Continuous Pretraining

Upload Pretraining Data

Overview

Upload pretraining data in Arcee to inject new knowledge into your language models. Ensure you have at least 100 million tokens for effective continuous pretraining.

Add the following bucket policy to ensure Arcee has access to your S3 bucket.

plaintext
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "812782781539"
            },
            "Action": [
                "s3:GetBucketLocation",
                "s3:ListBucket",
                "s3:GetObject",
                "s3:GetObjectAttributes",
                "s3:GetObjectTagging"
            ],
            "Resource": [
                "arn:aws:s3:::yourdatafolder",
                "arn:aws:s3:::yourdatafolder/*"
            ]
        }
    ]
}

Step 3: Uploading Your Data

Once your S3 bucket is linked, upload your pretraining data. Arcee will parse, tokenize, and pack your files to prepare them for pretraining.

Choosing a Tokenizer

When you go to the Create Pretraining Dataset screen, you'll find a dropdown menu labeled Tokenizer. Here, you can select the tokenizer that matches your target model. It is essential to choose a tokenizer compatible with the model architecture. This ensures that the tokenization process aligns well with how your model interprets and processes text data.

Upload Pretraining Data

Step-by-step
  1. Go to the Datasets tab

  2. Click the Create button

  3. Enter the dataset name

  4. Enter the dataset URL

  5. Select the tokenizer

  6. Click the Create button

Uploading Pretraining Data

Dataset Preparation

To prepare your dataset for pretraining in Arcee, you first need to ensure your dataset meets specific criteria related to size.

Continuous pretraining requires more than 100 million text data tokens, roughly equivalent to 25 million words. This token count ensures the model receives ample information to inject new knowledge effectively. For context, 4GB of text is roughly 1 billion tokens.

It's important to note that your dataset must be large enough to provide meaningful pretraining data. The pretraining process may fail if your dataset does not meet the minimum token requirement.

Prepare Your Data

We recommend using unstructured.io to parse and format your files before uploading. For best performance, upload your corpus in 1MB chunks.

Ensure You Have Enough Tokens

You need at least 100 million tokens or roughly 25 million words to perform continuous pertaining. Make sure your dataset meets this requirement.

Convert Data Size to Tokens Easily

Remember that 4GB of text data is approximately 1 billion tokens. Use this conversion to estimate the size of your pretraining dataset.

File Formats Supported for Pretraining Data Upload

Format

Description

PDF

Portable Document Format used for documents

JSON

JavaScript Object Notation, a lightweight data-interchange format

XML

eXtensible Markup Language for data representation

txt

Plain text files with no formatting

HTML

HyperText Markup Language, the standard for web pages

CSV

Comma-Separated Values, used for tabular data

Verifying Completion

You will see a final token count. As a rule of thumb, 4GB of text is 1B tokens.

Video Walkthrough

This video demonstrates model deployment with the Arcee UI and the Arcee Python SDK. The video description includes a link to the companion notebook.

Frequently Asked Questions

  • You need at least 100 million tokens of text data to perform continuous pretraining in Arcee. This equates to roughly 25 million words.

  • When creating a pretraining dataset, you can choose from various tokenizers compatible with your target model.

  • After uploading your pretraining data, you can check the status on the datasets tab to confirm if the data has been successfully parsed, tokenized, and packed.

  • Common issues include incorrect S3 URL formatting, insufficient permissions to access the S3 bucket, and not meeting the minimum token requirement for continuous pretraining.

  • Yes, a corpus can contain documents in different formats, e.g., PDF and HTML.

  • Yes, subfolders are supported.

  • File-level updates are not possible. However, the Python API lets you reupload a fully updated corpus with the same name.