Upload Pretraining Data

Overview

Upload pretraining data in Arcee to inject new knowledge into your language models. Ensure you have at least 100 million tokens for effective continuous pretraining.

Step 1: Link Your S3 Bucket

Add the following bucket policy to ensure Arcee has access to your S3 bucket.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "812782781539"
            },
            "Action": [
                "s3:GetBucketLocation",
                "s3:ListBucket",
                "s3:GetObject",
                "s3:GetObjectAttributes",
                "s3:GetObjectTagging"
            ],
            "Resource": [
                "arn:aws:s3:::yourdatafolder",
                "arn:aws:s3:::yourdatafolder/*"
            ]
        }
    ]
}

Step 3: Uploading Your Data

Once your S3 bucket is linked, upload your pretraining data. Arcee will parse, tokenize, and pack your files to prepare them for pretraining.

When you go to the Create Pretraining Dataset screen, you'll find a dropdown menu labeled Tokenizer. Here, you can select the tokenizer that matches your target model. It is essential to choose a tokenizer compatible with the model architecture. This ensures that the tokenization process aligns well with how your model interprets and processes text data.

Step-by-step

Go to the Datasets tab
Click the Create button
Enter the dataset name
Enter the dataset URL
Select the tokenizer
Click the Create button

Uploading Pretraining Data

Dataset Preparation

To prepare your dataset for pretraining in Arcee, you first need to ensure your dataset meets specific criteria related to size.

Continuous pretraining requires more than 100 million text data tokens, roughly equivalent to 25 million words. This token count ensures the model receives ample information to inject new knowledge effectively. For context, 4GB of text is roughly 1 billion tokens.

It's important to note that your dataset must be large enough to provide meaningful pretraining data. The pretraining process may fail if your dataset does not meet the minimum token requirement.

Prepare Your Data

We recommend using unstructured.io to parse and format your files before uploading. For best performance, upload your corpus in 1MB chunks.

Ensure You Have Enough Tokens

You need at least 100 million tokens or roughly 25 million words to perform continuous pertaining. Make sure your dataset meets this requirement.

Convert Data Size to Tokens Easily

Remember that 4GB of text data is approximately 1 billion tokens. Use this conversion to estimate the size of your pretraining dataset.

File Formats Supported for Pretraining Data Upload

Format	Description
PDF	Portable Document Format used for documents
JSON	JavaScript Object Notation, a lightweight data-interchange format
XML	eXtensible Markup Language for data representation
txt	Plain text files with no formatting
HTML	HyperText Markup Language, the standard for web pages
CSV	Comma-Separated Values, used for tabular data

What is the minimum data size required for pretraining?
You need at least 100 million tokens of text data to perform continuous pretraining in Arcee. This equates to roughly 25 million words.
What tokenizers are supported?
When creating a pretraining dataset, you can choose from various tokenizers compatible with your target model.
How can I verify if my data upload was successful?
After uploading your pretraining data, you can check the status on the datasets tab to confirm if the data has been successfully parsed, tokenized, and packed.
What are some common issues during data upload?
Common issues include incorrect S3 URL formatting, insufficient permissions to access the S3 bucket, and not meeting the minimum token requirement for continuous pretraining.
Can I mix document formats?
Yes, a corpus can contain documents in different formats, e.g., PDF and HTML.
Can my corpus contain subfolders?
Yes, subfolders are supported.
Can I update a corpus?
File-level updates are not possible. However, the Python API lets you reupload a fully updated corpus with the same name.

Upload Pretraining Data

Overview

Step 1: Link Your S3 Bucket

Step 3: Uploading Your Data

Choosing a Tokenizer