Upload Pretraining Data
Overview
Upload pretraining data in Arcee to inject new knowledge into your language models. Ensure you have at least 100 million tokens for effective continuous pretraining.
Step 1: Link Your S3 Bucket
Add the following bucket policy to ensure Arcee has access to your S3 bucket.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "812782781539"
},
"Action": [
"s3:GetBucketLocation",
"s3:ListBucket",
"s3:GetObject",
"s3:GetObjectAttributes",
"s3:GetObjectTagging"
],
"Resource": [
"arn:aws:s3:::yourdatafolder",
"arn:aws:s3:::yourdatafolder/*"
]
}
]
}
Step 3: Uploading Your Data
Once your S3 bucket is linked, upload your pretraining data. Arcee will parse, tokenize, and pack your files to prepare them for pretraining.
Choosing a Tokenizer
When you go to the Create Pretraining Dataset screen, you'll find a dropdown menu labeled Tokenizer. Here, you can select the tokenizer that matches your target model. It is essential to choose a tokenizer compatible with the model architecture. This ensures that the tokenization process aligns well with how your model interprets and processes text data.
Upload Pretraining Data
Go to the Datasets tab
Click the Create button
Enter the dataset name
Enter the dataset URL
Select the tokenizer
Click the Create button
Uploading Pretraining Data
Dataset Preparation
To prepare your dataset for pretraining in Arcee, you first need to ensure your dataset meets specific criteria related to size.
Continuous pretraining requires more than 100 million text data tokens, roughly equivalent to 25 million words. This token count ensures the model receives ample information to inject new knowledge effectively. For context, 4GB of text is roughly 1 billion tokens.
It's important to note that your dataset must be large enough to provide meaningful pretraining data. The pretraining process may fail if your dataset does not meet the minimum token requirement.
Prepare Your Data
We recommend using unstructured.io to parse and format your files before uploading. For best performance, upload your corpus in 1MB chunks.
Ensure You Have Enough Tokens
You need at least 100 million tokens or roughly 25 million words to perform continuous pertaining. Make sure your dataset meets this requirement.
Convert Data Size to Tokens Easily
Remember that 4GB of text data is approximately 1 billion tokens. Use this conversion to estimate the size of your pretraining dataset.
File Formats Supported for Pretraining Data Upload
Format | Description |
---|---|
Portable Document Format used for documents | |
JSON | JavaScript Object Notation, a lightweight data-interchange format |
XML | eXtensible Markup Language for data representation |
txt | Plain text files with no formatting |
HTML | HyperText Markup Language, the standard for web pages |
CSV | Comma-Separated Values, used for tabular data |
Verifying Completion
You will see a final token count. As a rule of thumb, 4GB of text is 1B tokens.
Video Walkthrough
This video demonstrates model deployment with the Arcee UI and the Arcee Python SDK. The video description includes a link to the companion notebook.
Frequently Asked Questions
You need at least 100 million tokens of text data to perform continuous pretraining in Arcee. This equates to roughly 25 million words.
When creating a pretraining dataset, you can choose from various tokenizers compatible with your target model.
After uploading your pretraining data, you can check the status on the datasets tab to confirm if the data has been successfully parsed, tokenized, and packed.
Common issues include incorrect S3 URL formatting, insufficient permissions to access the S3 bucket, and not meeting the minimum token requirement for continuous pretraining.
Yes, a corpus can contain documents in different formats, e.g., PDF and HTML.
Yes, subfolders are supported.
File-level updates are not possible. However, the Python API lets you reupload a fully updated corpus with the same name.