Continuous Pretraining

What is Continuous Pretraining?

Overview

Continuous Pretraining involves training your language model on custom text data to fit specific needs better. Consider this when the general model lacks exposure to your domain or to fine-tune responses in a targeted area.

When to Consider Pretraining

Custom Text Data

Pretraining is essential when using text data the general model has not encountered. This process helps adapt the language model to your needs, ensuring it can generate responses that fit your target domain.

By pretraining on your text data, you tailor the model's responses to be more accurate and relevant to your context. This means the model becomes better at understanding and generating text matching your domain-specific requirements.

Token Requirement

Continuous Pretraining in Arcee involves training the model on new text data that extends beyond the information found in general models.

A key factor for Continuous Pretraining is the amount of text data. Using more than 100 million tokens, approximately 25 million words is generally recommended. This threshold ensures the model receives enough new information to improve its performance.

Make Sure You Have Enough Data

For effective pretraining, ensure you have at least 100 million tokens (about 25 million words) of text data. Insufficient data can lead to poor model performance and unreliable results.

RAG Index Consideration

A large Retrieval-Augmented Generation (RAG) index can significantly influence the decision to perform Continuous Pretraining. This is because an extensive RAG index usually indicates a substantial amount of domain-specific information is available to enhance your language models.

Even if you have a large RAG index, Continuous Pretraining can still be beneficial to better align the model with the specific knowledge in the index. This improves the model's ability to generate relevant and accurate responses based on retrieved information. It may also reduce the amount of data retrieved from the RAG index and injected into the prompt, which may help reduce latency and RAG costs.

Additionally, merging the pretrained model back into the general chat checkpoint helps integrate the new domain-specific knowledge with the general capabilities of the model. This ensures the model can handle a broader range of topics while finely tuning to your domain.

Frequently Asked Questions

  • Continuous pretraining generally requires more than 100 million text data tokens, approximately 25 million words.

  • Avoid pretraining if you do not have enough unique and relevant data or if the general model already performs well on your tasks. Also, if your project does not involve domain-specific adaptations, pretraining may not be necessary.

  • RAG is a technique that uses a retrieval system to fetch relevant documents or text passages to augment and enhance the generation of responses by a language model. When you have large retrieval indices, it is better to continuously pretrain.

  • On Arcee, you can merge models by combining the parameters of the two language models. This process helps extend the pretrained checkpoint with additional knowledge from another model.