Continuous Pretraining

Start Pretraining

Overview

Begin pretraining in Arcee by selecting a base generator model and preparing your pretraining data. Ensure you have the necessary data and access rights before initiating the process.

Pretraining Setup

Base Generator Selection

Choosing a base generator for pretraining is important because it serves as the foundation for your Small Language Model. The base generator determines the initial skills and knowledge of the model, which will then be refined and expanded through pretraining on your proprietary data.

The base generator is the starting point for pretraining. It sets the initial capabilities of your model, which will be adapted to your specific needs through the pretraining process.

  • mistralai/Mistral-7B-Instruct-v0.2

  • mistralai/Mistral-7B-Instruct-v0.1

  • meta-llama/Meta-Llama-3-8B

  • meta-llama/Meta-Llama-3-8B-Instruct

Choose Your Base Generator

Select from supported base generators like 'mistralai/Mistral-7B-Instruct-v0.2', 'mistralai/Mistral-7B-Instruct-v0.1', 'meta-llama/Meta-Llama-3-8B', and 'meta-llama/Meta-Llama-3-8B-Instruct'. These options serve as the foundation for pretraining, each offering unique features to kickstart your project.

Python SDK Installation

To install the Arcee Python SDK, you need to follow a few simple steps. The SDK allows you to interact with the Arcee platform using Python.

  • First, ensure you have Python installed on your system. The SDK is compatible with Python 3.6 and above.

  • Open your terminal or command prompt.

  • Run the following command to install the SDK:

After completing these steps, you will be ready to start using the Arcee Python SDK to manage and interact with your Small Language Models on the Arcee platform.

Replace YOUR-ARCEE-API-KEY with your actual Arcee API key.

%env ARCEE_API_KEY=YOUR-ARCEE-API-KEY

  • Once the SDK is installed, you need to set up your Arcee API key. This key is necessary to authenticate and interact with the Arcee platform. You can set the API key as an environment variable using the following command:

pip install -q arcee-py

python
arcee.start_pretraining("my-pretrain", "my-corpus", " meta-llama/Meta-Llama-3-8B")

Visualize Training Loss

Visualize CPT training loss by clicking View Training Loss on your CPT job.

Video Walkthrough

This video demonstrates model pretraining with the Arcee UI and the Arcee Python SDK. The video description includes a link to the companion notebook.

Frequently Asked Questions

  • Pretraining is the process of training a language model on a large dataset to give it a broad understanding of language. This stage helps the model learn grammar, facts, and some reasoning abilities, making it more effective for specific tasks later.

  • To upload pretraining data in Arcee, click the Create button on the Datasets tab and choose Pretraining Data.

  • Yes, you can change the base generator. You can select from options such as 'mistralai/Mistral-7B-Instruct-v0.2', 'mistralai/Mistral-7B-Instruct-v0.1', 'meta-llama/Meta-Llama-3-8B', and 'meta-llama/Meta-Llama-3-8B-Instruct'.

  • You can monitor the progress of your pretraining through the Arcee interface, which provides real-time updates on training metrics and status.