Pre-training

Data format for a pre-training completion task.

For pretraining, there is no prompt template or roles. The only required field is text:

data.jsonl
{"text": "first row"}
{"text": "second row"}
...
Streaming is recommended for large datasets

Axolotl usually loads the entire dataset into memory. This will be challenging for large datasets. Use the following config to enable streaming:

config.yaml
pretraining_dataset: # hf path only
...