Pre-training
Also known as: Pre-training / 事前学習 / プレトレーニング
The initial large-scale training phase in which an LLM learns from vast text corpora via next-token prediction, establishing the general language and world knowledge that downstream fine-tuning and alignment build upon.
Overview
Pre-training is the first LLM development phase: self-supervised next-token prediction on trillions of tokens from web pages, books, and code. The model acquires language structure, grammar, world knowledge, and coding ability. The enormous GPU hours required mean almost all organizations start from an existing pre-trained model.
Frontier vs open models
OpenAI, Anthropic, and Google conduct proprietary large-scale pre-training and offer API access. Open-weight models like Llama and Qwen release pre-trained weights publicly, enabling organizations to fine-tune from a capable base without pre-training costs.
Related Columns
Feel free to contact us
Contact Us