What Are Large Language Models?
A Large Language Model (LLM) is an artificial intelligence system trained on massive corpora of text to understand and generate human language with remarkable fluency, coherence, and contextual awareness.
Unlike earlier AI systems that required explicit rules or hand-crafted features, LLMs learn entirely from data. By training on hundreds of billions to trillions of tokens of text — books, articles, code, web pages, scientific papers — they develop rich internal representations of language, world knowledge, and even reasoning patterns. The "large" refers both to the scale of the training data and the number of parameters in the model, which can range from a few billion to over a trillion.
At their core, LLMs are next-token predictors: given a sequence of text (the "context"), the model outputs a probability distribution over what token should come next. During training, this simple objective, applied at extraordinary scale, produces systems capable of translation, summarisation, question answering, code generation, logical reasoning, and creative writing.
Key Terminology
Before diving deeper, it helps to understand the vocabulary of LLMs:
The term "large" is relative and has shifted dramatically over time. In 2018, BERT with 340 million parameters was considered enormous. By 2023, models routinely exceed 70 billion parameters, and frontier models are estimated in the hundreds of billions to trillions.
A Brief History of LLMs
The story of large language models is one of compounding insights — each breakthrough enabling the next leap in capability.
Neural Word Embeddings
Google's Word2Vec demonstrated that words could be represented as dense numerical vectors capturing semantic meaning. "King" − "Man" + "Woman" ≈ "Queen". This planted the seed for neural language modelling at scale.
The Transformer Architecture
Vaswani et al. at Google introduced the Transformer — a model using self-attention rather than recurrence. It processed all tokens in parallel, enabling unprecedented training scale and capturing long-range dependencies that stumped previous RNN architectures. This paper remains the most consequential in modern AI.
The Pre-Training Era Begins
OpenAI's GPT-1 (117M parameters) showed that pre-training on large corpora then fine-tuning on specific tasks outperformed task-specific models. Google's BERT demonstrated that bidirectional context (reading text left-to-right AND right-to-left) improved understanding significantly.
In-Context Learning Emerges
GPT-3 (175B parameters) shocked the research community by performing well on tasks it was never explicitly trained for — simply by being shown a few examples in the prompt. This "few-shot learning" suggested that at sufficient scale, LLMs develop general cognitive capabilities beyond simple pattern matching.
Alignment via RLHF
OpenAI applied Reinforcement Learning from Human Feedback (RLHF) to fine-tune GPT models to be helpful, harmless, and honest. ChatGPT reached 100 million users in 60 days — the fastest product adoption in history — validating that alignment techniques could make powerful models accessible and safe for general use.
Open Source & Multimodal Models
Meta's Llama series brought capable open-weight models to researchers worldwide. GPT-4, Claude 3, Gemini Ultra introduced multimodal capabilities (image + text). The Mixture-of-Experts architecture enabled frontier capabilities at reduced inference cost. The gap between frontier closed-source and open-weight models narrowed dramatically.
LLMs as Autonomous Agents
Models like o3 (OpenAI), Claude 3.7 Sonnet, and Gemini 2.0 demonstrated extended reasoning via chain-of-thought and "thinking" modes. LLMs began operating autonomously as agents — browsing the web, writing and executing code, managing files, and completing multi-step tasks over hours. Context windows expanded to millions of tokens.
The Transformer Architecture
Every major LLM today is built on the transformer architecture. Understanding it reveals why LLMs work the way they do.
The transformer processes text in three broad stages: tokenisation (breaking text into tokens), embedding (converting tokens into numerical vectors), and decoding (generating output one token at a time through repeated forward passes).
Self-Attention: The Core Insight
The mechanism that makes transformers special is self-attention. For every token in the input, self-attention computes how much "attention" it should pay to every other token. This allows the model to understand that in "The animal didn't cross the street because it was too tired," "it" refers to "animal" — even across many words.
Mathematically, each token produces three vectors — Query (Q), Key (K), and Value (V) — through learned linear projections. Attention scores are computed as softmax(QKᵀ / √d), producing a weighted sum of values. Multi-head attention runs this in parallel across multiple "heads," each learning to attend to different relationships.
Feed-Forward Networks & Residual Connections
After attention, each token representation passes through a position-wise feed-forward network (FFN) — typically two linear layers with a non-linearity between them. The FFN is where much of the model's factual knowledge appears to be stored. Residual connections around both attention and FFN layers (plus layer normalisation) enable training of very deep networks without gradient vanishing.
Positional Encoding
Unlike RNNs, transformers process all tokens simultaneously and have no inherent sense of order. Positional encodings — either fixed sinusoidal patterns (original Transformer) or learned embeddings (BERT) or rotary encodings (RoPE, used in Llama, GPT-NeoX) — inject position information so the model understands word order.
Two linear layers + GELU/ReLU activation
Q, K, V projections · Scaled dot-product · Softmax
Input tokens → dense vectors ∈ ℝᵈ
DeepMind's 2022 Chinchilla paper showed that previous large models were overtrained relative to their compute budget. Optimal training requires roughly 20 training tokens per parameter. A 70B parameter model should see ~1.4 trillion tokens for optimal compute efficiency.
How LLMs Are Trained
Training an LLM is a multi-stage process that transforms billions of pages of raw text into a system capable of sophisticated language understanding and generation.
Stage 1 — Data Collection & Preprocessing
Training begins with assembling a massive, diverse corpus: CommonCrawl web data, books (Project Gutenberg, BooksCorpus), Wikipedia, academic papers (arXiv, PubMed), code (GitHub), and curated datasets. Raw web data requires aggressive filtering to remove duplicates, near-duplicates, low-quality content, hate speech, and personally identifiable information (PII). Quality filtering — using classifier models to score content — has become as important as scale.
Stage 2 — Pre-Training (Next-Token Prediction)
The model is trained on the objective of predicting the next token in a sequence — called causal language modelling (CLM) for decoder-only models. For each position in the training text, the model outputs a probability distribution over all possible next tokens, and the loss (cross-entropy between prediction and the actual next token) is backpropagated to update the model's weights. Over trillions of such predictions, the model implicitly learns grammar, facts, reasoning, and style.
Stage 3 — Supervised Fine-Tuning (SFT)
After pre-training, the base model is a powerful text completer, but it doesn't know how to be a helpful assistant. SFT fine-tunes the model on a curated dataset of (instruction, ideal response) pairs, written by human contractors. This teaches the model to follow instructions, answer questions directly, decline harmful requests, and format responses helpfully. SFT datasets typically contain tens to hundreds of thousands of examples across diverse tasks.
Stage 4 — Reinforcement Learning from Human Feedback (RLHF)
RLHF is the technique that transformed base LLMs into genuinely useful assistants. Human annotators rank multiple model responses to the same prompt from best to worst. These rankings train a reward model that learns to predict human preferences. The LLM is then fine-tuned using Proximal Policy Optimisation (PPO) — a reinforcement learning algorithm — to maximise the reward model's score. This aligns the model's outputs with human values and preferences.
Direct Preference Optimisation (DPO), introduced in 2023, achieves similar alignment results without the complexity of training a separate reward model and running PPO. DPO directly fine-tunes the LLM on preference pairs, making it significantly more stable and compute-efficient than full RLHF. Most post-2023 models use DPO or hybrid approaches.
Essential LLM Techniques
A practitioner's toolkit — the most important techniques for getting the best out of LLMs in real-world applications.
Prompt Engineering
Prompt engineering is the art and science of crafting inputs that elicit better outputs from LLMs — without changing model weights. Key techniques include:
- Zero-shot prompting: Providing a task with no examples, relying on the model's pre-trained knowledge. Works well for common tasks on capable models.
- Few-shot prompting: Including 3–10 examples of the desired input/output format in the prompt. Dramatically improves performance on specialised or unusual tasks.
- Chain-of-thought (CoT): Instructing the model to "think step by step" before answering. Enables complex multi-step reasoning and dramatically improves mathematical and logical performance. Introduced by Wei et al. (Google, 2022).
- System prompts: A persistent instruction prepended to every conversation, setting the model's role, persona, tone, and constraints. Used extensively in production deployments.
- Structured output prompting: Asking models to respond in JSON, XML, or Markdown formats for downstream programmatic processing.
Retrieval-Augmented Generation (RAG)
LLMs have a knowledge cutoff and cannot access private data. RAG solves this by combining an LLM with a retrieval system. When a query arrives:
- The query is embedded into a vector using an embedding model (e.g., text-embedding-3-large).
- A vector database (Pinecone, Weaviate, pgvector, Chroma) retrieves the most semantically similar document chunks.
- These chunks are injected into the LLM's context window alongside the query.
- The LLM generates a response grounded in the retrieved content, with citations.
RAG enables LLMs to answer questions about proprietary documents, recent events, and live data — without expensive retraining. Its main failure modes are retrieval quality (irrelevant chunks retrieved), chunk size (too large or small), and context length limits.
Fine-Tuning Strategies
Full fine-tuning updates all model parameters — expensive but maximally effective. Several parameter-efficient techniques exist for resource-constrained settings:
| Technique | Description | GPU Cost | Best For |
|---|---|---|---|
| Full Fine-Tuning | Update all parameters on task data | Very High | Maximum performance, abundant compute |
| LoRA | Inject low-rank adapter matrices; train only those | Low–Medium | Domain adaptation, most production use cases |
| QLoRA | LoRA + 4-bit quantisation of base model | Very Low | Consumer GPU fine-tuning (RTX 3090) |
| Adapter Layers | Insert small trainable modules between transformer layers | Low | Multi-task learning with shared base |
| Prefix Tuning | Prepend trainable "soft prompt" tokens to input | Very Low | Fast iteration on small datasets |
Mixture of Experts (MoE)
MoE models (GPT-4, Mixtral, Grok) replace dense feed-forward layers with a set of "expert" networks and a routing mechanism that selects only 2–8 experts per token. A 56B parameter MoE model might only activate 8B parameters per forward pass — achieving frontier capability at a fraction of the inference cost of a dense model of equivalent quality.
Quantisation & Efficient Inference
Serving large models is expensive. Quantisation reduces model size by representing weights with fewer bits (16-bit → 8-bit → 4-bit), trading minimal accuracy for dramatic memory and speed gains. Techniques like GPTQ, AWQ, and llama.cpp's GGUF format allow running 7B–70B models on consumer hardware. Speculative decoding (using a small draft model to propose tokens, verified by the large model) can accelerate inference 2–4× with no quality loss.
Real-World Applications
LLMs are rapidly transforming industries — not as magic solutions, but as powerful building blocks when applied thoughtfully to the right problems.
AI Alignment & Ethics
As LLMs become more capable and widely deployed, ensuring they behave safely, honestly, and in accordance with human values becomes increasingly critical.
The Alignment Problem
An "aligned" AI system reliably pursues goals that are beneficial to humans. LLMs face several alignment challenges: sycophancy (telling users what they want to hear rather than truth), hallucination (generating confident but false information), jailbreaking (adversarial prompts bypassing safety guidelines), and misuse (dual-use capabilities enabling harm).
Constitutional AI (CAI)
Anthropic's Constitutional AI trains models to follow a written "constitution" of principles. Rather than relying solely on human feedback, the model is trained to critique and revise its own outputs according to these principles using AI-generated supervision (RLAIF — Reinforcement Learning from AI Feedback). This scales alignment without proportional increases in human labour.
Hallucination: Causes & Mitigations
LLMs hallucinate because they are trained to produce fluent, plausible text — not to retrieve ground truth. Mitigations include: grounding via RAG (retrieved context anchor outputs), citation requirements (forcing models to cite sources), calibration training (teaching models to express uncertainty), and human-in-the-loop review for high-stakes outputs. Hallucination rates vary dramatically by task and model — factual lookup tasks have higher error rates than reasoning tasks.
Bias & Fairness
LLMs trained on internet data absorb and can amplify societal biases. Studies have documented stereotyping across gender, race, religion, and nationality. Mitigation approaches include data curation (filtering biased training content), instruction tuning on balanced data, and red-teaming evaluations that specifically probe for differential treatment across demographic groups.
Key Risks at a Glance
The Future of LLMs
The field is advancing at a pace that makes confident prediction difficult — but several clear trajectories are emerging from current research.