Chapter 2: Foundation Models and Large Language Models#

Why it matters#

Chapter 1 introduced foundation models and LLMs as the engines of generative AI. This chapter opens the engine. We compare foundation models with the traditional machine learning you may already know, walk through the transformer architecture and the attention mechanism that made modern LLMs possible, and end with a clear-eyed look at where these models break down. Understanding the architecture is not academic: it explains why prompts work the way they do, why context windows matter, and why some models are better at understanding while others are better at generating.

A quick review of traditional ML#

Traditional machine learning trains a model on task-specific data to do one job well. The workflow is familiar:

  • The model is trained on data curated for a single task.

  • Training usually starts from scratch.

  • You choose among model families, tree-based models, linear models, neural networks, depending on the problem.

  • The result is optimized for one task (classification, regression, clustering) and is hard to adapt to even a similar task.

Recall the basic vocabulary. In supervised learning you have features (the input columns, such as number of logins or whether a customer watched a video) and a label (the thing you predict, such as whether the customer signed up). The learning algorithm starts from a random function f and refines it until the features reliably predict the correct label.

Table 5 A toy supervised dataset#

Number of logins

Watched video

Number of purchases

Label: signed up

120

Yes

4

Yes

1

No

0

No

219

No

12

Yes

Foundation models change the workflow#

Foundation models break the one-model-per-task pattern. The key difference is how they are trained and adapted.

Traditional ML

Foundation models

Train on labeled data, then deploy.

Pre-train on huge unlabeled data, then adapt.

One model per task.

One model adapted to many tasks (text generation, summarization, information extraction, Q&A, chatbot).

Needs curated labels for every task.

Learns general structure first; little or no task-specific labeling needed to adapt.

A foundation model is pre-trained once on broad data and then adapted to downstream tasks, by prompting or light fine-tuning, rather than retrained from scratch each time. That is why the number and capability of LLMs has grown so quickly over the past several years: progress on the base model lifts every task at once.

The transformer architecture#

The breakthrough that made today’s LLMs practical is the transformer, introduced in the 2017 paper Attention Is All You Need [Vaswani et al., 2017]. To see why it mattered, start with the problem it solved.

Sequence-to-sequence problems#

Many language tasks map an input sequence to an output sequence where both can vary in length, machine translation (“They are watching” to “Ils regardent”), or auto-completion. Earlier models processed sequences one token at a time, which was slow and tended to forget information from earlier in long inputs.

Encoder-decoder and the context vector#

The classic design has two halves. An encoder reads the input and compresses it into a numerical representation, sometimes called a context vector. A decoder then reads that representation and produces the output sequence one token at a time. Transformers were originally built as encoder-decoder models in exactly this shape.

Attention is the key idea#

The transformer’s central innovation is the attention mechanism. Attention lets the model, when processing any given word, look at all the other words in the observable input and weigh how relevant each one is to the current prediction.

Intuition for attention

In the sentence “The animal didn’t cross the street because it was too tired,” what does “it” refer to? Attention lets the model associate “it” strongly with “animal” rather than “street.” It assigns high, medium, and low attention weights across the sentence so that the most relevant tokens influence the next prediction the most.

Both the encoder and decoder use attention, and they use multi-headed attention, several attention computations in parallel, so the model can capture different kinds of relationships at once. Two consequences follow that you should remember:

  • Parallel processing. Unlike older sequential models, transformers process all input tokens in parallel, which makes training on huge datasets feasible.

  • Positional encoding. Because tokens are processed together rather than in order, the model adds positional information to each token’s embedding so it still knows word order.

Three architectural flavors#

Not every model uses both halves of the transformer. Which half a model keeps determines what it is good at.

Architecture

Examples

Best for

Encoder-only

BERT, ELECTRA

Natural language understanding: classification, named entity recognition, text extraction. Uses bi-directional attention (sees the whole input).

Decoder-only

GPT, Llama, Claude

Text generation. Auto-regressive: each token can only see tokens that came before it, and the model generates the next token from that history.

Encoder-decoder

Original transformer, T5

Sequence-to-sequence tasks such as translation and summarization.

The decoder-only family is the one behind most chat and generation systems on Bedrock, and it is the architecture we lean on for the rest of the book.

Why transformers won#

Pulling the threads together, the transformer is the state-of-the-art deep learning architecture for generative AI because it:

  • processes input in parallel rather than sequentially,

  • uses self-attention to capture relationships between all words regardless of position,

  • typically learns through self-supervised learning, generating labels automatically from unlabeled text, so there is no need to hand-curate labels, and

  • generalizes beyond language to computer vision, audio, reinforcement learning, and multimodal applications.

AWS in practice

You rarely implement a transformer yourself when working with Bedrock; the provider has already trained it. But the architecture explains the knobs you do control. The context window (how much text the model can attend to at once) is a direct consequence of attention. The difference between an embedding model like Titan Embeddings (encoder-style, understanding) and a generative model like Titan Text (decoder-style, generation) maps onto the flavors above.

Challenges and limitations of LLMs#

Powerful as they are, LLMs have real limits. Designing responsibly, the focus of Module 2, starts with knowing them.

Reliability and bias. A model’s knowledge is limited to its training data. It cannot reliably tell truth from falsehood and can reproduce biases present in that data.

Context window. The model’s attention is bounded by its context window. Anything beyond that length is invisible to the model. For example, an early Titan Premier model had a 30,000-token limit at release. Inputs longer than the window must be truncated, chunked, or retrieved selectively, which is one motivation for retrieval-augmented generation in Module 3.

Copyright and intellectual property. Training data may include sensitive or copyrighted material, and a model can generate content resembling someone’s creative or intellectual property, raising ethical and legal questions.

Misinformation and privacy. Models can generate sensitive or personal data and can create or amplify misinformation about people, groups, or organizations.

System cost. Training large models demands enormous compute, specialized talent, and power. Models with more than 100 billion parameters can carry total project costs over 100 million dollars.

Environmental impact. That compute has a carbon footprint. By one cited estimate, the CO2 emissions from training a five-billion-parameter model on GPUs are roughly equivalent to a trans-American flight.

A useful unit: the token

LLMs read and write tokens, not characters or words. A rough rule of thumb is that one token is about four characters of English, so roughly 100 tokens equals about 75 words. Tokens are the unit you are billed in and the unit the context window is measured in, so the concept reappears throughout the book.

In the news#

Two architecture-driven trends dominate recent headlines. First, context windows have expanded dramatically, from a few thousand tokens to hundreds of thousands and beyond in leading models, easing (though not eliminating) the context-window limitation above. Second, the cost and efficiency conversation has matured: techniques such as quantization and smaller, well-trained models now deliver strong performance at a fraction of the compute, and providers increasingly publish model cards documenting capabilities and limitations. Both trends trace directly back to the transformer’s attention mechanism and the economics of scale described in this chapter.

Hands-on labs#

With the architecture in hand, the Module 1 labs become much easier to read. Lab 2a: Introduction to Amazon Bedrock shows how to invoke different Bedrock foundation models with Boto3, and Lab 2b: Chat with Amazon Bedrock builds a simple conversational application on top of a decoder-only model.

Key takeaways#

  • Traditional ML trains one model per task; foundation models pre-train once and adapt to many tasks.

  • The transformer uses attention to relate every token to every other token, processes input in parallel, and learns through self-supervision.

  • Encoder-only models excel at understanding, decoder-only at generation, and encoder-decoder at sequence-to-sequence tasks.

  • LLMs are limited by reliability, bias, the context window, IP and privacy risks, cost, and environmental impact, all of which motivate later modules.

Next, we turn from how models work to how you direct them: prompt engineering.