Chapter 2: Developing Conversational Applications#

Why it matters#

A chatbot is the most familiar generative-AI application, and it is the natural first thing to build with LangChain. This chapter takes you from one-off question-answering to genuine conversation: how chat models work, how to assemble a chatbot from an LLM, a prompt template, and memory, and how to keep conversations efficient with caching and smarter memory as they grow long.

From Q&A to conversation#

Many LLMs are tuned and optimized for conversation, instruction-tuned and chat models, and can take a sequence of messages as input: a system message, plus past human and AI messages. Amazon Bedrock offers a suite of chat-optimized models, including Amazon Nova, Anthropic Claude, Meta Llama, Mistral, and Cohere Command; check each model card for specifics.

LangChain provides chat prompt templates built around these message roles:

chat_template = ChatPromptTemplate([
    ("system", "You are a helpful AI bot. Your name is {name}."),
    ("human", "Hello, how are you doing?"),
    ("ai", "I'm doing well, thanks!"),
    ("human", "{user_input}"),
])
prompt = chat_template.invoke({"name": "Andy", "user_input": "What is your name?"})

The distinction that organizes this chapter:

Question-Answering (Q&A)

Conversation

Context is only the latest prompt.

Maintains context of all past interactions.

Interactions do not persist.

Interactions persist by updating the context.

No additional modules required.

Typically uses a form of memory.

Used for standard predictions (content creation).

Used for chat applications (like ChatGPT).

Building a chatbot#

The simplest chatbot needs just three parts: an LLM, a prompt template (defining the bot’s role and guidelines), and a memory module (to persist information across turns). The loop is: take user input, combine it with retrieved memory in the prompt, generate a response, and save the exchange back to memory.

memory = ConversationBufferMemory()
chain = prompt | llm | StrOutputParser()
chat_history = memory.load_memory_variables({}).get("chat_history", "")
response = chain.invoke({"chat_history": chat_history, "input": user_input})
memory.save_context({"input": user_input}, {"output": response})

Customizing chat applications#

Caching#

Each response consumes cost, compute, and time. Response caching stores answers so that the same or similar prompts return instantly without regenerating, significantly reducing inference time. Two kinds:

  • In-memory cache: stored in the application’s runtime; fastest, but lost on restart.

  • Persistent cache: stored offline (preferably a database); survives restarts and scales better.

Cache deterministic queries (temperature zero); creative responses may be hurt by caching. Set an expiration so cached answers do not go stale, and size your store appropriately.

Handling long conversations#

LLMs have a finite context window, and naively keeping every message causes three problems: messages may become redundant or obsolete, costs compound (each new message carries all the old ones), and eventually the window overflows. Two memory strategies address this:

Strategy

How it works

Conversation Buffer Window Memory

Keeps only the last k interactions, a sliding window that prevents the buffer from growing without bound.

Conversation Summary Memory

Uses an LLM to summarize the history, preserving critical information from older messages while staying compact, useful for long conversations.

memory = ConversationBufferWindowMemory(k=1)   # keep last 1 exchange
# or
memory = ConversationSummaryMemory(llm=llm)    # summarize the history

Context is more than chat history#

The context you put in a prompt need not be only past interactions; it can include relevant information, external data, and human feedback. One natural extension is chatting with documents, prompting with the full text of one or more documents. But this runs straight into the LLM limitations from Module 1: reliability and bias, the context-window limit (for instance, an early Nova Pro release allowed up to 300,000 tokens), compute and memory cost, and potential copyright issues. Feeding entire documents does not scale, which is exactly the motivation for retrieval-augmented generation in the next chapter.

AWS in practice

On Amazon Bedrock you build these chatbots from chat-optimized models (Nova, Claude, Llama, and others) through LangChain’s langchain-aws integration. Bedrock also offers managed conversational features, but understanding the memory and caching mechanics here lets you reason about cost and quality whichever path you choose.

In the news#

Conversational AI has moved from stateless chat toward persistent memory: assistants that remember preferences across sessions, and long-context models that hold entire documents or codebases at once. Both developments soften, but do not eliminate, the context-window pressures in this chapter; summarization, windowing, and retrieval remain essential for cost control even as raw context windows grow.

Key takeaways#

  • Chat models consume a sequence of system, human, and AI messages; LangChain’s chat prompt templates structure them.

  • A chatbot is an LLM + prompt template + memory, looping retrieve, respond, and save.

  • Caching cuts cost and latency for repeated, deterministic queries.

  • Long conversations need window or summary memory; stuffing whole documents into the prompt does not scale, motivating RAG.

Next, we ground models in external data with retrieval-augmented generation.