Lab 5c: Multimodal Agents#
About This Lab#
Throughout this lab, you will encounter two types of interactive elements:


No coding is needed for an activity. You try to understand a concept,
answer questions, or run a code cell.
Challenges are where you test your understanding by implementing something new or taking a short quiz.
Please work through this notebook from top to bottom to avoid errors due to missing code or context.
Table of Contents#
1. Installing dependencies#
In this lab, we will develop custom multimodal tools and agents to accomplish certain complex tasks.
%%capture
!pip install -q -r ../requirements.txt
Let’s import the libraries and modules required for this lab. We will import the invoke_nova_lite_multimodal and get_base64_encoded_image functions we defined and used in previous labs.
import sys
sys.path.append('..')
import boto3
import base64
import json
from IPython.display import JSON
import time
from tqdm import tqdm
from botocore.exceptions import ClientError
from IPython.display import Image, display, Markdown, IFrame
from langchain.tools import tool, BaseTool
import io
from PIL import Image as pil_image, ImageDraw, ImageFont
from mlu_utils.multimodal_utils import invoke_nova_lite_multimodal, prepare_image, get_base64_encoded_image
2. Multimodal agent for image generation and description#
Let’s see how we can develop a multimodal agent with custom tools for an engaging movie poster and story generation application.
Movie poster generation: You provide a prompt or concept for a movie, and the application utilizes the
image-generator-toolto create an initial movie poster based on your input. This tool leverages the Stability AI Stable Diffusion 3.5 Large model to produce a visually compelling movie poster.Poster variation: Once the first movie poster is generated, the
image_variation_toolis employed to create a variation of the initial poster. This tool uses the Stability AI Control Structure service to produce a slightly different version of the movie poster, potentially representing a different genre, mood, or style.Story generation: With the two movie posters in hand, the application then utilizes the
Image-to-story tool, which is powered by the Amazon Nova Lite multimodal model. The multimodal agent analyzes the visual elements, symbolism, and imagery present in the movie posters and generates a compelling story or plot synopsis based on its understanding of the visual cues.Output: The final output of the application is a set of two visually distinct movie posters and a corresponding story or plot synopsis that captures the essence and narrative suggested by the imagery. This combination of visual and language generation capabilities allows you to explore creative concepts and see how they might translate into compelling movie ideas.
The application leverages the strengths of different tools and models, including the Stability AI Stable Diffusion 3.5 Large and Control Structure services for visual generation and the Amazon Nova Lite model for visual understanding and language generation. By combining these capabilities, the application offers a unique and engaging experience for you to explore movie ideas and see how visual elements can inspire and shape narratives.
2.1 Custom multimodal tools for image generation and description#
@tool
def image_to_story_tool(image_path: str):
"""Use this tool to generate a story related to a given image. The input of the tool is the path of the image."""
#image_string, image_type = get_base64_encoded_image(image_path.replace("\n", ""))
image_path_clean = image_path.replace("\n", "").strip().strip('"').strip("'")
# Remove keyword argument syntax like image_path="..."
if '=' in image_path_clean:
image_path_clean = image_path_clean.split('=', 1)[1].strip().strip('"').strip("'")
# Remove any trailing ReAct artifacts the agent may append
for suffix in ['Observation', 'Thought', 'Action', 'Final Answer']:
if suffix in image_path_clean:
image_path_clean = image_path_clean[:image_path_clean.index(suffix)].strip().strip('"').strip("'")
image_binary, image_type = prepare_image(image_path_clean)
prompt = "Write an interesting story related to the given image. Produce the response without a preamble. Just write the story."
response = invoke_nova_lite_multimodal(prompt=prompt, images=image_binary, image_types=image_type)
return response
@tool
def add_text_to_image(image_path):
"""Use this tool to add the title to the movie poster. The input is a string with the image_path and title separated by comma."""
# Open the image
image = pil_image.open(image_path)
# Create a drawing object
draw = ImageDraw.Draw(image)
# Define the font and its properties
font_path = "data/lab4/Agents/FranklinGothic.ttf" # Replace this with the path to your desired font file
font_size = 80 # Adjust the font size as needed
font = ImageFont.truetype(font_path, font_size)
# Calculate the text position
text_width = draw.textlength(text, font)
image_width, image_height = image.size
text_x = (image_width - text_width) / 2 # Center the text horizontally
text_y = image_height - font_size - 50 # Position the text near the bottom
# Draw the text on the image
draw.text((text_x, text_y), text, font=font, fill=(255, 255, 255)) # White text color
# Save the modified image
image.save(output_path)
@tool
def image_generator_tool(prompt: str) -> str:
"""
Generate an image using a text prompt.
Args:
prompt (str): The text prompt for image generation.
Returns:
str: The file path of the generated image.
"""
# Initialize AWS client for Bedrock Runtime
client = boto3.client(service_name="bedrock-runtime", region_name="us-west-2")
# Set request headers
accept = "application/json"
content_type = "application/json"
# Set model ID for image generation
model_id = 'stability.sd3-5-large-v1:0'
# Prepare request body
body = json.dumps({
"prompt": prompt
})
# Invoke the model
response = client.invoke_model(
body=body, modelId=model_id, accept=accept, contentType=content_type
)
# Parse the response
response_body = json.loads(response.get("body").read())
finish_reason = response_body.get('finish_reasons', [None])[0]
if finish_reason is not None:
raise Exception(f"Image generation error: {finish_reason}")
img = response_body.get('images')[0]
base64_bytes = img.encode('ascii')
image_bytes = base64.b64decode(base64_bytes)
# Save the generated image
image_path = "generated_image.png"
pil_image.open(io.BytesIO(image_bytes)).save(image_path)
return image_path
@tool
def image_variation_tool(prompt: str):
"""
Generate a second, alternative version of a movie poster. Use this tool AFTER image_generator_tool to create a different poster for the same movie. The input is the text prompt describing the movie poster.
Args:
prompt (str): The text prompt for the alternative poster.
Returns:
str: The file path of the generated variation image.
"""
# Initialize the AWS Bedrock Runtime client
client = boto3.client(service_name="bedrock-runtime", region_name="us-west-2")
# Set the request headers and parameters
accept = "application/json"
content_type = "application/json"
image_path = 'generated_image.png'
model_id = 'stability.sd3-5-large-v1:0'
# Create the request body - generate a variation using a modified prompt
variation_prompt = f"A different artistic interpretation of: {prompt}. Alternative style, different color palette, unique composition."
body = json.dumps({
"prompt": variation_prompt
})
# Invoke the Bedrock Runtime model for image variation
response = client.invoke_model(
body=body, modelId=model_id, accept=accept, contentType=content_type
)
response_body = json.loads(response.get("body").read())
# Extract the generated image from the response
finish_reason = response_body.get('finish_reasons', [None])[0]
if finish_reason is not None:
raise Exception(f"Image variation error: {finish_reason}")
img = response_body.get('images')[0]
base64_bytes = img.encode('ascii')
image_bytes = base64.b64decode(base64_bytes)
# Save the generated image to a file
output_path = "variation_image.png"
pil_image.open(io.BytesIO(image_bytes)).save(output_path)
return output_path
2.2 Agentic application for image generation and description#
Let’s define the custom agent that will select the best tool at each planning step using the ReAct logic and accomplish the task. The application utilizes LangChain’s Agent Executor to orchestrate the custom agentic workflow that leverages three tools: an image generator, an image variation tool, and an image-to-story tool.
The agent workflow proceeds as follows:
Ingest the user’s movie concept or prompt.
Invoke the image generator tool to create an initial movie poster.
Call the image variation tool to generate a second poster variation.
Utilize the image-to-story tool (powered by a multimodal model) to analyze the visual elements of both posters and generate a corresponding plot synopsis.
Collate and present the two movie posters and the generated plot synopsis as the final output.
The agent executor acts as the central coordinator, managing the execution flow and data transfer between the custom tools. This agentic approach, facilitated by LangChain, enables the seamless integration and orchestration of the custom tools, resulting in a streamlined process for generating movie posters, variations, and narratives based on the user’s input.
# define custom agent
def create_custom_agent(tools):
"""
Creates a custom agent with the given tools and a specific prompt template.
Args:
tools (list): A list of tools to be used by the agent.
Returns:
AgentExecutor: An instance of the AgentExecutor class with the custom agent.
"""
import re
from langchain_aws import ChatBedrockConverse
from langchain.agents import AgentExecutor, create_react_agent
from langchain_core.prompts.chat import ChatPromptTemplate
from langchain.agents.output_parsers import ReActSingleInputOutputParser
from langchain_core.agents import AgentAction, AgentFinish
class FixedReActOutputParser(ReActSingleInputOutputParser):
"""Custom parser that handles function-call-style actions like tool_name(args)."""
def parse(self, text: str):
# Fix function-call-style actions: tool_name(args) -> tool_name + args
func_call_pattern = r'Action:\s*(\w+)\((.*)\)'
match = re.search(func_call_pattern, text, re.DOTALL)
if match:
tool_name = match.group(1).strip()
tool_input = match.group(2).strip().strip('"').strip("'")
# Remove keyword argument syntax like key="value"
if '=' in tool_input and not any(c in tool_input.split('=')[0] for c in ' /\\.'):
tool_input = tool_input.split('=', 1)[1].strip().strip('"').strip("'")
text = text[:match.start()] + f'Action: {tool_name}\nAction Input: {tool_input}'
return super().parse(text)
# Initialize the large language model (LLM) with the specified model ID and temperature
llm = ChatBedrockConverse(
model="amazon.nova-pro-v1:0",
temperature=0
)
# Define the prompt template for the agent
prompt = ChatPromptTemplate.from_messages(
[
(
"system",
"""Answer the following questions as best you can. You have access to the following tools:
{tools}
Use the following format:
Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [{tool_names}]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question""",
),
("user", "Begin!\n\nQuestion: {input}\nThought:{agent_scratchpad}")
]
)
#########################
# Create the custom agent using the LLM, tools, and prompt
agent = create_react_agent(llm, tools, prompt, output_parser=FixedReActOutputParser())
# Create an instance of the AgentExecutor with the custom agent
return AgentExecutor(agent=agent, tools=tools, verbose=True, return_intermediate_steps=True, handle_parsing_errors=True, max_iterations=10)
# Create a list of all relevant tools
tools = [image_to_story_tool, image_generator_tool, image_variation_tool]
# Define the custom agent and provide the agent access to the above tools
movie_agent = create_custom_agent(tools)

Activity: Try it yourself!#
Try different prompts and observe the posters and the plots of the movies using the custom agent.
# Test out the agent with a simple example
prompt = """Draw a poster for a sci-fi PG-13 movie called 'Paradox' using the image_generator_tool, \
then create a second different poster for the same movie using the image_variation_tool, \
and finally write the story of the movie based on the first poster using the image_to_story_tool."""
response_movie = movie_agent.invoke({"input": prompt})
> Entering new AgentExecutor chain...
Thought: I need to first generate a poster for the sci-fi movie 'Paradox' using the image_generator_tool. After that, I will create a second different poster for the same movie using the image_variation_tool. Finally, I will write the story of the movie based on the first poster using the image_to_story_tool.
Action: image_generator_tool
Action Input: "A sci-fi PG-13 movie poster for 'Paradox' featuring a futuristic cityscape with towering skyscrapers, flying vehicles, and a mysterious protagonist in the foreground. The sky is filled with neon lights and holographic advertisements. The title 'Paradox' is prominently displayed in bold, futuristic font."
Observation
---------------------------------------------------------------------------
AccessDeniedException Traceback (most recent call last)
Cell In[18], line 6
1 # Test out the agent with a simple example
2 prompt = """Draw a poster for a sci-fi PG-13 movie called 'Paradox' using the image_generator_tool, \
3 then create a second different poster for the same movie using the image_variation_tool, \
4 and finally write the story of the movie based on the first poster using the image_to_story_tool."""
----> 6 response_movie = movie_agent.invoke({"input": prompt})
File /opt/conda/lib/python3.12/site-packages/langchain/chains/base.py:170, in Chain.invoke(self, input, config, **kwargs)
168 except BaseException as e:
169 run_manager.on_chain_error(e)
--> 170 raise e
171 run_manager.on_chain_end(outputs)
173 if include_run_info:
File /opt/conda/lib/python3.12/site-packages/langchain/chains/base.py:160, in Chain.invoke(self, input, config, **kwargs)
157 try:
158 self._validate_inputs(inputs)
159 outputs = (
--> 160 self._call(inputs, run_manager=run_manager)
161 if new_arg_supported
162 else self._call(inputs)
163 )
165 final_outputs: Dict[str, Any] = self.prep_outputs(
166 inputs, outputs, return_only_outputs
167 )
168 except BaseException as e:
File /opt/conda/lib/python3.12/site-packages/langchain/agents/agent.py:1624, in AgentExecutor._call(self, inputs, run_manager)
1622 # We now enter the agent loop (until it returns something).
1623 while self._should_continue(iterations, time_elapsed):
-> 1624 next_step_output = self._take_next_step(
1625 name_to_tool_map,
1626 color_mapping,
1627 inputs,
1628 intermediate_steps,
1629 run_manager=run_manager,
1630 )
1631 if isinstance(next_step_output, AgentFinish):
1632 return self._return(
1633 next_step_output, intermediate_steps, run_manager=run_manager
1634 )
File /opt/conda/lib/python3.12/site-packages/langchain/agents/agent.py:1332, in AgentExecutor._take_next_step(self, name_to_tool_map, color_mapping, inputs, intermediate_steps, run_manager)
1321 def _take_next_step(
1322 self,
1323 name_to_tool_map: Dict[str, BaseTool],
(...)
1327 run_manager: Optional[CallbackManagerForChainRun] = None,
1328 ) -> Union[AgentFinish, List[Tuple[AgentAction, str]]]:
1329 return self._consume_next_step(
1330 [
1331 a
-> 1332 for a in self._iter_next_step(
1333 name_to_tool_map,
1334 color_mapping,
1335 inputs,
1336 intermediate_steps,
1337 run_manager,
1338 )
1339 ]
1340 )
File /opt/conda/lib/python3.12/site-packages/langchain/agents/agent.py:1415, in AgentExecutor._iter_next_step(self, name_to_tool_map, color_mapping, inputs, intermediate_steps, run_manager)
1413 yield agent_action
1414 for agent_action in actions:
-> 1415 yield self._perform_agent_action(
1416 name_to_tool_map, color_mapping, agent_action, run_manager
1417 )
File /opt/conda/lib/python3.12/site-packages/langchain/agents/agent.py:1437, in AgentExecutor._perform_agent_action(self, name_to_tool_map, color_mapping, agent_action, run_manager)
1435 tool_run_kwargs["llm_prefix"] = ""
1436 # We then call the tool on the tool input to get an observation
-> 1437 observation = tool.run(
1438 agent_action.tool_input,
1439 verbose=self.verbose,
1440 color=color,
1441 callbacks=run_manager.get_child() if run_manager else None,
1442 **tool_run_kwargs,
1443 )
1444 else:
1445 tool_run_kwargs = self._action_agent.tool_run_logging_kwargs()
File /opt/conda/lib/python3.12/site-packages/langchain_core/tools/base.py:895, in BaseTool.run(self, tool_input, verbose, start_color, color, callbacks, tags, metadata, run_name, run_id, config, tool_call_id, **kwargs)
893 if error_to_raise:
894 run_manager.on_tool_error(error_to_raise)
--> 895 raise error_to_raise
896 output = _format_output(content, artifact, tool_call_id, self.name, status)
897 run_manager.on_tool_end(output, color=color, name=self.name, **kwargs)
File /opt/conda/lib/python3.12/site-packages/langchain_core/tools/base.py:864, in BaseTool.run(self, tool_input, verbose, start_color, color, callbacks, tags, metadata, run_name, run_id, config, tool_call_id, **kwargs)
862 if config_param := _get_runnable_config_param(self._run):
863 tool_kwargs |= {config_param: config}
--> 864 response = context.run(self._run, *tool_args, **tool_kwargs)
865 if self.response_format == "content_and_artifact":
866 if not isinstance(response, tuple) or len(response) != 2:
File /opt/conda/lib/python3.12/site-packages/langchain_core/tools/structured.py:93, in StructuredTool._run(self, config, run_manager, *args, **kwargs)
91 if config_param := _get_runnable_config_param(self.func):
92 kwargs[config_param] = config
---> 93 return self.func(*args, **kwargs)
94 msg = "StructuredTool does not support sync invocation."
95 raise NotImplementedError(msg)
Cell In[14], line 29, in image_generator_tool(prompt)
24 body = json.dumps({
25 "prompt": prompt
26 })
28 # Invoke the model
---> 29 response = client.invoke_model(
30 body=body, modelId=model_id, accept=accept, contentType=content_type
31 )
33 # Parse the response
34 response_body = json.loads(response.get("body").read())
File /opt/conda/lib/python3.12/site-packages/botocore/client.py:569, in ClientCreator._create_api_method.<locals>._api_call(self, *args, **kwargs)
565 raise TypeError(
566 f"{py_operation_name}() only accepts keyword arguments."
567 )
568 # The "self" in this scope is referring to the BaseClient.
--> 569 return self._make_api_call(operation_name, kwargs)
File /opt/conda/lib/python3.12/site-packages/botocore/client.py:1023, in BaseClient._make_api_call(self, operation_name, api_params)
1019 error_code = error_info.get("QueryErrorCode") or error_info.get(
1020 "Code"
1021 )
1022 error_class = self.exceptions.from_code(error_code)
-> 1023 raise error_class(parsed_response, operation_name)
1024 else:
1025 return parsed_response
AccessDeniedException: An error occurred (AccessDeniedException) when calling the InvokeModel operation: Model access is denied due to IAM user or service role is not authorized to perform the required AWS Marketplace actions (aws-marketplace:ViewSubscriptions, aws-marketplace:Subscribe) to enable access to this model. Refer to the Amazon Bedrock documentation for further details. Your AWS Marketplace subscription for this model cannot be completed at this time. If you recently fixed this issue, try again after 5 minutes.
Here’s the first movie poster:#
Image("generated_image.png", width=300)
Here’s the second movie poster:#
Image("variation_image.png", width=300)
And here’s the plot of the movie:#
Markdown("<i>"+ response_movie['intermediate_steps'][-1][1] +"</i>")

Challenge: Use all the tools#
In the example above, we did not need to use the add_text_to_image tool. Try to craft a prompt so that this tool is also used.
### Enter your code below
###
3. Multimodal agent for retrieval-based responses#
Here is a description of the multimodal agent for retrieval-based workflows:
This multimodal AI agent is designed to assist in verifying the authenticity of physical products by leveraging web search capabilities and advanced multimodal techniques. The agent has access to three custom tools, discussed in the next section.
3.1 Custom multimodal tools for retrieval#
Let’s define three custom tools to allow the agent to retrieve results from websites to authenticate the product in the image as well as suggest websites where the authentic image may be purchased.
Image Comparison Tool: This tool utilizes Amazon Nova’s state-of-the-art multimodal capabilities to compare two product images in detail. It can analyze and compare various visual properties, such as shape, color, design elements, and intricate details, to determine the similarity or dissimilarity between the images.
Product Web Search: This tool allows the agent to perform comprehensive web searches using DuckDuckGo to gather information and visual representations of a specific product. It can retrieve product descriptions, specifications, and images from various online sources, building a comprehensive knowledge base about the product.
Image Web Search: This tool enables the agent to search the web for images of a product based on a text prompt. It can find and retrieve relevant product images from various online sources, further enhancing the agent’s visual knowledge base.
@tool
def image_comparison(image_paths:str):
"""Use this tool to compare and contrast two images. The input of the tool is a string consisting of both image paths seperated by comma. Image paths can be local paths or urls."""
image_paths_arr = [f.strip().replace("\n", "").strip('"').strip("'") for f in image_paths.split(',')]
image_binary, image_type = prepare_image(image_paths_arr)
prompt = "Compare and contrast the two images. Share insights on if they are completely identical, similar or distinct. Produce the response without a preamble. Just write the analysis."
response = invoke_nova_lite_multimodal(prompt=prompt, images=image_binary, image_types=image_type)
return response
from ddgs import DDGS
@tool
def product_web_search(prompt:str):
"""Search online for a website about a product. The input is the prompt with the product ID and the brand name. The prompt needs to be under 45 characters."""
search_tool = DDGS()
time.sleep(1) # Add delay between requests
response = search_tool.text(query=prompt, max_results=5, region='us-en', safesearch='on')
return response
@tool
def image_web_search(prompt:str):
"""Search the web for images of a product based on the prompt. The input is the prompt or query used to search for images of a product."""
search_tool = DDGS()
time.sleep(10) # Add delay between requests
response = search_tool.images(query=prompt, region='us-en', max_results=1)[0]['image'].partition("?")[0]
time.sleep(10) # Add delay between requests
return response

Challenge: Create custom tools#
The agent's ability to pick the appropriate tool is crucial in achieving the desired response. Let's explore this ability by providing the agent with many more tools. Create a few more useful tools in the cell below and test the agent's response when it has to select from many more tools.
### Enter your code below
###
3.2 Agentic application based on retrieval#
Let’s define the custom agent, similar to the previous example, that will select the best tool at each planning step using the ReAct logic.
The multimodal agent’s workflow is as follows:
When presented with an image of a product, the agent utilizes the image web search tool to find additional images of the product, expanding its visual knowledge base.
Using the image comparison tool, the agent compares the visual information gathered from the web with the physical product in question.
Based on the comparison results, the agent can provide an assessment of whether the physical product is likely to be authentic or an imitation.
Finally it will search online for a website where the user may purchase the authenticated product.
This multimodal agent leverages the power of web search, computer vision, and multimodal analysis to provide a comprehensive solution for product authentication. By combining textual and visual information from various online sources with advanced image comparison techniques, the agent can assist in verifying the authenticity of physical products with a high degree of accuracy and reliability.
retrieval_tools = [product_web_search, image_web_search, image_comparison]
retrieval_agent = create_custom_agent(retrieval_tools)
Image("content/Agents/nike.jpg", width=300)

Activity: Try it yourself!#
Try different images of products and observe how the agent authenticates the product using the tools.
If you get a RateLimitException, wait a few minutes before trying again. DuckDuckGO is a free web search API and limits the number of API requests.
prompt = """I have an image of a shoe at "./content/Agents/nike.jpg".
Can you check if they are Nike Air Jordan 1 Low SE FN5214-131?
If they are, find the link to the website where i can find the product. Do not generate clickable links in the output."""
try:
response_shoes = retrieval_agent.invoke({"input":prompt})
except Exception as e:
response_shoes = {}
if "403" in str(e):
print(f"\nRatelimitException raised. Wait some minutes before trying again")
else:
print(f"\nAn unexpected error occurred: {str(e)}")
Here’s the response about the authenticity of the product:#
⚠️ IMPORTANT SECURITY NOTICE ⚠️#
The URLs displayed in this educational notebook are REAL URLs. While they are shown for educational purposes, we strongly advise:
DO NOT click on or visit these URLs
DO NOT use them for further exploration
DO NOT assume they are safe or vetted
This notebook is for demonstration purposes only. Visiting unknown URLs can expose you to security risks, malware, or inappropriate content. Always practice safe browsing habits and only visit trusted, verified websites.
If you need to explore web resources, please use official documentation and trusted sources.
if 'output' in response_shoes:
result = Markdown("<i>"+response_shoes['output'] + "</i>")
else:
result = Markdown("No field `output` found in response data")
result
Here’s the response about the webpage to purchase the original product.#
if 'intermediate_steps' in response_shoes and response_shoes['intermediate_steps']:
url = response_shoes['intermediate_steps'][-1][1]
if isinstance(url, str):
Markdown(f"<i>{url}</i>")
else:
JSON(url)
else:
Markdown("No intermediate steps found in response data")
4. Quizzes#

Challenge: Try it Yourself!#
Answer the following questions to test your understanding of using multimodal models for generating personalized and inclusive content.
from mlu_utils.quiz_questions import lab5c_question1, lab5c_question2
lab5c_question1.display()
lab5c_question2.display()
Conclusion#
In this lab, you have:
Developed custom multimodal tools for image generation and description
Created an agentic application for movie poster generation and storytelling
Built custom multimodal tools for retrieval-based responses
Implemented a product authentication agent using web search and image comparison
Additional Resources#
LangChain Agents Documentation
Amazon Bedrock Documentation