Independent Study Guide Edition 1.0 · 2026 · Demo

Production
Generative AI
on AWS

A Field Guide for the Developer–Professional Exam
Covers
All 5 exam domains · 20 task statements · the AWS GenAI stack
Format
Concept · Service spotlight · Decision framework · Practice MCQs
Independently published. Not affiliated with, endorsed by, or sponsored by Amazon Web Services or Amazon.com, Inc.
Preface

How to Use This Guide

A study guide is only as good as the discipline you bring to it. This one is built for the candidate who has eight to ten weeks, a full-time job, and a healthy distrust of certification fluff.

The AWS Certified Generative AI Developer — Professional exam (AIP-C01) is among the first professional-level certifications focused entirely on building production GenAI systems on AWS. This is not a survey course in machine learning, nor a prompt-engineering quiz. The exam tests your ability to make architectural and operational decisions under realistic constraints — latency budgets, compliance regimes, cost ceilings, accuracy thresholds, and vendor trade-offs.

This guide is structured to mirror that decision-making muscle. Every chapter follows the same rhythm: concept first, then services, then decision framework, then practice. You will not find encyclopedic dumps of every AWS feature. You will find the features the exam actually probes, organized so you can recall them under time pressure. Along the way you’ll meet a small set of recurring visual primitives — mental-model figures, service spotlight cards, comparison tables, decision lists, and code-compare panes — that earn their place by carrying the highest-leverage decisions of each chapter.

Who this guide is for

AIP-C01 is a professional-level certification. The content of this book is correspondingly intermediate to advanced and assumes you have already built or shipped at least one cloud application on AWS — you have used IAM, VPCs, S3, Lambda, CloudWatch, and at least one compute or container service in anger — and that you have hands-on familiarity with at least one large language model and a basic understanding of how vector retrieval and embeddings work.

If you are entirely new to AWS or to generative AI, start with the AWS Cloud Practitioner and AI Practitioner certifications first; this guide will move faster than is comfortable for absolute beginners. We will not stop to explain what an S3 bucket is, what an IAM role looks like, what a VPC endpoint does, or what a foundation model is — the exam doesn’t and neither do we.

You are likely:

  • A backend or platform engineer with several years of AWS experience, now integrating foundation models into production systems.
  • An ML engineer with classical-ML background expanding into generative workloads.
  • A solutions architect needing to defend GenAI design choices to security, finance, and product stakeholders.
  • A consultant who needs to walk into client conversations with current, exam-grade fluency in the AWS GenAI stack.

Treat anything that lands as “new to me, but I’ve seen the building blocks” as the target reading level. If a chapter feels like it’s assuming things you’ve never touched (a VPC endpoint policy, a SageMaker endpoint, a Lambda execution role), pause — build the missing block in the console for half an hour — then come back. The exam is hands-on by design; the book is hands-on by design.

How to read this book

Each chapter is self-contained and can be read in isolation, but the parts build on each other. Domain 1 establishes vocabulary. Domain 2 covers integration patterns that the rest of the book assumes. Domains 3, 4, and 5 each focus on a non-functional concern: safety, efficiency, and quality. The pattern within each chapter is consistent:

Chapter Anatomy
What every chapter contains
OpenLearning objectives
CoreConcepts & AWS services
PatternDecision framework
DrillPractice MCQs
CloseSummary & recap

Three callout types appear throughout. Read them — they carry the highest density of exam-relevant material:

A nine-week study plan

The plan below assumes about 10–12 hours per week of focused study. If you have more, accelerate; if you have less, stretch the schedule and prioritize Weeks 1, 4, and 9.

Recommended Weekly Schedule
WeekFocusOutcome
1Read Part I (Domain 1, Chapters 1–6) — foundation models, RAG, vector stores, promptsArchitect a RAG solution on paper
2Hands-on lab: Bedrock + Knowledge Bases + OpenSearch ServerlessEnd-to-end RAG demo working
3Read Part II (Domain 2, Chapters 7–11) — agents, deployment, FM APIs, MLOpsBuild a working Bedrock Agent with tool calls
4Read Part III (Domain 3, Chapters 12–15) — guardrails, encryption, governance, responsible AIConfigure Bedrock Guardrails + CloudTrail logging
5Read Part IV (Domain 4, Chapters 16–18) — cost, latency, monitoringBuild a CloudWatch dashboard for an LLM workload
6Read Part V (Domain 5, Chapters 19–20) — evaluation, troubleshootingImplement an LLM-as-a-judge eval pipeline
7Walk Back Matter A (Exam Strategy) + B (Glossary, spaced repetition)Internalize the five-pass MCQ procedure
8Practice exams · review missed questions · re-read flagged calloutsScore ≥ 80% on practice consistently
9Final review · Cheat Sheets (Back Matter C) · book the examPass on first attempt

What this guide is not

This is not an AWS service catalog, not a Python tutorial, and not a substitute for hands-on practice. Where source material includes long code listings, we have summarized the conceptual takeaway and pointed you to AWS documentation for current SDK syntax. Treat the official AWS documentation as canonical for any code you intend to ship. The Back Matter sections (Exam Strategy, Glossary, Cheat Sheets) are where the book’s decisions distill into something you can re-read in the fifteen minutes before the test — budget time for them.

A note from the author

I wrote this guide for myself first. I was preparing for the AIP-C01 exam in early 2026 and could not find a study resource that combined the depth I needed with the structure my brain wanted — concepts, then services, then decision frameworks, then drills, with no padding. So I built one. I sat the exam, I passed, and the material in your hands is the same material I used to get there.

This is not an official AWS publication, and it is not a replacement for hands-on practice or for the AWS documentation. It is an opinionated, decision-oriented study guide — a real candidate’s playbook rather than a marketing document. If it helped me pass, I’m confident it can help you pass too.

Good luck. Now: open Part I, brew something hot, and let’s begin.

End of Preface
Exam Overview

The AIP-C01 Exam, In One Chapter

Format, scoring, domain weightings, and a realistic look at what the question paper actually feels like.

The AWS Certified Generative AI Developer — Professional exam (code: AIP-C01) is a 180-minute, scenario-driven examination consisting of 100 questions — 85 scored and 15 unscored. You won’t be told which are which; treat them all as scored. It’s delivered through Pearson VUE testing centers and as an online proctored exam, and it carries a recommended prerequisite of two or more years of hands-on experience designing and operating GenAI workloads on AWS.

You do not need to memorize service quotas or current pricing. You do need to be able to look at a multi-paragraph scenario — complete with constraints, red herrings, and competing priorities — and pick the architecture that satisfies the requirements at the lowest reasonable cost and operational burden.

Domain weightings

Five domains are weighted as follows. The percentages are official AWS guidance; treat them as your study budget allocator.

AIP-C01 Domain Weighting
#DomainWeightTasks
1Foundation Model Integration & Data Management27%1.1 – 1.6
2Implementation & Integration26%2.1 – 2.5
3Security, Compliance & Governance20%3.1 – 3.4
4Operational Efficiency & Optimization14%4.1 – 4.3
5Evaluation & Troubleshooting13%5.1 – 5.2

Question formats

The exam uses three question formats. Knowing the difference matters because the marking rules differ.

  1. Multiple choice — one stem, four options, exactly one correct answer. The most common format.
  2. Multiple response — one stem, five or more options, two or three correct answers. Partial credit is not awarded; you must select the exact correct subset.
  3. Scenario / case study — a long preamble (architecture diagram, customer requirements, constraints) followed by 2–4 dependent questions. Read the preamble carefully before starting; the same setup feeds multiple questions.

Passing standard

The reported passing score is approximately 750 / 1000, but AWS uses a scaled scoring model with statistical equating. Aim for ≥ 80% on practice exams to be comfortable on test day. Score reports break results down by domain (“Meets Competencies / Needs Improvement”); use those to direct your final review week.

Time budget

180 minutes for 100 questions = ~1:48 per question average. In practice, scenario questions consume 3–5 minutes each, while plain multiple-choice items can be answered in under a minute. Use the in-exam Flag for Review feature liberally; first-pass anything you cannot answer in 90 seconds, finish the easy questions, and circle back. Back Matter A (Exam Strategy) walks the full five-pass pacing plan in detail.

Time Budget
A realistic 180-minute pacing plan
Pass 1~90 min · touch every question once, answer the easy ones, flag the hard
Pass 2~50 min · work the flagged set deeply, commit to answers
Pass 3~25 min · revisit still-flagged few with fresh eyes
Pass 4–5~15 min · gut-check & verify nothing is blank

What the exam loves to test

Across all five domains, expect heavy emphasis on these decision pivots. These are the seams where two services overlap and the “right” answer depends on a single qualifier in the question:

  • Bedrock vs. SageMaker JumpStart vs. self-hosted — managed convenience vs. customization vs. control.
  • RAG vs. fine-tuning vs. continued pre-training — data freshness, knowledge depth, cost.
  • Amazon Bedrock Knowledge Bases vs. Kendra vs. raw OpenSearch — managed vs. document ACLs vs. flexibility.
  • On-demand vs. provisioned throughput vs. batch inference — latency, cost, predictability.
  • Bedrock Agents vs. custom orchestration with the Converse API — managed vs. flexible.
  • Bedrock Guardrails vs. application-level filters vs. Amazon Comprehend — safety surface area.

Each of these pivots gets its own decision framework in the chapters that follow.

End of Exam Overview
Part I
01

Foundation Model
Integration & Data Management

Domain 1 of the AIP-C01 covers the end-to-end lifecycle of a foundation-model workload — from translating a business problem into an architecture, through choosing and configuring the model, building data pipelines, indexing into vector stores, retrieving relevant context, and governing the prompts that drive it all. This is where every GenAI application begins.

27%Exam weight
6Tasks
6Chapters
~20Practice MCQs
01 Part I · Chapter 1 · Task 1.1

Analyze Requirements & Design GenAI Solutions

Before you write a single line of Bedrock code, you have to translate a business problem into an architecture. This chapter teaches you how to decide whether GenAI is even the right tool, and — if it is — which of five canonical patterns to reach for.
GENAI IN REAL LIFE Use a foundation model where it earns its keep — not where regex would do. WRONG TOOL — LLM TO PARSE A ZIP CODE 9:41 ●●● ▮ Address Validator (LLM-powered) Validate ZIP for 94103 Checking if 94103 is a valid US ZIP code... Latency: 1,840 ms Cost: $0.00214 ⚠ A REGEX WOULD HAVE WORKED. 10,000 calls/day × $0.002 = $600/month for what /^\d{5}$/ does for free. RIGHT FIT Ambiguous & generative. RIGHT TOOL — SUPPORT-TICKET TRIAGE 9:41 ●●● ▮ Support Triage Assistant Summarize this 14-message thread & classify priority & draft a customer reply. Summary: Billing dispute, refund requested. Customer angry. Priority: HIGH Draft reply: ✓ ready to send ✓ NO RULE-BASED SYSTEM CAN DO THIS. Saves an agent 4 minutes per ticket. That’s $0.002 well spent. PRINCIPLE Foundation models earn their cost on tasks rules can’t express — ambiguous, generative, language-shaped. Not on tasks a 30-character regex solves.
GenAI in Real Life — The same $0.002 model call is a waste in one app and a steal in another. The four-question checklist in §1.1 exists to keep you on the right side of this line.

1.1 · Is this even a GenAI problem?

The most expensive mistake in generative AI is using a foundation model where a regular expression would do. Before you reach for Bedrock, work through a four-question checklist. If you cannot answer yes to at least three, your problem belongs to traditional ML, classic search, or simple business logic. The exam punishes over-engineering.

  1. Does the task require language understanding, generation, or transformation? — summarization, drafting, translation, intent extraction. If the answer is “classification with structured features,” reach for Amazon SageMaker or Amazon Comprehend instead.
  2. Is the input variable, unstructured, or open-ended? — free-form support tickets, PDFs, conversational queries. Foundation models excel at variability.
  3. Can the system tolerate probabilistic output? — if every response must be 100% accurate (think: tax calculations, medical dosages), you need a deterministic system underneath, with the LLM only as an orchestrator.
  4. Is sufficient context obtainable? — either through prompts, retrieval, or fine-tuning. If the answer lives in a database the model cannot reach, no amount of prompting will help.

Common GenAI use case categories

The AIP-C01 exam organizes GenAI use cases into five categories. Memorize the canonical AWS service for each — questions often hide the use case in a verb (“summarize”, “classify”, “generate”) rather than name it directly.

Canonical GenAI Use Cases & Default AWS Services
CategoryExamplesDefault service
Text generationEmail drafting, content creation, conversational assistantsAmazon Bedrock (Anthropic Claude, Meta Llama, Amazon Nova)
Code generationCode completion, refactoring, test generationAmazon Q Developer (Bedrock under the hood)
Summarization & extractionLong-document summaries, structured field extractionBedrock + structured output (tool use / JSON mode)
Image & multimodalImage creation, visual Q&A, document understandingBedrock (Stable Diffusion, Amazon Titan Image, Nova Canvas)
Search & knowledgeNatural-language Q&A over private dataBedrock Knowledge Bases · Amazon Kendra · custom RAG on OpenSearch

1.2 · Functional & non-functional requirements

Once you have decided GenAI is appropriate, the requirements analysis you would do for any system splits into three buckets. The exam is unusually fond of latency / cost / compliance trade-offs, so each row in the table below is a likely question seam.

Functional requirements

What the solution must accomplish. For foundation-model workloads this includes: input modalities (text, images, audio, documents); expected output format and grounding requirements; required domain knowledge; integration points; and the interaction pattern (synchronous chat, streaming, batch). One requirement matters most: does the app need multi-turn conversational state? If yes, you need session management. If no, a stateless InvokeModel call works.

Non-functional requirements

Latency, throughput, cost, availability, security, and compliance. These usually determine the architecture rather than influence it.

Non-Functional Requirements → AWS Mapping
ConcernDiagnostic questionAWS lever
LatencyIs end-user response time under 2 s critical?Smaller model · Bedrock provisioned throughput · response streaming
ThroughputSustained TPS at peak?Provisioned throughput · SageMaker auto-scaling · async inference
CostCost per inference / monthly ceiling?Model right-sizing · Bedrock batch · prompt & semantic caching
AvailabilityRTO / RPO requirements?Multi-region deployment · Bedrock cross-region inference
SecurityData classification & access control?IAM · AWS KMS · VPC endpoints · Bedrock Guardrails
ComplianceHIPAA / PCI / GDPR / SOC 2?Regional deployment · CloudTrail · AWS Config · data residency
Data volumeHow much context / how much training data?S3 · OpenSearch · Kendra · SageMaker training
AccuracyQuality threshold & tolerance for errors?Bedrock Guardrails · human-in-loop · automated evaluation

1.3 · The five canonical AWS GenAI patterns

Almost every architecture the exam asks you to design is a composition of these five patterns. Memorize their trigger conditions; the exam phrases scenarios specifically so that exactly one pattern fits.

Figure 1.1 · Mental Model
Pattern selection tree — which of the five fits the problem?

Walk down the tree from the top. Stop at the first “yes.” The exam writes scenarios so that exactly one branch fits cleanly.

YES — PICK THIS PATTERN NO — ASK THE NEXT QUESTION Business problem Start here Q1. Is the answer in the model’s training data? Pattern 1 Direct API Bedrock Converse / InvokeModel YES NO Q2. Does it need to act on systems? Pattern 3 Agents & Tool Use Bedrock Agents · Action Groups YES NO Q3. Is the answer in your data? Pattern 2 RAG Knowledge Bases · OpenSearch YES NO Q4. Need style / format / domain shift? Pattern 4 Fine-Tuning / PEFT SageMaker · Bedrock Custom YES NO Pattern 5 Multi-Model / Ensemble Route · cascade · combine PRO TIP Patterns compose — production systems frequently chain RAG + Agents, or run fine-tuned models behind a router. Always start at Q1, walk down, and stop at the first “yes.” Each “no” you traverse buys more capability at the cost of more tokens, more engineering, or both.
Read it like this: the green branch is the cheapest pattern that satisfies the requirement. Every “no” spends more tokens, more engineering effort, or both. Always justify the next branch — the exam loves to test the over-engineered choice.

Pattern 1 · Direct API integration

The simplest pattern: your application calls InvokeModel or Converse on Amazon Bedrock, the foundation model returns a response, and your code post-processes it. Conversation state lives in your application layer (DynamoDB, ElastiCache, or memory). Compute is typically AWS Lambda or a container on Amazon ECS / EKS.

Use this pattern when the model’s training data is sufficient knowledge for the task — summarization, translation, brainstorming, generic Q&A, classification of free-text input. The moment you need your data in the response, you graduate to RAG.

InvokeModel vs. Converse — the same call, two eras. The older InvokeModel API requires a model-specific JSON body. Anthropic, Meta, and Cohere expect different shapes. Swapping models means rewriting the request. The newer Converse API normalizes that — one request shape, one response shape, across every Bedrock-hosted model. Use Converse for anything new; treat InvokeModel as legacy.

InvokeModel (legacy)Per-model JSON
import boto3, json

bedrock = boto3.client("bedrock-runtime")

# Anthropic-specific body shape — different
# for Llama, Cohere, Titan, etc.
body = {
    "anthropic_version": "bedrock-2023-05-31",
    "max_tokens": 512,
    "messages": [
        {"role": "user",
         "content": "Summarize CRISPR in 3 lines."}
    ],
}

resp = bedrock.invoke_model(
    modelId="anthropic.claude-3-5-sonnet-20240620-v1:0",
    body=json.dumps(body),
    contentType="application/json",
)

# Parse model-specific response shape
data = json.loads(resp["body"].read())
print(data["content"][0]["text"])
Couples your code to one model’s schema. Switching models means rewriting both the request body and the response parser.
Converse (recommended)Provider-agnostic
import boto3

bedrock = boto3.client("bedrock-runtime")

# Same shape works for Claude, Llama, Nova,
# Cohere, Mistral — swap modelId only.
resp = bedrock.converse(
    modelId="anthropic.claude-3-5-sonnet-20240620-v1:0",
    messages=[
        {"role": "user",
         "content": [{"text": "Summarize CRISPR in 3 lines."}]}
    ],
    inferenceConfig={"maxTokens": 512},
)

# Uniform response shape
text = resp["output"]["message"]["content"][0]["text"]
print(text)
# resp["stopReason"] tells you why generation ended
# resp["usage"] gives input/output token counts
One call shape for every model. Native multi-turn, streaming (converse_stream), and tool use without per-provider hacks.
Service
Amazon Bedrock Converse API

Provider-agnostic API that handles multi-turn message state, tool use, and streaming uniformly across Bedrock-hosted models. Replaces the older InvokeModel for all conversational workloads.

Use when

You want a single integration that works across Anthropic, Meta, Cohere, Mistral, and Amazon Nova models without rewriting your client code per provider.

Pattern 2 · Retrieval-Augmented Generation (RAG)

RAG is the dominant enterprise pattern. The workflow: embed the user query, search a vector store for relevant chunks, splice the retrieved text into the prompt, then call the foundation model. The model now “knows” about your private data without ever being trained on it.

RAG is the right answer when the source-of-truth changes more frequently than you can retrain a model (most enterprise data), when you must cite sources, or when the corpus is too large to fit in any context window. Two implementation paths:

  • Managed RAG — Amazon Bedrock Knowledge Bases handles ingestion, chunking, embedding, and retrieval. Lowest operational overhead. Choose this unless you need something it does not do.
  • Custom RAG — you build the pipeline yourself using Amazon OpenSearch Service (or Aurora pgvector), Bedrock embedding models, and your own orchestration. Choose this when you need fine control over chunking, hybrid search, custom metadata filtering, or non-AWS vector databases.

Pattern 3 · Agents and tool use

An agent extends a foundation model with the ability to take actions. The model decides which function to call, supplies arguments, the function executes (a Lambda function, an API call, a database query), the result is fed back, and the agent decides what to do next. This loop continues until the agent has enough information to respond.

Amazon Bedrock Agents wraps this with managed orchestration: you define an instruction, register action groups (Lambda functions or OpenAPI schemas), optionally attach knowledge bases, and Bedrock handles the reasoning loop. For more flexibility, build a custom agent on top of the Bedrock Converse API’s tool-use feature.

Service
Amazon Bedrock Agents

Managed orchestration of multi-step tool-using workflows. Handles prompt construction, tool selection, parameter inference, and conversation state. Integrates natively with Lambda action groups and Knowledge Bases.

Use when

The task requires multi-step actions across systems (lookup → decide → act → confirm) and you do not want to hand-roll the orchestration loop in your application code.

Pattern 4 · Fine-tuning & custom models

Fine-tuning adapts a foundation model’s weights using your data. Reach for it when prompting and RAG cannot achieve the consistency you need — specialized output format, domain vocabulary, or brand voice. Bedrock supports continued pre-training and fine-tuning for select models; SageMaker JumpStart offers more model choice and full training control.

Pattern 5 · Multi-model & ensemble

A small classifier routes traffic to a large generation model only when needed; an embedding model handles search, while a generation model handles synthesis; multiple models propose answers and a judge selects the best. Bedrock’s unified API makes this trivial to compose — you can call Claude, Titan, and Stable Diffusion from one application without managing three SDKs.

1.4 · Cost optimization at design time

Cost is decided at the architecture stage, not at the bill stage. Three levers dominate.

  1. Right-size the model. The largest model is rarely the right model. Try the smallest capable model first, measure quality, escalate only if needed. A factor-of-10 cost differential between Nova Micro and Claude Opus is typical.
  2. Pick the right inference mode. On-demand pricing for variable workloads, provisioned throughput for predictable steady traffic, batch inference for offline workloads (up to 50% cheaper). Mismatched mode is the #1 cause of bill shock.
  3. Cache aggressively. Bedrock prompt caching reuses repeated system prompts at a discount. Application-level caching of identical queries can eliminate 20–60% of inference calls. Semantic caching (vector-similarity match on prior queries) extends this further.

1.5 · The Well-Architected GenAI lens, in summary

AWS Well-Architected gains one dimension for GenAI: Responsible AI. It threads across the existing six pillars rather than replacing them.

Well-Architected GenAI Concerns by Pillar
PillarWhat to verify in a GenAI workload
Operational ExcellenceModel versioning, prompt versioning, evaluation pipelines, automated rollback on quality regression.
SecurityData classification at ingest, KMS encryption, VPC isolation, IAM least-privilege per action group, Guardrails for input/output safety.
ReliabilityCross-region inference fallback, retry & backoff for throttling, graceful degradation when a model is unavailable.
Performance EfficiencyRight-sized models, response streaming, dynamic batching, semantic caching, parallel tool execution.
Cost OptimizationToken budgets, model cascading, batch inference, prompt caching, monitoring per-request cost.
SustainabilityReuse of cached responses, batch over real-time when possible, choosing efficient models, regional placement.
Responsible AIBias evaluation, transparency & citation, opt-out, harm mitigation, fairness across user segments.

Chapter summary

Designing GenAI is choosing the right tool, the right pattern, and the right model for the constraints in front of you.

  • GenAI fit first — validate that GenAI is the right tool before designing. Not every problem needs a foundation model.
  • NFRs → AWS levers — map latency, cost, and compliance onto specific services. Compliance usually eliminates options first.
  • Five core patterns — Direct API, RAG, Agents, Fine-Tuning, Multi-Model. Almost every workload is one of these.
  • Pattern triggers — RAG when knowledge is external or fresh; Agents when the system must take action; Fine-Tuning is the last resort, not the first.
  • Cost is architectural — right-size the model, pick the right inference mode, cache. Decided at design time, not after launch.
  • Well-Architected + Responsible AI — weave the Responsible AI thread across all six pillars.

The exam rewards architecture-time decisions; it punishes ‘biggest model on every problem’.

Review Questions

Five scenario MCQs. Reveal the explanation only after you commit to an answer — the cognitive cost of guessing-then-checking is what builds exam memory.

Question 1
A healthcare company wants to build a chatbot that answers patient questions about their insurance coverage. The information lives in PDFs that are updated quarterly. The application must comply with HIPAA, and patients must only see information relevant to their specific plan. Which architecture best satisfies these requirements?
  1. Deploy a fine-tuned model on SageMaker trained on all insurance documents.
  2. Use Amazon Bedrock Knowledge Bases with OpenSearch Serverless, applying metadata filtering by plan type, with VPC endpoints and KMS encryption.
  3. Use Amazon Kendra with an S3 data source and document-level access control lists, integrated with a Bedrock foundation model for response generation.
  4. Provide all insurance documents inside the system prompt of a Bedrock foundation model.
Show answer & explanation

Correct: C. Amazon Kendra provides built-in document-level ACLs that map directly to the per-patient access requirement, and its native S3 connector handles PDF ingestion. Combined with a Bedrock model for natural-language response, this pattern satisfies access control, freshness, and compliance.

Why not B? Bedrock Knowledge Bases supports metadata filtering, but document-level user-aware ACLs require additional custom logic; Kendra offers this natively. Why not A? Fine-tuning bakes data into weights and cannot enforce per-user access. Why not D? Quarterly insurance corpora exceed any context window and offer no per-user filtering.

Question 2
A financial-services firm needs to summarize 50,000 quarterly earnings reports overnight. Real-time output is not required and cost optimization is the primary concern. Which approach fits?
  1. Bedrock on-demand inference with parallel Lambda fan-out.
  2. Bedrock batch inference processing all reports as a single batch job.
  3. SageMaker endpoint with provisioned capacity.
  4. Bedrock with provisioned throughput.
Show answer & explanation

Correct: B. Bedrock batch inference is purpose-built for high-volume offline jobs at a meaningful per-token discount versus synchronous inference. With no real-time requirement, every other option pays a real-time premium that batch avoids.

Question 3
A retail company wants an assistant that answers product questions, checks order status, and processes returns. It must call the order-management API and a product-catalog database. Which architecture is most suitable?
  1. An Amazon Bedrock Agent with action groups for order management and returns, plus a Knowledge Base for product information.
  2. A RAG pipeline on OpenSearch containing both product and order data.
  3. A fine-tuned model trained on product catalog and historical orders.
  4. A direct Bedrock InvokeModel call with the order API documented in the system prompt.
Show answer & explanation

Correct: A. The requirement combines knowledge retrieval (product catalog) with multi-step actions (order lookup, returns) — exactly the agent pattern. Bedrock Agents support both action groups and Knowledge Bases natively in one runtime. RAG alone cannot act, fine-tuning cannot reach a live order system, and stuffing API specs into a prompt does not give the model a way to actually call them.

Question 4
A team is choosing between RAG and fine-tuning to make a foundation model produce strict JSON output matching an internal schema. Output format consistency is the only requirement; the underlying knowledge is generic. Which option is most appropriate and cost-effective?
  1. Fine-tune a small Bedrock model on thousands of input/output examples.
  2. Build a RAG pipeline that retrieves the schema and includes it in every prompt.
  3. Use the Bedrock Converse API with tool use / structured output to enforce the schema.
  4. Continue pre-training a model on schema documentation.
Show answer & explanation

Correct: C. Structured output via tool use enforces the schema at decoding time without retraining or retrieval overhead. Fine-tuning is overkill for format-only adaptation; RAG injects context but does not enforce structure; continued pre-training is even heavier than fine-tuning and inappropriate here.

Question 5
A startup is prototyping a customer-support assistant. Traffic is unpredictable, will likely be low-volume for the first six months, and the team has no capacity for ongoing infrastructure work. Which inference mode is the best starting point?
  1. Bedrock provisioned throughput with a one-month commitment.
  2. Bedrock on-demand inference.
  3. A self-managed model on a SageMaker real-time endpoint.
  4. SageMaker serverless inference with a custom container.
Show answer & explanation

Correct: B. On-demand pricing matches unpredictable, low-volume workloads with zero commitment and no capacity management. Provisioned throughput requires a commitment that does not match the volume profile; SageMaker options shift operational burden onto a team that explicitly cannot absorb it.

End of Chapter 1
02 Part I · Chapter 2 · Task 1.2

Select & Configure Foundation Models

Picking a model is not a vibe check. It is a constrained optimization across capability, latency, cost, context window, modality, and deployment surface. This chapter teaches you to read a scenario, list the constraints, and converge on the one model the exam expects.
GENAI IN REAL LIFE A frontier model on a router decision is a Ferrari in a school zone. WRONG TIER — CLAUDE OPUS FOR EVERY HOP 9:41 ●●● ▮ Help Center Chat I want to reset my password. Thinking… (it’s been 6.4 seconds) ⚠ OPUS FOR INTENT CLASSIFICATION. Cost per call: $0.075 Latency p50: 6,400 ms User abandonment: 31% CASCADE Haiku routes; Sonnet drafts. RIGHT TIER — HAIKU FIRST, ESCALATE 9:41 ●●● ▮ Help Center Chat I want to reset my password. Sure! Click the link I just sent to your email to reset your password. Need anything else? ✓ HAIKU CLASSIFIED & ANSWERED. Cost per call: $0.0008 Latency p50: 480 ms Abandonment: 3% · 94× cheaper PRINCIPLE Pick the model tier from the dominant constraint — latency here, capability elsewhere. Cascade reserves the frontier model for the questions that need it.
GenAI in Real Life — A 6-second password-reset is a 31% abandonment rate. Same product, same user, same prompt — the only thing that changed was which model answered.

2.1 · The Bedrock model landscape

Amazon Bedrock is a managed surface for foundation models from Anthropic, Meta, Amazon Nova, Mistral, Cohere, AI21, DeepSeek, Writer, and Luma. You do not provision GPUs. You do not manage weights. You call an API. The exam expects you to know each family’s sweet spot — well enough to pick from four choices under time pressure.

Bedrock model families — what each is good at
Family Sweet spot Watch out for
Anthropic Claude
Sonnet · Haiku · Opus
Long-context reasoning, tool use, instruction following, code, structured output. Default choice for agents and complex RAG. Higher per-token cost than Haiku-tier models; Opus tier is slow.
Amazon Nova
Micro · Lite · Pro · Premier
AWS-native, lowest cost for tiered workloads, native multimodal (Lite/Pro), tight Bedrock integration. Newer ecosystem; some advanced reasoning still trails Claude/GPT-class peers.
Amazon Titan
Text · Embeddings · Image
Embeddings (amazon.titan-embed-text-v2:0) are a default for RAG on AWS. Image generation when staying in-house. For new generation work, AWS is positioning the Nova family alongside Titan Text.
Meta Llama Open-weight reasoning, customer wants portability or self-hosting on SageMaker. Capabilities lag closed-weight peers at the same parameter count.
Cohere Command / Embed Multilingual embeddings, retrieval, classification, RAG with strong non-English support. Smaller community for tool-use patterns.
Mistral Cost-effective European hosting, fast small-model inference, function calling. Smaller context windows on lower tiers.
Stability AI Image generation (SD3, SDXL) when output must be stylistically tunable. Not a chat model; do not pick it for text tasks.

2.2 · The selection trade-off — one mental model, four axes

Every model selection question collapses into four axes. The exam phrases scenarios so that one axis dominates — identify it, and the answer falls out.

Figure 2.1 · Mental Model
The four-axis model selection radar
When two axes pull in opposite directions, the dominant constraint — latency, cost, capability, or context — decides.
CAPABILITY (reasoning depth) CONTEXT (window size) SPEED (low latency) AFFORDABILITY (low $/token) Frontier tier capability · context Mid tier balanced Small tier speed · affordability Reading the radar Wider polygon on an axis = stronger on that axis. All four axes are framed as goods, so “more is better” everywhere. SMALL — Haiku · Nova Micro/Lite · Llama 8B High volume, low latency, classification, routing, drafts. MID — Sonnet · Nova Pro · Llama 70B RAG, agents, multi-step reasoning, most production work. FRONTIER — Opus · Nova Premier · Llama 405B Hard reasoning, long-context analysis, last 5% of quality. EXAM HEURISTIC Read the scenario. Find the dominant constraint: “real-time chat” → small. “500-page contract” → frontier. “reasoning over many tools” → mid+. “classify” → small. PRO TIP Default to mid tier. Step down only when latency or cost forces it; step up only when a measurable quality gap remains after prompt engineering and retrieval are dialed in.
Figure 2.1 · Model tier trade-off radar. All four axes are upward goods — capability and context measure raw power; speed and affordability invert latency and cost so wider polygon always means “better.” Small models dominate the speed / affordability side, frontier models reach further on capability / context, mid tier sits in the middle — usually the right starting point.

2.3 · Inference parameters that actually matter

Bedrock’s Converse API exposes the same handful of inference parameters across every provider. Most candidates can name them. Few can predict what changes when you turn each one. The exam loves this gap.

Inference parameters — what they do, when to change them
Parameter Effect When to change it
temperature Scales the logit distribution. Low (0–0.3) → deterministic, repeatable. High (0.7–1.0) → creative, variable. Use 0–0.2 for extraction, classification, function-calling. Use 0.7+ for brainstorming, marketing copy, creative drafts.
top_p Nucleus sampling: keep tokens whose cumulative probability is ≤ p. Low p = narrower vocabulary. Combine with low temperature for tight, factual output. Rarely tune both at once — pick one knob.
top_k Keep only the top k next-token candidates. Hard cap on diversity. Useful for very narrow domains (SQL, JSON only). Most providers default to a sensible value — leave alone unless you have a measured reason.
max_tokens Hard cap on response length. Generation stops at max_tokens or stop sequence, whichever first. Always set it. It caps cost and stops runaway loops. Tune to the realistic 95th percentile of your task.
stopSequences List of strings that, if emitted, terminate generation immediately. Use for structured output ("\n\n", "") and when chaining prompts.
system Persistent role / persona / rules above the conversation. Always set it. The system prompt is where guardrails, tone, and output schema live.

Configuring inference with the Converse API

Deterministic extractionLow temp · structured
import boto3, json

bedrock = boto3.client("bedrock-runtime")

resp = bedrock.converse(
    modelId="anthropic.claude-3-5-haiku-20241022-v1:0",
    system=[{"text":
        "Extract entities as JSON. "
        "Output ONLY valid JSON, no prose."}],
    messages=[{
        "role": "user",
        "content": [{"text": ticket_text}],
    }],
    inferenceConfig={
        "temperature": 0.0,    # tight
        "topP": 0.1,          # narrow vocab
        "maxTokens": 512,     # hard ceiling
        "stopSequences": ["\n\n"],
    },
)
data = json.loads(resp["output"]["message"]
                  ["content"][0]["text"])
Use for entity extraction, classification, JSON output, function-calling, anything that must round-trip cleanly into downstream code.
Creative draftingHigher temp · open-ended
import boto3

bedrock = boto3.client("bedrock-runtime")

resp = bedrock.converse(
    modelId="anthropic.claude-3-5-sonnet-20240620-v1:0",
    system=[{"text":
        "You are a senior product marketer. "
        "Write in a direct, energetic voice."}],
    messages=[{
        "role": "user",
        "content": [{"text":
            "Draft 3 launch taglines for a "
            "developer-focused vector database."}],
    }],
    inferenceConfig={
        "temperature": 0.8,    # exploratory
        "topP": 0.95,
        "maxTokens": 800,
    },
)
print(resp["output"]["message"]
     ["content"][0]["text"])
Use for ideation, copywriting, multiple-candidate generation. Pair with sampling: call n times and rank with a separate evaluator.

2.4 · Inference modes — on-demand, provisioned, batch

Once you have picked a model, you pick how Bedrock serves it. The three modes map cleanly to three traffic shapes, and the exam will give you the traffic shape and ask which mode fits.

Bedrock inference modes — matching workload to mode
Mode Best for Pricing Watch out for
On-demand Unpredictable, low-to-moderate volume; prototyping; bursty production traffic. Pay per input + output token. No commitment. Subject to account-level token-per-minute (TPM) and request-per-minute (RPM) quotas; throttling under bursts.
Provisioned throughput Steady, high-volume production; latency-sensitive workloads needing capacity guarantees; custom (fine-tuned) models. Hourly commitment per “model unit.” 1-month or 6-month terms. You pay for the unit whether you use it or not. Wrong for spiky traffic.
Batch inference Offline scoring, bulk summarization, embedding back-catalogs, eval datasets. ~50% discount on input + output tokens. Async, hours-scale latency. Not for real-time. Inputs/outputs are S3 files, not API responses.
Cross-region inference Production workloads that need higher effective throughput than a single region’s quota allows. Same as on-demand — routing happens transparently. Data may transit additional regions; check residency rules before enabling.

2.5 · Squeezing cost without losing capability

Once a model is in production, three levers move the needle far more than swapping providers: prompt caching, model cascading, and distillation.

Figure 2.2 · Mental Model
The cost-optimization escalation ladder
Climb only as high as your budget demands. Each rung up adds engineering effort and operational surface; each delivers larger compounding savings.
FREE MORE EFFORT RUNG 1 · Prompt caching Cache long repeated prefixes (system prompts, retrieved context, few-shot examples). Effort: flag a cachePoint · Savings: 60–90% of input tokens · Risk: zero quality change START HERE RUNG 2 · Model cascading (router pattern) Try the cheap model first; on low confidence, retry on a larger one. Effort: confidence check + routing · Savings: 50–80% if >75% terminates cheap Risk: must measure escalation rate; runaway escalations erode the savings. Haiku Sonnet Opus cascade only on low confidence RUNG 3 · Distillation & fine-tuning Collect (input, frontier-output) pairs. Fine-tune a small model via Bedrock Custom or SageMaker. Effort: dataset curation + training run + ongoing eval pipeline · Savings: 70–95% per inference Risk: model drift over time; needs versioning, regression tests, and a re-training cadence. LAST RESORT EXAM HEURISTIC “Reduce cost without losing quality” → caching. “Most traffic is simple, some is hard” → cascading. “Labeled data, want a small in-house model” → distillation.
Figure 2.2 · Cost-optimization escalation ladder. Three levers, ordered by effort. Climb the ladder; do not skip rungs. Caching is free quality and the right first move; cascading buys 50–80% on top; distillation is the most expensive intervention and the most permanent one.

Prompt caching

Bedrock can cache long, repeated prompt prefixes (system instructions, retrieved-context windows, few-shot examples). A cache hit bills those tokens at a steep discount. For most RAG workloads, that is 60–90% off input tokens at zero quality cost. Mark cache breakpoints explicitly via the cachePoint content block in Converse.

Model cascading (router pattern)

Send every request first to the cheapest model that might succeed. If a confidence check fails (low logprob, schema-validation error, explicit “I am not sure”), retry on a larger model. A common pattern: Haiku → Sonnet → Opus, with ~80% of traffic terminating at Haiku.

Distillation & fine-tuning for a smaller model

Frontier model meets quality, cost does not? Distill. Collect (input, frontier-output) pairs, then fine-tune a smaller model via Bedrock Custom Models or SageMaker. You trade one-time training cost for ongoing inference savings.

Chapter summary

Model selection on AWS is a two-axis choice: family from the workload, tier from the dominant constraint.

  • Two-axis selection — family from workload type; tier from dominant constraint (latency, cost, capability, or context).
  • Mid-tier default — Sonnet / Nova Pro / Llama 70B. Step down or up only when a measurable signal forces it.
  • Platform — Bedrock for serverless API access; SageMaker JumpStart when you need full control over the endpoint.
  • Inference parameters — set temperature, maxTokens, and system on every call. Tune top_p only when low temperature alone is not tight enough.
  • Throughput modes — on-demand for spiky; provisioned for steady-and-large; batch for offline. Cross-region profiles lift the ceiling without re-architecting.
  • Cost optimization order — prompt caching → model cascading → distillation. Stop at the first rung that meets your budget.

The exam rewards picking the smallest model that meets the bar; it punishes ‘always Opus’.

Review Questions

Five scenario MCQs. Reveal the explanation only after you commit to an answer — the cognitive cost of guessing-then-checking is what builds exam memory.

Question 1
A company needs a document-understanding system that processes invoices containing both text and tables. The invoices are scanned images in various formats. The system must extract structured data including vendor name, invoice number, line items, and totals. Which approach is most appropriate?
  1. Use Amazon Textract to extract text and tables from the scanned documents, then use a Bedrock foundation model to structure the extracted data.
  2. Use a multimodal foundation model through Bedrock to directly process the scanned images and extract structured data.
  3. Train a custom model on SageMaker using labeled invoice samples.
  4. Use Amazon Comprehend to extract entities from scanned invoices.
Show answer & explanation

Correct: A. Textract is purpose-built for OCR + table extraction on scanned documents and outperforms general multimodal models on precise structured extraction. Pairing it with a Bedrock FM to format the structured output is the canonical pipeline. (B) works but is less reliable than a specialist OCR. (C) needs labeled data and training effort that is unjustified when Textract exists. (D) cannot process images directly.

Question 2
An application requires a foundation model that produces deterministic outputs for a classification task. The same input must produce the same label across multiple requests. Which configuration is most important?
  1. Set temperature to 1.0 and Top-P to 1.0 for maximum consistency.
  2. Set temperature to 0 and use a stop sequence after the classification label.
  3. Use fine-tuning to ensure consistent outputs.
  4. Set Top-K to 1 and maximum tokens to 1000.
Show answer & explanation

Correct: B. Temperature 0 makes sampling deterministic — always the most likely next token. A stop sequence terminates generation right after the label so trailing tokens cannot reintroduce variability. (A) maximizes randomness, the opposite of the requirement. (C) may improve quality but does not guarantee determinism. (D) limits choices but does not address determinism end-to-end.

Question 3
A company wants to adapt a foundation model to generate customer-support responses in their brand voice using product-specific terminology. They have 500 examples of ideal responses and a limited compute budget. Which customization approach should they try first?
  1. Continued pre-training on a large corpus of company documentation.
  2. Fine-tuning using the 500 example responses through Amazon Bedrock.
  3. Parameter-efficient fine-tuning (LoRA) on Amazon SageMaker.
  4. Prompt engineering with carefully selected few-shot examples drawn from the 500 responses.
Show answer & explanation

Correct: D, then B if insufficient. Start with prompt engineering — cheapest, fastest, often sufficient for capturing voice and terminology. If quality plateaus, escalate to Bedrock fine-tuning with the 500 examples. (A) requires far more data and compute than the budget permits. (C) saves compute over full fine-tuning, but adds SageMaker complexity. Do not pay that price until prompt engineering has visibly failed.

Question 4
A retail bank deploys an internal assistant that summarizes 200-page loan files for underwriters. Volume is steady at ~4,000 documents per day, and the same input must produce the same summary on re-run for audit. Cost is the second concern after auditability. Which configuration best fits?
  1. Anthropic Claude Opus on on-demand inference with temperature=0.7, results stored in S3.
  2. Anthropic Claude Sonnet via batch inference, temperature=0, prompt caching enabled, batch outputs versioned in S3 and (input-hash, output) cached in DynamoDB.
  3. Amazon Nova Micro on provisioned throughput with a 6-month commitment.
  4. A self-managed Llama 405B endpoint on SageMaker with autoscaling.
Show answer & explanation

Correct: B. Steady, predictable, non-real-time volume is the textbook batch-inference shape — ~50% token discount with no operational change. temperature=0 plus a hash-keyed cache in DynamoDB gives replayable output for audit. Mid-tier Sonnet has the reasoning depth for 200-page documents at a fraction of Opus cost. (A) over-specs and uses creative temperature for an extraction task. (C) under-sizes capability. (D) imposes operational burden the scenario does not justify.

Question 5
A team is building a real-time customer-facing chatbot. P95 latency must stay under 1.5 seconds for short conversational replies. Cost matters but is secondary. Which Bedrock configuration is the strongest first choice?
  1. Anthropic Claude Opus on provisioned throughput.
  2. Amazon Nova Pro via batch inference with prompt caching.
  3. Anthropic Claude Haiku (or Nova Lite) on on-demand inference, with maxTokens capped and a tight system prompt.
  4. A SageMaker JumpStart Llama 70B endpoint with auto-scaling.
Show answer & explanation

Correct: C. Latency-dominated workloads point to small-tier models — Haiku, Nova Lite — on on-demand inference, which keeps the cold-path short. Capping maxTokens trims tail latency. (A) Opus on provisioned throughput is high-capacity but high-latency per token; over-spec’d for short replies. (B) batch is offline; it cannot meet a 1.5s SLA at all. (D) JumpStart adds endpoint operations the scenario does not justify, and 70B is overkill for short conversational replies.

End of Chapter 2
Demo edition

You’ve reached the demo’s edge.

The Field Guide doesn’t stop here — eighteen more chapters carry the same treatment across every AIP-C01 domain.

If the first two chapters earned the time you spent on them, the rest of the book is built the same way: decision-oriented, service-by-service, grounded in the exam’s twenty task statements. Below is what each format gets you and where to pick one up.

How the full edition continues

  • Part I — Foundation Models (Chapters 1–6): you’ve seen 2 of 6. The remaining four cover Data Pipelines, Vector Stores, Retrieval (RAG), and Prompt Engineering.
  • Part II — Implementation (Chapters 7–11): Bedrock vs SageMaker selection, Knowledge Bases, Agents, model evaluation, and deployment patterns.
  • Part III — Security & Governance (Chapters 12–15): IAM scoping, Guardrails, PII redaction, and compliance.
  • Part IV — Optimization (Chapters 16–18): cost, performance, monitoring.
  • Part V — Evaluation & Troubleshooting (Chapters 19–20): metrics, drift, incident response.
  • Back matter: 9-week study plan, glossary, exam-day cheat sheets.

Pick the format that fits

The full edition is available in four formats
Format Best for Delivery Price
Digital HTML Reading on desktop or tablet — same experience as this demo, with all 20 chapters and back matter. Single self-contained .html file. No DRM. Unlimited devices. $29 Buy Digital →
PDF Offline study, print-friendly, annotation in any PDF reader. Letter-size PDF with page numbers, running headers, and recto chapter starts. $29 Buy PDF →
Kindle Reading on Kindle devices or the Kindle app — reflowable text. EPUB delivered through Amazon Kindle Direct. $19.99 Get on Kindle →
Paperback Physical reference — 6×9″ trim, perfect-bound, ships globally via Amazon. Printed via Amazon KDP; same content as the digital editions. $44.99 Order Paperback →

Was the demo useful?

If a service comparison felt thin, a decision table missed a corner case, or you spotted a fact that needs updating — write to press@minecloudcraftpress.com. Field reports from candidates studying for the exam are the only thing that keeps this guide honest. Replies usually land within 48 hours.

MineCloudCraft Press is the publishing arm of MineCloudCraft — an independent practice covering consultancy, training, and mentoring for teams building production AI on AWS. Back to MineCloudCraft Press →

End of Demo edition