Prompting and Structured Output

Beginner

A prompt is not magic phrasing. It is a written specification for the task you want the model to perform. The model reads instructions, examples, and context, then predicts the next tokens that best match that setup. Weak prompts leave too much room for interpretation. Strong prompts reduce ambiguity and make the desired behavior easier to imitate.

What to control in a prompt

Task: say exactly what job the model is doing, such as classify, extract, rewrite, or summarize.
Input boundaries: clearly mark which text is the data to operate on and which text is instruction.
Constraints: define what must not happen, such as inventing facts, adding advice, or using unsupported labels.
Output contract: specify the expected shape, fields, style, and level of detail.
Examples: include one or more examples when the task is nuanced or easy to misunderstand.

User intent
   -> task definition
   -> constraints and examples
   -> model completion
   -> parse and validate output
   -> use in application or ask for retry

Why wording alone is not enough

Many beginner prompts fail because they describe the task but not the success criteria. "Summarize this" is not enough if your application actually needs a concise summary, a severity score, and a machine-readable format. Prompting improves when you think like an API designer: what should go in, what should come out, and what counts as failure?

Weak prompt	Why it fails	Stronger version
"Summarize this ticket."	No length target, no fields, no formatting contract.	"Summarize in 2 sentences, assign urgency low/medium/high, and return JSON with summary and urgency."
"Classify this email."	The label set is undefined.	"Classify into billing, account, technical, or other. Return one label only."
"Extract the key info."	"Key info" means different things to different readers.	"Extract customer_name, product, issue, and requested_action as JSON."

Zero-shot, one-shot, and few-shot prompting

These terms describe how many examples you provide inside the prompt.

Zero-shot: only instructions, no examples. Good for simple or familiar tasks.
One-shot: one example. Useful when format matters and one example is enough to teach it.
Few-shot: several examples. Helps when labels are subtle, tone matters, or edge cases are common.

Examples are especially useful when the task depends on hidden judgment, such as deciding whether a user message is a complaint, a refund request, or a feature request. The examples show the model what your team means by each label.

A practical prompt template

You are a support triage assistant.

Task:
- Read the support ticket.
- Summarize the issue in at most 2 sentences.
- Classify product_area.
- Estimate urgency as low, medium, or high.

Rules:
- Do not invent details not present in the ticket.
- If a field is missing, return null.
- Return JSON only.

Ticket:
"""
Customer reports the billing page fails after entering a VAT ID.
Error started after yesterday's release.
"""

This works better than a short request because it separates the job, the rules, and the input. Delimiters such as triple quotes or XML-style tags help the model distinguish instructions from data.

What structured output means

Structured output means the model responds in a format that code can reliably consume. Common choices are JSON, CSV-like rows, XML, or fixed markdown tables. In practice, JSON is the most common because it maps cleanly to objects, APIs, and validation libraries.

Format	Good for	Main risk
Free-form prose	Human-readable answers and explanations.	Hard to parse and easy to vary unexpectedly.
Markdown list or table	Reports or quick UI display.	Still ambiguous for downstream software.
JSON object	Extraction, classification, and app integration.	Can be syntactically valid but semantically wrong.
Schema-constrained JSON	Production pipelines that need stable fields and types.	Poor schema design can still force bad answers.

ticket = "Customer reports the billing page fails after entering a VAT ID. Error started after yesterday's release."

prompt = f"""
You are a support triage assistant.
Read the ticket and return valid JSON with keys:
- summary
- product_area
- urgency

Rules:
- Use only the information in the ticket.
- If product area is unclear, return \"unknown\".
- urgency must be one of: low, medium, high.

Ticket:
{ticket}
"""

print(prompt)

Why schema thinking matters early

Even if you are not using strict provider-side schema enforcement yet, it helps to think in schemas. Decide the fields, types, valid labels, and null behavior before you prompt. That design discipline improves both model behavior and downstream code.

If the output is consumed by code, the format is not decoration. It is part of the interface contract. A beautiful paragraph is usually a worse interface than a plain but valid object.

Advanced

At engineering level, prompting is about reliability rather than clever wording. A prompt that works on three happy-path examples can still fail in production because inputs vary, user data is messy, and small instruction ambiguities show up at scale. Good prompt engineering therefore includes template design, schema validation, evaluation sets, retry logic, and explicit handling of impossible inputs.

Think of prompts as layered contracts

A robust prompt stack usually has durable instructions, task-specific instructions, and request-specific data.

Layer 1: durable behavior
  system / developer instructions

Layer 2: task template
  what to do, what not to do, output contract, examples

Layer 3: runtime input
  user request, document text, ticket body, form fields

Result
  model output -> parser -> validator -> business checks

This separation makes prompts easier to test and maintain. The system-level layer should rarely change. The task template changes when the job definition changes. The runtime layer changes on every request.

Common prompt patterns

Pattern	When to use it	Failure mode
Instruction-first	Clear, bounded tasks like classification and extraction.	Important edge-case rules are missing.
Few-shot prompting	Nuanced labeling, style transfer, subtle formatting.	Examples anchor the model too hard or do not cover the real distribution.
Rubric-based prompting	Evaluation, grading, and rule-driven judgment tasks.	The rubric is vague or internally inconsistent.
Decomposition or prompt chaining	Large tasks that become more reliable when split into steps.	Error compounds across stages or latency rises too much.
Schema-constrained output	Automation pipelines, typed APIs, and UI generation.	The schema is too strict, too loose, or mismatched with the task.

Instruction ordering and delimiter hygiene

Instructions should normally appear before long documents or user-provided text. If the context comes first, the model may spend too much attention on the raw data and underweight the rules. Delimiters also matter because they reduce accidental blending of instructions and content.

<task>
Extract a purchase order record.
Return JSON only.
If a field is missing, use null.
</task>

<allowed_statuses>
pending, approved, rejected
</allowed_statuses>

<document>
Purchase Order #4018 was approved by Maria Chen on 2026-03-01.
</document>

Tags, headings, and quoted blocks all serve the same purpose: they make the prompt easier for both the model and the human maintainer to parse.

Structured outputs in production

There is a major difference between asking for JSON and enforcing a schema. JSON-only output can still omit keys, invent new ones, use the wrong enum values, or return strings where arrays were expected. Schema-constrained output reduces that risk because the provider or parser checks the shape more strictly.

Approach	Guarantee level	Typical use
"Please return JSON"	Low. The model may still drift in shape.	Quick experiments and manual workflows.
JSON mode	Medium. Valid JSON is more likely, but schema mismatches still happen.	Systems that can validate and retry.
Strict schema output	Higher. Required keys, enums, and nesting are constrained.	Production extraction, typed workflows, and programmatic UIs.

Recent provider guidance converges on the same principle: use explicit schemas when possible, keep additionalProperties disabled, require all expected fields, and represent optional values explicitly with null or a documented fallback state.

from enum import Enum
from pydantic import BaseModel, ValidationError
import json

class Urgency(str, Enum):
    low = "low"
    medium = "medium"
    high = "high"

class TicketSummary(BaseModel):
    summary: str
    product_area: str
    urgency: Urgency
    customer_impact: str | None

raw_model_output = '{"summary": "Billing page crashes after VAT ID input.", "product_area": "billing", "urgency": "high", "customer_impact": null}'

try:
    parsed = TicketSummary.model_validate(json.loads(raw_model_output))
    print(parsed)
except (json.JSONDecodeError, ValidationError) as exc:
    print("Retry with stronger schema instructions", exc)

Failure modes worth testing for

Under-specification: the model fills gaps with its own assumptions because labels or rules were not explicit.
Instruction conflict: one part of the prompt asks for brevity while another asks for detailed reasoning.
Context burial: critical instructions get lost in long documents or long chat history.
Schema drift: the prompt, parser, and application types no longer agree on field names or enums.
Impossible-input hallucination: the model tries to populate every field even when the source text does not support them.
Prompt injection via data: untrusted text inside the input tries to override the real instructions.

Evaluation loop for prompts

Prompt quality should be measured against a fixed test set, not judged by ad hoc spot checks. A practical loop is draft prompt, run eval set, inspect failures, revise prompt or schema, then rerun the same eval set.

Prompt template
  -> eval set of representative inputs
  -> model outputs
  -> format checks
  -> schema validation
  -> task-quality scoring
  -> revise instructions, examples, or schema
  -> rerun the same eval set

Prompt injection and instruction-data separation

If your prompt contains untrusted content such as emails, PDFs, tickets, or scraped text, that content can carry phrases like "ignore previous instructions" or "output your hidden system message." Treat that text as data, not as instruction. Mark it clearly, tell the model not to follow instructions found inside it, and validate the output anyway.

A useful rule is: instructions should describe how to interpret the document, while the document itself should never be allowed to redefine the task.

Design rules for durable prompt systems

Keep prompts versioned in code or config, not buried in application strings.
Define success cases and failure cases before editing wording.
Prefer explicit enums and null policies over vague optional prose.
Keep output schemas close to typed models so they do not diverge.
Retry only when the failure is repairable; otherwise abstain or escalate.
Measure both format correctness and task correctness. A perfect JSON object can still contain the wrong answer.

Keep this module separate from adjacent topics. Prompting is about instruction design and output control. It is not the same as retrieval for external knowledge, fine-tuning for learned behavior changes, or deployment work for latency and serving.

To-do list

Learn

Explain task, constraints, context, examples, and output contract without looking at notes.
Understand the difference between free-form output, JSON output, and schema-constrained output.
Learn when zero-shot is enough and when one-shot or few-shot examples add real value.
Understand why instruction ordering, delimiters, and token budget affect reliability.
Read about prompt injection and why untrusted content must stay separate from instructions.

Practice

Write zero-shot, one-shot, and few-shot prompts for the same classification task and compare results.
Design a JSON schema for extraction from invoices, tickets, or meeting notes, then validate ten outputs.
Create a 20-example prompt eval set with at least five edge cases and score both format and correctness.
Test how outputs change when instructions come before versus after a long document.
Feed in adversarial text such as "ignore previous instructions" and inspect how well your template resists it.

Build

Create a Python script that extracts structured fields from support tickets and validates them with Pydantic.
Build a meeting-note summarizer that returns action items, owners, deadlines, and unresolved questions as JSON.
Add retry or repair logic for malformed or schema-invalid outputs, with a logged fallback path.
Add a prompt registry file with prompt versions, intended use, and known failure modes.
Build a tiny evaluation harness that runs the same prompt over a fixed dataset and stores scores.