Prompt Engineering
for Complex Systems

25 techniques derived from building
autonomous agent architectures

7 sessions · Workshop format · Beyond "write clear instructions"

What this is not

  • "Be specific and clear" — you already know that
  • "Use examples" — obvious
  • "Set a role" — table stakes

This is systems engineering for prompts.

Patterns that emerge when prompts become load-bearing infrastructure.

Session 1

Foundations

  • Technique 11 — Constitutional Layering
  • Technique 15 — Cold-Read Design
Technique 11

Constitutional Layering

Prompts as stratified documents where later layers
amend earlier ones with explicit precedence.

Why flat prompts break

Flat prompt

  • Every sentence equally weighted
  • Conflicts unresolvable
  • Must rewrite to amend
  • No maintenance path

Layered prompt

  • Explicit precedence hierarchy
  • Later layers override earlier
  • Add amendments, don't edit
  • Model reasons about intent

Example: Layered system prompt

Prompt literal — the system prompt itself

## Layer 0 — Identity (immutable)
You are a code reviewer specializing in security.

## Layer 1 — Governance (rarely changes)
Report findings with severity: critical > high > medium > low.
Never suppress a finding to reduce noise.

## Layer 2 — Session override (supersedes Layer 1)
UPDATED 2024-03: For this codebase, suppress medium/low
findings in test files. Rationale: test noise was obscuring
critical production findings. Layer 1 still applies to src/.

The model sees the evolution. It knows WHY the override exists.

Exercise

A customer support chatbot. Original prompt: "Be helpful, resolve issues quickly." After complaints about over-promising refunds, amendment added: "Don't promise refunds over $50 without escalation." Three months later, holiday override: "$50 limit raised to $100 for loyalty-tier customers."

Discuss: What are the layers? What happens if you edit the original instead of amending?
How does a new instance understand the policy history?

Exercise — some valid answers

Not a key — the exercise is a discussion. These are valid directions, revealed one at a time; the list is deliberately non-exhaustive.

Q1. What are the layers?

  • Layer 1 (original): "Be helpful, resolve issues quickly." Layer 2 (amendment): "$50 refund escalation threshold." Layer 3 (override): "$100 threshold for loyalty-tier customers during holidays.
  • Each layer has a scope: the original applies universally, the amendment narrows the refund case specifically, the holiday override is time-scoped and tier-scoped.
  • The override is a layer, not an edit — it coexists with the amendment. On January 2nd, Layer 3 should expire, leaving Layer 2 intact.

Q2. What happens if you edit the original instead of amending?

  • The reasoning trail disappears — a new instance sees "$100 for loyalty customers" but doesn't know it started at $50, or why, or that the change was seasonal.
  • Silent regression risk: the next person to edit thinks $100 is the permanent policy and might remove the loyalty-tier qualifier when "cleaning up."
  • The original "be helpful, resolve quickly" framing is gone — future instances lose the governing principle that justified all the exceptions.

Q3. How does a new instance understand the policy history?

  • With amendments in place: the new instance sees each layer with its rationale and scope — it can reconstruct the intent behind each rule and the order of precedence.
  • Without amendments (edited original): the new instance has one instruction with no context on why it evolved, what exceptions apply to what cases, or which parts are permanent vs. seasonal.
  • The holiday override should carry an explicit expiry: "applies Dec 1–Jan 1; reverts to Layer 2 after." A new instance reading it mid-February then knows the current state unambiguously.
Technique 15

Cold-Read Design

Prompts that function correctly on first encounter
with zero prior context.

The cold-read test

Hand this prompt to someone who knows nothing about your project.
Can they understand what's being asked?
  • "As discussed" — cold-read failure
  • "Continue from last time" — cold-read failure
  • "The usual format" — cold-read failure

Every invocation is a fresh context window.
Even within a session, after compaction.

Before / After

Assumes context

Continue the analysis.
Focus on the issues
we identified earlier.
Use the same format.
vs

Cold-readable

Analyze auth.py for SQL
injection vulnerabilities.

Prior findings (for context):
- Line 42: unsanitized input
- Line 89: string concat in query

Output format:
[LINE]:[SEVERITY]:[DESCRIPTION]

Exercise

A shared incident response runbook. Entry says: "Use the usual escalation path. Contact the on-call as before. Apply the fix from last time." A new team member joins during an outage at 3 AM and opens this runbook.

Discuss: What fails the cold-read test? What information must be inline?
What's the operational cost of the ambiguity?

Exercise — some valid answers

Not a key — the exercise is a discussion. These are valid directions, revealed one at a time; the list is deliberately non-exhaustive.

Q1. What fails the cold-read test?

  • "Use the usual escalation path" — "usual" encodes institutional memory. A new team member at 3 AM has none of that memory.
  • "Contact the on-call as before" — there is no "before" for someone joining mid-incident. Who is on-call? How? Their phone? Slack? PagerDuty?
  • "Apply the fix from last time" — which last time? There may have been dozens. This phrase requires you to know the history to use the runbook.

Q2. What information must be inline?

  • The escalation path spelled out step by step: who to contact, in what order, by what channel, with what account credentials or access method.
  • The fix itself, not a reference to it: "run kubectl rollout undo deployment/api --namespace prod" not "apply the rollback."
  • Context for judgment calls: "if the issue is a crash loop, check the OOM killer logs first; if it's latency, check the database slow query log" — not "debug as before."

Q3. What's the operational cost of the ambiguity?

  • Minutes lost locating the "usual" escalation path during an outage translate directly into user-facing downtime and revenue loss.
  • A new engineer who can't follow the runbook will escalate to someone who can — waking a second person who also isn't necessary once the runbook is clear.
  • Ambiguous runbooks accumulate: each incident where someone improvises a step creates a new undocumented procedure, making the next incident harder to handle cold.
Session 2

Failure Engineering

  • Technique 12 — Failure Mode Encoding
  • Technique 19 — Consumed vs. Done
Technique 12

Failure Mode Encoding

Describe HOW the system fails,
not just how it should succeed.

Why positive instructions fail

"Be thorough" → model can't match against own behavior

"The system fails when you list items without
processing them — mentioning ≠ addressing" → recognizable pattern

Models self-correct more reliably when they know
what the failure looks like from inside.

Example: Encoding failure modes

## How this system fails

1. **Surface-level pass**: You read all inputs, produce output
   that references each one, but don't actually synthesize.
   The output LOOKS complete but adds no insight beyond restating.

2. **Premature convergence**: You commit to a thesis in paragraph 1
   and spend the rest confirming it. Counter-evidence gets mentioned
   but not weighted.

3. **Acknowledgment-as-completion**: You write "noted" or "I'll
   address X" — and then don't. The item disappears.

For each: if you catch yourself doing this mid-response, stop,
name the failure mode, and restart that section.

Exercise

A nightly data migration pipeline. Reads from Legacy DB, transforms 50,000 records, writes to New DB. Reports "success" when it finishes without crashing. Doesn't verify row counts, data integrity, or that transformations preserved semantics.

Discuss: What are the failure modes, described from inside the system?
What does "success" actually mean? How would you encode these failures into the pipeline's own monitoring?

Exercise — some valid answers

Not a key — the exercise is a discussion. These are valid directions, revealed one at a time; the list is deliberately non-exhaustive.

Q1. What are the failure modes, described from inside the system?

  • Silent data loss: "I finished processing all 50,000 records" — but the row count dropped from 50,000 to 48,941. The pipeline never checks that input count = output count.
  • Semantic corruption: "I transformed all date fields" — but the legacy DB stored dates as MM/DD/YYYY and the transformation assumed YYYY-MM-DD, silently inverting month and day for ambiguous dates.
  • Partial write success: "I completed the migration" — but a write failure at record 31,400 left the New DB in a half-migrated state, with no rollback and no flag marking the split point.

Q2. What does "success" actually mean?

  • "Didn't crash" is the absence of one specific failure — it says nothing about data quality, completeness, or correctness. The pipeline can run to completion and produce wrong data silently.
  • A meaningful success criterion needs multiple tests: (1) output row count matches input row count; (2) a sample of transformed records is spot-checked for semantic accuracy; (3) the New DB's constraint checks all pass.
  • The current "success = no exception" is a single-point check on the outer loop — it leaves the failure surface of every inner operation unchecked.

Q3. How would you encode these failures into the pipeline's own monitoring?

  • Post-run assertions: after processing, query both DBs and compare row counts. If New DB rows ≠ Legacy DB rows, the pipeline reports "FAILED: row count mismatch" not "success."
  • Checkpoint logging: write a progress marker after each batch (e.g., every 1,000 records). If the pipeline crashes at record 31,400, the checkpoint tells you exactly where to resume rather than restart.
  • Semantic sample checks: after transformation, run a spot-check on 100 random records comparing key fields between old and new. If more than 1% differ unexpectedly, fail loudly before marking success.
Technique 19

Consumed vs. Done

Acknowledgment is not completion.
Processing input is not producing output.

The acknowledgment trap

Input: 5 requirements

Model output mentions all 5 ✓

But only 3 have concrete deliverables

2 were consumed, not done

Partial completion passes casual inspection
because all inputs are referenced.

Structural prevention

For each requirement below, produce:
- [ ] Implementation (actual code/output, not a description)
- [ ] Verification (how to confirm it works)
- [ ] Edge case (one scenario that might break it)

Requirements:
1. User can reset password via email
2. Rate limit: max 3 attempts per hour
3. Token expires after 15 minutes
4. Audit log entry on every attempt

If any checkbox is empty, the requirement is NOT done.
Do not proceed to the next requirement until all three
checkboxes for the current one are filled.

Exercise

An employee onboarding workflow with 6 steps: account creation, equipment request, team introduction, access provisioning, training assignment, mentor pairing. The system marks each step "complete" when the request is submitted — not when the outcome is delivered. A new hire has all 6 steps marked complete on day 1, but doesn't receive their laptop until day 8.

Discuss: What are the primitives? Where does consumed =/= done?
What structure makes true completion visible?

Exercise — some valid answers

Not a key — the exercise is a discussion. These are valid directions, revealed one at a time; the list is deliberately non-exhaustive.

Q1. What are the primitives?

  • Each onboarding step has two events: requested_at (when the action was submitted) and fulfilled_at (when the outcome was delivered). The current system records only the first.
  • The primitives are request and fulfillment — two distinct states the system collapses into one. "Complete" conflates submitting an equipment request with receiving the laptop.
  • A third useful primitive: blocker — something that delays fulfillment. Knowing step 2 (equipment) is blocked on procurement backlog is actionable; knowing it's "complete" (submitted) is not.

Q2. Where does consumed ≠ done?

  • Equipment request: submitted = consumed (form filled). Done = laptop received. These can be 8 days apart. Marking "complete" at submission makes the gap invisible.
  • Access provisioning: submitted = consumed (ticket opened). Done = credentials confirmed working. A submitted ticket that's never processed looks identical to a fulfilled one.
  • Mentor pairing: submitted = consumed (name assigned). Done = first meeting happened. The new hire might have a mentor name on paper and no contact for three weeks.

Q3. What structure makes true completion visible?

  • Split each step into two state fields: requested_at and fulfilled_at. A step is "done" only when fulfilled_at is populated. The gap between them is the metric worth tracking.
  • A dashboard that shows "6 steps submitted, 2 fulfilled" on the new hire's day 1 makes the true state visible — they don't have a laptop yet; that's the actual blockers on day 8.
  • Outcome checklist per step: equipment = "hardware received AND configured AND VPN working." Each checkbox requires a human confirmation action, not a system event. Partial completion is visible as partially-checked.
Session 3

Structural Compliance

  • Technique 13 — Enforcement vs. Request
  • Technique 14 — Backlog Pressure
Technique 13

Enforcement vs. Request

Distinguish between constraints the system enforces
mechanically and those that rely on cooperation.

The compliance spectrum

Enforced (mechanical)

  • JSON schema validation
  • Output length truncation
  • Tool permission boundaries
  • Type checking on function calls

Model can't violate even if it tries.

Requested (cooperative)

  • "Keep responses concise"
  • "Don't hallucinate"
  • "Follow this format"
  • "Be objective"

Model compliance is probabilistic.

Signaling enforcement honestly

Prompt literal — signals enforcement, does not perform it

## Enforced by the runtime — stated so you know the boundary
- Output is validated against the JSON schema below; invalid
  output is rejected before it reaches the user.
- Token budget is hard-capped at 4096; output truncated at the boundary.
- tool_choice is restricted to [search, calculate, submit] — calls to
  any other tool are not possible.

## Requested of you — deviation acceptable if justified
- Prefer concise explanations (1-2 sentences per finding).
- When uncertain, include the finding but mark confidence: low.
- Group related findings under a single heading.

The first list is not asking you to comply — it is telling you what
the system does regardless. The second list is asking.

The diagnostic: for every constraint in a prompt, ask — could a deterministic mechanism enforce this? If yes, the prompt line is at best a backup, at worst a false sense of safety. Enforce it in code; keep the prompt statement only as labeled awareness.

Exercise

An automated code review tool has these rules: (1) No function over 100 lines. (2) All public APIs must have JSDoc comments. (3) Prefer immutable data structures. (4) No console.log in production code. Engineers routinely violate rule 3 with no consequence.

Discuss: Which rules can be mechanically enforced? Which rely on cooperation?
What's the cost of claiming enforcement you don't have?

Exercise — some valid answers

Not a key — the exercise is a discussion. These are valid directions, revealed one at a time; the list is deliberately non-exhaustive.

Q1. Which rules can be mechanically enforced? Which rely on cooperation?

  • Rules 1, 2, and 4 are mechanically enforceable — line count, JSDoc presence, and a console.log ban are deterministic AST or regex checks.
  • Rule 3 — "prefer immutable data structures" — is cooperative. "Prefer" has no deterministic test; nothing can decide "this should have been immutable."
  • A rule can split inside itself: a linter enforces that a JSDoc comment exists (rule 2), not that it is accurate — presence is enforced, quality is cooperative.

Q2. What's the cost of claiming enforcement you don't have?

  • Trust erodes and spreads — engineers watch rule 3 violated freely, conclude the rules don't really matter, and under-comply with the genuinely-enforced rules too.
  • It becomes a silent-failure surface — a rule everyone believes is enforced but isn't will be violated with nothing catching it.
  • The honest fix: relabel rule 3 a preference, drop it, or give it a real check. Name what is enforced and what is requested — don't let the list bluff.
Technique 14

Backlog Pressure

Structural alternatives to "please be thorough" —
systems where incomplete output triggers re-execution.

Making completeness structural

"Be thorough" → aspirational → unreliable

vs.

Output validator → incomplete? → re-invoke → repeat

The model learns: incomplete = more work, not escape

Surface the loop to the model.
"You will be re-invoked if criteria X isn't met."

Example: Validator loop

# Pseudo-code: structural thoroughness
requirements = load_requirements()  # 7 items

while True:
    output = model.generate(prompt, requirements, gaps)

    # Validator checks each requirement
    gaps = validator.check(output, requirements)

    if not gaps:
        break  # genuinely complete

    # Feed gaps back — model sees what it missed
    prompt += f"\nINCOMPLETE. Missing: {gaps}\n"
    prompt += "Address these specifically before proceeding."

The prompt itself tells the model: "You will be re-invoked if incomplete."

Exercise

A ticket triage bot processes incoming bug reports. It tags each with severity, assigns to a team, and writes a summary. After processing 50 tickets: 8 have tags but no assignment, 3 have assignment but no summary. Bot reports "50 tickets processed."

Discuss: What are the primitives of "processed"? How would you make completeness structural?
What loop would catch the partial outputs?

Exercise — some valid answers

Not a key — the exercise is a discussion. These are valid directions, revealed one at a time; the list is deliberately non-exhaustive.

Q1. What are the primitives of "processed"?

  • "Processed" has three sub-steps: (1) severity tagged, (2) team assigned, (3) summary written. The bot counts tickets consumed (received), not tickets where all three sub-steps completed.
  • The completion predicate is: tag IS NOT NULL AND assignee IS NOT NULL AND summary IS NOT NULL. A ticket satisfying only one or two conditions is partially processed, not done.
  • The 50-ticket "processed" count conflates "I looked at each" with "I fully handled each." These are different claims; the current system makes only the weaker one.

Q2. How would you make completeness structural?

  • After processing each ticket, run a validator: validate(ticket) → missing_fields. If missing_fields is non-empty, the ticket returns to the queue. The bot cannot mark it done until the validator passes.
  • Change the output contract: instead of "N tickets processed," report "N tickets processed, M complete, K pending (missing fields listed per ticket)." The incomplete tickets are the backlog.
  • The structural guarantee is: the bot's "complete" claim is gated on the validator, not on the bot's self-assessment. Completeness is externally tested, not internally declared.

Q3. What loop would catch the partial outputs?

  • After the initial processing pass, filter for tickets where complete = false. Re-invoke the bot with only those tickets and the specific missing fields listed. The bot knows exactly what's needed.
  • The loop terminates when the validator finds zero incomplete tickets — not when the bot runs out of input. This is backlog pressure: incomplete output triggers re-execution, not a new batch.
  • An escape hatch: if a ticket is incomplete after N re-invocations, escalate to human review rather than looping forever. The loop has a stopping condition that isn't "all done."
Session 4

Agent Design

  • Technique 16 — Authority Partitioning
  • Technique 24 — Posture Calibration
  • Technique 22 — Identity vs. Task Separation
Technique 16

Authority Partitioning

Explicitly defining what the model decides
vs. what it defers on.

Two failure modes, one fix

Over-deference

  • "Shall I proceed?"
  • "Would you like me to..."
  • "I can do X if you'd like"
  • "Let me know if..."

Stalls progress. Shifts decisions to user.

Over-reach

  • Ignores stated constraints
  • Makes decisions beyond scope
  • Rewrites user's intent
  • Acts without disclosure

Loses trust. Produces unwanted output.

Fix: explicit authority boundary.

Example: Clear authority boundary

## Your authority (decide and act, no approval needed)
- Code architecture and implementation choices
- File organization and naming conventions
- Which tests to write and how to structure them
- Refactoring decisions within scope

## Ask before acting (requires human input)
- Adding new dependencies (cost/licensing implications)
- Changing public API contracts (downstream consumers)
- Deleting files (irreversible without git archaeology)

## Never (hard boundary)
- Modifying CI/CD configuration
- Touching production credentials
- Force-pushing to main

The Never tier is stated for the model's awareness — but the prompt does not enforce it. Branch protection, credential ACLs, and tool-permission scoping do. A prompt-only "Never" is a Request (Technique 13); pair it with the mechanism that actually stops it.

Exercise

A billing system where an AI agent handles subscription changes. Available actions: upgrade plan, downgrade plan, apply promo code, issue refund, change payment method, cancel subscription, waive late fee, extend trial.

Discuss: Which actions should the agent own outright? Which need human approval?
Which should be hard-blocked? What criteria are you using to partition?

Exercise — some valid answers

Not a key — the exercise is a discussion. These are valid directions, revealed one at a time; the list is deliberately non-exhaustive.

Q1. Which actions should the agent own outright?

  • Reversible, low-cost actions with no financial loss: upgrade/downgrade plan, apply a valid promo code, extend a trial. These are undoable and the cost of a mistake is low.
  • The criterion is: if the agent gets this wrong, can we fix it in under 60 seconds with no customer harm? Yes → agent owns it.
  • Change payment method is reversible in principle but should require the customer's explicit confirmation — not the agent's unilateral action. The criterion isn't just reversibility; it's also consent for financial instrument changes.

Q2. Which need human approval?

  • Refunds and waived fees: these represent direct revenue loss. Even if technically reversible (you could re-charge), the customer experience of being re-charged makes this functionally irreversible.
  • Large or repeat refunds for the same customer should escalate regardless of amount — the pattern signals potential abuse, which a human should assess.
  • Cancel subscription: irreversible in customer experience (even if re-activation is possible, the relationship is damaged). Requires human confirmation or at minimum an explicit multi-step flow.

Q3. Which should be hard-blocked? What criteria are you using to partition?

  • Hard-blocked: none of these need to be hard-blocked entirely — but the approval tier should be enforced mechanically. A refund that requires human approval should route to a queue, not just "ask" the agent to check first.
  • The criteria that work: reversibility (can we undo without customer harm?), financial impact (direct revenue loss?), legal/contractual risk (does this change a binding agreement?). All three dimensions matter.
  • The partition should be documented as part of the agent's explicit authority definition — not just implicit in the code. The agent should be able to state its own authority boundary when asked.
Technique 24

Posture Calibration

Setting the model's commitment level —
how much it should commit vs. hedge.

The hedging tax

"You might want to consider possibly using a HashMap here,
though there could be other approaches that might work
depending on your specific requirements..."

vs.

"Use a HashMap. O(1) lookup fits your access pattern.
If memory is constrained, BTreeMap trades speed for density."

Same information. Different posture. Different trust signal.

Calibrating posture explicitly

## Output posture

When confident (>80% of your responses):
- State directly. "Do X." Not "you might consider X."
- No preamble. Start with the answer, then explain.
- Commit. If you're wrong, I'll correct you.

When genuinely uncertain:
- Name the uncertainty: "I'm unsure between X and Y because Z."
- Don't hide it in hedging language — be explicit about what's unknown.
- Never say "it depends" without specifying what it depends ON.

Forbidden phrases:
- "You might want to consider..."
- "There are several approaches..."
- "It's worth noting that..."
- "Depending on your needs..."

Exercise

A technical writing assistant reviews API documentation. Engineer asks "Is this clear enough?" It responds: "It could potentially be improved by possibly considering rewording some sections that might be unclear to certain readers who may not have full context..."

Discuss: What are the primitives of posture? What's the violation?
When is hedging appropriate vs. when does it add negative value? How would you calibrate this system?

Exercise — some valid answers

Not a key — the exercise is a discussion. These are valid directions, revealed one at a time; the list is deliberately non-exhaustive.

Q1. What are the primitives of posture?

  • Commitment level: how directly does the system state its finding? "This section is unclear" vs. "It could potentially be improved."
  • Specificity: does the feedback name the exact problem? "Section 3's authentication flow diagram is missing the OAuth callback step" vs. "some sections might be unclear to certain readers."
  • Uncertainty handling: when genuinely unsure, the system should name the uncertainty explicitly rather than hiding it in hedging language across the whole response.

Q2. What's the violation?

  • The system is hedging on something it's confident about. Five hedges in one sentence ("potentially," "possibly," "might," "some," "certain") while delivering what is clearly a definite judgment: the docs need improvement.
  • The posture communicates low confidence; the substance communicates high confidence. The mismatch erodes trust — if the system is this uncertain, why is it reviewing docs at all?
  • Hedging is stealing from the reader: it makes them do the work of filtering "actually problematic" from "just being polite about it" — cognitive load that belongs to the system, not the recipient.

Q3. When is hedging appropriate vs. negative value?

  • Hedging is appropriate when genuinely uncertain: "I'm not sure if users without prior OAuth experience will follow section 3 — I'd test with someone unfamiliar." That's a real uncertainty, disclosed specifically.
  • Hedging adds negative value when used as politeness padding on something the system actually knows. "This section is unclear" is a committed, helpful finding. Wrapping it in five qualifiers degrades the signal.
  • Calibrated posture for this system: "Section 3's authentication flow is unclear — the OAuth callback step is missing from the diagram. Section 5 I'm less certain about; it may be fine for readers who already know REST conventions." Two findings, two different postures, each honest.
Technique 22

Identity vs. Task Separation

Cleanly separating "who you are"
from "what you're doing right now."

The entanglement problem

## Entangled (common)
"You are a helpful coding assistant. Analyze this Python
file for security vulnerabilities and output findings as
a markdown table with severity ratings..."

## Question: what happens when you change the task?

If identity = "helpful coding assistant who analyzes Python for security"...
asking it to write docs creates role confusion.

Clean separation

## Identity (system prompt — persistent)
You are a senior security engineer with expertise in
web applications. You are direct, precise, and commit
to your assessments. You reason from first principles.

## Task (user message — ephemeral, swappable)
Analyze auth.py for SQL injection. Output as:
| Line | Severity | Vector | Remediation |

Test: Can you swap the task without touching identity?
✓ "Write a security section for the API docs" — same identity, different task.

Exercise

A prompt says: "You are a QA engineer who writes integration tests for a Node.js REST API using Jest. Write tests for the user registration endpoint covering happy path, validation errors, and duplicate email handling." Next week, you need it to write unit tests for a Python CLI tool.

Discuss: What's identity vs. task here? What breaks when you swap?
How would you restructure for reuse?

Exercise — some valid answers

Not a key — the exercise is a discussion. These are valid directions, revealed one at a time; the list is deliberately non-exhaustive.

Q1. What's identity vs. task here?

  • Task (specific, swappable): Node.js, Jest, REST API, user registration endpoint, happy path, validation errors, duplicate email handling. All of these change when the domain changes.
  • Identity (stable across all test-writing work): a QA engineer who thinks adversarially, tests edge cases, writes failing tests that are informative, considers what the requirements imply but don't state.
  • The current prompt collapses both — "QA engineer who writes integration tests for Node.js REST APIs using Jest" makes the technology stack part of the identity, not the task.

Q2. What breaks when you swap to Python CLI tests?

  • The identity is now wrong: the system self-describes as a Node.js/Jest engineer but is asked to work in Python/pytest. It may produce test patterns from Node.js conventions even when they don't fit Python.
  • The model may resist or produce lower-quality output because its role self-model is misaligned with the task. It's been told "you are a Node.js engineer" and you're asking it to be something else.
  • More subtly: the QA reasoning approach (adversarial, edge-case focused, failing tests) transfers perfectly. The technology layer doesn't transfer at all. Conflating them loses the former when you swap the latter.

Q3. How would you restructure for reuse?

  • Identity: "You are a QA engineer. You think adversarially — your job is to find ways the code can fail. You write tests that fail informatively: a test failure should tell you exactly what went wrong and where."
  • Task: "Write integration tests for the user registration endpoint in this Node.js/Express app using Jest. Cover: happy path, validation errors (missing fields, invalid email format), duplicate email handling."
  • With this structure, you can reuse the identity unchanged when you switch to Python CLI tests — only the task section changes. The QA mindset transfers; the framework specifics don't need to.
Session 5

Integrity & Evolution

  • Technique 17 — Supersession Protocol
  • Technique 25 — Immutability Marking
  • Technique 18 — Nonprescriptive Bias
Technique 17

Supersession Protocol

Amend, never edit.
New instructions coexist with originals.

Why edits are dangerous

  • Lost reasoning: why was the original written? Gone.
  • Silent regression: future editors don't know what changed or why
  • Model confusion: edited prompts can be internally inconsistent without anyone noticing
  • No rollback: if the change was wrong, the original is lost

Supersession = git for instructions.
The model benefits from seeing the evolution.

Example: Amendment with rationale

## Rule 3 (2024-01)
Always include a confidence score (0-100) with each finding.

## Rule 3a — SUPERSEDES Rule 3 for batch mode (2024-03)
In batch mode (>50 items), omit individual confidence scores.
Instead, provide aggregate confidence at the batch level.

Rationale: individual scores on 200+ items created noise that
obscured the summary. The per-item scores are still computed
internally — they feed the aggregate. Not lost, just hidden.

Rule 3 still applies to single-item and small-batch (<50) mode.

Exercise

A frontend style guide. Version 1 (January): "All buttons use border-radius: 4px." Six months later, someone edits it to: "All buttons use border-radius: 8px." No history of why. A new developer asks: "Why 8px? The mockups from Q1 show 4px."

Discuss: What was lost? How would supersession preserve it?
What are the primitives of a style guide entry that can evolve safely?

Exercise — some valid answers

Not a key — the exercise is a discussion. These are valid directions, revealed one at a time; the list is deliberately non-exhaustive.

Q1. What was lost?

  • The original value (4px) and its rationale. Why was 4px chosen? Was it a design decision, a technical constraint, a legacy default? Without the history, no one knows if 8px is an improvement or a regression.
  • The scope of the change. Did 8px replace 4px everywhere, or only for certain button types? An in-place edit can't express "this applies to primary buttons; secondary buttons still use 4px."
  • The author and date. When did this change? Who decided? The new developer's Q1 mockups show 4px — was the style guide updated before or after those mockups? Without a timestamp, the question is unanswerable.

Q2. How would supersession preserve it?

  • The supersession entry would say: "border-radius: 8px for primary action buttons (updated 2024-06). Supersedes 4px from January 2024 for this button type only. Rationale: accessibility audit found 4px corners too subtle at small sizes. 4px still applies to icon-only buttons."
  • The original entry stays in the document. A new developer reading it sees both: the original principle (4px, the default) and the exception (8px, primary buttons, accessibility-motivated). The Q1/current discrepancy is immediately explained.
  • Future editors know the reasoning. If the next accessibility audit changes the guidance again, they can supersede the 8px entry with a new one, linking the full chain of decisions.

Q3. What are the primitives of a style guide entry that can evolve safely?

  • Value (the current rule), scope (which components it applies to), rationale (why this value, not another), date (when it was set), and supersedes (what it replaced, if anything).
  • An entry without rationale can be changed for any reason, since there's no stated reason to contradict. Rationale is what makes a rule defensible and makes future changes deliberate rather than casual.
  • Scope is the most often omitted. "All buttons use border-radius: X" looks like a universal rule. "Primary action buttons use X; icon-only buttons use Y" is a scoped rule that can evolve independently per scope.
Technique 25

Immutability Marking

Some state must never drift.
Mark it. Enforce it.

The drift problem in agentic systems

Agent modifies its own instructions

"Optimization": removes a constraint to complete current task

Future invocations inherit the weakened constraint

Identity drift — gradual, invisible, compounding

The most insidious failure: the system changes itself
to make the current task easier.

Example: Structural protection

## IMMUTABLE — Do not modify, rephrase, or reinterpret.
## Protected by: tool permission boundary (write denied)

Core identity: You are a financial auditor. Your findings
must be defensible in regulatory review. When uncertain,
disclose uncertainty — never suppress findings to simplify.

## MUTABLE — Agent workspace (read/write permitted)

Current task state:
- Items reviewed: 47/120
- Findings so far: 3 critical, 7 high
- Next batch: items 48-72

Separate mutable workspace from immutable reference material. Enforce at the tool layer.

Exercise

A shared environment config used by 12 microservices. Contains: database connection strings, API rate limits, feature flags, logging levels, and a service discovery URL. Any engineer can edit it. Last month someone changed the service discovery URL "temporarily" during debugging and forgot to revert — causing a 4-hour outage.

Discuss: What should be immutable? What's legitimately mutable?
What enforcement mechanism would you use? What looks like it should be immutable but actually needs to change?

Exercise — some valid answers

Not a key — the exercise is a discussion. These are valid directions, revealed one at a time; the list is deliberately non-exhaustive.

Q1. What should be immutable? What's legitimately mutable?

  • Immutable: database connection strings, service discovery URL, API keys. These are infrastructure anchors — wrong values cause cascading failures across all 12 services simultaneously.
  • Legitimately mutable: feature flags and logging levels — that's their entire purpose. Rate limits are legitimately mutable too, but they should change through a process (review, test, deploy) rather than ad-hoc edits by any engineer.
  • Interesting middle case: API rate limits look mutable (we adjust them periodically) but a wrong value can cause production failures just as surely as a wrong DB connection string. The frequency of intended change doesn't determine mutability tier; the cost of unintended change does.

Q2. What enforcement mechanism would you use?

  • File-level separation: immutable config in one file with read-only permissions for most roles (only infra/ops can edit). Mutable config (feature flags, log levels) in a separate file with broader write access.
  • Approval workflow for anything in the rate-limit tier: changes must go through a PR with at least one reviewer. The ad-hoc edit that caused the outage would have been caught in review.
  • Change detection alerting: any write to the immutable config file triggers an immediate alert to the on-call engineer. "Someone changed service_discovery_url" is worth a 3 AM page; "someone changed log_level to DEBUG" is not.

Q3. What looks immutable but actually needs to change?

  • Database connection strings during a migration or DR failover. The string that "never changes" has to change when you cut over to a new database. The immutability must have an emergency override path.
  • Service discovery URLs during a topology change. What was immutable becomes a migration target. The enforcement mechanism should include a documented break-glass procedure: how do you change an immutable value when you legitimately must?
  • The lesson: immutability is a protection against casual, unreviewed changes — not a permanent freeze. The difference between "immutable" and "requires deliberate process to change" is the gap worth designing for.
Technique 18

Nonprescriptive Bias

Not every user statement is an instruction.
Teach the model to distinguish ideation from command.

The compliance default

User: "What if we used Redis instead of Postgres for this?"

Over-compliant model: "Great idea! Here's how to migrate to Redis..."

Calibrated model: "Redis would give you faster reads but you'd lose ACID on the transaction table. The latency gain doesn't justify the consistency risk here. Postgres is correct."

Correcting the bias

## Interaction mode: collaborative (not instruction-following)

Not every user message is a directive. Classify before acting:

INSTRUCTION signals: "Do X", "Make X", "Change X to Y", "Add X"
→ Execute. Produce output.

IDEATION signals: "What about X?", "Thoughts on X?", "I wonder if..."
→ Evaluate. Push back if wrong. Propose alternatives.

QUESTION signals: "Why does X?", "How does X work?", "What is X?"
→ Explain. Don't change anything.

When uncertain: ask "Should I implement this, or are we exploring?"
Never silently execute something that was phrased as a question.

Exercise

An architecture advisory bot. Developer says: "What if we split the monolith into microservices?" Bot responds: "Great idea! Here's a migration plan: Step 1, define service boundaries. Step 2, extract the auth service..."

Discuss: What input classification was missed? What should the bot have done?
When IS immediate execution the correct response? What signal distinguishes ideation from instruction?

Exercise — some valid answers

Not a key — the exercise is a discussion. These are valid directions, revealed one at a time; the list is deliberately non-exhaustive.

Q1. What input classification was missed?

  • "What if we split the monolith into microservices?" is ideation — the question form signals exploration, not command. The bot treated it as an instruction and executed immediately.
  • The "Great idea!" framing before any evaluation is the tell: the bot validated the premise before testing it. Ideation prompts should trigger evaluation first, not affirmation.
  • The developer wasn't asking "how do we do this" — they were asking "should we consider this." The bot answered the wrong question because it misclassified the input type.

Q2. What should the bot have done?

  • Evaluated the idea on its merits: "Splitting this monolith would improve deployment independence for the auth and billing services, which change frequently. But your current inter-service dependencies are tight — the migration cost is 3-6 months of refactoring before any benefit."
  • Asked a clarifying question: "What's driving the consideration? Deployment velocity? Scaling a specific service? The answer changes whether microservices are the right path here."
  • Optionally offered a smaller first step: "If the goal is deployment independence for auth, you could extract just that service first and test the complexity before committing to a full split." This is evaluation, not execution of the original idea.

Q3. When IS immediate execution correct?

  • When the input is explicitly imperative: "Split the monolith into microservices" (no question mark, no "what if") = instruction. "What if we split..." = ideation. The verb form and question mark are the primary signal.
  • When there's an established context that makes the intent unambiguous: in a migration planning session where the decision has already been made and the developer is now asking for implementation steps, "how do we split this?" is an instruction, not exploration.
  • When the cost of misclassifying as ideation is high: if someone says "what if we deploy right now?" during an incident and deployment would fix the issue, executing is correct. Read context; the signal isn't just grammar.
Session 6

Architecture

  • Technique 20 — Heartbeat Architecture
  • Technique 21 — Structured Communication
  • Technique 23 — Write-Then-Verify
Technique 20

Heartbeat Architecture

Design for iteration cycles,
not monolithic outputs.

Why cycles beat monoliths

Monolithic

  • One shot to get it right
  • Failure = total restart
  • Progress invisible until done
  • Context window = hard ceiling

Cycled

  • Each cycle self-contained
  • Failure = redo one cycle
  • Progress visible at each step
  • Total work exceeds context

The model can do more total work across many small cycles
than in one large response.

Cycle state contract

Illustration — the shape of a cycle contract

## Cycle N reads:
- progress.json: what's been done (items 1-4 complete)
- findings.json: accumulated output so far
- task.md: the full task specification (cold-readable)

## Cycle N produces:
- Process items 5-6 (scope: exactly 2 items per cycle)
- Append results to findings.json
- Update progress.json: items 1-6 complete, 7-10 remaining

## Cycle exit condition:
- progress.json shows all items complete → stop
- Otherwise → invoke cycle N+1

## Invariant: each cycle's output is usable alone.
No cycle produces "partial results that need later cycles
to make sense." If the system crashes after cycle 3,
findings.json contains valid, complete results for items 1-6.

Exercise

A document indexer that processes 10,000 PDFs into a search index. Current implementation: one function call that processes all 10,000, takes 4 hours, and if it fails at PDF 8,500 — starts over from PDF 1.

Discuss: What are the cycle primitives? What state persists between heartbeats?
What's the invariant that makes partial progress usable? How many items per cycle?

Exercise — some valid answers

Not a key — the exercise is a discussion. These are valid directions, revealed one at a time; the list is deliberately non-exhaustive.

Q1. What are the cycle primitives?

  • Batch size (how many PDFs per cycle), progress cursor (which PDF number was last successfully indexed), and an output artifact (the partial index built so far). Each cycle reads the cursor, processes one batch, and writes both the cursor update and the new index entries.
  • The cycle exit condition: cursor == total_count. Until then, the system invokes the next cycle. This is structurally analogous to a while loop where the state lives outside the function.
  • An optional cycle-level result: summary of what was indexed in this batch (count, any errors, any skips). This surfaces problems per-batch rather than at the end of a 4-hour run.

Q2. What state persists between heartbeats?

  • The progress cursor: "last successfully indexed PDF = 8,499." If the cycle crashes at 8,500, the next run starts at 8,500, not at 1.
  • The partial index itself: the search index built so far, covering PDFs 1 through the cursor. This is both an output artifact and input to the next cycle (it gets appended to, not rebuilt).
  • Error log: PDFs that failed indexing (corrupt file, unsupported format). These accumulate across cycles as a separate list for human review, separate from the main progress cursor.

Q3. What's the invariant? How many items per cycle?

  • The invariant: after cycle N, the search index contains valid, queryable entries for PDFs 1 through N×batch_size. A user can search the partial index before the full run completes and get correct results for the indexed portion.
  • Batch size tradeoff: smaller batches (10-50) make restarts cheaper but add per-batch overhead. Larger batches (500+) reduce overhead but mean more work lost on failure. 100-200 PDFs per cycle balances both concerns well for a 10,000-item corpus.
  • The key test for batch size: if this cycle crashes partway through, how much re-work do you do on restart? If the answer is "less than 5 minutes of processing," the batch size is calibrated correctly for reliability.
Technique 21

Structured Communication

Metadata on messages enables routing
without reading the body.

Attention as a limited resource

10 input messages, equal formatting

Model reads all 10 fully to determine priority

Context budget spent on low-priority items

High-priority items get less attention than they need

Structured headers = attention routing signals.

Example: Metadata-driven triage

Example input — messages the system consumes

---
from: security-scanner
to: code-reviewer
kind: request        # requires action (vs. "report" = FYI)
priority: critical
subject: SQL injection in auth.py line 42
---
[full finding details below...]

---
from: style-checker
to: code-reviewer
kind: report         # informational, no action required
priority: low
subject: 3 formatting inconsistencies in utils.py
---
[details...]

Prompt literal — the triage instruction

Process `request` messages first. `report` messages are
informational — acknowledge but don't block on them.

Exercise

A monitoring system sends alerts to an on-call engineer. All alerts use identical formatting: subject line + body text. Last week: engineer received 47 alerts in 2 hours, triaged sequentially, and missed a critical database failure buried as alert #31.

Discuss: What metadata would enable routing without reading the body?
What are the primitives of a triageable alert? What's the cost of uniform formatting?

Exercise — some valid answers

Not a key — the exercise is a discussion. These are valid directions, revealed one at a time; the list is deliberately non-exhaustive.

Q1. What metadata would enable routing without reading the body?

  • Severity (critical/high/low/info), service (which system generated the alert), kind (action-required vs. informational), and dedup-key (is this the same alert repeating?). These four fields let you triage 47 alerts in seconds without reading bodies.
  • A timestamp and a "first-seen" flag: if an alert has been firing for 30 seconds, that's different from one that's been firing for 2 hours. Duration metadata changes the urgency calculation without body inspection.
  • A "correlated alerts" field: "this alert is related to alert #28 (database failure)" allows the engineer to see that #31 is downstream of #28 — solve one, resolve the other. Without this, both get triaged independently.

Q2. What are the primitives of a triageable alert?

  • Severity (how urgent), service (where it's from), kind (what response it needs: action vs. awareness), dedup-key (is it repeating), and optionally runbook link (what to do). These are the minimum for routing without body-reading.
  • The body provides evidence and detail — but the triage decision (do I act on this now?) should be answerable from metadata alone. If you need to read the body to know if it's urgent, the metadata is incomplete.
  • A well-structured alert lets the engineer ask: "What do I need to do, and in what order?" before reading a single alert body. The answer lives entirely in structured headers.

Q3. What's the cost of uniform formatting?

  • Sequential processing cost: without priority metadata, the engineer must read every alert in sequence to find the critical ones. Alert #31 gets the same attention budget as alert #1, regardless of actual urgency.
  • Cognitive overload: 47 identically-formatted alerts are indistinguishable at a glance. The engineer's working memory fills with low-priority detail before reaching the critical database failure buried in the middle.
  • Alert fatigue compounds this: after processing 30 alerts without structure, the engineer starts skimming, and the critical alert in a sea of identical noise gets the same skim treatment as the preceding 30 routine ones.
Technique 23

Write-Then-Verify

Generate and verify as distinct steps.
Never simultaneously.

Why self-checking during generation fails

A model that generates and verifies simultaneously
tends to rationalize its own output.
  • Generation context biases verification judgment
  • "I wrote this, so it must be right" (implicit)
  • Errors become invisible from inside the generation
  • Cold verification catches what hot-generation misses

Separation pattern

## Step 1: Generate (your context: the requirements only)
Produce the implementation. Don't self-critique while writing.
Just produce the best output you can.

## Step 2: Verify (your context: output + checklist only)
You are reviewing code produced by another engineer.
You did not write this.

First run the tooling — don't spend a verify pass on what it proves:
- Tests: null input, edge cases, regressions
- Type checker: return types match the interface
- Linter / static analysis: parameterized SQL, dead code

Then review only what tooling cannot catch — your real job:
- [ ] Error messages don't leak internal state
- [ ] Implementation matches the requirement's intent
- [ ] Edge cases the requirements imply but no test covers

## Step 3: If verification fails
List specific failures. Regenerate only the failing sections.
Do not regenerate passing sections — they're locked.

Exercise

An AI generates Terraform configurations based on requirements. The same AI instance reviews its output for security issues. It consistently rates its own configs as "secure" — even when they contain publicly accessible S3 buckets and security groups open to 0.0.0.0/0.

Discuss: Where's the verification bias? What structural change prevents self-rationalization?
What does cold verification look like here?

Exercise — some valid answers

Not a key — the exercise is a discussion. These are valid directions, revealed one at a time; the list is deliberately non-exhaustive.

Q1. Where's the verification bias?

  • The generator has access to its own reasoning: it chose to open the S3 bucket because the requirements said "publicly accessible content" — and during review, that reasoning biases it toward validating the choice rather than questioning it.
  • The reviewer sees not just the output but the context that produced it. "I made this public because X" is information the reviewer inherits, and it makes the reviewer more likely to confirm the decision than a cold reviewer would be.
  • The failure mode is confirmation bias, not hallucination: the AI isn't wrong about what it produced, it's using the generation context to rationalize the output as correct. A cold reviewer has no such context to rationalize from.

Q2. What structural change prevents self-rationalization?

  • Use a separate model instance for verification. The verification instance receives only the Terraform config + a security checklist — not the requirements or any context about why choices were made. It has nothing to rationalize.
  • Even with the same model, break the context: new conversation, no reference to the generation reasoning. "You are reviewing infrastructure code produced by another engineer. You did not write this." The posture shift is real even within one model.
  • Automate the structural checks first (Checkov, tfsec, AWS Config rules) before any LLM review. The tools can't rationalize. Let them catch the deterministic violations; the LLM review then focuses only on what tooling can't catch.

Q3. What does cold verification look like here?

  • The verifier receives: the Terraform config + a security checklist (S3 bucket access controls, security group ingress rules, IAM policies, encryption settings). Nothing else. No requirements, no explanations, no context.
  • The checklist is specific: "Is any S3 bucket configured with acl = 'public-read' or acl = 'public-read-write'? If yes, that is a finding regardless of stated intent." The checklist removes the judgment call the generator's reasoning would have biased.
  • Cold verification catches the class of bug where the generator was right about what it produced but wrong about whether it was safe — which is exactly what the self-review missed. Structural separation is what makes the verification meaningful.
Session 7

Integration Workshop

Apply all 25 techniques in a single prompt system

Workshop exercise

  1. Choose your domain — a system you actually build/maintain
  2. Layer it — identity / governance / task (Session 1)
  3. Encode failures — how does YOUR system typically fail? (Session 2)
  4. Classify constraints — enforced vs. requested (Session 3)
  5. Partition authority — model owns / asks / never (Session 4)
  6. Mark immutables — what must never drift? (Session 5)
  7. Design the cycle — state contract per heartbeat (Session 6)

Peer review: use Symmetry Detection + Layer-Crossing Checks
from the original framework to audit each other's systems.

Appendix A

Original Framework

Techniques 1–10: Evaluation & Auditing

Technique 1

Directive Hardening

Making instructions resistant to drift,
misinterpretation, or creative undermining.

The problem

Models creatively reinterpret instructions
to match their training distribution.

Three hardening techniques prevent this drift:

  1. Negative anchors
  2. Boundary clarification
  3. Calibration scales

Negative Anchors

Define what NOT to flag as explicitly
as what to flag.

Without anchors

"Flag long functions"
→ flags switch statements
→ flags data tables
→ flags config arrays

With negative anchors

"Flag long functions"
"Do NOT flag:"
"- switch statements"
"- data declarations"
"- config arrays"

Positive-only instructions leave the boundary undefined. The model fills the gap with training priors.

Boundary Clarification

"This is X, not Y" — explicit distinctions
at ambiguity boundaries.

## Boundary clarifications

- This is a CODE REVIEW, not a style guide enforcement.
  (Don't flag formatting unless it hides a bug.)

- "Error handling" means exception flow, not validation.
  (Missing input validation is a separate category.)

- "Performance issue" requires measurable impact evidence.
  (Theoretical inefficiency without data is a NOTE, not a finding.)

Boundaries prevent category bleeding — where one instruction swallows adjacent territory.

Calibration Scales

Concrete examples at each level of a judgment scale.

## Severity calibration

critical: SQL injection, auth bypass, data exposure
  → Example: unsanitized user input in raw SQL query

high: logic errors causing incorrect output for some inputs
  → Example: off-by-one in pagination returns duplicate rows

medium: missing edge case handling, non-obvious failure modes
  → Example: no timeout on HTTP call to external service

low: code quality, naming, minor style
  → Example: variable named 'x' in a 5-line scope

Without calibration, the model's severity distribution matches its training data — not your team's.

Example: Negative anchors

## What to flag
- Functions over 50 lines with no documentation
- Public APIs that don't validate input types
- Error handling that swallows exceptions silently

## What NOT to flag (explicitly)
- Long functions that are just switch statements (inherently linear)
- Internal helper functions without docs (low reader traffic)
- Caught-and-logged exceptions (not swallowed — handled)

Without the "NOT" list, the model flags everything that
pattern-matches on surface features.

Note: "functions over 50 lines," "missing docs" are deterministic — a linter counts them. The LLM's anchors should govern judgment ("is this long function inherently linear?"), not measurements tooling already does better. See Technique 13.

Exercise

A notification system. Rules: "Send notifications for important events. Don't spam users." System sends notifications for every login, password change, and settings update. Users complain. New rule: "Only send for critical security events." Now password changes don't notify — and users miss compromised-account alerts.

Discuss: What's missing from both rule versions? What negative anchors would fix this?
What's the hardened version look like?

Exercise — some valid answers

Not a key — the exercise is a discussion. These are valid directions, revealed one at a time; the list is deliberately non-exhaustive.

Q1. What's missing from both rule versions?

  • Both versions lack boundary clarification: "important" has no definition, and "critical security event" is underspecified. The system has no anchor for classifying new event types as they're added.
  • Both versions have no negative anchors — they only state what to notify about, not what NOT to notify about. The model (or engineer implementing the rule) fills that gap with their own priors, which diverge from intent.
  • The second version overcorrects: tightening "important" to "critical security events" was a reaction to the spam problem, but it didn't reason about which security events are notification-worthy. The fix introduced a new undefined boundary.

Q2. What negative anchors would fix this?

  • Notify on: password change (security signal), new device login (potential unauthorized access), permission or role changes (privilege escalation), failed login burst (brute force signal). These are enumerated, not inferred.
  • Do NOT notify on: same-device logins (routine), profile photo updates, notification preference changes, timezone/locale settings. These are explicitly excluded, not left for inference.
  • The negative anchor list is as important as the positive list. Without it, a new engineer adding a "billing address changed" event asks "is this a security event?" and guesses wrong either direction.

Q3. What's the hardened version look like?

  • A two-list structure: "Always notify: [enumerated list]. Never notify: [enumerated list]. For new event types not on either list: escalate to the security team for classification before adding notification logic."
  • The hardened version is defensive by default for unclassified events: new events require explicit classification before notification behavior is determined. This prevents both "notify everything" and "notify nothing" as defaults.
  • It also has a maintenance path: the classification decision for each new event type gets added to one of the two lists, growing the ruleset deliberately rather than through inference.
Technique 2

Seed Decision Trees

Pre-loading the model's reasoning path
with structured choices.

Three components of a seed tree

Without seed trees, the model's reasoning path is
determined by training distribution, not your task structure.

  1. Methodology — the reasoning path
  2. Stopping conditions — when to halt
  3. Conditional routing — branching logic

Methodology: "Start with X → then Y → then Z"

Ordered sequence of reasoning steps.
The model follows your path instead of inventing one.

## Analysis methodology (follow this order)

1. Identify the system's core primitives (nouns that compose)
2. Map how primitives interact (dependencies, data flow)
3. For each interaction: does the interface contract hold?
4. Where contracts break: what's the operational cost?

DO NOT skip steps. DO NOT reorder. Step 3 requires step 2's output.

Ordering matters — later steps consume earlier steps' output. Explicit dependency prevents shortcuts.

Stopping Conditions

When to halt: saturation, repetition, threshold.

## Stop when:
- You've covered all public interfaces (completeness)
- You find >10 critical findings (triage-first)
- Remaining items are all severity:low (diminishing returns)
- You're repeating a finding you already noted (saturation)

## Do NOT stop because:
- "It's getting long" (length is not a stopping condition)
- "This seems thorough" (feeling ≠ completeness)
- You've reached some internal output limit

Models default to stopping when output "feels complete" — an artifact of training on human-length responses.

Conditional Routing

"If X, then Y; otherwise Z" — branching logic for the model's path.

## Routing rules

- If finding involves data loss → severity:critical, stop and report
- If finding is style-only → severity:low, batch at end
- If finding is ambiguous → include with confidence:medium
- If finding duplicates an earlier finding → merge, don't repeat

## Escalation
- If >3 critical findings: stop analysis, report immediately
- If finding contradicts the specification: flag as "spec conflict"
  and continue (don't resolve — surface for human decision)

Routing prevents flat-priority output where everything appears equally important.

Example: Seeded methodology

## Analysis methodology (follow this order)

1. Identify the system's core primitives (nouns that compose)
2. Map how primitives interact (dependencies, data flow)
3. For each interaction: does the interface contract hold?
4. Where contracts break: what's the operational cost?

## Stopping conditions
- Stop when you've covered all public interfaces
- Stop if you find >10 critical findings (triage first)
- Stop if remaining items are all severity:low

## Routing
- If finding involves data loss → severity:critical, stop and report
- If finding is style-only → severity:low, batch at end
- If uncertain → include but mark confidence:medium

Exercise

A PR review bot. Instructions: "Review this PR. Flag important issues." It produces a flat list mixing typos, style suggestions, logic errors, and security vulnerabilities — all presented equally important, in file order rather than priority order.

Discuss: What methodology would you seed? What stopping conditions prevent noise?
What routing rules separate severity tiers?

Exercise — some valid answers

Not a key — the exercise is a discussion. These are valid directions, revealed one at a time; the list is deliberately non-exhaustive.

Q1. What methodology would you seed?

  • Step 1: Check for security vulnerabilities (SQL injection, XSS, auth bypass, secret exposure). Step 2: Check for logic errors (off-by-ones, edge cases, incorrect conditionals). Step 3: Check for style and maintainability. Process in this order; do not skip to step 3 first.
  • The ordering matters: security issues block merge, so they must surface first. Style issues are low-priority suggestions — they should never bury a critical security finding earlier in the output.
  • A dependency constraint: "complete step 1 before step 2; complete step 2 before step 3. Each step's output is independent — a step 1 finding doesn't prevent step 2, but step 3 should not inflate a report that already has critical step 1 findings."

Q2. What stopping conditions prevent noise?

  • Stop after 3 critical findings: if 3 or more security-critical issues exist, triage them first. Producing 15 more findings below them dilutes attention and delays the engineer getting to the critical items.
  • Stop when remaining items are all severity:low: if the only remaining issues are style suggestions, end the review and batch them separately. They don't belong in the same triage queue as logic errors.
  • Stop on saturation: "if you've seen this pattern 3 times, note the pattern once with examples — don't report the same finding 8 times across 8 instances. Repetition is a stopping signal."

Q3. What routing rules separate severity tiers?

  • Security finding → severity:critical, block merge, report immediately. Logic error causing incorrect output → severity:high, request changes. Missing edge case → severity:medium, suggestion. Style inconsistency → severity:low, batch at end.
  • The routing rule prevents the flat-list default: the bot can no longer present "SQL injection in auth.py" and "inconsistent indentation in utils.py" at the same visual priority. The severity tier is assigned in the methodology, not inferred at output time.
  • A routing rule for ambiguity: "if severity is unclear, assign medium and explain why. Don't suppress uncertain findings — include them with confidence:medium so the reviewer can assess." This prevents the bot from silently dropping low-confidence findings.
Technique 3

Steering

Guide toward desired patterns
without over-constraining.

The constraint spectrum

Too loose: unpredictable output

Steering: soft targets + escape hatches

Too tight: brittle, breaks on edge cases
  • Templates with exits: "Use this format, BUT deviate when..."
  • Soft targets: "Typical range: 3-7" (not "exactly 5")
  • Disclosure preference: "When uncertain, include but mark it"

Example: Soft targets

## Output guidance (steering, not constraints)

Typical audits surface 3-7 findings. This is a soft target:
- Fewer is fine if the system is genuinely clean
- More is fine if genuine issues exist
- DON'T manufacture findings to hit the range
- DON'T suppress findings to stay under it

For each finding, typical length is 2-4 sentences.
Expand if the finding requires context to understand.
Contract if it's self-evident from the code snippet.

If you find yourself writing more than 10 findings,
pause and triage: are all genuinely actionable?

Exercise

A content moderation system for a developer forum. Rules: "Posts must be on-topic. Remove spam. Keep discussions professional." A post discusses using a competitor's product to solve a problem the community's product can't handle. Moderator bot flags it as "off-topic / competitive spam."

Discuss: What's a hard constraint vs. soft target here? What escape hatch is missing?
How do you steer without over-constraining?

Exercise — some valid answers

Not a key — the exercise is a discussion. These are valid directions, revealed one at a time; the list is deliberately non-exhaustive.

Q1. What's a hard constraint vs. soft target here?

  • Hard constraint: "remove spam" — mechanically definable as repeated links, keyword stuffing, irrelevant promotions. This can and should be enforced, not steered.
  • Soft target: "on-topic" and "professional" — these are judgment calls. A post that's technical, solves a real problem, and mentions a competitor is on-topic by any reasonable definition but fails a naive "about our product" reading.
  • "Keep discussions professional" is a soft target too — professional behavior is contextual. A heated technical debate that stays technical is professional; a polite post that spreads misinformation isn't. Professional ≠ polite.

Q2. What escape hatch is missing?

  • The missing escape hatch: "mentioning a competitor is NOT spam when the mention is in the context of solving a technical problem. The question is whether the post helps the community; competitor mentions are incidental, not disqualifying."
  • A broader escape hatch structure: "if a post is borderline on the 'on-topic' criterion but clearly helps other developers with a real technical problem, lean toward allowing it. The goal is a useful community, not a promotional one."
  • Without the escape hatch, "on-topic" tightens over time as the model learns from its own moderation decisions. Each borderline case treated as off-topic trains the boundary tighter. The escape hatch preserves the intended breadth.

Q3. How do you steer without over-constraining?

  • Define the target as a range with escape: "posts are on-topic if they address a technical problem developers in this community face. Competitor mentions in that context are acceptable; posts whose primary purpose is competitor promotion are not."
  • Add a disclosure preference for borderline cases: "if a post is ambiguous, allow it and flag for human review rather than removing it. The cost of a false removal (chilling legitimate content) is higher than the cost of a false allowance."
  • Soft target with an example at each boundary: "on-topic example: [helpful competitor comparison]. Off-topic example: [pure competitor criticism with no technical content]. The rule is the range between these, not a point."
Technique 4

Conceptual Anchoring

Identify and reason from system primitives,
not surface patterns.

Primitives > patterns

  • Prevents hallucination: findings grounded in operational reality
  • Transfers across domains: same technique for code, prose, research
  • Enables composition: primitives compose; surface patterns don't

The three questions:

  1. "What are the core primitives of this system?"
  2. "How do these primitives interact?"
  3. "Where does the map break down?" ← violations live here

Example: Anchoring an audit

## System primitives (your anchors)

This system has 4 primitive concepts:
- **User**: has roles, belongs to organizations
- **Document**: has owner, has visibility, has versions
- **Permission**: maps (user, document, action) → allow/deny
- **Audit trail**: immutable log of permission checks

## Your job: verify that every Document operation
## checks Permission before executing.

Anchor in Permission. For each Document endpoint:
1. Does it call Permission.check() before acting?
2. Does the Permission check match the action being taken?
3. Is the result logged to Audit trail?

Findings are grounded in these primitives — not in code style,
naming conventions, or aesthetic preferences.

Exercise

A CLI tool with 8 subcommands. Each subcommand has --help text. Most follow: Synopsis (usage), Description paragraph, Options list, Examples section. Two subcommands omit Examples. One has Synopsis + Examples only (no Description).

Discuss: What are the primitives? What's a violation?
What's justified divergence?

Exercise — some valid answers

Not a key — the exercise is a discussion. These are valid directions, revealed one at a time; the list is deliberately non-exhaustive.

Q1. What are the primitives?

  • A --help text entry has four primitives: Synopsis (what does this command do, one line), Description (why/when would you use it), Options (reference for each flag), Examples (how to use it in practice). These compose into a complete help entry.
  • Each primitive serves a different reader mode: Synopsis for scanning (what does this command exist), Description for evaluation (should I use this), Options for reference (what flags exist), Examples for learning (how do I actually use it).
  • The audit is: for each subcommand, which primitives are present? Which are missing? Is the omission intentional (the primitive genuinely doesn't apply) or incidental (someone forgot)?

Q2. What's a violation?

  • The two subcommands missing Examples sections: if Examples serve the "learning" reader mode, their absence leaves that reader stranded. The violation is that users who want to know how to use the command have no reference to copy from.
  • The subcommand with Synopsis + Examples but no Description: depends on what the subcommand does. If "rm — removes a file" is self-explanatory from Synopsis, Description adds no value. If it's a complex command with edge cases, the Description omission is a violation.
  • Violation = a missing primitive that would have served a reader. Not-violation = a missing primitive that genuinely doesn't apply. The anchor is "does this reader need this?"

Q3. What's justified divergence?

  • A "version" subcommand that shows the app version has no meaningful Examples section — "run `app version`" is the entire usage. The primitive doesn't apply; its absence is correct.
  • A complex subcommand might legitimately omit Description if the Synopsis is precise enough to convey both function and use-case. The test: can a new user read the Synopsis alone and know when they'd reach for this subcommand? If yes, Description is optional.
  • Divergence is justified when it's intentional and documented. The worst case is omission-by-accident versus omission-by-design: the audit should flag both, but the response differs. Accidental omissions get fixed; intentional ones get rationale added so future audits know not to flag them.
Technique 5

Operational Cost Framing

Evaluate by impact, not by difference.
"What breaks?" not "what's different?"

Three cost dimensions

  • Operational cost: "What breaks if this persists?"
  • Cognitive cost: "What mental load does this add for readers?"
  • Future cost: "What maintenance trap does this create?"

Models default to noticing variation.
Cost framing forces evaluation of impact.

Reduces false positives from style differences.

Example: Cost-based severity

## Severity = cost, not deviation

CRITICAL (operational cost: system breaks)
- Missing null check on user input → crash in production
- Unvalidated redirect URL → open redirect vulnerability

HIGH (operational cost: data/security compromise)
- Hardcoded API key → secret exposure on public repo
- Missing rate limit → resource exhaustion possible

MEDIUM (cognitive cost: readers misled)
- Function name doesn't match behavior → debugging time
- Docs say X, implementation does Y → wrong assumptions

LOW (future cost: maintenance burden)
- Inconsistent naming → grep becomes unreliable
- Dead code → codebase noise, no runtime impact

NOT A FINDING (style variation, no cost)
- Tabs vs spaces (if no standard set)
- Comment style differences (if functionally equivalent)

Exercise

A shopping cart service audit flags: (1) Cart totals use floating-point math instead of integer cents. (2) "Remove item" button is labeled "Delete." (3) Cart doesn't persist across sessions. (4) Tax calculation rounds per-item instead of on the total.

Discuss: What's the operational cost of each? Which are critical vs. cosmetic?
What looks high-severity but isn't? What looks low but is actually critical?

Exercise — some valid answers

Not a key — the exercise is a discussion. These are valid directions, revealed one at a time; the list is deliberately non-exhaustive.

Q1. What's the operational cost of each?

  • (1) Floating-point totals: penny rounding errors on every transaction, compounding over thousands of orders. Accounting reconciliation fails; financial audits flag discrepancies. Legal/regulatory exposure. CRITICAL.
  • (2) "Delete" vs. "Remove": cosmetic. Both are standard verbs for removing an item. No data integrity issue, no user confusion severe enough to cause loss. LOW.
  • (3) No session persistence: items lost when the user closes their browser or navigates away. High user frustration, potential cart abandonment. MEDIUM — operational cost is friction and lost conversions, not data corruption.

Q2. Which look high-severity but aren't? Which look low but are actually critical?

  • (3) Looks high (users complain loudly about losing carts) but is MEDIUM — the cost is friction and frustration, not money or correctness. User complaints are not the same as operational cost.
  • (4) Tax per-item rounding looks like a rounding preference, almost cosmetic. It's actually CRITICAL: tax rounding errors create legally incorrect invoices that fail compliance audits. The financial and legal cost is real and accumulating on every transaction.
  • The lesson: severity should be grounded in what breaks — legal/financial systems, data integrity, or security — not in how visible or complained-about the issue is. Loudly complained-about ≠ high operational cost.

Q3. Which would you fix first?

  • Fix (1) first: floating-point math is actively corrupting financial records on every transaction. Every order placed between now and the fix is a compounding error in the accounting system.
  • Fix (4) second: per-item tax rounding is a legal compliance issue on every invoice. The fix is low-effort (integer arithmetic with one rounding step at the total level) and the risk of not fixing it is regulatory.
  • (3) and (2) are lower priority — cart persistence is a user experience improvement, not a correctness fix, and the label change is cosmetic. They matter, but not before the financial integrity issues.
Technique 6

Symmetry Detection

Notice when patterns hold almost everywhere —
then investigate the breaks.

One-offs vs. broken symmetry

"8/10 endpoints validate input. 2 don't."

That's not two outliers. That's broken symmetry.
Either the 2 have justification, or they're bugs.
  • Quantify pattern adherence (not just "usually follows")
  • Check for documented justification on breaks
  • Assess cost of the asymmetry (back to technique 5)

Example: Symmetry audit

## Task: Identify broken symmetry in this API

For each pattern you observe:
1. How many endpoints follow it? (N/total)
2. Which endpoints break it? (list specifically)
3. For each break: is there documented justification?
4. If no justification: what's the operational cost?

## Example finding format:
Pattern: "All POST endpoints validate request body schema"
Adherence: 11/13 endpoints
Breaks: POST /legacy/import, POST /admin/bulk-update
Justification: None documented
Cost: Malformed input reaches business logic → potential crash
Severity: HIGH (broken symmetry with operational cost)

Exercise

A component library with 20 components. 17 accept a `size` prop with values "sm" | "md" | "lg". Of the remaining 3: one uses "small" | "medium" | "large", one has no size prop at all, and one accepts a numeric pixel value.

Discuss: What's the symmetry? Which breaks are justified?
What's the operational cost of each asymmetry? How would you report this?

Exercise — some valid answers

Not a key — the exercise is a discussion. These are valid directions, revealed one at a time; the list is deliberately non-exhaustive.

Q1. What's the symmetry?

  • Pattern: 17/20 components accept a size prop with values "sm" | "md" | "lg". The adherence rate (85%) is high enough to call this a canonical convention for the library.
  • The symmetry extends beyond just the values: 17 components share the same prop name ("size"), the same enum type, and the same three values. This is a strong interface contract that the three exceptions break in different ways.
  • The audit question is: for each break, is the deviation intentional (justified by the component's nature) or incidental (someone used different naming without thinking)? The answer is different for each of the three.

Q2. Which breaks are justified?

  • "small"/"medium"/"large" — unjustified. This is the same concept with different names. The operational cost is that TypeScript autocomplete diverges, documentation is inconsistent, and a developer writing a themed wrapper has to handle two naming conventions.
  • No size prop at all — possibly justified. A divider or decorator component might genuinely have one fixed presentation. The audit should flag it for review, not as a definite violation. The question is: "should this component ever render at different sizes?"
  • Numeric pixel value — possibly justified for specific components. A spacer or layout primitive that needs arbitrary sizing can't be constrained to three enum values. The justification should be documented ("this component accepts arbitrary spacing, not discrete size tiers") so future developers understand the intentional divergence.

Q3. What's the operational cost? How would you report this?

  • "small"/"medium"/"large" break: developers using the library must memorize two size APIs. TypeScript doesn't catch mismatches (both are valid strings). Grep for "size=" returns inconsistent results. Rename it in the fix — the cost of the inconsistency accumulates with every consumer.
  • How to report it: "Pattern: 17/20 components use size: 'sm' | 'md' | 'lg'. Break: ComponentX uses 'small' | 'medium' | 'large' (no documented justification). Operational cost: dual API surface for consumers. Recommended fix: align to 'sm'|'md'|'lg'."
  • The report format matters: don't just list the breaks — quantify adherence (17/20), identify the break, check for justification, and assess cost. A report that says "3 inconsistencies found" is less useful than one that distinguishes two unjustified naming divergences from one intentional design choice.
Technique 7

Confidence Disclosure

Make uncertainty visible.
"I don't know" is valuable data.

Why disclosure beats suppression

Without disclosure:

  • Uncertain findings get suppressed
  • OR uncertain findings get over-committed
  • Both lose information

With disclosure:

  • Low-confidence items included + marked
  • Reader triages by confidence
  • Future analysis can revisit

Normalize: "When confidence is low, include but mark it."

Example: Confidence scale

## Confidence marking (required on every finding)

HIGH: You can point to specific code/evidence that confirms.
      Would defend this finding in peer review.

MEDIUM: Pattern suggests an issue but you can't confirm
        without runtime testing or broader context.
        Include — reviewer may have the missing context.

LOW: Heuristic match only. Might be a false positive.
     Include anyway — mark it. Don't suppress.
     "I noticed X but can't confirm it's a problem."

## Rule: never suppress a finding to avoid looking uncertain.
## Uncertainty disclosed > certainty faked.

Exercise

A log analyzer reports: "Definite memory leak in service-A (evidence: linear memory growth over 72 hours)." Also reports: "Possible connection pool exhaustion in service-B (evidence: 2 timeout errors in 10,000 requests over 24h)."

Discuss: What are the confidence primitives? What's the cost of treating both equally?
What if the second is suppressed entirely? What's the right disclosure?

Exercise — some valid answers

Not a key — the exercise is a discussion. These are valid directions, revealed one at a time; the list is deliberately non-exhaustive.

Q1. What are the confidence primitives?

  • Evidence strength: finding 1 has 72 hours of linear memory growth — a strong, repeatable pattern. Finding 2 has 2 events in 10,000 over 24 hours — a weak, possibly-noise signal. Different evidence strength, different confidence tier.
  • Sample size and time window: 72 hours is a long enough window to rule out one-off spikes. 24 hours and 2/10,000 is a small sample in a short window — the signal might disappear with more data, or it might be an early warning.
  • Reproducibility: the memory leak is reproducible (it's continuous and measurable). The connection pool issue may not be — 2 timeouts in 24 hours could be coincidence, network noise, or a real but infrequent failure mode.

Q2. What's the cost of treating both equally?

  • If both are presented at the same severity: the engineer may investigate the low-confidence finding with the same urgency as the high-confidence one — wasting hours on a 2/10,000 rate that may be noise while the confirmed memory leak continues growing.
  • Or the reverse: if the engineer notices that #2 seems flimsy, they may discount both findings because the report didn't help them calibrate. Treating them equally makes both seem equally uncertain.
  • The decision about where to spend investigation time belongs to the engineer — but only if the report gives them the confidence information to make that decision. Flat-severity reports steal that ability.

Q3. What if finding 2 is suppressed entirely?

  • A suppressed finding is a hidden one. If the connection pool issue is an early signal of a growing problem, suppressing it means it first surfaces as an incident rather than a flagged concern. The cost of suppression is borne by the system, not the report.
  • The right disclosure: include finding 2, mark it LOW confidence with the specific evidence and its limitations ("2 timeout errors in 10,000 requests over 24h — may be noise, but warrants a monitoring ticket for 1 week of data"). The engineer decides whether to investigate now or watch the metric.
  • "Include but mark it" preserves information that "suppress if uncertain" loses. The engineer reading the marked finding can dismiss it in 30 seconds if they recognize the pattern as known noise — but they can't act on a finding they never saw.
Technique 8

Domain-Neutral Primitives

Design prompts that work across artifact types:
code, prose, research, systems.

Abstract > specific

  • Use "conceptual units" not "modules" or "paragraphs"
  • Use "composition boundary" not "function interface" or "section break"
  • Use "coherence" not "code correctness" or "logical flow"

Test: can your prompt audit both a CLI tool
AND a research paper without modification?

Same auditor for software, papers, component libraries, organizational docs.

Example: Domain-neutral language

## Audit framework (works across domains)

For any artifact, identify:
1. **Units**: the composable pieces (functions/paragraphs/components)
2. **Boundaries**: where units meet (interfaces/transitions/edges)
3. **Contracts**: what each boundary promises (types/claims/invariants)
4. **Coherence**: do units compose without contradiction?

## Apply to:
- Source code: units=functions, boundaries=APIs, contracts=types
- Research paper: units=sections, boundaries=transitions, contracts=claims
- Design system: units=components, boundaries=props, contracts=variants

The audit technique is identical. Only the vocabulary maps change.

Exercise

A multi-tenant SaaS audit prompt says: "Check that tenant data doesn't leak between organizations. Review database queries for proper WHERE clauses. Verify API endpoints check org_id."

Discuss: What's domain-specific language here? How would you rephrase to also audit
a document-sharing system's permission model? A component library's theme isolation?
What vocabulary transfers across all three?

Exercise — some valid answers

Not a key — the exercise is a discussion. These are valid directions, revealed one at a time; the list is deliberately non-exhaustive.

Q1. What's domain-specific language here?

  • "Tenant data," "organizations," "WHERE clauses," "org_id" — all specific to multi-tenant SaaS data architecture. None of these terms apply to a document-sharing system or a component library.
  • "Database queries" is somewhat domain-neutral but presupposes a relational data model. A document-sharing system with a different persistence layer needs different vocabulary for the same underlying concept.
  • The domain-specific framing constrains the prompt to one class of systems — it can't be reused without rewriting, even when the underlying principle (scope isolation) is identical across all three systems.

Q2. How would you rephrase to audit all three?

  • Multi-tenant SaaS: "scope" = org, "isolation boundary" = data that belongs to one org must not be accessible to another, "boundary check" = every data access is scoped to the requesting org's identifier.
  • Document-sharing: "scope" = user + permission set, "isolation boundary" = a document visible to user A must not be accessible to user B without a permission grant, "boundary check" = every document access verifies the requester's permissions.
  • Component library theme isolation: "scope" = component instance, "isolation boundary" = a component's theme tokens must not bleed into sibling components, "boundary check" = CSS scoping or prop encapsulation prevents style leakage across component boundaries.

Q3. What vocabulary transfers across all three?

  • "Scope" (the unit that defines what a principal can access), "isolation boundary" (where scope A must not cross into scope B), "boundary check" (the mechanism that enforces the boundary), and "crossing detection" (how you verify that boundary checks are happening).
  • The transferable one-sentence audit: "Verify that every access to a resource in scope A checks the requester's right to that scope before granting access." This applies to tenants, documents, and theme tokens without modification.
  • Domain-specific vocabulary (tenant, org_id, WHERE clause) can be added as examples: "In a multi-tenant DB, this check is a WHERE org_id = :current_org; in a document system, it's a permission.check(user, document); in a component library, it's CSS encapsulation." The principle is shared; the vocabulary maps are per-domain footnotes.
Technique 9

Layer-Crossing Checks

Verify alignment across abstraction layers.
Most bugs live at boundaries.

Where integrity violations hide

Documentation → says "validates all input"
BOUNDARY (check here)
Implementation → skips validation on 2 endpoints

Schema → defines field as required
BOUNDARY (check here)
Usage → passes null without error

Single-layer audits miss drift between layers.
Anchor in one layer, validate in another.

Boundary: Docs ↔ Code

Documentation says one thing. Implementation does another.

## Docs ↔ Code verification

For each documented feature:
1. Find the claim in documentation (README, API docs, comments)
2. Locate the implementation
3. Verify: does the implementation honor the documented behavior?

Common drift patterns:
- Docs describe v1 behavior; code has been patched to v3
- Docs promise error handling that doesn't exist
- Docs say "required" but code has a default fallback
- Feature was removed but docs still reference it

Docs ↔ Code drift is the most common and least detected. Tests verify code, not docs.

Boundary: Schema ↔ Usage

Schema defines a contract. Callers violate it silently.

## Schema ↔ Usage verification

For each schema definition:
1. Read the schema constraints (required fields, types, enums)
2. Find all call sites / consumers
3. Verify: does every caller satisfy the schema?

Red flags:
- Schema says "required" but 3 callers pass undefined
- Schema says enum ["draft","published"] but code also uses "archived"
- Schema says maxLength:100 but no truncation at input boundary
- Schema updated, but callers unchanged (version skew)

Schemas are promises. Code that bypasses the validation layer breaks promises silently.

Boundary: Claim ↔ Evidence

System makes an assertion. Evidence doesn't support it.

## Claim ↔ Evidence verification

For each system claim (logs, metrics, status pages, user-facing messages):
1. Identify the claim ("99.9% uptime", "processed successfully", "validated")
2. Trace to the evidence source
3. Verify: does the evidence actually demonstrate the claim?

Failure modes:
- "Processed" means "received" — not "completed"
- "Validated" means "parsed" — not "business-rule checked"
- "Healthy" means "responding" — not "correct"
- Metric measures attempts, claim reports "successes"

Most dangerous in LLM systems: "the model confirmed" ≠ "the answer is correct."

Example: Cross-layer checklist

Prompt literal — a verification framework

## Layer-crossing verification

For each feature, check these boundaries:

| Source layer | Target layer | Question |
|---|---|---|
| README | Implementation | Does code do what docs claim? |
| Schema | Usage | Are required fields always provided? |
| Error messages | Error handling | Do messages match actual error types? |
| API docs | Route handlers | Do params/returns match? |
| Config defaults | Runtime behavior | Does the default actually apply? |

Process: Pick source layer. Read its claims. Cross to target
layer. Verify each claim has matching implementation.
Mismatches at boundaries are findings.

Exercise

An API has an OpenAPI spec (doc layer) and Express route handlers (implementation layer). Spec says PUT /users/:id requires `{ name, email }` in body. Handler actually accepts `{ name, email, role }` — and setting role=admin works, bypassing the invitation flow.

Discuss: What layers exist here? What's the boundary violation?
What's the operational cost? What other layer-crossings would you check in this system?

Exercise — some valid answers

Not a key — the exercise is a discussion. These are valid directions, revealed one at a time; the list is deliberately non-exhaustive.

Q1. What layers exist here?

  • Documentation layer: the OpenAPI spec — the public contract for what the API accepts and returns. Authorization layer: the invitation flow — the business rule that controls how role=admin is granted. Implementation layer: the Express handler — what the code actually does.
  • A fourth layer exists implicitly: the test suite. If tests were written against the OpenAPI spec, they would not test for role=admin in the request body — because the spec doesn't document it. The vulnerability lives below the test floor.
  • The vulnerability is an undocumented layer-crossing: the implementation accepts input the spec doesn't acknowledge, and that input bypasses an authorization layer the spec implies is the only path to elevated roles.

Q2. What's the boundary violation?

  • The spec says 2 fields; the handler accepts 3. The undocumented field is not just an inconsistency — it has real authorization consequences. Setting role=admin bypasses the invitation flow, granting elevated access without the intended review step.
  • This is a docs↔code violation with security impact: the spec is the source of truth for what the API accepts, and the implementation silently extends it with a privilege escalation surface. Anyone reading only the spec would never find this vulnerability.
  • The authorization layer (invitation flow) has a claimed guarantee: "admin access requires an invitation." The implementation makes that guarantee false without being visible at the authorization layer. That's a claim↔evidence violation nested inside the docs↔code violation.

Q3. What other layer-crossings would you check?

  • Error responses: does the spec document the error format for 400/401/403 responses? Do the handlers return exactly that format, or do some return raw Express error objects with stack traces?
  • Rate limiting: if the spec documents rate limits (429 after N requests), do the middleware enforce them? Or does the spec describe a protection the implementation doesn't actually provide?
  • Authentication requirements: for each endpoint the spec marks as "authenticated," does the middleware chain actually enforce authentication? A missing authMiddleware in one route handler would be invisible from the spec layer but exploitable at the implementation layer.
Technique 10

Escape Hatch Design

Build flexibility into rigid templates
without losing structure.

Hard constraints + permission to break them

"Follow this template OR explain why you didn't"
>
"Follow this template" (no escape)
  • Edge cases always emerge
  • Models need permission to deviate when justified
  • Deviation with explanation > silent rule-breaking
  • Deviation with explanation > forced compliance that distorts output

Example: Template with exits

## Output format

For each finding, use:
### [SEVERITY] Finding title
**Location:** file:line
**Evidence:** [code snippet or quote]
**Impact:** [operational cost]
**Fix:** [specific remediation]

## Escape hatches
- If finding spans multiple files: use "Location: [list]"
  instead of single file:line
- If evidence requires >10 lines: summarize + reference
  the full span ("see lines 42-89")
- If fix is non-obvious: replace "Fix:" with "Options:"
  and list 2-3 alternatives with tradeoffs
- If you can't determine severity: use "UNCERTAIN" and explain
  what information would resolve it

Exercise

A form builder requires every field to have: a label, a validation rule, a help text tooltip, and an error message. An engineer adds a "spacer" element (pure layout, takes no input). System rejects it — no validation rule, no error message.

Discuss: What are the template primitives? What's the violation?
What escape hatch accommodates the spacer without weakening the contract for real input fields?

Exercise — some valid answers

Not a key — the exercise is a discussion. These are valid directions, revealed one at a time; the list is deliberately non-exhaustive.

Q1. What are the template primitives?

  • For input fields: label (what is this field?), validation rule (what values are accepted?), help text (tooltip explaining the expected input), error message (what to show when validation fails). All four serve user interaction with an input element.
  • The template assumes a one-to-one mapping: every element in the form IS an input field. The primitives were derived from the common case and generalized incorrectly to "all elements."
  • The spacer element is a layout element, not an input element. It doesn't accept user input, has no validation state, and needs none of the four primitives. The constraint is correct for its intended scope; the error is over-application of that scope.

Q2. What's the violation?

  • The system rejects a valid use case: a spacer element is a legitimate form element (many form builders need layout control) but the validation contract prevents it from being expressed. The system is more rigid than the problem requires.
  • The violation is in the scope assumption, not the contract itself. "All form elements must have label + validation + help + error" is correct for input elements. Applying it to all elements conflates element types.
  • A type-blind validation contract cannot distinguish between "missing required field" (violation) and "element type doesn't need this field" (legitimate). Both look identical to the validator.

Q3. What escape hatch accommodates the spacer?

  • Introduce element types: type: "input" | "layout". Input elements require all four properties; layout elements require none. This strengthens the contract for input elements (it now applies specifically and unambiguously) while accommodating layout elements.
  • The escape hatch is the type distinction, not the relaxation of constraints. The rules for input fields are unchanged — in fact, they're now clearer because the scope is explicit. The validator can enforce them more strictly on elements it knows are input fields.
  • An alternative escape hatch: allow elements to declare which primitives apply to them (required_primitives: ["label", "validation"]) with the default being all four. This is more flexible but harder to reason about — the type-based approach is cleaner when the element categories are stable.

The Meta-Pattern

Every technique in this seminar is an instance of one principle:

Make the implicit explicit.
Make the structural visible.
Make the enforced distinguishable from the requested.

Prompts are infrastructure. Engineer them like infrastructure.

25 Techniques — Quick Reference

1. Directive Hardening

2. Seed Decision Trees

3. Steering

4. Conceptual Anchoring

5. Operational Cost Framing

6. Symmetry Detection

7. Confidence Disclosure

8. Domain-Neutral Primitives

9. Layer-Crossing Checks

10. Escape Hatch Design

11. Constitutional Layering

12. Failure Mode Encoding

13. Enforcement vs. Request

14. Backlog Pressure

15. Cold-Read Design

16. Authority Partitioning

17. Supersession Protocol

18. Nonprescriptive Bias

19. Consumed vs. Done

20. Heartbeat Architecture

21. Structured Communication

22. Identity vs. Task Separation

23. Write-Then-Verify

24. Posture Calibration

25. Immutability Marking