Why LLM coding agents fail as constraints pile up — and what a solo builder can measure, mitigate, and build

By yrzheMay 25, 20267 min read

# Why do LLM coding agents fail as constraints pile up — and what can a solo builder measure, mitigate, and build?

They fail because “constraint decay” is real: as you add structural requirements (file layout, framework conventions, ORM usage, and inter‑file contracts) on top of functional behavior, LLM agents struggle to track and jointly satisfy the growing set of rules across multiple files—so code that looks plausible under a loose spec collapses into architectural defects and data‑layer/runtime failures once the spec becomes production-shaped.

The mechanism: why constraints compound into failures

The key dynamic in the “Constraint Decay” study is that structural constraints aren’t independent; they interact. A single requirement like “use this ORM pattern” implies many downstream obligations: where models live, how imports resolve, how migrations/schemas align, how transactions are handled, and what conventions a framework expects at runtime. As those obligations stack, the agent has to maintain more cross‑file state and more implicit “this must match that” relationships.

The paper’s error analysis highlights that failures concentrate in the data layer—incorrect query composition, ORM runtime violations, schema mismatches, and improper transaction handling—plus broader structural defects like misplaced modules/files or broken inter‑file relationships. In other words, constraint decay isn’t just “the model forgot a detail”; it’s that backend correctness often emerges from consistent composition across files and layers, and that consistency is exactly what deteriorates as constraints accumulate. The result is code that may be locally reasonable but globally inconsistent.

What the new research shows (and why it’s different)

The arXiv preprint “Constraint Decay: The Fragility of LLM Agents in Backend Code Generation” evaluates 100 backend tasks—80 greenfield generation tasks and 20 feature‑implementation tasks—spanning eight web frameworks (examples cited include Flask, FastAPI, and Django). A key design choice is that the study fixes a unified API contract across tasks to isolate the effect of structural constraints while holding functional requirements steady.

Crucially, evaluation is dual: runtime assertion‑based end‑to‑end tests measure behavior, while static verifiers check structural rules, conventions, and code properties. That combination matters because it prevents a common benchmark loophole: scoring “functionally correct” snippets that violate the architecture you’d actually ship.

Quantitatively, the headline result is a ~30 percentage‑point average drop in assertion pass rates when moving from baseline (loosely specified) to fully specified (structurally constrained) tasks for capable agent configurations. Weaker configurations often fall near zero under full specification. The losses are not evenly distributed: minimal, explicit frameworks (e.g., Flask) are comparatively agent-friendly, while convention-heavy frameworks (e.g., FastAPI and Django) see much larger degradation—consistent with the idea that implicit framework norms amplify constraint interactions.

Why It Matters Now

Constraint decay lands at an awkward moment: more builders are relying on agentic code generation for backend work, but production backend work is defined by structural compliance—framework conventions, architectural boundaries, and database correctness—at least as much as by endpoint behavior.

The study’s core thesis is that benchmarks focused only on functionality can systematically understate fragility when generated systems must obey real structural requirements. That’s not an abstract concern; it changes what “done” means for agent output. The secondary coverage (e.g., The Neural Feed and AgentPatterns.ai) emphasizes this dual-evaluation angle and the framework sensitivity, helping push the conversation from “can it produce code?” to “can it produce a system that fits constraints?”

If you’re already feeling this in practice—agents producing patches that balloon review load, break conventions, or fail on DB interactions—the constraint-decay framing gives you something measurable to optimize against. It also connects to a broader shift toward spec-driven checks and gated automation in codegen workflows (see: Spec-driven LLM coding and Claude Code plugins: practical moves for solo AI builders).

How a solo builder can measure constraint decay locally

You can mirror the study’s core measurement idea without reproducing the full dataset:

Start with a small suite (10–20 tasks) of multi-file backend changes. Include at least two frameworks: one minimal/explicit and one convention-heavy, because the paper shows performance depends strongly on framework norms.
Keep the functional contract fixed across variants (the study uses a unified API contract specifically to isolate structural effects). The point is to avoid confusing “harder behavior” with “more structure.”
Add constraints incrementally: file layout rules, naming conventions, architectural boundaries (e.g., where business logic can live), and data-layer requirements (ORM patterns, schema expectations, transaction handling).
Evaluate in two channels each time:

End-to-end runtime assertions that exercise APIs and database interactions.
Static checks that enforce the structural rules you just introduced.

Then track success rates as you tighten constraints. The metric you care about is not “did it compile,” but “how sharply do pass rates fall as structural requirements accumulate, and which class of constraints causes the steepest drop?” The paper’s results suggest you should expect especially visible breakage around ORM usage and other data-layer rules.

This measurement mindset also complements the operational reality that agent outputs are flooding downstream systems; validation and costed iteration loops are becoming the product (see: Agents are flooding issue trackers — rethink agent outputs, validation, and costed caching).

Practical mitigations: reduce cross-file state, increase verification pressure

The paper’s implication is not “don’t use agents,” but “don’t ask for a fully constrained multi-file backend in one shot and hope.” The builder consequence is workflow: you need tighter feedback loops and fewer simultaneous constraints per generation step.

Three mitigations follow directly from the failure modes and evaluation design:

Spec-first repos: check in machine-readable contracts and executable tests so the agent can’t treat structure as prose. The study’s fixed API contract is a hint: contracts reduce ambiguity, but only if you enforce them.
Verification-in-the-loop: run static structural checks and runtime assertions repeatedly, and feed failures back as repair tasks. Dual evaluation is the point of the paper; adopt it operationally.
Decompose generation: scaffold modules and interfaces first, then fill them. You’re reducing the amount of cross-file consistency the agent must maintain at once.

Architectural options that make constraints cheaper to satisfy

The framework sensitivity result suggests a pragmatic architectural thesis: make the “implicit” explicit.

If convention-heavy frameworks amplify constraint decay, you can reduce fragility by introducing clearer boundaries: thin adapters around the framework surface, explicit service/data interfaces, and a deliberately simple data-access pattern that avoids “hidden” ORM magic. This doesn’t eliminate constraints; it makes them checkable and local, which is exactly what dual evaluation rewards.

The paper also motivates multi-step agent architectures: separate “generate” from “verify” from “repair.” The research doesn’t prescribe a toolchain, but its methodology makes the direction clear—agents improve when structural compliance is continuously tested rather than assumed.

Trade-offs: what you shouldn’t expect

Constraint decay is, by definition, about the gap between plausible code and code that satisfies many interacting rules. Verification and decomposition reduce that gap, but they don’t erase it—especially when correctness depends on tacit framework norms and subtle data-layer invariants.

Two practical expectations to set from the paper’s findings:

You’ll likely get more automation mileage in minimal/explicit frameworks than in convention-heavy ones.
Data-layer correctness is a persistent hotspot; treat ORM and schema/transaction constraints as first-class citizens in your checks, not “we’ll catch it later.”

What to Watch

Watch for three developments implied by this work:

More public benchmarks that combine behavioral testing with structural/static verification (the study’s dual evaluation is the main methodological contribution).
Tooling that embeds static verifiers inside the generation loop, turning “structural compliance” into an iterative repair process rather than a final review step.
Framework-specific performance reporting and adapters/plugins that explicitly encode conventions—because the paper’s framework sensitivity suggests that “general codegen skill” won’t automatically translate into Django/FastAPI reliability.

Sources: arxiv.org · semanticscholar.org · theneuralfeed.com · agentpatterns.ai · alphaxiv.org · medium.com

About the Author

yrzhe

AI Product Thinker & Builder. Curating and analyzing tech news at TechScan AI. Follow @yrzhe_top on X for daily tech insights and commentary.

X/Twitter GitHub Blog