The Black Box Problem: Why AI-Generated Code Stops Being Maintainable

Editor
13 Min Read


A Pattern Across Teams

forming across engineering teams that adopted AI coding tools in the last year. The first month is euphoric. Velocity doubles, features ship faster, stakeholders are thrilled. By month three, a different metric starts climbing: the time it takes to safely change anything that was generated.

The code itself keeps getting better. Improved models, more correct, more complete, larger context. And yet the teams generating the most code are increasingly the ones requesting the most rewrites.

It stops making sense until you look at structure.

A developer opens a module that was generated in a single AI session. Could be 200 lines, maybe 600, the length doesn’t matter. They realize the only thing that understood the relationships in this code was the context window that produced it. The function signatures don’t document their assumptions. Three services call each other in a specific order, but the reason for that ordering exists nowhere in the codebase. Every change requires full comprehension and deep review. That’s the black box problem.

What Makes AI-Generated Code a Black Box

AI-generated code isn’t bad code. But it has tendencies that become problems fast:

  • Everything in one place. AI has a strong bias toward monoliths and choosing the fast path. Ask for “a checkout page” and you’ll get cart rendering, payment processing, form validation, and API calls in a single file. It works, but it’s one unit. You can’t review, test, or change any part without dealing with all of it.
  • Circular and implicit dependencies. AI wires things together based on what it saw in the context window. Service A calls service B because they were in the same session. That coupling isn’t declared anywhere. Worse, AI often creates circular dependencies, A depends on B depends on A, because it doesn’t track the dependency graph across files. A few weeks later, removing B breaks A, and nobody knows why.
  • No contracts. Well-engineered systems have typed interfaces, API schemas, explicit boundaries. AI skips this. The “contract” is whatever the current implementation happens to do. Everything works until you need to change one piece.
  • Documentation that explains the implementation, not the usage. AI generates thorough descriptions of what the code does internally. What’s missing is the other side: usage examples, how to consume it, what depends on it, how it connects to the rest of the system. A developer reading the docs can understand the implementation but still has no idea how to actually use the component or what breaks if they change its interface.

A concrete example

Consider two ways an AI might generate a user notification system:

Unstructured generation produces a single module:

notifications/
├── index.ts          # 600 lines: templates, sending logic,
│                     #   user preferences, delivery tracking,
│                     #   retry logic, analytics events
├── helpers.ts        # Shared utilities (used by... everything?)
└── types.ts          # 40 interfaces, unclear which are public

Result: 1 file to understand everything. 1 file to change anything.

Dependencies are imported directly. Changing the email provider means editing the same file that handles push notifications. Testing requires mocking the entire system. A new developer needs to read all 600 lines to understand any single behavior.

Structured generation decomposes the same functionality:

notifications/
├── templates/        # Template rendering (pure functions, independently testable)
├── channels/         # Email, push, SMS, each with declared interface
├── preferences/      # User preference storage and resolution
├── delivery/         # Send logic with retry, depends on channels/
└── tracking/         # Delivery analytics, depends on delivery/

Result: 5 independent surfaces. Change one without reading the others.

Each subdomain declares its dependencies explicitly. Consumers import typed interfaces, not implementations. You can test, replace, or modify each piece on its own. A new developer can understand preferences/ without ever opening delivery/. The dependency graph is inspectable, so you don’t have to reconstruct it from scattered import statements.

Both implementations produce identical runtime behavior. The difference is entirely structural. And that structural difference is what determines whether the system is still maintainable a few months out.

The same notification system, two architectures. Unstructured generation couples everything into a single module. Structured generation decomposes into independent components with explicit, one-directional dependencies. Image by the author.

The Composability Principle

What separates these two outcomes is composability: building systems from components with well-defined boundaries, declared dependencies, and isolated testability.

None of this is new. Component-based architecture, microservices, microfrontends, plugin systems, module patterns. They all express some version of composability. What’s new is scale: AI generates code faster than anyone can manually structure it.

Composable systems have specific, measurable properties:

Property Composable (Structured) 🛑 Black Box (Unstructured)
Boundaries Explicit (declared per component) Implicit (convention, if any)
Dependencies Declared and validated at build time Hidden in import chains
Testability Each component testable in isolation Requires mocking the world
Replaceability Safe (interface contract preserved) Risky (unknown downstream effects)
Onboarding Self-documenting via structure Requires archaeology

Here’s what matters: composability isn’t a quality attribute you add after generation. It’s a constraint that must exist during generation. If the AI generates into a flat directory with no constraints, the output will be unstructured regardless of how good the model is.

Most current AI coding workflows fall short here. The model is capable, but the target environment gives it no structural feedback. So you get code that runs but has no architectural intent.

What Structural Feedback Looks Like

So what would it take for AI-generated code to be composable by default?

It comes down to feedback, specifically structural feedback from the target environment during generation, not after.

When a developer writes code, they get signals: type errors, test failures, linting violations, CI checks. Those signals constrain the output toward correctness. AI-generated code typically gets none of this during generation. It’s produced in a single pass and evaluated after the fact, if at all.

What changes when the generation target provides real-time structural signals?

  • “This component has an undeclared dependency”, forcing explicit dependency graphs
  • “This interface doesn’t match its consumer’s expectations”, enforcing contracts
  • “This test fails in isolation”, catching hidden coupling
  • “This module exceeds its declared boundary”, preventing scope creep or cyclic dependencies

Tools like Bit and Nx already provide these signals to human developers. The shift is providing them during generation, so the AI can correct course before the structural damage is done.

In my work at Bit Cloud, we’ve built this feedback loop into the generation process itself. When our AI generates components, each one is validated against the platform’s structural constraints in real time: boundaries, dependencies, tests, typed interfaces. The AI doesn’t get to produce a 600-line module with hidden coupling, because the environment rejects it before it’s committed. That’s architecture enforcement at generation time.

Structure has to be a first-class constraint during generation, not something you review afterward.

The Real Question: How Fast Can You Get to Production and Stay in Control

We tend to measure AI productivity by generation speed. But the question that actually matters is: how fast can you go from AI-generated code to production and still be able to change things next week?

That breaks down into a few concrete problems. Can you review what the AI generated? Not just read it, actually review it, the way you’d review a pull request. Can you understand the boundaries, the dependencies, the intent? Can a teammate do the same?

Then: can you ship it? Does it have tests? Are the contracts explicit enough that you trust it in production? Or is there a gap between “it works locally” and “we can deploy this”?

And after it’s live: can you keep changing it? Can you add a feature without re-reading the whole module? Can a new team member make a safe change without archaeology?

If AI saves you 10 hours writing code but you spend 40 getting it to production-quality, or you ship it fast but lose control of it a month later, you haven’t gained anything. The debt starts on day two and it compounds.

The teams that actually move fast with AI are the ones who can answer yes to all three: reviewable, shippable, changeable. That’s not about the model. It’s about what the code lands in.

Practical Implications

For code you’re generating now

Treat every AI generation as a boundary decision. Before prompting, define: what is this component responsible for? What does it depend on? What is its public interface? These constraints in the prompt produce better output than open-ended generation. You’re giving the AI architectural intent, not just functional requirements.

For systems you’ve already generated

Audit for implicit coupling. The highest-risk code isn’t code that doesn’t work, it’s code that works but can’t be maintained. Look for modules with mixed responsibilities, circular dependencies, components that can’t be tested without spinning up the full application. Pay special attention to code generated in a single AI session. You can also leverage AI for wide reviews on specific standards you care about.

For choosing tools and platforms

Evaluate AI coding tools by what happens after generation. Can you review the output structurally? Are dependencies declared or inferred? Can you test a single generated unit in isolation? Can you inspect the dependency graph? The answers determine whether you’ll get to production fast and stay in control, or get there fast and lose it.

Conclusion

AI-generated code isn’t the problem. Unstructured AI-generated code is.

The black box problem is solvable, but not by better prompting alone. It requires generation environments that enforce structure: explicit component boundaries, validated dependency graphs, per-component testing, and interface contracts.

What that looks like in practice: a single product description in, hundreds of tested, governed components out. That’s the subject of a follow-up article.

The black box is real. But it’s an environment problem, not an AI problem. Fix the environment, and the AI generates code you can actually ship and maintain.


Yonatan Sason is co-founder at Bit Cloud, where his team builds infrastructure for structured AI-assisted development. Yonatan has spent the last decade working on component-based architecture and the last two years applying it to AI-generated platforms. The patterns in this article come from that work.

Bit is open source. For more on composable architecture and structured AI generation, visit bit.dev.

The owner of Towards Data Science, Insight Partners, also invests in Bit Cloud. As a result, Bit Cloud receives preference as a contributor. 

Share this Article
Please enter CoinGecko Free Api Key to get this plugin works.