Do You Smell That? Hidden Technical Debt in AI Development

Contents

The Path from Prototype to Production Typical smells you see … or not 🫥Why AI Accelerates Code Smells Checklists: a summarized list of recommendations 1. The missing piece: “Problem-first” design 2. Code Guardrails: Quality, Security, and Behavior Drift Checks 3. Human guardrails: shared standards and explainability 4. Responsible AI by Design 5. Adversarial testing Conclusion References

“smell” them at first. In practice, code smells are warning signs that suggest future problems. The code may work today, but its structure hints that it will become hard to maintain, test, scale, or secure. Smells are not necessarily bugs; they’re indicators of design debt and long-term product risk.

These smells typically manifest as slower delivery and higher change risk, more frequent regressions and production incidents, and less reliable AI/ML outcomes, often driven by leakage, bias, or drift that undermines evaluation and generalization.

The Path from Prototype to Production

Most phases in the development of data/AI products can vary, but they usually follow a similar path. Typically, we start with a prototype: an idea first sketched, followed by a small implementation to demonstrate value. Tools like Streamlit, Gradio, or n8n can be used to present a very simple concept using synthetic data. In these cases, you avoid using sensitive real data and reduce privacy and security concerns, especially in large, privacy‑sensitive, or highly regulated companies.

Later, you move to the PoC, where you use a sample of real data and go deeper into the features while working closely with the business. After that, you move toward productization, building an MVP that evolves as you validate and capture business value.

Most of the time, prototypes and PoCs are built quickly, and AI makes it even faster to deliver them. The problem is that this code rarely meets production standards. Before it can be robust, scalable, and secure, it usually needs refactoring across engineering (structure, readability, testing, maintainability), security (access control, data protection, compliance), and ML/AI quality (evaluation, drift monitoring, reproducibility).

Typical smells you see … or not 🫥

This hidden technical debt (often visible as code smells) is easy to overlook when teams chase quick wins, and “vibe coding” can amplify it. As a result, you can run into issues such as:

Duplicated code: same logic copied in multiple places, so fixes and changes become slow and inconsistent over time.
God script / god function: one huge file or function does everything, making the system hard to understand, test, review, and change safely because everything is tightly coupled. This violates the Single Responsibility Principle [1]. In the agent era, the “god agent” pattern shows up, where a single agent entrypoint handles routing, retrieval, prompting, actions, and error handling all in one place.
Rule sprawl: behavior grows into long if/elif chains for new cases and exceptions, forcing repeated edits to the same core logic and increasing regressions. This violates the Open–Closed Principle (OCP): you keep modifying the core instead of extending it [1]. I’ve seen this early in agent development, where intent routing, lead-stage handling, country-specific rules, and special-case exceptions quickly accumulate into long conditional chains.
Hard-coded values: paths, thresholds, IDs, and environment-specific details are embedded in code, so changes require code edits across multiple places instead of simple configuration updates.
Poor project structure (or folder layout): application logic, orchestration, and platform configuration live together, blurring boundaries and making deployment and scaling harder.
Hidden side effects: functions do extra work you don’t expect (mutating shared state, writing files, background updates), so outcomes depend on execution order and bugs become hard to trace.
Lack of tests: there are no automated checks to catch drift after code, prompt, config, or dependency changes, so behavior can change silently until systems break. (Sadly, not everyone realizes that tests are cheap, and bugs are not).
Inconsistent naming & structure: makes the code harder to understand and onboard others to, slows reviews, and makes maintenance depend on the original author.
Hidden/overwritten rules: behavior depends on untested, non-versioned, or loosely managed inputs such as prompts, templates, settings, etc. As a result, behavior can change or be overwritten without traceability.
Security gaps (missing protections): Things like input validation, permissions, secret handling, or PII controls are often skipped in early stages.
Buried legacy logic: old code such as pipelines, helpers, utilities, etc. remains scattered across the codebase long after the product has changed. The code becomes harder to trust because it encodes outdated assumptions, duplicated logic, and dead paths that still run (or quietly rot) in production.
Blind operations (no alerting / no detection): failures aren’t noticed until a user complains, someone manually checks the CloudWatch logs, or a downstream job breaks. Logs may exist, but nobody is actively monitoring the signals that matter, so incidents can run unnoticed. This often happens when external systems change outside the team’s control, or when too few people understand the system or the data.
Leaky integrations: business logic depends on specific API/SDK details (field names, required parameters, error codes), so small vendor changes force scattered fixes across the codebase instead of one change in an adapter. This violates the Dependency Inversion Principle (DIP) [1].
Environment drift (staging ≠ production): teams have dev/staging/pro, but staging is not truly production-like: different configs, permissions, or dependencies, so it creates false confidence: everything looks fine before release, but real issues only appear in prod (often ending in a rollback).

And the list goes on… and on.

The problem isn’t that prototypes are bad. The problem is the gap between prototype speed and production responsibility, when teams, for one reason or another, don’t invest in the practices that make systems reliable, secure, and able to evolve.

It’s also useful to extend the idea of “code smells” into model and pipeline smells: warning signs that the system may be producing confident but misleading results, even when aggregate metrics look great. Common examples include fairness gaps (subgroup error rates are consistently worse), spillover/leakage (evaluation accidentally includes future or relational information that won’t exist at decision time, generating dev/prod mismatch [7]), or/and multicollinearity (correlated features that make coefficients and explanations unstable). These aren’t academic edge cases; they reliably predict downstream failures like weak generalization, unfair outcomes, untrustworthy interpretations, and painful production drops.

If every developer independently solves the same problem in a different way (without a shared standard), it’s like having multiple remotes (each with different behaviors) for the same TV. Software engineering principles still matter in the vibe-coding era. They’re what make code reliable, maintainable, and safe to use as the foundation for real products.

Now, the practical question is how to reduce these risks without slowing teams down.

Why AI Accelerates Code Smells

AI code generators don’t automatically know what matters most in your codebase. They generate outputs based on patterns, not your product or business context. Without clear constraints and tests, you can end up with five minutes of “code generation” followed by a hundred hours of debugging ☠️.

Used carelessly, AI can even make things worse:

It oversimplifies or removes important parts.
It adds noise: unnecessary or duplicated code and verbose comments.
It loses context in large codebases (lost in the middle behavior)

A recent MIT Sloan article notes that generative AI can speed up coding, but it can also make systems harder to scale and improve over time when fast prototypes quietly harden into production systems [4].

Either way, refactors aren’t cheap, whether the code was written by humans or produced by misused AI, and the cost usually shows up later as slower delivery, painful maintenance, and constant firefighting. In my experience, both often share the same root cause: weak software engineering fundamentals.

Some of the worst smells aren’t technical at all; they’re organizational. Teams may minor debt 😪 because it doesn’t hurt immediately, but the hidden cost shows up later: ownership and standards don’t scale. When the original authors leave, get promoted, or simply move on, poorly structured code gets handed to someone else 🫩 without shared conventions for readability, modularity, tests, or documentation. The result is predictable: maintenance turns into archaeology, delivery slows down, risk increases, and the person who inherits the system often inherits the blame too.

Checklists: a summarized list of recommendations

This is a complex topic that benefits from senior engineering judgment. A checklist won’t replace platform engineering, application security, or experienced reviewers, but it can reduce risk by making the basics consistent and harder to skip.

1. The missing piece: “Problem-first” design

A “design-first / problem-first” mindset means that before building a data product or AI system (or continuously piling features into prompts or if/else rules), you clearly define the problem, constraints, and failure modes. And this is not only about product design (what you build and why), but also software design (how you build it and how it evolves). That combination is hard to beat.

It’s also important to remember that technology teams (AI/ML engineers, data scientists, QA, cybersecurity, and platform professionals) are part of the business, not a separate entity. Too often, highly technical roles are seen as disconnected from broader business concerns. This remains a challenge for some business leaders, who may view technical experts as know-it-alls rather than professionals (not always true) [2].

2. Code Guardrails: Quality, Security, and Behavior Drift Checks

In practice, technical debt grows when quality depends on people “remembering” standards. Checklists make expectations explicit, repeatable, and scalable across teams, but automated guardrails go further: you can’t merge code into production unless the basics are true. This guarantees a minimum baseline of quality and security on every change.

Automated checks help stop the most common prototype problems from slipping into production. In the AI era, where code can be generated faster than it can be reviewed, code guardrails act like a seatbelt by enforcing standards consistently. A practical way is to run checks as early as possible, not only in CI. For example, Git hooks, especially pre-commit hooks, can run validations before code is even committed [5]. Then CI pipelines run the full suite on every pull request, and branch protection rules can require those checks to pass before a merge is allowed, ensuring code quality is enforced even when standards are skipped.

A solid baseline usually includes:

Linters (e.g., ruff): enforces consistent style and catches common issues (unused imports, undefined names, suspicious patterns).
Tests (e.g., pytest): prevents silent behavior changes by checking that key functions and pipelines still behave as expected after code or config edits.
Secrets scanning (e.g., Gitleaks): blocks accidental commits of tokens, passwords, and API keys (often hardcoded in prototypes).
Dependency scanning (e.g., Dependabot / OSV): flags vulnerable packages early, especially when prototypes pull in libraries quickly.
LLM evals (e.g., prompt regression): if prompts and model settings affect behavior, treat them like code by testing inputs and expected outputs to catch drift [6].

This is the short list, but teams often add additional guardrails as systems mature, such as type checking to catch interface and “None” bugs early, static security analysis to flag risky patterns, coverage and complexity limits to prevent untested code, and integration tests to detect breaking changes between services. Many also include infrastructure-as-code and container image scanning to catch insecure cloud setting, plus data quality and model/LLM monitoring to detect schema and behavior drift, among others.

How this helps

AI-generated code often includes boilerplate, leftovers, and risky shortcuts. Guardrails like linters (e.g., Ruff) catch predictable issues fast: messy imports, dead code, noisy diffs, risky exception patterns, and common Python footguns. Scanning tools help prevent accidental secret leaks and vulnerable dependencies, and tests and evals make behavior changes visible by running test suites and prompt regressions on every pull request before production. The result is faster iteration with fewer production surprises.

Release guardrails

Beyond pull request to production (PR) checks, teams also use a staging environment as a lifecycle guardrail: a production-like setup with controlled data to validate behavior, integrations, and cost before release.

3. Human guardrails: shared standards and explainability

Good engineering practices such as code reviews, pair programming, documentation, and shared team standards reduce the risks of AI-generated code. A common failure mode in vibe coding is that the author can’t clearly explain what the code does, how it works, or why it should work. In the AI era, it’s essential to articulate intent and value in plain language and document decisions concisely, rather than relying on verbose AI output. This isn’t about memorizing syntax; it’s about design, good practices, and a shared learning discipline, because the only constant is change.

4. Responsible AI by Design

Guardrails aren’t only code style and CI checks. For AI systems, you also need guardrails across the full lifecycle, especially when a prototype becomes a real product. A practical approach is a “Responsible AI by Design” checklist covering minimum controls from data preparation to deployment and governance.

At a minimum, it should include:

Data preparation: privacy protection, data quality controls, bias/fairness checks.
Model development: business alignment, explainability, robustness testing.
Experiment tracking & versioning: reproducibility through dataset, code, and model version control.
Model evaluation: stress testing, subgroup analysis, uncertainty estimation where relevant.
Deployment & monitoring: monitor drift/latency/reliability separately from business KPIs; define alerts and retraining rules.
Governance & documentation: audit logs, clear ownership, and standardized documentation for approvals, risk analysis, and traceability.

The one-pager of figure 1 is only a first step. Use it as a baseline, then adapt and expand it with your expertise and your team’s context.

Figure 1. End to end AI practice checklist covering bias and fairness, privacy, data quality, evaluation, monitoring, and governance. Image by Author.

5. Adversarial testing

There is extensive literature on adversarial inputs. In practice, teams can test robustness by introducing inputs (in LLMs and classic ML) the system never encountered during development (malformed payloads, injection-like patterns, extreme lengths, weird encodings, edge cases). The key is cultural: adversarial testing must be treated as a normal part of development and application security, not a one-off exercise.

This emphasizes that evaluation is not a single offline event: teams should validate models through staged release processes and continuously maintain evaluation datasets, metrics, and subgroup checks to catch failures early and reduce risk before full rollout [8].

Conclusion

A prototype often looks small: a notebook, a script, a demo app. But once it touches real data, real users, and real infrastructure, it becomes part of a dependency graph, a network of components where small changes can have a surprising blast radius.

This matters in AI systems because the lifecycle involves many interdependent moving parts, and teams rarely have full visibility across them, especially if they don’t plan for it from the beginning. That lack of visibility makes it harder to anticipate impacts, particularly when third-party data, models, or services are involved.

What this often includes:

Software dependencies: libraries, containers, build steps, base images, CI runners.
Runtime dependencies: downstream services, queues, databases, feature stores, model endpoints.
AI-specific dependencies: data sources, embeddings/vector stores, prompts/templates, model versions, fine-tunes, RAG knowledge bases.
Security dependencies: IAM/permissions, secrets management, network controls, key management, and access policies.
Governance dependencies: compliance requirements, auditability, and clear ownership and approval processes.

For the business, this is not always obvious. A prototype can look “done” because it runs once and produces a result, but production systems behave more like living things: they interact with users, data, vendors, and infrastructure, and they need continuous maintenance to stay reliable and useful. The complexity of evolving these systems is easy to underestimate because much of it is invisible until something breaks.

This is where quick wins can be misleading. Speed can hide coupling, missing guardrails, and operational gaps that only show up later as incidents, regressions, and costly rework. This article inevitably falls short of covering everything, but the goal is to make that hidden complexity more visible and to encourage a design-first mindset that scales beyond the demo.

References

[1] Martin, R. C. (2008). Clean code: A handbook of agile software craftsmanship. Prentice Hall.

[2] Hunt, A., & Thomas, D. (1999). The pragmatic programmer: From journeyman to master. Addison-Wesley.

[3] Kanat-Alexander, M. (2012). Code simplicity: The fundamentals of software. O’Reilly Media.

[4] Anderson, E., Parker, G., & Tan, B. (2025, August 18). The hidden costs of coding with generative AI (Reprint 67110). MIT Sloan Management Review.

[5] iosutron. (2023, March 23). Build better code!!. Lost in tech. WordPress.

[6] Arize AI. (n.d.). The definitive guide to LLM evaluation: A practical guide to building and implementing evaluation strategies for AI applications. Retrieved January 10, 2026, from Arize AI.

[7] Gomes-Gonçalves, E. (2025, September 15). No Peeking Ahead: Time-Aware Graph Fraud Detection. Towards Data Science. Retrieved January 11, 2026, from Towards Data Science.

[8] Shankar, S., Garcia, R., Hellerstein, J. M., & Parameswaran, A. G. (2022, September 16). Operationalizing Machine Learning: An Interview Study. arXiv:2209.09125. Retrieved January 11, 2026, from arXiv.