Vibe Coding with AI: Best Practices for Human-AI Collaboration in Software Development

Contents

The Use Case Risks associated with Vibe Coding What can teams do about them Start With Clear Requirements Generate the Architecture Before Writing Code Validate the Design Strengthen the Design through Edge Cases Have the AI challenge itself Detect Over-Engineering Challenge the Complexity Key Insight Takeaways and Conclusion Reference

— collaborating with an agentic AI-powered IDE to build software — is rapidly becoming a mainstream development approach. Tasks that once required weeks of engineering effort can now often be completed in hours or days. Modern AI-assisted development environments can generate structured, modular code across multiple languages, design architectures, write tests, and even debug issues with minimal human input.

A growing ecosystem of such tools has emerged, many built on top of familiar development environments such as VS Code. While these platforms offer similar capabilities, they are evolving so rapidly that any differentiating feature in one tool typically appears in competing tools within a short period of time. As a result, the specific tool an organization chooses is often less important than how effectively developers learn to work with these AI systems to maximize productivity while controlling cost and complexity.

So the pertinent question is that if AI can generate high-quality code faster than most developers can write it manually, what role remains for the developer?

The challenge is no longer simply writing code. Instead, developers must learn how to collaborate effectively with AI coding agents:

How should developers structure instructions and prompts to guide the system toward the desired outcome?
Where should humans intervene in the development process?
How can teams validate AI-generated code to ensure it is reliable, maintainable, and production-ready?

In this article, we explore practical principles for working with AI-enhanced development environments. We will outline key risks associated with Vibe coding tools and look at ways to mitigate them. Rather than focusing on any specific tool, we will examine the broader human-AI collaboration model that enables teams to extract the most value from these systems.

To illustrate these ideas, we will walk through a simple but realistic use case: building an intelligent search system using Retrieval Augmented Generation (RAG) on a dataset of news articles. While the problem may appear straightforward, it reveals several subtle ways in which AI-generated architectures and code can drift toward unnecessary complexity without careful human oversight.

Through this example, we will examine both the strengths and limitations of AI-assisted development, and highlight the role that developers still play in guiding, validating, and refining the output of these powerful tools.

The Use Case

While the principles discussed here apply to any type of software development, let’s illustrate them with a practical example: building an intelligent AI-powered search system (RAG) over a dataset of news articles (CC0). The dataset contains business and sports news articles published over 2015 and 2016, along with the title.

The vibe coder used here is Google Antigravity but as mentioned earlier, this is not important as other tools also function in a very similar way.

Risks associated with Vibe Coding

As with any powerful technology, vibe coding introduces a new set of risks that are easy to overlook—precisely because of how fast and capable the system appears.

In this example, as I worked through building a simple RAG system over news articles, three patterns became immediately apparent.

First, the classic garbage-in-garbage-out principle still applies. The AI generates code quickly and confidently—but when the prompts were even slightly ambiguous, the output drifts away from what is actually needed. Speed does not guarantee correctness.

Second, prompting remains a core skill, even though the interface has changed. Instead of writing LLM system prompts directly, we are now prompting the IDE. But the responsibility remains the same: clear, precise instructions. In fact, poor prompting has a very tangible cost — developers quickly burn through Pro model limits without getting closer to a usable solution.

Third, and more subtly, over-engineering is a real risk. Because the system can generate complex architectures effortlessly and at little cost, it often does. Left unchecked, this can lead to designs that are far more complex than the problem requires —introducing unnecessary components that would be difficult to maintain later.

These risks are not theoretical—they directly influence how the system evolves. The question then becomes: how do we control them?

What can teams do about them

To address these risks, here are a few core principles that should form the foundation of AI-powered SDLC:

Start With Clear Requirements

Before asking the AI to generate architecture or code, it is important to establish at least a minimal definition of the problem. In ideal scenarios, this may come from an existing business requirements document. However, in many AI projects the only requirement the customer may provide is to point to a document repository and specify a loosely defined goal such as “Users should be able to ask questions about the news articles and receive contextual responses.” While this may seem like a reasonable starting point to a human, it is actually an extremely open-ended scope for an AI system to interpret and code and qualifies as a garbage-in prompt. It is similar to operating an LLM without any guardrails — there is a good probability the output will not be what you expect. A practical way to constrain the scope is to define a set of representative test queries that users are likely to ask. These queries provide the AI with an initial scope boundary and reduce the risk of unnecessary complexity in the resulting system.

Generate the Architecture Before Writing Code

Unless you are building a trivially simple prototype, it is prudent to always ask to create an architecture document first and optionally, a tasks plan to see the sequence in which it will execute the key steps such as data ingestion, agent build, test case execution and results validation. Use a large thinking model (such as Gemini-3-Pro in Planning mode) for this step. Even if you have an architecture in mind, rather than providing that upfront and creating a bias, ask the AI to design the architecture from a clean slate. And then use your own design to challenge, refine and confirm the design. The developer’s role then becomes one of critical evaluation — asking questions such as — What if we simplified this component? or What happens if the data size became 10X?. Through this iterative dialogue, the architecture gradually converges toward a design that balances capability, cost, and maintainability.

To illustrate the above ideas for our news articles use case, I started with the following initial prompt to set basic boundaries for the Vibe coder. I specifically asked it to provide an architecture document only without beginning to code.

We will begin on a new project in the C:\antigravity\VibeCode folder. All artifacts such as source code, outputs, architecture docs should be stored in appropriate folders within this. The Dataset folder has a file called Articles.xlsx. Which contain many news articles in business and sports along with heading and date published. Take a look into it if you want. There need to be a RAG built which can answer a diverse range of user questions from specific, easy to retrieve ones to broad ones that may require scanning all the articles, which might be impractical. I need a strategy and architecture how you will do it. Some test queries could be:
What do the articles say about oil prices in 2015 ?
Tell me business highlights in 2016
What sport are most of the articles about ?
In sports, are there mention of records being created ?
The LLM used will be gemini 2.5 flash, but for now lets have the architecture doc only

A detailed architecture document RAG_Architecture.md was produced and a rationale was provided as to why an agentic dual-engine RAG is needed.

To successfully answer your mix of specific questions and broad domain queries (like counting topics) without scanning all articles every time, a standard vector search RAG is not going to work. Instead, you need an Agentic dual-engine RAG architecture.

I have drafted a detailed strategy and architecture document that addresses this and saves to the project folder as requested: 

…\VibeCode\Architecture\RAG_Architecture.md
.

Here is the high-level summary of the strategy:

1. The Challenges with Your Dataset
Your dataset has limited structured columns (Date, Heading, NewsType). This means that questions like "What sport are most of the articles about?" are nearly impossible to answer natively using either pure SQL or pure Vector Search without extracting the specific "sport" entity first.

2. Dual-Engine RAG Setup
Our architecture relies on a Data Enrichment Pipeline during ingestion and an Intelligent Routing Agent at query time:

Ingestion (LLM Pre-processing): During the ingestion phase, we pass the text of each article to Gemini 2.5 Flash to automatically extract topics (e.g., cricket, football, oil) and entities, and optionally summarize the text. These new metadata fields are stored in a regular relational database (or Pandas dataframe) and attached as metadata tags to the text embeddings in your vector database.
Intelligent Router Agent (Gemini 2.5 Flash): An agent sits in front of the user's queries to determine what tool to use to retrieve the info:
The SQL/Data Analytics Engine Tool: If the query is aggregational or broad (e.g., "Tell me business highlights in 2016" or "What sport are most of the articles about?"), the LLM generates a SQL/Pandas query, counts/filters the extracted metadata, and synthesizes the exact answers without touching the vector engine.
The Vector Semantic Search Tool: If the query is highly qualitative and specific (e.g., "What do the articles say about oil prices in 2015?" or "In sports, are there mention of records being created?"), the Router uses vector search, but strictly utilizes the structured fields (like Date=2015 or NewsType='sports') to filter the embeddings prior to searching.

Validate the Design

It is worthwhile spending time to read the explanations provided for the architectural choices made, and the responses to your suggestions and challenges. The AI will provide detailed explanations of the pros and cons of each aspect of the design, but understanding and assuring oneself that it is the right balance between functionality, complexity and maintainability is still the role of the developer, or architect in the team, as the case may be.

Strengthen the Design through Edge Cases

Once we understand the architecture and rationale, its time to think about Edge test cases that might break the architecture. For instance, here I asked "How about if the user asks to summarize the negative articles in the entire database. Which may be tens of thousands of articles, How will you respond"

In response, an amendment to the architecture is proposed. Which is to add a sentiment extraction during ingestion, hierarchical summarization (generate one sentence summary of articles to prevent context window overflow of the LLM), and a strategic sampling with SQL fallback. This refinement step may be done iteratively for other edge cases that come to mind.

Have the AI challenge itself

Try asking the AI to think of edge cases that break its own architecture. You can also select a different model, such as Claude Opus for this step. A caveat: AI models often tend to be overly critical of another model’s work, so the human’s judgment is essential in deciding which suggestions are worthwhile.

When I asked for this to be done, the model generated several sophisticated test queries, which the architecture may not be able to handle. Such as complex cross document comparison, exact phrase/keyword matching, deep relational and network queries, contradiction and claim verification etc. For each, the reasoning was provided, and the suggested addition to the architecture, from Map-Reduce, BM25, Knowledge Graph, Timeline synthesizer tool, Claim Analysis module etc.

Here are the queries it generated and suggested amendments:

1. Complex Cross-Document Comparison (The "Map-Reduce" Problem)
    • Query: "Compare the primary reasons given for oil price drops in 2015 versus the reasons given in 2016. What are the key differences?"
    • Potential Addition: A Map-Reduce or Multi-Agent workflow where one agent summarizes 2015, another summarizes 2016, and a final agent compares the two summaries.
2. Exact Phrase / Lexical Keyword Matching (The "BM25" Problem)
    • Query: "Find the specific article that used the exact phrase 'unprecedented fiscal cliff' or mentions the obscure ticker symbol 'ZYXW'."
    • Potential Addition: Hybrid Search. Adding a traditional lexical search engine (like TF-IDF or BM25) and combining its scores with the vector search scores (Alpha blending or Reciprocal Rank Fusion).
3. Deep Relational & Network Queries (The "Knowledge Graph" Problem)
    • Query: "Which companies are most frequently mentioned in the same articles together when discussing the tech sector?"
    • Potential Addition: A Knowledge Graph database (like Neo4j) built during ingestion that maps Entity -> Co-occurs_with -> Entity.
4. Sequential and Causal Timelines (The "Temporal Reasoning" Problem)
    • Query: "Trace the chronological sequence of events that led to the resignation of the cricket team captain, week by week."
    • Potential Addition: A dedicated Timeline Synthesizer Tool that forces strict chronological retrieval and chains the narrative sequentially.
5. Contradiction and Claim Verification (The "Fact-Checking" Problem)
    • Query: "Are there any articles that contradict each other regarding the success of the recent business merger between Company A and Company B?"
    • Potential Addition: A Claim Analysis Module that specifically evaluates retrieved chunks for disagreeing viewpoints before passing them to the user.

Detect Over-Engineering

Review critically if the additional complexity is warranted for the additional functionality it might enable. In this case, for a simple problem like querying news articles, these changes seemed excessive. Let’s question the reasoning for the complexity. “Will users ask for an exact keyword search to merit the latency of adding a BM25 layer?“. Perhaps, a named entity extraction (NER) for metadata will be sufficient. “Are deep relational and network queries important for news articles, which are by definition standalone artifacts, such as to merit the complexity and cost of a knowledge graph?“

Challenge the Complexity

It is perfectly reasonable (and should be a best practice), to push back on AI generated architecture or code, when it seems overly complex. In such a case, ask for reasoning, workarounds and simpler alternatives. Here, I switched the model to Claude and asked to provide a Self-Critique. The result, as attached here, is an acknowledgment that the edge queries are unlikely or can be handled with simpler modifications such as better tool usage by the agent. And in this case, the AI concluded that none of the complex architectural changes are needed and therefore, were actually rolled back.

Here I am noting the results of the self-critique:

1. Cross-Document Comparison (oil 2015 vs 2016): Yes, very likely. Comparative temporal questions are natural for a news corpus. But honestly, the existing architecture can handle this — the agent just calls the vector tool twice with different date filters and compares. It doesn't need a new "Map-Reduce module." I over-complicated it.
2. Exact Phrase / BM25: Unlikely. Users of a news Q&A system rarely search for exact phrases or ticker symbols. This is more of a search engine use case. I'd drop this as a concern.
3. Knowledge Graph / Co-occurrence: **Unlikely at this scale.** With 2,700 articles, co-occurrence analysis is a niche analytical query. A simple SQL JOIN on the `Key_Entities` column could handle the rare case. A Neo4j graph is massive overkill here.
4. Temporal/Causal Timelines: **Moderately likely.** "What happened leading up to X?" is a natural question for news. But the existing architecture can handle it: vector search to find the topic + SQL sort by date. No new module needed — just good agent prompting.
5. Contradiction Detection: **Unlikely.** Users rarely ask "do articles contradict each other?" This is an academic NLP task, not a typical user query.

The architecture is now ready for code generation, with the required checks and balances incorporated. The resulting code will be accurate and explainable.

Key Insight

The above narrative is meant to establish a pattern of collaboration between human and AI. At every phase, the due diligence and oversight provided by an expert ensures the quality of the next stage’s outcome. To illustrate further, I instructed the vibe coder to publish detailed execution logs of test queries (including the edge cases) showing the tool calls and sequence, besides the final output. I reviewed the code and results manually to understand the logic, even if the AI assisted in generating explanations for parts of it. This challenge-and-refine pattern is repeated across phases. These principles ensured that the generated application is maintainable, traceable, and aligned with standards.

Here is an example of one of the query execution logs:

USER QUERY: Compare the primary reasons given for oil price drops in 2015 versus the reasons given in 2016. What are the key differences?
[TOOL PATH] -> search_articles_vector called with query='reasons for oil price drops', news_type='', specific_topic='Oil', year=2015
[TOOL PATH] -> search_articles_vector called with query='reasons for oil price drops', news_type='', specific_topic='Oil', year=2016

FINAL AGENT OUTPUT:
In 2015, the primary reasons for oil price drops included a **surging dollar**, **plentiful global crude supplies/oversupply**, **rising US crude inventories**, and **weak demand**. The International Energy Agency also issued a bearish report contributing to the decline.

In 2016, while **rising stockpiles**, **huge storage overhangs**, **near-record production**, and **slowing demand** continued to be factors, new and significant reasons emerged. The **row between Saudi Arabia and Iran** made cooperation among major exporters to cut output highly unlikely, directly impacting supply-side expectations. Additionally, **concerns over China's economy** and its **devaluation of the yuan** were explicitly cited as contributing to the price drops.

The key differences lie in the emergence of **geopolitical tensions between Saudi Arabia and Iran** directly affecting OPEC's ability to manage supply, and the more pronounced impact of **China's economic policies** on global demand and market sentiment in 2016 compared to 2015.

Takeaways and Conclusion

The purpose of this article is not to serve as a tutorial on vibe coding — numerous resources already cover that. Instead, the focus is on how humans should interact with AI-powered coding agents, which have the potential to overwhelm even experienced developers with the speed and complexity of the solutions they generate.

The central principle is clear: AI is a tool, not a replacement. Across every stage of development — from design and coding to unit testing, integration, and security validation — the recommended workflow is:

Prompt (Human) -> Generate (AI) -> Review (Human + AI) -> Feedback Prompt (Human) -> Iterate.

In this loop, the human remains the final arbiter. While the AI has visibility into requirements, architecture, code, and tests, only humans can assess the broader context: user expectations, business priorities, cost and latency constraints, reliability, maintainability, and explainability. These factors ultimately determine whether a system succeeds in production and is widely adopted by users.

Key Takeaways:

AI accelerates, humans validate: Speed does not replace judgment.
Start with architecture and clear requirements: Define boundaries and test cases before coding.
Beware of over-engineering: Not every AI suggestion is necessary; simplicity is a strategic choice.
Iterate through review and feedback: Maintain a human-in-the-loop approach at every stage.
Final responsibility lies with humans: Only humans can weigh trade-offs, ensure maintainability, and decide if the solution is fit for production.

By following these principles, developers can harness the full potential of vibe coding while maintaining control, ensuring systems are effective, understandable, and ultimately adopted by the users they are built for.

Connect with me and share your comments at www.linkedin.com/in/partha-sarkar-lets-talk-AI

Reference

News Articles — Dataset (CC0: Public Domain)

_{Images used in this article are generated using Google Gemini. Code created by me.}