It’s the Lessons We Learned Along the Way. Or, Is It?

Contents

Interns as explorers First, they came for the consultants, and I did not speak out…The AI setup The real treasure was the friends we made along the way The future can wait (a bit)See you, agentic cowboys

ChatGPT a typical month-long internship problem in the data space. The problem in some sense got “solved” but I’m not sure it means what I thought it would. For data and AI practitioners, this is now a very practical question. Many teams use interns or research spikes to explore ideas: is AI good enough now? Are these projects only about the final artifact?

Interns as explorers

Building a tech roadmap at an early-stage data startup is not that different from a typical video game map:

A video game map, where the world is revealed one step at a time. Image generated by the author with ChatGPT.

The roadmap is much bigger not only than what you can do, but also what you can see. If only we could peek over “the (product) horizon” by sending an explorer to clear the map, then we’d gain some awareness of what waits for you once you get there (the explorer may die, so the analogy is good up to a point).

Bauplan (which I co-founded in 2024) has made the unusual (for its size) choice of running summer internships from top institutions (Columbia University, CMU, University of Wisconsin–Madison) to peek over the horizon. It has worked very well so far. Aside from a better hiring funnel, community standing, and some social clout, explorations made their way into our product, and will be strategic assets as the company grows.

As I send out internship offers for the summer of 2026, half of my X feed is telling me that I’m going about it all wrong. Far from being an hypothetical problem, at different stages, sizes and constraints, all data and AI teams today are facing the same question: is there now a better way to do research spikes with agents? If yes, what is a good, tested AI setup that is easy to adapt?

In the hope our experience and perspective will be valuable for many data practitioners, this is our setup and the lessons we learned from a real research spike done by pairing with ChatGPT.

First, they came for the consultants, and I did not speak out…

At a time when AI threatens knowledge workers, the junior positions seem to be the ones falling first. Why should McKinsey hire an Ivy League analyst when a $200 subscription produces more reports, faster? Lately, my feed seems to indicate that AI may be coming after researchers too, with academics trying to automate themselves – “fully autonomous research from idea to paper” – and professors debating whether to hire assistants anymore.

There are obvious arguments to resist the trend. We can attack the outcome, and argue that the tech is still buggy so the promised “Ivy League” parity is just not there. We can argue that the social contract is being broken: sure, young researchers have always been (in some sense) a “burden”, but that burden was both a way to pay it forward and an investment in the next generation. We can also highlight the potential long-term damage of replacing a well-understood thought process with a new, untested workflow.

While there is weight behind all of these arguments, parallel arguments could be superficially construed for the invention of cars or what-have-you. There’s always a time and place for these debates, but my interest today is way more localized and personal: what would it feel like if I were to ditch my interns for a $200 subscription?

So (not unlike this physics experiment I recently discovered) I tried to squeeze the month of an intern into a weekend with ChatGPT.

While the exact problem is not terribly important, scoping the internship may be useful to get a feeling of the type of things interns do at Bauplan (feel free to skip!). Bauplan is a branching data platform: agents and humans can open Git-like branches on their tables. As a result, the same table may have different versions in different branches. In our motivating example, Acme Inc. is an online retailer in which a swarm of data agents is tasked with running different predictions on tomorrow’s sales:

Querying a “predictions” table when multiple versions co-exist. Image from the original paper by the author.

Ideally, a human would verify the work, compare and contrast the findings, and then merge the predictions table as the canonical data representation. But what if somebody asks a question before this happens?

Existing systems would just refuse to answer, even if this feels intuitively wasteful: two agents computing monthly revenue may disagree on the exact figure, yet both agree that revenue grew >10% quarter-over-quarter. In other words, even when there is no system-wide agreed-upon version of data, we could still answer many interesting questions.

Our internship goal is then to build a prototype of such a system. It requires learning about branching, picking up new math, designing a solution on top of Bauplan, and building both a text-to-SQL module (easy-ish) and a custom query path (hard-ish).

The AI setup

Bauplaners recently had the privilege of seeing the one and only Wes McKinney giving a live demo of his setup, so I decided to adopt it (with some minor tweaks):

ChatGPT 5.2 to plan and decide on strategies (i.e. how to design a benchmark highlighting the difference between engineering approaches);
Claude Code inside Visual Studio to do the actual development loop;
Roborev to locally review commits adversarially. Powered by Codex, the reviews highlight potential issues and suggest improvements;
Roborev reviews to tame project complexity every 10 commits or so: these reviews take an architectural point of view and help with cutting the bloat.

The real treasure was the friends we made along the way

Since I cannot bear the idea of having AI write for me (because, to be honest, I also cannot bear the idea of interns doing it), I did the final writing myself. As internships typically end by sharing results with the community, I ended up having enough things for an ACM SAO paper, “Querying Everything Everywhere All at Once”.

By some metrics, the X crowd was right: even granting a quality mismatch, I babysat the AI for 48 hours to do, say, 80% of what would have taken weeks. Interestingly, babysitting is of a different nature: the AI is so eager to please that it often ends up “cheating” to achieve surface-level results through hardcoded shortcuts. While many data and AI problems are superficially easy-to-verify, in our experience they are also easy-to-game: this is especially true whenever the interpretation of the experimental setup is nuanced, or the final metric is not straightforward: you should triple check if your AI agents is hill-climbing or just pretending.

On the other hand, the AI doesn’t need to be taught about Tarski’s models or truth gluts, as attaching a few papers is enough to hit the ground running. The results were also “tangible”: I have a good-looking web app without having to pick up D3.js again (10 years after my last time!), and a demo script simulating agentic pipelines and business questions over branches. If you believe (as I do) that prototypes generally beat PowerPoint (or papers), there is no doubt the AI stack delivered something.

What is harder to put into words is what was not delivered, or, to put it more precisely, what I lost in the process. For all the excitement about the incredible chart and the surprising benchmark, none of it really produced more understanding. I am not wiser for having gone through the research motion: I have a bit more intuition than before (e.g. how to better prompt for good SQL translations), but my mental models have largely the same resolution as when I started. Working with interns may be time-consuming and sometimes even frustrating, but it always produces better thoughts, in them and myself: by explaining and mentoring them, they also explain and mentor me back in some sense.

If I now get results without learning much, I feel uneasy mostly because it is not clear if that should matter. I don’t mean if it matters on a global, big-brain scale: of course if our children don’t learn anymore and our scientists offload thinking to a chat, that is bad. I’m now just modestly focused on this: does it matter for me, for my company, for my investors?

The local, personal answer – unless you have a very inflated sense of yourself – is less clear-cut. I know how to code and I could probably still teach some mathematical logic, so in a sense none of this project is breaking new ground anyway: perhaps, there is not that much to learn here (aside from the feasibility of it all, which of course I suspected in the first place), and the uneasiness I feel is the legacy of a past mindset. Or, perhaps, there is no task too humble to become a slightly better version of myself: doing the nitty-gritty work of connecting our APIs to a chart, failing to compile DataFusion 13 times, going back and forth on how to pick queries for a convincing benchmark where no other system can express – let alone compute – our query path. I feel uneasy because real-world projects for real-world, not-too-ego-inflated people have a very large surface of issues which are not obviously first-principles thinking, nor obvious implementation details.

I have no problem today (tomorrow, we will see…) with the simplistic view that humans should do the thinking and LLMs should fix matplotlib syntax. But I struggle with the large grey area in between, and the inner voice whispering that by treating everything as an implementation detail, my thoughts soon won’t be sharp anymore. Are we becoming like those VCs who “pattern match” and lose all the nuances? Is the point of a proof proving a theorem (however alien-looking the proof may be), or giving us novel understanding?

The future can wait (a bit)

Observing my decisions (and not my feelings) for the summer of 2026 does indeed reveal the consequences of this experiment. Bauplan has hired two (human) interns, two young, talented, motivated computer scientists in charge of exploring the edge of our product map with regard to end-to-end AI optimization (skills evolution with GEPA) and scaling git-for-data. From a practical perspective, I made the same decision I would have made before this project. However, I don’t believe I got out unchanged and unscathed from it: my feelings will at some point crystallize in new concepts and then influence my decisions.

On the one hand, as a big fan of the Little Prince, it is not lost on me that it was the time wasted on that rose that made it important: spending time with my interns this summer will (I believe) make them and our project together more important. On the other, this only partially captures my vibe these days. I had to dig into the Internet Archive to recover something I recently remembered from 2006 (mathematical logic is not the only thing I remember from my 20s, apparently). This is the #1 entry in Blender’s “50 Worst Things to Happen to Music”:

#01. KIDS TODAY

Back in our day, we didn’t have any of yer fancy iPods and ringtones and downloads. We didn’t have the luxury and convenience of your scrotum-rings and your World Wide Web logs. When we wanted to steal the new URIAH HEEP album, we couldn’t just troll the Internets for it, we had to do it the old-fashioned way—by hiking to the store (uphill, both ways) and shoving 12″ of vinyl under our sweaters (which we had to knit ourselves). That’s why you sniveling whipper-snappers don’t appreciate the real value of music. Or Uriah Heep. Now get the hell off our lawn!

Will we still appreciate the “real value of things” if we can now “steal them” from the comfort of our laptops?

See you, agentic cowboys

Thanks to Luca, Colin, Ethan for their comments on a previous draft of this article.

If you want to be a Bauplan intern and do cool data-and-AI stuff (like this or this or this), I still accept human candidates: get in touch!