Solving the Human Training Data Problem

Contents

Practice Makes Passing The Human Training Data Problem Synthetic Training Data for Humans Easy Mode: Replicating a Template Hard Mode: Construction from Scratch Generalizing to Test Data and Preventing Dataset Pollution When Synthetic Data Alone Doesn’t Cut It Overcoming Overfitting: How to Make the Best of Synthetic Human Training Data Afterword: My Thoughts on LLMs as a Learning Aid Footnotes References

Practice Makes Passing

in computer science was anything but easy. I vividly remember reaching a breaking point around the end of the tenth week of my first semester. With just a few weeks until my first final, I sat staring at Calc 1 practice problems, spiraling into despair. I’d always been good at math. I did all the homework and paid attention in all the lectures. So how could it be that I didn’t even know where to start? Why wasn’t anything clicking?

I often joked with friends about dropping out of the program, even well into my final semester. Week 10 of Semester 1 was the only time I very seriously considered it.

It was January 2022, right on the heels of the COVID tech hiring boom. I’d tried my hand at frontend development and had a pretty good grasp of React. None of the introductory math courses I was taking made any sense. Plenty of acquaintances and friends of friends had gotten cushy tech jobs without degrees, so why couldn’t I? What use was knowing how to prove a function was continuous out in the real world?

Excerpt from Calc 1 lecture notes, circa 2021. Image by the author.

In retrospect, I understood that that was exactly what I was supposed to feel. That was when I actually decided to pursue my degree, not when I applied a year earlier. That feeling of impending doom was what lit a fire under me and drove me to study like a man possessed for the next few months.

To this day, I’ve never been happier to get back a grade than when I opened the scan of my graded Calc 1 exam to see “61/100” staring me back in the face: a passing grade with a cool margin of 2 points above failing. But all that mattered was that it was a passing grade, especially when almost half the students had failed the class, many for the second or third time.

Calc 1 grade distribution. 42.6% fail rate and a failing average grade of 55.5. Image by the author.

By all accounts, my first semester of undergrad was rough. Yes, this was by design, and yes, I learned a lot from it, both in terms of the material itself and (mostly) about resilience and perseverance. But it took moving to Germany and starting my master’s for me to understand how good I really had it back then, at least in one particular regard.

The Human Training Data Problem

One of the biggest surprises to me at my new university was that past exams are much less of a thing here. For all the stress and anxiety I had during my bachelor’s, one thing I knew I could always count on was the existence of plentiful and easily-accessible scans of past exams and exam-relevant problem sets, especially for introductory courses.

For Discrete Math, I solved all the dozens of past exams going back almost a decade. I distinctly remember warming up for Linear Algebra 1 with questions from the 1990s. This was so ingrained in the culture of my program that I completely took it for granted. The only reason I managed to pass Calc 1 (by the skin of my teeth) was because I had spent hours on end solving hundreds of questions from exams.

I was so accustomed to exams from past years being readily available that skimming over them had become part of my process for vetting classes I was considering taking. This meant that my rude awakening came fairly early on in my first semester of grad school, while trying to figure out my schedule.

So shocking was the revelation that I can map my reaction to the five stages of grief. At first, I was in denial, absolutely convinced that there must be some secret platform where all the past exams were hiding. Anger, bargaining, and depression soon followed. Acceptance didn’t really, but I was willing to postpone my concerns until finals came closer at the end of the semester.

As my first two finals (on back-to-back days, no less) approached in a hurry, I found myself faced with what I like to call the Human Training Data Problem. Granted, the human brain and machines are (very!) substantially different. But I couldn’t help but liken my situation to that of a machine learning model with insufficient training data. I was completely stumped on how to bridge the gap between lecture notes and potential exam questions.

My undergrad experience had granted me the insight of what human underfitting looks like, both at training time (studying) and test time (on exam day). I vividly remember more than one class where, for one reason or another, I preferred more in-depth review of lecture slides or notes to solving practice problems.

This was an approach I quickly dropped during my freshman year, and for good reason: even in theory-heavy classes, it yielded disastrous results. Knowing the proofs for all 40 theorems the professor required was much less help in passing Linear Algebra 2 than practicing applying them to solve problems would have been. That’s not to say an adequate grasp of the theory isn’t necessary; it absolutely is. But being able to recite the lecture notes by heart won’t save you if you can’t answer questions like the ones on the final.

Proof of the Riesz representation theorem (for an inner product space with a finite orthonormal basis), **written out one of many times while memorizing it during exam prep,** circa 2022. **Even while studying, this definitely didn’t feel like the best use of my time.** Image by the author.

And so, armed with hundreds of slides and a vague idea of the structure of each exam, I racked my brain for ways to avoid the pitfall of going in blind without any practice problems. Denial crept back in, and I desperately searched for past exams I knew didn’t exist. Eventually, I shifted my attention from finding the Holy Grail to turning my problem into one an LLM might be able to solve.

Synthetic Training Data for Humans

Researchers at IBM define synthetic data as “information that’s been generated on a computer to augment or replace real data to improve AI models” [1]. It has many benefits, from mitigating privacy concerns to cutting costs, leading to its widespread adoption for uses as varied as tooling for financial institutions [1] and 3D content generation [2].

In my case, the motivation was simple: the real-world (human) training data I needed to study just wasn’t available in the wild.

Of course, using synthetic data only makes sense if that data accurately imitates the data our trained model will encounter in the real world. I knew I had to be very intentional about how I generated the mock exams I wanted to use. Just telling Claude to write a practice test or two wouldn’t cut it, even if I gave it all the slides and material I had to work with. Only when setting out to write an exam does one realize how many decisions there are to be made, well beyond what’s in and what’s out in terms of the material.

Luckily, I wasn’t flying completely blind on that front. For one class, I had information about the exam’s structure and the kinds of questions there were on it from students who had taken it the year prior. For the other, the professor provided a breakdown of the exam into sections and a small handful of open-ended review questions.

Both classes had Q&A sessions after their respective final lectures. I paid special attention to anything that seemed like a hint as to what they might ask, which later proved to be very helpful.

Easy Mode: Replicating a Template

The first exam was straightforward since I had much more to work with. It also had a reputation for being relatively formulaic. I gave Claude the example questions and structure I had and asked it to stick to the same style.

Many of the questions lent themselves nicely to slight changes that made them novel enough to be worth solving for practice without straying too far from what was typical for the actual exam. Apart from a few LaTeX formatting hiccups, which were fairly easily resolved, it was smooth sailing.

To insure myself against any surprises, I also had it generate some trickier questions based on the lecture slides and my notes from the Q&A session. Even though nothing unexpected was asked in the end, doing some targeted review tailored to my own personal blind spots was a great confidence booster.

Although I definitely would have been able to study for the first exam without the help of LLMs, I still felt like I gained a lot by using Claude. I could absolutely imagine how helpful it would have been for some of the newer or more advanced courses I took in undergrad, where there were only a small handful of past exams available.

Hard Mode: Construction from Scratch

The second exam was a much tougher nut to crack. First of all, the breadth of the material was much wider. Secondly, the slides only very loosely reflected what was discussed in class. Most importantly, there was far less information available on what the exam would look like. What details there were were hard to find and vague.

The first two concerns were at least partially mitigated by the fact that I made an effort to take comprehensive notes throughout the semester. As for hints on the structure and style of the exam, I scoured every possible platform and collected anything that seemed even remotely relevant. In that vein, the Q&A session ended up being a godsend. Transcribing the professor’s answers and comments left me with a much better (albeit still incomplete) idea of what to expect.

Admittedly, I was initially pessimistic about the prospect of Claude being able to generate mock exams of much value. Though I had used it fairly extensively for guided material review, I had my doubts about how it would fare with the uncertainty at play. Still, I gave it everything I knew about the exam and hoped for the best.

I was pleasantly surprised at the results. Although the first few attempts produced exams that didn’t feel quite right, the core did seem promising. They did appear to adequately cover the material and to be challenging enough. After some back and forth, Claude started generating tests that I could have been convinced were real.

**Overview of mock exams generated by Claude Sonnet 4.5 for Course #2.** Note the (rather typical) yes-man commentary. Image by the author.

I solved the improved tests and asked Claude to correct my solutions. The very act of solving practice tests made me feel great about my grasp of the material. Claude’s usual sycophancy was the cherry on top. (It did point out mistakes, but was exceptionally soft on deducting points and overly-excited about correct answers.) Ultimately, however, I wouldn’t know how well Claude had done training me until test time. With the fateful day fast approaching, I hoped for the best.

Generalizing to Test Data and Preventing Dataset Pollution

When Synthetic Data Alone Doesn’t Cut It

While synthetic data certainly has its benefits, it has a critical drawback. What a model learns based on synthetic data will, at best, model the simulated world from which that data is drawn. That simulated world could diverge from reality in ways we are completely unaware of until it’s too late [3].

As Dani Shanley puts it in “Synthetic data, real harm,“

“… just as generative AI models can produce plausible (but false) text or images, synthetic data generators may create datasets that appear statistically valid, while introducing subtle, hard-to-catch distortions and artificial patterns, or missing crucial real-world complexities.” [3]

Shanley also draws attention to the hidden and disproportionate impact of the individuals tasked with synthesizing data on how models ultimately behave. Largely arbitrary decisions on their part could have significant, possibly harmful, downstream effects [3].

I saw this impact in action while studying for my second exam. Slowly but surely, I had unintentionally skewed Claude’s outputs based on my personal interpretation of what the professor had said. My gut feeling on what the exam should look like became the arbiter of which questions were relevant and which weren’t.

It also became clearer as time went on that my training dataset was veering ever further into a biased take on reality. After the sixth mock exam, it was obvious that Claude had just settled on a fixed set of several dozen questions.

Even when prompted to introduce more variety, every output from there on out was just some cobbling together of questions I had already seen. Granted, these did include many key questions it was heavily implied would appear on the actual exam.

On test day, I was shocked at how much the exam resembled the ones I had solved for practice. The gimmes the professor had hinted at were indeed there, but so were an impressive number of non-trivial questions I had solved while studying. Roughly 60% of the questions were identical or very similar to ones I had practiced. Many of the rest were on topics I had at least touched on.

However, one part of the exam ended up being a significant blind spot. It was a section on topics we had discussed only briefly at the beginning of the semester. While studying, I was unreasonably confident in swiftly dismissing certain types of questions, be it because they seemed uncharacteristic (e.g., too mathematical) or because they were about things I had deemed too insignificant to include in the notes I took in class.

Unfortunately, those turned out to the exact types of questions that were asked in that section. Some were about topics that only appeared on a single slide all semester. Others were deeply technical in a way I just didn’t expect. Though I did my best to answer them, I hadn’t trained my mental model on data that would enable it to generalize to these questions well enough.

The pill was all the more bitter to swallow since the kinds of questions I struggled with were ones Claude included in its first attempts at mock exams. These were precisely the ones I did away with early on based on little more than hunches.

In this case, the slip up was far from catastrophic. In my opinion, it wasn’t even close to undoing the benefits of studying using synthetic mock exams. Even so, it serves as a cautionary tale that hearkens back to Shanley’s warnings about how synthetic data can insidiously exacerbate model subjectivity and bias [3].

Overcoming Overfitting: How to Make the Best of Synthetic Human Training Data

For many real-world applications, a synthetic dataset that yields a model with only 60% accuracy would probably be considered next to useless. With sufficient real-world data (i.e., actual past exams), there is no doubt in my mind that 90%+ accuracy would be achievable.

To be fair, though, the (human) model under consideration has flaws that machines don’t and is, in many ways, much harder to train. I can say with confidence that that 60% would almost certainly surpass the accuracy of any other method I could have attempted.

I will absolutely stick to this method for future exams, with three key takeaways I plan to implement:

Separate chats are the way to go. The feedback loop that led Claude to converge on specific questions undoubtedly had a lot to do with me running the entire cycle of generating tests and checking answers in one big, long context. This meant any new mock exam was directly based on all of the previous ones. Beyond that, Claude tried to be helpful by tailoring the questions to what it thought were my weak spots, leading it to become even more entrenched in what it thought should be asked. General context rot⁽¹⁾ was also probably an important factor.
Keep an open mind. As mentioned above, the major blind spot I developed was largely the result of putting too much stock in my subjective assessment of what material would or should make the cut. Instead of challenging my assumptions and devoting some time to covering minor topics that seemed like long shots, I leaned into my biases.
Augment with real-world training data! This is, of course, easier said than done. It somewhat contradicts the very premise of this article. But what you can do as a student (or as an educator) is enrich the bank of known questions for future students. I managed to remember most of the questions that were on my second exam and document them for future students to use when studying.

Afterword: My Thoughts on LLMs as a Learning Aid

The elephant in the room is that none of the exam preparation workflow I described would have been even remotely feasible when I started my bachelor’s in late 2021. Maybe this is what made the process feel almost magical to me.

I remember wishing I had a way to automatically check and correct my answers on mock exams when studying in my freshman year.If you would have told me back then that an AI tool, let alone a free one, would be able to do that (however imperfectly) in 2026, I would have thought you were crazy.

Much has been written about the new problems LLMs have brought about. Many of the points that have been made are especially relevant to students. And indeed, I can’t argue that claims like “AI is making people dumber” are completely unfounded. I’ve seen firsthand how these tools let a person outsource thinking and eliminate any intellectual discomfort. For an ever-growing range of complex tasks, they represent the ultimate shortcut [4].

Concerningly, I believe people who resist the temptation to take those shortcuts are increasingly being penalized, at least in the short run. A friend who was the only one not to vibe-code assignments in a certain class comes to mind. Others cruised to perfect grades on their homework despite threats about how AI-generated submissions would supposedly be rejected. He put in the work and ended up being docked significant points for minor errors, with little in the way of constructive feedback or recourse.

Still, in the long run, it is a well-established fact that growth, in its myriad forms, entails some kind of stress. One of those forms is learning, and the necessary stress comes in the form of active engagement with the material. Few things are more rewarding in my opinion than the lightbulb moment of finally understanding a difficult concept after struggling with it for hours or days. Experiencing such moments with Fourier series, reductions, metric spaces, and many other concepts was a major part of what led me to choose to pursue a master’s degree in the field.

LLMs undoubtedly enable would-be learners to deprive themselves of this stress and, in turn, of actual learning. Often, though, I think too little attention is paid to the other side of the coin: with the right approach, they can personalize and democratize learning like no invention since the internet has.

Having experienced higher education both pre- and post-ChatGPT, I feel enormously fortunate to have tools like Claude and Gemini at my fingertips. Their utility for exam preparation was just the tip of the iceberg. It felt like my productivity was boosted tenfold throughout the semester. Things clicked much faster than they ever would have otherwise. LLMs were a game changer for everything from strategy (when and how to study what) to reviewing slides and notes to developing genuine curiosity and interest in the material.

To summarize with a platitude: “With great power comes great responsibility.” LLMs are what you make of them. With the right approach, they can coach you to take on the heavy lifting instead of doing it for you.

If you enjoyed this article, please consider following me on LinkedIn to keep up with future articles and projects.

Footnotes

(1) Engineering at Anthropic defines context rot as a phenomenon where “as the number of tokens in the context window increases, the model’s ability to accurately recall information from that context decreases.” [5]

References

[1] K. Martineau and R. Feris, “What is synthetic data?,” IBM Research Blog, Feb. 7, 2023. https://research.ibm.com/blog/what-is-synthetic-data.

[2] Y. Shi, P. Wang, J. Ye, M. Long, K. Li, and X. Yang, “MVDream: Multi-view diffusion for 3D generation,” arXiv preprint arXiv:2308.16512, 2023. https://doi.org/10.48550/arXiv.2308.16512.

[3] D. Shanley, “Synthetic data, real harm,” Ada Lovelace Institute Blog, Sep. 18, 2025. https://www.adalovelaceinstitute.org/blog/synthetic-data-real-harm/.

[4] S. Bogdanov, “In the long run, LLMs make us dumber,” @desunit (Sergey Bogdanov), Aug. 12, 2025. https://desunit.com/blog/in-the-long-run-llms-make-us-dumber/.

[5] P. Rajasekaran, E. Dixon, C. Ryan, and J. Hadfield, “Effective context engineering for AI agents,” Engineering at Anthropic, Sep. 29, 2025. https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents.