Data Pruning MNIST: How I Hit 99% Accuracy Using Half the Data | by Chris Lettieri

Contents

Data Redundancy Selection Strategy vs Dataset Size Inspiration and Goals Methodology Median Accuracy & Run Time Accuracy vs Run Time Full Results Future Plans

Notice how the selected samples capture more varied writing styles and edge cases.

In some examples like cluster 1, 3, and 8 the furthest point does just look like a more varied example of the prototypical center.

Cluster 6 is an interesting point, showcasing how some images are difficult even for a human to guess what it is. But you can still make out how this could be in a cluster with the centroid as an 8.

Recent research on neural scaling laws helps to explain why data pruning using a “furthest-from-centroid” approach works, especially on the MNIST dataset.

Data Redundancy

Many training examples in large datasets are highly redundant.

Think about MNIST: how many nearly identical ‘7’s do we really need? The key to data pruning isn’t having more examples — it’s having the right examples.

Selection Strategy vs Dataset Size

One of the most interesting findings from the above paper is how the optimal data selection strategy changes based on your dataset size:

With “a lot” of data : Select harder, more diverse examples (furthest from cluster centers).
With scarce data: Select easier, more typical examples (closest to cluster centers).

This explains why our “furthest-from-centroid” strategy worked so well.

With MNIST’s 60,000 training examples, we were in the “abundant data” regime where selecting diverse, challenging examples proved most beneficial.

Inspiration and Goals

I was inspired by these two recent papers (and the fact that I’m a data engineer):

Both explore various ways we can use data selection strategies to train performant models on less data.

Methodology

I used LeNet-5 as my model architecture.

Then using one of the strategies below I pruned the training dataset of MNIST and trained a model. Testing was done against the full test set.

Due to time constraints, I only ran 5 tests per experiment.

Full code and results available here on GitHub.

Strategy #1: Baseline, Full Dataset

Standard LeNet-5 architecture
Trained using 100% of training data

Strategy #2: Random Sampling

Randomly sample individual images from the training dataset

Strategy #3: K-means Clustering with Different Selection Strategies

Here’s how this worked:

Preprocess the images with PCA to reduce the dimensionality. This just means each image was reduced from 784 values (28×28 pixels) into only 50 values. PCA does this while retaining the most important patterns and removing redundant information.
Cluster using k-means. The number of clusters was fixed at 50 and 500 in different tests. My poor CPU couldn’t handle much beyond 500 given all the experiments.
I then tested different selection methods once the data was cluster:

Closest-to-centroid — these represent a “typical” example of the cluster.
Furthest-from-centroid — more representative of edge cases.
Random from each cluster — randomly select within each cluster.

Example of Clustering Selection. Image by author.

PCA reduced noise and computation time. At first I was just flattening the images. The results and compute both improved using PCA so I kept it for the full experiment.
I switched from standard K-means to MiniBatchKMeans clustering for better speed. The standard algorithm was too slow for my CPU given all the tests.
Setting up a proper test harness was key. Moving experiment configs to a YAML, automatically saving results to a file, and having o1 write my visualization code made life much easier.

Median Accuracy & Run Time

Here are the median results, comparing our baseline LeNet-5 trained on the full dataset with two different strategies that used 50% of the dataset.

Accuracy vs Run Time Full Results

The below charts show the results of my four pruning strategies compared to the baseline in red.

Median Accuracy across Data Pruning methods. Image by author.

Median Run time across Data Pruning methods. Image by author.

Key findings across multiple runs:

Furthest-from-centroid consistently outperformed other methods
There definitely is a sweet spot between compute time and and model accuracy if you want to find it for your use case. More work needs to be done here.

I’m still shocked that just randomly reducing the dataset gives acceptable results if efficiency is what you’re after.

Future Plans

Test this on my second brain. I want to fine tune a LLM on my full Obsidian and test data pruning along with hierarchical summarization.
Explore other embedding methods for clustering. I can try training an auto-encoder to embed the images rather than use PCA.
Test this on more complex and larger datasets (CIFAR-10, ImageNet).
Experiment with how model architecture impacts the performance of data pruning strategies.

These findings suggest we need to rethink our approach to dataset curation:

More data isn’t always better — there seems to be diminishing returns to bigger data/ bigger models.
Strategic pruning can actually improve results.
The optimal strategy depends on your starting dataset size.

As people start sounding the alarm that we are running out of data, I can’t help but wonder if less data is actually the key to useful, cost-effective models.

I intend to continue exploring the space, please reach out if you find this interesting — happy to connect and talk more 🙂