Why Package Installs Are Slow (And How to Fix It)

Editor
10 Min Read


knows the wait. You type an install command and watch the cursor blink. The package manager churns through its index. Seconds stretch. You wonder if something broke.

This delay has a specific cause: metadata bloat. Many package managers maintain a monolithic index of every available package, version, and dependency. As ecosystems grow, these indexes grow with them. Conda-forge surpasses 31,000 packages across multiple platforms and architectures. Other ecosystems face similar scale challenges with hundreds of thousands of packages.

When package managers use monolithic indexes, your client downloads and parses the entire thing for every operation. You fetch metadata for packages you will never use. The problem compounds: more packages mean larger indexes, slower downloads, higher memory consumption, and unpredictable build times.

This is not unique to any single package manager. It is a scaling problem that affects any package ecosystem serving thousands of packages to millions of users.

The Architecture of Package Indexes

Conda-forge, like some package managers, distributes its index as a single file. This design has advantages: the solver gets all the information it needs upfront in a single request, enabling efficient dependency resolution without round-trip delays. When ecosystems were small, a 5 MB index downloaded in seconds and parsed with minimal memory.

At scale, the design breaks down.

Consider conda-forge, one of the largest community-driven package channels for scientific Python. Its repodata.json file, which contains metadata for all available packages, exceeds 47 MB compressed (363 MB uncompressed). Every environment operation requires parsing this file. When any package in the channel changes – which happens frequently with new builds – the entire file must be re-downloaded. A single new package version invalidates your entire cache. Users re-download 47+ MB to get access to one update.

The consequences are measurable: multi-second fetch times on fast connections, minutes on slower networks, memory spikes parsing the 363 MB JSON file, and CI pipelines that spend more time on dependency resolution than actual builds.

Sharding: A Different Approach

The solution borrows from database architecture. Instead of one monolithic index, you split metadata into many small pieces. Each package gets its own “shard” containing only its metadata. Clients fetch the shards they need and ignore the rest.

This pattern appears across distributed systems. Database sharding partitions data across servers. Content delivery networks cache assets by region. Search engines distribute indexes across clusters. The principle is consistent: when a single data structure becomes too large, divide it.

Applied to package management, sharding transforms metadata fetching from “download everything, use little” to “download what you need, use all of it.”

The implementation works through a two-part system outlined in the below diagram. First, a lightweight manifest file, called the shard index, lists all available packages and maps each package name to a hash. Think of a hash as a unique fingerprint generated from the file’s content. If you change even one byte of the file, you get a completely different hash.

Structure of sharded repodata showing the manifest index and individual shard files. The small manifest maps package names to shard hashes, enabling efficient lookup of individual package metadata files. Image by author.

This hash is computed from the compressed shard file content, so each shard file is uniquely identified by its hash. This manifest is small, around 500 KB for conda-forge’s linux-64 subdirectory which contains over 12,000 package names. It only needs updating when packages are added or removed. Second, individual shard files contain the actual package metadata. Each shard contains all versions of a single package name, stored as a separate compressed file.

The key insight is content-addressed storage. Each shard file is named after the hash of its compressed content. If a package hasn’t changed, its shard content stays the same, so the hash stays the same. This means clients can cache shards indefinitely without checking for updates. No round-trip to the server is required.
When you request a package, the client performs a dependency traversal mirroring the below diagram. It fetches the shard index to look up the package name and find its corresponding hash, then uses that hash to fetch the specific shard file. The shard contains dependency information, which the client uses to then fetch the next batch of additional shards in parallel.

Client fetch process for NumPy using sharded repodata. The workflow shows how conda retrieves package metadata and recursively resolves dependencies through parallel shard fetching. Image by author.

This process discovers only the packages that could possibly be needed, typically 35 to 678 packages for common installs, rather than downloading metadata for all packages across all platforms in the channel. Your conda client only downloads the metadata it needs to update your environment.

Measuring the Impact

The conda ecosystem recently implemented sharded repodata through CEP-16, a community specification developed collaboratively by engineers at prefix.dev, Anaconda, Quansight,a volunteer-maintained channel that hosts over 31,000 community-built packages independently of any single company. This makes it an ideal proving ground for infrastructure changes that benefit the broader ecosystem.

The benchmarks tell a clear story.

For metadata fetching and parsing, sharded repodata delivers a 10x speed improvement. Cold cache operations that previously took 18 seconds complete in under 2 seconds. Network transfer drops by a factor of 35. Installing Python previously required downloading 47+ MB of metadata. With sharding, you download roughly 2 MB. Peak memory usage decreases by 15 to 17x, from over 1.4 GB to under 100 MB.

Cache behavior also changes. With monolithic indexes, any channel update invalidates your entire cache. With sharding, only the affected package’s shard needs refreshing. This means more cache hits and fewer redundant downloads over time.

Design Tradeoffs

Sharding introduces complexity. Clients need logic to determine which shards to fetch. Servers need infrastructure to generate and serve thousands of small files instead of one large file. Cache invalidation becomes more granular but also more intricate.
The CEP-16 specification addresses these tradeoffs with a two-tier approach. A lightweight manifest file lists all available shards and their checksums. Clients download this manifest first, then fetch only the shards for packages they need to resolve. HTTP caching handles the rest. Unchanged shards return 304 responses. Changed shards download fresh.

This design keeps client logic simple while shifting complexity to the server, where it can be optimized once and benefit all users. For conda-forge, Anaconda’s infrastructure team handled this server-side work, meaning the 31,000+ package maintainers and millions of users benefit without changing their workflows.

Broader Applications

The pattern extends beyond conda-forge. Any package manager using monolithic indexes faces similar scaling challenges. The key insight is separating the discovery layer (what packages exist) from the resolution layer (what metadata do I need for my specific dependencies).

Different ecosystems have taken different approaches to this problem. Some use per-package APIs where each package’s metadata is fetched separately – this avoids downloading everything, but can result in many sequential HTTP requests during dependency resolution. Sharded repodata offers a middle ground: you fetch only the packages you need, but can batch-fetch related dependencies in parallel, reducing both bandwidth and request overhead.

For teams building internal package repositories, the lesson is architectural: design your metadata layer to scale independently of your package count. Whether you choose per-package APIs, sharded indexes, or another approach, the alternative is watching your build times grow with every package you add.

Trying It Yourself

Pixi already has support for sharded repodata with the conda-forge channel, which is included by default. Just use pixi normally and you’re already benefiting from it.

If you use conda with conda-forge, you can enable sharded repodata support:

conda install --name base 'conda-libmamba-solver>=25.11.0'
conda config --set plugins.use_sharded_repodata true

The feature is in beta for conda and the conda maintainers are collecting feedback before general availability. If you encounter issues, the conda-libmamba-solver repository on GitHub is the place to report them.

For everyone else, the takeaway is simpler: when your tooling feels slow, look at the metadata layer. The packages themselves may not be the bottleneck. The index often is.


The owner of Towards Data Science, Insight Partners, also invests in Anaconda. As a result, Anaconda receives preference as a contributor.

Share this Article
Please enter CoinGecko Free Api Key to get this plugin works.