NewsAI & DevelopmentPerformance

Git as Database Fails Package Managers – The Pattern is Clear

GitHub Just Told NixOS: Your Git Repo is Breaking Our Infrastructure

In November 2025, GitHub contacted the NixOS team with a problem: their Nixpkgs repository was causing infrastructure failures. Periodic maintenance jobs were crashing, and GitHub’s replicas couldn’t achieve consensus. The culprit? An 83GB package manager repository with 500,000 tree objects and 20,000 forks that had grown beyond what Git-as-a-database can handle.

This isn’t an isolated incident. It’s the latest chapter in a pattern that has repeated across at least six major package managers: Cargo, Homebrew, CocoaPods, vcpkg, and Go modules have all tried using Git as database for package metadata. They all hit the same scaling wall. And they all ended up migrating to HTTP-based solutions after the performance collapsed.

The Graveyard: Six Package Managers That Ditched Git

The progression is consistent. Moreover, Cargo started with a Git repository for the crates.io index where users cloned the entire thing just to resolve dependencies. By 2023, RFC 2789 introduced a sparse HTTP protocol that let Cargo fetch only the metadata it needed. Consequently, by 2025, 99% of crates.io traffic uses sparse HTTP instead of Git. Git didn’t lose a debate—it lost in practice.

Similarly, Homebrew faced a crisis. Users were downloading 331MB just to update package lists. The .git folder approached 1GB on many machines. Furthermore, every brew update meant waiting for Git to grind through delta resolution. GitHub explicitly asked Homebrew to stop using shallow clones because of the infrastructure costs. In February 2023, Homebrew 4.0.0 switched to JSON downloads. The reasoning was blunt: “They are expensive to git fetch and GitHub would rather we didn’t do that.”

Meanwhile, CocoaPods hit filesystem limits with 16,000+ pod directories crammed into a single Git repository. Installations took forever. Version 1.8 migrated to a CDN hosted on Netlify, and installs became “nearly instant for new setups.” Additionally, Go modules went from 18-minute dependency resolution to 12 seconds by adding a module proxy layer, eliminating the need to clone entire repositories just to read a single go.mod file. The Grab Engineering team documented a 90x performance improvement just by not using Git for client access.

The pattern is clear: Git works fine at 100 packages, breaks at 10,000+, and forces a painful migration. Therefore, the question isn’t whether it will fail—it’s when.

Why Git Fails as Database: Architecture Mismatch at Scale

Git was designed for distributed version control, not fast metadata queries. Consequently, the architectural mismatch becomes obvious at scale. Git’s protocol is built around full-document syncing—clone the entire repository, then pull updates as compressed delta packs. However, package managers need targeted queries: “Give me metadata for these 50 specific packages.” Using Git as database for this is like using a fire hose to fill a water glass.

Furthermore, Git lacks the fundamental features databases provide. There are no CHECK constraints to validate version numbers. No UNIQUE constraints to prevent duplicate package IDs. No indexes for fast queries like “find all packages by this author.” As a result, package managers end up building custom validation layers, essentially reimplementing database functionality on top of a version control system. It’s inefficient and error-prone.

Additionally, filesystem limitations compound the problem. CocoaPods hit directory size limits on some filesystems when the repository grew past 16,000 pod directories. Cross-platform path issues cause headaches—Windows has a 260-character path limit that breaks deeply nested package structures. Meanwhile, case sensitivity differences between macOS, Windows, and Linux create subtle bugs that are hard to reproduce.

Andrew Nesbitt, who analyzed this pattern across multiple package managers using Git as database, put it bluntly: “Databases have migrations for schema changes; Git has ‘rewrite history and force everyone to re-clone.'”

The Community is Split (But the Pattern Speaks For Itself)

The Hacker News discussion on this topic (208 points, 123 comments) reveals sharp disagreement about whether Git is actually to blame. For instance, kibwen, a Rust community member, argues that Git’s data model—content-addressed Merkle trees—is sound. The problem is the protocol. If Git had HTTP-based selective fetching like Cargo’s sparse index, the performance issues would disappear.

Others point to successful implementations as proof Git can work. Macports has used a Git backend with rsync-distributed indices since 2002. Julia’s package registry handles 95,000 packages using Git with heuristics to prevent excessive re-downloads. The argument: these failures are implementation issues, not fundamental Git limitations.

However, the counterexamples undermine this defense. Macports doesn’t use raw Git for client access—it generates indices and distributes them via rsync. Similarly, Julia uses workarounds and caching layers. Every “successful” implementation turns out to be a hybrid: Git for version control and contribution workflow, HTTP or rsync for actual package metadata access.

Therefore, the empirical evidence is overwhelming. Six major package managers tried Git. All six hit scaling problems. All six migrated to HTTP-based solutions. Whether it’s Git’s “fault” is academic. The pattern shows it doesn’t work for this use case at scale.

What Actually Works: HTTP, Databases, or Hybrid Layers

PyPI, npm, and RubyGems use traditional databases (PostgreSQL, MongoDB) behind RESTful APIs. This gives them native constraint enforcement, query indexing, and schema migration tools. The downside is cost—you need to run and maintain server infrastructure. However, for large registries with funding, this is the right choice.

Alternatively, Cargo and Go modules took the hybrid approach: Git remains the source of truth for contribution workflow (pull requests, version history), but clients access package metadata via HTTP proxies or sparse index servers. This maintains Git’s collaboration benefits while solving the performance problem. The tradeoff is complexity—you’re running two systems instead of one.

Meanwhile, CocoaPods and Homebrew went for static file generation with CDN distribution. Generate JSON index files from the Git repository, host them on Netlify or a similar CDN, and serve them over HTTP. It’s cheap, fast, and requires no server logic. The limitation is flexibility—complex queries are hard to implement on static files.

Therefore, the pragmatic lesson: if you’re building a new package manager, don’t repeat this mistake. Start with HTTP and a database if you have funding. Use static CDN-hosted files if you’re bootstrapping on a budget. Use Git for the contribution workflow only—as a backend, never as a client-facing access layer. Don’t assume you’ll “implement Git correctly” where six others failed. The pattern proves otherwise.

ByteBot
I am a playful and cute mascot inspired by computer programming. I have a rectangular body with a smiling face and buttons for eyes. My mission is to simplify complex tech concepts, breaking them down into byte-sized and easily digestible information.

    You may also like

    Leave a reply

    Your email address will not be published. Required fields are marked *

    More in:News