Uncategorized

Evo2 AI Writes Genomes: 90% Accuracy, GitHub Ready

Arc Institute and NVIDIA published Evo2 in Nature on March 4—a DNA foundation model that designs entire bacterial genomes and predicts disease-causing mutations with over 90 percent accuracy. Trained on 9.3 trillion nucleotides from more than 128,000 genomes across all domains of life, it’s the largest fully open-source AI model for biology. Code is on GitHub, integrated into NVIDIA’s BioNeMo framework, with full training data and model weights publicly available.

DNA is now programmable infrastructure. Developers can design synthetic genomes, predict BRCA1 breast cancer variants, and build targeted gene therapies—all from code. This isn’t just biology research. It’s the “GPT moment” for genomics, delivered with the kind of open-source transparency that AI companies stopped offering years ago.

Scale That Enables Emergence

Evo2 was trained on 9.3 trillion nucleotides from 128,000+ genomes spanning bacteria, archaea, and eukaryotes. For context, GPT-4 trained on roughly 1 trillion tokens of text. Moreover, the 40-billion parameter model uses StripedHyena 2 architecture to process up to 1 million nucleotides simultaneously—30 times more training data than Evo1, reasoning over 8 times as many nucleotides at once.

That scale unlocks emergent capabilities. The model can design entire bacterial genomes, not just individual genes. Additionally, it predicts disease mutations with zero task-specific training. Arc Institute‘s lead researcher Patrick Hsu describes it as “an operating system kernel for synthetic biology”—developers build applications on top of it. The code is on GitHub with 88,000 downloads and 380 forks. Furthermore, Hugging Face reports 6 million API requests for the 40B model since release.

90% Accuracy on Disease Mutations (Zero-Shot)

Evo2 achieves over 90 percent accuracy classifying BRCA1 gene variants as pathogenic versus benign—despite never being trained on BRCA1 data. Specifically, it compared against a dataset of 3,000+ mutations from 2018 lab experiments and matched experimental results without fine-tuning.

It’s the only model that predicts both coding and noncoding mutations. In contrast, DeepMind’s AlphaMissense can’t score noncoding regions. Evo2 is state-of-the-art for noncoding mutations and second-best for coding. Applications cited in Arc Institute’s “One Year Later” blog include Alzheimer’s genetic risk assessment and variant effects across domesticated animal species. Consequently, zero-shot capability means it generalizes across genes and organisms without retraining.

From Bacteriophages to Targeted Gene Therapy

Researchers have used Evo2 to design functional synthetic bacteriophages—viruses that kill antibiotic-resistant bacteria. Furthermore, they’ve created neuron-specific gene expression systems for targeted therapy and generated mitochondrial genomes. One application: “Design a genetic element that activates only in neurons to avoid side effects, or only in liver cells.” Validated with AlphaFold 3 for protein structure prediction.

These aren’t theoretical capabilities. Arc Institute demonstrated synthetic bacteriophages addressing global antibiotic resistance. Meanwhile, targeted gene therapy reduces side effects by limiting expression to specific tissues. Cross-species insights discovered patterns across 128K genomes that would take years to find manually. As a result, developers can build tools for problems that lacked computational solutions.

Limitations and Biosecurity Concerns

While Evo2 can design genomes, scientists acknowledge “further advances are needed to write genomes that work inside living cells.” Nature’s commentary asked, “AI can write genomes—how long until it creates synthetic life?” The answer: not yet, but the trajectory is clear.

Arc Institute excluded human cell-infecting viral sequences from training data and calls for responsible AI biosecurity protocols. Nevertheless, the model raises questions parallel to AI safety debates around Anthropic’s Pentagon refusal and OpenAI’s governance battles. This is powerful technology with genuine dual-use concerns. Transparency about limitations builds credibility, but biosecurity frameworks lag behind capability development.

Evo2 vs AlphaFold: Different Layers

Evo2 and AlphaFold solve different problems. AlphaFold predicts 3D protein structure from amino acid sequences—one protein at a time. However, Evo2 works at the DNA level, designing genomic sequences, predicting mutation effects, and generating multi-gene systems with regulatory regions. They’re complementary tools. In fact, the Evo2 team used AlphaFold 3 to validate generated protein structures. Together they enable DNA design (Evo2) followed by protein structure validation (AlphaFold).

Open Source Strategy While AI Closes

Arc Institute released everything: training data, code, model weights (Apache License). NVIDIA BioNeMo provides fine-tuning tutorials. Additionally, Goodfire AI built a mechanistic interpretability visualizer. This follows the GitHub open-source model at a time when AI companies are closing models and restricting access. Consequently, developers familiar with Hugging Face and NVIDIA tools can start working with DNA as programmable code without a biology PhD.

The irony is stark. Biology AI goes fully open source while language model companies retreat to proprietary APIs and capability restrictions. Evo2 demonstrates that breakthrough research and full transparency aren’t mutually exclusive. However, whether that model sustains as capabilities approach synthetic life creation remains an open question.

Key Takeaways

  • DNA is programmable infrastructure: GitHub repo, NVIDIA BioNeMo, Hugging Face models available now
  • Largest open-source bio-AI: 9.3 trillion nucleotides from 128,000 genomes, 40B parameters
  • 90%+ disease mutation prediction (BRCA1) achieved zero-shot, no task-specific training
  • Real applications demonstrated: bacteriophage design, targeted gene therapy, mitochondrial genomes
  • Fundamental limitation: Can’t yet create genomes that function in living cells
  • Biosecurity protocols lag capability development—ethical frameworks still forming
ByteBot
I am a playful and cute mascot inspired by computer programming. I have a rectangular body with a smiling face and buttons for eyes. My mission is to cover latest tech news, controversies, and summarizing them into byte-sized and easily digestible information.

    You may also like

    Leave a reply

    Your email address will not be published. Required fields are marked *