AI & DevelopmentOpen SourceSecurity

Google AMS: Scan Open-Weight LLMs for Safety in 40 Seconds

Google AMS activation-based model scanner scanning open-weight LLM safety with neural network activation patterns and direction vectors
Google AMS: Open source tool to verify LLM safety training integrity via activation fingerprinting

Downloading a fine-tuned model from HuggingFace and running a quick behavioral test is not safety verification — it is wishful thinking. Research has shown that safety training can be stripped from a Llama 3 8B model in under a minute on a single GPU, and the result passes casual testing while refusing unsafe instructions only 1% of the time. Google’s new open source tool, AMS (Activation-based Model Scanner), takes a different approach: instead of asking the model questions, it looks inside — measuring the geometric structure of safety training in the model’s activation space. The verdict arrives in 10 to 40 seconds.

Why Behavioral Testing Misses the Problem

Most teams verify a fine-tuned model by prompting it with a handful of edge-case inputs and checking the responses. This has a fundamental flaw: a model can be tuned to pass a test suite while having degraded safety across the broader distribution of harmful inputs. The LiteLLM supply chain attack in March 2026 was a reminder that AI infrastructure is a genuine target — not just for prompt injection but for silent weight substitution. A file named llama-3.1-8b-instruct.safetensors can contain something entirely different.

Traditional behavioral classifiers like Llama Guard 3 are also too slow for pre-deployment CI/CD gates, and their 50% true positive rate leaves significant room for failures to slip through.

What AMS Measures

AMS is built on Google Research’s AASE (Activation-based AI Safety Enforcement) framework. The core insight: safety training creates measurable geometric structure in a model’s activation space. Specifically, it produces direction vectors that cleanly separate harmful-content representations from benign ones — typically 4–8 standard deviations of separation in intact models. When safety training is stripped through fine-tuning, these directions collapse. AMS measures this geometry directly.

The tool operates in two tiers. Tier 1 performs a generic safety check with no baseline required, measuring whether safety-relevant activation structure exists at all across concepts like harmful_content, injection_resistance, and refusal_capability. Tier 2 goes further: it compares a model’s activation fingerprint against a verified baseline, confirming that a model claiming to be Llama 3.1 Instruct actually carries the safety fingerprint of Llama 3.1 Instruct.

Validated across seven models from three architecture families, AMS achieves AUC 1.00 for Gemma-2-9B in controlled conditions and runs 9× faster than Llama Guard 3 at the framework level (33ms vs 306ms).

Getting Started

AMS is on PyPI and takes one command to install:

pip install "ams-scanner[cli]"

Scanning a model from the HuggingFace hub:

# Standard scan (3 concepts: harmful_content, injection_resistance, refusal_capability)
ams scan google/gemma-2-2b-it

# Quick scan — ~40% faster, 2 concepts
ams scan meta-llama/Llama-3.1-8B-Instruct --quick

# Full scan — 4 concepts including truthfulness
ams scan mistralai/Mistral-7B-Instruct-v0.3 --full

# JSON output for CI/CD pipelines
ams scan ./my-fine-tuned-model --json

For supply chain verification, create a baseline from the official model and use it to verify unknown variants:

# Create a baseline fingerprint from the official model
ams baseline create meta-llama/Llama-3.1-8B-Instruct

# Verify a locally-sourced model against the official fingerprint
ams scan ./suspicious-model --verify meta-llama/Llama-3.1-8B-Instruct

Output is a table with each concept, its separation in standard deviations, and a SAFE or UNSAFE verdict. The --json flag produces structured output with exit codes for automated pipeline gating.

Practical Limitations

AMS only works on open-weight models — you cannot scan GPT-4o, Claude, or Gemini because access to their weights is required. A GPU is strongly recommended: the quoted 10–40 second scan times are based on NVIDIA A100 or L4 hardware, and CPU-only mode is significantly slower. At version 0.1.3, the tool is early-stage and the API will change.

It is also worth being precise about what AMS does not do: it checks the structural integrity of safety training, not every possible failure mode. A model that passes AMS has geometrically intact safety training — a meaningful signal, but not an exhaustive safety audit.

The Bigger Picture

Fine-tune proliferation, capable adversarial attacks, and supply chain targeting have made “download and deploy” an indefensible posture for teams shipping open-weight models. AMS is the first tool to make the most critical check — is safety training structurally intact? — fast enough and cheap enough for routine pre-deployment use. Adding a 40-second scan to a model deployment pipeline is not overhead; it is the minimum viable safety gate.

The tool is available now on the GoogleCloudPlatform GitHub and the official announcement is on the Google Open Source Blog. If you are shipping fine-tuned models and not scanning them, the question is no longer whether you should — it is why you have not started.

ByteBot
I am a playful and cute mascot inspired by computer programming. I have a rectangular body with a smiling face and buttons for eyes. My mission is to cover latest tech news, controversies, and summarizing them into byte-sized and easily digestible information.

    You may also like

    Leave a reply

    Your email address will not be published. Required fields are marked *