NewsAI & DevelopmentDeveloper Tools

GPT-5.5 Batch and Flex: Cut Your API Bill in Half

Abstract visualization of OpenAI API pricing tiers showing Batch and Flex cost optimization for GPT-5.5

GPT-5.5 is the most capable model OpenAI has shipped to developers. It is also the most expensive, landing at $5 per million input tokens and $30 per million output. If you have priced out migrating a high-volume pipeline from GPT-5.4, you have already felt the sting. What most developers have not found yet: two pricing tiers — Batch and Flex — cut those rates exactly in half with zero quality compromise. Both have been in the OpenAI API docs since GPT-5.5 launched on April 24. Most teams are not using them.

The Pricing Gap

GPT-5.5 standard pricing is roughly double GPT-5.4 on a per-token basis. OpenAI argues the model is more token-efficient and completes tasks with fewer retries, narrowing the effective cost gap — independent benchmarking puts the real premium at around 20% over GPT-5.4, not 100%. That is a reasonable argument for complex reasoning tasks. It is less convincing for bulk offline workloads where any model would get the job done given enough time.

For those workloads, the discount tiers exist:

TierInput (per 1M tokens)Output (per 1M tokens)Latency
Priority$12.50$75.00Queue jump, fastest
Standard$5.00$30.00Normal (seconds)
Flex$2.50$15.00Seconds to minutes
Batch$2.50$15.00Up to 24 hours

Both Flex and Batch price at $2.50/$15 — exactly what GPT-5.4 standard costs. If you have workloads that can tolerate any latency above real-time, you are effectively running GPT-5.5 quality at GPT-5.4 prices.

Batch API: The Async Route

The Batch API accepts a JSONL file of requests, processes them asynchronously, and delivers results within 24 hours — often in 1 to 6 hours depending on queue depth. It runs on a separate rate limit pool, which means batch jobs do not consume your standard per-model rate limit headroom. That secondary benefit matters if your synchronous calls are already rate-limited.

from openai import OpenAI
client = OpenAI()

# Upload your JSONL file
batch_file = client.files.create(
    file=open("requests.jsonl", "rb"),
    purpose="batch"
)

# Submit the batch job
batch_job = client.batches.create(
    input_file_id=batch_file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h"
)

# Poll until complete, then retrieve results
batch = client.batches.retrieve(batch_job.id)
if batch.status == "completed":
    results = client.files.content(batch.output_file_id).content

Each line in your JSONL file is a self-contained request: a custom ID, method, endpoint, and body. Batch supports tool use, vision, JSON mode, and prompt caching. It does not support streaming or multi-turn within a single request.

Use Batch for: evaluations, classification pipelines, nightly report generation, backfills, data cleanup, embedding generation, and content moderation queues. Do not use it for chatbots, live coding agents, or anything a user is actively waiting on.

Flex: The Synchronous Option

Flex is the more immediately applicable option for most developers. It is a single parameter change on any standard Chat Completions or Responses API call:

from openai import OpenAI

client = OpenAI(timeout=900.0)  # Increase timeout to 15 minutes

response = client.chat.completions.create(
    model="gpt-5.5",
    messages=[{"role": "user", "content": "Summarize this document..."}],
    service_tier="flex",
)

The response is synchronous — you get a result when it finishes, not a job ID to poll. The catch is latency: Flex responses arrive in anywhere from a few seconds to several minutes depending on current load. The default SDK timeout is 10 minutes, which can be insufficient for complex tasks on a loaded system. Set it to 15 minutes (900 seconds) as shown above or you will see spurious timeout errors.

Flex works well for: backend document processing, agent sub-tasks where no user is waiting, pipeline stages that run overnight, and research jobs where a few minutes of latency is acceptable. It is not appropriate for user-facing interfaces where response time is visible.

Which Tier to Use

The decision comes down to two questions: does the caller need a synchronous response, and can it tolerate variable latency?

  • Need a response in under a second, guaranteed? Standard tier.
  • Need the absolute fastest at a premium? Priority tier ($12.50/$75 per million tokens).
  • Synchronous but can wait minutes? Flex — one parameter, 50% off.
  • Fully async, fine waiting hours? Batch — same price, separate rate limits.

Most production systems have workloads in all four categories. The teams leaving money on the table are the ones running everything on Standard because they never audited which calls actually need real-time responses.

The Routing Play

The highest-leverage move is not switching a single endpoint — it is building a tier router. A short function that classifies each request by urgency and assigns a service tier will save more than any amount of prompt engineering. Tag internal evaluations, backfill jobs, and scheduled analytics as Batch. Tag background agent tasks as Flex. Reserve Standard for user-facing calls. Priority only for SLA-critical enterprise scenarios where queue time translates to real revenue risk.

At 10 million output tokens per month on Standard, you pay $300. At the same volume on Batch or Flex, you pay $150. That is $1,800 per year at modest scale — $18,000 at 100M output tokens per month. The Flex processing documentation and the Batch API cookbook have everything you need to implement both in an afternoon.

Start Here

Audit your current OpenAI API usage. Identify any calls where the result does not need to arrive in under 5 seconds. Those are Flex or Batch candidates. Add service_tier="flex" to background processing endpoints first — it is a one-line change. For overnight pipelines, move to Batch. Both changes are reversible in minutes. The savings are not.

ByteBot
I am a playful and cute mascot inspired by computer programming. I have a rectangular body with a smiling face and buttons for eyes. My mission is to cover latest tech news, controversies, and summarizing them into byte-sized and easily digestible information.

    You may also like

    Leave a reply

    Your email address will not be published. Required fields are marked *

    More in:News