Batch API Throughput Planner

Plan your Claude Batch API workloads with queue visualization, completion time estimation, and production-ready code generation. Configure batch sizes, concurrency, and retry strategies for optimal throughput.

By Michael Lip · May 25, 2026

Batch Configuration

10,000
Max 100,000 per batch. Smaller batches complete faster.
5s
1

Retry Strategy

Total Batches
1
Est. Time
2h
Total Cost
$90

Generated Code

Queue Visualization

$90.00
Saved vs real-time API (50% batch discount)
Batch Cost
$90
Real-time Cost
$180
Failed Reqs
100

What This Planner Does

The Batch API Throughput Planner helps you design and optimize batch processing workloads for the Claude Batch API. Instead of guessing at batch sizes, completion times, and costs, you configure your workload parameters and the planner calculates the optimal configuration. It shows you how many batches you need, how long processing will take, what it will cost, and how much you save compared to real-time API calls. The queue visualization displays batch progression so you can understand the processing timeline.

The planner also generates production-ready code in Python and JavaScript that handles the complete batch lifecycle: creating the JSONL input file, uploading the batch, polling for completion, downloading results, parsing successes and failures, and retrying failed requests with configurable backoff strategies. The generated code is not a simple example but a complete implementation you can adapt to your production pipeline with minimal changes.

Understanding the Claude Batch API

The Claude Batch API is an asynchronous processing service designed for workloads that do not require real-time responses. You submit a batch of requests as a JSONL file, Anthropic processes them in the background, and you retrieve the results when processing is complete. The primary advantages over real-time API calls are cost (50% discount on all token prices) and throughput (no per-minute rate limits within a batch). The trade-off is latency: batches can take up to 24 hours to complete, though most finish much sooner.

Each line in the JSONL input file is a JSON object with a custom_id field (your unique identifier for the request) and a params object containing the standard Messages API parameters: model, max_tokens, messages, and optional fields like system, temperature, and tools. The custom_id is essential for matching results to requests, so use meaningful identifiers like database row IDs or document names rather than sequential numbers.

After uploading the JSONL file, you receive a batch ID and can poll the /v1/messages/batches/{batch_id} endpoint for status updates. The status transitions from in_progress to ended when all requests are processed. At that point, you download the results JSONL file where each line contains the custom_id, a result type (succeeded or errored), and either the complete response or error details. For monitoring batch costs alongside real-time usage, KickLLM provides comprehensive LLM cost dashboards.

Batch Size Optimization

The maximum batch size is 100,000 requests, but bigger is not always better. Smaller batches complete faster because Anthropic can parallelize processing more efficiently. A batch of 1,000 requests typically finishes in 10 to 30 minutes, while a batch of 100,000 requests may take 4 to 12 hours. If you need results as fast as possible, split your workload into smaller batches and submit them concurrently.

The optimal batch size depends on your latency requirements and workload characteristics. For time-sensitive workloads where you need all results within a few hours, use batches of 1,000 to 5,000 requests submitted concurrently. For overnight processing jobs where results can wait until morning, single large batches of 50,000 to 100,000 requests minimize management overhead. For continuous pipelines that process data throughout the day, use fixed-size batches of 5,000 to 10,000 requests submitted on a schedule.

The planner calculates how many batches you need based on your total request count and batch size setting. It also estimates completion time by modeling the processing pipeline: each batch has a startup overhead (typically 1 to 5 minutes), a processing phase proportional to the number and complexity of requests, and a finalization phase. Concurrent batches run in parallel, reducing total wall-clock time linearly with concurrency level.

Retry Strategy Design

Even well-formed batch requests can fail. Common failure modes include individual requests that violate content policy, requests with invalid parameters that pass client-side validation but fail server-side, transient infrastructure errors, and requests that exceed model-specific context window limits. The planner models your expected failure rate and calculates the number of retry batches needed to achieve full completion.

The exponential backoff strategy doubles the delay between retry batches: 60 seconds after the first failure, 120 seconds after the second, 240 after the third. This prevents overwhelming the API during outages and gives transient issues time to resolve. The linear backoff strategy increases delay by a fixed increment: 60, 120, 180 seconds. The fixed delay strategy uses the same wait time between every retry. Exponential backoff is the recommended default for production systems.

The generated code implements the complete retry loop: parse the results file, collect custom_ids of failed requests, build a new JSONL file with only those requests, wait according to the backoff strategy, and submit the retry batch. After exhausting the maximum retry count, remaining failures are logged with full error details for manual review. For applications where batch failures trigger alerts or automated remediation, InvokeBot provides webhook management tools.

Cost Analysis and Savings

The Batch API provides a flat 50% discount on all token prices. This discount applies to both input and output tokens and stacks with prompt caching discounts. If you use prompt caching within batch requests, you get the cache read discount (90% off input) combined with the batch discount (50% off the remaining cost), resulting in a 95% total discount on cached input tokens compared to standard real-time pricing.

The planner calculates three cost figures. The batch cost is what you actually pay using the Batch API at 50% discount pricing. The real-time cost is what the same workload would cost using standard synchronous API calls at full pricing. The savings is the difference, representing the exact dollar value of using batch processing instead of real-time. For a workload of 10,000 requests with 500 input tokens and 1,000 output tokens each on Sonnet 4, the batch cost is approximately $90 versus $180 real-time, saving $90.

The failure rate setting adjusts the cost estimate to account for retry overhead. With a 1% failure rate and 3 retries, you process approximately 1% extra requests across retry batches. The cost of retries is factored into the total estimate so the number you see reflects the realistic total cost including error handling. For tracking actual batch costs over time and comparing them to your estimates, use the Prompt Caching Calculator for cache optimization and KickLLM for overall cost monitoring.

Queue Visualization Explained

The queue visualization shows each batch as a row with a progress bar representing its lifecycle. Green bars indicate completed batches, purple bars indicate currently processing batches, and gray bars indicate queued batches waiting to start. The timeline shows the estimated start and end time for each batch based on your configuration parameters.

For concurrent batch configurations, multiple batches process simultaneously and the visualization reflects this parallelism. The estimated completion time accounts for concurrency: two concurrent batches each taking 2 hours complete in 2 hours total, not 4. The visualization helps you understand whether your configuration will meet your time requirements and where bottlenecks might occur.

Generated Code Architecture

The Python code uses the official anthropic SDK with async support for concurrent batch management. It creates a JSONL file from your request data, uploads it using the batch creation endpoint, polls for completion with exponential backoff, downloads and parses results, and handles retries automatically. The code is structured as a reusable class with configurable parameters matching the planner settings.

The JavaScript code uses the official @anthropic-ai/sdk package with async/await patterns. It follows the same architecture as the Python version: file creation, batch submission, status polling, result processing, and retry handling. Both implementations include proper error handling, progress logging, and graceful shutdown support. For API debugging during batch development, use the ClaudKit API Request Builder to test individual requests before including them in batches, and consult the API Error Guide for troubleshooting specific error codes.

Frequently Asked Questions

How does the Claude Batch API work?

The Claude Batch API allows you to send up to 100,000 requests in a single batch for asynchronous processing. You create a batch by uploading a JSONL file where each line is a request object with a custom_id and the standard messages API parameters. The API returns a batch ID which you poll for status. Processing typically completes within 24 hours and costs 50% less than real-time API calls. Results are available as a JSONL file download once the batch reaches "ended" status.

How much does the Claude Batch API cost compared to real-time?

The Batch API costs exactly 50% of standard real-time pricing for both input and output tokens. For Claude Sonnet 4, this means $1.50 per million input tokens (vs $3.00) and $7.50 per million output tokens (vs $15.00). For Opus 4, it is $7.50 input and $37.50 output. For Haiku 3.5, it is $0.40 input and $2.00 output. Prompt caching discounts can be combined with batch pricing for additional savings.

What is the maximum batch size for the Claude Batch API?

Each batch can contain up to 100,000 individual requests. Each individual request within the batch follows the same limits as real-time API calls. You can run multiple batches concurrently if you need to process more than 100,000 requests, though total throughput is subject to your account's rate limits.

How long does Claude Batch API processing take?

Anthropic guarantees batch processing completes within 24 hours, but most batches finish much sooner. Small batches (under 1,000 requests) typically complete in 10 to 30 minutes. Medium batches (1,000-10,000) usually take 1 to 4 hours. Large batches (10,000-100,000) may take 4 to 12 hours. Processing time depends on current system load, model complexity, and individual request sizes.

How do I handle failed requests in a Claude batch?

When a batch completes, the results file includes both successful and failed requests. Each result has a custom_id matching your input, a result type of "succeeded" or "errored", and either the response or error details. Best practice is to parse results, collect failed custom_ids, build a retry batch with only those requests, and submit as a new batch with exponential backoff between retries.

Developer and creator of the Zovo Tools network. Building free, privacy-first developer tools that run entirely in the browser. No tracking, no sign-ups, no server-side processing. Open source on GitHub.