Claude API Streaming Responses Guide

Learn how streaming works with an interactive demo, then generate production-ready streaming code in Python, JavaScript, and curl. Configure event handling, error recovery, and token counting options.

By Michael Lip · May 25, 2026

Interactive Streaming Demo

Watch text appear character-by-character, simulating how Claude streams tokens via Server-Sent Events. Adjust the speed slider to control the animation rate.

30ms
Characters: 0
Est. tokens: 0
Elapsed: 0.0s
Status: Ready

Streaming Code Generator

Configuration Builder

Select the features you need and get customized streaming code with those capabilities built in.

Options

What Is Streaming in the Claude API?

Streaming allows the Claude API to deliver response tokens incrementally as they are generated, rather than waiting for the entire response to complete before sending anything. This is implemented using Server-Sent Events (SSE), a standard HTTP protocol for server-to-client streaming over a single persistent connection. When you enable streaming, the API opens a long-lived connection and pushes events to your client as each token or group of tokens is produced by the model.

The practical impact is dramatic. Without streaming, a request that generates 2,000 tokens might take 10 to 15 seconds before you receive any data. With streaming, the first token arrives in 200 to 500 milliseconds, and subsequent tokens flow in continuously. The total generation time is the same, but the perceived latency drops by an order of magnitude. For user-facing applications such as chatbots, writing assistants, and code completion tools, streaming is essential for a responsive experience.

How Server-Sent Events Work

When you set "stream": true in your request body, the API responds with a Content-Type: text/event-stream header instead of application/json. The response body is a sequence of SSE events, each consisting of an event: type line and a data: payload line. The Claude API sends six distinct event types during a streaming response:

For most applications, you only need to handle three events: content_block_delta for displaying text, message_delta for usage stats, and checking for the presence of message_stop to confirm completion. The Python and JavaScript SDKs abstract this into convenient iterator patterns, but understanding the raw event structure is valuable for debugging and for implementations in other languages. For a visual way to experiment with these events, the ClaudKit playground provides an interactive testing environment.

Implementing Streaming in Python

The official anthropic Python SDK provides two approaches for streaming. The client.messages.stream() method returns a context manager that yields events, while passing stream=True to client.messages.create() returns a raw event stream. The recommended approach is client.messages.stream() because it handles connection management and provides helper methods for common patterns.

The basic pattern is straightforward: open a stream, iterate over text events, and process each chunk. The SDK handles SSE parsing, reconnection on transient errors, and proper resource cleanup. For production use, wrap the stream in error handling and add logging for the usage statistics from the final message_delta event. The Python SDK also supports async streaming with async_client.messages.stream(), which is essential for asyncio-based applications such as FastAPI or aiohttp backends.

Implementing Streaming in JavaScript

The @anthropic-ai/sdk JavaScript SDK provides async iterable streams. Call client.messages.stream() and iterate with a for await loop. Each iteration yields a stream event that you can inspect for its type and data. The SDK also provides convenience methods like stream.on('text', callback) for simplified text extraction. If you need to pipe the stream to multiple consumers, tools like ClaudFlow can help manage multi-step streaming workflows.

For browser-based applications, you cannot call the Claude API directly from the client due to CORS restrictions and the need to protect your API key. Instead, set up a server-side proxy (Node.js, Deno, or a serverless function) that streams from the Claude API and re-streams to the browser using SSE or WebSockets. The pattern is: browser connects to your server via SSE, your server connects to Claude via the SDK's streaming method, and each text delta is forwarded to the browser in real time.

Streaming with curl

For quick testing, you can stream Claude responses directly in the terminal with curl. Add "stream": true to the JSON body and use the -N flag to disable output buffering. The raw SSE events will print to your terminal as they arrive. This is useful for debugging the event format, testing network connectivity, and verifying that your API key and model parameters are correct before writing application code.

The curl output shows the raw SSE protocol: each event has an event: line with the type and a data: line with the JSON payload. Events are separated by blank lines. You can pipe curl output through jq or a custom parser to extract just the text content if you want a cleaner terminal experience.

Error Recovery and Production Patterns

Streams can fail mid-response due to network issues, API overload (529 errors), or timeout. Your error recovery strategy depends on whether the failure is transient or persistent. For transient failures (network blips, 529 overload), implement exponential backoff with jitter and retry the full request. For persistent failures (400 bad request, 401 auth error), do not retry and surface the error to the user. If you need to track error patterns across your API integrations, LochBot provides security and monitoring tools for API endpoints.

A robust production pattern includes: (1) set an explicit connection timeout so you do not wait indefinitely, (2) set an inter-event timeout that fires if no event arrives within a reasonable interval (30 seconds is a common choice), (3) accumulate partial responses so you can present what was received even if the stream fails, (4) log the message_delta usage statistics for cost tracking, and (5) implement a circuit breaker that stops retrying and falls back to a cached or default response after a configurable number of consecutive failures.

Token counting from the stream is straightforward. The message_start event includes usage.input_tokens, and the message_delta event includes usage.output_tokens. These are the authoritative values for billing and should be logged with every request. Do not estimate token counts from the text length of content_block_delta events, as the relationship between text characters and tokens varies by language and content type. For detailed cost analysis of your streaming usage, KickLLM offers LLM cost tracking tools that support token-level granularity.

When to Use Streaming vs Non-Streaming

Use streaming for any user-facing application where perceived responsiveness matters. Chatbots, writing assistants, code completion, and interactive Q&A tools should always stream. The time-to-first-token improvement transforms the user experience from "waiting for a loading spinner" to "watching the answer appear in real time."

Non-streaming is appropriate for batch processing, background jobs, and automation pipelines where the consumer is another program rather than a human. Non-streaming responses are simpler to handle (a single JSON object instead of an event stream) and are easier to log, cache, and retry. For testing API requests before implementing streaming, use the ClaudKit API Request Builder to generate and test non-streaming requests first, then add streaming once the basic integration works.

Anthropic also provides a Batch API for high-volume non-time-sensitive workloads at 50% of standard pricing. If you are processing thousands of requests and do not need results in real time, the Batch API is more cost-effective than either streaming or standard requests. For choosing the right model for your streaming workload, the Claude Model Picker can help you evaluate speed, cost, and capability tradeoffs.

Frequently Asked Questions

How do I enable streaming in the Claude API?

Add "stream": true to your request body. The API will respond with a stream of Server-Sent Events (SSE) instead of a single JSON response. In the Python SDK, use client.messages.stream() which returns an iterator. In JavaScript, use client.messages.stream() which returns an async iterable. In curl, add the -N flag for unbuffered output and parse the SSE events.

What events does the Claude streaming API send?

The Claude streaming API sends six event types: message_start (contains the message object with model and usage info), content_block_start (signals the beginning of a content block), content_block_delta (contains the actual text tokens as they are generated), content_block_stop (signals end of a content block), message_delta (contains stop_reason and final usage statistics), and message_stop (signals the stream is complete). Most applications only need to process content_block_delta for the text and message_delta for usage info.

How do I handle streaming errors and connection drops?

Implement retry logic with exponential backoff. If the connection drops before receiving a message_stop event, the response was interrupted. Store the partial response and decide whether to retry the full request or present the partial result. The Python SDK's stream() method handles basic connection management. For production, wrap the stream in a try/except block and implement a circuit breaker for repeated failures.

Is Claude API streaming faster than non-streaming?

Streaming delivers the same total content but with much lower perceived latency. The time-to-first-token (TTFT) with streaming is typically 200-500ms compared to waiting 5-30 seconds for a complete non-streaming response. Total generation time is identical. The advantage is that users see output immediately and can start reading while the rest generates.

How do I count tokens from a Claude streaming response?

Token usage information is included in the stream events. The message_start event contains usage.input_tokens. The message_delta event (sent near the end of the stream) contains usage.output_tokens. You do not need to count tokens manually from the text deltas. Use these values for cost calculation and logging.

Developer and creator of the Zovo Tools network. Building free, privacy-first developer tools that run entirely in the browser. No tracking, no sign-ups, no server-side processing. Open source on GitHub.