When Streaming Lies

Debugging a silent failure across LangGraph, SSE, API Gateways, and Azure OpenAI

May 09, 2026

Last month I was working on a fairly standard LLM pipeline: a LangGraph based agent (with MCP tools), Azure OpenAI as the model provider (deployed on the client’s infrastructure), and a Flask API layer streaming responses back to the frontend using astream_events (v2) over SSE.

This setup had been stable. No recent code changes, no dependency upgrades, no configuration drift on our side. Then, without warning, the system started failing.

The failure surfaced as a combination of:

HTTP 400 responses from the OpenAI layer, and
a LangGraph exception: ValueError: No generations found in stream

Initially it looked like a typical integration issue, either malformed requests, a model side error, or something breaking in the MuleSoft DataWeave layer sitting in front of the client’s OpenAI deployment. All of these were plausible, and none of them turned out to be correct.

Why This Was Non Trivial

The key constraint here was that nothing had changed in our codebase. That removes the most common debugging axis. When a previously working system fails without code changes, the only reasonable assumption is:

something outside your control has changed.

That immediately shifts the debugging strategy. Instead of tweaking code or configs, you have to isolate the system layer by layer and validate assumptions.

System Overview (Where the Assumptions Lived)

The flow looked roughly like this:

Flask endpoint receives a request
LangGraph agent executes asynchronously
The agent calls Azure OpenAI through a proxy/gateway
Responses are streamed via astream_events
Events are pushed through a thread safe queue
Flask yields them as SSE to the frontend

The implicit contract in this setup is straightforward:

If stream=True, the upstream system will deliver tokens incrementally.

Everything downstream from LangGraph, the queue, SSE depends on that being true.

Debugging by Elimination and Not Guess Work

Rather than chasing symptoms, I started validating each layer independently.

Application Layer

At first I verified that the core logic was intact:

Synchronous LLM calls worked as expected
The agent execution pipeline was correct

So this ruled out:

prompt issues
agent orchestration bugs
MCP tool integration problems

Concurrency Layer

Given Flask (WSGI) and LangGraph (async) operate on different concurrency models, I suspected event loop interference.

I refactored the execution model to:

run LangGraph in a dedicated background thread
assign a separate event loop per request
communicate via a thread safe queue

This is the standard thread + queue isolation pattern to avoid mixing blocking I/O (yield in Flask) with async execution (asyncio).

The result: no change. The failure persisted.

So I ruled out concurrency issues.

SDK / Streaming Behaviour

Next, I stripped the system down further and tested the Azure OpenAI SDK directly.

With stream=False, responses were correct
With stream=True, connection was established, but zero chunks were received

This is where things got interesting.

At this point, the model was reachable, the request was valid, and the connection was technically working. But no data was flowing.

That combination doesn’t point to an application bug. It points to a broken transport.

The Actual Failure: Streaming Without Data

The only remaining layer was the network boundary between our service and Azure OpenAI, which was controlled by the client (API Gateway / proxy layer).

What was happening in practice:

The gateway accepted requests with stream=True
It established the connection
But it did not support (or was misconfigured for) SSE / chunked transfer encoding
The connection was closed immediately without forwarding any chunks

From our system’s perspective:

the stream succeeded, but contained no data

This is a particularly dangerous failure mode because it doesn’t raise a clear infrastructure error. Everything looks connected, but the contract is already broken.

Why LangGraph Failed the Way It Did

LangGraph’s astream_events is designed around the assumption that a streaming response will produce at least one token.

When that assumption is violated:

the stream parser receives nothing
it treats this as a failure
and raises: No generations found

So the error I saw was not the root cause. It was a downstream effect of an upstream contract violation.

What Actually Changed

Although I never got explicit confirmation, but the behaviour strongly indicates a change in the client’s gateway layer, likely around:

SSE handling
response buffering
or transfer encoding policies

The important part is not what exactly changed, but where the system became unreliable.

The Decision: Move Control Up the Stack

At this point, I had two options:

Depend on the client’s infrastructure team to fix streaming support
Remove the dependency on network level streaming entirely

The first option introduces uncertainty and external dependency. The second gives control.

I chose the second.

From Token Streaming to State Streaming

Instead of relying on token level streaming from the model, two changes were made:

Disabled streaming at the LLM layer (streaming=False)
Switched LangGraph from event streaming to state updates (stream_mode="updates")

This effectively converts the model interaction into a standard synchronous request, while still allowing us to stream meaningful progress back to the frontend.

The system now works like this:

The model returns a fully buffered response
LangGraph emits intermediate state transitions
These updates are pushed through the queue
Flask streams them to the frontend via SSE

So the user still experiences a responsive interface, but the streaming is now:

controlled by the application, not the network

What This Changes Architecturally

Previously, streaming depended on:

correct behaviour from the gateway
proper handling of chunked responses
uninterrupted token flow

Now, streaming depends only on:

the application logic
the queue
the SSE layer

And effectively moved streaming from:

a transport layer concern to an application layer concern

Trade offs

This is not a free win.

I gave up:

token level granularity
true real time generation

And accepted:

slightly higher latency (waiting for full response)
higher memory usage (buffering responses)

In return, gained:

determinism
reliability across environments
significantly easier debugging

This is a deliberate trade.

Takeaways

There are a few lessons here that generalize beyond this specific stack:

Streaming is not just a feature, it is a multilayer contract. If any layer violates it, the failure propagates unpredictably.
Enterprise gateways are often not transparent. Even when they support streaming, they may buffer or terminate connections silently.
A stream that yields zero chunks should be treated as a transport failure, not a valid response.
When you don’t control the infrastructure, pushing critical behaviour (like streaming) into the network layer is risky.

The broader takeaway is simple:

If you don’t control the network, you don’t control streaming.

Closing Thought

What initially looked like a model or framework issue turned out to be a failure at the system boundary. The resolution wasn’t deeper debugging, it was deciding where control should live in the architecture.

Once that decision was made, the fix became straightforward.

If you have come this far, would love to hear your thoughts, ideas or suggestions on this article.

Signals & Noise

Discussion about this post

Ready for more?