When gRPC Was Innocent: Debugging File Corruption in an Async Java Save Flow

Why this architectural choice mattered

Some time ago, I built a centralized gRPC service for file management. Instead of allowing microservices to exchange files directly, they would all read from and write to a single dedicated file service. The goal was to centralize access control, security rules, and file operations in one place, effectively turning that service into an internal object-storage layer for the rest of the system.

I chose gRPC because it gave me strict contracts, bounded operations, and predictable message boundaries. That mattered because this service could face heavy load peaks, and I needed a deterministic flow with bounded memory usage. To achieve that, I designed downloads as a chunked stream with backpressure: the server reads the file progressively, each message contains a bounded chunk, the client consumes chunks one by one, and the full file is never preloaded into memory. In practice, this gave me a controlled and memory-safe transfer model.

The modules around the file service

Around this service, I also created several client modules: a standard client for Java 8 applications, a reactive client for Java 21 applications, and a reactive adapter that transforms the stream into reactive types. These reactive modules were designed for non-blocking applications. The client handled the publisher-subscriber side, while the adapter exposed a more convenient reactive API to the rest of the application.

For both uploads and downloads, I also added a checksum as the last message of the stream. That checksum was calculated incrementally on both sides while chunks were being processed, so the receiver could verify that the streamed content had arrived correctly. At that point, the whole design looked safe: bounded memory, streaming, checksum verification, and a controlled contract.

The scenario where the problem appeared

The issue appeared when I used this client from a REST API that had to download several documents in the same request. At first, everything seemed fine. However, during repeated executions of the same request body, some downloaded documents turned out to be malformed.

What made the bug especially interesting was that the failures were not random. The same files tended to fail under the same pressure conditions. If this had been a generic transport problem, corruption should have appeared in any file, or at least in any file with similar characteristics. But that was not what I observed. The pattern was too consistent to ignore, so I started investigating.

I debugged the gRPC flow and found nothing suspicious. Then I ran more load tests focused specifically on those failing files, and eventually I reached an important conclusion: the corruption was not happening during transfer. It was happening when the file was being saved.

What the checksum proved

To confirm that suspicion, I compared the checksum received in the final gRPC message with the checksum calculated on the client while receiving the stream. They matched, and that meant something very important: the document had arrived correctly in memory.

I then added one more validation step. After the file was fully saved to disk, I calculated the checksum again over the saved file. That checksum was sometimes different. This was the key discovery. The transport layer was correct; the corruption appeared after the stream had already been received successfully.

What made the case even stronger was that when the saved-file checksum differed, it was still the same wrong checksum for the same document in every failing execution. This strongly suggested deterministic corruption under specific pressure conditions, not random data loss. In other words, the stream checksum proved transfer integrity, while the saved-file checksum proved persistence integrity. The bug was somewhere between those two moments.

The architecture at the time

At a high level, the architecture looked like this:

REST API
┌─────────────────────────────────┐
│ Reactive application            │
│ ┌─────────────────────────────┐ │
│ │ Reactive Adapter            │ │
│ └─────────────────────────────┘ │
│ ┌─────────────────────────────┐ │
│ │ Reactive Client             │ │
│ └─────────────────────────────┘ │
└─────────────────────────────────┘
                 │
                 ▼
         gRPC File Service
                 │
                 ▼
             Filesystem

The save step used an asynchronous Java library so that the pipeline remained non-blocking. On paper, that design looked reasonable. Chunks arrived in order, were processed progressively, and were written asynchronously. In practice, though, that was exactly where the bug lived.

Protocol Flow - one prepared chunk, demand-driven send

Server Read

Prepared (mem)

Client Receive

Async Write

Disk

Chunk N

Chunk N+1

Chunk N+2

WAIT: client asks next

send unlocks next read

The actual problem

The chunks were reaching the save layer in the correct order. The real problem was that the asynchronous write operations were not being coordinated safely.

At first, I considered another possibility: maybe the application continued processing the document before the final async write had completed. That would have been a classic race condition caused by missing synchronization or missing await semantics. But I ruled that out fairly quickly. Documents were processed only after all downloads had completed, the final file size was correct, and the corruption pattern was deterministic rather than random.

That left one explanation standing. For some documents, the final chunk was much smaller than the others. Under pressure, that smaller chunk could be written faster than the previous one. If the persistence layer relied on append-like sequencing instead of strict positional writes or serialized write confirmation, then completion order could differ from arrival order. In that scenario, the file could be corrupted even if the stream itself was perfectly correct.

That was the real lesson: the issue was not that gRPC failed, and not even that asynchronous I/O was inherently wrong. The issue was that asynchronous writes were launched in a way that allowed ordering assumptions to break.

Bug Flow - enters async write without previous write confirmation

Server Read

Prepared (mem)

Client Receive

Async Write

Disk

Chunk N

Chunk N+1

Chunk N+2

overlap: write queue receives next chunk too early

How I validated the theory

To validate that theory, I built a focused test. I received chunks in pairs, and whenever the second chunk was smaller than usual, I intentionally saved it before the first one in that iteration.

The result was exact. The SHA-256 of the saved PDF matched the same checksum I had been seeing in the real failure cases. In other words, I was able to reproduce the corruption pattern on purpose. At that point, the problem became clear: the transfer was correct, but the persistence layer could commit chunks in the wrong effective order.

Why gRPC was innocent

This is the part I find most valuable in the whole debugging story. When files arrive corrupted, it is very easy to blame the transport layer first. In this case, however, gRPC was doing exactly what it should do. The stream arrived correctly, the checksum matched, and the chunks were delivered in order.

The corruption happened later, in the client-side persistence layer. So the real problem was not “gRPC + async” in general. It was a correct streaming pipeline feeding an incorrectly coordinated asynchronous write process. That distinction matters, because it changes both the diagnosis and the fix.

The fix

Once the real cause became clear, the fix was straightforward: the pipeline could not start writing the next chunk until the previous write had been fully confirmed.

In practical terms, that meant changing the flow from overlapping asynchronous writes to a strictly coordinated one. After receiving chunk N, the client could dispatch its write operation, but chunk N+1 was not allowed to advance into the write stage until that previous write had completed successfully. The rule became simple: receive, write, confirm, then continue.

That change preserved correctness across the whole pipeline. It also meant that end-to-end throughput was now limited by the slowest step in the sequence—server-side read, chunk preparation, transmission, client-side reception, and finally filesystem persistence—but that trade-off was the right one. In this case, file integrity mattered more than unconstrained parallelism.

Fixed Flow - waits for send request and previous write confirmation

Server Read

Prepared (mem)

Client Receive

Async Write

Disk

Chunk N

Chunk N+1

Chunk N+2

WAIT: client asks next

WAIT: previous write confirmed

Conclusion

This bug was a headache, but it was also a useful reminder. In distributed systems, a correct transport layer does not guarantee a correct persistence layer. A stream can be valid in memory and still become corrupted on disk if asynchronous boundaries are not designed carefully.

It also reminded me of something important in engineering work: common patterns, popular libraries, and “normal” implementations are not absolute truths. They still have to be evaluated against the real architecture, the real pressure conditions, and the real guarantees your system needs.