Technical

Why TCP Fails for AI Data Transfer (and What to Use Instead)

TCP was designed for reliability on lossy 1980s networks, not for saturating 10 Gbps links across continents. Here is the math that proves it, and what the alternative looks like.

The Problem in One Equation

Every TCP connection has a hard throughput ceiling defined by the bandwidth-delay product (BDP):

BDP = Bandwidth x Round-Trip Time

Maximum throughput = TCP Window Size / RTT

TCP can only have a certain number of unacknowledged bytes in flight at any time, limited by the receive window. Until the sender gets an ACK back, it cannot send more data. On a high-latency link, this creates a hard ceiling that has nothing to do with your actual bandwidth.

Real Numbers: TCP on a 10 Gbps Link

Consider a 10 Gbps connection between two AI data centers separated by 100 ms of round-trip latency (roughly US East to West coast):

BDP Calculation

  • Bandwidth: 10 Gbps = 1.25 GB/s
  • RTT: 100 ms = 0.1 s
  • BDP: 1.25 GB/s x 0.1 s = 125 MB

TCP needs a 125 MB window to fully utilize this link. The default TCP window on most systems is 64 KB to 4 MB. Even with window scaling enabled (RFC 7323), the maximum is 1 GB, but kernel defaults and buffer tuning rarely get close.

In practice, a single TCP stream on this link typically achieves 200-400 Mbps, roughly 2-4% of available bandwidth. You are paying for 10 Gbps and using 300 Mbps. For AI workloads moving multi-terabyte datasets, that turns a 20-minute transfer into an 8-hour one.

What About Parallel TCP Streams?

The standard workaround is opening many TCP connections in parallel. Tools like GridFTP use 8-16 parallel streams. This helps, but it introduces new problems:

  • Congestion unfairness: 16 TCP streams compete with each other and with other traffic. Each stream independently runs congestion control, leading to oscillating throughput and packet loss.
  • Head-of-line blocking: If one stream stalls (packet loss, retransmission timeout), the application layer must wait or manage complex out-of-order reassembly.
  • Connection overhead: Each stream requires its own three-way handshake, TLS negotiation, and kernel buffer allocation. On a 100 ms RTT link, each connection takes 300+ ms just to establish.
  • Diminishing returns: Beyond 8-12 streams, you hit buffer limits and context-switching overhead. Real-world tests show 16 parallel TCP streams on a 10 Gbps / 100 ms link achieve roughly 2-5 Gbps, still well below line rate.

TCP Congestion Control: Built to Slow Down

TCP's congestion control algorithms (Reno, CUBIC, BBR) share a fundamental design goal: back off when there is any signal of congestion. This is good citizenship on shared networks. It is terrible for dedicated high-bandwidth transfers.

CUBIC, the default on Linux, uses a cubic function to probe for available bandwidth after a loss event. A single dropped packet causes the congestion window to drop by up to 30%, and recovery to the previous rate takes seconds to minutes depending on RTT. On a 10 Gbps link with 0.01% packet loss (common on long-haul routes), CUBIC will never come close to line rate.

BBR (Bottleneck Bandwidth and RTT) is better, modeling the path rather than reacting to loss. But BBR still operates within TCP's ACK-clocked framework. It cannot fundamentally escape the BDP constraint because it still waits for acknowledgments.

Head-of-Line Blocking: TCP's Hidden Tax

TCP guarantees in-order delivery. If packet 1,000 is lost, packets 1,001 through 2,000 sit in the receive buffer waiting, even though they arrived fine. The application sees nothing until the retransmission completes.

For file transfer, this means a single lost packet on a 10 Gbps link stalls the entire stream for at least one RTT (100 ms in our example). During that 100 ms stall, 125 MB of potential throughput is lost. This effect compounds: on links with 0.1% loss, head-of-line blocking can reduce effective throughput by 40-60%.

Why UDP-Based Protocols Win

UDP-based transfer protocols like FASP and Handrive's protocol solve these problems by discarding TCP's assumptions:

  • No ACK-clocked sending: The sender controls its own rate based on measured path characteristics, not waiting for per-packet acknowledgments. This eliminates the BDP ceiling entirely.
  • Application-layer reliability: Selective retransmission at the application layer means lost packets are re-requested without stalling the entire stream. No head-of-line blocking.
  • Rate-based congestion control: Instead of backing off on any loss signal, UDP protocols measure actual available bandwidth and adjust smoothly. A single lost packet does not trigger a 30% window reduction.
  • Latency independence: Throughput is determined by the bottleneck bandwidth, not by RTT. A 10 Gbps link delivers 10 Gbps whether RTT is 1 ms or 500 ms.

Throughput Comparison: 10 Gbps Link, 100 ms RTT

MethodThroughputUtilization
Single TCP (default buffers)200-400 Mbps2-4%
Single TCP (tuned buffers)1-3 Gbps10-30%
16x parallel TCP2-5 Gbps20-50%
UDP-based protocol8-9.5 Gbps80-95%

What This Means for AI Workloads

AI data transfer has characteristics that make TCP's weaknesses especially painful:

  • Large sequential transfers: Training datasets, model checkpoints, and inference batches are multi-gigabyte to multi-terabyte. TCP's slow-start alone wastes minutes on each transfer.
  • High-latency paths: Data moves between edge collection points, regional data centers, and GPU clusters that may be continents apart. Some paths involve satellite links with 500+ ms RTT.
  • Time-sensitive pipelines: Training jobs are blocked waiting for data. A transfer that takes 8 hours instead of 20 minutes means idle GPU time at $2-8/hr per GPU across hundreds of GPUs.
  • Cost multiplication: Slow transfers mean longer cloud compute bills. Moving 50 TB over TCP at 3 Gbps takes ~37 hours. At UDP rates (9 Gbps), it takes ~12 hours. On a 256-GPU cluster at $3/GPU/hr, that 25-hour difference costs $19,200 in idle compute.

Handrive's Approach

Handrive uses a UDP-based, rate-controlled protocol originally engineered for satellite links, where latency is extreme (600+ ms RTT) and packet loss is routine. The same properties that make it work over satellite make it optimal for AI data center transfers:

  • Throughput is independent of latency. Full bandwidth utilization on any path, any RTT.
  • Selective retransmission with no head-of-line blocking.
  • End-to-end encryption built into the protocol, not layered on top as TLS-over-TCP.
  • Direct peer-to-peer delivery. No intermediate servers adding latency hops.

Combined with NAT traversal that works across firewalls without port forwarding, this means any two machines can transfer at line rate regardless of network topology.

When TCP Is Still Fine

TCP is not universally bad. It works well for small files on low-latency links (under 20 ms RTT), web traffic, API calls, and any scenario where the BDP fits comfortably within default window sizes. If you are transferring files under 1 GB on the same continent with modern buffer settings, TCP is adequate.

The breakpoint is roughly: if your transfer size in gigabytes times your RTT in milliseconds exceeds 10,000, TCP will meaningfully underperform. For AI workloads, that threshold is crossed constantly.


Further Reading

Stop Leaving Bandwidth on the Table

Handrive's UDP-based protocol delivers full line-rate transfers regardless of latency. Free, encrypted, no infrastructure required.

Download Handrive