Stop GPU Starvation: Fast AI Data Loading with io_uring

💡 TL;DR (Too Long; Didn't Read)

Key takeaways in 90 seconds:

The Core Bottleneck: As AI/ML training and inference hardware scales, GPU processing speeds have outpaced storage I/O. The major overhead in loading data resides in kernel-level system call context switching and page cache transitions.

Epoll vs. io_uring: Standard event loops (epoll) are readiness-based, meaning they alert the application when a descriptor is ready for I/O, requiring subsequent blocking or non-blocking system calls. io_uring is completion-based, utilizing shared ring buffers to execute operations asynchronously without system call overhead.

SQPOLL Mode: By enabling Submission Queue Polling (SQPOLL), a dedicated kernel thread polls the submission ring. This allows userspace applications to perform high-frequency disk and network I/O with zero system calls once the loop is hot.

PostgreSQL 18 AIO: In PostgreSQL 18, the introduction of the Asynchronous I/O (AIO) engine allows the database to submit hundreds of concurrent read and write operations using io_uring, achieving up to a 3× throughput improvement on sequential scans and vacuuming operations.

Our Takeaway: For AI applications dealing with massive weight checkpoints, vector search indexes, or streaming training datasets, optimizing the systems layer via io_uring is no longer optional. True engineering efficiency means removing the kernel boundary tax from your high-throughput pipelines.

1. Introduction: The Storage Bottleneck of the Agentic Era

In the early phases of the AI revolution, engineering attention was understandably captured by the sheer compute requirements of deep learning. We spent our cognitive budgets optimizing GPU kernels, scaling cluster topologies, and minimizing inter-node communication latency via high-speed interconnects. However, as the industry transitions from training foundation models to deploying database-heavy agentic systems at scale, a different bottleneck has emerged: the operating system's storage and network boundaries.

Modern AI/ML infrastructures operate under extreme I/O pressure. During LLM inference, models must query massive vector search databases, fetch agent prompt history, and load small-scale adapter weights (like LoRA matrices) on the fly. During training, deep learning pipelines must continuously stream terabytes of tokenized datasets and image chunks to keep GPU clusters saturated.

The industry term for the failure to meet this demand is GPU starvation. While a H100 or Blackwell cluster sits idle, waiting for the next data batch to be read from NVMe disks or retrieved over the network, organizations continue to pay the capital and operating costs of idle silicon.

Historically, we attempted to solve this by throwing more hardware at the problem: configuring massive RAID-0 NVMe arrays, scaling memory caches, and upgrading to multi-gigabit network interfaces. Yet, developers quickly observed that their storage-bound applications plateaued far below the physical limits of the hardware.

The reason is not physical; it is architectural. The traditional Linux I/O models, originally designed for single-core processors and slow spinning disks, impose a heavy tax on high-frequency, concurrent operations. Every read, write, select, or poll requires traversing the user-kernel boundary, incurring context-switch overhead, memory copying, and synchronous thread blocking.

To resolve GPU starvation and scale modern AI infrastructure, we must optimize the systems layer. We must bypass the user-kernel boundary tax. This is where io_uring, the Linux kernel's modern asynchronous interface, redefines how applications communicate with storage and the network.

2. Under the Hood: Epoll vs. io_uring

To appreciate why io_uring is a generational leap forward, we must first analyze the limitations of the traditional asynchronous networking and storage interfaces in Linux: epoll and POSIX/Linux AIO.

The Readiness Model: epoll

For over two decades, high-performance network servers have relied on epoll (and its predecessors select and poll) to handle concurrency. The epoll architecture is built on a readiness-based model.

In this model, the application registers a set of file descriptors with the kernel. When an event occurs (e.g., data arrives on a network socket), the kernel alerts the application that the descriptor is "ready" for reading or writing. The application thread wakes up, parses the event list, and issues a standard read() or write() system call to perform the actual I/O.

While highly effective for network sockets, epoll has three structural drawbacks when applied to modern AI workloads:

Double Context Switch: For every event, the application must make at least two system calls: one to wait for readiness (epoll_wait) and another to execute the I/O (read or write). Each system call requires a context switch, which invalidates CPU caches and incurs translation lookaside buffer (TLB) flushing.
Incompatibility with Disk Files: Crucially, the Linux kernel does not support readiness events for regular disk files via epoll. Disk files are always considered "ready" by the virtual file system (VFS), meaning an application attempting to read a file asynchronously via epoll will silently block the calling thread if the data is not already cached in the operating system's page cache.
Linux AIO Limitations: To address disk I/O, Linux introduced native kernel AIO (io_submit). However, Linux AIO is notoriously fragile. It only works if the file is opened with the O_DIRECT flag (bypassing the page cache), requires block-aligned buffers, and will silently fallback to blocking behavior if metadata operations (like extending a file size) are required.

The Completion Model: io_uring

Introduced by kernel maintainer Jens Axboe in Linux 5.1 (2019) and matured into a production standard in Linux 6.x, io_uring implements a completion-based model. Instead of checking if a descriptor is ready to be read, the application tells the kernel: "Here is a list of I/O operations I want you to perform. Let me know when they are completed."

The core architecture of io_uring is built on two lock-free circular ring buffers shared directly between userspace and the kernel:

Submission Queue (SQ): The application writes Submission Queue Entries (SQEs) to this ring to request I/O operations (e.g., read, write, accept, send, recv).
Completion Queue (CQ): The kernel writes Completion Queue Entries (CQEs) to this ring when operations finish, containing the status and result of the requested I/O.

Because these ring buffers reside in mapped memory shared between userspace and the kernel (via mmap), the application can submit requests and read completions without copying data across the boundary.

Furthermore, io_uring provides three advanced execution modes that optimize systems-layer throughput:

Linkable Operations: Applications can chain SQEs together. For instance, you can submit an SQE to read a file offset, and link it to a second SQE that sends that data over a socket. The kernel executes the chain sequentially without returning control to userspace between steps.
Kernel Worker Pool: For blocking operations (like disk reads that miss the page cache), io_uring delegates the work to an internal kernel thread pool (io-wq), ensuring the userspace thread never blocks.
SQPOLL Mode: In this mode, the kernel spawns a dedicated thread (io_uring-sq) that continuously polls the Submission Queue for new entries. Once enabled, the application simply writes SQEs to the queue and reads CQEs from the completion ring. The entire I/O pipeline executes with zero system calls, eliminating the context-switch tax entirely.

3. SQL in the Loop: PostgreSQL 18 and Linux AIO

While io_uring has been utilized by low-level networking frameworks and runtime engines (such as Node.js via libuv and Rust via tokio-uring), its integration into core database engines marks a critical milestone for AI infrastructure. Specifically, the release of PostgreSQL 18 (and its subsequent stable minor release 18.4 in May 2026) introduces a completely redesigned Asynchronous I/O (AIO) engine.

Historically, PostgreSQL relied on a process-based model. Each client connection spawned a dedicated backend process. For disk read operations, PostgreSQL relied on standard synchronous prefetching and OS-level read-ahead. When a query required scanning a large index or performing a sequential scan over a table that exceeded the shared buffer pool, the backend process blocked, waiting for the block device to fetch the pages.

The PostgreSQL 18 AIO Subsystem

In PostgreSQL 18, the database architecture implements a unified AIO manager that can be configured to use io_uring on modern Linux systems (via the io_combined or io_uring provider settings).

When PostgreSQL needs to read blocks from disk (e.g., during a sequential scan, bitmap index scan, or vacuuming operation), it no longer issues blocking pread() calls. Instead, the backend process generates a series of block read requests, writes them as SQEs to the database's shared io_uring instance, and continues executing memory-bound processing or preparing subsequent steps of the query plan.

Verified SourcePostgreSQL 18 Documentation

PostgreSQL 18 introduces native asynchronous I/O support, allowing table scans, index scans, and vacuum operations to submit asynchronous read requests directly to the OS kernel.

This architecture provides three primary benefits:

Maximizing Queue Depth: Traditional NVMe storage devices achieve their rated read speeds only when processing highly parallel requests (typically requiring a queue depth of 32 or 64). Traditional synchronous PostgreSQL reads could not exploit this parallelism without spawning dozens of concurrent processes. PostgreSQL 18's AIO engine can keep the hardware queue saturated from a single backend process, fully utilizing NVMe parallel channels.
Reduced CPU Overhead: By bypassing standard syscall paths and using io_uring memory-mapped ring buffers, CPU consumption per gigabyte of read data drops significantly. During massive scans, this frees up CPU cycles for expensive query operations like vector distance calculations, aggregations, and joins.
Faster Vacuuming: The PostgreSQL VACUUM daemon, critical for maintaining database health and index performance, spends the majority of its lifetime waiting for block reads. In PostgreSQL 18, vacuuming runs asynchronously, utilizing io_uring to fetch heap pages ahead of processing, resulting in up to a 3× reduction in vacuum duration.

For AI developers, this database-level performance gain is directly transferable. Vector search extensions like pgvector store high-dimensional embeddings in HNSW (Hierarchical Navigable Small World) index structures. Traversing these graphs requires hopping between memory locations and fetching index blocks from storage. Under high concurrent load, PostgreSQL 18's AIO engine ensures these traversal queries spend their time executing distance math on the CPU, rather than waiting in blocking queues for disk blocks.

4. AI/ML Workload Impacts: Training Data & Inference Retrieval

To see how io_uring acts as a force multiplier for machine learning engineering, we must examine its impact on the two distinct phases of the model lifecycle: training data ingestion and real-time inference retrieval.

Resolving GPU Starvation in Training Pipelines

Deep learning training loops are highly structured. A typical epoch processes data in a pipelined fashion:

For models processing rich media (high-resolution images, video files, or raw audio waveforms), the "Read & Decrypt" phase is a major bottleneck. PyTorch's default DataLoader handles this by spawning multiple background worker processes. Each worker runs a loop that reads a batch from disk, parses it, applies augmentations, and copies it to pinned host memory.

Under this multi-process model, the workers compete for disk I/O. Each worker issues synchronous, blocking read system calls. As the GPU processes batches in milliseconds, the workers struggle to keep up, leading to GPU starvation. The GPU sits idle at 0% utilization while waiting for the workers to return from block device queues.

By implementing an io_uring-based file reader (such as integrating liburing into the dataloader worker C++ bindings), the architecture changes:

Zero-Copy Ingestion: Workers submit read requests for multiple data segments directly to the SQ. The kernel performs the read directly into userspace buffers mapped to the GPU-pinned memory, bypassing intermediate copies.
Unified Network & Storage: In large-scale clusters, datasets are often stored on distributed network filesystems (e.g., Lustre, Ceph, or GPFS) mounted over TCP or RDMA. io_uring treats file reads and network socket receives (recv) identically. A single event loop can manage both the retrieval of data from the network storage server and the block-device parsing, keeping the storage queue depth saturated.

Enhancing Low-Latency Inference Retrieval

During LLM inference, latency is measured in milliseconds per token. While the weights of a frontier-class model are stored in GPU memory, the surrounding systems layer must fetch auxiliary context in real-time:

Vector DB Lookup: Retrieving similar chunks from an HNSW index to populate the prompt context (Retrieval-Augmented Generation).
KV Cache Swapping: When handling long-context agent conversations, the Key-Value (KV) cache of inactive agents is often swapped out of GPU memory to host RAM or local NVMe storage to save space. When the agent becomes active again, the cache must be loaded back into VRAM instantly.

Swapping a 16 GB KV cache from NVMe to GPU memory via standard blocking calls blocks the inference worker, causing a visible lag in the first-token response.

Using io_uring with direct block device access, the system can stream the KV cache asynchronously. By submitting concurrent read requests for the cache segments, the system fully saturates the PCIe bus width, transferring the data to host memory in parallel with the GPU executing the initial prompt processing. The transfer occurs in the background, minimizing the time-to-first-token (TTFT) latency for the end user.

5. Practical Implementation: Reconstructing Async Ring Submissions

To understand how to program against io_uring, we can examine a low-level C implementation using the standard liburing helper library. The following example demonstrates how to initialize a ring, submit an asynchronous read request for a dataset block, and retrieve the completion event.

#include <stdio.h>
#include <fcntl.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <liburing.h>

#define QUEUE_DEPTH 32
#define BLOCK_SIZE 4096

int main(int argc, char *argv[]) {
    struct io_uring ring;
    struct io_uring_sqe *sqe;
    struct io_uring_cqe *cqe;
    struct iovec iov;
    int fd, ret;
    void *buf;

    if (argc < 2) {
        fprintf(stderr, "Usage: %s <filename>\n", argv[0]);
        return 1;
    }

    // 1. Initialize the io_uring instance
    // We request a queue depth of 32 entries.
    ret = io_uring_queue_init(QUEUE_DEPTH, &ring, 0);
    if (ret < 0) {
        fprintf(stderr, "Failed to initialize io_uring queue: %s\n", strerror(-ret));
        return 1;
    }

    // 2. Open the target dataset file
    fd = open(argv[1], O_RDONLY);
    if (fd < 0) {
        perror("Failed to open file");
        io_uring_queue_exit(&ring);
        return 1;
    }

    // Allocate an aligned buffer for direct-mapped I/O
    buf = malloc(BLOCK_SIZE);
    if (!buf) {
        perror("Failed to allocate buffer");
        close(fd);
        io_uring_queue_exit(&ring);
        return 1;
    }

    // Set up the iovec structure pointing to our buffer
    iov.iov_base = buf;
    iov.iov_len = BLOCK_SIZE;

    // 3. Obtain a Submission Queue Entry (SQE) from the ring
    sqe = io_uring_get_sqe(&ring);
    if (!sqe) {
        fprintf(stderr, "Failed to get submission queue entry\n");
        free(buf);
        close(fd);
        io_uring_queue_exit(&ring);
        return 1;
    }

    // 4. Prepare the read operation
    // We register the read target, buffer address, and file offset (0).
    io_uring_prep_readv(sqe, fd, &iov, 1, 0);

    // Attach custom user metadata to identify the operation upon completion
    io_uring_sqe_set_data(sqe, (void *)0xDEADBEEF);

    // 5. Submit the SQE to the kernel
    // In a high-frequency loop, we can submit multiple SQEs in a single call.
    ret = io_uring_submit(&ring);
    if (ret < 0) {
        fprintf(stderr, "Failed to submit SQE: %s\n", strerror(-ret));
        free(buf);
        close(fd);
        io_uring_queue_exit(&ring);
        return 1;
    }

    printf("I/O request submitted asynchronously. Waiting for completion...\n");

    // 6. Wait for the Completion Queue Entry (CQE)
    // This blocks the calling thread until at least one operation completes.
    ret = io_uring_wait_cqe(&ring, &cqe);
    if (ret < 0) {
        fprintf(stderr, "Failed to wait for CQE: %s\n", strerror(-ret));
        free(buf);
        close(fd);
        io_uring_queue_exit(&ring);
        return 1;
    }

    // Verify the completion metadata and status
    if (cqe->res < 0) {
        fprintf(stderr, "I/O operation failed: %s\n", strerror(-cqe->res));
    } else {
        printf("I/O completed successfully. Read %d bytes.\n", cqe->res);
        printf("UserData metadata token: %p\n", io_uring_cqe_get_data(cqe));
    }

    // 7. Mark the CQE as processed to clear it from the completion ring
    io_uring_cqe_seen(&ring, cqe);

    // Cleanup resources
    free(buf);
    close(fd);
    io_uring_queue_exit(&ring);
    return 0;
}

This structural loop exposes the lock-free, asynchronous programming model. The application thread writes, submits, and processes events independently, allowing high-throughput systems to scale without spawning threads or locking mutexes for basic operations.

6. Conclusion: Stop Waiting for the Disk

The transition from compute-bound optimization to storage-bound optimization represents a natural maturation of AI engineering. As we build systems that query larger contexts, update massive vector spaces, and run persistent pipelines, the systems-layer design of our applications becomes a primary performance driver.

Traditional synchronous and readiness-based models (epoll) are no longer sufficient to meet the throughput demands of AI workloads. They introduce CPU overhead, context-switch latency, and thread blockage that choke processing hardware.

By adopting io_uring at the kernel layer and utilizing database engines like PostgreSQL 18 that integrate it natively, we remove these structural boundaries. We allow our training workers to ingest data at the wire speed of modern NVMe devices, and our inference servers to swap memory states without blocking end-user tokens.

True engineering excellence is not just about using a larger model; it is about building a system that allows your resources to run at full capability. As you design your next data ingestion pipeline or scale your vector database cluster, remember: the kernel does not have to block your progress. Stop waiting for the disk.

External Sources

The Kernel Ate the Sidecar: eBPF Reconfigured Production Kubernetes: How kernel-layer programming shifts networking and observability bounds.
eBPF Is Miserable to Write. KernelScript Wants to Fix That.: Streamlining systems development with custom domain-specific languages.
The Real Cost of Caching: Why Your Redis Bill Doubled and Your Latency Got Worse: Designing low-latency distributed data retrieval architectures.

This article was human-architected and synthesized with AI assistance under the Aether (AI) persona.