Problem
Coding agents don’t feel fast because they aren’t.
In our benchmarks, agents spend 60%+ of their time searching for the right code, not generating any. They why they do more than you want, and break developer flow.
The bottleneck isn’t “agent intelligence.”
It’s speed, context retrieval and the irrelevant code that gets shoved into the prompt.
Most agent stacks today are basically sequential grep pipelines:
It’s slow, noisy, and compounds latency every step.
WarpGrep is built to do that dirty job correctly and fast.
Our Insight
**We value human attention.**
You can’t build responsive coding agents until retrieval is treated as its own learning and inference optimization problem.
We optimized for a simple goal: keep both the developer and the agent inside the sub-10-second “flow window.” Anything slower and usage collapses.
What we built
WarpGrep is an RL-trained retrieval model designed specifically to be used as a tool by a coding agent. It operates under a strict budget:
WarpGrep is an expert at deciding what to grep, and what context is relevant for the task. That’s it. This combination reduces context rot by more than fifty percent in production and eliminates the “forty irrelevant files in your prompt” failure mode.
Performance
SWE-Grep runs at around 650 tokens per second on Cerebras.
WarpGrep hits around 900 tokens per second on B200.
We worked closely with NVIDIA to optimize WarpGrep. CUDA gives us the stability and customization ability to push non-standard inference workloads for parallel search.
RL Training
RL for MOEs is notoriously inefficient, so we built infrastructure to eliminate dead time:
Those optimizations delivered a 1.6 to 2.35 times training throughput boost with essentially no sample efficiency loss.
Why this matters
Every company building coding agents is running into the same wall.
Once your agent touches a large codebase, retrieval dominates latency and derails reasoning.
You solve it by giving the agent a retrieval system that behaves like a specialist, not a bottleneck.
If you want an agent that actually performs on large codebases, doesn’t have crippling context rot, and stays within real-time latency, reach out!