Software engineering capabilities in the frontier llms have already emerged. Across SWE-Bench, Terminal-Bench, OS-World, model capabilities are rising amongst us.
Scaling pre-training has worked to allow the models to learn fundamental components of reality - but the last edge of performance differences come from agents learning specialized tools very well in training gyms - be it bash, file editing, or computer using capabilities.
Some fun things we’re working on…
We’ve archived the web to generate the next frontier of Computer Use environments with tasks that elicit failure modes across the frontier models.
We’re also pioneering mechanistic interpretability methods to deterministically elicit the failure modes in coding capabilities of the open source models.
Our team is ex-Uber ML and ex-Amazon scraping, and previously turned down 7 figure / year job offers from Scale AI to start Refresh.
We’re actively working with frontier labs to provide the highest quality simulation environments to train computer use and software engineering capabilities. If you’re training frontier intelligence, we’d love to talk!
Get in touch at founders@refresh.dev :)