Soren

AI agents for evaluating and debugging AI systems.

Fall 2025

Active

AI agents for evaluating and debugging AI systems.

Soren builds autonomous agents to replace human engineers on manual AI evaluation tasks. Evals are essential for building reliable AI systems, yet they remain incredibly time-consuming. Teams spend countless hours maintaining their evals and digging through piles of logs and traces to debug their systems. At scale, this level of manual work simply isn’t sustainable. Soren changes that with powerful agents that work alongside your team. They reason across test cases and logs to pinpoint root causes, then run targeted experiments to surface better-performing solutions. New test cases are added whenever new behaviors are detected, so engineers can stop doing ad-hoc maintenance. Built by MIT engineers who've published leading research in this space and lived the problem. We're building a future where AI handles the work and humans simply provide oversight.

Active Founders

Kevin Xie

Founder

Kevin studied AI and Math at MIT before dropping out at 19. During that time, he published two papers with Harvard, released one of the world’s largest LLM benchmarks, and identified key failure modes in Chain-of-Thought reasoning. He received national recognition for his research and was selected as a Neo Scholar Finalist in his freshman year. Before MIT, Kevin published research on novel MRI techniques and was a nationally ranked tennis player, reaching the top 150 in the U.S.

Kevin Xie

Founder

Company Launches

Soren AI: Evals Built for Agentic AI

See original launch post

I’m Kevin. I studied AI + Math at MIT before dropping out at 19 to build Soren AI.

TL;DR: Evals are essential for building trust in AI systems, but they’re hard to design and even harder to maintain. Soren makes it easy for teams to build, update, and improve their evals at scale.

We’re already working with teams powering millions of real-world AI interactions every week.

Introducing Soren.

The Problem

Every update to a model, prompt, or tool can unintentionally cause new errors or regressions. Teams spend hundreds of hours building and maintaining evals, then manually comb through the results to figure out what went wrong.

The result is wasted engineering time, slower iteration cycles, and less confidence with every deployment.

Soren AI solves this

Soren automates the painful, manual process of evaluating and testing AI systems.

Instead of engineers spending days triaging failed cases and writing new tests, Soren’s AI agents do it for them — continuously and at scale.

Here’s how:

1) Adaptive Testing: Every time your model, prompt, or tool changes, Soren updates its tests to stress-test the new version.

2) Root Cause Analysis: When something breaks, Soren clusters similar issues, connects them to real-world examples, and helps pinpoint the exact area that needs improvement.

uploaded image

My Story

While studying at MIT, I published two research papers with Harvard and built one of the world’s largest LLM benchmarks. In the process, I saw firsthand how fragmented and painful the AI evaluation process was.

AI is racing ahead, but the systems meant to test and trust it are stuck in the past. I left MIT at 19 to change that.

Working With Us

If you’re frustrated with your evals or want to level them up, I’d love to chat. Grab some time with me here or send me an email at kevin@soren-ai.com.

YC Photos