I’m Kevin. I studied AI + Math at MIT before dropping out at 19 to build Soren AI.
TL;DR: Evals are essential for building trust in AI systems, but they’re hard to design and even harder to maintain. Soren makes it easy for teams to build, update, and improve their evals at scale.
We’re already working with teams powering millions of real-world AI interactions every week.
Every update to a model, prompt, or tool can unintentionally cause new errors or regressions. Teams spend hundreds of hours building and maintaining evals, then manually comb through the results to figure out what went wrong.
The result is wasted engineering time, slower iteration cycles, and less confidence with every deployment.
Soren automates the painful, manual process of evaluating and testing AI systems.
Instead of engineers spending days triaging failed cases and writing new tests, Soren’s AI agents do it for them — continuously and at scale.
Here’s how:
1) Adaptive Testing: Every time your model, prompt, or tool changes, Soren updates its tests to stress-test the new version.
2) Root Cause Analysis: When something breaks, Soren clusters similar issues, connects them to real-world examples, and helps pinpoint the exact area that needs improvement.
While studying at MIT, I published two research papers with Harvard and built one of the world’s largest LLM benchmarks. In the process, I saw firsthand how fragmented and painful the AI evaluation process was.
AI is racing ahead, but the systems meant to test and trust it are stuck in the past. I left MIT at 19 to change that.
If you’re frustrated with your evals or want to level them up, I’d love to chat. Grab some time with me here or send me an email at kevin@soren-ai.com.