TL;DR:
Lemma is the first evaluation + observability platform built not just to measure performance, but to improve it automatically. We help AI agents learn from real user feedback and production data, closing the loop so your prompts and agents continuously optimize themselves over time.
Launch Video: https://www.youtube.com/watch?v=E4_v-pY_4fs
Hey everyone! We’re Jerry and Cole, co-founders of Lemma (YC F25).
The Problem:
AI agents don’t learn from their mistakes. In fact, they get worse with use.
In production, prompts and agents continuously degrade due to real-world input drift (new user behaviors or unseen edge cases). Agent performance can often drop ~40% in a few weeks, and suddenly what worked in testing breaks in front of customers.
When that happens, engineers are forced to dig through logs, collect failing examples, and manually tweak prompts rather than building core product features.
Solution:
That’s why we built Lemma: the first end-to-end system that closes the loop between agent deployment and improvement.
Here's what that means:
Step 1: Lemma detects failed outcomes directly from live traffic, and it automatically identifies the exact cause in an agent chain.
Step 2: Lemma alerts you, and with one click, it runs targeted prompt optimizations to fix the failing behavior without any manual tracing or guesswork.
Step 3: We give you back an improved prompt and automatically open a PR in your codebase so your prompts can live where you want them. Alternatively, you can also fetch your prompt from the Lemma dashboard.
Plus, Lemma provides all the LLM eval and observability features you rely on, just reimagined for continuous learning:
Teams using Lemma cut manual prompt iteration by 90%, resolve production drifts in minutes instead of days, and improve model performance ~2–5% every optimization cycle.
Our Story:
We met freshman year at USC and have been building together ever since instead of going to classes.
Before starting Lemma, we were engineers at two high growth, AI-native startups: Tandem (AI for healthcare) and Chipstack (AI agents for chip design). At both companies, setting up evaluations looked like clunky Retool dashboards and multiple engineers manually tweaking experiments. We built internal systems that automated both running the evaluations themselves, as well as the error-driven feedback loop. The result: 2x accuracy improvement and speed of iteration.
We soon realized every AI company was reinventing the same internal tooling in-house. So we left college, joined YC, and are now bringing continuous learning infrastructure to everyone else.
Ask
Try our platform - If you’re building with LLMs and run a ton of prompt or eval experiments, we’d love for you to work with us.
Introductions - If you know a Head of AI/Eng or CTO at a pre-seed to Series A startup, we owe you lunch :)
Please reach out at jerry@uselemma.ai or book a live demo on our website uselemma.ai. All help is appreciated - thank you!