{"id":87657,"title":"Confident AI: The Evals Platform for AI Quality \u0026 Observability","tagline":"An evaluation platform for engineers, QAs, and PMs to pinpoint which iteration of their AI to put in production.","body":"**TL;DR – Confident AI is a cloud-based evaluation platform to help teams build reliable AI products. Its evals are powered by DeepEval - an open-source LLM evaluation framework, which has over 12k stars, 3million monthly downloads, runs 2million evals per day, and is most commonly found in CI/CD pipelines of enterprises such as BCG, Astrazeneca, AXA, Microsoft.** \n\n**Confident AI allows engineering teams to iterate on their LLM app x10 faster by bringing best-in-class evals to traditional AI observability.**\n\n[Try Confident AI today (setup in 5 min)](https://docs.confident-ai.com/confident-ai/confident-ai-introduction)\n\n![uploaded image](/media/?type=post\u0026id=87657\u0026key=user_uploads/1314014/fc01311e-e55b-4632-8256-75115c04f3f5)\n\n### **The Problem**\n\nDespite LLM evaluation being a problem with many solutions on the market, it remains unsolved. General LLMOps observability platforms that offer evals lack robust metrics and are more suited for debugging through tracing UIs, while evaluation-focused frameworks don’t offer enough control for users to customize and make metrics reliable for specific use cases.\n\nAs a result, developers often build custom evaluation metrics and pipelines from scratch—writing hundreds or even thousands of lines of code to test their LLM apps. The worst part? Once they’ve fine-tuned their metrics and are ready to deploy them across the organization, they hit a roadblock: there’s no easy way to collaborate. Because these custom metrics exist in scattered code rather than an integrated ecosystem, incorporating them to enable team-wide adoption becomes frustrating and inefficient.\n\n### **The Solution**\n\nWe built DeepEval for engineers to create **use-case-specific, deterministic** LLM evaluation metrics, and when you're ready, Confident AI brings these evaluation results to the cloud for collaboration. This allows teams to collaborate on LLM app iteration — with no extra setup required.\n\n1. Curate your evaluation dataset on Confident AI.\n2. Run evaluations locally with DeepEval's metrics, pulling datasets from Confident AI.\n3. View and share testing reports to compare prompts and models and refine your LLM application.\n\nConfident AI continuously evaluates monitored LLM outputs for production, automatically enriching your dataset with real-world, adversarial test cases. This keeps your evaluation data high-quality and reflective of your use case.\n\n\u003chttps://youtu.be/yLIhVn3B8Wg\u003e\n\n### **How are we different?**\n\nWhat we've learned is in order to get legitimate evaluation results required for benchmark-driven iteration, you need extremely high-quality metrics and datasets. That's why we've built specifically for the ideal LLM evaluation workflow:\n\n* **DeepEval** handles robust, deterministic metrics required for rigorous, use-case-tailored validation.\n* **Confident AI** provides both technical and non-technical teams the ability to collaborate on pre and post-deployment evals and observability.\n\n\u003e If your evaluation results are 100% reflective of your LLM application's performance, what's stopping you from shipping the best version of your LLM app?\n\n### **Customer ROI metrics**\n\nWe've been working with companies of all sizes, and some of the ROI metrics include:\n\n* Decreasing LLM cost by **more than 70%** through evaluation to safely switch away from GPT-4o to cheaper models.\n* Decreasing time to deployment for a team of 7 from **1-2 weeks to 2-3 hours**.\n* Helping a team of 30 customer support agents **save 200+ hours a week** by having a centralized place to analyze LLM performance live.\n\nThere's more to come as we start publishing some case studies on our website, so stay tuned!\n\n### **Our Ask**\n\nThanks for sticking with us to the end! We’re on a mission to help companies get the most ROI out of their LLM use cases, and we believe this is only achievable through rigorous LLM evaluation at scale. If our mission resonates with you, Confident AI is always here and available to try immediately (coding required, but free to try): \u003chttps://www.confident-ai.com/docs\u003e\n\nIf you want to explore our enterprise offering, you can always [talk to us here.](https://calendly.com/d/cqbp-t88-y4j/confident-ai-intro-call)\n\n### **About Us**\n\nConfident AI is founded by **Jeffrey Ip**, a SWE formally at Google scaling YouTube's creators studio infrastructure, and Microsoft building document recommenders for Office 365, and **Kritin Vongthongsri**, an AI researcher and CHI-published author, who previously built NLP pipelines for fintech startups and researched self-driving cars/HCI during his time at Princeton, where he studied ORFE and CS.","slug":"Mnp-confident-ai-the-evals-platform-for-ai-quality-observability","created_at":"2025-02-13T17:44:57.361Z","updated_at":"2026-07-22T14:21:11.342Z","total_vote_count":374,"url":"https://www.ycombinator.com/launches/Mnp-confident-ai-the-evals-platform-for-ai-quality-observability","share_image_url":"https://www.ycombinator.com/media/?type=post\u0026id=87657\u0026key=user_uploads/1314014/fc01311e-e55b-4632-8256-75115c04f3f5","company":{"id":30283,"name":"Confident AI","slug":"confident-ai","url":"https://confident-ai.com","logo":"https://bookface-images.s3.amazonaws.com/small_logos/2e739bf439400b44a89bc15e023cbe6bca3f9e00.png","batch":"Winter 2025","industry":"B2B","tags":["Developer Tools","Generative AI","Open Source","AI"],"search_path":"https://bookface.ycombinator.com/company/30283"}}