Overview
Confident AI is building the infrastructure that makes AI trustworthy. We're an observability platform used by engineering teams who need to understand what their AI systems are actually doing. We have strong product-market fit and a growing base of customers deploying in production.
We're looking for a Founding Infrastructure Engineer to own the reliability, scalability, and infrastructure that our platform runs on. Today we operate in the cloud. Soon, our largest customers will deploy us on-prem in environments we don't control — high-traffic, high-ingest workloads where we can't just hotfix in production. You'll be the person who makes sure we survive that transition and thrive on the other side.
This is a foundational hire. You'll design the systems that let an observability platform handle massive data volumes without breaking, build the deployment story for on-prem and hybrid environments, and set the infrastructure standards that the rest of engineering builds on. If you want to own the hardest scaling and reliability problems at an early-stage company, this is the role.
What you'll be doing
- Own the reliability and scalability of our platform. You'll design and operate infrastructure that handles high-throughput data ingestion at scale — the kind of load that observability platforms generate.
- Build our on-prem and hybrid deployment story from scratch. This means packaging the entire service stack — ClickHouse, Postgres, Redis, application services — for customer environments, plus configuration management, upgrade paths, and operational runbooks for deployments where you have limited visibility and no ability to push quick fixes.
- Design and implement the observability, monitoring, and alerting for our own infrastructure — yes, the observability platform needs to observe itself.
- Own our Kubernetes infrastructure, CI/CD pipelines, and cloud-native architecture across AWS/GCP. Make sure engineers can ship fast without breaking things.
- Architect our data layer for scale. Our stack runs on ClickHouse for high-volume analytical ingestion, Postgres for transactional data, and Redis for caching and real-time workloads. You'll make sure none of these become the bottleneck as traffic grows by orders of magnitude.
- Establish infrastructure-as-code practices, deployment automation, and incident response processes that let a small team operate with the reliability of a much larger one.
You should be someone who
- 5+ years in platform engineering, infrastructure, or SRE roles — ideally at a company where uptime and data throughput were existential.
- Deep experience with Kubernetes, container orchestration, and cloud-native infrastructure (AWS and/or GCP).
- Strong experience with ClickHouse, Postgres, or similar — performance tuning, replication, schema design at scale, not just writing queries. Redis experience is a plus.
- You've built or significantly contributed to on-prem or hybrid deployment systems — packaging, shipping, and supporting a multi-service stack in environments you don't control.
- You think in systems, not just services. You understand how failures cascade and you design to prevent that.
- Experience with high-throughput data pipelines — ingestion, processing, and storage at volumes where naive approaches break.
- Comfortable with infrastructure-as-code (Terraform, Pulumi, or similar) and CI/CD automation.
- You've been on-call and you've built the systems that make on-call less painful.
- Self-directed and high-agency. You identify what's about to break before it does, and you fix it without being asked.
- Comfortable with ambiguity and fast iteration. We're a seed-stage startup; you'll be building the foundation, not maintaining someone else's.
- You use AI tools — LLMs, copilots, automation — to move faster. At our size, every engineer needs to operate at a multiplied level.
Your work will
- Be the reason our platform stays up when a Fortune 500 deploys us on-prem with 100x the traffic we've seen before.
- Define the infrastructure and deployment architecture that scales with the company from seed to market leader.
- Let the rest of the engineering team ship product fast because the platform underneath them is solid.