Founding Site Reliability Engineer (Remote - US) at Jobgether

We are redirecting you to the source. If you are not redirected in 3 seconds, please click here.

Founding Site Reliability Engineer (Remote - US) at Jobgether. This position is posted by Jobgether on behalf of a partner company. We are currently looking for a . Founding Site Reliability Engineer. in the . United States. .. This is a unique opportunity to join a rapidly growing AI company as the first SRE hire in the San Francisco office. In this role, you will define and scale the Site Reliability Engineering discipline, ensuring the platform is reliable, secure, and performant at enterprise scale. You will work closely with engineering leads, product teams, and company founders to build infrastructure, establish best practices, and drive the organization’s reliability culture. The role involves hands-on system design, automation, and observability work, while providing leadership and strategic input to shape long-term operational excellence. Ideal candidates are technically strong, highly collaborative, and motivated by building world-class systems from the ground up.. . Accountabilities. Establish and scale the . SRE discipline. , including best practices, tooling, and culture.. Ensure . >99.9% uptime. of production systems and maintain global platform reliability.. Architect, automate, and manage . AWS infrastructure. using Terraform, CI/CD pipelines, and Infrastructure as Code.. Design and implement . observability systems. across microservices, APIs, and vector workloads, including metrics, tracing, and logging.. Lead . incident management. , reducing MTTR through runbooks, alerts, and postmortems.. Collaborate with engineering teams to . embed reliability principles. into the software development lifecycle.. Influence . organizational strategy and culture. as a founding voice in the engineering team.. . 5+ years of experience. in SRE, DevOps, or infrastructure roles, ideally in enterprise SaaS environments.. Expertise in . AWS services. (EC2, ECS/EKS, Lambda, RDS, VPC, IAM).. Proven experience with . Infrastructure as Code. (Terraform, Kubernetes/EKS, CDK, or CloudFormation).. Hands-on experience with . observability and monitoring stacks. (CloudWatch, Grafana, Prometheus, Datadog).. Experience in . incident management. , on-call responsibilities, and postmortem-driven reliability improvements.. Bonus: exposure to . AI/ML platforms. , data-heavy systems, or multi-agent workloads.. Strong problem-solving, communication, and collaboration skills.. . Company Location: United States.