Staff AWS Site Reliability Engineer (Remote - US) at Jobgether

We are redirecting you to the source. If you are not redirected in 3 seconds, please click here.

Staff AWS Site Reliability Engineer (Remote - US) at Jobgether. This position is posted by Jobgether on behalf of . SearchStax. . We are currently looking for a . Staff AWS Site Reliability Engineer. in . United States. .. This role is ideal for a hands-on infrastructure leader who thrives in fast-paced, high-growth environments. You will be responsible for building and scaling highly available, resilient, and cost-efficient cloud infrastructure that supports thousands of servers and enterprise-grade workloads. The position involves leading reliability and performance initiatives, automating operations, and mentoring engineers to ensure the platform continues to scale with customer and business growth. You will work cross-functionally with development, QA, and product teams to solve complex problems at scale, implement best practices for observability and automation, and help shape the future of a rapidly growing cloud-native platform.. Accountabilities. In this role, you will:. ·         Take ownership of the scalability, reliability, and performance of cloud infrastructure supporting high-traffic workloads.. ·         Design and implement automation frameworks for provisioning, monitoring, logging, scaling, and recovery to minimize manual operations.. ·         Continuously evaluate and tune systems for latency, throughput, cost efficiency, and reliability.. ·         Build resilient, self-healing, and observable systems using SLOs, error budgets, and best practices in reliability engineering.. ·         Collaborate closely with development, QA, and product engineering teams to deliver highly available and performant services.. ·         Lead incident management processes, perform root cause analysis, and implement preventive measures.. ·         Mentor and guide other engineers, establishing standards for infrastructure and reliability best practices.. The ideal candidate will have:. ·         7+ years of experience in Site Reliability, DevOps, or Infrastructure Engineering roles, ideally in startup or high-growth environments.. ·         Deep hands-on expertise with AWS services including EC2, EKS, RDS, S3, CloudFront, VPC, IAM, and multi-region architectures.. ·         Strong experience with Infrastructure as Code tools like Terraform or CloudFormation.. ·         Proficiency in scripting or programming languages such as Python or Go for automation.. ·         Expertise in monitoring and observability tools such as Prometheus, Grafana, Loki, ELK/EFK, or Datadog.. ·         Experience with CI/CD pipelines, containers, and orchestration technologies such as Docker, Kubernetes, Jenkins, or GitHub Actions.. ·         Strong performance engineering skills to optimize systems for scalability and efficiency.. ·         Proven ability to diagnose complex production issues and implement long-term solutions.. ·         Startup mindset—comfortable wearing multiple hats, moving fast, and balancing pragmatism with technical excellence.. Bonus Skills:. ·         Experience managing large-scale Elasticsearch or Apache Solr deployments.. ·         Background in compliance-sensitive environments (SOC2, HIPAA).. ·         Experience with high-throughput distributed data platforms.. ·         Leadership experience managing SRE or Infrastructure teams in SaaS companies.. Company Location: United States.