Senior Engineer, Production Operations (Remote - US) at Jobgether

We are redirecting you to the source. If you are not redirected in 3 seconds, please click here.

Senior Engineer, Production Operations (Remote - US) at Jobgether. This position is posted by Jobgether on behalf of Greenlight. We are currently looking for a Senior Engineer, Production Operations in the United States.. This role offers the opportunity to shape and maintain highly reliable, scalable, and performant cloud systems that support mission-critical services in a fast-growing fintech environment. The Senior Engineer will focus on driving operational excellence through automation, Infrastructure as Code, and robust monitoring practices while collaborating with development and security teams. This position empowers a seasoned engineer to tackle complex production challenges, enhance system reliability, and improve operational efficiency across a microservices ecosystem. Success in this role directly impacts the stability and scalability of services used by millions of end users, making it both high-impact and highly rewarding.. . Accountabilities. . Design, implement, and maintain core cloud infrastructure and Site Reliability Engineering (SRE) practices to ensure high availability and performance.. . Develop and optimize cloud infrastructure using Infrastructure as Code (primarily Terraform) and automation tools.. . Collaborate with development and security teams to integrate SRE principles into the software development lifecycle.. . Design and manage monitoring, logging, and alerting solutions to provide clear visibility into system health.. . Participate in incident response, conduct root cause analyses, and contribute to blameless postmortems.. . Identify and implement architectural improvements to enhance system reliability, resilience, and efficiency.. . Automate operational tasks and processes to reduce toil and improve productivity.. . Research, evaluate, and advocate for new tools or technologies to improve operational posture.. . Enhance engineering tooling, processes, and standards for consistent and repeatable application delivery.. . . 5+ years of experience in Site Reliability Engineering, Production Operations, or similar roles focused on cloud infrastructure and distributed systems.. . Proven experience architecting and maintaining highly available, secure, and scalable systems in a public cloud environment (AWS preferred).. . Strong proficiency with Infrastructure as Code tools, particularly Terraform.. . Experience automating operational tasks using scripting languages (Python, Go, Bash) and automation platforms.. . Expertise in monitoring, logging, and alerting solutions (Datadog, Prometheus, Grafana, ELK stack).. . Solid understanding of incident response best practices and troubleshooting complex production issues.. . Knowledge of distributed systems, microservices architectures, and containerization technologies (Docker, Kubernetes/EKS).. . Exceptional analytical, problem-solving, and collaboration skills, with the ability to communicate technical concepts effectively to technical and non-technical stakeholders.. . Passion for improving system reliability, performance, and operational efficiency.. . Bonus Points:. . Experience with payments infrastructure or high-volume transactional systems.. . Familiarity with database technologies (PostgreSQL, Cassandra, DynamoDB).. . Experience with CI/CD pipelines and automation of software delivery.. . Company Location: United States.