Principal Site Reliability Engineer (Remote - US) at Jobgether

We are redirecting you to the source. If you are not redirected in 3 seconds, please click here.

Principal Site Reliability Engineer (Remote - US) at Jobgether. This position is posted by Jobgether on behalf of General Motors. We are currently looking for a Principal Site Reliability Engineer in the United States.. This role offers a critical opportunity to enhance the reliability, scalability, and efficiency of large-scale distributed systems within a dynamic automotive software environment. As a hands-on individual contributor, you will blend software and systems engineering skills to develop automated solutions, improve observability, and maintain service health. Collaborating closely with development teams, you will help drive operational excellence and cost-efficiency while responding proactively to incidents. This position supports a hybrid remote work setup with occasional on-site collaboration expected for candidates living near designated locations.. . Accountabilities:. . Develop automation tools and software to streamline operational processes and improve system reliability.. . Lead and enhance observability frameworks to detect and resolve issues proactively.. . Participate in on-call rotations to troubleshoot production incidents, minimizing downtime.. . Collaborate closely with software developers to ensure service scalability, reliability, and quality.. . Manage service level indicators (SLIs), objectives (SLOs), and agreements (SLAs) to meet reliability goals.. . Conduct post-incident reviews and failure analyses to foster continuous improvement.. . Identify and implement cost-saving optimizations while maintaining high service standards.. . . 8+ years of experience in site reliability engineering, systems engineering, or related fields.. . Proficiency in at least one programming language such as Python, Go, or Java.. . Strong understanding of operating systems, networking, distributed systems, databases, and storage architectures.. . Experience managing production incidents, root cause analysis, and mitigation strategies.. . Familiarity with cloud platforms like AWS, GCP, or Azure, and container orchestration tools such as Kubernetes.. . Excellent communication skills, capable of engaging both technical and non-technical stakeholders.. . Bachelor’s degree in Computer Science or a related field, or equivalent experience.. . Self-motivated and collaborative with a commitment to shared ownership of services.. . Company Location: United States.