Staff Site Reliability Engineer - Platform (Remote - US) at Jobgether

Source: https://jobs.workable.com/view/puCM2PHrBHquu6G1CPUAqf/staff-site-reliability-engineer---platform-(remote---us)-in-united-states-at-jobgether

We are redirecting you to the source. If you are not redirected in 3 seconds, please click here.

Staff Site Reliability Engineer - Platform (Remote - US) at Jobgether. This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Staff Site Reliability Engineer - Platform in the United States.. We are seeking a highly skilled and experienced Staff Site Reliability Engineer to join a dynamic engineering team focused on building and maintaining highly available, scalable, and performant cloud-based infrastructure. In this role, you will design, implement, and operate critical systems that support cutting-edge computing platforms, ensuring reliability, efficiency, and security. You will work closely with product and engineering teams to automate operations, improve system observability, and resolve complex incidents. The ideal candidate thrives in a fast-paced, collaborative environment and takes ownership of both infrastructure and process improvements. This position offers an opportunity to make a direct impact on the performance and reliability of high-value, innovative systems while mentoring junior engineers.. . Accountabilities. . Build, manage, and support cloud and on-prem infrastructure to ensure high availability, scalability, and reliability.. . Maintain and enhance monitoring, alerting, and instrumentation systems deployed on Kubernetes clusters and Linux environments.. . Automate operational tasks to reduce toil and increase efficiency across engineering and product teams.. . Troubleshoot complex infrastructure and application issues, performing root cause analysis and implementing long-term solutions.. . Lead incident management processes, including investigation, resolution, and post-mortem documentation.. . Collaborate with software engineers, product managers, and other SREs to define best practices, standards, and service-level objectives.. . Mentor junior engineers, promoting knowledge sharing and fostering a culture of reliability and operational excellence.. . . BS degree in Computer Science, Computer Engineering, or equivalent practical experience.. . 8+ years of professional experience, including 5+ years in site reliability engineering.. . 3+ years of hands-on experience with Kubernetes and containerized environments.. . Strong expertise in Unix/Linux OS internals, networking (TCP/IP, routing, SDN), and virtualized environments.. . Proficient in scripting languages (Shell, Python, or similar) and automation of operational tasks.. . Experience with incident management, performance tuning, and system observability.. . Excellent written and verbal communication skills, capable of driving best practices across teams.. . Experience mentoring and guiding junior engineers.. . Preferred Qualifications:. . 10+ years of software development experience.. . Experience with VMware, Terraform, Google Cloud, and scaling databases or applications.. . Knowledge of deploying bare-metal Kubernetes and advanced incident research techniques.. . Company Location: United States.