Site Reliability Engineer (Remote - US) at Jobgether

We are redirecting you to the source. If you are not redirected in 3 seconds, please click here.

Site Reliability Engineer (Remote - US) at Jobgether. This position is posted by Jobgether on behalf of McAfee. We are currently looking for a Site Reliability Engineer in the United States.. In this role, you will be instrumental in maintaining high service levels including availability, latency, and reliability to meet customer needs while reducing friction in managing changes. You will collaborate closely with DevOps, Engineering, and support teams to ensure services are scalable, secure, and performant. This hands-on position involves monitoring critical production environments, troubleshooting incidents, automating processes, and continuously improving service reliability. Working in a hybrid environment, you will support mission-critical applications with a focus on observability, incident response, and seamless integration with IT service operations.. Accountabilities:. ·         Proactively monitor production environments to detect and respond to issues quickly.. ·         Troubleshoot, debug, and escalate problems to ensure maximum uptime and customer satisfaction.. ·         Manage incident lifecycle, including detection, triage, resolution, and retrospectives.. ·         Collaborate with engineering and support teams to maintain service reliability and meet SLAs.. ·         Automate processes to reduce Mean Time to Detect (MTTD) and Mean Time to Restore (MTTR).. ·         Maintain security event responsiveness and compliance with operational procedures.. ·         Participate early in the software development lifecycle to embed reliability best practices.. ·         Document processes and update operational knowledge bases regularly.. ·         Communicate effectively with stakeholders and leadership regarding high-priority incidents and service status.. ·         1 to 3+ years of experience in software development, SRE, DevOps, or systems engineering roles.. ·         Proven track record managing large-scale, highly available production systems (>99.95% SLA), preferably in cloud environments.. ·         Strong troubleshooting, debugging, and root cause analysis skills.. ·         Experience with monitoring, logging, and application performance management tools such as Grafana, CloudWatch, or similar.. ·         Familiarity with CI/CD tools like Git, Jenkins, or Harness.. ·         Hands-on experience with container technologies, including Kubernetes and Docker.. ·         Comfortable working with both Windows and Linux operating systems.. ·         Solid understanding of AWS cloud services, including serverless and containerized workloads.. ·         Excellent communication skills and ability to collaborate across teams and time zones.. ·         Preferred certifications: ITIL, HDI, AWS or other cloud-related credentials.. ·         Willingness to work some non-standard hours to support global teams.. Company Location: United States.