Site Reliability Engineer (Remote - US) at Jobgether

Source: https://jobs.workable.com/view/iYnJt9jiKhDsr32MdDssPc/site-reliability-engineer-(remote---us)-in-united-states-at-jobgether

We are redirecting you to the source. If you are not redirected in 3 seconds, please click here.

Site Reliability Engineer (Remote - US) at Jobgether. This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Site Reliability Engineer in the United States.. This role offers the opportunity to ensure the reliability, stability, and performance of critical streaming and broadcast systems in a high-traffic, 24x7 environment. You will be responsible for monitoring live channel distribution, troubleshooting complex system issues, and implementing automation to prevent downtime. Collaborating with cross-functional teams and third-party vendors, you will identify root causes, deploy solutions, and document processes to optimize operational efficiency. The position requires technical expertise, a proactive mindset, and a dedication to delivering seamless media experiences at scale. Your work will have a direct impact on delivering high-quality streaming content to millions of viewers.. . Accountabilities. · Investigate and resolve issues within broadcast and streaming systems, identifying root causes and implementing solutions.. · Serve as a Level 2 escalation resource for live channel distribution and report findings to leadership and operations teams.. · Collaborate with internal teams and external vendors to troubleshoot unresolved issues and ensure timely resolutions.. · Develop and maintain comprehensive documentation detailing system issues, root causes, and effective solutions.. · Assist with deployment and testing of patches and fixes in development and production environments.. · Support on-air system integrations, rollouts, and special broadcast events, providing 24x7 operational coverage as needed.. · Participate in daily operations review meetings, providing updates on system status, new issues, and planned fixes.. · Contribute to the design, analysis, and evaluation of projects using sound engineering principles and best practices.. · Bachelor’s degree in Engineering, Computer Science, or a related field.. · 5+ years of DevOps/Site Reliability Engineering experience in high-traffic, cloud-hosted environments (AWS preferred).. · 3–5 years of experience in support, analysis, or operations roles within technology systems.. · Strong experience with deployment automation tools such as CloudFormation, Terraform, and Ansible.. · Familiarity with containerization and orchestration technologies (Kubernetes, Docker).. · Proficiency in CI/CD orchestration and deployment tools (e.g., GitHub Actions, Jenkins).. · 3–5 years of Linux system administration experience.. · Programming experience in Go, Python, Ruby, Java, or shell scripting.. · Skilled at designing automation tools and processes for large-scale systems.. · Experience with log and metric aggregation software (e.g., CloudWatch, Elasticsearch + Kibana, Splunk, Grafana).. · Strong problem-solving, analytical, and troubleshooting skills.. · Knowledge of networking concepts and the OSI model; able to troubleshoot network issues.. · Preferred: 3+ years in Media & Entertainment or 24x7 production environments, supporting IT/broadcast systems, and familiarity with live TV broadcasting, OTT streaming, codecs, and ARQ technologies.. . Company Location: United States.