Senior Site Reliability Engineer (Remote - India) at Jobgether

We are redirecting you to the source. If you are not redirected in 3 seconds, please click here.

Senior Site Reliability Engineer (Remote - India) at Jobgether. This position is posted by Jobgether on behalf of . Dremio. . We are currently looking for a . Senior Site Reliability Engineer. in . India. .. As a Senior Site Reliability Engineer, you’ll join a high-performing team focused on maintaining and improving mission-critical systems in a cloud-native environment. You'll be instrumental in designing reliable infrastructure, automating deployment processes, and ensuring services scale seamlessly across multiple cloud providers. This role offers deep technical engagement with Kubernetes, service meshes, and observability tools while promoting a strong culture of resilience and continuous improvement. It’s an opportunity to shape the backbone of a large-scale distributed system used by global enterprises—within a hybrid, collaborative, and forward-thinking setting.. . Accountabilities:. . Lead continuous improvements in Kubernetes usage, GitOps deployment strategies, and service mesh configuration across cloud platforms (AWS, GCP, Azure).. . Extend cross-cloud networking and connectivity solutions including VPNs, BGP, and partner interconnects.. . Collaborate closely with Engineering teams to ensure systems are production-ready via design consultation, capacity planning, and service reviews.. . Define and implement Service Level Indicators (SLIs) and Service Level Objectives (SLOs), establishing reliable on-call strategies for internal teams.. . Drive observability efforts by enhancing logging, metrics, tracing, and system profiling.. . Optimize and debug code, automate recurring tasks, and identify sustainable ways to improve reliability and deployment velocity.. . Advocate for reliability engineering practices throughout the organization and foster a culture of continuous delivery.. . Participate in an on-call rotation and lead incident response with a focus on blameless post-mortem analysis.. . Promote scalable practices and support the transformation toward true continuous delivery within the engineering ecosystem.. . . 10+ years of experience in Site Reliability Engineering, DevOps, or Cloud Infrastructure, with deep exposure to distributed systems.. . Advanced proficiency in Kubernetes, Istio, Terraform, Terragrunt, and ArgoCD/Flux.. . Strong understanding of cloud-native networking, VPNs, and multi-cloud connectivity solutions.. . Demonstrated hands-on experience with cloud platforms including GCP, AWS, and Azure.. . Skilled in Python or Go, with the ability to debug and review Java when necessary.. . Proven ability to design, analyze, and troubleshoot large-scale distributed architectures.. . Strong communication, ownership, and problem-solving abilities, with a mindset focused on resilience and automation.. . Bonus points for experience with:. . . Managing Kubernetes clusters at large scale (1,000+ nodes).. . Developing and managing production-grade SLIs/SLOs.. . . Company Location: India.