Tech Lead - Site Reliability Engineer at Ditto

We are redirecting you to the source. If you are not redirected in 3 seconds, please click here.

Tech Lead - Site Reliability Engineer at Ditto. Location Information: APAC. About the role. Ditto is at an inflection point. As we scale to meet the growing demands of our enterprise customers, we need experienced SRE Leads to drive and mature our Site Reliability Engineering practice.. This is a unique opportunity to play a leading role in shaping enterprise-grade reliability, observability and incident management to ensure Ditto's systems meet the high standards our customers expect.. As a Lead SRE for one of our three globally distributed SRE squads, you'll set the standard for best-in-class reliability engineering, while leading and mentoring your squad members. You'll partner closely with product engineering teams to improve system resilience and operational excellence.. As a Lead Site Reliability Engineer, you will:. Line manage your regional squad of SREs, providing leadership and setting the standard for enterprise ready reliability. Develop a high-performing team through mentoring, coaching, and creating growth opportunities for engineers. Engage with incident management and escalations, ensuring your squad sees continual improvement in incident response and actively owns follow ups. Architect enterprise-grade observability solutions across complex distributed systems. Actively lead and manage SREs initiatives, co-ordinating across teams where needed. Guide the implementation of SLIs, SLO and SLAs that align with business objectives. Establish best practices for documentation, runbooks, and knowledge sharing across engineering. Play an active roll in on-call, and manage your squad’s rotation. What you'll need:. 7+ years of experience in Site Reliability Engineering or similar DevOps roles with a focus on system reliability and incident management. 3+ years of experience leading and mentoring technical teams. Strong experience with modern monitoring stacks including Prometheus, Grafana, and Datadog. Proficiency in at least one systems programming language, such as Go, Rust, C or Java. Experience with Infrastructure as Code tools, like Terraform and Helm. Hands-on experience architecting applications for Kubernetes, and managing Kubernetes infrastructure. Experience with AWS and at least one other major cloud service provider (GCP, Azure). Excellent communication skills, you’ll set the standard for clear and succinct communication in incidents, hand-offs and project updates. Experience maintaining on-call rotations and incident response procedures. A high degree of agency, taking ownership of problems and identifying initiatives and improvements. Proven project management skills and the ability to balance competing priorities and interrupts. Understanding of security best practices in cloud environments. Nice to have:. Experience directly line managing SREs. Experience building or operating multi-tenant, multi-cloud SaaS/DBaaS Platforms. Familiarity with edge computing or mesh networking. Experience instrumenting advanced observability practices (tracing, profiling) in distributed systems. Experience working with globally distributed teams across EMEA and APAC regions