Staff Engineer - DevOps at Weekday AI

Source: https://jobs.workable.com/view/bFbBPT9FjYQHuMPyvh9cco/remote-staff-engineer---devops-in-india-at-weekday-ai

We are redirecting you to the source. If you are not redirected in 3 seconds, please click here.

Staff Engineer - DevOps at Weekday AI. This role is for one of the Weekday's clients. Min Experience: 9 years. Location: Remote (India). JobType: full-time. As a Staff Engineer, you will architect and evolve our DevOps ecosystem, champion cloud cost governance, and implement best-in-class container orchestration practices. You will work cross-functionally with engineering, security, and finance teams to ensure operational excellence while proactively managing infrastructure spend.. Key Responsibilities. DevOps Leadership & Architecture. Lead end-to-end DevOps strategy, including CI/CD pipelines, automation, infrastructure-as-code, and release engineering. . Design scalable, resilient cloud-native architectures aligned with business growth. . Establish DevOps best practices, reliability standards, and operational governance. . Kubernetes & Containerization. Architect and manage large-scale . Kubernetes. environments for production workloads. . Optimize workloads across clusters for performance, reliability, and cost efficiency. . Build and maintain containerized applications using . Docker and Kubernetes. , ensuring portability and scalability. . Drive multi-cluster, multi-region deployments where necessary. . Cost Savings & Cost Planning. Own infrastructure cost visibility and optimization initiatives. . Implement cloud cost-saving strategies including rightsizing, reserved capacity planning, auto-scaling optimization, and workload scheduling. . Partner with finance teams for budgeting, forecasting, and cost planning. . Create dashboards and reporting mechanisms to track infrastructure ROI and spend trends. . Continuously identify inefficiencies and implement measurable cost-reduction initiatives without compromising performance. . Monitoring & Observability. Design and implement comprehensive monitoring systems using . Grafana. and related observability tools. . Build real-time dashboards for system health, performance metrics, and cost insights. . Establish alerting frameworks to minimize downtime and improve incident response. . Drive improvements in system reliability through data-driven monitoring and post-incident analysis. . Automation & Reliability. Automate provisioning, deployments, scaling, and recovery processes. . Improve system resilience, availability, and disaster recovery strategies. . Lead root cause analysis for major incidents and implement preventive measures. . Required Qualifications. 9–15 years of experience in DevOps, Site Reliability Engineering, or Cloud Infrastructure roles. . Deep expertise in . Kubernetes. , container orchestration, and production-grade . Docker and Kubernetes. implementations. . Strong hands-on experience with . Grafana. , monitoring systems, and observability frameworks. . Proven track record in . cost savings initiatives and infrastructure cost planning. in cloud environments. . Experience designing highly available, scalable systems in AWS, Azure, or GCP. . Strong understanding of Infrastructure-as-Code (Terraform, CloudFormation, etc.). . Expertise in CI/CD automation and release management. . Solid knowledge of networking, security best practices, and cloud architecture patterns. . Preferred Attributes. Experience managing large-scale production environments with strict SLAs. . Strong analytical skills with the ability to translate technical metrics into financial impact. . Leadership mindset with experience mentoring engineers and influencing cross-functional teams. . Excellent communication and stakeholder management skills.. Company Location: India.