Site Reliability Engineer- Platform Engineering at Weekday AI

We are redirecting you to the source. If you are not redirected in 3 seconds, please click here.

Site Reliability Engineer- Platform Engineering at Weekday AI. This role is for one of Weekday’s clients. Min Experience: 4 years. JobType: full-time. We are looking for an experienced and motivated . Site Reliability Engineer (SRE) – Platform Engineering. to join our growing technology team. In this role, you will be responsible for designing, building, and maintaining scalable, resilient, and secure infrastructure platforms that support business-critical applications and services. The SRE will work at the intersection of software development and systems engineering to ensure the availability, performance, and reliability of our platforms.. This role requires deep expertise in automation, cloud-native technologies, monitoring, and platform operations. The ideal candidate is passionate about solving complex infrastructure challenges, streamlining deployment pipelines, and building highly reliable systems.. Key Responsibilities. . . Platform Engineering:. Design, implement, and optimize platform services and infrastructure to ensure high availability, scalability, and performance. . . . Reliability & Resilience:. Build self-healing and fault-tolerant systems while proactively identifying and eliminating reliability risks. . . . Automation:. Develop Infrastructure as Code (IaC) solutions using tools like Terraform, Ansible, or CloudFormation to automate infrastructure provisioning and configuration. . . . Monitoring & Observability:. Implement monitoring, logging, and alerting systems using tools such as Prometheus, Grafana, ELK, or Datadog to track platform health and performance. . . . Incident Management:. Troubleshoot incidents, perform root cause analysis, and ensure timely resolution while minimizing downtime and customer impact. . . . DevOps & CI/CD:. Collaborate with development teams to enhance CI/CD pipelines for seamless deployment and integration, ensuring reliability in production environments. . . . Cloud Infrastructure:. Manage cloud environments (AWS, Azure, or GCP) and optimize for cost, security, and performance. . . . Security & Compliance:. Implement security best practices, monitor vulnerabilities, and ensure compliance with industry standards across infrastructure and platforms. . . . Collaboration:. Partner with software engineers, product teams, and IT operations to align infrastructure capabilities with business requirements. . . . Continuous Improvement:. Analyze existing infrastructure and processes, identifying areas for improvement, and implementing best practices for operational efficiency. . . . Capacity Planning:. Forecast infrastructure requirements, ensuring the platform is always prepared to handle current and future workloads. . . Qualifications & Skills. . Bachelor’s degree in . Computer Science, Information Technology, or related field. . Equivalent practical experience may be considered. . . . 4+ years of experience. in Site Reliability Engineering, DevOps, or Platform Engineering. . . Strong proficiency with . cloud platforms. (AWS, Azure, or GCP). . . Hands-on experience with . Infrastructure as Code. (Terraform, Ansible, or CloudFormation). . . Solid understanding of . Linux systems administration, networking, and container orchestration. (Docker, Kubernetes). . . Experience with . CI/CD pipelines. (Jenkins, GitLab CI, or similar tools). . . Proficiency in . scripting/programming languages. such as Python, Go, Bash, or Java. . . Strong knowledge of . monitoring and observability tools. (Prometheus, Grafana, ELK, Datadog, Splunk). . . Familiarity with . incident response and on-call support. practices. . . Knowledge of . security best practices. and compliance frameworks. . . Excellent problem-solving, debugging, and analytical skills. . . Strong communication and collaboration abilities to work effectively across cross-functional teams. . . Company Location: India.