Lead Site Reliability Engineer - Data Platforms at Jobgether

We are redirecting you to the source. If you are not redirected in 3 seconds, please click here.

Lead Site Reliability Engineer - Data Platforms at Jobgether. This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Lead Site Reliability Engineer - Data Platforms in the United States.. We are seeking a Lead Site Reliability Engineer to manage and optimize data platform operations, ensuring high availability, scalability, and performance. In this role, you will oversee cloud infrastructure, end-to-end data pipelines, and containerized applications while collaborating closely with data science, ML/GenAI, and development teams. You will implement Infrastructure as Code, monitor systems for observability, and troubleshoot complex issues to maintain operational excellence. The ideal candidate thrives in a fast-paced environment, embraces automation, and drives innovation across cloud and data platforms. This role combines hands-on technical expertise with strategic system design, delivering measurable impact on business-critical data workflows.. . Accountabilities. Manage cloud-based infrastructure, including AWS services (S3, EMR, Redshift) and containerized environments (ECS, Docker), to support data pipelines and ML/GenAI workloads.. Design, deploy, and maintain automated infrastructure using tools like Terraform, Chef, Ansible, and CI/CD pipelines.. Monitor and enhance observability across data systems, applications, and platforms.. Collaborate with engineering and ML teams to optimize the performance, reliability, and scalability of data and AI systems.. Participate in code/design reviews, troubleshoot complex system issues, and document root cause analyses (RCAs).. Support release planning, on-call rotation, and problem resolution to ensure uninterrupted data operations.. . 8+ years of experience with Big Data technologies, data pipelines, and Linux administration.. Strong scripting proficiency in Bash or Python.. 5+ years managing cloud platforms (AWS, Azure) with hands-on experience in ECS, EKS, AKS, Terraform, Helm.. Experience with Infrastructure as Code, CI/CD tools (Chef, Ansible, Jenkins), and version control systems (Git).. Familiarity with Generative AI platforms (SageMaker, Bedrock, Azure ML) and vector databases.. Solid knowledge of networking (DNS, load balancers), MySQL, Apache Spark, and BI/data lake platforms.. Excellent communication skills, self-driven, capable of independently resolving complex issues, and delivering projects on time.. Strong interest in AI technologies and continuous improvement of operational practices.. . Company Location: United States.