
Lead Site Reliability Engineer at Kontakt.io. Location Information: USA. Kontakt.io. is building the platform that care operations run on.. We reduce waste, cut costs, and improve revenue by improving throughput, asset utilization and staff productivity. Our platform uses AI, RTLS, and EHR data to enable self-learning agents to automate workflows, adapt in real-time, and orchestrate all of care delivery operations.. Easy to deploy and scale, it gives a clear picture of spaces, equipment, and people, eliminating inefficiencies and enhancing the patient experience. With measurable 10X ROI and over 20+ use cases, . Kontakt.io. is the go-to platform for better and faster care delivery operations.. We’re looking for a . Lead Site Reliability Engineer. to . own the reliability, performance, and automation. of our . cloud-based, real-time platform. . This role will focus on . keeping our platform running smoothly 24/7. , minimizing downtime, improving . observability, incident response, and self-healing automation. . You will . lead and scale the SRE team. to ensure our infrastructure . stays ahead of demand, operates efficiently and meets the needs of our growing healthcare customers. .. Responsibilities. Ensure . 99.99. % uptime. across our cloud platform, meeting . strict SLAs for healthcare customers. .. Leverage your . software engineering expertise. to write high-quality, maintainable code that improves system reliability and operational efficiency.. Design and implement . self-healing, fault-tolerant systems. to prevent failures before they happen.. Define . SLIs, SLOs, and SLAs. , ensuring . proactive performance monitoring and incident resolution. .. Architect and manage . scalable cloud infrastructure. (AWS) for . massive real-time data processing. .. Optimize . containerized environments (Kubernetes, Docker). to support . multi-region deployments. .. Lead the adoption of . infrastructure as code (Terraform). to . fully automate infrastructure management. .. Build and refine a . world-class monitoring, alerting, and logging system. using . Prometheus, Grafana, OpenTelemetry, and Datadog. .. Lead . incident response and on-call operations. , reducing mean time to detection (MTTD) and mean time to resolution (MTTR).. Conduct . blameless postmortems. and continuously improve system resilience.. Reduce manual intervention through . automated deployment, scaling, and failover mechanisms. .. Partner with . Security & Compliance teams. to ensure infrastructure meets . HIPAA . and. SOC 2 . standards. Lead . disaster recovery and business continuity planning. to ensure . critical healthcare services are always available. .. Drive technical strategy and roadmap for . scalability, monitoring, and reliability engineering. .. Collaborate with . Product, Engineering, and Infrastructure teams. to align . SRE initiatives with business priorities. .. What You Bring. 10+ years. of experience in . Site Reliability Engineering or Cloud Infrastructure.. 2+ years of software engineering experience. Proven success scaling high-traffic, mission-critical platforms. in . SaaS, IoT, or healthcare. .. Deep expertise in . cloud platforms (AWS), Kubernetes, and distributed systems. .. Strong background in . monitoring, logging, and observability. with . Prometheus, OpenTelemetry, or similar tools. .. Hands-on experience with . incident management, postmortems, and building resilient systems. .. Deep knowledge of . CI/CD automation, GitOps, and infrastructure as code (Terraform, etc.). .. A . mature leadership approach. , with the ability to drive technical strategy while growing and mentoring a high-performance SRE team.. Strong understanding of . network security, access management, and compliance frameworks (HIPAA, SOC 2). .. Bonus Points If You Have:. Experience with healthcare IT. , including . EHR data, FHIR, and HL7 interoperability. .. Expertise in . real-time distributed systems, event-driven architectures, or large-scale data pipelines. .. Prior experience leading . on-call rotations and major incident management processes. .. Why You'll Love It Here. Own Mission-Critical Reliability. – Ensure . hospitals and care facilities always stay online. with a . 99.99. % uptime healthcare platform. .. Scale AI-Powered Infrastructure. – Work on . real-time automation and self-healing cloud systems. that . orchestrate care delivery. .. Drive Big Impact in Healthcare. – Help . reduce waste, optimize resources, and improve patient care. with technology that . delivers 10X ROI. .. Automation-First Culture. – Minimize manual ops with . cutting-edge automation, observability, and incident response strategies. .. Join a High-Performing Team. – Work with . top engineers, AI experts, and healthcare innovators. solving . real-world challenges. .. $190,000 - $230,000 a year. Ready to Build the Future of Healthcare?. Apply now and . help scale the platform that care operations run on.. 🚀