Senior Site Reliability Engineer at Fixify

Source: https://job-boards.greenhouse.io/embed/job_app?for=fixify&token=5120169008

We are redirecting you to the source. If you are not redirected in 3 seconds, please click here.

Senior Site Reliability Engineer at Fixify. . Location: Ireland, Remote. You remember the first time a system you built stayed up through the night. No alerts. No pages. Just quiet confirmation that the automation, monitoring, and guardrails you designed worked exactly as intended. Maybe it was a deployment pipeline that finally ran without intervention, or monitoring that caught an issue before users even noticed. That moment when infrastructure became invisible because it just . worked . — that's the feeling we're chasing.. . At Fixify, we believe reliability isn't just about uptime percentages - it's about enabling people to do their best work without technology getting in the way. We're on a mission to reimagine IT teams support companies and we need a Senior Site Reliability Engineer who finds joy in building systems that fade into the background, empowering product engineers to ship with confidence and our customers to work without interruption... . We think the best infrastructure is the kind you don't have to think about. As a Senior SRE, you'll be the architect, and we’ll be the celebrants of that invisible magic - designing the operational frameworks, observability systems, and automation that let our engineering teams move fast while keeping our SaaS platform rock-solid reliable. You'll be working at the intersection of infrastructure, product engineering, and data science, ensuring that every new feature meets our reliability standards before it sees production.. . Your primary customers are our product engineers, who count on you to provide the infrastructure, tooling, and practices that make deployments feel safe and recovery feel seamless. You'll instrument systems with monitoring that surfaces problems before they become incidents, define SLOs that align with business commitments, and build automation that turns manual toil into elegant, repeatable processes.. . This position is expected to be remote, though we do have an office in Cork you’re more than welcome to use, and the engineering team regularly meets up every other month to collaborate in person.. . If you care about building reliable systems that support ambitious product teams and you take pride in infrastructure that performs under pressure they we should talk.. . What we can do for you. . . Give you ownership over infrastructure that powers a globally-used platform, with clear visibility into how your work drives collaboration and productivity.. . Provide meaningful opportunities to learn and grow, whether that's diving deeper into distributed systems, exploring new observability paradigms, or mastering the latest cloud-native technologies.. . Surround you with a team that values blameless postmortems, continuous improvement, and the kind of operational culture where everyone learns from every incident.. . Share the "why" behind architectural decisions and give you a voice in shaping Fixify's reliability engineering principles as we scale.. . Connect you directly with product engineers and users, so you see firsthand how reliable infrastructure translates into delighted customers.. . Let you work across a hybrid container and serverless infrastructure environment, using what works best and leaning into a service’s strengths.. . . What you can do for us. . . Design and maintain scalable, fault-tolerant infrastructure that supports our SaaS platform and keeps pace with business growth.. . Instrument observability best practices—embracing tracing-first approaches, meaningful metrics, and monitoring that actually helps during incidents.. . Define, document, and maintain SLIs, SLOs, and SLAs in partnership with product engineering, translating business commitments into technical guardrails.. . Build automation that eliminates manual intervention across CI/CD, deployments, configuration management, and recovery—because your time is better spent on strategic problems.. . Lead incident response with steady judgment, facilitate blameless postmortems, and drive remediation efforts that prevent recurrence.. . Partner with engineering and product teams during design reviews to ensure new features are production-ready and operationally scalable.. . Optimize infrastructure costs through performance tuning, capacity planning, and smart use of cloud resources.. . Mentor engineers on operational best practices and champion reliability thinking across the organization.. . Document infrastructure architecture clearly and maintain the kind of runbooks that your future self will thank you for.. . . What you should bring with you. . . 4+ years of experience in SRE, DevOps, or infrastructure engineering roles, with demonstrated experience supporting SaaS platforms in production.. . Expert-level knowledge of an infrastructure-as-code framework (Pulumi, Terraform, CDK)—you should be the kind of person who thinks "if it's not in code, it doesn't exist.". . Strong working knowledge of AWS (or equivalent cloud platforms), including designing for availability, scalability, and security.. . Proficiency in TypeScript or Python for infrastructure automation and tooling.. . Experience with containerization and orchestration (ECS Fargate, Kubernetes, or similar).. . Deep familiarity with observability tools and practices (OpenTelemetry, CloudWatch, Honeycomb)—bonus points if you embrace a tracing-first philosophy.. . Solid understanding of networking, load balancing, and distributed systems concepts.. . Experience with CI/CD tooling (GitHub Actions, CodeBuild, or equivalent).. . The ability to communicate complex operational issues clearly to both technical and non-technical stakeholders.. . Calm effectiveness during high-pressure incidents and the judgment to balance competing priorities like performance, cost, and reliability.. . A collaborative spirit and the ability to build strong relationships with engineering, product, and operations teams.. . Prior experience working closely with product engineering teams is a strong plus—this role thrives on cross-disciplinary understanding.. . A commitment to continuous learning and improving team practices, systems, and culture.. . . Our stack. . TypeScript runs the majority of our services, with some machine learning orchestration in Python. All our infrastructure is defined in TypeScript using Pulumi, and we run on AWS. We're proficient users of serverless and make heavy use of SQS, Aurora, DynamoDB, Valkey, Lambda, ECS Fargate, and Step Functions. Our monitoring stack is CloudWatch for logs and Honeycomb for metrics and tracing—we embrace a tracing-first philosophy. Our CI/CD runs on GitHub Actions, with some internal actions through CodeBuild.. .