Cloud Operations Engineer at Cloudbeds

We are redirecting you to the source. If you are not redirected in 3 seconds, please click here.

Cloud Operations Engineer at Cloudbeds. Location Information: Europe,Latin America. Job Title: Cloud Operations Engineer. How You'll Make an Impact:. As a . Cloud Operations Engineer. , you’ll be the frontline support for our global infrastructure, playing a key role in ensuring 24/7 operational stability across our AWS-based environment. Your core responsibilities will include monitoring critical systems through platforms such as Datadog, PagerDuty, and CloudWatch, rapidly validating alerts, and escalating verified incidents based on clearly defined protocols.. You’ll execute operational tasks, follow documented procedures for common issues, and manage standard maintenance activities. You'll also have opportunities to collaborate directly with senior engineers across SRE, DevOps, and Infrastructure teams, contributing to the resolution of a wide range of technical challenges and gaining exposure to complex, real-world systems.. Acting as the central communication point during incidents, you’ll maintain clear, timely updates to stakeholders and facilitate smooth transitions between engineering and support teams.. Our Network Operations Team:. You’ll be joining a brand-new team at the ground level, helping shape the future of SaaS operations for a company undergoing exciting growth. Working closely with SRE, DevOps, Security, and various Workload teams, you’ll be at the heart of collaborative problem-solving and operational innovation. It’s a rare chance to build, influence, and grow in a highly visible and impactful role.. This role offers a rare opportunity to gain deep, hands-on experience in cloud operations and incident management while working alongside high-performing engineering teams. You'll build the foundation for growth into specialized areas like SRE, DevOps, or Infrastructure Engineering, with direct exposure to real-world systems at scale.. What You Bring to the Team:. Support Kubernetes (EKS) environments by performing operational checks, validating pod health, reviewing logs, and assisting with incident triage during deployments and scaling events. Assist with CI/CD pipeline operations by supporting deployments, rollbacks, and release verification in collaboration with DevOps and platform engineering teams using ArgoCD and GitHub. Execute Infrastructure as Code changes and standard operating procedures using Terraform across cloud infrastructure and application services. Monitor, triage, and validate incidents using observability and alerting tools such as PagerDuty, Datadog, Amazon CloudWatch, Prometheus, and Grafana, escalating to SRE, DevOps, or application teams as appropriate. Execute documented runbooks and SOPs to resolve common operational issues, including basic AWS troubleshooting, infrastructure access requests (SSO, VPN, IAM), and deployment support. Perform routine operational tasks such as configuration changes, maintenance activities, and standard change requests across cloud infrastructure and application services. Contribute to operational excellence by maintaining and improving runbooks, updating documentation, and participating in post-incident reviews (RCA) to drive reliability improvements. What Sets You Up for Success:. 3-4 years of hands-on experience in DevOps, Site Reliability Engineering (SRE), or related operational roles with focus on cloud infrastructure. Practical experience with Amazon EKS (Elastic Kubernetes Service) or other managed Kubernetes platforms, including container orchestration and operational management. Hands-on experience with CI/CD and GitOps deployment tools, particularly ArgoCD, Flux, or similar automation platforms. Experience using Infrastructure as Code tools, specifically Terraform, for managing and automating cloud infrastructure. Foundational understanding of the AWS service ecosystem including core infrastructure services (EC2, S3, RDS, IAM, VPC). Strong written and verbal communication skills in English with ability to provide clear, timely updates during high-pressure incidents. Detail-oriented with strong documentation skills and ability to collaborate effectively across multiple teams in a fully remote, global environment. Bonus Skills to Stand Out (Optional):. Experience with monitoring and observability tools such as Datadog, Prometheus, Grafana, or Amazon CloudWatch. Prior experience working in a 24/7 operations environment with hands-on use of PagerDuty or similar on-call and alerting systems. Ability to write (not just read) Bash or Python scripts for automation tasks