Senior Database Reliability Engineer (DBRE) & Architect at Alex Staff Agency. This position is open at a global product-led IT company specializing in infrastructure stability and security solutions. Their products are recognized as the industry standard in the Hosting and Enterprise segments, powering over 500,000 servers worldwide.. In 2025, the company is evolving its data management strategy, shifting from traditional database administration to an Internal Database-as-a-Service (DBaaS) model. This role requires a visionary engineer to design resilient distributed systems, automate infrastructure through code, and transform databases into a reliable service for product teams. This is an ideal opportunity for those ready to handle petabytes of data and build high-scale platform solutions.. Key Challenges & Responsibilities:. Designing and implementing a self-service platform (Terraform + Ansible) for deploying HA clusters (PostgreSQL, ClickHouse, MongoDB, Redis) in a heterogeneous environment (Bare Metal, OpenNebula, K8s, Public Clouds).. Managing rapidly growing analytics clusters (12+ clusters, tens of terabytes), focusing on sharding, ReplicatedMergeTree, and building reliable S3 backup pipelines under high load.. Maintaining and scaling infrastructure for Apache Airflow and Redash, ensuring the reliability of ETL pipelines and visualization tools.. Implementing SRE practices in data management: replacing manual incident response with automated self-healing mechanisms and defining SLO/SLIs.. Migrating legacy solutions to modern cloud patterns and implementing Kubernetes operators for stateful workloads.. Serving as a technical authority for product teams to optimize data schemas and SQL queries for high-load systems.. Tech Stack:. DB:. PostgreSQL 15+ (Patroni, PgBouncer), ClickHouse (Sharded/Replicated), MongoDB, Redis, Kafka.. Data & Analytics:. Apache Airflow, Redash.. Infrastructure:. Hybrid Cloud (3+ private DCs, OpenNebula, K8s, Bare Metal, AWS, GCP, Azure, DO).. IaC & CI/CD:. Terraform, Ansible, Python/Go, GitLab, Jenkins, Gerrit.. Observability:. VictoriaMetrics, Grafana, Loki.. Must have:. 5+ years of PostgreSQL expertise: deep knowledge of MVCC, locking mechanics, expert-level Patroni/PgBouncer configuration, and experience with seamless major version upgrades under load.. ClickHouse mastery: experience operating large clusters, understanding ZooKeeper/ClickHouse Keeper, sharding, replication internals, and performance diagnostics at the data-part level.. Engineering mindset (SRE/DevOps): experience writing complex Terraform modules and Ansible roles; proficiency in Python or Go for automation is a major asset.. Hybrid environment experience: understanding the nuances of running DBs on Bare Metal vs. Kubernetes vs. Public Cloud, with the ability to optimize TCO and disk subsystem performance (NVMe, Network Storage).. Systems approach: understanding the full stack from network packets to business logic, including security standards (FIPS, Audit logs) and Disaster Recovery.. Nice to Have:. Experience building an Internal Developer Platform (IDP).. Experience operating databases in Kubernetes via operators (CloudNativePG, Altinity Operator).. Background working with Cloud or Hosting providers on similar services.. Company Location: Portugal.
Senior Database Reliability Engineer (DBRE) & Architect at Alex Staff Agency