Senior Site Reliability Engineer (REMOTE) at Discogs

We are redirecting you to the source. If you are not redirected in 3 seconds, please click here.

Senior Site Reliability Engineer (REMOTE) at Discogs. The Discogs Platform team is focused on several objectives: building and supporting performant, cost-effective, reliable infrastructure; developer experience tooling and mentorship; and creating "golden paths" for organization-wide standards and velocity. As a Platform member, the Senior Site Reliability Engineer will contribute to the Platform team’s centralized infrastructure, including maintenance, monitoring, and automation of services ranging from databases to Kubernetes; lead incident response and postmortem efforts; and work closely with other engineering teams to understand their needs and drive improvements to both our technologies and processes.. Location. This is a remote position. Open to candidates located in . OR, WA, CA, CO, TX, IL. Compensation. Starting Base Salary Range: $130,000 - $140,000 yearly.  . Who We Are. We are dedicated to supporting a global community of music fans and collectors who share the value, culture, connection, and joy of record collecting. Fostering the exchange of knowledge, records, and curation, we help people help each other deepen their relationship with music. Leveraging the power of community, we are committed to enabling people to explore artists and their recorded works through the world's definitive music discography, stay informed with record collection and sales history data, get organized with specialized collection management tools, and stay connected to a global community of fellow record collectors and sellers. Providing this essential set of resources, tools, and access, we aim to unleash boundless opportunities for people to dig into the depths of their musical interests, build and fortify their record collections, cultivate and bridge communities, and elevate their connection to music and record collecting.. What You’ll Accomplish. Reasonable accommodations may be made to enable individuals with disabilities to perform the essential functions.. . Maintaining organization cloud presence in AWS. . Automating and deploying infrastructure configurations using Infrastructure as Code (IAC). . Mentoring engineering squads on Platform best practices for Kubernetes, MySQL, Kafka, and other software development lifecycle areas. . Assist engineering squads with capacity planning, infrastructure budgeting, and production readiness. . Writing documentation and runbooks that contribute to the engineering organization’s knowledge base. . Implementing monitoring and alerting systems with Discogs observability tools. . Working in a containerized, orchestrated environment. . Participating in on-call rotation, responding to incidents, and troubleshooting data and other operations issues. . Contribute to efforts on the reliability and design patterns of our Kafka, Kafka Connect and database implementations. . What You’ll Contribute. Minimum Education and Experience. . A Bachelor's Degree in Computer Science or similar area of focus, or equivalent relevant work experience.. . 5+ years experience in Ops, DevOps, Site Reliability, Platform or other systems roles.. . Required Skills & Abilities:. . Infrastructure-as-code (Terraform). . CI/CD (GitHub Actions). . GitOps (ArgoCD). . Kubernetes (EKS, Kustomize, Karpenter, administration, application manifests). . AWS and cloud development (VPC, EKS, RDS, S3). . FinOps and cloud cost optimization. . Observability (Datadog, Sentry). . Scripting (Shell, Python). . Track record of collaboration and mentorship. . Excellent written communication and documentation skills. . Continuous learning. . Ownership and proactive approach to solving large problems. . Preferred. :. . Kafka: Cluster administration (Strimzi), Kafka Connect (Debezium, JDBC). . Relational database administration and performance (MySQL, Percona Server, AWS RDS). . Elasticsearch (ECK administration, scaling, performance). . Python (SQLAlchemy, FastAPI). . GraphQL (schema design, Apollo federation). . REST API. . Hashicorp Vault. . Redis. . Memcached. . The Platform team covers a wide range of technical topics and we'd love to hear about your skills beyond this list!. Company Location: United States.