Senior/Staff DevOps Engineer at Ethos. . Location: Remote. . . About Ethos . . Ethos is on a mission to bridge the human readiness gap by transforming how training is developed, consumed, and aligned with strategic business outcomes. As a well-funded Series A startup ($40M+ raised), we’re a trusted partner to 150+ enterprise customers across the U.S. military, life sciences, manufacturing, supply chain, and professional sports.. . We’re expanding our engineering team to deliver a best-in-class learning platform—smarter, faster, and more optimized. We’ve gone all-in on AI tooling in our development process, and we’re accepting and expanding upon the best new practices for creating software in this era.. . About the Role . . You’ll lead the deployment and operationalization of our SaaS products across . Commercial Cloud. , . government networks. , and . bespoke/air-gapped. customer environments. As a . Senior. engineer, you’ll own end-to-end infrastructure delivery, elevate DevOps practices, and collaborate closely with Software and Product. As a . Staff . engineer, you’ll additionally shape . platform engineering strategy. , set technical direction for distributed systems at scale, and influence design patterns that enable AI workloads and complex data pipelines. You’ll treat AI tooling as core to your daily workflow — for IaC, pipelines, incident response, and toil reduction — and help shape the agentic operations patterns and AI workloads our platform runs.. . If you love solving hard deployment problems, care deeply about security and reliability, can scale modern cloud platforms with rigor, and embrace AI-augmented operations as the way forward, this role is for you.. . What You’ll Do . . . Design & Operate the Platform:. Architect, implement, and run secure, scalable, multi-tenant infrastructure (infra as code, immutable artifacts, GitOps).. . AI-Augmented Operations & Platform Work:. Use AI coding and agentic tools (Claude Code, Cursor, Copilot, MCP-based ops agents) for IaC authoring, pipeline development, log/trace analysis, postmortem drafting, and toil reduction; build and improve agentic workflows for the team.. . CI/CD & Release Engineering:. Build and harden pipelines (build, test, scan, sign, promote, deploy) for multi-environment delivery—including disconnected/air-gapped workflows.. . Observability & Reliability:. Establish SLOs; instrument systems for metrics/logs/traces; drive incident response and postmortems; reduce MTTR and change failure rate.. . Security & Compliance by Design:. Integrate supply-chain security (SBOMs, signing, provenance), secrets management, and baseline hardening (CIS/STIG-aligned).. . Cost & Performance:. Optimize infrastructure spend and performance (capacity planning, autoscaling, right-sizing, storage/egress strategies).. . Technical Leadership:. Lead design reviews, author RFCs, mentor engineers, and raise the quality bar for platform changes.. . Gov/Constrained Deployments:. Support IL-4/IL-5-aligned patterns, RMF documentation support, and offline artifact promotion processes where needed.. . (Staff) Strategy & Standards:. Define platform roadmaps, establish consistent deployment and infrastructure patterns, and guide cross-team adoption of best practices.. . . Measures of Success (First 6–12 Months). . . Availability & Reliability:. Meet or exceed service SLOs; reduce MTTR by ≥30%.. . Delivery Velocity:. Increase deployment frequency by ≥2× while keeping change failure rate ≤15%.. . Pipeline Efficiency:. Cut CI pipeline duration by ≥25% and reduce flaky tests significantly.. . Security Posture:. Achieve ≥95% pass rate for supply-chain/security gates (image signing, SBOM scans, vulnerability thresholds); reduce MTTR for CVEs to ≤14 days for high severity.. . Cost & Drift:. Deliver ≥15% infra cost savings without performance regressions; keep infra drift near zero via GitOps and policy as code.. . Gov/Offline Readiness:. Stand up an artifact promotion flow (build → scan → sign → export) suitable for disconnected deployments with documented runbooks.. . . 30/60/90 Day Plan. . First 30 Days — Map & Baseline. . . Deep-dive on current cloud topology, CI/CD, observability, security controls, and on-call.. . Inventory build and runtime artifacts; document deployment environments and promotion paths.. . Baseline reliability and delivery metrics (SLOs, MTTR, deploy frequency, CFR, pipeline timing).. . Establish and prove the effectiveness of your personal workflow with AI tooling.. . . 60 Days — Design & Deliver. . . Harden CI/CD: add SBOM generation, signing (e.g., Cosign/Sigstore), and policy gates.. . Implement or refine infrastructure modules (Terraform) and Helm/Kustomize charts with GitOps flows.. . Establish service SLOs and golden signals; wire alerts and dashboards for top services.. . Pilot artifact export/import flow for air-gapped/disconnected deployments; write runbooks.. . . 90 Days — Scale & Standardize. . . Standardize CI/CD pipelines and infrastructure modules across existing services.. . Migrate priority services to hardened delivery paths; deprecate legacy workflows.. . Land cost/performance wins (e.g., autoscaling policies, instance/storage class right-sizing).. . . Basic Qualifications. . . . 5+ years building and operating cloud platforms; 3+ years deploying SaaS in production.. . Strong with Terraform, Helm/Kustomize, and containers (Docker, Kubernetes).. . Deep AWS experience (e.g., VPC, EKS, EC2, S3, RDS, ECR, IAM/KMS, Route 53; CloudFront desirable).. . CI/CD expertise (e.g., GitHub Actions, CircleCI, or Argo Workflows) and GitOps (Argo CD or Flux).. . Observability across metrics, logs, and traces (e.g., Prometheus/Grafana, OpenTelemetry, ELK).. . Proven track record in IaC, scalable system design, and quality tooling (automated tests, canaries/blue-green, feature flags).. . Excellent communication; comfortable partnering with Product, Security, and Customer teams.. . Thrives in a startup environment—ownership, autonomy, and pragmatic delivery.. . Active, fluent use of AI development/operations tools as part of your daily workflow.. . Secret Clearance or eligibility and willingness to obtain one.. . . Preferred Qualifications. . . Supply-chain security (SBOMs, SLSA concepts, image signing, provenance) and vulnerability management (e.g., Trivy/Grype, Snyk; Chainguard experience a plus).. . Experience identifying/mitigating CVEs and setting policy thresholds.. . Background with DoD/regulated customers; familiarity with IL-4/IL-5, Platform One patterns, and RMF documentation workflows.. . Knowledge of STIG/CIS hardening, air-gapped architectures, and offline update mechanisms.. . Experience operating AI/ML workloads in production (GPU scheduling, model artifact management, inference serving, vector DBs, queuing/streaming) or building agentic ops workflows / MCP-based integrations (alert triage, runbook automation, IaC review agents).. . . Tooling you might touch. . We use technologies similar to and including some of these to build our products: . . . AI development tools (Claude Code, Cursor, GitHub Copilot, MCP servers);Terraform modules; Helm/Kustomize; Kubernetes (EKS); GitHub Actions/Workflows; Argo CD/Flux; Docker/OCI; Prometheus/Grafana, Datadog, OpenTelemetry; Loki/ELK; LaunchDarkly/Flagsmith; Cosign/Sigstore, Trivy/Grype/Snyk; AWS (VPC, EKS, EC2, S3, RDS, ECR, IAM/KMS, Route 53, CloudFront); HashiCorp Vault/Parameter Store/Secrets Manager.. . . Compensation & Benefits. . . Competitive base salary (Senior: $150k-$190k; Staff: $170k-210k) based on location and experience with significant equity upside. . Subsidized health insurance, 401(k), life insurance, and cell phone stipend.. . Remote-first culture with up to 10% travel for offsites.. . Work eligibility: Applicants must be authorized to work in the U.S.. . . One Final Note. . We’re committed to building a diverse, inclusive, and authentic workplace. If you’re excited about this role but your experience doesn’t perfectly align with every qualification, please apply—you may be just the right candidate.. . EEO & accommodations:. Ethos is an Equal Opportunity Employer. We welcome applicants of all backgrounds and provide reasonable accommodations throughout the hiring process.
Senior/Staff DevOps Engineer at Ethos