LLM Ops Engineer - Serverless & CI/CD (AWS) at Expedite Commerce

We are redirecting you to the source. If you are not redirected in 3 seconds, please click here.

LLM Ops Engineer - Serverless & CI/CD (AWS) at Expedite Commerce. This isn't your average DevOps role. This isn't just about pipelines or cloud provisioning. This is about engineering the backbone of . Agentic AI systems. that drive the next generation of enterprise SaaS—where . conversational interfaces, dynamic UIs, and intelligent agents. operate seamlessly on . AWS Serverless infrastructure. , with deep integration into . Salesforce and cross-agent protocols. .. This is for builders with something to prove. For engineers who’ve gone beyond cloud fluency to orchestrate . complex, multi-agent ecosystems. —who want to shape how enterprise applications are deployed, debugged, scaled, and observed in real time.. If you’re driven by deep automation, passionate about creating fault-tolerant agentic systems, and thrive where innovation is the expectation—not the exception—you’re in the right place. Join us to redefine SaaS infrastructure and champion a . new era of AI-powered, product-led enterprise experiences. .. The Role. We are seeking a . hands-on Agentic AI Ops Engineer. who thrives at the intersection of . cloud infrastructure. , . AI agent systems. , and . DevOps automation. . In this role, you will . build and maintain the CI/CD infrastructure for Agentic AI solutions. using . Terraform on AWS. , while also . developing, deploying, and debugging intelligent agents and their associated tools. . This position is critical to ensuring scalable, traceable, and cost-effective delivery of agentic systems in production environments.. The Responsibilities. CI/CD Infrastructure for Agentic AI. . Design, implement, and maintain . CI/CD pipelines. for . Agentic AI applications. using . Terraform. , . AWS CodePipeline. , . CodeBuild. , and related tools. . . Automate deployment of multi-agent systems and associated tooling, ensuring version control, rollback strategies, and consistent environment parity across dev/test/prod.. . Agent Development & Debugging. . Collaborate with ML/NLP engineers to develop and deploy . modular, tool-integrated AI agents. in production. . . Lead the effort to create . debuggable agent architectures. , with structured logging, standardized agent behaviors, and feedback integration loops. . . Build agent lifecycle management tools that support . quick iteration, rollback, and debugging. of faulty behaviors.. . Monitoring, Tracing & Reliability. . Implement . end-to-end observability. for agents and tools, including . runtime performance metrics. , . tool invocation traces. , and . latency/accuracy tracking. . . . Design dashboards and alerting mechanisms to capture . agent failures, degraded performance, and tool bottlenecks. in real-time. . . Build lightweight tracing systems that help . visualize agent workflows. and simplify root cause analysis.. . Cost Optimization & Usage Analysis. . Monitor and manage . cost metrics. associated with agentic operations including . API call usage. , . toolchain overhead. , and . model inference costs. . . . Set up proactive . alerts for usage anomalies. , implement . cost dashboards. , and propose strategies for reducing operational expenses without compromising performance.. . Collaboration & Continuous Improvement. . Work closely with product, backend, and AI teams to evolve the . agentic infrastructure design. and . tool orchestration workflows. . . . Drive the adoption of . best practices for Agentic AI DevOps. , including retraining automation, secure deployments, and compliance in cloud-hosted environments. . . Participate in design reviews, postmortems, and architectural roadmap planning to continuously improve reliability and scalability.. . . . 2+ years. of experience in DevOps, MLOps, or Cloud Infrastructure with exposure to . AI/ML systems. . . . . Deep expertise in AWS serverless architecture. , including hands-on experience with: . . . . AWS Lambda. – function design, performance tuning, cold-start optimization. . . . Amazon API Gateway. – managing REST/HTTP APIs and integrating with Lambda securely. . . . Step Functions. – orchestrating agentic workflows and managing execution states. . . . S3, DynamoDB, EventBridge, SQS. – event-driven and storage patterns for scalable AI systems. . . . Strong proficiency in . Terraform. to build and manage serverless AWS environments using reusable, modular templates.. . Experience deploying and managing . CI/CD pipelines. for serverless and agent-based applications using . AWS CodePipeline, CodeBuild, CodeDeploy. , or . GitHub Actions. .. . Hands-on experience with . agent and tool development. in . Python. , including debugging and performance tuning in production. . . Solid understanding of . IAM roles and policies. , VPC configuration, and . least-privilege access control. for securing AI systems. . . Deep understanding of . monitoring, alerting, and distributed tracing. systems (e.g., CloudWatch, Grafana, OpenTelemetry). . . Ability to manage . environment parity. across dev, staging, and production using automated infrastructure pipelines. . . Excellent debugging, documentation, and cross-team communication skills.. . Company Location: India.