Infrastructure/GPU Engineer at Cognizant

Source: https://remotive.com/remote-jobs/devops/infrastructure-gpu-engineer-2070022

We are redirecting you to the source. If you are not redirected in 3 seconds, please click here.

Infrastructure/GPU Engineer at Cognizant. Location Information: USA. . . Cognizant is seeking a highly skilled hands-on Infrastructure Engineer with proven experience in the physical and technical deployment of AI-ready environments optimized for AI and machine learning workloads. This role focuses on NVIDIA DGX or similar systems, GPU-accelerated compute clusters, high-speed networking, and scalable storage solutions. The ideal candidate will have deep expertise in infrastructure design ,deployment, workload orchestration, and performance optimization in enterprise environments.. This is a remote role in the US. Salary range for this role is between $99,000 and $116,000 depending on skills and qualifications of the candidate. Applications will be accepted till 10/21/2025.. Key Responsibilities. System Design & Deployment. Help in rightsizing GPU investment . Architect and deploy NVIDIA DGX systems and GPU-based compute clusters.. Design and implement scalable parallel filesystems (e.g., Lustre, BeeGFS, GPFS).. Integrate high-speed interconnects using InfiniBand, RoCE, and RDMA.. Collaborate on rack planning and airflow optimization.. Cluster & Infrastructure Management. Configure and manage Slurm Workload Manager for job scheduling.. Deploy and maintain cluster orchestration tools. Automate provisioning using PXE boot, Terraform, Redfish, and Kubernetes.. Perform firmware updates, BIOS/IPMI/BMC configuration, and OS provisioning. Knowledge of Run.ai, ClearML or similar platform . Networking & Performance Optimization. Design and validate network topologies including IPMI, internal/external networks, and InfiniBand fabrics.. Optimize RDMA and RoCE configurations for low-latency, high-throughput data transfers.. Conduct performance benchmarking using GPU-Burn, NCCL, and NVSM.. Monitoring & Troubleshooting. Implement system health checks and diagnostics across compute, storage, and network layers.. Troubleshoot hardware/software issues and ensure reliable infrastructure operation.. Required Skills & Qualifications. Technical Expertise. Deep understanding of NVIDIA DGX architecture, CUDA, and GPU compute.. Strong Linux system administration and shell scripting skills.. Experience with Slurm, parallel filesystems, and high-speed networking (InfiniBand/RDMA/RoCE).. Familiarity with containerization (Docker), orchestration (Kubernetes), and automation tools (Ansible, Redfish).. . Preferred Qualifications. Experience with BBCM, and DGX BasePOD/SuperPOD configuration. Certifications by Nvidia or equivalent OEM.