AI Cluster Architect at Vultr

We are redirecting you to the source. If you are not redirected in 3 seconds, please click here.

AI Cluster Architect at Vultr. Remote Location: Remote - United States. Who We Are. Vultr is on a mission to make high-performance cloud infrastructure easy to use, affordable, and locally accessible for enterprises and AI innovators around the world. With 32 global cloud data center locations, Vultr is trusted by hundreds of thousands of active customers across 185 countries for its flexible, scalable, global Cloud Compute, Cloud GPU, Bare Metal, and Cloud Storage solutions. In December 2024 Vultr announced an equity financing at a $3.5 billion valuation. Founded by David Aninowsky and self-funded for over a decade, Vultr has grown to become the world’s largest privately-held cloud infrastructure company.. Vultr Cares. Excellent Medical Benefits w/ 100% company-paid premiums for employee only plan + 100% company-paid dental & vision premiums. 401(k) plan that matches 100% up to 4% with immediate vesting. Professional Development Reimbursement of $2,500 each year. 11 Holidays + Paid Time Off Accrual + Rollover Plan + take your birthday off. Commitment matters to Vultr! Increased PTO at 3 year & 10 year anniversary + 1 month paid sabbatical every 5 years + Anniversary Bonus each year. $500 first year remote office setup + $400 each following year for new equipment. Internet reimbursement up to $75 per month. Gym membership reimbursement up to $50 per month. Company-paid Wellable subscription. Join Vultr. Vultr is looking for an AI Cluster Architect who will be responsible for creating and refining large-scale GPU cluster architectures within strict power and infrastructure limits. This role focuses heavily on power-aware design: starting from a fixed power envelope, the architect determines the optimal number of GPUs while accounting for the full stack of services needing to be deployed—compute nodes, storage systems, networking fabric, cooling, and facility constraints. This role requires deep experience navigating heterogeneous environments, multiple generations of hardware, and end user requirements.. The architect must understand how different GPU SKUs, NICs, switches, and fabrics interact at scale, including their individual and aggregate power and thermal characteristics. They will evaluate multi-plane, rail-optimized, and tiered fabric designs across technologies like InfiniBand, RoCE, and SpectrumX to ensure the networking architecture supports the intended GPU count without overrunning facility limits or switch radix and/or topology constraints. This role balances customer-specific requirements for compute, storage, and service density, ensuring that the final cluster design maintains acceptable levels of GPU and fabric performance, while maximizing the number of usable GPUs within the total power budget.. Key Responsibilities. Architect large-scale GPU clusters within fixed site power budgets that optimizes for maximum GPU density while reserving necessary headroom for compute services, storage, and networking.. Model and validate power consumption across the full cluster bill of materials (GPUs, CPUs, NICs, switches, fabric components, storage, and facility limits).. Evaluate tradeoffs across multiple fabric networking architectures (InfiniBand, RoCE, SpectrumX) as well as multi-plane, 2-tier/3-tier, and rail-optimized topologies.. Determine network scale limits based on switch radix, link speed, topology, and blocking requirements.. Gather, interpret, and maintain detailed SKU-level power and thermal specifications for GPUs, NICs, switches, DPUs, storage, and server platforms.. Develop power-aware cluster configuration templates and capacity-planning models that can scale across sites with varying constraints and allow for quick iteration and ideation.. Document architecture, design choices, tradeoff analyses, and operational considerations for deployment and lifecycle management.. Provide guidance on future-proofing, including the ability to incorporate next-gen GPUs, NICs, or fabrics.. Collaborate with vendors on novel fabric architectures that enable large-scale cluster deployments (100k+ GPUs). . Qualifications. 7+ years designing or building large-scale HPC, AI, or hyperscale GPU clusters.. Expert understanding of GPU and accelerator system design, including node topology, PCIe/NVLink/NVSwitch/ROCm, and NIC-to-GPU affinity considerations.. Strong familiarity with InfiniBand, RoCE, and SpectrumX networking, including multi-tier, multi-plane, Clos/dragonfly variants, and large-radix switch design.. Demonstrated experience modeling power draw and thermal characteristics of servers, GPUs, NICs, switches, optics, and storage systems.. Ability to design networks that maintain full non-blocking performance or intentionally introduce over/under-subscription while understanding impacts on workload performance.. Proven ability to gather and analyze vendor SKU-level specifications and incorporate them into scalable cluster architectures.. Experience balancing customer-driven requirements for compute, storage, and service density in combination with overall GPU count.. Strong documentation, communication, and cross-functional collaboration skills.. . Compensation. $165,000 - $185,000. This salary can vary based on location, years of experience, background and skill set.. Inclusion & Privacy. We are an equal opportunity employer and are committed to creating an inclusive environment for all employees. We welcome applications from individuals of all backgrounds and experiences, and we prohibit discrimination based on race, color, religion, sex, sexual orientation, gender identity, national origin, age, disability, veteran status, or any other protected status under applicable laws. Vultr will consider qualified applicants with arrest or conviction records in accordance with applicable laws and will not conduct a background check until after an offer of employment has been extended and accepted.. We also take your privacy seriously. We handle personal information responsibly and follow applicable laws, including U.S. privacy rules and India’s Digital Personal Data Protection Act, 2023. Your data is used only for legitimate business purposes and is protected with proper security measures.. Where allowed by law, applicants may request details about the data we collect, access or delete their information, withdraw consent for its use, and opt out of nonessential communications. For more details, please see our . Privacy Policy. .