Peach Pilot — Principal QA Engineer (AI Systems & Platform) Remote — Latin America at Hive Financial Systems

Source: https://job-boards.greenhouse.io/embed/job_app?for=hivefinancialsystems&token=5102879007

We are redirecting you to the source. If you are not redirected in 3 seconds, please click here.

Peach Pilot — Principal QA Engineer (AI Systems & Platform) Remote — Latin America at Hive Financial Systems. . Location: Argentina. Peach Pilot — Principal QA Engineer (AI Systems & Platform). . Remote — Latin America | Full-Time Contract | US Eastern Timezone Overlap Required (5+ hours daily). . The Mission: Trust Has to Be Earned — Every Release. . 95% of enterprise AI pilots fail — not because the technology is broken, but because users don't trust it. At Peach Pilot we are building an enterprise AI operating system where trust is the product. That means every feature we ship must work exactly as the user expects, every time. One broken interaction at the wrong moment can undo months of adoption. You are the last line of defense before our platform reaches a CFO's desk.. . Peach Pilot is a funded US-based AI startup building an enterprise AI operating system for business leaders. We are closing the AI trust gap — making powerful AI feel effortless and reliable for the people who run companies, not just the engineers who build software.. . We are an early-stage founding team moving fast and hiring remotely across Latin America.. . The Role. . This is a hands-on, high-ownership role. You will build and own the QA function at Peach Pilot — writing test code, designing eval pipelines, and setting the quality bar as we move from early-stage development into full production and enterprise deployment. We are not looking for someone who manages spreadsheets and delegates everything. We are looking for someone who can do the work, knows what good looks like, and raises the bar across the entire engineering team.. . This is a fully remote contract role based in Latin America. As the company scales, there is a path to a larger leadership role. For now the focus is getting the product right.. . You will work directly with the US-based founding engineering team and must be available during US Eastern business hours with a minimum of 5 hours of daily overlap.. . The Challenge: QA for AI is a Different Problem. . Traditional QA assumes deterministic outputs. LLMs don't give you that. You will be building a quality function from scratch in an environment where:. . . Multi-model routing (Claude, GPT-4o, Grok, Gemini) means the same input can produce different outputs depending on which model handled it. . Agent orchestration and governance agents must maintain a structurally separate audit trail any drift between execution and governance is a critical failure. . The file ingestion pipeline (Word, Excel, PowerPoint, PDF) must survive edge cases that enterprise clients will find within the first week of deployment. . Your users are CEOs and operations leaders who have never used a terminal. A confusing error state isn't a minor bug it kills adoption. . . What You Will Own & Build. . . First 90 Days — Build the QA Foundation. . Establish the testing framework from zero: unit, integration, end-to-end, and LLM-specific evaluation pipelines. . Define quality standards, test coverage requirements, and documentation practices in partnership with the VP of Engineering. . Audit the existing platform and identify the highest-risk surfaces before the next major customer deployment. . Own the QA function end to end and be the voice of quality across the engineering team. . AI & Agent Testing — . . Design evaluation frameworks for non-deterministic LLM outputs — including prompt regression testing, model drift detection, and output quality scoring across Claude, GPT-4o, Grok, and Gemini —. . Build automated test suites for the agent orchestration layer including governance agent audit trail integrity and human-override behavior. . Validate the Enterprise Knowledge Graph (Neo4j + vector search) for data accuracy, retrieval quality, and failure modes under real enterprise data conditions. . . . Platform & Integration Testing . . Own end-to-end testing of the file ingestion pipeline across document types (Word, Excel, PowerPoint, PDF) including encryption, formatting edge cases, and audit trail continuity. . Validate streaming response handling, latency thresholds, and graceful degradation when a model is unavailable or slow. . Test multi-model routing logic to confirm cost-optimized task allocation behaves correctly across LLM providers. . . . UX Quality . . Partner with the Full-Stack Engineer to define and test trust-layer UX standards onboarding flows, progressive disclosure, uncertainty states, and real-time document viewers. . Act as the internal advocate for the non-technical enterprise user — if a CEO would be confused by it, it ships. . . . . Who You Are. . . 7+ years of QA engineering experience with at least 3 years in a lead or senior role where you both wrote test code and owned quality outcomes. . Hands-on experience testing LLM-powered applications. . you understand prompt sensitivity, output variance, and how to build eval pipelines that catch regressions across model updates. . You write test code. Python is your primary tool. . Experience building and maintaining CI/CD-integrated test suites. . Comfortable testing complex API chains, async/streaming responses, and multi-service workflows. . Built or significantly improved a QA function in an early-stage or fast-moving environment. . Strong English communication skills written and verbal. . Available during US Eastern business hours with minimum 5 hours of daily overlap. . . Even Better If. . . Experience with LLM evaluation frameworks such as LangSmith, PromptFlow, or custom eval pipelines. . Experience testing agent frameworks such as LangChain or CrewAI. . Background in enterprise software or regulated industries where audit trail integrity is non-negotiable. . Insurance industry background is a strong plus. . . The Stack You'll Test Against. . AI/LLM: Anthropic Claude, OpenAI GPT-4o, xAI Grok, Gemini Frontend: React/Next.js, TypeScript, Tailwind CSS Backend: Python, Node.js/TypeScript (FastAPI/Express) Data & Graph: Neo4j, Snowflake, Azure Cosmos DB, Azure AI Search Infrastructure: Azure (Functions, Key Vault), CI/CD pipelines Visualization: Plotly, D3, Recharts, Mermaid. . Compensation. . Competitive contractor rate commensurate with experience. Paid monthly via Deel in USD.. . The Clincher. . Tell us about a quality failure — one you caught before it shipped, or one that got through. What did you build or change after it, and how did you make sure your team could catch the next one without you?. .