About the Role
OverviewWe are building the next generation of agentic AI to transform how the agency accelerates research, makes decisions, and ships products at scale.
We are a small, startup-minded team that ships fast and owns what we build end-to-end. We are looking for a senior SDE III who can own the backend infrastructure runs on, while also being a first-principles builder of the agentic AI systems that run on top of it. On a lean team, infra and AI are not separate concerns. You will own both, and you will treat production reliability, token economics, security, and observability as non-negotiable from day one.
The best person for this role starts with the user. They ask why before they ask how. They communicate clearly, give and receive feedback well, and make the people around them better. They have a high bar, a high sense of urgency, and they play well with others.
What You Will Own
• Backend InfrastructureOwn the end-to-end backend infrastructure on Microsoft Azure: Azure Functions, Azure API Management, Azure Container Apps, and Azure OpenAI Service
• Own the data layer: storage, retrieval pipelines, vector databases, and document indexing that power GRACE's internal knowledge search
• Own authentication and identity integration, including Entra ID and application-level access control
• Implement and maintain infrastructure as code for all environments; no manual snowflakes
• Own CI/CD pipelines, deployment automation, and release processes including canary and gradual rollouts
• Own the basics that are non-negotiable on any production system: monitoring, alerting, logging, distributed tracing, SLOs, and incident response runbooks
• Manage secrets, API keys, and credential rotation across all integrations with external providers
• Own cost and token economics across all LLM providers; track spend, set budgets, build guardrails, and optimize for cost-per-query without sacrificing quality
• Agentic AI & Protocol InfrastructureOwn the backend implementation of MCP, including MCP server hosting, tool registration, versioning, and lifecycle management on Azure
• Implement and evolve A2A communication patterns, enabling GRACE agents to interoperate with each other and with external agent systems
• Design and maintain LLM orchestration, routing, and multi-model switching infrastructure across OpenAI GPT, Anthropic Claude, and Google Gemini families
• Build and operate RAG pipelines: document ingestion, chunking, embedding, and semantic search
• Implement robust fallback, retry, circuit-breaker, and graceful degradation patterns for all AI service dependencies
• Own tool-calling infrastructure: registration, execution, error handling, and observability for all GRACE tools
• Observability & Production QualityBuild and maintain end-to-end observability for agentic workflows: latency, throughput, error rates, token usage, and LLM quality metrics
• Implement LLM evaluation pipelines including safety checks, regression monitoring, and grounding assessment
• Define and enforce system-level SLOs for availability, response time, and tool call reliability
• Own alerting and on-call runbooks; be the person who knows what broke and why
Engineering Excellence & Team
• Establish and improve coding standards, design review processes, and testing practices
• Communicate technical decisions clearly, in writing and in conversation, to both engineers and non-engineers
• Mentor and unblock other engineers with a bias toward ownership and speed
• Work backward from the user: understand the problem being solved before proposing a solution
• Ensure strong privacy, security, and compliance in all systems, integrations, and data handling
• Basic QualificationsBachelor's or Master's in Computer Science, Software Engineering, related, or equivalent experience
• 7+ years of professional software engineering experience building and operating production systems
• Proven experience in high-velocity environments where you owned and shipped complex products end-to-end
• Strong proficiency in Python and at least one other backend language; familiarity with modern backend frameworks and async patterns
• Solid understanding of distributed systems, APIs, data pipelines, and software design patterns
• Hands-on experience on Microsoft Azure: Azure Functions, API Management, Container Apps, and Azure OpenAI Service
• Experience with containerization, CI/CD, and infrastructure as code
• Strong understanding of authentication and identity systems (OAuth2, OIDC, Azure Entra ID or equivalent)
• Demonstrated ability to own production systems: you have been on-call, debugged incidents, and written the postmortem
• Clear, direct communicator who gives and receives feedback well and makes the people around them better
• Preferred QualificationsHands-on experience building and operating MCP servers in production, including tool registration, versioning, and hosting on Azure Functions or equivalent serverless infrastructure
• Experience implementing A2A comm