About the Role
Remote Full-time (40 hours/week) with strong overlap during PK / Dubai working hours Compensation: Competitive, typically in the range of 2200 – 3000 USD per month (flexible for exceptional candidates)
About the Role
We're a fast-growing company shifting heavily toward AI-driven automation across the business: customer support, marketing, product, and operations. This role is for an engineer who designs and builds real production AI systems that move work through our company faster, cheaper, and with fewer humans in the loop.
This is a hands-on engineering role spanning backend systems, LLM integrations, agentic workflows, retrieval pipelines, and the data plumbing that makes AI actually work in production. You'll work in a real-world environment where models hallucinate, APIs fail, prompts drift, costs spike, and engineers are expected to take ownership of the systems they ship.
This is not a prompt engineer role. This is not a research only role. This is not "play with ChatGPT and report back" work. We move fast, ship often, debug real production issues, and expect engineers to own AI systems end to end. We use AI daily ourselves, but we care deeply about engineers who read, understand, and validate what their systems are doing, not those who treat LLMs as magic boxes.
What You'll Actually Do (Day to Day)
Design, build, and maintain AI-powered features and internal systems across one or more business areas (support automation, marketing workflows, internal research tools, ops automations, voice/email agents)
Build production integrations with LLMs across both hosted APIs (OpenAI, Anthropic, Gemini) and open-source models (Llama, Qwen, Mistral, DeepSeek, etc.) running on inference providers (Together, Groq, Replicate, Hugging Face, Fireworks) or self-hosted (vLLM, Ollama). Real systems with proper error handling, retries, timeouts, structured outputs, cost controls, and fallbacks
Pick the right model for the job. Frontier closed models when capability matters, smaller or open-source models when cost, latency, privacy, or customization matters. Fine-tune smaller models (LoRA / QLoRA) when prompting alone isn't enough and the use case is narrow and stable
Design and ship agentic workflows: multi-step LLM pipelines, tool-using agents, decision logic, task orchestration, and human-in-the-loop checkpoints
Build and maintain RAG systems end to end: ingestion, chunking, embedding generation, vector search, re-ranking, and retrieval quality evaluation
Work with vector databases (Pinecone, Qdrant, pgvector, Chroma, Weaviate, etc.) at a practical level
Build backend services and APIs that expose AI capabilities to internal tools, integrations, and external systems
Build automation pipelines that connect AI workflows to the rest of the stack (CRMs, support platforms, marketing tools, internal databases, webhooks)
Own reliability of AI systems in production: monitoring outputs, catching regressions, building eval harnesses, alerting, and debugging when behavior changes
Evaluate AI outputs systematically. Build the test sets, scoring rubrics, and feedback loops that tell you whether a system is actually working
Prepare and normalize real-world data for AI use: cleaning call transcripts, structuring support conversations, deduplicating documents, removing PII, extracting structured fields from messy inputs, and shaping data into RAG indexes, fine-tuning datasets, or evaluation sets. This is often the highest-leverage work in an AI project, and we treat it as core engineering, not preprocessing grunt work
Handle structured and unstructured data more broadly: parsing documents, transcripts, emails, scraped content, API responses, and turning messy inputs into useful structured outputs
Debug real production issues where AI behavior, data integrity, latency, or cost is impacted
Collaborate asynchronously with a remote engineering and operations team
Non-Negotiable Requirements
You must have hands-on experience with all of the following:
Strong backend engineering fundamentals. You can design, build, and ship a real backend service from scratch (Python or Node.js strongly preferred), including database design, API design, and proper error handling
Production experience with LLMs in real systems. You've shipped systems using either hosted APIs (OpenAI, Anthropic, Gemini, etc.) or open-source models via inference providers (Together, Groq, Replicate, Fireworks, etc.) or self-hosting (vLLM, Ollama, Hugging Face). Real workflows that real people or real customers depend on, not just demos or side projects
Real prompt design experience. Iterating on prompts under production conditions, structuring outputs (JSON, function calls, schemas), handling edge cases, and con