F

[Remote] Principal Machine Learning Engineer, ML Platform

frontendnode-production.up.railway.app · Anywhere

Full-timeStaff+Kubernetes

🔥22 people viewed this job

About the Role

Note: The job is a remote job and is open to candidates in USA. Shippo is on a mission to make every merchant successful through excellent shipping and logistics technology. They are seeking a Principal Machine Learning Engineer for their ML Platform to build a standardized, production-grade ML platform that enhances model reliability and speeds up product development. Responsibilities Set technical strategy and drive a multi-quarter roadmap for ML platform capabilities aligned to Shippo's business prioritiesOwn cross-team architecture decisions, RFCs, and design reviews for ML lifecycle and inferenceRaise the engineering bar through mentorship, production readiness standards, and reusable platform primitivesBe accountable for platform adoption, reliability, and cost-performance outcomesBuild and operate core ML platform components:+ ML lifecycle foundation (experiment tracking, reproducibility, artifact management, model registry, versioning, and controlled promotion workflows using MLflow or equivalent)+ Training and experimentation enablement (standardized environments, reusable pipelines/templates, evaluation harnesses, and repeatable workflows that let data scientists move from exploration to production with confidence)+ Kubernetes-native model serving for real-time inference (safe rollout and rollback, autoscaling, reliability practices, and cost controls)+ Batch inference and scoring pipelines (repeatable backfills, retraining triggers, consistent packaging between training and inference)+ Observability for ML systems (service health metrics, alerting, and model-quality signals such as drift and data quality)+ Developer experience (templates, reference implementations, documentation, and self-service workflows)Evaluate and recommend inference frameworks and deployment patterns, and document tradeoffs for Shippo's workloadsIdentify and resolve performance bottlenecks across the inference stack (model runtime, compute utilization, networking, serialization, and autoscaling behavior)Establish ML engineering standards across training, evaluation, testing, model packaging, CI/CD, production readiness, and incident responsePartner with Data Science teams to bridge research and production environments by creating repeatable frameworks, shared standards for code quality and reproducibility, and self-serve paths to deploy models safelyCollaborate with Data and Engineering teams to ensure the platform supports real workflows, drives adoption, and meets reliability expectationsMentor engineers through design reviews, architecture guidance, and shared best practices across platform and ML development Skills 15+ years of software engineering experience, including ownership of production systems (platform, infrastructure, or distributed systems)4+ years owning ML systems end-to-end in production, including on-call and incident response, and making architecture decisions based on operational constraints (latency, throughput, availability, and cost)Strong experience building and running services on Kubernetes, including deployments, autoscaling, and observabilityHands-on experience with ML lifecycle tooling such as MLflow or equivalent (tracking, registry, packaging, and promotion workflows)Demonstrated ability to evaluate inference tradeoffs across batch and real-time serving, CPU versus GPU, latency and throughput, cost, and operational complexityDemonstrated Principal-level technical leadership, including setting technical direction, driving cross-team alignment via RFCs/design reviews, and delivering multi-quarter roadmapsProven ownership of reliability and operational outcomes for production systems (SLOs, incident response, and measurable improvements in stability and performance)Demonstrated ability to ship incrementally, prioritize production reliability over perfect solutions, and drive adoption through pragmatic platform designExperience working with or evaluating managed ML platforms (Databricks, SageMaker, Vertex AI, or similar), with clear judgement on strengths, limitations, and build-vs-buy decisionsDatabricks experience (useful, not required), including Databricks workflows and ML tooling integrationExperience with inference and serving frameworksExperience with feature store patterns, online and offline consistency, and model evaluation at scaleExperience supporting optimization systems and decision engines in productionLLM or agent workflow experience, especially evaluation harnesses, deployment patterns, guardrails, and monitoring Benefits Healthcare coverage for medical, dental, and vision (90% covered by the company, incl. dependents). Pets coverage is also avail

frontendnode-production.up.railway.app has 9 open positions on Remote Vibe Coding Jobs.

💬 Developer Questions

Ask the team a question — answers show up here

🎯

What does the interview process look like?

🤖

What AI/vibe coding tools does the team use daily?

👥

How big is the engineering team?

Is the team fully async or are there required meetings?

🚀

What does onboarding look like for remote hires?

🔧

Can you share more about the tech stack and architecture?

📈

What does career growth look like in this role?

📅

What does a typical day look like?

💰

Is there a salary range you can share?

📊

Is equity or stock options part of the package?

🌍

Are there timezone requirements or preferences?

🛂

Do you sponsor work visas?

🏢 Is this your listing? Claim it to answer questions

Similar Jobs

Helpful resources

Hiring for a similar role? Post your job here — it's free →