CAST AI Logo

CAST AI

Senior ML Engineer - Kimchi (LLM Inference Optimization)

Reposted 22 Days Ago
Be an Early Applicant
Remote
Hiring Remotely in United Kingdom
Senior level
Remote
Hiring Remotely in United Kingdom
Senior level
The Senior Machine Learning Engineer will design and implement ML pipelines, optimize LLM performance, and collaborate on AI infrastructure for cost-efficient solutions.
The summary above was generated by AI
Why Cast AI?

Cast AI is an automation platform that operates cloud-native and AI infrastructure at scale. By embedding autonomous decision-making directly into Kubernetes and cloud environments, Cast AI continuously optimizes performance, reliability, and efficiency in production.
The old way doesn't work. As Kubernetes and AI environments grow, manual decisions don’t. Cast AI replaces tickets, alerts, and manual tuning with continuous automation that adapts infrastructure as conditions change. Efficiency and cost savings follow naturally from that automation.
Over 2,100 companies already rely on Cast AI, including Akamai, BMW, Cisco, FICO, HuggingFace, NielsenIQ, Swisscom, and TGS.
Global team, diverse perspectives

We're headquartered in Miami, but our impact is international. We take a global and intentional approach to diversity. Today, Cast AI operates across 34 countries spanning Europe, North America, Latin America, and APAC, bringing a wide range of perspectives into how we build and lead.
Unicorn momentum

In January 2026, we achieved unicorn status with a strategic investment from Pacific Alliance Ventures, the corporate venture arm of Shinsegae Group (a $50+ billion Korean conglomerate). Our valuation now exceeds $1 billion, and we're just getting started.

Join us as we build the future of autonomous infrastructure.

About the role

Throughput. Latency. KV cache utilization.

Move those three numbers in the right direction, and two things happen: customers get faster, cheaper inference, and our margins improve. That's the entire thesis of this role. Every kernel you tune, every quantization scheme you ship, every scheduler tweak you land shows up directly in a customer's p99 and on our P&L.
This is a high-impact seat. It is also a high-autonomy seat as you'll be given the room to lead the technical direction of inference optimization at Kimchi, not execute someone else's roadmap.

The problem: running LLMs in production is a moving target. The "right" model and serving configuration for a workload depend on traffic shape, sequence-length distribution, batch dynamics, GPU SKU, memory bandwidth, quantization tolerance, and a dozen other variables that shift week to week. Most teams pick a model once, over-provision GPUs, and absorb the cost. Kimchi is the system that makes that decision automatically - continuously matching workloads to the most cost-efficient, best-performing LLM and serving configuration on a customer's infrastructure. We're building the optimization layer between the model and the hardware, and we need engineers who understand both sides deeply.

Stack

Python · vLLM · SGLang · TensorRT-LLM · PyTorch · CUDA-adjacent tooling · Kubernetes · gRPC · ClickHouse · PostgreSQL · GCP Pub/Sub · AWS / GCP / Azure · GitLab CI · ArgoCD · Prometheus · Grafana · Loki · Tempo.

Requirements:
  • 5+ years building real ML systems, with a portfolio that shows depth in inference or training infrastructure (not just model training notebooks).
  • Strong Python - production services, not scripts.
  • Hands-on experience with at least one of vLLM, SGLang, or TensorRT-LLM, and a working mental model of why an inference engine performs the way it does on a given GPU.
  • Fluency with quantization tradeoffs - you've measured quality regressions, not just compression ratios.
  • Comfort with distributed systems: collective communication, sharding strategies, and the practical failure modes of multi-GPU and multi-node setups.
  • A bias toward measurement. You instrument before you optimize, and you can tell the difference between a real win and a benchmark artifact.
  • Self-direction. This role comes with a wide mandate; you should be excited by that, not unsettled by it.
Responsibilities:
  • Push throughput. Continuous batching, speculative decoding, chunked prefill, kernel-level tuning across vLLM, SGLang, and TensorRT-LLM. Find the ceiling on each GPU SKU, then raise it.
  • Cut latency. Attack TTFT and TPOT separately. Profile, identify the actual bottleneck (compute, memory bandwidth, scheduling, networking), and fix it - not the bottleneck you assumed.
  • Get more out of the KV cache. Paged attention, prefix caching, eviction policies, cache reuse across requests, quantized KV. This is where a lot of the unrealized throughput lives, and it's an area you'll own.
  • Quantize without regressing quality. INT8, INT4, FP8 across weights, activations, and KV. Empirical work: measure quality on real workloads, not just perplexity benchmarks.
  • Shrink cold starts and memory footprint. Faster init, smarter weight loading, tighter memory accounting - the difference between a model that scales and one that doesn't.
  • Scale across nodes. Distributed inference topologies, network-aware placement, checkpointing strategies that don't bottleneck on storage or interconnect.
  • Set the technical direction. Decide what we benchmark, what we adopt, and what we build ourselves. Bring the team along with strong writeups and reproducible experiments.
What’s in it for you?
  • Competitive salary (depending on the level of experience).
  • Enjoy a flexible, remote-first global environment.
  • Collaborate with a global team of cloud experts and innovators, passionate about pushing the boundaries of Kubernetes technology
  • Equity options.
  • Get quick feedback with a fast-paced workflow. Most feature projects are completed in 1 to 4 weeks.
  • Spend 10% of your work time on personal projects or self-improvement. 
  • Learning budget for professional and personal development - including access to international conferences and courses that elevate your skills.
  • Annual hackathon to spark new ideas and strengthen team bonds.
  • Team-building budget and company events to connect with your colleagues.
  • Equipment budget to ensure you have everything you need.
  • Extra days off to help maintain a healthy work-life balance.
Hiring process
  • Screening call with Recruiter
  • Hiring Manager interview
  • Technical interview (system design)
  • Live coding
  • Culture Check interview with an executive

*As part of our standard hiring process, we would like to inform you that a background check may be conducted at the final stage of recruitment through our third-party provider, Checkr.
*Please note that Cast AI does not provide any form of visa sponsorship/work permit.

#LI-Remote

Similar Jobs

Yesterday
Easy Apply
Remote or Hybrid
Easy Apply
Senior level
Senior level
Cloud • Information Technology • Security • Software • Cybersecurity
As a Technical Success Manager, you'll guide public sector customers in adopting Zscaler solutions, ensuring their success and driving satisfaction through relationship building and expert advice.
Top Skills: DnsFree BsdFtpHTTPLdapLinuxOauthSAMLSaseSmtpSseUnixZtna
Yesterday
Remote or Hybrid
Senior level
Senior level
Security • Cybersecurity
The Named Account Manager is responsible for managing customer accounts, developing new sales opportunities, and promoting Tufin's cybersecurity solutions in Italy. This role involves consultative selling, proposal development, and fostering long-term customer relationships while staying informed on market trends and cybersecurity regulations.
Top Skills: Cloud SecurityFirewall ManagementNetwork SecuritySecurity Policy Automation
Yesterday
Remote or Hybrid
Mid level
Mid level
Big Data • Food • Hardware • Machine Learning • Retail • Automation • Manufacturing
The Manager in Central FP&A will oversee financial performance, conduct analyses, manage P&L, and support financial target communications and reporting for the MEU region.
Top Skills: Marco ReportsSac System

What you need to know about the London Tech Scene

London isn't just a hub for established businesses; it's also a nursery for innovation. Boasting one of the most recognized fintech ecosystems in Europe, attracting billions in investments each year, London's success has made it a go-to destination for startups looking to make their mark. Top U.K. companies like Hoptin, Moneybox and Marshmallow have already made the city their base — yet fintech is just the beginning. From healthtech to renewable energy to cybersecurity and beyond, the city's startups are breaking new ground across a range of industries.

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account