Convergence Logo

Convergence

ML Engineer - Infrastructure

Job Posted 15 Days Ago Posted 15 Days Ago
Be an Early Applicant
London, Greater London, England
Mid level
London, Greater London, England
Mid level
Design and maintain ML-focused cloud infrastructure, manage HPC clusters, implement automation tools, and optimize data storage for ML systems.
The summary above was generated by AI

ML Engineer - Infrastructure

About Us

At Convergence, we're transforming the way AI integrates into our daily lives. Our team is developing the next generation of AI agents that don't just process information but take actions, learn from experience, and collaborate with humans. By introducing Large Meta Learning Models (LMLMs) that integrate memory as a core component, we're enabling AI to improve continuously through user feedback and acquire new skills during real-time use.

We believe in freeing individuals and businesses from mundane, repetitive tasks, allowing them to focus on innovative and creative work that truly matters. Our personalised AI assistant, proxy, collaborates with users to enhance productivity and creativity. With a $12 million pre-seed funding from Balderton Capital, Salesforce Ventures, and Shopify Ventures, we're poised to make a significant impact in the AI space. Join us in shaping the future of human-AI collaboration and be part of our mission to transform the AI landscape.

Responsibilities

  • Design, implement, and maintain our ML-focused cloud infrastructure on GCP using Infrastructure as Code (Terraform)

  • Build and manage HPC clusters with Slurm for distributed ML workloads, focusing on GPU/TPU utilisation and job scheduling

  • Develop and maintain ML pipeline automation tools and ML-specific CI/CD workflows in Python

  • Design and optimise data storage solutions for ML datasets, model artefacts, and feature stores

  • Implement comprehensive monitoring, logging, and alerting solutions for ML model performance and infrastructure health

  • Collaborate with ML engineers and data scientists to provide robust infrastructure for model training and deployment

  • Lead and implement security best practices for ML systems, including model security and data protection

Requirements

  • 3+ years of experience in ML infrastructure or ML platform engineering

  • Strong proficiency in Python for ML pipeline automation and tooling

  • Extensive experience with Slurm cluster management for large-scale ML workloads

  • Proven track record with Terraform and Infrastructure as Code for ML environments

  • Solid understanding of GCP's ML-specific services (Vertex AI, AI Platform, etc.)

  • Experience with distributed training systems and model serving infrastructure

  • Experience with ML observability tools and performance monitoring

  • Excellent problem-solving skills with a focus on ML system reliability and optimisation

Bonus Qualifications

  • Experience scaling large language model (LLM) infrastructure

  • Knowledge of ML-specific orchestration tools (e.g., MLflow, Ray)

  • Experience with high-performance computing for ML training

  • Contributions to ML infrastructure-related open-source projects

  • Experience with GPU/TPU cluster management and optimisation

  • Background in ML operations (MLOps) or AI reliability engineering

  • Familiarity with vector databases and efficient embedding storage/retrieval

Why Join Us?

  • Be at the cutting edge of AI and LLM technology

  • Work on challenging problems that impact users' daily lives

  • Collaborative and innovative work environment

  • Opportunities for professional growth and learning

  • Competitive salary and benefits package

Top Skills

GCP
Mlflow
Python
Ray
Slurm
Terraform

Convergence London, England Office

London, United Kingdom

Similar Jobs

16 Days Ago
London, Greater London, England, GBR
Senior level
Senior level
Automotive
Lead the development of scalable AI/ML infrastructure for simulations, focusing on large foundation models and complex distributed systems, while mentoring junior engineers.
Top Skills: DeepspeedPyTorchRayTensorFlow
16 Days Ago
London, Greater London, England, GBR
Senior level
Senior level
Automotive
Lead the development of advanced AI/ML infrastructure for scalable simulations, focusing on large foundation models and distributed training in a collaborative environment.
Top Skills: DeepspeedMachine LearningPyTorchRayTensorFlow
2 Hours Ago
Hybrid
3 Locations
Senior level
Senior level
Cloud • Information Technology • Security • Software • Cybersecurity
As a Senior Software Engineer, you'll design, develop and enhance backend systems for a Zero Trust Network, providing technical leadership and optimizing high-availability systems.
Top Skills: GoKubernetesMySQLPostgresRustSQLTypescript

What you need to know about the London Tech Scene

London isn't just a hub for established businesses; it's also a nursery for innovation. Boasting one of the most recognized fintech ecosystems in Europe, attracting billions in investments each year, London's success has made it a go-to destination for startups looking to make their mark. Top U.K. companies like Hoptin, Moneybox and Marshmallow have already made the city their base — yet fintech is just the beginning. From healthtech to renewable energy to cybersecurity and beyond, the city's startups are breaking new ground across a range of industries.
By clicking Apply you agree to share your profile information with the hiring company.

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account