Thoughtworks Logo

Thoughtworks

Lead Machine Learning Engineer

Posted 2 Days Ago
Be an Early Applicant
Manchester, Greater Manchester, England
Senior level
Manchester, Greater Manchester, England
Senior level
The Lead Machine Learning Engineer will oversee white-glove support service for managing large GPU clusters, contribute to model training and optimization, and collaborate on automation improvements. Responsibilities include ensuring training readiness, facilitating problem solving, and aligning ML strategies with business goals while providing exceptional client-facing service.
The summary above was generated by AI

The Team

This team will provide 24x7 white-glove support to people using large blocks of GPUs (6,000+ contiguous GPUs) for a short period of time (eg: 6-weeks, 12-weeks etc) to perform Managed Post Training. This includes helping with preparation, 24x7 support during training to ensure full utilization of the GPU clusters and off-boarding. The team is in three timezones with hand-off protocols to enable 24x7 support: US, Europe and India.

The Role

While you can be a specialist in MLE, you need to know enough about cluster operations.

Location

This role can be based at any of our offices across Europe.

Job responsibilities

  • You will help shape and iterate this new white glove support service.
  • You will work in close collaboration with a Lead Cluster Operations Support Engineer.
  • You will contribute to accelerator development: find gaps in the tooling, or needed automation, or patterns we would develop accelerators to make the next round of this more efficient and faster. Eg: We need to improve observability, or we need to automate user onboarding, or we need to bring in a new tool which everyone seems to want to use etc.
  • You will help assess the model training readiness and data preparation.
  • You will provide model training support rotating daytime weekend shifts - with pagers, to any issues they may encounter. These can range from infrastructure issues to data sciences issues or anything in between: eg: AWS changed a configuration in EKS that affects the training.
  • You will facilitate collaborative problem solving within the team by actively listening, communicating effectively and mentoring other engineers.
  • You will contribute to the development and execution of the team's overall ML strategy, aligning technical capabilities with business objectives.
  • You will proactively identify and address challenges related to the white glove service for continued pre training, proposing solutions and implementing improvements.

Job qualifications

Technical Skills
  • You have proven experience in distributed training of large language models (LLMs) across multiple worker nodes and GPUs.
  • You have deep understanding of LLM architectures, including transformer-based models, and demonstrated ability to design and implement custom models.
  • You have expertise in monitoring large training jobs in a distributed environment and ability to debug job failures.
  • You have deep expertise in Pytorch (or Tensorflow) and debugging training failure modes.
  • You have deep Knowledge of fine-tuning or training with open-weight Gen AI models (i.e. Llama,Mistral, Gemma).
  • You have previous experience with Weights & Biases, Run.ai, Pytorch, Tensorflow, Hugging Face libraries.
  • You have expereicence but not limited to NVIDIA NeMo Stack (for both training and inference).

Professional Skills

  • You will be part of a client facing white glove service where a high level of professionalism is required.
  • You understand the importance of stakeholder management and can easily liaise between clients and other key stakeholders throughout projects, ensuring buy-in and gaining trust along the way.
  • You are resilient in ambiguous situations and can adapt your role to approach challenges from multiple perspectives.
  • You don’t shy away from risks or conflicts, instead you take them on and skillfully manage them.
  • You are eager to coach, mentor and motivate others and you aspire to influence teammates to take positive action and accountability for their work.
  • You enjoy influencing others and always advocate for technical excellence while being open to change when needed.

Other things to know

Learning & Development

There is no one-size-fits-all career path at Thoughtworks: however you want to develop your career is entirely up to you. But we also balance autonomy with the strength of our cultivation culture. This means your career is supported by interactive tools, numerous development programs and teammates who want to help you grow. We see value in helping each other be our best and that extends to empowering our employees in their career journeys.

About Thoughtworks

Thoughtworks is a dynamic and inclusive community of bright and supportive colleagues who are revolutionizing tech. As a leading technology consultancy, we’re pushing boundaries through our purposeful and impactful work. For 30+ years, we’ve delivered extraordinary impact together with our clients by helping them solve complex business problems with technology as the differentiator. Bring your brilliant expertise and commitment for continuous learning to Thoughtworks. Together, let’s be extraordinary.

#LI-Onsite

Top Skills

PyTorch
TensorFlow

Similar Jobs

2 Days Ago
London, Greater London, England, GBR
Senior level
Senior level
Software
The Lead Machine Learning Engineer will support users of large GPU clusters for model training, collaborate with cluster operations, and enhance ML strategies. Responsibilities include overseeing training readiness, monitoring distributed training jobs, and improving service workflows while providing client-facing support.
Top Skills: PyTorchTensorFlow
3 Days Ago
London, Greater London, England, GBR
Senior level
Senior level
Fitness
As a Senior Machine Learning Engineer, you'll enhance customer experiences through machine learning algorithms and recommender systems. You'll work collaboratively across teams to design, develop, and deploy solutions, analyze data for model training, and evaluate the effectiveness of implemented models while fostering knowledge sharing within the team.
Top Skills: Python
3 Days Ago
London, Greater London, England, GBR
Senior level
Senior level
Fitness
As a Senior/Lead Machine Learning Engineer at Flo, you will develop and maintain ML models, focusing on predicting metrics such as lifetime value. You'll design scalable solutions, manage data pipelines, and collaborate with stakeholders to integrate insights into workflows, aiming to enhance decision-making in user acquisition and marketing.
Top Skills: Python

What you need to know about the London Tech Scene

London isn't just a hub for established businesses; it's also a nursery for innovation. Boasting one of the most recognized fintech ecosystems in Europe, attracting billions in investments each year, London's success has made it a go-to destination for startups looking to make their mark. Top U.K. companies like Hoptin, Moneybox and Marshmallow have already made the city their base — yet fintech is just the beginning. From healthtech to renewable energy to cybersecurity and beyond, the city's startups are breaking new ground across a range of industries.

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account