JetBrains Logo

JetBrains

Research Engineer (LLM Training and Performance)

Posted 23 Days Ago
Be an Early Applicant
Easy Apply
In-Office
9 Locations
Senior level
Easy Apply
In-Office
9 Locations
Senior level
As a Research Engineer, you will enhance LLM training performance, manage the training stack, and optimize multi-node pipelines for large-scale machine learning models.
The summary above was generated by AI

At JetBrains, code is our passion. Ever since we started back in 2000, we have been striving to make the strongest, most effective developer tools on earth. By automating routine checks and corrections, our tools speed up production, freeing developers to grow, discover, and create.

We’re looking for a Research Engineer who will own the training stack and model architecture for our Mellum LLM family. Your job is easier said than done: make training faster, cheaper, and more stable at a large scale. You’ll profile, design, and implement changes to the training pipeline – from architecture to custom GPU kernels, as needed.

As part of our team, you will:
  • Be responsible for improving end-to-end performance for multi-node LLM pre-training and post-training pipelines.
  • Profile hotspots (Nsight Systems/Compute, NVTX) and fix them using compute/comm overlap, kernel fusion, scheduling, etc.
  • Design and evaluate architecture choices (depth/width, attention variants including GQA/MQA/MLA/Flash-style, RoPE scaling/NTK, and MoE routing and load-balancing).
  • Implement custom ops (Triton and/or CUDA C++), integrate via PyTorch extensions, and upstream when possible.
  • Push memory/perf levers: FSDP/ZeRO, activation checkpointing, FP8/TE, tensor/pipeline/sequence/expert parallelism, NCCL tuning.
  • Harden large runs by building elastic and fault-tolerant training setups, ensuring robust checkpointing, strengthening reproducibility, and improving resilience to preemption.
  • Keep the data path fast using streaming and sharded data loaders and tokenizer pipelines, as well as improve overall throughput and cache efficiency.
  • Define the right metrics, build dashboards, and deliver steady improvements.
  • Run both pre-training and post-training (including SFT, RLHF, and GRPO-style methods) efficiently across sizable clusters.
We’ll be happy to bring you on board if you have:
  • Strong PyTorch and PyTorch Distributed experience, having run multi-node jobs with tens to hundreds of GPUs.
  • Hands-on experience with Megatron-LM/Megatron-Core/NeMo, DeepSpeed, or serious FSDP/ZeRO expertise.
  • Real profiling expertise (Nsight Systems/Compute, nvprof) and experience with NVTX-instrumented workflows.
  • GPU programming skills with Triton and/or CUDA, and the ability to write, test, and debug kernels.
  • A solid understanding of NCCL collectives, as well as topology and fabric effects (IB/RoCE), and how they show up in traces.
Our ideal candidate would have experience with:
  • FlashAttention-2 and 3, CUTLASS and CuTe, TransformerEngine and FP8, Inductor, AOTAutograd, and torch.compile.
  • MoE at scale (expert parallel, router losses, capacity management) and long-context tricks (ALiBi/YaRN/NTK scaling).
  • Kubernetes or SLURM at scale, placement and affinity tuning, as well as AWS, GCP, and Azure GPU fleets.
  • Web-scale data plumbing (streaming datasets, Parquet and TFRecord, tokenizer perf), eval harnesses, and benchmarking.
  • Safety and post-training methods, such as DPO, ORPO, GRPO, and reward models.
  • Inference ecosystems such as vLLM and paged KV.

#LI-KP1

We process the data provided in your job application in accordance with the Recruitment Privacy Policy.

Top Skills

Compute
Cuda
Deepspeed
Fsdp
Megatron-Core
Megatron-Lm
Nccl
Nemo
Nsight Systems
PyTorch
Triton
Zero

Similar Jobs

7 Hours Ago
In-Office
Prague, CZE
Mid level
Mid level
Fintech • Payments • Financial Services
The Key Account Manager will identify and secure high-value deals, navigate complex sales cycles, and collaborate with teams to tailor solutions for enterprise clients. Responsibilities include prospecting, managing negotiations, and cross-functional collaboration to meet sales targets.
8 Hours Ago
Hybrid
Prague, CZE
Senior level
Senior level
Big Data • Food • Hardware • Machine Learning • Retail • Automation • Manufacturing
Lead and develop local and global brands, create marketing strategies, manage product launches, and oversee a team of brand managers.
21 Hours Ago
Easy Apply
Hybrid
Prague, CZE
Easy Apply
Senior level
Senior level
AdTech • Artificial Intelligence • Marketing Tech • Software • Analytics
The Senior Software Engineer will lead the architecture and development of real-time systems in the Zeta Marketing Platform, collaborating with teams to create scalable solutions and mentor developers.
Top Skills: AkkaAWSJavaPlay FrameworkScalaTerraformZio

What you need to know about the London Tech Scene

London isn't just a hub for established businesses; it's also a nursery for innovation. Boasting one of the most recognized fintech ecosystems in Europe, attracting billions in investments each year, London's success has made it a go-to destination for startups looking to make their mark. Top U.K. companies like Hoptin, Moneybox and Marshmallow have already made the city their base — yet fintech is just the beginning. From healthtech to renewable energy to cybersecurity and beyond, the city's startups are breaking new ground across a range of industries.

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account