As a Research Intern at Atla, you'll conduct machine learning research, contribute to product development, and output publications and datasets while collaborating with engineers on projects involving iterative model improvement and evaluation systems.
About Atla
Role
Qualifications
Nice to have
About you
Compensation
Atla is committed to engineering safe, beneficial AI systems that will have a massive positive impact on the future of humanity. We are a London-based start-up building the most capable AI evaluation models. Become part of our growing world-class team, backed by Y Combinator, Creandum, and the founders of Reddit, Cruise, Rappi, Instacart and more.
As Atla’s research intern, you will collaborate with our researchers and obtain deep experience in a growing AI startup. As part of your role, you will:
- Conduct cutting-edge machine learning research, contributing to research initiatives that have practical applications in our product development.
- Disseminate your research results through the production of publications, datasets, and code.
Our ongoing research projects encompass but are not limited to:
Iterative Self Improvement
This project applies iterative self-improvement to enhance our general-purpose evaluator. This involves using the model’s outputs to refine its training data iteratively, rather than relying on fixed datasets. Prior work [1, 2, 3, 4] demonstrates the effectiveness of this approach, and we aim to extend it to evaluation systems.
We will leverage our internal training data, infrastructure, and benchmarks to iteratively refine the evaluator. You will collaborate with engineers to build infrastructure for iteratively generating better and more informative data. Techniques from our research on techmulti-stage synthetic data generation will be incorporated to improve data quality.
Key challenges include addressing bias amplification, semantic drift, and maintaining diversity of data to ensure model stability and alignment. This project aims to advance safe iterative training methodologies and deliver a more capable evaluator, with findings targeted for a top-tier conference. The scope can be tailored to your skills and interests.
[1] Wang, Y., et al. (2023). SELF-INSTRUCT: Aligning Language Models with Self-Generated Instructions.
[2] Yuan, W., et al. (2024). Self-Rewarding Language Models.
[3] Wang, T., et al. (2024). Self-Taught Evaluators.
[4] Li, X., et al. (2024). MONTESSORI-INSTRUCT: Generate Influential Training Data Tailored for Student Learning.
Inference Time Compute
This project explores inference-time compute scaling to enhance our general-purpose evaluator, particularly for complex tasks like coding, which benefit from longer reasoning chains. Recent research [1, 2] has shown the effectiveness of inference-time compute in improving performance on reasoning and mathematical tasks by leveraging more tokens during inference.
We will investigate methods to train models capable of utilising additional tokens effectively for reasoning. This involves experimenting with reinforcement learning (RL) approaches, such as group reinforcement policy optimisation (GRPO), to encourage self-verification and reasoning strategies. You will work with engineers to develop the necessary training infrastructure.
Key challenges include addressing trade-offs between token efficiency and performance while mitigating common issues. The project aims to develop robust methods for inference-time compute scaling and contribute findings to a top-tier conference. The scope can be tailored to your skills and interests.
[1] Guo, D., et al. (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.
[2] Snell, C., et al. (2024). Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters.
Agentic Evaluation
This project investigates how to evaluate agentic systems using an LLM-as-a-Judge framework. Agents introduce new challenges due to their ability to reason, plan, and interact with external tools [1,2]. Evaluating their capabilities and safety requires new approaches, with potential directions including:
- Agent-as-a-Judge: Using agentic systems to evaluate other agentic systems, reducing reliance on human judgment and enabling automated, scalable evaluation frameworks [3].
- Task-driven and multi-step evaluation: Moving beyond single-action accuracy to assess long-horizon reasoning, adaptability, and decision-making in dynamic environments [4].
AI agents are becoming the next major AI paradigm, with 2025 set to be a pivotal year for their development. As models evolve from passive assistants to autonomous agents, rigorous evaluation is essential to ensure their reliability and safety [5,6].
This project aims to develop a framework for evaluating agents, create benchmarks, and contribute findings to a top-tier conference. The scope can be tailored to your skills and interests.
[1] Yao, S., et al. (2023). ReAct: Synergizing Reasoning and Acting in Language Models.
[2] Deng, X., et al. (2023). MIND2WEB: Towards a Generalist Agent for the Web.
[3] Zhuge, M., et al. (2024). Agent-as-a-Judge: Evaluate Agents with Agents.
[4] Nathani, D., et al. (2025). MLGym: A New Framework and Benchmark for Advancing AI Research Agents.
[5] Altman, S. (2024). The Intelligence Age.
[6] Heikkilä, M., & Heaven, W. D. (2025). Anthropic’s Chief Scientist on 4 Ways Agents Will Be Even Better in 2025. MIT Technology Review.
Evidence of exceptional research engineering ability:
- Are currently pursuing, or in the process of obtaining, a PhD in Machine Learning, NLP, Artificial Intelligence, or a related discipline. We will also consider exceptional non-PhD candidates.
- Proven track record in empirical research, including designing and executing experiments, and effectively writing up and communicating findings.
- Publications in top AI conferences.
- Aptitude for distilling and applying ideas from complex research papers.
- Previous internship experience at elite AI research labs (OpenAI, DeepMind, Meta, Anthropic, etc.).
- Experience using large-scale distributed training strategies, data annotation and evaluation pipelines, or implementing state of the art ML models.
- Interested in and thoughtful about the impacts of AI technology.
You'll work by and thrive through our core principles:
Own the Outcome
- Create real value: Every action should deliver tangible, meaningful value for the people who use what we build.
- Drive to completion: Do the second 90%.
- Do fewer things, better: Prioritize focus over breadth.
Back the Team
- Collaborate for excellence: The whole is greater than the sum of its parts.
- Seek truth: Let the best ideas win, no matter where they come from, and let go of ego.
- Argue passionately, then commit fully: Debate fiercely, but once a decision is made, own it like it’s yours.
Drive the Mission
- Advance AI safety: Every action should contribute towards the safe development of AI.
- Go big or go home: “The people who are crazy enough to think they can change the world are the ones who do.”
- Highly competitive
Top Skills
Artificial Intelligence
Machine Learning
Nlp
Atla London, England Office
London, England, United Kingdom
Similar Jobs
Artificial Intelligence • Semiconductor
As a research intern, you'll advance AI research by investigating machine learning topics, designing experiments, and implementing models and algorithms under mentorship.
Top Skills:
C++PythonPyTorch
Artificial Intelligence • Software • Generative AI
As a Policy & Risk Research Intern, you'll track updates across AI regulations, translate legal provisions for product logic, and draft materials to support strategic positioning and engagement with regulators.
Top Skills:
Ai GovernanceCompliance Frameworks
Artificial Intelligence • Machine Learning • Software • Generative AI
As a Finance Intern at Mentis AI, you will assist in building financial models, performing deep financial analysis, and collaborating with ML engineers to improve AI systems in finance.
Top Skills:
Excel
What you need to know about the London Tech Scene
London isn't just a hub for established businesses; it's also a nursery for innovation. Boasting one of the most recognized fintech ecosystems in Europe, attracting billions in investments each year, London's success has made it a go-to destination for startups looking to make their mark. Top U.K. companies like Hoptin, Moneybox and Marshmallow have already made the city their base — yet fintech is just the beginning. From healthtech to renewable energy to cybersecurity and beyond, the city's startups are breaking new ground across a range of industries.



