NVIDIA Logo

NVIDIA

AI Test Architect

Reposted Yesterday
Be an Early Applicant
Remote
6 Locations
Expert/Leader
Remote
6 Locations
Expert/Leader
The AI Test Architect will profile, benchmark, and optimize deep learning models and training pipelines while focusing on high-performance networking for large-scale supercomputing solutions.
The summary above was generated by AI

We are looking for an AI Test Architect joining E2E Verification group to profile Innovative large scale Distributed training on NVIDIA AI End-to-End solutions in a large scale supercomputing clusters.

Provide insights on at-scale system design and tuning mechanisms for large-scale compute runs. You will work with the latest Accelerated Computing and Deep Learning software and hardware platforms, with researchers, developers, and customers to craft improved workflows and develop new, leading differentiated solutions. You will interact with HPC, OS, Switch, HCA, CPU and GPU compute, and systems specialist to architect, develop and bring up large scale performance platforms.

What you’ll be doing:

  • Profiling, benchmarking, and analyzing deep learning models to identify areas for optimization and improvement in terms of performance, efficiency, and accuracy, with a strong emphasis on networking aspects.

  • Collaborating closely with data scientists, researchers, development, automation teams to design and implement scalable training pipelines and frameworks that demonstrate large scale high -performance networking capabilities.

  • Staying up-to-date with the latest advancements in deep learning algorithms, architectures, NVIDIA GPU technologies, and high-performance networking solutions.

  • Optimizing deep learning models for performance, memory usage, and power efficiency while maximizing high-performance networking features on NVIDIA supercomputers.

  • Providing insights and recommendations based on the analysis of large-scale training results, specifically focusing on networking bottlenecks and optimizations, to improve model outcomes and achieve business objectives.

  • Collaborating with hardware engineers to guide the development and integration of efficient networking solutions for deep learning, including exploring network architecture optimizations and bringing to bear technologies such as RDMA or InfiniBand.

What we need to see:

  • B.Sc in Computer Science, Software Engineering, or equivalent experience.

  • Strong understanding and practical experience with machine learning algorithms and techniques, with a specialization in deep learning and expertise in high-performance networking.

  • 8+ years of overall experience, with CUDA programming for deep learning frameworks like TensorFlow, PyTorch, combined with expertise in networking libraries and protocols.

  • Ability to profile and optimize deep learning workflows, focusing on networking-related bottlenecks and optimizations, to improve overall performance and efficiency.

  • Exceptional analytical and problem-solving skill, with a keen attention to detail, particularly in identifying and resolving networking performance issues.

  • Excellent communication and collaboration skills, enabling effective teamwork and cooperation.

  • Familiarity with supercomputers, parallel computing, distributed systems, and high- performance networking technologies like RDMA or InfiniBand.

Ways to stand out from the crowd:

  • Demonstrated experience in successfully profiling and optimizing large-scale deep learning training on NVIDIA supercomputers, with a significant focus on high-performance networking enhancements.

  • Experience with distributed deep learning, distributed training frameworks, or large-scale data pipelines enhanced by high-performance networking solutions.

  • Expertise in optimizing networking parameters, such as bandwidth, latency, or congestion control, for deep learning workloads.

  • Familiarity with NVIDIA's networking technologies, such as Mellanox InfiniBand, and their integration with deep learning workflows.

  • Strong understanding of high-performance networking protocols and standards and their application to deep learning.

Top Skills

Cuda
Infiniband
PyTorch
Rdma
TensorFlow

NVIDIA London, England Office

13th Floor One Angel Court, London, United Kingdom, EC2R 7HJ

Similar Jobs

23 Hours Ago
In-Office or Remote
15 Locations
Mid level
Mid level
Machine Learning • Natural Language Processing
The role involves designing and implementing machine learning solutions, managing projects from idea to deployment, and optimizing models using AWS and Docker.
Top Skills: AWSDockerEc2PythonPyTorchS3SagemakerScikit-LearnTensorFlow
23 Hours Ago
Remote or Hybrid
7 Locations
Mid level
Mid level
Cloud • Computer Vision • Information Technology • Sales • Security • Cybersecurity
Drive new business opportunities and grow existing client relationships within enterprise accounts, managing the full sales cycle and collaborating with internal teams.
Top Skills: MeddpiccSalesforce
23 Hours Ago
Easy Apply
Remote
28 Locations
Easy Apply
Senior level
Senior level
Cloud • Security • Software • Cybersecurity • Automation
Lead a team of Ecosystem Sales Managers for partnership development, revenue growth, and market engagement across EMEA. Oversee strategic initiatives and foster relationships with key partners like AWS and Google.
Top Skills: Cloud ComputingDevOpsRemote Work

What you need to know about the London Tech Scene

London isn't just a hub for established businesses; it's also a nursery for innovation. Boasting one of the most recognized fintech ecosystems in Europe, attracting billions in investments each year, London's success has made it a go-to destination for startups looking to make their mark. Top U.K. companies like Hoptin, Moneybox and Marshmallow have already made the city their base — yet fintech is just the beginning. From healthtech to renewable energy to cybersecurity and beyond, the city's startups are breaking new ground across a range of industries.

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account