Northern Data Group Logo

Northern Data Group

Operations Engineer (m/f/d)

Sorry, this job was removed at 06:57 p.m. (GMT) on Monday, Feb 17, 2025
Be an Early Applicant
London, Greater London, England
London, Greater London, England

Job Description

We are a leading GPU cloud company specializing in high-performance computing (HPC) clusters powered by NVIDIA GPUs and advanced networking technologies. Our solutions empower industries like AI, machine learning, and scientific research to tackle complex computational challenges.

We’re looking for a skilled Operations Engineer to join our Support & Operations team, working to ensure the reliability, performance, and availability of our GPU-accelerated HPC infrastructure. You’ll play a key role in monitoring and maintaining system health, troubleshooting issues, and optimizing infrastructure performance in collaboration with our Infrastructure and Support teams.

YOUR RESPONSIBILITIES:

  • Continuously monitor GPU-accelerated HPC clusters to ensure system health, performance, and availability.
  • Proactively identify potential issues, respond to alerts, and troubleshoot to resolve system outages and performance bottlenecks.
  • Participate in an on-call rotation to provide 24/7 support for critical infrastructure issues and customer escalations.
  • Manage incidents, perform root cause analysis, and implement long-term solutions.
  • Develop and maintain documentation for operational procedures, troubleshooting guides, and best practices for HPC cluster management.
  • Create and update SOPs for routine tasks, including upgrades, system patching, and hardware replacements.
  • Coordinate with the Infrastructure team on deployments, upgrades, and system enhancements to align on operational best practices.
  • Support the Customer Support team in resolving complex technical issues related to GPU-powered infrastructure and HPC systems.

YOUR QUALIFICATIONS:

  • 3+ years of experience in infrastructure operations, system administration, or technical support, ideally within HPC or GPU-accelerated environments.
  • Strong troubleshooting skills with high-performance networking technologies (InfiniBand, RDMA, or similar).
  • Familiarity with NVIDIA GPU technology, HPC architectures, storage solutions and high-performance file systems.
  • Hands-on experience with monitoring tools and system management for large-scale infrastructure.
  • Proficiency in Ansible and scripting (e.g., Python, Bash) to automate tasks and improve operational efficiency.
  • Experience in a 24/7 support environment and incident management.
WHAT WE OFFER

With us, you will work towards the future of HPC: From new, sustainable building methods for data centers to cooling concepts to software solutions for accelerated compute. 

Your approaches count: In official exchange formats or spontaneously at the coffee machine. At Northern Data, it's the best idea that counts - not the hierarchy. We’re looking forward to getting your inputs!

You make the difference in the company: Unlike in established corporations, at Northern Data you will really help shape things. From implementing new departments, to optimizing processes and culture. 

Best-in-class partners: The best work with Northern Data. This means a knowledge and time advantage from which your career and our customers benefit equally.

Green by heart: Sustainability is at the core of Northern Data. With us, you actively work on the carbon neutrality of datacenters worldwide. Beginning with our infrastructure and continuing with the solutions for our clients, we work towards a green future.

Home Office facts: Work with our international and virtual team flexible from home. And of course, your hardware wishes will be fulfilled to make your ideas for next level HPC come true.

Your wellness matters: At Northern Data we have regular wellbeing initiatives that are designed to promote wellness, diversity, inclusion, and much more, ensuring a supportive and enriching environment for our global team.

Similar Jobs

17 Days Ago
London, Greater London, England, GBR
Mid level
Mid level
Information Technology • Business Intelligence • Consulting
The Operations Engineer will monitor and maintain the health, performance, and availability of GPU-accelerated HPC infrastructure, troubleshoot issues, manage incidents, and support the Customer Support team with technical issues. They will develop operational documentation and collaborate with the Infrastructure team on enhancements.
Top Skills: AnsibleBashGpuHpcInfinibandMonitoring ToolsNvidiaPythonRdma
An Hour Ago
Farnborough, Rushmoor, Hampshire, England, GBR
Senior level
Senior level
Aerospace • Hardware • Information Technology • Software
The MIL/GOV Pre-Sales Support Engineer will act as a bridge between sales and technical teams, providing technical support, conducting demos, addressing customer inquiries, and creating documentation to ensure customer satisfaction. Additionally, the role entails collaborating with product development and staying updated on industry trends to effectively position offerings. Extensive travel within the EMEA/APAC region is required.
Top Skills: Aeronautical EngineeringMilitary SatcomNetworking
An Hour Ago
London, Greater London, England, GBR
Senior level
Senior level
Artificial Intelligence • Cloud • Information Technology • Legal Tech • Productivity • Software
As a Senior AI Software Engineer at iManage, you will design, implement, and optimize NLP and AI applications, collaborating with a multidisciplinary team. Your responsibilities include delivering robust software, conducting code reviews, mentoring junior engineers, and integrating AI solutions into products while staying up-to-date with AI advancements.
Top Skills: Ai/Ml InfrastructureCi/Cd ToolsDockerGithub ActionsHuggingfaceJavaNlpPythonPyTorchScala

What you need to know about the London Tech Scene

London isn't just a hub for established businesses; it's also a nursery for innovation. Boasting one of the most recognized fintech ecosystems in Europe, attracting billions in investments each year, London's success has made it a go-to destination for startups looking to make their mark. Top U.K. companies like Hoptin, Moneybox and Marshmallow have already made the city their base — yet fintech is just the beginning. From healthtech to renewable energy to cybersecurity and beyond, the city's startups are breaking new ground across a range of industries.

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account