Carbon3.ai Logo

Carbon3.ai

Platform Engineer

Posted 15 Days Ago
Be an Early Applicant
Remote
Hiring Remotely in United Kingdom
Mid level
Remote
Hiring Remotely in United Kingdom
Mid level
The Platform Engineer will design, deploy, and manage HPC and GPU-accelerated clusters, optimize network topologies, and ensure high availability while collaborating with vendor engineering teams to support seamless operations.
The summary above was generated by AI

Era4 develops, owns and operates AI infrastructure across the UK, powered by renewable energy. Converting legacy industrial and energy sites into modern data-centre facilities, Era4 is combining brownfield regeneration opportunities with cleaner, efficient, scalable compute capacity for healthcare, research, finance, enterprise, and public-sector organisations


Role Summary:  

We are looking for Platform Engineer (HPC & AI) who can assist in shaping our new Platform team, this role will be customer facing, involve technical troubleshooting, and collaboration with vendor engineering teams to ensure seamless AI platform operations.   

  

Responsibilities:  

  • Designing, deploying, and managing large‑scale HPC and GPU‑accelerated clusters, including NVIDIA based compute environments. 
  • Implementing and administering HPC scheduling and resource‑management systems (e.g., Slurm), including GPU partitioning, workload scheduling, and capacity planning. 
  • Architecting and optimising InfiniBand and Ethernet network topologies. 
  • Ensuring high availability and resilience through failover strategies, planned maintenance coordination, and proactive risk mitigation. 
  • Automating provisioning, configuration, monitoring, and operational workflows across multi‑vendor HPC hardware and software stacks. 
  • Monitoring real‑time performance and leading troubleshooting efforts across compute, storage, interconnect, drivers, and node failures, engaging vendor support for critical issues. 
  • Incident response: node failure management, network issues, driver issues, troubleshooting common issues and then working with vendor support to resolve any critical issues.  
  • Security and access control: Manage user permissions, RBAC, security hardening, data protection.   

 

Required Skills & Experience:  

  • Experience supporting HPE PCAI or other AI/HPC infrastructure and platforms.  
  • System administration experience with OS's like RHEL/CentOS, Ubuntu, tuning Linux kernel. 
  • Proficiency with Ansible, Nvidia and CUDA toolkits, Kubernetes and container orchestration. 
  • Understanding of automation, monitoring and security with GPU as a service. 
  • Extensive experience in system engineering, platform operations or SRE. 
  • Experience with GPU resource allocation (across instances, GPUs count and time).  
  • Advanced networking skills with High performance networking, troubleshooting and fine tuning. 
  • Familiarity with cloud-based platforms, APIs, and distributed systems. 
  • Understanding of AI/ML concepts and tooling (model training, inference, data pipelines basics). 
  • Experience with monitoring/logging tools (e.g., Grafana, Kibana, Splunk).  
  • Excellent communication skills to interface with both customers and internal / vendor teams.  
  • Good understanding of tools requirements for ML engineers and data scientists, and how to optimise the experience. 


Why Join Era4:

You’ll be joining a mission-driven start-up building critical national infrastructure, where operational excellence directly enables growth. This role offers high visibility with leadership, real autonomy, and the chance to shape how a next-generation company operates at scale. 

 

Diversity & Inclusion:  

Era4 is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees.  

 

Note: 

We appreciate this is a relatively new skill set and we are open to candidates who may not tick all the boxes but are willing to learn and develop their skillset.  

Top Skills

AI
Ansible
Centos
Cuda
Grafana
Hpc
Kibana
Kubernetes
Nvidia
Rhel
Slurm
Splunk
Ubuntu

Similar Jobs

8 Days Ago
In-Office or Remote
London, Greater London, England, GBR
Expert/Leader
Expert/Leader
Blockchain • Fintech • Payments • Financial Services • Cryptocurrency • Web3
As a Lead Security Engineer, you will architect and manage Circle's security data platform, ensuring robust data ingestion, normalization, and response strategies while collaborating on security operations initiatives.
Top Skills: AthenaAWSGlueKafkaMskPythonS3SQL
9 Days Ago
In-Office or Remote
London, Greater London, England, GBR
Senior level
Senior level
Artificial Intelligence • HR Tech • Productivity • Software
Join a startup as a Full-Stack Engineer, responsible for designing and operating a SaaS platform, managing AWS infrastructure, and developing AI integrations.
Top Skills: AWSEcsFastapiLambdaPostgresPythonReactTerraformTypescript
14 Days Ago
Remote
GB
Senior level
Senior level
Gaming • Mobile
The Senior Platform Engineer will enhance reliability, performance, and scalability of systems, improve operational excellence, and automate processes. Responsibilities include incident management, observability solutions, cost optimization, and cross-functional collaboration.
Top Skills: AWSDatadogGoKubernetesPythonTerraform

What you need to know about the London Tech Scene

London isn't just a hub for established businesses; it's also a nursery for innovation. Boasting one of the most recognized fintech ecosystems in Europe, attracting billions in investments each year, London's success has made it a go-to destination for startups looking to make their mark. Top U.K. companies like Hoptin, Moneybox and Marshmallow have already made the city their base — yet fintech is just the beginning. From healthtech to renewable energy to cybersecurity and beyond, the city's startups are breaking new ground across a range of industries.

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account