Carbon3.ai Logo

Carbon3.ai

Platform Engineer

Reposted Yesterday
Be an Early Applicant
In-Office
London, Greater London, England
Mid level
In-Office
London, Greater London, England
Mid level
The HPC Platform Engineer will manage AI platform operations, providing L1 and L2 support, coordinating with vendors, monitoring systems, and optimizing resource allocation.
The summary above was generated by AI

Carbon3 develops, owns and operates AI infrastructure across the UK, powered by renewable energy. Converting legacy industrial and energy sites into modern data-centre facilities, Carbon3 is combining brownfield regeneration opportunities with cleaner, efficient, scalable compute capacity for healthcare, research, finance, enterprise, and public-sector organisations.


We are looking for Platform Engineer (HPC & AI) who can assist in shaping our new Platform team, this role will be customer facing, involve technical troubleshooting, and collaboration with vendor engineering teams to ensure seamless AI platform operations.  

 

Key Responsibilities:

  • Coordinate resolution of complex issues (L3) to (vendor) product/engineering teams and manage vendor responses.
  • Monitor system health, alerts, and customer usage patterns.
  • Document solutions/workarounds, create and maintain knowledge, document support procedures. 
  • Automate common tasks and fixes. 
  • Configure and integrate tooling to support optimal operation of the platform, and support tool selection. 
  • Assist customers with platform configuration, onboarding, and usage best practices. 
  • Collaborate with platform and infrastructure support/engineering teams to resolve platform integration issues. 
  • Ensure SLAs and customer satisfaction targets are met.
  • L1 support for customer-reported issues and requests.
  • L2 support by diagnosing, replicating, and troubleshooting issues across platform and infrastructure. 
  • Work with customers and multiple stakeholders to understand requirements and challenges, provide reporting on usage, workflow and billing. 


Technical responsibilities:

  • Cluster Infrastructure management: Managing the Nvidia GPU cluster .
  • High availability and resilience: Implement failover strategies and manage maintenance events to minimise downtime.
  • Resource allocation and optimisation: Resource partitioning (GPU resources), workload scheduling, capacity planning. 
  • Performance monitoring and troubleshooting: Performance analysis, monitoring (realtime) with available Nvidia and HPE tools.  
  • Incident response: node failure management, network issues, driver issues, troubleshooting common issues and then working with vendor support to resolve any critical issues. 
  • Security and access control: Manage user permissions, RBAC, security hardening, data protection.  

   

Required Skills & Experience: 

  • Extensive experience in technical support, system engineering, or platform operations.
  • Solid understanding of L1 and L2 support processes (ticketing, escalation, troubleshooting).
  • Familiarity with cloud-based platforms, APIs, and distributed systems.
  • Understanding of AI/ML concepts and tooling (model training, inference, data pipelines basics).
  • Experience with monitoring/logging tools (e.g., Grafana, Kibana, Splunk). 
  • Excellent communication skills to interface with both customers and internal / vendor teams. 
  • Good understanding of tools requirements for ML engineers and data scientists, and how to optimize the experience.

  

Core Technical skills: 

  • System administration experience with OS's like RHEL/CentOS, Ubuntu, tuning Linux kernel.
  • Proficiency with Ansible, Nvidia and CUDA toolkits, Kubernetes and container orchestration.
  • Understanding of automation, monitoring and security with GPU as a service.

   

Preferred experience:

  • Experience supporting HPE PCAI or other AI/HPC infrastructure and platforms. 
  • Experience with GPU resource allocation (across instances, GPUs count and time). 
  • Advanced networking skills with High performance networking, troubleshooting and fine tuning.
  • Background in DevOps or SRE practices. 
  • ITIL familiarity. 

  

Success Metrics: 

  • Customers receive timely, effective support with minimal escalations. 
  • Issues are resolved or routed correctly with high-quality documentation. 
  • The platform maintains strong uptime and customer satisfaction. 

 

Why Join Carbon3.ai:

You’ll be joining a mission-driven start-up building critical national infrastructure, where operational excellence directly enables growth. This role offers high visibility with leadership, real autonomy, and the chance to shape how a next-generation company operates at scale.

Top Skills

Ansible
Cuda
Grafana
Kibana
Kubernetes
Nvidia Gpu
Rhel/Centos
Splunk
Ubuntu

Similar Jobs

15 Days Ago
Hybrid
London, Greater London, England, GBR
Senior level
Senior level
Blockchain • Fintech • Payments • Consulting • Cryptocurrency • Cybersecurity • Quantum Computing
As a Senior Systems Platform Engineer, you will oversee the Virtualisation & Compute Unix platforms, focusing on implementation, support, and achieving operational excellence in line with SLA metrics. Responsibilities include system maintenance, performance optimization, security management, problem-solving, and collaborating with third-party suppliers.
Top Skills: AnsibleBashConfluenceDellemcHpJIRAOelOvmPythonRedhatSolarisSplunkTerraformVMware
5 Days Ago
Hybrid
London, Greater London, England, GBR
Senior level
Senior level
Fintech • Mobile • Payments • Software • Financial Services
Develop and operate internal cloud-based database platforms, ensuring reliability and security. Lead YugabyteDB initiatives and collaborate with product teams.
Top Skills: AnsibleAWSCi/CdCloud SpannerCockroachdbGCPPythonTeleportTerraformTidbYugabytedb
2 Days Ago
In-Office
London, Greater London, England, GBR
Mid level
Mid level
AdTech • Digital Media • Marketing Tech • Software
Design, build, and maintain scalable, secure infrastructure on GCP. Implement IaC and CI/CD, container orchestration, monitoring, and security best practices. Create self-service platform tools, perform incident RCA, and collaborate with engineering and product teams to enable reliable deployments and performance.
Top Skills: Google Cloud Platform,Compute Engine,Kubernetes Engine (Gke),Cloud Functions,Bigquery,Cloud Storage,Terraform,Pulumi,Github Actions,Gitlab Ci/Cd,Jenkins,Argocd,Docker,Kubernetes,Istio,Traefik,Stackdriver,Prometheus,Grafana,Python,Go,Bash,Iam,Vpc,Data Encryption

What you need to know about the London Tech Scene

London isn't just a hub for established businesses; it's also a nursery for innovation. Boasting one of the most recognized fintech ecosystems in Europe, attracting billions in investments each year, London's success has made it a go-to destination for startups looking to make their mark. Top U.K. companies like Hoptin, Moneybox and Marshmallow have already made the city their base — yet fintech is just the beginning. From healthtech to renewable energy to cybersecurity and beyond, the city's startups are breaking new ground across a range of industries.

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account