Carbon3.ai Logo

Carbon3.ai

SRE / AIOps Engineer

Reposted 6 Days Ago
Be an Early Applicant
In-Office
London, Greater London, England, GBR
Mid level
In-Office
London, Greater London, England, GBR
Mid level
The SRE/AIOps Engineer will develop autonomous workflows and operational tooling for AI-driven platform operations, focusing on automation, observability, and continuous improvement.
The summary above was generated by AI

Era4 develops, owns and operates AI infrastructure across the UK, powered by renewable energy. Converting legacy industrial and energy sites into modern data-centre facilities, Era4 is combining brownfield regeneration opportunities with cleaner, efficient, scalable compute capacity for healthcare, research, finance, enterprise, and public-sector organisations

**This is a greenfield role, building a modern Agentic approach to Client and Infrastructure Operations**.

 

Role Summary:

We are seeking Automation & AIOps Engineers who sit at the intersection of Site Reliability Engineering and modern AI-driven operations. Embedded within Era4's engineering-led Operations Centre, this role exists to build a modern AI Platform Operations function from scratch, designing tooling, and agentic workflows.  No legacy to deal with.

 

Key Responsibilities:

 

Runbook Automation & Agent Development:

  • Build agentic, executable workflows capable of triaging, diagnosing, and where appropriate autonomously remediating known failure patterns.
  • Build and maintain LLM-backed agents targeting the observability stack, ITSM platform, and infrastructure APIs (e.g. DCIM, IPAM, hypervisor layers).
  • Develop auditable Client focused automations, for Client interactions and workflows, with appropriate controls
  • Develop safe, auditable automation with appropriate controls for higher-risk platform actions

 

Operational Tooling & Self-Service Enablement

  • Build internal tooling that empowers engineers and service desk analysts: CLI utilities, ChatOps integrations (Slack/Teams bots), status dashboards, and self-service automation hooks.
  • Reduce dependency on DevSecOps and engineering teams for routine operational tasks through automation.
  • Maintain and contribute a library of automation assets, agent prompts, and runbook-as-code artefacts, version-controlled and peer-reviewed.

 

Event & Alert Intelligence:

  • Develop the automation layer around monitoring and event management: alert suppression logic, enrichment pipelines, correlation rules, and alert-to-ticket integrations.
  • Continuously tune signal-to-noise ratios across monitoring tooling (Prometheus, Mimir, Grafana, or equivalent) to improve situational awareness.
  • Design and implement event correlation and deduplication logic to reduce alert storms and improve incident context.

 

Continuous Improvement & Knowledge Capture

  • Identify common Operational patterns and tasks as candidates for automation; maintain and prioritise a toil reduction backlog.
  • Participate in post-incident reviews and translate findings into updated automation, runbooks, or agent logic.
  • Contribute to the evolution of Era4's operational standards, tooling architecture, and agent framework.

 

Essential Experience:

 

Technical – Core Element:

  • Strong Python development skills, including scripting for automation, API integration, and data processing.
  • Hands-on experience with observability and monitoring platforms: Prometheus, Grafana, Mimir, or equivalent.
  • Experience integrating with ITSM platforms (ServiceNow, Halo, Jira Service Management, or similar) via API.
  • Solid understanding of event-driven architectures, message queues, and webhook-based automation patterns.
  • Strong understanding of managing GPU infrastructure in production, key signals and metrics and the automation of workflows 
  • Familiarity with Infrastructure-as-Code principles and cloud-native environments (Kubernetes, Terraform, or similar).

 

Technical – Agent & AI

  • Demonstrable experience building LLM-powered agents or automation using frameworks such as LangChain, LlamaIndex, the Anthropic SDK, OpenAI function calling, or comparable tooling.
  • Understanding of agentic design patterns: tool use, structured output, human-in-the-loop controls, and chain-of-thought reasoning for operational tasks.
  • Comfort operating in an API-first environment, integrating agents with infrastructure APIs, DCIM, IPAM, and hypervisor control planes.

 

Operational:

  • Prior experience in an SRE, Senior Operations, or Platform Engineering environment, with exposure to on-call operations and incident management processes.
  • Experience in converting narrative runbooks into executable automation or codified decision trees.
  • Understanding of ITIL-aligned incident and change management principles and ITSM tooling.

 

One or more would be an advantage:

  • Exposure to data centre or colocation operations, particularly high-density compute or GPU infrastructure environments.
  • Experience with ChatOps tooling: building Slack or Microsoft Teams bots for operational workflows.
  • Familiarity with DCIM platforms and telemetry pipelines (power, thermal, network).
  • Knowledge of OpenTelemetry, distributed tracing, or log aggregation platforms (Loki, ELK, Splunk).
  • Contributions to open-source observability or automation tooling.
  • Experience in a start-up or scale-up environment where tooling is built from scratch.

 

Why Join Era4:

You’ll be joining a mission-driven start-up building critical national infrastructure, where operational excellence directly enables growth. This role offers high visibility with leadership, real autonomy, and the chance to shape how a next-generation company operates at scale.

 

Diversity & Inclusion

Era4 is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees. 

 

Similar Jobs

An Hour Ago
In-Office or Remote
Egypt, Buckinghamshire, England, GBR
Mid level
Mid level
Cloud • Information Technology • Internet of Things • Machine Learning • Software • Cybersecurity • Infrastructure as a Service (IaaS)
This role involves managing IP-backbone security devices, troubleshooting issues, defining threats, and planning security access requests while collaborating with a telecom security team.
Top Skills: Authentication ServersDdos Mitigation SolutionsF5 AfmF5 AsmF5 GtmF5 LtmFortinetIpsJuniper SrxLoad BalancersMulti-Vendor FirewallsToken Server
6 Hours Ago
Remote or Hybrid
London, Greater London, England, GBR
Mid level
Mid level
Information Technology • Sales • Security • Cybersecurity • Automation
The Partnership Lead will manage partnerships in cyber insurance and incident response, driving programs and joint offerings with various stakeholders while ensuring partner readiness and collaboration across teams.
Top Skills: Cyber InsuranceIdentity SecurityIdentity SegmentationIncident ResponseItdrMfa
11 Hours Ago
In-Office
Frimley, Surrey Heath, Surrey, England, GBR
Mid level
Mid level
Aerospace • Information Technology • Software • Cybersecurity • Design • Defense • Manufacturing
The Programme Manager at Boeing will manage aircraft modification projects, ensuring delivery on schedule and within budget while communicating with stakeholders and managing risks.
Top Skills: MS OfficeMs Project

What you need to know about the London Tech Scene

London isn't just a hub for established businesses; it's also a nursery for innovation. Boasting one of the most recognized fintech ecosystems in Europe, attracting billions in investments each year, London's success has made it a go-to destination for startups looking to make their mark. Top U.K. companies like Hoptin, Moneybox and Marshmallow have already made the city their base — yet fintech is just the beginning. From healthtech to renewable energy to cybersecurity and beyond, the city's startups are breaking new ground across a range of industries.

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account