Era4 develops, owns and operates AI infrastructure across the UK, powered by renewable energy. Converting legacy industrial and energy sites into modern data-centre facilities, Era4 is combining brownfield regeneration opportunities with cleaner, efficient, scalable compute capacity for healthcare, research, finance, enterprise, and public-sector organisations
**This is a greenfield role, building a modern Agentic approach to Client and Infrastructure Operations**.
Role Summary:
We are seeking Automation & AIOps Engineers who sit at the intersection of Site Reliability Engineering and modern AI-driven operations. Embedded within Era4's engineering-led Operations Centre, this role exists to build a modern AI Platform Operations function from scratch, designing tooling, and agentic workflows. No legacy to deal with.
Key Responsibilities:
Runbook Automation & Agent Development:
- Build agentic, executable workflows capable of triaging, diagnosing, and where appropriate autonomously remediating known failure patterns.
- Build and maintain LLM-backed agents targeting the observability stack, ITSM platform, and infrastructure APIs (e.g. DCIM, IPAM, hypervisor layers).
- Develop auditable Client focused automations, for Client interactions and workflows, with appropriate controls
- Develop safe, auditable automation with appropriate controls for higher-risk platform actions
Operational Tooling & Self-Service Enablement
- Build internal tooling that empowers engineers and service desk analysts: CLI utilities, ChatOps integrations (Slack/Teams bots), status dashboards, and self-service automation hooks.
- Reduce dependency on DevSecOps and engineering teams for routine operational tasks through automation.
- Maintain and contribute a library of automation assets, agent prompts, and runbook-as-code artefacts, version-controlled and peer-reviewed.
Event & Alert Intelligence:
- Develop the automation layer around monitoring and event management: alert suppression logic, enrichment pipelines, correlation rules, and alert-to-ticket integrations.
- Continuously tune signal-to-noise ratios across monitoring tooling (Prometheus, Mimir, Grafana, or equivalent) to improve situational awareness.
- Design and implement event correlation and deduplication logic to reduce alert storms and improve incident context.
Continuous Improvement & Knowledge Capture
- Identify common Operational patterns and tasks as candidates for automation; maintain and prioritise a toil reduction backlog.
- Participate in post-incident reviews and translate findings into updated automation, runbooks, or agent logic.
- Contribute to the evolution of Era4's operational standards, tooling architecture, and agent framework.
Essential Experience:
Technical – Core Element:
- Strong Python development skills, including scripting for automation, API integration, and data processing.
- Hands-on experience with observability and monitoring platforms: Prometheus, Grafana, Mimir, or equivalent.
- Experience integrating with ITSM platforms (ServiceNow, Halo, Jira Service Management, or similar) via API.
- Solid understanding of event-driven architectures, message queues, and webhook-based automation patterns.
- Strong understanding of managing GPU infrastructure in production, key signals and metrics and the automation of workflows
- Familiarity with Infrastructure-as-Code principles and cloud-native environments (Kubernetes, Terraform, or similar).
Technical – Agent & AI
- Demonstrable experience building LLM-powered agents or automation using frameworks such as LangChain, LlamaIndex, the Anthropic SDK, OpenAI function calling, or comparable tooling.
- Understanding of agentic design patterns: tool use, structured output, human-in-the-loop controls, and chain-of-thought reasoning for operational tasks.
- Comfort operating in an API-first environment, integrating agents with infrastructure APIs, DCIM, IPAM, and hypervisor control planes.
Operational:
- Prior experience in an SRE, Senior Operations, or Platform Engineering environment, with exposure to on-call operations and incident management processes.
- Experience in converting narrative runbooks into executable automation or codified decision trees.
- Understanding of ITIL-aligned incident and change management principles and ITSM tooling.
One or more would be an advantage:
- Exposure to data centre or colocation operations, particularly high-density compute or GPU infrastructure environments.
- Experience with ChatOps tooling: building Slack or Microsoft Teams bots for operational workflows.
- Familiarity with DCIM platforms and telemetry pipelines (power, thermal, network).
- Knowledge of OpenTelemetry, distributed tracing, or log aggregation platforms (Loki, ELK, Splunk).
- Contributions to open-source observability or automation tooling.
- Experience in a start-up or scale-up environment where tooling is built from scratch.
Why Join Era4:
You’ll be joining a mission-driven start-up building critical national infrastructure, where operational excellence directly enables growth. This role offers high visibility with leadership, real autonomy, and the chance to shape how a next-generation company operates at scale.
Diversity & Inclusion:
Era4 is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees.



