Carbon3.ai Jobs

Platform Engineer - Observability (Contract)

Carbon3.ai

Platform Engineer - Observability (Contract)

Posted 5 Days Ago

Be an Early Applicant

Remote

Hiring Remotely in United Kingdom

Senior level

Remote

Hiring Remotely in United Kingdom

Senior level

Design, implement, and operate a multi-site, multi-tenant observability platform using the Grafana stack and related tooling. Configure telemetry ingestion, dashboards, alerting, SLOs, and automation; ensure scalability, tenant isolation, long-term storage and DR; integrate telemetry sources and collaborate with application teams on onboarding and tracing.

The summary above was generated by AI

Era4 develops, owns and operates AI infrastructure across the UK, powered by renewable energy. Converting legacy industrial and energy sites into modern data-centre facilities, Era4 is combining brownfield regeneration opportunities with cleaner, efficient, scalable compute capacity for healthcare, research, finance, enterprise, and public-sector organisations

Initial 6 month contract.

June start date.

Competitive day rate.

Key Responsibilities:

Observability Platform Implementation:

Deliver the implementation of Era4's observability platform based on Grafana Mimir, Loki, Tempo, Grafana Alloy and Grafana Enterprise tooling.
Design and implement highly available observability services across multiple co-location and production sites.
Configure telemetry ingestion pipelines for metrics, logs, and future distributed tracing workloads.
Develop and maintain observability architecture documentation, high-level designs, low-level designs, and operational runbooks.
Define platform standards for telemetry collection, labelling, metadata enrichment, retention policies, and data governance.
Implement multi-tenant observability controls and tenant isolation strategies.
Configure and maintain object-storage-backed telemetry platforms for long-term retention and scalability.

Telemetry Collection & Integration:

Deploy and manage Grafana Alloy collectors across Kubernetes clusters, Linux hosts, network infrastructure, storage platforms, and hardware management systems.
Integrate telemetry from Kubernetes, GPU infrastructure, HPE hardware, storage platforms, network devices, and cloud-native services.
Develop and maintain observability integrations using OpenTelemetry standards and protocols.
Establish onboarding processes for new platforms, applications, and infrastructure services.
Collaborate with application teams to define observability requirements and future tracing adoption strategies.

Alerting & Operational Insights:

Design and implement alerting frameworks using recording rules, AlertManager, and operational best practices.
Develop operational dashboards and service health views for infrastructure, platform, and application services.
Support integration of observability events with ITSM and incident-management platforms.
Define SLIs, SLOs, alert thresholds, and operational KPIs.
Continuously improve platform observability, incident detection, and root-cause analysis capabilities.

Reliability & Automation:

Implement Infrastructure-as-Code and GitOps practices for observability platform deployment and configuration management.
Develop automation for dashboard provisioning, alert deployment, tenant onboarding, and telemetry configuration.
Design and validate disaster recovery, resilience, and failover capabilities across observability services.
Contribute to platform security, compliance, and operational governance initiatives.
Work with operational teams to ensure observability services remain reliable, scalable, and maintainable.

Required Experience & Skills:

Significant experience implementing and operating enterprise observability or monitoring platforms.
Strong understanding of metrics, logs, traces, OpenTelemetry, and modern observability principles.
Experience with Grafana ecosystem technologies including Grafana, Prometheus, Grafana Mimir, Grafana Loki, Grafana Tempo, and Grafana Alloy.
Experience designing Kubernetes-native solutions and operating distributed platforms at scale.
Knowledge of Linux systems administration and cloud-native infrastructure.
Experience implementing Infrastructure-as-Code and GitOps approaches (preferably including Ansible).
Skilled in developing automation and operational tooling using Python and/or Go.
Previous exposure to creating technical architecture, operational documentation, and deployment designs.
Experience with object storage technologies and distributed data platforms.
Strong understanding of monitoring, alerting, and operational event management.

One or more of the following would be advantageous:

Implemented OpenTelemetry-based observability solutions.
Operated observability platforms in service-provider, cloud, or large-scale enterprise environments.
Supported GPU, AI/ML, or high-performance computing environments.
Integrated observability platforms with ITSM solutions.
Experience with multi-tenant platform architectures.
Knowledge of networking, storage, and data-centre infrastructure monitoring.
Understanding of distributed tracing and application performance monitoring.

Why Join Era4:

You’ll be joining a mission-driven start-up building critical national infrastructure, where operational excellence directly enables growth. This role offers high visibility with leadership, real autonomy, and the chance to shape how a next-generation company operates at scale.

Diversity & Inclusion:

Era4 is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees.

Similar Jobs

HiBob

Account Manager

10 Hours Ago

Remote or Hybrid

United Kingdom

Mid level

HR Tech • Information Technology • Professional Services • Sales • Software

As an Account Manager, you will manage and grow a portfolio of Mid-Market customers, driving adoption and identifying upsell opportunities while collaborating with Customer Success Managers to achieve retention objectives.

Top Skills: ChatgptGong EngageLinkedin Sales NavigatorSalesforce

TransUnion

Analyst, Global Incident Response

19 Hours Ago

Remote or Hybrid

United Kingdom

Entry level

Big Data • Fintech • Information Technology • Business Intelligence • Financial Services • Cybersecurity • Big Data Analytics

The Analyst will support Global Incident Response efforts by conducting consultations, managing client accounts, and utilizing various data collection tools to respond to incidents while focusing on client satisfaction and account growth.

Top Skills: Data Mining ToolsDfir OperationsEdr ToolsMitreThreat Intelligence

Nisos

Senior Machine Learning Engineer

19 Hours Ago

Remote

United Kingdom

Senior level

Professional Services • Security • Software • Consulting • Cybersecurity • Generative AI • Data Privacy

Design and implement large-scale AI/ML systems using LLMs and RAG. Develop data pipelines, optimize models, and collaborate with teams on actionable intelligence solutions.

Top Skills: AutogenCloudFormationDockerEcsLangchainLanggraphPineconePythonPyTorchTensorFlowTerraformWeaviate

What you need to know about the London Tech Scene

London isn't just a hub for established businesses; it's also a nursery for innovation. Boasting one of the most recognized fintech ecosystems in Europe, attracting billions in investments each year, London's success has made it a go-to destination for startups looking to make their mark. Top U.K. companies like Hoptin, Moneybox and Marshmallow have already made the city their base — yet fintech is just the beginning. From healthtech to renewable energy to cybersecurity and beyond, the city's startups are breaking new ground across a range of industries.