Cepheid

Reliability Engineer

Posted 8 Days Ago

Be an Early Applicant

In-Office

Cambridge, Cambridgeshire, England

Mid level

In-Office

Cambridge, Cambridgeshire, England

Mid level

The Reliability Engineer will ensure production systems' stability, performance, and reliability, manage incidents, automate tasks, and improve system resilience.

The summary above was generated by AI

For over 25 years, Abcam has been providing tools the scientific community needs to enable faster breakthroughs in critical areas like cancer, neurological disorders, infectious diseases, and metabolic disorders.

We believe that to continue making progress, we need to work together, each bringing our own unique perspectives to make an impact on the world. This community needs people like you: dedicated, agile and above all audacious so we can truly drive science forward.

Role Summary

We are seeking a highly motivated Reliability Engineer to join our team. As a Reliability Engineer, you will play a crucial role in ensuring the stability, performance, and reliability of our production systems. Your responsibilities will include proactively identifying and resolving technical issues, leading major incident responses, and implementing best practices for system reliability. You will work closely with cross-functional teams to develop and maintain robust monitoring and automation solutions. This position reports directly to the Global Reliability Manager.

In this role, you will have the opportunity to:

• Shape system reliability at scale by monitoring performance, spotting trends, and preventing issues before they impact users.

• Take charge during critical moments, leading major incident responses and driving rapid service restoration.

• Solve complex problems for the long term, collaborating across teams to implement robust, sustainable solutions.

• Automate and innovate, building tools and processes that streamline operations and reduce manual work.

• Drive continuous improvement, using data insights and post-incident learnings to make systems more resilient every day.

The essential requirements of the job include:

• Automation & Scripting: Ability to code repeatable tasks using PowerShell, Bash, or Python, and familiarity with infrastructure-as-code tools such as Terraform and configuration management tools such as Puppet.

• Cloud & Infrastructure: Strong knowledge of AWS Cloud services, networking, security, and storage solutions both on-premises and on the cloud.

• Reliability & Scalability: High-level understanding of High Availability, Disaster Recovery, scalability solutions, and web infrastructure troubleshooting using logs.

• Monitoring & Incident Management: Proficiency with monitoring dashboards (Grafana, Humio, CloudWatch) and incident management tools like ServiceNow and PagerDuty.

• Database & Pipelines: Good understanding of SQL Server, Oracle, PostgreSQL (including DML), and familiarity with CI/CD pipelines such as GitLab CI.

It would be a plus if you also possess previous experience in:

• EKS troubleshooting knowledge

• Application support experience

• Linux OS trouble shooting experience

• Oracle Cloud Infrastructure knowledge

Participate in an on-call rotation to provide 24/7 support for critical systems and respond to incidents as needed.

Join our winning team today. Together, we’ll accelerate the real-life impact of tomorrow’s science and technology. We partner with customers across the globe to help them solve their most complex challenges, architecting solutions that bring the power of science to life.

For more information, visit www.danaher.com.

Top Skills

AWS

Bash

Cloudwatch

Gitlab Ci

Grafana

Humio

Oracle

Pagerduty

Postgres

Powershell

Puppet

Python

Servicenow

SQL Server

Terraform

Similar Jobs

Optum

Site Reliability Engineer

3 Days Ago

In-Office

London, England, GBR

Mid level

Artificial Intelligence • Big Data • Healthtech • Information Technology • Machine Learning • Software • Analytics

Responsible for reliability, security and efficiency of cloud environments for Enterprise Imaging. Automate cloud operations, implement reliable infrastructure, support CI/CD and IaC, and provide 24×7 shift-based incident triage and resolution.

Top Skills: Gcp,Aws,Azure,Python,Node.Js,Kubernetes,Terraform,Ci/Cd,Linux,Windows,Jenkins,Git,Ansible

Anduril

Site Reliability Engineer

6 Days Ago

In-Office

London, Greater London, England, GBR

Entry level

Aerospace • Artificial Intelligence • Hardware • Robotics • Security • Software • Defense

Design, build, and operate secure infrastructure for cloud, on-prem, and distributed tactical edge systems. Support hardware integration labs, vehicle/device payloads, automation, and implement network security controls and hardening to meet regulated environment requirements.

Top Skills: FirewallsKvmLinux (Rhel)Linux (Ubuntu)PkiQemuTlsVlansVMware

Smiths Group plc

Reliability Engineer

8 Days Ago

In-Office

Slough, Berkshire, England, GBR

Mid level

Aerospace • Security • Energy • Defense

The Gas Seals Reliability Engineer will support customers on technical aspects of design and testing, conducting analyses, and fostering relationships with commercial teams and stakeholders.

Top Skills: CadFea Analysis Software

What you need to know about the London Tech Scene

London isn't just a hub for established businesses; it's also a nursery for innovation. Boasting one of the most recognized fintech ecosystems in Europe, attracting billions in investments each year, London's success has made it a go-to destination for startups looking to make their mark. Top U.K. companies like Hoptin, Moneybox and Marshmallow have already made the city their base — yet fintech is just the beginning. From healthtech to renewable energy to cybersecurity and beyond, the city's startups are breaking new ground across a range of industries.