NatWest Group Jobs

Senior Site Reliability Engineer

NatWest Group

Senior Site Reliability Engineer

Posted 3 Hours Ago

Be an Early Applicant

In-Office

Edinburgh, City of Edinburgh, Scotland

Senior level

In-Office

Edinburgh, City of Edinburgh, Scotland

Senior level

The Senior Site Reliability Engineer ensures reliability and performance of production platforms, leads SRE practices, incident management, and automation using AWS and Kubernetes.

The summary above was generated by AI

Join us as a Senior Site Reliability Engineer

In this key role, you’ll improve and drive the availability, performance, efficiency, change management, monitoring, security, incident response, and capacity planning for our products and services
You’ll enjoy significant stakeholder interaction, working in collaboration with engineers to ensure a principled approach to delivering change in a safe and secure way
This is a chance to join an inclusive team with a collaborative ethos and a commitment to innovation and professional development
You’ll need to have the flexibility to support the team by working shifts and weekends on rotation

What you'll do

As a Senior Site Reliability Engineer, you’ll act as a hands-on expert responsible for ensuring the reliability, availability, and performance of critical production platforms. You’ll lead the adoption of Site Reliability Engineering (SRE) practices, embedding resilience, observability, and operational excellence into distributed systems running on AWS and Kubernetes. You’ll also take ownership of 24/7 production support models, ensuring systems are highly available and that incidents are effectively managed and learned from.

We’ll expect you as well to design and operate highly resilient AWS-based Kubernetes platforms (EKS) aligned with enterprise standards while owning and continuously improving production reliability, availability, and Service Level Agreement or Service Level Objective (SLA/SLO) frameworks. You’ll lead incident management, escalation, and 24/7 on-call practices, including post-incident reviews, and embed SRE principles such as error budgets, toil reduction, and reliability engineering into delivery teams. Furthermore, you’ll implement infrastructure and platform automation using Terraform and GitOps methodologies and drive self-healing, auto-scaling, and failure recovery mechanisms using tools such as Karpenter.

In addition to this, you’ll be:

Building secure and scalable networking and service communication such as Cilium
Defining and operating observability platforms using Grafana, Prometheus, Loki, and Tempo
Partnering with DevOps and engineering teams to ensure production readiness and operational excellence
Leading complex troubleshooting across distributed systems and cloud-native environments
Developing reusable “golden paths,” operational runbooks, and reliability patterns
Ensuring platforms meet regulatory, security, and operational risk requirements
Using data, Service Level Indicators (SLIs), and metrics to drive continuous improvement and proactive reliability enhancements

The skills you'll need

We’re looking for a highly experienced Site Reliability Engineer with a strong background in operating large-scale, business-critical platforms and a passion for reliability engineering. You must also have deep expertise in managing production systems on AWS and Kubernetes (EKS), along with strong experience in 24/7 support models, incident management, and on-call leadership.

Moreover, you’ll need to demonstrate advanced knowledge of SRE principles such as SLIs, SLOs, error budgets, and toil reduction, as well as proficiency in Terraform, GitOps, and cloud automation practices. Hands-on experience with GitLab continuous integration and continuous delivery pipelines and Argo CD is also essential.

In addition, you’ll have to bring:

A strong understanding of Kubernetes networking, security, and service mesh technologies, ideally using Cilium
Experience scaling infrastructure using Karpenter and auto-scaling strategies
Expertise in observability tooling, including Grafana, Prometheus, Loki and Tempo
A proven ability to troubleshoot and resolve complex, cross-system production issues
Experience operating in regulated or high-security environments
Strong leadership, mentoring, and stakeholder engagement capabilities
The ability to balance reliability, risk, and delivery in a fast-paced environment

Hours

Job Posting Closing Date:

03/06/2026

Ways of Working:Remote First

250 Bishopsgate, London, United Kingdom, EC2M 4AA

Similar Jobs

BlackRock

Site Reliability Engineer

9 Days Ago

In-Office

Senior level

Fintech • Information Technology • Financial Services

The SRE Lead will oversee resilient system design, automation, and AI solutions, enhancing reliability for BlackRock's Private Markets platform while guiding a global engineering team.

Top Skills: AIAiopsApache CassandraMlNosql DatabaseRedisRelational Database

Nebius

Senior Site Reliability Engineer

24 Days Ago

In-Office or Remote

Senior level

Artificial Intelligence • Information Technology • Consulting

As a Senior Site Reliability Engineer, you will enhance the reliability and performance of our inference platform, leveraging Kubernetes and Terraform while ensuring smooth scalability of systems under load.

Top Skills: BashGrafanaKubernetesMlopsPrometheusPythonRayTerraformTritonVllm

Nebius

Senior Site Reliability Engineer

2 Days Ago

In-Office or Remote

United Kingdom

Senior level

Artificial Intelligence • Information Technology • Consulting

The Senior Site Reliability Engineer ensures system fault-tolerance, scalability, and operational continuity by leveraging cloud technologies and improving CI/CD processes.

Top Skills: AnsibleC++DockerGoHelmK8SPythonSaltTerraformUnix

What you need to know about the London Tech Scene

London isn't just a hub for established businesses; it's also a nursery for innovation. Boasting one of the most recognized fintech ecosystems in Europe, attracting billions in investments each year, London's success has made it a go-to destination for startups looking to make their mark. Top U.K. companies like Hoptin, Moneybox and Marshmallow have already made the city their base — yet fintech is just the beginning. From healthtech to renewable energy to cybersecurity and beyond, the city's startups are breaking new ground across a range of industries.

NatWest Group

Senior Site Reliability Engineer

NatWest Group London, England Office

Similar Jobs

Site Reliability Engineer

Senior Site Reliability Engineer

Senior Site Reliability Engineer

What you need to know about the London Tech Scene