Lyrebird Health

Staff Site Reliability Engineer

Reposted 2 Days Ago

Be an Early Applicant

In-Office or Remote

Hiring Remotely in Onley, Northamptonshire, England

Expert/Leader

In-Office or Remote

Hiring Remotely in Onley, Northamptonshire, England

Expert/Leader

The role focuses on enhancing reliability and security of Lyrebird's platform by improving production systems, incident response, observability, and operational excellence while collaborating across teams.

The summary above was generated by AI

The Role

We’re looking for a Staff Site Reliability Engineer (SRE) to raise the reliability, scalability, and security bar across the Lyrebird platform.

This is a senior, high-impact role focused on designing and evolving the systems and practices that keep Lyrebird fast, safe, and available. You’ll work across infrastructure, application reliability, observability, incident response, and platform enablement - partnering closely with Engineering, Security, and Product.

This is not a “keep the lights on” role. You’ll drive meaningful improvements to how we build, deploy, and operate our services in production - with real autonomy and ownership.

About Lyrebird Health

Lyrebird Health is transforming the quality and accessibility of healthcare by automating clinicians’ most time-consuming tasks. Thousands of clinicians across many disciplines already use Lyrebird — and that number is growing every day.

They trust us to deliver a fast, reliable, and secure experience. We value that trust above all else and strive to earn it while continuing to amaze our users.

What You'll Do

Reliability & Production Engineering
Own reliability outcomes across core services and customer-facing systems
Define, implement, and evolve SLOs/SLIs, alerting strategy, and error budgets
Lead initiatives to improve uptime, latency, and overall system resilience
Proactively identify reliability risks and drive mitigation plans to completion
Observability & Incident Response
Improve end-to-end observability (metrics, logs, traces) so issues are detected early and diagnosed quickly
Lead incident response for high-severity events and guide teams through calm, effective mitigation
Drive post-incident reviews that result in measurable, lasting improvements
Build a culture of operational excellence: fewer incidents, faster recovery, better learning
Platform Enablement
Develop internal tooling and paved paths that make “doing the right thing” the easiest option
Improve the developer experience around deployments, rollbacks, environment consistency, and service ownership
Partner with engineers to uplift production-readiness across new and existing services
Infrastructure & Automation
Improve infrastructure reliability and maintainability using Infrastructure as Code
Strengthen deployment workflows and reduce operational toil through automation
Help shape architecture decisions with a reliability and scalability lens
Security & Compliance Support
Embed security and compliance principles into platform practices (access controls, auditability, safe-by-default designs)
Work closely with Security and Engineering leadership to support regulatory and enterprise requirements without slowing down delivery

What We’re Looking For:

8+ years of engineering experience, with significant depth in SRE / platform/production systems
Strong experience operating and improving systems in production (including incident response)
Proven ability to lead cross-team initiatives and influence engineering standards
Technical StrengthYou don’t need to tick every box, but you should be strong across most: Cloud/Infrastructure, AWS (ECS, EC2, VPC, IAM, RDS/Aurora, S3, CloudWatch)
Infrastructure as Code (Terraform)
Observability
Strong grasp of monitoring and alerting principles
Experience with logs + metrics + tracing and building meaningful dashboards
Familiar with OpenTelemetry and modern observability tooling
Systems & Operational Excellence
Knowledge of reliability patterns: graceful degradation, retries, backoff, timeouts, load shedding, capacity planning
Strong debugging instincts across distributed systems
Practical approach to risk management and tradeoffs
Software Engineering
Ability to build tools and automation (TypeScript, Go, Python, or similar)
Familiarity with CI/CD and safe rollout strategies (feature flags, canary, blue/green)

Bonus Skill (Nice to Have):

Experience supporting security frameworks (SOC 2, ISO 27001, HIPAA-style environments)
Experience with service mesh patterns, multi-account AWS environments, or multi-region design
Experience working with healthcare or regulated domains
Experience scaling engineering org practices as the company grows

Who You Are:

You’re deeply accountable - you take ownership of outcomes, not just tasks
You value simplicity and reliability over cleverness
You’re calm and effective in incidents, and you raise the quality bar afterward
You communicate clearly across engineering and non-engineering stakeholders
You’re pragmatic: you know when to move fast, and when to slow down to reduce risk

Why This Role Is Different:

Staff-level scope with real influence across engineering
Direct impact on reliability for a product clinicians depend on every day
Work on meaningful problems where security, performance, and trust matter
High ownership environment with room to shape how the company operates at scale

At Lyrebird, you won’t just respond to incidents - you’ll design the systems and standards that prevent them.

We’re building a team that reflects the diversity of the people who’ll benefit from our work. If you’re from an underrepresented background in tech, we especially encourage you to apply - even if you don’t meet every single requirement.

Top Skills

Aurora

AWS

Cloudwatch

Ec2

Ecs

Iam

Opentelemetry

Python

Rds

Terraform

Typescript

Vpc

Similar Jobs

Airalo

Senior Site Reliability Engineer

8 Days Ago

Remote

Senior level

Information Technology

The Senior Site Reliability Engineer will design resilient systems, manage incidents, lead automation efforts, and ensure service reliability in a remote, global setting.

Top Skills: AWSDatadogGithub ActionsGoJavaKubernetesOpentelemetryPrometheusPythonTerraform

Luupli

Site Reliability Engineer

22 Days Ago

Remote

United Kingdom

Mid level

Social Media

The Site Reliability Engineer will design, build, and maintain AWS cloud infrastructure, ensure performance and reliability, automate tasks, and participate in incident management.

Top Skills: AWSBashPythonTerraform

Oscilar

Site Reliability Engineer

21 Days Ago

Remote

Senior level

Artificial Intelligence • Fintech • Software • Financial Services

As a Senior SRE at Oscilar, you'll own system reliability, architect cloud infrastructure, improve performance, and mentor engineers.

Top Skills: AWSClickhouseGoJavaKafkaKubernetesPulumiTerraform

What you need to know about the London Tech Scene

London isn't just a hub for established businesses; it's also a nursery for innovation. Boasting one of the most recognized fintech ecosystems in Europe, attracting billions in investments each year, London's success has made it a go-to destination for startups looking to make their mark. Top U.K. companies like Hoptin, Moneybox and Marshmallow have already made the city their base — yet fintech is just the beginning. From healthtech to renewable energy to cybersecurity and beyond, the city's startups are breaking new ground across a range of industries.