Lyrebird Health Logo

Lyrebird Health

Staff Site Reliability Engineer

Posted 6 Days Ago
Be an Early Applicant
In-Office or Remote
Hiring Remotely in Onley, Northamptonshire, England
Expert/Leader
In-Office or Remote
Hiring Remotely in Onley, Northamptonshire, England
Expert/Leader
The role focuses on enhancing reliability and security of Lyrebird's platform by improving production systems, incident response, observability, and operational excellence while collaborating across teams.
The summary above was generated by AI
The Role

We’re looking for a Staff Site Reliability Engineer (SRE) to raise the reliability, scalability, and security bar across the Lyrebird platform.

This is a senior, high-impact role focused on designing and evolving the systems and practices that keep Lyrebird fast, safe, and available. You’ll work across infrastructure, application reliability, observability, incident response, and platform enablement - partnering closely with Engineering, Security, and Product.

This is not a “keep the lights on” role. You’ll drive meaningful improvements to how we build, deploy, and operate our services in production - with real autonomy and ownership.


About Lyrebird Health

Lyrebird Health is transforming the quality and accessibility of healthcare by automating clinicians’ most time-consuming tasks. Thousands of clinicians across many disciplines already use Lyrebird — and that number is growing every day.

They trust us to deliver a fast, reliable, and secure experience. We value that trust above all else and strive to earn it while continuing to amaze our users.

What You'll Do

  • Reliability & Production Engineering
  • Own reliability outcomes across core services and customer-facing systems
  • Define, implement, and evolve SLOs/SLIs, alerting strategy, and error budgets
  • Lead initiatives to improve uptime, latency, and overall system resilience
  • Proactively identify reliability risks and drive mitigation plans to completion
  • Observability & Incident Response
  • Improve end-to-end observability (metrics, logs, traces) so issues are detected early and diagnosed quickly
  • Lead incident response for high-severity events and guide teams through calm, effective mitigation
  • Drive post-incident reviews that result in measurable, lasting improvements
  • Build a culture of operational excellence: fewer incidents, faster recovery, better learning
  • Platform Enablement
  • Develop internal tooling and paved paths that make “doing the right thing” the easiest option
  • Improve the developer experience around deployments, rollbacks, environment consistency, and service ownership
  • Partner with engineers to uplift production-readiness across new and existing services
  • Infrastructure & Automation
  • Improve infrastructure reliability and maintainability using Infrastructure as Code
  • Strengthen deployment workflows and reduce operational toil through automation
  • Help shape architecture decisions with a reliability and scalability lens
  • Security & Compliance Support
  • Embed security and compliance principles into platform practices (access controls, auditability, safe-by-default designs)
  • Work closely with Security and Engineering leadership to support regulatory and enterprise requirements without slowing down delivery

What We’re Looking For:

  • 8+ years of engineering experience, with significant depth in SRE / platform/production systems
  • Strong experience operating and improving systems in production (including incident response)
  • Proven ability to lead cross-team initiatives and influence engineering standards
  • Technical StrengthYou don’t need to tick every box, but you should be strong across most: Cloud/Infrastructure, AWS (ECS, EC2, VPC, IAM, RDS/Aurora, S3, CloudWatch)
  • Infrastructure as Code (Terraform)
  • Observability
  • Strong grasp of monitoring and alerting principles
  • Experience with logs + metrics + tracing and building meaningful dashboards
  • Familiar with OpenTelemetry and modern observability tooling
  • Systems & Operational Excellence
  • Knowledge of reliability patterns: graceful degradation, retries, backoff, timeouts, load shedding, capacity planning
  • Strong debugging instincts across distributed systems
  • Practical approach to risk management and tradeoffs
  • Software Engineering
  • Ability to build tools and automation (TypeScript, Go, Python, or similar)
  • Familiarity with CI/CD and safe rollout strategies (feature flags, canary, blue/green)

Bonus Skill (Nice to Have):

  • Experience supporting security frameworks (SOC 2, ISO 27001, HIPAA-style environments)
  • Experience with service mesh patterns, multi-account AWS environments, or multi-region design
  • Experience working with healthcare or regulated domains
  • Experience scaling engineering org practices as the company grows

Who You Are:

  • You’re deeply accountable - you take ownership of outcomes, not just tasks
  • You value simplicity and reliability over cleverness
  • You’re calm and effective in incidents, and you raise the quality bar afterward
  • You communicate clearly across engineering and non-engineering stakeholders
  • You’re pragmatic: you know when to move fast, and when to slow down to reduce risk

Why This Role Is Different:

  • Staff-level scope with real influence across engineering
  • Direct impact on reliability for a product clinicians depend on every day
  • Work on meaningful problems where security, performance, and trust matter
  • High ownership environment with room to shape how the company operates at scale

At Lyrebird, you won’t just respond to incidents - you’ll design the systems and standards that prevent them.

We’re building a team that reflects the diversity of the people who’ll benefit from our work. If you’re from an underrepresented background in tech, we especially encourage you to apply - even if you don’t meet every single requirement.

Top Skills

Aurora
AWS
Cloudwatch
Ec2
Ecs
Go
Iam
Opentelemetry
Python
Rds
S3
Terraform
Typescript
Vpc

Similar Jobs

10 Days Ago
In-Office or Remote
2 Locations
Senior level
Senior level
Artificial Intelligence • Software • Generative AI
As a Site Reliability Engineer, you'll ensure platform availability and reliability, automate infrastructure management, design scalable solutions, and lead incident response efforts.
Top Skills: AWSAzureDockerElk StackGCPGoGrafanaJavaKubernetesPrometheusPythonTerraform
25 Days Ago
Remote
UK
Senior level
Senior level
Artificial Intelligence • Fintech • Software • Financial Services
As a Senior SRE at Oscilar, you'll own system reliability, architect cloud infrastructure, improve performance, and mentor engineers.
Top Skills: AWSClickhouseGoJavaKafkaKubernetesPulumiTerraform
16 Days Ago
In-Office or Remote
5 Locations
Junior
Junior
Artificial Intelligence
The Applied AI Engineer will onboard customers, deploy AI applications, and collaborate with teams to resolve technical challenges and drive technological transformation.
Top Skills: AnsibleAWSAzureCi/CdDockerGCPKubernetesPythonPyTorchTensorFlowTerraform

What you need to know about the London Tech Scene

London isn't just a hub for established businesses; it's also a nursery for innovation. Boasting one of the most recognized fintech ecosystems in Europe, attracting billions in investments each year, London's success has made it a go-to destination for startups looking to make their mark. Top U.K. companies like Hoptin, Moneybox and Marshmallow have already made the city their base — yet fintech is just the beginning. From healthtech to renewable energy to cybersecurity and beyond, the city's startups are breaking new ground across a range of industries.

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account