The role focuses on enhancing reliability and security of Lyrebird's platform by improving production systems, incident response, observability, and operational excellence while collaborating across teams.
The Role
We’re looking for a Staff Site Reliability Engineer (SRE) to raise the reliability, scalability, and security bar across the Lyrebird platform.
This is a senior, high-impact role focused on designing and evolving the systems and practices that keep Lyrebird fast, safe, and available. You’ll work across infrastructure, application reliability, observability, incident response, and platform enablement - partnering closely with Engineering, Security, and Product.
This is not a “keep the lights on” role. You’ll drive meaningful improvements to how we build, deploy, and operate our services in production - with real autonomy and ownership.
About Lyrebird Health
Lyrebird Health is transforming the quality and accessibility of healthcare by automating clinicians’ most time-consuming tasks. Thousands of clinicians across many disciplines already use Lyrebird — and that number is growing every day.
They trust us to deliver a fast, reliable, and secure experience. We value that trust above all else and strive to earn it while continuing to amaze our users.
What You'll Do
- Reliability & Production Engineering
- Own reliability outcomes across core services and customer-facing systems
- Define, implement, and evolve SLOs/SLIs, alerting strategy, and error budgets
- Lead initiatives to improve uptime, latency, and overall system resilience
- Proactively identify reliability risks and drive mitigation plans to completion
- Observability & Incident Response
- Improve end-to-end observability (metrics, logs, traces) so issues are detected early and diagnosed quickly
- Lead incident response for high-severity events and guide teams through calm, effective mitigation
- Drive post-incident reviews that result in measurable, lasting improvements
- Build a culture of operational excellence: fewer incidents, faster recovery, better learning
- Platform Enablement
- Develop internal tooling and paved paths that make “doing the right thing” the easiest option
- Improve the developer experience around deployments, rollbacks, environment consistency, and service ownership
- Partner with engineers to uplift production-readiness across new and existing services
- Infrastructure & Automation
- Improve infrastructure reliability and maintainability using Infrastructure as Code
- Strengthen deployment workflows and reduce operational toil through automation
- Help shape architecture decisions with a reliability and scalability lens
- Security & Compliance Support
- Embed security and compliance principles into platform practices (access controls, auditability, safe-by-default designs)
- Work closely with Security and Engineering leadership to support regulatory and enterprise requirements without slowing down delivery
What We’re Looking For:
- 8+ years of engineering experience, with significant depth in SRE / platform/production systems
- Strong experience operating and improving systems in production (including incident response)
- Proven ability to lead cross-team initiatives and influence engineering standards
- Technical StrengthYou don’t need to tick every box, but you should be strong across most: Cloud/Infrastructure, AWS (ECS, EC2, VPC, IAM, RDS/Aurora, S3, CloudWatch)
- Infrastructure as Code (Terraform)
- Observability
- Strong grasp of monitoring and alerting principles
- Experience with logs + metrics + tracing and building meaningful dashboards
- Familiar with OpenTelemetry and modern observability tooling
- Systems & Operational Excellence
- Knowledge of reliability patterns: graceful degradation, retries, backoff, timeouts, load shedding, capacity planning
- Strong debugging instincts across distributed systems
- Practical approach to risk management and tradeoffs
- Software Engineering
- Ability to build tools and automation (TypeScript, Go, Python, or similar)
- Familiarity with CI/CD and safe rollout strategies (feature flags, canary, blue/green)
Bonus Skill (Nice to Have):
- Experience supporting security frameworks (SOC 2, ISO 27001, HIPAA-style environments)
- Experience with service mesh patterns, multi-account AWS environments, or multi-region design
- Experience working with healthcare or regulated domains
- Experience scaling engineering org practices as the company grows
Who You Are:
- You’re deeply accountable - you take ownership of outcomes, not just tasks
- You value simplicity and reliability over cleverness
- You’re calm and effective in incidents, and you raise the quality bar afterward
- You communicate clearly across engineering and non-engineering stakeholders
- You’re pragmatic: you know when to move fast, and when to slow down to reduce risk
Why This Role Is Different:
- Staff-level scope with real influence across engineering
- Direct impact on reliability for a product clinicians depend on every day
- Work on meaningful problems where security, performance, and trust matter
- High ownership environment with room to shape how the company operates at scale
At Lyrebird, you won’t just respond to incidents - you’ll design the systems and standards that prevent them.
We’re building a team that reflects the diversity of the people who’ll benefit from our work. If you’re from an underrepresented background in tech, we especially encourage you to apply - even if you don’t meet every single requirement.
Top Skills
Aurora
AWS
Cloudwatch
Ec2
Ecs
Go
Iam
Opentelemetry
Python
Rds
S3
Terraform
Typescript
Vpc
Similar Jobs
Artificial Intelligence • Software • Generative AI
As a Site Reliability Engineer, you'll ensure platform availability and reliability, automate infrastructure management, design scalable solutions, and lead incident response efforts.
Top Skills:
AWSAzureDockerElk StackGCPGoGrafanaJavaKubernetesPrometheusPythonTerraform
Artificial Intelligence • Fintech • Software • Financial Services
As a Senior SRE at Oscilar, you'll own system reliability, architect cloud infrastructure, improve performance, and mentor engineers.
Top Skills:
AWSClickhouseGoJavaKafkaKubernetesPulumiTerraform
Artificial Intelligence
The Applied AI Engineer will onboard customers, deploy AI applications, and collaborate with teams to resolve technical challenges and drive technological transformation.
Top Skills:
AnsibleAWSAzureCi/CdDockerGCPKubernetesPythonPyTorchTensorFlowTerraform
What you need to know about the London Tech Scene
London isn't just a hub for established businesses; it's also a nursery for innovation. Boasting one of the most recognized fintech ecosystems in Europe, attracting billions in investments each year, London's success has made it a go-to destination for startups looking to make their mark. Top U.K. companies like Hoptin, Moneybox and Marshmallow have already made the city their base — yet fintech is just the beginning. From healthtech to renewable energy to cybersecurity and beyond, the city's startups are breaking new ground across a range of industries.



