Monstro Jobs

Site Reliability Engineer (SRE)

Monstro

Site Reliability Engineer (SRE)

Posted 3 Days Ago

Be an Early Applicant

In-Office

London, Greater London, England, GBR

Mid level

In-Office

London, Greater London, England, GBR

Mid level

The Site Reliability Engineer will build and maintain the reliability and observability of Monstro's platform on Google Cloud, manage incidents, and automate processes to reduce toil.

The summary above was generated by AI

About Monstro

Monstro is the operating system for governed financial intelligence. We build governance and intelligence infrastructure that enables artificial intelligence to operate safely, explainably, and at institutional scale.
We exist because the level of financial guidance historically available to a small group should be accessible to many more people. By combining AI with deep institutional infrastructure, we help financial institutions deliver more personalized, responsible, and life-changing financial support to millions of individuals.
We’re building mission-critical systems in a highly regulated domain, and we care deeply about doing it right. If you’re motivated by meaningful problems, high standards, and shaping infrastructure that improves financial outcomes, you’ll feel at home here.

About the Role

Monstro is building a secure, multi-tenant platform on Google Cloud, and we’re hiring a Site Reliability Engineer to own the reliability and observability of that platform end-to-end.

This is a hands-on role for someone who wants to do real SRE work - not a rebrand of L1 support. You’ll write the dashboards, define the SLOs, build the automation that kills toil, and take your turn on the on-call rotation that proves it all works. When something breaks at 2 AM, you’re the person who keeps it running; when nothing’s breaking, you’re the person making sure the next break is smaller, shorter, or doesn’t happen at all

What You’ll Do

Observability and reliability engineering

Define and maintain SLOs and SLIs for our tier-1 services: API gateway, application services, identity, and edge availability
Build canonical dashboards and alerts in Google Cloud Monitoring, backed by structured logs and BigQuery log analytics
Tune alert routing so every page is actionable — kill the rest
Instrument services for distributed tracing and structured logging; push back on services that ship without it
Own error budgets and use them to prioritize reliability work over feature work when burned
Reduce toil: automate the top recurring page from the previous quarter
Maintain runbooks so every page maps to one within a cycle of first occurrence

On-call rotation and incident response

First responder for production alerts across monitoring, API gateway, edge defense, and CI
Triage severity, run the incident bridge, drive mitigation (revision rollback, traffic shift, scaling, edge block, credential rotation)
Own internal and external incident comms during your shift
Drive postmortems to closure with action items tracked as audit evidence
Clean written handoffs at end of shift

Our stack

Google Cloud Platform across multiple environments
Apigee X for API management
Cloud Run, GKE Autopilot, Cloud SQL
Identity Platform for customer identity
Cloud Armor, Cloud IDS, Security Command Center for edge and posture
BigQuery-backed log analytics from an org-level log sink
OpenTofu / Terraform for everything; GitHub Actions for CI/CD
Linear for work tracking

What You Bring

Required:

Solid production experience on GCP (or comparable AWS/Azure depth with willingness to ramp on GCP fast)
Comfortable on-call: you’ve run incidents, written postmortems, and shipped the action items
Strong observability fundamentals: SLOs, log-based metrics, alert hygiene, dashboard discipline
Working knowledge of Kubernetes, API gateways, identity systems, and at least one IaC tool
Scripting / coding fluency (Python, Go, Bash) for automation and tooling
Good written communication — handoffs, postmortems, and runbooks are part of the job
Bias toward fixing the system, not the symptom

Nice to Have:

Apigee or another enterprise API gateway in production
BigQuery for log analytics or audit
Experience standing up observability from scratch, not just maintaining inherited dashboards
SOC2 or similar compliance environments

Why Join Us

You’ll be at the centre of how we bring Monstro to life for our institutional clients. Your work directly shapes the success of every implementation—getting requirements right means we deliver faster, smoother, and with fewer surprises. You’ll be joining at a foundational moment, helping to build the delivery practice from the ground up alongside a Delivery Manager who will rely on you as a critical partner from day one.

If you enjoy the puzzle of understanding complex environments, the satisfaction of a well-organised document, and the energy of working directly with clients, this is your role.

Why Monstro

Ownership & Impact: Shape the future of AI-powered finance - building a category-defining product used by consumers and institutions around the world.
Experienced Team: Join a team with leadership that has a track record of scaling companies from early stage to major exits.
Principles-Driven Culture: Work in a culture that values speed, ownership, and impact - what most companies achieve in 90 days, we do in 45.
Competitive salary with potential: for expanded compensation and benefits as our new office grows.

A Note on Interviewing: We sometimes use AI note-takers to help us transcribe interview notes, so we can be more present in your interview. If you’d like to opt out of us using automatic transcribers, please note this in the free text field in your application, otherwise we’ll take your application as confirmation that you’re happy for us to use notetakers (whether added to video calls or in the background).

We are an equal opportunity employer and value diversity. We do not discriminate on the basis of race, religion, colour, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.

Similar Jobs

Navan

Senior Site Reliability Engineer

11 Days Ago

Easy Apply

Hybrid

London, Greater London, England, GBR

Easy Apply

Senior level

Fintech • Information Technology • Payments • Productivity • Software • Travel • Automation

Seeking a Senior Site Reliability Engineer to design and develop automation and infrastructure services that ensure reliable, scalable systems for business travelers, while collaborating with development and security teams.

Top Skills: AWSCi/CdCloudFormationDatadogGoJavaJenkinsKibanaMavenNewrelicNode.jsPythonSignalfxTerraform

iManage

Senior Site Reliability Engineer

14 Days Ago

Hybrid

London, Greater London, England, GBR

Senior level

Artificial Intelligence • Cloud • Information Technology • Legal Tech • Productivity • Software

The Senior Site Reliability Engineer will automate processes, collaborate across teams, and enhance service resilience in a cloud-native environment, focusing on system scalability and best practices.

Top Skills: AksAzureBashChefDockerEfkElkGoGrafanaJavaKubernetesPowershellPrometheusPythonRubyTerraform

HelloKindred

Site Reliability Engineer

3 Days Ago

Hybrid

Mid level

Agency • Digital Media • Professional Services • Design

The SRE Engineer will manage deployment and reliability for an AI helpdesk platform, focusing on CI/CD pipelines, observability, cloud infrastructure, and support for AI services.

Top Skills: AWSCi/CdConfluenceDockerInfrastructure As CodeJIRAKubernetesMonitoringObservabilityServicenow

What you need to know about the London Tech Scene

London isn't just a hub for established businesses; it's also a nursery for innovation. Boasting one of the most recognized fintech ecosystems in Europe, attracting billions in investments each year, London's success has made it a go-to destination for startups looking to make their mark. Top U.K. companies like Hoptin, Moneybox and Marshmallow have already made the city their base — yet fintech is just the beginning. From healthtech to renewable energy to cybersecurity and beyond, the city's startups are breaking new ground across a range of industries.