Encord Jobs

Senior SRE Engineer

Encord

Senior SRE Engineer

Reposted 2 Days Ago

Be an Early Applicant

In-Office

London, Greater London, England, GBR

Senior level

In-Office

London, Greater London, England, GBR

Senior level

Lead and operate Encord's core infrastructure to ensure performant, reliable, observable, and scalable services. Profile and optimize large-scale data pipelines, perform capacity planning, define SLIs/SLOs, build incident response and runbooks, manage Kubernetes and cloud infrastructure (GCP/AWS) at petabyte scale, and drive automation, observability, and reliability best practices across engineering squads.

The summary above was generated by AI

About us

Encord is the universal data layer for AI that helps 300+ AI teams train and run models on the right data. Our platform indexes, curates, annotates, and evaluates data across the full AI lifecycle, from development through production.

Trusted by Woven by Toyota, AXA, UiPath, Zipline, and more. We're an ambitious team of 100+ working at the frontier of AI and have raised $60M in Series C funding from Wellington Management, CRV, Next47 and Y Combinator.

The role

We're looking for a Senior Site Reliability Engineer to join our growing platform engineering team. You'll be embedded in the teams building and operating Encord's core infrastructure, ensuring our platform is performant, reliable, observable, and scalable.

You will lead the planning and execution of efforts needed as we grow from our customer base from hundreds to thousands of AI teams worldwide, and the volume of AI training and supervision data managed by our platform from TBs to PBs of data.

You'll drive a culture of performant and resilient software through individual contributions and collaboration with multiple squads.

What You'll Do

Performance & Capacity — Profile and optimise services handling large-scale data pipelines; perform capacity planning for storage and compute-intensive workloads. Work with squads to establish performance benchmarks and expectations
Collaboration — Partner closely with backend and ML engineers to improve deployment pipelines (CI/CD), review infrastructure changes, and champion reliability best practices.
Reliability & Availability — Define and own SLIs/SLOs/SLAs for critical services; build alerting, runbooks, and incident response processes; lead postmortems with a blameless culture.
Infrastructure & Cloud — Design, deploy, and maintain cloud infrastructure on GCP and AWS; manage Kubernetes clusters, networking, and storage at petabyte scale.
Automation & Tooling — Work to improve developer productivity and guide and review automation and tooling efforts across the engineering group.
Observability — Instrument services with distributed tracing, logging, and metrics (Prometheus, Grafana, OpenTelemetry, Datadog or similar); build infrastructure, define best practices and work with each squad to ensure every service is observable before it goes to production.

What We're Looking For

Experience on hands-on SRE, DevOps, platform engineering experience or similar in a production environment.
Strong fundamentals in designing, building and maintaining resilient distributed and/or high performance systems
Solid understanding of networking, operating systems and database technologies
Experience with observability fundamentals — metrics, logs, traces, and alerting.
Comfortable with on-call rotations and incident management.

Tech stack

We are technology agnostic at Encord and not looking for experience across all of these — as long as you're open to learning, please apply.
- Backend: Python and Rust
- Frontend: TypeScript and React
- Deployment: Kubernetes
- Infrastructure: GCP

Why Encord

Competitive salary, commission, and meaningful equity in a high-growth startup
Strong in-person culture — most of the team works from our London office 4+ days/week
25 days annual leave + UK public holidays
Annual learning & development budget
Travel for customer visits, events, and conferences across the UK and Europe
Company lunches twice a week
Monthly socials & bi-annual team offsites

Eastcastle St, London, United Kingdom, W1W 8DE

Similar Jobs

Mastercard

Senior Site Reliability Engineer

25 Days Ago

Hybrid

Senior level

Blockchain • Fintech • Payments • Consulting • Cryptocurrency • Cybersecurity • Quantum Computing

Provide second-line support and on-call coverage for mission-critical UK payments systems, deliver hotfixes and tooling, manage changes/releases, maintain DR capabilities, collaborate with Scrum teams and suppliers, and drive automation and reliability improvements.

Top Skills: AgileAtm SwitchingCard AuthenticationCobolConnexCryptologyElectronic File Transfer ToolsEmvEnformEnscribe SqlHp NonstopHsmsIso8583ItilLis5MastercardOracle GoldengatePathwaySafe AgileScrumSdlcSource Code ManagementTaclTacl ScriptingTalTlsTransaction ProcessingVisa

Carta

Senior Site Reliability Engineer

9 Days Ago

Hybrid

London, Greater London, England, GBR

Senior level

Fintech • Software

Design, build, and scale internal compute, storage, and networking platform services to ensure reliability and performance. Implement monitoring, alerting, and incident response; collaborate with application engineers to ensure scalable designs; automate infrastructure and improve systems globally while reducing operational toil.

Top Skills: AnsibleAWSAzureCi/CdCloudFormationCniDatadogDockerEc2Elk StackGoogle Cloud PlatformGrafanaGraphQLGrpcJavaKubernetesLambdaPostgresPrometheusPythonRdsRestS3Terraform

Lantern

Senior Site Reliability Engineer

17 Days Ago

Hybrid

Senior level

Healthtech

Lead reliability efforts for Lantern's Azure-based healthcare platform by defining SRE practices, building observability and incident management systems, automating infrastructure with Terraform, ensuring compliance (HIPAA/SOC 2), optimizing performance and costs, supporting CI/CD, designing DR strategies, and mentoring engineers to improve resilience and reduce operational toil.

Top Skills: AzureAzure DevopsAzure Kubernetes ServiceAzure MonitorBashDatadogGithub ActionsGrafanaKubernetesPowershellPrometheusPythonRootlyTerraform

What you need to know about the London Tech Scene

London isn't just a hub for established businesses; it's also a nursery for innovation. Boasting one of the most recognized fintech ecosystems in Europe, attracting billions in investments each year, London's success has made it a go-to destination for startups looking to make their mark. Top U.K. companies like Hoptin, Moneybox and Marshmallow have already made the city their base — yet fintech is just the beginning. From healthtech to renewable energy to cybersecurity and beyond, the city's startups are breaking new ground across a range of industries.

Encord

Senior SRE Engineer

Encord London, England Office

Similar Jobs

Senior Site Reliability Engineer

Senior Site Reliability Engineer

Senior Site Reliability Engineer

What you need to know about the London Tech Scene