Thredd Logo

Thredd

Site Reliability Engineer

Posted 2 Days Ago
Be an Early Applicant
London, Greater London, England
Entry level
London, Greater London, England
Entry level
As the first Site Reliability Engineer at Thredd, you will build CI/CD pipelines, drive automation and observability, create scalable cloud infrastructure, and implement reliability strategies.
The summary above was generated by AI

Are you passionate about building reliable, scalable, and high-performing systems? Do you thrive on solving complex infrastructure challenges while driving automation and observability best practices? If so, we want to hear from you!

At Thredd, we’re looking for a Site Reliability Engineer to act as a North Star for this evolving discipline. As our first engineer in this role, you’ll have the unique opportunity to shape our SRE strategy, establish best practices, and set the standard for service reliability and performance.

What You’ll Do
Define strategies for Application Performance Monitoring, Unit Cost, and Chaos Engineering.
Continuously optimize production environments to enhance reliability and efficiency.
Implement and apply MTTR, SLO, and SLI principles to ensure high service standards.
Respond to incidents, analyze root causes, and drive long-term improvements.
Maintain fault-tolerant, scalable, and cost-effective infrastructures and services.
Monitor availability, latency, and system health to keep our platform running smoothly.
Lead blameless postmortems and refine our incident response processes.
Provide feedback loops to development teams on operational gaps and resiliency concerns.
Support services before they go live with system design consulting, capacity planning, and launch reviews.
Scale systems sustainably through automation and infrastructure evolution.
Deeply understand our customers’ needs and the critical role Thredd plays in their businesses.

What You’ll Be Working On
Building and maintaining the infrastructure, tooling, and technical foundation of Thredd.
Ensuring high service uptime and reliability so product teams can innovate effectively.
Playing a key role in shaping the core technology layers that drive our platform’s success.

What You Need
Proven experience implementing SRE principles at scale, including deep knowledge of SLI/SLO/SLA differences.
A product engineering background with strong coding skills in Python, C#, or similar.
Experience with incident management frameworks and evolving them for efficiency.
Expertise in cloud platforms (AWS preferred) and container orchestration (Docker, Kubernetes, ECS).
Solid understanding of microservices, service mesh, and modern architectural concepts.
A collaborative mindset – you thrive on helping others and driving company-wide impact.

Nice to Have
Experience working in regulated industries (e.g., PCI compliance).
Background in capacity planning, performance, and load testing.
Sysadmin skills for troubleshooting disk, network, and infrastructure issues.

Why Join Thredd?
The chance to define and lead SRE best practices from the ground up.
A high-impact role in a rapidly growing company.
A collaborative, innovation-driven culture where your expertise will shape our platform’s future.
If you’re excited about scaling infrastructure, improving reliability, and making a real impact, apply now and help us build the future of Thredd! 🚀



Top Skills

Java
HQ

Thredd London, England Office

Kingsbourne House 229-231 High Holborn London, London, United Kingdom, WC1V 7DA

Similar Jobs

2 Days Ago
London, Greater London, England, GBR
Entry level
Entry level
Information Technology • Software • Financial Services • Big Data Analytics
As a Site Reliability Engineer at Citadel, you will ensure the reliability and performance of applications, automate repetitive tasks, and propose engineering solutions for complex issues. You will work collaboratively with other teams, promote the SRE mindset, and drive improvements in application support and operational efficiency.
Top Skills: Python
2 Days Ago
Easy Apply
Hybrid
London, England, GBR
Easy Apply
Entry level
Entry level
Cloud • Software
The Site Reliability Engineer will enhance observability for the ThousandEyes platform, focusing on cloud-native monitoring tools and automation. Responsibilities include designing and maintaining monitoring services, establishing best practices for instrumentation, and supporting the incident response process.
Top Skills: GoPython
2 Days Ago
Easy Apply
Hybrid
London, Greater London, England, GBR
Easy Apply
Mid level
Mid level
HR Tech • Software • Travel
As a Site Reliability Engineer (SRE) at TravelPerk, you will design and maintain cloud infrastructure, monitor system performance and reliability, improve automation processes, and collaborate with development teams to enhance application scalability while participating in on-call rotations to resolve production issues.
Top Skills: BashNode.jsPython

What you need to know about the London Tech Scene

London isn't just a hub for established businesses; it's also a nursery for innovation. Boasting one of the most recognized fintech ecosystems in Europe, attracting billions in investments each year, London's success has made it a go-to destination for startups looking to make their mark. Top U.K. companies like Hoptin, Moneybox and Marshmallow have already made the city their base — yet fintech is just the beginning. From healthtech to renewable energy to cybersecurity and beyond, the city's startups are breaking new ground across a range of industries.

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account