xAI Logo

xAI

Site Reliability Engineer (SRE)

Posted 23 Days Ago
Be an Early Applicant
3 Locations
Mid level
3 Locations
Mid level
The role involves enhancing observability, creating dashboards, developing alerts, managing on-call rotations, and refining the deployment process for reliability in a dynamic environment.
The summary above was generated by AI

About xAI

The xAI London team is a team of software engineers with a focus on large-scale, highly-reliable distributed systems. We work on many different levels of the stack ranging from build systems, to production backend infrastructure, and frontend development. For example, we built large parts of the Grok production stack. We focus on building high-quality software and aren’t afraid to delve into technically complex topics to solve problems the right way.

About the role

We’re looking for an experienced site reliability engineer (SRE) who can thrive in a dynamic start-up environment. The main responsibilities for this role are:

  1. Improving our observability by adding/adjusting metrics,
  2. Building easily parsable dashboards,
  3. Building reliable alerts,
  4. Designing and overseeing our on-call rotations,
  5. Improving our deployment process to increase reliability.

An ideal candidate meets at least the following requirements:

  1. Expert in at least one programming language that compiles to machine code such as Rust, C++, or Go. Rust or C++ experience is preferred,
  2. Expert knowledge of monitoring technologies such as Prometheus, Grafana, and PagerDuty,
  3. Expert knowledge of deployment technologies such as Pulumi or Terraform,
  4. Expert knowledge of Kubernetes.

Location

The role is based in our London office close to Piccadilly Circus underground station. We usually work from the office 5 days a week but allow for work-from-home days when required. Candidates must be willing to attend late meetings at least twice a week to coordinate with the rest of our team, which is based in California. This role includes semi-regular business trips to California.

Interview process

After submitting your application, the team reviews your CV and statement of exceptional work. If your application passes this stage, you will be invited to a 15 minute interview (“phone interview”) during which a member of our team will ask some basic questions. If you clear the initial phone interview, you will enter the main process, which consists of four technical interviews:

  1. Coding interview in Rust, C++ or Go,
  2. Monitoring & deployment design interview,
  3. Distributed systems design interview,
  4. Meet the wider team and give a 20 minute presentation about the most difficult technical problems you have solved.

Our goal is to finish the process within one week. We don’t rely on recruiters for assessments. Every application is reviewed by a member of our technical team. All interviews will be conducted via Google Meet.

Benefits

  • Competitive cash-based compensation
  • xAI equity
  • Private health and dental insurance
  • Unlimited time off subject to prior approval

California Consumer Privacy Act (CCPA) Notice

Top Skills

C++
Go
Rust

Similar Jobs

4 Hours Ago
3 Locations
1,100 Employees
Senior level
1,100 Employees
Senior level
Cloud • Software
As a Principal Site Reliability Engineer, you will drive operational excellence for the platform's mission critical datastores, ensuring their reliability, availability, and performance. This role involves innovating solutions, collaborating across teams, and mentoring engineers. Responsibilities include designing scalable systems, writing high-quality code, and utilizing cloud-managed services and IaC tools.
20 Hours Ago
San Francisco, CA, USA
Remote
11,000 Employees
Junior
11,000 Employees
Junior
Cloud • Information Technology • Productivity • Security • Software • App development • Automation
The Site Reliability Engineer will enhance cloud services by overseeing caching infrastructure and automation, ensuring high availability and performance. The role involves monitoring, debugging, and improving code while scaling distributed software in production environments. Responsibilities include communication across technical levels and implementing best practices in service reliability.
2 Days Ago
2 Locations
Hybrid
200 Employees
Senior level
200 Employees
Senior level
Blockchain • Information Technology • Software • Cryptocurrency • Web3
As a Site Reliability Engineer, you will enhance developer productivity and ensure product reliability by designing and improving infrastructure. Responsibilities include setting reliability standards, managing production infrastructure, mentoring teams, and implementing best practices for coding and deployment.

What you need to know about the London Tech Scene

London isn't just a hub for established businesses; it's also a nursery for innovation. Boasting one of the most recognized fintech ecosystems in Europe, attracting billions in investments each year, London's success has made it a go-to destination for startups looking to make their mark. Top U.K. companies like Hoptin, Moneybox and Marshmallow have already made the city their base — yet fintech is just the beginning. From healthtech to renewable energy to cybersecurity and beyond, the city's startups are breaking new ground across a range of industries.

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account