Site Reliability Engineer

London, ENG, GB, United Kingdom

Job Description

Our purpose is to make great financial decision making a breeze for everyone, and that purpose drives us every day.

It's why we're on a mission to create an automated quoting engine, with the simplest of experiences, wrapped in a brand everyone loves!

We change lives by making it simple to switch and save money. So, when it comes to getting a better deal, it's never been more blindingly obvious why you would choose Compare the Market.


We'd love you to be part of our journey.



As the Site Reliability Engineer, you will ensure the highest levels of system uptime and performance, contributing directly to the trust and reliability of our services and infrastructure. You will be working collaboratively with engineering teams to design, implement, and maintain robust systems that can withstand the challenges of a rapidly evolving technology landscape. Bridging the gap between development and operations, focusing on automation, scalability, and system stability.


Everyone is welcome. Be you.





We have a culture of creativity. We approach our work passionately, improve constantly and celebrate our wins at every turn. We are an inclusive workplace, and our employees are comfortable bringing their authentic, whole selves to work.

This means we're excited to hear from people with a range of skills, experiences, and ideas. We don't expect you to tick all the boxes but would love you to hear what makes you great for this role.


Some of the great things you'll be doing:



As a Site Reliability Engineer, you'll help shape and maintain the reliability, performance, and scalability of our systems. You'll work closely with software engineering and platform teams to embed SRE practices and improve how we measure, manage, and automate service health.


You'll be responsible for:



Service Reliability & SLOs - Partner with product and platform teams to define, monitor, and uphold meaningful service-level objectives (SLOs). Use error budgets to guide decisions on stability vs. change velocity. Incident Response & Readiness - Participate in and lead incident response, drive blameless postmortems, and ensure meaningful follow-ups that prevent recurrence. Observability & Insights - Strengthen how teams instrument and monitor systems, ensuring actionable metrics, logs, and traces are in place to catch issues early. Resilience Engineering - Help teams design systems that handle partial failure gracefully, with retry strategies, fallbacks, circuit breakers, and chaos testing where appropriate. Toil Reduction & Automation - Identify manual and repetitive work, and engineer it away through tooling, automation, or process improvements. Capacity & Risk Management - Contribute to capacity planning efforts and reliability risk reviews, helping the business scale safely. Collaboration & Enablement - Support and mentor teams by sharing reliability knowledge, participating in design reviews, and driving consistency in operational excellence.

What We're Looking For



We're looking for candidates with a mix of the following experience and qualities. Not all are required -- if you meet most, we encourage you to apply.

Demonstrated ability to define and implement SLOs and error budgets and use them to drive service reliability decisions. Experience leading or contributing significantly to incident response and postmortem processes, with a focus on continuous learning and systemic improvement. A passion for building internal tools that reduce cognitive load and make reliability scalable across teams. A mindset of continuous improvement, ownership, and collaboration -- you're proactive about solving problems and raising the bar for how we operate. Experience using observability tools (e.g.OpenTelemetry, or Dynatrace) to debug complex systems and guide reliability improvements.

Bonus Points For

A strong grasp of Linux systems and networking fundamentals, and experience managing production workloads in a cloud environment (preferably AWS). Proficiency in at least one programming language (e.g., Python, Go, JavaScript/Node.js, or C#), with a focus on building automation and reliability tooling. Familiarity with Kubernetes and the ability to support containerized services with an eye toward resilience and fault tolerance. Familiarity with postmortem culture, chaos testing, or game days A passion for solving operational problems through engineering, not just manual intervention.

There's something for everyone.



We're a place of opportunity. You'll have the tools and autonomy to drive your own career, supported by a team of amazingly talented people.

And then there's our benefits. For us, it's not just about a competitive salary and hybrid working, we care about what matters to you. From a generous holiday allowance and private healthcare to an electric car scheme and paid CSR days, we've pretty much got you covered!



#LI-HL1

Beware of fraud agents! do not pay money to get a job

MNCJobs.co.uk will not be responsible for any payment made to a third-party. All Terms of Use are applicable.


Related Jobs

Job Detail

  • Job Id
    JD3444362
  • Industry
    Not mentioned
  • Total Positions
    1
  • Job Type:
    Full Time
  • Salary:
    Not mentioned
  • Employment Status
    Permanent
  • Job Location
    London, ENG, GB, United Kingdom
  • Education
    Not mentioned