to lead modernization initiatives across IT operations by establishing robust observability practices and automating manual processes (toil). The ideal candidate will combine strategic thinking with deep hands-on expertise to drive reliability, scalability, and efficiency across complex technology landscapes. This role requires strong leadership, advanced technical proficiency, and the ability to foster a culture of reliability and continuous improvement.
Primary Responsibilities
Operational Modernization & Strategy
Collaborate with product engineering teams to define and implement strategies that modernize IT operations, enhance observability, and reduce toil.
Architect, deploy, and optimize observability platforms to monitor system health, performance, and reliability.
Define and drive strategies for AI-driven alerting, proactive anomaly detection, and event correlation to reduce MTTD and MTTR.
Develop and implement SRE practices including Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budget policies.
Create and maintain an AIOps roadmap to improve operational efficiency and accelerate automation initiatives.
Automation & Reliability Engineering
Automate repetitive processes using scripting, orchestration tools, and AI/ML-driven automation models.
Drive initiatives for automated incident response, self-healing workflows, and autonomous operations.
Enable shift-left engineering practices by partnering with engineering, architecture, and product teams to improve system reliability early in the development lifecycle.
Lead continuous improvement initiatives focusing on reducing operational burden and improving resilience across systems and services.
Incident Management & Root Cause Analysis
Oversee and enhance incident management processes through automation and structured problem-solving.
Conduct root cause analyses and drive remediation efforts to prevent recurrence and strengthen system reliability.
Collaboration & Leadership
Work cross-functionally to ensure systems are built to be scalable, resilient, and maintainable.
Mentor teams in adopting SRE principles, tools, and modern operational practices.
Champion a culture of automation, observability, and reliability across the organization.
Key Skills & Technical Expertise
Core Competencies
Strong proficiency in applying SRE principles across large-scale environments.
Advanced hands-on experience with observability tools, specifically
Dynatrace
and
Datadog
.
Expertise in automation and scripting using
Python
and
Ansible
.
Robust experience with cloud platforms including
AWS
and
Azure
.
Deep understanding of containerization and orchestration using
Docker
and
Kubernetes
.
Strong knowledge of cloud-native architectures and distributed systems.
Exposure to AI/ML-driven predictive analytics, anomaly detection, and automated remediation.
Familiarity with CI/CD pipelines and automated release and deployment practices.
Desirable Skills
Experience with chaos engineering platforms such as
Gremlin
or
Chaos Monkey
.
Knowledge of resilience testing frameworks and reliability scoring models.
Ability to manage multiple initiatives simultaneously in fast-moving environments.
Excellent communication, collaboration, analytical, and decision-making skills.
Strategic mindset that balances technical innovation with business priorities.
Preferred Qualifications
12+ years of experience in SRE, DevOps, or IT operations roles.
Proven track record implementing observability, AIOps, and automation solutions at enterprise scale.
Certifications in cloud platforms, observability tools, or SRE-related disciplines.
Job Type: Fixed term contract
Contract length: 12 months
Pay: 80,000.00-85,000.00 per year
Benefits:
Life insurance
Sabbatical
Work Location: Hybrid remote in Hove BN3 3YU
Beware of fraud agents! do not pay money to get a job
MNCJobs.co.uk will not be responsible for any payment made to a third-party. All Terms of Use are applicable.