with a strong systems-thinking mindset and expertise in
Site Reliability Engineering (SRE)
principles. The primary focus will be to uplift capacity planning and observability controls across a complex technology estate. This role combines deep technical engineering skills with architectural vision and aims to enhance
performance, resilience, and operational control
.
The ideal candidate will possess a solid blend of hands-on expertise and strategic leadership to align technology capabilities with internal control frameworks and regulatory expectations.
Key Responsibilities:
Lead the
design and technical assessment
of capacity management, utilization monitoring, and observability controls.
Apply
SRE best practices
to identify control gaps, performance risks, and automation opportunities.
Evaluate existing tooling, data flows, and operations to propose and implement control remediations.
Collaborate with engineering, infrastructure, architecture, and risk teams to validate technical solutions.
Define
reusable technical patterns and tooling strategies
for enhanced operational readiness.
Contribute to
roadmap planning
, tooling evaluations, and documentation for governance and operational preparedness.
Required Skills & Experience:
10+ years in engineering, infrastructure, or architecture roles in complex technology environments.
Strong understanding of
compute, storage, and network capacity planning
across hybrid/cloud platforms.
Hands-on experience with
SRE principles
, including observability, SLIs/SLOs, and task automation.
Skilled in interpreting
control requirements
and embedding them into technical designs.
Experience with
performance monitoring and diagnostic tools
(e.g., Geneos, Prometheus, Grafana, AppDynamics).
Excellent communication skills with the ability to
influence senior stakeholders and risk/control teams
.
Desirable:
Experience uplifting
operational controls
(capacity, availability, performance).
Familiarity with