We are seeking a highly skilled Cloud Site Reliability Engineer to join our dynamic IT team. This role is integral to maintaining, optimizing, and scaling our cloud infrastructure and services across multiple platforms including AWS, Google Cloud Platform, Azure, and private cloud environments. The ideal candidate will possess extensive experience in cloud architecture, virtualization, and automation, ensuring the reliability and performance of our cloud-based applications and services. This position offers an exciting opportunity to work on cutting-edge cloud development projects within a collaborative and innovative environment.
Duties
Design, implement, and manage scalable, reliable cloud infrastructure using platforms such as AWS, GCP, Azure, VMware, OpenStack, Rackspace, and Citrix.
Develop automation scripts and tools utilizing Python, Bash, PowerShell, Ansible, Puppet, Chef, Terraform, and other DevOps tools to streamline deployment processes.
Monitor system health and performance metrics; troubleshoot issues related to web services, SaaS solutions, PaaS platforms, and IaaS environments.
Perform system hardening and security best practices to ensure compliance with industry standards across Linux, Windows, and UNIX systems.
Collaborate with development teams to implement CI/CD pipelines using Jenkins, Git, Docker containers, Kubernetes clusters, and microservices architecture.
Manage databases such as MySQL, PostgreSQL, Oracle SQL Server with expertise in SQL/T-SQL/PL/SQL for data integrity and performance tuning.
Architect and maintain RESTful APIs and web services supporting enterprise applications including IoT integrations.
Ensure high availability through effective system management of virtualization technologies like VMware and OpenStack; optimize cloud architecture for cost-efficiency and scalability.
Implement system security measures including VPNs, system hardening techniques, network segmentation using Meraki or similar tools.
Participate in Agile SDLC processes to deliver reliable software solutions; contribute to incident response planning and disaster recovery strategies.
Experience
Over minimum of 14+ years of experience with strong SRE principles and hands on experience in Azure cloud landing zones with following skills would be the best fit for SRE lead position.
Deep understanding of Site Reliability Engineering concepts:
SLIs, SLOs, SLAs, and Error Budgets.
Incident management and blameless postmortems.
Proven ability to design and implement reliability strategies in Azure environments.
Expertise in observability frameworks: Metrics, Logs, Traces for distributed systems.
Hands-on experience with:
Azure Monitor, Log Analytics, Application Insights.
Integration with Prometheus, Grafana, OpenTelemetry.
Ability to define custom SLIs and build dashboards for real-time health monitoring.
Strong skills in Infrastructure as Code (IaC): Terraform/OpenTofu, Bicep, or ARM templates.
Automation of operational tasks using PowerShell, Python, or Azure CLI.
Experience with CI/CD pipelines (Azure DevOps, GitHub Actions).
Advanced knowledge of Azure services: Compute (VMs, AKS), Networking (VNet, Load Balancer), Storage, IaaS, PaaS services etc.
Capacity planning, performance tuning, and auto-scaling strategies for Azure landing zones
Skilled in incident detection and resolution identifying the reoccurring patterns and eliminate false positives.
Disaster recovery planning and chaos engineering practices.
Familiarity with Azure Security Center, Defender for Cloud.
Implementing RBAC, identity governance, and compliance frameworks.
Job Type: Fixed term contract
Contract length: 12 months
Pay: 385.00-425.00 per day
Work Location: Remote
Beware of fraud agents! do not pay money to get a job
MNCJobs.co.uk will not be responsible for any payment made to a third-party. All Terms of Use are applicable.