We are seeking a highly skilled Site Reliability Engineer to join our dynamic IT team. The successful candidate will be responsible for ensuring the stability, scalability, and performance of our cloud-based and on-premise systems. This role involves developing automation solutions, managing infrastructure, and supporting software deployment processes across diverse environments. The ideal applicant will possess a strong background in system administration, cloud computing, and software development, with a keen eye for troubleshooting and incident management. This is an excellent opportunity for professionals passionate about maintaining high-availability systems and driving continuous improvement in system reliability.
Duties
Design, implement, and maintain scalable and reliable infrastructure using tools such as Kubernetes, Terraform, Ansible, Puppet, Chef, and VMware.
Monitor system performance with tools like New Relic, Splunk, Elasticsearch, and Nagios to proactively identify issues before they impact users.
Automate deployment pipelines leveraging Jenkins, GitLab CI/CD, TFS, and other continuous integration tools to streamline software releases.
Manage cloud environments including AWS, Azure, Google Cloud Platform (GCP), and OpenStack to optimise resource utilisation and cost-efficiency.
Develop scripts using PowerShell, Bash (Unix shell), Python, Ruby, Perl, Groovy, or Go to automate routine tasks and improve operational efficiency.
Troubleshoot complex issues related to web services such as REST APIs, web servers like NGINX or WebSphere, application servers including Weblogic or JBoss.
Implement disaster recovery plans and perform incident response activities to minimise downtime during outages or security breaches.
Collaborate with development teams on requirements gathering for new features or system upgrades following SDLC best practices.
Maintain comprehensive documentation of system configurations and procedures aligned with ITIL standards for release management and change control.
Experience
Proven experience in a Site Reliability Engineering or DevOps role within a large-scale enterprise environment.
Extensive knowledge of containerisation technologies such as Docker and Kubernetes.
Hands-on experience with cloud platforms including AWS (Amazon S3, EC2), Azure (Virtual Machines), Google Cloud Platform (GCP), or OpenStack.
Strong proficiency in scripting languages such as Python, PowerShell, Bash (Unix shell), Ruby on Rails or Groovy for automation tasks.
Familiarity with configuration management tools like Ansible, Puppet, Chef; version control systems including GitHub or GitLab; and CI/CD pipelines using Jenkins or TFS.
Experience managing distributed systems architecture involving microservices and APIs over TCP/IP networks.
Knowledge of databases including MySQL, Microsoft SQL Server (T-SQL), Oracle DBMS; along with experience in SQL optimisation and disaster recovery planning.
Understanding of computer networking concepts such as DNS, TCP/IP protocols, firewalls, LAN/WAN configurations.
Ability to troubleshoot software issues across various platforms including Linux (CentOS/Ubuntu) and Windows Server environments. This role offers an engaging environment where technical expertise is valued and professional growth is encouraged through exposure to cutting-edge technology stacks and best practices in system reliability engineering.
Job Types: Full-time, Permanent, Temp to perm
Contract length: 12 months
Pay: 47,034.80-121,866.38 per year
Work Location: Hybrid remote in London E16
Beware of fraud agents! do not pay money to get a job
MNCJobs.co.uk will not be responsible for any payment made to a third-party. All Terms of Use are applicable.