Nexsys analytics is a consultancy and software company, specialising in water resources management, infrastructure investment, decision making under uncertainty and trade-off analysis. We are currently investing in the development of a series of large-scale optimisation projects with the goal of identifying the most efficient ways to invest in water infrastructure to safeguard future generations.
Description of the project
We are seeking a UK-based DevOps engineer, preferably located in the North of England, with expertise in cluster job scheduling (Slurm or similar), Docker, and big data storage to deliver a 3-month project. The goal is to build a system that allows optimisation jobs to be submitted via our existing API, deployed automatically on a compute cluster with the requested number of cores (if available), and have the results stored in a big data database for later retrieval and analysis.
The project objectives
We have an existing task / job management system which deploys jobs via docker containers on our internal cluster. We seek to expand and augment this API to be more scalable and robust using a task management system such as Slurm.
The responsibilities of the role will include:
- Cluster Setup & Orchestration - Deploy and configure a task manager (Slurm preferred) for scheduling containerised jobs. Implement fair resource allocation (CPU cores, memory, job queues).
- Containerised Execution - Develop Docker images for running optimisation workflows. Ensure reproducibility and portability of the runtime environment.
- Data Storage & Management - Design schema for storing large optimisation outputs in a scalable database (e.g. PostgreSQL + TimescaleDB, MongoDB, or HDFS/S3-backed). Implement pipelines for ingesting results directly from job execution.
- API Augmentation - Extend existing REST API with endpoints to: submit a job request (with parameters incl. number of cores); retrieve job status (queued, running, finished, failed); access job outputs from the database.
- Monitoring & Observability - Integrate monitoring and logging (Grafana/Prometheus/ELK). Provide dashboards for cluster utilisation and job performance.
- Documentation & Handover - Deliver Infrastructure-as-Code templates (Terraform/Ansible preferred). Provide technical documentation and a runbook for long-term operation.
The required skills are:
- Strong experience with Slurm (or equivalent job schedulers)
- Deep knowledge of Docker (Kubernetes experience a plus)
- Big data database design (PostgreSQL, MongoDB, or Hadoop/S3)
- Proficiency in Python/Flask/FastAPI (for API integration)
- Infrastructure-as-Code (Terraform, Ansible, or similar)
- Experience with observability stacks (Prometheus, Grafana, ELK)
Project deliverables:
- Working Slurm (or equivalent) cluster integrated with Docker
- Database backend for storing optimisation outputs
- Extended API with job submission, monitoring, and results retrieval