Senior Mlops/llmops Engineer,

London, ENG, GB, United Kingdom

Job Description

About Us



At FDJ UNITED, we don't just follow the game, we reinvent it.



FDJ UNITED is one of Europe's leading betting and gaming operators, with a vast portfolio of iconic brands and a reputation for technological excellence. With more than 5,000 employees and a presence in around fifteen regulated markets, the Group offers a diversified, responsible range of games, both under exclusive rights and open to competition. We set new standards, proving that entertainment and safety can go hand in hand. Here, you'll work alongside a team of passionate individuals dedicated to delivering the best and safest entertaining experiences for our customers every day.



We're looking for bold people who are eager to succeed and ready to level-up the game. If you thrive on innovation, embrace challenges, and want to make a real impact at all levels, FDJ UNITED is your playing field.



Join us in shaping the future of gaming. Are you ready to LEVEL-UP THE GAME?



The Role



As a Senior MLOps/LLMOps Engineer, you will be at the forefront of building and scaling our AI/ML infrastructure, bridging the gap between cutting-edge large language models and production-ready systems. You will play a pivotal role in designing, deploying, and operating the platforms that power our AI-driven products, working at the intersection of DevOps, MLOps, and emerging LLM technologies.



In this role, you'll architect robust, scalable infrastructure for deploying and monitoring large language models (LLMs) such as GPT and Claude-family models in AWS Bedrock & AWS AI Foundry, while ensuring security, observability, and reliability across multi-tenant ML workloads. You will collaborate closely with data scientists, ML engineers, platform teams, and product stakeholders to create seamless, self-serve experiences that accelerate AI innovation across the organization.



This is a hands-on leadership role that blends strategic thinking with deep technical execution. You'll own the end-to-end ML platform lifecycle; from infrastructure provisioning and CI/CD automation to model deployment, monitoring, and cost optimization. As a senior technical leader, you'll champion best practices, mentor team members, and drive a culture of continuous improvement, experimentation, and operational excellence.



Key Responsibilities



Platform Infrastructure & Deployment



Run and evolve our ML/LLM compute infrastructure on Kubernetes/EKS (CPU/GPU) for multi-tenant workloads, ensuring portability across AWS/Azure AI Foundry regions with region-aware scheduling, cross-region data access, and artifact management

Engage with platform and infrastructure teams to provision and maintain access to cloud environments (AWS, Azure), ensuring seamless integration with existing systems

Setup and maintain deployment workflows for LLM-powered applications, handling environment-specific configurations across development, staging/UAT, and production

Build and operate GitOps-native delivery pipelines using GitLab CI, Jenkins, ArgoCD, Helm, and FluxCD to enable fast, safe rollouts and automated rollbacks



LLM Operations & Optimization



Deploy, scale, and optimize large language models (GPT, Claude, and similar) with deep consideration for prompt engineering, latency/performance tradeoffs, and cost efficiency

Operate and maintain Argo Workflows as reliable, self-serve orchestration platforms for data preparation, model training, evaluation, and large-scale batch compute

Implement and evaluate models using AI Observability frameworks to track model performance, drift, and quality in production



CI/CD & Infrastructure as Code



Design and maintain robust CI/CD pipelines with isolated development, staging, and production environments to support safe iteration, reproducibility, and full lifecycle observability

Implement Infrastructure as Code (IaC) using Terraform, CloudFormation, and Helm to automate provisioning, configuration, and scaling of cloud resources

Manage container orchestration, secrets management (e.g., AWS Secrets Manager), and secure deployment practices across all environments



Observability, Monitoring & Reliability



Set up and analyze comprehensive observability stacks using Prometheus/Grafana and Splunk to monitor model health, infrastructure performance, and system reliability

Support system monitoring for health, usage, and cost across AWS and Azure environments, including CloudWatch, ELK Stack, and custom alerting solutions

Implement sensible alerting strategies to proactively detect and resolve incidents, minimizing downtime and ensuring high availability

Proactively troubleshoot production issues, manage release cycles, and provide on-call support as necessary



Data Platform & Experiment Reproducibility



Design and maintain a modern data platform built on Apache Iceberg to enable experiment reproducibility, data lineage tracking, and automated governance

Build data pipelines with strong principles of idempotency, retries, backfills, and reproducibility to support ML workflows

Collaborate with data engineers to ensure seamless integration between data ingestion, transformation, and model training processes



Developer Experience & Enablement



Own developer experience by creating intuitive APIs, CLIs, and minimal UIs that enable engineers and data scientists to self-serve infrastructure and deployment needs

Develop comprehensive, modular documentation covering system architecture, deployment processes, model usage guidelines, onboarding playbooks, and operational runbooks

Treat the ML platform as a product:

engage with internal users (engineers, data scientists), gather feedback, remove friction points, and continuously improve usability

Create reusable templates, standards, and best practices to ensure maintainability, consistency, and scalability across teams



Architecture, Security & Governance



Define and refine platform architecture with a focus on scalability, security, and compliance with organizational and regulatory standards

Engage in security approval conversations, ensuring that infrastructure, deployments, and data handling meet security and governance requirements

Implement FinOps best practices, including cost attribution, budget monitoring, and optimization strategies for multi-tenant ML infrastructure

Champion a culture of continuous integration, continuous delivery, and continuous improvement across engineering teams



Skills, Knowledge, and Experience



Essential Experience



8+ years of experience in DevOps, Platform Engineering, or Site Reliability Engineering, with at least 2+ years focused on MLOps/LLMOps

Deep hands-on expertise with AWS services, including Bedrock, S3, EC2, EKS, RDS/PostgreSQL, ECR, IAM, Lambda, Step Functions, and CloudWatch

Production experience managing Kubernetes workloads in EKS, including GPU workloads, autoscaling, resource quotas, and multi-tenant configurations

Proficient in container orchestration (Docker, Kubernetes), secrets management, and implementing GitOps-style deployments using Jenkins, ArgoCD, FluxCD, or similar tools

Practical understanding of deploying and scaling LLMs (e.g., GPT and Claude-family models), including prompt engineering, latency/performance tradeoffs, and model evaluation

Strong programming skills in Python (FastAPI, Django, Pydantic, boto3, Pandas, NumPy) with solid computer science fundamentals (performance, concurrency, data structures)

Working knowledge of Machine Learning techniques and frameworks (e.g., scikit-learn, TensorFlow, PyTorch)

Experience building and operating data pipelines with principles of idempotency, retries, backfills, and reproducibility

Expertise in Infrastructure as Code (IaC) using Terraform, CloudFormation, and Helm

Proven track record designing and maintaining CI/CD pipelines with GitLab CI, Jenkins, or similar tools

Observability experience with Prometheus/Grafana, Splunk, Datadog, Loki/Promtail, OpenTelemetry, and Sentry, including implementing sensible alerting strategies

Strong grasp of networking, security concepts, and Linux systems administration

Excellent communication skills with ability to collaborate across development, QA, operations, and product teams

Self-motivated, proactive, with a strong sense of ownership and a passion for removing friction and improving developer experience



Nice to Have



Experience with distributed compute frameworks such as Dask, Spark, or Ray

Familiarity with NVIDIA Triton, TorchServe, or other inference servers

Experience with ML experiment tracking platforms like Weights & Biases, MLflow, or Kubeflow

FinOps best practices and cost attribution strategies for multi-tenant ML infrastructure

Exposure to multi-region and multi-cloud designs, including dataset replication strategies, compute placement, and latency optimization

Experience with LakeFS, Apache Iceberg, or Delta Lake for data versioning and lakehouse architectures

Knowledge of data transformation tools such as DBT

Experience with data pipeline orchestration tools like Airflow or Prefect

Familiarity with Snowflake or other cloud data warehouses

Understanding of responsible AI practices, model governance, and compliance frameworks



Our Way Of Working



Our world is hybrid.



A career is not a sprint. It's a marathon. One of the perks of joining us is that we value you as a person first. Our hybrid world allows you to focus on your goals and responsibilities and lets you self-organize to improve your deliveries and get the work done in your own way.



Application Process



We believe talent knows no boundaries. Our hiring process focuses solely on your skills, experience, and potential to contribute to our team. We welcome applicants from all backgrounds and evaluate each candidate based on merit, regardless of personal characteristics such as age, gender, origin, religion, sexual orientation, neurodiversity, or disability.



Why Join FDJ UNITED?



Work on cutting-edge AI/ML technologies at scale in a regulated, high-stakes industry

Technical leadership opportunities with visibility across the organization

Collaborate with world-class engineers, data scientists, and product teams

Influence the architecture and strategy of our AI platform from the ground up

Continuous learning environment with access to the latest tools, technologies, and practices



Our Way Of Working



Our world is hybrid.



A career is not a sprint. It's a marathon. One of the perks of joining us is that we value you as a person first. Our hybrid world allows you to focus on your goals and responsibilities and lets you self-organise to improve your deliveries and get the work done in your own way.



Application Process



We believe talent knows no boundaries. Our hiring process focuses solely on your skills, experience, and potential to contribute to our team. We welcome applicants from all backgrounds and evaluate each candidate based on merit, regardless of personal characteristics as the age, gender, origin, religion, sexual orientation, neurodiversity or disability.



Details

Hybrid

London, Stockholm

Full Time Permanent

TEC2682

Location

London

Stockholm

Kindred House, 17-25 Hartfield Road, Wimbledon, London, United Kingdom, SW19 3SE



Benefits

Well-being allowance

Learning and development opportunities

Inclusion networks

Charity days

Long service awards

Social events and activites

Private medical insurance

Life assurance and income protection

Employee Assistance Programme

Pension

Meet the recruiter

Prachi Arya



prachi.arya@kindredgroup.com

Beware of fraud agents! do not pay money to get a job

MNCJobs.co.uk will not be responsible for any payment made to a third-party. All Terms of Use are applicable.


Related Jobs

Job Detail

  • Job Id
    JD4568097
  • Industry
    Not mentioned
  • Total Positions
    1
  • Job Type:
    Full Time
  • Salary:
    Not mentioned
  • Employment Status
    Full Time
  • Job Location
    London, ENG, GB, United Kingdom
  • Education
    Not mentioned