About Us
At FDJ UNITED, we don't just follow the game, we reinvent it.
FDJ UNITED is one of Europe's leading betting and gaming operators, with a vast portfolio of iconic brands and a reputation for technological excellence. With more than 5,000 employees and a presence in around fifteen regulated markets, the Group offers a diversified, responsible range of games, both under exclusive rights and open to competition. We set new standards, proving that entertainment and safety can go hand in hand. Here, you'll work alongside a team of passionate individuals dedicated to delivering the best and safest entertaining experiences for our customers every day.
We're looking for bold people who are eager to succeed and ready to level-up the game. If you thrive on innovation, embrace challenges, and want to make a real impact at all levels, FDJ UNITED is your playing field.
Join us in shaping the future of gaming. Are you ready to LEVEL-UP THE GAME?
The Role
As a Senior MLOps/LLMOps Engineer, you will be at the forefront of building and scaling our AI/ML infrastructure, bridging the gap between cutting-edge large language models and production-ready systems. You will play a pivotal role in designing, deploying, and operating the platforms that power our AI-driven products, working at the intersection of DevOps, MLOps, and emerging LLM technologies.
In this role, you'll architect robust, scalable infrastructure for deploying and monitoring large language models (LLMs) such as GPT and Claude-family models in AWS Bedrock & AWS AI Foundry, while ensuring security, observability, and reliability across multi-tenant ML workloads. You will collaborate closely with data scientists, ML engineers, platform teams, and product stakeholders to create seamless, self-serve experiences that accelerate AI innovation across the organization.
This is a hands-on leadership role that blends strategic thinking with deep technical execution. You'll own the end-to-end ML platform lifecycle; from infrastructure provisioning and CI/CD automation to model deployment, monitoring, and cost optimization. As a senior technical leader, you'll champion best practices, mentor team members, and drive a culture of continuous improvement, experimentation, and operational excellence.
Key Responsibilities
Platform Infrastructure & Deployment
Run and evolve our ML/LLM compute infrastructure on Kubernetes/EKS (CPU/GPU) for multi-tenant workloads, ensuring portability across AWS/Azure AI Foundry regions with region-aware scheduling, cross-region data access, and artifact management
Engage with platform and infrastructure teams to provision and maintain access to cloud environments (AWS, Azure), ensuring seamless integration with existing systems
Setup and maintain deployment workflows for LLM-powered applications, handling environment-specific configurations across development, staging/UAT, and production
Build and operate GitOps-native delivery pipelines using GitLab CI, Jenkins, ArgoCD, Helm, and FluxCD to enable fast, safe rollouts and automated rollbacks
LLM Operations & Optimization
Deploy, scale, and optimize large language models (GPT, Claude, and similar) with deep consideration for prompt engineering, latency/performance tradeoffs, and cost efficiency
Operate and maintain Argo Workflows as reliable, self-serve orchestration platforms for data preparation, model training, evaluation, and large-scale batch compute
Implement and evaluate models using AI Observability frameworks to track model performance, drift, and quality in production
CI/CD & Infrastructure as Code
Design and maintain robust CI/CD pipelines with isolated development, staging, and production environments to support safe iteration, reproducibility, and full lifecycle observability
Implement Infrastructure as Code (IaC) using Terraform, CloudFormation, and Helm to automate provisioning, configuration, and scaling of cloud resources
Manage container orchestration, secrets management (e.g., AWS Secrets Manager), and secure deployment practices across all environments
Observability, Monitoring & Reliability
Set up and analyze comprehensive observability stacks using Prometheus/Grafana and Splunk to monitor model health, infrastructure performance, and system reliability
Support system monitoring for health, usage, and cost across AWS and Azure environments, including CloudWatch, ELK Stack, and custom alerting solutions
Implement sensible alerting strategies to proactively detect and resolve incidents, minimizing downtime and ensuring high availability
Proactively troubleshoot production issues, manage release cycles, and provide on-call support as necessary
Data Platform & Experiment Reproducibility
Design and maintain a modern data platform built on Apache Iceberg to enable experiment reproducibility, data lineage tracking, and automated governance
Build data pipelines with strong principles of idempotency, retries, backfills, and reproducibility to support ML workflows
Collaborate with data engineers to ensure seamless integration between data ingestion, transformation, and model training processes
Developer Experience & Enablement
Own developer experience by creating intuitive APIs, CLIs, and minimal UIs that enable engineers and data scientists to self-serve infrastructure and deployment needs
Develop comprehensive, modular documentation covering system architecture, deployment processes, model usage guidelines, onboarding playbooks, and operational runbooks
MNCJobs.co.uk will not be responsible for any payment made to a third-party. All Terms of Use are applicable.