Ai Infra Architecture

London, United Kingdom

https://www.mncjobs.co.uk/company/luxoft

Apply Now

Job Description

Project description
We are seeking an experienced AI Infrastructure Architect with deep expertise in designing and operating scalable, secure, and high-performance cloud environments for Generative AI and LLM workloads. This role is ideal for someone who combines strong AWS architectural skills with hands-on experience in GPU compute, MLOps/LLMOps, and enterprise-grade AI platform design. You should bring extensive experience building cloud-native AI infrastructure, optimizing large-scale model training and inference environments, and collaborating closely with AI/ML teams to enable advanced GenAI capabilities. You should bring strong experience in designing complex AI systems, creating detailed technical specifications, and collaborating across multidisciplinary teams to ensure seamless implementation.
Responsibilities

- Design and implement scalable AWS infrastructure to support Generative AI and LLM workloads, including training, fine-tuning, and inference. - Architect secure, high-performance environments using AWS core services such as Amazon SageMaker, Amazon Bedrock, Amazon EKS, AWS Lambda, and related cloud-native components. - Design GPU-based compute environments (e.g., EC2 P-series, G-series) optimized for distributed training, fine-tuning, and low-latency inference. - Implement secure VPC architectures, private endpoints, IAM policies, encryption (KMS), and enterprise-grade data governance controls. - Build and govern MLOps/LLMOps pipelines using SageMaker Pipelines, CodePipeline, and CI/CD best practices. - Architect RAG infrastructure, including vector databases (OpenSearch, Aurora PostgreSQL with pgvector) and scalable storage solutions (S3). - Establish monitoring and observability using CloudWatch, model monitoring tools, logging frameworks, and performance dashboards. - Optimize infrastructure for latency, autoscaling, high availability, and cost efficiency, leveraging Spot Instances, Savings Plans, and right-sizing strategies. - Define disaster recovery (DR) and backup strategies across multi-AZ and multi-region AWS setups. - Implement Infrastructure as Code (IaC) using Terraform or CloudFormation for consistent, repeatable provisioning of AI environments. - Collaborate with AI/ML teams to support LLM fine-tuning, prompt orchestration, inference endpoints, and model deployment workflows. - Stay current with AWS GenAI advancements, evaluating new services, architectural patterns, and best practices for enterprise adoption.

SKILLS
Must have

- Extensive experience (typically 7+ years) in cloud architecture, infrastructure engineering, or platform engineering, with a strong focus on AWS. - Proven expertise designing and operating AI/ML and Generative AI infrastructure at scale. - Deep knowledge of AWS services relevant to AI workloads (SageMaker, Bedrock, EKS, EC2 GPU instances, Lambda, VPC, IAM, KMS, S3). - Hands-on experience with GPU compute, distributed training, and high-performance inference environments. - Strong understanding of MLOps/LLMOps practices, CI/CD pipelines, and model deployment workflows. - Experience architecting secure, compliant, and highly available cloud environments. - Proficiency with Infrastructure as Code (Terraform or CloudFormation). - Familiarity with vector databases, RAG architectures, and scalable data storage patterns. - Strong collaboration skills and the ability to work closely with AI/ML, DevOps, and engineering teams. - Excellent documentation and communication skills.

Nice to have
n/a