Ai Infra Architecture

London, United Kingdom

Job Description

Project description
We are seeking an experienced AI Infrastructure Architect with deep expertise in designing and operating scalable, secure, and high-performance cloud environments for Generative AI and LLM workloads. This role is ideal for someone who combines strong AWS architectural skills with hands-on experience in GPU compute, MLOps/LLMOps, and enterprise-grade AI platform design. You should bring extensive experience building cloud-native AI infrastructure, optimizing large-scale model training and inference environments, and collaborating closely with AI/ML teams to enable advanced GenAI capabilities. You should bring strong experience in designing complex AI systems, creating detailed technical specifications, and collaborating across multidisciplinary teams to ensure seamless implementation.
Responsibilities

  • - Design and implement scalable AWS infrastructure to support Generative AI and LLM workloads, including training, fine-tuning, and inference. - Architect secure, high-performance environments using AWS core services such as Amazon SageMaker, Amazon Bedrock, Amazon EKS, AWS Lambda, and related cloud-native components. - Design GPU-based compute environments (e.g., EC2 P-series, G-series) optimized for distributed training, fine-tuning, and low-latency inference. - Implement secure VPC architectures, private endpoints, IAM policies, encryption (KMS), and enterprise-grade data governance controls. - Build and govern MLOps/LLMOps pipelines using SageMaker Pipelines, CodePipeline, and CI/CD best practices. - Architect RAG infrastructure, including vector databases (OpenSearch, Aurora PostgreSQL with pgvector) and scalable storage solutions (S3). - Establish monitoring and observability using CloudWatch, model monitoring tools, logging frameworks, and performance dashboards. - Optimize infrastructure for latency, autoscaling, high availability, and cost efficiency, leveraging Spot Instances, Savings Plans, and right-sizing strategies. - Define disaster recovery (DR) and backup strategies across multi-AZ and multi-region AWS setups. - Implement Infrastructure as Code (IaC) using Terraform or CloudFormation for consistent, repeatable provisioning of AI environments. - Collaborate with AI/ML teams to support LLM fine-tuning, prompt orchestration, inference endpoints, and model deployment workflows. - Stay current with AWS GenAI advancements, evaluating new services, architectural patterns, and best practices for enterprise adoption.
SKILLS
Must have
  • - Extensive experience (typically 7+ years) in cloud architecture, infrastructure engineering, or platform engineering, with a strong focus on AWS. - Proven expertise designing and operating AI/ML and Generative AI infrastructure at scale. - Deep knowledge of AWS services relevant to AI workloads (SageMaker, Bedrock, EKS, EC2 GPU instances, Lambda, VPC, IAM, KMS, S3). - Hands-on experience with GPU compute, distributed training, and high-performance inference environments. - Strong understanding of MLOps/LLMOps practices, CI/CD pipelines, and model deployment workflows. - Experience architecting secure, compliant, and highly available cloud environments. - Proficiency with Infrastructure as Code (Terraform or CloudFormation). - Familiarity with vector databases, RAG architectures, and scalable data storage patterns. - Strong collaboration skills and the ability to work closely with AI/ML, DevOps, and engineering teams. - Excellent documentation and communication skills.
Nice to have
n/a

Skills Required

Beware of fraud agents! do not pay money to get a job

MNCJobs.co.uk will not be responsible for any payment made to a third-party. All Terms of Use are applicable.


Job Detail

  • Job Id
    JD4603328
  • Industry
    Not mentioned
  • Total Positions
    1
  • Job Type:
    Full Time
  • Salary:
    Not mentioned
  • Employment Status
    Full Time
  • Job Location
    London, United Kingdom
  • Education
    Not mentioned