We're building the data backbone for Orbital, an industrial AI system that ingests and learns from complex refinery and process data in real time. As our Data Engineer, you'll architect and maintain pipelines that make high-frequency time-series, lab, and historian data into a scalable
Lakehouse architecture
, usable for both deep learning models and real-time LLMs.
You'll be working across
AWS (EKS, S3, EBS, KMS, CloudWatch)
and
Databricks/PySpark
, ensuring data is contextualised, synchronised, and optimised for both deep learning models and real-time LLM workloads.
This isn't a traditional ETL role, you'll be solving problems at the intersection of
control systems, industrial data engineering, and AI enablement
.
Location:
Whilst you will be based in the Europe and or eligible to work here - this role will involve travel to other locations in India & USA.
Core Responsibilities
Ingest & Contextualise Data
Ingest from
OPC UA servers, process historians, IoT sensors, LIMS systems, alarms/events, and P&IDs
.
Map signals to their physical processes (tags, units, hierarchies) for interpretability in AI pipelines.
Data Movement & Accessibility
Build pipelines that handle
real-time streaming and batch ingestion
into the Lakehouse.
Manage
synchronisation between historian archives, unstructured files, and AWS storage (S3/EBS)
.
Orchestrate
Databricks Lakeflow/Connectors
for integrating data into Lakebase/Lakehouse.
Handle secure, high-throughput transfers between historian archives and sandbox/live environments.
Change Tracking & Integrity
Detect and manage schema changes, signal drift, and inconsistencies across time.
Implement lineage and audit trails across Spark/Databricks and AWS pipelines.
Data Preparation for AI
Build and maintain dual pipelines:
+
Training
large-scale historical data prep for time-series + LLM training.
+
Inference
low-latency, real-time pipelines for anomaly detection, optimisation, and LLM search. Support heterogeneous AI workloads (time-series forecasting and retrieval-augmented LLMs).
Database Performance & Optimisation
Tune PostgreSQL and spark for high-throughput time-series workloads (partitioning, indexing, query optimisation).
Optimise pipelines for both fast analytical queries and high-efficiency model training.
Deploy and manage data pipelines in
AWS EKS (Kubernetes)
with persistent
EBS-backed storage
.
Technical Requirements
Deep expertise in
PostgreSQL
(partitioning, indexing, query optimisation, storage design).
Strong proficiency in
Python
for data processing, scripting, and pipeline orchestration.
Hands-on experience with
AWS (EKS, S3, EBS, IAM, KMS, CloudWatch, etc.)
for secure and scalable data pipelines.
Proven ability to work with
Databricks and PySpark
for large-scale distributed data processing.
Familiarity with
time-series industrial data
(control systems, DCS/SCADA logs, process historians).
Experience in
unstructured data sync and management
within hybrid cloud/on-prem environments.
Bonus: Knowledge of streaming frameworks (Kafka, Flink, Spark Streaming) or MLOps stacks for data versioning and lineage.
What Success Looks Like
Live data streams are contextualised, query-able, and AI-ready.
Schema changes and signal drift are detected and handled without breaking downstream workflows.
Training and inference pipelines run smoothly in parallel, optimised for scale and latency.
* AI teams can focus on modelling because the data backbone is robust, fast, and reliable.
Beware of fraud agents! do not pay money to get a job
MNCJobs.co.uk will not be responsible for any payment made to a third-party. All Terms of Use are applicable.