Era4 develops, owns and operates AI infrastructure across the UK, powered by renewable energy. Converting legacy industrial and energy sites into modern data-centre facilities, Era4 is combining brownfield regeneration opportunities with cleaner, efficient, scalable compute capacity for healthcare, research, finance, enterprise, and public-sector organisations
Initial 6 month contract
Start date - 13th July
Competitive day rate
If you are a contractor open to perm please include salary expectations in application.
Role Summary
We are seeking an AI Infrastructure Validation Engineer to join our fast-scaling team. This role sits within Product but works across Product, Engineering and Operations. You will design, build, and orchestrate the automated preflight validation suites and performance benchmarks that continuously verify our bare-metal APIs, Kubernetes environments, and multi-node GPU clusters under enterprise-scale workloads.
You will ensure that every platform release, infrastructure change or hardware deployment is tested, validated and production-ready before reaching customers. This is an opportunity to join a mission-led AI business that is redefining infrastructure, intelligence, and impact for enterprise customers.
Key Responsibilities
Software-Defined Infrastructure Validation & Preflight Automation:
- Design and build zero-dependency, Python-based preflight verification tools to validate multi-node distributed initialization, master-to-worker rendezvous routing, and correct GPU-to-CPU process affinity prior to launching massive model-training workloads.
- Write and maintain Infrastructure as Code to provision, configure, test, and teardown complex bare-metal and containerized compute environments.
- Implement Resilient Execution: Construct adaptive, intelligent test orchestration harnesses that can autonomously detect environment drifts and analyse platform changes.
GPU Platform & Low-Latency Network Validation
- Automate the execution and results aggregation of cluster-level benchmarking suites and high-performance storage benchmarks to validate node-to-node throughput limits.
- Build validation routines to monitor high-throughput network fabrics, evaluating traffic patterns, congestion control parameters.
- Script low-level automated checks to validate server-node topology, PCIe link speeds, HBM memory status, secure boot parameters, and firmware performance via BMC, IPMI, or Redfish interfaces.
Continuous Integration & Observability
- Integrate automated infrastructure validation suites directly into CI/CD pipelines.
- Configure and maintain observability pipelines to route real-time diagnostic logs and hardware execution metrics to quickly isolate slow, misconfigured, or degrading compute nodes.
- Partner with the Platform team, Network Engineers, and Datacentre Operations to lead root-cause analysis on complex platform regressions, hardware-software boundaries, and distributed interconnect bottlenecks.
Essential Experience
- Strong proficiency in Python for building zero-dependency verification tools, automated test orchestration harnesses, and low-level system checks.
- Deep hands-on experience writing and maintaining Infrastructure as Code to provision, configure, and teardown complex bare-metal and containerized compute environments.
- Proven experience working within Kubernetes environments and validating enterprise-scale, multi-node distributed systems.
- Scripting automated checks for server-node topology, PCIe link speeds, HBM memory status, firmware performance, and interfacing with hardware via BMC, IPMI, or Redfish.
- Demonstrated capability integrating automated infrastructure validation suites into CI/CD pipelines.
- Configuring observability pipelines for real-time diagnostic logs and hardware metrics.
- History of partnering across Engineering, Product, and Datacentre Operations to conduct root-cause analysis on complex platform regressions and hardware-software boundaries.
Preferred Experience
- Prior experience validating infrastructure specifically optimized for massive model-training workloads, including a solid understanding of GPU-to-CPU process affinity and master-to-worker rendezvous routing.
- Deep understanding of high-throughput network fabrics, traffic patterns, and congestion control parameters in multi-node GPU clusters.
- Background in building and executing cluster-level benchmarking suites and high-performance storage benchmarks to isolate node-to-node throughput limits.
- Experience designing intelligent test systems capable of autonomously detecting environment drifts and analysing large-scale platform changes.
Why Join Era4
You’ll be joining a mission-driven start-up building critical national infrastructure, where operational excellence directly enables growth. This role offers high visibility with leadership, real autonomy, and the chance to shape how a next-generation company operates at scale.
Diversity & Inclusion
Era4 is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees.