Era4 develops, owns and operates AI infrastructure across the UK, powered by renewable energy. Converting legacy industrial and energy sites into modern data-centre facilities, Era4 is combining brownfield regeneration opportunities with cleaner, efficient, scalable compute capacity for healthcare, research, finance, enterprise, and public-sector organisations
Initial 6 month contract.
June start date.
Competitive day rate.
Role Summary:
The Identity & Platform Engineer is responsible for designing, implementing and operating the core platform services that provide:
Kubernetes platform services
Sovereign identity management
Federation and authentication services
Privileged access management
Secrets management
Customer identity integration
Platform security and governance
The successful candidate will play a key role in delivering a Zero Trust, sovereign cloud platform built around: FreeIPA, Teleport, authentic, Bitwarden, Kubernetes.
Key Responsibilities:
Observability Platform Implementation:
- Deliver the implementation of Era4's observability platform based on Grafana Mimir, Loki, Tempo, Grafana Alloy and Grafana Enterprise tooling.
- Design and implement highly available observability services across multiple co-location and production sites.
- Configure telemetry ingestion pipelines for metrics, logs, and future distributed tracing workloads.
- Develop and maintain observability architecture documentation, high-level designs, low-level designs, and operational runbooks.
- Define platform standards for telemetry collection, labelling, metadata enrichment, retention policies, and data governance.
- Implement multi-tenant observability controls and tenant isolation strategies.
- Configure and maintain object-storage-backed telemetry platforms for long-term retention and scalability.
Telemetry Collection & Integration:
- Deploy and manage Grafana Alloy collectors across Kubernetes clusters, Linux hosts, network infrastructure, storage platforms, and hardware management systems.
- Integrate telemetry from Kubernetes, GPU infrastructure, HPE hardware, storage platforms, network devices, and cloud-native services.
- Develop and maintain observability integrations using OpenTelemetry standards and protocols.
- Establish onboarding processes for new platforms, applications, and infrastructure services.
- Collaborate with application teams to define observability requirements and future tracing adoption strategies.
Alerting & Operational Insights:
- Design and implement alerting frameworks using recording rules, AlertManager, and operational best practices.
- Develop operational dashboards and service health views for infrastructure, platform, and application services.
- Support integration of observability events with ITSM and incident-management platforms.
- Define SLIs, SLOs, alert thresholds, and operational KPIs.
- Continuously improve platform observability, incident detection, and root-cause analysis capabilities.
Reliability & Automation:
- Implement Infrastructure-as-Code and GitOps practices for observability platform deployment and configuration management.
- Develop automation for dashboard provisioning, alert deployment, tenant onboarding, and telemetry configuration.
- Design and validate disaster recovery, resilience, and failover capabilities across observability services.
- Contribute to platform security, compliance, and operational governance initiatives.
- Work with operational teams to ensure observability services remain reliable, scalable, and maintainable.
Required Experience & Skills:
- Significant experience implementing and operating enterprise observability or monitoring platforms.
- Strong understanding of metrics, logs, traces, OpenTelemetry, and modern observability principles.
- Experience with Grafana ecosystem technologies including Grafana, Prometheus, Grafana Mimir, Grafana Loki, Grafana Tempo, and Grafana Alloy.
- Experience designing Kubernetes-native solutions and operating distributed platforms at scale.
- Knowledge of Linux systems administration and cloud-native infrastructure.
- Experience implementing Infrastructure-as-Code and GitOps approaches (preferably including Ansible).
- Skilled in developing automation and operational tooling using Python and/or Go.
- Previous exposure to creating technical architecture, operational documentation, and deployment designs.
- Experience with object storage technologies and distributed data platforms.
- Strong understanding of monitoring, alerting, and operational event management.
One or more of the following would be advantageous:
- Implemented OpenTelemetry-based observability solutions.
- Operated observability platforms in service-provider, cloud, or large-scale enterprise environments.
- Supported GPU, AI/ML, or high-performance computing environments.
- Integrated observability platforms with ITSM solutions.
- Experience with multi-tenant platform architectures.
- Knowledge of networking, storage, and data-centre infrastructure monitoring.
- Understanding of distributed tracing and application performance monitoring.
Why Join Era4:
You’ll be joining a mission-driven start-up building critical national infrastructure, where operational excellence directly enables growth. This role offers high visibility with leadership, real autonomy, and the chance to shape how a next-generation company operates at scale.
Diversity & Inclusion:
Era4 is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees.