Senior Platform & Site Reliability Engineer
Location: Remote Employment Type: Contract
The Role
This role carries full architectural and operational ownership of the platform layer across a growing SaaS portfolio. The Cloud Architect owns AWS infrastructure standards — VPCs, account structures, networking, and compute design. Everything outside that lane is yours: the CI/CD platform, the observability and reliability stack, the event streaming infrastructure, the deployment pipelines, and the incident engineering model.
Architectural decisions are yours to make and defend, standards are yours to define and enforce, and the reliability of 20+ enterprise SaaS products depends on what you and your team build.
This is an AI-native engineering organisation. Where it is practical and safe to do so, you are expected to use automation and AI-assisted tooling to reduce toil — in CI/CD triage, infrastructure provisioning, observability workflows, and acquisition onboarding. The expectation is not to replace engineering judgement with automation, but to free it up for the problems that genuinely require it.
The Scale You Will Operate At
The portfolio consists of 20+ live, enterprise-grade SaaS solutions running concurrently. Each product serves enterprise customers and processes millions to billions of real-time requests. The architecture is serious: event streaming for real-time data pipelines, batch processing workloads running alongside live transaction flows, and multi-tenant enterprise-grade reliability expectations across every product.
You will design and operate the platform infrastructure that underpins all of it — scaling horizontally as each new acquisition joins the portfolio, without proportionally scaling cost, complexity, or headcount.
What You Will Own
Platform Architecture
- Full architectural ownership of the non-AWS toolchain: CI/CD, observability, event streaming, automation, secrets, and deployment infrastructure
- Define, build, and enforce platform standards across portfolio products
- Terraform IaC for all infrastructure — nothing provisioned manually, everything versioned and reviewed
- Self-service developer platform so product teams ship without waiting on platform
Event Streaming & Pipeline Infrastructure
- Own the event streaming architecture, operational standards, and health monitoring across all products using real-time pipelines
- Design and maintain batch processing infrastructure alongside live event flows
- Ensure pipeline reliability, throughput, and cost are actively managed at scale
CI/CD & Deployment
- Build and maintain CI/CD pipelines (GitHub Actions) across all portfolio products
- Automate triage and retry logic for known failure classes — flaky tests, dependency timeouts, OOM kills — so engineers are only paged for genuinely novel failures
- Deployment standards: release management, rollback mechanisms, canary and blue-green patterns where justified
Observability & Reliability
- Own the full observability stack: Grafana, Prometheus, and Loki across all products
- SLOs and error budgets defined per product; reliability tracked consistently
- Build alerting that correlates signals and surfaces diagnostic context alongside notifications — so on-call engineers arrive at an incident with hypotheses, not a blank screen
- Incident response: on-call design, escalation playbooks, post-mortem facilitation
- Automated remediation scoped to safe, idempotent actions — container restarts, ECS task scaling, known rollback patterns; novel or ambiguous failures escalate to a human with full context attached
Acquisition Onboarding
- Platform audit and gap analysis for every new acquisition — assessing CI/CD maturity, IaC coverage, observability gaps, and security posture
- Migration plan and execution for each portfolio company joining the platform — sequenced to avoid disrupting live operations
- Target: full platform integration within a defined window per acquisition
A Note on Automation
Where automation is safe and failure modes are well understood — routine provisioning, known CI/CD failure classes, secrets rotation, cost anomaly flagging — aggressive automation is expected. Where automation would act on ambiguous signals or carry significant blast radius, human judgement stays in the loop. The goal is to reduce toil on solved problems, not to automate decisions that require engineering expertise.
Platform Stack
Area Stack / Standard IaC Terraform OSS / OpenTofu CI/CD GitHub Actions Event Streaming Architecture and tooling chosen for the workload Observability Grafana, Prometheus, Loki Log Management AWS CloudWatch, Grafana Loki Incident Management OpsGenie (startup tier) or Better Uptime Secrets AWS Secrets Manager / HashiCorp Vault OSS Containers ECS (default), EKS only where justified Cost Monitoring AWS Cost Explorer with custom dashboards What We’re Looking For
- 8–12 years in platform engineering, DevOps, or SRE — with clear evidence of increasing ownership over time
- Strong Terraform depth across multi-environment, multi-account setups
- CI/CD ownership across a multi-product environment with GitHub Actions
- Experience with event streaming infrastructure at production scale — design, operations, reliability, and cost management
- Hands-on Grafana, Prometheus, and Loki in production
- AWS operational depth: ECS, EKS, RDS, IAM, VPC, CloudWatch, Cost Explorer
- SRE fundamentals: SLOs, error budgets, on-call design, post-mortem culture
- Acquisition or greenfield platform integration experience strongly preferred
How You Work
- Comfortable operating across multiple products simultaneously — context-switching without dropping standards
- Cost-efficiency instinct — you optimise spend as a habit, not as a project
- You treat automation as a tool for eliminating toil, not a substitute for engineering judgement
- You document decisions, enforce standards through code, and build platforms that other engineers find intuitive to use
Why This Role
The platform function is being built from the ground up. You will have architectural ownership of the entire non-AWS platform layer across a growing portfolio of enterprise SaaS products, with the freedom — and responsibility — to build the reliability and delivery culture of the organisation.
This is not a role that inherits someone else’s decisions and maintains them. Every major architectural choice is still to be made. If you want to build something that lasts and that other engineers depend on, this is the role.