Why this role exists
Our infrastructure footprint is growing faster than our headcount, and we believe most of that
gap should be closed by automation and AI agents — not by hiring more humans to do toil. We
need someone early in their career who treats manual work as a bug, ships scripts and agents
instead of tickets, and wants to grow into deeper ownership over the next two years.
You will not be the most senior person on the team. You will be the one who multiplies the team.
What you'll own
In your first 1 months
• Take ownership of one slice of our CI/CD pipeline and make it measurably
faster, more reliable, or cheaper. We expect a number on a dashboard to move.
• Build at least three internal automations that replace manual ops toil —
using AI agents (Claude Code, agentic CLIs, scripted LLM workflows) as your force
multiplier.
• Be the first responder for a defined set of alerts. Write the runbooks. Drive
the alert volume down.
• Support senior engineers on AI/ML infrastructure (GPU nodes, inference
services, model deployment) — observe, document, and gradually take on contained
changes under review.
By 3 months you should be
• The go-to person for at least two production systems.
• Shipping routine infrastructure changes without needing senior review.
• Treating "manual" as a code smell.
Required (we will reject without these)
• 0–3 years hands-on experience with one major cloud (AWS, GCP, or
Azure — one is fine, depth beats breadth).
• Fluent in Linux command line, bash, and at least one scripting language
(Python or Go preferred).
• Have shipped something to production that real users hit. A side project
counts; a graded coursework lab does not.
• Comfortable with Docker — you can explain what an image vs. a
container is and why it matters.
• Working knowledge of networking fundamentals: DNS, HTTP/HTTPS,
TLS, ports, basic subnets — enough to debug "it works on my machine."
• Git fluency: branches, merges, rebases, conflict resolution.
• CI/CD pipelines — you have authored or substantially modified pipelines
in GitHub Actions, GitLab CI, ArgoCD, Jenkins, or similar. Not just "I clicked Re-run."
• Kubernetes basics — kubectl for real work, can read pod logs,
understand deployments and services, can debug a CrashLoopBackOff without
panicking. You do not need to have run a cluster; you do need to have lived inside one.
• Active user of AI coding agents (Claude Code, Cursor, Copilot, agentic
CLIs, etc.). You should be able to walk us through specific tasks where they made you
faster, and specific tasks where they failed you and how you noticed. "I have tried it" is
not enough.
Bonus (real plus, not required)
• Infrastructure as Code: Terraform, Pulumi, or Ansible.
• Observability: Prometheus/Grafana, Datadog, OpenTelemetry, any APM.
• Have built or extended an LLM-based agent — a custom MCP server, a
scripted multi-step workflow, an internal tool that calls models in a loop. Anything beyond
chat-with-Claude.
• Exposure to GPU workloads, model serving (vLLM, Triton, TGI, etc.), or
ML pipelines.
What we don't care about
• Whether your degree is in CS — or whether you have a degree at all.
• Brand-name companies on your resume.
• Certifications. They are fine. They do not substitute for having shipped.
How we work
• We default to automation. If you do something manually twice, the third
time you script it or hand it to an agent.
• AI agents are part of the workflow, not a novelty. Expect interview
questions about exactly how you use them — and where you have caught them being
wrong.
• Small, reversible changes beat big-bang rollouts.
• Postmortems are blameless and written down.
• We push back on each other. If you only execute, you will be unhappy
here.
How to apply
Send:
• Your resume.
• A short note (≤200 words) describing one infra or automation problem you
solved, and how AI agents factored in — or did not, and why. We read these. Generic
notes get rejected.
Internal note — delete before posting externally
• Comp band, location policy, team name, and reporting line marked
[CONFIRM] need to be filled in before this goes external.
• The Required list is intentionally tight: CI/CD and Kubernetes basics
promoted from bonus. Expect this to filter ~80% of typical junior DevOps applicants. The
remaining pool will skew toward people who have actually shipped infra at a startup, not
bootcamp grads or pure cloud-cert holders.
• IaC, observability, agent-building, and GPU/ML serving stay as bonus.
Promoting any of these to required at 0–3 yrs collapses the pool to near-zero or forces
hiring senior people at junior comp. If you want IaC required, re-level this to mid (3–5
yrs) and raise the band.
• Screening implication: the resume screen should explicitly check for
CI/CD pipeline authorship and any K8s-touching production work. If neither is on the
resume, reject at screen. Do not waste interview slots.
• Pipeline watch: if fewer than ~15 qualified resumes after 2 weeks of
active sourcing, the first thing to relax is the AI-agent-fluency bar (move to bonus and
screen for it in interview instead). Do not relax the "shipped to production" requirement
— that is the load-bearing filter.