Role: Full-Time, Long-Term Required: Docker, GCP, CI/CD Preferred: Experience with ML pipelines
OVERVIEW
We are seeking a DevOps engineer to join as a core member of our technical team. This is a long-term position for someone who wants to own infrastructure and deployment for a production machine learning system. You will ensure our prediction pipeline runs reliably, deploys smoothly, and scales as needed.
The ideal candidate thinks about failure modes obsessively, automates everything possible, and builds systems that run without constant attention.
CORE TECHNICAL REQUIREMENTS
Docker (Required): Deep experience with containerization. Efficient Dockerfiles, layer caching, multi-stage builds, debugging container issues. Experience with Docker Compose for local development.
Google Cloud Platform (Required): Strong GCP experience: Cloud Run for serverless containers, Compute Engine for VMs, Artifact Registry for images, Cloud Storage, IAM. You can navigate the console but prefer scripting everything.
CI/CD (Required): Build and maintain deployment pipelines. GitHub Actions required. You automate testing, building, pushing, and deploying. You understand the difference between continuous integration and continuous deployment.
Linux Administration (Required): Comfortable on the command line. SSH, diagnose problems, manage services, read logs, fix things. Bash scripting is second nature.
PostgreSQL (Required): Database administration basics—backups, monitoring, connection management, basic performance tuning. Not a DBA, but comfortable keeping a production database healthy.
Infrastructure as Code (Preferred): Terraform, Pulumi, or similar. Infrastructure should be versioned, reviewed, and reproducible—not clicked together in a console.
WHAT YOU WILL OWN
Deployment Pipeline: Maintaining and improving deployment scripts and CI/CD workflows. Code moves from commit to production reliably with appropriate testing gates.
Cloud Run Services: Managing deployments for model fitting, data cleansing, and signal discovery services. Monitor health, optimize cold starts, handle scaling.
VM Infrastructure: PostgreSQL and Streamlit on GCP VMs. Instance management, updates, backups, security.
Container Registry: Managing images in GitHub Container Registry and Google Artifact Registry. Cleanup policies, versioning, access control.
Monitoring and Alerting: Building observability. Logging, metrics, health checks, alerting. Know when things break before users tell us.
Environment Management: Configuration across local and production. Secrets management. Environment parity where it matters.
WHAT SUCCESS LOOKS LIKE
Deployments are boring—no drama, no surprises. Systems recover automatically from transient failures. Engineers deploy with confidence. Infrastructure changes are versioned and reproducible. Costs are reasonable and resources scale appropriately.
ENGINEERING STANDARDS
Automation First: If you do something twice, automate it. Manual processes are bugs waiting to happen.
Documentation: Runbooks, architecture diagrams, deployment guides. The next person can understand and operate the system.
Security Mindset: Secrets never in code. Least-privilege access. You think about attack surfaces.
Reliability Focus: Design for failure. Backups are tested. Recovery procedures exist and work.
CURRENT ENVIRONMENT
GCP (Cloud Run, Compute Engine, Artifact Registry, Cloud Storage), Docker, Docker Compose, GitHub Actions, PostgreSQL 16, Bash deployment scripts with Python wrapper.
WHAT WE ARE LOOKING FOR
Ownership Mentality: You see a problem, you fix it. You do not wait for assignment.
Calm Under Pressure: When production breaks, you diagnose methodically.
Communication: You explain infrastructure decisions to non-infrastructure people. You document what you build.
Long-Term Thinking: You build systems maintained for years, not quick fixes creating tech debt.
EDUCATION
University degree in Computer Science, Engineering, or related field preferred. Equivalent demonstrated expertise also considered.
TO APPLY
Include: (1) CV/resume, (2) Brief description of infrastructure you built or maintained, (3) Links to relevant work if available, (4) Availability and timezone.