Role: Full-Time, Long-Term Required: Python, SQL Preferred: Experience with financial or crypto data
OVERVIEW
We are seeking a data engineer to join as a core member of our technical team. This is a long-term position for someone who wants to build robust, production-grade data infrastructure and grow with a small, focused team. You will own the data layer that feeds our machine learning pipeline—from ingestion and validation through transformation, storage, and delivery.
The ideal candidate is meticulous about data quality, thinks deeply about failure modes, and builds systems that run reliably without constant attention. You understand that downstream ML models are only as good as the data they consume.
CORE TECHNICAL REQUIREMENTS
Python (Required): Professional-level proficiency. You write clean, maintainable code for data pipelines—not throwaway scripts. Comfortable with Pandas, NumPy, and their performance characteristics. You know when to use Python versus push computation to the database.
SQL (Required): Advanced SQL skills. Complex queries, query optimization, schema design, execution plans. PostgreSQL experience strongly preferred. You think about indexing, partitioning, and query performance as second nature.
Data Pipeline Design (Required): You build pipelines that handle real-world messiness gracefully. You understand idempotency, exactly-once semantics, backfill strategies, and incremental versus full recomputation tradeoffs. You design for failure—what happens when an upstream source is late, returns malformed data, or goes down entirely. Experience with workflow orchestration required: Airflow, Prefect, Dagster, or similar.
Data Quality (Required): You treat data quality as a first-class concern. You implement validation checks, anomaly detection, and monitoring. You know the difference between data that is missing versus data that should not exist. You build systems that catch problems before they propagate downstream.
WHAT YOU WILL BUILD
Data Ingestion: Pipelines pulling from diverse sources—crypto exchanges, traditional market feeds, on-chain data, alternative data. Handling rate limits, API quirks, authentication, and source-specific idiosyncrasies.
Data Validation: Checks ensuring completeness, consistency, and correctness. Schema validation, range checks, freshness monitoring, cross-source reconciliation.
Transformation Layer: Converting raw data into clean, analysis-ready formats. Time series alignment, handling different frequencies and timezones, managing gaps.
Storage and Access: Schema design optimized for both write patterns (ingestion) and read patterns (ML training, feature computation). Data lifecycle and retention management.
Monitoring and Alerting: Observability into pipeline health. Knowing when something breaks before it affects downstream systems.
DOMAIN EXPERIENCE
Preference for candidates with experience in financial or crypto data—understanding market data conventions, exchange-specific quirks, and point-in-time correctness. You know why look-ahead bias is dangerous and how to prevent it.
Time series data at scale—hundreds of symbols with years of history, multiple frequencies, derived features. You understand temporal joins, windowed computations, and time-aligned data challenges.
High-dimensional feature stores—we work with hundreds of thousands of derived features. Experience managing, versioning, and serving large feature sets is valuable.
ENGINEERING STANDARDS
Reliability: Pipelines run unattended. Failures are graceful with clear errors, not silent corruption. Recovery is straightforward.
Reproducibility: Same inputs and code version produce identical outputs. You version schemas, track lineage, and can reconstruct historical states.
Documentation: Schemas, data dictionaries, pipeline dependencies, operational runbooks. Others can understand and maintain your systems.
Testing: You write tests for pipelines—validation logic, transformation correctness, edge cases. Untested pipelines are broken pipelines waiting to happen.
TECHNICAL ENVIRONMENT
PostgreSQL, Python, workflow orchestration (flexible on tool), cloud infrastructure (GCP preferred but flexible), Git.
WHAT WE ARE LOOKING FOR
Attention to Detail: You notice when something is slightly off and investigate rather than ignore.
Defensive Thinking: You assume sources will send bad data, APIs will fail, schemas will change. You build accordingly.
Self-Direction: You identify problems, propose solutions, and execute without waiting to be told.
Long-Term Orientation: You build systems you will maintain for years.
Communication: You document clearly, explain data issues to non-engineers, and surface problems early.
EDUCATION
University degree in a quantitative/technical field preferred: Computer Science, Mathematics, Statistics, Engineering. Equivalent demonstrated expertise also considered.
TO APPLY
Include: (1) CV/resume, (2) Brief description of a data pipeline you built and maintained, (3) Links to relevant work if available, (4) Availability and timezone.