What Is a Data Lakehouse Engineer?
Data lakehouse engineers build and maintain the storage, catalog, and compute layers of modern data platforms. They work with open table formats (Apache Iceberg, Delta Lake, Apache Hudi), catalog services (Unity Catalog, Project Nessie, Polaris), compute engines (Spark, Flink, Trino, DuckDB), and transformation tools (dbt, SQLMesh). They have significant budget authority over data infrastructure decisions and are actively evaluating new tooling on GitHub.
GitHub Signals That Identify Data Lakehouse Engineers
- New stars on apache/iceberg, delta-io/delta, apache/hudi — table format evaluators
- Stars on unitycatalog/unitycatalog, projectnessie/nessie, apache/polaris — catalog evaluators
- Stars on trinodb/trino, apache/flink, apache/spark — compute engine users
- Stars on dbt-labs/dbt-core, TobikoData/sqlmesh, SDF-Labs/sdf — transformation engineers
- Issues or PRs mentioning "Iceberg REST catalog", "table format migration", "partition evolution"
- Keyword mentions: "data lakehouse", "open table format", "ACID transactions", "time travel queries", "schema evolution"
- Stars on tabular-io/iceberg-python, apache/iceberg-go — language-specific SDK evaluators
Key Repos to Track for Lakehouse Signal Capture
Add these repos to your GitLeads tracked repositories to capture data lakehouse signals continuously:
- apache/iceberg — the primary Apache Iceberg repo; stars signal format adoption
- delta-io/delta — Delta Lake core; Databricks ecosystem signal
- apache/hudi — Hudi format; AWS ecosystem signal
- unitycatalog/unitycatalog — Databricks open-source catalog
- projectnessie/nessie — Git-for-data catalog (Dremio ecosystem)
- apache/polaris (incubating) — Snowflake-contributed Iceberg REST catalog
- dbt-labs/dbt-core — the dominant transformation layer
- TobikoData/sqlmesh — dbt alternative; evaluators are tech-forward data teams
- apache/gravitino — Hortonworks/Cloudera metadata lake
Keyword Signals to Monitor in GitHub Issues and Code
- "iceberg REST catalog" OR "iceberg catalog" — platform integration signal
- "table format migration" OR "migrate to iceberg" — active migration project
- "partition evolution" OR "schema evolution" — power user signal
- "data lakehouse" OR "open lakehouse" — architecture evaluation
- "ACID transactions" OR "merge-on-read" OR "copy-on-write" — format decision signal
- "Unity Catalog" OR "HMS" OR "Glue catalog" — catalog evaluation signal
- "dbt incremental" OR "dbt model" — active dbt engineering
What Data Lakehouse Engineers Buy
This audience controls or strongly influences decisions in:
- Managed Iceberg table services (Tabular, Snowflake Open Catalog, AWS Glue Iceberg)
- Lakehouse query engines (Trino Enterprise, Starburst Galaxy, Dremio Cloud)
- Data catalog platforms (Atlan, DataHub, Alation, Collibra)
- dbt Cloud — the managed version of dbt-core they're already using
- ETL/ELT pipelines (Airbyte, Fivetran, dlt Hub)
- Cloud storage optimization tools (Iceberg compaction, OPTIMIZE services)
- Data observability platforms (Monte Carlo, Elementary, Bigeye)
Routing Lakehouse Leads to Your Sales Stack
- Iceberg repo star + company email from data/cloud domain → HubSpot deal + data team AE
- Unity Catalog or Nessie keyword → Salesforce account match — check if enterprise account
- dbt-core star with public email → Clay enrichment + Smartlead sequence for dbt Cloud pitch
- High-follower data engineer → Slack alert for DevRel partnership outreach
- SQLMesh or SDF star → early-adopter signal; fast-track to founder sales call