Technology

Snowflake pg_lake: Open-Source Postgres Data Lake Tool

Snowflake Just Open-Sourced a Tool That Makes Postgres More Powerful

Snowflake Labs dropped pg_lake on November 4, and it’s already got over 1,000 GitHub stars. The extension lets you query Apache Iceberg data lakes directly from PostgreSQL—no ETL, no data duplication, no separate analytics system. Just Postgres.

However, here’s the twist: Snowflake built its business selling proprietary data warehouses. Now they’re open-sourcing tools that make Postgres—their competitor—better at analytics. The question isn’t just what pg_lake does. It’s why Snowflake is doing this.

What pg_lake Actually Does

Most developers hit this wall eventually: operational data lives in Postgres, but analytics needs a data warehouse or data lake. Cue the ETL pipelines, the data replication, the version conflicts, the latency. It’s a mess.

pg_lake erases that line. It’s a Postgres extension that integrates Apache Iceberg—the open table format that’s becoming the data lake standard—directly into your database. Query Parquet files in S3? Done. Create transactional Iceberg tables from Postgres? Done. Join your operational tables with analytical datasets in one SQL statement? Also done.

Under the hood, pg_lake uses a clever architecture. Postgres handles the planning and transactions. Meanwhile, a separate process called pgduck_server runs DuckDB—a fast analytical query engine—via the Postgres wire protocol. You get Postgres’s familiar interface with DuckDB’s columnar performance. The result is what people are calling a “lakehouse”—operational and analytical workloads in one system.

This isn’t vapor ware. The codebase is mature. Snowflake’s acquisition target, Crunchy Data, has been running this as Crunchy Data Warehouse since 2024. The November 4 release is the open-source debut of production-proven technology.

Why Developers Should Care

The practical value is immediate. For instance, need to analyze log files sitting in S3 without importing them? Create a foreign table in Postgres with schema auto-detection:

CREATE FOREIGN TABLE sales_data()
SERVER pg_lake
OPTIONS (path 's3://mybucket/sales/*.parquet');

SELECT region, SUM(revenue) FROM sales_data GROUP BY region;

That’s it. No schema definition, no data movement, no separate query engine to learn.

The bigger win is architectural simplification. Startups can delay expensive data warehouse adoption. Mid-size teams can eliminate ETL complexity. Furthermore, Iceberg tables you create in Postgres are readable by Spark, Trino, and Snowflake itself—true interoperability without vendor lock-in.

The Snowflake Paradox

Now the interesting part. Snowflake sells data warehouses. Postgres plus data lakes is the open-source alternative to Snowflake. So why would they make the alternative better?

The most obvious answer: they bought Crunchy Data in June 2025. Crunchy built pg_lake. Open-sourcing it generates ecosystem goodwill and drives adoption. Classic post-acquisition PR.

But there’s a deeper play. Snowflake has been pushing Apache Iceberg support since 2023. More tools writing Iceberg data means more data Snowflake can ingest. By making Postgres—the world’s most popular open-source database—an Iceberg citizen, they’re essentially expanding the pool of Iceberg-compatible data. When those mid-size companies eventually need Snowflake’s scale, the data is already in a format Snowflake understands.

It’s also a challenge to BigQuery and Redshift, neither of which have open-source equivalents. Snowflake gets to position itself as the “open” data platform while Google and AWS are stuck with proprietary systems. Smart positioning.

Or maybe it’s simpler: Snowflake knows customers use multiple databases. Fighting that reality is futile. Better to be the company that helps you connect everything than the one that tries to lock you in.

What This Means for Data Architecture

pg_lake fits into a bigger trend: the lakehouse. Data warehouses gave you fast queries but expensive, proprietary storage. Data lakes gave you cheap object storage but required separate query engines. Lakehouse architectures combine both—cheap storage, fast queries, open formats.

Databricks has been pushing this with Delta Lake. Snowflake is doing it with Iceberg support across platforms. Now Postgres developers can do it without buying either.

The industry is also coalescing around open table formats. Iceberg, backed by Netflix, Apple, and LinkedIn, is winning that race. Tools like pg_lake accelerate the shift. Consequently, if you’re starting a new analytics project today, betting on proprietary formats is risky. Iceberg gives you portability.

Should You Use It?

pg_lake is three days old. Exercise appropriate caution.

That said, it’s built on technology that’s been in production since 2024. The modular design—ten separate extensions, each handling a specific function—suggests professional engineering. The Apache 2.0 license means you can fork it if Snowflake loses interest.

If you’re running Postgres and hitting the analytics wall, pg_lake is worth testing. The GitHub repo has comprehensive docs, architecture diagrams, and Docker setup for quick experiments. Just don’t bet the company on it until the ecosystem matures.

What’s clear is this: the wall between operational and analytical databases is crumbling. Whether pg_lake becomes the standard or just accelerates the trend, the direction is set. One database, multiple storage formats, no ETL tax. That future just got closer.

ByteIota
I am a playful and cute mascot inspired by computer programming. I have a rectangular body with a smiling face and buttons for eyes. My mission is to simplify complex tech concepts, breaking them down into byte-sized and easily digestible information.

    You may also like

    Leave a reply

    Your email address will not be published. Required fields are marked *

    More in:Technology