Sponsored Content
The landscape of big data analytics is constantly evolving, with organizations seeking more flexible, scalable, and cost-effective ways to manage and analyze vast amounts of data. This pursuit has led to the rise of the data lakehouse paradigm, which combines the low-cost storage and flexibility of data lakes with the data management capabilities and transactional consistency of data warehouses. At the heart of this revolution are open table formats like Apache Iceberg and powerful processing engines like Apache Spark, all empowered by the robust infrastructure of Google Cloud.
The Rise of Apache Iceberg: A Game-Changer for Data Lakes
For years, data lakes, typically built on cloud object storage like Google Cloud Storage (GCS), offered unparalleled scalability and cost efficiency. However, they often lacked the crucial features found in traditional data warehouses, such as transactional consistency, schema evolution, and performance optimizations for analytical queries. This is where Apache Iceberg shines.
Apache Iceberg is an open table format designed to address these limitations. It sits on top of your data files (like Parquet, ORC, or Avro) in cloud storage, providing a layer of metadata that transforms a collection of files into a high-performance, SQL-like table. Here’s what makes Iceberg so powerful:
- ACID Compliance: Iceberg brings Atomicity, Consistency, Isolation, and Durability (ACID) properties to your data lake. This means that data writes are transactional, ensuring data integrity even with concurrent operations. No more partial writes or inconsistent reads.
- Schema Evolution: One of the biggest pain points in traditional data lakes is managing schema changes. Iceberg handles schema evolution seamlessly, allowing you to add, drop, rename, or reorder columns without rewriting the underlying data. This is critical for agile data development.
- Hidden Partitioning: Iceberg intelligently manages partitioning, abstracting away the physical layout of your data. Users no longer need to know the partitioning scheme to write efficient queries, and you can evolve your partitioning strategy over time without data migrations.
- Time Travel and Rollback: Iceberg maintains a complete history of table snapshots. This enables “time travel” queries, allowing you to query data as it existed at any point in the past. It also provides rollback capabilities, letting you revert a table to a previous good state, invaluable for debugging and data recovery.
- Performance Optimizations: Iceberg’s rich metadata allows query engines to prune irrelevant data files and partitions efficiently, significantly accelerating query execution. It avoids costly file listing operations, directly jumping to the relevant data based on its metadata.
By providing these data warehouse-like features on top of a data lake, Apache Iceberg enables the creation of a true “data lakehouse,” offering the best of both worlds: the flexibility and cost-effectiveness of cloud storage combined with the reliability and performance of structured tables.
Google Cloud’s BigLake tables for Apache Iceberg in BigQuery offers a fully-managed table experience similar to standard BigQuery tables, but all of the data is stored in customer-owned storage buckets. Support features include:
- Table mutations via GoogleSQL data manipulation language (DML)
- Unified batch and high throughput streaming using the Storage Write API through BigLake connectors such as Spark
- Iceberg V2 snapshot export and automatic refresh on each table mutation
- Schema evolution to update column metadata
- Automatic storage optimization
- Time travel for historical data access
- Column-level security and data masking
Here’s an example of how to create an empty BigLake Iceberg table using GoogleSQL:
SQL
CREATE TABLE PROJECT_ID.DATASET_ID.my_iceberg_table (
name STRING,
id INT64
)
WITH CONNECTION PROJECT_ID.REGION.CONNECTION_ID
OPTIONS (
file_format="PARQUET"
table_format="ICEBERG"
storage_uri = 'gs://BUCKET/PATH');
You can then import data into the data using LOAD INTO
to import data from a file or INSERT INTO
from another table.
SQL
# Load from file
LOAD DATA INTO PROJECT_ID.DATASET_ID.my_iceberg_table
FROM FILES (
uris=['gs://bucket/path/to/data'],
format="PARQUET");
# Load from table
INSERT INTO PROJECT_ID.DATASET_ID.my_iceberg_table
SELECT name, id
FROM PROJECT_ID.DATASET_ID.source_table
In addition to a fully-managed offering, Apache Iceberg is also supported as a read-external table in BigQuery. Use this to point to an existing path with data files.
SQL
CREATE OR REPLACE EXTERNAL TABLE PROJECT_ID.DATASET_ID.my_external_iceberg_table
WITH CONNECTION PROJECT_ID.REGION.CONNECTION_ID
OPTIONS (
format="ICEBERG",
uris =
['gs://BUCKET/PATH/TO/DATA'],
require_partition_filter = FALSE);
Apache Spark: The Engine for Data Lakehouse Analytics
While Apache Iceberg provides the structure and management for your data lakehouse, Apache Spark is the processing engine that brings it to life. Spark is a powerful open-source, distributed processing system renowned for its speed, versatility, and ability to handle diverse big data workloads. Spark’s in-memory processing, robust ecosystem of tools including ML and SQL-based processing, and deep Iceberg support make it an excellent choice.
Apache Spark is deeply integrated into the Google Cloud ecosystem. Benefits of using Apache Spark on Google Cloud include:
- Access to a true serverless Spark experience without cluster management using Google Cloud Serverless for Apache Spark.
- Fully managed Spark experience with flexible cluster configuration and management via Dataproc.
- Accelerate Spark jobs using the new Lightning Engine for Apache Spark preview feature.
- Configure your runtime with GPUs and drivers preinstalled.
- Run AI/ML jobs using a robust set of libraries available by default in Spark runtimes, including XGBoost, PyTorch and Transformers.
- Write PySpark code directly inside BigQuery Studio via Colab Enterprise notebooks along with Gemini-powered PySpark code generation.
- Easily connect to your data in BigQuery native tables, BigLake Iceberg tables, external tables and GCS
- Integration with Vertex AI for end-to-end MLOps
Iceberg + Spark: Better Together
Together, Iceberg and Spark form a potent combination for building performant and reliable data lakehouses. Spark can leverage Iceberg’s metadata to optimize query plans, perform efficient data pruning, and ensure transactional consistency across your data lake.
Your Iceberg tables and BigQuery native tables are accessible via BigLake metastore. This exposes your tables to open source engines with BigQuery compatibility, including Spark.
Python
from pyspark.sql import SparkSession
# Create a spark session
spark = SparkSession.builder \
.appName("BigLake Metastore Iceberg") \
.config("spark.sql.catalog.CATALOG_NAME", "org.apache.iceberg.spark.SparkCatalog") \
.config("spark.sql.catalog.CATALOG_NAME.catalog-impl", "org.apache.iceberg.gcp.bigquery.BigQueryMetastoreCatalog") \
.config("spark.sql.catalog.CATALOG_NAME.gcp_project", "PROJECT_ID") \
.config("spark.sql.catalog.CATALOG_NAME.gcp_location", "LOCATION") \
.config("spark.sql.catalog.CATALOG_NAME.warehouse", "WAREHOUSE_DIRECTORY") \
.getOrCreate()
spark.conf.set("viewsEnabled","true")
# Use the blms_catalog
spark.sql("USE `CATALOG_NAME`;")
spark.sql("USE NAMESPACE DATASET_NAME;")
# Configure spark for temp results
spark.sql("CREATE namespace if not exists MATERIALIZATION_NAMESPACE");
spark.conf.set("materializationDataset","MATERIALIZATION_NAMESPACE")
# List the tables in the dataset
df = spark.sql("SHOW TABLES;")
df.show();
# Query the tables
sql = """SELECT * FROM DATASET_NAME.TABLE_NAME"""
df = spark.read.format("bigquery").load(sql)
df.show()
sql = """SELECT * FROM DATASET_NAME.ICEBERG_TABLE_NAME"""
df = spark.read.format("bigquery").load(sql)
df.show()
sql = """SELECT * FROM DATASET_NAME.READONLY_ICEBERG_TABLE_NAME"""
df = spark.read.format("bigquery").load(sql)
df.show()
Extending the functionality of BigLake metastore is the Iceberg REST catalog (in preview) to access Iceberg data with any data processing engine. Here’s how to connect to it using Spark:
Python
import google.auth
from google.auth.transport.requests import Request
from google.oauth2 import service_account
import pyspark
from pyspark.context import SparkContext
from pyspark.sql import SparkSession
catalog = ""
spark = SparkSession.builder.appName("") \
.config("spark.sql.defaultCatalog", catalog) \
.config(f"spark.sql.catalog.{catalog}", "org.apache.iceberg.spark.SparkCatalog") \
.config(f"spark.sql.catalog.{catalog}.type", "rest") \
.config(f"spark.sql.catalog.{catalog}.uri",
"https://biglake.googleapis.com/iceberg/v1beta/restcatalog") \
.config(f"spark.sql.catalog.{catalog}.warehouse", "gs://") \
.config(f"spark.sql.catalog.{catalog}.token", "") \
.config(f"spark.sql.catalog.{catalog}.oauth2-server-uri", "https://oauth2.googleapis.com/token") \ .config(f"spark.sql.catalog.{catalog}.header.x-goog-user-project", "") \ .config("spark.sql.extensions","org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
.config(f"spark.sql.catalog.{catalog}.io-impl","org.apache.iceberg.hadoop.HadoopFileIO") \ .config(f"spark.sql.catalog.{catalog}.rest-metrics-reporting-enabled", "false") \
.getOrCreate()
Completing the lakehouse
Google Cloud provides a comprehensive suite of services that complement Apache Iceberg and Apache Spark, enabling you to build, manage, and scale your data lakehouse with ease while leveraging many of the open-source technologies you already use:
- Dataplex Universal Catalog: Dataplex Universal Catalog provides a unified data fabric for managing, monitoring, and governing your data across data lakes, data warehouses, and data marts. It integrates with BigLake Metastore, ensuring that governance policies are consistently enforced across your Iceberg tables, and enabling capabilities like semantic search, data lineage, and data quality checks.
- Google Cloud Managed Service for Apache Kafka: Run fully-managed Kafka clusters on Google Cloud, including Kafka Connect. Data streams can be read directly to BigQuery, including to managed Iceberg tables with low latency reads.
- Cloud Composer: A fully managed workflow orchestration service built on Apache Airflow.
- Vertex AI: Use Vertex AI to manage the full end-to-end ML Ops experience. You can also use Vertex AI Workbench for a managed JupyterLab experience to connect to your serverless Spark and Dataproc instances.
Conclusion
The combination of Apache Iceberg and Apache Spark on Google Cloud offers a compelling solution for building modern, high-performance data lakehouses. Iceberg provides the transactional consistency, schema evolution, and performance optimizations that were historically missing from data lakes, while Spark offers a versatile and scalable engine for processing these large datasets.
To learn more, check out our free webinar on July 8th at 11AM PST where we’ll dive deeper into using Apache Spark and supporting tools on Google Cloud.
Author: Brad Miro, Senior Developer Advocate – Google