-
Notifications
You must be signed in to change notification settings - Fork 75
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add IcebergDocument as one type of the operator result storage (#3147)
# Implement Apache Iceberg for Result Storage <img width="556" alt="Screenshot 2025-01-06 at 3 18 19 PM" src="https://github.com/user-attachments/assets/4edadb64-ee28-48ee-8d3c-1d1891d69d6a" /> ## How to Enable Iceberg Result Storage 1. Update `storage-config.yaml`: - Set `result-storage-mode` to `iceberg`. ## Major Changes - **Introduced `IcebergDocument`**: A thread-safe `VirtualDocument` implementation for storing and reading results in Iceberg tables. - **Introduced `IcebergTableWriter`**: Append-only writer for Iceberg tables with configurable buffer size. - **Catalog and Data storage for Iceberg**: Uses a local file system (`file:/`) via `HadoopCatalog` and `HadoopFileIO`. This ensures Iceberg operates without relying on external storage services. - `ProgressiveSinkOpExec` with a new parameter `workerId` is added. Each writer of the result storage will take this `workerId` as one new parameter. ## Dependencies - Added Apache Iceberg-related libraries. - Introduced Hadoop-related libraries to support Iceberg's `HadoopCatalog` and `HadoopFileIO`. These libraries are used for placeholder configuration but do not enforce runtime dependency on HDFS. ## Overview of Iceberg Components ### `IcebergDocument` - Manages reading and organizing data in Iceberg tables. - Supports iterator-based incremental reads with thread-safe operations for reading and clearing data. - Initializes or overrides the Iceberg table during construction. ### `IcebergTableWriter` - Writes data as immutable Parquet files in an append-only manner. - Each writer uniquely prefixes its files to avoid conflicts (`workerIndex_fileIndex` format). - Not thread-safe—single-thread access is recommended. ## Data Storage via Iceberg Tables - **Write**: - Tables are created per `storage key`. - Writers append Parquet files to the table, ensuring immutability. - **Read**: - Readers use `IcebergDocument.get` to fetch data via an iterator. - The iterator reads data incrementally while ensuring data order matches the commit sequence of the data files. ## Data Reading Using File Metadata - Data files are read using `getUsingFileSequenceOrder`, which: - Retrieves and sorts metadata files (`FileScanTask`) by sequence numbers. - Reads records sequentially, skipping files or records as needed. - Supports range-based reading (`from`, `until`) and incremental reads. - Sorting ensures data consistency and order preservation. ## Hadoop Usage Without HDFS - The `HadoopCatalog` uses an empty Hadoop configuration, defaulting to the local file system (`file:/`). - This enables efficient management of Iceberg tables in local or network file systems without requiring HDFS infrastructure. --------- Co-authored-by: Shengquan Ni <[email protected]>
- Loading branch information
1 parent
a1186f8
commit 7debf45
Showing
21 changed files
with
1,525 additions
and
18 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
43 changes: 43 additions & 0 deletions
43
.../workflow-core/src/main/scala/edu/uci/ics/amber/core/storage/IcebergCatalogInstance.scala
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
package edu.uci.ics.amber.core.storage | ||
|
||
import edu.uci.ics.amber.util.IcebergUtil | ||
import org.apache.iceberg.catalog.Catalog | ||
|
||
/** | ||
* IcebergCatalogInstance is a singleton that manages the Iceberg catalog instance. | ||
* - Provides a single shared catalog for all Iceberg table-related operations in the Texera application. | ||
* - Lazily initializes the catalog on first access. | ||
* - Supports replacing the catalog instance primarily for testing or reconfiguration. | ||
*/ | ||
object IcebergCatalogInstance { | ||
|
||
private var instance: Option[Catalog] = None | ||
|
||
/** | ||
* Retrieves the singleton Iceberg catalog instance. | ||
* - If the catalog is not initialized, it is lazily created using the configured properties. | ||
* @return the Iceberg catalog instance. | ||
*/ | ||
def getInstance(): Catalog = { | ||
instance match { | ||
case Some(catalog) => catalog | ||
case None => | ||
val hadoopCatalog = IcebergUtil.createHadoopCatalog( | ||
"texera_iceberg", | ||
StorageConfig.fileStorageDirectoryPath | ||
) | ||
instance = Some(hadoopCatalog) | ||
hadoopCatalog | ||
} | ||
} | ||
|
||
/** | ||
* Replaces the existing Iceberg catalog instance. | ||
* - This method is useful for testing or dynamically updating the catalog. | ||
* | ||
* @param catalog the new Iceberg catalog instance to replace the current one. | ||
*/ | ||
def replaceInstance(catalog: Catalog): Unit = { | ||
instance = Some(catalog) | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.