Recently Cloudera announced the general availability of Apache Iceberg in the Cloudera Data Platform (CDP). This article provides some background on data lake storage, the challenges of organizing data within data lake storage, the emergence of Apache Iceberg as the standard for managing data in data lakes, and finally, the benefits for existing and potential users of Cloudera CDP.
Challenges of organizing data within data lake storage
Data lakes deliver virtually unlimited storage for structured and unstructured data. A data lake is a shared data repository for organizations’ applications to access various tasks, including reporting, analytics, and processing.
The Apache Hadoop Distributed File System (HDFS), Cloudera’s roots, formed the basis for traditional data lakes. Today, the trend is towards cloud data lakes that utilize object storage systems such as Amazon S3 and Microsoft Azure Data Lake Storage (ADLS).
Data is stored in the data lake precisely as it is collected. A structured dataset maintains the original structure without further indexing or metadata. Similarly, unstructured data such as social media posts, images, and MP3 files land in the original native format.
A data lake can only work if data can be extracted and used for analysis, which requires data governance. Data catalogs, such as Hive Metastore (HMS), apply metadata and a hierarchical logic to incoming data, so datasets receive the necessary context and trackable lineage.
The limitations of a catalog
While catalogs provide a shared definition of the dataset structure within data lake storage, data changes or schema evolution between applications go untracked. For example, the structure of a large dataset, including column names and data types, can be cataloged by Hive, but the data files present as part of the dataset are unknown. As a result, applications must read file metadata to identify which files are part of a dataset at any given time.
Data integrity is not much of an issue if the dataset is static and does not change. When one application writes to and modifies the dataset, another application that reads from the same dataset must be in sync with the changes. For example, an ETL (Extract, Transform, Load) process updates the dataset by adding and removing several files from storage; another application that reads the dataset may process a partial or inconsistent view of the dataset and generate incorrect results.
What is Apache Iceberg?
Apache Iceberg is a new open table format that enables multiple applications to work together on the same data transactionally. It tracks the state of dataset evolution and changes over time.
Those conversant with traditional SQL tables will immediately recognize the Iceberg table format. It is open and accessible so multiple engines can operate on the same dataset.
HMS, for example, keeps track of data at the “folder” level requiring file list operations when working with data in a table which can often lead to performance degradation.
Iceberg avoids this by keeping track of a complete list of all files within a table using a persistent tree structure.
Apache Iceberg was developed at Netflix to solve issues with huge, petabyte-scale tables, given to the open-source community in 2018 as an Apache Incubator project.
The benefits for Cloudera CDP users
General availability covers Iceberg running within essential data services in the Cloudera Data Platform (CDP)—including Cloudera Data Warehousing (CDW), Cloudera Data Engineering (CDE), and Cloudera Machine Learning (CML).
Cloudera has integrated Iceberg into the CDP’s SDX (Shared Data Experience) layer, so the productivity and performance benefits of the open table format are right out of the box. Also, the Iceberg native integration benefits from enterprise-grade features of SDX such as data lineage, audit, and security.
The Iceberg tables in CDP integrate within the SDX Metastore for table structure and access validation, allowing for the creation of auditing and fine-grained policies. Iceberg enables CDP to expose the same data set to multiple analytical engines, including Spark, Hive, Impala, and Presto.
There are four other benefits from the CDP Iceberg integration, which users will like:
In-place table evolution saves time.
Users can evolve a table schema or change the partition layout as a single command, much as you would with SQL. Iceberg does not require laborious, costly processes, like rewriting table data or migrating to a new table.
Time travel for forensic visibility and regulatory compliance
Iceberg logs previous table snapshots, allowing the generation of time travel queries or table rollbacks.
Multi-function analytics from the edge to AI
Iceberg enables seamless integration between different streaming and processing engines while maintaining data integrity between them. Multiple engines can concurrently change the table, even with partial writes, without correctness issues and the need for expensive read locks.
Improved performance with very large-scale data sets
Partitioning makes queries faster by grouping similar rows together when writing or dividing a table into certain parts based on some attributes.
Iceberg simplifies partitioning by implementing hidden partitioning and handling all the details of partitioning and querying without user knowledge.
Wrapping up
I like what Cloudera has done here. Analysts and data scientists can easily collaborate on the same data using tools and analytic engines. This functionality requires no effort to get the benefits of Iceberg as part of CDP. No more lock-in, unnecessary data transformations, or data movement across tools and clouds to extract insights from the data.
It is pure to the Cloudera strategy: to take open-source technologies and add enterprise-grade quality and stability. The biggest enterprises with large amounts of data see Cloudera as the company to manage that end-to-end data on-premises or in the public cloud or even collecting data that comes through a SaaS application. Cloudera is doing an excellent job in pulling it all together as a one-stop shop for data management.
Note: Moor Insights & Strategy writers and editors may have contributed to this article.
Moor Insights & Strategy, like all research and tech industry analyst firms, provides or has provided paid services to technology companies. These services include research, analysis, advising, consulting, benchmarking, acquisition matchmaking, and speaking sponsorships. The company has had or currently has paid business relationships with 8×8, Accenture, A10 Networks, Advanced Micro Devices, Amazon, Amazon Web Services, Ambient Scientific, Anuta Networks, Applied Brain Research, Applied Micro, Apstra, Arm, Aruba Networks (now HPE), Atom Computing, AT&T, Aura, Automation Anywhere, AWS, A-10 Strategies, Bitfusion, Blaize, Box, Broadcom, C3.AI, Calix, Campfire, Cisco Systems, Clear Software, Cloudera, Clumio, Cognitive Systems, CompuCom, Cradlepoint, CyberArk, Dell, Dell EMC, Dell Technologies, Diablo Technologies, Dialogue Group, Digital Optics, Dreamium Labs, D-Wave, Echelon, Ericsson, Extreme Networks, Five9, Flex, Foundries.io, Foxconn, Frame (now VMware), Fujitsu, Gen Z Consortium, Glue Networks, GlobalFoundries, Revolve (now Google), Google Cloud, Graphcore, Groq, Hiregenics, Hotwire Global, HP Inc., Hewlett Packard Enterprise, Honeywell, Huawei Technologies, IBM, Infinidat, Infosys, Inseego, IonQ, IonVR, Inseego, Infosys, Infiot, Intel, Interdigital, Jabil Circuit, Keysight, Konica Minolta, Lattice Semiconductor, Lenovo, Linux Foundation, Lightbits Labs, LogicMonitor, Luminar, MapBox, Marvell Technology, Mavenir, Marseille Inc, Mayfair Equity, Meraki (Cisco), Merck KGaA, Mesophere, Micron Technology, Microsoft, MiTEL, Mojo Networks, MongoDB, MulteFire Alliance, National Instruments, Neat, NetApp, Nightwatch, NOKIA (Alcatel-Lucent), Nortek, Novumind, NVIDIA, Nutanix, Nuvia (now Qualcomm), onsemi, ONUG, OpenStack Foundation, Oracle, Palo Alto Networks, Panasas, Peraso, Pexip, Pixelworks, Plume Design, PlusAI, Poly (formerly Plantronics), Portworx, Pure Storage, Qualcomm, Quantinuum, Rackspace, Rambus, Rayvolt E-Bikes, Red Hat, Renesas, Residio, Samsung Electronics, Samsung Semi, SAP, SAS, Scale Computing, Schneider Electric, SiFive, Silver Peak (now Aruba-HPE), SkyWorks, SONY Optical Storage, Splunk, Springpath (now Cisco), Spirent, Splunk, Sprint (now T-Mobile), Stratus Technologies, Symantec, Synaptics, Syniverse, Synopsys, Tanium, Telesign,TE Connectivity, TensTorrent, Tobii Technology, Teradata,T-Mobile, Treasure Data, Twitter, Unity Technologies, UiPath, Verizon Communications, VAST Data, Ventana Micro Systems, Vidyo, VMware, Wave Computing, Wellsmith, Xilinx, Zayo, Zebra, Zededa, Zendesk, Zoho, Zoom, and Zscaler. Moor Insights & Strategy founder, CEO, and Chief Analyst Patrick Moorhead is an investor in dMY Technology Group Inc. VI, Dreamium Labs, Groq, Luminar Technologies, MemryX, and Movandi.
Stay connected with us on social media platform for instant update click here to join our Twitter, & Facebook
We are now on Telegram. Click here to join our channel (@TechiUpdate) and stay updated with the latest Technology headlines.
For all the latest Technology News Click Here