Here’s my first cut at how a data lake might work:
First of all, the processes known as ETL, that is extraction, transformation, and loading, will stay with us. ETL takes many data sources and create a single normalized version of the truth. These processes will be expanded from structured transactional data, to less structured machine data, to unstructured data. In a data lake, this normalized cleaned data will be only one source of information alongside machine data and unstructured data.
Second, end-users will be given tools to explore structured data, machine data, and unstructured data. The Splunk Search Language, which I am studying now for a book project, and ThingWorx SQUEAL™ search language were both created for this purpose. To use and understand the value of new types of data, the people who know what questions are worth answering must be able to play with the data.
Third, new forms of summary data will arrive along side the normalized form of data created by ETL. Some of these summaries will find themselves in new kinds of databases: NOSQL databases used by Cassandra or MarkLogic or the graph databases used by ThingWorx. (See Databases for the Data Lake). Other summaries will be just consolidated forms of data like the summary indexes used by Splunk. I believe that the data cubes will be replaced by other forms of summaries that are easier to create and require less intermediation. End-users will be able to define what the summary should look like using a search language, and then an automated and optimized implementation can be created for those summaries that are used often. The summaries will be easily linked to the detail records used to create them.
Fourth, both raw data from all sources and summary data will be analyzed using in-memory visualization systems like QlikView, TIBCO Spotfire, Tableau, SAP HANA, and others. These systems allow exploration and navigation of the data by end-users, who are free from the intermediation that causes bottlenecks in data warehouses.
Fifth, the exploratory environments will increasingly include the ability to relate unstructured data to the information on a dashboard or some other visualization. A data lake assumes that it can tell only part of the whole story and seeks to be friendly to adding new sources of data.
Sixth, as more is understood about how to capture important events and to show key trends from the data, the analysis will be automated and captured in dashboards and other reports. Data warehouses do a good job of distributed reports that have been proven useful and that function will remain important. As the reports delivered are more visual and interactive, the number of reports will drop dramatically.
Seventh, the way IT supports end-users will change. IT will observe how end-users use data and how they add data to the data lake. When reports start to be used a lot, then IT will arrive and optimize the execution of the report. IT will do its noble work after the value has been proven. The bottleneck in IT may well still exist, but it won’t get in the way of end-users asking and answering questions. It will be a bottleneck with regard to optimization of how questions are being answered.
So, as you see, a data lake doesn’t kill all of a data warehouse. It just cuts away the weakest parts and replaces them with new more flexible structures that allow new forms of information to be put to use. Should you kill your data warehouse today? Probably not. Should you be planning its funeral? Absolutely.
Dan Woods - Forbes