Advances in data preparation and integration will have a major impact on BI, visual analytics, and data discovery. Here are three points to consider as you evaluate and deploy these recent software entries.
David Stodder, dstodder@tdwi.org
Related event: TDWI BI Symposium London 2015, 7-9 September
As users seek to go beyond canned reports, dashboards, and spreadsheets to employ sophisticated visual analytics to drive decisions and actions, traditional processes for preparing data are under pressure. These include steps for data quality and profiling, data transformation, and other forms of enrichment.
To perform business-driven analytics, users want flexibility in data preparation; they don’t want to wait for long cycles of extraction, transformation, and loading (ETL) only to gain access to a limited selection in the data warehouse. In today’s brave new big data world full of Hadoop clusters and nontraditional data types, users want to explore it all with less restraint and more self-service.
The established world of ETL and data integration is thus in the midst of a shakeup. Innovators are coming out of Google, Facebook, and leading universities to launch new companies with the backing of top-flight venture capitalists (VCs). Traditional vendors have had to adjust quickly and introduce solutions that are geared more to ad hoc, on-the-fly data integration, transformation, quality improvement, and other preparation for analytical activities such as blending internal and external data to gain insights into competitive pricing. Newer solutions employ machine learning and other advanced analytics to enable users to learn about the data faster, with algorithms for finding relevant data relationships and anomalies.
As always happens in our industry, new technologies bring new terminology with them to draw distinctions from the old. Rather than ETL and data integration, the latest data preparation and integration technologies apply terms such as “data blending,” “data munging,” and “data wrangling.” Although the vendors use them somewhat differently, the terms generally stand for easier and faster data preparation and integration of a wider range of sources, usually through automated processes driven by advanced analytics.
Hiding Complexity
Being able to integrate and prepare a wider variety of data types is a major distinction between the newer solutions and the old. Both inexperienced and expert analysts today increasing want to blend views of disparate data types including geospatial, text, and demographic data with their more traditionally structured transactional data. These nonstandard data types are often voluminous, varied, and messy; to gain business value from them sooner, manual work must be replaced by automated methods.
Although experienced data scientists and analysts may still prefer to get their hands dirty and write code to analyze the data based on intimate knowledge of the sources, most users need automation to run queries and models against what could be petabytes of highly varied data.
To hide the complexity of selecting, blending, and accessing data sources, many of the newer tools provide graphical user interfaces of their own or the ability to embed icons in leading business intelligence and visual analytics solutions. Users can work with icons rather than code to perform data mashups, set filters, or create custom data blends for their immediate analytic needs. The tools are thus fueling the trend toward self-service data integration, taking tasks out of the hands of IT to enable business analysts and other nontechnical users work on their own to develop variables, build models, or query sources to find data patterns and correlations. (For an in-depth discussion of data blending, see Fern Halper’s TDWI Checklist report, Seven Keys to Data Blending.)
Of course, much of this innovation is aimed at enabling organizations to gain more value out of the growing “lake” of data stored in Hadoop clusters. Organizations need tools that are geared to the “schema-on-read” data analysis style prevalent with Hadoop where schema, transformation, and other steps are applied to data when it is accessed rather than as it enters the systems, which is typical with traditional BI and data warehousing systems. Because no one vendor is entrenched as the market leader for data preparation on Hadoop, the new firms and their VC backers see a major opportunity.
Many of the new data preparation software providers are led by technologists with deep experience in using Hadoop, MapReduce, Spark, and related Apache open source technologies. Offloading of ETL jobs to cheaper Hadoop systems has already been growing and will likely accelerate as Spark and commercial SQL-on-Hadoop options mature. Over time, these trends will make it easier for organizations to view Hadoop as an appropriate platform for a greater share of their data preparation, enrichment, and integration tasks. (For more on ETL and Hadoop, see Philip Russom’s recent article, Can Hadoop Replace My ETL Tool?.)
Advances in data preparation and integration will have a major impact on BI, visual analytics, and data discovery. Here are three points to consider as you evaluate and deploy these recent software entries.
#1. Ensure good governance. One of the potential dangers of breaking away from IT control and increase users’ self-service with data preparation is that proper data governance can become more difficult. Data preparation and integration tools are increasingly providing data lineage tracking capabilities, which can be helpful for data governance. Users and IT should work together to set rules and ensure that they are followed.
#2. Manage performance carefully. Whether they are using traditional ETL or newer data preparation and integration software, better performance is always a key goal and a high priority for users. Look carefully at how vendors are currently employing or planning to employ in-database and in-memory processing for data preparation analytics to improve performance.
#3. Make it easier for users, not harder. Many newer technologies are attempting a transition from Hadoop’s developer-oriented culture to the world of nontechnical users who generally do not want to code and are more focused on solving business problems. Graphical interfaces help, but they can also mask confusion. Ensure that users are properly trained and guided as they move toward self-service data preparation and integration.
New Opportunities, New Responsibilities
New technologies entering the market mean that these are exciting times for users who have been frustrated with traditional ETL and data integration and seek more flexibility and control. However, as Uncle Ben so famously said in Spider-Man, “With great power comes great responsibility.” Users and IT must adjust rules, practices, and their relationship to make fortuitous use of new data preparation technologies and avoid potential pitfalls.
David Stodder is director of TDWI Research for business intelligence. He focuses on providing research-based insight and best practices for organizations implementing BI, analytics, performance management, data discovery, data visualization, and related technologies and methods. He is the author of TDWI Best Practices Reports on mobile BI and customer analytics in the age of social media, as well as TDWI Checklist Reports on data discovery and information management. He has chaired TDWI conferences on BI agility and big data analytics. Stodder has provided thought leadership on BI, information management, and IT management for over two decades. He has served as vice president and research director with Ventana Research, and he was the founding chief editor of Intelligent Enterprise, where he served as editorial director for nine years. You can reach him at dstodder@tdwi.org.
Previously published:
http://www.tdwi.org/articles/2015/04/14/Preparing-Data-for-Analytics.aspx
All articles are © 2015 by the authors.
Related event: TDWI BI Symposium London 2015, 7-9 September