Data is an essential asset of all organization.

As a consequence, within the Big Data Market the interest for the data preparation tools is warming up.

Today, with technological advancements, large amount of data comes from umpteen number of disparate sources. It costs a lot of money to collect and archive those data, and it’s expensive to process that information in order to arrive at useful insights in order to understand “How-much”, “When”, and “What”.

The very big issue is that, to understand the “Why” factor, some additional work is still required: it is the implementation of the context awareness, though which it is possible to craft the extension of the data into relevant information.


According to a recent WIPRO study “Experts now point to a 4300% increase in annual data generation by 2020, of which 80% is unstructured data in the form of audio recordings, PDFs and texts. This data explosion is resulting in the creation of an overwhelming Data lake. IDC suggests that about 90% of this data is ‘dark’ and unstructured. In fact, companies use only a mere 12% of the available data to derive business insights and the rest are just stored as there is no proper means to access this data. This also means that the process for the data to run through its lifecycle, is quite elongated as of date.

Gartner research suggests that by 2018, 90% of the deployed data lakes would be useless as they are overwhelmed with information. However, companies spend millions of dollars in storing this data in the repository. This growing need for fast data discovery has been identified by companies like Paxata, Trifacta and others. They provide a self-service Data Preparation Tool which ‘swims’ through the huge data lakes to fetch all the relevant data and helps analysts by providing clean, standardised and  enriched data set collated from various data sources. According to New York Times, Analysts spend about 80% of their time in preparing data. These tools would thus bring a revolution in the world of analysts by helping them save a lot of time and efforts. These tools are dynamic and visual with great user-interface with additional capabilities of smart-data discovery, inbuilt semantic library, data quality assurance etc.”


Even if we can say that the data mining would be an old discipline, maybe after more then 25 years something is finally changed, today we have data to work with, many. As a consequence, data integration vendors and some incumbent player are developing software technologies to accomplish needs of the data scientist.

Our understanding of where data preparation is going, is that most of the vendors are creating comfortable features attempting to reduce the time spent in preparing data for the “next step”.

Looking at proposals of popular names in the segment, it seems that the focus is on creating sophisticated GUIs and workflow to help data accessibility, transformation, reduction, and integration with back-ends and front-ends (including advanced analytics and BI).

Frankly we don’t know if this way we’ll go in a good direction. Above features are natural extensions of the already mature advanced analytics platforms (alpine dl, knime, sas, spss..), no glue there.

So what?

We have worked with many statisticians and data mining expert, and have a message for the IT community, Data Scientists aren’t end-users.

Most of them are sophisticated programmers, they practice with C language, to achieve computing performances, they are not interested in comfortable and “nice to have” features, they are looking for efficiency.

They are fast, they work at command line, for example don’t use text editors, but line editors.

We really believe Data Preparation will be the next big thing in the data management space, because of the growth of Data Scientist community size. Next generation Data Preparation Tools should focus on specific features:

DATA COMPLEX: schema on read Smart Data Store containing all the data, values token and related projections

DATA MOVEMENT: to provide answers to the need to move large data set

GRID UNIFICATION: to unify data leaving the Data Store where they are i.e. without moving or changing their schema

COLLABORATION: share results, embed domain knowledge and reference data, automate narrative reporting and story telling

GARBAGE COLLECTION AND RECYCLING: consider data scientists often finds gold mines in garbage data.

MULTIMEDIA DISTRIBUTED DATA: an effective answer is needed to address the requirement to mine into Multimedia World