The 11 challenges of data preparation and data wrangling
The events of 2020 have accelerated the shift to telecommuting and digital relationships. With digitalization, another transformation is also underway: the analytics transformation. To face this data revolution, I find that businesses do not necessarily have the right tools for preparing and analyzing data (data preparation or data wrangling).
In this article, I wanted to put into perspective the role of these tools in the functioning of the business. In particular, I think they can contribute to efficiency gains, which I will try to demonstrate. In particular, I have defined 11 characteristics that determine the perfect data preparation solution.
Summary
- Data transformation: a challenge for human resources
- The 11 characteristics of an exemplary data wrangling/data preparation tool
- Conclusions
Data transformation: a challenge for human resources
Historically, generating value from data was done by specialized and expensive profiles (doctors in data science, coders in R/Python). While outsourcing is always possible for specific needs, it has one major drawback: external consultants do not understand your business and need time to grasp the hidden subtleties of your data.
To take full advantage of your data and quickly extract all the knowledge and value from it, it is essential to internalize the data expertise and make available tools that the most significant number of people can use. The use of self-service data preparation tools by all analysts seems essential to me. These are the people who understand the business problem to be solved and who are therefore the most likely to succeed if given the technical means.
Internalizing data expertise and making available tools that as many people can use as possible is essential.
This approach allows the analytical approach to get out of the straitjacket of the “Data Science” department. It is also the first step to instill a global analytics culture.
There are many examples of people within businesses who already manipulate data. In 99% of the cases, everything is done with Excel, and this is where substantial efficiency gains can be realized using it. Indeed, good data preparation tools offer solutions to all the limitations of Excel (complex transformations, multiple data formats, volume, …).
In the following paragraphs, I explain what I think the 11 characteristics of an ideal data preparation tool should be.
The data wrangling tool must be a “self-service” tool
A self-service data preparation tool enables your analysts to solve business problems faster. They understand the data and its context, and with the right tool, they are autonomous in their resolution.
The preparation of the data must be rapid
Typically, data scientists spend more than 85% of their time doing data preparation. A tool that increases the speed and productivity of data preparation is therefore welcome!
In particular, the most experienced data scientists have long realized that they need to reduce the time spent on “data preparation” to do a better job. This is why, just like business analysts, the most experienced data scientists are asking for a “self-service” tool with a mouse, because it allows them to have an immense gain in productivity and time during the always costly “data preparation” phase.
The chosen tool must be federative
A tool that promotes and simplifies collaboration between business analysts (business-oriented) and data scientists (technical-oriented) seems to me to be essential. Without this unifying aspect, it is difficult to achieve a global analytical culture within your business.
The federating aspect of a “data preparation” tool is perhaps the most difficult to achieve because the needs of business users are often far removed from the requirements of data scientists:
- Business analysts avoid code and want easy and immediate access to the desired information. They often work with small volumes of data and do not use highly complex algorithms.
- Data scientists like to code, and writing a thousand lines of code in R/Python is not going to stop them (after all, it’s only a day’s work!). They often work with larger volumes and use complex algorithms. To meet the requirements of the most advanced data scientists, the data preparation tool used in a company must therefore allow to:
- manipulate large amounts of data,
- code in R/Python,
- create complex features (for feature engineering),
- use the latest generation of AI algorithms
Without these properties, you will not achieve the “unifying” aspect and a global analytical culture within your company will be out of reach.
The possibility of iterative work
The work on data is endless and should be seen as a cycle. The cycle represented schematically below shows that the work is constant. It is constantly fed by new data: updated data on the one hand and data from new sources on the other.
Data processing is not an end in itself. It is a cycle, an eternal restart. The tools used must therefore be able to manage the update cycles.
Example of iterative data processing
Many businesses use Excel files to collect (and exchange) data from their employees. What happens when an employee decides to slightly modify the structure of an Excel file to fit the business problem at hand better?
This small change requires updating the data wrangling process that collects data from those duplicate Excel files. Suppose this data wrangling process is opaque (because it is programmed in an incomprehensible language that only insiders understand only on the day, they wrote the code). In that case, the whole data collection process is compromised. This leads to the production of invalid data and, ultimately, to wrong decisions. How many times have you heard a colleague tell you that this KPI is absurd and should be ignored? This (justified) lack of confidence in the analytical results is often due to data wrangling tools that are not transparent enough.
Adaptation to large volumes of data
I defend the idea of tools adapted to the fast processing of large volumes of data (see the blog article « TIMi vs Spark »). The processing time is correlated to the volume of data to be processed. However, all too often, the tools available to business analysts are too slow or do not allow them to manage large volumes of data.
A powerful tool independent of resources in the cloud
The question of computing power in data preparation is, in my opinion, central. To give analysts autonomy, they must answer all their questions without depending on access (or not) to a cluster of machines in the cloud. If the available computing power constrains you in your analyses, you will get frustrated and never get an analytical culture.
Although it is now elementary to create clusters of machines in the cloud, the price is still high. Because of this high price, a “normal” company will limit itself to making one cloud cluster (or even two). The use of this cluster will also be reserved for a small number of data scientists.
It is easy to understand that, under these conditions, the development of a global analytical culture is compromised. If only two people have access to the cluster, how can you increase efficiency and avoid creating bottlenecks?
In addition to the prohibitive price of the cloud, there is also the issue of data sovereignty when data is stored on a US cloud. For more on this topic: see the ‘Schrems II’ decision in case C-311/18 here.
A tool with fixed costs
I’m pretty allergic to the idea of not knowing in advance what the data analysis is going to cost me. Yet, that’s the situation all businesses that use AWS or Azure for data processing find themselves in. The variability of costs makes it impossible to anticipate what the next bill will be. In other words, there is a variable cost associated with each analytical question.
An inherent property of cloud clusters is the “variable cost” associated with each analytical question. This is perhaps the “Key Selling Point” of the cloud giants’ offerings: “You only pay for what you use.”
A motivated data scientist will thus be the source of higher variable costs because he will make intensive use of the cloud cluster to understand the data as well as possible. On the contrary, a less motivated data scientist will cause lower variable costs.
The monitoring of processing costs in the cloud sanctions the most active and motivated data scientists.
As the assessment of data scientists is based on the costs they generate, we end up in a paradoxical situation. Those who work the most and are the most motivated are penalized. The “parsimonious” use of the cluster thus becomes the norm. As this situation happens all the time, there are now many tools specialized in “cloud monitoring” that allow sanctioning and cutting off access to computing resources to all data scientists who are a little bit motivated. Under these conditions, it is easy to understand that keeping data scientists motivated is likely to be complicated.
To finish on this subject, a data wrangling tool that works on “variable costs” (like 99% of cloud solutions) has the effect of penalizing, discouraging, and finally preventing your best people from working.
.
Simplifying industrialization
A good data wrangling tool should allow you to easily industrialize and automate the “recipes” developed by your teams.
Here are some specific features that are important for the peace of mind of your IT teams:
- easy integration with any scheduling program (e.g., MS-Windows task scheduler or Jenkins).
- easy implementation within your current IT infrastructure or other languages or other data frameworks. For example, calling a data wrangling procedure from a small Python script is a must.
- robust enough to handle a sudden increase in data volume without crashing.
Integrated tool
Some ETL solutions are not ETL because the “transformation” part (the “T”) is absent or insufficiently developed. I give priority to tools that cover all 3 parts of the data cycle: acquisition, storage, exploitation. It must be possible to move from one to the other without difficulty and any accidental loss of information.
Multiplicity of connectors
I have already talked many times about the importance of having as many connectors as possible. Today, data comes from everywhere, in more and more formats, and sometimes proprietary.
Numerous data transformation features
This is a point I have made many times before and I think it is essential. You need to have a maximum number of pre-programmed “boxes” to cover the everyday data transformation operations.
Of course, all ETL solutions worthy of the name have a wide range of “boxes” available. But some have more than others : for example Anatella.
This last point seems obvious, but, surprisingly enough, there are, in the end, very few software solutions that meet this need. Indeed, many software vendors are content to provide broad connectivity and completely forget about the “data transformation” aspect, which is at least as necessary a component, if not more so. Would you like some examples? Here are two.
Fuzzy matching
Extremely useful to reconcile 2 databases when the quality is not excellent. Yet, only Anatella offers it natively (see here for more info).
The pivotal function
If you are doing data visualization, being able to rotate your data is the basis. However, here again, 99% of data wrangling solutions do not propose it (in Anatella, the boxes are called “flatten” and “unflatten”)
Currently, Anatella is the only tool that fully meets the requirements of the ideal data preparation tool. Indeed, Anatella is :
- …a fully self-service tool with a mouse.
- …an extremely fast tool.
- …a federating tool that favours the development of a global analytical culture by helping collaboration between business analysts (who are more business-oriented) and data scientists (who are more code and algorithm-oriented). For more details on this subject, see here.
- …a tool that can handle large volumes of data, code in R/Python, create complex features (for feature engineering) and use the latest generation of AI algorithms, to satisfy the needs of the most advanced data scientists
- …a tool that allows iterative work thanks to :
- its graphical and easily understandable representation of data preparation processes
- …its unique meta-data free functionality.
- …a tool that is powerful enough to make you independent of “cloud clusters”.
- …a tool that works without variable costs.
- …a tool that allows you to easily industrialise and automate the data-wrangling processes developed by your teams.
- …a tool that offers a high number of connectors to capture virtually any data from anywhere.
- …or “all-in-one” tool that covers virtually everything that can be done with data in a data-driven organisation that wants to develop an analytical culture.
- …the tool on the market that offers the most different data transformation features: In Anatella, there are almost 400 data transformation operators. With Anatella, you are guaranteed to have access to all the necessary functionalities to always be able to clean, structure and enrich your data.
Source: IntoTheMinds