NO DATA HOSTAGE
You must certainly think that such “hostage” situation will never happen to you. You would actually be surprised how often this happens. You might encounter this phenomenon countless times in many different, subtle forms:
Proprietary file format
The most common form (and the most obvious form) of “hostage” situation is the “proprietary file format”. Common examples include: SAS, Tableau, Qlikview and most “Custom” Solutions for BI.
For a very long time, if you are using the .sas7bdat files (from SAS), you were in a “lock-in” situation, forced to continue to buy SAS because it was really painful & slow to export the data stored inside .sas7bdat files to any other “open” format. Nowadays, the .sas7bdat file format has been reversed-engineered you can (finally!) escape the dominion from SAS (…and this is good news!): For example, you can now read .sas7bdat directly from Anatella (faster than in SAS!).
If you are storing your data in the .tde file format (used by Tableau), you are in a “lock-in” situation: You are forced to continue to buy Tableau forever because there exists no way to retrieve the data stored inside .tde files. More precisely Tableau can export aggregates computed from .tde files, but you’ll never have access to the real, non-aggregated data contained inside your .tde files. There have been some attempts to reverse-engineer the .tde file format but the Tableau company threatened the developers with court of justice problems, so that all these attempts stopped. Since you are not able to extract non-aggregated, raw data from a .tde file, this means that you will never be able to do any predictive/advanced analytics (because, for predictive modeling, you need the non-aggregated, raw data).
If you are storing your data in the .qvb file format (used by Qlikview), you are in a “lock-in” situation: You are forced to continue to buy Qlikview forever because there exists no “official” way to retrieve the data stored inside .qvb files (although there exists some un-official, unsupported, chinese tools that are border-line illegal to still export your .qvb files to .txt files).
If you are storing your data in one of the many “Business Intelligence” tools (i.e. a clone of Tableau, Qlickview) currently available on the market (e.g. Lily), you are also, most certainly, again in “lock-in”/hostage situation because, typically, these BI tools won’t allow you to get back your non-aggregated, raw data anymore (e.g. to allow you to do predictive analytics). If you still want to use such tools, my advices are:
- Ask the provider to demonstrate to you how to export your raw data back to a “plain text” file and do yourself the experiment on data file containing a few million records.
- Keep a “plain text” copy of all the data injected inside the tool, so that you can always “go back” to the raw data, if you need so (e.g. for predictive analytics).
A more “subtle” way of taking your data hostage is to “obfuscate the meta-data”. Even if you are using an open platform (such as Hadoop, Teradata, Oracle, etc.) to store and manage your data, the consultant (that created your analytic system) can still “obfuscate” the table’s names and the column’s names. More precisely, you need the “dictionary” that explains what’s inside the column “C4” of the table “T15”?
In such situation, you are again in a “lock-in”/hostage situation because you’ll be forced to pay indefinitely for the consultancy services because the consultants are the only ones to have the “dictionary” to understand what’s stored inside your databases.
This is more common than you might think: During 5 years of experience in this business, we already saw such situation at two large telecoms (one in Belgium – VOO, one in France – SFR), one large bank (Belfius) and one state institution (BPost). This hostage situation seems to happens more frequently when the platform used to manage the data is “SAP”. Indeed, when using SAP, it’s very easy to create a completely incomprehensible/obfuscated system (unless you are one of the “initiated” that has the “dictionary”).
An even more “subtle” way of taking your data hostage is to let some consultants “recode your data” using a tool that only them control.
Let’s give a simple example, to understand better the logic: Let’s assume the following:
- You just purchased a database (Oracle, Teradata, SQLserver) for a reasonable price and you now want to create a complete analytic system, using the data store inside your new database. Your first analytic project is the creation of a dashboard. You received “for free” (for one year) the dashboarding tool when you purchased the database.
- Your database contains a column named “Gender”. The “Gender” column is quite “dirty” because it contains the following strings: “F”, “Female”, “W”, “Woman”, “M”, “Man”, “Male”. Such “dirty” database is not good when you want to create a dashboard. So, you clean the “Gender” column to only have two different strings (e.g. “Woman”, “Man”). You must “clean” all your columns. This “cleaning procedure” can become a lot more complex over time (e.g. it can include complex business-rules based on many different columns). It’s very common and normal that consultants work for one complete year to create all the business-rules required to obtain a “clean” dashboard.
Let’s assume that your first project is now complete and you now want to do predictive analytics: i.e. You now need access to all your raw data. Two situations can occur:
- The consultants created business rules to clean the source, raw data inside your database. This is the “safe situation”: You can directly export your (clean) data from your database and import it inside your favorite predictive modeling tool and you can directly start creating predictive models.
- All the business rules to do all the “data cleaning” were created inside the dashboarding tool (e.g. the consultants created “calculated variables” inside the dashboarding tool). More precisely, the raw data from the database is still dirty and the consultant only created business-rules that corrects the display of the data on the dashboard, but not the source data. You are now in a “lock-in” situation: You must continue to buy the dashboarding tool because you already invested one year of consultancy work to create all the business-rules (inside your dashboarding tool) to “clean” your data and you don’t want to lose all your investment. Furthermore, since you didn’t pay attention to the price of the dashboarding tool (You remember: it was initially for free!), the consultants can now increase its price as they see fit. Finally, since your database is still completely dirty, you cannot do any predictive analytics: i.e. the only tool that you can use to analyze your data is your stupid dashboarding tool. Since, it’s the only usable tool, you’ll invest more and more money into it (and it will be expensive since you didn’t negotiate its price “up front”!), up to the point you are 100% in a lock-in situation, with so much money invested in the dashboarding tool that it’s impossible to think of getting rid of it.
Be wary of “free gifts”!
Be wary of “free gifts”! They often hide some hidden agenda.
This is more common than you might think: During 5 years of experience in this business, we already saw such situation at one large insurance (Partena) and one state institution (BPost), one HR company (USG people).
We still encourage you to use our own file format (.gel_anatella files or .cgel_anatella files) to store your datasets because these two file formats are more efficient than any other file formats (e.g. you can read .gel_anatella files at a speed of 800MB/sec uncompressed data on a 2000$ laptop).
Since we don’t want to take your data hostage, we always provide inside the free “Anatella Community Edition” the possibility to easily & rapidly convert back all your .gel_anatella files (or .cgel_anatella files) to any other file format (or to any database), if you wish so.