Data warehousing

Data Warehousing

Historically, a “Data warehouse” was simply a place where you would find most of the structured data of your company.

“Data Warehouses” are actually built using database engines: SQL Server, Oracle, Teradata, MySQL,…

Nowadays, most “data workers” (business analysts and data scientists) are storing their data inside a Data Lake (and thereafter, they use this Data Lake to produce all their results: dashboards, kpi, models, etc.). For “data workers”, the Data Lake approach has several advantages over the older, outdated “Data warehouse” approach: it’s faster, more versatile, more resilient, less expensive, connects to more data sources, etc. (for more details, see, for example, all the advantages of Anatella over a traditional data base engine).

Thus, in the modern enterprise, the role of the “Data warehouse” is very much reduced: it’s mainly useful for interoperability reasons: it allows the exchange of data between different software components. These components are, typically: “Data viz” components (“BI tool” or “reporting tool”), operational CRM components, ERP components (such as SAP, Odoo, etc.), websites, etc.

If you are using Anatella to manage your Data Lake, you don’t even need a “Data warehouse” to exchange data with your “Data viz” component because Anatella natively creates at high speed the internal, proprietary file format used by the major “Data viz” solutions: Tableau, Qlik, Kibella and Kibana. This means that, in the end, a “Data warehouse” is not very useful anymore: it’s usually reduced to its most simple expression.

There still exists some specific use-cases where a “Data Warehouse” make more sense than a “Data Lake“: see here for more details about this subject.

There are two major technologies when it comes to Database engines

Non-clustered databases (i.e. the old, standard approach).
These databases are running on one unique “big” server.

Pro: Lower acquisition & maintenance price.
Con: Limited storage (i.e. a few Terabyte at max.) and limited Computing power (i.e. complex SQL queries runs for a very long time and sometime do not execute at all).

At first sight, the relatively small&limited storage might seem like a very negative property of this technology. But, if your primary storage is a “Data Lake” (which is the case for most modern enterprises), you don’t need a large storage inside your Database engine anymore and you won’t fall into any storage difficulties.
Distributed (or clustered) databases
A clustered database is a database that is running simultaneously on many different servers connected together with high-speed data cables (e.g. infiny band).

Pro: Potentially unlimited storage and unlimited Computing power (i.e. If you need more computing power or more storage-space, you simply add new servers inside the cluster).
Con: Higher acquisition & maintenance price.

The main “pro” argument in favor of the Distributed Databases is the supposedly higher computing-power offered by distributed databases. Actually, this argument does not hold true. You can only offer a high computing power (using a distributed computation approach) if the incompressible time of your Database Engine is small. …and this is not the case for most databases: i.e. Most database have a very high incompressible time. This means that they won’t be able to deliver a high computing power, even if you use many servers. This is especially true for low-grade databases such a Redshift, Cassandra, etc. You’ll find more details on this subject here (about the incompressible time).

The differences between all these database systems are:

Speed, Scalability, Versatility, Cost of Ownership

All major database systems are equally able to store vast amounts of information. For Non-clustered databases, the storage space is limited to a few Terabytes. A complete analysis of the advantages of each database system is out of the scope of this simple document.

If your “Data workers” are already using a “Data Lake” to manage most of your data, you most certainly only need a very simple (and inexpensive) Non-clustered Database engine.

For more information on this very vast subject, see here:
http://en.wikipedia.org/wiki/Comparison_of_relational_database_management_systems

Differences in results between tools

The differences in result between all the Non-clustered databases are:

None: given enough time and enough human resources, you can achieve similar results (similar ROI) with all theNon-clustered databases.

The differences in result between all the Distributed (or clustered) databases are:

None: given enough time and enough human resources, you can achieve similar results (similar ROI) with all theDistributed (or clustered) databases.

If you apply on the same tables the same set of transformations, you’ll obtain the same results whatever the datawarehouse tool used. This seems logical. The same is not true about Advanced Analytic Software: When working on the same dataset and using the same modeling methodology, you’ll still obtain different predictive models, with different ROI, depending of the modeling tool used. This difference can be very significative: On most cases, TIMi give 5 to 15% more ROI.

Although the differences between the different database tools are inexistant in terms of end-result, you can still observe large differences when it comes to: speed, cost of ownership, quality of the interface (“user-friendliness”), energy efficiency, etc.