Why you need more data engineers, but not for the reasons you think
The role of the data scientist has evolved quite a bit over the last few years. While in some areas, it stemmed from groups of software engineers and other IT specialist who soon realized making models was more than linking to a library, or from groups of statisticians who learned the hard way data doesn’t come prepared. A data team in a company is now – in most cases – a group of mixed specialists with complementary skills.
Many years ago, it involved a lot of coding: we could choose between SAS, Matlab, SPSS, or we directly wrote code implementing ideas from academic papers. Today, it mostly involves knowing which library to call, and basic scripting. And lots (and lots) of data preparation for two purposes: to create a model, and to put a model in production.
This led to an interesting split in roles, as we increasingly hear about a separation between data scientists and data engineer, to the point that some believe data scientists should not deal with data preparation and data transformation. While there are some differences in how each will do it, it’s very clear that such a clear separation of roles is dangerous. Let me explain.
Data Scientists sometimes seem to live in a bubble universe where their role is to create algorithms or improve on existing ones. While such data scientists certainly exist, and I count myself NOT among them, they are a rare breed. Which one of us could pretend to improve on the core algorithms of Friedman, or optimize the code to Tibshirani, or maybe come up with something better than WoE to improve on a logistic regression? Very few, certainly. Each one of these tasks is fit for a PhD thesis, and most PhD thesis in data science do not even go that far. More interestingly, which company is willing to wait for months or years of R&D for such an improvement to happen?
So, as “data scientists”, I think it is fair to accept we do not invent algorithms, we apply them. And while it is critical to understand how they work, our daily job doesn’t rely revolve around inventing equations or reading academic papers (it did, 15 years ago, when data sciences was in its infancy and we had to code things to make the magic happen).
So, how are we different from traditional analysts and data engineer?
Data engineers put things in production, worry about efficiency and scalability (as data scientist, we do too, but they understand it better). They happen to enjoy creating automation tasks and performing system integration. To do this, they will use ETL tools, SQL, Schedulers such as Jenkins and Git or Tortoise repositories for version control. They will code much cleaner than most of us can even dream of, and they can also build some predictive models, because honestly running the code for a Random Forest after setting a target isn’t rocket science.
What should a data scientist do? Mostly, we create variables (we do “feature engineering”) and we “tweak” models. And there are only two ways to do this: within a modeling tool (and then forget about putting into production) or with an ETL tool (or in SQL, but good luck). So, there is a clear overlap between the two.
In terms of modeling, we have reached a point in which the choice of the algorithm is less relevant than it used to be. We basically all have our favorite swiss army knife: For some, it is a random forest, for other a forest of stump or TreeNet, for other analysts, it’s a variation of penalized regressions, and there even exists some dinosaurs that still think in terms of logistic regression or single trees such as CART or ChAID. And then, there are also some analysts who like Deep Learning because of they have an unlimited budget, unlimited time and they believe that everything is a picture. I wager that 80% of the business problems can be solved by such approach (if all you need is a score). And the data scientist is the one who knows when it doesn’t. The one that will recognize the patterns in the model output that suggests that his favorite toolbox is not working. When the problem goes beyond a “simple” scoring, the data scientist’s eyes start to shine, while the “data engineer” glazes into the deep and gets confused.
So, yes, they are different. But should there be a clear separation in the data processing tasks?
In my humble opinion, this would be a huge mistake. I’ve never been in a situation where I knew beforehand exactly which variable I would use, and which data source I would extract from. Unless you are working with very stable models that only need recalibration, you probably are in the same situation. When we build models, we proceed iteratively: First a simple idea, and we see where it goes. Then we try additional vaiables, additional sources, additional constructs, and we go back and forth. Asking to a data scientist to write SQL queries to answer simple business questions is probably the worse usage of its time.
It is critical that, during this phase of exploration and model generation, the data scientist is in control of the entire process, has access to all data sources, and has access to an ETL tool that allow him to create constructs.
At the time of putting the data scientist’s work in production, should the company choose to use a different platform than the one used by the data scientists, then it absolutely makes no sense to ask the data scientists to automate the process: They will not enjoy it, and will not be efficient at it. Once we know which variables should be created, it is no longer a big issue to do it in Python or SQL (although Anatella will always be much faster and much more reliable), and once we know what model should be applied, it is easy to generate the SQL code for it, or set a task in Jenkins to run Anatella. A team of engineers can then optimize the resources so that 300.000.000.000 rows can be processed in a few days rather than weeks. And this is not a fun task for your everyday “data scientists”.
So, while it is true that they have a separate role, it doesn’t mean that they should not both be doing ETL and data wrangling. Additionally, if data scientists don’t get to experience the struggles of accessing the data, they will never fully grasp how hard it will be to put their models in production, and they will run a high risk of having a limited impact, despite their brilliant minds and exceptional models. Also, both can be using the same tools: Anatella is one of the very few ETL that is just perfect for data scientist and for data engineers. Unless, of course, if you want them to work slower, cost more, and use more servers.