Data Science, Data Analysis, Data Engineering. Data this, data that. A collateral effect of the rise of Artificial Intelligence (AI) in the 4.0 industry is putting the word Data in front of everything expecting it will become more relevant and sell more. When we overuse a term it tends to lose meaning, becoming different things for different people. We started using Data (insert-word-here) to designate several things and, in the end, we may have stopped to understand each other.
Of course data is important for our current technological society. We are learning a lot by acquiring, understanding and using data to develop better solutions. When we think about the DIKIW pyramid, data (and here in special computer data) are basic for achieving distilling wisdom for decision-making. The ability to deal with lots of data and make sense of everything is amazing, but we should not lose sight from reality and, in the process, make more harm than good.
That is what’s been going on with Data Scientists today. The breadth of Data Science has become so wide that some are forgetting that we are… humans! As the title “Data Scientist” loosen up itself, it incorporates more and more tasks, ending up as a nearly unreachable standard. So I believe it’s time we dive a little deeper in the subject, but now in the plural form. Data Science(s) ain’t no magic. So let’s put our feet back to the ground.
Data Science as a whole
The process of delivering Data Science solutions is complex and, as the name indicates, a science in itself. As any science, it is a very exploratory effort involving a great deal of knowledge acquisition, representation, discussions, implementation and, overall, iteration. A framework I particularly like when talking about the Data Science process is the CRISP (Cross-industry standard process), a methodology developed in the 90s and very well discussed by Provost and Fawcett. It consists in 5 phases:
- Business understanding: knowledge acquisition and interpretation to gain insight about the matter and the process we are venturing ourselves into. We need to know what to deliver, what is value to stakeholders, and how the development process can be thought out;
- Data understanding: is parallel to business understanding, one feeding from the other. To understand the problems and how to deliver value, we must understand the data available and how to interact with them. Oftentimes data have to be collected from new sources, restructured, enriched, or even bought;
- Data preparation: data comes in various ways and can be stored in various forms. While we usually learn with well behaved data in most Data Science courses, in almost all cases a good part of the Data Science process is transforming raw data into something usable to train models or generate knowledge;
- Modeling: the most glamorous part, but often also one of the quickest. With a good intuition about the objectives, data pipelines, and clean datasets, modeling becomes an exploratory work for finding the best way to extract knowledge from information. Here, Machine Learning models are glorified, but rule systems, dashboard visualizations or even data analysis can deliver astonishing value to clients;
- Evaluation: models eat numbers and deliver numbers. We are the ones that have to make sense of them all. Models can have biases, be contradictory, and there is always the risk of treating coincidences as correlations. We always need to check scores, but also validate outputs with knowledge experts or users;
- Deployment: with a model developed and validated, now we need to put it to use. This implies in tasks such as operating, monitoring and managing the final solution. Understanding infrastructure and information architectures are basal here.
Data Science as part of a process
First we need to understand that this framework is not isolated from other parts of the business process inside an organization. It does not end on itself. To understand and perform a good business understanding we need to dive into Design and Management practices, being this an intersection phase. On the other end, deployment involves several skills on software development and DevSecOps, packing everything created in the other phases into an optimized and fail-proof system. Then all starts back again, since the solution should be validated with end-users and monitored.
On another note, it is very important to keep in mind that this is a great framework to understand the process, but not in any way a step-by-step prescription. The phases are fluid and iterative. Normally there is no clear division between them and teammates can perform them paralleled. Agile methodologies are usually a good match here, since the Data Science process is intense and iterative. Just keep in mind that it is not usually possible to create a solid backlog for AI solutions. The spin can get out of control. A good business understanding is fundamental to keep one foot in the ground, and Service Design is great to manage expectations and have a good view of the solution at an early stage. If we don’t have a concept in mind, the team may end up with an amazing solution that does not solve the problem.
From theory to practice
Great, now we have a common understanding of the Data Science process. You may have noticed that the breadth of knowledge needed to perform the whole cycle I described above is very wide. You may also think “oh, if a Data Scientist can do all that he may as well be superhuman!”.
No, we are not.
As I mentioned, Data Science is still a fluid term, and in my experience working with innovation projects no one can excel in all phases simultaneously. Naturally it is important for all Data Scientists to grasp the whole process and shallowly perform the whole cycle, but in practice what we have are different professionals shining in different parts. A much more humanly-possible division for Data Scientist professionals is threefold:
Data Scientist — Insights
Someone with a solid foot on Data Science, and another on design or management. This expertise has been called Data Translator, which seems fitting to me since they act as interpreters between business, clients and tech teams. Generating insights include understanding the problem, collecting qualitative information, envisioning solutions, and sparking possibilities throughout the process. This expertise is more intense at the beginning of the project, but remains active and important until the final delivery, sometimes also being involved in validation and UX design processes.
Data Scientist — Machine Learning
These are the statistics geeks. They can make deep learning models before breakfast and complex calculus operations on the shower. Jokes aside, the ML expert is responsible to deliver the model, from preprocessing to testing. They interact often with Data Engineers and live in this intersection of raw data, analysis and knowledge extraction. In some cases, they are required to improve or develop ML models and pipelines by themselves, since some applications require less generalist approaches and high efficiency.
Data Scientist — Product
Deploying models in production demands a great deal of work. As the project approaches its ending, these guys here become more and more important. They are responsible to get the model, pack and wrap everything up, and put it to use inside an adequate architecture. For this part, not only developer skills are needed, but also a keen eye for optimization and a strong knowledge on infrastructure and the DevSecOps cycle. But the presence of these guys is important from the start. The architecture and solution concepts envisioned at the beginning require inputs from someone with this expertise.
The first act in the first diamond linking business and tech. Without it, it is easy to lose scope with blackscreen-geeks developing what they think that is good, not what delivers value. The second acts in the development transforming data into knowledge. Without it there is no AI, and we become hostages of machine models and pipelines by-the-book. The last acts in the delivery putting optimal solutions to use. Without it a great solution ends up losing it’s wow-factor, with problems of performance and security.
Data Science is exciting and extremely valuable in today’s information world. Many surprising things can be discovered with a well planned and managed process, things that no human could perceive alone due to cognitive limitations. But Data Science is rarely an individual effort. It requires discussion and integration on a multidisciplinary team, having Data Scientists with different backgrounds, Service Designers, UX Designers, UI Designers, Front and Back-End Developers, DevOps, Machine Learning and Data Engineers, Scrum Masters, Product Owners. Of course these would be a lot of people, so think about them as skills, not as jobs.
In the end, the delivery of value for clients and users is what matters, but it is not OK to expect one Data Scientist to do it all alone. Maybe it’s time to take a little more care with names and expectancies. Don’t you agree?