Data-driven does not mean data drives itself
For all the hype around data and data-hyphenated terms (like “data-driven”), it is important to remember that data is a raw resource that has no actualized value until it is integrated into a product that uses said data to generate a meaningful output. Though the specific roles and responsibilities of data scientists vary from organization to organization (and even within organizations), data scientists are generally the ones responsible for executing the actual transformation of data from a raw resource into something of value for product-users. Indeed, data science is not so much the scientific study of data itself as it is the practice of creating things that use data to generate inferences and/or predictions that satisfy unmet needs or wants [3,4]. An organization’s investment in data science, therefore, is often intended to create what I refer to as “data-fueled product features” — functionalities in a product that transform data inputs into valuable outputs with minimal-to-no human intervention occurring in the actual processes between input and output. However, for data science projects to be successful and efficiently run, it is not enough to possess piles of data and expect that data science will come up with relevant data-fueled features. Rather, a clear understanding of the desired data-fueled feature is needed before any technical work (e.g., data-wrangling and modeling) occurs.
Products, Features, and Benefits, Oh My!
Before going forward, let’s establish some definitions. A product is an item or service intended to satisfy customers’ needs or wants [1,2]. The definition of a product is expansive and can include examples ranging from physical objects such televisions and cars to customer services such as credit monitoring and financial advice. In the context of a product, a feature is a specific functional part of a product that is intended to result in a particular benefit, where a benefit is some type of value the user obtains through their use of the product . A data-fueled feature, in particular, is a term I use to describe a product feature that is powered by data and would be useless without data inputs. Examples of data-fueled features include recommender engines on e-commerce websites/applications, to speech recognition models in voice assistant devices to image classifiers that can distinguish a hot-dog from something that is not a hot-dog, to fraud detection models that are part of a credit card’s customer service.
I place so much emphasis on product features as opposed to the product itself here because the creation of a product, even what might be termed a “data product” [4–6], is the result of multiple teams with complementary skill sets, not just data scientists. However, when the focus shifts to the specific data-fueled features of a product, I believe data science responsibilities become clearer and organizing a data science project becomes more straightforward.
From Idea to Execution
If data science is to produce any benefit, direction is necessary. Establishing a clear idea for a data-fueled feature provides such direction. Why? When a clear idea for a data-fueled feature is achieved, the next steps for the data science project are likely to become clearer.
I can think of at least three ways in which understanding the feature that needs to be created helps direct other steps in a project (and if you have additional ways, please leave a comment). First, after an idea for a data-fueled feature is identified, it is easier to identify data that is needed, and equally as important, the data that is not needed. Now, this does not mean that the needed data will be available or ready to use in a machine learning model right away (it will not in most cases), but being able to articulate data needs is a big step in the right direction. Second, understanding the desired feature also helps to identify what kind of method needs to be applied to the data. For example, if the feature in question is intended to predict whether some kind of event has occurred (say, fraudulent behavior), you now have a good hunch that you are going to need a classification model. In addition, you are now in a position to refine your modeling strategy further, by assessing whether it is necessary for the prediction generation process to be highly interpretable? If so, then some kind of interpretable method or mechanism that allows one to make some sense of a model’s decision process needs to be employed. Third, understanding the product feature driving a data science project can help to evaluate performance metrics. For instance, if the feature is a supervised machine learning model, having a sense of the feature’s purpose for the product can help determine whether false positives or false negatives are more costly to the product’s users .
Today, many organizations are investing in data science. Such efforts should generally be applauded (assuming that these efforts are founded on ethical and responsible data practices). Yet, I write this piece to stress that investment in data science presumes, or should presume, that an organization has strong ideas for data-fueled features, or is actively developing ideas for features powered by data. Remember, data scientists can throw any data into a model and obtain some result. However, meaningful results are driven by a clear idea that guides data scientists’ efforts to use data to produce some kind of benefit for product-users.
3. Kozyrkov C. What on earth is data science? In: Hackernoon [Internet]. 10 Aug 2018 [cited 7 Sep 2020]. Available: https://hackernoon.com/what-on-earth-is-data-science-eb1237d8cb37
4. O’Neil C, Schutt R. Doing Data Science: Straight Talk from the Frontline. “O’Reilly Media, Inc.”; 2013.
6. O’Regan S. Designing Data Products — Towards Data Science. In: Towards Data Science [Internet]. 16 Aug 2018 [cited 8 Feb 2021]. Available: https://towardsdatascience.com/designing-data-products-b6b93edf3d23
8. Koehrsen W. Beyond Accuracy: Precision and Recall — Towards Data Science. In: Towards Data Science [Internet]. 3 Mar 2018 [cited 20 Feb 2021]. Available: https://towardsdatascience.com/beyond-accuracy-precision-and-recall-3da06bea9f6c
Focus on Data-Fueled Features to Move Data Science Projects Forward was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.