Communication is key…with other business skills and advice
Table of Contents
- Reducing DS Jargon for Stakeholders
- Do Not Overpromise
- Build a Relationship with a Software Engineer
- Master SQL Optimization
- Git with Git
As data scientists or future data scientists, we might see some of the same skills expressed as important, which they are; however, I want to bring up five skills and/or pieces of advice that are unique, so hopefully you can benefit from these examples and apply them moving forward in your career. The skills below will cover working with stakeholders as well as some programming tips and advice. Keep reading if you would like to learn more about these five unique data science skills.
Reducing Data Science Jargon for Stakeholders
This skill is incredibly important if you decide to be employed by a company where data science is more customer-facing. Customer can mean two things, primarily. The first type is external, which includes data science clients. The second type, which you may interact more with depending on your specific job, are the company stakeholders. They usually consist of roles like product managers, business analysts, and even other software engineers. These are the people you will have to explain your complex data science model to in a way that is easy to understand.
In order to even start a data science project at a company, you will need to reduce the amount of data science-specific terms so that the people you are collaborating with can understand the concepts and approach you are taking.
Here is an example of reducing jargon to a product manager:
- BAD — “we need to use an XGBoost Machine Learning algorithm to reduce the root mean square error by 12.68% for our end-users”
A product manager, in this case, might not understand what XGBoost means (unless you explain it further), but still, the idea is not to teach data science to your stakeholders (most people will not know of RMSE and even explained, it can still be tricky to understand — even to data scientists), but instead, summarize the model’s effects on the business.
- GOOD — “we will use a new algorithm that has several benefits, and the main one is that it will decrease our prediction error, which will save X amount of money next quarter”
This example is much better because the name of the algorithm is usually not important to a stakeholder (unless it is a software engineer working to deploy your model). This example also highlights the effect that the model will have on the business. While error metrics can be useful, usually they end up being translated into KPI (Key Performance Indicator) metrics that the business is already used to understanding.
Do Not Overpromise
As data scientists, we may get excited that our accuracy was 96% in development and want to shout it from the rooftops. While this is a great result, we want to make sure that the result is indicative of the data that is actually available in production, as well as the specific group of data that we care about the most. Perhaps the accuracy is closer to 92% for the group we care about more in data, like edge cases, where accuracy is more critical.
Here are some of the things to consider when presenting results for the first time to stakeholders:
- What was the testing period?
- Do we expect different testing periods to have the same or different accuracies/errors?
- Is this data available in the production environment?
- Can our training size be the same in production?
- Can we actually predict as frequently in production?
- How much will this model cost versus save?
- Can we afford to train this model frequently?
- What happens when there is missing data for our predictions?
- What happens when outlier events cause the accuracy to drop significantly for particular time periods or groups?
- Test, test, and test! Make sure these predictions are stable.
As you can see, there is quite a bit to consider when relaying results to stakeholders, and above are just some of those considerations.
Build a Relationship with a Software Engineer
This skill may be more of an advice, but is nevertheless still incredibly important to data scientists. It might be something to consider even more if your background is not in software engineering, of course. For example, many data scientists do not come from a software engineering or coding/programming background, and instead have backgrounds in finance, earth sciences, statistics, or mathematics. These areas are beneficial in some aspects, but where data scientists can often struggle is during the complex coding process around machine-learning algorithms and data science model deployment. If this is you, then it might be wise to have someone you are comfortable with to discuss ways of creating more efficient code, as well as someone who could approve your pull requests who is not another data scientist.
Here are some of the benefits of working alongside a Software Engineer:
- A better understanding of object-oriented programming (OOP).
- Can deploy models faster.
- Can integrate with the business faster.
- Can have another person to double-check your code if needed — a second set of eyes.
It is important to keep in mind that yes, some roles of data science involve all of the processes including all of the software engineering or programming involved, so this skill or tip is more so for data scientists who have previously focused solely on building models locally.
Master SQL Optimization
Oftentimes, just like general programming in a typical language like Python, data scientists might struggle in the optimization of SQL queries. This skill is to understand where and what to apply to your query so that it can run more quickly (and correctly), which can make testing models faster (for example — if you are constantly updating your query) and making training in production more efficient.
Here are some examples of SQL optimization techniques:
- Filtering with dates
- Filtering with categorical data
- Removing ordering
- Performing inner joins
Of course, there are more ways to optimize queries, but the above are some easy ways to drastically improve your query run time.
Git with Git
As we learn data science in the academic setting, sometimes Git and/or GitHub practices are left behind. There are plenty of common commands to become familiar with so that you can update your model code quickly, as well as some more unique commands that will be included below. The following link is a great example of a cheat sheet full of several different types of Git commands like undoing changes, rewriting git history, git branches, and remote repositories — to name a few :
Here are some of those unique Git commands:
* git status
* git diff
* git branch
* git commit
* git fetch
* git remote add
* git reset — hard
* git rebase -i <base>
The mentioned skills and advice are ones that I have benefitted from myself, and hopefully, you can as well. We have discussed how to work with other stakeholders in your company, as well as how to create more efficient code. All in all, these are just some of the skills that every data scientist should know that could be unique and new to you.
As a summary, here are the top five skills every data scientist should know:
* Reducing DS Jargon for Stakeholders
* Do Not Overpromise
* Build a Relationship with a Software Engineer
* Master SQL Optimization
* Git with Git
Thank you for reading! Please feel to reach out to me if you have any questions, as well as comment down below if you have any experiences, agreements, or disagreements with the skills seen above. What other skills or advice can you think of that every data scientist should know?
I am not affiliated with any of the mentioned companies.
 Atlassian, Git cheat sheet, (2021)