How I Actually Use Statistics as a Data Scientist

0
6



Image by Ideogram

 

Introduction

 
When you hear the word data science, you probably think of two words: programming and statistics. In fact, the prerequisite of learning statistics often discourages people from pursuing a career in data. It doesn’t help that most data science job descriptions make it seem like you need a PhD in statistics to thrive in the role, when the reality is entirely different.

In a majority of data science positions, especially in tech companies focused on product development, you need to know applied statistics. This involves using existing statistical frameworks to solve business problems. This is different from academic statistics (think calculating complex formulas by hand). Instead, you simply need to understand what a concept means, how to calculate it using existing libraries, and how to interpret it. Here’s an example: In most practical data science scenarios, it is sufficient to understand what a p-value of 0.03 means and how to use it to make a business decision, rather than having to know how to calculate it by hand.

In this article, I will give you examples of how I use statistics in my data science job, along with the resources I used to gain this knowledge.

 

How I Use Statistics in My Data Science Job

 

// Experimentation

Most tech companies (Google, Meta, Spotify) have a large experimentation culture. They test rigorously before making feature changes.

When performing A/B tests, I need to know statistical concepts like:

  • Statistical power to determine the sample size required for the experiment
  • Significance levels, p-values, and confidence intervals for decision-making

There are times when p-values might not tell the full story, where you will need to learn more complex forms of analysis like Difference-in-Differences (DID) estimation. However, these are concepts I picked up on the job, through reading articles, asking questions, and discussions with senior colleagues. You cannot possibly learn and remember every concept required through courses or even a university degree. I suggest picking up the core concepts that are required to get you through the data science interview and learning the rest on the job.

 

// Modeling

Building machine learning models requires knowledge of statistics. However, in my experience, it has been sufficient to have a working knowledge of machine learning models rather than having to learn the theory behind these algorithms and how they are created.

Of course, this doesn’t apply to every industry. A data scientist working in a specialized sector like forecasting, biostatistics, or econometrics must possess deep statistical knowledge pertaining to their field.

In my experience, however, when working in product or tech companies, the focus is more on the business impact and interpretation of these models rather than the mathematical rigor behind them.

 

// Data Analysis

I also spend a significant amount of time analyzing data to understand how users are interacting with the product, providing recommendations on how this experience can be improved. This typically involves descriptive statistics, where I create visualizations, perform customer segmentation, and compare data distributions. Most data-related questions, such as “why customer retention dropped in the past 3 months,” can be solved with simple visualizations and don’t require the use of sophisticated statistical methods.

In fact, if you know the difference between the mean, median, and mode and can build visualizations like histograms and box plots, you are already equipped with the knowledge to perform this type of analysis. Rarely, you might need to use an advanced regression technique or build a time-series model. Again, this is something I usually learn on the job from senior colleagues, documentation, and online tutorials.

 

Three Resources to Learn Statistics for Data Science

 
I have a computer science degree and was taught little to no statistics. All of my statistics knowledge comes from resources I’ve found online, and I’ve compiled a list of the most helpful ones:

  • Udacity’s Intro to Statistics is recommended for complete beginners and covers descriptive statistics, inferential statistics, and probability
  • StatQuest is helpful when you want to learn specific concepts. For example, if you want to learn how regression works, you can find 20-minute tutorials that are specific to the topic on this channel
  • Statistical Learning on edX is another great course that you can audit for free. This learning path teaches you to apply statistical concepts in Python, making it relevant to most data science jobs

 

Takeaways

 
While the idea of having to learn statistics for data science might sound intimidating, most data science jobs require you to know applied statistics, which is the ability to apply statistical concepts to solve business problems. In my experience, this knowledge can easily be acquired through online courses and doesn’t require a master’s degree in statistics.

The resources listed in this article should suffice to get you through the statistics portion of data science interviews. Any knowledge beyond this can be acquired on the job by continuously reading articles and papers on the subject, working with existing frameworks in your organization, and learning from senior data scientists.

 
 

Natassha Selvaraj is a self-taught data scientist with a passion for writing. Natassha writes on everything data science-related, a true master of all data topics. You can connect with her on LinkedIn or check out her YouTube channel.