How much statistics is enough to do data science?

So, you’re at your desk thinking about getting into data science. But you’ve heard it involves statistics and it’s been a while since school. You’re wondering if you can do it.

I’m here to tell you that you are already doing it. You are doing it every day.

Think about going grocery shopping. You are at the checkout line and you try to predict which line will move fastest. This could be based on the number of items in other shoppers' baskets or the perceived speed of the cashier. This decision-making involves a form of statistical reasoning and probability estimation.

Now, consider the last time you planned a trip. You look up online reviews or ratings for accommodations, restaurants, and attractions. These reviews and ratings are a summary of past experiences. It allows you to make decisions about the quality of different options, similar to what we call statistical sampling and inference.

From planning meals to managing finances, statistical thinking permeates our daily lives, guiding decision-making, influencing choices, and shaping outcomes, even when we may not consciously recognize it.

Delving deeper into them can refine your intuition and empower you with a more structured understanding. This blog serves as a foundation for delving into data science, providing you with a familiar starting point for your journey into this field.

So, let’s dive right in!

The impact of statistical thinking

Some people say, “The data speaks for itself.”

But the data never speaks. And it needs interpreters.

Data science and statistics go hand in hand. In data science, we collect, analyze, and visualize data. Statistics give us a lens on our data to spot patterns, trends, and connections. Statistics help us see when our analysis is off and ensure our analysis isn’t just based on intuition but grounded in fact.

Statistical thinking can help you decide whether ideas, commonly believed to be sound and intuitive, are perhaps not as rational as initially perceived. While the outcome of an analysis might seem straightforward on the surface, statistical analysis allows us to delve deeper.

So when that person says “The data speaks for itself” it’s not a good idea because it completely ignores the enormous role that chance has to play in any given outcome. Statistical methods help us decide if an outcome is statistically significant or could have occurred purely by chance. If you just look at an outcome without thinking about the statistical context, you might misinterpret the results or draw incorrect conclusions.

Statistical thinking takes you beyond the surface level, prompting us to question our assumptions, acknowledge the variability, and appreciate the subtle nuances that shape the outcome of our data analysis.

When you start to develop a statistical mindset, you cultivate a more disciplined and rigorous approach to understanding and interpreting your data. This can help with continuous learning, making ethical decisions, and having confidence in your conclusions.

The 3 key statistics concepts to get you started doing data science

I will give you the most useful statistics concepts to help you start working with data. They will serve you as a building block to understand your data and analyze it efficiently.

The following three concepts are

Again, you have been doing and thinking about these concepts regularly. Whether you're reviewing your bank statement to track monthly expenses (descriptive statistics), deciding whether to bring an umbrella based on the weather forecast (probability), or observing how most ratings for a new carpet tend to cluster around a certain value with fewer extreme ratings (distributions), statistical concepts are already at play in your everyday life, even if not formally articulated in technical terms.

These three concepts are fundamental to data analysis. Let's jump into it!

1. Descriptive statistics

Descriptive statistics will give you a quick snapshot of what is going on in your dataset. In simple terms, descriptive statistics help you see the big picture of your data without getting lost in all the numbers. They're like a summary that highlights the main characteristics, like what's typical and how much things vary.

Let's say you are analyzing customer purchase data to calculate the average spending, identifying the most frequently purchased items, and so on.

Some important characteristics of your data are:

Average spending (mean)

Imagine you have the sales amounts of different customers. To find the average spending, you add up all the sales and then divide that total by the number of customers. This gives you an idea of the typical amount a customer spends.

Typical purchase amount (median)

If you arrange all the purchase amounts from the lowest to the highest, the median is the middle value. It's like finding the sales amount that's right in the middle. This helps you understand what's the most common purchase amount.

Most frequently spent amount (mode)

The mode is the sales amount that appears most often. If there's a specific amount that many customers tend to spend, that's the mode. It gives insight into the most common spending pattern.

Spread of values (range)

The range looks at the difference between the highest and lowest purchase amounts. If the range is big, it means there's a wide spread of spending amounts among customers.

Variation in spending (variance)

Variance is a measure of how much each customer's spending amount differs from the average. If the variance is high, it means customers are spending quite differently from each other.

Variation around the mean (standard deviation)

Standard deviation is similar to variance. It tells you how much the spending amounts vary or spread out around the average spending. If the standard deviation is high, it means customers' spending amounts differ a lot from the average; if it's low, most customers spend similar amounts.

Tilt in the data (skewness):

Skewness is like checking if your customer sales data is leaning to one side. If it's skewed to the left, it means there are more lower sales amounts; if it's skewed to the right, there are more higher sales amounts. It helps you understand the shape of your sales distribution.

Peakiness of the data (kurtosis):

Kurtosis is about checking how 'peaked' or 'flat' your customer sales data is. If it's more peaked, it means there are more extreme values (high or low sales); if it's flatter, the values are more spread out. It gives you insights into the concentration of sales amounts around the average.

Read this blog post - Know Your Data with Descriptive Statistics in KNIME, to find out more about descriptive statistics and how to use the low-code KNIME Analytics Platform to extract insights from your data by creating graphical and numerical summaries.

2. Probability

Probability is the study of likelihood and uncertainty. Whenever you think about the outcomes of an event, you refer to the probability of a certain outcome occurring. In other words, how likely is it for a particular event to occur?

For example, let’s say you want to know the chances of a customer making a high-value purchase, over 2000 euros! You analyze your past sales data and find that, on average, 10% of customers make purchases over 2000 euros.

If a new customer walks into your store, based on your past data, there's a 10% probability that they will make a purchase exceeding 2000 euros. This information helps you anticipate and plan for the likelihood of high-value transactions.

Probability concepts will help you quantitatively measure how likely events are to occur. In data science, knowing the chance of an event occurring will help you make informed decisions and predictions. Beyond predicting the outcomes of flipping a coin or rolling a die, probability concepts are useful for pretty much everything- forecasting the weather, medical diagnosis, and even winning the lottery!

Read this blog post - Know Your Chances: Calculate Probability in KNIME, to gain a theoretical background of probability and how to build a workflow to calculate probabilities using the KNIME Analytics Platform.

3. Distributions

Your data can be categorized into either continuous or discrete data. If your data can take any value within a range it's continuous data. For example, the weight of a pumpkin can be 10.4kgs, 10.43kgs, 10.438kgs, and so on. You can measure with great precision, but there are so many possible values that you can't count them all individually.

On the other hand, if it can only take specific, distinct values, it's discrete data. For example, the number of students with different hair colors, you can count them individually. Depending on the type of data you use, distributions can also be categorized as discrete distribution or continuous distribution.

Distribution describes how the values in a dataset are spread or arranged. It helps you understand the different ways the numbers can be in your data. For example, in customer sales, a distribution could show how different purchase amounts are spread out among the customers, It helps you see which amounts are more common, which ones are rare, and generally how the numbers behave in your data.

When you understand the different distributions, you can conveniently select the appropriate statistical method and model. In other words, it will help you choose the correct statistical tests that are significant for your data.

Read the blog posts - Know Your Data with Continuous Probability Distribution, Identify Continuous Probability Distributions with KNIME, and Know Your Data with Discrete Probability Distribution to learn different kinds of continuous and discrete distributions and how to model and visualize them using the KNIME Analytics Platform.

Finding the statistical sweet spot: How much statistics is enough?

Simply put, there is of course no fixed threshold for “enough” statistical knowledge. As your curiosity grows and you begin to work with increasingly complex datasets and encounter diverse challenges, your statistical toolbox will be your guide and empower you to adapt to dynamic changes. Changes like adapting your approach when working with big datasets, embracing emerging technologies to analyze your data better, and keeping up with how things are done in your field will be important as you explore data science.

In this blog post, I'm here to emphasize that you engage in statistical thinking daily, influencing all your decisions.

A low-code tool like KNIME Analytics Platform lets you focus on learning data science by applying different statistical methods and techniques without the added complexity of coding.

Try it out yourself.

open source