Data Science Glossary: 250+ Terms You Need to Know

Jump to section

Starting out in data science can feel overwhelming, but getting familiar with its terminology can help you build a solid foundation.

Regardless of your professional title, you’ll likely need to know a thing or two about data science at some point in your career.

So whether you’re advancing your skills as a data scientist or navigating new concepts, this comprehensive glossary can beef up your knowledge about everything from “accuracy” to “z-score.”

📌 Pro-tip: Bookmark this article to refer back to anytime you have a data science related question.

A

ANOVA

Analysis of variance (ANOVA) is a statistical method used to determine if there are statistically significant differences between the means of three or more independent groups by analyzing the variation within each group compared to the variation between the groups.

API

An application programming interface (API) is a software intermediary that allows two software applications to communicate with each other, enabling data exchange.

Accuracy

Accuracy is a measure of how often a predictive model correctly predicts the outcome. In a two-class problem, it is defined as the ratio of the number of times a machine learning model correctly recognizes events of the two classes with respect to all events in the dataset.

Adam optimization

Adam (Adaptive Moment Estimation) is an optimization algorithm used for training deep learning models. It is named adaptive because it adjusts the learning speed for each model parameter dynamically based on past performance. It is known for its computational efficiency, low memory requirement, and ability to converge quickly.

Algorithm

An algorithm is a sequence of mathematical operations designed to solve a problem or perform a computation. In data science, algorithms are used to make predictions and uncover patterns on a set of data.

Alternative hypothesis

When trying to understand the effect of an independent variable on a dependent variable, the alternative hypothesis is the claim that there is such an effect. Researchers use a statistical test to weigh evidence for or against the alternative hypothesis.

Anomaly detection

Anomaly detection is the process of identifying rare data points within a dataset. Outlier detection techniques are often used for anomaly detection. These techniques are designed to isolate data points that differ significantly from the majority of the data points in the dataset.

Apache Spark

Apache Spark is an engine designed for large-scale data analytics. It provides a programmatic interface and enables data parallelism, fault tolerance, and scalable computing across large datasets.

Artificial intelligence

Artificial intelligence (AI) is a broad field of science that includes disciplines such as computer science, statistics, hardware/software engineering, linguistics, neuroscience, and psychology. Its focus is on building computers capable of reasoning, learning, and performing tasks that typically require human intelligence or processing data on a scale beyond human capability.

Artificial neural networks

Artificial neural networks (ANN) are computational models, loosely inspired by biological neural networks, consisting of connected computational units (artificial neurons), organized in specific architectures, and working together to produce a response. ANN can adjust their parameters (i.e., learn) from a set of existing examples by minimizing a cost function/maximizing a likelihood function. They are used for pattern recognition, predictive modelling, adaptive control, and many more tasks.

Auto-regression

Auto-regression is a statistical technique often applied to time series data where a value is predicted using a regression model with its own lagged values as predictors.

B

Backpropagation

Backpropagation is a method for training feed forward neural networks. It starts by feeding data through the network to get an output. This output is compared to a target using a loss function to calculate an error. The algorithm then works backwards through the network, calculating each weight’s contribution via partial derivatives. These derivatives, scaled by a learning rate, are used to adjust the weights. This process repeats many times, reducing the error.

Bagging

Bagging, or bootstrap aggregating, is an ensemble learning technique that improves the stability and accuracy of machine learning algorithms. It involves training multiple models on different bootstrap samples (random samples drawn with replacement) of the training data. By combining the predictions of these models, bagging helps reduce variance, avoid overfitting, and produce more accurate and robust predictions than a single model.

Bar chart

A bar chart is a graphical representation of categorical data using rectangular bars, where the length of each bar corresponds to the value it represents. It is commonly used for displaying category frequency, comparing different categories, or tracking changes in values over time.

Bayesian statistics

Bayesian statistics is a branch of statistics that uses probability to represent uncertainty in statistical models and updates these probabilities as new data becomes available. It incorporates prior knowledge or beliefs to refine predictions and decision-making.

Bayes’ theorem

Bayes’ theorem is a fundamental theorem in probability theory that quantifies the probability of an event based on prior knowledge of conditions related to the event.

Bernoulli trial

Bernoulli's trial is an experiment with either a success or a failure as the outcome. The probability of success is constant, and the trials are statistically independent—the outcome of one trial does not affect the outcome of another.

Bias

Bias is the systematic error in a model that affects its predictions by consistently skewing results in one direction, regardless of the training data size.

Bias-variance trade-off

Bias-variance trade-off is the balance between errors from bias and variance that occurs when optimizing machine learning models. As the number of tunable parameters increases, bias tends to decrease while variance increases, meaning that more complex models might fit the training data better but generalize worse.

Big data

Big data refers to extremely large datasets that can be analyzed computationally to reveal patterns, trends, and associations—especially in relation to human behavior and interactions.

BigQuery

BigQuery is Google's fully managed and serverless data warehouse that enables scalable querying and analysis of large datasets.

Binary classification

Binary classification is a type of predictive modeling that categorizes data into two distinct classes.

Binary variable

A binary variable is one that has only two possible values, such as true/false or yes/no.

Binomial distribution

The binomial distribution calculates the probability of a certain number of "successes" in a set number of independent tries (Bernoulli trials), each with the same probability of success.

Boolean

Boolean is a data type with two possible values: true or false.

Boosting

Boosting is an ensemble learning technique that improves prediction by combining several weak learners. Each model is trained to focus on the errors of its predecessor, and by weighting and combining these models, boosting reduces bias and increases stability.

Bootstrapping

Bootstrapping is a statistical sampling method where samples are drawn from the original dataset with replacement. This method helps estimate the sampling distribution of a statistic and approximate measures like variance and confidence intervals.

Box plot

A box plot is a graphical tool to visualize the distribution of a numeric variable. It shows the median, quartiles, and potential outliers, making it useful for comparing distributions.

Business intelligence

Business Intelligence (BI) involves strategies and technologies used by organizations to extract meaningful insights from data, supporting informed decision-making through reporting, analytics, dashboards, and performance evaluation.

C

Categorical variable

A categorical variable is one that can take on a limited set of distinct values, each representing a different group or category. These values are mutually exclusive and collectively exhaustive.

Chi-square test

The chi-square test is a statistical method used to determine whether there is a significant association between observed and expected frequencies in categorical data, commonly applied to test model goodness-of-fit or variable independence.

Classification

Classification is a set of supervised learning techniques in which a model predicts the correct label for a given input based on its features.

Cluster analysis

Cluster analysis is an unsupervised learning technique used to group similar data points into clusters based on measures like distance or frequency, helping to reveal data structure and patterns.

Computer vision

Computer vision is the field of study that enables computers to interpret and understand visual information from images or videos.

Concatenate

For data tables: Concatenate combines two or more data tables by stacking them vertically. For strings: It creates a new string by joining two or more strings end-to-end.

Concordant-discordant ratio

The concordant-discordant ratio is a measure of agreement (or disagreement) in the ranking order of paired observations.

Confidence interval

A confidence interval is a range derived from sample data that estimates where a population parameter likely falls, accompanied by a confidence level (e.g., 95%).

Confusion matrix

A confusion matrix is a table that displays the actual versus predicted classifications, showing true positives, true negatives, false positives, and false negatives to evaluate a classification model's performance.

Continuous probability distribution

A continuous probability distribution defines the likelihood of a continuous random variable taking on any value within a range, with probabilities represented by a density function that integrates to 1.

Continuous random variable

A continuous random variable can take an infinite number of values within a given range and is typically measured rather than counted.

Convergence

Convergence refers to the point in an iterative optimization process when further iterations yield little or no improvement, indicating that the algorithm has reached a stable solution (which might be a local optimum).

Convex function

A convex function is a mathematical function in which the line segment between any two points on the graph lies above (or on) the graph, a property that simplifies many optimization problems.

Correlation

Correlation is a statistical measure describing the strength and direction of the relationship between two variables. Its values range from -1 (perfect negative) to 1 (perfect positive), with 0 indicating no linear correlation.

Cosine similarity

Cosine similarity measures the cosine of the angle between two non-zero vectors. Values near +1 indicate high similarity, values near 0 indicate little similarity, and values near -1 indicate opposite directions.

Cost function

In machine learning, a cost (or loss) function calculates the error between predicted values and actual values. It is used to guide the optimization of model parameters.

Covariance

Covariance measures how two variables change together. A positive covariance indicates that the variables tend to increase or decrease together, while a negative value indicates they move inversely.

Cross-validation

Cross-validation is a model evaluation technique that involves partitioning the data into subsets, training the model on some parts, and validating it on the remaining parts to assess its generalizability.

D

DBSCAN

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm that identifies clusters as dense regions separated by sparser areas. It groups points near core points and marks isolated points as noise.

Dashboard

A dashboard is a visual display that consolidates key metrics and data trends using graphs, charts, and tables, allowing users to monitor performance and make informed decisions. Interactive dashboards may include dynamic filters and selectors.

Data analytics

Data analytics is the process of examining datasets to extract meaningful insights, identify patterns, and support data-driven decision-making.

Data cleaning

Data cleaning involves identifying and correcting or removing errors, inconsistencies, or irrelevant information from datasets to improve data quality.

Data engineering

Data engineering is the development and maintenance of systems that collect, process, and transform raw data into high-quality, consistent information for analysis and machine learning.

Data governance

Data governance is the framework of policies and practices that ensures the integrity, quality, security, and availability of data within an organization.

Data lake

A data lake is a centralized repository that stores vast amounts of raw data from various sources, both structured and unstructured, allowing for flexible processing and analysis.

Data mining

Data mining is the process of discovering patterns and knowledge from large amounts of data by combining methods from statistics, machine learning, and database systems.

Data modeling

Data modeling is the process of creating a logical representation of business requirements and data relationships. In data science, it involves creating mathematical representations of real-world data to understand its structure and behavior.

Data pipeline

A data pipeline is a sequence of steps that moves and transforms data from its source to its destination, often including ingestion, processing, integration, and storage.

Data preparation

Data preparation involves transforming raw data into a format suitable for analysis or modeling, including tasks such as cleaning, formatting, and feature engineering.

Data science

Data science is an interdisciplinary field that combines data processing, machine learning, and statistics to extract knowledge and insights from both structured and unstructured data.

Data science life cycle

The data science life cycle is the series of stages involved in a data science project—from business understanding and data acquisition to modeling, evaluation, deployment, and maintenance.

Data storytelling

Data storytelling is the art of conveying insights by combining data visualizations with narrative techniques to make data more engaging and understandable.

Data structure

A data structure is a specialized format for organizing and storing data to enable efficient access and modification. Common examples include arrays, lists, and trees.

Data transformation

Data transformation is the process of converting data from one format, structure, or representation to another to meet specific requirements.

Data type

A data type defines the characteristics of a value, such as its numerical precision or storage format, categorizing values as numbers, text, dates, etc.

Data visualization

Data visualization is the graphical representation of data through charts, graphs, maps, or other visual elements to facilitate exploration, analysis, and communication.

Data warehouse

A data warehouse is a centralized repository that stores structured data from multiple sources and is optimized for reporting and analysis.

Data wrangling

Data wrangling is the process of cleaning, restructuring, and enriching raw data into a desired format for better decision making in analysis or modeling.

Database

A database is a structured collection of data stored electronically and organized for efficient retrieval, updating, and management.

Dataframe

A dataframe is a tabular data structure with rows and columns, similar to a spreadsheet or SQL table, commonly used in data manipulation and analysis.

Dataset

A dataset is a collection of related data organized for analysis, modeling, or training machine learning algorithms.

Decile

A decile divides a dataset into ten equal parts. Each decile represents 10% of the data, helping to understand the distribution of the dataset.

Decision boundary

A decision boundary is the dividing line (or hyperplane) that separates different classes in a dataset as determined by a classification algorithm.

Decision tree

A decision tree is a supervised learning algorithm that splits data into subsets based on decision rules, forming a tree-like model to predict outcomes.

Deep learning

Deep learning is a subset of machine learning that uses neural networks with many layers to model complex patterns in data, enabling advances in areas such as image and speech recognition.

Degree of freedom

Degree of freedom refers to the number of independent values or parameters that can vary in the calculation of a statistic.

Dependent variable

A dependent variable is the outcome or response that is measured in an experiment, influenced by changes in the independent variable(s).

Descriptive statistics

Descriptive statistics summarize and describe the main features of a dataset, using measures such as the mean, median, mode, and standard deviation.

Dimensionality reduction

Dimensionality reduction is the process of reducing the number of variables under consideration by obtaining a set of principal variables while preserving as much information as possible.

Discrete distribution

A discrete distribution is a probability distribution for a random variable that can take on a finite or countable number of values, with probabilities assigned to each value summing to 1.

Discrete random variable

A discrete random variable takes on distinct, separate values, such as integers, rather than a continuous range.

Dummy variable

A dummy variable is a binary variable (coded 0 or 1) used to represent the presence or absence of a categorical effect in modeling.

E

EDA

Exploratory data analysis (EDA) is the preliminary examination of data using statistics and visualizations to understand its main characteristics, patterns, and distributions.

ETL

Extract, Transform, Load (ETL) is the process of extracting raw data from various sources, cleaning and transforming it into a desired format, and loading it into a destination (such as a database or data warehouse) for analysis.

Early stopping

Early stopping is a technique used during model training to halt the process once performance on a validation set stops improving, helping to prevent overfitting.

Ensemble learning

Ensemble learning combines the predictions of multiple models to produce a more robust and accurate overall prediction than any single model.

Evaluation metrics

Evaluation metrics are quantitative measures (such as accuracy, precision, recall, and F-score) used to assess the performance of a predictive model.

F

F-score

F-score is an evaluation metric that combines precision and recall into a single measure, often used to assess classification model performance.

Factor analysis

Factor analysis is a statistical method used to describe variability among observed variables in terms of a smaller number of unobserved variables called factors. It helps reveal the underlying structure of data.

False negative

A false negative occurs in binary classification when a positive instance is incorrectly predicted as negative.

False positive

A false positive occurs in binary classification when a negative instance is incorrectly predicted as positive.

Feature engineering

Feature engineering is the process of using domain knowledge to create new features from raw data, thereby improving the performance of machine learning models.

Feature hashing

Feature hashing is a technique that converts arbitrary features into indices in a vector or matrix using a hash function, reducing memory usage.

Feature reduction

Feature reduction is the process of reducing the number of input variables in a dataset while retaining as much important information as possible.

Feature selection

Feature selection involves choosing a subset of relevant features from a larger set to improve model interpretability and efficiency.

Few-shot learning

Few-shot learning is an approach where a model is trained to make accurate predictions using only a very small number of labeled examples per class.

Float

A float is a data type used to represent real numbers with a fractional component.

Flow variable

In KNIME, a flow variable is used to pass a parameter from one node to another within a workflow, allowing for dynamic configuration.

Fourier transform

The Fourier transform converts a time- or space-based signal into its frequency components, enabling analysis in the frequency domain.

Frequentist statistics

Frequentist statistics models probability in terms of frequency, focusing on the long-run behavior of repeated experiments.

Front end

The front end is the user interface part of an application that users interact with, including visual elements like buttons and graphics.

Fuzzy algorithms

Fuzzy algorithms use fuzzy logic to handle reasoning that is approximate rather than fixed and exact, providing flexible solutions to complex problems.

Fuzzy c-means

Fuzzy c-means is a clustering algorithm that allows data points to belong to multiple clusters with varying degrees of membership, based on fuzzy logic.

Fuzzy logic

Fuzzy logic is a form of logic that permits values between true and false, reflecting real-world ambiguity and uncertainty.

G

Gated recurrent unit

A gated recurrent unit (GRU) is an improved type of recurrent neural network (RNN) designed to capture long-range dependencies in sequential data using gating mechanisms to control information flow.

Gaussian distribution

The Gaussian distribution, or normal distribution, is a symmetric, bell-shaped probability distribution centered around the mean, with spread determined by the standard deviation.

Geospatial analytics

Geospatial analytics involves the collection, analysis, and visualization of geographic and spatial data to uncover patterns and relationships between locations.

Goodness of fit

Goodness of fit tests assess how well a statistical model or distribution matches the observed data.

Gradient descent

Gradient descent is an optimization algorithm that iteratively adjusts model parameters to minimize a cost function by moving in the direction of steepest descent.

Greedy algorithms

A greedy algorithm makes the locally optimal choice at each step with the hope of finding a global optimum, though it does not always guarantee the best overall solution.

H

Hadoop

Hadoop is an open-source framework that allows for distributed processing and storage of large datasets across clusters of computers.

Heatmap

A heatmap is a data visualization technique where values in a matrix are represented by colors, making it easy to spot patterns, trends, or correlations.

Hidden Markov model

A hidden Markov model is a statistical model in which the system being modeled is assumed to be a Markov process with unobserved (hidden) states, used to infer the sequence of states from observable events.

Hierarchical clustering

Hierarchical clustering is a technique that builds a tree (dendrogram) of clusters by either progressively merging smaller clusters or splitting larger ones.

Histogram

A histogram is a graphical representation of the frequency distribution of a numerical variable, using bars to show the number of data points within specified ranges (bins).

Holdout sample

A holdout sample is a subset of data set aside during model training to evaluate its performance on unseen data.

Holt-Winters forecasting

Holt-Winters forecasting, also known as triple exponential smoothing, is a time series forecasting method that smooths data considering the level, trend, and seasonality.

Human-in-the-loop

Human-in-the-loop refers to the incorporation of human input or oversight into the machine learning process to improve model accuracy, fairness, and accountability.

Hyperparameter

A hyperparameter is a configuration set by the user (external to the model) that influences the training process and performance of a machine learning algorithm.

Hyperparameter tuning

Hyperparameter tuning is the process of selecting the optimal values for a model’s hyperparameters to achieve the best performance.

Hyperplane

A hyperplane is a flat, affine subspace in a higher-dimensional space used to separate data points in classification tasks. In n-dimensional space, it has dimension n–1.

Hypothesis

A hypothesis is a proposed explanation or assumption made on the basis of limited evidence, which can then be tested using statistical methods.

I

Imputation

Missing Value Imputation is the process of replacing missing data with substituted values to improve dataset completeness for analysis and modeling.

Independent variable

An independent variable is one that is manipulated or categorized to observe its effect on a dependent variable, often serving as a predictor in experiments and models.

Inferential statistics

Inferential statistics involves making predictions or inferences about a population based on a sample of data, using methods like hypothesis testing and confidence intervals.

Integer

An integer is a whole number (positive, negative, or zero) used in data analysis for counting and indexing.

Interquartile range

An interquartile range (IQR) is a measure of statistical dispersion calculated as the difference between the third quartile (Q3) and the first quartile (Q1), representing the middle 50% of data.

Iteration

Iteration refers to the repetition of a set of operations in algorithms or model training to gradually improve performance or converge on a solution.

J

Joint probability

Joint probability is the probability of two or more events occurring simultaneously. If the events are independent, their joint probability is the product of their individual probabilities.

Julia

Julia is a high-level, high-performance programming language designed for technical computing, prized in data science for its speed and ease in numerical analysis.

K

K-means

K-means clustering partitions data into K distinct clusters by iteratively updating cluster centroids until convergence.

K-nearest neighbors

K-nearest neighbors (KNN) is a supervised algorithm that predicts the class or value of a sample based on the classes or values of its K closest neighbors.

Keras

Keras is an open-source Python neural network library that serves as an interface for deep learning frameworks like TensorFlow.

Kurtosis

Kurtosis measures the "tailedness" of a probability distribution, with high kurtosis indicating heavy tails and low kurtosis indicating light tails compared to a normal distribution.

L

Labeled data

Labeled data are records that have been tagged with target labels, making them essential for supervised learning.

Lasso regression

Lasso regression is a type of linear regression that uses L1 regularization to shrink less important feature coefficients to zero, aiding in feature selection.

Line chart

A line chart is a data visualization that displays information as a series of data points connected by straight lines, ideal for showing trends over time.

Linear regression

Linear regression is a supervised learning algorithm that predicts a continuous outcome based on one or more predictor variables by fitting a linear relationship to the data.

Log likelihood

Log likelihood is the natural logarithm of the likelihood function and is used in statistical modeling to estimate model parameters that best explain the observed data.

Log loss

Log loss, also known as logistic loss or cross-entropy, quantifies the error between predicted probabilities and actual outcomes in binary classification.

Logistic regression

Logistic regression is a supervised learning algorithm used for binary classification, predicting the probability of an outcome that can take one of two possible values.

Long short-term memory

Long short-term memory (LSTM) is a type of recurrent neural network (RNN) architecture designed to capture long-term dependencies in sequential data, commonly used in time series and natural language processing.

Loops

Loops refer to the repeated execution of a block of code or workflow segment, continuing as long as a specified condition remains true.

M

MLOps

MLOps (Machine Learning Operations) is a set of operations, tools, and best practices for bringing machine learning models into production which includes deployment, monitoring, and maintenance.

Machine learning

Machine learning is a subset of artificial intelligence focusing on systems that can learn patterns and trends from data without being explicitly programmed. It is often used for making predictions and decisions.

MapReduce

MapReduce is a programming model, and an associated implementation, for processing large datasets with a parallel, distributed algorithm on a cluster. Each node of the cluster performs a map operation on a subset of the data to generate key/value pairs. In the reduce step, the pairs are aggregated according to their key and a final set is generated.

Market basket analysis

Market basket analysis is a data mining technique used to discover associations between products frequently purchased together. It is commonly used in retail to understand customer purchasing behavior.

Market mix modeling

Market mix modeling is a statistical analysis technique used to estimate the impact of various marketing tactics on sales and to forecast the impact of future marketing strategies.

Matplotlib

Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python. It is widely used for plotting data and creating graphs and charts.

Maximum likelihood estimation

Maximum likelihood estimation is a method for estimating the parameters of an assumed probability distribution based on observed data by maximizing the likelihood function, ensuring that the observed data is most probable under the specified statistical model.

Mean

The mean is the arithmetic average of a set of numbers, calculated by summing all the values and dividing by the count. It is a measure of central tendency in data.

Mean (average, expected value)

The mean is the arithmetic average of a set of numbers, calculated by summing all the values and dividing by the count. It is a measure of central tendency in data.

Mean absolute error

Mean absolute error (MAE) is a measure of error in paired observations. It calculates the average absolute differences between two sequences of values. MAE is often used to evaluate numeric prediction models by comparing predicted values to actual values.

Mean squared error

Mean squared error (MSE) calculates the average squared differences between two sequences of values and is used to evaluate numeric prediction models.

Median

The median is the middle value in a sequence of ordered values. It divides the dataset into two halves, providing a robust measure of central tendency.

Mode

The mode is the value that appears most frequently in a dataset and is especially useful for categorical data.

Model selection

Model selection is the process of selecting the most appropriate model from a set of candidate models for a given dataset, often using cross-validation or other evaluation criteria.

Monte Carlo simulation

Monte Carlo simulation is a computational technique that uses random sampling to obtain numerical results, modeling the probability of different outcomes in complex systems.

Multi-class classification

Multi-class classification is a type of classification task where the goal is to assign group labels from three or more classes, as opposed to binary classification.

Multivariate analysis

Multivariate analysis examines multiple variables simultaneously to understand relationships, interactions, and effects on outcomes. It includes methods like multivariate regression and MANOVA.

Multivariate regression

Multivariate regression is an extension of linear regression that models the relationship between multiple independent variables and multiple dependent variables.

N

NaN

NaN stands for "Not a Number" and represents undefined or unrepresentable numerical results, often used to denote missing values.

Naive Bayes

Naive Bayes is a probabilistic classifier based on Bayes' theorem, assuming independence between predictors.

Natural language processing

Natural language processing (NLP) is a subfield of computer science and AI that enables computers to analyze, understand, and generate human language. It involves tasks like speech recognition and text summarization.

NoSQL

NoSQL refers to database management systems that do not adhere to the traditional relational model but use a data model optimized for the stored data.

Nominal variable

A nominal variable is a categorical variable with distinct categories that have no inherent order, such as gender or color.

Non-relational database

A non-relational (NoSQL) database does not use a tabular schema like relational databases, offering flexibility and scalability for large and unstructured data.

Normal distribution

See Gaussian distribution.

Normalization

Normalization is the process of scaling data to fall within a standard range—often between 0 and 1—or to have a standard distribution, as required by some machine learning algorithms.

Null hypothesis

The null hypothesis is the claim that there is no effect or relationship between variables, and it is tested against an alternative hypothesis.

Numeric prediction

Numeric prediction refers to predicting a continuous numerical value based on input data.

Numpy

NumPy is a Python library providing support for large, multi-dimensional arrays and matrices along with high-performance mathematical functions, essential for scientific computing and data analysis.

O

One shot learning

One-shot learning is a machine learning approach that enables a model to learn from a single labeled example per class.

One-hot encoding

One-hot encoding is a technique that converts categorical data into numerical vectors by creating binary columns for each category.

Open source

Open source software is released with source code that anyone can inspect, modify, and enhance, fostering collaboration and innovation.

Ordinal variable

An ordinal variable is a categorical variable with a clear ordering (e.g., education level or satisfaction rating), though the differences between levels may not be uniform.

Outlier

An outlier is a data point that deviates significantly from the majority of observations, potentially indicating measurement error or a unique phenomenon.

Overfitting

Overfitting occurs when a model learns the noise in the training data instead of the underlying pattern, leading to poor generalization on unseen data.

P

P-value

The p-value is the probability of obtaining a result at least as extreme as the observed one, assuming the null hypothesis is true. A low p-value suggests the result is unlikely under the null hypothesis.

Pandas

Pandas is a Python library offering data structures and functions for efficiently handling structured data, such as spreadsheets and SQL tables.

Parameters

In machine learning, parameters are internal coefficients or weights that a model learns from the training data to make predictions.

Pattern recognition

Pattern recognition uses machine learning algorithms to automatically detect and interpret regularities and patterns in data.

Pearson correlation coefficient

The Pearson correlation coefficient measures the linear relationship between two variables, ranging from -1 (perfect negative) to 1 (perfect positive).

Pie chart

A pie chart is a circular graph divided into slices to illustrate numerical proportions, with each slice representing a category’s share of the whole.

Plotly

Plotly is an open-source graphing library for Python that creates interactive, publication-quality graphs including line, bar, and 3D charts.

Poisson distribution

A Poisson distribution is a discrete probability distribution that gives the probability of a given number of events occurring in a fixed interval, based on a known average rate and independence of events.

Polynomial regression

Polynomial regression is a supervised learning algorithm that models the relationship between a dependent variable and one or more independent variables as an nth degree polynomial.

Pre-trained model

A pre-trained model is a machine learning model previously trained on a large dataset, which can be reused or fine-tuned for related tasks to save time and resources.

Precision

Precision is the ratio of true positive predictions to the total number of positive predictions made by a classification model.

Predictive analytics

Predictive analytics uses statistical modeling, data mining, and machine learning techniques to analyze historical data and forecast future events or trends.

Predictive model

A predictive model uses statistical and machine learning techniques to learn patterns from historical data and make predictions about future outcomes.

Predictor variable

See independent variable.

Principal component analysis

Principal component analysis (PCA) is a dimensionality reduction technique that transforms data into a new coordinate system where the greatest variance lies on the first coordinate, the second greatest on the second, and so on.

Probability distribution

A probability distribution describes all possible values of a random variable along with their associated probabilities. It can be continuous or discrete.

Program

A program is a finite set of instructions written in a programming language that a computer executes to perform a specific task.

Programming language

A programming language is a formal system of instructions used to create software, with examples including Python, Java, and C++.

PyTorch

PyTorch is an open-source deep learning framework for Python based on the Torch library, known for its tensor computation and strong GPU acceleration.

Python

Python is a high-level, interpreted programming language prized for its readability and versatility, widely used in data science, web development, and automation.

Q

Q-Q plot

A Q-Q plot (quantile-quantile plot) is a graphical tool that compares the quantiles of two probability distributions to assess if they follow a common distribution.

Quartile

A quartile divides a ranked dataset into four equal parts. Q1, Q2 (the median), and Q3 represent the 25th, 50th, and 75th percentiles, respectively.

R

R is an interpreted programming language used for statistical computing and graphics, offering a wide variety of statistical techniques and graphical capabilities.

ROC curve

The ROC (Receiver Operating Characteristic) curve is a graph showing the performance of a classification model by plotting the true positive rate against the false positive rate at various thresholds.

ROC-AUC

ROC-AUC stands for Receiver Operating Characteristic – Area Under the Curve, measuring a classification model's ability to distinguish between classes; 1 indicates perfect distinction, while 0.5 suggests random guessing.

Random forest

Random forest is an ensemble learning method that builds multiple decision trees and combines their outputs to improve classification or regression accuracy.

Random sample

A random sample is a subset chosen from a population in such a way that every individual has an equal chance of being selected, ensuring representativeness.

Random variable

A random variable represents the possible outcomes of a random event; it can be discrete or continuous.

Range

Range is a measure of dispersion calculated as the difference between the maximum and minimum values in a dataset.

Recall

Recall (or sensitivity) is the ratio of true positives to the total actual positives, indicating a model’s ability to identify positive instances.

Recommendation engine

A recommendation engine analyzes user data and behavior to suggest products, services, or information tailored to individual preferences.

Regression

Regression is a statistical technique for modeling the relationship between a dependent variable and one or more independent variables.

Regression spline

Regression spline fits piecewise polynomial functions to data, providing flexibility for modeling non-linear relationships.

Regularization

Regularization is a technique that adds a penalty to a model’s loss function to prevent overfitting by discouraging overly complex models.

Reinforcement learning

Reinforcement learning (RL) is a machine learning paradigm where an agent learns to make decisions by interacting with an environment and receiving feedback in the form of rewards or penalties.

Relational database

A relational database stores data in tables with rows and columns and uses SQL for managing and querying the data, ensuring relationships between tables.

Resampling

Resampling involves repeatedly drawing samples from a dataset to assess the variability of a statistic; techniques include bootstrapping and cross-validation.

Residuals

Residuals are the differences between observed values and the values predicted by a model, used to evaluate model fit.

Response variable

The response variable is the outcome that a model is designed to predict in a regression analysis.

Retrieval augmented generation

Retrieval augmented generation (RAG) is an AI framework that retrieves relevant information from a curated knowledge base and uses a generative model to produce a contextually enriched response.

Ridge regression

Ridge regression is a type of linear regression that uses L2 regularization to shrink coefficient estimates, helping to reduce model complexity.

Root mean squared error

Root Mean Squared Error (RMSE) is the square root of the average squared differences between predicted and actual values, used to assess the accuracy of numeric prediction models.

Rotational invariance

Rotational invariance is the property of a function to yield the same output regardless of the rotation applied to its input.

S

SMOTE

SMOTE (Synthetic Minority Over-sampling Technique) addresses class imbalance by generating synthetic examples for the minority class.

SQL

SQL (Structured Query Language) is a standardized language used for managing and manipulating relational databases.

Sample

A sample is a subset of individuals or observations selected from a larger population for analysis.

Sampling error

Sampling error is the difference between a sample statistic and the corresponding population parameter, arising from the fact that only a subset of data is observed.

Scatter plot

A scatter plot is a graphical representation of the relationship between two variables, using Cartesian coordinates to display individual data points.

Scikit-Learn

Scikit-Learn is an open-source Python library that provides simple and efficient tools for data mining and data analysis, including algorithms for classification, regression, and clustering.

Seaborn

Seaborn is a Python visualization library based on Matplotlib that offers a high-level interface for creating attractive statistical graphics.

Selection Bias

Selection bias occurs when the data collection method results in a sample that is not representative of the population, leading to skewed or invalid conclusions.

Semi-supervised learning

Semi-supervised learning uses both labeled and unlabeled data for training, which is useful when acquiring labeled data is expensive or time-consuming.

Skewness

Skewness is a measure of the asymmetry of a probability distribution; positive skew indicates a longer right tail, while negative skew indicates a longer left tail.

Spatial-temporal reasoning

Spatial-temporal reasoning involves analyzing data that varies across both space and time, integrating concepts from computer science, cognitive science, and psychology to forecast or understand dynamic systems.

Spearman rank correlation

Spearman rank correlation is a non-parametric measure of rank correlation that assesses the strength and direction of the association between two ranked variables.

Standard deviation

Standard deviation quantifies the amount of variation or dispersion in a dataset relative to its mean.

Standard error

Standard error is the standard deviation of the sampling distribution of a statistic, typically used to measure the precision of the sample mean.

Standardization

Standardization scales data to have a mean of zero and a standard deviation of one, ensuring that all features contribute equally to model performance.

Statistics

Statistics is the science of collecting, analyzing, interpreting, and presenting data, used to make inferences about populations based on samples.

Stochastic gradient descent

Stochastic Gradient Descent (SGD) is a variation of gradient descent that updates model parameters using a single random training example per iteration, introducing randomness into the optimization process.

Stratified sampling

Stratified sampling divides the population into distinct subgroups (strata) and takes a random sample from each to ensure representativeness.

String

A string is a sequence of characters used to represent text in programming.

Structured data

Structured data is organized in a predefined format (typically rows and columns), as found in relational databases and spreadsheets.

Summary statistics

Summary statistics are concise measures (like mean, median, and standard deviation) that describe the main features of a dataset.

Sunburst chart

A sunburst chart is a hierarchical visualization using concentric circles to represent data levels, with the central circle as the root.

Supervised learning

Supervised learning is a machine learning paradigm where models are trained on labeled data to learn the relationship between inputs and outputs.

Support vector machine

Support vector machine (SVM) is a supervised algorithm that finds a hyperplane to separate classes with the widest possible margin, used in both classification and regression.

Synthetic data

Synthetic data is artificially generated to mimic the statistical properties of real-world data, used when actual data is scarce or sensitive.

T

T-test

A t-test is a statistical test used to determine whether there is a significant difference between the means of two groups.

TensorFlow

TensorFlow is an open-source machine learning framework developed by Google, widely used for building and deploying deep learning models.

Time series analysis

Time series analysis studies data points collected over time to identify patterns and trends, and to forecast future values.

Tokenization

Tokenization in NLP is the process of breaking text into smaller units (tokens), such as words or phrases, to facilitate analysis.

Training and testing

Training and testing are phases in the machine learning workflow where a model is first trained on a dataset and then evaluated on unseen data to assess its performance.

Transfer learning

Transfer learning is a technique where a model developed for one task is reused or fine-tuned for a related task, reducing the need for large amounts of new training data.

True negative

A true negative is a correct prediction in binary classification where a negative instance is correctly identified as negative.

True positive

A true positive is a correct prediction in binary classification where a positive instance is correctly identified as positive.

Type I error

A Type I error occurs when a true null hypothesis is incorrectly rejected, commonly known as a false positive.

Type II error

A Type II error occurs when a false null hypothesis is not rejected, often referred to as a false negative.

U

UDF

A user-defined function (UDF) is a custom function created by a user to perform specific tasks not provided by standard functions, allowing tailored data processing.

Underfitting

Underfitting occurs when a model is too simple to capture the underlying patterns in the training data, leading to poor performance on both training and test data.

Univariate analysis

Univariate analysis examines a single variable to summarize its main characteristics using descriptive statistics and visualizations.

Unstructured data

Unstructured data lacks a predefined format or organization (e.g., text, images), requiring specialized techniques for analysis.

Unsupervised learning

Unsupervised learning is a machine learning approach that analyzes unlabeled data to find hidden patterns or intrinsic structures.

V

Variance

Variance is a statistical measure that quantifies the spread of data points around the mean.

Vega Altair

Vega-Altair is a declarative visualization library for Python that enables the creation of interactive graphics.

Violin plot

A violin plot combines a box plot and a kernel density plot to show the distribution of a numerical variable, including its density and summary statistics.

W

Web scraping

Web scraping is the process of extracting data from websites by fetching web pages and parsing their content into a structured format.

X

XGBoost

XGBoost (Extreme Gradient Boosting) is a popular and efficient implementation of gradient boosting used for classification and regression tasks.

Z

Z-score

A z-score indicates how many standard deviations a data point is from the mean, and is used to standardize data and detect outliers.

Z-test

A z-test is a statistical test used to determine whether the population mean and the sample mean differ significantly, applied when the population variance is known and the sample size is large.

Data Science Glossary: 250+ Terms You Need to Know

Jump to section

A

B

C

D

E

F

G

H

I

J

K

L

M

N

O

P

Q

R

S

T

U

V

W

X

Z

You might also like