EXPLAINER

130 Data Science Terms Explained in Plain English

FARPOINT RESEARCH

Every field has their own set of jargon.

Biopsy, prognosis, embolism in healthcare.

Affidavit, verdict, and litigation in legal.

And just like every other field, Data Science has its own list of jargon. While 130+ terms sounds like a lot, you'll find the majority are actually concepts you've already heard of before.

This explainer should help with 3 things:

  1. Refresh your memory on things you already know
  2. Highly the things you should/need to know
  3. Helps develop a better understanding of concepts that might have initially been confusing to you.

Everything is arranged alphabetically so feel free to save and come back to this as a resource.

A

1. A/B Testing: A statistical method used to compare two versions of a product, webpage, or strategy to determine which performs better.

2. Accuracy: The measure of how often a classification model correctly predicts outcomes among all instances it evaluates.

3. Adaboost: An ensemble learning algorithm that combines weak classifiers to create a strong classifier.

4. Algorithm: A step-by-step set of instructions or rules followed by a computer to solve a problem or perform a task.

5. Analytics: The process of interpreting and examining data to extract meaningful insights.

6. Anomaly Detection: Identifying unusual patterns or outliers in data.

7. ANOVA (Analysis of Variance): A statistical method used to analyze the differences among group means in a sample.

8. API (Application Programming Interface): A set of rules that allows one software application to interact with another.

9. AUC-ROC (Area Under the ROC Curve): A metric that tells us how well a classification model is doing overall, considering different ways of deciding what counts as a positive or negative prediction.

B

10. Batch Gradient Descent: An optimization algorithm that updates model parameters using the entire training dataset (different from mini-batch gradient descent)

11. Bayesian Statistics: A statistical approach that combines prior knowledge with observed data.

12. BI (Business Intelligence): Technologies, processes, and tools that help organizations make informed business decisions.

13. Bias: An error in a model that causes it to consistently predict values away from the true values.

14. Bias-Variance Tradeoff: The balance between the error introduced by bias and variance in a model.

15. Big Data: Large and complex datasets that cannot be easily processed using traditional data processing methods.

16. Binary Classification: Categorizing data into two groups, such as spam or not spam.

17. Bootstrap Sampling: A resampling technique where random samples are drawn with replacement from a dataset.

C

18. Categorical data: variables that represent categories or groups and can take on a limited, fixed number of distinct values.

19. Chi-Square Test: A statistical test used to determine if there is a significant association between two categorical variables.

20. Classification: Categorizing data points into predefined classes or groups.

21. Clustering: Grouping similar data points together based on certain criteria.

22. Confidence Interval: A range of values used to estimate the true value of a parameter with a certain level of confidence.

23. Confusion Matrix: A table used to evaluate the performance of a classification algorithm.

24. Correlation: A statistical measure that describes the degree of association between two variables.

25. Covariance: A measure of how much two random variables change together.

26. Cross-Entropy Loss: A loss function commonly used in classification problems.

27. Cross-Validation: A technique to assess the performance of a model by splitting the data into multiple subsets for training and testing.

D

28. Data Cleaning: The process of identifying and correcting errors or inconsistencies in datasets.

29. Data Mining: Extracting valuable patterns or information from large datasets.

30. Data Preprocessing: Cleaning and transforming raw data into a format suitable for analysis.

31. Data Visualization: Presenting data in graphical or visual formats to aid understanding.

32. Decision Boundary: The dividing line that separates different classes in a classification problem.

33. Decision Tree: A tree-like model that makes decisions based on a set of rules.

34. Dimensionality Reduction: Reducing the number of features in a dataset while retaining important information.

E

35. Eigenvalue and Eigenvector: Concepts used in linear algebra, often employed in dimensionality reduction to transform and simplify complex datasets.

36. Elastic Net: A regularization technique that combines L1 and L2 penalties.

37. Ensemble Learning: Combining multiple models to improve overall performance and accuracy.

38. Exploratory Data Analysis (EDA): Analyzing and visualizing data to understand its characteristics and relationships.

F

39. F1 Score: A metric that combines precision and recall in classification models.

40. False Positive and False Negative: Incorrect predictions in binary classification.

41. Feature: data column that’s used as the input for ML models to make predictions.

42. Feature Engineering: Creating new features from existing ones to improve model performance.

43. Feature Extraction: Reducing the dimensionality of data by selecting important features.

44. Feature Importance: Assessing the contribution of each feature to the model’s predictions.

45. Feature Selection: Choosing the most relevant features for a model.

G

46. Gaussian Distribution: A type of probability distribution often used in statistical modeling.

47. Geospatial Analysis: Analyzing and interpreting patterns and relationships within geographic data.

48. Gradient Boosting: An ensemble learning technique where weak models are trained sequentially, each correcting the errors of the previous one.

49. Gradient Descent: An optimization algorithm used to minimize the error in a model by adjusting its parameters.

50. Grid Search: A method for tuning hyperparameters by evaluating models at all possible combinations.

H

51. Heteroscedasticity: Unequal variability of errors in a regression model.

52. Hierarchical Clustering: A method of cluster analysis that organizes data into a tree-like structure of clusters, where each level of the tree shows the relationships and similarities between different groups of data points.

53. Hyperparameter: A parameter whose value is set before the training process begins.

54. Hypothesis Testing: A statistical method to test a hypothesis about a population parameter based on sample data.

I

55. Imputation: Filling in missing values in a dataset using various techniques.

56. Inferential Statistics: A branch of statistics that involves making inferences about a population based on a sample of data.

57. Information Gain: A measure used in decision trees to assess the effectiveness of a feature in classifying data.

58. Interquartile Range (IQR): A measure of statistical dispersion, representing the range between the first and third quartiles.

J

59. Joint Plot: A type of data visualization in Seaborn used for exploring relationships between two variables and their individual distributions.

60. Joint Probability: The probability of two or more events happening at the same time, often used in statistical analysis.

61. Jupyter Notebook: An open-source web application for creating and sharing documents containing live code, equations, visualizations, and narrative text.

K

62. K-Means Clustering: A popular algorithm for partitioning a dataset into distinct, non-overlapping subsets.

63. K-Nearest Neighbors (KNN): A simple and widely used classification algorithm based on how close a new data point is to other data points.

L

64. L1 Regularization: Adding the absolute values of coefficients as a penalty term to the loss function.

65. L2 Regularization (Ridge): Adding the squared values of coefficients as a penalty term to the loss function.

66. Linear Regression: A statistical method for modeling the relationship between a dependent variable and one or more independent variables.

67. Log Likelihood: The logarithm of the likelihood function, often used in maximum likelihood estimation.

68. Logistic Function: A sigmoid function used in logistic regression to model the probability of a binary outcome.

69. Logistic Regression: A statistical method for predicting the probability of a binary outcome.

M

70. Machine Learning: A subset of artificial intelligence that enables systems to learn and make predictions from data.

71. Mean Absolute Error (MAE): A measure of the average absolute differences between predicted and actual values.

72. Mean Squared Error (MSE): A measure of the average squared difference between predicted and actual values.

73. Mean: The average value of a set of numbers.

74. Median: The middle value in a set of sorted numbers.

75. Metrics: Criteria used to assess the performance of a machine learning model, such as accuracy, precision, recall, and F1 score.

76. Model Evaluation: Assessing the performance of a machine learning model using various metrics.

77. Multicollinearity: The presence of a high correlation between independent variables in a regression model.

78. Multi-Label Classification: Assigning multiple labels to an input, as opposed to just one.

79. Multivariate Analysis: Analyzing data with multiple variables to understand relationships between them.

N

80. Naive Bayes: A probabilistic algorithm based on Bayes’ theorem used for classification.

81. Normalization: Scaling numerical variables to a standard range.

82. Null Hypothesis: A statistical hypothesis that assumes there is no significant difference between observed and expected results.

O

83. One-Hot Encoding: A technique to convert categorical variables into a binary matrix for machine learning models.

84. Ordinal Variable: A categorical variable with a meaningful order but not necessarily equal intervals.

85. Outlier: An observation that deviates significantly from other observations in a dataset.

86. Overfitting: A model that performs well on the training data but poorly on new, unseen data.

P

87. Pandas: A standard data manipulation library for Python for working with structured data.

88. Pearson Correlation Coefficient: A measure of the linear relationship between two variables.

89. Poisson Distribution: A discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space.

90. Precision: The ratio of true positive predictions to the total number of positive predictions made by a classification model.

91. Predictive Analytics: Using data, statistical algorithms, and machine learning techniques to identify the likelihood of future outcomes.

92. Principal Component Analysis (PCA): A dimensionality reduction technique that transforms data into a new framework of features, simplifying the information while preserving its fundamental patterns.

93. Principal Component: The axis that captures the most variance in a dataset in principal component analysis.

94. P-value: The probability of obtaining a result as extreme as, or more extreme than, the observed result during hypothesis testing.

Q

95. Q-Q Plot (Quantile-Quantile Plot): A graphical tool to assess if a dataset follows a particular theoretical distribution.

96. Quantile: A data point or set of data points that divide a dataset into equal parts.

R

97. Random Forest: An ensemble learning method that constructs a multitude of decision trees and merges them together for more accurate and stable predictions.

98. Random Sample: A sample where each member of the population has an equal chance of being selected.

99. Random Variable: A variable whose possible values are outcomes of a random phenomenon.

100. Recall: The ratio of true positive predictions to the total number of actual positive instances in a classification model.

101. Regression Analysis: A statistical method used for modeling the relationship between a dependent variable and one or more independent variables.

102. Regularization: Adding a penalty term to the cost function to prevent overfitting in machine learning models.

103. Resampling: Techniques like bootstrapping or cross-validation to assess the performance of a model.

104. ROC Curve (Receiver Operating Characteristic Curve): A graphical representation of the trade-off between true positive rate and false positive rate for different thresholds in a classification model.

105. Root Mean Square Error (RMSE): A measure of the difference between predicted and actual values.

106. R-squared: A statistical measure that represents the proportion of the variance in the dependent variable explained by the independent variables in a regression model.

S

107. Sampling Bias: A bias in the selection of participants or data points that may affect the generalizability of results.

108. Sampling: The process of selecting a subset of data points from a larger dataset.

109. Scalability: The ability of a system to handle increasing amounts of data or workload.

110. Sigmoid Function: A mathematical function used in binary classification problems.

111. Silhouette Score: A metric used to calculate the goodness of a clustering technique.

112. Singular Value Decomposition (SVD): A matrix factorization technique used in dimensionality reduction.

113. Spearman Rank Correlation: A non-parametric measure of correlation between two variables.

114. Standard Deviation: A measure of the amount of variation or dispersion in a set of values.

115. Stationarity: A property of time series data where statistical properties remain constant over time.

116. Stratified Sampling: A sampling method that ensures proportional representation of subgroups within a population.

117. Supervised Learning: Learning from labeled data where the algorithm is trained on a set of input-output pairs.

118. Support Vector Machine (SVM): A supervised machine learning algorithm used for classification and regression analysis.

T

119. t-Distribution: A probability distribution used in hypothesis testing when the sample size is small or the population standard deviation is unknown.

120. Time Series Analysis: Analyzing data collected over time to identify patterns and trends.

121. t-test: A statistical test used to determine if there is a significant difference between the means of two groups.

122. Two-sample t-test: A statistical test used to compare the means of two independent samples.

U

123. Underfitting: A model that is too simple to capture the underlying patterns in the data.

124. Univariate Analysis: Analyzing the variation of a single variable in the dataset.

125. Unsupervised Learning: Learning from unlabeled data where the algorithm identifies patterns and relationships on its own.

V

126. Validation Set: A subset of data used to assess the performance of a model during training.

127. Variance: The degree of spread or dispersion of a set of values, and also the variability of model predictions.

X

128. XGBoost: An open-source library for gradient-boosted decision trees designed for speed and performance.

Z

129. Zero-shot Learning: Training a model to perform a task without explicit examples.

130. Z-Score: A standardized score that represents the number of standard deviations a data point is from the mean.