Descriptive Statistics with Python

Overview

This repository contains a comprehensive implementation of fundamental Descriptive Statistics concepts using Python. The notebook explores statistical measures, probability distributions, data visualization techniques, outlier detection methods, and relationship analysis between variables.

The project is designed to provide both theoretical understanding and practical implementation of statistical concepts that serve as the foundation for Data Science, Machine Learning, and Data Analysis.

Objectives

Understand the structure and characteristics of datasets.
Summarize data using statistical measures.
Detect and analyze outliers.
Visualize distributions and patterns.
Explore probability distributions.
Analyze relationships between variables.
Build intuition for data preprocessing in Machine Learning.

Recommended Prerequisites

To gain the most value from this notebook, readers should have basic familiarity with:

Python Programming
NumPy
Pandas
Basic Probability
Elementary Algebra
Data Visualization Concepts

No prior Machine Learning knowledge is required.

Learning Outcomes

After completing this notebook, readers will be able to:

Compute and interpret measures of central tendency
Analyze data dispersion using variance and standard deviation
Understand quartiles and the five-number summary
Detect outliers using Z-Score and IQR methods
Visualize distributions using histograms and KDE plots
Interpret PDF and CDF curves
Explore Normal, Log-Normal, and Pareto distributions
Analyze relationships using covariance and correlation
Apply descriptive statistics to real-world datasets such as Iris

Why Descriptive Statistics Matters in Machine Learning?

Descriptive statistics forms the foundation of Exploratory Data Analysis (EDA). Before training machine learning models, data scientists use descriptive statistical techniques to:

Understand data distributions
Detect anomalies and outliers
Identify skewness and variability
Explore feature relationships
Guide feature engineering decisions
Improve model reliability

A strong understanding of descriptive statistics is essential for effective data preprocessing and machine learning workflows.

Topics Covered

1. Measures of Central Tendency

Measures of central tendency describe the central or typical value of a dataset.

Mean

The arithmetic average of observations.

Properties:

Uses all observations.
Sensitive to outliers.
Most commonly used measure of center.

Median

The middle value of an ordered dataset.

Properties:

Robust against outliers.
Suitable for skewed distributions.

Mode

The most frequently occurring observation.

Properties:

Can be used for both numerical and categorical data.
A dataset may have multiple modes.

2. Measures of Dispersion

Dispersion measures quantify the spread of data around a central value.

Range

[ Range = Maximum - Minimum ]

Variance

Measures average squared deviation from the mean.

Standard Deviation

Square root of variance.

Properties:

Expressed in the same unit as the data.
Indicates consistency or variability.

Coefficient of Variation

Used for comparing variability across datasets.

3. Five Number Summary

A concise summary describing distribution characteristics.

Components:

Minimum
First Quartile (Q1)
Median (Q2)
Third Quartile (Q3)
Maximum

Provides insights into spread, skewness, and outliers.

4. Box Plot Analysis

Box plots visualize distribution characteristics using quartiles.

Features:

Median
Interquartile Range (IQR)
Potential outliers
Data spread
Skewness

Interquartile Range

[ IQR = Q3 - Q1 ]

Outlier Boundaries:

[ Lower = Q1 - 1.5(IQR) ]

[ Upper = Q3 + 1.5(IQR) ]

5. Outlier Detection

Outliers are observations significantly different from the majority of data.

Z-Score Method

Measures distance from mean in standard deviation units. if distance in terms of SD units >=3 in either side considered as outlier.

Advantages

Simple and computationally efficient.
Works well for approximately normal distributions.

Limitations

Sensitive to extreme values.
May perform poorly on highly skewed distributions.

IQR Method

Based on quartiles and resistant to extreme values. Suitable for skewed datasets.

Interpretation

Values within the fences are considered normal observations.
Values outside the fences are considered potential outliers.
Box plots visually represent these outliers as individual points beyond the whiskers.

Advantages

Robust to extreme values
Does not assume normal distribution
Widely used in Exploratory Data Analysis (EDA)

6. Histogram

A graphical representation of frequency distribution.

Applications:

Understanding shape of data.
Detecting skewness.
Identifying multiple peaks.
Detecting potential outliers.

7. Probability Density Function (PDF)

Describes the relative likelihood of a continuous random variable.

Interpretation

PDF values themselves are not probabilities.
Probability is represented by the area under the curve.
Larger density indicates a higher likelihood of observing values in that region.

Applications

Distribution analysis
Statistical modeling
Machine Learning
Risk analysis

==================

8. Cumulative Distribution Function (CDF)

Represents cumulative probability up to a value.

Properties:

Monotonically increasing.
Ranges from 0 to 1.

Interpretation

CDF gives cumulative probability up to a point.
The value of the CDF is always between 0 and 1.
It represents the proportion of observations less than or equal to a given value.

Applications

Percentile calculations
Probability estimation
Statistical inference
Reliability analysis

9. Kernel Density Estimation (KDE)

A non-parametric technique used to estimate probability density.

Advantages:

Smooth alternative to histograms.
Better visualization of distribution shape.
Useful for identifying multimodal behavior.

10. Iris Dataset Analysis

The Iris dataset is one of the most widely used datasets in Data Science.

Features:

Sepal Length
Sepal Width
Petal Length
Petal Width

Classes:

Setosa
Versicolor
Virginica

Applications:

Exploratory Data Analysis
Classification
Statistical Visualization

Probability Distributions

11. Normal Distribution

A symmetric bell-shaped distribution.

Characteristics:

Mean = Median = Mode
Symmetric around the mean
Foundation of many statistical methods

12. Log-Normal Distribution

A variable follows a log-normal distribution if its logarithm follows a normal distribution.

Applications:

Income distributions
Biological measurements
Financial data

Characteristics:

Positively skewed
Non-negative values

13. Pareto Distribution

Models situations where a small percentage contributes to a large proportion of outcomes.

Commonly known as the 80-20 Principle.

Examples:

Wealth distribution
Business sales
Website traffic

Characteristics:

Heavy-tailed distribution
Significant extreme events

Distribution Visualizations

The notebook includes graphical representations of:

Normal Distribution
Log-Normal Distribution
Pareto Distribution

These visualizations help understand:

Symmetry
Skewness
Tail behavior
Probability density characteristics

Relationship Analysis

14. Covariance

Measures directional relationship between variables.

[ Cov(X,Y)

Interpretation:

Positive → variables increase together.
Negative → inverse relationship.
Near zero → weak linear relationship.

15. Pearson Correlation

Measures linear association.

Interpretation:

+1 : Perfect positive relationship
0 : No linear relationship
-1 : Perfect negative relationship

16. Spearman Rank Correlation

Measures monotonic relationships using ranks.

Useful when:

Data is not normally distributed.
Relationship is non-linear.
Outliers are present.

Libraries Used

NumPy
Pandas
Matplotlib
Seaborn
SciPy
Scikit-Learn

Project Structure

Descriptive_Statistics/
│
├── Descriptive_Statistics.ipynb
├── requirements.txt
├── MIT LICENSE
├── iris.csv
├── tips.csv.txt
├── README.md

Future Enhancements

Inferential Statistics
Confidence Intervals
Hypothesis Testing
ANOVA
Chi-Square Tests
Probability Theory
Feature Engineering

Acknowledgment

Some statistical concepts and initial implementation ideas were inspired by publicly available educational notebooks, tutorials, and open learning resources. The documentation, explanations, code organization, analysis workflow, visualizations, and learning notes were independently developed and adapted for educational purposes as part of my learning journey in Statistics, Data Analysis, and Machine Learning.

Recommended Learning Resources

For readers interested in exploring these topics further, the following books are highly recommended:

Statistics & Probability

Practical Statistics for Data Scientists — Peter Bruce, Andrew Bruce, and Peter Gedeck
Think Stats — Allen B. Downey
An Introduction to Statistical Learning (ISLR) — Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani, and Jonathan Taylor

Data Analysis with Python

Python for Data Analysis — Wes McKinney
Data Science from Scratch — Joel Grus

Machine Learning Foundations

Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow — Aurélien Géron
Pattern Recognition and Machine Learning — Christopher M. Bishop

Visualization

Storytelling with Data — Cole Nussbaumer Knaflic

This repository was created as part of my ongoing journey to build a strong foundation in Statistics, Data Analysis, and Artificial Intelligence through hands-on implementation and practical exploration.

Author

Akinchan Nayek

Exploring the foundations of Data Science, Machine Learning, and Statistical Analysis through practical Python implementations.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.gitignore		.gitignore
Descriptive_Stat.ipynb		Descriptive_Stat.ipynb
LICENSE		LICENSE
README.md		README.md
iris.csv		iris.csv
requirements.txt		requirements.txt
tips.csv		tips.csv

Folders and files

Latest commit

History

Repository files navigation

Descriptive Statistics with Python

Overview

Objectives

Recommended Prerequisites

Learning Outcomes

Why Descriptive Statistics Matters in Machine Learning?

Topics Covered

1. Measures of Central Tendency

Mean

Median

Mode

2. Measures of Dispersion

Range

Variance

Standard Deviation

Coefficient of Variation

3. Five Number Summary

4. Box Plot Analysis

Interquartile Range

5. Outlier Detection

Z-Score Method

Advantages

Limitations

IQR Method

Interpretation

Advantages

6. Histogram

7. Probability Density Function (PDF)

Interpretation

Applications

8. Cumulative Distribution Function (CDF)

Interpretation

Applications

9. Kernel Density Estimation (KDE)

10. Iris Dataset Analysis

Probability Distributions

11. Normal Distribution

12. Log-Normal Distribution

13. Pareto Distribution

Distribution Visualizations

Relationship Analysis

14. Covariance

[ Cov(X,Y)

15. Pearson Correlation

16. Spearman Rank Correlation

Libraries Used

Project Structure

Future Enhancements

Acknowledgment

Recommended Learning Resources

Statistics & Probability

Data Analysis with Python

Machine Learning Foundations

Visualization

Author

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages