This repository contains a comprehensive implementation of fundamental Descriptive Statistics concepts using Python. The notebook explores statistical measures, probability distributions, data visualization techniques, outlier detection methods, and relationship analysis between variables.
The project is designed to provide both theoretical understanding and practical implementation of statistical concepts that serve as the foundation for Data Science, Machine Learning, and Data Analysis.
- Understand the structure and characteristics of datasets.
- Summarize data using statistical measures.
- Detect and analyze outliers.
- Visualize distributions and patterns.
- Explore probability distributions.
- Analyze relationships between variables.
- Build intuition for data preprocessing in Machine Learning.
To gain the most value from this notebook, readers should have basic familiarity with:
- Python Programming
- NumPy
- Pandas
- Basic Probability
- Elementary Algebra
- Data Visualization Concepts
No prior Machine Learning knowledge is required.
After completing this notebook, readers will be able to:
- Compute and interpret measures of central tendency
- Analyze data dispersion using variance and standard deviation
- Understand quartiles and the five-number summary
- Detect outliers using Z-Score and IQR methods
- Visualize distributions using histograms and KDE plots
- Interpret PDF and CDF curves
- Explore Normal, Log-Normal, and Pareto distributions
- Analyze relationships using covariance and correlation
- Apply descriptive statistics to real-world datasets such as Iris
Descriptive statistics forms the foundation of Exploratory Data Analysis (EDA). Before training machine learning models, data scientists use descriptive statistical techniques to:
- Understand data distributions
- Detect anomalies and outliers
- Identify skewness and variability
- Explore feature relationships
- Guide feature engineering decisions
- Improve model reliability
A strong understanding of descriptive statistics is essential for effective data preprocessing and machine learning workflows.
Measures of central tendency describe the central or typical value of a dataset.
The arithmetic average of observations.
Properties:
- Uses all observations.
- Sensitive to outliers.
- Most commonly used measure of center.
The middle value of an ordered dataset.
Properties:
- Robust against outliers.
- Suitable for skewed distributions.
The most frequently occurring observation.
Properties:
- Can be used for both numerical and categorical data.
- A dataset may have multiple modes.
Dispersion measures quantify the spread of data around a central value.
[ Range = Maximum - Minimum ]
Measures average squared deviation from the mean.
Square root of variance.
Properties:
- Expressed in the same unit as the data.
- Indicates consistency or variability.
Used for comparing variability across datasets.
A concise summary describing distribution characteristics.
Components:
- Minimum
- First Quartile (Q1)
- Median (Q2)
- Third Quartile (Q3)
- Maximum
Provides insights into spread, skewness, and outliers.
Box plots visualize distribution characteristics using quartiles.
Features:
- Median
- Interquartile Range (IQR)
- Potential outliers
- Data spread
- Skewness
[ IQR = Q3 - Q1 ]
Outlier Boundaries:
[ Lower = Q1 - 1.5(IQR) ]
[ Upper = Q3 + 1.5(IQR) ]
Outliers are observations significantly different from the majority of data.
Measures distance from mean in standard deviation units. if distance in terms of SD units >=3 in either side considered as outlier.
- Simple and computationally efficient.
- Works well for approximately normal distributions.
- Sensitive to extreme values.
- May perform poorly on highly skewed distributions.
Based on quartiles and resistant to extreme values. Suitable for skewed datasets.
- Values within the fences are considered normal observations.
- Values outside the fences are considered potential outliers.
- Box plots visually represent these outliers as individual points beyond the whiskers.
- Robust to extreme values
- Does not assume normal distribution
- Widely used in Exploratory Data Analysis (EDA)
A graphical representation of frequency distribution.
Applications:
- Understanding shape of data.
- Detecting skewness.
- Identifying multiple peaks.
- Detecting potential outliers.
Describes the relative likelihood of a continuous random variable.
- PDF values themselves are not probabilities.
- Probability is represented by the area under the curve.
- Larger density indicates a higher likelihood of observing values in that region.
- Distribution analysis
- Statistical modeling
- Machine Learning
- Risk analysis
==================
Represents cumulative probability up to a value.
Properties:
- Monotonically increasing.
- Ranges from 0 to 1.
- CDF gives cumulative probability up to a point.
- The value of the CDF is always between 0 and 1.
- It represents the proportion of observations less than or equal to a given value.
- Percentile calculations
- Probability estimation
- Statistical inference
- Reliability analysis
A non-parametric technique used to estimate probability density.
Advantages:
- Smooth alternative to histograms.
- Better visualization of distribution shape.
- Useful for identifying multimodal behavior.
The Iris dataset is one of the most widely used datasets in Data Science.
Features:
- Sepal Length
- Sepal Width
- Petal Length
- Petal Width
Classes:
- Setosa
- Versicolor
- Virginica
Applications:
- Exploratory Data Analysis
- Classification
- Statistical Visualization
A symmetric bell-shaped distribution.
Characteristics:
- Mean = Median = Mode
- Symmetric around the mean
- Foundation of many statistical methods
A variable follows a log-normal distribution if its logarithm follows a normal distribution.
Applications:
- Income distributions
- Biological measurements
- Financial data
Characteristics:
- Positively skewed
- Non-negative values
Models situations where a small percentage contributes to a large proportion of outcomes.
Commonly known as the 80-20 Principle.
Examples:
- Wealth distribution
- Business sales
- Website traffic
Characteristics:
- Heavy-tailed distribution
- Significant extreme events
The notebook includes graphical representations of:
- Normal Distribution
- Log-Normal Distribution
- Pareto Distribution
These visualizations help understand:
- Symmetry
- Skewness
- Tail behavior
- Probability density characteristics
Measures directional relationship between variables.
Interpretation:
- Positive → variables increase together.
- Negative → inverse relationship.
- Near zero → weak linear relationship.
Measures linear association.
Interpretation:
- +1 : Perfect positive relationship
- 0 : No linear relationship
- -1 : Perfect negative relationship
Measures monotonic relationships using ranks.
Useful when:
- Data is not normally distributed.
- Relationship is non-linear.
- Outliers are present.
- NumPy
- Pandas
- Matplotlib
- Seaborn
- SciPy
- Scikit-Learn
Descriptive_Statistics/
│
├── Descriptive_Statistics.ipynb
├── requirements.txt
├── MIT LICENSE
├── iris.csv
├── tips.csv.txt
├── README.md
- Inferential Statistics
- Confidence Intervals
- Hypothesis Testing
- ANOVA
- Chi-Square Tests
- Probability Theory
- Feature Engineering
Some statistical concepts and initial implementation ideas were inspired by publicly available educational notebooks, tutorials, and open learning resources. The documentation, explanations, code organization, analysis workflow, visualizations, and learning notes were independently developed and adapted for educational purposes as part of my learning journey in Statistics, Data Analysis, and Machine Learning.
For readers interested in exploring these topics further, the following books are highly recommended:
- Practical Statistics for Data Scientists — Peter Bruce, Andrew Bruce, and Peter Gedeck
- Think Stats — Allen B. Downey
- An Introduction to Statistical Learning (ISLR) — Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani, and Jonathan Taylor
- Python for Data Analysis — Wes McKinney
- Data Science from Scratch — Joel Grus
- Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow — Aurélien Géron
- Pattern Recognition and Machine Learning — Christopher M. Bishop
- Storytelling with Data — Cole Nussbaumer Knaflic
This repository was created as part of my ongoing journey to build a strong foundation in Statistics, Data Analysis, and Artificial Intelligence through hands-on implementation and practical exploration.
Akinchan Nayek
Exploring the foundations of Data Science, Machine Learning, and Statistical Analysis through practical Python implementations.