This project analyzes weekly avocado sales and pricing data in the United States between 2015 and 2020. The objective is to understand price elasticity, compare organic vs. conventional product behavior, identify regional patterns, and evaluate a simple forecasting approach using time series modelling.
The dataset contains 33,045 records and was obtained from Kaggle (Avocado Prices 2020). All analysis was performed in RStudio.
- Quantify the relationship between price and sales volume for organic and conventional avocados.
- Examine regional pricing behavior, using Albany as a case study.
- Detect outliers in price and volume and assess their potential impact.
- Build a baseline 12‑week price forecast using an ARIMA model.
- Organic avocados hold demand better: A 10% price increase was associated with roughly a 7.7% drop in volume. For conventional avocados the drop was around 13.2%. That suggests customers buying organic are less price-driven.
- Albany prices are relatively stable. The time series doesn't show wild swings, which could be useful for inventory planning.
- There are clear outliers in price and volume, especially on the organic side. Some weeks see prices or volumes far from the typical range, worth investigating further if this were a live business question.
None of these numbers are meant as universal rules; they come from a linear model on log-transformed data and should be interpreted with that in mind.
avocado-analysis/
│
├── data/
│ └── avocado-updated-2020.csv
│
├── scripts/
│ └── avocado-analysis.R
│
├── imgs/
│ ├── albany_organic_price_forecast_3months.png
│ ├── albany_organic_prices_decomposition.png
│ ├── average_price_boxplot.png
│ ├── boxplot_precios.png
│ ├── price_by_type_boxplot.png
│ ├── series_temporales.png
│ └── total_volume_boxplot.png
│
└── README.md
The R script covers the full pipeline: data loading, cleaning, exploration, regression models, and the ARIMA forecast. All plots were generated with ggplot2, some made interactive with plotly during the analysis phase.
The chart below shows the 12-week forecast for organic avocado prices in Albany. The model captures the general level well, though the confidence bands widen as expected.
It's a simple univariate ARIMA fitted automatically. With more domain context (seasonality of supply, weather events) the forecast could be improved, but for a first pass it gives a useful baseline.
Everything was written in R. The main libraries used:
- Data handling:
readr,dplyr,tidyr - Visualization:
ggplot2,plotly - Modeling:
stats::lm(),forecast(for ARIMA) - Summary statistics: base R functions like
cor(),summary(),boxplot()
I wanted to practice working with a real dataset end-to-end, from messy CSV to a presentable result, and to show how an analyst might approach a retail pricing question. No dashboard is included here; the output is the R script itself and the plots it produces.
The dataset is from 2020 and will not be updated, so the findings are a snapshot of that period.
