Jupyter Notebook
- Jupyter Notebook (for running the pipeline - extracting and loading raw CSV files to a MySQL server) Executable Python Script
- Python Script (a deployable infrastructure as code (IaC) version of the Jupyter Notebook - for importing the raw CSV Files, transforming the raw files, building a MySQL database and related tables, and loading the transformed data into the newly constructed tables) Raw CSV Files:
- Invoices.CSV
- OrderLeads.CSV
- SalesTeam.CSV
- Tableau Dashboard and a packaged Tableau Workbook
- Download either the Jupyter Notebook or Python Script.
- Execute downloaded file in your preferred python environment (Anaconda, VSCode, etc.). *Note: The pipeline assumes that the server is "localhost" and that the port number is 3306. If the server name or port number differs, the script will need to be modified.
- When prompted, enter your MySQL username (usually, "root") and your corresponding MySQL password.
- Let the pipeline run to completion.
- To view an example Tableau Dashboard that can be created from the output, click on the Tableau link here, or download the Tableau workbook.
- Print statements, such as "print('Invoice Table Created!)," are included at critical points to confirm that the pipeline is functioning as expected.
- If a print statement, confirming the cells successful execution, is not printed, the pipeline will stop its execution and display an error message.
- The raw CSV files (Invoices.CSV, OrderLeads.CSV, and SalesTeam.CSV) are loaded into a Python Jupyter Notebook and converted to Pandas dataframe objects.
- General pre-processing steps are taken.
- Null values are assessed - in the current state, there are no missing values
- Data types for each column are evaluated - most are imported as strings or integers
- The following transformation are applied to the dataframes:
- White spaces in column names with multiple words are replaced with underscores (for consistency and to prevent syntax issues).
- To the invoices dataframe:
- The Date and Date_of_Meal fields are converted to Datetime datatypes
- Additional timezone information ("+00:00:00") is dropped to standardize all times to UTC timezone.
- The hour is extracted from Date_of_Meal and mapped to a part of the day (i.e. Early Morning, Late Morning, Early Afternoon, etc.), and a new field ("Part_of_Day") is created.
- The number of participants is derived from the Participants column and added as a new column
- A new dataframe dataframe (customer_order) is created to link every order_id to the participating customer_id(s).
- A last_updated column is added to represent the date at which the csv file was last imported
- To the orders dataframe:
- The date column is formatted to a date datatype
- The pipeline attempts to connect to the user's local MySQL server, prompting the user to enter their MySQL username and password.
Note: The pipeline assumes that the server is "localhost" and that the port number is 3306.
- The database is created via a try-except command, where the script first tries to drop the database to create it again from scratch; if the database cannot be dropped because it does not already exist, the script will create the database.
- Four tables are created for each of the dataframes (invoice, orders, saleslead, and customer_order)
- For scalability reasons, the invoice table is partitioned by year
- The dataframes are loaded into the MySQL tables via a for loop executing INSERT INTO statements for each row of the dataframes
- SQL Transformations:
- Views:
- Average Meal Price: Average meal price by type of meal.
- Average Participants: Average number of participants by meal type.
- Company Metrics: For each company, the total amount and average amount of each invoice monthly are shown for each meal (and displaying their respective meal). In addition, the year-to-date amount collected and yearly total are presented.
- Customer Purchases: Customer_Name, Part_of_Day, Company_Name, Number_of_Purchases, Total_Spent.
- Customer Stats: total number of orders by each customer, total amount each customer spent, and the average amount each spent.
- Difference Days: Difference in days between the date of meal and date the order was placed.
- Percent Converted: Shows the number of orders for every company and the total converted (as a sum and proportion) and not converted to an order, as a sum.
- Sales by Year: Number of invoices each year.
- Sales Rep Performance: Sales_Rep, Sales_Rep_Id, Company_Name, Company_Id, Profit_by_Sales_Rep.
- Total Sales: Total sales by type of meal price for each year.