- Student Name: Brian Vu
- Student ID: 1053531
- Due Date: Friday 13th of August 11:59:00 am (AEST).
- Language: Python 3.8.3
- Packages / Libraries: pandas, pyspark, geopandas, numpy, folium, rtree, pygeos, statsmodels, sklearn
- NYC TLC: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
- External dataset 1: US Census Bureau (various datasets): https://data.census.gov/cedsci/
- External dataset 2: Census Tracts data: https://www1.nyc.gov/assets/planning/download/zip/data-maps/open-data/nycb2010_21b.zip
Change this to fit your needs when you have started the project.
raw_data: Contains all the raw data files that are too large to upload to git. These just include the taxi datasets.raw_data_lite: Contains all other raw data that can be uploaded to git. These include the shapefiles for the census tracts and taxi zones, as well as the census dataset that used.preprocessed_data: Contains all the preprocessed data files.plots: Contains all plots and figurescode: Order to run notebooks:
1, Extracting and serializing data. Some data was manually uploaded, but have links to download them.
2, Preprocessing census data.
3, Preprocessing taxi data 1.
4, Preprocessing taxi data 2.
5, Merging and visualizing.
6, Statistical modelling.
deprecated: Contains discard plots and some code (but not much, most were deleted).