NYC taxi and rideshare drivers work long, exhausting shifts, but most are leaving money on the table. Without a clear strategy, they accept nearly every trip that comes their wayโwasting precious time on low-value rides that drag down their hourly wage.
Increase average taxi driver earnings by 20%โwithout working more hours.
We focused on high-volume for-hire vehicles (Uber, Lyft, Juno, Via) operating within Manhattan, Brooklyn, and Queensโthe busiest boroughs in NYC.
We built a dataโdriven decision system that answers two simple but powerful questions:
-
๐ค Which trips should I accept? โ A machine learning model (XGBoost) classifies each trip as highโvalue or lowโvalue in real time. The winning strategy? Our model recommends accepting only trips that have a 90%+ chance of being among the top 25% most profitable in the next few minutes. This alone adds +$8/hour.
-
โฐ When and with whom should I start my shift? โ Through simulation, we discovered that the starting zone barely matters, but choosing Uber over Lyft and working night shifts adds up another +$6 per hour.
Combine both strategies โ +$14/hour โ 25% more earnings โ $2,200 extra per month ๐ธ
No extra hours. No extra effort. Just smarter decisions. โจ
Drivers implementing our full strategy saw their average hourly wage rise from $55.09 to $69.07โsignificantly exceeding the 20% target.
Working 8-hour days, 5 days a week, this improvement translates to roughly $2,200 in additional monthly earnings for a full-time driver.
The true test came when we validated our policy on 2024 dataโa year the models were not trained on. The performance held strong, with the policy achieving an average hourly wage of $67.87. More importantly, it consistently outperformed the baseline in every single month, maintaining a solid advantage of $7โ$17 per hour. This confirms the strategy is reliable and adaptable to changing conditions.
Our recommendations are practical and implementable:
- ๐ Drive for Uber if possible
- ๐ Work nights rather than mornings
- โ Reject trips that fall below our profitability threshold
- ๐ Avoid Mondays and Fridays if you have the flexibility
- ๐บ๏ธ Where you start doesnโt matter as much as when you start
- We Simulated Over 65,000 Workdays
Without actual driver IDs in the data (anonymized trip records), we built a simulation engine that modeled driver behavior across tens of thousands of scenariosโeffectively โplaying outโ entire days to test what strategies would work best in the real world.
-
We Combined Data Science with Decision Theory
- Machine learning (XGBoost) to classify trip quality
- Sequential decision modeling to simulate day-long driver behavior
- Spatial analysis of NYC neighborhoods
- Demographic data from the US Census to understand local patterns
-
We Processed Massive Datasets
Working with over 55 GB of NYC taxi trip data, we used modern data engineering techniques (DuckDB, parallel processing, caching) to make this analysis feasible.
- All Results Are Reproducible
The entire analysis is documented, unit-tested, and containerized. Every stepโfrom data collection to model evaluationโcan be reproduced by anyone with the right tools.
To find the optimal solution for those questions, we followed the methodology proposed by Warren B. Powell (2022) in Sequential Decision Analytics and Modeling: Modeling with Python and combined it with the CrossโIndustry Standard Process for Data Mining (CRISPโDM) to define a machine learning model that powers the sequential decision.
Following the steps of both methodologies, we organized the articles created in this portfolio website:
- Interactive Demo: A Shiny web app where drivers can simulate their own earnings under different strategies.
- Real-World Pilot: Testing the policy with a small group of NYC drivers.
- Expansion: Adapting the model for other cities with similar trip data.
In this project, we used a subset of the data available in the TLC Trip Record Data from 2022-2023 for High Volume ForโHire Vehicle โ which covers the Juno, Uber, Via and Lyft trips within our project scope โ with the columns described in its data dictionary.
This project was completed under strong assumptions given that the data used in the analysis does not provide any unique identifier for taxi drivers, which limits the realism of some results.
Additionally, this project aims to increase taxi driver earnings at the individual level. However, if applied extensively, it could also produce the following unintended consequences:
-
Reduced service quality: Drivers focusing solely on maximizing earnings may avoid less profitable areas or times, potentially leaving some passengers underserved.
-
Increased congestion: Drivers congregating in highโprofit areas could worsen traffic in already busy parts of the city.
This project is intended as a demonstration of data science methodology rather than a prescriptive business recommendation, and these considerations should be carefully weighed before any realโworld implementation.
Reproducibility and longโterm maintainability were core priorities from the start, which shaped every tooling decision in this project. The following tools were used to achieve this:
- ๐ง We use
gitto manage changes in the code and provide an interface to share the project on GitHub. - ๐ณ
DockerandNixare used to build a reproducible devโcontainer based ondefault.nix. The container can be connected via SSH using a public and private key pair as defined insetup.sh, and the.envrcsets the Nix environment to use in the Positron console. - ๐ฆ For modeling, we used the
tidymodelsframework to ensure we are following good modeling practices. - ๐ Since the project follows the basic structure of an R package, we
were able to document and create unit tests for custom
functions using
testthat,roxygen2anddevtools. This was especially important to ensure that the simulation function and the custom step function (which extends therecipespackage) work correctly. - ๐ The project also follows the structure of a Quarto project
and renders all articles into the
docsfolder, giving us full control over the format used to present each article. Results are hosted on GitHub Pages, so they can be shared at no cost. - ๐ The
.Rprofileoverridesinstall.packages,update.packagesandremove.packagesto make clear that R packages must be defined indefault.nixto ensure reproducibility. - ๐๏ธ To manage data larger than RAM, we use
duckdband keep large files in a separate folder namedNycTaxiBigFilesunder the same parent directory as this repo. - ๐พ To cache results generated during the investigation process, we
use
.qs2files and track them withpins, stored under the folderNycTaxiPinsin the same parent directory as this repo. - ๐งน We use the air extension to ensure consistent code formatting across the project.
The result is a hybrid structure that combines an R package (with documented functions and unit tests) and a Quarto website (with rendered articles and hosted results), which was one of the most challenging aspects of the project to set up correctly:
tree -L 3
.
โโโ air.toml
โโโ default.nix
โโโ DESCRIPTION
โโโ docker-compose.yml
โโโ Dockerfile
โโโ docs
โย ย โโโ figures
โย ย โย ย โโโ CRISP-DM_Process_Diagram.png
โย ย โย ย โโโ Hour Tree Explanation-1.png
โย ย โย ย โโโ htop_parallel_process.png
โย ย โย ย โโโ logo-generated.jpeg
โย ย โย ย โโโ Mean Hourly Wage after policy-1.png
โย ย โย ย โโโ model_benefit_curve.png
โย ย โย ย โโโ model-benefit.jpg
โย ย โย ย โโโ nyc-taxi-navbar-logo.png
โย ย โย ย โโโ nyc-taxi-navbar-logo.xcf
โย ย โย ย โโโ screenshot-ui.png
โย ย โย ย โโโ Sequential-Decision-Modeling-Framework.png
โย ย โย ย โโโ simulated_wage_vs_threshold.png
โย ย โโโ index.html
โย ย โโโ investigation-phases
โย ย โย ย โโโ 01-business-understanding.html
โย ย โย ย โโโ 02-data-collection-process.html
โย ย โย ย โโโ 03-initial-exploration_files
โย ย โย ย โโโ 03-initial-exploration.html
โย ย โย ย โโโ 04-base-line_files
โย ย โย ย โโโ 04-base-line.html
โย ย โย ย โโโ 05-lookahead-labeling_files
โย ย โย ย โโโ 05-lookahead-labeling.html
โย ย โย ย โโโ 06-expanding-geospatial-data_files
โย ย โย ย โโโ 06-expanding-geospatial-data.html
โย ย โย ย โโโ 07-expanding-transportation-socioeconomic_files
โย ย โย ย โโโ 07-expanding-transportation-socioeconomic.html
โย ย โย ย โโโ 08-policy-function-approximation_files
โย ย โย ย โโโ 08-policy-function-approximation.html
โย ย โย ย โโโ 09-from-predictions-to-policies_files
โย ย โย ย โโโ 09-from-predictions-to-policies.html
โย ย โย ย โโโ 10-optimal-starting-states_files
โย ย โย ย โโโ 10-optimal-starting-states.html
โย ย โโโ man
โย ย โย ย โโโ figures
โย ย โโโ search.json
โย ย โโโ site_libs
โย ย โโโ bootstrap
โย ย โโโ clipboard
โย ย โโโ DiagrammeR-styles-0.2
โย ย โโโ ggiraphjs-0.9.2
โย ย โโโ girafe-binding-0.9.2
โย ย โโโ grViz-binding-1.0.11
โย ย โโโ htmltools-fill-0.5.8.1
โย ย โโโ htmlwidgets-1.6.4
โย ย โโโ jquery-3.6.0
โย ย โโโ leaflet-1.3.1
โย ย โโโ leaflet-binding-2.2.3
โย ย โโโ leafletfix-1.0.0
โย ย โโโ Leaflet.glify-3.2.0
โย ย โโโ leaflet-providers-2.0.0
โย ย โโโ leaflet-providers-plugin-2.2.3
โย ย โโโ proj4-2.6.2
โย ย โโโ Proj4Leaflet-1.0.1
โย ย โโโ quarto-html
โย ย โโโ quarto-nav
โย ย โโโ quarto-search
โย ย โโโ rstudio_leaflet-1.3.1
โย ย โโโ viz-1.8.2
โโโ figures
โย ย โโโ CRISP-DM_Process_Diagram.png
โย ย โโโ Hour Tree Explanation-1.png
โย ย โโโ htop_parallel_process.png
โย ย โโโ Mean Hourly Wage after policy-1.png
โย ย โโโ model_benefit_curve.png
โย ย โโโ nyc-taxi-navbar-logo.png
โย ย โโโ nyc-taxi-navbar-logo.xcf
โย ย โโโ Sequential-Decision-Modeling-Framework.png
โย ย โโโ simulated_wage_vs_threshold.png
โโโ index.qmd
โโโ investigation-phases
โย ย โโโ 01-business-understanding.qmd
โย ย โโโ 02-data-collection-process.qmd
โย ย โโโ 03-initial-exploration.qmd
โย ย โโโ 04-base-line.qmd
โย ย โโโ 05-lookahead-labeling.qmd
โย ย โโโ 06-expanding-geospatial-data.qmd
โย ย โโโ 07-expanding-transportation-socioeconomic.qmd
โย ย โโโ 08-policy-function-approximation.qmd
โย ย โโโ 09-from-predictions-to-policies.qmd
โย ย โโโ 10-optimal-starting-states.qmd
โโโ man
โย ย โโโ add_performance_variables.Rd
โย ย โโโ add_pred_class.Rd
โย ย โโโ add_take_current_trip.Rd
โย ย โโโ calculate_costs.Rd
โย ย โโโ collect_predictions_best_config.Rd
โย ย โโโ compare_model_predictions.Rd
โย ย โโโ compute_power.Rd
โย ย โโโ figures
โย ย โย ย โโโ logo.hex
โย ย โย ย โโโ logo-image.png
โย ย โย ย โโโ logo.png
โย ย โย ย โโโ Logo-source.txt
โย ย โโโ NycTaxi-package.Rd
โย ย โโโ optimize_trip_start_time.Rd
โย ย โโโ plot_bar.Rd
โย ย โโโ plot_box.Rd
โย ย โโโ plot_heap_map.Rd
โย ย โโโ plot_num_distribution.Rd
โย ย โโโ required_pkgs.step_join_geospatial_features.Rd
โย ย โโโ sim_start_trip_summary.Rd
โย ย โโโ simulate_trips.Rd
โย ย โโโ step_join_geospatial_features.Rd
โโโ multicore-scripts
โย ย โโโ 01-fine-tune-future-process.R
โย ย โโโ 02-add-target.R
โย ย โโโ 02-run_add_target.sh
โย ย โโโ 03a-tuning-simple-models.R
โย ย โโโ 03b-tuning-dimreduction-models.R
โย ย โโโ 03c-tuning-tree-models.R
โโโ NAMESPACE
โโโ nix
โย ย โโโ pkgs.nix
โย ย โโโ r-core.nix
โย ย โโโ r-custom.nix
โย ย โโโ r-data.nix
โย ย โโโ r-geo.nix
โย ย โโโ r-ml.nix
โย ย โโโ system.nix
โโโ params.yml
โโโ _quarto.yml
โโโ R
โย ย โโโ add_take_current_trip.R
โย ย โโโ calculate_costs.R
โย ย โโโ compare_model_predictions.R
โย ย โโโ compute_power.R
โย ย โโโ NycTaxi-package.R
โย ย โโโ optimize_trip_start_time.R
โย ย โโโ plot_bar.R
โย ย โโโ plot_box.R
โย ย โโโ plot_heap_map.R
โย ย โโโ plot_num_distribution.R
โย ย โโโ sim_start_trip_summary.R
โย ย โโโ simulate_trips.R
โย ย โโโ step_join_geospatial_features.R
โย ย โโโ utils.R
โโโ README.md
โโโ setup.sh
โโโ tests
โโโ testthat
โย ย โโโ fixtures
โย ย โโโ test-add_take_current_trip.R
โย ย โโโ test-calculate_costs.R
โย ย โโโ test-plot_box.R
โย ย โโโ test-sim_start_trip_summary.R
โย ย โโโ test-simulate_trips.R
โย ย โโโ test-step_join_geospatial_features.R
โโโ testthat.R
47 directories, 109 filesTo reproduce the results of this project, follow these steps to set up the same environment using Docker and Nix.
You need Docker and Docker Compose. Choose the appropriate installation method for your operating system:
- Windows or macOS: Install Docker Desktop (includes Docker Compose).
- Linux: Install the Docker Engine and then Docker Compose.
For Debian 13 (as an example), run the following as root:
apt update
apt install -y apt-transport-https ca-certificates curl gnupg2 software-properties-common
curl -fsSL https://download.docker.com/linux/debian/gpg | apt-key add -
add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/debian trixie stable"
apt update
apt install -y docker-ce docker-compose-plugin
systemctl enable docker && systemctl start docker
usermod -aG docker <YOUR-USER>
su - <YOUR-USER>Note: Replace <YOUR-USER> with your actual username.
Navigate to the parent directory where you want to store the project and the data folders. Then run:
cd <parent-dir-path>
mkdir NycTaxiBigFiles
mkdir NycTaxiPins
git clone https://github.com/AngelFelizR/NycTaxiYour directory structure should look like:
<parent-dir-path>/
โโโ NycTaxi/ # cloned repository
โโโ NycTaxiBigFiles/ # large data files (mounted into container)
โโโ NycTaxiPins/ # pin board storage (mounted into container)The repository includes a setup.sh script that automates all remaining
steps: pulling the image, starting the container, and configuring SSH
keyโbased authentication using your existing ~/.ssh/id_rsa.pub.
From inside the NycTaxi folder, run:
cd NycTaxi
chmod +x setup.sh
./setup.shThe script will:
- Pull the preโbuilt image
angelfelizr/nyc-taxi:4.5.2from Docker Hub. - Start the container in detached mode, mapping port
2222for SSH and mounting the three directories under/root/. - Register your public key (
~/.ssh/id_rsa.pub) inside the container so you can connect without a password.
#!/bin/bash
docker compose pull
docker compose up -d
docker compose cp ~/.ssh/id_rsa.pub nyc-taxi:/root/.ssh/authorized_keys
docker compose exec nyc-taxi chown root:root /root/.ssh/authorized_keys
docker compose exec nyc-taxi chmod 600 /root/.ssh/authorized_keys
echo "Ready! Connect with: ssh NycTaxi"You can verify the container is running with docker compose ps.
Add the following to your ~/.ssh/config so you can connect with a
simple alias:
Host NycTaxi
HostName 127.0.0.1
User root
Port 2222
IdentityFile ~/.ssh/id_rsa
Then connect with:
ssh NycTaxiSince direnv is configured via the .envrc file in the repository,
you can use Positron with the SSH remote development feature to work
directly inside the container.
- In Positron, select โConnect to Hostโฆโ (or use the Remote Explorer).
- Enter
root@localhost:2222and authenticate using your SSH key (configured in Step 3). - Once connected, open the folder
/root/NycTaxi. - Install the direnv extension by mkhl from the Open VSX
Registry. This extension automatically activates direnv when you
open a folder containing an
.envrcfile.
After the extension loads, you should see a notification confirming that direnv is active. At that point, any terminal you open inside Positron will have the Nix environment loaded automatically.
To make the R interactive console use the Nix environment instead of
the system default, open the Positron command palette and switch the
active R interpreter to the one provided by the Nix shell. Once
selected, the console will have access to all the R packages defined in
default.nix.
If you need to use the shared pin board, create a cache directory on your host (outside the container) and then, inside R, set up the board as follows:
# On your host (in <parent-dir-path>)
mkdir NycTaxiBoardCacheIn your R session (inside the Nix shell), use:
BoardRemote <- board_url(
"https://raw.githubusercontent.com/AngelFelizR/NycTaxiPins/refs/heads/main/Board/",
cache = here::here("../NycTaxiBoardCache")
)The cache directory is mounted into the container at
/root/NycTaxiBoardCache, so pins will be stored on your host and
persist between container restarts.


