AIS Collision Detection with Apache Spark

This project identifies the most likely vessel collision, or the closest physical proximity event indicating a collision, in Danish AIS data for December 2021. The solution is implemented as a containerized Apache Spark / PySpark application and processes raw AIS CSV files directly inside Docker.

The detected collision-like event is between KARIN HOEJ and MV SCOT CARRIER. The closest AIS-based proximity was detected on 2021-12-13 02:27:43 at approximately 55.2230795, 14.2437065, with an estimated AIS-position distance of 4.08 meters. The project also generates an interactive HTML map and trajectory data for the 20-minute window around the event.

Assignment Scope

The analysis is restricted to the exact period from 2021-12-01 to 2021-12-31. The spatial search area is a circle with a radius of 50 nautical miles around the assignment center coordinate.

Parameter	Value
Center latitude	`55.225000`
Center longitude	`14.245000`
Radius	`50 nautical miles`
Radius in kilometers	`92.6 km`

The raw input data is the Danish AIS dataset from the Danish Maritime Authority AIS archive. The raw monthly CSV files are not included in this repository because of their size. The expected input is the extracted set of daily AIS CSV files for December 2021:

aisdk-2021-12-01.csv
aisdk-2021-12-02.csv
...
aisdk-2021-12-31.csv

The repository contains the source code, Docker configuration, auxiliary port data, final lightweight result artifacts, and map preview images.

Repository Structure

.
├── data/
│   └── ports_dma_region.csv
├── figures/
│   ├── figure1.png
│   └── figure2.png
├── outputs/
│   ├── collision_trajectory_map.html
│   └── final_collision_event.json
├── src/
│   ├── candidates.py
│   ├── config.py
│   ├── events.py
│   ├── geo.py
│   ├── h3_indexing.py
│   ├── main.py
│   ├── ports.py
│   ├── preprocessing.py
│   ├── refinement.py
│   ├── schema.py
│   └── visualization.py
├── Dockerfile
├── requirements.txt
├── .dockerignore
├── .gitignore
├── LICENSE
└── README.md

The src/ directory contains the Spark pipeline. The data/ directory contains auxiliary port coordinates used for excluding normal port-area proximity events. The figures/ directory contains static screenshots of the generated map for GitHub README preview. The outputs/ directory contains lightweight final artifacts that can be inspected without rerunning the full monthly pipeline.

Data Sources

AIS Data

This project is designed to process real-world AIS (Automatic Identification System) records from the Danish Maritime Authority AIS data archive:

http://aisdata.ais.dk/

The analysis uses the daily Danish AIS CSV files for the period from 2021-12-01 to 2021-12-31. Due to the large size of the original dataset, raw AIS files are not included in this repository. The Docker container expects the extracted daily CSV files to be mounted as an external input directory at runtime.

The pipeline is implemented with Apache Spark / PySpark and is intended for large AIS CSV datasets. Raw AIS records are read by Spark, cleaned, spatially filtered, and processed through the collision-detection pipeline inside the container.

Port Data

Auxiliary port location data is sourced from:

https://github.com/tayljordan/ports
Author: Jordan Taylor (GitHub: tayljordan)

At the time of use, no explicit license was provided for this dataset. The data is used for research and educational purposes only.

The port data is used only as an auxiliary dataset for filtering normal port-area proximity events. Close vessel positions inside or near ports often correspond to docking, mooring, harbor maneuvers, or other normal operational patterns rather than open-water collision events. For this project, a small fixed exclusion radius is applied around relevant port coordinates.

The included port file is:

data/ports_dma_region.csv

The port list was manually extended with Hvide Sande using the following coordinates:

Hvide Sande, Central Denmark Region, Denmark
Latitude: 56.002301
Longitude: 8.123969

Methodology

The pipeline starts from raw AIS CSV files and applies a sequence of cleaning, spatial filtering, candidate generation, refinement, and validation steps. The goal is not only to find the minimum distance between two AIS points, but also to avoid false positives caused by ports, pilot operations, stationary adjacency, and GPS noise.

The first stage reads the raw CSV files using a predefined Spark schema. Text placeholders such as Unknown, Undefined, Unknown value, and empty strings are normalized to null values. Invalid AIS numeric values are also cleaned. For example, unavailable ROT values, invalid SOG values, invalid COG values, unavailable headings, and invalid latitude or longitude values are removed or normalized.

The pipeline restricts the dataset to AIS Class A vessels. This keeps the analysis focused on the vessel class most relevant to larger commercial maritime traffic in the assignment dataset.

After basic cleaning, the data is spatially filtered in two stages. First, a bounding box around the assignment circle is used as a cheap coarse filter. Then a Haversine distance calculation keeps only AIS points within the exact 50-nautical-mile radius around the assignment center. This avoids running more expensive spatial logic on the full raw dataset.

Port-zone filtering is applied before collision candidate generation. The auxiliary port dataset contains port names, countries, regions, coordinates, and port-radius metadata. In this project, the original broad regional radii are not used directly as collision-exclusion radii because they would remove too much open water. Instead, the pipeline applies a small fixed exclusion radius around relevant port coordinates. This removes many normal harbor-adjacent situations before H3 candidate generation.

The next stage removes GPS anomalies. AIS data can contain sudden GPS jumps, duplicated positions, malformed messages, and physically impossible movement. The pipeline detects jumps by comparing each vessel point to the previous point for the same MMSI and date. The implied speed between consecutive points is calculated from the Haversine distance and the time difference. A point is treated as a GPS jump if the implied speed exceeds a reasonable vessel-speed threshold and the jump distance is non-trivial. Partitioning this check by MMSI and date keeps the monthly preprocessing tractable while still removing the main within-day GPS noise that can create false collision candidates.

To reduce computational cost, the pipeline does not calculate pairwise distances for all vessels. Instead, it creates a smaller detection dataset containing moving, identifiable vessels. Points with null MMSI, null vessel name, null ship type, undefined ship type, non-moving status, pilot ship type, or HSC ship type are excluded from candidate generation. Pilot vessels and HSC vessels are excluded because their close approaches often represent normal operational activities such as pilot transfer, crew transfer, ferry-like maneuvers, or service operations rather than collision events. Pilot vessels are excluded because very close approach to cargo and tanker vessels is a normal operational pattern during pilot transfer and does not by itself indicate a collision.

Candidate generation is based on temporal bucketing and H3 spatial indexing. Each AIS point is assigned to a time bucket and an H3 cell. Neighboring H3 cells and neighboring time buckets are considered so that close approaches near cell or bucket boundaries are not missed. Candidate vessel pairs are generated only when two vessels occur in compatible time buckets and nearby H3 cells. Exact Haversine distances are computed only for this reduced candidate set, avoiding an unoptimized Cartesian product across all AIS records.

Candidate pairs are then summarized by vessel pair. Pairs are retained if their minimum distance is below the close-distance threshold, if the close proximity is not persistent for an excessive duration, and if there are enough close observations to avoid single-point ghosts. This shortlist is refined using the original cleaned AIS points rather than the representative H3-bucket points. The refinement stage performs a temporal join for the shortlisted vessel pairs within a ±30 second tolerance and recalculates exact Haversine distances to find the closest observed approach.

The final event selection includes additional validation. Both vessels must have enough AIS trajectory support in the ±10 minute event window. This removes one-point AIS ghosts and ensures that the final visualization contains meaningful trajectories for both vessels. Operational service events are also excluded from the final selection. In particular, pilot-vessel events are excluded, and pairs where both vessels are service craft such as SAR, law enforcement, or pilot vessels are not considered final collision events.

Computational Strategy

The implementation is designed to use Spark transformations efficiently. The expensive exact distance calculation is delayed until after spatial-temporal pruning. H3 indexing and time buckets reduce the search space before Haversine distances are calculated. This is important because a naive global Cartesian product of AIS points would be computationally infeasible and would not satisfy the assignment’s efficiency requirements.

The monthly raw CSV input is large, so the application supports an optional Parquet cache for preprocessed AIS points. This cache is created from the raw CSV files after cleaning, spatial filtering, port filtering, and GPS jump filtering. It does not replace the raw-data pipeline; it only avoids repeated parsing and preprocessing during reruns. A full reproducible run can always be forced with --rebuild-cache.

Final Result

The detected collision-like event is:

Field	Value
Vessel A	`KARIN HOEJ`
MMSI A	`219021240`
IMO A	`8685844`
Ship type A	`Other`
Vessel B	`MV SCOT CARRIER`
MMSI B	`232018267`
IMO B	`9841782`
Ship type B	`Cargo`
Event timestamp	`2021-12-13 02:27:43`
Event latitude	`55.2230795`
Event longitude	`14.2437065`
AIS-position distance	`4.08 m`

The event was detected northeast of Bornholm in the Baltic Sea. The following figures show the same trajectory window at two zoom levels.

Regional map view

Detailed trajectory view

The interactive HTML map is available in:

outputs/collision_trajectory_map.html

The output JSON for the final event is available in:

outputs/final_collision_event.json

The map visualizes the trajectories of both vessels from 10 minutes before to 10 minutes after the detected closest approach.

Docker Image

The Docker image is published on Docker Hub:

abannikovgeo/ais-collision-detection:latest

Pull the image with:

docker pull abannikovgeo/ais-collision-detection:latest

Alternatively, the image can be built locally from the repository:

docker build -t ais-collision-detection .

Running the Pipeline

The container expects three mounted locations. The first is a raw AIS input directory containing the extracted December 2021 CSV files. The second is an optional cache directory for preprocessed Parquet data. The third is an output directory where the final result files are written.

Example Windows PowerShell command for the full December 2021 run:

docker run --rm `
  -v "E:\ais_data:/app/data_raw:ro" `
  -v "D:\ais_cache:/app/cache" `
  -v "C:\BDA\ais-collision-detection\outputs:/app/outputs" `
  abannikovgeo/ais-collision-detection:latest `
    --input "/app/data_raw/*.csv" `
    --ports "/app/data/ports_dma_region.csv" `
    --output "/app/outputs/docker_run_december_2021" `
    --cache-dir "/app/cache/cache_december_2021_clean_points" `
    --rebuild-cache `
    --top-candidates 200 `
    --spark-shuffle-partitions 1000

For a local image built from the repository, replace the image name with:

ais-collision-detection

If the cache already exists and should be reused, omit --rebuild-cache:

docker run --rm `
  -v "E:\ais_data:/app/data_raw:ro" `
  -v "D:\ais_cache:/app/cache" `
  -v "C:\BDA\ais-collision-detection\outputs:/app/outputs" `
  abannikovgeo/ais-collision-detection:latest `
    --input "/app/data_raw/*.csv" `
    --ports "/app/data/ports_dma_region.csv" `
    --output "/app/outputs/docker_run_december_2021" `
    --cache-dir "/app/cache/cache_december_2021_clean_points" `
    --top-candidates 200 `
    --spark-shuffle-partitions 1000

A smaller test run can be performed by mounting a folder containing only selected daily CSV files. For example, if a folder contains only December 12–14 files:

docker run --rm `
  -v "E:\ais_data:/app/data_raw:ro" `
  -v "D:\ais_cache:/app/cache" `
  -v "C:\BDA\ais-collision-detection\outputs:/app/outputs" `
  abannikovgeo/ais-collision-detection:latest `
    --input "/app/data_raw/aisdk-2021-12-1[2-4].csv".csv" `
    --ports "/app/data/ports_dma_region.csv" `
    --output "/app/outputs/docker_run_2021_12_12_14" `
    --cache-dir "/app/cache/cache_2021_12_12_14_clean_points" `
    --rebuild-cache `
    --top-candidates 200 `
    --spark-shuffle-partitions 600

Runtime Arguments

Argument	Required	Default	Description
`--input`	Yes	none	Input AIS CSV path or glob inside the container, for example `/app/data_raw/*.csv`.
`--ports`	Yes	none	Path to the port-zone CSV file inside the container. The Docker image includes `/app/data/ports_dma_region.csv`.
`--output`	Yes	none	Output directory where final CSV, JSON, and HTML files are written.
`--top-candidates`	No	`50`	Number of top H3 candidate vessel pairs to refine with the original cleaned AIS points. Higher values are safer for long periods but slower.
`--spark-shuffle-partitions`	No	`200`	Spark SQL shuffle partition count used for joins and aggregations. Larger monthly runs may require higher values such as `600`, `800`, or `1000`.
`--cache-dir`	No	none	Optional Parquet cache directory for preprocessed AIS points. Useful for large runs and repeated experiments.
`--rebuild-cache`	No	false	Rebuilds the Parquet cache from raw CSV even if the cache directory already exists.
`--save-candidates`	No	false	Saves the refined top candidate events. Disabled by default for large runs to avoid extra sorting and output cost.

Output Files

A successful run creates the following outputs:

final_collision_event/
final_collision_event.json
final_collision_trajectory/
collision_trajectory_map.html

If --save-candidates is provided, the run also creates:

refined_candidate_events/

The final_collision_event.json file contains the selected vessel pair, MMSI numbers, IMO numbers, timestamp, coordinates, speed/course fields, and calculated AIS-position distance. The final_collision_trajectory/ directory contains the AIS points for both vessels in the 20-minute visualization window. The collision_trajectory_map.html file contains the interactive map used for visual validation.

Limitations

AIS-based collision detection is inherently approximate because AIS messages are discrete, asynchronous, and sometimes noisy. The detected distance is calculated between reported AIS positions, not between physical hull geometries. A zero or near-zero AIS distance is therefore not automatically treated as a collision. The final event must also pass trajectory-support checks and operational-context filters.

The method excludes pilot-vessel proximity events from the final collision selection because pilot transfer requires very close and intentional approach between a pilot boat and a larger vessel. Port-zone points are also excluded to reduce false positives from docked vessels, harbor operations, and normal adjacency in ports. These choices make the pipeline better suited to identifying collision-like events in open water, but they also mean that actual collisions occurring inside ports or involving pilot/service vessels would require a different configuration.

The GPS anomaly filter is calculated within each MMSI and date partition. This makes the monthly processing more stable and avoids very large month-long per-vessel window operations, but it means that jumps across midnight boundaries are not explicitly compared. This is acceptable for the present collision-detection task because candidate validation and trajectory extraction are focused on short local event windows rather than full-month vessel reconstruction.

Conclusion

The work completed in this project covers the full processing pipeline required by the assignment. Raw Danish AIS CSV files for December 2021 are loaded with Apache Spark, cleaned, restricted to the required 50-nautical-mile search area, filtered for port-zone proximity and GPS anomalies, and then processed through an H3-based spatial-temporal candidate search. The final event is selected only after refined distance calculation, trajectory-support validation, and operational-context filtering.

The final result identifies the historically known close-proximity collision event between KARIN HOEJ and MV SCOT CARRIER near Bornholm on 2021-12-13. The generated trajectory visualization provides an additional qualitative check that both vessels were present in the event window and that their tracks converge at the detected closest approach.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AIS Collision Detection with Apache Spark

Assignment Scope

Repository Structure

Data Sources

AIS Data

Port Data

Methodology

Computational Strategy

Final Result

Regional map view

Detailed trajectory view

Docker Image

Running the Pipeline

Runtime Arguments

Output Files

Limitations

Conclusion

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
data		data
figures		figures
outputs		outputs
src		src
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

AIS Collision Detection with Apache Spark

Assignment Scope

Repository Structure

Data Sources

AIS Data

Port Data

Methodology

Computational Strategy

Final Result

Regional map view

Detailed trajectory view

Docker Image

Running the Pipeline

Runtime Arguments

Output Files

Limitations

Conclusion

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages