Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
157 changes: 127 additions & 30 deletions NESP/README.md
Original file line number Diff line number Diff line change
@@ -1,43 +1,140 @@
# Intro
# NESP 5.9 — User Code Library

## Pre-requisites
Python package and environment manager. We highly recommend [uv](https://docs.astral.sh/uv/#installation).
This directory contains notebooks and utilities for loading, exploring, and analysing the **NESP 5.9 marine datasets** stored in AWS S3.

## Usage
1. Navigate to the NESP directory of the repo
Each dataset has its own subdirectory with a dedicated README, a Python notebook (`.ipynb`), and an R notebook (`.Rmd`).

2. Create and activate the python environment
---

## Datasets

| Dataset | Description | Folder | S3 Key |
|---------|-------------|--------|--------|
| AMSA | AIS Vessel Tracking (2012–2025) | [`amsa/`](amsa/) | `stored/datauplift/amsa/year=*/source=*/*.parquet` |
| Kelp | Squidle+ Kelp Annotations | [`kelp/`](kelp/) | `stored/datauplift/kelp/kelp.parquet` |
| NRMN | NRMN Reef Life Surveys | [`nrmn/`](nrmn/) | `stored/datauplift/nrmn/nrmn.parquet` |
| Seabird | Seabird Observations and Tracking | [`seabird/`](seabird/) | `stored/datauplift/seabird/seabird.parquet` |
| Seagrass | Seagrass Surveys | [`seagrass/`](seagrass/) | `stored/datauplift/seagrass/seagrass.parquet` |
| Dugongs | Dugong Sightings and Distribution | [`dugongs/`](dugongs/) | `stored/datauplift/dugongs/dugongs.parquet` |

All datasets are publicly accessible in the `data-uplift-public` S3 bucket (region: `ap-southeast-2`).

---

## Repository Structure

```
NESP/
├── nesp/
│ ├── __init__.py
│ └── util.py # Shared utilities (H3 mapping, color scales, schema display)
├── h3.md # H3 hexagonal indexing reference
├── pyproject.toml # Python project and dependency definition
├── requirements.txt # Pinned Python dependencies
├── uv.lock # uv lockfile
├── OffshoreRenewable_Energy_Infrastructure_Regions.zip # Shapefile used in spatial analyses
├── amsa/
│ ├── README.md
│ ├── amsa.ipynb
│ └── amsa.Rmd
├── kelp/
│ └── README.md
├── nrmn/
│ └── README.md
├── seabird/
│ ├── README.md
│ ├── seabird.ipynb
│ └── seabird.Rmd
├── seagrass/
│ ├── README.md
│ ├── seagrass.ipynb
│ └── seagrass.Rmd
└── dugongs/
└── README.md
```

---

## Python Environment Setup

Python 3.12 or later is required. We recommend [uv](https://docs.astral.sh/uv/) for environment and package management.

### With `uv` (recommended)

```bash
# From the NESP/ directory:
uv venv && source .venv/bin/activate && uv pip install .
```

### With `pip`

```bash
uv venv \
&& source .venv/bin/activate \
&& uv pip install .
python -m venv .venv && source .venv/bin/activate && pip install .
```

3. Launch the jupyter server
### Running notebooks

```bash
jupyter notebook
```

## NESP 5.9 Datasets

| Dataset name | Description | Metadata | S3 URL |
| ------------ | ----------- | -------- | ------ |
| AMSA | AMSA Vessel Tracking | - | [link](s3://data-uplift-public/stored/datauplift/amsa/) |
| Kelp | Squidle+ Kelp Annotations | - | [link](s3://data-uplift-public/stored/datauplift/kelp/kelp.parquet) |
| NRMN | NRMN Reef Life Surveys | - | [link](s3://data-uplift-public/stored/datauplift/nrmn/nrmn.parquet) |
| Seabird | Seabird Observations and Tracking | - | [link](s3://data-uplift-public/stored/datauplift/seabird/seabird.parquet) |
| Seagrass | Seagrass Surveys | - | [link](s3://data-uplift-public/stored/datauplift/seagrass/seagrass.parquet) |

### S3 Details\n"
NESP 5.9 datasets are currently stored in S3 and are publicly available as per the following table:\n"

| Bucket | Key | Partitioned |
| -------------------- | -------------------------------------------------------- | ----------- |
| `data-uplift-public` | `stored/datauplift/amsa/year=*/source=*/*.parquet` | [x] |
| `data-uplift-public` | `stored/datauplift/kelp/kelp.parquet` | [ ] |
| `data-uplift-public` | `stored/datauplift/nrmn/nrmn.parquet` | [ ] |\n",
| `data-uplift-public` | `stored/datauplift/seabird/seabird.parquet` | [ ] |
| `data-uplift-public` | `stored/datauplift/seagrass/seagrass.parquet` | [ ] |
Or open any `.ipynb` in your IDE and select the `.venv` as the kernel.

### Key Python packages

| Package | Purpose |
|---------|---------|
| `pyarrow` | Parquet I/O and S3 dataset connection |
| `polars` | DataFrame and LazyFrame computation |
| `polars-h3` | H3 spatial indexing within Polars |
| `polars-st` | Spatial (geometry) operations in Polars |
| `pydeck` | GPU-accelerated H3 hexagon map rendering |
| `h3` | H3 indexing for polygon-to-cell conversion |
| `geopandas` | Shapefile loading and CRS handling |
| `matplotlib` / `seaborn` | Statistical plotting |
| `rich` | Schema and table display in notebooks |

> **Note:** Shared utilities live in the `nesp` package (`nesp/util.py`) and are installed when you run `uv pip install .` or `pip install .`. Notebooks import them as `from nesp import util` — no path manipulation needed.

---

## R Environment Setup

R 4.x or later is required. Install the following packages from CRAN and GitHub before running any `.Rmd` notebook.

### CRAN packages

```r
install.packages(c(
"arrow", # S3 dataset connection and Parquet I/O
"sf", # Spatial features
"dplyr", # Data manipulation
"tidyr", # Data reshaping
"stringr", # String operations
"lubridate", # Date/time handling
"ggplot2", # Plotting
"leaflet" # Interactive maps
))
```

### H3 for R (from GitHub)

The `h3-r` package is recommended as it bundles the underlying C library automatically:

```r
# install.packages("remotes")
remotes::install_github("crazycapivara/h3-r")
```

> See the [h3-r documentation](https://crazycapivara.github.io/h3-r/articles/h3.html) for usage examples.

### Running notebooks

Open any `.Rmd` file in RStudio and click **Knit**, or run chunks interactively.

---

## H3 Spatial Indexing

All datasets are spatially aggregated using the [Uber H3](https://h3geo.org/) hexagonal grid system. See [`h3.md`](h3.md) for an overview of H3 concepts, resolution levels, and why hexagons are used over traditional grids.
26 changes: 26 additions & 0 deletions NESP/amsa/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# AMSA — Vessel Tracking

## Description
AIS (Automatic Identification System) vessel tracking records collected by the Australian Maritime Safety Authority (AMSA), covering Australian waters from 2012 to 2025.

## Dataset Details

| Property | Value |
|-------------|-------|
| Bucket | `data-uplift-public` |
| Key | `stored/datauplift/amsa/year=*/source=*/*.parquet` |
| Partitioned | Yes (by year and source) |
| Format | Parquet |

## Notebooks

| Notebook | Language | Description |
|----------|----------|-------------|
| `amsa.ipynb` | Python | Vessel density mapping with H3 aggregation, longitudinal data health analysis, and regional deep-dives (e.g. Sydney Harbour) |
| `amsa.Rmd` | R | Equivalent analysis using the `arrow`, `h3`, and `ggplot2` R packages |

## Dataset-Specific Notes
- The full dataset exceeds typical RAM limits. The Python notebook uses a Polars `LazyFrame` and streaming aggregation to avoid loading all data into memory.
- Vessel density follows a power-law distribution. The Python notebook applies a log₁₀ transform before coloring to make both busy ports and open-ocean routes visible.
- Records include an `australianMarineRegionsTags` column for pre-computed spatial region labels.
- The AMSA dataset is already indexed at H3 Resolution 8 (`h3Index` column). For finer-grained regional analysis, re-index to Resolution 9–12.
File renamed without changes.
4 changes: 2 additions & 2 deletions NESP/amsa.ipynb → NESP/amsa/amsa.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -160,7 +160,7 @@
"import pyarrow\n",
"import pyarrow.fs\n",
"import pyarrow.dataset\n",
"import util\n",
"from nesp import util\n",
"\n",
"# --- FileSystem Configuration ---\n",
"# Initialize S3FileSystem for ap-southeast-2 region.\n",
Expand All @@ -185,7 +185,7 @@
"rich.print('Total Inspected files: ',len(list(ds.get_fragments())))\n",
"\n",
"# Display schema definition and units.\n",
"schema_rich_table = util.generate_schema_rich_table(\n",
"schema_rich_table = util.print_schema_rich_table(\n",
" schema=ds.schema,\n",
" metadata_keys=[\n",
" \"definition\",\n",
Expand Down
22 changes: 22 additions & 0 deletions NESP/dugongs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Dugongs

## Description
Dugong (*Dugong dugon*) sighting and distribution records from Australian marine surveys. Dugongs are a protected marine mammal and an indicator species for seagrass ecosystem health.

## Dataset Details

| Property | Value |
|-------------|-------|
| Bucket | `data-uplift-public` |
| Key | `stored/datauplift/dugongs/dugongs.parquet` *(placeholder — confirm S3 path)* |
| Partitioned | No |
| Format | Parquet |

## Notebooks

> Notebooks for this dataset are yet to be created. Use the existing `amsa`, `seabird`, or `seagrass` notebooks as a reference template.

| Notebook | Language | Description |
|----------|----------|-------------|
| `dugongs.ipynb` | Python | *(to be created)* |
| `dugongs.Rmd` | R | *(to be created)* |
22 changes: 22 additions & 0 deletions NESP/kelp/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Kelp — Squidle+ Annotations

## Description
Kelp presence/absence annotations sourced from the Squidle+ platform, covering benthic imagery surveys across Australian coastal waters.

## Dataset Details

| Property | Value |
|-------------|-------|
| Bucket | `data-uplift-public` |
| Key | `stored/datauplift/kelp/kelp.parquet` |
| Partitioned | No |
| Format | Parquet |

## Notebooks

> Notebooks for this dataset are yet to be created. Use the existing `seagrass` or `seabird` notebooks as a reference template.

| Notebook | Language | Description |
|----------|----------|-------------|
| `kelp.ipynb` | Python | *(to be created)* |
| `kelp.Rmd` | R | *(to be created)* |
Empty file added NESP/nesp/__init__.py
Empty file.
File renamed without changes.
22 changes: 22 additions & 0 deletions NESP/nrmn/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# NRMN — Reef Life Surveys

## Description
Reef Life Survey (RLS) data from the National Reef Monitoring Network (NRMN), covering fish and invertebrate abundance, species richness, and community composition across Australian reef systems.

## Dataset Details

| Property | Value |
|-------------|-------|
| Bucket | `data-uplift-public` |
| Key | `stored/datauplift/nrmn/nrmn.parquet` |
| Partitioned | No |
| Format | Parquet |

## Notebooks

> Notebooks for this dataset are yet to be created. Use the existing `seabird` or `seagrass` notebooks as a reference template.

| Notebook | Language | Description |
|----------|----------|-------------|
| `nrmn.ipynb` | Python | *(to be created)* |
| `nrmn.Rmd` | R | *(to be created)* |
7 changes: 7 additions & 0 deletions NESP/pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,3 +1,10 @@
[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"

[tool.hatch.build.targets.wheel]
packages = ["nesp"]

[project]
name = "nesp"
version = "1.0.0"
Expand Down
24 changes: 24 additions & 0 deletions NESP/seabird/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Seabird — Observations and Tracking

## Description
Seabird observation and tracking records aggregated from Australian monitoring programs. Covers species occurrence, movement, and spatial distribution across Australian marine environments.

## Dataset Details

| Property | Value |
|-------------|-------|
| Bucket | `data-uplift-public` |
| Key | `stored/datauplift/seabird/seabird.parquet` |
| Partitioned | No |
| Format | Parquet |

## Notebooks

| Notebook | Language | Description |
|----------|----------|-------------|
| `seabird.ipynb` | Python | Data exploration, H3 spatial aggregation, and mapping of seabird distribution |
| `seabird.Rmd` | R | Equivalent analysis using the `arrow`, `h3`, and `ggplot2` R packages |

## Authors
- Thomas Galindo (Python notebook)
- Denisse Fierro Arcos (R translation)
File renamed without changes.
41 changes: 2 additions & 39 deletions NESP/seabird.ipynb → NESP/seabird/seabird.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -136,44 +136,7 @@
"id": "6",
"metadata": {},
"outputs": [],
"source": [
"import pyarrow\n",
"import pyarrow.fs\n",
"import pyarrow.dataset\n",
"import util\n",
"\n",
"# Construct the anonymous file system responsible for reading from the public S3 bucket\n",
"FILE_SYSTEM = pyarrow.fs.S3FileSystem(\n",
" region=\"ap-southeast-2\", \n",
" anonymous=True,\n",
")\n",
"\n",
"# Create the dataset connection\n",
"# By convention, datasets are labelled `ds`\n",
"ds = pyarrow.dataset.dataset(\n",
" source=\"data-uplift-public/stored/datauplift/seabird/seabird.parquet\",\n",
" filesystem=FILE_SYSTEM,\n",
")\n",
"\n",
"# The comprising files of the dataset can be inspected\n",
"print(list(ds.get_fragments()))\n",
"\n",
"# The schema of the dataset can be inspected\n",
"util.print_schema_rich_table(\n",
" schema=ds.schema,\n",
" metadata_keys=[\"definition\", \"units\", \"long_name\", \"resolution\"]\n",
")\n",
"\n",
"# For the full table scheme use native pyarrow representation\n",
"# rich.print(ds.schema)\n",
"\n",
"# Estimate dataset size\n",
"estimated_megabytes = util.estimate_dataset_size(\n",
" ds=ds,\n",
" n_samples=10_000,\n",
")\n",
"print(f\"DS estimated size: {round(estimated_megabytes, 2)}MB\")"
]
"source": "import pyarrow\nimport pyarrow.fs\nimport pyarrow.dataset\nfrom nesp import util\n\n# Construct the anonymous file system responsible for reading from the public S3 bucket\nFILE_SYSTEM = pyarrow.fs.S3FileSystem(\n region=\"ap-southeast-2\", \n anonymous=True,\n)\n\n# Create the dataset connection\n# By convention, datasets are labelled `ds`\nds = pyarrow.dataset.dataset(\n source=\"data-uplift-public/stored/datauplift/seabird/seabird.parquet\",\n filesystem=FILE_SYSTEM,\n)\n\n# The comprising files of the dataset can be inspected\nprint(list(ds.get_fragments()))\n\n# The schema of the dataset can be inspected\nutil.print_schema_rich_table(\n schema=ds.schema,\n metadata_keys=[\"definition\", \"units\", \"long_name\", \"resolution\"]\n)\n\n# For the full table scheme use native pyarrow representation\n# rich.print(ds.schema)\n\n# Estimate dataset size\nestimated_megabytes = util.estimate_dataset_size(\n ds=ds,\n n_samples=10_000,\n)\nprint(f\"DS estimated size: {round(estimated_megabytes, 2)}MB\")"
},
{
"cell_type": "markdown",
Expand Down Expand Up @@ -1332,4 +1295,4 @@
},
"nbformat": 4,
"nbformat_minor": 5
}
}
24 changes: 24 additions & 0 deletions NESP/seagrass/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Seagrass — Surveys

## Description
Seagrass survey records from Australian coastal and marine environments, covering species presence, cover, and site-level habitat assessments.

## Dataset Details

| Property | Value |
|-------------|-------|
| Bucket | `data-uplift-public` |
| Key | `stored/datauplift/seagrass/seagrass.parquet` |
| Partitioned | No |
| Format | Parquet |

## Notebooks

| Notebook | Language | Description |
|----------|----------|-------------|
| `seagrass.ipynb` | Python | Data exploration, H3 spatial aggregation, and mapping of seagrass survey sites |
| `seagrass.Rmd` | R | Equivalent analysis using the `arrow`, `h3`, and `ggplot2` R packages |

## Authors
- Thomas Galindo (Python notebook)
- Denisse Fierro Arcos (R translation)
File renamed without changes.
Loading