GitHub - anyscale/transaction-foundation-model-with-ray

Build Your Own Transaction Foundation Model with Ray on Anyscale

This is based on NVIDIA's Build Your Own Transaction Foundation Model developer example. It runs the same five-notebook pipeline and produces the same results, with the single-node execution engine swapped for a distributed one: Ray Data for streamed, out-of-core ingest, tokenization, and inference, and Ray Train for distributed, fault-tolerant pretraining. NVIDIA RAPIDS cuDF/cuML and the HuggingFace decoder are unchanged from the original; they are now driven by Ray and run inside Ray tasks/actors on GPU workers.

Financial transaction data is one of the richest signals available in the enterprise. Every swipe, transfer, and payment encodes patterns of human behavior, from daily spending habits to subtle shifts that precede fraud. Traditional approaches rely on hand-crafted features and rules that are brittle, slow to adapt, and blind to the deep sequential structure in transaction histories. Foundation models change this equation: by pretraining on large volumes of unlabeled transaction sequences, they learn general-purpose representations of financial behavior that transfer to a wide range of downstream tasks: fraud detection, anomaly scoring, customer segmentation, and personalized financial services.

This developer example shows how to build such a model end-to-end, distributed across a GPU cluster with Ray:

Custom GPU-accelerated tokenizer on Ray Data: The original modular, RAPIDS-powered tokenizer (src/tokenizer/) is streamed through Ray Data (batch_format="cudf"), converting heterogeneous tabular fields (merchant category, amount, time deltas, and more) into domain-specific token sequences across the cluster. The pipeline stays flexible: swap or add tokenizer components to match any transaction schema.
Distributed pretraining with Ray Train: The same decoder-only foundation model is trained with causal language modeling, but the training loop runs under Ray Train's TorchTrainer. This gives cross-node data-parallel DDP, fault tolerance, checkpointing, and elastic scaling out of the box, from a single GPU to a multi-node cluster, while keeping a standard HuggingFace/PyTorch training step. The architecture is unchanged (Llama by default) and remains architecture-agnostic.
Embedding extraction and downstream evaluation: Learned embeddings are extracted via last-token pooling in a pipelined Ray Data job (tokenize ‖ GPU infer) and evaluated on fraud detection with XGBoost, demonstrating clear lift over hand-crafted feature baselines.

Blueprint architecture (NVIDIA):

Ray / Anyscale execution of the same pipeline:

Software Components

Ray (Anyscale):

Ray Data: Streamed, distributed, out-of-core data ingest, tokenization, and batch GPU inference
Ray Train: Distributed, fault-tolerant pretraining (TorchTrainer, DDP) with checkpointing
Anyscale Workspace: Managed Ray cluster with on-demand GPU autoscaling

NVIDIA RAPIDS:

NVIDIA RAPIDS (cuDF, cuML): GPU-accelerated data processing, tokenization, and UMAP

3rd Party Software:

PyTorch 2.x: Deep learning framework
HuggingFace Transformers: Model definition, checkpointing, and loading
XGBoost: Gradient-boosted trees for fraud detection
scikit-learn: Classical ML preprocessing, metrics, and baseline utilities
pandas: CPU dataframe operations and interoperability with GPU pipelines
NumPy: Array operations used across preprocessing and inference
CuPy: GPU array operations for tokenizer and embedding workflows
matplotlib: Static visualizations
plotly: Interactive 3D embedding visualization

#	Notebook	Description
1	`01_dataset_baseline_ray.ipynb`	Ingest the TabFormer financial transaction dataset with Ray Data, run cuDF feature engineering, create temporal train/val/test splits, and train a GPU-accelerated XGBoost baseline for fraud detection.
2	`02_seq_preproc_tokenization_ray.ipynb`	Stream the custom GPU-accelerated cuDF tokenizer through Ray Data to convert transaction records into domain-specific token sequences (Parquet).
3	`03_foundation_model_training_ray.ipynb`	Pretrain a decoder-only foundation model (~29M parameters) on tokenized sequences using Ray Train (`TorchTrainer`, cross-node DDP) with causal language modeling.
4	`04_inference_embedding_extraction_ray.ipynb`	Load the pretrained model, run pipelined Ray Data GPU inference, extract 512-dimensional embeddings via last-token pooling, and visualize with cuML UMAP (2D + interactive 3D).
5	`05_xgboost_fraud_detection_ray.ipynb`	Compare XGBoost fraud detection using raw features, foundation model embeddings, and combined features.

Create an Anyscale Workspace with GPU workers (see Create an Anyscale Workspace).
Clone this repository into the workspace and install the head-node dependencies:
```
pip install -r requirements.txt
```
Open 01_dataset_baseline_ray.ipynb and run it top to bottom, then continue through notebooks 02–05 sequentially.

Deployment

Prerequisites

Component	Requirement
Platform	Anyscale account (or any Ray 2.55+ cluster with GPU nodes)
GPU	2× NVIDIA A10G (or larger) GPU worker nodes; a GPU head node is recommended for the cuML UMAP step in NB04
Ray	2.55.1
Python	3.12
Image	`anyscale/ray:2.55.1-slim-py312-cu129`
CUDA	CUDA 12.9 (provided by the Anyscale Ray image)

Create an Anyscale Workspace

Create an Anyscale Workspace with the configuration below, then open its IDE:

Setting	Value
Ray image	`anyscale/ray:2.55.1-slim-py312-cu129` (Ray 2.55.1, Python 3.12, CUDA 12.9)
Head node	`1xA10G` (NVIDIA A10G, 23 GB), gives the head a GPU for the NB04 cuML UMAP step
Worker group	`1xa10g-64cpu-256gb`, min 0 / max 2 (each a single-GPU A10G node → 2-way cross-node DDP in NB03)
Region	`us-west-2`
Autoscaling	Enabled, workers scale up on the first GPU op and scale back to 0 when idle

Anyscale automatically provisions shared cluster storage at /mnt/cluster_storage, which the notebooks use to pass data between stages.

Steps

Open the workspace IDE and clone this repository (or upload it).
Install the head-node dependencies:
```
pip install -r requirements.txt
```
GPU worker dependencies (torch, transformers, cuDF, CuPy, XGBoost) are installed automatically on each worker node via the Ray runtime_env declared in src/ray_common.py; you do not install them on the head.
Open and run each notebook one by one, in order (01 → 05), top to bottom, in the workspace IDE. Notebooks 02–05 are idempotent: a re-run skips any stage whose output already exists on shared storage; to force a fresh run of a stage, delete its directory under /mnt/cluster_storage/tfm_ray/.

Data. By default the notebooks generate a self-contained synthetic dataset (src/data_gen.py) that reproduces the TabFormer schema, so the pipeline runs anywhere with no download. To run on the real TabFormer dataset, point the notebooks at the CSV:

export TFM_REAL_CSV=/path/to/card_transaction.v1.csv

How It Works

The head node orchestrates Ray; GPU code (cuDF, PyTorch, Transformers, XGBoost) runs inside Ray tasks/actors on GPU workers and returns plain Python. Worker dependencies are declared once in a Ray runtime_env (src/ray_common.py) and installed per node on first use. Data flows between notebooks through shared cluster storage (/mnt/cluster_storage/tfm_ray):

JOB_RUNTIME_ENV ships the src package to every node via py_modules (code only).
GPU_RUNTIME_ENV / TRAIN_JOB_ENV install the GPU wheels (torch, transformers, cuDF, …) on worker nodes.

GPU library code is loaded lazily inside the Ray actors so the modules stay importable on a CPU head. The one head-side GPU step is NB04's cuML UMAP visualization.

Customization

The example is designed for extensibility:

Tokenizer: The modular tokenizer pipeline (src/tokenizer/, unchanged from the original) can be adapted to different transaction schemas by adding or replacing individual tokenizer components. The Ray Data wrapper is in src/ray_tokenize.py.
Model Architecture: The decoder configuration lives in src/ray_common.py (MODEL_CONFIG). Swap in any HuggingFace-compatible decoder architecture by editing it.
Training scale: In 03_foundation_model_training_ray.ipynb, change NUM_WORKERS to scale the DDP world size across more GPUs/nodes and MAX_STEPS for the training budget. Ray Train handles the distribution.
Cluster scale: Raise the worker group's max replicas to ingest, tokenize, and train over more data without code changes.
Downstream Tasks: Replace XGBoost with any classifier that accepts fixed-length feature vectors.

Model Architecture

The included example uses a Llama decoder architecture, but any HuggingFace-compatible decoder model works. The configuration is defined in src/ray_common.py.

Parameter	Value
Architecture	Llama (decoder-only transformer)
Parameters	~29M
Hidden size	512
Layers	8
Attention	Grouped Query Attention (8 query heads, 2 KV heads)
Context window	8,192 tokens (RoPE)
Activation	SwiGLU
Normalization	RMSNorm
Vocabulary	~6,251 domain-specific tokens

Differences From the Original

Stage	Original (NVIDIA)	With Ray (Anyscale)
Ingest / feature engineering	pandas + cuDF on one node	Ray Data streamed read + cuDF `map_batches`
Tokenization	cuDF on one node	same cuDF tokenizer, streamed via Ray Data (`batch_format="cudf"`)
Pretraining	NeMo AutoModel, single node	Ray Train `TorchTrainer`, cross-node DDP, fault tolerance, checkpointing
Embedding extraction	head-side loop	pipelined Ray Data GPU inference
Checkpoint	downloaded pretrained checkpoint	produced by notebook 03 to shared storage
Dependencies	installed in a container	head deps in `requirements.txt`; GPU worker deps via Ray `runtime_env`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Build Your Own Transaction Foundation Model with Ray on Anyscale

Software Components

Ray (Anyscale):

NVIDIA RAPIDS:

3rd Party Software:

Table of Contents

Quickstart

Notebooks

Deployment

Prerequisites

Create an Anyscale Workspace

Steps

How It Works

Customization

Model Architecture

Differences From the Original

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
assets		assets
src		src
.gitignore		.gitignore
01_dataset_baseline_ray.ipynb		01_dataset_baseline_ray.ipynb
02_seq_preproc_tokenization_ray.ipynb		02_seq_preproc_tokenization_ray.ipynb
03_foundation_model_training_ray.ipynb		03_foundation_model_training_ray.ipynb
04_inference_embedding_extraction_ray.ipynb		04_inference_embedding_extraction_ray.ipynb
05_xgboost_fraud_detection_ray.ipynb		05_xgboost_fraud_detection_ray.ipynb
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Build Your Own Transaction Foundation Model with Ray on Anyscale

Software Components

Ray (Anyscale):

NVIDIA RAPIDS:

3rd Party Software:

Table of Contents

Quickstart

Notebooks

Deployment

Prerequisites

Create an Anyscale Workspace

Steps

How It Works

Customization

Model Architecture

Differences From the Original

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages