Skip to content

ScalaHeaven/spark-devcontainer

Repository files navigation

Scala 2 Spark Devcontainer

This repository is a ready-to-open Scala 2 development workspace for building and running an Apache Spark CSV pipeline. It includes:

  • Scala 2.13.18
  • Apache Spark SQL 4.1.1
  • sbt 1.12.11
  • a VS Code Dev Containers setup with JDK 21, Scala CLI, sbt, Metals, Codex, and Metals MCP
  • JDK source archives linked into the devcontainer JDK so Metals can navigate into Java standard library classes
  • a production Dockerfile that builds a runnable assembly JAR

The example application reads a transaction CSV with a fixed schema, validates and enriches rows, aggregates revenue metrics, and writes partitioned Parquet output.

Quick Start

Open the folder in VS Code and run Dev Containers: Reopen in Container.

Inside the container:

sbt -Dsbt.batch=true run
sbt -Dsbt.batch=true "run data/input/transactions.csv target/spark-output/transaction-summary"

Build and run the application image:

docker build -t spark-devcontainer .
docker run --rm spark-devcontainer

The default input is:

data/input/transactions.csv

The default output is:

target/spark-output/transaction-summary

CSV Contract

The pipeline uses an explicit Spark StructType; it does not infer CSV types. Input files must include this header:

transaction_id,customer_id,event_ts,region,country,product_id,product_category,quantity,unit_price,discount_pct,payment_method,status

Field meanings:

  • transaction_id: unique transaction identifier
  • customer_id: customer identifier
  • event_ts: timestamp in yyyy-MM-dd'T'HH:mm:ss format
  • region: business region, such as NA, EMEA, or APAC
  • country: ISO-like country code used for reporting
  • product_id: product SKU
  • product_category: reporting category
  • quantity: positive integer quantity
  • unit_price: non-negative item price
  • discount_pct: decimal discount from 0.0 through 1.0
  • payment_method: payment channel label
  • status: transaction status, such as completed or refunded

Malformed rows, rows with missing identifiers, invalid timestamps, non-positive quantities, negative prices, or discounts outside 0.0 through 1.0 are excluded before aggregation.

Pipeline Output

The job writes Parquet files partitioned by event_date. Each output row is grouped by:

  • event_date
  • region
  • country
  • product_category
  • status

Metrics written:

  • transaction_count
  • unique_customers
  • units_sold
  • gross_revenue
  • net_revenue

For a large CSV, pass the input and output paths explicitly:

sbt -Dsbt.batch=true "run /data/landing/transactions.csv /data/curated/transaction-summary"

To target a Spark master instead of local mode, pass a third argument:

sbt -Dsbt.batch=true "run /data/landing/transactions.csv /data/curated/transaction-summary spark://spark-master:7077"

Build And Validation

Run sbt commands serially. Starting multiple sbt processes at the same time can hit the sbt boot socket lock and fail with ServerAlreadyBootingException.

sbt -Dsbt.batch=true compile
sbt -Dsbt.batch=true run
sbt -Dsbt.batch=true assembly
sbt -Dsbt.batch=true scalafmtCheckAll

Format Scala sources with:

sbt -Dsbt.batch=true scalafmtAll

Important Files

  • build.sbt: pins Scala, adds Spark SQL, configures Java module options for Spark, and builds an assembly JAR.
  • src/main/scala/Main.scala: Spark CSV pipeline entry point.
  • data/input/transactions.csv: small sample file with the production CSV structure.
  • .devcontainer/Dockerfile: development image with JDK 21, JDK sources for Metals Java navigation, and Scala tools.
  • Dockerfile: production multi-stage build for the Spark application.
  • .vscode/launch.json: Metals/Scala debug launch config for Main.

Notes

Spark is run in local[*] mode by default, which is useful for development and single-machine processing. For cluster execution, pass the Spark master URL as the third application argument and ensure the input/output paths are accessible to the driver and executors.

About

a small devcontainer to experiment with Spark

Topics

Resources

Stars

Watchers

Forks

Contributors