This repository is a ready-to-open Scala 2 development workspace for building and running an Apache Spark CSV pipeline. It includes:
- Scala
2.13.18 - Apache Spark SQL
4.1.1 - sbt
1.12.11 - a VS Code Dev Containers setup with JDK 21, Scala CLI, sbt, Metals, Codex, and Metals MCP
- JDK source archives linked into the devcontainer JDK so Metals can navigate into Java standard library classes
- a production Dockerfile that builds a runnable assembly JAR
The example application reads a transaction CSV with a fixed schema, validates and enriches rows, aggregates revenue metrics, and writes partitioned Parquet output.
Open the folder in VS Code and run Dev Containers: Reopen in Container.
Inside the container:
sbt -Dsbt.batch=true run
sbt -Dsbt.batch=true "run data/input/transactions.csv target/spark-output/transaction-summary"Build and run the application image:
docker build -t spark-devcontainer .
docker run --rm spark-devcontainerThe default input is:
data/input/transactions.csv
The default output is:
target/spark-output/transaction-summary
The pipeline uses an explicit Spark StructType; it does not infer CSV types.
Input files must include this header:
transaction_id,customer_id,event_ts,region,country,product_id,product_category,quantity,unit_price,discount_pct,payment_method,statusField meanings:
transaction_id: unique transaction identifiercustomer_id: customer identifierevent_ts: timestamp inyyyy-MM-dd'T'HH:mm:ssformatregion: business region, such asNA,EMEA, orAPACcountry: ISO-like country code used for reportingproduct_id: product SKUproduct_category: reporting categoryquantity: positive integer quantityunit_price: non-negative item pricediscount_pct: decimal discount from0.0through1.0payment_method: payment channel labelstatus: transaction status, such ascompletedorrefunded
Malformed rows, rows with missing identifiers, invalid timestamps, non-positive
quantities, negative prices, or discounts outside 0.0 through 1.0 are
excluded before aggregation.
The job writes Parquet files partitioned by event_date. Each output row is
grouped by:
event_dateregioncountryproduct_categorystatus
Metrics written:
transaction_countunique_customersunits_soldgross_revenuenet_revenue
For a large CSV, pass the input and output paths explicitly:
sbt -Dsbt.batch=true "run /data/landing/transactions.csv /data/curated/transaction-summary"To target a Spark master instead of local mode, pass a third argument:
sbt -Dsbt.batch=true "run /data/landing/transactions.csv /data/curated/transaction-summary spark://spark-master:7077"Run sbt commands serially. Starting multiple sbt processes at the same time can
hit the sbt boot socket lock and fail with ServerAlreadyBootingException.
sbt -Dsbt.batch=true compile
sbt -Dsbt.batch=true run
sbt -Dsbt.batch=true assembly
sbt -Dsbt.batch=true scalafmtCheckAllFormat Scala sources with:
sbt -Dsbt.batch=true scalafmtAllbuild.sbt: pins Scala, adds Spark SQL, configures Java module options for Spark, and builds an assembly JAR.src/main/scala/Main.scala: Spark CSV pipeline entry point.data/input/transactions.csv: small sample file with the production CSV structure..devcontainer/Dockerfile: development image with JDK 21, JDK sources for Metals Java navigation, and Scala tools.Dockerfile: production multi-stage build for the Spark application..vscode/launch.json: Metals/Scala debug launch config forMain.
Spark is run in local[*] mode by default, which is useful for development and
single-machine processing. For cluster execution, pass the Spark master URL as
the third application argument and ensure the input/output paths are accessible
to the driver and executors.