This repository is a ready-to-open Scala 2 development workspace for building and running an Apache Spark CSV pipeline. It includes:
- Scala
2.13.18 - Apache Spark SQL
4.1.1 - sbt
1.12.11 - a VS Code Dev Containers setup with JDK 21, Scala CLI, sbt, Metals, Codex, and Metals MCP
- JDK source archives linked into the devcontainer JDK so Metals can navigate into Java standard library classes
- a production Dockerfile that builds a runnable assembly JAR
The example application starts a local Spark standalone cluster with one master and three worker nodes, reads a transaction CSV with a fixed schema, validates and enriches rows, aggregates revenue metrics, and writes partitioned Parquet output.
Open the folder in VS Code and run Dev Containers: Reopen in Container.
The devcontainer opens at /workspaces/spark-cluster-devcontainer and repairs
workspace ownership on start so Metals can write .metals/metals.log as the
non-root vscode user.
Inside the container:
scala-cli scripts/GenerateTransactions.scala -- data/input/transactions.csv 100000
sbt -Dsbt.batch=true run
sbt -Dsbt.batch=true "run data/input/transactions.csv target/spark-output/transaction-summary"Build and run the application image:
docker build -t spark-devcontainer .
docker run --rm spark-devcontainerThe default input is:
data/input/transactions.csv
The default output is:
target/spark-output/transaction-summary
The default Spark master is:
local-cluster[3,1,4096]
That Spark master starts one local standalone master and three worker JVMs, each
with one core and 4096 MiB of worker memory. Use this mode when you want the
template to behave like a small cluster without managing external services.
The Spark executors are configured with spark.executor.memory=4g, and the
application driver JVM uses -Xmx4g for sbt run, VS Code debug launches, and
the production Docker image.
The pipeline uses an explicit Spark StructType; it does not infer CSV types.
Input files must include this header:
transaction_id,customer_id,event_ts,region,country,product_id,product_category,quantity,unit_price,discount_pct,payment_method,statusField meanings:
transaction_id: unique transaction identifiercustomer_id: customer identifierevent_ts: timestamp inyyyy-MM-dd'T'HH:mm:ssformatregion: business region, such asNA,EMEA, orAPACcountry: ISO-like country code used for reportingproduct_id: product SKUproduct_category: reporting categoryquantity: positive integer quantityunit_price: non-negative item pricediscount_pct: decimal discount from0.0through1.0payment_method: payment channel labelstatus: transaction status, such ascompletedorrefunded
Malformed rows, rows with missing identifiers, invalid timestamps, non-positive
quantities, negative prices, or discounts outside 0.0 through 1.0 are
excluded before aggregation.
Regenerate the checked-in 100,000-row sample file with:
scala-cli scripts/GenerateTransactions.scala -- data/input/transactions.csv 100000The job writes Parquet files partitioned by event_date. Each output row is
grouped by:
event_dateregioncountryproduct_categorystatus
Metrics written:
transaction_countunique_customersunits_soldgross_revenuenet_revenue
For a large CSV, pass the input and output paths explicitly:
sbt -Dsbt.batch=true "run /data/landing/transactions.csv /data/curated/transaction-summary"To target a different Spark master, pass a third argument:
sbt -Dsbt.batch=true "run /data/landing/transactions.csv /data/curated/transaction-summary spark://spark-master:7077"Run sbt commands serially. Starting multiple sbt processes at the same time can
hit the sbt boot socket lock and fail with ServerAlreadyBootingException.
sbt -Dsbt.batch=true compile
sbt -Dsbt.batch=true run
sbt -Dsbt.batch=true assembly
sbt -Dsbt.batch=true scalafmtCheckAllFormat Scala sources with:
sbt -Dsbt.batch=true scalafmtAllbuild.sbt: pins Scala, adds Spark SQL, configures thesbt runJVM heap, configures Java module options for Spark, and builds an assembly JAR.src/main/scala/Main.scala: Spark CSV pipeline entry point.scripts/GenerateTransactions.scala: deterministic generator for the checked-in transaction CSV sample.data/input/transactions.csv: 100,000-row sample file with the production CSV structure..devcontainer/Dockerfile: development image with JDK 21, JDK sources for Metals Java navigation, Scala tools, and a minimal/opt/sparkhome used by local cluster executors.Dockerfile: production multi-stage build for the Spark application. It copies the same minimal Spark home into the runtime image solocal-cluster[3,1,4096]can launch worker executors..vscode/launch.json: Metals/Scala debug launch config forMain.
Spark runs in local-cluster[3,1,4096] mode by default, which starts one local
standalone master and three workers for development. For an external cluster,
pass the Spark master URL as the third application argument and ensure the
input/output paths are accessible to the driver and executors.