LucaCanali
diff --git a/‎docs/Flight_recorder_mode_KafkaSink.md‎
Lines changed: 74 additions & 0 deletions b/‎docs/Flight_recorder_mode_KafkaSink.md‎
Lines changed: 74 additions & 0 deletions
diff --git a/‎docs/Reference_SparkMeasure_API_and_Configs.md‎
Lines changed: 48 additions & 0 deletions b/‎docs/Reference_SparkMeasure_API_and_Configs.md‎
Lines changed: 48 additions & 0 deletions
diff --git a/‎src/main/scala/ch/cern/sparkmeasure/KafkaSink.scala‎
Lines changed: 2 additions & 131 deletions b/‎src/main/scala/ch/cern/sparkmeasure/KafkaSink.scala‎
Lines changed: 2 additions & 131 deletions
@@ -14,6 +14,28 @@ provided by the user. Use this mode to monitor Spark execution workload.
 Notes:
 - KafkaSink: the amount of data generated is relatively small in most applications: O(number_of_stages)
 - KafkaSinkExtended can generate a large amount of data O(Number_of_tasks), use with care
+
+## KafkaSinkV2 and KafkaSinkV2Extended
+
+**KafkaSinkV2** is an enhanced version of KafkaSink that adds application-level aggregated metrics and custom labels support.  
+It collects all stage/executor/query metrics from the base KafkaSink plus additional application lifecycle events and counters.  
+**KafkaSinkV2Extended** extends KafkaSinkV2 to also record detailed metrics for each executed Task.
+
+**Compatibility Note:** KafkaSinkV2 is backward compatible with KafkaSink in terms of configuration and basic event types.
+Existing consumers of KafkaSink events will continue to work as expected. KafkaSinkV2 adds two new event types
+(`applications_started` and `applications_ended`) with additional metadata that existing consumers can safely ignore
+if not needed.
+
+### Key Differences from KafkaSink
+
+1. **Application-Level Metrics:** KafkaSinkV2 emits `applications_started` and `applications_ended` events with aggregated counters
+2. **Custom Labels:** Support for custom metadata labels via `spark.sparkmeasure.appLabels.*` configuration
+3. **Enhanced Application End Event:** Includes executor counts, job/stage/task counters, and selected Spark configurations
+4. **Counter Tracking:** Tracks success/failure counts for jobs, stages, and tasks throughout application lifecycle
+
+## Configuration
+
+### KafkaSink / KafkaSinkExtended Configuration
 
 How to use: attach the KafkaSink to a Spark Context using the extra listener infrastructure. Example:
   - `--conf spark.extraListeners=ch.cern.sparkmeasure.KafkaSink`
@@ -40,6 +62,29 @@ Configuration - KafkaSink parameters:
        Example: --conf spark.sparkmeasure.kafka.ssl.keystore.location=/var/private/ssl/kafka.server.keystore.jks
 ```
 
+### KafkaSinkV2 / KafkaSinkV2Extended Configuration
+
+How to use: attach the KafkaSinkV2 listener to the Spark Context:
+
+```
+# Start the listener for KafkaSinkV2 (recommended):
+--conf spark.extraListeners=ch.cern.sparkmeasure.KafkaSinkV2
+
+# Or use KafkaSinkV2Extended for task-level metrics (generates more data):
+--conf spark.extraListeners=ch.cern.sparkmeasure.KafkaSinkV2Extended
+
+# Required Kafka configuration (same as KafkaSink):
+--conf spark.sparkmeasure.kafkaBroker = Kafka broker endpoint URL
+       Example: --conf spark.sparkmeasure.kafkaBroker=kafka.your-site.com:9092
+--conf spark.sparkmeasure.kafkaTopic = Kafka topic
+       Example: --conf spark.sparkmeasure.kafkaTopic=sparkmeasure-metrics
+
+# Optional - Custom application labels:
+--conf spark.sparkmeasure.appLabels.<labelKey> = Custom metadata value
+       Example: --conf spark.sparkmeasure.appLabels.project=my-project
+       Example: --conf spark.sparkmeasure.appLabels.environment=production
+```
+
 This code depends on "kafka-clients". If you deploy sparkMeasure from maven central,
 the dependency is being taken care of.
 If you run sparkMeasure from a jar instead, you may need to add the dependency manually
@@ -102,3 +147,32 @@ bin/spark-shell \
   "jobId" : "0",
   "appId" : "local-1660057441489"
 ...
+```
+
+**Note:** KafkaSinkV2 also emits all standard events like `stages_started`, `stages_ended`, `stage_metrics`, etc., 
+just like the original KafkaSink, ensuring backward compatibility.
+
+## Migration Guide: KafkaSink to KafkaSinkV2
+
+If you are currently using KafkaSink and want to migrate to KafkaSinkV2:
+
+1. **Configuration Change:** Simply replace the listener class name:
+   ```
+   # Old:
+   --conf spark.extraListeners=ch.cern.sparkmeasure.KafkaSink
+   
+   # New:
+   --conf spark.extraListeners=ch.cern.sparkmeasure.KafkaSinkV2
+   ```
+
+2. **Backward Compatibility:** All existing event types and their schemas remain unchanged. Your existing Kafka consumers will continue to work.
+
+3. **New Events:** KafkaSinkV2 adds two new event types (`applications_started` and `applications_ended`). If your consumers don't need these, they can simply ignore events with these names.
+
+4. **Optional Custom Labels:** Add custom labels for better filtering and organization:
+   ```
+   --conf spark.sparkmeasure.appLabels.project=my-project
+   --conf spark.sparkmeasure.appLabels.environment=staging
+   ```
+
+5. **Benefits:** You gain application-level aggregated metrics, custom labels, and selected Spark configurations without losing any existing functionality.
@@ -433,6 +433,54 @@ This code depends on "kafka-clients", you may need to add the dependency explici
   --packages org.apache.kafka:kafka-clients:3.7.0
 ```
 
+## KafkaSinkV2 and KafkaSinkV2Extended
+
+```
+class KafkaSinkV2(conf: SparkConf) extends KafkaSink(conf)
+class KafkaSinkV2Extended(conf: SparkConf) extends KafkaSinkV2(conf)
+
+**KafkaSinkV2** is an enhanced version of KafkaSink that extends the SparkListener infrastructure
+with application-level aggregated metrics and custom labels support.
+
+Key Features:
+1. All stage/executor/query metrics from the base KafkaSink
+2. Application-level aggregated counters (executor/job/stage/task counts)
+3. Custom labels via spark.sparkmeasure.appLabels.* configurations
+4. Enhanced applications_started and applications_ended events with metadata
+5. Automatic capture of selected Spark configurations
+
+**Backward Compatibility:** KafkaSinkV2 emits all the same event types as KafkaSink,
+ensuring existing consumers continue to work. It adds two new event types that can be
+safely ignored by consumers that don't need them.
+
+How to use:
+*   --conf spark.extraListeners=ch.cern.sparkmeasure.KafkaSinkV2
+
+**KafkaSinkV2Extended** adds verbose task-level metrics (can generate O(Number_of_tasks) data)
+* How to use:
+*   --conf spark.extraListeners=ch.cern.sparkmeasure.KafkaSinkV2Extended
+
+Configuration - KafkaSinkV2 parameters:
+
+Required:
+--conf spark.sparkmeasure.kafkaBroker = Kafka broker endpoint URL
+       Example: --conf spark.sparkmeasure.kafkaBroker=kafka.your-site.com:9092
+--conf spark.sparkmeasure.kafkaTopic = Kafka topic
+       Example: --conf spark.sparkmeasure.kafkaTopic=spark-metrics
+
+Optional - Custom labels (recommended for filtering and organization):
+--conf spark.sparkmeasure.appLabels.<labelKey> = Custom metadata value
+       Example: --conf spark.sparkmeasure.appLabels.project=my-project
+       Example: --conf spark.sparkmeasure.appLabels.environment=production
+       Example: --conf spark.sparkmeasure.appLabels.team=data-engineering
+
+For detailed event schemas and examples, see:
+  - docs/Flight_recorder_mode_KafkaSink.md
+
+This code depends on "kafka-clients", you may need to add the dependency explicitly, example:
+  --packages org.apache.kafka:kafka-clients:3.7.0
+```
+
 ## Prometheus PushGatewaySink
 ```
 class PushGatewaySink(conf: SparkConf) extends SparkListener
 
@@ -2,15 +2,14 @@ package ch.cern.sparkmeasure
 
 import org.apache.kafka.clients.producer.{KafkaProducer, Producer, ProducerRecord}
 import org.apache.kafka.common.serialization.ByteArraySerializer
-import org.apache.spark.{SparkConf, TaskFailedReason, TaskKilled}
+import org.apache.spark.SparkConf
 import org.apache.spark.scheduler._
 import org.apache.spark.sql.SparkSession
 import org.apache.spark.sql.execution.ui.{SparkListenerSQLExecutionEnd, SparkListenerSQLExecutionStart}
 import org.slf4j.{Logger, LoggerFactory}
 
 import java.nio.charset.StandardCharsets
 import java.util.Properties
-import scala.collection.mutable
 import scala.util.Try
 
 /**
@@ -30,8 +29,6 @@ import scala.util.Try
  * example: --conf spark.sparkmeasure.kafkaTopic=sparkmeasure-stageinfo
  * spark.sparkmeasure.kafka.* = Other kafka properties
  * example: --conf spark.sparkmeasure.kafka.ssl.keystore.location=/var/private/ssl/kafka.server.keystore.jks
- * spark.sparkmeasure.appLabels.* = Custom labels to include in application start and end events
- * example: --conf spark.sparkmeasure.appLabels.environment=production
  *
  * This code depends on "kafka clients", you may need to add the dependency:
  * --packages org.apache.kafka:kafka-clients:3.2.1
@@ -52,29 +49,6 @@ class KafkaSink(conf: SparkConf) extends SparkListener {
     case _ => "noAppId"
   }
 
-  // Application tracking
-  private var appName: String = "noAppName"
-  private var startTime: Long = 0L
-
-  // Executor tracking
-  private var executorsFailed: Int = 0
-  private var executorsKilled: Int = 0
-
-  // Job tracking
-  private var totalJobsCompleted: Int = 0
-  private var succeededJobsCount: Int = 0
-  private var failedJobsCount: Int = 0
-
-  // Stage tracking
-  private var totalStagesCompleted: Int = 0
-  private var succeededStagesCount: Int = 0
-  private var failedStagesCount: Int = 0
-
-  // Task tracking
-  private var totalTaskCount: Int = 0
-  private var numTaskFailed: Int = 0
-  private var numTaskKilled: Int = 0
-
   override def onExecutorAdded(executorAdded: SparkListenerExecutorAdded): Unit = {
     val executorInfo = executorAdded.executorInfo
     val epochMillis = System.currentTimeMillis()
@@ -90,17 +64,6 @@ class KafkaSink(conf: SparkConf) extends SparkListener {
     report(metrics)
   }
 
-  override def onExecutorRemoved(executorRemoved: SparkListenerExecutorRemoved): Unit = {
-    if (executorRemoved != null && executorRemoved.reason != null) {
-      executorRemoved.reason match {
-        case reason if reason.toLowerCase.contains("kill") =>
-          executorsKilled += 1
-        case _ =>
-          executorsFailed += 1
-      }
-    }
-  }
-
   override def onStageSubmitted(stageSubmitted: SparkListenerStageSubmitted): Unit = {
     val submissionTime = stageSubmitted.stageInfo.submissionTime.getOrElse(0L)
     val attemptNumber = stageSubmitted.stageInfo.attemptNumber()
@@ -118,6 +81,7 @@ class KafkaSink(conf: SparkConf) extends SparkListener {
     report(metrics)
   }
 
+
   override def onStageCompleted(stageCompleted: SparkListenerStageCompleted): Unit = {
     val stageId = stageCompleted.stageInfo.stageId.toString
     val submissionTime = stageCompleted.stageInfo.submissionTime.getOrElse(0L)
@@ -176,17 +140,6 @@ class KafkaSink(conf: SparkConf) extends SparkListener {
     )
 
     report(stageTaskMetrics)
-
-    if (stageCompleted != null && stageCompleted.stageInfo != null) {
-      val stageInfo = stageCompleted.stageInfo
-      totalStagesCompleted += 1
-
-      if (stageInfo.failureReason.isDefined) {
-        failedStagesCount += 1
-      } else {
-        succeededStagesCount += 1
-      }
-    }
   }
 
   override def onOtherEvent(event: SparkListenerEvent): Unit = {
@@ -250,69 +203,13 @@ class KafkaSink(conf: SparkConf) extends SparkListener {
       "epochMillis" -> epochMillis
     )
     report(jobEndMetrics)
-
-    if (jobEnd != null) {
-      totalJobsCompleted += 1
-
-      jobEnd.jobResult match {
-        case org.apache.spark.scheduler.JobSucceeded =>
-          succeededJobsCount += 1
-        case _ =>
-          failedJobsCount += 1
-      }
-    }
   }
 
   override def onApplicationStart(applicationStart: SparkListenerApplicationStart): Unit = {
     appId = applicationStart.appId.getOrElse("noAppId")
-    appName = applicationStart.appName
-    startTime = applicationStart.time
-    val appLabels = extractAppLabels(conf)
-    val epochMillis = System.currentTimeMillis()
-
-    val appStartMetrics = Map[String, Any](
-      "name" -> "applications_started",
-      "appId" -> appId,
-      "appName" -> appName,
-      "startTime" -> startTime,
-      "epochMillis" -> epochMillis
-    ) ++ appLabels
-
-    report(appStartMetrics)
   }
 
   override def onApplicationEnd(applicationEnd: SparkListenerApplicationEnd): Unit = {
-    val completionTime = applicationEnd.time
-    val safeEndTime = if (completionTime > 0) completionTime else System.currentTimeMillis()
-    val duration = if (startTime > 0) safeEndTime - startTime else 0L
-    val epochMillis = System.currentTimeMillis()
-    val configurations = conf.getAll.toMap
-    val appLabels = extractAppLabels(conf)
-
-    val appEndMetrics = Map[String, Any](
-      "name" -> "applications_ended",
-      "appId" -> appId,
-      "appName" -> appName,
-      "startTime" -> startTime,
-      "completionTime" -> completionTime,
-      "duration" -> duration,
-      "executorsFailed" -> executorsFailed,
-      "executorsKilled" -> executorsKilled,
-      "totalJobsCompleted" -> totalJobsCompleted,
-      "succeededJobsCount" -> succeededJobsCount,
-      "failedJobsCount" -> failedJobsCount,
-      "numStagesCompleted" -> totalStagesCompleted,
-      "numSucceededStages" -> succeededStagesCount,
-      "numFailedStages" -> failedStagesCount,
-      "totalTaskCount" -> totalTaskCount,
-      "numTaskFailed" -> numTaskFailed,
-      "numTaskKilled" -> numTaskKilled,
-      "epochMillis" -> epochMillis,
-      "configurations" -> configurations
-    ) ++ appLabels
-
-    report(appEndMetrics)
-
     logger.info(s"Spark application ended, timestamp = ${applicationEnd.time}, closing Kafka connection.")
     synchronized(
       if (Option(producer).isDefined) {
@@ -326,22 +223,6 @@ class KafkaSink(conf: SparkConf) extends SparkListener {
     )
   }
 
-  override def onTaskEnd(taskEnd: SparkListenerTaskEnd): Unit = {
-    if (taskEnd != null) {
-      totalTaskCount += 1
-
-      if (taskEnd.reason != null) {
-        taskEnd.reason match {
-          case _: TaskKilled =>
-            numTaskKilled += 1
-          case _: TaskFailedReason =>
-            numTaskFailed += 1
-          case _ =>
-        }
-      }
-    }
-  }
-
   protected def report[T <: Any](metrics: Map[String, T]): Unit = {
     val result: Unit = Try {
       ensureProducer()
@@ -375,14 +256,6 @@ class KafkaSink(conf: SparkConf) extends SparkListener {
     )
   }
 
-  private def extractAppLabels(conf: SparkConf): Map[String, String] = {
-    Try {
-      conf.getAll
-        .filter { case (key, _) => key.startsWith("spark.sparkmeasure.appLabels.") }
-        .map { case (key, value) => (key.stripPrefix("spark.sparkmeasure."), value) }
-        .toMap
-    }.getOrElse(Map.empty[String, String])
-  }
 }
 
 /**
@@ -409,8 +282,6 @@ class KafkaSinkExtended(conf: SparkConf) extends KafkaSink(conf) {
   }
 
   override def onTaskEnd(taskEnd: SparkListenerTaskEnd): Unit = {
-    super.onTaskEnd(taskEnd)
-
     val taskInfo = taskEnd.taskInfo
     val taskmetrics = taskEnd.taskMetrics
     val epochMillis = System.currentTimeMillis()