Spark Connect
Support for Apache Spark Connect is considered experimental and is subject to change in future releases. Spark Connect is a young technology and there are important questions to be answered yet, mostly related to security and multi-tenancy. |
Apache Spark Connect is a remote procedure call (RPC) server that allows clients to run Spark applications on a remote cluster. Clients can connect to the Spark Connect server using a variety of programming languages, editors and IDEs without needing to install Spark locally.
The Stackable Spark operator can set up Spark Connect servers backed by Kubernetes as a distributed execution engine.
Deployment
The example below demonstrates how to set up a Spark Connect server and apply some customizations.
---
apiVersion: spark.stackable.tech/v1alpha1
kind: SparkConnectServer
metadata:
name: spark-connect (1)
spec:
image:
productVersion: "3.5.5" (2)
pullPolicy: IfNotPresent
args:
- "--package org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.8.1" (3)
server:
podOverrides:
spec:
containers:
- name: spark
env:
- name: DEMO_GREETING (4)
value: "Hello"
jvmArgumentOverrides:
add:
- -Dmy.custom.jvm.arg=customValue (5)
config:
logging:
enableVectorAgent: False
containers:
spark:
custom:
configMap: spark-connect-log-config (6)
configOverrides:
spark-defaults.conf:
spark.driver.cores: "3" (7)
executor:
configOverrides:
spark-defaults.conf:
spark.executor.memoryOverhead: "1m" (8)
spark.executor.instances: "3"
config:
logging:
enableVectorAgent: False
containers:
spark:
custom:
configMap: spark-connect-log-config
1 | The name of the Spark Connect server. |
2 | Version of the Spark Connect server. |
3 | Additional package to install when starting the Spark Connect server and executors. |
4 | Environment variable to be created via podOverrides . Alternatively, the environment variable can be set in the spec.server.envOverrides section. |
5 | Additional argument to be passed to the Spark Connect JVM settings. Do not use this to tweak heap settings. Use spec.server.jvmOptions instead. |
6 | A custom log4j configuration file to be used by the Spark Connect server. The config map must have an entry called log4j.properties . |
7 | Customize the driver properties in the server role. The number of cores here is not related to Kubernetes cores! |
8 | Customize spark.executor.* and spark.kubernetes.executor.* in the executor role. |
Metrics
The server pod exposes Prometheus metrics at the following endpoints:
-
/metrics/prometheus
for driver instances. -
/metrics/executors/prometheus
for executor instances.
To customize the metrics configuration use the `spec.server.configOverrides' like this:
spec:
server:
configOverrides:
metrics.properties:
applications.sink.prometheusServlet.path: "/metrics/applications/prometheus"
The example above adds a new endpoint for application metrics.
Spark History Server
Unforunately integration with the Spark History Server is not supported yet.
The connect server seems to ignore the spark.eventLog
properties while also prohibiting clients to set them programatically.
Notable Omissions
The following features are not supported by the Stackable Spark operator yet
-
Authorization and authentication. Currently, anyone with access to the Spark Connect service can run jobs.
-
Volumes and volume mounts can be added only with pod overrides.
-
Job dependencies must be provisioned as custom images or via
--packages
or--jars
arguments.