Skip to content
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions content/en/docs/components/spark-operator/getting-started.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,12 @@ This removes all the Kubernetes resources associated with the chart and deletes

See [helm uninstall](https://helm.sh/docs/helm/helm_uninstall) for command documentation.

### Additional Steps to Integrate Jupyter Notebooks

Integrating Jupyter Notebooks with the Spark Operator to run big data or distributed machine learning jobs with PySpark.

See [Integration with Notebooks](../user-guide/notebooks-spark-operator) for further details.

## Running the Examples

To run the Spark PI example, run the following command:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: Monitoring Spark Applications with Prometheus and JMX Exporter
description: Using the Spark operator to setup Java Metrics Exporter (JMX) and send metrics to Prometheus
weight: 110
weight: 120
---

Spark Operator supports exporting Spark metrics in Prometheus format using the [JMX Prometheus Exporter](https://github.com/prometheus/jmx_exporter). This allows detailed monitoring of your Spark drivers and executors with tools like Prometheus and Grafana.
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,188 @@
---
title: Integration with Notebooks
description: Integrating Jupyter Notebooks with the Spark Operator
weight: 110
---

If you're using Kubeflow Notebooks and want to run big data or distributed machine learning jobs with PySpark, the option is now available.

The Spark Operator streamlines the deployment of Apache Spark applications on Kubernetes. By integrating it with [Jupyter Enterprise Gateway (JEG)](https://github.com/jupyter-server/enterprise_gateway) and Kubeflow Notebooks, users can now run PySpark workloads at scale directly from a notebook interface, without worrying about the underlying Spark infrastructure.

This integration enables a seamless workflow for data scientists and ML engineers, allowing users to write PySpark code in their notebooks, which is then executed remotely using Kubernetes resources via the Spark Operator and JEG.

## Architecture

The following diagram illustrates how the components work together:

<img src="/docs/images/spark-operator/notebooks-spark.png"
alt="Architecture diagram showing Kubeflow notebooks integrated with Spark Operator"
class="mt-3 mb-3 border rounded">
</img>

---

## Overview

In a typical Kubeflow setup, users access JupyterLab Notebooks through the central dashboard. These notebooks can now be configured to run PySpark code remotely through kernels managed by Jupyter Enterprise Gateway (JEG).

Behind the scenes:

1. JEG receives execution requests from notebooks.
2. JEG creates and submits `SparkApplication` Custom Resources.
3. The Spark Operator handles the lifecycle of Spark driver and executor pods in Kubernetes.

This architecture enables scalable, elastic execution of big data or distributed ML workloads.

## Prerequisites

- A running Kubeflow deployment with Notebook Controller enabled
- Spark Operator installed and configured in the cluster
- Helm installed locally

---

## Step 1: Deploy Enterprise Gateway

This step creates a dedicated Kubernetes namespace (enterprise-gateway) and sets up a local persistent volume and claim using hostPath.

Begin by creating the necessary storage resources. Save the following manifest as `enterprise-gateway-storage.yaml`:

```yaml
apiVersion: v1
kind: Namespace
metadata:
name: enterprise-gateway
labels:
app: enterprise-gateway
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: pvc-kernelspecs
labels:
app: enterprise-gateway
spec:
storageClassName: standard
capacity:
storage: 1Gi
accessModes:
- ReadWriteOnce
hostPath:
path: "/jupyter-gateway/kernelspecs"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: pvc-kernelspecs
namespace: enterprise-gateway
spec:
storageClassName: standard
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 1Gi

```

Apply it:
```yaml
kubectl apply -f enterprise-gateway-storage.yaml
```
Now we will be deploying Jupyter Enterpise Gateway with support for remote kernel management and persistent kernelspec storage.

Save the following manifest as `enterprise-gateway-helm.yaml` which will be used as the basic configuration for the gateway.

```yaml
image: elyra/enterprise-gateway:3.2.3
imagePullPolicy: Always
logLevel: DEBUG

kernel:
shareGatewayNamespace: true
launchTimeout: 300
cullIdleTimeout: 3600
allowedKernels:
- pyspark
- python3
defaultKernelName: pyspark

kernelspecsPvc:
enabled: true
name: pvc-kernelspecs

kip:
enabled: false
image: elyra/kernel-image-puller:3.2.3
imagePullPolicy: Always
pullPolicy: Always
defaultContainerRegistry: quay.io
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're currently setting the JUPYTER_GATEWAY_URL to the service address, which allows us to avoid enabling ingress. What is your current use case that requires ingress? @Shekharrajak

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still thinking if this service will be available to call in another namespace:

kubectl -n enterprise-gateway get svc enterprise-gateway -o yaml

type: NodePort


```

Then deploy Enterprise Gateway using Helm:

The command below uses a YAML file named enterprise-gateway-helm.yaml, which includes an example configuration shown above.

```yaml
helm upgrade --install enterprise-gateway \
https://github.com/jupyter-server/enterprise_gateway/releases/download/v3.2.3/jupyter_enterprise_gateway_helm-3.2.3.tar.gz \
--namespace enterprise-gateway \
--values enterprise-gateway-helm.yaml \
--create-namespace \
--wait

```

## Step 2: Configure the Notebook to connect to the Jupyter Gateway

Each user will have to edit their Kubeflow Notebook's custom resources to configure the following environment variables to allow notebook to connect to the deployed Jupyter gateway.

```yaml
env:
- name: JUPYTER_GATEWAY_URL
value: http://enterprise-gateway.enterprise-gateway:8888
- name: JUPYTER_GATEWAY_REQUEST_TIMEOUT
value: "120"

```

You can do this from the Lens, or using the following kubectl command below.

The <NOTEBOOK_NAME> parameter is the name of the notebook created on the Kubeflow Notebook workspace.

```yaml
kubectl patch notebook <NOTEBOOK_NAME> \
  -n kubeflow-user-example-com \
  --type='json' \
  -p='[
    {
      "op": "add",
      "path": "/spec/template/spec/containers/0/env",
      "value": [
        {
          "name": "JUPYTER_GATEWAY_URL",
          "value": "http://enterprise-gateway.enterprise-gateway:8888"
        },
        {
          "name": "JUPYTER_GATEWAY_REQUEST_TIMEOUT",
          "value": "120"
        }
      ]
    }
  ]'

```

These variables configure JupyterLab to forward kernel execution to JEG, which then runs PySpark jobs via the Spark Operator.

## What Happens Next

Once everything is set up:

- Launch a notebook from the Kubeflow UI
- Select the `pyspark` kernel
- Write and run PySpark code
- Your notebook submits Spark jobs via JEG → Spark Operator → Kubernetes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add debugging using spark UI as well ?




Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.