We support multiple kernels in the code editor.
Python3 is the default kernel. You can prototype and transform small to medium size datasets with this kernel. Pipelines built with this kernel can be executed in Python environments.
We support running PySpark kernel to prototype with large datasets and build pipelines to transform large datasets.
Instructions for running PySpark kernel
- Launch editor with docker:
docker run -it -p 6789:6789 -v $(pwd):/home/src mageai/mageai mage start [project_name]
- Specify PySpark kernel related metadata in project's metadata.yaml file
- Launch a remote AWS EMR Spark cluster. Install mage_ai library in bootstrap actions. Make sure the EMR cluster is publicly accessible.
- You can use the
create_emr.py
script under scripts/spark folder to launch a new EMR cluster. Example:python3 create_cluster.py [project_path]
. Please make sure your AWS crendentials are provided in~/.aws/credentials
file or environment variables (AWS_ACCESS_KEY_ID
andAWS_SECRET_ACCESS_KEY
) when executing the script.
- You can use the
- Connect to the remote spark cluster with command
ssh -i [path_to_key_pair] -L 0.0.0.0:9999:localhost:8998 [master_ec2_public_dns_name]
path_to_key_pair
is the path to theec2_key_pair_name
configured inmetadata.yaml
file- Find the
master_ec2_public_dns_name
in your newly created EMR cluster page under attributeMaster public DNS
Once you finish using the PySpark kernel, please remember to terminate the cluster in AWS console to avoid unnecessary costs.
When using PySpark kernel, we need to specify a s3 path as the variables dir, which will be used to store the output of each block. We also need to provide EMR cluster related config fields for cluster creation. The config fields can be configured in project's metadata.yaml file. Example:
remote_variables_dir: s3://bucket/path
emr_config:
master_security_group: 'sg-xxxxxxxxxxxx' # Optional. Use EMR-managed security group by default.
slave_security_group: 'sg-yyyyyyyyyyyy' # Optional. Use EMR-managed security group by default.
master_instance_type: 'r5.4xlarge' # Optional. Default value: r5.4xlarge
slave_instance_type: 'r5.4xlarge' # Optional. Default value: r5.4xlarge
# ec2_key_name must be configured during cluster launch to enable SSH access.
# You can create a key pair in page https://console.aws.amazon.com/ec2#KeyPairs and download the key file.
ec2_key_name: '[ec2_key_pair_name]'
Pipelines built with this kernel can be executed in PySpark environments.
We support executing PySpark pipelines in EMR cluster automatically. You'll need to specify the some EMR config fields in project's metadata.yaml file. And then we'll launch an EMR cluster when you executing the pipeline with mage run [project_name] [pipeline]
command. Example EMR config:
To switch kernels, you first need to follow the steps in the corresponding section above. Then, you can switch kernels through the kernel selection menu in the UI.
If you're switching back to the Python3 kernel, you may need to undo the configuration changes needed for the other kernel types.