Skip to content

Latest commit

 

History

History
35 lines (24 loc) · 1.93 KB

data-lab.mdx

File metadata and controls

35 lines (24 loc) · 1.93 KB
meta content dates category productIcon
title description
Distributed Data Lab FAQ
Discover Scaleway Distributed Data Lab powered by Apache Spark, and how to use it.
h1
Distributed Data Lab FAQ
validation
2024-07-31
managed-services
DistributedDataLabProductIcon

How can I register for the Distributed Data Lab private beta?

You can request access to the Distributed Data Lab private beta by email via the Scaleway betas page.

What is Apache Spark?

Apache Spark is an open-source unified analytics engine designed for large-scale data processing. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. Spark offers high-level APIs in Java, Scala, Python, and R, and an optimized engine that supports general execution graphs.

How does Apache Spark work?

Apache Spark processes data in memory, which allows it to perform tasks up to 100 times faster than traditional disk-based processing frameworks like Hadoop MapReduce. It uses Resilient Distributed Datasets (RDDs) to store data across multiple nodes in a cluster and perform parallel operations on this data.

How am I billed for Distributed Data Lab?

Distributed Data Lab is billed based on two factors:

  • the main node configuration selected
  • the worker node configuration selected, and the number of worker nodes in the cluster

Can I upscale or downscale a Distributed Data Lab?

Yes, you can upscale a Data Lab cluster to distribute your workloads across a greater number of worker nodes for faster processing. You can also scale it down to zero to reduce costs, while retaining your configuration and context.

You can still access the notebook of a Data Lab cluster with zero worker nodes, but you cannot perform any calculation. You can resume the activity of your cluster by provisioning at least one worker node.