PySpark Elastic provides python support for Apache Spark's Resillient Distributed Datasets from Elastic Search documents using Elasticsearch Hadoop within PySpark, both in the interactive shell and in python programmes submitted with spark-submit.
Contents:
- Compatibility
- Using with PySpark
- Using with PySpark shell
- Building
- API
- Examples
- Problems / ideas?
- Contributing
PySpark Elastic is tested to be compatible with Spark 1.4, 1.5 and 1.6. Feedback on (in-)compatibility is much appreciated.
PySpark Elastic is tested with Elastic Search 2.2.
PySpark Elastic is tested with Python 2.7 and Python 3.4.
PySpark Elastic is published at Spark Packages. This allows easy usage with Spark through:
spark-submit \
--packages TargetHolding/pyspark-elastic:<version> \
--conf spark.es.nodes=your,elastic,node,namesspark-submit \
--jars /path/to/pyspark_elastic-<version>.jar \
--driver-class-path /path/to/pyspark_elastic-<version>.jar \
--py-files target/pyspark_elastic_<version>-<python version>.egg \
--conf spark.es.nodes=your,elastic,node,names \
--master spark://spark-master:7077 \
yourscript.py(note that the the --driver-class-path due to SPARK-5185)
Replace spark-submit with pyspark to start the interactive shell and don't provide a script as argument and then import PySpark Elastic. Note that when performing this import the sc variable in pyspark is augmented with the esRDD(...) and esJsonRDD(...) methods.
import pyspark_elasticFor Spark Packages Pyspark Elastic can be compiled using:
sbt compileThe package can be published locally with:
sbt spPublishLocalThe package can be published to Spark Packages with (requires authentication and authorization):
sbt spPublishA Java / JVM library as well as a python library is required to use PySpark Elastic. They can be built with:
make distThis creates 1) a fat jar with the Elasticsearch Hadoop library and additional classes for bridging Spark and PySpark for Elastic Search data and 2) a python source distribution at:
target/scala-2.10/pyspark-elastic-assembly-<version>.jartarget/pyspark_elastic_<version>-<python version>.egg.
The PySpark Elastic API aims to stay close to the Java / Scala APIs provided by Elastic Search. Reading its documentation is a good place to start.
A EsSparkContext is very similar to a regular SparkContext. It is created in the same way, can be used to read files, parallelize local data, broadcast a variable, etc. See the Spark Programming Guide for more details. But it exposes additional methods:
-
esRDD(resource_read, query, **kwargs): Returns a EsRDD for the resource and query with the JSON documents from Elastic parsed withjson.loads(or fromcjsonorujsonif available). Arguments which can be provided:resourceis the index and document type seperated by a forward slash (/)queryis the query string to apply in searching Elastic Search for data for in the RDD**kwargs: any configuration item listed in the Elastic Search documentation, see also the configuration section below
-
esObjRDD(resource_read, query, **kwargs): AsesRDD(...), but the RDD contains JSON documents from Elastic parsed withjson.loadswhere eachdictis parsed into apyspark_elastic.types.AttrDictso that object can be accessed by attributed as well as key: e.g.sc.esObjRead(...).first().field. -
esJsonRDD(resource_read, query, **kwargs): AsesRDD(...), but the RDD contains JSON documents as strings.
The configuration options from Elastic Search documentation can be provided to the methods above without the es. prefix and with underscores instead of dots. The latter allows using normal keywords instead of resorting to constructs such as esRDD(..., **{'es.configuration.option': 'xyz'}) and use esRDD(..., configuration_option='xyx').
PySpark Elastic supports saving arbitrary RDD's to Elastic using:
rdd.saveToEs(resource, **kwargs): Saves an RDD to resource (which is a / separated index and document type) by dumping the RDD elements usingjson.dumps.rdd.saveJsonToEs(resource, **kwargs): Saves an RDD to resource (which is a / separated index and document type) directly. The RDD must contain strings.
Not yet implemented
Creating a SparkContext with Elastic Search support
from pyspark_elastic import EsSparkContext
conf = SparkConf() \
.setAppName("PySpark Elastic Test") \
.setMaster("spark://spark-master:7077") \
.set("spark.es.host", "elastic-1")
sc = EsSparkContext(conf=conf)Reading from an index as JSON strings:
rdd = sc.esJsonRDD('test/tweets')
rdd...Reading from an index as deserialized JSON (dicts, lists, etc.):
rdd = sc.esRDD('test/tweets')
rdd...Storing data in Elastic Search:
rdd = sc.parallelize([
{ 'title': x, 'body', x }
for x in ['a', 'b', 'c']
])
rdd.saveToEs('test/docs')Feel free to use the issue tracker propose new functionality and / or report bugs.
- Fork it
- Create your feature branch (git checkout -b my-new-feature)
- Commit your changes (git commit -am 'Add some feature')
- Push to the branch (git push origin my-new-feature)
- Create new Pull Request