Skip to content

Commit

Permalink
[SPARK-10032] [PYSPARK] [DOC] Add Python example for mllib LDAModel u…
Browse files Browse the repository at this point in the history
…ser guide

Add Python example for mllib LDAModel user guide

Author: Yanbo Liang <[email protected]>

Closes apache#8227 from yanboliang/spark-10032.
  • Loading branch information
yanboliang authored and mengxr committed Aug 18, 2015
1 parent f4fa61e commit 747c2ba
Showing 1 changed file with 28 additions and 0 deletions.
28 changes: 28 additions & 0 deletions docs/mllib-clustering.md
Original file line number Diff line number Diff line change
Expand Up @@ -564,6 +564,34 @@ public class JavaLDAExample {
{% endhighlight %}
</div>

<div data-lang="python" markdown="1">
{% highlight python %}
from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.mllib.linalg import Vectors

# Load and parse the data
data = sc.textFile("data/mllib/sample_lda_data.txt")
parsedData = data.map(lambda line: Vectors.dense([float(x) for x in line.strip().split(' ')]))
# Index documents with unique IDs
corpus = parsedData.zipWithIndex().map(lambda x: [x[1], x[0]]).cache()

# Cluster the documents into three topics using LDA
ldaModel = LDA.train(corpus, k=3)

# Output topics. Each is a distribution over words (matching word count vectors)
print("Learned topics (as distributions over vocab of " + str(ldaModel.vocabSize()) + " words):")
topics = ldaModel.topicsMatrix()
for topic in range(3):
print("Topic " + str(topic) + ":")
for word in range(0, ldaModel.vocabSize()):
print(" " + str(topics[word][topic]))

# Save and load model
model.save(sc, "myModelPath")
sameModel = LDAModel.load(sc, "myModelPath")
{% endhighlight %}
</div>

</div>

## Streaming k-means
Expand Down

0 comments on commit 747c2ba

Please sign in to comment.