-
Notifications
You must be signed in to change notification settings - Fork 6
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
a76a093
commit 32fd062
Showing
10 changed files
with
60 additions
and
28 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
## This is an implementation of Two Phases Matrix Multiplication algorithm in Spark 2.1.1 with Python 2.7 | ||
Matrix Multiplication: Two Phases approach to deal with huge matrix multiplication on spark platform | ||
|
||
## Algorithm: Matrix Multiplication: Two Phases approach | ||
|
||
## Task: | ||
The task is to implement SON algorithm in Apache Spark using Python. | ||
Given a set of baskets, SON algorithm divides them into chunks/partitions and then proceed in two stages. | ||
First, local frequent itemsets are collected, which form candidates; | ||
next, it makes second pass through data to determine which candidates are globally frequent. | ||
|
||
#### Usage: bin/spark-submit TwoPhase_Matrix_Multiplication.py <mat-A/values.txt> <mat-B/values.txt> <output.txt> | ||
|
||
|
||
#### Input: Takes two folders with mat-A/values.txt or mat-B/values.txt as the input | ||
|
||
#### Output: Save all results into one text file. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
## This is an implementation of TF-IDF algorithm with cosin similarity algorithm in Spark 2.1.1 with Python 2.7 | ||
A similarity algorithm implementation of TF-IDF algorithm with cosin similarity implementation on spark platform as the measure of K-Means. The implementation of k-means is provided by Spark in examples/src/main/python/ml/kmeans_example.py. | ||
|
||
## Algorithm: TF-IDF algorithm with cosin similarity | ||
|
||
## Task: | ||
The task is to implement TF-IDF algorithm with cosin similarity in Apache Spark using Python. | ||
Given a set of vectors to present a document as input, calculating the TF-IDF with cosin similarity to cluster those documents via similarity. | ||
|
||
#### Usage: bin/spark-submit kmeans <file> <k> <convergeDist> [outputfile.txt] | ||
k - the number of clusters | ||
convergDist - The converge distance/similarity to stop program iterations. | ||
|
||
example: bin\spark-submit .\kmeans.py .\docword.enron_s.txt 10 0.00001 kmeans_output.txt | ||
|
||
#### Input: Takes input file from folder as the input | ||
|
||
#### Output: Save all results into one text file. | ||
|
||
kmeans_output.txt |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
## This is an implementation of TF-IDF algorithm with cosin similarity algorithm in Spark 2.1.1 with Python 2.7 | ||
A similarity algorithm implementation of TF-IDF algorithm with cosin similarity implementation on spark platform as the measure of K-Means. The implementation of k-means is provided by Spark in examples/src/main/python/ml/kmeans_example.py. | ||
|
||
## Algorithm: TF-IDF algorithm with cosin similarity | ||
|
||
## Task: | ||
The task is to implement TF-IDF algorithm with cosin similarity in Apache Spark using Python. | ||
Given a set of vectors to present a document as input, calculating the TF-IDF with cosin similarity to cluster those documents via similarity. | ||
|
||
#### Usage: bin/spark-submit kmeans <file> <k> <convergeDist> [outputfile.txt] | ||
k - the number of clusters | ||
convergDist - The converge distance/similarity to stop program iterations. | ||
|
||
example: bin\spark-submit .\kmeans.py .\docword.enron_s.txt 10 0.00001 kmeans_output.txt | ||
|
||
#### Input: Takes input file from folder as the input | ||
|
||
#### Output: Save all results into one text file. | ||
|
||
kmeans_output.txt |
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
File renamed without changes.
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
Binary file not shown.