Skip to content

Commit

Permalink
update readme
Browse files Browse the repository at this point in the history
  • Loading branch information
Cheng-Lin-Li committed Oct 16, 2017
1 parent a76a093 commit 32fd062
Show file tree
Hide file tree
Showing 10 changed files with 60 additions and 28 deletions.
2 changes: 1 addition & 1 deletion ALS/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
## Algorithm: Alternating Least Squares (ALS) Algorithm

## Task:
The task is to modify the parallel implementation of ALS (alternating least squares) algorithm in Spark, so that it takes a utility matrix as the input, and output the root-mean-square deviation (RMSE) into standard output or a file after each iteration. The code for the algorithm is als.py under the <spark-2.1.0 installation directory>/examples/src/main/python.
The task is to modify the parallel implementation of ALS (alternating least squares) algorithm in Spark, so that it takes a utility matrix as the input and process by UV decomposition, and output the root-mean-square deviation (RMSE) into standard output or a file after each iteration. The code for the algorithm is als.py under the <spark-2.1.0 installation directory>/examples/src/main/python.

#### Usage: bin/spark-submit ALS.py input-matrix n m f k p [output-file]
1. n is the number of rows (users) of the matrix
Expand Down
17 changes: 17 additions & 0 deletions Matrix_Multiplication/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
## This is an implementation of Two Phases Matrix Multiplication algorithm in Spark 2.1.1 with Python 2.7
Matrix Multiplication: Two Phases approach to deal with huge matrix multiplication on spark platform

## Algorithm: Matrix Multiplication: Two Phases approach

## Task:
The task is to implement SON algorithm in Apache Spark using Python.
Given a set of baskets, SON algorithm divides them into chunks/partitions and then proceed in two stages.
First, local frequent itemsets are collected, which form candidates;
next, it makes second pass through data to determine which candidates are globally frequent.

#### Usage: bin/spark-submit TwoPhase_Matrix_Multiplication.py <mat-A/values.txt> <mat-B/values.txt> <output.txt>


#### Input: Takes two folders with mat-A/values.txt or mat-B/values.txt as the input

#### Output: Save all results into one text file.
21 changes: 21 additions & 0 deletions MinHash_LSH/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
## This is an implementation of TF-IDF algorithm with cosin similarity algorithm in Spark 2.1.1 with Python 2.7
A similarity algorithm implementation of TF-IDF algorithm with cosin similarity implementation on spark platform as the measure of K-Means. The implementation of k-means is provided by Spark in examples/src/main/python/ml/kmeans_example.py.

## Algorithm: TF-IDF algorithm with cosin similarity

## Task:
The task is to implement TF-IDF algorithm with cosin similarity in Apache Spark using Python.
Given a set of vectors to present a document as input, calculating the TF-IDF with cosin similarity to cluster those documents via similarity.

#### Usage: bin/spark-submit kmeans <file> <k> <convergeDist> [outputfile.txt]
k - the number of clusters
convergDist - The converge distance/similarity to stop program iterations.

example: bin\spark-submit .\kmeans.py .\docword.enron_s.txt 10 0.00001 kmeans_output.txt

#### Input: Takes input file from folder as the input

#### Output: Save all results into one text file.

kmeans_output.txt
21 changes: 21 additions & 0 deletions TF-IDF_KMeans/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
## This is an implementation of TF-IDF algorithm with cosin similarity algorithm in Spark 2.1.1 with Python 2.7
A similarity algorithm implementation of TF-IDF algorithm with cosin similarity implementation on spark platform as the measure of K-Means. The implementation of k-means is provided by Spark in examples/src/main/python/ml/kmeans_example.py.

## Algorithm: TF-IDF algorithm with cosin similarity

## Task:
The task is to implement TF-IDF algorithm with cosin similarity in Apache Spark using Python.
Given a set of vectors to present a document as input, calculating the TF-IDF with cosin similarity to cluster those documents via similarity.

#### Usage: bin/spark-submit kmeans <file> <k> <convergeDist> [outputfile.txt]
k - the number of clusters
convergDist - The converge distance/similarity to stop program iterations.

example: bin\spark-submit .\kmeans.py .\docword.enron_s.txt 10 0.00001 kmeans_output.txt

#### Input: Takes input file from folder as the input

#### Output: Save all results into one text file.

kmeans_output.txt
8 changes: 0 additions & 8 deletions TF-IDF_KMeans/Readme.txt

This file was deleted.

8 changes: 0 additions & 8 deletions TF-IDF_KMeans/command.txt

This file was deleted.

File renamed without changes.
5 changes: 0 additions & 5 deletions TF-IDF_KMeans/hca_output.txt

This file was deleted.

6 changes: 0 additions & 6 deletions TF-IDF_KMeans/sample.txt

This file was deleted.

Binary file removed TF-IDF_KMeans/test.xlsx
Binary file not shown.

0 comments on commit 32fd062

Please sign in to comment.