update readme

Cheng-Lin-Li · Oct 16, 2017 · 32fd062 · 32fd062
1 parent a76a093
commit 32fd062
Show file tree

Hide file tree

Showing 10 changed files with 60 additions and 28 deletions.
diff --git a/ALS/README.md b/ALS/README.md
@@ -3,7 +3,7 @@
 ## Algorithm: Alternating Least Squares (ALS) Algorithm
 
 ## Task:
-The task is to modify the parallel implementation of ALS (alternating least squares) algorithm in Spark, so that it takes a utility matrix as the input, and output the root-mean-square deviation (RMSE) into standard output or a file after each iteration. The code for the algorithm is als.py under the <spark-2.1.0 installation directory>/examples/src/main/python.
+The task is to modify the parallel implementation of ALS (alternating least squares) algorithm in Spark, so that it takes a utility matrix as the input and process by UV decomposition, and output the root-mean-square deviation (RMSE) into standard output or a file after each iteration. The code for the algorithm is als.py under the <spark-2.1.0 installation directory>/examples/src/main/python.
 
 #### Usage: bin/spark-submit ALS.py input-matrix n m f k p [output-file]
   1. n is the number of rows (users) of the matrix

diff --git a/Matrix_Multiplication/README.md b/Matrix_Multiplication/README.md
@@ -0,0 +1,17 @@
+## This is an implementation of Two Phases Matrix Multiplication algorithm in Spark 2.1.1 with Python 2.7
+Matrix Multiplication: Two Phases approach to deal with huge matrix multiplication on spark platform
+
+## Algorithm: Matrix Multiplication: Two Phases approach
+
+## Task:
+The task is to implement SON algorithm in Apache Spark using Python. 
+Given a set of baskets, SON algorithm divides them into chunks/partitions and then proceed in two stages. 
+First, local frequent itemsets are collected, which form candidates; 
+next, it makes second pass through data to determine which candidates are globally frequent.
+
+#### Usage: bin/spark-submit TwoPhase_Matrix_Multiplication.py <mat-A/values.txt> <mat-B/values.txt> <output.txt>
+
+
+#### Input: Takes two folders with mat-A/values.txt or mat-B/values.txt as the input
+
+#### Output: Save all results into one text file. 
diff --git a/MinHash_LSH/README.md b/MinHash_LSH/README.md
@@ -0,0 +1,21 @@
+## This is an implementation of TF-IDF algorithm with cosin similarity algorithm in Spark 2.1.1 with Python 2.7
+A similarity algorithm implementation of TF-IDF algorithm with cosin similarity implementation on spark platform as the measure of K-Means. The implementation of k-means is provided by Spark in examples/src/main/python/ml/kmeans_example.py.
+
+## Algorithm: TF-IDF algorithm with cosin similarity
+
+## Task:
+The task is to implement TF-IDF algorithm with cosin similarity in Apache Spark using Python. 
+Given a set of vectors to present a document as input, calculating the TF-IDF with cosin similarity to cluster those documents via similarity.
+
+#### Usage: bin/spark-submit kmeans <file> <k> <convergeDist> [outputfile.txt]
+	k - the number of clusters
+	convergDist - The converge distance/similarity to stop program iterations.
+
+	example: 	bin\spark-submit .\kmeans.py .\docword.enron_s.txt 10 0.00001 kmeans_output.txt
+
+#### Input: Takes input file from folder as the input
+
+		
+#### Output: Save all results into one text file. 
+
+kmeans_output.txt
diff --git a/TF-IDF_KMeans/README.md b/TF-IDF_KMeans/README.md
@@ -0,0 +1,21 @@
+## This is an implementation of TF-IDF algorithm with cosin similarity algorithm in Spark 2.1.1 with Python 2.7
+A similarity algorithm implementation of TF-IDF algorithm with cosin similarity implementation on spark platform as the measure of K-Means. The implementation of k-means is provided by Spark in examples/src/main/python/ml/kmeans_example.py.
+
+## Algorithm: TF-IDF algorithm with cosin similarity
+
+## Task:
+The task is to implement TF-IDF algorithm with cosin similarity in Apache Spark using Python. 
+Given a set of vectors to present a document as input, calculating the TF-IDF with cosin similarity to cluster those documents via similarity.
+
+#### Usage: bin/spark-submit kmeans <file> <k> <convergeDist> [outputfile.txt]
+	k - the number of clusters
+	convergDist - The converge distance/similarity to stop program iterations.
+
+	example: 	bin\spark-submit .\kmeans.py .\docword.enron_s.txt 10 0.00001 kmeans_output.txt
+
+#### Input: Takes input file from folder as the input
+
+		
+#### Output: Save all results into one text file. 
+
+kmeans_output.txt
diff --git a/TF-IDF_KMeans/Readme.txt b/TF-IDF_KMeans/Readme.txt
diff --git a/TF-IDF_KMeans/command.txt b/TF-IDF_KMeans/command.txt
diff --git a/TF-IDF_KMeans/docword.enron_s - test.txt → TF-IDF_KMeans/docword.enron_s-test.txt b/TF-IDF_KMeans/docword.enron_s - test.txt → TF-IDF_KMeans/docword.enron_s-test.txt
diff --git a/TF-IDF_KMeans/hca_output.txt b/TF-IDF_KMeans/hca_output.txt
diff --git a/TF-IDF_KMeans/sample.txt b/TF-IDF_KMeans/sample.txt
diff --git a/TF-IDF_KMeans/test.xlsx b/TF-IDF_KMeans/test.xlsx