enrich information

Cheng-Lin-Li · Apr 1, 2018 · c5e7543 · c5e7543
1 parent 442259f
commit c5e7543
Showing 1 changed file with 27 additions and 0 deletions.
diff --git a/TF-IDF_KMeans/README.md b/TF-IDF_KMeans/README.md
@@ -15,7 +15,34 @@ Given a set of vectors to present a document as input, calculating the TF-IDF wi
 
 #### Input: Takes input file from folder as the input
 
+The input file which has the following format:
+
+	39861 
+	28102 
+	3710420 
+	1 118 1 
+	1 285 1 
+	1 1229 1 
+	1 1688 1 
+	1 2068 1 
+	…
+The first line is the number of documents in the collection (39861). The second line is the number of words in the vocabulary (28102). Note that the vocabulary only contains the words that appear in at least 10 documents. The third line (3710420) is the number of words that appear in at least one document.
+
+Starting from the fourth line, the content is [document id] [word id] [tf]. 
+
+For example, document #1 has word #118 (i.e., the line number in the vocabulary file) that occurs once.
+
 		
 #### Output: Save all results into one text file. 
 
 kmeans_output.txt
+
+For each final center, output the number of its nonzero values as following:
+
+	87 
+	60 
+	50 
+	56 
+	…
+
+It means total 4 clusters, the #0 cluster center is a sparse vector that has 87 nonzero values. The order doesn’t matter