Skip to content

Commit

Permalink
enrich information
Browse files Browse the repository at this point in the history
  • Loading branch information
Cheng-Lin-Li committed Apr 1, 2018
1 parent 442259f commit c5e7543
Showing 1 changed file with 27 additions and 0 deletions.
27 changes: 27 additions & 0 deletions TF-IDF_KMeans/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,34 @@ Given a set of vectors to present a document as input, calculating the TF-IDF wi

#### Input: Takes input file from folder as the input

The input file which has the following format:

39861
28102
3710420
1 118 1
1 285 1
1 1229 1
1 1688 1
1 2068 1
The first line is the number of documents in the collection (39861). The second line is the number of words in the vocabulary (28102). Note that the vocabulary only contains the words that appear in at least 10 documents. The third line (3710420) is the number of words that appear in at least one document.

Starting from the fourth line, the content is [document id] [word id] [tf].

For example, document #1 has word #118 (i.e., the line number in the vocabulary file) that occurs once.

#### Output: Save all results into one text file.

kmeans_output.txt

For each final center, output the number of its nonzero values as following:

87
60
50
56

It means total 4 clusters, the #0 cluster center is a sparse vector that has 87 nonzero values. The order doesn’t matter

0 comments on commit c5e7543

Please sign in to comment.