-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME
73 lines (52 loc) · 2.7 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
******************
Seer Readme
******************
Notes: Syntatically and semantically unbiased; core features will be based purely on statistics. Language-specific functions will be contained to modules.
Current Priority:
-Figure out how to collect a large sample of feed entries (n=5000). Currently most feeds only show latest 10 entries, how to get more entries?
Usage:
-----------------------------
Rails environment is basically setup but all the "real" code is currently in /ruby for testing.
******************
samplegetter.rb
******************
run from / in the command line with:
$ ruby ruby/samplegetter.rb
This pulls from an array of feeds from ruby/samplefeeds.txt and saves their entries' contents as .txt files to ruby/testdata
******************
textsplitter.rb
******************
run from / in the command line with:
$ ruby ruby/textsplitter.rb
Takes the .txt files created by samplegetter.rb (ruby/testdata/*.txt) and analyzes them. Prints each unique word and its frequency, mean, and standard deviation, and saves all this output to ruby/testdata.csv. Estimates which words should belong on the blacklist by z-score and outputs the list to ruby/blacklist.txt.
Unresolved Issues:
-----------------------------
-Alternate spellings & misspellings
-Acronyms
-Special characters
-Plurals
-Conjugations
Future Features:
-----------------------------
-Get full posts from feeds that only show summaries (like: http://fulltextrssfeed.com/ & http://www.wizardrss.com/)
Development Roadmap:
-----------------------------
-Feedzirra error handling (i.e. can't reach server)
-Proper merging of sample means into population moving average
-Word & feed trial periods. Start with some training data or just a training period?
-Finish basic feed parsing & analysis functions, test
-Figure out/test automated blacklisting
-Automated feed pruning
-Develop "conceptual maps"/"relational maps" for words using Wikipedia database, and weighting based on those maps
-Implement code into rails
-Incorporating presence of HTML tags in weighting word importance?
-Determine ideal p threshold. If this is even the best approach?
-Feed weighting according to significant correlation with "trending" words?
-Automated article recommendations based on trending words
-Automated feed discovery
-Phrase detection with markov chains (or a better approach)
Considerations:
-----------------------------
-It may be worthwhile to NOT include a spellchecker. Misspellings are likely to be statistically insignifcant, and would likely mean that some words are mistakenly flagged as misspelled.
-For plurals, use Ruby's built-in pluralization engine. It is not perfect but may suffice
-For conjugations, consider using: https://github.com/ged/linguistics