GitHub

#Seer Readme

Notes: Syntatically and semantically unbiased; core features will be based purely on statistics. Language-specific functions will be contained to modules.

###Current Priority

Figure out how to collect a large sample of feed entries (n=5000). Currently most feeds only show latest 10 entries, how to get more entries?

###Usage: Rails environment is basically setup but all the "real" code is currently in /ruby for testing.

####samplegetter.rb run from the command line with:

$ ruby ./ruby/samplegetter.rb

This pulls from an array of feeds from ruby/samplefeeds.txt and saves their entries' contents as .txt files to ruby/testdata

####textsplitter.rb run in the command line with:

$ ruby ./ruby/textsplitter.rb

Takes the .txt files created by samplegetter.rb (ruby/testdata/*.txt) and analyzes them. Prints each unique word and its frequency, mean, and standard deviation, and saves all this output to ruby/testdata.csv. Estimates which words should belong on the blacklist by z-score and outputs the list to ruby/blacklist.txt.

###Unresolved Issues

Alternate spellings & misspellings
Acronyms
Special characters
Plurals
Conjugations

###Future Features

Get full posts from feeds that only show summaries (like: http://fulltextrssfeed.com/ & http://www.wizardrss.com/)

###Development Roadmap

Feedzirra error handling (i.e. can't reach server)
Proper merging of sample means into population moving average
Ensure feed entries have good content length (i.e. how to detect that it's the full post and not just a summary/one-liner?)
Word & feed trial periods. Start with some training data or just a training period?
Figure out/test automated blacklisting
Implement code into rails
Finish basic feed parsing & analysis functions, test
Determine ideal p threshold. If this is even the best approach?
Automated feed pruning
Develop "conceptual maps"/"relational maps" for words using Wikipedia database, and weighting based on those maps
Incorporating presence of HTML tags in weighting word importance?
Feed weighting according to significant correlation with "trending" words?
Automated article recommendations based on trending words
Automated feed discovery
Phrase detection with markov chains (or a better approach)

###Considerations

It may be worthwhile to NOT include a spellchecker. Misspellings are likely to be statistically insignifcant, and would likely mean that some words are mistakenly flagged as misspelled.
For plurals, use Ruby's built-in pluralization engine. It is not perfect but may suffice
For conjugations, consider using: https://github.com/ged/linguistics
So far, the biggest contributor to program time is downloading all the entries. With enough feeds to check, it may not be feasible to check multiple categories each hour.

##Classes

Word => { 
	belongs_to => category,
	name => string,
	history => serialized text => float array,
	pop_mean => float,
	pop_sd => float,
	alert => int,
	category_id => int,
	hour_count => int array (TEMPORARY, NOT AN ATTRIBUTE)
}

Category => {
	has_many => [feeds, words],
	name => string,
	iterations => integer,
	alert_threshold => float,
	word_settings => serialized text => hash [word trial length, blacklist z-score],
	feed_settings => serialized text => hash [feed trial length, minimum post rate, prune threshold],
	word_stats => serialized text => hash [pop mean, pop standard deviation],
	blacklist => serialized text => set (of words), 
	map (moving average parameters) => serialized text => hash [type, subset size, alpha coefficient]
}

Feed => {
	belongs_to => category,
	has_many => feed_entries,
	name => string,
	url => string,
	alert => int,
	pop_mean => float,
	pop_sd => float,
	sample_mean => float,
	history => serialized text => int array,
	category_id => int
}

Feed Entry => {
	belongs_to => feed,
	name => string,
	summary => text,
	content => text,
	url => string,
	published_at => datetime,
	guid => int,
	feed_id => int
}

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
app		app
config		config
db		db
doc		doc
fulltext		fulltext
lib		lib
log		log
public		public
ruby		ruby
script		script
ssri-seer		ssri-seer
test		test
vendor		vendor
.gitignore		.gitignore
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
README		README
README.md		README.md
Rakefile		Rakefile
config.ru		config.ru

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

ewchow/Seer

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages