Skip to content

ewchow/Seer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

#Seer Readme

Notes: Syntatically and semantically unbiased; core features will be based purely on statistics. Language-specific functions will be contained to modules.

###Current Priority

  • Figure out how to collect a large sample of feed entries (n=5000). Currently most feeds only show latest 10 entries, how to get more entries?

###Usage: Rails environment is basically setup but all the "real" code is currently in /ruby for testing.

####samplegetter.rb run from the command line with:

$ ruby ./ruby/samplegetter.rb

This pulls from an array of feeds from ruby/samplefeeds.txt and saves their entries' contents as .txt files to ruby/testdata

####textsplitter.rb run in the command line with:

$ ruby ./ruby/textsplitter.rb

Takes the .txt files created by samplegetter.rb (ruby/testdata/*.txt) and analyzes them. Prints each unique word and its frequency, mean, and standard deviation, and saves all this output to ruby/testdata.csv. Estimates which words should belong on the blacklist by z-score and outputs the list to ruby/blacklist.txt.


###Unresolved Issues

  • Alternate spellings & misspellings
  • Acronyms
  • Special characters
  • Plurals
  • Conjugations

###Future Features

###Development Roadmap

  • Feedzirra error handling (i.e. can't reach server)
  • Proper merging of sample means into population moving average
  • Ensure feed entries have good content length (i.e. how to detect that it's the full post and not just a summary/one-liner?)
  • Word & feed trial periods. Start with some training data or just a training period?
  • Figure out/test automated blacklisting
  • Implement code into rails
  • Finish basic feed parsing & analysis functions, test
  • Determine ideal p threshold. If this is even the best approach?
  • Automated feed pruning
  • Develop "conceptual maps"/"relational maps" for words using Wikipedia database, and weighting based on those maps
  • Incorporating presence of HTML tags in weighting word importance?
  • Feed weighting according to significant correlation with "trending" words?
  • Automated article recommendations based on trending words
  • Automated feed discovery
  • Phrase detection with markov chains (or a better approach)

###Considerations

  • It may be worthwhile to NOT include a spellchecker. Misspellings are likely to be statistically insignifcant, and would likely mean that some words are mistakenly flagged as misspelled.
  • For plurals, use Ruby's built-in pluralization engine. It is not perfect but may suffice
  • For conjugations, consider using: https://github.com/ged/linguistics
  • So far, the biggest contributor to program time is downloading all the entries. With enough feeds to check, it may not be feasible to check multiple categories each hour.

##Classes

Word => { 
	belongs_to => category,
	name => string,
	history => serialized text => float array,
	pop_mean => float,
	pop_sd => float,
	alert => int,
	category_id => int,
	hour_count => int array (TEMPORARY, NOT AN ATTRIBUTE)
}
Category => {
	has_many => [feeds, words],
	name => string,
	iterations => integer,
	alert_threshold => float,
	word_settings => serialized text => hash [word trial length, blacklist z-score],
	feed_settings => serialized text => hash [feed trial length, minimum post rate, prune threshold],
	word_stats => serialized text => hash [pop mean, pop standard deviation],
	blacklist => serialized text => set (of words), 
	map (moving average parameters) => serialized text => hash [type, subset size, alpha coefficient]
}
Feed => {
	belongs_to => category,
	has_many => feed_entries,
	name => string,
	url => string,
	alert => int,
	pop_mean => float,
	pop_sd => float,
	sample_mean => float,
	history => serialized text => int array,
	category_id => int
}
Feed Entry => {
	belongs_to => feed,
	name => string,
	summary => text,
	content => text,
	url => string,
	published_at => datetime,
	guid => int,
	feed_id => int
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published