(Currently, mainly for myself, so not yet well organized or documented)
Collection of R libraries and resources for text analysis (mining) and vizualization,
including processing of multilingual texts (in UTF-8, specifically Russian and French)
Easy to run codes provided, including for getting texts from web and spitting them by sections for easier section-by-section analysis provided.
Books:
Text mining packages (used in tidytext book)
- tm
- quanteda
- lexicon?
- qdap
- syuzhet
- https://github.com/trinker/sentimentr ** <-- Vinette! it Compares sentimentr, syuzhet, meanr, and Stanford
https://www.datacamp.com/courses/intro-to-text-mining-bag-of-words https://www.datacamp.com/courses/string-manipulation-in-r-with-stringr https://www.datacamp.com/courses/sentiment-analysis-in-r-the-tidy-way - by Julia Silge - Ch1 DONE
Includes:
- various sentiment/emotion analysis techniques.
- compileable code from Vignettes from udpipe and quanteda- All redone with Russian texts.
Based on:
See also https://github.com/gorodnichy/LA-R-Keras for using Neural Network (Tensorflow) based techniques for text clasification.
Plagiarism detection:
-
https://cran.r-project.org/web/packages/RNewsflow/vignettes/RNewsflow.html
-
https://cran.r-project.org/web/packages/corpustools/vignettes/corpustools.html
Data-sets:
- https://cran.r-project.org/web/packages/Rpoet - Wrapper for the 'PoetryDB' API http://poetrydb.org
- https://cran.r-project.org/web/packages/gutenbergr/vignettes/intro.html - Wrapper for http://www.gutenberg.org/
- https://cran.rstudio.com/web/packages/rplos/index.html
You may also find these resources useful:
-
CRAN The Natural Language Processing View (https://cran.r-project.org/web/views/NaturalLanguageProcessing.html) suggests many R packages related to text mining, especially around the tm package.
-
You could match the wikipedia column in gutenberg_author to Wikipedia content with the WikipediR package - https://cran.r-project.org/web/packages/WikipediR/index.html or to pageview statistics with the wikipediatrend package - https://cran.r-project.org/web/packages/wikipediatrend/index.html
-
If you’re considering an analysis based on author name, you may find the humaniformat (for extraction of first names) and gender (prediction of gender from first names) packages useful. (Note that humaniformat has a format_reverse function for reversing “Last, First” names).
- Facebook: Rfacebook provides an interface to the Facebook API. (K)
- Google+: plusser has been designed to to facilitate the retrieval of Google+ profiles, pages and posts. It also provides search facilities. Currently a Google+ API key is required for accessing Google+ data. tuber provides bindings for YouTube API. Only on Github for now. (K)
- RedditExtractoR can retrieve data from the Reddit API.
- Rlinkedin: is an R client for the LinkedIn API.
- tumblr: tumblR (GitHub): R client for the Tumblr API ( https://www.tumblr.com/docs/en/api/v2). Tumblr is a microblogging platform and social networking website https://www.tumblr.com. (K)
- Twitter: RTwitterAPI (not on CRAN) and twitteR provide an interface to the Twitter web API. streamR: This package provides a series of functions that allow R users to access Twitter's filter, sample, and user streams, and to parse the output into data frames. OAuth authentication is supported. (K) Additionally, RKlout is an interface to Klout API v2. It fetches Klout Score for a Twitter Username/handle in real time. Klout is a silly ranking of Twitter influence.
- SocialMediaLab provides a convenient wrapper around many other social media clients and enables the construction of network structures from those data.
- SocialMediaMineR is an analytic tool that returns information about the popularity of a URL on social media sites.
https://cran.r-project.org/web/views/:
- https://cran.r-project.org/web/views/WebTechnologies.html
- https://cran.r-project.org/web/views/NaturalLanguageProcessing.html
-
Hands-on: a five day text mining course for humanists and social scientistsin R
-
Excellent one: https://tm4ss.github.io/docs/Tutorial_1_Web_scraping.html - https://github.com/tm4ss / https://github.com/tm4ss/tm4ss.github.io
- Tutorial 1: Web crawling and scraping
- Tutorial 2: Processing of textual data
- Tutorial 3: Frequency analysis
- Tutorial 4: Key term extraction
- Tutorial 5: Co-occurrence analysis
- Tutorial 6: Topic Models
- Tutorial 7: Classification
- Tutorial 8: Part-of-Speech tagging / Named Entity Recognition
-
Another one: https://slcladal.github.io/topicmodels.html#ref-silge2017text - https://github.com/SLCLADAL / https://github.com/SLCLADAL/SLCLADAL.github.io Text Analysis and Distant Reading - Concordancing (keywords-in-context) - Network Analysis - Co-occurrence and Collocation Analysis - Topic Modeling - Sentiment Analysis - Tagging and Parsing