-
Notifications
You must be signed in to change notification settings - Fork 20
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
added functions to preprocess; corrected terminology
- Loading branch information
Showing
28 changed files
with
1,316 additions
and
483 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,27 +1,21 @@ | ||
Package: LDAtools | ||
Title: Tools to fit a topic model using Latent Dirichlet Allocation (LDA) | ||
Version: 0.1 | ||
Authors@R: c(person("Carson", "Sievert", role = "aut", email = "[email protected]"), | ||
person("Kenny", "Shirley", role = c("aut", "cre"), email = "[email protected]")) | ||
Authors@R: c(person("Carson", "Sievert", role = "aut", email = | ||
"[email protected]"), person("Kenny", "Shirley", role = c("aut", "cre"), | ||
email = "[email protected]")) | ||
Description: This package implements a collapsed Gibbs Sampler algorithm to fit | ||
a topic model to a set of unstructured text documents. It contains three basic groups of | ||
functions: (1) pre-processing of unstructured text, including | ||
substitutions, tokenization, and stemming, (2) fitting the Latent Dirichlet | ||
Allocation (LDA) topic model to training data and making model-based | ||
predictions on test data, and (3) visualizing and summarizing the fitted | ||
model. | ||
a topic model to a set of unstructured text documents. It contains three | ||
basic groups of functions: (1) pre-processing of unstructured text, | ||
including substitutions, tokenization, and stemming, (2) fitting the Latent | ||
Dirichlet Allocation (LDA) topic model to training data and making | ||
model-based predictions on test data, and (3) visualizing and summarizing | ||
the fitted model. | ||
Depends: | ||
R (>= 2.15), | ||
SnowballC | ||
License: MIT | ||
License: | ||
Imports: | ||
foreach | ||
License: | ||
Suggests: | ||
shiny | ||
Collate: | ||
'help.r' | ||
'preprocess.R' | ||
'fitLDA.R' | ||
'postprocess.R' | ||
'utils.R' |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,19 +1,22 @@ | ||
export(KL) | ||
export(bigram.table) | ||
export(collapse.bigrams) | ||
export(ent) | ||
export(fitLDA) | ||
export(flag.exact) | ||
export(flag.partial) | ||
export(getProbs) | ||
export(jsviz) | ||
export(lu) | ||
export(norm) | ||
export(normalize) | ||
export(perplexity.bounds) | ||
export(plotLoglik) | ||
export(plotTokens) | ||
export(predictLDA) | ||
export(preprocess) | ||
export(preprocess.newdocs) | ||
export(remap.terms) | ||
export(su) | ||
export(sum.na) | ||
export(token.rank) | ||
export(topdocs) | ||
useDynLib(LDAtools) | ||
useDynLib(LDAviz) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,5 @@ | ||
#' LDAviz | ||
#' LDAtools | ||
#' | ||
#' @name LDAviz | ||
#' @name LDAtools | ||
#' @docType package | ||
NULL |
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4,6 +4,6 @@ | |
\alias{LDAviz-package} | ||
\title{LDAviz} | ||
\description{ | ||
LDAviz | ||
LDAviz | ||
} | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
\name{bigram.table} | ||
\alias{bigram.table} | ||
\title{Compute table of bigrams} | ||
\usage{ | ||
bigram.table(term.id = integer(), doc.id = integer(), vocab = character(), | ||
n = integer()) | ||
} | ||
\arguments{ | ||
\item{term.id}{an integer vector containing the term ID | ||
number of every token in the corpus. Should take values | ||
between 1 and W, where W is the number of terms in the | ||
vocabulary.} | ||
|
||
\item{doc.id}{an interger vector containing the document | ||
ID number of every token in the corpus. Should take | ||
values between 1 and D, where D is the total number of | ||
documents in the corpus.} | ||
|
||
\item{vocab}{a character vector of length W, containing | ||
the terms in the vocabulary. This vector must align with | ||
\code{term.id}, such that a term.id of 1 indicates the | ||
first element of \code{vocab}, a term.id of 2 indicates | ||
the second element of \code{vocab}, etc.} | ||
|
||
\item{n}{an integer specifying how large the bigram table | ||
should be. The function will return the top n most | ||
frequent bigrams. This argument is here because the | ||
number of bigrams can be as large as W^2.} | ||
} | ||
\value{ | ||
a dataframe with three columns and \code{n} rows, | ||
containing the bigrams (column 2), their frequencies | ||
(column 3), and their rank in decreasing order of frequency | ||
(column 1). The table is sorted by default in decreasing | ||
order of frequency. | ||
} | ||
\description{ | ||
This function counts the bigrams in the data. It's based on | ||
the vector of term IDs and document IDs -- that is, the | ||
vocabulary has already been established, and this function | ||
simply counts occurrences of consecutive terms in the data. | ||
} | ||
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,51 @@ | ||
\name{collapse.bigrams} | ||
\alias{collapse.bigrams} | ||
\title{Replace specified bigrams with terms representing the bigrams} | ||
\usage{ | ||
collapse.bigrams(bigrams = character(), doc.id = integer(), | ||
term.id = integer(), vocab = character()) | ||
} | ||
\arguments{ | ||
\item{bigrams}{A character vector, each element of which | ||
is a bigram represented by two terms separated by a | ||
hyphen, such as 'term1-term2'. Every consecutive | ||
occurrence of 'term1' and 'term2' in the data will be | ||
replaced by a single token representing this bigram.} | ||
|
||
\item{doc.id}{an interger vector containing the document | ||
ID number of every token in the corpus. Should take | ||
values between 1 and D, where D is the total number of | ||
documents in the corpus.} | ||
|
||
\item{term.id}{an integer vector containing the term ID | ||
number of every token in the corpus. Should take values | ||
between 1 and W, where W is the number of terms in the | ||
vocabulary.} | ||
|
||
\item{vocab}{a character vector of length W, containing | ||
the terms in the vocabulary. This vector must align with | ||
\code{term.id}, such that a term.id of 1 indicates the | ||
first element of \code{vocab}, a term.id of 2 indicates | ||
the second element of \code{vocab}, etc.} | ||
} | ||
\value{ | ||
Returns a list of length three. The first element, | ||
\code{new.vocab}, is a character vector containing the new | ||
vocabulary. The second element, \code{new.term.id} is the | ||
new vector of term ID numbers for all tokens in the data, | ||
taking integer values from 1 to the length of the new | ||
vocabulary. The third element is \code{new.doc.id}, which | ||
is the new version of the document id vector. If any of the | ||
specified bigrams were present in the data, then | ||
\code{new.term.id} and \code{new.doc.id} will be shorter | ||
vectors than the original \code{term.id} and \code{doc.id} | ||
vectors. | ||
} | ||
\description{ | ||
After tokenization, use this function to replace all | ||
occurrences of a given bigram with a single token | ||
representing the bigram, and 'delete' the occurrences of | ||
the two individual tokens that comprised the bigram (so | ||
that it is still a generative model for text). | ||
} | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.