-
Notifications
You must be signed in to change notification settings - Fork 44
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Precompute vignette, disable examples
To please BDR
Showing
8 changed files
with
508 additions
and
43 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,271 @@ | ||
--- | ||
title: "The hunspell package: High-Performance Stemmer, Tokenizer, and Spell Checker for R" | ||
date: "`r Sys.Date()`" | ||
output: | ||
html_document: | ||
fig_caption: false | ||
toc: true | ||
toc_float: | ||
collapsed: false | ||
smooth_scroll: false | ||
toc_depth: 3 | ||
vignette: > | ||
%\VignetteIndexEntry{The hunspell package: High-Performance Stemmer, Tokenizer, and Spell Checker for R} | ||
%\VignetteEngine{knitr::rmarkdown} | ||
%\VignetteEncoding{UTF-8} | ||
--- | ||
|
||
```{r, echo = FALSE, message = FALSE} | ||
knitr::opts_chunk$set(comment = "") | ||
``` | ||
|
||
Hunspell is the spell checker library used by LibreOffice, OpenOffice, Mozilla Firefox, Google Chrome, Mac OS-X, InDesign, Opera, RStudio and many others. It provides a system for tokenizing, stemming and spelling in almost any language or alphabet. The R package exposes both the high-level spell-checker as well as low-level stemmers and tokenizers which analyze or extract individual words from various formats (text, html, xml, latex). | ||
|
||
Hunspell uses a special dictionary format that defines which characters, words and conjugations are valid in a given language. The examples below use the (default) `"en_US"` dictionary. However each function can be used in another language by setting a custom dictionary in the `dict` parameter. See the section on [dictionaries](#hunspell_dictionaries) below. | ||
|
||
## Spell Checking | ||
|
||
Spell checking text consists of the following steps: | ||
|
||
1. Parse a document by extracting (**tokenizing**) words that we want to check | ||
2. Analyze each word by breaking it down in it's root (**stemming**) and conjugation affix | ||
3. Lookup in a **dictionary** if the word+affix combination if valid for your language | ||
4. (optional) For incorrect words, **suggest** corrections by finding similar (correct) words in the dictionary | ||
|
||
We can do each these steps manually or have Hunspell do them automatically. | ||
|
||
### Check Individual Words | ||
|
||
The `hunspell_check` and `hunspell_suggest` functions can test individual words for correctness, and suggest similar (correct) words that look similar to the given (incorrect) word. | ||
|
||
```{r} | ||
library(hunspell) | ||
|
||
# Check individual words | ||
words <- c("beer", "wiskey", "wine") | ||
correct <- hunspell_check(words) | ||
print(correct) | ||
|
||
# Find suggestions for incorrect words | ||
hunspell_suggest(words[!correct]) | ||
``` | ||
|
||
### Check Documents | ||
|
||
In practice we often want to spell check an entire document at once by searching for incorrect words. This is done using the `hunspell` function: | ||
|
||
|
||
```{r} | ||
bad <- hunspell("spell checkers are not neccessairy for langauge ninjas") | ||
print(bad[[1]]) | ||
hunspell_suggest(bad[[1]]) | ||
``` | ||
|
||
Besides plain text, `hunspell` supports various document formats, such as html or latex: | ||
|
||
```{r} | ||
download.file("https://arxiv.org/e-print/1406.4806v1", "1406.4806v1.tar.gz", mode = "wb") | ||
untar("1406.4806v1.tar.gz", "content.tex") | ||
text <- readLines("content.tex", warn = FALSE) | ||
bad_words <- hunspell(text, format = "latex") | ||
sort(unique(unlist(bad_words))) | ||
``` | ||
|
||
### Check PDF files | ||
|
||
Use the text-extraction from the `pdftools` package to spell check text from PDF files! | ||
|
||
```{r, eval = require('pdftools')} | ||
text <- pdftools::pdf_text('https://www.gnu.org/licenses/quick-guide-gplv3.pdf') | ||
bad_words <- hunspell(text) | ||
sort(unique(unlist(bad_words))) | ||
``` | ||
|
||
### Check Manual Pages | ||
|
||
The `spelling` package builds on hunspell and has a wrapper to spell-check manual pages from R packages. Results might contain a lot of false positives for technical jargon, but you might also catch a typo or two. Point it to the root of your source package: | ||
|
||
```{r, eval=FALSE} | ||
spelling::spell_check_package("~/workspace/V8") | ||
``` | ||
``` | ||
WORD FOUND IN | ||
ECMA V8.Rd:16, description:2,4 | ||
ECMAScript description:2 | ||
emscripten description:5 | ||
htmlwidgets JS.Rd:16 | ||
JSON V8.Rd:33,38,39,57,58,59,120 | ||
jsonlite V8.Rd:42 | ||
Ooms V8.Rd:41,120 | ||
Xie JS.Rd:26 | ||
Yihui JS.Rd:26 | ||
``` | ||
|
||
## Morphological Analysis | ||
|
||
In order to lookup a word in a dictionary, hunspell needs to break it down in a stem (**stemming**) and conjugation affix. The `hunspell` function does this automatically but we can also do it manually. | ||
|
||
### Stemming Words | ||
|
||
The `hunspell_stem` looks up words from the dictionary which match the root of the given word. Note that the function returns a list because some words can have multiple matches. | ||
|
||
```{r} | ||
# Stemming | ||
words <- c("love", "loving", "lovingly", "loved", "lover", "lovely") | ||
hunspell_stem(words) | ||
``` | ||
|
||
### Analyzing Words | ||
|
||
The `hunspell_analyze` function is similar, but it returns both the stem and the affix syntax of the word: | ||
|
||
```{r} | ||
hunspell_analyze(words) | ||
``` | ||
|
||
## Tokenizing | ||
|
||
To support spell checking on documents, Hunspell includes parsers for various document formats, including *text*, *html*, *xml*, *man* or *latex*. The Hunspell package also exposes these tokenizers directly so they can be used for other application than spell checking. | ||
|
||
|
||
```{r} | ||
text <- readLines("content.tex", warn = FALSE) | ||
allwords <- hunspell_parse(text, format = "latex") | ||
|
||
# Third line (title) only | ||
print(allwords[[3]]) | ||
``` | ||
|
||
### Summarizing Text | ||
|
||
In text analysis we often want to summarize text via it's stems. For example we can count words for display in a wordcloud: | ||
|
||
```{r} | ||
allwords <- hunspell_parse(janeaustenr::prideprejudice) | ||
stems <- unlist(hunspell_stem(unlist(allwords))) | ||
words <- sort(table(stems), decreasing = TRUE) | ||
print(head(words, 30)) | ||
``` | ||
|
||
Most of these are stop words. Let's filter these out: | ||
|
||
```{r} | ||
df <- as.data.frame(words) | ||
df$stems <- as.character(df$stems) | ||
stops <- df$stems %in% stopwords::stopwords(source="stopwords-iso") | ||
wcdata <- head(df[!stops,], 150) | ||
print(wcdata, max = 40) | ||
``` | ||
|
||
```{r} | ||
library(wordcloud2) | ||
names(wcdata) <- c("word", "freq") | ||
wcdata$freq <- (wcdata$freq)^(2/3) | ||
wordcloud2(wcdata) | ||
``` | ||
|
||
|
||
|
||
## Hunspell Dictionaries | ||
|
||
Hunspell is based on MySpell and is backward-compatible with MySpell and aspell dictionaries. Chances are your dictionaries in your language are already available on your system! | ||
|
||
A Hunspell dictionary consists of two files: | ||
|
||
- The `[lang].aff` file specifies the affix syntax for the language | ||
- The `[lang].dic` file contains a wordlist formatted using syntax from the aff file. | ||
|
||
Typically both files are located in the same directory and share the same filename, for example `en_GB.aff` and `en_GB.dic`. The `list_dictionaries()` function lists available dictionaries in the current directory and standard system paths where dictionaries are usually installed. | ||
|
||
```{r} | ||
list_dictionaries() | ||
``` | ||
|
||
The `dictionary` function is then used to load any of these available dictionaries: | ||
|
||
```{r} | ||
dictionary("en_GB") | ||
``` | ||
|
||
|
||
If the files are not in one of the standard paths you can also specify the full path to either or both the dic and aff file: | ||
|
||
```{r, eval = FALSE} | ||
dutch <- dictionary("~/workspace/Dictionaries/Dutch.dic") | ||
print(dutch) | ||
``` | ||
``` | ||
<hunspell dictionary> | ||
affix: /Users/jeroen/workspace/Dictionaries/Dutch.aff | ||
dictionary: /Users/jeroen/workspace/Dictionaries/Dutch.dic | ||
encoding: UTF-8 | ||
wordchars: '-./0123456789\ij’ | ||
``` | ||
### Setting a Language | ||
|
||
The hunspell R package includes dictionaries for `en_US` and `en_GB`. So if you you don't speak `en_US` you can always switch to the British English: | ||
|
||
```{r} | ||
hunspell("My favourite colour to visualise is grey") | ||
hunspell("My favourite colour to visualise is grey", dict = 'en_GB') | ||
``` | ||
|
||
If you want to use another language you need to make sure that the dictionary is available from your system. The `dictionary` function is used to read in dictionary. | ||
|
||
```{r, eval = FALSE} | ||
dutch <- dictionary("~/workspace/Dictionaries/Dutch.dic") | ||
hunspell("Hij heeft de klok wel horen luiden, maar weet niet waar de klepel hangt", dict = dutch) | ||
``` | ||
|
||
Note that if the `dict` argument is a string, it will be passed on to the `dictionary` function. | ||
|
||
### Installing Dictionaries in RStudio | ||
|
||
RStudio users can install various dictionaries via the "Global Options" menu of the IDE. Once these dictionaries are installed they become available to the `hunspell` and `spelling` package as well. | ||
|
||
<img data-external="1" src="https://jeroen.github.io/images/rs-hunspell.png" alt="rstudio screenshot"> | ||
|
||
### Dictionaries on Linux | ||
|
||
The best way to install dictionaries on __Linux__ is via the system package manager. For example on if you would like to install the Austrian-German dictionary on **Debian** or **Ubuntu** you either need the [`hunspell-de-at`](https://packages.debian.org/testing/hunspell-de-at) or [`myspell-de-at`](https://packages.debian.org/testing/myspell-de-at) package: | ||
|
||
```sh | ||
sudo apt-get install hunspell-de-at | ||
``` | ||
|
||
On **Fedora** and **CentOS** / **RHEL** all German dialects are included with the [`hunspell-de`](https://src.fedoraproject.org/rpms/hunspell-de) package | ||
|
||
```sh | ||
sudo yum install hunspell-de | ||
``` | ||
|
||
After installing this you should be able to load the dictionary: | ||
|
||
```r | ||
dict <- dictionary('de_AT') | ||
``` | ||
|
||
If that didn't work, verify that the dictionary files were installed in one of the system directories (usually `/usr/share/myspell` or `/usr/share/hunspell`). | ||
|
||
|
||
### Custom Dictionaries | ||
|
||
If your system does not provide standard dictionaries you need to download them yourself. There are a lot of places that provide quality dictionaries. | ||
|
||
- [SCOWL](http://wordlist.aspell.net/dicts/) | ||
- [OpenOffice](http://openoffice.cs.utah.edu/contrib/dictionaries/) | ||
- [LibreOffice](http://archive.ubuntu.com/ubuntu/pool/main/libr/libreoffice-dictionaries/?C=S;O=D) | ||
- [titoBouzout](https://github.com/titoBouzout/Dictionaries) | ||
- [wooorm](https://github.com/wooorm/dictionaries) | ||
|
||
On OS-X it is [recommended](https://github.com/Homebrew/homebrew-core/blob/master/Formula/h/hunspell.rb#L40-L48) to put the files in `~/Library/Spelling/` or `/Library/Spelling/`. However you can also put them in your project working directory, or any of the other standard locations. If you wish to store your dictionaries somewhere else, you can make hunspell find them by setting the `DICPATH` environment variable. The `hunspell:::dicpath()` shows which locations your system searches: | ||
|
||
```{r} | ||
Sys.setenv(DICPATH = "/my/custom/hunspell/dir") | ||
hunspell:::dicpath() | ||
``` | ||
|
||
```{r, echo = FALSE, message = FALSE} | ||
unlink(c("1406.4806v1.tar.gz", "content.tex")) | ||
``` | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
knitr::knit("intro.Rmd.in", "intro.Rmd") |