GreynirCorrect can be used in three modes, depending on your requirements.
- You can use it as a Python package to proofread text at the token (word) level, correcting and annotating individual errors in the token stream.
- It can also perform more extensive grammar checking at the sentence level, returning a list of annotations for each sentence.
- Finally, it can be used as a command-line tool, at the token level, to consume an input file stream (or stdin) and write to a corrected output file stream (or stdout).
These modes are described below. The :ref:`Reference` section contains further detail.
GreynirCorrect can tokenize text and return an automatically corrected token stream. This catches token-level errors, such as spelling errors and erroneous fixed phrases (?að ýmsu leiti → að ýmsu leyti), but not grammatical errors. The returned token objects are annotated with explanations of each correction or suggestion.
Token-level correction is relatively fast.
GreynirCorrect can analyze text grammatically by attempting to parse each sentence in turn, after token-level correction. The parsing is done according to Greynir's context-free grammar for Icelandic, augmented with additional production rules for common grammatical errors (?Manninum á verkstæðinu vantaði hamar → Manninn á verkstæðinu vantaði hamar). The analysis returns each sentence along with a set of annotations (errors and suggestions) that apply to spans (consecutive tokens) within the sentence, in addition to individual token annotations.
Full grammar analysis of sentences is slower than token-level correction.
GreynirCorrect can be invoked as a command-line tool
to perform token-level correction. The command is correct infile.txt outfile.txt
,
where infile.txt
and outfile.txt
are the input and output filenames,
respectively.
The command-line tool is further documented :ref:`here <commandline>`.
To perform token-level correction from Python code:
from reynir_correct import tokenize
g = tokenize("Af gefnu tilefni fékk fékk daninn vilja sýnum "
"framgengt í auknu mæli.")
for tok in g:
print("{0:10} {1}".format(tok.txt or "", tok.error_description))
Output:
Að Orðasambandið 'Af gefnu tilefni' var leiðrétt í 'að gefnu tilefni' gefnu tilefni fékk Endurtekið orð ('fékk') var fellt burt Daninn Orð á að byrja á hástaf: 'daninn' vilja Orðasambandið 'vilja sýnum framgengt' var leiðrétt í 'vilja sínum framgengt' sínum framgengt í Orðasambandið 'í auknu mæli' var leiðrétt í 'í auknum mæli' auknum mæli .
To perform full spelling and grammar analysis of a sentence from Python code:
from reynir_correct import check_single
sent = check_single("Páli, vini mínum, langaði að horfa á sjónnvarpið.")
for annotation in sent.annotations:
print("{0}".format(annotation))
Output:
000-004: P_WRONG_CASE_þgf_þf Á líklega að vera 'Pál, vin minn' / [Pál , vin minn] 009-009: S004 Orðið 'sjónnvarpið' var leiðrétt í 'sjónvarpið'
>>> sent.tidy_text
Output:
'Páli, vini mínum, langaði að horfa á sjónvarpið.'
Note that the annotation.start
and annotation.end
properties
(here start
is 0 and end
is 4) contain the indices of the first
and last tokens to which the annotation applies.
P_WRONG_CASE_þgf_þf
and S004
are error codes.