Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion data/xml/2020.iwclul.xml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
<volume id="1" ingest-date="2020-03-30" type="proceedings">
<meta>
<booktitle>Proceedings of the Sixth International Workshop on Computational Linguistics of <fixed-case>U</fixed-case>ralic Languages</booktitle>
<editor><first>Tommi A</first><last>Pirinen</last></editor>
<editor><first>Tommi A.</first><last>Pirinen</last></editor>
<editor><first>Francis M.</first><last>Tyers</last></editor>
<editor><first>Michael</first><last>Rießler</last></editor>
<publisher>Association for Computational Linguistics</publisher>
Expand Down
6 changes: 3 additions & 3 deletions data/xml/2020.lrec.xml
Original file line number Diff line number Diff line change
Expand Up @@ -5818,9 +5818,9 @@
<paper id="474">
<title>An Unsupervised Method for Weighting Finite-state Morphological Analyzers</title>
<author><first>Amr</first><last>Keleg</last></author>
<author><first>Francis</first><last>Tyers</last></author>
<author><first>Nick</first><last>Howell</last></author>
<author><first>Tommi</first><last>Pirinen</last></author>
<author><first>Francis M.</first><last>Tyers</last></author>
<author><first>Nicholas</first><last>Howell</last></author>
<author><first>Tommi A.</first><last>Pirinen</last></author>
<pages>3842–3850</pages>
<abstract>Morphological analysis is one of the tasks that have been studied for years. Different techniques have been used to develop models for performing morphological analysis. Models based on finite state transducers have proved to be more suitable for languages with low available resources. In this paper, we have developed a method for weighting a morphological analyzer built using finite state transducers in order to disambiguate its results. The method is based on a word2vec model that is trained in a completely unsupervised way using raw untagged corpora and is able to capture the semantic meaning of the words. Most of the methods used for disambiguating the results of a morphological analyzer relied on having tagged corpora that need to manually built. Additionally, the method developed uses information about the token irrespective of its context unlike most of the other techniques that heavily rely on the word’s context to disambiguate its set of candidate analyses.</abstract>
<url hash="cffce4b4">2020.lrec-1.474</url>
Expand Down
2 changes: 1 addition & 1 deletion data/xml/2021.ranlp.xml
Original file line number Diff line number Diff line change
Expand Up @@ -1786,7 +1786,7 @@
<paper id="171">
<title>Rules Ruling Neural Networks - Neural vs. Rule-Based Grammar Checking for a Low Resource Language</title>
<author><first>Linda</first><last>Wiechetek</last></author>
<author><first>Flammie</first><last>Pirinen</last></author>
<author><first>Flammie A</first><last>Pirinen</last></author>
<author><first>Mika</first><last>Hämäläinen</last></author>
<author><first>Chiara</first><last>Argese</last></author>
<pages>1526–1535</pages>
Expand Down
2 changes: 1 addition & 1 deletion data/xml/2022.computel.xml
Original file line number Diff line number Diff line change
Expand Up @@ -260,7 +260,7 @@
<title>Reusing a Multi-lingual Setup to Bootstrap a Grammar Checker for a Very Low Resource Language without Data</title>
<author><first>Inga</first><last>Lill Sigga Mikkelsen</last></author>
<author><first>Linda</first><last>Wiechetek</last></author>
<author><first>Flammie</first><last>A Pirinen</last></author>
<author><first>Flammie A</first><last>Pirinen</last></author>
<pages>149-158</pages>
<abstract>Grammar checkers (GEC) are needed for digital language survival. Very low resource languages like Lule Sámi with less than 3,000 speakers need to hurry to build these tools, but do not have the big corpus data that are required for the construction of machine learning tools. We present a rule-based tool and a workflow where the work done for a related language can speed up the process. We use an existing grammar to infer rules for the new language, and we do not need a large gold corpus of annotated grammar errors, but a smaller corpus of regression tests is built while developing the tool. We present a test case for Lule Sámi reusing resources from North Sámi, show how we achieve a categorisation of the most frequent errors, and present a preliminary evaluation of the system. We hope this serves as an inspiration for small languages that need advanced tools in a limited amount of time, but do not have big data.</abstract>
<url hash="e9e8ccc9">2022.computel-1.19</url>
Expand Down
2 changes: 1 addition & 1 deletion data/xml/2022.konvens.xml
Original file line number Diff line number Diff line change
Expand Up @@ -174,7 +174,7 @@
</paper>
<paper id="18">
<title>Building an Extremely Low Resource Language to High Resource Language Machine Translation System from Scratch</title>
<author><first>Flammie</first><last>Pirinen</last></author>
<author><first>Flammie A</first><last>Pirinen</last></author>
<author><first>Linda</first><last>Wiechetek</last></author>
<pages>150–155</pages>
<url hash="adfc2f98">2022.konvens-1.18</url>
Expand Down
4 changes: 2 additions & 2 deletions data/xml/2022.lrec.xml
Original file line number Diff line number Diff line change
Expand Up @@ -1533,8 +1533,8 @@
<author><first>Linda</first><last>Wiechetek</last></author>
<author><first>Katri</first><last>Hiovain-Asikainen</last></author>
<author><first>Inga Lill Sigga</first><last>Mikkelsen</last></author>
<author><first>Sjur</first><last>Moshagen</last></author>
<author><first>Flammie</first><last>Pirinen</last></author>
<author><first>Sjur N.</first><last>Moshagen</last></author>
<author><first>Flammie A.</first><last>Pirinen</last></author>
<author><first>Trond</first><last>Trosterud</last></author>
<author><first>Børre</first><last>Gaup</last></author>
<pages>1167–1177</pages>
Expand Down
4 changes: 2 additions & 2 deletions data/xml/2023.humeval.xml
Original file line number Diff line number Diff line change
Expand Up @@ -23,8 +23,8 @@
<paper id="1">
<title>A Manual Evaluation Method of Neural <fixed-case>MT</fixed-case> for Indigenous Languages</title>
<author><first>Linda</first><last>Wiechetek</last></author>
<author><first>Flammie</first><last>Pirinen</last></author>
<author><first>Per</first><last>Kummervold</last></author>
<author><first>Flammie A.</first><last>Pirinen</last></author>
<author><first>Per E</first><last>Kummervold</last></author>
<pages>1–10</pages>
<abstract>Indigenous language expertise is not encoded in written text in the same way as it is for languages that have a long literal tradition. In many cases it is, on the contrary, mostly conserved orally. Therefore the evaluation of neural MT systems solely based on an algorithm learning from written texts is not adequate to measure the quality of a system that is used by the language community. If extensively using tools based on a big amount of non-native language this can even contribute to language change in a way that is not desired by the language community. It can also pollute the internet with automatically created texts that outweigh native texts. We propose a manual evaluation method focusing on flow and content separately, and additionally we use existing rule-based NLP to evaluate other factors such as spelling, grammar and grammatical richness. Our main conclusion is that language expertise of a native speaker is necessary to properly evaluate a given system. We test the method by manually evaluating two neural MT tools for an indigenous low resource language. We present an experiment on two different neural translations to and from North Sámi, an indigenous language of North Europe.</abstract>
<url hash="e2b27a34">2023.humeval-1.1</url>
Expand Down
4 changes: 2 additions & 2 deletions data/xml/2023.nodalida.xml
Original file line number Diff line number Diff line change
Expand Up @@ -670,8 +670,8 @@
</paper>
<paper id="63">
<title><fixed-case>G</fixed-case>iella<fixed-case>LT</fixed-case> — a stable infrastructure for <fixed-case>N</fixed-case>ordic minority languages and beyond</title>
<author><first>Flammie</first><last>Pirinen</last><affiliation>Norgga árktalaš universitehta</affiliation></author>
<author><first>Sjur</first><last>Moshagen</last></author>
<author><first>Flammie A</first><last>Pirinen</last><affiliation>Norgga árktalaš universitehta</affiliation></author>
<author><first>Sjur N.</first><last>Moshagen</last></author>
<author><first>Katri</first><last>Hiovain-Asikainen</last></author>
<pages>643-649</pages>
<abstract>Long term language technology infrastructures are critical for continued maintenance of language technology based software that is used to support the use of languages in digital world. In Nordic area we have languages ranging from well-resourced national majority languages like Norwegian, Swedish and Finnish as well as minoritised, unresourced and indigenous languages like Sámi languages. We present an infrastructure that has been build in over 20 years time that supports building language technology and tools for most of the Nordic languages as well as many of the languages all over the world, with focus on Sámi and other indigenous, minoritised and unresourced languages. We show that one common infrastructure can be used to build tools from keyboards and spell-checkers to machine translators, grammar checkers and text-to-speech as well as automatic speech recognition.</abstract>
Expand Down
2 changes: 1 addition & 1 deletion data/xml/2024.iwclul.xml
Original file line number Diff line number Diff line change
Expand Up @@ -158,7 +158,7 @@
</paper>
<paper id="16">
<title>Keeping Up Appearances—or how to get all <fixed-case>U</fixed-case>ralic languages included into bleeding edge research and software: generate, convert, and <fixed-case>LLM</fixed-case> your way into multilingual datasets</title>
<author><first>Flammie</first><last>A Pirinen</last><affiliation>Divvun, UiT—Norgga árktalaš universitehta, Tromsø, Norway</affiliation></author>
<author><first>Flammie A</first><last>Pirinen</last><affiliation>Divvun, UiT—Norgga árktalaš universitehta, Tromsø, Norway</affiliation></author>
<pages>123-131</pages>
<abstract>The current trends in natural language processing strongly favor large language models and generative AIs as the basis for everything. For Uralic languages that are not largely present in publically available data on the Internet, this can be problematic. In the current computational linguistic scene, it is very important to have representation of your language in popular datasets. Languages that are included in well-known datasets are also included in shared tasks, products by large technology corporations, and so forth. This inclusion will become especially important for under-resourced, under-studied minority, and Indigenous languages, which will otherwise be easily forgotten. In this article, we present the resources that are often deemed necessary for digital presence of a language in the large language model obsessed world of today. We show that there are methods and tricks available to alleviate the problems with a lack of data and a lack of creators and annotators of the data, some more successful than others.</abstract>
<url hash="4e521931">2024.iwclul-1.16</url>
Expand Down
8 changes: 4 additions & 4 deletions data/xml/2024.lrec.xml
Original file line number Diff line number Diff line change
Expand Up @@ -16257,11 +16257,11 @@
<paper id="1383">
<title>The Ethical Question – Use of Indigenous Corpora for Large Language Models</title>
<author><first>Linda</first><last>Wiechetek</last></author>
<author><first>Flammie A.</first><last>Pirinen</last></author>
<author><first>Børre</first><last>Gaup</last></author>
<author><first>Trond</first><last>Trosterud</last></author>
<author><first>Flammie</first><last>Pirinen</last></author>
<author><first>Maja Lisa</first><last>Kappfjell</last></author>
<author><first>Sjur</first><last>Moshagen</last></author>
<author><first>Trond</first><last>Trosterud</last></author>
<author><first>Børre</first><last>Gaup</last></author>
<author><first>Sjur Nørstebø</first><last>Moshagen</last></author>
<pages>15922–15931</pages>
<abstract>Creating language technology based on language data has become very popular with the recent advances of large language models and neural network technologies. This makes language resources very valuable, and especially in case of indigenous languages, the scarce resources are even more precious. Given the good results of simply fetching everything you can from the internet and feeding it to neural networks in English, there has been more work on doing the same for all languages. However, indigenous language resources as they are on the web are not comparable in that they would encode the most recent normativised language in all its aspects. This problematic is further due to not understanding the texts input to models or output by models by the people who work on them. Corpora also have intelligent property rights and copyrights that are not respected. Furthermore, the web is filled with the result of language model -generated texts. In this article we describe an ethical and sustainable way to work with indigenous languages.</abstract>
<url hash="8076593b">2024.lrec-main.1383</url>
Expand Down
2 changes: 1 addition & 1 deletion data/xml/L14.xml
Original file line number Diff line number Diff line change
Expand Up @@ -6525,7 +6525,7 @@
<paper id="609">
<author><first>Senka</first><last>Drobac</last></author>
<author><first>Krister</first><last>Lindén</last></author>
<author><first>Tommi</first><last>Pirinen</last></author>
<author><first>Tommi A</first><last>Pirinen</last></author>
<author><first>Miikka</first><last>Silfverberg</last></author>
<title>Heuristic Hyper-minimization of Finite State Lexicons</title>
<pages>3319–3324</pages>
Expand Down
2 changes: 1 addition & 1 deletion data/xml/W11.xml
Original file line number Diff line number Diff line change
Expand Up @@ -8370,7 +8370,7 @@
</paper>
<paper id="44">
<title>Modularisation of <fixed-case>F</fixed-case>innish Finite-State Language Description – Towards Wide Collaboration in Open Source Development of a Morphological Analyser</title>
<author><first>Tommi</first><last>Pirinen</last></author>
<author><first>Tommi A</first><last>Pirinen</last></author>
<pages>299–302</pages>
<url hash="578ca53b">W11-4644</url>
<bibkey>pirinen-2011-modularisation</bibkey>
Expand Down
2 changes: 1 addition & 1 deletion data/xml/W13.xml
Original file line number Diff line number Diff line change
Expand Up @@ -9508,7 +9508,7 @@
<paper id="31">
<title>Building an Open-Source Development Infrastructure for Language Technology Projects</title>
<author><first>Sjur N.</first><last>Moshagen</last></author>
<author><first>Tommi</first><last>Pirinen</last></author>
<author><first>Tommi A</first><last>Pirinen</last></author>
<author><first>Trond</first><last>Trosterud</last></author>
<pages>343–352</pages>
<url hash="f5b766e0">W13-5631</url>
Expand Down
10 changes: 5 additions & 5 deletions data/xml/W17.xml
Original file line number Diff line number Diff line change
Expand Up @@ -378,15 +378,15 @@
</paper>
<paper id="14">
<title><fixed-case>N</fixed-case>orth-<fixed-case>S</fixed-case>ámi to <fixed-case>F</fixed-case>innish rule-based machine translation system</title>
<author><first>Tommi</first><last>Pirinen</last></author>
<author><first>Francis M.</first><last>Tyers</last></author>
<author><first>Trond</first><last>Trosterud</last></author>
<author><first>Ryan</first><last>Johnson</last></author>
<author><first>Kevin</first><last>Unhammer</last></author>
<author><first>Tommi A</first><last>Pirinen</last></author>
<author><first>Tiina</first><last>Puolakainen</last></author>
<author><first>Francis</first><last>Tyers</last></author>
<author><first>Trond</first><last>Trosterud</last></author>
<author><first>Kevin</first><last>Unhammer</last></author>
<pages>115–122</pages>
<url hash="0afc21cd">W17-0214</url>
<bibkey>pirinen-etal-2017-north</bibkey>
<bibkey>johnson-etal-2017-north</bibkey>
</paper>
<paper id="15">
<title>Machine translation with North Saami as a pivot language</title>
Expand Down
2 changes: 1 addition & 1 deletion data/xml/W19.xml
Original file line number Diff line number Diff line change
Expand Up @@ -11610,7 +11610,7 @@ One of the references was wrong therefore it is corrected to cite the appropriat
</paper>
<paper id="36">
<title>Apertium-fin-eng–Rule-based Shallow Machine Translation for <fixed-case>WMT</fixed-case> 2019 Shared Task</title>
<author><first>Tommi</first><last>Pirinen</last></author>
<author><first>Tommi A</first><last>Pirinen</last></author>
<pages>335–341</pages>
<abstract>In this paper we describe a rule-based, bi-directional machine translation system for the Finnish—English language pair. The baseline system was based on the existing data of FinnWordNet, omorfi and apertium-eng. We have built the disambiguation, lexical selection and translation rules by hand. The dictionaries and rules have been developed based on the shared task data. We describe in this article the use of the shared task data as a kind of a test-driven development workflow in RBMT development and show that it suits perfectly to a modern software engineering continuous integration workflow of RBMT and yields big increases to BLEU scores with minimal effort.</abstract>
<url hash="1d2cec39">W19-5336</url>
Expand Down
7 changes: 6 additions & 1 deletion data/yaml/name_variants.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8115,10 +8115,15 @@
id: stelios-piperidis
variants:
- {first: Stelios, last: Piperdis}
- canonical: {first: Tommi A., last: Pirinen}
- canonical: {first: Flammie A., last: Pirinen}
orcid: 0000-0003-1207-5395
degree: University of Helsinki
variants:
- {first: Flammie, last: Pirinen}
- {first: Flammie A, last: Pirinen}
- {first: Tommi, last: Pirinen}
- {first: Tommi A, last: Pirinen}
- {first: Tommi A., last: Pirinen}
- canonical: {first: John F., last: Pitrelli}
variants:
- {first: John, last: Pitrelli}
Expand Down