Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion data/xml/2011.mtsummit.xml
Original file line number Diff line number Diff line change
Expand Up @@ -429,7 +429,7 @@
<paper id="46">
<title>Generating Virtual Parallel Corpus: A Compatibility Centric Method</title>
<author><first>Jia</first><last>Xu</last></author>
<author><first>Weiwei</first><last>Sun</last></author>
<author id="weiwei-sun"><first>Weiwei</first><last>Sun</last></author>
<url hash="09f1e154">2011.mtsummit-papers.46</url>
<bibkey>xu-sun-2011-generating</bibkey>
</paper>
Expand Down
6 changes: 3 additions & 3 deletions data/xml/2020.acl.xml
Original file line number Diff line number Diff line change
Expand Up @@ -5081,7 +5081,7 @@
<paper id="377">
<title>Exact yet Efficient Graph Parsing, Bi-directional Locality and the Constructivist Hypothesis</title>
<author><first>Yajie</first><last>Ye</last></author>
<author><first>Weiwei</first><last>Sun</last></author>
<author id="weiwei-sun"><first>Weiwei</first><last>Sun</last></author>
<pages>4100–4110</pages>
<abstract>A key problem in processing graph-based meaning representations is graph parsing, i.e. computing all possible derivations of a given graph according to a (competence) grammar. We demonstrate, for the first time, that exact graph parsing can be efficient for large graphs and with large Hyperedge Replacement Grammars (HRGs). The advance is achieved by exploiting locality as terminal edge-adjacency in HRG rules. In particular, we highlight the importance of 1) a terminal edge-first parsing strategy, 2) a categorization of a subclass of HRG, i.e. what we call Weakly Regular Graph Grammar, and 3) distributing argument-structures to both lexical and phrasal rules.</abstract>
<url hash="231b0280">2020.acl-main.377</url>
Expand Down Expand Up @@ -8167,7 +8167,7 @@
<paper id="605">
<title>Parsing into Variable-in-situ Logico-Semantic Graphs</title>
<author><first>Yufei</first><last>Chen</last></author>
<author><first>Weiwei</first><last>Sun</last></author>
<author id="weiwei-sun"><first>Weiwei</first><last>Sun</last></author>
<pages>6772–6782</pages>
<abstract>We propose variable-in-situ logico-semantic graphs to bridge the gap between semantic graph and logical form parsing. The new type of graph-based meaning representation allows us to include analysis for scope-related phenomena, such as quantification, negation and modality, in a way that is consistent with the state-of-the-art underspecification approach. Moreover, the well-formedness of such a graph is clear, since model-theoretic interpretation is available. We demonstrate the effectiveness of this new perspective by developing a new state-of-the-art semantic parser for English Resource Semantics. At the core of this parser is a novel neural graph rewriting system which combines the strengths of Hyperedge Replacement Grammar, a knowledge-intensive model, and Graph Neural Networks, a data-intensive model. Our parser achieves an accuracy of 92.39% in terms of elementary dependency match, which is a 2.88 point improvement over the best data-driven model in the literature. The output of our parser is highly coherent: at least 91% graphs are valid, in that they allow at least one sound scope-resolved logical form.</abstract>
<url hash="bd78729c">2020.acl-main.605</url>
Expand All @@ -8179,7 +8179,7 @@
<paper id="606">
<title>Semantic Parsing for <fixed-case>E</fixed-case>nglish as a Second Language</title>
<author><first>Yuanyuan</first><last>Zhao</last></author>
<author><first>Weiwei</first><last>Sun</last></author>
<author id="weiwei-sun"><first>Weiwei</first><last>Sun</last></author>
<author><first>Junjie</first><last>Cao</last></author>
<author><first>Xiaojun</first><last>Wan</last></author>
<pages>6783–6794</pages>
Expand Down
2 changes: 1 addition & 1 deletion data/xml/2020.emnlp.xml
Original file line number Diff line number Diff line change
Expand Up @@ -1405,7 +1405,7 @@
<title>Coding Textual Inputs Boosts the Accuracy of Neural Networks</title>
<author><first>Abdul Rafae</first><last>Khan</last></author>
<author><first>Jia</first><last>Xu</last></author>
<author><first>Weiwei</first><last>Sun</last></author>
<author id="weiwei-sun"><first>Weiwei</first><last>Sun</last></author>
<pages>1350–1360</pages>
<abstract>Natural Language Processing (NLP) tasks are usually performed word by word on textual inputs. We can use arbitrary symbols to represent the linguistic meaning of a word and use these symbols as inputs. As “alternatives” to a text representation, we introduce Soundex, MetaPhone, NYSIIS, logogram to NLP, and develop fixed-output-length coding and its extension using Huffman coding. Each of those codings combines different character/digital sequences and constructs a new vocabulary based on codewords. We find that the integration of those codewords with text provides more reliable inputs to Neural-Network-based NLP systems through redundancy than text-alone inputs. Experiments demonstrate that our approach outperforms the state-of-the-art models on the application of machine translation, language modeling, and part-of-speech tagging. The source code is available at <url>https://github.com/abdulrafae/coding_nmt</url>.</abstract>
<url hash="53b5562f">2020.emnlp-main.104</url>
Expand Down
2 changes: 1 addition & 1 deletion data/xml/2020.iwpt.xml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
<editor><first>Stephan</first><last>Oepen</last></editor>
<editor><first>Kenji</first><last>Sagae</last></editor>
<editor><first>Djamé</first><last>Seddah</last></editor>
<editor><first>Weiwei</first><last>Sun</last></editor>
<editor id="weiwei-sun"><first>Weiwei</first><last>Sun</last></editor>
<editor><first>Anders</first><last>Søgaard</last></editor>
<editor><first>Reut</first><last>Tsarfaty</last></editor>
<editor><first>Dan</first><last>Zeman</last></editor>
Expand Down
2 changes: 1 addition & 1 deletion data/xml/2021.bea.xml
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@
<author><first>Mengyu</first><last>Zhang</last></author>
<author><first>Weiqi</first><last>Wang</last></author>
<author><first>Shuqiao</first><last>Sun</last></author>
<author><first>Weiwei</first><last>Sun</last></author>
<author id="weiwei-sun"><first>Weiwei</first><last>Sun</last></author>
<pages>1–10</pages>
<abstract>This paper studies Negation Scope Resolution (NSR) for Chinese as a Second Language (CSL), which shows many unique characteristics that distinguish itself from “standard” Chinese. We annotate a new moderate-sized corpus that covers two background L1 languages, viz. English and Japanese. We build a neural NSR system, which achieves a new state-of-the-art accuracy on English benchmark data. We leverage this system to gauge how successful NSR for CSL can be. Different native language backgrounds of language learners result in unequal cross-lingual transfer, which has a significant impact on processing second language data. In particular, manual annotation, empirical evaluation and error analysis indicate two non-obvious facts: 1) L2-Chinese, L1-Japanese data are more difficult to analyze and thus annotate than L2-Chinese, L1-English data; 2) computational models trained on L2-Chinese, L1-Japanese data perform better than models trained on L2-Chinese, L1-English data.</abstract>
<url hash="f3f9c66d">2021.bea-1.1</url>
Expand Down
2 changes: 1 addition & 1 deletion data/xml/2021.cl.xml
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@
<title>Comparing Knowledge-Intensive and Data-Intensive Models for <fixed-case>E</fixed-case>nglish Resource Semantic Parsing</title>
<author><first>Junjie</first><last>Cao</last></author>
<author><first>Zi</first><last>Lin</last></author>
<author><first>Weiwei</first><last>Sun</last></author>
<author id="weiwei-sun"><first>Weiwei</first><last>Sun</last></author>
<author><first>Xiaojun</first><last>Wan</last></author>
<doi>10.1162/coli_a_00395</doi>
<abstract>In this work, we present a phenomenon-oriented comparative analysis of the two dominant approaches in English Resource Semantic (ERS) parsing: classic, knowledge-intensive and neural, data-intensive models. To reflect state-of-the-art neural NLP technologies, a factorization-based parser is introduced that can produce Elementary Dependency Structures much more accurately than previous data-driven parsers. We conduct a suite of tests for different linguistic phenomena to analyze the grammatical competence of different parsers, where we show that, despite comparable performance overall, knowledge- and data-intensive models produce different types of errors, in a way that can be explained by their theoretical properties. This analysis is beneficial to in-depth evaluation of several representative parsing techniques and leads to new directions for parser development.</abstract>
Expand Down
2 changes: 1 addition & 1 deletion data/xml/2021.naacl.xml
Original file line number Diff line number Diff line change
Expand Up @@ -5929,7 +5929,7 @@
<author><first>Yiyang</first><last>Hou</last></author>
<author><first>Yajie</first><last>Ye</last></author>
<author><first>Li</first><last>Liang</last></author>
<author><first>Weiwei</first><last>Sun</last></author>
<author id="weiwei-sun"><first>Weiwei</first><last>Sun</last></author>
<pages>5554–5566</pages>
<abstract>Universal Semantic Tagging aims to provide lightweight unified analysis for all languages at the word level. Though the proposed annotation scheme is conceptually promising, the feasibility is only examined in four Indo–European languages. This paper is concerned with extending the annotation scheme to handle Mandarin Chinese and empirically study the plausibility of unifying meaning representations for multiple languages. We discuss a set of language-specific semantic phenomena, propose new annotation specifications and build a richly annotated corpus. The corpus consists of 1100 English–Chinese parallel sentences, where compositional semantic analysis is available for English, and another 1000 Chinese sentences which has enriched syntactic analysis. By means of the new annotations, we also evaluate a series of neural tagging models to gauge how successful semantic tagging can be: accuracies of 92.7% and 94.6% are obtained for Chinese and English respectively. The English tagging performance is remarkably better than the state-of-the-art by 7.7%.</abstract>
<url hash="63148334">2021.naacl-main.440</url>
Expand Down
2 changes: 1 addition & 1 deletion data/xml/2023.cxgsnlp.xml
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@
<paper id="5">
<title>Constructivist Tokenization for <fixed-case>E</fixed-case>nglish</title>
<author><first>Allison</first><last>Fan</last></author>
<author><first>Weiwei</first><last>Sun</last></author>
<author id="weiwei-sun"><first>Weiwei</first><last>Sun</last></author>
<pages>36-40</pages>
<abstract>This paper revisits tokenization from a theoretical perspective, and argues for the necessity of a constructivist approach to tokenization for semantic parsing and modeling language acquisition. We consider two problems: (1) (semi-) automatically converting existing lexicalist annotations, e.g. those of the Penn TreeBank, into constructivist annotations, and (2) automatic tokenization of raw texts. We demonstrate that (1) a heuristic rule-based constructivist tokenizer is able to yield relatively satisfactory accuracy when gold standard Penn TreeBank part-of-speech tags are available, but that some manual annotations are still necessary to obtain gold standard results, and (2) a neural tokenizer is able to provide accurate automatic constructivist tokenization results from raw character sequences. Our research output also includes a set of high-quality morpheme-tokenized corpora, which enable the training of computational models that more closely align with language comprehension and acquisition.</abstract>
<url hash="0963dbef">2023.cxgsnlp-1.5</url>
Expand Down
4 changes: 2 additions & 2 deletions data/xml/2024.acl.xml
Original file line number Diff line number Diff line change
Expand Up @@ -1815,7 +1815,7 @@
<paper id="129">
<title><fixed-case>MEFT</fixed-case>: Memory-Efficient Fine-Tuning through Sparse Adapter</title>
<author><first>Jitai</first><last>Hao</last></author>
<author><first>Weiwei</first><last>Sun</last><affiliation>Carnegie Mellon University</affiliation></author>
<author id="weiwei-sun-sd"><first>Weiwei</first><last>Sun</last><affiliation>Carnegie Mellon University</affiliation></author>
<author><first>Xin</first><last>Xin</last></author>
<author><first>Qi</first><last>Meng</last><affiliation>Microsoft and Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Chinese Academy of Sciences</affiliation></author>
<author><first>Zhumin</first><last>Chen</last><affiliation>Shandong University</affiliation></author>
Expand Down Expand Up @@ -5509,7 +5509,7 @@
<title>Generate-then-Ground in Retrieval-Augmented Generation for Multi-hop Question Answering</title>
<author><first>Zhengliang</first><last>Shi</last></author>
<author><first>Shuo</first><last>Zhang</last></author>
<author><first>Weiwei</first><last>Sun</last><affiliation>Carnegie Mellon University</affiliation></author>
<author id="weiwei-sun-sd"><first>Weiwei</first><last>Sun</last><affiliation>Carnegie Mellon University</affiliation></author>
<author><first>Shen</first><last>Gao</last><affiliation>University of Electronic Science and Technology of China</affiliation></author>
<author><first>Pengjie</first><last>Ren</last><affiliation>Shandong University</affiliation></author>
<author><first>Zhumin</first><last>Chen</last><affiliation>Shandong University</affiliation></author>
Expand Down
2 changes: 1 addition & 1 deletion data/xml/2024.cl.xml
Original file line number Diff line number Diff line change
Expand Up @@ -169,7 +169,7 @@
<author><first>Wenxi</first><last>Li</last></author>
<author><first>Yutong</first><last>Zhang</last></author>
<author><first>Guy</first><last>Emerson</last></author>
<author><first>Weiwei</first><last>Sun</last></author>
<author id="weiwei-sun"><first>Weiwei</first><last>Sun</last></author>
<doi>10.1162/coli_a_00504</doi>
<abstract>Divergence of languages observed at the surface level is a major challenge encountered by multilingual data representation, especially when typologically distant languages are involved. Drawing inspiration from a formalist Chomskyan perspective towards language universals, Universal Grammar (UG), this article uses deductively pre-defined universals to analyze a multilingually heterogeneous phenomenon, event nominals. In this way, deeper universality of event nominals beneath their huge divergence in different languages is uncovered, which empowers us to break barriers between languages and thus extend insights from some synthetic languages to a non-inflectional language, Mandarin Chinese. Our empirical investigation also demonstrates this UG-inspired schema is effective: With its assistance, the inter-annotator agreement (IAA) for identifying event nominals in Mandarin grows from 88.02% to 94.99%, and automatic detection of event-reading nominalizations on the newly-established data achieves an accuracy of 94.76% and an F1 score of 91.3%, which significantly surpass those achieved on the pre-existing resource by 9.8% and 5.2%, respectively. Our systematic analysis also sheds light on nominal semantic role labeling. By providing a clear definition and classification on arguments of event nominal, the IAA of this task significantly increases from 90.46% to 98.04%.</abstract>
<pages>535–561</pages>
Expand Down
2 changes: 1 addition & 1 deletion data/xml/2024.emnlp.xml
Original file line number Diff line number Diff line change
Expand Up @@ -10858,7 +10858,7 @@
</paper>
<paper id="778">
<title><fixed-case>MAIR</fixed-case>: A Massive Benchmark for Evaluating Instructed Retrieval</title>
<author><first>Weiwei</first><last>Sun</last><affiliation>Carnegie Mellon University</affiliation></author>
<author id="weiwei-sun-sd"><first>Weiwei</first><last>Sun</last><affiliation>Carnegie Mellon University</affiliation></author>
<author><first>Zhengliang</first><last>Shi</last></author>
<author><first>Wu Jiu</first><last>Long</last></author>
<author><first>Lingyong</first><last>Yan</last><affiliation>Baidu Inc.</affiliation></author>
Expand Down
2 changes: 1 addition & 1 deletion data/xml/2024.lchange.xml
Original file line number Diff line number Diff line change
Expand Up @@ -144,7 +144,7 @@
<paper id="12">
<title><fixed-case>E</fixed-case>tymo<fixed-case>L</fixed-case>ink: A Structured <fixed-case>E</fixed-case>nglish Etymology Dataset</title>
<author><first>Yuan</first><last>Gao</last><affiliation>University of Cambridge</affiliation></author>
<author><first>Weiwei</first><last>Sun</last><affiliation>University of Cambridge</affiliation></author>
<author id="weiwei-sun"><first>Weiwei</first><last>Sun</last><affiliation>University of Cambridge</affiliation></author>
<pages>126-136</pages>
<abstract/>
<url hash="183478e8">2024.lchange-1.12</url>
Expand Down
4 changes: 2 additions & 2 deletions data/xml/2024.lrec.xml
Original file line number Diff line number Diff line change
Expand Up @@ -8545,7 +8545,7 @@
<paper id="722">
<title>How Large Language Models Encode Context Knowledge? A Layer-Wise Probing Study</title>
<author><first>Tianjie</first><last>Ju</last></author>
<author><first>Weiwei</first><last>Sun</last></author>
<author id="weiwei-sun-sd"><first>Weiwei</first><last>Sun</last></author>
<author><first>Wei</first><last>Du</last></author>
<author><first>Xinwei</first><last>Yuan</last></author>
<author><first>Zhaochun</first><last>Ren</last></author>
Expand Down Expand Up @@ -9217,7 +9217,7 @@
<title>Improving the Robustness of Large Language Models via Consistency Alignment</title>
<author><first>Yukun</first><last>Zhao</last></author>
<author><first>Lingyong</first><last>Yan</last></author>
<author><first>Weiwei</first><last>Sun</last></author>
<author id="weiwei-sun-sd"><first>Weiwei</first><last>Sun</last></author>
<author><first>Guoliang</first><last>Xing</last></author>
<author><first>Shuaiqiang</first><last>Wang</last></author>
<author><first>Chong</first><last>Meng</last></author>
Expand Down
2 changes: 1 addition & 1 deletion data/xml/2024.naacl.xml
Original file line number Diff line number Diff line change
Expand Up @@ -5577,7 +5577,7 @@
<title>Knowing What <fixed-case>LLM</fixed-case>s <fixed-case>DO</fixed-case> <fixed-case>NOT</fixed-case> Know: A Simple Yet Effective Self-Detection Method</title>
<author><first>Yukun</first><last>Zhao</last></author>
<author><first>Lingyong</first><last>Yan</last><affiliation>Baidu Inc.</affiliation></author>
<author><first>Weiwei</first><last>Sun</last></author>
<author id="weiwei-sun-sd"><first>Weiwei</first><last>Sun</last></author>
<author><first>Guoliang</first><last>Xing</last></author>
<author><first>Chong</first><last>Meng</last><affiliation>Baidu</affiliation></author>
<author><first>Shuaiqiang</first><last>Wang</last><affiliation>Baidu Inc.</affiliation></author>
Expand Down
2 changes: 1 addition & 1 deletion data/xml/2025.babylm.xml
Original file line number Diff line number Diff line change
Expand Up @@ -172,7 +172,7 @@
<author orcid="0000-0001-7201-7387"><first>Suchir</first><last>Salhan</last><affiliation>University of Cambridge</affiliation></author>
<author><first>Andrew</first><last>Caines</last><affiliation>University of Cambridge</affiliation></author>
<author><first>Paula</first><last>Buttery</last><affiliation>University of Cambridge</affiliation></author>
<author><first>Weiwei</first><last>Sun</last><affiliation>University of Cambridge</affiliation></author>
<author id="weiwei-sun"><first>Weiwei</first><last>Sun</last><affiliation>University of Cambridge</affiliation></author>
<pages>160-174</pages>
<abstract>Cross-lingual extensions of the BabyLM Shared Task beyond English incentivise the development of Small Language Models that simulate a much wider range of language acquisition scenarios, including code-switching, simultaneous and successive bilingualism and second language acquisition. However, to our knowledge, there is no benchmark of the formal competence of cognitively-inspired models of L2 acquisition, or <b>L2LMs</b>. To address this, we introduce a <b>Benchmark of Learner Interlingual Syntactic Structure (BLiSS)</b>. BLiSS consists of 1.5M naturalistic minimal pairs dataset derived from errorful sentence–correction pairs in parallel learner corpora. These are systematic patterns –overlooked by standard benchmarks of the formal competence of Language Models – which we use to evaluate L2LMs trained in a variety of training regimes on specific properties of L2 learner language to provide a linguistically-motivated framework for controlled measure of the interlanguage competence of L2LMs.</abstract>
<url hash="2e7f6503">2025.babylm-main.13</url>
Expand Down
Loading