diff --git a/data/xml/2007.sigdial.xml b/data/xml/2007.sigdial.xml
index 721741b212..81c0423706 100644
--- a/data/xml/2007.sigdial.xml
+++ b/data/xml/2007.sigdial.xml
@@ -209,7 +209,7 @@
       <author><first>Ivan</first><last>Tashev</last></author>
       <author><first>Michael</first><last>Seltzer</last></author>
       <author><first>Yun-Cheng</first><last>Ju</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last></author>
       <author><first>Alex</first><last>Acero</last></author>
       <pages>87–94</pages>
       <url hash="f9be08f8">2007.sigdial-1.18</url>
diff --git a/data/xml/2012.iwslt.xml b/data/xml/2012.iwslt.xml
index 6e47166987..cc180194aa 100644
--- a/data/xml/2012.iwslt.xml
+++ b/data/xml/2012.iwslt.xml
@@ -22,7 +22,7 @@
     </paper>
     <paper id="3">
       <title>Who can understand your speech better – deep neural network of <fixed-case>G</fixed-case>aussian mixture model</title>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last></author>
       <url hash="f54a3245">2012.iwslt-keynotes.3</url>
       <bibkey>yu-2012-understand</bibkey>
     </paper>
diff --git a/data/xml/2020.acl.xml b/data/xml/2020.acl.xml
index fab064d73e..254d10337c 100644
--- a/data/xml/2020.acl.xml
+++ b/data/xml/2020.acl.xml
@@ -1376,7 +1376,7 @@
       <author><first>Zhenyi</first><last>Wang</last></author>
       <author><first>Xiaoyang</first><last>Wang</last></author>
       <author><first>Bang</first><last>An</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last></author>
       <author><first>Changyou</first><last>Chen</last></author>
       <pages>1072–1086</pages>
       <abstract>Text generation from a knowledge base aims to translate knowledge triples to natural language descriptions. Most existing methods ignore the faithfulness between a generated text description and the original table, leading to generated information that goes beyond the content of the table. In this paper, for the first time, we propose a novel Transformer-based generation framework to achieve the goal. The core techniques in our method to enforce faithfulness include a new table-text optimal-transport matching loss and a table-text embedding similarity loss based on the Transformer model. Furthermore, to evaluate faithfulness, we propose a new automatic metric specialized to the table-to-text generation problem. We also provide detailed analysis on each component of our model in our experiments. Automatic and human evaluations show that our framework can significantly outperform state-of-the-art by a large margin.</abstract>
@@ -3140,7 +3140,7 @@
       <author><first>Jie</first><last>Lei</last></author>
       <author><first>Liwei</first><last>Wang</last></author>
       <author><first>Yelong</first><last>Shen</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last></author>
       <author><first>Tamara</first><last>Berg</last></author>
       <author><first>Mohit</first><last>Bansal</last></author>
       <pages>2603–2614</pages>
@@ -5985,7 +5985,7 @@
       <author><first>Dian</first><last>Yu</last></author>
       <author><first>Kai</first><last>Sun</last></author>
       <author><first>Claire</first><last>Cardie</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last></author>
       <pages>4927–4940</pages>
       <abstract>We present the first human-annotated dialogue-based relation extraction (RE) dataset DialogRE, aiming to support the prediction of relation(s) between two arguments that appear in a dialogue. We further offer DialogRE as a platform for studying cross-sentence RE as most facts span multiple sentences. We argue that speaker-related information plays a critical role in the proposed task, based on an analysis of similarities and differences between dialogue-based and traditional RE tasks. Considering the timeliness of communication in a dialogue, we design a new metric to evaluate the performance of RE methods in a conversational setting and investigate the performance of several representative RE methods on DialogRE. Experimental results demonstrate that a speaker-aware extension on the best-performing model leads to gains in both the standard and conversational evaluation settings. DialogRE is available at <url>https://dataset.org/dialogre/</url>.</abstract>
       <url hash="2c79b40c">2020.acl-main.444</url>
@@ -6482,7 +6482,7 @@
       <author><first>Kun</first><last>Xu</last></author>
       <author><first>Yue</first><last>Zhang</last></author>
       <author><first>Jianshu</first><last>Chen</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last></author>
       <pages>5429–5434</pages>
       <abstract>Zero pronoun recovery and resolution aim at recovering the dropped pronoun and pointing out its anaphoric mentions, respectively. We propose to better explore their interaction by solving both tasks together, while the previous work treats them separately. For zero pronoun resolution, we study this task in a more realistic setting, where no parsing trees or only automatic trees are available, while most previous work assumes gold trees. Experiments on two benchmarks show that joint modeling significantly outperforms our baseline that already beats the previous state of the arts.</abstract>
       <url hash="5af6f516">2020.acl-main.482</url>
@@ -8139,7 +8139,7 @@
       <author><first>Yelong</first><last>Shen</last></author>
       <author><first>Dian</first><last>Yu</last></author>
       <author><first>Jianshu</first><last>Chen</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last></author>
       <pages>6751–6761</pages>
       <abstract>In this paper, we study machine reading comprehension (MRC) on long texts: where a model takes as inputs a lengthy document and a query, extracts a text span from the document as an answer. State-of-the-art models (e.g., BERT) tend to use a stack of transformer layers that are pre-trained from a large number of unlabeled language corpora to encode the joint contextual information of query and document. However, these transformer models can only take as input a fixed-length (e.g., 512) text. To deal with even longer text inputs, previous approaches usually chunk them into <i>equally-spaced</i> segments and predict answers based on each segment independently without considering the information from other segments. As a result, they may form segments that fail to cover complete answers or retain insufficient contexts around the correct answer required for question answering. Moreover, they are less capable of answering questions that need cross-segment information. We propose to let a model learn to chunk in a more flexible way via reinforcement learning: a model can decide the next segment that it wants to process in either direction. We also apply recurrent mechanisms to enable information to flow across segments. Experiments on three MRC tasks – CoQA, QuAC, and TriviaQA – demonstrate the effectiveness of our proposed recurrent chunking mechanisms: we can obtain segments that are more likely to contain complete answers and at the same time provide sufficient contexts around the ground truth answers for better predictions.</abstract>
       <url hash="ffc15051">2020.acl-main.603</url>
@@ -9580,7 +9580,7 @@
       <author><first>Yue</first><last>Zhang</last></author>
       <author><first>Kun</first><last>Xu</last></author>
       <author><first>Yubin</first><last>Ge</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last></author>
       <pages>7987–7998</pages>
       <abstract>The task of graph-to-text generation aims at producing sentences that preserve the meaning of input graphs. As a crucial defect, the current state-of-the-art models may mess up or even drop the core structural information of input graphs when generating outputs. We propose to tackle this problem by leveraging richer training signals that can guide our model for preserving input information. In particular, we introduce two types of autoencoding losses, each individually focusing on different aspects (a.k.a. views) of input graphs. The losses are then back-propagated to better calibrate our model via multi-task training. Experiments on two benchmarks for graph-to-text generation show the effectiveness of our approach over a state-of-the-art baseline.</abstract>
       <url hash="c39534ed">2020.acl-main.712</url>
diff --git a/data/xml/2020.ccl.xml b/data/xml/2020.ccl.xml
index c131ee8f62..a6014c2b43 100644
--- a/data/xml/2020.ccl.xml
+++ b/data/xml/2020.ccl.xml
@@ -590,7 +590,7 @@
       <title>面向人工智能伦理计算的中文道德词典构建方法研究(Construction of a <fixed-case>C</fixed-case>hinese Moral Dictionary for Artificial Intelligence Ethical Computing)</title>
       <author><first>Hongrui</first><last>Wang</last><variant script="hani"><first>弘睿</first><last>王</last></variant></author>
       <author><first>Chang</first><last>Liu</last><variant script="hani"><first>畅</first><last>刘</last></variant></author>
-      <author><first>Dong</first><last>Yu</last><variant script="hani"><first>东</first><last>于</last></variant></author>
+      <author id="dong-yu-blcu"><first>Dong</first><last>Yu</last><variant script="hani"><first>东</first><last>于</last></variant></author>
       <pages>539–549</pages>
       <abstract>道德词典资源的建设是人工智能伦理计算的一个研究重点。由于道德行为复杂多样,现有的英文道德词典分类体系并不完善,而中文方面目前尚未有相关的词典资源,理论体系和构建方法仍待探究。针对以上问题,该文提出了面向人工智能伦理计算的中文道德词典构建任务,设计了四类标签和四种类型,得到包含25,012个词的中文道德词典资源。实验结果表明,该词典资源不仅能够使机器学会道德知识,判断词的道德标签和类型,而且能够为句子级别的道德文本分析提供数据支持。</abstract>
       <url hash="c5445fa1">2020.ccl-1.50</url>
@@ -811,7 +811,7 @@
     <paper id="68">
       <title>结合深度学习和语言难度特征的句子可读性计算方法(The method of calculating sentence readability combined with deep learning and language difficulty characteristics)</title>
       <author><first>Yuling</first><last>Tang</last><variant script="hani"><first>玉玲</first><last>唐</last></variant></author>
-      <author><first>Dong</first><last>Yu</last><variant script="hani"><first>东</first><last>于</last></variant></author>
+      <author id="dong-yu-blcu"><first>Dong</first><last>Yu</last><variant script="hani"><first>东</first><last>于</last></variant></author>
       <pages>731–742</pages>
       <abstract>本文提出了可读性语料库构建的改进方法,基于该方法,构建了规模更大的汉语句子可读性语料库。该语料库在句子绝对难度评估任务上的准确率达到0.7869,相对前人工作提升了0.15以上,证明了改进方法的有效性。将深度学习方法应用于汉语可读性评估,探究了不同深度学习方法自动捕获难度特征的能力,并进仛步探究了向深度学习特征中融入不同层面的语难度特征对模型整体性能的影响。实验结果显示,不同深度学习模型的难度特征捕获能力不尽相同,语言难度特征可以不同程度地提高深度学习模型的难度表征能力。</abstract>
       <url hash="99223a1a">2020.ccl-1.68</url>
diff --git a/data/xml/2020.emnlp.xml b/data/xml/2020.emnlp.xml
index 13e7ec9f9e..56d94282dc 100644
--- a/data/xml/2020.emnlp.xml
+++ b/data/xml/2020.emnlp.xml
@@ -1030,7 +1030,7 @@
       <author><first>Yang</first><last>Feng</last></author>
       <author><first>Wanying</first><last>Xie</last></author>
       <author><first>Jie</first><last>Zhou</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-blcu"><first>Dong</first><last>Yu</last></author>
       <pages>1035–1046</pages>
       <abstract>There exists a token imbalance phenomenon in natural language as different tokens appear with different frequencies, which leads to different learning difficulties for tokens in Neural Machine Translation (NMT). The vanilla NMT model usually adopts trivial equal-weighted objectives for target tokens with different frequencies and tends to generate more high-frequency tokens and less low-frequency tokens compared with the golden token distribution. However, low-frequency tokens may carry critical semantic information that will affect the translation quality once they are neglected. In this paper, we explored target token-level adaptive objectives based on token frequencies to assign appropriate weights for each target token during training. We aimed that those meaningful but relatively low-frequency words could be assigned with larger weights in objectives to encourage the model to pay more attention to these tokens. Our method yields consistent improvements in translation quality on ZH-EN, EN-RO, and EN-DE translation tasks, especially on sentences that contain more low-frequency tokens where we can get 1.68, 1.02, and 0.52 BLEU increases compared with baseline, respectively. Further analyses show that our method can also improve the lexical diversity of translation.</abstract>
       <url hash="3986c73c">2020.emnlp-main.76</url>
@@ -6836,7 +6836,7 @@
       <author><first>Sangwoo</first><last>Cho</last></author>
       <author><first>Kaiqiang</first><last>Song</last></author>
       <author><first>Chen</first><last>Li</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last></author>
       <author><first>Hassan</first><last>Foroosh</last></author>
       <author id="fei-liu-utdallas"><first>Fei</first><last>Liu</last></author>
       <pages>6282–6300</pages>
@@ -7214,7 +7214,7 @@
       <author><first>Han</first><last>Wu</last></author>
       <author><first>Haisong</first><last>Zhang</last></author>
       <author><first>Linqi</first><last>Song</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last></author>
       <pages>6632–6639</pages>
       <abstract>For multi-turn dialogue rewriting, the capacity of effectively modeling the linguistic knowledge in dialog context and getting ride of the noises is essential to improve its performance. Existing attentive models attend to all words without prior focus, which results in inaccurate concentration on some dispensable words. In this paper, we propose to use semantic role labeling (SRL), which highlights the core semantic information of who did what to whom, to provide additional guidance for the rewriter model. Experiments show that this information significantly improves a RoBERTa-based model that already outperforms previous state-of-the-art systems.</abstract>
       <url hash="34118239">2020.emnlp-main.537</url>
diff --git a/data/xml/2020.semeval.xml b/data/xml/2020.semeval.xml
index be709a7537..bbbabb8e63 100644
--- a/data/xml/2020.semeval.xml
+++ b/data/xml/2020.semeval.xml
@@ -354,7 +354,7 @@
       <author><first>Shike</first><last>Wang</last></author>
       <author><first>Yuchen</first><last>Fan</last></author>
       <author><first>Xiangying</first><last>Luo</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-blcu"><first>Dong</first><last>Yu</last></author>
       <pages>255–262</pages>
       <abstract>Lexical entailment recognition plays an important role in tasks like Question Answering and Machine Translation. As important branches of lexical entailment, predicting multilingual and cross-lingual lexical entailment (LE) are two subtasks of SemEval2020 Task2. In previous monolingual LE studies, researchers leverage external linguistic constraints to transform word embeddings for LE relation. In our system, we expand the number of external constraints in multiple languages to obtain more specialised multilingual word embeddings. For the cross-lingual subtask, we apply a bilingual word embeddings mapping method in the model. The mapping method takes specialised embeddings as inputs and is able to retain the embeddings’ LE features after operations. Our results for multilingual subtask are about 20% and 10% higher than the baseline in graded and binary prediction respectively.</abstract>
       <url hash="0c4cb83a">2020.semeval-1.31</url>
@@ -930,7 +930,7 @@
     <paper id="81">
       <title><fixed-case>BLCU</fixed-case>-<fixed-case>NLP</fixed-case> at <fixed-case>S</fixed-case>em<fixed-case>E</fixed-case>val-2020 Task 5: Data Augmentation for Efficient Counterfactual Detecting</title>
       <author><first>Chang</first><last>Liu</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-blcu"><first>Dong</first><last>Yu</last></author>
       <pages>633–639</pages>
       <abstract>Counterfactuals describe events counter to facts and hence naturally involve common sense, knowledge, and reasoning. SemEval 2020 task 5 is focusing on this field. We participate in the subtask 1 and we use BERT as our system. Our Innovations are feature extraction and data augmentation. We extract and summarize features of counterfactual statements, augment counterfactual examples in training set with the help of these features, and two general methods of data augmentation is experimented in our work. We demonstrate the effectiveness of our approaches, which achieves 0.95 of subtask 1 in F1 while using only a subset of giving training set to fine-tune the BERT model, and our official submission achieves F1 0.802, which ranks us 16th in the competition.</abstract>
       <url hash="2d0e62d4">2020.semeval-1.81</url>
diff --git a/data/xml/2020.tacl.xml b/data/xml/2020.tacl.xml
index efacc42bac..945fe962ae 100644
--- a/data/xml/2020.tacl.xml
+++ b/data/xml/2020.tacl.xml
@@ -123,7 +123,7 @@
       <title>Investigating Prior Knowledge for Challenging <fixed-case>C</fixed-case>hinese Machine Reading Comprehension</title>
       <author><first>Kai</first><last>Sun</last></author>
       <author><first>Dian</first><last>Yu</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last></author>
       <author><first>Claire</first><last>Cardie</last></author>
       <doi>10.1162/tacl_a_00305</doi>
       <abstract>Machine reading comprehension tasks require a machine reader to answer questions relevant to the given document. In this paper, we present the first free-form multiple-Choice Chinese machine reading Comprehension dataset (C3), containing 13,369 documents (dialogues or more formally written mixed-genre texts) and their associated 19,577 multiple-choice free-form questions collected from Chinese-as-a-second-language examinations. We present a comprehensive analysis of the prior knowledge (i.e., linguistic, domain-specific, and general world knowledge) needed for these real-world problems. We implement rule-based and popular neural methods and find that there is still a significant performance gap between the best performing model (68.5%) and human readers (96.0%), especiallyon problems that require prior knowledge. We further study the effects of distractor plausibility and data augmentation based on translated relevant datasets for English on model performance. We expect C3 to present great challenges to existing systems as answering 86.8% of questions requires both knowledge within and beyond the accompanying document, and we hope that C3 can serve as a platform to study how to leverage various kinds of prior knowledge to better understand a given written or orally oriented text. C3 is available at <url>https://dataset.org/c3/</url>.</abstract>
diff --git a/data/xml/2021.acl.xml b/data/xml/2021.acl.xml
index 34c855eba9..96e9caa648 100644
--- a/data/xml/2021.acl.xml
+++ b/data/xml/2021.acl.xml
@@ -6235,7 +6235,7 @@ The source code has been made available at \url{https://github.com/liam0949/DCLO
       <author><first>Wanying</first><last>Xie</last></author>
       <author><first>Yang</first><last>Feng</last></author>
       <author><first>Shuhao</first><last>Gu</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-blcu"><first>Dong</first><last>Yu</last></author>
       <pages>5725–5737</pages>
       <abstract>Multilingual neural machine translation with a single model has drawn much attention due to its capability to deal with multiple languages. However, the current multilingual translation paradigm often makes the model tend to preserve the general knowledge, but ignore the language-specific knowledge. Some previous works try to solve this problem by adding various kinds of language-specific modules to the model, but they suffer from the parameter explosion problem and require specialized manual design. To solve these problems, we propose to divide the model neurons into general and language-specific parts based on their importance across languages. The general part is responsible for preserving the general knowledge and participating in the translation of all the languages, while the language-specific part is responsible for preserving the language-specific knowledge and participating in the translation of some specific languages. Experimental results on several language pairs, covering IWSLT and Europarl corpus datasets, demonstrate the effectiveness and universality of the proposed method.</abstract>
       <url hash="839d5766">2021.acl-long.445</url>
@@ -10364,7 +10364,7 @@ The source code has been made available at \url{https://github.com/liam0949/DCLO
       <author><first>Xiao</first><last>Feng</last></author>
       <author><first>Tao</first><last>Chen</last></author>
       <author><first>Tao</first><last>Yang</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last></author>
       <author><first>Feng</first><last>Zhang</last></author>
       <author><first>ZhanHui</first><last>Kang</last></author>
       <author><first>Shuming</first><last>Shi</last></author>
diff --git a/data/xml/2021.ccl.xml b/data/xml/2021.ccl.xml
index 3e5ad8ea50..f788b7a943 100644
--- a/data/xml/2021.ccl.xml
+++ b/data/xml/2021.ccl.xml
@@ -589,7 +589,7 @@
       <author><first>Shiya</first><last>Peng</last><variant script="hani"><first>诗雅</first><last>彭</last></variant></author>
       <author><first>Chang</first><last>Liu</last><variant script="hani"><first>畅</first><last>刘</last></variant></author>
       <author><first>Yayue</first><last>Deng</last><variant script="hani"><first>雅月</first><last>邓</last></variant></author>
-      <author><first>Dong</first><last>Yu</last><variant script="hani"><first>东</first><last>于</last></variant></author>
+      <author id="dong-yu-blcu"><first>Dong</first><last>Yu</last><variant script="hani"><first>东</first><last>于</last></variant></author>
       <pages>537–548</pages>
       <abstract>随着人工智能的发展,越来越多的研究开始关注人工智能伦理。在NLP领域,道德自动识别作为研究分析文本中的道德的一项重要任务,近年来开始受到研究者的关注。该任务旨在识别文本中的道德片段,其对自然语言处理的道德相关的下游任务如偏见识别消除、判定模型隐形歧视等具有重要意义。与英文相比,目前面向中文的道德识别研究开展缓慢,其主要原因是至今还没有较大型的道德中文数据集为研究提供数据。为解决上述问题,本文在中文语料上进行了中文道德句的标注工作,并初步对识别中文文本道德句进行探索。我们首先构建了国内首个10万级别的中文道德句数据集,然后本文提出了利用流行的几种机器学习方法探究识别中文道德句任务的效果。此外,我们还探索了利用额外知识辅助的方法,对中文道德句的识别任务进行了进一步的探究。</abstract>
       <url hash="45e630bf">2021.ccl-1.49</url>
diff --git a/data/xml/2021.emnlp.xml b/data/xml/2021.emnlp.xml
index ef719191cc..08c7ff4edc 100644
--- a/data/xml/2021.emnlp.xml
+++ b/data/xml/2021.emnlp.xml
@@ -4339,7 +4339,7 @@
       <author><first>Yangqiu</first><last>Song</last></author>
       <author><first>Changshui</first><last>Zhang</last></author>
       <author><first>Kun</first><last>Xu</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last></author>
       <pages>3832–3845</pages>
       <abstract>Resolving pronouns to their referents has long been studied as a fundamental natural language understanding problem. Previous works on pronoun coreference resolution (PCR) mostly focus on resolving pronouns to mentions in text while ignoring the exophoric scenario. Exophoric pronouns are common in daily communications, where speakers may directly use pronouns to refer to some objects present in the environment without introducing the objects first. Although such objects are not mentioned in the dialogue text, they can often be disambiguated by the general topics of the dialogue. Motivated by this, we propose to jointly leverage the local context and global topics of dialogues to solve the out-of-text PCR problem. Extensive experiments demonstrate the effectiveness of adding topic regularization for resolving exophoric pronouns.</abstract>
       <url hash="3cd4dfc9">2021.emnlp-main.311</url>
@@ -5583,7 +5583,7 @@
       <author><first>Liwei</first><last>Wang</last></author>
       <author><first>Kun</first><last>Xu</last></author>
       <author><first>Zhaopeng</first><last>Tu</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last></author>
       <pages>4913–4924</pages>
       <abstract>The task of dialogue rewriting aims to reconstruct the latest dialogue utterance by copying the missing content from the dialogue context. Until now, the existing models for this task suffer from the robustness issue, i.e., performances drop dramatically when testing on a different dataset. We address this robustness issue by proposing a novel sequence-tagging-based model so that the search space is significantly reduced, yet the core of this task is still well covered. As a common issue of most tagging models for text generation, the model’s outputs may lack fluency. To alleviate this issue, we inject the loss signal from BLEU or GPT-2 under a REINFORCE framework. Experiments show huge improvements of our model over the current state-of-the-art systems when transferring to another dataset.</abstract>
       <url hash="8ff8e055">2021.emnlp-main.402</url>
@@ -6291,7 +6291,7 @@
       <author><first>Lifeng</first><last>Jin</last></author>
       <author><first>Linfeng</first><last>Song</last></author>
       <author><first>Kun</first><last>Xu</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last></author>
       <pages>5647–5663</pages>
       <abstract>In order to alleviate the huge demand for annotated datasets for different tasks, many recent natural language processing datasets have adopted automated pipelines for fast-tracking usable data. However, model training with such datasets poses a challenge because popular optimization objectives are not robust to label noise induced in the annotation generation process. Several noise-robust losses have been proposed and evaluated on tasks in computer vision, but they generally use a single dataset-wise hyperparamter to control the strength of noise resistance. This work proposes novel instance-adaptive training frameworks to change single dataset-wise hyperparameters of noise resistance in such losses to be instance-wise. Such instance-wise noise resistance hyperparameters are predicted by special instance-level label quality predictors, which are trained along with the main classification models. Experiments on noisy and corrupted NLP datasets show that proposed instance-adaptive training frameworks help increase the noise-robustness provided by such losses, promoting the use of the frameworks and associated losses in NLP models trained with noisy data.</abstract>
       <url hash="4d50790d">2021.emnlp-main.457</url>
@@ -8368,7 +8368,7 @@
       <author><first>Lifeng</first><last>Jin</last></author>
       <author><first>Jianshu</first><last>Chen</last></author>
       <author><first>Dian</first><last>Yu</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last></author>
       <pages>7741–7751</pages>
       <abstract>Word Sense Disambiguation (WSD) aims to automatically identify the exact meaning of one word according to its context. Existing supervised models struggle to make correct predictions on rare word senses due to limited training data and can only select the best definition sentence from one predefined word sense inventory (e.g., WordNet). To address the data sparsity problem and generalize the model to be independent of one predefined inventory, we propose a gloss alignment algorithm that can align definition sentences (glosses) with the same meaning from different sense inventories to collect rich lexical knowledge. We then train a model to identify semantic equivalence between a target word in context and one of its glosses using these aligned inventories, which exhibits strong transfer capability to many WSD tasks. Experiments on benchmark datasets show that the proposed method improves predictions on both frequent and rare word senses, outperforming prior work by 1.2% on the All-Words WSD Task and 4.3% on the Low-Shot WSD Task. Evaluation on WiC Task also indicates that our method can better capture word meanings in context.</abstract>
       <url hash="1817095b">2021.emnlp-main.610</url>
diff --git a/data/xml/2021.findings.xml b/data/xml/2021.findings.xml
index 0b0d2fd8ac..0f1651a200 100644
--- a/data/xml/2021.findings.xml
+++ b/data/xml/2021.findings.xml
@@ -5737,7 +5737,7 @@
       <title>Self-Teaching Machines to Read and Comprehend with Large-Scale Multi-Subject Question-Answering Data</title>
       <author><first>Dian</first><last>Yu</last></author>
       <author><first>Kai</first><last>Sun</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last></author>
       <author><first>Claire</first><last>Cardie</last></author>
       <pages>56–68</pages>
       <abstract>Despite considerable progress, most machine reading comprehension (MRC) tasks still lack sufficient training data to fully exploit powerful deep neural network models with millions of parameters, and it is laborious, expensive, and time-consuming to create large-scale, high-quality MRC data through crowdsourcing. This paper focuses on generating more training data for MRC tasks by leveraging existing question-answering (QA) data. We first collect a large-scale multi-subject multiple-choice QA dataset for Chinese, ExamQA. We next use incomplete, yet relevant snippets returned by a web search engine as the context for each QA instance to convert it into a weakly-labeled MRC instance. To better use the weakly-labeled data to improve a target MRC task, we evaluate and compare several methods and further propose a self-teaching paradigm. Experimental results show that, upon state-of-the-art MRC baselines, we can obtain +5.1% in accuracy on a multiple-choice Chinese MRC dataset, Cˆ3, and +3.8% in exact match on an extractive Chinese MRC dataset, CMRC 2018, demonstrating the usefulness of the generated QA-based weakly-labeled data for different types of MRC tasks as well as the effectiveness of self-teaching. ExamQA will be available at <url>https://dataset.org/examqa/</url>.</abstract>
diff --git a/data/xml/2021.naacl.xml b/data/xml/2021.naacl.xml
index e10381b920..178c8615a4 100644
--- a/data/xml/2021.naacl.xml
+++ b/data/xml/2021.naacl.xml
@@ -1610,7 +1610,7 @@
       <author><first>Linfeng</first><last>Song</last></author>
       <author><first>Lifeng</first><last>Jin</last></author>
       <author><first>Kun</first><last>Xu</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last></author>
       <author><first>Jiebo</first><last>Luo</last></author>
       <pages>1513–1524</pages>
       <abstract>We investigate video-aided grammar induction, which learns a constituency parser from both unlabeled text and its corresponding video. Existing methods of multi-modal grammar induction focus on grammar induction from text-image pairs, with promising results showing that the information from static images is useful in induction. However, videos provide even richer information, including not only static objects but also actions and state changes useful for inducing verb phrases. In this paper, we explore rich features (e.g. action, object, scene, audio, face, OCR and speech) from videos, taking the recent Compound PCFG model as the baseline. We further propose a Multi-Modal Compound PCFG model (MMC-PCFG) to effectively aggregate these rich features from different modalities. Our proposed MMC-PCFG is trained end-to-end and outperforms each individual modality and previous state-of-the-art systems on three benchmarks, i.e. DiDeMo, YouCook2 and MSRVTT, confirming the effectiveness of leveraging video information for unsupervised grammar induction.</abstract>
diff --git a/data/xml/2021.wmt.xml b/data/xml/2021.wmt.xml
index 2df80ca98f..b5dc23f912 100644
--- a/data/xml/2021.wmt.xml
+++ b/data/xml/2021.wmt.xml
@@ -686,7 +686,7 @@
       <author><first>Wanying</first><last>Xie</last></author>
       <author><first>Bojie</first><last>Hu</last></author>
       <author><first>Han</first><last>Yang</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-blcu"><first>Dong</first><last>Yu</last></author>
       <author><first>Qi</first><last>Ju</last></author>
       <pages>439–445</pages>
       <abstract>This paper describes TenTrans large-scale multilingual machine translation system for WMT 2021. We participate in the Small Track 2 in five South East Asian languages, thirty directions: Javanese, Indonesian, Malay, Tagalog, Tamil, English. We mainly utilized forward/back-translation, in-domain data selection, knowledge distillation, and gradual fine-tuning from the pre-trained model FLORES-101. We find that forward/back-translation significantly improves the translation results, data selection and gradual fine-tuning are particularly effective during adapting domain, while knowledge distillation brings slight performance improvement. Also, model averaging is used to further improve the translation performance based on these systems. Our final system achieves an average BLEU score of 28.89 across thirty directions on the test set.</abstract>
diff --git a/data/xml/2022.acl.xml b/data/xml/2022.acl.xml
index 8fab4ed432..e2bf023287 100644
--- a/data/xml/2022.acl.xml
+++ b/data/xml/2022.acl.xml
@@ -2737,7 +2737,7 @@
       <author><first>Irene</first><last>Li</last></author>
       <author><first>Linfeng</first><last>Song</last></author>
       <author><first>Kun</first><last>Xu</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last></author>
       <pages>2790-2800</pages>
       <abstract>Coreference resolution over semantic graphs like AMRs aims to group the graph nodes that represent the same entity. This is a crucial step for making document-level formal semantic representations. With annotated data on AMR coreference resolution, deep learning approaches have recently shown great potential for this task, yet they are usually data hunger and annotations are costly. We propose a general pretraining method using variational graph autoencoder (VGAE) for AMR coreference resolution, which can leverage any general AMR corpus and even automatically parsed AMR data. Experiments on benchmarks show that the pretraining approach achieves performance gains of up to 6% absolute F1 points. Moreover, our model significantly improves on the previous state-of-the-art model by up to 11% F1.</abstract>
       <url hash="312f40bb">2022.acl-long.199</url>
@@ -4156,7 +4156,7 @@ in the Case of Unambiguous Gender</title>
       <author><first>Kaiqiang</first><last>Song</last></author>
       <author><first>Chen</first><last>Li</last></author>
       <author><first>Xiaoyang</first><last>Wang</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last></author>
       <author id="fei-liu-utdallas"><first>Fei</first><last>Liu</last></author>
       <pages>4407-4418</pages>
       <abstract>Podcasts have shown a recent rise in popularity. Summarization of podcasts is of practical benefit to both content providers and consumers. It helps people quickly decide whether they will listen to a podcast and/or reduces the cognitive load of content providers to write summaries. Nevertheless, podcast summarization faces significant challenges including factual inconsistencies of summaries with respect to the inputs. The problem is exacerbated by speech disfluencies and recognition errors in transcripts of spoken language. In this paper, we explore a novel abstractive summarization method to alleviate these issues. Our approach learns to produce an abstractive summary while grounding summary segments in specific regions of the transcript to allow for full inspection of summary details. We conduct a series of analyses of the proposed approach on a large podcast dataset and show that the approach can achieve promising results. Grounded summaries bring clear benefits in locating the summary and transcript segments that contain inconsistent information, and hence improve summarization quality in terms of automatic and human evaluation.</abstract>
@@ -8195,7 +8195,7 @@ in the Case of Unambiguous Gender</title>
       <author orcid="0000-0001-8262-4906"><first>Kai</first><last>Sun</last></author>
       <author><first>Dian</first><last>Yu</last></author>
       <author><first>Jianshu</first><last>Chen</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last></author>
       <author><first>Claire</first><last>Cardie</last></author>
       <pages>8736-8747</pages>
       <abstract>To perform well on a machine reading comprehension (MRC) task, machine readers usually require commonsense knowledge that is not explicitly mentioned in the given documents. This paper aims to extract a new kind of structured knowledge from scripts and use it to improve MRC. We focus on scripts as they contain rich verbal and nonverbal messages, and two relevant messages originally conveyed by different modalities during a short time period may serve as arguments of a piece of commonsense knowledge as they function together in daily communications. To save human efforts to name relations, we propose to represent relations implicitly by situating such an argument pair in a context and call it contextualized knowledge. To use the extracted knowledge to improve MRC, we compare several fine-tuning strategies to use the weakly-labeled MRC data constructed based on contextualized knowledge and further design a teacher-student paradigm with multiple teachers to facilitate the transfer of knowledge in weakly-labeled MRC data. Experimental results show that our paradigm outperforms other methods that use weakly-labeled data and improves a state-of-the-art baseline by 4.3% in accuracy on a Chinese multiple-choice MRC dataset C<tex-math>^3</tex-math>, wherein most of the questions require unstated prior knowledge. We also seek to transfer the knowledge to other tasks by simply adapting the resulting student reader, yielding a 2.9% improvement in F1 on a relation extraction dataset DialogRE, demonstrating the potential usefulness of the knowledge for non-MRC tasks that require document comprehension.</abstract>
@@ -8569,7 +8569,7 @@ in the Case of Unambiguous Gender</title>
       <author><first>Wenlin</first><last>Yao</last></author>
       <author><first>Dian</first><last>Yu</last></author>
       <author><first>Kaiqiang</first><last>Song</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last></author>
       <author><first>Jianshu</first><last>Chen</last></author>
       <pages>212-218</pages>
       <abstract>Comprehending a dialogue requires a model to capture diverse kinds of key information in the utterances, which are either scattered around or implicitly implied in different turns of conversations. Therefore, dialogue comprehension requires diverse capabilities such as paraphrasing, summarizing, and commonsense reasoning. Towards the objective of pre-training a zero-shot dialogue comprehension model, we develop a novel narrative-guided pre-training strategy that learns by narrating the key information from a dialogue input. However, the dialogue-narrative parallel corpus for such a pre-training strategy is currently unavailable. For this reason, we first construct a dialogue-narrative parallel corpus by automatically aligning movie subtitles and their synopses. We then pre-train a BART model on the data and evaluate its performance on four dialogue-based tasks that require comprehension. Experimental results show that our model not only achieves superior zero-shot performance but also exhibits stronger fine-grained dialogue comprehension capabilities. The data and code are available at <url>https://github.com/zhaochaocs/Diana</url>.</abstract>
@@ -8813,7 +8813,7 @@ in the Case of Unambiguous Gender</title>
       <author><first>Xiaoman</first><last>Pan</last></author>
       <author><first>Wenlin</first><last>Yao</last></author>
       <author><first>Dian</first><last>Yu</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last></author>
       <author><first>Jianshu</first><last>Chen</last></author>
       <pages>371-377</pages>
       <abstract>We consider the problem of pretraining a two-stage open-domain question answering (QA) system (retriever + reader) with strong transfer capabilities. The key challenge is how to construct a large amount of high-quality question-answer-context triplets without task-specific annotations. Specifically, the triplets should align well with downstream tasks by: (i) covering a wide range of domains (for open-domain applications), (ii) linking a question to its semantically relevant context with supporting evidence (for training the retriever), and (iii) identifying the correct answer in the context (for training the reader). Previous pretraining approaches generally fall short of one or more of these requirements. In this work, we automatically construct a large-scale corpus that meets all three criteria by consulting millions of references cited within Wikipedia. The well-aligned pretraining signals benefit both the retriever and the reader significantly. Our pretrained retriever leads to 2%-10% absolute gains in top-20 accuracy. And with our pretrained reader, the entire system improves by up to 4% in exact match.</abstract>
diff --git a/data/xml/2022.ccl.xml b/data/xml/2022.ccl.xml
index 169b5c785f..2f5e59af14 100644
--- a/data/xml/2022.ccl.xml
+++ b/data/xml/2022.ccl.xml
@@ -458,7 +458,7 @@
       <title><fixed-case>C</fixed-case>ore<fixed-case>V</fixed-case>alue:面向价值观计算的中文核心价值-行为体系及知识库(<fixed-case>C</fixed-case>ore<fixed-case>V</fixed-case>alue: <fixed-case>C</fixed-case>hinese Core Value-Behavior Frame and Knowledge Base for Value Computing)</title>
       <author><first>Pengyuan</first><last>Liu</last><variant script="hani"><first>鹏远</first><last>刘</last></variant></author>
       <author><first>Sanle</first><last>Zhang</last><variant script="hani"><first>三乐</first><last>张</last></variant></author>
-      <author><first>Dong</first><last>Yu</last><variant script="hani"><first>东</first><last>于</last></variant></author>
+      <author id="dong-yu-blcu"><first>Dong</first><last>Yu</last><variant script="hani"><first>东</first><last>于</last></variant></author>
       <author><first>Lin</first><last>Bo</last><variant script="hani"><first>琳</first><last>薄</last></variant></author>
       <pages>417–430</pages>
       <abstract>“由主体行为推断其价值观是人工智能理解并具有人类价值观的前提之一。在NLP相关领域,研究主要集中在对文本价值观或道德的是非判断上,鲜见由主体行为推断其价值观的工作,也缺乏相应的数据资源。该文首先构建了中文核心价值-行为体系。该体系以社会主义核心价值观为基础,分为两部分:1)类别体系。共包含8大类核心价值,进一步细分为19小类双方向价值并对应38类行为;2)要素体系。划分为核心与非核心要素共7种。随后,抽取语料中含有主体行为的文本句,依据该体系进行人工标注,构建了一个包含6994个行为句及其对应的细粒度价值与方向,34965个要素的细粒度中文价值-行为知识库。最后,该文提出了价值观类别判别、方向判别及联合判别任务并进行了实验。结果表明,基于预训练语言模型的方法在价值观方向判别上表现优异,在细粒度价值类别判别以及价值类别多标签判别上,有较大提升空间。”</abstract>
diff --git a/data/xml/2022.coling.xml b/data/xml/2022.coling.xml
index 88e774c6d8..5d2b50f109 100644
--- a/data/xml/2022.coling.xml
+++ b/data/xml/2022.coling.xml
@@ -1246,7 +1246,7 @@
       <title>From Polarity to Intensity: Mining Morality from Semantic Space</title>
       <author><first>Chunxu</first><last>Zhao</last></author>
       <author><first>Pengyuan</first><last>Liu</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-blcu"><first>Dong</first><last>Yu</last></author>
       <pages>1250–1262</pages>
       <abstract>Most works on computational morality focus on moral polarity recognition, i.e., distinguishing right from wrong. However, a discrete polarity label is not informative enough to reflect morality as it does not contain any degree or intensity information. Existing approaches to compute moral intensity are limited to word-level measurement and heavily rely on human labelling. In this paper, we propose MoralScore, a weakly-supervised framework that can automatically measure moral intensity from text. It only needs moral polarity labels, which are more robust and easier to acquire. Besides, the framework can capture latent moral information not only from words but also from sentence-level semantics which can provide a more comprehensive measurement. To evaluate the performance of our method, we introduce a set of evaluation metrics and conduct extensive experiments. Results show that our method achieves good performance on both automatic and human evaluations.</abstract>
       <url hash="7b5f18ac">2022.coling-1.107</url>
diff --git a/data/xml/2022.emnlp.xml b/data/xml/2022.emnlp.xml
index f9fd054a23..e6eab57343 100644
--- a/data/xml/2022.emnlp.xml
+++ b/data/xml/2022.emnlp.xml
@@ -118,7 +118,7 @@
       <author><first>Kaiqiang</first><last>Song</last><affiliation>Tencent AI Lab</affiliation></author>
       <author><first>Xiaoyang</first><last>Wang</last><affiliation>Tencent AI Lab</affiliation></author>
       <author id="fei-liu"><first>Fei</first><last>Liu</last><affiliation>Emory University</affiliation></author>
-      <author><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
       <pages>106-118</pages>
       <abstract>Text segmentation is important for signaling a document’s structure. Without segmenting a long document into topically coherent sections, it is difficult for readers to comprehend the text, let alone find important information. The problem is only exacerbated by a lack of segmentation in transcripts of audio/video recordings. In this paper, we explore the role that section segmentation plays in extractive summarization of written and spoken documents. Our approach learns robust sentence representations by performing summarization and segmentation simultaneously, which is further enhanced by an optimization-based regularizer to promote selection of diverse summary sentences. We conduct experiments on multiple datasets ranging from scientific articles to spoken transcripts to evaluate the model’s performance. Our findings suggest that the model can not only achieve state-of-the-art performance on publicly available benchmarks, but demonstrate better cross-genre transferability when equipped with text segmentation. We perform a series of analyses to quantify the impact of section segmentation on summarizing written and spoken documents of substantial length and complexity.</abstract>
       <url hash="a94331cb">2022.emnlp-main.8</url>
@@ -218,7 +218,7 @@
       <author><first>Lifeng</first><last>Jin</last><affiliation>Tencent AI Lab</affiliation></author>
       <author><first>Haitao</first><last>Mi</last><affiliation>Tencent America</affiliation></author>
       <author><first>Kun</first><last>Xu</last><affiliation>Tencent AI Lab</affiliation></author>
-      <author><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
       <author><first>Jiebo</first><last>Luo</last><affiliation>University of Rochester</affiliation></author>
       <pages>233-247</pages>
       <abstract>Video-aided grammar induction aims to leverage video information for finding more accurate syntactic grammars for accompanying text. While previous work focuses on building systems for inducing grammars on text that are well-aligned with video content, we investigate the scenario, in which text and video are only in loose correspondence. Such data can be found in abundance online, and the weak correspondence is similar to the indeterminacy problem studied in language acquisition. Furthermore, we build a new model that can better learn video-span correlation without manually designed features adopted by previous work. Experiments show that our model trained only on large-scale YouTube data with no text-video alignment reports strong and robust performances across three unseen datasets, despite domain shift and noisy label issues. Furthermore our model yields higher F1 scores than the previous state-of-the-art systems trained on in-domain data.</abstract>
@@ -1114,7 +1114,7 @@
       <author><first>Wenlin</first><last>Yao</last><affiliation>Tencent AI Lab</affiliation></author>
       <author><first>Hongming</first><last>Zhang</last><affiliation>Tencent AI Lab, Bellevue</affiliation></author>
       <author><first>Xiaoyang</first><last>Wang</last><affiliation>Tencent AI Lab</affiliation></author>
-      <author><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
       <author><first>Jianshu</first><last>Chen</last><affiliation>Tencent AI Lab</affiliation></author>
       <pages>1186-1203</pages>
       <abstract>Large-scale pretrained language models have made significant advances in solving downstream language understanding tasks. However, they generally suffer from reporting bias, the phenomenon describing the lack of explicit commonsense knowledge in written text, e.g., ”an orange is orange”. To overcome this limitation, we develop a novel approach, Z-LaVI, to endow language models with visual imagination capabilities. Specifically, we leverage two complementary types of ”imaginations”: (i) recalling existing images through retrieval and (ii) synthesizing nonexistent images via text-to-image generation. Jointly exploiting the language inputs and the imagination, a pretrained vision-language model (e.g., CLIP) eventually composes a zero-shot solution to the original language tasks. Notably, fueling language models with imagination can effectively leverage visual knowledge to solve plain language tasks. In consequence, Z-LaVI consistently improves the zero-shot performance of existing language models across a diverse set of language tasks.</abstract>
@@ -4316,7 +4316,7 @@
       <author><first>Ruixin</first><last>Hong</last><affiliation>Tsinghua University</affiliation></author>
       <author><first>Xiaodan</first><last>Liang</last><affiliation>Sun Yat-sen University</affiliation></author>
       <author><first>Changshui</first><last>Zhang</last><affiliation>School of Automation, Tsinghua University</affiliation></author>
-      <author><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
       <pages>4698-4724</pages>
       <abstract>In this paper, we propose a comprehensive benchmark to investigate models’ logical reasoning capabilities in complex real-life scenarios. Current explanation datasets often employ synthetic data with simple reasoning structures. Therefore, it cannot express more complex reasoning processes, such as the rebuttal to a reasoning step and the degree of certainty of the evidence. To this end, we propose a comprehensive logical reasoning explanation form. Based on the multi-hop chain of reasoning, the explanation form includes three main components: (1) The condition of rebuttal that the reasoning node can be challenged; (2) Logical formulae that uncover the internal texture of reasoning nodes; (3) Reasoning strength indicated by degrees of certainty. The fine-grained structure conforms to the real logical reasoning scenario, better fitting the human cognitive process but, simultaneously, is more challenging for the current models. We evaluate the current best models’ performance on this new explanation form. The experimental results show that generating reasoning graphs remains a challenging task for current models, even with the help of giant pre-trained language models.</abstract>
       <url hash="2c390969">2022.emnlp-main.310</url>
@@ -5682,7 +5682,7 @@
       <author><first>Wenlin</first><last>Yao</last><affiliation>Tencent AI Lab</affiliation></author>
       <author><first>Xiaoyang</first><last>Wang</last><affiliation>Tencent AI Lab</affiliation></author>
       <author><first>Muhao</first><last>Chen</last><affiliation>USC</affiliation></author>
-      <author><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
       <pages>6094-6106</pages>
       <abstract>Abstractive summarization models typically learn to capture the salient information from scratch implicitly.Recent literature adds extractive summaries as guidance for abstractive summarization models to provide hints of salient content and achieves better performance.However, extractive summaries as guidance could be over strict, leading to information loss or noisy signals.Furthermore, it cannot easily adapt to documents with various abstractiveness.As the number and allocation of salience content pieces varies, it is hard to find a fixed threshold deciding which content should be included in the guidance.In this paper, we propose a novel summarization approach with a flexible and reliable salience guidance, namely SEASON (SaliencE Allocation as Guidance for Abstractive SummarizatiON).SEASON utilizes the allocation of salience expectation to guide abstractive summarization and adapts well to articles in different abstractiveness.Automatic and human evaluations on two benchmark datasets show that the proposed method is effective and reliable.Empirical results on more than one million news articles demonstrate a natural fifteen-fifty salience split for news article sentences, providing a useful insight for composing news articles.</abstract>
       <url hash="53d55a0e">2022.emnlp-main.409</url>
@@ -9788,7 +9788,7 @@
       <author><first>Yanyang</first><last>Li</last><affiliation>The Chinese University of Hong Kong</affiliation></author>
       <author><first>Wanyu</first><last>Du</last><affiliation>University of Virginia</affiliation></author>
       <author><first>Yangfeng</first><last>Ji</last><affiliation>University of Virginia</affiliation></author>
-      <author><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
       <author><first>Michael</first><last>Lyu</last><affiliation>The Chinese University of Hong Kong</affiliation></author>
       <author><first>Liwei</first><last>Wang</last><affiliation>The Chinese University of Hong Kong</affiliation></author>
       <pages>10469-10483</pages>
diff --git a/data/xml/2022.findings.xml b/data/xml/2022.findings.xml
index 4c5ef0ea6f..878886819f 100644
--- a/data/xml/2022.findings.xml
+++ b/data/xml/2022.findings.xml
@@ -13296,7 +13296,7 @@ Faster and Smaller Speech Translation without Quality Compromise</title>
       <author><first>Haitao</first><last>Mi</last><affiliation>Tencent America</affiliation></author>
       <author><first>He</first><last>Bai</last><affiliation>University of Waterloo</affiliation></author>
       <author><first>Jimmy</first><last>Lin</last><affiliation>University of Waterloo</affiliation></author>
-      <author><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
       <pages>5296-5306</pages>
       <abstract>We focus on the cross-lingual Text-to-SQL semantic parsing task,where the parsers are expected to generate SQL for non-English utterances based on English database schemas.Intuitively, English translation as side information is an effective way to bridge the language gap,but noise introduced by the translation system may affect parser effectiveness.In this work, we propose a Representation Mixup Framework (Rex) for effectively exploiting translations in the cross-lingual Text-to-SQL task.Particularly, it uses a general encoding layer, a transition layer, and a target-centric layer to properly guide the information flow of the English translation.Experimental results on CSpider and VSpider show that our framework can benefit from cross-lingual training and improve the effectiveness of semantic parsers, achieving state-of-the-art performance.</abstract>
       <url hash="dd052668">2022.findings-emnlp.388</url>
@@ -15175,7 +15175,7 @@ Faster and Smaller Speech Translation without Quality Compromise</title>
       <title>Efficient Zero-shot Event Extraction with Context-Definition Alignment</title>
       <author><first>Hongming</first><last>Zhang</last><affiliation>Tencent AI Lab, Bellevue</affiliation></author>
       <author><first>Wenlin</first><last>Yao</last><affiliation>Tencent AI Lab</affiliation></author>
-      <author><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
       <pages>7169-7179</pages>
       <abstract>Event extraction (EE) is the task of identifying interested event mentions from text.Conventional efforts mainly focus on the supervised setting. However, these supervised models cannot generalize to event types out of the pre-defined ontology. To fill this gap, many efforts have been devoted to the zero-shot EE problem. This paper follows the trend of modeling event-type semantics but moves one step further. We argue that using the static embedding of the event type name might not be enough because a single word could be ambiguous, and we need a sentence to define the type semantics accurately. To model the definition semantics, we use two separate transformer models to project the contextualized event mentions and corresponding definitions into the same embedding space and then minimize their embedding distance via contrastive learning. On top of that, we also propose a warming phase to help the model learn the minor difference between similar definitions. We name our approach Zero-shot Event extraction with Definition (ZED). Experiments on the MAVEN dataset show that our model significantly outperforms all previous zero-shot EE methods with fast inference speed due to the disjoint design. Further experiments also show that can be easily applied to the few-shot setting when the annotation is available and consistently outperforms baseline supervised methods.</abstract>
       <url hash="acdce326">2022.findings-emnlp.531</url>
diff --git a/data/xml/2022.lrec.xml b/data/xml/2022.lrec.xml
index 58d97d1d3b..6376fb4185 100644
--- a/data/xml/2022.lrec.xml
+++ b/data/xml/2022.lrec.xml
@@ -6864,7 +6864,7 @@
     <paper id="594">
       <title><fixed-case>CLGC</fixed-case>: A Corpus for <fixed-case>C</fixed-case>hinese Literary Grace Evaluation</title>
       <author><first>Yi</first><last>Li</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-blcu"><first>Dong</first><last>Yu</last></author>
       <author><first>Pengyuan</first><last>Liu</last></author>
       <pages>5548–5556</pages>
       <abstract>In this paper, we construct a Chinese literary grace corpus, CLGC, with 10,000 texts and more than 1.85 million tokens. Multi-level annotations are provided for each text in our corpus, including literary grace level, sentence category, and figure-of-speech type. Based on the corpus, we dig deep into the correlation between fine-grained features (semantic information, part-of-speech and figure-of-speech, etc.) and literary grace level. We also propose a new Literary Grace Evaluation (LGE) task, which aims at making a comprehensive assessment of the literary grace level according to the text. In the end, we build some classification models with machine learning algorithms (such as SVM, TextCNN) to prove the effectiveness of our features and corpus for LGE. The results of our preliminary classification experiments have achieved 79.71% on the weighted average F1-score.</abstract>
diff --git a/data/xml/2022.naacl.xml b/data/xml/2022.naacl.xml
index 9f320b0bfd..18ea04571a 100644
--- a/data/xml/2022.naacl.xml
+++ b/data/xml/2022.naacl.xml
@@ -2303,7 +2303,7 @@
       <title>End-to-End <fixed-case>C</fixed-case>hinese Speaker Identification</title>
       <author><first>Dian</first><last>Yu</last></author>
       <author><first>Ben</first><last>Zhou</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last></author>
       <pages>2274-2285</pages>
       <abstract>Speaker identification (SI) in texts aims to identify the speaker(s) for each utterance in texts. Previous studies divide SI into several sub-tasks (e.g., quote extraction, named entity recognition, gender identification, and coreference resolution). However, we are still far from solving these sub-tasks, making SI systems that rely on them seriously suffer from error propagation. End-to-end SI systems, on the other hand, are not limited by individual modules, but suffer from insufficient training data from the existing small-scale datasets. To make large end-to-end models possible, we design a new annotation guideline that regards SI as span extraction from the local context, and we annotate by far the largest SI dataset for Chinese named CSI based on eighteen novels. Viewing SI as a span selection task also introduces the possibility of applying existing storng extractive machine reading comprehension (MRC) baselines. Surprisingly, simply using such a baseline without human-annotated character names and carefully designed rules, we can already achieve performance comparable or better than those of previous state-of-the-art SI methods on all public SI datasets for Chinese. Furthermore, we show that our dataset can serve as additional training data for existing benchmarks, which leads to further gains (up to 6.5% in accuracy). Finally, using CSI as a clean source, we design an effective self-training paradigm to continuously leverage hundreds of unlabeled novels.</abstract>
       <url hash="13e37f2f">2022.naacl-main.165</url>
diff --git a/data/xml/2023.acl.xml b/data/xml/2023.acl.xml
index f95bba577b..4096f02345 100644
--- a/data/xml/2023.acl.xml
+++ b/data/xml/2023.acl.xml
@@ -51,7 +51,7 @@
       <author><first>Linfeng</first><last>Song</last><affiliation>Tencent AI Lab</affiliation></author>
       <author><first>Haitao</first><last>Mi</last><affiliation>Tencent America</affiliation></author>
       <author><first>Wenliang</first><last>Chen</last><affiliation>Soochow University</affiliation></author>
-      <author><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
       <pages>22-35</pages>
       <abstract>One of the main challenges open-domain end-to-end dialogue systems, or chatbots, face is the prevalence of unsafe behavior, such as toxic languages and harmful suggestions. However, existing dialogue datasets do not provide enough annotation to explain and correct such unsafe behavior. In this work, we construct a new dataset called SafeConv for the research of conversational safety: (1) Besides the utterance-level safety labels, SafeConv also provides unsafe spans in an utterance, information able to indicate which words contribute to the detected unsafe behavior; (2) SafeConv provides safe alternative responses to continue the conversation when unsafe behavior detected, guiding the conversation to a gentle trajectory. By virtue of the comprehensive annotation of SafeConv, we benchmark three powerful models for the mitigation of conversational unsafe behavior, including a checker to detect unsafe utterances, a tagger to extract unsafe spans, and a rewriter to convert an unsafe response to a safe version. Moreover, we explore the huge benefits brought by combining the models for explaining the emergence of unsafe behavior and detoxifying chatbots. Experiments show that the detected unsafe behavior could be well explained with unsafe spans and popular chatbots could be detoxified by a huge extent. The dataset is available at <url>https://github.com/mianzhang/SafeConv</url>.</abstract>
       <url hash="8f3d4de4">2023.acl-long.2</url>
@@ -2586,7 +2586,7 @@
       <author><first>Xiaoyang</first><last>Wang</last><affiliation>Tencent AI Lab</affiliation></author>
       <author orcid="0000-0001-9263-5035"><first>Hong</first><last>Yu</last><affiliation>University of Massachusetts, Lowell</affiliation></author>
       <author id="fei-liu"><first>Fei</first><last>Liu</last><affiliation>Emory University</affiliation></author>
-      <author><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
       <pages>3265-3280</pages>
       <abstract>The potential choices for news article headlines are enormous, and finding the right balance between conveying the essential message and capturing the reader’s attention is key to effective headlining. However, presenting the same news headline to all readers is a suboptimal strategy, because it does not take into account the different preferences and interests of diverse readers, who may be confused about why a particular article has been recommended to them and do not see a clear connection between their interests and the recommended article. In this paper, we present a novel framework that addresses these challenges by incorporating user profiling to generate personalized headlines, and a combination of automated and human evaluation methods to determine user preference for personalized headlines. Our framework utilizes a learnable relevance function to assign personalized signature phrases to users based on their reading histories, which are then used to personalize headline generation. Through extensive evaluation, we demonstrate the effectiveness of our proposed framework in generating personalized headlines that meet the needs of a diverse audience. Our framework has the potential to improve the efficacy of news recommendations and facilitate creation of personalized content.</abstract>
       <url hash="9d80d4ea">2023.acl-long.183</url>
@@ -3061,7 +3061,7 @@
       <author orcid="0000-0003-2852-3746"><first>Ruixin</first><last>Hong</last><affiliation>Tsinghua University</affiliation></author>
       <author><first>Hongming</first><last>Zhang</last><affiliation>Tencent AI Lab, Bellevue</affiliation></author>
       <author><first>Hong</first><last>Zhao</last><affiliation>Tsinghua University</affiliation></author>
-      <author><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
       <author><first>Changshui</first><last>Zhang</last><affiliation>Tsinghua University</affiliation></author>
       <pages>3944-3965</pages>
       <abstract>Although large language models demonstrate remarkable question-answering performances, revealing the intermediate reasoning steps that the models faithfully follow remains challenging. In this paper, we propose FAME (FAithful question answering with MontE-carlo planning) to answer questions based on faithful reasoning steps. The reasoning steps are organized as a structured entailment tree, which shows how premises are used to produce intermediate conclusions that can prove the correctness of the answer. We formulate the task as a discrete decision-making problem and solve it through the interaction of a reasoning environment and a controller. The environment is modular and contains several basic task-oriented modules, while the controller proposes actions to assemble the modules. Since the search space could be large, we introduce a Monte-Carlo planning algorithm to do a look-ahead search and select actions that will eventually lead to high-quality steps. FAME achieves advanced performance on the standard benchmark. It can produce valid and faithful reasoning steps compared with large language models with a much smaller model size.</abstract>
@@ -13772,7 +13772,7 @@
     <paper id="49">
       <title>Going Beyond Sentence Embeddings: A Token-Level Matching Algorithm for Calculating Semantic Textual Similarity</title>
       <author orcid="0000-0001-7474-8271"><first>Hongwei</first><last>Wang</last><affiliation>Tencent AI Lab</affiliation></author>
-      <author><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
       <pages>563-570</pages>
       <abstract>Semantic Textual Similarity (STS) measures the degree to which the underlying semantics of paired sentences are equivalent. State-of-the-art methods for STS task use language models to encode sentences into embeddings. However, these embeddings are limited in representing semantics because they mix all the semantic information together in fixed-length vectors, which are difficult to recover and lack explainability. This paper presents a token-level matching inference algorithm, which can be applied on top of any language model to improve its performance on STS task. Our method calculates pairwise token-level similarity and token matching scores, and then aggregates them with pretrained token weights to produce sentence similarity. Experimental results on seven STS datasets show that our method improves the performance of almost all language models, with up to 12.7% gain in Spearman’s correlation. We also demonstrate that our method is highly explainable and computationally efficient.</abstract>
       <url hash="8ce896c5">2023.acl-short.49</url>
diff --git a/data/xml/2023.ccl.xml b/data/xml/2023.ccl.xml
index 8fb1949410..deb609f5ff 100644
--- a/data/xml/2023.ccl.xml
+++ b/data/xml/2023.ccl.xml
@@ -330,7 +330,7 @@
     <paper id="26">
       <title>中国社会道德变化模型与发展动因探究——基于70年《人民日报》的计量与分析 (The Model of Moral Change and Motivation in <fixed-case>C</fixed-case>hinese Society ——<fixed-case>T</fixed-case>he Vocabulary Analysis of the 70-year ”People’s Daily”)</title>
       <author><first>Hongrui</first><last>Wang</last><variant script="hani"><first>弘睿</first><last>王</last></variant></author>
-      <author><first>Dong</first><last>Yu</last><variant script="hani"><first>东</first><last>于</last></variant></author>
+      <author id="dong-yu-blcu"><first>Dong</first><last>Yu</last><variant script="hani"><first>东</first><last>于</last></variant></author>
       <author><first>Pengyuan</first><last>Liu</last><variant script="hani"><first>鹏远</first><last>刘</last></variant></author>
       <author><first>Liying</first><last>Ceng</last><variant script="hani"><first>立英</first><last>曾</last></variant></author>
       <pages>289–299</pages>
@@ -342,7 +342,7 @@
     <paper id="27">
       <title>动词视角下的汉语性别表征研究——基于多语体语料库与依存分析(Gendered Representation in <fixed-case>C</fixed-case>hinese via Verbal Analysis —<fixed-case>B</fixed-case>ased on a Multi-register Corpus and Dependency Parsing)</title>
       <author><first>Yingshi</first><last>Chen</last><variant script="hani"><first>颖诗</first><last>陈</last></variant></author>
-      <author><first>Dong</first><last>Yu</last><variant script="hani"><first>东</first><last>于</last></variant></author>
+      <author id="dong-yu-blcu"><first>Dong</first><last>Yu</last><variant script="hani"><first>东</first><last>于</last></variant></author>
       <author><first>Pengyuan</first><last>Liu</last><variant script="hani"><first>鹏远</first><last>刘</last></variant></author>
       <pages>301–314</pages>
       <abstract>“动作是反映性别社会化的重要形式,研究汉语中动词的性别表征,可以找到语言构建不同性别身份的路径,即所采用的方式、形式。本文以依存句法关系为抓手,在四种语体的语料中抽取出和不同性别词构成依存结构的动词,统计出有显著性别差异的动词,并根据性别词充当的句子成分,结合语义进行了定量和定性分析。总体来看,大部分汉语动词表征是中性的,能体现性别的动词是少数,汉语作为一种承载着中华智慧且具有深厚文化底蕴的语言,对性别的表征是中立且平等的,这也体现出了我国的性别平等观念。而在表征性别的动词中,能看到构建男性和女性身份的两种不同路径。显著表征女性的动词在不同语体的语料中均多于显著表征男性的,但是表征男性的动词的语义分布则更为均衡,体现了“男性默认-女性专门”。在司法动词上,女性常常作为暴力行为的受害者,同时施害者男性却隐身了,体现了筜男性主宰笭女性顺从笢。不同语体的动词在构建性别时体现了不同的功能,新闻塑造了较为传统的性别规范,传统和网络文学以不同的形式打破了固有的性别规范。”</abstract>
@@ -411,7 +411,7 @@
       <author><first>Jiadai</first><last>Sun</last><variant script="hani"><first>嘉黛</first><last>孙</last></variant></author>
       <author><first>Siyi</first><last>Tang</last><variant script="hani"><first>思怡</first><last>汤</last></variant></author>
       <author><first>Shike</first><last>Wang</last><variant script="hani"><first>诗可</first><last>王</last></variant></author>
-      <author><first>Dong</first><last>Yu</last><variant script="hani"><first>东</first><last>于</last></variant></author>
+      <author id="dong-yu-blcu"><first>Dong</first><last>Yu</last><variant script="hani"><first>东</first><last>于</last></variant></author>
       <author><first>Pengyuan</first><last>Liu</last><variant script="hani"><first>鹏远</first><last>刘</last></variant></author>
       <pages>364–376</pages>
       <abstract>“现有的文本分级阅读研究往往从文本可读性的角度出发,以离散的文本难度等级的形式为读者推荐阅读书目。目前,仍缺少一种研究读者在阅读过程中产生的多方面、深层次阅读体验的体系结构。对此,我们调研了读者在阅读中文篇章过程中产生的不同阅读体验,提出了中文篇章多维度阅读体验的量化体系。我们将阅读过程中呈现的连续性的阅读体验归纳为多种类别,并在此基础上构建了中文篇章多维度阅读体验数据集。同时,我们探究了以大规模语言模型为基础的ChatGPT对阅读体验的量化能力,发现其虽具备强大的信息抽取和语义理解能力,在阅读体验的量化上却表现不佳。但我们发现大规模语言模型所蕴含的能力能够以知识蒸馏的方式协助深层属性的量化,基于此,我们实现了大规模语言模型增强的中文篇章多维阅读体验量化模型。模型在各维度阅读体验上的平均F1值达到0.72,高于ChatGPT的Fewshot结果0.48。”</abstract>
diff --git a/data/xml/2023.eacl.xml b/data/xml/2023.eacl.xml
index fcbebc00a7..9928b4cedd 100644
--- a/data/xml/2023.eacl.xml
+++ b/data/xml/2023.eacl.xml
@@ -245,7 +245,7 @@
       <author><first>Linfeng</first><last>Song</last><affiliation>Tencent AI Lab</affiliation></author>
       <author><first>Haitao</first><last>Mi</last><affiliation>Tencent America</affiliation></author>
       <author><first>Xiabing</first><last>Zhou</last><affiliation>Soochow University</affiliation></author>
-      <author><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
       <pages>232-247</pages>
       <abstract>Current self-training methods such as standard self-training, co-training, tri-training, and others often focus on improving model performance on a single task, utilizing differences in input features, model architectures, and training processes. However, many tasks in natural language processing are about different but related aspects of language, and models trained for one task can be great teachers for other related tasks. In this work, we propose friend-training, a cross-task self-training framework, where models trained to do different tasks are used in an iterative training, pseudo-labeling, and retraining process to help each other for better selection of pseudo-labels. With two dialogue understanding tasks, conversational semantic role labeling and dialogue rewriting, chosen for a case study, we show that the models trained with the friend-training framework achieve the best performance compared to strong baselines.</abstract>
       <url hash="571691ef">2023.eacl-main.18</url>
@@ -2929,7 +2929,7 @@
       <author><first>Xiaoman</first><last>Pan</last><affiliation>Tencent AI Lab</affiliation></author>
       <author><first>Kaiqiang</first><last>Song</last><affiliation>Tencent AI Lab</affiliation></author>
       <author><first>Dian</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
-      <author><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
       <author><first>Jianshu</first><last>Chen</last><affiliation>Tencent AI Lab</affiliation></author>
       <pages>3001-3010</pages>
       <abstract>Understanding sentence semantics requires an interpretation of the main information from a concrete context. To investigate how individual word contributes to sentence semantics, we propose a perturbation method for unsupervised semantic analysis. We next re-examine SOTA sentence embedding models’ ability to capture the main semantics of a sentence by developing a new evaluation metric to adapt sentence compression datasets for automatic evaluation. Results on three datasets show that unsupervised discourse relation recognition can serve as a general inference task that can more effectively aggregate information to essential contents than several SOTA unsupervised sentence embedding models.</abstract>
diff --git a/data/xml/2023.emnlp.xml b/data/xml/2023.emnlp.xml
index 2ad35e4b2d..d68b82e5a2 100644
--- a/data/xml/2023.emnlp.xml
+++ b/data/xml/2023.emnlp.xml
@@ -2438,7 +2438,7 @@
       <title>Bridging the Gap between Synthetic and Authentic Images for Multimodal Machine Translation</title>
       <author><first>Wenyu</first><last>Guo</last></author>
       <author><first>Qingkai</first><last>Fang</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-blcu"><first>Dong</first><last>Yu</last></author>
       <author><first>Yang</first><last>Feng</last></author>
       <pages>2863-2874</pages>
       <abstract>Multimodal machine translation (MMT) simultaneously takes the source sentence and a relevant image as input for translation. Since there is no paired image available for the input sentence in most cases, recent studies suggest utilizing powerful text-to-image generation models to provide image inputs. Nevertheless, synthetic images generated by these models often follow different distributions compared to authentic images. Consequently, using authentic images for training and synthetic images for inference can introduce a distribution shift, resulting in performance degradation during inference. To tackle this challenge, in this paper, we feed synthetic and authentic images to the MMT model, respectively. Then we minimize the gap between the synthetic and authentic images by drawing close the input image representations of the Transformer Encoder and the output distributions of the Transformer Decoder. Therefore, we mitigate the distribution disparity introduced by the synthetic images during inference, thereby freeing the authentic images from the inference process. Experimental results show that our approach achieves state-of-the-art performance on the Multi30K En-De and En-Fr datasets, while remaining independent of authentic images during inference.</abstract>
@@ -12582,7 +12582,7 @@ The experiments were repeated and the tables and figures were updated. Changes a
       <author><first>Kaiqiang</first><last>Song</last></author>
       <author><first>Hongming</first><last>Zhang</last></author>
       <author><first>Muhao</first><last>Chen</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last></author>
       <pages>14584-14595</pages>
       <abstract>Traditional sentence embedding models encode sentences into vector representations to capture useful properties such as the semantic similarity between sentences. However, in addition to similarity, sentence semantics can also be interpreted via compositional operations such as sentence fusion or difference. It is unclear whether the compositional semantics of sentences can be directly reflected as compositional operations in the embedding space. To more effectively bridge the continuous embedding and discrete text spaces, we explore the plausibility of incorporating various compositional properties into the sentence embedding space that allows us to interpret embedding transformations as compositional sentence operations. We propose InterSent, an end-to-end framework for learning interpretable sentence embeddings that supports compositional sentence operations in the embedding space. Our method optimizes operator networks and a bottleneck encoder-decoder model to produce meaningful and interpretable sentence embeddings. Experimental results demonstrate that our method significantly improves the interpretability of sentence embeddings on four textual generation tasks over existing approaches while maintaining strong performance on traditional semantic similarity tasks.</abstract>
       <url hash="04ca334f">2023.emnlp-main.900</url>
@@ -14272,7 +14272,7 @@ The experiments were repeated and the tables and figures were updated. Changes a
       <author><first>Nan</first><last>Du</last></author>
       <author><first>Longyue</first><last>Wang</last></author>
       <author><first>Haitao</first><last>Mi</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last></author>
       <pages>16396-16413</pages>
       <abstract>Nonverbal messages (NM) such as speakers’ facial expressions and speed of speech are essential for face-to-face communication, and they can be regarded as implicit knowledge as they are usually not included in existing dialogue understanding or generation tasks. This paper introduces the task of extracting NMs in written text and generating NMs for spoken text. Previous studies merely focus on extracting NMs from relatively small-scale well-structured corpora such as movie scripts wherein NMs are enclosed in parentheses by scriptwriters, which greatly decreases the difficulty of extraction. To enable extracting NMs from unstructured corpora, we annotate the first NM extraction dataset for Chinese based on novels and develop three baselines to extract single-span or multi-span NM of a target utterance from its surrounding context. Furthermore, we use the extractors to extract 749K (context, utterance, NM) triples from Chinese novels and investigate whether we can use them to improve NM generation via semi-supervised learning. Experimental results demonstrate that the automatically extracted triples can serve as high-quality augmentation data of clean triples extracted from scripts to generate more relevant, fluent, valid, and factually consistent NMs than the purely supervised generator, and the resulting generator can in turn help Chinese dialogue understanding tasks such as dialogue machine reading comprehension and emotion classification by simply adding the predicted “unspoken” NM to each utterance or narrative in inputs.</abstract>
       <url hash="3b619090">2023.emnlp-main.1021</url>
diff --git a/data/xml/2023.findings.xml b/data/xml/2023.findings.xml
index 3405d2806a..90e9325d74 100644
--- a/data/xml/2023.findings.xml
+++ b/data/xml/2023.findings.xml
@@ -5963,7 +5963,7 @@
       <author orcid="0000-0002-4704-5455"><first>Zhenhailong</first><last>Wang</last><affiliation>University of Illinois at Urbana-Champaign</affiliation></author>
       <author><first>Xiaoman</first><last>Pan</last><affiliation>Tencent AI Lab</affiliation></author>
       <author><first>Dian</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
-      <author><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
       <author><first>Jianshu</first><last>Chen</last><affiliation>Tencent AI Lab</affiliation></author>
       <author><first>Heng</first><last>Ji</last><affiliation>University of Illinois at Urbana-Champaign and Amazon (Amazon Scholar)</affiliation></author>
       <pages>3978-4004</pages>
@@ -6263,7 +6263,7 @@
       <author><first>Xiaoyang</first><last>Wang</last><affiliation>Tencent AI Lab</affiliation></author>
       <author><first>Xiaoman</first><last>Pan</last><affiliation>Tencent AI Lab</affiliation></author>
       <author orcid="0000-0001-6251-6078"><first>Linda</first><last>Petzold</last><affiliation>UC Santa Barbara</affiliation></author>
-      <author><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
       <pages>4381-4401</pages>
       <abstract>Aspect or query-based summarization has recently caught more attention, as it can generate differentiated summaries based on users’ interests. However, the current dataset for aspect or query-based summarization either focuses on specific domains, on a relatively small scale, or contains only a few aspect types. Such limitations hinder further explorations in this direction. In this work, we take advantage of crowd-sourcing knowledge on Wikipedia and automatically create a high-quality, large-scale open-domain aspect-based summarization dataset named OASum, which contains more than 3.7 million instances with around 1 million different aspects on 2 million Wikipedia pages. We provide benchmark results on OASum and demonstrate its ability for diverse aspect-based summarization generation. To overcome the data scarcity problem on specific domains, we also perform zero-shot, few-shot, and fine-tuning on seven downstream datasets. Specifically, zero/few-shot and fine-tuning results show that the model pre-trained on our corpus demonstrates a strong aspect or query-focused generation ability compared with the backbone model. Our dataset and pre-trained checkpoints are publicly available.</abstract>
       <url hash="03358834">2023.findings-acl.268</url>
@@ -9423,7 +9423,7 @@
       <author><first>Chunlei</first><last>Zhang</last><affiliation>Tencent AI Lab</affiliation></author>
       <author orcid="0000-0002-9160-3848"><first>Yi</first><last>Ren</last><affiliation>Bytedance</affiliation></author>
       <author><first>Zhou</first><last>Zhao</last><affiliation>zhejiang university</affiliation></author>
-      <author><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
       <pages>8018-8034</pages>
       <abstract>Expressive text-to-speech aims to generate high-quality samples with rich and diverse prosody, which is hampered by <b>dual challenges</b>: 1) prosodic attributes in highly dynamic voices are difficult to capture and model without intonation; and 2) highly multimodal prosodic representations cannot be well learned by simple regression (e.g., MSE) objectives, which causes blurry and over-smoothing predictions. This paper proposes Prosody-TTS, a two-stage pipeline that enhances <b>prosody modeling and sampling</b> by introducing several components: 1) a self-supervised masked autoencoder to model the prosodic representation without relying on text transcriptions or local prosody attributes, which ensures to cover diverse speaking voices with superior generalization; and 2) a diffusion model to sample diverse prosodic patterns within the latent space, which prevents TTS models from generating samples with dull prosodic performance. Experimental results show that Prosody-TTS achieves new state-of-the-art in text-to-speech with natural and expressive synthesis. Both subjective and objective evaluation demonstrate that it exhibits superior audio quality and prosody naturalness with rich and diverse prosodic attributes. Audio samples are available at <url>https://improved_prosody.github.io</url></abstract>
       <url hash="51560c7a">2023.findings-acl.508</url>
@@ -9869,7 +9869,7 @@
       <author><first>Lifeng</first><last>Jin</last><affiliation>Tencent AI Lab</affiliation></author>
       <author><first>Linfeng</first><last>Song</last><affiliation>Tencent AI Lab</affiliation></author>
       <author><first>Haitao</first><last>Mi</last><affiliation>Tencent America</affiliation></author>
-      <author><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
       <pages>8569-8588</pages>
       <abstract>Training a large language model in low-resource settings is challenging since they are susceptible to overfitting with limited generalization abilities. Previous work addresses this issue by approaches such as tunable parameters reduction or data augmentation. However, they either limit the trained models’ expressiveness or rely on task-independent knowledge. In this paper, we propose the Bi-level Finetuning with Task-dependent Similarity Structure framework where all parameters, including the embeddings for unseen tokens, are finetuned with task-dependent information from the training data only. In this framework, a task-dependent similarity structure is learned in a data-driven fashion, which in turn is used to compose soft embeddings from conventional embeddings to be used in training to update all parameters. In order to learn the similarity structure and model parameters, we propose a bi-level optimization algorithm with two stages—search and finetune—to ensure successful learning. Results of experiments on several classification datasets in low-resource scenarios demonstrate that models trained with our method outperform strong baselines. Ablation experiments further support the effectiveness of different components in our framework. Code is available at <url>https://github.com/Sai-Ashish/BFTSS</url>.</abstract>
       <url hash="d1480a33">2023.findings-acl.544</url>
@@ -23624,7 +23624,7 @@
       <title>On the Dimensionality of Sentence Embeddings</title>
       <author><first>Hongwei</first><last>Wang</last></author>
       <author><first>Hongming</first><last>Zhang</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last></author>
       <pages>10344-10354</pages>
       <abstract>Learning sentence embeddings is a fundamental problem in natural language processing. While existing research primarily focuses on enhancing the quality of sentence embeddings, the exploration of sentence embedding dimensions is limited. Here we present a comprehensive and empirical analysis of the dimensionality of sentence embeddings. First, we demonstrate that the optimal dimension of sentence embeddings is usually smaller than the default value. Subsequently, to compress the dimension of sentence embeddings with minimum performance degradation, we identify two components contributing to the overall performance loss: the encoder’s performance loss and the pooler’s performance loss. Therefore, we propose a two-step training method for sentence representation learning models, wherein the encoder and the pooler are optimized separately to mitigate the overall performance loss in low-dimension scenarios. Experimental results on seven STS tasks and seven sentence classification tasks demonstrate that our method significantly improves the performance of low-dimensional sentence embeddings.</abstract>
       <url hash="d5e0d295">2023.findings-emnlp.694</url>
@@ -27779,7 +27779,7 @@
       <author><first>Xiaoman</first><last>Pan</last></author>
       <author><first>Kaiqiang</first><last>Song</last></author>
       <author><first>Hongming</first><last>Zhang</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last></author>
       <author><first>Jianshu</first><last>Chen</last></author>
       <pages>15108-15127</pages>
       <abstract>This work considers the problem of Open-world Entity Profiling, a sub-domain of Open-world Information Extraction (Open-world IE). Unlike the conventional closed-world IE, Open-world IE is considered a more general situation where entities and relations could be beyond a predefined ontology. We seek to develop a large language model (LLM) that can perform Open-world Entity Profiling with instruction tuning to extract desirable entity profiles characterized by (possibly fine-grained) natural language instructions. In particular, we construct INSTRUCTOPENWIKI, a substantial instruction-tuning dataset for Open-world Entity Profiling enriched with a comprehensive corpus, extensive annotations, and diverse instructions. We finetune pretrained BLOOM models on INSTRUCTOPENWIKI and obtain PIVOINE, an LLM for Open-world Entity Profiling with strong instruction-following capabilities. Our experiments demonstrate that PIVOINE significantly outperforms traditional methods and ChatGPT-based baselines, displaying impressive generalization capabilities on both unseen instructions and out-of-ontology cases. Consequently, PIVOINE emerges as a promising solution to tackle the open-world challenge of entity profiling.</abstract>
@@ -28562,7 +28562,7 @@
       <author><first>Xiaoyang</first><last>Wang</last></author>
       <author><first>Hongwei</first><last>Wang</last></author>
       <author><first>Jiawei</first><last>Zhang</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last></author>
       <pages>123–133</pages>
       <url hash="2efeb0ad">2023.findings-ijcnlp.11</url>
       <bibkey>zhang-etal-2023-unsupervised-multi</bibkey>
diff --git a/data/xml/2023.tacl.xml b/data/xml/2023.tacl.xml
index 4e8f39e6af..1fb3da7cf3 100644
--- a/data/xml/2023.tacl.xml
+++ b/data/xml/2023.tacl.xml
@@ -520,7 +520,7 @@
       <author><first>Haitao</first><last>Mi</last></author>
       <author><first>Jinsong</first><last>Su</last></author>
       <author><first>Yue</first><last>Zhang</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last></author>
       <doi>10.1162/tacl_a_00569</doi>
       <abstract>We focus on the factuality property during the extraction of an OpenIE corpus named OpenFact, which contains more than 12 million high-quality knowledge triplets. We break down the factuality property into two important aspects—expressiveness and groundedness—and we propose a comprehensive framework to handle both aspects. To enhance expressiveness, we formulate each knowledge piece in OpenFact based on a semantic frame. We also design templates, extra constraints, and adopt human efforts so that most OpenFact triplets contain enough details. For groundedness, we require the main arguments of each triplet to contain linked Wikidata1 entities. A human evaluation suggests that the OpenFact triplets are much more accurate and contain denser information compared to OPIEC-Linked (Gashteovski et al., 2019), one recent high-quality OpenIE corpus grounded to Wikidata. Further experiments on knowledge base completion and knowledge base question answering show the effectiveness of OpenFact over OPIEC-Linked as supplementary knowledge to Wikidata as the major KG.</abstract>
       <pages>686–702</pages>
@@ -1190,7 +1190,7 @@
       <author><first>Linfeng</first><last>Song</last></author>
       <author><first>Haitao</first><last>Mi</last></author>
       <author><first>Yongfeng</first><last>Zhang</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last></author>
       <doi>10.1162/tacl_a_00617</doi>
       <abstract>Pretrained natural language processing (NLP) models have achieved high overall performance, but they still make systematic errors. Instead of manual error analysis, research on slice detection models (SDMs), which automatically identify underperforming groups of datapoints, has caught escalated attention in Computer Vision for both understanding model behaviors and providing insights for future model training and designing. However, little research on SDMs and quantitative evaluation of their effectiveness have been conducted on NLP tasks. Our paper fills the gap by proposing a benchmark named “Discover, Explain, Improve (DEIm)” for classification NLP tasks along with a new SDM Edisa. Edisa discovers coherent and underperforming groups of datapoints; DEIm then unites them under human-understandable concepts and provides comprehensive evaluation tasks and corresponding quantitative metrics. The evaluation in DEIm shows that Edisa can accurately select error-prone datapoints with informative semantic features that summarize error patterns. Detecting difficult datapoints directly boosts model performance without tuning any original model parameters, showing that discovered slices are actionable for users.1</abstract>
       <pages>1537–1552</pages>
diff --git a/data/xml/2024.acl.xml b/data/xml/2024.acl.xml
index e3c165e7e4..168e8ce3c3 100644
--- a/data/xml/2024.acl.xml
+++ b/data/xml/2024.acl.xml
@@ -257,7 +257,7 @@
       <author><first>Sangwoo</first><last>Cho</last><affiliation>Capital One</affiliation></author>
       <author><first>Xiaoyang</first><last>Wang</last><affiliation>Tencent AI Lab</affiliation></author>
       <author><first>Hassan</first><last>Foroosh</last><affiliation>University of Central Florida</affiliation></author>
-      <author><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
       <author id="fei-liu"><first>Fei</first><last>Liu</last><affiliation>Emory University</affiliation></author>
       <pages>267-278</pages>
       <abstract>Large language models hold significant potential for integrating various data types, such as text documents and database records, for advanced analytics. However, blending text and numerical data presents substantial challenges. LLMs need to process and cross-reference entities and numbers, handle data inconsistencies and redundancies, and develop planning capabilities such as building a working memory for managing complex data queries. In this paper, we introduce four novel tasks centered around sports data analytics to evaluate the numerical reasoning and information fusion capabilities of LLMs. These tasks involve providing LLMs with detailed, play-by-play sports game descriptions, then challenging them with adversarial scenarios such as new game rules, longer durations, scrambled narratives, and analyzing key statistics in game summaries. We conduct extensive experiments on NBA and NFL games to assess the performance of LLMs on these tasks. Our benchmark, SportsMetrics, introduces a new mechanism for assessing LLMs’ numerical reasoning and fusion skills.</abstract>
@@ -1368,7 +1368,7 @@
       <author><first>Dan</first><last>Su</last></author>
       <author><first>Liqiang</first><last>He</last><affiliation>Tencent AI Lab</affiliation></author>
       <author><first>Linli</first><last>Xu</last><affiliation>University of Science and Technology of China</affiliation></author>
-      <author><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
       <pages>1764-1775</pages>
       <abstract>While recent advancements in speech language models have achieved significant progress, they face remarkable challenges in modeling the long acoustic sequences of neural audio codecs. In this paper, we introduce <b>G</b>enerative <b>P</b>re-trained <b>S</b>peech <b>T</b>ransformer (GPST), a hierarchical transformer designed for efficient speech language modeling. GPST quantizes audio waveforms into two distinct types of discrete speech representations and integrates them within a hierarchical transformer architecture, allowing for a unified one-stage generation process and enhancing Hi-Res audio generation capabilities. By training on large corpora of speeches in an end-to-end unsupervised manner, GPST can generate syntactically consistent speech with diverse speaker identities. Given a brief 3-second prompt, GPST can produce natural and coherent personalized speech, demonstrating in-context learning abilities. Moreover, our approach can be easily extended to spoken cross-lingual speech generation by incorporating multi-lingual semantic tokens and universal acoustic tokens. Experimental results indicate that GPST significantly outperforms the existing speech language models in terms of word error rate, speech quality, and speaker similarity. See <url>https://youngsheen.github.io/GPST/demo</url> for demo samples.</abstract>
       <url hash="5bed8498">2024.acl-long.97</url>
@@ -5171,7 +5171,7 @@
       <author><first>Yong</first><last>Dai</last><affiliation>Tencent AI Lab</affiliation></author>
       <author><first>Hongming</first><last>Zhang</last></author>
       <author><first>Zhenzhong</first><last>Lan</last><affiliation>Westlake University</affiliation></author>
-      <author><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
       <pages>6864-6890</pages>
       <abstract>The rapid advancement of large language models (LLMs) has led to a new era marked by the development of autonomous applications in real-world scenarios, which drives innovation in creating advanced web agents. Existing web agents typically only handle one input modality and are evaluated only in simplified web simulators or static web snapshots, greatly limiting their applicability in real-world scenarios. To bridge this gap, we introduce WebVoyager, an innovative Large Multimodal Model (LMM) powered web agent that can complete user instructions end-to-end by interacting with real-world websites. Moreover, we establish a new benchmark by compiling real-world tasks from 15 popular websites and introduce an automatic evaluation protocol leveraging multimodal understanding abilities of GPT-4V to evaluate open-ended web agents. We show that WebVoyager achieves a 59.1% task success rate on our benchmark, significantly surpassing the performance of both GPT-4 (All Tools) and the WebVoyager (text-only) setups, underscoring the exceptional capability of WebVoyager. The proposed automatic evaluation metric achieves 85.3% agreement with human judgment, indicating its effectiveness in providing reliable and accurate assessments of web agents.</abstract>
       <url hash="389bb686">2024.acl-long.371</url>
@@ -8201,7 +8201,7 @@
       <author><first>Jiatong</first><last>Shi</last></author>
       <author><first>Chao</first><last>Weng</last><affiliation>Tencent AI Lab</affiliation></author>
       <author><first>Zhou</first><last>Zhao</last><affiliation>Zhejiang University and Zhejiang University</affiliation></author>
-      <author><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
       <pages>10929-10942</pages>
       <abstract>Large language models (LLMs) have successfully served as a general-purpose interface across multiple tasks and languages, while the adaptation of voice LLMs is mostly designed for specific purposes (either single-task or monolingual), where the advantages of LLMs especially for low-resource language processing and zero-shot task generalization are less exploited in the audio community. To bridge the gap, we introduce Make-A-Voice as a multi-modal voice LLM and conduct a comprehensive study on its capability to deal with multiple tasks/languages. When trained on ~200K hours of 6-language data for 4 voice generation applications, Make-A-Voice emerges notable advantages: 1) as scalable learners to improve performance with end-to-end local and global multiscale transformers; and 2) as multitask learners by adjusting prompts to share common knowledge across modalities (speech/singing) and present in-context learning abilities by generalizing to unseen tasks not explicitly train on; 3) as multilingual learners to alleviate data scarcity of low-resource languages by including rich-resource language training data. Experimental results demonstrate that Make-A-Voice exhibits superior audio quality and style similarity compared with competitive baseline models in monolingual/cross-lingual voice generation. Audio samples are available at https://M-Voice.github.io</abstract>
       <url hash="5cf56a7c">2024.acl-long.589</url>
@@ -8258,7 +8258,7 @@
       <author><first>Hongming</first><last>Zhang</last></author>
       <author><first>Wei</first><last>Shao</last></author>
       <author><first>Zhicheng</first><last>Yang</last><affiliation>Hong Kong University of Science and Technology (Guangzhou)</affiliation></author>
-      <author><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
       <author><first>Changshui</first><last>Zhang</last><affiliation>Tsinghua University and Department of Computer Science and Technology</affiliation></author>
       <author><first>Xiaodan</first><last>Liang</last></author>
       <author><first>Linqi</first><last>Song</last><affiliation>City University of Hong Kong</affiliation></author>
diff --git a/data/xml/2024.ccl.xml b/data/xml/2024.ccl.xml
index 5756210e7e..6ccd39de29 100644
--- a/data/xml/2024.ccl.xml
+++ b/data/xml/2024.ccl.xml
@@ -140,7 +140,7 @@
     <paper id="11">
       <title>文本样式和主题框架引导下的大模型辅助儿童新闻生成(Text Styles and Thematic Framework Guided Large Modeling to Aid Children’s News Generation)</title>
       <author><first>Xiaomeng</first><last>Du</last><variant script="hani"><first>晓蒙</first><last>杜</last></variant></author>
-      <author><first>Dong</first><last>Yu</last><variant script="hani"><first>东</first><last>于</last></variant></author>
+      <author id="dong-yu-blcu"><first>Dong</first><last>Yu</last><variant script="hani"><first>东</first><last>于</last></variant></author>
       <author><first>Pengyuan</first><last>Liu</last><variant script="hani"><first>鹏远</first><last>刘</last></variant></author>
       <pages>150–170</pages>
       <abstract>“主流新闻内容多针对成年人设计,不易于儿童理解,难以满足其阅读需求。对此,我们提出了一种基于主题的儿童新闻篇章结构框架(TNC-LLM)。该框架融合了文本样式定义(TSD)和主题类别定义(TCD)两大核心模块,TSD模块采用多种机器学习算法,从不同粒度分析文本样式风格和段落布局等特点,TCD模块针对不同主题进行了内容分析,以揭示儿童新闻的写作特点和内容的倾向性,确保内容的教育性和适宜性。本文实验主要评估了ChatGPT3.5等四个模型在将成年人新闻转换为面向儿童的新闻的性能。实验结果表明,TNC-LLM在儿童新闻内容生成任务中对内容的准确性、文本的趣味性以及教育性等关键维度有显著提升。此外,该框架具有普适性,能够应用于不同类型的大型语言模型。”</abstract>
@@ -657,7 +657,7 @@
       <title>中西谚语多元价值观资源库建设及对比研究(The construction and comparative study of the resource library of <fixed-case>C</fixed-case>hinese and Western proverbs and multiple values)</title>
       <author><first>Xia</first><last>Du</last><variant script="hani"><first>霞</first><last>杜</last></variant></author>
       <author><first>Pengyuan</first><last>Liu</last><variant script="hani"><first>鹏远</first><last>刘</last></variant></author>
-      <author><first>Dong</first><last>Yu</last><variant script="hani"><first>东</first><last>于</last></variant></author>
+      <author id="dong-yu-blcu"><first>Dong</first><last>Yu</last><variant script="hani"><first>东</first><last>于</last></variant></author>
       <pages>688–699</pages>
       <abstract>“中西方谚语是中西方文化的结晶,分别蕴含着中西方文化中最基本的价值观。但目前缺乏中西方谚语价值观资源,难以对谚语所体现的中西方价值观进行全面的研究,特别是定量对比研究。因此本文设计了多元价值观体系,包含动机及需求、共同及特色价值观、价值判断和使用场景,根据这个体系构建了中西方谚语多元价值观资源库并进行了考察与对比分析。本文发现中西谚语在价值判断、使用场景及部分价值观上具有相似性,在具体内涵表达上各具独特性。”</abstract>
       <url hash="ad15b357">2024.ccl-1.54</url>
@@ -741,7 +741,7 @@
       <author><first>Xu</first><last>Zhang</last><variant script="hani"><first>旭</first><last>张</last></variant></author>
       <author><first>Mengqing</first><last>Guo</last><variant script="hani"><first>梦清</first><last>郭</last></variant></author>
       <author><first>Shucheng</first><last>Zhu</last><variant script="hani"><first>述承</first><last>朱</last></variant></author>
-      <author><first>Dong</first><last>Yu</last><variant script="hani"><first>东</first><last>于</last></variant></author>
+      <author id="dong-yu-blcu"><first>Dong</first><last>Yu</last><variant script="hani"><first>东</first><last>于</last></variant></author>
       <author><first>Ying</first><last>Liu</last><variant script="hani"><first>颖</first><last>刘</last></variant></author>
       <author><first>Pengyuan</first><last>Liu</last><variant script="hani"><first>鹏远</first><last>刘</last></variant></author>
       <pages>774–789</pages>
@@ -861,7 +861,7 @@
     <paper id="70">
       <title>基于领域信息分解式学习的大语言模型修辞认知增强方法(Method for Enhancing Rhetorical Cognition of Large Language Models Based on Decomposed Learning of Field Information)</title>
       <author><first>Wen</first><last>Wang</last><variant script="hani"><first>雯</first><last>王</last></variant></author>
-      <author><first>Dong</first><last>Yu</last><variant script="hani"><first>东</first><last>于</last></variant></author>
+      <author id="dong-yu-blcu"><first>Dong</first><last>Yu</last><variant script="hani"><first>东</first><last>于</last></variant></author>
       <author><first>Pengyuan</first><last>Liu</last><variant script="hani"><first>鹏远</first><last>刘</last></variant></author>
       <pages>894–909</pages>
       <abstract>“中文修辞手法多样且概念差异性大,大语言模型对部分修辞手法的认知存在缺陷。针对该问题,本文研究如何增强大语言模型的修辞认知能力,并探究其与修辞识别性能之间的关系。为此,本文提出了QAKAG框架,此框架首先引入信息分解式学习思想,通过问答形式检测大语言模型的修辞认知缺陷,然后以四种不同的知识组合方式探究最优信息补充机制,实现了大语言模型修辞认知能力的增强。本文构建了多类别中文修辞句数据集MCRSD和修辞知识库MCRKB,并在ChatGPT4等六个大语言模型上开展实验研究,验证了QAKAG框架对增强大语言模型修辞认知能力的有效性以及其各阶段的必要性。结果表明,在QAKAG框架的增强下,六个大语言模型在多类别修辞识别任务上的性能相较直接回答识别问题的平均F1值提高22.1%,优于Zero-shot-CoT、RAG-BaiKe、Few-Shot5提示策略。”</abstract>
@@ -925,7 +925,7 @@
       <author><first>Shuo</first><last>Wang</last></author>
       <author><first>Yukun</first><last>Yan</last></author>
       <author><first>Pengyuan</first><last>Liu</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-blcu"><first>Dong</first><last>Yu</last></author>
       <pages>973–985</pages>
       <abstract>“Free-form table question answering is a challenging task since tables contain structured contentscompared to plain texts, which requires high-level reasoning abilities to effectively identify cellsthat are relevant to the question and produce a correct and faithful answer based on their relations.Large language models (LLMs) have exhibited remarkable reasoning capabilities in numerousNLP applications. However, in some specific tasks, specially-trained small models can still out-perform LLMs. Furthermore, small models require extremely less computation costs comparedto LLMs. To leverage the strengths of both types of models, we propose a Relevant-Cell-basedKnowledge Distillation with inference-time Teacher Guidance (RCKD-TG) method. This ap-proach aims to combine small free-form table question answering models’ abilities to learn fromhuman annotations and large language models’ abilities to effectively reason from table contents,via applying Relevant-Cell-based rationales distilled from LLMs to small models’ training andinference stages. Our experiments demonstrate the superiority of our method over vanilla smallmodels in correctness, faithfulness, adequacy and fluency, also over general LLMs in adheringto the style of human annotations. We achieve state-of-the-art performance on FeTaQA, a rep-resentative free-form table question answering benchmark. Our result of a 41.3 BLEU scoredemonstrates the feasibility of effectively using small models’ task-specific abilities and LLMs’reasoning capabilities at the same time. Additionally, our method exhibits high computation ef-ficiency and data efficiency. Compared to strong baselines, we achieve better performance withsignificantly less training data.”</abstract>
       <url hash="c30f6881">2024.ccl-1.75</url>
@@ -959,7 +959,7 @@
       <author><first>Huidong</first><last>Du</last></author>
       <author><first>Hao</first><last>Sun</last></author>
       <author><first>Pengyuan</first><last>Liu</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-blcu"><first>Dong</first><last>Yu</last></author>
       <pages>1011–1022</pages>
       <abstract>“Large language models (LLMs) struggle with event detection (ED) due to the structured and vari-able number of events in the output. Existing supervised approaches rely on a large amount ofmanually annotated corpora, facing challenges in practice when event types are diverse and theannotated data is scarce. We propose Generate-then-Revise (GtR), a framework that leveragesLLMs in the opposite direction to address these challenges in ED. GtR utilizes an LLM to gen-erate high-quality training data in three stages, including a novel data revision step to minimizenoise in the synthetic data. The generated data is then used to train a smaller model for evalua-tion. Our approach demonstrates significant improvements on the low-resource ED. We furtheranalyze the generated data, highlighting the potential of synthetic data generation for enhancingED performance.Introduction”</abstract>
       <url hash="e3599557">2024.ccl-1.78</url>
@@ -1831,7 +1831,7 @@
       <title>人类思维指导下大小模型协同决策的中文修辞识别与理解方法</title>
       <author><first>Wen</first><last>Wang</last><variant script="hani"><first>雯</first><last>王</last></variant></author>
       <author><first>Siyi</first><last>Tang</last><variant script="hani"><first>思怡</first><last>汤</last></variant></author>
-      <author><first>Dong</first><last>Yu</last><variant script="hani"><first>东</first><last>于</last></variant></author>
+      <author id="dong-yu-blcu"><first>Dong</first><last>Yu</last><variant script="hani"><first>东</first><last>于</last></variant></author>
       <author><first>Pengyuan</first><last>Liu</last><variant script="hani"><first>鹏远</first><last>刘</last></variant></author>
       <pages>240–252</pages>
       <abstract>“CCL24-Eval任务6提出了一个多层次、细粒度中小学作文修辞识别与理解任务。针对任务特点,本文提出了人类思维指导下大小模型协同决策的中文修辞识别与理解方法。该方法根据人类在面对修辞识别和理解任务时的处理思路,将任务顺序重新定义,并分别选取大小语言模型,使每个步骤的实现效果均达到局部最优,以局部最优达到整体任务的最优效果。结果表明,本文提出的方法能够有效对修辞进行识别与理解,在三个赛道上相较于Baseline方法分别提升了13.54、4.03、57.11。”</abstract>
@@ -1996,10 +1996,10 @@
       <bibkey>guohang-etal-2024-evaluation</bibkey>
     </paper>
     <paper id="40">
-      <title>Bridging the Gap between Authentic and Answer-Guided Images for <fixed-case>C</fixed-case>hinese Vision-Language Understanding Enhancement</title>
+      <title>System Report for <fixed-case>CCL</fixed-case>24-Eval Task 9: Bridging the Gap between Authentic and Answer-Guided Images for <fixed-case>C</fixed-case>hinese Vision-Language Understanding Enhancement</title>
       <author><first>Feiyu</first><last>Wang</last></author>
       <author><first>Wenyu</first><last>Guo</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-blcu"><first>Dong</first><last>Yu</last></author>
       <author><first>Chen</first><last>Kang</last></author>
       <author><first>Pengyuan</first><last>Liu</last></author>
       <pages>353–362</pages>
diff --git a/data/xml/2024.emnlp.xml b/data/xml/2024.emnlp.xml
index a706c22886..33dec1404b 100644
--- a/data/xml/2024.emnlp.xml
+++ b/data/xml/2024.emnlp.xml
@@ -3440,7 +3440,7 @@
       <author><first>Xiaoyang</first><last>Wang</last><affiliation>Tencent AI Lab</affiliation></author>
       <author><first>Wenlin</first><last>Yao</last><affiliation>Tencent AI Lab</affiliation></author>
       <author><first>Hassan</first><last>Foroosh</last><affiliation>University of Central Florida</affiliation></author>
-      <author><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
       <author id="fei-liu"><first>Fei</first><last>Liu</last><affiliation>Emory University</affiliation></author>
       <pages>4293-4308</pages>
       <abstract>Reasoning is most powerful when an LLM accurately aggregates relevant information. We examine the critical role of information aggregation in reasoning by requiring the LLM to analyze sports narratives. To succeed at this task, an LLM must infer points from actions, identify related entities, attribute points accurately to players and teams, and compile key statistics to draw conclusions. We conduct comprehensive experiments with real NBA basketball data and present SportsGen, a new method to synthesize game narratives. By synthesizing data, we can rigorously evaluate LLMs’ reasoning capabilities under complex scenarios with varying narrative lengths and density of information. Our findings show that most models, including GPT-4o, often fail to accurately aggregate basketball scores due to frequent scoring patterns. Open-source models like Llama-3 further suffer from significant score hallucinations. Finally, the effectiveness of reasoning is influenced by narrative complexity, information density, and domain-specific terms, highlighting the challenges in analytical reasoning tasks.</abstract>
@@ -11353,7 +11353,7 @@
       <author><first>Kaixin</first><last>Ma</last><affiliation>Tencent AI Lab</affiliation></author>
       <author><first>Jian</first><last>Li</last><affiliation>Tencent</affiliation></author>
       <author orcid="0000-0001-7474-8271"><first>Hongwei</first><last>Wang</last><affiliation>Tencent AI Lab</affiliation></author>
-      <author><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
       <pages>14672-14685</pages>
       <abstract>Retrieval-augmented language model (RALM) represents a significant advancement in mitigating factual hallucination by leveraging external knowledge sources. However, the reliability of the retrieved information is not always guaranteed, and the retrieval of irrelevant data can mislead the response generation. Moreover, standard RALMs frequently neglect their intrinsic knowledge due to the interference from retrieved information. In instances where the retrieved information is irrelevant, RALMs should ideally utilize their intrinsic knowledge or, in the absence of both intrinsic and retrieved knowledge, opt to respond with “unknown” to avoid hallucination. In this paper, we introduces Chain-of-Note (CoN), a novel approach to improve robustness of RALMs in facing noisy, irrelevant documents and in handling unknown scenarios. The core idea of CoN is to generate sequential reading notes for each retrieved document, enabling a thorough evaluation of their relevance to the given question and integrating this information to formulate the final answer. Our experimental results show that GPT-4, when equipped with CoN, outperforms the Chain-of-Thought approach. Besides, we utilized GPT-4 to create 10K CoN data, subsequently trained on smaller models like OPT and LLaMa-2. Our experiments across four open-domain QA benchmarks show that fine-tuned RALMs equipped with CoN significantly outperform standard fine-tuned RALMs.</abstract>
       <url hash="c1a39fa9">2024.emnlp-main.813</url>
@@ -11409,7 +11409,7 @@
       <author><first>Wenhao</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
       <author><first>Dian</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
       <author orcid="0009-0005-9759-4996"><first>Mengzhao</first><last>Jia</last><affiliation>University of Notre Dame</affiliation></author>
-      <author><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
       <author orcid="0000-0002-3009-519X"><first>Meng</first><last>Jiang</last><affiliation>University of Notre Dame</affiliation></author>
       <pages>14720-14738</pages>
       <abstract>Supervised fine-tuning enhances the problem-solving abilities of language models across various mathematical reasoning tasks. To maximize such benefits, existing research focuses on *broadening* the training set with various data augmentation techniques, which is effective for standard single-round question-answering settings. Our work introduces a novel technique aimed at cultivating a *deeper* understanding of the training problems at hand, enhancing performance not only in standard settings but also in more complex scenarios that require reflective thinking. Specifically, we propose **reflective augmentation**, a method that embeds problem reflection into each training instance. It trains the model to consider alternative perspectives and engage with abstractions and analogies, thereby fostering a thorough comprehension through reflective reasoning. Extensive experiments validate the achievement of our aim, underscoring the unique advantages of our method and its complementary nature relative to existing augmentation techniques.</abstract>
@@ -11785,7 +11785,7 @@
       <author><first>Kaixin</first><last>Ma</last><affiliation>Tencent AI Lab</affiliation></author>
       <author><first>Xinran</first><last>Zhao</last><affiliation>CMU, Carnegie Mellon University</affiliation></author>
       <author><first>Hongming</first><last>Zhang</last></author>
-      <author><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
       <pages>15159-15177</pages>
       <abstract>Dense retrieval has become a prominent method to obtain relevant context or world knowledge in open-domain NLP tasks. When we use a learned dense retriever on a retrieval corpus at inference time, an often-overlooked design choice is the retrieval unit in which the corpus is indexed, e.g. document, passage, or sentence. We discover that the retrieval unit choice significantly impacts the performance of both retrieval and downstream tasks. Distinct from the typical approach of using passages or sentences, we introduce a novel retrieval unit, proposition, for dense retrieval. Propositions are defined as atomic expressions within text, each encapsulating a distinct factoid and presented in a concise, self-contained natural language format. We conduct an empirical comparison of different retrieval granularity. Our experiments reveal that indexing a corpus by fine-grained units such as propositions significantly outperforms passage-level units in retrieval tasks. Moreover, constructing prompts with fine-grained retrieved units for retrieval-augmented language models improves the performance of downstream QA tasks given a specific computation budget.</abstract>
       <url hash="8b74f11e">2024.emnlp-main.845</url>
diff --git a/data/xml/2024.findings.xml b/data/xml/2024.findings.xml
index 682fdb75c3..3b8a7357bc 100644
--- a/data/xml/2024.findings.xml
+++ b/data/xml/2024.findings.xml
@@ -215,7 +215,7 @@
       <author><first>Lifeng</first><last>Jin</last><affiliation>Tencent AI Lab</affiliation></author>
       <author><first>Linfeng</first><last>Song</last></author>
       <author><first>Haitao</first><last>Mi</last><affiliation>Tencent AI Lab</affiliation></author>
-      <author><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
       <pages>220-230</pages>
       <abstract>One critical issue for chat systems is to stay consistent about preferences, opinions, beliefs and facts of itself, which has been shown a difficult problem. In this work, we study methods to assess and bolster utterance consistency of chat systems. A dataset is first developed for studying the inconsistencies, where inconsistent dialogue responses, explanations of the inconsistencies, and recovery utterances are authored by annotators. This covers the life span of inconsistencies, namely introduction, understanding, and resolution. Building on this, we introduce a set of tasks centered on dialogue consistency, specifically focused on its detection and resolution. Our experimental findings indicate that our dataset significantly helps the progress in identifying and resolving conversational inconsistencies, and current popular large language models like ChatGPT which are good at resolving inconsistencies however still struggle with detection.</abstract>
       <url hash="def387c7">2024.findings-eacl.16</url>
@@ -1185,7 +1185,7 @@
       <author><first>Haoyu</first><last>Wang</last><affiliation>University of Pennsylvania</affiliation></author>
       <author><first>Hongming</first><last>Zhang</last></author>
       <author><first>Kaiqiang</first><last>Song</last><affiliation>Tencent AI Lab</affiliation></author>
-      <author><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
       <author><first>Dan</first><last>Roth</last><affiliation>Amazon and University of Pennsylvania</affiliation></author>
       <pages>1395-1407</pages>
       <abstract>In this work, we focus on a fundamental yet underexplored problem, event semantic classification in context, to help machines gain a deeper understanding of events. We classify events from six perspectives: modality, affirmation, specificity, telicity, durativity, and kinesis. These properties provide essential cues regarding the occurrence and grounding of events, changes of status that events can bring about, and the connection between events and time. To this end, this paper introduces a novel dataset collected for the semantic classification tasks and several effective models. By incorporating these event properties into downstream tasks, we demonstrate that understanding the fine-grained event semantics benefits downstream event understanding and reasoning via experiments on event extraction, temporal relation extraction, and subevent relation extraction.</abstract>
@@ -12907,7 +12907,7 @@
       <author><first>Ye</first><last>Tian</last></author>
       <author><first>Haitao</first><last>Mi</last><affiliation>Tencent AI Lab</affiliation></author>
       <author><first>Jinsong</first><last>Su</last><affiliation>Xiamen University</affiliation></author>
-      <author><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
       <pages>8424-8436</pages>
       <abstract>This work studies mitigating fact-conflicting hallucinations for large language model (LLM) at inference time.Particularly, we propose a self-endorsement framework that leverages the fine-grained fact-level comparisons across multiple sampled responses.Compared with prior ensemble methods (e.g., self-consistency) that perform response-level selection, our approach can better alleviate hallucinations for knowledge-intensive tasks.Our approach can broadly benefit smaller and open-source LLMs as it mainly conducts simple content-based comparisons.Experiments on Biographies show that our method can effectively improve the factuality of generations with simple and intuitive prompts across different scales of LLMs.Besides, comprehensive analyses on TriviaQA and GSM8K demonstrate the potential of self-endorsement for broader application.</abstract>
       <url hash="9ef5ee41">2024.findings-acl.499</url>
@@ -13108,7 +13108,7 @@
       <author><first>Hongming</first><last>Zhang</last></author>
       <author><first>Xiaoman</first><last>Pan</last><affiliation>Tencent AI Lab</affiliation></author>
       <author><first>Wenlin</first><last>Yao</last><affiliation>Tencent AI Lab</affiliation></author>
-      <author><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
       <author><first>Tongshuang</first><last>Wu</last><affiliation>School of Computer Science, Carnegie Mellon University</affiliation></author>
       <author><first>Jianshu</first><last>Chen</last><affiliation>Amazon</affiliation></author>
       <pages>8702-8718</pages>
@@ -15653,7 +15653,7 @@
       <author><first>Zhenghao</first><last>Liu</last><affiliation>Northeastern University</affiliation></author>
       <author orcid="0000-0002-2426-6220"><first>Zhixing</first><last>Tan</last><affiliation>Zhongguancun Laboratory</affiliation></author>
       <author><first>Pengyuan</first><last>Liu</last><affiliation>Beijing Language and Culture University</affiliation></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-blcu"><first>Dong</first><last>Yu</last></author>
       <author orcid="0000-0002-7709-2543"><first>Zhiyuan</first><last>Liu</last><affiliation>Tsinghua University</affiliation></author>
       <author><first>Xiaodong</first><last>Shi</last><affiliation>Xiamen University, Tsinghua University</affiliation></author>
       <author><first>Maosong</first><last>Sun</last></author>
@@ -16152,7 +16152,7 @@
       <author><first>Chenxing</first><last>Li</last></author>
       <author><first>Dan</first><last>Su</last></author>
       <author orcid="0000-0001-9848-6384"><first>Chenhui</first><last>Chu</last><affiliation>Kyoto University</affiliation></author>
-      <author><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
       <pages>12401-12430</pages>
       <abstract>In the past year, MultiModal Large Language Models (MM-LLMs) have undergone substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or outputs via cost-effective training strategies. The resulting models not only preserve the inherent reasoning and decision-making capabilities of LLMs but also empower a diverse range of MM tasks. In this paper, we provide a comprehensive survey aimed at facilitating further research of MM-LLMs. Initially, we outline general design formulations for model architecture and training pipeline. Subsequently, we introduce a taxonomy encompassing 126 MM-LLMs, each characterized by its specific formulations. Furthermore, we review the performance of selected MM-LLMs on mainstream benchmarks and summarize key training recipes to enhance the potency of MM-LLMs. Finally, we explore promising directions for MM-LLMs while concurrently maintaining a [real-time tracking website](https://mm-llms.github.io/) for the latest developments in the field. We hope that this survey contributes to the ongoing advancement of the MM-LLMs domain.</abstract>
       <url hash="02621438">2024.findings-acl.738</url>
@@ -16632,7 +16632,7 @@
       <author><first>Xuansheng</first><last>Wu</last></author>
       <author id="fei-liu"><first>Fei</first><last>Liu</last><affiliation>Emory University</affiliation></author>
       <author><first>Pengfei</first><last>Liu</last></author>
-      <author><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
       <pages>13025-13048</pages>
       <abstract>This paper introduces the Decomposed Requirements Following Ratio (DRFR), a new metric for evaluating Large Language Models’ (LLMs) ability to follow instructions. Addressing a gap in current methodologies, DRFR breaks down complex instructions into simpler criteria, facilitating a detailed analysis of LLMs’ compliance with various aspects of tasks. Alongside this metric, we present InFoBench, a benchmark comprising 500 diverse instructions and 2,250 decomposed questions across multiple constraint categories. Our experiments compare DRFR with traditional scoring methods and explore annotation sources, including human experts, crowd-sourced workers, and GPT-4. The findings demonstrate DRFR’s higher reliability and the effectiveness of using GPT-4 as a cost-efficient annotator. The evaluation of several advanced LLMs using this framework reveals their strengths and areas needing improvement, particularly in complex instruction-following. This study contributes a novel metric and benchmark, offering insights for future LLM development and evaluation.</abstract>
       <url hash="37b5358e">2024.findings-acl.772</url>
@@ -20921,7 +20921,7 @@
       <author orcid="0000-0003-2852-3746"><first>Ruixin</first><last>Hong</last><affiliation>Tsinghua University, Tsinghua University</affiliation></author>
       <author><first>Hongming</first><last>Zhang</last></author>
       <author><first>Xiaoman</first><last>Pan</last><affiliation>Amazon</affiliation></author>
-      <author><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
       <author><first>Changshui</first><last>Zhang</last><affiliation>Tsinghua University and Department of Computer Science and Technology</affiliation></author>
       <pages>1993-2027</pages>
       <abstract>Abstract reasoning, the ability to reason from the abstract essence of a problem, serves as a key to generalization in human reasoning. However, eliciting language models to perform reasoning with abstraction remains unexplored. This paper seeks to bridge this gap by introducing a novel structured reasoning format called Abstraction-of-Thought (AoT). The uniqueness of AoT lies in its explicit requirement for varying levels of abstraction within the reasoning process. This approach could elicit language models to first contemplate on the abstract level before incorporating concrete details, which is overlooked by the prevailing step-by-step Chain-of-Thought (CoT) method. To align models with the AoT format, we present AoT Collection, a generic finetuning dataset consisting of 348k high-quality samples with AoT reasoning processes, collected via an automated and scalable pipeline. We finetune a wide range of language models with AoT Collection and conduct extensive evaluations on 23 unseen tasks from the challenging benchmark Big-Bench Hard. Experimental results indicate that models aligned to AoT reasoning format substantially outperform those aligned to CoT in many reasoning tasks.</abstract>
@@ -23185,7 +23185,7 @@ and high variation in performance on the subset, suggesting our plausibility cri
       <author><first>Shucheng</first><last>Zhu</last><affiliation>Tsinghua University, Tsinghua University</affiliation></author>
       <author><first>Pengyuan</first><last>Liu</last><affiliation>Beijing Language and Culture University</affiliation></author>
       <author><first>Ying</first><last>Liu</last><affiliation>Tsinghua University, Tsinghua University</affiliation></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-blcu"><first>Dong</first><last>Yu</last></author>
       <pages>4740-4760</pages>
       <abstract>Proper moral beliefs are fundamental for language models, yet assessing these beliefs poses a significant challenge. This study introduces a novel three-module framework to evaluate the moral beliefs of four prominent large language models. Initially, we constructed a dataset containing 472 moral choice scenarios in Chinese, derived from moral words. The decision-making process of the models in these scenarios reveals their moral principle preferences. By ranking these moral choices, we discern the varying moral beliefs held by different language models. Additionally, through moral debates, we investigate the firmness of these models to their moral choices. Our findings indicate that English language models, namely ChatGPT and Gemini, closely mirror moral decisions of the sample of Chinese university students, demonstrating strong adherence to their choices and a preference for individualistic moral beliefs. In contrast, Chinese models such as Ernie and ChatGLM lean towards collectivist moral beliefs, exhibiting ambiguity in their moral choices and debates. This study also uncovers gender bias embedded within the moral beliefs of all examined language models. Our methodology offers an innovative means to assess moral beliefs in both artificial and human intelligence, facilitating a comparison of moral values across different cultures.</abstract>
       <url hash="f2105e87">2024.findings-emnlp.272</url>
@@ -24203,7 +24203,7 @@ and high variation in performance on the subset, suggesting our plausibility cri
       <author><first>Lifeng</first><last>Jin</last><affiliation>Tencent AI Lab</affiliation></author>
       <author><first>Haitao</first><last>Mi</last><affiliation>Tencent AI Lab</affiliation></author>
       <author><first>Jinsong</first><last>Su</last><affiliation>Xiamen University</affiliation></author>
-      <author><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
       <pages>6023-6029</pages>
       <url hash="785fe832">2024.findings-emnlp.349</url>
       <bibkey>wang-etal-2024-self-consistency</bibkey>
@@ -30508,7 +30508,7 @@ hai-coaching/</abstract>
       <author><first>Dian</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
       <author><first>Kaiqiang</first><last>Song</last><affiliation>Tencent AI Lab</affiliation></author>
       <author><first>Xiaoyang</first><last>Wang</last><affiliation>Tencent AI Lab</affiliation></author>
-      <author><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
       <author><first>Jianshu</first><last>Chen</last><affiliation>Amazon</affiliation></author>
       <pages>13838-13890</pages>
       <abstract>We investigate how to elicit compositional generalization capabilities in large language models (LLMs). Compositional generalization empowers LLMs to solve complex problems by combining foundational skills, a critical reasoning ability akin to human intelligence. However, even the most advanced LLMs currently struggle with this form of reasoning. We examine this problem within the framework of in-context learning and find that demonstrating both foundational skills and compositional examples grounded in these skills within the same prompt context is crucial. We refer to this prompt structure as skills-in-context (SKiC). With as few as two exemplars, this in-context learning structure enables LLMs to tackle more challenging problems requiring innovative skill combinations, achieving near-perfect systematic generalization across a broad range of tasks. Intriguingly, SKiC also unlocks the latent potential of LLMs, allowing them to more actively utilize pre-existing internal skills acquired during earlier pretraining stages to solve complex reasoning problems. The SKiC structure is robust across different skill constructions and exemplar choices and demonstrates strong transferability to new tasks. Finally, inspired by our in-context learning study, we show that fine-tuning LLMs with SKiC-style data can elicit zero-shot weak-to-strong generalization, enabling the models to solve much harder problems directly with standard prompting.</abstract>
diff --git a/data/xml/2024.lrec.xml b/data/xml/2024.lrec.xml
index cdeceb3b00..e213e58d45 100644
--- a/data/xml/2024.lrec.xml
+++ b/data/xml/2024.lrec.xml
@@ -714,7 +714,7 @@
       <author><first>Lifeng</first><last>Jin</last></author>
       <author><first>Haitao</first><last>Mi</last></author>
       <author><first>Jessica</first><last>Ouyang</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last></author>
       <pages>666–676</pages>
       <abstract>Knowledge-based, open-domain dialogue generation aims to build chit-chat systems that talk to humans using mined support knowledge. Many types and sources of knowledge have previously been shown to be useful as support knowledge. Even in the era of large language models, response generation grounded in knowledge retrieved from additional up-to-date sources remains a practically important approach. While prior work using single-source knowledge has shown a clear positive correlation between the performances of knowledge selection and response generation, there are no existing multi-source datasets for evaluating support knowledge retrieval. Further, prior work has assumed that the knowledge sources available at test time are the same as during training. This unrealistic assumption unnecessarily handicaps models, as new knowledge sources can become available after a model is trained. In this paper, we present a high-quality benchmark named multi-source Wizard of Wikipedia (Ms.WoW) for evaluating multi-source dialogue knowledge selection and response generation. Unlike existing datasets, it contains clean support knowledge, grounded at the utterance level and partitioned into multiple knowledge sources. We further propose a new challenge, dialogue knowledge plug-and-play, which aims to test an already trained dialogue model on using new support knowledge from previously unseen sources in a zero-shot fashion.</abstract>
       <url hash="0d7c0833">2024.lrec-main.58</url>
@@ -11624,7 +11624,7 @@
       <author><first>Wenlin</first><last>Yao</last></author>
       <author><first>Qingkai</first><last>Zeng</last></author>
       <author><first>Xiangliang</first><last>Zhang</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last></author>
       <pages>11307–11318</pages>
       <abstract>Reasoning in mathematical domains remains a significant challenge for relatively small language models (LMs). Many current methods focus on specializing LMs in mathematical reasoning and rely heavily on distilling knowledge from powerful yet inefficient large LMs (LLMs). In this work, we explore a new direction that avoids over-reliance on LLM teachers, introducing a multi-view fine-tuning method that efficiently exploits existing mathematical problem datasets with diverse annotation styles. Our approach uniquely considers the various annotation formats as different “views” that may help each other and leverage them in training the model. By postpending distinct instructions to input questions, models can learn to generate solutions in diverse formats in a flexible manner. Experimental results show that our strategy enables relatively small LMs to outperform prior approaches that heavily rely on knowledge distillation, as well as carefully established baselines. Additionally, the proposed method grants the models promising generalization ability across various views and datasets, and the capability to learn from inaccurate or incomplete noisy data. We hope our multi-view training paradigm could inspire future studies in other machine reasoning domains.</abstract>
       <url hash="1b0d3154">2024.lrec-main.988</url>
diff --git a/data/xml/2024.naacl.xml b/data/xml/2024.naacl.xml
index c26bd0920e..395233c59c 100644
--- a/data/xml/2024.naacl.xml
+++ b/data/xml/2024.naacl.xml
@@ -689,7 +689,7 @@
       <author><first>Ruixin</first><last>Hong</last><affiliation>Tsinghua University, Tsinghua University</affiliation></author>
       <author><first>Hongming</first><last>Zhang</last></author>
       <author><first>Xinyu</first><last>Pang</last><affiliation>Tsinghua University, Tsinghua University</affiliation></author>
-      <author><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
       <author><first>Changshui</first><last>Zhang</last><affiliation>Tsinghua University and Department of Computer Science and Technology</affiliation></author>
       <pages>900-925</pages>
       <abstract>Logical reasoning has been an ongoing pursuit in the field of AI. Despite significant advancements made by large language models (LLMs), they still struggle with complex logical reasoning problems. To enhance reasoning performance, one promising direction is scalable oversight, which requires LLMs to identify their own errors and then improve by themselves. Various self-verification methods have been proposed in pursuit of this goal. Nevertheless, whether existing models understand their own errors well is still under investigation. In this paper, we take a closer look at the self-verification abilities of LLMs in the context of logical reasoning, focusing on their ability to identify logical fallacies accurately. We introduce a dataset, FALLACIES, containing 232 types of reasoning fallacies categorized in a hierarchical taxonomy. By conducting exhaustive experiments on FALLACIES, we obtain comprehensive and detailed analyses of a series of models on their verification abilities. Our main findings suggest that existing LLMs could struggle to identify fallacious reasoning steps accurately and may fall short of guaranteeing the validity of self-verification methods. Drawing from these observations, we offer suggestions for future research and practical applications of self-verification methods.</abstract>
@@ -958,7 +958,7 @@
       <author><first>Kaiqiang</first><last>Song</last><affiliation>Tencent AI Lab</affiliation></author>
       <author><first>Sangwoo</first><last>Cho</last><affiliation>Tencent AI Lab</affiliation></author>
       <author><first>Yaser</first><last>Yacoob</last><affiliation>University of Maryland, College Park</affiliation></author>
-      <author><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
       <pages>1287-1310</pages>
       <abstract>With the rapid development of large language models (LLMs) and their integration into large multimodal models (LMMs), there has beenimpressive progress in zero-shot completion of user-oriented vision-language tasks. However, a gap remains in the domain of chartimage understanding due to the distinct abstract components in charts. To address this, we introduce a large-scale MultiModal ChartInstruction (MMC-Instruction) dataset comprising 600k instances supporting diverse tasks and chart types. Leveraging this data, we de-velop MultiModal Chart Assistant (MMCA), an LMM that achieves state-of-the-art performance on existing chart QA benchmarks. Recognizing the need for a comprehensive evaluation of LMM chart understanding, we also propose a MultiModal Chart Benchmark (MMC-Benchmark), a comprehensive human-annotated benchmark with nine distinct tasks evaluating reasoning capabilities over charts.Extensive experiments on MMC-Benchmark reveal the limitations of existing LMMs on correctly interpreting charts, even for the mostrecent GPT-4V model. Our work provides an instruction-tuning methodology and benchmark to advance multimodal understanding ofcharts. Code and data are available at https://github.com/FuxiaoLiu/MMC.</abstract>
       <url hash="23befb35">2024.naacl-long.70</url>
@@ -1214,7 +1214,7 @@
       <author><first>Baolin</first><last>Peng</last><affiliation>Tencent AI Lab</affiliation></author>
       <author><first>Hongwei</first><last>Wang</last><affiliation>Tencent AI Lab</affiliation></author>
       <author><first>Dan</first><last>Roth</last><affiliation>Amazon and University of Pennsylvania</affiliation></author>
-      <author><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
       <pages>1596-1609</pages>
       <abstract>We introduce sub-sentence encoder, a contrastively-learned contextual embedding model for fine-grained semantic representation of text. In contrast to the standard practice with sentence embeddings, where the meaning of an entire sequence of text is encoded into a fixed-length vector, the sub-sentence encoder learns to produce distinct contextual embeddings corresponding to different atomic propositions, i.e. atomic units of meaning expressed within a text sequence. The sub-sentence embeddings are contrastively learned to recognize (inferred) semantic equivalence between propositions across different text sequences. Our experiments show the effectiveness of sub-sentence encoders in applications, such as retrieving supporting facts for fine-grained text attribution or recognizing the conditional semantic similarity between texts. In practice, we demonstrate that sub-sentence encoders keep the same level of inference cost and space complexity compared to sentence encoders.</abstract>
       <url hash="278335ed">2024.naacl-long.89</url>
@@ -1791,7 +1791,7 @@
       <author><first>Xiaoman</first><last>Pan</last><affiliation>Tencent AI Lab</affiliation></author>
       <author><first>Xiaoyang</first><last>Wang</last><affiliation>Tencent AI Lab</affiliation></author>
       <author><first>Ninghao</first><last>Liu</last><affiliation>University of Georgia</affiliation></author>
-      <author><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
       <pages>2341-2369</pages>
       <abstract>Large Language Models (LLMs) have achieved remarkable success, where instruction tuning is the critical step in aligning LLMs with user intentions. In this work, we investigate how the instruction tuning adjusts pre-trained models with a focus on intrinsic changes. Specifically, we first develop several local and global explanation methods, including a gradient-based method for input-output attribution, and techniques for interpreting patterns and concepts in self-attention and feed-forward layers. The impact of instruction tuning is then studied by comparing the explanations derived from the pre-trained and instruction-tuned models. This approach provides an internal perspective of the model shifts on a human-comprehensible level. Our findings reveal three significant impacts of instruction tuning: 1) It empowers LLMs to recognize the instruction parts of user prompts, and promotes the response generation constantly conditioned on the instructions. 2) It encourages the self-attention heads to capture more word-word relationships about instruction verbs. 3) It encourages the feed-forward networks to rotate their pre-trained knowledge toward user-oriented tasks. These insights contribute to a more comprehensive understanding of instruction tuning and lay the groundwork for future work that aims at explaining and optimizing LLMs for various applications. Our code and data are publicly available at https://github.com/JacksonWuxs/Interpret_Instruction_Tuning_LLMs.</abstract>
       <url hash="6a97785d">2024.naacl-long.130</url>
@@ -4110,7 +4110,7 @@
       <author><first>Sangwoo</first><last>Cho</last><affiliation>Tencent AI Lab</affiliation></author>
       <author><first>Xiaoyang</first><last>Wang</last><affiliation>Tencent AI Lab</affiliation></author>
       <author><first>Ruihong</first><last>Huang</last><affiliation>Texas A&amp;M University</affiliation></author>
-      <author><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
       <pages>5211-5224</pages>
       <abstract>Opinion summarization is automatically generating summaries from a variety of subjective information, such as product reviews or political opinions. The challenge of opinions summarization lies in presenting divergent or even conflicting opinions. We conduct an analysis of previous summarization models, which reveals their inclination to amplify the polarity bias, emphasizing the majority opinions while ignoring the minority opinions. To address this issue and make the summarizer express both sides of opinions, we introduce the concept of polarity calibration, which aims to align the polarity of output summary with that of input text. Specifically, we develop a reinforcement training approach for polarity calibration. This approach feeds the polarity distance between output summary and input text as reward into the summarizer, and also balance polarity calibration with content preservation and language naturality. We evaluate our Polarity Calibration model (PoCa) on two types of opinions summarization tasks: summarizing product reviews and political opinions articles. Automatic and human evaluation demonstrate that our approach can mitigate the polarity mismatch between output summary and input text, as well as maintain the content semantic and language quality.</abstract>
       <url hash="fec34b6e">2024.naacl-long.291</url>
diff --git a/data/xml/2025.acl.xml b/data/xml/2025.acl.xml
index b78cf66237..03f6e82609 100644
--- a/data/xml/2025.acl.xml
+++ b/data/xml/2025.acl.xml
@@ -3489,7 +3489,7 @@
       <author orcid="0000-0002-5648-568X"><first>Kelong</first><last>Mao</last></author>
       <author orcid="0009-0001-7014-5251"><first>Shuaiyi</first><last>Li</last><affiliation>Chinese University of Hong Kong, The Chinese University of Hong Kong</affiliation></author>
       <author><first>Xinting</first><last>Huang</last><affiliation>Tencent</affiliation></author>
-      <author orcid="0000-0003-0520-6844"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
+      <author orcid="0000-0003-0520-6844" id="dong-yu-idaho"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
       <author orcid="0000-0002-9781-948X"><first>Zhicheng</first><last>Dou</last><affiliation>Renmin University of China</affiliation></author>
       <pages>4861-4879</pages>
       <abstract>In this work, we provide an empirical investigation of gist-based context compression methods to improve context processing in large language models. We focus on two key questions: (1) How well can these methods replace full attention models? and (2) What potential failure patterns arise due to compression? Through extensive experiments, we show that while gist-based compression can achieve only slight performance loss on tasks like retrieval-augmented generation and long-document QA, it faces challenges in tasks like synthetic recall. Furthermore, we identify three key failure patterns: lost by the boundary, lost if surprise, and lost along the way. To mitigate these issues, we propose two effective strategies: fine-grained autoencoding, which enhances the reconstruction of original token information, and segment-wise token importance estimation, which adjusts optimization based on token dependencies. Our work provides valuable insights into the understanding of gist token-based context compression and offers practical strategies for improving compression capabilities.</abstract>
@@ -7050,7 +7050,7 @@
       <author><first>Hongming</first><last>Zhang</last></author>
       <author><first>Chenlong</first><last>Deng</last><affiliation>Renmin University of China</affiliation></author>
       <author orcid="0009-0001-7014-5251"><first>Shuaiyi</first><last>Li</last><affiliation>Chinese University of Hong Kong, The Chinese University of Hong Kong</affiliation></author>
-      <author orcid="0000-0003-0520-6844"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
+      <author orcid="0000-0003-0520-6844" id="dong-yu-idaho"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
       <pages>9840-9855</pages>
       <abstract>Large language models have shown remarkable performance across a wide range of language tasks, owing to their exceptional capabilities in context modeling. The most commonly used method of context modeling is full self-attention, as seen in standard decoder-only Transformers. Although powerful, this method can be inefficient for long sequences and may overlook inherent input structures. To address these problems, an alternative approach is parallel context encoding, which splits the context into sub-pieces and encodes them parallelly. Because parallel patterns are not encountered during training, naively applying parallel encoding leads to performance degradation. However, the underlying reasons and potential mitigations are unclear. In this work, we provide a detailed analysis of this issue and identify that unusually high attention entropy can be a key factor. Furthermore, we adopt two straightforward methods to reduce attention entropy by incorporating attention sinks and selective mechanisms. Experiments on various tasks reveal that these methods effectively lower irregular attention entropy and narrow performance gaps. We hope this study can illuminate ways to enhance context modeling mechanisms.</abstract>
       <url hash="c6305bcc">2025.acl-long.485</url>
@@ -13661,7 +13661,7 @@
       <author><first>Xinting</first><last>Huang</last><affiliation>Tencent</affiliation></author>
       <author><first>Sen</first><last>Yang</last><affiliation>The Chinese University of Hong Kong</affiliation></author>
       <author orcid="0000-0002-7230-4164"><first>Nigel</first><last>Collier</last><affiliation>University of Cambridge</affiliation></author>
-      <author orcid="0000-0003-0520-6844"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
+      <author orcid="0000-0003-0520-6844" id="dong-yu-idaho"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
       <author orcid="0000-0002-1390-3861"><first>Deqing</first><last>Yang</last><affiliation>Fudan University</affiliation></author>
       <pages>18947-18968</pages>
       <abstract>While Large Language Models (LLMs) demonstrate impressive capabilities, they still struggle with generating factually incorrect content (i.e., hallucinations). A promising approach to mitigate this issue is enabling models to express uncertainty when unsure. Previous research on uncertainty modeling has primarily focused on short-form QA, but real-world applications often require much longer responses. In this work, we introduce the task of Long-form Generation with Uncertainty (LoGU). We identify two key challenges: Uncertainty Suppression, where models hesitate to express uncertainty, and Uncertainty Misalignment, where models convey uncertainty inaccurately. To tackle these challenges, we propose a refinement-based data collection framework and a two-stage training pipeline. Our framework adopts a divide-and-conquer strategy, refining uncertainty based on atomic claims. The collected data are then used in training through supervised fine-tuning (SFT) and direct preference optimization (DPO) to enhance uncertainty expression. Extensive experiments on three long-form instruction following datasets show that our method significantly improves accuracy, reduces hallucinations, and maintains the comprehensiveness of responses.</abstract>
@@ -17041,7 +17041,7 @@
       <author><first>Xiangyu</first><last>Duan</last><affiliation>Soochow University, China</affiliation></author>
       <author><first>Zhaopeng</first><last>Tu</last><affiliation>Tencent AI Lab</affiliation></author>
       <author><first>Jinsong</first><last>Su</last><affiliation>Xiamen University</affiliation></author>
-      <author orcid="0000-0003-0520-6844"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
+      <author orcid="0000-0003-0520-6844" id="dong-yu-idaho"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
       <pages>23946-23959</pages>
       <abstract>Recent advancements in tree search algorithms guided by verifiers have significantly enhanced the reasoning capabilities of large language models (LLMs), but at the cost of increased computational resources. In this work, we identify two key challenges contributing to this inefficiency: <tex-math>\textit{over-exploration}</tex-math> due to redundant states with semantically equivalent content, and <tex-math>\textit{under-exploration}</tex-math> caused by high variance in verifier scoring leading to frequent trajectory switching. To address these issues, we propose FETCH – an e<tex-math>{\bf f}</tex-math>fici<tex-math>{\bf e}</tex-math>nt <tex-math>{\bf t}</tex-math>ree sear<tex-math>{\bf ch}</tex-math> framework, which is a flexible, plug-and-play system compatible with various tree search algorithms.Our framework mitigates over-exploration by merging semantically similar states using agglomerative clustering of text embeddings obtained from a fine-tuned SimCSE model. To tackle under-exploration, we enhance verifiers by incorporating temporal difference learning with adjusted <tex-math>\lambda</tex-math>-returns during training to reduce variance, and employing a verifier ensemble to aggregate scores during inference. Experiments on GSM8K, GSM-Plus, and MATH datasets demonstrate that our methods significantly improve reasoning accuracy and computational efficiency across four different tree search algorithms, paving the way for more practical applications of LLM-based reasoning. The code is available at https://github.com/DeepLearnXMU/Fetch.</abstract>
       <url hash="39f67363">2025.acl-long.1167</url>
@@ -19423,7 +19423,7 @@
       <author><first>Hongming</first><last>Zhang</last></author>
       <author><first>Tianqing</first><last>Fang</last><affiliation>Tencent AI Lab</affiliation></author>
       <author><first>Zhenzhong</first><last>Lan</last><affiliation>Westlake University</affiliation></author>
-      <author orcid="0000-0003-0520-6844"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
+      <author orcid="0000-0003-0520-6844" id="dong-yu-idaho"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
       <pages>27545-27564</pages>
       <abstract>The advancement of foundation models has laid the groundwork for building autonomous agents for complex tasks such as web navigation. Recent efforts have also tried to equip the agent with the ability to explore environments and continuously improve over time. However, existing works only focused on building text-only agents in synthetic environments where the reward signals are clearly defined. Such agents can hardly generalize to realistic settings that require multimodal perception ability and provide no ground-truth signal. In this paper, we introduce an innovative multimodal web agent that can autonomously conduct real-world exploration and improve itself. We first train the base model with imitation learning to gain the basic abilities. We then let the agent explore the open web and collect feedback on its trajectories. After that, it further improves its policy by learning from well-performing trajectories judged by another general-purpose model. This exploration-feedback-optimization cycle can continue for several iterations. Experimental results show that our web agent successfully improves itself after each iteration, demonstrating strong performance across multiple test sets. We will release our code and model to encourage future research in this field.</abstract>
       <url hash="a7de41cb">2025.acl-long.1336</url>
@@ -22551,7 +22551,7 @@
       <author><first>Thomas</first><last>Hartvigsen</last><affiliation>University of Virginia, Charlottesville</affiliation></author>
       <author><first>Zhisong</first><last>Zhang</last><affiliation>Tencent</affiliation></author>
       <author><first>Haitao</first><last>Mi</last><affiliation>Tencent AI Lab</affiliation></author>
-      <author orcid="0000-0003-0520-6844"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
+      <author orcid="0000-0003-0520-6844" id="dong-yu-idaho"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
       <pages>32338-32348</pages>
       <abstract>Low-bit quantization improves machine learning model efficiency but surprisingly favors undertrained large language models (LLMs). Larger models or those trained on fewer tokens exhibit less quantization-induced degradation (QiD), while smaller, well-trained models face significant performance losses. To gain deeper insights into this trend, we study over 1500+ quantized LLM checkpoints of various sizes and at different training levels (undertrained or fully trained) in a controlled setting, deriving scaling laws for understanding the relationship between QiD and factors: the number of training tokens, model size and bit width.With our derived scaling laws, we propose a novel perspective that we can use QiD to measure an LLM’s training levels and determine the number of training tokens required for fully training LLMs of various sizes. Moreover, we use the scaling laws to predict the quantization performance of different-sized LLMs trained with tokens. Our projection shows that the low-bit quantization performance of future models, which are expected to be trained with over <tex-math>\textcolor{red}{100~trillion}</tex-math> tokens, may NOT be desirable. This poses a potential challenge for low-bit quantization in the future and highlights the need for awareness of a model’s training level when evaluating low-bit quantization research. To facilitate future research on this problem, we release all the 1500+ quantized checkpoints used in this work at https://huggingface.co/Xu-Ouyang.</abstract>
       <url hash="354d7369">2025.acl-long.1555</url>
diff --git a/data/xml/2025.babylm.xml b/data/xml/2025.babylm.xml
index c11c8e4861..e6d0ba96c6 100644
--- a/data/xml/2025.babylm.xml
+++ b/data/xml/2025.babylm.xml
@@ -334,7 +334,7 @@
       <author><first>Hongming</first><last>Zhang</last></author>
       <author><first>Kaixin</first><last>Ma</last><affiliation>Apple</affiliation></author>
       <author><first>Jiyeon</first><last>Kim</last><affiliation>Korea Advanced Institute of Science &amp; Technology</affiliation></author>
-      <author orcid="0000-0003-0520-6844"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
+      <author orcid="0000-0003-0520-6844" id="dong-yu-idaho"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
       <author><first>Minjoon</first><last>Seo</last><affiliation>Korea Advanced Institute of Science &amp; Technology and Config Intelligence</affiliation></author>
       <pages>380-398</pages>
       <abstract>Hybrid models that combine state space models (SSMs) with attention mechanisms have demonstrated strong performance by leveraging the efficiency of SSMs and the high recall ability of attention. However, the underlying reasons for these benefits remain insufficiently understood. In this work, we investigate hybrid architectures through the lens of memory utilization and overall performance, and propose a complementary method to further enhance their effectiveness. We focus in particular on the distinction between sequential and parallel integration of SSM and attention layers. Our analysis reveals that sequential hybrids perform better on shorter contexts, whereas parallel hybrids are more effective for longer contexts. Among various configurations, parallel hybrids using a cross-attention to combine SSM and attention outputs perform best. We also introduce a data-centric approach to further improve model performance: continual training on datasets with paraphrases. This method strikes the best balance across various other datasets, enhancing memory recall while preserving other capabilities. It generalizes well across different base models, including pure SSMs, and outperforms architectural modifications aimed at enhancing recall.</abstract>
diff --git a/data/xml/2025.coling.xml b/data/xml/2025.coling.xml
index 25e6e23127..8c96ebf001 100644
--- a/data/xml/2025.coling.xml
+++ b/data/xml/2025.coling.xml
@@ -3813,7 +3813,7 @@
       <title>What’s the most important value? <fixed-case>INVP</fixed-case>: <fixed-case>IN</fixed-case>vestigating the Value Priorities of <fixed-case>LLM</fixed-case>s through Decision-making in Social Scenarios</title>
       <author><first>Xuelin</first><last>Liu</last></author>
       <author><first>Pengyuan</first><last>Liu</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-blcu"><first>Dong</first><last>Yu</last></author>
       <pages>4725–4752</pages>
       <abstract>As large language models (LLMs) demonstrate impressive performance in various tasks and are increasingly integrated into the decision-making process, ensuring they align with human values has become crucial. This paper highlights that value priorities—the relative importance of different value—play a pivotal role in the decision-making process. To explore the value priorities in LLMs, this paper introduces INVP, a framework for INvestigating Value Priorities through decision-making in social scenarios. The framework encompasses social scenarios including binary decision-making, covering both individual and collective decision-making contexts, and is based on Schwartz’s value theory for constructing value priorities. Using this framework, we construct a dataset, which contains a total of 1613 scenarios and 3226 decisions across 283 topics. We evaluate seven popular LLMs and the experimental results reveal commonalities in the value priorities across different LLMs, such as an emphasis on Universalism and Benevolence, while Power and Hedonism are typically given lower priority. This study provides fresh insights into understanding and enhancing the moral and value alignment of LLMs when making complex social decisions.</abstract>
       <url hash="b3633df9">2025.coling-main.317</url>
@@ -5252,7 +5252,7 @@
       <author><first>Linfeng</first><last>Song</last></author>
       <author><first>Haitao</first><last>Mi</last></author>
       <author><first>Baolin</first><last>Peng</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last></author>
       <pages>6589–6600</pages>
       <abstract>Large language models (LLMs) exhibit impressive natural language capabilities but suffer from hallucination – generating content ungrounded in the realities of training data. Recent work has focused on decoding techniques to improve factuality in decoding by leveraging LLMs’ hierarchical representation of factual knowledge, manipulating the predicted distributions at inference time. Current state-of-the-art approaches refine decoding by contrasting logits from a lower layer with the final layer to exploit information related factuality within the model forward procedure. However, such methods often assume the final layer is most reliable one and the lower layer selection process depends on it. In this work, we first propose logit extrapolation of critical token probabilities beyond the last layer for more accurate contrasting. We additionally employ layer-wise entropy-guided lower layer selection, decoupling the selection process from the final layer. Experiments demonstrate strong performance - surpassing state-of-the-art on multiple different datasets by large margins. Analyses show different kinds of prompts respond to different selection strategies.</abstract>
       <url hash="b81fd88a">2025.coling-main.439</url>
diff --git a/data/xml/2025.emnlp.xml b/data/xml/2025.emnlp.xml
index 34854d598a..73e9554a9f 100644
--- a/data/xml/2025.emnlp.xml
+++ b/data/xml/2025.emnlp.xml
@@ -1489,7 +1489,7 @@
       <author><first>Guoheng</first><last>Sun</last><affiliation>University of Maryland, College Park</affiliation></author>
       <author orcid="0009-0005-7275-7955"><first>Bowei</first><last>Tian</last><affiliation>University of Maryland, College Park</affiliation></author>
       <author orcid="0000-0002-0746-1059"><first>Xiaoyang</first><last>Wang</last><affiliation>Tencent AI Lab</affiliation></author>
-      <author orcid="0000-0003-0520-6844"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
+      <author orcid="0000-0003-0520-6844" id="dong-yu-idaho"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
       <pages>1925-1938</pages>
       <abstract>The Mixture of Depths (MoD) was introduced to improve computational efficiency by dynamically skipping less important layers, reducing redundant computation while maintaining model capacity. Despite its promise, existing MoD approaches remain under-explored and face two main challenges: (1) <i>high training costs due to the need to train the entire model along with the routers that determine which layers to skip</i>, and (2) <i>performance degradation when important layers are bypassed</i>. In response to the first issue, we propose Router-Tuning, which fine-tunes only the routers on a small dataset, drastically reducing the computational overhead associated with full model training. For the second challenge, we investigate across different architectures and granularities, demonstrating its effectiveness on Attention layers and MoE layers. This method preserves the model’s performance while significantly enhancing computational and memory efficiency. Extensive experiments demonstrate that our approach delivers competitive results while dramatically improving the computation efficiency, e.g., 21% speedup and only a 0.2% performance drop. The code will be released upon acceptance.</abstract>
       <url hash="a34550e3">2025.emnlp-main.99</url>
@@ -3524,7 +3524,7 @@
       <author><first>Zhisong</first><last>Zhang</last><affiliation>City University of Hong Kong</affiliation></author>
       <author><first>Hongming</first><last>Zhang</last></author>
       <author><first>Haitao</first><last>Mi</last><affiliation>Tencent AI Lab</affiliation></author>
-      <author orcid="0000-0003-0520-6844"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
+      <author orcid="0000-0003-0520-6844" id="dong-yu-idaho"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
       <pages>4714-4720</pages>
       <abstract>Mamba’s theoretical infinite-context potential is limited in practice when sequences far exceed training lengths. This work explores unlocking Mamba’s long-context memory ability by a simple-yet-effective method, Recall with Reasoning (RwR), by distilling chain-of-thought (CoT) summarization from a teacher model. Specifically, RwR prepends these summarization as CoT prompts during fine-tuning, teaching Mamba to actively recall and reason over long contexts. Experiments on LONGMEMEVAL and HELMET show that RwR outperforms existing long-term memory methods on the Mamba model. Furthermore, under similar pre-training conditions, RwR improves the long-context performance of Mamba relative to comparable Transformer/hybrid baselines while preserving short-context capabilities, all without changing the architecture.</abstract>
       <url hash="855b86bc">2025.emnlp-main.235</url>
@@ -5938,7 +5938,7 @@
       <author><first>Xia</first><last>Du</last></author>
       <author><first>Shuhan</first><last>Sun</last></author>
       <author><first>Pengyuan</first><last>Liu</last><affiliation>Beijing Language and Culture University</affiliation></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-blcu"><first>Dong</first><last>Yu</last></author>
       <pages>7757-7797</pages>
       <abstract>Although small Large Language models (sLLMs) have been widely deployed in practical applications, little attention has been paid to their value-reasoning abilities, particularly in terms of reasoning reliability. To address this gap, we propose a systematic evaluation framework for assessing the Value-Reasoning Reliability of sLLMs. We define Value-Reasoning Reliability as comprising: (1) Output consistency under identical prompts, (2) Output Robustness under semantically equivalent prompts, (3) Maintaining stable value reasoning in the face of attacks, and (4) Consistency of value reasoning in open-ended value expression tasks. Our framework includes three core tasks: Repetition Consistency task, Interaction Stability task, and Open-ended Expression Consistency task. We further incorporate self-reported confidence scores to evaluate the model’s value reasoning reliability from two perspectives: the model’s self-awareness of its values, and its value-based decision-making. Our findings show that models vary significantly in their stability when responding to value-related questions. Moreover, we observe considerable output randomness, which is not always correlated with the self-reported confidence or expressed value preferences. This suggests that current models lack a reliable internal mechanism for stable value reasoning when addressing value-sensitive queries.</abstract>
       <url hash="be46bdc5">2025.emnlp-main.395</url>
@@ -6821,7 +6821,7 @@
       <author><first>Kaixin</first><last>Ma</last><affiliation>Apple</affiliation></author>
       <author orcid="0000-0002-4075-5980"><first>Wenhao</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
       <author><first>Haitao</first><last>Mi</last><affiliation>Tencent AI Lab</affiliation></author>
-      <author orcid="0000-0003-0520-6844"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
+      <author orcid="0000-0003-0520-6844" id="dong-yu-idaho"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
       <pages>8970-8986</pages>
       <abstract>Agent self-improvement, where agents autonomously train their underlying Large Language Model (LLM) on self-sampled trajectories, shows promising results but often stagnates in web environments due to limited exploration and under-utilization of pretrained web knowledge. To improve the performance of self-improvement, we propose a novel framework that introduces a co-evolving World Model LLM. This world model predicts the next observation based on the current observation and action within the web environment. The World Model serves dual roles: (1) as a virtual web server generating self-instructed training data to continuously refine the agent’s policy, and (2) as an imagination engine during inference, enabling look-ahead simulation to guide action selection for the agent LLM. Experiments in real-world web environments (Mind2Web-Live, WebVoyager, and GAIA-web) show a 10% performance gain over existing self-evolving agents, demonstrating the efficacy and generalizability of our approach, without using any distillation from more powerful close-sourced models.</abstract>
       <url hash="b5f7e585">2025.emnlp-main.454</url>
@@ -13441,7 +13441,7 @@
       <author><first>Hongming</first><last>Zhang</last></author>
       <author><first>Joyce C.</first><last>Ho</last><affiliation>Emory University</affiliation></author>
       <author orcid="0000-0001-9145-4531"><first>Carl</first><last>Yang</last><affiliation>Emory University</affiliation></author>
-      <author orcid="0000-0003-0520-6844"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
+      <author orcid="0000-0003-0520-6844" id="dong-yu-idaho"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
       <pages>17877-17886</pages>
       <abstract>GUI agents powered by vision-language models (VLMs) show promise in automating complex digital tasks. However, their effectiveness in real-world applications is often limited by scarce training data and the inherent complexity of these tasks, which frequently require long-tailed knowledge covering rare, unseen scenarios. We propose RAG-GUI , a lightweight VLM that leverages web tutorials at inferencetime. RAG-GUI is first warm-started via supervised finetuning (SFT) and further refined through self-guided rejection sampling fine-tuning (RSF). Designed to be model-agnostic, RAG-GUI functions as a generic plug-in that enhances any VLM-based agent. Evaluatedacross three distinct tasks, it consistently outperforms baseline agents and surpasses other inference baselines by 2.6% to 13.3% acrosstwo model sizes, demonstrating strong generalization and practical plug-and-play capabilities in real-world scenarios.</abstract>
       <url hash="8f48bd08">2025.emnlp-main.902</url>
@@ -22697,7 +22697,7 @@
       <author><first>Caiqi</first><last>Zhang</last></author>
       <author><first>Zhisong</first><last>Zhang</last><affiliation>City University of Hong Kong</affiliation></author>
       <author><first>Xinting</first><last>Huang</last><affiliation>Tencent</affiliation></author>
-      <author orcid="0000-0003-0520-6844"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
+      <author orcid="0000-0003-0520-6844" id="dong-yu-idaho"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
       <author orcid="0000-0002-7230-4164"><first>Nigel</first><last>Collier</last><affiliation>University of Cambridge</affiliation></author>
       <author orcid="0000-0002-1390-3861"><first>Deqing</first><last>Yang</last><affiliation>Fudan University</affiliation></author>
       <pages>30328-30344</pages>
diff --git a/data/xml/2025.findings.xml b/data/xml/2025.findings.xml
index aa30d90b80..6b669ae78b 100644
--- a/data/xml/2025.findings.xml
+++ b/data/xml/2025.findings.xml
@@ -9721,7 +9721,7 @@
       <author orcid="0009-0003-5416-1600"><first>Yiming</first><last>Lu</last></author>
       <author orcid="0000-0002-6959-165X"><first>Daoan</first><last>Zhang</last></author>
       <author><first>Hassan</first><last>Foroosh</last><affiliation>University of Central Florida</affiliation></author>
-      <author orcid="0000-0003-0520-6844"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
+      <author orcid="0000-0003-0520-6844" id="dong-yu-idaho"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
       <author id="fei-liu"><first>Fei</first><last>Liu</last><affiliation>Emory University</affiliation></author>
       <pages>4587-4603</pages>
       <abstract>LLMs are ideal for decision-making thanks to their ability to reason over long contexts. However, challenges arise when processing speech transcripts that describe complex scenarios, as they are verbose and include repetition, hedging, and vagueness. E.g., during a company’s earnings call, an executive might project a positive revenue outlook to reassure investors, despite uncertainty regarding future earnings. It is crucial for LLMs to incorporate this uncertainty systematically when making decisions. In this paper, we introduce DeFine, a modular framework that constructs probabilistic factor profiles from complex scenarios. It then integrates these profiles with analogical reasoning, leveraging insights from similar past experiences to guide LLMs in making critical decisions in new situations. Our framework separates the tasks of quantifying uncertainty and incorporating it into LLM decision-making. This approach is particularly useful in areas such as consulting and financial deliberation, where making decisions under uncertainty is vital.</abstract>
@@ -29919,7 +29919,7 @@
       <author><first>Jingyan</first><last>Zhou</last></author>
       <author><first>Hongming</first><last>Zhang</last></author>
       <author><first>Haitao</first><last>Mi</last><affiliation>Tencent AI Lab</affiliation></author>
-      <author orcid="0000-0003-0520-6844"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
+      <author orcid="0000-0003-0520-6844" id="dong-yu-idaho"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
       <author orcid="0000-0001-8106-6447"><first>Irwin</first><last>King</last></author>
       <pages>5155-5173</pages>
       <abstract>Web agents powered by Large Language Models (LLMs) show promise for next-generation AI, but their limited reasoning in uncertain, dynamic web environments hinders robust deployment. In this paper, we identify key reasoning skills essential for effective web agents, i.e., reflection &amp; lookahead, branching, and rollback, and curate trajectory data that exemplifies these abilities by reconstructing the agent’s (inference-time) reasoning algorithms into chain-of-thought rationales. We conduct experiments in the agent self-improving benchmark, OpenWebVoyager, and demonstrate that distilling salient reasoning patterns into the backbone LLM via simple fine-tuning can substantially enhance its performance. Our approach yields significant improvements across multiple benchmarks, including WebVoyager, Mind2web-live, and SimpleQA (web search), highlighting the potential of targeted reasoning skill enhancement for web agents.</abstract>
@@ -33335,7 +33335,7 @@
       <author><first>Kaixin</first><last>Ma</last><affiliation>Apple</affiliation></author>
       <author><first>Xiaoman</first><last>Pan</last><affiliation>Amazon</affiliation></author>
       <author orcid="0000-0002-7818-6090"><first>Yangqiu</first><last>Song</last><affiliation>Hong Kong University of Science and Technology</affiliation></author>
-      <author orcid="0000-0003-0520-6844"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
+      <author orcid="0000-0003-0520-6844" id="dong-yu-idaho"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
       <pages>9666-9686</pages>
       <abstract>Large Vision-Language Models (LVLMs) have achieved significant progress in tasks like visual question answering and document understanding. However, their potential to comprehend embodied environments and navigate within them remains underexplored. In this work, we first study the challenge of open-vocabulary object navigation by introducing DivScene, a large-scale dataset with 4,614 houses across 81 scene types and 5,707 kinds of target objects. Our dataset provides a much greater diversity of target objects and scene types than existing datasets, enabling a comprehensive task evaluation. We evaluated various methods with LVLMs and LLMs on our dataset and found that current models still fall short of open-vocab object navigation ability. Then, we fine-tuned LVLMs to predict the next action with CoT explanations. We observe that LVLM’s navigation ability can be improved substantially with only BFS-generated shortest paths without any human supervision, surpassing GPT-4o by over 20% in success rates.</abstract>
       <url hash="e3d84e0c">2025.findings-emnlp.513</url>
@@ -34957,7 +34957,7 @@
       <title>Attribution and Application of Multiple Neurons in Multimodal Large Language Models</title>
       <author><first>Feiyu</first><last>Wang</last></author>
       <author><first>Ziran</first><last>Zhao</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-blcu"><first>Dong</first><last>Yu</last></author>
       <author><first>Pengyuan</first><last>Liu</last><affiliation>Beijing Language and Culture University</affiliation></author>
       <pages>11649-11662</pages>
       <abstract>Multimodal Large Language Models (MLLMs) have demonstrated exceptional performance across various tasks. However, the internal mechanisms by which they interpret and integrate cross-modal information remain insufficiently understood. In this paper, to address the limitations of prior studies that could only identify neurons corresponding to single-token and rely on the vocabulary of LLMs, we propose a novel method to identify multimodal neurons in Transformer-based MLLMs. Then we introduce fuzzy set theory to model the complex relationship between neurons and semantic concepts and to characterize how multiple neurons collaboratively contribute to semantic concepts. Through both theoretical analysis and empirical validation, we demonstrate the effectiveness of our method and present some meaningful findings. Furthermore, by modulating neuron activation values based on the constructed fuzzy sets, we enhance performance on the Visual Question Answering (VQA) task, showing the practical value of our approach in downstream applications in MLLMs.</abstract>
diff --git a/data/xml/2025.knowledgenlp.xml b/data/xml/2025.knowledgenlp.xml
index dfd080c181..162e1f00b8 100644
--- a/data/xml/2025.knowledgenlp.xml
+++ b/data/xml/2025.knowledgenlp.xml
@@ -316,7 +316,7 @@
       <author><first>Deng</first><last>Cai</last></author>
       <author><first>Zhuosheng</first><last>Zhang</last></author>
       <author><first>Hai</first><last>Zhao</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last></author>
       <pages>359-373</pages>
       <abstract>Recent advancements in proprietary large language models (LLMs), such as those from OpenAI and Anthropic, have led to the development of document reading systems capable of handling raw files with complex layouts, intricate formatting, lengthy content, and multi-modal information. However, the absence of a standardized benchmark hinders objective evaluation of these systems. To address this gap, we introduce DocBench, a benchmark designed to simulate real-world scenarios, where each raw file consists of a document paired with one or more questions. DocBench uniquely evaluates entire document reading systems and adopts a user-centric approach, allowing users to identify the system best suited to their needs.</abstract>
       <url hash="70e5ce37">2025.knowledgenlp-1.29</url>
diff --git a/data/xml/2025.naacl.xml b/data/xml/2025.naacl.xml
index 47261b7ee1..bb31462049 100644
--- a/data/xml/2025.naacl.xml
+++ b/data/xml/2025.naacl.xml
@@ -12143,7 +12143,7 @@
       <author><first>Hongwei</first><last>Wang</last><affiliation>Tencent AI Lab</affiliation></author>
       <author><first>Kaixin</first><last>Ma</last><affiliation>Tencent AI Lab</affiliation></author>
       <author><first>Wenhao</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
-      <author><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last><affiliation>Tencent AI Lab</affiliation></author>
       <pages>328-349</pages>
       <abstract>We introduce Cognitive Kernel, an open-source agent system towards the goal of generalist autopilots. Unlike copilot systems, which primarily rely on users to provide essential state information, autopilot systems complete tasks from start to finish independently. This requires the system to acquire the missing state information actively. Cognitive Kernel adopts a dynamic programming design where the central policy model (a fine-tuned LLM) could initiate an environment state perception task, essentially another agent task, as needed. The results demonstrate that Cognitive Kernel achieves better or comparable performance to other closed-source systems on core autopilot capabilities. Cognitive Kernel is fully dockerized, ensuring everyone can deploy it privately and securely. We open-source the system to encourage further research on LLM-driven autopilot systems</abstract>
       <url hash="b6bf0ca6">2025.naacl-demo.29</url>
diff --git a/data/xml/D18.xml b/data/xml/D18.xml
index da4ff395fd..9db570cc94 100644
--- a/data/xml/D18.xml
+++ b/data/xml/D18.xml
@@ -505,7 +505,7 @@
       <author><first>Jianshu</first><last>Chen</last></author>
       <author><first>Yu</first><last>Su</last></author>
       <author><first>Xin</first><last>Wang</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last></author>
       <author><first>Xifeng</first><last>Yan</last></author>
       <author><first>William Yang</first><last>Wang</last></author>
       <pages>414–424</pages>
diff --git a/data/xml/D19.xml b/data/xml/D19.xml
index c0529b3ad9..60da6e054e 100644
--- a/data/xml/D19.xml
+++ b/data/xml/D19.xml
@@ -6667,7 +6667,7 @@
       <author><first>Changlong</first><last>Yu</last></author>
       <author><first>Yangqiu</first><last>Song</last></author>
       <author><first>Wilfred</first><last>Ng</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last></author>
       <pages>5247–5256</pages>
       <abstract>Conventional word embeddings represent words with fixed vectors, which are usually trained based on co-occurrence patterns among words. In doing so, however, the power of such representations is limited, where the same word might be functionalized separately under different syntactic relations. To address this limitation, one solution is to incorporate relational dependencies of different words into their embeddings. Therefore, in this paper, we propose a multiplex word embedding model, which can be easily extended according to various relations among words. As a result, each word has a center embedding to represent its overall semantics, and several relational embeddings to represent its relational dependencies. Compared to existing models, our model can effectively distinguish words with respect to different relations without introducing unnecessary sparseness. Moreover, to accommodate various relations, we use a small dimension for relational embeddings and our model is able to keep their effectiveness. Experiments on selectional preference acquisition and word similarity demonstrate the effectiveness of the proposed model, and a further study of scalability also proves that our embeddings only need 1/20 of the original embedding size to achieve better performance.</abstract>
       <url hash="e10b1e45">D19-1528</url>
@@ -10602,7 +10602,7 @@ Typo in Table 4 fixed to reflect correct recall of presented system.</revision>
       <title>Multi-Document Summarization with Determinantal Point Processes and Contextualized Representations</title>
       <author><first>Sangwoo</first><last>Cho</last></author>
       <author><first>Chen</first><last>Li</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last></author>
       <author><first>Hassan</first><last>Foroosh</last></author>
       <author id="fei-liu-utdallas"><first>Fei</first><last>Liu</last></author>
       <pages>98–103</pages>
@@ -11388,7 +11388,7 @@ Typo in Table 4 fixed to reflect correct recall of presented system.</revision>
       <title>Generating Diverse Story Continuations with Controllable Semantics</title>
       <author><first>Lifu</first><last>Tu</last></author>
       <author><first>Xiaoan</first><last>Ding</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last></author>
       <author><first>Kevin</first><last>Gimpel</last></author>
       <pages>44–58</pages>
       <abstract>We propose a simple and effective modeling framework for controlled generation of multiple, diverse outputs. We focus on the setting of generating the next sentence of a story given its context. As controllable dimensions, we consider several sentence attributes, including sentiment, length, predicates, frames, and automatically-induced clusters. Our empirical results demonstrate: (1) our framework is accurate in terms of generating outputs that match the target control values; (2) our model yields increased maximum metric scores compared to standard n-best list generation via beam search; (3) controlling generation with semantic frames leads to a stronger combination of diversity and quality than other control variables as measured by automatic metrics. We also conduct a human evaluation to assess the utility of providing multiple suggestions for creative writing, demonstrating promising results for the potential of controllable, diverse generation in a collaborative writing system.</abstract>
@@ -12183,7 +12183,7 @@ Typo in Table 4 fixed to reflect correct recall of presented system.</revision>
       <author><first>Jianshu</first><last>Chen</last></author>
       <author><first>Heng</first><last>Ji</last></author>
       <author><first>Claire</first><last>Cardie</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last></author>
       <pages>27–37</pages>
       <abstract>We focus on multiple-choice question answering (QA) tasks in subject areas such as science, where we require both broad background knowledge and the facts from the given subject-area reference corpus. In this work, we explore simple yet effective methods for exploiting two sources of external knowledge for subject-area QA. The first enriches the original subject-area reference corpus with relevant text snippets extracted from an open-domain resource (i.e., Wikipedia) that cover potentially ambiguous concepts in the question and answer options. As in other QA research, the second method simply increases the amount of training data by appending additional in-domain subject-area instances. Experiments on three challenging multiple-choice science QA tasks (i.e., ARC-Easy, ARC-Challenge, and OpenBookQA) demonstrate the effectiveness of our methods: in comparison to the previous state-of-the-art, we obtain absolute gains in accuracy of up to 8.1%, 13.0%, and 12.8%, respectively. While we observe consistent gains when we introduce knowledge from Wikipedia, we find that employing additional QA training instances is not uniformly helpful: performance degrades when the added instances exhibit a higher level of difficulty than the original training data. As one of the first studies on exploiting unstructured external knowledge for subject-area QA, we hope our methods, observations, and discussion of the exposed limitations may shed light on further developments in the area.</abstract>
       <url hash="1c844156">D19-5804</url>
@@ -12754,7 +12754,7 @@ Typo in Table 4 fixed to reflect correct recall of presented system.</revision>
     <paper id="12">
       <title><fixed-case>BLCU</fixed-case>-<fixed-case>NLP</fixed-case> at <fixed-case>COIN</fixed-case>-Shared Task1: Stagewise Fine-tuning <fixed-case>BERT</fixed-case> for Commonsense Inference in Everyday Narrations</title>
       <author><first>Chunhua</first><last>Liu</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-blcu"><first>Dong</first><last>Yu</last></author>
       <pages>99–103</pages>
       <abstract>This paper describes our system for COIN Shared Task 1: Commonsense Inference in Everyday Narrations. To inject more external knowledge to better reason over the narrative passage, question and answer, the system adopts a stagewise fine-tuning method based on pre-trained BERT model. More specifically, the first stage is to fine-tune on addi- tional machine reading comprehension dataset to learn more commonsense knowledge. The second stage is to fine-tune on target-task (MCScript2.0) with MCScript (2018) dataset assisted. Experimental results show that our system achieves significant improvements over the baseline systems with 84.2% accuracy on the official test dataset.</abstract>
       <url hash="118d0307">D19-6012</url>
diff --git a/data/xml/K19.xml b/data/xml/K19.xml
index bd138b9624..1a4c8bb9e2 100644
--- a/data/xml/K19.xml
+++ b/data/xml/K19.xml
@@ -373,7 +373,7 @@
       <author><first>Dian</first><last>Yu</last></author>
       <author><first>Kai</first><last>Sun</last></author>
       <author><first>Jianshu</first><last>Chen</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last></author>
       <pages>316–327</pages>
       <abstract>Recently, pre-trained language models have achieved remarkable success in a broad range of natural language processing tasks. However, in multilingual setting, it is extremely resource-consuming to pre-train a deep language model over large-scale corpora for each language. Instead of exhaustively pre-training monolingual language models independently, an alternative solution is to pre-train a powerful multilingual deep language model over large-scale corpora in hundreds of languages. However, the vocabulary size for each language in such a model is relatively small, especially for low-resource languages. This limitation inevitably hinders the performance of these multilingual models on tasks such as sequence labeling, wherein in-depth token-level or sentence-level understanding is essential. In this paper, inspired by previous methods designed for monolingual settings, we investigate two approaches (i.e., joint mapping and mixture mapping) based on a pre-trained multilingual model BERT for addressing the out-of-vocabulary (OOV) problem on a variety of tasks, including part-of-speech tagging, named entity recognition, machine translation quality estimation, and machine reading comprehension. Experimental results show that using mixture mapping is more promising. To the best of our knowledge, this is the first work that attempts to address and discuss the OOV issue in multilingual settings.</abstract>
       <url hash="651a99e1">K19-1030</url>
@@ -805,7 +805,7 @@
       <author><first>Dian</first><last>Yu</last></author>
       <author><first>Kai</first><last>Sun</last></author>
       <author><first>Jianshu</first><last>Chen</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last></author>
       <author><first>David</first><last>McAllester</last></author>
       <author><first>Dan</first><last>Roth</last></author>
       <pages>696–707</pages>
diff --git a/data/xml/N07.xml b/data/xml/N07.xml
index 11dc0646ea..a9c0f051a4 100644
--- a/data/xml/N07.xml
+++ b/data/xml/N07.xml
@@ -1408,7 +1408,7 @@
       <author><first>Geoffrey</first><last>Zweig</last></author>
       <author><first>Y.C.</first><last>Ju</last></author>
       <author><first>Patrick</first><last>Nguyen</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last></author>
       <author><first>Ye-Yi</first><last>Wang</last></author>
       <author><first>Alex</first><last>Acero</last></author>
       <pages>31–32</pages>
diff --git a/data/xml/N16.xml b/data/xml/N16.xml
index 4a7114a9e9..6489714b33 100644
--- a/data/xml/N16.xml
+++ b/data/xml/N16.xml
@@ -469,7 +469,7 @@
       <author><first>Yangyang</first><last>Shi</last></author>
       <author><first>Kaisheng</first><last>Yao</last></author>
       <author><first>Hu</first><last>Chen</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last></author>
       <author><first>Yi-Cheng</first><last>Pan</last></author>
       <author><first>Mei-Yuh</first><last>Hwang</last></author>
       <pages>393–399</pages>
@@ -1926,7 +1926,7 @@
       <title>An End-to-end Approach to Learning Semantic Frames with Feedforward Neural Network</title>
       <author><first>Yukun</first><last>Feng</last></author>
       <author><first>Yipei</first><last>Xu</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-blcu"><first>Dong</first><last>Yu</last></author>
       <pages>1–7</pages>
       <url hash="6b26dd6f">N16-2001</url>
       <doi>10.18653/v1/N16-2001</doi>
diff --git a/data/xml/N19.xml b/data/xml/N19.xml
index 493508c090..2b22ba724a 100644
--- a/data/xml/N19.xml
+++ b/data/xml/N19.xml
@@ -3333,7 +3333,7 @@
       <title>Improving Machine Reading Comprehension with General Reading Strategies</title>
       <author><first>Kai</first><last>Sun</last></author>
       <author><first>Dian</first><last>Yu</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last></author>
       <author><first>Claire</first><last>Cardie</last></author>
       <pages>2633–2643</pages>
       <abstract>Reading strategies have been shown to improve comprehension levels, especially for readers lacking adequate prior knowledge. Just as the process of knowledge accumulation is time-consuming for human readers, it is resource-demanding to impart rich general domain knowledge into a deep language model via pre-training. Inspired by reading strategies identified in cognitive science, and given limited computational resources - just a pre-trained model and a fixed number of training instances - we propose three general strategies aimed to improve non-extractive machine reading comprehension (MRC): (i) BACK AND FORTH READING that considers both the original and reverse order of an input sequence, (ii) HIGHLIGHTING, which adds a trainable embedding to the text embedding of tokens that are relevant to the question and candidate answers, and (iii) SELF-ASSESSMENT that generates practice questions and candidate answers directly from the text in an unsupervised manner. By fine-tuning a pre-trained language model (Radford et al., 2018) with our proposed strategies on the largest general domain multiple-choice MRC dataset RACE, we obtain a 5.8% absolute increase in accuracy over the previous best result achieved by the same pre-trained model fine-tuned on RACE without the use of strategies. We further fine-tune the resulting model on a target MRC task, leading to an absolute improvement of 6.2% in average accuracy over previous state-of-the-art approaches on six representative non-extractive MRC datasets from different domains (i.e., ARC, OpenBookQA, MCTest, SemEval-2018 Task 11, ROCStories, and MultiRC). These results demonstrate the effectiveness of our proposed strategies and the versatility and general applicability of our fine-tuned models that incorporate these strategies. Core code is available at <url>https://github.com/nlpdata/strategy/</url>.</abstract>
diff --git a/data/xml/P19.xml b/data/xml/P19.xml
index 0cdab3ee01..d19f755dcb 100644
--- a/data/xml/P19.xml
+++ b/data/xml/P19.xml
@@ -229,7 +229,7 @@
       <author><first>Ying</first><last>Lin</last></author>
       <author><first>Liyuan</first><last>Liu</last></author>
       <author><first>Heng</first><last>Ji</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last></author>
       <author><first>Jiawei</first><last>Han</last></author>
       <pages>165–174</pages>
       <abstract>Word embeddings are widely used on a variety of tasks and can substantially improve the performance. However, their quality is not consistent throughout the vocabulary due to the long-tail distribution of word frequency. Without sufficient contexts, rare word embeddings are usually less reliable than those of common words. However, current models typically trust all word embeddings equally regardless of their reliability and thus may introduce noise and hurt the performance. Since names often contain rare and uncommon words, this problem is particularly critical for name tagging. In this paper, we propose a novel reliability-aware name tagging model to tackle this issue. We design a set of word frequency-based reliability signals to indicate the quality of each word embedding. Guided by the reliability signals, the model is able to dynamically select and compose features such as word embedding and character-level representation using gating mechanisms. For example, if an input word is rare, the model relies less on its word embedding and assigns higher weights to its character and contextual features. Experiments on OntoNotes 5.0 show that our model outperforms the baseline model by 2.7% absolute gain in F-score. In cross-genre experiments on five genres in OntoNotes, our model improves the performance for most genre pairs and obtains up to 5% absolute F-score gain.</abstract>
@@ -1046,7 +1046,7 @@
       <author><first>Hongming</first><last>Zhang</last></author>
       <author><first>Yan</first><last>Song</last></author>
       <author><first>Yangqiu</first><last>Song</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last></author>
       <pages>867–876</pages>
       <abstract>Resolving pronoun coreference requires knowledge support, especially for particular domains (e.g., medicine). In this paper, we explore how to leverage different types of knowledge to better resolve pronoun coreference with a neural model. To ensure the generalization ability of our model, we directly incorporate knowledge in the format of triplets, which is the most common format of modern knowledge graphs, instead of encoding it with features or rules as that in conventional approaches. Moreover, since not all knowledge is helpful in certain contexts, to selectively use them, we propose a knowledge attention module, which learns to select and use informative knowledge based on contexts, to enhance our model. Experimental results on two datasets from different domains prove the validity and effectiveness of our model, where it outperforms state-of-the-art baselines by a large margin. Moreover, since our model learns to use external knowledge rather than only fitting the training data, it also demonstrates superior performance to baselines in the cross-domain setting.</abstract>
       <url hash="98ec2404">P19-1083</url>
@@ -3903,7 +3903,7 @@
       <author><first>Yansong</first><last>Feng</last></author>
       <author><first>Yan</first><last>Song</last></author>
       <author><first>Zhiguo</first><last>Wang</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last></author>
       <pages>3156–3161</pages>
       <abstract>Previous cross-lingual knowledge graph (KG) alignment studies rely on entity embeddings derived only from monolingual KG structural information, which may fail at matching entities that have different facts in two KGs. In this paper, we introduce the topic entity graph, a local sub-graph of an entity, to represent entities with their contextual information in KG. From this view, the KB-alignment task can be formulated as a graph matching problem; and we further propose a graph-attention based solution, which first matches all entities in two topic entity graphs, and then jointly model the local matching information to derive a graph-level matching vector. Experiments show that our model outperforms previous state-of-the-art methods by a large margin.</abstract>
       <url hash="9229d387">P19-1304</url>
diff --git a/data/xml/Q19.xml b/data/xml/Q19.xml
index 9beb2729d4..38e05a7176 100644
--- a/data/xml/Q19.xml
+++ b/data/xml/Q19.xml
@@ -179,7 +179,7 @@
       <author><first>Kai</first><last>Sun</last></author>
       <author><first>Dian</first><last>Yu</last></author>
       <author><first>Jianshu</first><last>Chen</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-idaho"><first>Dong</first><last>Yu</last></author>
       <author><first>Yejin</first><last>Choi</last></author>
       <author><first>Claire</first><last>Cardie</last></author>
       <doi>10.1162/tacl_a_00264</doi>
diff --git a/data/xml/S15.xml b/data/xml/S15.xml
index f5edc6e07f..ccbec746c2 100644
--- a/data/xml/S15.xml
+++ b/data/xml/S15.xml
@@ -1014,7 +1014,7 @@
       <title><fixed-case>BLCUNLP</fixed-case>: Corpus Pattern Analysis for Verbs Based on Dependency Chain</title>
       <author><first>Yukun</first><last>Feng</last></author>
       <author><first>Qiao</first><last>Deng</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-blcu"><first>Dong</first><last>Yu</last></author>
       <pages>325–328</pages>
       <url hash="32a1b9db">S15-2054</url>
       <doi>10.18653/v1/S15-2054</doi>
diff --git a/data/xml/S17.xml b/data/xml/S17.xml
index f0af312861..ff78b806d7 100644
--- a/data/xml/S17.xml
+++ b/data/xml/S17.xml
@@ -119,7 +119,7 @@
     <paper id="10">
       <title>Semantic Frame Labeling with Target-based Neural Model</title>
       <author><first>Yukun</first><last>Feng</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-blcu"><first>Dong</first><last>Yu</last></author>
       <author><first>Jian</first><last>Xu</last></author>
       <author><first>Chunhua</first><last>Liu</last></author>
       <pages>91–96</pages>
diff --git a/data/xml/S18.xml b/data/xml/S18.xml
index 19b54ccbf1..de21a89ee3 100644
--- a/data/xml/S18.xml
+++ b/data/xml/S18.xml
@@ -2151,7 +2151,7 @@
       <author><first>Chunhua</first><last>Liu</last></author>
       <author><first>Lu</first><last>Liu</last></author>
       <author><first>Yan</first><last>Zhao</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-blcu"><first>Dong</first><last>Yu</last></author>
       <pages>1104–1108</pages>
       <abstract>To comprehend an argument and fill the gap between claims and reasons, it is vital to find the implicit supporting warrants behind. In this paper, we propose a hierarchical attention model to identify the right warrant which explains why the reason stands for the claim. Our model focuses not only on the similar part between warrants and other information but also on the contradictory part between two opposing warrants. In addition, we use the ensemble method for different models. Our model achieves an accuracy of 61%, ranking second in this task. Experimental results demonstrate that our model is effective to make correct choices.</abstract>
       <url hash="c14de464">S18-1186</url>
diff --git a/data/xml/S19.xml b/data/xml/S19.xml
index a1961b1730..018a37ed4d 100644
--- a/data/xml/S19.xml
+++ b/data/xml/S19.xml
@@ -2606,7 +2606,7 @@
       <author><first>Ruoyao</first><last>Yang</last></author>
       <author><first>Wanying</first><last>Xie</last></author>
       <author><first>Chunhua</first><last>Liu</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-blcu"><first>Dong</first><last>Yu</last></author>
       <pages>1090–1096</pages>
       <abstract>Researchers have been paying increasing attention to rumour evaluation due to the rapid spread of unsubstantiated rumours on social media platforms, including SemEval 2019 task 7. However, labelled data for learning rumour veracity is scarce, and labels in rumour stance data are highly disproportionate, making it challenging for a model to perform supervised-learning adequately. We propose an inference chain-based system, which fully utilizes conversation structure-based knowledge in the limited data and expand the training data in minority categories to alleviate class imbalance. Our approach obtains 12.6% improvement upon the baseline system for subtask A, ranks 1st among 21 systems in subtask A, and ranks 4th among 12 systems in subtask B.</abstract>
       <url hash="f419313f">S19-2191</url>
@@ -2689,7 +2689,7 @@
       <author><first>Mengxi</first><last>Que</last></author>
       <author><first>Ruoyao</first><last>Yang</last></author>
       <author><first>Chunhua</first><last>Liu</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-blcu"><first>Dong</first><last>Yu</last></author>
       <pages>1132–1137</pages>
       <abstract>Since the resources of Community Question Answering are abundant and information sharing becomes universal, it will be increasingly difficult to find factual information for questioners in massive messages. SemEval 2019 task 8 is focusing on these issues. We participate in the task and use Generative Pre-trained Transformer (OpenAI GPT) as our system. Our innovations are data extension, feature extraction, and input transformation. For contextual knowledge enhancement, we extend the training set of subtask A, use several features to improve the results of our system and adapt the input formats to be more suitable for this task. We demonstrate the effectiveness of our approaches, which achieves 81.95% of subtask A and 61.08% of subtask B in accuracy on the SemEval 2019 task 8.</abstract>
       <url hash="9d86adf3">S19-2198</url>
diff --git a/data/xml/W14.xml b/data/xml/W14.xml
index b69d031bec..6dd0cfc910 100644
--- a/data/xml/W14.xml
+++ b/data/xml/W14.xml
@@ -11795,7 +11795,7 @@
     </paper>
     <paper id="19">
       <title>An Introduction to <fixed-case>BLCU</fixed-case> Personal Attributes Extraction System</title>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-blcu"><first>Dong</first><last>Yu</last></author>
       <author><first>Cheng</first><last>Yu</last></author>
       <author><first>Qin</first><last>Qu</last></author>
       <author><first>Gongbo</first><last>Tang</last></author>
diff --git a/data/xml/Y18.xml b/data/xml/Y18.xml
index 7199833df4..170428b08c 100644
--- a/data/xml/Y18.xml
+++ b/data/xml/Y18.xml
@@ -364,7 +364,7 @@
       <author><first>Chunhua</first><last>Liu</last></author>
       <author><first>Haiou</first><last>Zhang</last></author>
       <author><first>Shan</first><last>Jiang</last></author>
-      <author><first>Dong</first><last>Yu</last></author>
+      <author id="dong-yu-blcu"><first>Dong</first><last>Yu</last></author>
       <url hash="e347ce1f">Y18-1045</url>
       <bibkey>liu-etal-2018-demn</bibkey>
     </paper>
diff --git a/data/yaml/name_variants.yaml b/data/yaml/name_variants.yaml
index d758d6889a..52e803f984 100644
--- a/data/yaml/name_variants.yaml
+++ b/data/yaml/name_variants.yaml
@@ -10941,6 +10941,14 @@
 - canonical: {first: Clement T., last: Yu}
   variants:
   - {first: Clement, last: Yu}
+- canonical: {first: Dong, last: Yu}
+  id: dong-yu-blcu
+  comment: Beijing Language and Culture University
+- canonical: {first: Dong, last: Yu}
+  id: dong-yu-idaho
+  orcid: 0000-0003-0520-6844
+  comment: University of Idaho
+  degree: University of Idaho
 - canonical: {first: Edmund, last: Yu}
   variants:
   - {first: Edmund S., last: Yu}