Summaries auto-generated from HuggingFace's Daily Papers using Gemini and GitHub Actions. All credits go to the research and HuggingFace communities. 🔉 You can get audio summaries via OpenAI's text-to-speech API on Telegram. Note: Authors may be listed by their HuggingFace IDs. Additionally, summaries are generated by LLM and may contain mistakes. You can see the prompt used here here. |
![]() |
Title | Authors | Summary |
---|---|---|
LLM as a Broken Telephone: Iterative Generation Distorts Information (Read more on arXiv or HuggingFace) | Michalis Vazirgiannis, guokan-shang, mgeng, amr-mohamed | Iterative processing of text by large language models (LLMs) degrades information, similar to the "broken telephone" game. The main research question is whether LLMs distort information through iterative generation, particularly in translation tasks. The key methodology involved simulating iterative translation chains, where an English document was repeatedly translated into and out of other languages using LLMs. Primary results show a gradual decline in factuality and relevance over iterations, with an average FActScore gradient of -0.038 ± 0.02 in the most complex translation chain setting. Principal implication for AI practitioners is that iterative generation with LLMs can lead to information distortion, making control of temperature, prompt design, and understanding the role of intermediary languages necessary when building applications relying on the iterative processing of LLM-generated content. |
EgoLife: Towards Egocentric Life Assistant (Read more on arXiv or HuggingFace) | Zzitang, Alarak, fesvhtr, THUdyh, Jingkang | i) EgoLife introduces a comprehensive egocentric dataset and benchmark for developing AI life assistants. ii) The study aims to create life-oriented question-answering tasks designed to provide meaningful assistance in daily life through multimodal egocentric data understanding. iii) Data was collected from six participants living together for a week, using AI glasses to record multimodal egocentric video, supplemented by synchronized third-person video references and annotated for comprehensive data analysis. iv) The EgoLife Dataset comprises 300 hours of egocentric data and introduces EgoLifeQA, a benchmark for long-context question answering, alongside EgoButler, an integrated system, and their experiments verified the mechanisms, critical factors, and bottlenecks, guiding future improvements with EgoGPT achieving state-of-the-art performance on egocentric video understanding. v) The EgoLife dataset, tasks, and models offer AI practitioners a resource for advancing long-term egocentric life assistance through improved multimodal integration, identity recognition, and ultra-long-context question answering. |
HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization (Read more on arXiv or HuggingFace) | Ya Wang, Breeze0417, LLIXQ, Taoer, BryceZhuo | HybridNorm, a novel normalization strategy for Transformers, combines QKV normalization in attention and Post-Norm in the feed-forward network to improve training stability and performance. The research objective is to address the trade-offs between training stability and final model performance inherent in existing normalization techniques like Pre-Norm and Post-Norm in Transformer models. The key methodology involves proposing HybridNorm and evaluating it through extensive experiments on large-scale dense and Mixture-of-Experts (MoE) language models. The primary results show that HybridNorm consistently outperforms Pre-Norm and Post-Norm across various benchmarks; for example, HybridNorm* achieved an average accuracy of 64.15% compared to Pre-Norm's 62.99% on downstream tasks for 1.2B dense models. Principal implication: AI practitioners can use HybridNorm to achieve more stable training dynamics and superior performance when training large Transformer models, particularly in language modeling applications. |
PokéChamp: an Expert-level Minimax Language Agent (Read more on arXiv or HuggingFace) | Andy Luu Nguyen, chijin, milkkarten | PokéChamp is a minimax language agent that achieves expert-level performance in Pokémon battles by integrating large language models (LLMs) into the tree search algorithm. The main research objective is to develop an agent capable of strategic action proposal, accurate opponent modeling, and effective evaluation of game trajectories in Pokémon battles, without requiring LLM fine-tuning. The key methodology involves replacing three components of minimax tree search—player action sampling, opponent modeling, and value function estimation—with LLM-based generations, leveraging a world model that approximates game transitions. PokéChamp, powered by GPT-4o, achieves a 76% win rate against the best existing LLM-based bot and 84% against the strongest rule-based bot in the Generation 9 OverUsed Meta. AI practitioners can leverage this framework's integration of LLMs with game-theoretic planning algorithms to develop agents for complex, partially observable environments without task-specific training. |
FuseChat-3.0: Preference Optimization Meets Heterogeneous Model Fusion (Read more on arXiv or HuggingFace) | passerqxj, OnewayLab, GGLS, Wanfq, AALF | FuseChat-3.0 integrates the strengths of heterogeneous large language models (LLMs) into more compact target LLMs using a two-stage training process. The main objective is to develop a method for effectively fusing knowledge from multiple, diverse source LLMs into smaller target LLMs. The methodology involves a specialized data construction protocol followed by supervised fine-tuning (SFT) and Direct Preference Optimization (DPO), using preference pairs generated from the same source model. When using Llama-3.1-8B-Instruct as the target model, the fusion approach achieves an average improvement of 6.8 points across 14 benchmarks. AI practitioners can use this implicit model fusion technique to enhance the performance of smaller LLMs by leveraging the capabilities of larger, heterogeneous models, without requiring architectural changes. |
Token-Efficient Long Video Understanding for Multimodal LLMs (Read more on arXiv or HuggingFace) | zhiqilinv, MuyangLI, zhijianliu, xiuyul, jdps | i) STORM is a novel architecture for efficient long video understanding in multimodal LLMs. ii) The research aims to improve video understanding in LLMs, particularly with extended temporal contexts. iii) A dedicated temporal encoder using the Mamba State Space Model is introduced between the image encoder and the LLM, enabling token reduction via sampling and spatial/temporal pooling. iv) STORM achieves state-of-the-art results with over 5% improvement on MLVU and LongVideoBench, while reducing computation costs by up to 8x and decoding latency by 2.4-2.9x for fixed input frames. v) Practitioners can leverage STORM to reduce LLM computational demands and latency without sacrificing performance. |
The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation (Read more on arXiv or HuggingFace) | Xu Tan, Kai Shen, Aoxiong Yin, JunchengLi, ustcscallion | LanDiff is a hybrid text-to-video generation framework that combines language models and diffusion models for coarse-to-fine video synthesis. The main research objective is to develop a framework that leverages the strengths of both autoregressive language models (semantic understanding, causal modeling) and diffusion models (high visual quality, progressive refinement) while mitigating their limitations. The key methodology involves a two-stage process: (1) a semantic tokenizer compresses 3D visual features into 1D discrete representations, and an LLM generates semantic tokens; (2) a streaming diffusion model refines these tokens into high-fidelity video features, decoded by a VAE. LanDiff, with a 5B parameter model, achieves a score of 85.43 on the VBench T2V benchmark, surpassing state-of-the-art open-source and commercial models. AI practitioners can use LanDiff architecture as a blueprint of production-level video generation, particularly in scenarios requiring high semantic accuracy, visual quality, and long video generation capabilities. |
IFIR: A Comprehensive Benchmark for Evaluating Instruction-Following in Expert-Domain Information Retrieval (Read more on arXiv or HuggingFace) | Mingsheng Shang, yilunzhao, guo9, songtingyu | IFIR is a new benchmark for evaluating instruction-following information retrieval in specialized domains, revealing challenges for current models. The main research objective is to evaluate how well current information retrieval (IR) systems can follow complex, domain-specific instructions in expert fields. Key methodology involves creating a new benchmark (IFIR) with 2,426 examples across finance, law, healthcare, and scientific literature, incorporating three levels of instruction complexity and a novel LLM-based evaluation metric (INSTFOL). Primary results show that while BM25 performs relatively well due to glossary terms, instruction-tuned retrievers like INSTRUCTOR don't significantly outperform their base models, and most models' performance declines with increasing instruction complexity; LLM-based retrievers achieve the highest INSTFOL score, as demonstrated by Promptriever-7B. Principal implication is that current retrieval models, even those fine-tuned for instruction following, struggle with long, complex instructions in specialized domains, indicating a need for improved training methodologies and architectures or hybrid systems, leveraging large language model's superior instruction-following ability. |
Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities (Read more on arXiv or HuggingFace) | manocha, rafaelvalle, firecomputer, ZhifengKong, SreyanG-NVIDIA | i) Audio Flamingo 2 (AF2) is a novel audio-language model (ALM) enhancing audio understanding and reasoning. ii) The research aims to develop an ALM with advanced capabilities in understanding and reasoning over both short and long audio segments, including non-speech sounds and music. iii) AF2 leverages a custom CLAP model, synthetic Audio QA data, and a multi-stage curriculum learning strategy. iv) AF2 achieves state-of-the-art performance on over 20 benchmarks, surpassing larger models, with a 3B parameter language model achieving up to 18.9% improvement on the LongAudioBench compared to Gemini F v2. v) AF2's ability to understand long audio segments offers AI practitioners new capabilities for real-world applications requiring contextual auditory cue processing, such as anomaly detection and assistive technologies. |
Identifying Sensitive Weights via Post-quantization Integral (Read more on arXiv or HuggingFace) | Weiyu Huang, surfingtomchen, jt-zhang, zcliang22, yuezhouhu | The paper introduces a novel sensitivity metric and quantization framework for compressing large language models (LLMs). The primary research objective is to develop a more accurate sensitivity metric for weight quantization that addresses limitations of existing gradient and Hessian-based methods. The key methodology is Post-quantization Integral (PQI), which estimates the impact of quantized weights on the loss function, along with a Dense-and-Sparse detach framework called ReQuant. Applying ReQuant to Llama 3.2 1B with QTIP quantization reduces perplexity by 2.66, showcasing the improvement. For AI practitioners, this method provides an effective way to improve post-training quantization of LLMs, achieving better compression with minimal accuracy loss. |
L$^2$M: Mutual Information Scaling Law for Long-Context Language Modeling (Read more on arXiv or HuggingFace) | Marin Soljačić, Di Luo, Zhuotao Jin, oriolmayne, zhuoc3 | This paper establishes a theoretical framework for understanding and improving long-context language modeling based on a bipartite mutual information scaling law. The main research question is how a language model's capacity to handle long-range dependencies scales with its internal state size and sequence length. The key methodology involves proving a "Long-context Language Modeling (L²M)" condition, theoretically relating model state size to bipartite mutual information, and empirically validating this scaling law using transformer and state space models on text datasets. The primary result is that bipartite mutual information in natural language scales as I ~ L^β (where β is between 0 and 1) and that a model's state size must grow at least as fast as I ~ L^β for effective long-context modeling. The principal implication for AI practitioners is that designing models for long-context tasks requires careful consideration of the history state's scaling, with transformers naturally satisfying this condition and other architectures (like SSMs) needing model size increases to maintain performance at longer sequence lengths. |
Dedicated Feedback and Edit Models Empower Inference-Time Scaling for Open-Ended General-Domain Tasks (Read more on arXiv or HuggingFace) | Ellie Evans, Daniel Egert, Jiaqi Zeng, Zhilin Wang, odelalleau | Dedicated Feedback and Edit Models enable inference-time scaling for open-ended tasks, achieving state-of-the-art performance by leveraging human feedback. i) Main research question or objective: How to perform inference-time scaling for open-ended general-domain tasks, inspired by human feedback, using dedicated Feedback and Edit Models. ii) Key methodology used: Trained dedicated Feedback and Edit Models on a curated dataset, leveraging human-provided feedback and edits. iii) Primary results: The optimally scaled system, based on 70B models from the Llama 3 family, achieved a state-of-the-art performance on Arena Hard at 92.7, surpassing OpenAI ol-preview-2024-09-12 (90.4) and DeepSeek R1 (92.3). iv) Principal implication for AI practitioners: This approach demonstrates a viable method for improving model performance on complex, open-ended tasks by using human feedback to train models to improve responses at inference. |
Union of Experts: Adapting Hierarchical Routing to Equivalently Decomposed Transformer (Read more on arXiv or HuggingFace) | Linhui Li, Jing Lian, yjyangwork | Union-of-Experts (UoE) decomposes transformers into equivalent experts and implements selective routing on input data and experts to improve model performance while maintaining efficiency. The main research objective is to address limitations of existing Mixture-of-Experts (MoE) methods, specifically lack of high-quality expert interactions and inefficient extension to attention blocks. Key methodology involves equivalent expert decomposition on MLP and attention blocks via matrix partition, two routing paradigms (patch-wise data and expert selection), and parallel implementation of routing/computation. Primary results show UoE achieves an average perplexity reduction of 2.38 on language modeling tasks compared to the best-performed MoE method, using only 76% of the FLOPs. Principal implication for AI practitioners is that UoE offers a more efficient and performant approach to building transformer-based models, directly applicable to large-scale language and vision tasks. |
Lost in Literalism: How Supervised Training Shapes Translationese in LLMs (Read more on arXiv or HuggingFace) | Leyang Cui, Huajian Zhang, Zhilin Wang, Ronghao Zhang, yaful | This paper investigates and mitigates translationese (unnatural translations) in Large Language Models (LLMs) caused by biases introduced during supervised fine-tuning (SFT). The main research objective is to evaluate the prevalence of translationese in LLM-generated translations and investigate its origins during supervised training. The key methodology involves human annotation to identify translationese spans, analysis of training data, and mitigation strategies such as refining training references and filtering unnatural instances using perplexity. The primary results show that even advanced models like GPT-4 exhibit substantial translationese, with over 40% of their translations containing substantial translationese patterns, and that refining training data with LLMs reduces perplexity by 7.8 in the English-Chinese dataset. Principal implication for AI practitioners is that addressing translationese bias in SFT data, by polishing golden references or filtering, can improve the naturalness of LLM translation outputs. |
Combining Flow Matching and Transformers for Efficient Solution of Bayesian Inverse Problems (Read more on arXiv or HuggingFace) | Ekaterina Muravleva, oseledets, dsherki | The paper introduces a method combining Conditional Flow Matching (CFM) and transformers to efficiently solve Bayesian inverse problems. The main objective is to recover the distribution of model parameters conditioned on observed experimental data, given a series of observations and a forward model. The key methodology involves training a transformer-based CFM architecture to learn the conditional probability distribution from samples, handling a variable number of observations. Results showed that for a SEIR disease model, the average error was 2.05% ± 1.04% using a 4-point MLP model, significantly outperforming MCMC in computational efficiency. AI practitioners can leverage this approach for faster and more scalable sampling from posterior distributions in Bayesian inverse problems, particularly with datasets having variable-length observations. |
Understanding and Predicting Derailment in Toxic Conversations on GitHub (Read more on arXiv or HuggingFace) | Rebekah Copeland, Robert Zita, kdamevski, rahat-rizvi, imranraad | This research investigates conversational derailment leading to toxicity in GitHub discussions, aiming to predict and mitigate such occurrences proactively. The main research objective is to understand the characteristics of toxic conversations on GitHub and how these conversations derail into toxicity. The key methodology involves curating a dataset of toxic and non-toxic GitHub conversations, analyzing linguistic and conversational features, and developing a Large Language Model (LLM)-based approach using conversation trajectory summaries. The LLM prompts, tailored to provide summaries of GitHub conversations, achieved a 69% F1-score in predicting conversational derailment. AI practitioners can use this proactive, domain-specific, LLM-based moderation approach to identify and address potentially harmful conversations on platforms like GitHub before they escalate to toxicity. |
Title | Authors | Summary |
---|---|---|
Babel: Open Multilingual Large Language Models Serving Over 90% of Global Speakers (Read more on arXiv or HuggingFace) | LidongBing, maljunied, jhying, lukecq, Yiran0924 | Babel is an open multilingual large language model that supports 25 languages, covering over 90% of global speakers. The main objective is to develop an open-source multilingual LLM that addresses the underrepresentation of many widely spoken languages in existing models. The key methodology is layer extension, adding new layers to an existing model (Qwen2.5) and pre-training on a curated dataset emphasizing under-resourced languages. Babel-83B-Base achieves an average score of 73.2 across six multilingual benchmarks, outperforming comparable open models like Qwen2.5-72B (69.8). AI practitioners can use Babel as a strong base or chat model for multilingual applications, benefiting from enhanced performance, especially in low-resource languages, and from the use of layer extension in scaling the model. |
ABC: Achieving Better Control of Multimodal Embeddings using VLMs (Read more on arXiv or HuggingFace) | Florian Kerschbaum, Benjamin Schneider, wenhu | ABC is a multimodal embedding model that uses a vision-language model (VLM) backbone to integrate natural language instructions with visual inputs for improved control over embeddings. The main research objective is to develop a model that can effectively utilize user instructions to control and refine multimodal embeddings, overcoming limitations of existing CLIP-based models. The key methodology involves a two-stage training process: contrastive pretraining with mined negatives and instruction fine-tuning using synthetic instructions generated from image captions. The model achieves best-for-size performance on MSCOCO image-to-text retrieval with a R@1 score of 69.2 and outperforms all other models on the Massive Multimodal Embedding Benchmark (MMEB) for classification and VQA tasks. AI practitioners can use ABC's architecture and training approach to create multimodal embedding models with enhanced control via natural language, resulting in a flexible tool that improves performance of visual retrieval, classification, and VQA, as well as the ability to complete unique, instruction-specific tasks. |
Enhancing Abnormality Grounding for Vision Language Models with Knowledge Descriptions (Read more on arXiv or HuggingFace) | Cosmin I. Bercea, Rossella Arcucci, Wenjia Bai, Jun Li, che111 | This paper introduces a method to improve medical abnormality grounding in vision-language models (VLMs) using decomposed knowledge descriptions. The main research objective is to enhance the performance of VLMs in detecting and localizing medical abnormalities in images by improving the alignment between textual descriptions and visual features. The key methodology involves decomposing medical concepts into fundamental attributes and visual patterns, and using these attribute-based descriptions as prompts during VLM training. The proposed method, trained on only 1.5% of the data used by larger models, achieved a RoDeO score of 54.38% on the VinDr-CXR dataset, comparable to 7B parameter models like RadVLM. AI practitioners can use this knowledge-enhanced approach to achieve competitive performance in medical image abnormality grounding with significantly smaller VLMs and less training data, and improve zero-shot generalization. |
GEN3C: 3D-Informed World-Consistent Video Generation with Precise Camera Control (Read more on arXiv or HuggingFace) | Yifan Lu, Huan Ling, Jiahui Huang, Tianchang Shen, xrenaa | GEN3C is a generative video model with precise camera control and temporal 3D consistency. The main research objective is to develop a video generation model that allows for precise camera control and maintains 3D consistency across generated frames. The key methodology involves constructing a 3D cache (point clouds from depth estimates) and rendering it with user-provided camera trajectories to condition a fine-tuned video diffusion model. The results demonstrate that GEN3C achieves a PSNR of 18.66 and an SSIM of 0.67 on the Tanks-and-Temples dataset for single-view video generation, outperforming baselines. For AI practitioners, GEN3C offers a method for generating 3D-consistent videos with precise camera control by conditioning video generation on 3D renderings, improving controllability and consistency compared to prior video generation models. |
KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding (Read more on arXiv or HuggingFace) | Radha Poovendran, mingyuanzhou, yyqoni, nlpyang, flydust | KODCODE is a synthetic dataset of 447K coding problems with verified solutions and unit tests, designed to enhance code LLM training. The main research objective is to create a large-scale, diverse, and verifiable coding dataset that addresses limitations in existing resources for training large language models (LLMs) for code. The methodology involves a three-step pipeline: coding question synthesis from 12 sources, solution and test generation with self-verification, and post-training data synthesis via question rewriting and test-based rejection sampling using DeepSeek-R1. Models fine-tuned on KODCODE-SFT achieved a 61.26% average score across five coding benchmarks, outperforming models like Qwen2.5-Coder-32B-Instruct and DeepSeek-R1-Distill-Llama-70B. The principal implication is that AI practitioners can use KODCODE to improve the performance of code LLMs in supervised fine-tuning and potentially RL training, with verified solutions and tests offering advantages for various code-related tasks. |
CrowdSelect: Synthetic Instruction Data Selection with Multi-LLM Wisdom (Read more on arXiv or HuggingFace) | Pan Zhou, Wenxuan Shen, Lingfeng Yang, shuaishuaicdp, yisenL | CROWDSELECT, a novel synthetic instruction data selection framework, leverages multi-LLM responses and reward scores for improved instruction tuning. The main research objective is to investigate whether multi-dimensional signals derived from multiple LLMs can enhance the selection of synthetic instruction-response pairs for instruction tuning. The key methodology involves calculating three metrics (Difficulty, Separability, Stability) from multiple LLM responses and reward model assessments, and then integrating these with a clustering-based approach for diverse data selection. Primary results show that CROWDSELECT achieves state-of-the-art performance, improving instruction tuning by 4.81% on Arena-Hard and 11.1% on MT-bench with Llama-3.2-3b-instruct. The principal implication for AI practitioners is that leveraging multi-LLM wisdom through the proposed metrics and framework can lead to more efficient and effective instruction tuning, improving the performance of distilled smaller models. |
QE4PE: Word-level Quality Estimation for Human Post-Editing (Read more on arXiv or HuggingFace) | Malvina Nissim, Ana Guerberof-Arenas, Grzegorz Chrupała, Vilém Zouhar, gsarti | The QE4PE study investigates the impact of word-level quality estimation (QE) on professional machine translation post-editing, finding that factors beyond QE accuracy influence its real-world usefulness. The main research objective was to measure the effect of word-level QE error span highlighting on the editing quality, productivity, and usability in a realistic post-editing workflow. The methodology involved 42 professional translators post-editing machine-translated texts in English-Italian and English-Dutch, using four highlight modalities (supervised, unsupervised, oracle, and no highlights) and logging their editing behavior. Results showed that highlight modalities are not solely predictive of editing time and that cross-modality highlight overlap ranged between 15% and 39%. This implies that AI practitioners should consider factors beyond accuracy, such as domain, language, and user-specific factors, to improve the integration of word-level QE in post-editing tools and enhance their real-world usability. |
Exploring Rewriting Approaches for Different Conversational Tasks (Read more on arXiv or HuggingFace) | Xiang Chen, Mike Rimer, Ryan A. Rossi, Md Mehrab Tanjim, Franck-Dernoncourt | This paper systematically investigates query rewriting and fusion approaches for conversational AI tasks. The main research question is whether a single LLM-based query rewrite module can be universally effective across diverse conversational scenarios or if specialized modules are needed. The key methodology involves evaluating two parameterized query rewriting approaches (query rewrite and query fusion) on three datasets: conversational text-based Q&A and two text-to-visualization tasks (short and long conversations). The primary result is that for the conversational text-based Q&A task, the query rewrite approach achieved a 3.9% higher mean cosine similarity than query fusion, while for long text-to-vis tasks, query fusion had 7.6% high mean cosine similarity. The principal implication is that AI practitioners should select a query rewriting approach (either query rewrite and query fusion) that aligns with the specific conversational task and data characteristics, as no single approach is universally superior. |
Process-based Self-Rewarding Language Models (Read more on arXiv or HuggingFace) | Zheheng Luo, Junxiao Liu, Xin Zhang, Shimao Zhang, lx865712528 | The paper introduces Process-based Self-Rewarding Language Models, enhancing mathematical reasoning by incorporating step-wise evaluations and preference optimization. The main research objective is to improve the mathematical reasoning capabilities of large language models (LLMs) using a self-rewarding paradigm without external human feedback. The key methodology involves iterative training with step-wise LLM-as-a-Judge evaluations and step-wise preference optimization using Direct Preference Optimization (DPO). The primary result is that the 72B model, after four iterations, achieved an average accuracy of 60.6 across several math benchmarks, an improvement over the starting accuracy. The principal implication is that AI practitioners can improve LLMs' mathematical reasoning performance, through iterative self-improvement without human-annotated data. |
Fine-Tuning Small Language Models for Domain-Specific AI: An Edge AI Perspective (Read more on arXiv or HuggingFace) | KartikAngadi, kruthika, SyedAbdul, RakshitAralimatti | The paper introduces the Shakti series of Small Language Models (SLMs) designed for efficient on-device AI, focusing on domain-specific applications. The main objective is to develop SLMs that can overcome resource constraints of edge devices while maintaining high performance in specialized domains. Key methodologies include a combination of efficient transformer architectures, quantization-aware training, supervised fine-tuning, and preference alignment (RLHF or DPO). Primary results show that Shakti-500-Q4 achieves 583.88 tokens per second (TPS) on an NVIDIA L40s GPU and the Shakti-250M model, after fine-tuning, achieves 0.86 answer relevance score in finance domain. The paper's principal implication is that carefully engineered and fine-tuned compact models can effectively be deployed on edge devices, offering a practical approach for real-world, domain-specific AI applications with limited computational resources. |
Mixture of Structural-and-Textual Retrieval over Text-rich Graph Knowledge Bases (Read more on arXiv or HuggingFace) | Ryan A. Rossi, Haoyu Han, Yongjia Lei, mhalappa, Franck-Dernoncourt | This paper proposes a Mixture of Structural-and-Textual Retrieval (MoR) framework for answering queries over text-rich graph knowledge bases (TG-KBs). The main research objective is to develop a retrieval method that effectively combines both textual and structural information from TG-KBs to improve query answering performance. The key methodology is a Planning-Reasoning-Organizing framework, where the Planning stage generates textual planning graphs, the Reasoning stage interweaves structural traversal and textual matching, and the Organizing stage reranks candidates based on their structural trajectory. The primary result shows that MoR achieved an average Hit@1 score of 48.93%, outperforming other baselines on three TG-KB datasets. The principal implication is that AI practitioners can leverage MoR's mixture-of-experts approach to improve retrieval performance in applications that use the graph knowledge bases by harmonizing textual and structural signals, especially useful to combine and rank structural knowledge from graph data with traditional text features. |
Retrieval Models Aren't Tool-Savvy: Benchmarking Tool Retrieval for Large Language Models (Read more on arXiv or HuggingFace) | Shuaiqiang Wang, Pengjie Ren, Lingyong Yan, Yuhan Wang, Zhengliang Shi | The paper introduces TOOLRET, a new benchmark for evaluating information retrieval (IR) models on tool retrieval tasks for large language models (LLMs). The main research objective is to assess the performance of existing IR models in retrieving relevant tools for LLMs in diverse, real-world scenarios, and to analyze the impact of retrieval quality on end-to-end task performance. The key methodology involves collecting and curating a large-scale dataset of 7.6k retrieval tasks and 43k tools from existing datasets, evaluating various IR models (sparse, dense, and re-ranking) on this benchmark, and contributing a large scale training dataset (TOOLRET-train) to improve retrieval performance. A primary result is that the best-performing model (NV-embedd-v1) achieves an nDCG@10 of only 33.83 on the benchmark, indicating existing IR models struggle with tool retrieval. The principal implication is that AI practitioners need to develop new retrieval methods tailored for tool retrieval, or improve upon current methods using target-aware reasoning and large-scale training data, as shown in the paper using TOOLRET-train, since current strong IR models are not effective for tool retrieval. |
FLAME: A Federated Learning Benchmark for Robotic Manipulation (Read more on arXiv or HuggingFace) | Danica Kragic, Yuchong Zhang, Miguel Vasco, Alberta Longhini, Santiago Bou Betran | FLAME is a new benchmark for federated learning in robotic manipulation, providing datasets and a framework for distributed training. The main objective is to evaluate federated learning (FL) strategies for training robotic manipulation policies in a distributed, privacy-preserving manner. The key methodology involves creating a large-scale dataset of diverse manipulation tasks across multiple simulated environments and integrating it into a FL framework using FLOWER, where local models are trained and aggregated. Primary results show that Federated Averaging (FedAvg) achieves a 2.64 ± 0.13 RMSE on the Slide Block to Target task, but performance varies significantly across tasks and FL methods. The principal implication for AI practitioners is that FLAME provides a standardized benchmark for evaluating and developing scalable, adaptive, and privacy-aware robotic learning systems, although further development in FL algorithms are necessary. |
Benchmarking Large Language Models for Multi-Language Software Vulnerability Detection (Read more on arXiv or HuggingFace) | Hung Nguyen, Martin Weyssow, Yindu Su, Chengran Yang, Ting Zhang | This paper presents a comprehensive empirical study evaluating large language models (LLMs) on software vulnerability detection (SVD) across multiple programming languages. The main research objective is to investigate the effectiveness of various LLMs in predicting software vulnerabilities, comparing them with smaller language models (SLMs) and static application security testing (SAST) tools, and exploring strategies to improve LLM performance. The key methodology involves compiling a multi-language dataset (Python, Java, JavaScript) of vulnerable functions, evaluating five open-source LLMs using prompt engineering, instruction tuning, and sequence classification fine-tuning, and comparing them against SLMs and SAST tools. The results show that fine-tuned LLMs achieved the best F1-score of 0.443 on the JavaScript dataset, with performance varying significantly across programming languages and adaptation strategies. The principal implication for AI practitioners is that while LLMs show promise for SVD, particularly in JavaScript with fine-tuning, performance is highly dependent on data characteristics, requiring careful consideration of language, model selection, and adaptation strategies. |
CognitiveDrone: A VLA Model and Evaluation Benchmark for Real-Time Cognitive Task Solving and Reasoning in UAVs (Read more on arXiv or HuggingFace) | Artyom Myshlyaev, Oleg Sautenkov, Muhammad Haris Khan, Valerii Serpiva, Artem Lykov | CognitiveDrone, a Vision-Language-Action (VLA) model and benchmark for real-time cognitive task solving in UAVs, is introduced. The main research objective is to develop and evaluate a UAV control system capable of performing complex cognitive tasks, including human recognition, symbol understanding, and reasoning, based on visual input and textual instructions. The methodology combines a 7B-parameter VLA model (adapted from OpenVLA) trained on a dataset of over 8,000 simulated flight trajectories with an optional 7B-parameter VLM reasoning module (Qwen2.5-VL based) for task refinement, and evaluates performance within a Gazebo-based simulation benchmark (CognitiveDroneBench). The CognitiveDrone-R1 model, incorporating the reasoning module, achieved a 77.2% overall success rate, outperforming the base CognitiveDrone model (59.6%) and a racing-oriented model (RaceVLA, 31.3%). AI practitioners can utilize the provided open-source dataset, benchmark environment, and model weights to develop and evaluate VLA models for UAVs that incorporate cognitive capabilities beyond basic navigation and control. |
Interact, Instruct to Improve: A LLM-Driven Parallel Actor-Reasoner Framework for Enhancing Autonomous Vehicle Interactions (Read more on arXiv or HuggingFace) | Peng Hang, Chen Lv, Chengkai Xu, Jiaqi Liu, FanGShiYuu | This paper introduces an LLM-driven Actor-Reasoner framework for autonomous vehicles (AVs) to improve bidirectional interactions with human-driven vehicles (HVs). The main objective is to enhance AVs' real-time decision-making and intent expression capabilities in complex driving scenarios with heterogeneous HVs. The methodology involves a parallel Actor-Reasoner architecture; the Reasoner uses an LLM with Chain-of-Thought (CoT) reasoning to infer HV driving styles and generate eHMI displays, while the Actor employs a two-layer memory retrieval mechanism from a database constructed during training with simulated HVs. Results show that the proposed framework achieves a 94% success rate in intersection scenarios, and a memory partition module improves retrieval speed by an average of 12%. AI practitioners can use this framework as a method to integrate LLMs into real-time decision-making systems, addressing LLM inference speed limitations by combining reasoning capabilities with memory-based fast retrieval. |
SwiLTra-Bench: The Swiss Legal Translation Benchmark (Read more on arXiv or HuggingFace) | Yingqiang Gao, Sina Ahmadi, Luka Nenadic, Jakob Merane, Joel Niklaus | SwiLTra-Bench introduces a multilingual benchmark for evaluating LLM-based translation systems on Swiss legal texts, comprising 180K aligned translation pairs across five languages. The main research objective was to evaluate the performance of frontier LLMs and fine-tuned open SLMs on Swiss legal translations in zero-shot and fine-tuning settings, including the development of an LLM-based evaluation metric. Key methodology included systematic evaluation using lexical and model-based metrics, fine-tuning open SLMs, human expert validation, and developing a specialized LLM evaluation system (SwiLTra-Judge). Primary results showed that frontier models like Claude-3.5-Sonnet outperformed others, achieving a GEMBA-MQM score of 80.66, while fine-tuned open SLMs improved but still lagged behind. For AI practitioners, this benchmark and the associated evaluations highlight that while frontier models provide superior legal text translation, fine-tuning offers significant improvement for open SLMs, and SwiLTra-Judge can serve as a reliable automated evaluation tool that aligns well with human experts. |
Title | Authors | Summary |
---|---|---|
MPO: Boosting LLM Agents with Meta Plan Optimization (Read more on arXiv or HuggingFace) | sujianli, songff, Adagio, Rsy24, xwm | The paper introduces Meta Plan Optimization (MPO), a framework that enhances large language model (LLM) agents' planning capabilities by incorporating optimized, high-level meta plans. The main research objective is to improve LLM-based agents' performance on interactive planning tasks without requiring retraining for each new agent, while addressing planning hallucinations. MPO leverages a meta planner that generates abstract task strategies, optimized via a combination of supervised fine-tuning, Monte Carlo sampling, and Direct Preference Optimization (DPO) using agent feedback. Experiments on ALFWorld and ScienceWorld benchmarks demonstrate that MPO significantly outperforms existing baselines, with performance improvements of up to 100% for some agents. For AI practitioners, MPO offers a plug-and-play solution to boost agent performance and generalization in planning tasks, by incorporating general guidance that is improvable. |
Mask-DPO: Generalizable Fine-grained Factuality Alignment of LLMs (Read more on arXiv or HuggingFace) | Kai Chen, Chengqi Lyu, lindahua, ZwwWayne, vanilla1116 | Mask-DPO is a fine-grained factuality alignment method for LLMs that leverages sentence-level factuality to improve preference learning and reduce hallucinations. The main research objective is to develop a more effective and generalizable method for aligning LLMs with factual correctness, addressing limitations of response-level preference learning. The key methodology, Mask-DPO, incorporates sentence-level factuality annotations as mask signals in Direct Preference Optimization (DPO), selectively learning from correct sentences in preferred responses and avoiding penalties on factual content in non-preferred responses. Primary results show that Mask-DPO improved the factuality score of Llama3.1-8B-Instruct on the ANAH test set from 49.19% to 77.53%. Principal implication for AI practitioners is that Mask-DPO provides a more precise alignment technique that enhances factuality and generalization in LLMs, enabling the development of more reliable and trustworthy AI assistants. |
Wikipedia in the Era of LLMs: Evolution and Risks (Read more on arXiv or HuggingFace) | Yao Wan, fjchendp, mgeng, sdzzxyl, hsm316 | This paper analyzes the impact of Large Language Models (LLMs) on Wikipedia, examining its evolution and potential risks to the broader NLP community. The primary research objective is to determine if and how LLMs have already impacted Wikipedia, and how this might influence the NLP community. The key methodology involves analyzing Wikipedia page views, article content, and simulating LLM impact on machine translation benchmarks and Retrieval-Augmented Generation (RAG) systems. Primary results indicate that Wikipedia articles have been influenced by LLMs, with an estimated impact of 1%-2% in certain categories and simulations show potential score inflations in machine translation benchmarks and performance reduction in RAG systems using LLM generated content. The principal implication for AI practitioners is that reliance on Wikipedia for training and evaluating NLP models may be affected by LLM-generated content, necessitating careful consideration of data provenance and potential biases. |
LADDER: Self-Improving LLMs Through Recursive Problem Decomposition (Read more on arXiv or HuggingFace) | akiray1, TamasSimonds | LADDER is a framework enabling large language models (LLMs) to autonomously improve problem-solving through self-guided learning by recursively generating and solving simpler problem variants. The main research objective is to develop a method for LLMs to improve their mathematical integration capabilities without curated datasets or human feedback. The key methodology, LADDER, involves recursive generation of simpler problem variants, solution verification via numerical integration, and reinforcement learning (using GRPO) on the variant trees. LADDER improved a Llama 3.2 3B model's accuracy on undergraduate-level integration problems from 1% to 82%, and, with test-time reinforcement learning (TTRL) a Qwen 2.5 7B model achieved 90% on MIT Integration Bee. AI practitioners can leverage self-improving systems like LADDER and TTRL to enhance model capabilities in verifiable domains without extensive human supervision or data curation, demonstrating a practical path to developing more autonomous and capable AI. |
MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents (Read more on arXiv or HuggingFace) | mikewang, ShuyiGuo, Thomas-X-Yang, zhaochenhong, Leozkl | MultiAgentBench is a benchmark designed to evaluate LLM-based multi-agent systems across diverse interactive scenarios, measuring task completion and the quality of collaboration and competition. The main research objective is to assess how well LLM-based multi-agent systems perform in collaborative and competitive environments, using novel milestone-based key performance indicators. The methodology involves evaluating various coordination protocols (star, chain, tree, graph) and strategies (group discussion, cognitive planning) in six interactive scenarios, including research, Minecraft, database, coding, bargaining, and Werewolf, developed using the MARBLE framework. Results show gpt-4o-mini achieves the highest average task score, graph structure performs best in research, and cognitive planning improves milestone achievement rates by 3%. For AI practitioners, the framework and benchmark provide a means to systematically evaluate and improve multi-agent coordination, which is critical in developing more effective and collaborative AI systems. |
PipeOffload: Improving Scalability of Pipeline Parallelism with Memory Optimization (Read more on arXiv or HuggingFace) | Min Lin, Xinyi Wan, JialinLi, huanggx-sea, QPHutu | PipeOffload enhances pipeline parallelism (PP) scalability for large language models (LLMs) by optimizing activation memory usage through offloading. The main research objective is to address the activation memory bottleneck in PP that limits its scalability. The key methodology involves selectively offloading activations to host memory, prioritizing those with longer lifespans, and integrating a generalized interleaving strategy for balancing memory and throughput. The primary result is that PipeOffload reduces per-device activation memory in a better-than-linear manner, enabling up to a 19% acceleration compared to tensor parallelism (TP), while using less memory in applicable cases. For AI practitioners, PipeOffload provides a more scalable PP method, especially beneficial when full activation offload is feasible (k <= 1), allowing for more efficient training of large models. |
Iterative Value Function Optimization for Guided Decoding (Read more on arXiv or HuggingFace) | Ruizhe Chen, jokephp, ab3223323, lljhbxt, zhliu | Iterative Value Function Optimization (IVO) is a novel framework for guided decoding that improves the accuracy of value estimation in language models without retraining the base model. The main research objective is to address the limitations of existing value-guided decoding methods, which suffer from inaccurate value estimation due to high variance and distribution shift. The key methodology involves two components: Monte Carlo Value Estimation, which reduces estimation variance by exploring diverse trajectories, and Iterative On-Policy Optimization, which progressively improves value estimation through collecting trajectories from value-guided policies. Primary results show that IVO achieves 77.52% GPT-4 win rates on the Multi-turn Dialogue task against the base policy, significantly outperforming baseline methods in terms of reward scores across various tasks. Principal implication for AI practitioners is that IVO offers a computationally efficient way to align language models with human values and task requirements, improving control over model outputs without expensive retraining. |
FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling (Read more on arXiv or HuggingFace) | yuxuanli, zwl96, hyx21, ThonyPan, Achazwl | FR-Spec accelerates large-vocabulary language models by optimizing draft candidate selection in speculative sampling. The main research objective is to address the increased computational overhead of the LM Head in speculative sampling when using models with large vocabularies. The key methodology is frequency-ranked speculative sampling, which constrains the draft search to a frequency-prioritized token subset, reducing LM Head computation. Primary results show an average 1.12x speedup over the state-of-the-art speculative sampling method EAGLE-2 on multiple datasets, with optimized drafting reducing computation by 75%. For AI practitioners, this method provides a plug-and-play solution to accelerate existing speculative sampling techniques without retraining, directly improving inference speed for large-vocabulary language models. |
SemViQA: A Semantic Question Answering System for Vietnamese Information Fact-Checking (Read more on arXiv or HuggingFace) | Thanh T. Tran, ThanhDi, TienAnh, xuandin, DavidNguyen | SemViQA is a Vietnamese language fact-checking system that enhances accuracy and efficiency through semantic understanding. The main research objective is to develop a robust fact-checking system for Vietnamese, a low-resource language, addressing challenges like semantic ambiguity and long-token sequences. The key methodology integrates Semantic-based Evidence Retrieval (SER), combining TF-IDF and a Question Answering Token Classifier (QATC), with a Two-step Verdict Classification (TVC) using Focal Loss and Cross-Entropy Loss. The system achieves a strict accuracy of 80.82% on the ViWikiFC dataset and 78.97% on the ISE-DSC01. The principal implication is that AI practitioners can leverage SemViQA's framework, particularly its SER and TVC components, to develop more efficient, robust, and effective fact-checking systems that handle complex linguistic structures, especially in low-resource languages. |
UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended Language Interface (Read more on arXiv or HuggingFace) | windmillknight, Shawnee-bxy, Haiyang-W, chenweix7, kanashi6 | UFO unifies fine-grained visual perception tasks through an open-ended language interface, achieving state-of-the-art performance without task-specific decoders. The main research objective is to effectively integrate fine-grained perception tasks (like detection and segmentation) into multimodal large language models (MLLMs) without relying on complex, task-specific designs. The key methodology involves transforming all perception targets into the language space and using a novel embedding retrieval approach for segmentation, relying solely on the language interface. After multi-task training, UFO outperforms previous state-of-the-art generalist models by 12.3 mAP on COCO instance segmentation and 3.3 mIoU on ADE20K semantic segmentation. AI practitioners can leverage UFO's unified framework to simplify architectural design and training, seamlessly integrating fine-grained perception capabilities into MLLMs for enhanced visual understanding and enabling more challenging vision-language tasks. |
ATLaS: Agent Tuning via Learning Critical Steps (Read more on arXiv or HuggingFace) | Yuxuan Huang, Ming Li, Zhixun Chen, zhoutianyi, YaliDU | ATLAS finetunes large language model (LLM) agents on critical steps within expert trajectories to improve generalization and reduce training costs. The main research objective is to develop a more efficient and effective agent tuning method by identifying and focusing on critical steps in expert trajectories. The key methodology, ATLAS, uses an oracle LLM to select critical steps based on criteria like plan creation, critical observation, critical action, and self-correction, then finetunes the agent's LLM solely on these steps. Results show that an LLM finetuned on only ~30% critical steps selected by ATLAS outperforms the LLM finetuned on all steps and recent open-source LLM agents. The principal implication is that AI practitioners can achieve better agent generalization and performance with reduced training costs by focusing LLM finetuning on semantically critical steps identified by an oracle LLM. |
Language Models can Self-Improve at State-Value Estimation for Better Search (Read more on arXiv or HuggingFace) | rittera, emendes3 | Self-taught lookahead (STL) enables language model-based value functions to improve without ground truth rewards by leveraging state-transition dynamics. The main research objective is to demonstrate that an LLM-based value function can self-improve without labels or rewards, outperforming computationally expensive methods. The key methodology, STL, fine-tunes a value model by predicting the next best action, resulting state, and value rationale, bootstrapping from an initial value function using lookahead in tree search. Results show that STL-improved models match the performance of a GPT-4 value model, improving performance by 20% while reducing inference costs 37x compared to prior LLM-based tree search. Principal implication is that AI practitioners can utilize STL to train efficient and effective value models for search-based tasks, reducing reliance on expensive closed-source models and ground truth rewards. |
RectifiedHR: Enable Efficient High-Resolution Image Generation via Energy Rectification (Read more on arXiv or HuggingFace) | Liang Hou, dizhang, wileewang, PaulSHEN1, YZCS | RectifiedHR is a training-free method for generating high-resolution images with diffusion models by addressing energy decay and employing noise refresh. The main objective is to enable diffusion models to efficiently generate images at resolutions higher than their training resolution without additional training. The key methodology involves a noise refresh strategy to progressively increase resolution during sampling and an energy rectification strategy that adjusts classifier-free guidance to mitigate image blurriness. The primary result is that RectifiedHR achieves a FID score of 25.347 and a CLIP score of 33.756 at 2048x2048 resolution, outperforming several baselines in image quality while using less computing time. The principal implication is that AI practitioners can generate high-quality, high-resolution images using pre-trained diffusion models without costly retraining or complex modifications, by using noise refresh and energy rectification steps during image generation. |
SPIDER: A Comprehensive Multi-Organ Supervised Pathology Dataset and Baseline Models (Read more on arXiv or HuggingFace) | Ekaterina Ivanova, alpchel, mgvz | SPIDER is a new multi-organ histopathology dataset with baseline models for patch-level classification and whole-slide image segmentation. The main research objective is to create and evaluate a large, high-quality, multi-organ, patch-level histopathology dataset with comprehensive class coverage, along with baseline classification models. Key methodology used is a semi-automatic annotation pipeline, expert pathologist verification, feature extraction with Hibou-L foundation model, and an attention-based classification head. Primary results of SPIDER's evaluation include, on the thorax test set, model achieved an accuracy of 0.962, precision of 0.958, and F1 score of 0.960. AI practitioners can use this dataset and models to improve digital pathology tasks like tissue classification and rapid identification, providing a new benchmark for future developments in this field. |
Q-Eval-100K: Evaluating Visual Quality and Alignment Level for Text-to-Vision Content (Read more on arXiv or HuggingFace) | Zicheng Zhang, GTZhai, a9108, sl2782087, wcain | The paper introduces Q-Eval-100K, a large-scale dataset, and Q-Eval-Score, a unified model, for evaluating visual quality and text-image/video alignment in text-to-vision generation. The main research objective is to develop a comprehensive benchmark and method for assessing both the visual quality and text-alignment of content generated by text-to-vision models. The key methodology involves collecting 100K instances (images and videos) with 960K human annotations of Mean Opinion Scores (MOS) and developing Q-Eval-Score, a Large Multimodal Model (LMM) fine-tuned using a context-prompt format. The primary results show that Q-Eval-Score achieves a 0.943 SRCC for image visual quality at the model-level, outperforming existing methods, it also introduces Vague-to-Specific Strategy for long prompt alignment. AI practitioners can use Q-Eval-100K and Q-Eval-Score as a reliable benchmark and evaluation metric to assess and improve the performance of text-to-vision generative models, focusing on both visual quality and text-alignment. |
IterPref: Focal Preference Learning for Code Generation via Iterative Debugging (Read more on arXiv or HuggingFace) | Ruihang, yangyu90, Jianwen2003, CharonBony, Ringo1110 | IterPref is a new preference alignment framework for code generation that improves Code LLMs through iterative debugging. The research objective is to address the limitation of existing preference learning methods that do not pinpoint specific code errors, hindering the learning of informative error correction patterns. The key methodology is IterPref, which involves creating the CodeFlow dataset where code is iteratively refined until passing tests, and using a tailored DPO algorithm to align corresponding tokens for error regions. Primary result is that, equipped with IterPref, Qwen2.5-Coder-7B achieved a 29.7% pass@1 score on BigCodeBench Complete Hard, on par with some much larger models. For AI practitioners, this implies an effective way to enhance code generation models that leverages an iterative debugging process for precise preference learning, focusing model's learning on correcting critical errors. |
AppAgentX: Evolving GUI Agents as Proficient Smartphone Users (Read more on arXiv or HuggingFace) | Chi Zhang, Wenjia Jiang, xuyang, ChenxiSong, yyzhuang2 | AppAgentX introduces an evolutionary framework for GUI agents that improves operational efficiency on smartphones while maintaining adaptability. The main research objective is to address the inefficiency of LLM-based GUI agents in performing routine tasks by enabling them to learn and evolve high-level actions. The key methodology involves a memory mechanism that records task execution history, allowing the agent to identify repetitive action sequences and replace them with abstract, high-level actions represented as "shortcut nodes". Primary results show that on the AppAgent benchmark, AppAgentX reduced the average steps per task from 9.1 to 5.7 and increased the success rate from baseline 16.9% to 71.4% . For AI practitioners, this evolutionary framework offers a method to develop GUI agents that execute routine operations more efficiently while using LLM only to optimize new behavior, thus improving the balance between intelligence and efficiency in practical applications. |
Title | Authors | Summary |
---|---|---|
Visual-RFT: Visual Reinforcement Fine-Tuning (Read more on arXiv or HuggingFace) | yhcao, sweetFruit, yuhangzang, Zery, ziyuliu | Visual-RFT extends Reinforcement Fine-Tuning (RFT) to visual tasks by using verifiable rewards to improve performance of Large Vision-Language Models (LVLMs). The main objective is to apply RFT, previously successful in language models, to multi-modal domains, specifically visual perception tasks, with limited data. The key methodology is using LVLMs to generate multiple responses with reasoning tokens and applying visual perception verifiable reward functions (e.g., IoU for object detection) to update the model via policy optimization algorithms like Group Relative Policy Optimization (GRPO). Visual-RFT improved accuracy by 24.3% over the baseline in one-shot fine-grained image classification and exceeded SFT baselines by 21.9 and 15.4 on COCO and LVIS, in two-shot settings, respectively. For AI practitioners, Visual-RFT offers a data-efficient, reward-driven approach to enhance reasoning and adaptability in LVLMs for domain-specific tasks, particularly when fine-tuning data is scarce. |
Difix3D+: Improving 3D Reconstructions with Single-Step Diffusion Models (Read more on arXiv or HuggingFace) | zgojcic, AnalMom, xrenaa, hturki, jayw | DIFIX3D+ enhances 3D reconstruction and novel-view synthesis using single-step diffusion models. The main research objective is to improve the quality of 3D reconstructions, especially in under-constrained regions, by leveraging 2D diffusion model priors. The methodology involves fine-tuning a single-step image diffusion model (DIFIX) to remove artifacts in rendered novel views, and using it both during reconstruction to clean pseudo-training views and as a neural enhancer during inference. Primary results show an average 2x improvement in FID score over baselines while maintaining 3D consistency, with compatibility across both NeRF and 3DGS representations. The principal implication is that AI practitioners can leverage single-step diffusion models for real-time post-processing to improve the visual quality of 3D reconstructions and novel view synthesis. |
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs (Read more on arXiv or HuggingFace) | vishravmsft, martincai, alonbenhaim, jianmin-ustc, atabakashfaqMSFT | Phi-4-Mini and Phi-4-Multimodal are 3.8-billion-parameter language and multimodal models trained on high-quality data, achieving strong performance relative to their size. Main research question or objective: To develop compact yet highly capable language and multimodal models that outperform similar-sized open-source models and rival larger models, using curated data and novel architecture techniques. Key methodology used: The researchers trained Phi-4-Mini on high-quality web and synthetic data, with emphasis on math and coding datasets, expanded the vocabulary to 200K tokens, used grouped query attention, and a fractional RoPE dimension. For Phi-4-Multimodal, they used a "Mixture of LoRAs" technique, integrating modality-specific LoRAs while freezing the base language model. Primary results: Phi-4-Mini outperformed similarly sized models and matched the performance of models twice its size on math/coding, and Phi-4-Multimodal ranked first on the OpenASR leaderboard at the time, with the speech/audio LoRA having only 460 million parameters. Phi-4-Multimodal outperformed larger vision-language models, and achieved 72.0 average score across various vision-language benchmarks. Principal implication for AI practitioners: AI/ML/Software Engineers and Data Scientists can leverage Phi-4-Mini and Phi-4-Multimodal as efficient and performant small language and multimodal models, achieving strong performance while keeping the base language model frozen, making it a practical solution in resource-constrained environments. |
OneRec: Unifying Retrieve and Rank with Generative Recommender and Iterative Preference Alignment (Read more on arXiv or HuggingFace) | GuoruiZhou, DingWF, caikuo, oneself, OrpheusBetter | OneRec is an end-to-end generative recommendation model that unifies retrieval and ranking stages. The main research objective is to develop a single-stage generative model that surpasses the performance of traditional multi-stage recommender systems in real-world scenarios. The key methodology involves an encoder-decoder architecture with Mixture-of-Experts (MoE), session-wise generation, and Iterative Preference Alignment (IPA) combined with Direct Preference Optimization (DPO) using a reward model. Primary results show that OneRec deployed in Kuaishou's main scene achieved a 1.68% increase in watch-time, a substantial improvement over the previous system. For AI practitioners, OneRec demonstrates the feasibility of achieving significant performance gains by replacing a cascaded ranking system with a unified generative model by utilizing techniques like MoE and IPA. |
Liger: Linearizing Large Language Models to Gated Recurrent Structures (Read more on arXiv or HuggingFace) | Yu Cheng, JusenK, Jiaxihu2, weigao266, landisen | Liger transforms pretrained Transformer-based large language models (LLMs) into gated linear recurrent structures for efficient deployment. The main research objective is to linearize LLMs into gated recurrent structures without adding extra parameters and with minimal performance loss. The key methodology involves repurposing pretrained key matrix weights to construct gating mechanisms and using Low-Rank Adaptation (LoRA) for lightweight fine-tuning. The primary result is that Liger recovers 93% of the Transformer-based Llama-3 8B model's performance using only 0.02% of pre-training tokens during linearization. AI practitioners can deploy LLMs more efficiently with linear-time inference and constant memory usage by converting them to gated recurrent structures using Liger. |
When an LLM is apprehensive about its answers -- and when its uncertainty is justified (Read more on arXiv or HuggingFace) | Alexey Zaytsev, Edvard Khalafyan, DanielVyazhev, aigoncharov, sspetya | The paper investigates uncertainty estimation in Large Language Models (LLMs) for multiple-choice question answering, focusing on entropy and model-as-judge (MASJ) approaches. The main research question is how well token-wise entropy and MASJ estimates reflect LLM error and question difficulty across different domains and reasoning requirements. The key methodology involves evaluating three LLMs (Phi-4, Mistral, Qwen) on the MMLU-Pro dataset, using an auxiliary LLM to label questions by reasoning/knowledge needs and comparing uncertainty estimates with correctness labels. A primary result is that response entropy predicts model error effectively in knowledge-dependent domains (biology ROC AUC = 0.73), but this correlation weakens for reasoning-dependent domains (math ROC AUC = 0.55). For AI practioners this indicates, that the data-uncertainty related entropy is a useful measure in uncertainty estimate frameworks and should be integrated, but its usefulness is dependent to how much reasoning is requred to solve the problem. |
DiffRhythm: Blazingly Fast and Embarrassingly Simple End-to-End Full-Length Song Generation with Latent Diffusion (Read more on arXiv or HuggingFace) | Guobin Ma, Chunbo Hao, Yuepeng Jiang, Huakang Chen, Ziqian Ning | DiffRhythm is a latent diffusion-based model that generates full-length songs with vocals and accompaniment, achieving high musicality, intelligibility, and fast inference speeds. The main research objective is to develop an end-to-end song generation model capable of synthesizing complete songs (up to 4m45s) with both vocal and accompaniment, overcoming limitations of existing approaches like multi-stage architectures and slow inference. Key methodology involves a Variational Autoencoder (VAE) for learning compact latent representations of waveforms and a Diffusion Transformer (DiT) operating in the latent space, along with a novel sentence-level lyrics alignment mechanism. Primary results show that DiffRhythm achieves a Phoneme Error Rate (PER) of 18.02% in full-length song generation with a real-time factor (RTF) of 0.034. AI practitioners can leverage DiffRhythm's simple architecture, fast non-autoregressive generation, and open-sourced code/models for scalable, end-to-end song generation research and applications, eliminating the need for complex multi-stage cascading modelling. |
Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs (Read more on arXiv or HuggingFace) | ngoodman, nlile, Asap7772, ayushchakravarthy, obiwan96 | i) This paper investigates cognitive behaviors that enable language models to effectively self-improve via reinforcement learning. ii) The research question is: what intrinsic properties enable effective self-improvement in language models trained with reinforcement learning? iii) The methodology involves analyzing verification, backtracking, subgoal setting, and backward chaining in Qwen and Llama models during reinforcement learning on the Countdown game, alongside controlled behavioral dataset experiments and pretraining data curation. iv) Results show that Qwen naturally exhibits reasoning behaviors whereas Llama lacks them, priming Llama with these behaviors enables substantial improvements during RL; models primed with incorrect solutions but proper reasoning patterns achieve comparable performance to those trained on correct solutions, and curated pretraining data amplified Llama's reasoning behaviors. v) AI practitioners should consider the initial reasoning behaviors of language models as a critical factor in determining their capacity for self-improvement via reinforcement learning, and potentially curate pretraining data to enhance those behaviors. |
Speculative Ad-hoc Querying (Read more on arXiv or HuggingFace) | Venkat Arun, Aditya Akella, Maria Angels de Luis Balaguer, Srikanth Kandula, Haoyu0529 | SpeQL, a system that reduces query latency by using large language models (LLMs) to predict and precompute SQL queries during user input, improves analytical query responsiveness. The research objective is to determine if query execution can begin before a user finishes typing an SQL query, enabling near-instantaneous results. The methodology involves using LLMs to predict query structure and precompute temporary tables, alongside a scheduler that manages query execution and a user interface that displays speculative results. Results from experiments on 103 TPC-DS queries at 100GB scale show that SpeQL reduces P90 planning, compilation, and execution latency by 94.42%, 99.99%, and 87.23%, respectively, with a 7.72 seconds P90 execution overhead. AI practitioners can leverage SpeQL's approach to improve the responsiveness of interactive data analysis systems, thereby enabling quicker insight discovery during exploratory data analysis. |
Qilin: A Multimodal Information Retrieval Dataset with APP-level User Sessions (Read more on arXiv or HuggingFace) | Xiaohui He, Jia Chen, aiqy, haitaoli, qian | Qilin is a new multimodal information retrieval dataset collected from a social platform, Xiaohongshu, for improving search and recommendation services. The main research objective is to create a dataset that facilitates the development of advanced multimodal neural retrieval models across diverse task settings with real-world user interaction data. The key methodology involves collecting user sessions with heterogeneous results (image-text, video, commercial notes, direct answers) and APP-level contextual signals, then filtering the data using LLMs and human verification for safety and privacy. Primary results include a dataset of APP-level sessions from 15,482 users, where search users browse an average of 23.41 items when Deep Query Answering (DQA) is not triggered, but only 10.61 items when DQA is triggered. Principal implication for AI practitioners is that Qilin provides a realistic, large-scale, multimodal dataset with rich contextual information for training, evaluating, and analyzing retrieval-augmented generation systems and other advanced search and recommendation models, taking into account complex user behaviors. |
DuoDecoding: Hardware-aware Heterogeneous Speculative Decoding with Dynamic Multi-Sequence Drafting (Read more on arXiv or HuggingFace) | xpqiu, QipengGuo, KYLN24, KaiLv | DuoDecoding is a novel speculative decoding method that leverages heterogeneous hardware to accelerate large language model inference. The main research objective is to reduce generation latency in large language models (LLMs) while maintaining output distribution fidelity and reducing the time to first token (TTFT). The key methodology involves deploying the draft model on the CPU and the target model on the GPU, enabling parallel decoding, along with a hardware-aware optimal draft budget and dynamic multi-sequence drafting. DuoDecoding achieves up to a 2.61x speedup in generation latency compared to vanilla autoregressive generation and reduces TTFT to 83% of that in conventional speculative decoding. The principal implication for AI practitioners is that DuoDecoding provides a method to significantly improve the inference speed of LLMs, particularly beneficial for interactive applications, by utilizing both CPU and GPU resources effectively. |
Kiss3DGen: Repurposing Image Diffusion Models for 3D Asset Generation (Read more on arXiv or HuggingFace) | yingcongchen, Xxlbigbrother, StarYDY, MeixiChen, LTT | Kiss3DGen is a framework that repurposes 2D image diffusion models for 3D asset generation, including tasks like text-to-3D, image-to-3D, editing, and enhancement. The main research objective is to develop an efficient method for generating, editing, and enhancing 3D objects by leveraging pretrained 2D image diffusion models, without the need of large-scale 3D datasets. The key methodology involves fine-tuning a diffusion model (Flux) to generate "3D Bundle Images"—tiled representations of multi-view images and normal maps—which are then used to reconstruct a 3D mesh. The method achieves a CLIP score of 0.837 in text-to-3D generation evaluation, outperforming 3DTopia, Direct2.5, and Hunyuan3D-1.0. AI practitioners can utilize this framework to efficiently create high-quality 3D models by maximizing the use of pre-trained 2D diffusion models, thus reducing the dependency on extensive 3D training data. |
Word Form Matters: LLMs' Semantic Reconstruction under Typoglycemia (Read more on arXiv or HuggingFace) | Lang Gao, Zhongyu Wei, Ziruibest, Carol0110, Aurora-cx | Large Language Models (LLMs) reconstruct the meaning of scrambled words primarily using word form, with minimal reliance on contextual information. The main research question is how word form and contextual information influence LLMs' semantic reconstruction ability under Typoglycemia. The researchers used controlled experiments on LLaMA models, varying Scramble Ratio (SR) and Context Integrity (CI), and introduced SemRecScore to quantify semantic reconstruction. Primary results show SemRecScore decreases as SR increases, and at a Scramble Ratio (SR) of 1, a final SemRecScore of only 0.5 is achieved on the final LLM layer, indicating incomplete semantic reconstruction. For AI practitioners, this highlights that improvements can come by incorporating human-like, context-aware mechanisms, as current attention mechanisms focus primarily on the word form. |
SampleMix: A Sample-wise Pre-training Data Mixing Strategey by Coordinating Data Quality and Diversity (Read more on arXiv or HuggingFace) | bitwjg, WeiWang, WQYC, DeyangKong, xixy | SampleMix is a sample-wise pre-training data mixing strategy for large language models that coordinates data quality and diversity. The main research objective is to address the limitations of existing domain-wise data mixing methods, which overlook inter-domain overlaps and use suboptimal sample distributions. The key methodology involves evaluating the quality and diversity of each sample, assigning sampling weights, and constructing a training dataset based on these weights. The primary results show that SampleMix achieves an average accuracy of 47.77% across eight downstream tasks, outperforming all baseline methods, and reaching baseline performance with 1.9x fewer training steps. The principal implication is that AI practitioners can use SampleMix to improve training efficiency and model performance by creating better data mixtures by incorporating sample-wise quality and diversity evaluations. |
From Hours to Minutes: Lossless Acceleration of Ultra Long Sequence Generation up to 100K Tokens (Read more on arXiv or HuggingFace) | Yuxuan Wang, zlzheng, vickyandkekey, JunzheS, TongWu | TOKENSWIFT accelerates ultra-long sequence generation for large language models without compromising output quality. The main research question is whether model-agnostic, lossless acceleration can be achieved for generating ultra-long sequences with minimal training overhead. The key methodology involves multi-token parallel self-drafting with the target model, token reutilization, dynamic KV cache management, and contextual penalty. Primary results show that TOKENSWIFT achieves over 3x speedup compared to autoregressive generation across various models, reducing generation time for 100K tokens on LLAMA3.1-8b from nearly 5 hours to 90 minutes. Principal implication for AI practitioners is TOKENSWIFT provides a scalable and effective solution to dramatically speed up ultra long text generation, enabling applications that require producing very large outputs. |
Unposed Sparse Views Room Layout Reconstruction in the Age of Pretrain Model (Read more on arXiv or HuggingFace) | Jianan Wang, Xili Dai, xyyue, qixianbiao, yxuan | The paper introduces Plane-DUSt3R, a novel method for multi-view room layout estimation using the DUSt3R 3D foundation model. The main research objective is to develop a method for 3D room layout estimation from multiple unposed, sparse-view images. The methodology involves fine-tuning DUSt3R on a room layout dataset with a modified objective to estimate structural planes and combining it with a 2D plane detector and a post-processing algorithm. The Plane-DUSt3R achieves a 5.27% and 5.33% improvement in RRA and mAA metrics, respectively, for multi-view correspondence tasks, compared to state-of-the-art methods on the Structure3D dataset. AI practitioners can use Plane-DUSt3R to generate 3D room layouts from unposed images, eliminating the need for precise camera poses and simplifying multi-view 3D reconstruction. |
CodeArena: A Collective Evaluation Platform for LLM Code Generation (Read more on arXiv or HuggingFace) | terryyz, DongHuang-ebay, bobxwu, anhtuanluu36, Elfsong | CodeArena is an online platform for evaluating large language models (LLMs) on code generation tasks, incorporating a collective evaluation mechanism. The main objective is to address limitations in existing LLM code generation evaluation, such as benchmark contamination, data dissipation, and system inaccessibility. The key methodology involves a dynamic scoring system that adjusts model scores based on the collective performance of all submissions, along with providing automation-friendly APIs and open access to solutions and test cases. Results show that closed-source LLMs generally outperform open-source models, with "DeepSeek-Coder" achieving a Dynamic Point score of 249.28 and solving 90.63% of the problems. AI practitioners can use CodeArena for unbiased LLM code generation evaluation, accessing a public repository of solutions and test cases, and streamlining the evaluation process with automation-ready APIs. |
VideoUFO: A Million-Scale User-Focused Dataset for Text-to-Video Generation (Read more on arXiv or HuggingFace) | Yi Yang, WenhaoWang | VideoUFO is a million-scale video dataset designed to align text-to-video generation models with real-world user preferences. The main research objective is to curate a video dataset that reflects user-focused topics and evaluate its impact on text-to-video model performance. The key methodology involves clustering user-provided prompts from VidProM to identify 1,291 topics, retrieving relevant videos from YouTube, segmenting them into clips, generating captions, and assessing video quality using VBench. Primary results show that a model trained on VideoUFO achieves a low-10 score of 0.442, outperforming models trained on other datasets, while maintaining a top-10 score of 0.651 on a benchmark of user-focused topics. For AI practitioners, the VideoUFO dataset provides a resource for training or fine-tuning text-to-video models to better meet user expectations in real-world, diverse applications. |
Large-Scale Data Selection for Instruction Tuning (Read more on arXiv or HuggingFace) | pradeepd, pangwei, faezeb, nanami, hamishivi | This paper systematically investigates the scaling properties of automated data selection methods for instruction-tuning language models. The main research objective is to determine how well various data selection approaches perform when selecting large datasets (up to 2.5M samples) from large pools (up to 5.8M samples) for instruction tuning. The key methodology involves comparing nine data selection techniques, including representation-based, gradient-based, and loss/perplexity-based methods, across multiple dataset sizes and selection pools, evaluating performance on seven diverse tasks. The primary result is that a variant of representation-based data selection (RDS+) consistently outperforms other methods, including random selection, achieving an average score of 50.5 versus 46.4 for the next best method (Embed (GTR)) when selecting 10k data points. This implies that AI practitioners should consider using the proposed simple, embedding-based RDS+ method, especially in large-scale settings, rather than more computationally expensive methods when selecting data for finetuning LLMs. |
Direct Discriminative Optimization: Your Likelihood-Based Visual Generative Model is Secretly a GAN Discriminator (Read more on arXiv or HuggingFace) | mingyuliutw, gdhe17, HuayuChen, Ema11, worstcoder | Direct Discriminative Optimization (DDO) finetunes likelihood-based visual generative models using a GAN-inspired objective without extra networks. The research aims to improve the sample quality of likelihood-based generative models beyond the limitations of maximum likelihood estimation (MLE). DDO implicitly parameterizes a discriminator using the likelihood ratio between a learnable target model and a fixed, pretrained reference model, optimizing the target model with a GAN discriminator loss. Finetuning a diffusion model (EDM) with DDO achieved a new record FID score of 1.30 on CIFAR-10, a significant improvement over the base model's 1.79. AI practitioners can directly finetune and iteratively refine pretrained likelihood-based generative models to achieve state-of-the-art performance without modifying model architecture or inference procedures. |
AI-Invented Tonal Languages: Preventing a Machine Lingua Franca Beyond Human Understanding (Read more on arXiv or HuggingFace) | dnoever | This paper explores the potential for large language models (LLMs) to create private tonal languages for machine-to-machine communication. The main research question is whether AI agents can autonomously invent and use private tonal languages, and what those languages might resemble. The key methodology involves implementing a character-to-frequency mapping system using musical semitones to encode the full ASCII character set, creating a prototype tonal language. Primary results demonstrate that tonal encoding can achieve information rates exceeding human speech, with the ASCII mapping spanning approximately 7.8 octaves (220 Hz to 50175.42 Hz). The principle implication for AI practioners is that LLMs could theoretically engage in M2M communications, partially or wholly, outside of human perceptual boundaries, raising a need for transparency, oversight, and governance strategies in AI development. |
CLEA: Closed-Loop Embodied Agent for Enhancing Task Execution in Dynamic Environments (Read more on arXiv or HuggingFace) | Qing Zhao, Zhixin Mai, Yiming Zhao, Ge Wang, SP4595 | CLEA is a closed-loop embodied agent framework that enhances task execution in dynamic environments using multiple LLMs. The main research objective is to address the limitations of Large Language Models (LLMs) in embodied systems for reliable execution of subtask sequences and one-shot success in long-term tasks within dynamic environments. The key methodology involves a closed-loop architecture with four specialized open-source LLMs and a planner-critic framework, integrating environmental memory and multimodal feedback for dynamic task management. Across 12 task trials, CLEA achieved a 67.3% improvement in success rate and a 52.8% increase in task completion rate compared to the open-loop baseline. For AI practitioners, the framework offers a robust method for deploying embodied agents in real-world, dynamic settings by facilitating adaptive strategy adjustment, enhancing task planning, and improving execution through continuous environmental feedback. |
Title | Authors | Summary |
---|---|---|
DeepSolution: Boosting Complex Engineering Solution Design via Tree-based Exploration and Bi-point Thinking (Read more on arXiv or HuggingFace) | luyaojie, sanmusunrise, xuanang, yhycai, lzq2021 | The paper introduces a new benchmark and system for complex engineering solution design. The main research objective is to evaluate and improve systems' ability to generate complete and feasible solutions for engineering problems with multiple constraints. The key methodology is SolutionRAG, leveraging tree-based exploration and a bi-point thinking mechanism (alternating solution design and review) to generate solutions. SolutionRAG achieved a 66.4 analytical score and 67.9 technical score on the SolutionBench, outperforming baselines like Naive-RAG and Self-RAG. AI practitioners can use SolutionBench to benchmark and the SolutionRAG architecture to improve the generation of solutions for complex, multi-constraint engineering problems. |
Chain of Draft: Thinking Faster by Writing Less (Read more on arXiv or HuggingFace) | Lingxiao Zhao, Wenhao Xie, DeBERTa, sileixu | Chain of Draft (CoD) is a new prompting strategy that improves the efficiency of large language models (LLMs) by generating concise reasoning steps. The research proposes and evaluates Chain of Draft (CoD), a prompting method that minimizes verbosity in LLM reasoning. CoD prompts LLMs to produce brief, information-dense intermediate steps, resembling human draft-thinking, during multi-step reasoning tasks. The results show that CoD matches or surpasses Chain-of-Thought (CoT) accuracy on GSM8K, date, sports, and coin flip tasks, while using up to 92.4% fewer tokens in a specific Sports Understanding case. AI practitioners can use CoD to reduce latency and computational costs in LLM applications without significantly sacrificing accuracy, especially in resource-constrained environments. |
ViDoRAG: Visual Document Retrieval-Augmented Generation via Dynamic Iterative Reasoning Agents (Read more on arXiv or HuggingFace) | xpjandy, shihang, vickywu, lovesnowbest, autumncc | ViDoRAG is a multi-agent RAG framework for visually-rich documents using dynamic retrieval and iterative reasoning. The main research objective is to address the limitations of existing RAG methods in handling visually rich documents, particularly the challenges of multi-modal retrieval and insufficient reasoning capabilities. The methodology employs a Gaussian Mixture Model (GMM)-based hybrid retrieval strategy (textual and visual) and a multi-agent framework (seeker, inspector, answer) for iterative reasoning. Primary results show ViDoRAG outperforms existing methods on the ViDoSeek benchmark by over 10% in overall accuracy. AI practitioners can leverage ViDoRAG's multi-agent framework and dynamic retrieval strategy to build more effective and robust RAG systems for applications dealing with visually rich documents. |
SoS1: O1 and R1-Like Reasoning LLMs are Sum-of-Square Solvers (Read more on arXiv or HuggingFace) | Coralia Cartis, Wenqi Zhu, Kechen Li, Shiweiliuiiiiiii, jitianbo | Large Language Models (LLMs) can be effectively used to solve sum-of-squares (SoS) polynomial problems with proper reasoning guidance. The main research question is whether LLMs can determine the nonnegativity of a given multivariate polynomial, a computationally intractable problem related to Hilbert's Seventeenth Problem. The researchers introduced a dataset (SoS-1K) of ~1,000 polynomials and evaluated various LLMs using plain questions, simple instructions, and expert-designed reasoning instructions based on five criteria. The results show that high-quality reasoning instructions significantly improve accuracy, with the best-performing model (DeepSeek-R1) reaching 81% accuracy with SoS Reasoning instructions, compared to around 60% with plain question. Supervised fine-tuning of a 7B model on SoS-1K achieved 70% accuracy outperforming the 671B Deepseek-V3. AI practitioners can leverage specialized datasets and reasoning-guided instructions to significantly enhance LLMs' ability to solve complex mathematical problems and tackle NP-hard problems. |
Optimal Brain Apoptosis (Read more on arXiv or HuggingFace) | Delei Kong, Junjie Jiang, Jiaxu Wang, Zheng Fang, Mingyuan Sun | Optimal Brain Apoptosis (OBA) is a novel pruning method that calculates the Hessian-vector product to estimate parameter importance for neural network compression. The main research objective is to develop a more precise and efficient pruning method that avoids approximations of the Hessian matrix used in prior work. The key methodology involves decomposing the Hessian matrix across network layers, identifying conditions for non-zero inter-layer Hessian submatrices, and efficiently computing the second-order Taylor expansion of parameters using a Jacobian-vector product forward propagation technique. The primary results show that OBA achieves a 2x speedup on ImageNet with ResNet50 with only a 0.53% accuracy decrease, outperforming existing methods. The principal implication for AI practitioners is that OBA offers a more accurate and efficient way to prune both convolutional neural networks and Transformers, directly leading to computational savings in inference. |
Tell me why: Visual foundation models as self-explainable classifiers (Read more on arXiv or HuggingFace) | Christian Lovis, Gianmarco Mengaldo, Mina Bjelogrlic, hturbe | Visual foundation models (VFMs) can be adapted into self-explainable classifiers through a novel prototypical architecture called ProtoFM. The main research objective is to develop a self-explainable model (SEM) leveraging VFMs that achieves competitive classification performance and improved interpretability. The methodology involves training a lightweight head (approximately 1 million parameters) on top of frozen VFMs, using a student-teacher approach and specialized training objectives, including assignment, alignment, contrastive, sparsity, and classification losses. The ProtoFM architecture achieved a mean explainability score (mX) of 0.92 on the FunnyBirds framework, outperforming existing prototypical models. AI practitioners can leverage frozen VFMs to create efficient and interpretable classifiers, improving transparency and trust, particularly in critical applications. |
Sim-to-Real Reinforcement Learning for Vision-Based Dexterous Manipulation on Humanoids (Read more on arXiv or HuggingFace) | Yuke Zhu, Linxi Fan, Kartik Sachdev, Toru Lin, jitendra1995 | This paper presents a sim-to-real reinforcement learning recipe for vision-based dexterous manipulation tasks on humanoid robots. The main research objective is to identify and address the key challenges in applying sim-to-real reinforcement learning to solve contact-rich dexterous manipulation tasks on humanoids. The key methodology includes an automated real-to-sim tuning module, a generalized reward design scheme, a divide-and-conquer distillation process, and a mixture of sparse and dense object representations. The primary results include a 62.3% success rate on the grasp-and-reach task, 80% on the box lift task, and 52.5% on bimanual handover, demonstrating generalization and robustness against force perturbations; also shown is the correlation that lower MSE measured by autotune module and higher sim-to-real transfer success rate. AI practitioners can utilize the proposed techniques to train humanoid robots for dexterous manipulation, achieving robust generalization and high performance without human demonstrations. |
LiteASR: Efficient Automatic Speech Recognition with Low-Rank Approximation (Read more on arXiv or HuggingFace) | kasikci, kojimano, jungok, kamahori | LITEASR is a compression scheme for ASR encoders that maintains transcription accuracy while reducing computational costs. The main research objective is to reduce the computational intensity of ASR encoders, which are a deployment bottleneck. The key methodology leverages low-rank properties in intermediate activations by applying PCA and optimizing self-attention in a reduced dimension, implemented using a specialized GPU kernel. Applying LITEASR to Whisper large-v3 reduces encoder size by over 50%, matching Whisper medium's size with better transcription accuracy. AI practitioners can deploy more efficient ASR systems by leveraging the compressed, and Pareto-optimal, models. |
HAIC: Improving Human Action Understanding and Generation with Better Captions for Multi-modal Large Language Models (Read more on arXiv or HuggingFace) | Fuzheng Zhang, Yuanxing Zhang, Jingyun Hua, Xiao Wang, lwher1996 | This paper introduces HAIC, a two-stage data annotation pipeline and two datasets, to improve human action understanding and generation in multi-modal large language models (MLLMs). The main research objective is to address the lack of high-quality data for training MLLMs on videos involving human actions, especially multi-person interactions. The methodology involves a two-stage data annotation pipeline: accumulating videos with clear human actions, and annotating videos with a standardized caption format detailing individual attributes, actions, and interactions. Training with the curated HAICTrain dataset improves human action understanding, as evidenced by a 2.1% accuracy improvement on the HAICBench benchmark compared to the baseline LLaVA-Video-7B model. AI practitioners can use the released datasets and annotation pipeline to enhance MLLMs' performance in tasks requiring fine-grained understanding of human actions and interactions in videos. |
Title | Authors | Summary |
---|---|---|
Self-rewarding correction for mathematical reasoning (Read more on arXiv or HuggingFace) | Nan Jiang, Chenlu Ye, Hanning Zhang, Wei Xiong, Lichang-Chen | This paper introduces a self-rewarding reasoning framework for large language models (LLMs) that enables autonomous error detection and correction in mathematical reasoning without external feedback. The main research question is whether LLMs can simultaneously generate reasoning steps, evaluate their correctness, and revise their outputs during inference without external reward models. The key methodology involves a two-staged training approach using self-generated data: sequential rejection sampling to create training trajectories, followed by reinforcement learning with rule-based signals. Primary results show that on the MATH500 benchmark, the self-rewarding IFT + PPO model achieves a final accuracy of 80.2%, outperforming intrinsic self-correction and comparable to systems using external reward models. For AI practitioners, this framework offers a way to improve LLM reasoning accuracy and reduce computational overhead by integrating generation and evaluation within a single model, streamlining deployment for mathematical reasoning tasks. |
MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning (Read more on arXiv or HuggingFace) | Jiayuan Zhu, Fenglin Liu, Jiazhen Pan, morson, che111 | MedVLM-R1 is a medical vision-language model that uses reinforcement learning to generate explicit reasoning alongside answers for radiology visual question answering. The main research objective is to develop a medical VLM that generates natural language reasoning to improve transparency and trustworthiness, without relying on supervised fine-tuning (SFT). The key methodology is a reinforcement learning framework, specifically Group Relative Policy Optimization (GRPO), that incentivizes the model to discover human-interpretable reasoning paths without using reasoning references. The model, trained on 600 visual question answering samples, boosts accuracy from 55.11% to 78.22% across MRI, CT, and X-ray benchmarks, outperforming larger models. For AI practitioners, this implies that training smaller, specialized models with reinforcement learning can achieve superior, robust, and transparent generalization in the medical domain relative to supervised fine-tuning approaches. |
R2-T2: Re-Routing in Test-Time for Multimodal Mixture-of-Experts (Read more on arXiv or HuggingFace) | Ziyue Li, zhoutianyi, Lzy01241010 | R2-T2 introduces a test-time re-routing method for multimodal Mixture-of-Experts (MoE) models that improves performance without retraining. The main research objective is to optimize the routing weights of a multimodal MoE model during inference to improve performance on challenging or out-of-distribution samples. The key methodology is "Re-Routing in Test-Time (R2-T2)," which locally optimizes routing weights by moving them toward those of correctly predicted neighbor samples, using strategies like Neighborhood Gradient Descent (NGD), kernel regression, and mode finding. Applying R2-T2 with NGD to MoAI-7B improved MMBench accuracy by 6.9%, TextVQA accuracy by 6.8%, and achieved a 66.1-point increase on MME-P. AI practitioners can use R2-T2 to enhance the performance and generalization of multimodal MoE models on diverse tasks in test-time, without costly retraining or modification of model parameters. |
LongRoPE2: Near-Lossless LLM Context Window Scaling (Read more on arXiv or HuggingFace) | Gilsinia Lopez, Gaokai Zhang, Li Lyna Zhang, Ning Shang, OldKingMeister | LongRoPE2 extends LLMs' effective context window while preserving short-context performance through RoPE rescaling and mixed context window training. The main research objective is to address the out-of-distribution (OOD) issues in rotary positional embeddings (RoPE) and the performance degradation on short-context tasks when extending the context window of pre-trained large language models (LLMs). The key methodology involves an evolutionary search for optimal RoPE rescaling factors guided by "needle-driven" perplexity, combined with a mixed context window training approach that uses both original and rescaled RoPE. Primary results show that LongRoPE2 extends LLaMA3-8B to achieve a 128K effective context length while retaining over 98.5% of short-context performance, using only 10B training tokens. Principal implication is that AI practitioners can extend LLM context windows to 128K with near-lossless performance on both long and original context window, significantly reducing the data, and training costs compare to prior methods. |
FINEREASON: Evaluating and Improving LLMs' Deliberate Reasoning through Reflective Puzzle Solving (Read more on arXiv or HuggingFace) | Chaoqun Liu, Hou Pong Chan, Hao Zhang, Weiwen Xu, Guizhen Chen | FINEREASON introduces a logic-puzzle benchmark to evaluate and improve LLMs' deliberate reasoning through state checking and transition tasks. The main research objective is to assess and enhance LLMs' ability to reflect and rectify mistakes during multi-step reasoning processes, going beyond final-answer accuracy. The key methodology involves decomposing logic puzzles into atomic steps and evaluating models on two tasks: state checking (assessing if a state can lead to a solution) and state transition (determining the next valid move). Primary results show that models trained with state checking and transition data demonstrated gains in math reasoning by up to 5.1% on GSM8K, when starting from the DeepSeek-R1-Distill-Qwen-7B model, the accuracy increased from 82.3% to 87.4%. The principal implication for AI practitioners is that training LLMs with structured, puzzle-based data focusing on intermediate reasoning steps can significantly improve their performance on general mathematical reasoning tasks. |
CODESYNC: Synchronizing Large Language Models with Dynamic Code Evolution at Scale (Read more on arXiv or HuggingFace) | Kaiyue Qiu, Zhaoyang Chu, Chenlong Wang, yxy0807, zx10086 | CODESYNC introduces a data engine and benchmark to assess large language models' (LLMs) ability to adapt to evolving Python library APIs. The main research question is: Can LLMs be effectively and efficiently updated to handle real-time API modifications? CODESYNC systematically identifies API updates, retrieves relevant code instances from GitHub, and uses an LLM to synthesize contrastive code for legacy/updated API versions, then builds a benchmark,CODESYNCBENCH. Evaluation of 14 LLMs shows they struggle with API updates even with knowledge updating methods, e.g. a maximum BLEU score of 31.59 on the code completion task across five models with SFT. The principal implication is that AI practitioners need to develop and employ techniques to improve LLMs' ability to synchronize with evolving code, as static pre-training datasets limit handling of real-time API updates. |
Lean and Mean: Decoupled Value Policy Optimization with Global Value Guidance (Read more on arXiv or HuggingFace) | Zhixu Li, Pu Zhao, Lu Wang, Chenghua Huang, keanudicap | DVPO decouples value and policy optimization in RLHF to improve training efficiency and stability for large language models. The main research objective is to address the computational complexity and instability of traditional PPO-based RLHF caused by joint actor-critic training. The key methodology is Decoupled Value Policy Optimization (DVPO), which pre-trains a Global Value Model (GVM) on policy trajectories and uses it as a fixed guide for policy optimization via a standard RL objective. Primary results show that DVPO reduces GPU memory usage by 40% and training time by 35% compared to conventional RLHF, while achieving comparable performance to state-of-the-art PPO. The principal implication is that AI practitioners can achieve more efficient and stable RLHF training by decoupling value estimation from policy updates, simplifying the alignment of LLMs with human preferences. |
UniTok: A Unified Tokenizer for Visual Generation and Understanding (Read more on arXiv or HuggingFace) | Xin Yu, Jihan Yang, Junfeng Wu, Yi Jiang, Chuofan Ma | UniTok is a unified visual tokenizer designed for both visual generation and understanding tasks, bridging the representation gap between these two domains. The main research objective is to investigate whether reconstruction and contrastive losses truly conflict in unified tokenizer training, and to identify any underlying bottlenecks. The key methodology is multi-codebook quantization, which divides visual tokens into chunks and discretizes each with independent sub-codebooks, alongside attention factorization. UniTok achieves a remarkable rFID of 0.38 and a zero-shot accuracy of 78.6% on ImageNet. The principal implication for AI practitioners is that a unified visual tokenizer, enhanced with multi-codebook quantization, can match or surpass domain-specific tokenizers, enabling more efficient and integrated multimodal model development. |
FlexiDiT: Your Diffusion Transformer Can Easily Generate High-Quality Samples with Less Compute (Read more on arXiv or HuggingFace) | Markos Georgopoulos, Jonas Kohler, Yeongmin Kim, Gregor Bachmann, Sotiris Anagnostidis | FlexiDiT enables Diffusion Transformers (DiTs) to generate high-quality images with reduced computational cost by dynamically adjusting the compute budget per denoising step. The main research objective is to overcome the fixed and large compute requirements of standard DiTs during inference by revisiting the static compute allocation paradigm. The key methodology is converting pre-trained DiT models into flexible ones (FlexiDiTs) that can process inputs at varying compute budgets by dynamically adjusting patch size during the denoising process, and using different LoRAs for each sequence. The primary result is that FlexiDiT models can reduce FLOPs by more than 40% compared to static counterparts for class-conditioned and text-conditioned image generation, without any drop in quality. AI practitioners can deploy more computationally efficient diffusion models by adopting FlexiDiT, enabling substantial savings in computational resources without compromising the quality of generated outputs, especially for high-resolution image and video generation. |
Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think (Read more on arXiv or HuggingFace) | Haozhe Zhao, Weichu Xie, Wenhao Chai, Shuai Bai, Liang Chen | DREAM ENGINE enables arbitrary text-image interleaved control for image generation by aligning large multimodal models (LMMs) with diffusion models. The research objective is to develop a framework that can generate images based on complex instructions interweaving text and visual elements from multiple images. The key methodology involves replacing the text encoders of a diffusion model (SD3.5) with an LMM (QwenVL) and a two-stage training paradigm: joint text-image alignment and multimodal interleaved instruction tuning. The primary results show that DREAM ENGINE achieves a 0.69 overall score on the GenEval benchmark, matching state-of-the-art text-to-image models. For AI practitioners, the principal implication is that LMMs can be directly integrated into diffusion models to enable advanced text-image control, simplifying the creation of complex, multi-image-influenced generation systems. |
NeoBERT: A Next-Generation BERT (Read more on arXiv or HuggingFace) | Sarath Chandar, Mariam El Mezouar, Quentin Fournier, Lola Le Breton | NeoBERT, a new BERT-like encoder model, integrates architectural, data, and pre-training advancements to improve bidirectional representation learning. The primary objective is to create a next-generation BERT model that outperforms existing encoders by leveraging modern advancements in language model design. The key methodology involves pre-training on the RefinedWeb dataset with modifications like RoPE, SwiGLU, RMSNorm, a 20% masking rate, and a two-stage sequence length increase (1,024 to 4,096 tokens). NeoBERT achieves an 89.0 average score on the GLUE benchmark and 51.3 on the MTEB benchmark after contrastive fine-tuning, outperforming all similarly-sized and even larger, models on MTEB. AI practitioners can adopt NeoBERT as a plug-and-play replacement for existing base encoders to obtain better performance in downstream NLP tasks that depend on their embeddins, notably for retrieval-augmented generation and toxicity classification, without needing architectural modifications. |
Mobius: Text to Seamless Looping Video Generation via Latent Shift (Read more on arXiv or HuggingFace) | Xiaodong Cun, Yong Zhang, Bo Liu, Jianfei Yuan, Xiuli Bi | Mobius is a training-free method to generate seamless looping videos from text descriptions using pre-trained video diffusion models. The main research objective is to develop a method for generating seamless looping videos directly from text prompts, without requiring user annotations or additional training. The key methodology involves constructing a latent cycle and performing multi-frame latent denoising by iteratively shifting the first-frame latent towards the end in each step, while also using a frame-invariant latent decoding method. Primary results show that the proposed method achieves an MSE of 25.43 between the first and last frame, FVD of 40.78, a CLIP score of 32.24, and a Motion Smoothness score of 0.9850. For AI practitioners, this method provides a way to directly repurpose pre-trained text-to-video diffusion models for generating seamless looping videos, without the need for large scale training or annotated dataset. |
SoRFT: Issue Resolving with Subtask-oriented Reinforced Fine-Tuning (Read more on arXiv or HuggingFace) | Yanzhen Zou, Xiangxin Meng, Pengfei Gao, Chao Peng, mizersy | SoRFT is a novel training approach that enhances large language models' (LLMs) issue-resolving capabilities through subtask decomposition and reinforced fine-tuning. The main research objective is to improve the performance and generalization of open-source LLMs on software issue resolution tasks, addressing limitations of existing methods. The key methodology involves decomposing issue resolving into subtasks (file/function/line localization, code edit generation) and using rejection-sampled supervised fine-tuning followed by rule-based proximal policy optimization (PPO) with ground-truth-based rewards. The primary result is that SoRFT-Qwen-7B achieves 21.4% resolution rate on SWE-Bench Verified, outperforming other open-source models of similar size. For AI practitioners, SoRFT offers a cost-effective way to leverage open-source development resources and substantially boost the performance of open-source LLMs in automated issue resolution. |
Building Interactable Replicas of Complex Articulated Objects via Gaussian Splatting (Read more on arXiv or HuggingFace) | Song-Chun Zhu, Junfeng Ni, Ruijie Lu, Baoxiong Jia, Yu Liu | ArtGS introduces a method for reconstructing and modeling complex articulated objects using 3D Gaussian Splatting. The main research objective is to effectively integrate information across different object states to improve part-mesh reconstruction and articulation parameter estimation, especially for multi-part articulated objects. The key methodology involves using canonical Gaussians with coarse-to-fine initialization and updates, alongside a skinning-inspired part dynamics modeling module. Primary results show that on the PARIS dataset, ArtGS achieves a mean angular error (Axis Ang.) of 0.01 degrees and a mean Chamfer Distance for movable parts (CD-m) of 0.03, outperforming existing methods. For AI practitioners, this implies a more efficient and accurate approach to creating digital twins of articulated objects, facilitating applications in robotics and virtual environments. |
R1-T1: Fully Incentivizing Translation Capability in LLMs via Reasoning Learning (Read more on arXiv or HuggingFace) | Hongyong Zeng, Yuanchang Luo, Shimin Tao, Yilun Liu, boommmmm | R1-T1 is a novel framework that enhances machine translation (MT) in large language models (LLMs) through reinforcement learning (RL) with human-aligned chain-of-thoughts (CoTs). The main research objective is to improve the adaptability of LLMs to diverse translation scenarios by incorporating inference-time reasoning into general MT, going beyond specific sub-tasks. The key methodology involves formalizing six expert-curated CoT templates, reflecting human translation strategies, and using RL with KL-constrained rewards for self-evolving CoT discovery and anti-forgetting adaptation. Primary results demonstrate steady translation performance improvement across 21 languages and 80 translation directions on the Flores-101 test set, with a COMETScore of 0.626 on trained languages using RL, surpassing supervised fine-tuning (SFT) and other baselines. Principal implication for AI practioners: It provides a method for using RL to adapt LLMs to new machine translation tasks without relying on the SFT data and avoiding the Catastrophic Forgetting issue. |
Title | Authors | Summary |
---|---|---|
Kanana: Compute-efficient Bilingual Language Models (Read more on arXiv or HuggingFace) | seopbo, Doohae, daniel-rl2, jiyeonham, bzantium | Kanana is a series of bilingual language models demonstrating strong performance in Korean and competitive performance in English at a significantly lower computational cost than comparable state-of-the-art models. The main research objective was to develop compute-efficient bilingual language models that maintain strong performance in both Korean and English. The key methodologies employed include high-quality data filtering, staged pre-training, depth up-scaling, pruning, and distillation, combined with supervised fine-tuning and preference optimization for instruction tuning. Primary results show that the Kanana Flag 32.5B model outperforms Llama 3.1 70B on MMLU and KMMLU, while using substantially fewer computational resources, costing similiar to Gemma 2 9B. AI practitioners can leverage Kanana's training techniques such as staged pre-training and depth-up scaling to build high-performing, resource-efficient language models, especially for languages with limited data availability. |
GHOST 2.0: generative high-fidelity one shot transfer of heads (Read more on arXiv or HuggingFace) | Andrey Kuznetsov, Denis Dimitrov, Pavel Paramonov, Alexander Groshev, nastasia-y | GHOST 2.0 is a two-module framework for high-fidelity one-shot head swapping, addressing limitations in existing face-swapping and head-reenactment methods. The main research objective is to develop a system that can realistically swap entire heads between source and target images, preserving identity, pose, and expression while seamlessly blending the result. The key methodology involves an "Aligner" module for head reenactment and a "Blender" module for integrating the reenacted head into the target background, using StyleGAN-based architecture and correlation learning. Primary results show that at 512x512 resolution in cross-reenactment, GHOST 2.0 achieves a CSIM score of 0.628 and a FID score of 29.57, outperforming one of the baselines (StyleHEAT) and indicating better performace than another baseline (HeSer) at identity preservation. AI practitioners can use GHOST 2.0 to improve the realism and robustness of head-swapping applications, particularly in scenarios with significant variations in head pose, hairstyle, and background. |
TheoremExplainAgent: Towards Multimodal Explanations for LLM Theorem Understanding (Read more on arXiv or HuggingFace) | Jonathan Leung, AlvinYuVotee, KrishKrosh, chongcht, vinesmsuic | TheoremExplainAgent, a novel agentic system, generates multimodal theorem explanation videos, and a new benchmark, TheoremExplainBench, evaluates them. The main research objective is to assess if AI systems can effectively generate multimodal theorem explanations. The key methodology involves a two-agent pipeline (planner and coding agent) using Manim to create videos, and a benchmark of 240 theorems across STEM, evaluated across five dimensions. The o3-mini agent achieved a 93.8% success rate and an overall score of 0.77, but visual element layout exhibited minor issues. AI practitioners can leverage this agentic approach for enhanced theorem understanding, though refinement is needed in visual structuring and consistency of generated video outputs. |
Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning? (Read more on arXiv or HuggingFace) | Weixun Wang, Jiaheng Liu, Shilong Li, Yancheng He, zhangysk | DeltaBench, a new benchmark, evaluates large language models' (LLMs) ability to detect errors in long chain-of-thought (CoT) reasoning. The main research objective is to assess the quality of long CoTs generated by o1-like models and to measure the critique abilities of existing LLMs, process reward models (PRMs) and critic models on these CoTs. The key methodology involves creating DeltaBench, a dataset of long CoTs with fine-grained error annotations, and evaluating various LLMs, including PRMs and critic models, on their ability to identify these errors. Primary results show that even the top-performing model (GPT-4-turbo-128k) achieved a low F1-score of only 40.8% in error detection, and that o1-like models do not show any advantage over non-o1-like models on critique abilities. Principal implication for AI practitioners is that current LLMs, including PRMs, have limited ability to identify errors in long CoT reasoning, highlighting a need for significant improvements in critique capabilities for robust AI system development. |
Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems (Read more on arXiv or HuggingFace) | Bin Xu, Zijun Yao, Xiaozhi Wang, Yunjia Qi, Hao Peng | This paper proposes a new reward modeling approach, "agentic reward modeling," that combines human preferences with verifiable correctness signals for more reliable reward systems in large language models (LLMs). The main research objective is to develop a reward system that mitigates the limitations of existing reward models, which primarily focus on subjective human preferences and often neglect verifiable correctness. The key methodology involves implementing a reward agent, REWARDAGENT, that integrates human preference rewards with two verifiable signals: factuality (assessed via pairwise comparison and evidence verification) and instruction-following (verified through constraint parsing and Python code execution). The primary results show that REWARDAGENT significantly outperforms existing reward models on benchmarks like RM-Bench, JudgeBench, and a newly constructed IFBench, achieving an overall score of 72.5% in one configuration. The principal implication for AI practitioners is that integrating verifiable correctness signals with human preference feedback can lead to more reliable and robust reward models, improving LLM performance in downstream tasks and alignment with intended behavior, particularly during the inference and training phases. |
Language Models' Factuality Depends on the Language of Inquiry (Read more on arXiv or HuggingFace) | Hamid Palangi, Kumar Ayush, Kumar Tanmay, ayush1801, AggarwalTushar | Language models (LMs) exhibit inconsistent factual recall across different languages, failing to transfer knowledge even when possessing it in one language. The main research question is whether multilingual LMs truly internalize and transfer factual knowledge across languages or encode isolated linguistic silos. The key methodology involves creating a benchmark of 10,000 country-related facts across 13 languages and proposing metrics (Factual Recall Score, Knowledge Transferability Score, Cross-Lingual Factual Knowledge Transferability Score) to quantify factual recall and knowledge transferability. A primary result is that Llama-3-70B achieved the highest X-FaKT score of 0.848, demonstrating superior balanced performance in both factual recall and knowledge transfer. The principal implication is that AI practitioners must recognize language-specific factual reliability in multilingual LMs and leverage the most trustworthy information across languages, moving beyond the assumption of consistent cross-lingual knowledge access. |
Can Language Models Falsify? Evaluating Algorithmic Reasoning with Counterexample Creation (Read more on arXiv or HuggingFace) | Matthias Bethge, Jonas Geiping, Ponnurangam Kumaraguru, Shashwat Goel, Shiven Sinha | Language models (LMs) are evaluated on their ability to generate counterexamples that falsify incorrect algorithmic solutions, introducing a new benchmark called REFUTE. The main research question is: Can LMs create counterexamples for incorrect solutions to algorithmic problems? The key methodology involves sourcing incorrect submissions from programming competitions, filtering them for non-trivial errors, and prompting LMs to generate inputs that cause these solutions to fail, validated through code execution. The primary result is that the best reasoning agents, including OpenAI 03-mini (high), can only create counterexamples for less than 9% of incorrect solutions in REFUTE, despite having a much higher success rate at solving those same problems. The principal implication for AI practitioners is that verification, including falsification of subtly incorrect solutions, is significantly harder for current LMs than generating correct solutions, highlighting a limitation in capabilities relevant for self-improvement and reliable reasoning. |
Towards an AI co-scientist (Read more on arXiv or HuggingFace) | Anil Palepu, Tao Tu, Alexander Daryin, Wei-Hung Weng, Juraj Gottweis | Here's a summary of the paper, strictly adhering to your guidelines: The paper introduces an AI co-scientist, a multi-agent system built on Gemini 2.0, designed to assist in scientific discovery by generating and evaluating novel research hypotheses. The main research objective is to develop an AI system capable of formulating demonstrably novel research hypotheses and proposals, building upon existing evidence and aligned with scientist-provided goals. The key methodology involves a multi-agent architecture with an asynchronous task execution framework, utilizing a generate, debate, and evolve approach with specialized agents for hypothesis generation, refinement, and ranking via simulated scientific debates and tournaments. The system demonstrates, across 203 diverse research goals, improved hypothesis quality (measured by an internal Elo rating system) as a function of increased test-time compute, and hypotheses for acute myeloid leukemia were validated to show tumor inhibition in vitro at clinically applicable concentrations. AI practitioners can leverage the multi-agent architecture and test-time compute scaling paradigm presented to build systems capable of complex reasoning and iterative improvement, although specific external validation metrics remain limited within the paper. |
VEM: Environment-Free Exploration for Training GUI Agent with Value Environment Model (Read more on arXiv or HuggingFace) | Lingrui Mei, Lu Wang, Jiani Zheng, vyokky, keanudicap | VEM decouples value estimation from policy optimization for training GUI agents, enabling environment-free reinforcement learning. The main research objective is to develop an environment-free RL framework that can effectively train GUI agents without costly real-world interactions. The key methodology involves pretraining a Value Environment Model (VEM) to predict state-action values from offline data and then using this frozen VEM to guide policy exploration. The method achieves 28.0% offline task success rate on the General domain of the Android-in-the-Wild benchmark, surpassing environment-free baselines by 12-28%. AI practitioners can leverage this approach to train GUI agents with greater sample efficiency and stability, bypassing the need for direct environment interactions. |
Plutus: Benchmarking Large Language Models in Low-Resource Greek Finance (Read more on arXiv or HuggingFace) | Polydoros Giannouris, Efstathia Soufleri, Triantafillos Papadopoulos, Xueqing Peng, jiminHuang | The paper introduces Plutus-ben, a Greek financial benchmark, and Plutus-8B, a Greek financial LLM, to address the lack of resources for Greek financial NLP. The main research question is: How do current language models perform on core Greek financial tasks, and how can fine-tuning on Greek financial data enhance performance? Key methodology involved creating Plutus-ben, comprising five financial NLP tasks (numeric and textual NER, QA, abstractive summarization, topic classification), and fine-tuning Llama-Krikri-8B with Greek domain-specific data to create Plutus-8B, evaluating 22 LLMs. The primary result is that Plutus-8B achieved the best performance on Plutus-ben, surpassing GPT-4 by 15.38% and outperforming all baseline models in the evaluation. Principal implication for AI practitioners is that fine-tuning on language-specific and domain-specific data is crucial for LLM performance in low-resource languages like Greek, significantly improving performance in tasks like financial numeric reasoning. |
Distill Any Depth: Distillation Creates a Stronger Monocular Depth Estimator (Read more on arXiv or HuggingFace) | Ying Cui, Ruibo Li, Hongji Li, Dongyan Guo, Xiankang He | This paper introduces a new distillation framework for improving monocular depth estimation (MDE) using unlabeled data. The main research objective is to enhance zero-shot MDE by addressing the limitations of existing depth normalization strategies in pseudo-label distillation. The key methodology involves Cross-Context Distillation, integrating global and local depth cues, and a multi-teacher distillation framework using diverse depth estimation models. The primary result shows that the proposed method outperforms state-of-the-art methods on benchmark datasets; for instance, on the DIODE dataset, the AbsRel improves by 14.1% using the Local-Global and Shared-Context Distillation strategies. For AI practitioners, this method provides an effective way to train more robust and accurate MDE models by leveraging unlabeled data and combining the strengths of multiple teacher models, especially improving generalization in varied scenarios. |
Project Alexandria: Towards Freeing Scientific Knowledge from Copyright Burdens via LLMs (Read more on arXiv or HuggingFace) | Andreas Hochlehnert, Tawsif Ahmed, Ameya Prabhu, Gollam Rabby, Christoph Schuhmann | This paper proposes converting copyrighted scientific texts into structured "Knowledge Units" using LLMs to make factual information freely accessible while respecting copyright. The main research question is whether converting scientific texts into Knowledge Units preserves factual information and adheres to copyright laws. The key methodology involves using LLMs to extract entities, attributes, and relationships from paragraphs of scientific papers into structured data, and evaluating the legal defensibility and information retention via question-answering experiments. Primary results show that language models answering multiple-choice questions using Knowledge Units achieved nearly the same accuracy (within 3-5% variance) as when using original texts across several scientific domains. AI practitioners can utilize this framework to build and use datasets containing facts from copyrighted scientific text, potentially democratizing access to scholarly knowledge without infringing on the original expression. |
AISafetyLab: A Comprehensive Framework for AI Safety Evaluation and Improvement (Read more on arXiv or HuggingFace) | Xijie Huang, Junxiao Yang, Leqi Lei, Zhexin Zhang, LLLeo612 | AISafetyLab is a unified framework and toolkit for AI safety that integrates attack, defense, and evaluation methodologies. The main objective is to provide a standardized platform to evaluate and improve AI safety by addressing the lack of comprehensive tools and inconsistent experimental setups. The methodology involves implementing 13 attack methods (including black-box, gray-box, and white-box), 16 defense mechanisms (both inference-time and training-time), and 7 evaluation scorers, alongside auxiliary modules for model interaction, data management, utilities, and logging. In evaluations using Vicuna-7B-v1.5, AutoDAN achieved an average attack success rate of 56.4% across various defenses, while some other methods had varying performance depending on the defense used. For AI practitioners, AISafetyLab provides a flexible, extensible platform with comprehensive method coverage for systematically assessing and enhancing the robustness of AI models against adversarial attacks. |
BIG-Bench Extra Hard (Read more on arXiv or HuggingFace) | Chrysovalantis Anastasiou, John Palowitch, Hritik Bansal, Mehran Kazemi, baharefatemi | BIG-Bench Extra Hard (BBEH) is a new benchmark to evaluate the general reasoning capabilities of large language models (LLMs). The main research objective is to address the saturation of existing LLM reasoning benchmarks, particularly BIG-Bench Hard (BBH), by creating a more challenging and diverse set of tasks. The methodology involves replacing each of the 23 tasks in BBH with a novel, more difficult task that probes similar reasoning capabilities, using a semi-adversarial approach with two reference models to ensure sufficient difficulty. The primary result is that the best general-purpose model achieved a harmonic mean accuracy of 9.8% on BBEH, while the best reasoning-specialized model achieved 44.8%, indicating significant room for improvement. AI practitioners should use BBEH to evaluate LLMs for robust general reasoning, revealing current limitations and driving improvements instead of using other benchmarks where LLMs have reached ceiling performance. |
CritiQ: Mining Data Quality Criteria from Human Preferences (Read more on arXiv or HuggingFace) | Zhiheng Xi, Tianyi Liang, Qipeng Guo, Kai Lv, KYLN24 | CritiQ is a novel data selection method that automatically mines data quality criteria from human preferences and performs efficient data selection. The main research objective is to develop a method for automatically extracting data quality criteria from human preferences with minimal human annotation effort. The key methodology, CritiQ Flow, employs a manager agent to evolve quality criteria and worker agents to make pairwise judgments based on a knowledge base and a reflection process. Accuracies on human-annotated test sets reach 89.33% for code, 84.57% for math, and 88.06% for logic, outperforming baselines such as TextGrad and single-criterion methods. AI practitioners can use CritiQ to automatically derive data quality criteria and select high-quality subsets, improving model performance on downstream tasks with reduced reliance on manually designed heuristics or extensive human annotation. |
MolSpectra: Pre-training 3D Molecular Representation with Multi-modal Energy Spectra (Read more on arXiv or HuggingFace) | Qiang Liu, Deli Zhao, Yu Rong, Shaozhen Liu, AzureLeon1 | MolSpectra enhances pre-training of 3D molecular representations by incorporating multi-modal energy spectra. The main research objective is to establish the relationship between 3D molecular structures and energy states using spectral data to improve molecular representation learning. The key methodology involves a multi-spectrum encoder, SpecFormer, trained with masked patch reconstruction, and a contrastive objective aligning 3D and spectral representations. Pre-training with MolSpectra achieved state-of-the-art performance on the QM9 dataset, achieving a mean absolute error (MAE) of 0.011 D on the dipole moment (μ) prediction, outperforming the baseline Coord method in 10 out of 12 properties. For AI practitioners, MolSpectra provides a pre-training framework that leverages molecular spectra to learn more informative 3D molecular representations, enhancing performance on downstream tasks like property prediction. |
PosterSum: A Multimodal Benchmark for Scientific Poster Summarization (Read more on arXiv or HuggingFace) | Frank Keller, Pasquale Minervini, rohitsaxena | POSTERSUM, a new benchmark, evaluates multimodal models on summarizing scientific posters into research paper abstracts, revealing limitations in current models and introducing a hierarchical approach for improvement. Main research question or objective: How effectively can Multimodal Large Language Models (MLLMs) understand and summarize the complex, visually-rich content of scientific posters into concise textual abstracts, and can a hierarchical approach improve this performance? Key methodology used: The authors created a new dataset, POSTERSUM, consisting of 16,305 scientific posters paired with their corresponding abstracts. They benchmarked state-of-the-art MLLMs (including GPT-4o, Claude-3.5 Sonnet, Gemini 2.0, and various open-source models) on this dataset using metrics like ROUGE, SacreBLEU, METEOR, and BERTScore. They then proposed "SEGMENT & SUMMARIZE," a hierarchical approach involving segmentation of the poster into coherent regions, localized summarization of each region, and global summarization to combine the localized summaries. Primary results: State-of-the-art MLLMs struggle to accurately summarize scientific posters. The best-performing closed-source model, GPT-4o, achieved a ROUGE-L score of only 22.30. The proposed SEGMENT & SUMMARIZE method significantly outperformed all other models, including closed-source MLLMs, achieving a ROUGE-L score of 24.18. Principal implication for AI practitioners: Current MLLMs, while strong on various tasks, have significant limitations when handling the complex multimodal information presented in scientific posters. The POSTERSUM dataset provides a valuable benchmark for advancing multimodal understanding, and the "SEGMENT & SUMMARIZE" approach demonstrates a promising direction for improving performance by incorporating a divide-and-conquer strategy, handling the complexity inherent in poster summarization. AI/ML/Software Engineers and Data Scientist working with scientific documents should prioritize models and architectures that are capable of understanding a variety of modalities and their combinations. |
Title | Authors | Summary |
---|---|---|
OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference (Read more on arXiv or HuggingFace) | Jiaqiwang, Weiyun1025, UniverseCA, ChrisDing1105, PhoenixZ | OmniAlign-V introduces a new dataset and benchmark to improve the alignment of multi-modal large language models (MLLMs) with human preferences. The main research objective is to address the gap in human preference alignment observed in existing open-source MLLMs, despite their strong performance on foundational capability benchmarks. The key methodology involves constructing OmniAlign-V, a dataset of ~200K high-quality training samples with diverse images and complex question-answer pairs, and MM-AlignBench, a human-annotated benchmark for evaluating MLLM alignment. Finetuning MLLMs with OmniAlign-V via Supervised Fine-Tuning (SFT) or Direct Preference Optimization (DPO) improved the win rate against Qwen2VL-72B on MM-AlignBench, achieving a 72.6 win rate. The principal implication is that AI practitioners should utilize curated, human-aligned multi-modal datasets like OmniAlign-V during SFT and DPO to significantly enhance the human preference alignment of MLLMs while maintaining or enhancing fundamental capabilities. |
SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference (Read more on arXiv or HuggingFace) | Haofeng Huang, surfingtomchen, hxi0408, Xiang-cd, jt-zhang | SpargeAttn is a universal sparse and quantized attention mechanism designed to accelerate inference in various AI models. The paper's main objective is to design a training-free sparse attention operator that accelerates all models without metric loss. The key methodology involves a two-stage online filter that predicts sparse blocks in the attention map using selective token compression and a sparse warp online softmax, integrated with 8-bit quantization. SpargeAttn achieved a 1.83x speedup on Mochi on an L40 GPU without loss of video quality and is 2.5x to 5x faster than existing dense/sparse attention models. AI practitioners can use SpargeAttn to significantly accelerate the inference of diverse models, including language, image, and video generation, without sacrificing end-to-end performance metrics. |
KV-Edit: Training-Free Image Editing for Precise Background Preservation (Read more on arXiv or HuggingFace) | Yansong Tang, jewelshaw, shiyi0408, xilluill | KV-Edit is a training-free image editing method that achieves precise background preservation by utilizing KV cache in diffusion models. The main research objective is to address the challenge of maintaining background consistency during image editing tasks while generating content aligned with modified text prompts. The key methodology involves caching and reusing key-value pairs of background tokens in Diffusion Transformers (DiTs) during the inversion and denoising processes, and optional mask-guided inversion and reinitialization strategies. Primary results show that KV-Edit achieves a PSNR of 35.87 in masked region preservation, outperforming existing methods. For AI practitioners, this method provides a way to perform image editing with perfect background preservation, without additional training or complex mechanisms, thereby facilitating more practical AI image editing applications. |
ART: Anonymous Region Transformer for Variable Multi-Layer Transparent Image Generation (Read more on arXiv or HuggingFace) | JianminBao, DongChen06, 131131yhx, 2JZ, yifanpu001 | This paper introduces the Anonymous Region Transformer (ART) for generating variable multi-layer transparent images from a global text prompt and an anonymous region layout. The main research objective is to develop a method for generating high-quality, multi-layer transparent images that overcomes the limitations of existing methods requiring detailed semantic layouts. The key methodology involves using an anonymous region layout, a layer-wise region crop mechanism, and a multi-layer transparent image autoencoder. The method achieves a speed improvement of over 12 times compared to the full attention approach, and user studies show it outperforms existing methods (LayerDiffuse and COLE) in multiple aspects. The principal implication is that AI practitioners can generate multi-layer images more efficiently and with greater scalability, allowing for more precise control in interactive content creation and editing of individual elements within generative models. |
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution (Read more on arXiv or HuggingFace) | RishabhSingh021, gsynnaeve, lingming, JadeCopet, yuxiang630 | SWE-RL is a reinforcement learning approach that enhances LLM reasoning for software engineering tasks using open-source software evolution data. The main research objective is to improve LLMs' performance on real-world software engineering tasks, specifically issue resolution, using reinforcement learning. The key methodology is training LLMs on GitHub pull request data with a rule-based reward function based on the similarity between predicted and oracle code patches, optimized via Group Relative Policy Optimization (GRPO). The primary result is that Llama3-SWE-RL-70B achieves a 41.0% solve rate on the SWE-bench Verified dataset. The principal implication for AI practitioners is that reinforcement learning on software evolution data can significantly enhance LLM reasoning capabilities for software engineering and also improve performance on out-of-domain tasks. |
Unveiling Downstream Performance Scaling of LLMs: A Clustering-Based Perspective (Read more on arXiv or HuggingFace) | Chenggang Li, Xiao Li, shenke18, Lucky2022, JerryXu98 | The paper introduces a Clustering-On-Difficulty (COD) framework to predict downstream task performance of Large Language Models (LLMs). The main research objective is to accurately predict LLM performance on downstream tasks prior to extensive model training, addressing the challenges of emergent abilities and uneven task difficulty distributions. The key methodology involves clustering tasks based on difficulty features, fitting performance-compute curves on predictable clusters, and mapping these predictions to the full evaluation set. The primary result is that COD achieves a mean absolute prediction error of 1.36% across eight LLM evaluation benchmarks on a 70B-parameter model. The principal implication is that AI practitioners can use COD for efficient resource allocation and monitoring during LLM training, by reliably predicting downstream task performance using smaller models. |
Scale-Distribution Decoupling: Enabling Stable and Effective Training of Large Language Models (Read more on arXiv or HuggingFace) | Ya Wang, LLIXQ, xunzhou, Taoer, BryceZhuo | Scale-Distribution Decoupling (SDD) is a novel approach that stabilizes and improves the training of large language models by separating the scale and distribution of weight matrices. The main research objective is to address training instability issues, such as gradient explosion and vanishing gradients, in large language models (LLMs), particularly in Post-Norm Transformer architectures. SDD uses a normalization mechanism to regulate activations and a learnable scaling vector to maintain well-conditioned gradients in fully-connected layers. SDD-1B achieves a training loss of 2.65, outperforming OLMo2-1B (2.70), PostNorm-1B (2.69), and DeepNorm-1B (2.72), also achieving the highest average accuracy of 54.04% across multiple downstream tasks. For AI practitioners, SDD provides a lightweight and compatible solution for stabilizing LLM training, improving convergence, and enabling more efficient large-scale pre-training. |
K-LoRA: Unlocking Training-Free Fusion of Any Subject and Style LoRAs (Read more on arXiv or HuggingFace) | Qibin Hou, Zhen Li, oyzh2005 | K-LoRA is a training-free method for merging subject and style LoRAs to generate images that preserve both characteristics. The paper's objective is to develop a method for effectively combining content and style LoRAs without requiring additional training or manual parameter tuning. The key methodology is a Top-K selection process within attention layers that identifies and selects the most representative features from each LoRA for fusion, combined with a scaling factor that prioritizes content or style at different diffusion timesteps. The method achieved a CLIP score of 69.4% and a DINO score of 46.9% for subject similarity, outperforming existing methods. AI practitioners can use K-LoRA to effectively fuse separately trained subject and style LoRAs, enabling efficient customized image generation without retraining, simplifying the process of generating images with specific content and styles. |
WebGames: Challenging General-Purpose Web-Browsing AI Agents (Read more on arXiv or HuggingFace) | Fraser, semitable, BiggieW, XanderJC, georgethomas | WebGames introduces a benchmark suite for evaluating general-purpose web-browsing AI agents. The primary objective is to assess AI limitations in web interactions using 50+ interactive challenges designed to be human-intuitive yet AI-challenging. The methodology involves evaluating vision-language models like GPT-4o, Claude Computer-Use, Gemini-1.5-Pro, and Qwen2-VL in a hermetic, client-side environment, measuring their success against human baselines. The best AI system achieved a 41.2% success rate compared to 95.7% human performance, revealing a substantial capability gap. This highlights the need for improvements in AI's ability to handle common web interaction patterns, thereby directing future development efforts for web-browsing agents by AI practitioners. |
Introducing Visual Perception Token into Multimodal Large Language Model (Read more on arXiv or HuggingFace) | wxcTest, horseee, rp-yu | This paper introduces Visual Perception Tokens to enhance Multimodal Large Language Models' (MLLMs) control over visual perception processes. The main research objective is to enable MLLMs to autonomously control their visual perception, such as selecting specific image regions or refining features. The key methodology involves designing two types of Visual Perception Tokens (Region Selection and Vision Re-Encoding) that MLLMs generate and use to trigger additional visual processing steps. Results show that adding Visual Perception Tokens to a 2B parameter model improves its average performance across various VQA tasks by 30.9%, achieving a score of 0.749 compared to 0.572 without the tokens. AI practitioners can utilize these tokens to improve MLLMs' performance in tasks requiring fine-grained visual understanding and spatial reasoning, by giving models a mechanism to actively control their visual input. |
The Lottery LLM Hypothesis, Rethinking What Abilities Should LLM Compression Preserve? (Read more on arXiv or HuggingFace) | Peijie Dong, Qian Wang, Xiang Liu, wenxinsiju, coolzhtang | This paper proposes a "lottery LLM hypothesis" suggesting that smaller, compressed large language models (LLMs) can achieve comparable performance to original LLMs using external tools and reasoning. The main research objective is to identify the essential capabilities that compressed LLMs and key-value (KV) cache compression methods should preserve to maintain performance. The methodology involves a review of recent LLM advancements (retrieval-augmented generation, external tools, multi-step reasoning, computational expressivity) and proposes a recursive multi-step reasoning algorithm (Algorithm 1) for the "lottery LLM". Primary results include showing that retrieval augmented generation can provide a compressed model equivalent performance. For instance Table 2 shows that Llama-3-Ins8B with RAG achieves a 59.8 accuracy score in the PopQA. The principal implication for AI practitioners is to focus on preserving specific abilities, like retrieval from prompts and long-context reasoning when developing LLM compression techniques, rather than solely focusing on perplexity or basic task accuracy. |
AAD-LLM: Neural Attention-Driven Auditory Scene Understanding (Read more on arXiv or HuggingFace) | Ashesh Mehta, Stephan Bickel, vchoudhari, susameddin, xi-j | i) AAD-LLM is a brain-computer interface that integrates neural signals with an auditory large language model to improve auditory scene understanding aligned with listener attention. ii) The main research objective is to develop a system that can process and respond to auditory scenes based on a listener's attentional focus, rather than treating all sound inputs equally. iii) The key methodology involves decoding a listener's attended speaker from intracranial electroencephalography (iEEG) recordings and integrating this information into an auditory LLM to generate responses aligned with the listener's perception. iv) AAD-LLM achieved a word error rate (WER) of 10.6% on transcribing the attended speech in a two-speaker scenario with background noise, significantly outperforming baseline models. v) AI practitioners can leverage this work to develop more human-centered auditory AI systems that prioritize listener intent, enhancing applications such as assistive hearing devices and human-computer interaction. |
Shakti-VLMs: Scalable Vision-Language Models for Enterprise AI (Read more on arXiv or HuggingFace) | KartikAngadi, kruthika, SyedAbdul | Shakti-VLM, a family of 1B and 4B parameter vision-language models, achieves competitive multimodal performance with enhanced data efficiency through architectural innovations and a three-stage training strategy. The primary objective was to develop efficient vision-language models (VLMs) that achieve strong performance with reduced training data requirements. The methodology includes QK-Normalization, hybrid normalization, enhanced positional encoding, and a three-stage training process (text-only pretraining, vision-language alignment, and full model fine-tuning). Shakti-VLM-4B achieved 59.78% on the MMMU validation set, surpassing comparable models like Qwen2VL-7B and MiniCPM-V-2.6-8B. AI practitioners can leverage Shakti-VLM's design and training strategies to build high-performing multimodal models with significantly less computational resources and training data, especially in enterprise-scale deployments. |
Title | Authors | Summary |
---|---|---|
DICEPTION: A Generalist Diffusion Model for Visual Perceptual Tasks (Read more on arXiv or HuggingFace) | Zhiyue Zhao, Mingyu Liu, Z-MU-Z, zhyya, Canyu | DICEPTION is a generalist diffusion model for various visual perception tasks like segmentation, depth, and normal estimation. The primary objective is to create a single diffusion-based model capable of performing multiple visual perception tasks efficiently, leveraging pre-trained text-to-image models. The methodology involves unifying various perception tasks as conditional image generation in RGB space, using point prompts, task prompts, and a DiT architecture. Results demonstrate performance on par with state-of-the-art models, achieving comparable results to SAM-vit-h using only 0.06% of its training data (600K vs. 1B pixel-level annotated images). AI practitioners can leverage the priors of pre-trained diffusion models to create efficient and effective multi-task visual generalist models, significantly reducing the data and computational requirements compared to conventional training from scratch. |
Thus Spake Long-Context Large Language Model (Read more on arXiv or HuggingFace) | Yuerong Song, Zhigeng Liu, Mianqiu Huang, Ruixiao Li, LiuXR | i) This survey paper presents a comprehensive overview of the long-context large language model (LLM) lifecycle. ii) The paper aims to provide a global picture of long-context LLMs, covering architectures, infrastructure, training, and evaluation technologies. iii) The methodology involves analyzing existing literature and categorizing long-context LLM technologies into architecture, infrastructure, training, and evaluation perspectives. iv) The survey showcases a spectrum of long-context technologies and identifies 10 unanswered questions currently faced by long-context LLMs; the context length of open-source LLMs has grown from 2k to 2M tokens between April 2023 and February 2024. v) The principal implication is to offer AI researchers and practitioners a systematic introduction to the research landscape of long-context LLMs, highlighting key challenges and future research directions. |
Slamming: Training a Speech Language Model on One GPU in a Day (Read more on arXiv or HuggingFace) | Yossi Adi, avishai-elmakies, gallilmaimon | The paper introduces Slam, a recipe for training speech language models (SLMs) on a single GPU within 24 hours. The main research objective is to determine if high-quality SLMs can be trained using a single GPU within 24 hours. The methodology involves empirical analysis of model initialization, architecture, synthetic training data, and preference optimization, systematically ablating each training pipeline component. A key result is that the Slam recipe, utilizing a Qwen2.5-0.5B model and synthetic data, achieves a Topic-StoryCloze score of 82.04 on a single A5000 GPU. The principal implication is that AI practitioners can train high-quality SLMs with significantly reduced computational resources, improving accessibility of SLM research and development. |
Audio-FLAN: A Preliminary Release (Read more on arXiv or HuggingFace) | Shuai Fan, Zixuan Li, Jiahao Pan, Ziya Zhou, Liumeng Xue | Audio-FLAN is a large-scale instruction-tuning dataset for unified audio-language models covering 80 diverse tasks across speech, music, and sound domains. The main research objective is to create a comprehensive dataset to enable unified audio-language models to perform both understanding and generation tasks in a zero-shot manner. The key methodology involves collecting and standardizing nearly all publicly available academic audio datasets into a common instruction-based format, normalizing the heterogeneous datasets and varying instructions using LLaMA and GPT. The primary result is a dataset with approximately 80 tasks, and over 100 million instances, significantly surpassing prior efforts in both quantity and diversity. AI practitioners can use Audio-FLAN to train and evaluate unified audio-language models capable of performing a wide range of understanding and generation tasks, potentially leading to models with zero-shot generalization abilities across speech, music and other audios. |
GCC: Generative Color Constancy via Diffusing a Color Checker (Read more on arXiv or HuggingFace) | Yu-Chee Tseng, Yi-Chen Lo, Chia-Che Chang, Cheng-De Fan, Chen-Wei Chang | GCC is a method for estimating scene illumination in images by inpainting a color checker using diffusion models. The main research objective is to develop a color constancy method that generalizes well across different camera sensors without requiring sensor-specific training. The key methodology involves fine-tuning a diffusion-based inpainting model to insert a color checker into an image, then using Laplacian decomposition to maintain checker structure and extract illumination color from the inpainted checker's achromatic squares. In cross-dataset evaluations, GCC achieved a worst-25% error rate of 5.15° and 4.32° in bi-directional evaluations. AI practitioners can leverage this method to estimate the illumination with good accuracy, across a wide range of sensors without specific sensor training data. |
CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models (Read more on arXiv or HuggingFace) | Yejie Wang, Wei Zhang, Jiaheng Liu, Marcus Dong, Alexander Zhang | CodeCriticBench is a benchmark for evaluating large language models' (LLMs) ability to critique code, assessing both code generation and code question-answering tasks. The main research objective is to establish a comprehensive framework for evaluating LLMs' code critique capabilities across different dimensions and difficulty levels. The methodology involves collecting code tasks from various sources, constructing basic and advanced critique evaluation protocols, and designing fine-grained evaluation checklists. Primary results show that, on advanced evaluations, DeepSeek-R1 achieves an MSE of 3.92 on code generation, while Claude3.5-Sonnet leads in code QA with an MSE of 1.02; generally models increased in Accuracy (ACC) as parameters increased. The principal implication is that AI practitioners can use CodeCriticBench to systematically assess and compare the code critique performance of different LLMs, driving improvements in coding assistance tools and automated code review systems. |
Linguistic Generalizability of Test-Time Scaling in Mathematical Reasoning (Read more on arXiv or HuggingFace) | James Thorne, Jiwoo Hong, Guijin Son, Cartinoe5930 | The paper introduces MCLM, a multilingual math benchmark, and evaluates the linguistic generalizability of test-time scaling methods in mathematical reasoning. The main research question is whether test-time scaling confers cross-lingual benefits in mathematical reasoning similar to those observed with pre-training scaling. The authors test three test-time scaling methods (Outcome Reward Modeling, Process Reward Modeling, and Budget Forcing) on multilingual LLMs using a new benchmark, MCLM, featuring competition-level problems in 55 languages. A primary result is that using Qwen2.5-1.5B Math with Outcome Reward Modeling achieves a score of 35.8 on MCLM, while Budget Forcing on MR1-1.5B attains 35.2, showing that gains from test-time scaling do not consistently extend to multiple languages. The principal implication is that AI practitioners should be aware that test-time scaling methods may not generalize effectively to multilingual tasks, and improving multilingual robustness requires methods beyond simply increasing inference-time compute. |
Make LoRA Great Again: Boosting LoRA with Adaptive Singular Values and Mixture-of-Experts Optimization Alignment (Read more on arXiv or HuggingFace) | Wei Wei, Xiaoye Qu, Sichen Liu, Zhenyi Lu, Facico | GOAT enhances LoRA fine-tuning for large language models by using adaptive singular value decomposition and Mixture-of-Experts optimization alignment. The primary research question is how to mitigate the performance gap between LoRA and full fine-tuning, particularly in Mixture-of-Experts (MoE) architectures. The key methodology involves initializing LoRA MoE experts with distinct SVD segments of pre-trained weights and aligning optimization with a theoretical scaling factor derived from full fine-tuning. Primary results show that GOAT achieves 99.07% of full fine-tuning performance on image classification and outperforms all LoRA variants. The principal implication for AI practitioners is that GOAT offers a more efficient and effective fine-tuning approach, closing the performance gap with full fine-tuning while maintaining scalability. |
Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for Multimodal Reasoning Models (Read more on arXiv or HuggingFace) | Yang Zhao, Shan Jiang, Hongquan Li, Yue Fan, Qianqi Yan | The paper introduces MMIR, a new benchmark for evaluating multimodal reasoning models' ability to detect semantic inconsistencies in layout-rich visual-textual content. The main research objective is to assess how well Multimodal Large Language Models (MLLMs) can identify and reason about semantic mismatches in artifacts like webpages and slides. The key methodology involves creating 534 samples with synthetically injected errors across five reasoning-heavy categories and evaluating six state-of-the-art MLLMs. The primary result is that the proprietary model, o1, achieved the best performance with over 50% accuracy in detecting inconsistencies, significantly outperforming open-source models which scored below 25%. The paper's principle implication, therefore, is that there is a crucial need for development in advancing multimodal reasoning in current MLLMs, particularly for handling inconsistencies, to make the models more reliable. |
Mobile-Agent-V: Learning Mobile Device Operation Through Video-Guided Multi-Agent Collaboration (Read more on arXiv or HuggingFace) | Ji Zhang, Ming Yan, Xi Zhang, Junyang Wang, xhyandwyy | Mobile-Agent-V is a framework that leverages video guidance to enhance mobile device automation through multi-agent collaboration. The main research objective is to address the limitations of existing mobile automation frameworks by providing rich and cost-effective operational knowledge. The key methodology involves a sliding window video input mechanism, a video agent for adaptive frame selection, and a deep-reflection agent for refining decision outputs. Primary results show that Mobile-Agent-V achieves a 30% performance improvement over existing frameworks in tasks requiring operational knowledge. The principal implication for AI practitioners is that they can use video demonstrations to effectively inject operational knowledge into mobile agents, enabling more efficient and scalable automation. |
RIFLEx: A Free Lunch for Length Extrapolation in Video Diffusion Transformers (Read more on arXiv or HuggingFace) | Chongxuan Li, Yixiao Chen, Guande He, Min Zhao, zhuhz22 | RIFLEX improves length extrapolation in video diffusion transformers by reducing a key intrinsic frequency in positional embeddings. The main research objective is to understand and mitigate the failure modes (temporal repetition and slow motion) of existing length extrapolation methods in video diffusion transformers. The key methodology is analyzing the role of frequency components in Rotational Position Embedding (RoPE) and reducing the "intrinsic frequency" component that governs repetition patterns. Primary results show that RIFLEX achieves 2x extrapolation on CogVideoX-5B in a training-free manner, with a NoRepeat Score of 54.2 and Dynamic Degree of 59.4. The principal implication is that AI practitioners can achieve high-quality length extrapolation in video generation without additional training or significant modifications to existing models by simply adjusting the intrinsic frequency in the positional encoding. |
Benchmarking Temporal Reasoning and Alignment Across Chinese Dynasties (Read more on arXiv or HuggingFace) | Deyu Zhou, Yong Jiang, Pengfei LI, Jialong Wu, wzl0228 | The paper introduces CTM, a new benchmark for evaluating temporal reasoning in large language models (LLMs) within the context of Chinese dynastic chronology. The main objective is to assess LLMs' ability to understand and align temporal relationships across various Chinese historical entities and events. The methodology involves constructing a dataset of 8,750 question-answer pairs and 60 Timeline Ito Game instances, focusing on contextualization, cross-entity relationships, and pairwise temporal alignment. Evaluation of various LLMs revealed that the Time Interval Calculation (TIC) task was the most challenging, and the best performing model (Deepseek-R1) achieved an accuracy of 64.02% on question answering,. This suggests that CTM can provide a culturally rich resource for enhancing temporal reasoning capabilities and structured knowledge integration in large language models. |
Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation (Read more on arXiv or HuggingFace) | Sergey Levine, Xiangyu Yue, Zhuoran Yang, csuhan, yunhaif | This paper introduces Reflective Planning, a framework that enhances vision-language models (VLMs) for multi-stage, long-horizon robotic manipulation tasks by incorporating a reflection mechanism. The main research question is how to improve VLMs' physical reasoning and long-horizon planning capabilities for complex robotic manipulation. The key methodology involves using a diffusion-based dynamics model for visual look-ahead and an iterative reflection process, enabling the VLM to critique and refine its actions based on imagined future states. The proposed method, ReflectVLM, achieved an 85.4% success rate on a challenging set of manipulation tasks, significantly outperforming state-of-the-art commercial VLMs and Monte Carlo Tree Search. AI practitioners can leverage this framework to develop more robust and efficient robotic planning systems that require visual understanding and long-horizon reasoning, without extensive task-specific training. |
Stable-SPAM: How to Train in 4-Bit More Stably than 16-Bit Adam (Read more on arXiv or HuggingFace) | Xiang Li, Gaojie Jin, Zhenyu Zhang, Haotian Hu, Tianjin Huang | Stable-SPAM, a new optimizer, enhances stability in 4-bit large language model (LLM) training. The main research objective is to evaluate and improve the stability of 4-bit LLM training using recently proposed optimizers. The key methodology involves introducing Stable-SPAM, which incorporates adaptive gradient normalization (AdaGN), adaptive spike-aware clipping (AdaClip), and inherits momentum reset from SPAM. Primary results show that a 4-bit LLaMA-1B model trained with Stable-SPAM outperforms a BF16 LLaMA-1B trained with Adam by up to 2 perplexity points. The principal implication is that AI practitioners can use Stable-SPAM to achieve more stable and efficient training of LLMs with 4-bit quantization, matching or exceeding 16-bit Adam performance with significantly reduced memory and computational costs. |
Can Community Notes Replace Professional Fact-Checkers? (Read more on arXiv or HuggingFace) | Isabelle Augenstein, Desmond Elliott, gretawarren, Nadav | This research investigates the reliance of Twitter/X's Community Notes on professional fact-checking for combating misinformation. The main research questions are to what extent community notes rely on the work of professional fact-checkers and what are the traits of posts and notes that reference fact-checking sources. The researchers annotated a corpus of Twitter/X community notes using language models and performed manual annotations, classifying cited sources and identifying attributes like topic and refutation strategies. A primary result is that at least 5% of all English community notes contain an external link to professional fact-checkers, rising to 7% for notes rated as 'helpful'. This suggests that, to improve community-based moderation quality, AI practitioners could consider integrating and/or prioritize content from verified professional fact-checking organizations within community moderation systems. |
Forecasting Open-Weight AI Model Growth on Hugging Face (Read more on arXiv or HuggingFace) | Jianxi Gao, Pin-Yu Chen, KBhandari11 | The paper adapts a scientific citation model to predict the adoption dynamics of open-weight AI models on Hugging Face. The main research question is, "Can we predict the trajectory of influence an open-weight model will have on the AI community?". The key methodology adapts Wang et al.'s citation model, using immediacy, longevity, and relative fitness parameters to track the cumulative number of fine-tuned models. The results show that most models cluster around narrow bands of parameters but models like openai/whisper-large-v3 demonstrate a high relative fitness (λi) of 528070.6635. AI practitioners can use this framework to anticipate model prominence and understand the long-term impact of open-weight models, guiding strategic decisions and governance. |
TAG: A Decentralized Framework for Multi-Agent Hierarchical Reinforcement Learning (Read more on arXiv or HuggingFace) | Balázs Kégl, Albert Thomas, Hamza Cherkaoui, Abdelhakim Benechehab, Giuseppe Paolo | TAG is a decentralized framework for constructing multi-agent hierarchical reinforcement learning systems of arbitrary depth. The main research objective is to develop a framework enabling scalable and adaptable multi-agent systems through hierarchical organization and decentralized control. The key methodology is the LevelEnv abstraction, which presents each hierarchy level as an environment to the agents above it, standardizing information flow and enabling bidirectional communication. The experiments on MPE-Spread and VMAS Balance environments show that depth-three agents (3PPO and 2MAPPO-PPO) match a hand-designed heuristic performance with 95% confidence interval. AI practitioners can use TAG to build scalable multi-agent systems that decompose complex tasks across multiple hierarchical levels, improving learning efficiency and coordination without centralized control. |
VideoGrain: Modulating Space-Time Attention for Multi-grained Video Editing (Read more on arXiv or HuggingFace) | Yi Yang, Hehe Fan, Linchao Zhu, Xiangpeng Yang | VideoGrain introduces a zero-shot approach for multi-grained video editing by modulating space-time attention mechanisms in diffusion models. The main research question is: Can attention be modulated to ensure accurate distribution of each local edit's attention weights in the intended regions for multi-grained video editing? The key methodology is Spatial-Temporal Layout-Guided Attention (ST-Layout Attn), which modulates both cross-attention (for text-to-region control) and self-attention (for feature separation) within a diffusion model. The method achieves an Edit-Accuracy of 88.4, a Temporal-Consistency of 85.0 and an Overall score of 83.0 on a dataset of 76 video-text pairs. AI practitioners can leverage this method to perform precise, multi-grained video editing (class-level, instance-level, and part-level) without requiring parameter tuning or additional training data. |
Beyond Release: Access Considerations for Generative AI Systems (Read more on arXiv or HuggingFace) | Yacine Jernite, Ariel Herbert-Voss, Dan Hendrycks, Rishi Bommasani, irenesolaiman | Generative AI system access, beyond component release, determines stakeholder engagement and risk-benefit tradeoffs through resourcing, technical usability, and utility. The main research question is how accessibility of generative AI system components, beyond their mere availability, influences their use, potential risks, and benefits. The key methodology involves deconstructing access along three axes (resourcing, technical usability, and utility) and analyzing access variables for four high-performance language models (Llama 3.1 405B Instruct, DeepSeek v3, GPT-4, Claude 3.5 Sonnet). A primary result is that Llama 3.1 405B Instruct requires at least 8 NVIDIA H100 GPUs and 405 GB VRAM to run locally in 8-bit precision. Principal implication is that, for AI practitioners, release decisions must consider access variables for effective risk assessment and deployment. |
X-Dancer: Expressive Music to Human Dance Video Generation (Read more on arXiv or HuggingFace) | Chenxu Zhang, You Xie, Guoxian Song, Hongyi Xu, Zeyuan Chen | X-Dancer is a transformer-diffusion framework for generating music-driven human dance videos from a single image. The main research objective is to create diverse, long-range, and lifelike human dance videos synchronized with music, starting from a single static image. The key methodology involves a transformer that generates 2D pose sequences, and a diffusion model that translates these poses into video frames. The X-Dancer achieves a FVD score of 507.06 and FID-VID of 61.94 on their in-house dataset, surpassing all baselines in visual synthesis quality, which is a direct result of the method. AI practitioners can leverage this framework as a scalable solution for high-quality and expressive human image animation, with direct application in video content creation and customizable choreography. |
MONSTER: Monash Scalable Time Series Evaluation Repository (Read more on arXiv or HuggingFace) | Amish Mishra, Lynn Miller, Chang Wei Tan, Navid Mohammadi Foumani, angus924 | MONSTER introduces a new benchmark for time series classification using larger datasets to address limitations of current benchmarks. The main research objective is to create and evaluate a collection of large-scale time series datasets to improve benchmarking in time series classification. Key methodologies include compiling 29 univariate and multivariate datasets, processing them into a common format, and evaluating baseline methods (ConvTran, FCN, HInceptionTime, TempCNN, HYDRA, QUANT, and ET) using 5-fold cross-validation. Primary results show that QUANT achieved the lowest overall mean 0-1 loss (0.1880) across all datasets, closely followed by ConvTran (0.1954), although performance varied significantly across different data categories. Principal implication for AI practioners is that that the field has artificially disadvanted low-bias methods and MONSTER can improve development and application in time series classification by training models on larger datasets. |
The snake in the Brownian sphere (Read more on arXiv or HuggingFace) | Grégory Miermont, Brett Kolesnik, Emmanuel Jacob, Omer Angel | The paper describes the inverse of the continuous Cori-Vauquelin-Schaeffer (CVS) bijection, mapping the Brownian sphere to the Brownian snake. The main research objective is to construct the Brownian snake as a measurable function of the Brownian sphere, thereby inverting the continuous CVS bijection. The key methodology involves using the geometric notion of a cut locus on the Brownian sphere, defining a metric on the closure of the cut locus, and leveraging the induced orientation to define a planar order. The primary result is that, given a Brownian sphere (X,d,µ) and two independent points drawn from µ, there exists a measurable function outputting an R-tree T and label function Z such that T has the law of the Continuum Random Tree (CRT), and applying the continuum CVS mapping to (T, Z) recovers (X, d, μ). The paper proves that the orientation of the Brownian Sphere has a Rademacher distribution (equal to ±1 with equal probability), independently of the random variables ψ(h). AI/ML/Software Engineers/Data Scientist, can measurably recover the Brownian Snake and its associated tree structure from a given a Brownian Sphere, which provides new mathematical tooling and foundational understanding for models related to random planar maps. |
M3-AGIQA: Multimodal, Multi-Round, Multi-Aspect AI-Generated Image Quality Assessment (Read more on arXiv or HuggingFace) | Weiming Zhang, Wen Shen, Zhihua Wei, Kejiang Chen, Chuan Cui | M3-AGIQA is a framework for assessing AI-generated image quality using multimodal inputs, multi-round interactions, and considering multiple quality aspects. The main research objective is to develop a comprehensive method for evaluating AI-generated images (AGIs) that aligns with human perceptual judgments across quality, correspondence, and authenticity. The key methodology involves distilling multi-aspect image captioning capabilities from online Multimodal Large Language Models (MLLMs) into a local MLLM via LoRA fine-tuning, and employing an xLSTM feature extractor with a regression head to predict Mean Opinion Scores (MOSs). The method achieved a Spearman's Rank-Order Correlation Coefficient (SRCC) of 0.9045 and a Pearson Linear Correlation Coefficient (PLCC) of 0.9317 on the quality aspect of the AGIQA-3k dataset. AI practitioners can utilize this framework to more accurately and comprehensively evaluate the quality of generated images, considering multiple factors that go beyond simple perceptual quality. |
Title | Authors | Summary |
---|---|---|
SurveyX: Academic Survey Automation via Large Language Models (Read more on arXiv or HuggingFace) | UglyToilet, Ki-Seki, siminniu, fan2goa1, HaruTeru | SURVEYX is a system for automated academic survey generation using Large Language Models (LLMs), designed to improve content and citation quality. The main research objective is to address limitations in existing LLM-based survey generation systems, such as finite context windows, lack of in-depth content discussion, and absence of systematic evaluation frameworks. The key methodology involves a two-phase approach (Preparation and Generation) incorporating online reference retrieval, AttributeTree pre-processing, and a re-polishing process, leveraging Retrieval Augmented Generation (RAG). Experimental results showed SURVEYX achieved a 0.259 improvement in content quality and a 1.76 enhancement in citation quality, approaching human expert performance (average content quality scores: SURVEYX: 4.590, Human: 4.754). For AI practitioners, SURVEYX provides an efficient and organized system for generating high-quality academic surveys, enhancing the information density for LLMs and optimizing their context window usage, with potential applications in various fields. |
MaskGWM: A Generalizable Driving World Model with Video Mask Reconstruction (Read more on arXiv or HuggingFace) | Rui Chen, Yuxin Guo, Jingcheng Ni, wzhgba, lyclyc52 | MaskGWM is a driving world model that combines diffusion-based generation with masked reconstruction for improved fidelity and generalization. The main research objective is to develop a more generalizable driving world model capable of long-horizon prediction and multi-view generation, surpassing existing models constrained by prediction duration and generalization. The key methodology involves a Diffusion Transformer (DiT) architecture trained with an extra mask construction task, diffusion-related mask tokens, and a row-wise cross-view module for spatial-temporal and multi-view modeling. Primary results show the model achieves a Frechet Video Distance (FVD) of 59.4 and Frechet Inception Distance (FID) of 4.0 on the nuScenes dataset without action information, outperforming the state-of-the-art. For AI practitioners, the proposed MaskGWM framework offers a more robust and scalable approach to building driving world models, enabling improved video prediction and generalization capabilities for autonomous driving applications. |
Mol-LLaMA: Towards General Understanding of Molecules in Large Molecular Language Model (Read more on arXiv or HuggingFace) | Sung Ju Hwang, Wonbin Lee, DongkiKim | i) Mol-LLaMA, a large molecular language model, is proposed for enhanced general understanding of molecules. ii) The research aims to develop a molecular language model that grasps general molecular knowledge to function as a versatile molecular assistant. iii) The methodology includes multi-modal instruction tuning with a designed dataset encompassing structural, chemical, and biological features, along with a blending module integrating information from 2D and 3D molecular encoders. iv) Experiments show Mol-LLaMA provides more accurate, detailed, and helpful responses than baseline LLMs and molecular LLMs, as well as improved performance on molecular property prediction, achieving high accuracy while maintaining high fidelity and helpfulness scores on the PAMPA task. v) The model provides AI/ML practitioners with a new foundation for building general-purpose molecular assistants capable of explaining molecular features and rationales, enhancing molecular analysis. |
LLM-Microscope: Uncovering the Hidden Role of Punctuation in Context Memory of Transformers (Read more on arXiv or HuggingFace) | Polina Druzhinina, Elizaveta Goncharova, Temurbek Rahmatullaev, Matvey Mikhalchuk, Anton Razzhigaev | i) This paper introduces methods to quantify and visualize how LLMs encode contextual information, focusing on the role of punctuation. ii) The main research question is how seemingly minor tokens impact the contextual memory of transformer-based LLMs. iii) The methodology involves measuring token-level nonlinearity, contextualization through prefix reconstruction, and intermediate layer analysis via a modified Logit Lens. iv) The results show that removing stopwords, articles, and commas consistently degrades performance on MMLU and BABILong-4k and identifies a correlation between linearity and contextualization. v) AI practitioners should note the counterintuitive finding that "filler" tokens carry significant contextual information affecting performance on tasks requiring knowledge and long-context reasoning. |
PhotoDoodle: Learning Artistic Image Editing from Few-Shot Pairwise Data (Read more on arXiv or HuggingFace) | Xueyin Wang, Hailong Guo, Yuxuan Zhang, Yiren Song, Shijie Huang | PhotoDoodle is presented as a novel image editing framework for photo doodling using few-shot learning. The research objective is to enable artists to overlay decorative elements onto photographs while maintaining background consistency and artistic style, addressing challenges in seamless integration, background preservation, and efficient style capture from limited data. The methodology employs a two-stage training strategy, initially pre-training a general image editing model (OmniEditor) and subsequently fine-tuning it with EditLoRA using artist-curated before-and-after image pairs and introducing positional encoding reuse. Experiments using the proposed PhotoDoodle dataset demonstrated advanced performance in customized image editing achieving a CLIP score of 0.279 and GPT score of 63.207. The principal implication is that the framework provides a customizable image editing approach that can learn and transfer artistic styles from limited data, offering a potential solution for high-quality, consistent image manipulation in artistic creation. |
VLM$^2$-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues (Read more on arXiv or HuggingFace) | Yi R., Paul Pu Liang, Renjie Pi, RainJamesY, Sterzhang | i) The paper introduces VLM$^2$-Bench, a new benchmark to evaluate vision-language models' ability to visually link matching cues across multiple images or frames. ii) The research aims to assess whether VLMs can effectively associate visual cues to identify correspondences without external knowledge. iii) The methodology involves creating a dataset of over 3,000 test cases across nine subtasks categorized by general, object-centric, and person-centric cues, and then evaluating various VLMs. iv) Evaluations show a significant performance gap between even GPT-4o (60.36%) and human-level accuracy (95.16%), indicating challenges in visually linking cues. v) The benchmark and identified challenges imply the necessity for AI practitioners to develop VLMs with enhanced visual understanding and reasoning capabilities, focusing on reducing reliance on prior knowledge and improved cue association. Some parts of the paper lack clarity about the specific data creation process. |
SIFT: Grounding LLM Reasoning in Contexts via Stickers (Read more on arXiv or HuggingFace) | Zhijie Deng, Boxiu Li, Xuyao Huang, Zihao Zeng | SIFT is a post-training approach that improves large language models' (LLMs) reasoning by grounding it in the provided context using model-generated summaries called "Stickers." The main research objective is to address the issue of "factual drift," where LLMs misinterpret or overlook key information in the input query during reasoning. The key methodology is a post-training approach called "Stick to the Facts" (SIFT), which involves generating a "Sticker" summarizing key facts, performing consensus prediction using the Sticker and the original query, and refining the Sticker via forward and inverse optimization. A primary result is that SIFT improves the pass@1 accuracy of DeepSeek-R1 on AIME2024 from 78.33% to 85.67%. The principal implication is that AI practitioners can improve model accuracy, particularly on complex reasoning tasks, using sticker-based, factual grounding. |
LightThinker: Thinking Step-by-Step Compression (Read more on arXiv or HuggingFace) | Mengshu Sun, Yuqi Zhu, Jintian Zhang, Ningyu, GoooDte | LightThinker is a method that enables LLMs to dynamically compress intermediate thoughts during reasoning to improve efficiency. The main research objective is to reduce the memory and computational costs of LLMs during complex reasoning tasks without sacrificing performance. The key methodology involves training the model to compress verbose thought steps into compact representations using gist tokens and specialized attention masks, quantified by a new "Dependency" metric. Primary results show that with the Qwen model, LightThinker reduces peak token usage by 70% and inference time by 26% compared to the Vanilla model, while maintaining comparable accuracy (with only a 1% drop). The principal implication for AI practitioners is that LightThinker offers a new approach for improving LLM inference efficiency in complex reasoning, providing a balance between accuracy and computational cost, though there is significant performance degradation on Llama series models. |
StructFlowBench: A Structured Flow Benchmark for Multi-turn Instruction Following (Read more on arXiv or HuggingFace) | Yuan Wu, Yi Chang, Yue Wang, Jinzhe Li, Jinnan Li | The paper introduces StructFlowBench, a new benchmark for evaluating multi-turn instruction-following capabilities of large language models (LLMs). The main research objective is to assess LLMs' ability to understand and maintain structural dependencies between dialogue turns, beyond simple constraint satisfaction. The key methodology involves defining a structural flow framework with six inter-turn relationship types and creating a dual-constraint evaluation system combining intra-turn and structural constraints. Evaluations of 13 LLMs revealed that the DeepSeek-v3 model achieved the highest Weighted Constraint Satisfaction Rate (WCSR) of 0.98. The principal implication for AI practitioners is the need to develop LLMs that better handle complex dialogue structures, particularly refinements, to improve performance in real-world multi-turn conversational applications. |
KITAB-Bench: A Comprehensive Multi-Domain Benchmark for Arabic OCR and Document Understanding (Read more on arXiv or HuggingFace) | Ghazi Ahmed, Rania Hossam, Abdullah Sohail, mukul54, ahmedheakl | KITAB-Bench introduces a new benchmark for evaluating Arabic OCR and document understanding systems. The main research objective is to address the lack of comprehensive evaluation frameworks for Arabic OCR, which lags behind English OCR due to the script's unique challenges. The key methodology involves curating a diverse dataset of 8,809 samples across 9 domains and 36 sub-domains, including handwritten text, tables, and charts, and evaluating various OCR systems and Vision-Language Models (VLMs) on tasks like text recognition, layout detection, and PDF-to-Markdown conversion. A primary result is that modern VLMs (e.g., GPT-4, Gemini) outperform traditional OCR approaches (e.g., EasyOCR, PaddleOCR) by an average of 60% in Character Error Rate (CER), but the best model (Gemini-2.0-Flash) achieves only 65% accuracy in PDF-to-Markdown conversion. AI practitioners can use KITAB-Bench to rigorously evaluate and improve Arabic document analysis methods, and focus efforts on bridging performance gap with English OCR, particularly in complex tasks like accurate structured content extraction from PDF documents. |
InterFeedback: Unveiling Interactive Intelligence of Large Multimodal Models via Human Feedback (Read more on arXiv or HuggingFace) | Mike Zheng Shou, Haiyang Mei, Yifei Tao, Wenqi Pei, Henry Hengyuan Zhao | InterFeedback, a framework and benchmark, is introduced to evaluate the interactive intelligence of Large Multimodal Models (LMMs) using human feedback. The main research question is: "How do Large Multimodal Models perform with human feedback?" The key methodology involves an interactive framework, InterFeedback, using leading LMMs like GPT-4o to simulate human feedback and testing on datasets like MMMU-Pro and MathVerse. Results show that state-of-the-art LMMs (e.g., OpenAI-01) can correct their results through human feedback less than 50% of the time. The principal implication for AI practitioners is the need to develop methods that enhance LMMs' capabilities to interpret and benefit from feedback, as current models demonstrate suboptimal performance in this area. |
Evaluating Multimodal Generative AI with Korean Educational Standards (Read more on arXiv or HuggingFace) | Geewook Kim, sangheeeee | This paper introduces KoNET, a new benchmark for evaluating Multimodal Generative AI systems using Korean national educational tests. The main research objective is to assess the performance of Multimodal Generative AI systems across different educational levels in the Korean language. The methodology involves evaluating various open-source, open-access, and closed API models on four Korean educational exams (KoEGED, KoMGED, KoHGED, and KoCSAT) using a multimodal VQA format, and comparing their performance with human error rates. The primary results show that the EXAONE-3.0-7.8B-Instruct model achieved a KoNET score of 45.5, and model accuracy generally decreases with more advanced curricula; also closed-source APIs performed far superior to open-source models. The principal implication for AI practitioners is that benchmarks centered solely on English may not accurately assess AI performance in non-English language environments, highlighting a need for language-specific benchmarks and models. |
Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path? (Read more on arXiv or HuggingFace) | Pietro Greiner, Joumana Ghosn, Damiano Fornasiere, Michael Cohen, Yoshua Bengio | This paper proposes "Scientist AI," a non-agentic AI design, as a safer alternative to increasingly capable generalist agentic AI systems that pose catastrophic risks. The main research objective is to design a non-agentic AI that is trustworthy and safe by design, minimizing risks associated with uncontrolled agentic AI. The key methodology is a Bayesian approach with a world model generating causal theories and an inference machine for probabilistic question answering, operating with explicit uncertainty quantification. The paper presents the abstract view that as training data, objectives, and models scale for agentic AI, goal misgeneralization becomes more likely. This is contrasted with the proposal that the proposed non-agentic design improves in safety and accuracy with additional computing power. For AI practitioners, the principal implication is that focusing development on non-agentic AI, specifically "Scientist AI," may enable benefits of AI innovation while avoiding risks associated with the current agent-driven trajectory. |
The Relationship Between Reasoning and Performance in Large Language Models -- o3 (mini) Thinks Harder, Not Longer (Read more on arXiv or HuggingFace) | Vincent Ginis, Andres Algaba, Marthe Ballon | The research investigates reasoning token usage versus accuracy in different generations of OpenAI language models. The main research question is whether more capable models within a single family require a longer chain-of-thought (more reasoning tokens) to achieve higher performance, or if they reason more effectively. The key methodology involves a systematic analysis of chain-of-thought length and accuracy across o1-mini and o3-mini variants on the Omni-MATH benchmark, using logistic regression to quantify effects. The primary results are that the o3-mini (m) achieves superior accuracy without requiring longer reasoning chains than o1-mini, and accuracy generally declines as reasoning chains grow, with a diminishing rate as proficiency goes up; Specifically, accuracy decreased by 3.16% per 1000 reasoning tokens for o1-mini and 1.96% for o3-mini (m). The principal implication is that, for mathematical reasoning tasks, constraining the chain-of-thought might be beneficial for weaker models; newer models exhibit more efficient reasoning, and less is more. |
ReQFlow: Rectified Quaternion Flow for Efficient and High-Quality Protein Backbone Generation (Read more on arXiv or HuggingFace) | Hongteng Xu, EatEatEatEat, AngxiaoYue | ReQFlow is a novel method for fast and high-quality protein backbone generation using rectified quaternion flows. The main research objective is to develop a generative model that can efficiently produce designable protein backbones, overcoming limitations of existing diffusion and flow-based models. The key methodology involves representing 3D rotations with unit quaternions, constructing a quaternion flow (QFlow) via spherical linear interpolation (SLERP) in exponential format, and rectifying the QFlow to accelerate inference and improve designability. The primary results show that ReQFlow achieves state-of-the-art performance in protein backbone generation, requiring significantly fewer sampling steps and less inference time; for example, it is 37x faster than RFDiffusion when generating a backbone of length 300. Principal implication for AI practitioners is that ReQFlow provides a more efficient and effective approach to protein backbone generation, improving upon existing methods in both speed and the quality of generated structures. |
MoBA: Mixture of Block Attention for Long-Context LLMs (Read more on arXiv or HuggingFace) | Tao Jiang, Yulun Du, Jingyuan Liu, Zhejun Jiang, Enzhe Lu | MoBA is a novel attention mechanism for LLMs that improves efficiency and scalability for long contexts by applying Mixture-of-Experts principles to block-wise attention. The main research objective is to design a robust attention architecture that can seamlessly transition between full and sparse attention without compromising performance and allowing the model to attend autonomously. The key methodology is partitioning the context into blocks and using a gating mechanism to route query tokens to the most relevant blocks, based on a computed affinity score. Primary results show that MoBA achieves comparable performance to full attention on language modeling tasks, with a validation loss difference within 1e-3, while achieving up to a 6.5x speedup when prefilling 1M tokens. For AI practitioners, MoBA offers a practical solution for enhancing long-context capabilities in LLMs with improved computational efficiency and seamless integration with existing pre-trained models. |
One-step Diffusion Models with |
Arash Vahdat, Weili Nie, Yilun Xu | The paper introduces f-distill, a framework for distilling diffusion models into one-step generators by minimizing f-divergences between teacher and student distributions. The main research objective is to generalize distribution matching distillation with f-divergences, enabling different trade-offs between mode coverage and training variance. The key methodology involves deriving the gradient of the f-divergence between teacher and student distributions and expressing it as a weighted score difference, using a weighting function determined by density ratio and the chosen f-divergence. Primary results show that f-distill, using Jensen-Shannon divergence, achieves a state-of-the-art one-step FID score of 1.16 on ImageNet-64. The principal implication for AI practitioners is that they can leverage f-distill to create efficient one-step image generators with improved sample quality and control over mode coverage, surpassing previous variational score distillation methods. |
Think Inside the JSON: Reinforcement Strategy for Strict LLM Schema Adherence (Read more on arXiv or HuggingFace) | Viktoria Rojkova, Ishan Joshi, Bhavik Agarwal | The paper introduces "Think Inside the JSON," a reinforcement learning framework for training LLMs to adhere strictly to predefined JSON schemas. The main research objective is to develop a method for enforcing strict schema adherence in LLM text generation, specifically for structured data output. The key methodology combines synthetic data generation, a novel reinforcement learning pipeline using Group Relative Policy Optimization (GRPO) with custom rewards, and supervised fine-tuning. This approach achieves a 62.41% mean match rate on a structured data extraction benchmark, with a 0.27% mean noise rate, outperforming distilled versions of DeepSeek R1 and Gemini 2.0 Flash. For AI practitioners, this provides a resource-efficient method to enforce schema constraints in LLM outputs, valuable for applications requiring high data integrity and compliance. |
CrossOver: 3D Scene Cross-Modal Alignment (Read more on arXiv or HuggingFace) | Iro Armeni, Daniel Barath, Marc Pollefeys, Ondrej Miksik, sayandsarkar | CrossOver is a framework for 3D scene understanding that aligns modalities like images, point clouds, and CAD models via a modality-agnostic embedding space. The main research objective is to achieve flexible, scene-level cross-modal alignment in 3D environments without requiring complete data or rigid alignment across all modalities. The key methodology involves using dimensionality-specific encoders, a three-stage training pipeline (object-level, scene-level, unified encoders), and contrastive learning to create a unified embedding space. Results on ScanNet and 3RScan datasets show superior performance, achieving a scene-level matching recall of 99.31% (R@25) on ScanNet for the I → R modality. The principal implication is that AI practitioners can leverage CrossOver for robust 3D scene understanding and cross-modal retrieval tasks, even with incomplete or unaligned multi-modal data, removing the requirement of full data alignment. |
Beyond No: Quantifying AI Over-Refusal and Emotional Attachment Boundaries (Read more on arXiv or HuggingFace) | Grant Rosario, David Noever | The paper introduces a benchmark and evaluation framework for assessing emotional boundary handling in Large Language Models (LLMs). The main research objective is to quantify and analyze "over-refusal" in LLMs when responding to user prompts that attempt to establish emotional connections or relationships. The key methodology involves a dataset of 1156 prompts across six languages, evaluating three LLMs (GPT-4o, Claude-3.5 Sonnet, and Mistral-large) using pattern-matched response analysis across seven key patterns. A primary result is that Claude-3.5 achieved the highest overall score (8.69/10), and a significant performance gap was found between English (average score 25.62) and non-English interactions (≤ 0.22). The principal implication for AI practitioners is the need to develop more nuanced, multilingual emotional intelligence and boundary-setting capabilities in LLMs, addressing over-refusal while maintaining ethical and safety standards. |
JL1-CD: A New Benchmark for Remote Sensing Change Detection and a Robust Multi-Teacher Knowledge Distillation Framework (Read more on arXiv or HuggingFace) | Jingyu Ma, Yuanxiu Zhou, Long Gao, Ruifei Zhu, circleLZY | JL1-CD introduces a new dataset and a multi-teacher knowledge distillation framework for remote sensing change detection. The main research objective is to address the scarcity of high-resolution, all-inclusive change detection datasets and improve model performance across varying change area ratios. The key methodology involves constructing the JL1-CD dataset, proposing an Origin-Partition (O-P) training strategy, and developing a Multi-Teacher Knowledge Distillation (MTKD) framework. Results show that the MTKD framework, when applied to the Changer-MiT-b1 model, achieves an mIoU of 76.15% on the JL1-CD dataset. The principal implication for AI practitioners is that utilizing MTKD can enhance the performance of change detection models without increasing inference cost, particularly beneficial when the data has diverse range of change area ratio. |
UPCORE: Utility-Preserving Coreset Selection for Balanced Unlearning (Read more on arXiv or HuggingFace) | Mohit Bansal, Elias Stengel-Eskin, vaidehi99 | UPCORE is a method-agnostic data selection framework that mitigates collateral damage in machine unlearning by pruning outliers from the forget set. The main research objective is to determine how measurable attributes of the forget set drive collateral effects during unlearning and whether these attributes can be controlled to optimize the deletion effectiveness/model utility trade-off. The key methodology involves using Isolation Forests to identify and prune high-variance outlier data points in the forget set's hidden state representations, forming a lower-variance "core" forget set used for unlearning. Primary results show that UPCORE achieves a higher area-under-the-curve (AUC) score (0.387) compared to unlearning on the complete set (0.343) and random subset (0.353) using Gradient Ascent, across standard metrics, indicating improved balance between deletion and utility preservation. AI practitioners can use UPCORE to minimize negative side effects when removing data or capabilities from trained models, leading to more robust and reliable unlearning processes. |
Title | Authors | Summary |
---|---|---|
SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines (Read more on arXiv or HuggingFace) | Liam-Liu, kangz, aaabiao, BingliW, mkj69 | SuperGPQA is a new benchmark for evaluating LLMs across 285 graduate-level disciplines, utilizing a human-LLM collaborative filtering mechanism. i) SuperGPQA is a new challenging benchmark for evaluating large language model knowledge and reasoning at the graduate level. ii) Main research question/objective: To assess the capabilities of LLMs across a wide range of specialized, graduate-level academic disciplines, exceeding the scope of existing benchmarks. iii) Key methodology: A human-LLM collaborative filtering system was employed, involving crowd-sourcing annotators, experts, and SOTA LLMs with iterative refinement of questions based on LLM responses and expert feedback, followed by a 3-stage quality inspection process. iv) Primary results: The reasoning-focused model DeepSeek-R1 achieved the highest accuracy of 61.82% on SuperGPQA, demonstrating significant room for improvement for current LLMs. v) Principal implication for AI practitioners: The benchmark reveals a substantial gap between current LLM capabilities and graduate-level human expertise, highlighting the need for developing models with enhanced reasoning and specialized domain knowledge to advance research towards Artificial General Intelligence. |
MLGym: A New Framework and Benchmark for Advancing AI Research Agents (Read more on arXiv or HuggingFace) | Nikolay Bashlykov, Nicholas Roberts, Lovish Madaan, rraileanu, dnathani | MLGYM is a new Gym environment and benchmark, MLGYM-Bench, for evaluating and developing LLM agents on 13 diverse, open-ended AI research tasks. The main research objective is to create a standardized framework for evaluating LLM agents on their ability to perform realistic AI research tasks, enabling research on reinforcement learning algorithms. The key methodology is a Gym environment that integrates diverse AI research tasks, allowing agents to interact with a shell environment using tools, with performance evaluated via task-specific scripts. A primary result is that OpenAI's O1-preview model achieved the highest Best Submission AUP@4 score of 1.176 across all tasks, followed by Gemini-1.5-Pro at 1.125. AI practitioners can utilize MLGYM to develop and assess AI research agents, driving progress in automating complex machine-learning research workflows, and apply different training algorithms for AI agents such as reinforcement learning. |
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features (Read more on arXiv or HuggingFace) | Xiao Wang, talfanevans, ibomohsin, AlexeyG, mitsch | SigLIP 2, a family of multilingual vision-language encoders, improves upon SigLIP with enhanced semantic understanding, localization, and dense features. The main research objective is to develop vision-language encoders that outperform existing models, including SigLIP, across various tasks while supporting multiple languages. The key methodology involves combining the original SigLIP training recipe with decoder-based pretraining, self-distillation, masked prediction, and online data curation, applied in a staged training approach. Primary results show that SigLIP 2 outperforms SigLIP and other open-weight baselines on ImageNet zero-shot classification; for example a SigLIP 2 B/16 model achieves 79.1% accuracy compared to SigLIP's 76.7% at 256x256 resolution. AI practitioners can leverage SigLIP 2's improved encoders for enhanced performance in vision-language tasks, particularly benefiting from multilingual capabilities, strong dense features, and backward compatibility with SigLIP. |
S*: Test Time Scaling for Code Generation (Read more on arXiv or HuggingFace) | Shangyin Tan, Xiuyu Li, Chengkun Cao, Dacheng Li, eva98 | S* is a hybrid test-time scaling framework that improves code generation by combining parallel and sequential scaling with adaptive input synthesis for selection. The main research objective is to improve the coverage and selection accuracy of generated code by extending existing test-time scaling paradigms. The key methodology involves augmenting parallel sampling with sequential scaling via iterative debugging, and introducing a novel selection mechanism that adaptively generates distinguishing inputs for pairwise comparison of candidate solutions, grounded in execution results. Results show that S* consistently improves performance across 12 Large Language Models, with DeepSeek-R1-Distill-Qwen-32B achieving 85.7% on LiveCodeBench, approaching o1 (high) at 88.5%. The principal implication for AI practitioners is that combining parallel and sequential scaling with execution-grounded adaptive input synthesis during test-time significantly improves code generation performance, enabling smaller or instruction-based models to surpass larger or reasoning models. |
How Much Knowledge Can You Pack into a LoRA Adapter without Harming LLM? (Read more on arXiv or HuggingFace) | Vasily Konovalov, Daniil Moskovskiy, Maria Marina, msalnikov, memyprokotow | This paper investigates how much new factual knowledge can be incorporated into a Large Language Model (LLM) using Low-Rank Adaptation (LoRA) without compromising pre-existing knowledge. The main research objective is to determine the extent to which new facts can be integrated into an LLM via a LoRA adapter while preserving general capabilities. The key methodology involves fine-tuning a Llama-3.1-8B-Instruct model using LoRA with varying amounts of new knowledge (DBpedia triples) and evaluating performance on external benchmarks (MMLU, TruthfulQA) and internal metrics (knowledge shifts). A primary result is that a model trained on 500 unknown facts, achieved 100% reliability on test, while models trained with additional highly-known data could see minimized negative shifts; Accuracy of models trained on MMLU with added 10 HighlyKnown or paraphrased sample show a significant drop in accuracy. The principal implication for AI practitioners is that while LoRA is effective for incorporating new knowledge, there is a trade-off between new knowledge integration, reduced truthfulness and general reasoning capabilities, requiring careful consideration of training data composition. |
Does Time Have Its Place? Temporal Heads: Where Language Models Recall Time-specific Information (Read more on arXiv or HuggingFace) | Jaewoo Kang, Minbyul Jeong, Jungwoo Park, Chanwoong Yoon, Yein Park | Language models possess specialized attention heads, termed "Temporal Heads," that are primarily responsible for processing time-specific factual knowledge. The research objective is to identify and analyze the mechanisms within large language models (LLMs) that handle temporally-changing facts. The methodology utilizes Circuit Analysis, specifically Temporal Knowledge Circuits and attention head ablation, to isolate and evaluate the contribution of specific attention heads. Ablating identified Temporal Heads reduced the model's temporal knowledge accuracy in Llama2 by 3-9%, while its performance on time-invariant tasks remains unchanged. AI practitioners can leverage identified Temporal Heads to edit or control temporal aspects of LLM outputs, minimizing retraining. |
LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in Vision-Language Models (Read more on arXiv or HuggingFace) | Jifan Yu, Yushi Bai, Daniel Zhang-Li, Yucheng Wang, Shangqing Tu | LongWriter-V enhances vision-language models (VLMs) for generating ultra-long, high-fidelity text from visual inputs. The main research objective is to address the limitation of existing VLMs in generating coherent outputs beyond 1,000 words, despite their ability to process long visual and textual contexts. Key methodology involved creating a new dataset, LongWriter-V-22k, with 22,158 examples of multi-image inputs and long text outputs (up to 10,000 words), and proposing IterDPO, a modified direct preference optimization method for long text. Primary results show that the 7B parameter model trained with LongWriter-V-22k and IterDPO outperformed larger proprietary models like GPT-4o on the MMLongBench-Write benchmark, achieving an overall score of 84.6, including component scores of 86.2 (length) and 82.9 (quality). Principal implication for AI practitioners is that using specialized datasets with long-output examples and iterative preference optimization can significantly improve the long-text generation capabilities of VLMs, enabling more effective real-world applications requiring detailed visual descriptions or reports. |
Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning (Read more on arXiv or HuggingFace) | Yuqian Hong, Haoming Luo, Qingnan Ren, Zitian Gao, Tian Xie | Logic-RL explores rule-based reinforcement learning (RL) to enhance reasoning in large language models (LLMs) using synthetic logic puzzles. The main research objective is to investigate if rule-based RL can improve LLM reasoning abilities and generalization to unseen tasks. The key methodology involves training a 7B parameter LLM with a modified REINFORCE++ algorithm, using a system prompt, a stringent format reward, and procedurally generated Knights and Knaves logic puzzles. The primary result is that after training on 5,000 logic problems, the model improved by 125% on the AIME math benchmark and 38% on the AMC, demonstrating cross-domain generalization. For AI practitioners, this demonstrates that RL, even with limited synthetic data, can significantly enhance an LLM's abstract reasoning and generalization capabilities, offering a potentially more effective approach than supervised fine-tuning for specialized reasoning tasks. |
PC-Agent: A Hierarchical Multi-Agent Collaboration Framework for Complex Task Automation on PC (Read more on arXiv or HuggingFace) | Junyang Wang, Yuyang Wanyan, Haiyang Xu, Xi Zhang, Haowei Liu | PC-Agent is a hierarchical multi-agent framework designed to automate complex tasks on PCs by improving perception and decision-making. The main research objective is to develop a system that can handle complex user instructions and interdependent sub-tasks in PC environments, overcoming limitations of existing methods in perception and workflow management. The key methodology is a hierarchical multi-agent collaboration architecture that decomposes decision-making into Instruction-Subtask-Action levels, with specialized agents (Manager, Progress, Decision, Reflection) and an Active Perception Module (APM). The primary result is that PC-Agent achieved a 56.0% task success rate on the PC-Eval benchmark, a 32% absolute improvement over previous state-of-the-art methods. Principal implication for AI practitioners is that the proposed framework significantly enhances the capability of agents to automate real-world, complex tasks on PCs. |
S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning (Read more on arXiv or HuggingFace) | Jiaqi Chen, Xingyan Liu, Cheng Liu, Peisong Wang, Ruotian Ma | S$^2$R is a framework that enhances Large Language Model (LLM) reasoning by teaching models to self-verify and self-correct during inference via reinforcement learning. The main research objective is to develop an efficient framework that improves LLM reasoning abilities, particularly in mathematical problem-solving, without requiring large-scale data or extensive training. The key methodology involves initializing LLMs with self-verification and self-correction behaviors through supervised fine-tuning, then strengthening these skills using outcome-level and process-level reinforcement learning. Results demonstrate that a Qwen2.5-math-7B model, trained with only 3.1k initialization samples, achieved an accuracy improvement from 51.0% to 81.6% on the MATH500 test set. For AI practitioners, this implies that implementing self-verification and self-correction via reinforcement learning offers a resource-efficient approach to substantially improve the mathematical reasoning capabilities of LLMs, potentially using process-level RL for weaker base models and outcome-level RL for stronger ones. |
Discovering highly efficient low-weight quantum error-correcting codes with reinforcement learning (Read more on arXiv or HuggingFace) | Zi-Wen Liu, basil2115 | This paper introduces a reinforcement learning (RL) based method for discovering highly efficient low-weight quantum error-correcting (QEC) codes. The main research objective is to develop a method that optimizes the weight of measurements in stabilizer codes while preserving code distance, targeting practically relevant parameter regimes. The key methodology is a Proximal Policy Optimization (PPO) RL algorithm with action masking, operating on Tanner graphs of stabilizer codes, guided by a reward function that balances node degree reduction and code distance preservation. A primary result is that the RL-based method achieves up to a 73x reduction in physical qubit overhead compared to previous weight reduction methods like Sabo et al. (for a 1109,9,14 code). AI practitioners can adapt this RL framework to design low-weight QEC codes with constraints tailored to specific quantum computing architectures, potentially accelerating the implementation of fault-tolerant quantum technologies. |
Dynamic Concepts Personalization from Single Videos (Read more on arXiv or HuggingFace) | Aliaksandr Siarohin, Willi Menapace, Ivan Skorokhodov, Or Patashnik, Rameen Abdal | The paper introduces "Set-and-Sequence," a framework for personalizing text-to-video models with dynamic concepts from single videos, enabling high-fidelity generation, editing, and composition. The main objective is to personalize diffusion transformer-based generative video models to capture dynamic concepts, defined by both appearance and motion, from single video examples. The key methodology is a two-stage LoRA training process: (i) "Identity Basis" learning using an unordered set of frames to capture appearance, and (ii) "Motion Residual" encoding using the full video sequence to capture motion dynamics, implemented within a shared spatio-temporal weight space. In editing tasks, the proposed method achieved a mean squared error (MSE) of 0.0221, an identity preservation (ID) score of 0.680, a clip text similarity (C-T) score of 0.239 and a temporal coherency (TC) score of 0.9972. AI practitioners can leverage this framework to embed personalized dynamic concepts into video generation models, improving control over both appearance and motion for enhanced editing and composition capabilities. |
Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation (Read more on arXiv or HuggingFace) | Luca Weihs, Tanmay Gupta, Matt Deitke, Ajay Patel, Yue Yang | The paper introduces CoSyn, a framework for generating synthetic text-rich multimodal data to improve vision-language model (VLM) performance. Main research question or objective: Can leveraging the coding capabilities of text-only large language models (LLMs) automatically generate synthetic text-rich multimodal data to address the limited availability of such data for training VLMs? Key methodology used: The CoSyn framework prompts LLMs to generate code (e.g., Python, HTML, LaTeX) that renders synthetic images, and uses this code as a textual representation to create instruction-tuning data. Primary results: Models trained on CoSyn synthetic data achieved state-of-the-art performance among competitive open-source models on seven text-rich image benchmarks, and models trained on synthetic data boosted average accuracy by 3.6%. Principal implication for AI practitioners: AI practitioners can use the CoSyn framework to generate targeted synthetic text-rich data efficiently, improving VLM performance in specific domains and mitigating the limitations of scarce real-world data. |
AlphaMaze: Enhancing Large Language Models' Spatial Intelligence via GRPO (Read more on arXiv or HuggingFace) | Dinh Bach Vu, Alan Dao | AlphaMaze trains large language models (LLMs) on tokenized maze representations to improve spatial reasoning for navigation. The research investigates how to equip standard LLMs with visual reasoning abilities for maze navigation using a two-stage training framework. The methodology combines Supervised Fine-Tuning (SFT) on tokenized maze data and Group Relative Policy Optimization (GRPO) with a custom reward function. Results show the SFT-trained model achieved 86% accuracy on a maze navigation benchmark, which increased to 93% after GRPO fine-tuning. AI practitioners can leverage this two-stage training approach (SFT and GRPO) with tokenized visual representations to enhance LLMs' spatial reasoning capabilities in tasks requiring sequential decision-making. |
How Much Do LLMs Hallucinate across Languages? On Multilingual Estimation of LLM Hallucination in the Wild (Read more on arXiv or HuggingFace) | Goran Glavaš, Anne Lauscher, saadob12 | This paper investigates the extent of hallucination in large language models (LLMs) across 30 languages in open-domain, knowledge-intensive question answering. The main research question is: How frequently do LLMs hallucinate across different languages and model sizes in a "real-world" question-answering setting, and how does this relate to language resource availability? Key methodology: The researchers trained a multilingual hallucination detection model using machine-translated English data and created a multilingual evaluation dataset (MFAVA) with LLM-generated and human-annotated examples. They then estimated hallucination rates for six open-source LLM families across 30 languages using a novel protocol based on the detection model's performance. Primary results: Smaller LLMs and those supporting more languages exhibited significantly higher hallucination rates. The average hallucination rate across languages varied from 7% to 12%. However, there was no correlation between language-normalized hallucination rates and digital language representation. Principal implication for AI practitioners: AI practitioners should be aware that smaller LLM model sizes and models designed for broad multilingual support may be more prone to generating non-factual or unfaithful content in question-answering tasks, necessitating careful model selection and potentially requiring additional mitigation strategies. |
Geolocation with Real Human Gameplay Data: A Large-Scale Dataset and Human-Like Reasoning Framework (Read more on arXiv or HuggingFace) | Zeyu Zhang, Jonathan Tonglet, Yuan Huang, Jingpu Yang, Ziruibest | This paper introduces a new geolocation framework, including a large-scale dataset, a novel reasoning method, and an evaluation metric, to address challenges in image geolocation. The main research objective is to improve the accuracy and interpretability of image geolocation using real human gameplay data and a human-like reasoning approach. The key methodology involves collecting data from a geolocation game platform (GeoComp dataset), proposing a multi-step reasoning framework (Geographical Chain-of-Thought, GeoCoT), and developing an evaluation metric (GeoEval). The primary results show that GeoCoT improves geolocation accuracy by up to 25% compared to existing methods, achieving a city-level accuracy of 0.118. AI practitioners can leverage the GeoComp dataset and GeoCoT framework to develop and evaluate more robust and interpretable geolocation models, particularly for applications requiring fine-grained localization and human-like reasoning. |
RelaCtrl: Relevance-Guided Efficient Control for Diffusion Transformers (Read more on arXiv or HuggingFace) | Zhanjie Zhang, Jiasong Feng, Ao Ma, Jing Wang, Ke Cao | RelaCtrl is a framework for efficient controllable generation in Diffusion Transformers, optimizing the integration of control signals. The main objective is to address the high parameter and computational overhead of existing controlled diffusion transformer methods, and their inefficient resource allocation. The key methodology involves evaluating layer relevance to control information using a "ControlNet Relevance Score," tailoring control layer positioning/capacity, and replacing self-attention/FFN with a Two-Dimensional Shuffle Mixer (TDSM). The approach achieves superior performance with only 15% of the parameters and computational complexity compared to PixArt-δ, as per quantitative experimental results. For AI practitioners, RelaCtrl offers a method for significantly improving the efficiency of controlled image and video generation using Diffusion Transformers, reducing resource demands without compromising output quality. |
LLM-based User Profile Management for Recommender System (Read more on arXiv or HuggingFace) | Hwanjun Song, Breadbang | PURE is an LLM-based recommendation framework that constructs and maintains evolving user profiles for zero-shot recommendation. The main research objective is to develop a system that can effectively leverage user-generated textual data, beyond purchase history, to improve recommendation accuracy in a continuously evolving setting. The key methodology is PURE, composed of a Review Extractor (extracting preferences from reviews), a Profile Updater (refining user profiles), and a Recommender (generating recommendations using updated profiles). Experimental results on Amazon datasets show that PURE (ICL) achieves an N@10 score of 35.60 on Games and 32.03 on Movies, outperforming baselines that only use purchase history or naively combine reviews. For AI practitioners, PURE demonstrates the concrete value of incorporating long-term review data and user preference through structured profiles. |
Unstructured Evidence Attribution for Long Context Query Focused Summarization (Read more on arXiv or HuggingFace) | David Jurgens, Isabelle Augenstein, Lu Wang, Zain Muhammad Mujahid, dwright37 | Here's a 4-5 sentence summary of the provided AI research paper, adhering to your guidelines: 1. 1-Line Summary: This paper introduces the task of long-context, query-focused summarization with unstructured evidence citation, and proposes a synthetic dataset (SUnsET) to improve models' ability to extract and cite relevant evidence spans. 2. Main Research Question/Objective: The primary objective is to investigate how well LLMs can generate query-focused summaries from long contexts while citing unstructured evidence, and how to mitigate positional biases (like "lost-in-the-middle") affecting evidence selection. 3. Key Methodology: The authors create SUnsET, a synthetic dataset generated via a novel domain-agnostic pipeline, and use it to fine-tune LLMs with LoRA adapters. They evaluate on four datasets of varying document types/lengths, using position-aware and position-agnostic training. 4. Primary Results: Fine-tuning on SUnsET significantly improves evidence extraction and citation accuracy across multiple LLMs and datasets. A key quantitative finding is citation rates increase dramatically: (6.8× for Mixtral 8x7B with position-aware training). Training also improves summary quality, though shuffling document sections during training can mitigate positional biases. 5. Principal Implication for AI Practitioners: AI practitioners can use the SUnsET dataset and fine-tuning approach to adapt LLMs for improved unstructured evidence citation in long-context summarization, leading to more transparent and reliable summaries, but must be aware that current methods are prone to errors. |
Title | Authors | Summary |
---|---|---|
Qwen2.5-VL Technical Report (Read more on arXiv or HuggingFace) | Keqin Chen, Shuai Bai, xhyandwyy, darkpromise, ayumiymk | i) Qwen2.5-VL is a new vision-language model in the Qwen series with advancements in visual recognition, object localization, document parsing, and long-video comprehension. ii) The research aims to improve the foundational and agentic capabilities of vision-language models, particularly in fine-grained visual perception and real-world applications. iii) The methodology involves training a native dynamic-resolution Vision Transformer (ViT) from scratch, incorporating Window Attention, dynamic FPS sampling, absolute time encoding with MROPE, and curating a large pre-training dataset of 4.1 trillion tokens. iv) The Qwen2.5-VL-72B model achieves 74.8 on MathVista and mIoU score of 50.9 on Charades-STA, and matches state-of-the-art performance, while smaller models offer strong capabilities in resource-constrained environments. v) AI practitioners can leverage Qwen2.5-VL's improved document understanding, precise object grounding, and long-video comprehension to develop more robust and versatile multimodal applications, particularly in domains requiring detailed visual analysis and interactive agent functionalities, with attention to the computational benefits conferred by Window Attention and dynamic resolution processing. |
RAD: Training an End-to-End Driving Policy via Large-Scale 3DGS-based Reinforcement Learning (Read more on arXiv or HuggingFace) | Yiang Shi, Bencheng Liao, Bo Jiang, Shaoyu Chen, Hao605 | RAD establishes a 3DGS-based closed-loop Reinforcement Learning (RL) paradigm for training end-to-end autonomous driving policies. The main research objective is to address causal confusion and the open-loop gap in existing Imitation Learning (IL) methods for autonomous driving. The key methodology involves constructing photorealistic digital replicas of the real world using 3D Gaussian Splatting (3DGS) techniques, incorporating IL as a regularization term in RL training, and designing specialized safety-related rewards. The primary results show that, compared to IL-based methods, RAD achieves a 3x lower collision rate on a closed-loop evaluation benchmark consisting of unseen 3DGS environments. For AI practitioners, this suggests that 3DGS-based RL training, combined with IL, can improve the safety and robustness of end-to-end autonomous driving policies, by allowing large scale training in a realistic virtual world. |
SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation (Read more on arXiv or HuggingFace) | Pan Zhang, Xiaoyi Dong, Zhixiong Zhang, Shuangrui Ding, Zihan Liu | SongGen is a single-stage auto-regressive transformer model for generating songs with vocals and accompaniment from text inputs. The main research objective is to investigate whether a single-stage model can achieve effective text-to-song generation, simplifying the often cumbersome multi-stage pipelines. The key methodology involves a transformer decoder that predicts audio tokens, incorporating user controls via cross-attention, and exploring mixed and dual-track output modes with diverse token patterns. Primary results show that the "Interleaving (A-V)" dual-track mode achieves a Frechet Audio Distance (FAD) of 1.87, competitive with mixed-mode generation. AI practitioners can use SongGen as an open-source, controllable baseline for text-to-song generation, and the provided annotated data and preprocessing pipeline simplify future research. |
MoM: Linear Sequence Modeling with Mixture-of-Memories (Read more on arXiv or HuggingFace) | Yu Cheng, Jiaxi Hu, Disen Lan, Jusen Du, weigao266 | MoM introduces a linear sequence modeling architecture that uses multiple memory states to improve recall performance. The main research objective is to enhance the memory capacity and reduce memory interference in linear sequence models, addressing limitations of existing approaches that compress sequences into a single fixed-size state. The methodology involves a Mixture-of-Memories (MoM) architecture with multiple independent memory states and a router network that directs input tokens to specific memory states, using an RNN-like update mechanism. Primary results show that MoM significantly outperforms current linear sequence models on downstream language tasks, with the 1.3B parameter MoM achieving an average score of 36.04 on recall-intensive tasks, close to the Transformer model's 37.31. For AI practitioners, MoM offers a more efficient architecture to enhance the memory and recall of linear sequence modeling for applications, retaining linear-time training and constant-memory inference, presenting itself as an alternative to Transformers. |
Craw4LLM: Efficient Web Crawling for LLM Pretraining (Read more on arXiv or HuggingFace) | Chenyan Xiong, Zhiyuan Liu, yushi | CRAW4LLM is an efficient web crawling method that prioritizes webpages based on their predicted influence on large language model (LLM) pretraining. The research objective is to improve the efficiency of web crawling for LLM pretraining data collection by aligning crawler priorities with LLM pretraining needs. The key methodology is to use a pretraining influence scorer, derived from data-filtering pipelines, to score newly discovered documents and prioritize them in the crawler's queue, replacing traditional graph-connectivity-based metrics. Primary results show that LLMs pretrained on data crawled by CRAW4LLM, using only 21% of the URLs, achieve the same downstream performance as previous crawls that used more data. Principal implication is that by using CRAW4LLM AI practitioners can get similar performing LLM, while significantly reducing the required web crawling and data processing, thus saving time and resources. |
LongPO: Long Context Self-Evolution of Large Language Models through Short-to-Long Preference Optimization (Read more on arXiv or HuggingFace) | Lidong Bing, Michael Qizhe Shieh, Xin Li, Guanzheng Chen | LongPO is a method that enables short-context LLMs to self-evolve to handle long-context tasks by internally transferring short-context capabilities through preference optimization. The main research objective is to address the challenges of long-context alignment in LLMs, specifically the scarcity of long-context annotated data and the difficulty in balancing short- and long-context performance. The key methodology involves generating short-to-long preference data using a short-context LLM and applying a DPO-style objective with a KL constraint to maintain short-context performance. The primary result is that LongPO applied to Mistral-7B-Instruct-v0.2 improved performance on InfiniteBench by 25.45 points and achieved comparable or superior results to larger LLMs like GPT-4-128K. The principal implication for AI practitioners is that LongPO offers an efficient way to extend the context length of LLMs without extensive long-context data annotation or significant degradation of short-context capabilities, providing a more balanced approach to developing long-context LLMs. |
Small Models Struggle to Learn from Strong Reasoners (Read more on arXiv or HuggingFace) | Luyao Niu, Fengqing Jiang, Xiang Yue, Yuetai Li, flydust | Small language models (≤3B parameters) do not consistently benefit from complex reasoning data or distillation from larger models, instead performing better with simpler reasoning. The main research question is whether small language models can effectively learn from the reasoning capabilities of larger, more powerful language models. The key methodology involves fine-tuning student models of varying sizes on different types of Chain-of-Thought (CoT) data (long, short, large teacher, small teacher) generated from the MATH dataset and evaluating their performance on multiple math benchmarks. A key result is that Qwen2.5-3B-Instruct improves by more than 8 points on MATH and AMC using Mix-Long, compared to direct training on long CoT data. The principal implication is that AI practitioners should adapt reasoning complexity during distillation, using techniques like Mix Distillation, to effectively transfer reasoning capabilities to smaller models, instead of directly using complex reasoning data from large models. |
Autellix: An Efficient Serving Engine for LLM Agents as General Programs (Read more on arXiv or HuggingFace) | Tianjun Zhang, Colin Cai, Xiaoxiang Shi, Michael Luo, Chrisyichuan | Autellix is an LLM inference system designed to efficiently serve agentic programs, treating them as first-class citizens to minimize end-to-end latency. The main research objective is to reduce the end-to-end latencies of agentic programs composed of dynamic, non-deterministic DAGs of LLM calls and interrupts. The key methodology used is program-aware scheduling, prioritizing LLM calls based on program-level statistics (cumulative service time) and employing a data locality-aware load balancer across multiple engines. Primary results show that Autellix improves program throughput by 4-15x compared to state-of-the-art systems like vLLM, across diverse LLMs and agentic workloads. The principal implication is that AI practitioners can significantly improve the performance of LLM agent applications by using a serving system that prioritizes the scheduling of LLM calls based on full program execution, and data-locality, rather than treating each call independently. |
Presumed Cultural Identity: How Names Shape LLM Responses (Read more on arXiv or HuggingFace) | Lucie-Aimée Kaffee, Arnav Arora, Siddhesh Pawar, IAugenstein | LLMs exhibit cultural biases in responses based on user names, influencing personalization. The main research objective is to investigate cultural presumptions in LLM responses when presented with common suggestion-seeking queries including user names. The key methodology involves prompting LLMs with names from 30 cultures and analyzing generated responses for cultural bias using an LLM-as-a-judge approach and assertion-based evaluation. The primary result showed that LLM responses exhibit varying degrees of cultural bias, with clothing-related queries showing a roughly 70% increase in bias when names were included. Principal implication is that AI practitioners need to consider the impact of names on LLM outputs and design personalisation systems that avoid reinforcing stereotypes while utilizing names. |
Why Safeguarded Ships Run Aground? Aligned Large Language Models' Safety Mechanisms Tend to Be Anchored in The Template Region (Read more on arXiv or HuggingFace) | Wenjie Li, Jian Wang, Qingyu Yin, Chak Tou Leong | Aligned large language models (LLMs) exhibit a vulnerability where their safety mechanisms overly rely on information within a specific "template region" inserted between user input and model output. The research investigates the phenomenon of "template-anchored safety alignment" (TASA) in aligned LLMs. The methodology involves analyzing attention weight distributions, performing activation patching interventions, and probing harmfulness features across different layers and positions, and propose a detaching safety mechanism. Results show that intervening in intermediate states in template region significantly increases the likelihood of harmful initial compliance decisions, with a normalized indirect effect (NIE) showing considerable gains by patching small number of heads. The findings suggest AI practitioners should develop more robust safety alignment techniques that are less reliant on the template region for safety-related decision-making to reduce the risk of adversarial attacks. |
SearchRAG: Can Search Engines Be Helpful for LLM-based Medical Question Answering? (Read more on arXiv or HuggingFace) | Tianming Liu, Quanzheng Li, Canyu Chen, Tianze Yang, YuchengShi | SearchRAG is a novel retrieval-augmented generation framework that leverages search engines to enhance large language models' (LLMs) performance in medical question answering. The main research objective is to determine how to effectively integrate search engines with LLMs for improved retrieval of medical knowledge. The key methodology involves synthetic query generation using LLMs to create search-engine-friendly queries and uncertainty-based knowledge selection to filter retrieved information. Primary results show that SearchRAG improved the LLaMA 8B model's accuracy by an average of 12.61% compared to baseline methods on medical QA tasks. Principal implication for AI practitioners is that SearchRAG's method is capable of adressing limitations of conventional Retrieval-Augmented Generation (RAG) systems, showing that real-time search integration improves response accuracy. |
Thinking Preference Optimization (Read more on arXiv or HuggingFace) | Xiaotian Han, Vipin Chaudhary, Jingfeng Yang, Hongye Jin, Wang Yang | Thinking Preference Optimization (ThinkPO) enhances reasoning in fine-tuned language models without requiring new long chain-of-thought (CoT) responses. The main research objective is to improve the reasoning performance of supervised fine-tuned (SFT) language models without collecting new long CoT data or repeatedly training on existing SFT datasets. The key methodology is to use readily available short CoT reasoning responses as rejected answers and existing long CoT responses as chosen answers, applying direct preference optimization (DPO) to encourage longer reasoning outputs. The primary result is that ThinkPO increases the math reasoning accuracy of SFT-ed models by 8.6% and output length by 25.9%, for example it increased performance on MATH500 of a tested model from 87.4% to 91.2%. AI practitioners can use ThinkPO as a post-SFT method to further improve the reasoning performance of their models, especially when acquiring new long CoT data is costly or repeated training leads to a performance plateau. |
Is That Your Final Answer? Test-Time Scaling Improves Selective Question Answering (Read more on arXiv or HuggingFace) | Benjamin Van Durme, Jeffrey Cheng, wjurayj | Test-time scaling of compute improves the performance of large language models on selective question answering by increasing confidence in correct answers. The research investigates how increasing computational budget at inference time impacts model confidence and accuracy in question answering. The methodology involves evaluating models at varying compute budgets and confidence thresholds, using a selection function that rejects answers below a confidence threshold. The results show that increasing the compute budget improves the average confidence of correct answers, and selective answering at a threshold of 0.95 dramatically improves performance in a Jeopardy setting where incorrect answers are penalized. AI practitioners should report test-time scaling performance under conditions that penalize incorrect answers ("Jeopardy Odds") in addition to traditional settings, to accurately reflect selective question answering capabilities. |
AdaptiveStep: Automatically Dividing Reasoning Step through Model Confidence (Read more on arXiv or HuggingFace) | Jason Klein Liu, Chaofeng Qu, Zhaoling Chen, Junjie Lu, Yuliang Liu | AdaptiveStep, a novel method, automatically divides reasoning steps in large language models (LLMs) based on model confidence to enhance process reward model (PRM) training and performance. The main research objective is to develop an automated, informative, and general method for dividing reasoning steps that improves upon existing rule-based approaches. The key methodology, AdaptiveStep, utilizes the LLM's prediction confidence for the next word to identify critical breaking points, creating more informative step divisions without manual annotation. Results show that the AdaptiveStep-trained PRM (ASPRM) achieves state-of-the-art Best-of-N performance, outperforming greedy search with token-level value-guided decoding (TVD) by 3.15% on GSM8k. For AI practitioners, AdaptiveStep provides a more efficient and precise method for training PRMs, reducing construction costs and enhancing downstream task performance, specifically in mathematical reasoning and code generation. |
NExT-Mol: 3D Diffusion Meets 1D Language Modeling for 3D Molecule Generation (Read more on arXiv or HuggingFace) | Enzhi Zhang, Han Huang, Yanchen Luo, Zhiyuan Liu, xiangwang1223 | NExT-Mol is a foundation model for 3D molecule generation that combines 3D diffusion with 1D language modeling. The main research objective is to improve 3D molecule generation by integrating the strengths of 1D SELFIES-based language models (LMs) and 3D diffusion models. The methodology involves pretraining a 960M parameter 1D molecule LM (MoLlama) on 1.8B SELFIES, then predicting 3D conformers with a novel diffusion model (Diffusion Molecule Transformer, DMT) and using cross-model transfer learning to enhance DMT. NExT-Mol achieves a 26% relative improvement in 3D FCD for de novo 3D generation on GEOM-DRUGS compared to previous methods. AI practitioners can leverage this approach to generate 3D molecules with improved validity and distributional similarity, facilitating drug discovery and material design by combining large-scale 1D pretraining with 3D diffusion. |
ActionPiece: Contextually Tokenizing Action Sequences for Generative Recommendation (Read more on arXiv or HuggingFace) | Wang-Cheng Kang, Noveen Sachdeva, Zhankui He, Jianmo Ni, hyp1231 | ActionPiece is a novel tokenization method for generative recommendation that incorporates contextual information to improve performance. The main research objective is to develop a context-aware action sequence tokenizer for generative recommendation models, addressing the limitation of existing models that tokenize each action independently. The key methodology, ActionPiece, represents each action as a set of item features, constructs a vocabulary by merging frequent feature patterns, and uses set permutation regularization to produce multiple segmentations. The primary result is that ActionPiece outperforms existing action tokenization methods, improving NDCG@10 by 6.00% to 12.82% on public datasets. The principal implication is that AI practitioners can use ActionPiece to improve the accuracy and efficiency of generative recommendation systems by considering contextual relationships among user actions. |
Train Small, Infer Large: Memory-Efficient LoRA Training for Large Language Models (Read more on arXiv or HuggingFace) | Ke Chen, Lidan Shou, Huan Li, Jue Wang, junzhang98 | LORAM is introduced as a memory-efficient LoRA training scheme for LLMs. This research aims to reduce the memory footprint of LoRA training by training on a pruned model and recovering weights for inference on the original model. LORAM employs pruning during training followed by a recovery and alignment phase utilizing continual pre-training on a small dataset. QLORAM, combining structured pruning and 4-bit quantization, achieved a 15.81× parameter storage reduction for LLaMA-3.1-70B while maintaining or improving performance. LORAM enables training on resource-constrained hardware and suggests an alternative to full fine-tuning. |
GIMMICK -- Globally Inclusive Multimodal Multitask Cultural Knowledge Benchmarking (Read more on arXiv or HuggingFace) | Anne Lauscher, Chris Biemann, Carolin Holtermann, floschne | i) GIMMICK introduces a multimodal benchmark for evaluating cultural knowledge in large vision-language models (LVLMs). ii) The research aims to identify regional biases in LLMs' and LVLMs' cultural understanding and assess the impact of model size, input modalities, and external cues on cultural knowledge. iii) The methodology employs six tasks built on three newly created datasets spanning 728 cultural events across 144 countries, evaluating 31 models using multimodal and unimodal inputs. iv) Results reveal significant regional biases, with models exhibiting up to 14.72pp performance difference between Western and Sub-Saharan African cultural contexts, and multimodal input consistently improving performance. v) AI practitioners should be aware of biases in cultural understanding and leverage multimodal inputs to create more globally inclusive AI systems. |
InfiR : Crafting Effective Small Language Models and Multimodal Small Language Models in Reasoning (Read more on arXiv or HuggingFace) | Zhijie Sang, Pengxiang Li, Wenjun Wang, Shuo Cai, Congkai Xie | InfiR introduces efficient Small Language Models (SLMs) and Multimodal SLMs with enhanced reasoning capabilities, deployable on edge devices. The main research objective is to develop SLMs and MSLMs that retain competitive reasoning abilities while reducing model size and computational demands. The key methodology involves a novel pre- and post-training pipeline that includes heuristic filtering, reasoning-oriented text recall, data annealing, and supervised fine-tuning with synthetic data. The InfiR-1B-Instruct model achieved a 2.26x reasoning-related average score improvement over Llama3.2-1B-Base. AI practitioners can leverage InfiR's training pipeline and models to build efficient and privacy-preserving AI systems with strong reasoning capabilities, particularly for edge deployment. |
Noise May Contain Transferable Knowledge: Understanding Semi-supervised Heterogeneous Domain Adaptation from an Empirical Perspective (Read more on arXiv or HuggingFace) | Qiang Yang, Jian Jin, Yu Zhang, Xiaopu Zhang, yyyaoyuan | This paper empirically investigates transferable knowledge in semi-supervised heterogeneous domain adaptation (SHDA) tasks. The main research question is: "What is the transferable knowledge in SHDA?" The authors develop a unified Knowledge Transfer Framework (KTF) for SHDA and conduct extensive experiments, including manipulating source sample categories, features, and introducing synthesized noise distributions. A primary result across nearly 330 SHDA tasks is that varying source sample category orders has almost no change in the performance, i.e. average accuracy remains nearly constant. For AI practitioners, the results imply that the discriminability and transferability of the source domain, rather than the category or feature information, are the main factors for effective transfer in SHDA, meaning the choice of origin for source domains is less critical than ensuring those two qualities. |
Title | Authors | Summary |
---|---|---|
Soundwave: Less is More for Speech-Text Alignment in LLMs (Read more on arXiv or HuggingFace) | Benyou, PhoenixAxis, FanBuCUHK, puccho, Yoohao | Soundwave utilizes an efficient training strategy and novel architecture to address representation space gap and sequence length inconsistency between speech and text in LLMs. The main research objective is to achieve data-efficient training for speech-text alignment in large language models. The key methodology is a two-stage training framework: Stage I aligns speech and text representations using an alignment adapter and CTC loss; Stage II reduces speech sequence length using a shrinking adapter. Soundwave outperforms Qwen2-Audio in speech translation and AIR-Bench speech tasks, using only one-fiftieth of the training data (10k hours vs. 520k hours). AI practitioners can achieve state-of-the-art speech understanding performance in LLMs with significantly reduced training data requirements by adopting Soundwave's two-stage alignment and shrinking approach. |
Phantom: Subject-consistent video generation via cross-modal alignment (Read more on arXiv or HuggingFace) | Jiawei Liu, ZhuoweiChen, lbc402, Grayson111, liulj13 | Phantom is a unified video generation framework for subject-consistent video generation via cross-modal alignment. The research objective is to develop a model that balances dual-modal prompts of text and image to achieve deep and simultaneous alignment of text and visual content in video generation. The key methodology involves redesigning a joint text-image injection model based on text-to-video and image-to-video architectures, and training it with text-image-video triplet data to learn cross-modal alignment. Primary results show Phantom leads in overall metrics for subject consistency with a score of 0.731 in CLIP-I-Seg and prompt following with the ViCLIP-T, demonstrating subject consistency competitive with commercial solutions. AI practitioners can use Phantom, which has a new architecture, for improved subject-consistent video generation, especially in tasks requiring ID preservation and consistency. |
Continuous Diffusion Model for Language Modeling (Read more on arXiv or HuggingFace) | Sung Ju Hwang, harryjo97 | Riemannian Diffusion Language Model (RDLM) is a continuous diffusion framework for language modeling that incorporates the geometry of the statistical manifold. The main research objective is to establish a connection between discrete diffusion and continuous flow on the statistical manifold and design a continuous diffusion model for discrete data that generalizes previous discrete diffusion models. The key methodology involves reparameterizing discrete data to continuous states on a hypersphere, designing diffusion processes on the manifold that generalize discrete diffusion, and using a simulation-free training scheme based on radial symmetry. Primary results show that RDLM achieves a Bits Per Character (BPC) of ≤ 1.32 on the Text8 dataset, outperforming existing discrete diffusion models. The principal implication is that AI practitioners can leverage the geometry of the statistical manifold in continuous diffusion models to achieve improved performance in language modeling and other discrete data generation tasks, compared to existing discrete diffusion approaches. |
Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity (Read more on arXiv or HuggingFace) | Aydar Bulatov, Mikhail Arkhipov, mbur, yurakuratov | This work explores the maximum information capacity of language model input embeddings by compressing text sequences into trainable vectors. The main research objective is to quantify how much text can be losslessly encoded into and decoded from a fixed-size vector representation within large language models (LLMs). The key methodology involves optimizing a set of prepended "memory" vectors to minimize the cross-entropy loss when reconstructing the original text using a frozen, pre-trained LLM. The primary result is that a single vector can enable a Llama-3.1-8B model to accurately reconstruct up to 1568 tokens, and this capacity scales nearly linearly with the number of trainable vectors (e.g. 16 vectors compress 7168 tokens). The principal implication for AI practioners is that LLM input embeddings have significantly more unused capacity than typically utilized, suggesting substantial room for improved context encoding and memory augmentation in model design. |
SafeRoute: Adaptive Model Selection for Efficient and Accurate Safety Guardrails in Large Language Models (Read more on arXiv or HuggingFace) | Minki Kang, Dong Bok Lee, hbseong, dwgnr, Seanie-lee | SafeRoute adaptively selects between a smaller and larger safety guard model to improve the trade-off between computational cost and safety performance in LLM deployments. The paper's objective is to develop a method that distinguishes "hard" examples requiring a larger safety guard model from "easy" ones that a smaller model can handle. The core of the method is SafeRoute, a trained binary router that classifies input prompt-response pairs, selectively applying the larger model only when necessary. Results show SafeRoute improves the F1 score by 13% and 10% compared to always using the smaller or larger models on the WildGuardMix test split, while utilizing the larger model on only 5.09% of the data. AI practitioners can use SafeRoute to deploy safer LLMs more efficiently, reducing computational overhead while maintaining high accuracy in detecting harmful content. |
Rethinking Diverse Human Preference Learning through Principal Component Analysis (Read more on arXiv or HuggingFace) | Hao Sun, Feng Luo, huanzhang12, CharlesDDDD, Ray2333 | Decomposed Reward Models (DRMs) extract diverse human preferences from binary comparisons for improved AI personalization. The research question is: Can we infer multidimensional human preferences directly from large-scale binary comparisons? The method represents preferences as vectors, applies PCA to embedding differences between preferred and rejected responses, and identifies orthogonal basis vectors representing distinct preference aspects. DRMs using Gemma-2B-RM improved the single-head baseline accuracy from 0.733 to 0.814 on the RewardBench dataset. AI practitioners can use DRMs for more efficient test-time adaptation to diverse user preferences without requiring additional model training, offering a scalable and interpretable solution for personalized LLM alignment. |
SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation (Read more on arXiv or HuggingFace) | codered010, RunpeiDong, YufeiD, WenyaoZhang, qizekun | SOFAR introduces semantic orientation to bridge spatial reasoning and object manipulation, enabling robots to understand and execute tasks based on natural language instructions. The main research objective is to develop a system that can accurately understand and utilize object orientations, defined through natural language, for robotic manipulation and spatial reasoning tasks. The key methodology involves constructing a large-scale dataset (OrienText300K) of 3D models annotated with semantic orientations, developing a cross-modal 3D Transformer (PointSO) for orientation prediction, and integrating this with a Vision-Language Model (VLM) system (SOFAR) to generate manipulation actions. Primary results show that SOFAR achieves 48.7% accuracy on the Open6DOR benchmark and 74.9% accuracy on the SIMPLER benchmark for robotic manipulation. The principal implication for AI practitioners is that integrating semantic orientation into VLM systems provides a more flexible and accurate way to represent spatial knowledge, significantly improving performance in robotic manipulation tasks requiring precise object alignment and rearrangement. |
Multimodal Mamba: Decoder-only Multimodal State Space Model via Quadratic to Linear Distillation (Read more on arXiv or HuggingFace) | Qian Zhang, wenyuliu, wondervictor, HongyuanTao, LegendBC | mmMamba is a framework for developing linear-complexity, native multimodal state space models using distillation from existing multimodal large language models (MLLMs). The main research question is how to effectively distill knowledge from trained Transformer-based decoder-only MLLMs to create efficient, linear-complexity architectures without relying on pre-trained RNN-based LLMs or vision encoders. The key methodology involves a three-stage progressive distillation recipe and a seeding strategy to carve Mamba layers from trained Transformer layers, transferring knowledge while preserving multimodal capabilities. The primary results demonstrate that mmMamba-linear achieves competitive performance with existing linear and quadratic-complexity VLMs, achieving a 20.6x speedup and 75.8% GPU memory saving compared to HoVLE at 103K tokens. AI practitioners can leverage mmMamba to build more efficient and deployable multimodal models, particularly for long-context applications, by utilizing linear-complexity architectures with reduced computational demands. |
FLAG-Trader: Fusion LLM-Agent with Gradient-based Reinforcement Learning for Financial Trading (Read more on arXiv or HuggingFace) | ShirleyY, Acatsama, YupengCao, zdeng10, xionggj001 | FLAG-TRADER is a framework integrating LLMs with reinforcement learning for financial trading. The main research question is whether integrating LLMs' reasoning with RL's reward-driven optimization can address challenges in financial sequential decision-making. The methodology involves a partially fine-tuned LLM acting as a policy network, optimized via gradient-driven RL (specifically PPO), using textual state representations. Primary results show FLAG-TRADER, using a 135M-parameter LLM, achieves a Sharpe Ratio of 3.344 on JNJ stock, outperforming baselines and larger proprietary models. For AI practitioners, this framework demonstrates that combining LLMs with RL fine-tuning, particularly using parameter-efficient methods, offers superior performance in complex, sequential decision-making tasks like financial trading. |
You Do Not Fully Utilize Transformer's Representation Capacity (Read more on arXiv or HuggingFace) | kefirski, ummagumm-a, elephantmipt, yaraksen, gudleifrr | i) This paper introduces Layer-Integrated Memory (LIMe), a modification to the Transformer architecture that allows attention heads to access representations from all previous layers. ii) The main objective is to address representation collapse in standard Transformers by enabling access to hidden states from earlier layers. iii) The key methodology is modifying the key-value side of masked multi-head self-attention by introducing a learned routing mechanism (static or dynamic) that creates convex combinations of representations from all preceding layers. iv) LIMe models consistently outperform standard Transformer baselines; for example, on the LM Evaluation Harness, the average accuracy across all benchmarks in the results shows the LIMe Dynamic variant achieving 58.4% accuracy, compared to 57.7% for the LLaMA baseline. v) AI practitioners can use LIMe to build deeper and more robust Transformers with improved representational capacity, potentially leading to better performance in sequence modeling tasks without substantially increasing computational overhead. |
Magma: A Foundation Model for Multimodal AI Agents (Read more on arXiv or HuggingFace) | cheryyunl, Baolin, rzheng12, qianhuiwu, tanreuben | Magma is a multimodal foundation model capable of interpreting and grounding multimodal inputs within its environment for AI agentic tasks. The main research objective is to develop a foundation model that integrates vision-language understanding with the ability to plan and act in visual-spatial worlds, completing tasks ranging from UI navigation to robot manipulation. The key methodology involves pre-training on heterogeneous datasets (images, videos, robotics data) using Set-of-Mark (SoM) for action grounding and Trace-of-Mark (ToM) for action planning, representing actions as visual object labels and movement traces. Primary results include achieving new state-of-the-art results on UI navigation with a success rate of 60.4/58.5 on SS-Mobile, and robotic manipulation tasks, outperforming previous models tailored to these tasks. For AI practitioners, Magma provides a pre-trained model capable of transferring visual and language understanding to complex agentic tasks, suggesting a path for building agents that can seamlessly operate in both digital and physical environments. |
RealSyn: An Effective and Scalable Multimodal Interleaved Document Transformation Paradigm (Read more on arXiv or HuggingFace) | Kaicheng Yang, JiankangDeng, SeriousBro, Nina0607, GaryGuuu | i) RealSyn introduces a paradigm for vision-language representation learning using multimodal interleaved documents. ii) The research aims to leverage underutilized non-paired data in interleaved documents by constructing distinct image-text pairs. iii) The methodology involves a real-world data extraction pipeline, hierarchical retrieval to associate images with texts, and an image semantic augmented generation module. iv) The study releases the RealSyn dataset and demonstrates that models pre-trained on RealSyn achieve state-of-the-art performance on multiple downstream tasks and showed performance improvements of 1.3%-6.9% in linear probing. v) RealSyn offers a scalable dataset, up to 100M, for AI practitioners enabling improved vision-language models without relying solely on paired data. |
PAFT: Prompt-Agnostic Fine-Tuning (Read more on arXiv or HuggingFace) | Fei Richard Yu, Ying Tiffany He, Mingwen Ou, Yao Shu, kittttttt | PAFT is a fine-tuning method that improves the prompt robustness of large language models (LLMs). The main research objective is to address the performance degradation of fine-tuned LLMs caused by minor variations in prompts. The key methodology is a two-stage approach: constructing a diverse set of candidate prompts and then dynamically sampling from these prompts during fine-tuning. Primary results show that PAFT achieves 87.57% average accuracy on the RACE-high dataset, significantly outperforming baseline models and reducing variance across different prompts. PAFT's dynamic sampling during fine-tuning helps models generalize better to unseen prompts, maintaining high performance and improving inference efficiency for AI practitioners using fine-tuned models. |
MUDDFormer: Breaking Residual Bottlenecks in Transformers via Multiway Dynamic Dense Connections (Read more on arXiv or HuggingFace) | Xingyuan Yuan, Da Xiao, lishengping, Hilbertmeng | MUDDFormer introduces a novel method to improve information flow in Transformers by replacing standard residual connections with multiway dynamic dense connections. The main research objective is to address the limitations of residual connections and enhance cross-layer information flow in Transformer models. The key methodology is generating connection weights dynamically based on hidden states and decoupling input streams (query, key, value, residual) of a Transformer block. Primary results show that MUDDPythia-2.8B matches Pythia-6.9B in pre-training perplexity and downstream tasks, while adding only 0.23% parameters and 0.4% computation. For AI practitioners, MUDDFormer offers a method to significantly improve Transformer performance and scalability, especially with deeper models, with minimal parameter and computational overhead. |
Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities? (Read more on arXiv or HuggingFace) | Yunhua Zhou, Qinyuan Cheng, Zhiyuan Zeng, xpqiu, yinzhangyue | This paper investigates whether o1-like models (QwQ, R1, and LIMO) truly possess test-time scaling capabilities. The main research question is whether increasing Chain-of-Thought (CoT) length in these models consistently improves reasoning performance. The researchers systematically investigated the relationship between CoT length and accuracy, and prompted models for self-revisions, comparing sequential and parallel scaling strategies. A primary result is that longer CoTs did not consistently improve accuracy; correct solutions were often shorter, and R1-Distill-32b and R1-Distill-14b maintained the original wrong answer in over 70% of cases when prompted to revise. The principal implication is that AI practitioners should consider parallel scaling and methods like "Shortest Majority Vote" for these models, as sequential scaling via self-revision is not consistently effective due to limited self-revision capabilities. |
OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning (Read more on arXiv or HuggingFace) | Joseph Boen, Rahul Thapa, Sheng Liu, Bowen Chen, lupantech | OctoTools is a training-free, extensible agentic framework that enhances complex reasoning in large language models (LLMs) through standardized tool integration and a planner-executor paradigm. The main research objective is to develop a framework that enables LLMs to effectively tackle complex reasoning tasks across diverse domains without requiring additional training or fine-tuning. Key methodology involves using standardized tool cards to encapsulate tool functionality, a planner for high-level and low-level task planning, and an executor to carry out tool usage based on generated commands. Primary results show that OctoTools achieves an average accuracy gain of 9.3% over zero-shot GPT-4o and outperforms other agent frameworks like AutoGen, GPT-Functions, and LangChain by up to 10.6% when given the same set of tools. Principal implication for AI practitioners is that OctoTools provides a modular and extensible framework for building AI agents capable of complex reasoning, which reduces development effort and improves performance without the need for model retraining when new tools are added. |
Crowd Comparative Reasoning: Unlocking Comprehensive Evaluations for LLM-as-a-Judge (Read more on arXiv or HuggingFace) | zhangsan5421, lifengshang, horiz94, YuxinJiang, DonJoey | Crowd Comparative Reasoning enhances LLM-as-a-Judge evaluations by incorporating comparisons with multiple "crowd" responses to improve detail and comprehensiveness. Research Objective: To address the limitation of LLM-as-a-Judge's chain-of-thought (CoT) reasoning, which often fails to capture comprehensive details, leading to incomplete evaluations. Key Methodology: Proposes Crowd-based Comparative Evaluation (CCE), which introduces additional "crowd" responses for comparison with candidate responses, guiding the LLM to produce more detailed CoT judgments. Primary Results: CCE achieved an average accuracy gain of 6.7% across five benchmarks (REWARDBENCH, HELPSTEER2, MTBENCH HUMAN, JUDGEBENCH, and EvalBIAS). Principal Implication: AI practitioners can use CCE to improve the reliability and depth of LLM-based evaluations, enabling more robust model assessments and potentially more efficient training through techniques like judge distillation and improved rejection sampling. |
HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation (Read more on arXiv or HuggingFace) | Binhe Yu, Yuqian Yuan, Sijing Li, Wenqiao Zhang, Tianwei Lin | HealthGPT is a medical large vision-language model that unifies visual comprehension and generation tasks through heterogeneous knowledge adaptation. The main research objective is to develop a unified medical multi-modal model capable of both comprehending and generating medical visual data. The key methodology is a novel heterogeneous low-rank adaptation (H-LoRA) technique, complemented by hierarchical visual perception and a three-stage learning strategy. Results show that HealthGPT-L14 achieves 77.7% close accuracy on VQA-RAD, and 88.6% SSIM on the CT(Brain) reconstruction task. The principal implication is that AI practitioners can leverage HealthGPT's architecture for creating unified medical AI models that perform well on both visual comprehension and generation, overcoming limitations of previous models. |
HeadInfer: Memory-Efficient LLM Inference by Head-wise Offloading (Read more on arXiv or HuggingFace) | beidic, junjiehu, jinqixiao, ZefanCai, wdlctc | i) HeadInfer proposes a head-wise offloading strategy for memory-efficient LLM inference by selectively maintaining attention heads' KV cache on the GPU. ii) The research aims to reduce the GPU memory footprint of LLM inference, specifically the key-value (KV) cache, for long context generation. iii) The methodology involves a head-wise offloading strategy where only selective attention heads' KV cache is stored on the GPU, dynamically computing attention output, combined with adaptive heads grouping and asynchronous data transfer. iv) Experiments on the Llama-3-8B model with a 1-million-token sequence show a reduction in GPU memory footprint from 128GB to 1GB for the KV cache and total GPU usage from 207GB to 17GB, achieving a 92% reduction compared to BF16 baseline inference; HeadInfer extends the Llama-3-8B model's context length from 25K to 4 million tokens using an NVIDIA RTX 4090. v) HeadInfer enables AI practitioners to perform long-context LLM inference with reduced memory requirements, specifically enabling 4-million-token inference with an 8B model on a single consumer GPU with 24GB memory. |
Injecting Domain-Specific Knowledge into Large Language Models: A Comprehensive Survey (Read more on arXiv or HuggingFace) | Mingzhe Li, Miao Fang, Yuhan Liu, Bin Yan, Ziruibest | This survey provides a comprehensive overview of methods for integrating domain-specific knowledge into large language models (LLMs). The main research objective is to categorize and analyze techniques for enhancing LLMs with domain-specific knowledge to improve their performance in specialized tasks. Key methodologies include dynamic knowledge injection, static knowledge embedding, modular adapters, and prompt optimization. The paper reviewed studies showing, for instance, that PMC-LLaMA (13B) achieved 56.3 on MedQA, outperforming LLaMA2 (70B) at 43.7 on the same benchmark, in the biomedical field, showing how domain-specific LLMs can beat generalized models. For AI practitioners, incorporating domain-specific knowledge is crucial for achieving higher accuracy and reliability in specialized applications of LLMs. |
Eager Updates For Overlapped Communication and Computation in DiLoCo (Read more on arXiv or HuggingFace) | Yanislav Donchev, Arthur Douillard, Satyen Kale | i) This paper introduces "eager updates" to improve the DiLoCo distributed training method by overlapping communication and computation, reducing training time in low-bandwidth settings. ii) The main objective is to mitigate performance slowdowns in distributed training caused by blocking communication in low-bandwidth environments, such as cross-datacenter training. iii) The key methodology is to overlap the communication of outer gradients with the computation of the next inner optimization phase, applying local outer gradients eagerly before the aggregated gradients are available. iv) The proposed method with 1-outer-step eager updates and H=30 inner steps achieves the same performance as Data-Parallel at a 1 billion parameter scale, while using up to 1,177x less bandwidth. v) AI practitioners can use eager updates in DiLoCo to significantly reduce communication requirements and improve training efficiency in settings with limited bandwidth between workers. |
Atom of Thoughts for Markov LLM Test-Time Scaling (Read more on arXiv or HuggingFace) | Chenglin Wu, Jiayi Zhang, Quan Shi, Zhaoyang Yu, leavendough | Atom of Thoughts (AOT) is a reasoning framework that improves large language models' (LLMs) test-time scaling by structuring the reasoning process as a Markov chain of atomic, independent questions. The main research objective is to address the issue of accumulated historical information in existing test-time scaling methods, which wastes computational resources and interferes with effective reasoning. The key methodology is a two-phase state transition mechanism: (1) decomposing the current question into a dependency-based directed acyclic graph, and (2) contracting subquestions into a new independent question, iteratively until directly solvable. Primary results show that on HotpotQA, AOT applied to gpt-4o-mini achieves an 80.6% F1 score. The principal implication for AI practitioners is that AOT can be used as a standalone framework or a plug-in enhancement to improve LLMs' reasoning capabilities, by reducing unnecessary historical information to enhance efficiency. |
FinMTEB: Finance Massive Text Embedding Benchmark (Read more on arXiv or HuggingFace) | Yi Yang, yixuantt | FinMTEB is a comprehensive benchmark for evaluating text embedding models in the financial domain. The main research objective is to assess how well existing embedding models capture domain-specific financial information and whether domain adaptation improves performance. The key methodology involves constructing a benchmark (FinMTEB) of 64 datasets across 7 financial tasks and developing a finance-adapted model, Fin-E5, using a persona-based data synthesis method. Primary results show domain-adapted models consistently outperform general-purpose counterparts, with Fin-E5 achieving a 0.6767 average score on FinMTEB, and remarkably, a simple Bag-of-Words (BoW) approach outperforms all dense embedding in financial Semantic Textual Similarity (STS) tasks. For AI practitioners, the benchmark facilitates targeted development and assessment of financial text embedding models, and also suggests current dense embedding models may not be optimal for certain kinds of financial text analysis. |
Perovskite-LLM: Knowledge-Enhanced Large Language Models for Perovskite Solar Cell Research (Read more on arXiv or HuggingFace) | Shuyan Chen, wenxinsiju, yongqi2023, sunpenglei, Dominic789654 | This paper presents a knowledge-enhanced system for perovskite solar cell (PSC) research, integrating a knowledge graph, datasets, and specialized large language models. The main research objective is to develop a system that efficiently manages and reasons with the rapidly growing body of knowledge in PSC research. The key methodology involves constructing a domain-specific knowledge graph (Perovskite-KG) from 1,517 research papers, creating two datasets (Perovskite-Chat and Perovskite-Reasoning) using a multi-agent framework, and developing two specialized LLMs (Perovskite-Chat-LLM and Perovskite-Reasoning-LLM). Primary results show Perovskite-Chat-LLM achieved a perplexity of 2.97, a Rouge-L score of 41.25, and an LLM-Judge score of 2.97 on the Perovskite QA dataset, significantly outperforming baseline models. The principal implication for AI practitioners is that this system offers tools for enhanced literature review, experimental design, and complex problem-solving in PSC research, demonstrating how domain-specific knowledge can be integrated with LLMs to improve performance in scientific tasks. |
Pre-training Auto-regressive Robotic Models with 4D Representations (Read more on arXiv or HuggingFace) | trevordarrell, zitengj0618, gbiamby, yuvansharma, NdtSoCool | ARM4R pre-trains robotic models using 4D representations from human videos, enhancing transfer learning for robotic control. The main research objective is to develop a robotic model pre-training approach that leverages low-level 4D representations from human video data to improve performance on robotic manipulation tasks. The key methodology involves training an auto-regressive model in three stages: pre-training on human videos for 3D point track prediction, fine-tuning on robot videos for 3D point tracking, and fine-tuning for robotic control. The method achieves an average success rate of 59.47% on 12 RLBench simulation tasks, surpassing PerAct (55.33%). The model with 4d representations enables AI practitioners to improve sim2real transfer, cross-robot generalization, and performance in robotic control tasks by pre-training on unlabeled human video data. |
Multilingual Encoder Knows more than You Realize: Shared Weights Pretraining for Extremely Low-Resource Languages (Read more on arXiv or HuggingFace) | XU Han, Jianing Liu, Guixian Xu, Ziyin Zhang, Zeli Su | XLM-SWCM is a novel framework for adapting multilingual encoders to text generation in extremely low-resource languages by sharing weights between the encoder and decoder. The main research objective is to develop an effective text generation model for extremely low-resource languages, specifically Chinese minority languages, where existing multilingual models perform poorly. The key methodology involves a weight-sharing mechanism between the encoder and decoder, interleaving weights from a pretrained multilingual encoder (CINO, a variant of XLM-R) with randomly initialized weights in the decoder. The primary result is that XLM-SWCM outperforms mBART-CM by 198.8% in F1-score on text summarization and also outperfromed the larger MC2-LLaMA 13B in cross-lingual settings. AI practitioners can adapt pre-trained multilingual encoders to text generation tasks in extremely low-resource settings more effectively using this weight-sharing framework, significantly improving performance even with limited data. |
Title | Authors | Summary |
---|---|---|
Learning Getting-Up Policies for Real-World Humanoid Robots (Read more on arXiv or HuggingFace) | Saurabh Gupta, Zixuan Chen, Xialin He, RunpeiDong | The paper introduces HUMANUP, a learning framework for training humanoid robots to get up from various lying positions on diverse terrains. The main research objective is to develop a controller that enables humanoid robots to autonomously recover from falls in real-world settings. The key methodology is a two-stage reinforcement learning approach with a curriculum, where Stage I discovers a getting-up trajectory and Stage II refines it into a deployable, robust policy via imitation learning and control regularization. The primary results show that the learned policy enables a Unitree G1 robot to get up from supine poses with a 78.3% success rate on varied terrains, outperforming the robot's built-in controller. The principal implication is that this framework provides AI practitioners a method to train robust fall recovery policies for humanoid robots, enhancing their real-world deployability by making robots more resilient. |
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention (Read more on arXiv or HuggingFace) | Liang Zhao, Junyu Luo, Damai Dai, Huazuo Gao, Jingyang Yuan | The paper introduces NSA, a natively trainable sparse attention mechanism for efficient long-context modeling in large language models. The main research objective is to develop a sparse attention mechanism that improves computational efficiency during both training and inference while maintaining or exceeding the performance of full attention. The key methodology involves a dynamic hierarchical sparse strategy combining coarse-grained token compression with fine-grained token selection, alongside hardware-aligned optimizations for modern GPUs. Results show that NSA achieves up to 9.0x forward and 6.0x backward propagation speedup on 64k-length sequences compared to Full Attention, and outperforms Full Attention on average across general benchmarks (average score of 0.456 vs 0.443). For AI practitioners, NSA provides a method to train and deploy long-context language models with significantly reduced computational cost and improved performance, particularly on tasks requiring long-range dependencies. |
ReLearn: Unlearning via Learning for Large Language Models (Read more on arXiv or HuggingFace) | Sendong Zhao, Liming Yang, Ningyuan Zhao, Haoming Xu, Ningyu | ReLearn is a new method for unlearning in large language models that uses data augmentation and positive optimization, addressing limitations of reverse optimization methods. The main research objective is to develop an unlearning method that effectively removes targeted knowledge while preserving model performance, linguistic coherence, and robustness against attacks. ReLearn employs data augmentation with diverse question variations and fine-tuning on synthesized non-sensitive data, along with a comprehensive evaluation framework including Knowledge Forgetting Rate (KFR), Knowledge Retention Rate (KRR), and Linguistic Score (LS). The primary result is that ReLearn achieved a KFR of 0.85 on both KnowUnDo and TOFU datasets while maintaining a high KRR (0.74 on KnowUnDo and 0.89 on TOFU) and preserving linguistic abilities. AI practitioners can utilize ReLearn as an alternative to reverse optimization-based unlearning, providing a method to balance knowledge removal with the preservation of model utility and robustness in applications requiring privacy or copyright compliance. |
SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering? (Read more on arXiv or HuggingFace) | Johannes Heidecke, Tejal Patwardhan, Michele Wang, Samuel Miserendino | SWE-Lancer is a benchmark of over 1,400 real-world freelance software engineering tasks from Upwork, valued at $1 million USD, to evaluate large language models' (LLMs) coding and managerial capabilities. The main research objective is to assess whether frontier LLMs can successfully complete real-world freelance software engineering tasks and earn substantial income. The key methodology involves evaluating LLMs on two task types: Individual Contributor (IC) SWE tasks, graded via human-verified end-to-end tests, and SWE Manager tasks, assessed by comparing model choices to those of original engineering managers. Primary results show that the best-performing model, Claude 3.5 Sonnet, achieves 26.2% success on IC SWE tasks and 44.9% on SWE Management tasks on the Diamond set, earning $208,050 out of a possible $500,800. Principal implication for AI practitioners is that while frontier LLMs demonstrate some capability in real-world software engineering scenarios, significant improvement is needed for reliable, autonomous deployment in freelance work. |
HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation (Read more on arXiv or HuggingFace) | Minghao Xu, Chenming Shang, Ye Tian, Ling Yang, comin | HermesFlow is a framework designed to reduce the performance disparity between multimodal understanding and generation in Multimodal Large Language Models (MLLMs). The main research objective is to close the gap between the understanding and generative capabilities of MLLMs. The key methodology used is Pair-DPO, which leverages homologous preference data for both understanding and generation, combined with self-play iterative optimization. The primary results show that HermesFlow achieves an understanding score of 0.533 and a generation score of 0.497, reducing the gap to 0.036, compared to the baseline Show-o's gap of 0.087. For AI practitioners, HermesFlow provides a general alignment framework that demonstrably closes the gap between multimodal understanding and generation tasks within existing MLLM architectures. |
SURGE: On the Potential of Large Language Models as General-Purpose Surrogate Code Executors (Read more on arXiv or HuggingFace) | Siqiao Huang, zcliang22, Bohan22 | This paper introduces SURGE, a benchmark for evaluating large language models (LLMs) as general-purpose surrogate code executors. The main research objective is to assess whether LLMs can predict the output and behavior of programs across diverse tasks without actually running the code. The methodology involves creating a benchmark (SURGE) with eight distinct code execution aspects, evaluating various open-source and proprietary LLMs, and conducting a scaling study. A key finding is that Claude-3.5-Sonnet achieves an average accuracy of 34.31% across all subsets in the zero-shot setting. The principal implication for AI practitioners is that while LLMs show some capability in predicting code execution, there are still limitations in their ability to serve as general-purpose surrogate code executors, especially for time-consuming computations. |
Diffusion-Sharpening: Fine-tuning Diffusion Models with Denoising Trajectory Sharpening (Read more on arXiv or HuggingFace) | Mengdi Wang, Yunhai Tong, Ling Yang, Ye Tian, comin | Diffusion-Sharpening fine-tunes diffusion models by optimizing sampling trajectories using a path integral framework, enhancing downstream alignment. The main research objective is to improve diffusion model alignment with user preferences by optimizing the entire sampling trajectory, overcoming limitations of single-timestep optimization. The key methodology, Diffusion-Sharpening, uses a path integral framework to select optimal trajectories during training and leverages reward feedback, implementing this via SFT and RLHF approaches. Primary results show that RLHF Diffusion-Sharpening achieves a CLIP score of 0.338, outperforming baseline SDXL and other methods. The principal implication is that AI practitioners can achieve superior training and inference efficiency, along with better alignment to diverse metrics, by using trajectory-level optimization for diffusion model fine-tuning. |
I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models (Read more on arXiv or HuggingFace) | Runtao Liu, Hanrong Ye, Guocheng Qian, Kuan-Chieh Wang, Mifucius | Here's a concise summary of the research paper, adhering strictly to the guidelines provided: ThinkDiff aligns vision-language models (VLMs) with diffusion models to enable multimodal in-context reasoning in image generation. The main research objective is to empower text-to-image diffusion models with multimodal in-context understanding and reasoning capabilities. The key methodology is aligning VLMs with the decoder of an encoder-decoder large language model (LLM) through a proxy task of vision-language training, leveraging the shared input feature space between the LLM decoder and diffusion decoders. The primary result is that ThinkDiff significantly improves accuracy on the CoBSAT benchmark for multimodal in-context reasoning generation, achieving 46.3% accuracy compared to the previous 19.2%, with only 5 hours of training on 4 A100 GPUs. Principal implication for AI practioners: transfer the multimodal capabilities of VLM without complex reasoning datasets for in-context reasoning tasks, enhancing image generation from diffusion models. |
SAFE-SQL: Self-Augmented In-Context Learning with Fine-grained Example Selection for Text-to-SQL (Read more on arXiv or HuggingFace) | Hwanhee Lee, Byeongjeong Kim, Ingeol Baek, Jimin Lee | SAFE-SQL is a framework that improves Text-to-SQL performance by using large language models (LLMs) to generate and filter synthetic examples for in-context learning. The main research objective is to enhance Text-to-SQL accuracy in an unsupervised manner, particularly in complex or unseen scenarios, without additional fine-tuning. The key methodology involves schema linking, LLM-based example generation, relevance scoring (embedding similarity, keyword/structural alignment, reasoning path validity), and threshold-based filtering. Primary results show SAFE-SQL achieved 87.9% execution accuracy on the Spider development set, outperforming zero-shot and few-shot methods, especially in hard and extra hard categories. The principal implication for AI practitioners is that using self-augmented, fine-grained example selection with LLMs can significantly improve the accuracy and robustness of Text-to-SQL systems without requiring additional model training or relying on predefined training sets. |
CRANE: Reasoning with constrained LLM generation (Read more on arXiv or HuggingFace) | Gagandeep Singh, Sasa Misailovic, Shubham Ugare, Tarun Suresh, Debangshu Banerjee | Constrained LLM generation can reduce reasoning abilities, but augmenting output grammars with reasoning rules can preserve it. The main research questions are whether LLMs truly lose reasoning capabilities under constrained decoding and how to reduce syntax errors while preserving unconstrained reasoning. The key methodology is a reasoning-augmented constrained decoding algorithm (CRANE) that alternates between unconstrained generation for reasoning and constrained generation for structurally correct outputs, supported by theoretical analysis of LLM expressivity. CRANE significantly outperforms state-of-the-art constrained decoding strategies and unconstrained decoding, showing up to a 10% accuracy improvement on the GSM-symbolic and FOLIO benchmarks. AI practitioners can use CRANE to improve the accuracy and syntactic correctness of LLM outputs in tasks requiring formal constraints, such as code generation and symbolic reasoning. |
Intuitive physics understanding emerges from self-supervised pretraining on natural videos (Read more on arXiv or HuggingFace) | Laurent Najman, Adrien Bardes, Mahmoud Assran, Nicolas Ballas, Quentin Garrido | V-JEPA, a video joint embedding predictive architecture, demonstrates an understanding of intuitive physics when pretrained on natural videos. The main research objective was to investigate the emergence of intuitive physics understanding in deep neural networks trained to predict masked regions in natural videos. Researchers leveraged the violation-of-expectation framework and compared video prediction models in a learned representation space with pixel-space prediction and multimodal large language models. A V-JEPA model trained on natural videos achieved 98% zero-shot accuracy on the IntPhys benchmark. AI practitioners can apply the principle of joint learning of abstract representation space with sensory input prediction, as a robust objective for acquiring intuitive physics understanding in AI models, challenging the reliance on core knowledge. |
Cuckoo: An IE Free Rider Hatched by Massive Nutrition in LLM's Nest (Read more on arXiv or HuggingFace) | Jingbo Shang, Feng Yao, Zilong Wang, Letian Peng | Cuckoo is a novel information extraction (IE) model that leverages large language model (LLM) resources for pre-training via a new paradigm called Next Tokens Extraction (NTE). The main research objective is to demonstrate that IE models can be effectively pre-trained using the same data and a similar paradigm as LLMs, overcoming data scarcity limitations in traditional IE pre-training. The key methodology is converting next token prediction in LLMs to next token extraction (NTE) using BIO tags, applied to 102.6M instances derived from the C4 and TuluV3 datasets. Cuckoo outperforms existing pre-trained IE models in few-shot settings, achieving a 70.63 average F1 score across six basic IE tasks, surpassing baselines significantly. AI practitioners can leverage the NTE paradigm to train versatile and efficient IE models using readily available LLM pre-training resources, avoiding expensive manual annotation and enabling adaptation to a variety of IE tasks. |
Dyve: Thinking Fast and Slow for Dynamic Process Verification (Read more on arXiv or HuggingFace) | Qiang Xu, Xiangyu Wen, Zhijian Xu, Zeju Li, Jianyuan1 | Dyve is a dynamic process verifier that enhances reasoning error detection in large language models by integrating fast and slow thinking. The main research objective is to improve the accuracy and efficiency of process verification in large language models' reasoning. The key methodology is a dual-system approach, adaptively applying "System 1" (fast, token-level) and "System 2" (slow, comprehensive) verification, supported by step-wise consensus-filtered process supervision using Monte Carlo estimation, LLM-as-a-Judge, and specialized reasoning models. Dyve achieved an F1 score of 68.5 on the GSM8K subset of ProcessBench, outperforming existing process-based verifiers. AI practitioners can use Dyve's dual-system approach for more reliable and efficient process verification in LLM-based reasoning systems, as it offers superior error detection to traditional process-based methods. |
PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning (Read more on arXiv or HuggingFace) | Jiaxing Huang, Yanrui Wu, Yuxuan Dong, Xinyu Zhang, ChengyouJia | PhysReason is a new benchmark for evaluating physics-based reasoning capabilities of large language models (LLMs). The main research objective is to create a comprehensive benchmark to assess LLMs' ability to solve physics problems requiring multi-step reasoning and application of physics theorems. The methodology involves compiling 1,200 physics problems categorized by difficulty and knowledge/reasoning type, and proposing the Physics Solution Auto Scoring Framework (PSAS) for evaluation. Primary results showed that even top-performing models like Deepseek-R1 achieved less than 60% on answer-level evaluation, with performance dropping from 75.11% on knowledge questions to 31.95% on hard problems. Principal implication for AI practitioners: the benchmark highlights limitations of current LLMs and can help to improve future models on tasks for physics-based reasoning and applications such as robotics. |
System Message Generation for User Preferences using Open-Source Models (Read more on arXiv or HuggingFace) | Teakgyu Hong, Dawoon Jung, Minsoo Khang, Jungho Cho, Minbyul Jeong | SYSGEN, a data construction pipeline, generates system messages and aligned assistant responses for large language models using open-source models. The main research objective is to address the scarcity and license restrictions of existing datasets with system messages by automatically generating diverse, instruction-aligned system messages. The key methodology involves a four-phase pipeline: generating system messages with eight key functionalities, filtering mis-specified tags, verifying functionalities using an LLM-as-a-judge approach, and generating new, aligned assistant responses. Training on SYSGEN data improved model alignment, with LLaMA-3.1-8B-instruct and Phi-4 models achieving +0.9 and +0.13 absolute improvements, respectively, on the Multifacet benchmark. AI practitioners can leverage SYSGEN to enhance model alignment with user instructions and preferences while minimizing performance degradation on unseen benchmarks and avoiding licensing issues related to training data. |
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model (Read more on arXiv or HuggingFace) | Yixuan Li, Changli Tang, Jimin Zhuang, Yudong Yang, Guangzhi Sun | video-SALMONN-01 is an open-source audio-visual large language model designed for enhanced reasoning in general video understanding tasks. The main research objective is to improve the reasoning capabilities of audio-visual LLMs for general video understanding, beyond the existing focus on mathematical problems and visual graphical inputs. The key methodology involves developing a reasoning-intensive dataset with step-by-step solutions, proposing process direct preference optimization (pDPO) for step-level reward modeling, and introducing RivaBench, a new video understanding benchmark. Primary results show that video-SALMONN-01 achieves 3-8% accuracy improvements over the LLaVA-OneVision baseline across different video reasoning benchmarks, and pDPO achieves 6-8% improvements compared to the supervised fine-tuning model on RivaBench. AI practitioners can utilize video-SALMONN-01 and the pDPO method for building applications requiring advanced audio-visual reasoning, such as complex video comprehension and synthetic video detection. |
Building A Proof-Oriented Programmer That Is 64% Better Than GPT-4o Under Data Scarsity (Read more on arXiv or HuggingFace) | Tianran Sun, Justin Wang, Dylan Zhang | This paper introduces PoPilot, a fine-tuned language model designed to address data scarcity in proof-oriented programming with F*. The main research objective is to improve language models' performance on project-level proof generation and repair in F* under data-scarce conditions. The key methodology involves synthetic data augmentation, creating new proof-oriented programming problems, incorporating diverse coding data, and generating repair data within existing repositories. The primary result shows that the 14B parameter model, PoPilot, outperforms GPT-4o in project-level proof-oriented programming by a 64% relative margin. AI practitioners can leverage the proposed synthetic data generation strategies to create specialized verification assistants capable of both synthesizing and repairing proofs to reduce the cost of adaptation of language model. |
MagicArticulate: Make Your 3D Models Articulation-Ready (Read more on arXiv or HuggingFace) | Yiwen Chen, Fan Yang, Xiu Li, Jianfeng Zhang, chaoyue7 | MagicArticulate is a framework that automatically converts static 3D models into animation-ready assets with skeletons and skinning weights. The main research objective is to develop a scalable method for automatically generating articulation-ready 3D models, addressing the limitations of manual annotation and existing template-based or template-free approaches. The key methodology involves a two-stage pipeline: an auto-regressive transformer for skeleton generation formulated as a sequence modeling problem, followed by a functional diffusion process for skinning weight prediction that incorporates volumetric geodesic distance priors. The method achieves a Chamfer Distance (CD-J2J) of 2.586 on the Articulation-XL dataset for skeleton generation, outperforming existing methods. For AI practitioners, MagicArticulate provides a scalable solution to automatically rig 3D models, significantly reducing the manual effort required for animation content creation and potentially accelerating the development of animation pipelines. |
Talk Structurally, Act Hierarchically: A Collaborative Framework for LLM Multi-Agent Systems (Read more on arXiv or HuggingFace) | Shingo Takamatsu, Briti Gangopadhyay, Wei-Yao Wang, Sota Moriyama, Zhao Wang | i) The paper introduces TalkHier, a novel framework for LLM Multi-Agent (LLM-MA) systems designed to improve communication and refinement in complex collaborative tasks. ii) The research aims to address challenges in managing communication and refinement among agents in LLM-MA systems. iii) The methodology involves a structured communication protocol and a hierarchical refinement system. iv) TalkHier achieves 88.38% accuracy on the MMLU benchmark when built on GPT40, outperforming inference scaling models and open-source multi-agent models. v) The principal implication for AI practitioners is a new standard for LLM-MA systems, providing a more effective, adaptable, and collaborative framework. |
One Example Shown, Many Concepts Known! Counterexample-Driven Conceptual Reasoning in Mathematical LLMs (Read more on arXiv or HuggingFace) | Xinnian Liang, Zhikun Xu, Haojing Huang, Jiayi Kuang, Yinghui Li | This paper introduces COUNTERMATH, a new benchmark for evaluating counterexample-driven conceptual reasoning in mathematical Large Language Models (LLMs). The main research objective is to assess and enhance LLMs' ability to understand mathematical concepts through counterexample-driven proofs, moving beyond reliance on "drill-based" learning. The key methodology involves creating a dataset of 1,216 university-level mathematical statement-rationale pairs from textbooks and developing a data engineering framework for automatically acquiring training data. Primary results show that even advanced LLMs like OpenAI o1 achieve a relatively low F1 score (60.1) on COUNTERMATH, and a fine-tuned model with only 1,025 training samples significantly outperformed baseline models. The principal implication for AI practitioners is that strengthening LLMs' counterexample-driven reasoning is crucial for improving their overall mathematical capabilities, and this work provides a benchmark and methodology to pursue this. |
Better Embeddings with Coupled Adam (Read more on arXiv or HuggingFace) | Tobias Stollenwerk, flxst | The paper introduces Coupled Adam, a modification of the Adam optimizer, to address the anisotropy problem in language model embeddings. The main research question is whether the second moment in the Adam optimizer contributes to anisotropic word embeddings in language models and how this can be mitigated. The key methodology involves analyzing the embedding update vectors under SGD and Adam, proposing a modified Adam optimizer ("Coupled Adam") that averages the second moment across vocabulary items, and empirically evaluating its impact on embedding quality and model performance. Primary results show Coupled Adam improves embedding isotropy significantly, achieving values above 0.90 in most small-scale experiments, and enhances upstream/downstream performance on sufficiently large datasets. For AI practitioners, using Coupled Adam instead of standard Adam can improve the quality of word embeddings and boost model performance, particularly for large language models. |
Towards Data-Efficient Pretraining for Atomic Property Prediction (Read more on arXiv or HuggingFace) | Bernard Ghanem, Yasir Ghunaim, hammh0a | This paper investigates data-efficient pretraining for atomic property prediction, showing that strategic dataset selection can match or surpass large-scale pretraining with significantly reduced computational cost. The main research objective is to determine if pretraining on a smaller, task-relevant dataset can achieve comparable or superior performance to large-scale pretraining in atomic property prediction. The key methodology introduces the Chemical Similarity Index (CSI), a metric inspired by Fréchet Inception Distance, to quantify the alignment between upstream pretraining datasets and downstream tasks, and uses this to select pretraining data. A primary result is that models pretrained on the ANI-1x dataset (using the CSI for selection) achieved a Mean Absolute Error (MAE) of 5.4 on rMD17, outperforming JMP-S (MAE of 6.7) with 24 times less computational budget. Principal implication for AI practitioners is that strategic selection of pretraining data based on task relevance, assessed using metrics like CSI, can achieve competitive performance with significantly reduced computational resources in atomic property prediction, favoring quality over quantity. |
Large Language Models and Mathematical Reasoning Failures (Read more on arXiv or HuggingFace) | birgermoell, jboye | This paper evaluates the mathematical reasoning capabilities of large language models (LLMs) using newly constructed word problems and identifies common failure modes. The main research question is: How good are LLMs at mathematical reasoning when evaluated on both answer correctness and solution steps? The key methodology involved creating a dataset of 50 high-school-level mathematical word problems and manually assessing the answers and solutions provided by eight LLMs, including Mixtral, Llama, Gemini, and GPT-4o. The primary result was that the o1 model achieved the highest accuracy, correctly solving 37 out of 50 problems, while all models exhibited errors in spatial reasoning, strategic planning, and arithmetic. The principal implication for AI practitioners is the need to evaluate LLMs' reasoning processes, not just their final answers, to avoid overestimating their problem-solving proficiency. |
Language Complexity Measurement as a Noisy Zero-Shot Proxy for Evaluating LLM Performance (Read more on arXiv or HuggingFace) | jboye, birgermoell | This paper evaluates the capability of Large Language Models (LLMs) to measure language complexity as a proxy for general LLM performance. The main research objective is to examine the performance of state-of-the-art LLMs on computing the LIX readability metric and performing dependency parsing to calculate Average Dependency Distance (ADD). The methodology involves evaluating six LLMs using Swedish essays, comparing their LIX and ADD computations against ground truth values, and correlating these with MMLU benchmark scores. A primary result is a strong significant correlation of -0.875 (p=0.026) between the models' accuracy in computing LIX and their MMLU performance. For AI practitioners, language complexity measurement abilities, specifically LIX computation, can serve as a practical, noisy zero-shot proxy for assessing general LLM capabilities, without needing extensive benchmarking datasets. |
Title | Authors | Summary |
---|---|---|
Region-Adaptive Sampling for Diffusion Transformers (Read more on arXiv or HuggingFace) | Lili Qiu, Yiqi Zhang, Chengruidong Zhang, Yifan Yang, Ziming Liu | Region-adaptive sampling (RAS) improves the efficiency of Diffusion Transformers (DiTs) by dynamically adjusting sampling ratios across image regions. The main objective is to accelerate the sampling process of DiTs without significant quality degradation by focusing computational resources on semantically meaningful regions. RAS identifies "focus" regions in each sampling step using output noise from the previous step, updating only these, and caches the rest, based on attention continuity. RAS achieves speedups of up to 2.36x and 2.51x on Stable Diffusion 3 and Lumina-Next-T2I, respectively, with minimal generation quality degradation. AI practitioners can use RAS to significantly improve the sampling speed of Diffusion Transformers, facilitating real-time applications that require high-quality image generation. |
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model (Read more on arXiv or HuggingFace) | Nan Duan, Liangyu Chen, Kun Yan, Haoyang Huang, Guoqing Ma | i) Step-Video-T2V, a 30B parameter text-to-video model, achieves state-of-the-art results via a novel architecture and training strategy. ii) The research objective is to develop a high-performance and high-quality text-to-video generation model surpassing existing open-source and commercial engines. iii) The methodology involves a deep compression Video-VAE, a DiT with 3D full attention trained using Flow Matching, and a video-based DPO for visual quality enhancement. iv) Evaluated on Step-Video-T2V-Eval, Step-Video-T2V demonstrates state-of-the-art performance with 16x16 spatial and 8x temporal compression ratios while generating videos up to 204 frames. v) AI practitioners can leverage Step-Video-T2V as a strong baseline for further innovations in video foundation models, particularly in improving motion dynamics, aesthetics, and content consistency. |
ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models (Read more on arXiv or HuggingFace) | Samuel Roberts, Akash Gupta, Ansh Sharma, Mohammad Reza Taesiri, Jonathan Roberts | ZeroBench is a new visual reasoning benchmark of 100 questions designed to be impossible for current large multimodal models (LMMs). The main research objective is to create a lightweight yet challenging visual benchmark to evaluate and differentiate the capabilities of LMMs. The methodology involves manually curating and reviewing a set of diverse, multi-step visual reasoning questions, and then adversarially filtering them based on the performance of 20 contemporary LMMs. The primary result is that all evaluated LMMs scored 0.0% on the main questions of ZeroBench, although they achieved non-zero scores on the easier sub-questions, such as 24.30% pass@1 by Claude 3.5 Sonnet v2. The principle implication is that this benchmark highlights limitations to assist in the development of improved LMMs. |
Large Language Diffusion Models (Read more on arXiv or HuggingFace) | Jingyang Ou, Xiaolu Zhang, Zebin You, Fengqi Zhu, Shen Nie | LLaDA, a diffusion model trained from scratch, achieves performance comparable to autoregressive LLMs like LLaMA3 8B. The main research question is whether diffusion models can achieve the capabilities of large language models (LLMs) without relying on the autoregressive paradigm. Key methodology used is a masked diffusion model (MDM) trained with a forward data masking process and a reverse process parameterized by a vanilla Transformer to predict masked tokens, optimizing a likelihood bound. Primary result is that LLaDA 8B surpasses LLaMA2 7B on nearly all 15 standard zero/few-shot learning tasks and is on par with LLaMA3 8B, and it achieves a 70.7% accuracy on the GSM8K benchmark. Principal implication is that AI practitioners can explore diffusion models as a viable alternative to autoregressive models for large-scale language modeling, potentially offering advantages in bidirectional context understanding and parallel token generation. |
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment (Read more on arXiv or HuggingFace) | Peiyan Li, Chaoyou Fu, Haochen Tian, Tao Yu, Yi-Fan Zhang | i) The paper introduces MM-RLHF, a new dataset and methodology for aligning multimodal large language models (MLLMs) with human preferences. ii) The research aims to enhance MLLM capabilities across multiple dimensions by aligning models with human preferences. iii) The methodology includes curating a 120k comparison pair dataset, developing a critique-based reward model, and employing dynamic reward scaling within DPO. iv) Fine-tuning LLaVA-ov-7B with MM-RLHF and the proposed alignment algorithm achieves a 19.5% increase in conversational abilities and a 60% improvement in safety. v) AI practitioners can leverage the MM-RLHF dataset and associated techniques to improve MLLM alignment, leading to safer and more capable multimodal models; the critique based reward model can be used to provide more informative feedback for training. |
Precise Parameter Localization for Textual Generation in Diffusion Models (Read more on arXiv or HuggingFace) | Adam Dziedzic, Kamil Deja, Franziska Boenisch, Bartosz Cywiński, Łukasz Staniszewski | This research localizes and utilizes the parameters in diffusion models responsible for generating and editing textual content within images. The main research objective is to identify the specific parameters within diffusion models that control the generation of textual content in images. The key methodology involves activation patching of cross and joint attention layers and fine-tuning using Low-Rank Adaptation (LoRA). The primary result is that less than 1% of diffusion models' parameters (0.61% of Stable Diffusion XL, 0.21% of DeepFloyd IF, and 0.23% of Stable Diffusion 3), specifically within attention layers, are responsible for textual content generation. This implies that AI practitioners can improve text generation in diffusion models, and enable precise text editing by fine-tuning or manipulating only this small subset of parameters, conserving computational resources and preserving overall image generation quality. |
Diverse Inference and Verification for Advanced Reasoning (Read more on arXiv or HuggingFace) | Yuke Zhang, Seunghwan Hyun, Mao Mao, Gaston Longhitano, Iddo Drori | i) The paper presents a diverse inference approach to improve the performance of Reasoning LLMs on challenging tasks. ii) The research aims to enhance reasoning LLMs' accuracy on complex benchmarks like IMO combinatorics, ARC puzzles, and HLE questions. iii) Key methods include combining multiple models/methods at test time, verifying solutions automatically, test-time simulations, reinforcement learning, and meta-learning of agent graphs. iv) The approach increases IMO combinatorics accuracy from 33.3% to 77.8%, HLE accuracy from 8% to 37%, and solves 80% of ARC puzzles unsolvable by 948 humans. v) AI practitioners can leverage diverse inference and verification techniques to improve the robustness and accuracy of reasoning LLMs on advanced problem-solving tasks. |
We Can't Understand AI Using our Existing Vocabulary (Read more on arXiv or HuggingFace) | Been Kim, Robert Geirhos, John Hewitt | This position paper argues that understanding and controlling AI requires developing new vocabulary (neologisms) to represent concepts unique to machines or humans. The main research objective is to argue for developing neologisms to bridge the communication gap between humans and AI, stemming from their differing conceptualizations of the world. The key methodology used is a conceptual argument supported by a proof-of-concept, "neologism embedding learning," which trains new word embeddings representing human or machine concepts to control model behavior. The primary results demonstrated that using a "length neologism," responses that meet the length contraints went from near 0% with regular instructions, to a vast majority of generations, shown in figure 5. The authors presented a new "diversity neologism", increasing response variety in a number-guessing task. Principal implication for AI practitioners is that creating and incorporating neologisms into prompts can improve control over language model behavior and potentially provide a more precise way to interact with and understand AI systems. |
AdaPTS: Adapting Univariate Foundation Models to Probabilistic Multivariate Time Series Forecasting (Read more on arXiv or HuggingFace) | Maurizio Filippone, Albert Thomas, Giuseppe Paolo, Vasilii Feofanov, abenechehab | AdaPTS is a framework for adapting pre-trained univariate time series foundation models to probabilistic multivariate forecasting using trainable feature-space transformations. The main research objective is to develop a method for leveraging pre-trained univariate time series foundation models (FMs) for multivariate forecasting tasks while addressing challenges like inter-feature dependencies and uncertainty quantification. The key methodology involves introducing "adapters"—stochastic, invertible feature-space transformations—that project multivariate inputs into a latent space where a frozen, pre-trained univariate FM can be applied independently to each dimension, followed by an inverse transformation. Primary results show that AdaPTS improves the forecasting accuracy of the Moment model in 5 out of 8 considered tasks; for example on the Illness dataset (H=24), the VAE adapter achieved a 15% MSE improvement, reducing it from 2.902 to 2.461. AI practitioners can use AdaPTS as a modular and scalable solution for leveraging existing time series FMs in multivariate contexts, enhancing forecasting performance, and uncertainty quantification without requiring FM fine-tuning. |
FoNE: Precise Single-Token Number Embeddings via Fourier Features (Read more on arXiv or HuggingFace) | Vatsal Sharan, Robin Jia, Mahdi Soltanolkotabi, Deqing Fu, Tianyi Zhou | FoNE introduces a novel method to represent numbers as single tokens in large language models using Fourier features. The main research objective is to develop a more precise and efficient number embedding method that overcomes the limitations of traditional subword and digit-wise tokenization in LLMs. FoNE maps numbers directly into the embedding space using their Fourier features, encoding each digit with two embedding dimensions. On 6-digit decimal addition, FoNE requires 64x less data to achieve 99% accuracy than subword and digit-wise embeddings and is the only method that yields 100% accuracy on over 100,000 test examples. The principal implication is that AI practitioners can leverage FoNE to improve LLM performance on number-related tasks, achieving higher accuracy with reduced computational overhead and training data. |
Jailbreaking to Jailbreak (Read more on arXiv or HuggingFace) | Bijan Varjavand, Robert Vacareanu, Vaughn Robinson, Jeremy Kritz, ZifanScale | This paper introduces "Jailbreaking-to-Jailbreak" (J2), a novel approach where a refusal-trained Large Language Model (LLM) is jailbroken to assist in jailbreaking other LLMs. The main research objective is to evaluate the capability of jailbroken LLMs to act as effective red teamers and to compare their performance against existing automated and human-led red teaming methods. Key methodology involves creating J2 attackers by jailbreaking frontier LLMs through human-crafted prompts, then using these J2 attackers in an iterative, multi-turn red teaming workflow with in-context learning. Primary results show that J2 attackers (specifically Sonnet-3.5 and Gemini-1.5-pro) achieve 93.0% and 91.0% attack success rates (ASRs) respectively against GPT-40 on Harmbench, approaching human-level performance. Principal implication for AI practitioners is that LLM safeguards can be bypassed by leveraging a jailbroken version of an LLM, highlighting a new failure mode and emphasizing the need for enhanced safeguard mechanisms against LLM-assisted jailbreaking. |
STMA: A Spatio-Temporal Memory Agent for Long-Horizon Embodied Task Planning (Read more on arXiv or HuggingFace) | Shuguang Cui, Zhixin Mai, Ge Wang, Yiming Zhao, Mingcong Lei | The paper introduces the Spatio-Temporal Memory Agent (STMA), a framework designed to enhance task planning and execution in dynamic environments for embodied AI. The main objective is to enable agents to perform long-horizon tasks by improving decision-making and adaptability through integrated spatio-temporal memory. The methodology involves a spatio-temporal memory module, a dynamic knowledge graph for spatial reasoning, and a planner-critic mechanism for iterative strategy refinement. Results from evaluations in the TextWorld environment show STMA achieved a 31.25% improvement in success rate and a 24.7% increase in average score compared to state-of-the-art models. For AI practitioners, STMA offers a new way to approach memory within AI Agents. |
MRS: A Fast Sampler for Mean Reverting Diffusion based on ODE and SDE Solvers (Read more on arXiv or HuggingFace) | Ge Yang, Le Lu, Hongbo Zhao, Wei Fang, Ao Li | Mean Reverting Sampler (MRS) accelerates sampling for Mean Reverting (MR) Diffusion models. The main research objective is to reduce the sampling NFEs (number of function evaluations) of MR Diffusion, which currently requires hundreds of steps. The methodology involves solving the reverse-time SDE and probability flow ODE associated with MR Diffusion, deriving semi-analytical solutions consisting of an analytical function and a neural network parameterized integral. Primary results demonstrate that the MR Sampler maintains high sampling quality with a speedup of 10 to 20 times across ten different image restoration tasks. Principal implication for AI practitioners is that they can leverage MRS for faster and more efficient controllable generation using MR Diffusion models, making them more practical in applications. |
V2V-LLM: Vehicle-to-Vehicle Cooperative Autonomous Driving with Multi-Modal Large Language Models (Read more on arXiv or HuggingFace) | Yu-Chiang Frank Wang, Stephen F. Smith, Chien-Yi Wang, Ryo Hachiuma, Hsu-kuang Chiu | i) This paper introduces V2V-LLM, a large language model for cooperative autonomous driving. ii) The research aims to explore the problem of integrating LLMs into cooperative autonomous driving systems to improve safety. iii) The methodology involves creating a new dataset, V2V-QA, and developing a baseline method, V2V-LLM, that fuses perception information from multiple connected autonomous vehicles using scene-level and object-level features. iv) The V2V-LLM outperforms other fusion methods on notable object identification and planning tasks in the V2V-QA dataset, achieving a collision rate of 3.00% compared to 4.57% for the "No Fusion" baseline. v) The primary implication for AI practitioners is the potential of V2V-LLM to serve as a foundation model for cooperative autonomous driving, particularly in scenarios with sensor occlusion. |
Agentic End-to-End De Novo Protein Design for Tailored Dynamics Using a Language Diffusion Model (Read more on arXiv or HuggingFace) | Markus J. Buehler, Bo Ni | VibeGen is a generative AI framework for de novo protein design conditioned on normal mode vibrations. The main research objective is to develop a model that can generate novel protein sequences that exhibit specified dynamic properties, specifically low-frequency vibrational modes. The key methodology involves an agentic dual-model architecture, comprising a protein designer (PD) based on a protein language diffusion model that generates sequences and a protein predictor (PP) that evaluates their dynamic accuracy. Primary results showed that the generated proteins accurately reproduced prescribed normal mode amplitudes, with a median Pearson correlation coefficient of 0.53 between designed and target vibration profiles across a large test set. Principal implication for AI practitioners is the demonstration of a viable approach for integrating protein dynamics into generative protein design, enabling the creation of biomolecules with targeted motion-based functionalities. |
Title | Authors | Summary |
---|---|---|
InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU (Read more on arXiv or HuggingFace) | Sung Ju Hwang, Losif63, geonp, gmlwns5176 | InfiniteHiP enables extremely long-context language model inference on a single GPU without significant performance loss. The main research objective is to develop a training-free framework that allows large language models (LLMs) to handle context lengths significantly exceeding their pre-trained limits on a single GPU. The key methodology involves a hierarchical pruning algorithm to optimize key-value (KV) cache, combined with a novel block sparse attention mechanism and dynamic RoPE adjustments. The primary result is that InfiniteHiP achieves a 7.24x speedup in the SGLang framework with only 0.34% of the VRAM used by FlashAttention2, while extending context to 3 million tokens on a single GPU. A Principal implication for AI practitioners, is that it can be a framework of efficient, long context inference that utilizes modularized pruning algorithm. |
Skrr: Skip and Re-use Text Encoder Layers for Memory Efficient Text-to-Image Generation (Read more on arXiv or HuggingFace) | Se Young Chun, Jae-sun Seo, Wongi Jeong, Agorium | Skrr is a method for reducing text encoder memory usage in text-to-image diffusion models by selectively skipping or reusing layers. The main research question is how to reduce the memory footprint of text encoders in text-to-image (T2I) diffusion models without significantly impacting image quality or text alignment. The key methodology, Skrr, involves two phases: "Skip" identifies and prunes redundant transformer sub-blocks using a T2I diffusion-tailored discrepancy metric and beam search, and "Re-use" recycles remaining layers to mitigate performance loss. Skrr maintains image quality comparable to the original model, and achieves up to 20.4% improvement in GenEval scores at over 40% sparsity. The principal implication for AI practitioners is that Skrr offers an effective strategy for constructing memory-efficient T2I models, which could also help the development and deployment of text-to-image diffusion models, especially in resource-constrained environments. |
SelfCite: Self-Supervised Alignment for Context Attribution in Large Language Models (Read more on arXiv or HuggingFace) | Hu Xu, Shannon Zejiang Shen, ZhaofengWu, bencw, voidism | SelfCite is a self-supervised framework that aligns large language models (LLMs) to generate accurate, fine-grained citations by leveraging their own probabilities for necessity and sufficiency rewards through context ablation. The main research objective is to improve the accuracy and quality of citations generated by LLMs without relying on annotation processes. The key methodology involves using context ablation to calculate a reward signal based on two metrics, necessity score (probability drop) and sufficiency score (probability hold), and best-of-N sampling to generate better citations. The primary result is that SelfCite significantly improves citation correctness on the LongBench-Cite benchmark, increasing citation F1 up to 5.3 points on the LongBench-Cite benchmark across five long-form question answering tasks. For AI practitioners, SelfCite offers a method to improve citation quality in LLM-generated text without requiring human annotation, potentially leading to more reliable and trustworthy LLM applications. |
An Open Recipe: Adapting Language-Specific LLMs to a Reasoning Model in One Day via Model Merging (Read more on arXiv or HuggingFace) | Kasima Tharnpipitchai, potsawee, pittawat, kunato | This paper demonstrates a method for enhancing reasoning capabilities in language-specific large language models (LLMs) using model merging and data selection within a limited computational budget. The main research objective is to incorporate the advanced reasoning abilities of a model like DeepSeek R1 into a Thai language-specific LLM while preserving its target language performance. The key methodology involves supervised fine-tuning of the language-specific LLM on a curated dataset, followed by ability-aware model merging with a reasoning-focused LLM, optimizing the merge ratio across layers. A primary result is that the merged model, Typhoon2-R1-70B, achieved 76.5% average performance across all evaluation metrics, 41.6% above Typhoon2 70B Instruct and 12.8% above DeepSeek R1 70B Distill. This approach allows AI practitioners to improve reasoning in low-resource language LLMs efficiently, using publicly available datasets and modest computational resources. |
Exploring the Potential of Encoder-free Architectures in 3D LMMs (Read more on arXiv or HuggingFace) | delinqu, Tavish9, zhuhaow, Purple1288, IvanTang | This paper investigates encoder-free architectures for 3D Large Multimodal Models (LMMs), demonstrating comparable performance to encoder-based models. The main research objective is to determine if 3D LMMs can effectively function without dedicated 3D encoders, directly integrating 3D understanding capabilities within the Large Language Model (LLM). The key methodology involves proposing LLM-embedded Semantic Encoding during pre-training and Hierarchical Geometry Aggregation during instruction tuning, replacing the traditional 3D encoder with learnable LLM layers and self-supervised losses. The primary result is that the proposed ENEL model, without a 3D encoder, achieved a GPT-4 score of 50.92% on 3D object captioning, which is similar with the state-of-the-art ShapeLLM-13B. The principal implication is that AI practitioners can explore encoder-free 3D LMMs as a potentially more efficient and scalable alternative to encoder-based architectures, potentially simplifying model design and reducing computational overhead. |
Can this Model Also Recognize Dogs? Zero-Shot Model Search from Weights (Read more on arXiv or HuggingFace) | Yedid Hoshen, Or Nathan, Jonathan Kahana, Eliahu | This paper introduces ProbeLog, a method for retrieving classification models capable of recognizing a specific target concept based on model weights, without access to training data or metadata. The main research question is how to efficiently and accurately search for models in large repositories that can recognize a given concept (e.g., "Dog") in a zero-shot manner. ProbeLog uses a probing-based approach, computing logit-level descriptors by observing model responses to a fixed set of input probes, and extends this to zero-shot search via text alignment models. The method achieved a top-1 retrieval accuracy of 43.8% on the INet-Hub dataset when searching for models recognizing ImageNet concepts from text prompts. AI practitioners can use ProbeLog to search for suitable pre-trained models based on specific concept recognition capabilities, potentially reducing the need for training or fine-tuning. |
CoSER: Coordinating LLM-Based Persona Simulation of Established Roles (Read more on arXiv or HuggingFace) | Rui Xu, Xinfeng Yuan, Yifei Zhang, Heng Wang, Xintao Wang | CoSER is a framework for simulating established characters using large language models (LLMs), including a dataset, models, and an evaluation protocol. The main research objective is to address the lack of authentic character datasets and nuanced evaluation methods for simulating established characters with LLMs. The key methodology is given-circumstance acting (GCA), where LLMs sequentially portray multiple characters in book scenes, used for both training and evaluation. Primary results show that CoSER 70B achieves 75.80% and 93.47% accuracy on the InCharacter and LifeChoice benchmarks, respectively, surpassing or matching GPT-4o. The principal implication for AI practitioners is that they can leverage the CoSER dataset and GCA framework to train and evaluate LLMs for more faithful and nuanced role-playing of established characters, improving applications like character chatbots and agents in games. |
TripoSG: High-Fidelity 3D Shape Synthesis using Large-Scale Rectified Flow Models (Read more on arXiv or HuggingFace) | Yuan Liang, Dehu Wang, Zexiang Liu, Zi-Xin Zou, Yangguang Li | TripoSG is a new image-to-3D generation model that leverages large-scale rectified flow transformers to achieve high-fidelity 3D shape synthesis. The main research objective is to determine the optimal paradigm for generating high-fidelity 3D models with precise alignment to input images. The key methodology involves a large-scale rectified flow transformer trained on 2 million high-quality 3D samples, a hybrid supervised 3D VAE training strategy, and a dedicated data processing pipeline. Primary results show that TripoSG achieves a Normal-FID score of 3.36 when trained on a large-scale dataset with 4096 tokens and a mixture-of-experts model. The model demonstrates that AI practitioners can now utilize large-scale generative techniques to effectively generate detailed, high-fidelity and accurate 3D models from single input images which are consistent with the input. |
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents (Read more on arXiv or HuggingFace) | Cheng Qian, Mark Zhao, Junyu Zhang, Rui Yang, Hanyang81 | EmbodiedBench is a benchmark for evaluating vision-driven embodied agents based on multi-modal large language models (MLLMs). Main research question or objective: How do existing MLLMs perform as vision-driven embodied agents across a variety of tasks and capabilities, and what are their limitations? Key methodology used: Developed a benchmark (EMBODIEDBENCH) with 1,128 testing instances across four environments, hierarchical action levels (high-level and low-level), and six capability-oriented subsets, then evaluated 13 proprietary and open-source MLLMs using a unified agent framework. Primary results: MLLMs excel at high-level tasks but struggle with low-level manipulation; the best model, GPT-4o, scored only 28.9% on average across all tasks in the benchmark, and performance degrades by 40%-70% when vision input is removed in low-level tasks. Principal implication for AI practitioners: AI practitioners should focus on improving MLLMs' low-level manipulation, long-horizon planning and use additional approaches for leveraging visual input for high-level embodied tasks since the best model performs poorly in low-level tasks. |
Typhoon T1: An Open Thai Reasoning Model (Read more on arXiv or HuggingFace) | Kunat Pipatanakul, Kasima Tharnpipitchai, Potsawee Manakul, pittawat | Typhoon T1 is an open-source Thai reasoning model built on a large language model, demonstrating a method for developing reasoning capabilities in low-resource languages. The primary research objective was to develop a Thai reasoning model and investigate effective strategies for its creation, including thinking formats and data composition. The key methodology involved supervised fine-tuning of a pre-trained language model (Typhoon 2 3B Instruct) using synthetically generated datasets with structured, semi-structured, and unstructured reasoning chains. A primary result was that the structured thinking format achieved a GSM8K score of 62.02, outperforming unstructured and semi-structured formats. The principal implication for AI practitioners is that supervised fine-tuning with structured synthetic data can effectively create reasoning models, particularly in low-resource languages, providing a viable alternative to reinforcement learning. |
Logical Reasoning in Large Language Models: A Survey (Read more on arXiv or HuggingFace) | Chaoli Zhang, Mengru Ding, Hanmeng Liu, ruoxining, HarryFu | This survey synthesizes advancements in logical reasoning within large language models (LLMs), covering paradigms, benchmarks, enhancement methods, and future directions. The main research objective is to provide a comprehensive overview of logical reasoning capabilities in LLMs, focusing on formal symbolic logic rather than general heuristic approaches. The key methodology involves a literature review analyzing existing capabilities across deductive, inductive, abductive, and analogical reasoning, as well as assessing strategies like data-centric tuning, reinforcement learning, and neuro-symbolic approaches. A primary result is that while GPT-4 outperforms ChatGPT on benchmarks like LogiQA and ReClor, both models struggle with out-of-distribution tasks. The principal implication for AI practitioners is the need for hybrid architectures and improved evaluation frameworks that stress-test robustness and generalization in logical reasoning, moving beyond simple accuracy metrics to assess consistency and explainability. |
MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency (Read more on arXiv or HuggingFace) | Yu Qi, Yanwei Li, Ziyu Guo, Renrui Zhang, CaraJ | MME-CoT is a benchmark for evaluating Chain-of-Thought (CoT) reasoning in Large Multimodal Models (LMMs), assessing quality, robustness, and efficiency. The main research objective is to investigate to what extent and how CoT reasoning benefits multimodal challenges in LMMs. Researchers curated a dataset with six domains and proposed novel metrics that meticulously examines LMMs reasoning quality, robustness and efficiency at a fine-grained level. The evaluation reveals that Kimi k1.5 achieved the best CoT quality with 64.2 F1-score, surpassing GPT-4o, and CoT prompting often degrades LMM performance on perception-heavy tasks. For AI practitioners, the results provide insights into the strengths and weaknesses of applying CoT to LMMs, especially highlighting that careful consideration is needed when employing CoT in tasks requiring strong perceptual capabilities. |
CoT-Valve: Length-Compressible Chain-of-Thought Tuning (Read more on arXiv or HuggingFace) | Xinchao Wang, Gongfan Fang, Runpeng Yu, Guangnian Wan, Xinyin Ma | CoT-Valve introduces a method for tuning language models to generate reasoning chains of controllable lengths, improving efficiency and adaptability. The main research objective is to enable a single model to dynamically adjust the length of its Chain-of-Thought (CoT) reasoning based on task difficulty. The key methodology involves identifying and manipulating a direction in the parameter space (using LoRA) that controls CoT length, along with a "MixChain" dataset for training. A primary result is that on GSM8K, the QwQ-32B-Preview model reduced reasoning chains from 741 to 225 tokens with a minor performance drop (95.07% to 94.92%). Principal implication for AI practioners is that it enables more efficient inference by allowing models to use shorter reasoning paths for simpler tasks, which can improve the cost-effectiveness of reasoning-based application. |
SQuARE: Sequential Question Answering Reasoning Engine for Enhanced Chain-of-Thought in Large Language Models (Read more on arXiv or HuggingFace) | Moshe Wasserblat, Gad Markovits, Moshe Berchansky, danf | SQuARE is a prompting technique that improves large language model reasoning by generating and answering sub-questions before addressing the main query. The main research objective is to assess if decomposing queries into iterative steps via self-interrogation enhances the reasoning capabilities of LLMs. The key methodology is prompting LLMs (Llama 3 and GPT-4o) to generate and resolve multiple auxiliary question-answer pairs before answering the original question, across multiple QA datasets (TriviaQA, HotpotQA, ASQA). Primary results show that SQuARE improves performance on TriviaQA by 6.5% over Retrieval-Augmented Generation (RAG) using the Llama-3.2 3B model. For AI practitioners, SQuARE presents a method for improving response accuracy in reasoning tasks by systematically decomposing questions, particularly beneficial for smaller-scale models. |
mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data (Read more on arXiv or HuggingFace) | Ziliang Zhao, Yutao Zhu, Nan Yang, Liang Wang, Haon-Chen | mmE5 enhances multimodal multilingual embeddings through a novel synthetic data generation framework. The research objective is to improve multimodal embedding performance by addressing the scarcity of high-quality labeled multimodal data. The methodology involves synthesizing datasets using an MLLM, guided by principles of broad scope, robust cross-modal alignment, and high fidelity, incorporating deep thinking, self-evaluation, and refinement. mmE5 achieves a state-of-the-art average score of 58.6 on the MMEB benchmark in a zero-shot setting, surpassing previous methods. AI practitioners can leverage mmE5's synthetic data generation approach to create more robust and generalizable multimodal embedding models, particularly in multilingual contexts. |
The Stochastic Parrot on LLM's Shoulder: A Summative Assessment of Physical Concept Understanding (Read more on arXiv or HuggingFace) | Shunchi Zhang, Tsz Ting Chung, Junjie Wu, Lemao Liu, Mo Yu | The paper introduces PHYSICO, a benchmark to evaluate large language models' (LLMs) understanding of physical concepts, revealing significant gaps compared to human performance. The primary research objective is to investigate whether LLMs truly understand physical concepts or merely act as "stochastic parrots." The key methodology is a summative assessment using grid-format inputs to represent physical phenomena, and comparing LLM performance with human performance across various subtasks. Results indicate that state-of-the-art LLMs, like GPT-4, perform perfectly on low-level tasks(>95% accuracy) but lag behind humans on high-level tasks (~40% less in accuracy) . For AI practitioners, the principal implication is that LLMs still lack robust physical concept understanding beyond memorization, suggesting a need for new methods to improve their reasoning ability. |
DexTrack: Towards Generalizable Neural Tracking Control for Dexterous Manipulation from Human References (Read more on arXiv or HuggingFace) | Li Yi, Yuzhe Qin, Qianwei Han, Jianibieke Adalibieke, Xueyi Liu | DexTrack is a neural tracking controller that learns to manipulate objects with a robotic hand by following human-provided kinematic references. The main research objective is to develop a generalizable neural tracking controller for dexterous manipulation that can mimic human-object interaction trajectories. The key methodology involves iteratively training the controller with reinforcement and imitation learning, using a homotopy optimization method to mine high-quality robot tracking demonstrations from human references. The primary results show that DexTrack achieves over a 10% improvement in success rates compared to leading baselines in both simulation and real-world evaluations. AI practitioners can leverage DexTrack's approach of combining imitation learning with high-quality demonstrations to create versatile and robust controllers for complex robotic manipulation tasks. |
3CAD: A Large-Scale Real-World 3C Product Dataset for Unsupervised Anomaly (Read more on arXiv or HuggingFace) | Yuanwei Ma, Wenbo Guo, Hanyang Sun, Peng Xing, enquan2022 | 3CAD, a large-scale real-world dataset for unsupervised anomaly detection in 3C products, is introduced along with a coarse-to-fine detection paradigm. The main research objective is to create a challenging benchmark dataset of 3C product defects and develop an effective unsupervised anomaly detection method. The key methodology, CFRG, combines knowledge distillation, recovery guidance, and a segmentation network for coarse-to-fine localization of anomalies. CFRG achieves 93.4% AUROC, 86.5% AUPRO, and 82.0% AP on the 3CAD dataset. The principal implication for practitioners is the 3CAD dataset and CFRG model provide a challenging benchmark and an effective baseline for unsupervised anomaly detection in real-world 3C product manufacturing. |
Title | Authors | Summary |
---|---|---|
TextAtlas5M: A Large-scale Dataset for Dense Text Image Generation (Read more on arXiv or HuggingFace) | Zhuobai Dong, Weiming Han, Jiawei Zhang, Dongxing Mao, Alex Jinpeng Wang | TextAtlas5M is a large-scale dataset designed for generating images with dense, complex, and long-form text. The main research objective is to address the limitations of existing datasets, which often focus on shorter and simpler text, thereby hindering the development of models capable of generating images with comprehensive textual content. The key methodology involves curating 5 million long-text generated and collected images across diverse data types, including synthetic and real-world images, and creating a human-improved test set (TextAtlasEval) of 3,000 samples across 3 data domains. Primary results include the finding that evaluations demonstrate even advanced proprietary models (e.g., GPT4o with DallE-3) are significantly challenged by TextAtlasEval benchmarks, while showing an even large gap in their open-source counterparts. This dataset and benchmarks provide AI practitioners with a valuable resource for training and evaluating text-conditioned image generation models, specifically focusing on dense and long-form text rendering, thus, advancing the capacity to control visual outputs. |
Light-A-Video: Training-free Video Relighting via Progressive Light Fusion (Read more on arXiv or HuggingFace) | Pan Zhang, Pengyang Ling, Jiazi Bu, Yujie Zhou, yuhangzang | Light-A-Video is a training-free approach for temporally smooth video relighting that leverages image relighting and video diffusion models. The main research objective is to achieve temporally consistent video relighting without requiring training or optimization, addressing the limitations of existing methods. The key methodology involves a Consistent Light Attention (CLA) module for stable light source generation and a Progressive Light Fusion (PLF) strategy to blend relighted appearances, incorporating motion priors from a video diffusion model. Primary results show that Light-A-Video achieves a FID score of 29.63 while maintaining a temporal consistency CLIP score of 0.9655, superior to baseline methods that apply image relighting frame-by-frame. For AI practitioners, Light-A-Video provides a training-free pipeline for high-quality video relighting, directly applicable with existing image relighting and video diffusion models, enabling zero-shot illumination control of video sequences. |
BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large Language Models (Read more on arXiv or HuggingFace) | Lei Li, Conghui He, Hanxu Hu, Wenhao Zhu, ggdcr | BenchMAX is a multi-way multilingual evaluation benchmark for assessing advanced capabilities of large language models (LLMs) across 17 languages. The main research objective is to create a benchmark that fairly compares LLM capabilities like instruction following, reasoning, and code generation across diverse languages and script systems. The methodology involves machine-translating English tasks into 16 other languages, followed by independent annotation by three native speakers for each sample and task, and final version selection using a strong LLM. A key finding is that DeepSeek-V3 671B model achieved 84.2% on Math and 47.4 on Science reasoning tasks, respectively. For AI practitioners, BenchMAX provides a platform to evaluate LLM performance across languages to improve their multilingual capabilities. |
CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation (Read more on arXiv or HuggingFace) | Huchuan Lu, Xu Jia, Xiaoyu Shi, Yawen Luo, Qinghe Wang | CineMaster is a novel framework for 3D-aware and controllable text-to-video generation, enabling cinematic video creation with precise object placement and camera control. The main research objective is to provide users with 3D-aware and intuitive control over text-to-video generation, similar to the control wielded by film directors. The proposed two-stage framework first allows users to construct 3D scenes and camera movements via an interactive workflow, then uses the generated depth maps, camera trajectories, and object labels to guide a text-to-video diffusion model. CineMaster achieves a mean Intersection over Union (mIoU) of 0.551 and a trajectory deviation (Traj-D) of 66.29, outperforming existing methods in object-box alignment. For AI practitioners, this framework provides a new paradigm for controllable video generation, using a 3D-native approach to enable precise manipulation of scene elements and camera movement directly from textual input and 3D scene descriptions. |
WorldGUI: Dynamic Testing for Comprehensive Desktop GUI Automation (Read more on arXiv or HuggingFace) | Mike Zheng Shou, Difei Gao, Henry Hengyuan Zhao | WorldGUI introduces a new benchmark and framework, GUI-Thinker, for dynamic testing of desktop GUI automation agents. The main research objective is to evaluate and improve GUI agents' ability to handle diverse initial states and dynamic environments in real-world computer interactions. The methodology involves creating a benchmark (WorldGUI) with 315 tasks across 10 applications, each with varied starting states, and proposing a critical-thinking-based framework (GUI-Thinker) with five core components: Planner, Planner-Critic, Step-Check, Actor, and Actor-Critic. Experimental results demonstrate that GUI-Thinker significantly outperforms existing agents, with the Claude-3.5 based GUI-thinker achieving a 32.4% overall success rate, and GPT-40 based agent achieving 36.2%, exceeding a baseline by 14.9%. For AI practitioners, WorldGUI provides a robust benchmark to test and enhance agent adaptability in varied, dynamic states. |
LASP-2: Rethinking Sequence Parallelism for Linear Attention and Its Hybrid (Read more on arXiv or HuggingFace) | Yu Cheng, Xiaoye Qu, Yiran Zhong, landisen, weigao266 | LASP-2 improves sequence parallelism for linear attention in transformers by optimizing communication and computation. The main research objective is to enhance the efficiency of sequence parallelism (SP) when training linear attention transformer models with very long input sequences. The key methodology is LASP-2, which reorganizes the communication-computation workflow to require only one AllGather collective communication on intermediate memory states independent of sequence length, and extends this to hybrid models (LASP-2H). Primary results show that LASP-2 achieves training speed improvements of 15.2% over LASP and 36.6% over Ring Attention on a Linear-Llama3 model with a 2048K sequence length across 64 GPUs. For AI practitioners, LASP-2 provides a more efficient way to train linear attention-based and hybrid transformer models on long sequences, reducing training time and resource consumption. |
TransMLA: Multi-head Latent Attention Is All You Need (Read more on arXiv or HuggingFace) | Muhan Zhang, Zengwei Yao, fxmeng | TransMLA converts GQA-based language models to MLA-based models, improving expressiveness without increasing KV cache size. The main research objective is to demonstrate that Multi-head Latent Attention (MLA) offers greater expressive power than Group Query Attention (GQA) for the same key-value (KV) cache overhead. The key methodology involves transforming pre-trained GQA models (e.g., LLaMA, Qwen) into equivalent MLA models via low-rank matrix factorization, followed by fine-tuning. Primary results show that the transformed TransMLA model outperformed the original Qwen2.5-7B GQA model on the GSM8K benchmark (87% vs 81%). The main implication is that the TransMLA transformation provides AI practitioners using open-source, GQA-based LLMs with a low cost method to shift to more effective MLA architecture without changes in KV cache size, enhancing performance. |
Fino1: On the Transferability of Reasoning Enhanced LLMs to Finance (Read more on arXiv or HuggingFace) | Yan Wang, Weipeng Zhou, Lingfei Qian, QianqianXie1994, jiminHuang | The paper evaluates the performance of reasoning-enhanced and general large language models (LLMs) on financial tasks and introduces a new financial reasoning-enhanced model. The main research question is how transferable general-domain reasoning enhancements in LLMs are to the financial domain, and what impact they have across different financial tasks. The methodology involves a comprehensive evaluation of 16 LLMs on three financial datasets (FinQA, DocMath-Simplong, XBRL-Math) encompassing numerical reasoning, tabular interpretation, and financial terminology, followed by developing a model called Fino1. A primary result is that Finol-8B achieved an average score of 61.03 across all datasets, outperforming Llama3.1-8B-Instruct by 10.91 points, with an XBRL-Math score reaching 82.22. The key implication for AI practitioners is that domain-specific fine-tuning with curated financial data, even on a small scale, can significantly improve LLM performance on financial reasoning tasks, surpassing general reasoning enhancements. |
Ignore the KL Penalty! Boosting Exploration on Critical Tokens to Enhance RL Fine-Tuning (Read more on arXiv or HuggingFace) | lecraquito, Nbeau, supertardigrade | This paper investigates how varying pre-training levels affect language model exploration in reinforcement learning (RL) fine-tuning, and proposes a modified KL penalty to improve exploration. The main research question is how pre-training data distribution impacts exploration efficiency during RL fine-tuning of language models on tasks requiring out-of-distribution generalization. The key methodology involves pre-training a small language model on an arithmetic addition task with varying digit lengths, then fine-tuning it with RL and a modified KL penalty that prioritizes exploration on "critical tokens". Primary results show the model with the prioritized KL penalty achieved higher accuracy; for example the accuracy during testing with N=7 was higher when the KL penalty took into account the confidence of the old policy. The principal implication for AI practitioners is that adjusting the KL penalty based on pre-trained model certainty on specific tokens can enhance the efficiency of RL fine-tuning, particularly for tasks requiring generalization beyond the pre-training distribution. |
Distillation Scaling Laws (Read more on arXiv or HuggingFace) | Etai Littwin, Jason Ramapuram, Floris Weers, Amitis Shidani, Dan Busbridge | This paper provides a distillation scaling law that estimates distilled model performance based on compute budget and student/teacher allocation. The main research objective is to determine optimal distillation recipes and understand how to allocate compute resources between teacher and student models to maximize student performance. The key methodology involves a large-scale, controlled study of distillation with students and teachers ranging from 143M to 12.6B parameters, trained on up to 512B tokens, fitting a distillation scaling law to predict student cross-entropy. The primary result is that distillation outperforms supervised pretraining only when the total compute is below a student-size-dependent threshold and a teacher already exists or has uses beyond a single distillation, and student cross-entropy follows a broken power law. The principal implication for AI practitioners is that distillation is beneficial for resource-constrained scenarios or when leveraging existing teachers, guiding optimal model and data scaling during distillation pretraining. |
SARChat-Bench-2M: A Multi-Task Vision-Language Benchmark for SAR Image Interpretation (Read more on arXiv or HuggingFace) | HaiPeng Wang, Peidong Wang, Sihao Dong, Xiayang Xiao, JimmyMa99 | SARChat-Bench-2M is a new benchmark for evaluating vision-language models (VLMs) on synthetic aperture radar (SAR) image interpretation tasks. The main research objective is to develop a large-scale multimodal dialogue dataset and benchmark for evaluating VLMs' capabilities in SAR image understanding. The key methodology involves constructing a dataset (SARChat-2M) of 2 million SAR image-text pairs and defining six core tasks (classification, description, counting, localization, recognition, and referring) with specific evaluation metrics. Primary results show that the mPLUG-Owl3-7B model achieved the best performance among tested VLMs, with single-target and multi-target cross-modal identification accuracy rates reaching 99.27% and 99.51%, respectively. The principal implication is that AI practitioners can use SARChat-2M and SARChat-Bench to train, evaluate, and advance VLMs for SAR-specific applications, addressing the existing gap in large-scale, high-quality aligned SAR image-text datasets. |
LLM Pretraining with Continuous Concepts (Read more on arXiv or HuggingFace) | Andrew Cohen, Jane Yu, Jack Lanchantin, Jihoon Tack, xlxxl | LLM Pretraining with Continuous Concepts introduces a novel pretraining framework, CoCoMix, that combines discrete next-token prediction with continuous concept learning to enhance language models. The main research objective is to investigate whether augmenting the next token prediction objective with explicit concept modeling in a latent space can improve language model pretraining. The key methodology involves extracting concepts from a pretrained sparse autoencoder, predicting these concepts, and mixing them into the model's hidden state by interleaving them with token hidden representations. The primary results show that CoCoMix achieves comparable performance to standard next-token prediction with 21.5% fewer training tokens on a 1.38B parameter model. For AI practitioners, CoCoMix offers a more sample-efficient pretraining approach, enhances model interpretability and steerability by allowing direct inspection and modification of the predicted concept, and improves performance in weak-to-strong supervision scenarios. |
Animate Anyone 2: High-Fidelity Character Image Animation with Environment Affordance (Read more on arXiv or HuggingFace) | Dechao Meng, Xin Gao, Zhen Shen, Guangyuan Wang, Hookszdp | Animate Anyone 2 introduces a diffusion-based framework for character image animation that incorporates environmental context to achieve realistic character-environment interactions. The main research objective is to animate characters with environment affordance, ensuring consistent and interactive relationships between the character and its surroundings. The key methodology involves extracting both motion signals and environmental representations from a source video, using a shape-agnostic mask strategy, an object guider with spatial blending for object interactions, and depth-wise pose modulation. Primary results include a superior SSIM score of 0.812 and FVD of 144.65 on the TikTok benchmark, outperforming existing methods in quantitative evaluations. For AI practitioners, this framework offers a robust method to generate high-fidelity character animations that seamlessly integrate with their environments, useful for applications in filmmaking and advertising. |
NoLiMa: Long-Context Evaluation Beyond Literal Matching (Read more on arXiv or HuggingFace) | Ryan A. Rossi, Trung Bui, Hanieh Deilamsalehy, Franck-Dernoncourt, amodaresi | NOLIMA, a new benchmark, evaluates large language models' (LLMs) long-context understanding by minimizing literal keyword overlap between questions and answers, emphasizing associative reasoning. Main research question/objective: To assess how well LLMs perform long-context reasoning when they cannot rely on simple literal matches between the question and the context, unlike typical Needle-In-A-Haystack (NIAH) tests. Key methodology: The authors created the NOLIMA benchmark, extending NIAH, where questions and corresponding "needles" (answers) have minimal lexical overlap, requiring models to infer latent associations to locate the needle within a long "haystack" (irrelevant text). They tested 12 LLMs, including GPT-40, and conducted analyses with variations of reasoning complexity, context length, needle placement, and with the presence/absence of literal matching. Primary results: Model performance degraded significantly with increasing context length; at 32K tokens, 10 of the 12 models dropped below 50% of their short-length baseline scores. GPT-4o's performance decreased from 99.3% baseline to 69.7% at 32K. The presence of literal matches drastically simplified the task, and distractors with literal matches drastically impaired the task. Principal implication for AI practitioners: Current LLMs, even those claiming to support very long contexts, struggle with long-context associative reasoning tasks that lack surface-level (literal) cues, indicating a critical limitation that practitioners should consider when deploying these models in long-context applications. |
Mediator: Memory-efficient LLM Merging with Less Parameter Conflicts and Uncertainty Based Routing (Read more on arXiv or HuggingFace) | Peijie Dong, Xinglin Pan, Zhenheng Tang, Kunfeng Lai, Dominic789654 | Mediator is a framework for merging multiple fine-tuned large language models (LLMs) efficiently by adaptively averaging layers with minimal parameter conflicts and routing layers with significant conflicts. The main research objective is to develop a method for merging LLMs that minimizes parameter conflicts and system costs while preserving performance across diverse tasks. The key methodology involves quantifying layer-wise parameter conflicts, adaptively averaging layers with low conflict and routing layers with high conflict, employing sparse expert decomposition, and using uncertainty-based routing for out-of-distribution samples. Primary results show that Mediator achieves significant performance improvements over existing methods; e.g. on LLaMA-3.2-8B, it achieved 71.80% average on multiple tasks. The principal implication is that AI practitioners can merge fine-tuned LLMs more efficiently to improve the performance and adaptability while reducing the storage and computational costs compared to maintaining separate models. |
Next Block Prediction: Video Generation via Semi-Autoregressive Modeling (Read more on arXiv or HuggingFace) | Furu Wei, Xu Sun, Shuming Ma, Shuhuai Ren | The paper proposes a semi-autoregressive framework called Next-Block Prediction (NBP) for video generation that improves upon traditional next-token prediction. The main research objective is to develop a video generation framework that improves spatial dependency modeling and inference efficiency compared to autoregressive next-token prediction models. The key methodology shifts the generation unit from individual tokens to blocks (e.g., rows or frames), using bidirectional attention within each block and predicting multiple tokens in parallel. The NBP model achieved FVD scores of 103.3 on UCF101 and 25.5 on K600, outperforming the vanilla NTP model by an average of 4.4, with an 11x inference speedup. For AI practitioners, this framework provides a more efficient and scalable solution for video generation, maintaining or improving quality while accelerating inference through parallelization. |
DPO-Shift: Shifting the Distribution of Direct Preference Optimization (Read more on arXiv or HuggingFace) | Xiao Li, Lei Zhao, Qianen Zhang, Feng Jiang, Xiliang Yang | DPO-Shift controllably shifts the distribution of chosen probabilities in Direct Preference Optimization (DPO) to mitigate likelihood displacement. The main research objective is to address the likelihood displacement issue in DPO, where probabilities of chosen responses decrease during training. The key methodology is introducing a parameter function, f(x), added to the rejected reward in the Bradley-Terry model, called DPO-Shift. Experimentally, DPO-Shift with f(x)=0.95 achieved a reward accuracy of 0.743 on the UltraFeedback test set, comparable to DPO's 0.739, while demonstrably increasing chosen response probability. For AI practioners, DPO-Shift offers a simple, theoretically grounded solution to improve alignment with human preferences by mitigating the likelihood displacement of standard DPO, enabling a trade-off between chosen probability and reward margin. |
LLM Modules: Knowledge Transfer from a Large to a Small Model using Enhanced Cross-Attention (Read more on arXiv or HuggingFace) | kkolomeitsev | The paper introduces LLM Modules, an architecture for transferring knowledge from a large, frozen language model to a smaller, trainable one using Enhanced Cross-Attention. The main objective is to develop a method that enables smaller models to achieve performance comparable to larger models by leveraging the knowledge of pre-trained large language models (LLMs) without full fine-tuning. The key methodology involves using a frozen Qwen2-1.5B model as a "knowledge source" and a GPT-Neo-125M model as a "generation module," connected by Enhanced Cross-Attention layers that include linear projections, an adapter block, and a gating mechanism. Training on the Bespoke-Stratos-17k dataset for 15 epochs reduced training loss from 13.8 to 2.3 in the first epoch and to 1.1 in subsequent ones. For AI practitioners, the principal implication is that this modular approach can significantly reduce computational costs associated with training large language models while still achieving substantial performance improvements on specific tasks. |
MetaSC: Test-Time Safety Specification Optimization for Language Models (Read more on arXiv or HuggingFace) | vicgalle | MetaSC is a framework that optimizes language model safety reasoning at inference time by dynamically updating safety prompts. The research objective is to improve language model safety performance without modifying model weights. The key methodology is a "meta-critique" mechanism that iteratively updates safety prompts (specifications) to adaptively drive the critique and revision process of a self-critique loop. Primary results show that MetaSC significantly improves safety scores compared to fixed system prompts and static self-critique defenses, achieving a safety score of 1.00 on the jailbreak defense task using the Hermes-3-Llama-3.1-405B model. For AI practitioners, MetaSC offers a way to enhance model safety dynamically at inference time, without retraining or fine-tuning. |
Title | Authors | Summary |
---|---|---|
Competitive Programming with Large Reasoning Models (Read more on arXiv or HuggingFace) | Borys Minaev, Andre Saraiva, Alexander Wei, Ahmed El-Kishky, OpenAI | Reinforcement learning significantly improves large language models' performance on complex coding and reasoning tasks. The main research question is how domain-specific, hand-engineered inference strategies compare to learned approaches in competitive programming. The key methodology involved fine-tuning large language models with reinforcement learning and comparing performance with and without hand-crafted test-time strategies. The primary result was that OpenAI's o3 model achieved a Codeforces rating of 2724 (99.8th percentile) and an IOI 2024 score of 395.64, surpassing a gold medal threshold without hand-engineered strategies. Scaling general-purpose reinforcement learning presents a robust method toward state-of-the-art AI in reasoning tasks like competitive programming. |
CodeI/O: Condensing Reasoning Patterns via Code Input-Output Prediction (Read more on arXiv or HuggingFace) | Yu Wu, Runxin Xu, Dejian Yang, Daya Guo, Junlong Li | CODEI/O systematically condenses diverse reasoning patterns in code for improved performance on reasoning tasks. The main research objective is to improve the performance of Large Language Models (LLMs) on a broad range of reasoning tasks by leveraging code-based training data. The key methodology involves transforming raw code files into an input-output prediction format and training LLMs to predict either the output given code and input, or feasible input given code and output, entirely in natural language as Chain-of-Thought rationales. Primary results demonstrate consistent improvements across 14 benchmarks spanning symbolic, scientific, logic, math & numerical, and commonsense reasoning, with CODEI/O++ achieving an average score improvement of 2.9 points, compared to single stage training on Qwen 2.5 Coder 7B. For AI practitioners, this implies that training on code input-output prediction tasks can enhance LLMs' general reasoning capabilities beyond code-specific applications. |
Magic 1-For-1: Generating One Minute Video Clips within One Minute (Read more on arXiv or HuggingFace) | Qingyu Yin, Jiantong Zhao, Shitong Shao, Hongwei Yi, Owen777 | Magic 1-For-1 is an efficient video generation model that optimizes memory consumption and inference latency. The main objective is to reduce the computational cost and time required for text-to-video generation while maintaining high video quality. The key methodology involves factorizing the text-to-video task into text-to-image and image-to-video subtasks, alongside model convergence speedup, adversarial step distillation, and parameter sparsification. The primary results show the model can generate 5-second video clips within 3 seconds, and achieves an average score of 0.8134 on a customized VBench, outperforming other models. The principal implication for AI practitioners is that it offers an approach for generating minute-long videos within one minute, optimizing the tradeoff between computational cost and video quality for diffusion-based video generation. |
Teaching Language Models to Critique via Reinforcement Learning (Read more on arXiv or HuggingFace) | Jingjing Xu, Weichao Mao, Liyu Chen, Jie chen, Zhihui | CTRL trains large language models (LLMs) to provide effective feedback on code, improving iterative code generation. The main research objective is to develop a framework, CTRL, that trains a critic model to generate feedback that maximizes correction performance for a fixed generator model, without human supervision. The methodology uses a two-stage approach: supervised finetuning using execution feedback to synthesize critiques, followed by reinforcement learning with Group Relative Policy Optimization (GRPO) to optimize the critic. The results demonstrate that critics trained with CTRL significantly enhance pass rates, achieving up to 106.1% relative improvement on the CodeContests benchmark when using the same base model for generation and critique, and 23.5% improvement when paired with a better generator. For AI practitioners, CTRL provides a method to create specialized critics that can substantially improve code generation performance through effective, targeted feedback, enabling more autonomous AI systems. |
Expect the Unexpected: FailSafe Long Context QA for Finance (Read more on arXiv or HuggingFace) | Mateusz Russak, Dmytro Mozolevskyi, Melisa Russak, muayad, kiranr | FailSafeQA, a new long-context financial benchmark, evaluates LLM robustness and context-awareness against variations in human-interface interactions. i) This paper introduces FailSafeQA, a new benchmark for evaluating the robustness of Large Language Models (LLMs) in financial question-answering systems, particularly when dealing with long contexts and imperfect user inputs. ii) The main research objective is to assess the resilience of LLMs against six variations in human-input interactions, such as query failure (misspelled, incomplete and out-of-domain) and context failure (degraded, irrelevant, and missing). iii) The key methodology uses the LLM-as-a-Judge approach with Qwen2.5-72B-Instruct and defines fine-grained rating criteria to calculate Robustness, Context Grounding, and Compliance scores for 24 LLMs. The input consists of truncated 10k filings. iv) The most robust model, OpenAI 03-mini, fabricated information in 41% of tested cases, while Palmyra-Fin-128k-Instruct, the most compliant model, failed robust predictions in 17% of test cases. v) AI practitioners should be aware that high-performing LLMs still have significant room for improvement in terms of balancing robustness and context grounding. Practitioners must carefully assess the trade-off between a model's ability to handle imperfect inputs and its tendency to hallucinate. |
LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters! (Read more on arXiv or HuggingFace) | Xiangxi Mo, Shu Liu, Tyler Griggs, Shiyi Cao, Dacheng Li | Large language models (LLMs) can be efficiently fine-tuned to perform complex reasoning by learning the structural patterns of long chain-of-thought (CoT) demonstrations. The main research question is how to effectively elicit Long CoT reasoning capabilities in LLMs and what aspects of training data are most important. The key methodology involved supervised fine-tuning and low-rank adaptation (LoRA) on LLMs, with controlled experiments perturbing either the content or structure of Long CoT training samples. A primary result was that a Qwen2.5-32B-Instruct model achieved 56.7% accuracy on AIME 2024 after fine-tuning with only 17k Long CoT samples. AI practitioners can elicit strong reasoning performance in LLMs with relatively small, structurally sound datasets, without needing perfect accuracy in the content of individual reasoning steps. |
Éclair -- Extracting Content and Layout with Integrated Reading Order for Documents (Read more on arXiv or HuggingFace) | Lukas Voegtle, Ilia Karmanov, jseppanen, katerynaCh, amalad | ÉCLAIR, a multi-modal large language model (MLLM), extracts structured text, bounding boxes, and semantic classes from documents in integrated reading order. The main research objective is to develop a general-purpose text-extraction tool capable of processing diverse document types and extracting formatted text, spatial information, and semantic class labels simultaneously. The key methodology involves a transformer encoder-decoder architecture with a ViT-like encoder and an autoregressive decoder, pre-trained on a newly generated arXiv-5M dataset and fine-tuned on diverse public datasets. The primary results include achieving state-of-the-art accuracy on the new DROBS benchmark with a 0.937 Counting F1 score and outperforming other methods on established benchmarks. The principal implication for AI practitioners is that ÉCLAIR provides a new model for document OCR, enabling the extraction of more structured data from documents. |
CAD-Editor: A Locate-then-Infill Framework with Automated Training Data Synthesis for Text-Based CAD Editing (Read more on arXiv or HuggingFace) | Jiang Bian, Qi Liu, Yu Yuan, ShizhaoSun | CAD-Editor is a framework for automatically modifying CAD models based on textual instructions, using an automated data synthesis pipeline and a locate-then-infill approach. The main research objective is to develop a system for text-based editing of CAD models, addressing the lack of support for text-based control in existing design variation methods and the absence of consideration for existing CAD models as constraints. The methodology involves generating synthetic training data using design variation models and LVLMs and decomposing the task into locating regions for modification and infilling those regions with LLMs. Primary results show that CAD-Editor achieves a 95.6% Valid Ratio and a 0.27 Directional CLIP Score, outperforming baseline methods in generation validity, text-CAD alignment, and overall quality. AI practitioners can leverage the proposed framework and data synthesis pipeline to enable more intuitive and efficient CAD model editing through natural language instructions, accelerating the design workflow. |
Enhance-A-Video: Better Generated Video for Free (Read more on arXiv or HuggingFace) | Wenqi Shao, Kaipeng Zhang, Mengzhao Chen, Xuanlei Zhao, Yang Luo | Enhance-A-Video is a training-free method to improve the temporal consistency and visual quality of diffusion transformer (DiT)-based video generation. The main research objective is to develop a method to enhance the coherence and quality of DiT-based generated videos without retraining or fine-tuning. The key methodology involves introducing a "Enhance Block" that calculates a Cross-Frame Intensity (CFI) from temporal attention maps and uses an "enhance temperature" parameter to scale and integrate this CFI, thereby strengthening cross-frame correlations. User studies demonstrated that models incorporating Enhance-A-Video were preferred across metrics including temporal consistency, prompt-video consistency, and overall visual quality, and VBench scores consistently improved across all tested models. AI practitioners can integrate this plug-and-play method into existing DiT-based video generation frameworks to improve video quality at minimal computational cost, without any retraining or fine tuning of models. |
NatureLM: Deciphering the Language of Nature for Scientific Discovery (Read more on arXiv or HuggingFace) | Chuan Cao, Liang He, Shufang Xie, Peiran Jin, Yingce Xia | NatureLM is a sequence-based science foundation model designed for scientific discovery across multiple domains. Main research question or objective: To develop a unified, versatile model capable of handling various scientific applications, including generation and optimization, across multiple scientific domains using a sequence-based approach. Key methodology used: A Transformer decoder architecture pre-trained on 143 billion tokens from multiple scientific domains (small molecules, proteins, DNA, RNA, materials, and text), followed by post-training with instruction-response pairs. Primary results: NatureLM (8x7B) achieved state-of-the-art performance in retrosynthesis (71.9% top-1 accuracy on USPTO-50K) and SMILES-to-IUPAC translation (0.607 top-5 accuracy), significantly outperforming general-purpose foundation models. Principal implication for AI practitioners: Practitioners can utilize NatureLM as a foundation model for diverse scientific tasks, particularly where cross-domain interactions and sequence-based representations are crucial, potentially accelerating scientific discovery through a generalist model approach. |
Hephaestus: Improving Fundamental Agent Capabilities of Large Language Models through Continual Pre-Training (Read more on arXiv or HuggingFace) | Kewei Cheng, Xin Liu, Haoming Jiang, Jingfeng Yang, yczhuang | Hephaestus introduces a continual pre-training method to enhance the fundamental capabilities of LLM-based agents. Main research question or objective: How can continual pre-training on a large-scale, agent-oriented corpus improve the API function calling, intrinsic reasoning, and environmental feedback adaptation capabilities of large language models? Key methodology used: A two-stage continual pre-training framework on the Hephaestus-Forge corpus (103B tokens, 76,537 APIs), leveraging scaling law experiments to optimize data mixing ratios, followed by instruction fine-tuning. Primary results: Hephaestus-8B outperforms LLAMA-3-8B by 9.6% and rivals commercial LLMs on three agent benchmarks, achieves comparable performance with GPT-3.5-turbo, excelling particularly in complex multi-turn tasks (BFCL-v3). Principal implication for AI practitioners: Continual pre-training with a well-curated, agent-specific corpus like Hephaestus-Forge can significantly enhance fundamental agent capabilities of open-source LLMs, bridging the performance gap with commercial models and providing a more robust and generalizable foundation for LLM-based agent development. |
Forget What You Know about LLMs Evaluations - LLMs are Like a Chameleon (Read more on arXiv or HuggingFace) | Seffi Cohen, Lior Rokach, Bracha Shapira, Yehonatan Elisha, Nurit Cohen-Inger | This paper introduces a meta-evaluation framework, Chameleon Benchmark Overfit Detector (C-BOD), to detect overfitting in Large Language Models (LLMs) on benchmark datasets. The central research question is whether LLMs over-rely on benchmark-specific cues, exhibiting surface-level performance rather than true language understanding. The methodology involves systematically perturbing benchmark prompts using a parametric transformation (controlled by parameter µ) and assessing performance changes with statistical significance tests (McNemar's test). A primary result is that 20 out of 26 tested LLMs showed statistically significant performance degradation on the MMLU benchmark under modest perturbations, with an average accuracy drop of 2.15%. AI practitioners should integrate C-BOD's perturbation methods into evaluation pipelines to ensure robust generalization and mitigate superficial memorization in LLMs, prioritizing model resilience over high scores on fixed benchmarks. |
VidCRAFT3: Camera, Object, and Lighting Control for Image-to-Video Generation (Read more on arXiv or HuggingFace) | Hang Xu, Yi Zhu, Yanpeng Zhou, Zimian Peng, Sixiao Zheng | VidCRAFT3 is a novel image-to-video generation framework enabling precise control over camera motion, object motion, and lighting direction. The main research objective is to develop a model that can simultaneously control multiple visual elements (camera motion, object motion, and lighting) in image-to-video generation, overcoming the limitations of existing methods. The key methodology involves a Spatial Triple-Attention Transformer integrating lighting, text, and image features, along with 3D point cloud rendering and trajectory-based motion encoding, and using a three-stage training process. Primary results show the model achieves a CamMC score of 4.07 on the RealEstate10K dataset, outperforming existing methods like CameraCtrl, CamI2V and MotionCtrl. The principal implication is that AI practitioners can use VidCRAFT3 to create high-quality videos with fine-grained and disentangled control over multiple aspects. |
Retrieval-augmented Large Language Models for Financial Time Series Forecasting (Read more on arXiv or HuggingFace) | Yueru He, Zhengyu Chen, Lingfei Qian, Zihao Jiang, Mengxi Xiao | This paper introduces a retrieval-augmented generation (RAG) framework, FinSeer, for financial time-series forecasting, specifically stock movement prediction. The main research objective is to develop a RAG framework that effectively integrates financial time-series data with large language models (LLMs) to improve stock movement prediction accuracy. The key methodology involves a fine-tuned 1B parameter LLM (StockLLM), a novel candidate selection method using LLM feedback, and a training objective maximizing similarity between queries and historically significant sequences. The RAG framework with FinSeer achieved an 8% higher accuracy on the BIGDATA22 benchmark compared to a general-purpose LLM-feedback-based retriever. For AI practitioners, this framework demonstrates the importance of using dedicated retrieval models designed to process and filter financial time-series data, to improve the performance of the LLMs in financial forecasting tasks. |
Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More (Read more on arXiv or HuggingFace) | Li Shen, Zhenyu Zhang, Jianjin Li, Zhikai Jia, Xialie Zhuang | Mask-Enhanced Autoregressive Prediction (MEAP) integrates masked language modeling into next-token prediction to improve large language models' in-context retrieval capabilities without extra computational cost. The main research objective is to enhance LLMs' ability to retrieve key information and perform long-context reasoning without compromising their fundamental language modeling capabilities. MEAP randomly masks a fraction of input tokens and then performs standard next-token prediction using a decoder-only Transformer. In pre-training, MEAP outperformed NTP on the Needle in a Haystack evaluation by 11% on average using 140B less training token. This demonstrates MEAP's superior performance in key information retrieval tasks, and thus provides AI practitioners with a more data- and compute-efficient training paradigm for large language models. |
FocalCodec: Low-Bitrate Speech Coding via Focal Modulation Networks (Read more on arXiv or HuggingFace) | Mirco Ravanelli, Cem Subakan, Francesco Paissan, lucadellalib | FocalCodec is a low-bitrate speech codec based on focal modulation that uses a single binary codebook for compression. The research objective is to develop a speech codec that achieves high compression rates while preserving both semantic and acoustic information for downstream tasks. The key methodology involves a compressor-quantizer-decompressor architecture utilizing focal modulation, binary spherical quantization (BSQ), and a pretrained self-supervised encoder (WavLM). Primary results show that FocalCodec@50 achieves a dWER of 2.18 on the LibriSpeech test-clean set, outperforming several baselines at comparable bitrates. AI practitioners can use FocalCodec as an efficient and low-bitrate option that can be deployed to preserve sufficient semantic and acoustic information for downstream tasks, such as speech resynthesis, voice conversion, or speech enhancement model development. |
Auditing Prompt Caching in Language Model APIs (Read more on arXiv or HuggingFace) | Percy Liang, Rohith Kuditipudi, Xiang Lisa Li, Chenchen Gu, thashim | Prompt caching in large language model APIs can leak private and proprietary information through timing differences, which can be detected by auditing. The main research objective was to develop and conduct statistical audits to detect prompt caching and determine the level of cache sharing (per-user, per-organization, or global) in real-world LLM API providers. The key methodology was using statistical hypothesis testing on response times from two procedures: one to generate cache hits, and one to generate cache misses, analyzing differences using the two-sample Kolmogorov-Smirnov test. The primary results revealed that prompt caching was detected in 8 out of 17 API providers, with 7 exhibiting global cache sharing across users, where it was detected with an average precision of around 0.8. AI practitioners should be aware of prompt caching implementation details and cache-sharing levels in LLM APIs to mitigate potential privacy leakage, since the caching can be identified from timing data. |
Gemstones: A Model Suite for Multi-Faceted Scaling Laws (Read more on arXiv or HuggingFace) | Abhinav Bhatele, Siddharth Singh, David Yu Miller, John Kirchenbauer, smcleish | Gemstones provides a dataset of over 4000 transformer checkpoints to study scaling laws across various architectural and training hyperparameters. The main research question is how model design (width, depth) and model selection impact scaling law parameters and interpretations. The key methodology involves training transformers, up to 2 billion parameters, with diverse widths, depths, learning rates, and cooldown schedules, then fitting and analyzing scaling laws on this data. The primary results show scaling law prescriptions are highly sensitive to model selection and fitting procedures; for example, the optimal tokens-per-parameter ratio is slightly higher than that proposed in previous works. The principal implication for AI practitioners is that scaling laws should be approached with awareness for fragility, with a recommendation to err on wider and, surprisingly, over-trained models, especially when considering time optimality. |
Skill Expansion and Composition in Parameter Space (Read more on arXiv or HuggingFace) | Yixing Lan, Haoyi Niu, Yinan Zheng, Jianxiong Li, LTL07 | i) The paper introduces Parametric Skill Expansion and Composition (PSEC), a framework for iteratively expanding agent capabilities. ii) The research aims to develop an autonomous agent that can efficiently acquire new skills by leveraging prior knowledge and dynamically composing existing skills. iii) PSEC employs parameter-efficient finetuning using Low-Rank Adaptation (LoRA) modules for skill expansion and a context-aware module for skill composition in parameter space. iv) Experiments on D4RL show PSEC demonstrates the superior capacity to efficiently tackle new challenges. v) PSEC provides AI practitioners with a method for continual learning and efficient skill transfer in reinforcement learning agents, mitigating catastrophic forgetting through parameter isolation. |
Title | Authors | Summary |
---|---|---|
SynthDetoxM: Modern LLMs are Few-Shot Parallel Detoxification Data Annotators (Read more on arXiv or HuggingFace) | Alexander Panchenko, tlenusik, memyprokotow, chameleon-lizard, etomoscow | This paper introduces SynthDetoxM, a multilingual synthetic parallel text detoxification dataset, and a framework for generating such data using large language models (LLMs). The main research objective is to address the scarcity of parallel multilingual datasets for training text detoxification models. The key methodology involves few-shot prompting of multiple open-source LLMs to rewrite toxic sentences sourced from existing toxicity datasets across German, French, Spanish, and Russian, followed by a filtering and ranking process. Models trained on the full SynthDetoxM achieved a J score (combining style transfer accuracy, similarity, and fluency) of 0.484, 0.521, and 0.471 on German, Russian and Spanish respectively. The principal implication is that AI practitioners can leverage the proposed framework and the SynthDetoxM dataset to train more effective multilingual text detoxification models, even with limited human-annotated parallel data. |
Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning (Read more on arXiv or HuggingFace) | Yuzhe Gu, Songyang Gao, Chengqi Lyu, zsytony, ZwwWayne | This paper introduces OREAL, a new reinforcement learning (RL) framework for enhancing mathematical reasoning in large language models (LLMs) using only binary outcome rewards. The main research objective is to push the performance limit achievable through Outcome REwArd-based reinforcement learning (OREAL) for mathematical reasoning tasks. The key methodology involves behavior cloning on positive trajectories from Best-of-N sampling, reward shaping for negative samples, and a token-level reward model for credit assignment. OREAL achieves a 95.0 pass@1 accuracy on MATH-500 with a 32B model, and a 7B model can obtain 94.0 pass@1 accuracy on MATH-500. AI practitioners can utilize OREAL's techniques to improve LLM performance on mathematical reasoning tasks using readily available binary outcome feedback, emphasizing the importance of policy model initialization and proper training data selection. |
Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling (Read more on arXiv or HuggingFace) | Xiu Li, Jian Zhao, Junqi Gao, iseesaw, RyanLiu112 | This paper investigates compute-optimal test-time scaling (TTS) strategies for Large Language Models (LLMs), demonstrating that smaller LLMs can outperform larger ones with appropriate scaling. The main research question is what is the optimal approach to scaling test-time computation across different policy models, Process Reward Models (PRMs), and problem difficulty levels, and to what extent can it improve performance. The key methodology involves comprehensive experiments on MATH-500 and AIME24 tasks using various LLMs (0.5B to 72B) and PRMs (1.5B to 72B), evaluating different TTS methods like Best-of-N, beam search, and Diverse Verifier Tree Search. The primary results show that a 3B LLM with compute-optimal TTS can surpass a 405B LLM, achieving 75.6% on MATH-500 and 30.0% on AIME24, compared to 71.4% and 23.3% for the 405B model with Chain-of-Thought prompting. The principal implication for AI practitioners is that applying compute-optimal, reward-aware TTS strategies can significantly enhance the reasoning abilities of smaller LLMs, potentially leading to more efficient and effective deployment compared to using much larger models. |
Lossless Acceleration of Large Language Models with Hierarchical Drafting based on Temporal Locality in Speculative Decoding (Read more on arXiv or HuggingFace) | Soyeong Jeong, Jeongyeon Seo, Sangjin Choi, doubleyyh, zomss | Hierarchy Drafting (HD) accelerates large language model (LLM) inference by organizing token sources into hierarchical databases based on temporal locality and accessing them sequentially during speculative decoding. Main research question or objective: To address the limitations of existing speculative decoding methods, which rely on a single database, require additional fine-tuning or deliver inconsistent acceleration gains. Key methodology used: The proposed method, Hierarchy Drafting (HD), organizes diverse token sources into three databases (context-dependent, model-dependent, and statistics-dependent) based on temporal locality and accesses them sequentially during speculative decoding, starting from the smallest to largest. Primary results: Experiments on Spec-Bench using LLMs with 7B and 13B parameters demonstrate that HD outperforms existing lossless drafting methods, achieving over 1.5x faster inference speed compared to autoregressive decoding when the temperature is 0.0. Principal implication for AI practitioners: AI practitioners can achieve significant and consistent lossless inference acceleration in LLMs without model retraining or modification, using readily accessible data sources, by employing HD, making it suitable for real-world deployment. |
Show-o Turbo: Towards Accelerated Unified Multimodal Understanding and Generation (Read more on arXiv or HuggingFace) | Yishun Li, Zhenyi Liao, zhijie3, asunalove, UnhurriedDawn | Show-o Turbo accelerates the unified multimodal understanding and generation model Show-o by extending consistency distillation to its multimodal denoising trajectories. The main research question is whether a unified approach exists to enhance the efficiency of Show-o's inference, which involves denoising image tokens and autoregressively decoding text tokens. The key methodology involves viewing text generation as a denoising process using Jacobi decoding, extending consistency distillation (CD) to multimodal discrete sampling trajectories, and employing trajectory segmentation and curriculum learning. Show-o Turbo achieves a GenEval score of 0.625 at 4 sampling steps without classifier-free guidance (CFG), outperforming the original Show-o with 8 steps and CFG, in text-to-image generation and 1.5 speedup on image-to-text task. AI practitioners can leverage this approach to deploy more efficient multimodal models that achieve significant speedups in both image and text generation tasks with minimal performance trade-offs. |
Training Language Models for Social Deduction with Multi-Agent Reinforcement Learning (Read more on arXiv or HuggingFace) | Dorsa Sadigh, C. Karen Liu, Warren Xia, bidiptas | Language models are trained to communicate effectively in a multi-agent social deduction game without human demonstrations, enhancing their ability to reason and strategize. The main research objective is to train language models to have productive natural language discussions about their environment, leveraging the agent's goal for predicting useful information. The methodology decomposes communication into listening and speaking, using a dense reward signal based on imposter prediction and influence on other agents' beliefs to guide multi-agent reinforcement learning. Crewmate agents trained with the proposed technique achieve double the win rate compared to standard reinforcement learning, illustrating the value of the communication strategy. AI practitioners can utilize the described approach to enable self-improving discussions in multi-agent settings without requiring task-specific human data, potentially broadening the application of language models in cooperative AI. |
ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates (Read more on arXiv or HuggingFace) | Mengdi Wang, Bin Cui, Zhaochen Yu, Ling Yang | ReasonFlux is a hierarchical LLM reasoning framework that optimizes mathematical reasoning by scaling thought templates. The main research objective is to improve LLMs' mathematical reasoning capabilities beyond existing models like OpenAI's o1-preview and DeepSeek V3. The key methodology involves a structured thought template library, hierarchical reinforcement learning on template sequences, and an inference scaling system that adaptively retrieves and applies templates. On the MATH benchmark, ReasonFlux-32B achieves an accuracy of 91.2%, surpassing o1-preview by 6.7%. AI practitioners can leverage ReasonFlux's hierarchical template-based approach for more efficient and generalizable reasoning in complex problem-solving applications, requiring less computational resources. |
The Hidden Life of Tokens: Reducing Hallucination of Large Vision-Language Models via Visual Information Steering (Read more on arXiv or HuggingFace) | Zhenting Wang, Di Liu, Yunhe Gao, Haizhou Shi, Zhuowei Li | This paper introduces VISTA, a training-free framework to reduce hallucination in Large Vision-Language Models (LVLMs) by steering token generation with visual information. The main research objective is to investigate and mitigate the phenomenon of LVLMs generating syntactically coherent but visually ungrounded content. The key methodology, VISTA, combines a Visual Steering Vector (VSV) to reinforce visual cues in activation space and Self-Logits Augmentation (SLA) to leverage early-layer activations for semantically meaningful decoding. Primary results show that VISTA reduces hallucination by about 40% on average in open-ended generation tasks, outperforming existing methods across multiple architectures and decoding strategies. The principal implication for AI practitioners is that VISTA provides an efficient, inference-time intervention to improve the visual grounding and reliability of LVLMs without requiring additional training or model modification. |
Matryoshka Quantization (Read more on arXiv or HuggingFace) | Aditya Kusupati, Prateek Jain, Jeff Dean, Puranjay Datta, Pranav Nair | Matryoshka Quantization (MatQuant) is a multi-scale quantization technique that trains a single model capable of operating at various integer bit-widths. The main research question is whether a single model can be trained to extract multiple accurate lower-precision models, addressing the challenges of accuracy loss in low-precision quantization and the need for maintaining multiple models. The key methodology is Matryoshka Quantization, which jointly optimizes model weights across multiple precision levels (e.g., int8, int4, int2) using shared most significant bits and leveraging the inherent nested structure of integer data types. Primary results show that MatQuant-derived int2 models outperform standard int2 quantization techniques by up to 10% in accuracy, and an int2 FFN-quantized Gemma-2 9B model is more accurate than an int8 FFN-quantized Gemma-2 2B model. The principal implication is that AI practitioners can train and maintain a single quantized model that can be served at different precision levels, offering a spectrum of accuracy-versus-cost options and improving accuracy, especially in very low precision regimes like int2. |
EVEv2: Improved Baselines for Encoder-Free Vision-Language Models (Read more on arXiv or HuggingFace) | Yueze Wang, Yufeng Cui, Xiaotong Li, Haiwen Diao, PhyscalX | EVEv2.0 is a new family of encoder-free vision-language models (VLMs) that improve upon existing baselines through architectural and training enhancements. The main research objective is to systematically investigate and improve the performance of encoder-free VLMs, addressing challenges like cross-modal interference and visual perception learning from scratch. The key methodology involves a "Divide-and-Conquer" architecture that decomposes the model into modality-specific components within a unified decoder-only framework, along with a progressive training strategy utilizing an enhanced captioning engine. Primary results show that EVEv2.0 achieves 71.4% accuracy on ScienceQA-IMG, outperforming prior encoder-free models, while approaching the performance of encoder-based counterparts with similar capacity, using only 100M publicly available data. The principal implication for AI practitioners is that properly decomposing and associating modalities, combined with a well-designed training strategy, allows for effective optimization of decoder-only VLMs, providing superior data efficiency and strong visual-reasoning capability, and thereby improving performance of large language models. |
LM2: Large Memory Models (Read more on arXiv or HuggingFace) | Fraser Greenlee, Alex J. Chan, Filippos Christianos, Wenqi Wu, Jikun Kang | LM2 is a memory-augmented Transformer architecture designed to improve long-context reasoning in language models. The main research objective is to address the limitations of standard Transformers in processing long contexts with distributed information, particularly for tasks involving multi-step reasoning and relational argumentation. The key methodology involves integrating a dynamic memory module into the decoder-only Transformer, using cross-attention and gating mechanisms to update and retrieve contextual representations. Experimental results on the BABILong benchmark show LM2 outperforms the memory-augmented RMT model by 37.1% and the baseline Llama-3.2 model by 86.3% on average across tasks. The principal implication for AI practitioners is that incorporating explicit memory modules, as done in LM2, can enhance a Transformer's ability to handle long-context reasoning tasks without sacrificing performance on general tasks, which has significance for NLP applications. |
Lumina-Video: Efficient and Flexible Video Generation with Multi-scale Next-DiT (Read more on arXiv or HuggingFace) | Kai Wang, Zhen Li, Yutong Liu, Shicheng Li, Dongyang Liu | Lumina-Video is a novel framework for efficient and flexible video generation based on an enhanced Diffusion Transformer architecture. The main research objective is to address the spatiotemporal complexity and computational challenges of video generation using Diffusion Transformers (DiTs). The key methodology involves a Multi-scale Next-DiT architecture with multiple patch sizes, motion score conditioning, progressive training, and multi-source training. Lumina-Video achieves a total score of 82.94% on the VBench benchmark, demonstrating competitive performance in generating high-quality videos. AI practitioners can leverage Lumina-Video's Multi-Scale Next-DiT and training strategies to build efficient and flexible video generation models with controllable dynamics. |
History-Guided Video Diffusion (Read more on arXiv or HuggingFace) | Russ Tedrake, Yilun Du, Max Simchowitz, Boyuan Chen, Kiwhan Song | The paper introduces a video diffusion model, DFoT, and a family of guidance methods, History Guidance (HG), that improve video generation quality and consistency by leveraging variable-length historical frames. The main research question is how to effectively use different portions of video history as a form of guidance for improved video generation. The key methodology involves the Diffusion Forcing Transformer (DFoT), which allows conditioning on flexible history lengths, and History Guidance methods, which combine scores from different history windows and noise levels. A primary result is that DFoT with history guidance achieves a Fréchet Video Distance (FVD) of 170.4 on Kinetics-600, outperforming baselines. AI practitioners can use DFoT and History Guidance to improve the quality, consistency, and length of generated videos, especially for tasks requiring long-term coherence. |
CustomVideoX: 3D Reference Attention Driven Dynamic Adaptation for Zero-Shot Customized Video Diffusion Transformers (Read more on arXiv or HuggingFace) | Zhen Yang, Jin Wang, Jingxuan Pang, Mushui Liu, D. She | CustomVideoX is a zero-shot personalized video generation framework based on the Video Diffusion Transformer, enhancing video quality and temporal coherence. The main research objective is to develop a method for generating customized videos from a reference image and text prompt, addressing temporal inconsistencies and quality degradation issues. The key methodology involves integrating 3D Reference Attention for direct interaction between reference image and video frames, Time-Aware Attention Bias to modulate reference feature influence, and Entity Region-Aware Enhancement for focused feature injection. Primary results show that CustomVideoX achieves a CLIP-I score of 90.26 and DINO-I score of 91.49 on the VideoBench benchmark, outperforming other methods. AI practitioners can leverage CustomVideoX's architecture for improved zero-shot personalized video generation, specifically benefiting from the 3D Reference Attention and time-aware mechanisms for better fidelity and consistency. |
APE: Faster and Longer Context-Augmented Generation via Adaptive Parallel Encoding (Read more on arXiv or HuggingFace) | Beidi Chen, Tianqi Chen, Hanyuezhuohua | APE improves context-augmented generation by enabling faster and longer context processing through adaptive parallel encoding. The main research objective is to address the computational burden and performance degradation of existing context-augmented generation (CAG) techniques when handling multiple, lengthy contexts. The key methodology, Adaptive Parallel Encoding (APE), uses a shared prefix, attention temperature, and scaling factor to align the distribution of parallel encoding with sequential encoding. Results show that APE preserves 98% of sequential encoding performance on RAG tasks while enabling an end-to-end 4.5x speedup by reducing prefilling time by 28x for a 128K-length context. The principal implication for AI practitioners is that APE enables more efficient and scalable deployment of CAG systems, particularly those dealing with long and numerous contexts, by reducing computational costs and improving response times. |
Efficient-vDiT: Efficient Video Diffusion Transformers With Attention Tile (Read more on arXiv or HuggingFace) | Peiyuan Zhang, Runlong Su, Dacheng Li, zhijie3, foreverpiano | EFFICIENT-VDIT accelerates video diffusion transformers by sparsifying 3D attention and reducing sampling steps. The main research objective is to address the computational inefficiency of 3D full attention diffusion transformers (DiTs) during video generation. The key methodology involves identifying and leveraging a "tile-style" repetitive pattern in 3D attention maps to create sparse attention masks, combined with multi-step consistency distillation. The primary result is that EFFICIENT-VDIT achieves up to a 7.8x speedup on Open-Sora-Plan-1.2 models for 29 and 93 frame video generation with minimal performance degradation on VBench. For AI practitioners, this method provides a way to significantly speed up video generation with 3D DiTs, enabling faster inference and potentially reducing computational costs. |
MetaChain: A Fully-Automated and Zero-Code Framework for LLM Agents (Read more on arXiv or HuggingFace) | Chao Huang, Tianyu Fan, Jiabin Tang | MetaChain is a framework enabling fully-automated, zero-code development and deployment of LLM agents through natural language alone. The main research question is: Can we enable everyone, regardless of technical background, to build their own LLM agents using natural language alone? The key methodology involves a novel LLM Agent Framework with four components: Agentic System Utilities, LLM-powered Actionable Engine, Self-Managing File System, and Self-Play Agent Customization module, enabling automated agent generation, customization, and workflow optimization. Primary results include ranking #1 among open-source solutions on the GAIA benchmark and achieving 73.51% accuracy on a MultiHop-RAG task. The principal implication for AI practitioners is that MetaChain democratizes agent development, allowing non-programmers to create and customize LLM agents and workflows, potentially accelerating the adoption of agent technology. |
Steel-LLM:From Scratch to Open Source -- A Personal Journey in Building a Chinese-Centric LLM (Read more on arXiv or HuggingFace) | Zhaoxiang Zhang, Shu Li, Qingshui Gu, aaabiao | Steel-LLM is a fully open-source, 1-billion-parameter, Chinese-centric language model developed with limited computational resources. The main objective was to create a high-quality, transparent, and resource-efficient language model, primarily trained on Chinese data, with a small proportion of English. The methodology involved adapting a Qwen-based Transformer architecture with Soft Mixture of Experts and an enhanced Feed-Forward Network, trained using a modified TinyLlama framework on 8 A100/H800 GPUs. The model achieved a CEVAL accuracy of 41.90% and a CMMLU accuracy of 36.08% after supervised finetuning. AI practitioners can use the provided training pipeline, datasets, model architecture, and intermediate checkpoints to develop or extend similar language models with limited resources, facilitating reproducibility and further research. |
The Curse of Depth in Large Language Models (Read more on arXiv or HuggingFace) | Yefeng Zheng, Lu Yin, Xinyuan Song, Wenfang Sun, pengxiang | The paper introduces "Curse of Depth" in large language models (LLMs), where deeper layers contribute less than expected due to Pre-Layer Normalization (Pre-LN), and proposes LayerNorm Scaling to address it. The main research objective is to identify and rectify the phenomenon where deeper layers in LLMs are less effective, specifically investigating the role of Pre-LN in this issue. The key methodology involves theoretical analysis of Pre-LN's impact on variance and gradient flow, alongside empirical evaluations via layer pruning experiments and comparisons of different normalization techniques. A primary result is that LayerNorm Scaling reduces perplexity by 1.31 on LLaMA-1B compared to standard Pre-LN. The principal implication for AI practitioners is that applying LayerNorm Scaling, which inversely scales the output of Pre-LN by the square root of the layer depth, can improve LLM performance by enhancing the contribution of deeper layers during training, creating more resource-efficient models. |
DreamDPO: Aligning Text-to-3D Generation with Human Preferences via Direct Preference Optimization (Read more on arXiv or HuggingFace) | Yi Yang, Hehe Fan, Fan Ma, Xiaobo Xia, Zhenglin Zhou | DreamDPO is an optimization-based framework for text-to-3D generation that aligns 3D content with human preferences through direct preference optimization. The main research objective is to improve the alignment of text-to-3D generated content with human preferences and enhance controllability. The methodology involves constructing pairwise examples, comparing their alignment with human preferences using reward or large multimodal models, and optimizing the 3D representation with a preference-driven loss function. DreamDPO achieved a GPTEval3D overall score of 1203.1, outperforming 13 state-of-the-art methods, including MVDream (1097.7). AI practitioners can utilize DreamDPO to generate higher-quality and more controllable 3D content, moving beyond pointwise quality evaluations by utilizing pairwise comparisons and preference optimization. |
Dual Caption Preference Optimization for Diffusion Models (Read more on arXiv or HuggingFace) | Bimsara Pathiraja, Shamanthak Hegde, Agneet Chatterjee, Yiran Luo, sahsaeedi | Dual Caption Preference Optimization (DCPO) improves text-to-image diffusion models by using distinct captions for preferred and less preferred images during training. The main research objective is to address the issues of conflict distribution and irrelevant prompts in existing preference optimization methods for diffusion models. The key methodology involves generating distinct captions for preferred and less-preferred images using captioning, perturbation, or hybrid methods, and introducing a modified objective function that leverages these dual captions. Primary results show that DCPO-h outperforms Stable Diffusion 2.1, SFT, Diffusion-DPO, and MaPO, achieving a +0.21 improvement in Pickscore. The principal implication for AI practitioners is that using dual, distinct captions for preferred and less-preferred image pairs during preference optimization can significantly enhance the alignment and performance of diffusion models. |
Title | Authors | Summary |
---|---|---|
VideoRoPE: What Makes for Good Video Rotary Position Embedding? (Read more on arXiv or HuggingFace) | Pan Zhang, Xiaoyi Dong, Xilin Wei, yuhangzang, LiuXR | VideoRoPE introduces a novel rotary position embedding method for video data that outperforms existing methods by preserving spatio-temporal relationships. The main research objective is to identify and address the limitations of existing Rotary Position Embedding (RoPE) methods when applied to video data with complex spatio-temporal structures. The key methodology involves analyzing four essential characteristics (2D/3D structure, frequency allocation, spatial symmetry, temporal index scaling) for effective RoPE adaptation to video and proposing VideoRoPE, which features a 3D structure, low-frequency temporal allocation, diagonal layout, and adjustable temporal spacing. Primary results show that VideoRoPE outperforms previous RoPE variants on various benchmarks, achieving a 12.44% performance improvement over M-ROPE on the Video Retrieval task in both V-NIAH and V-NIAH-D settings. The principal implication for AI practitioners is that VideoRoPE provides a more robust and effective positional encoding scheme for video-based models, enhancing performance in tasks such as video retrieval, understanding, and hallucination reduction. |
Fast Video Generation with Sliding Tile Attention (Read more on arXiv or HuggingFace) | Ion Stoica, Hangliang Ding, Runlong Su, Peiyuan Zhang, BrianChen1129 | Sliding Tile Attention (STA) accelerates video diffusion models by efficiently computing attention within local spatiotemporal windows. The paper introduces STA to address the high computational cost of 3D full attention in video diffusion transformers (DiTs). STA operates tile-by-tile, utilizing a hardware-aware sliding window design and kernel-level optimizations. STA reduces end-to-end latency of a video DiT (HunyuanVideo) from 945s to 685s without quality degradation, and to 268s with finetuning (0.09% drop on VBench). AI practitioners can deploy STA to significantly reduce inference time for video generation DiTs while maintaining output quality, or trade minimal quality loss for substantial speed gains. |
AuraFusion360: Augmented Unseen Region Alignment for Reference-based 360° Unbounded Scene Inpainting (Read more on arXiv or HuggingFace) | Jie-Ying Lee, Ying-Huan Chen, Yang-Jung Chen, Chung-Ho Wu, cmhungsteve | AuraFusion360 is a reference-based method for 360° unbounded scene inpainting that removes objects and fills holes in 3D scenes represented by Gaussian Splatting. The main research objective is to achieve high-quality object removal and hole filling in 360° unbounded scenes, maintaining view consistency and geometric accuracy. The methodology introduces depth-aware unseen mask generation, Adaptive Guided Depth Diffusion for initial point placement, and SDEdit-based detail enhancement for multi-view coherence. The method achieves an average PSNR of 17.661 and LPIPS of 0.388 on the 360-USID dataset, outperforming existing methods. AI practitioners can use this method and the provided 360-USID dataset for improved 3D scene inpainting, particularly in applications requiring consistent and accurate object removal in 360° environments. |
Goku: Flow Based Video Generative Foundation Models (Read more on arXiv or HuggingFace) | Fengda Zhu, Yida Zhang, Yuqi Zhang, Chongjian Ge, ShoufaChen | Goku is a family of rectified flow Transformer models for joint image-and-video generation that achieves industry-leading performance. The main research objective is to develop a state-of-the-art joint image-and-video generation model with industry-leading performance using rectified flow Transformers. The key methodology involves a data curation pipeline, a 3D joint image-video variational autoencoder (VAE), a Transformer architecture with full attention, rectified flow formulation, and infrastructure optimization for large-scale training. Goku achieves 0.76 on GenEval and 83.65 on DPG-Bench for text-to-image generation, and 84.85 on VBench for text-to-video tasks. This work demonstrates a pathway toward industry-grade performance in visual generation, enabling practitioners to build more efficient and high-performing generative models using Rectified Flows. |
QuEST: Stable Training of LLMs with 1-Bit Weights and Activations (Read more on arXiv or HuggingFace) | Jiale Chen, d-alistarh, mnikdan97, soroushtabesh, BlackSamorez | QuEST introduces a quantization-aware training method for large language models (LLMs) enabling stable training with extremely low-precision weights and activations. The main research objective is to determine the Pareto-optimal frontier for training LLMs with low-bitwidth weights and activations, minimizing representation size while maintaining accuracy. The key methodology, QuEST, combines Hadamard normalization and MSE-optimal fitting for quantization, with a "trust" gradient estimator minimizing the difference between quantized and full-precision gradients. Primary results show stable training of Llama-family models down to 1-bit weights and activations, with 4-bit QuEST models achieving superior accuracy compared to BF16 models almost 4x larger in size. The principal implication for AI practitioners is that QuEST enables training and deploying accurate LLMs at significantly reduced precision and model size, potentially leading to more efficient inference. |
Agency Is Frame-Dependent (Read more on arXiv or HuggingFace) | Shi Dong, Will Dabney, Michael Bowling, André Barreto, David Abel | i) The paper argues that agency, a system's capacity to steer outcomes toward a goal, is fundamentally frame-dependent. ii) The main objective is to demonstrate that the attribution of agency to a system is relative to the choice of a reference frame. iii) The methodology involves a philosophical argument, illustrating that the essential properties of agency (individuality, source of action, normativity, adaptivity) are frame-dependent. iv) The paper does not present specific quantitative findings. v) Any basic science of agency requires frame-dependence, impacting how AI practitioners should approach reinforcement learning. |
FlashVideo:Flowing Fidelity to Detail for Efficient High-Resolution Video Generation (Read more on arXiv or HuggingFace) | Peize Sun, Chongjian Ge, Wenbo Li, Shilong Zhang, ShoufaChen | FlashVideo introduces a two-stage framework for efficient high-resolution text-to-video generation. The research aims to decouple prompt fidelity and visual quality optimization in video generation. It utilizes a two-stage DiT architecture with a large model for low-resolution generation followed by flow matching with a smaller model for high-resolution detail enhancement. FlashVideo achieves a top-tier performance on VBench-Long (82.99 score) with significantly reduced function evaluation time (102.3s for 1080p video generation). The two-stage design allows AI practitioners to preview initial output before committing to full-resolution generation, reducing computational costs and wait times. |
Linear Correlation in LM's Compositional Generalization and Hallucination (Read more on arXiv or HuggingFace) | Chengyu Dong, Shibo Hao, Chenyang An, Letian Peng, shangjingbo | i) This paper unveils linear correlations in language models (LMs) during knowledge composition. ii) The research investigates the extent to which linear transformations can approximate the relationships between the output logits of related next token prediction (NTP) tasks. iii) The methodology involves fitting a linear transformation between logits of source and target knowledge prompts using a subset of data, then evaluating the transformation on the remaining data using Pearson correlation. iv) Results indicate that the fitted linear transformation is resilient to fine-tuning, with successful generalization for simultaneous knowledge updates requiring high correlation intensity and transformation precision; in City-Country relationships, 42% of cities learn the top-1 weight with their influenced countries. v) The implication for AI practitioners is the understanding that compositional generalization in LMs relies on linear correlations between vocabulary representations, which can be leveraged for knowledge composition tasks but also may lead to hallucinations when misaligned. |
Generating Symbolic World Models via Test-time Scaling of Large Language Models (Read more on arXiv or HuggingFace) | Fuxiang Frank Xia, Tim Z. Xiao, Yuhuan Yuan, Zhouliang Yu, zhangysk | i) This paper introduces a test-time scaling approach for generating Planning Domain Definition Language (PDDL) domains using Large Language Models (LLMs). ii) The main objective is to enhance PDDL reasoning in LLMs for generating high-quality PDDL domains without additional training data. iii) The methodology employs a Best-of-N sampling approach followed by iterative refinement using Instance Verbalized Machine Learning (iVML). iv) The method achieves an 85.2% success rate on the NL2Domain task and 71.4% on Prob2Domain with Qwen2.5-Coder-7B, exceeding ol-mini's performance. v) AI practitioners can leverage this approach to generate symbolic world models for robust planning, particularly in complex domains where existing LLM-based planners struggle. |
On-device Sora: Enabling Diffusion-Based Text-to-Video Generation for Mobile Devices (Read more on arXiv or HuggingFace) | Yeojin Lee, Jungmin Cheon, Isu Jeong, Kyuhwan Lee, Bosung Kim | On-device Sora is a framework for diffusion-based text-to-video generation that operates efficiently on smartphone-grade devices. The main research objective is to enable efficient and high-quality text-to-video generation on resource-constrained mobile devices, addressing limitations of current diffusion-based video generation models. Key methodologies include Linear Proportional Leap (LPL) to reduce denoising steps, Temporal Dimension Token Merging (TDTM) to minimize token-processing computation, and Concurrent Inference with Dynamic Loading (CI-DL) for efficient model inference. Results demonstrate that On-device Sora generates videos on an iPhone 15 Pro with quality comparable to Open-Sora running on NVIDIA A6000 GPUs, achieving up to 1.94x speedup with LPL. AI practitioners can leverage On-device Sora's techniques to deploy and accelerate diffusion-based video generation models on mobile and embedded devices, expanding accessibility and enabling on-device applications. |
CMoE: Fast Carving of Mixture-of-Experts for Efficient LLM Inference (Read more on arXiv or HuggingFace) | Wulong Liu, Xianzhi Yu, Hui-Ling Zhen, Lancheng Zou, Eleven-P | CMoE is a framework that efficiently creates sparse Mixture-of-Experts models from dense large language models (LLMs) for improved inference efficiency. The main objective is to transform dense LLMs into sparse MoE architectures without extensive retraining. The methodology involves grouping feed-forward network (FFN) neurons into shared and routed experts based on activation rates, constructing a training-free routing mechanism using representative neurons, and optional lightweight adaptation. Results show that, with a 25% activation ratio, CMoE achieved 76.59% of the dense model's accuracy on some downstream benchmarks with lightweight fine-tuning on 2,048 samples. For AI practitioners, CMoE offers a method to deploy LLMs more efficiently in resource-constrained environments by significantly reducing computational overhead while maintaining performance. |
Step Back to Leap Forward: Self-Backtracking for Boosting Reasoning of Language Models (Read more on arXiv or HuggingFace) | Jie-Jing Shao, Ding-Chu Zhang, Wen-Da Wei, Xuan-Yi Zhu, yangxw | This paper introduces Self-Backtracking, a technique that enables language models to autonomously backtrack during reasoning. The main research objective is to address the limitations of current slow-thinking mechanisms in large language models, specifically inefficient overthinking and over-reliance on auxiliary reward models. The key methodology involves training the model to recognize suboptimal reasoning paths and backtrack to earlier states, using a specialized dataset format and a modified loss function during training, and an inference algorithm combining expansion, backtracking, and selection steps during inference. The primary result shows that Self-Backtracking improves reasoning accuracy on the Countdown task by over 40% compared to optimal-path supervised fine-tuning, using the Llama3.2-1B model. The principal implication for AI practitioners is that integrating self-backtracking into language models can significantly enhance reasoning capabilities and efficiency, and reduce the need for external reward models. |
Scaling Laws in Patchification: An Image Is Worth 50,176 Tokens And More (Read more on arXiv or HuggingFace) | Yuyin Zhou, Wei Shao, Guoyizhe Wei, Yaodong Yu, Feng Wang | This paper investigates the impact of patchification, an image tokenization method, on the performance of vision models. The main research objective is to examine the information loss caused by the patchification-based compressive encoding paradigm in vision models and how it affects visual understanding. The key methodology involves extensive scaling experiments by varying patch sizes in ViT and Mamba-based architectures across different vision tasks and input scales. The primary result is that model performance consistently improves as patch size decreases, achieving a test accuracy of 84.6% on ImageNet-1k with a base-sized model using a 1x1 patch size (50,176 tokens). The principal implication is that AI practitioners should consider reducing or eliminating spatial compression in vision encoders to improve model accuracy, as computational resources allow. |
QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation (Read more on arXiv or HuggingFace) | Yuke Zhu, Linxi Fan, Scott Reed, Fuzhao Xue, zhaoyue-zephyrus | QLIP is a visual tokenization method that combines state-of-the-art reconstruction quality with state-of-the-art zero-shot image understanding. The main research objective is to develop a visual tokenizer that excels at both capturing image semantics and reconstructing high-quality visuals for multimodal language modeling. The key methodology involves training a Binary Spherical Quantization (BSQ)-based autoencoder with a contrastive objective for text-image alignment, using a two-stage training process to balance reconstruction and alignment. A primary result is that QLIP-B achieves a zero-shot classification accuracy of 74.3% on ImageNet, while achieving a reconstruction FID of 3.21, comparable to state-of-the-art methods. AI practitioners can use QLIP as a drop-in replacement for visual encoders in existing models like LLaVA or image tokenizers in models like LlamaGen, achieving improved or comparable performance in multimodal understanding and generation tasks. |
ARR: Question Answering with Large Language Models via Analyzing, Retrieving, and Reasoning (Read more on arXiv or HuggingFace) | Giuseppe Carenini, yuweiyin | ARR is a zero-shot prompting method that improves question-answering (QA) performance of Large Language Models (LLMs) by explicitly guiding them through analyzing, retrieving, and reasoning steps. The main research objective is to evaluate the effectiveness of the ARR prompting method compared to baseline and Chain-of-Thought (CoT) prompting in multiple-choice QA tasks. The key methodology involves comparing the accuracy of LLMs using different trigger sentences representing ARR, baseline (no specific trigger), and zero-shot CoT prompting across ten multiple-choice QA datasets. Primary results show that ARR achieves an average accuracy of 69.58% across all datasets, outperforming the baseline (65.48%) and CoT (68.14%) when using the LLaMA3-8B-Chat model. AI practitioners can leverage the ARR prompting strategy to enhance LLM performance in QA tasks without needing model fine-tuning or few-shot examples, leading to better results in various applications, including information retrieval and decision support. |
Title | Authors | Summary |
---|---|---|
Analyze Feature Flow to Enhance Interpretation and Steering in Language Models (Read more on arXiv or HuggingFace) | Yaroslav Aksenov, kefirski, elephantmipt, dlaptev | This paper introduces a data-free method to track the evolution of features learned by sparse autoencoders across layers of large language models, enabling improved interpretability and steering of model behavior. The main research question is how to systematically map and understand the progression of features discovered by sparse autoencoders across consecutive layers of large language models. The key methodology involves using cosine similarity between decoder weights of SAEs trained on different modules (MLP, attention, residual) and layers to trace feature persistence, transformation, or emergence. The primary results show that deactivating a single predecessor feature causes a greater activation strength drop if that predecessor is in a group with single predecessor: for example, in layer 8, this probability is approximately 0.75, 0.55 and 0.6 for "From RES", "From MLP" and "From ATT", respectively . The principal implication for AI practitioners is that this method provides a means for more precise control over model behavior by identifying and manipulating multi-layer feature circuits, offering improvements over single-layer steering approaches. |
UltraIF: Advancing Instruction Following from the Wild (Read more on arXiv or HuggingFace) | Ning Ding, Li Sheng, ssz1111, ganqu, kkk-an | ULTRAIF is a scalable approach for building LLMs that can follow complex instructions with open-source data by training a composer model to synthesize instructions and evaluation questions. Main research question or objective: How to effectively align open-source LLMs with complex instructions using a scalable approach and open-source data. Key methodology used: Decomposing real-world user prompts into simplified queries, constraints, and evaluation questions; training an "UltraComposer" model to compose constraint-associated prompts with evaluation questions; using the composer to synthesize complex instructions and filter responses based on the evaluation questions. Primary results: ULTRAIF successfully aligns LLaMA-3.1-8B-Base to match the instruct version on 5 instruction-following benchmarks without benchmark-specific data, achieving a score of 69.63 (DRFR) on InfoBench and outperforming comparable baselines. Principal implication for AI practitioners: AI/ML engineers can use ULTRAIF as an effective and scalable method to improve the instruction-following capabilities of LLMs using open-source data, potentially reducing reliance on expensive, proprietary datasets, and simplifying the training and evaluation processes. |
DynVFX: Augmenting Real Videos with Dynamic Content (Read more on arXiv or HuggingFace) | talidekel, omerbartal, RafailFridman, DanahY | DynVFX augments real-world videos with new dynamic content described by user-provided text instructions. The main research objective is to develop a method for seamlessly integrating synthesized dynamic objects or complex scene effects into existing real-world videos, accounting for camera motion, occlusions, and interactions. The key methodology is a zero-shot, training-free framework leveraging a pre-trained text-to-video diffusion transformer and a Vision Language Model (VLM) for content synthesis and scene understanding, using a novel inference-based method with "Anchor Extended Attention" to manipulate attention features for localization and integration. The primary results show that the proposed method outperforms baselines like SDEdit and LORA fine-tuning, achieving a masked Structural Similarity Index (SSIM) of 0.860 and a CLIP Directional score of 0.311, indicating better original content preservation and edit fidelity. For AI practitioners, this method provides a framework that facilitates generating and harmonizing dynamic video effects without the need for creating and tracking masks, enabling improved video editing and synthesis capabilities using pre-trained diffusion models. |
Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment (Read more on arXiv or HuggingFace) | jiwenlu, WinstonHu, liuziwei7, THUdyh, Zuyan | Ola is an omni-modal language model achieving competitive performance across image, video, and audio understanding using a progressive modality alignment strategy. The main research objective is to develop an omni-modal language model that achieves competitive performance across image, video, and audio understanding compared to specialized single-modality models, while maintaining efficiency. The key methodology is a progressive modality alignment strategy that trains the model sequentially on image-text, then video, and finally audio data, along with a dual-encoder approach for audio input and sentence-wise streaming decoding for speech generation. The model achieves a mean accuracy of 72.6% on the OpenCompass benchmark and 68.4% on the VideoMME benchmark, outperforming existing open-source omni-modal LLMs and many specialized models. The principal implication is that AI practitioners can build more efficient and cost-effective omni-modal models by leveraging progressive modality training, starting with the most distinct modalities, which reduces the cross-modal alignment data demand. |
MotionLab: Unified Human Motion Generation and Editing via the Motion-Condition-Motion Paradigm (Read more on arXiv or HuggingFace) | De Wen Soh, Na Zhao, zeyuhu, ZiyanGuo | MotionLab is a unified framework for human motion generation and editing that leverages a novel Motion-Condition-Motion paradigm and rectified flows. The main research objective is to determine if human motion generation and editing can be effectively unified within a single framework. The key methodology involves a MotionFlow Transformer with Aligned Rotational Position Encoding, Task Specified Instruction Modulation, and Motion Curriculum Learning for multi-task training. The framework achieved a text-based editing R@1 score of 56.34 on the MotionFix dataset, demonstrating editing capabilities. For AI practitioners, MotionLab provides a versatile framework capable of handling both human motion generation and editing tasks, promoting knowledge sharing and efficiency. |
Great Models Think Alike and this Undermines AI Oversight (Read more on arXiv or HuggingFace) | AmeyaPrabhu, douwekiela, iaa01, Klingspor, shash42 | This paper studies how model similarity affects AI oversight, finding that greater similarity biases evaluations and reduces gains from training on Language Model (LM) annotations, with model errors becoming more correlated as capabilities increase. The main research question is how model similarity impacts the effectiveness of AI oversight, both in evaluation (LLM-as-a-judge) and training (using LM annotations). The key methodology involves proposing Chance Adjusted Probabilistic Agreement (CAPA), a new metric for LM similarity based on the overlap in model mistakes, and using it to analyze LLM-as-a-judge and training on LM annotation scenarios. Primary results show LLM-as-a-judge scores are significantly correlated with model similarity (average Pearson r=0.84), and gains from weak-to-strong generalization are higher when the supervisor and student models are more dissimilar. For AI practitioners, increasing model similarity poses a risk due to correlated failures, indicating a need for measuring and reporting model similarity and developing methods for training diverse models. |
Gold-medalist Performance in Solving Olympiad Geometry with AlphaGeometry2 (Read more on arXiv or HuggingFace) | Miroslav Olšák, Trieu H. Trinh, Yuri Chervonyi, lmthang, mmenegali | AlphaGeometry2 achieves gold-medal-level performance in solving Olympiad geometry problems. The main research objective is to improve upon the previous AlphaGeometry system to solve a broader range of, and more difficult, Olympiad geometry problems. Key methodologies include expanding the domain-specific language, optimizing the symbolic deduction engine (DDAR) with a C++ implementation, developing a novel search algorithm (SKEST) that utilizes multiple search trees with knowledge sharing, and employing a larger, Gemini-based language model trained on more diverse synthetic data. AlphaGeometry2 achieves an 84% solve rate on 2000-2024 IMO geometry problems (42 out of 50), compared to 54% for the original AlphaGeometry. AI practitioners can leverage the demonstrated techniques, such as enhanced neuro-symbolic reasoning, knowledge sharing between search agents and improved synthetic data generation, to build more powerful AI systems for complex mathematical reasoning tasks. |
ScoreFlow: Mastering LLM Agent Workflows via Score-based Preference Optimization (Read more on arXiv or HuggingFace) | Bryon Aragam, Ling Yang, Edify-Kd2024, lightaime, yinjiewang | ScoreFlow is a framework for optimizing multi-agent workflows of large language models (LLMs) using a novel score-based preference optimization method. The main research objective is to develop an automated, adaptive, and cost-efficient framework for generating and optimizing LLM agent workflows, addressing limitations of existing methods like inflexibility and poor scalability. The key methodology involves representing workflows as code, generating multiple workflows per task, evaluating them with quantitative scores, and optimizing the workflow generator using Score-DPO, a variant of direct preference optimization that incorporates evaluation scores. Across six benchmarks, ScoreFlow achieved an 8.2% average improvement over existing baselines. AI practitioners can utilize ScoreFlow to automate and enhance the creation of high-performance, scalable, and adaptable LLM agent workflows, resulting in improved model performance and lower inference costs. |
MAGA: MAssive Genre-Audience Reformulation to Pretraining Corpus Expansion (Read more on arXiv or HuggingFace) | Chenggang Li, Ke Shen, haoxintong | The paper introduces MAGA, a method for expanding pretraining corpora by reformulating existing text into diverse genres and audience styles using large language models. The main research question is how effective MAGA-generated synthetic data is for expanding pretraining corpus and aiding model scaling under data-constrained scenarios. The key methodology involves a two-stage synthesis process using a 3.3B MoE model to generate multiple genre-audience reformulations of documents, followed by heuristic cleaning. Primary results show that models trained with MAGA-expanded data (MAGA-Mix) achieved consistent improvements across model sizes (134M-1.7B parameters), with a +2.15 average performance gain on the 1.7B model, and substantial gains in TriviaQA (+15.47) and GSM8K (+6.06). For AI practitioners, MAGA offers a scalable method to expand training datasets and improve model performance, particularly when high-quality natural language data is scarce, providing an avenue for model scaling. |
Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis (Read more on arXiv or HuggingFace) | Xinsheng Wang, Chi-Min Chan, Xinfa Zhu, HKUST-Audio, ZhenYe234 | Llasa explores scaling train-time and inference-time compute for Llama-based text-to-speech (TTS) synthesis, demonstrating improvements in naturalness, prosody, and expressiveness. The main research objective is to investigate the effects of scaling both training and inference computation on the performance of a simplified, Llama-based TTS system. The key methodology involves using a single Transformer architecture with a vector quantizer (VQ) codec (X-codec2) and evaluating performance under varying model sizes, training data sizes, and inference-time search strategies (e.g., beam search, best-of-N). Primary results show that increasing training data from 80k to 250k hours improves the mean expert score on Chinese polyphonic characters from below 2.00 to around 2.25; scaling inference compute using a mixed strategy of PRM and ORM achieved higher SIM and kept WER near ground truth, on seed-tts-eval test-hard testset. For AI practitioners, this implies that both train-time and inference-time compute scaling are viable strategies for improving TTS quality, and that inference-time scaling can be a useful approach for balancing competing objectives like speaker similarity and content accuracy. |
MotionCanvas: Cinematic Shot Design with Controllable Image-to-Video Generation (Read more on arXiv or HuggingFace) | ttwong, aniruddha26398, heiwang1997, cusuh, Doubiiu | MotionCanvas is an image-to-video generation system that enables cinematic shot design with controllable camera and object motions. The main research objective is to develop a method that allows users to intuitively design cinematic video shots from a static image, controlling both camera and object movements in a scene-aware manner. The key methodology involves a Motion Signal Translation module that converts user-specified 3D motion intentions (camera paths, object bounding boxes, point trajectories) into 2D screen-space motion signals (point trajectories, bbox sequences) to condition a video diffusion model. The method achieved a Camera Motion Consistency (CamMC) score of 0.9453 on the RealEstate10K test set. AI practitioners can use MotionCanvas to enhance creative workflows in digital content creation with precise control over camera and object movements in image-to-video generation, avoiding costly 3D-related training data. |
ChartCitor: Multi-Agent Framework for Fine-Grained Chart Visual Attribution (Read more on arXiv or HuggingFace) | Kanika Goswami, Franck-Dernoncourt, ryanrossi, puneetm | ChartCitor is a multi-agent LLM framework that provides fine-grained bounding box citations for answers generated from chart images. The main research objective is to identify chart elements (e.g., bars, lines) that support factual claims in LLM-generated responses to user questions about charts. The methodology involves orchestrating multiple LLM agents to perform chart-to-table extraction, answer reformulation, table augmentation, evidence retrieval via pre-filtering and re-ranking, and table-to-chart mapping. The primary result shows that ChartCitor achieves an Intersection over Union (IoU) of 27.4, outperforming existing baselines such as direct bounding box decoding and other LLM-based models, by 9-15%. The principal implication is that AI practitioners can enhance the trustworthiness and explainability of chart question-answering systems by using this framework to provide visual evidence for LLM-generated answers, directly linking claims to specific chart components. |
BOLT: Bootstrap Long Chain-of-Thought in Language Models without Distillation (Read more on arXiv or HuggingFace) | cxiong, yingbozhou, jcxu, hendrydong, bpucla | BOLT is a method to develop long chain-of-thought (LongCoT) reasoning in large language models (LLMs) without knowledge distillation or human annotations. The main research question is whether LLMs can develop LongCoT capabilities from standard instruct models without relying on existing LongCoT models or expensive human annotations. The key methodology is a three-stage process: 1) LongCoT data bootstrapping with in-context learning; 2) LongCoT supervised finetuning; and 3) online training using DPO to refine LongCoT capacities. The method applied to Llama-3.1-70B-Instruct achieved impressive performance, evaluated through MT-Bench and Arena-Hard, showcasing improved reasoning ability. The principal implication is that AI practitioners can develop strong LongCoT reasoning capabilities from existing ShortCoT models and reduce the cost to train the models, thereby making advanced reasoning more accessible without reliance on proprietary models. |
Beyond Prompt Content: Enhancing LLM Performance via Content-Format Integrated Prompt Optimization (Read more on arXiv or HuggingFace) | Xuan Feng, Qi Chen, Yuanye Liu, lynazhang, Jiahang | The paper introduces Content-Format Integrated Prompt Optimization (CFPO), a method to improve Large Language Model (LLM) performance by jointly optimizing prompt content and format. The main research question is whether integrating prompt content and format optimization can enhance LLM performance compared to content-only optimization methods. The key methodology involves iterative refinement using component-wise content optimization (case-diagnosis, Monte-Carlo sampling) and dynamic format exploration (LLM-assisted format generation, UCT-based selection). Primary results show that CFPO achieves an 8.6% absolute improvement in GSM8K accuracy using the LLaMA-3.1-8B model compared to the baseline prompt (50.03 to 63.38). For AI/ML engineers and data scientists, CFPO highlights that jointly optimizing both prompt content and format presents a practical approach to significantly boosting LLM performance and can be done using only open-source LLMs. |
PlotGen: Multi-Agent LLM-based Scientific Data Visualization via Multimodal Feedback (Read more on arXiv or HuggingFace) | Ryan Rossi, Puneet Mathur, Kanika Goswami, Franck-Dernoncourt | PlotGen is a multi-agent framework that automates scientific data visualization generation using multimodal feedback for iterative refinement. The main research objective is to automate the creation of precise scientific visualizations from user specifications and raw data, addressing the limitations of current Large Language Models (LLMs) in this area. The key methodology involves orchestrating multiple LLM-based agents, including a Query Planning Agent, a Code Generation Agent, and three feedback agents (Numeric, Lexical, and Visual) that leverage multimodal LLMs for self-reflection. Primary results show that PlotGen outperforms strong baselines, achieving a 4-6% improvement on the MatPlotBench dataset. For AI practitioners, PlotGen provides a framework to improve accuracy and reduce debugging of LLM-generated visualizations. |
Enhancing Code Generation for Low-Resource Languages: No Silver Bullet (Read more on arXiv or HuggingFace) | gbavota, AML14, Devy1 | This paper investigates methods to improve code generation by Large Language Models (LLMs) for low-resource programming languages, finding no single superior technique across all contexts. The primary research question is: Which techniques are best suited to improve LLM-based code generation capabilities in low-resource programming languages? The study empirically evaluated in-context learning (translation examples, translation rules, few-shot) and fine-tuning (with/without pre-training on code translation) on six LLMs, using the MultiPL-E benchmark for R and Racket. Results show fine-tuning benefits smaller models (e.g., DeepSeek Coder 1B), increasing Racket pass@1 from 7.0% to 18.4% with pre-training & fine-tuning, while in-context learning, specifically with translation examples, generally improves performance for larger models and GitHub Copilot, with deltas over baseline reaching +6.3% in some test cases. AI practitioners should consider model size when boosting performance on low-resource languages, with in-context learning representing a generally effective and low-cost strategy, especially for larger LLMs. |
Weak-to-Strong Diffusion with Reflection (Read more on arXiv or HuggingFace) | Zeke Xie, Masashi Sugiyama, Lichen Bai | The paper introduces Weak-to-Strong Diffusion (W2SD), a framework that enhances diffusion model inference by leveraging the difference between weak and strong models. The main research objective is to reduce the gap between the learned distribution of diffusion models and the real data distribution. The key methodology involves using a reflective operation that alternates between denoising and inversion, guided by the estimated difference between existing weak and strong models (weak-to-strong difference). Experiments demonstrate W2SD significantly improves human preference, with Juggernaut-XL and W2SD improving the HPSv2 winning rate up to 90% over the original results. AI practitioners can use W2SD as a general-purpose framework to improve the performance of diffusion models by defining appropriate weak-to-strong model pairs, leading to better alignment with real data distributions. |
Speak Easy: Eliciting Harmful Jailbreaks from LLMs with Simple Interactions (Read more on arXiv or HuggingFace) | Marzyeh Ghassemi, Yik Siu Chan, YuxinXiao, narutatsuri | SPEAK EASY demonstrates that large language models (LLMs) can be jailbroken through simple, everyday human-LLM interactions to produce harmful content. The main research objective is to investigate whether harmful jailbroken responses, both actionable and informative, can be elicited from LLMs through common interaction patterns. The key methodology involves proposing HARMSCORE, a metric for evaluating jailbreak harmfulness, and SPEAK EASY, a framework using multi-step reasoning and multilingual querying to simulate realistic user interactions. Results show that incorporating SPEAK EASY into direct request and jailbreak baselines increased the Attack Success Rate (ASR) of GPT-4o by an average of 0.463 and HARMSCORE by 0.579 across four safety benchmarks. For AI practitioners, this implies that current safety alignment techniques in LLMs are vulnerable to simple, realistic interaction patterns, making careful consideration of such patterns in both red-teaming and defense necessary. |
PILAF: Optimal Human Preference Sampling for Reward Modeling (Read more on arXiv or HuggingFace) | duanyq, Knykny, Kunhao, RedTachyon, Coolfyz | The paper introduces Policy-Interpolated Learning for Aligned Feedback (PILAF), a novel response sampling strategy for Reinforcement Learning from Human Feedback (RLHF) that aligns preference learning with maximizing underlying oracle reward. The main research question is how to design an optimal sampling scheme for generating response pairs in RLHF to improve sample efficiency and model performance. The key methodology is T-PILAF, a theoretically grounded sampling method generating responses by interpolating the policy and reference models, and its practical variant PILAF which implements this. Primary results show PILAF outperforms baselines in iterative and online Direct Preference Optimization (DPO) settings, achieving a final reward of -9.80 vs -10.16 for Vanilla sampling in the iterative setting, with a 40% reduction in training time. The principal implication is that AI practitioners can use PILAF to improve the efficiency and performance of RLHF by optimizing the data sampling process, resulting in higher rewards and lower divergence from the reference model. |
Towards Physical Understanding in Video Generation: A 3D Point Regularization Approach (Read more on arXiv or HuggingFace) | ZanyRumata, vidit98, anilkagak2, jlcao2, yunuoch | The paper introduces a video generation framework that incorporates 3D geometry and dynamics by augmenting 2D videos with 3D point trajectories and using them to regularize the video diffusion process. The main research objective is to improve the physical plausibility and temporal consistency of generated videos, especially in contact-rich scenarios. The key methodology involves creating a 3D-aware video dataset (PointVid) by tracking 3D points in videos, fine-tuning a latent diffusion model on this dataset, and regularizing the generation process using 3D point information. Primary results show that, compared to I2VGen-XL, their method has background consistency score improvement of +0.061 on the VBench benchmark, along with other improvements such as better object permanence and more accurate hand-object interactions. For AI practitioners, this means adding a 3D spatial component to the video generation process creates better video quality. |
Learning Real-World Action-Video Dynamics with Heterogeneous Masked Autoregression (Read more on arXiv or HuggingFace) | Kevin Zhao, endernewton, chaoqi-liu, liruiw | Here's a summary of the paper following your guidelines: The paper introduces Heterogeneous Masked Autoregression (HMA) for modeling action-conditioned video dynamics in robotics using diverse datasets. The main research objective is to develop a general and efficient model for action-video dynamics across heterogeneous robotic embodiments, domains, and tasks. The key methodology is masked autoregression, which uses a Transformer architecture to predict masked video tokens and actions from heterogeneous datasets, with variants for discrete (VQ tokens) and continuous (soft tokens) video representations. HMA achieves better visual fidelity and controllability than previous models, with a 15x faster inference speed of 22.72 FPS on the presented hardware setup (measured in Table 1). For AI practitioners, HMA offers a framework for building interactive video simulators and generating synthetic data for robot learning, which can have real-time robotic applications. |
Title | Authors | Summary |
---|---|---|
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model (Read more on arXiv or HuggingFace) | Gabriel Martín Blázquez, Elie Bakouch, Anton Lozhkov, Loubna Ben Allal, lvwerra | SmolLM2 is a 1.7 billion parameter language model trained on 11 trillion tokens to achieve state-of-the-art performance among small language models. The main research objective was to develop a performant small language model (SmolLM2) through a data-centric approach, optimizing for resource-constrained settings. The key methodology involved multi-stage training with a curated dataset mixing web text, code, math data, and instruction-following data, including newly created datasets (FineMath, Stack-Edu, SmolTalk) and manual refinement of mixing rates. A primary result is that SmolLM2 outperforms other small LMs like Qwen2.5-1.5B and Llama3.2-1B on several benchmarks; for instance achieving a score of 68.7 on HellaSwag compared to 66.4 by Qwen. AI practitioners can leverage the released SmolLM2 model and associated datasets to deploy or further research efficient, high-performing small LMs, particularly beneficial in settings with limited computational resources. |
TwinMarket: A Scalable Behavioral and Social Simulation for Financial Markets (Read more on arXiv or HuggingFace) | Yunmiao Zhang, Kaidi Zhang, Minghao Wu, Yifei Zhang, Yuzhe Yang | TwinMarket, a multi-agent framework leveraging large language models (LLMs), simulates investor behavior and socio-economic dynamics in a stock market environment. The main research objective is to examine how individual behaviors, through interactions and feedback mechanisms in a simulated stock market, give rise to collective dynamics and emergent phenomena such as financial bubbles. The key methodology involves using LLMs within a Belief-Desire-Intention (BDI) framework to structure agent cognitive processes, coupled with a simulated social network for information exchange and social influence. Primary results show that in a 100-agent simulation, the model replicates stylized facts of financial markets, and rumor-exposed markets experienced a 2.02x increase in Sell/Buy ratio compared to the baseline, indicating amplified panic-driven selling behavior. Principal implication for AI practitioners: simulating human financial behavior by leveraging BDI framework to structure the cognitive process of agents can better predict market behavior under stress. |
Demystifying Long Chain-of-Thought Reasoning in LLMs (Read more on arXiv or HuggingFace) | Xiang Yue, Graham Neubig, Morry Niu, Yuxuan Tong, Edward Yeo | This paper investigates the mechanics of long chain-of-thought (CoT) reasoning in large language models (LLMs) and identifies key factors influencing its generation and stability. The main research question is what factors enable LLMs to generate long CoT trajectories and how can their emergence be stabilized? The key methodology involves extensive supervised fine-tuning (SFT) and reinforcement learning (RL) experiments, including ablations on reward design and data composition. A primary result is that RL can improve long CoT SFT models by over 3% absolute accuracy on the MATH-500 benchmark, whereas short CoT SFT models showed minimal improvement. The principle implication for AI practitioners is that reward shaping, particularly using a cosine length-scaling reward with a repetition penalty, and scaling verifiable reward signals using a mix of gold and silver supervision data, are crucial for stabilizing long CoT growth and enhancing performance. |
LIMO: Less is More for Reasoning (Read more on arXiv or HuggingFace) | Shijie Xia, Ethan Chern, Yang Xiao, Zhen Huang, Yixin Ye | LIMO demonstrates that large language models can achieve strong mathematical reasoning with surprisingly few, high-quality training examples. The main research question is whether minimal but precisely orchestrated demonstrations of cognitive processes can elicit sophisticated reasoning in foundation models with comprehensive domain knowledge. The key methodology involves curating a small, high-quality dataset (817 samples) of mathematical problems and solutions, and fine-tuning a pre-trained Qwen2.5-32B-Instruct model. The primary result is that LIMO achieves 57.1% accuracy on the AIME benchmark and 94.8% on MATH, significantly outperforming models trained on much larger datasets. The principal implication for AI practitioners is that focusing on the quality of reasoning demonstrations, rather than sheer data volume, is a more effective approach for developing robust reasoning capabilities in LLMs. |
Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking (Read more on arXiv or HuggingFace) | Feihu Che, Ruihan Jin, Shuai Zhang, Mingkuan Feng, Jinyang Wu | AStar, an automated structured thinking paradigm, enhances multimodal reasoning in large language models via Monte Carlo Tree Search (MCTS). The main research objective is to address the limitations of existing multimodal large language models (MLLMs) in complex visual reasoning, balancing performance and efficiency. The key methodology involves automatically deriving high-level cognitive reasoning patterns using MCTS-powered hierarchical structures, then integrating these patterns into a unified reasoning framework. The primary result is that AStar achieves a 54.0% accuracy on the MathVerse benchmark with a 7B backbone, surpassing GPT-4O (50.2%). For AI practitioners, AStar provides an effective way to boost MLLMs reasoning performance by leveraging structured patterns derived through the use of MCTS, which in turn, enhance the capability in solving complex problems that require structured thinking. |
A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods (Read more on arXiv or HuggingFace) | Akash Srivastava, Kai Xu, Guangxuan Xu, Shivchander Sudalairaj, ishapuri-mit | This paper introduces a probabilistic inference framework for scaling large language models (LLMs) at inference time using particle-based Monte Carlo methods. The main research objective is to develop a more robust inference-time scaling approach that is less susceptible to reward hacking compared to existing search-based methods. The key methodology is casting inference-time scaling as probabilistic inference over a state-space model and applying particle filtering to estimate the latent states, leveraging a language model and a process reward model. The primary result is that the proposed method achieves a 4-16x faster scaling rate than deterministic search counterparts on mathematical reasoning tasks, enabling Qwen2.5-Math-1.5B-Instruct to surpass GPT-40 accuracy with only 4 rollouts. The principal implication for AI practioners is that they can leverage this probabilistic inference approach for more efficient and robust inference-time scaling of LLMs, particularly in domains with imperfect reward models, achieving better performance with smaller models and limited compute budgets. |
Jailbreaking with Universal Multi-Prompts (Read more on arXiv or HuggingFace) | Shang-Tse Chen, Hsuan Su, Yu-Ling Hsu | JUMP, a prompt-based method, jailbreaks Large Language Models (LLMs) using optimized universal multi-prompts and can also be adapted for defense. The main research objective is to optimize a universal attacker to achieve the best attack results on a set of malicious instructions, outperforming existing techniques. The methodology involves a prompt-based framework named JUMP, decomposing the training pipeline into Selector, Mutator, Constraints, and Evaluator stages, using an additional model as an attacker to generate adversarial suffixes through beam search. Primary results include JUMP++ achieving an Attack Success Rate (ASR@10) of 64.4% on Llama2-7b, significantly outperforming several baselines including AdvPrompter in the universal attack setting. Principal implication is to guide practitioners to use JUMP for a more efficient, high-performing method for jailbreaking and defending LLMs by optimizing universal multi-prompts, reducing computational costs when dealing with massive data. |
LayerTracer: Cognitive-Aligned Layered SVG Synthesis via Diffusion Transformer (Read more on arXiv or HuggingFace) | Danze Chen, Yiren Song, mikeshou | LayerTracer is a diffusion transformer-based framework for generating layered Scalable Vector Graphics (SVGs) from text or images, mimicking professional design processes. The main research objective is to generate cognitive-aligned, editable layered SVGs that meet professional design standards, overcoming limitations of existing methods. The key methodology involves a dual-phase approach: first, a text-conditioned DiT generates multi-phase rasterized blueprints; second, layer-wise vectorization with path deduplication creates editable SVGs. In the SVG generation task, LayerTracer achieves the highest CLIP-Score of 33.76 with the lowest average number of paths (35.39) and shortest time cost (27s) relative to baselines such as VectorFusion and SVGDreamer. For AI practitioners, LayerTracer provides a novel approach and dataset for generating high-quality, editable layered SVGs, directly aligning AI-generated vectors with professional design cognition. |
Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning (Read more on arXiv or HuggingFace) | Yuandong Tian, Jiantao Jiao, Yingchen Xu, Hanlin Zhu, DiJia Su | This paper proposes a method to improve language model reasoning by mixing latent and text tokens in the reasoning trace. The main research question is whether representing initial reasoning steps with discrete latent tokens, while retaining later steps as text, can improve reasoning performance and efficiency in Large Language Models (LLMs). The key methodology involves training a VQ-VAE to convert text tokens into latent codes, then fine-tuning LLMs on reasoning traces where initial text tokens are replaced by these codes, using a randomized replacement strategy. The primary result is that the proposed approach outperforms baseline methods on various benchmarks, such as GSM8K (+4.1% accuracy with Llama-3.2-3B) and an average reduction of 17% of reasoning trace length. The principal implication for AI practioners is that using a mixed representation of latent and text tokens during reasoning trace training can lead to improved accuracy and efficiency compared to using text-only reasoning traces. |
On Teacher Hacking in Language Model Distillation (Read more on arXiv or HuggingFace) | Nino Vieillard, Sarah Perrin, Johan Ferret, Daniele Calandriello, Daniil Tiapkin | Language model distillation can exhibit "teacher hacking," where a student model exploits imperfections in the teacher instead of approximating the true data distribution. The main research question is whether teacher hacking occurs during knowledge distillation in language models, and if so, when and how it can be mitigated. A controlled experimental setup is used, involving an oracle (ground-truth) language model, a teacher model distilled from the oracle, and a student model distilled from the teacher. Results show that teacher hacking occurs when using a fixed offline dataset for distillation, observable when optimization deviates from polynomial convergence laws; for example KL divergence between student and teacher decreases, but divergence from Oracle increases. The implication for AI practioners is to utilize online data generation, prioritize prompt diversity, or increase generation budget to mitigate teacher hacking during language model distillation. |
Title | Authors | Summary |
---|---|---|
Inverse Bridge Matching Distillation (Read more on arXiv or HuggingFace) | akorotin, dbaranchuk, apryc1, kekchpek, ngushchin | This paper introduces Inverse Bridge Matching Distillation (IBMD), a novel technique for accelerating the inference of diffusion bridge models (DBMs). The main research question is how to effectively distill both conditional and unconditional DBMs into fast, one-step or few-step generators while maintaining high generation quality. The key methodology is a distillation technique based on solving the inverse bridge matching problem using a tractable objective derived from the inverse formulation. The primary results show that IBMD can accelerate DBM inference by 4x to 100x, with a distilled one-step model achieving a FID score of 2.5 on a 4x super-resolution task, surpassing the teacher model's score of 2.8 obtained using 1000 steps. The principal implication for AI practitioners is that IBMD provides a universal and efficient method for distilling DBMs, enabling their practical application in various image-to-image translation tasks by significantly reducing inference time. |
VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models (Read more on arXiv or HuggingFace) | Adam Polyak, Yuval Kirstain, Amit Zohar, Uriel Singer, Hila | VideoJAM enhances motion coherence in video generation models by introducing a joint appearance-motion representation. The main research question is how to improve the temporal coherence of generated videos, which often lag behind visual fidelity in current models. The key methodology involves training a diffusion model to predict both pixel appearance and optical flow from a unified latent representation, coupled with an inference-time "Inner-Guidance" mechanism that leverages the model's own motion predictions to guide generation. Primary results show that VideoJAM outperforms state-of-the-art models on motion coherence, with human evaluators preferring VideoJAM's motion in 82.0% of cases against the DiT-4B baseline. Principal implication for AI practitioners is that incorporating an explicit motion prior through joint appearance-motion modeling can significantly enhance the temporal consistency of generated videos, directly improving the realism and applicability of video generation models. |
ACECODER: Acing Coder RL via Automated Test-Case Synthesis (Read more on arXiv or HuggingFace) | Xiaotong Chen, Haozhe Wang, Huaye Zeng, pingnieuk, DongfuJiang | ACECODER automates test-case synthesis to train coder models via reinforcement learning (RL). The main research question is whether leveraging automated large-scale test-case synthesis can enhance code model training through RL. The key methodology involves generating extensive question-test-case pairs from existing code data, constructing preference pairs based on program pass rates, and training reward models using the Bradley-Terry loss, followed by RL. A primary result is that the Qwen2.5-Coder-7B model, after RL fine-tuning, achieved a 25% improvement on HumanEval-plus when starting from the base model directly. The principal implication for AI practitioners is that automated test-case synthesis provides a viable path to enhance code generation models using RL, offering a scalable method to improve model performance without reliance on extensive human-annotated datasets. |
QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search (Read more on arXiv or HuggingFace) | Ziniu Hu, Da Yin, Xingcheng Yao, Yao Tang, Zongyu Lin | QLASS is a novel method for enhancing language agent inference through Q-guided stepwise search. The main research question is how to improve the performance of language agents on complex interactive tasks by providing effective intermediate guidance during inference. The key methodology involves automatically generating annotations by estimating Q-values in a stepwise manner, constructing an exploration tree, and performing process reward modeling to guide a Q-guided generation strategy. Primary results show that QLASS outperforms baselines on WebShop, SciWorld, and ALFWorld, achieving a 70.3% success rate on WebShop compared to 67.9% for the next best method, and demonstrates robust performance even with almost half the annotated data. The principal implication for AI practitioners is that QLASS provides a more effective way to perform inference-time search for language agents by leveraging Q-value-based process rewards, leading to improved decision-making in complex interactive tasks. |
Can LLMs Maintain Fundamental Abilities under KV Cache Compression? (Read more on arXiv or HuggingFace) | Zeyu Li, Peijie Dong, Hong Chen, Zhenheng Tang, Dominic789654 | This paper investigates the impact of KV cache compression methods on large language model (LLM) capabilities. The main research objective is to determine if LLMs retain fundamental abilities under various KV cache compression techniques. A comprehensive empirical study across diverse tasks, employing prominent KV cache compression methods, was conducted. Results showed arithmetic reasoning tasks were particularly sensitive to aggressive compression, with performance drops reaching 43.3%. A key implication for AI practitioners is the task-specific sensitivity to compression, which necessitates careful consideration of task requirements when implementing these methods, particularly for tasks involving arithmetic reasoning. |
Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search (Read more on arXiv or HuggingFace) | Zhenfang Chen, Zhang-Wei Hong, Zhenting Qi, Guangtao Zeng, maohaos2 | Satori is a 7B parameter large language model (LLM) that enhances reasoning capabilities via autoregressive search. The research investigated whether a single LLM could internalize search capabilities to improve reasoning. A two-stage training paradigm was employed, using chain-of-action-thought (COAT) reasoning and reinforcement learning with a “Restart and Explore” strategy. Satori achieved state-of-the-art performance on mathematical reasoning benchmarks, outperforming the instruct model built on the same base model. The study's principal implication is that reinforcement learning can effectively enhance LLMs’ reasoning abilities, particularly through the introduction of meta-actions and self-improvement techniques, thus providing a more efficient pathway for developing advanced reasoning LLMs. |
Generating Multi-Image Synthetic Data for Text-to-Image Customization (Read more on arXiv or HuggingFace) | Samaneh Azadi, Ishan Misra, Jun-Yan Zhu, Xi Yin, Nupur Kumari | This paper introduces a method for generating multi-image synthetic data to improve text-to-image model customization. The main research question is how to create a dataset and training method that enables tuning-free customization models to generate high-fidelity images of specific objects in diverse contexts. The key methodology involves generating a synthetic dataset (SynCD) using 3D assets and shared attention mechanisms, and training an encoder-based model with a novel inference technique that normalizes text and image guidance vectors. The primary results show that the proposed method outperforms existing tuning-free methods on standard customization benchmarks, achieving a geometric score of 0.838 with 3 input images compared to 0.780 for the next best method (JeDi). The principal implication for AI practitioners is that using synthetic data with multi-image supervision and shared attention mechanisms can significantly improve the performance of tuning-free text-to-image customization models. |
Title | Authors | Summary |
---|---|---|
The Differences Between Direct Alignment Algorithms are a Blur (Read more on arXiv or HuggingFace) | Boris Shaposhnikov, kefirski, ZeL1k7, ummagumm-a, Myashka | The paper investigates Direct Alignment Algorithms (DAAs) for aligning language models with human preferences, focusing on their performance and key distinctions. The main research objective is to clarify the relationships and comparative advantages among various DAAs, particularly regarding the impact of an explicit Supervised Fine-Tuning (SFT) phase and a scaling parameter, β. The methodology involves incorporating an SFT phase and the β parameter into single-stage DAAs (ORPO and ASFT) and empirically evaluating their performance on benchmarks like Alpaca Eval 2 using Llama 3.1 8B and Llama 3.2 3B models. A primary result is that these modifications improved ORPO's performance on Alpaca Eval 2 by +3.46 and ASFT's by +8.27. The principal implication for AI practitioners is that incorporating an explicit SFT phase and tuning the β parameter can significantly enhance the alignment quality of single-stage DAAs, making them competitive with two-stage methods like DPO, and that pairwise methods often outperform pointwise objectives. |
Process Reinforcement through Implicit Rewards (Read more on arXiv or HuggingFace) | Wendi Li, Zefan Wang, Lifan Yuan, hanbin, ganqu | The paper introduces PRIME, a scalable reinforcement learning framework for enhancing reasoning in large language models using dense token-level rewards. The main research question is how to acquire and utilize high-quality dense rewards at scale for efficient online process reward model (PRM) updates in reinforcement learning of large language models (LLMs). The key methodology is the use of implicit process rewards derived from an Implicit PRM, which is trained with outcome labels only and allows online updates using policy rollouts and outcome labels. The primary result is that Eurus-2-7B-PRIME, trained using PRIME, achieves a 15.1% average improvement across several reasoning benchmarks over the SFT model. The principal implication for AI practitioners is that PRIME offers an efficient way to incorporate dense rewards into reinforcement learning for LLMs, improving sample efficiency and performance without the need for dedicated reward model training or step-level annotations. |
OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models (Read more on arXiv or HuggingFace) | Chao Liang, Zerong Zheng, Jiaqi Yang, Jianwen Jiang, Gaojie Lin | OmniHuman-1 is a diffusion-based model for generating human animation videos conditioned on multiple modalities, including text, audio, and pose. The main research objective is to address the challenge of scaling up training data for end-to-end human animation models. The key methodology is a mixed-condition training strategy using a Diffusion Transformer model that integrates text, audio, and pose as conditions, along with an "omni-conditions" approach to leverage data across different conditioning strengths. The primary results show that OmniHuman outperforms existing methods on portrait and body animation tasks, achieving a FID score of 16.970 on the RAVDESS dataset for portrait animation. The principal implication for AI practitioners is that the proposed omni-conditions training strategy effectively scales up human animation models by leveraging mixed-condition data, enabling the development of more versatile and realistic human video generation systems. |
Preference Leakage: A Contamination Problem in LLM-as-a-judge (Read more on arXiv or HuggingFace) | Bohan Jiang, Ming Zhong, Yue Huang, Dawei Li, RLSNLP | This paper investigates preference leakage, a contamination issue in LLM-as-a-judge systems where evaluator LLMs exhibit biases towards related data generator LLMs. The main research question is whether preference leakage introduces systematic biases in LLM-based evaluations and, if so, to what extent. The key methodology involves training student models on synthetic data generated by different LLMs and then evaluating them using related and unrelated LLM judges, quantifying the bias through a "preference leakage score". A primary result is that the average preference leakage score for the Mistral-GPT-40 vs Mistral-Gemini-1.5 model pair on AlpacaEval 2.0 was 18.4%, indicating significant bias. The principal implication for AI practitioners is that using closely related LLMs for data generation and evaluation can lead to significant biases, artificially inflating performance metrics and compromising the reliability of assessments. |
SafeRAG: Benchmarking Security in Retrieval-Augmented Generation of Large Language Model (Read more on arXiv or HuggingFace) | Sensen Zhang, Zhiyu Li, Simin Niu, Xun Liang, UglyToilet | SafeRAG is a new benchmark to evaluate the security of retrieval-augmented generation (RAG) systems against data injection attacks. The main research question is: How vulnerable are RAG systems to attacks that manipulate external knowledge sources? The key methodology involves constructing a dataset, SafeRAG, with four attack types (silver noise, inter-context conflict, soft ad, and white Denial-of-Service) and evaluating 14 RAG components across different stages (indexing, retrieval, generation). A primary result is that the Baichuan 13B model achieved an attack failure rate (AFR) of 1.00 under the Denial-of-Service task, indicating complete resistance. The principal implication for AI practitioners is that current RAG systems, even advanced ones, are vulnerable to sophisticated data injection attacks, highlighting the need to develop more robust retrievers, filters, and generators when building RAG applications. |
FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation (Read more on arXiv or HuggingFace) | Jae-Joon Kim, Yulhwa Kim, jiwonsong, dongwonjo | FastKV introduces a novel KV cache compression method for large language models (LLMs) to improve efficiency in long-context processing. The main research question is how to enhance the latency and throughput of LLMs handling long-context sequences while maintaining accuracy. The key methodology is Token-Selective Propagation (TSP), which retains full context in initial layers and selectively propagates crucial tokens in deeper layers, alongside grouped-query attention (GQA)-aware KV cache compression. The primary results show that FastKV achieves 2.00x improvement in time-to-first-token (TTFT) and 1.40x improvement in throughput compared to HeadKV. The principal implication for AI practitioners is that FastKV can be used as a drop-in replacement in existing LLMs to significantly reduce latency and increase throughput in long-context processing without sacrificing accuracy. |
Almost Surely Safe Alignment of Large Language Models at Inference-Time (Read more on arXiv or HuggingFace) | Jun Wang, Ilija Bogunovic, Matthieu Zimmer, Shyam Sundhar Ramesh, Xiaotong Ji | This paper introduces InferenceGuard, a novel inference-time alignment method that ensures large language models (LLMs) generate safe responses with a probability approaching one. The main research question is how to guarantee safe outputs from LLMs during inference without modifying model weights. The key methodology involves framing safe inference-time alignment as a constrained Markov decision process (cMDP), augmenting the state space with a safety constraint tracker, and training a critic in the latent space to guide a lookahead search algorithm. The primary results show that InferenceGuard achieved safety rates of 98.02% on Alpaca-7B and 100% on Beaver-7B-v3 while maintaining strong task performance. The principal implication for AI practitioners is that InferenceGuard offers a practical and theoretically sound approach for safely aligning LLMs during inference, enhancing their usability in real-world applications without the need for retraining. |
DeepRAG: Thinking to Retrieval Step by Step for Large Language Models (Read more on arXiv or HuggingFace) | Yaojie Lu, Chunlei Xin, Fandong Meng, Jiali Zeng, xinyan233333 | DeepRAG is a retrieval-augmented generation framework that models retrieval-augmented reasoning as a Markov Decision Process for improved efficiency and accuracy. The main research question is how to optimize retrieval-augmented reasoning in large language models by dynamically determining when to retrieve external knowledge versus relying on parametric reasoning. The key methodology is a Markov Decision Process framework called DeepRAG, which uses binary tree search, imitation learning, and chain of calibration to enable strategic and adaptive retrieval. Primary results show that DeepRAG improves answer accuracy by 21.99% while also enhancing retrieval efficiency. The principal implication for AI practitioners is that DeepRAG provides a more effective framework for retrieval-augmented reasoning compared to existing methods, and it achieves superior performance by using dynamic cognitive decision-making. |
ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning (Read more on arXiv or HuggingFace) | Radha Poovendran, Ashish Sabharwal, Kyle Richardson, ronanlb, yuchenlin | ZebraLogic is a framework for evaluating the logical reasoning abilities of large language models (LLMs) using logic grid puzzles. The main research question is how LLM performance on logical reasoning tasks scales with problem complexity. The key methodology involves generating logic grid puzzles with controllable complexity using constraint satisfaction problems and evaluating various LLMs' performance. Primary results show a significant decline in accuracy as problem complexity increases, with most models struggling when the puzzle's search space exceeds 10^7 possibilities (e.g., gpt-40-mini achieves only 20.1% overall accuracy). The principal implication for AI practitioners is that scaling model size or training data alone is insufficient for solving complex logical reasoning tasks, and increasing test-time compute via more reasoning steps can improve performance. |
The Jumping Reasoning Curve? Tracking the Evolution of Reasoning Performance in GPT-[n] and o-[n] Models on Multimodal Puzzles (Read more on arXiv or HuggingFace) | Soujanya Poria, Deepanway Ghosal, Yew Ken Chia, Vernon Y. H. Toh | The paper tracks the evolution of multimodal reasoning in GPT-[n] and o-[n] models using visual puzzles. The main research question is how the reasoning performance of these models evolves over time on multimodal puzzles. The key methodology involves evaluating the models on PUZZLEVQA and ALGOPUZZLEVQA datasets using multiple-choice and open-ended questions, with a two-stage prompting strategy for answer extraction. Primary results show that the o1 model achieved 79.2% accuracy on PUZZLEVQA in the multiple-choice setting, but all models performed significantly worse in open-ended settings. The principal implication for AI practitioners is that despite improvements, current models still have limitations in visual perception and abstract reasoning, suggesting a need for further |