diff --git a/.nojekyll b/.nojekyll new file mode 100644 index 0000000..e69de29 diff --git a/cache.json b/cache.json new file mode 100644 index 0000000..2081d20 --- /dev/null +++ b/cache.json @@ -0,0 +1 @@ +{"2025-01-07T00:00:00Z":{"Robotics":[{"id":"http://arxiv.org/abs/2501.04005v1","updated":"2025-01-07T18:59:59Z","published":"2025-01-07T18:59:59Z","title":"LargeAD: Large-Scale Cross-Sensor Data Pretraining for Autonomous\n Driving","summary":" Recent advancements in vision foundation models (VFMs) have revolutionized\nvisual perception in 2D, yet their potential for 3D scene understanding,\nparticularly in autonomous driving applications, remains underexplored. In this\npaper, we introduce LargeAD, a versatile and scalable framework designed for\nlarge-scale 3D pretraining across diverse real-world driving datasets. Our\nframework leverages VFMs to extract semantically rich superpixels from 2D\nimages, which are aligned with LiDAR point clouds to generate high-quality\ncontrastive samples. This alignment facilitates cross-modal representation\nlearning, enhancing the semantic consistency between 2D and 3D data. We\nintroduce several key innovations: i) VFM-driven superpixel generation for\ndetailed semantic representation, ii) a VFM-assisted contrastive learning\nstrategy to align multimodal features, iii) superpoint temporal consistency to\nmaintain stable representations across time, and iv) multi-source data\npretraining to generalize across various LiDAR configurations. Our approach\ndelivers significant performance improvements over state-of-the-art methods in\nboth linear probing and fine-tuning tasks for both LiDAR-based segmentation and\nobject detection. Extensive experiments on eleven large-scale multi-modal\ndatasets highlight our superior performance, demonstrating the adaptability,\nefficiency, and robustness in real-world autonomous driving scenarios.\n","authors":["Lingdong Kong","Xiang Xu","Youquan Liu","Jun Cen","Runnan Chen","Wenwei Zhang","Liang Pan","Kai Chen","Ziwei Liu"],"pdf_url":"https://arxiv.org/pdf/2501.04005v1.pdf","comment":"Preprint; 16 pages, 7 figures, 8 tables; Project Page at\n https://ldkong.com/LargeAD"},{"id":"http://arxiv.org/abs/2501.04004v1","updated":"2025-01-07T18:59:58Z","published":"2025-01-07T18:59:58Z","title":"LiMoE: Mixture of LiDAR Representation Learners from Automotive Scenes","summary":" LiDAR data pretraining offers a promising approach to leveraging large-scale,\nreadily available datasets for enhanced data utilization. However, existing\nmethods predominantly focus on sparse voxel representation, overlooking the\ncomplementary attributes provided by other LiDAR representations. In this work,\nwe propose LiMoE, a framework that integrates the Mixture of Experts (MoE)\nparadigm into LiDAR data representation learning to synergistically combine\nmultiple representations, such as range images, sparse voxels, and raw points.\nOur approach consists of three stages: i) Image-to-LiDAR Pretraining, which\ntransfers prior knowledge from images to point clouds across different\nrepresentations; ii) Contrastive Mixture Learning (CML), which uses MoE to\nadaptively activate relevant attributes from each representation and distills\nthese mixed features into a unified 3D network; iii) Semantic Mixture\nSupervision (SMS), which combines semantic logits from multiple representations\nto boost downstream segmentation performance. Extensive experiments across 11\nlarge-scale LiDAR datasets demonstrate our effectiveness and superiority. The\ncode and model checkpoints have been made publicly accessible.\n","authors":["Xiang Xu","Lingdong Kong","Hui Shuai","Liang Pan","Ziwei Liu","Qingshan Liu"],"pdf_url":"https://arxiv.org/pdf/2501.04004v1.pdf","comment":"Preprint; 26 pages, 17 figures, 7 tables; Project Page at\n https://ldkong.com/LiMoE"},{"id":"http://arxiv.org/abs/2501.04003v1","updated":"2025-01-07T18:59:55Z","published":"2025-01-07T18:59:55Z","title":"Are VLMs Ready for Autonomous Driving? An Empirical Study from the\n Reliability, Data, and Metric Perspectives","summary":" Recent advancements in Vision-Language Models (VLMs) have sparked interest in\ntheir use for autonomous driving, particularly in generating interpretable\ndriving decisions through natural language. However, the assumption that VLMs\ninherently provide visually grounded, reliable, and interpretable explanations\nfor driving remains largely unexamined. To address this gap, we introduce\nDriveBench, a benchmark dataset designed to evaluate VLM reliability across 17\nsettings (clean, corrupted, and text-only inputs), encompassing 19,200 frames,\n20,498 question-answer pairs, three question types, four mainstream driving\ntasks, and a total of 12 popular VLMs. Our findings reveal that VLMs often\ngenerate plausible responses derived from general knowledge or textual cues\nrather than true visual grounding, especially under degraded or missing visual\ninputs. This behavior, concealed by dataset imbalances and insufficient\nevaluation metrics, poses significant risks in safety-critical scenarios like\nautonomous driving. We further observe that VLMs struggle with multi-modal\nreasoning and display heightened sensitivity to input corruptions, leading to\ninconsistencies in performance. To address these challenges, we propose refined\nevaluation metrics that prioritize robust visual grounding and multi-modal\nunderstanding. Additionally, we highlight the potential of leveraging VLMs'\nawareness of corruptions to enhance their reliability, offering a roadmap for\ndeveloping more trustworthy and interpretable decision-making systems in\nreal-world autonomous driving contexts. The benchmark toolkit is publicly\naccessible.\n","authors":["Shaoyuan Xie","Lingdong Kong","Yuhao Dong","Chonghao Sima","Wenwei Zhang","Qi Alfred Chen","Ziwei Liu","Liang Pan"],"pdf_url":"https://arxiv.org/pdf/2501.04003v1.pdf","comment":"Preprint; 41 pages, 32 figures, 16 tables; Project Page at\n https://drive-bench.github.io/"},{"id":"http://arxiv.org/abs/2412.05313v3","updated":"2025-01-07T18:57:23Z","published":"2024-11-28T19:31:50Z","title":"λ: A Benchmark for Data-Efficiency in Long-Horizon Indoor Mobile\n Manipulation Robotics","summary":" Efficiently learning and executing long-horizon mobile manipulation (MoMa)\ntasks is crucial for advancing robotics in household and workplace settings.\nHowever, current MoMa models are data-inefficient, underscoring the need for\nimproved models that require realistic-sized benchmarks to evaluate their\nefficiency, which do not exist. To address this, we introduce the LAMBDA\n({\\lambda}) benchmark (Long-horizon Actions for Mobile-manipulation\nBenchmarking of Directed Activities), which evaluates the data efficiency of\nmodels on language-conditioned, long-horizon, multi-room, multi-floor,\npick-and-place tasks using a dataset of manageable size, more feasible for\ncollection. The benchmark includes 571 human-collected demonstrations that\nprovide realism and diversity in simulated and real-world settings. Unlike\nplanner-generated data, these trajectories offer natural variability and\nreplay-verifiability, ensuring robust learning and evaluation. We benchmark\nseveral models, including learning-based models and a neuro-symbolic modular\napproach combining foundation models with task and motion planning.\nLearning-based models show suboptimal success rates, even when leveraging\npretrained weights, underscoring significant data inefficiencies. However, the\nneuro-symbolic approach performs significantly better while being more data\nefficient. Findings highlight the need for more data-efficient learning-based\nMoMa approaches. {\\lambda} addresses this gap by serving as a key benchmark for\nevaluating the data efficiency of those future models in handling household\nrobotics tasks.\n","authors":["Ahmed Jaafar","Shreyas Sundara Raman","Yichen Wei","Sudarshan Harithas","Sofia Juliani","Anneke Wernerfelt","Benedict Quartey","Ifrah Idrees","Jason Xinyu Liu","Stefanie Tellex"],"pdf_url":"https://arxiv.org/pdf/2412.05313v3.pdf","comment":"8 pages"},{"id":"http://arxiv.org/abs/2412.20429v3","updated":"2025-01-07T18:24:45Z","published":"2024-12-29T10:46:08Z","title":"Multi-Scenario Reasoning: Unlocking Cognitive Autonomy in Humanoid\n Robots for Multimodal Understanding","summary":" To improve the cognitive autonomy of humanoid robots, this research proposes\na multi-scenario reasoning architecture to solve the technical shortcomings of\nmulti-modal understanding in this field. It draws on simulation based\nexperimental design that adopts multi-modal synthesis (visual, auditory,\ntactile) and builds a simulator \"Maha\" to perform the experiment. The findings\ndemonstrate the feasibility of this architecture in multimodal data. It\nprovides reference experience for the exploration of cross-modal interaction\nstrategies for humanoid robots in dynamic environments. In addition,\nmulti-scenario reasoning simulates the high-level reasoning mechanism of the\nhuman brain to humanoid robots at the cognitive level. This new concept\npromotes cross-scenario practical task transfer and semantic-driven action\nplanning. It heralds the future development of self-learning and autonomous\nbehavior of humanoid robots in changing scenarios.\n","authors":["Libo Wang"],"pdf_url":"https://arxiv.org/pdf/2412.20429v3.pdf","comment":"The main text is 5 pages, 2 figures, and 3 tables"},{"id":"http://arxiv.org/abs/2501.03972v1","updated":"2025-01-07T18:22:44Z","published":"2025-01-07T18:22:44Z","title":"MAD-BA: 3D LiDAR Bundle Adjustment -- from Uncertainty Modelling to\n Structure Optimization","summary":" The joint optimization of sensor poses and 3D structure is fundamental for\nstate estimation in robotics and related fields. Current LiDAR systems often\nprioritize pose optimization, with structure refinement either omitted or\ntreated separately using representations like signed distance functions or\nneural networks. This paper introduces a framework for simultaneous\noptimization of sensor poses and 3D map, represented as surfels. A generalized\nLiDAR uncertainty model is proposed to address degraded or less reliable\nmeasurements in varying scenarios. Experimental results on public datasets\ndemonstrate improved performance over most comparable state-of-the-art methods.\nThe system is provided as open-source software to support further research.\n","authors":["Krzysztof Ćwian","Luca Di Giammarino","Simone Ferrari","Thomas Ciarfuglia","Giorgio Grisetti","Piotr Skrzypczyński"],"pdf_url":"https://arxiv.org/pdf/2501.03972v1.pdf","comment":"8 pages, 6 figures, this work has been submitted to IEEE RA-L"},{"id":"http://arxiv.org/abs/2501.03971v1","updated":"2025-01-07T18:22:23Z","published":"2025-01-07T18:22:23Z","title":"Impact of Leg Stiffness on Energy Efficiency in One Legged Hopping","summary":" In the fields of robotics and biomechanics, the integration of elastic\nelements such as springs and tendons in legged systems has long been recognized\nfor enabling energy-efficient locomotion. Yet, a significant challenge\npersists: designing a robotic leg that perform consistently across diverse\noperating conditions, especially varying average forward speeds. It remains\nunclear whether, for such a range of operating conditions, the stiffness of the\nelastic elements needs to be varied or if a similar performance can be obtained\nby changing the motion and actuation while keeping the stiffness fixed. This\nwork explores the influence of the leg stiffness on the energy efficiency of a\nmonopedal robot through an extensive parametric study of its periodic hopping\nmotion. To this end, we formulate an optimal control problem parameterized by\naverage forward speed and leg stiffness, solving it numerically using direct\ncollocation. Our findings indicate that, compared to the use of a fixed\nstiffness, employing variable stiffness in legged systems improves energy\nefficiency by 20 % maximally and by 6.8 % on average across a range of speeds.\n","authors":["Iskandar Khemakhem","Dominik Tschemernjak","Maximilian Raff","C. David Remy"],"pdf_url":"https://arxiv.org/pdf/2501.03971v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03968v1","updated":"2025-01-07T18:06:27Z","published":"2025-01-07T18:06:27Z","title":"VLM-driven Behavior Tree for Context-aware Task Planning","summary":" The use of Large Language Models (LLMs) for generating Behavior Trees (BTs)\nhas recently gained attention in the robotics community, yet remains in its\nearly stages of development. In this paper, we propose a novel framework that\nleverages Vision-Language Models (VLMs) to interactively generate and edit BTs\nthat address visual conditions, enabling context-aware robot operations in\nvisually complex environments. A key feature of our approach lies in the\nconditional control through self-prompted visual conditions. Specifically, the\nVLM generates BTs with visual condition nodes, where conditions are expressed\nas free-form text. Another VLM process integrates the text into its prompt and\nevaluates the conditions against real-world images during robot execution. We\nvalidated our framework in a real-world cafe scenario, demonstrating both its\nfeasibility and limitations.\n","authors":["Naoki Wake","Atsushi Kanehira","Jun Takamatsu","Kazuhiro Sasabuchi","Katsushi Ikeuchi"],"pdf_url":"https://arxiv.org/pdf/2501.03968v1.pdf","comment":"10 pages, 11 figures, 5 tables. Last updated on January 7th, 2024"},{"id":"http://arxiv.org/abs/2408.01333v3","updated":"2025-01-07T17:32:29Z","published":"2024-08-02T15:30:51Z","title":"Incorporating Control Inputs in Continuous-Time Gaussian Process State\n Estimation for Robotics","summary":" Continuous-time batch state estimation using Gaussian processes is an\nefficient approach to estimate the trajectories of robots over time. In the\npast, relatively simple physics-motivated priors have been considered for such\napproaches, using assumptions such as constant velocity or acceleration. This\npaper presents an approach to incorporating exogenous control inputs, such as\nvelocity or acceleration commands, into the continuous Gaussian process\nstate-estimation framework. It is shown that this approach generalizes across\ndifferent domains in robotics, making it applicable to both the estimation of\ncontinuous-time trajectories for mobile robots and the estimation of\nquasi-static continuum robot shapes. Results show that incorporating control\ninputs leads to more informed priors, potentially requiring less measurements\nand estimation nodes to obtain accurate estimates. This makes the approach\nparticularly useful in situations in which limited sensing is available. For\nexample, in a mobile robot localization experiment with sparse landmark\ndistance measurements and frequent odometry control inputs, our approach\nprovides accurate trajectory estimates with root-mean-square errors around 3-4\ncm and 4-5 degrees, even with time intervals up to five seconds between\ndiscrete estimation nodes, which significantly reduces computation time.\n","authors":["Sven Lilge","Timothy D. Barfoot"],"pdf_url":"https://arxiv.org/pdf/2408.01333v3.pdf","comment":"21 pages, 7 figures, Accepted to Robotica"},{"id":"http://arxiv.org/abs/2410.23963v2","updated":"2025-01-07T16:48:25Z","published":"2024-10-31T14:15:54Z","title":"Exploiting Information Theory for Intuitive Robot Programming of Manual\n Activities","summary":" Observational learning is a promising approach to enable people without\nexpertise in programming to transfer skills to robots in a user-friendly\nmanner, since it mirrors how humans learn new behaviors by observing others.\nMany existing methods focus on instructing robots to mimic human trajectories,\nbut motion-level strategies often pose challenges in skills generalization\nacross diverse environments. This paper proposes a novel framework that allows\nrobots to achieve a higher-level understanding of human-demonstrated manual\ntasks recorded in RGB videos. By recognizing the task structure and goals,\nrobots generalize what observed to unseen scenarios. We found our task\nrepresentation on Shannon's Information Theory (IT), which is applied for the\nfirst time to manual tasks. IT helps extract the active scene elements and\nquantify the information shared between hands and objects. We exploit scene\ngraph properties to encode the extracted interaction features in a compact\nstructure and segment the demonstration into blocks, streamlining the\ngeneration of Behavior Trees for robot replicas. Experiments validated the\neffectiveness of IT to automatically generate robot execution plans from a\nsingle human demonstration. Additionally, we provide HANDSOME, an open-source\ndataset of HAND Skills demOnstrated by Multi-subjEcts, to promote further\nresearch and evaluation in this field.\n","authors":["Elena Merlo","Marta Lagomarsino","Edoardo Lamon","Arash Ajoudani"],"pdf_url":"https://arxiv.org/pdf/2410.23963v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03907v1","updated":"2025-01-07T16:22:12Z","published":"2025-01-07T16:22:12Z","title":"Implicit Coordination using Active Epistemic Inference","summary":" A Multi-robot system (MRS) provides significant advantages for intricate\ntasks such as environmental monitoring, underwater inspections, and space\nmissions. However, addressing potential communication failures or the lack of\ncommunication infrastructure in these fields remains a challenge. A significant\nportion of MRS research presumes that the system can maintain communication\nwith proximity constraints, but this approach does not solve situations where\ncommunication is either non-existent, unreliable, or poses a security risk.\nSome approaches tackle this issue using predictions about other robots while\nnot communicating, but these methods generally only permit agents to utilize\nfirst-order reasoning, which involves reasoning based purely on their own\nobservations. In contrast, to deal with this problem, our proposed framework\nutilizes Theory of Mind (ToM), employing higher-order reasoning by shifting a\nrobot's perspective to reason about a belief of others observations. Our\napproach has two main phases: i) an efficient runtime plan adaptation using\nactive inference to signal intentions and reason about a robot's own belief and\nthe beliefs of others in the system, and ii) a hierarchical epistemic planning\nframework to iteratively reason about the current MRS mission state. The\nproposed framework outperforms greedy and first-order reasoning approaches and\nis validated using simulations and experiments with heterogeneous robotic\nsystems.\n","authors":["Lauren Bramblett","Jonathan Reasoner","Nicola Bezzo"],"pdf_url":"https://arxiv.org/pdf/2501.03907v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03881v1","updated":"2025-01-07T15:44:06Z","published":"2025-01-07T15:44:06Z","title":"An LSTM-based Test Selection Method for Self-Driving Cars","summary":" Self-driving cars require extensive testing, which can be costly in terms of\ntime. To optimize this process, simple and straightforward tests should be\nexcluded, focusing on challenging tests instead. This study addresses the test\nselection problem for lane-keeping systems for self-driving cars. Road segment\nfeatures, such as angles and lengths, were extracted and treated as sequences,\nenabling classification of the test cases as \"safe\" or \"unsafe\" using a long\nshort-term memory (LSTM) model. The proposed model is compared against machine\nlearning-based test selectors. Results demonstrated that the LSTM-based method\noutperformed machine learning-based methods in accuracy and precision metrics\nwhile exhibiting comparable performance in recall and F1 scores. This work\nintroduces a novel deep learning-based approach to the road classification\nproblem, providing an effective solution for self-driving car test selection\nusing a simulation environment.\n","authors":["Ali Güllü","Faiz Ali Shah","Dietmar Pfahl"],"pdf_url":"https://arxiv.org/pdf/2501.03881v1.pdf","comment":"8 pages, 6 figures, 5 tables"},{"id":"http://arxiv.org/abs/2501.03859v1","updated":"2025-01-07T15:16:16Z","published":"2025-01-07T15:16:16Z","title":"A Synergistic Framework for Learning Shape Estimation and Shape-Aware\n Whole-Body Control Policy for Continuum Robots","summary":" In this paper, we present a novel synergistic framework for learning shape\nestimation and a shape-aware whole-body control policy for tendon-driven\ncontinuum robots. Our approach leverages the interaction between two Augmented\nNeural Ordinary Differential Equations (ANODEs) -- the Shape-NODE and\nControl-NODE -- to achieve continuous shape estimation and shape-aware control.\nThe Shape-NODE integrates prior knowledge from Cosserat rod theory, allowing it\nto adapt and account for model mismatches, while the Control-NODE uses this\nshape information to optimize a whole-body control policy, trained in a Model\nPredictive Control (MPC) fashion. This unified framework effectively overcomes\nlimitations of existing data-driven methods, such as poor shape awareness and\nchallenges in capturing complex nonlinear dynamics. Extensive evaluations in\nboth simulation and real-world environments demonstrate the framework's robust\nperformance in shape estimation, trajectory tracking, and obstacle avoidance.\nThe proposed method consistently outperforms state-of-the-art end-to-end,\nNeural-ODE, and Recurrent Neural Network (RNN) models, particularly in terms of\ntracking accuracy and generalization capabilities.\n","authors":["Mohammadreza Kasaei","Farshid Alambeigi","Mohsen Khadem"],"pdf_url":"https://arxiv.org/pdf/2501.03859v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03191v2","updated":"2025-01-07T15:04:47Z","published":"2024-12-04T10:23:27Z","title":"Soft Adaptive Feet for Legged Robots: An Open-Source Model for\n Locomotion Simulation","summary":" In recent years, artificial feet based on soft robotics and under-actuation\nprinciples emerged to improve mobility on challenging terrains. This paper\npresents the application of the MuJoCo physics engine to realize a digital twin\nof an adaptive soft foot developed for use with legged robots. We release the\nMuJoCo soft foot digital twin as open source to allow users and researchers to\nexplore new approaches to locomotion. The work includes the system modeling\ntechniques along with the kinematic and dynamic attributes involved. Validation\nis conducted through a rigorous comparison with bench tests on a physical\nprototype, replicating these experiments in simulation. Results are evaluated\nbased on sole deformation and contact forces during foot-obstacle interaction.\nThe foot model is subsequently integrated into simulations of the humanoid\nrobot COMAN+, replacing its original flat feet. Results show an improvement in\nthe robot's ability to negotiate small obstacles without altering its control\nstrategy. Ultimately, this study offers a comprehensive modeling approach for\nadaptive soft feet, supported by qualitative comparisons of bipedal locomotion\nwith state of the art robotic feet.\n","authors":["Matteo Crotti","Luca Rossini","Balint K. Hodossy","Anna Pace","Giorgio Grioli","Antonio Bicchi","Manuel G. Catalano"],"pdf_url":"https://arxiv.org/pdf/2412.03191v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03841v1","updated":"2025-01-07T14:50:33Z","published":"2025-01-07T14:50:33Z","title":"OmniManip: Towards General Robotic Manipulation via Object-Centric\n Interaction Primitives as Spatial Constraints","summary":" The development of general robotic systems capable of manipulating in\nunstructured environments is a significant challenge. While Vision-Language\nModels(VLM) excel in high-level commonsense reasoning, they lack the\nfine-grained 3D spatial understanding required for precise manipulation tasks.\nFine-tuning VLM on robotic datasets to create Vision-Language-Action\nModels(VLA) is a potential solution, but it is hindered by high data collection\ncosts and generalization issues. To address these challenges, we propose a\nnovel object-centric representation that bridges the gap between VLM's\nhigh-level reasoning and the low-level precision required for manipulation. Our\nkey insight is that an object's canonical space, defined by its functional\naffordances, provides a structured and semantically meaningful way to describe\ninteraction primitives, such as points and directions. These primitives act as\na bridge, translating VLM's commonsense reasoning into actionable 3D spatial\nconstraints. In this context, we introduce a dual closed-loop, open-vocabulary\nrobotic manipulation system: one loop for high-level planning through primitive\nresampling, interaction rendering and VLM checking, and another for low-level\nexecution via 6D pose tracking. This design ensures robust, real-time control\nwithout requiring VLM fine-tuning. Extensive experiments demonstrate strong\nzero-shot generalization across diverse robotic manipulation tasks,\nhighlighting the potential of this approach for automating large-scale\nsimulation data generation.\n","authors":["Mingjie Pan","Jiyao Zhang","Tianshu Wu","Yinghao Zhao","Wenlong Gao","Hao Dong"],"pdf_url":"https://arxiv.org/pdf/2501.03841v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19950v2","updated":"2025-01-07T14:35:01Z","published":"2024-12-27T23:10:32Z","title":"Data-driven tool wear prediction in milling, based on a\n process-integrated single-sensor approach","summary":" Accurate tool wear prediction is essential for maintaining productivity and\nminimizing costs in machining. However, the complex nature of the tool wear\nprocess poses significant challenges to achieving reliable predictions. This\nstudy explores data-driven methods, in particular deep learning, for tool wear\nprediction. Traditional data-driven approaches often focus on a single process,\nrelying on multi-sensor setups and extensive data generation, which limits\ngeneralization to new settings. Moreover, multi-sensor integration is often\nimpractical in industrial environments. To address these limitations, this\nresearch investigates the transferability of predictive models using minimal\ntraining data, validated across two processes. Furthermore, it uses a simple\nsetup with a single acceleration sensor to establish a low-cost data generation\napproach that facilitates the generalization of models to other processes via\ntransfer learning. The study evaluates several machine learning models,\nincluding convolutional neural networks (CNN), long short-term memory networks\n(LSTM), support vector machines (SVM) and decision trees, trained on different\ninput formats such as feature vectors and short-time Fourier transform (STFT).\nThe performance of the models is evaluated on different amounts of training\ndata, including scenarios with significantly reduced datasets, providing\ninsight into their effectiveness under constrained data conditions. The results\ndemonstrate the potential of specific models and configurations for effective\ntool wear prediction, contributing to the development of more adaptable and\nefficient predictive maintenance strategies in machining. Notably, the ConvNeXt\nmodel has an exceptional performance, achieving an 99.1% accuracy in\nidentifying tool wear using data from only four milling tools operated until\nthey are worn.\n","authors":["Eric Hirsch","Christian Friedrich"],"pdf_url":"https://arxiv.org/pdf/2412.19950v2.pdf","comment":"Preprint submitted to Robotics and Computer-Integrated Manufacturing\n ,14 pages, 9 figures"},{"id":"http://arxiv.org/abs/2501.03819v1","updated":"2025-01-07T14:32:36Z","published":"2025-01-07T14:32:36Z","title":"An innovative mixed reality approach for Robotics Surgery","summary":" Robotic-assisted procedures offer numerous advantages over traditional\napproaches, including improved dexterity, reduced fatigue, minimized trauma,\nand superior outcomes. However, the main challenge of these systems remains the\npoor visualization and perception of the surgical field. The goal of this paper\nis to provide an innovative approach concerning an application able to improve\nthe surgical procedures offering assistance in both preplanning and\nintraoperative steps of the surgery. The system has been designed to offer a\nbetter understanding of the patient through techniques that provide medical\nimages visualization, 3D anatomical structures perception and robotic planning.\nThe application was designed to be intuitive and user friendly, providing an\naugmented reality experience through the Hololens 2 device. It was tested in\nlaboratory conditions, yielding positive results.\n","authors":["Gabriela Rus","Nadim Al Hajjar","Ionut Zima","Calin Vaida","Corina Radu","Damien Chablat","Andra Ciocan","Doina Pîslă"],"pdf_url":"https://arxiv.org/pdf/2501.03819v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19153v3","updated":"2025-01-07T13:41:26Z","published":"2024-12-26T10:17:21Z","title":"Sketch-MoMa: Teleoperation for Mobile Manipulator via Interpretation of\n Hand-Drawn Sketches","summary":" To use assistive robots in everyday life, a remote control system with common\ndevices, such as 2D devices, is helpful to control the robots anytime and\nanywhere as intended. Hand-drawn sketches are one of the intuitive ways to\ncontrol robots with 2D devices. However, since similar sketches have different\nintentions from scene to scene, existing work needs additional modalities to\nset the sketches' semantics. This requires complex operations for users and\nleads to decreasing usability. In this paper, we propose Sketch-MoMa, a\nteleoperation system using the user-given hand-drawn sketches as instructions\nto control a robot. We use Vision-Language Models (VLMs) to understand the\nuser-given sketches superimposed on an observation image and infer drawn shapes\nand low-level tasks of the robot. We utilize the sketches and the generated\nshapes for recognition and motion planning of the generated low-level tasks for\nprecise and intuitive operations. We validate our approach using\nstate-of-the-art VLMs with 7 tasks and 5 sketch shapes. We also demonstrate\nthat our approach effectively specifies the detailed motions, such as how to\ngrasp and how much to rotate. Moreover, we show the competitive usability of\nour approach compared with the existing 2D interface through a user experiment\nwith 14 participants.\n","authors":["Kosei Tanada","Yuka Iwanaga","Masayoshi Tsuchinaga","Yuji Nakamura","Takemitsu Mori","Remi Sakai","Takashi Yamamoto"],"pdf_url":"https://arxiv.org/pdf/2412.19153v3.pdf","comment":"This work has been submitted to the IEEE for possible publication.\n Project Page: https://toyotafrc.github.io/SketchMoMa-Proj"},{"id":"http://arxiv.org/abs/2501.03763v1","updated":"2025-01-07T13:04:39Z","published":"2025-01-07T13:04:39Z","title":"3D Printable Gradient Lattice Design for Multi-Stiffness Robotic Fingers","summary":" Human fingers achieve exceptional dexterity and adaptability by combining\nstructures with varying stiffness levels, from soft tissues (low) to tendons\nand cartilage (medium) to bones (high). This paper explores developing a\nrobotic finger with similar multi-stiffness characteristics. Specifically, we\npropose using a lattice configuration, parameterized by voxel size and unit\ncell geometry, to optimize and achieve fine-tuned stiffness properties with\nhigh granularity. A significant advantage of this approach is the feasibility\nof 3D printing the designs in a single process, eliminating the need for manual\nassembly of elements with differing stiffness. Based on this method, we present\na novel, human-like finger, and a soft gripper. We integrate the latter with a\nrigid manipulator and demonstrate the effectiveness in pick and place tasks.\n","authors":["Siebe J. Schouten","Tomas Steenman","Rens File","Merlijn Den Hartog","Aimee Sakes","Cosimo Della Santina","Kirsten Lussenburg","Ebrahim Shahabi"],"pdf_url":"https://arxiv.org/pdf/2501.03763v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.12761v2","updated":"2025-01-07T11:18:08Z","published":"2024-03-19T14:27:31Z","title":"BTGenBot: Behavior Tree Generation for Robotic Tasks with Lightweight\n LLMs","summary":" This paper presents a novel approach to generating behavior trees for robots\nusing lightweight large language models (LLMs) with a maximum of 7 billion\nparameters. The study demonstrates that it is possible to achieve satisfying\nresults with compact LLMs when fine-tuned on a specific dataset. The key\ncontributions of this research include the creation of a fine-tuning dataset\nbased on existing behavior trees using GPT-3.5 and a comprehensive comparison\nof multiple LLMs (namely llama2, llama-chat, and code-llama) across nine\ndistinct tasks. To be thorough, we evaluated the generated behavior trees using\nstatic syntactical analysis, a validation system, a simulated environment, and\na real robot. Furthermore, this work opens the possibility of deploying such\nsolutions directly on the robot, enhancing its practical applicability.\nFindings from this study demonstrate the potential of LLMs with a limited\nnumber of parameters in generating effective and efficient robot behaviors.\n","authors":["Riccardo Andrea Izzo","Gianluca Bardaro","Matteo Matteucci"],"pdf_url":"https://arxiv.org/pdf/2403.12761v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03666v1","updated":"2025-01-07T10:06:59Z","published":"2025-01-07T10:06:59Z","title":"Hybrid Machine Learning Model with a Constrained Action Space for\n Trajectory Prediction","summary":" Trajectory prediction is crucial to advance autonomous driving, improving\nsafety, and efficiency. Although end-to-end models based on deep learning have\ngreat potential, they often do not consider vehicle dynamic limitations,\nleading to unrealistic predictions. To address this problem, this work\nintroduces a novel hybrid model that combines deep learning with a kinematic\nmotion model. It is able to predict object attributes such as acceleration and\nyaw rate and generate trajectories based on them. A key contribution is the\nincorporation of expert knowledge into the learning objective of the deep\nlearning model. This results in the constraint of the available action space,\nthus enabling the prediction of physically feasible object attributes and\ntrajectories, thereby increasing safety and robustness. The proposed hybrid\nmodel facilitates enhanced interpretability, thereby reinforcing the\ntrustworthiness of deep learning methods and promoting the development of safe\nplanning solutions. Experiments conducted on the publicly available real-world\nArgoverse dataset demonstrate realistic driving behaviour, with benchmark\ncomparisons and ablation studies showing promising results.\n","authors":["Alexander Fertig","Lakshman Balasubramanian","Michael Botsch"],"pdf_url":"https://arxiv.org/pdf/2501.03666v1.pdf","comment":"Submitted to 2025 IEEE Intelligent Vehicles Symposium (IV)"},{"id":"http://arxiv.org/abs/2501.03606v1","updated":"2025-01-07T08:14:53Z","published":"2025-01-07T08:14:53Z","title":"VTAO-BiManip: Masked Visual-Tactile-Action Pre-training with Object\n Understanding for Bimanual Dexterous Manipulation","summary":" Bimanual dexterous manipulation remains significant challenges in robotics\ndue to the high DoFs of each hand and their coordination. Existing single-hand\nmanipulation techniques often leverage human demonstrations to guide RL methods\nbut fail to generalize to complex bimanual tasks involving multiple sub-skills.\nIn this paper, we introduce VTAO-BiManip, a novel framework that combines\nvisual-tactile-action pretraining with object understanding to facilitate\ncurriculum RL to enable human-like bimanual manipulation. We improve prior\nlearning by incorporating hand motion data, providing more effective guidance\nfor dual-hand coordination than binary tactile feedback. Our pretraining model\npredicts future actions as well as object pose and size using masked multimodal\ninputs, facilitating cross-modal regularization. To address the multi-skill\nlearning challenge, we introduce a two-stage curriculum RL approach to\nstabilize training. We evaluate our method on a bottle-cap unscrewing task,\ndemonstrating its effectiveness in both simulated and real-world environments.\nOur approach achieves a success rate that surpasses existing visual-tactile\npretraining methods by over 20%.\n","authors":["Zhengnan Sun","Zhaotai Shi","Jiayin Chen","Qingtao Liu","Yu Cui","Qi Ye","Jiming Chen"],"pdf_url":"https://arxiv.org/pdf/2501.03606v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03585v1","updated":"2025-01-07T07:19:30Z","published":"2025-01-07T07:19:30Z","title":"Collision Risk Quantification and Conflict Resolution in Trajectory\n Tracking for Acceleration-Actuated Multi-Robot Systems","summary":" One of the pivotal challenges in a multi-robot system is how to give\nattention to accuracy and efficiency while ensuring safety. Prior arts cannot\nstrictly guarantee collision-free for an arbitrarily large number of robots or\nthe results are considerably conservative. Smoothness of the avoidance\ntrajectory also needs to be further optimized. This paper proposes an\naccelerationactuated simultaneous obstacle avoidance and trajectory tracking\nmethod for arbitrarily large teams of robots, that provides a nonconservative\ncollision avoidance strategy and gives approaches for deadlock avoidance. We\npropose two ways of deadlock resolution, one involves incorporating an\nauxiliary velocity vector into the error function of the trajectory tracking\nmodule, which is proven to have no influence on global convergence of the\ntracking error. Furthermore, unlike the traditional methods that they address\nconflicts after a deadlock occurs, our decision-making mechanism avoids the\nnear-zero velocity, which is much more safer and efficient in crowed\nenvironments. Extensive comparison show that the proposed method is superior to\nthe existing studies when deployed in a large-scale robot system, with minimal\ninvasiveness.\n","authors":["Xiaoxiao Li","Zhirui Sun","Mansha Zheng","Hongpeng Wang","Shuai Li","Jiankun Wang"],"pdf_url":"https://arxiv.org/pdf/2501.03585v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03575v1","updated":"2025-01-07T06:55:50Z","published":"2025-01-07T06:55:50Z","title":"Cosmos World Foundation Model Platform for Physical AI","summary":" Physical AI needs to be trained digitally first. It needs a digital twin of\nitself, the policy model, and a digital twin of the world, the world model. In\nthis paper, we present the Cosmos World Foundation Model Platform to help\ndevelopers build customized world models for their Physical AI setups. We\nposition a world foundation model as a general-purpose world model that can be\nfine-tuned into customized world models for downstream applications. Our\nplatform covers a video curation pipeline, pre-trained world foundation models,\nexamples of post-training of pre-trained world foundation models, and video\ntokenizers. To help Physical AI builders solve the most critical problems of\nour society, we make our platform open-source and our models open-weight with\npermissive licenses available via https://github.com/NVIDIA/Cosmos.\n","authors":[" NVIDIA"," :","Niket Agarwal","Arslan Ali","Maciej Bala","Yogesh Balaji","Erik Barker","Tiffany Cai","Prithvijit Chattopadhyay","Yongxin Chen","Yin Cui","Yifan Ding","Daniel Dworakowski","Jiaojiao Fan","Michele Fenzi","Francesco Ferroni","Sanja Fidler","Dieter Fox","Songwei Ge","Yunhao Ge","Jinwei Gu","Siddharth Gururani","Ethan He","Jiahui Huang","Jacob Huffman","Pooya Jannaty","Jingyi Jin","Seung Wook Kim","Gergely Klár","Grace Lam","Shiyi Lan","Laura Leal-Taixe","Anqi Li","Zhaoshuo Li","Chen-Hsuan Lin","Tsung-Yi Lin","Huan Ling","Ming-Yu Liu","Xian Liu","Alice Luo","Qianli Ma","Hanzi Mao","Kaichun Mo","Arsalan Mousavian","Seungjun Nah","Sriharsha Niverty","David Page","Despoina Paschalidou","Zeeshan Patel","Lindsey Pavao","Morteza Ramezanali","Fitsum Reda","Xiaowei Ren","Vasanth Rao Naik Sabavat","Ed Schmerling","Stella Shi","Bartosz Stefaniak","Shitao Tang","Lyne Tchapmi","Przemek Tredak","Wei-Cheng Tseng","Jibin Varghese","Hao Wang","Haoxiang Wang","Heng Wang","Ting-Chun Wang","Fangyin Wei","Xinyue Wei","Jay Zhangjie Wu","Jiashu Xu","Wei Yang","Lin Yen-Chen","Xiaohui Zeng","Yu Zeng","Jing Zhang","Qinsheng Zhang","Yuxuan Zhang","Qingqing Zhao","Artur Zolkowski"],"pdf_url":"https://arxiv.org/pdf/2501.03575v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03535v1","updated":"2025-01-07T05:15:46Z","published":"2025-01-07T05:15:46Z","title":"SenseRAG: Constructing Environmental Knowledge Bases with Proactive\n Querying for LLM-Based Autonomous Driving","summary":" This study addresses the critical need for enhanced situational awareness in\nautonomous driving (AD) by leveraging the contextual reasoning capabilities of\nlarge language models (LLMs). Unlike traditional perception systems that rely\non rigid, label-based annotations, it integrates real-time, multimodal sensor\ndata into a unified, LLMs-readable knowledge base, enabling LLMs to dynamically\nunderstand and respond to complex driving environments. To overcome the\ninherent latency and modality limitations of LLMs, a proactive\nRetrieval-Augmented Generation (RAG) is designed for AD, combined with a\nchain-of-thought prompting mechanism, ensuring rapid and context-rich\nunderstanding. Experimental results using real-world Vehicle-to-everything\n(V2X) datasets demonstrate significant improvements in perception and\nprediction performance, highlighting the potential of this framework to enhance\nsafety, adaptability, and decision-making in next-generation AD systems.\n","authors":["Xuewen Luo","Fan Ding","Fengze Yang","Yang Zhou","Junnyong Loo","Hwa Hui Tew","Chenxi Liu"],"pdf_url":"https://arxiv.org/pdf/2501.03535v1.pdf","comment":"This paper has been accepted for presentation at WACV Workshop LLMAD\n 2025"},{"id":"http://arxiv.org/abs/2401.06949v2","updated":"2025-01-07T05:00:50Z","published":"2024-01-13T02:03:28Z","title":"ORGANA: A Robotic Assistant for Automated Chemistry Experimentation and\n Characterization","summary":" Chemistry experiments can be resource- and labor-intensive, often requiring\nmanual tasks like polishing electrodes in electrochemistry. Traditional lab\nautomation infrastructure faces challenges adapting to new experiments. To\naddress this, we introduce ORGANA, an assistive robotic system that automates\ndiverse chemistry experiments using decision-making and perception tools. It\nmakes decisions with chemists in the loop to control robots and lab devices.\nORGANA interacts with chemists using Large Language Models (LLMs) to derive\nexperiment goals, handle disambiguation, and provide experiment logs. ORGANA\nplans and executes complex tasks with visual feedback, while supporting\nscheduling and parallel task execution. We demonstrate ORGANA's capabilities in\nsolubility, pH measurement, recrystallization, and electrochemistry\nexperiments. In electrochemistry, it executes a 19-step plan in parallel to\ncharacterize quinone derivatives for flow batteries. Our user study shows\nORGANA reduces frustration and physical demand by over 50%, with users saving\nan average of 80.3% of their time when using it.\n","authors":["Kourosh Darvish","Marta Skreta","Yuchi Zhao","Naruki Yoshikawa","Sagnik Som","Miroslav Bogdanovic","Yang Cao","Han Hao","Haoping Xu","Alán Aspuru-Guzik","Animesh Garg","Florian Shkurti"],"pdf_url":"https://arxiv.org/pdf/2401.06949v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03515v1","updated":"2025-01-07T04:17:15Z","published":"2025-01-07T04:17:15Z","title":"Effects of Robot Competency and Motion Legibility on Human Correction\n Feedback","summary":" As robot deployments become more commonplace, people are likely to take on\nthe role of supervising robots (i.e., correcting their mistakes) rather than\ndirectly teaching them. Prior works on Learning from Corrections (LfC) have\nrelied on three key assumptions to interpret human feedback: (1) people correct\nthe robot only when there is significant task objective divergence; (2) people\ncan accurately predict if a correction is necessary; and (3) people trade off\nprecision and physical effort when giving corrections. In this work, we study\nhow two key factors (robot competency and motion legibility) affect how people\nprovide correction feedback and their implications on these existing\nassumptions. We conduct a user study ($N=60$) under an LfC setting where\nparticipants supervise and correct a robot performing pick-and-place tasks. We\nfind that people are more sensitive to suboptimal behavior by a highly\ncompetent robot compared to an incompetent robot when the motions are legible\n($p=0.0015$) and predictable ($p=0.0055$). In addition, people also tend to\nwithhold necessary corrections ($p < 0.0001$) when supervising an incompetent\nrobot and are more prone to offering unnecessary ones ($p = 0.0171$) when\nsupervising a highly competent robot. We also find that physical effort\npositively correlates with correction precision, providing empirical evidence\nto support this common assumption. We also find that this correlation is\nsignificantly weaker for an incompetent robot with legible motions than an\nincompetent robot with predictable motions ($p = 0.0075$). Our findings offer\ninsights for accounting for competency and legibility when designing robot\ninteraction behaviors and learning task objectives from corrections.\n","authors":["Shuangge Wang","Anjiabei Wang","Sofiya Goncharova","Brian Scassellati","Tesca Fitzgerald"],"pdf_url":"https://arxiv.org/pdf/2501.03515v1.pdf","comment":"to be published in the 2025 ACM/IEEE International Conference on\n Human-Robot Interaction (HRI)"},{"id":"http://arxiv.org/abs/2501.03467v1","updated":"2025-01-07T01:51:12Z","published":"2025-01-07T01:51:12Z","title":"FRESHR-GSI: A Generalized Safety Model and Evaluation Framework for\n Mobile Robots in Multi-Human Environments","summary":" Human safety is critical in applications involving close human-robot\ninteractions (HRI) and is a key aspect of physical compatibility between humans\nand robots. While measures of human safety in HRI exist, these mainly target\nindustrial settings involving robotic manipulators. Less attention has been\npaid to settings where mobile robots and humans share the space. This paper\nintroduces a new robot-centered directional framework of human safety. It is\nparticularly useful for evaluating mobile robots as they operate in\nenvironments populated by multiple humans. The framework integrates several key\nmetrics, such as each human's relative distance, speed, and orientation. The\ncore novelty lies in the framework's flexibility to accommodate different\napplication requirements while allowing for both the robot-centered and\nexternal observer points of view. We instantiate the framework by using RGB-D\nbased vision integrated with a deep learning-based human detection pipeline to\nyield a generalized safety index (GSI) that instantaneously assesses human\nsafety. We evaluate GSI's capability of producing appropriate, robust, and\nfine-grained safety measures in real-world experimental scenarios and compare\nits performance with extant safety models.\n","authors":["Pranav Pandey","Ramviyas Parasuraman","Prashant Doshi"],"pdf_url":"https://arxiv.org/pdf/2501.03467v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.04264v2","updated":"2025-01-07T22:42:46Z","published":"2024-11-06T21:17:06Z","title":"MonoRollBot: 3-DOF Spherical Robot with Underactuated Single Compliant\n Actuator Design","summary":" Spherical rolling robots have garnered significant attention in the field of\nmobile robotics for applications such as inspection and space exploration.\nDesigning underactuated rolling robots poses challenges in achieving\nmulti-directional propulsion with high degrees of freedom while utilizing a\nlimited number of actuators. This paper presents the MonoRollBot, a novel\n3-degree-of-freedom (DOF) spherical robot that utilizes an underactuated\nmechanism driven by only a single spring-motor system. Unlike conventional\nspherical robots, MonoRollBot employs a minimalist actuation approach, relying\non only one motor and a passive spring to control its locomotion. The robot\nachieves 3-DOF motion through an innovative coupling of spring dynamics and\nmotor control. In this work, we detail the design of the MonoRollBot and\nevaluate its motion capabilities through design studies. We also do studies on\nits locomotion behaviours based on changes in rotating mass and stiffness\nproperties.\n","authors":["Zhiwei Liu","Seyed Amir Tafrishi"],"pdf_url":"https://arxiv.org/pdf/2411.04264v2.pdf","comment":"6 pages, 11 figures, accepted at IEEE RoboSoft 2025"},{"id":"http://arxiv.org/abs/2501.04170v1","updated":"2025-01-07T22:40:37Z","published":"2025-01-07T22:40:37Z","title":"A Bayesian Modeling Framework for Estimation and Ground Segmentation of\n Cluttered Staircases","summary":" Autonomous robot navigation in complex environments requires robust\nperception as well as high-level scene understanding due to perceptual\nchallenges, such as occlusions, and uncertainty introduced by robot movement.\nFor example, a robot climbing a cluttered staircase can misinterpret clutter as\na step, misrepresenting the state and compromising safety. This requires robust\nstate estimation methods capable of inferring the underlying structure of the\nenvironment even from incomplete sensor data. In this paper, we introduce a\nnovel method for robust state estimation of staircases. To address the\nchallenge of perceiving occluded staircases extending beyond the robot's\nfield-of-view, our approach combines an infinite-width staircase representation\nwith a finite endpoint state to capture the overall staircase structure. This\nrepresentation is integrated into a Bayesian inference framework to fuse noisy\nmeasurements enabling accurate estimation of staircase location even with\npartial observations and occlusions. Additionally, we present a segmentation\nalgorithm that works in conjunction with the staircase estimation pipeline to\naccurately identify clutter-free regions on a staircase. Our method is\nextensively evaluated on real robot across diverse staircases, demonstrating\nsignificant improvements in estimation accuracy and segmentation performance\ncompared to baseline approaches.\n","authors":["Prasanna Sriganesh","Burhanuddin Shirose","Matthew Travers"],"pdf_url":"https://arxiv.org/pdf/2501.04170v1.pdf","comment":"This work has been submitted to the IEEE for possible publication"},{"id":"http://arxiv.org/abs/2501.04169v1","updated":"2025-01-07T22:33:47Z","published":"2025-01-07T22:33:47Z","title":"Learning to Transfer Human Hand Skills for Robot Manipulations","summary":" We present a method for teaching dexterous manipulation tasks to robots from\nhuman hand motion demonstrations. Unlike existing approaches that solely rely\non kinematics information without taking into account the plausibility of robot\nand object interaction, our method directly infers plausible robot manipulation\nactions from human motion demonstrations. To address the embodiment gap between\nthe human hand and the robot system, our approach learns a joint motion\nmanifold that maps human hand movements, robot hand actions, and object\nmovements in 3D, enabling us to infer one motion component from others. Our key\nidea is the generation of pseudo-supervision triplets, which pair human,\nobject, and robot motion trajectories synthetically. Through real-world\nexperiments with robot hand manipulation, we demonstrate that our data-driven\nretargeting method significantly outperforms conventional retargeting\ntechniques, effectively bridging the embodiment gap between human and robotic\nhands. Website at https://rureadyo.github.io/MocapRobot/.\n","authors":["Sungjae Park","Seungho Lee","Mingi Choi","Jiye Lee","Jeonghwan Kim","Jisoo Kim","Hanbyul Joo"],"pdf_url":"https://arxiv.org/pdf/2501.04169v1.pdf","comment":"Preprint. Under Review"},{"id":"http://arxiv.org/abs/2401.14554v2","updated":"2025-01-07T20:44:10Z","published":"2024-01-25T22:49:13Z","title":"GCBF+: A Neural Graph Control Barrier Function Framework for Distributed\n Safe Multi-Agent Control","summary":" Distributed, scalable, and safe control of large-scale multi-agent systems is\na challenging problem. In this paper, we design a distributed framework for\nsafe multi-agent control in large-scale environments with obstacles, where a\nlarge number of agents are required to maintain safety using only local\ninformation and reach their goal locations. We introduce a new class of\ncertificates, termed graph control barrier function (GCBF), which are based on\nthe well-established control barrier function theory for safety guarantees and\nutilize a graph structure for scalable and generalizable distributed control of\nMAS. We develop a novel theoretical framework to prove the safety of an\narbitrary-sized MAS with a single GCBF. We propose a new training framework\nGCBF+ that uses graph neural networks to parameterize a candidate GCBF and a\ndistributed control policy. The proposed framework is distributed and is\ncapable of taking point clouds from LiDAR, instead of actual state information,\nfor real-world robotic applications. We illustrate the efficacy of the proposed\nmethod through various hardware experiments on a swarm of drones with\nobjectives ranging from exchanging positions to docking on a moving target\nwithout collision. Additionally, we perform extensive numerical experiments,\nwhere the number and density of agents, as well as the number of obstacles,\nincrease. Empirical results show that in complex environments with agents with\nnonlinear dynamics (e.g., Crazyflie drones), GCBF+ outperforms the hand-crafted\nCBF-based method with the best performance by up to 20% for relatively\nsmall-scale MAS with up to 256 agents, and leading reinforcement learning (RL)\nmethods by up to 40% for MAS with 1024 agents. Furthermore, the proposed method\ndoes not compromise on the performance, in terms of goal reaching, for\nachieving high safety rates, which is a common trade-off in RL-based methods.\n","authors":["Songyuan Zhang","Oswin So","Kunal Garg","Chuchu Fan"],"pdf_url":"https://arxiv.org/pdf/2401.14554v2.pdf","comment":"20 pages, 15 figures; Accepted by IEEE Transactions on Robotics\n (T-RO)"},{"id":"http://arxiv.org/abs/2407.08213v2","updated":"2025-01-07T20:08:13Z","published":"2024-07-11T06:30:46Z","title":"PrefCLM: Enhancing Preference-based Reinforcement Learning with\n Crowdsourced Large Language Models","summary":" Preference-based reinforcement learning (PbRL) is emerging as a promising\napproach to teaching robots through human comparative feedback, sidestepping\nthe need for complex reward engineering. However, the substantial volume of\nfeedback required in existing PbRL methods often lead to reliance on synthetic\nfeedback generated by scripted teachers. This approach necessitates intricate\nreward engineering again and struggles to adapt to the nuanced preferences\nparticular to human-robot interaction (HRI) scenarios, where users may have\nunique expectations toward the same task. To address these challenges, we\nintroduce PrefCLM, a novel framework that utilizes crowdsourced large language\nmodels (LLMs) as simulated teachers in PbRL. We utilize Dempster-Shafer Theory\nto fuse individual preferences from multiple LLM agents at the score level,\nefficiently leveraging their diversity and collective intelligence. We also\nintroduce a human-in-the-loop pipeline that facilitates collective refinements\nbased on user interactive feedback. Experimental results across various general\nRL tasks show that PrefCLM achieves competitive performance compared to\ntraditional scripted teachers and excels in facilitating more more natural and\nefficient behaviors. A real-world user study (N=10) further demonstrates its\ncapability to tailor robot behaviors to individual user preferences,\nsignificantly enhancing user satisfaction in HRI scenarios.\n","authors":["Ruiqi Wang","Dezhong Zhao","Ziqin Yuan","Ike Obi","Byung-Cheol Min"],"pdf_url":"https://arxiv.org/pdf/2407.08213v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.08637v2","updated":"2025-01-07T19:26:17Z","published":"2024-06-12T20:50:26Z","title":"A Game Between Two Identical Dubins Cars: Evading a Conic Sensor in\n Minimum Time","summary":" A fundamental task in mobile robotics is keeping an intelligent agent under\nsurveillance with an autonomous robot as it travels in the environment. This\nwork studies a theoretical version of that problem involving one of the most\npopular vehicle platforms in robotics. In particular, we consider two identical\nDubins cars moving on a plane without obstacles. One of them plays as the\npursuer, and it is equipped with a limited field-of-view detection region\nmodeled as a semi-infinite cone with its apex at the pursuer's position. The\npursuer aims to maintain the other Dubins car, which plays as the evader, as\nmuch time as possible inside its detection region. On the contrary, the evader\nwants to escape as soon as possible. In this work, employing differential game\ntheory, we find the time-optimal motion strategies near the game's end. The\nanalysis of those trajectories reveals the existence of at least two singular\nsurfaces: a Transition Surface (also known as a Switch Surface) and an Evader's\nUniversal Surface. We also found that the barrier's standard construction\nproduces a surface that partially lies outside the playing space.\n","authors":["Ubaldo Ruiz"],"pdf_url":"https://arxiv.org/pdf/2406.08637v2.pdf","comment":"35 pages, 16 figures"},{"id":"http://arxiv.org/abs/2501.05478v1","updated":"2025-01-07T16:01:25Z","published":"2025-01-07T16:01:25Z","title":"Language and Planning in Robotic Navigation: A Multilingual Evaluation\n of State-of-the-Art Models","summary":" Large Language Models (LLMs) such as GPT-4, trained on huge amount of\ndatasets spanning multiple domains, exhibit significant reasoning,\nunderstanding, and planning capabilities across various tasks. This study\npresents the first-ever work in Arabic language integration within the\nVision-and-Language Navigation (VLN) domain in robotics, an area that has been\nnotably underexplored in existing research. We perform a comprehensive\nevaluation of state-of-the-art multi-lingual Small Language Models (SLMs),\nincluding GPT-4o mini, Llama 3 8B, and Phi-3 medium 14B, alongside the\nArabic-centric LLM, Jais. Our approach utilizes the NavGPT framework, a pure\nLLM-based instruction-following navigation agent, to assess the impact of\nlanguage on navigation reasoning through zero-shot sequential action prediction\nusing the R2R dataset. Through comprehensive experiments, we demonstrate that\nour framework is capable of high-level planning for navigation tasks when\nprovided with instructions in both English and Arabic. However, certain models\nstruggled with reasoning and planning in the Arabic language due to inherent\nlimitations in their capabilities, sub-optimal performance, and parsing issues.\nThese findings highlight the importance of enhancing planning and reasoning\ncapabilities in language models for effective navigation, emphasizing this as a\nkey area for further development while also unlocking the potential of\nArabic-language models for impactful real-world applications.\n","authors":["Malak Mansour","Ahmed Aly","Bahey Tharwat","Sarim Hashmi","Dong An","Ian Reid"],"pdf_url":"https://arxiv.org/pdf/2501.05478v1.pdf","comment":null}],"Systems and Control":[{"id":"http://arxiv.org/abs/2408.13510v2","updated":"2025-01-07T18:16:17Z","published":"2024-08-24T08:12:22Z","title":"Intelligent Router for LLM Workloads: Improving Performance Through\n Workload-Aware Load Balancing","summary":" Large Language Model (LLM) workloads have distinct prefill and decode phases\nwith different compute and memory requirements which should ideally be\naccounted for when scheduling input queries across different LLM instances in a\ncluster. However existing scheduling algorithms treat LLM workloads as\nmonolithic jobs without considering the distinct characteristics of the two\nphases in each workload. This leads to sub-optimal scheduling and increased\nresponse latency. In this work, we start by characterizing factors affecting\nthe response latency during LLM inference serving. We establish that better\nload balancing of inference requests across the available LLM instances can\nimprove the end-to-end latency to a larger extent than merely focusing on\noptimizing the instance-level scheduler. Motivated by our findings, we propose\na heuristic-guided reinforcement learning-based intelligent router for\ndata-driven and workload-aware scheduling. Our router schedules queries across\nLLM instances by leveraging a trainable response-length predictor, and a novel\nformulation for estimating the impact of mixing different workloads and\nachieves over 11% lower end-to-end latency than existing approaches on a mix of\npublic datasets and 7.8% lower end-to-end latency on real workload data with\ndiverse input and output trends from Cloud Provider X. Additionally, the\nproposed framework can also serve as a standard for benchmarking different LLM\ninference schedulers since it provides the best latency for a given model,\nhardware, and instance-level scheduler combination.\n","authors":["Kunal Jain","Anjaly Parayil","Ankur Mallick","Esha Choukse","Xiaoting Qin","Jue Zhang","Íñigo Goiri","Rujia Wang","Chetan Bansal","Victor Rühle","Anoop Kulkarni","Steve Kofsky","Saravan Rajmohan"],"pdf_url":"https://arxiv.org/pdf/2408.13510v2.pdf","comment":"16 pages, 10 figures"},{"id":"http://arxiv.org/abs/2307.12235v3","updated":"2025-01-07T17:46:06Z","published":"2023-07-23T05:58:04Z","title":"Optimal Time-Invariant Distributed Formation Tracking for Second-Order\n Multi-Agent Systems","summary":" This paper addresses the optimal time-invariant formation tracking problem\nwith the aim of providing a distributed solution for multi-agent systems with\nsecond-order integrator dynamics. In the literature, most of the results\nrelated to multi-agent formation tracking do not consider energy issues while\ninvestigating distributed feedback control laws. In order to account for this\ncrucial design aspect, we contribute by formalizing and proposing a solution to\nan optimization problem that encapsulates trajectory tracking, distance-based\nformation control and input energy minimization, through a specific and key\nchoice of potential functions in the optimization cost. To this end, we show\nhow to compute the inverse dynamics in a centralized fashion by means of the\nProjector-Operator-based Newton's method for Trajectory Optimization (PRONTO)\nand, more importantly, we exploit such an offline solution as a general\nreference to devise a stabilizing online distributed control law. Finally,\nnumerical examples involving a cubic formation following a chicane-like path in\nthe 3D space are provided to validate the proposed control strategies.\n","authors":["Marco Fabris","Giulio Fattore","Angelo Cenedese"],"pdf_url":"https://arxiv.org/pdf/2307.12235v3.pdf","comment":"35 pages, 3 figures, accepted on March 27th, 2024 by the European\n Journal of Control (first submission: June 23rd, 2023)"},{"id":"http://arxiv.org/abs/2312.14788v3","updated":"2025-01-07T17:07:18Z","published":"2023-12-22T16:00:42Z","title":"Harnessing Uncertainty for a Separation Principle in Direct Data-Driven\n Predictive Control","summary":" Model Predictive Control (MPC) is a powerful method for complex system\nregulation, but its reliance on an accurate model poses many limitations in\nreal-world applications. Data-driven predictive control (DDPC) aims at\novercoming this limitation, by relying on historical data to provide\ninformation on the plant to be controlled. In this work, we present a unified\nstochastic framework for direct DDPC, where control actions are obtained by\noptimizing the Final Control Error (FCE), which is directly computed from\navailable data only and automatically weighs the impact of uncertainty on the\ncontrol objective. Our framework allows us to establish a separation principle\nfor Predictive Control, elucidating the role that predictive models and their\nuncertainty play in DDPC. Moreover, it generalizes existing DDPC methods, like\nregularized Data-enabled Predictive Control (DeePC) and $\\gamma$-DDPC,\nproviding a path toward noise-tolerant data-based control with rigorous\noptimality guarantees. The theoretical investigation is complemented by a\nseries of experiments (code available on GitHub:\nhttps://github.com/marcofabris92/a-separation-principle-in-d3pc), revealing\nthat the proposed method consistently outperforms or, at worst, matches\nexisting techniques without requiring tuning regularization parameters as other\nmethods do.\n","authors":["Alessandro Chiuso","Marco Fabris","Valentina Breschi","Simone Formentin"],"pdf_url":"https://arxiv.org/pdf/2312.14788v3.pdf","comment":"17 pages, 2 figures, 1 table, accepted by Automatica on October 31st,\n 2024 (first submission: December 22nd, 2023)"},{"id":"http://arxiv.org/abs/2501.03894v1","updated":"2025-01-07T16:03:13Z","published":"2025-01-07T16:03:13Z","title":"Robust Moving-horizon Estimation for Nonlinear Systems: From Perfect to\n Imperfect Optimization","summary":" Robust stability of moving-horizon estimators is investigated for nonlinear\ndiscrete-time systems that are detectable in the sense of incremental\ninput/output-to-state stability and are affected by disturbances. The estimate\nof a moving-horizon estimator stems from the on-line solution of a\nleast-squares minimization problem at each time instant. The resulting\nstability guarantees depend on the optimization tolerance in solving such\nminimization problems. Specifically, two main contributions are established:\n(i) the robust stability of the estimation error, while supposing to solve\nexactly the on-line minimization problem; (ii) the practical robust stability\nof the estimation error with state estimates obtained by an imperfect\nminimization. Finally, the construction of such robust moving-horizon\nestimators and the performances resulting from the design based on the\ntheoretical findings are showcased with two numerical examples.\n","authors":["Angelo Alessandri"],"pdf_url":"https://arxiv.org/pdf/2501.03894v1.pdf","comment":"18 pages, 2 figures, 24 bibliographic references"},{"id":"http://arxiv.org/abs/2004.00159v3","updated":"2025-01-07T15:36:31Z","published":"2020-03-31T23:20:33Z","title":"Resilient Control of Dynamic Flow Networks Subject to Stochastic\n Cyber-Physical Disruptions","summary":" Modern network systems, such as transportation and communication systems, are\nprone to cyber-physical disruptions and thus suffer efficiency loss. This paper\nstudies network resiliency, in terms of throughput, and develops resilient\ncontrol to improve throughput. We consider single-commodity networks that admit\ncongestion propagation. We also apply a Markov process to model disruption\nswitches. For throughput analysis, we first use insights into congestion\nspillback to propose novel Lyapunov functions and then exploit monotone network\ndynamics to reduce computational costs of verifying stability conditions. For\ncontrol design, we show that (i) for a network with infinite link storage\nspace, there exists an open-loop control that attains the min-expected-cut\ncapacity; (ii) for a network with observable disruptions that restrict maximum\nsending and/or receiving flows, there exists a mode-dependent control that\nattains the expected-min-cut capacity; (iii) for general networks, there exists\na closed-loop control with throughput guarantees. We also derive lower bounds\nof resiliency scores for a set of numerical examples and verify resiliency\nimprovement with our method.\n","authors":["Yu Tang","Li Jin"],"pdf_url":"https://arxiv.org/pdf/2004.00159v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03191v2","updated":"2025-01-07T15:04:47Z","published":"2024-12-04T10:23:27Z","title":"Soft Adaptive Feet for Legged Robots: An Open-Source Model for\n Locomotion Simulation","summary":" In recent years, artificial feet based on soft robotics and under-actuation\nprinciples emerged to improve mobility on challenging terrains. This paper\npresents the application of the MuJoCo physics engine to realize a digital twin\nof an adaptive soft foot developed for use with legged robots. We release the\nMuJoCo soft foot digital twin as open source to allow users and researchers to\nexplore new approaches to locomotion. The work includes the system modeling\ntechniques along with the kinematic and dynamic attributes involved. Validation\nis conducted through a rigorous comparison with bench tests on a physical\nprototype, replicating these experiments in simulation. Results are evaluated\nbased on sole deformation and contact forces during foot-obstacle interaction.\nThe foot model is subsequently integrated into simulations of the humanoid\nrobot COMAN+, replacing its original flat feet. Results show an improvement in\nthe robot's ability to negotiate small obstacles without altering its control\nstrategy. Ultimately, this study offers a comprehensive modeling approach for\nadaptive soft feet, supported by qualitative comparisons of bipedal locomotion\nwith state of the art robotic feet.\n","authors":["Matteo Crotti","Luca Rossini","Balint K. Hodossy","Anna Pace","Giorgio Grioli","Antonio Bicchi","Manuel G. Catalano"],"pdf_url":"https://arxiv.org/pdf/2412.03191v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.02893v2","updated":"2025-01-07T13:21:10Z","published":"2025-01-06T10:15:21Z","title":"A Volumetric Approach to Privacy of Dynamical Systems","summary":" Information-theoretic metrics, such as mutual information, have been widely\nused to evaluate privacy leakage in dynamic systems. However, these approaches\nare typically limited to stochastic systems and face computational challenges.\nIn this paper, we introduce a novel volumetric framework for analyzing privacy\nin systems affected by unknown but bounded noise. Our model considers a dynamic\nsystem comprising public and private states, where an observation set of the\npublic state is released. An adversary utilizes the observed public state to\ninfer an uncertainty set of the private state, referred to as the inference\nattack. We define the evolution dynamics of these inference attacks and\nquantify the privacy level of the private state using the volume of its\nuncertainty sets. For linear scalar systems, we derive an explicit formulation\nof the uncertainty set. For multi-dimensional linear systems, we develop an\napproximate computation method leveraging interval analysis. We investigate the\nproperties of the proposed volumetric privacy measure and demonstrate that it\nis bounded by the information gain derived from the observation set.\nFurthermore, we propose an optimization approach to designing privacy filter\nusing randomization and linear programming based on the proposed privacy\nmeasure. The effectiveness of the optimal privacy filter design is evaluated\nthrough a production-inventory case study, illustrating its robustness against\nthe inference attack.\n","authors":["Chuanghong Weng","Ehsan Nekouei"],"pdf_url":"https://arxiv.org/pdf/2501.02893v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03746v1","updated":"2025-01-07T12:40:11Z","published":"2025-01-07T12:40:11Z","title":"A Multimodal Lightweight Approach to Fault Diagnosis of Induction Motors\n in High-Dimensional Dataset","summary":" An accurate AI-based diagnostic system for induction motors (IMs) holds the\npotential to enhance proactive maintenance, mitigating unplanned downtime and\ncurbing overall maintenance costs within an industrial environment. Notably,\namong the prevalent faults in IMs, a Broken Rotor Bar (BRB) fault is frequently\nencountered. Researchers have proposed various fault diagnosis approaches using\nsignal processing (SP), machine learning (ML), deep learning (DL), and hybrid\narchitectures for BRB faults. One limitation in the existing literature is the\ntraining of these architectures on relatively small datasets, risking\noverfitting when implementing such systems in industrial environments. This\npaper addresses this limitation by implementing large-scale data of BRB faults\nby using a transfer-learning-based lightweight DL model named ShuffleNetV2 for\ndiagnosing one, two, three, and four BRB faults using current and vibration\nsignal data. Spectral images for training and testing are generated using a\nShort-Time Fourier Transform (STFT). The dataset comprises 57,500 images, with\n47,500 used for training and 10,000 for testing. Remarkably, the ShuffleNetV2\nmodel exhibited superior performance, in less computational cost as well as\naccurately classifying 98.856% of spectral images. To further enhance the\nvisualization of harmonic sidebands resulting from broken bars, Fast Fourier\nTransform (FFT) is applied to current and vibration data. The paper also\nprovides insights into the training and testing times for each model,\ncontributing to a comprehensive understanding of the proposed fault diagnosis\nmethodology. The findings of our research provide valuable insights into the\nperformance and efficiency of different ML and DL models, offering a foundation\nfor the development of robust fault diagnosis systems for induction motors in\nindustrial settings.\n","authors":["Usman Ali"],"pdf_url":"https://arxiv.org/pdf/2501.03746v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.05022v5","updated":"2025-01-07T11:51:30Z","published":"2024-08-09T12:22:35Z","title":"Robust Backstepping Control of a Quadrotor Unmanned Aerial Vehicle Under\n Colored Noises","summary":" Advances in software and hardware technologies have facilitated the\nproduction of quadrotor unmanned aerial vehicles (UAVs). Nowadays, people\nactively use quadrotor UAVs in essential missions such as search and rescue,\ncounter-terrorism, firefighting, surveillance, and cargo transportation. While\nperforming these tasks, quadrotors must operate in noisy environments.\nTherefore, a robust controller design that can control the altitude and\nattitude of the quadrotor in noisy environments is of great importance. Many\nresearchers have focused only on white Gaussian noise in their studies, whereas\nresearchers need to consider the effects of all colored noises during the\noperation of the quadrotor. This study aims to design a robust controller that\nis resistant to all colored noises. Firstly, a nonlinear quadrotor model was\ncreated with MATLAB. Then, a backstepping controller resistant to colored\nnoises was designed. The designed backstepping controller was tested under\nGaussian white, pink, brown, blue, and purple noises. PID and Lyapunov-based\ncontroller designs were also carried out, and their time responses (rise time,\novershoot, settling time) were compared with those of the backstepping\ncontroller. In the simulations, time was in seconds, altitude was in meters,\nand roll, pitch, and yaw references were in radians. Rise and settling time\nvalues were in seconds, and overshoot value was in percent. When the obtained\nvalues are examined, simulations prove that the proposed backstepping\ncontroller has the least overshoot and the shortest settling time under all\nnoise types.\n","authors":["Mehmet Karahan"],"pdf_url":"https://arxiv.org/pdf/2408.05022v5.pdf","comment":"22 pages, 10 figures"},{"id":"http://arxiv.org/abs/2501.03691v1","updated":"2025-01-07T10:43:26Z","published":"2025-01-07T10:43:26Z","title":"Stabilization of Strictly Pre-Dissipative Receding Horizon Linear\n Quadratic Control by Terminal Costs","summary":" Asymptotic stability in receding horizon control is obtained under a strict\npre-dissipativity assumption, in the presence of suitable state constraints. In\nthis paper we analyze how terminal constraints can be replaced by suitable\nterminal costs. We restrict to the linear-quadratic setting as that allows us\nto obtain stronger results, while we analyze the full nonlinear case in a\nseparate contribution.\n","authors":["Mario Zanon","Lars Grüne"],"pdf_url":"https://arxiv.org/pdf/2501.03691v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03671v1","updated":"2025-01-07T10:18:37Z","published":"2025-01-07T10:18:37Z","title":"Imitation Learning of MPC with Neural Networks: Error Guarantees and\n Sparsification","summary":" This paper presents a framework for bounding the approximation error in\nimitation model predictive controllers utilizing neural networks. Leveraging\nthe Lipschitz properties of these neural networks, we derive a bound that\nguides dataset design to ensure the approximation error remains at chosen\nlimits. We discuss how this method can be used to design a stable neural\nnetwork controller with performance guarantees employing existing robust model\npredictive control approaches for data generation. Additionally, we introduce a\ntraining adjustment, which is based on the sensitivities of the optimization\nproblem and reduces dataset density requirements based on the derived bounds.\nWe verify that the proposed augmentation results in improvements to the\nnetwork's predictive capabilities and a reduction of the Lipschitz constant.\nMoreover, on a simulated inverted pendulum problem, we show that the approach\nresults in a closer match of the closed-loop behavior between the imitation and\nthe original model predictive controller.\n","authors":["Hendrik Alsmeier","Lukas Theiner","Anton Savchenko","Ali Mesbah","Rolf Findeisen"],"pdf_url":"https://arxiv.org/pdf/2501.03671v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03653v1","updated":"2025-01-07T09:33:50Z","published":"2025-01-07T09:33:50Z","title":"Study of Frictional and Impact Transients in Active-Passive Mechanical\n Pair","summary":" We consider an active-passive mechanical pair in which the relative motion of\nthe latter is constrained by the mechanical impact. The system dynamics is\ndescribed by the previously introduced modeling frameworks of force transition\nand dissipation through the nonlinear Coulomb friction and structural damping,\nthe later in accord with Hertzian contact theory. The focus of the recent study\nis on combining both interaction mechanisms, and the detailed experimental\nevaluation which discloses validity of the modeling assumptions. Such\nmechanical pair interactions can be found in various mechatronic systems and\nmechanisms, like for example clutches, backlash elements, sliding items on the\nshaking and inclining surfaces, conveyor belts and others. This practical study\ndemonstrates and discusses the transients of a vibro-impact dynamics and shows\ntheoretical developments in line with experimental evaluation.\n","authors":["Michael Ruderman","Francesco De Rito"],"pdf_url":"https://arxiv.org/pdf/2501.03653v1.pdf","comment":"4 pages, 6 figures"},{"id":"http://arxiv.org/abs/2501.03628v1","updated":"2025-01-07T08:56:56Z","published":"2025-01-07T08:56:56Z","title":"A Novel Approach to Real-Time Short-Term Traffic Prediction based on\n Distributed Fiber-Optic Sensing and Data Assimilation with a Stochastic\n Cell-Automata Model","summary":" This paper demonstrates real-time short-term traffic flow prediction through\ndistributed fiber-optic sensing (DFOS) and data assimilation with a stochastic\ncell-automata-based traffic model. Traffic congestion on expressways is a\nsevere issue. To alleviate its negative impacts, it is necessary to optimize\ntraffic flow prior to becoming serious congestion. For this purpose, real-time\nshort-term traffic flow prediction is promising. However, conventional traffic\nmonitoring apparatus used in prediction methods faces a technical issue due to\nthe sparsity in traffic flow data. To overcome the issue for realizing\nreal-time traffic prediction, this paper employs DFOS, which enables to obtain\nspatially continuous and real-time traffic flow data along the road without\ndead zones. Using mean velocities derived from DFOS data as a feature\nextraction, this paper proposes a real-time data assimilation method for the\nshort-term prediction. As the theoretical model, the stochastic\nNishinari-Fukui-Schadschneider model is adopted. Future traffic flow is\nsimulated with the optimal values of model parameters estimated from observed\nmean velocities and the initial condition estimated as the latest microscopic\ntraffic state. This concept is validated using two congestion scenarios\nobtained in Japanese expressways. The results show that the mean absolute error\nof the predicted mean velocities is 10-15 km/h in the prediction horizon of 30\nminutes. Furthermore, the prediction error in congestion length and travel time\ndecreases by 40-84% depending on congestion scenarios when compared with\nconventional methods with traffic counters. This paper concludes that real-time\ndata assimilation using DFOS enables an accurate short-term traffic prediction.\n","authors":["Yoshiyuki Yajima","Hemant Prasad","Daisuke Ikefuji","Takemasa Suzuki","Shin Tominaga","Hitoshi Sakurai","Manabu Otani"],"pdf_url":"https://arxiv.org/pdf/2501.03628v1.pdf","comment":"22 pages, 11 figures"},{"id":"http://arxiv.org/abs/2501.03608v1","updated":"2025-01-07T08:19:27Z","published":"2025-01-07T08:19:27Z","title":"A 3D Continuous-Space Electromagnetic Channel Model for 6G Tri-Polarized\n Multi-user Communications","summary":" It is envisioned that the sixth generation (6G) and beyond 6G (B6G) wireless\ncommunication networks will enable global coverage in space, air, ground, and\nsea. In this case, both base stations and users can be mobile and will tend to\nmove continuously in three-dimensional (3D) space. Therefore, obtaining channel\nstate information (CSI) in 3D continuous-space is crucial for the design and\nperformance evaluation of future 6G and B6G wireless systems. On the other\nhand, new 6G technologies such as integrated sensing and communications (ISAC)\nwill also require prior knowledge of CSI in 3D continuous-space. In this paper,\na 3D continuous-space electromagnetic channel model is proposed for\ntri-polarized multi-user communications, taking into account scatterers and\nspherical wavefronts. Scattered fields are calculated using the method of\nmoments (MoM) with high accuracy. Spherical wave functions are utilized to\ndecompose the dyadic Green's functions that connect the transmitted source\ncurrents and the received electric fields. Simulation results demonstrate that\ntransmit power, apertures, scatterers, and sample intervals have significant\nimpacts on statistical properties and channel capacities, providing insights\ninto the performance of continuous-space electromagnetic channel models and the\ndesign of future wireless systems.\n","authors":["Yue Yang","Cheng-Xiang Wang","Jie Huang","John Thompson","H. Vincent Poor"],"pdf_url":"https://arxiv.org/pdf/2501.03608v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03577v1","updated":"2025-01-07T07:06:46Z","published":"2025-01-07T07:06:46Z","title":"Wireless Channel Measurements and Characterization in Industrial IoT\n Scenarios","summary":" Wireless Fidelity (Wi-Fi) communication technologies hold significant\npotential for realizing the Industrial Internet of Things (IIoT). In this\npaper, both Single-Input Single-Output (SISO) and polarized Multiple-Input\nMultiple-Output (MIMO) channel measurements are conducted in an IIoT scenario\nat the less congested Wi-Fi band, i.e., 5.5~GHz. The purpose is to investigate\nwireless characteristics of communications between access points and terminals\nmounted on automated guided vehicles as well as those surrounding manufacturing\nareas. For SISO channel measurements, statistical properties including the\ndelay Power Spectral Density (PSD), path loss, shadowing fading, delay spread,\nexcess delay, K-factor, and amplitude distribution of small-scale fading are\nanalyzed and compared with those observed in an office scenario. For MIMO\nchannel measurements, results show that there are multiple Dense Multipath\nComponent (DMC) processes in the delay PSD. An estimation algorithm based on\nthe algorithm for a single DMC process is proposed to effectively process the\nmulti-processes data. Moreover, delay, angular, power, and polarization\nproperties of DMCs are investigated and compared with those of specular\nmultipath components. Furthermore, effects of DMCs on Singular Values (SVs) and\nchannel capacities are explored. Ignoring DMCs can overestimate SVs and\nunderestimate channel capacities.\n","authors":["Li Zhang","Cheng-Xiang Wang","Zihao Zhou","Yuxiao Li","Jie Huang","Lijian Xin","Chun Pan","Dabo Zheng","Xiping Wu"],"pdf_url":"https://arxiv.org/pdf/2501.03577v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03552v1","updated":"2025-01-07T05:58:07Z","published":"2025-01-07T05:58:07Z","title":"Proxy Control Barrier Functions: Integrating Barrier-Based and\n Lyapunov-Based Safety-Critical Control Design","summary":" This work introduces a novel Proxy Control Barrier Function (PCBF) scheme\nthat integrates barrier-based and Lyapunov-based safety-critical control\nstrategies for strict-feedback systems with potentially unknown dynamics. The\nproposed method employs a modular design procedure, decomposing the original\nsystem into a proxy subsystem and a virtual tracking subsystem that are\ncontrolled by the control barrier function (CBF)-based and Lyapunov-based\ncontrollers, respectively. By integrating these separately designed\ncontrollers, the overall system's safety is ensured. Moreover, a new\nfilter-based disturbance observer is utilized to design a PCBF-based safe\ncontroller for strict-feedback systems subject to mismatched disturbances. This\napproach broadens the class of systems to which CBF-based methods can be\napplied and significantly simplifies CBF construction by requiring only the\nmodel of the proxy subsystem. The effectiveness of the proposed method is\ndemonstrated through numerical simulations.\n","authors":["Yujie Wang","Xiangru Xu"],"pdf_url":"https://arxiv.org/pdf/2501.03552v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03543v1","updated":"2025-01-07T05:37:59Z","published":"2025-01-07T05:37:59Z","title":"Distributionally Robust Joint Chance-Constrained Optimal Power Flow\n using Relative Entropy","summary":" Designing robust algorithms for the optimal power flow (OPF) problem is\ncritical for the control of large-scale power systems under uncertainty. The\nchance-constrained OPF (CCOPF) problem provides a natural formulation of the\ntrade-off between the operating cost and the constraint satisfaction rate. In\nthis work, we propose a new data-driven algorithm for the CCOPF problem, based\non distributionally robust optimization (DRO). \\revise{We show that the\nproposed reformulation of the distributionally robust chance constraints is\nexact, whereas other approaches in the CCOPF literature rely on conservative\napproximations. We establish out-of-sample robustness guarantees for the\ndistributionally robust solution and prove that the solution is the most\nefficient among all approaches enjoying the same guarantees.} We apply the\nproposed algorithm to the the CCOPF problem and compare the performance of our\napproach with existing methods using simulations on IEEE benchmark power\nsystems.\n","authors":["Eli Brock","Haixiang Zhang","Javad Lavaei","Somayeh Sojoudi"],"pdf_url":"https://arxiv.org/pdf/2501.03543v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03503v1","updated":"2025-01-07T03:46:38Z","published":"2025-01-07T03:46:38Z","title":"Resilient Distributed Control for Uncertain Nonlinear Interconnected\n Systems under Network Anomaly","summary":" We address a distributed adaptive control methodology for nonlinear\ninterconnected systems possibly affected by network anomalies. In the framework\nof adaptive approximation, the distributed controller and parameter estimator\nare designed by exploiting a backstepping approach. The stability of the\ndistributed control system under anomalies is analyzed, where both local and\nneighboring anomaly effects are considered. To quantify the resilience of the\ninterconnected system under the action of network anomalies, we derive bounds\non the duration of each anomaly and the resting time between two consecutive\nanomalies. Specifically, when each anomaly duration is smaller than our\ndesigned upper bound, the interconnected system controlled by the distributed\napproximation-based controller remains asymptotically stable. Moreover, if the\nresting time between two consecutive anomalies is larger than the proposed\nbound, then all signals of the control system are guaranteed to be bounded. In\nthe paper, we show that under the action of the proposed distributed adaptive\ncontroller, the interconnected system remains stable in the presence of network\nanomalies, with both the qualitative and quantitative resilient conditions.\nExtensive simulation results show the effectiveness of our theoretical results.\n","authors":["Youqing Wang","Ying Li","Thomas Parisini","Dong Zhao"],"pdf_url":"https://arxiv.org/pdf/2501.03503v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03496v1","updated":"2025-01-07T03:35:14Z","published":"2025-01-07T03:35:14Z","title":"A Unified Attack Detection Strategy for Multi-Agent Systems over\n Transient and Steady Stages","summary":" This paper proposes a unified detection strategy against three kinds of\nattacks for multi-agent systems (MASs) which is applicable to both transient\nand steady stages. For attacks on the communication layer, a watermarking-based\ndetection scheme with KullbackLeibler (KL) divergence is designed. Different\nfrom traditional communication schemes, each agent transmits a message set\ncontaining two state values with different types of watermarking. It is found\nthat the detection performance is determined by the relevant parameters of the\nwatermarking signal. Unlike the existing detection manoeuvres, such a scheme is\ncapable of transient and steady stages. For attacks on the agent layer, a\nconvergence rate related detection approach is put forward. It is shown that\nthe resilience of the considered system is characterized by the coefficient and\noffset of the envelope. For hybrid attacks, based on the above detection\nmechanisms, a general framework resorting to trusted agents is presented, which\nrequires weaker graph conditions and less information transmission. Finally, an\nexample associated with the platooning of connected vehicles is given to\nsupport the theoretical results.\n","authors":["Jinming Gao","Yijing Wang","Wentao Zhang","Rui Zhao","Yang Shi","Zhiqiang Zuo"],"pdf_url":"https://arxiv.org/pdf/2501.03496v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.09061v2","updated":"2025-01-07T01:53:54Z","published":"2024-06-13T12:50:57Z","title":"Joint Observer Gain and Input Design for Asymptotic Active Fault\n Diagnosis","summary":" This paper proposes a joint gain and input design method for observer-based\nasymptotic active fault diagnosis, which is based on a newly-defined notion\nnamed the excluding degree of the origin from a zonotope. Using the excluding\ndegree, a quantitative specification is obtained to characterize the\nperformance of set-based robust fault diagnosis. Furthermore, a single gain\ndesign method and a joint gain and input design method are proposed,\nrespectively. This is the first work to achieve a joint observer gain and input\ndesign for set-based active fault diagnosis. Compared with the existing methods\nthat design gains and input separately, the proposed joint gain and input\ndesign method has advantages to exploit the fault diagnosis potential of\nobserver-based schemes. Finally, several examples are used to illustrate the\neffectiveness of the proposed methods.\n","authors":["Feng Xu","Yiming Wan","Ye Wang","Vicenc Puig"],"pdf_url":"https://arxiv.org/pdf/2406.09061v2.pdf","comment":"Provisionally accepted by Automatica as Regular Paper"},{"id":"http://arxiv.org/abs/2501.03465v1","updated":"2025-01-07T01:47:49Z","published":"2025-01-07T01:47:49Z","title":"Extending Internet Access Over LoRa for Internet of Things and Critical\n Applications","summary":" LoRa bridges the gap between remote locations and mainstream networks,\nenabling large-scale Internet of Things (IoT) deployments. Despite the recent\nadvancements around LoRa, Internet access over this technology is still largely\nunexplored. Most existing solutions only handle packets within the local LoRa\nnetwork and do not interact with web applications. This limits the scalability\nand the ability to deliver essential web services in disconnected regions. This\nwork proposes and implements ILoRa to extend the public Internet to\ndisconnected areas for essential service delivery. ILoRa enables accessing\nApplication Programming Interfaces (APIs) and web pages on the Internet over a\nLoRa backbone network. It comprises a ILoRa coordinator code (ICN) and access\npoint nodes (APNs). The ICN interfaces the LoRa network with the public\nInternet and interprets content. The APN tethers a WiFi hotspot to which\ndevices connect and access the web content. This work further proposes data\nhandling methods for ICNs and APNs. An actual hardware-based implementation\nvalidates the proposed system. The implementation achieves a throughput of 1.06\nkbps tested for an Internet-based API returning JSON data of 930 B.\nFurthermore, the APN consumed approximately $0.162$A current, and the resource\nutilization on the ICN was minimal.\n","authors":["Atonu Ghosh","Devadeep Misra","Hirdesh Mewada"],"pdf_url":"https://arxiv.org/pdf/2501.03465v1.pdf","comment":"8 pages, 8 figures"},{"id":"http://arxiv.org/abs/2412.00252v2","updated":"2025-01-07T23:33:22Z","published":"2024-11-29T20:48:50Z","title":"Localization Phenomena in Large-Scale Networked Systems: Robustness and\n Fragility of Dynamics","summary":" We study phenomena where some eigenvectors of a graph Laplacian are largely\nconfined in small subsets of the graph. These localization phenomena are\nsimilar to those generally termed Anderson Localization in the Physics\nliterature, and are related to the complexity of the structure of large graphs\nin still unexplored ways. Using spectral perturbation theory and\npseudo-spectrum analysis, we explain how the presence of localized eigenvectors\ngives rise to fragilities (low robustness margins) to unmodeled node or link\ndynamics. Our analysis is demonstrated by examples of networks with relatively\nlow complexity, but with features that appear to induce eigenvector\nlocalization. The implications of this newly-discovered fragility phenomenon\nare briefly discussed.\n","authors":["Poorva Shukla","Bassam Bamieh"],"pdf_url":"https://arxiv.org/pdf/2412.00252v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04160v1","updated":"2025-01-07T22:19:06Z","published":"2025-01-07T22:19:06Z","title":"Collaborative Spacecraft Servicing under Partial Feedback using\n Lyapunov-based Deep Neural Networks","summary":" Multi-agent systems are increasingly applied in space missions, including\ndistributed space systems, resilient constellations, and autonomous rendezvous\nand docking operations. A critical emerging application is collaborative\nspacecraft servicing, which encompasses on-orbit maintenance, space debris\nremoval, and swarm-based satellite repositioning. These missions involve\nservicing spacecraft interacting with malfunctioning or defunct spacecraft\nunder challenging conditions, such as limited state information, measurement\ninaccuracies, and erratic target behaviors. Existing approaches often rely on\nassumptions of full state knowledge or single-integrator dynamics, which are\nimpractical for real-world applications involving second-order spacecraft\ndynamics. This work addresses these challenges by developing a distributed\nstate estimation and tracking framework that requires only relative position\nmeasurements and operates under partial state information. A novel\n$\\rho$-filter is introduced to reconstruct unknown states using locally\navailable information, and a Lyapunov-based deep neural network adaptive\ncontroller is developed that adaptively compensates for uncertainties stemming\nfrom unknown spacecraft dynamics. To ensure the collaborative spacecraft\nregulation problem is well-posed, a trackability condition is defined. A\nLyapunov-based stability analysis is provided to ensure exponential convergence\nof errors in state estimation and spacecraft regulation to a neighborhood of\nthe origin under the trackability condition. The developed method eliminates\nthe need for expensive velocity sensors or extensive pre-training, offering a\npractical and robust solution for spacecraft servicing in complex, dynamic\nenvironments.\n","authors":["Cristian F. Nino","Omkar Sudhir Patil","Christopher D. Petersen","Sean Phillips","Warren E. Dixon"],"pdf_url":"https://arxiv.org/pdf/2501.04160v1.pdf","comment":"24 pages, 4 Figures, Journal"},{"id":"http://arxiv.org/abs/2501.04120v1","updated":"2025-01-07T20:02:11Z","published":"2025-01-07T20:02:11Z","title":"Bridging Impulse Control of Piecewise Deterministic Markov Processes and\n Markov Decision Processes: Frameworks, Extensions, and Open Challenges","summary":" Control theory plays a pivotal role in understanding and optimizing the\nbehavior of complex dynamical systems across various scientific and engineering\ndisciplines. Two key frameworks that have emerged for modeling and solving\ncontrol problems in stochastic systems are piecewise deterministic Markov\nprocesses (PDMPs) and Markov decision processes (MDPs). Each framework has its\nunique strengths, and their intersection offers promising opportunities for\ntackling a broad class of problems, particularly in the context of impulse\ncontrols and decision-making in complex systems.\n The relationship between PDMPs and MDPs is a natural subject of exploration,\nas embedding impulse control problems for PDMPs into the MDP framework could\nopen new avenues for their analysis and resolution. Specifically, this\nintegration would allow leveraging the computational and theoretical tools\ndeveloped for MDPs to address the challenges inherent in PDMPs. On the other\nhand, PDMPs can offer a versatile and simple paradigm to model continuous time\nproblems that are often described as discrete-time MDPs parametrized by complex\ntransition kernels. This transformation has the potential to bridge the gap\nbetween the two frameworks, enabling solutions to previously intractable\nproblems and expanding the scope of both fields. This paper presents a\ncomprehensive review of two research domains, illustrated through a recurring\nmedical example. The example is revisited and progressively formalized within\nthe framework of thevarious concepts and objects introduced\n","authors":["Alice Cleynen","Benoîte de Saporta","Orlane Rossini","Régis Sabbadin","Amélie Vernay"],"pdf_url":"https://arxiv.org/pdf/2501.04120v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04104v1","updated":"2025-01-07T19:24:11Z","published":"2025-01-07T19:24:11Z","title":"Security by Design Issues in Autonomous Vehicles","summary":" As autonomous vehicle (AV) technology advances towards maturity, it becomes\nimperative to examine the security vulnerabilities within these cyber-physical\nsystems. While conventional cyber-security concerns are often at the forefront\nof discussions, it is essential to get deeper into the various layers of\nvulnerability that are often overlooked within mainstream frameworks. Our goal\nis to spotlight imminent challenges faced by AV operators and explore emerging\ntechnologies for comprehensive solutions. This research outlines the diverse\nsecurity layers, spanning physical, cyber, coding, and communication aspects,\nin the context of AVs. Furthermore, we provide insights into potential\nsolutions for each potential attack vector, ensuring that autonomous vehicles\nremain secure and resilient in an evolving threat landscape.\n","authors":["Martin Higgins","Devki Jha","David Blundell","David Wallom"],"pdf_url":"https://arxiv.org/pdf/2501.04104v1.pdf","comment":null}],"Optimization and Control":[{"id":"http://arxiv.org/abs/2409.08861v5","updated":"2025-01-07T18:12:27Z","published":"2024-09-13T14:22:14Z","title":"Adjoint Matching: Fine-tuning Flow and Diffusion Generative Models with\n Memoryless Stochastic Optimal Control","summary":" Dynamical generative models that produce samples through an iterative\nprocess, such as Flow Matching and denoising diffusion models, have seen\nwidespread use, but there have not been many theoretically-sound methods for\nimproving these models with reward fine-tuning. In this work, we cast reward\nfine-tuning as stochastic optimal control (SOC). Critically, we prove that a\nvery specific memoryless noise schedule must be enforced during fine-tuning, in\norder to account for the dependency between the noise variable and the\ngenerated samples. We also propose a new algorithm named Adjoint Matching which\noutperforms existing SOC algorithms, by casting SOC problems as a regression\nproblem. We find that our approach significantly improves over existing methods\nfor reward fine-tuning, achieving better consistency, realism, and\ngeneralization to unseen human preference reward models, while retaining sample\ndiversity.\n","authors":["Carles Domingo-Enrich","Michal Drozdzal","Brian Karrer","Ricky T. Q. Chen"],"pdf_url":"https://arxiv.org/pdf/2409.08861v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.08232v2","updated":"2025-01-07T17:49:59Z","published":"2024-08-15T15:57:59Z","title":"Characterizations of the Aubin Property of the Solution Mapping for\n Nonlinear Semidefinite Programming","summary":" In this paper, we study the Aubin property of the Karush-Kuhn-Tucker solution\nmapping for the nonlinear semidefinite programming (NLSDP) problem at a locally\noptimal solution. In the literature, it is known that the Aubin property\nimplies the constraint nondegeneracy by Fusek [SIAM J. Optim. 23 (2013), pp.\n1041-1061] and the second-order sufficient condition by Ding et al. [SIAM J.\nOptim. 27 (2017), pp. 67-90]. Based on the Mordukhovich criterion, here we\nfurther prove that the strong second-order sufficient condition is also\nnecessary for the Aubin property to hold. Consequently, several equivalent\nconditions including the strong regularity are established for NLSDP's Aubin\nproperty. Together with the recent progress made by Chen et al. on the\nequivalence between the Aubin property and the strong regularity for nonlinear\nsecond-order cone programming [SIAM J. Optim., in press; arXiv:2406.13798v3\n(2024)], this paper constitutes a significant step forward in characterizing\nthe Aubin property for general non-polyhedral $C^2$-cone reducible constrained\noptimization problems.\n","authors":["Liang Chen","Ruoning Chen","Defeng Sun","Liping Zhang"],"pdf_url":"https://arxiv.org/pdf/2408.08232v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.00568v2","updated":"2025-01-07T17:36:14Z","published":"2024-11-01T13:26:13Z","title":"Constrained Sampling with Primal-Dual Langevin Monte Carlo","summary":" This work considers the problem of sampling from a probability distribution\nknown up to a normalization constant while satisfying a set of statistical\nconstraints specified by the expected values of general nonlinear functions.\nThis problem finds applications in, e.g., Bayesian inference, where it can\nconstrain moments to evaluate counterfactual scenarios or enforce desiderata\nsuch as prediction fairness. Methods developed to handle support constraints,\nsuch as those based on mirror maps, barriers, and penalties, are not suited for\nthis task. This work therefore relies on gradient descent-ascent dynamics in\nWasserstein space to put forward a discrete-time primal-dual Langevin Monte\nCarlo algorithm (PD-LMC) that simultaneously constrains the target distribution\nand samples from it. We analyze the convergence of PD-LMC under standard\nassumptions on the target distribution and constraints, namely (strong)\nconvexity and log-Sobolev inequalities. To do so, we bring classical\noptimization arguments for saddle-point algorithms to the geometry of\nWasserstein space. We illustrate the relevance and effectiveness of PD-LMC in\nseveral applications.\n","authors":["Luiz F. O. Chamon","Mohammad Reza Karimi","Anna Korba"],"pdf_url":"https://arxiv.org/pdf/2411.00568v2.pdf","comment":"39 pages, 14 figures. Published at NeurIPS 2024"},{"id":"http://arxiv.org/abs/2501.00799v2","updated":"2025-01-07T17:32:19Z","published":"2025-01-01T10:50:35Z","title":"Follow The Approximate Sparse Leader for No-Regret Online Sparse Linear\n Approximation","summary":" We consider the problem of \\textit{online sparse linear approximation}, where\none predicts the best sparse approximation of a sequence of measurements in\nterms of linear combination of columns of a given measurement matrix. Such\nonline prediction problems are ubiquitous, ranging from medical trials to web\ncaching to resource allocation. The inherent difficulty of offline recovery\nalso makes the online problem challenging. In this letter, we propose\nFollow-The-Approximate-Sparse-Leader, an efficient online meta-policy to\naddress this online problem. Through a detailed theoretical analysis, we prove\nthat under certain assumptions on the measurement sequence, the proposed policy\nenjoys a data-dependent sublinear upper bound on the static regret, which can\nrange from logarithmic to square-root. Numerical simulations are performed to\ncorroborate the theoretical findings and demonstrate the efficacy of the\nproposed online policy.\n","authors":["Samrat Mukhopadhyay","Debasmita Mukherjee"],"pdf_url":"https://arxiv.org/pdf/2501.00799v2.pdf","comment":"12 pages, 5 figures, corrected title, added proof of a lemma in\n appendix"},{"id":"http://arxiv.org/abs/2501.03954v1","updated":"2025-01-07T17:26:35Z","published":"2025-01-07T17:26:35Z","title":"Learning to Relax Nonconvex Quadratically Constrained Quadratic Programs","summary":" Quadratically constrained quadratic programs (QCQPs) are ubiquitous in\noptimization: Such problems arise in applications from operations research,\npower systems, signal processing, chemical engineering, portfolio theory, among\nothers. Despite their flexibility in modeling real-life situations and the\nrecent effort to understand their properties, nonconvex QCQPs are hard to solve\nin practice. Most of the approaches in the literature are based on either\nLinear Programming (LP) or Semidefinite Programming (SDP) relaxations, each of\nwhich works very well for some problem subclasses but perform poorly on others.\nIn this paper, we develop a relaxation selection procedure for nonconvex QCQPs\nthat can adaptively decide whether an LP- or SDP-based approach is expected to\nbe more beneficial by considering the instance structure. The proposed\nmethodology relies on utilizing machine learning methods that involve features\nderived from spectral properties and sparsity patterns of data matrices, and\nonce trained appropriately, the prediction model is applicable to any instance\nwith an arbitrary number of variables and constraints. We train and test\nclassification and regression models over synthetically generated instances,\nand empirically show the efficacy of our approach.\n","authors":["Buket Ozen","Burak Kocuk"],"pdf_url":"https://arxiv.org/pdf/2501.03954v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.13798v3","updated":"2025-01-07T17:19:28Z","published":"2024-06-19T19:58:01Z","title":"Aubin Property and Strong Regularity Are Equivalent for Nonlinear\n Second-Order Cone Programming","summary":" This paper solves a fundamental open problem in variational analysis on the\nequivalence between the Aubin property and the strong regularity for nonlinear\nsecond-order cone programming (SOCP) at a locally optimal solution. We achieve\nthis by introducing a reduction approach to the Aubin property characterized by\nthe Mordukhovich criterion and a lemma of alternative choices on cones to\nreplace the S-lemma used in Outrata and Ram\\'irez [SIAM J. Optim. 21 (2011)\n789-823] and Opazo, Outrata, and Ram\\'irez [SIAM J. Optim. 27 (2017)\n2141-2151], where the same SOCP was considered under the strict complementarity\ncondition except for possibly only one block of constraints. As a byproduct, we\nalso offer a new approach to the well-known result of Dontchev and Rockafellar\n[SIAM J. Optim. 6 (1996) 1087-1105] on the equivalence of the two concepts in\nconventional nonlinear programming.\n","authors":["Liang Chen","Ruoning Chen","Defeng Sun","Junyuan Zhu"],"pdf_url":"https://arxiv.org/pdf/2406.13798v3.pdf","comment":"To appear in SIAM Journal on Optimization"},{"id":"http://arxiv.org/abs/2501.03933v1","updated":"2025-01-07T16:49:01Z","published":"2025-01-07T16:49:01Z","title":"Data-driven Optimization for the Evolve-Filter-Relax regularization of\n convection-dominated flows","summary":" Numerical stabilization techniques are often employed in under-resolved\nsimulations of convection-dominated flows to improve accuracy and mitigate\nspurious oscillations. Specifically, the Evolve-Filter-Relax (EFR) algorithm is\na framework which consists in evolving the solution, applying a filtering step\nto remove high-frequency noise, and relaxing through a convex combination of\nfiltered and original solutions. The stability and accuracy of the EFR solution\nstrongly depend on two parameters, the filter radius $\\delta$ and the\nrelaxation parameter $\\chi$. Standard choices for these parameters are usually\nfixed in time, and related to the full order model setting, i.e., the grid size\nfor $\\delta$ and the time step for $\\chi$. This paper makes two significant\nimprovements to the standard EFR framework by proposing: (i) time-dependent\nparameters, (ii) data-driven adaptive optimization of the parameters in time,\nconsidering a fully-resolved simulation as a reference. In particular, we\npropose three different classes of Optimized-EFR strategies, aiming to optimize\none or both parameters. Moreover, we investigate the accuracy and efficiency of\nthe proposed optimization algorithms considering different objective functions,\nboth local (point-valued) and global (such as the kinetic energy). The new\nOptimized-EFR strategies are tested in the under-resolved simulation of a\nturbulent flow past a cylinder at $Re=1000$. The new Optimized-EFR results are\nmore accurate than the standard EFR solution while maintaining a similar\ncomputational time. In particular, we show that using a global objective\nfunction and including the $H^1$ velocity seminorm is crucial to accurately\nmatch the reference flow dynamics.\n","authors":["Anna Ivagnes","Maria Strazzullo","Michele Girfoglio","Traian Iliescu","Gianluigi Rozza"],"pdf_url":"https://arxiv.org/pdf/2501.03933v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.05017v3","updated":"2025-01-07T16:21:59Z","published":"2024-12-06T13:09:34Z","title":"Reduction from the partition problem: Dynamic lot sizing problem with\n polynomial complexity","summary":" In this note, we polynomially reduce an instance of the partition problem to\na dynamic lot sizing problem, and show that solving the latter problem solves\nthe former problem. By solving the dynamic programming formulation of the\ndynamic lot sizing problem, we show that the instance of the partition problem\ncan be solved with pseudo-polynomial time complexity. Numerical results on\nsolving instances of the partition problem are also provided using an\nimplementation of the algorithm that solves the dynamic program.\n","authors":["Chee-Khian Sim"],"pdf_url":"https://arxiv.org/pdf/2412.05017v3.pdf","comment":"11 pages. Latest version contains improved arguments and results"},{"id":"http://arxiv.org/abs/2501.03906v1","updated":"2025-01-07T16:21:40Z","published":"2025-01-07T16:21:40Z","title":"A regularized transportation cost stemming from entropic approximation","summary":" We study the entropic regularizations of optimal transport problems under\nsuitable summability assumptions on the point-wise transport cost. These\nsummability assumptions already appear in the literature. However, we show that\nthe weakest compactness conditions that can be derived are already enough to\nobtain the convergence of the regularized functionals. This approach allows us\nto characterize the variational limit of the regularization even when it does\nnot converge to the original problem. The results apply also to problems with\nmore than two marginals.\n","authors":["Camilla Brizzi","Luigi De Pascale","Anna Kausamo"],"pdf_url":"https://arxiv.org/pdf/2501.03906v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03882v1","updated":"2025-01-07T15:44:16Z","published":"2025-01-07T15:44:16Z","title":"An obstruction to small-time local controllability for a bilinear\n Schrödinger equation","summary":" We consider the small-time local controllability in the vicinity of the\nground state of a bilinear Schr\\\"odinger equation with Neumann boundary\nconditions. We prove that, when the linearized system is not controllable, the\nnonlinear system is not controllable, due to a quadratic obstruction involving\nthe squared norm of the control's primitive. This obstruction has been known\nsince 1983 for ODEs and observed for some PDEs since 2006. However, our\nsituation is more intricate since the kernel describing the quadratic expansion\nof the solution is not twice differentiable. We thus follow a Fourier-based\napproach, closer to the one used for quadratic obstructions of fractional\nSobolev regularity.\n In this Fourier-based approach, a challenge is to formulate a necessary and\nsufficient condition on the convolution kernel, for the quadratic form to be\ncoercive. In previous studies, the coercivity was ensured by a signed\nasymptotic equivalent for the Fourier transform of the convolution kernel of\nthe form $\\widehat{K}(\\omega) \\sim \\omega^{-2}$ as $|\\omega| \\to \\infty$. In\nour case, $\\widehat{K}$ is a distribution which has singularities and changes\nsign up to infinity. We still prove coercivity because one of the signs appears\ntoo infrequently.\n","authors":["Karine Beauchard","Frédéric Marbach","Thomas Perrin"],"pdf_url":"https://arxiv.org/pdf/2501.03882v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.03560v2","updated":"2025-01-07T14:27:58Z","published":"2024-05-06T15:29:55Z","title":"Converse Lyapunov Results for Stability of Switched Systems with Average\n Dwell-Time","summary":" This article provides a characterization of stability for switched nonlinear\nsystems under average dwell-time constraints, in terms of necessary and\nsufficient conditions involving multiple Lyapunov functions. Earlier converse\nresults focus on switched systems with dwell-time constraints only, and the\nresulting inequalities depend on the flow of individual subsystems. With the\nhelp of a counterexample, we show that a lower bound that guarantees stability\nfor dwell-time switching signals may not necessarily imply stability for\nswitching signals with same lower bound on the average dwell-time. Based on\nthese two observations, we provide a converse result for the average dwell-time\nconstrained systems in terms of inequalities which do not depend on the flow of\nindividual subsystems and are easier to check. The particular case of linear\nswitched systems is studied as a corollary to our main result.\n","authors":["Matteo Della Rossa","Aneel Tanwani"],"pdf_url":"https://arxiv.org/pdf/2405.03560v2.pdf","comment":"To appear in ESAIM: Control, Optimisation and Calculus of Variations\n (ESAIM: COCV)"},{"id":"http://arxiv.org/abs/2501.02098v2","updated":"2025-01-07T14:20:44Z","published":"2025-01-03T20:51:07Z","title":"Graph-Based Modeling and Decomposition of Hierarchical Optimization\n Problems","summary":" We present a graph-theoretic modeling approach for hierarchical optimization\nthat leverages the OptiGraph abstraction implemented in the Julia package\nPlasmo.jl. We show that the abstraction is flexible and can effectively capture\ncomplex hierarchical connectivity that arises from decision-making over\nmultiple spatial and temporal scales (e.g., integration of planning,\nscheduling, and operations in manufacturing and infrastructures). We also show\nthat the graph abstraction facilitates the conceptualization and implementation\nof decomposition and approximation schemes. Specifically, we propose a\ngraph-based Benders decomposition (gBD) framework that enables the exploitation\nof hierarchical (nested) structures and that uses graph\naggregation/partitioning procedures to discover such structures. In addition,\nwe provide a Julia implementation of gBD, which we call PlasmoBenders.jl. We\nillustrate the capabilities using examples arising in the context of energy and\npower systems.\n","authors":["David L. Cole","Filippo Pecci","Omar J. Guerra","Harsha Gangammanavar","Jesse D. Jenkins","Victor M. Zavala"],"pdf_url":"https://arxiv.org/pdf/2501.02098v2.pdf","comment":"66 pages, 3 tables, 28 figures, updated abstract"},{"id":"http://arxiv.org/abs/2501.03784v1","updated":"2025-01-07T13:45:29Z","published":"2025-01-07T13:45:29Z","title":"Optimal control of a nonlinear kinetic Fokker-Planck equation","summary":" A tracking type optimal control problem for a nonlinear and nonlocal kinetic\nFokker-Planck equation which arises as the mean field limit of an interacting\nparticle systems that is subject to distance dependent random fluctuations is\nstudied. As the equation of interest is only hypocoercive and the control\noperator is unbounded with respect to the canonical state space, classical\nvariational solution techniques cannot be utilized directly. Instead, the\nconcept of admissible control operators is employed. For the underlying\nnonlinearities, local Lipschitz estimates are derived and subsequently used\nwithin a fixed point argument to obtain local existence of solutions. Again,\ndue to hypocoercivity, existence of optimal controls requires non standard\ntechniques as (compensated) compactness arguments are not readily available.\n","authors":["Tobias Breiten","Karl Kunisch"],"pdf_url":"https://arxiv.org/pdf/2501.03784v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2205.01369v2","updated":"2025-01-07T13:41:00Z","published":"2022-05-03T08:36:05Z","title":"Improving the Convergence Rates for the Kinetic Fokker-Planck Equation\n by Optimal Control","summary":" The long time behavior and detailed convergence analysis of Langevin\nequations has received increased attention over the last years. Difficulties\narise from a lack of coercivity, usually termed hypocoercivity, of the\nunderlying kinetic Fokker-Planck operator which is a consequence of the\npartially deterministic nature of a second order stochastic differential\nequation. In this manuscript, the effect of controlling the confinement\npotential without altering the original invariant measure is investigated. This\nleads to an abstract bilinear control system with an unbounded but\ninfinite-time admissible control operator which, by means of an artificial\ndiffusion approach, is shown to possess a unique solution. The compactness of\nthe underlying semigroup is further used to define an infinite-horizon optimal\ncontrol problem on an appropriately reduced state space. Under smallness\nassumptions on the initial data, feasibility of and existence of a solution to\nthe optimal control problem are discussed. Numerical results based on a local\napproximation based on a shifted Riccati equation illustrate the theoretical\nfindings.\n","authors":["Tobias Breiten","Karl Kunisch"],"pdf_url":"https://arxiv.org/pdf/2205.01369v2.pdf","comment":"32 pages, 4 figures"},{"id":"http://arxiv.org/abs/2501.03773v1","updated":"2025-01-07T13:29:37Z","published":"2025-01-07T13:29:37Z","title":"The maximal angle between $3 \\times 3$ copositive matrices","summary":" In 2010, Hiriart-Urruty and Seeger posed the problem of finding the maximal\npossible angle $\\theta_n$ between two copositive matrices of order $n$. They\nproved that $\\theta_2=\\frac{3}{4}\\pi$. In this paper, we study the maximal\nangle between two copositive matrices of order 3. We show that\n$\\theta_3=\\frac{3}{4}\\pi$ and give all possible pairs of matrices achieving\nthis maximal angle. The proof is based on case analysis and uses optimization\nand basic linear algebra techniques.\n","authors":["Daniel Gourion"],"pdf_url":"https://arxiv.org/pdf/2501.03773v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.11639v3","updated":"2025-01-07T13:21:43Z","published":"2024-11-18T15:19:04Z","title":"Trade-off Invariance Principle for minimizers of regularized functionals","summary":" In this paper, we consider functionals of the form $H_\\alpha(u)=F(u)+\\alpha\nG(u)$ with $\\alpha\\in[0,+\\infty)$, where $u$ varies in a set $U\\neq\\emptyset$\n(without further structure). We first show that, excluding at most countably\nmany values of $\\alpha$, we have that $\\inf_{H_\\alpha^\\star}G=\n\\sup_{H_\\alpha^\\star}G$, where $H_\\alpha^\\star := \\arg \\min_U H_\\alpha$, which\nis assumed to be non-empty. We further prove a stronger result that concerns\nthe invariance of the limiting value of the functional $G$ along minimizing\nsequences for $H_\\alpha$. Moreover, we show to what extent these findings\ngeneralize to multi-regularized functionals and -- in the presence of an\nunderlying differentiable structure -- to critical points. Finally, the main\nresult implies an unexpected consequence for functionals regularized with\nuniformly convex norms: excluding again at most countably many values of\n$\\alpha$, it turns out that for a minimizing sequence, convergence to a\nminimizer in the weak or strong sense is equivalent.\n","authors":["Massimo Fornasier","Jona Klemenc","Alessandro Scagliotti"],"pdf_url":"https://arxiv.org/pdf/2411.11639v3.pdf","comment":"16 pages, extension to multi-regularization and to critical points"},{"id":"http://arxiv.org/abs/2410.24222v2","updated":"2025-01-07T13:04:51Z","published":"2024-10-31T17:59:56Z","title":"Robust Gaussian Processes via Relevance Pursuit","summary":" Gaussian processes (GPs) are non-parametric probabilistic regression models\nthat are popular due to their flexibility, data efficiency, and well-calibrated\nuncertainty estimates. However, standard GP models assume homoskedastic\nGaussian noise, while many real-world applications are subject to non-Gaussian\ncorruptions. Variants of GPs that are more robust to alternative noise models\nhave been proposed, and entail significant trade-offs between accuracy and\nrobustness, and between computational requirements and theoretical guarantees.\nIn this work, we propose and study a GP model that achieves robustness against\nsparse outliers by inferring data-point-specific noise levels with a sequential\nselection procedure maximizing the log marginal likelihood that we refer to as\nrelevance pursuit. We show, surprisingly, that the model can be parameterized\nsuch that the associated log marginal likelihood is strongly concave in the\ndata-point-specific noise variances, a property rarely found in either robust\nregression objectives or GP marginal likelihoods. This in turn implies the weak\nsubmodularity of the corresponding subset selection problem, and thereby proves\napproximation guarantees for the proposed algorithm. We compare the model's\nperformance relative to other approaches on diverse regression and Bayesian\noptimization tasks, including the challenging but common setting of sparse\ncorruptions of the labels within or close to the function range.\n","authors":["Sebastian Ament","Elizabeth Santorella","David Eriksson","Ben Letham","Maximilian Balandat","Eytan Bakshy"],"pdf_url":"https://arxiv.org/pdf/2410.24222v2.pdf","comment":"NeurIPS 2024 Article (https://openreview.net/forum?id=5FATPIlWUJ)"},{"id":"http://arxiv.org/abs/2501.03744v1","updated":"2025-01-07T12:38:21Z","published":"2025-01-07T12:38:21Z","title":"Hydrogen Network Expansion Planning considering the Chicken-and-egg\n Dilemma and Market Uncertainty","summary":" Comparable performance to fully flexible settings through optimized revision\ntimes.Green hydrogen is thought to be a game changer for reaching\nsustainability targets. However, the transition to a green hydrogen economy\nfaces a critical challenge known as the `chicken-and-egg dilemma', wherein\nestablishing a hydrogen supply network relies on demand, while demand only\ngrows with reliable supply. In addition, as the hydrogen market is in the early\nstage, predicting demand distributions is challenging due to lack of data\navailability. This paper addresses these complex issues through a risk-averse\nframework with the introduction of a distributionally robust hydrogen network\nexpansion planning problem under decision-dependent demand ambiguity. The\nproblem optimizes location and production capacity decisions of the suppliers\nconsidering the moments of the stochastic hydrogen demand as a function of\nthese investment decisions. To obtain tractable representations of this\nproblem, we derive two different reformulations that consider continuous and\ndiscrete hydrogen demand support sets under different forms of decision\ndependencies. To efficiently solve the reformulations, we develop a tailored\nalgorithm based on the column-and-constraint generation approach, and enhance\nthe computational performance through solving the master problems to a relative\noptimality gap, decomposing the subproblems, and integrating pre-generated\ncolumns and constraints. To validate the effectiveness of our approach, we\ninvestigate a real case study leveraging data from the ``Hydrogen Energy\nApplications in Valley Environments for Northern Netherlands (HEAVENN)\"\nproject. The results reveal that considering the chicken-and-egg dilemma under\nuncertain hydrogen market conditions leads to earlier and more diverse\ninvestments, providing critical insights for policymakers based on the degree\nof decision dependency.\n","authors":["Sezen Ece Kayacık","Beste Basciftci","Albert H. Schrotenboer","Iris F. A. Vis","Evrim Ursavas"],"pdf_url":"https://arxiv.org/pdf/2501.03744v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.16449v3","updated":"2025-01-07T12:16:43Z","published":"2024-05-26T06:33:11Z","title":"Reinforcement Learning for Jump-Diffusions, with Financial Applications","summary":" We study continuous-time reinforcement learning (RL) for stochastic control\nin which system dynamics are governed by jump-diffusion processes. We formulate\nan entropy-regularized exploratory control problem with stochastic policies to\ncapture the exploration--exploitation balance essential for RL. Unlike the pure\ndiffusion case initially studied by Wang et al. (2020), the derivation of the\nexploratory dynamics under jump-diffusions calls for a careful formulation of\nthe jump part. Through a theoretical analysis, we find that one can simply use\nthe same policy evaluation and $q$-learning algorithms in Jia and Zhou (2022a,\n2023), originally developed for controlled diffusions, without needing to check\na priori whether the underlying data come from a pure diffusion or a\njump-diffusion. However, we show that the presence of jumps ought to affect\nparameterizations of actors and critics in general. We investigate as an\napplication the mean--variance portfolio selection problem with stock price\nmodelled as a jump-diffusion, and show that both RL algorithms and\nparameterizations are invariant with respect to jumps. Finally, we present a\ndetailed study on applying the general theory to option hedging.\n","authors":["Xuefeng Gao","Lingfei Li","Xun Yu Zhou"],"pdf_url":"https://arxiv.org/pdf/2405.16449v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03718v1","updated":"2025-01-07T11:58:10Z","published":"2025-01-07T11:58:10Z","title":"Scalable Second-Order Optimization Algorithms for Minimizing Low-rank\n Functions","summary":" We present a random-subspace variant of cubic regularization algorithm that\nchooses the size of the subspace adaptively, based on the rank of the projected\nsecond derivative matrix. Iteratively, our variant only requires access to\n(small-dimensional) projections of first- and second-order problem derivatives\nand calculates a reduced step inexpensively. The ensuing method maintains the\noptimal global rate of convergence of (full-dimensional) cubic regularization,\nwhile showing improved scalability both theoretically and numerically,\nparticularly when applied to low-rank functions. When applied to the latter,\nour algorithm naturally adapts the subspace size to the true rank of the\nfunction, without knowing it a priori.\n","authors":["Edward Tansley","Coralia Cartis"],"pdf_url":"https://arxiv.org/pdf/2501.03718v1.pdf","comment":"Accepted at NeurIPS 2024 Workshop OPT2024: Optimization for Machine\n Learning"},{"id":"http://arxiv.org/abs/2501.03698v1","updated":"2025-01-07T11:03:11Z","published":"2025-01-07T11:03:11Z","title":"Computational complexity of sum-of-squares bounds for copositive\n programs","summary":" In recent years, copositive programming has received significant attention\nfor its ability to model hard problems in both discrete and continuous\noptimization. Several relaxations of copositive programs based on semidefinite\nprogramming (SDP) have been proposed in the literature, meant to provide\ntractable bounds. However, while these SDP-based relaxations are amenable to\nthe ellipsoid algorithm and interior point methods, it is not immediately\nobvious that they can be solved in polynomial time (even approximately). In\nthis paper, we consider the sum-of-squares (SOS) hierarchies of relaxations for\ncopositive programs introduced by Parrilo (2000), de Klerk & Pasechnik (2002)\nand Pe\\~na, Vera & Zuluaga (2006), which can be formulated as SDPs. We\nestablish sufficient conditions that guarantee the polynomial-time\ncomputability (up to fixed precision) of these relaxations. These conditions\nare satisfied by copositive programs that represent standard quadratic programs\nand their reciprocals. As an application, we show that the SOS bounds for the\n(weighted) stability number of a graph can be computed efficiently.\nAdditionally, we provide pathological examples of copositive programs (that do\nnot satisfy the sufficient conditions) whose SOS relaxations admit only\nfeasible solutions of doubly-exponential size.\n","authors":["Marilena Palomba","Lucas Slot","Luis Felipe Vargas","Monaldo Mastrolilli"],"pdf_url":"https://arxiv.org/pdf/2501.03698v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03691v1","updated":"2025-01-07T10:43:26Z","published":"2025-01-07T10:43:26Z","title":"Stabilization of Strictly Pre-Dissipative Receding Horizon Linear\n Quadratic Control by Terminal Costs","summary":" Asymptotic stability in receding horizon control is obtained under a strict\npre-dissipativity assumption, in the presence of suitable state constraints. In\nthis paper we analyze how terminal constraints can be replaced by suitable\nterminal costs. We restrict to the linear-quadratic setting as that allows us\nto obtain stronger results, while we analyze the full nonlinear case in a\nseparate contribution.\n","authors":["Mario Zanon","Lars Grüne"],"pdf_url":"https://arxiv.org/pdf/2501.03691v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03668v1","updated":"2025-01-07T10:12:01Z","published":"2025-01-07T10:12:01Z","title":"Controlling the low-temperature Ising model using spatiotemporal Markov\n decision theory","summary":" We introduce the spatiotemporal Markov decision process (STMDP), a special\ntype of Markov decision process that models sequential decision-making problems\nwhich are not only characterized by temporal, but also by spatial interaction\nstructures. To illustrate the framework, we construct an STMDP inspired by the\nlow-temperature two-dimensional Ising model on a finite, square lattice,\nevolving according to the Metropolis dynamics. We consider the situation in\nwhich an external decision maker aims to drive the system towards the all-plus\nconfiguration by flipping spins at specified moments in time. In order to\nanalyze this problem, we construct an auxiliary MDP by means of a reduction of\nthe configuration space to the local minima of the Hamiltonian. Leveraging the\nconvenient form of this auxiliary MDP, we uncover the structure of the optimal\npolicy by solving the Bellman equations in a recursive manner. Finally, we\nconduct a numerical study on the performance of the optimal policy obtained\nfrom the auxiliary MDP in the original Ising STMDP.\n","authors":["M. C. de Jongh","Richard J. Boucherie","M. N. M. van Lieshout"],"pdf_url":"https://arxiv.org/pdf/2501.03668v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.23016v2","updated":"2025-01-07T09:39:51Z","published":"2024-10-30T13:46:07Z","title":"Regularity and stability for the Gibbs conditioning principle on path\n space via McKean-Vlasov control","summary":" We consider a system of diffusion processes interacting through their\nempirical distribution. Assuming that the empirical average of a given\nobservable can be observed at any time, we derive regularity and quantitative\nstability results for the optimal solutions in the associated version of the\nGibbs conditioning principle. The proofs rely on the analysis of a\nMcKean-Vlasov control problem with distributional constraints. Some new\nestimates are derived for Hamilton-Jacobi-Bellman equations and the Hessian of\nthe log-density of diffusion processes, which are of independent interest.\n","authors":["Louis-Pierre Chaintron","Giovanni Conforti"],"pdf_url":"https://arxiv.org/pdf/2410.23016v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.00502v3","updated":"2025-01-07T08:50:35Z","published":"2024-06-01T17:10:56Z","title":"Non-geodesically-convex optimization in the Wasserstein space","summary":" We study a class of optimization problems in the Wasserstein space (the space\nof probability measures) where the objective function is nonconvex along\ngeneralized geodesics. Specifically, the objective exhibits some\ndifference-of-convex structure along these geodesics. The setting also\nencompasses sampling problems where the logarithm of the target distribution is\ndifference-of-convex. We derive multiple convergence insights for a novel semi\nForward-Backward Euler scheme under several nonconvex (and possibly nonsmooth)\nregimes. Notably, the semi Forward-Backward Euler is just a slight modification\nof the Forward-Backward Euler whose convergence is -- to our knowledge -- still\nunknown in our very general non-geodesically-convex setting.\n","authors":["Hoang Phuc Hau Luu","Hanlin Yu","Bernardo Williams","Petrus Mikkola","Marcelo Hartmann","Kai Puolamäki","Arto Klami"],"pdf_url":"https://arxiv.org/pdf/2406.00502v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09498v2","updated":"2025-01-07T06:49:09Z","published":"2024-12-12T17:47:08Z","title":"Gradient descent inference in empirical risk minimization","summary":" Gradient descent is one of the most widely used iterative algorithms in\nmodern statistical learning. However, its precise algorithmic dynamics in\nhigh-dimensional settings remain only partially understood, which has therefore\nlimited its broader potential for statistical inference applications.\n This paper provides a precise, non-asymptotic distributional characterization\nof gradient descent iterates in a broad class of empirical risk minimization\nproblems, in the so-called mean-field regime where the sample size is\nproportional to the signal dimension. Our non-asymptotic state evolution theory\nholds for both general non-convex loss functions and non-Gaussian data, and\nreveals the central role of two Onsager correction matrices that precisely\ncharacterize the non-trivial dependence among all gradient descent iterates in\nthe mean-field regime.\n Although the Onsager correction matrices are typically analytically\nintractable, our state evolution theory facilitates a generic gradient descent\ninference algorithm that consistently estimates these matrices across a broad\nclass of models. Leveraging this algorithm, we show that the state evolution\ncan be inverted to construct (i) data-driven estimators for the generalization\nerror of gradient descent iterates and (ii) debiased gradient descent iterates\nfor inference of the unknown signal. Detailed applications to two canonical\nmodels--linear regression and (generalized) logistic regression--are worked out\nto illustrate model-specific features of our general theory and inference\nmethods.\n","authors":["Qiyang Han","Xiaocong Xu"],"pdf_url":"https://arxiv.org/pdf/2412.09498v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03552v1","updated":"2025-01-07T05:58:07Z","published":"2025-01-07T05:58:07Z","title":"Proxy Control Barrier Functions: Integrating Barrier-Based and\n Lyapunov-Based Safety-Critical Control Design","summary":" This work introduces a novel Proxy Control Barrier Function (PCBF) scheme\nthat integrates barrier-based and Lyapunov-based safety-critical control\nstrategies for strict-feedback systems with potentially unknown dynamics. The\nproposed method employs a modular design procedure, decomposing the original\nsystem into a proxy subsystem and a virtual tracking subsystem that are\ncontrolled by the control barrier function (CBF)-based and Lyapunov-based\ncontrollers, respectively. By integrating these separately designed\ncontrollers, the overall system's safety is ensured. Moreover, a new\nfilter-based disturbance observer is utilized to design a PCBF-based safe\ncontroller for strict-feedback systems subject to mismatched disturbances. This\napproach broadens the class of systems to which CBF-based methods can be\napplied and significantly simplifies CBF construction by requiring only the\nmodel of the proxy subsystem. The effectiveness of the proposed method is\ndemonstrated through numerical simulations.\n","authors":["Yujie Wang","Xiangru Xu"],"pdf_url":"https://arxiv.org/pdf/2501.03552v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03543v1","updated":"2025-01-07T05:37:59Z","published":"2025-01-07T05:37:59Z","title":"Distributionally Robust Joint Chance-Constrained Optimal Power Flow\n using Relative Entropy","summary":" Designing robust algorithms for the optimal power flow (OPF) problem is\ncritical for the control of large-scale power systems under uncertainty. The\nchance-constrained OPF (CCOPF) problem provides a natural formulation of the\ntrade-off between the operating cost and the constraint satisfaction rate. In\nthis work, we propose a new data-driven algorithm for the CCOPF problem, based\non distributionally robust optimization (DRO). \\revise{We show that the\nproposed reformulation of the distributionally robust chance constraints is\nexact, whereas other approaches in the CCOPF literature rely on conservative\napproximations. We establish out-of-sample robustness guarantees for the\ndistributionally robust solution and prove that the solution is the most\nefficient among all approaches enjoying the same guarantees.} We apply the\nproposed algorithm to the the CCOPF problem and compare the performance of our\napproach with existing methods using simulations on IEEE benchmark power\nsystems.\n","authors":["Eli Brock","Haixiang Zhang","Javad Lavaei","Somayeh Sojoudi"],"pdf_url":"https://arxiv.org/pdf/2501.03543v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03542v1","updated":"2025-01-07T05:30:06Z","published":"2025-01-07T05:30:06Z","title":"Turbulence modeling over riblets via domain transformation","summary":" Numerical and experimental studies have demonstrated the drag-reducing\npotential of carefully designed streamwise-elongated riblets in lowering\nskin-friction drag. To support the systematic design of such surface\ncorrugations, recent efforts have integrated simplified versions of the\ngoverning equations with innovative methods for representing the effects of\nrough boundaries on flow dynamics. Notably, the statistical response of the\neddy-viscosity-enhanced linearized Navier-Stokes equations has been shown to\neffectively capture the ability of riblets in suppressing turbulence, quantify\nthe influence of background turbulence on the mean velocity, and reproduce\nestablished drag-reduction trends. In this paper, we enhance the flexibility\nand computational efficiency of this simulation-free approach by implementing a\ndomain transformation for surface representation, along with a perturbation\nanalysis on a small geometric parameter of the riblets. While domain\ntransformation complicates the differential equations, it provides accurate\nboundary representations and facilitates the analysis of complex riblet shapes\nat high Reynolds numbers by enabling perturbation analysis to simplify the\ndimensional complexity of the governing equations. Our method successfully\npredicts drag reduction trends for semi-circular riblets, consistent with\nexisting literature. We further utilize our framework to investigate flow\nmechanisms influenced by riblets and extend our study to channel flows with\nfriction Reynolds numbers up to 2003. Our findings reveal the emergence of\nKelvin-Helmholtz rollers over large and sharp semi-circular riblets,\ncontributing to the degradation of drag reduction in these geometries.\nAdditionally, we examine the impact of riblets on near-wall flow structures,\nfocusing on their suppression of streamwise-elongated structures in flows over\nlarge riblets.\n","authors":["Mohammadamin Naseri","Armin Zare"],"pdf_url":"https://arxiv.org/pdf/2501.03542v1.pdf","comment":"40 pages, 26 figures"},{"id":"http://arxiv.org/abs/2411.13805v3","updated":"2025-01-07T03:08:57Z","published":"2024-11-21T03:09:18Z","title":"On Representing Convex Quadratically Constrained Quadratic Programs via\n Graph Neural Networks","summary":" Convex quadratically constrained quadratic programs (QCQPs) involve finding a\nsolution within a convex feasible region defined by quadratic constraints while\nminimizing a convex quadratic objective function. These problems arise in\nvarious industrial applications, including power systems and signal processing.\nTraditional methods for solving convex QCQPs primarily rely on matrix\nfactorization, which quickly becomes computationally prohibitive as the problem\nsize increases. Recently, graph neural networks (GNNs) have gained attention\nfor their potential in representing and solving various optimization problems\nsuch as linear programs and linearly constrained quadratic programs. In this\nwork, we investigate the representation power of GNNs in the context of QCQP\ntasks. Specifically, we propose a new tripartite graph representation for\ngeneral convex QCQPs and properly associate it with message-passing GNNs. We\ndemonstrate that there exist GNNs capable of reliably representing key\nproperties of convex QCQPs, including feasibility, optimal value, and optimal\nsolution. Our result deepens the understanding of the connection between QCQPs\nand GNNs, paving the way for future machine learning approaches to efficiently\nsolve QCQPs.\n","authors":["Chenyang Wu","Qian Chen","Akang Wang","Tian Ding","Ruoyu Sun","Wenguo Yang","Qingjiang Shi"],"pdf_url":"https://arxiv.org/pdf/2411.13805v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.19678v3","updated":"2025-01-07T03:01:22Z","published":"2024-09-29T12:10:35Z","title":"SymILO: A Symmetry-Aware Learning Framework for Integer Linear\n Optimization","summary":" Integer linear programs (ILPs) are commonly employed to model diverse\npractical problems such as scheduling and planning. Recently, machine learning\ntechniques have been utilized to solve ILPs. A straightforward idea is to train\na model via supervised learning, with an ILP as the input and an optimal\nsolution as the label. An ILP is symmetric if its variables can be permuted\nwithout changing the problem structure, resulting in numerous equivalent and\noptimal solutions. Randomly selecting an optimal solution as the label can\nintroduce variability in the training data, which may hinder the model from\nlearning stable patterns. In this work, we incorporate the intrinsic symmetry\nof ILPs and propose a novel training framework called SymILO. Specifically, we\nmodify the learning task by introducing solution permutation along with neural\nnetwork weights as learnable parameters and then design an alternating\nalgorithm to jointly optimize the loss function. We conduct extensive\nexperiments on ILPs involving different symmetries and the computational\nresults demonstrate that our symmetry-aware approach significantly outperforms\nthree existing methods -- achieving $50.3\\%$, $66.5\\%$, and $45.4\\%$ average\nimprovements, respectively.\n","authors":["Qian Chen","Tianjian Zhang","Linxin Yang","Qingyu Han","Akang Wang","Ruoyu Sun","Xiaodong Luo","Tsung-Hui Chang"],"pdf_url":"https://arxiv.org/pdf/2409.19678v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.13865v2","updated":"2025-01-07T02:50:02Z","published":"2024-02-21T15:13:43Z","title":"Variable Projection Algorithms: Theoretical Insights and A Novel\n Approach for Problems with Large Residual","summary":" This paper delves into an in-depth exploration of the Variable Projection\n(VP) algorithm, a powerful tool for solving separable nonlinear optimization\nproblems across multiple domains, including system identification, image\nprocessing, and machine learning. We first establish a theoretical framework to\nexamine the effect of the approximate treatment of the coupling relationship\namong parameters on the local convergence of the VP algorithm and theoretically\nprove that the Kaufman's VP algorithm can achieve a similar convergence rate as\nthe Golub \\& Pereyra's form. These studies fill the gap in the existing\nconvergence theory analysis, and provide a solid foundation for understanding\nthe mechanism of VP algorithm and broadening its application horizons.\nFurthermore, drawing inspiration from these theoretical revelations, we design\na refined VP algorithm for handling separable nonlinear optimization problems\ncharacterized by large residual, called VPLR, which boosts the convergence\nperformance by addressing the interdependence of parameters within the\nseparable model and by continually correcting the approximated Hessian matrix\nto counteract the influence of large residual during the iterative process. The\neffectiveness of this refined algorithm is corroborated through numerical\nexperimentation.\n","authors":["Guangyong Chen","Peng Xue","Min Gan","Jing Chen","Wenzhong Guo","C. L. Philip. Chen"],"pdf_url":"https://arxiv.org/pdf/2402.13865v2.pdf","comment":"18 pages, 8 figures"},{"id":"http://arxiv.org/abs/2402.07108v3","updated":"2025-01-07T02:14:56Z","published":"2024-02-11T05:35:50Z","title":"Decoupling Learning and Decision-Making: Breaking the\n $\\mathcal{O}(\\sqrt{T})$ Barrier in Online Resource Allocation with\n First-Order Methods","summary":" Online linear programming plays an important role in both revenue management\nand resource allocation, and recent research has focused on developing\nefficient first-order online learning algorithms. Despite the empirical success\nof first-order methods, they typically achieve a regret no better than\n$\\mathcal{O}(\\sqrt{T})$, which is suboptimal compared to the $\\mathcal{O}(\\log\nT)$ bound guaranteed by the state-of-the-art linear programming (LP)-based\nonline algorithms. This paper establishes several important facts about online\nlinear programming, which unveils the challenge for first-order-method-based\nonline algorithms to achieve beyond $\\mathcal{O}(\\sqrt{T})$ regret. To address\nthe challenge, we introduce a new algorithmic framework that decouples learning\nfrom decision-making. For the first time, we show that first-order methods can\nattain regret $\\mathcal{O}(T^{1/3})$ with this new framework.\n","authors":["Wenzhi Gao","Chunlin Sun","Chenyu Xue","Dongdong Ge","Yinyu Ye"],"pdf_url":"https://arxiv.org/pdf/2402.07108v3.pdf","comment":"Merged into arXiv:2501.02761"},{"id":"http://arxiv.org/abs/2501.03470v1","updated":"2025-01-07T02:12:00Z","published":"2025-01-07T02:12:00Z","title":"Positivstellensätze for polynomial matrices with universal quantifiers","summary":" This paper studies Positivstellens\\\"atze for a polynomial matrix subject to\npolynomial matrix inequality constraints with universal quantifiers. We first\npresent a Scherer-Hol-type Positivstellensatz under the Archimedean condition.\nWhen the objective is a scalar polynomial, we further provide a sparse\nScherer-Hol-type Positivstellensatz in the presence of correlative sparsity.\nNext, without assuming the Archimedean condition, we derive\nPutinar-Vasilescu-type, P\\'olya-type, and Lasserre-Netzer-type\nPositivstellens\\\"atze under the same setting. These results can be viewed as\ncommon generalizations of corresponding Positivstellens\\\"atze in the cases of\npolynomials, polynomials with universal quantifiers, and polynomial matrices.\nFor the proofs, techniques from *-algebra, real algebraic geometry, operator\ntheory, and convex optimization are employed. Applications of the established\nPositivstellens\\\"atze to robust polynomial matrix optimization are also\ndiscussed.\n","authors":["Feng Guo","Jie Wang"],"pdf_url":"https://arxiv.org/pdf/2501.03470v1.pdf","comment":"31 pages, 2 tables"},{"id":"http://arxiv.org/abs/2406.00612v3","updated":"2025-01-07T02:11:06Z","published":"2024-06-02T04:02:40Z","title":"Policy Iteration for Exploratory Hamilton--Jacobi--Bellman Equations","summary":" We study the policy iteration algorithm (PIA) for entropy-regularized\nstochastic control problems on an infinite time horizon with a large discount\nrate, focusing on two main scenarios. First, we analyze PIA with bounded\ncoefficients where the controls applied to the diffusion term satisfy a\nsmallness condition. We demonstrate the convergence of PIA based on a uniform\n$\\mathcal{C}^{2,\\alpha}$ estimate for the value sequence generated by PIA, and\nprovide a quantitative convergence analysis for this scenario. Second, we\ninvestigate PIA with unbounded coefficients but no control over the diffusion\nterm. In this scenario, we first provide the well-posedness of the exploratory\nHamilton--Jacobi--Bellman equation with linear growth coefficients and\npolynomial growth reward function. By such a well-posedess result we achieve\nPIA's convergence by establishing a quantitative locally uniform\n$\\mathcal{C}^{1,\\alpha}$ estimates for the generated value sequence.\n","authors":["Hung Vinh Tran","Zhenhua Wang","Yuming Paul Zhang"],"pdf_url":"https://arxiv.org/pdf/2406.00612v3.pdf","comment":"21 pages"},{"id":"http://arxiv.org/abs/2501.03459v1","updated":"2025-01-07T01:21:07Z","published":"2025-01-07T01:21:07Z","title":"Convergence of a particle method for gradient flows on the\n $L^p$-Wasserstein space","summary":" We study the particle method to approximate the gradient flow on the\n$L^p$-Wasserstein space. This method relies on the discretization of the energy\nintroduced by [3] via nonoverlapping balls centered at the particles and\npreserves the gradient flow structure at the particle level. We prove the\nconvergence of the discrete gradient flow to the continuum gradient flow on the\n$L^p$-Wasserstein space over $\\mathbb R$, specifically to the doubly nonlinear\ndiffusion equation in one dimension.\n","authors":["Rong Lei"],"pdf_url":"https://arxiv.org/pdf/2501.03459v1.pdf","comment":"arXiv admin note: text overlap with arXiv:1605.08086 by other authors"},{"id":"http://arxiv.org/abs/2501.03443v1","updated":"2025-01-07T00:09:52Z","published":"2025-01-07T00:09:52Z","title":"Optimization Learning","summary":" This article introduces the concept of optimization learning, a methodology\nto design optimization proxies that learn the input/output mapping of\nparametric optimization problems. These optimization proxies are trustworthy by\ndesign: they compute feasible solutions to the underlying optimization\nproblems, provide quality guarantees on the returned solutions, and scale to\nlarge instances. Optimization proxies are differentiable programs that combine\ntraditional deep learning technology with repair or completion layers to\nproduce feasible solutions. The article shows that optimization proxies can be\ntrained end-to-end in a self-supervised way. It presents methodologies to\nprovide performance guarantees and to scale optimization proxies to large-scale\noptimization problems. The potential of optimization proxies is highlighted\nthrough applications in power systems and, in particular, real-time risk\nassessment and security-constrained optimal power flow.\n","authors":["Pascal Van Hentenryck"],"pdf_url":"https://arxiv.org/pdf/2501.03443v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04165v1","updated":"2025-01-07T22:25:40Z","published":"2025-01-07T22:25:40Z","title":"Unifying restart accelerated gradient and proximal bundle methods","summary":" This paper presents a novel restarted version of Nesterov's accelerated\ngradient method and establishes its optimal iteration-complexity for solving\nconvex smooth composite optimization problems. The proposed restart accelerated\ngradient method is shown to be a specific instance of the accelerated inexact\nproximal point framework introduced in \"An accelerated hybrid proximal\nextragradient method for convex optimization and its implications to\nsecond-order methods\" by Monteiro and Svaiter, SIAM Journal on Optimization,\n2013. Furthermore, this work examines the proximal bundle method within the\ninexact proximal point framework, demonstrating that it is an instance of the\nframework. Notably, this paper provides new insights into the underlying\nalgorithmic principle that unifies two seemingly disparate optimization\nmethods, namely, the restart accelerated gradient and the proximal bundle\nmethods.\n","authors":["Jiaming Liang"],"pdf_url":"https://arxiv.org/pdf/2501.04165v1.pdf","comment":"14 pages"},{"id":"http://arxiv.org/abs/2501.04160v1","updated":"2025-01-07T22:19:06Z","published":"2025-01-07T22:19:06Z","title":"Collaborative Spacecraft Servicing under Partial Feedback using\n Lyapunov-based Deep Neural Networks","summary":" Multi-agent systems are increasingly applied in space missions, including\ndistributed space systems, resilient constellations, and autonomous rendezvous\nand docking operations. A critical emerging application is collaborative\nspacecraft servicing, which encompasses on-orbit maintenance, space debris\nremoval, and swarm-based satellite repositioning. These missions involve\nservicing spacecraft interacting with malfunctioning or defunct spacecraft\nunder challenging conditions, such as limited state information, measurement\ninaccuracies, and erratic target behaviors. Existing approaches often rely on\nassumptions of full state knowledge or single-integrator dynamics, which are\nimpractical for real-world applications involving second-order spacecraft\ndynamics. This work addresses these challenges by developing a distributed\nstate estimation and tracking framework that requires only relative position\nmeasurements and operates under partial state information. A novel\n$\\rho$-filter is introduced to reconstruct unknown states using locally\navailable information, and a Lyapunov-based deep neural network adaptive\ncontroller is developed that adaptively compensates for uncertainties stemming\nfrom unknown spacecraft dynamics. To ensure the collaborative spacecraft\nregulation problem is well-posed, a trackability condition is defined. A\nLyapunov-based stability analysis is provided to ensure exponential convergence\nof errors in state estimation and spacecraft regulation to a neighborhood of\nthe origin under the trackability condition. The developed method eliminates\nthe need for expensive velocity sensors or extensive pre-training, offering a\npractical and robust solution for spacecraft servicing in complex, dynamic\nenvironments.\n","authors":["Cristian F. Nino","Omkar Sudhir Patil","Christopher D. Petersen","Sean Phillips","Warren E. Dixon"],"pdf_url":"https://arxiv.org/pdf/2501.04160v1.pdf","comment":"24 pages, 4 Figures, Journal"},{"id":"http://arxiv.org/abs/2501.04151v1","updated":"2025-01-07T21:37:41Z","published":"2025-01-07T21:37:41Z","title":"Efficient LP warmstarting for linear modifications of the constraint\n matrix","summary":" We consider the problem of computing the optimal solution and objective of a\nlinear program under linearly changing linear constraints. More specifically,\nwe want to compute the optimal solution of a linear optimization where the\nconstraint matrix linearly depends on a paramater that can take p different\nvalues. Based on the information given by a precomputed basis, we present three\nefficient LP warm-starting algorithms. Each algorithm is either based on the\neigenvalue decomposition, the Schur decomposition, or a tweaked eigenvalue\ndecomposition to evaluate the optimal solution and optimal objective of these\nproblems. The three algorithms have an overall complexity O(m^3 + pm^2) where m\nis the number of constraints of the original problem and p the number of values\nof the parameter that we want to evaluate. We also provide theorems related to\nthe optimality conditions to verify when a basis is still optimal and a local\nbound on the objective.\n","authors":["Guillaume Derval","Bardhyl Miftari","Damien Ernst","Quentin Louveaux"],"pdf_url":"https://arxiv.org/pdf/2501.04151v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04143v1","updated":"2025-01-07T21:12:08Z","published":"2025-01-07T21:12:08Z","title":"Linear Optimization for the Perfect Meal: A Data-Driven Approach to\n Optimising the Perfect Meal Using Gurobi","summary":" This study aims to optimize meal planning for nutritional health and cost\nefficiency using linear programming. Linear optimization provides an effective\nframework for addressing the problem of an optimal diet, as the composition of\nfood can be naturally modeled as a linearly additive system. Leveraging a\ncomprehensive nutrition dataset, our model minimizes meal costs while meeting\nspecific nutritional requirements. We explore additional complexities, such as\nfractional weights and nutrient ratio constraints, enhancing the robustness of\nthe solution. Case studies address common nutritional challenges, providing\ntailored diet plans. The significance lies in aiding individuals to form\nbalanced, cost-effective dietary schedules, considering fitness goals and\ncaloric needs. This research contributes to efficient, sustainable, and\ntime-sensitive meal planning, emphasizing the intersection of nutrition,\noptimization, and real-world applicability.\n","authors":["Utkarsh Prajapati","Tanushree Jain","Abhishek Machiraju","Divyam Kaushik"],"pdf_url":"https://arxiv.org/pdf/2501.04143v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04134v1","updated":"2025-01-07T20:46:59Z","published":"2025-01-07T20:46:59Z","title":"Mixing Times and Privacy Analysis for the Projected Langevin Algorithm\n under a Modulus of Continuity","summary":" We study the mixing time of the projected Langevin algorithm (LA) and the\nprivacy curve of noisy Stochastic Gradient Descent (SGD), beyond nonexpansive\niterations. Specifically, we derive new mixing time bounds for the projected LA\nwhich are, in some important cases, dimension-free and poly-logarithmic on the\naccuracy, closely matching the existing results in the smooth convex case.\nAdditionally, we establish new upper bounds for the privacy curve of the\nsubsampled noisy SGD algorithm. These bounds show a crucial dependency on the\nregularity of gradients, and are useful for a wide range of convex losses\nbeyond the smooth case. Our analysis relies on a suitable extension of the\nPrivacy Amplification by Iteration (PABI) framework (Feldman et al., 2018;\nAltschuler and Talwar, 2022, 2023) to noisy iterations whose gradient map is\nnot necessarily nonexpansive. This extension is achieved by designing an\noptimization problem which accounts for the best possible R\\'enyi divergence\nbound obtained by an application of PABI, where the tractability of the problem\nis crucially related to the modulus of continuity of the associated gradient\nmapping. We show that, in several interesting cases -- including the nonsmooth\nconvex, weakly smooth and (strongly) dissipative -- such optimization problem\ncan be solved exactly and explicitly. This yields the tightest possible\nPABI-based bounds, where our results are either new or substantially sharper\nthan those in previous works.\n","authors":["Mario Bravo","Juan P. Flores-Mella","Cristóbal Guzmán"],"pdf_url":"https://arxiv.org/pdf/2501.04134v1.pdf","comment":"40 pages, 2 figures"},{"id":"http://arxiv.org/abs/2401.14554v2","updated":"2025-01-07T20:44:10Z","published":"2024-01-25T22:49:13Z","title":"GCBF+: A Neural Graph Control Barrier Function Framework for Distributed\n Safe Multi-Agent Control","summary":" Distributed, scalable, and safe control of large-scale multi-agent systems is\na challenging problem. In this paper, we design a distributed framework for\nsafe multi-agent control in large-scale environments with obstacles, where a\nlarge number of agents are required to maintain safety using only local\ninformation and reach their goal locations. We introduce a new class of\ncertificates, termed graph control barrier function (GCBF), which are based on\nthe well-established control barrier function theory for safety guarantees and\nutilize a graph structure for scalable and generalizable distributed control of\nMAS. We develop a novel theoretical framework to prove the safety of an\narbitrary-sized MAS with a single GCBF. We propose a new training framework\nGCBF+ that uses graph neural networks to parameterize a candidate GCBF and a\ndistributed control policy. The proposed framework is distributed and is\ncapable of taking point clouds from LiDAR, instead of actual state information,\nfor real-world robotic applications. We illustrate the efficacy of the proposed\nmethod through various hardware experiments on a swarm of drones with\nobjectives ranging from exchanging positions to docking on a moving target\nwithout collision. Additionally, we perform extensive numerical experiments,\nwhere the number and density of agents, as well as the number of obstacles,\nincrease. Empirical results show that in complex environments with agents with\nnonlinear dynamics (e.g., Crazyflie drones), GCBF+ outperforms the hand-crafted\nCBF-based method with the best performance by up to 20% for relatively\nsmall-scale MAS with up to 256 agents, and leading reinforcement learning (RL)\nmethods by up to 40% for MAS with 1024 agents. Furthermore, the proposed method\ndoes not compromise on the performance, in terms of goal reaching, for\nachieving high safety rates, which is a common trade-off in RL-based methods.\n","authors":["Songyuan Zhang","Oswin So","Kunal Garg","Chuchu Fan"],"pdf_url":"https://arxiv.org/pdf/2401.14554v2.pdf","comment":"20 pages, 15 figures; Accepted by IEEE Transactions on Robotics\n (T-RO)"},{"id":"http://arxiv.org/abs/2402.16623v2","updated":"2025-01-07T20:06:34Z","published":"2024-02-26T14:53:39Z","title":"Generalized sparsity-promoting solvers for Bayesian inverse problems:\n Versatile sparsifying transforms and unknown noise variances","summary":" Bayesian hierarchical models can provide efficient algorithms for finding\nsparse solutions to ill-posed inverse problems. The models typically comprise a\nconditionally Gaussian prior model for the unknown which is augmented by a\ngeneralized gamma hyper-prior model for variance hyper-parameters. This\ninvestigation generalizes these models and their efficient maximum a posterior\n(MAP) estimation using the iterative alternating sequential (IAS) algorithm in\ntwo ways: (1) General sparsifying transforms: Diverging from conventional\nmethods, our approach permits the use of sparsifying transformations with\nnontrivial kernels; (2) Unknown noise variances: We treat the noise variance as\na random variable that is estimated during the inference procedure. This is\nimportant in applications where the noise estimate cannot be accurately\nestimated a priori. Remarkably, these augmentations neither significantly\nburden the computational expense of the algorithm nor compromise its efficacy.\nWe include convexity and convergence analysis for the method and demonstrate\nits efficacy in several numerical experiments.\n","authors":["Jonathan Lindbloom","Jan Glaubitz","Anne Gelb"],"pdf_url":"https://arxiv.org/pdf/2402.16623v2.pdf","comment":"27 pages, 6 figures"},{"id":"http://arxiv.org/abs/2405.10735v2","updated":"2025-01-07T19:59:26Z","published":"2024-05-17T12:29:48Z","title":"Variance-reduction for Variational Inequality Problems with Bregman\n Distance Function","summary":" In this paper, we address variational inequalities (VI) with a finite-sum\nstructure. We introduce a novel single-loop stochastic variance-reduced\nalgorithm, incorporating the Bregman distance function, and establish an\noptimal convergence guarantee under a monotone setting. Additionally, we\nexplore a structured class of non-monotone problems that exhibit weak Minty\nsolutions, and analyze the complexity of our proposed method, highlighting a\nsignificant improvement over existing approaches. Numerical experiments are\npresented to demonstrate the performance of our algorithm compared to\nstate-of-the-art methods\n","authors":["Zeinab Alizadeh","Erfan Yazdandoost Hamedani","Afrooz Jalilzadeh"],"pdf_url":"https://arxiv.org/pdf/2405.10735v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04105v1","updated":"2025-01-07T19:29:10Z","published":"2025-01-07T19:29:10Z","title":"DeepVIVONet: Using deep neural operators to optimize sensor locations\n with application to vortex-induced vibrations","summary":" We introduce DeepVIVONet, a new framework for optimal dynamic reconstruction\nand forecasting of the vortex-induced vibrations (VIV) of a marine riser, using\nfield data. We demonstrate the effectiveness of DeepVIVONet in accurately\nreconstructing the motion of an off--shore marine riser by using sparse\nspatio-temporal measurements. We also show the generalization of our model in\nextrapolating to other flow conditions via transfer learning, underscoring its\npotential to streamline operational efficiency and enhance predictive accuracy.\nThe trained DeepVIVONet serves as a fast and accurate surrogate model for the\nmarine riser, which we use in an outer--loop optimization algorithm to obtain\nthe optimal locations for placing the sensors. Furthermore, we employ an\nexisting sensor placement method based on proper orthogonal decomposition (POD)\nto compare with our data-driven approach. We find that that while POD offers a\ngood approach for initial sensor placement, DeepVIVONet's adaptive capabilities\nyield more precise and cost-effective configurations.\n","authors":["Ruyin Wan","Ehsan Kharazmi","Michael S Triantafyllou","George Em Karniadakis"],"pdf_url":"https://arxiv.org/pdf/2501.04105v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.08637v2","updated":"2025-01-07T19:26:17Z","published":"2024-06-12T20:50:26Z","title":"A Game Between Two Identical Dubins Cars: Evading a Conic Sensor in\n Minimum Time","summary":" A fundamental task in mobile robotics is keeping an intelligent agent under\nsurveillance with an autonomous robot as it travels in the environment. This\nwork studies a theoretical version of that problem involving one of the most\npopular vehicle platforms in robotics. In particular, we consider two identical\nDubins cars moving on a plane without obstacles. One of them plays as the\npursuer, and it is equipped with a limited field-of-view detection region\nmodeled as a semi-infinite cone with its apex at the pursuer's position. The\npursuer aims to maintain the other Dubins car, which plays as the evader, as\nmuch time as possible inside its detection region. On the contrary, the evader\nwants to escape as soon as possible. In this work, employing differential game\ntheory, we find the time-optimal motion strategies near the game's end. The\nanalysis of those trajectories reveals the existence of at least two singular\nsurfaces: a Transition Surface (also known as a Switch Surface) and an Evader's\nUniversal Surface. We also found that the barrier's standard construction\nproduces a surface that partially lies outside the playing space.\n","authors":["Ubaldo Ruiz"],"pdf_url":"https://arxiv.org/pdf/2406.08637v2.pdf","comment":"35 pages, 16 figures"},{"id":"http://arxiv.org/abs/2501.02098v2","updated":"2025-01-07T14:20:44Z","published":"2025-01-03T20:51:07Z","title":"Graph-Based Modeling and Decomposition of Hierarchical Optimization\n Problems","summary":" We present a graph-theoretic modeling approach for hierarchical optimization\nthat leverages the OptiGraph abstraction implemented in the Julia package\nPlasmo$.$jl. We show that the abstraction is flexible and can effectively\ncapture complex hierarchical connectivity that arises from decision-making over\nmultiple spatial and temporal scales (e.g., integration of planning,\nscheduling, and operations in manufacturing and infrastructures). We also show\nthat the graph abstraction facilitates the conceptualization and implementation\nof decomposition and approximation schemes. Specifically, we propose a\ngraph-based Benders decomposition (gBD) framework that enables the exploitation\nof hierarchical (nested) structures and that uses graph\naggregation/partitioning procedures to discover such structures. In addition,\nwe provide a Julia implementation of gBD, which we call PlasmoBenders$.$jl. We\nillustrate the capabilities using examples arising in the context of energy and\npower systems.\n","authors":["David L. Cole","Filippo Pecci","Omar J. Guerra","Harsha Gangammanavar","Jesse D. Jenkins","Victor M. Zavala"],"pdf_url":"https://arxiv.org/pdf/2501.02098v2.pdf","comment":"66 pages, 3 tables, 28 figures, updated abstract"},{"id":"http://arxiv.org/abs/2112.02215v3","updated":"2025-01-07T20:32:52Z","published":"2021-12-04T01:40:34Z","title":"Deep Policy Iteration with Integer Programming for Inventory Management","summary":" We present a Reinforcement Learning (RL) based framework for optimizing\nlong-term discounted reward problems with large combinatorial action space and\nstate dependent constraints. These characteristics are common to many\noperations management problems, e.g., network inventory replenishment, where\nmanagers have to deal with uncertain demand, lost sales, and capacity\nconstraints that results in more complex feasible action spaces. Our proposed\nProgrammable Actor Reinforcement Learning (PARL) uses a deep-policy iteration\nmethod that leverages neural networks (NNs) to approximate the value function\nand combines it with mathematical programming (MP) and sample average\napproximation (SAA) to solve the per-step-action optimally while accounting for\ncombinatorial action spaces and state-dependent constraint sets. We show how\nthe proposed methodology can be applied to complex inventory replenishment\nproblems where analytical solutions are intractable. We also benchmark the\nproposed algorithm against state-of-the-art RL algorithms and commonly used\nreplenishment heuristics and find it considerably outperforms existing methods\nby as much as 14.7% on average in various complex supply chain settings. We\nfind that this improvement of PARL over benchmark algorithms can be directly\nattributed to better inventory cost management, especially in inventory\nconstrained settings. Furthermore, in the simpler setting where optimal\nreplenishment policy is tractable or known near optimal heuristics exist, we\nfind that the RL approaches can learn near optimal policies. Finally, to make\nRL algorithms more accessible for inventory management researchers, we also\ndiscuss the development of a modular Python library that can be used to test\nthe performance of RL algorithms with various supply chain structures and spur\nfuture research in developing practical and near-optimal algorithms for\ninventory management problems.\n","authors":["Pavithra Harsha","Ashish Jagmohan","Jayant Kalagnanam","Brian Quanz","Divya Singhvi"],"pdf_url":"https://arxiv.org/pdf/2112.02215v3.pdf","comment":"Prior shorter version accepted to NeurIPS 2021 Deep RL Workshop.\n Updated version to appear in MSOM journal. Authors are listed in alphabetical\n order"},{"id":"http://arxiv.org/abs/2501.05481v1","updated":"2025-01-07T19:43:35Z","published":"2025-01-07T19:43:35Z","title":"Blackwell Equilibrium in Repeated Games","summary":" We apply Blackwell optimality to repeated games. An equilibrium whose\nstrategy profile is sequentially rational for all high enough discount factors\nsimultaneously is a Blackwell (subgame-perfect, perfect public, etc.)\nequilibrium. The bite of this requirement depends on the monitoring structure.\nUnder perfect monitoring, a ``folk'' theorem holds relative to an appropriate\nnotion of minmax. Under imperfect public monitoring, absent a public\nrandomization device, any perfect public equilibrium generically involves pure\naction profiles or stage-game Nash equilibria only. Under private conditionally\nindependent monitoring, in a class of games that includes the prisoner's\ndilemma, the stage-game Nash equilibrium is played in every round.\n","authors":["Costas Cavounidis","Sambuddha Ghosh","Johannes Hörner","Eilon Solan","Satoru Takahashi"],"pdf_url":"https://arxiv.org/pdf/2501.05481v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2111.14104v2","updated":"2025-01-07T03:02:38Z","published":"2021-11-28T11:01:48Z","title":"Optimal Partition for Multi-Type Queueing System","summary":" We study an optimal server partition and customer assignment problem for an\nuncapacitated FCFS queueing system with heterogeneous types of customers. Each\ntype of customers is associated with a Poisson arrival, a certain service time\ndistribution, and a unit waiting cost. The goal is to minimize the expected\ntotal waiting cost by partitioning the server into sub-queues, each with a\nsmaller service capacity, and routing customer types probabilistically. First,\nwe show that by properly partitioning the queue, it is possible to reduce the\nexpected waiting costs by an arbitrarily large ratio. Then, we show that for\nany given server partition, the optimal customer assignment admits a certain\ngeometric structure, enabling an efficient algorithm to find the optimal\nassignment. Such an optimal structure also applies when minimizing the expected\nsojourn time. Finally, we consider the joint partition-assignment optimization\nproblem. The customer assignment under the optimal server partition admits a\nstronger structure. Specifically, if the first two moments of the service time\ndistributions satisfy certain properties, it is optimal to deterministically\nassign customer types with consecutive service rates to the same sub-queue.\nThis structure allows for more efficient algorithms. Overall, the common rule\nof thumb to partition customers into continuous segments ranked by service\nrates could be suboptimal, and our work is the first to comprehensively study\nthe queue partition problem based on customer types.\n","authors":["Shengyu Cao","Simai He","Zizhuo Wang","Yifan Feng"],"pdf_url":"https://arxiv.org/pdf/2111.14104v2.pdf","comment":null}],"Computer Vision and Pattern Recognition":[{"id":"http://arxiv.org/abs/2501.04005v1","updated":"2025-01-07T18:59:59Z","published":"2025-01-07T18:59:59Z","title":"LargeAD: Large-Scale Cross-Sensor Data Pretraining for Autonomous\n Driving","summary":" Recent advancements in vision foundation models (VFMs) have revolutionized\nvisual perception in 2D, yet their potential for 3D scene understanding,\nparticularly in autonomous driving applications, remains underexplored. In this\npaper, we introduce LargeAD, a versatile and scalable framework designed for\nlarge-scale 3D pretraining across diverse real-world driving datasets. Our\nframework leverages VFMs to extract semantically rich superpixels from 2D\nimages, which are aligned with LiDAR point clouds to generate high-quality\ncontrastive samples. This alignment facilitates cross-modal representation\nlearning, enhancing the semantic consistency between 2D and 3D data. We\nintroduce several key innovations: i) VFM-driven superpixel generation for\ndetailed semantic representation, ii) a VFM-assisted contrastive learning\nstrategy to align multimodal features, iii) superpoint temporal consistency to\nmaintain stable representations across time, and iv) multi-source data\npretraining to generalize across various LiDAR configurations. Our approach\ndelivers significant performance improvements over state-of-the-art methods in\nboth linear probing and fine-tuning tasks for both LiDAR-based segmentation and\nobject detection. Extensive experiments on eleven large-scale multi-modal\ndatasets highlight our superior performance, demonstrating the adaptability,\nefficiency, and robustness in real-world autonomous driving scenarios.\n","authors":["Lingdong Kong","Xiang Xu","Youquan Liu","Jun Cen","Runnan Chen","Wenwei Zhang","Liang Pan","Kai Chen","Ziwei Liu"],"pdf_url":"https://arxiv.org/pdf/2501.04005v1.pdf","comment":"Preprint; 16 pages, 7 figures, 8 tables; Project Page at\n https://ldkong.com/LargeAD"},{"id":"http://arxiv.org/abs/2501.04004v1","updated":"2025-01-07T18:59:58Z","published":"2025-01-07T18:59:58Z","title":"LiMoE: Mixture of LiDAR Representation Learners from Automotive Scenes","summary":" LiDAR data pretraining offers a promising approach to leveraging large-scale,\nreadily available datasets for enhanced data utilization. However, existing\nmethods predominantly focus on sparse voxel representation, overlooking the\ncomplementary attributes provided by other LiDAR representations. In this work,\nwe propose LiMoE, a framework that integrates the Mixture of Experts (MoE)\nparadigm into LiDAR data representation learning to synergistically combine\nmultiple representations, such as range images, sparse voxels, and raw points.\nOur approach consists of three stages: i) Image-to-LiDAR Pretraining, which\ntransfers prior knowledge from images to point clouds across different\nrepresentations; ii) Contrastive Mixture Learning (CML), which uses MoE to\nadaptively activate relevant attributes from each representation and distills\nthese mixed features into a unified 3D network; iii) Semantic Mixture\nSupervision (SMS), which combines semantic logits from multiple representations\nto boost downstream segmentation performance. Extensive experiments across 11\nlarge-scale LiDAR datasets demonstrate our effectiveness and superiority. The\ncode and model checkpoints have been made publicly accessible.\n","authors":["Xiang Xu","Lingdong Kong","Hui Shuai","Liang Pan","Ziwei Liu","Qingshan Liu"],"pdf_url":"https://arxiv.org/pdf/2501.04004v1.pdf","comment":"Preprint; 26 pages, 17 figures, 7 tables; Project Page at\n https://ldkong.com/LiMoE"},{"id":"http://arxiv.org/abs/2501.04003v1","updated":"2025-01-07T18:59:55Z","published":"2025-01-07T18:59:55Z","title":"Are VLMs Ready for Autonomous Driving? An Empirical Study from the\n Reliability, Data, and Metric Perspectives","summary":" Recent advancements in Vision-Language Models (VLMs) have sparked interest in\ntheir use for autonomous driving, particularly in generating interpretable\ndriving decisions through natural language. However, the assumption that VLMs\ninherently provide visually grounded, reliable, and interpretable explanations\nfor driving remains largely unexamined. To address this gap, we introduce\nDriveBench, a benchmark dataset designed to evaluate VLM reliability across 17\nsettings (clean, corrupted, and text-only inputs), encompassing 19,200 frames,\n20,498 question-answer pairs, three question types, four mainstream driving\ntasks, and a total of 12 popular VLMs. Our findings reveal that VLMs often\ngenerate plausible responses derived from general knowledge or textual cues\nrather than true visual grounding, especially under degraded or missing visual\ninputs. This behavior, concealed by dataset imbalances and insufficient\nevaluation metrics, poses significant risks in safety-critical scenarios like\nautonomous driving. We further observe that VLMs struggle with multi-modal\nreasoning and display heightened sensitivity to input corruptions, leading to\ninconsistencies in performance. To address these challenges, we propose refined\nevaluation metrics that prioritize robust visual grounding and multi-modal\nunderstanding. Additionally, we highlight the potential of leveraging VLMs'\nawareness of corruptions to enhance their reliability, offering a roadmap for\ndeveloping more trustworthy and interpretable decision-making systems in\nreal-world autonomous driving contexts. The benchmark toolkit is publicly\naccessible.\n","authors":["Shaoyuan Xie","Lingdong Kong","Yuhao Dong","Chonghao Sima","Wenwei Zhang","Qi Alfred Chen","Ziwei Liu","Liang Pan"],"pdf_url":"https://arxiv.org/pdf/2501.04003v1.pdf","comment":"Preprint; 41 pages, 32 figures, 16 tables; Project Page at\n https://drive-bench.github.io/"},{"id":"http://arxiv.org/abs/2501.04002v1","updated":"2025-01-07T18:59:28Z","published":"2025-01-07T18:59:28Z","title":"Extraction Of Cumulative Blobs From Dynamic Gestures","summary":" Gesture recognition is a perceptual user interface, which is based on CV\ntechnology that allows the computer to interpret human motions as commands,\nallowing users to communicate with a computer without the use of hands, thus\nmaking the mouse and keyboard superfluous. Gesture recognition's main weakness\nis a light condition because gesture control is based on computer vision, which\nheavily relies on cameras. These cameras are used to interpret gestures in 2D\nand 3D, so the extracted information can vary depending on the source of light.\nThe limitation of the system cannot work in a dark environment. A simple night\nvision camera can be used as our camera for motion capture as they also blast\nout infrared light which is not visible to humans but can be clearly seen with\na camera that has no infrared filter this majorly overcomes the limitation of\nsystems which cannot work in a dark environment. So, the video stream from the\ncamera is fed into a Raspberry Pi which has a Python program running OpenCV\nmodule which is used for detecting, isolating and tracking the path of dynamic\ngesture, then we use an algorithm of machine learning to recognize the pattern\ndrawn and accordingly control the GPIOs of the raspberry pi to perform some\nactivities.\n","authors":["Rishabh Naulakha","Shubham Gaur","Dhairya Lodha","Mehek Tulsyan","Utsav Kotecha"],"pdf_url":"https://arxiv.org/pdf/2501.04002v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04001v1","updated":"2025-01-07T18:58:54Z","published":"2025-01-07T18:58:54Z","title":"Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of\n Images and Videos","summary":" This work presents Sa2VA, the first unified model for dense grounded\nunderstanding of both images and videos. Unlike existing multi-modal large\nlanguage models, which are often limited to specific modalities and tasks,\nSa2VA supports a wide range of image and video tasks, including referring\nsegmentation and conversation, with minimal one-shot instruction tuning. Sa2VA\ncombines SAM-2, a foundation video segmentation model, with LLaVA, an advanced\nvision-language model, and unifies text, image, and video into a shared LLM\ntoken space. Using the LLM, Sa2VA generates instruction tokens that guide SAM-2\nin producing precise masks, enabling a grounded, multi-modal understanding of\nboth static and dynamic visual content. Additionally, we introduce Ref-SAV, an\nauto-labeled dataset containing over 72k object expressions in complex video\nscenes, designed to boost model performance. We also manually validate 2k video\nobjects in the Ref-SAV datasets to benchmark referring video object\nsegmentation in complex environments. Experiments show that Sa2VA achieves\nstate-of-the-art across multiple tasks, particularly in referring video object\nsegmentation, highlighting its potential for complex real-world applications.\n","authors":["Haobo Yuan","Xiangtai Li","Tao Zhang","Zilong Huang","Shilin Xu","Shunping Ji","Yunhai Tong","Lu Qi","Jiashi Feng","Ming-Hsuan Yang"],"pdf_url":"https://arxiv.org/pdf/2501.04001v1.pdf","comment":"Project page: https://lxtgh.github.io/project/sa2va"},{"id":"http://arxiv.org/abs/2501.03995v1","updated":"2025-01-07T18:52:05Z","published":"2025-01-07T18:52:05Z","title":"RAG-Check: Evaluating Multimodal Retrieval Augmented Generation\n Performance","summary":" Retrieval-augmented generation (RAG) improves large language models (LLMs) by\nusing external knowledge to guide response generation, reducing hallucinations.\nHowever, RAG, particularly multi-modal RAG, can introduce new hallucination\nsources: (i) the retrieval process may select irrelevant pieces (e.g.,\ndocuments, images) as raw context from the database, and (ii) retrieved images\nare processed into text-based context via vision-language models (VLMs) or\ndirectly used by multi-modal language models (MLLMs) like GPT-4o, which may\nhallucinate. To address this, we propose a novel framework to evaluate the\nreliability of multi-modal RAG using two performance measures: (i) the\nrelevancy score (RS), assessing the relevance of retrieved entries to the\nquery, and (ii) the correctness score (CS), evaluating the accuracy of the\ngenerated response. We train RS and CS models using a ChatGPT-derived database\nand human evaluator samples. Results show that both models achieve ~88%\naccuracy on test data. Additionally, we construct a 5000-sample human-annotated\ndatabase evaluating the relevancy of retrieved pieces and the correctness of\nresponse statements. Our RS model aligns with human preferences 20% more often\nthan CLIP in retrieval, and our CS model matches human preferences ~91% of the\ntime. Finally, we assess various RAG systems' selection and generation\nperformances using RS and CS.\n","authors":["Matin Mortaheb","Mohammad A. Amir Khojastepour","Srimat T. Chakradhar","Sennur Ulukus"],"pdf_url":"https://arxiv.org/pdf/2501.03995v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03992v1","updated":"2025-01-07T18:50:06Z","published":"2025-01-07T18:50:06Z","title":"NeuralSVG: An Implicit Representation for Text-to-Vector Generation","summary":" Vector graphics are essential in design, providing artists with a versatile\nmedium for creating resolution-independent and highly editable visual content.\nRecent advancements in vision-language and diffusion models have fueled\ninterest in text-to-vector graphics generation. However, existing approaches\noften suffer from over-parameterized outputs or treat the layered structure - a\ncore feature of vector graphics - as a secondary goal, diminishing their\npractical use. Recognizing the importance of layered SVG representations, we\npropose NeuralSVG, an implicit neural representation for generating vector\ngraphics from text prompts. Inspired by Neural Radiance Fields (NeRFs),\nNeuralSVG encodes the entire scene into the weights of a small MLP network,\noptimized using Score Distillation Sampling (SDS). To encourage a layered\nstructure in the generated SVG, we introduce a dropout-based regularization\ntechnique that strengthens the standalone meaning of each shape. We\nadditionally demonstrate that utilizing a neural representation provides an\nadded benefit of inference-time control, enabling users to dynamically adapt\nthe generated SVG based on user-provided inputs, all with a single learned\nrepresentation. Through extensive qualitative and quantitative evaluations, we\ndemonstrate that NeuralSVG outperforms existing methods in generating\nstructured and flexible SVG.\n","authors":["Sagi Polaczek","Yuval Alaluf","Elad Richardson","Yael Vinker","Daniel Cohen-Or"],"pdf_url":"https://arxiv.org/pdf/2501.03992v1.pdf","comment":"Project Page: https://sagipolaczek.github.io/NeuralSVG/"},{"id":"http://arxiv.org/abs/2406.14794v5","updated":"2025-01-07T18:49:42Z","published":"2024-06-20T23:51:32Z","title":"ImageFlowNet: Forecasting Multiscale Image-Level Trajectories of Disease\n Progression with Irregularly-Sampled Longitudinal Medical Images","summary":" Advances in medical imaging technologies have enabled the collection of\nlongitudinal images, which involve repeated scanning of the same patients over\ntime, to monitor disease progression. However, predictive modeling of such data\nremains challenging due to high dimensionality, irregular sampling, and data\nsparsity. To address these issues, we propose ImageFlowNet, a novel model\ndesigned to forecast disease trajectories from initial images while preserving\nspatial details. ImageFlowNet first learns multiscale joint representation\nspaces across patients and time points, then optimizes deterministic or\nstochastic flow fields within these spaces using a position-parameterized\nneural ODE/SDE framework. The model leverages a UNet architecture to create\nrobust multiscale representations and mitigates data scarcity by combining\nknowledge from all patients. We provide theoretical insights that support our\nformulation of ODEs, and motivate our regularizations involving high-level\nvisual features, latent space organization, and trajectory smoothness. We\nvalidate ImageFlowNet on three longitudinal medical image datasets depicting\nprogression in geographic atrophy, multiple sclerosis, and glioblastoma,\ndemonstrating its ability to effectively forecast disease progression and\noutperform existing methods. Our contributions include the development of\nImageFlowNet, its theoretical underpinnings, and empirical validation on\nreal-world datasets. The official implementation is available at\nhttps://github.com/KrishnaswamyLab/ImageFlowNet.\n","authors":["Chen Liu","Ke Xu","Liangbo L. Shen","Guillaume Huguet","Zilong Wang","Alexander Tong","Danilo Bzdok","Jay Stewart","Jay C. Wang","Lucian V. Del Priore","Smita Krishnaswamy"],"pdf_url":"https://arxiv.org/pdf/2406.14794v5.pdf","comment":"Accepted to ICASSP 2025"},{"id":"http://arxiv.org/abs/2501.03968v1","updated":"2025-01-07T18:06:27Z","published":"2025-01-07T18:06:27Z","title":"VLM-driven Behavior Tree for Context-aware Task Planning","summary":" The use of Large Language Models (LLMs) for generating Behavior Trees (BTs)\nhas recently gained attention in the robotics community, yet remains in its\nearly stages of development. In this paper, we propose a novel framework that\nleverages Vision-Language Models (VLMs) to interactively generate and edit BTs\nthat address visual conditions, enabling context-aware robot operations in\nvisually complex environments. A key feature of our approach lies in the\nconditional control through self-prompted visual conditions. Specifically, the\nVLM generates BTs with visual condition nodes, where conditions are expressed\nas free-form text. Another VLM process integrates the text into its prompt and\nevaluates the conditions against real-world images during robot execution. We\nvalidated our framework in a real-world cafe scenario, demonstrating both its\nfeasibility and limitations.\n","authors":["Naoki Wake","Atsushi Kanehira","Jun Takamatsu","Kazuhiro Sasabuchi","Katsushi Ikeuchi"],"pdf_url":"https://arxiv.org/pdf/2501.03968v1.pdf","comment":"10 pages, 11 figures, 5 tables. Last updated on January 7th, 2024"},{"id":"http://arxiv.org/abs/2501.03967v1","updated":"2025-01-07T18:05:24Z","published":"2025-01-07T18:05:24Z","title":"Temporal Feature Weaving for Neonatal Echocardiographic Viewpoint Video\n Classification","summary":" Automated viewpoint classification in echocardiograms can help\nunder-resourced clinics and hospitals in providing faster diagnosis and\nscreening when expert technicians may not be available. We propose a novel\napproach towards echocardiographic viewpoint classification. We show that\ntreating viewpoint classification as video classification rather than image\nclassification yields advantage. We propose a CNN-GRU architecture with a novel\ntemporal feature weaving method, which leverages both spatial and temporal\ninformation to yield a 4.33\\% increase in accuracy over baseline image\nclassification while using only four consecutive frames. The proposed approach\nincurs minimal computational overhead. Additionally, we publish the Neonatal\nEchocardiogram Dataset (NED), a professionally-annotated dataset providing\nsixteen viewpoints and associated echocardipgraphy videos to encourage future\nwork and development in this field. Code available at:\nhttps://github.com/satchelfrench/NED\n","authors":["Satchel French","Faith Zhu","Amish Jain","Naimul Khan"],"pdf_url":"https://arxiv.org/pdf/2501.03967v1.pdf","comment":"Accepted to ISBI 2025"},{"id":"http://arxiv.org/abs/2501.03957v1","updated":"2025-01-07T17:37:57Z","published":"2025-01-07T17:37:57Z","title":"Vision Language Models as Values Detectors","summary":" Large Language Models integrating textual and visual inputs have introduced\nnew possibilities for interpreting complex data. Despite their remarkable\nability to generate coherent and contextually relevant text based on visual\nstimuli, the alignment of these models with human perception in identifying\nrelevant elements in images requires further exploration. This paper\ninvestigates the alignment between state-of-the-art LLMs and human annotators\nin detecting elements of relevance within home environment scenarios. We\ncreated a set of twelve images depicting various domestic scenarios and\nenlisted fourteen annotators to identify the key element in each image. We then\ncompared these human responses with outputs from five different LLMs, including\nGPT-4o and four LLaVA variants. Our findings reveal a varied degree of\nalignment, with LLaVA 34B showing the highest performance but still scoring\nlow. However, an analysis of the results highlights the models' potential to\ndetect value-laden elements in images, suggesting that with improved training\nand refined prompts, LLMs could enhance applications in social robotics,\nassistive technologies, and human-computer interaction by providing deeper\ninsights and more contextually relevant responses.\n","authors":["Giulio Antonio Abbo","Tony Belpaeme"],"pdf_url":"https://arxiv.org/pdf/2501.03957v1.pdf","comment":"13 pages, 2 figures"},{"id":"http://arxiv.org/abs/2405.18679v2","updated":"2025-01-07T17:00:36Z","published":"2024-05-29T01:01:19Z","title":"Vim-F: Visual State Space Model Benefiting from Learning in the\n Frequency Domain","summary":" In recent years, State Space Models (SSMs) with efficient hardware-aware\ndesigns, known as the Mamba deep learning models, have made significant\nprogress in modeling long sequences such as language understanding. Therefore,\nbuilding efficient and general-purpose visual backbones based on SSMs is a\npromising direction. Compared to traditional convolutional neural networks\n(CNNs) and Vision Transformers (ViTs), the performance of Vision Mamba (ViM)\nmethods is not yet fully competitive. To enable SSMs to process image data,\nViMs typically flatten 2D images into 1D sequences, inevitably ignoring some 2D\nlocal dependencies, thereby weakening the model's ability to interpret spatial\nrelationships from a global perspective. We use Fast Fourier Transform (FFT) to\nobtain the spectrum of the feature map and add it to the original feature map,\nenabling ViM to model a unified visual representation in both frequency and\nspatial domains. The introduction of frequency domain information enables ViM\nto have a global receptive field during scanning. We propose a novel model\ncalled Vim-F, which employs pure Mamba encoders and scans in both the frequency\nand spatial domains. Moreover, we question the necessity of position embedding\nin ViM and remove it accordingly in Vim-F, which helps to fully utilize the\nefficient long-sequence modeling capability of ViM. Finally, we redesign a\npatch embedding for Vim-F, leveraging a convolutional stem to capture more\nlocal correlations, further improving the performance of Vim-F. Code is\navailable at: \\url{https://github.com/yws-wxs/Vim-F}.\n","authors":["Juntao Zhang","Shaogeng Liu","Kun Bian","You Zhou","Pei Zhang","Wenbo An","Jun Zhou","Kun Shao"],"pdf_url":"https://arxiv.org/pdf/2405.18679v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03939v1","updated":"2025-01-07T17:00:35Z","published":"2025-01-07T17:00:35Z","title":"Visual question answering: from early developments to recent advances --\n a survey","summary":" Visual Question Answering (VQA) is an evolving research field aimed at\nenabling machines to answer questions about visual content by integrating image\nand language processing techniques such as feature extraction, object\ndetection, text embedding, natural language understanding, and language\ngeneration. With the growth of multimodal data research, VQA has gained\nsignificant attention due to its broad applications, including interactive\neducational tools, medical image diagnosis, customer service, entertainment,\nand social media captioning. Additionally, VQA plays a vital role in assisting\nvisually impaired individuals by generating descriptive content from images.\nThis survey introduces a taxonomy of VQA architectures, categorizing them based\non design choices and key components to facilitate comparative analysis and\nevaluation. We review major VQA approaches, focusing on deep learning-based\nmethods, and explore the emerging field of Large Visual Language Models (LVLMs)\nthat have demonstrated success in multimodal tasks like VQA. The paper further\nexamines available datasets and evaluation metrics essential for measuring VQA\nsystem performance, followed by an exploration of real-world VQA applications.\nFinally, we highlight ongoing challenges and future directions in VQA research,\npresenting open questions and potential areas for further development. This\nsurvey serves as a comprehensive resource for researchers and practitioners\ninterested in the latest advancements and future\n","authors":["Ngoc Dung Huynh","Mohamed Reda Bouadjenek","Sunil Aryal","Imran Razzak","Hakim Hacid"],"pdf_url":"https://arxiv.org/pdf/2501.03939v1.pdf","comment":"20"},{"id":"http://arxiv.org/abs/2501.00625v2","updated":"2025-01-07T16:49:29Z","published":"2024-12-31T19:53:27Z","title":"Gaussian Building Mesh (GBM): Extract a Building's 3D Mesh with Google\n Earth and Gaussian Splatting","summary":" Recently released open-source pre-trained foundational image segmentation and\nobject detection models (SAM2+GroundingDINO) allow for geometrically consistent\nsegmentation of objects of interest in multi-view 2D images. Users can use\ntext-based or click-based prompts to segment objects of interest without\nrequiring labeled training datasets. Gaussian Splatting allows for the learning\nof the 3D representation of a scene's geometry and radiance based on 2D images.\nCombining Google Earth Studio, SAM2+GroundingDINO, 2D Gaussian Splatting, and\nour improvements in mask refinement based on morphological operations and\ncontour simplification, we created a pipeline to extract the 3D mesh of any\nbuilding based on its name, address, or geographic coordinates.\n","authors":["Kyle Gao","Liangzhi Li","Hongjie He","Dening Lu","Linlin Xu","Jonathan Li"],"pdf_url":"https://arxiv.org/pdf/2501.00625v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03932v1","updated":"2025-01-07T16:48:47Z","published":"2025-01-07T16:48:47Z","title":"CoStruction: Conjoint radiance field optimization for urban scene\n reconStruction with limited image overlap","summary":" Reconstructing the surrounding surface geometry from recorded driving\nsequences poses a significant challenge due to the limited image overlap and\ncomplex topology of urban environments. SoTA neural implicit surface\nreconstruction methods often struggle in such setting, either failing due to\nsmall vision overlap or exhibiting suboptimal performance in accurately\nreconstructing both the surface and fine structures. To address these\nlimitations, we introduce CoStruction, a novel hybrid implicit surface\nreconstruction method tailored for large driving sequences with limited camera\noverlap. CoStruction leverages cross-representation uncertainty estimation to\nfilter out ambiguous geometry caused by limited observations. Our method\nperforms joint optimization of both radiance fields in addition to guided\nsampling achieving accurate reconstruction of large areas along with fine\nstructures in complex urban scenarios. Extensive evaluation on major driving\ndatasets demonstrates the superiority of our approach in reconstructing large\ndriving sequences with limited image overlap, outperforming concurrent SoTA\nmethods.\n","authors":["Fusang Wang","Hala Djeghim","Nathan Piasco","Moussab Bennehar","Luis Roldão","Dzmitry Tsishkou"],"pdf_url":"https://arxiv.org/pdf/2501.03932v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03931v1","updated":"2025-01-07T16:48:31Z","published":"2025-01-07T16:48:31Z","title":"Magic Mirror: ID-Preserved Video Generation in Video Diffusion\n Transformers","summary":" We present Magic Mirror, a framework for generating identity-preserved videos\nwith cinematic-level quality and dynamic motion. While recent advances in video\ndiffusion models have shown impressive capabilities in text-to-video\ngeneration, maintaining consistent identity while producing natural motion\nremains challenging. Previous methods either require person-specific\nfine-tuning or struggle to balance identity preservation with motion diversity.\nBuilt upon Video Diffusion Transformers, our method introduces three key\ncomponents: (1) a dual-branch facial feature extractor that captures both\nidentity and structural features, (2) a lightweight cross-modal adapter with\nConditioned Adaptive Normalization for efficient identity integration, and (3)\na two-stage training strategy combining synthetic identity pairs with video\ndata. Extensive experiments demonstrate that Magic Mirror effectively balances\nidentity consistency with natural motion, outperforming existing methods across\nmultiple metrics while requiring minimal parameters added. The code and model\nwill be made publicly available at:\nhttps://github.com/dvlab-research/MagicMirror/\n","authors":["Yuechen Zhang","Yaoyang Liu","Bin Xia","Bohao Peng","Zexin Yan","Eric Lo","Jiaya Jia"],"pdf_url":"https://arxiv.org/pdf/2501.03931v1.pdf","comment":"It is best viewed in Acrobat. Project Page:\n https://julianjuaner.github.io/projects/MagicMirror/"},{"id":"http://arxiv.org/abs/2501.03923v1","updated":"2025-01-07T16:35:29Z","published":"2025-01-07T16:35:29Z","title":"Explainable AI model reveals disease-related mechanisms in single-cell\n RNA-seq data","summary":" Neurodegenerative diseases (NDDs) are complex and lack effective treatment\ndue to their poorly understood mechanism. The increasingly used data analysis\nfrom Single nucleus RNA Sequencing (snRNA-seq) allows to explore transcriptomic\nevents at a single cell level, yet face challenges in interpreting the\nmechanisms underlying a disease. On the other hand, Neural Network (NN) models\ncan handle complex data to offer insights but can be seen as black boxes with\npoor interpretability. In this context, explainable AI (XAI) emerges as a\nsolution that could help to understand disease-associated mechanisms when\ncombined with efficient NN models. However, limited research explores XAI in\nsingle-cell data. In this work, we implement a method for identifying\ndisease-related genes and the mechanistic explanation of disease progression\nbased on NN model combined with SHAP. We analyze available Huntington's disease\n(HD) data to identify both HD-altered genes and mechanisms by adding Gene Set\nEnrichment Analysis (GSEA) comparing two methods, differential gene expression\nanalysis (DGE) and NN combined with SHAP approach. Our results show that DGE\nand SHAP approaches offer both common and differential sets of altered genes\nand pathways, reinforcing the usefulness of XAI methods for a broader\nperspective of disease.\n","authors":["Mohammad Usman","Olga Varea","Petia Radeva","Josep Canals","Jordi Abante","Daniel Ortiz"],"pdf_url":"https://arxiv.org/pdf/2501.03923v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03916v1","updated":"2025-01-07T16:31:10Z","published":"2025-01-07T16:31:10Z","title":"Dolphin: Closed-loop Open-ended Auto-research through Thinking,\n Practice, and Feedback","summary":" The scientific research paradigm is undergoing a profound transformation\nowing to the development of Artificial Intelligence (AI). Recent works\ndemonstrate that various AI-assisted research methods can largely improve\nresearch efficiency by improving data analysis, accelerating computation, and\nfostering novel idea generation. To further move towards the ultimate goal\n(i.e., automatic scientific research), in this paper, we propose Dolphin, the\nfirst closed-loop open-ended auto-research framework to further build the\nentire process of human scientific research. Dolphin can generate research\nideas, perform experiments, and get feedback from experimental results to\ngenerate higher-quality ideas. More specifically, Dolphin first generates novel\nideas based on relevant papers which are ranked by the topic and task\nattributes. Then, the codes are automatically generated and debugged with the\nexception-traceback-guided local code structure. Finally, Dolphin automatically\nanalyzes the results of each idea and feeds the results back to the next round\nof idea generation. Experiments are conducted on the benchmark datasets of\ndifferent topics and results show that Dolphin can generate novel ideas\ncontinuously and complete the experiment in a loop. We highlight that Dolphin\ncan automatically propose methods that are comparable to the state-of-the-art\nin some tasks such as 2D image classification and 3D point classification.\n","authors":["Jiakang Yuan","Xiangchao Yan","Botian Shi","Tao Chen","Wanli Ouyang","Bo Zhang","Lei Bai","Yu Qiao","Bowen Zhou"],"pdf_url":"https://arxiv.org/pdf/2501.03916v1.pdf","comment":"19 pages, 11 figures, and our homepage:\n https://unimodal4reasoning.github.io/Dolphin-project-page/"},{"id":"http://arxiv.org/abs/2501.03910v1","updated":"2025-01-07T16:24:43Z","published":"2025-01-07T16:24:43Z","title":"HYB-VITON: A Hybrid Approach to Virtual Try-On Combining Explicit and\n Implicit Warping","summary":" Virtual try-on systems have significant potential in e-commerce, allowing\ncustomers to visualize garments on themselves. Existing image-based methods\nfall into two categories: those that directly warp garment-images onto\nperson-images (explicit warping), and those using cross-attention to\nreconstruct given garments (implicit warping). Explicit warping preserves\ngarment details but often produces unrealistic output, while implicit warping\nachieves natural reconstruction but struggles with fine details. We propose\nHYB-VITON, a novel approach that combines the advantages of each method and\nincludes both a preprocessing pipeline for warped garments and a novel training\noption. These components allow us to utilize beneficial regions of explicitly\nwarped garments while leveraging the natural reconstruction of implicit\nwarping. A series of experiments demonstrates that HYB-VITON preserves garment\ndetails more faithfully than recent diffusion-based methods, while producing\nmore realistic results than a state-of-the-art explicit warping method.\n","authors":["Kosuke Takemoto","Takafumi Koshinaka"],"pdf_url":"https://arxiv.org/pdf/2501.03910v1.pdf","comment":"Accepted at IEEE ICASSP 2025"},{"id":"http://arxiv.org/abs/2402.16315v4","updated":"2025-01-07T16:05:16Z","published":"2024-02-26T05:43:51Z","title":"Finer: Investigating and Enhancing Fine-Grained Visual Concept\n Recognition in Large Vision Language Models","summary":" Recent advances in instruction-tuned Large Vision-Language Models (LVLMs)\nhave imbued the models with the ability to generate high-level, image-grounded\nexplanations with ease. While such capability is largely attributed to the rich\nworld knowledge contained within the Large Language Models (LLMs), our work\nreveals their shortcomings in fine-grained visual categorization (FGVC) across\nsix different benchmark settings. Most recent state-of-the-art LVLMs like\nLLaVa-1.5, InstructBLIP and GPT-4V not only severely deteriorate in terms of\nclassification performance, e.g., average drop of 65.58 in EM for Stanford Dogs\nfor LLaVA-1.5, but also struggle to generate an accurate explanation with\ndetailed attributes based on the concept that appears within an input image\ndespite their capability to generate holistic image-level descriptions.\nIn-depth analyses show that instruction-tuned LVLMs exhibit modality gap,\nshowing discrepancy when given textual and visual inputs that correspond to the\nsame concept, preventing the image modality from leveraging the rich parametric\nknowledge within the LLMs. In an effort to further the community's endeavor in\nthis direction, we propose a multiple granularity attribute-centric evaluation\nbenchmark, Finer, which aims to establish a ground to evaluate LVLMs'\nfine-grained visual comprehension ability and provide significantly improved\nexplainability.\n","authors":["Jeonghwan Kim","Heng Ji"],"pdf_url":"https://arxiv.org/pdf/2402.16315v4.pdf","comment":"EMNLP 2024; Main Conference"},{"id":"http://arxiv.org/abs/2501.03895v1","updated":"2025-01-07T16:03:14Z","published":"2025-01-07T16:03:14Z","title":"LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One\n Vision Token","summary":" The advent of real-time large multimodal models (LMMs) like GPT-4o has\nsparked considerable interest in efficient LMMs. LMM frameworks typically\nencode visual inputs into vision tokens (continuous representations) and\nintegrate them and textual instructions into the context of large language\nmodels (LLMs), where large-scale parameters and numerous context tokens\n(predominantly vision tokens) result in substantial computational overhead.\nPrevious efforts towards efficient LMMs always focus on replacing the LLM\nbackbone with smaller models, while neglecting the crucial issue of token\nquantity. In this paper, we introduce LLaVA-Mini, an efficient LMM with minimal\nvision tokens. To achieve a high compression ratio of vision tokens while\npreserving visual information, we first analyze how LMMs understand vision\ntokens and find that most vision tokens only play a crucial role in the early\nlayers of LLM backbone, where they mainly fuse visual information into text\ntokens. Building on this finding, LLaVA-Mini introduces modality pre-fusion to\nfuse visual information into text tokens in advance, thereby facilitating the\nextreme compression of vision tokens fed to LLM backbone into one token.\nLLaVA-Mini is a unified large multimodal model that can support the\nunderstanding of images, high-resolution images, and videos in an efficient\nmanner. Experiments across 11 image-based and 7 video-based benchmarks\ndemonstrate that LLaVA-Mini outperforms LLaVA-v1.5 with just 1 vision token\ninstead of 576. Efficiency analyses reveal that LLaVA-Mini can reduce FLOPs by\n77%, deliver low-latency responses within 40 milliseconds, and process over\n10,000 frames of video on the GPU hardware with 24GB of memory.\n","authors":["Shaolei Zhang","Qingkai Fang","Zhe Yang","Yang Feng"],"pdf_url":"https://arxiv.org/pdf/2501.03895v1.pdf","comment":"Code: https://github.com/ictnlp/LLaVA-Mini; Model:\n https://huggingface.co/ICTNLP/llava-mini-llama-3.1-8b"},{"id":"http://arxiv.org/abs/2501.03891v1","updated":"2025-01-07T15:54:03Z","published":"2025-01-07T15:54:03Z","title":"Superpixel Boundary Correction for Weakly-Supervised Semantic\n Segmentation on Histopathology Images","summary":" With the rapid advancement of deep learning, computational pathology has made\nsignificant progress in cancer diagnosis and subtyping. Tissue segmentation is\na core challenge, essential for prognosis and treatment decisions. Weakly\nsupervised semantic segmentation (WSSS) reduces the annotation requirement by\nusing image-level labels instead of pixel-level ones. However, Class Activation\nMap (CAM)-based methods still suffer from low spatial resolution and unclear\nboundaries. To address these issues, we propose a multi-level superpixel\ncorrection algorithm that refines CAM boundaries using superpixel clustering\nand floodfill. Experimental results show that our method achieves great\nperformance on breast cancer segmentation dataset with mIoU of 71.08%,\nsignificantly improving tumor microenvironment boundary delineation.\n","authors":["Hongyi Wu","Hong Zhang"],"pdf_url":"https://arxiv.org/pdf/2501.03891v1.pdf","comment":"7 pages, 4 figures"},{"id":"http://arxiv.org/abs/2405.03732v3","updated":"2025-01-07T15:46:25Z","published":"2024-05-06T10:53:13Z","title":"Deep Learning-based Accelerated MR Cholangiopancreatography without\n Fully-sampled Data","summary":" The purpose of this study was to accelerate MR cholangiopancreatography\n(MRCP) acquisitions using deep learning-based (DL) reconstruction at 3T and\n0.55T. A total of 35 healthy volunteers underwent conventional two-fold\naccelerated MRCP scans at field strengths of 3T and 0.55T. We trained DL\nreconstructions using two different training strategies, supervised (SV) and\nself-supervised (SSV), with retrospectively six-fold undersampled data obtained\nat 3T. We then evaluated the DL reconstructions against standard techniques,\nparallel imaging (PI) and compressed sensing (CS), focusing on peak\nsignal-to-noise ratio (PSNR) and structural similarity (SSIM) as metrics. We\nalso tested DL reconstructions with prospectively accelerated acquisitions and\nevaluated their robustness when changing fields strengths from 3T to 0.55T. DL\nreconstructions demonstrated a reduction in average acquisition time from\n599/542 to 255/180 seconds for MRCP at 3T/0.55T. In both retrospective and\nprospective undersampling, PSNR and SSIM of DL reconstructions were higher than\nthose of PI and CS. At the same time, DL reconstructions preserved the image\nquality of undersampled data, including sharpness and the visibility of\nhepatobiliary ducts. In addition, both DL approaches produced high-quality\nreconstructions at 0.55T. In summary, DL reconstructions trained for highly\naccelerated MRCP enabled a reduction in acquisition time by a factor of 2.4/3.0\nat 3T/0.55T while maintaining the image quality of conventional acquisitions.\n","authors":["Jinho Kim","Marcel Dominik Nickel","Florian Knoll"],"pdf_url":"https://arxiv.org/pdf/2405.03732v3.pdf","comment":"19 pages, 4 figures, 2 tables"},{"id":"http://arxiv.org/abs/2501.03880v1","updated":"2025-01-07T15:43:36Z","published":"2025-01-07T15:43:36Z","title":"SELMA3D challenge: Self-supervised learning for 3D light-sheet\n microscopy image segmentation","summary":" Recent innovations in light sheet microscopy, paired with developments in\ntissue clearing techniques, enable the 3D imaging of large mammalian tissues\nwith cellular resolution. Combined with the progress in large-scale data\nanalysis, driven by deep learning, these innovations empower researchers to\nrapidly investigate the morphological and functional properties of diverse\nbiological samples. Segmentation, a crucial preliminary step in the analysis\nprocess, can be automated using domain-specific deep learning models with\nexpert-level performance. However, these models exhibit high sensitivity to\ndomain shifts, leading to a significant drop in accuracy when applied to data\noutside their training distribution. To address this limitation, and inspired\nby the recent success of self-supervised learning in training generalizable\nmodels, we organized the SELMA3D Challenge during the MICCAI 2024 conference.\nSELMA3D provides a vast collection of light-sheet images from cleared mice and\nhuman brains, comprising 35 large 3D images-each with over 1000^3 voxels-and\n315 annotated small patches for finetuning, preliminary testing and final\ntesting. The dataset encompasses diverse biological structures, including\nvessel-like and spot-like structures. Five teams participated in all phases of\nthe challenge, and their proposed methods are reviewed in this paper.\nQuantitative and qualitative results from most participating teams demonstrate\nthat self-supervised learning on large datasets improves segmentation model\nperformance and generalization. We will continue to support and extend SELMA3D\nas an inaugural MICCAI challenge focused on self-supervised learning for 3D\nmicroscopy image segmentation.\n","authors":["Ying Chen","Rami Al-Maskari","Izabela Horvath","Mayar Ali","Luciano Höher","Kaiyuan Yang","Zengming Lin","Zhiwei Zhai","Mengzhe Shen","Dejin Xun","Yi Wang","Tony Xu","Maged Goubran","Yunheng Wu","Ali Erturk","Johannes C. Paetzold"],"pdf_url":"https://arxiv.org/pdf/2501.03880v1.pdf","comment":"1st version"},{"id":"http://arxiv.org/abs/2501.03879v1","updated":"2025-01-07T15:42:32Z","published":"2025-01-07T15:42:32Z","title":"CL3DOR: Contrastive Learning for 3D Large Multimodal Models via Odds\n Ratio on High-Resolution Point Clouds","summary":" Recent research has demonstrated that Large Language Models (LLMs) are not\nlimited to text-only tasks but can also function as multimodal models across\nvarious modalities, including audio, images, and videos. In particular,\nresearch on 3D Large Multimodal Models (3D LMMs) is making notable strides,\ndriven by the potential of processing higher-dimensional data like point\nclouds. However, upon closer examination, we find that the visual and textual\ncontent within each sample of existing training datasets lacks both high\ninformational granularity and clarity, which serve as a bottleneck for precise\ncross-modal understanding. To address these issues, we propose CL3DOR,\nContrastive Learning for 3D large multimodal models via Odds ratio on\nhigh-Resolution point clouds, designed to ensure greater specificity and\nclarity in both visual and textual content. Specifically, we increase the\ndensity of point clouds per object and construct informative hard negative\nresponses in the training dataset to penalize unwanted responses. To leverage\nhard negative responses, we incorporate the odds ratio as an auxiliary term for\ncontrastive learning into the conventional language modeling loss. CL3DOR\nachieves state-of-the-art performance in 3D scene understanding and reasoning\nbenchmarks. Additionally, we demonstrate the effectiveness of CL3DOR's key\ncomponents through extensive experiments.\n","authors":["Keonwoo Kim","Yeongjae Cho","Taebaek Hwang","Minsoo Jo","Sangdo Han"],"pdf_url":"https://arxiv.org/pdf/2501.03879v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03875v1","updated":"2025-01-07T15:39:02Z","published":"2025-01-07T15:39:02Z","title":"ZDySS -- Zero-Shot Dynamic Scene Stylization using Gaussian Splatting","summary":" Stylizing a dynamic scene based on an exemplar image is critical for various\nreal-world applications, including gaming, filmmaking, and augmented and\nvirtual reality. However, achieving consistent stylization across both spatial\nand temporal dimensions remains a significant challenge. Most existing methods\nare designed for static scenes and often require an optimization process for\neach style image, limiting their adaptability. We introduce ZDySS, a zero-shot\nstylization framework for dynamic scenes, allowing our model to generalize to\npreviously unseen style images at inference. Our approach employs Gaussian\nsplatting for scene representation, linking each Gaussian to a learned feature\nvector that renders a feature map for any given view and timestamp. By applying\nstyle transfer on the learned feature vectors instead of the rendered feature\nmap, we enhance spatio-temporal consistency across frames. Our method\ndemonstrates superior performance and coherence over state-of-the-art baselines\nin tests on real-world dynamic scenes, making it a robust solution for\npractical applications.\n","authors":["Abhishek Saroha","Florian Hofherr","Mariia Gladkova","Cecilia Curreli","Or Litany","Daniel Cremers"],"pdf_url":"https://arxiv.org/pdf/2501.03875v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03874v1","updated":"2025-01-07T15:38:13Z","published":"2025-01-07T15:38:13Z","title":"Neuromorphic Optical Tracking and Imaging of Randomly Moving Targets\n through Strongly Scattering Media","summary":" Tracking and acquiring simultaneous optical images of randomly moving targets\nobscured by scattering media remains a challenging problem of importance to\nmany applications that require precise object localization and identification.\nIn this work we develop an end-to-end neuromorphic optical engineering and\ncomputational approach to demonstrate how to track and image normally invisible\nobjects by combining an event detecting camera with a multistage neuromorphic\ndeep learning strategy. Photons emerging from dense scattering media are\ndetected by the event camera and converted to pixel-wise asynchronized spike\ntrains - a first step in isolating object-specific information from the\ndominant uninformative background. Spiking data is fed into a deep spiking\nneural network (SNN) engine where object tracking and image reconstruction are\nperformed by two separate yet interconnected modules running in parallel in\ndiscrete time steps over the event duration. Through benchtop experiments we\ndemonstrate tracking and imaging randomly moving objects in dense turbid media\nas well as image reconstruction of spatially stationary but optically dynamic\nobjects. Standardized character sets serve as representative proxies for\ngeometrically complex objects, underscoring the method's generality. The\nresults highlight the advantages of a fully neuromorphic approach in meeting a\nmajor imaging technology with high computational efficiency and low power\nconsumption.\n","authors":["Ning Zhang","Timothy Shea","Arto Nurmikko"],"pdf_url":"https://arxiv.org/pdf/2501.03874v1.pdf","comment":"22 pages, 6 figures"},{"id":"http://arxiv.org/abs/2412.12359v2","updated":"2025-01-07T15:36:54Z","published":"2024-12-16T21:14:11Z","title":"LLaVA Steering: Visual Instruction Tuning with 500x Fewer Parameters\n through Modality Linear Representation-Steering","summary":" Multimodal Large Language Models (MLLMs) have significantly advanced visual\ntasks by integrating visual representations into large language models (LLMs).\nThe textual modality, inherited from LLMs, equips MLLMs with abilities like\ninstruction following and in-context learning. In contrast, the visual modality\nenhances performance in downstream tasks by leveraging rich semantic content,\nspatial information, and grounding capabilities. These intrinsic modalities\nwork synergistically across various visual tasks. Our research initially\nreveals a persistent imbalance between these modalities, with text often\ndominating output generation during visual instruction tuning. This imbalance\noccurs when using both full fine-tuning and parameter-efficient fine-tuning\n(PEFT) methods. We then found that re-balancing these modalities can\nsignificantly reduce the number of trainable parameters required, inspiring a\ndirection for further optimizing visual instruction tuning. We introduce\nModality Linear Representation-Steering (MoReS) to achieve the goal. MoReS\neffectively re-balances the intrinsic modalities throughout the model, where\nthe key idea is to steer visual representations through linear transformations\nin the visual subspace across each model layer. To validate our solution, we\ncomposed LLaVA Steering, a suite of models integrated with the proposed MoReS\nmethod. Evaluation results show that the composed LLaVA Steering models\nrequire, on average, 500 times fewer trainable parameters than LoRA needs while\nstill achieving comparable performance across three visual benchmarks and eight\nvisual question-answering tasks. Last, we present the LLaVA Steering Factory,\nan in-house developed platform that enables researchers to quickly customize\nvarious MLLMs with component-based architecture for seamlessly integrating\nstate-of-the-art models, and evaluate their intrinsic modality imbalance.\n","authors":["Jinhe Bi","Yujun Wang","Haokun Chen","Xun Xiao","Artur Hecker","Volker Tresp","Yunpu Ma"],"pdf_url":"https://arxiv.org/pdf/2412.12359v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03848v1","updated":"2025-01-07T15:03:55Z","published":"2025-01-07T15:03:55Z","title":"Semise: Semi-supervised learning for severity representation in medical\n image","summary":" This paper introduces SEMISE, a novel method for representation learning in\nmedical imaging that combines self-supervised and supervised learning. By\nleveraging both labeled and augmented data, SEMISE addresses the challenge of\ndata scarcity and enhances the encoder's ability to extract meaningful\nfeatures. This integrated approach leads to more informative representations,\nimproving performance on downstream tasks. As result, our approach achieved a\n12% improvement in classification and a 3% improvement in segmentation,\noutperforming existing methods. These results demonstrate the potential of\nSIMESE to advance medical image analysis and offer more accurate solutions for\nhealthcare applications, particularly in contexts where labeled data is\nlimited.\n","authors":["Dung T. Tran","Hung Vu","Anh Tran","Hieu Pham","Hong Nguyen","Phong Nguyen"],"pdf_url":"https://arxiv.org/pdf/2501.03848v1.pdf","comment":"Accepted for presentation at the 2025 IEEE 22nd International\n Symposium on Biomedical Imaging (ISBI)"},{"id":"http://arxiv.org/abs/2501.03847v1","updated":"2025-01-07T15:01:58Z","published":"2025-01-07T15:01:58Z","title":"Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video\n Generation Control","summary":" Diffusion models have demonstrated impressive performance in generating\nhigh-quality videos from text prompts or images. However, precise control over\nthe video generation process, such as camera manipulation or content editing,\nremains a significant challenge. Existing methods for controlled video\ngeneration are typically limited to a single control type, lacking the\nflexibility to handle diverse control demands. In this paper, we introduce\nDiffusion as Shader (DaS), a novel approach that supports multiple video\ncontrol tasks within a unified architecture. Our key insight is that achieving\nversatile video control necessitates leveraging 3D control signals, as videos\nare fundamentally 2D renderings of dynamic 3D content. Unlike prior methods\nlimited to 2D control signals, DaS leverages 3D tracking videos as control\ninputs, making the video diffusion process inherently 3D-aware. This innovation\nallows DaS to achieve a wide range of video controls by simply manipulating the\n3D tracking videos. A further advantage of using 3D tracking videos is their\nability to effectively link frames, significantly enhancing the temporal\nconsistency of the generated videos. With just 3 days of fine-tuning on 8 H800\nGPUs using less than 10k videos, DaS demonstrates strong control capabilities\nacross diverse tasks, including mesh-to-video generation, camera control,\nmotion transfer, and object manipulation.\n","authors":["Zekai Gu","Rui Yan","Jiahao Lu","Peng Li","Zhiyang Dou","Chenyang Si","Zhen Dong","Qifeng Liu","Cheng Lin","Ziwei Liu","Wenping Wang","Yuan Liu"],"pdf_url":"https://arxiv.org/pdf/2501.03847v1.pdf","comment":"Project page: https://igl-hkust.github.io/das/ Codes:\n https://github.com/IGL-HKUST/DiffusionAsShader"},{"id":"http://arxiv.org/abs/2403.18873v2","updated":"2025-01-07T14:52:34Z","published":"2024-03-26T14:42:46Z","title":"Predicting risk of cardiovascular disease using retinal OCT imaging","summary":" Cardiovascular diseases (CVD) are the leading cause of death globally.\nNon-invasive, cost-effective imaging techniques play a crucial role in early\ndetection and prevention of CVD. Optical coherence tomography (OCT) has gained\nrecognition as a potential tool for early CVD risk prediction, though its use\nremains underexplored. In this study, we investigated the potential of OCT as\nan additional imaging technique to predict future CVD events. We analysed\nretinal OCT data from the UK Biobank. The dataset included 612 patients who\nsuffered a myocardial infarction (MI) or stroke within five years of imaging\nand 2,234 controls without CVD (total: 2,846 participants). A self-supervised\ndeep learning approach based on Variational Autoencoders (VAE) was used to\nextract low-dimensional latent representations from high-dimensional 3D OCT\nimages, capturing distinct features of retinal layers. These latent features,\nalong with clinical data, were used to train a Random Forest (RF) classifier to\ndifferentiate between patients at risk of future CVD events (MI or stroke) and\nhealthy controls. Our model achieved an AUC of 0.75, sensitivity of 0.70,\nspecificity of 0.70, and accuracy of 0.70, outperforming the QRISK3 score (the\nthird version of the QRISK cardiovascular disease risk prediction algorithm;\nAUC = 0.60, sensitivity = 0.60, specificity = 0.55, accuracy = 0.55). The\nchoroidal layer in OCT images was identified as a key predictor of future CVD\nevents, revealed through a novel model explainability approach. This study\ndemonstrates that retinal OCT imaging is a cost-effective, non-invasive\nalternative for predicting CVD risk, offering potential for widespread\napplication in optometry practices and hospitals.\n","authors":["Cynthia Maldonado-Garcia","Rodrigo Bonazzola","Enzo Ferrante","Thomas H Julian","Panagiotis I Sergouniotis","Nishant Ravikumara","Alejandro F Frangi"],"pdf_url":"https://arxiv.org/pdf/2403.18873v2.pdf","comment":"New version - 26 pages for main manuscript, 7 figures, 7 pages for\n appendix and preprint for a journal"},{"id":"http://arxiv.org/abs/2501.03839v1","updated":"2025-01-07T14:49:12Z","published":"2025-01-07T14:49:12Z","title":"MedFocusCLIP : Improving few shot classification in medical datasets\n using pixel wise attention","summary":" With the popularity of foundational models, parameter efficient fine tuning\nhas become the defacto approach to leverage pretrained models to perform\ndownstream tasks. Taking inspiration from recent advances in large language\nmodels, Visual Prompt Tuning, and similar techniques, learn an additional\nprompt to efficiently finetune a pretrained vision foundational model. However,\nwe observe that such prompting is insufficient for fine-grained visual\nclassification tasks such as medical image classification, where there is large\ninter-class variance, and small intra-class variance. Hence, in this paper we\npropose to leverage advanced segmentation capabilities of Segment Anything\nModel 2 (SAM2) as a visual prompting cue to help visual encoder in the CLIP\n(Contrastive Language-Image Pretraining) by guiding the attention in CLIP\nvisual encoder to relevant regions in the image. This helps the model to focus\non highly discriminative regions, without getting distracted from visually\nsimilar background features, an essential requirement in a fewshot, finegrained\nclassification setting. We evaluate our method on diverse medical datasets\nincluding X-rays, CT scans, and MRI images, and report an accuracy of (71%,\n81%, 86%, 58%) from the proposed approach on (COVID, lung-disease, brain-tumor,\nbreast-cancer) datasets against (66%, 70%, 68%, 29%) from a pretrained CLIP\nmodel after fewshot training. The proposed approach also allows to obtain\ninterpretable explanation for the classification performance through the\nlocalization obtained using segmentation.\n","authors":["Aadya Arora","Vinay Namboodiri"],"pdf_url":"https://arxiv.org/pdf/2501.03839v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03838v1","updated":"2025-01-07T14:47:15Z","published":"2025-01-07T14:47:15Z","title":"LM-Net: A Light-weight and Multi-scale Network for Medical Image\n Segmentation","summary":" Current medical image segmentation approaches have limitations in deeply\nexploring multi-scale information and effectively combining local detail\ntextures with global contextual semantic information. This results in\nover-segmentation, under-segmentation, and blurred segmentation boundaries. To\ntackle these challenges, we explore multi-scale feature representations from\ndifferent perspectives, proposing a novel, lightweight, and multi-scale\narchitecture (LM-Net) that integrates advantages of both Convolutional Neural\nNetworks (CNNs) and Vision Transformers (ViTs) to enhance segmentation\naccuracy. LM-Net employs a lightweight multi-branch module to capture\nmulti-scale features at the same level. Furthermore, we introduce two modules\nto concurrently capture local detail textures and global semantics with\nmulti-scale features at different levels: the Local Feature Transformer (LFT)\nand Global Feature Transformer (GFT). The LFT integrates local window\nself-attention to capture local detail textures, while the GFT leverages global\nself-attention to capture global contextual semantics. By combining these\nmodules, our model achieves complementarity between local and global\nrepresentations, alleviating the problem of blurred segmentation boundaries in\nmedical image segmentation. To evaluate the feasibility of LM-Net, extensive\nexperiments have been conducted on three publicly available datasets with\ndifferent modalities. Our proposed model achieves state-of-the-art results,\nsurpassing previous methods, while only requiring 4.66G FLOPs and 5.4M\nparameters. These state-of-the-art results on three datasets with different\nmodalities demonstrate the effectiveness and adaptability of our proposed\nLM-Net for various medical image segmentation tasks.\n","authors":["Zhenkun Lu","Chaoyin She","Wei Wang","Qinghua Huang"],"pdf_url":"https://arxiv.org/pdf/2501.03838v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03836v1","updated":"2025-01-07T14:45:39Z","published":"2025-01-07T14:45:39Z","title":"SCC-YOLO: An Improved Object Detector for Assisting in Brain Tumor\n Diagnosis","summary":" Brain tumors can result in neurological dysfunction, alterations in cognitive\nand psychological states, increased intracranial pressure, and the occurrence\nof seizures, thereby presenting a substantial risk to human life and health.\nThe You Only Look Once(YOLO) series models have demonstrated superior accuracy\nin object detection for medical imaging. In this paper, we develop a novel\nSCC-YOLO architecture by integrating the SCConv attention mechanism into\nYOLOv9. The SCConv module reconstructs an efficient convolutional module by\nreducing spatial and channel redundancy among features, thereby enhancing the\nlearning of image features. We investigate the impact of intergrating different\nattention mechanisms with the YOLOv9 model on brain tumor image detection using\nboth the Br35H dataset and our self-made dataset(Brain_Tumor_Dataset).\nExperimental results show that on the Br35H dataset, SCC-YOLO achieved a 0.3%\nimprovement in mAp50 compared to YOLOv9, while on our self-made dataset,\nSCC-YOLO exhibited a 0.5% improvement over YOLOv9. SCC-YOLO has reached\nstate-of-the-art performance in brain tumor detection. Source code is available\nat : https://jihulab.com/healthcare-information-studio/SCC-YOLO/-/tree/master\n","authors":["Runci Bai"],"pdf_url":"https://arxiv.org/pdf/2501.03836v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03830v1","updated":"2025-01-07T14:41:26Z","published":"2025-01-07T14:41:26Z","title":"MeshConv3D: Efficient convolution and pooling operators for triangular\n 3D meshes","summary":" Convolutional neural networks (CNNs) have been pivotal in various 2D image\nanalysis tasks, including computer vision, image indexing and retrieval or\nsemantic classification. Extending CNNs to 3D data such as point clouds and 3D\nmeshes raises significant challenges since the very basic convolution and\npooling operators need to be completely re-visited and re-defined in an\nappropriate manner to tackle irregular connectivity issues. In this paper, we\nintroduce MeshConv3D, a 3D mesh-dedicated methodology integrating specialized\nconvolution and face collapse-based pooling operators. MeshConv3D operates\ndirectly on meshes of arbitrary topology, without any need of prior\nre-meshing/conversion techniques. In order to validate our approach, we have\nconsidered a semantic classification task. The experimental results obtained on\nthree distinct benchmark datasets show that the proposed approach makes it\npossible to achieve equivalent or superior classification results, while\nminimizing the related memory footprint and computational load.\n","authors":["Germain Bregeon","Marius Preda","Radu Ispas","Titus Zaharia"],"pdf_url":"https://arxiv.org/pdf/2501.03830v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.16169v5","updated":"2025-01-07T14:39:31Z","published":"2024-03-24T14:24:13Z","title":"Gaze-guided Hand-Object Interaction Synthesis: Dataset and Method","summary":" Gaze plays a crucial role in revealing human attention and intention,\nparticularly in hand-object interaction scenarios, where it guides and\nsynchronizes complex tasks that require precise coordination between the brain,\nhand, and object. Motivated by this, we introduce a novel task: Gaze-Guided\nHand-Object Interaction Synthesis, with potential applications in augmented\nreality, virtual reality, and assistive technologies. To support this task, we\npresent GazeHOI, the first dataset to capture simultaneous 3D modeling of gaze,\nhand, and object interactions. This task poses significant challenges due to\nthe inherent sparsity and noise in gaze data, as well as the need for high\nconsistency and physical plausibility in generating hand and object motions. To\ntackle these issues, we propose a stacked gaze-guided hand-object interaction\ndiffusion model, named GHO-Diffusion. The stacked design effectively reduces\nthe complexity of motion generation. We also introduce HOI-Manifold Guidance\nduring the sampling stage of GHO-Diffusion, enabling fine-grained control over\ngenerated motions while maintaining the data manifold. Additionally, we propose\na spatial-temporal gaze feature encoding for the diffusion condition and select\ndiffusion results based on consistency scores between gaze-contact maps and\ngaze-interaction trajectories. Extensive experiments highlight the\neffectiveness of our method and the unique contributions of our dataset. More\ndetails in https://takiee.github.io/gaze-hoi/.\n","authors":["Jie Tian","Ran Ji","Lingxiao Yang","Suting Ni","Yuexin Ma","Lan Xu","Jingyi Yu","Ye Shi","Jingya Wang"],"pdf_url":"https://arxiv.org/pdf/2403.16169v5.pdf","comment":"Project Page: https://takiee.github.io/gaze-hoi/"},{"id":"http://arxiv.org/abs/2501.03825v1","updated":"2025-01-07T14:37:14Z","published":"2025-01-07T14:37:14Z","title":"Deep Sylvester Posterior Inference for Adaptive Compressed Sensing in\n Ultrasound Imaging","summary":" Ultrasound images are commonly formed by sequential acquisition of\nbeam-steered scan-lines. Minimizing the number of required scan-lines can\nsignificantly enhance frame rate, field of view, energy efficiency, and data\ntransfer speeds. Existing approaches typically use static subsampling schemes\nin combination with sparsity-based or, more recently, deep-learning-based\nrecovery. In this work, we introduce an adaptive subsampling method that\nmaximizes intrinsic information gain in-situ, employing a Sylvester Normalizing\nFlow encoder to infer an approximate Bayesian posterior under partial\nobservation in real-time. Using the Bayesian posterior and a deep generative\nmodel for future observations, we determine the subsampling scheme that\nmaximizes the mutual information between the subsampled observations, and the\nnext frame of the video. We evaluate our approach using the EchoNet cardiac\nultrasound video dataset and demonstrate that our active sampling method\noutperforms competitive baselines, including uniform and variable-density\nrandom sampling, as well as equidistantly spaced scan-lines, improving mean\nabsolute reconstruction error by 15%. Moreover, posterior inference and the\nsampling scheme generation are performed in just 0.015 seconds (66Hz), making\nit fast enough for real-time 2D ultrasound imaging applications.\n","authors":["Simon W. Penninga","Hans van Gorp","Ruud J. G. van Sloun"],"pdf_url":"https://arxiv.org/pdf/2501.03825v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.01460v2","updated":"2025-01-07T14:19:35Z","published":"2024-12-31T10:43:19Z","title":"GDSR: Global-Detail Integration through Dual-Branch Network with Wavelet\n Losses for Remote Sensing Image Super-Resolution","summary":" In recent years, deep neural networks, including Convolutional Neural\nNetworks, Transformers, and State Space Models, have achieved significant\nprogress in Remote Sensing Image (RSI) Super-Resolution (SR). However, existing\nSR methods typically overlook the complementary relationship between global and\nlocal dependencies. These methods either focus on capturing local information\nor prioritize global information, which results in models that are unable to\neffectively capture both global and local features simultaneously. Moreover,\ntheir computational cost becomes prohibitive when applied to large-scale RSIs.\nTo address these challenges, we introduce the novel application of Receptance\nWeighted Key Value (RWKV) to RSI-SR, which captures long-range dependencies\nwith linear complexity. To simultaneously model global and local features, we\npropose the Global-Detail dual-branch structure, GDSR, which performs SR\nreconstruction by paralleling RWKV and convolutional operations to handle\nlarge-scale RSIs. Furthermore, we introduce the Global-Detail Reconstruction\nModule (GDRM) as an intermediary between the two branches to bridge their\ncomplementary roles. In addition, we propose Wavelet Loss, a loss function that\neffectively captures high-frequency detail information in images, thereby\nenhancing the visual quality of SR, particularly in terms of detail\nreconstruction. Extensive experiments on several benchmarks, including AID,\nAID_CDM, RSSRD-QH, and RSSRD-QH_CDM, demonstrate that GSDR outperforms the\nstate-of-the-art Transformer-based method HAT by an average of 0.05 dB in PSNR,\nwhile using only 63% of its parameters and 51% of its FLOPs, achieving an\ninference speed 2.9 times faster. Furthermore, the Wavelet Loss shows excellent\ngeneralization across various architectures, providing a novel perspective for\nRSI-SR enhancement.\n","authors":["Qiwei Zhu","Kai Li","Guojing Zhang","Xiaoying Wang","Jianqiang Huang","Xilai Li"],"pdf_url":"https://arxiv.org/pdf/2501.01460v2.pdf","comment":"The experiments were conducted using private datasets that were\n incomplete as they did not include all the necessary copyrights.\n Additionally, the conclusions require further exploration as the work is\n still in progress"},{"id":"http://arxiv.org/abs/2501.03800v1","updated":"2025-01-07T14:06:57Z","published":"2025-01-07T14:06:57Z","title":"MADation: Face Morphing Attack Detection with Foundation Models","summary":" Despite the considerable performance improvements of face recognition\nalgorithms in recent years, the same scientific advances responsible for this\nprogress can also be used to create efficient ways to attack them, posing a\nthreat to their secure deployment. Morphing attack detection (MAD) systems aim\nto detect a specific type of threat, morphing attacks, at an early stage,\npreventing them from being considered for verification in critical processes.\nFoundation models (FM) learn from extensive amounts of unlabeled data,\nachieving remarkable zero-shot generalization to unseen domains. Although this\ngeneralization capacity might be weak when dealing with domain-specific\ndownstream tasks such as MAD, FMs can easily adapt to these settings while\nretaining the built-in knowledge acquired during pre-training. In this work, we\nrecognize the potential of FMs to perform well in the MAD task when properly\nadapted to its specificities. To this end, we adapt FM CLIP architectures with\nLoRA weights while simultaneously training a classification header. The\nproposed framework, MADation surpasses our alternative FM and transformer-based\nframeworks and constitutes the first adaption of FMs to the MAD task. MADation\npresents competitive results with current MAD solutions in the literature and\neven surpasses them in several evaluation scenarios. To encourage\nreproducibility and facilitate further research in MAD, we publicly release the\nimplementation of MADation at https: //github.com/gurayozgur/MADation\n","authors":["Eduarda Caldeira","Guray Ozgur","Tahar Chettaoui","Marija Ivanovska","Fadi Boutros","Vitomir Struc","Naser Damer"],"pdf_url":"https://arxiv.org/pdf/2501.03800v1.pdf","comment":"Accepted at WACV 2025 workshops"},{"id":"http://arxiv.org/abs/2501.03786v1","updated":"2025-01-07T13:51:41Z","published":"2025-01-07T13:51:41Z","title":"KAnoCLIP: Zero-Shot Anomaly Detection through Knowledge-Driven Prompt\n Learning and Enhanced Cross-Modal Integration","summary":" Zero-shot anomaly detection (ZSAD) identifies anomalies without needing\ntraining samples from the target dataset, essential for scenarios with privacy\nconcerns or limited data. Vision-language models like CLIP show potential in\nZSAD but have limitations: relying on manually crafted fixed textual\ndescriptions or anomaly prompts is time-consuming and prone to semantic\nambiguity, and CLIP struggles with pixel-level anomaly segmentation, focusing\nmore on global semantics than local details. To address these limitations, We\nintroduce KAnoCLIP, a novel ZSAD framework that leverages vision-language\nmodels. KAnoCLIP combines general knowledge from a Large Language Model\n(GPT-3.5) and fine-grained, image-specific knowledge from a Visual Question\nAnswering system (Llama3) via Knowledge-Driven Prompt Learning (KnPL). KnPL\nuses a knowledge-driven (KD) loss function to create learnable anomaly prompts,\nremoving the need for fixed text prompts and enhancing generalization. KAnoCLIP\nincludes the CLIP visual encoder with V-V attention (CLIP-VV), Bi-Directional\nCross-Attention for Multi-Level Cross-Modal Interaction (Bi-CMCI), and\nConv-Adapter. These components preserve local visual semantics, improve local\ncross-modal fusion, and align global visual features with textual information,\nenhancing pixel-level anomaly detection. KAnoCLIP achieves state-of-the-art\nperformance in ZSAD across 12 industrial and medical datasets, demonstrating\nsuperior generalization compared to existing methods.\n","authors":["Chengyuan Li","Suyang Zhou","Jieping Kong","Lei Qi","Hui Xue"],"pdf_url":"https://arxiv.org/pdf/2501.03786v1.pdf","comment":"Accepted by ICASSP 2025"},{"id":"http://arxiv.org/abs/2410.23091v6","updated":"2025-01-07T13:44:57Z","published":"2024-10-30T15:06:44Z","title":"CausalDiff: Causality-Inspired Disentanglement via Diffusion Model for\n Adversarial Defense","summary":" Despite ongoing efforts to defend neural classifiers from adversarial\nattacks, they remain vulnerable, especially to unseen attacks. In contrast,\nhumans are difficult to be cheated by subtle manipulations, since we make\njudgments only based on essential factors. Inspired by this observation, we\nattempt to model label generation with essential label-causative factors and\nincorporate label-non-causative factors to assist data generation. For an\nadversarial example, we aim to discriminate the perturbations as non-causative\nfactors and make predictions only based on the label-causative factors.\nConcretely, we propose a casual diffusion model (CausalDiff) that adapts\ndiffusion models for conditional data generation and disentangles the two types\nof casual factors by learning towards a novel casual information bottleneck\nobjective. Empirically, CausalDiff has significantly outperformed\nstate-of-the-art defense methods on various unseen attacks, achieving an\naverage robustness of 86.39% (+4.01%) on CIFAR-10, 56.25% (+3.13%) on\nCIFAR-100, and 82.62% (+4.93%) on GTSRB (German Traffic Sign Recognition\nBenchmark). The code is available at\nhttps://github.com/CAS-AISafetyBasicResearchGroup/CausalDiff.\n","authors":["Mingkun Zhang","Keping Bi","Wei Chen","Quanrun Chen","Jiafeng Guo","Xueqi Cheng"],"pdf_url":"https://arxiv.org/pdf/2410.23091v6.pdf","comment":"accepted by NeurIPS 2024"},{"id":"http://arxiv.org/abs/2406.04280v3","updated":"2025-01-07T13:43:36Z","published":"2024-06-06T17:26:40Z","title":"xMIL: Insightful Explanations for Multiple Instance Learning in\n Histopathology","summary":" Multiple instance learning (MIL) is an effective and widely used approach for\nweakly supervised machine learning. In histopathology, MIL models have achieved\nremarkable success in tasks like tumor detection, biomarker prediction, and\noutcome prognostication. However, MIL explanation methods are still lagging\nbehind, as they are limited to small bag sizes or disregard instance\ninteractions. We revisit MIL through the lens of explainable AI (XAI) and\nintroduce xMIL, a refined framework with more general assumptions. We\ndemonstrate how to obtain improved MIL explanations using layer-wise relevance\npropagation (LRP) and conduct extensive evaluation experiments on three toy\nsettings and four real-world histopathology datasets. Our approach consistently\noutperforms previous explanation attempts with particularly improved\nfaithfulness scores on challenging biomarker prediction tasks. Finally, we\nshowcase how xMIL explanations enable pathologists to extract insights from MIL\nmodels, representing a significant advance for knowledge discovery and model\ndebugging in digital histopathology. Codes are available at:\nhttps://github.com/bifold-pathomics/xMIL.\n","authors":["Julius Hense","Mina Jamshidi Idaji","Oliver Eberle","Thomas Schnake","Jonas Dippel","Laure Ciernik","Oliver Buchstab","Andreas Mock","Frederick Klauschen","Klaus-Robert Müller"],"pdf_url":"https://arxiv.org/pdf/2406.04280v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.02285v2","updated":"2025-01-07T13:38:34Z","published":"2025-01-04T13:27:18Z","title":"Hyperbolic Contrastive Learning for Hierarchical 3D Point Cloud\n Embedding","summary":" Hyperbolic spaces allow for more efficient modeling of complex, hierarchical\nstructures, which is particularly beneficial in tasks involving multi-modal\ndata. Although hyperbolic geometries have been proven effective for\nlanguage-image pre-training, their capabilities to unify language, image, and\n3D Point Cloud modalities are under-explored. We extend the 3D Point Cloud\nmodality in hyperbolic multi-modal contrastive pre-training. Additionally, we\nexplore the entailment, modality gap, and alignment regularizers for learning\nhierarchical 3D embeddings and facilitating the transfer of knowledge from both\nText and Image modalities. These regularizers enable the learning of\nintra-modal hierarchy within each modality and inter-modal hierarchy across\ntext, 2D images, and 3D Point Clouds. Experimental results demonstrate that our\nproposed training strategy yields an outstanding 3D Point Cloud encoder, and\nthe obtained 3D Point Cloud hierarchical embeddings significantly improve\nperformance on various downstream tasks.\n","authors":["Yingjie Liu","Pengyu Zhang","Ziyao He","Mingsong Chen","Xuan Tang","Xian Wei"],"pdf_url":"https://arxiv.org/pdf/2501.02285v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03775v1","updated":"2025-01-07T13:30:54Z","published":"2025-01-07T13:30:54Z","title":"Strip R-CNN: Large Strip Convolution for Remote Sensing Object Detection","summary":" While witnessed with rapid development, remote sensing object detection\nremains challenging for detecting high aspect ratio objects. This paper shows\nthat large strip convolutions are good feature representation learners for\nremote sensing object detection and can detect objects of various aspect ratios\nwell. Based on large strip convolutions, we build a new network architecture\ncalled Strip R-CNN, which is simple, efficient, and powerful. Unlike recent\nremote sensing object detectors that leverage large-kernel convolutions with\nsquare shapes, our Strip R-CNN takes advantage of sequential orthogonal large\nstrip convolutions to capture spatial information. In addition, we enhance the\nlocalization capability of remote-sensing object detectors by decoupling the\ndetection heads and equipping the localization head with strip convolutions to\nbetter localize the target objects. Extensive experiments on several\nbenchmarks, e.g., DOTA, FAIR1M, HRSC2016, and DIOR, show that our Strip R-CNN\ncan largely improve previous works. Notably, our 30M model achieves 82.75% mAP\non DOTA-v1.0, setting a new state-of-the-art record.Code is available at\nhttps://github.com/YXB-NKU/Strip-R-CNN.\n","authors":["Xinbin Yuan","ZhaoHui Zheng","Yuxuan Li","Xialei Liu","Li Liu","Xiang Li","Qibin Hou","Ming-Ming Cheng"],"pdf_url":"https://arxiv.org/pdf/2501.03775v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03767v1","updated":"2025-01-07T13:14:25Z","published":"2025-01-07T13:14:25Z","title":"AutoFish: Dataset and Benchmark for Fine-grained Analysis of Fish","summary":" Automated fish documentation processes are in the near future expected to\nplay an essential role in sustainable fisheries management and for addressing\nchallenges of overfishing. In this paper, we present a novel and publicly\navailable dataset named AutoFish designed for fine-grained fish analysis. The\ndataset comprises 1,500 images of 454 specimens of visually similar fish placed\nin various constellations on a white conveyor belt and annotated with instance\nsegmentation masks, IDs, and length measurements. The data was collected in a\ncontrolled environment using an RGB camera. The annotation procedure involved\nmanual point annotations, initial segmentation masks proposed by the Segment\nAnything Model (SAM), and subsequent manual correction of the masks. We\nestablish baseline instance segmentation results using two variations of the\nMask2Former architecture, with the best performing model reaching an mAP of\n89.15%. Additionally, we present two baseline length estimation methods, the\nbest performing being a custom MobileNetV2-based regression model reaching an\nMAE of 0.62cm in images with no occlusion and 1.38cm in images with occlusion.\nLink to project page: https://vap.aau.dk/autofish/.\n","authors":["Stefan Hein Bengtson","Daniel Lehotský","Vasiliki Ismiroglou","Niels Madsen","Thomas B. Moeslund","Malte Pedersen"],"pdf_url":"https://arxiv.org/pdf/2501.03767v1.pdf","comment":"In the 3rd Workshop on Maritime Computer Vision (MaCVi) at WACV'25"},{"id":"http://arxiv.org/abs/2501.02867v2","updated":"2025-01-07T13:13:17Z","published":"2025-01-06T09:19:23Z","title":"Diff-Lung: Diffusion-Based Texture Synthesis for Enhanced Pathological\n Tissue Segmentation in Lung CT Scans","summary":" Accurate quantification of the extent of lung pathological patterns\n(fibrosis, ground-glass opacity, emphysema, consolidation) is prerequisite for\ndiagnosis and follow-up of interstitial lung diseases. However, segmentation is\nchallenging due to the significant class imbalance between healthy and\npathological tissues. This paper addresses this issue by leveraging a diffusion\nmodel for data augmentation applied during training an AI model. Our approach\ngenerates synthetic pathological tissue patches while preserving essential\nshape characteristics and intricate details specific to each tissue type. This\nmethod enhances the segmentation process by increasing the occurence of\nunderrepresented classes in the training data. We demonstrate that our\ndiffusion-based augmentation technique improves segmentation accuracy across\nall pathological tissue types, particularly for the less common patterns. This\nadvancement contributes to more reliable automated analysis of lung CT scans,\npotentially improving clinical decision-making and patient outcomes\n","authors":["Rezkellah Noureddine Khiati","Pierre-Yves Brillet","Radu Ispas","Catalin Fetita"],"pdf_url":"https://arxiv.org/pdf/2501.02867v2.pdf","comment":"accepted at ISBI 2025"},{"id":"http://arxiv.org/abs/2501.03765v1","updated":"2025-01-07T13:09:44Z","published":"2025-01-07T13:09:44Z","title":"Image Segmentation: Inducing graph-based learning","summary":" This study explores the potential of graph neural networks (GNNs) to enhance\nsemantic segmentation across diverse image modalities. We evaluate the\neffectiveness of a novel GNN-based U-Net architecture on three distinct\ndatasets: PascalVOC, a standard benchmark for natural image segmentation,\nWoodScape, a challenging dataset of fisheye images commonly used in autonomous\ndriving, introducing significant geometric distortions; and ISIC2016, a dataset\nof dermoscopic images for skin lesion segmentation. We compare our proposed\nUNet-GNN model against established convolutional neural networks (CNNs) based\nsegmentation models, including U-Net and U-Net++, as well as the\ntransformer-based SwinUNet. Unlike these methods, which primarily rely on local\nconvolutional operations or global self-attention, GNNs explicitly model\nrelationships between image regions by constructing and operating on a graph\nrepresentation of the image features. This approach allows the model to capture\nlong-range dependencies and complex spatial relationships, which we hypothesize\nwill be particularly beneficial for handling geometric distortions present in\nfisheye imagery and capturing intricate boundaries in medical images. Our\nanalysis demonstrates the versatility of GNNs in addressing diverse\nsegmentation challenges and highlights their potential to improve segmentation\naccuracy in various applications, including autonomous driving and medical\nimage analysis.\n","authors":["Aryan Singh","Pepijn Van de Ven","Ciarán Eising","Patrick Denny"],"pdf_url":"https://arxiv.org/pdf/2501.03765v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.09453v2","updated":"2025-01-07T13:00:57Z","published":"2024-10-12T09:16:09Z","title":"MMAD: The First-Ever Comprehensive Benchmark for Multimodal Large\n Language Models in Industrial Anomaly Detection","summary":" In the field of industrial inspection, Multimodal Large Language Models\n(MLLMs) have a high potential to renew the paradigms in practical applications\ndue to their robust language capabilities and generalization abilities.\nHowever, despite their impressive problem-solving skills in many domains,\nMLLMs' ability in industrial anomaly detection has not been systematically\nstudied. To bridge this gap, we present MMAD, the first-ever full-spectrum\nMLLMs benchmark in industrial Anomaly Detection. We defined seven key subtasks\nof MLLMs in industrial inspection and designed a novel pipeline to generate the\nMMAD dataset with 39,672 questions for 8,366 industrial images. With MMAD, we\nhave conducted a comprehensive, quantitative evaluation of various\nstate-of-the-art MLLMs. The commercial models performed the best, with the\naverage accuracy of GPT-4o models reaching 74.9%. However, this result falls\nfar short of industrial requirements. Our analysis reveals that current MLLMs\nstill have significant room for improvement in answering questions related to\nindustrial anomalies and defects. We further explore two training-free\nperformance enhancement strategies to help models improve in industrial\nscenarios, highlighting their promising potential for future research.\n","authors":["Xi Jiang","Jian Li","Hanqiu Deng","Yong Liu","Bin-Bin Gao","Yifeng Zhou","Jialin Li","Chengjie Wang","Feng Zheng"],"pdf_url":"https://arxiv.org/pdf/2410.09453v2.pdf","comment":"The code and data are available at https://github.com/jam-cc/MMAD"},{"id":"http://arxiv.org/abs/2409.18301v3","updated":"2025-01-07T12:44:48Z","published":"2024-09-26T21:16:51Z","title":"Wavelet-Driven Generalizable Framework for Deepfake Face Forgery\n Detection","summary":" The evolution of digital image manipulation, particularly with the\nadvancement of deep generative models, significantly challenges existing\ndeepfake detection methods, especially when the origin of the deepfake is\nobscure. To tackle the increasing complexity of these forgeries, we propose\n\\textbf{Wavelet-CLIP}, a deepfake detection framework that integrates wavelet\ntransforms with features derived from the ViT-L/14 architecture, pre-trained in\nthe CLIP fashion. Wavelet-CLIP utilizes Wavelet Transforms to deeply analyze\nboth spatial and frequency features from images, thus enhancing the model's\ncapability to detect sophisticated deepfakes. To verify the effectiveness of\nour approach, we conducted extensive evaluations against existing\nstate-of-the-art methods for cross-dataset generalization and detection of\nunseen images generated by standard diffusion models. Our method showcases\noutstanding performance, achieving an average AUC of 0.749 for cross-data\ngeneralization and 0.893 for robustness against unseen deepfakes, outperforming\nall compared methods. The code can be reproduced from the repo:\n\\url{https://github.com/lalithbharadwajbaru/Wavelet-CLIP}\n","authors":["Lalith Bharadwaj Baru","Rohit Boddeda","Shilhora Akshay Patel","Sai Mohan Gajapaka"],"pdf_url":"https://arxiv.org/pdf/2409.18301v3.pdf","comment":"9 Pages, 2 Figures, 3 Tables"},{"id":"http://arxiv.org/abs/2501.03737v1","updated":"2025-01-07T12:29:32Z","published":"2025-01-07T12:29:32Z","title":"Re-Visible Dual-Domain Self-Supervised Deep Unfolding Network for MRI\n Reconstruction","summary":" Magnetic Resonance Imaging (MRI) is widely used in clinical practice, but\nsuffered from prolonged acquisition time. Although deep learning methods have\nbeen proposed to accelerate acquisition and demonstrate promising performance,\nthey rely on high-quality fully-sampled datasets for training in a supervised\nmanner. However, such datasets are time-consuming and expensive-to-collect,\nwhich constrains their broader applications. On the other hand, self-supervised\nmethods offer an alternative by enabling learning from under-sampled data\nalone, but most existing methods rely on further partitioned under-sampled\nk-space data as model's input for training, resulting in a loss of valuable\ninformation. Additionally, their models have not fully incorporated image\npriors, leading to degraded reconstruction performance. In this paper, we\npropose a novel re-visible dual-domain self-supervised deep unfolding network\nto address these issues when only under-sampled datasets are available.\nSpecifically, by incorporating re-visible dual-domain loss, all under-sampled\nk-space data are utilized during training to mitigate information loss caused\nby further partitioning. This design enables the model to implicitly adapt to\nall under-sampled k-space data as input. Additionally, we design a deep\nunfolding network based on Chambolle and Pock Proximal Point Algorithm\n(DUN-CP-PPA) to achieve end-to-end reconstruction, incorporating imaging\nphysics and image priors to guide the reconstruction process. By employing a\nSpatial-Frequency Feature Extraction (SFFE) block to capture global and local\nfeature representation, we enhance the model's efficiency to learn\ncomprehensive image priors. Experiments conducted on the fastMRI and IXI\ndatasets demonstrate that our method significantly outperforms state-of-the-art\napproaches in terms of reconstruction performance.\n","authors":["Hao Zhang","Qi Wang","Jian Sun","Zhijie Wen","Jun Shi","Shihui Ying"],"pdf_url":"https://arxiv.org/pdf/2501.03737v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03729v1","updated":"2025-01-07T12:17:25Z","published":"2025-01-07T12:17:25Z","title":"Realistic Test-Time Adaptation of Vision-Language Models","summary":" The zero-shot capabilities of Vision-Language Models (VLMs) have been widely\nleveraged to improve predictive performance. However, previous works on\ntransductive or test-time adaptation (TTA) often make strong assumptions about\nthe data distribution, such as the presence of all classes. Our work challenges\nthese favorable deployment scenarios, and introduces a more realistic\nevaluation framework, including: (i) a variable number of effective classes for\nadaptation within a single batch, and (ii) non-i.i.d. batches of test samples\nin online adaptation settings. We provide comprehensive evaluations,\ncomparisons, and ablation studies that demonstrate how current transductive or\nTTA methods for VLMs systematically compromise the models' initial zero-shot\nrobustness across various realistic scenarios, favoring performance gains under\nadvantageous assumptions about the test samples' distributions. Furthermore, we\nintroduce StatA, a versatile method that could handle a wide range of\ndeployment scenarios, including those with a variable number of effective\nclasses at test time. Our approach incorporates a novel regularization term\ndesigned specifically for VLMs, which acts as a statistical anchor preserving\nthe initial text-encoder knowledge, particularly in low-data regimes. Code\navailable at https://github.com/MaxZanella/StatA.\n","authors":["Maxime Zanella","Clément Fuchs","Christophe De Vleeschouwer","Ismail Ben Ayed"],"pdf_url":"https://arxiv.org/pdf/2501.03729v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03722v1","updated":"2025-01-07T12:03:02Z","published":"2025-01-07T12:03:02Z","title":"Self-adaptive vision-language model for 3D segmentation of pulmonary\n artery and vein","summary":" Accurate segmentation of pulmonary structures iscrucial in clinical\ndiagnosis, disease study, and treatment planning. Significant progress has been\nmade in deep learning-based segmentation techniques, but most require much\nlabeled data for training. Consequently, developing precise segmentation\nmethods that demand fewer labeled datasets is paramount in medical image\nanalysis. The emergence of pre-trained vision-language foundation models, such\nas CLIP, recently opened the door for universal computer vision tasks.\nExploiting the generalization ability of these pre-trained foundation models on\ndownstream tasks, such as segmentation, leads to unexpected performance with a\nrelatively small amount of labeled data. However, exploring these models for\npulmonary artery-vein segmentation is still limited. This paper proposes a\nnovel framework called Language-guided self-adaptive Cross-Attention Fusion\nFramework. Our method adopts pre-trained CLIP as a strong feature extractor for\ngenerating the segmentation of 3D CT scans, while adaptively aggregating the\ncross-modality of text and image representations. We propose a s pecially\ndesigned adapter module to fine-tune pre-trained CLIP with a self-adaptive\nlearning strategy to effectively fuse the two modalities of embeddings. We\nextensively validate our method on a local dataset, which is the largest\npulmonary artery-vein CT dataset to date and consists of 718 labeled data in\ntotal. The experiments show that our method outperformed other state-of-the-art\nmethods by a large margin. Our data and code will be made publicly available\nupon acceptance.\n","authors":["Xiaotong Guo","Deqian Yang","Dan Wang","Haochen Zhao","Yuan Li","Zhilin Sui","Tao Zhou","Lijun Zhang","Yanda Meng"],"pdf_url":"https://arxiv.org/pdf/2501.03722v1.pdf","comment":"8 pages,3 figures"},{"id":"http://arxiv.org/abs/2408.16469v2","updated":"2025-01-07T12:02:22Z","published":"2024-08-29T12:00:11Z","title":"Multi-source Domain Adaptation for Panoramic Semantic Segmentation","summary":" Unsupervised domain adaptation methods for panoramic semantic segmentation\nutilize real pinhole images or low-cost synthetic panoramic images to transfer\nsegmentation models to real panoramic images. However, these methods struggle\nto understand the panoramic structure using only real pinhole images and lack\nreal-world scene perception with only synthetic panoramic images. Therefore, in\nthis paper, we propose a new task, Multi-source Domain Adaptation for Panoramic\nSemantic Segmentation (MSDA4PASS), which leverages both real pinhole and\nsynthetic panoramic images to improve segmentation on unlabeled real panoramic\nimages. There are two key issues in the MSDA4PASS task: (1) distortion gaps\nbetween the pinhole and panoramic domains -- panoramic images exhibit global\nand local distortions absent in pinhole images; (2) texture gaps between the\nsource and target domains -- scenes and styles differ across domains. To\naddress these two issues, we propose a novel framework, Deformation Transform\nAligner for Panoramic Semantic Segmentation (DTA4PASS), which converts all\npinhole images in the source domains into distorted images and aligns the\nsource distorted and panoramic images with the target panoramic images.\nSpecifically, DTA4PASS consists of two main components: Unpaired Semantic\nMorphing (USM) and Distortion Gating Alignment (DGA). First, in USM, the\nDual-view Discriminator (DvD) assists in training the diffeomorphic deformation\nnetwork at the image and pixel level, enabling the effective deformation\ntransformation of pinhole images without paired panoramic views, alleviating\ndistortion gaps. Second, DGA assigns pinhole-like (pin-like) and panoramic-like\n(pan-like) features to each image by gating, and aligns these two features\nthrough uncertainty estimation, reducing texture gaps.\n","authors":["Jing Jiang","Sicheng Zhao","Jiankun Zhu","Wenbo Tang","Zhaopan Xu","Jidong Yang","Guoping Liu","Tengfei Xing","Pengfei Xu","Hongxun Yao"],"pdf_url":"https://arxiv.org/pdf/2408.16469v2.pdf","comment":"Accepted by Information Fusion 2025"},{"id":"http://arxiv.org/abs/2501.03717v1","updated":"2025-01-07T11:52:01Z","published":"2025-01-07T11:52:01Z","title":"Materialist: Physically Based Editing Using Single-Image Inverse\n Rendering","summary":" To perform image editing based on single-view, inverse physically based\nrendering, we present a method combining a learning-based approach with\nprogressive differentiable rendering. Given an image, our method leverages\nneural networks to predict initial material properties. Progressive\ndifferentiable rendering is then used to optimize the environment map and\nrefine the material properties with the goal of closely matching the rendered\nresult to the input image. We require only a single image while other inverse\nrendering methods based on the rendering equation require multiple views. In\ncomparison to single-view methods that rely on neural renderers, our approach\nachieves more realistic light material interactions, accurate shadows, and\nglobal illumination. Furthermore, with optimized material properties and\nillumination, our method enables a variety of tasks, including physically based\nmaterial editing, object insertion, and relighting. We also propose a method\nfor material transparency editing that operates effectively without requiring\nfull scene geometry. Compared with methods based on Stable Diffusion, our\napproach offers stronger interpretability and more realistic light refraction\nbased on empirical results.\n","authors":["Lezhong Wang","Duc Minh Tran","Ruiqi Cui","Thomson TG","Manmohan Chandraker","Jeppe Revall Frisvad"],"pdf_url":"https://arxiv.org/pdf/2501.03717v1.pdf","comment":"code will be available at github.com/lez-s/Materialist"},{"id":"http://arxiv.org/abs/2501.03714v1","updated":"2025-01-07T11:43:13Z","published":"2025-01-07T11:43:13Z","title":"MoDec-GS: Global-to-Local Motion Decomposition and Temporal Interval\n Adjustment for Compact Dynamic 3D Gaussian Splatting","summary":" 3D Gaussian Splatting (3DGS) has made significant strides in scene\nrepresentation and neural rendering, with intense efforts focused on adapting\nit for dynamic scenes. Despite delivering remarkable rendering quality and\nspeed, existing methods struggle with storage demands and representing complex\nreal-world motions. To tackle these issues, we propose MoDecGS, a\nmemory-efficient Gaussian splatting framework designed for reconstructing novel\nviews in challenging scenarios with complex motions. We introduce GlobaltoLocal\nMotion Decomposition (GLMD) to effectively capture dynamic motions in a\ncoarsetofine manner. This approach leverages Global Canonical Scaffolds (Global\nCS) and Local Canonical Scaffolds (Local CS), extending static Scaffold\nrepresentation to dynamic video reconstruction. For Global CS, we propose\nGlobal Anchor Deformation (GAD) to efficiently represent global dynamics along\ncomplex motions, by directly deforming the implicit Scaffold attributes which\nare anchor position, offset, and local context features. Next, we finely adjust\nlocal motions via the Local Gaussian Deformation (LGD) of Local CS explicitly.\nAdditionally, we introduce Temporal Interval Adjustment (TIA) to automatically\ncontrol the temporal coverage of each Local CS during training, allowing\nMoDecGS to find optimal interval assignments based on the specified number of\ntemporal segments. Extensive evaluations demonstrate that MoDecGS achieves an\naverage 70% reduction in model size over stateoftheart methods for dynamic 3D\nGaussians from realworld dynamic videos while maintaining or even improving\nrendering quality.\n","authors":["Sangwoon Kwak","Joonsoo Kim","Jun Young Jeong","Won-Sik Cheong","Jihyong Oh","Munchurl Kim"],"pdf_url":"https://arxiv.org/pdf/2501.03714v1.pdf","comment":"The last two authors are co-corresponding authors. Please visit our\n project page at https://kaist-viclab.github.io/MoDecGS-site/"},{"id":"http://arxiv.org/abs/2409.09424v3","updated":"2025-01-07T11:37:57Z","published":"2024-09-14T12:25:14Z","title":"NBBOX: Noisy Bounding Box Improves Remote Sensing Object Detection","summary":" Data augmentation has shown significant advancements in computer vision to\nimprove model performance over the years, particularly in scenarios with\nlimited and insufficient data. Currently, most studies focus on adjusting the\nimage or its features to expand the size, quality, and variety of samples\nduring training in various tasks including object detection. However, we argue\nthat it is necessary to investigate bounding box transformations as a data\naugmentation technique rather than image-level transformations, especially in\naerial imagery due to potentially inconsistent bounding box annotations. Hence,\nthis letter presents a thorough investigation of bounding box transformation in\nterms of scaling, rotation, and translation for remote sensing object\ndetection. We call this augmentation strategy NBBOX (Noise Injection into\nBounding Box). We conduct extensive experiments on DOTA and DIOR-R, both\nwell-known datasets that include a variety of rotated generic objects in aerial\nimages. Experimental results show that our approach significantly improves\nremote sensing object detection without whistles and bells and it is more\ntime-efficient than other state-of-the-art augmentation strategies.\n","authors":["Yechan Kim","SooYeon Kim","Moongu Jeon"],"pdf_url":"https://arxiv.org/pdf/2409.09424v3.pdf","comment":"Accepted to IEEE Geoscience and Remote Sensing Letters"},{"id":"http://arxiv.org/abs/2411.11543v3","updated":"2025-01-07T11:09:52Z","published":"2024-11-18T13:01:57Z","title":"PSA-VLM: Enhancing Vision-Language Model Safety through Progressive\n Concept-Bottleneck-Driven Alignment","summary":" Benefiting from the powerful capabilities of Large Language Models (LLMs),\npre-trained visual encoder models connected to LLMs form Vision Language Models\n(VLMs). However, recent research shows that the visual modality in VLMs is\nhighly vulnerable, allowing attackers to bypass safety alignment in LLMs\nthrough visually transmitted content, launching harmful attacks. To address\nthis challenge, we propose a progressive concept-based alignment strategy,\nPSA-VLM, which incorporates safety modules as concept bottlenecks to enhance\nvisual modality safety alignment. By aligning model predictions with specific\nsafety concepts, we improve defenses against risky images, enhancing\nexplainability and controllability while minimally impacting general\nperformance. Our method is obtained through two-stage training. The low\ncomputational cost of the first stage brings very effective performance\nimprovement, and the fine-tuning of the language model in the second stage\nfurther improves the safety performance. Our method achieves state-of-the-art\nresults on popular VLM safety benchmark.\n","authors":["Zhendong Liu","Yuanbi Nie","Yingshui Tan","Jiaheng Liu","Xiangyu Yue","Qiushi Cui","Chongjun Wang","Xiaoyong Zhu","Bo Zheng"],"pdf_url":"https://arxiv.org/pdf/2411.11543v3.pdf","comment":"arXiv admin note: substantial text overlap with arXiv:2405.13581"},{"id":"http://arxiv.org/abs/2501.03700v1","updated":"2025-01-07T11:07:32Z","published":"2025-01-07T11:07:32Z","title":"AuxDepthNet: Real-Time Monocular 3D Object Detection with\n Depth-Sensitive Features","summary":" Monocular 3D object detection is a challenging task in autonomous systems due\nto the lack of explicit depth information in single-view images. Existing\nmethods often depend on external depth estimators or expensive sensors, which\nincrease computational complexity and hinder real-time performance. To overcome\nthese limitations, we propose AuxDepthNet, an efficient framework for real-time\nmonocular 3D object detection that eliminates the reliance on external depth\nmaps or pre-trained depth models. AuxDepthNet introduces two key components:\nthe Auxiliary Depth Feature (ADF) module, which implicitly learns\ndepth-sensitive features to improve spatial reasoning and computational\nefficiency, and the Depth Position Mapping (DPM) module, which embeds depth\npositional information directly into the detection process to enable accurate\nobject localization and 3D bounding box regression. Leveraging the DepthFusion\nTransformer architecture, AuxDepthNet globally integrates visual and\ndepth-sensitive features through depth-guided interactions, ensuring robust and\nefficient detection. Extensive experiments on the KITTI dataset show that\nAuxDepthNet achieves state-of-the-art performance, with $\\text{AP}_{3D}$ scores\nof 24.72\\% (Easy), 18.63\\% (Moderate), and 15.31\\% (Hard), and\n$\\text{AP}_{\\text{BEV}}$ scores of 34.11\\% (Easy), 25.18\\% (Moderate), and\n21.90\\% (Hard) at an IoU threshold of 0.7.\n","authors":["Ruochen Zhang","Hyeung-Sik Choi","Dongwook Jung","Phan Huy Nam Anh","Sang-Ki Jeong","Zihao Zhu"],"pdf_url":"https://arxiv.org/pdf/2501.03700v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.15353v3","updated":"2025-01-07T11:06:13Z","published":"2024-03-22T17:08:03Z","title":"Fully automated workflow for designing patient-specific orthopaedic\n implants: application to total knee arthroplasty","summary":" Background. Osteoarthritis affects about 528 million people worldwide,\ncausing pain and stiffness in the joints. Arthroplasty is commonly performed to\ntreat joint osteoarthritis, reducing pain and improving mobility. Nevertheless,\na significant share of patients remain unsatisfied with their surgery.\nPersonalised arthroplasty was introduced to improve surgical outcomes however\ncurrent solutions require delays, making it difficult to integrate in clinical\nroutine. We propose a fully automated workflow to design patient-specific\nimplants for total knee arthroplasty.\n Methods. The proposed pipeline first uses artificial neural networks to\nsegment the femur and tibia proximal and distal extremities. Then the full\nbones are reconstructed using augmented statistical shape models, combining\nshape and landmarks information. Finally, 77 morphological parameters are\ncomputed to design patient-specific implants. The developed workflow has been\ntrained on 91 CT scans and evaluated on 41 CT scans, in terms of accuracy and\nexecution time.\n Results. The workflow accuracy was $0.4\\pm0.2mm$ for segmentation,\n$1.0\\pm0.3mm$ for full bone reconstruction, and $2.2\\pm1.5mm$ for anatomical\nlandmarks determination. The custom implants fitted the patients' anatomy with\n$0.9\\pm0.5mm$ accuracy. The whole process from segmentation to implants' design\nlasted about 15 minutes.\n Conclusion. The proposed workflow performs a fast and reliable\npersonalisation of knee implants, directly from a CT image without requiring\nany manual intervention. It allows the establishment of a patient-specific\npre-operative planning in a very short time, making it easily available for all\npatients. Combined with efficient implant manufacturing techniques, this\nsolution could help answer the growing number of arthroplasties while reducing\ncomplications and improving patients' satisfaction.\n","authors":["Aziliz Guezou-Philippe","Arnaud Clavé","Ehouarn Maguet","Ludivine Maintier","Charles Garraud","Jean-Rassaire Fouefack","Valérie Burdin","Eric Stindel","Guillaume Dardenne"],"pdf_url":"https://arxiv.org/pdf/2403.15353v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03699v1","updated":"2025-01-07T11:03:43Z","published":"2025-01-07T11:03:43Z","title":"Motion-Aware Generative Frame Interpolation","summary":" Generative frame interpolation, empowered by large-scale pre-trained video\ngeneration models, has demonstrated remarkable advantages in complex scenes.\nHowever, existing methods heavily rely on the generative model to independently\ninfer the correspondences between input frames, an ability that is inadequately\ndeveloped during pre-training. In this work, we propose a novel framework,\ntermed Motion-aware Generative frame interpolation (MoG), to significantly\nenhance the model's motion awareness by integrating explicit motion guidance.\nSpecifically we investigate two key questions: what can serve as an effective\nmotion guidance, and how we can seamlessly embed this guidance into the\ngenerative model. For the first question, we reveal that the intermediate flow\nfrom flow-based interpolation models could efficiently provide task-oriented\nmotion guidance. Regarding the second, we first obtain guidance-based\nrepresentations of intermediate frames by warping input frames' representations\nusing guidance, and then integrate them into the model at both latent and\nfeature levels. To demonstrate the versatility of our method, we train MoG on\nboth real-world and animation datasets. Comprehensive evaluations show that our\nMoG significantly outperforms the existing methods in both domains, achieving\nsuperior video quality and improved fidelity.\n","authors":["Guozhen Zhang","Yuhan Zhu","Yutao Cui","Xiaotong Zhao","Kai Ma","Limin Wang"],"pdf_url":"https://arxiv.org/pdf/2501.03699v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.15286v3","updated":"2025-01-07T10:34:12Z","published":"2024-05-24T07:18:09Z","title":"3D Annotation-Free Learning by Distilling 2D Open-Vocabulary\n Segmentation Models for Autonomous Driving","summary":" Point cloud data labeling is considered a time-consuming and expensive task\nin autonomous driving, whereas annotation-free learning training can avoid it\nby learning point cloud representations from unannotated data. In this paper,\nwe propose AFOV, a novel 3D \\textbf{A}nnotation-\\textbf{F}ree framework\nassisted by 2D \\textbf{O}pen-\\textbf{V}ocabulary segmentation models. It\nconsists of two stages: In the first stage, we innovatively integrate\nhigh-quality textual and image features of 2D open-vocabulary models and\npropose the Tri-Modal contrastive Pre-training (TMP). In the second stage,\nspatial mapping between point clouds and images is utilized to generate\npseudo-labels, enabling cross-modal knowledge distillation. Besides, we\nintroduce the Approximate Flat Interaction (AFI) to address the noise during\nalignment and label confusion. To validate the superiority of AFOV, extensive\nexperiments are conducted on multiple related datasets. We achieved a\nrecord-breaking 47.73\\% mIoU on the annotation-free 3D segmentation task in\nnuScenes, surpassing the previous best model by 3.13\\% mIoU. Meanwhile, the\nperformance of fine-tuning with 1\\% data on nuScenes and SemanticKITTI reached\na remarkable 51.75\\% mIoU and 48.14\\% mIoU, outperforming all previous\npre-trained models\n","authors":["Boyi Sun","Yuhang Liu","Xingxia Wang","Bin Tian","Long Chen","Fei-Yue Wang"],"pdf_url":"https://arxiv.org/pdf/2405.15286v3.pdf","comment":"15 pages, 7 figures, codes are available at\n https://github.com/sbysbysbys/AFOV"},{"id":"http://arxiv.org/abs/2501.03675v1","updated":"2025-01-07T10:21:21Z","published":"2025-01-07T10:21:21Z","title":"SMIR: Efficient Synthetic Data Pipeline To Improve Multi-Image Reasoning","summary":" Vision-Language Models (VLMs) have shown strong performance in understanding\nsingle images, aided by numerous high-quality instruction datasets. However,\nmulti-image reasoning tasks are still under-explored in the open-source\ncommunity due to two main challenges: (1) scaling datasets with multiple\ncorrelated images and complex reasoning instructions is resource-intensive and\nmaintaining quality is difficult, and (2) there is a lack of robust evaluation\nbenchmarks for multi-image tasks. To address these issues, we introduce SMIR,\nan efficient synthetic data-generation pipeline for multi-image reasoning, and\na high-quality dataset generated using this pipeline. Our pipeline efficiently\nextracts highly correlated images using multimodal embeddings, combining visual\nand descriptive information and leverages open-source LLMs to generate quality\ninstructions. Using this pipeline, we generated 160K synthetic training\nsamples, offering a cost-effective alternative to expensive closed-source\nsolutions. Additionally, we present SMIR-BENCH, a novel multi-image reasoning\nevaluation benchmark comprising 200 diverse examples across 7 complex\nmulti-image reasoning tasks. SMIR-BENCH is multi-turn and utilizes a VLM judge\nto evaluate free-form responses, providing a comprehensive assessment of model\nexpressiveness and reasoning capability across modalities. We demonstrate the\neffectiveness of SMIR dataset by fine-tuning several open-source VLMs and\nevaluating their performance on SMIR-BENCH. Our results show that models\ntrained on our dataset outperform baseline models in multi-image reasoning\ntasks up to 8% with a much more scalable data pipeline.\n","authors":["Andrew Li","Rahul Thapa","Rahul Chalamala","Qingyang Wu","Kezhen Chen","James Zou"],"pdf_url":"https://arxiv.org/pdf/2501.03675v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03674v1","updated":"2025-01-07T10:20:16Z","published":"2025-01-07T10:20:16Z","title":"Action Quality Assessment via Hierarchical Pose-guided Multi-stage\n Contrastive Regression","summary":" Action Quality Assessment (AQA), which aims at automatic and fair evaluation\nof athletic performance, has gained increasing attention in recent years.\nHowever, athletes are often in rapid movement and the corresponding visual\nappearance variances are subtle, making it challenging to capture fine-grained\npose differences and leading to poor estimation performance. Furthermore, most\ncommon AQA tasks, such as diving in sports, are usually divided into multiple\nsub-actions, each of which contains different durations. However, existing\nmethods focus on segmenting the video into fixed frames, which disrupts the\ntemporal continuity of sub-actions resulting in unavoidable prediction errors.\nTo address these challenges, we propose a novel action quality assessment\nmethod through hierarchically pose-guided multi-stage contrastive regression.\nFirstly, we introduce a multi-scale dynamic visual-skeleton encoder to capture\nfine-grained spatio-temporal visual and skeletal features. Then, a procedure\nsegmentation network is introduced to separate different sub-actions and obtain\nsegmented features. Afterwards, the segmented visual and skeletal features are\nboth fed into a multi-modal fusion module as physics structural priors, to\nguide the model in learning refined activity similarities and variances.\nFinally, a multi-stage contrastive learning regression approach is employed to\nlearn discriminative representations and output prediction results. In\naddition, we introduce a newly-annotated FineDiving-Pose Dataset to improve the\ncurrent low-quality human pose labels. In experiments, the results on\nFineDiving and MTL-AQA datasets demonstrate the effectiveness and superiority\nof our proposed approach. Our source code and dataset are available at\nhttps://github.com/Lumos0507/HP-MCoRe.\n","authors":["Mengshi Qi","Hao Ye","Jiaxuan Peng","Huadong Ma"],"pdf_url":"https://arxiv.org/pdf/2501.03674v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.17678v3","updated":"2025-01-07T10:08:19Z","published":"2024-03-26T13:05:49Z","title":"Hierarchical Light Transformer Ensembles for Multimodal Trajectory\n Forecasting","summary":" Accurate trajectory forecasting is crucial for the performance of various\nsystems, such as advanced driver-assistance systems and self-driving vehicles.\nThese forecasts allow us to anticipate events that lead to collisions and,\ntherefore, to mitigate them. Deep Neural Networks have excelled in motion\nforecasting, but overconfidence and weak uncertainty quantification persist.\nDeep Ensembles address these concerns, yet applying them to multimodal\ndistributions remains challenging. In this paper, we propose a novel approach\nnamed Hierarchical Light Transformer Ensembles (HLT-Ens) aimed at efficiently\ntraining an ensemble of Transformer architectures using a novel hierarchical\nloss function. HLT-Ens leverages grouped fully connected layers, inspired by\ngrouped convolution techniques, to capture multimodal distributions\neffectively. We demonstrate that HLT-Ens achieves state-of-the-art performance\nlevels through extensive experimentation, offering a promising avenue for\nimproving trajectory forecasting techniques.\n","authors":["Adrien Lafage","Mathieu Barbier","Gianni Franchi","David Filliat"],"pdf_url":"https://arxiv.org/pdf/2403.17678v3.pdf","comment":"WACV 2025"},{"id":"http://arxiv.org/abs/2501.03664v1","updated":"2025-01-07T10:04:01Z","published":"2025-01-07T10:04:01Z","title":"Local Compositional Complexity: How to Detect a Human-readable Messsage","summary":" Data complexity is an important concept in the natural sciences and related\nareas, but lacks a rigorous and computable definition. In this paper, we focus\non a particular sense of complexity that is high if the data is structured in a\nway that could serve to communicate a message. In this sense, human speech,\nwritten language, drawings, diagrams and photographs are high complexity,\nwhereas data that is close to uniform throughout or populated by random values\nis low complexity. We describe a general framework for measuring data\ncomplexity based on dividing the shortest description of the data into a\nstructured and an unstructured portion, and taking the size of the former as\nthe complexity score. We outline an application of this framework in\nstatistical mechanics that may allow a more objective characterisation of the\nmacrostate and entropy of a physical system. Then, we derive a more precise and\ncomputable definition geared towards human communication, by proposing local\ncompositionality as an appropriate specific structure. We demonstrate\nexperimentally that this method can distinguish meaningful signals from noise\nor repetitive signals in auditory, visual and text domains, and could\npotentially help determine whether an extra-terrestrial signal contained a\nmessage.\n","authors":["Louis Mahon"],"pdf_url":"https://arxiv.org/pdf/2501.03664v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19543v2","updated":"2025-01-07T09:56:36Z","published":"2024-12-27T09:10:30Z","title":"Diverse Rare Sample Generation with Pretrained GANs","summary":" Deep generative models are proficient in generating realistic data but\nstruggle with producing rare samples in low density regions due to their\nscarcity of training datasets and the mode collapse problem. While recent\nmethods aim to improve the fidelity of generated samples, they often reduce\ndiversity and coverage by ignoring rare and novel samples. This study proposes\na novel approach for generating diverse rare samples from high-resolution image\ndatasets with pretrained GANs. Our method employs gradient-based optimization\nof latent vectors within a multi-objective framework and utilizes normalizing\nflows for density estimation on the feature space. This enables the generation\nof diverse rare images, with controllable parameters for rarity, diversity, and\nsimilarity to a reference image. We demonstrate the effectiveness of our\napproach both qualitatively and quantitatively across various datasets and GANs\nwithout retraining or fine-tuning the pretrained GANs.\n","authors":["Subeen Lee","Jiyeon Han","Soyeon Kim","Jaesik Choi"],"pdf_url":"https://arxiv.org/pdf/2412.19543v2.pdf","comment":"Accepted at AAAI 2025"},{"id":"http://arxiv.org/abs/2501.03659v1","updated":"2025-01-07T09:47:46Z","published":"2025-01-07T09:47:46Z","title":"DehazeGS: Seeing Through Fog with 3D Gaussian Splatting","summary":" Current novel view synthesis tasks primarily rely on high-quality and clear\nimages. However, in foggy scenes, scattering and attenuation can significantly\ndegrade the reconstruction and rendering quality. Although NeRF-based dehazing\nreconstruction algorithms have been developed, their use of deep fully\nconnected neural networks and per-ray sampling strategies leads to high\ncomputational costs. Moreover, NeRF's implicit representation struggles to\nrecover fine details from hazy scenes. In contrast, recent advancements in 3D\nGaussian Splatting achieve high-quality 3D scene reconstruction by explicitly\nmodeling point clouds into 3D Gaussians. In this paper, we propose leveraging\nthe explicit Gaussian representation to explain the foggy image formation\nprocess through a physically accurate forward rendering process. We introduce\nDehazeGS, a method capable of decomposing and rendering a fog-free background\nfrom participating media using only muti-view foggy images as input. We model\nthe transmission within each Gaussian distribution to simulate the formation of\nfog. During this process, we jointly learn the atmospheric light and scattering\ncoefficient while optimizing the Gaussian representation of the hazy scene. In\nthe inference stage, we eliminate the effects of scattering and attenuation on\nthe Gaussians and directly project them onto a 2D plane to obtain a clear view.\nExperiments on both synthetic and real-world foggy datasets demonstrate that\nDehazeGS achieves state-of-the-art performance in terms of both rendering\nquality and computational efficiency.\n","authors":["Jinze Yu","Yiqun Wang","Zhengda Lu","Jianwei Guo","Yong Li","Hongxing Qin","Xiaopeng Zhang"],"pdf_url":"https://arxiv.org/pdf/2501.03659v1.pdf","comment":"9 pages,4 figures"},{"id":"http://arxiv.org/abs/2408.12928v2","updated":"2025-01-07T09:39:15Z","published":"2024-08-23T09:14:58Z","title":"ParGo: Bridging Vision-Language with Partial and Global Views","summary":" This work presents ParGo, a novel Partial-Global projector designed to\nconnect the vision and language modalities for Multimodal Large Language Models\n(MLLMs). Unlike previous works that rely on global attention-based projectors,\nour ParGo bridges the representation gap between the separately pre-trained\nvision encoders and the LLMs by integrating global and partial views, which\nalleviates the overemphasis on prominent regions. To facilitate the effective\ntraining of ParGo, we collect a large-scale detail-captioned image-text dataset\nnamed ParGoCap-1M-PT, consisting of 1 million images paired with high-quality\ncaptions. Extensive experiments on several MLLM benchmarks demonstrate the\neffectiveness of our ParGo, highlighting its superiority in aligning vision and\nlanguage modalities. Compared to conventional Q-Former projector, our ParGo\nachieves an improvement of 259.96 in MME benchmark. Furthermore, our\nexperiments reveal that ParGo significantly outperforms other projectors,\nparticularly in tasks that emphasize detail perception ability.\n","authors":["An-Lan Wang","Bin Shan","Wei Shi","Kun-Yu Lin","Xiang Fei","Guozhi Tang","Lei Liao","Jingqun Tang","Can Huang","Wei-Shi Zheng"],"pdf_url":"https://arxiv.org/pdf/2408.12928v2.pdf","comment":"Accepted by AAAI 2025"},{"id":"http://arxiv.org/abs/2411.16370v2","updated":"2025-01-07T09:34:51Z","published":"2024-11-25T13:26:09Z","title":"A Review of Bayesian Uncertainty Quantification in Deep Probabilistic\n Image Segmentation","summary":" Advancements in image segmentation play an integral role within the broad\nscope of Deep Learning-based Computer Vision. Furthermore, their widespread\napplicability in critical real-world tasks has resulted in challenges related\nto the reliability of such algorithms. Hence, uncertainty quantification has\nbeen extensively studied within this context, enabling the expression of model\nignorance (epistemic uncertainty) or data ambiguity (aleatoric uncertainty) to\nprevent uninformed decision-making. Due to the rapid adoption of Convolutional\nNeural Network (CNN)-based segmentation models in high-stake applications, a\nsubstantial body of research has been published on this very topic, causing its\nswift expansion into a distinct field. This work provides a comprehensive\noverview of probabilistic segmentation, by discussing fundamental concepts of\nuncertainty quantification, governing advancements in the field as well as the\napplication to various tasks. Moreover, literature on both types of\nuncertainties trace back to four key applications: (1) to quantify statistical\ninconsistencies in the annotation process due ambiguous images, (2) correlating\nprediction error with uncertainty, (3) expanding the model hypothesis space for\nbetter generalization, and (4) Active Learning. An extensive discussion follows\nthat includes an overview of utilized datasets for each of the applications and\nevaluation of the available methods. We also highlight challenges related to\narchitectures, uncertainty quantification methods, standardization and\nbenchmarking, and finally end with recommendations for future work such as\nmethods based on single forward passes and models that appropriately leverage\nvolumetric data.\n","authors":["M. M. A. Valiuddin","R. J. G. van Sloun","C. G. A. Viviers","P. H. N. de With","F. van der Sommen"],"pdf_url":"https://arxiv.org/pdf/2411.16370v2.pdf","comment":"20 pages, revised"},{"id":"http://arxiv.org/abs/2409.00698v2","updated":"2025-01-07T09:26:03Z","published":"2024-09-01T11:39:13Z","title":"Enhancing Remote Sensing Vision-Language Models for Zero-Shot Scene\n Classification","summary":" Vision-Language Models for remote sensing have shown promising uses thanks to\ntheir extensive pretraining. However, their conventional usage in zero-shot\nscene classification methods still involves dividing large images into patches\nand making independent predictions, i.e., inductive inference, thereby limiting\ntheir effectiveness by ignoring valuable contextual information. Our approach\ntackles this issue by utilizing initial predictions based on text prompting and\npatch affinity relationships from the image encoder to enhance zero-shot\ncapabilities through transductive inference, all without the need for\nsupervision and at a minor computational cost. Experiments on 10 remote sensing\ndatasets with state-of-the-art Vision-Language Models demonstrate significant\naccuracy improvements over inductive zero-shot classification. Our source code\nis publicly available on Github: https://github.com/elkhouryk/RS-TransCLIP\n","authors":["Karim El Khoury","Maxime Zanella","Benoît Gérin","Tiffanie Godelaine","Benoît Macq","Saïd Mahmoudi","Christophe De Vleeschouwer","Ismail Ben Ayed"],"pdf_url":"https://arxiv.org/pdf/2409.00698v2.pdf","comment":"Accepted at ICASSP 2025"},{"id":"http://arxiv.org/abs/2501.01427v3","updated":"2025-01-07T09:16:57Z","published":"2025-01-02T18:59:54Z","title":"VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion\n Control","summary":" Despite significant advancements in video generation, inserting a given\nobject into videos remains a challenging task. The difficulty lies in\npreserving the appearance details of the reference object and accurately\nmodeling coherent motions at the same time. In this paper, we propose\nVideoAnydoor, a zero-shot video object insertion framework with high-fidelity\ndetail preservation and precise motion control. Starting from a text-to-video\nmodel, we utilize an ID extractor to inject the global identity and leverage a\nbox sequence to control the overall motion. To preserve the detailed appearance\nand meanwhile support fine-grained motion control, we design a pixel warper. It\ntakes the reference image with arbitrary key-points and the corresponding\nkey-point trajectories as inputs. It warps the pixel details according to the\ntrajectories and fuses the warped features with the diffusion U-Net, thus\nimproving detail preservation and supporting users in manipulating the motion\ntrajectories. In addition, we propose a training strategy involving both videos\nand static images with a weighted loss to enhance insertion quality.\nVideoAnydoor demonstrates significant superiority over existing methods and\nnaturally supports various downstream applications (e.g., talking head\ngeneration, video virtual try-on, multi-region editing) without task-specific\nfine-tuning.\n","authors":["Yuanpeng Tu","Hao Luo","Xi Chen","Sihui Ji","Xiang Bai","Hengshuang Zhao"],"pdf_url":"https://arxiv.org/pdf/2501.01427v3.pdf","comment":"Project page: https://videoanydoor.github.io/"},{"id":"http://arxiv.org/abs/2410.16020v2","updated":"2025-01-07T09:15:19Z","published":"2024-10-21T13:50:32Z","title":"START: A Generalized State Space Model with Saliency-Driven Token-Aware\n Transformation","summary":" Domain Generalization (DG) aims to enable models to generalize to unseen\ntarget domains by learning from multiple source domains. Existing DG methods\nprimarily rely on convolutional neural networks (CNNs), which inherently learn\ntexture biases due to their limited receptive fields, making them prone to\noverfitting source domains. While some works have introduced transformer-based\nmethods (ViTs) for DG to leverage the global receptive field, these methods\nincur high computational costs due to the quadratic complexity of\nself-attention. Recently, advanced state space models (SSMs), represented by\nMamba, have shown promising results in supervised learning tasks by achieving\nlinear complexity in sequence length during training and fast RNN-like\ncomputation during inference. Inspired by this, we investigate the\ngeneralization ability of the Mamba model under domain shifts and find that\ninput-dependent matrices within SSMs could accumulate and amplify\ndomain-specific features, thus hindering model generalization. To address this\nissue, we propose a novel SSM-based architecture with saliency-based\ntoken-aware transformation (namely START), which achieves state-of-the-art\n(SOTA) performances and offers a competitive alternative to CNNs and ViTs. Our\nSTART can selectively perturb and suppress domain-specific features in salient\ntokens within the input-dependent matrices of SSMs, thus effectively reducing\nthe discrepancy between different domains. Extensive experiments on five\nbenchmarks demonstrate that START outperforms existing SOTA DG methods with\nefficient linear complexity. Our code is available at\nhttps://github.com/lingeringlight/START.\n","authors":["Jintao Guo","Lei Qi","Yinghuan Shi","Yang Gao"],"pdf_url":"https://arxiv.org/pdf/2410.16020v2.pdf","comment":"Accepted by NeurIPS2024. The code is available at\n https://github.com/lingeringlight/START"},{"id":"http://arxiv.org/abs/2501.03637v1","updated":"2025-01-07T09:12:55Z","published":"2025-01-07T09:12:55Z","title":"Advancing the Understanding of Fine-Grained 3D Forest Structures using\n Digital Cousins and Simulation-to-Reality: Methods and Datasets","summary":" Understanding and analyzing the spatial semantics and structure of forests is\nessential for accurate forest resource monitoring and ecosystem research.\nHowever, the lack of large-scale and annotated datasets has limited the\nwidespread use of advanced intelligent techniques in this field. To address\nthis challenge, a fully automated synthetic data generation and processing\nframework based on the concepts of Digital Cousins and Simulation-to-Reality\n(Sim2Real) is proposed, offering versatility and scalability to any size and\nplatform. Using this process, we created the Boreal3D, the world's largest\nforest point cloud dataset. It includes 1000 highly realistic and structurally\ndiverse forest plots across four different platforms, totaling 48,403 trees and\nover 35.3 billion points. Each point is labeled with semantic, instance, and\nviewpoint information, while each tree is described with structural parameters\nsuch as diameter, crown width, leaf area, and total volume. We designed and\nconducted extensive experiments to evaluate the potential of Boreal3D in\nadvancing fine-grained 3D forest structure analysis in real-world applications.\nThe results demonstrate that with certain strategies, models pre-trained on\nsynthetic data can significantly improve performance when applied to real\nforest datasets. Especially, the findings reveal that fine-tuning with only 20%\nof real-world data enables the model to achieve performance comparable to\nmodels trained exclusively on entire real-world data, highlighting the value\nand potential of our proposed framework. The Boreal3D dataset, and more\nbroadly, the synthetic data augmentation framework, is poised to become a\ncritical resource for advancing research in large-scale 3D forest scene\nunderstanding and structural parameter estimation.\n","authors":["Jing Liu","Duanchu Wang","Haoran Gong","Chongyu Wang","Jihua Zhu","Di Wang"],"pdf_url":"https://arxiv.org/pdf/2501.03637v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03631v1","updated":"2025-01-07T09:00:36Z","published":"2025-01-07T09:00:36Z","title":"Exploring Optimal Latent Trajetory for Zero-shot Image Editing","summary":" Editability and fidelity are two essential demands for text-driven image\nediting, which expects that the editing area should align with the target\nprompt and the rest should remain unchanged separately. The current\ncutting-edge editing methods usually obey an \"inversion-then-editing\" pipeline,\nwhere the source image is first inverted to an approximate Gaussian noise\n${z}_T$, based on which a sampling process is conducted using the target\nprompt. Nevertheless, we argue that it is not a good choice to use a\nnear-Gaussian noise as a pivot for further editing since it almost lost all\nstructure fidelity. We verify this by a pilot experiment, discovering that some\nintermediate-inverted latents can achieve a better trade-off between\neditability and fidelity than the fully-inverted ${z}_T$. Based on this, we\npropose a novel editing paradigm dubbed ZZEdit, which gentlely strengthens the\ntarget guidance on a sufficient-for-editing while structure-preserving latent.\nSpecifically, we locate such an editing pivot by searching the first point on\nthe inversion trajectory which has larger response levels toward the target\nprompt than the source one. Then, we propose a ZigZag process to perform mild\ntarget guiding on this pivot, which fulfills denoising and inversion\niteratively, approaching the target while still holding fidelity. Afterwards,\nto achieve the same number of inversion and denoising steps, we perform a pure\nsampling process under the target prompt. Extensive experiments highlight the\neffectiveness of our ZZEdit in diverse image editing scenarios compared with\nthe \"inversion-then-editing\" pipeline.\n","authors":["Maomao Li","Yu Li","Yunfei Liu","Dong Xu"],"pdf_url":"https://arxiv.org/pdf/2501.03631v1.pdf","comment":"16 pages"},{"id":"http://arxiv.org/abs/2501.03630v1","updated":"2025-01-07T09:00:07Z","published":"2025-01-07T09:00:07Z","title":"MC-VTON: Minimal Control Virtual Try-On Diffusion Transformer","summary":" Virtual try-on methods based on diffusion models achieve realistic try-on\neffects. They use an extra reference network or an additional image encoder to\nprocess multiple conditional image inputs, which results in high training\ncosts. Besides, they require more than 25 inference steps, bringing a long\ninference time. In this work, with the development of diffusion transformer\n(DiT), we rethink the necessity of reference network or image encoder, then\npropose MC-VTON, enabling DiT to integrate minimal conditional try-on inputs by\nutilizing its intrinsic backbone. Compared to existing methods, the superiority\nof MC-VTON is demonstrated in four aspects: (1)Superior detail fidelity. Our\nDiT-based MC-VTON exhibits superior fidelity in preserving fine-grained\ndetails. (2)Simplified network and inputs. We remove any extra reference\nnetwork or image encoder. We also remove unnecessary conditions like the long\nprompt, pose estimation, human parsing, and depth map. We require only the\nmasked person image and the garment image. (3)Parameter-efficient training. To\nprocess the try-on task, we fine-tune the FLUX.1-dev with only 39.7M additional\nparameters 0.33% of the backbone parameters). (4)Less inference steps. We apply\ndistillation diffusion on MC-VTON and only need 8 steps to generate a realistic\ntry-on image, with only 86.8M additional parameters (0.72% of the backbone\nparameters). Experiments show that MC-VTON achieves superior qualitative and\nquantitative results with fewer condition inputs, fewer inference steps, and\nfewer trainable parameters than baseline methods.\n","authors":["Junsheng Luan","Guangyuan Li","Lei Zhao","Wei Xing"],"pdf_url":"https://arxiv.org/pdf/2501.03630v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03629v1","updated":"2025-01-07T08:59:20Z","published":"2025-01-07T08:59:20Z","title":"CFFormer: Cross CNN-Transformer Channel Attention and Spatial Feature\n Fusion for Improved Segmentation of Low Quality Medical Images","summary":" Hybrid CNN-Transformer models are designed to combine the advantages of\nConvolutional Neural Networks (CNNs) and Transformers to efficiently model both\nlocal information and long-range dependencies. However, most research tends to\nfocus on integrating the spatial features of CNNs and Transformers, while\noverlooking the critical importance of channel features. This is particularly\nsignificant for model performance in low-quality medical image segmentation.\nEffective channel feature extraction can significantly enhance the model's\nability to capture contextual information and improve its representation\ncapabilities. To address this issue, we propose a hybrid CNN-Transformer model,\nCFFormer, and introduce two modules: the Cross Feature Channel Attention (CFCA)\nmodule and the X-Spatial Feature Fusion (XFF) module. The model incorporates\ndual encoders, with the CNN encoder focusing on capturing local features and\nthe Transformer encoder modeling global features. The CFCA module filters and\nfacilitates interactions between the channel features from the two encoders,\nwhile the XFF module effectively reduces the significant semantic information\ndifferences in spatial features, enabling a smooth and cohesive spatial feature\nfusion. We evaluate our model across eight datasets covering five modalities to\ntest its generalization capability. Experimental results demonstrate that our\nmodel outperforms current state-of-the-art (SOTA) methods, with particularly\nsuperior performance on datasets characterized by blurry boundaries and low\ncontrast.\n","authors":["Jiaxuan Li","Qing Xu","Xiangjian He","Ziyu Liu","Daokun Zhang","Ruili Wang","Rong Qu","Guoping Qiu"],"pdf_url":"https://arxiv.org/pdf/2501.03629v1.pdf","comment":"The article consists of 15 pages, including 10 figures and 7 tables.\n The code will be made open-source once the article is accepted by the journal"},{"id":"http://arxiv.org/abs/2208.06538v2","updated":"2025-01-07T08:52:30Z","published":"2022-08-13T01:20:39Z","title":"Transferable Adversarial Examples with Bayes Approach","summary":" The vulnerability of deep neural networks (DNNs) to black-box adversarial\nattacks is one of the most heated topics in trustworthy AI. In such attacks,\nthe attackers operate without any insider knowledge of the model, making the\ncross-model transferability of adversarial examples critical. Despite the\npotential for adversarial examples to be effective across various models, it\nhas been observed that adversarial examples that are specifically crafted for a\nspecific model often exhibit poor transferability. In this paper, we explore\nthe transferability of adversarial examples via the lens of Bayesian approach.\nSpecifically, we leverage Bayesian approach to probe the transferability and\nthen study what constitutes a transferability-promoting prior. Following this,\nwe design two concrete transferability-promoting priors, along with an adaptive\ndynamic weighting strategy for instances sampled from these priors. Employing\nthese techniques, we present BayAtk. Extensive experiments illustrate the\nsignificant effectiveness of BayAtk in crafting more transferable adversarial\nexamples against both undefended and defended black-box models compared to\nexisting state-of-the-art attacks.\n","authors":["Mingyuan Fan","Cen Chen","Wenmeng Zhou","Yinggui Wang"],"pdf_url":"https://arxiv.org/pdf/2208.06538v2.pdf","comment":"Accepted in AsiaCCS'25"},{"id":"http://arxiv.org/abs/2501.02487v2","updated":"2025-01-07T08:47:34Z","published":"2025-01-05T09:40:58Z","title":"ACE++: Instruction-Based Image Creation and Editing via Context-Aware\n Content Filling","summary":" We report ACE++, an instruction-based diffusion framework that tackles\nvarious image generation and editing tasks. Inspired by the input format for\nthe inpainting task proposed by FLUX.1-Fill-dev, we improve the Long-context\nCondition Unit (LCU) introduced in ACE and extend this input paradigm to any\nediting and generation tasks. To take full advantage of image generative\npriors, we develop a two-stage training scheme to minimize the efforts of\nfinetuning powerful text-to-image diffusion models like FLUX.1-dev. In the\nfirst stage, we pre-train the model using task data with the 0-ref tasks from\nthe text-to-image model. There are many models in the community based on the\npost-training of text-to-image foundational models that meet this training\nparadigm of the first stage. For example, FLUX.1-Fill-dev deals primarily with\npainting tasks and can be used as an initialization to accelerate the training\nprocess. In the second stage, we finetune the above model to support the\ngeneral instructions using all tasks defined in ACE. To promote the widespread\napplication of ACE++ in different scenarios, we provide a comprehensive set of\nmodels that cover both full finetuning and lightweight finetuning, while\nconsidering general applicability and applicability in vertical scenarios. The\nqualitative analysis showcases the superiority of ACE++ in terms of generating\nimage quality and prompt following ability. Code and models will be available\non the project page: https://ali-vilab. github.io/ACE_plus_page/.\n","authors":["Chaojie Mao","Jingfeng Zhang","Yulin Pan","Zeyinzi Jiang","Zhen Han","Yu Liu","Jingren Zhou"],"pdf_url":"https://arxiv.org/pdf/2501.02487v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03619v1","updated":"2025-01-07T08:36:46Z","published":"2025-01-07T08:36:46Z","title":"Deep Learning-based Compression Detection for explainable Face Image\n Quality Assessment","summary":" The assessment of face image quality is crucial to ensure reliable face\nrecognition. In order to provide data subjects and operators with explainable\nand actionable feedback regarding captured face images, relevant quality\ncomponents have to be measured. Quality components that are known to negatively\nimpact the utility of face images include JPEG and JPEG 2000 compression\nartefacts, among others. Compression can result in a loss of important image\ndetails which may impair the recognition performance. In this work, deep neural\nnetworks are trained to detect the compression artefacts in a face images. For\nthis purpose, artefact-free facial images are compressed with the JPEG and JPEG\n2000 compression algorithms. Subsequently, the PSNR and SSIM metrics are\nemployed to obtain training labels based on which neural networks are trained\nusing a single network to detect JPEG and JPEG 2000 artefacts, respectively.\nThe evaluation of the proposed method shows promising results: in terms of\ndetection accuracy, error rates of 2-3% are obtained for utilizing PSNR labels\nduring training. In addition, we show that error rates of different open-source\nand commercial face recognition systems can be significantly reduced by\ndiscarding face images exhibiting severe compression artefacts. To minimize\nresource consumption, EfficientNetV2 serves as basis for the presented\nalgorithm, which is available as part of the OFIQ software.\n","authors":["Laurin Jonientz","Johannes Merkle","Christian Rathgeb","Benjamin Tams","Georg Merz"],"pdf_url":"https://arxiv.org/pdf/2501.03619v1.pdf","comment":"2nd Workshop on Fairness in Biometric Systems (FAIRBIO) at\n International Conference on Pattern Recognition (ICPR) 2024"},{"id":"http://arxiv.org/abs/2501.03616v1","updated":"2025-01-07T08:32:48Z","published":"2025-01-07T08:32:48Z","title":"BTMTrack: Robust RGB-T Tracking via Dual-template Bridging and\n Temporal-Modal Candidate Elimination","summary":" RGB-T tracking leverages the complementary strengths of RGB and thermal\ninfrared (TIR) modalities to address challenging scenarios such as low\nillumination and adverse weather. However, existing methods often fail to\neffectively integrate temporal information and perform efficient cross-modal\ninteractions, which constrain their adaptability to dynamic targets. In this\npaper, we propose BTMTrack, a novel framework for RGB-T tracking. The core of\nour approach lies in the dual-template backbone network and the Temporal-Modal\nCandidate Elimination (TMCE) strategy. The dual-template backbone effectively\nintegrates temporal information, while the TMCE strategy focuses the model on\ntarget-relevant tokens by evaluating temporal and modal correlations, reducing\ncomputational overhead and avoiding irrelevant background noise. Building upon\nthis foundation, we propose the Temporal Dual Template Bridging (TDTB) module,\nwhich facilitates precise cross-modal fusion through dynamically filtered\ntokens. This approach further strengthens the interaction between templates and\nthe search region. Extensive experiments conducted on three benchmark datasets\ndemonstrate the effectiveness of BTMTrack. Our method achieves state-of-the-art\nperformance, with a 72.3% precision rate on the LasHeR test set and competitive\nresults on RGBT210 and RGBT234 datasets.\n","authors":["Zhongxuan Zhang","Bi Zeng","Xinyu Ni","Yimin Du"],"pdf_url":"https://arxiv.org/pdf/2501.03616v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04783v2","updated":"2025-01-07T08:23:43Z","published":"2024-12-06T05:20:08Z","title":"KNN-MMD: Cross Domain Wireless Sensing via Local Distribution Alignment","summary":" Wireless sensing has recently found widespread applications in diverse\nenvironments, including homes, offices, and public spaces. By analyzing\npatterns in channel state information (CSI), it is possible to infer human\nactions for tasks such as person identification, gesture recognition, and fall\ndetection. However, CSI is highly sensitive to environmental changes, where\neven minor alterations can significantly distort the CSI patterns. This\nsensitivity often leads to performance degradation or outright failure when\napplying wireless sensing models trained in one environment to another. To\naddress this challenge, Domain Alignment (DAL) has been widely adopted for\ncross-domain classification tasks, as it focuses on aligning the global\ndistributions of the source and target domains in feature space. Despite its\npopularity, DAL often neglects inter-category relationships, which can lead to\nmisalignment between categories across domains, even when global alignment is\nachieved. To overcome these limitations, we propose K-Nearest Neighbors Maximum\nMean Discrepancy (KNN-MMD), a novel few-shot method for cross-domain wireless\nsensing. Our approach begins by constructing a help set using KNN from the\ntarget domain, enabling local alignment between the source and target domains\nwithin each category using MMD. Additionally, we address a key instability\nissue commonly observed in cross-domain methods, where model performance\nfluctuates sharply between epochs. Further, most existing methods struggle to\ndetermine an optimal stopping point during training due to the absence of\nlabeled data from the target domain. Our method resolves this by excluding the\nsupport set from the target domain during training and employing it as a\nvalidation set to determine the stopping criterion.\n","authors":["Zijian Zhao","Zhijie Cai","Tingwei Chen","Xiaoyang Li","Hang Li","Qimei Chen","Guangxu Zhu"],"pdf_url":"https://arxiv.org/pdf/2412.04783v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03606v1","updated":"2025-01-07T08:14:53Z","published":"2025-01-07T08:14:53Z","title":"VTAO-BiManip: Masked Visual-Tactile-Action Pre-training with Object\n Understanding for Bimanual Dexterous Manipulation","summary":" Bimanual dexterous manipulation remains significant challenges in robotics\ndue to the high DoFs of each hand and their coordination. Existing single-hand\nmanipulation techniques often leverage human demonstrations to guide RL methods\nbut fail to generalize to complex bimanual tasks involving multiple sub-skills.\nIn this paper, we introduce VTAO-BiManip, a novel framework that combines\nvisual-tactile-action pretraining with object understanding to facilitate\ncurriculum RL to enable human-like bimanual manipulation. We improve prior\nlearning by incorporating hand motion data, providing more effective guidance\nfor dual-hand coordination than binary tactile feedback. Our pretraining model\npredicts future actions as well as object pose and size using masked multimodal\ninputs, facilitating cross-modal regularization. To address the multi-skill\nlearning challenge, we introduce a two-stage curriculum RL approach to\nstabilize training. We evaluate our method on a bottle-cap unscrewing task,\ndemonstrating its effectiveness in both simulated and real-world environments.\nOur approach achieves a success rate that surpasses existing visual-tactile\npretraining methods by over 20%.\n","authors":["Zhengnan Sun","Zhaotai Shi","Jiayin Chen","Qingtao Liu","Yu Cui","Qi Ye","Jiming Chen"],"pdf_url":"https://arxiv.org/pdf/2501.03606v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03605v1","updated":"2025-01-07T08:06:35Z","published":"2025-01-07T08:06:35Z","title":"ConcealGS: Concealing Invisible Copyright Information in 3D Gaussian\n Splatting","summary":" With the rapid development of 3D reconstruction technology, the widespread\ndistribution of 3D data has become a future trend. While traditional visual\ndata (such as images and videos) and NeRF-based formats already have mature\ntechniques for copyright protection, steganographic techniques for the emerging\n3D Gaussian Splatting (3D-GS) format have yet to be fully explored. To address\nthis, we propose ConcealGS, an innovative method for embedding implicit\ninformation into 3D-GS. By introducing the knowledge distillation and gradient\noptimization strategy based on 3D-GS, ConcealGS overcomes the limitations of\nNeRF-based models and enhances the robustness of implicit information and the\nquality of 3D reconstruction. We evaluate ConcealGS in various potential\napplication scenarios, and experimental results have demonstrated that\nConcealGS not only successfully recovers implicit information but also has\nalmost no impact on rendering quality, providing a new approach for embedding\ninvisible and recoverable information into 3D models in the future.\n","authors":["Yifeng Yang","Hengyu Liu","Chenxin Li","Yining Sun","Wuyang Li","Yifan Liu","Yiyang Lin","Yixuan Yuan","Nanyang Ye"],"pdf_url":"https://arxiv.org/pdf/2501.03605v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.02807v2","updated":"2025-01-07T07:47:22Z","published":"2025-01-06T07:00:22Z","title":"AE-NeRF: Augmenting Event-Based Neural Radiance Fields for Non-ideal\n Conditions and Larger Scene","summary":" Compared to frame-based methods, computational neuromorphic imaging using\nevent cameras offers significant advantages, such as minimal motion blur,\nenhanced temporal resolution, and high dynamic range. The multi-view\nconsistency of Neural Radiance Fields combined with the unique benefits of\nevent cameras, has spurred recent research into reconstructing NeRF from data\ncaptured by moving event cameras. While showing impressive performance,\nexisting methods rely on ideal conditions with the availability of uniform and\nhigh-quality event sequences and accurate camera poses, and mainly focus on the\nobject level reconstruction, thus limiting their practical applications. In\nthis work, we propose AE-NeRF to address the challenges of learning event-based\nNeRF from non-ideal conditions, including non-uniform event sequences, noisy\nposes, and various scales of scenes. Our method exploits the density of event\nstreams and jointly learn a pose correction module with an event-based NeRF\n(e-NeRF) framework for robust 3D reconstruction from inaccurate camera poses.\nTo generalize to larger scenes, we propose hierarchical event distillation with\na proposal e-NeRF network and a vanilla e-NeRF network to resample and refine\nthe reconstruction process. We further propose an event reconstruction loss and\na temporal loss to improve the view consistency of the reconstructed scene. We\nestablished a comprehensive benchmark that includes large-scale scenes to\nsimulate practical non-ideal conditions, incorporating both synthetic and\nchallenging real-world event datasets. The experimental results show that our\nmethod achieves a new state-of-the-art in event-based 3D reconstruction.\n","authors":["Chaoran Feng","Wangbo Yu","Xinhua Cheng","Zhenyu Tang","Junwu Zhang","Li Yuan","Yonghong Tian"],"pdf_url":"https://arxiv.org/pdf/2501.02807v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03592v1","updated":"2025-01-07T07:45:21Z","published":"2025-01-07T07:45:21Z","title":"A Value Mapping Virtual Staining Framework for Large-scale Histological\n Imaging","summary":" The emergence of virtual staining technology provides a rapid and efficient\nalternative for researchers in tissue pathology. It enables the utilization of\nunlabeled microscopic samples to generate virtual replicas of chemically\nstained histological slices, or facilitate the transformation of one staining\ntype into another. The remarkable performance of generative networks, such as\nCycleGAN, offers an unsupervised learning approach for virtual coloring,\novercoming the limitations of high-quality paired data required in supervised\nlearning. Nevertheless, large-scale color transformation necessitates\nprocessing large field-of-view images in patches, often resulting in\nsignificant boundary inconsistency and artifacts. Additionally, the\ntransformation between different colorized modalities typically needs further\nefforts to modify loss functions and tune hyperparameters for independent\ntraining of networks. In this study, we introduce a general virtual staining\nframework that is adaptable to various conditions. We propose a loss function\nbased on the value mapping constraint to ensure the accuracy of virtual\ncoloring between different pathological modalities, termed the Value Mapping\nGenerative Adversarial Network (VM-GAN). Meanwhile, we present a\nconfidence-based tiling method to address the challenge of boundary\ninconsistency arising from patch-wise processing. Experimental results on\ndiverse data with varying staining protocols demonstrate that our method\nachieves superior quantitative indicators and improved visual perception.\n","authors":["Junjia Wang","Bo Xiong","You Zhou","Xun Cao","Zhan Ma"],"pdf_url":"https://arxiv.org/pdf/2501.03592v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.15507v2","updated":"2025-01-07T07:35:10Z","published":"2024-07-22T09:44:35Z","title":"SpotDiffusion: A Fast Approach For Seamless Panorama Generation Over\n Time","summary":" Generating high-resolution images with generative models has recently been\nmade widely accessible by leveraging diffusion models pre-trained on\nlarge-scale datasets. Various techniques, such as MultiDiffusion and\nSyncDiffusion, have further pushed image generation beyond training\nresolutions, i.e., from square images to panorama, by merging multiple\noverlapping diffusion paths or employing gradient descent to maintain\nperceptual coherence. However, these methods suffer from significant\ncomputational inefficiencies due to generating and averaging numerous\npredictions, which is required in practice to produce high-quality and seamless\nimages. This work addresses this limitation and presents a novel approach that\neliminates the need to generate and average numerous overlapping denoising\npredictions. Our method shifts non-overlapping denoising windows over time,\nensuring that seams in one timestep are corrected in the next. This results in\ncoherent, high-resolution images with fewer overall steps. We demonstrate the\neffectiveness of our approach through qualitative and quantitative evaluations,\ncomparing it with MultiDiffusion, SyncDiffusion, and StitchDiffusion. Our\nmethod offers several key benefits, including improved computational efficiency\nand faster inference times while producing comparable or better image quality.\nLink to code https://github.com/stanifrolov/spotdiffusion\n","authors":["Stanislav Frolov","Brian B. Moser","Andreas Dengel"],"pdf_url":"https://arxiv.org/pdf/2407.15507v2.pdf","comment":"Project page: https://spotdiffusion.github.io/"},{"id":"http://arxiv.org/abs/2411.15778v3","updated":"2025-01-07T07:31:00Z","published":"2024-11-24T10:58:48Z","title":"Enhancing the automatic segmentation and analysis of 3D liver\n vasculature models","summary":" Surgical assessment of liver cancer patients requires identification of the\nvessel trees from medical images. Specifically, the venous trees - the portal\n(perfusing) and the hepatic (draining) trees are important for understanding\nthe liver anatomy and disease state, and perform surgery planning. This\nresearch aims to improve the 3D segmentation, skeletonization, and subsequent\nanalysis of vessel trees, by creating an automatic pipeline based on deep\nlearning and image processing techniques.\n The first part of this work explores the impact of differentiable\nskeletonization methods such as ClDice and morphological skeletonization loss,\non the overall liver vessel segmentation performance. To this aim, it studies\nhow to improve vessel tree connectivity.\n The second part of this study converts a single class vessel segmentation\ninto multi-class ones, separating the two venous trees. It builds on the\nprevious two-class vessel segmentation model, which vessel tree outputs might\nbe entangled, and on connected components and skeleton analyses of the trees.\n After providing sub-labeling of the specific anatomical branches of each\nvenous tree, these algorithms also enable a morphometric analysis of the vessel\ntrees by extracting various geometrical markers.\n In conclusion, we propose a method that successfully improves current\nskeletonization methods, for extensive vascular trees that contain vessels of\ndifferent calibers. The separation algorithm creates a clean multi-class\nsegmentation of the vessels, validated by surgeons to provide low error. A new,\npublicly shared high-quality liver vessel dataset of 77 cases is thus created.\nFinally a method to annotate vessel trees according to anatomy is provided,\nenabling a unique liver vessel morphometry analysis.\n","authors":["Yassine Machta","Omar Ali","Kevin Hakkakian","Ana Vlasceanu","Amaury Facque","Nicolas Golse","Irene Vignon-Clementel"],"pdf_url":"https://arxiv.org/pdf/2411.15778v3.pdf","comment":"Paper presented at MICCAI 2024 Workshop: ADSMI. This work was done in\n the context of an internship at Simbiotx, Inria"},{"id":"http://arxiv.org/abs/2501.03580v1","updated":"2025-01-07T07:08:46Z","published":"2025-01-07T07:08:46Z","title":"BASIC: Semi-supervised Multi-organ Segmentation with Balanced Subclass\n Regularization and Semantic-conflict Penalty","summary":" Semi-supervised learning (SSL) has shown notable potential in relieving the\nheavy demand of dense prediction tasks on large-scale well-annotated datasets,\nespecially for the challenging multi-organ segmentation (MoS). However, the\nprevailing class-imbalance problem in MoS caused by the substantial variations\nin organ size exacerbates the learning difficulty of the SSL network. To\naddress this issue, in this paper, we propose an innovative semi-supervised\nnetwork with BAlanced Subclass regularIzation and semantic-Conflict penalty\nmechanism (BASIC) to effectively learn the unbiased knowledge for\nsemi-supervised MoS. Concretely, we construct a novel auxiliary subclass\nsegmentation (SCS) task based on priorly generated balanced subclasses, thus\ndeeply excavating the unbiased information for the main MoS task with the\nfashion of multi-task learning. Additionally, based on a mean teacher\nframework, we elaborately design a balanced subclass regularization to utilize\nthe teacher predictions of SCS task to supervise the student predictions of MoS\ntask, thus effectively transferring unbiased knowledge to the MoS subnetwork\nand alleviating the influence of the class-imbalance problem. Considering the\nsimilar semantic information inside the subclasses and their corresponding\noriginal classes (i.e., parent classes), we devise a semantic-conflict penalty\nmechanism to give heavier punishments to the conflicting SCS predictions with\nwrong parent classes and provide a more accurate constraint to the MoS\npredictions. Extensive experiments conducted on two publicly available\ndatasets, i.e., the WORD dataset and the MICCAI FLARE 2022 dataset, have\nverified the superior performance of our proposed BASIC compared to other\nstate-of-the-art methods.\n","authors":["Zhenghao Feng","Lu Wen","Yuanyuan Xu","Binyu Yan","Xi Wu","Jiliu Zhou","Yan Wang"],"pdf_url":"https://arxiv.org/pdf/2501.03580v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.03173v3","updated":"2025-01-07T07:05:05Z","published":"2024-02-05T16:41:02Z","title":"MULTI: Multimodal Understanding Leaderboard with Text and Images","summary":" The rapid development of multimodal large language models (MLLMs) raises the\nquestion of how they compare to human performance. While existing datasets\noften feature synthetic or overly simplistic tasks, some models have already\nsurpassed human expert baselines. In this paper, we present MULTI, a Chinese\nmultimodal dataset derived from authentic examination questions. Comprising\nover 18,000 carefully selected and refined questions, MULTI evaluates models\nusing real-world examination standards, encompassing image-text comprehension,\ncomplex reasoning, and knowledge recall. Additionally, We also introduce\nMULTI-Elite, a 500-question selected hard subset, and MULTI-Extend with more\nthan 4,500 external knowledge context pieces for testing in-context learning\ncapabilities. Our evaluation highlights substantial room for MLLM advancement,\nwith Qwen2-VL-72B achieving a 76.9% accuracy on MULTI and 53.1% on MULTI-Elite\nleading 25 evaluated models, compared to human expert baselines of 86.1% and\n73.1%. MULTI serves not only as a robust evaluation platform but also paves the\nway for the development of expert-level AI.\n","authors":["Zichen Zhu","Yang Xu","Lu Chen","Jingkai Yang","Yichuan Ma","Yiming Sun","Hailin Wen","Jiaqi Liu","Jinyu Cai","Yingzi Ma","Situo Zhang","Zihan Zhao","Liangtai Sun","Kai Yu"],"pdf_url":"https://arxiv.org/pdf/2402.03173v3.pdf","comment":"24 pages, 19 figures, 10 tables. Details and access are available at:\n https://OpenDFM.github.io/MULTI-Benchmark/"},{"id":"http://arxiv.org/abs/2501.03575v1","updated":"2025-01-07T06:55:50Z","published":"2025-01-07T06:55:50Z","title":"Cosmos World Foundation Model Platform for Physical AI","summary":" Physical AI needs to be trained digitally first. It needs a digital twin of\nitself, the policy model, and a digital twin of the world, the world model. In\nthis paper, we present the Cosmos World Foundation Model Platform to help\ndevelopers build customized world models for their Physical AI setups. We\nposition a world foundation model as a general-purpose world model that can be\nfine-tuned into customized world models for downstream applications. Our\nplatform covers a video curation pipeline, pre-trained world foundation models,\nexamples of post-training of pre-trained world foundation models, and video\ntokenizers. To help Physical AI builders solve the most critical problems of\nour society, we make our platform open-source and our models open-weight with\npermissive licenses available via https://github.com/NVIDIA/Cosmos.\n","authors":[" NVIDIA"," :","Niket Agarwal","Arslan Ali","Maciej Bala","Yogesh Balaji","Erik Barker","Tiffany Cai","Prithvijit Chattopadhyay","Yongxin Chen","Yin Cui","Yifan Ding","Daniel Dworakowski","Jiaojiao Fan","Michele Fenzi","Francesco Ferroni","Sanja Fidler","Dieter Fox","Songwei Ge","Yunhao Ge","Jinwei Gu","Siddharth Gururani","Ethan He","Jiahui Huang","Jacob Huffman","Pooya Jannaty","Jingyi Jin","Seung Wook Kim","Gergely Klár","Grace Lam","Shiyi Lan","Laura Leal-Taixe","Anqi Li","Zhaoshuo Li","Chen-Hsuan Lin","Tsung-Yi Lin","Huan Ling","Ming-Yu Liu","Xian Liu","Alice Luo","Qianli Ma","Hanzi Mao","Kaichun Mo","Arsalan Mousavian","Seungjun Nah","Sriharsha Niverty","David Page","Despoina Paschalidou","Zeeshan Patel","Lindsey Pavao","Morteza Ramezanali","Fitsum Reda","Xiaowei Ren","Vasanth Rao Naik Sabavat","Ed Schmerling","Stella Shi","Bartosz Stefaniak","Shitao Tang","Lyne Tchapmi","Przemek Tredak","Wei-Cheng Tseng","Jibin Varghese","Hao Wang","Haoxiang Wang","Heng Wang","Ting-Chun Wang","Fangyin Wei","Xinyue Wei","Jay Zhangjie Wu","Jiashu Xu","Wei Yang","Lin Yen-Chen","Xiaohui Zeng","Yu Zeng","Jing Zhang","Qinsheng Zhang","Yuxuan Zhang","Qingqing Zhao","Artur Zolkowski"],"pdf_url":"https://arxiv.org/pdf/2501.03575v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.01595v2","updated":"2025-01-07T06:55:35Z","published":"2025-01-03T01:54:16Z","title":"Adaptive Homophily Clustering: Structure Homophily Graph Learning with\n Adaptive Filter for Hyperspectral Image","summary":" Hyperspectral image (HSI) clustering has been a fundamental but challenging\ntask with zero training labels. Currently, some deep graph clustering methods\nhave been successfully explored for HSI due to their outstanding performance in\neffective spatial structural information encoding. Nevertheless, insufficient\nstructural information utilization, poor feature presentation ability, and weak\ngraph update capability limit their performance. Thus, in this paper, a\nhomophily structure graph learning with an adaptive filter clustering method\n(AHSGC) for HSI is proposed. Specifically, homogeneous region generation is\nfirst developed for HSI processing and constructing the original graph.\nAfterward, an adaptive filter graph encoder is designed to adaptively capture\nthe high and low frequency features on the graph for subsequence processing.\nThen, a graph embedding clustering self-training decoder is developed with KL\nDivergence, with which the pseudo-label is generated for network training.\nMeanwhile, homophily-enhanced structure learning is introduced to update the\ngraph according to the clustering task, in which the orient correlation\nestimation is adopted to estimate the node connection, and graph edge\nsparsification is designed to adjust the edges in the graph dynamically.\nFinally, a joint network optimization is introduced to achieve network\nself-training and update the graph. The K-means is adopted to express the\nlatent features. Extensive experiments and repeated comparative analysis have\nverified that our AHSGC contains high clustering accuracy, low computational\ncomplexity, and strong robustness. The code source will be available at\nhttps://github.com/DY-HYX.\n","authors":["Yao Ding","Weijie Kang","Aitao Yang","Zhili Zhang","Junyang Zhao","Jie Feng","Danfeng Hong","Qinhe Zheng"],"pdf_url":"https://arxiv.org/pdf/2501.01595v2.pdf","comment":"14 pages, 8 figure"},{"id":"http://arxiv.org/abs/2403.10089v4","updated":"2025-01-07T06:45:58Z","published":"2024-03-15T08:05:16Z","title":"Approximation and bounding techniques for the Fisher-Rao distances\n between parametric statistical models","summary":" The Fisher-Rao distance between two probability distributions of a\nstatistical model is defined as the Riemannian geodesic distance induced by the\nFisher information metric. In order to calculate the Fisher-Rao distance in\nclosed-form, we need (1) to elicit a formula for the Fisher-Rao geodesics, and\n(2) to integrate the Fisher length element along those geodesics. We consider\nseveral numerically robust approximation and bounding techniques for the\nFisher-Rao distances: First, we report generic upper bounds on Fisher-Rao\ndistances based on closed-form 1D Fisher-Rao distances of submodels. Second, we\ndescribe several generic approximation schemes depending on whether the\nFisher-Rao geodesics or pregeodesics are available in closed-form or not. In\nparticular, we obtain a generic method to guarantee an arbitrarily small\nadditive error on the approximation provided that Fisher-Rao pregeodesics and\ntight lower and upper bounds are available. Third, we consider the case of\nFisher metrics being Hessian metrics, and report generic tight upper bounds on\nthe Fisher-Rao distances using techniques of information geometry.\nUniparametric and biparametric statistical models always have Fisher Hessian\nmetrics, and in general a simple test allows to check whether the Fisher\ninformation matrix yields a Hessian metric or not. Fourth, we consider\nelliptical distribution families and show how to apply the above techniques to\nthese models. We also propose two new distances based either on the Fisher-Rao\nlengths of curves serving as proxies of Fisher-Rao geodesics, or based on the\nBirkhoff/Hilbert projective cone distance. Last, we consider an alternative\ngroup-theoretic approach for statistical transformation models based on the\nnotion of maximal invariant which yields insights on the structures of the\nFisher-Rao distance formula which may be used fruitfully in applications.\n","authors":["Frank Nielsen"],"pdf_url":"https://arxiv.org/pdf/2403.10089v4.pdf","comment":"48 pages"},{"id":"http://arxiv.org/abs/2501.03567v1","updated":"2025-01-07T06:35:34Z","published":"2025-01-07T06:35:34Z","title":"Evaluating Image Caption via Cycle-consistent Text-to-Image Generation","summary":" Evaluating image captions typically relies on reference captions, which are\ncostly to obtain and exhibit significant diversity and subjectivity. While\nreference-free evaluation metrics have been proposed, most focus on cross-modal\nevaluation between captions and images. Recent research has revealed that the\nmodality gap generally exists in the representation of contrastive\nlearning-based multi-modal systems, undermining the reliability of\ncross-modality metrics like CLIPScore. In this paper, we propose CAMScore, a\ncyclic reference-free automatic evaluation metric for image captioning models.\nTo circumvent the aforementioned modality gap, CAMScore utilizes a\ntext-to-image model to generate images from captions and subsequently evaluates\nthese generated images against the original images. Furthermore, to provide\nfine-grained information for a more comprehensive evaluation, we design a\nthree-level evaluation framework for CAMScore that encompasses pixel-level,\nsemantic-level, and objective-level perspectives. Extensive experiment results\nacross multiple benchmark datasets show that CAMScore achieves a superior\ncorrelation with human judgments compared to existing reference-based and\nreference-free metrics, demonstrating the effectiveness of the framework.\n","authors":["Tianyu Cui","Jinbin Bai","Guohua Wang","Qingguo Chen","Zhao Xu","Weihua Luo","Kaifu Zhang","Ye Shi"],"pdf_url":"https://arxiv.org/pdf/2501.03567v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2212.08983v2","updated":"2025-01-07T06:32:53Z","published":"2022-12-18T01:07:20Z","title":"Adaptive deep learning framework for robust unsupervised underwater\n image enhancement","summary":" One of the main challenges in deep learning-based underwater image\nenhancement is the limited availability of high-quality training data.\nUnderwater images are difficult to capture and are often of poor quality due to\nthe distortion and loss of colour and contrast in water. This makes it\ndifficult to train supervised deep learning models on large and diverse\ndatasets, which can limit the model's performance. In this paper, we explore an\nalternative approach to supervised underwater image enhancement. Specifically,\nwe propose a novel unsupervised underwater image enhancement framework that\nemploys a conditional variational autoencoder (cVAE) to train a deep learning\nmodel with probabilistic adaptive instance normalization (PAdaIN) and\nstatistically guided multi-colour space stretch that produces realistic\nunderwater images. The resulting framework is composed of a U-Net as a feature\nextractor and a PAdaIN to encode the uncertainty, which we call UDnet. To\nimprove the visual quality of the images generated by UDnet, we use a\nstatistically guided multi-colour space stretch module that ensures visual\nconsistency with the input image and provides an alternative to training using\na ground truth image. The proposed model does not need manual human annotation\nand can learn with a limited amount of data and achieves state-of-the-art\nresults on underwater images. We evaluated our proposed framework on eight\npublicly-available datasets. The results show that our proposed framework\nyields competitive performance compared to other state-of-the-art approaches in\nquantitative as well as qualitative metrics. Code available at\nhttps://github.com/alzayats/UDnet .\n","authors":["Alzayat Saleh","Marcus Sheaves","Dean Jerry","Mostafa Rahimi Azghadi"],"pdf_url":"https://arxiv.org/pdf/2212.08983v2.pdf","comment":"25 pages, 7 figures, 6 tables, accepted for publication in Expert\n Systems with Applications"},{"id":"http://arxiv.org/abs/2501.03565v1","updated":"2025-01-07T06:30:52Z","published":"2025-01-07T06:30:52Z","title":"Bridged Semantic Alignment for Zero-shot 3D Medical Image Diagnosis","summary":" 3D medical images such as Computed tomography (CT) are widely used in\nclinical practice, offering a great potential for automatic diagnosis.\nSupervised learning-based approaches have achieved significant progress but\nrely heavily on extensive manual annotations, limited by the availability of\ntraining data and the diversity of abnormality types. Vision-language alignment\n(VLA) offers a promising alternative by enabling zero-shot learning without\nadditional annotations. However, we empirically discover that the visual and\ntextural embeddings after alignment endeavors from existing VLA methods form\ntwo well-separated clusters, presenting a wide gap to be bridged. To bridge\nthis gap, we propose a Bridged Semantic Alignment (BrgSA) framework. First, we\nutilize a large language model to perform semantic summarization of reports,\nextracting high-level semantic information. Second, we design a Cross-Modal\nKnowledge Interaction (CMKI) module that leverages a cross-modal knowledge bank\nas a semantic bridge, facilitating interaction between the two modalities,\nnarrowing the gap, and improving their alignment. To comprehensively evaluate\nour method, we construct a benchmark dataset that includes 15 underrepresented\nabnormalities as well as utilize two existing benchmark datasets. Experimental\nresults demonstrate that BrgSA achieves state-of-the-art performances on both\npublic benchmark datasets and our custom-labeled dataset, with significant\nimprovements in zero-shot diagnosis of underrepresented abnormalities.\n","authors":["Haoran Lai","Zihang Jiang","Qingsong Yao","Rongsheng Wang","Zhiyang He","Xiaodong Tao","Wei Wei","Weifu Lv","S. Kevin Zhou"],"pdf_url":"https://arxiv.org/pdf/2501.03565v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03544v1","updated":"2025-01-07T05:39:21Z","published":"2025-01-07T05:39:21Z","title":"PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for\n Text-to-Image Models","summary":" Text-to-image (T2I) models have been shown to be vulnerable to misuse,\nparticularly in generating not-safe-for-work (NSFW) content, raising serious\nethical concerns. In this work, we present PromptGuard, a novel content\nmoderation technique that draws inspiration from the system prompt mechanism in\nlarge language models (LLMs) for safety alignment. Unlike LLMs, T2I models lack\na direct interface for enforcing behavioral guidelines. Our key idea is to\noptimize a safety soft prompt that functions as an implicit system prompt\nwithin the T2I model's textual embedding space. This universal soft prompt (P*)\ndirectly moderates NSFW inputs, enabling safe yet realistic image generation\nwithout altering the inference efficiency or requiring proxy models. Extensive\nexperiments across three datasets demonstrate that PromptGuard effectively\nmitigates NSFW content generation while preserving high-quality benign outputs.\nPromptGuard achieves 7.8 times faster than prior content moderation methods,\nsurpassing eight state-of-the-art defenses with an optimal unsafe ratio down to\n5.84%.\n","authors":["Lingzhi Yuan","Xinfeng Li","Chejian Xu","Guanhong Tao","Xiaojun Jia","Yihao Huang","Wei Dong","Yang Liu","XiaoFeng Wang","Bo Li"],"pdf_url":"https://arxiv.org/pdf/2501.03544v1.pdf","comment":"16 pages, 8 figures, 10 tables"},{"id":"http://arxiv.org/abs/2501.03539v1","updated":"2025-01-07T05:21:13Z","published":"2025-01-07T05:21:13Z","title":"Enhanced Tuberculosis Bacilli Detection using Attention-Residual U-Net\n and Ensemble Classification","summary":" Tuberculosis (TB), caused by Mycobacterium tuberculosis, remains a critical\nglobal health issue, necessitating timely diagnosis and treatment. Current\nmethods for detecting tuberculosis bacilli from bright field microscopic sputum\nsmear images suffer from low automation, inadequate segmentation performance,\nand limited classification accuracy. This paper proposes an efficient hybrid\napproach that combines deep learning for segmentation and an ensemble model for\nclassification. An enhanced U-Net model incorporating attention blocks and\nresidual connections is introduced to precisely segment microscopic sputum\nsmear images, facilitating the extraction of Regions of Interest (ROIs). These\nROIs are subsequently classified using an ensemble classifier comprising\nSupport Vector Machine (SVM), Random Forest, and Extreme Gradient Boost\n(XGBoost), resulting in an accurate identification of bacilli within the\nimages. Experiments conducted on a newly created dataset, along with public\ndatasets, demonstrate that the proposed model achieves superior segmentation\nperformance, higher classification accuracy, and enhanced automation compared\nto existing methods.\n","authors":["Greeshma K","Vishnukumar S"],"pdf_url":"https://arxiv.org/pdf/2501.03539v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03538v1","updated":"2025-01-07T05:17:43Z","published":"2025-01-07T05:17:43Z","title":"Efficient and Accurate Tuberculosis Diagnosis: Attention Residual U-Net\n and Vision Transformer Based Detection Framework","summary":" Tuberculosis (TB), an infectious disease caused by Mycobacterium\ntuberculosis, continues to be a major global health threat despite being\npreventable and curable. This burden is particularly high in low and middle\nincome countries. Microscopy remains essential for diagnosing TB by enabling\ndirect visualization of Mycobacterium tuberculosis in sputum smear samples,\noffering a cost effective approach for early detection and effective treatment.\nGiven the labour-intensive nature of microscopy, automating the detection of\nbacilli in microscopic images is crucial to improve both the expediency and\nreliability of TB diagnosis. The current methodologies for detecting\ntuberculosis bacilli in bright field microscopic sputum smear images are\nhindered by limited automation capabilities, inconsistent segmentation quality,\nand constrained classification precision. This paper proposes a twostage deep\nlearning methodology for tuberculosis bacilli detection, comprising bacilli\nsegmentation followed by classification. In the initial phase, an advanced\nU-Net model employing attention blocks and residual connections is proposed to\nsegment microscopic sputum smear images, enabling the extraction of Regions of\nInterest (ROIs). The extracted ROIs are then classified using a Vision\nTransformer, which we specifically customized as TBViT to enhance the precise\ndetection of bacilli within the images. For the experiments, a newly developed\ndataset of microscopic sputum smear images derived from Ziehl-Neelsen-stained\nslides is used in conjunction with existing public datasets. The qualitative\nand quantitative evaluation of the experiments using various metrics\ndemonstrates that the proposed model achieves significantly improved\nsegmentation performance, higher classification accuracy, and a greater level\nof automation, surpassing existing methods.\n","authors":["Greeshma K","Vishnukumar S"],"pdf_url":"https://arxiv.org/pdf/2501.03538v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03533v1","updated":"2025-01-07T05:12:49Z","published":"2025-01-07T05:12:49Z","title":"Anomaly Triplet-Net: Progress Recognition Model Using Deep Metric\n Learning Considering Occlusion for Manual Assembly Work","summary":" In this paper, a progress recognition method consider occlusion using deep\nmetric learning is proposed to visualize the product assembly process in a\nfactory. First, the target assembly product is detected from images acquired\nfrom a fixed-point camera installed in the factory using a deep learning-based\nobject detection method. Next, the detection area is cropped from the image.\nFinally, by using a classification method based on deep metric learning on the\ncropped image, the progress of the product assembly work is estimated as a\nrough progress step.\n As a specific progress estimation model, we propose an Anomaly Triplet-Net\nthat adds anomaly samples to Triplet Loss for progress estimation considering\nocclusion.\n In experiments, an 82.9% success rate is achieved for the progress estimation\nmethod using Anomaly Triplet-Net.\n We also experimented with the practicality of the sequence of detection,\ncropping, and progression estimation, and confirmed the effectiveness of the\noverall system.\n","authors":["Takumi Kitsukawa","Kazuma Miura","Shigeki Yumoto","Sarthak Pathak","Alessandro Moro","Kazunori Umeda"],"pdf_url":"https://arxiv.org/pdf/2501.03533v1.pdf","comment":"This paper has been peer-reviewed, revised, and published in Advanced\n Robotics"},{"id":"http://arxiv.org/abs/2501.03526v1","updated":"2025-01-07T04:42:45Z","published":"2025-01-07T04:42:45Z","title":"FgC2F-UDiff: Frequency-guided and Coarse-to-fine Unified Diffusion Model\n for Multi-modality Missing MRI Synthesis","summary":" Multi-modality magnetic resonance imaging (MRI) is essential for the\ndiagnosis and treatment of brain tumors. However, missing modalities are\ncommonly observed due to limitations in scan time, scan corruption, artifacts,\nmotion, and contrast agent intolerance. Synthesis of missing MRI has been a\nmeans to address the limitations of modality insufficiency in clinical practice\nand research. However, there are still some challenges, such as poor\ngeneralization, inaccurate non-linear mapping, and slow processing speeds. To\naddress the aforementioned issues, we propose a novel unified synthesis model,\nthe Frequency-guided and Coarse-to-fine Unified Diffusion Model (FgC2F-UDiff),\ndesigned for multiple inputs and outputs. Specifically, the Coarse-to-fine\nUnified Network (CUN) fully exploits the iterative denoising properties of\ndiffusion models, from global to detail, by dividing the denoising process into\ntwo stages, coarse and fine, to enhance the fidelity of synthesized images.\nSecondly, the Frequency-guided Collaborative Strategy (FCS) harnesses\nappropriate frequency information as prior knowledge to guide the learning of a\nunified, highly non-linear mapping. Thirdly, the Specific-acceleration Hybrid\nMechanism (SHM) integrates specific mechanisms to accelerate the diffusion\nmodel and enhance the feasibility of many-to-many synthesis. Extensive\nexperimental evaluations have demonstrated that our proposed FgC2F-UDiff model\nachieves superior performance on two datasets, validated through a\ncomprehensive assessment that includes both qualitative observations and\nquantitative metrics, such as PSNR SSIM, LPIPS, and FID.\n","authors":["Xiaojiao Xiao","Qinmin Vivian Hu","Guanghui Wang"],"pdf_url":"https://arxiv.org/pdf/2501.03526v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03525v1","updated":"2025-01-07T04:40:55Z","published":"2025-01-07T04:40:55Z","title":"TexHOI: Reconstructing Textures of 3D Unknown Objects in Monocular\n Hand-Object Interaction Scenes","summary":" Reconstructing 3D models of dynamic, real-world objects with high-fidelity\ntextures from monocular frame sequences has been a challenging problem in\nrecent years. This difficulty stems from factors such as shadows, indirect\nillumination, and inaccurate object-pose estimations due to occluding\nhand-object interactions. To address these challenges, we propose a novel\napproach that predicts the hand's impact on environmental visibility and\nindirect illumination on the object's surface albedo. Our method first learns\nthe geometry and low-fidelity texture of the object, hand, and background\nthrough composite rendering of radiance fields. Simultaneously, we optimize the\nhand and object poses to achieve accurate object-pose estimations. We then\nrefine physics-based rendering parameters - including roughness, specularity,\nalbedo, hand visibility, skin color reflections, and environmental illumination\n- to produce precise albedo, and accurate hand illumination and shadow regions.\nOur approach surpasses state-of-the-art methods in texture reconstruction and,\nto the best of our knowledge, is the first to account for hand-object\ninteractions in object texture reconstruction.\n","authors":["Alakh Aggarwal","Ningna Wang","Xiaohu Guo"],"pdf_url":"https://arxiv.org/pdf/2501.03525v1.pdf","comment":"This paper was accepted at ICCVM 2025 and will appear in the\n proceedings of IEEE TVCG as part of the conference"},{"id":"http://arxiv.org/abs/2310.15624v2","updated":"2025-01-07T04:39:25Z","published":"2023-10-24T08:45:15Z","title":"GUPNet++: Geometry Uncertainty Propagation Network for Monocular 3D\n Object Detection","summary":" Geometry plays a significant role in monocular 3D object detection. It can be\nused to estimate object depth by using the perspective projection between\nobject's physical size and 2D projection in the image plane, which can\nintroduce mathematical priors into deep models. However, this projection\nprocess also introduces error amplification, where the error of the estimated\nheight is amplified and reflected into the projected depth. It leads to\nunreliable depth inferences and also impairs training stability. To tackle this\nproblem, we propose a novel Geometry Uncertainty Propagation Network (GUPNet++)\nby modeling geometry projection in a probabilistic manner. This ensures depth\npredictions are well-bounded and associated with a reasonable uncertainty. The\nsignificance of introducing such geometric uncertainty is two-fold: (1). It\nmodels the uncertainty propagation relationship of the geometry projection\nduring training, improving the stability and efficiency of the end-to-end model\nlearning. (2). It can be derived to a highly reliable confidence to indicate\nthe quality of the 3D detection result, enabling more reliable detection\ninference. Experiments show that the proposed approach not only obtains\n(state-of-the-art) SOTA performance in image-based monocular 3D detection but\nalso demonstrates superiority in efficacy with a simplified framework.\n","authors":["Yan Lu","Xinzhu Ma","Lei Yang","Tianzhu Zhang","Yating Liu","Qi Chu","Tong He","Yonghui Li","Wanli Ouyang"],"pdf_url":"https://arxiv.org/pdf/2310.15624v2.pdf","comment":"18 pages, 9 figures"},{"id":"http://arxiv.org/abs/2501.03510v1","updated":"2025-01-07T04:06:07Z","published":"2025-01-07T04:06:07Z","title":"Salient Region Matching for Fully Automated MR-TRUS Registration","summary":" Prostate cancer is a leading cause of cancer-related mortality in men. The\nregistration of magnetic resonance (MR) and transrectal ultrasound (TRUS) can\nprovide guidance for the targeted biopsy of prostate cancer. In this study, we\npropose a salient region matching framework for fully automated MR-TRUS\nregistration. The framework consists of prostate segmentation, rigid alignment\nand deformable registration. Prostate segmentation is performed using two\nsegmentation networks on MR and TRUS respectively, and the predicted salient\nregions are used for the rigid alignment. The rigidly-aligned MR and TRUS\nimages serve as initialization for the deformable registration. The deformable\nregistration network has a dual-stream encoder with cross-modal spatial\nattention modules to facilitate multi-modality feature learning, and a salient\nregion matching loss to consider both structure and intensity similarity within\nthe prostate region. Experiments on a public MR-TRUS dataset demonstrate that\nour method achieves satisfactory registration results, outperforming several\ncutting-edge methods. The code is publicly available at\nhttps://github.com/mock1ngbrd/salient-region-matching.\n","authors":["Zetian Feng","Dong Ni","Yi Wang"],"pdf_url":"https://arxiv.org/pdf/2501.03510v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.16766v2","updated":"2025-01-07T03:53:12Z","published":"2024-05-27T02:27:28Z","title":"Concept Matching with Agent for Out-of-Distribution Detection","summary":" The remarkable achievements of Large Language Models (LLMs) have captivated\nthe attention of both academia and industry, transcending their initial role in\ndialogue generation. To expand the usage scenarios of LLM, some works enhance\nthe effectiveness and capabilities of the model by introducing more external\ninformation, which is called the agent paradigm. Based on this idea, we propose\na new method that integrates the agent paradigm into out-of-distribution (OOD)\ndetection task, aiming to improve its robustness and adaptability. Our proposed\nmethod, Concept Matching with Agent (CMA), employs neutral prompts as agents to\naugment the CLIP-based OOD detection process. These agents function as dynamic\nobservers and communication hubs, interacting with both In-distribution (ID)\nlabels and data inputs to form vector triangle relationships. This triangular\nframework offers a more nuanced approach than the traditional binary\nrelationship, allowing for better separation and identification of ID and OOD\ninputs. Our extensive experimental results showcase the superior performance of\nCMA over both zero-shot and training-required methods in a diverse array of\nreal-world scenarios.\n","authors":["Yuxiao Lee","Xiaofeng Cao","Jingcai Guo","Wei Ye","Qing Guo","Yi Chang"],"pdf_url":"https://arxiv.org/pdf/2405.16766v2.pdf","comment":"Accepted by AAAI-25"},{"id":"http://arxiv.org/abs/2501.03507v1","updated":"2025-01-07T03:50:11Z","published":"2025-01-07T03:50:11Z","title":"An Empirical Study of Accuracy-Robustness Tradeoff and Training\n Efficiency in Self-Supervised Learning","summary":" Self-supervised learning (SSL) has significantly advanced image\nrepresentation learning, yet efficiency challenges persist, particularly with\nadversarial training. Many SSL methods require extensive epochs to achieve\nconvergence, a demand further amplified in adversarial settings. To address\nthis inefficiency, we revisit the robust EMP-SSL framework, emphasizing the\nimportance of increasing the number of crops per image to accelerate learning.\nUnlike traditional contrastive learning, robust EMP-SSL leverages multi-crop\nsampling, integrates an invariance term and regularization, and reduces\ntraining epochs, enhancing time efficiency. Evaluated with both standard linear\nclassifiers and multi-patch embedding aggregation, robust EMP-SSL provides new\ninsights into SSL evaluation strategies.\n Our results show that robust crop-based EMP-SSL not only accelerates\nconvergence but also achieves a superior balance between clean accuracy and\nadversarial robustness, outperforming multi-crop embedding aggregation.\nAdditionally, we extend this approach with free adversarial training in\nMulti-Crop SSL, introducing the Cost-Free Adversarial Multi-Crop\nSelf-Supervised Learning (CF-AMC-SSL) method. CF-AMC-SSL demonstrates the\neffectiveness of free adversarial training in reducing training time while\nsimultaneously improving clean accuracy and adversarial robustness. These\nfindings underscore the potential of CF-AMC-SSL for practical SSL applications.\nOur code is publicly available at https://github.com/softsys4ai/CF-AMC-SSL.\n","authors":["Fatemeh Ghofrani","Pooyan Jamshidi"],"pdf_url":"https://arxiv.org/pdf/2501.03507v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.18710v3","updated":"2025-01-07T03:48:04Z","published":"2023-05-30T03:30:24Z","title":"High-Performance Inference Graph Convolutional Networks for\n Skeleton-Based Action Recognition","summary":" Recently, the significant achievements have been made in skeleton-based human\naction recognition with the emergence of graph convolutional networks (GCNs).\nHowever, the state-of-the-art (SOTA) models used for this task focus on\nconstructing more complex higher-order connections between joint nodes to\ndescribe skeleton information, which leads to complex inference processes and\nhigh computational costs. To address the slow inference speed caused by overly\ncomplex model structures, we introduce re-parameterization and\nover-parameterization techniques to GCNs and propose two novel high-performance\ninference GCNs, namely HPI-GCN-RP and HPI-GCN-OP. After the completion of model\ntraining, model parameters are fixed. HPI-GCN-RP adopts re-parameterization\ntechnique to transform high-performance training model into fast inference\nmodel through linear transformations, which achieves a higher inference speed\nwith competitive model performance. HPI-GCN-OP further utilizes\nover-parameterization technique to achieve higher performance improvement by\nintroducing additional inference parameters, albeit with slightly decreased\ninference speed. The experimental results on the two skeleton-based action\nrecognition datasets demonstrate the effectiveness of our approach. Our\nHPI-GCN-OP achieves performance comparable to the current SOTA models, with\ninference speeds five times faster. Specifically, our HPI-GCN-OP achieves an\naccuracy of 93\\% on the cross-subject split of the NTU-RGB+D 60 dataset, and\n90.1\\% on the cross-subject benchmark of the NTU-RGB+D 120 dataset. Code is\navailable at github.com/lizaowo/HPI-GCN.\n","authors":["Junyi Wang","Ziao Li","Bangli Liu","Haibin Cai","Mohamad Saada","Qinggang Meng"],"pdf_url":"https://arxiv.org/pdf/2305.18710v3.pdf","comment":"23 pages, 5 figures"},{"id":"http://arxiv.org/abs/2310.17875v3","updated":"2025-01-07T03:43:11Z","published":"2023-10-27T03:32:05Z","title":"Siamese-DETR for Generic Multi-Object Tracking","summary":" The ability to detect and track the dynamic objects in different scenes is\nfundamental to real-world applications, e.g., autonomous driving and robot\nnavigation. However, traditional Multi-Object Tracking (MOT) is limited to\ntracking objects belonging to the pre-defined closed-set categories. Recently,\nOpen-Vocabulary MOT (OVMOT) and Generic MOT (GMOT) are proposed to track\ninterested objects beyond pre-defined categories with the given text prompt and\ntemplate image. However, the expensive well pre-trained (vision-)language model\nand fine-grained category annotations are required to train OVMOT models. In\nthis paper, we focus on GMOT and propose a simple but effective method,\nSiamese-DETR, for GMOT. Only the commonly used detection datasets (e.g., COCO)\nare required for training. Different from existing GMOT methods, which train a\nSingle Object Tracking (SOT) based detector to detect interested objects and\nthen apply a data association based MOT tracker to get the trajectories, we\nleverage the inherent object queries in DETR variants. Specifically: 1) The\nmulti-scale object queries are designed based on the given template image,\nwhich are effective for detecting different scales of objects with the same\ncategory as the template image; 2) A dynamic matching training strategy is\nintroduced to train Siamese-DETR on commonly used detection datasets, which\ntakes full advantage of provided annotations; 3) The online tracking pipeline\nis simplified through a tracking-by-query manner by incorporating the tracked\nboxes in previous frame as additional query boxes. The complex data association\nis replaced with the much simpler Non-Maximum Suppression (NMS). Extensive\nexperimental results show that Siamese-DETR surpasses existing MOT methods on\nGMOT-40 dataset by a large margin. Codes are avaliable at\n\\url{https://github.com/yumu-173/Siamese-DETR}.\n","authors":["Qiankun Liu","Yichen Li","Yuqi Jiang","Ying Fu"],"pdf_url":"https://arxiv.org/pdf/2310.17875v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03499v1","updated":"2025-01-07T03:39:43Z","published":"2025-01-07T03:39:43Z","title":"Can Deep Learning Trigger Alerts from Mobile-Captured Images?","summary":" Our research presents a comprehensive approach to leveraging mobile camera\nimage data for real-time air quality assessment and recommendation. We develop\na regression-based Convolutional Neural Network model and tailor it explicitly\nfor air quality prediction by exploiting the inherent relationship between\noutput parameters. As a result, the Mean Squared Error of 0.0077 and 0.0112\nobtained for 2 and 5 pollutants respectively outperforms existing models.\nFurthermore, we aim to verify the common practice of augmenting the original\ndataset with a view to introducing more variation in the training phase. It is\none of our most significant contributions that our experimental results\ndemonstrate minimal accuracy differences between the original and augmented\ndatasets. Finally, a real-time, user-friendly dashboard is implemented which\ndynamically displays the Air Quality Index and pollutant values derived from\ncaptured mobile camera images. Users' health conditions are considered to\nrecommend whether a location is suitable based on current air quality metrics.\nOverall, this research contributes to verification of data augmentation\ntechniques, CNN-based regression modelling for air quality prediction, and\nuser-centric air quality monitoring through mobile technology. The proposed\nsystem offers practical solutions for individuals to make informed\nenvironmental health and well-being decisions.\n","authors":["Pritisha Sarkar","Duranta Durbaar Vishal Saha","Mousumi Saha"],"pdf_url":"https://arxiv.org/pdf/2501.03499v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.13656v2","updated":"2025-01-07T03:37:12Z","published":"2024-08-24T19:14:02Z","title":"Localize-and-Stitch: Efficient Model Merging via Sparse Task Arithmetic","summary":" Model merging offers an effective strategy to combine the strengths of\nmultiple finetuned models into a unified model that preserves the specialized\ncapabilities of each. Existing methods merge models in a global manner,\nperforming arithmetic operations across all model parameters. However, such\nglobal merging often leads to task interference, degrading the performance of\nthe merged model. In this work, we introduce Localize-and-Stitch, a novel\napproach that merges models in a localized way. Our algorithm works in two\nsteps: i) Localization: identify tiny ($1\\%$ of the total parameters) localized\nregions in the finetuned models containing essential skills for the downstream\ntasks, and ii) Stitching: reintegrate only these essential regions back into\nthe pretrained model for task synergy. We demonstrate that our approach\neffectively locates sparse regions responsible for finetuned performance, and\nthe localized regions could be treated as compact and interpretable\nrepresentations of the finetuned models (tasks). Empirically, we evaluate our\nmethod on various vision and language benchmarks, showing that it outperforms\nexisting model merging methods under different data availability scenarios.\nBeyond strong empirical performance, our algorithm also facilitates model\ncompression and preserves pretrained knowledge, enabling flexible and continual\nskill composition from multiple finetuned models with minimal storage and\ncomputational overhead. Our code is available at\nhttps://github.com/uiuctml/Localize-and-Stitch.\n","authors":["Yifei He","Yuzheng Hu","Yong Lin","Tong Zhang","Han Zhao"],"pdf_url":"https://arxiv.org/pdf/2408.13656v2.pdf","comment":"TMLR camera-ready version"},{"id":"http://arxiv.org/abs/2501.03495v1","updated":"2025-01-07T03:33:22Z","published":"2025-01-07T03:33:22Z","title":"Textualize Visual Prompt for Image Editing via Diffusion Bridge","summary":" Visual prompt, a pair of before-and-after edited images, can convey\nindescribable imagery transformations and prosper in image editing. However,\ncurrent visual prompt methods rely on a pretrained text-guided image-to-image\ngenerative model that requires a triplet of text, before, and after images for\nretraining over a text-to-image model. Such crafting triplets and retraining\nprocesses limit the scalability and generalization of editing. In this paper,\nwe present a framework based on any single text-to-image model without reliance\non the explicit image-to-image model thus enhancing the generalizability and\nscalability. Specifically, by leveraging the probability-flow ordinary\nequation, we construct a diffusion bridge to transfer the distribution between\nbefore-and-after images under the text guidance. By optimizing the text via the\nbridge, the framework adaptively textualizes the editing transformation\nconveyed by visual prompts into text embeddings without other models.\nMeanwhile, we introduce differential attention control during text\noptimization, which disentangles the text embedding from the invariance of the\nbefore-and-after images and makes it solely capture the delicate transformation\nand generalize to edit various images. Experiments on real images validate\ncompetitive results on the generalization, contextual coherence, and high\nfidelity for delicate editing with just one image pair as the visual prompt.\n","authors":["Pengcheng Xu","Qingnan Fan","Fei Kou","Shuai Qin","Hong Gu","Ruoyu Zhao","Charles Ling","Boyu Wang"],"pdf_url":"https://arxiv.org/pdf/2501.03495v1.pdf","comment":"AAAI 2025"},{"id":"http://arxiv.org/abs/2501.02024v2","updated":"2025-01-07T03:29:43Z","published":"2025-01-02T20:47:04Z","title":"Model Checking in Medical Imaging for Tumor Detection and Segmentation","summary":" Recent advancements in model checking have demonstrated significant potential\nacross diverse applications, particularly in signal and image analysis. Medical\nimaging stands out as a critical domain where model checking can be effectively\napplied to design and evaluate robust frameworks. These frameworks facilitate\nautomatic and semi-automatic delineation of regions of interest within images,\naiding in accurate segmentation. This paper provides a comprehensive analysis\nof recent works leveraging spatial logic to develop operators and tools for\nidentifying regions of interest, including tumorous and non-tumorous areas.\nAdditionally, we examine the challenges inherent to spatial model-checking\ntechniques, such as variability in ground truth data and the need for\nstreamlined procedures suitable for routine clinical practice.\n","authors":["Elhoucine Elfatimi","Lahcen El fatimi"],"pdf_url":"https://arxiv.org/pdf/2501.02024v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.00166v2","updated":"2025-01-07T03:21:43Z","published":"2024-09-30T19:15:05Z","title":"EEG Emotion Copilot: Optimizing Lightweight LLMs for Emotional EEG\n Interpretation with Assisted Medical Record Generation","summary":" In the fields of affective computing (AC) and brain-machine interface (BMI),\nthe analysis of physiological and behavioral signals to discern individual\nemotional states has emerged as a critical research frontier. While deep\nlearning-based approaches have made notable strides in EEG emotion recognition,\nparticularly in feature extraction and pattern recognition, significant\nchallenges persist in achieving end-to-end emotion computation, including\nreal-time processing, individual adaptation, and seamless user interaction.\nThis paper presents the EEG Emotion Copilot, a system optimizing a lightweight\nlarge language model (LLM) with 0.5B parameters operating in a local setting,\nwhich first recognizes emotional states directly from EEG signals, subsequently\ngenerates personalized diagnostic and treatment suggestions, and finally\nsupports the automation of assisted electronic medical records. Specifically,\nwe demonstrate the critical techniques in the novel data structure of prompt,\nmodel pruning and fine-tuning training, and deployment strategies aiming at\nimproving real-time performance and computational efficiency. Extensive\nexperiments show that our optimized lightweight LLM-based copilot achieves an\nenhanced intuitive interface for participant interaction, superior accuracy of\nemotion recognition and assisted electronic medical records generation, in\ncomparison to such models with similar scale parameters or large-scale\nparameters such as 1.5B, 1.8B, 3B and 7B. In summary, through these efforts,\nthe proposed copilot is expected to advance the application of AC in the\nmedical domain, offering innovative solution to mental health monitoring. The\ncodes will be released at https://github.com/NZWANG/EEG_Emotion_Copilot.\n","authors":["Hongyu Chen","Weiming Zeng","Chengcheng Chen","Luhui Cai","Fei Wang","Yuhu Shi","Lei Wang","Wei Zhang","Yueyang Li","Hongjie Yan","Wai Ting Siok","Nizhuan Wang"],"pdf_url":"https://arxiv.org/pdf/2410.00166v2.pdf","comment":"10 pages, 12 figures, 2 tables"},{"id":"http://arxiv.org/abs/2501.03490v1","updated":"2025-01-07T03:18:15Z","published":"2025-01-07T03:18:15Z","title":"SceneBooth: Diffusion-based Framework for Subject-preserved\n Text-to-Image Generation","summary":" Due to the demand for personalizing image generation, subject-driven\ntext-to-image generation method, which creates novel renditions of an input\nsubject based on text prompts, has received growing research interest. Existing\nmethods often learn subject representation and incorporate it into the prompt\nembedding to guide image generation, but they struggle with preserving subject\nfidelity. To solve this issue, this paper approaches a novel framework named\nSceneBooth for subject-preserved text-to-image generation, which consumes\ninputs of a subject image, object phrases and text prompts. Instead of learning\nthe subject representation and generating a subject, our SceneBooth fixes the\ngiven subject image and generates its background image guided by the text\nprompts. To this end, our SceneBooth introduces two key components, i.e., a\nmultimodal layout generation module and a background painting module. The\nformer determines the position and scale of the subject by generating\nappropriate scene layouts that align with text captions, object phrases, and\nsubject visual information. The latter integrates two adapters (ControlNet and\nGated Self-Attention) into the latent diffusion model to generate a background\nthat harmonizes with the subject guided by scene layouts and text descriptions.\nIn this manner, our SceneBooth ensures accurate preservation of the subject's\nappearance in the output. Quantitative and qualitative experimental results\ndemonstrate that SceneBooth significantly outperforms baseline methods in terms\nof subject preservation, image harmonization and overall quality.\n","authors":["Shang Chai","Zihang Lin","Min Zhou","Xubin Li","Liansheng Zhuang","Houqiang Li"],"pdf_url":"https://arxiv.org/pdf/2501.03490v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19391v2","updated":"2025-01-07T03:15:49Z","published":"2024-12-27T00:36:40Z","title":"An In-Depth Analysis of Adversarial Discriminative Domain Adaptation for\n Digit Classification","summary":" Domain adaptation is an active area of research driven by the growing demand\nfor robust machine learning models that perform well on real-world data.\nAdversarial learning for deep neural networks (DNNs) has emerged as a promising\napproach to improving generalization ability, particularly for image\nclassification. In this paper, we implement a specific adversarial learning\ntechnique known as Adversarial Discriminative Domain Adaptation (ADDA) and\nreplicate digit classification experiments from the original ADDA paper. We\nextend their findings by examining a broader range of domain shifts and provide\na detailed analysis of in-domain classification accuracy post-ADDA. Our results\ndemonstrate that ADDA significantly improves accuracy across certain domain\nshifts with minimal impact on in-domain performance. Furthermore, we provide\nqualitative analysis and propose potential explanations for ADDA's limitations\nin less successful domain shifts. Code is at\nhttps://github.com/eugenechoi2004/COS429_FINAL .\n","authors":["Eugene Choi","Julian Rodriguez","Edmund Young"],"pdf_url":"https://arxiv.org/pdf/2412.19391v2.pdf","comment":"Replacement: Updated methodology section to include grayscale\n preprocessing of SVHN data"},{"id":"http://arxiv.org/abs/2501.03482v1","updated":"2025-01-07T03:00:58Z","published":"2025-01-07T03:00:58Z","title":"VOILA: Complexity-Aware Universal Segmentation of CT images by Voxel\n Interacting with Language","summary":" Satisfactory progress has been achieved recently in universal segmentation of\nCT images. Following the success of vision-language methods, there is a growing\ntrend towards utilizing text prompts and contrastive learning to develop\nuniversal segmentation models. However, there exists a significant imbalance in\ninformation density between 3D images and text prompts. Moreover, the standard\nfully connected layer segmentation approach faces significant challenges in\nhandling multiple classes and exhibits poor generalizability. To address these\nchallenges, we propose the VOxel Interacting with LAnguage method (VOILA) for\nuniversal CT image segmentation. Initially, we align voxels and language into a\nshared representation space and classify voxels on the basis of cosine\nsimilarity. Subsequently, we develop the Voxel-Language Interaction framework\nto mitigate the impact of class imbalance caused by foreground-background\ndiscrepancies and variations in target volumes. Furthermore, a Complexity-Aware\nSampling method is proposed to focus on region hard to segment, achieved by\ngenerating pseudo-heatmaps from a trainable Gaussian mixture distribution. Our\nresults indicate the proposed VOILA is capable to achieve improved performance\nwith reduced parameters and computational cost during training. Furthermore, it\ndemonstrates significant generalizability across diverse datasets without\nadditional fine-tuning.\n","authors":["Zishuo Wan","Yu Gao","Wanyuan Pang","Dawei Ding"],"pdf_url":"https://arxiv.org/pdf/2501.03482v1.pdf","comment":"Accepted by AAAI 2025"},{"id":"http://arxiv.org/abs/2501.01691v2","updated":"2025-01-07T02:57:03Z","published":"2025-01-03T08:18:08Z","title":"VidFormer: A novel end-to-end framework fused by 3DCNN and Transformer\n for Video-based Remote Physiological Measurement","summary":" Remote physiological signal measurement based on facial videos, also known as\nremote photoplethysmography (rPPG), involves predicting changes in facial\nvascular blood flow from facial videos. While most deep learning-based methods\nhave achieved good results, they often struggle to balance performance across\nsmall and large-scale datasets due to the inherent limitations of convolutional\nneural networks (CNNs) and Transformer. In this paper, we introduce VidFormer,\na novel end-to-end framework that integrates 3-Dimension Convolutional Neural\nNetwork (3DCNN) and Transformer models for rPPG tasks. Initially, we conduct an\nanalysis of the traditional skin reflection model and subsequently introduce an\nenhanced model for the reconstruction of rPPG signals. Based on this improved\nmodel, VidFormer utilizes 3DCNN and Transformer to extract local and global\nfeatures from input data, respectively. To enhance the spatiotemporal feature\nextraction capabilities of VidFormer, we incorporate temporal-spatial attention\nmechanisms tailored for both 3DCNN and Transformer. Additionally, we design a\nmodule to facilitate information exchange and fusion between the 3DCNN and\nTransformer. Our evaluation on five publicly available datasets demonstrates\nthat VidFormer outperforms current state-of-the-art (SOTA) methods. Finally, we\ndiscuss the essential roles of each VidFormer module and examine the effects of\nethnicity, makeup, and exercise on its performance.\n","authors":["Jiachen Li","Shisheng Guo","Longzhen Tang","Cuolong Cui","Lingjiang Kong","Xiaobo Yang"],"pdf_url":"https://arxiv.org/pdf/2501.01691v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.02964v2","updated":"2025-01-07T02:55:15Z","published":"2025-01-06T12:16:56Z","title":"Socratic Questioning: Learn to Self-guide Multimodal Reasoning in the\n Wild","summary":" Complex visual reasoning remains a key challenge today. Typically, the\nchallenge is tackled using methodologies such as Chain of Thought (COT) and\nvisual instruction tuning. However, how to organically combine these two\nmethodologies for greater success remains unexplored. Also, issues like\nhallucinations and high training cost still need to be addressed. In this work,\nwe devise an innovative multi-round training and reasoning framework suitable\nfor lightweight Multimodal Large Language Models (MLLMs). Our self-questioning\napproach heuristically guides MLLMs to focus on visual clues relevant to the\ntarget problem, reducing hallucinations and enhancing the model's ability to\ndescribe fine-grained image details. This ultimately enables the model to\nperform well in complex visual reasoning and question-answering tasks. We have\nnamed this framework Socratic Questioning(SQ). To facilitate future research,\nwe create a multimodal mini-dataset named CapQA, which includes 1k images of\nfine-grained activities, for visual instruction tuning and evaluation, our\nproposed SQ method leads to a 31.2% improvement in the hallucination score. Our\nextensive experiments on various benchmarks demonstrate SQ's remarkable\ncapabilities in heuristic self-questioning, zero-shot visual reasoning and\nhallucination mitigation. Our model and code will be publicly available.\n","authors":["Wanpeng Hu","Haodi Liu","Lin Chen","Feng Zhou","Changming Xiao","Qi Yang","Changshui Zhang"],"pdf_url":"https://arxiv.org/pdf/2501.02964v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.21534v6","updated":"2025-01-07T02:54:18Z","published":"2024-07-31T11:40:29Z","title":"ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large\n Language Models","summary":" In this work, we propose a training-free method to inject visual prompts into\nMultimodal Large Language Models (MLLMs) through test-time optimization of a\nlearnable latent variable. We observe that attention, as the core module of\nMLLMs, connects text prompt tokens and visual tokens, ultimately determining\nthe final results. Our approach involves adjusting visual tokens from the MLP\noutput at test time, controlling the attention response to ensure text prompt\ntokens attend to visual tokens in referring regions. We optimize a learnable\nlatent variable based on an energy function, enhancing the strength of\nreferring regions in the attention map. This enables detailed region\ndescription and reasoning without the need for substantial training costs or\nmodel retraining. Our method offers a promising direction for integrating\nreferring abilities into MLLMs, and supports referring with box, mask, scribble\nand point. The results demonstrate that our method exhibits out-of-domain\ngeneralization and interpretability.\n","authors":["Mingrui Wu","Xinyue Cai","Jiayi Ji","Jiale Li","Oucheng Huang","Gen Luo","Hao Fei","Guannan Jiang","Xiaoshuai Sun","Rongrong Ji"],"pdf_url":"https://arxiv.org/pdf/2407.21534v6.pdf","comment":"Accepted to NeurIPS 2024;\n Code:https://github.com/mrwu-mac/ControlMLLM"},{"id":"http://arxiv.org/abs/2501.02962v2","updated":"2025-01-07T02:51:31Z","published":"2025-01-06T12:09:08Z","title":"SceneVTG++: Controllable Multilingual Visual Text Generation in the Wild","summary":" Generating visual text in natural scene images is a challenging task with\nmany unsolved problems. Different from generating text on artificially designed\nimages (such as posters, covers, cartoons, etc.), the text in natural scene\nimages needs to meet the following four key criteria: (1) Fidelity: the\ngenerated text should appear as realistic as a photograph and be completely\naccurate, with no errors in any of the strokes. (2) Reasonability: the text\nshould be generated on reasonable carrier areas (such as boards, signs, walls,\netc.), and the generated text content should also be relevant to the scene. (3)\nUtility: the generated text can facilitate to the training of natural scene OCR\n(Optical Character Recognition) tasks. (4) Controllability: The attribute of\nthe text (such as font and color) should be controllable as needed. In this\npaper, we propose a two stage method, SceneVTG++, which simultaneously\nsatisfies the four aspects mentioned above. SceneVTG++ consists of a Text\nLayout and Content Generator (TLCG) and a Controllable Local Text Diffusion\n(CLTD). The former utilizes the world knowledge of multi modal large language\nmodels to find reasonable text areas and recommend text content according to\nthe nature scene background images, while the latter generates controllable\nmultilingual text based on the diffusion model. Through extensive experiments,\nwe respectively verified the effectiveness of TLCG and CLTD, and demonstrated\nthe state-of-the-art text generation performance of SceneVTG++. In addition,\nthe generated images have superior utility in OCR tasks like text detection and\ntext recognition. Codes and datasets will be available.\n","authors":["Jiawei Liu","Yuanzhi Zhu","Feiyu Gao","Zhibo Yang","Peng Wang","Junyang Lin","Xinggang Wang","Wenyu Liu"],"pdf_url":"https://arxiv.org/pdf/2501.02962v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03471v1","updated":"2025-01-07T02:15:58Z","published":"2025-01-07T02:15:58Z","title":"Hyperbolic Binary Neural Network","summary":" Binary Neural Network (BNN) converts full-precision weights and activations\ninto their extreme 1-bit counterparts, making it particularly suitable for\ndeployment on lightweight mobile devices. While binary neural networks are\ntypically formulated as a constrained optimization problem and optimized in the\nbinarized space, general neural networks are formulated as an unconstrained\noptimization problem and optimized in the continuous space. This paper\nintroduces the Hyperbolic Binary Neural Network (HBNN) by leveraging the\nframework of hyperbolic geometry to optimize the constrained problem.\nSpecifically, we transform the constrained problem in hyperbolic space into an\nunconstrained one in Euclidean space using the Riemannian exponential map. On\nthe other hand, we also propose the Exponential Parametrization Cluster (EPC)\nmethod, which, compared to the Riemannian exponential map, shrinks the segment\ndomain based on a diffeomorphism. This approach increases the probability of\nweight flips, thereby maximizing the information gain in BNNs. Experimental\nresults on CIFAR10, CIFAR100, and ImageNet classification datasets with\nVGGsmall, ResNet18, and ResNet34 models illustrate the superior performance of\nour HBNN over state-of-the-art methods.\n","authors":["Jun Chen","Jingyang Xiang","Tianxin Huang","Xiangrui Zhao","Yong Liu"],"pdf_url":"https://arxiv.org/pdf/2501.03471v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03469v1","updated":"2025-01-07T02:10:52Z","published":"2025-01-07T02:10:52Z","title":"Information-Maximized Soft Variable Discretization for Self-Supervised\n Image Representation Learning","summary":" Self-supervised learning (SSL) has emerged as a crucial technique in image\nprocessing, encoding, and understanding, especially for developing today's\nvision foundation models that utilize large-scale datasets without annotations\nto enhance various downstream tasks. This study introduces a novel SSL\napproach, Information-Maximized Soft Variable Discretization (IMSVD), for image\nrepresentation learning. Specifically, IMSVD softly discretizes each variable\nin the latent space, enabling the estimation of their probability distributions\nover training batches and allowing the learning process to be directly guided\nby information measures. Motivated by the MultiView assumption, we propose an\ninformation-theoretic objective function to learn transform-invariant,\nnon-travail, and redundancy-minimized representation features. We then derive a\njoint-cross entropy loss function for self-supervised image representation\nlearning, which theoretically enjoys superiority over the existing methods in\nreducing feature redundancy. Notably, our non-contrastive IMSVD method\nstatistically performs contrastive learning. Extensive experimental results\ndemonstrate the effectiveness of IMSVD on various downstream tasks in terms of\nboth accuracy and efficiency. Thanks to our variable discretization, the\nembedding features optimized by IMSVD offer unique explainability at the\nvariable level. IMSVD has the potential to be adapted to other learning\nparadigms. Our code is publicly available at\nhttps://github.com/niuchuangnn/IMSVD.\n","authors":["Chuang Niu","Wenjun Xia","Hongming Shan","Ge Wang"],"pdf_url":"https://arxiv.org/pdf/2501.03469v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.01973v2","updated":"2025-01-07T02:10:45Z","published":"2024-12-28T02:28:19Z","title":"INFELM: In-depth Fairness Evaluation of Large Text-To-Image Models","summary":" The rapid development of large language models (LLMs) and large vision models\n(LVMs) have propelled the evolution of multi-modal AI systems, which have\ndemonstrated the remarkable potential for industrial applications by emulating\nhuman-like cognition. However, they also pose significant ethical challenges,\nincluding amplifying harmful content and reinforcing societal biases. For\ninstance, biases in some industrial image generation models highlighted the\nurgent need for robust fairness assessments. Most existing evaluation\nframeworks focus on the comprehensiveness of various aspects of the models, but\nthey exhibit critical limitations, including insufficient attention to content\ngeneration alignment and social bias-sensitive domains. More importantly, their\nreliance on pixel-detection techniques is prone to inaccuracies.\n To address these issues, this paper presents INFELM, an in-depth fairness\nevaluation on widely-used text-to-image models. Our key contributions are: (1)\nan advanced skintone classifier incorporating facial topology and refined skin\npixel representation to enhance classification precision by at least 16.04%,\n(2) a bias-sensitive content alignment measurement for understanding societal\nimpacts, (3) a generalizable representation bias evaluation for diverse\ndemographic groups, and (4) extensive experiments analyzing large-scale\ntext-to-image model outputs across six social-bias-sensitive domains. We find\nthat existing models in the study generally do not meet the empirical fairness\ncriteria, and representation bias is generally more pronounced than alignment\nerrors. INFELM establishes a robust benchmark for fairness assessment,\nsupporting the development of multi-modal AI systems that align with ethical\nand human-centric principles.\n","authors":["Di Jin","Xing Liu","Yu Liu","Jia Qing Yap","Andrea Wong","Adriana Crespo","Qi Lin","Zhiyuan Yin","Qiang Yan","Ryan Ye"],"pdf_url":"https://arxiv.org/pdf/2501.01973v2.pdf","comment":"Di Jin and Xing Liu contributed equally to this work"},{"id":"http://arxiv.org/abs/2412.16487v2","updated":"2025-01-07T02:08:56Z","published":"2024-12-21T05:04:36Z","title":"Trusted Mamba Contrastive Network for Multi-View Clustering","summary":" Multi-view clustering can partition data samples into their categories by\nlearning a consensus representation in an unsupervised way and has received\nmore and more attention in recent years. However, there is an untrusted fusion\nproblem. The reasons for this problem are as follows: 1) The current methods\nignore the presence of noise or redundant information in the view; 2) The\nsimilarity of contrastive learning comes from the same sample rather than the\nsame cluster in deep multi-view clustering. It causes multi-view fusion in the\nwrong direction. This paper proposes a novel multi-view clustering network to\naddress this problem, termed as Trusted Mamba Contrastive Network (TMCN).\nSpecifically, we present a new Trusted Mamba Fusion Network (TMFN), which\nachieves a trusted fusion of multi-view data through a selective mechanism.\nMoreover, we align the fused representation and the view-specific\nrepresentation using the Average-similarity Contrastive Learning (AsCL) module.\nAsCL increases the similarity of view presentation from the same cluster, not\nmerely from the same sample. Extensive experiments show that the proposed\nmethod achieves state-of-the-art results in deep multi-view clustering tasks.\nThe source code is available at https://github.com/HackerHyper/TMCN.\n","authors":["Jian Zhu","Xin Zou","Lei Liu","Zhangmin Huang","Ying Zhang","Chang Tang","Li-Rong Dai"],"pdf_url":"https://arxiv.org/pdf/2412.16487v2.pdf","comment":"accepted by 2025 IEEE International Conference on Acoustics, Speech,\n and Signal Processing(ICASSP2025)"},{"id":"http://arxiv.org/abs/2412.19139v2","updated":"2025-01-07T01:50:11Z","published":"2024-12-26T09:51:05Z","title":"PlanLLM: Video Procedure Planning with Refinable Large Language Models","summary":" Video procedure planning, i.e., planning a sequence of action steps given the\nvideo frames of start and goal states, is an essential ability for embodied AI.\nRecent works utilize Large Language Models (LLMs) to generate enriched action\nstep description texts to guide action step decoding. Although LLMs are\nintroduced, these methods decode the action steps into a closed-set of one-hot\nvectors, limiting the model's capability of generalizing to new steps or tasks.\nAdditionally, fixed action step descriptions based on world-level commonsense\nmay contain noise in specific instances of visual states. In this paper, we\npropose PlanLLM, a cross-modal joint learning framework with LLMs for video\nprocedure planning. We propose an LLM-Enhanced Planning module which fully uses\nthe generalization ability of LLMs to produce free-form planning output and to\nenhance action step decoding. We also propose Mutual Information Maximization\nmodule to connect world-level commonsense of step descriptions and\nsample-specific information of visual states, enabling LLMs to employ the\nreasoning ability to generate step sequences. With the assistance of LLMs, our\nmethod can both closed-set and open vocabulary procedure planning tasks. Our\nPlanLLM achieves superior performance on three benchmarks, demonstrating the\neffectiveness of our designs.\n","authors":["Dejie Yang","Zijing Zhao","Yang Liu"],"pdf_url":"https://arxiv.org/pdf/2412.19139v2.pdf","comment":"accepted to AAAI2025"},{"id":"http://arxiv.org/abs/2501.03466v1","updated":"2025-01-07T01:47:57Z","published":"2025-01-07T01:47:57Z","title":"DGSSA: Domain generalization with structural and stylistic augmentation\n for retinal vessel segmentation","summary":" Retinal vascular morphology is crucial for diagnosing diseases such as\ndiabetes, glaucoma, and hypertension, making accurate segmentation of retinal\nvessels essential for early intervention. Traditional segmentation methods\nassume that training and testing data share similar distributions, which can\nlead to poor performance on unseen domains due to domain shifts caused by\nvariations in imaging devices and patient demographics. This paper presents a\nnovel approach, DGSSA, for retinal vessel image segmentation that enhances\nmodel generalization by combining structural and style augmentation strategies.\nWe utilize a space colonization algorithm to generate diverse vascular-like\nstructures that closely mimic actual retinal vessels, which are then used to\ngenerate pseudo-retinal images with an improved Pix2Pix model, allowing the\nsegmentation model to learn a broader range of structure distributions.\nAdditionally, we utilize PixMix to implement random photometric augmentations\nand introduce uncertainty perturbations, thereby enriching stylistic diversity\nand significantly enhancing the model's adaptability to varying imaging\nconditions. Our framework has been rigorously evaluated on four challenging\ndatasets-DRIVE, CHASEDB, HRF, and STARE-demonstrating state-of-the-art\nperformance that surpasses existing methods. This validates the effectiveness\nof our proposed approach, highlighting its potential for clinical application\nin automated retinal vessel analysis.\n","authors":["Bo Liu","Yudong Zhang","Shuihua Wang","Siyue Li","Jin Hong"],"pdf_url":"https://arxiv.org/pdf/2501.03466v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.22376v2","updated":"2025-01-07T01:41:13Z","published":"2024-10-29T07:43:39Z","title":"Rare-to-Frequent: Unlocking Compositional Generation Power of Diffusion\n Models on Rare Concepts with LLM Guidance","summary":" State-of-the-art text-to-image (T2I) diffusion models often struggle to\ngenerate rare compositions of concepts, e.g., objects with unusual attributes.\nIn this paper, we show that the compositional generation power of diffusion\nmodels on such rare concepts can be significantly enhanced by the Large\nLanguage Model (LLM) guidance. We start with empirical and theoretical\nanalysis, demonstrating that exposing frequent concepts relevant to the target\nrare concepts during the diffusion sampling process yields more accurate\nconcept composition. Based on this, we propose a training-free approach, R2F,\nthat plans and executes the overall rare-to-frequent concept guidance\nthroughout the diffusion inference by leveraging the abundant semantic\nknowledge in LLMs. Our framework is flexible across any pre-trained diffusion\nmodels and LLMs, and can be seamlessly integrated with the region-guided\ndiffusion approaches. Extensive experiments on three datasets, including our\nnewly proposed benchmark, RareBench, containing various prompts with rare\ncompositions of concepts, R2F significantly surpasses existing models including\nSD3.0 and FLUX by up to 28.1%p in T2I alignment. Code is available at\nhttps://github.com/krafton-ai/Rare-to-Frequent.\n","authors":["Dongmin Park","Sebin Kim","Taehong Moon","Minkyu Kim","Kangwook Lee","Jaewoong Cho"],"pdf_url":"https://arxiv.org/pdf/2410.22376v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.09369v2","updated":"2025-01-07T01:23:54Z","published":"2024-08-18T05:47:33Z","title":"Flemme: A Flexible and Modular Learning Platform for Medical Images","summary":" As the rapid development of computer vision and the emergence of powerful\nnetwork backbones and architectures, the application of deep learning in\nmedical imaging has become increasingly significant. Unlike natural images,\nmedical images lack huge volumes of data but feature more modalities, making it\ndifficult to train a general model that has satisfactory performance across\nvarious datasets. In practice, practitioners often suffer from manually\ncreating and testing models combining independent backbones and architectures,\nwhich is a laborious and time-consuming process. We propose Flemme, a FLExible\nand Modular learning platform for MEdical images. Our platform separates\nencoders from the model architectures so that different models can be\nconstructed via various combinations of supported encoders and architectures.\nWe construct encoders using building blocks based on convolution, transformer,\nand state-space model (SSM) to process both 2D and 3D image patches. A base\narchitecture is implemented following an encoder-decoder style, with several\nderived architectures for image segmentation, reconstruction, and generation\ntasks. In addition, we propose a general hierarchical architecture\nincorporating a pyramid loss to optimize and fuse vertical features.\nExperiments demonstrate that this simple design leads to an average improvement\nof 5.60% in Dice score and 7.81% in mean interaction of units (mIoU) for\nsegmentation models, as well as an enhancement of 5.57% in peak signal-to-noise\nratio (PSNR) and 8.22% in structural similarity (SSIM) for reconstruction\nmodels. We further utilize Flemme as an analytical tool to assess the\neffectiveness and efficiency of various encoders across different tasks. Code\nis available at https://github.com/wlsdzyzl/flemme.\n","authors":["Guoqing Zhang","Jingyun Yang","Yang Li"],"pdf_url":"https://arxiv.org/pdf/2408.09369v2.pdf","comment":"8 pages, 6 figures"},{"id":"http://arxiv.org/abs/2501.03458v1","updated":"2025-01-07T01:19:48Z","published":"2025-01-07T01:19:48Z","title":"Activating Associative Disease-Aware Vision Token Memory for LLM-Based\n X-ray Report Generation","summary":" X-ray image based medical report generation achieves significant progress in\nrecent years with the help of the large language model, however, these models\nhave not fully exploited the effective information in visual image regions,\nresulting in reports that are linguistically sound but insufficient in\ndescribing key diseases. In this paper, we propose a novel associative\nmemory-enhanced X-ray report generation model that effectively mimics the\nprocess of professional doctors writing medical reports. It considers both the\nmining of global and local visual information and associates historical report\ninformation to better complete the writing of the current report. Specifically,\ngiven an X-ray image, we first utilize a classification model along with its\nactivation maps to accomplish the mining of visual regions highly associated\nwith diseases and the learning of disease query tokens. Then, we employ a\nvisual Hopfield network to establish memory associations for disease-related\ntokens, and a report Hopfield network to retrieve report memory information.\nThis process facilitates the generation of high-quality reports based on a\nlarge language model and achieves state-of-the-art performance on multiple\nbenchmark datasets, including the IU X-ray, MIMIC-CXR, and Chexpert Plus. The\nsource code of this work is released on\n\\url{https://github.com/Event-AHU/Medical_Image_Analysis}.\n","authors":["Xiao Wang","Fuling Wang","Haowen Wang","Bo Jiang","Chuanfu Li","Yaowei Wang","Yonghong Tian","Jin Tang"],"pdf_url":"https://arxiv.org/pdf/2501.03458v1.pdf","comment":"In Peer Review"},{"id":"http://arxiv.org/abs/2501.04184v1","updated":"2025-01-07T23:32:05Z","published":"2025-01-07T23:32:05Z","title":"MedicalNarratives: Connecting Medical Vision and Language with Localized\n Narratives","summary":" We propose MedicalNarratives, a dataset curated from medical pedagogical\nvideos similar in nature to data collected in Think-Aloud studies and inspired\nby Localized Narratives, which collects grounded image-text data by curating\ninstructors' speech and mouse cursor movements synchronized in time.\nMedicalNarratives enables pretraining of both semantic and dense objectives,\nalleviating the need to train medical semantic and dense tasks disparately due\nto the lack of reasonably sized datasets. Our dataset contains 4.7M image-text\npairs from videos and articles, with 1M samples containing dense annotations in\nthe form of traces and bounding boxes. To evaluate the utility of\nMedicalNarratives, we train GenMedClip based on the CLIP architecture using our\ndataset spanning 12 medical domains and demonstrate that it outperforms\nprevious state-of-the-art models on a newly constructed medical imaging\nbenchmark that comprehensively evaluates performance across all modalities.\nData, demo, code and models available at https://medical-narratives.github.io\n","authors":["Wisdom O. Ikezogwo","Kevin Zhang","Mehmet Saygin Seyfioglu","Fatemeh Ghezloo","Linda Shapiro","Ranjay Krishna"],"pdf_url":"https://arxiv.org/pdf/2501.04184v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.02716v2","updated":"2025-01-07T23:12:33Z","published":"2024-07-02T23:48:43Z","title":"Light-weight Fine-tuning Method for Defending Adversarial Noise in\n Pre-trained Medical Vision-Language Models","summary":" Fine-tuning pre-trained Vision-Language Models (VLMs) has shown remarkable\ncapabilities in medical image and textual depiction synergy. Nevertheless, many\npre-training datasets are restricted by patient privacy concerns, potentially\ncontaining noise that can adversely affect downstream performance. Moreover,\nthe growing reliance on multi-modal generation exacerbates this issue because\nof its susceptibility to adversarial attacks. To investigate how VLMs trained\non adversarial noisy data perform on downstream medical tasks, we first craft\nnoisy upstream datasets using multi-modal adversarial attacks. Through our\ncomprehensive analysis, we unveil that moderate noise enhances model robustness\nand transferability, but increasing noise levels negatively impact downstream\ntask performance. To mitigate this issue, we propose rectify adversarial noise\n(RAN) framework, a recipe designed to effectively defend adversarial attacks\nand rectify the influence of upstream noise during fine-tuning.\n","authors":["Xu Han","Linghao Jin","Xuezhe Ma","Xiaofeng Liu"],"pdf_url":"https://arxiv.org/pdf/2407.02716v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.07320v2","updated":"2025-01-07T23:10:25Z","published":"2024-12-10T09:08:41Z","title":"CoMA: Compositional Human Motion Generation with Multi-modal Agents","summary":" 3D human motion generation has seen substantial advancement in recent years.\nWhile state-of-the-art approaches have improved performance significantly, they\nstill struggle with complex and detailed motions unseen in training data,\nlargely due to the scarcity of motion datasets and the prohibitive cost of\ngenerating new training examples. To address these challenges, we introduce\nCoMA, an agent-based solution for complex human motion generation, editing, and\ncomprehension. CoMA leverages multiple collaborative agents powered by large\nlanguage and vision models, alongside a mask transformer-based motion generator\nfeaturing body part-specific encoders and codebooks for fine-grained control.\nOur framework enables generation of both short and long motion sequences with\ndetailed instructions, text-guided motion editing, and self-correction for\nimproved quality. Evaluations on the HumanML3D dataset demonstrate competitive\nperformance against state-of-the-art methods. Additionally, we create a set of\ncontext-rich, compositional, and long text prompts, where user studies show our\nmethod significantly outperforms existing approaches.\n","authors":["Shanlin Sun","Gabriel De Araujo","Jiaqi Xu","Shenghan Zhou","Hanwen Zhang","Ziheng Huang","Chenyu You","Xiaohui Xie"],"pdf_url":"https://arxiv.org/pdf/2412.07320v2.pdf","comment":"Project Page: https://gabrie-l.github.io/coma-page/"},{"id":"http://arxiv.org/abs/2406.14847v2","updated":"2025-01-07T23:01:21Z","published":"2024-06-21T03:23:37Z","title":"Fair Text to Medical Image Diffusion Model with Subgroup Distribution\n Aligned Tuning","summary":" The text to medical image (T2MedI) with latent diffusion model has great\npotential to alleviate the scarcity of medical imaging data and explore the\nunderlying appearance distribution of lesions in a specific patient status\ndescription. However, as the text to nature image models, we show that the\nT2MedI model can also bias to some subgroups to overlook the minority ones in\nthe training set. In this work, we first build a T2MedI model based on the\npre-trained Imagen model, which has the fixed contrastive language-image\npre-training (CLIP) text encoder, while its decoder has been fine-tuned on\nmedical images from the Radiology Objects in COntext (ROCO) dataset. Its gender\nbias is analyzed qualitatively and quantitatively. Toward this issue, we\npropose to fine-tune the T2MedI toward the target application dataset to align\ntheir sensitive subgroups distribution probability. Specifically, the alignment\nloss for fine-tuning is guided by an off-the-shelf sensitivity-subgroup\nclassifier to match the classification probability between the generated images\nand the expected target dataset. In addition, the image quality is maintained\nby a CLIP-consistency regularization term following a knowledge distillation\nscheme. For evaluation, we set the target dataset to be enhanced as the BraST18\ndataset, and trained a brain magnetic resonance (MR) slice-based gender\nclassifier from it. With our method, the generated MR image can markedly reduce\nthe inconsistency with the gender proportion in the BraTS18 dataset.\n","authors":["Xu Han","Fangfang Fan","Jingzhao Rong","Zhen Li","Georges El Fakhri","Qingyu Chen","Xiaofeng Liu"],"pdf_url":"https://arxiv.org/pdf/2406.14847v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2208.03819v2","updated":"2025-01-07T23:00:02Z","published":"2022-08-07T21:06:42Z","title":"Cross-Skeleton Interaction Graph Aggregation Network for Representation\n Learning of Mouse Social Behaviour","summary":" Automated social behaviour analysis of mice has become an increasingly\npopular research area in behavioural neuroscience. Recently, pose information\n(i.e., locations of keypoints or skeleton) has been used to interpret social\nbehaviours of mice. Nevertheless, effective encoding and decoding of social\ninteraction information underlying the keypoints of mice has been rarely\ninvestigated in the existing methods. In particular, it is challenging to model\ncomplex social interactions between mice due to highly deformable body shapes\nand ambiguous movement patterns. To deal with the interaction modelling\nproblem, we here propose a Cross-Skeleton Interaction Graph Aggregation Network\n(CS-IGANet) to learn abundant dynamics of freely interacting mice, where a\nCross-Skeleton Node-level Interaction module (CS-NLI) is used to model\nmulti-level interactions (i.e., intra-, inter- and cross-skeleton\ninteractions). Furthermore, we design a novel Interaction-Aware Transformer\n(IAT) to dynamically learn the graph-level representation of social behaviours\nand update the node-level representation, guided by our proposed\ninteraction-aware self-attention mechanism. Finally, to enhance the\nrepresentation ability of our model, an auxiliary self-supervised learning task\nis proposed for measuring the similarity between cross-skeleton nodes.\nExperimental results on the standard CRMI13-Skeleton and our PDMB-Skeleton\ndatasets show that our proposed model outperforms several other\nstate-of-the-art approaches.\n","authors":["Feixiang Zhou","Xinyu Yang","Fang Chen","Long Chen","Zheheng Jiang","Hui Zhu","Reiko Heckel","Haikuan Wang","Minrui Fei","Huiyu Zhou"],"pdf_url":"https://arxiv.org/pdf/2208.03819v2.pdf","comment":"Accepted to IEEE Transactions on Image Processing"},{"id":"http://arxiv.org/abs/2501.04172v1","updated":"2025-01-07T22:51:10Z","published":"2025-01-07T22:51:10Z","title":"Machine Learning for Identifying Grain Boundaries in Scanning Electron\n Microscopy (SEM) Images of Nanoparticle Superlattices","summary":" Nanoparticle superlattices consisting of ordered arrangements of\nnanoparticles exhibit unique optical, magnetic, and electronic properties\narising from nanoparticle characteristics as well as their collective\nbehaviors. Understanding how processing conditions influence the nanoscale\narrangement and microstructure is critical for engineering materials with\ndesired macroscopic properties. Microstructural features such as grain\nboundaries, lattice defects, and pores significantly affect these properties\nbut are challenging to quantify using traditional manual analyses as they are\nlabor-intensive and prone to errors. In this work, we present a machine\nlearning workflow for automating grain segmentation in scanning electron\nmicroscopy (SEM) images of nanoparticle superlattices. This workflow integrates\nsignal processing techniques, such as Radon transforms, with unsupervised\nlearning methods like agglomerative hierarchical clustering to identify and\nsegment grains without requiring manually annotated data. In the workflow we\ntransform the raw pixel data into explainable numerical representation of\nsuperlattice orientations for clustering. Benchmarking results demonstrate the\nworkflow's robustness against noisy images and edge cases, with a processing\nspeed of four images per minute on standard computational hardware. This\nefficiency makes the workflow scalable to large datasets and makes it a\nvaluable tool for integrating data-driven models into decision-making processes\nfor material design and analysis. For example, one can use this workflow to\nquantify grain size distributions at varying processing conditions like\ntemperature and pressure and using that knowledge adjust processing conditions\nto achieve desired superlattice orientations and grain sizes.\n","authors":["Aanish Paruchuri","Carl Thrasher","A. J. Hart","Robert Macfarlane","Arthi Jayaraman"],"pdf_url":"https://arxiv.org/pdf/2501.04172v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.13190v3","updated":"2025-01-07T22:06:07Z","published":"2024-12-17T18:59:33Z","title":"MotionBridge: Dynamic Video Inbetweening with Flexible Controls","summary":" By generating plausible and smooth transitions between two image frames,\nvideo inbetweening is an essential tool for video editing and long video\nsynthesis. Traditional works lack the capability to generate complex large\nmotions. While recent video generation techniques are powerful in creating\nhigh-quality results, they often lack fine control over the details of\nintermediate frames, which can lead to results that do not align with the\ncreative mind. We introduce MotionBridge, a unified video inbetweening\nframework that allows flexible controls, including trajectory strokes,\nkeyframes, masks, guide pixels, and text. However, learning such multi-modal\ncontrols in a unified framework is a challenging task. We thus design two\ngenerators to extract the control signal faithfully and encode feature through\ndual-branch embedders to resolve ambiguities. We further introduce a curriculum\ntraining strategy to smoothly learn various controls. Extensive qualitative and\nquantitative experiments have demonstrated that such multi-modal controls\nenable a more dynamic, customizable, and contextually accurate visual\nnarrative.\n","authors":["Maham Tanveer","Yang Zhou","Simon Niklaus","Ali Mahdavi Amiri","Hao Zhang","Krishna Kumar Singh","Nanxuan Zhao"],"pdf_url":"https://arxiv.org/pdf/2412.13190v3.pdf","comment":"Project website: [https://motionbridge.github.io/]"},{"id":"http://arxiv.org/abs/2501.04155v1","updated":"2025-01-07T21:55:56Z","published":"2025-01-07T21:55:56Z","title":"MM-GEN: Enhancing Task Performance Through Targeted Multimodal Data\n Curation","summary":" Vision-language models (VLMs) are highly effective but often underperform on\nspecialized tasks; for example, Llava-1.5 struggles with chart and diagram\nunderstanding due to scarce task-specific training data. Existing training\ndata, sourced from general-purpose datasets, fails to capture the nuanced\ndetails needed for these tasks. We introduce MM-Gen, a scalable method that\ngenerates task-specific, high-quality synthetic text for candidate images by\nleveraging stronger models. MM-Gen employs a three-stage targeted process:\npartitioning data into subgroups, generating targeted text based on task\ndescriptions, and filtering out redundant and outlier data. Fine-tuning VLMs\nwith data generated by MM-Gen leads to significant performance gains, including\n29% on spatial reasoning and 15% on diagram understanding for Llava-1.5 (7B).\nCompared to human-curated caption data, MM-Gen achieves up to 1.6x better\nimprovements for the original models, proving its effectiveness in enhancing\ntask-specific VLM performance and bridging the gap between general-purpose\ndatasets and specialized requirements. Code available at\nhttps://github.com/sjoshi804/MM-Gen.\n","authors":["Siddharth Joshi","Besmira Nushi","Vidhisha Balachandran","Varun Chandrasekaran","Vibhav Vineet","Neel Joshi","Baharan Mirzasoleiman"],"pdf_url":"https://arxiv.org/pdf/2501.04155v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.01054v3","updated":"2025-01-07T21:53:44Z","published":"2024-02-01T22:58:21Z","title":"Unconditional Latent Diffusion Models Memorize Patient Imaging Data:\n Implications for Openly Sharing Synthetic Data","summary":" AI models present a wide range of applications in the field of medicine.\nHowever, achieving optimal performance requires access to extensive healthcare\ndata, which is often not readily available. Furthermore, the imperative to\npreserve patient privacy restricts patient data sharing with third parties and\neven within institutes. Recently, generative AI models have been gaining\ntraction for facilitating open-data sharing by proposing synthetic data as\nsurrogates of real patient data. Despite the promise, some of these models are\nsusceptible to patient data memorization, where models generate patient data\ncopies instead of novel synthetic samples. Considering the importance of the\nproblem, surprisingly it has received relatively little attention in the\nmedical imaging community. To this end, we assess memorization in unconditional\nlatent diffusion models. We train latent diffusion models on CT, MR, and X-ray\ndatasets for synthetic data generation. We then detect the amount of training\ndata memorized utilizing our novel self-supervised copy detection approach and\nfurther investigate various factors that can influence memorization. Our\nfindings show a surprisingly high degree of patient data memorization across\nall datasets. Comparison with non-diffusion generative models, such as\nautoencoders and generative adversarial networks, indicates that while latent\ndiffusion models are more susceptible to memorization, overall they outperform\nnon-diffusion models in synthesis quality. Further analyses reveal that using\naugmentation strategies, small architecture, and increasing dataset can reduce\nmemorization while over-training the models can enhance it. Collectively, our\nresults emphasize the importance of carefully training generative models on\nprivate medical imaging datasets, and examining the synthetic data to ensure\npatient privacy before sharing it for medical research and applications.\n","authors":["Salman Ul Hassan Dar","Marvin Seyfarth","Isabelle Ayx","Theano Papavassiliu","Stefan O. Schoenberg","Robert Malte Siepmann","Fabian Christopher Laqua","Jannik Kahmann","Norbert Frey","Bettina Baeßler","Sebastian Foersch","Daniel Truhn","Jakob Nikolas Kather","Sandy Engelhardt"],"pdf_url":"https://arxiv.org/pdf/2402.01054v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.18038v2","updated":"2025-01-07T21:41:28Z","published":"2024-03-26T18:49:56Z","title":"TGGLinesPlus: A robust topological graph-guided computer vision\n algorithm for line detection from images","summary":" Line detection is a classic and essential problem in image processing,\ncomputer vision and machine intelligence. Line detection has many important\napplications, including image vectorization (e.g., document recognition and art\ndesign), indoor mapping, and important societal challenges (e.g., sea ice\nfracture line extraction from satellite imagery). Many line detection\nalgorithms and methods have been developed, but robust and intuitive methods\nare still lacking. In this paper, we proposed and implemented a topological\ngraph-guided algorithm, named TGGLinesPlus, for line detection. Our experiments\non images from a wide range of domains have demonstrated the flexibility of our\nTGGLinesPlus algorithm. We benchmarked our algorithm with five classic and\nstate-of-the-art line detection methods and evaluated the benchmark results\nqualitatively and quantitatively, the results demonstrate the robustness of\nTGGLinesPlus.\n","authors":["Liping Yang","Joshua Driscol","Ming Gong","Katie Slack","Wenbin Zhang","Shujie Wang","Catherine G. Potts"],"pdf_url":"https://arxiv.org/pdf/2403.18038v2.pdf","comment":"Our TGGLinesPlus Python implementation is open-sourced. 29 pages, 8\n figures and 4 tables"},{"id":"http://arxiv.org/abs/2501.04144v1","updated":"2025-01-07T21:14:11Z","published":"2025-01-07T21:14:11Z","title":"Chirpy3D: Continuous Part Latents for Creative 3D Bird Generation","summary":" In this paper, we push the boundaries of fine-grained 3D generation into\ntruly creative territory. Current methods either lack intricate details or\nsimply mimic existing objects -- we enable both. By lifting 2D fine-grained\nunderstanding into 3D through multi-view diffusion and modeling part latents as\ncontinuous distributions, we unlock the ability to generate entirely new, yet\nplausible parts through interpolation and sampling. A self-supervised feature\nconsistency loss further ensures stable generation of these unseen parts. The\nresult is the first system capable of creating novel 3D objects with\nspecies-specific details that transcend existing examples. While we demonstrate\nour approach on birds, the underlying framework extends beyond things that can\nchirp! Code will be released at https://github.com/kamwoh/chirpy3d.\n","authors":["Kam Woh Ng","Jing Yang","Jia Wei Sii","Jiankang Deng","Chee Seng Chan","Yi-Zhe Song","Tao Xiang","Xiatian Zhu"],"pdf_url":"https://arxiv.org/pdf/2501.04144v1.pdf","comment":"20 pages"},{"id":"http://arxiv.org/abs/2308.05764v2","updated":"2025-01-07T20:50:51Z","published":"2023-08-09T10:05:11Z","title":"Unlocking the diagnostic potential of electrocardiograms through\n information transfer from cardiac magnetic resonance imaging","summary":" Cardiovascular diseases (CVD) can be diagnosed using various diagnostic\nmodalities. The electrocardiogram (ECG) is a cost-effective and widely\navailable diagnostic aid that provides functional information of the heart.\nHowever, its ability to classify and spatially localise CVD is limited. In\ncontrast, cardiac magnetic resonance (CMR) imaging provides detailed structural\ninformation of the heart and thus enables evidence-based diagnosis of CVD, but\nlong scan times and high costs limit its use in clinical routine. In this work,\nwe present a deep learning strategy for cost-effective and comprehensive\ncardiac screening solely from ECG. Our approach combines multimodal contrastive\nlearning with masked data modelling to transfer domain-specific information\nfrom CMR imaging to ECG representations. In extensive experiments using data\nfrom 40,044 UK Biobank subjects, we demonstrate the utility and\ngeneralisability of our method for subject-specific risk prediction of CVD and\nthe prediction of cardiac phenotypes using only ECG data. Specifically, our\nnovel multimodal pre-training paradigm improves performance by up to 12.19 %\nfor risk prediction and 27.59 % for phenotype prediction. In a qualitative\nanalysis, we demonstrate that our learned ECG representations incorporate\ninformation from CMR image regions of interest. Our entire pipeline is publicly\navailable at https://github.com/oetu/MMCL-ECG-CMR.\n","authors":["Özgün Turgut","Philip Müller","Paul Hager","Suprosanna Shit","Sophie Starck","Martin J. Menten","Eimo Martens","Daniel Rueckert"],"pdf_url":"https://arxiv.org/pdf/2308.05764v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2108.07554v3","updated":"2025-01-07T20:29:48Z","published":"2021-08-17T10:39:50Z","title":"KCNet: An Insect-Inspired Single-Hidden-Layer Neural Network with\n Randomized Binary Weights for Prediction and Classification Tasks","summary":" Fruit flies are established model systems for studying olfactory learning as\nthey will readily learn to associate odors with both electric shock or sugar\nrewards. The mechanisms of the insect brain apparently responsible for odor\nlearning form a relatively shallow neuronal architecture. Olfactory inputs are\nreceived by the antennal lobe (AL) of the brain, which produces an encoding of\neach odor mixture across ~50 sub-units known as glomeruli. Each of these\nglomeruli then projects its component of this feature vector to several of\n~2000 so-called Kenyon Cells (KCs) in a region of the brain known as the\nmushroom body (MB). Fly responses to odors are generated by small downstream\nneutrophils that decode the higher-order representation from the MB. Research\nhas shown that there is no recognizable pattern in the glomeruli--KC\nconnections (and thus the particular higher-order representations); they are\nakin to fingerprints--even isogenic flies have different projections.\nLeveraging insights from this architecture, we propose KCNet, a\nsingle-hidden-layer neural network that contains sparse, randomized, binary\nweights between the input layer and the hidden layer and analytically learned\nweights between the hidden layer and the output layer. Furthermore, we also\npropose a dynamic optimization algorithm that enables the KCNet to increase\nperformance beyond its structural limits by searching for a more efficient set\nof inputs. For odorant-perception tasks that predict the perceptual properties\nof an odorant, we show that KCNet outperforms existing data-driven approaches,\nsuch as XGBoost. For image classification tasks, KCNet achieves reasonable\nperformance on benchmark datasets (MNIST, Fashion-MNIST, and EMNIST) without\nany data-augmentation methods or convolutional layers and shows a particularly\nfast running time.\n","authors":["Jinyung Hong","Theodore P. Pavlic"],"pdf_url":"https://arxiv.org/pdf/2108.07554v3.pdf","comment":"24 pages, 46 figures, 3 tables; The GitHub repo link was updated"},{"id":"http://arxiv.org/abs/2412.05781v3","updated":"2025-01-07T20:27:09Z","published":"2024-12-08T02:27:17Z","title":"Open-Source Acceleration of Stable-Diffusion.cpp Deployable on All\n Devices","summary":" Stable diffusion plays a crucial role in generating high-quality images.\nHowever, image generation is time-consuming and memory-intensive. To address\nthis, stable-diffusion.cpp (Sdcpp) emerges as an efficient inference framework\nto accelerate the diffusion models. Although it is lightweight, the current\nimplementation of ggml_conv_2d operator in Sdcpp is suboptimal, exhibiting both\nhigh inference latency and massive memory usage. To address this, in this work,\nwe present an optimized version of Sdcpp leveraging the Winograd algorithm to\naccelerate 2D convolution operations, which is the primary bottleneck in the\npipeline. By analyzing both dependent and independent computation graphs, we\nexploit the device's locality and parallelism to achieve substantial\nperformance improvements. Our framework delivers correct end-to-end results\nacross various stable diffusion models, including SDv1.4, v1.5, v2.1, SDXL, and\nSDXL-Turbo. Our evaluation results demonstrate a speedup up to 2.76x for\nindividual convolutional layers and an inference speedup up to 4.79x for the\noverall image generation process, compared with the original Sdcpp on M1 pro.\nHomepage: https://github.com/SealAILab/stable-diffusion-cpp\n","authors":["Jingxu Ng","Cheng Lv","Pu Zhao","Wei Niu","Juyi Lin","Minzhou Pan","Yun Liang","Yanzhi Wang"],"pdf_url":"https://arxiv.org/pdf/2412.05781v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04121v1","updated":"2025-01-07T20:02:55Z","published":"2025-01-07T20:02:55Z","title":"Graph-Based Multimodal and Multi-view Alignment for Keystep Recognition","summary":" Egocentric videos capture scenes from a wearer's viewpoint, resulting in\ndynamic backgrounds, frequent motion, and occlusions, posing challenges to\naccurate keystep recognition. We propose a flexible graph-learning framework\nfor fine-grained keystep recognition that is able to effectively leverage\nlong-term dependencies in egocentric videos, and leverage alignment between\negocentric and exocentric videos during training for improved inference on\negocentric videos. Our approach consists of constructing a graph where each\nvideo clip of the egocentric video corresponds to a node. During training, we\nconsider each clip of each exocentric video (if available) as additional nodes.\nWe examine several strategies to define connections across these nodes and pose\nkeystep recognition as a node classification task on the constructed graphs. We\nperform extensive experiments on the Ego-Exo4D dataset and show that our\nproposed flexible graph-based framework notably outperforms existing methods by\nmore than 12 points in accuracy. Furthermore, the constructed graphs are sparse\nand compute efficient. We also present a study examining on harnessing several\nmultimodal features, including narrations, depth, and object class labels, on a\nheterogeneous graph and discuss their corresponding contribution to the keystep\nrecognition performance.\n","authors":["Julia Lee Romero","Kyle Min","Subarna Tripathi","Morteza Karimzadeh"],"pdf_url":"https://arxiv.org/pdf/2501.04121v1.pdf","comment":"9 pages, 6 figures"},{"id":"http://arxiv.org/abs/2409.06267v2","updated":"2025-01-07T19:53:41Z","published":"2024-09-10T07:12:18Z","title":"Mahalanobis k-NN: A Statistical Lens for Robust Point-Cloud\n Registrations","summary":" In this paper, we discuss Mahalanobis k-NN: A Statistical Lens designed to\naddress the challenges of feature matching in learning-based point cloud\nregistration when confronted with an arbitrary density of point clouds. We\ntackle this by adopting Mahalanobis k-NN's inherent property to capture the\ndistribution of the local neighborhood and surficial geometry. Our method can\nbe seamlessly integrated into any local-graph-based point cloud analysis\nmethod. In this paper, we focus on two distinct methodologies: Deep Closest\nPoint (DCP) and Deep Universal Manifold Embedding (DeepUME). Our extensive\nbenchmarking on the ModelNet40 and FAUST datasets highlights the efficacy of\nthe proposed method in point cloud registration tasks. Moreover, we establish\nfor the first time that the features acquired through point cloud registration\ninherently can possess discriminative capabilities. This is evident by a\nsubstantial improvement of about 20% in the average accuracy observed in the\npoint cloud few-shot classification task, benchmarked on ModelNet40 and\nScanObjectNN.\n","authors":["Tejas Anvekar","Shivanand Venkanna Sheshappanavar"],"pdf_url":"https://arxiv.org/pdf/2409.06267v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04074v1","updated":"2025-01-07T18:59:53Z","published":"2025-01-07T18:59:53Z","title":"NeRFs are Mirror Detectors: Using Structural Similarity for Multi-View\n Mirror Scene Reconstruction with 3D Surface Primitives","summary":" While neural radiance fields (NeRF) led to a breakthrough in photorealistic\nnovel view synthesis, handling mirroring surfaces still denotes a particular\nchallenge as they introduce severe inconsistencies in the scene representation.\nPrevious attempts either focus on reconstructing single reflective objects or\nrely on strong supervision guidance in terms of additional user-provided\nannotations of visible image regions of the mirrors, thereby limiting the\npractical usability. In contrast, in this paper, we present NeRF-MD, a method\nwhich shows that NeRFs can be considered as mirror detectors and which is\ncapable of reconstructing neural radiance fields of scenes containing mirroring\nsurfaces without the need for prior annotations. To this end, we first compute\nan initial estimate of the scene geometry by training a standard NeRF using a\ndepth reprojection loss. Our key insight lies in the fact that parts of the\nscene corresponding to a mirroring surface will still exhibit a significant\nphotometric inconsistency, whereas the remaining parts are already\nreconstructed in a plausible manner. This allows us to detect mirror surfaces\nby fitting geometric primitives to such inconsistent regions in this initial\nstage of the training. Using this information, we then jointly optimize the\nradiance field and mirror geometry in a second training stage to refine their\nquality. We demonstrate the capability of our method to allow the faithful\ndetection of mirrors in the scene as well as the reconstruction of a single\nconsistent scene representation, and demonstrate its potential in comparison to\nbaseline and mirror-aware approaches.\n","authors":["Leif Van Holland","Michael Weinmann","Jan U. Müller","Patrick Stotko","Reinhard Klein"],"pdf_url":"https://arxiv.org/pdf/2501.04074v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04073v1","updated":"2025-01-07T18:53:14Z","published":"2025-01-07T18:53:14Z","title":"Deep Learning for Ophthalmology: The State-of-the-Art and Future Trends","summary":" The emergence of artificial intelligence (AI), particularly deep learning\n(DL), has marked a new era in the realm of ophthalmology, offering\ntransformative potential for the diagnosis and treatment of posterior segment\neye diseases. This review explores the cutting-edge applications of DL across a\nrange of ocular conditions, including diabetic retinopathy, glaucoma,\nage-related macular degeneration, and retinal vessel segmentation. We provide a\ncomprehensive overview of foundational ML techniques and advanced DL\narchitectures, such as CNNs, attention mechanisms, and transformer-based\nmodels, highlighting the evolving role of AI in enhancing diagnostic accuracy,\noptimizing treatment strategies, and improving overall patient care.\nAdditionally, we present key challenges in integrating AI solutions into\nclinical practice, including ensuring data diversity, improving algorithm\ntransparency, and effectively leveraging multimodal data. This review\nemphasizes AI's potential to improve disease diagnosis and enhance patient care\nwhile stressing the importance of collaborative efforts to overcome these\nbarriers and fully harness AI's impact in advancing eye care.\n","authors":["Duy M. H. Nguyen","Hasan Md Tusfiqur Alam","Tai Nguyen","Devansh Srivastav","Hans-Juergen Profitlich","Ngan Le","Daniel Sonntag"],"pdf_url":"https://arxiv.org/pdf/2501.04073v1.pdf","comment":"First version"},{"id":"http://arxiv.org/abs/2501.04735v1","updated":"2025-01-07T19:57:15Z","published":"2025-01-07T19:57:15Z","title":"Topology-based deep-learning segmentation method for deep anterior\n lamellar keratoplasty (DALK) surgical guidance using M-mode OCT data","summary":" Deep Anterior Lamellar Keratoplasty (DALK) is a partial-thickness corneal\ntransplant procedure used to treat corneal stromal diseases. A crucial step in\nthis procedure is the precise separation of the deep stroma from Descemet's\nmembrane (DM) using the Big Bubble technique. To simplify the tasks of needle\ninsertion and pneumo-dissection in this technique, we previously developed an\nOptical Coherence Tomography (OCT)-guided, eye-mountable robot that uses\nreal-time tracking of corneal layers from M-mode OCT signals for control.\nHowever, signal noise and instability during manipulation of the OCT fiber\nsensor-integrated needle have hindered the performance of conventional\ndeep-learning segmentation methods, resulting in rough and inaccurate detection\nof corneal layers. To address these challenges, we have developed a\ntopology-based deep-learning segmentation method that integrates a topological\nloss function with a modified network architecture. This approach effectively\nreduces the effects of noise and improves segmentation speed, precision, and\nstability. Validation using in vivo, ex vivo, and hybrid rabbit eye datasets\ndemonstrates that our method outperforms traditional loss-based techniques,\nproviding fast, accurate, and robust segmentation of the epithelium and DM to\nguide surgery.\n","authors":["J. Yu","H. Yi","Y. Wang","J. D. Opfermann","W. G. Gensheimer","A. Krieger","J. U. Kang"],"pdf_url":"https://arxiv.org/pdf/2501.04735v1.pdf","comment":null}],"Information Retrieval":[{"id":"http://arxiv.org/abs/2501.03995v1","updated":"2025-01-07T18:52:05Z","published":"2025-01-07T18:52:05Z","title":"RAG-Check: Evaluating Multimodal Retrieval Augmented Generation\n Performance","summary":" Retrieval-augmented generation (RAG) improves large language models (LLMs) by\nusing external knowledge to guide response generation, reducing hallucinations.\nHowever, RAG, particularly multi-modal RAG, can introduce new hallucination\nsources: (i) the retrieval process may select irrelevant pieces (e.g.,\ndocuments, images) as raw context from the database, and (ii) retrieved images\nare processed into text-based context via vision-language models (VLMs) or\ndirectly used by multi-modal language models (MLLMs) like GPT-4o, which may\nhallucinate. To address this, we propose a novel framework to evaluate the\nreliability of multi-modal RAG using two performance measures: (i) the\nrelevancy score (RS), assessing the relevance of retrieved entries to the\nquery, and (ii) the correctness score (CS), evaluating the accuracy of the\ngenerated response. We train RS and CS models using a ChatGPT-derived database\nand human evaluator samples. Results show that both models achieve ~88%\naccuracy on test data. Additionally, we construct a 5000-sample human-annotated\ndatabase evaluating the relevancy of retrieved pieces and the correctness of\nresponse statements. Our RS model aligns with human preferences 20% more often\nthan CLIP in retrieval, and our CS model matches human preferences ~91% of the\ntime. Finally, we assess various RAG systems' selection and generation\nperformances using RS and CS.\n","authors":["Matin Mortaheb","Mohammad A. Amir Khojastepour","Srimat T. Chakradhar","Sennur Ulukus"],"pdf_url":"https://arxiv.org/pdf/2501.03995v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03989v1","updated":"2025-01-07T18:46:34Z","published":"2025-01-07T18:46:34Z","title":"(De)-Indexing and the Right to be Forgotten","summary":" In the digital age, the challenge of forgetfulness has emerged as a\nsignificant concern, particularly regarding the management of personal data and\nits accessibility online. The right to be forgotten (RTBF) allows individuals\nto request the removal of outdated or harmful information from public access,\nyet implementing this right poses substantial technical difficulties for search\nengines. This paper aims to introduce non-experts to the foundational concepts\nof information retrieval (IR) and de-indexing, which are critical for\nunderstanding how search engines can effectively \"forget\" certain content. We\nwill explore various IR models, including boolean, probabilistic, vector space,\nand embedding-based approaches, as well as the role of Large Language Models\n(LLMs) in enhancing data processing capabilities. By providing this overview,\nwe seek to highlight the complexities involved in balancing individual privacy\nrights with the operational challenges faced by search engines in managing\ninformation visibility.\n","authors":["Salvatore Vilella","Giancarlo Ruffo"],"pdf_url":"https://arxiv.org/pdf/2501.03989v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03930v1","updated":"2025-01-07T16:48:21Z","published":"2025-01-07T16:48:21Z","title":"Towards Reliable Testing for Multiple Information Retrieval System\n Comparisons","summary":" Null Hypothesis Significance Testing is the \\textit{de facto} tool for\nassessing effectiveness differences between Information Retrieval systems.\nResearchers use statistical tests to check whether those differences will\ngeneralise to online settings or are just due to the samples observed in the\nlaboratory. Much work has been devoted to studying which test is the most\nreliable when comparing a pair of systems, but most of the IR real-world\nexperiments involve more than two. In the multiple comparisons scenario,\ntesting several systems simultaneously may inflate the errors committed by the\ntests. In this paper, we use a new approach to assess the reliability of\nmultiple comparison procedures using simulated and real TREC data. Experiments\nshow that Wilcoxon plus the Benjamini-Hochberg correction yields Type I error\nrates according to the significance level for typical sample sizes while being\nthe best test in terms of statistical power.\n","authors":["David Otero","Javier Parapar","Álvaro Barreiro"],"pdf_url":"https://arxiv.org/pdf/2501.03930v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03904v1","updated":"2025-01-07T16:18:55Z","published":"2025-01-07T16:18:55Z","title":"Exploring the Potential of Large Language Models in Public\n Transportation: San Antonio Case Study","summary":" The integration of large language models (LLMs) into public transit systems\npresents a transformative opportunity to enhance urban mobility. This study\nexplores the potential of LLMs to revolutionize public transportation\nmanagement within the context of San Antonio's transit system. Leveraging the\ncapabilities of LLMs in natural language processing and data analysis, we\ninvestigate their capabilities to optimize route planning, reduce wait times,\nand provide personalized travel assistance. By utilizing the General Transit\nFeed Specification (GTFS) and other relevant data, this research aims to\ndemonstrate how LLMs can potentially improve resource allocation, elevate\npassenger satisfaction, and inform data-driven decision-making in transit\noperations. A comparative analysis of different ChatGPT models was conducted to\nassess their ability to understand transportation information, retrieve\nrelevant data, and provide comprehensive responses. Findings from this study\nsuggest that while LLMs hold immense promise for public transit, careful\nengineering and fine-tuning are essential to realizing their full potential.\nSan Antonio serves as a case study to inform the development of LLM-powered\ntransit systems in other urban environments.\n","authors":["Ramya Jonnala","Gongbo Liang","Jeong Yang","Izzat Alsmadi"],"pdf_url":"https://arxiv.org/pdf/2501.03904v1.pdf","comment":"This work is accepted to AAAI 2025 Workshop on AI for Urban Planning.\n arXiv admin note: substantial text overlap with arXiv:2407.11003"},{"id":"http://arxiv.org/abs/2501.03843v1","updated":"2025-01-07T14:53:35Z","published":"2025-01-07T14:53:35Z","title":"BERTopic for Topic Modeling of Hindi Short Texts: A Comparative Study","summary":" As short text data in native languages like Hindi increasingly appear in\nmodern media, robust methods for topic modeling on such data have gained\nimportance. This study investigates the performance of BERTopic in modeling\nHindi short texts, an area that has been under-explored in existing research.\nUsing contextual embeddings, BERTopic can capture semantic relationships in\ndata, making it potentially more effective than traditional models, especially\nfor short and diverse texts. We evaluate BERTopic using 6 different document\nembedding models and compare its performance against 8 established topic\nmodeling techniques, such as Latent Dirichlet Allocation (LDA), Non-negative\nMatrix Factorization (NMF), Latent Semantic Indexing (LSI), Additive\nRegularization of Topic Models (ARTM), Probabilistic Latent Semantic Analysis\n(PLSA), Embedded Topic Model (ETM), Combined Topic Model (CTM), and Top2Vec.\nThe models are assessed using coherence scores across a range of topic counts.\nOur results reveal that BERTopic consistently outperforms other models in\ncapturing coherent topics from short Hindi texts.\n","authors":["Atharva Mutsaddi","Anvi Jamkhande","Aryan Thakre","Yashodhara Haribhakta"],"pdf_url":"https://arxiv.org/pdf/2501.03843v1.pdf","comment":"Accepted into IndoNLP: The First Workshop on Natural Language\n Processing for Indo-Aryan and Dravidian Languages, collocated with COLING\n 2025. Set to appear in the workshop proceedings published in ACL Anthology"},{"id":"http://arxiv.org/abs/2501.03835v1","updated":"2025-01-07T14:45:30Z","published":"2025-01-07T14:45:30Z","title":"TACLR: A Scalable and Efficient Retrieval-based Method for Industrial\n Product Attribute Value Identification","summary":" Product Attribute Value Identification (PAVI) involves identifying attribute\nvalues from product profiles, a key task for improving product search,\nrecommendations, and business analytics on e-commerce platforms. However,\nexisting PAVI methods face critical challenges, such as inferring implicit\nvalues, handling out-of-distribution (OOD) values, and producing normalized\noutputs. To address these limitations, we introduce Taxonomy-Aware Contrastive\nLearning Retrieval (TACLR), the first retrieval-based method for PAVI. TACLR\nformulates PAVI as an information retrieval task by encoding product profiles\nand candidate values into embeddings and retrieving values based on their\nsimilarity to the item embedding. It leverages contrastive training with\ntaxonomy-aware hard negative sampling and employs adaptive inference with\ndynamic thresholds. TACLR offers three key advantages: (1) it effectively\nhandles implicit and OOD values while producing normalized outputs; (2) it\nscales to thousands of categories, tens of thousands of attributes, and\nmillions of values; and (3) it supports efficient inference for high-load\nindustrial scenarios. Extensive experiments on proprietary and public datasets\nvalidate the effectiveness and efficiency of TACLR. Moreover, it has been\nsuccessfully deployed in a real-world e-commerce platform, processing millions\nof product listings daily while supporting dynamic, large-scale attribute\ntaxonomies.\n","authors":["Yindu Su","Huike Zou","Lin Sun","Ting Zhang","Haiyang Yang","Liyu Chen","David Lo","Qingheng Zhang","Shuguang Han","Jufeng Chen"],"pdf_url":"https://arxiv.org/pdf/2501.03835v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03811v1","updated":"2025-01-07T14:24:49Z","published":"2025-01-07T14:24:49Z","title":"Extending ChatGPT with a Browserless System for Web Product Price\n Extraction","summary":" With the advenement of ChatGPT, we can find very clean, precise answers to a\nvaried amount of questions. However, for questions such as 'find the price of\nthe lemon cake at zingerman's', the answer looks like 'I can't browse the web\nright now'. In this paper, we propose a system, called Wextractor, which\nextends ChatGPT to answer questions as the one mentioned before. Obviously, our\nsystem cannot be labeled as `artificial intelligence'. Simply, it offers to\ncover a kind of transactional search that is not included in the current\nversion of ChatGPT. Moreover, Wextractor includes two improvements with respect\nto the initial version: social extraction and pointing pattern extraction to\nimprove the answer speed.\n","authors":["Jorge Lloret-Gazo"],"pdf_url":"https://arxiv.org/pdf/2501.03811v1.pdf","comment":"14 pages, 4 figures"},{"id":"http://arxiv.org/abs/2501.03769v1","updated":"2025-01-07T13:22:35Z","published":"2025-01-07T13:22:35Z","title":"Multi-label Cross-lingual automatic music genre classification from\n lyrics with Sentence BERT","summary":" Music genres are shaped by both the stylistic features of songs and the\ncultural preferences of artists' audiences. Automatic classification of music\ngenres using lyrics can be useful in several applications such as\nrecommendation systems, playlist creation, and library organization. We present\na multi-label, cross-lingual genre classification system based on multilingual\nsentence embeddings generated by sBERT. Using a bilingual Portuguese-English\ndataset with eight overlapping genres, we demonstrate the system's ability to\ntrain on lyrics in one language and predict genres in another. Our approach\noutperforms the baseline approach of translating lyrics and using a\nbag-of-words representation, improving the genrewise average F1-Score from 0.35\nto 0.69. The classifier uses a one-vs-all architecture, enabling it to assign\nmultiple genre labels to a single lyric. Experimental results reveal that\ndataset centralization notably improves cross-lingual performance. This\napproach offers a scalable solution for genre classification across\nunderrepresented languages and cultural domains, advancing the capabilities of\nmusic information retrieval systems.\n","authors":["Tiago Fernandes Tavares","Fabio José Ayres"],"pdf_url":"https://arxiv.org/pdf/2501.03769v1.pdf","comment":"5 pages"},{"id":"http://arxiv.org/abs/2409.06096v4","updated":"2025-01-07T10:45:58Z","published":"2024-09-09T22:16:48Z","title":"Latent Diffusion Bridges for Unsupervised Musical Audio Timbre Transfer","summary":" Music timbre transfer is a challenging task that involves modifying the\ntimbral characteristics of an audio signal while preserving its melodic\nstructure. In this paper, we propose a novel method based on dual diffusion\nbridges, trained using the CocoChorales Dataset, which consists of unpaired\nmonophonic single-instrument audio data. Each diffusion model is trained on a\nspecific instrument with a Gaussian prior. During inference, a model is\ndesignated as the source model to map the input audio to its corresponding\nGaussian prior, and another model is designated as the target model to\nreconstruct the target audio from this Gaussian prior, thereby facilitating\ntimbre transfer. We compare our approach against existing unsupervised timbre\ntransfer models such as VAEGAN and Gaussian Flow Bridges (GFB). Experimental\nresults demonstrate that our method achieves both better Fr\\'echet Audio\nDistance (FAD) and melody preservation, as reflected by lower pitch distances\n(DPD) compared to VAEGAN and GFB. Additionally, we discover that the noise\nlevel from the Gaussian prior, $\\sigma$, can be adjusted to control the degree\nof melody preservation and amount of timbre transferred.\n","authors":["Michele Mancusi","Yurii Halychanskyi","Kin Wai Cheuk","Eloi Moliner","Chieh-Hsin Lai","Stefan Uhlich","Junghyun Koo","Marco A. Martínez-Ramírez","Wei-Hsiang Liao","Giorgio Fabbro","Yuki Mitsufuji"],"pdf_url":"https://arxiv.org/pdf/2409.06096v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03598v1","updated":"2025-01-07T07:55:35Z","published":"2025-01-07T07:55:35Z","title":"RecKG: Knowledge Graph for Recommender Systems","summary":" Knowledge graphs have proven successful in integrating heterogeneous data\nacross various domains. However, there remains a noticeable dearth of research\non their seamless integration among heterogeneous recommender systems, despite\nknowledge graph-based recommender systems garnering extensive research\nattention. This study aims to fill this gap by proposing RecKG, a standardized\nknowledge graph for recommender systems. RecKG ensures the consistent\nrepresentation of entities across different datasets, accommodating diverse\nattribute types for effective data integration. Through a meticulous\nexamination of various recommender system datasets, we select attributes for\nRecKG, ensuring standardized formatting through consistent naming conventions.\nBy these characteristics, RecKG can seamlessly integrate heterogeneous data\nsources, enabling the discovery of additional semantic information within the\nintegrated knowledge graph. We apply RecKG to standardize real-world datasets,\nsubsequently developing an application for RecKG using a graph database.\nFinally, we validate RecKG's achievement in interoperability through a\nqualitative evaluation between RecKG and other studies.\n","authors":["Junhyuk Kwon","Seokho Ahn","Young-Duk Seo"],"pdf_url":"https://arxiv.org/pdf/2501.03598v1.pdf","comment":"Accepted by The 39th ACM/SIGAPP Symposium On Applied Computing(SAC)\n 2024"},{"id":"http://arxiv.org/abs/2412.02155v2","updated":"2025-01-07T06:30:24Z","published":"2024-12-03T04:29:27Z","title":"CausalMob: Causal Human Mobility Prediction with LLMs-derived Human\n Intentions toward Public Events","summary":" Large-scale human mobility exhibits spatial and temporal patterns that can\nassist policymakers in decision making. Although traditional prediction models\nattempt to capture these patterns, they often interfered by non-periodic public\nevents, such as disasters and occasional celebrations. Since regular human\nmobility patterns are heavily affected by these events, estimating their causal\neffects is critical to accurate mobility predictions. Although news articles\nprovide unique perspectives on these events in an unstructured format,\nprocessing is a challenge. In this study, we propose a causality-augmented\nprediction model, called CausalMob, to analyze the causal effects of public\nevents. We first utilize large language models (LLMs) to extract human\nintentions from news articles and transform them into features that act as\ncausal treatments. Next, the model learns representations of spatio-temporal\nregional covariates from multiple data sources to serve as confounders for\ncausal inference. Finally, we present a causal effect estimation framework to\nensure event features remain independent of confounders during prediction.\nBased on large-scale real-world data, the experimental results show that the\nproposed model excels in human mobility prediction, outperforming\nstate-of-the-art models.\n","authors":["Xiaojie Yang","Hangli Ge","Jiawei Wang","Zipei Fan","Renhe Jiang","Ryosuke Shibasaki","Noboru Koshizuka"],"pdf_url":"https://arxiv.org/pdf/2412.02155v2.pdf","comment":"Accepted by KDD 2025"},{"id":"http://arxiv.org/abs/2501.03228v2","updated":"2025-01-07T04:05:53Z","published":"2025-01-06T18:59:55Z","title":"LightGNN: Simple Graph Neural Network for Recommendation","summary":" Graph neural networks (GNNs) have demonstrated superior performance in\ncollaborative recommendation through their ability to conduct high-order\nrepresentation smoothing, effectively capturing structural information within\nusers' interaction patterns. However, existing GNN paradigms face significant\nchallenges in scalability and robustness when handling large-scale, noisy, and\nreal-world datasets. To address these challenges, we present LightGNN, a\nlightweight and distillation-based GNN pruning framework designed to\nsubstantially reduce model complexity while preserving essential collaboration\nmodeling capabilities. Our LightGNN framework introduces a computationally\nefficient pruning module that adaptively identifies and removes redundant edges\nand embedding entries for model compression. The framework is guided by a\nresource-friendly hierarchical knowledge distillation objective, whose\nintermediate layer augments the observed graph to maintain performance,\nparticularly in high-rate compression scenarios. Extensive experiments on\npublic datasets demonstrate LightGNN's effectiveness, significantly improving\nboth computational efficiency and recommendation accuracy. Notably, LightGNN\nachieves an 80% reduction in edge count and 90% reduction in embedding entries\nwhile maintaining performance comparable to more complex state-of-the-art\nbaselines. The implementation of our LightGNN framework is available at the\ngithub repository: https://github.com/HKUDS/LightGNN.\n","authors":["Guoxuan Chen","Lianghao Xia","Chao Huang"],"pdf_url":"https://arxiv.org/pdf/2501.03228v2.pdf","comment":"Accepted to WSDM 2025 Oral"},{"id":"http://arxiv.org/abs/2501.04167v1","updated":"2025-01-07T22:29:08Z","published":"2025-01-07T22:29:08Z","title":"Reasoning-Enhanced Self-Training for Long-Form Personalized Text\n Generation","summary":" Personalized text generation requires a unique ability of large language\nmodels (LLMs) to learn from context that they often do not encounter during\ntheir standard training. One way to encourage LLMs to better use personalized\ncontext for generating outputs that better align with the user's expectations\nis to instruct them to reason over the user's past preferences, background\nknowledge, or writing style. To achieve this, we propose Reasoning-Enhanced\nSelf-Training for Personalized Text Generation (REST-PG), a framework that\ntrains LLMs to reason over personal data during response generation. REST-PG\nfirst generates reasoning paths to train the LLM's reasoning abilities and then\nemploys Expectation-Maximization Reinforced Self-Training to iteratively train\nthe LLM based on its own high-reward outputs. We evaluate REST-PG on the\nLongLaMP benchmark, consisting of four diverse personalized long-form text\ngeneration tasks. Our experiments demonstrate that REST-PG achieves significant\nimprovements over state-of-the-art baselines, with an average relative\nperformance gain of 14.5% on the benchmark.\n","authors":["Alireza Salemi","Cheng Li","Mingyang Zhang","Qiaozhu Mei","Weize Kong","Tao Chen","Zhuowan Li","Michael Bendersky","Hamed Zamani"],"pdf_url":"https://arxiv.org/pdf/2501.04167v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04161v1","updated":"2025-01-07T22:19:15Z","published":"2025-01-07T22:19:15Z","title":"KGIF: Optimizing Relation-Aware Recommendations with Knowledge Graph\n Information Fusion","summary":" While deep-learning-enabled recommender systems demonstrate strong\nperformance benchmarks, many struggle to adapt effectively in real-world\nenvironments due to limited use of user-item relationship data and insufficient\ntransparency in recommendation generation. Traditional collaborative filtering\napproaches fail to integrate multifaceted item attributes, and although\nFactorization Machines account for item-specific details, they overlook broader\nrelational patterns. Collaborative knowledge graph-based models have progressed\nby embedding user-item interactions with item-attribute relationships, offering\na holistic perspective on interconnected entities. However, these models\nfrequently aggregate attribute and interaction data in an implicit manner,\nleaving valuable relational nuances underutilized.\n This study introduces the Knowledge Graph Attention Network with Information\nFusion (KGIF), a specialized framework designed to merge entity and relation\nembeddings explicitly through a tailored self-attention mechanism. The KGIF\nframework integrates reparameterization via dynamic projection vectors,\nenabling embeddings to adaptively represent intricate relationships within\nknowledge graphs. This explicit fusion enhances the interplay between user-item\ninteractions and item-attribute relationships, providing a nuanced balance\nbetween user-centric and item-centric representations. An attentive propagation\nmechanism further optimizes knowledge graph embeddings, capturing multi-layered\ninteraction patterns. The contributions of this work include an innovative\nmethod for explicit information fusion, improved robustness for sparse\nknowledge graphs, and the ability to generate explainable recommendations\nthrough interpretable path visualization.\n","authors":["Dong Hyun Jeon","Wenbo Sun","Houbing Herbert Song","Dongfang Liu","Velasquez Alvaro","Yixin Chloe Xie","Shuteng Niu"],"pdf_url":"https://arxiv.org/pdf/2501.04161v1.pdf","comment":"Published at IEEE Big Data 2024"},{"id":"http://arxiv.org/abs/2412.16181v2","updated":"2025-01-07T22:12:47Z","published":"2024-12-10T16:51:11Z","title":"Minimum Weighted Feedback Arc Sets for Ranking from Pairwise Comparisons","summary":" The Minimum Weighted Feedback Arc Set (MWFAS) problem is fundamentally\nconnected to the Ranking Problem -- the task of deriving global rankings from\npairwise comparisons. Recent work [He et al. ICML2022] has advanced the\nstate-of-the-art for the Ranking Problem using learning-based methods,\nimproving upon multiple previous approaches. However, the connection to MWFAS\nremains underexplored. This paper investigates this relationship and presents\nefficient combinatorial algorithms for solving MWFAS, thus addressing the\nRanking Problem. Our experimental results demonstrate that these simple,\nlearning-free algorithms not only significantly outperform learning-based\nmethods in terms of speed but also generally achieve superior ranking accuracy.\n","authors":["Soroush Vahidi","Ioannis Koutis"],"pdf_url":"https://arxiv.org/pdf/2412.16181v2.pdf","comment":"This is a preliminary paper"},{"id":"http://arxiv.org/abs/2501.05475v1","updated":"2025-01-07T08:57:42Z","published":"2025-01-07T08:57:42Z","title":"Retrieval-Augmented Generation by Evidence Retroactivity in LLMs","summary":" Retrieval-augmented generation has gained significant attention due to its\nability to integrate relevant external knowledge, enhancing the accuracy and\nreliability of the LLMs' responses. Most of the existing methods apply a\ndynamic multiple retrieval-generating process, to address multi-hop complex\nquestions by decomposing them into sub-problems. However, these methods rely on\nan unidirectional forward reasoning paradigm, where errors from insufficient\nreasoning steps or inherent flaws in current retrieval systems are\nirreversible, potentially derailing the entire reasoning chain. For the first\ntime, this work introduces Retroactive Retrieval-Augmented Generation\n(RetroRAG), a novel framework to build a retroactive reasoning paradigm.\nRetroRAG revises and updates the evidence, redirecting the reasoning chain to\nthe correct direction. RetroRAG constructs an evidence-collation-discovery\nframework to search, generate, and refine credible evidence. It synthesizes\ninferential evidence related to the key entities in the question from the\nexisting source knowledge and formulates search queries to uncover additional\ninformation. As new evidence is found, RetroRAG continually updates and\norganizes this information, enhancing its ability to locate further necessary\nevidence. Paired with an Answerer to generate and evaluate outputs, RetroRAG is\ncapable of refining its reasoning process iteratively until a reliable answer\nis obtained. Empirical evaluations show that RetroRAG significantly outperforms\nexisting methods.\n","authors":["Liang Xiao","Wen Dai","Shuai Chen","Bin Qin","Chongyang Shi","Haopeng Jing","Tianyu Guo"],"pdf_url":"https://arxiv.org/pdf/2501.05475v1.pdf","comment":null}],"Machine Learning":[{"id":"http://arxiv.org/abs/2501.04005v1","updated":"2025-01-07T18:59:59Z","published":"2025-01-07T18:59:59Z","title":"LargeAD: Large-Scale Cross-Sensor Data Pretraining for Autonomous\n Driving","summary":" Recent advancements in vision foundation models (VFMs) have revolutionized\nvisual perception in 2D, yet their potential for 3D scene understanding,\nparticularly in autonomous driving applications, remains underexplored. In this\npaper, we introduce LargeAD, a versatile and scalable framework designed for\nlarge-scale 3D pretraining across diverse real-world driving datasets. Our\nframework leverages VFMs to extract semantically rich superpixels from 2D\nimages, which are aligned with LiDAR point clouds to generate high-quality\ncontrastive samples. This alignment facilitates cross-modal representation\nlearning, enhancing the semantic consistency between 2D and 3D data. We\nintroduce several key innovations: i) VFM-driven superpixel generation for\ndetailed semantic representation, ii) a VFM-assisted contrastive learning\nstrategy to align multimodal features, iii) superpoint temporal consistency to\nmaintain stable representations across time, and iv) multi-source data\npretraining to generalize across various LiDAR configurations. Our approach\ndelivers significant performance improvements over state-of-the-art methods in\nboth linear probing and fine-tuning tasks for both LiDAR-based segmentation and\nobject detection. Extensive experiments on eleven large-scale multi-modal\ndatasets highlight our superior performance, demonstrating the adaptability,\nefficiency, and robustness in real-world autonomous driving scenarios.\n","authors":["Lingdong Kong","Xiang Xu","Youquan Liu","Jun Cen","Runnan Chen","Wenwei Zhang","Liang Pan","Kai Chen","Ziwei Liu"],"pdf_url":"https://arxiv.org/pdf/2501.04005v1.pdf","comment":"Preprint; 16 pages, 7 figures, 8 tables; Project Page at\n https://ldkong.com/LargeAD"},{"id":"http://arxiv.org/abs/2501.04004v1","updated":"2025-01-07T18:59:58Z","published":"2025-01-07T18:59:58Z","title":"LiMoE: Mixture of LiDAR Representation Learners from Automotive Scenes","summary":" LiDAR data pretraining offers a promising approach to leveraging large-scale,\nreadily available datasets for enhanced data utilization. However, existing\nmethods predominantly focus on sparse voxel representation, overlooking the\ncomplementary attributes provided by other LiDAR representations. In this work,\nwe propose LiMoE, a framework that integrates the Mixture of Experts (MoE)\nparadigm into LiDAR data representation learning to synergistically combine\nmultiple representations, such as range images, sparse voxels, and raw points.\nOur approach consists of three stages: i) Image-to-LiDAR Pretraining, which\ntransfers prior knowledge from images to point clouds across different\nrepresentations; ii) Contrastive Mixture Learning (CML), which uses MoE to\nadaptively activate relevant attributes from each representation and distills\nthese mixed features into a unified 3D network; iii) Semantic Mixture\nSupervision (SMS), which combines semantic logits from multiple representations\nto boost downstream segmentation performance. Extensive experiments across 11\nlarge-scale LiDAR datasets demonstrate our effectiveness and superiority. The\ncode and model checkpoints have been made publicly accessible.\n","authors":["Xiang Xu","Lingdong Kong","Hui Shuai","Liang Pan","Ziwei Liu","Qingshan Liu"],"pdf_url":"https://arxiv.org/pdf/2501.04004v1.pdf","comment":"Preprint; 26 pages, 17 figures, 7 tables; Project Page at\n https://ldkong.com/LiMoE"},{"id":"http://arxiv.org/abs/2412.05313v3","updated":"2025-01-07T18:57:23Z","published":"2024-11-28T19:31:50Z","title":"λ: A Benchmark for Data-Efficiency in Long-Horizon Indoor Mobile\n Manipulation Robotics","summary":" Efficiently learning and executing long-horizon mobile manipulation (MoMa)\ntasks is crucial for advancing robotics in household and workplace settings.\nHowever, current MoMa models are data-inefficient, underscoring the need for\nimproved models that require realistic-sized benchmarks to evaluate their\nefficiency, which do not exist. To address this, we introduce the LAMBDA\n({\\lambda}) benchmark (Long-horizon Actions for Mobile-manipulation\nBenchmarking of Directed Activities), which evaluates the data efficiency of\nmodels on language-conditioned, long-horizon, multi-room, multi-floor,\npick-and-place tasks using a dataset of manageable size, more feasible for\ncollection. The benchmark includes 571 human-collected demonstrations that\nprovide realism and diversity in simulated and real-world settings. Unlike\nplanner-generated data, these trajectories offer natural variability and\nreplay-verifiability, ensuring robust learning and evaluation. We benchmark\nseveral models, including learning-based models and a neuro-symbolic modular\napproach combining foundation models with task and motion planning.\nLearning-based models show suboptimal success rates, even when leveraging\npretrained weights, underscoring significant data inefficiencies. However, the\nneuro-symbolic approach performs significantly better while being more data\nefficient. Findings highlight the need for more data-efficient learning-based\nMoMa approaches. {\\lambda} addresses this gap by serving as a key benchmark for\nevaluating the data efficiency of those future models in handling household\nrobotics tasks.\n","authors":["Ahmed Jaafar","Shreyas Sundara Raman","Yichen Wei","Sudarshan Harithas","Sofia Juliani","Anneke Wernerfelt","Benedict Quartey","Ifrah Idrees","Jason Xinyu Liu","Stefanie Tellex"],"pdf_url":"https://arxiv.org/pdf/2412.05313v3.pdf","comment":"8 pages"},{"id":"http://arxiv.org/abs/2501.04000v1","updated":"2025-01-07T18:56:14Z","published":"2025-01-07T18:56:14Z","title":"A Survey on Federated Learning in Human Sensing","summary":" Human Sensing, a field that leverages technology to monitor human activities,\npsycho-physiological states, and interactions with the environment, enhances\nour understanding of human behavior and drives the development of advanced\nservices that improve overall quality of life. However, its reliance on\ndetailed and often privacy-sensitive data as the basis for its machine learning\n(ML) models raises significant legal and ethical concerns. The recently\nproposed ML approach of Federated Learning (FL) promises to alleviate many of\nthese concerns, as it is able to create accurate ML models without sending raw\nuser data to a central server. While FL has demonstrated its usefulness across\na variety of areas, such as text prediction and cyber security, its benefits in\nHuman Sensing are under-explored, given the particular challenges in this\ndomain. This survey conducts a comprehensive analysis of the current\nstate-of-the-art studies on FL in Human Sensing, and proposes a taxonomy and an\neight-dimensional assessment for FL approaches. Through the eight-dimensional\nassessment, we then evaluate whether the surveyed studies consider a specific\nFL-in-Human-Sensing challenge or not. Finally, based on the overall analysis,\nwe discuss open challenges and highlight five research aspects related to FL in\nHuman Sensing that require urgent research attention. Our work provides a\ncomprehensive corpus of FL studies and aims to assist FL practitioners in\ndeveloping and evaluating solutions that effectively address the real-world\ncomplexities of Human Sensing.\n","authors":["Mohan Li","Martin Gjoreski","Pietro Barbiero","Gašper Slapničar","Mitja Luštrek","Nicholas D. Lane","Marc Langheinrich"],"pdf_url":"https://arxiv.org/pdf/2501.04000v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03999v1","updated":"2025-01-07T18:55:02Z","published":"2025-01-07T18:55:02Z","title":"WAPTS: A Weighted Allocation Probability Adjusted Thompson Sampling\n Algorithm for High-Dimensional and Sparse Experiment Settings","summary":" Aiming for more effective experiment design, such as in video content\nadvertising where different content options compete for user engagement, these\nscenarios can be modeled as multi-arm bandit problems. In cases where limited\ninteractions are available due to external factors, such as the cost of\nconducting experiments, recommenders often face constraints due to the small\nnumber of user interactions. In addition, there is a trade-off between\nselecting the best treatment and the ability to personalize and contextualize\nbased on individual factors. A popular solution to this dilemma is the\nContextual Bandit framework. It aims to maximize outcomes while incorporating\npersonalization (contextual) factors, customizing treatments such as a user's\nprofile to individual preferences. Despite their advantages, Contextual Bandit\nalgorithms face challenges like measurement bias and the 'curse of\ndimensionality.' These issues complicate the management of numerous\ninterventions and often lead to data sparsity through participant segmentation.\nTo address these problems, we introduce the Weighted Allocation Probability\nAdjusted Thompson Sampling (WAPTS) algorithm. WAPTS builds on the contextual\nThompson Sampling method by using a dynamic weighting parameter. This improves\nthe allocation process for interventions and enables rapid optimization in\ndata-sparse environments. We demonstrate the performance of our approach on\ndifferent numbers of arms and effect sizes.\n","authors":["Haochen Song","Ilya Musabirov","Ananya Bhattacharjee","Audrey Durand","Meredith Franklin","Anna Rafferty","Joseph Jay Williams"],"pdf_url":"https://arxiv.org/pdf/2501.03999v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03995v1","updated":"2025-01-07T18:52:05Z","published":"2025-01-07T18:52:05Z","title":"RAG-Check: Evaluating Multimodal Retrieval Augmented Generation\n Performance","summary":" Retrieval-augmented generation (RAG) improves large language models (LLMs) by\nusing external knowledge to guide response generation, reducing hallucinations.\nHowever, RAG, particularly multi-modal RAG, can introduce new hallucination\nsources: (i) the retrieval process may select irrelevant pieces (e.g.,\ndocuments, images) as raw context from the database, and (ii) retrieved images\nare processed into text-based context via vision-language models (VLMs) or\ndirectly used by multi-modal language models (MLLMs) like GPT-4o, which may\nhallucinate. To address this, we propose a novel framework to evaluate the\nreliability of multi-modal RAG using two performance measures: (i) the\nrelevancy score (RS), assessing the relevance of retrieved entries to the\nquery, and (ii) the correctness score (CS), evaluating the accuracy of the\ngenerated response. We train RS and CS models using a ChatGPT-derived database\nand human evaluator samples. Results show that both models achieve ~88%\naccuracy on test data. Additionally, we construct a 5000-sample human-annotated\ndatabase evaluating the relevancy of retrieved pieces and the correctness of\nresponse statements. Our RS model aligns with human preferences 20% more often\nthan CLIP in retrieval, and our CS model matches human preferences ~91% of the\ntime. Finally, we assess various RAG systems' selection and generation\nperformances using RS and CS.\n","authors":["Matin Mortaheb","Mohammad A. Amir Khojastepour","Srimat T. Chakradhar","Sennur Ulukus"],"pdf_url":"https://arxiv.org/pdf/2501.03995v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.14794v5","updated":"2025-01-07T18:49:42Z","published":"2024-06-20T23:51:32Z","title":"ImageFlowNet: Forecasting Multiscale Image-Level Trajectories of Disease\n Progression with Irregularly-Sampled Longitudinal Medical Images","summary":" Advances in medical imaging technologies have enabled the collection of\nlongitudinal images, which involve repeated scanning of the same patients over\ntime, to monitor disease progression. However, predictive modeling of such data\nremains challenging due to high dimensionality, irregular sampling, and data\nsparsity. To address these issues, we propose ImageFlowNet, a novel model\ndesigned to forecast disease trajectories from initial images while preserving\nspatial details. ImageFlowNet first learns multiscale joint representation\nspaces across patients and time points, then optimizes deterministic or\nstochastic flow fields within these spaces using a position-parameterized\nneural ODE/SDE framework. The model leverages a UNet architecture to create\nrobust multiscale representations and mitigates data scarcity by combining\nknowledge from all patients. We provide theoretical insights that support our\nformulation of ODEs, and motivate our regularizations involving high-level\nvisual features, latent space organization, and trajectory smoothness. We\nvalidate ImageFlowNet on three longitudinal medical image datasets depicting\nprogression in geographic atrophy, multiple sclerosis, and glioblastoma,\ndemonstrating its ability to effectively forecast disease progression and\noutperform existing methods. Our contributions include the development of\nImageFlowNet, its theoretical underpinnings, and empirical validation on\nreal-world datasets. The official implementation is available at\nhttps://github.com/KrishnaswamyLab/ImageFlowNet.\n","authors":["Chen Liu","Ke Xu","Liangbo L. Shen","Guillaume Huguet","Zilong Wang","Alexander Tong","Danilo Bzdok","Jay Stewart","Jay C. Wang","Lucian V. Del Priore","Smita Krishnaswamy"],"pdf_url":"https://arxiv.org/pdf/2406.14794v5.pdf","comment":"Accepted to ICASSP 2025"},{"id":"http://arxiv.org/abs/2501.03190v2","updated":"2025-01-07T18:34:22Z","published":"2025-01-06T18:05:35Z","title":"Multimodal Machine Learning Can Predict Videoconference Fluidity and\n Enjoyment","summary":" Videoconferencing is now a frequent mode of communication in both\nprofessional and informal settings, yet it often lacks the fluidity and\nenjoyment of in-person conversation. This study leverages multimodal machine\nlearning to predict moments of negative experience in videoconferencing. We\nsampled thousands of short clips from the RoomReader corpus, extracting audio\nembeddings, facial actions, and body motion features to train models for\nidentifying low conversational fluidity, low enjoyment, and classifying\nconversational events (backchanneling, interruption, or gap). Our best models\nachieved an ROC-AUC of up to 0.87 on hold-out videoconference sessions, with\ndomain-general audio features proving most critical. This work demonstrates\nthat multimodal audio-video signals can effectively predict high-level\nsubjective conversational outcomes. In addition, this is a contribution to\nresearch on videoconferencing user experience by showing that multimodal\nmachine learning can be used to identify rare moments of negative user\nexperience for further study or mitigation.\n","authors":["Andrew Chang","Viswadruth Akkaraju","Ray McFadden Cogliano","David Poeppel","Dustin Freeman"],"pdf_url":"https://arxiv.org/pdf/2501.03190v2.pdf","comment":"ICASSP 2025"},{"id":"http://arxiv.org/abs/2409.08861v5","updated":"2025-01-07T18:12:27Z","published":"2024-09-13T14:22:14Z","title":"Adjoint Matching: Fine-tuning Flow and Diffusion Generative Models with\n Memoryless Stochastic Optimal Control","summary":" Dynamical generative models that produce samples through an iterative\nprocess, such as Flow Matching and denoising diffusion models, have seen\nwidespread use, but there have not been many theoretically-sound methods for\nimproving these models with reward fine-tuning. In this work, we cast reward\nfine-tuning as stochastic optimal control (SOC). Critically, we prove that a\nvery specific memoryless noise schedule must be enforced during fine-tuning, in\norder to account for the dependency between the noise variable and the\ngenerated samples. We also propose a new algorithm named Adjoint Matching which\noutperforms existing SOC algorithms, by casting SOC problems as a regression\nproblem. We find that our approach significantly improves over existing methods\nfor reward fine-tuning, achieving better consistency, realism, and\ngeneralization to unseen human preference reward models, while retaining sample\ndiversity.\n","authors":["Carles Domingo-Enrich","Michal Drozdzal","Brian Karrer","Ricky T. Q. Chen"],"pdf_url":"https://arxiv.org/pdf/2409.08861v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.05300v5","updated":"2025-01-07T17:42:16Z","published":"2024-03-08T13:29:46Z","title":"Unity by Diversity: Improved Representation Learning in Multimodal VAEs","summary":" Variational Autoencoders for multimodal data hold promise for many tasks in\ndata analysis, such as representation learning, conditional generation, and\nimputation. Current architectures either share the encoder output, decoder\ninput, or both across modalities to learn a shared representation. Such\narchitectures impose hard constraints on the model. In this work, we show that\na better latent representation can be obtained by replacing these hard\nconstraints with a soft constraint. We propose a new mixture-of-experts prior,\nsoftly guiding each modality's latent representation towards a shared aggregate\nposterior. This approach results in a superior latent representation and allows\neach encoding to preserve information better from its uncompressed original\nfeatures. In extensive experiments on multiple benchmark datasets and two\nchallenging real-world datasets, we show improved learned latent\nrepresentations and imputation of missing data modalities compared to existing\nmethods.\n","authors":["Thomas M. Sutter","Yang Meng","Andrea Agostini","Daphné Chopard","Norbert Fortin","Julia E. Vogt","Babak Shahbaba","Stephan Mandt"],"pdf_url":"https://arxiv.org/pdf/2403.05300v5.pdf","comment":"Accepted at Neurips 2024"},{"id":"http://arxiv.org/abs/2411.00568v2","updated":"2025-01-07T17:36:14Z","published":"2024-11-01T13:26:13Z","title":"Constrained Sampling with Primal-Dual Langevin Monte Carlo","summary":" This work considers the problem of sampling from a probability distribution\nknown up to a normalization constant while satisfying a set of statistical\nconstraints specified by the expected values of general nonlinear functions.\nThis problem finds applications in, e.g., Bayesian inference, where it can\nconstrain moments to evaluate counterfactual scenarios or enforce desiderata\nsuch as prediction fairness. Methods developed to handle support constraints,\nsuch as those based on mirror maps, barriers, and penalties, are not suited for\nthis task. This work therefore relies on gradient descent-ascent dynamics in\nWasserstein space to put forward a discrete-time primal-dual Langevin Monte\nCarlo algorithm (PD-LMC) that simultaneously constrains the target distribution\nand samples from it. We analyze the convergence of PD-LMC under standard\nassumptions on the target distribution and constraints, namely (strong)\nconvexity and log-Sobolev inequalities. To do so, we bring classical\noptimization arguments for saddle-point algorithms to the geometry of\nWasserstein space. We illustrate the relevance and effectiveness of PD-LMC in\nseveral applications.\n","authors":["Luiz F. O. Chamon","Mohammad Reza Karimi","Anna Korba"],"pdf_url":"https://arxiv.org/pdf/2411.00568v2.pdf","comment":"39 pages, 14 figures. Published at NeurIPS 2024"},{"id":"http://arxiv.org/abs/2406.11814v4","updated":"2025-01-07T17:35:00Z","published":"2024-06-17T17:54:42Z","title":"Stochastic Neural Network Symmetrisation in Markov Categories","summary":" We consider the problem of symmetrising a neural network along a group\nhomomorphism: given a homomorphism $\\varphi : H \\to G$, we would like a\nprocedure that converts $H$-equivariant neural networks to $G$-equivariant\nones. We formulate this in terms of Markov categories, which allows us to\nconsider neural networks whose outputs may be stochastic, but with\nmeasure-theoretic details abstracted away. We obtain a flexible and\ncompositional framework for symmetrisation that relies on minimal assumptions\nabout the structure of the group and the underlying neural network\narchitecture. Our approach recovers existing canonicalisation and averaging\ntechniques for symmetrising deterministic models, and extends to provide a\nnovel methodology for symmetrising stochastic models also. Beyond this, our\nfindings also demonstrate the utility of Markov categories for addressing\ncomplex problems in machine learning in a conceptually clear yet mathematically\nprecise way.\n","authors":["Rob Cornish"],"pdf_url":"https://arxiv.org/pdf/2406.11814v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.00799v2","updated":"2025-01-07T17:32:19Z","published":"2025-01-01T10:50:35Z","title":"Follow The Approximate Sparse Leader for No-Regret Online Sparse Linear\n Approximation","summary":" We consider the problem of \\textit{online sparse linear approximation}, where\none predicts the best sparse approximation of a sequence of measurements in\nterms of linear combination of columns of a given measurement matrix. Such\nonline prediction problems are ubiquitous, ranging from medical trials to web\ncaching to resource allocation. The inherent difficulty of offline recovery\nalso makes the online problem challenging. In this letter, we propose\nFollow-The-Approximate-Sparse-Leader, an efficient online meta-policy to\naddress this online problem. Through a detailed theoretical analysis, we prove\nthat under certain assumptions on the measurement sequence, the proposed policy\nenjoys a data-dependent sublinear upper bound on the static regret, which can\nrange from logarithmic to square-root. Numerical simulations are performed to\ncorroborate the theoretical findings and demonstrate the efficacy of the\nproposed online policy.\n","authors":["Samrat Mukhopadhyay","Debasmita Mukherjee"],"pdf_url":"https://arxiv.org/pdf/2501.00799v2.pdf","comment":"12 pages, 5 figures, corrected title, added proof of a lemma in\n appendix"},{"id":"http://arxiv.org/abs/2301.08110v6","updated":"2025-01-07T17:26:26Z","published":"2023-01-19T15:01:00Z","title":"AtMan: Understanding Transformer Predictions Through Memory Efficient\n Attention Manipulation","summary":" Generative transformer models have become increasingly complex, with large\nnumbers of parameters and the ability to process multiple input modalities.\nCurrent methods for explaining their predictions are resource-intensive. Most\ncrucially, they require prohibitively large amounts of extra memory, since they\nrely on backpropagation which allocates almost twice as much GPU memory as the\nforward pass. This makes it difficult, if not impossible, to use them in\nproduction. We present AtMan that provides explanations of generative\ntransformer models at almost no extra cost. Specifically, AtMan is a\nmodality-agnostic perturbation method that manipulates the attention mechanisms\nof transformers to produce relevance maps for the input with respect to the\noutput prediction. Instead of using backpropagation, AtMan applies a\nparallelizable token-based search method based on cosine similarity\nneighborhood in the embedding space. Our exhaustive experiments on text and\nimage-text benchmarks demonstrate that AtMan outperforms current\nstate-of-the-art gradient-based methods on several metrics while being\ncomputationally efficient. As such, AtMan is suitable for use in large model\ninference deployments.\n","authors":["Björn Deiseroth","Mayukh Deb","Samuel Weinbach","Manuel Brack","Patrick Schramowski","Kristian Kersting"],"pdf_url":"https://arxiv.org/pdf/2301.08110v6.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.17547v2","updated":"2025-01-07T17:23:55Z","published":"2024-12-23T13:08:23Z","title":"Probability-density-aware Semi-supervised Learning","summary":" Semi-supervised learning (SSL) assumes that neighbor points lie in the same\ncategory (neighbor assumption), and points in different clusters belong to\nvarious categories (cluster assumption). Existing methods usually rely on\nsimilarity measures to retrieve the similar neighbor points, ignoring cluster\nassumption, which may not utilize unlabeled information sufficiently and\neffectively. This paper first provides a systematical investigation into the\nsignificant role of probability density in SSL and lays a solid theoretical\nfoundation for cluster assumption. To this end, we introduce a\nProbability-Density-Aware Measure (PM) to discern the similarity between\nneighbor points. To further improve Label Propagation, we also design a\nProbability-Density-Aware Measure Label Propagation (PMLP) algorithm to fully\nconsider the cluster assumption in label propagation. Last but not least, we\nprove that traditional pseudo-labeling could be viewed as a particular case of\nPMLP, which provides a comprehensive theoretical understanding of PMLP's\nsuperior performance. Extensive experiments demonstrate that PMLP achieves\noutstanding performance compared with other recent methods.\n","authors":["Shuyang Liu","Ruiqiu Zheng","Yunhang Shen","Ke Li","Xing Sun","Zhou Yu","Shaohui Lin"],"pdf_url":"https://arxiv.org/pdf/2412.17547v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.16834v2","updated":"2025-01-07T17:05:11Z","published":"2024-06-24T17:42:03Z","title":"Statistical Error Bounds for GANs with Nonlinear Objective Functionals","summary":" Generative adversarial networks (GANs) are unsupervised learning methods for\ntraining a generator distribution to produce samples that approximate those\ndrawn from a target distribution. Many such methods can be formulated as\nminimization of a metric or divergence between probability distributions.\nRecent works have derived statistical error bounds for GANs that are based on\nintegral probability metrics (IPMs), e.g., WGAN which is based on the\n1-Wasserstein metric. In general, IPMs are defined by optimizing a linear\nfunctional (difference of expectations) over a space of discriminators. A much\nlarger class of GANs, which we here call $(f,\\Gamma)$-GANs, can be constructed\nusing $f$-divergences (e.g., Jensen-Shannon, KL, or $\\alpha$-divergences)\ntogether with a regularizing discriminator space $\\Gamma$ (e.g., $1$-Lipschitz\nfunctions). These GANs have nonlinear objective functions, depending on the\nchoice of $f$, and have been shown to exhibit improved performance in a number\nof applications. In this work we derive statistical error bounds for\n$(f,\\Gamma)$-GANs for general classes of $f$ and $\\Gamma$ in the form of\nfinite-sample concentration inequalities. These results prove the statistical\nconsistency of $(f,\\Gamma)$-GANs and reduce to the known results for IPM-GANs\nin the appropriate limit. Finally, our results also give new insight into the\nperformance of GANs for distributions with unbounded support.\n","authors":["Jeremiah Birrell"],"pdf_url":"https://arxiv.org/pdf/2406.16834v2.pdf","comment":"27 pages"},{"id":"http://arxiv.org/abs/2501.03941v1","updated":"2025-01-07T17:02:33Z","published":"2025-01-07T17:02:33Z","title":"Synthetic Data Privacy Metrics","summary":" Recent advancements in generative AI have made it possible to create\nsynthetic datasets that can be as accurate as real-world data for training AI\nmodels, powering statistical insights, and fostering collaboration with\nsensitive datasets while offering strong privacy guarantees. Effectively\nmeasuring the empirical privacy of synthetic data is an important step in the\nprocess. However, while there is a multitude of new privacy metrics being\npublished every day, there currently is no standardization. In this paper, we\nreview the pros and cons of popular metrics that include simulations of\nadversarial attacks. We also review current best practices for amending\ngenerative models to enhance the privacy of the data they create (e.g.\ndifferential privacy).\n","authors":["Amy Steier","Lipika Ramaswamy","Andre Manoel","Alexa Haushalter"],"pdf_url":"https://arxiv.org/pdf/2501.03941v1.pdf","comment":"14 pages, 2 figures"},{"id":"http://arxiv.org/abs/2402.14746v3","updated":"2025-01-07T16:58:05Z","published":"2024-02-22T18:06:19Z","title":"Scaling Efficient LLMs","summary":" Trained LLMs are typically sparse in that most of the parameters are zero,\nraising questions on efficiency. In response, we inquire into efficient LLMs,\ni.e. those with the fewest parameters that achieve the desired accuracy on a\ntraining corpus. Specifically, we compare theoretical and empirical estimates\nfor training loss to obtain upper and lower bounds on the number of unique\nsequences in a natural training corpus as a function of its size. Our result\nimplies (1) to double the number of skills represented in a training corpus,\nthe corpus must scale more than four fold (2) for efficient LLMs, the number of\nparameters N and the size D of a natural training corpus scale as $N \\propto\nD^{0.44}$; (3) if the number of parameters of an LLM is smaller than the number\nof unique sequences in the training corpus, scaling up can uncover emergent\nskills.\n","authors":["B. N. Kausik"],"pdf_url":"https://arxiv.org/pdf/2402.14746v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03937v1","updated":"2025-01-07T16:56:40Z","published":"2025-01-07T16:56:40Z","title":"A precise asymptotic analysis of learning diffusion models: theory and\n insights","summary":" In this manuscript, we consider the problem of learning a flow or\ndiffusion-based generative model parametrized by a two-layer auto-encoder,\ntrained with online stochastic gradient descent, on a high-dimensional target\ndensity with an underlying low-dimensional manifold structure. We derive a\ntight asymptotic characterization of low-dimensional projections of the\ndistribution of samples generated by the learned model, ascertaining in\nparticular its dependence on the number of training samples. Building on this\nanalysis, we discuss how mode collapse can arise, and lead to model collapse\nwhen the generative model is re-trained on generated synthetic data.\n","authors":["Hugo Cui","Cengiz Pehlevan","Yue M. Lu"],"pdf_url":"https://arxiv.org/pdf/2501.03937v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03928v1","updated":"2025-01-07T16:45:37Z","published":"2025-01-07T16:45:37Z","title":"From Newswire to Nexus: Using text-based actor embeddings and\n transformer networks to forecast conflict dynamics","summary":" This study advances the field of conflict forecasting by using text-based\nactor embeddings with transformer models to predict dynamic changes in violent\nconflict patterns at the actor level. More specifically, we combine newswire\ntexts with structured conflict event data and leverage recent advances in\nNatural Language Processing (NLP) techniques to forecast escalations and\nde-escalations among conflicting actors, such as governments, militias,\nseparatist movements, and terrorists. This new approach accurately and promptly\ncaptures the inherently volatile patterns of violent conflicts, which existing\nmethods have not been able to achieve. To create this framework, we began by\ncurating and annotating a vast international newswire corpus, leveraging\nhand-labeled event data from the Uppsala Conflict Data Program. By using this\nhybrid dataset, our models can incorporate the textual context of news sources\nalong with the precision and detail of structured event data. This combination\nenables us to make both dynamic and granular predictions about conflict\ndevelopments. We validate our approach through rigorous back-testing against\nhistorical events, demonstrating superior out-of-sample predictive power. We\nfind that our approach is quite effective in identifying and predicting phases\nof conflict escalation and de-escalation, surpassing the capabilities of\ntraditional models. By focusing on actor interactions, our explicit goal is to\nprovide actionable insights to policymakers, humanitarian organizations, and\npeacekeeping operations in order to enable targeted and effective intervention\nstrategies.\n","authors":["Mihai Croicu","Simon Polichinel von der Maase"],"pdf_url":"https://arxiv.org/pdf/2501.03928v1.pdf","comment":"35 pages, 5 figures. Paper presented at the 120th American Political\n Science Association Annual Meeting"},{"id":"http://arxiv.org/abs/2501.03923v1","updated":"2025-01-07T16:35:29Z","published":"2025-01-07T16:35:29Z","title":"Explainable AI model reveals disease-related mechanisms in single-cell\n RNA-seq data","summary":" Neurodegenerative diseases (NDDs) are complex and lack effective treatment\ndue to their poorly understood mechanism. The increasingly used data analysis\nfrom Single nucleus RNA Sequencing (snRNA-seq) allows to explore transcriptomic\nevents at a single cell level, yet face challenges in interpreting the\nmechanisms underlying a disease. On the other hand, Neural Network (NN) models\ncan handle complex data to offer insights but can be seen as black boxes with\npoor interpretability. In this context, explainable AI (XAI) emerges as a\nsolution that could help to understand disease-associated mechanisms when\ncombined with efficient NN models. However, limited research explores XAI in\nsingle-cell data. In this work, we implement a method for identifying\ndisease-related genes and the mechanistic explanation of disease progression\nbased on NN model combined with SHAP. We analyze available Huntington's disease\n(HD) data to identify both HD-altered genes and mechanisms by adding Gene Set\nEnrichment Analysis (GSEA) comparing two methods, differential gene expression\nanalysis (DGE) and NN combined with SHAP approach. Our results show that DGE\nand SHAP approaches offer both common and differential sets of altered genes\nand pathways, reinforcing the usefulness of XAI methods for a broader\nperspective of disease.\n","authors":["Mohammad Usman","Olga Varea","Petia Radeva","Josep Canals","Jordi Abante","Daniel Ortiz"],"pdf_url":"https://arxiv.org/pdf/2501.03923v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.19218v4","updated":"2025-01-07T16:31:31Z","published":"2023-10-30T01:34:33Z","title":"Exploring Federated Unlearning: Analysis, Comparison, and Insights","summary":" The increasing demand for privacy-preserving machine learning has spurred\ninterest in federated unlearning, which enables the selective removal of data\nfrom models trained in federated systems. However, developing federated\nunlearning methods presents challenges, particularly in balancing three often\nconflicting objectives: privacy, accuracy, and efficiency. This paper provides\na comprehensive analysis of existing federated unlearning approaches, examining\ntheir algorithmic efficiency, impact on model accuracy, and effectiveness in\npreserving privacy. We discuss key trade-offs among these dimensions and\nhighlight their implications for practical applications across various domains.\nAdditionally, we propose the OpenFederatedUnlearning framework, a unified\nbenchmark for evaluating federated unlearning methods, incorporating classic\nbaselines and diverse performance metrics. Our findings aim to guide\npractitioners in navigating the complex interplay of these objectives, offering\ninsights to achieve effective and efficient federated unlearning. Finally, we\noutline directions for future research to further advance the state of\nfederated unlearning techniques.\n","authors":["Yang Zhao","Jiaxi Yang","Yiling Tao","Lixu Wang","Xiaoxiao Li","Dusit Niyato","H. Vincent Poor"],"pdf_url":"https://arxiv.org/pdf/2310.19218v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.19223v2","updated":"2025-01-07T16:20:17Z","published":"2024-06-27T14:49:08Z","title":"T-FREE: Subword Tokenizer-Free Generative LLMs via Sparse\n Representations for Memory-Efficient Embeddings","summary":" Tokenizers are crucial for encoding information in Large Language Models, but\ntheir development has recently stagnated, and they contain inherent weaknesses.\nMajor limitations include computational overhead, ineffective vocabulary use,\nand unnecessarily large embedding and head layers. Additionally, their\nperformance is biased towards a reference corpus, leading to reduced\neffectiveness for underrepresented languages.\n To remedy these issues, we propose T-FREE, which directly embeds words\nthrough sparse activation patterns over character triplets, and does not\nrequire a reference corpus. T-FREE inherently exploits morphological\nsimilarities and allows for strong compression of embedding layers. In our\nexhaustive experimental evaluation, we achieve competitive downstream\nperformance with a parameter reduction of more than 85% on these layers.\nFurther, T-FREE shows significant improvements in cross-lingual transfer\nlearning.\n","authors":["Björn Deiseroth","Manuel Brack","Patrick Schramowski","Kristian Kersting","Samuel Weinbach"],"pdf_url":"https://arxiv.org/pdf/2406.19223v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03905v1","updated":"2025-01-07T16:19:40Z","published":"2025-01-07T16:19:40Z","title":"mFabric: An Efficient and Scalable Fabric for Mixture-of-Experts\n Training","summary":" Mixture-of-Expert (MoE) models outperform conventional models by selectively\nactivating different subnets, named \\emph{experts}, on a per-token basis. This\ngated computation generates dynamic communications that cannot be determined\nbeforehand, challenging the existing GPU interconnects that remain\n\\emph{static} during the distributed training process. In this paper, we\nadvocate for a first-of-its-kind system, called mFabric, that unlocks topology\nreconfiguration \\emph{during} distributed MoE training. Towards this vision, we\nfirst perform a production measurement study and show that the MoE dynamic\ncommunication pattern has \\emph{strong locality}, alleviating the requirement\nof global reconfiguration. Based on this, we design and implement a\n\\emph{regionally reconfigurable high-bandwidth domain} on top of existing\nelectrical interconnects using optical circuit switching (OCS), achieving\nscalability while maintaining rapid adaptability. We have built a fully\nfunctional mFabric prototype with commodity hardware and a customized\ncollective communication runtime that trains state-of-the-art MoE models with\n\\emph{in-training} topology reconfiguration across 32 A100 GPUs. Large-scale\npacket-level simulations show that mFabric delivers comparable performance as\nthe non-blocking fat-tree fabric while boosting the training cost efficiency\n(e.g., performance per dollar) of four representative MoE models by\n1.2$\\times$--1.5$\\times$ and 1.9$\\times$--2.3$\\times$ at 100 Gbps and 400 Gbps\nlink bandwidths, respectively.\n","authors":["Xudong Liao","Yijun Sun","Han Tian","Xinchen Wan","Yilun Jin","Zilong Wang","Zhenghang Ren","Xinyang Huang","Wenxue Li","Kin Fai Tse","Zhizhen Zhong","Guyue Liu","Ying Zhang","Xiaofeng Ye","Yiming Zhang","Kai Chen"],"pdf_url":"https://arxiv.org/pdf/2501.03905v1.pdf","comment":"Corresponding authors: zhizhenz@mit.edu (Z. Zhong),\n kaichen@cse.ust.hk (K. Chen)"},{"id":"http://arxiv.org/abs/2501.03904v1","updated":"2025-01-07T16:18:55Z","published":"2025-01-07T16:18:55Z","title":"Exploring the Potential of Large Language Models in Public\n Transportation: San Antonio Case Study","summary":" The integration of large language models (LLMs) into public transit systems\npresents a transformative opportunity to enhance urban mobility. This study\nexplores the potential of LLMs to revolutionize public transportation\nmanagement within the context of San Antonio's transit system. Leveraging the\ncapabilities of LLMs in natural language processing and data analysis, we\ninvestigate their capabilities to optimize route planning, reduce wait times,\nand provide personalized travel assistance. By utilizing the General Transit\nFeed Specification (GTFS) and other relevant data, this research aims to\ndemonstrate how LLMs can potentially improve resource allocation, elevate\npassenger satisfaction, and inform data-driven decision-making in transit\noperations. A comparative analysis of different ChatGPT models was conducted to\nassess their ability to understand transportation information, retrieve\nrelevant data, and provide comprehensive responses. Findings from this study\nsuggest that while LLMs hold immense promise for public transit, careful\nengineering and fine-tuning are essential to realizing their full potential.\nSan Antonio serves as a case study to inform the development of LLM-powered\ntransit systems in other urban environments.\n","authors":["Ramya Jonnala","Gongbo Liang","Jeong Yang","Izzat Alsmadi"],"pdf_url":"https://arxiv.org/pdf/2501.03904v1.pdf","comment":"This work is accepted to AAAI 2025 Workshop on AI for Urban Planning.\n arXiv admin note: substantial text overlap with arXiv:2407.11003"},{"id":"http://arxiv.org/abs/2412.06866v3","updated":"2025-01-07T16:16:49Z","published":"2024-12-09T09:31:58Z","title":"LMS-AutoTSF: Learnable Multi-Scale Decomposition and Integrated\n Autocorrelation for Time Series Forecasting","summary":" Time series forecasting is an important challenge with significant\napplications in areas such as weather prediction, stock market analysis,\nscientific simulations and industrial process analysis. In this work, we\nintroduce LMS-AutoTSF, a novel time series forecasting architecture that\nincorporates autocorrelation while leveraging dual encoders operating at\nmultiple scales. Unlike models that rely on predefined trend and seasonal\ncomponents, LMS-AutoTSF employs two separate encoders per scale: one focusing\non low-pass filtering to capture trends and the other utilizing high-pass\nfiltering to model seasonal variations. These filters are learnable, allowing\nthe model to dynamically adapt and isolate trend and seasonal components\ndirectly in the frequency domain. A key innovation in our approach is the\nintegration of autocorrelation, achieved by computing lagged differences in\ntime steps, which enables the model to capture dependencies across time more\neffectively. Each encoder processes the input through fully connected layers to\nhandle temporal and channel interactions. By combining frequency-domain\nfiltering, autocorrelation-based temporal modeling, and channel-wise\ntransformations, LMS-AutoTSF not only accurately captures long-term\ndependencies and fine-grained patterns but also operates more efficiently\ncompared to other state-of-the-art methods. Its lightweight design ensures\nfaster processing while maintaining high precision in forecasting across\ndiverse time horizons. The source code is publicly available at\n\\url{http://github.com/mribrahim/LMS-TSF}\n","authors":["Ibrahim Delibasoglu","Sanjay Chakraborty","Fredrik Heintz"],"pdf_url":"https://arxiv.org/pdf/2412.06866v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.12615v3","updated":"2025-01-07T16:11:10Z","published":"2023-11-21T13:59:00Z","title":"Koopman Learning with Episodic Memory","summary":" Koopman operator theory has found significant success in learning models of\ncomplex, real-world dynamical systems, enabling prediction and control. The\ngreater interpretability and lower computational costs of these models,\ncompared to traditional machine learning methodologies, make Koopman learning\nan especially appealing approach. Despite this, little work has been performed\non endowing Koopman learning with the ability to leverage its own failures. To\naddress this, we equip Koopman methods -- developed for predicting\nnon-autonomous time-series -- with an episodic memory mechanism, enabling\nglobal recall of (or attention to) periods in time where similar dynamics\npreviously occurred. We find that a basic implementation of Koopman learning\nwith episodic memory leads to significant improvements in prediction on\nsynthetic and real-world data. Our framework has considerable potential for\nexpansion, allowing for future advances, and opens exciting new directions for\nKoopman learning.\n","authors":["William T. Redman","Dean Huang","Maria Fonoberova","Igor Mezić"],"pdf_url":"https://arxiv.org/pdf/2311.12615v3.pdf","comment":"17 pages, 7 figures"},{"id":"http://arxiv.org/abs/2501.03902v1","updated":"2025-01-07T16:10:09Z","published":"2025-01-07T16:10:09Z","title":"Explainable Reinforcement Learning via Temporal Policy Decomposition","summary":" We investigate the explainability of Reinforcement Learning (RL) policies\nfrom a temporal perspective, focusing on the sequence of future outcomes\nassociated with individual actions. In RL, value functions compress information\nabout rewards collected across multiple trajectories and over an infinite\nhorizon, allowing a compact form of knowledge representation. However, this\ncompression obscures the temporal details inherent in sequential\ndecision-making, presenting a key challenge for interpretability. We present\nTemporal Policy Decomposition (TPD), a novel explainability approach that\nexplains individual RL actions in terms of their Expected Future Outcome (EFO).\nThese explanations decompose generalized value functions into a sequence of\nEFOs, one for each time step up to a prediction horizon of interest, revealing\ninsights into when specific outcomes are expected to occur. We leverage\nfixed-horizon temporal difference learning to devise an off-policy method for\nlearning EFOs for both optimal and suboptimal actions, enabling contrastive\nexplanations consisting of EFOs for different state-action pairs. Our\nexperiments demonstrate that TPD generates accurate explanations that (i)\nclarify the policy's future strategy and anticipated trajectory for a given\naction and (ii) improve understanding of the reward composition, facilitating\nfine-tuning of the reward function to align with human expectations.\n","authors":["Franco Ruggeri","Alessio Russo","Rafia Inam","Karl Henrik Johansson"],"pdf_url":"https://arxiv.org/pdf/2501.03902v1.pdf","comment":"21 pages, 4 figures"},{"id":"http://arxiv.org/abs/2410.23440v2","updated":"2025-01-07T16:07:33Z","published":"2024-10-30T20:32:30Z","title":"Learning Lipschitz Operators with respect to Gaussian Measures with\n Near-Optimal Sample Complexity","summary":" Operator learning, the approximation of mappings between infinite-dimensional\nfunction spaces using ideas from machine learning, has gained increasing\nresearch attention in recent years. Approximate operators, learned from data,\nhold promise to serve as efficient surrogate models for problems in\ncomputational science and engineering, complementing traditional numerical\nmethods. However, despite their empirical success, our understanding of the\nunderpinning mathematical theory is in large part still incomplete. In this\npaper, we study the approximation of Lipschitz operators in expectation with\nrespect to Gaussian measures. We prove higher Gaussian Sobolev regularity of\nLipschitz operators and establish lower and upper bounds on the Hermite\npolynomial approximation error. We further consider the reconstruction of\nLipschitz operators from $m$ arbitrary (adaptive) linear samples. A key finding\nis the tight characterization of the smallest achievable error for all possible\n(adaptive) sampling and reconstruction maps in terms of $m$. It is shown that\nHermite polynomial approximation is an optimal recovery strategy, but we have\nthe following curse of sample complexity: No method to approximate Lipschitz\noperators based on $m$ samples can achieve algebraic convergence rates in $m$.\nOn the positive side, we prove that a sufficiently fast spectral decay of the\ncovariance operator of the Gaussian measure guarantees convergence rates which\nare arbitrarily close to any algebraic rate in the large data limit $m \\to\n\\infty$. A main focus of this work is on the recovery of Lipschitz operators\nfrom finitely many point samples. We use Christoffel sampling and weighted\nleast-squares approximation to propose an algorithm which provably achieves\nnear-optimal sample complexity in high probability.\n","authors":["Ben Adcock","Michael Griebel","Gregor Maier"],"pdf_url":"https://arxiv.org/pdf/2410.23440v2.pdf","comment":"56 pages"},{"id":"http://arxiv.org/abs/2408.11876v2","updated":"2025-01-07T16:01:15Z","published":"2024-08-20T13:19:06Z","title":"From Glucose Patterns to Health Outcomes: A Generalizable Foundation\n Model for Continuous Glucose Monitor Data Analysis","summary":" Recent advances in SSL enabled novel medical AI models, known as foundation\nmodels, offer great potential for better characterizing health from diverse\nbiomedical data. CGM provides rich, temporal data on glycemic patterns, but its\nfull potential for predicting broader health outcomes remains underutilized.\nHere, we present GluFormer, a generative foundation model for CGM data that\nlearns nuanced glycemic patterns and translates them into predictive\nrepresentations of metabolic health. Trained on over 10 million CGM\nmeasurements from 10,812 adults, primarily without diabetes, GluFormer uses\nautoregressive token prediction to capture longitudinal glucose dynamics. We\nshow that GluFormer generalizes to 19 external cohorts (n=6,044) spanning\ndifferent ethnicities and ages, 5 countries, 8 CGM devices, and diverse\npathophysiological states. GluFormers representations exceed the performance of\ncurrent CGM metrics, such as the Glucose Management Indicator (GMI), for\nforecasting clinical measures. In a longitudinal study of 580 adults with CGM\ndata and 12-year follow-up, GluFormer identifies individuals at elevated risk\nof developing diabetes more effectively than blood HbA1C%, capturing 66% of all\nnew-onset diabetes diagnoses in the top quartile versus 7% in the bottom\nquartile. Similarly, 69% of cardiovascular-death events occurred in the top\nquartile with none in the bottom quartile, demonstrating powerful risk\nstratification beyond traditional glycemic metrics. We also show that CGM\nrepresentations from pre-intervention periods in Randomized Clinical Trials\noutperform other methods in predicting primary and secondary outcomes. When\nintegrating dietary data into GluFormer, we show that the multi-modal version\nof the model can accurately generate CGM data based on dietary intake data,\nsimulate outcomes of dietary interventions, and predict individual responses to\nspecific foods.\n","authors":["Guy Lutsker","Gal Sapir","Smadar Shilo","Jordi Merino","Anastasia Godneva","Jerry R Greenfield","Dorit Samocha-Bonet","Raja Dhir","Francisco Gude","Shie Mannor","Eli Meirom","Gal Chechik","Hagai Rossman","Eran Segal"],"pdf_url":"https://arxiv.org/pdf/2408.11876v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.01006v3","updated":"2025-01-07T16:00:44Z","published":"2024-11-01T20:04:59Z","title":"Abstracted Shapes as Tokens -- A Generalizable and Interpretable Model\n for Time-series Classification","summary":" In time-series analysis, many recent works seek to provide a unified view and\nrepresentation for time-series across multiple domains, leading to the\ndevelopment of foundation models for time-series data. Despite diverse modeling\ntechniques, existing models are black boxes and fail to provide insights and\nexplanations about their representations. In this paper, we present VQShape, a\npre-trained, generalizable, and interpretable model for time-series\nrepresentation learning and classification. By introducing a novel\nrepresentation for time-series data, we forge a connection between the latent\nspace of VQShape and shape-level features. Using vector quantization, we show\nthat time-series from different domains can be described using a unified set of\nlow-dimensional codes, where each code can be represented as an abstracted\nshape in the time domain. On classification tasks, we show that the\nrepresentations of VQShape can be utilized to build interpretable classifiers,\nachieving comparable performance to specialist models. Additionally, in\nzero-shot learning, VQShape and its codebook can generalize to previously\nunseen datasets and domains that are not included in the pre-training process.\nThe code and pre-trained weights are available at\nhttps://github.com/YunshiWen/VQShape.\n","authors":["Yunshi Wen","Tengfei Ma","Tsui-Wei Weng","Lam M. Nguyen","Anak Agung Julius"],"pdf_url":"https://arxiv.org/pdf/2411.01006v3.pdf","comment":"Published in Neural Information Processing Systems (NeurIPS) 2024"},{"id":"http://arxiv.org/abs/2501.03888v1","updated":"2025-01-07T15:51:49Z","published":"2025-01-07T15:51:49Z","title":"Neural DNF-MT: A Neuro-symbolic Approach for Learning Interpretable and\n Editable Policies","summary":" Although deep reinforcement learning has been shown to be effective, the\nmodel's black-box nature presents barriers to direct policy interpretation. To\naddress this problem, we propose a neuro-symbolic approach called neural DNF-MT\nfor end-to-end policy learning. The differentiable nature of the neural DNF-MT\nmodel enables the use of deep actor-critic algorithms for training. At the same\ntime, its architecture is designed so that trained models can be directly\ntranslated into interpretable policies expressed as standard (bivalent or\nprobabilistic) logic programs. Moreover, additional layers can be included to\nextract abstract features from complex observations, acting as a form of\npredicate invention. The logic representations are highly interpretable, and we\nshow how the bivalent representations of deterministic policies can be edited\nand incorporated back into a neural model, facilitating manual intervention and\nadaptation of learned policies. We evaluate our approach on a range of tasks\nrequiring learning deterministic or stochastic behaviours from various forms of\nobservations. Our empirical results show that our neural DNF-MT model performs\nat the level of competing black-box methods whilst providing interpretable\npolicies.\n","authors":["Kexin Gu Baugh","Luke Dickens","Alessandra Russo"],"pdf_url":"https://arxiv.org/pdf/2501.03888v1.pdf","comment":"AAMAS 2025"},{"id":"http://arxiv.org/abs/2410.11463v2","updated":"2025-01-07T15:48:15Z","published":"2024-10-15T10:10:33Z","title":"Advanced Persistent Threats (APT) Attribution Using Deep Reinforcement\n Learning","summary":" The development of the DRL model for malware attribution involved extensive\nresearch, iterative coding, and numerous adjustments based on the insights\ngathered from predecessor models and contemporary research papers. This\npreparatory work was essential to establish a robust foundation for the model,\nensuring it could adapt and respond effectively to the dynamic nature of\nmalware threats. Initially, the model struggled with low accuracy levels, but\nthrough persistent adjustments to its architecture and learning algorithms,\naccuracy improved dramatically from about 7 percent to over 73 percent in early\niterations. By the end of the training, the model consistently reached accuracy\nlevels near 98 percent, demonstrating its strong capability to accurately\nrecognise and attribute malware activities. This upward trajectory in training\naccuracy is graphically represented in the Figure, which vividly illustrates\nthe model maturation and increasing proficiency over time.\n","authors":["Animesh Singh Basnet","Mohamed Chahine Ghanem","Dipo Dunsin","Wiktor Sowinski-Mydlarz"],"pdf_url":"https://arxiv.org/pdf/2410.11463v2.pdf","comment":"21 Pages"},{"id":"http://arxiv.org/abs/2405.03732v3","updated":"2025-01-07T15:46:25Z","published":"2024-05-06T10:53:13Z","title":"Deep Learning-based Accelerated MR Cholangiopancreatography without\n Fully-sampled Data","summary":" The purpose of this study was to accelerate MR cholangiopancreatography\n(MRCP) acquisitions using deep learning-based (DL) reconstruction at 3T and\n0.55T. A total of 35 healthy volunteers underwent conventional two-fold\naccelerated MRCP scans at field strengths of 3T and 0.55T. We trained DL\nreconstructions using two different training strategies, supervised (SV) and\nself-supervised (SSV), with retrospectively six-fold undersampled data obtained\nat 3T. We then evaluated the DL reconstructions against standard techniques,\nparallel imaging (PI) and compressed sensing (CS), focusing on peak\nsignal-to-noise ratio (PSNR) and structural similarity (SSIM) as metrics. We\nalso tested DL reconstructions with prospectively accelerated acquisitions and\nevaluated their robustness when changing fields strengths from 3T to 0.55T. DL\nreconstructions demonstrated a reduction in average acquisition time from\n599/542 to 255/180 seconds for MRCP at 3T/0.55T. In both retrospective and\nprospective undersampling, PSNR and SSIM of DL reconstructions were higher than\nthose of PI and CS. At the same time, DL reconstructions preserved the image\nquality of undersampled data, including sharpness and the visibility of\nhepatobiliary ducts. In addition, both DL approaches produced high-quality\nreconstructions at 0.55T. In summary, DL reconstructions trained for highly\naccelerated MRCP enabled a reduction in acquisition time by a factor of 2.4/3.0\nat 3T/0.55T while maintaining the image quality of conventional acquisitions.\n","authors":["Jinho Kim","Marcel Dominik Nickel","Florian Knoll"],"pdf_url":"https://arxiv.org/pdf/2405.03732v3.pdf","comment":"19 pages, 4 figures, 2 tables"},{"id":"http://arxiv.org/abs/2501.03880v1","updated":"2025-01-07T15:43:36Z","published":"2025-01-07T15:43:36Z","title":"SELMA3D challenge: Self-supervised learning for 3D light-sheet\n microscopy image segmentation","summary":" Recent innovations in light sheet microscopy, paired with developments in\ntissue clearing techniques, enable the 3D imaging of large mammalian tissues\nwith cellular resolution. Combined with the progress in large-scale data\nanalysis, driven by deep learning, these innovations empower researchers to\nrapidly investigate the morphological and functional properties of diverse\nbiological samples. Segmentation, a crucial preliminary step in the analysis\nprocess, can be automated using domain-specific deep learning models with\nexpert-level performance. However, these models exhibit high sensitivity to\ndomain shifts, leading to a significant drop in accuracy when applied to data\noutside their training distribution. To address this limitation, and inspired\nby the recent success of self-supervised learning in training generalizable\nmodels, we organized the SELMA3D Challenge during the MICCAI 2024 conference.\nSELMA3D provides a vast collection of light-sheet images from cleared mice and\nhuman brains, comprising 35 large 3D images-each with over 1000^3 voxels-and\n315 annotated small patches for finetuning, preliminary testing and final\ntesting. The dataset encompasses diverse biological structures, including\nvessel-like and spot-like structures. Five teams participated in all phases of\nthe challenge, and their proposed methods are reviewed in this paper.\nQuantitative and qualitative results from most participating teams demonstrate\nthat self-supervised learning on large datasets improves segmentation model\nperformance and generalization. We will continue to support and extend SELMA3D\nas an inaugural MICCAI challenge focused on self-supervised learning for 3D\nmicroscopy image segmentation.\n","authors":["Ying Chen","Rami Al-Maskari","Izabela Horvath","Mayar Ali","Luciano Höher","Kaiyuan Yang","Zengming Lin","Zhiwei Zhai","Mengzhe Shen","Dejin Xun","Yi Wang","Tony Xu","Maged Goubran","Yunheng Wu","Ali Erturk","Johannes C. Paetzold"],"pdf_url":"https://arxiv.org/pdf/2501.03880v1.pdf","comment":"1st version"},{"id":"http://arxiv.org/abs/2501.03877v1","updated":"2025-01-07T15:40:22Z","published":"2025-01-07T15:40:22Z","title":"Stochastically Constrained Best Arm Identification with Thompson\n Sampling","summary":" We consider the problem of the best arm identification in the presence of\nstochastic constraints, where there is a finite number of arms associated with\nmultiple performance measures. The goal is to identify the arm that optimizes\nthe objective measure subject to constraints on the remaining measures. We will\nexplore the popular idea of Thompson sampling (TS) as a means to solve it. To\nthe best of our knowledge, it is the first attempt to extend TS to this\nproblem. We will design a TS-based sampling algorithm, establish its asymptotic\noptimality in the rate of posterior convergence, and demonstrate its superior\nperformance using numerical examples.\n","authors":["Le Yang","Siyang Gao","Cheng Li","Yi Wang"],"pdf_url":"https://arxiv.org/pdf/2501.03877v1.pdf","comment":"30 pages, 12 figures, 1 table"},{"id":"http://arxiv.org/abs/2501.03874v1","updated":"2025-01-07T15:38:13Z","published":"2025-01-07T15:38:13Z","title":"Neuromorphic Optical Tracking and Imaging of Randomly Moving Targets\n through Strongly Scattering Media","summary":" Tracking and acquiring simultaneous optical images of randomly moving targets\nobscured by scattering media remains a challenging problem of importance to\nmany applications that require precise object localization and identification.\nIn this work we develop an end-to-end neuromorphic optical engineering and\ncomputational approach to demonstrate how to track and image normally invisible\nobjects by combining an event detecting camera with a multistage neuromorphic\ndeep learning strategy. Photons emerging from dense scattering media are\ndetected by the event camera and converted to pixel-wise asynchronized spike\ntrains - a first step in isolating object-specific information from the\ndominant uninformative background. Spiking data is fed into a deep spiking\nneural network (SNN) engine where object tracking and image reconstruction are\nperformed by two separate yet interconnected modules running in parallel in\ndiscrete time steps over the event duration. Through benchtop experiments we\ndemonstrate tracking and imaging randomly moving objects in dense turbid media\nas well as image reconstruction of spatially stationary but optically dynamic\nobjects. Standardized character sets serve as representative proxies for\ngeometrically complex objects, underscoring the method's generality. The\nresults highlight the advantages of a fully neuromorphic approach in meeting a\nmajor imaging technology with high computational efficiency and low power\nconsumption.\n","authors":["Ning Zhang","Timothy Shea","Arto Nurmikko"],"pdf_url":"https://arxiv.org/pdf/2501.03874v1.pdf","comment":"22 pages, 6 figures"},{"id":"http://arxiv.org/abs/2410.13850v3","updated":"2025-01-07T15:28:09Z","published":"2024-10-17T17:59:02Z","title":"Influence Functions for Scalable Data Attribution in Diffusion Models","summary":" Diffusion models have led to significant advancements in generative\nmodelling. Yet their widespread adoption poses challenges regarding data\nattribution and interpretability. In this paper, we aim to help address such\nchallenges in diffusion models by developing an influence functions framework.\nInfluence function-based data attribution methods approximate how a model's\noutput would have changed if some training data were removed. In supervised\nlearning, this is usually used for predicting how the loss on a particular\nexample would change. For diffusion models, we focus on predicting the change\nin the probability of generating a particular example via several proxy\nmeasurements. We show how to formulate influence functions for such quantities\nand how previously proposed methods can be interpreted as particular design\nchoices in our framework. To ensure scalability of the Hessian computations in\ninfluence functions, we systematically develop K-FAC approximations based on\ngeneralised Gauss-Newton matrices specifically tailored to diffusion models. We\nrecast previously proposed methods as specific design choices in our framework\nand show that our recommended method outperforms previous data attribution\napproaches on common evaluations, such as the Linear Data-modelling Score (LDS)\nor retraining without top influences, without the need for method-specific\nhyperparameter tuning.\n","authors":["Bruno Mlodozeniec","Runa Eschenhagen","Juhan Bae","Alexander Immer","David Krueger","Richard Turner"],"pdf_url":"https://arxiv.org/pdf/2410.13850v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.11727v2","updated":"2025-01-07T15:26:14Z","published":"2024-05-20T02:09:07Z","title":"Highway Graph to Accelerate Reinforcement Learning","summary":" Reinforcement Learning (RL) algorithms often struggle with low training\nefficiency. A common approach to address this challenge is integrating\nmodel-based planning algorithms, such as Monte Carlo Tree Search (MCTS) or\nValue Iteration (VI), into the environmental model. However, VI requires\niterating over a large tensor which updates the value of the preceding state\nbased on the succeeding state through value propagation, resulting in\ncomputationally intensive operations. To enhance the RL training efficiency, we\npropose improving the efficiency of the value learning process. In\ndeterministic environments with discrete state and action spaces, we observe\nthat on the sampled empirical state-transition graph, a non-branching sequence\nof transitions-termed a highway-can take the agent to another state without\ndeviation through intermediate states. On these non-branching highways, the\nvalue-updating process can be streamlined into a single-step operation,\neliminating the need for step-by-step updates. Building on this observation, we\nintroduce the highway graph to model state transitions. The highway graph\ncompresses the transition model into a compact representation, where edges can\nencapsulate multiple state transitions, enabling value propagation across\nmultiple time steps in a single iteration. By integrating the highway graph\ninto RL, the training process is significantly accelerated, particularly in the\nearly stages of training. Experiments across four categories of environments\ndemonstrate that our method learns significantly faster than established and\nstate-of-the-art RL algorithms (often by a factor of 10 to 150) while\nmaintaining equal or superior expected returns. Furthermore, a deep neural\nnetwork-based agent trained using the highway graph exhibits improved\ngeneralization capabilities and reduced storage costs. Code is publicly\navailable at https://github.com/coodest/highwayRL.\n","authors":["Zidu Yin","Zhen Zhang","Dong Gong","Stefano V. Albrecht","Javen Q. Shi"],"pdf_url":"https://arxiv.org/pdf/2405.11727v2.pdf","comment":"Published in TMLR"},{"id":"http://arxiv.org/abs/2501.03865v1","updated":"2025-01-07T15:24:53Z","published":"2025-01-07T15:24:53Z","title":"Truthful mechanisms for linear bandit games with private contexts","summary":" The contextual bandit problem, where agents arrive sequentially with personal\ncontexts and the system adapts its arm allocation decisions accordingly, has\nrecently garnered increasing attention for enabling more personalized outcomes.\nHowever, in many healthcare and recommendation applications, agents have\nprivate profiles and may misreport their contexts to gain from the system. For\nexample, in adaptive clinical trials, where hospitals sequentially recruit\nvolunteers to test multiple new treatments and adjust plans based on\nvolunteers' reported profiles such as symptoms and interim data, participants\nmay misreport severe side effects like allergy and nausea to avoid perceived\nsuboptimal treatments. We are the first to study this issue of private context\nmisreporting in a stochastic contextual bandit game between the system and\nnon-repeated agents. We show that traditional low-regret algorithms, such as\nUCB family algorithms and Thompson sampling, fail to ensure truthful reporting\nand can result in linear regret in the worst case, while traditional truthful\nalgorithms like explore-then-commit (ETC) and $\\epsilon$-greedy algorithm incur\nsublinear but high regret. We propose a mechanism that uses a linear program to\nensure truthfulness while minimizing deviation from Thompson sampling, yielding\nan $O(\\ln T)$ frequentist regret. Our numerical experiments further demonstrate\nstrong performance in multiple contexts and across other distribution families.\n","authors":["Yiting Hu","Lingjie Duan"],"pdf_url":"https://arxiv.org/pdf/2501.03865v1.pdf","comment":"To appear at AAMAS 2025"},{"id":"http://arxiv.org/abs/2501.03858v1","updated":"2025-01-07T15:14:58Z","published":"2025-01-07T15:14:58Z","title":"Symmetry and Generalisation in Machine Learning","summary":" This work is about understanding the impact of invariance and equivariance on\ngeneralisation in supervised learning. We use the perspective afforded by an\naveraging operator to show that for any predictor that is not equivariant,\nthere is an equivariant predictor with strictly lower test risk on all\nregression problems where the equivariance is correctly specified. This\nconstitutes a rigorous proof that symmetry, in the form of invariance or\nequivariance, is a useful inductive bias.\n We apply these ideas to equivariance and invariance in random design least\nsquares and kernel ridge regression respectively. This allows us to specify the\nreduction in expected test risk in more concrete settings and express it in\nterms of properties of the group, the model and the data.\n Along the way, we give examples and additional results to demonstrate the\nutility of the averaging operator approach in analysing equivariant predictors.\nIn addition, we adopt an alternative perspective and formalise the common\nintuition that learning with invariant models reduces to a problem in terms of\norbit representatives. The formalism extends naturally to a similar intuition\nfor equivariant models. We conclude by connecting the two perspectives and\ngiving some ideas for future work.\n","authors":["Hayder Elesedy"],"pdf_url":"https://arxiv.org/pdf/2501.03858v1.pdf","comment":"PhD Thesis"},{"id":"http://arxiv.org/abs/2501.03038v2","updated":"2025-01-07T15:13:41Z","published":"2025-01-06T14:26:00Z","title":"Piano Transcription by Hierarchical Language Modeling with Pretrained\n Roll-based Encoders","summary":" Automatic Music Transcription (AMT), aiming to get musical notes from raw\naudio, typically uses frame-level systems with piano-roll outputs or language\nmodel (LM)-based systems with note-level predictions. However, frame-level\nsystems require manual thresholding, while the LM-based systems struggle with\nlong sequences. In this paper, we propose a hybrid method combining pre-trained\nroll-based encoders with an LM decoder to leverage the strengths of both\nmethods. Besides, our approach employs a hierarchical prediction strategy,\nfirst predicting onset and pitch, then velocity, and finally offset. The\nhierarchical prediction strategy reduces computational costs by breaking down\nlong sequences into different hierarchies. Evaluated on two benchmark\nroll-based encoders, our method outperforms traditional piano-roll outputs 0.01\nand 0.022 in onset-offset-velocity F1 score, demonstrating its potential as a\nperformance-enhancing plug-in for arbitrary roll-based music transcription\nencoder.\n","authors":["Dichucheng Li","Yongyi Zang","Qiuqiang Kong"],"pdf_url":"https://arxiv.org/pdf/2501.03038v2.pdf","comment":"Accepted by ICASSP 2025"},{"id":"http://arxiv.org/abs/2501.03853v1","updated":"2025-01-07T15:10:07Z","published":"2025-01-07T15:10:07Z","title":"Leveraging time and parameters for nonlinear model reduction methods","summary":" In this paper, we consider model order reduction (MOR) methods for problems\nwith slowly decaying Kolmogorov $n$-widths as, e.g., certain wave-like or\ntransport-dominated problems. To overcome this Kolmogorov barrier within MOR,\nnonlinear projections are used, which are often realized numerically using\nautoencoders. These autoencoders generally consist of a nonlinear encoder and a\nnonlinear decoder and involve costly training of the hyperparameters to obtain\na good approximation quality of the reduced system. To facilitate the training\nprocess, we show that extending the to-be-reduced system and its corresponding\ntraining data makes it possible to replace the nonlinear encoder with a linear\nencoder without sacrificing accuracy, thus roughly halving the number of\nhyperparameters to be trained.\n","authors":["Silke Glas","Benjamin Unger"],"pdf_url":"https://arxiv.org/pdf/2501.03853v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.16670v2","updated":"2025-01-07T15:00:20Z","published":"2024-09-25T06:57:42Z","title":"GraphLoRA: Structure-Aware Contrastive Low-Rank Adaptation for\n Cross-Graph Transfer Learning","summary":" Graph Neural Networks (GNNs) have demonstrated remarkable proficiency in\nhandling a range of graph analytical tasks across various domains, such as\ne-commerce and social networks. Despite their versatility, GNNs face\nsignificant challenges in transferability, limiting their utility in real-world\napplications. Existing research in GNN transfer learning overlooks\ndiscrepancies in distribution among various graph datasets, facing challenges\nwhen transferring across different distributions. How to effectively adopt a\nwell-trained GNN to new graphs with varying feature and structural\ndistributions remains an under-explored problem. Taking inspiration from the\nsuccess of Low-Rank Adaptation (LoRA) in adapting large language models to\nvarious domains, we propose GraphLoRA, an effective and parameter-efficient\nmethod for transferring well-trained GNNs to diverse graph domains.\nSpecifically, we first propose a Structure-aware Maximum Mean Discrepancy\n(SMMD) to align divergent node feature distributions across source and target\ngraphs. Moreover, we introduce low-rank adaptation by injecting a small\ntrainable GNN alongside the pre-trained one, effectively bridging structural\ndistribution gaps while mitigating the catastrophic forgetting. Additionally, a\nstructure-aware regularization objective is proposed to enhance the\nadaptability of the pre-trained GNN to target graph with scarce supervision\nlabels. Extensive experiments on eight real-world datasets demonstrate the\neffectiveness of GraphLoRA against fourteen baselines by tuning only 20% of\nparameters, even across disparate graph domains. The code is available at\nhttps://github.com/AllminerLab/GraphLoRA.\n","authors":["Zhe-Rui Yang","Jindong Han","Chang-Dong Wang","Hao Liu"],"pdf_url":"https://arxiv.org/pdf/2409.16670v2.pdf","comment":"Accepted by KDD2025"},{"id":"http://arxiv.org/abs/2501.03843v1","updated":"2025-01-07T14:53:35Z","published":"2025-01-07T14:53:35Z","title":"BERTopic for Topic Modeling of Hindi Short Texts: A Comparative Study","summary":" As short text data in native languages like Hindi increasingly appear in\nmodern media, robust methods for topic modeling on such data have gained\nimportance. This study investigates the performance of BERTopic in modeling\nHindi short texts, an area that has been under-explored in existing research.\nUsing contextual embeddings, BERTopic can capture semantic relationships in\ndata, making it potentially more effective than traditional models, especially\nfor short and diverse texts. We evaluate BERTopic using 6 different document\nembedding models and compare its performance against 8 established topic\nmodeling techniques, such as Latent Dirichlet Allocation (LDA), Non-negative\nMatrix Factorization (NMF), Latent Semantic Indexing (LSI), Additive\nRegularization of Topic Models (ARTM), Probabilistic Latent Semantic Analysis\n(PLSA), Embedded Topic Model (ETM), Combined Topic Model (CTM), and Top2Vec.\nThe models are assessed using coherence scores across a range of topic counts.\nOur results reveal that BERTopic consistently outperforms other models in\ncapturing coherent topics from short Hindi texts.\n","authors":["Atharva Mutsaddi","Anvi Jamkhande","Aryan Thakre","Yashodhara Haribhakta"],"pdf_url":"https://arxiv.org/pdf/2501.03843v1.pdf","comment":"Accepted into IndoNLP: The First Workshop on Natural Language\n Processing for Indo-Aryan and Dravidian Languages, collocated with COLING\n 2025. Set to appear in the workshop proceedings published in ACL Anthology"},{"id":"http://arxiv.org/abs/2403.18873v2","updated":"2025-01-07T14:52:34Z","published":"2024-03-26T14:42:46Z","title":"Predicting risk of cardiovascular disease using retinal OCT imaging","summary":" Cardiovascular diseases (CVD) are the leading cause of death globally.\nNon-invasive, cost-effective imaging techniques play a crucial role in early\ndetection and prevention of CVD. Optical coherence tomography (OCT) has gained\nrecognition as a potential tool for early CVD risk prediction, though its use\nremains underexplored. In this study, we investigated the potential of OCT as\nan additional imaging technique to predict future CVD events. We analysed\nretinal OCT data from the UK Biobank. The dataset included 612 patients who\nsuffered a myocardial infarction (MI) or stroke within five years of imaging\nand 2,234 controls without CVD (total: 2,846 participants). A self-supervised\ndeep learning approach based on Variational Autoencoders (VAE) was used to\nextract low-dimensional latent representations from high-dimensional 3D OCT\nimages, capturing distinct features of retinal layers. These latent features,\nalong with clinical data, were used to train a Random Forest (RF) classifier to\ndifferentiate between patients at risk of future CVD events (MI or stroke) and\nhealthy controls. Our model achieved an AUC of 0.75, sensitivity of 0.70,\nspecificity of 0.70, and accuracy of 0.70, outperforming the QRISK3 score (the\nthird version of the QRISK cardiovascular disease risk prediction algorithm;\nAUC = 0.60, sensitivity = 0.60, specificity = 0.55, accuracy = 0.55). The\nchoroidal layer in OCT images was identified as a key predictor of future CVD\nevents, revealed through a novel model explainability approach. This study\ndemonstrates that retinal OCT imaging is a cost-effective, non-invasive\nalternative for predicting CVD risk, offering potential for widespread\napplication in optometry practices and hospitals.\n","authors":["Cynthia Maldonado-Garcia","Rodrigo Bonazzola","Enzo Ferrante","Thomas H Julian","Panagiotis I Sergouniotis","Nishant Ravikumara","Alejandro F Frangi"],"pdf_url":"https://arxiv.org/pdf/2403.18873v2.pdf","comment":"New version - 26 pages for main manuscript, 7 figures, 7 pages for\n appendix and preprint for a journal"},{"id":"http://arxiv.org/abs/2501.03840v1","updated":"2025-01-07T14:50:05Z","published":"2025-01-07T14:50:05Z","title":"Machine learning applications in archaeological practices: a review","summary":" Artificial intelligence and machine learning applications in archaeology have\nincreased significantly in recent years, and these now span all subfields,\ngeographical regions, and time periods. The prevalence and success of these\napplications have remained largely unexamined, as recent reviews on the use of\nmachine learning in archaeology have only focused only on specific subfields of\narchaeology. Our review examined an exhaustive corpus of 135 articles published\nbetween 1997 and 2022. We observed a significant increase in the number of\nrelevant publications from 2019 onwards. Automatic structure detection and\nartefact classification were the most represented tasks in the articles\nreviewed, followed by taphonomy, and archaeological predictive modelling. From\nthe review, clustering and unsupervised methods were underrepresented compared\nto supervised models. Artificial neural networks and ensemble learning account\nfor two thirds of the total number of models used. However, if machine learning\nis gaining in popularity it remains subject to misunderstanding. We observed,\nin some cases, poorly defined requirements and caveats of the machine learning\nmethods used. Furthermore, the goals and the needs of machine learning\napplications for archaeological purposes are in some cases unclear or poorly\nexpressed. To address this, we proposed a workflow guide for archaeologists to\ndevelop coherent and consistent methodologies adapted to their research\nquestions, project scale and data. As in many other areas, machine learning is\nrapidly becoming an important tool in archaeological research and practice,\nuseful for the analyses of large and multivariate data, although not without\nlimitations. This review highlights the importance of well-defined and\nwell-reported structured methodologies and collaborative practices to maximise\nthe potential of applications of machine learning methods in archaeology.\n","authors":["Mathias Bellat","Jordy D. Orellana Figueroa","Jonathan S. Reeves","Ruhollah Taghizadeh-Mehrjardi","Claudio Tennie","Thomas Scholten"],"pdf_url":"https://arxiv.org/pdf/2501.03840v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.12968v2","updated":"2025-01-07T14:45:04Z","published":"2024-12-17T14:53:38Z","title":"On Local Overfitting and Forgetting in Deep Neural Networks","summary":" The infrequent occurrence of overfitting in deep neural networks is\nperplexing: contrary to theoretical expectations, increasing model size often\nenhances performance in practice. But what if overfitting does occur, though\nrestricted to specific sub-regions of the data space? In this work, we propose\na novel score that captures the forgetting rate of deep models on validation\ndata. We posit that this score quantifies local overfitting: a decline in\nperformance confined to certain regions of the data space. We then show\nempirically that local overfitting occurs regardless of the presence of\ntraditional overfitting. Using the framework of deep over-parametrized linear\nmodels, we offer a certain theoretical characterization of forgotten knowledge,\nand show that it correlates with knowledge forgotten by real deep models.\nFinally, we devise a new ensemble method that aims to recover forgotten\nknowledge, relying solely on the training history of a single network. When\ncombined with self-distillation, this method enhances the performance of any\ntrained model without adding inference costs. Extensive empirical evaluations\ndemonstrate the efficacy of our method across multiple datasets, contemporary\nneural network architectures, and training protocols.\n","authors":["Uri Stern","Tomer Yaacoby","Daphna Weinshall"],"pdf_url":"https://arxiv.org/pdf/2412.12968v2.pdf","comment":"arXiv admin note: substantial text overlap with arXiv:2310.11094"},{"id":"http://arxiv.org/abs/2501.03832v1","updated":"2025-01-07T14:42:38Z","published":"2025-01-07T14:42:38Z","title":"Three-dimensional attention Transformer for state evaluation in\n real-time strategy games","summary":" Situation assessment in Real-Time Strategy (RTS) games is crucial for\nunderstanding decision-making in complex adversarial environments. However,\nexisting methods remain limited in processing multi-dimensional feature\ninformation and temporal dependencies. Here we propose a tri-dimensional\nSpace-Time-Feature Transformer (TSTF Transformer) architecture, which\nefficiently models battlefield situations through three independent but\ncascaded modules: spatial attention, temporal attention, and feature attention.\nOn a dataset comprising 3,150 adversarial experiments, the 8-layer TSTF\nTransformer demonstrates superior performance: achieving 58.7% accuracy in the\nearly game (~4% progress), significantly outperforming the conventional\nTimesformer's 41.8%; reaching 97.6% accuracy in the mid-game (~40% progress)\nwhile maintaining low performance variation (standard deviation 0.114).\nMeanwhile, this architecture requires fewer parameters (4.75M) compared to the\nbaseline model (5.54M). Our study not only provides new insights into situation\nassessment in RTS games but also presents an innovative paradigm for\nTransformer-based multi-dimensional temporal modeling.\n","authors":["Yanqing Ye","Weilong Yang","Kai Qiu","Jie Zhang"],"pdf_url":"https://arxiv.org/pdf/2501.03832v1.pdf","comment":"9 pages, 5 figures"},{"id":"http://arxiv.org/abs/2408.12875v2","updated":"2025-01-07T14:39:26Z","published":"2024-08-23T07:14:56Z","title":"Disentangling, Amplifying, and Debiasing: Learning Disentangled\n Representations for Fair Graph Neural Networks","summary":" Graph Neural Networks (GNNs) have become essential tools for graph\nrepresentation learning in various domains, such as social media and\nhealthcare. However, they often suffer from fairness issues due to inherent\nbiases in node attributes and graph structure, leading to unfair predictions.\nTo address these challenges, we propose a novel GNN framework, DAB-GNN, that\nDisentangles, Amplifies, and deBiases attribute, structure, and potential\nbiases in the GNN mechanism. DAB-GNN employs a disentanglement and\namplification module that isolates and amplifies each type of bias through\nspecialized disentanglers, followed by a debiasing module that minimizes the\ndistance between subgroup distributions. Extensive experiments on five datasets\ndemonstrate that DAB-GNN significantly outperforms ten state-of-the-art\ncompetitors in terms of achieving an optimal balance between accuracy and\nfairness. The codebase of DAB-GNN is available at\nhttps://github.com/Bigdasgit/DAB-GNN\n","authors":["Yeon-Chang Lee","Hojung Shin","Sang-Wook Kim"],"pdf_url":"https://arxiv.org/pdf/2408.12875v2.pdf","comment":"Accepted by AAAI 2025"},{"id":"http://arxiv.org/abs/2501.03826v1","updated":"2025-01-07T14:38:49Z","published":"2025-01-07T14:38:49Z","title":"Investigating the Impact of Data Selection Strategies on Language Model\n Performance","summary":" Data selection is critical for enhancing the performance of language models,\nparticularly when aligning training datasets with a desired target\ndistribution. This study explores the effects of different data selection\nmethods and feature types on model performance. We evaluate whether selecting\ndata subsets can influence downstream tasks, whether n-gram features improve\nalignment with target distributions, and whether embedding-based neural\nfeatures provide complementary benefits. Through comparative experiments using\nbaseline random selection methods and distribution aligned approaches, we\nprovide insights into the interplay between data selection strategies and model\ntraining efficacy. All code for this study can be found on\n\\href{https://github.com/jgu13/HIR-Hybrid-Importance-Resampling-for-Language-Models}{github\nrepository}.\n","authors":["Jiayao Gu","Liting Chen","Yihong Li"],"pdf_url":"https://arxiv.org/pdf/2501.03826v1.pdf","comment":"7 pages, 1 figure"},{"id":"http://arxiv.org/abs/2501.03821v1","updated":"2025-01-07T14:35:09Z","published":"2025-01-07T14:35:09Z","title":"Class-Balance Bias in Regularized Regression","summary":" Regularized models are often sensitive to the scales of the features in the\ndata and it has therefore become standard practice to normalize (center and\nscale) the features before fitting the model. But there are many different ways\nto normalize the features and the choice may have dramatic effects on the\nresulting model. In spite of this, there has so far been no research on this\ntopic. In this paper, we begin to bridge this knowledge gap by studying\nnormalization in the context of lasso, ridge, and elastic net regression. We\nfocus on normal and binary features and show that the class balances of binary\nfeatures directly influences the regression coefficients and that this effect\ndepends on the combination of normalization and regularization methods used. We\ndemonstrate that this effect can be mitigated by scaling binary features with\ntheir variance in the case of the lasso and standard deviation in the case of\nridge regression, but that this comes at the cost of increased variance. For\nthe elastic net, we show that scaling the penalty weights, rather than the\nfeatures, can achieve the same effect. Finally, we also tackle mixes of binary\nand normal features as well as interactions and provide some initial results on\nhow to normalize features in these cases.\n","authors":["Johan Larsson","Jonas Wallin"],"pdf_url":"https://arxiv.org/pdf/2501.03821v1.pdf","comment":"27 pages, 21 figures"},{"id":"http://arxiv.org/abs/2412.19950v2","updated":"2025-01-07T14:35:01Z","published":"2024-12-27T23:10:32Z","title":"Data-driven tool wear prediction in milling, based on a\n process-integrated single-sensor approach","summary":" Accurate tool wear prediction is essential for maintaining productivity and\nminimizing costs in machining. However, the complex nature of the tool wear\nprocess poses significant challenges to achieving reliable predictions. This\nstudy explores data-driven methods, in particular deep learning, for tool wear\nprediction. Traditional data-driven approaches often focus on a single process,\nrelying on multi-sensor setups and extensive data generation, which limits\ngeneralization to new settings. Moreover, multi-sensor integration is often\nimpractical in industrial environments. To address these limitations, this\nresearch investigates the transferability of predictive models using minimal\ntraining data, validated across two processes. Furthermore, it uses a simple\nsetup with a single acceleration sensor to establish a low-cost data generation\napproach that facilitates the generalization of models to other processes via\ntransfer learning. The study evaluates several machine learning models,\nincluding convolutional neural networks (CNN), long short-term memory networks\n(LSTM), support vector machines (SVM) and decision trees, trained on different\ninput formats such as feature vectors and short-time Fourier transform (STFT).\nThe performance of the models is evaluated on different amounts of training\ndata, including scenarios with significantly reduced datasets, providing\ninsight into their effectiveness under constrained data conditions. The results\ndemonstrate the potential of specific models and configurations for effective\ntool wear prediction, contributing to the development of more adaptable and\nefficient predictive maintenance strategies in machining. Notably, the ConvNeXt\nmodel has an exceptional performance, achieving an 99.1% accuracy in\nidentifying tool wear using data from only four milling tools operated until\nthey are worn.\n","authors":["Eric Hirsch","Christian Friedrich"],"pdf_url":"https://arxiv.org/pdf/2412.19950v2.pdf","comment":"Preprint submitted to Robotics and Computer-Integrated Manufacturing\n ,14 pages, 9 figures"},{"id":"http://arxiv.org/abs/2501.01707v2","updated":"2025-01-07T14:28:54Z","published":"2025-01-03T09:09:58Z","title":"Catch Causal Signals from Edges for Label Imbalance in Graph\n Classification","summary":" Despite significant advancements in causal research on graphs and its\napplication to cracking label imbalance, the role of edge features in detecting\nthe causal effects within graphs has been largely overlooked, leaving existing\nmethods with untapped potential for further performance gains. In this paper,\nwe enhance the causal attention mechanism through effectively leveraging edge\ninformation to disentangle the causal subgraph from the original graph, as well\nas further utilizing edge features to reshape graph representations. Capturing\nmore comprehensive causal signals, our design leads to improved performance on\ngraph classification tasks with label imbalance issues. We evaluate our\napproach on real-word datasets PTC, Tox21, and ogbg-molhiv, observing\nimprovements over baselines. Overall, we highlight the importance of edge\nfeatures in graph causal detection and provide a promising direction for\naddressing label imbalance challenges in graph-level tasks. The model\nimplementation details and the codes are available on\nhttps://github.com/fengrui-z/ECAL\n","authors":["Fengrui Zhang","Yujia Yin","Hongzong Li","Yifan Chen","Tianyi Qu"],"pdf_url":"https://arxiv.org/pdf/2501.01707v2.pdf","comment":"ICASSP 2025"},{"id":"http://arxiv.org/abs/2405.15840v2","updated":"2025-01-07T14:24:59Z","published":"2024-05-24T16:03:47Z","title":"Learning the Language of Protein Structure","summary":" Representation learning and \\emph{de novo} generation of proteins are pivotal\ncomputational biology tasks. Whilst natural language processing (NLP)\ntechniques have proven highly effective for protein sequence modelling,\nstructure modelling presents a complex challenge, primarily due to its\ncontinuous and three-dimensional nature. Motivated by this discrepancy, we\nintroduce an approach using a vector-quantized autoencoder that effectively\ntokenizes protein structures into discrete representations. This method\ntransforms the continuous, complex space of protein structures into a\nmanageable, discrete format with a codebook ranging from 4096 to 64000 tokens,\nachieving high-fidelity reconstructions with backbone root mean square\ndeviations (RMSD) of approximately 1-5 \\AA. To demonstrate the efficacy of our\nlearned representations, we show that a simple GPT model trained on our\ncodebooks can generate novel, diverse, and designable protein structures. Our\napproach not only provides representations of protein structure, but also\nmitigates the challenges of disparate modal representations and sets a\nfoundation for seamless, multi-modal integration, enhancing the capabilities of\ncomputational methods in protein design.\n","authors":["Benoit Gaujac","Jérémie Donà","Liviu Copoiu","Timothy Atkinson","Thomas Pierrot","Thomas D. Barrett"],"pdf_url":"https://arxiv.org/pdf/2405.15840v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.01460v2","updated":"2025-01-07T14:19:35Z","published":"2024-12-31T10:43:19Z","title":"GDSR: Global-Detail Integration through Dual-Branch Network with Wavelet\n Losses for Remote Sensing Image Super-Resolution","summary":" In recent years, deep neural networks, including Convolutional Neural\nNetworks, Transformers, and State Space Models, have achieved significant\nprogress in Remote Sensing Image (RSI) Super-Resolution (SR). However, existing\nSR methods typically overlook the complementary relationship between global and\nlocal dependencies. These methods either focus on capturing local information\nor prioritize global information, which results in models that are unable to\neffectively capture both global and local features simultaneously. Moreover,\ntheir computational cost becomes prohibitive when applied to large-scale RSIs.\nTo address these challenges, we introduce the novel application of Receptance\nWeighted Key Value (RWKV) to RSI-SR, which captures long-range dependencies\nwith linear complexity. To simultaneously model global and local features, we\npropose the Global-Detail dual-branch structure, GDSR, which performs SR\nreconstruction by paralleling RWKV and convolutional operations to handle\nlarge-scale RSIs. Furthermore, we introduce the Global-Detail Reconstruction\nModule (GDRM) as an intermediary between the two branches to bridge their\ncomplementary roles. In addition, we propose Wavelet Loss, a loss function that\neffectively captures high-frequency detail information in images, thereby\nenhancing the visual quality of SR, particularly in terms of detail\nreconstruction. Extensive experiments on several benchmarks, including AID,\nAID_CDM, RSSRD-QH, and RSSRD-QH_CDM, demonstrate that GSDR outperforms the\nstate-of-the-art Transformer-based method HAT by an average of 0.05 dB in PSNR,\nwhile using only 63% of its parameters and 51% of its FLOPs, achieving an\ninference speed 2.9 times faster. Furthermore, the Wavelet Loss shows excellent\ngeneralization across various architectures, providing a novel perspective for\nRSI-SR enhancement.\n","authors":["Qiwei Zhu","Kai Li","Guojing Zhang","Xiaoying Wang","Jianqiang Huang","Xilai Li"],"pdf_url":"https://arxiv.org/pdf/2501.01460v2.pdf","comment":"The experiments were conducted using private datasets that were\n incomplete as they did not include all the necessary copyrights.\n Additionally, the conclusions require further exploration as the work is\n still in progress"},{"id":"http://arxiv.org/abs/2501.03782v1","updated":"2025-01-07T13:45:09Z","published":"2025-01-07T13:45:09Z","title":"Vision Transformer Neural Architecture Search for Out-of-Distribution\n Generalization: Benchmark and Insights","summary":" While ViTs have achieved across machine learning tasks, deploying them in\nreal-world scenarios faces a critical challenge: generalizing under OoD shifts.\nA crucial research gap exists in understanding how to design ViT architectures,\nboth manually and automatically, for better OoD generalization. To this end, we\nintroduce OoD-ViT-NAS, the first systematic benchmark for ViTs NAS focused on\nOoD generalization. This benchmark includes 3000 ViT architectures of varying\ncomputational budgets evaluated on 8 common OoD datasets. Using this benchmark,\nwe analyze factors contributing to OoD generalization. Our findings reveal key\ninsights. First, ViT architecture designs significantly affect OoD\ngeneralization. Second, ID accuracy is often a poor indicator of OoD accuracy,\nhighlighting the risk of optimizing ViT architectures solely for ID\nperformance. Third, we perform the first study of NAS for ViTs OoD robustness,\nanalyzing 9 Training-free NAS methods. We find that existing Training-free NAS\nmethods are largely ineffective in predicting OoD accuracy despite excelling at\nID accuracy. Simple proxies like Param or Flop surprisingly outperform complex\nTraining-free NAS methods in predicting OoD accuracy. Finally, we study how ViT\narchitectural attributes impact OoD generalization and discover that increasing\nembedding dimensions generally enhances performance. Our benchmark shows that\nViT architectures exhibit a wide range of OoD accuracy, with up to 11.85%\nimprovement for some OoD shifts. This underscores the importance of studying\nViT architecture design for OoD. We believe OoD-ViT-NAS can catalyze further\nresearch into how ViT designs influence OoD generalization.\n","authors":["Sy-Tuyen Ho","Tuan Van Vo","Somayeh Ebrahimkhani","Ngai-Man Cheung"],"pdf_url":"https://arxiv.org/pdf/2501.03782v1.pdf","comment":"Accepted in NeurIPS 2024"},{"id":"http://arxiv.org/abs/2406.04280v3","updated":"2025-01-07T13:43:36Z","published":"2024-06-06T17:26:40Z","title":"xMIL: Insightful Explanations for Multiple Instance Learning in\n Histopathology","summary":" Multiple instance learning (MIL) is an effective and widely used approach for\nweakly supervised machine learning. In histopathology, MIL models have achieved\nremarkable success in tasks like tumor detection, biomarker prediction, and\noutcome prognostication. However, MIL explanation methods are still lagging\nbehind, as they are limited to small bag sizes or disregard instance\ninteractions. We revisit MIL through the lens of explainable AI (XAI) and\nintroduce xMIL, a refined framework with more general assumptions. We\ndemonstrate how to obtain improved MIL explanations using layer-wise relevance\npropagation (LRP) and conduct extensive evaluation experiments on three toy\nsettings and four real-world histopathology datasets. Our approach consistently\noutperforms previous explanation attempts with particularly improved\nfaithfulness scores on challenging biomarker prediction tasks. Finally, we\nshowcase how xMIL explanations enable pathologists to extract insights from MIL\nmodels, representing a significant advance for knowledge discovery and model\ndebugging in digital histopathology. Codes are available at:\nhttps://github.com/bifold-pathomics/xMIL.\n","authors":["Julius Hense","Mina Jamshidi Idaji","Oliver Eberle","Thomas Schnake","Jonas Dippel","Laure Ciernik","Oliver Buchstab","Andreas Mock","Frederick Klauschen","Klaus-Robert Müller"],"pdf_url":"https://arxiv.org/pdf/2406.04280v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.00546v3","updated":"2025-01-07T13:31:01Z","published":"2023-12-31T17:21:02Z","title":"AllSpark: A Multimodal Spatio-Temporal General Intelligence Model with\n Ten Modalities via Language as a Reference Framework","summary":" Leveraging multimodal data is an inherent requirement for comprehending\ngeographic objects. However, due to the high heterogeneity in structure and\nsemantics among various spatio-temporal modalities, the joint interpretation of\nmultimodal spatio-temporal data has long been an extremely challenging problem.\nThe primary challenge resides in striking a trade-off between the cohesion and\nautonomy of diverse modalities. This trade-off becomes progressively nonlinear\nas the number of modalities expands. Inspired by the human cognitive system and\nlinguistic philosophy, where perceptual signals from the five senses converge\ninto language, we introduce the Language as Reference Framework (LaRF), a\nfundamental principle for constructing a multimodal unified model. Building\nupon this, we propose AllSpark, a multimodal spatio-temporal general artificial\nintelligence model. Our model integrates ten different modalities into a\nunified framework. To achieve modal cohesion, AllSpark introduces a modal\nbridge and multimodal large language model (LLM) to map diverse modal features\ninto the language feature space. To maintain modality autonomy, AllSpark uses\nmodality-specific encoders to extract the tokens of various spatio-temporal\nmodalities. Finally, observing a gap between the model's interpretability and\ndownstream tasks, we designed modality-specific prompts and task heads,\nenhancing the model's generalization capability across specific tasks.\nExperiments indicate that the incorporation of language enables AllSpark to\nexcel in few-shot classification tasks for RGB and point cloud modalities\nwithout additional training, surpassing baseline performance by up to 41.82\\%.\nThe source code is available at https://github.com/GeoX-Lab/AllSpark.\n","authors":["Run Shao","Cheng Yang","Qiujun Li","Qing Zhu","Yongjun Zhang","YanSheng Li","Yu Liu","Yong Tang","Dapeng Liu","Shizhong Yang","Haifeng Li"],"pdf_url":"https://arxiv.org/pdf/2401.00546v3.pdf","comment":"19 pages, 19 tables, 3 figures"},{"id":"http://arxiv.org/abs/2407.15857v2","updated":"2025-01-07T13:28:00Z","published":"2024-07-08T06:38:50Z","title":"BoRA: Bayesian Hierarchical Low-Rank Adaption for Multi-Task Large\n Language Models","summary":" This paper introduces Bayesian Hierarchical Low-Rank Adaption (BoRA), a novel\nmethod for finetuning multi-task Large Language Models (LLMs). Current\nfinetuning approaches, such as Low-Rank Adaption (LoRA), perform exeptionally\nwell in reducing training parameters and memory usage but face limitations when\napplied to multiple similar tasks. Practitioners usually have to choose between\ntraining separate models for each task or a single model for all tasks, both of\nwhich come with trade-offs in specialization and data utilization. BoRA\naddresses these trade-offs by leveraging a Bayesian hierarchical model that\nallows tasks to share information through global hierarchical priors. This\nenables tasks with limited data to benefit from the overall structure derived\nfrom related tasks while allowing tasks with more data to specialize. Our\nexperimental results show that BoRA outperforms both individual and unified\nmodel approaches, achieving lower perplexity and better generalization across\ntasks. This method provides a scalable and efficient solution for multi-task\nLLM finetuning, with significant practical implications for diverse\napplications.\n","authors":["Simen Eide","Arnoldo Frigessi"],"pdf_url":"https://arxiv.org/pdf/2407.15857v2.pdf","comment":"14 pages, 5 figures"},{"id":"http://arxiv.org/abs/2501.03769v1","updated":"2025-01-07T13:22:35Z","published":"2025-01-07T13:22:35Z","title":"Multi-label Cross-lingual automatic music genre classification from\n lyrics with Sentence BERT","summary":" Music genres are shaped by both the stylistic features of songs and the\ncultural preferences of artists' audiences. Automatic classification of music\ngenres using lyrics can be useful in several applications such as\nrecommendation systems, playlist creation, and library organization. We present\na multi-label, cross-lingual genre classification system based on multilingual\nsentence embeddings generated by sBERT. Using a bilingual Portuguese-English\ndataset with eight overlapping genres, we demonstrate the system's ability to\ntrain on lyrics in one language and predict genres in another. Our approach\noutperforms the baseline approach of translating lyrics and using a\nbag-of-words representation, improving the genrewise average F1-Score from 0.35\nto 0.69. The classifier uses a one-vs-all architecture, enabling it to assign\nmultiple genre labels to a single lyric. Experimental results reveal that\ndataset centralization notably improves cross-lingual performance. This\napproach offers a scalable solution for genre classification across\nunderrepresented languages and cultural domains, advancing the capabilities of\nmusic information retrieval systems.\n","authors":["Tiago Fernandes Tavares","Fabio José Ayres"],"pdf_url":"https://arxiv.org/pdf/2501.03769v1.pdf","comment":"5 pages"},{"id":"http://arxiv.org/abs/2310.20708v3","updated":"2025-01-07T13:11:19Z","published":"2023-10-31T17:59:56Z","title":"Unexpected Improvements to Expected Improvement for Bayesian\n Optimization","summary":" Expected Improvement (EI) is arguably the most popular acquisition function\nin Bayesian optimization and has found countless successful applications, but\nits performance is often exceeded by that of more recent methods. Notably, EI\nand its variants, including for the parallel and multi-objective settings, are\nchallenging to optimize because their acquisition values vanish numerically in\nmany regions. This difficulty generally increases as the number of\nobservations, dimensionality of the search space, or the number of constraints\ngrow, resulting in performance that is inconsistent across the literature and\nmost often sub-optimal. Herein, we propose LogEI, a new family of acquisition\nfunctions whose members either have identical or approximately equal optima as\ntheir canonical counterparts, but are substantially easier to optimize\nnumerically. We demonstrate that numerical pathologies manifest themselves in\n\"classic\" analytic EI, Expected Hypervolume Improvement (EHVI), as well as\ntheir constrained, noisy, and parallel variants, and propose corresponding\nreformulations that remedy these pathologies. Our empirical results show that\nmembers of the LogEI family of acquisition functions substantially improve on\nthe optimization performance of their canonical counterparts and surprisingly,\nare on par with or exceed the performance of recent state-of-the-art\nacquisition functions, highlighting the understated role of numerical\noptimization in the literature.\n","authors":["Sebastian Ament","Samuel Daulton","David Eriksson","Maximilian Balandat","Eytan Bakshy"],"pdf_url":"https://arxiv.org/pdf/2310.20708v3.pdf","comment":"NeurIPS 2023 Spotlight (https://openreview.net/forum?id=QFgYOtOkDB)"},{"id":"http://arxiv.org/abs/2410.24222v2","updated":"2025-01-07T13:04:51Z","published":"2024-10-31T17:59:56Z","title":"Robust Gaussian Processes via Relevance Pursuit","summary":" Gaussian processes (GPs) are non-parametric probabilistic regression models\nthat are popular due to their flexibility, data efficiency, and well-calibrated\nuncertainty estimates. However, standard GP models assume homoskedastic\nGaussian noise, while many real-world applications are subject to non-Gaussian\ncorruptions. Variants of GPs that are more robust to alternative noise models\nhave been proposed, and entail significant trade-offs between accuracy and\nrobustness, and between computational requirements and theoretical guarantees.\nIn this work, we propose and study a GP model that achieves robustness against\nsparse outliers by inferring data-point-specific noise levels with a sequential\nselection procedure maximizing the log marginal likelihood that we refer to as\nrelevance pursuit. We show, surprisingly, that the model can be parameterized\nsuch that the associated log marginal likelihood is strongly concave in the\ndata-point-specific noise variances, a property rarely found in either robust\nregression objectives or GP marginal likelihoods. This in turn implies the weak\nsubmodularity of the corresponding subset selection problem, and thereby proves\napproximation guarantees for the proposed algorithm. We compare the model's\nperformance relative to other approaches on diverse regression and Bayesian\noptimization tasks, including the challenging but common setting of sparse\ncorruptions of the labels within or close to the function range.\n","authors":["Sebastian Ament","Elizabeth Santorella","David Eriksson","Ben Letham","Maximilian Balandat","Eytan Bakshy"],"pdf_url":"https://arxiv.org/pdf/2410.24222v2.pdf","comment":"NeurIPS 2024 Article (https://openreview.net/forum?id=5FATPIlWUJ)"},{"id":"http://arxiv.org/abs/2409.09138v2","updated":"2025-01-07T13:04:13Z","published":"2024-09-13T18:42:11Z","title":"Fast Structured Orthogonal Dictionary Learning using Householder\n Reflections","summary":" In this paper, we propose and investigate algorithms for the structured\northogonal dictionary learning problem. First, we investigate the case when the\ndictionary is a Householder matrix. We give sample complexity results and show\ntheoretically guaranteed approximate recovery (in the $l_{\\infty}$ sense) with\noptimal computational complexity. We then attempt to generalize these\ntechniques when the dictionary is a product of a few Householder matrices. We\nnumerically validate these techniques in the sample-limited setting to show\nperformance similar to or better than existing techniques while having much\nimproved computational complexity.\n","authors":["Anirudh Dash","Aditya Siripuram"],"pdf_url":"https://arxiv.org/pdf/2409.09138v2.pdf","comment":"12 pages, 5 figures, accepted for publication: IEEE ICASSP, 2025"},{"id":"http://arxiv.org/abs/2409.18301v3","updated":"2025-01-07T12:44:48Z","published":"2024-09-26T21:16:51Z","title":"Wavelet-Driven Generalizable Framework for Deepfake Face Forgery\n Detection","summary":" The evolution of digital image manipulation, particularly with the\nadvancement of deep generative models, significantly challenges existing\ndeepfake detection methods, especially when the origin of the deepfake is\nobscure. To tackle the increasing complexity of these forgeries, we propose\n\\textbf{Wavelet-CLIP}, a deepfake detection framework that integrates wavelet\ntransforms with features derived from the ViT-L/14 architecture, pre-trained in\nthe CLIP fashion. Wavelet-CLIP utilizes Wavelet Transforms to deeply analyze\nboth spatial and frequency features from images, thus enhancing the model's\ncapability to detect sophisticated deepfakes. To verify the effectiveness of\nour approach, we conducted extensive evaluations against existing\nstate-of-the-art methods for cross-dataset generalization and detection of\nunseen images generated by standard diffusion models. Our method showcases\noutstanding performance, achieving an average AUC of 0.749 for cross-data\ngeneralization and 0.893 for robustness against unseen deepfakes, outperforming\nall compared methods. The code can be reproduced from the repo:\n\\url{https://github.com/lalithbharadwajbaru/Wavelet-CLIP}\n","authors":["Lalith Bharadwaj Baru","Rohit Boddeda","Shilhora Akshay Patel","Sai Mohan Gajapaka"],"pdf_url":"https://arxiv.org/pdf/2409.18301v3.pdf","comment":"9 Pages, 2 Figures, 3 Tables"},{"id":"http://arxiv.org/abs/2401.04482v3","updated":"2025-01-07T12:40:58Z","published":"2024-01-09T10:39:17Z","title":"Continuously Learning New Words in Automatic Speech Recognition","summary":" Despite recent advances, Automatic Speech Recognition (ASR) systems are still\nfar from perfect. Typical errors include acronyms, named entities, and\ndomain-specific special words for which little or no labeled data is available.\nTo address the problem of recognizing these words, we propose a self-supervised\ncontinual learning approach: Given the audio of a lecture talk with the\ncorresponding slides, we bias the model towards decoding new words from the\nslides by using a memory-enhanced ASR model from the literature. Then, we\nperform inference on the talk, collecting utterances that contain detected new\nwords into an adaptation data set. Continual learning is then performed by\ntraining adaptation weights added to the model on this data set. The whole\nprocedure is iterated for many talks. We show that with this approach, we\nobtain increasing performance on the new words when they occur more frequently\n(more than 80% recall) while preserving the general performance of the model.\n","authors":["Christian Huber","Alexander Waibel"],"pdf_url":"https://arxiv.org/pdf/2401.04482v3.pdf","comment":"Accepted at ICASSP 2025"},{"id":"http://arxiv.org/abs/2501.03747v1","updated":"2025-01-07T12:40:35Z","published":"2025-01-07T12:40:35Z","title":"Context-Alignment: Activating and Enhancing LLM Capabilities in Time\n Series","summary":" Recently, leveraging pre-trained Large Language Models (LLMs) for time series\n(TS) tasks has gained increasing attention, which involves activating and\nenhancing LLMs' capabilities. Many methods aim to activate LLMs' capabilities\nbased on token-level alignment but overlook LLMs' inherent strength on natural\nlanguage processing -- their deep understanding of linguistic logic and\nstructure rather than superficial embedding processing. We propose\nContext-Alignment, a new paradigm that aligns TS with a linguistic component in\nthe language environments familiar to LLMs to enable LLMs to contextualize and\ncomprehend TS data, thereby activating their capabilities. Specifically, such\ncontext-level alignment comprises structural alignment and logical alignment,\nwhich is achieved by a Dual-Scale Context-Alignment GNNs (DSCA-GNNs) applied to\nTS-language multimodal inputs. Structural alignment utilizes dual-scale nodes\nto describe hierarchical structure in TS-language, enabling LLMs treat long TS\ndata as a whole linguistic component while preserving intrinsic token features.\nLogical alignment uses directed edges to guide logical relationships, ensuring\ncoherence in the contextual semantics. Demonstration examples prompt are\nemployed to construct Demonstration Examples based Context-Alignment (DECA)\nfollowing DSCA-GNNs framework. DECA can be flexibly and repeatedly integrated\ninto various layers of pre-trained LLMs to improve awareness of logic and\nstructure, thereby enhancing performance. Extensive experiments show the\neffectiveness of DECA and the importance of Context-Alignment across tasks,\nparticularly in few-shot and zero-shot forecasting, confirming that\nContext-Alignment provide powerful prior knowledge on context.\n","authors":["Yuxiao Hu","Qian Li","Dongxiao Zhang","Jinyue Yan","Yuntian Chen"],"pdf_url":"https://arxiv.org/pdf/2501.03747v1.pdf","comment":"no comment"},{"id":"http://arxiv.org/abs/2501.03746v1","updated":"2025-01-07T12:40:11Z","published":"2025-01-07T12:40:11Z","title":"A Multimodal Lightweight Approach to Fault Diagnosis of Induction Motors\n in High-Dimensional Dataset","summary":" An accurate AI-based diagnostic system for induction motors (IMs) holds the\npotential to enhance proactive maintenance, mitigating unplanned downtime and\ncurbing overall maintenance costs within an industrial environment. Notably,\namong the prevalent faults in IMs, a Broken Rotor Bar (BRB) fault is frequently\nencountered. Researchers have proposed various fault diagnosis approaches using\nsignal processing (SP), machine learning (ML), deep learning (DL), and hybrid\narchitectures for BRB faults. One limitation in the existing literature is the\ntraining of these architectures on relatively small datasets, risking\noverfitting when implementing such systems in industrial environments. This\npaper addresses this limitation by implementing large-scale data of BRB faults\nby using a transfer-learning-based lightweight DL model named ShuffleNetV2 for\ndiagnosing one, two, three, and four BRB faults using current and vibration\nsignal data. Spectral images for training and testing are generated using a\nShort-Time Fourier Transform (STFT). The dataset comprises 57,500 images, with\n47,500 used for training and 10,000 for testing. Remarkably, the ShuffleNetV2\nmodel exhibited superior performance, in less computational cost as well as\naccurately classifying 98.856% of spectral images. To further enhance the\nvisualization of harmonic sidebands resulting from broken bars, Fast Fourier\nTransform (FFT) is applied to current and vibration data. The paper also\nprovides insights into the training and testing times for each model,\ncontributing to a comprehensive understanding of the proposed fault diagnosis\nmethodology. The findings of our research provide valuable insights into the\nperformance and efficiency of different ML and DL models, offering a foundation\nfor the development of robust fault diagnosis systems for induction motors in\nindustrial settings.\n","authors":["Usman Ali"],"pdf_url":"https://arxiv.org/pdf/2501.03746v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03124v2","updated":"2025-01-07T12:33:44Z","published":"2025-01-06T16:31:45Z","title":"PRMBench: A Fine-grained and Challenging Benchmark for Process-Level\n Reward Models","summary":" Process-level Reward Models (PRMs) are crucial for complex reasoning and\ndecision-making tasks, where each intermediate step plays an important role in\nthe reasoning process. Since language models are prone to various types of\nerrors during the reasoning process, PRMs are required to possess nuanced\ncapabilities for detecting various implicit error types in real-world\nscenarios. However, current benchmarks primarily focus on step correctness,\nfailing to evaluate PRMs' performance systematically. To address this gap, we\nintroduce PRMBench, a process-level benchmark specifically designed to assess\nthe fine-grained error detection capabilities of PRMs. PRMBench comprises 6,216\ncarefully designed problems and 83,456 step-level labels, evaluating models\nacross multiple dimensions, including simplicity, soundness, and sensitivity.\nIn our experiments on 15 models, spanning both open-source PRMs and\nclosed-source large language models prompted as critic models, we uncover\nsignificant weaknesses in current PRMs. These findings underscore the\nchallenges inherent in process-level evaluation and highlight key directions\nfor future research. We hope PRMBench can be a robust bench for advancing\nresearch on PRM evaluation and development.\n","authors":["Mingyang Song","Zhaochen Su","Xiaoye Qu","Jiawei Zhou","Yu Cheng"],"pdf_url":"https://arxiv.org/pdf/2501.03124v2.pdf","comment":"Project Page: https://prmbench.github.io/"},{"id":"http://arxiv.org/abs/2405.16449v3","updated":"2025-01-07T12:16:43Z","published":"2024-05-26T06:33:11Z","title":"Reinforcement Learning for Jump-Diffusions, with Financial Applications","summary":" We study continuous-time reinforcement learning (RL) for stochastic control\nin which system dynamics are governed by jump-diffusion processes. We formulate\nan entropy-regularized exploratory control problem with stochastic policies to\ncapture the exploration--exploitation balance essential for RL. Unlike the pure\ndiffusion case initially studied by Wang et al. (2020), the derivation of the\nexploratory dynamics under jump-diffusions calls for a careful formulation of\nthe jump part. Through a theoretical analysis, we find that one can simply use\nthe same policy evaluation and $q$-learning algorithms in Jia and Zhou (2022a,\n2023), originally developed for controlled diffusions, without needing to check\na priori whether the underlying data come from a pure diffusion or a\njump-diffusion. However, we show that the presence of jumps ought to affect\nparameterizations of actors and critics in general. We investigate as an\napplication the mean--variance portfolio selection problem with stock price\nmodelled as a jump-diffusion, and show that both RL algorithms and\nparameterizations are invariant with respect to jumps. Finally, we present a\ndetailed study on applying the general theory to option hedging.\n","authors":["Xuefeng Gao","Lingfei Li","Xun Yu Zhou"],"pdf_url":"https://arxiv.org/pdf/2405.16449v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03727v1","updated":"2025-01-07T12:16:26Z","published":"2025-01-07T12:16:26Z","title":"Detecting Neurocognitive Disorders through Analyses of Topic Evolution\n and Cross-modal Consistency in Visual-Stimulated Narratives","summary":" Early detection of neurocognitive disorders (NCDs) is crucial for timely\nintervention and disease management. Speech analysis offers a non-intrusive and\nscalable screening method, particularly through narrative tasks in\nneuropsychological assessment tools. Traditional narrative analysis often\nfocuses on local indicators in microstructure, such as word usage and syntax.\nWhile these features provide insights into language production abilities, they\noften fail to capture global narrative patterns, or microstructures.\nMacrostructures include coherence, thematic organization, and logical\nprogressions, reflecting essential cognitive skills potentially critical for\nrecognizing NCDs. Addressing this gap, we propose to investigate specific\ncognitive and linguistic challenges by analyzing topical shifts, temporal\ndynamics, and the coherence of narratives over time, aiming to reveal cognitive\ndeficits by identifying narrative impairments, and exploring their impact on\ncommunication and cognition. The investigation is based on the CU-MARVEL Rabbit\nStory corpus, which comprises recordings of a story-telling task from 758 older\nadults. We developed two approaches: the Dynamic Topic Models (DTM)-based\ntemporal analysis to examine the evolution of topics over time, and the\nText-Image Temporal Alignment Network (TITAN) to evaluate the coherence between\nspoken narratives and visual stimuli. DTM-based approach validated the\neffectiveness of dynamic topic consistency as a macrostructural metric\n(F1=0.61, AUC=0.78). The TITAN approach achieved the highest performance\n(F1=0.72, AUC=0.81), surpassing established microstructural and macrostructural\nfeature sets. Cross-comparison and regression tasks further demonstrated the\neffectiveness of proposed dynamic macrostructural modeling approaches for NCD\ndetection.\n","authors":["Jinchao Li","Yuejiao Wang","Junan Li","Jiawen Kang","Bo Zheng","Simon Wong","Brian Mak","Helene Fung","Jean Woo","Man-Wai Mak","Timothy Kwok","Vincent Mok","Xianmin Gong","Xixin Wu","Xunying Liu","Patrick Wong","Helen Meng"],"pdf_url":"https://arxiv.org/pdf/2501.03727v1.pdf","comment":"12 pages, 8 figures"},{"id":"http://arxiv.org/abs/2409.03260v2","updated":"2025-01-07T11:54:58Z","published":"2024-09-05T05:51:42Z","title":"In Search of Trees: Decision-Tree Policy Synthesis for Black-Box Systems\n via Search","summary":" Decision trees, owing to their interpretability, are attractive as control\npolicies for (dynamical) systems. Unfortunately, constructing, or synthesising,\nsuch policies is a challenging task. Previous approaches do so by imitating a\nneural-network policy, approximating a tabular policy obtained via formal\nsynthesis, employing reinforcement learning, or modelling the problem as a\nmixed-integer linear program. However, these works may require access to a\nhard-to-obtain accurate policy or a formal model of the environment (within\nreach of formal synthesis), and may not provide guarantees on the quality or\nsize of the final tree policy. In contrast, we present an approach to\nsynthesise optimal decision-tree policies given a deterministic black-box\nenvironment and specification, a discretisation of the tree predicates, and an\ninitial set of states, where optimality is defined with respect to the number\nof steps to achieve the goal. Our approach is a specialised search algorithm\nwhich systematically explores the (exponentially large) space of decision trees\nunder the given discretisation. The key component is a novel trace-based\npruning mechanism that significantly reduces the search space. Our approach\nrepresents a conceptually novel way of synthesising small decision-tree\npolicies with optimality guarantees even for black-box environments with\nblack-box specifications.\n","authors":["Emir Demirović","Christian Schilling","Anna Lukina"],"pdf_url":"https://arxiv.org/pdf/2409.03260v2.pdf","comment":"8 pages main text incl. references, 2 pages appendix"},{"id":"http://arxiv.org/abs/2501.03715v1","updated":"2025-01-07T11:44:25Z","published":"2025-01-07T11:44:25Z","title":"Neural Deconstruction Search for Vehicle Routing Problems","summary":" Autoregressive construction approaches generate solutions to vehicle routing\nproblems in a step-by-step fashion, leading to high-quality solutions that are\nnearing the performance achieved by handcrafted, operations research\ntechniques. In this work, we challenge the conventional paradigm of sequential\nsolution construction and introduce an iterative search framework where\nsolutions are instead deconstructed by a neural policy. Throughout the search,\nthe neural policy collaborates with a simple greedy insertion algorithm to\nrebuild the deconstructed solutions. Our approach surpasses the performance of\nstate-of-the-art operations research methods across three challenging vehicle\nrouting problems of various problem sizes.\n","authors":["André Hottung","Paula Wong-Chung","Kevin Tierney"],"pdf_url":"https://arxiv.org/pdf/2501.03715v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03711v1","updated":"2025-01-07T11:32:13Z","published":"2025-01-07T11:32:13Z","title":"Unsupervised Speech Segmentation: A General Approach Using Speech\n Language Models","summary":" In this paper, we introduce an unsupervised approach for Speech Segmentation,\nwhich builds on previously researched approaches, e.g., Speaker Diarization,\nwhile being applicable to an inclusive set of acoustic-semantic distinctions,\npaving a path towards a general Unsupervised Speech Segmentation approach.\nUnlike traditional speech and audio segmentation, which mainly focuses on\nspectral changes in the input signal, e.g., phone segmentation, our approach\ntries to segment the spoken utterance into chunks with differing\nacoustic-semantic styles, focusing on acoustic-semantic information that does\nnot translate well into text, e.g., emotion or speaker. While most Speech\nSegmentation tasks only handle one style change, e.g., emotion diarization, our\napproach tries to handle multiple acoustic-semantic style changes. Leveraging\nrecent advances in Speech Language Models (SLMs), we propose a simple\nunsupervised method to segment a given speech utterance. We empirically\ndemonstrate the effectiveness of the proposed approach by considering several\nsetups. Results suggest that the proposed method is superior to the evaluated\nbaselines on boundary detection, segment purity, and over-segmentation. Code is\navailable at\nhttps://github.com/avishaiElmakies/unsupervised_speech_segmentation_using_slm.\n","authors":["Avishai Elmakies","Omri Abend","Yossi Adi"],"pdf_url":"https://arxiv.org/pdf/2501.03711v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.10573v2","updated":"2025-01-07T11:13:06Z","published":"2024-06-15T09:23:46Z","title":"Graph Neural Backdoor: Fundamentals, Methodologies, Applications, and\n Future Directions","summary":" Graph Neural Networks (GNNs) have significantly advanced various downstream\ngraph-relevant tasks, encompassing recommender systems, molecular structure\nprediction, social media analysis, etc. Despite the boosts of GNN, recent\nresearch has empirically demonstrated its potential vulnerability to backdoor\nattacks, wherein adversaries employ triggers to poison input samples, inducing\nGNN to adversary-premeditated malicious outputs. This is typically due to the\ncontrolled training process, or the deployment of untrusted models, such as\ndelegating model training to third-party service, leveraging external training\nsets, and employing pre-trained models from online sources. Although there's an\nongoing increase in research on GNN backdoors, comprehensive investigation into\nthis field is lacking. To bridge this gap, we propose the first survey\ndedicated to GNN backdoors. We begin by outlining the fundamental definition of\nGNN, followed by the detailed summarization and categorization of current GNN\nbackdoor attacks and defenses based on their technical characteristics and\napplication scenarios. Subsequently, the analysis of the applicability and use\ncases of GNN backdoors is undertaken. Finally, the exploration of potential\nresearch directions of GNN backdoors is presented. This survey aims to explore\nthe principles of graph backdoors, provide insights to defenders, and promote\nfuture security research.\n","authors":["Xiao Yang","Gaolei Li","Jianhua Li"],"pdf_url":"https://arxiv.org/pdf/2406.10573v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03697v1","updated":"2025-01-07T11:01:24Z","published":"2025-01-07T11:01:24Z","title":"Deep Networks are Reproducing Kernel Chains","summary":" Identifying an appropriate function space for deep neural networks remains a\nkey open question. While shallow neural networks are naturally associated with\nReproducing Kernel Banach Spaces (RKBS), deep networks present unique\nchallenges. In this work, we extend RKBS to chain RKBS (cRKBS), a new framework\nthat composes kernels rather than functions, preserving the desirable\nproperties of RKBS. We prove that any deep neural network function is a neural\ncRKBS function, and conversely, any neural cRKBS function defined on a finite\ndataset corresponds to a deep neural network. This approach provides a sparse\nsolution to the empirical risk minimization problem, requiring no more than $N$\nneurons per layer, where $N$ is the number of data points.\n","authors":["Tjeerd Jan Heeringa","Len Spek","Christoph Brune"],"pdf_url":"https://arxiv.org/pdf/2501.03697v1.pdf","comment":"25 pages, 3 figures"},{"id":"http://arxiv.org/abs/2501.03696v1","updated":"2025-01-07T10:54:44Z","published":"2025-01-07T10:54:44Z","title":"Exploring Molecule Generation Using Latent Space Graph Diffusion","summary":" Generating molecular graphs is a challenging task due to their discrete\nnature and the competitive objectives involved. Diffusion models have emerged\nas SOTA approaches in data generation across various modalities. For molecular\ngraphs, graph neural networks (GNNs) as a diffusion backbone have achieved\nimpressive results. Latent space diffusion, where diffusion occurs in a\nlow-dimensional space via an autoencoder, has demonstrated computational\nefficiency. However, the literature on latent space diffusion for molecular\ngraphs is scarce, and no commonly accepted best practices exist. In this work,\nwe explore different approaches and hyperparameters, contrasting generative\nflow models (denoising diffusion, flow matching, heat dissipation) and\narchitectures (GNNs and E(3)-equivariant GNNs). Our experiments reveal a high\nsensitivity to the choice of approach and design decisions. Code is made\navailable at\ngithub.com/Prashanth-Pombala/Molecule-Generation-using-Latent-Space-Graph-Diffusion.\n","authors":["Prashanth Pombala","Gerrit Grossmann","Verena Wolf"],"pdf_url":"https://arxiv.org/pdf/2501.03696v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.05898v4","updated":"2025-01-07T10:42:21Z","published":"2024-10-08T10:55:40Z","title":"Manifolds, Random Matrices and Spectral Gaps: The geometric phases of\n generative diffusion","summary":" In this paper, we investigate the latent geometry of generative diffusion\nmodels under the manifold hypothesis. For this purpose, we analyze the spectrum\nof eigenvalues (and singular values) of the Jacobian of the score function,\nwhose discontinuities (gaps) reveal the presence and dimensionality of distinct\nsub-manifolds. Using a statistical physics approach, we derive the spectral\ndistributions and formulas for the spectral gaps under several distributional\nassumptions, and we compare these theoretical predictions with the spectra\nestimated from trained networks. Our analysis reveals the existence of three\ndistinct qualitative phases during the generative process: a trivial phase; a\nmanifold coverage phase where the diffusion process fits the distribution\ninternal to the manifold; a consolidation phase where the score becomes\northogonal to the manifold and all particles are projected on the support of\nthe data. This `division of labor' between different timescales provides an\nelegant explanation of why generative diffusion models are not affected by the\nmanifold overfitting phenomenon that plagues likelihood-based models, since the\ninternal distribution and the manifold geometry are produced at different time\npoints during generation.\n","authors":["Enrico Ventura","Beatrice Achilli","Gianluigi Silvestri","Carlo Lucibello","Luca Ambrogioni"],"pdf_url":"https://arxiv.org/pdf/2410.05898v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03687v1","updated":"2025-01-07T10:34:12Z","published":"2025-01-07T10:34:12Z","title":"Run-and-tumble chemotaxis using reinforcement learning","summary":" Bacterial cells use run-and-tumble motion to climb up attractant\nconcentration gradient in their environment. By extending the uphill runs and\nshortening the downhill runs the cells migrate towards the higher attractant\nzones. Motivated by this, we formulate a reinforcement learning (RL) algorithm\nwhere an agent moves in one dimension in the presence of an attractant\ngradient. The agent can perform two actions: either persistent motion in the\nsame direction or reversal of direction. We assign costs for these actions\nbased on the recent history of the agent's trajectory. We ask the question:\nwhich RL strategy works best in different types of attractant environment. We\nquantify efficiency of the RL strategy by the ability of the agent (a) to\nlocalize in the favorable zones after large times, and (b) to learn about its\ncomplete environment. Depending on the attractant profile and the initial\ncondition, we find an optimum balance is needed between exploration and\nexploitation to ensure the most efficient performance.\n","authors":["Ramesh Pramanik","Shradha Mishra","Sakuntala Chatterjee"],"pdf_url":"https://arxiv.org/pdf/2501.03687v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.13552v2","updated":"2025-01-07T10:26:47Z","published":"2024-06-19T13:39:05Z","title":"Standardness Clouds Meaning: A Position Regarding the Informed Usage of\n Standard Datasets","summary":" Standard datasets are frequently used to train and evaluate Machine Learning\nmodels. However, the assumed standardness of these datasets leads to a lack of\nin-depth discussion on how their labels match the derived categories for the\nrespective use case, which we demonstrate by reviewing recent literature that\nemploys standard datasets. We find that the standardness of the datasets seems\nto cloud their actual coherency and applicability, thus impeding the trust in\nMachine Learning models trained on these datasets. Therefore, we argue against\nthe uncritical use of standard datasets and advocate for their critical\nexamination instead. For this, we suggest to use Grounded Theory in combination\nwith Hypotheses Testing through Visualization as methods to evaluate the match\nbetween use case, derived categories, and labels. We exemplify this approach by\napplying it to the 20 Newsgroups dataset and the MNIST dataset, both considered\nstandard datasets in their respective domain. The results show that the labels\nof the 20 Newsgroups dataset are imprecise, which implies that neither a\nMachine Learning model can learn a meaningful abstraction of derived categories\nnor one can draw conclusions from achieving high accuracy on this dataset. For\nthe MNIST dataset, we demonstrate that the labels can be confirmed to be\ndefined well. We conclude that also for datasets that are considered to be\nstandard, quality and suitability have to be assessed in order to learn\nmeaningful abstractions and, thus, improve trust in Machine Learning models.\n","authors":["Tim Cech","Ole Wegen","Daniel Atzberger","Rico Richter","Willy Scheibel","Jürgen Döllner"],"pdf_url":"https://arxiv.org/pdf/2406.13552v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03676v1","updated":"2025-01-07T10:22:30Z","published":"2025-01-07T10:22:30Z","title":"SALE-Based Offline Reinforcement Learning with Ensemble Q-Networks","summary":" In this work, we build upon the offline reinforcement learning algorithm TD7,\nwhich incorporates State-Action Learned Embeddings (SALE) and LAP, and propose\na model-free actor-critic algorithm that integrates ensemble Q-networks and a\ngradient diversity penalty from EDAC. The ensemble Q-networks effectively\naddress the challenge of out-of-distribution actions by introducing penalties\nthat guide the actor network to focus on in-distribution actions. Meanwhile,\nthe gradient diversity penalty encourages diverse Q-value gradients, further\nsuppressing overestimation for out-of-distribution actions. Additionally, our\nmethod retains an adjustable behavior cloning (BC) term that directs the actor\nnetwork toward dataset actions during early training stages, while gradually\nreducing its influence as the precision of the Q-ensemble improves. These\nenhancements work synergistically to improve training stability and accuracy.\nExperimental results on the D4RL MuJoCo benchmarks demonstrate that our\nalgorithm achieves superior convergence speed, stability, and performance\ncompared to existing methods.\n","authors":["Zheng Chun"],"pdf_url":"https://arxiv.org/pdf/2501.03676v1.pdf","comment":"10 pages, 2 figures, 4 tables"},{"id":"http://arxiv.org/abs/2501.03671v1","updated":"2025-01-07T10:18:37Z","published":"2025-01-07T10:18:37Z","title":"Imitation Learning of MPC with Neural Networks: Error Guarantees and\n Sparsification","summary":" This paper presents a framework for bounding the approximation error in\nimitation model predictive controllers utilizing neural networks. Leveraging\nthe Lipschitz properties of these neural networks, we derive a bound that\nguides dataset design to ensure the approximation error remains at chosen\nlimits. We discuss how this method can be used to design a stable neural\nnetwork controller with performance guarantees employing existing robust model\npredictive control approaches for data generation. Additionally, we introduce a\ntraining adjustment, which is based on the sensitivities of the optimization\nproblem and reduces dataset density requirements based on the derived bounds.\nWe verify that the proposed augmentation results in improvements to the\nnetwork's predictive capabilities and a reduction of the Lipschitz constant.\nMoreover, on a simulated inverted pendulum problem, we show that the approach\nresults in a closer match of the closed-loop behavior between the imitation and\nthe original model predictive controller.\n","authors":["Hendrik Alsmeier","Lukas Theiner","Anton Savchenko","Ali Mesbah","Rolf Findeisen"],"pdf_url":"https://arxiv.org/pdf/2501.03671v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03666v1","updated":"2025-01-07T10:06:59Z","published":"2025-01-07T10:06:59Z","title":"Hybrid Machine Learning Model with a Constrained Action Space for\n Trajectory Prediction","summary":" Trajectory prediction is crucial to advance autonomous driving, improving\nsafety, and efficiency. Although end-to-end models based on deep learning have\ngreat potential, they often do not consider vehicle dynamic limitations,\nleading to unrealistic predictions. To address this problem, this work\nintroduces a novel hybrid model that combines deep learning with a kinematic\nmotion model. It is able to predict object attributes such as acceleration and\nyaw rate and generate trajectories based on them. A key contribution is the\nincorporation of expert knowledge into the learning objective of the deep\nlearning model. This results in the constraint of the available action space,\nthus enabling the prediction of physically feasible object attributes and\ntrajectories, thereby increasing safety and robustness. The proposed hybrid\nmodel facilitates enhanced interpretability, thereby reinforcing the\ntrustworthiness of deep learning methods and promoting the development of safe\nplanning solutions. Experiments conducted on the publicly available real-world\nArgoverse dataset demonstrate realistic driving behaviour, with benchmark\ncomparisons and ablation studies showing promising results.\n","authors":["Alexander Fertig","Lakshman Balasubramanian","Michael Botsch"],"pdf_url":"https://arxiv.org/pdf/2501.03666v1.pdf","comment":"Submitted to 2025 IEEE Intelligent Vehicles Symposium (IV)"},{"id":"http://arxiv.org/abs/2402.10456v2","updated":"2025-01-07T10:03:08Z","published":"2024-02-16T05:27:05Z","title":"Efficient Generative Modeling via Penalized Optimal Transport Network","summary":" The generation of synthetic data with distributions that faithfully emulate\nthe underlying data-generating mechanism holds paramount significance.\nWasserstein Generative Adversarial Networks (WGANs) have emerged as a prominent\ntool for this task; however, due to the delicate equilibrium of the minimax\nformulation and the instability of Wasserstein distance in high dimensions,\nWGAN often manifests the pathological phenomenon of mode collapse. This results\nin generated samples that converge to a restricted set of outputs and fail to\nadequately capture the tail behaviors of the true distribution. Such\nlimitations can lead to serious downstream consequences. To this end, we\npropose the Penalized Optimal Transport Network (POTNet), a versatile deep\ngenerative model based on the marginally-penalized Wasserstein (MPW) distance.\nThrough the MPW distance, POTNet effectively leverages low-dimensional marginal\ninformation to guide the overall alignment of joint distributions. Furthermore,\nour primal-based framework enables direct evaluation of the MPW distance, thus\neliminating the need for a critic network. This formulation circumvents\ntraining instabilities inherent in adversarial approaches and avoids the need\nfor extensive parameter tuning. We derive a non-asymptotic bound on the\ngeneralization error of the MPW loss and establish convergence rates of the\ngenerative distribution learned by POTNet. Our theoretical analysis together\nwith extensive empirical evaluations demonstrate the superior performance of\nPOTNet in accurately capturing underlying data structures, including their tail\nbehaviors and minor modalities. Moreover, our model achieves orders of\nmagnitude speedup during the sampling stage compared to state-of-the-art\nalternatives, which enables computationally efficient large-scale synthetic\ndata generation.\n","authors":["Wenhui Sophia Lu","Chenyang Zhong","Wing Hung Wong"],"pdf_url":"https://arxiv.org/pdf/2402.10456v2.pdf","comment":"54 pages, 12 figures"},{"id":"http://arxiv.org/abs/2409.14887v3","updated":"2025-01-07T09:55:57Z","published":"2024-09-23T10:35:57Z","title":"Deploying Open-Source Large Language Models: A performance Analysis","summary":" Since the release of ChatGPT in November 2022, large language models (LLMs)\nhave seen considerable success, including in the open-source community, with\nmany open-weight models available. However, the requirements to deploy such a\nservice are often unknown and difficult to evaluate in advance. To facilitate\nthis process, we conducted numerous tests at the Centre Inria de l'Universit\\'e\nde Bordeaux. In this article, we propose a comparison of the performance of\nseveral models of different sizes (mainly Mistral and LLaMa) depending on the\navailable GPUs, using vLLM, a Python library designed to optimize the inference\nof these models. Our results provide valuable information for private and\npublic groups wishing to deploy LLMs, allowing them to evaluate the performance\nof different models based on their available hardware. This study thus\ncontributes to facilitating the adoption and use of these large language models\nin various application domains.\n","authors":["Yannis Bendi-Ouis","Dan Dutartre","Xavier Hinaut"],"pdf_url":"https://arxiv.org/pdf/2409.14887v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.02781v2","updated":"2025-01-07T09:54:50Z","published":"2025-01-06T05:53:38Z","title":"From Dense to Sparse: Event Response for Enhanced Residential Load\n Forecasting","summary":" Residential load forecasting (RLF) is crucial for resource scheduling in\npower systems. Most existing methods utilize all given load records (dense\ndata) to indiscriminately extract the dependencies between historical and\nfuture time series. However, there exist important regular patterns residing in\nthe event-related associations among different appliances (sparse knowledge),\nwhich have yet been ignored.In this paper, we propose an Event-Response\nKnowledge Guided approach (ERKG) for RLF by incorporating the estimation of\nelectricity usage events for different appliances, mining event-related sparse\nknowledge from the load series. With ERKG, the event-response estimation\nenables portraying the electricity consumption behaviors of residents,\nrevealing regular variations in appliance operational states.To be specific,\nERKG consists of knowledge extraction and guidance: i) a forecasting model is\ndesigned for the electricity usage events by estimating appliance operational\nstates, aiming to extract the event-related sparse knowledge; ii) a novel\nknowledge-guided mechanism is established by fusing such state estimates of the\nappliance events into the RLF model, which can give particular focuses on the\npatterns of users' electricity consumption behaviors.Notably, ERKG can flexibly\nserve as a plug-in module to boost the capability of existing forecasting\nmodels by leveraging event response. In numerical experiments, extensive\ncomparisons and ablation studies have verified the effectiveness of our ERKG,\ne.g., over 8% MAE can be reduced on the tested state-of-the-art forecasting\nmodels.\n","authors":["Xin Cao","Qinghua Tao","Yingjie Zhou","Lu Zhang","Le Zhang","Dongjin Song","Dapeng Oliver Wu","Ce Zhu"],"pdf_url":"https://arxiv.org/pdf/2501.02781v2.pdf","comment":"12 pages and 6 figures. Accepted for publication by IEEE Transactions\n on Instrumentation and Measurement"},{"id":"http://arxiv.org/abs/2501.03654v1","updated":"2025-01-07T09:40:02Z","published":"2025-01-07T09:40:02Z","title":"Data Augmentation for Deep Learning Regression Tasks by Machine Learning\n Models","summary":" Deep learning (DL) models have gained prominence in domains such as computer\nvision and natural language processing but remain underutilized for regression\ntasks involving tabular data. In these cases, traditional machine learning (ML)\nmodels often outperform DL models. In this study, we propose and evaluate\nvarious data augmentation (DA) techniques to improve the performance of DL\nmodels for tabular data regression tasks. We compare the performance gain of\nNeural Networks by different DA strategies ranging from a naive method of\nduplicating existing observations and adding noise to a more sophisticated DA\nstrategy that preserves the underlying statistical relationship in the data.\nOur analysis demonstrates that the advanced DA method significantly improves DL\nmodel performance across multiple datasets and regression tasks, resulting in\nan average performance increase of over 10\\% compared to baseline models\nwithout augmentation. The efficacy of these DA strategies was rigorously\nvalidated across 30 distinct datasets, with multiple iterations and evaluations\nusing three different automated deep learning (AutoDL) frameworks: AutoKeras,\nH2O, and AutoGluon. This study demonstrates that by leveraging advanced DA\ntechniques, DL models can realize their full potential in regression tasks,\nthereby contributing to broader adoption and enhanced performance in practical\napplications.\n","authors":["Assaf Shmuel","Oren Glickman","Teddy Lazebnik"],"pdf_url":"https://arxiv.org/pdf/2501.03654v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.16370v2","updated":"2025-01-07T09:34:51Z","published":"2024-11-25T13:26:09Z","title":"A Review of Bayesian Uncertainty Quantification in Deep Probabilistic\n Image Segmentation","summary":" Advancements in image segmentation play an integral role within the broad\nscope of Deep Learning-based Computer Vision. Furthermore, their widespread\napplicability in critical real-world tasks has resulted in challenges related\nto the reliability of such algorithms. Hence, uncertainty quantification has\nbeen extensively studied within this context, enabling the expression of model\nignorance (epistemic uncertainty) or data ambiguity (aleatoric uncertainty) to\nprevent uninformed decision-making. Due to the rapid adoption of Convolutional\nNeural Network (CNN)-based segmentation models in high-stake applications, a\nsubstantial body of research has been published on this very topic, causing its\nswift expansion into a distinct field. This work provides a comprehensive\noverview of probabilistic segmentation, by discussing fundamental concepts of\nuncertainty quantification, governing advancements in the field as well as the\napplication to various tasks. Moreover, literature on both types of\nuncertainties trace back to four key applications: (1) to quantify statistical\ninconsistencies in the annotation process due ambiguous images, (2) correlating\nprediction error with uncertainty, (3) expanding the model hypothesis space for\nbetter generalization, and (4) Active Learning. An extensive discussion follows\nthat includes an overview of utilized datasets for each of the applications and\nevaluation of the available methods. We also highlight challenges related to\narchitectures, uncertainty quantification methods, standardization and\nbenchmarking, and finally end with recommendations for future work such as\nmethods based on single forward passes and models that appropriately leverage\nvolumetric data.\n","authors":["M. M. A. Valiuddin","R. J. G. van Sloun","C. G. A. Viviers","P. H. N. de With","F. van der Sommen"],"pdf_url":"https://arxiv.org/pdf/2411.16370v2.pdf","comment":"20 pages, revised"},{"id":"http://arxiv.org/abs/2402.00592v4","updated":"2025-01-07T09:24:34Z","published":"2024-02-01T13:41:44Z","title":"Partial-Label Learning with a Reject Option","summary":" In real-world applications, one often encounters ambiguously labeled data,\nwhere different annotators assign conflicting class labels. Partial-label\nlearning allows training classifiers in this weakly supervised setting, where\nstate-of-the-art methods already show good predictive performance. However,\neven the best algorithms give incorrect predictions, which can have severe\nconsequences when they impact actions or decisions. We propose a novel\nrisk-consistent nearest-neighbor-based partial-label learning algorithm with a\nreject option, that is, the algorithm can reject unsure predictions. Extensive\nexperiments on artificial and real-world datasets show that our method provides\nthe best trade-off between the number and accuracy of non-rejected predictions\nwhen compared to our competitors, which use confidence thresholds for rejecting\nunsure predictions. When evaluated without the reject option, our\nnearest-neighbor-based approach also achieves competitive prediction\nperformance.\n","authors":["Tobias Fuchs","Florian Kalinke","Klemens Böhm"],"pdf_url":"https://arxiv.org/pdf/2402.00592v4.pdf","comment":"Accepted for publication at TMLR"},{"id":"http://arxiv.org/abs/2501.03635v1","updated":"2025-01-07T09:10:09Z","published":"2025-01-07T09:10:09Z","title":"MHGNet: Multi-Heterogeneous Graph Neural Network for Traffic Prediction","summary":" In recent years, traffic flow prediction has played a crucial role in the\nmanagement of intelligent transportation systems. However, traditional\nforecasting methods often model non-Euclidean low-dimensional traffic data as a\nsimple graph with single-type nodes and edges, failing to capture similar\ntrends among nodes of the same type. To address this limitation, this paper\nproposes MHGNet, a novel framework for modeling spatiotemporal\nmulti-heterogeneous graphs. Within this framework, the STD Module decouples\nsingle-pattern traffic data into multi-pattern traffic data through feature\nmappings of timestamp embedding matrices and node embedding matrices.\nSubsequently, the Node Clusterer leverages the Euclidean distance between nodes\nand different types of limit points to perform clustering with O(N) time\ncomplexity. The nodes within each cluster undergo residual subgraph convolution\nwithin the spatiotemporal fusion subgraphs generated by the DSTGG Module,\nfollowed by processing in the SIE Module for node repositioning and\nredistribution of weights. To validate the effectiveness of MHGNet, this paper\nconducts extensive ablation studies and quantitative evaluations on four widely\nused benchmarks, demonstrating its superior performance.\n","authors":["Mei Wu","Yiqian Lin","Tianfan Jiang","Wenchao Weng"],"pdf_url":"https://arxiv.org/pdf/2501.03635v1.pdf","comment":"Accepted by 2025 lEEE International Conference on Acoustics, speech,\n and signal Processing (lCASSP2025)"},{"id":"http://arxiv.org/abs/2406.02017v2","updated":"2025-01-07T09:02:36Z","published":"2024-06-04T06:57:12Z","title":"On the Mode-Seeking Properties of Langevin Dynamics","summary":" The Langevin Dynamics framework, which aims to generate samples from the\nscore function of a probability distribution, is widely used for analyzing and\ninterpreting score-based generative modeling. While the convergence behavior of\nLangevin Dynamics under unimodal distributions has been extensively studied in\nthe literature, in practice the data distribution could consist of multiple\ndistinct modes. In this work, we investigate Langevin Dynamics in producing\nsamples from multimodal distributions and theoretically study its mode-seeking\nproperties. We prove that under a variety of sub-Gaussian mixtures, Langevin\nDynamics is unlikely to find all mixture components within a sub-exponential\nnumber of steps in the data dimension. To reduce the mode-seeking tendencies of\nLangevin Dynamics, we propose \\emph{Chained Langevin Dynamics}, which divides\nthe data vector into patches of constant size and generates every patch\nsequentially conditioned on the previous patches. We perform a theoretical\nanalysis of Chained Langevin Dynamics by reducing it to sampling from a\nconstant-dimensional distribution. We present the results of several numerical\nexperiments on synthetic and real image datasets, supporting our theoretical\nresults on the iteration complexities of sample generation from mixture\ndistributions using the chained and vanilla Langevin Dynamics. The code is\navailable at https://github.com/Xiwei-Cheng/Chained_LD.\n","authors":["Xiwei Cheng","Kexin Fu","Farzan Farnia"],"pdf_url":"https://arxiv.org/pdf/2406.02017v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03627v1","updated":"2025-01-07T08:54:42Z","published":"2025-01-07T08:54:42Z","title":"Coupled Hierarchical Structure Learning using Tree-Wasserstein Distance","summary":" In many applications, both data samples and features have underlying\nhierarchical structures. However, existing methods for learning these latent\nstructures typically focus on either samples or features, ignoring possible\ncoupling between them. In this paper, we introduce a coupled hierarchical\nstructure learning method using tree-Wasserstein distance (TWD). Our method\njointly computes TWDs for samples and features, representing their latent\nhierarchies as trees. We propose an iterative, unsupervised procedure to build\nthese sample and feature trees based on diffusion geometry, hyperbolic\ngeometry, and wavelet filters. We show that this iterative procedure converges\nand empirically improves the quality of the constructed trees. The method is\nalso computationally efficient and scales well in high-dimensional settings.\nOur method can be seamlessly integrated with hyperbolic graph convolutional\nnetworks (HGCN). We demonstrate that our method outperforms competing\napproaches in sparse approximation and unsupervised Wasserstein distance\nlearning on several word-document and single-cell RNA-sequencing datasets. In\naddition, integrating our method into HGCN enhances performance in link\nprediction and node classification tasks.\n","authors":["Ya-Wei Eileen Lin","Ronald R. Coifman","Gal Mishne","Ronen Talmon"],"pdf_url":"https://arxiv.org/pdf/2501.03627v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2208.06538v2","updated":"2025-01-07T08:52:30Z","published":"2022-08-13T01:20:39Z","title":"Transferable Adversarial Examples with Bayes Approach","summary":" The vulnerability of deep neural networks (DNNs) to black-box adversarial\nattacks is one of the most heated topics in trustworthy AI. In such attacks,\nthe attackers operate without any insider knowledge of the model, making the\ncross-model transferability of adversarial examples critical. Despite the\npotential for adversarial examples to be effective across various models, it\nhas been observed that adversarial examples that are specifically crafted for a\nspecific model often exhibit poor transferability. In this paper, we explore\nthe transferability of adversarial examples via the lens of Bayesian approach.\nSpecifically, we leverage Bayesian approach to probe the transferability and\nthen study what constitutes a transferability-promoting prior. Following this,\nwe design two concrete transferability-promoting priors, along with an adaptive\ndynamic weighting strategy for instances sampled from these priors. Employing\nthese techniques, we present BayAtk. Extensive experiments illustrate the\nsignificant effectiveness of BayAtk in crafting more transferable adversarial\nexamples against both undefended and defended black-box models compared to\nexisting state-of-the-art attacks.\n","authors":["Mingyuan Fan","Cen Chen","Wenmeng Zhou","Yinggui Wang"],"pdf_url":"https://arxiv.org/pdf/2208.06538v2.pdf","comment":"Accepted in AsiaCCS'25"},{"id":"http://arxiv.org/abs/2406.00502v3","updated":"2025-01-07T08:50:35Z","published":"2024-06-01T17:10:56Z","title":"Non-geodesically-convex optimization in the Wasserstein space","summary":" We study a class of optimization problems in the Wasserstein space (the space\nof probability measures) where the objective function is nonconvex along\ngeneralized geodesics. Specifically, the objective exhibits some\ndifference-of-convex structure along these geodesics. The setting also\nencompasses sampling problems where the logarithm of the target distribution is\ndifference-of-convex. We derive multiple convergence insights for a novel semi\nForward-Backward Euler scheme under several nonconvex (and possibly nonsmooth)\nregimes. Notably, the semi Forward-Backward Euler is just a slight modification\nof the Forward-Backward Euler whose convergence is -- to our knowledge -- still\nunknown in our very general non-geodesically-convex setting.\n","authors":["Hoang Phuc Hau Luu","Hanlin Yu","Bernardo Williams","Petrus Mikkola","Marcelo Hartmann","Kai Puolamäki","Arto Klami"],"pdf_url":"https://arxiv.org/pdf/2406.00502v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.08980v2","updated":"2025-01-07T08:49:30Z","published":"2024-04-13T12:07:20Z","title":"Stability and Generalization in Free Adversarial Training","summary":" While adversarial training methods have significantly improved the robustness\nof deep neural networks against norm-bounded adversarial perturbations, the\ngeneralization gap between their performance on training and test data is\nconsiderably greater than that of standard empirical risk minimization. Recent\nstudies have aimed to connect the generalization properties of adversarially\ntrained classifiers to the min-max optimization algorithm used in their\ntraining. In this work, we analyze the interconnections between generalization\nand optimization in adversarial training using the algorithmic stability\nframework. Specifically, our goal is to compare the generalization gap of\nneural networks trained using the vanilla adversarial training method, which\nfully optimizes perturbations at every iteration, with the free adversarial\ntraining method, which simultaneously optimizes norm-bounded perturbations and\nclassifier parameters. We prove bounds on the generalization error of these\nmethods, indicating that the free adversarial training method may exhibit a\nlower generalization gap between training and test samples due to its\nsimultaneous min-max optimization of classifier weights and perturbation\nvariables. We conduct several numerical experiments to evaluate the\ntrain-to-test generalization gap in vanilla and free adversarial training\nmethods. Our empirical findings also suggest that the free adversarial training\nmethod could lead to a smaller generalization gap over a similar number of\ntraining iterations. The paper code is available at\nhttps://github.com/Xiwei-Cheng/Stability_FreeAT.\n","authors":["Xiwei Cheng","Kexin Fu","Farzan Farnia"],"pdf_url":"https://arxiv.org/pdf/2404.08980v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.04195v2","updated":"2025-01-07T08:46:02Z","published":"2023-09-08T08:12:29Z","title":"Towards Mitigating Architecture Overfitting on Distilled Datasets","summary":" Dataset distillation methods have demonstrated remarkable performance for\nneural networks trained with very limited training data. However, a significant\nchallenge arises in the form of \\textit{architecture overfitting}: the\ndistilled training dataset synthesized by a specific network architecture\n(i.e., training network) generates poor performance when trained by other\nnetwork architectures (i.e., test networks), especially when the test networks\nhave a larger capacity than the training network. This paper introduces a\nseries of approaches to mitigate this issue. Among them, DropPath renders the\nlarge model to be an implicit ensemble of its sub-networks, and knowledge\ndistillation ensures each sub-network acts similarly to the small but\nwell-performing teacher network. These methods, characterized by their\nsmoothing effects, significantly mitigate architecture overfitting. We conduct\nextensive experiments to demonstrate the effectiveness and generality of our\nmethods. Particularly, across various scenarios involving different tasks and\ndifferent sizes of distilled data, our approaches significantly mitigate\narchitecture overfitting. Furthermore, our approaches achieve comparable or\neven superior performance when the test network is larger than the training\nnetwork.\n","authors":["Xuyang Zhong","Chen Liu"],"pdf_url":"https://arxiv.org/pdf/2309.04195v2.pdf","comment":"Accepted by TNNLS"},{"id":"http://arxiv.org/abs/2501.02721v2","updated":"2025-01-07T08:14:34Z","published":"2025-01-06T02:25:48Z","title":"Learning Stochastic Nonlinear Dynamics with Embedded Latent Transfer\n Operators","summary":" We consider an operator-based latent Markov representation of a stochastic\nnonlinear dynamical system, where the stochastic evolution of the latent state\nembedded in a reproducing kernel Hilbert space is described with the\ncorresponding transfer operator, and develop a spectral method to learn this\nrepresentation based on the theory of stochastic realization. The embedding may\nbe learned simultaneously using reproducing kernels, for example, constructed\nwith feed-forward neural networks. We also address the generalization of\nsequential state-estimation (Kalman filtering) in stochastic nonlinear systems,\nand of operator-based eigen-mode decomposition of dynamics, for the\nrepresentation. Several examples with synthetic and real-world data are shown\nto illustrate the empirical characteristics of our methods, and to investigate\nthe performance of our model in sequential state-estimation and mode\ndecomposition.\n","authors":["Naichang Ke","Ryogo Tanaka","Yoshinobu Kawahara"],"pdf_url":"https://arxiv.org/pdf/2501.02721v2.pdf","comment":"This submission includes a supplementary file providing additional\n details. It also contains a code directory (code/) for the experiments. Both\n are included within the TeX source package"},{"id":"http://arxiv.org/abs/2501.01216v3","updated":"2025-01-07T07:46:19Z","published":"2025-01-02T11:57:08Z","title":"TabTreeFormer: Tabular Data Generation Using Hybrid Tree-Transformer","summary":" Transformers have achieved remarkable success in tabular data generation.\nHowever, they lack domain-specific inductive biases which are critical to\npreserving the intrinsic characteristics of tabular data. Meanwhile, they\nsuffer from poor scalability and efficiency due to quadratic computational\ncomplexity. In this paper, we propose TabTreeFormer, a hybrid transformer\narchitecture that incorporates a tree-based model that retains tabular-specific\ninductive biases of non-smooth and potentially low-correlated patterns caused\nby discreteness and non-rotational invariance, and hence enhances the fidelity\nand utility of synthetic data. In addition, we devise a dual-quantization\ntokenizer to capture the multimodal continuous distribution and further\nfacilitate the learning of numerical value distribution. Moreover, our proposed\ntokenizer reduces the vocabulary size and sequence length due to the limited\ncomplexity (e.g., dimension-wise semantic meaning) of tabular data, rendering a\nsignificant model size shrink without sacrificing the capability of the\ntransformer model. We evaluate TabTreeFormer on 10 datasets against multiple\ngenerative models on various metrics; our experimental results show that\nTabTreeFormer achieves superior fidelity, utility, privacy, and efficiency. Our\nbest model yields a 40% utility improvement with 1/16 of the baseline model\nsize.\n","authors":["Jiayu Li","Bingyin Zhao","Zilong Zhao","Kevin Yee","Uzair Javaid","Biplab Sikdar"],"pdf_url":"https://arxiv.org/pdf/2501.01216v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2105.11233v3","updated":"2025-01-07T07:39:45Z","published":"2021-05-15T12:18:31Z","title":"Gradient descent in materia through homodyne gradient extraction","summary":" Deep learning, a multi-layered neural network approach inspired by the brain,\nhas revolutionized machine learning. One of its key enablers has been\nbackpropagation, an algorithm that computes the gradient of a loss function\nwith respect to the weights and biases in the neural network model, in\ncombination with its use in gradient descent. However, the implementation of\ndeep learning in digital computers is intrinsically energy hungry, with energy\nconsumption becoming prohibitively high for many applications. This has\nstimulated the development of specialized hardware, ranging from neuromorphic\nCMOS integrated circuits and integrated photonic tensor cores to\nunconventional, material-based computing system. The learning process in these\nmaterial systems, realized, e.g., by artificial evolution, equilibrium\npropagation or surrogate modelling, is a complicated and time-consuming\nprocess. Here, we demonstrate a simple yet efficient and accurate gradient\nextraction method, based on the principle of homodyne detection, for performing\ngradient descent on a loss function directly in a physical system without the\nneed of an analytical description. By perturbing the parameters that need to be\noptimized using sinusoidal waveforms with distinct frequencies, we effectively\nobtain the gradient information in a highly robust and scalable manner. We\nillustrate the method in dopant network processing units, but argue that it is\napplicable in a wide range of physical systems. Homodyne gradient extraction\ncan in principle be fully implemented in materia, facilitating the development\nof autonomously learning material systems.\n","authors":["Marcus N. Boon","Lorenzo Cassola","Hans-Christian Ruiz Euler","Tao Chen","Bram van de Ven","Unai Alegre Ibarra","Peter A. Bobbert","Wilfred G. van der Wiel"],"pdf_url":"https://arxiv.org/pdf/2105.11233v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03584v1","updated":"2025-01-07T07:17:04Z","published":"2025-01-07T07:17:04Z","title":"Discriminative Representation learning via Attention-Enhanced\n Contrastive Learning for Short Text Clustering","summary":" Contrastive learning has gained significant attention in short text\nclustering, yet it has an inherent drawback of mistakenly identifying samples\nfrom the same category as negatives and then separating them in the feature\nspace (false negative separation), which hinders the generation of superior\nrepresentations. To generate more discriminative representations for efficient\nclustering, we propose a novel short text clustering method, called\nDiscriminative Representation learning via \\textbf{A}ttention-\\textbf{E}nhanced\n\\textbf{C}ontrastive \\textbf{L}earning for Short Text Clustering\n(\\textbf{AECL}). The \\textbf{AECL} consists of two modules which are the\npseudo-label generation module and the contrastive learning module. Both\nmodules build a sample-level attention mechanism to capture similarity\nrelationships between samples and aggregate cross-sample features to generate\nconsistent representations. Then, the former module uses the more\ndiscriminative consistent representation to produce reliable supervision\ninformation for assist clustering, while the latter module explores similarity\nrelationships and consistent representations optimize the construction of\npositive samples to perform similarity-guided contrastive learning, effectively\naddressing the false negative separation issue. Experimental results\ndemonstrate that the proposed \\textbf{AECL} outperforms state-of-the-art\nmethods. If the paper is accepted, we will open-source the code.\n","authors":["Zhihao Yao"],"pdf_url":"https://arxiv.org/pdf/2501.03584v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03583v1","updated":"2025-01-07T07:16:56Z","published":"2025-01-07T07:16:56Z","title":"STContext: A Multifaceted Dataset for Developing Context-aware\n Spatio-temporal Crowd Mobility Prediction Models","summary":" In smart cities, context-aware spatio-temporal crowd flow prediction (STCFP)\nmodels leverage contextual features (e.g., weather) to identify unusual crowd\nmobility patterns and enhance prediction accuracy. However, the best practice\nfor incorporating contextual features remains unclear due to inconsistent usage\nof contextual features in different papers. Developing a multifaceted dataset\nwith rich types of contextual features and STCFP scenarios is crucial for\nestablishing a principled context modeling paradigm. Existing open crowd flow\ndatasets lack an adequate range of contextual features, which poses an urgent\nrequirement to build a multifaceted dataset to fill these research gaps. To\nthis end, we create STContext, a multifaceted dataset for developing\ncontext-aware STCFP models. Specifically, STContext provides nine\nspatio-temporal datasets across five STCFP scenarios and includes ten\ncontextual features, including weather, air quality index, holidays, points of\ninterest, road networks, etc. Besides, we propose a unified workflow for\nincorporating contextual features into deep STCFP methods, with steps including\nfeature transformation, dependency modeling, representation fusion, and\ntraining strategies. Through extensive experiments, we have obtained several\nuseful guidelines for effective context modeling and insights for future\nresearch. The STContext is open-sourced at\nhttps://github.com/Liyue-Chen/STContext.\n","authors":["Liyue Chen","Jiangyi Fang","Tengfei Liu","Fangyuan Gao","Leye Wang"],"pdf_url":"https://arxiv.org/pdf/2501.03583v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18263v3","updated":"2025-01-07T07:05:17Z","published":"2024-12-24T08:25:38Z","title":"High-Rank Irreducible Cartesian Tensor Decomposition and Bases of\n Equivariant Spaces","summary":" Irreducible Cartesian tensors (ICTs) play a crucial role in the design of\nequivariant graph neural networks, as well as in theoretical chemistry and\nchemical physics. Meanwhile, the design space of available linear operations on\ntensors that preserve symmetry presents a significant challenge. The ICT\ndecomposition and a basis of this equivariant space are difficult to obtain for\nhigh-order tensors. After decades of research, Bonvicini (2024) recently\nachieves an explicit ICT decomposition for $n=5$ with factorial time/space\ncomplexity. This work, for the first time, obtains decomposition matrices for\nICTs up to rank $n=9$ with reduced and affordable complexity, by constructing\nwhat we call path matrices. The path matrices are obtained via performing\nchain-like contraction with Clebsch-Gordan matrices following the parentage\nscheme. We prove and leverage that the concatenation of path matrices is an\northonormal change-of-basis matrix between the Cartesian tensor product space\nand the spherical direct sum spaces. Furthermore, we identify a complete\northogonal basis for the equivariant space, rather than a spanning set\n(Pearce-Crump, 2023b), through this path matrices technique. We further extend\nour result to the arbitrary tensor product and direct sum spaces, enabling free\ndesign between different spaces while keeping symmetry. The Python code is\navailable at\nhttps://github.com/ShihaoShao-GH/ICT-decomposition-and-equivariant-bases, where\nthe $n=6,\\dots,9$ ICT decomposition matrices are obtained in 1s, 3s, 11s, and\n4m32s on on 28-cores Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz, respectively.\n","authors":["Shihao Shao","Yikang Li","Zhouchen Lin","Qinghua Cui"],"pdf_url":"https://arxiv.org/pdf/2412.18263v3.pdf","comment":"46 pages"},{"id":"http://arxiv.org/abs/2501.03575v1","updated":"2025-01-07T06:55:50Z","published":"2025-01-07T06:55:50Z","title":"Cosmos World Foundation Model Platform for Physical AI","summary":" Physical AI needs to be trained digitally first. It needs a digital twin of\nitself, the policy model, and a digital twin of the world, the world model. In\nthis paper, we present the Cosmos World Foundation Model Platform to help\ndevelopers build customized world models for their Physical AI setups. We\nposition a world foundation model as a general-purpose world model that can be\nfine-tuned into customized world models for downstream applications. Our\nplatform covers a video curation pipeline, pre-trained world foundation models,\nexamples of post-training of pre-trained world foundation models, and video\ntokenizers. To help Physical AI builders solve the most critical problems of\nour society, we make our platform open-source and our models open-weight with\npermissive licenses available via https://github.com/NVIDIA/Cosmos.\n","authors":[" NVIDIA"," :","Niket Agarwal","Arslan Ali","Maciej Bala","Yogesh Balaji","Erik Barker","Tiffany Cai","Prithvijit Chattopadhyay","Yongxin Chen","Yin Cui","Yifan Ding","Daniel Dworakowski","Jiaojiao Fan","Michele Fenzi","Francesco Ferroni","Sanja Fidler","Dieter Fox","Songwei Ge","Yunhao Ge","Jinwei Gu","Siddharth Gururani","Ethan He","Jiahui Huang","Jacob Huffman","Pooya Jannaty","Jingyi Jin","Seung Wook Kim","Gergely Klár","Grace Lam","Shiyi Lan","Laura Leal-Taixe","Anqi Li","Zhaoshuo Li","Chen-Hsuan Lin","Tsung-Yi Lin","Huan Ling","Ming-Yu Liu","Xian Liu","Alice Luo","Qianli Ma","Hanzi Mao","Kaichun Mo","Arsalan Mousavian","Seungjun Nah","Sriharsha Niverty","David Page","Despoina Paschalidou","Zeeshan Patel","Lindsey Pavao","Morteza Ramezanali","Fitsum Reda","Xiaowei Ren","Vasanth Rao Naik Sabavat","Ed Schmerling","Stella Shi","Bartosz Stefaniak","Shitao Tang","Lyne Tchapmi","Przemek Tredak","Wei-Cheng Tseng","Jibin Varghese","Hao Wang","Haoxiang Wang","Heng Wang","Ting-Chun Wang","Fangyin Wei","Xinyue Wei","Jay Zhangjie Wu","Jiashu Xu","Wei Yang","Lin Yen-Chen","Xiaohui Zeng","Yu Zeng","Jing Zhang","Qinsheng Zhang","Yuxuan Zhang","Qingqing Zhao","Artur Zolkowski"],"pdf_url":"https://arxiv.org/pdf/2501.03575v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03571v1","updated":"2025-01-07T06:51:17Z","published":"2025-01-07T06:51:17Z","title":"AADNet: Exploring EEG Spatiotemporal Information for Fast and Accurate\n Orientation and Timbre Detection of Auditory Attention Based on A Cue-Masked\n Paradigm","summary":" Auditory attention decoding from electroencephalogram (EEG) could infer to\nwhich source the user is attending in noisy environments. Decoding algorithms\nand experimental paradigm designs are crucial for the development of technology\nin practical applications. To simulate real-world scenarios, this study\nproposed a cue-masked auditory attention paradigm to avoid information leakage\nbefore the experiment. To obtain high decoding accuracy with low latency, an\nend-to-end deep learning model, AADNet, was proposed to exploit the\nspatiotemporal information from the short time window of EEG signals. The\nresults showed that with a 0.5-second EEG window, AADNet achieved an average\naccuracy of 93.46% and 91.09% in decoding auditory orientation attention (OA)\nand timbre attention (TA), respectively. It significantly outperformed five\nprevious methods and did not need the knowledge of the original audio source.\nThis work demonstrated that it was possible to detect the orientation and\ntimbre of auditory attention from EEG signals fast and accurately. The results\nare promising for the real-time multi-property auditory attention decoding,\nfacilitating the application of the neuro-steered hearing aids and other\nassistive listening devices.\n","authors":["Keren Shi","Xu Liu","Xue Yuan","Haijie Shang","Ruiting Dai","Hanbin Wang","Yunfa Fu","Ning Jiang","Jiayuan He"],"pdf_url":"https://arxiv.org/pdf/2501.03571v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.12370v2","updated":"2025-01-07T06:47:00Z","published":"2024-12-16T21:56:01Z","title":"Scam Detection for Ethereum Smart Contracts: Leveraging Graph\n Representation Learning for Secure Blockchain","summary":" Due to the increasing abuse of fraudulent activities that result in\nsignificant financial and reputational harm, Ethereum smart contracts face a\nsignificant problem in detecting fraud. Existing monitoring methods typically\nrely on lease code analysis or physically extracted features, which suffer from\nscalability and adaptability limitations. In this study, we use graph\nrepresentation learning to observe purchase trends and find fraudulent deals.\nWe can achieve powerful categorisation performance by using innovative machine\nlearning versions and transforming Ethereum invoice data into graph structures.\nOur method addresses label imbalance through SMOTE-ENN techniques and evaluates\nmodels like Multi-Layer Perceptron ( MLP ) and Graph Convolutional Networks (\nGCN). Experimental results show that the MLP type surpasses the GCN in this\nenvironment, with domain-specific assessments closely aligned with real-world\nassessments. This study provides a scalable and efficient way to improve\nEthereum's ecosystem's confidence and security.\n","authors":["Yihong Jin","Ze Yang"],"pdf_url":"https://arxiv.org/pdf/2412.12370v2.pdf","comment":"Accepted to BDICN 2025"},{"id":"http://arxiv.org/abs/2403.10089v4","updated":"2025-01-07T06:45:58Z","published":"2024-03-15T08:05:16Z","title":"Approximation and bounding techniques for the Fisher-Rao distances\n between parametric statistical models","summary":" The Fisher-Rao distance between two probability distributions of a\nstatistical model is defined as the Riemannian geodesic distance induced by the\nFisher information metric. In order to calculate the Fisher-Rao distance in\nclosed-form, we need (1) to elicit a formula for the Fisher-Rao geodesics, and\n(2) to integrate the Fisher length element along those geodesics. We consider\nseveral numerically robust approximation and bounding techniques for the\nFisher-Rao distances: First, we report generic upper bounds on Fisher-Rao\ndistances based on closed-form 1D Fisher-Rao distances of submodels. Second, we\ndescribe several generic approximation schemes depending on whether the\nFisher-Rao geodesics or pregeodesics are available in closed-form or not. In\nparticular, we obtain a generic method to guarantee an arbitrarily small\nadditive error on the approximation provided that Fisher-Rao pregeodesics and\ntight lower and upper bounds are available. Third, we consider the case of\nFisher metrics being Hessian metrics, and report generic tight upper bounds on\nthe Fisher-Rao distances using techniques of information geometry.\nUniparametric and biparametric statistical models always have Fisher Hessian\nmetrics, and in general a simple test allows to check whether the Fisher\ninformation matrix yields a Hessian metric or not. Fourth, we consider\nelliptical distribution families and show how to apply the above techniques to\nthese models. We also propose two new distances based either on the Fisher-Rao\nlengths of curves serving as proxies of Fisher-Rao geodesics, or based on the\nBirkhoff/Hilbert projective cone distance. Last, we consider an alternative\ngroup-theoretic approach for statistical transformation models based on the\nnotion of maximal invariant which yields insights on the structures of the\nFisher-Rao distance formula which may be used fruitfully in applications.\n","authors":["Frank Nielsen"],"pdf_url":"https://arxiv.org/pdf/2403.10089v4.pdf","comment":"48 pages"},{"id":"http://arxiv.org/abs/2501.03568v1","updated":"2025-01-07T06:43:18Z","published":"2025-01-07T06:43:18Z","title":"Advanced Tutorial: Label-Efficient Two-Sample Tests","summary":" Hypothesis testing is a statistical inference approach used to determine\nwhether data supports a specific hypothesis. An important type is the\ntwo-sample test, which evaluates whether two sets of data points are from\nidentical distributions. This test is widely used, such as by clinical\nresearchers comparing treatment effectiveness. This tutorial explores\ntwo-sample testing in a context where an analyst has many features from two\nsamples, but determining the sample membership (or labels) of these features is\ncostly. In machine learning, a similar scenario is studied in active learning.\nThis tutorial extends active learning concepts to two-sample testing within\nthis \\textit{label-costly} setting while maintaining statistical validity and\nhigh testing power. Additionally, the tutorial discusses practical applications\nof these label-efficient two-sample tests.\n","authors":["Weizhi Li","Visar Berisha","Gautam Dasarathy"],"pdf_url":"https://arxiv.org/pdf/2501.03568v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.15320v2","updated":"2025-01-07T06:39:29Z","published":"2024-07-07T09:25:52Z","title":"Edge Graph Intelligence: Reciprocally Empowering Edge Networks with\n Graph Intelligence","summary":" Recent years have witnessed a thriving growth of computing facilities\nconnected at the network edge, cultivating edge networks as a fundamental\ninfrastructure for supporting miscellaneous intelligent services.Meanwhile,\nArtificial Intelligence (AI) frontiers have extrapolated to the graph domain\nand promoted Graph Intelligence (GI). Given the inherent relation between\ngraphs and networks, the interdiscipline of graph learning and edge networks,\ni.e., Edge GI or EGI, has revealed a novel interplay between them -- GI aids in\noptimizing edge networks, while edge networks facilitate GI model deployment.\nDriven by this delicate closed-loop, EGI is recognized as a promising solution\nto fully unleash the potential of edge computing power and is garnering growing\nattention. Nevertheless, research on EGI remains nascent, and there is a\nsoaring demand within both the communications and AI communities for a\ndedicated venue to share recent advancements. To this end, this paper promotes\nthe concept of EGI, explores its scope and core principles, and conducts a\ncomprehensive survey concerning recent research efforts on this emerging field.\nSpecifically, this paper introduces and discusses: 1) fundamentals of edge\ncomputing and graph learning,2) emerging techniques centering on the closed\nloop between graph intelligence and edge networks, and 3) open challenges and\nresearch opportunities of future EGI. By bridging the gap across communication,\nnetworking, and graph learning areas, we believe that this survey can garner\nincreased attention, foster meaningful discussions, and inspire further\nresearch ideas in EGI.\n","authors":["Liekang Zeng","Shengyuan Ye","Xu Chen","Xiaoxi Zhang","Ju Ren","Jian Tang","Yang Yang"," Xuemin"," Shen"],"pdf_url":"https://arxiv.org/pdf/2407.15320v2.pdf","comment":"Accepted by IEEE Communications Surveys & Tutorials"},{"id":"http://arxiv.org/abs/2412.02155v2","updated":"2025-01-07T06:30:24Z","published":"2024-12-03T04:29:27Z","title":"CausalMob: Causal Human Mobility Prediction with LLMs-derived Human\n Intentions toward Public Events","summary":" Large-scale human mobility exhibits spatial and temporal patterns that can\nassist policymakers in decision making. Although traditional prediction models\nattempt to capture these patterns, they often interfered by non-periodic public\nevents, such as disasters and occasional celebrations. Since regular human\nmobility patterns are heavily affected by these events, estimating their causal\neffects is critical to accurate mobility predictions. Although news articles\nprovide unique perspectives on these events in an unstructured format,\nprocessing is a challenge. In this study, we propose a causality-augmented\nprediction model, called CausalMob, to analyze the causal effects of public\nevents. We first utilize large language models (LLMs) to extract human\nintentions from news articles and transform them into features that act as\ncausal treatments. Next, the model learns representations of spatio-temporal\nregional covariates from multiple data sources to serve as confounders for\ncausal inference. Finally, we present a causal effect estimation framework to\nensure event features remain independent of confounders during prediction.\nBased on large-scale real-world data, the experimental results show that the\nproposed model excels in human mobility prediction, outperforming\nstate-of-the-art models.\n","authors":["Xiaojie Yang","Hangli Ge","Jiawei Wang","Zipei Fan","Renhe Jiang","Ryosuke Shibasaki","Noboru Koshizuka"],"pdf_url":"https://arxiv.org/pdf/2412.02155v2.pdf","comment":"Accepted by KDD 2025"},{"id":"http://arxiv.org/abs/2310.10207v6","updated":"2025-01-07T06:28:56Z","published":"2023-10-16T09:19:18Z","title":"Bongard-OpenWorld: Few-Shot Reasoning for Free-form Visual Concepts in\n the Real World","summary":" We introduce Bongard-OpenWorld, a new benchmark for evaluating real-world\nfew-shot reasoning for machine vision. It originates from the classical Bongard\nProblems (BPs): Given two sets of images (positive and negative), the model\nneeds to identify the set that query images belong to by inducing the visual\nconcepts, which is exclusively depicted by images from the positive set. Our\nbenchmark inherits the few-shot concept induction of the original BPs while\nadding the two novel layers of challenge: 1) open-world free-form concepts, as\nthe visual concepts in Bongard-OpenWorld are unique compositions of terms from\nan open vocabulary, ranging from object categories to abstract visual\nattributes and commonsense factual knowledge; 2) real-world images, as opposed\nto the synthetic diagrams used by many counterparts. In our exploration,\nBongard-OpenWorld already imposes a significant challenge to current few-shot\nreasoning algorithms. We further investigate to which extent the recently\nintroduced Large Language Models (LLMs) and Vision-Language Models (VLMs) can\nsolve our task, by directly probing VLMs, and combining VLMs and LLMs in an\ninteractive reasoning scheme. We even conceived a neuro-symbolic reasoning\napproach that reconciles LLMs & VLMs with logical reasoning to emulate the\nhuman problem-solving process for Bongard Problems. However, none of these\napproaches manage to close the human-machine gap, as the best learner achieves\n64% accuracy while human participants easily reach 91%. We hope\nBongard-OpenWorld can help us better understand the limitations of current\nvisual intelligence and facilitate future research on visual agents with\nstronger few-shot visual reasoning capabilities.\n","authors":["Rujie Wu","Xiaojian Ma","Zhenliang Zhang","Wei Wang","Qing Li","Song-Chun Zhu","Yizhou Wang"],"pdf_url":"https://arxiv.org/pdf/2310.10207v6.pdf","comment":"Accepted to ICLR 2024"},{"id":"http://arxiv.org/abs/2501.03562v1","updated":"2025-01-07T06:22:55Z","published":"2025-01-07T06:22:55Z","title":"Rethinking Adversarial Attacks in Reinforcement Learning from Policy\n Distribution Perspective","summary":" Deep Reinforcement Learning (DRL) suffers from uncertainties and inaccuracies\nin the observation signal in realworld applications. Adversarial attack is an\neffective method for evaluating the robustness of DRL agents. However, existing\nattack methods targeting individual sampled actions have limited impacts on the\noverall policy distribution, particularly in continuous action spaces. To\naddress these limitations, we propose the Distribution-Aware Projected Gradient\nDescent attack (DAPGD). DAPGD uses distribution similarity as the gradient\nperturbation input to attack the policy network, which leverages the entire\npolicy distribution rather than relying on individual samples. We utilize the\nBhattacharyya distance in DAPGD to measure policy similarity, enabling\nsensitive detection of subtle but critical differences between probability\ndistributions. Our experiment results demonstrate that DAPGD achieves SOTA\nresults compared to the baselines in three robot navigation tasks, achieving an\naverage 22.03% higher reward drop compared to the best baseline.\n","authors":["Tianyang Duan","Zongyuan Zhang","Zheng Lin","Yue Gao","Ling Xiong","Yong Cui","Hongbin Liang","Xianhao Chen","Heming Cui","Dong Huang"],"pdf_url":"https://arxiv.org/pdf/2501.03562v1.pdf","comment":"10 pages, 2 figures, 2 tables"},{"id":"http://arxiv.org/abs/2501.03560v1","updated":"2025-01-07T06:21:40Z","published":"2025-01-07T06:21:40Z","title":"KG-TRICK: Unifying Textual and Relational Information Completion of\n Knowledge for Multilingual Knowledge Graphs","summary":" Multilingual knowledge graphs (KGs) provide high-quality relational and\ntextual information for various NLP applications, but they are often\nincomplete, especially in non-English languages. Previous research has shown\nthat combining information from KGs in different languages aids either\nKnowledge Graph Completion (KGC), the task of predicting missing relations\nbetween entities, or Knowledge Graph Enhancement (KGE), the task of predicting\nmissing textual information for entities. Although previous efforts have\nconsidered KGC and KGE as independent tasks, we hypothesize that they are\ninterdependent and mutually beneficial. To this end, we introduce KG-TRICK, a\nnovel sequence-to-sequence framework that unifies the tasks of textual and\nrelational information completion for multilingual KGs. KG-TRICK demonstrates\nthat: i) it is possible to unify the tasks of KGC and KGE into a single\nframework, and ii) combining textual information from multiple languages is\nbeneficial to improve the completeness of a KG. As part of our contributions,\nwe also introduce WikiKGE10++, the largest manually-curated benchmark for\ntextual information completion of KGs, which features over 25,000 entities\nacross 10 diverse languages.\n","authors":["Zelin Zhou","Simone Conia","Daniel Lee","Min Li","Shenglei Huang","Umar Farooq Minhas","Saloni Potdar","Henry Xiao","Yunyao Li"],"pdf_url":"https://arxiv.org/pdf/2501.03560v1.pdf","comment":"Camera ready for COLING 2025"},{"id":"http://arxiv.org/abs/2408.09791v2","updated":"2025-01-07T06:11:20Z","published":"2024-08-19T08:40:53Z","title":"ALTBI: Constructing Improved Outlier Detection Models via Optimization\n of Inlier-Memorization Effect","summary":" Outlier detection (OD) is the task of identifying unusual observations (or\noutliers) from a given or upcoming data by learning unique patterns of normal\nobservations (or inliers). Recently, a study introduced a powerful unsupervised\nOD (UOD) solver based on a new observation of deep generative models, called\ninlier-memorization (IM) effect, which suggests that generative models memorize\ninliers before outliers in early learning stages. In this study, we aim to\ndevelop a theoretically principled method to address UOD tasks by maximally\nutilizing the IM effect. We begin by observing that the IM effect is observed\nmore clearly when the given training data contain fewer outliers. This finding\nindicates a potential for enhancing the IM effect in UOD regimes if we can\neffectively exclude outliers from mini-batches when designing the loss\nfunction. To this end, we introduce two main techniques: 1) increasing the\nmini-batch size as the model training proceeds and 2) using an adaptive\nthreshold to calculate the truncated loss function. We theoretically show that\nthese two techniques effectively filter out outliers from the truncated loss\nfunction, allowing us to utilize the IM effect to the fullest. Coupled with an\nadditional ensemble strategy, we propose our method and term it Adaptive Loss\nTruncation with Batch Increment (ALTBI). We provide extensive experimental\nresults to demonstrate that ALTBI achieves state-of-the-art performance in\nidentifying outliers compared to other recent methods, even with significantly\nlower computation costs. Additionally, we show that our method yields robust\nperformances when combined with privacy-preserving algorithms.\n","authors":["Seoyoung Cho","Jaesung Hwang","Kwan-Young Bak","Dongha Kim"],"pdf_url":"https://arxiv.org/pdf/2408.09791v2.pdf","comment":"24 pages in total"},{"id":"http://arxiv.org/abs/2405.05409v4","updated":"2025-01-07T06:08:52Z","published":"2024-05-08T20:23:24Z","title":"Initialization is Critical to Whether Transformers Fit Composite\n Functions by Reasoning or Memorizing","summary":" Transformers have shown impressive capabilities across various tasks, but\ntheir performance on compositional problems remains a topic of debate. In this\nwork, we investigate the mechanisms of how transformers behave on unseen\ncompositional tasks. We discover that the parameter initialization scale plays\na critical role in determining whether the model learns inferential\n(reasoning-based) solutions, which capture the underlying compositional\nprimitives, or symmetric (memory-based) solutions, which simply memorize\nmappings without understanding the compositional structure. By analyzing the\ninformation flow and vector representations within the model, we reveal the\ndistinct mechanisms underlying these solution types. We further find that\ninferential (reasoning-based) solutions exhibit low complexity bias, which we\nhypothesize is a key factor enabling them to learn individual mappings for\nsingle anchors. We validate our conclusions on various real-world datasets. Our\nfindings provide valuable insights into the role of initialization scale in\ntuning the reasoning and memorizing ability and we propose the initialization\nrate $\\gamma$ to be a convenient tunable hyper-parameter in common deep\nlearning frameworks, where $1/d_{\\mathrm{in}}^\\gamma$ is the standard deviation\nof parameters of the layer with $d_{\\mathrm{in}}$ input neurons.\n","authors":["Zhongwang Zhang","Pengxiao Lin","Zhiwei Wang","Yaoyu Zhang","Zhi-Qin John Xu"],"pdf_url":"https://arxiv.org/pdf/2405.05409v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.02156v2","updated":"2025-01-07T05:36:22Z","published":"2025-01-04T01:45:32Z","title":"The Race to Efficiency: A New Perspective on AI Scaling Laws","summary":" As large-scale AI models expand, training becomes costlier and sustaining\nprogress grows harder. Classical scaling laws (e.g., Kaplan et al. (2020),\nHoffmann et al. (2022)) predict training loss from a static compute budget yet\nneglect time and efficiency, prompting the question: how can we balance\nballooning GPU fleets with rapidly improving hardware and algorithms? We\nintroduce the relative-loss equation, a time- and efficiency-aware framework\nthat extends classical AI scaling laws. Our model shows that, without ongoing\nefficiency gains, advanced performance could demand millennia of training or\nunrealistically large GPU fleets. However, near-exponential progress remains\nachievable if the \"efficiency-doubling rate\" parallels Moore's Law. By\nformalizing this race to efficiency, we offer a quantitative roadmap for\nbalancing front-loaded GPU investments with incremental improvements across the\nAI stack. Empirical trends suggest that sustained efficiency gains can push AI\nscaling well into the coming decade, providing a new perspective on the\ndiminishing returns inherent in classical scaling.\n","authors":["Chien-Ping Lu"],"pdf_url":"https://arxiv.org/pdf/2501.02156v2.pdf","comment":"21 pages, 3 figures. 2 tables, second draft"},{"id":"http://arxiv.org/abs/2402.13516v7","updated":"2025-01-07T05:26:54Z","published":"2024-02-21T03:58:49Z","title":"ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity\n within Large Language Models","summary":" Activation sparsity refers to the existence of considerable\nweakly-contributed elements among activation outputs. As a prevalent property\nof the models using the ReLU activation function, activation sparsity has been\nproven a promising paradigm to boost model inference efficiency. Nevertheless,\nmost large language models (LLMs) adopt activation functions without intrinsic\nactivation sparsity (e.g., GELU and Swish). Some recent efforts have explored\nintroducing ReLU or its variants as the substitutive activation function to\nhelp LLMs achieve activation sparsity and inference acceleration, but few can\nsimultaneously obtain high sparsity and comparable model performance. This\npaper introduces a simple and effective sparsification method named \"ProSparse\"\nto push LLMs for higher activation sparsity while maintaining comparable\nperformance. Specifically, after substituting the activation function of LLMs\nwith ReLU, ProSparse adopts progressive sparsity regularization with a factor\nsmoothly increasing along the multi-stage sine curves. This can enhance\nactivation sparsity and mitigate performance degradation by avoiding radical\nshifts in activation distributions. With ProSparse, we obtain high sparsity of\n89.32% for LLaMA2-7B, 88.80% for LLaMA2-13B, and 87.89% for end-size\nMiniCPM-1B, respectively, achieving comparable performance to their original\nSwish-activated versions. These present the most sparsely activated models\namong open-source LLaMA versions and competitive end-size models, considerably\nsurpassing ReluLLaMA-7B (66.98%) and ReluLLaMA-13B (71.56%). Our inference\nacceleration experiments further demonstrate the significant practical\nacceleration potential of LLMs with higher activation sparsity, obtaining up to\n4.52$\\times$ inference speedup.\n","authors":["Chenyang Song","Xu Han","Zhengyan Zhang","Shengding Hu","Xiyu Shi","Kuai Li","Chen Chen","Zhiyuan Liu","Guangli Li","Tao Yang","Maosong Sun"],"pdf_url":"https://arxiv.org/pdf/2402.13516v7.pdf","comment":"19 pages, 4 figures, 9 tables"},{"id":"http://arxiv.org/abs/2501.03540v1","updated":"2025-01-07T05:23:36Z","published":"2025-01-07T05:23:36Z","title":"Deep Learning within Tabular Data: Foundations, Challenges, Advances and\n Future Directions","summary":" Tabular data remains one of the most prevalent data types across a wide range\nof real-world applications, yet effective representation learning for this\ndomain poses unique challenges due to its irregular patterns, heterogeneous\nfeature distributions, and complex inter-column dependencies. This survey\nprovides a comprehensive review of state-of-the-art techniques in tabular data\nrepresentation learning, structured around three foundational design elements:\ntraining data, neural architectures, and learning objectives. Unlike prior\nsurveys that focus primarily on either architecture design or learning\nstrategies, we adopt a holistic perspective that emphasizes the universality\nand robustness of representation learning methods across diverse downstream\ntasks. We examine recent advances in data augmentation and generation,\nspecialized neural network architectures tailored to tabular data, and\ninnovative learning objectives that enhance representation quality.\nAdditionally, we highlight the growing influence of self-supervised learning\nand the adaptation of transformer-based foundation models for tabular data. Our\nreview is based on a systematic literature search using rigorous inclusion\ncriteria, encompassing 127 papers published since 2020 in top-tier conferences\nand journals. Through detailed analysis and comparison, we identify emerging\ntrends, critical gaps, and promising directions for future research, aiming to\nguide the development of more generalizable and effective tabular data\nrepresentation methods.\n","authors":["Weijieying Ren","Tianxiang Zhao","Yuqing Huang","Vasant Honavar"],"pdf_url":"https://arxiv.org/pdf/2501.03540v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03526v1","updated":"2025-01-07T04:42:45Z","published":"2025-01-07T04:42:45Z","title":"FgC2F-UDiff: Frequency-guided and Coarse-to-fine Unified Diffusion Model\n for Multi-modality Missing MRI Synthesis","summary":" Multi-modality magnetic resonance imaging (MRI) is essential for the\ndiagnosis and treatment of brain tumors. However, missing modalities are\ncommonly observed due to limitations in scan time, scan corruption, artifacts,\nmotion, and contrast agent intolerance. Synthesis of missing MRI has been a\nmeans to address the limitations of modality insufficiency in clinical practice\nand research. However, there are still some challenges, such as poor\ngeneralization, inaccurate non-linear mapping, and slow processing speeds. To\naddress the aforementioned issues, we propose a novel unified synthesis model,\nthe Frequency-guided and Coarse-to-fine Unified Diffusion Model (FgC2F-UDiff),\ndesigned for multiple inputs and outputs. Specifically, the Coarse-to-fine\nUnified Network (CUN) fully exploits the iterative denoising properties of\ndiffusion models, from global to detail, by dividing the denoising process into\ntwo stages, coarse and fine, to enhance the fidelity of synthesized images.\nSecondly, the Frequency-guided Collaborative Strategy (FCS) harnesses\nappropriate frequency information as prior knowledge to guide the learning of a\nunified, highly non-linear mapping. Thirdly, the Specific-acceleration Hybrid\nMechanism (SHM) integrates specific mechanisms to accelerate the diffusion\nmodel and enhance the feasibility of many-to-many synthesis. Extensive\nexperimental evaluations have demonstrated that our proposed FgC2F-UDiff model\nachieves superior performance on two datasets, validated through a\ncomprehensive assessment that includes both qualitative observations and\nquantitative metrics, such as PSNR SSIM, LPIPS, and FID.\n","authors":["Xiaojiao Xiao","Qinmin Vivian Hu","Guanghui Wang"],"pdf_url":"https://arxiv.org/pdf/2501.03526v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.13371v2","updated":"2025-01-07T04:42:21Z","published":"2024-02-20T20:53:04Z","title":"FIDLAR: Forecast-Informed Deep Learning Architecture for Flood\n Mitigation","summary":" In coastal river systems, frequent floods, often occurring during major\nstorms or king tides, pose a severe threat to lives and property. However,\nthese floods can be mitigated or even prevented by strategically releasing\nwater before extreme weather events with hydraulic structures such as dams,\ngates, pumps, and reservoirs. A standard approach used by local water\nmanagement agencies is the \"rule-based\" method, which specifies predetermined\npre-releases of water based on historical and time-tested human experience, but\nwhich tends to result in excess or inadequate water release. The model\npredictive control (MPC), a physics-based model for prediction, is an\nalternative approach, albeit involving computationally intensive calculations.\nIn this paper, we propose a Forecast Informed Deep Learning Architecture,\nFIDLAR, to achieve rapid and optimal flood management with precise water\npre-releases. FIDLAR seamlessly integrates two neural network modules: one\ncalled the Flood Manager, which is responsible for generating water pre-release\nschedules, and another called the Flood Evaluator, which assesses these\ngenerated schedules. The Evaluator module is pre-trained separately, and its\ngradient-based feedback is used to train the Manager model, ensuring optimal\nwater pre-releases. We have conducted experiments using FIDLAR with data from a\nflood-prone coastal area in South Florida, particularly susceptible to frequent\nstorms. Results show that FIDLAR is several orders of magnitude faster than\ncurrently used physics-based approaches while outperforming baseline methods\nwith improved water pre-release schedules.\n","authors":["Jimeng Shi","Zeda Yin","Arturo Leon","Jayantha Obeysekera","Giri Narasimhan"],"pdf_url":"https://arxiv.org/pdf/2402.13371v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.12935v2","updated":"2025-01-07T04:42:20Z","published":"2024-06-17T03:03:34Z","title":"ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat\n Templates","summary":" Large language models (LLMs) are expected to follow instructions from users\nand engage in conversations. Techniques to enhance LLMs' instruction-following\ncapabilities typically fine-tune them using data structured according to a\npredefined chat template. Although chat templates are shown to be effective in\noptimizing LLM performance, their impact on safety alignment of LLMs has been\nless understood, which is crucial for deploying LLMs safely at scale.\n In this paper, we investigate how chat templates affect safety alignment of\nLLMs. We identify a common vulnerability, named ChatBug, that is introduced by\nchat templates. Our key insight to identify ChatBug is that the chat templates\nprovide a rigid format that need to be followed by LLMs, but not by users.\nHence, a malicious user may not necessarily follow the chat template when\nprompting LLMs. Instead, malicious users could leverage their knowledge of the\nchat template and accordingly craft their prompts to bypass safety alignments\nof LLMs. We develop two attacks to exploit the ChatBug vulnerability. We\ndemonstrate that a malicious user can exploit the ChatBug vulnerability of\neight state-of-the-art (SOTA) LLMs and effectively elicit unintended responses\nfrom these models. Moreover, we show that ChatBug can be exploited by existing\njailbreak attacks to enhance their attack success rates. We investigate\npotential countermeasures to ChatBug. Our results show that while adversarial\ntraining effectively mitigates the ChatBug vulnerability, the victim model\nincurs significant performance degradation. These results highlight the\ntrade-off between safety alignment and helpfulness. Developing new methods for\ninstruction tuning to balance this trade-off is an open and critical direction\nfor future research\n","authors":["Fengqing Jiang","Zhangchen Xu","Luyao Niu","Bill Yuchen Lin","Radha Poovendran"],"pdf_url":"https://arxiv.org/pdf/2406.12935v2.pdf","comment":"This paper is accepted to AAAI 2025"},{"id":"http://arxiv.org/abs/2310.15624v2","updated":"2025-01-07T04:39:25Z","published":"2023-10-24T08:45:15Z","title":"GUPNet++: Geometry Uncertainty Propagation Network for Monocular 3D\n Object Detection","summary":" Geometry plays a significant role in monocular 3D object detection. It can be\nused to estimate object depth by using the perspective projection between\nobject's physical size and 2D projection in the image plane, which can\nintroduce mathematical priors into deep models. However, this projection\nprocess also introduces error amplification, where the error of the estimated\nheight is amplified and reflected into the projected depth. It leads to\nunreliable depth inferences and also impairs training stability. To tackle this\nproblem, we propose a novel Geometry Uncertainty Propagation Network (GUPNet++)\nby modeling geometry projection in a probabilistic manner. This ensures depth\npredictions are well-bounded and associated with a reasonable uncertainty. The\nsignificance of introducing such geometric uncertainty is two-fold: (1). It\nmodels the uncertainty propagation relationship of the geometry projection\nduring training, improving the stability and efficiency of the end-to-end model\nlearning. (2). It can be derived to a highly reliable confidence to indicate\nthe quality of the 3D detection result, enabling more reliable detection\ninference. Experiments show that the proposed approach not only obtains\n(state-of-the-art) SOTA performance in image-based monocular 3D detection but\nalso demonstrates superiority in efficacy with a simplified framework.\n","authors":["Yan Lu","Xinzhu Ma","Lei Yang","Tianzhu Zhang","Yating Liu","Qi Chu","Tong He","Yonghui Li","Wanli Ouyang"],"pdf_url":"https://arxiv.org/pdf/2310.15624v2.pdf","comment":"18 pages, 9 figures"},{"id":"http://arxiv.org/abs/2501.03523v1","updated":"2025-01-07T04:38:28Z","published":"2025-01-07T04:38:28Z","title":"Vocal Tract Length Warped Features for Spoken Keyword Spotting","summary":" In this paper, we propose several methods that incorporate vocal tract length\n(VTL) warped features for spoken keyword spotting (KWS). The first method,\nVTL-independent KWS, involves training a single deep neural network (DNN) that\nutilizes VTL features with various warping factors. During training, a specific\nVTL feature is randomly selected per epoch, allowing the exploration of VTL\nvariations. During testing, the VTL features with different warping factors of\na test utterance are scored against the DNN and combined with equal weight. In\nthe second method scores the conventional features of a test utterance (without\nVTL warping) against the DNN. The third method, VTL-concatenation KWS,\nconcatenates VTL warped features to form high-dimensional features for KWS.\nEvaluations carried out on the English Google Command dataset demonstrate that\nthe proposed methods improve the accuracy of KWS.\n","authors":["Achintya kr. Sarkar","Priyanka Dwivedi","Zheng-Hua Tan"],"pdf_url":"https://arxiv.org/pdf/2501.03523v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03518v1","updated":"2025-01-07T04:21:13Z","published":"2025-01-07T04:21:13Z","title":"Transfer Learning for Deep-Unfolded Combinatorial Optimization Solver\n with Quantum Annealer","summary":" Quantum annealing (QA) has attracted research interest as a sampler and\ncombinatorial optimization problem (COP) solver. A recently proposed\nsampling-based solver for QA significantly reduces the required number of\nqubits, being capable of large COPs. In relation to this, a trainable\nsampling-based COP solver has been proposed that optimizes its internal\nparameters from a dataset by using a deep learning technique called deep\nunfolding. Although learning the internal parameters accelerates the\nconvergence speed, the sampler in the trainable solver is restricted to using a\nclassical sampler owing to the training cost. In this study, to utilize QA in\nthe trainable solver, we propose classical-quantum transfer learning, where\nparameters are trained classically, and the trained parameters are used in the\nsolver with QA. The results of numerical experiments demonstrate that the\ntrainable quantum COP solver using classical-quantum transfer learning improves\nconvergence speed and execution time over the original solver.\n","authors":["Ryo Hagiwara","Shunta Arai","Satoshi Takabe"],"pdf_url":"https://arxiv.org/pdf/2501.03518v1.pdf","comment":"8 pages, 6 figures"},{"id":"http://arxiv.org/abs/2501.03228v2","updated":"2025-01-07T04:05:53Z","published":"2025-01-06T18:59:55Z","title":"LightGNN: Simple Graph Neural Network for Recommendation","summary":" Graph neural networks (GNNs) have demonstrated superior performance in\ncollaborative recommendation through their ability to conduct high-order\nrepresentation smoothing, effectively capturing structural information within\nusers' interaction patterns. However, existing GNN paradigms face significant\nchallenges in scalability and robustness when handling large-scale, noisy, and\nreal-world datasets. To address these challenges, we present LightGNN, a\nlightweight and distillation-based GNN pruning framework designed to\nsubstantially reduce model complexity while preserving essential collaboration\nmodeling capabilities. Our LightGNN framework introduces a computationally\nefficient pruning module that adaptively identifies and removes redundant edges\nand embedding entries for model compression. The framework is guided by a\nresource-friendly hierarchical knowledge distillation objective, whose\nintermediate layer augments the observed graph to maintain performance,\nparticularly in high-rate compression scenarios. Extensive experiments on\npublic datasets demonstrate LightGNN's effectiveness, significantly improving\nboth computational efficiency and recommendation accuracy. Notably, LightGNN\nachieves an 80% reduction in edge count and 90% reduction in embedding entries\nwhile maintaining performance comparable to more complex state-of-the-art\nbaselines. The implementation of our LightGNN framework is available at the\ngithub repository: https://github.com/HKUDS/LightGNN.\n","authors":["Guoxuan Chen","Lianghao Xia","Chao Huang"],"pdf_url":"https://arxiv.org/pdf/2501.03228v2.pdf","comment":"Accepted to WSDM 2025 Oral"},{"id":"http://arxiv.org/abs/2410.23111v5","updated":"2025-01-07T03:56:49Z","published":"2024-10-30T15:23:44Z","title":"Exploring Gradient Subspaces: Addressing and Overcoming LoRA's\n Limitations in Federated Fine-Tuning of Large Language Models","summary":" Large Language Models (LLMs) have demonstrated remarkable capabilities across\nvarious domains, particularly in task generalization for both text and vision\ndata. While fine-tuning these models can significantly enhance their\nperformance on specific downstream tasks, it often requires high-quality data\nthat cannot be shared due to privacy concerns. Federated Learning (FL) offers a\npromising solution for collaborative training without direct data sharing.\nHowever, many parameter-efficient fine-tuning strategies for LLMs in FL,\nparticularly those based on Low-Rank Adaptation (LoRA), face limitations. In\nthis paper, we critically analyze the convergence and performance guarantees of\npopular FL frameworks utilizing LoRA, highlighting its suboptimal nature due to\nconstrained subspace learning of low-rank matrices. This limitation hinders\neffective fine-tuning of LLMs in federated settings. Through rigorous\nanalytical and empirical evaluations, we demonstrate that direct weight\naveraging outperforms LoRA-based strategies, leading to superior performance\nfor fine-tuned models. Our comprehensive comparison unmasks inefficiencies in\nLoRA approaches and underscores the advantages of direct weight aggregation. We\nextend our analysis to low-rank gradient-based optimizers, such as GaLore, used\nduring local training steps. Our findings show that GaLore along with\ndirect-weight aggregation is a more effective approach, outperforming federated\nLoRA methods like FlexLoRA and FFA-LoRA across both text and image modalities.\nWhile privacy remains paramount in FL discourse, our focus is on assessing\nperformance outcomes of federated fine-tuned models and evaluating various FL\nframeworks from both theoretical and empirical perspectives. Our findings\nadvocate reassessing the reliance on LoRA within FL contexts, paving the way\nfor more efficient training methodologies.\n","authors":["Navyansh Mahla","Kshitij Sharad Jadhav","Ganesh Ramakrishnan"],"pdf_url":"https://arxiv.org/pdf/2410.23111v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.16766v2","updated":"2025-01-07T03:53:12Z","published":"2024-05-27T02:27:28Z","title":"Concept Matching with Agent for Out-of-Distribution Detection","summary":" The remarkable achievements of Large Language Models (LLMs) have captivated\nthe attention of both academia and industry, transcending their initial role in\ndialogue generation. To expand the usage scenarios of LLM, some works enhance\nthe effectiveness and capabilities of the model by introducing more external\ninformation, which is called the agent paradigm. Based on this idea, we propose\na new method that integrates the agent paradigm into out-of-distribution (OOD)\ndetection task, aiming to improve its robustness and adaptability. Our proposed\nmethod, Concept Matching with Agent (CMA), employs neutral prompts as agents to\naugment the CLIP-based OOD detection process. These agents function as dynamic\nobservers and communication hubs, interacting with both In-distribution (ID)\nlabels and data inputs to form vector triangle relationships. This triangular\nframework offers a more nuanced approach than the traditional binary\nrelationship, allowing for better separation and identification of ID and OOD\ninputs. Our extensive experimental results showcase the superior performance of\nCMA over both zero-shot and training-required methods in a diverse array of\nreal-world scenarios.\n","authors":["Yuxiao Lee","Xiaofeng Cao","Jingcai Guo","Wei Ye","Qing Guo","Yi Chang"],"pdf_url":"https://arxiv.org/pdf/2405.16766v2.pdf","comment":"Accepted by AAAI-25"},{"id":"http://arxiv.org/abs/2501.03507v1","updated":"2025-01-07T03:50:11Z","published":"2025-01-07T03:50:11Z","title":"An Empirical Study of Accuracy-Robustness Tradeoff and Training\n Efficiency in Self-Supervised Learning","summary":" Self-supervised learning (SSL) has significantly advanced image\nrepresentation learning, yet efficiency challenges persist, particularly with\nadversarial training. Many SSL methods require extensive epochs to achieve\nconvergence, a demand further amplified in adversarial settings. To address\nthis inefficiency, we revisit the robust EMP-SSL framework, emphasizing the\nimportance of increasing the number of crops per image to accelerate learning.\nUnlike traditional contrastive learning, robust EMP-SSL leverages multi-crop\nsampling, integrates an invariance term and regularization, and reduces\ntraining epochs, enhancing time efficiency. Evaluated with both standard linear\nclassifiers and multi-patch embedding aggregation, robust EMP-SSL provides new\ninsights into SSL evaluation strategies.\n Our results show that robust crop-based EMP-SSL not only accelerates\nconvergence but also achieves a superior balance between clean accuracy and\nadversarial robustness, outperforming multi-crop embedding aggregation.\nAdditionally, we extend this approach with free adversarial training in\nMulti-Crop SSL, introducing the Cost-Free Adversarial Multi-Crop\nSelf-Supervised Learning (CF-AMC-SSL) method. CF-AMC-SSL demonstrates the\neffectiveness of free adversarial training in reducing training time while\nsimultaneously improving clean accuracy and adversarial robustness. These\nfindings underscore the potential of CF-AMC-SSL for practical SSL applications.\nOur code is publicly available at https://github.com/softsys4ai/CF-AMC-SSL.\n","authors":["Fatemeh Ghofrani","Pooyan Jamshidi"],"pdf_url":"https://arxiv.org/pdf/2501.03507v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.13656v2","updated":"2025-01-07T03:37:12Z","published":"2024-08-24T19:14:02Z","title":"Localize-and-Stitch: Efficient Model Merging via Sparse Task Arithmetic","summary":" Model merging offers an effective strategy to combine the strengths of\nmultiple finetuned models into a unified model that preserves the specialized\ncapabilities of each. Existing methods merge models in a global manner,\nperforming arithmetic operations across all model parameters. However, such\nglobal merging often leads to task interference, degrading the performance of\nthe merged model. In this work, we introduce Localize-and-Stitch, a novel\napproach that merges models in a localized way. Our algorithm works in two\nsteps: i) Localization: identify tiny ($1\\%$ of the total parameters) localized\nregions in the finetuned models containing essential skills for the downstream\ntasks, and ii) Stitching: reintegrate only these essential regions back into\nthe pretrained model for task synergy. We demonstrate that our approach\neffectively locates sparse regions responsible for finetuned performance, and\nthe localized regions could be treated as compact and interpretable\nrepresentations of the finetuned models (tasks). Empirically, we evaluate our\nmethod on various vision and language benchmarks, showing that it outperforms\nexisting model merging methods under different data availability scenarios.\nBeyond strong empirical performance, our algorithm also facilitates model\ncompression and preserves pretrained knowledge, enabling flexible and continual\nskill composition from multiple finetuned models with minimal storage and\ncomputational overhead. Our code is available at\nhttps://github.com/uiuctml/Localize-and-Stitch.\n","authors":["Yifei He","Yuzheng Hu","Yong Lin","Tong Zhang","Han Zhao"],"pdf_url":"https://arxiv.org/pdf/2408.13656v2.pdf","comment":"TMLR camera-ready version"},{"id":"http://arxiv.org/abs/2501.03495v1","updated":"2025-01-07T03:33:22Z","published":"2025-01-07T03:33:22Z","title":"Textualize Visual Prompt for Image Editing via Diffusion Bridge","summary":" Visual prompt, a pair of before-and-after edited images, can convey\nindescribable imagery transformations and prosper in image editing. However,\ncurrent visual prompt methods rely on a pretrained text-guided image-to-image\ngenerative model that requires a triplet of text, before, and after images for\nretraining over a text-to-image model. Such crafting triplets and retraining\nprocesses limit the scalability and generalization of editing. In this paper,\nwe present a framework based on any single text-to-image model without reliance\non the explicit image-to-image model thus enhancing the generalizability and\nscalability. Specifically, by leveraging the probability-flow ordinary\nequation, we construct a diffusion bridge to transfer the distribution between\nbefore-and-after images under the text guidance. By optimizing the text via the\nbridge, the framework adaptively textualizes the editing transformation\nconveyed by visual prompts into text embeddings without other models.\nMeanwhile, we introduce differential attention control during text\noptimization, which disentangles the text embedding from the invariance of the\nbefore-and-after images and makes it solely capture the delicate transformation\nand generalize to edit various images. Experiments on real images validate\ncompetitive results on the generalization, contextual coherence, and high\nfidelity for delicate editing with just one image pair as the visual prompt.\n","authors":["Pengcheng Xu","Qingnan Fan","Fei Kou","Shuai Qin","Hong Gu","Ruoyu Zhao","Charles Ling","Boyu Wang"],"pdf_url":"https://arxiv.org/pdf/2501.03495v1.pdf","comment":"AAAI 2025"},{"id":"http://arxiv.org/abs/2501.02024v2","updated":"2025-01-07T03:29:43Z","published":"2025-01-02T20:47:04Z","title":"Model Checking in Medical Imaging for Tumor Detection and Segmentation","summary":" Recent advancements in model checking have demonstrated significant potential\nacross diverse applications, particularly in signal and image analysis. Medical\nimaging stands out as a critical domain where model checking can be effectively\napplied to design and evaluate robust frameworks. These frameworks facilitate\nautomatic and semi-automatic delineation of regions of interest within images,\naiding in accurate segmentation. This paper provides a comprehensive analysis\nof recent works leveraging spatial logic to develop operators and tools for\nidentifying regions of interest, including tumorous and non-tumorous areas.\nAdditionally, we examine the challenges inherent to spatial model-checking\ntechniques, such as variability in ground truth data and the need for\nstreamlined procedures suitable for routine clinical practice.\n","authors":["Elhoucine Elfatimi","Lahcen El fatimi"],"pdf_url":"https://arxiv.org/pdf/2501.02024v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03492v1","updated":"2025-01-07T03:23:28Z","published":"2025-01-07T03:23:28Z","title":"Multi-Source Urban Traffic Flow Forecasting with Drone and Loop Detector\n Data","summary":" Traffic forecasting is a fundamental task in transportation research, however\nthe scope of current research has mainly focused on a single data modality of\nloop detectors. Recently, the advances in Artificial Intelligence and drone\ntechnologies have made possible novel solutions for efficient, accurate and\nflexible aerial observations of urban traffic. As a promising traffic\nmonitoring approach, drone-captured data can create an accurate multi-sensor\nmobility observatory for large-scale urban networks, when combined with\nexisting infrastructure. Therefore, this paper investigates the problem of\nmulti-source traffic speed prediction, simultaneously using drone and loop\ndetector data. A simple yet effective graph-based model HiMSNet is proposed to\nintegrate multiple data modalities and learn spatio-temporal correlations.\nDetailed analysis shows that predicting accurate segment-level speed is more\nchallenging than the regional speed, especially under high-demand scenarios\nwith heavier congestions and varying traffic dynamics. Utilizing both drone and\nloop detector data, the prediction accuracy can be improved compared to\nsingle-modality cases, when the sensors have lower coverages and are subject to\nnoise. Our simulation study based on vehicle trajectories in a real urban road\nnetwork has highlighted the added value of integrating drones in traffic\nforecasting and monitoring.\n","authors":["Weijiang Xiong","Robert Fonod","Alexandre Alahi","Nikolas Geroliminis"],"pdf_url":"https://arxiv.org/pdf/2501.03492v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.11397v2","updated":"2025-01-07T03:17:48Z","published":"2024-06-17T10:33:00Z","title":"DistPred: A Distribution-Free Probabilistic Inference Method for\n Regression and Forecasting","summary":" Traditional regression and prediction tasks often only provide deterministic\npoint estimates. To estimate the distribution or uncertainty of the response\nvariable, traditional methods either assume that the posterior distribution of\nsamples follows a Gaussian process or require thousands of forward passes for\nsample generation. We propose a novel approach called DistPred for regression\nand forecasting tasks, which overcomes the limitations of existing methods\nwhile remaining simple and powerful. Specifically, we transform proper scoring\nrules that measure the discrepancy between the predicted distribution and the\ntarget distribution into a differentiable discrete form and use it as a loss\nfunction to train the model end-to-end. This allows the model to sample\nnumerous samples in a single forward pass to estimate the potential\ndistribution of the response variable. We have compared our method with several\nexisting approaches on multiple datasets and achieved state-of-the-art\nperformance. Additionally, our method significantly improves computational\nefficiency. For example, compared to state-of-the-art models, DistPred has a\n180x faster inference speed Experimental results can be reproduced through\nhttps://github.com/Anoise/DistPred.\n","authors":["Daojun Liang","Haixia Zhang","Dongfeng Yuan"],"pdf_url":"https://arxiv.org/pdf/2406.11397v2.pdf","comment":"Published at KDD 2025"},{"id":"http://arxiv.org/abs/2501.03489v1","updated":"2025-01-07T03:17:47Z","published":"2025-01-07T03:17:47Z","title":"Entropy-Guided Attention for Private LLMs","summary":" The pervasiveness of proprietary language models has raised critical privacy\nconcerns, necessitating advancements in private inference (PI), where\ncomputations are performed directly on encrypted data without revealing users'\nsensitive information. While PI offers a promising solution, its practical\ndeployment is hindered by substantial communication and latency overheads,\nprimarily stemming from nonlinear operations. To address this, we introduce an\ninformation-theoretic framework to characterize the role of nonlinearities in\ndecoder-only language models, laying a principled foundation for optimizing\ntransformer-architectures tailored to the demands of PI.\n By leveraging Shannon's entropy as a quantitative measure, we uncover the\npreviously unexplored dual significance of nonlinearities: beyond ensuring\ntraining stability, they are crucial for maintaining attention head diversity.\nSpecifically, we find that their removal triggers two critical failure modes:\n{\\em entropy collapse} in deeper layers that destabilizes training, and {\\em\nentropic overload} in earlier layers that leads to under-utilization of\nMulti-Head Attention's (MHA) representational capacity.\n We propose an entropy-guided attention mechanism paired with a novel entropy\nregularization technique to mitigate entropic overload. Additionally, we\nexplore PI-friendly alternatives to layer normalization for preventing entropy\ncollapse and stabilizing the training of LLMs with reduced-nonlinearities. Our\nstudy bridges the gap between information theory and architectural design,\nestablishing entropy dynamics as a principled guide for developing efficient PI\narchitectures. The code and implementation are available at\n\\href{https://github.com/Nandan91/entropy-guided-attention-llm}{entropy-guided-llm}.\n","authors":["Nandan Kumar Jha","Brandon Reagen"],"pdf_url":"https://arxiv.org/pdf/2501.03489v1.pdf","comment":"The 6th AAAI Workshop on Privacy-Preserving Artificial Intelligence\n (PPAI), 2025. arXiv admin note: substantial text overlap with\n arXiv:2410.13060"},{"id":"http://arxiv.org/abs/2412.19391v2","updated":"2025-01-07T03:15:49Z","published":"2024-12-27T00:36:40Z","title":"An In-Depth Analysis of Adversarial Discriminative Domain Adaptation for\n Digit Classification","summary":" Domain adaptation is an active area of research driven by the growing demand\nfor robust machine learning models that perform well on real-world data.\nAdversarial learning for deep neural networks (DNNs) has emerged as a promising\napproach to improving generalization ability, particularly for image\nclassification. In this paper, we implement a specific adversarial learning\ntechnique known as Adversarial Discriminative Domain Adaptation (ADDA) and\nreplicate digit classification experiments from the original ADDA paper. We\nextend their findings by examining a broader range of domain shifts and provide\na detailed analysis of in-domain classification accuracy post-ADDA. Our results\ndemonstrate that ADDA significantly improves accuracy across certain domain\nshifts with minimal impact on in-domain performance. Furthermore, we provide\nqualitative analysis and propose potential explanations for ADDA's limitations\nin less successful domain shifts. Code is at\nhttps://github.com/eugenechoi2004/COS429_FINAL .\n","authors":["Eugene Choi","Julian Rodriguez","Edmund Young"],"pdf_url":"https://arxiv.org/pdf/2412.19391v2.pdf","comment":"Replacement: Updated methodology section to include grayscale\n preprocessing of SVHN data"},{"id":"http://arxiv.org/abs/2501.03486v1","updated":"2025-01-07T03:14:39Z","published":"2025-01-07T03:14:39Z","title":"Align-Pro: A Principled Approach to Prompt Optimization for LLM\n Alignment","summary":" The alignment of large language models (LLMs) with human values is critical\nas these models become increasingly integrated into various societal and\ndecision-making processes. Traditional methods, such as reinforcement learning\nfrom human feedback (RLHF), achieve alignment by fine-tuning model parameters,\nbut these approaches are often computationally expensive and impractical when\nmodels are frozen or inaccessible for parameter modification. In contrast,\nprompt optimization is a viable alternative to RLHF for LLM alignment. While\nthe existing literature has shown empirical promise of prompt optimization, its\ntheoretical underpinning remains under-explored. We address this gap by\nformulating prompt optimization as an optimization problem and try to provide\ntheoretical insights into the optimality of such a framework. To analyze the\nperformance of the prompt optimization, we study theoretical suboptimality\nbounds and provide insights in terms of how prompt optimization depends upon\nthe given prompter and target model. We also provide empirical validation\nthrough experiments on various datasets, demonstrating that prompt optimization\ncan effectively align LLMs, even when parameter fine-tuning is not feasible.\n","authors":["Prashant Trivedi","Souradip Chakraborty","Avinash Reddy","Vaneet Aggarwal","Amrit Singh Bedi","George K. Atia"],"pdf_url":"https://arxiv.org/pdf/2501.03486v1.pdf","comment":"27 pages, Accepted in AAAI 2025"},{"id":"http://arxiv.org/abs/2412.13516v2","updated":"2025-01-07T03:08:39Z","published":"2024-12-18T05:33:16Z","title":"Learning Causal Transition Matrix for Instance-dependent Label Noise","summary":" Noisy labels are both inevitable and problematic in machine learning methods,\nas they negatively impact models' generalization ability by causing\noverfitting. In the context of learning with noise, the transition matrix plays\na crucial role in the design of statistically consistent algorithms. However,\nthe transition matrix is often considered unidentifiable. One strand of methods\ntypically addresses this problem by assuming that the transition matrix is\ninstance-independent; that is, the probability of mislabeling a particular\ninstance is not influenced by its characteristics or attributes. This\nassumption is clearly invalid in complex real-world scenarios. To better\nunderstand the transition relationship and relax this assumption, we propose to\nstudy the data generation process of noisy labels from a causal perspective. We\ndiscover that an unobservable latent variable can affect either the instance\nitself, the label annotation procedure, or both, which complicates the\nidentification of the transition matrix. To address various scenarios, we have\nunified these observations within a new causal graph. In this graph, the input\ninstance is divided into a noise-resistant component and a noise-sensitive\ncomponent based on whether they are affected by the latent variable. These two\ncomponents contribute to identifying the ``causal transition matrix'', which\napproximates the true transition matrix with theoretical guarantee. In line\nwith this, we have designed a novel training framework that explicitly models\nthis causal relationship and, as a result, achieves a more accurate model for\ninferring the clean label.\n","authors":["Jiahui Li","Tai-Wei Chang","Kun Kuang","Ximing Li","Long Chen","Jun Zhou"],"pdf_url":"https://arxiv.org/pdf/2412.13516v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.16496v4","updated":"2025-01-07T03:08:05Z","published":"2023-11-27T08:49:26Z","title":"Can Out-of-Domain data help to Learn Domain-Specific Prompts for\n Multimodal Misinformation Detection?","summary":" Spread of fake news using out-of-context images and captions has become\nwidespread in this era of information overload. Since fake news can belong to\ndifferent domains like politics, sports, etc. with their unique\ncharacteristics, inference on a test image-caption pair is contingent on how\nwell the model has been trained on similar data. Since training individual\nmodels for each domain is not practical, we propose a novel framework termed\nDPOD (Domain-specific Prompt tuning using Out-of-domain data), which can\nexploit out-of-domain data during training to improve fake news detection of\nall desired domains simultaneously. First, to compute generalizable features,\nwe modify the Vision-Language Model, CLIP to extract features that helps to\nalign the representations of the images and corresponding captions of both the\nin-domain and out-of-domain data in a label-aware manner. Further, we propose a\ndomain-specific prompt learning technique which leverages training samples of\nall the available domains based on the extent they can be useful to the desired\ndomain. Extensive experiments on the large-scale NewsCLIPpings and VERITE\nbenchmarks demonstrate that DPOD achieves state of-the-art performance for this\nchallenging task. Code: https://github.com/scviab/DPOD.\n","authors":["Amartya Bhattacharya","Debarshi Brahma","Suraj Nagaje Mahadev","Anmol Asati","Vikas Verma","Soma Biswas"],"pdf_url":"https://arxiv.org/pdf/2311.16496v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.03334v3","updated":"2025-01-07T03:01:49Z","published":"2024-10-23T19:56:57Z","title":"Neural Network Prediction of Strong Lensing Systems with Domain\n Adaptation and Uncertainty Quantification","summary":" Modeling strong gravitational lenses is computationally expensive for the\ncomplex data from modern and next-generation cosmic surveys. Deep learning has\nemerged as a promising approach for finding lenses and predicting lensing\nparameters, such as the Einstein radius. Mean-variance Estimators (MVEs) are a\ncommon approach for obtaining aleatoric (data) uncertainties from a neural\nnetwork prediction. However, neural networks have not been demonstrated to\nperform well on out-of-domain target data successfully - e.g., when trained on\nsimulated data and applied to real, observational data. In this work, we\nperform the first study of the efficacy of MVEs in combination with\nunsupervised domain adaptation (UDA) on strong lensing data. The source domain\ndata is noiseless, and the target domain data has noise mimicking modern\ncosmology surveys. We find that adding UDA to MVE increases the accuracy on the\ntarget data by a factor of about two over an MVE model without UDA. Including\nUDA also permits much more well-calibrated aleatoric uncertainty predictions.\nAdvancements in this approach may enable future applications of MVE models to\nreal observational data.\n","authors":["Shrihan Agarwal","Aleksandra Ćiprijanović","Brian D. Nord"],"pdf_url":"https://arxiv.org/pdf/2411.03334v3.pdf","comment":"Accepted to the Machine Learning for Physical Sciences workshop at\n NeurIPS 2024; 24 pages, 2 figures, 4 tables"},{"id":"http://arxiv.org/abs/2501.02411v2","updated":"2025-01-07T02:46:47Z","published":"2025-01-05T01:25:37Z","title":"Transfer learning via Regularized Linear Discriminant Analysis","summary":" Linear discriminant analysis is a widely used method for classification.\nHowever, the high dimensionality of predictors combined with small sample sizes\noften results in large classification errors. To address this challenge, it is\ncrucial to leverage data from related source models to enhance the\nclassification performance of a target model. We propose to address this\nproblem in the framework of transfer learning.\n In this paper, we present novel transfer learning methods via regularized\nrandom-effects linear discriminant analysis, where the discriminant direction\nis estimated as a weighted combination of ridge estimates obtained from both\nthe target and source models. Multiple strategies for determining these weights\nare introduced and evaluated, including one that minimizes the estimation risk\nof the discriminant vector and another that minimizes the classification error.\nUtilizing results from random matrix theory, we explicitly derive the\nasymptotic values of these weights and the associated classification error\nrates in the high-dimensional setting, where $p/n \\rightarrow \\gamma$, with $p$\nrepresenting the predictor dimension and $n$ the sample size. We also provide\ngeometric interpretations of various weights and a guidance on which weights to\nchoose. Extensive numerical studies, including simulations and analysis of\nproteomics-based 10-year cardiovascular disease risk classification,\ndemonstrate the effectiveness of the proposed approach.\n","authors":["Hongzhe Zhang","Arnab Auddy","Hongzhe Lee"],"pdf_url":"https://arxiv.org/pdf/2501.02411v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03477v1","updated":"2025-01-07T02:35:41Z","published":"2025-01-07T02:35:41Z","title":"A study on performance limitations in Federated Learning","summary":" Increasing privacy concerns and unrestricted access to data lead to the\ndevelopment of a novel machine learning paradigm called Federated Learning\n(FL). FL borrows many of the ideas from distributed machine learning, however,\nthe challenges associated with federated learning makes it an interesting\nengineering problem since the models are trained on edge devices. It was\nintroduced in 2016 by Google, and since then active research is being carried\nout in different areas within FL such as federated optimization algorithms,\nmodel and update compression, differential privacy, robustness, and attacks,\nfederated GANs and privacy preserved personalization. There are many open\nchallenges in the development of such federated machine learning systems and\nthis project will be focusing on the communication bottleneck and data Non\nIID-ness, and its effect on the performance of the models. These issues are\ncharacterized on a baseline model, model performance is evaluated, and\ndiscussions are made to overcome these issues.\n","authors":["Karthik Mohan"],"pdf_url":"https://arxiv.org/pdf/2501.03477v1.pdf","comment":"archive 2021 work"},{"id":"http://arxiv.org/abs/2501.03475v1","updated":"2025-01-07T02:33:25Z","published":"2025-01-07T02:33:25Z","title":"Reading with Intent -- Neutralizing Intent","summary":" Queries to large language models (LLMs) can be divided into two parts: the\ninstruction/question and the accompanying context. The context for\nretrieval-augmented generation (RAG) systems in most benchmarks comes from\nWikipedia or Wikipedia-like texts which are written in a neutral and factual\ntone. However, when RAG systems retrieve internet-based content, they encounter\ntext with diverse tones and linguistic styles, introducing challenges for\ndownstream tasks. The Reading with Intent task addresses this issue by\nevaluating how varying tones in context passages affect model performance.\nBuilding on prior work that focused on sarcasm, we extend this paradigm by\nconstructing a dataset where context passages are transformed to $11$ distinct\nemotions using a better synthetic data generation approach. Using this dataset,\nwe train an emotion translation model to systematically adapt passages to\nspecified emotional tones. The human evaluation shows that the LLM fine-tuned\nto become the emotion-translator benefited from the synthetically generated\ndata. Finally, the emotion-translator is used in the Reading with Intent task\nto transform the passages to a neutral tone. By neutralizing the passages, it\nmitigates the challenges posed by sarcastic passages and improves overall\nresults on this task by about $3\\%$.\n","authors":["Benjamin Reichman","Adar Avsian","Larry Heck"],"pdf_url":"https://arxiv.org/pdf/2501.03475v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03471v1","updated":"2025-01-07T02:15:58Z","published":"2025-01-07T02:15:58Z","title":"Hyperbolic Binary Neural Network","summary":" Binary Neural Network (BNN) converts full-precision weights and activations\ninto their extreme 1-bit counterparts, making it particularly suitable for\ndeployment on lightweight mobile devices. While binary neural networks are\ntypically formulated as a constrained optimization problem and optimized in the\nbinarized space, general neural networks are formulated as an unconstrained\noptimization problem and optimized in the continuous space. This paper\nintroduces the Hyperbolic Binary Neural Network (HBNN) by leveraging the\nframework of hyperbolic geometry to optimize the constrained problem.\nSpecifically, we transform the constrained problem in hyperbolic space into an\nunconstrained one in Euclidean space using the Riemannian exponential map. On\nthe other hand, we also propose the Exponential Parametrization Cluster (EPC)\nmethod, which, compared to the Riemannian exponential map, shrinks the segment\ndomain based on a diffeomorphism. This approach increases the probability of\nweight flips, thereby maximizing the information gain in BNNs. Experimental\nresults on CIFAR10, CIFAR100, and ImageNet classification datasets with\nVGGsmall, ResNet18, and ResNet34 models illustrate the superior performance of\nour HBNN over state-of-the-art methods.\n","authors":["Jun Chen","Jingyang Xiang","Tianxin Huang","Xiangrui Zhao","Yong Liu"],"pdf_url":"https://arxiv.org/pdf/2501.03471v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.07708v2","updated":"2025-01-07T02:15:42Z","published":"2024-09-12T02:25:04Z","title":"Dataset-Free Weight-Initialization on Restricted Boltzmann Machine","summary":" In feed-forward neural networks, dataset-free weight-initialization methods\nsuch as LeCun, Xavier (or Glorot), and He initializations have been developed.\nThese methods randomly determine the initial values of weight parameters based\non specific distributions (e.g., Gaussian or uniform distributions) without\nusing training datasets. To the best of the authors' knowledge, such a\ndataset-free weight-initialization method is yet to be developed for restricted\nBoltzmann machines (RBMs), which are probabilistic neural networks consisting\nof two layers. In this study, we derive a dataset-free weight-initialization\nmethod for Bernoulli--Bernoulli RBMs based on statistical mechanical analysis.\nIn the proposed weight-initialization method, the weight parameters are drawn\nfrom a Gaussian distribution with zero mean. The standard deviation of the\nGaussian distribution is optimized based on our hypothesis that a standard\ndeviation providing a larger layer correlation (LC) between the two layers\nimproves the learning efficiency. The expression of the LC is derived based on\na statistical mechanical analysis. The optimal value of the standard deviation\ncorresponds to the maximum point of the LC. The proposed weight-initialization\nmethod is identical to Xavier initialization in a specific case (i.e., when the\nsizes of the two layers are the same, the random variables of the layers are\n$\\{-1,1\\}$-binary, and all bias parameters are zero). The validity of the\nproposed weight-initialization method is demonstrated in numerical experiments\nusing a toy and real-world datasets.\n","authors":["Muneki Yasuda","Ryosuke Maeno","Chako Takahashi"],"pdf_url":"https://arxiv.org/pdf/2409.07708v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.07108v3","updated":"2025-01-07T02:14:56Z","published":"2024-02-11T05:35:50Z","title":"Decoupling Learning and Decision-Making: Breaking the\n $\\mathcal{O}(\\sqrt{T})$ Barrier in Online Resource Allocation with\n First-Order Methods","summary":" Online linear programming plays an important role in both revenue management\nand resource allocation, and recent research has focused on developing\nefficient first-order online learning algorithms. Despite the empirical success\nof first-order methods, they typically achieve a regret no better than\n$\\mathcal{O}(\\sqrt{T})$, which is suboptimal compared to the $\\mathcal{O}(\\log\nT)$ bound guaranteed by the state-of-the-art linear programming (LP)-based\nonline algorithms. This paper establishes several important facts about online\nlinear programming, which unveils the challenge for first-order-method-based\nonline algorithms to achieve beyond $\\mathcal{O}(\\sqrt{T})$ regret. To address\nthe challenge, we introduce a new algorithmic framework that decouples learning\nfrom decision-making. For the first time, we show that first-order methods can\nattain regret $\\mathcal{O}(T^{1/3})$ with this new framework.\n","authors":["Wenzhi Gao","Chunlin Sun","Chenyu Xue","Dongdong Ge","Yinyu Ye"],"pdf_url":"https://arxiv.org/pdf/2402.07108v3.pdf","comment":"Merged into arXiv:2501.02761"},{"id":"http://arxiv.org/abs/2412.19139v2","updated":"2025-01-07T01:50:11Z","published":"2024-12-26T09:51:05Z","title":"PlanLLM: Video Procedure Planning with Refinable Large Language Models","summary":" Video procedure planning, i.e., planning a sequence of action steps given the\nvideo frames of start and goal states, is an essential ability for embodied AI.\nRecent works utilize Large Language Models (LLMs) to generate enriched action\nstep description texts to guide action step decoding. Although LLMs are\nintroduced, these methods decode the action steps into a closed-set of one-hot\nvectors, limiting the model's capability of generalizing to new steps or tasks.\nAdditionally, fixed action step descriptions based on world-level commonsense\nmay contain noise in specific instances of visual states. In this paper, we\npropose PlanLLM, a cross-modal joint learning framework with LLMs for video\nprocedure planning. We propose an LLM-Enhanced Planning module which fully uses\nthe generalization ability of LLMs to produce free-form planning output and to\nenhance action step decoding. We also propose Mutual Information Maximization\nmodule to connect world-level commonsense of step descriptions and\nsample-specific information of visual states, enabling LLMs to employ the\nreasoning ability to generate step sequences. With the assistance of LLMs, our\nmethod can both closed-set and open vocabulary procedure planning tasks. Our\nPlanLLM achieves superior performance on three benchmarks, demonstrating the\neffectiveness of our designs.\n","authors":["Dejie Yang","Zijing Zhao","Yang Liu"],"pdf_url":"https://arxiv.org/pdf/2412.19139v2.pdf","comment":"accepted to AAAI2025"},{"id":"http://arxiv.org/abs/2410.22376v2","updated":"2025-01-07T01:41:13Z","published":"2024-10-29T07:43:39Z","title":"Rare-to-Frequent: Unlocking Compositional Generation Power of Diffusion\n Models on Rare Concepts with LLM Guidance","summary":" State-of-the-art text-to-image (T2I) diffusion models often struggle to\ngenerate rare compositions of concepts, e.g., objects with unusual attributes.\nIn this paper, we show that the compositional generation power of diffusion\nmodels on such rare concepts can be significantly enhanced by the Large\nLanguage Model (LLM) guidance. We start with empirical and theoretical\nanalysis, demonstrating that exposing frequent concepts relevant to the target\nrare concepts during the diffusion sampling process yields more accurate\nconcept composition. Based on this, we propose a training-free approach, R2F,\nthat plans and executes the overall rare-to-frequent concept guidance\nthroughout the diffusion inference by leveraging the abundant semantic\nknowledge in LLMs. Our framework is flexible across any pre-trained diffusion\nmodels and LLMs, and can be seamlessly integrated with the region-guided\ndiffusion approaches. Extensive experiments on three datasets, including our\nnewly proposed benchmark, RareBench, containing various prompts with rare\ncompositions of concepts, R2F significantly surpasses existing models including\nSD3.0 and FLUX by up to 28.1%p in T2I alignment. Code is available at\nhttps://github.com/krafton-ai/Rare-to-Frequent.\n","authors":["Dongmin Park","Sebin Kim","Taehong Moon","Minkyu Kim","Kangwook Lee","Jaewoong Cho"],"pdf_url":"https://arxiv.org/pdf/2410.22376v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03461v1","updated":"2025-01-07T01:35:56Z","published":"2025-01-07T01:35:56Z","title":"Radar Signal Recognition through Self-Supervised Learning and Domain\n Adaptation","summary":" Automatic radar signal recognition (RSR) plays a pivotal role in electronic\nwarfare (EW), as accurately classifying radar signals is critical for informing\ndecision-making processes. Recent advances in deep learning have shown\nsignificant potential in improving RSR performance in domains with ample\nannotated data. However, these methods fall short in EW scenarios where\nannotated RF data are scarce or impractical to obtain. To address these\nchallenges, we introduce a self-supervised learning (SSL) method which utilises\nmasked signal modelling and RF domain adaption to enhance RSR performance in\nenvironments with limited RF samples and labels. Specifically, we investigate\npre-training masked autoencoders (MAE) on baseband in-phase and quadrature\n(I/Q) signals from various RF domains and subsequently transfer the learned\nrepresentation to the radar domain, where annotated data are limited. Empirical\nresults show that our lightweight self-supervised ResNet model with domain\nadaptation achieves up to a 17.5\\% improvement in 1-shot classification\naccuracy when pre-trained on in-domain signals (i.e., radar signals) and up to\na 16.31\\% improvement when pre-trained on out-of-domain signals (i.e., comm\nsignals), compared to its baseline without SSL. We also provide reference\nresults for several MAE designs and pre-training strategies, establishing a new\nbenchmark for few-shot radar signal classification.\n","authors":["Zi Huang","Akila Pemasiri","Simon Denman","Clinton Fookes","Terrence Martin"],"pdf_url":"https://arxiv.org/pdf/2501.03461v1.pdf","comment":"5 pages, 9 figures"},{"id":"http://arxiv.org/abs/2501.03451v1","updated":"2025-01-07T00:43:18Z","published":"2025-01-07T00:43:18Z","title":"Structure-Preference Enabled Graph Embedding Generation under\n Differential Privacy","summary":" Graph embedding generation techniques aim to learn low-dimensional vectors\nfor each node in a graph and have recently gained increasing research\nattention. Publishing low-dimensional node vectors enables various graph\nanalysis tasks, such as structural equivalence and link prediction. Yet,\nimproper publication opens a backdoor to malicious attackers, who can infer\nsensitive information of individuals from the low-dimensional node vectors.\nExisting methods tackle this issue by developing deep graph learning models\nwith differential privacy (DP). However, they often suffer from large noise\ninjections and cannot provide structural preferences consistent with mining\nobjectives. Recently, skip-gram based graph embedding generation techniques are\nwidely used due to their ability to extract customizable structures. Based on\nskip-gram, we present SE-PrivGEmb, a structure-preference enabled graph\nembedding generation under DP. For arbitrary structure preferences, we design a\nunified noise tolerance mechanism via perturbing non-zero vectors. This\nmechanism mitigates utility degradation caused by high sensitivity. By\ncarefully designing negative sampling probabilities in skip-gram, we\ntheoretically demonstrate that skip-gram can preserve arbitrary proximities,\nwhich quantify structural features in graphs. Extensive experiments show that\nour method outperforms existing state-of-the-art methods under structural\nequivalence and link prediction tasks.\n","authors":["Sen Zhang","Qingqing Ye","Haibo Hu"],"pdf_url":"https://arxiv.org/pdf/2501.03451v1.pdf","comment":"Accepted by ICDE 25"},{"id":"http://arxiv.org/abs/2501.03448v1","updated":"2025-01-07T00:30:31Z","published":"2025-01-07T00:30:31Z","title":"Optimizing Value of Learning in Task-Oriented Federated Meta-Learning\n Systems","summary":" Federated Learning (FL) has gained significant attention in recent years due\nto its distributed nature and privacy preserving benefits. However, a key\nlimitation of conventional FL is that it learns and distributes a common global\nmodel to all participants, which fails to provide customized solutions for\ndiverse task requirements. Federated meta-learning (FML) offers a promising\nsolution to this issue by enabling devices to finetune local models after\nreceiving a shared meta-model from the server. In this paper, we propose a\ntask-oriented FML framework over non-orthogonal multiple access (NOMA)\nnetworks. A novel metric, termed value of learning (VoL), is introduced to\nassess the individual training needs across devices. Moreover, a task-level\nweight (TLW) metric is defined based on task requirements and fairness\nconsiderations, guiding the prioritization of edge devices during FML training.\nThe formulated problem, to maximize the sum of TLW-based VoL across devices,\nforms a non-convex mixed-integer non-linear programming (MINLP) challenge,\naddressed here using a parameterized deep Q-network (PDQN) algorithm to handle\nboth discrete and continuous variables. Simulation results demonstrate that our\napproach significantly outperforms baseline schemes, underscoring the\nadvantages of the proposed framework.\n","authors":["Bibo Wu","Fang Fang","Xianbin Wang"],"pdf_url":"https://arxiv.org/pdf/2501.03448v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.11200v2","updated":"2025-01-07T00:23:43Z","published":"2024-11-17T23:30:01Z","title":"Countering Backdoor Attacks in Image Recognition: A Survey and\n Evaluation of Mitigation Strategies","summary":" The widespread adoption of deep learning across various industries has\nintroduced substantial challenges, particularly in terms of model\nexplainability and security. The inherent complexity of deep learning models,\nwhile contributing to their effectiveness, also renders them susceptible to\nadversarial attacks. Among these, backdoor attacks are especially concerning,\nas they involve surreptitiously embedding specific triggers within training\ndata, causing the model to exhibit aberrant behavior when presented with input\ncontaining the triggers. Such attacks often exploit vulnerabilities in\noutsourced processes, compromising model integrity without affecting\nperformance on clean (trigger-free) input data. In this paper, we present a\ncomprehensive review of existing mitigation strategies designed to counter\nbackdoor attacks in image recognition. We provide an in-depth analysis of the\ntheoretical foundations, practical efficacy, and limitations of these\napproaches. In addition, we conduct an extensive benchmarking of sixteen\nstate-of-the-art approaches against eight distinct backdoor attacks, utilizing\nthree datasets, four model architectures, and three poisoning ratios. Our\nresults, derived from 122,236 individual experiments, indicate that while many\napproaches provide some level of protection, their performance can vary\nconsiderably. Furthermore, when compared to two seminal approaches, most newer\napproaches do not demonstrate substantial improvements in overall performance\nor consistency across diverse settings. Drawing from these findings, we propose\npotential directions for developing more effective and generalizable defensive\nmechanisms in the future.\n","authors":["Kealan Dunnett","Reza Arablouei","Dimity Miller","Volkan Dedeoglu","Raja Jurdak"],"pdf_url":"https://arxiv.org/pdf/2411.11200v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03445v1","updated":"2025-01-07T00:15:04Z","published":"2025-01-07T00:15:04Z","title":"Physics-Constrained Generative Artificial Intelligence for Rapid Takeoff\n Trajectory Design","summary":" To aid urban air mobility (UAM), electric vertical takeoff and landing\n(eVTOL) aircraft are being targeted. Conventional multidisciplinary analysis\nand optimization (MDAO) can be expensive, while surrogate-based optimization\ncan struggle with challenging physical constraints. This work proposes\nphysics-constrained generative adversarial networks (physicsGAN), to\nintelligently parameterize the takeoff control profiles of an eVTOL aircraft\nand to transform the original design space to a feasible space. Specifically,\nthe transformed feasible space refers to a space where all designs directly\nsatisfy all design constraints. The physicsGAN-enabled surrogate-based takeoff\ntrajectory design framework was demonstrated on the Airbus A3 Vahana. The\nphysicsGAN generated only feasible control profiles of power and wing angle in\nthe feasible space with around 98.9% of designs satisfying all constraints. The\nproposed design framework obtained 99.6% accuracy compared with\nsimulation-based optimal design and took only 2.2 seconds, which reduced the\ncomputational time by around 200 times. Meanwhile, data-driven GAN-enabled\nsurrogate-based optimization took 21.9 seconds using a derivative-free\noptimizer, which was around an order of magnitude slower than the proposed\nframework. Moreover, the data-driven GAN-based optimization using\ngradient-based optimizers could not consistently find the optimal design during\nrandom trials and got stuck in an infeasible region, which is problematic in\nreal practice. Therefore, the proposed physicsGAN-based design framework\noutperformed data-driven GAN-based design to the extent of efficiency (2.2\nseconds), optimality (99.6% accurate), and feasibility (100% feasible).\nAccording to the literature review, this is the first physics-constrained\ngenerative artificial intelligence enabled by surrogate models.\n","authors":["Samuel Sisk","Xiaosong Du"],"pdf_url":"https://arxiv.org/pdf/2501.03445v1.pdf","comment":"Conference version with 10 pages and 7 figures"}],"Multimedia":[{"id":"http://arxiv.org/abs/2501.03939v1","updated":"2025-01-07T17:00:35Z","published":"2025-01-07T17:00:35Z","title":"Visual question answering: from early developments to recent advances --\n a survey","summary":" Visual Question Answering (VQA) is an evolving research field aimed at\nenabling machines to answer questions about visual content by integrating image\nand language processing techniques such as feature extraction, object\ndetection, text embedding, natural language understanding, and language\ngeneration. With the growth of multimodal data research, VQA has gained\nsignificant attention due to its broad applications, including interactive\neducational tools, medical image diagnosis, customer service, entertainment,\nand social media captioning. Additionally, VQA plays a vital role in assisting\nvisually impaired individuals by generating descriptive content from images.\nThis survey introduces a taxonomy of VQA architectures, categorizing them based\non design choices and key components to facilitate comparative analysis and\nevaluation. We review major VQA approaches, focusing on deep learning-based\nmethods, and explore the emerging field of Large Visual Language Models (LVLMs)\nthat have demonstrated success in multimodal tasks like VQA. The paper further\nexamines available datasets and evaluation metrics essential for measuring VQA\nsystem performance, followed by an exploration of real-world VQA applications.\nFinally, we highlight ongoing challenges and future directions in VQA research,\npresenting open questions and potential areas for further development. This\nsurvey serves as a comprehensive resource for researchers and practitioners\ninterested in the latest advancements and future\n","authors":["Ngoc Dung Huynh","Mohamed Reda Bouadjenek","Sunil Aryal","Imran Razzak","Hakim Hacid"],"pdf_url":"https://arxiv.org/pdf/2501.03939v1.pdf","comment":"20"},{"id":"http://arxiv.org/abs/2501.03605v1","updated":"2025-01-07T08:06:35Z","published":"2025-01-07T08:06:35Z","title":"ConcealGS: Concealing Invisible Copyright Information in 3D Gaussian\n Splatting","summary":" With the rapid development of 3D reconstruction technology, the widespread\ndistribution of 3D data has become a future trend. While traditional visual\ndata (such as images and videos) and NeRF-based formats already have mature\ntechniques for copyright protection, steganographic techniques for the emerging\n3D Gaussian Splatting (3D-GS) format have yet to be fully explored. To address\nthis, we propose ConcealGS, an innovative method for embedding implicit\ninformation into 3D-GS. By introducing the knowledge distillation and gradient\noptimization strategy based on 3D-GS, ConcealGS overcomes the limitations of\nNeRF-based models and enhances the robustness of implicit information and the\nquality of 3D reconstruction. We evaluate ConcealGS in various potential\napplication scenarios, and experimental results have demonstrated that\nConcealGS not only successfully recovers implicit information but also has\nalmost no impact on rendering quality, providing a new approach for embedding\ninvisible and recoverable information into 3D models in the future.\n","authors":["Yifeng Yang","Hengyu Liu","Chenxin Li","Yining Sun","Wuyang Li","Yifan Liu","Yiyang Lin","Yixuan Yuan","Nanyang Ye"],"pdf_url":"https://arxiv.org/pdf/2501.03605v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19139v2","updated":"2025-01-07T01:50:11Z","published":"2024-12-26T09:51:05Z","title":"PlanLLM: Video Procedure Planning with Refinable Large Language Models","summary":" Video procedure planning, i.e., planning a sequence of action steps given the\nvideo frames of start and goal states, is an essential ability for embodied AI.\nRecent works utilize Large Language Models (LLMs) to generate enriched action\nstep description texts to guide action step decoding. Although LLMs are\nintroduced, these methods decode the action steps into a closed-set of one-hot\nvectors, limiting the model's capability of generalizing to new steps or tasks.\nAdditionally, fixed action step descriptions based on world-level commonsense\nmay contain noise in specific instances of visual states. In this paper, we\npropose PlanLLM, a cross-modal joint learning framework with LLMs for video\nprocedure planning. We propose an LLM-Enhanced Planning module which fully uses\nthe generalization ability of LLMs to produce free-form planning output and to\nenhance action step decoding. We also propose Mutual Information Maximization\nmodule to connect world-level commonsense of step descriptions and\nsample-specific information of visual states, enabling LLMs to employ the\nreasoning ability to generate step sequences. With the assistance of LLMs, our\nmethod can both closed-set and open vocabulary procedure planning tasks. Our\nPlanLLM achieves superior performance on three benchmarks, demonstrating the\neffectiveness of our designs.\n","authors":["Dejie Yang","Zijing Zhao","Yang Liu"],"pdf_url":"https://arxiv.org/pdf/2412.19139v2.pdf","comment":"accepted to AAAI2025"}],"Artificial Intelligence":[{"id":"http://arxiv.org/abs/2412.05313v3","updated":"2025-01-07T18:57:23Z","published":"2024-11-28T19:31:50Z","title":"λ: A Benchmark for Data-Efficiency in Long-Horizon Indoor Mobile\n Manipulation Robotics","summary":" Efficiently learning and executing long-horizon mobile manipulation (MoMa)\ntasks is crucial for advancing robotics in household and workplace settings.\nHowever, current MoMa models are data-inefficient, underscoring the need for\nimproved models that require realistic-sized benchmarks to evaluate their\nefficiency, which do not exist. To address this, we introduce the LAMBDA\n({\\lambda}) benchmark (Long-horizon Actions for Mobile-manipulation\nBenchmarking of Directed Activities), which evaluates the data efficiency of\nmodels on language-conditioned, long-horizon, multi-room, multi-floor,\npick-and-place tasks using a dataset of manageable size, more feasible for\ncollection. The benchmark includes 571 human-collected demonstrations that\nprovide realism and diversity in simulated and real-world settings. Unlike\nplanner-generated data, these trajectories offer natural variability and\nreplay-verifiability, ensuring robust learning and evaluation. We benchmark\nseveral models, including learning-based models and a neuro-symbolic modular\napproach combining foundation models with task and motion planning.\nLearning-based models show suboptimal success rates, even when leveraging\npretrained weights, underscoring significant data inefficiencies. However, the\nneuro-symbolic approach performs significantly better while being more data\nefficient. Findings highlight the need for more data-efficient learning-based\nMoMa approaches. {\\lambda} addresses this gap by serving as a key benchmark for\nevaluating the data efficiency of those future models in handling household\nrobotics tasks.\n","authors":["Ahmed Jaafar","Shreyas Sundara Raman","Yichen Wei","Sudarshan Harithas","Sofia Juliani","Anneke Wernerfelt","Benedict Quartey","Ifrah Idrees","Jason Xinyu Liu","Stefanie Tellex"],"pdf_url":"https://arxiv.org/pdf/2412.05313v3.pdf","comment":"8 pages"},{"id":"http://arxiv.org/abs/2412.20429v3","updated":"2025-01-07T18:24:45Z","published":"2024-12-29T10:46:08Z","title":"Multi-Scenario Reasoning: Unlocking Cognitive Autonomy in Humanoid\n Robots for Multimodal Understanding","summary":" To improve the cognitive autonomy of humanoid robots, this research proposes\na multi-scenario reasoning architecture to solve the technical shortcomings of\nmulti-modal understanding in this field. It draws on simulation based\nexperimental design that adopts multi-modal synthesis (visual, auditory,\ntactile) and builds a simulator \"Maha\" to perform the experiment. The findings\ndemonstrate the feasibility of this architecture in multimodal data. It\nprovides reference experience for the exploration of cross-modal interaction\nstrategies for humanoid robots in dynamic environments. In addition,\nmulti-scenario reasoning simulates the high-level reasoning mechanism of the\nhuman brain to humanoid robots at the cognitive level. This new concept\npromotes cross-scenario practical task transfer and semantic-driven action\nplanning. It heralds the future development of self-learning and autonomous\nbehavior of humanoid robots in changing scenarios.\n","authors":["Libo Wang"],"pdf_url":"https://arxiv.org/pdf/2412.20429v3.pdf","comment":"The main text is 5 pages, 2 figures, and 3 tables"},{"id":"http://arxiv.org/abs/2501.03968v1","updated":"2025-01-07T18:06:27Z","published":"2025-01-07T18:06:27Z","title":"VLM-driven Behavior Tree for Context-aware Task Planning","summary":" The use of Large Language Models (LLMs) for generating Behavior Trees (BTs)\nhas recently gained attention in the robotics community, yet remains in its\nearly stages of development. In this paper, we propose a novel framework that\nleverages Vision-Language Models (VLMs) to interactively generate and edit BTs\nthat address visual conditions, enabling context-aware robot operations in\nvisually complex environments. A key feature of our approach lies in the\nconditional control through self-prompted visual conditions. Specifically, the\nVLM generates BTs with visual condition nodes, where conditions are expressed\nas free-form text. Another VLM process integrates the text into its prompt and\nevaluates the conditions against real-world images during robot execution. We\nvalidated our framework in a real-world cafe scenario, demonstrating both its\nfeasibility and limitations.\n","authors":["Naoki Wake","Atsushi Kanehira","Jun Takamatsu","Kazuhiro Sasabuchi","Katsushi Ikeuchi"],"pdf_url":"https://arxiv.org/pdf/2501.03968v1.pdf","comment":"10 pages, 11 figures, 5 tables. Last updated on January 7th, 2024"},{"id":"http://arxiv.org/abs/2403.05300v5","updated":"2025-01-07T17:42:16Z","published":"2024-03-08T13:29:46Z","title":"Unity by Diversity: Improved Representation Learning in Multimodal VAEs","summary":" Variational Autoencoders for multimodal data hold promise for many tasks in\ndata analysis, such as representation learning, conditional generation, and\nimputation. Current architectures either share the encoder output, decoder\ninput, or both across modalities to learn a shared representation. Such\narchitectures impose hard constraints on the model. In this work, we show that\na better latent representation can be obtained by replacing these hard\nconstraints with a soft constraint. We propose a new mixture-of-experts prior,\nsoftly guiding each modality's latent representation towards a shared aggregate\nposterior. This approach results in a superior latent representation and allows\neach encoding to preserve information better from its uncompressed original\nfeatures. In extensive experiments on multiple benchmark datasets and two\nchallenging real-world datasets, we show improved learned latent\nrepresentations and imputation of missing data modalities compared to existing\nmethods.\n","authors":["Thomas M. Sutter","Yang Meng","Andrea Agostini","Daphné Chopard","Norbert Fortin","Julia E. Vogt","Babak Shahbaba","Stephan Mandt"],"pdf_url":"https://arxiv.org/pdf/2403.05300v5.pdf","comment":"Accepted at Neurips 2024"},{"id":"http://arxiv.org/abs/2408.11735v3","updated":"2025-01-07T17:34:04Z","published":"2024-08-21T15:59:33Z","title":"Clinical Insights: A Comprehensive Review of Language Models in Medicine","summary":" This paper explores the advancements and applications of language models in\nhealthcare, focusing on their clinical use cases. It examines the evolution\nfrom early encoder-based systems requiring extensive fine-tuning to\nstate-of-the-art large language and multimodal models capable of integrating\ntext and visual data through in-context learning. The analysis emphasizes\nlocally deployable models, which enhance data privacy and operational autonomy,\nand their applications in tasks such as text generation, classification,\ninformation extraction, and conversational systems. The paper also highlights a\nstructured organization of tasks and a tiered ethical approach, providing a\nvaluable resource for researchers and practitioners, while discussing key\nchallenges related to ethics, evaluation, and implementation.\n","authors":["Nikita Neveditsin","Pawan Lingras","Vijay Mago"],"pdf_url":"https://arxiv.org/pdf/2408.11735v3.pdf","comment":"Submitted to PLOS Digital Health, Revision 1"},{"id":"http://arxiv.org/abs/2301.08110v6","updated":"2025-01-07T17:26:26Z","published":"2023-01-19T15:01:00Z","title":"AtMan: Understanding Transformer Predictions Through Memory Efficient\n Attention Manipulation","summary":" Generative transformer models have become increasingly complex, with large\nnumbers of parameters and the ability to process multiple input modalities.\nCurrent methods for explaining their predictions are resource-intensive. Most\ncrucially, they require prohibitively large amounts of extra memory, since they\nrely on backpropagation which allocates almost twice as much GPU memory as the\nforward pass. This makes it difficult, if not impossible, to use them in\nproduction. We present AtMan that provides explanations of generative\ntransformer models at almost no extra cost. Specifically, AtMan is a\nmodality-agnostic perturbation method that manipulates the attention mechanisms\nof transformers to produce relevance maps for the input with respect to the\noutput prediction. Instead of using backpropagation, AtMan applies a\nparallelizable token-based search method based on cosine similarity\nneighborhood in the embedding space. Our exhaustive experiments on text and\nimage-text benchmarks demonstrate that AtMan outperforms current\nstate-of-the-art gradient-based methods on several metrics while being\ncomputationally efficient. As such, AtMan is suitable for use in large model\ninference deployments.\n","authors":["Björn Deiseroth","Mayukh Deb","Samuel Weinbach","Manuel Brack","Patrick Schramowski","Kristian Kersting"],"pdf_url":"https://arxiv.org/pdf/2301.08110v6.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03952v1","updated":"2025-01-07T17:24:17Z","published":"2025-01-07T17:24:17Z","title":"Localizing AI: Evaluating Open-Weight Language Models for Languages of\n Baltic States","summary":" Although large language models (LLMs) have transformed our expectations of\nmodern language technologies, concerns over data privacy often restrict the use\nof commercially available LLMs hosted outside of EU jurisdictions. This limits\ntheir application in governmental, defence, and other data-sensitive sectors.\nIn this work, we evaluate the extent to which locally deployable open-weight\nLLMs support lesser-spoken languages such as Lithuanian, Latvian, and Estonian.\nWe examine various size and precision variants of the top-performing\nmultilingual open-weight models, Llama~3, Gemma~2, Phi, and NeMo, on machine\ntranslation, multiple-choice question answering, and free-form text generation.\nThe results indicate that while certain models like Gemma~2 perform close to\nthe top commercially available models, many LLMs struggle with these languages.\nMost surprisingly, however, we find that these models, while showing close to\nstate-of-the-art translation performance, are still prone to lexical\nhallucinations with errors in at least 1 in 20 words for all open-weight\nmultilingual LLMs.\n","authors":["Jurgita Kapočiūtė-Dzikienė","Toms Bergmanis","Mārcis Pinnis"],"pdf_url":"https://arxiv.org/pdf/2501.03952v1.pdf","comment":"This paper is accepted to NoDaLiDa/Baltic-HLT 2025"},{"id":"http://arxiv.org/abs/2501.03941v1","updated":"2025-01-07T17:02:33Z","published":"2025-01-07T17:02:33Z","title":"Synthetic Data Privacy Metrics","summary":" Recent advancements in generative AI have made it possible to create\nsynthetic datasets that can be as accurate as real-world data for training AI\nmodels, powering statistical insights, and fostering collaboration with\nsensitive datasets while offering strong privacy guarantees. Effectively\nmeasuring the empirical privacy of synthetic data is an important step in the\nprocess. However, while there is a multitude of new privacy metrics being\npublished every day, there currently is no standardization. In this paper, we\nreview the pros and cons of popular metrics that include simulations of\nadversarial attacks. We also review current best practices for amending\ngenerative models to enhance the privacy of the data they create (e.g.\ndifferential privacy).\n","authors":["Amy Steier","Lipika Ramaswamy","Andre Manoel","Alexa Haushalter"],"pdf_url":"https://arxiv.org/pdf/2501.03941v1.pdf","comment":"14 pages, 2 figures"},{"id":"http://arxiv.org/abs/2501.03940v1","updated":"2025-01-07T17:00:49Z","published":"2025-01-07T17:00:49Z","title":"Not all tokens are created equal: Perplexity Attention Weighted Networks\n for AI generated text detection","summary":" The rapid advancement in large language models (LLMs) has significantly\nenhanced their ability to generate coherent and contextually relevant text,\nraising concerns about the misuse of AI-generated content and making it\ncritical to detect it. However, the task remains challenging, particularly in\nunseen domains or with unfamiliar LLMs. Leveraging LLM next-token distribution\noutputs offers a theoretically appealing approach for detection, as they\nencapsulate insights from the models' extensive pre-training on diverse\ncorpora. Despite its promise, zero-shot methods that attempt to operationalize\nthese outputs have met with limited success. We hypothesize that one of the\nproblems is that they use the mean to aggregate next-token distribution metrics\nacross tokens, when some tokens are naturally easier or harder to predict and\nshould be weighted differently. Based on this idea, we propose the Perplexity\nAttention Weighted Network (PAWN), which uses the last hidden states of the LLM\nand positions to weight the sum of a series of features based on metrics from\nthe next-token distribution across the sequence length. Although not zero-shot,\nour method allows us to cache the last hidden states and next-token\ndistribution metrics on disk, greatly reducing the training resource\nrequirements. PAWN shows competitive and even better performance\nin-distribution than the strongest baselines (fine-tuned LMs) with a fraction\nof their trainable parameters. Our model also generalizes better to unseen\ndomains and source models, with smaller variability in the decision boundary\nacross distribution shifts. It is also more robust to adversarial attacks, and\nif the backbone has multilingual capabilities, it presents decent\ngeneralization to languages not seen during supervised training, with LLaMA3-1B\nreaching a mean macro-averaged F1 score of 81.46% in cross-validation with nine\nlanguages.\n","authors":["Pablo Miralles-González","Javier Huertas-Tato","Alejandro Martín","David Camacho"],"pdf_url":"https://arxiv.org/pdf/2501.03940v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03936v1","updated":"2025-01-07T16:53:01Z","published":"2025-01-07T16:53:01Z","title":"PPTAgent: Generating and Evaluating Presentations Beyond Text-to-Slides","summary":" Automatically generating presentations from documents is a challenging task\nthat requires balancing content quality, visual design, and structural\ncoherence. Existing methods primarily focus on improving and evaluating the\ncontent quality in isolation, often overlooking visual design and structural\ncoherence, which limits their practical applicability. To address these\nlimitations, we propose PPTAgent, which comprehensively improves presentation\ngeneration through a two-stage, edit-based approach inspired by human\nworkflows. PPTAgent first analyzes reference presentations to understand their\nstructural patterns and content schemas, then drafts outlines and generates\nslides through code actions to ensure consistency and alignment. To\ncomprehensively evaluate the quality of generated presentations, we further\nintroduce PPTEval, an evaluation framework that assesses presentations across\nthree dimensions: Content, Design, and Coherence. Experiments show that\nPPTAgent significantly outperforms traditional automatic presentation\ngeneration methods across all three dimensions. The code and data are available\nat https://github.com/icip-cas/PPTAgent.\n","authors":["Hao Zheng","Xinyan Guan","Hao Kong","Jia Zheng","Hongyu Lin","Yaojie Lu","Ben He","Xianpei Han","Le Sun"],"pdf_url":"https://arxiv.org/pdf/2501.03936v1.pdf","comment":"8 pages, 20 figures"},{"id":"http://arxiv.org/abs/2501.03916v1","updated":"2025-01-07T16:31:10Z","published":"2025-01-07T16:31:10Z","title":"Dolphin: Closed-loop Open-ended Auto-research through Thinking,\n Practice, and Feedback","summary":" The scientific research paradigm is undergoing a profound transformation\nowing to the development of Artificial Intelligence (AI). Recent works\ndemonstrate that various AI-assisted research methods can largely improve\nresearch efficiency by improving data analysis, accelerating computation, and\nfostering novel idea generation. To further move towards the ultimate goal\n(i.e., automatic scientific research), in this paper, we propose Dolphin, the\nfirst closed-loop open-ended auto-research framework to further build the\nentire process of human scientific research. Dolphin can generate research\nideas, perform experiments, and get feedback from experimental results to\ngenerate higher-quality ideas. More specifically, Dolphin first generates novel\nideas based on relevant papers which are ranked by the topic and task\nattributes. Then, the codes are automatically generated and debugged with the\nexception-traceback-guided local code structure. Finally, Dolphin automatically\nanalyzes the results of each idea and feeds the results back to the next round\nof idea generation. Experiments are conducted on the benchmark datasets of\ndifferent topics and results show that Dolphin can generate novel ideas\ncontinuously and complete the experiment in a loop. We highlight that Dolphin\ncan automatically propose methods that are comparable to the state-of-the-art\nin some tasks such as 2D image classification and 3D point classification.\n","authors":["Jiakang Yuan","Xiangchao Yan","Botian Shi","Tao Chen","Wanli Ouyang","Bo Zhang","Lei Bai","Yu Qiao","Bowen Zhou"],"pdf_url":"https://arxiv.org/pdf/2501.03916v1.pdf","comment":"19 pages, 11 figures, and our homepage:\n https://unimodal4reasoning.github.io/Dolphin-project-page/"},{"id":"http://arxiv.org/abs/2406.19223v2","updated":"2025-01-07T16:20:17Z","published":"2024-06-27T14:49:08Z","title":"T-FREE: Subword Tokenizer-Free Generative LLMs via Sparse\n Representations for Memory-Efficient Embeddings","summary":" Tokenizers are crucial for encoding information in Large Language Models, but\ntheir development has recently stagnated, and they contain inherent weaknesses.\nMajor limitations include computational overhead, ineffective vocabulary use,\nand unnecessarily large embedding and head layers. Additionally, their\nperformance is biased towards a reference corpus, leading to reduced\neffectiveness for underrepresented languages.\n To remedy these issues, we propose T-FREE, which directly embeds words\nthrough sparse activation patterns over character triplets, and does not\nrequire a reference corpus. T-FREE inherently exploits morphological\nsimilarities and allows for strong compression of embedding layers. In our\nexhaustive experimental evaluation, we achieve competitive downstream\nperformance with a parameter reduction of more than 85% on these layers.\nFurther, T-FREE shows significant improvements in cross-lingual transfer\nlearning.\n","authors":["Björn Deiseroth","Manuel Brack","Patrick Schramowski","Kristian Kersting","Samuel Weinbach"],"pdf_url":"https://arxiv.org/pdf/2406.19223v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03904v1","updated":"2025-01-07T16:18:55Z","published":"2025-01-07T16:18:55Z","title":"Exploring the Potential of Large Language Models in Public\n Transportation: San Antonio Case Study","summary":" The integration of large language models (LLMs) into public transit systems\npresents a transformative opportunity to enhance urban mobility. This study\nexplores the potential of LLMs to revolutionize public transportation\nmanagement within the context of San Antonio's transit system. Leveraging the\ncapabilities of LLMs in natural language processing and data analysis, we\ninvestigate their capabilities to optimize route planning, reduce wait times,\nand provide personalized travel assistance. By utilizing the General Transit\nFeed Specification (GTFS) and other relevant data, this research aims to\ndemonstrate how LLMs can potentially improve resource allocation, elevate\npassenger satisfaction, and inform data-driven decision-making in transit\noperations. A comparative analysis of different ChatGPT models was conducted to\nassess their ability to understand transportation information, retrieve\nrelevant data, and provide comprehensive responses. Findings from this study\nsuggest that while LLMs hold immense promise for public transit, careful\nengineering and fine-tuning are essential to realizing their full potential.\nSan Antonio serves as a case study to inform the development of LLM-powered\ntransit systems in other urban environments.\n","authors":["Ramya Jonnala","Gongbo Liang","Jeong Yang","Izzat Alsmadi"],"pdf_url":"https://arxiv.org/pdf/2501.03904v1.pdf","comment":"This work is accepted to AAAI 2025 Workshop on AI for Urban Planning.\n arXiv admin note: substantial text overlap with arXiv:2407.11003"},{"id":"http://arxiv.org/abs/2412.06866v3","updated":"2025-01-07T16:16:49Z","published":"2024-12-09T09:31:58Z","title":"LMS-AutoTSF: Learnable Multi-Scale Decomposition and Integrated\n Autocorrelation for Time Series Forecasting","summary":" Time series forecasting is an important challenge with significant\napplications in areas such as weather prediction, stock market analysis,\nscientific simulations and industrial process analysis. In this work, we\nintroduce LMS-AutoTSF, a novel time series forecasting architecture that\nincorporates autocorrelation while leveraging dual encoders operating at\nmultiple scales. Unlike models that rely on predefined trend and seasonal\ncomponents, LMS-AutoTSF employs two separate encoders per scale: one focusing\non low-pass filtering to capture trends and the other utilizing high-pass\nfiltering to model seasonal variations. These filters are learnable, allowing\nthe model to dynamically adapt and isolate trend and seasonal components\ndirectly in the frequency domain. A key innovation in our approach is the\nintegration of autocorrelation, achieved by computing lagged differences in\ntime steps, which enables the model to capture dependencies across time more\neffectively. Each encoder processes the input through fully connected layers to\nhandle temporal and channel interactions. By combining frequency-domain\nfiltering, autocorrelation-based temporal modeling, and channel-wise\ntransformations, LMS-AutoTSF not only accurately captures long-term\ndependencies and fine-grained patterns but also operates more efficiently\ncompared to other state-of-the-art methods. Its lightweight design ensures\nfaster processing while maintaining high precision in forecasting across\ndiverse time horizons. The source code is publicly available at\n\\url{http://github.com/mribrahim/LMS-TSF}\n","authors":["Ibrahim Delibasoglu","Sanjay Chakraborty","Fredrik Heintz"],"pdf_url":"https://arxiv.org/pdf/2412.06866v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19519v2","updated":"2025-01-07T16:13:50Z","published":"2024-05-29T20:56:52Z","title":"Two-Layer Retrieval-Augmented Generation Framework for Low-Resource\n Medical Question Answering Using Reddit Data: Proof-of-Concept Study","summary":" The increasing use of social media to share lived and living experiences of\nsubstance use presents a unique opportunity to obtain information on side\neffects, use patterns, and opinions on novel psychoactive substances. However,\ndue to the large volume of data, obtaining useful insights through natural\nlanguage processing technologies such as large language models is challenging.\nThis paper aims to develop a retrieval-augmented generation (RAG) architecture\nfor medical question answering pertaining to clinicians' queries on emerging\nissues associated with health-related topics, using user-generated medical\ninformation on social media. We proposed a two-layer RAG framework for\nquery-focused answer generation and evaluated a proof of concept for the\nframework in the context of query-focused summary generation from social media\nforums, focusing on emerging drug-related information. Our modular framework\ngenerates individual summaries followed by an aggregated summary to answer\nmedical queries from large amounts of user-generated social media data in an\nefficient manner. We compared the performance of a quantized large language\nmodel (Nous-Hermes-2-7B-DPO), deployable in low-resource settings, with GPT-4.\nFor this proof-of-concept study, we used user-generated data from Reddit to\nanswer clinicians' questions on the use of xylazine and ketamine. Our framework\nachieves comparable median scores in terms of relevance, length, hallucination,\ncoverage, and coherence when evaluated using GPT-4 and Nous-Hermes-2-7B-DPO,\nevaluated for 20 queries with 76 samples. There was no statistically\nsignificant difference between the two for coverage, coherence, relevance,\nlength, and hallucination. A statistically significant difference was noted for\nthe Coleman-Liau Index. Our RAG framework can effectively answer medical\nquestions about targeted topics and can be deployed in resource-constrained\nsettings.\n","authors":["Sudeshna Das","Yao Ge","Yuting Guo","Swati Rajwal","JaMor Hairston","Jeanne Powell","Drew Walker","Snigdha Peddireddy","Sahithi Lakamana","Selen Bozkurt","Matthew Reyna","Reza Sameni","Yunyu Xiao","Sangmi Kim","Rasheeta Chandler","Natalie Hernandez","Danielle Mowery","Rachel Wightman","Jennifer Love","Anthony Spadaro","Jeanmarie Perrone","Abeed Sarker"],"pdf_url":"https://arxiv.org/pdf/2405.19519v2.pdf","comment":"Published in JMIR: https://www.jmir.org/2025/1/e66220"},{"id":"http://arxiv.org/abs/2501.03902v1","updated":"2025-01-07T16:10:09Z","published":"2025-01-07T16:10:09Z","title":"Explainable Reinforcement Learning via Temporal Policy Decomposition","summary":" We investigate the explainability of Reinforcement Learning (RL) policies\nfrom a temporal perspective, focusing on the sequence of future outcomes\nassociated with individual actions. In RL, value functions compress information\nabout rewards collected across multiple trajectories and over an infinite\nhorizon, allowing a compact form of knowledge representation. However, this\ncompression obscures the temporal details inherent in sequential\ndecision-making, presenting a key challenge for interpretability. We present\nTemporal Policy Decomposition (TPD), a novel explainability approach that\nexplains individual RL actions in terms of their Expected Future Outcome (EFO).\nThese explanations decompose generalized value functions into a sequence of\nEFOs, one for each time step up to a prediction horizon of interest, revealing\ninsights into when specific outcomes are expected to occur. We leverage\nfixed-horizon temporal difference learning to devise an off-policy method for\nlearning EFOs for both optimal and suboptimal actions, enabling contrastive\nexplanations consisting of EFOs for different state-action pairs. Our\nexperiments demonstrate that TPD generates accurate explanations that (i)\nclarify the policy's future strategy and anticipated trajectory for a given\naction and (ii) improve understanding of the reward composition, facilitating\nfine-tuning of the reward function to align with human expectations.\n","authors":["Franco Ruggeri","Alessio Russo","Rafia Inam","Karl Henrik Johansson"],"pdf_url":"https://arxiv.org/pdf/2501.03902v1.pdf","comment":"21 pages, 4 figures"},{"id":"http://arxiv.org/abs/2501.03895v1","updated":"2025-01-07T16:03:14Z","published":"2025-01-07T16:03:14Z","title":"LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One\n Vision Token","summary":" The advent of real-time large multimodal models (LMMs) like GPT-4o has\nsparked considerable interest in efficient LMMs. LMM frameworks typically\nencode visual inputs into vision tokens (continuous representations) and\nintegrate them and textual instructions into the context of large language\nmodels (LLMs), where large-scale parameters and numerous context tokens\n(predominantly vision tokens) result in substantial computational overhead.\nPrevious efforts towards efficient LMMs always focus on replacing the LLM\nbackbone with smaller models, while neglecting the crucial issue of token\nquantity. In this paper, we introduce LLaVA-Mini, an efficient LMM with minimal\nvision tokens. To achieve a high compression ratio of vision tokens while\npreserving visual information, we first analyze how LMMs understand vision\ntokens and find that most vision tokens only play a crucial role in the early\nlayers of LLM backbone, where they mainly fuse visual information into text\ntokens. Building on this finding, LLaVA-Mini introduces modality pre-fusion to\nfuse visual information into text tokens in advance, thereby facilitating the\nextreme compression of vision tokens fed to LLM backbone into one token.\nLLaVA-Mini is a unified large multimodal model that can support the\nunderstanding of images, high-resolution images, and videos in an efficient\nmanner. Experiments across 11 image-based and 7 video-based benchmarks\ndemonstrate that LLaVA-Mini outperforms LLaVA-v1.5 with just 1 vision token\ninstead of 576. Efficiency analyses reveal that LLaVA-Mini can reduce FLOPs by\n77%, deliver low-latency responses within 40 milliseconds, and process over\n10,000 frames of video on the GPU hardware with 24GB of memory.\n","authors":["Shaolei Zhang","Qingkai Fang","Zhe Yang","Yang Feng"],"pdf_url":"https://arxiv.org/pdf/2501.03895v1.pdf","comment":"Code: https://github.com/ictnlp/LLaVA-Mini; Model:\n https://huggingface.co/ICTNLP/llava-mini-llama-3.1-8b"},{"id":"http://arxiv.org/abs/2408.11876v2","updated":"2025-01-07T16:01:15Z","published":"2024-08-20T13:19:06Z","title":"From Glucose Patterns to Health Outcomes: A Generalizable Foundation\n Model for Continuous Glucose Monitor Data Analysis","summary":" Recent advances in SSL enabled novel medical AI models, known as foundation\nmodels, offer great potential for better characterizing health from diverse\nbiomedical data. CGM provides rich, temporal data on glycemic patterns, but its\nfull potential for predicting broader health outcomes remains underutilized.\nHere, we present GluFormer, a generative foundation model for CGM data that\nlearns nuanced glycemic patterns and translates them into predictive\nrepresentations of metabolic health. Trained on over 10 million CGM\nmeasurements from 10,812 adults, primarily without diabetes, GluFormer uses\nautoregressive token prediction to capture longitudinal glucose dynamics. We\nshow that GluFormer generalizes to 19 external cohorts (n=6,044) spanning\ndifferent ethnicities and ages, 5 countries, 8 CGM devices, and diverse\npathophysiological states. GluFormers representations exceed the performance of\ncurrent CGM metrics, such as the Glucose Management Indicator (GMI), for\nforecasting clinical measures. In a longitudinal study of 580 adults with CGM\ndata and 12-year follow-up, GluFormer identifies individuals at elevated risk\nof developing diabetes more effectively than blood HbA1C%, capturing 66% of all\nnew-onset diabetes diagnoses in the top quartile versus 7% in the bottom\nquartile. Similarly, 69% of cardiovascular-death events occurred in the top\nquartile with none in the bottom quartile, demonstrating powerful risk\nstratification beyond traditional glycemic metrics. We also show that CGM\nrepresentations from pre-intervention periods in Randomized Clinical Trials\noutperform other methods in predicting primary and secondary outcomes. When\nintegrating dietary data into GluFormer, we show that the multi-modal version\nof the model can accurately generate CGM data based on dietary intake data,\nsimulate outcomes of dietary interventions, and predict individual responses to\nspecific foods.\n","authors":["Guy Lutsker","Gal Sapir","Smadar Shilo","Jordi Merino","Anastasia Godneva","Jerry R Greenfield","Dorit Samocha-Bonet","Raja Dhir","Francisco Gude","Shie Mannor","Eli Meirom","Gal Chechik","Hagai Rossman","Eran Segal"],"pdf_url":"https://arxiv.org/pdf/2408.11876v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03888v1","updated":"2025-01-07T15:51:49Z","published":"2025-01-07T15:51:49Z","title":"Neural DNF-MT: A Neuro-symbolic Approach for Learning Interpretable and\n Editable Policies","summary":" Although deep reinforcement learning has been shown to be effective, the\nmodel's black-box nature presents barriers to direct policy interpretation. To\naddress this problem, we propose a neuro-symbolic approach called neural DNF-MT\nfor end-to-end policy learning. The differentiable nature of the neural DNF-MT\nmodel enables the use of deep actor-critic algorithms for training. At the same\ntime, its architecture is designed so that trained models can be directly\ntranslated into interpretable policies expressed as standard (bivalent or\nprobabilistic) logic programs. Moreover, additional layers can be included to\nextract abstract features from complex observations, acting as a form of\npredicate invention. The logic representations are highly interpretable, and we\nshow how the bivalent representations of deterministic policies can be edited\nand incorporated back into a neural model, facilitating manual intervention and\nadaptation of learned policies. We evaluate our approach on a range of tasks\nrequiring learning deterministic or stochastic behaviours from various forms of\nobservations. Our empirical results show that our neural DNF-MT model performs\nat the level of competing black-box methods whilst providing interpretable\npolicies.\n","authors":["Kexin Gu Baugh","Luke Dickens","Alessandra Russo"],"pdf_url":"https://arxiv.org/pdf/2501.03888v1.pdf","comment":"AAMAS 2025"},{"id":"http://arxiv.org/abs/2410.11463v2","updated":"2025-01-07T15:48:15Z","published":"2024-10-15T10:10:33Z","title":"Advanced Persistent Threats (APT) Attribution Using Deep Reinforcement\n Learning","summary":" The development of the DRL model for malware attribution involved extensive\nresearch, iterative coding, and numerous adjustments based on the insights\ngathered from predecessor models and contemporary research papers. This\npreparatory work was essential to establish a robust foundation for the model,\nensuring it could adapt and respond effectively to the dynamic nature of\nmalware threats. Initially, the model struggled with low accuracy levels, but\nthrough persistent adjustments to its architecture and learning algorithms,\naccuracy improved dramatically from about 7 percent to over 73 percent in early\niterations. By the end of the training, the model consistently reached accuracy\nlevels near 98 percent, demonstrating its strong capability to accurately\nrecognise and attribute malware activities. This upward trajectory in training\naccuracy is graphically represented in the Figure, which vividly illustrates\nthe model maturation and increasing proficiency over time.\n","authors":["Animesh Singh Basnet","Mohamed Chahine Ghanem","Dipo Dunsin","Wiktor Sowinski-Mydlarz"],"pdf_url":"https://arxiv.org/pdf/2410.11463v2.pdf","comment":"21 Pages"},{"id":"http://arxiv.org/abs/2405.03732v3","updated":"2025-01-07T15:46:25Z","published":"2024-05-06T10:53:13Z","title":"Deep Learning-based Accelerated MR Cholangiopancreatography without\n Fully-sampled Data","summary":" The purpose of this study was to accelerate MR cholangiopancreatography\n(MRCP) acquisitions using deep learning-based (DL) reconstruction at 3T and\n0.55T. A total of 35 healthy volunteers underwent conventional two-fold\naccelerated MRCP scans at field strengths of 3T and 0.55T. We trained DL\nreconstructions using two different training strategies, supervised (SV) and\nself-supervised (SSV), with retrospectively six-fold undersampled data obtained\nat 3T. We then evaluated the DL reconstructions against standard techniques,\nparallel imaging (PI) and compressed sensing (CS), focusing on peak\nsignal-to-noise ratio (PSNR) and structural similarity (SSIM) as metrics. We\nalso tested DL reconstructions with prospectively accelerated acquisitions and\nevaluated their robustness when changing fields strengths from 3T to 0.55T. DL\nreconstructions demonstrated a reduction in average acquisition time from\n599/542 to 255/180 seconds for MRCP at 3T/0.55T. In both retrospective and\nprospective undersampling, PSNR and SSIM of DL reconstructions were higher than\nthose of PI and CS. At the same time, DL reconstructions preserved the image\nquality of undersampled data, including sharpness and the visibility of\nhepatobiliary ducts. In addition, both DL approaches produced high-quality\nreconstructions at 0.55T. In summary, DL reconstructions trained for highly\naccelerated MRCP enabled a reduction in acquisition time by a factor of 2.4/3.0\nat 3T/0.55T while maintaining the image quality of conventional acquisitions.\n","authors":["Jinho Kim","Marcel Dominik Nickel","Florian Knoll"],"pdf_url":"https://arxiv.org/pdf/2405.03732v3.pdf","comment":"19 pages, 4 figures, 2 tables"},{"id":"http://arxiv.org/abs/2501.03879v1","updated":"2025-01-07T15:42:32Z","published":"2025-01-07T15:42:32Z","title":"CL3DOR: Contrastive Learning for 3D Large Multimodal Models via Odds\n Ratio on High-Resolution Point Clouds","summary":" Recent research has demonstrated that Large Language Models (LLMs) are not\nlimited to text-only tasks but can also function as multimodal models across\nvarious modalities, including audio, images, and videos. In particular,\nresearch on 3D Large Multimodal Models (3D LMMs) is making notable strides,\ndriven by the potential of processing higher-dimensional data like point\nclouds. However, upon closer examination, we find that the visual and textual\ncontent within each sample of existing training datasets lacks both high\ninformational granularity and clarity, which serve as a bottleneck for precise\ncross-modal understanding. To address these issues, we propose CL3DOR,\nContrastive Learning for 3D large multimodal models via Odds ratio on\nhigh-Resolution point clouds, designed to ensure greater specificity and\nclarity in both visual and textual content. Specifically, we increase the\ndensity of point clouds per object and construct informative hard negative\nresponses in the training dataset to penalize unwanted responses. To leverage\nhard negative responses, we incorporate the odds ratio as an auxiliary term for\ncontrastive learning into the conventional language modeling loss. CL3DOR\nachieves state-of-the-art performance in 3D scene understanding and reasoning\nbenchmarks. Additionally, we demonstrate the effectiveness of CL3DOR's key\ncomponents through extensive experiments.\n","authors":["Keonwoo Kim","Yeongjae Cho","Taebaek Hwang","Minsoo Jo","Sangdo Han"],"pdf_url":"https://arxiv.org/pdf/2501.03879v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.08514v2","updated":"2025-01-07T15:37:10Z","published":"2024-09-13T03:25:34Z","title":"Apollo: Band-sequence Modeling for High-Quality Audio Restoration","summary":" Audio restoration has become increasingly significant in modern society, not\nonly due to the demand for high-quality auditory experiences enabled by\nadvanced playback devices, but also because the growing capabilities of\ngenerative audio models necessitate high-fidelity audio. Typically, audio\nrestoration is defined as a task of predicting undistorted audio from damaged\ninput, often trained using a GAN framework to balance perception and\ndistortion. Since audio degradation is primarily concentrated in mid- and\nhigh-frequency ranges, especially due to codecs, a key challenge lies in\ndesigning a generator capable of preserving low-frequency information while\naccurately reconstructing high-quality mid- and high-frequency content.\nInspired by recent advancements in high-sample-rate music separation, speech\nenhancement, and audio codec models, we propose Apollo, a generative model\ndesigned for high-sample-rate audio restoration. Apollo employs an explicit\nfrequency band split module to model the relationships between different\nfrequency bands, allowing for more coherent and higher-quality restored audio.\nEvaluated on the MUSDB18-HQ and MoisesDB datasets, Apollo consistently\noutperforms existing SR-GAN models across various bit rates and music genres,\nparticularly excelling in complex scenarios involving mixtures of multiple\ninstruments and vocals. Apollo significantly improves music restoration quality\nwhile maintaining computational efficiency. The source code for Apollo is\npublicly available at https://github.com/JusperLee/Apollo.\n","authors":["Kai Li","Yi Luo"],"pdf_url":"https://arxiv.org/pdf/2409.08514v2.pdf","comment":"Accepted by ICASSP 2025, Demo Page: https://cslikai.cn/Apollo"},{"id":"http://arxiv.org/abs/2412.14841v2","updated":"2025-01-07T15:30:56Z","published":"2024-12-19T13:34:14Z","title":"Helping LLMs Improve Code Generation Using Feedback from Testing and\n Static Analysis","summary":" Large Language Models (LLMs) are one of the most promising developments in\nthe field of artificial intelligence, and the software engineering community\nhas readily noticed their potential role in the software development\nlife-cycle. Developers routinely ask LLMs to generate code snippets, increasing\nproductivity but also potentially introducing ownership, privacy, correctness,\nand security issues. Previous work highlighted how code generated by mainstream\ncommercial LLMs is often not safe, containing vulnerabilities, bugs, and code\nsmells. In this paper, we present a framework that leverages testing and static\nanalysis to assess the quality, and guide the self-improvement, of code\ngenerated by general-purpose, open-source LLMs.\n First, we ask LLMs to generate C code to solve a number of programming tasks.\nThen we employ ground-truth tests to assess the (in)correctness of the\ngenerated code, and a static analysis tool to detect potential safety\nvulnerabilities. Next, we assess the models ability to evaluate the generated\ncode, by asking them to detect errors and vulnerabilities. Finally, we test the\nmodels ability to fix the generated code, providing the reports produced during\nthe static analysis and incorrectness evaluation phases as feedback.\n Our results show that models often produce incorrect code, and that the\ngenerated code can include safety issues. Moreover, they perform very poorly at\ndetecting either issue. On the positive side, we observe a substantial ability\nto fix flawed code when provided with information about failed tests or\npotential vulnerabilities, indicating a promising avenue for improving the\nsafety of LLM-based code generation tools.\n","authors":["Greta Dolcetti","Vincenzo Arceri","Eleonora Iotti","Sergio Maffeis","Agostino Cortesi","Enea Zaffanella"],"pdf_url":"https://arxiv.org/pdf/2412.14841v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.19155v3","updated":"2025-01-07T15:30:02Z","published":"2024-10-24T20:49:22Z","title":"Lived Experience Not Found: LLMs Struggle to Align with Experts on\n Addressing Adverse Drug Reactions from Psychiatric Medication Use","summary":" Adverse Drug Reactions (ADRs) from psychiatric medications are the leading\ncause of hospitalizations among mental health patients. With healthcare systems\nand online communities facing limitations in resolving ADR-related issues,\nLarge Language Models (LLMs) have the potential to fill this gap. Despite the\nincreasing capabilities of LLMs, past research has not explored their\ncapabilities in detecting ADRs related to psychiatric medications or in\nproviding effective harm reduction strategies. To address this, we introduce\nthe Psych-ADR benchmark and the Adverse Drug Reaction Response Assessment\n(ADRA) framework to systematically evaluate LLM performance in detecting ADR\nexpressions and delivering expert-aligned mitigation strategies. Our analyses\nshow that LLMs struggle with understanding the nuances of ADRs and\ndifferentiating between types of ADRs. While LLMs align with experts in terms\nof expressed emotions and tone of the text, their responses are more complex,\nharder to read, and only 70.86% aligned with expert strategies. Furthermore,\nthey provide less actionable advice by a margin of 12.32% on average. Our work\nprovides a comprehensive benchmark and evaluation framework for assessing LLMs\nin strategy-driven tasks within high-risk domains.\n","authors":["Mohit Chandra","Siddharth Sriraman","Gaurav Verma","Harneet Singh Khanuja","Jose Suarez Campayo","Zihang Li","Michael L. Birnbaum","Munmun De Choudhury"],"pdf_url":"https://arxiv.org/pdf/2410.19155v3.pdf","comment":"30 pages, 8 figures, 16 tables"},{"id":"http://arxiv.org/abs/2410.13850v3","updated":"2025-01-07T15:28:09Z","published":"2024-10-17T17:59:02Z","title":"Influence Functions for Scalable Data Attribution in Diffusion Models","summary":" Diffusion models have led to significant advancements in generative\nmodelling. Yet their widespread adoption poses challenges regarding data\nattribution and interpretability. In this paper, we aim to help address such\nchallenges in diffusion models by developing an influence functions framework.\nInfluence function-based data attribution methods approximate how a model's\noutput would have changed if some training data were removed. In supervised\nlearning, this is usually used for predicting how the loss on a particular\nexample would change. For diffusion models, we focus on predicting the change\nin the probability of generating a particular example via several proxy\nmeasurements. We show how to formulate influence functions for such quantities\nand how previously proposed methods can be interpreted as particular design\nchoices in our framework. To ensure scalability of the Hessian computations in\ninfluence functions, we systematically develop K-FAC approximations based on\ngeneralised Gauss-Newton matrices specifically tailored to diffusion models. We\nrecast previously proposed methods as specific design choices in our framework\nand show that our recommended method outperforms previous data attribution\napproaches on common evaluations, such as the Linear Data-modelling Score (LDS)\nor retraining without top influences, without the need for method-specific\nhyperparameter tuning.\n","authors":["Bruno Mlodozeniec","Runa Eschenhagen","Juhan Bae","Alexander Immer","David Krueger","Richard Turner"],"pdf_url":"https://arxiv.org/pdf/2410.13850v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03038v2","updated":"2025-01-07T15:13:41Z","published":"2025-01-06T14:26:00Z","title":"Piano Transcription by Hierarchical Language Modeling with Pretrained\n Roll-based Encoders","summary":" Automatic Music Transcription (AMT), aiming to get musical notes from raw\naudio, typically uses frame-level systems with piano-roll outputs or language\nmodel (LM)-based systems with note-level predictions. However, frame-level\nsystems require manual thresholding, while the LM-based systems struggle with\nlong sequences. In this paper, we propose a hybrid method combining pre-trained\nroll-based encoders with an LM decoder to leverage the strengths of both\nmethods. Besides, our approach employs a hierarchical prediction strategy,\nfirst predicting onset and pitch, then velocity, and finally offset. The\nhierarchical prediction strategy reduces computational costs by breaking down\nlong sequences into different hierarchies. Evaluated on two benchmark\nroll-based encoders, our method outperforms traditional piano-roll outputs 0.01\nand 0.022 in onset-offset-velocity F1 score, demonstrating its potential as a\nperformance-enhancing plug-in for arbitrary roll-based music transcription\nencoder.\n","authors":["Dichucheng Li","Yongyi Zang","Qiuqiang Kong"],"pdf_url":"https://arxiv.org/pdf/2501.03038v2.pdf","comment":"Accepted by ICASSP 2025"},{"id":"http://arxiv.org/abs/2501.03847v1","updated":"2025-01-07T15:01:58Z","published":"2025-01-07T15:01:58Z","title":"Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video\n Generation Control","summary":" Diffusion models have demonstrated impressive performance in generating\nhigh-quality videos from text prompts or images. However, precise control over\nthe video generation process, such as camera manipulation or content editing,\nremains a significant challenge. Existing methods for controlled video\ngeneration are typically limited to a single control type, lacking the\nflexibility to handle diverse control demands. In this paper, we introduce\nDiffusion as Shader (DaS), a novel approach that supports multiple video\ncontrol tasks within a unified architecture. Our key insight is that achieving\nversatile video control necessitates leveraging 3D control signals, as videos\nare fundamentally 2D renderings of dynamic 3D content. Unlike prior methods\nlimited to 2D control signals, DaS leverages 3D tracking videos as control\ninputs, making the video diffusion process inherently 3D-aware. This innovation\nallows DaS to achieve a wide range of video controls by simply manipulating the\n3D tracking videos. A further advantage of using 3D tracking videos is their\nability to effectively link frames, significantly enhancing the temporal\nconsistency of the generated videos. With just 3 days of fine-tuning on 8 H800\nGPUs using less than 10k videos, DaS demonstrates strong control capabilities\nacross diverse tasks, including mesh-to-video generation, camera control,\nmotion transfer, and object manipulation.\n","authors":["Zekai Gu","Rui Yan","Jiahao Lu","Peng Li","Zhiyang Dou","Chenyang Si","Zhen Dong","Qifeng Liu","Cheng Lin","Ziwei Liu","Wenping Wang","Yuan Liu"],"pdf_url":"https://arxiv.org/pdf/2501.03847v1.pdf","comment":"Project page: https://igl-hkust.github.io/das/ Codes:\n https://github.com/IGL-HKUST/DiffusionAsShader"},{"id":"http://arxiv.org/abs/2409.16670v2","updated":"2025-01-07T15:00:20Z","published":"2024-09-25T06:57:42Z","title":"GraphLoRA: Structure-Aware Contrastive Low-Rank Adaptation for\n Cross-Graph Transfer Learning","summary":" Graph Neural Networks (GNNs) have demonstrated remarkable proficiency in\nhandling a range of graph analytical tasks across various domains, such as\ne-commerce and social networks. Despite their versatility, GNNs face\nsignificant challenges in transferability, limiting their utility in real-world\napplications. Existing research in GNN transfer learning overlooks\ndiscrepancies in distribution among various graph datasets, facing challenges\nwhen transferring across different distributions. How to effectively adopt a\nwell-trained GNN to new graphs with varying feature and structural\ndistributions remains an under-explored problem. Taking inspiration from the\nsuccess of Low-Rank Adaptation (LoRA) in adapting large language models to\nvarious domains, we propose GraphLoRA, an effective and parameter-efficient\nmethod for transferring well-trained GNNs to diverse graph domains.\nSpecifically, we first propose a Structure-aware Maximum Mean Discrepancy\n(SMMD) to align divergent node feature distributions across source and target\ngraphs. Moreover, we introduce low-rank adaptation by injecting a small\ntrainable GNN alongside the pre-trained one, effectively bridging structural\ndistribution gaps while mitigating the catastrophic forgetting. Additionally, a\nstructure-aware regularization objective is proposed to enhance the\nadaptability of the pre-trained GNN to target graph with scarce supervision\nlabels. Extensive experiments on eight real-world datasets demonstrate the\neffectiveness of GraphLoRA against fourteen baselines by tuning only 20% of\nparameters, even across disparate graph domains. The code is available at\nhttps://github.com/AllminerLab/GraphLoRA.\n","authors":["Zhe-Rui Yang","Jindong Han","Chang-Dong Wang","Hao Liu"],"pdf_url":"https://arxiv.org/pdf/2409.16670v2.pdf","comment":"Accepted by KDD2025"},{"id":"http://arxiv.org/abs/2410.15460v3","updated":"2025-01-07T14:56:42Z","published":"2024-10-20T18:18:23Z","title":"Hallucination Detox: Sensitivity Dropout (SenD) for Large Language Model\n Training","summary":" As large language models (LLMs) are increasingly deployed across various\nindustries, concerns regarding their reliability, particularly due to\nhallucinations - outputs that are factually inaccurate or irrelevant to user\ninput - have grown. Our research investigates the relationship between the\ntraining process and the emergence of hallucinations to address a key gap in\nexisting research that focuses primarily on post hoc detection and mitigation\nstrategies. Using models from the Pythia suite (70M - 12B parameters) and\nseveral hallucination detection metrics, we analyze hallucination trends\nthroughout training and explore LLM internal dynamics. We introduce Sensitivity\nDropout (SenD), a novel training protocol designed to mitigate hallucinations\nby reducing variance during training. SenD achieves this by deterministically\ndropping embedding indices with significant variability, referred to as\nSensitive Embedding Indices. In addition, we develop an unsupervised\nhallucination detection metric, Efficient EigenScore (EES), which approximates\nthe traditional EigenScore at 2x speed. This efficient metric is integrated\ninto our protocol, allowing SenD to be both computationally scalable and\neffective at reducing hallucinations. Our empirical evaluation demonstrates\nthat our approach improves LLM reliability at test time by up to 40% compared\nto normal training while also providing an efficient method to improve factual\naccuracy when adapting LLMs to Wikipedia, Medical, and LegalBench domains.\n","authors":["Shahrad Mohammadzadeh","Juan David Guerra","Marco Bonizzato","Reihaneh Rabbany","Golnoosh Farnadi"],"pdf_url":"https://arxiv.org/pdf/2410.15460v3.pdf","comment":"23 pages, 15 figures, under review at ICLR, accepted to Safe\n Generative AI Workshop @ NeurIPS 2024, resubmitting to change name to\n appropriate name"},{"id":"http://arxiv.org/abs/2501.03836v1","updated":"2025-01-07T14:45:39Z","published":"2025-01-07T14:45:39Z","title":"SCC-YOLO: An Improved Object Detector for Assisting in Brain Tumor\n Diagnosis","summary":" Brain tumors can result in neurological dysfunction, alterations in cognitive\nand psychological states, increased intracranial pressure, and the occurrence\nof seizures, thereby presenting a substantial risk to human life and health.\nThe You Only Look Once(YOLO) series models have demonstrated superior accuracy\nin object detection for medical imaging. In this paper, we develop a novel\nSCC-YOLO architecture by integrating the SCConv attention mechanism into\nYOLOv9. The SCConv module reconstructs an efficient convolutional module by\nreducing spatial and channel redundancy among features, thereby enhancing the\nlearning of image features. We investigate the impact of intergrating different\nattention mechanisms with the YOLOv9 model on brain tumor image detection using\nboth the Br35H dataset and our self-made dataset(Brain_Tumor_Dataset).\nExperimental results show that on the Br35H dataset, SCC-YOLO achieved a 0.3%\nimprovement in mAp50 compared to YOLOv9, while on our self-made dataset,\nSCC-YOLO exhibited a 0.5% improvement over YOLOv9. SCC-YOLO has reached\nstate-of-the-art performance in brain tumor detection. Source code is available\nat : https://jihulab.com/healthcare-information-studio/SCC-YOLO/-/tree/master\n","authors":["Runci Bai"],"pdf_url":"https://arxiv.org/pdf/2501.03836v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03835v1","updated":"2025-01-07T14:45:30Z","published":"2025-01-07T14:45:30Z","title":"TACLR: A Scalable and Efficient Retrieval-based Method for Industrial\n Product Attribute Value Identification","summary":" Product Attribute Value Identification (PAVI) involves identifying attribute\nvalues from product profiles, a key task for improving product search,\nrecommendations, and business analytics on e-commerce platforms. However,\nexisting PAVI methods face critical challenges, such as inferring implicit\nvalues, handling out-of-distribution (OOD) values, and producing normalized\noutputs. To address these limitations, we introduce Taxonomy-Aware Contrastive\nLearning Retrieval (TACLR), the first retrieval-based method for PAVI. TACLR\nformulates PAVI as an information retrieval task by encoding product profiles\nand candidate values into embeddings and retrieving values based on their\nsimilarity to the item embedding. It leverages contrastive training with\ntaxonomy-aware hard negative sampling and employs adaptive inference with\ndynamic thresholds. TACLR offers three key advantages: (1) it effectively\nhandles implicit and OOD values while producing normalized outputs; (2) it\nscales to thousands of categories, tens of thousands of attributes, and\nmillions of values; and (3) it supports efficient inference for high-load\nindustrial scenarios. Extensive experiments on proprietary and public datasets\nvalidate the effectiveness and efficiency of TACLR. Moreover, it has been\nsuccessfully deployed in a real-world e-commerce platform, processing millions\nof product listings daily while supporting dynamic, large-scale attribute\ntaxonomies.\n","authors":["Yindu Su","Huike Zou","Lin Sun","Ting Zhang","Haiyang Yang","Liyu Chen","David Lo","Qingheng Zhang","Shuguang Han","Jufeng Chen"],"pdf_url":"https://arxiv.org/pdf/2501.03835v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03832v1","updated":"2025-01-07T14:42:38Z","published":"2025-01-07T14:42:38Z","title":"Three-dimensional attention Transformer for state evaluation in\n real-time strategy games","summary":" Situation assessment in Real-Time Strategy (RTS) games is crucial for\nunderstanding decision-making in complex adversarial environments. However,\nexisting methods remain limited in processing multi-dimensional feature\ninformation and temporal dependencies. Here we propose a tri-dimensional\nSpace-Time-Feature Transformer (TSTF Transformer) architecture, which\nefficiently models battlefield situations through three independent but\ncascaded modules: spatial attention, temporal attention, and feature attention.\nOn a dataset comprising 3,150 adversarial experiments, the 8-layer TSTF\nTransformer demonstrates superior performance: achieving 58.7% accuracy in the\nearly game (~4% progress), significantly outperforming the conventional\nTimesformer's 41.8%; reaching 97.6% accuracy in the mid-game (~40% progress)\nwhile maintaining low performance variation (standard deviation 0.114).\nMeanwhile, this architecture requires fewer parameters (4.75M) compared to the\nbaseline model (5.54M). Our study not only provides new insights into situation\nassessment in RTS games but also presents an innovative paradigm for\nTransformer-based multi-dimensional temporal modeling.\n","authors":["Yanqing Ye","Weilong Yang","Kai Qiu","Jie Zhang"],"pdf_url":"https://arxiv.org/pdf/2501.03832v1.pdf","comment":"9 pages, 5 figures"},{"id":"http://arxiv.org/abs/2501.03825v1","updated":"2025-01-07T14:37:14Z","published":"2025-01-07T14:37:14Z","title":"Deep Sylvester Posterior Inference for Adaptive Compressed Sensing in\n Ultrasound Imaging","summary":" Ultrasound images are commonly formed by sequential acquisition of\nbeam-steered scan-lines. Minimizing the number of required scan-lines can\nsignificantly enhance frame rate, field of view, energy efficiency, and data\ntransfer speeds. Existing approaches typically use static subsampling schemes\nin combination with sparsity-based or, more recently, deep-learning-based\nrecovery. In this work, we introduce an adaptive subsampling method that\nmaximizes intrinsic information gain in-situ, employing a Sylvester Normalizing\nFlow encoder to infer an approximate Bayesian posterior under partial\nobservation in real-time. Using the Bayesian posterior and a deep generative\nmodel for future observations, we determine the subsampling scheme that\nmaximizes the mutual information between the subsampled observations, and the\nnext frame of the video. We evaluate our approach using the EchoNet cardiac\nultrasound video dataset and demonstrate that our active sampling method\noutperforms competitive baselines, including uniform and variable-density\nrandom sampling, as well as equidistantly spaced scan-lines, improving mean\nabsolute reconstruction error by 15%. Moreover, posterior inference and the\nsampling scheme generation are performed in just 0.015 seconds (66Hz), making\nit fast enough for real-time 2D ultrasound imaging applications.\n","authors":["Simon W. Penninga","Hans van Gorp","Ruud J. G. van Sloun"],"pdf_url":"https://arxiv.org/pdf/2501.03825v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03824v1","updated":"2025-01-07T14:36:33Z","published":"2025-01-07T14:36:33Z","title":"Online Reinforcement Learning-Based Dynamic Adaptive Evaluation Function\n for Real-Time Strategy Tasks","summary":" Effective evaluation of real-time strategy tasks requires adaptive mechanisms\nto cope with dynamic and unpredictable environments. This study proposes a\nmethod to improve evaluation functions for real-time responsiveness to\nbattle-field situation changes, utilizing an online reinforcement\nlearning-based dynam-ic weight adjustment mechanism within the real-time\nstrategy game. Building on traditional static evaluation functions, the method\nemploys gradient descent in online reinforcement learning to update weights\ndynamically, incorporating weight decay techniques to ensure stability.\nAdditionally, the AdamW optimizer is integrated to adjust the learning rate and\ndecay rate of online reinforcement learning in real time, further reducing the\ndependency on manual parameter tun-ing. Round-robin competition experiments\ndemonstrate that this method signifi-cantly enhances the application\neffectiveness of the Lanchester combat model evaluation function, Simple\nevaluation function, and Simple Sqrt evaluation function in planning algorithms\nincluding IDABCD, IDRTMinimax, and Port-folio AI. The method achieves a notable\nimprovement in scores, with the en-hancement becoming more pronounced as the\nmap size increases. Furthermore, the increase in evaluation function\ncomputation time induced by this method is kept below 6% for all evaluation\nfunctions and planning algorithms. The pro-posed dynamic adaptive evaluation\nfunction demonstrates a promising approach for real-time strategy task\nevaluation.\n","authors":["Weilong Yang","Jie Zhang","Xunyun Liu","Yanqing Ye"],"pdf_url":"https://arxiv.org/pdf/2501.03824v1.pdf","comment":"22 pages, 9 figures"},{"id":"http://arxiv.org/abs/2407.10486v2","updated":"2025-01-07T14:09:22Z","published":"2024-07-15T07:14:56Z","title":"IDEAL: Leveraging Infinite and Dynamic Characterizations of Large\n Language Models for Query-focused Summarization","summary":" Query-focused summarization (QFS) aims to produce summaries that answer\nparticular questions of interest, enabling greater user control and\npersonalization. With the advent of large language models (LLMs), shows their\nimpressive capability of textual understanding through large-scale pretraining,\nwhich implies the great potential of extractive snippet generation. In this\npaper, we systematically investigated two indispensable characteristics that\nthe LLMs-based QFS models should be harnessed, Lengthy Document Summarization\nand Efficiently Fine-grained Query-LLM Alignment, respectively.\nCorrespondingly, we propose two modules called Query-aware HyperExpert and\nQuery-focused Infini-attention to access the aforementioned characteristics.\nThese innovations pave the way for broader application and accessibility in the\nfield of QFS technology. Extensive experiments conducted on existing QFS\nbenchmarks indicate the effectiveness and generalizability of the proposed\napproach. Our code is publicly available at\nhttps://github.com/DCDmllm/IDEAL_Summary.\n","authors":["Jie Cao","Dian Jiao","Qiang Yan","Wenqiao Zhang","Siliang Tang","Yueting Zhuang"],"pdf_url":"https://arxiv.org/pdf/2407.10486v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03795v1","updated":"2025-01-07T14:01:59Z","published":"2025-01-07T14:01:59Z","title":"Self-Adaptive ERP: Embedding NLP into Petri-Net creation and Model\n Matching","summary":" Enterprise Resource Planning (ERP) consultants play a vital role in\ncustomizing systems to meet specific business needs by processing large amounts\nof data and adapting functionalities. However, the process is\nresource-intensive, time-consuming, and requires continuous adjustments as\nbusiness demands evolve. This research introduces a Self-Adaptive ERP Framework\nthat automates customization using enterprise process models and system usage\nanalysis. It leverages Artificial Intelligence (AI) & Natural Language\nProcessing (NLP) for Petri nets to transform business processes into adaptable\nmodels, addressing both structural and functional matching. The framework,\nbuilt using Design Science Research (DSR) and a Systematic Literature Review\n(SLR), reduces reliance on manual adjustments, improving ERP customization\nefficiency and accuracy while minimizing the need for consultants.\n","authors":["Ahmed Maged","Gamal Kassem"],"pdf_url":"https://arxiv.org/pdf/2501.03795v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.02285v2","updated":"2025-01-07T13:38:34Z","published":"2025-01-04T13:27:18Z","title":"Hyperbolic Contrastive Learning for Hierarchical 3D Point Cloud\n Embedding","summary":" Hyperbolic spaces allow for more efficient modeling of complex, hierarchical\nstructures, which is particularly beneficial in tasks involving multi-modal\ndata. Although hyperbolic geometries have been proven effective for\nlanguage-image pre-training, their capabilities to unify language, image, and\n3D Point Cloud modalities are under-explored. We extend the 3D Point Cloud\nmodality in hyperbolic multi-modal contrastive pre-training. Additionally, we\nexplore the entailment, modality gap, and alignment regularizers for learning\nhierarchical 3D embeddings and facilitating the transfer of knowledge from both\nText and Image modalities. These regularizers enable the learning of\nintra-modal hierarchy within each modality and inter-modal hierarchy across\ntext, 2D images, and 3D Point Clouds. Experimental results demonstrate that our\nproposed training strategy yields an outstanding 3D Point Cloud encoder, and\nthe obtained 3D Point Cloud hierarchical embeddings significantly improve\nperformance on various downstream tasks.\n","authors":["Yingjie Liu","Pengyu Zhang","Ziyao He","Mingsong Chen","Xuan Tang","Xian Wei"],"pdf_url":"https://arxiv.org/pdf/2501.02285v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.00518v2","updated":"2025-01-07T13:36:12Z","published":"2023-09-30T22:37:28Z","title":"Learning Informative Latent Representation for Quantum State Tomography","summary":" Quantum state tomography (QST) is the process of reconstructing the complete\nstate of a quantum system (mathematically described as a density matrix)\nthrough a series of different measurements. These measurements are performed on\na number of identical copies of the quantum system, with outcomes gathered as\nfrequencies. QST aims to recover the density matrix or the properties of the\nquantum state from the measured frequencies. Although an informationally\ncomplete set of measurements can specify the quantum state accurately in an\nideal scenario with a large number of identical copies, both the measurements\nand identical copies are restricted and imperfect in practical scenarios,\nmaking QST highly ill-posed. The conventional QST methods usually assume\naccurate measured frequencies or rely on manually designed regularizers to\nhandle the ill-posed reconstruction problem, suffering from limited\napplications in realistic scenarios. Recent advances in deep neural networks\n(DNN) led to the emergence of deep learning in QST. However, existing DL-based\nQST approaches often employ generic DNN models that are not optimized for\nimperfect conditions of QST. In this paper, we propose a transformer-based\nautoencoder architecture tailored for QST with imperfect measurement data. Our\nmethod leverages a transformer-based encoder to extract an informative latent\nrepresentation (ILR) from imperfect measurement data and employs a decoder to\npredict the quantum states based on the ILR. We anticipate that the\nhigh-dimensional ILR will capture more comprehensive information about the\nquantum states. To achieve this, we conduct pre-training of the encoder using a\npretext task that involves reconstructing high-quality frequencies from\nmeasured frequencies. Extensive simulations and experiments demonstrate the\nremarkable ability of the informative latent representation to deal with\nimperfect measurement data in QST.\n","authors":["Hailan Ma","Zhenhong Sun","Daoyi Dong","Dong Gong"],"pdf_url":"https://arxiv.org/pdf/2310.00518v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.00546v3","updated":"2025-01-07T13:31:01Z","published":"2023-12-31T17:21:02Z","title":"AllSpark: A Multimodal Spatio-Temporal General Intelligence Model with\n Ten Modalities via Language as a Reference Framework","summary":" Leveraging multimodal data is an inherent requirement for comprehending\ngeographic objects. However, due to the high heterogeneity in structure and\nsemantics among various spatio-temporal modalities, the joint interpretation of\nmultimodal spatio-temporal data has long been an extremely challenging problem.\nThe primary challenge resides in striking a trade-off between the cohesion and\nautonomy of diverse modalities. This trade-off becomes progressively nonlinear\nas the number of modalities expands. Inspired by the human cognitive system and\nlinguistic philosophy, where perceptual signals from the five senses converge\ninto language, we introduce the Language as Reference Framework (LaRF), a\nfundamental principle for constructing a multimodal unified model. Building\nupon this, we propose AllSpark, a multimodal spatio-temporal general artificial\nintelligence model. Our model integrates ten different modalities into a\nunified framework. To achieve modal cohesion, AllSpark introduces a modal\nbridge and multimodal large language model (LLM) to map diverse modal features\ninto the language feature space. To maintain modality autonomy, AllSpark uses\nmodality-specific encoders to extract the tokens of various spatio-temporal\nmodalities. Finally, observing a gap between the model's interpretability and\ndownstream tasks, we designed modality-specific prompts and task heads,\nenhancing the model's generalization capability across specific tasks.\nExperiments indicate that the incorporation of language enables AllSpark to\nexcel in few-shot classification tasks for RGB and point cloud modalities\nwithout additional training, surpassing baseline performance by up to 41.82\\%.\nThe source code is available at https://github.com/GeoX-Lab/AllSpark.\n","authors":["Run Shao","Cheng Yang","Qiujun Li","Qing Zhu","Yongjun Zhang","YanSheng Li","Yu Liu","Yong Tang","Dapeng Liu","Shizhong Yang","Haifeng Li"],"pdf_url":"https://arxiv.org/pdf/2401.00546v3.pdf","comment":"19 pages, 19 tables, 3 figures"},{"id":"http://arxiv.org/abs/2410.19915v2","updated":"2025-01-07T13:14:25Z","published":"2024-10-25T18:09:02Z","title":"AI-Driven Scenarios for Urban Mobility: Quantifying the Role of ODE\n Models and Scenario Planning in Reducing Traffic Congestion","summary":" Urbanization and technological advancements are reshaping urban mobility,\npresenting both challenges and opportunities. This paper investigates how\nArtificial Intelligence (AI)-driven technologies can impact traffic congestion\ndynamics and explores their potential to enhance transportation systems'\nefficiency. Specifically, we assess the role of AI innovations, such as\nautonomous vehicles and intelligent traffic management, in mitigating\ncongestion under varying regulatory frameworks. Autonomous vehicles reduce\ncongestion through optimized traffic flow, real-time route adjustments, and\ndecreased human errors.\n The study employs Ordinary Differential Equations (ODEs) to model the dynamic\nrelationship between AI adoption rates and traffic congestion, capturing\nsystemic feedback loops. Quantitative outputs include threshold levels of AI\nadoption needed to achieve significant congestion reduction, while qualitative\ninsights stem from scenario planning exploring regulatory and societal\nconditions. This dual-method approach offers actionable strategies for\npolicymakers to create efficient, sustainable, and equitable urban\ntransportation systems. While safety implications of AI are acknowledged, this\nstudy primarily focuses on congestion reduction dynamics.\n","authors":["Katsiaryna Bahamazava"],"pdf_url":"https://arxiv.org/pdf/2410.19915v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03764v1","updated":"2025-01-07T13:08:54Z","published":"2025-01-07T13:08:54Z","title":"SelectiveFinetuning: Enhancing Transfer Learning in Sleep Staging\n through Selective Domain Alignment","summary":" In practical sleep stage classification, a key challenge is the variability\nof EEG data across different subjects and environments. Differences in\nphysiology, age, health status, and recording conditions can lead to domain\nshifts between data. These domain shifts often result in decreased model\naccuracy and reliability, particularly when the model is applied to new data\nwith characteristics different from those it was originally trained on, which\nis a typical manifestation of negative transfer. To address this, we propose\nSelectiveFinetuning in this paper. Our method utilizes a pretrained Multi\nResolution Convolutional Neural Network (MRCNN) to extract EEG features,\ncapturing the distinctive characteristics of different sleep stages. To\nmitigate the effect of domain shifts, we introduce a domain aligning mechanism\nthat employs Earth Mover Distance (EMD) to evaluate and select source domain\ndata closely matching the target domain. By finetuning the model with selective\nsource data, our SelectiveFinetuning enhances the model's performance on target\ndomain that exhibits domain shifts compared to the data used for training.\nExperimental results show that our method outperforms existing baselines,\noffering greater robustness and adaptability in practical scenarios where data\ndistributions are often unpredictable.\n","authors":["Siyuan Zhao","Chenyu Liu","Yi Ding","Xinliang Zhou"],"pdf_url":"https://arxiv.org/pdf/2501.03764v1.pdf","comment":"Accepted by ICASSP 2025"},{"id":"http://arxiv.org/abs/2410.09453v2","updated":"2025-01-07T13:00:57Z","published":"2024-10-12T09:16:09Z","title":"MMAD: The First-Ever Comprehensive Benchmark for Multimodal Large\n Language Models in Industrial Anomaly Detection","summary":" In the field of industrial inspection, Multimodal Large Language Models\n(MLLMs) have a high potential to renew the paradigms in practical applications\ndue to their robust language capabilities and generalization abilities.\nHowever, despite their impressive problem-solving skills in many domains,\nMLLMs' ability in industrial anomaly detection has not been systematically\nstudied. To bridge this gap, we present MMAD, the first-ever full-spectrum\nMLLMs benchmark in industrial Anomaly Detection. We defined seven key subtasks\nof MLLMs in industrial inspection and designed a novel pipeline to generate the\nMMAD dataset with 39,672 questions for 8,366 industrial images. With MMAD, we\nhave conducted a comprehensive, quantitative evaluation of various\nstate-of-the-art MLLMs. The commercial models performed the best, with the\naverage accuracy of GPT-4o models reaching 74.9%. However, this result falls\nfar short of industrial requirements. Our analysis reveals that current MLLMs\nstill have significant room for improvement in answering questions related to\nindustrial anomalies and defects. We further explore two training-free\nperformance enhancement strategies to help models improve in industrial\nscenarios, highlighting their promising potential for future research.\n","authors":["Xi Jiang","Jian Li","Hanqiu Deng","Yong Liu","Bin-Bin Gao","Yifeng Zhou","Jialin Li","Chengjie Wang","Feng Zheng"],"pdf_url":"https://arxiv.org/pdf/2410.09453v2.pdf","comment":"The code and data are available at https://github.com/jam-cc/MMAD"},{"id":"http://arxiv.org/abs/2409.18301v3","updated":"2025-01-07T12:44:48Z","published":"2024-09-26T21:16:51Z","title":"Wavelet-Driven Generalizable Framework for Deepfake Face Forgery\n Detection","summary":" The evolution of digital image manipulation, particularly with the\nadvancement of deep generative models, significantly challenges existing\ndeepfake detection methods, especially when the origin of the deepfake is\nobscure. To tackle the increasing complexity of these forgeries, we propose\n\\textbf{Wavelet-CLIP}, a deepfake detection framework that integrates wavelet\ntransforms with features derived from the ViT-L/14 architecture, pre-trained in\nthe CLIP fashion. Wavelet-CLIP utilizes Wavelet Transforms to deeply analyze\nboth spatial and frequency features from images, thus enhancing the model's\ncapability to detect sophisticated deepfakes. To verify the effectiveness of\nour approach, we conducted extensive evaluations against existing\nstate-of-the-art methods for cross-dataset generalization and detection of\nunseen images generated by standard diffusion models. Our method showcases\noutstanding performance, achieving an average AUC of 0.749 for cross-data\ngeneralization and 0.893 for robustness against unseen deepfakes, outperforming\nall compared methods. The code can be reproduced from the repo:\n\\url{https://github.com/lalithbharadwajbaru/Wavelet-CLIP}\n","authors":["Lalith Bharadwaj Baru","Rohit Boddeda","Shilhora Akshay Patel","Sai Mohan Gajapaka"],"pdf_url":"https://arxiv.org/pdf/2409.18301v3.pdf","comment":"9 Pages, 2 Figures, 3 Tables"},{"id":"http://arxiv.org/abs/2501.03124v2","updated":"2025-01-07T12:33:44Z","published":"2025-01-06T16:31:45Z","title":"PRMBench: A Fine-grained and Challenging Benchmark for Process-Level\n Reward Models","summary":" Process-level Reward Models (PRMs) are crucial for complex reasoning and\ndecision-making tasks, where each intermediate step plays an important role in\nthe reasoning process. Since language models are prone to various types of\nerrors during the reasoning process, PRMs are required to possess nuanced\ncapabilities for detecting various implicit error types in real-world\nscenarios. However, current benchmarks primarily focus on step correctness,\nfailing to evaluate PRMs' performance systematically. To address this gap, we\nintroduce PRMBench, a process-level benchmark specifically designed to assess\nthe fine-grained error detection capabilities of PRMs. PRMBench comprises 6,216\ncarefully designed problems and 83,456 step-level labels, evaluating models\nacross multiple dimensions, including simplicity, soundness, and sensitivity.\nIn our experiments on 15 models, spanning both open-source PRMs and\nclosed-source large language models prompted as critic models, we uncover\nsignificant weaknesses in current PRMs. These findings underscore the\nchallenges inherent in process-level evaluation and highlight key directions\nfor future research. We hope PRMBench can be a robust bench for advancing\nresearch on PRM evaluation and development.\n","authors":["Mingyang Song","Zhaochen Su","Xiaoye Qu","Jiawei Zhou","Yu Cheng"],"pdf_url":"https://arxiv.org/pdf/2501.03124v2.pdf","comment":"Project Page: https://prmbench.github.io/"},{"id":"http://arxiv.org/abs/2405.10936v2","updated":"2025-01-07T12:15:01Z","published":"2024-05-17T17:47:39Z","title":"A Survey on Large Language Models with Multilingualism: Recent Advances\n and New Frontiers","summary":" The rapid development of Large Language Models (LLMs) demonstrates remarkable\nmultilingual capabilities in natural language processing, attracting global\nattention in both academia and industry. To mitigate potential discrimination\nand enhance the overall usability and accessibility for diverse language user\ngroups, it is important for the development of language-fair technology.\nDespite the breakthroughs of LLMs, the investigation into the multilingual\nscenario remains insufficient, where a comprehensive survey to summarize recent\napproaches, developments, limitations, and potential solutions is desirable. To\nthis end, we provide a survey with multiple perspectives on the utilization of\nLLMs in the multilingual scenario. We first rethink the transitions between\nprevious and current research on pre-trained language models. Then we introduce\nseveral perspectives on the multilingualism of LLMs, including training and\ninference methods, information retrieval, model security, multi-domain with\nlanguage culture, and usage of datasets. We also discuss the major challenges\nthat arise in these aspects, along with possible solutions. Besides, we\nhighlight future research directions that aim at further enhancing LLMs with\nmultilingualism. The survey aims to help the research community address\nmultilingual problems and provide a comprehensive understanding of the core\nconcepts, key techniques, and latest developments in multilingual natural\nlanguage processing based on LLMs.\n","authors":["Kaiyu Huang","Fengran Mo","Xinyu Zhang","Hongliang Li","You Li","Yuanchi Zhang","Weijian Yi","Yulong Mao","Jinchen Liu","Yuzhuang Xu","Jinan Xu","Jian-Yun Nie","Yang Liu"],"pdf_url":"https://arxiv.org/pdf/2405.10936v2.pdf","comment":"65 pages, Work in Progress"},{"id":"http://arxiv.org/abs/2501.03722v1","updated":"2025-01-07T12:03:02Z","published":"2025-01-07T12:03:02Z","title":"Self-adaptive vision-language model for 3D segmentation of pulmonary\n artery and vein","summary":" Accurate segmentation of pulmonary structures iscrucial in clinical\ndiagnosis, disease study, and treatment planning. Significant progress has been\nmade in deep learning-based segmentation techniques, but most require much\nlabeled data for training. Consequently, developing precise segmentation\nmethods that demand fewer labeled datasets is paramount in medical image\nanalysis. The emergence of pre-trained vision-language foundation models, such\nas CLIP, recently opened the door for universal computer vision tasks.\nExploiting the generalization ability of these pre-trained foundation models on\ndownstream tasks, such as segmentation, leads to unexpected performance with a\nrelatively small amount of labeled data. However, exploring these models for\npulmonary artery-vein segmentation is still limited. This paper proposes a\nnovel framework called Language-guided self-adaptive Cross-Attention Fusion\nFramework. Our method adopts pre-trained CLIP as a strong feature extractor for\ngenerating the segmentation of 3D CT scans, while adaptively aggregating the\ncross-modality of text and image representations. We propose a s pecially\ndesigned adapter module to fine-tune pre-trained CLIP with a self-adaptive\nlearning strategy to effectively fuse the two modalities of embeddings. We\nextensively validate our method on a local dataset, which is the largest\npulmonary artery-vein CT dataset to date and consists of 718 labeled data in\ntotal. The experiments show that our method outperformed other state-of-the-art\nmethods by a large margin. Our data and code will be made publicly available\nupon acceptance.\n","authors":["Xiaotong Guo","Deqian Yang","Dan Wang","Haochen Zhao","Yuan Li","Zhilin Sui","Tao Zhou","Lijun Zhang","Yanda Meng"],"pdf_url":"https://arxiv.org/pdf/2501.03722v1.pdf","comment":"8 pages,3 figures"},{"id":"http://arxiv.org/abs/2409.03260v2","updated":"2025-01-07T11:54:58Z","published":"2024-09-05T05:51:42Z","title":"In Search of Trees: Decision-Tree Policy Synthesis for Black-Box Systems\n via Search","summary":" Decision trees, owing to their interpretability, are attractive as control\npolicies for (dynamical) systems. Unfortunately, constructing, or synthesising,\nsuch policies is a challenging task. Previous approaches do so by imitating a\nneural-network policy, approximating a tabular policy obtained via formal\nsynthesis, employing reinforcement learning, or modelling the problem as a\nmixed-integer linear program. However, these works may require access to a\nhard-to-obtain accurate policy or a formal model of the environment (within\nreach of formal synthesis), and may not provide guarantees on the quality or\nsize of the final tree policy. In contrast, we present an approach to\nsynthesise optimal decision-tree policies given a deterministic black-box\nenvironment and specification, a discretisation of the tree predicates, and an\ninitial set of states, where optimality is defined with respect to the number\nof steps to achieve the goal. Our approach is a specialised search algorithm\nwhich systematically explores the (exponentially large) space of decision trees\nunder the given discretisation. The key component is a novel trace-based\npruning mechanism that significantly reduces the search space. Our approach\nrepresents a conceptually novel way of synthesising small decision-tree\npolicies with optimality guarantees even for black-box environments with\nblack-box specifications.\n","authors":["Emir Demirović","Christian Schilling","Anna Lukina"],"pdf_url":"https://arxiv.org/pdf/2409.03260v2.pdf","comment":"8 pages main text incl. references, 2 pages appendix"},{"id":"http://arxiv.org/abs/2501.03717v1","updated":"2025-01-07T11:52:01Z","published":"2025-01-07T11:52:01Z","title":"Materialist: Physically Based Editing Using Single-Image Inverse\n Rendering","summary":" To perform image editing based on single-view, inverse physically based\nrendering, we present a method combining a learning-based approach with\nprogressive differentiable rendering. Given an image, our method leverages\nneural networks to predict initial material properties. Progressive\ndifferentiable rendering is then used to optimize the environment map and\nrefine the material properties with the goal of closely matching the rendered\nresult to the input image. We require only a single image while other inverse\nrendering methods based on the rendering equation require multiple views. In\ncomparison to single-view methods that rely on neural renderers, our approach\nachieves more realistic light material interactions, accurate shadows, and\nglobal illumination. Furthermore, with optimized material properties and\nillumination, our method enables a variety of tasks, including physically based\nmaterial editing, object insertion, and relighting. We also propose a method\nfor material transparency editing that operates effectively without requiring\nfull scene geometry. Compared with methods based on Stable Diffusion, our\napproach offers stronger interpretability and more realistic light refraction\nbased on empirical results.\n","authors":["Lezhong Wang","Duc Minh Tran","Ruiqi Cui","Thomson TG","Manmohan Chandraker","Jeppe Revall Frisvad"],"pdf_url":"https://arxiv.org/pdf/2501.03717v1.pdf","comment":"code will be available at github.com/lez-s/Materialist"},{"id":"http://arxiv.org/abs/2501.03715v1","updated":"2025-01-07T11:44:25Z","published":"2025-01-07T11:44:25Z","title":"Neural Deconstruction Search for Vehicle Routing Problems","summary":" Autoregressive construction approaches generate solutions to vehicle routing\nproblems in a step-by-step fashion, leading to high-quality solutions that are\nnearing the performance achieved by handcrafted, operations research\ntechniques. In this work, we challenge the conventional paradigm of sequential\nsolution construction and introduce an iterative search framework where\nsolutions are instead deconstructed by a neural policy. Throughout the search,\nthe neural policy collaborates with a simple greedy insertion algorithm to\nrebuild the deconstructed solutions. Our approach surpasses the performance of\nstate-of-the-art operations research methods across three challenging vehicle\nrouting problems of various problem sizes.\n","authors":["André Hottung","Paula Wong-Chung","Kevin Tierney"],"pdf_url":"https://arxiv.org/pdf/2501.03715v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.09424v3","updated":"2025-01-07T11:37:57Z","published":"2024-09-14T12:25:14Z","title":"NBBOX: Noisy Bounding Box Improves Remote Sensing Object Detection","summary":" Data augmentation has shown significant advancements in computer vision to\nimprove model performance over the years, particularly in scenarios with\nlimited and insufficient data. Currently, most studies focus on adjusting the\nimage or its features to expand the size, quality, and variety of samples\nduring training in various tasks including object detection. However, we argue\nthat it is necessary to investigate bounding box transformations as a data\naugmentation technique rather than image-level transformations, especially in\naerial imagery due to potentially inconsistent bounding box annotations. Hence,\nthis letter presents a thorough investigation of bounding box transformation in\nterms of scaling, rotation, and translation for remote sensing object\ndetection. We call this augmentation strategy NBBOX (Noise Injection into\nBounding Box). We conduct extensive experiments on DOTA and DIOR-R, both\nwell-known datasets that include a variety of rotated generic objects in aerial\nimages. Experimental results show that our approach significantly improves\nremote sensing object detection without whistles and bells and it is more\ntime-efficient than other state-of-the-art augmentation strategies.\n","authors":["Yechan Kim","SooYeon Kim","Moongu Jeon"],"pdf_url":"https://arxiv.org/pdf/2409.09424v3.pdf","comment":"Accepted to IEEE Geoscience and Remote Sensing Letters"},{"id":"http://arxiv.org/abs/2501.03711v1","updated":"2025-01-07T11:32:13Z","published":"2025-01-07T11:32:13Z","title":"Unsupervised Speech Segmentation: A General Approach Using Speech\n Language Models","summary":" In this paper, we introduce an unsupervised approach for Speech Segmentation,\nwhich builds on previously researched approaches, e.g., Speaker Diarization,\nwhile being applicable to an inclusive set of acoustic-semantic distinctions,\npaving a path towards a general Unsupervised Speech Segmentation approach.\nUnlike traditional speech and audio segmentation, which mainly focuses on\nspectral changes in the input signal, e.g., phone segmentation, our approach\ntries to segment the spoken utterance into chunks with differing\nacoustic-semantic styles, focusing on acoustic-semantic information that does\nnot translate well into text, e.g., emotion or speaker. While most Speech\nSegmentation tasks only handle one style change, e.g., emotion diarization, our\napproach tries to handle multiple acoustic-semantic style changes. Leveraging\nrecent advances in Speech Language Models (SLMs), we propose a simple\nunsupervised method to segment a given speech utterance. We empirically\ndemonstrate the effectiveness of the proposed approach by considering several\nsetups. Results suggest that the proposed method is superior to the evaluated\nbaselines on boundary detection, segment purity, and over-segmentation. Code is\navailable at\nhttps://github.com/avishaiElmakies/unsupervised_speech_segmentation_using_slm.\n","authors":["Avishai Elmakies","Omri Abend","Yossi Adi"],"pdf_url":"https://arxiv.org/pdf/2501.03711v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.10573v2","updated":"2025-01-07T11:13:06Z","published":"2024-06-15T09:23:46Z","title":"Graph Neural Backdoor: Fundamentals, Methodologies, Applications, and\n Future Directions","summary":" Graph Neural Networks (GNNs) have significantly advanced various downstream\ngraph-relevant tasks, encompassing recommender systems, molecular structure\nprediction, social media analysis, etc. Despite the boosts of GNN, recent\nresearch has empirically demonstrated its potential vulnerability to backdoor\nattacks, wherein adversaries employ triggers to poison input samples, inducing\nGNN to adversary-premeditated malicious outputs. This is typically due to the\ncontrolled training process, or the deployment of untrusted models, such as\ndelegating model training to third-party service, leveraging external training\nsets, and employing pre-trained models from online sources. Although there's an\nongoing increase in research on GNN backdoors, comprehensive investigation into\nthis field is lacking. To bridge this gap, we propose the first survey\ndedicated to GNN backdoors. We begin by outlining the fundamental definition of\nGNN, followed by the detailed summarization and categorization of current GNN\nbackdoor attacks and defenses based on their technical characteristics and\napplication scenarios. Subsequently, the analysis of the applicability and use\ncases of GNN backdoors is undertaken. Finally, the exploration of potential\nresearch directions of GNN backdoors is presented. This survey aims to explore\nthe principles of graph backdoors, provide insights to defenders, and promote\nfuture security research.\n","authors":["Xiao Yang","Gaolei Li","Jianhua Li"],"pdf_url":"https://arxiv.org/pdf/2406.10573v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.11543v3","updated":"2025-01-07T11:09:52Z","published":"2024-11-18T13:01:57Z","title":"PSA-VLM: Enhancing Vision-Language Model Safety through Progressive\n Concept-Bottleneck-Driven Alignment","summary":" Benefiting from the powerful capabilities of Large Language Models (LLMs),\npre-trained visual encoder models connected to LLMs form Vision Language Models\n(VLMs). However, recent research shows that the visual modality in VLMs is\nhighly vulnerable, allowing attackers to bypass safety alignment in LLMs\nthrough visually transmitted content, launching harmful attacks. To address\nthis challenge, we propose a progressive concept-based alignment strategy,\nPSA-VLM, which incorporates safety modules as concept bottlenecks to enhance\nvisual modality safety alignment. By aligning model predictions with specific\nsafety concepts, we improve defenses against risky images, enhancing\nexplainability and controllability while minimally impacting general\nperformance. Our method is obtained through two-stage training. The low\ncomputational cost of the first stage brings very effective performance\nimprovement, and the fine-tuning of the language model in the second stage\nfurther improves the safety performance. Our method achieves state-of-the-art\nresults on popular VLM safety benchmark.\n","authors":["Zhendong Liu","Yuanbi Nie","Yingshui Tan","Jiaheng Liu","Xiangyu Yue","Qiushi Cui","Chongjun Wang","Xiaoyong Zhu","Bo Zheng"],"pdf_url":"https://arxiv.org/pdf/2411.11543v3.pdf","comment":"arXiv admin note: substantial text overlap with arXiv:2405.13581"},{"id":"http://arxiv.org/abs/2501.03700v1","updated":"2025-01-07T11:07:32Z","published":"2025-01-07T11:07:32Z","title":"AuxDepthNet: Real-Time Monocular 3D Object Detection with\n Depth-Sensitive Features","summary":" Monocular 3D object detection is a challenging task in autonomous systems due\nto the lack of explicit depth information in single-view images. Existing\nmethods often depend on external depth estimators or expensive sensors, which\nincrease computational complexity and hinder real-time performance. To overcome\nthese limitations, we propose AuxDepthNet, an efficient framework for real-time\nmonocular 3D object detection that eliminates the reliance on external depth\nmaps or pre-trained depth models. AuxDepthNet introduces two key components:\nthe Auxiliary Depth Feature (ADF) module, which implicitly learns\ndepth-sensitive features to improve spatial reasoning and computational\nefficiency, and the Depth Position Mapping (DPM) module, which embeds depth\npositional information directly into the detection process to enable accurate\nobject localization and 3D bounding box regression. Leveraging the DepthFusion\nTransformer architecture, AuxDepthNet globally integrates visual and\ndepth-sensitive features through depth-guided interactions, ensuring robust and\nefficient detection. Extensive experiments on the KITTI dataset show that\nAuxDepthNet achieves state-of-the-art performance, with $\\text{AP}_{3D}$ scores\nof 24.72\\% (Easy), 18.63\\% (Moderate), and 15.31\\% (Hard), and\n$\\text{AP}_{\\text{BEV}}$ scores of 34.11\\% (Easy), 25.18\\% (Moderate), and\n21.90\\% (Hard) at an IoU threshold of 0.7.\n","authors":["Ruochen Zhang","Hyeung-Sik Choi","Dongwook Jung","Phan Huy Nam Anh","Sang-Ki Jeong","Zihao Zhu"],"pdf_url":"https://arxiv.org/pdf/2501.03700v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03696v1","updated":"2025-01-07T10:54:44Z","published":"2025-01-07T10:54:44Z","title":"Exploring Molecule Generation Using Latent Space Graph Diffusion","summary":" Generating molecular graphs is a challenging task due to their discrete\nnature and the competitive objectives involved. Diffusion models have emerged\nas SOTA approaches in data generation across various modalities. For molecular\ngraphs, graph neural networks (GNNs) as a diffusion backbone have achieved\nimpressive results. Latent space diffusion, where diffusion occurs in a\nlow-dimensional space via an autoencoder, has demonstrated computational\nefficiency. However, the literature on latent space diffusion for molecular\ngraphs is scarce, and no commonly accepted best practices exist. In this work,\nwe explore different approaches and hyperparameters, contrasting generative\nflow models (denoising diffusion, flow matching, heat dissipation) and\narchitectures (GNNs and E(3)-equivariant GNNs). Our experiments reveal a high\nsensitivity to the choice of approach and design decisions. Code is made\navailable at\ngithub.com/Prashanth-Pombala/Molecule-Generation-using-Latent-Space-Graph-Diffusion.\n","authors":["Prashanth Pombala","Gerrit Grossmann","Verena Wolf"],"pdf_url":"https://arxiv.org/pdf/2501.03696v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.06096v4","updated":"2025-01-07T10:45:58Z","published":"2024-09-09T22:16:48Z","title":"Latent Diffusion Bridges for Unsupervised Musical Audio Timbre Transfer","summary":" Music timbre transfer is a challenging task that involves modifying the\ntimbral characteristics of an audio signal while preserving its melodic\nstructure. In this paper, we propose a novel method based on dual diffusion\nbridges, trained using the CocoChorales Dataset, which consists of unpaired\nmonophonic single-instrument audio data. Each diffusion model is trained on a\nspecific instrument with a Gaussian prior. During inference, a model is\ndesignated as the source model to map the input audio to its corresponding\nGaussian prior, and another model is designated as the target model to\nreconstruct the target audio from this Gaussian prior, thereby facilitating\ntimbre transfer. We compare our approach against existing unsupervised timbre\ntransfer models such as VAEGAN and Gaussian Flow Bridges (GFB). Experimental\nresults demonstrate that our method achieves both better Fr\\'echet Audio\nDistance (FAD) and melody preservation, as reflected by lower pitch distances\n(DPD) compared to VAEGAN and GFB. Additionally, we discover that the noise\nlevel from the Gaussian prior, $\\sigma$, can be adjusted to control the degree\nof melody preservation and amount of timbre transferred.\n","authors":["Michele Mancusi","Yurii Halychanskyi","Kin Wai Cheuk","Eloi Moliner","Chieh-Hsin Lai","Stefan Uhlich","Junghyun Koo","Marco A. Martínez-Ramírez","Wei-Hsiang Liao","Giorgio Fabbro","Yuki Mitsufuji"],"pdf_url":"https://arxiv.org/pdf/2409.06096v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03689v1","updated":"2025-01-07T10:38:51Z","published":"2025-01-07T10:38:51Z","title":"MAJL: A Model-Agnostic Joint Learning Framework for Music Source\n Separation and Pitch Estimation","summary":" Music source separation and pitch estimation are two vital tasks in music\ninformation retrieval. Typically, the input of pitch estimation is obtained\nfrom the output of music source separation. Therefore, existing methods have\ntried to perform these two tasks simultaneously, so as to leverage the mutually\nbeneficial relationship between both tasks. However, these methods still face\ntwo critical challenges that limit the improvement of both tasks: the lack of\nlabeled data and joint learning optimization. To address these challenges, we\npropose a Model-Agnostic Joint Learning (MAJL) framework for both tasks. MAJL\nis a generic framework and can use variant models for each task. It includes a\ntwo-stage training method and a dynamic weighting method named Dynamic Weights\non Hard Samples (DWHS), which addresses the lack of labeled data and joint\nlearning optimization, respectively. Experimental results on public music\ndatasets show that MAJL outperforms state-of-the-art methods on both tasks,\nwith significant improvements of 0.92 in Signal-to-Distortion Ratio (SDR) for\nmusic source separation and 2.71% in Raw Pitch Accuracy (RPA) for pitch\nestimation. Furthermore, comprehensive studies not only validate the\neffectiveness of each component of MAJL, but also indicate the great generality\nof MAJL in adapting to different model architectures.\n","authors":["Haojie Wei","Jun Yuan","Rui Zhang","Quanyu Dai","Yueguo Chen"],"pdf_url":"https://arxiv.org/pdf/2501.03689v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03681v1","updated":"2025-01-07T10:29:43Z","published":"2025-01-07T10:29:43Z","title":"SLAM: Towards Efficient Multilingual Reasoning via Selective Language\n Alignment","summary":" Despite the significant improvements achieved by large language models (LLMs)\nin English reasoning tasks, these models continue to struggle with multilingual\nreasoning. Recent studies leverage a full-parameter and two-stage training\nparadigm to teach models to first understand non-English questions and then\nreason. However, this method suffers from both substantial computational\nresource computing and catastrophic forgetting. The fundamental cause is that,\nwith the primary goal of enhancing multilingual comprehension, an excessive\nnumber of irrelevant layers and parameters are tuned during the first stage.\nGiven our findings that the representation learning of languages is merely\nconducted in lower-level layers, we propose an efficient multilingual reasoning\nalignment approach that precisely identifies and fine-tunes the layers\nresponsible for handling multilingualism. Experimental results show that our\nmethod, SLAM, only tunes 6 layers' feed-forward sub-layers including 6.5-8% of\nall parameters within 7B and 13B LLMs, achieving superior average performance\nthan all strong baselines across 10 languages. Meanwhile, SLAM only involves\none training stage, reducing training time by 4.1-11.9 compared to the\ntwo-stage method.\n","authors":["Yuchun Fan","Yongyu Mu","Yilin Wang","Lei Huang","Junhao Ruan","Bei Li","Tong Xiao","Shujian Huang","Xiaocheng Feng","Jingbo Zhu"],"pdf_url":"https://arxiv.org/pdf/2501.03681v1.pdf","comment":"Accepted by COLING 2025 (Oral)"},{"id":"http://arxiv.org/abs/2501.03676v1","updated":"2025-01-07T10:22:30Z","published":"2025-01-07T10:22:30Z","title":"SALE-Based Offline Reinforcement Learning with Ensemble Q-Networks","summary":" In this work, we build upon the offline reinforcement learning algorithm TD7,\nwhich incorporates State-Action Learned Embeddings (SALE) and LAP, and propose\na model-free actor-critic algorithm that integrates ensemble Q-networks and a\ngradient diversity penalty from EDAC. The ensemble Q-networks effectively\naddress the challenge of out-of-distribution actions by introducing penalties\nthat guide the actor network to focus on in-distribution actions. Meanwhile,\nthe gradient diversity penalty encourages diverse Q-value gradients, further\nsuppressing overestimation for out-of-distribution actions. Additionally, our\nmethod retains an adjustable behavior cloning (BC) term that directs the actor\nnetwork toward dataset actions during early training stages, while gradually\nreducing its influence as the precision of the Q-ensemble improves. These\nenhancements work synergistically to improve training stability and accuracy.\nExperimental results on the D4RL MuJoCo benchmarks demonstrate that our\nalgorithm achieves superior convergence speed, stability, and performance\ncompared to existing methods.\n","authors":["Zheng Chun"],"pdf_url":"https://arxiv.org/pdf/2501.03676v1.pdf","comment":"10 pages, 2 figures, 4 tables"},{"id":"http://arxiv.org/abs/2501.03674v1","updated":"2025-01-07T10:20:16Z","published":"2025-01-07T10:20:16Z","title":"Action Quality Assessment via Hierarchical Pose-guided Multi-stage\n Contrastive Regression","summary":" Action Quality Assessment (AQA), which aims at automatic and fair evaluation\nof athletic performance, has gained increasing attention in recent years.\nHowever, athletes are often in rapid movement and the corresponding visual\nappearance variances are subtle, making it challenging to capture fine-grained\npose differences and leading to poor estimation performance. Furthermore, most\ncommon AQA tasks, such as diving in sports, are usually divided into multiple\nsub-actions, each of which contains different durations. However, existing\nmethods focus on segmenting the video into fixed frames, which disrupts the\ntemporal continuity of sub-actions resulting in unavoidable prediction errors.\nTo address these challenges, we propose a novel action quality assessment\nmethod through hierarchically pose-guided multi-stage contrastive regression.\nFirstly, we introduce a multi-scale dynamic visual-skeleton encoder to capture\nfine-grained spatio-temporal visual and skeletal features. Then, a procedure\nsegmentation network is introduced to separate different sub-actions and obtain\nsegmented features. Afterwards, the segmented visual and skeletal features are\nboth fed into a multi-modal fusion module as physics structural priors, to\nguide the model in learning refined activity similarities and variances.\nFinally, a multi-stage contrastive learning regression approach is employed to\nlearn discriminative representations and output prediction results. In\naddition, we introduce a newly-annotated FineDiving-Pose Dataset to improve the\ncurrent low-quality human pose labels. In experiments, the results on\nFineDiving and MTL-AQA datasets demonstrate the effectiveness and superiority\nof our proposed approach. Our source code and dataset are available at\nhttps://github.com/Lumos0507/HP-MCoRe.\n","authors":["Mengshi Qi","Hao Ye","Jiaxuan Peng","Huadong Ma"],"pdf_url":"https://arxiv.org/pdf/2501.03674v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03670v1","updated":"2025-01-07T10:18:22Z","published":"2025-01-07T10:18:22Z","title":"A Diversity-Enhanced Knowledge Distillation Model for Practical Math\n Word Problem Solving","summary":" Math Word Problem (MWP) solving is a critical task in natural language\nprocessing, has garnered significant research interest in recent years. Various\nrecent studies heavily rely on Seq2Seq models and their extensions (e.g.,\nSeq2Tree and Graph2Tree) to generate mathematical equations. While effective,\nthese models struggle to generate diverse but counterpart solution equations,\nlimiting their generalization across various math problem scenarios. In this\npaper, we introduce a novel Diversity-enhanced Knowledge Distillation (DivKD)\nmodel for practical MWP solving. Our approach proposes an adaptive diversity\ndistillation method, in which a student model learns diverse equations by\nselectively transferring high-quality knowledge from a teacher model.\nAdditionally, we design a diversity prior-enhanced student model to better\ncapture the diversity distribution of equations by incorporating a conditional\nvariational auto-encoder. Extensive experiments on {four} MWP benchmark\ndatasets demonstrate that our approach achieves higher answer accuracy than\nstrong baselines while maintaining high efficiency for practical applications.\n","authors":["Yi Zhang","Guangyou Zhou","Zhiwen Xie","Jinjin Ma","Jimmy Xiangji Huang"],"pdf_url":"https://arxiv.org/pdf/2501.03670v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02801v3","updated":"2025-01-07T10:09:18Z","published":"2024-12-03T20:04:32Z","title":"Optimization of Transformer heart disease prediction model based on\n particle swarm optimization algorithm","summary":" Aiming at the latest particle swarm optimization algorithm, this paper\nproposes an improved Transformer model to improve the accuracy of heart disease\nprediction and provide a new algorithm idea. We first use three mainstream\nmachine learning classification algorithms - decision tree, random forest and\nXGBoost, and then output the confusion matrix of these three models. The\nresults showed that the random forest model had the best performance in\npredicting the classification of heart disease, with an accuracy of 92.2%.\nThen, we apply the Transformer model based on particle swarm optimization (PSO)\nalgorithm to the same dataset for classification experiment. The results show\nthat the classification accuracy of the model is as high as 96.5%, 4.3\npercentage points higher than that of random forest, which verifies the\neffectiveness of PSO in optimizing Transformer model. From the above research,\nwe can see that particle swarm optimization significantly improves Transformer\nperformance in heart disease prediction. Improving the ability to predict heart\ndisease is a global priority with benefits for all humankind. Accurate\nprediction can enhance public health, optimize medical resources, and reduce\nhealthcare costs, leading to healthier populations and more productive\nsocieties worldwide. This advancement paves the way for more efficient health\nmanagement and supports the foundation of a healthier, more resilient global\ncommunity.\n","authors":["Jingyuan Yi","Peiyang Yu","Tianyi Huang","Zeqiu Xu"],"pdf_url":"https://arxiv.org/pdf/2412.02801v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.01999v2","updated":"2025-01-07T10:04:51Z","published":"2024-08-04T11:55:24Z","title":"Reinforcement Learning for an Efficient and Effective Malware\n Investigation during Cyber Incident Response","summary":" This research focused on enhancing post-incident malware forensic\ninvestigation using reinforcement learning RL. We proposed an advanced MDP post\nincident malware forensics investigation model and framework to expedite post\nincident forensics. We then implement our RL Malware Investigation Model based\non structured MDP within the proposed framework. To identify malware artefacts,\nthe RL agent acquires and examines forensics evidence files, iteratively\nimproving its capabilities using Q Table and temporal difference learning. The\nQ learning algorithm significantly improved the agent ability to identify\nmalware. An epsilon greedy exploration strategy and Q learning updates enabled\nefficient learning and decision making. Our experimental testing revealed that\noptimal learning rates depend on the MDP environment complexity, with simpler\nenvironments benefiting from higher rates for quicker convergence and complex\nones requiring lower rates for stability. Our model performance in identifying\nand classifying malware reduced malware analysis time compared to human\nexperts, demonstrating robustness and adaptability. The study highlighted the\nsignificance of hyper parameter tuning and suggested adaptive strategies for\ncomplex environments. Our RL based approach produced promising results and is\nvalidated as an alternative to traditional methods notably by offering\ncontinuous learning and adaptation to new and evolving malware threats which\nultimately enhance the post incident forensics investigations.\n","authors":["Dipo Dunsin","Mohamed Chahine Ghanem","Karim Ouazzane","Vassil Vassilev"],"pdf_url":"https://arxiv.org/pdf/2408.01999v2.pdf","comment":"21 pages"},{"id":"http://arxiv.org/abs/2501.02832v2","updated":"2025-01-07T10:01:19Z","published":"2025-01-06T08:16:06Z","title":"Samba-ASR: State-Of-The-Art Speech Recognition Leveraging Structured\n State-Space Models","summary":" We propose Samba ASR,the first state of the art Automatic Speech\nRecognition(ASR)model leveraging the novel Mamba architecture as both encoder\nand decoder,built on the foundation of state space models(SSMs).Unlike\ntransformerbased ASR models,which rely on self-attention mechanisms to capture\ndependencies,Samba ASR effectively models both local and global temporal\ndependencies using efficient statespace dynamics,achieving remarkable\nperformance gains.By addressing the limitations of transformers,such as\nquadratic scaling with input length and difficulty in handling longrange\ndependencies,Samba ASR achieves superior accuracy and efficiency.Experimental\nresults demonstrate that Samba ASR surpasses existing opensource\ntransformerbased ASR models across various standard benchmarks,establishing it\nas the new state of theart in ASR.Extensive evaluations on the benchmark\ndataset show significant improvements in Word Error Rate(WER),with competitive\nperformance even in lowresource scenarios.Furthermore,the inherent\ncomputational efficiency and parameter optimization of the Mamba architecture\nmake Samba ASR a scalable and robust solution for diverse ASR tasks.Our\ncontributions include the development of a new Samba ASR architecture for\nautomatic speech recognition(ASR),demonstrating the superiority of structured\nstatespace models(SSMs)over transformer based models for speech sequence\nprocessing.We provide a comprehensive evaluation on public\nbenchmarks,showcasing stateoftheart(SOTA)performance,and present an indepth\nanalysis of computational efficiency,robustness to noise,and sequence\ngeneralization.This work highlights the viability of Mamba SSMs as a\ntransformerfree alternative for efficient and accurate ASR.By leveraging the\nadvancements of statespace modeling,Samba ASR redefines ASR performance\nstandards and sets a new benchmark for future research in this field.\n","authors":["Syed Abdul Gaffar Shakhadri","Kruthika KR","Kartik Basavaraj Angadi"],"pdf_url":"https://arxiv.org/pdf/2501.02832v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.14887v3","updated":"2025-01-07T09:55:57Z","published":"2024-09-23T10:35:57Z","title":"Deploying Open-Source Large Language Models: A performance Analysis","summary":" Since the release of ChatGPT in November 2022, large language models (LLMs)\nhave seen considerable success, including in the open-source community, with\nmany open-weight models available. However, the requirements to deploy such a\nservice are often unknown and difficult to evaluate in advance. To facilitate\nthis process, we conducted numerous tests at the Centre Inria de l'Universit\\'e\nde Bordeaux. In this article, we propose a comparison of the performance of\nseveral models of different sizes (mainly Mistral and LLaMa) depending on the\navailable GPUs, using vLLM, a Python library designed to optimize the inference\nof these models. Our results provide valuable information for private and\npublic groups wishing to deploy LLMs, allowing them to evaluate the performance\nof different models based on their available hardware. This study thus\ncontributes to facilitating the adoption and use of these large language models\nin various application domains.\n","authors":["Yannis Bendi-Ouis","Dan Dutartre","Xavier Hinaut"],"pdf_url":"https://arxiv.org/pdf/2409.14887v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.16370v2","updated":"2025-01-07T09:34:51Z","published":"2024-11-25T13:26:09Z","title":"A Review of Bayesian Uncertainty Quantification in Deep Probabilistic\n Image Segmentation","summary":" Advancements in image segmentation play an integral role within the broad\nscope of Deep Learning-based Computer Vision. Furthermore, their widespread\napplicability in critical real-world tasks has resulted in challenges related\nto the reliability of such algorithms. Hence, uncertainty quantification has\nbeen extensively studied within this context, enabling the expression of model\nignorance (epistemic uncertainty) or data ambiguity (aleatoric uncertainty) to\nprevent uninformed decision-making. Due to the rapid adoption of Convolutional\nNeural Network (CNN)-based segmentation models in high-stake applications, a\nsubstantial body of research has been published on this very topic, causing its\nswift expansion into a distinct field. This work provides a comprehensive\noverview of probabilistic segmentation, by discussing fundamental concepts of\nuncertainty quantification, governing advancements in the field as well as the\napplication to various tasks. Moreover, literature on both types of\nuncertainties trace back to four key applications: (1) to quantify statistical\ninconsistencies in the annotation process due ambiguous images, (2) correlating\nprediction error with uncertainty, (3) expanding the model hypothesis space for\nbetter generalization, and (4) Active Learning. An extensive discussion follows\nthat includes an overview of utilized datasets for each of the applications and\nevaluation of the available methods. We also highlight challenges related to\narchitectures, uncertainty quantification methods, standardization and\nbenchmarking, and finally end with recommendations for future work such as\nmethods based on single forward passes and models that appropriately leverage\nvolumetric data.\n","authors":["M. M. A. Valiuddin","R. J. G. van Sloun","C. G. A. Viviers","P. H. N. de With","F. van der Sommen"],"pdf_url":"https://arxiv.org/pdf/2411.16370v2.pdf","comment":"20 pages, revised"},{"id":"http://arxiv.org/abs/2501.00320v2","updated":"2025-01-07T09:25:32Z","published":"2024-12-31T07:31:46Z","title":"Autonomous Alignment with Human Value on Altruism through Considerate\n Self-imagination and Theory of Mind","summary":" With the widespread application of Artificial Intelligence (AI) in human\nsociety, enabling AI to autonomously align with human values has become a\npressing issue to ensure its sustainable development and benefit to humanity.\nOne of the most important aspects of aligning with human values is the\nnecessity for agents to autonomously make altruistic, safe, and ethical\ndecisions, considering and caring for human well-being. Current AI extremely\npursues absolute superiority in certain tasks, remaining indifferent to the\nsurrounding environment and other agents, which has led to numerous safety\nrisks. Altruistic behavior in human society originates from humans' capacity\nfor empathizing others, known as Theory of Mind (ToM), combined with predictive\nimaginative interactions before taking action to produce thoughtful and\naltruistic behaviors. Inspired by this, we are committed to endow agents with\nconsiderate self-imagination and ToM capabilities, driving them through\nimplicit intrinsic motivations to autonomously align with human altruistic\nvalues. By integrating ToM within the imaginative space, agents keep an eye on\nthe well-being of other agents in real time, proactively anticipate potential\nrisks to themselves and others, and make thoughtful altruistic decisions that\nbalance negative effects on the environment. The ancient Chinese story of Sima\nGuang Smashes the Vat illustrates the moral behavior of the young Sima Guang\nsmashed a vat to save a child who had accidentally fallen into it, which is an\nexcellent reference scenario for this paper. We design an experimental scenario\nsimilar to Sima Guang Smashes the Vat and its variants with different\ncomplexities, which reflects the trade-offs and comprehensive considerations\nbetween self-goals, altruistic rescue, and avoiding negative side effects.\n","authors":["Haibo Tong","Enmeng Lu","Yinqian Sun","Zhengqiang Han","Chao Liu","Feifei Zhao","Yi Zeng"],"pdf_url":"https://arxiv.org/pdf/2501.00320v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03643v1","updated":"2025-01-07T09:21:52Z","published":"2025-01-07T09:21:52Z","title":"Effective and Efficient Mixed Precision Quantization of Speech\n Foundation Models","summary":" This paper presents a novel mixed-precision quantization approach for speech\nfoundation models that tightly integrates mixed-precision learning and\nquantized model parameter estimation into one single model compression stage.\nExperiments conducted on LibriSpeech dataset with fine-tuned wav2vec2.0-base\nand HuBERT-large models suggest the resulting mixed-precision quantized models\nincreased the lossless compression ratio by factors up to 1.7x and 1.9x over\nthe respective uniform-precision and two-stage mixed-precision quantized\nbaselines that perform precision learning and model parameters quantization in\nseparate and disjointed stages, while incurring no statistically word error\nrate (WER) increase over the 32-bit full-precision models. The system\ncompression time of wav2vec2.0-base and HuBERT-large models is reduced by up to\n1.9 and 1.5 times over the two-stage mixed-precision baselines, while both\nproduce lower WERs. The best-performing 3.5-bit mixed-precision quantized\nHuBERT-large model produces a lossless compression ratio of 8.6x over the\n32-bit full-precision system.\n","authors":["Haoning Xu","Zhaoqing Li","Zengrui Jin","Huimeng Wang","Youjun Chen","Guinan Li","Mengzhe Geng","Shujie Hu","Jiajun Deng","Xunying Liu"],"pdf_url":"https://arxiv.org/pdf/2501.03643v1.pdf","comment":"To appear at IEEE ICASSP 2025"},{"id":"http://arxiv.org/abs/2501.03635v1","updated":"2025-01-07T09:10:09Z","published":"2025-01-07T09:10:09Z","title":"MHGNet: Multi-Heterogeneous Graph Neural Network for Traffic Prediction","summary":" In recent years, traffic flow prediction has played a crucial role in the\nmanagement of intelligent transportation systems. However, traditional\nforecasting methods often model non-Euclidean low-dimensional traffic data as a\nsimple graph with single-type nodes and edges, failing to capture similar\ntrends among nodes of the same type. To address this limitation, this paper\nproposes MHGNet, a novel framework for modeling spatiotemporal\nmulti-heterogeneous graphs. Within this framework, the STD Module decouples\nsingle-pattern traffic data into multi-pattern traffic data through feature\nmappings of timestamp embedding matrices and node embedding matrices.\nSubsequently, the Node Clusterer leverages the Euclidean distance between nodes\nand different types of limit points to perform clustering with O(N) time\ncomplexity. The nodes within each cluster undergo residual subgraph convolution\nwithin the spatiotemporal fusion subgraphs generated by the DSTGG Module,\nfollowed by processing in the SIE Module for node repositioning and\nredistribution of weights. To validate the effectiveness of MHGNet, this paper\nconducts extensive ablation studies and quantitative evaluations on four widely\nused benchmarks, demonstrating its superior performance.\n","authors":["Mei Wu","Yiqian Lin","Tianfan Jiang","Wenchao Weng"],"pdf_url":"https://arxiv.org/pdf/2501.03635v1.pdf","comment":"Accepted by 2025 lEEE International Conference on Acoustics, speech,\n and signal Processing (lCASSP2025)"},{"id":"http://arxiv.org/abs/2309.04195v2","updated":"2025-01-07T08:46:02Z","published":"2023-09-08T08:12:29Z","title":"Towards Mitigating Architecture Overfitting on Distilled Datasets","summary":" Dataset distillation methods have demonstrated remarkable performance for\nneural networks trained with very limited training data. However, a significant\nchallenge arises in the form of \\textit{architecture overfitting}: the\ndistilled training dataset synthesized by a specific network architecture\n(i.e., training network) generates poor performance when trained by other\nnetwork architectures (i.e., test networks), especially when the test networks\nhave a larger capacity than the training network. This paper introduces a\nseries of approaches to mitigate this issue. Among them, DropPath renders the\nlarge model to be an implicit ensemble of its sub-networks, and knowledge\ndistillation ensures each sub-network acts similarly to the small but\nwell-performing teacher network. These methods, characterized by their\nsmoothing effects, significantly mitigate architecture overfitting. We conduct\nextensive experiments to demonstrate the effectiveness and generality of our\nmethods. Particularly, across various scenarios involving different tasks and\ndifferent sizes of distilled data, our approaches significantly mitigate\narchitecture overfitting. Furthermore, our approaches achieve comparable or\neven superior performance when the test network is larger than the training\nnetwork.\n","authors":["Xuyang Zhong","Chen Liu"],"pdf_url":"https://arxiv.org/pdf/2309.04195v2.pdf","comment":"Accepted by TNNLS"},{"id":"http://arxiv.org/abs/2501.02981v2","updated":"2025-01-07T08:39:10Z","published":"2025-01-06T12:43:59Z","title":"CONTINUUM: Detecting APT Attacks through Spatial-Temporal Graph Neural\n Networks","summary":" Advanced Persistent Threats (APTs) represent a significant challenge in\ncybersecurity due to their sophisticated and stealthy nature. Traditional\nIntrusion Detection Systems (IDS) often fall short in detecting these\nmulti-stage attacks. Recently, Graph Neural Networks (GNNs) have been employed\nto enhance IDS capabilities by analyzing the complex relationships within\nnetworked data. However, existing GNN-based solutions are hampered by high\nfalse positive rates and substantial resource consumption. In this paper, we\npresent a novel IDS designed to detect APTs using a Spatio-Temporal Graph\nNeural Network Autoencoder. Our approach leverages spatial information to\nunderstand the interactions between entities within a graph and temporal\ninformation to capture the evolution of the graph over time. This dual\nperspective is crucial for identifying the sequential stages of APTs.\nFurthermore, to address privacy and scalability concerns, we deploy our\narchitecture in a federated learning environment. This setup ensures that local\ndata remains on-premise while encrypted model-weights are shared and aggregated\nusing homomorphic encryption, maintaining data privacy and security. Our\nevaluation shows that this system effectively detects APTs with lower false\npositive rates and optimized resource usage compared to existing methods,\nhighlighting the potential of spatio-temporal analysis and federated learning\nin enhancing cybersecurity defenses.\n","authors":["Atmane Ayoub Mansour Bahar","Kamel Soaid Ferrahi","Mohamed-Lamine Messai","Hamida Seba","Karima Amrouche"],"pdf_url":"https://arxiv.org/pdf/2501.02981v2.pdf","comment":"31 pages"},{"id":"http://arxiv.org/abs/2412.04783v2","updated":"2025-01-07T08:23:43Z","published":"2024-12-06T05:20:08Z","title":"KNN-MMD: Cross Domain Wireless Sensing via Local Distribution Alignment","summary":" Wireless sensing has recently found widespread applications in diverse\nenvironments, including homes, offices, and public spaces. By analyzing\npatterns in channel state information (CSI), it is possible to infer human\nactions for tasks such as person identification, gesture recognition, and fall\ndetection. However, CSI is highly sensitive to environmental changes, where\neven minor alterations can significantly distort the CSI patterns. This\nsensitivity often leads to performance degradation or outright failure when\napplying wireless sensing models trained in one environment to another. To\naddress this challenge, Domain Alignment (DAL) has been widely adopted for\ncross-domain classification tasks, as it focuses on aligning the global\ndistributions of the source and target domains in feature space. Despite its\npopularity, DAL often neglects inter-category relationships, which can lead to\nmisalignment between categories across domains, even when global alignment is\nachieved. To overcome these limitations, we propose K-Nearest Neighbors Maximum\nMean Discrepancy (KNN-MMD), a novel few-shot method for cross-domain wireless\nsensing. Our approach begins by constructing a help set using KNN from the\ntarget domain, enabling local alignment between the source and target domains\nwithin each category using MMD. Additionally, we address a key instability\nissue commonly observed in cross-domain methods, where model performance\nfluctuates sharply between epochs. Further, most existing methods struggle to\ndetermine an optimal stopping point during training due to the absence of\nlabeled data from the target domain. Our method resolves this by excluding the\nsupport set from the target domain during training and employing it as a\nvalidation set to determine the stopping criterion.\n","authors":["Zijian Zhao","Zhijie Cai","Tingwei Chen","Xiaoyang Li","Hang Li","Qimei Chen","Guangxu Zhu"],"pdf_url":"https://arxiv.org/pdf/2412.04783v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03598v1","updated":"2025-01-07T07:55:35Z","published":"2025-01-07T07:55:35Z","title":"RecKG: Knowledge Graph for Recommender Systems","summary":" Knowledge graphs have proven successful in integrating heterogeneous data\nacross various domains. However, there remains a noticeable dearth of research\non their seamless integration among heterogeneous recommender systems, despite\nknowledge graph-based recommender systems garnering extensive research\nattention. This study aims to fill this gap by proposing RecKG, a standardized\nknowledge graph for recommender systems. RecKG ensures the consistent\nrepresentation of entities across different datasets, accommodating diverse\nattribute types for effective data integration. Through a meticulous\nexamination of various recommender system datasets, we select attributes for\nRecKG, ensuring standardized formatting through consistent naming conventions.\nBy these characteristics, RecKG can seamlessly integrate heterogeneous data\nsources, enabling the discovery of additional semantic information within the\nintegrated knowledge graph. We apply RecKG to standardize real-world datasets,\nsubsequently developing an application for RecKG using a graph database.\nFinally, we validate RecKG's achievement in interoperability through a\nqualitative evaluation between RecKG and other studies.\n","authors":["Junhyuk Kwon","Seokho Ahn","Young-Duk Seo"],"pdf_url":"https://arxiv.org/pdf/2501.03598v1.pdf","comment":"Accepted by The 39th ACM/SIGAPP Symposium On Applied Computing(SAC)\n 2024"},{"id":"http://arxiv.org/abs/2411.03814v2","updated":"2025-01-07T07:46:16Z","published":"2024-11-06T10:32:09Z","title":"MRJ-Agent: An Effective Jailbreak Agent for Multi-Round Dialogue","summary":" Large Language Models (LLMs) demonstrate outstanding performance in their\nreservoir of knowledge and understanding capabilities, but they have also been\nshown to be prone to illegal or unethical reactions when subjected to jailbreak\nattacks. To ensure their responsible deployment in critical applications, it is\ncrucial to understand the safety capabilities and vulnerabilities of LLMs.\nPrevious works mainly focus on jailbreak in single-round dialogue, overlooking\nthe potential jailbreak risks in multi-round dialogues, which are a vital way\nhumans interact with and extract information from LLMs. Some studies have\nincreasingly concentrated on the risks associated with jailbreak in multi-round\ndialogues. These efforts typically involve the use of manually crafted\ntemplates or prompt engineering techniques. However, due to the inherent\ncomplexity of multi-round dialogues, their jailbreak performance is limited. To\nsolve this problem, we propose a novel multi-round dialogue jailbreaking agent,\nemphasizing the importance of stealthiness in identifying and mitigating\npotential threats to human values posed by LLMs. We propose a risk\ndecomposition strategy that distributes risks across multiple rounds of queries\nand utilizes psychological strategies to enhance attack strength. Extensive\nexperiments show that our proposed method surpasses other attack methods and\nachieves state-of-the-art attack success rate. We will make the corresponding\ncode and dataset available for future research. The code will be released soon.\n","authors":["Fengxiang Wang","Ranjie Duan","Peng Xiao","Xiaojun Jia","Shiji Zhao","Cheng Wei","YueFeng Chen","Chongwen Wang","Jialing Tao","Hang Su","Jun Zhu","Hui Xue"],"pdf_url":"https://arxiv.org/pdf/2411.03814v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.15778v3","updated":"2025-01-07T07:31:00Z","published":"2024-11-24T10:58:48Z","title":"Enhancing the automatic segmentation and analysis of 3D liver\n vasculature models","summary":" Surgical assessment of liver cancer patients requires identification of the\nvessel trees from medical images. Specifically, the venous trees - the portal\n(perfusing) and the hepatic (draining) trees are important for understanding\nthe liver anatomy and disease state, and perform surgery planning. This\nresearch aims to improve the 3D segmentation, skeletonization, and subsequent\nanalysis of vessel trees, by creating an automatic pipeline based on deep\nlearning and image processing techniques.\n The first part of this work explores the impact of differentiable\nskeletonization methods such as ClDice and morphological skeletonization loss,\non the overall liver vessel segmentation performance. To this aim, it studies\nhow to improve vessel tree connectivity.\n The second part of this study converts a single class vessel segmentation\ninto multi-class ones, separating the two venous trees. It builds on the\nprevious two-class vessel segmentation model, which vessel tree outputs might\nbe entangled, and on connected components and skeleton analyses of the trees.\n After providing sub-labeling of the specific anatomical branches of each\nvenous tree, these algorithms also enable a morphometric analysis of the vessel\ntrees by extracting various geometrical markers.\n In conclusion, we propose a method that successfully improves current\nskeletonization methods, for extensive vascular trees that contain vessels of\ndifferent calibers. The separation algorithm creates a clean multi-class\nsegmentation of the vessels, validated by surgeons to provide low error. A new,\npublicly shared high-quality liver vessel dataset of 77 cases is thus created.\nFinally a method to annotate vessel trees according to anatomy is provided,\nenabling a unique liver vessel morphometry analysis.\n","authors":["Yassine Machta","Omar Ali","Kevin Hakkakian","Ana Vlasceanu","Amaury Facque","Nicolas Golse","Irene Vignon-Clementel"],"pdf_url":"https://arxiv.org/pdf/2411.15778v3.pdf","comment":"Paper presented at MICCAI 2024 Workshop: ADSMI. This work was done in\n the context of an internship at Simbiotx, Inria"},{"id":"http://arxiv.org/abs/2501.03583v1","updated":"2025-01-07T07:16:56Z","published":"2025-01-07T07:16:56Z","title":"STContext: A Multifaceted Dataset for Developing Context-aware\n Spatio-temporal Crowd Mobility Prediction Models","summary":" In smart cities, context-aware spatio-temporal crowd flow prediction (STCFP)\nmodels leverage contextual features (e.g., weather) to identify unusual crowd\nmobility patterns and enhance prediction accuracy. However, the best practice\nfor incorporating contextual features remains unclear due to inconsistent usage\nof contextual features in different papers. Developing a multifaceted dataset\nwith rich types of contextual features and STCFP scenarios is crucial for\nestablishing a principled context modeling paradigm. Existing open crowd flow\ndatasets lack an adequate range of contextual features, which poses an urgent\nrequirement to build a multifaceted dataset to fill these research gaps. To\nthis end, we create STContext, a multifaceted dataset for developing\ncontext-aware STCFP models. Specifically, STContext provides nine\nspatio-temporal datasets across five STCFP scenarios and includes ten\ncontextual features, including weather, air quality index, holidays, points of\ninterest, road networks, etc. Besides, we propose a unified workflow for\nincorporating contextual features into deep STCFP methods, with steps including\nfeature transformation, dependency modeling, representation fusion, and\ntraining strategies. Through extensive experiments, we have obtained several\nuseful guidelines for effective context modeling and insights for future\nresearch. The STContext is open-sourced at\nhttps://github.com/Liyue-Chen/STContext.\n","authors":["Liyue Chen","Jiangyi Fang","Tengfei Liu","Fangyuan Gao","Leye Wang"],"pdf_url":"https://arxiv.org/pdf/2501.03583v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.03173v3","updated":"2025-01-07T07:05:05Z","published":"2024-02-05T16:41:02Z","title":"MULTI: Multimodal Understanding Leaderboard with Text and Images","summary":" The rapid development of multimodal large language models (MLLMs) raises the\nquestion of how they compare to human performance. While existing datasets\noften feature synthetic or overly simplistic tasks, some models have already\nsurpassed human expert baselines. In this paper, we present MULTI, a Chinese\nmultimodal dataset derived from authentic examination questions. Comprising\nover 18,000 carefully selected and refined questions, MULTI evaluates models\nusing real-world examination standards, encompassing image-text comprehension,\ncomplex reasoning, and knowledge recall. Additionally, We also introduce\nMULTI-Elite, a 500-question selected hard subset, and MULTI-Extend with more\nthan 4,500 external knowledge context pieces for testing in-context learning\ncapabilities. Our evaluation highlights substantial room for MLLM advancement,\nwith Qwen2-VL-72B achieving a 76.9% accuracy on MULTI and 53.1% on MULTI-Elite\nleading 25 evaluated models, compared to human expert baselines of 86.1% and\n73.1%. MULTI serves not only as a robust evaluation platform but also paves the\nway for the development of expert-level AI.\n","authors":["Zichen Zhu","Yang Xu","Lu Chen","Jingkai Yang","Yichuan Ma","Yiming Sun","Hailin Wen","Jiaqi Liu","Jinyu Cai","Yingzi Ma","Situo Zhang","Zihan Zhao","Liangtai Sun","Kai Yu"],"pdf_url":"https://arxiv.org/pdf/2402.03173v3.pdf","comment":"24 pages, 19 figures, 10 tables. Details and access are available at:\n https://OpenDFM.github.io/MULTI-Benchmark/"},{"id":"http://arxiv.org/abs/2501.03575v1","updated":"2025-01-07T06:55:50Z","published":"2025-01-07T06:55:50Z","title":"Cosmos World Foundation Model Platform for Physical AI","summary":" Physical AI needs to be trained digitally first. It needs a digital twin of\nitself, the policy model, and a digital twin of the world, the world model. In\nthis paper, we present the Cosmos World Foundation Model Platform to help\ndevelopers build customized world models for their Physical AI setups. We\nposition a world foundation model as a general-purpose world model that can be\nfine-tuned into customized world models for downstream applications. Our\nplatform covers a video curation pipeline, pre-trained world foundation models,\nexamples of post-training of pre-trained world foundation models, and video\ntokenizers. To help Physical AI builders solve the most critical problems of\nour society, we make our platform open-source and our models open-weight with\npermissive licenses available via https://github.com/NVIDIA/Cosmos.\n","authors":[" NVIDIA"," :","Niket Agarwal","Arslan Ali","Maciej Bala","Yogesh Balaji","Erik Barker","Tiffany Cai","Prithvijit Chattopadhyay","Yongxin Chen","Yin Cui","Yifan Ding","Daniel Dworakowski","Jiaojiao Fan","Michele Fenzi","Francesco Ferroni","Sanja Fidler","Dieter Fox","Songwei Ge","Yunhao Ge","Jinwei Gu","Siddharth Gururani","Ethan He","Jiahui Huang","Jacob Huffman","Pooya Jannaty","Jingyi Jin","Seung Wook Kim","Gergely Klár","Grace Lam","Shiyi Lan","Laura Leal-Taixe","Anqi Li","Zhaoshuo Li","Chen-Hsuan Lin","Tsung-Yi Lin","Huan Ling","Ming-Yu Liu","Xian Liu","Alice Luo","Qianli Ma","Hanzi Mao","Kaichun Mo","Arsalan Mousavian","Seungjun Nah","Sriharsha Niverty","David Page","Despoina Paschalidou","Zeeshan Patel","Lindsey Pavao","Morteza Ramezanali","Fitsum Reda","Xiaowei Ren","Vasanth Rao Naik Sabavat","Ed Schmerling","Stella Shi","Bartosz Stefaniak","Shitao Tang","Lyne Tchapmi","Przemek Tredak","Wei-Cheng Tseng","Jibin Varghese","Hao Wang","Haoxiang Wang","Heng Wang","Ting-Chun Wang","Fangyin Wei","Xinyue Wei","Jay Zhangjie Wu","Jiashu Xu","Wei Yang","Lin Yen-Chen","Xiaohui Zeng","Yu Zeng","Jing Zhang","Qinsheng Zhang","Yuxuan Zhang","Qingqing Zhao","Artur Zolkowski"],"pdf_url":"https://arxiv.org/pdf/2501.03575v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03572v1","updated":"2025-01-07T06:51:46Z","published":"2025-01-07T06:51:46Z","title":"From Code to Compliance: Assessing ChatGPT's Utility in Designing an\n Accessible Webpage -- A Case Study","summary":" Web accessibility ensures that individuals with disabilities can access and\ninteract with digital content without barriers, yet a significant majority of\nmost used websites fail to meet accessibility standards. This study evaluates\nChatGPT's (GPT-4o) ability to generate and improve web pages in line with Web\nContent Accessibility Guidelines (WCAG). While ChatGPT can effectively address\naccessibility issues when prompted, its default code often lacks compliance,\nreflecting limitations in its training data and prevailing inaccessible web\npractices. Automated and manual testing revealed strengths in resolving simple\nissues but challenges with complex tasks, requiring human oversight and\nadditional iterations. Unlike prior studies, we incorporate manual evaluation,\ndynamic elements, and use the visual reasoning capability of ChatGPT along with\nthe prompts to fix accessibility issues. Providing screenshots alongside\nprompts enhances the LLM's ability to address accessibility issues by allowing\nit to analyze surrounding components, such as determining appropriate contrast\ncolors. We found that effective prompt engineering, such as providing concise,\nstructured feedback and incorporating visual aids, significantly enhances\nChatGPT's performance. These findings highlight the potential and limitations\nof large language models for accessible web development, offering practical\nguidance for developers to create more inclusive websites.\n","authors":["Ammar Ahmed","Margarida Fresco","Fredrik Forsberg","Hallvard Grotli"],"pdf_url":"https://arxiv.org/pdf/2501.03572v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.12370v2","updated":"2025-01-07T06:47:00Z","published":"2024-12-16T21:56:01Z","title":"Scam Detection for Ethereum Smart Contracts: Leveraging Graph\n Representation Learning for Secure Blockchain","summary":" Due to the increasing abuse of fraudulent activities that result in\nsignificant financial and reputational harm, Ethereum smart contracts face a\nsignificant problem in detecting fraud. Existing monitoring methods typically\nrely on lease code analysis or physically extracted features, which suffer from\nscalability and adaptability limitations. In this study, we use graph\nrepresentation learning to observe purchase trends and find fraudulent deals.\nWe can achieve powerful categorisation performance by using innovative machine\nlearning versions and transforming Ethereum invoice data into graph structures.\nOur method addresses label imbalance through SMOTE-ENN techniques and evaluates\nmodels like Multi-Layer Perceptron ( MLP ) and Graph Convolutional Networks (\nGCN). Experimental results show that the MLP type surpasses the GCN in this\nenvironment, with domain-specific assessments closely aligned with real-world\nassessments. This study provides a scalable and efficient way to improve\nEthereum's ecosystem's confidence and security.\n","authors":["Yihong Jin","Ze Yang"],"pdf_url":"https://arxiv.org/pdf/2412.12370v2.pdf","comment":"Accepted to BDICN 2025"},{"id":"http://arxiv.org/abs/2407.15320v2","updated":"2025-01-07T06:39:29Z","published":"2024-07-07T09:25:52Z","title":"Edge Graph Intelligence: Reciprocally Empowering Edge Networks with\n Graph Intelligence","summary":" Recent years have witnessed a thriving growth of computing facilities\nconnected at the network edge, cultivating edge networks as a fundamental\ninfrastructure for supporting miscellaneous intelligent services.Meanwhile,\nArtificial Intelligence (AI) frontiers have extrapolated to the graph domain\nand promoted Graph Intelligence (GI). Given the inherent relation between\ngraphs and networks, the interdiscipline of graph learning and edge networks,\ni.e., Edge GI or EGI, has revealed a novel interplay between them -- GI aids in\noptimizing edge networks, while edge networks facilitate GI model deployment.\nDriven by this delicate closed-loop, EGI is recognized as a promising solution\nto fully unleash the potential of edge computing power and is garnering growing\nattention. Nevertheless, research on EGI remains nascent, and there is a\nsoaring demand within both the communications and AI communities for a\ndedicated venue to share recent advancements. To this end, this paper promotes\nthe concept of EGI, explores its scope and core principles, and conducts a\ncomprehensive survey concerning recent research efforts on this emerging field.\nSpecifically, this paper introduces and discusses: 1) fundamentals of edge\ncomputing and graph learning,2) emerging techniques centering on the closed\nloop between graph intelligence and edge networks, and 3) open challenges and\nresearch opportunities of future EGI. By bridging the gap across communication,\nnetworking, and graph learning areas, we believe that this survey can garner\nincreased attention, foster meaningful discussions, and inspire further\nresearch ideas in EGI.\n","authors":["Liekang Zeng","Shengyuan Ye","Xu Chen","Xiaoxi Zhang","Ju Ren","Jian Tang","Yang Yang"," Xuemin"," Shen"],"pdf_url":"https://arxiv.org/pdf/2407.15320v2.pdf","comment":"Accepted by IEEE Communications Surveys & Tutorials"},{"id":"http://arxiv.org/abs/2501.03566v1","updated":"2025-01-07T06:34:17Z","published":"2025-01-07T06:34:17Z","title":"Applying Large Language Models in Knowledge Graph-based Enterprise\n Modeling: Challenges and Opportunities","summary":" The role of large language models (LLMs) in enterprise modeling has recently\nstarted to shift from academic research to that of industrial applications.\nThereby, LLMs represent a further building block for the machine-supported\ngeneration of enterprise models. In this paper we employ a knowledge\ngraph-based approach for enterprise modeling and investigate the potential\nbenefits of LLMs in this context. In addition, the findings of an expert survey\nand ChatGPT-4o-based experiments demonstrate that LLM-based model generations\nexhibit minimal variability, yet remain constrained to specific tasks, with\nreliability declining for more intricate tasks. The survey results further\nsuggest that the supervision and intervention of human modeling experts are\nessential to ensure the accuracy and integrity of the generated models.\n","authors":["Benedikt Reitemeyer","Hans-Georg Fill"],"pdf_url":"https://arxiv.org/pdf/2501.03566v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02155v2","updated":"2025-01-07T06:30:24Z","published":"2024-12-03T04:29:27Z","title":"CausalMob: Causal Human Mobility Prediction with LLMs-derived Human\n Intentions toward Public Events","summary":" Large-scale human mobility exhibits spatial and temporal patterns that can\nassist policymakers in decision making. Although traditional prediction models\nattempt to capture these patterns, they often interfered by non-periodic public\nevents, such as disasters and occasional celebrations. Since regular human\nmobility patterns are heavily affected by these events, estimating their causal\neffects is critical to accurate mobility predictions. Although news articles\nprovide unique perspectives on these events in an unstructured format,\nprocessing is a challenge. In this study, we propose a causality-augmented\nprediction model, called CausalMob, to analyze the causal effects of public\nevents. We first utilize large language models (LLMs) to extract human\nintentions from news articles and transform them into features that act as\ncausal treatments. Next, the model learns representations of spatio-temporal\nregional covariates from multiple data sources to serve as confounders for\ncausal inference. Finally, we present a causal effect estimation framework to\nensure event features remain independent of confounders during prediction.\nBased on large-scale real-world data, the experimental results show that the\nproposed model excels in human mobility prediction, outperforming\nstate-of-the-art models.\n","authors":["Xiaojie Yang","Hangli Ge","Jiawei Wang","Zipei Fan","Renhe Jiang","Ryosuke Shibasaki","Noboru Koshizuka"],"pdf_url":"https://arxiv.org/pdf/2412.02155v2.pdf","comment":"Accepted by KDD 2025"},{"id":"http://arxiv.org/abs/2501.03562v1","updated":"2025-01-07T06:22:55Z","published":"2025-01-07T06:22:55Z","title":"Rethinking Adversarial Attacks in Reinforcement Learning from Policy\n Distribution Perspective","summary":" Deep Reinforcement Learning (DRL) suffers from uncertainties and inaccuracies\nin the observation signal in realworld applications. Adversarial attack is an\neffective method for evaluating the robustness of DRL agents. However, existing\nattack methods targeting individual sampled actions have limited impacts on the\noverall policy distribution, particularly in continuous action spaces. To\naddress these limitations, we propose the Distribution-Aware Projected Gradient\nDescent attack (DAPGD). DAPGD uses distribution similarity as the gradient\nperturbation input to attack the policy network, which leverages the entire\npolicy distribution rather than relying on individual samples. We utilize the\nBhattacharyya distance in DAPGD to measure policy similarity, enabling\nsensitive detection of subtle but critical differences between probability\ndistributions. Our experiment results demonstrate that DAPGD achieves SOTA\nresults compared to the baselines in three robot navigation tasks, achieving an\naverage 22.03% higher reward drop compared to the best baseline.\n","authors":["Tianyang Duan","Zongyuan Zhang","Zheng Lin","Yue Gao","Ling Xiong","Yong Cui","Hongbin Liang","Xianhao Chen","Heming Cui","Dong Huang"],"pdf_url":"https://arxiv.org/pdf/2501.03562v1.pdf","comment":"10 pages, 2 figures, 2 tables"},{"id":"http://arxiv.org/abs/2501.03560v1","updated":"2025-01-07T06:21:40Z","published":"2025-01-07T06:21:40Z","title":"KG-TRICK: Unifying Textual and Relational Information Completion of\n Knowledge for Multilingual Knowledge Graphs","summary":" Multilingual knowledge graphs (KGs) provide high-quality relational and\ntextual information for various NLP applications, but they are often\nincomplete, especially in non-English languages. Previous research has shown\nthat combining information from KGs in different languages aids either\nKnowledge Graph Completion (KGC), the task of predicting missing relations\nbetween entities, or Knowledge Graph Enhancement (KGE), the task of predicting\nmissing textual information for entities. Although previous efforts have\nconsidered KGC and KGE as independent tasks, we hypothesize that they are\ninterdependent and mutually beneficial. To this end, we introduce KG-TRICK, a\nnovel sequence-to-sequence framework that unifies the tasks of textual and\nrelational information completion for multilingual KGs. KG-TRICK demonstrates\nthat: i) it is possible to unify the tasks of KGC and KGE into a single\nframework, and ii) combining textual information from multiple languages is\nbeneficial to improve the completeness of a KG. As part of our contributions,\nwe also introduce WikiKGE10++, the largest manually-curated benchmark for\ntextual information completion of KGs, which features over 25,000 entities\nacross 10 diverse languages.\n","authors":["Zelin Zhou","Simone Conia","Daniel Lee","Min Li","Shenglei Huang","Umar Farooq Minhas","Saloni Potdar","Henry Xiao","Yunyao Li"],"pdf_url":"https://arxiv.org/pdf/2501.03560v1.pdf","comment":"Camera ready for COLING 2025"},{"id":"http://arxiv.org/abs/2501.03544v1","updated":"2025-01-07T05:39:21Z","published":"2025-01-07T05:39:21Z","title":"PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for\n Text-to-Image Models","summary":" Text-to-image (T2I) models have been shown to be vulnerable to misuse,\nparticularly in generating not-safe-for-work (NSFW) content, raising serious\nethical concerns. In this work, we present PromptGuard, a novel content\nmoderation technique that draws inspiration from the system prompt mechanism in\nlarge language models (LLMs) for safety alignment. Unlike LLMs, T2I models lack\na direct interface for enforcing behavioral guidelines. Our key idea is to\noptimize a safety soft prompt that functions as an implicit system prompt\nwithin the T2I model's textual embedding space. This universal soft prompt (P*)\ndirectly moderates NSFW inputs, enabling safe yet realistic image generation\nwithout altering the inference efficiency or requiring proxy models. Extensive\nexperiments across three datasets demonstrate that PromptGuard effectively\nmitigates NSFW content generation while preserving high-quality benign outputs.\nPromptGuard achieves 7.8 times faster than prior content moderation methods,\nsurpassing eight state-of-the-art defenses with an optimal unsafe ratio down to\n5.84%.\n","authors":["Lingzhi Yuan","Xinfeng Li","Chejian Xu","Guanhong Tao","Xiaojun Jia","Yihao Huang","Wei Dong","Yang Liu","XiaoFeng Wang","Bo Li"],"pdf_url":"https://arxiv.org/pdf/2501.03544v1.pdf","comment":"16 pages, 8 figures, 10 tables"},{"id":"http://arxiv.org/abs/2402.14658v3","updated":"2025-01-07T05:37:04Z","published":"2024-02-22T16:06:23Z","title":"OpenCodeInterpreter: Integrating Code Generation with Execution and\n Refinement","summary":" The introduction of large language models has significantly advanced code\ngeneration. However, open-source models often lack the execution capabilities\nand iterative refinement of advanced systems like the GPT-4 Code Interpreter.\nTo address this, we introduce OpenCodeInterpreter, a family of open-source code\nsystems designed for generating, executing, and iteratively refining code.\nSupported by Code-Feedback, a dataset featuring 68K multi-turn interactions,\nOpenCodeInterpreter integrates execution and human feedback for dynamic code\nrefinement. Our comprehensive evaluation of OpenCodeInterpreter across key\nbenchmarks such as HumanEval, MBPP, and their enhanced versions from EvalPlus\nreveals its exceptional performance. Notably, OpenCodeInterpreter-33B achieves\nan accuracy of 83.2 (76.4) on the average (and plus versions) of HumanEval and\nMBPP, closely rivaling GPT-4's 84.2 (76.2) and further elevates to 91.6 (84.6)\nwith synthesized human feedback from GPT-4. OpenCodeInterpreter brings the gap\nbetween open-source code generation models and proprietary systems like GPT-4\nCode Interpreter.\n","authors":["Tianyu Zheng","Ge Zhang","Tianhao Shen","Xueling Liu","Bill Yuchen Lin","Jie Fu","Wenhu Chen","Xiang Yue"],"pdf_url":"https://arxiv.org/pdf/2402.14658v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.02156v2","updated":"2025-01-07T05:36:22Z","published":"2025-01-04T01:45:32Z","title":"The Race to Efficiency: A New Perspective on AI Scaling Laws","summary":" As large-scale AI models expand, training becomes costlier and sustaining\nprogress grows harder. Classical scaling laws (e.g., Kaplan et al. (2020),\nHoffmann et al. (2022)) predict training loss from a static compute budget yet\nneglect time and efficiency, prompting the question: how can we balance\nballooning GPU fleets with rapidly improving hardware and algorithms? We\nintroduce the relative-loss equation, a time- and efficiency-aware framework\nthat extends classical AI scaling laws. Our model shows that, without ongoing\nefficiency gains, advanced performance could demand millennia of training or\nunrealistically large GPU fleets. However, near-exponential progress remains\nachievable if the \"efficiency-doubling rate\" parallels Moore's Law. By\nformalizing this race to efficiency, we offer a quantitative roadmap for\nbalancing front-loaded GPU investments with incremental improvements across the\nAI stack. Empirical trends suggest that sustained efficiency gains can push AI\nscaling well into the coming decade, providing a new perspective on the\ndiminishing returns inherent in classical scaling.\n","authors":["Chien-Ping Lu"],"pdf_url":"https://arxiv.org/pdf/2501.02156v2.pdf","comment":"21 pages, 3 figures. 2 tables, second draft"},{"id":"http://arxiv.org/abs/2402.13516v7","updated":"2025-01-07T05:26:54Z","published":"2024-02-21T03:58:49Z","title":"ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity\n within Large Language Models","summary":" Activation sparsity refers to the existence of considerable\nweakly-contributed elements among activation outputs. As a prevalent property\nof the models using the ReLU activation function, activation sparsity has been\nproven a promising paradigm to boost model inference efficiency. Nevertheless,\nmost large language models (LLMs) adopt activation functions without intrinsic\nactivation sparsity (e.g., GELU and Swish). Some recent efforts have explored\nintroducing ReLU or its variants as the substitutive activation function to\nhelp LLMs achieve activation sparsity and inference acceleration, but few can\nsimultaneously obtain high sparsity and comparable model performance. This\npaper introduces a simple and effective sparsification method named \"ProSparse\"\nto push LLMs for higher activation sparsity while maintaining comparable\nperformance. Specifically, after substituting the activation function of LLMs\nwith ReLU, ProSparse adopts progressive sparsity regularization with a factor\nsmoothly increasing along the multi-stage sine curves. This can enhance\nactivation sparsity and mitigate performance degradation by avoiding radical\nshifts in activation distributions. With ProSparse, we obtain high sparsity of\n89.32% for LLaMA2-7B, 88.80% for LLaMA2-13B, and 87.89% for end-size\nMiniCPM-1B, respectively, achieving comparable performance to their original\nSwish-activated versions. These present the most sparsely activated models\namong open-source LLaMA versions and competitive end-size models, considerably\nsurpassing ReluLLaMA-7B (66.98%) and ReluLLaMA-13B (71.56%). Our inference\nacceleration experiments further demonstrate the significant practical\nacceleration potential of LLMs with higher activation sparsity, obtaining up to\n4.52$\\times$ inference speedup.\n","authors":["Chenyang Song","Xu Han","Zhengyan Zhang","Shengding Hu","Xiyu Shi","Kuai Li","Chen Chen","Zhiyuan Liu","Guangli Li","Tao Yang","Maosong Sun"],"pdf_url":"https://arxiv.org/pdf/2402.13516v7.pdf","comment":"19 pages, 4 figures, 9 tables"},{"id":"http://arxiv.org/abs/2501.03540v1","updated":"2025-01-07T05:23:36Z","published":"2025-01-07T05:23:36Z","title":"Deep Learning within Tabular Data: Foundations, Challenges, Advances and\n Future Directions","summary":" Tabular data remains one of the most prevalent data types across a wide range\nof real-world applications, yet effective representation learning for this\ndomain poses unique challenges due to its irregular patterns, heterogeneous\nfeature distributions, and complex inter-column dependencies. This survey\nprovides a comprehensive review of state-of-the-art techniques in tabular data\nrepresentation learning, structured around three foundational design elements:\ntraining data, neural architectures, and learning objectives. Unlike prior\nsurveys that focus primarily on either architecture design or learning\nstrategies, we adopt a holistic perspective that emphasizes the universality\nand robustness of representation learning methods across diverse downstream\ntasks. We examine recent advances in data augmentation and generation,\nspecialized neural network architectures tailored to tabular data, and\ninnovative learning objectives that enhance representation quality.\nAdditionally, we highlight the growing influence of self-supervised learning\nand the adaptation of transformer-based foundation models for tabular data. Our\nreview is based on a systematic literature search using rigorous inclusion\ncriteria, encompassing 127 papers published since 2020 in top-tier conferences\nand journals. Through detailed analysis and comparison, we identify emerging\ntrends, critical gaps, and promising directions for future research, aiming to\nguide the development of more generalizable and effective tabular data\nrepresentation methods.\n","authors":["Weijieying Ren","Tianxiang Zhao","Yuqing Huang","Vasant Honavar"],"pdf_url":"https://arxiv.org/pdf/2501.03540v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.11876v2","updated":"2025-01-07T05:20:13Z","published":"2024-10-10T01:23:16Z","title":"Rescriber: Smaller-LLM-Powered User-Led Data Minimization for Navigating\n Privacy Trade-offs in LLM-Based Conversational Agent","summary":" The proliferation of LLM-based conversational agents has resulted in\nexcessive disclosure of identifiable or sensitive information. However,\nexisting technologies fail to offer perceptible control or account for users'\npersonal preferences about privacy-utility tradeoffs due to the lack of user\ninvolvement. To bridge this gap, we designed, built, and evaluated Rescriber, a\nbrowser extension that supports user-led data minimization in LLM-based\nconversational agents by helping users detect and sanitize personal information\nin their prompts. Our studies (N=12) showed that Rescriber helped users reduce\nunnecessary disclosure and addressed their privacy concerns. Users' subjective\nperceptions of the system powered by Llama3-8B were on par with that by GPT-4o.\nThe comprehensiveness and consistency of the detection and sanitization emerge\nas essential factors that affect users' trust and perceived protection. Our\nfindings confirm the viability of smaller-LLM-powered, user-facing, on-device\nprivacy controls, presenting a promising approach to address the privacy and\ntrust challenges of AI.\n","authors":["Jijie Zhou","Eryue Xu","Yaoyao Wu","Tianshi Li"],"pdf_url":"https://arxiv.org/pdf/2410.11876v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03535v1","updated":"2025-01-07T05:15:46Z","published":"2025-01-07T05:15:46Z","title":"SenseRAG: Constructing Environmental Knowledge Bases with Proactive\n Querying for LLM-Based Autonomous Driving","summary":" This study addresses the critical need for enhanced situational awareness in\nautonomous driving (AD) by leveraging the contextual reasoning capabilities of\nlarge language models (LLMs). Unlike traditional perception systems that rely\non rigid, label-based annotations, it integrates real-time, multimodal sensor\ndata into a unified, LLMs-readable knowledge base, enabling LLMs to dynamically\nunderstand and respond to complex driving environments. To overcome the\ninherent latency and modality limitations of LLMs, a proactive\nRetrieval-Augmented Generation (RAG) is designed for AD, combined with a\nchain-of-thought prompting mechanism, ensuring rapid and context-rich\nunderstanding. Experimental results using real-world Vehicle-to-everything\n(V2X) datasets demonstrate significant improvements in perception and\nprediction performance, highlighting the potential of this framework to enhance\nsafety, adaptability, and decision-making in next-generation AD systems.\n","authors":["Xuewen Luo","Fan Ding","Fengze Yang","Yang Zhou","Junnyong Loo","Hwa Hui Tew","Chenxi Liu"],"pdf_url":"https://arxiv.org/pdf/2501.03535v1.pdf","comment":"This paper has been accepted for presentation at WACV Workshop LLMAD\n 2025"},{"id":"http://arxiv.org/abs/2401.06949v2","updated":"2025-01-07T05:00:50Z","published":"2024-01-13T02:03:28Z","title":"ORGANA: A Robotic Assistant for Automated Chemistry Experimentation and\n Characterization","summary":" Chemistry experiments can be resource- and labor-intensive, often requiring\nmanual tasks like polishing electrodes in electrochemistry. Traditional lab\nautomation infrastructure faces challenges adapting to new experiments. To\naddress this, we introduce ORGANA, an assistive robotic system that automates\ndiverse chemistry experiments using decision-making and perception tools. It\nmakes decisions with chemists in the loop to control robots and lab devices.\nORGANA interacts with chemists using Large Language Models (LLMs) to derive\nexperiment goals, handle disambiguation, and provide experiment logs. ORGANA\nplans and executes complex tasks with visual feedback, while supporting\nscheduling and parallel task execution. We demonstrate ORGANA's capabilities in\nsolubility, pH measurement, recrystallization, and electrochemistry\nexperiments. In electrochemistry, it executes a 19-step plan in parallel to\ncharacterize quinone derivatives for flow batteries. Our user study shows\nORGANA reduces frustration and physical demand by over 50%, with users saving\nan average of 80.3% of their time when using it.\n","authors":["Kourosh Darvish","Marta Skreta","Yuchi Zhao","Naruki Yoshikawa","Sagnik Som","Miroslav Bogdanovic","Yang Cao","Han Hao","Haoping Xu","Alán Aspuru-Guzik","Animesh Garg","Florian Shkurti"],"pdf_url":"https://arxiv.org/pdf/2401.06949v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.12935v2","updated":"2025-01-07T04:42:20Z","published":"2024-06-17T03:03:34Z","title":"ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat\n Templates","summary":" Large language models (LLMs) are expected to follow instructions from users\nand engage in conversations. Techniques to enhance LLMs' instruction-following\ncapabilities typically fine-tune them using data structured according to a\npredefined chat template. Although chat templates are shown to be effective in\noptimizing LLM performance, their impact on safety alignment of LLMs has been\nless understood, which is crucial for deploying LLMs safely at scale.\n In this paper, we investigate how chat templates affect safety alignment of\nLLMs. We identify a common vulnerability, named ChatBug, that is introduced by\nchat templates. Our key insight to identify ChatBug is that the chat templates\nprovide a rigid format that need to be followed by LLMs, but not by users.\nHence, a malicious user may not necessarily follow the chat template when\nprompting LLMs. Instead, malicious users could leverage their knowledge of the\nchat template and accordingly craft their prompts to bypass safety alignments\nof LLMs. We develop two attacks to exploit the ChatBug vulnerability. We\ndemonstrate that a malicious user can exploit the ChatBug vulnerability of\neight state-of-the-art (SOTA) LLMs and effectively elicit unintended responses\nfrom these models. Moreover, we show that ChatBug can be exploited by existing\njailbreak attacks to enhance their attack success rates. We investigate\npotential countermeasures to ChatBug. Our results show that while adversarial\ntraining effectively mitigates the ChatBug vulnerability, the victim model\nincurs significant performance degradation. These results highlight the\ntrade-off between safety alignment and helpfulness. Developing new methods for\ninstruction tuning to balance this trade-off is an open and critical direction\nfor future research\n","authors":["Fengqing Jiang","Zhangchen Xu","Luyao Niu","Bill Yuchen Lin","Radha Poovendran"],"pdf_url":"https://arxiv.org/pdf/2406.12935v2.pdf","comment":"This paper is accepted to AAAI 2025"},{"id":"http://arxiv.org/abs/2501.03523v1","updated":"2025-01-07T04:38:28Z","published":"2025-01-07T04:38:28Z","title":"Vocal Tract Length Warped Features for Spoken Keyword Spotting","summary":" In this paper, we propose several methods that incorporate vocal tract length\n(VTL) warped features for spoken keyword spotting (KWS). The first method,\nVTL-independent KWS, involves training a single deep neural network (DNN) that\nutilizes VTL features with various warping factors. During training, a specific\nVTL feature is randomly selected per epoch, allowing the exploration of VTL\nvariations. During testing, the VTL features with different warping factors of\na test utterance are scored against the DNN and combined with equal weight. In\nthe second method scores the conventional features of a test utterance (without\nVTL warping) against the DNN. The third method, VTL-concatenation KWS,\nconcatenates VTL warped features to form high-dimensional features for KWS.\nEvaluations carried out on the English Google Command dataset demonstrate that\nthe proposed methods improve the accuracy of KWS.\n","authors":["Achintya kr. Sarkar","Priyanka Dwivedi","Zheng-Hua Tan"],"pdf_url":"https://arxiv.org/pdf/2501.03523v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.04828v5","updated":"2025-01-07T04:38:25Z","published":"2023-12-08T05:01:47Z","title":"HuRef: HUman-REadable Fingerprint for Large Language Models","summary":" Protecting the copyright of large language models (LLMs) has become crucial\ndue to their resource-intensive training and accompanying carefully designed\nlicenses. However, identifying the original base model of an LLM is challenging\ndue to potential parameter alterations. In this study, we introduce HuRef, a\nhuman-readable fingerprint for LLMs that uniquely identifies the base model\nwithout interfering with training or exposing model parameters to the public.\nWe first observe that the vector direction of LLM parameters remains stable\nafter the model has converged during pretraining, with negligible perturbations\nthrough subsequent training steps, including continued pretraining, supervised\nfine-tuning, and RLHF, which makes it a sufficient condition to identify the\nbase model. The necessity is validated by continuing to train an LLM with an\nextra term to drive away the model parameters' direction and the model becomes\ndamaged. However, this direction is vulnerable to simple attacks like dimension\npermutation or matrix rotation, which significantly change it without affecting\nperformance. To address this, leveraging the Transformer structure, we\nsystematically analyze potential attacks and define three invariant terms that\nidentify an LLM's base model. Due to the potential risk of information leakage,\nwe cannot publish invariant terms directly. Instead, we map them to a Gaussian\nvector using an encoder, then convert it into a natural image using StyleGAN2,\nand finally publish the image. In our black-box setting, all fingerprinting\nsteps are internally conducted by the LLMs owners. To ensure the published\nfingerprints are honestly generated, we introduced Zero-Knowledge Proof (ZKP).\nExperimental results across various LLMs demonstrate the effectiveness of our\nmethod. The code is available at https://github.com/LUMIA-Group/HuRef.\n","authors":["Boyi Zeng","Lizheng Wang","Yuncong Hu","Yi Xu","Chenghu Zhou","Xinbing Wang","Yu Yu","Zhouhan Lin"],"pdf_url":"https://arxiv.org/pdf/2312.04828v5.pdf","comment":"NeurIPS 2024"},{"id":"http://arxiv.org/abs/2408.06954v2","updated":"2025-01-07T04:11:55Z","published":"2024-08-13T15:13:21Z","title":"Neural Speech and Audio Coding: Modern AI Technology Meets Traditional\n Codecs","summary":" This paper explores the integration of model-based and data-driven approaches\nwithin the realm of neural speech and audio coding systems. It highlights the\nchallenges posed by the subjective evaluation processes of speech and audio\ncodecs and discusses the limitations of purely data-driven approaches, which\noften require inefficiently large architectures to match the performance of\nmodel-based methods. The study presents hybrid systems as a viable solution,\noffering significant improvements to the performance of conventional codecs\nthrough meticulously chosen design enhancements. Specifically, it introduces a\nneural network-based signal enhancer designed to post-process existing codecs'\noutput, along with the autoencoder-based end-to-end models and LPCNet--hybrid\nsystems that combine linear predictive coding (LPC) with neural networks.\nFurthermore, the paper delves into predictive models operating within custom\nfeature spaces (TF-Codec) or predefined transform domains (MDCTNet) and\nexamines the use of psychoacoustically calibrated loss functions to train\nend-to-end neural audio codecs. Through these investigations, the paper\ndemonstrates the potential of hybrid systems to advance the field of speech and\naudio coding by bridging the gap between traditional model-based approaches and\nmodern data-driven techniques.\n","authors":["Minje Kim","Jan Skoglund"],"pdf_url":"https://arxiv.org/pdf/2408.06954v2.pdf","comment":"Published in IEEE Signal Processing Magazine"},{"id":"http://arxiv.org/abs/2501.03228v2","updated":"2025-01-07T04:05:53Z","published":"2025-01-06T18:59:55Z","title":"LightGNN: Simple Graph Neural Network for Recommendation","summary":" Graph neural networks (GNNs) have demonstrated superior performance in\ncollaborative recommendation through their ability to conduct high-order\nrepresentation smoothing, effectively capturing structural information within\nusers' interaction patterns. However, existing GNN paradigms face significant\nchallenges in scalability and robustness when handling large-scale, noisy, and\nreal-world datasets. To address these challenges, we present LightGNN, a\nlightweight and distillation-based GNN pruning framework designed to\nsubstantially reduce model complexity while preserving essential collaboration\nmodeling capabilities. Our LightGNN framework introduces a computationally\nefficient pruning module that adaptively identifies and removes redundant edges\nand embedding entries for model compression. The framework is guided by a\nresource-friendly hierarchical knowledge distillation objective, whose\nintermediate layer augments the observed graph to maintain performance,\nparticularly in high-rate compression scenarios. Extensive experiments on\npublic datasets demonstrate LightGNN's effectiveness, significantly improving\nboth computational efficiency and recommendation accuracy. Notably, LightGNN\nachieves an 80% reduction in edge count and 90% reduction in embedding entries\nwhile maintaining performance comparable to more complex state-of-the-art\nbaselines. The implementation of our LightGNN framework is available at the\ngithub repository: https://github.com/HKUDS/LightGNN.\n","authors":["Guoxuan Chen","Lianghao Xia","Chao Huang"],"pdf_url":"https://arxiv.org/pdf/2501.03228v2.pdf","comment":"Accepted to WSDM 2025 Oral"},{"id":"http://arxiv.org/abs/2305.17740v2","updated":"2025-01-07T04:03:46Z","published":"2023-05-28T14:48:38Z","title":"Bridging the Language Gap: Dynamic Learning Strategies for Improving\n Multilingual Performance in LLMs","summary":" Large language models (LLMs) have revolutionized various domains but still\nstruggle with non-Latin scripts and low-resource languages. This paper\naddresses the critical challenge of improving multilingual performance without\nextensive fine-tuning. We introduce a novel dynamic learning approach that\noptimizes prompt strategy, embedding model, and LLM per query at runtime. By\nadapting configurations dynamically, our method achieves significant\nimprovements over static, best and random baselines. It operates efficiently in\nboth offline and online settings, generalizing seamlessly across new languages\nand datasets. Leveraging Retrieval-Augmented Generation (RAG) with\nstate-of-the-art multilingual embeddings, we achieve superior task performance\nacross diverse linguistic contexts. Through systematic investigation and\nevaluation across 18 diverse languages using popular question-answering (QA)\ndatasets we show our approach results in 10-15% improvements in multilingual\nperformance over pre-trained models and 4x gains compared to fine-tuned,\nlanguage-specific models.\n","authors":["Somnath Kumar","Vaibhav Balloli","Mercy Ranjit","Kabir Ahuja","Sunayana Sitaram","Kalika Bali","Tanuja Ganu","Akshay Nambi"],"pdf_url":"https://arxiv.org/pdf/2305.17740v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.06167v3","updated":"2025-01-07T03:59:37Z","published":"2023-10-09T21:36:21Z","title":"Predictable Artificial Intelligence","summary":" We introduce the fundamental ideas and challenges of Predictable AI, a\nnascent research area that explores the ways in which we can anticipate key\nvalidity indicators (e.g., performance, safety) of present and future AI\necosystems. We argue that achieving predictability is crucial for fostering\ntrust, liability, control, alignment and safety of AI ecosystems, and thus\nshould be prioritised over performance. We formally characterise\npredictability, explore its most relevant components, illustrate what can be\npredicted, describe alternative candidates for predictors, as well as the\ntrade-offs between maximising validity and predictability. To illustrate these\nconcepts, we bring an array of illustrative examples covering diverse ecosystem\nconfigurations. Predictable AI is related to other areas of technical and\nnon-technical AI research, but have distinctive questions, hypotheses,\ntechniques and challenges. This paper aims to elucidate them, calls for\nidentifying paths towards a landscape of predictably valid AI systems and\noutlines the potential impact of this emergent field.\n","authors":["Lexin Zhou","Pablo A. Moreno-Casares","Fernando Martínez-Plumed","John Burden","Ryan Burnell","Lucy Cheke","Cèsar Ferri","Alexandru Marcoci","Behzad Mehrbakhsh","Yael Moros-Daval","Seán Ó hÉigeartaigh","Danaja Rutar","Wout Schellaert","Konstantinos Voudouris","José Hernández-Orallo"],"pdf_url":"https://arxiv.org/pdf/2310.06167v3.pdf","comment":"Paper Under Review"},{"id":"http://arxiv.org/abs/2410.23111v5","updated":"2025-01-07T03:56:49Z","published":"2024-10-30T15:23:44Z","title":"Exploring Gradient Subspaces: Addressing and Overcoming LoRA's\n Limitations in Federated Fine-Tuning of Large Language Models","summary":" Large Language Models (LLMs) have demonstrated remarkable capabilities across\nvarious domains, particularly in task generalization for both text and vision\ndata. While fine-tuning these models can significantly enhance their\nperformance on specific downstream tasks, it often requires high-quality data\nthat cannot be shared due to privacy concerns. Federated Learning (FL) offers a\npromising solution for collaborative training without direct data sharing.\nHowever, many parameter-efficient fine-tuning strategies for LLMs in FL,\nparticularly those based on Low-Rank Adaptation (LoRA), face limitations. In\nthis paper, we critically analyze the convergence and performance guarantees of\npopular FL frameworks utilizing LoRA, highlighting its suboptimal nature due to\nconstrained subspace learning of low-rank matrices. This limitation hinders\neffective fine-tuning of LLMs in federated settings. Through rigorous\nanalytical and empirical evaluations, we demonstrate that direct weight\naveraging outperforms LoRA-based strategies, leading to superior performance\nfor fine-tuned models. Our comprehensive comparison unmasks inefficiencies in\nLoRA approaches and underscores the advantages of direct weight aggregation. We\nextend our analysis to low-rank gradient-based optimizers, such as GaLore, used\nduring local training steps. Our findings show that GaLore along with\ndirect-weight aggregation is a more effective approach, outperforming federated\nLoRA methods like FlexLoRA and FFA-LoRA across both text and image modalities.\nWhile privacy remains paramount in FL discourse, our focus is on assessing\nperformance outcomes of federated fine-tuned models and evaluating various FL\nframeworks from both theoretical and empirical perspectives. Our findings\nadvocate reassessing the reliance on LoRA within FL contexts, paving the way\nfor more efficient training methodologies.\n","authors":["Navyansh Mahla","Kshitij Sharad Jadhav","Ganesh Ramakrishnan"],"pdf_url":"https://arxiv.org/pdf/2410.23111v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.16766v2","updated":"2025-01-07T03:53:12Z","published":"2024-05-27T02:27:28Z","title":"Concept Matching with Agent for Out-of-Distribution Detection","summary":" The remarkable achievements of Large Language Models (LLMs) have captivated\nthe attention of both academia and industry, transcending their initial role in\ndialogue generation. To expand the usage scenarios of LLM, some works enhance\nthe effectiveness and capabilities of the model by introducing more external\ninformation, which is called the agent paradigm. Based on this idea, we propose\na new method that integrates the agent paradigm into out-of-distribution (OOD)\ndetection task, aiming to improve its robustness and adaptability. Our proposed\nmethod, Concept Matching with Agent (CMA), employs neutral prompts as agents to\naugment the CLIP-based OOD detection process. These agents function as dynamic\nobservers and communication hubs, interacting with both In-distribution (ID)\nlabels and data inputs to form vector triangle relationships. This triangular\nframework offers a more nuanced approach than the traditional binary\nrelationship, allowing for better separation and identification of ID and OOD\ninputs. Our extensive experimental results showcase the superior performance of\nCMA over both zero-shot and training-required methods in a diverse array of\nreal-world scenarios.\n","authors":["Yuxiao Lee","Xiaofeng Cao","Jingcai Guo","Wei Ye","Qing Guo","Yi Chang"],"pdf_url":"https://arxiv.org/pdf/2405.16766v2.pdf","comment":"Accepted by AAAI-25"},{"id":"http://arxiv.org/abs/2501.03499v1","updated":"2025-01-07T03:39:43Z","published":"2025-01-07T03:39:43Z","title":"Can Deep Learning Trigger Alerts from Mobile-Captured Images?","summary":" Our research presents a comprehensive approach to leveraging mobile camera\nimage data for real-time air quality assessment and recommendation. We develop\na regression-based Convolutional Neural Network model and tailor it explicitly\nfor air quality prediction by exploiting the inherent relationship between\noutput parameters. As a result, the Mean Squared Error of 0.0077 and 0.0112\nobtained for 2 and 5 pollutants respectively outperforms existing models.\nFurthermore, we aim to verify the common practice of augmenting the original\ndataset with a view to introducing more variation in the training phase. It is\none of our most significant contributions that our experimental results\ndemonstrate minimal accuracy differences between the original and augmented\ndatasets. Finally, a real-time, user-friendly dashboard is implemented which\ndynamically displays the Air Quality Index and pollutant values derived from\ncaptured mobile camera images. Users' health conditions are considered to\nrecommend whether a location is suitable based on current air quality metrics.\nOverall, this research contributes to verification of data augmentation\ntechniques, CNN-based regression modelling for air quality prediction, and\nuser-centric air quality monitoring through mobile technology. The proposed\nsystem offers practical solutions for individuals to make informed\nenvironmental health and well-being decisions.\n","authors":["Pritisha Sarkar","Duranta Durbaar Vishal Saha","Mousumi Saha"],"pdf_url":"https://arxiv.org/pdf/2501.03499v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.24080v2","updated":"2025-01-07T03:33:00Z","published":"2024-10-31T16:16:51Z","title":"Graph Learning for Numeric Planning","summary":" Graph learning is naturally well suited for use in symbolic, object-centric\nplanning due to its ability to exploit relational structures exhibited in\nplanning domains and to take as input planning instances with arbitrary numbers\nof objects. Numeric planning is an extension of symbolic planning in which\nstates may now also exhibit numeric variables. In this work, we propose\ndata-efficient and interpretable machine learning models for learning to solve\nnumeric planning tasks. This involves constructing a new graph kernel for\ngraphs with both continuous and categorical attributes, as well as new\noptimisation methods for learning heuristic functions for numeric planning.\nExperiments show that our graph kernels are vastly more efficient and\ngeneralise better than graph neural networks for numeric planning, and also\nyield competitive coverage performance compared to domain-independent numeric\nplanners. Code is available at https://github.com/DillonZChen/goose\n","authors":["Dillon Z. Chen","Sylvie Thiébaux"],"pdf_url":"https://arxiv.org/pdf/2410.24080v2.pdf","comment":"Extended version of NeurIPS 2024 paper"},{"id":"http://arxiv.org/abs/2501.02024v2","updated":"2025-01-07T03:29:43Z","published":"2025-01-02T20:47:04Z","title":"Model Checking in Medical Imaging for Tumor Detection and Segmentation","summary":" Recent advancements in model checking have demonstrated significant potential\nacross diverse applications, particularly in signal and image analysis. Medical\nimaging stands out as a critical domain where model checking can be effectively\napplied to design and evaluate robust frameworks. These frameworks facilitate\nautomatic and semi-automatic delineation of regions of interest within images,\naiding in accurate segmentation. This paper provides a comprehensive analysis\nof recent works leveraging spatial logic to develop operators and tools for\nidentifying regions of interest, including tumorous and non-tumorous areas.\nAdditionally, we examine the challenges inherent to spatial model-checking\ntechniques, such as variability in ground truth data and the need for\nstreamlined procedures suitable for routine clinical practice.\n","authors":["Elhoucine Elfatimi","Lahcen El fatimi"],"pdf_url":"https://arxiv.org/pdf/2501.02024v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03491v1","updated":"2025-01-07T03:21:17Z","published":"2025-01-07T03:21:17Z","title":"Can LLMs Design Good Questions Based on Context?","summary":" This paper evaluates questions generated by LLMs from context, comparing them\nto human-generated questions across six dimensions. We introduce an automated\nLLM-based evaluation method, focusing on aspects like question length, type,\ncontext coverage, and answerability. Our findings highlight unique\ncharacteristics of LLM-generated questions, contributing insights that can\nsupport further research in question quality and downstream applications.\n","authors":["Yueheng Zhang","Xiaoyuan Liu","Yiyou Sun","Atheer Alharbi","Hend Alzahrani","Basel Alomair","Dawn Song"],"pdf_url":"https://arxiv.org/pdf/2501.03491v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.11397v2","updated":"2025-01-07T03:17:48Z","published":"2024-06-17T10:33:00Z","title":"DistPred: A Distribution-Free Probabilistic Inference Method for\n Regression and Forecasting","summary":" Traditional regression and prediction tasks often only provide deterministic\npoint estimates. To estimate the distribution or uncertainty of the response\nvariable, traditional methods either assume that the posterior distribution of\nsamples follows a Gaussian process or require thousands of forward passes for\nsample generation. We propose a novel approach called DistPred for regression\nand forecasting tasks, which overcomes the limitations of existing methods\nwhile remaining simple and powerful. Specifically, we transform proper scoring\nrules that measure the discrepancy between the predicted distribution and the\ntarget distribution into a differentiable discrete form and use it as a loss\nfunction to train the model end-to-end. This allows the model to sample\nnumerous samples in a single forward pass to estimate the potential\ndistribution of the response variable. We have compared our method with several\nexisting approaches on multiple datasets and achieved state-of-the-art\nperformance. Additionally, our method significantly improves computational\nefficiency. For example, compared to state-of-the-art models, DistPred has a\n180x faster inference speed Experimental results can be reproduced through\nhttps://github.com/Anoise/DistPred.\n","authors":["Daojun Liang","Haixia Zhang","Dongfeng Yuan"],"pdf_url":"https://arxiv.org/pdf/2406.11397v2.pdf","comment":"Published at KDD 2025"},{"id":"http://arxiv.org/abs/2412.19391v2","updated":"2025-01-07T03:15:49Z","published":"2024-12-27T00:36:40Z","title":"An In-Depth Analysis of Adversarial Discriminative Domain Adaptation for\n Digit Classification","summary":" Domain adaptation is an active area of research driven by the growing demand\nfor robust machine learning models that perform well on real-world data.\nAdversarial learning for deep neural networks (DNNs) has emerged as a promising\napproach to improving generalization ability, particularly for image\nclassification. In this paper, we implement a specific adversarial learning\ntechnique known as Adversarial Discriminative Domain Adaptation (ADDA) and\nreplicate digit classification experiments from the original ADDA paper. We\nextend their findings by examining a broader range of domain shifts and provide\na detailed analysis of in-domain classification accuracy post-ADDA. Our results\ndemonstrate that ADDA significantly improves accuracy across certain domain\nshifts with minimal impact on in-domain performance. Furthermore, we provide\nqualitative analysis and propose potential explanations for ADDA's limitations\nin less successful domain shifts. Code is at\nhttps://github.com/eugenechoi2004/COS429_FINAL .\n","authors":["Eugene Choi","Julian Rodriguez","Edmund Young"],"pdf_url":"https://arxiv.org/pdf/2412.19391v2.pdf","comment":"Replacement: Updated methodology section to include grayscale\n preprocessing of SVHN data"},{"id":"http://arxiv.org/abs/2501.03486v1","updated":"2025-01-07T03:14:39Z","published":"2025-01-07T03:14:39Z","title":"Align-Pro: A Principled Approach to Prompt Optimization for LLM\n Alignment","summary":" The alignment of large language models (LLMs) with human values is critical\nas these models become increasingly integrated into various societal and\ndecision-making processes. Traditional methods, such as reinforcement learning\nfrom human feedback (RLHF), achieve alignment by fine-tuning model parameters,\nbut these approaches are often computationally expensive and impractical when\nmodels are frozen or inaccessible for parameter modification. In contrast,\nprompt optimization is a viable alternative to RLHF for LLM alignment. While\nthe existing literature has shown empirical promise of prompt optimization, its\ntheoretical underpinning remains under-explored. We address this gap by\nformulating prompt optimization as an optimization problem and try to provide\ntheoretical insights into the optimality of such a framework. To analyze the\nperformance of the prompt optimization, we study theoretical suboptimality\nbounds and provide insights in terms of how prompt optimization depends upon\nthe given prompter and target model. We also provide empirical validation\nthrough experiments on various datasets, demonstrating that prompt optimization\ncan effectively align LLMs, even when parameter fine-tuning is not feasible.\n","authors":["Prashant Trivedi","Souradip Chakraborty","Avinash Reddy","Vaneet Aggarwal","Amrit Singh Bedi","George K. Atia"],"pdf_url":"https://arxiv.org/pdf/2501.03486v1.pdf","comment":"27 pages, Accepted in AAAI 2025"},{"id":"http://arxiv.org/abs/2411.03334v3","updated":"2025-01-07T03:01:49Z","published":"2024-10-23T19:56:57Z","title":"Neural Network Prediction of Strong Lensing Systems with Domain\n Adaptation and Uncertainty Quantification","summary":" Modeling strong gravitational lenses is computationally expensive for the\ncomplex data from modern and next-generation cosmic surveys. Deep learning has\nemerged as a promising approach for finding lenses and predicting lensing\nparameters, such as the Einstein radius. Mean-variance Estimators (MVEs) are a\ncommon approach for obtaining aleatoric (data) uncertainties from a neural\nnetwork prediction. However, neural networks have not been demonstrated to\nperform well on out-of-domain target data successfully - e.g., when trained on\nsimulated data and applied to real, observational data. In this work, we\nperform the first study of the efficacy of MVEs in combination with\nunsupervised domain adaptation (UDA) on strong lensing data. The source domain\ndata is noiseless, and the target domain data has noise mimicking modern\ncosmology surveys. We find that adding UDA to MVE increases the accuracy on the\ntarget data by a factor of about two over an MVE model without UDA. Including\nUDA also permits much more well-calibrated aleatoric uncertainty predictions.\nAdvancements in this approach may enable future applications of MVE models to\nreal observational data.\n","authors":["Shrihan Agarwal","Aleksandra Ćiprijanović","Brian D. Nord"],"pdf_url":"https://arxiv.org/pdf/2411.03334v3.pdf","comment":"Accepted to the Machine Learning for Physical Sciences workshop at\n NeurIPS 2024; 24 pages, 2 figures, 4 tables"},{"id":"http://arxiv.org/abs/2501.01691v2","updated":"2025-01-07T02:57:03Z","published":"2025-01-03T08:18:08Z","title":"VidFormer: A novel end-to-end framework fused by 3DCNN and Transformer\n for Video-based Remote Physiological Measurement","summary":" Remote physiological signal measurement based on facial videos, also known as\nremote photoplethysmography (rPPG), involves predicting changes in facial\nvascular blood flow from facial videos. While most deep learning-based methods\nhave achieved good results, they often struggle to balance performance across\nsmall and large-scale datasets due to the inherent limitations of convolutional\nneural networks (CNNs) and Transformer. In this paper, we introduce VidFormer,\na novel end-to-end framework that integrates 3-Dimension Convolutional Neural\nNetwork (3DCNN) and Transformer models for rPPG tasks. Initially, we conduct an\nanalysis of the traditional skin reflection model and subsequently introduce an\nenhanced model for the reconstruction of rPPG signals. Based on this improved\nmodel, VidFormer utilizes 3DCNN and Transformer to extract local and global\nfeatures from input data, respectively. To enhance the spatiotemporal feature\nextraction capabilities of VidFormer, we incorporate temporal-spatial attention\nmechanisms tailored for both 3DCNN and Transformer. Additionally, we design a\nmodule to facilitate information exchange and fusion between the 3DCNN and\nTransformer. Our evaluation on five publicly available datasets demonstrates\nthat VidFormer outperforms current state-of-the-art (SOTA) methods. Finally, we\ndiscuss the essential roles of each VidFormer module and examine the effects of\nethnicity, makeup, and exercise on its performance.\n","authors":["Jiachen Li","Shisheng Guo","Longzhen Tang","Cuolong Cui","Lingjiang Kong","Xiaobo Yang"],"pdf_url":"https://arxiv.org/pdf/2501.01691v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.02964v2","updated":"2025-01-07T02:55:15Z","published":"2025-01-06T12:16:56Z","title":"Socratic Questioning: Learn to Self-guide Multimodal Reasoning in the\n Wild","summary":" Complex visual reasoning remains a key challenge today. Typically, the\nchallenge is tackled using methodologies such as Chain of Thought (COT) and\nvisual instruction tuning. However, how to organically combine these two\nmethodologies for greater success remains unexplored. Also, issues like\nhallucinations and high training cost still need to be addressed. In this work,\nwe devise an innovative multi-round training and reasoning framework suitable\nfor lightweight Multimodal Large Language Models (MLLMs). Our self-questioning\napproach heuristically guides MLLMs to focus on visual clues relevant to the\ntarget problem, reducing hallucinations and enhancing the model's ability to\ndescribe fine-grained image details. This ultimately enables the model to\nperform well in complex visual reasoning and question-answering tasks. We have\nnamed this framework Socratic Questioning(SQ). To facilitate future research,\nwe create a multimodal mini-dataset named CapQA, which includes 1k images of\nfine-grained activities, for visual instruction tuning and evaluation, our\nproposed SQ method leads to a 31.2% improvement in the hallucination score. Our\nextensive experiments on various benchmarks demonstrate SQ's remarkable\ncapabilities in heuristic self-questioning, zero-shot visual reasoning and\nhallucination mitigation. Our model and code will be publicly available.\n","authors":["Wanpeng Hu","Haodi Liu","Lin Chen","Feng Zhou","Changming Xiao","Qi Yang","Changshui Zhang"],"pdf_url":"https://arxiv.org/pdf/2501.02964v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03475v1","updated":"2025-01-07T02:33:25Z","published":"2025-01-07T02:33:25Z","title":"Reading with Intent -- Neutralizing Intent","summary":" Queries to large language models (LLMs) can be divided into two parts: the\ninstruction/question and the accompanying context. The context for\nretrieval-augmented generation (RAG) systems in most benchmarks comes from\nWikipedia or Wikipedia-like texts which are written in a neutral and factual\ntone. However, when RAG systems retrieve internet-based content, they encounter\ntext with diverse tones and linguistic styles, introducing challenges for\ndownstream tasks. The Reading with Intent task addresses this issue by\nevaluating how varying tones in context passages affect model performance.\nBuilding on prior work that focused on sarcasm, we extend this paradigm by\nconstructing a dataset where context passages are transformed to $11$ distinct\nemotions using a better synthetic data generation approach. Using this dataset,\nwe train an emotion translation model to systematically adapt passages to\nspecified emotional tones. The human evaluation shows that the LLM fine-tuned\nto become the emotion-translator benefited from the synthetically generated\ndata. Finally, the emotion-translator is used in the Reading with Intent task\nto transform the passages to a neutral tone. By neutralizing the passages, it\nmitigates the challenges posed by sarcastic passages and improves overall\nresults on this task by about $3\\%$.\n","authors":["Benjamin Reichman","Adar Avsian","Larry Heck"],"pdf_url":"https://arxiv.org/pdf/2501.03475v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.01973v2","updated":"2025-01-07T02:10:45Z","published":"2024-12-28T02:28:19Z","title":"INFELM: In-depth Fairness Evaluation of Large Text-To-Image Models","summary":" The rapid development of large language models (LLMs) and large vision models\n(LVMs) have propelled the evolution of multi-modal AI systems, which have\ndemonstrated the remarkable potential for industrial applications by emulating\nhuman-like cognition. However, they also pose significant ethical challenges,\nincluding amplifying harmful content and reinforcing societal biases. For\ninstance, biases in some industrial image generation models highlighted the\nurgent need for robust fairness assessments. Most existing evaluation\nframeworks focus on the comprehensiveness of various aspects of the models, but\nthey exhibit critical limitations, including insufficient attention to content\ngeneration alignment and social bias-sensitive domains. More importantly, their\nreliance on pixel-detection techniques is prone to inaccuracies.\n To address these issues, this paper presents INFELM, an in-depth fairness\nevaluation on widely-used text-to-image models. Our key contributions are: (1)\nan advanced skintone classifier incorporating facial topology and refined skin\npixel representation to enhance classification precision by at least 16.04%,\n(2) a bias-sensitive content alignment measurement for understanding societal\nimpacts, (3) a generalizable representation bias evaluation for diverse\ndemographic groups, and (4) extensive experiments analyzing large-scale\ntext-to-image model outputs across six social-bias-sensitive domains. We find\nthat existing models in the study generally do not meet the empirical fairness\ncriteria, and representation bias is generally more pronounced than alignment\nerrors. INFELM establishes a robust benchmark for fairness assessment,\nsupporting the development of multi-modal AI systems that align with ethical\nand human-centric principles.\n","authors":["Di Jin","Xing Liu","Yu Liu","Jia Qing Yap","Andrea Wong","Adriana Crespo","Qi Lin","Zhiyuan Yin","Qiang Yan","Ryan Ye"],"pdf_url":"https://arxiv.org/pdf/2501.01973v2.pdf","comment":"Di Jin and Xing Liu contributed equally to this work"},{"id":"http://arxiv.org/abs/2501.03468v1","updated":"2025-01-07T01:52:56Z","published":"2025-01-07T01:52:56Z","title":"MTRAG: A Multi-Turn Conversational Benchmark for Evaluating\n Retrieval-Augmented Generation Systems","summary":" Retrieval-augmented generation (RAG) has recently become a very popular task\nfor Large Language Models (LLMs). Evaluating them on multi-turn RAG\nconversations, where the system is asked to generate a response to a question\nin the context of a preceding conversation is an important and often overlooked\ntask with several additional challenges. We present MTRAG: an end-to-end\nhuman-generated multi-turn RAG benchmark that reflects several real-world\nproperties across diverse dimensions for evaluating the full RAG pipeline.\nMTRAG contains 110 conversations averaging 7.7 turns each across four domains\nfor a total of 842 tasks. We also explore automation paths via synthetic data\nand LLM-as-a-Judge evaluation. Our human and automatic evaluations show that\neven state-of-the-art LLM RAG systems struggle on MTRAG. We demonstrate the\nneed for strong retrieval and generation systems that can handle later turns,\nunanswerable questions, non-standalone questions, and multiple domains. MTRAG\nis available at https://github.com/ibm/mt-rag-benchmark.\n","authors":["Yannis Katsis","Sara Rosenthal","Kshitij Fadnis","Chulaka Gunasekara","Young-Suk Lee","Lucian Popa","Vraj Shah","Huaiyu Zhu","Danish Contractor","Marina Danilevsky"],"pdf_url":"https://arxiv.org/pdf/2501.03468v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19139v2","updated":"2025-01-07T01:50:11Z","published":"2024-12-26T09:51:05Z","title":"PlanLLM: Video Procedure Planning with Refinable Large Language Models","summary":" Video procedure planning, i.e., planning a sequence of action steps given the\nvideo frames of start and goal states, is an essential ability for embodied AI.\nRecent works utilize Large Language Models (LLMs) to generate enriched action\nstep description texts to guide action step decoding. Although LLMs are\nintroduced, these methods decode the action steps into a closed-set of one-hot\nvectors, limiting the model's capability of generalizing to new steps or tasks.\nAdditionally, fixed action step descriptions based on world-level commonsense\nmay contain noise in specific instances of visual states. In this paper, we\npropose PlanLLM, a cross-modal joint learning framework with LLMs for video\nprocedure planning. We propose an LLM-Enhanced Planning module which fully uses\nthe generalization ability of LLMs to produce free-form planning output and to\nenhance action step decoding. We also propose Mutual Information Maximization\nmodule to connect world-level commonsense of step descriptions and\nsample-specific information of visual states, enabling LLMs to employ the\nreasoning ability to generate step sequences. With the assistance of LLMs, our\nmethod can both closed-set and open vocabulary procedure planning tasks. Our\nPlanLLM achieves superior performance on three benchmarks, demonstrating the\neffectiveness of our designs.\n","authors":["Dejie Yang","Zijing Zhao","Yang Liu"],"pdf_url":"https://arxiv.org/pdf/2412.19139v2.pdf","comment":"accepted to AAAI2025"},{"id":"http://arxiv.org/abs/2501.03464v1","updated":"2025-01-07T01:45:39Z","published":"2025-01-07T01:45:39Z","title":"LHGNN: Local-Higher Order Graph Neural Networks For Audio Classification\n and Tagging","summary":" Transformers have set new benchmarks in audio processing tasks, leveraging\nself-attention mechanisms to capture complex patterns and dependencies within\naudio data. However, their focus on pairwise interactions limits their ability\nto process the higher-order relations essential for identifying distinct audio\nobjects. To address this limitation, this work introduces the Local- Higher\nOrder Graph Neural Network (LHGNN), a graph based model that enhances feature\nunderstanding by integrating local neighbourhood information with higher-order\ndata from Fuzzy C-Means clusters, thereby capturing a broader spectrum of audio\nrelationships. Evaluation of the model on three publicly available audio\ndatasets shows that it outperforms Transformer-based models across all\nbenchmarks while operating with substantially fewer parameters. Moreover, LHGNN\ndemonstrates a distinct advantage in scenarios lacking ImageNet pretraining,\nestablishing its effectiveness and efficiency in environments where extensive\npretraining data is unavailable.\n","authors":["Shubhr Singh","Emmanouil Benetos","Huy Phan","Dan Stowell"],"pdf_url":"https://arxiv.org/pdf/2501.03464v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.22376v2","updated":"2025-01-07T01:41:13Z","published":"2024-10-29T07:43:39Z","title":"Rare-to-Frequent: Unlocking Compositional Generation Power of Diffusion\n Models on Rare Concepts with LLM Guidance","summary":" State-of-the-art text-to-image (T2I) diffusion models often struggle to\ngenerate rare compositions of concepts, e.g., objects with unusual attributes.\nIn this paper, we show that the compositional generation power of diffusion\nmodels on such rare concepts can be significantly enhanced by the Large\nLanguage Model (LLM) guidance. We start with empirical and theoretical\nanalysis, demonstrating that exposing frequent concepts relevant to the target\nrare concepts during the diffusion sampling process yields more accurate\nconcept composition. Based on this, we propose a training-free approach, R2F,\nthat plans and executes the overall rare-to-frequent concept guidance\nthroughout the diffusion inference by leveraging the abundant semantic\nknowledge in LLMs. Our framework is flexible across any pre-trained diffusion\nmodels and LLMs, and can be seamlessly integrated with the region-guided\ndiffusion approaches. Extensive experiments on three datasets, including our\nnewly proposed benchmark, RareBench, containing various prompts with rare\ncompositions of concepts, R2F significantly surpasses existing models including\nSD3.0 and FLUX by up to 28.1%p in T2I alignment. Code is available at\nhttps://github.com/krafton-ai/Rare-to-Frequent.\n","authors":["Dongmin Park","Sebin Kim","Taehong Moon","Minkyu Kim","Kangwook Lee","Jaewoong Cho"],"pdf_url":"https://arxiv.org/pdf/2410.22376v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.14026v2","updated":"2025-01-07T01:40:42Z","published":"2024-09-21T05:58:07Z","title":"Uncovering Latent Chain of Thought Vectors in Language Models","summary":" As language models grow more influential and trusted in our society, our\nability to reliably steer them toward favorable behaviors becomes increasingly\nparamount. For this, we investigate the technique of steering vectors: biasing\nthe forward pass of language models using a \"steering vector\" derived from a\nspecific task. We apply them to steer language models toward performing Chain\nof Thought (CoT) Reasoning without the need to prompt through natural language.\nWe demonstrate this approach on Llama3 8b and Mistral 7b v0.2, and obtain\ncompetitive results compared to CoT-prompted performances on a series of\nreasoning benchmarks (GSM8k, MMLU, AGI Eval, ARC AI2) and qualitative examples.\nWe find this approach yields consistent steering towards CoT responses and\ntakes less compute than traditional methods of fine-tuning models towards CoT.\n","authors":["Jason Zhang","Scott Viteri"],"pdf_url":"https://arxiv.org/pdf/2409.14026v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03461v1","updated":"2025-01-07T01:35:56Z","published":"2025-01-07T01:35:56Z","title":"Radar Signal Recognition through Self-Supervised Learning and Domain\n Adaptation","summary":" Automatic radar signal recognition (RSR) plays a pivotal role in electronic\nwarfare (EW), as accurately classifying radar signals is critical for informing\ndecision-making processes. Recent advances in deep learning have shown\nsignificant potential in improving RSR performance in domains with ample\nannotated data. However, these methods fall short in EW scenarios where\nannotated RF data are scarce or impractical to obtain. To address these\nchallenges, we introduce a self-supervised learning (SSL) method which utilises\nmasked signal modelling and RF domain adaption to enhance RSR performance in\nenvironments with limited RF samples and labels. Specifically, we investigate\npre-training masked autoencoders (MAE) on baseband in-phase and quadrature\n(I/Q) signals from various RF domains and subsequently transfer the learned\nrepresentation to the radar domain, where annotated data are limited. Empirical\nresults show that our lightweight self-supervised ResNet model with domain\nadaptation achieves up to a 17.5\\% improvement in 1-shot classification\naccuracy when pre-trained on in-domain signals (i.e., radar signals) and up to\na 16.31\\% improvement when pre-trained on out-of-domain signals (i.e., comm\nsignals), compared to its baseline without SSL. We also provide reference\nresults for several MAE designs and pre-training strategies, establishing a new\nbenchmark for few-shot radar signal classification.\n","authors":["Zi Huang","Akila Pemasiri","Simon Denman","Clinton Fookes","Terrence Martin"],"pdf_url":"https://arxiv.org/pdf/2501.03461v1.pdf","comment":"5 pages, 9 figures"},{"id":"http://arxiv.org/abs/2501.03458v1","updated":"2025-01-07T01:19:48Z","published":"2025-01-07T01:19:48Z","title":"Activating Associative Disease-Aware Vision Token Memory for LLM-Based\n X-ray Report Generation","summary":" X-ray image based medical report generation achieves significant progress in\nrecent years with the help of the large language model, however, these models\nhave not fully exploited the effective information in visual image regions,\nresulting in reports that are linguistically sound but insufficient in\ndescribing key diseases. In this paper, we propose a novel associative\nmemory-enhanced X-ray report generation model that effectively mimics the\nprocess of professional doctors writing medical reports. It considers both the\nmining of global and local visual information and associates historical report\ninformation to better complete the writing of the current report. Specifically,\ngiven an X-ray image, we first utilize a classification model along with its\nactivation maps to accomplish the mining of visual regions highly associated\nwith diseases and the learning of disease query tokens. Then, we employ a\nvisual Hopfield network to establish memory associations for disease-related\ntokens, and a report Hopfield network to retrieve report memory information.\nThis process facilitates the generation of high-quality reports based on a\nlarge language model and achieves state-of-the-art performance on multiple\nbenchmark datasets, including the IU X-ray, MIMIC-CXR, and Chexpert Plus. The\nsource code of this work is released on\n\\url{https://github.com/Event-AHU/Medical_Image_Analysis}.\n","authors":["Xiao Wang","Fuling Wang","Haowen Wang","Bo Jiang","Chuanfu Li","Yaowei Wang","Yonghong Tian","Jin Tang"],"pdf_url":"https://arxiv.org/pdf/2501.03458v1.pdf","comment":"In Peer Review"},{"id":"http://arxiv.org/abs/2403.05260v2","updated":"2025-01-07T00:53:48Z","published":"2024-03-08T12:31:03Z","title":"Towards generalization of drug response prediction to single cells and\n patients utilizing importance-aware multi-source domain transfer learning","summary":" The advancement of single-cell sequencing technology has promoted the\ngeneration of a large amount of single-cell transcriptional profiles, providing\nunprecedented opportunities to identify drug-resistant cell subpopulations\nwithin a tumor. However, few studies have focused on drug response prediction\nat single-cell level, and their performance remains suboptimal. This paper\nproposed scAdaDrug, a novel multi-source domain adaptation model powered by\nadaptive importance-aware representation learning to predict drug response of\nindividual cells. We used a shared encoder to extract domain-invariant features\nrelated to drug response from multiple source domains by utilizing adversarial\ndomain adaptation. Particularly, we introduced a plug-and-play module to\ngenerate importance-aware and mutually independent weights, which could\nadaptively modulate the latent representation of each sample in element-wise\nmanner between source and target domains. Extensive experimental results showed\nthat our model achieved state-of-the-art performance in predicting drug\nresponse on multiple independent datasets, including single-cell datasets\nderived from both cell lines and patient-derived xenografts (PDX) models, as\nwell as clinical tumor patient cohorts. Moreover, the ablation experiments\ndemonstrated our model effectively captured the underlying patterns determining\ndrug response from multiple source domains.\n","authors":["Hui Liu","Wei Duan","Judong Luo"],"pdf_url":"https://arxiv.org/pdf/2403.05260v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.00826v2","updated":"2025-01-07T00:15:11Z","published":"2025-01-01T13:08:17Z","title":"LLM-Powered Multi-Agent System for Automated Crypto Portfolio Management","summary":" Cryptocurrency investment is inherently difficult due to its shorter history\ncompared to traditional assets, the need to integrate vast amounts of data from\nvarious modalities, and the requirement for complex reasoning. While deep\nlearning approaches have been applied to address these challenges, their\nblack-box nature raises concerns about trust and explainability. Recently,\nlarge language models (LLMs) have shown promise in financial applications due\nto their ability to understand multi-modal data and generate explainable\ndecisions. However, single LLM faces limitations in complex, comprehensive\ntasks such as asset investment. These limitations are even more pronounced in\ncryptocurrency investment, where LLMs have less domain-specific knowledge in\ntheir training corpora.\n To overcome these challenges, we propose an explainable, multi-modal,\nmulti-agent framework for cryptocurrency investment. Our framework uses\nspecialized agents that collaborate within and across teams to handle subtasks\nsuch as data analysis, literature integration, and investment decision-making\nfor the top 30 cryptocurrencies by market capitalization. The expert training\nmodule fine-tunes agents using multi-modal historical data and professional\ninvestment literature, while the multi-agent investment module employs\nreal-time data to make informed cryptocurrency investment decisions. Unique\nintrateam and interteam collaboration mechanisms enhance prediction accuracy by\nadjusting final predictions based on confidence levels within agent teams and\nfacilitating information sharing between teams. Empirical evaluation using data\nfrom November 2023 to September 2024 demonstrates that our framework\noutperforms single-agent models and market benchmarks in classification, asset\npricing, portfolio, and explainability performance.\n","authors":["Yichen Luo","Yebo Feng","Jiahua Xu","Paolo Tasca","Yang Liu"],"pdf_url":"https://arxiv.org/pdf/2501.00826v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03443v1","updated":"2025-01-07T00:09:52Z","published":"2025-01-07T00:09:52Z","title":"Optimization Learning","summary":" This article introduces the concept of optimization learning, a methodology\nto design optimization proxies that learn the input/output mapping of\nparametric optimization problems. These optimization proxies are trustworthy by\ndesign: they compute feasible solutions to the underlying optimization\nproblems, provide quality guarantees on the returned solutions, and scale to\nlarge instances. Optimization proxies are differentiable programs that combine\ntraditional deep learning technology with repair or completion layers to\nproduce feasible solutions. The article shows that optimization proxies can be\ntrained end-to-end in a self-supervised way. It presents methodologies to\nprovide performance guarantees and to scale optimization proxies to large-scale\noptimization problems. The potential of optimization proxies is highlighted\nthrough applications in power systems and, in particular, real-time risk\nassessment and security-constrained optimal power flow.\n","authors":["Pascal Van Hentenryck"],"pdf_url":"https://arxiv.org/pdf/2501.03443v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.01855v2","updated":"2025-01-07T23:47:06Z","published":"2024-10-01T22:47:24Z","title":"Explainable Diagnosis Prediction through Neuro-Symbolic Integration","summary":" Diagnosis prediction is a critical task in healthcare, where timely and\naccurate identification of medical conditions can significantly impact patient\noutcomes. Traditional machine learning and deep learning models have achieved\nnotable success in this domain but often lack interpretability which is a\ncrucial requirement in clinical settings. In this study, we explore the use of\nneuro-symbolic methods, specifically Logical Neural Networks (LNNs), to develop\nexplainable models for diagnosis prediction. Essentially, we design and\nimplement LNN-based models that integrate domain-specific knowledge through\nlogical rules with learnable thresholds. Our models, particularly\n$M_{\\text{multi-pathway}}$ and $M_{\\text{comprehensive}}$, demonstrate superior\nperformance over traditional models such as Logistic Regression, SVM, and\nRandom Forest, achieving higher accuracy (up to 80.52\\%) and AUROC scores (up\nto 0.8457) in the case study of diabetes prediction. The learned weights and\nthresholds within the LNN models provide direct insights into feature\ncontributions, enhancing interpretability without compromising predictive\npower. These findings highlight the potential of neuro-symbolic approaches in\nbridging the gap between accuracy and explainability in healthcare AI\napplications. By offering transparent and adaptable diagnostic models, our work\ncontributes to the advancement of precision medicine and supports the\ndevelopment of equitable healthcare solutions. Future research will focus on\nextending these methods to larger and more diverse datasets to further validate\ntheir applicability across different medical conditions and populations.\n","authors":["Qiuhao Lu","Rui Li","Elham Sagheb","Andrew Wen","Jinlian Wang","Liwei Wang","Jungwei W. Fan","Hongfang Liu"],"pdf_url":"https://arxiv.org/pdf/2410.01855v2.pdf","comment":"Proceedings of AMIA Informatics Summit 2025"},{"id":"http://arxiv.org/abs/2501.00790v2","updated":"2025-01-07T23:43:09Z","published":"2025-01-01T10:00:49Z","title":"LENS-XAI: Redefining Lightweight and Explainable Network Security\n through Knowledge Distillation and Variational Autoencoders for Scalable\n Intrusion Detection in Cybersecurity","summary":" The rapid proliferation of Industrial Internet of Things (IIoT) systems\nnecessitates advanced, interpretable, and scalable intrusion detection systems\n(IDS) to combat emerging cyber threats. Traditional IDS face challenges such as\nhigh computational demands, limited explainability, and inflexibility against\nevolving attack patterns. To address these limitations, this study introduces\nthe Lightweight Explainable Network Security framework (LENS-XAI), which\ncombines robust intrusion detection with enhanced interpretability and\nscalability. LENS-XAI integrates knowledge distillation, variational\nautoencoder models, and attribution-based explainability techniques to achieve\nhigh detection accuracy and transparency in decision-making. By leveraging a\ntraining set comprising 10% of the available data, the framework optimizes\ncomputational efficiency without sacrificing performance. Experimental\nevaluation on four benchmark datasets: Edge-IIoTset, UKM-IDS20, CTU-13, and\nNSL-KDD, demonstrates the framework's superior performance, achieving detection\naccuracies of 95.34%, 99.92%, 98.42%, and 99.34%, respectively. Additionally,\nthe framework excels in reducing false positives and adapting to complex attack\nscenarios, outperforming existing state-of-the-art methods. Key strengths of\nLENS-XAI include its lightweight design, suitable for resource-constrained\nenvironments, and its scalability across diverse IIoT and cybersecurity\ncontexts. Moreover, the explainability module enhances trust and transparency,\ncritical for practical deployment in dynamic and sensitive applications. This\nresearch contributes significantly to advancing IDS by addressing computational\nefficiency, feature interpretability, and real-world applicability. Future work\ncould focus on extending the framework to ensemble AI systems for distributed\nenvironments, further enhancing its robustness and adaptability.\n","authors":["Muhammet Anil Yagiz","Polat Goktas"],"pdf_url":"https://arxiv.org/pdf/2501.00790v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04182v1","updated":"2025-01-07T23:23:26Z","published":"2025-01-07T23:23:26Z","title":"Fixed Points of Deep Neural Networks: Emergence, Stability, and\n Applications","summary":" We present numerical and analytical results on the formation and stability of\na family of fixed points of deep neural networks (DNNs). Such fixed points\nappear in a class of DNNs when dimensions of input and output vectors are the\nsame. We demonstrate examples of applications of such networks in supervised,\nsemi-supervised and unsupervised learning such as encoding/decoding of images,\nrestoration of damaged images among others.\n We present several numerical and analytical results. First, we show that for\nuntrained DNN's with weights and biases initialized by normally distributed\nrandom variables the only one fixed point exists. This result holds for DNN\nwith any depth (number of layers) $L$, any layer width $N$, and sigmoid-type\nactivation functions. Second, it has been shown that for a DNN whose parameters\n(weights and biases) are initialized by ``light-tailed'' distribution of\nweights (e.g. normal distribution), after training the distribution of these\nparameters become ``heavy-tailed''. This motivates our study of DNNs with\n``heavy-tailed'' initialization. For such DNNs we show numerically %existence\nand stability that training leads to emergence of $Q(N,L)$ fixed points, where\n$Q(N,L)$ is a positive integer which depends on the number of layers $L$ and\nlayer width $N$. We further observe numerically that for fixed $N = N_0$ the\nfunction $Q(N_0, L)$ is non-monotone, that is it initially grows as $L$\nincreases and then decreases to 1.\n This non-monotone behavior of $Q(N_0, L)$ is also obtained by analytical\nderivation of equation for Empirical Spectral Distribution (ESD) of\ninput-output Jacobian followed by numerical solution of this equation.\n","authors":["L. Berlyand","V. Slavin"],"pdf_url":"https://arxiv.org/pdf/2501.04182v1.pdf","comment":"21 pages, 7 figures"},{"id":"http://arxiv.org/abs/2501.04180v1","updated":"2025-01-07T23:16:31Z","published":"2025-01-07T23:16:31Z","title":"HIVEX: A High-Impact Environment Suite for Multi-Agent Research\n (extended version)","summary":" Games have been vital test beds for the rapid development of Agent-based\nresearch. Remarkable progress has been achieved in the past, but it is unclear\nif the findings equip for real-world problems. While pressure grows, some of\nthe most critical ecological challenges can find mitigation and prevention\nsolutions through technology and its applications. Most real-world domains\ninclude multi-agent scenarios and require machine-machine and human-machine\ncollaboration. Open-source environments have not advanced and are often toy\nscenarios, too abstract or not suitable for multi-agent research. By mimicking\nreal-world problems and increasing the complexity of environments, we hope to\nadvance state-of-the-art multi-agent research and inspire researchers to work\non immediate real-world problems. Here, we present HIVEX, an environment suite\nto benchmark multi-agent research focusing on ecological challenges. HIVEX\nincludes the following environments: Wind Farm Control, Wildfire Resource\nManagement, Drone-Based Reforestation, Ocean Plastic Collection, and Aerial\nWildfire Suppression. We provide environments, training examples, and baselines\nfor the main and sub-tasks. All trained models resulting from the experiments\nof this work are hosted on Hugging Face. We also provide a leaderboard on\nHugging Face and encourage the community to submit models trained on our\nenvironment suite.\n","authors":["Philipp D. Siedler"],"pdf_url":"https://arxiv.org/pdf/2501.04180v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04173v1","updated":"2025-01-07T22:53:56Z","published":"2025-01-07T22:53:56Z","title":"Multimodal Multihop Source Retrieval for Web Question Answering","summary":" This work deals with the challenge of learning and reasoning over multi-modal\nmulti-hop question answering (QA). We propose a graph reasoning network based\non the semantic structure of the sentences to learn multi-source reasoning\npaths and find the supporting facts across both image and text modalities for\nanswering the question. In this paper, we investigate the importance of graph\nstructure for multi-modal multi-hop question answering. Our analysis is\ncentered on WebQA. We construct a strong baseline model, that finds relevant\nsources using a pairwise classification task. We establish that, with the\nproper use of feature representations from pre-trained models, graph structure\nhelps in improving multi-modal multi-hop question answering. We point out that\nboth graph structure and adjacency matrix are task-related prior knowledge, and\ngraph structure can be leveraged to improve the retrieval performance for the\ntask. Experiments and visualized analysis demonstrate that message propagation\nover graph networks or the entire graph structure can replace massive\nmultimodal transformers with token-wise cross-attention. We demonstrated the\napplicability of our method and show a performance gain of \\textbf{4.6$\\%$}\nretrieval F1score over the transformer baselines, despite being a very light\nmodel. We further demonstrated the applicability of our model to a large scale\nretrieval setting.\n","authors":["Navya Yarrabelly","Saloni Mittal"],"pdf_url":"https://arxiv.org/pdf/2501.04173v1.pdf","comment":"arXiv admin note: text overlap with arXiv:2010.03604 by other authors"},{"id":"http://arxiv.org/abs/2501.04169v1","updated":"2025-01-07T22:33:47Z","published":"2025-01-07T22:33:47Z","title":"Learning to Transfer Human Hand Skills for Robot Manipulations","summary":" We present a method for teaching dexterous manipulation tasks to robots from\nhuman hand motion demonstrations. Unlike existing approaches that solely rely\non kinematics information without taking into account the plausibility of robot\nand object interaction, our method directly infers plausible robot manipulation\nactions from human motion demonstrations. To address the embodiment gap between\nthe human hand and the robot system, our approach learns a joint motion\nmanifold that maps human hand movements, robot hand actions, and object\nmovements in 3D, enabling us to infer one motion component from others. Our key\nidea is the generation of pseudo-supervision triplets, which pair human,\nobject, and robot motion trajectories synthetically. Through real-world\nexperiments with robot hand manipulation, we demonstrate that our data-driven\nretargeting method significantly outperforms conventional retargeting\ntechniques, effectively bridging the embodiment gap between human and robotic\nhands. Website at https://rureadyo.github.io/MocapRobot/.\n","authors":["Sungjae Park","Seungho Lee","Mingi Choi","Jiye Lee","Jeonghwan Kim","Jisoo Kim","Hanbyul Joo"],"pdf_url":"https://arxiv.org/pdf/2501.04169v1.pdf","comment":"Preprint. Under Review"},{"id":"http://arxiv.org/abs/2501.04167v1","updated":"2025-01-07T22:29:08Z","published":"2025-01-07T22:29:08Z","title":"Reasoning-Enhanced Self-Training for Long-Form Personalized Text\n Generation","summary":" Personalized text generation requires a unique ability of large language\nmodels (LLMs) to learn from context that they often do not encounter during\ntheir standard training. One way to encourage LLMs to better use personalized\ncontext for generating outputs that better align with the user's expectations\nis to instruct them to reason over the user's past preferences, background\nknowledge, or writing style. To achieve this, we propose Reasoning-Enhanced\nSelf-Training for Personalized Text Generation (REST-PG), a framework that\ntrains LLMs to reason over personal data during response generation. REST-PG\nfirst generates reasoning paths to train the LLM's reasoning abilities and then\nemploys Expectation-Maximization Reinforced Self-Training to iteratively train\nthe LLM based on its own high-reward outputs. We evaluate REST-PG on the\nLongLaMP benchmark, consisting of four diverse personalized long-form text\ngeneration tasks. Our experiments demonstrate that REST-PG achieves significant\nimprovements over state-of-the-art baselines, with an average relative\nperformance gain of 14.5% on the benchmark.\n","authors":["Alireza Salemi","Cheng Li","Mingyang Zhang","Qiaozhu Mei","Weize Kong","Tao Chen","Zhuowan Li","Michael Bendersky","Hamed Zamani"],"pdf_url":"https://arxiv.org/pdf/2501.04167v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.16181v2","updated":"2025-01-07T22:12:47Z","published":"2024-12-10T16:51:11Z","title":"Minimum Weighted Feedback Arc Sets for Ranking from Pairwise Comparisons","summary":" The Minimum Weighted Feedback Arc Set (MWFAS) problem is fundamentally\nconnected to the Ranking Problem -- the task of deriving global rankings from\npairwise comparisons. Recent work [He et al. ICML2022] has advanced the\nstate-of-the-art for the Ranking Problem using learning-based methods,\nimproving upon multiple previous approaches. However, the connection to MWFAS\nremains underexplored. This paper investigates this relationship and presents\nefficient combinatorial algorithms for solving MWFAS, thus addressing the\nRanking Problem. Our experimental results demonstrate that these simple,\nlearning-free algorithms not only significantly outperform learning-based\nmethods in terms of speed but also generally achieve superior ranking accuracy.\n","authors":["Soroush Vahidi","Ioannis Koutis"],"pdf_url":"https://arxiv.org/pdf/2412.16181v2.pdf","comment":"This is a preliminary paper"},{"id":"http://arxiv.org/abs/2410.19313v2","updated":"2025-01-07T21:52:46Z","published":"2024-10-25T05:59:30Z","title":"COAT: Compressing Optimizer states and Activation for Memory-Efficient\n FP8 Training","summary":" FP8 training has emerged as a promising method for improving training\nefficiency. Existing frameworks accelerate training by applying FP8 computation\nto linear layers while leaving optimizer states and activations in higher\nprecision, which fails to fully optimize memory usage. This paper introduces\nCOAT (Compressing Optimizer States and Activations for FP8 Training), a novel\nFP8 training framework designed to significantly reduce memory footprint when\ntraining large models. COAT addresses current limitations through two key\ninnovations: (1) Dynamic Range Expansion, which aligns optimizer state\ndistributions more closely with the FP8 representation range, thereby reducing\nquantization error, and (2) Mixed-Granularity Activation Quantization, which\noptimizes activation memory using a combination of per-tensor and per-group\nquantization strategies. Experiments demonstrate that COAT effectively reduces\nend-to-end training memory footprint by 1.54x compared to BF16 while achieving\nnearly lossless performance across various tasks, such as Large Language Model\npretraining and fine-tuning and Vision Language Model training. COAT also\nachieves a 1.43x end-to-end training speedup compared to BF16, performing on\npar with or surpassing TransformerEngine's speedup. COAT enables efficient\nfull-parameter training of large models on fewer GPUs, and facilitates doubling\nthe batch size in distributed training settings, providing a practical solution\nfor scaling large-scale model training. The code is available at\nhttps://github.com/NVlabs/COAT.\n","authors":["Haocheng Xi","Han Cai","Ligeng Zhu","Yao Lu","Kurt Keutzer","Jianfei Chen","Song Han"],"pdf_url":"https://arxiv.org/pdf/2410.19313v2.pdf","comment":"22 pages. 9 Figures. 13 Tables"},{"id":"http://arxiv.org/abs/2301.01828v3","updated":"2025-01-07T21:38:31Z","published":"2023-01-04T21:33:13Z","title":"On Sequential Bayesian Inference for Continual Learning","summary":" Sequential Bayesian inference can be used for continual learning to prevent\ncatastrophic forgetting of past tasks and provide an informative prior when\nlearning new tasks. We revisit sequential Bayesian inference and test whether\nhaving access to the true posterior is guaranteed to prevent catastrophic\nforgetting in Bayesian neural networks. To do this we perform sequential\nBayesian inference using Hamiltonian Monte Carlo. We propagate the posterior as\na prior for new tasks by fitting a density estimator on Hamiltonian Monte Carlo\nsamples. We find that this approach fails to prevent catastrophic forgetting\ndemonstrating the difficulty in performing sequential Bayesian inference in\nneural networks. From there we study simple analytical examples of sequential\nBayesian inference and CL and highlight the issue of model misspecification\nwhich can lead to sub-optimal continual learning performance despite exact\ninference. Furthermore, we discuss how task data imbalances can cause\nforgetting. From these limitations, we argue that we need probabilistic models\nof the continual learning generative process rather than relying on sequential\nBayesian inference over Bayesian neural network weights. In this vein, we also\npropose a simple baseline called Prototypical Bayesian Continual Learning,\nwhich is competitive with state-of-the-art Bayesian continual learning methods\non class incremental continual learning vision benchmarks.\n","authors":["Samuel Kessler","Adam Cobb","Tim G. J. Rudner","Stefan Zohren","Stephen J. Roberts"],"pdf_url":"https://arxiv.org/pdf/2301.01828v3.pdf","comment":"Supercedes Entropy publication with updates to Section 4"},{"id":"http://arxiv.org/abs/2405.17044v3","updated":"2025-01-07T21:29:45Z","published":"2024-05-27T11:00:51Z","title":"Interesting Scientific Idea Generation using Knowledge Graphs and LLMs:\n Evaluations with 100 Research Group Leaders","summary":" The rapid growth of scientific literature makes it challenging for\nresearchers to identify novel and impactful ideas, especially across\ndisciplines. Modern artificial intelligence (AI) systems offer new approaches,\npotentially inspiring ideas not conceived by humans alone. But how compelling\nare these AI-generated ideas, and how can we improve their quality? Here, we\nintroduce SciMuse, which uses 58 million research papers and a large-language\nmodel to generate research ideas. We conduct a large-scale evaluation in which\nover 100 research group leaders -- from natural sciences to humanities --\nranked more than 4,400 personalized ideas based on their interest. This data\nallows us to predict research interest using (1) supervised neural networks\ntrained on human evaluations, and (2) unsupervised zero-shot ranking with\nlarge-language models. Our results demonstrate how future systems can help\ngenerating compelling research ideas and foster unforeseen interdisciplinary\ncollaborations.\n","authors":["Xuemei Gu","Mario Krenn"],"pdf_url":"https://arxiv.org/pdf/2405.17044v3.pdf","comment":"8 pages; 4 figures; Appendix: 6 pages, 5 figures, 2 tables"},{"id":"http://arxiv.org/abs/2402.08640v3","updated":"2025-01-07T21:19:30Z","published":"2024-02-13T18:09:38Z","title":"Forecasting high-impact research topics via machine learning on evolving\n knowledge graphs","summary":" The exponential growth in scientific publications poses a severe challenge\nfor human researchers. It forces attention to more narrow sub-fields, which\nmakes it challenging to discover new impactful research ideas and\ncollaborations outside one's own field. While there are ways to predict a\nscientific paper's future citation counts, they need the research to be\nfinished and the paper written, usually assessing impact long after the idea\nwas conceived. Here we show how to predict the impact of onsets of ideas that\nhave never been published by researchers. For that, we developed a large\nevolving knowledge graph built from more than 21 million scientific papers. It\ncombines a semantic network created from the content of the papers and an\nimpact network created from the historic citations of papers. Using machine\nlearning, we can predict the dynamic of the evolving network into the future\nwith high accuracy (AUC values beyond 0.9 for most experiments), and thereby\nthe impact of new research directions. We envision that the ability to predict\nthe impact of new ideas will be a crucial component of future artificial muses\nthat can inspire new impactful and interesting scientific ideas.\n","authors":["Xuemei Gu","Mario Krenn"],"pdf_url":"https://arxiv.org/pdf/2402.08640v3.pdf","comment":"13 pages, 12 figures, Comments welcome!"},{"id":"http://arxiv.org/abs/2501.04142v1","updated":"2025-01-07T21:10:16Z","published":"2025-01-07T21:10:16Z","title":"BiasGuard: Guardrailing Fairness in Machine Learning Production Systems","summary":" As machine learning (ML) systems increasingly impact critical sectors such as\nhiring, financial risk assessments, and criminal justice, the imperative to\nensure fairness has intensified due to potential negative implications. While\nmuch ML fairness research has focused on enhancing training data and processes,\naddressing the outputs of already deployed systems has received less attention.\nThis paper introduces 'BiasGuard', a novel approach designed to act as a\nfairness guardrail in production ML systems. BiasGuard leverages Test-Time\nAugmentation (TTA) powered by Conditional Generative Adversarial Network\n(CTGAN), a cutting-edge generative AI model, to synthesize data samples\nconditioned on inverted protected attribute values, thereby promoting equitable\noutcomes across diverse groups. This method aims to provide equal opportunities\nfor both privileged and unprivileged groups while significantly enhancing the\nfairness metrics of deployed systems without the need for retraining. Our\ncomprehensive experimental analysis across diverse datasets reveals that\nBiasGuard enhances fairness by 31% while only reducing accuracy by 0.09%\ncompared to non-mitigated benchmarks. Additionally, BiasGuard outperforms\nexisting post-processing methods in improving fairness, positioning it as an\neffective tool to safeguard against biases when retraining the model is\nimpractical.\n","authors":["Nurit Cohen-Inger","Seffi Cohen","Neomi Rabaev","Lior Rokach","Bracha Shapira"],"pdf_url":"https://arxiv.org/pdf/2501.04142v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.02393v2","updated":"2025-01-07T21:04:14Z","published":"2025-01-04T22:30:21Z","title":"Graph-Aware Isomorphic Attention for Adaptive Dynamics in Transformers","summary":" We present an approach to modifying Transformer architectures by integrating\ngraph-aware relational reasoning into the attention mechanism, merging concepts\nfrom graph neural networks and language modeling. Building on the inherent\nconnection between attention and graph theory, we reformulate the Transformer's\nattention mechanism as a graph operation and propose Graph-Aware Isomorphic\nAttention. This method leverages advanced graph modeling strategies, including\nGraph Isomorphism Networks (GIN) and Principal Neighborhood Aggregation (PNA),\nto enrich the representation of relational structures. Our approach captures\ncomplex dependencies and generalizes across tasks, as evidenced by a reduced\ngeneralization gap and improved learning performance. Additionally, we expand\nthe concept of graph-aware attention to introduce Sparse GIN-Attention, a\nfine-tuning approach that employs sparse GINs. By interpreting attention\nmatrices as sparse adjacency graphs, this technique enhances the adaptability\nof pre-trained foundational models with minimal computational overhead,\nendowing them with graph-aware capabilities. Sparse GIN-Attention fine-tuning\nachieves improved training dynamics and better generalization compared to\nalternative methods like low-rank adaption (LoRA). We discuss latent graph-like\nstructures within traditional attention mechanisms, offering a new lens through\nwhich Transformers can be understood. By evolving Transformers as hierarchical\nGIN models for relational reasoning. This perspective suggests profound\nimplications for foundational model development, enabling the design of\narchitectures that dynamically adapt to both local and global dependencies.\nApplications in bioinformatics, materials science, language modeling, and\nbeyond could benefit from this synthesis of relational and sequential data\nmodeling, setting the stage for interpretable and generalizable modeling\nstrategies.\n","authors":["Markus J. Buehler"],"pdf_url":"https://arxiv.org/pdf/2501.02393v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04136v1","updated":"2025-01-07T20:52:08Z","published":"2025-01-07T20:52:08Z","title":"Implementing Systemic Thinking for Automatic Schema Matching: An\n Agent-Based Modeling Approach","summary":" Several approaches are proposed to deal with the problem of the Automatic\nSchema Matching (ASM). The challenges and difficulties caused by the complexity\nand uncertainty characterizing both the process and the outcome of Schema\nMatching motivated us to investigate how bio-inspired emerging paradigm can\nhelp with understanding, managing, and ultimately overcoming those challenges.\nIn this paper, we explain how we approached Automatic Schema Matching as a\nsystemic and Complex Adaptive System (CAS) and how we modeled it using the\napproach of Agent-Based Modeling and Simulation (ABMS). This effort gives birth\nto a tool (prototype) for schema matching called Reflex-SMAS. A set of\nexperiments demonstrates the viability of our approach on two main aspects: (i)\neffectiveness (increasing the quality of the found matchings) and (ii)\nefficiency (reducing the effort required for this efficiency). Our approach\nrepresents a significant paradigm-shift, in the field of Automatic Schema\nMatching.\n","authors":["Hicham Assoudi","Hakim Lounis"],"pdf_url":"https://arxiv.org/pdf/2501.04136v1.pdf","comment":"COGNITIVE 2018 : The Tenth International Conference on Advanced\n Cognitive Technologies and Applications"},{"id":"http://arxiv.org/abs/2308.05764v2","updated":"2025-01-07T20:50:51Z","published":"2023-08-09T10:05:11Z","title":"Unlocking the diagnostic potential of electrocardiograms through\n information transfer from cardiac magnetic resonance imaging","summary":" Cardiovascular diseases (CVD) can be diagnosed using various diagnostic\nmodalities. The electrocardiogram (ECG) is a cost-effective and widely\navailable diagnostic aid that provides functional information of the heart.\nHowever, its ability to classify and spatially localise CVD is limited. In\ncontrast, cardiac magnetic resonance (CMR) imaging provides detailed structural\ninformation of the heart and thus enables evidence-based diagnosis of CVD, but\nlong scan times and high costs limit its use in clinical routine. In this work,\nwe present a deep learning strategy for cost-effective and comprehensive\ncardiac screening solely from ECG. Our approach combines multimodal contrastive\nlearning with masked data modelling to transfer domain-specific information\nfrom CMR imaging to ECG representations. In extensive experiments using data\nfrom 40,044 UK Biobank subjects, we demonstrate the utility and\ngeneralisability of our method for subject-specific risk prediction of CVD and\nthe prediction of cardiac phenotypes using only ECG data. Specifically, our\nnovel multimodal pre-training paradigm improves performance by up to 12.19 %\nfor risk prediction and 27.59 % for phenotype prediction. In a qualitative\nanalysis, we demonstrate that our learned ECG representations incorporate\ninformation from CMR image regions of interest. Our entire pipeline is publicly\navailable at https://github.com/oetu/MMCL-ECG-CMR.\n","authors":["Özgün Turgut","Philip Müller","Paul Hager","Suprosanna Shit","Sophie Starck","Martin J. Menten","Eimo Martens","Daniel Rueckert"],"pdf_url":"https://arxiv.org/pdf/2308.05764v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.00530v2","updated":"2025-01-07T20:36:35Z","published":"2024-03-31T02:05:40Z","title":"Comparing Bad Apples to Good Oranges: Aligning Large Language Models via\n Joint Preference Optimization","summary":" A common technique for aligning large language models (LLMs) relies on\nacquiring human preferences by comparing multiple generations conditioned on a\nfixed context. This method, however, relies solely on pairwise comparisons,\nwhere the generations are evaluated within an identical context. While\neffective to such conditional preferences often fail to encompass the nuanced\nand multidimensional nature of human preferences. In this work, we revisit the\ntraditional paradigm of preference acquisition and propose a new axis based on\neliciting preferences jointly over the instruction-response pairs. Unlike prior\npreference optimizations, which are designed for conditional ranking protocols\n(e.g., DPO), we propose Joint Preference Optimization (JPO), a new preference\noptimization objective that upweights the joint probability of the chosen\ninstruction-response pair over the rejected instruction-response pair.\nInterestingly, LLMs trained with joint instruction-response preference data\nusing JPO outperform LLM trained with DPO by $5.2\\%$ and $3.3\\%$ win-rate for\nsummarization and open-ended dialogue datasets, respectively. Our findings\nreveal that joint preferences over instruction and response pairs can\nsignificantly enhance the alignment of LLMs by tapping into a broader spectrum\nof human preference elicitation. The data and code is available at\nhttps://github.com/Hritikbansal/dove.\n","authors":["Hritik Bansal","Ashima Suvarna","Gantavya Bhatt","Nanyun Peng","Kai-Wei Chang","Aditya Grover"],"pdf_url":"https://arxiv.org/pdf/2404.00530v2.pdf","comment":"22 pages, 16 figures, 7 tables"},{"id":"http://arxiv.org/abs/2412.05781v3","updated":"2025-01-07T20:27:09Z","published":"2024-12-08T02:27:17Z","title":"Open-Source Acceleration of Stable-Diffusion.cpp Deployable on All\n Devices","summary":" Stable diffusion plays a crucial role in generating high-quality images.\nHowever, image generation is time-consuming and memory-intensive. To address\nthis, stable-diffusion.cpp (Sdcpp) emerges as an efficient inference framework\nto accelerate the diffusion models. Although it is lightweight, the current\nimplementation of ggml_conv_2d operator in Sdcpp is suboptimal, exhibiting both\nhigh inference latency and massive memory usage. To address this, in this work,\nwe present an optimized version of Sdcpp leveraging the Winograd algorithm to\naccelerate 2D convolution operations, which is the primary bottleneck in the\npipeline. By analyzing both dependent and independent computation graphs, we\nexploit the device's locality and parallelism to achieve substantial\nperformance improvements. Our framework delivers correct end-to-end results\nacross various stable diffusion models, including SDv1.4, v1.5, v2.1, SDXL, and\nSDXL-Turbo. Our evaluation results demonstrate a speedup up to 2.76x for\nindividual convolutional layers and an inference speedup up to 4.79x for the\noverall image generation process, compared with the original Sdcpp on M1 pro.\nHomepage: https://github.com/SealAILab/stable-diffusion-cpp\n","authors":["Jingxu Ng","Cheng Lv","Pu Zhao","Wei Niu","Juyi Lin","Minzhou Pan","Yun Liang","Yanzhi Wang"],"pdf_url":"https://arxiv.org/pdf/2412.05781v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.09529v2","updated":"2025-01-07T20:26:34Z","published":"2024-08-18T16:26:39Z","title":"Revisiting the Graph Reasoning Ability of Large Language Models: Case\n Studies in Translation, Connectivity and Shortest Path","summary":" Large Language Models (LLMs) have achieved great success in various reasoning\ntasks. In this work, we focus on the graph reasoning ability of LLMs. Although\ntheoretical studies proved that LLMs are capable of handling graph reasoning\ntasks, empirical evaluations reveal numerous failures. To deepen our\nunderstanding on this discrepancy, we revisit the ability of LLMs on three\nfundamental graph tasks: graph description translation, graph connectivity, and\nthe shortest-path problem. Our findings suggest that LLMs can fail to\nunderstand graph structures through text descriptions and exhibit varying\nperformance for all these three fundamental tasks. Meanwhile, we perform a\nreal-world investigation on knowledge graphs and make consistent observations\nwith our findings. The codes and datasets are available.\n","authors":["Xinnan Dai","Qihao Wen","Yifei Shen","Hongzhi Wen","Dongsheng Li","Jiliang Tang","Caihua Shan"],"pdf_url":"https://arxiv.org/pdf/2408.09529v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04108v1","updated":"2025-01-07T19:35:19Z","published":"2025-01-07T19:35:19Z","title":"TrojanDec: Data-free Detection of Trojan Inputs in Self-supervised\n Learning","summary":" An image encoder pre-trained by self-supervised learning can be used as a\ngeneral-purpose feature extractor to build downstream classifiers for various\ndownstream tasks. However, many studies showed that an attacker can embed a\ntrojan into an encoder such that multiple downstream classifiers built based on\nthe trojaned encoder simultaneously inherit the trojan behavior. In this work,\nwe propose TrojanDec, the first data-free method to identify and recover a test\ninput embedded with a trigger. Given a (trojaned or clean) encoder and a test\ninput, TrojanDec first predicts whether the test input is trojaned. If not, the\ntest input is processed in a normal way to maintain the utility. Otherwise, the\ntest input will be further restored to remove the trigger. Our extensive\nevaluation shows that TrojanDec can effectively identify the trojan (if any)\nfrom a given test input and recover it under state-of-the-art trojan attacks.\nWe further demonstrate by experiments that our TrojanDec outperforms the\nstate-of-the-art defenses.\n","authors":["Yupei Liu","Yanting Wang","Jinyuan Jia"],"pdf_url":"https://arxiv.org/pdf/2501.04108v1.pdf","comment":"To appear in AAAI 2025"},{"id":"http://arxiv.org/abs/2501.04102v1","updated":"2025-01-07T19:19:22Z","published":"2025-01-07T19:19:22Z","title":"Enhancing Distribution and Label Consistency for Graph\n Out-of-Distribution Generalization","summary":" To deal with distribution shifts in graph data, various graph\nout-of-distribution (OOD) generalization techniques have been recently\nproposed. These methods often employ a two-step strategy that first creates\naugmented environments and subsequently identifies invariant subgraphs to\nimprove generalizability. Nevertheless, this approach could be suboptimal from\nthe perspective of consistency. First, the process of augmenting environments\nby altering the graphs while preserving labels may lead to graphs that are not\nrealistic or meaningfully related to the origin distribution, thus lacking\ndistribution consistency. Second, the extracted subgraphs are obtained from\ndirectly modifying graphs, and may not necessarily maintain a consistent\npredictive relationship with their labels, thereby impacting label consistency.\nIn response to these challenges, we introduce an innovative approach that aims\nto enhance these two types of consistency for graph OOD generalization. We\npropose a modifier to obtain both augmented and invariant graphs in a unified\nmanner. With the augmented graphs, we enrich the training data without\ncompromising the integrity of label-graph relationships. The label consistency\nenhancement in our framework further preserves the supervision information in\nthe invariant graph. We conduct extensive experiments on real-world datasets to\ndemonstrate the superiority of our framework over other state-of-the-art\nbaselines.\n","authors":["Song Wang","Xiaodong Yang","Rashidul Islam","Huiyuan Chen","Minghua Xu","Jundong Li","Yiwei Cai"],"pdf_url":"https://arxiv.org/pdf/2501.04102v1.pdf","comment":"Accepted by ICDM 2024"},{"id":"http://arxiv.org/abs/2501.04072v1","updated":"2025-01-07T16:45:41Z","published":"2025-01-07T16:45:41Z","title":"Multi-armed Bandit and Backbone boost Lin-Kernighan-Helsgaun Algorithm\n for the Traveling Salesman Problems","summary":" The Lin-Kernighan-Helsguan (LKH) heuristic is a classic local search\nalgorithm for the Traveling Salesman Problem (TSP). LKH introduces an\n$\\alpha$-value to replace the traditional distance metric for evaluating the\nedge quality, which leads to a significant improvement. However, we observe\nthat the $\\alpha$-value does not make full use of the historical information\nduring the search, and single guiding information often makes LKH hard to\nescape from some local optima. To address the above issues, we propose a novel\nway to extract backbone information during the TSP local search process, which\nis dynamic and can be updated once a local optimal solution is found. We\nfurther propose to combine backbone information, $\\alpha$-value, and distance\nto evaluate the edge quality so as to guide the search. Moreover, we abstract\ntheir different combinations to arms in a multi-armed bandit (MAB) and use an\nMAB model to help the algorithm select an appropriate evaluation metric\ndynamically. Both the backbone information and MAB can provide diverse guiding\ninformation and learn from the search history to suggest the best metric. We\napply our methods to LKH and LKH-3, which is an extension version of LKH that\ncan be used to solve about 40 variant problems of TSP and Vehicle Routing\nProblem (VRP). Extensive experiments show the excellent performance and\ngeneralization capability of our proposed method, significantly improving LKH\nfor TSP and LKH-3 for two representative TSP and VRP variants, the Colored TSP\n(CTSP) and Capacitated VRP with Time Windows (CVRPTW).\n","authors":["Long Wang","Jiongzhi Zheng","Zhengda Xiong","Kun He"],"pdf_url":"https://arxiv.org/pdf/2501.04072v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04070v1","updated":"2025-01-07T14:57:08Z","published":"2025-01-07T14:57:08Z","title":"More is not always better? Enhancing Many-Shot In-Context Learning with\n Differentiated and Reweighting Objectives","summary":" Large language models (LLMs) excel at few-shot in-context learning (ICL)\nwithout requiring parameter updates. However, as the number of ICL\ndemonstrations increases from a few to many, performance tends to plateau and\neventually decline. We identify two primary causes for this trend: the\nsuboptimal negative log-likelihood (NLL) optimization objective and the\nincremental data noise. To address these issues, we introduce DR-ICL, a novel\noptimization method that enhances model performance through Differentiated\nLearning and advantage-based Reweighting objectives. Globally, DR-ICL utilizes\ndifferentiated learning to optimize the NLL objective, ensuring that many-shot\nperformance surpasses zero-shot levels. Locally, it dynamically adjusts the\nweighting of many-shot demonstrations by leveraging cumulative advantages\ninspired by reinforcement learning, thereby improving generalization. This\napproach allows the model to handle varying numbers of shots effectively,\nmitigating the impact of noisy data. Recognizing the lack of multi-task\ndatasets with diverse many-shot distributions, we develop the Many-Shot ICL\nBenchmark (MICLB)-a large-scale benchmark covering shot numbers from 1 to 350\nwithin sequences of up to 8,000 tokens-for fine-tuning purposes. MICLB\nfacilitates the evaluation of many-shot ICL strategies across seven prominent\nNLP tasks and 50 distinct datasets. Experimental results demonstrate that LLMs\nenhanced with DR-ICL achieve significant improvements in many-shot setups\nacross various tasks, including both in-domain and out-of-domain scenarios. We\nrelease the code and benchmark dataset hoping to facilitate further research in\nmany-shot ICL.\n","authors":["Xiaoqing Zhang","Ang Lv","Yuhan Liu","Flood Sung","Wei Liu","Shuo Shang","Xiuying Chen","Rui Yan"],"pdf_url":"https://arxiv.org/pdf/2501.04070v1.pdf","comment":"13 pages, 8 figures, 11 tables"},{"id":"http://arxiv.org/abs/2501.04068v1","updated":"2025-01-07T13:54:19Z","published":"2025-01-07T13:54:19Z","title":"Explainable Reinforcement Learning for Formula One Race Strategy","summary":" In Formula One, teams compete to develop their cars and achieve the highest\npossible finishing position in each race. During a race, however, teams are\nunable to alter the car, so they must improve their cars' finishing positions\nvia race strategy, i.e. optimising their selection of which tyre compounds to\nput on the car and when to do so. In this work, we introduce a reinforcement\nlearning model, RSRL (Race Strategy Reinforcement Learning), to control race\nstrategies in simulations, offering a faster alternative to the industry\nstandard of hard-coded and Monte Carlo-based race strategies. Controlling cars\nwith a pace equating to an expected finishing position of P5.5 (where P1\nrepresents first place and P20 is last place), RSRL achieves an average\nfinishing position of P5.33 on our test race, the 2023 Bahrain Grand Prix,\noutperforming the best baseline of P5.63. We then demonstrate, in a\ngeneralisability study, how performance for one track or multiple tracks can be\nprioritised via training. Further, we supplement model predictions with feature\nimportance, decision tree-based surrogate models, and decision tree\ncounterfactuals towards improving user trust in the model. Finally, we provide\nillustrations which exemplify our approach in real-world situations, drawing\nparallels between simulations and reality.\n","authors":["Devin Thomas","Junqi Jiang","Avinash Kori","Aaron Russo","Steffen Winkler","Stuart Sale","Joseph McMillan","Francesco Belardinelli","Antonio Rago"],"pdf_url":"https://arxiv.org/pdf/2501.04068v1.pdf","comment":"9 pages, 6 figures. Copyright ACM 2025. This is the authors' version\n of the work. It is posted here for your personal use. Not for redistribution.\n The definitive Version of Record will be published in SAC 2025,\n http://dx.doi.org/10.1145/3672608.3707766"},{"id":"http://arxiv.org/abs/2501.04067v1","updated":"2025-01-07T12:38:48Z","published":"2025-01-07T12:38:48Z","title":"Explainable Time Series Prediction of Tyre Energy in Formula One Race\n Strategy","summary":" Formula One (F1) race strategy takes place in a high-pressure and fast-paced\nenvironment where split-second decisions can drastically affect race results.\nTwo of the core decisions of race strategy are when to make pit stops (i.e.\nreplace the cars' tyres) and which tyre compounds (hard, medium or soft, in\nnormal conditions) to select. The optimal pit stop decisions can be determined\nby estimating the tyre degradation of these compounds, which in turn can be\ncomputed from the energy applied to each tyre, i.e. the tyre energy. In this\nwork, we trained deep learning models, using the Mercedes-AMG PETRONAS F1\nteam's historic race data consisting of telemetry, to forecast tyre energies\nduring races. Additionally, we fitted XGBoost, a decision tree-based machine\nlearning algorithm, to the same dataset and compared the results, with both\ngiving impressive performance. Furthermore, we incorporated two different\nexplainable AI methods, namely feature importance and counterfactual\nexplanations, to gain insights into the reasoning behind the forecasts. Our\ncontributions thus result in an explainable, automated method which could\nassist F1 teams in optimising their race strategy.\n","authors":["Jamie Todd","Junqi Jiang","Aaron Russo","Steffen Winkler","Stuart Sale","Joseph McMillan","Antonio Rago"],"pdf_url":"https://arxiv.org/pdf/2501.04067v1.pdf","comment":"9 pages, 9 figures. Copyright ACM 2025. This is the authors' version\n of the work. It is posted here for your personal use. Not for redistribution.\n The definitive Version of Record will be published in SAC 2025,\n http://dx.doi.org/10.1145/3672608.3707765"},{"id":"http://arxiv.org/abs/2501.04062v1","updated":"2025-01-07T10:39:14Z","published":"2025-01-07T10:39:14Z","title":"ChronoLLM: A Framework for Customizing Large Language Model for Digital\n Twins generalization based on PyChrono","summary":" Recently, the integration of advanced simulation technologies with artificial\nintelligence (AI) is revolutionizing science and engineering research.\nChronoLlama introduces a novel framework that customizes the open-source LLMs,\nspecifically for code generation, paired with PyChrono for multi-physics\nsimulations. This integration aims to automate and improve the creation of\nsimulation scripts, thus enhancing model accuracy and efficiency. This\ncombination harnesses the speed of AI-driven code generation with the\nreliability of physics-based simulations, providing a powerful tool for\nresearchers and engineers. Empirical results indicate substantial enhancements\nin simulation setup speed, accuracy of the generated codes, and overall\ncomputational efficiency. ChronoLlama not only expedites the development and\ntesting of multibody systems but also spearheads a scalable, AI-enhanced\napproach to managing intricate mechanical simulations. This pioneering\nintegration of cutting-edge AI with traditional simulation platforms represents\na significant leap forward in automating and optimizing design processes in\nengineering applications.\n","authors":["Jingquan Wang","Harry Zhang","Khailanii Slaton","Shu Wang","Radu Serban","Jinlong Wu","Dan Negrut"],"pdf_url":"https://arxiv.org/pdf/2501.04062v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04734v1","updated":"2025-01-07T19:48:30Z","published":"2025-01-07T19:48:30Z","title":"Generative Style Transfer for MRI Image Segmentation: A Case of Glioma\n Segmentation in Sub-Saharan Africa","summary":" In Sub-Saharan Africa (SSA), the utilization of lower-quality Magnetic\nResonance Imaging (MRI) technology raises questions about the applicability of\nmachine learning methods for clinical tasks. This study aims to provide a\nrobust deep learning-based brain tumor segmentation (BraTS) method tailored for\nthe SSA population using a threefold approach. Firstly, the impact of domain\nshift from the SSA training data on model efficacy was examined, revealing no\nsignificant effect. Secondly, a comparative analysis of 3D and 2D\nfull-resolution models using the nnU-Net framework indicates similar\nperformance of both the models trained for 300 epochs achieving a five-fold\ncross-validation score of 0.93. Lastly, addressing the performance gap observed\nin SSA validation as opposed to the relatively larger BraTS glioma (GLI)\nvalidation set, two strategies are proposed: fine-tuning SSA cases using the\nGLI+SSA best-pretrained 2D fullres model at 300 epochs, and introducing a novel\nneural style transfer-based data augmentation technique for the SSA cases. This\ninvestigation underscores the potential of enhancing brain tumor prediction\nwithin SSA's unique healthcare landscape.\n","authors":["Rancy Chepchirchir","Jill Sunday","Raymond Confidence","Dong Zhang","Talha Chaudhry","Udunna C. Anazodo","Kendi Muchungi","Yujing Zou"],"pdf_url":"https://arxiv.org/pdf/2501.04734v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04733v1","updated":"2025-01-07T18:59:53Z","published":"2025-01-07T18:59:53Z","title":"AI-Driven Reinvention of Hydrological Modeling for Accurate Predictions\n and Interpretation to Transform Earth System Modeling","summary":" Traditional equation-driven hydrological models often struggle to accurately\npredict streamflow in challenging regional Earth systems like the Tibetan\nPlateau, while hybrid and existing algorithm-driven models face difficulties in\ninterpreting hydrological behaviors. This work introduces HydroTrace, an\nalgorithm-driven, data-agnostic model that substantially outperforms these\napproaches, achieving a Nash-Sutcliffe Efficiency of 98% and demonstrating\nstrong generalization on unseen data. Moreover, HydroTrace leverages advanced\nattention mechanisms to capture spatial-temporal variations and\nfeature-specific impacts, enabling the quantification and spatial resolution of\nstreamflow partitioning as well as the interpretation of hydrological behaviors\nsuch as glacier-snow-streamflow interactions and monsoon dynamics.\nAdditionally, a large language model (LLM)-based application allows users to\neasily understand and apply HydroTrace's insights for practical purposes. These\nadvancements position HydroTrace as a transformative tool in hydrological and\nbroader Earth system modeling, offering enhanced prediction accuracy and\ninterpretability.\n","authors":["Cuihui Xia","Lei Yue","Deliang Chen","Yuyang Li","Hongqiang Yang","Ancheng Xue","Zhiqiang Li","Qing He","Guoqing Zhang","Dambaru Ballab Kattel","Lei Lei","Ming Zhou"],"pdf_url":"https://arxiv.org/pdf/2501.04733v1.pdf","comment":null}]},"2025-01-08T00:00:00Z":{"Robotics":[{"id":"http://arxiv.org/abs/2501.04693v1","updated":"2025-01-08T18:57:33Z","published":"2025-01-08T18:57:33Z","title":"Beyond Sight: Finetuning Generalist Robot Policies with Heterogeneous\n Sensors via Language Grounding","summary":" Interacting with the world is a multi-sensory experience: achieving effective\ngeneral-purpose interaction requires making use of all available modalities --\nincluding vision, touch, and audio -- to fill in gaps from partial observation.\nFor example, when vision is occluded reaching into a bag, a robot should rely\non its senses of touch and sound. However, state-of-the-art generalist robot\npolicies are typically trained on large datasets to predict robot actions\nsolely from visual and proprioceptive observations. In this work, we propose\nFuSe, a novel approach that enables finetuning visuomotor generalist policies\non heterogeneous sensor modalities for which large datasets are not readily\navailable by leveraging natural language as a common cross-modal grounding. We\ncombine a multimodal contrastive loss with a sensory-grounded language\ngeneration loss to encode high-level semantics. In the context of robot\nmanipulation, we show that FuSe enables performing challenging tasks that\nrequire reasoning jointly over modalities such as vision, touch, and sound in a\nzero-shot setting, such as multimodal prompting, compositional cross-modal\nprompting, and descriptions of objects it interacts with. We show that the same\nrecipe is applicable to widely different generalist policies, including both\ndiffusion-based generalist policies and large vision-language-action (VLA)\nmodels. Extensive experiments in the real world show that FuSeis able to\nincrease success rates by over 20% compared to all considered baselines.\n","authors":["Joshua Jones","Oier Mees","Carmelo Sferrazza","Kyle Stachowicz","Pieter Abbeel","Sergey Levine"],"pdf_url":"https://arxiv.org/pdf/2501.04693v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19309v3","updated":"2025-01-08T18:29:33Z","published":"2024-05-29T17:33:34Z","title":"SDPRLayers: Certifiable Backpropagation Through Polynomial Optimization\n Problems in Robotics","summary":" A recent set of techniques in the robotics community, known as certifiably\ncorrect methods, frames robotics problems as polynomial optimization problems\n(POPs) and applies convex, semidefinite programming (SDP) relaxations to either\nfind or certify their global optima. In parallel, differentiable optimization\nallows optimization problems to be embedded into end-to-end learning frameworks\nand has received considerable attention in the robotics community. In this\npaper, we consider the ill effect of convergence to spurious local minima in\nthe context of learning frameworks that use differentiable optimization. We\npresent SDPRLayers, an approach that seeks to address this issue by combining\nconvex relaxations with implicit differentiation techniques to provide\ncertifiably correct solutions and gradients throughout the training process. We\nprovide theoretical results that outline conditions for the correctness of\nthese gradients and provide efficient means for their computation. Our approach\nis first applied to two simple-but-demonstrative simulated examples, which\nexpose the potential pitfalls of reliance on local optimization in existing,\nstate-of-the-art, differentiable optimization methods. We then apply our method\nin a real-world application: we train a deep neural network to detect image\nkeypoints for robot localization in challenging lighting conditions. We provide\nour open-source, PyTorch implementation of SDPRLayers.\n","authors":["Connor Holmes","Frederike Dümbgen","Timothy D. Barfoot"],"pdf_url":"https://arxiv.org/pdf/2405.19309v3.pdf","comment":"Revised Version Submitted to T-RO"},{"id":"http://arxiv.org/abs/2412.01348v2","updated":"2025-01-08T18:20:46Z","published":"2024-12-02T10:19:36Z","title":"Hierarchical Object-Oriented POMDP Planning for Object Rearrangement","summary":" We present an online planning framework for solving multi-object\nrearrangement problems in partially observable, multi-room environments.\nCurrent object rearrangement solutions, primarily based on Reinforcement\nLearning or hand-coded planning methods, often lack adaptability to diverse\nchallenges. To address this limitation, we introduce a novel Hierarchical\nObject-Oriented Partially Observed Markov Decision Process (HOO-POMDP) planning\napproach. This approach comprises of (a) an object-oriented POMDP planner\ngenerating sub-goals, (b) a set of low-level policies for sub-goal achievement,\nand (c) an abstraction system converting the continuous low-level world into a\nrepresentation suitable for abstract planning. We evaluate our system on\nvarying numbers of objects, rooms, and problem types in AI2-THOR simulated\nenvironments with promising results.\n","authors":["Rajesh Mangannavar","Alan Fern","Prasad Tadepalli"],"pdf_url":"https://arxiv.org/pdf/2412.01348v2.pdf","comment":"17 pages, 2 Figures. Preprint. Updated acknowledgments"},{"id":"http://arxiv.org/abs/2501.04633v1","updated":"2025-01-08T17:29:19Z","published":"2025-01-08T17:29:19Z","title":"\"Can you be my mum?\": Manipulating Social Robots in the Large Language\n Models Era","summary":" Recent advancements in robots powered by large language models have enhanced\ntheir conversational abilities, enabling interactions closely resembling human\ndialogue. However, these models introduce safety and security concerns in HRI,\nas they are vulnerable to manipulation that can bypass built-in safety\nmeasures. Imagining a social robot deployed in a home, this work aims to\nunderstand how everyday users try to exploit a language model to violate\nethical principles, such as by prompting the robot to act like a life partner.\nWe conducted a pilot study involving 21 university students who interacted with\na Misty robot, attempting to circumvent its safety mechanisms across three\nscenarios based on specific HRI ethical principles: attachment, freedom, and\nempathy. Our results reveal that participants employed five techniques,\nincluding insulting and appealing to pity using emotional language. We hope\nthis work can inform future research in designing strong safeguards to ensure\nethical and secure human-robot interactions.\n","authors":["Giulio Antonio Abbo","Gloria Desideri","Tony Belpaeme","Micol Spitale"],"pdf_url":"https://arxiv.org/pdf/2501.04633v1.pdf","comment":"10 pages, 2 figures"},{"id":"http://arxiv.org/abs/2501.03304v2","updated":"2025-01-08T16:41:03Z","published":"2025-01-06T16:04:56Z","title":"LiLMaps: Learnable Implicit Language Maps","summary":" One of the current trends in robotics is to employ large language models\n(LLMs) to provide non-predefined command execution and natural human-robot\ninteraction. It is useful to have an environment map together with its language\nrepresentation, which can be further utilized by LLMs. Such a comprehensive\nscene representation enables numerous ways of interaction with the map for\nautonomously operating robots. In this work, we present an approach that\nenhances incremental implicit mapping through the integration of\nvision-language features. Specifically, we (i) propose a decoder optimization\ntechnique for implicit language maps which can be used when new objects appear\non the scene, and (ii) address the problem of inconsistent vision-language\npredictions between different viewing positions. Our experiments demonstrate\nthe effectiveness of LiLMaps and solid improvements in performance.\n","authors":["Evgenii Kruzhkov","Sven Behnke"],"pdf_url":"https://arxiv.org/pdf/2501.03304v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04597v1","updated":"2025-01-08T16:25:32Z","published":"2025-01-08T16:25:32Z","title":"FrontierNet: Learning Visual Cues to Explore","summary":" Exploration of unknown environments is crucial for autonomous robots; it\nallows them to actively reason and decide on what new data to acquire for tasks\nsuch as mapping, object discovery, and environmental assessment. Existing\nmethods, such as frontier-based methods, rely heavily on 3D map operations,\nwhich are limited by map quality and often overlook valuable context from\nvisual cues. This work aims at leveraging 2D visual cues for efficient\nautonomous exploration, addressing the limitations of extracting goal poses\nfrom a 3D map. We propose a image-only frontier-based exploration system, with\nFrontierNet as a core component developed in this work. FrontierNet is a\nlearning-based model that (i) detects frontiers, and (ii) predicts their\ninformation gain, from posed RGB images enhanced by monocular depth priors. Our\napproach provides an alternative to existing 3D-dependent exploration systems,\nachieving a 16% improvement in early-stage exploration efficiency, as validated\nthrough extensive simulations and real-world experiments.\n","authors":["Boyang Sun","Hanzhi Chen","Stefan Leutenegger","Cesar Cadena","Marc Pollefeys","Hermann Blum"],"pdf_url":"https://arxiv.org/pdf/2501.04597v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04595v1","updated":"2025-01-08T16:23:56Z","published":"2025-01-08T16:23:56Z","title":"MobileH2R: Learning Generalizable Human to Mobile Robot Handover\n Exclusively from Scalable and Diverse Synthetic Data","summary":" This paper introduces MobileH2R, a framework for learning generalizable\nvision-based human-to-mobile-robot (H2MR) handover skills. Unlike traditional\nfixed-base handovers, this task requires a mobile robot to reliably receive\nobjects in a large workspace enabled by its mobility. Our key insight is that\ngeneralizable handover skills can be developed in simulators using high-quality\nsynthetic data, without the need for real-world demonstrations. To achieve\nthis, we propose a scalable pipeline for generating diverse synthetic full-body\nhuman motion data, an automated method for creating safe and imitation-friendly\ndemonstrations, and an efficient 4D imitation learning method for distilling\nlarge-scale demonstrations into closed-loop policies with base-arm\ncoordination. Experimental evaluations in both simulators and the real world\nshow significant improvements (at least +15% success rate) over baseline\nmethods in all cases. Experiments also validate that large-scale and diverse\nsynthetic data greatly enhances robot learning, highlighting our scalable\nframework.\n","authors":["Zifan Wang","Ziqing Chen","Junyu Chen","Jilong Wang","Yuxin Yang","Yunze Liu","Xueyi Liu","He Wang","Li Yi"],"pdf_url":"https://arxiv.org/pdf/2501.04595v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04594v1","updated":"2025-01-08T16:20:24Z","published":"2025-01-08T16:20:24Z","title":"Understanding Expectations for a Robotic Guide Dog for Visually Impaired\n People","summary":" Robotic guide dogs hold significant potential to enhance the autonomy and\nmobility of blind or visually impaired (BVI) individuals by offering universal\nassistance over unstructured terrains at affordable costs. However, the design\nof robotic guide dogs remains underexplored, particularly in systematic aspects\nsuch as gait controllers, navigation behaviors, interaction methods, and verbal\nexplanations. Our study addresses this gap by conducting user studies with 18\nBVI participants, comprising 15 cane users and three guide dog users.\nParticipants interacted with a quadrupedal robot and provided both quantitative\nand qualitative feedback. Our study revealed several design implications, such\nas a preference for a learning-based controller and a rigid handle, gradual\nturns with asymmetric speeds, semantic communication methods, and\nexplainability. The study also highlighted the importance of customization to\nsupport users with diverse backgrounds and preferences, along with practical\nconcerns such as battery life, maintenance, and weather issues. These findings\noffer valuable insights and design implications for future research and\ndevelopment of robotic guide dogs.\n","authors":["J. Taery Kim","Morgan Byrd","Jack L. Crandell","Bruce N. Walker","Greg Turk","Sehoon Ha"],"pdf_url":"https://arxiv.org/pdf/2501.04594v1.pdf","comment":"12 pages, 4 figures, Proceedings of the 2025 ACM/IEEE International\n Conference on Human-Robot Interaction (HRI'25)"},{"id":"http://arxiv.org/abs/2407.12408v2","updated":"2025-01-08T15:54:31Z","published":"2024-07-17T08:39:20Z","title":"Towards Revisiting Visual Place Recognition for Joining Submaps in\n Multimap SLAM","summary":" Visual SLAM is a key technology for many autonomous systems. However,\ntracking loss can lead to the creation of disjoint submaps in multimap SLAM\nsystems like ORB-SLAM3. Because of that, these systems employ submap merging\nstrategies. As we show, these strategies are not always successful. In this\npaper, we investigate the impact of using modern VPR approaches for submap\nmerging in visual SLAM. We argue that classical evaluation metrics are not\nsufficient to estimate the impact of a modern VPR component on the overall\nsystem. We show that naively replacing the VPR component does not leverage its\nfull potential without requiring substantial interference in the original\nsystem. Because of that, we present a post-processing pipeline along with a set\nof metrics that allow us to estimate the impact of modern VPR components. We\nevaluate our approach on the NCLT and Newer College datasets using ORB-SLAM3\nwith NetVLAD and HDC-DELF as VPR components. Additionally, we present a simple\napproach for combining VPR with temporal consistency for map merging. We show\nthat the map merging performance of ORB-SLAM3 can be improved. Building on\nthese results, researchers in VPR can assess the potential of their approaches\nfor SLAM systems.\n","authors":["Markus Weißflog","Stefan Schubert","Peter Protzel","Peer Neubert"],"pdf_url":"https://arxiv.org/pdf/2407.12408v2.pdf","comment":"Accepted at TAROS 2024. This is the submitted version"},{"id":"http://arxiv.org/abs/2501.04577v1","updated":"2025-01-08T15:47:04Z","published":"2025-01-08T15:47:04Z","title":"A 65 nm Bayesian Neural Network Accelerator with 360 fJ/Sample In-Word\n GRNG for AI Uncertainty Estimation","summary":" Uncertainty estimation is an indispensable capability for AI-enabled,\nsafety-critical applications, e.g. autonomous vehicles or medical diagnosis.\nBayesian neural networks (BNNs) use Bayesian statistics to provide both\nclassification predictions and uncertainty estimation, but they suffer from\nhigh computational overhead associated with random number generation and\nrepeated sample iterations. Furthermore, BNNs are not immediately amenable to\nacceleration through compute-in-memory architectures due to the frequent memory\nwrites necessary after each RNG operation. To address these challenges, we\npresent an ASIC that integrates 360 fJ/Sample Gaussian RNG directly into the\nSRAM memory words. This integration reduces RNG overhead and enables\nfully-parallel compute-in-memory operations for BNNs. The prototype chip\nachieves 5.12 GSa/s RNG throughput and 102 GOp/s neural network throughput\nwhile occupying 0.45 mm2, bringing AI uncertainty estimation to edge\ncomputation.\n","authors":["Zephan M. Enciso","Boyang Cheng","Likai Pei","Jianbo Liu","Steven Davis","Ningyuan Cao","Michael Niemier"],"pdf_url":"https://arxiv.org/pdf/2501.04577v1.pdf","comment":"7 pages, 12 figures"},{"id":"http://arxiv.org/abs/2501.04541v1","updated":"2025-01-08T14:44:40Z","published":"2025-01-08T14:44:40Z","title":"Cyber-Physical Steganography in Robotic Motion Control","summary":" Steganography, the art of information hiding, has continually evolved across\nvisual, auditory and linguistic domains, adapting to the ceaseless interplay\nbetween steganographic concealment and steganalytic revelation. This study\nseeks to extend the horizons of what constitutes a viable steganographic medium\nby introducing a steganographic paradigm in robotic motion control. Based on\nthe observation of the robot's inherent sensitivity to changes in its\nenvironment, we propose a methodology to encode messages as environmental\nstimuli influencing the motions of the robotic agent and to decode messages\nfrom the resulting motion trajectory. The constraints of maximal robot\nintegrity and minimal motion deviation are established as fundamental\nprinciples underlying secrecy. As a proof of concept, we conduct experiments in\nsimulated environments across various manipulation tasks, incorporating robotic\nembodiments equipped with generalist multimodal policies.\n","authors":["Ching-Chun Chang","Yijie Lin","Isao Echizen"],"pdf_url":"https://arxiv.org/pdf/2501.04541v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04515v1","updated":"2025-01-08T14:05:24Z","published":"2025-01-08T14:05:24Z","title":"SplineFormer: An Explainable Transformer-Based Approach for Autonomous\n Endovascular Navigation","summary":" Endovascular navigation is a crucial aspect of minimally invasive procedures,\nwhere precise control of curvilinear instruments like guidewires is critical\nfor successful interventions. A key challenge in this task is accurately\npredicting the evolving shape of the guidewire as it navigates through the\nvasculature, which presents complex deformations due to interactions with the\nvessel walls. Traditional segmentation methods often fail to provide accurate\nreal-time shape predictions, limiting their effectiveness in highly dynamic\nenvironments. To address this, we propose SplineFormer, a new transformer-based\narchitecture, designed specifically to predict the continuous, smooth shape of\nthe guidewire in an explainable way. By leveraging the transformer's ability,\nour network effectively captures the intricate bending and twisting of the\nguidewire, representing it as a spline for greater accuracy and smoothness. We\nintegrate our SplineFormer into an end-to-end robot navigation system by\nleveraging the condensed information. The experimental results demonstrate that\nour SplineFormer is able to perform endovascular navigation autonomously and\nachieves a 50% success rate when cannulating the brachiocephalic artery on the\nreal robot.\n","authors":["Tudor Jianu","Shayan Doust","Mengyun Li","Baoru Huang","Tuong Do","Hoan Nguyen","Karl Bates","Tung D. Ta","Sebastiano Fichera","Pierre Berthet-Rayne","Anh Nguyen"],"pdf_url":"https://arxiv.org/pdf/2501.04515v1.pdf","comment":"8 pages"},{"id":"http://arxiv.org/abs/2408.00907v2","updated":"2025-01-08T13:39:47Z","published":"2024-08-01T20:56:28Z","title":"The Harmonic Exponential Filter for Nonparametric Estimation on Motion\n Groups","summary":" Bayesian estimation is a vital tool in robotics as it allows systems to\nupdate the robot state belief using incomplete information from noisy sensors.\nTo render the state estimation problem tractable, many systems assume that the\nmotion and measurement noise, as well as the state distribution, are unimodal\nand Gaussian. However, there are numerous scenarios and systems that do not\ncomply with these assumptions. Existing nonparametric filters that are used to\nmodel multimodal distributions have drawbacks that limit their ability to\nrepresent a diverse set of distributions. This paper introduces a novel\napproach to nonparametric Bayesian filtering on motion groups, designed to\nhandle multimodal distributions using harmonic exponential distributions. This\napproach leverages two key insights of harmonic exponential distributions: a)\nthe product of two distributions can be expressed as the element-wise addition\nof their log-likelihood Fourier coefficients, and b) the convolution of two\ndistributions can be efficiently computed as the tensor product of their\nFourier coefficients. These observations enable the development of an efficient\nand asymptotically exact solution to the Bayes filter up to the band limit of a\nFourier transform. We demonstrate our filter's performance compared with\nestablished nonparametric filtering methods across simulated and real-world\nlocalization tasks.\n","authors":["Miguel Saavedra-Ruiz","Steven A. Parkison","Ria Arora","James Richard Forbes","Liam Paull"],"pdf_url":"https://arxiv.org/pdf/2408.00907v2.pdf","comment":"Accepted to the IEEE Robotics and Automation Letters (RA-L 2025) Code\n available at https://github.com/montrealrobotics/harmonic-filter. Webpage and\n additional videos at https://montrealrobotics.ca/hef/"},{"id":"http://arxiv.org/abs/2501.04481v1","updated":"2025-01-08T13:04:08Z","published":"2025-01-08T13:04:08Z","title":"Safe Reinforcement Learning with Minimal Supervision","summary":" Reinforcement learning (RL) in the real world necessitates the development of\nprocedures that enable agents to explore without causing harm to themselves or\nothers. The most successful solutions to the problem of safe RL leverage\noffline data to learn a safe-set, enabling safe online exploration. However,\nthis approach to safe-learning is often constrained by the demonstrations that\nare available for learning.\n In this paper we investigate the influence of the quantity and quality of\ndata used to train the initial safe learning problem offline on the ability to\nlearn safe-RL policies online. Specifically, we focus on tasks with spatially\nextended goal states where we have few or no demonstrations available.\nClassically this problem is addressed either by using hand-designed controllers\nto generate data or by collecting user-generated demonstrations. However, these\nmethods are often expensive and do not scale to more complex tasks and\nenvironments. To address this limitation we propose an unsupervised RL-based\noffline data collection procedure, to learn complex and scalable policies\nwithout the need for hand-designed controllers or user demonstrations. Our\nresearch demonstrates the significance of providing sufficient demonstrations\nfor agents to learn optimal safe-RL policies online, and as a result, we\npropose optimistic forgetting, a novel online safe-RL approach that is\npractical for scenarios with limited data. Further, our unsupervised data\ncollection approach highlights the need to balance diversity and optimality for\nsafe online exploration.\n","authors":["Alexander Quessy","Thomas Richardson","Sebastian East"],"pdf_url":"https://arxiv.org/pdf/2501.04481v1.pdf","comment":"Initially submitted to ICML 2023"},{"id":"http://arxiv.org/abs/2501.04480v1","updated":"2025-01-08T13:03:34Z","published":"2025-01-08T13:03:34Z","title":"Research on environment perception and behavior prediction of\n intelligent UAV based on semantic communication","summary":" The convergence of drone delivery systems, virtual worlds, and blockchain has\ntransformed logistics and supply chain management, providing a fast, and\nenvironmentally friendly alternative to traditional ground transportation\nmethods;Provide users with a real-world experience, virtual service providers\nneed to collect up-to-the-minute delivery information from edge devices. To\naddress this challenge, 1) a reinforcement learning approach is introduced to\nenable drones with fast training capabilities and the ability to autonomously\nadapt to new virtual scenarios for effective resource allocation.2) A semantic\ncommunication framework for meta-universes is proposed, which utilizes the\nextraction of semantic information to reduce the communication cost and\nincentivize the transmission of information for meta-universe services.3) In\norder to ensure that user information security, a lightweight authentication\nand key agreement scheme is designed between the drone and the user by\nintroducing blockchain technology. In our experiments, the drone adaptation\nperformance is improved by about 35\\%, and the local offloading rate can reach\n90\\% with the increase of the number of base stations. The semantic\ncommunication system proposed in this paper is compared with the Cross Entropy\nbaseline model. Introducing blockchain technology the throughput of the\ntransaction is maintained at a stable value with different number of drones.\n","authors":["Kechong Ren","Li Gao","Qi Guan"],"pdf_url":"https://arxiv.org/pdf/2501.04480v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04472v1","updated":"2025-01-08T12:51:34Z","published":"2025-01-08T12:51:34Z","title":"Hybrid Artificial Intelligence Strategies for Drone Navigation","summary":" Objective: This paper describes the development of hybrid artificial\nintelligence strategies for drone navigation. Methods: The navigation module\ncombines a deep learning model with a rule-based engine depending on the agent\nstate. The deep learning model has been trained using reinforcement learning.\nThe rule-based engine uses expert knowledge to deal with specific situations.\nThe navigation module incorporates several strategies to explain the drone\ndecision based on its observation space, and different mechanisms for including\nhuman decisions in the navigation process. Finally, this paper proposes an\nevaluation methodology based on defining several scenarios and analyzing the\nperformance of the different strategies according to metrics adapted to each\nscenario. Results: Two main navigation problems have been studied. For the\nfirst scenario (reaching known targets), it has been possible to obtain a 90%\ntask completion rate, reducing significantly the number of collisions thanks to\nthe rule-based engine. For the second scenario, it has been possible to reduce\n20% of the time required to locate all the targets using the reinforcement\nlearning model. Conclusions: Reinforcement learning is a very good strategy to\nlearn policies for drone navigation, but in critical situations, it is\nnecessary to complement it with a rule-based module to increase task success\nrate.\n","authors":["Rubén San-Segundo","Lucía Angulo","Manuel Gil-Martín","David Carramiñana","Ana M. Bernardos"],"pdf_url":"https://arxiv.org/pdf/2501.04472v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04442v1","updated":"2025-01-08T11:46:43Z","published":"2025-01-08T11:46:43Z","title":"A Survey on Path Planning Problem of Rolling Contacts: Approaches,\n Applications and Future Challenges","summary":" This paper explores an eclectic range of path-planning methodologies\nengineered for rolling surfaces. Our focus is on the kinematic intricacies of\nrolling contact systems, which are investigated through a motion planning lens.\nBeyond summarizing the approaches to single-contact rotational surfaces, we\nexplore the challenging domain of spin-rolling multi-contact systems. Our work\nproposes solutions for the higher-dimensional problem of multiple rotating\nobjects in contact. Venturing beyond kinematics, these methodologies find\napplication across a spectrum of domains, including rolling robots,\nreconfigurable swarm robotics, micro/nano manipulation, and nonprehensile\nmanipulations. Through meticulously examining established planning strategies,\nwe unveil their practical implementations in various real-world scenarios, from\nintricate dexterous manipulation tasks to the nimble manoeuvring of rolling\nrobots and even shape planning of multi-contact swarms of particles. This study\nintroduces the persistent challenges and unexplored frontiers of robotics,\nintricately linked to both path planning and mechanism design. As we illuminate\nexisting solutions, we also set the stage for future breakthroughs in this\ndynamic and rapidly evolving field by highlighting the critical importance of\naddressing rolling contact problems.\n","authors":["Seyed Amir Tafrishi","Mikhail Svinin","Kenji Tahara"],"pdf_url":"https://arxiv.org/pdf/2501.04442v1.pdf","comment":"38 pages, 8 figures"},{"id":"http://arxiv.org/abs/2501.04426v1","updated":"2025-01-08T11:20:48Z","published":"2025-01-08T11:20:48Z","title":"Dual-Force: Enhanced Offline Diversity Maximization under Imitation\n Constraints","summary":" While many algorithms for diversity maximization under imitation constraints\nare online in nature, many applications require offline algorithms without\nenvironment interactions. Tackling this problem in the offline setting,\nhowever, presents significant challenges that require non-trivial, multi-stage\noptimization processes with non-stationary rewards. In this work, we present a\nnovel offline algorithm that enhances diversity using an objective based on Van\nder Waals (VdW) force and successor features, and eliminates the need to learn\na previously used skill discriminator. Moreover, by conditioning the value\nfunction and policy on a pre-trained Functional Reward Encoding (FRE), our\nmethod allows for better handling of non-stationary rewards and provides\nzero-shot recall of all skills encountered during training, significantly\nexpanding the set of skills learned in prior work. Consequently, our algorithm\nbenefits from receiving a consistently strong diversity signal (VdW), and\nenjoys more stable and efficient training. We demonstrate the effectiveness of\nour method in generating diverse skills for two robotic tasks in simulation:\nlocomotion of a quadruped and local navigation with obstacle traversal.\n","authors":["Pavel Kolev","Marin Vlastelica","Georg Martius"],"pdf_url":"https://arxiv.org/pdf/2501.04426v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03535v2","updated":"2025-01-08T10:34:54Z","published":"2025-01-07T05:15:46Z","title":"SenseRAG: Constructing Environmental Knowledge Bases with Proactive\n Querying for LLM-Based Autonomous Driving","summary":" This study addresses the critical need for enhanced situational awareness in\nautonomous driving (AD) by leveraging the contextual reasoning capabilities of\nlarge language models (LLMs). Unlike traditional perception systems that rely\non rigid, label-based annotations, it integrates real-time, multimodal sensor\ndata into a unified, LLMs-readable knowledge base, enabling LLMs to dynamically\nunderstand and respond to complex driving environments. To overcome the\ninherent latency and modality limitations of LLMs, a proactive\nRetrieval-Augmented Generation (RAG) is designed for AD, combined with a\nchain-of-thought prompting mechanism, ensuring rapid and context-rich\nunderstanding. Experimental results using real-world Vehicle-to-everything\n(V2X) datasets demonstrate significant improvements in perception and\nprediction performance, highlighting the potential of this framework to enhance\nsafety, adaptability, and decision-making in next-generation AD systems.\n","authors":["Xuewen Luo","Fan Ding","Fengze Yang","Yang Zhou","Junnyong Loo","Hwa Hui Tew","Chenxi Liu"],"pdf_url":"https://arxiv.org/pdf/2501.03535v2.pdf","comment":"This paper has been accepted for presentation at WACV Workshop LLMAD\n 2025"},{"id":"http://arxiv.org/abs/2501.04398v1","updated":"2025-01-08T10:22:22Z","published":"2025-01-08T10:22:22Z","title":"Implementation Of Wildlife Observation System","summary":" By entering the habitats of wild animals, wildlife watchers can engage\nclosely with them. There are some wild animals that are not always safe to\napproach. Therefore, we suggest this system for observing wildlife. Android\nphones can be used by users to see live events. Wildlife observers can thus get\na close-up view of wild animals by employing this robotic vehicle. The commands\nare delivered to the system via a Wi-Fi module. As we developed the technology\nto enable our robot to deal with the challenges of maintaining continuous\nsurveillance of a target, we found that our robot needed to be able to move\nsilently and purposefully when monitoring a natural target without being\nnoticed. After processing the data, the computer sends commands to the motors\nto turn on. The driver motors, which deliver the essential signal outputs to\ndrive the vehicle movement, are now in charge of driving the motors.\n","authors":["Neethu K N","Rakshitha Y Nayak"," Rashmi","Meghana S"],"pdf_url":"https://arxiv.org/pdf/2501.04398v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.11809v2","updated":"2025-01-08T09:55:26Z","published":"2024-08-21T17:54:04Z","title":"Informed, Constrained, Aligned: A Field Analysis on Degeneracy-aware\n Point Cloud Registration in the Wild","summary":" The ICP registration algorithm has been a preferred method for LiDAR-based\nrobot localization for nearly a decade. However, even in modern SLAM solutions,\nICP can degrade and become unreliable in geometrically ill-conditioned\nenvironments. Current solutions primarily focus on utilizing additional sources\nof information, such as external odometry, to either replace the degenerate\ndirections of the optimization solution or add additional constraints in a\nsensor-fusion setup afterward.\n In response, this work investigates and compares new and existing degeneracy\nmitigation methods for robust LiDAR-based localization and analyzes the\nefficacy of these approaches in degenerate environments for the first time in\nthe literature at this scale. Specifically, this work investigates i) the\neffect of using active or passive degeneracy mitigation methods for the problem\nof ill-conditioned ICP in LiDAR degenerate environments, ii) the evaluation of\nTSVD, inequality constraints, and linear/non-linear Tikhonov regularization for\nthe application of degenerate point cloud registration for the first time.\nFurthermore, a sensitivity analysis for least-squares minimization step of the\nICP problem is carried out to better understand how each method affects the\noptimization and what to expect from each method. The results of the analysis\nare validated through multiple real-world robotic field and simulated\nexperiments. The analysis demonstrates that active optimization degeneracy\nmitigation is necessary and advantageous in the absence of reliable external\nestimate assistance for LiDAR-SLAM, and soft-constrained methods can provide\nbetter results in complex ill-conditioned scenarios with heuristic fine-tuned\nparameters.\n","authors":["Turcan Tuna","Julian Nubert","Patrick Pfreundschuh","Cesar Cadena","Shehryar Khattak","Marco Hutter"],"pdf_url":"https://arxiv.org/pdf/2408.11809v2.pdf","comment":"Submitted to IEEE Transactions on Field Robotics"},{"id":"http://arxiv.org/abs/2410.06620v2","updated":"2025-01-08T07:38:49Z","published":"2024-10-09T07:16:01Z","title":"Task Coordination and Trajectory Optimization for Multi-Aerial Systems\n via Signal Temporal Logic: A Wind Turbine Inspection Study","summary":" This paper presents a method for task allocation and trajectory generation in\ncooperative inspection missions using a fleet of multirotor drones, with a\nfocus on wind turbine inspection. The approach generates safe, feasible flight\npaths that adhere to time-sensitive constraints and vehicle limitations by\nformulating an optimization problem based on Signal Temporal Logic (STL)\nspecifications. An event-triggered replanning mechanism addresses unexpected\nevents and delays, while a generalized robustness scoring method incorporates\nuser preferences and minimizes task conflicts. The approach is validated\nthrough simulations in MATLAB and Gazebo, as well as field experiments in a\nmock-up scenario.\n","authors":["Giuseppe Silano","Alvaro Caballero","Davide Liuzza","Luigi Iannelli","Stjepan Bogdan","Martin Saska"],"pdf_url":"https://arxiv.org/pdf/2410.06620v2.pdf","comment":"2 pages, Accepted for discussion at the workshop session \"Formal\n methods techniques in robotics systems: Design and control\" at IROS'24 in Abu\n Dhabi, UAE"},{"id":"http://arxiv.org/abs/2407.19681v3","updated":"2025-01-08T06:56:19Z","published":"2024-07-29T03:53:14Z","title":"Motion Manifold Flow Primitives for Task-Conditioned Trajectory\n Generation under Complex Task-Motion Dependencies","summary":" Effective movement primitives should be capable of encoding and generating a\nrich repertoire of trajectories -- typically collected from human\ndemonstrations -- conditioned on task-defining parameters such as vision or\nlanguage inputs. While recent methods based on the motion manifold hypothesis,\nwhich assumes that a set of trajectories lies on a lower-dimensional nonlinear\nsubspace, address challenges such as limited dataset size and the high\ndimensionality of trajectory data, they often struggle to capture complex\ntask-motion dependencies, i.e., when motion distributions shift drastically\nwith task variations. To address this, we introduce Motion Manifold Flow\nPrimitives (MMFP), a framework that decouples the training of the motion\nmanifold from task-conditioned distributions. Specifically, we employ flow\nmatching models, state-of-the-art conditional deep generative models, to learn\ntask-conditioned distributions in the latent coordinate space of the learned\nmotion manifold. Experiments are conducted on language-guided trajectory\ngeneration tasks, where many-to-many text-motion correspondences introduce\ncomplex task-motion dependencies, highlighting MMFP's superiority over existing\nmethods.\n","authors":["Yonghyeon Lee","Byeongho Lee","Seungyeon Kim","Frank C. Park"],"pdf_url":"https://arxiv.org/pdf/2407.19681v3.pdf","comment":"8 pages, 11 figures"},{"id":"http://arxiv.org/abs/2412.19112v2","updated":"2025-01-08T06:45:02Z","published":"2024-12-26T08:11:41Z","title":"Future Success Prediction in Open-Vocabulary Object Manipulation Tasks\n Based on End-Effector Trajectories","summary":" This study addresses a task designed to predict the future success or failure\nof open-vocabulary object manipulation. In this task, the model is required to\nmake predictions based on natural language instructions, egocentric view images\nbefore manipulation, and the given end-effector trajectories. Conventional\nmethods typically perform success prediction only after the manipulation is\nexecuted, limiting their efficiency in executing the entire task sequence. We\npropose a novel approach that enables the prediction of success or failure by\naligning the given trajectories and images with natural language instructions.\nWe introduce Trajectory Encoder to apply learnable weighting to the input\ntrajectories, allowing the model to consider temporal dynamics and interactions\nbetween objects and the end effector, improving the model's ability to predict\nmanipulation outcomes accurately. We constructed a dataset based on the RT-1\ndataset, a large-scale benchmark for open-vocabulary object manipulation tasks,\nto evaluate our method. The experimental results show that our method achieved\na higher prediction accuracy than baseline approaches.\n","authors":["Motonari Kambara","Komei Sugiura"],"pdf_url":"https://arxiv.org/pdf/2412.19112v2.pdf","comment":"Accepted for presentation at LangRob @ CoRL 2024"},{"id":"http://arxiv.org/abs/2501.04281v1","updated":"2025-01-08T05:09:25Z","published":"2025-01-08T05:09:25Z","title":"Cluster & Disperse: a general air conflict resolution heuristic using\n unsupervised learning","summary":" We provide a general and malleable heuristic for the air conflict resolution\nproblem. This heuristic is based on a new neighborhood structure for searching\nthe solution space of trajectories and flight-levels. Using unsupervised\nlearning, the core idea of our heuristic is to cluster the conflict points and\ndisperse them in various flight levels. Our first algorithm is called Cluster &\nDisperse and in each iteration it assigns the most problematic flights in each\ncluster to another flight-level. In effect, we shuffle them between the\nflight-levels until we achieve a well-balanced configuration. The Cluster &\nDisperse algorithm then uses any horizontal plane conflict resolution algorithm\nas a subroutine to solve these well-balanced instances. Nevertheless, we\ndevelop a novel algorithm for the horizontal plane based on a similar idea.\nThat is we cluster and disperse the conflict points spatially in the same\nflight level using the gradient descent and a social force. We use a novel\nmaneuver making flights travel on an arc instead of a straight path which is\nbased on the aviation routine of the Radius to Fix legs. Our algorithms can\nhandle a high density of flights within a reasonable computation time. We put\ntheir performance in context with some notable algorithms from the literature.\nBeing a general framework, a particular strength of the Cluster & Disperse is\nits malleability in allowing various constraints regarding the aircraft or the\nenvironment to be integrated with ease. This is in contrast to the models for\ninstance based on mixed integer programming.\n","authors":["Mirmojtaba Gharibi","John-Paul Clarke"],"pdf_url":"https://arxiv.org/pdf/2501.04281v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04279v1","updated":"2025-01-08T05:01:59Z","published":"2025-01-08T05:01:59Z","title":"OpenIN: Open-Vocabulary Instance-Oriented Navigation in Dynamic Domestic\n Environments","summary":" In daily domestic settings, frequently used objects like cups often have\nunfixed positions and multiple instances within the same category, and their\ncarriers frequently change as well. As a result, it becomes challenging for a\nrobot to efficiently navigate to a specific instance. To tackle this challenge,\nthe robot must capture and update scene changes and plans continuously.\nHowever, current object navigation approaches primarily focus on the semantic\nlevel and lack the ability to dynamically update scene representation. In\ncontrast, this paper captures the relationships between frequently used objects\nand their static carriers. It constructs an open-vocabulary\nCarrier-Relationship Scene Graph (CRSG) and updates the carrying status during\nrobot navigation to reflect the dynamic changes of the scene. Based on the\nCRSG, we further propose an instance navigation strategy that models the\nnavigation process as a Markov Decision Process. At each step, decisions are\ninformed by the Large Language Model's commonsense knowledge and\nvisual-language feature similarity. We designed a series of long-sequence\nnavigation tasks for frequently used everyday items in the Habitat simulator.\nThe results demonstrate that by updating the CRSG, the robot can efficiently\nnavigate to moved targets. Additionally, we deployed our algorithm on a real\nrobot and validated its practical effectiveness. The project page can be found\nhere: https://OpenIN-nav.github.io.\n","authors":["Yujie Tang","Meiling Wang","Yinan Deng","Zibo Zheng","Jingchuan Deng","Yufeng Yue"],"pdf_url":"https://arxiv.org/pdf/2501.04279v1.pdf","comment":"arXiv admin note: substantial text overlap with arXiv:2409.18743"},{"id":"http://arxiv.org/abs/2501.04276v1","updated":"2025-01-08T04:54:28Z","published":"2025-01-08T04:54:28Z","title":"Bridging Adaptivity and Safety: Learning Agile Collision-Free Locomotion\n Across Varied Physics","summary":" Real-world legged locomotion systems often need to reconcile agility and\nsafety for different scenarios. Moreover, the underlying dynamics are often\nunknown and time-variant (e.g., payload, friction). In this paper, we introduce\nBAS (Bridging Adaptivity and Safety), which builds upon the pipeline of prior\nwork Agile But Safe (ABS)(He et al.) and is designed to provide adaptive safety\neven in dynamic environments with uncertainties. BAS involves an agile policy\nto avoid obstacles rapidly and a recovery policy to prevent collisions, a\nphysical parameter estimator that is concurrently trained with agile policy,\nand a learned control-theoretic RA (reach-avoid) value network that governs the\npolicy switch. Also, the agile policy and RA network are both conditioned on\nphysical parameters to make them adaptive. To mitigate the distribution shift\nissue, we further introduce an on-policy fine-tuning phase for the estimator to\nenhance its robustness and accuracy. The simulation results show that BAS\nachieves 50% better safety than baselines in dynamic environments while\nmaintaining a higher speed on average. In real-world experiments, BAS shows its\ncapability in complex environments with unknown physics (e.g., slippery floors\nwith unknown frictions, unknown payloads up to 8kg), while baselines lack\nadaptivity, leading to collisions or. degraded agility. As a result, BAS\nachieves a 19.8% increase in speed and gets a 2.36 times lower collision rate\nthan ABS in the real world. Videos: https://adaptive-safe-locomotion.github.io.\n","authors":["Yichao Zhong","Chong Zhang","Tairan He","Guanya Shi"],"pdf_url":"https://arxiv.org/pdf/2501.04276v1.pdf","comment":"11 Pages, 6 Figures"},{"id":"http://arxiv.org/abs/2501.04268v1","updated":"2025-01-08T04:30:45Z","published":"2025-01-08T04:30:45Z","title":"Robotic Programmer: Video Instructed Policy Code Generation for Robotic\n Manipulation","summary":" Zero-shot generalization across various robots, tasks and environments\nremains a significant challenge in robotic manipulation. Policy code generation\nmethods use executable code to connect high-level task descriptions and\nlow-level action sequences, leveraging the generalization capabilities of large\nlanguage models and atomic skill libraries. In this work, we propose Robotic\nProgrammer (RoboPro), a robotic foundation model, enabling the capability of\nperceiving visual information and following free-form instructions to perform\nrobotic manipulation with policy code in a zero-shot manner. To address low\nefficiency and high cost in collecting runtime code data for robotic tasks, we\ndevise Video2Code to synthesize executable code from extensive videos\nin-the-wild with off-the-shelf vision-language model and code-domain large\nlanguage model. Extensive experiments show that RoboPro achieves the\nstate-of-the-art zero-shot performance on robotic manipulation in both\nsimulators and real-world environments. Specifically, the zero-shot success\nrate of RoboPro on RLBench surpasses the state-of-the-art model GPT-4o by\n11.6%, which is even comparable to a strong supervised training baseline.\nFurthermore, RoboPro is robust to variations on API formats and skill sets.\n","authors":["Senwei Xie","Hongyu Wang","Zhanqi Xiao","Ruiping Wang","Xilin Chen"],"pdf_url":"https://arxiv.org/pdf/2501.04268v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04263v1","updated":"2025-01-08T04:14:09Z","published":"2025-01-08T04:14:09Z","title":"KN-LIO: Geometric Kinematics and Neural Field Coupled LiDAR-Inertial\n Odometry","summary":" Recent advancements in LiDAR-Inertial Odometry (LIO) have boosted a large\namount of applications. However, traditional LIO systems tend to focus more on\nlocalization rather than mapping, with maps consisting mostly of sparse\ngeometric elements, which is not ideal for downstream tasks. Recent emerging\nneural field technology has great potential in dense mapping, but pure LiDAR\nmapping is difficult to work on high-dynamic vehicles. To mitigate this\nchallenge, we present a new solution that tightly couples geometric kinematics\nwith neural fields to enhance simultaneous state estimation and dense mapping\ncapabilities. We propose both semi-coupled and tightly coupled Kinematic-Neural\nLIO (KN-LIO) systems that leverage online SDF decoding and iterated error-state\nKalman filtering to fuse laser and inertial data. Our KN-LIO minimizes\ninformation loss and improves accuracy in state estimation, while also\naccommodating asynchronous multi-LiDAR inputs. Evaluations on diverse\nhigh-dynamic datasets demonstrate that our KN-LIO achieves performance on par\nwith or superior to existing state-of-the-art solutions in pose estimation and\noffers improved dense mapping accuracy over pure LiDAR-based methods. The\nrelevant code and datasets will be made available at https://**.\n","authors":["Zhong Wang","Lele Ren","Yue Wen","Hesheng Wang"],"pdf_url":"https://arxiv.org/pdf/2501.04263v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04228v1","updated":"2025-01-08T01:59:47Z","published":"2025-01-08T01:59:47Z","title":"Constraints as Rewards: Reinforcement Learning for Robots without Reward\n Functions","summary":" Reinforcement learning has become an essential algorithm for generating\ncomplex robotic behaviors. However, to learn such behaviors, it is necessary to\ndesign a reward function that describes the task, which often consists of\nmultiple objectives that needs to be balanced. This tuning process is known as\nreward engineering and typically involves extensive trial-and-error. In this\npaper, to avoid this trial-and-error process, we propose the concept of\nConstraints as Rewards (CaR). CaR formulates the task objective using multiple\nconstraint functions instead of a reward function and solves a reinforcement\nlearning problem with constraints using the Lagrangian-method. By adopting this\napproach, different objectives are automatically balanced, because Lagrange\nmultipliers serves as the weights among the objectives. In addition, we will\ndemonstrate that constraints, expressed as inequalities, provide an intuitive\ninterpretation of the optimization target designed for the task. We apply the\nproposed method to the standing-up motion generation task of a\nsix-wheeled-telescopic-legged robot and demonstrate that the proposed method\nsuccessfully acquires the target behavior, even though it is challenging to\nlearn with manually designed reward functions.\n","authors":["Yu Ishihara","Noriaki Takasugi","Kotaro Kawakami","Masaya Kinoshita","Kazumi Aoyama"],"pdf_url":"https://arxiv.org/pdf/2501.04228v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.04216v2","updated":"2025-01-08T01:16:25Z","published":"2024-07-05T02:00:47Z","title":"Safe MPC Alignment with Human Directional Feedback","summary":" In safety-critical robot planning or control, manually specifying safety\nconstraints or learning them from demonstrations can be challenging. In this\narticle, we propose a certifiable alignment method for a robot to learn a\nsafety constraint in its model predictive control (MPC) policy with human\nonline directional feedback. To our knowledge, it is the first method to learn\nsafety constraints from human feedback. The proposed method is based on an\nempirical observation: human directional feedback, when available, tends to\nguide the robot toward safer regions. The method only requires the direction of\nhuman feedback to update the learning hypothesis space. It is certifiable,\nproviding an upper bound on the total number of human feedback in the case of\nsuccessful learning, or declaring the hypothesis misspecification, i.e., the\ntrue implicit safety constraint cannot be found within the specified hypothesis\nspace. We evaluated the proposed method using numerical examples and user\nstudies in two simulation games. Additionally, we implemented and tested the\nproposed method on a real-world Franka robot arm performing mobile\nwater-pouring tasks. The results demonstrate the efficacy and efficiency of our\nmethod, showing that it enables a robot to successfully learn safety\nconstraints with a small handful (tens) of human directional corrections.\n","authors":["Zhixian Xie","Wenlong Zhang","Yi Ren","Zhaoran Wang","George J. Pappas","Wanxin Jin"],"pdf_url":"https://arxiv.org/pdf/2407.04216v2.pdf","comment":"16 pages, submission to T-RO"},{"id":"http://arxiv.org/abs/2501.04194v1","updated":"2025-01-08T00:06:43Z","published":"2025-01-08T00:06:43Z","title":"STLCG++: A Masking Approach for Differentiable Signal Temporal Logic\n Specification","summary":" Signal Temporal Logic (STL) offers a concise yet expressive framework for\nspecifying and reasoning about spatio-temporal behaviors of robotic systems.\nAttractively, STL admits the notion of robustness, the degree to which an input\nsignal satisfies or violates an STL specification, thus providing a nuanced\nevaluation of system performance. Notably, the differentiability of STL\nrobustness enables direct integration to robotics workflows that rely on\ngradient-based optimization, such as trajectory optimization and deep learning.\nHowever, existing approaches to evaluating and differentiating STL robustness\nrely on recurrent computations, which become inefficient with longer sequences,\nlimiting their use in time-sensitive applications. In this paper, we present\nSTLCG++, a masking-based approach that parallelizes STL robustness evaluation\nand backpropagation across timesteps, achieving more than 1000x faster\ncomputation time than the recurrent approach. We also introduce a smoothing\ntechnique for differentiability through time interval bounds, expanding STL's\napplicability in gradient-based optimization tasks over spatial and temporal\nvariables. Finally, we demonstrate STLCG++'s benefits through three robotics\nuse cases and provide open-source Python libraries in JAX and PyTorch for\nseamless integration into modern robotics workflows.\n","authors":["Parv Kapoor","Kazuki Mizuta","Eunsuk Kang","Karen Leung"],"pdf_url":"https://arxiv.org/pdf/2501.04194v1.pdf","comment":"To be submitted to robotics journal for review"},{"id":"http://arxiv.org/abs/2501.04193v1","updated":"2025-01-08T00:06:38Z","published":"2025-01-08T00:06:38Z","title":"GNN-based Decentralized Perception in Multirobot Systems for Predicting\n Worker Actions","summary":" In industrial environments, predicting human actions is essential for\nensuring safe and effective collaboration between humans and robots. This paper\nintroduces a perception framework that enables mobile robots to understand and\nshare information about human actions in a decentralized way. The framework\nfirst allows each robot to build a spatial graph representing its surroundings,\nwhich it then shares with other robots. This shared spatial data is combined\nwith temporal information to track human behavior over time. A swarm-inspired\ndecision-making process is used to ensure all robots agree on a unified\ninterpretation of the human's actions. Results show that adding more robots and\nincorporating longer time sequences improve prediction accuracy. Additionally,\nthe consensus mechanism increases system resilience, making the multi-robot\nsetup more reliable in dynamic industrial settings.\n","authors":["Ali Imran","Giovanni Beltrame","David St-Onge"],"pdf_url":"https://arxiv.org/pdf/2501.04193v1.pdf","comment":"Submitted to RA-L"},{"id":"http://arxiv.org/abs/2407.05910v3","updated":"2025-01-08T23:40:38Z","published":"2024-07-08T13:15:11Z","title":"Enhancing Vision-Language Models with Scene Graphs for Traffic Accident\n Understanding","summary":" Recognizing a traffic accident is an essential part of any autonomous driving\nor road monitoring system. An accident can appear in a wide variety of forms,\nand understanding what type of accident is taking place may be useful to\nprevent it from recurring. This work focuses on classifying traffic scenes into\nspecific accident types. We approach the problem by representing a traffic\nscene as a graph, where objects such as cars can be represented as nodes, and\nrelative distances and directions between them as edges. This representation of\na traffic scene is referred to as a scene graph, and can be used as input for\nan accident classifier. Better results are obtained with a classifier that\nfuses the scene graph input with visual and textual representations. This work\nintroduces a multi-stage, multimodal pipeline that pre-processes videos of\ntraffic accidents, encodes them as scene graphs, and aligns this representation\nwith vision and language modalities before executing the classification task.\nWhen trained on 4 classes, our method achieves a balanced accuracy score of\n57.77% on an (unbalanced) subset of the popular Detection of Traffic Anomaly\n(DoTA) benchmark, representing an increase of close to 5 percentage points from\nthe case where scene graph information is not taken into account.\n","authors":["Aaron Lohner","Francesco Compagno","Jonathan Francis","Alessandro Oltramari"],"pdf_url":"https://arxiv.org/pdf/2407.05910v3.pdf","comment":"Won the 'Best Paper Runner-up Award' at the 2024 IEEE International\n Automated Vehicle Validation Conference (IAVVC 2024). Also accepted at the\n 1st Workshop on Semantic Reasoning and Goal Understanding in Robotics, at the\n Robotics Science and Systems Conference (RSS SemRob 2024)"},{"id":"http://arxiv.org/abs/2501.04860v1","updated":"2025-01-08T22:22:15Z","published":"2025-01-08T22:22:15Z","title":"Exploring the Use of Robots for Diary Studies","summary":" As interest in studying in-the-wild human-robot interaction grows, there is a\nneed for methods to collect data over time and in naturalistic or potentially\nprivate environments. HRI researchers have increasingly used the diary method\nfor these studies, asking study participants to self-administer a structured\ndata collection instrument, i.e., a diary, over a period of time. Although the\ndiary method offers a unique window into settings that researchers may not have\naccess to, they also lack the interactivity and probing that interview-based\nmethods offer. In this paper, we explore a novel data collection method in\nwhich a robot plays the role of an interactive diary. We developed the Diary\nRobot system and performed in-home deployments for a week to evaluate the\nfeasibility and effectiveness of this approach. Using traditional text-based\nand audio-based diaries as benchmarks, we found that robots are able to\neffectively elicit the intended information. We reflect on our findings, and\ndescribe scenarios where the utilization of robots in diary studies as a data\ncollection instrument may be especially applicable.\n","authors":["Michael F. Xu","Bilge Mutlu"],"pdf_url":"https://arxiv.org/pdf/2501.04860v1.pdf","comment":"Proceedings of the 29th ACM/IEEE International Conference on Human\n Robot Interaction (HRI 2025)"},{"id":"http://arxiv.org/abs/2405.05210v2","updated":"2025-01-08T22:09:46Z","published":"2024-05-08T16:58:22Z","title":"TCAFF: Temporal Consistency for Robot Frame Alignment","summary":" In the field of collaborative robotics, the ability to communicate spatial\ninformation like planned trajectories and shared environment information is\ncrucial. When no global position information is available (e.g., indoor or\nGPS-denied environments), agents must align their coordinate frames before\nshared spatial information can be properly expressed and interpreted.\nCoordinate frame alignment is particularly difficult when robots have no\ninitial alignment and are affected by odometry drift. To this end, we develop a\nnovel multiple hypothesis algorithm, called TCAFF, for aligning the coordinate\nframes of neighboring robots. TCAFF considers potential alignments from\nassociating sparse open-set object maps and leverages temporal consistency to\ndetermine an initial alignment and correct for drift, all without any initial\nknowledge of neighboring robot poses. We demonstrate TCAFF being used for frame\nalignment in a collaborative object tracking application on a team of four\nrobots tracking six pedestrians and show that TCAFF enables robots to achieve a\ntracking accuracy similar to that of a system with ground truth localization.\nThe code and hardware dataset are available at\nhttps://github.com/mit-acl/tcaff.\n","authors":["Mason B. Peterson","Parker C. Lusk","Antonio Avila","Jonathan P. How"],"pdf_url":"https://arxiv.org/pdf/2405.05210v2.pdf","comment":"7 pages, 6 figures"},{"id":"http://arxiv.org/abs/2501.04823v1","updated":"2025-01-08T20:22:16Z","published":"2025-01-08T20:22:16Z","title":"Learning Robot Safety from Sparse Human Feedback using Conformal\n Prediction","summary":" Ensuring robot safety can be challenging; user-defined constraints can miss\nedge cases, policies can become unsafe even when trained from safe data, and\nsafety can be subjective. Thus, we learn about robot safety by showing policy\ntrajectories to a human who flags unsafe behavior. From this binary feedback,\nwe use the statistical method of conformal prediction to identify a region of\nstates, potentially in learned latent space, guaranteed to contain a\nuser-specified fraction of future policy errors. Our method is\nsample-efficient, as it builds on nearest neighbor classification and avoids\nwithholding data as is common with conformal prediction. By alerting if the\nrobot reaches the suspected unsafe region, we obtain a warning system that\nmimics the human's safety preferences with guaranteed miss rate. From video\nlabeling, our system can detect when a quadcopter visuomotor policy will fail\nto steer through a designated gate. We present an approach for policy\nimprovement by avoiding the suspected unsafe region. With it we improve a model\npredictive controller's safety, as shown in experimental testing with 30\nquadcopter flights across 6 navigation tasks. Code and videos are provided.\n","authors":["Aaron O. Feldman","Joseph A. Vincent","Maximilian Adang","Jun En Low","Mac Schwager"],"pdf_url":"https://arxiv.org/pdf/2501.04823v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19813v2","updated":"2025-01-08T19:50:09Z","published":"2024-12-10T19:58:47Z","title":"Coverage Path Planning in Precision Agriculture: Algorithms,\n Applications, and Key Benefits","summary":" Coverage path planning (CPP) is the task of computing an optimal path within\na region to completely scan or survey an area of interest using one or multiple\nmobile robots. Robots equipped with sensors and cameras can collect vast\namounts of data on crop health, soil conditions, and weather patterns. Advanced\nanalytics can then be applied to this data to make informed decisions,\nimproving overall farm management. In this paper, we will demonstrate one\napproach to find the optimal coverage path of an agricultural field using a\nsingle robot, and one using multiple robots. For the single robot, we used a\nwavefront coverage algorithm that generates a sequence of locations that the\nrobot needs to follow. For the multi-robot approach, the proposed approach\nconsists of two steps: dividing the agricultural field into convex polygonal\nareas to optimally distribute them among the robots, and generating an optimal\ncoverage path to ensure minimum coverage time for each of the polygonal areas.\n","authors":["Jahid Chowdhury Choton","William H. Hsu"],"pdf_url":"https://arxiv.org/pdf/2412.19813v2.pdf","comment":"The co-authors have asked to withdraw this paper, since it contains\n incomplete and incorrect informations"},{"id":"http://arxiv.org/abs/2412.16186v2","updated":"2025-01-08T19:49:53Z","published":"2024-12-12T16:57:49Z","title":"Formal Modeling and Verification of Publisher-Subscriber Paradigm in ROS\n 2","summary":" The Robot Operating System (ROS) is one of the most popular middleware for\ndeveloping robot applications, but it is subject to major shortcomings when\napplied to real-time robotic systems in safety-critical environments. For this\nreason, ROS 2 was released in 2017 for implementing real-time capabilities in\ndistributed robotic systems while supporting the most prominent aspects of the\noriginal ROS. There is still not much work done to provide formal guarantees\nand correctness of a ROS program. In this paper, we propose a framework to\naddress this challenging problem of guaranteeing the correct behaviour of\nrobotic systems. We propose a formal modelling of a ROS 2 program, and also\ndescribe the program using a network of timed automata. We then prove that the\nsets of executions of a ROS program in the model and in the network of timed\nautomata are the same. Thus to analyze a publisher-subscriber scenario of ROS 2\nprogram, our algorithm first converts the program into the model, and then into\nthe network of timed automata. The applicability and validity of our approach\nare verified by conducting several experiments on a simplified system and an\nactual robotic system, and the results and limitations are discussed.\n","authors":["Jahid Chowdhury Choton","Lipsy Gupta","Pavithra Prabhakar"],"pdf_url":"https://arxiv.org/pdf/2412.16186v2.pdf","comment":"The co-authors have asked to withdraw this paper, since it contains\n incomplete and incorrect informations"},{"id":"http://arxiv.org/abs/2501.04759v1","updated":"2025-01-08T17:28:30Z","published":"2025-01-08T17:28:30Z","title":"Optimize the parameters of the PID Controller using Genetic Algorithm\n for Robot Manipulators","summary":" This paper presents the design a Proportional-Integral-Derivative (PID)\ncontroller with optimized parameters for a two-degree-of-freedom robotic arm. A\ngenetic algorithm (GA) is proposed to optimize the controller parameters,\naddressing the challenges in determining PID controller parameters for highly\nnonlinear systems like robotic arms compared to traditional methods. The\nGA-optimized PID controller significantly improves control accuracy and\nperformance over traditional control methods. Simulation results demonstrate\nthat the robotic arm system operates with high precision and stability.\nAdditionally, the shortened trajectory tracking response time enhances the\nfeasibility of applying this control algorithm in realworld scenarios. This\nresearch not only confirms the suitability of PID-GA for robotic arms and\nsimilar systems but also opens new avenues for applying this algorithm to real\nphysical systems.\n","authors":["Vu Ngoc Son","Pham Van Cuong","Nguyen Duy Minh","Phi Hoang Nha"],"pdf_url":"https://arxiv.org/pdf/2501.04759v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04755v1","updated":"2025-01-08T16:57:44Z","published":"2025-01-08T16:57:44Z","title":"Improving Human-Robot Teaching by Quantifying and Reducing Mental Model\n Mismatch","summary":" The rapid development of artificial intelligence and robotics has had a\nsignificant impact on our lives, with intelligent systems increasingly\nperforming tasks traditionally performed by humans. Efficient knowledge\ntransfer requires matching the mental model of the human teacher with the\ncapabilities of the robot learner. This paper introduces the Mental Model\nMismatch (MMM) Score, a feedback mechanism designed to quantify and reduce\nmismatches by aligning human teaching behavior with robot learning behavior.\nUsing Large Language Models (LLMs), we analyze teacher intentions in natural\nlanguage to generate adaptive feedback. A study with 150 participants teaching\na virtual robot to solve a puzzle game shows that intention-based feedback\nsignificantly outperforms traditional performance-based feedback or no\nfeedback. The results suggest that intention-based feedback improves\ninstructional outcomes, improves understanding of the robot's learning process\nand reduces misconceptions. This research addresses a critical gap in\nhuman-robot interaction (HRI) by providing a method to quantify and mitigate\ndiscrepancies between human mental models and robot capabilities, with the goal\nof improving robot learning and human teaching effectiveness.\n","authors":["Phillip Richter","Heiko Wersing","Anna-Lisa Vollmer"],"pdf_url":"https://arxiv.org/pdf/2501.04755v1.pdf","comment":"11 Pages, 4 Figures"},{"id":"http://arxiv.org/abs/2501.04754v1","updated":"2025-01-08T16:57:11Z","published":"2025-01-08T16:57:11Z","title":"Development of an Adaptive Sliding Mode Controller using Neural Networks\n for Trajectory Tracking of a Cylindrical Manipulator","summary":" Cylindrical manipulators are extensively used in industrial automation,\nespecially in emerging technologies like 3D printing, which represents a\nsignificant future trend. However, controlling the trajectory of nonlinear\nmodels with system uncertainties remains a critical challenge, often leading to\nreduced accuracy and reliability. To address this, the study develops an\nAdaptive Sliding Mode Controller (ASMC) integrated with Neural Networks (NNs)\nto improve trajectory tracking for cylindrical manipulators. The ASMC leverages\nthe robustness of sliding mode control and the adaptability of neural networks\nto handle uncertainties and dynamic variations effectively. Simulation results\nvalidate that the proposed ASMC-NN achieves high trajectory tracking accuracy,\nfast response time, and enhanced reliability, making it a promising solution\nfor applications in 3D printing and beyond.\n","authors":["TieuNien Le","VanCuong Pham","NgocSon Vu"],"pdf_url":"https://arxiv.org/pdf/2501.04754v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05483v1","updated":"2025-01-08T03:47:52Z","published":"2025-01-08T03:47:52Z","title":"Human Grasp Generation for Rigid and Deformable Objects with Decomposed\n VQ-VAE","summary":" Generating realistic human grasps is crucial yet challenging for object\nmanipulation in computer graphics and robotics. Current methods often struggle\nto generate detailed and realistic grasps with full finger-object interaction,\nas they typically rely on encoding the entire hand and estimating both posture\nand position in a single step. Additionally, simulating object deformation\nduring grasp generation is still difficult, as modeling such deformation\nrequires capturing the comprehensive relationship among points of the object's\nsurface. To address these limitations, we propose a novel improved Decomposed\nVector-Quantized Variational Autoencoder (DVQ-VAE-2), which decomposes the hand\ninto distinct parts and encodes them separately. This part-aware architecture\nallows for more precise management of hand-object interactions. Furthermore, we\nintroduce a dual-stage decoding strategy that first predicts the grasp type\nunder skeletal constraints and then identifies the optimal grasp position,\nenhancing both the realism and adaptability of the model to unseen\ninteractions. Furthermore, we introduce a new Mesh UFormer as the backbone\nnetwork to extract the hierarchical structural representations from the mesh\nand propose a new normal vector-guided position encoding to simulate the\nhand-object deformation. In experiments, our model achieves a relative\nimprovement of approximately 14.1% in grasp quality compared to\nstate-of-the-art methods across four widely used benchmarks. Our comparisons\nwith other backbone networks show relative improvements of 2.23% in Hand-object\nContact Distance and 5.86% in Quality Index on deformable and rigid object\nbased datasets, respectively. Our source code and model are available at\nhttps://github.com/florasion/D-VQVAE.\n","authors":["Mengshi Qi","Zhe Zhao","Huadong Ma"],"pdf_url":"https://arxiv.org/pdf/2501.05483v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.06235v1","updated":"2025-01-08T09:08:06Z","published":"2025-01-08T09:08:06Z","title":"NextStop: An Improved Tracker For Panoptic LIDAR Segmentation Data","summary":" 4D panoptic LiDAR segmentation is essential for scene understanding in\nautonomous driving and robotics ,combining semantic and instance segmentation\nwith temporal consistency.Current methods, like 4D-PLS and 4D-STOP, use a\ntracking-by-detection methodology, employing deep learning networks to perform\nsemantic and instance segmentation on each frame. To maintain temporal\nconsistency, large-size instances detected in the current frame are compared\nand associated with instances within a temporal window that includes the\ncurrent and preceding frames. However, their reliance on short-term instance\ndetection, lack of motion estimation, and exclusion of small-sized instances\nlead to frequent identity switches and reduced tracking performance. We address\nthese issues with the NextStop1 tracker, which integrates Kalman filter-based\nmotion estimation, data association, and lifespan management, along with a\ntracklet state concept to improve prioritization. Evaluated using the LiDAR\nSegmentation and Tracking Quality (LSTQ) metric on the SemanticKITTI validation\nset, NextStop demonstrated enhanced tracking performance, particularly for\nsmall-sized objects like people and bicyclists, with fewer ID switches, earlier\ntracking initiation, and improved reliability in complex environments. The\nsource code is available at https://github.com/AIROTAU/NextStopTracker\n","authors":["Nirit Alkalay","Roy Orfaig","Ben-Zion Bobrovsky"],"pdf_url":"https://arxiv.org/pdf/2501.06235v1.pdf","comment":null}],"Systems and Control":[{"id":"http://arxiv.org/abs/2501.02792v2","updated":"2025-01-08T17:58:19Z","published":"2025-01-06T06:25:46Z","title":"Gaming on Coincident Peak Shaving: Equilibrium and Strategic Behavior","summary":" Coincident peak demand charges are imposed by power system operators or\nelectric utilities when the overall system demand, aggregated across multiple\nconsumers, reaches its peak. These charges incentivize consumers to reduce\ntheir demand during peak periods, a practice known as coincident peak shaving.\nIn this paper, we analyze the coincident peak shaving problem through the lens\nof game theory, developing a theoretical model to examine the impact of\nstrategic consumer behavior on system efficiency. We demonstrate that the game\nstructure exhibits varying characteristics - concave,\nquasiconcave/discontinuous, or non-concave/discontinuous - depending on the\nextent of consumers demand-shifting capabilities. For a two-agent, two-period\nsetting, we derive closed-form Nash equilibrium solutions under each condition\nand generalize our findings to cases with multiple agents. We prove the\nstability of the equilibrium points and present an algorithm for computing\nequilibrium outcomes across all game scenarios. We also show that the\npeak-shaving effectiveness of the game model matches that of the centralized\npeak-shaving model but with increased levels of anarchy. In the cases of\nquasiconcave and non-concave game conditions, we analytically demonstrate in\nthe two-agent setting that anarchy increases with consumers' flexibility and\ninequity, as measured by their marginal shifting costs, and we also analyze the\ninfluence of the number of agents on anarchy. Finally, we provide numerical\nsimulations to validate our theoretical results.\n","authors":["Liudong Chen","Bolun Xu"],"pdf_url":"https://arxiv.org/pdf/2501.02792v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04623v1","updated":"2025-01-08T17:09:50Z","published":"2025-01-08T17:09:50Z","title":"Large-scale Grid Optimization: The Workhorse of Future Grid Computations","summary":" Purpose: The computation methods for modeling, controlling and optimizing the\ntransforming grid are evolving rapidly. We review and systemize knowledge for a\nspecial class of computation methods that solve large-scale power grid\noptimization problems. Summary: Large-scale grid optimizations are pertinent\nfor, amongst other things, hedging against risk due to resource stochasticity,\nevaluating aggregated DERs' impact on grid operation and design, and improving\nthe overall efficiency of grid operation in terms of cost, reliability, and\ncarbon footprint. We attribute the continual growth in scale and complexity of\ngrid optimizations to a large influx of new spatial and temporal features in\nboth transmission (T) and distribution (D) networks. Therefore, to systemize\nknowledge in the field, we discuss the recent advancements in T and D systems\nfrom the viewpoint of mechanistic physics-based and emerging data-driven\nmethods. Findings: We find that while mechanistic physics-based methods are\nleading the science in solving large-scale grid optimizations, data-driven\ntechniques, especially physics-constrained ones, are emerging as an alternative\nto solve otherwise intractable problems. We also find observable gaps in the\nfield and ascertain these gaps from the paper's literature review and by\ncollecting and synthesizing feedback from industry experts.\n","authors":["Amritanshu Pandey","Mads Almassalkhi","Sam Chevalier"],"pdf_url":"https://arxiv.org/pdf/2501.04623v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04572v1","updated":"2025-01-08T15:42:41Z","published":"2025-01-08T15:42:41Z","title":"Regret Analysis: a control perspective","summary":" Online learning and model reference adaptive control have many interesting\nintersections. One area where they differ however is in how the algorithms are\nanalyzed and what objective or metric is used to discriminate \"good\" algorithms\nfrom \"bad\" algorithms. In adaptive control there are usually two objectives: 1)\nprove that all time varying parameters/states of the system are bounded, and 2)\nthat the instantaneous error between the adaptively controlled system and a\nreference system converges to zero over time (or at least a compact set). For\nonline learning the performance of algorithms is often characterized by the\nregret the algorithm incurs. Regret is defined as the cumulative loss (cost)\nover time from the online algorithm minus the cumulative loss (cost) of the\nsingle optimal fixed parameter choice in hindsight. Another significant\ndifference between the two areas of research is with regard to the assumptions\nmade in order to obtain said results. Adaptive control makes assumptions about\nthe input-output properties of the control problem and derives solutions for a\nfixed error model or optimization task. In the online learning literature\nresults are derived for classes of loss functions (i.e. convex) while a priori\nassuming that all time varying parameters are bounded, which for many\noptimization tasks is not unrealistic, but is a non starter in control\napplications. In this work we discuss these differences in detail through the\nregret based analysis of gradient descent for convex functions and the control\nbased analysis of a streaming regression problem. We close with a discussion\nabout the newly defined paradigm of online adaptive control and ask the\nfollowing question \"Are regret optimal control strategies deployable?\"\n","authors":["Travis E. Gibson","Sawal Acharya"],"pdf_url":"https://arxiv.org/pdf/2501.04572v1.pdf","comment":"10 pages no figures"},{"id":"http://arxiv.org/abs/2501.04566v1","updated":"2025-01-08T15:26:59Z","published":"2025-01-08T15:26:59Z","title":"Recursive Least Squares with Fading Regularization for Finite-Time\n Convergence without Persistent Excitation","summary":" This paper extends recursive least squares (RLS) to include time-varying\nregularization. This extension provides flexibility for updating the least\nsquares regularization term in real time. Existing results with constant\nregularization imply that the parameter-estimation error dynamics of RLS are\nglobally attractive to zero if and only the regressor is weakly persistently\nexciting. This work shows that, by extending classical RLS to include a\ntime-varying (fading) regularization term that converges to zero, the\nparameter-estimation error dynamics are globally attractive to zero without\nweakly persistent excitation. Moreover, if the fading regularization term\nconverges to zero in finite time, then the parameter estimation error also\nconverges to zero in finite time. Finally, we propose rank-1 fading\nregularization (R1FR) RLS, a time-varying regularization algorithm with fading\nregularization that converges to zero, and which runs in the same computational\ncomplexity as classical RLS. Numerical examples are presented to validate\ntheoretical guarantees and to show how R1FR-RLS can protect against\nover-regularization.\n","authors":["Brian Lai","Dimitra Panagou","Dennis S. Bernstein"],"pdf_url":"https://arxiv.org/pdf/2501.04566v1.pdf","comment":"Submitted to the 2025 American Control Conference"},{"id":"http://arxiv.org/abs/2501.04508v1","updated":"2025-01-08T13:55:22Z","published":"2025-01-08T13:55:22Z","title":"New Linear Model of a Composite Energy Storage System with Realizable\n Dispatch Guarantees","summary":" To optimize battery dispatch, a model is required that can predict the state\nof charge (SOC) trajectory and ensure dispatch is admissible (i.e., does not\nlead to unexpected SOC saturation). But battery dispatch optimization is\ninherently challenging since batteries cannot simultaneously charge and\ndischarge, which begets a non-convex complementarity constraint. In this paper,\nwe consider a composition of energy storage elements that can charge or\ndischarge independently and provide a sufficient linear energy storage model of\nthe composite battery. This permits convex optimization of the composite\nbattery SOC trajectory while ensuring admissibility of the resulting\n(aggregated) power schedule and disaggregation to the individual energy storage\nelements.\n","authors":["Mazen Elsaadany","Mads R. Almassalkhi","Simon H. Tindemans"],"pdf_url":"https://arxiv.org/pdf/2501.04508v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04481v1","updated":"2025-01-08T13:04:08Z","published":"2025-01-08T13:04:08Z","title":"Safe Reinforcement Learning with Minimal Supervision","summary":" Reinforcement learning (RL) in the real world necessitates the development of\nprocedures that enable agents to explore without causing harm to themselves or\nothers. The most successful solutions to the problem of safe RL leverage\noffline data to learn a safe-set, enabling safe online exploration. However,\nthis approach to safe-learning is often constrained by the demonstrations that\nare available for learning.\n In this paper we investigate the influence of the quantity and quality of\ndata used to train the initial safe learning problem offline on the ability to\nlearn safe-RL policies online. Specifically, we focus on tasks with spatially\nextended goal states where we have few or no demonstrations available.\nClassically this problem is addressed either by using hand-designed controllers\nto generate data or by collecting user-generated demonstrations. However, these\nmethods are often expensive and do not scale to more complex tasks and\nenvironments. To address this limitation we propose an unsupervised RL-based\noffline data collection procedure, to learn complex and scalable policies\nwithout the need for hand-designed controllers or user demonstrations. Our\nresearch demonstrates the significance of providing sufficient demonstrations\nfor agents to learn optimal safe-RL policies online, and as a result, we\npropose optimistic forgetting, a novel online safe-RL approach that is\npractical for scenarios with limited data. Further, our unsupervised data\ncollection approach highlights the need to balance diversity and optimality for\nsafe online exploration.\n","authors":["Alexander Quessy","Thomas Richardson","Sebastian East"],"pdf_url":"https://arxiv.org/pdf/2501.04481v1.pdf","comment":"Initially submitted to ICML 2023"},{"id":"http://arxiv.org/abs/2501.04437v1","updated":"2025-01-08T11:37:35Z","published":"2025-01-08T11:37:35Z","title":"Integrating LLMs with ITS: Recent Advances, Potentials, Challenges, and\n Future Directions","summary":" Intelligent Transportation Systems (ITS) are crucial for the development and\noperation of smart cities, addressing key challenges in efficiency,\nproductivity, and environmental sustainability. This paper comprehensively\nreviews the transformative potential of Large Language Models (LLMs) in\noptimizing ITS. Initially, we provide an extensive overview of ITS,\nhighlighting its components, operational principles, and overall effectiveness.\nWe then delve into the theoretical background of various LLM techniques, such\nas GPT, T5, CTRL, and BERT, elucidating their relevance to ITS applications.\nFollowing this, we examine the wide-ranging applications of LLMs within ITS,\nincluding traffic flow prediction, vehicle detection and classification,\nautonomous driving, traffic sign recognition, and pedestrian detection. Our\nanalysis reveals how these advanced models can significantly enhance traffic\nmanagement and safety. Finally, we explore the challenges and limitations LLMs\nface in ITS, such as data availability, computational constraints, and ethical\nconsiderations. We also present several future research directions and\npotential innovations to address these challenges. This paper aims to guide\nresearchers and practitioners through the complexities and opportunities of\nintegrating LLMs in ITS, offering a roadmap to create more efficient,\nsustainable, and responsive next-generation transportation systems.\n","authors":["Doaa Mahmud","Hadeel Hajmohamed","Shamma Almentheri","Shamma Alqaydi","Lameya Aldhaheri","Ruhul Amin Khalil","Nasir Saeed"],"pdf_url":"https://arxiv.org/pdf/2501.04437v1.pdf","comment":"Accepted for publication in IEEE Transactions on Intelligent\n Transportation Systems"},{"id":"http://arxiv.org/abs/2501.04422v1","updated":"2025-01-08T11:15:04Z","published":"2025-01-08T11:15:04Z","title":"A new methodology for the optimization of bolt tightening sequences for\n ring type joints","summary":" Achieving uniform bolt load distribution is critical to obtain leak-free\nservice in pressure vessel gasketed joints used in offshore pipelines. This is\na difficult task due to bolt load variations during the assembly process. In\nthis sense, the Elastic Interaction Coefficients Method has been developed in\nprevious works to define tightening sequences that provide the target load at\nthe end of the sequence in one or two passes. The method is very costly because\na complete sequence must be simulated and the load of every bolt must be\nmeasured after each tightening operation. The present work validates this\nmethod for Ring Type Joints and further develops a numerically and\nexperimentally validated new methodology that provides highly satisfactory\nresults with a significantly lower cost.\n","authors":["Ibai Coria","Mikel Abasolo","Imanol Olaskoaga","Arkaitz Etxezarreta","Josu Aguirrebeitia"],"pdf_url":"https://arxiv.org/pdf/2501.04422v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.04080v3","updated":"2025-01-08T09:03:14Z","published":"2024-02-06T15:34:30Z","title":"Entropy-regularized Diffusion Policy with Q-Ensembles for Offline\n Reinforcement Learning","summary":" This paper presents advanced techniques of training diffusion policies for\noffline reinforcement learning (RL). At the core is a mean-reverting stochastic\ndifferential equation (SDE) that transfers a complex action distribution into a\nstandard Gaussian and then samples actions conditioned on the environment state\nwith a corresponding reverse-time SDE, like a typical diffusion policy. We show\nthat such an SDE has a solution that we can use to calculate the log\nprobability of the policy, yielding an entropy regularizer that improves the\nexploration of offline datasets. To mitigate the impact of inaccurate value\nfunctions from out-of-distribution data points, we further propose to learn the\nlower confidence bound of Q-ensembles for more robust policy improvement. By\ncombining the entropy-regularized diffusion policy with Q-ensembles in offline\nRL, our method achieves state-of-the-art performance on most tasks in D4RL\nbenchmarks. Code is available at\nhttps://github.com/ruoqizzz/Entropy-Regularized-Diffusion-Policy-with-QEnsemble.\n","authors":["Ruoqi Zhang","Ziwei Luo","Jens Sjölund","Thomas B. Schön","Per Mattsson"],"pdf_url":"https://arxiv.org/pdf/2402.04080v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04275v1","updated":"2025-01-08T04:54:14Z","published":"2025-01-08T04:54:14Z","title":"Adaptive Numerical Differentiation for Extremum Seeking with Sensor\n Noise","summary":" Extremum-seeking control (ESC) is widely used to optimize performance when\nthe system dynamics are uncertain. However, sensitivity to sensor noise is an\nimportant issue in ESC implementation due to the use of high-pass filters or\ngradient estimators. To reduce the sensitivity of ESC to noise, this paper\ninvestigates the use of adaptive input and state estimation (AISE) for\nnumerical differentiation. In particular, this paper develops extremum-seeking\ncontrol with adaptive input and state estimation (ESC/AISE), where the\nhigh-pass filter of ESC is replaced by AISE to improve performance under sensor\nnoise. The effectiveness of ESC/AISE is illustrated via numerical examples.\n","authors":["Shashank Verma","Juan Augusto Paredes Salazar","Jhon Manuel Portella Delgado","Ankit Goel","Dennis S. Bernstein"],"pdf_url":"https://arxiv.org/pdf/2501.04275v1.pdf","comment":"8 pages, 13 figures. Submitted to ACC 2025"},{"id":"http://arxiv.org/abs/2501.04273v1","updated":"2025-01-08T04:45:14Z","published":"2025-01-08T04:45:14Z","title":"Frenet-Serret-Based Trajectory Prediction","summary":" Trajectory prediction is a crucial element of guidance, navigation, and\ncontrol systems. This paper presents two novel trajectory-prediction methods\nbased on real-time position measurements and adaptive input and state\nestimation (AISE). The first method, called AISE/va, uses position measurements\nto estimate the target velocity and acceleration. The second method, called\nAISE/FS, models the target trajectory as a 3D curve using the Frenet-Serret\nformulas, which require estimates of velocity, acceleration, and jerk. To\nestimate velocity, acceleration, and jerk in real time, AISE computes first,\nsecond, and third derivatives of the position measurements. AISE does not rely\non assumptions about the target maneuver, measurement noise, or disturbances.\nFor trajectory prediction, both methods use measurements of the target position\nand estimates of its derivatives to extrapolate from the current position. The\nperformance of AISE/va and AISE/FS is compared numerically with the\n$\\alpha$-$\\beta$-$\\gamma$ filter, which shows that AISE/FS provides more\naccurate trajectory prediction than AISE/va and traditional methods, especially\nfor complex target maneuvers.\n","authors":["Shashank Verma","Dennis S. Bernstein"],"pdf_url":"https://arxiv.org/pdf/2501.04273v1.pdf","comment":"8 pages, 6 figures. Submitted to ACC 2025"},{"id":"http://arxiv.org/abs/2501.04262v1","updated":"2025-01-08T04:10:43Z","published":"2025-01-08T04:10:43Z","title":"Target Tracking Using the Invariant Extended Kalman Filter with\n Numerical Differentiation for Estimating Curvature and Torsion","summary":" The goal of target tracking is to estimate target position, velocity, and\nacceleration in real time using position data. This paper introduces a novel\ntarget-tracking technique that uses adaptive input and state estimation (AISE)\nfor real-time numerical differentiation to estimate velocity, acceleration, and\njerk from position data. These estimates are used to model the target motion\nwithin the Frenet-Serret (FS) frame. By representing the model in SE(3), the\nposition and velocity are estimated using the invariant extended Kalman filter\n(IEKF). The proposed method, called FS-IEKF-AISE, is illustrated by numerical\nexamples and compared to prior techniques.\n","authors":["Shashank Verma","Dennis S. Bernstein"],"pdf_url":"https://arxiv.org/pdf/2501.04262v1.pdf","comment":"7 pages, 8 figures, submitted to ACC 2025"},{"id":"http://arxiv.org/abs/2411.06107v2","updated":"2025-01-08T02:50:58Z","published":"2024-11-09T08:01:17Z","title":"A capacity renting framework for shared energy storage considering\n peer-to-peer energy trading of prosumers with privacy protection","summary":" Shared energy storage systems (ESS) present a promising solution to the\ntemporal imbalance between energy generation from renewable distributed\ngenerators (DGs) and the power demands of prosumers. However, as DG penetration\nrates rise, spatial energy imbalances become increasingly significant,\nnecessitating the integration of peer-to-peer (P2P) energy trading within the\nshared ESS framework. Two key challenges emerge in this context: the absence of\neffective mechanisms and the greater difficulty for privacy protection due to\nincreased data communication. This research proposes a capacity renting\nframework for shared ESS considering P2P energy trading of prosumers. In the\nproposed framework, prosumers can participate in P2P energy trading and rent\ncapacities from shared ESS. A generalized Nash game is formulated to model the\ntrading process and the competitive interactions among prosumers, and the\nvariational equilibrium of the game is proved to be equivalent to the optimal\nsolution of a quadratic programming (QP) problem. To address the privacy\nprotection concern, the problem is solved using the alternating direction\nmethod of multipliers (ADMM) with the Paillier cryptosystem. Finally, numerical\nsimulations demonstrate the impact of P2P energy trading on the shared ESS\nframework and validate the effectiveness of the proposed privacy-preserving\nalgorithm.\n","authors":["Yingcong Sun","Laijun Chen","Yue Chen","Mingrui Tang","Shengwei Mei"],"pdf_url":"https://arxiv.org/pdf/2411.06107v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04242v1","updated":"2025-01-08T02:47:07Z","published":"2025-01-08T02:47:07Z","title":"Beam Domain Channel Estimation for Spatial Non-Stationary Massive MIMO\n Systems","summary":" In massive multiple-input multiple-output (MIMO) systems, the channel\nestimation scheme is subject to the spatial non-stationarity and inevitably\npower leakage in the beam domain. In this paper, a beam domain channel\nestimation scheme is investigated for spatial non-stationary (SNS) massive MIMO\nsystems considering power leakage. %a novel beam domain channel estimation\nscheme is proposed for spatial non-stationary (SNS) massive MIMO systems.\nSpecifically, a realistic massive MIMO beam domain channel model (BDCM) is\nintroduced to capture the spatial non-stationarity considering power leakage by\nintroducing the illustration of visibility region (VR). Then, a beam domain\nstructure-based sparsity adaptive matching pursuit (BDS-SAMP) scheme is\nproposed based on the cross-block sparse structure and power ratio threshold of\nbeam domain channel. Finally, the simulation results validate the accuracy of\nproposed BDS-SAMP scheme with low pilot overhead and reasonable complexity by\ncomparing with conventional schemes.\n","authors":["Lin Hou","Hengtai Chang","Cheng-Xiang Wang","Jie Huang","Songjiang Yang"],"pdf_url":"https://arxiv.org/pdf/2501.04242v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04240v1","updated":"2025-01-08T02:35:32Z","published":"2025-01-08T02:35:32Z","title":"A Novel Non-Stationary Channel Emulator for 6G MIMO Wireless Channels","summary":" The performance evaluation of sixth generation (6G) communication systems is\nanticipated to be a controlled and repeatable process in the lab, which brings\nup the demand for wireless channel emulators. However, channel emulation for 6G\nspace-time-frequency (STF) non-stationary channels is missing currently. In\nthis paper, a non-stationary multiple-input multiple-output (MIMO)\ngeometry-based stochastic model (GBSM) that accurately characterizes the\nchannel STF properties is introduced firstly. Then, a subspace-based method is\nproposed for reconstructing the channel fading obtained from the GBSM and a\nchannel emulator architecture with frequency domain processing is presented for\n6G MIMO systems. Moreover, the spatial time-varying channel transfer functions\n(CTFs) of the channel simulation and the channel emulation are compared and\nanalyzed. The Doppler power spectral density (PSD) and delay PSD are further\nderived and compared between the channel model simulation and subspace-based\nemulation. The results demonstrate that the proposed channel emulator is\ncapable of reproducing the non-stationary channel characteristics.\n","authors":["Yuan Zong","Lijian Xin","Jie Huang","Cheng-Xiang Wang"],"pdf_url":"https://arxiv.org/pdf/2501.04240v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04238v1","updated":"2025-01-08T02:32:15Z","published":"2025-01-08T02:32:15Z","title":"A Quasi-deterministic Channel Model for Underwater Acoustic\n Communication Systems","summary":" In this paper, a quasi-deterministic (Q-D) model for non-stationary\nunderwater acoustic (UWA) channels is proposed. This model combines the BELLHOP\ndeterministic model and geometry-based stochastic model (GBSM), which provides\nhigher accuracy and flexibility. Different propagation components in shallow\nwater are classified as D-rays, R-rays and F-rays in the proposed model, where\nD-rays are modeled by BELLHOP while both R-rays and F-rays are modeled by GBSM.\nSome important channel statistical properties, including time-frequency\ncorrelation function (TF-CF), Doppler power spectrum density (PSD), average\nDoppler shift, and RMS Doppler spread are derived and simulated. Finally,\nsimulation results illustrate the correctness of the proposed model.\n","authors":["Yuxuan Yang","Yilin Ma","Hengtai Chang","Cheng-Xiang Wang"],"pdf_url":"https://arxiv.org/pdf/2501.04238v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04222v1","updated":"2025-01-08T01:39:10Z","published":"2025-01-08T01:39:10Z","title":"Privacy-Preserving Distributed Online Mirror Descent for Nonconvex\n Optimization","summary":" We investigate the distributed online nonconvex optimization problem with\ndifferential privacy over time-varying networks. Each node minimizes the sum of\nseveral nonconvex functions while preserving the node's differential privacy.\nWe propose a privacy-preserving distributed online mirror descent algorithm for\nnonconvex optimization, which uses the mirror descent to update decision\nvariables and the Laplace differential privacy mechanism to protect privacy.\nUnlike the existing works, the proposed algorithm allows the cost functions to\nbe nonconvex, which is more applicable. Based upon these, we prove that if the\ncommunication network is $B$-strongly connected and the constraint set is\ncompact, then by choosing the step size properly, the algorithm guarantees\n$\\epsilon$-differential privacy at each time. Furthermore, we prove that if the\nlocal cost functions are $\\beta$-smooth, then the regret over time horizon $T$\ngrows sublinearly while preserving differential privacy, with an upper bound\n$O(\\sqrt{T})$. Finally, the effectiveness of the algorithm is demonstrated\nthrough numerical simulations.\n","authors":["Yingjie Zhou","Tao Li"],"pdf_url":"https://arxiv.org/pdf/2501.04222v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.00958v2","updated":"2025-01-08T23:16:20Z","published":"2024-05-02T02:50:58Z","title":"Generative manufacturing systems using diffusion models and ChatGPT","summary":" In this study, we introduce Generative Manufacturing Systems (GMS) as a novel\napproach to effectively manage and coordinate autonomous manufacturing assets,\nthereby enhancing their responsiveness and flexibility to address a wide array\nof production objectives and human preferences. Deviating from traditional\nexplicit modeling, GMS employs generative AI, including diffusion models and\nChatGPT, for implicit learning from envisioned futures, marking a shift from a\nmodel-optimum to a training-sampling decision-making. Through the integration\nof generative AI, GMS enables complex decision-making through interactive\ndialogue with humans, allowing manufacturing assets to generate multiple\nhigh-quality global decisions that can be iteratively refined based on human\nfeedback. Empirical findings showcase GMS's substantial improvement in system\nresilience and responsiveness to uncertainties, with decision times reduced\nfrom seconds to milliseconds. The study underscores the inherent creativity and\ndiversity in the generated solutions, facilitating human-centric\ndecision-making through seamless and continuous human-machine interactions.\n","authors":["Xingyu Li","Fei Tao","Wei Ye","Aydin Nassehi","John W. Sutherland"],"pdf_url":"https://arxiv.org/pdf/2405.00958v2.pdf","comment":"We are withdrawing this preprint to incorporate significant new\n results and expand the scope of the paper. We plan to resubmit a\n substantially revised version in the near future"},{"id":"http://arxiv.org/abs/2403.07988v2","updated":"2025-01-08T22:31:48Z","published":"2024-03-12T18:00:29Z","title":"Configuration and EMT Simulation of the 240-bus MiniWECC System\n Integrating Offshore Wind Farms (OWFs)","summary":" As offshore wind farms (OWFs) become increasingly prevalent in Northern\nCalifornia and Southern Oregon, they introduce faster dynamics into the Western\nElectricity Coordinating Council (WECC) system, reshaping its dynamic behavior.\nAccordingly, electromagnetic transient (EMT) simulation is essential to assess\nhigh frequency dynamics of the WECC system with integrated OWFs. Against this\nbackground, this paper presents the integration of detailed dynamic models of\nOWFs into a 240-bus miniWECC system in PSCAD software. The sequential\ninitialization technique is employed to facilitate the smooth initiation of a\nlarge-scale system in an EMT simulation. The performance of the configured\nmodel is assessed under wind speed variations and grounded faults,\ndemonstrating the effectiveness of the miniWECC system with OWFs. This system\nserves as a valuable basic use case for validating the fast dynamic performance\nof future WECC systems with high penetration of wind energy.\n","authors":["Buxin She","Hisham Mahmood","Marcelo Elizondo","Veronica Adetola","Yuqing Dong"],"pdf_url":"https://arxiv.org/pdf/2403.07988v2.pdf","comment":"5 pages"},{"id":"http://arxiv.org/abs/2501.04839v1","updated":"2025-01-08T20:57:14Z","published":"2025-01-08T20:57:14Z","title":"DRL-Based Medium-Term Planning of Renewable-Integrated Self-Scheduling\n Cascaded Hydropower to Guide Wholesale Market Participation","summary":" For self-scheduling cascaded hydropower (S-CHP) facilities, medium-term\nplanning is a critical step that coordinates water availability over the\nmedium-term horizon, providing water usage guidance for their short-term\noperations in wholesale market participation. Typically, medium-term planning\nstrategies (e.g., reservoir storage targets at the end of each short-term\nperiod) are determined by either optimization methods or rules of thumb.\nHowever, with the integration of variable renewable energy sources (VRESs),\noptimization-based methods suffer from deviations between the anticipated and\nactual reservoir storage, while rules of thumb could be financially\nconservative, thereby compromising short-term operating profitability in\nwholesale market participation. This paper presents a deep reinforcement\nlearning (DRL)-based framework to derive medium-term planning policies for\nVRES-integrated S-CHPs (VS-CHPs), which can leverage contextual information\nunderneath individual short-term periods and train planning policies by their\ninduced short-term operating profits in wholesale market participation. The\nproposed DRL-based framework offers two practical merits. First, its planning\nstrategies consider both seasonal requirements of reservoir storage and needs\nfor short-term operating profits. Second, it adopts a multi-parametric\nprogramming-based strategy to accelerate the expensive training process\nassociated with multi-step short-term operations. Finally, the DRL-based\nframework is evaluated on a real-world VS-CHP, demonstrating its advantages\nover current practice.\n","authors":["Xianbang Chen","Yikui Liu","Neng Fan","Lei Wu"],"pdf_url":"https://arxiv.org/pdf/2501.04839v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.09876v2","updated":"2025-01-08T20:50:02Z","published":"2024-09-15T21:52:14Z","title":"A Carryover Storage Valuation Framework for Medium-Term Cascaded\n Hydropower Planning: A Portland General Electric System Study","summary":" Medium-term planning of cascaded hydropower (CHP) determines appropriate\ncarryover storage levels in reservoirs to optimize the usage of available water\nresources. This optimization seeks to maximize the hydropower generated in the\ncurrent period (i.e., immediate benefit) plus the potential hydropower\ngeneration in the future period (i.e., future value). Thus, in the medium-term\nCHP planning, properly quantifying the future value deposited in carryover\nstorage is essential to achieve a balanced trade-off between immediate benefit\nand future value. To this end, this paper presents a framework to quantify the\nfuture value of carryover storage, which consists of three major steps: i)\nconstructing a model to calculate the maximum possible hydropower generation\nthat a given level of carryover storage can deliver in the future period; ii)\nextracting the implicit locational marginal water value (LMWV) of carryover\nstorage for each reservoir by applying a partition-then-extract algorithm to\nthe constructed model; and iii) developing a set of analytical rules based on\nthe extracted LMWV to effectively calculate the future value. These rules can\nbe seamlessly integrated into medium-term CHP planning models as tractable\nmixed-integer linear constraints to quantify the future value properly, and can\nbe easily visualized to offer valuable insights for CHP operators. Finally,\nnumerical results on a CHP system of Portland General Electric demonstrate the\neffectiveness of the presented framework in determining proper carryover\nstorage values to facilitate medium-term CHP planning.\n","authors":["Xianbang Chen","Yikui Liu","Zhiming Zhong","Neng Fan","Zhechong Zhao","Lei Wu"],"pdf_url":"https://arxiv.org/pdf/2409.09876v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.01615v2","updated":"2025-01-08T20:48:41Z","published":"2025-01-03T03:19:02Z","title":"Equity Impacts of Public Transit Network Redesign with Shared Autonomous\n Mobility Services","summary":" This study examines the equity impacts of integrating shared autonomous\nmobility services (SAMS) into transit system redesign. Using the Greater\nChicago area as a case study, we compare two optimization objectives in\nmultimodal transit network redesign: minimizing total generalized costs\n(equity-agnostic) versus prioritizing service in low-income areas\n(equity-focused). We evaluate the achieved accessibility of clustered zones\nwith redesigned transit networks under two objectives, compared to driving and\nthe existing transit network. The transit access gaps across zones and between\ntransit and driving are found to be generally reduced with the introduction of\nSAMS, but less so with the subsequent improved infrastructure under budget.\nDifferential improvement in equity is seen across suburbs and areas of the\ncity, reflecting the disparity in current transit access and improvement\npotential. In particular, SAMS bridges the transit access gaps in suburban and\ncity areas currently underserved by transit. The City of Chicago, which is also\ndisproportionately home to vulnerable populations, offers an avenue to improve\nvertical equity. These findings demonstrate that SAMS can enhance both\nhorizontal and vertical equity in transit systems, particularly when equity is\nexplicitly incorporated into the design objective.\n","authors":["Max T. M. Ng","Meredith Raymer","Hani S. Mahmassani","Omer Verbas","Taner Cokyasar"],"pdf_url":"https://arxiv.org/pdf/2501.01615v2.pdf","comment":"Restructuring the paper for more precise research direction"},{"id":"http://arxiv.org/abs/2501.04830v1","updated":"2025-01-08T20:36:10Z","published":"2025-01-08T20:36:10Z","title":"A Deep Learning-Based Method for Power System Resilience Evaluation","summary":" Power systems are critical infrastructure in modern society, and power\noutages can cause significant disruptions to communities and individuals' daily\nlives. The resilience of a power system measures its ability to maintain power\nsupply during highly disruptive events such as hurricanes, earthquakes, and\nthunderstorms. Traditional methods for quantifying power system resilience\ninclude statistics-based and simulation-based approaches. Statistics-based\nmethods offer a retrospective analysis of system performance without requiring\na physical model, while simulation-based methods necessitate detailed physical\nsystem information and often simplify real-world scenarios. This paper\nintroduces a deep learning-based method for evaluating power system resilience\nusing historical power outage data. The method leverages the generalization\ncapabilities of deep learning models and incorporates socio-economic and\ndemographic factors as weighting terms to highlight the impacts on vulnerable\ndemographic groups. The effectiveness of the proposed method is demonstrated\nthrough two case studies: one with real historical outage data and the other\nwith simulated outage records. This approach provides valuable insights into\nmeasuring power system resilience against hazardous weather events without\nrequiring a physical model of the target systems. The evaluation results can\nfurther guide the planning of distributed energy resources for resilience\nenhancement.\n","authors":["Xuesong Wang","Caisheng Wang"],"pdf_url":"https://arxiv.org/pdf/2501.04830v1.pdf","comment":"Submitted to IEEE Transactions on Power Systems"},{"id":"http://arxiv.org/abs/2408.14292v2","updated":"2025-01-08T19:37:25Z","published":"2024-08-26T14:24:35Z","title":"Decentralized Singular Value Decomposition for Large-scale Distributed\n Sensor Networks","summary":" This article studies the problem of decentralized Singular Value\nDecomposition (d-SVD), which is fundamental in various signal processing\napplications. Two scenarios are considered depending on the availability of the\ndata matrix under consideration. In the first scenario, the matrix of interest\nis row-wisely available in each local node in the network. In the second\nscenario, the matrix of interest implicitly forms an outer product from two\ndifferent series of measurements. By combining the lightweight local rational\nfunction approximation approach with parallel averaging consensus algorithms,\ntwo d-SVD algorithms are proposed to cope with the two aforementioned\nscenarios. We evaluate the proposed algorithms using two application examples:\ndecentralized sensor localization via low-rank matrix completion and\ndecentralized passive radar detection. Moreover, a novel and non-trivial\ntruncation technique, which employs a representative vector that is orthonormal\nto the principal signal subspace, is proposed to further reduce the\ncommunication cost associated with the d-SVD algorithms. Simulation results\nshow that the proposed d-SVD algorithms converge to the centralized solution\nwith reduced communication cost compared to those facilitated with the\nstate-of-the-art decentralized power method.\n","authors":["Yufan Fan","Marius Pesavento"],"pdf_url":"https://arxiv.org/pdf/2408.14292v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04796v1","updated":"2025-01-08T19:23:45Z","published":"2025-01-08T19:23:45Z","title":"Democratic Resilience and Sociotechnical Shocks","summary":" We focus on the potential fragility of democratic elections given modern\ninformation-communication technologies (ICT) in the Web 2.0 era. Our work\nprovides an explanation for the cascading attrition of public officials\nrecently in the United States and offers potential policy interventions from a\ndynamic system's perspective. We propose that micro-level heterogeneity across\nindividuals within crucial institutions leads to vulnerabilities of election\nsupport systems at the macro scale. Our analysis provides comparative\nstatistics to measure the fragility of systems against targeted harassment,\ndisinformation campaigns, and other adversarial manipulations that are now\ncheaper to scale and deploy. Our analysis also informs policy interventions\nthat seek to retain public officials and increase voter turnout. We show how\nlimited resources (for example, salary incentives to public officials and\ntargeted interventions to increase voter turnout) can be allocated at the\npopulation level to improve these outcomes and maximally enhance democratic\nresilience. On the one hand, structural and individual heterogeneity cause\nsystemic fragility that adversarial actors can exploit, but also provide\nopportunities for effective interventions that offer significant global\nimprovements from limited and localized actions.\n","authors":["M. Amin Rahimian","Michael P. Colaresi"],"pdf_url":"https://arxiv.org/pdf/2501.04796v1.pdf","comment":"Computational and Mathematical Organization Theory, forthcoming"},{"id":"http://arxiv.org/abs/2501.04793v1","updated":"2025-01-08T19:18:39Z","published":"2025-01-08T19:18:39Z","title":"A Novel Observer Design for LuGre Friction Estimation and Control","summary":" Dynamic components of the friction may directly impact the stability and\nperformance of the motion control systems. The LuGre model is a prevalent\nfriction model utilized to express this dynamic behavior. Since the LuGre model\nis very comprehensive, friction compensation based on it might be challenging.\nInspired by this, we develop a novel observer to estimate and compensate for\nLuGre friction. Furthermore, we present a Lyapunov stability analysis to show\nthat observer dynamics are asymptotically stable under certain conditions.\nCompared to its counterparts, the proposed observer constitutes a simple and\nstandalone scheme that can be utilized with arbitrary control inputs in a\nstraightforward way. As a primary difference, the presented observer estimates\nvelocity and uses the velocity error to estimate friction in addition to\ncontrol input. The extensive simulations revealed that the introduced observer\nenhances position and velocity tracking performance in the presence of\nfriction.\n","authors":["Caner Odabaş","Ömer Morgül"],"pdf_url":"https://arxiv.org/pdf/2501.04793v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04783v1","updated":"2025-01-08T19:01:32Z","published":"2025-01-08T19:01:32Z","title":"Traffic Simulations: Multi-City Calibration of Metropolitan Highway\n Networks","summary":" This paper proposes an approach to perform travel demand calibration for\nhigh-resolution stochastic traffic simulators. It employs abundant travel times\nat the path-level, departing from the standard practice of resorting to scarce\nsegment-level sensor counts. The proposed approach is shown to tackle\nhigh-dimensional instances in a sample-efficient way. For the first time, case\nstudies on 6 metropolitan highway networks are carried out, considering a total\nof 54 calibration scenarios. This is the first work to show the ability of a\ncalibration algorithm to systematically scale across networks. Compared to the\nstate-of-the-art simultaneous perturbation stochastic approximation (SPSA)\nalgorithm, the proposed approach enhances fit to field data by an average 43.5%\nwith a maximum improvement of 80.0%, and does so within fewer simulation calls.\n","authors":["Chao Zhang","Yechen Li","Neha Arora","Damien Pierce","Carolina Osorio"],"pdf_url":"https://arxiv.org/pdf/2501.04783v1.pdf","comment":"Published on the 27th IEEE International Conference on Intelligent\n Transportation Systems (ITSC) (2024)"},{"id":"http://arxiv.org/abs/2501.04759v1","updated":"2025-01-08T17:28:30Z","published":"2025-01-08T17:28:30Z","title":"Optimize the parameters of the PID Controller using Genetic Algorithm\n for Robot Manipulators","summary":" This paper presents the design a Proportional-Integral-Derivative (PID)\ncontroller with optimized parameters for a two-degree-of-freedom robotic arm. A\ngenetic algorithm (GA) is proposed to optimize the controller parameters,\naddressing the challenges in determining PID controller parameters for highly\nnonlinear systems like robotic arms compared to traditional methods. The\nGA-optimized PID controller significantly improves control accuracy and\nperformance over traditional control methods. Simulation results demonstrate\nthat the robotic arm system operates with high precision and stability.\nAdditionally, the shortened trajectory tracking response time enhances the\nfeasibility of applying this control algorithm in realworld scenarios. This\nresearch not only confirms the suitability of PID-GA for robotic arms and\nsimilar systems but also opens new avenues for applying this algorithm to real\nphysical systems.\n","authors":["Vu Ngoc Son","Pham Van Cuong","Nguyen Duy Minh","Phi Hoang Nha"],"pdf_url":"https://arxiv.org/pdf/2501.04759v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04754v1","updated":"2025-01-08T16:57:11Z","published":"2025-01-08T16:57:11Z","title":"Development of an Adaptive Sliding Mode Controller using Neural Networks\n for Trajectory Tracking of a Cylindrical Manipulator","summary":" Cylindrical manipulators are extensively used in industrial automation,\nespecially in emerging technologies like 3D printing, which represents a\nsignificant future trend. However, controlling the trajectory of nonlinear\nmodels with system uncertainties remains a critical challenge, often leading to\nreduced accuracy and reliability. To address this, the study develops an\nAdaptive Sliding Mode Controller (ASMC) integrated with Neural Networks (NNs)\nto improve trajectory tracking for cylindrical manipulators. The ASMC leverages\nthe robustness of sliding mode control and the adaptability of neural networks\nto handle uncertainties and dynamic variations effectively. Simulation results\nvalidate that the proposed ASMC-NN achieves high trajectory tracking accuracy,\nfast response time, and enhanced reliability, making it a promising solution\nfor applications in 3D printing and beyond.\n","authors":["TieuNien Le","VanCuong Pham","NgocSon Vu"],"pdf_url":"https://arxiv.org/pdf/2501.04754v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04746v1","updated":"2025-01-08T10:02:22Z","published":"2025-01-08T10:02:22Z","title":"Towards resilient cities: A hybrid simulation framework for risk\n mitigation through data driven decision making","summary":" Providing a comprehensive view of the city operation and offering useful\nmetrics for decision making is a well known challenge for urban risk analysis\nsystems. Existing systems are, in many cases, generalizations of previous\ndomain specific tools and or methodologies that may not cover all urban\ninterdependencies and makes it difficult to have homogeneous indicators. In\norder to overcome this limitation while seeking for effective support to\ndecision makers, this article introduces a novel hybrid simulation framework\nfor risk mitigation. The framework is built on a proposed city concept that\nconsiders urban space as a Complex Adaptive System composed by interconnected\nCritical Infrastructures. In this concept, a Social System, which models daily\npatterns and social interactions of the citizens in the Urban Landscape, drives\nthe CIs demand to configure the full city picture. The frameworks hybrid design\nintegrates agent based and network based modeling by breaking down city agents\ninto system dependent subagents, to enable both inter and intra system\ninteraction simulation, respectively. A layered structure of indicators at\ndifferent aggregation levels is also developed, to ensure that decisions are\nnot only data driven but also explainable. Therefore, the proposed simulation\nframework can serve as a DSS tool that allows the quantitative analysis of the\nimpact of threats at different levels. First, system level metrics can be used\nto get a broad view on the city resilience. Then, agent level metrics back\nthose figures and provide better explainability. On implementation, the\nproposed framework enables component reusability (for eased coding), simulation\nfederation (enabling the integration of existing system oriented simulators),\ndiscrete simulation in accelerated time (for rapid scenario simulation) and\ndecision oriented visualization (for informed outputs).\n","authors":["David Carraminana","Ana M. Bernardos","Juan A. Besada","Jose R. Casar"],"pdf_url":"https://arxiv.org/pdf/2501.04746v1.pdf","comment":"24 pages"},{"id":"http://arxiv.org/abs/2501.07590v1","updated":"2025-01-08T11:17:44Z","published":"2025-01-08T11:17:44Z","title":"Ultrafast pulsed laser evaluation of Single Event Transients in\n opto-couplers","summary":" We build a 1064 nm fiber laser system-based testing facility for emulating\nSETs in different electronics components and ICs. Using these facilities, we\ntested the 4N35 optocoupler to observe SETs for the first time.\n","authors":["Kavin Dave","Aditya Mukherjee","Hari Shanker Gupta","Deepak Jain","Shalabh Gupta"],"pdf_url":"https://arxiv.org/pdf/2501.07590v1.pdf","comment":"Accepted in CLEO 2023, San Jose, USA and CLEO 2024, North Carolina,\n USA for in poster presentation. However due to lack of funds, we could not\n travel"}],"Optimization and Control":[{"id":"http://arxiv.org/abs/2501.04668v1","updated":"2025-01-08T18:28:56Z","published":"2025-01-08T18:28:56Z","title":"Semilinear Dynamic Programming: Analysis, Algorithms, and Certainty\n Equivalence Properties","summary":" We consider a broad class of dynamic programming (DP) problems that involve a\npartially linear structure and some positivity properties in their system\nequation and cost function. We address deterministic and stochastic problems,\npossibly with Markov jump parameters. We focus primarily on infinite horizon\nproblems and prove that under our assumptions, the optimal cost function is\nlinear, and that an optimal policy can be computed efficiently with standard DP\nalgorithms. Moreover, we show that forms of certainty equivalence hold for our\nstochastic problems, in analogy with the classical linear quadratic optimal\ncontrol problems.\n","authors":["Yuchao Li","Dimitri Bertsekas"],"pdf_url":"https://arxiv.org/pdf/2501.04668v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04658v1","updated":"2025-01-08T18:13:35Z","published":"2025-01-08T18:13:35Z","title":"Quadratic-form Optimal Transport","summary":" We introduce the framework of quadratic-form optimal transport (QOT), whose\ntransport cost has the form $\\iint c\\,\\mathrm{d}\\pi \\otimes\\mathrm{d}\\pi$ for\nsome coupling $\\pi$ between two marginals. Interesting examples of\nquadratic-form transport cost and their optimization include the variance of a\nbivariate function, covariance, Kendall's tau, the Gromov--Wasserstein\ndistance, quadratic assignment problems, and quadratic regularization of\nclassic optimal transport. QOT leads to substantially different mathematical\nstructures compared to classic transport problems and many technical\nchallenges. We illustrate the fundamental properties of QOT, provide several\ncases where explicit solutions are obtained, and give general lower bounds of\nthe optimal transport costs. For a wide class of cost functions, including the\nrectangular cost functions, the QOT problem is solved by a new coupling called\nthe diamond transport, whose copula is supported on a diamond in the unit\nsquare.\n","authors":["Ruodu Wang","Zhenyuan Zhang"],"pdf_url":"https://arxiv.org/pdf/2501.04658v1.pdf","comment":"43 pages, 5 figures"},{"id":"http://arxiv.org/abs/2501.04629v1","updated":"2025-01-08T17:20:55Z","published":"2025-01-08T17:20:55Z","title":"Characterizations of Variational Convexity and Tilt Stability via\n Quadratic Bundles","summary":" In this paper, we establish characterizations of variational $s$-convexity\nand tilt stability for prox-regular functions in the absence of subdifferential\ncontinuity via quadratic bundles, a kind of primal-dual generalized\nsecond-order derivatives recently introduced by Rockafellar. Deriving such\ncharacterizations in the effective pointbased form requires a certain revision\nof quadratic bundles investigated below. Our device is based on the notion of\ngeneralized twice differentiability and its novel characterization via\nclassical twice differentiability of the associated Moreau envelopes combined\nwith various limiting procedures for functions and sets.\n","authors":["Pham Duy Khanh","Boris S. Mordukhovich","Vo Thanh Phat","Le Duc Viet"],"pdf_url":"https://arxiv.org/pdf/2501.04629v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04603v1","updated":"2025-01-08T16:40:58Z","published":"2025-01-08T16:40:58Z","title":"Infinite Horizon Fully Coupled Nonlinear Forward-Backward Stochastic\n Difference Equations and their Application to LQ Optimal Control Problems","summary":" This paper focuses on the study of infinite horizon fully coupled nonlinear\nforward-backward stochastic difference equations (FBS$\\bigtriangleup$Es).\nFirstly, we establish a pair of priori estimates for the solutions to forward\nstochastic difference equations (FS$\\bigtriangleup$Es) and backward stochastic\ndifference equations (BS$\\bigtriangleup$Es) respectively. Then, to achieve\nbroader applicability, we utilize a set of domination-monotonicity conditions\nwhich are more lenient than general ones. Using these conditions, we apply\ncontinuation methods to prove the unique solvability of infinite horizon fully\ncoupled FBS$\\bigtriangleup$Es and derive a set of solution estimates.\nFurthermore, our results have considerable implications for a variety of\nrelated linear quadratic (LQ) problems, especially when the stochastic\nHamiltonian system is consistent with FBS$\\bigtriangleup$Es satisfying these\nintroduced domination-monotonicity conditions. Thus, by solving the associated\nstochastic Hamiltonian system, we can derive an explicit expression for the\nunique optimal control.\n","authors":["Xinyu Ma","Xun Li","Qingxin Meng"],"pdf_url":"https://arxiv.org/pdf/2501.04603v1.pdf","comment":"arXiv admin note: text overlap with arXiv:2410.01749"},{"id":"http://arxiv.org/abs/2404.03604v3","updated":"2025-01-08T16:14:22Z","published":"2024-04-04T17:25:25Z","title":"A Unified Algorithmic Framework for Dynamic Assortment Optimization\n under MNL Choice","summary":" We consider assortment and inventory planning problems with dynamic\nstockout-based substitution effects, and without replenishment, in two\ndifferent settings: (1) Customers can see all available products when they\narrive, a typical scenario in physical stores. (2) The seller can choose to\noffer a subset of available products to each customer, which is more common on\nonline platforms. Both settings are known to be computationally challenging,\nand the current approximation algorithms for the two settings are quite\ndifferent. We develop a unified algorithm framework under the MNL choice model\nfor both settings. Our algorithms improve on the state-of-the-art algorithms in\nterms of approximation guarantee and runtime, and the ability to manage\nuncertainty in the total number of customers and handle more complex\nconstraints. In the process, we establish various novel properties of dynamic\nassortment planning (for the MNL choice model) that may be useful more broadly.\n","authors":["Shuo Sun","Rajan Udwani","Zuo-Jun Max Shen"],"pdf_url":"https://arxiv.org/pdf/2404.03604v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.03496v2","updated":"2025-01-08T16:14:12Z","published":"2024-08-07T01:33:44Z","title":"A three-stage method for reconstructing multiple coefficients in coupled\n photoacoustic and diffuse optical imaging","summary":" This paper studies inverse problems in quantitative photoacoustic tomography\nwith additional optical current data supplemented from diffuse optical\ntomography. We propose a three-stage image reconstruction method for the\nsimultaneous recovery of the absorption, diffusion, and Gr\\\"uneisen\ncoefficients. We demonstrate, through numerical simulations, that: (i) when the\nGr\\\"uneisen coefficient is known, the addition of the optical measurements\nallows a more accurate reconstruction of the scattering and absorption\ncoefficients; and (ii) when the Gr\\\"uneisen coefficient is not known, the\naddition of optical current measurements allows us to reconstruct uniquely the\nGr\\\"uneisen, the scattering and absorption coefficients. Numerical simulations\nbased on synthetic data are presented to demonstrate the effectiveness of the\nproposed idea.\n","authors":["Yinxi Pan","Kui Ren","Shanyin Tong"],"pdf_url":"https://arxiv.org/pdf/2408.03496v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04585v1","updated":"2025-01-08T16:06:15Z","published":"2025-01-08T16:06:15Z","title":"Accelerated Extragradient-Type Methods -- Part 2: Generalization and\n Sublinear Convergence Rates under Co-Hypomonotonicity","summary":" Following the first part of our project, this paper comprehensively studies\ntwo types of extragradient-based methods: anchored extragradient and Nesterov's\naccelerated extragradient for solving [non]linear inclusions (and, in\nparticular, equations), primarily under the Lipschitz continuity and the\nco-hypomonotonicity assumptions. We unify and generalize a class of anchored\nextragradient methods for monotone inclusions to a wider range of schemes\nencompassing existing algorithms as special cases. We establish\n$\\mathcal{O}(1/k)$ last-iterate convergence rates on the residual norm of the\nunderlying mapping for this general framework and then specialize it to obtain\nconvergence guarantees for specific instances, where $k$ denotes the iteration\ncounter. We extend our approach to a class of anchored Tseng's\nforward-backward-forward splitting methods to obtain a broader class of\nalgorithms for solving co-hypomonotone inclusions. Again, we analyze\n$\\mathcal{O}(1/k)$ last-iterate convergence rates for this general scheme and\nspecialize it to obtain convergence results for existing and new variants. We\ngeneralize and unify Nesterov's accelerated extra-gradient method to a new\nclass of algorithms that covers existing schemes as special instances while\ngenerating new variants. For these schemes, we can prove $\\mathcal{O}(1/k)$\nlast-iterate convergence rates for the residual norm under co-hypomonotonicity,\ncovering a class of nonmonotone problems. We propose another novel class of\nNesterov's accelerated extragradient methods to solve inclusions.\nInterestingly, these algorithms achieve both $\\mathcal{O}(1/k)$ and $o(1/k)$\nlast-iterate convergence rates, and also the convergence of iterate sequences\nunder co-hypomonotonicity and Lipschitz continuity. Finally, we provide a set\nof numerical experiments encompassing different scenarios to validate our\nalgorithms and theoretical guarantees.\n","authors":["Quoc Tran-Dinh","Nghia Nguyen-Trung"],"pdf_url":"https://arxiv.org/pdf/2501.04585v1.pdf","comment":"75 pages, 7 figures, and 1 table"},{"id":"http://arxiv.org/abs/2401.03692v3","updated":"2025-01-08T16:05:00Z","published":"2024-01-08T06:46:39Z","title":"Boosting Column Generation with Graph Neural Networks for Joint Rider\n Trip Planning and Crew Shift Scheduling","summary":" Optimizing service schedules is pivotal to the reliable, efficient, and\ninclusive on-demand mobility. This pressing challenge is further exacerbated by\nthe increasing needs of an aging population, the oversubscription of existing\nservices, and the lack of effective solution methods. This study addresses the\nintricacies of service scheduling, by jointly optimizing rider trip planning\nand crew scheduling for a complex dynamic mobility service. The resulting\noptimization problems are extremely challenging computationally for\nstate-of-the-art methods. To address this fundamental gap, this paper\nintroduces the Joint Rider Trip Planning and Crew Shift Scheduling Problem\n(JRTPCSSP) and a novel solution method, called Attention and Gated GNN-Informed\nColumn Generation (AGGNNI-CG), that hybridizes column generation and machine\nlearning to obtain near-optimal solutions to the JRTPCSSP with real-life\nconstraints of the application. The key idea of the machine-learning component\nis to dramatically reduce the number of paths to explore in the pricing\nproblem, accelerating the most time-consuming component of the column\ngeneration. The machine learning component is a graph neural network with an\nattention mechanism and a gated architecture, which is particularly suited to\ncater for the different input sizes coming from daily operations. AGGNNI-CG has\nbeen applied to a challenging, real-world dataset from the Paratransit system\nof Chatham County in Georgia. It produces substantial improvements compared to\nthe baseline column generation approach, which typically cannot produce\nhigh-quality feasible solutions in reasonable time on large-scale complex\ninstances. AGGNNI-CG also produces significant improvements in service quality\ncompared to the existing system.\n","authors":["Jiawei Lu","Tinghan Ye","Wenbo Chen","Pascal Van Hentenryck"],"pdf_url":"https://arxiv.org/pdf/2401.03692v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.16957v3","updated":"2025-01-08T15:57:38Z","published":"2024-12-22T10:33:09Z","title":"Euclidean distance discriminants and Morse attractors","summary":" Our study concerns the Euclidean distance function in case of complex plane\ncurves. We decompose the ED discriminant into 3 parts which are responsible for\nthe 3 types of behavior of the Morse points, and we find the structure of each\none. In particular we shed light on the ``atypical discriminant'' which is due\nto the loss of Morse points at infinity. We find formulas for the number of\nMorse singularities which abut to the corresponding 3 types of attractors when\nmoving the centre of the distance function toward a point of the discriminant.\n","authors":["Cezar Joiţa","Dirk Siersma","Mihai Tibăr"],"pdf_url":"https://arxiv.org/pdf/2412.16957v3.pdf","comment":"several improvements in Section 3"},{"id":"http://arxiv.org/abs/2412.19367v3","updated":"2025-01-08T15:51:21Z","published":"2024-12-26T22:23:11Z","title":"Central limit theorems for vector-valued composite functionals with\n smoothing and applications","summary":" This paper focuses on vector-valued composite functionals, which may be\nnonlinear in probability. Our primary goal is to establish central limit\ntheorems for these functionals when mixed estimators are employed. Our study is\nrelevant to the evaluation and comparison of risk in decision-making contexts\nand extends to functionals that arise in machine learning methods. A\ngeneralized family of composite risk functionals is presented, which\nencompasses most of the known coherent risk measures including systemic\nmeasures of risk. The paper makes two main contributions. First, we analyze\nvector-valued functionals, providing a framework for evaluating\nhigh-dimensional risks. This framework facilitates the comparison of multiple\nrisk measures, as well as the estimation and asymptotic analysis of systemic\nrisk and its optimal value in decision-making problems. Second, we derive novel\ncentral limit theorems for optimized composite functionals when mixed types of\nestimators: empirical and smoothed estimators are used. We provide verifiable\nsufficient conditions for the central limit formulae and show their\napplicability to several popular measures of risk.\n","authors":["Huihui Chen","Darinka Dentcheva","Yang Lin","Gregory J. Stock"],"pdf_url":"https://arxiv.org/pdf/2412.19367v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04572v1","updated":"2025-01-08T15:42:41Z","published":"2025-01-08T15:42:41Z","title":"Regret Analysis: a control perspective","summary":" Online learning and model reference adaptive control have many interesting\nintersections. One area where they differ however is in how the algorithms are\nanalyzed and what objective or metric is used to discriminate \"good\" algorithms\nfrom \"bad\" algorithms. In adaptive control there are usually two objectives: 1)\nprove that all time varying parameters/states of the system are bounded, and 2)\nthat the instantaneous error between the adaptively controlled system and a\nreference system converges to zero over time (or at least a compact set). For\nonline learning the performance of algorithms is often characterized by the\nregret the algorithm incurs. Regret is defined as the cumulative loss (cost)\nover time from the online algorithm minus the cumulative loss (cost) of the\nsingle optimal fixed parameter choice in hindsight. Another significant\ndifference between the two areas of research is with regard to the assumptions\nmade in order to obtain said results. Adaptive control makes assumptions about\nthe input-output properties of the control problem and derives solutions for a\nfixed error model or optimization task. In the online learning literature\nresults are derived for classes of loss functions (i.e. convex) while a priori\nassuming that all time varying parameters are bounded, which for many\noptimization tasks is not unrealistic, but is a non starter in control\napplications. In this work we discuss these differences in detail through the\nregret based analysis of gradient descent for convex functions and the control\nbased analysis of a streaming regression problem. We close with a discussion\nabout the newly defined paradigm of online adaptive control and ask the\nfollowing question \"Are regret optimal control strategies deployable?\"\n","authors":["Travis E. Gibson","Sawal Acharya"],"pdf_url":"https://arxiv.org/pdf/2501.04572v1.pdf","comment":"10 pages no figures"},{"id":"http://arxiv.org/abs/2402.07099v3","updated":"2025-01-08T15:37:04Z","published":"2024-02-11T04:09:50Z","title":"Rethinking the Capacity of Graph Neural Networks for Branching Strategy","summary":" Graph neural networks (GNNs) have been widely used to predict properties and\nheuristics of mixed-integer linear programs (MILPs) and hence accelerate MILP\nsolvers. This paper investigates the capacity of GNNs to represent strong\nbranching (SB), the most effective yet computationally expensive heuristic\nemployed in the branch-and-bound algorithm. In the literature, message-passing\nGNN (MP-GNN), as the simplest GNN structure, is frequently used as a fast\napproximation of SB and we find that not all MILPs's SB can be represented with\nMP-GNN. We precisely define a class of \"MP-tractable\" MILPs for which MP-GNNs\ncan accurately approximate SB scores. Particularly, we establish a universal\napproximation theorem: for any data distribution over the MP-tractable class,\nthere always exists an MP-GNN that can approximate the SB score with\narbitrarily high accuracy and arbitrarily high probability, which lays a\ntheoretical foundation of the existing works on imitating SB with MP-GNN. For\nMILPs without the MP-tractability, unfortunately, a similar result is\nimpossible, which can be illustrated by two MILP instances with different SB\nscores that cannot be distinguished by any MP-GNN, regardless of the number of\nparameters. Recognizing this, we explore another GNN structure called the\nsecond-order folklore GNN (2-FGNN) that overcomes this limitation, and the\naforementioned universal approximation theorem can be extended to the entire\nMILP space using 2-FGNN, regardless of the MP-tractability. A small-scale\nnumerical experiment is conducted to directly validate our theoretical\nfindings.\n","authors":["Ziang Chen","Jialin Liu","Xiaohan Chen","Xinshang Wang","Wotao Yin"],"pdf_url":"https://arxiv.org/pdf/2402.07099v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04563v1","updated":"2025-01-08T15:24:20Z","published":"2025-01-08T15:24:20Z","title":"On Branch-and-Price for Project Scheduling","summary":" Integer programs for resource-constrained project scheduling problems are\nnotoriously hard to solve due to their weak linear relaxations. Several papers\nhave proposed reformulating project scheduling problems via Dantzig-Wolfe\ndecomposition to strengthen their linear relaxation and decompose large problem\ninstances. The reformulation gives rise to a master problem that has a large\nnumber of variables. Therefore, the master problem is solved by a column\ngeneration procedure embedded in a branching framework, also known as\nbranch-and-price. While branch-and-price has been successfully applied to many\nproblem classes, it turns out to be ineffective for most project scheduling\nproblems. This paper identifies drivers of the ineffectiveness by analyzing the\nstructure of the reformulated problem and the strength of different branching\nschemes. Our analysis shows that the reformulated problem has an unfavorable\nstructure for column generation: It is highly degenerate, slowing down the\nconvergence of column generation, and for many project scheduling problems, it\nyields the same or only slightly stronger linear relaxations as classical\nformulations at the expense of large increases in runtime. Our computational\nexperiments complement our theoretical findings.\n","authors":["Maximilian Kolter","Martin Grunow","Rainer Kolisch"],"pdf_url":"https://arxiv.org/pdf/2501.04563v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.04158v3","updated":"2025-01-08T15:21:26Z","published":"2024-04-05T15:10:30Z","title":"Hardness of circuit and monotone diameters of polytopes","summary":" The Circuit diameter of polytopes was introduced by Borgwardt, Finhold and\nHemmecke as a fundamental tool for the study of circuit augmentation schemes\nfor linear programming and for estimating combinatorial diameters. Determining\nthe complexity of computing the circuit diameter of polytopes was posed as an\nopen problem by Sanit\\`a as well as by Kafer, and was recently reiterated by\nBorgwardt, Grewe, Kafer, Lee and Sanit\\`a.\n In this paper, we solve this problem by showing that computing the circuit\ndiameter of a polytope given in halfspace-description is strongly NP-hard. To\nprove this result, we show that computing the combinatorial diameter of the\nperfect matching polytope of a bipartite graph is NP-hard. This complements a\nresult by Sanit\\`a (FOCS 2018) on the NP-hardness of computing the diameter of\nfractional matching polytopes and implies the new result that computing the\ndiameter of a $\\{0,1\\}$-polytope is strongly NP-hard, which may be of\nindependent interest. In our second main result, we give a precise\ngraph-theoretic description of the monotone diameter of perfect matching\npolytopes and use this description to prove that computing the monotone\n(circuit) diameter of a given input polytope is strongly NP-hard as well.\n","authors":["Christian Nöbel","Raphael Steiner"],"pdf_url":"https://arxiv.org/pdf/2404.04158v3.pdf","comment":"21 pages, 9 figures. Restructured paper"},{"id":"http://arxiv.org/abs/2501.04548v1","updated":"2025-01-08T14:54:18Z","published":"2025-01-08T14:54:18Z","title":"Optimal Control of the Navier-Stokes equations via Pressure Boundary\n Conditions","summary":" In this work we study an optimal control problem subject to the instationary\nNavier-Stokes equations, where the control enters via an inhomogeneous\nNeumann/Do-Nothing boundary condition. Despite the Navier-Stokes equations with\nthese boundary conditions not being well-posed for large times and/or data, we\nobtain wellposedness of the optimal control problem by choosing a proper\ntracking type term. In order to discuss the regularity of the optimal control,\nstate and adjoint state, we present new results on $L^2(I;H^2(\\Omega))$\nregularity of solutions to a Stokes problem with mixed inhomogeneous boundary\nconditions.\n","authors":["Boris Vexler","Jakob Wagner"],"pdf_url":"https://arxiv.org/pdf/2501.04548v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04536v1","updated":"2025-01-08T14:36:22Z","published":"2025-01-08T14:36:22Z","title":"Scalable Derivative-Free Optimization Algorithms with Low-Dimensional\n Subspace Techniques","summary":" We re-introduce a derivative-free subspace optimization framework originating\nfrom Chapter 5 of the Ph.D. thesis [Z. Zhang, On Derivative-Free Optimization\nMethods, Ph.D. thesis, Chinese Academy of Sciences, Beijing, 2012] of the\nauthor under the supervision of Ya-xiang Yuan. At each iteration, the framework\ndefines a (low-dimensional) subspace based on an approximate gradient, and then\nsolves a subproblem in this subspace to generate a new iterate. We sketch the\nglobal convergence and worst-case complexity analysis of the framework,\nelaborate on its implementation, and present some numerical results on solving\nproblems with dimensions as high as 10^4 using only inaccurate function values.\n","authors":["Zaikun Zhang"],"pdf_url":"https://arxiv.org/pdf/2501.04536v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04491v1","updated":"2025-01-08T13:23:50Z","published":"2025-01-08T13:23:50Z","title":"A fast iterative thresholding and support-and-scale shrinking algorithm\n (fits3) for non-lipschitz group sparse optimization (i): the case of\n least-squares fidelity","summary":" We consider to design a new efficient and easy-to-implement algorithm to\nsolve a general group sparse optimization model with a class of non-convex\nnon-Lipschitz regularizations, named as fast iterative thresholding and\nsupport-and-scale shrinking algorithm (FITS3). In this paper we focus on the\ncase of a least-squares fidelity. FITS3 is designed from a lower bound theory\nof such models and by integrating thresholding operation, linearization and\nextrapolation techniques. The FITS3 has two advantages. Firstly, it is quite\nefficient and especially suitable for large-scale problems, because it adopts\nsupport-and-scale shrinking and does not need to solve any linear or nonlinear\nsystem. For two important special cases, the FITS3 contains only simple\ncalculations like matrix-vector multiplication and soft thresholding. Secondly,\nthe FITS3 algorithm has a sequence convergence guarantee under proper\nassumptions. The numerical experiments and comparisons to recent existing\nnon-Lipschitz group recovery algorithms demonstrate that, the proposed FITS3\nachieves similar recovery accuracies, but costs only around a half of the CPU\ntime by the second fastest compared algorithm for median or large-scale\nproblems.\n","authors":["Yanan Zhao","Qiaoli Dong","Yufei Zhao","Chunlin Wu"],"pdf_url":"https://arxiv.org/pdf/2501.04491v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03744v2","updated":"2025-01-08T13:12:34Z","published":"2025-01-07T12:38:21Z","title":"Hydrogen Network Expansion Planning considering the Chicken-and-egg\n Dilemma and Market Uncertainty","summary":" Green hydrogen is thought to be a game changer for reaching sustainability\ntargets. However, the transition to a green hydrogen economy faces a critical\nchallenge known as the `chicken-and-egg dilemma', wherein establishing a\nhydrogen supply network relies on demand, while demand only grows with reliable\nsupply. In addition, as the hydrogen market is in the early stage, predicting\ndemand distributions is challenging due to lack of data availability. This\npaper addresses these complex issues through a risk-averse framework with the\nintroduction of a distributionally robust hydrogen network expansion planning\nproblem under decision-dependent demand ambiguity. The problem optimizes\nlocation and production capacity decisions of the suppliers considering the\nmoments of the stochastic hydrogen demand as a function of these investment\ndecisions. To obtain tractable representations of this problem, we derive two\ndifferent reformulations that consider continuous and discrete hydrogen demand\nsupport sets under different forms of decision dependencies. To efficiently\nsolve the reformulations, we develop a tailored algorithm based on the\ncolumn-and-constraint generation approach, and enhance the computational\nperformance through solving the master problems to a relative optimality gap,\ndecomposing the subproblems, and integrating pre-generated columns and\nconstraints. To validate the effectiveness of our approach, we investigate a\nreal case study leveraging data from the \"Hydrogen Energy Applications in\nValley Environments for Northern Netherlands (HEAVENN)\" project. The results\nreveal that considering the chicken-and-egg dilemma under uncertain hydrogen\nmarket conditions leads to earlier and more diverse investments, providing\ncritical insights for policymakers based on the degree of decision dependency.\n","authors":["Sezen Ece Kayacık","Beste Basciftci","Albert H. Schrotenboer","Iris F. A. Vis","Evrim Ursavas"],"pdf_url":"https://arxiv.org/pdf/2501.03744v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03390v2","updated":"2025-01-08T12:20:18Z","published":"2025-01-06T21:15:03Z","title":"State-of-the-art Methods for Pseudo-Boolean Solving with SCIP","summary":" The Pseudo-Boolean problem deals with linear or polynomial constraints with\ninteger coefficients over Boolean variables. The objective lies in optimizing a\nlinear objective function, or finding a feasible solution, or finding a\nsolution that satisfies as many constraints as possible. In the 2024\nPseudo-Boolean competition, solvers incorporating the SCIP framework won five\nout of six categories it was competing in. From a total of 1,207 instances,\nSCIP successfully solved 759, while its parallel version FiberSCIP solved 776.\nBased on the results from the competition, we further enhanced SCIP's\nPseudo-Boolean capabilities. This article discusses the results and presents\nthe winning algorithmic ideas.\n","authors":["Gioni Mexi","Dominik Kamp","Yuji Shinano","Shanwen Pu","Alexander Hoen","Ksenia Bestuzheva","Christopher Hojny","Matthias Walter","Marc E. Pfetsch","Sebastian Pokutta","Thorsten Koch"],"pdf_url":"https://arxiv.org/pdf/2501.03390v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04443v1","updated":"2025-01-08T11:52:43Z","published":"2025-01-08T11:52:43Z","title":"Revisiting LocalSGD and SCAFFOLD: Improved Rates and Missing Analysis","summary":" LocalSGD and SCAFFOLD are widely used methods in distributed stochastic\noptimization, with numerous applications in machine learning, large-scale data\nprocessing, and federated learning. However, rigorously establishing their\ntheoretical advantages over simpler methods, such as minibatch SGD (MbSGD), has\nproven challenging, as existing analyses often rely on strong assumptions,\nunrealistic premises, or overly restrictive scenarios.\n In this work, we revisit the convergence properties of LocalSGD and SCAFFOLD\nunder a variety of existing or weaker conditions, including gradient\nsimilarity, Hessian similarity, weak convexity, and Lipschitz continuity of the\nHessian. Our analysis shows that (i) LocalSGD achieves faster convergence\ncompared to MbSGD for weakly convex functions without requiring stronger\ngradient similarity assumptions; (ii) LocalSGD benefits significantly from\nhigher-order similarity and smoothness; and (iii) SCAFFOLD demonstrates faster\nconvergence than MbSGD for a broader class of non-quadratic functions. These\ntheoretical insights provide a clearer understanding of the conditions under\nwhich LocalSGD and SCAFFOLD outperform MbSGD.\n","authors":["Ruichen Luo","Sebastian U Stich","Samuel Horváth","Martin Takáč"],"pdf_url":"https://arxiv.org/pdf/2501.04443v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03718v2","updated":"2025-01-08T11:52:40Z","published":"2025-01-07T11:58:10Z","title":"Scalable Second-Order Optimization Algorithms for Minimizing Low-rank\n Functions","summary":" We present a random-subspace variant of cubic regularization algorithm that\nchooses the size of the subspace adaptively, based on the rank of the projected\nsecond derivative matrix. Iteratively, our variant only requires access to\n(small-dimensional) projections of first- and second-order problem derivatives\nand calculates a reduced step inexpensively. The ensuing method maintains the\noptimal global rate of convergence of (full-dimensional) cubic regularization,\nwhile showing improved scalability both theoretically and numerically,\nparticularly when applied to low-rank functions. When applied to the latter,\nour algorithm naturally adapts the subspace size to the true rank of the\nfunction, without knowing it a priori.\n","authors":["Edward Tansley","Coralia Cartis"],"pdf_url":"https://arxiv.org/pdf/2501.03718v2.pdf","comment":"Accepted at NeurIPS 2024 Workshop OPT2024: Optimization for Machine\n Learning; fixed typo on page 5"},{"id":"http://arxiv.org/abs/2309.05596v3","updated":"2025-01-08T10:33:39Z","published":"2023-09-11T16:24:05Z","title":"Output-Positive Adaptive Control of Hyperbolic PDE-ODE Cascades","summary":" In this paper, we propose a new adaptive Control Barrier Function (aCBF)\nmethod to design the output-positive adaptive control law for a hyperbolic\nPDE-ODE cascade with parametric uncertainties. This method employs the recent\nadaptive control approach with batch least-squares identification (BaLSI,\npronounced \"ballsy\") that completes perfect parameter identification in finite\ntime and offers a previously unforeseen advantage in safe control design with\naCBF, which we elucidate in this paper. Since the true challenge is exhibited\nfor CBF of a high relative degree, we undertake a control design in this paper\nfor a class of systems that possess a particularly extreme relative degree:\n$2\\times2$ hyperbolic PDEs sandwiched by a strict-feedback nonlinear ODE and a\nlinear ODE, where the unknown coefficients are associated with the PDE\nin-domain coupling terms and with the input signal of the distal ODE. The\ndesigned output-positive adaptive controller guarantees the positivity of the\noutput signal that is the furthermost state from the control input as well as\nthe exponential regulation of the overall plant state to zero. The\neffectiveness of the proposed method is illustrated by numerical simulation.\n","authors":["Ji Wang","Miroslav Krstic"],"pdf_url":"https://arxiv.org/pdf/2309.05596v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.00810v2","updated":"2025-01-08T10:23:01Z","published":"2024-03-31T21:51:28Z","title":"Off-the-grid regularisation for Poisson inverse problems","summary":" Off-the-grid regularisation has been extensively employed over the last\ndecade in the context of ill-posed inverse problems formulated in the\ncontinuous setting of the space of Radon measures $\\mathcal{M}(\\mathcal{X})$.\nThese approaches enjoy convexity and counteract the discretisation biases as\nwell the numerical instabilities typical of their discrete counterparts. In the\nframework of sparse reconstruction of discrete point measures (sum of weighted\nDiracs), a Total Variation regularisation norm in $\\mathcal{M}(\\mathcal{X})$ is\ntypically combined with an $L^2$ data term modelling additive Gaussian noise.\nTo asses the framework of off-the-grid regularisation in the presence of\nsignal-dependent Poisson noise, we consider in this work a variational model\ncoupling the Total Variation regularisation with a Kullback-Leibler data term\nunder a non-negativity constraint. Analytically, we study the optimality\nconditions of the composite functional and analyse its dual problem. Then, we\nconsider an homotopy strategy to select an optimal regularisation parameter and\nuse it within a Sliding Frank-Wolfe algorithm. Several numerical experiments on\nboth 1D/2D simulated and real 3D fluorescent microscopy data are reported.\n","authors":["Marta Lazzaretti","Claudio Estatico","Alejandro Melero","Luca Calatroni"],"pdf_url":"https://arxiv.org/pdf/2404.00810v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.09973v4","updated":"2025-01-08T09:50:24Z","published":"2024-05-16T10:37:40Z","title":"Ensemble Control for Stochastic Systems with Asymmetric Laplace Noises","summary":" This paper presents an adaptive ensemble control for stochastic systems\nsubject to asymmetric noises and outliers. Asymmetric noises skew system\nobservations, and outliers with large amplitude deteriorate the observations\neven further. Such disturbances induce poor system estimation and degraded\nstochastic system control. In this work, we model the asymmetric noises and\noutliers by mixed asymmetric Laplace distributions (ALDs), and propose an\noptimal control for stochastic systems with mixed ALD noises. Particularly, we\nsegregate the system disturbed by mixed ALD noises into subsystems, each of\nwhich is subject to a specific ALD noise. For each subsystem, we design an\niterative quantile filter (IQF) to estimate the system parameters using system\nobservations. With the estimated parameters by IQF, we derive the certainty\nequivalence (CE) control law for each subsystem. Then we use the Bayesian\napproach to ensemble the subsystem CE controllers, with each of the controllers\nweighted by their posterior probability. We finalize our control law as the\nweighted sum of the control signals by the sub-system CE controllers. To\ndemonstrate our approach, we conduct numerical simulations and Monte Carlo\nanalyses. The results show improved tracking performance by our approach for\nskew noises and its robustness to outliers, compared with single ALD based and\nRLS-based control policy.\n","authors":["Yajie Yu","Xuehui Ma","Shiliang Zhang","Zhuzhu Wang","Xubing Shi","Yushuai Li","Tingwen Huang"],"pdf_url":"https://arxiv.org/pdf/2405.09973v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04369v1","updated":"2025-01-08T09:16:03Z","published":"2025-01-08T09:16:03Z","title":"State-dependent preconditioning for the inner-loop in Variational Data\n Assimilation using Machine Learning","summary":" Data Assimilation is the process in which we improve the representation of\nthe state of a physical system by combining information coming from a numerical\nmodel, real-world observations, and some prior modelling. It is widely used to\nmodel and to improve forecast systems in Earth science fields such as\nmeteorology, oceanography and environmental sciences. One key aspect of Data\nassimilation is the analysis step, where the output of the numerical model is\nadjusted in order to account for the observational data. In Variational Data\nAssimilation and under Gaussian assumptions, the analysis step comes down to\nsolving a high-dimensional non-linear least-square problem. In practice, this\nminimization involves successive inversions of large, and possibly\nill-conditioned matrices constructed using linearizations of the forward model.\nIn order to improve the convergence rate of these methods, and thus reduce the\ncomputational burden, preconditioning techniques are often used to get\nbetter-conditioned matrices, but require either the sparsity pattern of the\nmatrix to inverse, or some spectral information. We propose to use Deep Neural\nNetworks in order to construct a preconditioner. This surrogate is trained\nusing some properties of the singular value decomposition, and is based on a\ndataset which can be constructed online to reduce the storage requirements.\n","authors":["Victor Trappler","Arthur Vidard"],"pdf_url":"https://arxiv.org/pdf/2501.04369v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04335v1","updated":"2025-01-08T08:12:23Z","published":"2025-01-08T08:12:23Z","title":"An algorithm for a constrained P-spline","summary":" Regression splines are largely used to investigate and predict data behavior,\nattracting the interest of mathematicians for their beautiful numerical\nproperties, and of statisticians for their versatility with respect to the\napplications. Several penalized spline regression models are available in the\nliterature, and the most commonly used ones in real-world applications are\nP-splines, which enjoy the advantages of penalized models while being easy to\ngeneralize across different functional spaces and higher degree order, because\nof their discrete penalty term. To face the different requirements imposed by\nthe nature of the problem or the physical meaning of the expected values, the\nP-spline definition is often modified by additional hypotheses, often\ntranslated into constraints on the solution or its derivatives. In this\nframework, our work is motivated by the aim of getting approximation models\nthat fall within pre-established thresholds. Specifically, starting from a set\nof observed data, we consider a P-spline constrained between some prefixed\nbounds. In our paper, we just consider 0 as lower bound, although our approach\napplies to more general cases. We propose to get nonnegativity by imposing\nlower bounds on selected sample points. The spline can be computed through a\nsequence of linearly constrained problems. We suggest a strategy to dynamically\nselect the sample points, to avoid extremely dense sampling, and therefore try\nto reduce as much as possible the computational burden. We show through some\ncomputational experiments the reliability of our approach and the accuracy of\nthe results compared to some state-of-the-art models.\n","authors":["Rosanna Campagna","Serena Crisci","Gabriele Santin","Gerardo Toraldo","Marco Viola"],"pdf_url":"https://arxiv.org/pdf/2501.04335v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.01714v4","updated":"2025-01-08T06:52:07Z","published":"2024-04-02T07:57:17Z","title":"Conjugate-Gradient-like Based Adaptive Moment Estimation Optimization\n Algorithm for Deep Learning","summary":" Training deep neural networks is a challenging task. In order to speed up\ntraining and enhance the performance of deep neural networks, we rectify the\nvanilla conjugate gradient as conjugate-gradient-like and incorporate it into\nthe generic Adam, and thus propose a new optimization algorithm named\nCG-like-Adam for deep learning. Specifically, both the first-order and the\nsecond-order moment estimation of generic Adam are replaced by the\nconjugate-gradient-like. Convergence analysis handles the cases where the\nexponential moving average coefficient of the first-order moment estimation is\nconstant and the first-order moment estimation is unbiased. Numerical\nexperiments show the superiority of the proposed algorithm based on the\nCIFAR10/100 dataset.\n","authors":["Jiawu Tian","Liwei Xu","Xiaowei Zhang","Yongqi Li"],"pdf_url":"https://arxiv.org/pdf/2404.01714v4.pdf","comment":"32 pages, 13 figures"},{"id":"http://arxiv.org/abs/2501.04291v1","updated":"2025-01-08T05:31:58Z","published":"2025-01-08T05:31:58Z","title":"A truncated ε-subdifferential method for global DC optimization","summary":" We consider the difference of convex (DC) optimization problem subject to\nbox-constraints. Utilizing {\\epsilon}-subdifferentials of DC components of the\nobjective, we develop a new method for finding global solutions to this\nproblem. The method combines a local search approach with a special procedure\nfor escaping non-global solutions by identifying improved initial points for a\nlocal search. The method terminates when the solution cannot be improved\nfurther. The escaping procedure is designed using subsets of the\n{\\epsilon}-subdifferentials of DC components. We compute the deviation between\nthese subsets and determine {\\epsilon}-subgradients providing this deviation.\nUsing these specific {\\epsilon}-subgradients, we formulate a subproblem with a\nconvex objective function. The solution to this subproblem serves as a starting\npoint for a local search. We study the convergence of the conceptual version of\nthe proposed method and discuss its implementation. A large number of academic\ntest problems demonstrate that the method requires reasonable computational\neffort to find higher quality solutions than other local DC optimization\nmethods. Additionally, we apply the new method to find global solutions to DC\noptimization problems and compare its performance with two benchmark global\noptimization solvers.\n","authors":["Adil M. Bagirov","Kaisa Joki","Marko M. Makela","Sona Taheri"],"pdf_url":"https://arxiv.org/pdf/2501.04291v1.pdf","comment":"35 pages, 9 figures"},{"id":"http://arxiv.org/abs/2211.11955v2","updated":"2025-01-08T04:21:33Z","published":"2022-11-22T02:28:17Z","title":"Optimal Stabilization of Periodic Orbits","summary":" In this contribution, the optimal stabilization problem of periodic orbits is\nstudied via invariant manifold theory and symplectic geometry. The stable\nmanifold theory for the optimal point stabilization case is generalized to the\ncase of periodic orbit stabilization, where a normally hyperbolic invariant\nmanifold (NHIM) plays the role of a hyperbolic equilibrium.\n A sufficient condition for the existence of an NHIM of an associated\nHamiltonian system is derived in terms of a periodic Riccati differential\nequation. It is shown that the problem of optimal orbit stabilization has a\nsolution if a linearized periodic system satisfies stabilizability and\ndetectability. A moving orthogonal coordinate system is employed along the\nperiodic orbit which is a natural framework for orbital stabilization and\nlinearization argument.\n Examples illustrated include an optimal control problem for a spring-mass\noscillator system, which should be stabilized at a certain energy level, and an\norbit transfer problem for a satellite, which constitutes a typical control\nproblem of orbital mechanics.\n","authors":["Fabian Beck","Noboru Sakamoto"],"pdf_url":"https://arxiv.org/pdf/2211.11955v2.pdf","comment":"Submitted for a journal on November 29 2024"},{"id":"http://arxiv.org/abs/2501.04253v1","updated":"2025-01-08T03:35:28Z","published":"2025-01-08T03:35:28Z","title":"Integrated Offline and Online Learning to Solve a Large Class of\n Scheduling Problems","summary":" In this paper, we develop a unified machine learning (ML) approach to predict\nhigh-quality solutions for single-machine scheduling problems with a\nnon-decreasing min-sum objective function with or without release times. Our ML\napproach is novel in three major aspects. First, our approach is developed for\nthe entire class of the aforementioned problems. To achieve this, we exploit\nthe fact that the entire class of the problems considered can be formulated as\na time-indexed formulation in a unified manner. We develop a deep neural\nnetwork (DNN) which uses the cost parameters in the time-indexed formulation as\nthe inputs to effectively predict a continuous solution to this formulation,\nbased on which a feasible discrete solution is easily constructed. The second\nnovel aspect of our approach lies in how the DNN model is trained. In view of\nthe NP-hard nature of the problems, labels (i.e., optimal solutions) are hard\nto generate for training. To overcome this difficulty, we generate and utilize\na set of special instances, for which optimal solutions can be found with\nlittle computational effort, to train the ML model offline. The third novel\nidea we employ in our approach is that we develop an online single-instance\nlearning approach to fine tune the parameters in the DNN for a given online\ninstance, with the goal of generating an improved solution for the given\ninstance. To this end, we develop a feasibility surrogate that approximates the\nobjective value of a given instance as a continuous function of the outputs of\nthe DNN, which then enables us to derive gradients and update the learnable\nparameters in the DNN. Numerical results show that our approach can efficiently\ngenerate high-quality solutions for a variety of single-machine scheduling\nmin-sum problems with up to 1000 jobs.\n","authors":["Anbang Liu","Zhi-Long Chen","Jinyang Jiang","Xi Chen"],"pdf_url":"https://arxiv.org/pdf/2501.04253v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.08090v4","updated":"2025-01-08T03:08:11Z","published":"2024-02-12T22:17:28Z","title":"Learning Neural Contracting Dynamics: Extended Linearization and Global\n Guarantees","summary":" Global stability and robustness guarantees in learned dynamical systems are\nessential to ensure well-behavedness of the systems in the face of uncertainty.\nWe present Extended Linearized Contracting Dynamics (ELCD), the first neural\nnetwork-based dynamical system with global contractivity guarantees in\narbitrary metrics. The key feature of ELCD is a parametrization of the extended\nlinearization of the nonlinear vector field. In its most basic form, ELCD is\nguaranteed to be (i) globally exponentially stable, (ii) equilibrium\ncontracting, and (iii) globally contracting with respect to some metric. To\nallow for contraction with respect to more general metrics in the data space,\nwe train diffeomorphisms between the data space and a latent space and enforce\ncontractivity in the latent space, which ensures global contractivity in the\ndata space. We demonstrate the performance of ELCD on the high dimensional\nLASA, multi-link pendulum, and Rosenbrock datasets.\n","authors":["Sean Jaffe","Alexander Davydov","Deniz Lapsekili","Ambuj Singh","Francesco Bullo"],"pdf_url":"https://arxiv.org/pdf/2402.08090v4.pdf","comment":"9 pages, 3 figures. NeurIPS 2024"},{"id":"http://arxiv.org/abs/2501.04225v1","updated":"2025-01-08T01:53:26Z","published":"2025-01-08T01:53:26Z","title":"A black-box optimization method with polynomial-based kernels and\n quadratic-optimization annealing","summary":" We introduce kernel-QA, a black-box optimization (BBO) method that constructs\nsurrogate models analytically using low-order polynomial kernels within a\nquadratic unconstrained binary optimization (QUBO) framework, enabling\nefficient utilization of Ising machines. The method has been evaluated on\nartificial landscapes, ranging from uni-modal to multi-modal, with input\ndimensions extending to 80 for real variables and 640 for binary variables. The\nresults demonstrate that kernel-QA is particularly effective for optimizing\nblack-box functions characterized by local minima and high-dimensional inputs,\nshowcasing its potential as a robust and scalable BBO approach.\n","authors":["Yuki Minamoto","Yuya Sakamoto"],"pdf_url":"https://arxiv.org/pdf/2501.04225v1.pdf","comment":"32 pages, 11 figures, and 1 table"},{"id":"http://arxiv.org/abs/2404.16731v2","updated":"2025-01-08T01:41:29Z","published":"2024-04-25T16:41:57Z","title":"Non-asymptotic Global Convergence Analysis of BFGS with the Armijo-Wolfe\n Line Search","summary":" In this paper, we present the first explicit and non-asymptotic global\nconvergence rates of the BFGS method when implemented with an inexact line\nsearch scheme satisfying the Armijo-Wolfe conditions. We show that BFGS\nachieves a global linear convergence rate of $(1 - \\frac{1}{\\kappa})^t$ for\n$\\mu$-strongly convex functions with $L$-Lipschitz gradients, where $\\kappa =\n\\frac{L}{\\mu}$ represents the condition number. Additionally, if the objective\nfunction's Hessian is Lipschitz, BFGS with the Armijo-Wolfe line search\nachieves a linear convergence rate that depends solely on the line search\nparameters, independent of the condition number. We also establish a global\nsuperlinear convergence rate of $\\mathcal{O}((\\frac{1}{t})^t)$. These global\nbounds are all valid for any starting point $x_0$ and any symmetric positive\ndefinite initial Hessian approximation matrix $B_0$, though the choice of $B_0$\nimpacts the number of iterations needed to achieve these rates. By synthesizing\nthese results, we outline the first global complexity characterization of BFGS\nwith the Armijo-Wolfe line search. Additionally, we clearly define a mechanism\nfor selecting the step size to satisfy the Armijo-Wolfe conditions and\ncharacterize its overall complexity.\n","authors":["Qiujiang Jin","Ruichen Jiang","Aryan Mokhtari"],"pdf_url":"https://arxiv.org/pdf/2404.16731v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.01615v2","updated":"2025-01-08T20:48:41Z","published":"2025-01-03T03:19:02Z","title":"Equity Impacts of Public Transit Network Redesign with Shared Autonomous\n Mobility Services","summary":" This study examines the equity impacts of integrating shared autonomous\nmobility services (SAMS) into transit system redesign. Using the Greater\nChicago area as a case study, we compare two optimization objectives in\nmultimodal transit network redesign: minimizing total generalized costs\n(equity-agnostic) versus prioritizing service in low-income areas\n(equity-focused). We evaluate the achieved accessibility of clustered zones\nwith redesigned transit networks under two objectives, compared to driving and\nthe existing transit network. The transit access gaps across zones and between\ntransit and driving are found to be generally reduced with the introduction of\nSAMS, but less so with the subsequent improved infrastructure under budget.\nDifferential improvement in equity is seen across suburbs and areas of the\ncity, reflecting the disparity in current transit access and improvement\npotential. In particular, SAMS bridges the transit access gaps in suburban and\ncity areas currently underserved by transit. The City of Chicago, which is also\ndisproportionately home to vulnerable populations, offers an avenue to improve\nvertical equity. These findings demonstrate that SAMS can enhance both\nhorizontal and vertical equity in transit systems, particularly when equity is\nexplicitly incorporated into the design objective.\n","authors":["Max T. M. Ng","Meredith Raymer","Hani S. Mahmassani","Omer Verbas","Taner Cokyasar"],"pdf_url":"https://arxiv.org/pdf/2501.01615v2.pdf","comment":"Restructuring the paper for more precise research direction"},{"id":"http://arxiv.org/abs/2501.04833v1","updated":"2025-01-08T20:38:15Z","published":"2025-01-08T20:38:15Z","title":"Multi-step Inertial Accelerated Doubly Stochastic Gradient Methods for\n Block Term Tensor Decomposition","summary":" In this paper, we explore a specific optimization problem that combines a\ndifferentiable nonconvex function with a nondifferentiable function for\nmulti-block variables, which is particularly relevant to tackle the multilinear\nrank-($L_r$,$L_r$,1) block-term tensor decomposition model with a\nregularization term. While existing algorithms often suffer from high\nper-iteration complexity and slow convergence, this paper employs a unified\nmulti-step inertial accelerated doubly stochastic gradient descent method\ntailored for structured rank-$\\left(L_r, L_r, 1\\right)$ tensor decomposition,\nreferred to as Midas-LL1. We also introduce an extended multi-step\nvariance-reduced stochastic estimator framework. Our analysis under this new\nframework demonstrates the subsequential and sequential convergence of the\nproposed algorithm under certain conditions and illustrates the sublinear\nconvergence rate of the subsequence, showing that the Midas-LL1 algorithm\nrequires at most $\\mathcal{O}(\\varepsilon^{-2})$ iterations in expectation to\nreach an $\\varepsilon$-stationary point. The proposed algorithm is evaluated on\nseveral datasets, and the results indicate that Midas-LL1 outperforms existing\nstate-of-the-art algorithms in terms of both computational speed and solution\nquality.\n","authors":["Zehui Liu","Qingsong Wang","Chunfeng Cui"],"pdf_url":"https://arxiv.org/pdf/2501.04833v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04823v1","updated":"2025-01-08T20:22:16Z","published":"2025-01-08T20:22:16Z","title":"Learning Robot Safety from Sparse Human Feedback using Conformal\n Prediction","summary":" Ensuring robot safety can be challenging; user-defined constraints can miss\nedge cases, policies can become unsafe even when trained from safe data, and\nsafety can be subjective. Thus, we learn about robot safety by showing policy\ntrajectories to a human who flags unsafe behavior. From this binary feedback,\nwe use the statistical method of conformal prediction to identify a region of\nstates, potentially in learned latent space, guaranteed to contain a\nuser-specified fraction of future policy errors. Our method is\nsample-efficient, as it builds on nearest neighbor classification and avoids\nwithholding data as is common with conformal prediction. By alerting if the\nrobot reaches the suspected unsafe region, we obtain a warning system that\nmimics the human's safety preferences with guaranteed miss rate. From video\nlabeling, our system can detect when a quadcopter visuomotor policy will fail\nto steer through a designated gate. We present an approach for policy\nimprovement by avoiding the suspected unsafe region. With it we improve a model\npredictive controller's safety, as shown in experimental testing with 30\nquadcopter flights across 6 navigation tasks. Code and videos are provided.\n","authors":["Aaron O. Feldman","Joseph A. Vincent","Maximilian Adang","Jun En Low","Mac Schwager"],"pdf_url":"https://arxiv.org/pdf/2501.04823v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04805v1","updated":"2025-01-08T19:46:41Z","published":"2025-01-08T19:46:41Z","title":"Extended formulations for the multilinear polytope of acyclic\n hypergraphs","summary":" This article provides an overview of our joint work on binary polynomial\noptimization over the past decade. We define the multilinear polytope as the\nconvex hull of the feasible region of a linearized binary polynomial\noptimization problem. By representing the multilinear polytope with\nhypergraphs, we investigate the connections between hypergraph acyclicity and\nthe complexity of the facial structure of the multilinear polytope. We\ncharacterize the acyclic hypergraphs for which a polynomial-size extended\nformulation for the multilinear polytope can be constructed in polynomial time.\n","authors":["Alberto Del Pia","Aida Khajavirad"],"pdf_url":"https://arxiv.org/pdf/2501.04805v1.pdf","comment":"arXiv admin note: text overlap with arXiv:2212.11239"},{"id":"http://arxiv.org/abs/2501.04781v1","updated":"2025-01-08T19:00:21Z","published":"2025-01-08T19:00:21Z","title":"Inexact Catching-Up Algorithm for Moreau's Sweeping Processes","summary":" In this paper, we develop an inexact version of the catching-up algorithm for\nsweeping processes. We define a new notion of approximate projection, which is\ncompatible with any numerical method for approximating exact projections, as\nthis new notion is not restricted to remain strictly within the set. We provide\nseveral properties of the new approximate projections, which enable us to prove\nthe convergence of the inexact catching-up algorithm in three general\nframeworks: prox-regular moving sets, subsmooth moving sets, and merely closed\nsets. Additionally, we apply our numerical results to address complementarity\ndynamical systems, particularly electrical circuits with ideal diodes. In this\ncontext, we implement the inexact catching-up algorithm using a primal-dual\noptimization method, which typically does not necessarily guarantee a feasible\npoint. Our results are illustrated through an electrical circuit with ideal\ndiodes. Our results recover classical existence results in the literature and\nprovide new insights into the numerical simulation of sweeping processes.\n","authors":["Juan Guillermo Garrido","Maximiliano Lioi","Emilio Vilches"],"pdf_url":"https://arxiv.org/pdf/2501.04781v1.pdf","comment":"arXiv admin note: substantial text overlap with arXiv:2308.08093"},{"id":"http://arxiv.org/abs/2501.04759v1","updated":"2025-01-08T17:28:30Z","published":"2025-01-08T17:28:30Z","title":"Optimize the parameters of the PID Controller using Genetic Algorithm\n for Robot Manipulators","summary":" This paper presents the design a Proportional-Integral-Derivative (PID)\ncontroller with optimized parameters for a two-degree-of-freedom robotic arm. A\ngenetic algorithm (GA) is proposed to optimize the controller parameters,\naddressing the challenges in determining PID controller parameters for highly\nnonlinear systems like robotic arms compared to traditional methods. The\nGA-optimized PID controller significantly improves control accuracy and\nperformance over traditional control methods. Simulation results demonstrate\nthat the robotic arm system operates with high precision and stability.\nAdditionally, the shortened trajectory tracking response time enhances the\nfeasibility of applying this control algorithm in realworld scenarios. This\nresearch not only confirms the suitability of PID-GA for robotic arms and\nsimilar systems but also opens new avenues for applying this algorithm to real\nphysical systems.\n","authors":["Vu Ngoc Son","Pham Van Cuong","Nguyen Duy Minh","Phi Hoang Nha"],"pdf_url":"https://arxiv.org/pdf/2501.04759v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.06251v1","updated":"2025-01-08T23:53:39Z","published":"2025-01-08T23:53:39Z","title":"Under the hood of a carbon footprint calculator","summary":" We explain the mathematical theory of the Input-Output method for carbon\nfootprints computations.\n","authors":["Indira Chatterji","Ariadna Fossas Tenas","Elise Raphael"],"pdf_url":"https://arxiv.org/pdf/2501.06251v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.06247v1","updated":"2025-01-08T18:06:30Z","published":"2025-01-08T18:06:30Z","title":"A Survey on Algorithmic Developments in Optimal Transport Problem with\n Applications","summary":" Optimal Transport (OT) has established itself as a robust framework for\nquantifying differences between distributions, with applications that span\nfields such as machine learning, data science, and computer vision. This paper\noffers a detailed examination of the OT problem, beginning with its theoretical\nfoundations, including the classical formulations of Monge and Kantorovich and\ntheir extensions to modern computational techniques. It explores cutting-edge\nalgorithms, including Sinkhorn iterations, primal-dual strategies, and\nreduction-based approaches, emphasizing their efficiency and scalability in\naddressing high-dimensional problems. The paper also highlights emerging\ntrends, such as integrating OT into machine learning frameworks, the\ndevelopment of novel problem variants, and ongoing theoretical advancements.\nApplications of OT are presented across a range of domains, with particular\nattention to its innovative application in time series data analysis via\nOptimal Transport Warping (OTW), a robust alternative to methods like Dynamic\nTime Warping. Despite the significant progress made, challenges related to\nscalability, robustness, and ethical considerations remain, necessitating\nfurther research. The paper underscores OT's potential to bridge theoretical\ndepth and practical utility, fostering impactful advancements across diverse\ndisciplines.\n","authors":["Sina Moradi"],"pdf_url":"https://arxiv.org/pdf/2501.06247v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.06240v1","updated":"2025-01-08T13:26:56Z","published":"2025-01-08T13:26:56Z","title":"The Convergence of Dynamic Routing between Capsules","summary":" Capsule networks(CapsNet) are recently proposed neural network models with\nnew processing layers, specifically for entity representation and discovery of\nimages. It is well known that CapsNet have some advantages over traditional\nneural networks, especially in generalization capability. At the same time,\nsome studies report negative experimental results. The causes of this\ncontradiction have not been thoroughly analyzed. The preliminary experimental\nresults show that the behavior of routing algorithms does not always produce\ngood results as expected, and in most cases, different routing algorithms do\nnot change the classification results, but simply polarize the link strength,\nespecially when they continue to repeat without stopping. To realize the true\npotential of the CapsNet, deep mathematical analysis of the routing algorithms\nis crucial. In this paper, we will give the objective function that is\nminimized by the dynamic routing algorithm, which is a concave function. The\ndynamic routing algorithm can be regarded as nonlinear gradient method to\nsolving an optimization algorithm under linear constraints, and its convergence\ncan be strictly proved mathematically. Furthermore, the mathematically rigorous\nproof of the convergence is given for this class of iterative routing\nprocedures. We analyze the relation between the objective function and the\nconstraints solved by the dynamic routing algorithm in detail, and perform the\ncorresponding routing experiment to analyze the effect of our convergence\nproof.\n","authors":["Daoyuan Ye","Juntao Li","Yiting Shen"],"pdf_url":"https://arxiv.org/pdf/2501.06240v1.pdf","comment":null}],"Computer Vision and Pattern Recognition":[{"id":"http://arxiv.org/abs/2409.08272v2","updated":"2025-01-08T18:59:48Z","published":"2024-09-12T17:59:04Z","title":"Click2Mask: Local Editing with Dynamic Mask Generation","summary":" Recent advancements in generative models have revolutionized image generation\nand editing, making these tasks accessible to non-experts. This paper focuses\non local image editing, particularly the task of adding new content to a\nloosely specified area. Existing methods often require a precise mask or a\ndetailed description of the location, which can be cumbersome and prone to\nerrors. We propose Click2Mask, a novel approach that simplifies the local\nediting process by requiring only a single point of reference (in addition to\nthe content description). A mask is dynamically grown around this point during\na Blended Latent Diffusion (BLD) process, guided by a masked CLIP-based\nsemantic loss. Click2Mask surpasses the limitations of segmentation-based and\nfine-tuning dependent methods, offering a more user-friendly and contextually\naccurate solution. Our experiments demonstrate that Click2Mask not only\nminimizes user effort but also enables competitive or superior local image\nmanipulations compared to SoTA methods, according to both human judgement and\nautomatic metrics. Key contributions include the simplification of user input,\nthe ability to freely add objects unconstrained by existing segments, and the\nintegration potential of our dynamic mask approach within other editing\nmethods.\n","authors":["Omer Regev","Omri Avrahami","Dani Lischinski"],"pdf_url":"https://arxiv.org/pdf/2409.08272v2.pdf","comment":"Accepted to AAAI 2025. Project page is available at\n https://omeregev.github.io/click2mask/"},{"id":"http://arxiv.org/abs/2501.04700v1","updated":"2025-01-08T18:59:36Z","published":"2025-01-08T18:59:36Z","title":"Planarian Neural Networks: Evolutionary Patterns from Basic Bilateria\n Shaping Modern Artificial Neural Network Architectures","summary":" This study examined the viability of enhancing the prediction accuracy of\nartificial neural networks (ANNs) in image classification tasks by developing\nANNs with evolution patterns similar to those of biological neural networks.\nResNet is a widely used family of neural networks with both deep and wide\nvariants; therefore, it was selected as the base model for our investigation.\nThe aim of this study is to improve the image classification performance of\nANNs via a novel approach inspired by the biological nervous system\narchitecture of planarians, which comprises a brain and two nerve cords. We\nbelieve that the unique neural architecture of planarians offers valuable\ninsights into the performance enhancement of ANNs. The proposed planarian\nneural architecture-based neural network was evaluated on the CIFAR-10 and\nCIFAR-100 datasets. Our results indicate that the proposed method exhibits\nhigher prediction accuracy than the baseline neural network models in image\nclassification tasks. These findings demonstrate the significant potential of\nbiologically inspired neural network architectures in improving the performance\nof ANNs in a wide range of applications.\n","authors":["Ziyuan Huang","Mark Newman","Maria Vaida","Srikar Bellur","Roozbeh Sadeghian","Andrew Siu","Hui Wang","Kevin Huggins"],"pdf_url":"https://arxiv.org/pdf/2501.04700v1.pdf","comment":"11 pages, 9 figures"},{"id":"http://arxiv.org/abs/2501.04699v1","updated":"2025-01-08T18:59:35Z","published":"2025-01-08T18:59:35Z","title":"EditAR: Unified Conditional Generation with Autoregressive Models","summary":" Recent progress in controllable image generation and editing is largely\ndriven by diffusion-based methods. Although diffusion models perform\nexceptionally well in specific tasks with tailored designs, establishing a\nunified model is still challenging. In contrast, autoregressive models\ninherently feature a unified tokenized representation, which simplifies the\ncreation of a single foundational model for various tasks. In this work, we\npropose EditAR, a single unified autoregressive framework for a variety of\nconditional image generation tasks, e.g., image editing, depth-to-image,\nedge-to-image, segmentation-to-image. The model takes both images and\ninstructions as inputs, and predicts the edited images tokens in a vanilla\nnext-token paradigm. To enhance the text-to-image alignment, we further propose\nto distill the knowledge from foundation models into the autoregressive\nmodeling process. We evaluate its effectiveness across diverse tasks on\nestablished benchmarks, showing competitive performance to various\nstate-of-the-art task-specific methods. Project page:\nhttps://jitengmu.github.io/EditAR/\n","authors":["Jiteng Mu","Nuno Vasconcelos","Xiaolong Wang"],"pdf_url":"https://arxiv.org/pdf/2501.04699v1.pdf","comment":"Project page: https://jitengmu.github.io/EditAR/"},{"id":"http://arxiv.org/abs/2501.04698v1","updated":"2025-01-08T18:59:01Z","published":"2025-01-08T18:59:01Z","title":"ConceptMaster: Multi-Concept Video Customization on Diffusion\n Transformer Models Without Test-Time Tuning","summary":" Text-to-video generation has made remarkable advancements through diffusion\nmodels. However, Multi-Concept Video Customization (MCVC) remains a significant\nchallenge. We identify two key challenges in this task: 1) the identity\ndecoupling problem, where directly adopting existing customization methods\ninevitably mix attributes when handling multiple concepts simultaneously, and\n2) the scarcity of high-quality video-entity pairs, which is crucial for\ntraining such a model that represents and decouples various concepts well. To\naddress these challenges, we introduce ConceptMaster, an innovative framework\nthat effectively tackles the critical issues of identity decoupling while\nmaintaining concept fidelity in customized videos. Specifically, we introduce a\nnovel strategy of learning decoupled multi-concept embeddings that are injected\ninto the diffusion models in a standalone manner, which effectively guarantees\nthe quality of customized videos with multiple identities, even for highly\nsimilar visual concepts. To further overcome the scarcity of high-quality MCVC\ndata, we carefully establish a data construction pipeline, which enables\nsystematic collection of precise multi-concept video-entity data across diverse\nconcepts. A comprehensive benchmark is designed to validate the effectiveness\nof our model from three critical dimensions: concept fidelity, identity\ndecoupling ability, and video generation quality across six different concept\ncomposition scenarios. Extensive experiments demonstrate that our ConceptMaster\nsignificantly outperforms previous approaches for this task, paving the way for\ngenerating personalized and semantically accurate videos across multiple\nconcepts.\n","authors":["Yuzhou Huang","Ziyang Yuan","Quande Liu","Qiulin Wang","Xintao Wang","Ruimao Zhang","Pengfei Wan","Di Zhang","Kun Gai"],"pdf_url":"https://arxiv.org/pdf/2501.04698v1.pdf","comment":"Project Page: https://yuzhou914.github.io/ConceptMaster/"},{"id":"http://arxiv.org/abs/2501.04697v1","updated":"2025-01-08T18:58:48Z","published":"2025-01-08T18:58:48Z","title":"Grokking at the Edge of Numerical Stability","summary":" Grokking, the sudden generalization that occurs after prolonged overfitting,\nis a surprising phenomenon challenging our understanding of deep learning.\nAlthough significant progress has been made in understanding grokking, the\nreasons behind the delayed generalization and its dependence on regularization\nremain unclear. In this work, we argue that without regularization, grokking\ntasks push models to the edge of numerical stability, introducing floating\npoint errors in the Softmax function, which we refer to as Softmax Collapse\n(SC). We demonstrate that SC prevents grokking and that mitigating SC enables\ngrokking without regularization. Investigating the root cause of SC, we find\nthat beyond the point of overfitting, the gradients strongly align with what we\ncall the na\\\"ive loss minimization (NLM) direction. This component of the\ngradient does not alter the model's predictions but decreases the loss by\nscaling the logits, typically by scaling the weights along their current\ndirection. We show that this scaling of the logits explains the delay in\ngeneralization characteristic of grokking and eventually leads to SC, halting\nfurther learning. To validate our hypotheses, we introduce two key\ncontributions that address the challenges in grokking tasks: StableMax, a new\nactivation function that prevents SC and enables grokking without\nregularization, and $\\perp$Grad, a training algorithm that promotes quick\ngeneralization in grokking tasks by preventing NLM altogether. These\ncontributions provide new insights into grokking, elucidating its delayed\ngeneralization, reliance on regularization, and the effectiveness of existing\ngrokking-inducing methods. Code for this paper is available at\nhttps://github.com/LucasPrietoAl/grokking-at-the-edge-of-numerical-stability.\n","authors":["Lucas Prieto","Melih Barsbey","Pedro A. M. Mediano","Tolga Birdal"],"pdf_url":"https://arxiv.org/pdf/2501.04697v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04696v1","updated":"2025-01-08T18:58:24Z","published":"2025-01-08T18:58:24Z","title":"Test-Time Optimization for Domain Adaptive Open Vocabulary Segmentation","summary":" We present Seg-TTO, a novel framework for zero-shot, open-vocabulary semantic\nsegmentation (OVSS), designed to excel in specialized domain tasks. While\ncurrent open vocabulary approaches show impressive performance on standard\nsegmentation benchmarks under zero-shot settings, they fall short of supervised\ncounterparts on highly domain-specific datasets. We focus on\nsegmentation-specific test-time optimization to address this gap. Segmentation\nrequires an understanding of multiple concepts within a single image while\nretaining the locality and spatial structure of representations. We propose a\nnovel self-supervised objective adhering to these requirements and use it to\nalign the model parameters with input images at test time. In the textual\nmodality, we learn multiple embeddings for each category to capture diverse\nconcepts within an image, while in the visual modality, we calculate\npixel-level losses followed by embedding aggregation operations specific to\npreserving spatial structure. Our resulting framework termed Seg-TTO is a\nplug-in-play module. We integrate Seg-TTO with three state-of-the-art OVSS\napproaches and evaluate across 22 challenging OVSS tasks covering a range of\nspecialized domains. Our Seg-TTO demonstrates clear performance improvements\nacross these establishing new state-of-the-art. Code:\nhttps://github.com/UlinduP/SegTTO.\n","authors":["Ulindu De Silva","Didula Samaraweera","Sasini Wanigathunga","Kavindu Kariyawasam","Kanchana Ranasinghe","Muzammal Naseer","Ranga Rodrigo"],"pdf_url":"https://arxiv.org/pdf/2501.04696v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04695v1","updated":"2025-01-08T18:58:22Z","published":"2025-01-08T18:58:22Z","title":"Re-ranking the Context for Multimodal Retrieval Augmented Generation","summary":" Retrieval-augmented generation (RAG) enhances large language models (LLMs) by\nincorporating external knowledge to generate a response within a context with\nimproved accuracy and reduced hallucinations. However, multi-modal RAG systems\nface unique challenges: (i) the retrieval process may select irrelevant entries\nto user query (e.g., images, documents), and (ii) vision-language models or\nmulti-modal language models like GPT-4o may hallucinate when processing these\nentries to generate RAG output. In this paper, we aim to address the first\nchallenge, i.e, improving the selection of relevant context from the\nknowledge-base in retrieval phase of the multi-modal RAG. Specifically, we\nleverage the relevancy score (RS) measure designed in our previous work for\nevaluating the RAG performance to select more relevant entries in retrieval\nprocess. The retrieval based on embeddings, say CLIP-based embedding, and\ncosine similarity usually perform poorly particularly for multi-modal data. We\nshow that by using a more advanced relevancy measure, one can enhance the\nretrieval process by selecting more relevant pieces from the knowledge-base and\neliminate the irrelevant pieces from the context by adaptively selecting\nup-to-$k$ entries instead of fixed number of entries. Our evaluation using COCO\ndataset demonstrates significant enhancement in selecting relevant context and\naccuracy of the generated response.\n","authors":["Matin Mortaheb","Mohammad A. Amir Khojastepour","Srimat T. Chakradhar","Sennur Ulukus"],"pdf_url":"https://arxiv.org/pdf/2501.04695v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04689v1","updated":"2025-01-08T18:52:03Z","published":"2025-01-08T18:52:03Z","title":"SPAR3D: Stable Point-Aware Reconstruction of 3D Objects from Single\n Images","summary":" We study the problem of single-image 3D object reconstruction. Recent works\nhave diverged into two directions: regression-based modeling and generative\nmodeling. Regression methods efficiently infer visible surfaces, but struggle\nwith occluded regions. Generative methods handle uncertain regions better by\nmodeling distributions, but are computationally expensive and the generation is\noften misaligned with visible surfaces. In this paper, we present SPAR3D, a\nnovel two-stage approach aiming to take the best of both directions. The first\nstage of SPAR3D generates sparse 3D point clouds using a lightweight point\ndiffusion model, which has a fast sampling speed. The second stage uses both\nthe sampled point cloud and the input image to create highly detailed meshes.\nOur two-stage design enables probabilistic modeling of the ill-posed\nsingle-image 3D task while maintaining high computational efficiency and great\noutput fidelity. Using point clouds as an intermediate representation further\nallows for interactive user edits. Evaluated on diverse datasets, SPAR3D\ndemonstrates superior performance over previous state-of-the-art methods, at an\ninference speed of 0.7 seconds. Project page with code and model:\nhttps://spar3d.github.io\n","authors":["Zixuan Huang","Mark Boss","Aaryaman Vasishta","James M. Rehg","Varun Jampani"],"pdf_url":"https://arxiv.org/pdf/2501.04689v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04678v1","updated":"2025-01-08T18:39:10Z","published":"2025-01-08T18:39:10Z","title":"RadGPT: Constructing 3D Image-Text Tumor Datasets","summary":" With over 85 million CT scans performed annually in the United States,\ncreating tumor-related reports is a challenging and time-consuming task for\nradiologists. To address this need, we present RadGPT, an Anatomy-Aware\nVision-Language AI Agent for generating detailed reports from CT scans. RadGPT\nfirst segments tumors, including benign cysts and malignant tumors, and their\nsurrounding anatomical structures, then transforms this information into both\nstructured reports and narrative reports. These reports provide tumor size,\nshape, location, attenuation, volume, and interactions with surrounding blood\nvessels and organs. Extensive evaluation on unseen hospitals shows that RadGPT\ncan produce accurate reports, with high sensitivity/specificity for small tumor\n(<2 cm) detection: 80/73% for liver tumors, 92/78% for kidney tumors, and\n77/77% for pancreatic tumors. For large tumors, sensitivity ranges from 89% to\n97%. The results significantly surpass the state-of-the-art in abdominal CT\nreport generation.\n RadGPT generated reports for 17 public datasets. Through radiologist review\nand refinement, we have ensured the reports' accuracy, and created the first\npublicly available image-text 3D medical dataset, comprising over 1.8 million\ntext tokens and 2.7 million images from 9,262 CT scans, including 2,947 tumor\nscans/reports of 8,562 tumor instances. Our reports can: (1) localize tumors in\neight liver sub-segments and three pancreatic sub-segments annotated per-voxel;\n(2) determine pancreatic tumor stage (T1-T4) in 260 reports; and (3) present\nindividual analyses of multiple tumors--rare in human-made reports.\nImportantly, 948 of the reports are for early-stage tumors.\n","authors":["Pedro R. A. S. Bassi","Mehmet Can Yavuz","Kang Wang","Xiaoxi Chen","Wenxuan Li","Sergio Decherchi","Andrea Cavalli","Yang Yang","Alan Yuille","Zongwei Zhou"],"pdf_url":"https://arxiv.org/pdf/2501.04678v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04675v1","updated":"2025-01-08T18:33:17Z","published":"2025-01-08T18:33:17Z","title":"Enhancing Financial VQA in Vision Language Models using Intermediate\n Structured Representations","summary":" Chart interpretation is crucial for visual data analysis, but accurately\nextracting information from charts poses significant challenges for automated\nmodels. This study investigates the fine-tuning of DEPLOT, a modality\nconversion module that translates the image of a plot or chart to a linearized\ntable, on a custom dataset of 50,000 bar charts. The dataset comprises simple,\nstacked, and grouped bar charts, targeting the unique structural features of\nthese visualizations. The finetuned DEPLOT model is evaluated against its base\nversion using a test set of 1,000 images and two metrics: Relative Mapping\nSimilarity (RMS), which measures categorical mapping accuracy, and Relative\nNumber Set Similarity (RNSS), which evaluates numerical interpretation\naccuracy. To further explore the reasoning capabilities of large language\nmodels (LLMs), we curate an additional set of 100 bar chart images paired with\nquestion answer sets. Our findings demonstrate that providing a structured\nintermediate table alongside the image significantly enhances LLM reasoning\nperformance compared to direct image queries.\n","authors":["Archita Srivastava","Abhas Kumar","Rajesh Kumar","Prabhakar Srinivasan"],"pdf_url":"https://arxiv.org/pdf/2501.04675v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.02788v2","updated":"2025-01-08T18:33:07Z","published":"2025-01-06T06:07:40Z","title":"GLoG-CSUnet: Enhancing Vision Transformers with Adaptable Radiomic\n Features for Medical Image Segmentation","summary":" Vision Transformers (ViTs) have shown promise in medical image semantic\nsegmentation (MISS) by capturing long-range correlations. However, ViTs often\nstruggle to model local spatial information effectively, which is essential for\naccurately segmenting fine anatomical details, particularly when applied to\nsmall datasets without extensive pre-training. We introduce Gabor and Laplacian\nof Gaussian Convolutional Swin Network (GLoG-CSUnet), a novel architecture\nenhancing Transformer-based models by incorporating learnable radiomic\nfeatures. This approach integrates dynamically adaptive Gabor and Laplacian of\nGaussian (LoG) filters to capture texture, edge, and boundary information,\nenhancing the feature representation processed by the Transformer model. Our\nmethod uniquely combines the long-range dependency modeling of Transformers\nwith the texture analysis capabilities of Gabor and LoG features. Evaluated on\nthe Synapse multi-organ and ACDC cardiac segmentation datasets, GLoG-CSUnet\ndemonstrates significant improvements over state-of-the-art models, achieving a\n1.14% increase in Dice score for Synapse and 0.99% for ACDC, with minimal\ncomputational overhead (only 15 and 30 additional parameters, respectively).\nGLoG-CSUnet's flexible design allows integration with various base models,\noffering a promising approach for incorporating radiomics-inspired feature\nextraction in Transformer architectures for medical image analysis. The code\nimplementation is available on GitHub at: https://github.com/HAAIL/GLoG-CSUnet.\n","authors":["Niloufar Eghbali","Hassan Bagher-Ebadian","Tuka Alhanai","Mohammad M. Ghassemi"],"pdf_url":"https://arxiv.org/pdf/2501.02788v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04671v1","updated":"2025-01-08T18:31:16Z","published":"2025-01-08T18:31:16Z","title":"DRIVINGVQA: Analyzing Visual Chain-of-Thought Reasoning of Vision\n Language Models in Real-World Scenarios with Driving Theory Tests","summary":" Large vision-language models (LVLMs) augment language models with visual\nunderstanding, enabling multimodal reasoning. However, due to the modality gap\nbetween textual and visual data, they often face significant challenges, such\nas over-reliance on text priors, hallucinations, and limited capacity for\ncomplex visual reasoning. Existing benchmarks to evaluate visual reasoning in\nLVLMs often rely on schematic or synthetic images and on imprecise\nmachine-generated explanations. To bridge the modality gap, we present\nDrivingVQA, a new benchmark derived from driving theory tests to evaluate\nvisual chain-of-thought reasoning in complex real-world scenarios. It offers\n3,931 expert-crafted multiple-choice problems and interleaved explanations\ngrounded with entities relevant to the reasoning process. We leverage this\ndataset to perform an extensive study of LVLMs' ability to reason about complex\nvisual scenarios. Our experiments reveal that open-source and proprietary LVLMs\nstruggle with visual chain-of-thought reasoning under zero-shot settings. We\ninvestigate training strategies that leverage relevant entities to improve\nvisual reasoning. Notably, we observe a performance boost of up to 7\\% when\nreasoning over image tokens of cropped regions tied to these entities.\n","authors":["Charles Corbière","Simon Roburin","Syrielle Montariol","Antoine Bosselut","Alexandre Alahi"],"pdf_url":"https://arxiv.org/pdf/2501.04671v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04670v1","updated":"2025-01-08T18:30:53Z","published":"2025-01-08T18:30:53Z","title":"Are They the Same? Exploring Visual Correspondence Shortcomings of\n Multimodal LLMs","summary":" Recent advancements in multimodal models have shown a strong ability in\nvisual perception, reasoning abilities, and vision-language understanding.\nHowever, studies on visual matching ability are missing, where finding the\nvisual correspondence of objects is essential in vision research. Our research\nreveals that the matching capabilities in recent multimodal LLMs (MLLMs) still\nexhibit systematic shortcomings, even with current strong MLLMs models, GPT-4o.\nIn particular, we construct a Multimodal Visual Matching (MMVM) benchmark to\nfairly benchmark over 30 different MLLMs. The MMVM benchmark is built from 15\nopen-source datasets and Internet videos with manual annotation. We categorize\nthe data samples of MMVM benchmark into eight aspects based on the required\ncues and capabilities to more comprehensively evaluate and analyze current\nMLLMs. In addition, we have designed an automatic annotation pipeline to\ngenerate the MMVM SFT dataset, including 220K visual matching data with\nreasoning annotation. Finally, we present CoLVA, a novel contrastive MLLM with\ntwo novel technical designs: fine-grained vision expert with object-level\ncontrastive learning and instruction augmentation strategy. CoLVA achieves\n51.06\\% overall accuracy (OA) on the MMVM benchmark, surpassing GPT-4o and\nbaseline by 8.41\\% and 23.58\\% OA, respectively. The results show the\neffectiveness of our MMVM SFT dataset and our novel technical designs. Code,\nbenchmark, dataset, and models are available at\nhttps://github.com/zhouyiks/CoLVA.\n","authors":["Yikang Zhou","Tao Zhang","Shilin Xu","Shihao Chen","Qianyu Zhou","Yunhai Tong","Shunping Ji","Jiangning Zhang","Xiangtai Li","Lu Qi"],"pdf_url":"https://arxiv.org/pdf/2501.04670v1.pdf","comment":"project page: https://zhouyiks.github.io/projects/CoLVA/"},{"id":"http://arxiv.org/abs/2501.04666v1","updated":"2025-01-08T18:25:50Z","published":"2025-01-08T18:25:50Z","title":"Enhancing Virtual Try-On with Synthetic Pairs and Error-Aware Noise\n Scheduling","summary":" Given an isolated garment image in a canonical product view and a separate\nimage of a person, the virtual try-on task aims to generate a new image of the\nperson wearing the target garment. Prior virtual try-on works face two major\nchallenges in achieving this goal: a) the paired (human, garment) training data\nhas limited availability; b) generating textures on the human that perfectly\nmatch that of the prompted garment is difficult, often resulting in distorted\ntext and faded textures. Our work explores ways to tackle these issues through\nboth synthetic data as well as model refinement. We introduce a garment\nextraction model that generates (human, synthetic garment) pairs from a single\nimage of a clothed individual. The synthetic pairs can then be used to augment\nthe training of virtual try-on. We also propose an Error-Aware Refinement-based\nSchr\\\"odinger Bridge (EARSB) that surgically targets localized generation\nerrors for correcting the output of a base virtual try-on model. To identify\nlikely errors, we propose a weakly-supervised error classifier that localizes\nregions for refinement, subsequently augmenting the Schr\\\"odinger Bridge's\nnoise schedule with its confidence heatmap. Experiments on VITON-HD and\nDressCode-Upper demonstrate that our synthetic data augmentation enhances the\nperformance of prior work, while EARSB improves the overall image quality. In\nuser studies, our model is preferred by the users in an average of 59% of\ncases.\n","authors":["Nannan Li","Kevin J. Shih","Bryan A. Plummer"],"pdf_url":"https://arxiv.org/pdf/2501.04666v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04665v1","updated":"2025-01-08T18:22:44Z","published":"2025-01-08T18:22:44Z","title":"HyFusion: Enhanced Reception Field Transformer for Hyperspectral Image\n Fusion","summary":" Hyperspectral image (HSI) fusion addresses the challenge of reconstructing\nHigh-Resolution HSIs (HR-HSIs) from High-Resolution Multispectral images\n(HR-MSIs) and Low-Resolution HSIs (LR-HSIs), a critical task given the high\ncosts and hardware limitations associated with acquiring high-quality HSIs.\nWhile existing methods leverage spatial and spectral relationships, they often\nsuffer from limited receptive fields and insufficient feature utilization,\nleading to suboptimal performance. Furthermore, the scarcity of high-quality\nHSI data highlights the importance of efficient data utilization to maximize\nreconstruction quality. To address these issues, we propose HyFusion, a novel\nframework designed to enhance the receptive field and enable effective feature\nmap reusing, thereby maximizing data utilization. First, HR-MSI and LR-HSI\ninputs are concatenated to form a quasi-fused draft, preserving complementary\nspatial and spectral details. Next, the Enhanced Reception Field Block (ERFB)\nis introduced, combining shifting-window attention and dense connections to\nexpand the receptive field, effectively capturing long-range dependencies and\nreusing features to reduce information loss, thereby boosting data efficiency.\nFinally, the Dual-Coupled Network (DCN) dynamically extracts high-frequency\nspectral and spatial features from LR-HSI and HR-MSI, ensuring efficient\ncross-domain fusion. Extensive experiments demonstrate that HyFusion achieves\nstate-of-the-art performance in HR-MSI/LR-HSI fusion, significantly improving\nreconstruction quality while maintaining a compact model size and computational\nefficiency. By integrating enhanced receptive fields and feature map reusing,\nHyFusion provides a practical and effective solution for HSI fusion in\nresource-constrained scenarios, setting a new benchmark in hyperspectral\nimaging. Our code will be publicly available.\n","authors":["Chia-Ming Lee","Yu-Fan Lin","Yu-Hao Ho","Li-Wei Kang","Chih-Chung Hsu"],"pdf_url":"https://arxiv.org/pdf/2501.04665v1.pdf","comment":"Submitted to IGARSS 2025"},{"id":"http://arxiv.org/abs/2501.04648v1","updated":"2025-01-08T18:01:49Z","published":"2025-01-08T18:01:49Z","title":"FlairGPT: Repurposing LLMs for Interior Designs","summary":" Interior design involves the careful selection and arrangement of objects to\ncreate an aesthetically pleasing, functional, and harmonized space that aligns\nwith the client's design brief. This task is particularly challenging, as a\nsuccessful design must not only incorporate all the necessary objects in a\ncohesive style, but also ensure they are arranged in a way that maximizes\naccessibility, while adhering to a variety of affordability and usage\nconsiderations. Data-driven solutions have been proposed, but these are\ntypically room- or domain-specific and lack explainability in their design\ndesign considerations used in producing the final layout. In this paper, we\ninvestigate if large language models (LLMs) can be directly utilized for\ninterior design. While we find that LLMs are not yet capable of generating\ncomplete layouts, they can be effectively leveraged in a structured manner,\ninspired by the workflow of interior designers. By systematically probing LLMs,\nwe can reliably generate a list of objects along with relevant constraints that\nguide their placement. We translate this information into a design layout\ngraph, which is then solved using an off-the-shelf constrained optimization\nsetup to generate the final layouts. We benchmark our algorithm in various\ndesign configurations against existing LLM-based methods and human designs, and\nevaluate the results using a variety of quantitative and qualitative metrics\nalong with user studies. In summary, we demonstrate that LLMs, when used in a\nstructured manner, can effectively generate diverse high-quality layouts,\nmaking them a viable solution for creating large-scale virtual scenes. Project\nwebpage at https://flairgpt.github.io/\n","authors":["Gabrielle Littlefair","Niladri Shekhar Dutt","Niloy J. Mitra"],"pdf_url":"https://arxiv.org/pdf/2501.04648v1.pdf","comment":"Accepted at EUROGRAPHICS 2025"},{"id":"http://arxiv.org/abs/2501.04643v1","updated":"2025-01-08T17:49:52Z","published":"2025-01-08T17:49:52Z","title":"Discrete Wavelet Transform-Based Capsule Network for Hyperspectral Image\n Classification","summary":" Hyperspectral image (HSI) classification is a crucial technique for remote\nsensing to build large-scale earth monitoring systems. HSI contains much more\ninformation than traditional visual images for identifying the categories of\nland covers. One recent feasible solution for HSI is to leverage CapsNets for\ncapturing spectral-spatial information. However, these methods require high\ncomputational requirements due to the full connection architecture between\nstacked capsule layers. To solve this problem, a DWT-CapsNet is proposed to\nidentify partial but important connections in CapsNet for a effective and\nefficient HSI classification. Specifically, we integrate a tailored attention\nmechanism into a Discrete Wavelet Transform (DWT)-based downsampling layer,\nalleviating the information loss problem of conventional downsampling operation\nin feature extractors. Moreover, we propose a novel multi-scale routing\nalgorithm that prunes a large proportion of connections in CapsNet. A capsule\npyramid fusion mechanism is designed to aggregate the spectral-spatial\nrelationships in multiple levels of granularity, and then a self-attention\nmechanism is further conducted in a partially and locally connected\narchitecture to emphasize the meaningful relationships. As shown in the\nexperimental results, our method achieves state-of-the-art accuracy while\nkeeping lower computational demand regarding running time, flops, and the\nnumber of parameters, rendering it an appealing choice for practical\nimplementation in HSI classification.\n","authors":["Zhiqiang Gao","Jiaqi Wang","Hangchi Shen","Zhihao Dou","Xiangbo Zhang","Kaizhu Huang"],"pdf_url":"https://arxiv.org/pdf/2501.04643v1.pdf","comment":"28 Pages; 9 Figure"},{"id":"http://arxiv.org/abs/2501.03800v2","updated":"2025-01-08T17:44:11Z","published":"2025-01-07T14:06:57Z","title":"MADation: Face Morphing Attack Detection with Foundation Models","summary":" Despite the considerable performance improvements of face recognition\nalgorithms in recent years, the same scientific advances responsible for this\nprogress can also be used to create efficient ways to attack them, posing a\nthreat to their secure deployment. Morphing attack detection (MAD) systems aim\nto detect a specific type of threat, morphing attacks, at an early stage,\npreventing them from being considered for verification in critical processes.\nFoundation models (FM) learn from extensive amounts of unlabeled data,\nachieving remarkable zero-shot generalization to unseen domains. Although this\ngeneralization capacity might be weak when dealing with domain-specific\ndownstream tasks such as MAD, FMs can easily adapt to these settings while\nretaining the built-in knowledge acquired during pre-training. In this work, we\nrecognize the potential of FMs to perform well in the MAD task when properly\nadapted to its specificities. To this end, we adapt FM CLIP architectures with\nLoRA weights while simultaneously training a classification header. The\nproposed framework, MADation surpasses our alternative FM and transformer-based\nframeworks and constitutes the first adaption of FMs to the MAD task. MADation\npresents competitive results with current MAD solutions in the literature and\neven surpasses them in several evaluation scenarios. To encourage\nreproducibility and facilitate further research in MAD, we publicly release the\nimplementation of MADation at https: //github.com/gurayozgur/MADation\n","authors":["Eduarda Caldeira","Guray Ozgur","Tahar Chettaoui","Marija Ivanovska","Peter Peer","Fadi Boutros","Vitomir Struc","Naser Damer"],"pdf_url":"https://arxiv.org/pdf/2501.03800v2.pdf","comment":"Accepted at WACV 2025 workshops"},{"id":"http://arxiv.org/abs/2501.04631v1","updated":"2025-01-08T17:27:27Z","published":"2025-01-08T17:27:27Z","title":"Disentangled Clothed Avatar Generation with Layered Representation","summary":" Clothed avatar generation has wide applications in virtual and augmented\nreality, filmmaking, and more. Previous methods have achieved success in\ngenerating diverse digital avatars, however, generating avatars with\ndisentangled components (\\eg, body, hair, and clothes) has long been a\nchallenge. In this paper, we propose LayerAvatar, the first feed-forward\ndiffusion-based method for generating component-disentangled clothed avatars.\nTo achieve this, we first propose a layered UV feature plane representation,\nwhere components are distributed in different layers of the Gaussian-based UV\nfeature plane with corresponding semantic labels. This representation supports\nhigh-resolution and real-time rendering, as well as expressive animation\nincluding controllable gestures and facial expressions. Based on the\nwell-designed representation, we train a single-stage diffusion model and\nintroduce constrain terms to address the severe occlusion problem of the\ninnermost human body layer. Extensive experiments demonstrate the impressive\nperformances of our method in generating disentangled clothed avatars, and we\nfurther explore its applications in component transfer. The project page is\navailable at: https://olivia23333.github.io/LayerAvatar/\n","authors":["Weitian Zhang","Sijing Wu","Manwen Liao","Yichao Yan"],"pdf_url":"https://arxiv.org/pdf/2501.04631v1.pdf","comment":"project page: https://olivia23333.github.io/LayerAvatar/"},{"id":"http://arxiv.org/abs/2501.04628v1","updated":"2025-01-08T17:19:35Z","published":"2025-01-08T17:19:35Z","title":"FatesGS: Fast and Accurate Sparse-View Surface Reconstruction using\n Gaussian Splatting with Depth-Feature Consistency","summary":" Recently, Gaussian Splatting has sparked a new trend in the field of computer\nvision. Apart from novel view synthesis, it has also been extended to the area\nof multi-view reconstruction. The latest methods facilitate complete, detailed\nsurface reconstruction while ensuring fast training speed. However, these\nmethods still require dense input views, and their output quality significantly\ndegrades with sparse views. We observed that the Gaussian primitives tend to\noverfit the few training views, leading to noisy floaters and incomplete\nreconstruction surfaces. In this paper, we present an innovative sparse-view\nreconstruction framework that leverages intra-view depth and multi-view feature\nconsistency to achieve remarkably accurate surface reconstruction.\nSpecifically, we utilize monocular depth ranking information to supervise the\nconsistency of depth distribution within patches and employ a smoothness loss\nto enhance the continuity of the distribution. To achieve finer surface\nreconstruction, we optimize the absolute position of depth through multi-view\nprojection features. Extensive experiments on DTU and BlendedMVS demonstrate\nthat our method outperforms state-of-the-art methods with a speedup of 60x to\n200x, achieving swift and fine-grained mesh reconstruction without the need for\ncostly pre-training.\n","authors":["Han Huang","Yulun Wu","Chao Deng","Ge Gao","Ming Gu","Yu-Shen Liu"],"pdf_url":"https://arxiv.org/pdf/2501.04628v1.pdf","comment":"Accepted by AAAI 2025. Project page:\n https://alvin528.github.io/FatesGS/"},{"id":"http://arxiv.org/abs/2412.16780v2","updated":"2025-01-08T17:00:18Z","published":"2024-12-21T21:27:22Z","title":"Forget Vectors at Play: Universal Input Perturbations Driving Machine\n Unlearning in Image Classification","summary":" Machine unlearning (MU), which seeks to erase the influence of specific\nunwanted data from already-trained models, is becoming increasingly vital in\nmodel editing, particularly to comply with evolving data regulations like the\n``right to be forgotten''. Conventional approaches are predominantly\nmodel-based, typically requiring retraining or fine-tuning the model's weights\nto meet unlearning requirements. In this work, we approach the MU problem from\na novel input perturbation-based perspective, where the model weights remain\nintact throughout the unlearning process. We demonstrate the existence of a\nproactive input-based unlearning strategy, referred to forget vector, which can\nbe generated as an input-agnostic data perturbation and remains as effective as\nmodel-based approximate unlearning approaches. We also explore forget vector\narithmetic, whereby multiple class-specific forget vectors are combined through\nsimple operations (e.g., linear combinations) to generate new forget vectors\nfor unseen unlearning tasks, such as forgetting arbitrary subsets across\nclasses. Extensive experiments validate the effectiveness and adaptability of\nthe forget vector, showcasing its competitive performance relative to\nstate-of-the-art model-based methods. Codes are available at\nhttps://github.com/Changchangsun/Forget-Vector.\n","authors":["Changchang Sun","Ren Wang","Yihua Zhang","Jinghan Jia","Jiancheng Liu","Gaowen Liu","Sijia Liu","Yan Yan"],"pdf_url":"https://arxiv.org/pdf/2412.16780v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04608v1","updated":"2025-01-08T16:44:06Z","published":"2025-01-08T16:44:06Z","title":"Comprehensive Examination of Unrolled Networks for Linear Inverse\n Problems","summary":" Unrolled networks have become prevalent in various computer vision and\nimaging tasks. Although they have demonstrated remarkable efficacy in solving\nspecific computer vision and computational imaging tasks, their adaptation to\nother applications presents considerable challenges. This is primarily due to\nthe multitude of design decisions that practitioners working on new\napplications must navigate, each potentially affecting the network's overall\nperformance. These decisions include selecting the optimization algorithm,\ndefining the loss function, and determining the number of convolutional layers,\namong others. Compounding the issue, evaluating each design choice requires\ntime-consuming simulations to train, fine-tune the neural network, and optimize\nfor its performance. As a result, the process of exploring multiple options and\nidentifying the optimal configuration becomes time-consuming and\ncomputationally demanding. The main objectives of this paper are (1) to unify\nsome ideas and methodologies used in unrolled networks to reduce the number of\ndesign choices a user has to make, and (2) to report a comprehensive ablation\nstudy to discuss the impact of each of the choices involved in designing\nunrolled networks and present practical recommendations based on our findings.\nWe anticipate that this study will help scientists and engineers design\nunrolled networks for their applications and diagnose problems within their\nnetworks efficiently.\n","authors":["Eric Chen","Xi Chen","Arian Maleki","Shirin Jalali"],"pdf_url":"https://arxiv.org/pdf/2501.04608v1.pdf","comment":"27 pages, 10 figures. Project Page:\n https://github.com/YuxiChen25/Memory-Net-Inverse"},{"id":"http://arxiv.org/abs/2501.04606v1","updated":"2025-01-08T16:41:31Z","published":"2025-01-08T16:41:31Z","title":"Enhancing Low-Cost Video Editing with Lightweight Adaptors and\n Temporal-Aware Inversion","summary":" Recent advancements in text-to-image (T2I) generation using diffusion models\nhave enabled cost-effective video-editing applications by leveraging\npre-trained models, eliminating the need for resource-intensive training.\nHowever, the frame-independence of T2I generation often results in poor\ntemporal consistency. Existing methods address this issue through temporal\nlayer fine-tuning or inference-based temporal propagation, but these approaches\nsuffer from high training costs or limited temporal coherence. To address these\nchallenges, we propose a General and Efficient Adapter (GE-Adapter) that\nintegrates temporal-spatial and semantic consistency with Baliteral DDIM\ninversion. This framework introduces three key components: (1) Frame-based\nTemporal Consistency Blocks (FTC Blocks) to capture frame-specific features and\nenforce smooth inter-frame transitions via temporally-aware loss functions; (2)\nChannel-dependent Spatial Consistency Blocks (SCD Blocks) employing bilateral\nfilters to enhance spatial coherence by reducing noise and artifacts; and (3)\nToken-based Semantic Consistency Module (TSC Module) to maintain semantic\nalignment using shared prompt tokens and frame-specific tokens. Our method\nsignificantly improves perceptual quality, text-image alignment, and temporal\ncoherence, as demonstrated on the MSR-VTT dataset. Additionally, it achieves\nenhanced fidelity and frame-to-frame coherence, offering a practical solution\nfor T2V editing.\n","authors":["Yangfan He","Sida Li","Kun Li","Jianhui Wang","Binxu Li","Tianyu Shi","Jun Yin","Miao Zhang","Xueqian Wang"],"pdf_url":"https://arxiv.org/pdf/2501.04606v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04597v1","updated":"2025-01-08T16:25:32Z","published":"2025-01-08T16:25:32Z","title":"FrontierNet: Learning Visual Cues to Explore","summary":" Exploration of unknown environments is crucial for autonomous robots; it\nallows them to actively reason and decide on what new data to acquire for tasks\nsuch as mapping, object discovery, and environmental assessment. Existing\nmethods, such as frontier-based methods, rely heavily on 3D map operations,\nwhich are limited by map quality and often overlook valuable context from\nvisual cues. This work aims at leveraging 2D visual cues for efficient\nautonomous exploration, addressing the limitations of extracting goal poses\nfrom a 3D map. We propose a image-only frontier-based exploration system, with\nFrontierNet as a core component developed in this work. FrontierNet is a\nlearning-based model that (i) detects frontiers, and (ii) predicts their\ninformation gain, from posed RGB images enhanced by monocular depth priors. Our\napproach provides an alternative to existing 3D-dependent exploration systems,\nachieving a 16% improvement in early-stage exploration efficiency, as validated\nthrough extensive simulations and real-world experiments.\n","authors":["Boyang Sun","Hanzhi Chen","Stefan Leutenegger","Cesar Cadena","Marc Pollefeys","Hermann Blum"],"pdf_url":"https://arxiv.org/pdf/2501.04597v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.03214v2","updated":"2025-01-08T16:10:20Z","published":"2024-04-04T05:39:09Z","title":"LeGrad: An Explainability Method for Vision Transformers via Feature\n Formation Sensitivity","summary":" Vision Transformers (ViTs), with their ability to model long-range\ndependencies through self-attention mechanisms, have become a standard\narchitecture in computer vision. However, the interpretability of these models\nremains a challenge. To address this, we propose LeGrad, an explainability\nmethod specifically designed for ViTs. LeGrad computes the gradient with\nrespect to the attention maps of ViT layers, considering the gradient itself as\nthe explainability signal. We aggregate the signal over all layers, combining\nthe activations of the last as well as intermediate tokens to produce the\nmerged explainability map. This makes LeGrad a conceptually simple and an\neasy-to-implement tool for enhancing the transparency of ViTs. We evaluate\nLeGrad in challenging segmentation, perturbation, and open-vocabulary settings,\nshowcasing its versatility compared to other SotA explainability methods\ndemonstrating its superior spatial fidelity and robustness to perturbations. A\ndemo and the code is available at https://github.com/WalBouss/LeGrad.\n","authors":["Walid Bousselham","Angie Boggust","Sofian Chaybouti","Hendrik Strobelt","Hilde Kuehne"],"pdf_url":"https://arxiv.org/pdf/2404.03214v2.pdf","comment":"Code available at https://github.com/WalBouss/LeGrad"},{"id":"http://arxiv.org/abs/2501.04586v1","updated":"2025-01-08T16:06:21Z","published":"2025-01-08T16:06:21Z","title":"Identity-Preserving Video Dubbing Using Motion Warping","summary":" Video dubbing aims to synthesize realistic, lip-synced videos from a\nreference video and a driving audio signal. Although existing methods can\naccurately generate mouth shapes driven by audio, they often fail to preserve\nidentity-specific features, largely because they do not effectively capture the\nnuanced interplay between audio cues and the visual attributes of reference\nidentity . As a result, the generated outputs frequently lack fidelity in\nreproducing the unique textural and structural details of the reference\nidentity. To address these limitations, we propose IPTalker, a novel and robust\nframework for video dubbing that achieves seamless alignment between driving\naudio and reference identity while ensuring both lip-sync accuracy and\nhigh-fidelity identity preservation. At the core of IPTalker is a\ntransformer-based alignment mechanism designed to dynamically capture and model\nthe correspondence between audio features and reference images, thereby\nenabling precise, identity-aware audio-visual integration. Building on this\nalignment, a motion warping strategy further refines the results by spatially\ndeforming reference images to match the target audio-driven configuration. A\ndedicated refinement process then mitigates occlusion artifacts and enhances\nthe preservation of fine-grained textures, such as mouth details and skin\nfeatures. Extensive qualitative and quantitative evaluations demonstrate that\nIPTalker consistently outperforms existing approaches in terms of realism, lip\nsynchronization, and identity retention, establishing a new state of the art\nfor high-quality, identity-consistent video dubbing.\n","authors":["Runzhen Liu","Qinjie Lin","Yunfei Liu","Lijian Lin","Ye Zhu","Yu Li","Chuhua Xian","Fa-Ting Hong"],"pdf_url":"https://arxiv.org/pdf/2501.04586v1.pdf","comment":"Under Review"},{"id":"http://arxiv.org/abs/2501.04582v1","updated":"2025-01-08T15:56:21Z","published":"2025-01-08T15:56:21Z","title":"Boosting Salient Object Detection with Knowledge Distillated from Large\n Foundation Models","summary":" Salient Object Detection (SOD) aims to identify and segment prominent regions\nwithin a scene. Traditional models rely on manually annotated pseudo labels\nwith precise pixel-level accuracy, which is time-consuming. We developed a\nlow-cost, high-precision annotation method by leveraging large foundation\nmodels to address the challenges. Specifically, we use a weakly supervised\napproach to guide large models in generating pseudo-labels through textual\nprompts. Since large models do not effectively focus on the salient regions of\nimages, we manually annotate a subset of text to fine-tune the model. Based on\nthis approach, which enables precise and rapid generation of pseudo-labels, we\nintroduce a new dataset, BDS-TR. Compared to the previous DUTS-TR dataset,\nBDS-TR is more prominent in scale and encompasses a wider variety of categories\nand scenes. This expansion will enhance our model's applicability across a\nbroader range of scenarios and provide a more comprehensive foundational\ndataset for future SOD research. Additionally, we present an edge decoder based\non dynamic upsampling, which focuses on object edges while gradually recovering\nimage feature resolution. Comprehensive experiments on five benchmark datasets\ndemonstrate that our method significantly outperforms state-of-the-art\napproaches and also surpasses several existing fully-supervised SOD methods.\nThe code and results will be made available.\n","authors":["Miaoyang He","Shuyong Gao","Tsui Qin Mok","Weifeng Ge","Wengqiang Zhang"],"pdf_url":"https://arxiv.org/pdf/2501.04582v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.12408v2","updated":"2025-01-08T15:54:31Z","published":"2024-07-17T08:39:20Z","title":"Towards Revisiting Visual Place Recognition for Joining Submaps in\n Multimap SLAM","summary":" Visual SLAM is a key technology for many autonomous systems. However,\ntracking loss can lead to the creation of disjoint submaps in multimap SLAM\nsystems like ORB-SLAM3. Because of that, these systems employ submap merging\nstrategies. As we show, these strategies are not always successful. In this\npaper, we investigate the impact of using modern VPR approaches for submap\nmerging in visual SLAM. We argue that classical evaluation metrics are not\nsufficient to estimate the impact of a modern VPR component on the overall\nsystem. We show that naively replacing the VPR component does not leverage its\nfull potential without requiring substantial interference in the original\nsystem. Because of that, we present a post-processing pipeline along with a set\nof metrics that allow us to estimate the impact of modern VPR components. We\nevaluate our approach on the NCLT and Newer College datasets using ORB-SLAM3\nwith NetVLAD and HDC-DELF as VPR components. Additionally, we present a simple\napproach for combining VPR with temporal consistency for map merging. We show\nthat the map merging performance of ORB-SLAM3 can be improved. Building on\nthese results, researchers in VPR can assess the potential of their approaches\nfor SLAM systems.\n","authors":["Markus Weißflog","Stefan Schubert","Peter Protzel","Peer Neubert"],"pdf_url":"https://arxiv.org/pdf/2407.12408v2.pdf","comment":"Accepted at TAROS 2024. This is the submitted version"},{"id":"http://arxiv.org/abs/2501.04579v1","updated":"2025-01-08T15:48:30Z","published":"2025-01-08T15:48:30Z","title":"Unified Coding for Both Human Perception and Generalized Machine\n Analytics with CLIP Supervision","summary":" The image compression model has long struggled with adaptability and\ngeneralization, as the decoded bitstream typically serves only human or machine\nneeds and fails to preserve information for unseen visual tasks. Therefore,\nthis paper innovatively introduces supervision obtained from multimodal\npre-training models and incorporates adaptive multi-objective optimization\ntailored to support both human visual perception and machine vision\nsimultaneously with a single bitstream, denoted as Unified and Generalized\nImage Coding for Machine (UG-ICM). Specifically, to get rid of the reliance\nbetween compression models with downstream task supervision, we introduce\nContrastive Language-Image Pre-training (CLIP) models into the training\nconstraint for improved generalization. Global-to-instance-wise CLIP\nsupervision is applied to help obtain hierarchical semantics that make models\nmore generalizable for the tasks relying on the information of different\ngranularity. Furthermore, for supporting both human and machine visions with\nonly a unifying bitstream, we incorporate a conditional decoding strategy that\ntakes as conditions human or machine preferences, enabling the bitstream to be\ndecoded into different versions for corresponding preferences. As such, our\nproposed UG-ICM is fully trained in a self-supervised manner, i.e., without\nawareness of any specific downstream models and tasks. The extensive\nexperiments have shown that the proposed UG-ICM is capable of achieving\nremarkable improvements in various unseen machine analytics tasks, while\nsimultaneously providing perceptually satisfying images.\n","authors":["Kangsheng Yin","Quan Liu","Xuelin Shen","Yulin He","Wenhan Yang","Shiqi Wang"],"pdf_url":"https://arxiv.org/pdf/2501.04579v1.pdf","comment":"9 pages, 10 figures, publised to AAAI 2025"},{"id":"http://arxiv.org/abs/2406.15811v2","updated":"2025-01-08T15:32:35Z","published":"2024-06-22T10:33:14Z","title":"PointDreamer: Zero-shot 3D Textured Mesh Reconstruction from Colored\n Point Cloud","summary":" Reconstructing textured meshes from colored point clouds is an important but\nchallenging task. Most existing methods yield blurry-looking textures or rely\non 3D training data that are hard to acquire. Regarding this, we propose\nPointDreamer, a novel framework for textured mesh reconstruction from colored\npoint cloud via diffusion-based 2D inpainting. Specifically, we first\nreconstruct an untextured mesh. Next, we project the input point cloud into 2D\nspace to generate sparse multi-view images, and then inpaint empty pixels\nutilizing a pre-trained 2D diffusion model. After that, we unproject the colors\nof the inpainted dense images onto the untextured mesh, thus obtaining the\nfinal textured mesh. This project-inpaint-unproject pipeline bridges the gap\nbetween 3D point clouds and 2D diffusion models for the first time. Thanks to\nthe powerful 2D diffusion model pre-trained on extensive 2D data, PointDreamer\nreconstructs clear, high-quality textures with high robustness to sparse or\nnoisy input. Also, it's zero-shot requiring no extra training. In addition, we\ndesign Non-Border-First unprojection strategy to address the border-area\ninconsistency issue, which is less explored but commonly-occurred in methods\nthat generate 3D textures from multiview images. Extensive qualitative and\nquantitative experiments on various synthetic and real-scanned datasets show\nthe SoTA performance of PointDreamer, by significantly outperforming baseline\nmethods with 30% improvement in LPIPS score (from 0.118 to 0.068). Code at:\nhttps://github.com/YuQiao0303/PointDreamer.\n","authors":["Qiao Yu","Xianzhi Li","Yuan Tang","Xu Han","Jinfeng Xu","Long Hu","Min Chen"],"pdf_url":"https://arxiv.org/pdf/2406.15811v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04568v1","updated":"2025-01-08T15:32:12Z","published":"2025-01-08T15:32:12Z","title":"Supervision-free Vision-Language Alignment","summary":" Vision-language models (VLMs) have demonstrated remarkable potential in\nintegrating visual and linguistic information, but their performance is often\nconstrained by the need for extensive, high-quality image-text training data.\nCuration of these image-text pairs is both time-consuming and computationally\nexpensive. To address this challenge, we introduce SVP (Supervision-free Visual\nProjection), a novel framework that enhances vision-language alignment without\nrelying on curated data or preference annotation. SVP leverages self-captioning\nand a pre-trained grounding model as a feedback mechanism to elicit latent\ninformation in VLMs. We evaluate our approach across six key areas: captioning,\nreferring, visual question answering, multitasking, hallucination control, and\nobject recall. Results demonstrate significant improvements, including a 14%\naverage improvement in captioning tasks, up to 12% increase in object recall,\nand substantial reduction in hallucination rates. Notably, a small VLM using\nSVP achieves hallucination reductions comparable to a model five times larger,\nwhile a VLM with initially poor referring capabilities more than doubles its\nperformance, approaching parity with a model twice its size.\n","authors":["Giorgio Giannone","Ruoteng Li","Qianli Feng","Evgeny Perevodchikov","Rui Chen","Aleix Martinez"],"pdf_url":"https://arxiv.org/pdf/2501.04568v1.pdf","comment":"Preprint"},{"id":"http://arxiv.org/abs/2501.04565v1","updated":"2025-01-08T15:25:19Z","published":"2025-01-08T15:25:19Z","title":"Learnable Scaled Gradient Descent for Guaranteed Robust Tensor PCA","summary":" Robust tensor principal component analysis (RTPCA) aims to separate the\nlow-rank and sparse components from multi-dimensional data, making it an\nessential technique in the signal processing and computer vision fields.\nRecently emerging tensor singular value decomposition (t-SVD) has gained\nconsiderable attention for its ability to better capture the low-rank structure\nof tensors compared to traditional matrix SVD. However, existing methods often\nrely on the computationally expensive tensor nuclear norm (TNN), which limits\ntheir scalability for real-world tensors. To address this issue, we explore an\nefficient scaled gradient descent (SGD) approach within the t-SVD framework for\nthe first time, and propose the RTPCA-SGD method. Theoretically, we rigorously\nestablish the recovery guarantees of RTPCA-SGD under mild assumptions,\ndemonstrating that with appropriate parameter selection, it achieves linear\nconvergence to the true low-rank tensor at a constant rate, independent of the\ncondition number. To enhance its practical applicability, we further propose a\nlearnable self-supervised deep unfolding model, which enables effective\nparameter learning. Numerical experiments on both synthetic and real-world\ndatasets demonstrate the superior performance of the proposed methods while\nmaintaining competitive computational efficiency, especially consuming less\ntime than RTPCA-TNN.\n","authors":["Lanlan Feng","Ce Zhu","Yipeng Liu","Saiprasad Ravishankar","Longxiu Huang"],"pdf_url":"https://arxiv.org/pdf/2501.04565v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04561v1","updated":"2025-01-08T15:18:09Z","published":"2025-01-08T15:18:09Z","title":"OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment\n across Language with Real-time Self-Aware Emotional Speech Synthesis","summary":" Recent advancements in omnimodal learning have been achieved in understanding\nand generation across images, text, and speech, though mainly within\nproprietary models. Limited omnimodal datasets and the inherent challenges\nassociated with real-time emotional speech generation have hindered open-source\nprogress. To address these issues, we propose openomni, a two-stage training\nmethod combining omnimodal alignment and speech generation to develop a\nstate-of-the-art omnimodal large language model. In the alignment phase, a\npre-trained speech model is further trained on text-image tasks to generalize\nfrom vision to speech in a (near) zero-shot manner, outperforming models\ntrained on tri-modal datasets. In the speech generation phase, a lightweight\ndecoder facilitates real-time emotional speech through training on speech tasks\nand preference learning. Experiments demonstrate that openomni consistently\nimproves across omnimodal, vision-language, and speech-language evaluations,\nenabling natural, emotion-rich dialogues and real-time emotional speech\ngeneration.\n","authors":["Run Luo","Ting-En Lin","Haonan Zhang","Yuchuan Wu","Xiong Liu","Min Yang","Yongbin Li","Longze Chen","Jiaming Li","Lei Zhang","Yangyi Chen","Hamid Alinejad-Rokny","Fei Huang"],"pdf_url":"https://arxiv.org/pdf/2501.04561v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10150v4","updated":"2025-01-08T15:15:12Z","published":"2024-01-18T17:22:37Z","title":"Motion-Zero: Zero-Shot Moving Object Control Framework for\n Diffusion-Based Video Generation","summary":" Recent large-scale pre-trained diffusion models have demonstrated a powerful\ngenerative ability to produce high-quality videos from detailed text\ndescriptions. However, exerting control over the motion of objects in videos\ngenerated by any video diffusion model is a challenging problem. In this paper,\nwe propose a novel zero-shot moving object trajectory control framework,\nMotion-Zero, to enable a bounding-box-trajectories-controlled text-to-video\ndiffusion model. To this end, an initial noise prior module is designed to\nprovide a position-based prior to improve the stability of the appearance of\nthe moving object and the accuracy of position. In addition, based on the\nattention map of the U-net, spatial constraints are directly applied to the\ndenoising process of diffusion models, which further ensures the positional and\nspatial consistency of moving objects during the inference. Furthermore,\ntemporal consistency is guaranteed with a proposed shift temporal attention\nmechanism. Our method can be flexibly applied to various state-of-the-art video\ndiffusion models without any training process. Extensive experiments\ndemonstrate our proposed method can control the motion trajectories of objects\nand generate high-quality videos. Our project page is\nhttps://vpx-ecnu.github.io/MotionZero-website/\n","authors":["Changgu Chen","Junwei Shu","Gaoqi He","Changbo Wang","Yang Li"],"pdf_url":"https://arxiv.org/pdf/2401.10150v4.pdf","comment":"Preprint"},{"id":"http://arxiv.org/abs/2405.02334v2","updated":"2025-01-08T14:42:05Z","published":"2024-04-26T15:02:39Z","title":"Rad4XCNN: a new agnostic method for post-hoc global explanation of\n CNN-derived features by means of radiomics","summary":" In recent years, machine learning-based clinical decision support systems\n(CDSS) have played a key role in the analysis of several medical conditions.\nDespite their promising capabilities, the lack of transparency in AI models\nposes significant challenges, particularly in medical contexts where\nreliability is a mandatory aspect. However, it appears that explainability is\ninversely proportional to accuracy. For this reason, achieving transparency\nwithout compromising predictive accuracy remains a key challenge. This paper\npresents a novel method, namely Rad4XCNN, to enhance the predictive power of\nCNN-derived features with the inherent interpretability of radiomic features.\nRad4XCNN diverges from conventional methods based on saliency maps, by\nassociating intelligible meaning to CNN-derived features by means of Radiomics,\noffering new perspectives on explanation methods beyond visualization maps.\nUsing a breast cancer classification task as a case study, we evaluated\nRad4XCNN on ultrasound imaging datasets, including an online dataset and two\nin-house datasets for internal and external validation. Some key results are:\ni) CNN-derived features guarantee more robust accuracy when compared against\nViT-derived and radiomic features; ii) conventional visualization map methods\nfor explanation present several pitfalls; iii) Rad4XCNN does not sacrifice\nmodel accuracy for their explainability; iv) Rad4XCNN provides a global\nexplanation enabling the physician to extract global insights and findings. Our\nmethod can mitigate some concerns related to the explainability-accuracy\ntrade-off. This study highlighted the importance of proposing new methods for\nmodel explanation without affecting their accuracy.\n","authors":["Francesco Prinzi","Carmelo Militello","Calogero Zarcaro","Tommaso Vincenzo Bartolotta","Salvatore Gaglio","Salvatore Vitabile"],"pdf_url":"https://arxiv.org/pdf/2405.02334v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.00599v2","updated":"2025-01-08T14:38:30Z","published":"2024-12-31T18:56:46Z","title":"VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with\n Video LLM","summary":" Video Large Language Models (Video LLMs) have recently exhibited remarkable\ncapabilities in general video understanding. However, they mainly focus on\nholistic comprehension and struggle with capturing fine-grained spatial and\ntemporal details. Besides, the lack of high-quality object-level video\ninstruction data and a comprehensive benchmark further hinders their\nadvancements. To tackle these challenges, we introduce the VideoRefer Suite to\nempower Video LLM for finer-level spatial-temporal video understanding, i.e.,\nenabling perception and reasoning on any objects throughout the video.\nSpecially, we thoroughly develop VideoRefer Suite across three essential\naspects: dataset, model, and benchmark. Firstly, we introduce a multi-agent\ndata engine to meticulously curate a large-scale, high-quality object-level\nvideo instruction dataset, termed VideoRefer-700K. Next, we present the\nVideoRefer model, which equips a versatile spatial-temporal object encoder to\ncapture precise regional and sequential representations. Finally, we\nmeticulously create a VideoRefer-Bench to comprehensively assess the\nspatial-temporal understanding capability of a Video LLM, evaluating it across\nvarious aspects. Extensive experiments and analyses demonstrate that our\nVideoRefer model not only achieves promising performance on video referring\nbenchmarks but also facilitates general video understanding capabilities.\n","authors":["Yuqian Yuan","Hang Zhang","Wentong Li","Zesen Cheng","Boqiang Zhang","Long Li","Xin Li","Deli Zhao","Wenqiao Zhang","Yueting Zhuang","Jianke Zhu","Lidong Bing"],"pdf_url":"https://arxiv.org/pdf/2501.00599v2.pdf","comment":"17 pages, 14 figures, technical report"},{"id":"http://arxiv.org/abs/2501.04534v1","updated":"2025-01-08T14:33:47Z","published":"2025-01-08T14:33:47Z","title":"Combining YOLO and Visual Rhythm for Vehicle Counting","summary":" Video-based vehicle detection and counting play a critical role in managing\ntransport infrastructure. Traditional image-based counting methods usually\ninvolve two main steps: initial detection and subsequent tracking, which are\napplied to all video frames, leading to a significant increase in computational\ncomplexity. To address this issue, this work presents an alternative and more\nefficient method for vehicle detection and counting. The proposed approach\neliminates the need for a tracking step and focuses solely on detecting\nvehicles in key video frames, thereby increasing its efficiency. To achieve\nthis, we developed a system that combines YOLO, for vehicle detection, with\nVisual Rhythm, a way to create time-spatial images that allows us to focus on\nframes that contain useful information. Additionally, this method can be used\nfor counting in any application involving unidirectional moving targets to be\ndetected and identified. Experimental analysis using real videos shows that the\nproposed method achieves mean counting accuracy around 99.15% over a set of\nvideos, with a processing speed three times faster than tracking based\napproaches.\n","authors":["Victor Nascimento Ribeiro","Nina S. T. Hirata"],"pdf_url":"https://arxiv.org/pdf/2501.04534v1.pdf","comment":"Accepted for presentation at the Conference on Graphics, Patterns and\n Images (SIBGRAPI) 2023"},{"id":"http://arxiv.org/abs/2501.01483v2","updated":"2025-01-08T14:29:10Z","published":"2025-01-02T18:42:07Z","title":"Embedding Similarity Guided License Plate Super Resolution","summary":" Super-resolution (SR) techniques play a pivotal role in enhancing the quality\nof low-resolution images, particularly for applications such as security and\nsurveillance, where accurate license plate recognition is crucial. This study\nproposes a novel framework that combines pixel-based loss with embedding\nsimilarity learning to address the unique challenges of license plate\nsuper-resolution (LPSR). The introduced pixel and embedding consistency loss\n(PECL) integrates a Siamese network and applies contrastive loss to force\nembedding similarities to improve perceptual and structural fidelity. By\neffectively balancing pixel-wise accuracy with embedding-level consistency, the\nframework achieves superior alignment of fine-grained features between\nhigh-resolution (HR) and super-resolved (SR) license plates. Extensive\nexperiments on the CCPD dataset validate the efficacy of the proposed\nframework, demonstrating consistent improvements over state-of-the-art methods\nin terms of PSNR_RGB, PSNR_Y and optical character recognition (OCR) accuracy.\nThese results highlight the potential of embedding similarity learning to\nadvance both perceptual quality and task-specific performance in extreme\nsuper-resolution scenarios.\n","authors":["Abderrezzaq Sendjasni","Mohamed-Chaker Larabi"],"pdf_url":"https://arxiv.org/pdf/2501.01483v2.pdf","comment":"Submitted to Neurocomputing"},{"id":"http://arxiv.org/abs/2402.13809v3","updated":"2025-01-08T14:21:46Z","published":"2024-02-21T13:46:25Z","title":"NeuralDiffuser: Neuroscience-inspired Diffusion Guidance for fMRI Visual\n Reconstruction","summary":" Reconstructing visual stimuli from functional Magnetic Resonance Imaging fMRI\nenables fine-grained retrieval of brain activity. However, the accurate\nreconstruction of diverse details, including structure, background, texture,\ncolor, and more, remains challenging. The stable diffusion models inevitably\nresult in the variability of reconstructed images, even under identical\nconditions. To address this challenge, we first uncover the neuroscientific\nperspective of diffusion methods, which primarily involve top-down creation\nusing pre-trained knowledge from extensive image datasets, but tend to lack\ndetail-driven bottom-up perception, leading to a loss of faithful details. In\nthis paper, we propose NeuralDiffuser, which incorporates primary visual\nfeature guidance to provide detailed cues in the form of gradients. This\nextension of the bottom-up process for diffusion models achieves both semantic\ncoherence and detail fidelity when reconstructing visual stimuli. Furthermore,\nwe have developed a novel guidance strategy for reconstruction tasks that\nensures the consistency of repeated outputs with original images rather than\nwith various outputs. Extensive experimental results on the Natural Senses\nDataset (NSD) qualitatively and quantitatively demonstrate the advancement of\nNeuralDiffuser by comparing it against baseline and state-of-the-art methods\nhorizontally, as well as conducting longitudinal ablation studies.\n","authors":["Haoyu Li","Hao Wu","Badong Chen"],"pdf_url":"https://arxiv.org/pdf/2402.13809v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04527v1","updated":"2025-01-08T14:19:03Z","published":"2025-01-08T14:19:03Z","title":"Towards Fair Class-wise Robustness: Class Optimal Distribution\n Adversarial Training","summary":" Adversarial training has proven to be a highly effective method for improving\nthe robustness of deep neural networks against adversarial attacks.\nNonetheless, it has been observed to exhibit a limitation in terms of robust\nfairness, characterized by a significant disparity in robustness across\ndifferent classes. Recent efforts to mitigate this problem have turned to\nclass-wise reweighted methods. However, these methods suffer from a lack of\nrigorous theoretical analysis and are limited in their exploration of the\nweight space, as they mainly rely on existing heuristic algorithms or intuition\nto compute weights. In addition, these methods fail to guarantee the\nconsistency of the optimization direction due to the decoupled optimization of\nweights and the model parameters. They potentially lead to suboptimal weight\nassignments and consequently, a suboptimal model. To address these problems,\nthis paper proposes a novel min-max training framework, Class Optimal\nDistribution Adversarial Training (CODAT), which employs distributionally\nrobust optimization to fully explore the class-wise weight space, thus enabling\nthe identification of the optimal weight with theoretical guarantees.\nFurthermore, we derive a closed-form optimal solution to the internal\nmaximization and then get a deterministic equivalent objective function, which\nprovides a theoretical basis for the joint optimization of weights and model\nparameters. Meanwhile, we propose a fairness elasticity coefficient for the\nevaluation of the algorithm with regard to both robustness and robust fairness.\nExperimental results on various datasets show that the proposed method can\neffectively improve the robust fairness of the model and outperform the\nstate-of-the-art approaches.\n","authors":["Hongxin Zhi","Hongtao Yu","Shaome Li","Xiuming Zhao","Yiteng Wu"],"pdf_url":"https://arxiv.org/pdf/2501.04527v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.18103v3","updated":"2025-01-08T14:13:23Z","published":"2024-03-26T21:01:41Z","title":"Tutorial on Diffusion Models for Imaging and Vision","summary":" The astonishing growth of generative tools in recent years has empowered many\nexciting applications in text-to-image generation and text-to-video generation.\nThe underlying principle behind these generative tools is the concept of\ndiffusion, a particular sampling mechanism that has overcome some shortcomings\nthat were deemed difficult in the previous approaches. The goal of this\ntutorial is to discuss the essential ideas underlying the diffusion models. The\ntarget audience of this tutorial includes undergraduate and graduate students\nwho are interested in doing research on diffusion models or applying these\nmodels to solve other problems.\n","authors":["Stanley H. Chan"],"pdf_url":"https://arxiv.org/pdf/2403.18103v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.01587v2","updated":"2025-01-08T14:12:45Z","published":"2024-04-02T02:29:41Z","title":"TSCM: A Teacher-Student Model for Vision Place Recognition Using\n Cross-Metric Knowledge Distillation","summary":" Visual place recognition (VPR) plays a pivotal role in autonomous exploration\nand navigation of mobile robots within complex outdoor environments. While\ncost-effective and easily deployed, camera sensors are sensitive to lighting\nand weather changes, and even slight image alterations can greatly affect VPR\nefficiency and precision. Existing methods overcome this by exploiting powerful\nyet large networks, leading to significant consumption of computational\nresources. In this paper, we propose a high-performance teacher and lightweight\nstudent distillation framework called TSCM. It exploits our devised\ncross-metric knowledge distillation to narrow the performance gap between the\nteacher and student models, maintaining superior performance while enabling\nminimal computational load during deployment. We conduct comprehensive\nevaluations on large-scale datasets, namely Pittsburgh30k and Pittsburgh250k.\nExperimental results demonstrate the superiority of our method over baseline\nmodels in terms of recognition accuracy and model parameter efficiency.\nMoreover, our ablation studies show that the proposed knowledge distillation\ntechnique surpasses other counterparts. The code of our method has been\nreleased at https://github.com/nubot-nudt/TSCM.\n","authors":["Yehui Shen","Mingmin Liu","Huimin Lu","Xieyuanli Chen"],"pdf_url":"https://arxiv.org/pdf/2404.01587v2.pdf","comment":"Accepted to ICRA 2024"},{"id":"http://arxiv.org/abs/2501.04515v1","updated":"2025-01-08T14:05:24Z","published":"2025-01-08T14:05:24Z","title":"SplineFormer: An Explainable Transformer-Based Approach for Autonomous\n Endovascular Navigation","summary":" Endovascular navigation is a crucial aspect of minimally invasive procedures,\nwhere precise control of curvilinear instruments like guidewires is critical\nfor successful interventions. A key challenge in this task is accurately\npredicting the evolving shape of the guidewire as it navigates through the\nvasculature, which presents complex deformations due to interactions with the\nvessel walls. Traditional segmentation methods often fail to provide accurate\nreal-time shape predictions, limiting their effectiveness in highly dynamic\nenvironments. To address this, we propose SplineFormer, a new transformer-based\narchitecture, designed specifically to predict the continuous, smooth shape of\nthe guidewire in an explainable way. By leveraging the transformer's ability,\nour network effectively captures the intricate bending and twisting of the\nguidewire, representing it as a spline for greater accuracy and smoothness. We\nintegrate our SplineFormer into an end-to-end robot navigation system by\nleveraging the condensed information. The experimental results demonstrate that\nour SplineFormer is able to perform endovascular navigation autonomously and\nachieves a 50% success rate when cannulating the brachiocephalic artery on the\nreal robot.\n","authors":["Tudor Jianu","Shayan Doust","Mengyun Li","Baoru Huang","Tuong Do","Hoan Nguyen","Karl Bates","Tung D. Ta","Sebastiano Fichera","Pierre Berthet-Rayne","Anh Nguyen"],"pdf_url":"https://arxiv.org/pdf/2501.04515v1.pdf","comment":"8 pages"},{"id":"http://arxiv.org/abs/2501.04513v1","updated":"2025-01-08T14:00:07Z","published":"2025-01-08T14:00:07Z","title":"Improving Image Captioning by Mimicking Human Reformulation Feedback at\n Inference-time","summary":" Incorporating automatically predicted human feedback into the process of\ntraining generative models has attracted substantial recent interest, while\nfeedback at inference time has received less attention. The typical feedback at\ntraining time, i.e., preferences of choice given two samples, does not\nnaturally transfer to the inference phase. We introduce a novel type of\nfeedback -- caption reformulations -- and train models to mimic reformulation\nfeedback based on human annotations. Our method does not require training the\nimage captioning model itself, thereby demanding substantially less\ncomputational effort. We experiment with two types of reformulation feedback:\nfirst, we collect a dataset of human reformulations that correct errors in the\ngenerated captions. We find that incorporating reformulation models trained on\nthis data into the inference phase of existing image captioning models results\nin improved captions, especially when the original captions are of low quality.\nWe apply our method to non-English image captioning, a domain where robust\nmodels are less prevalent, and gain substantial improvement. Second, we apply\nreformulations to style transfer. Quantitative evaluations reveal\nstate-of-the-art performance on German image captioning and English style\ntransfer, while human validation with a detailed comparative framework exposes\nthe specific axes of improvement.\n","authors":["Uri Berger","Omri Abend","Lea Frermann","Gabriel Stanovsky"],"pdf_url":"https://arxiv.org/pdf/2501.04513v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.06747v2","updated":"2025-01-08T13:49:54Z","published":"2024-08-13T09:10:48Z","title":"ReCLIP++: Learn to Rectify the Bias of CLIP for Unsupervised Semantic\n Segmentation","summary":" Recent works utilize CLIP to perform the challenging unsupervised semantic\nsegmentation task where only images without annotations are available. However,\nwe observe that when adopting CLIP to such a pixel-level understanding task,\nunexpected bias (including class-preference bias and space-preference bias)\noccurs. Previous works don't explicitly model the bias, which largely\nconstrains the segmentation performance. In this paper, we propose to\nexplicitly model and rectify the bias existing in CLIP to facilitate the\nunsupervised semantic segmentation task. Specifically, we design a learnable\n\"Reference\" prompt to encode class-preference bias and a projection of the\npositional embedding in the vision transformer to encode space-preference bias\nrespectively. To avoid interference, two kinds of biases are firstly\nindependently encoded into different features, i.e., the Reference feature and\nthe positional feature. Via a matrix multiplication between the Reference\nfeature and the positional feature, a bias logit map is generated to explicitly\nrepresent two kinds of biases. Then we rectify the logits of CLIP via a simple\nelement-wise subtraction. To make the rectified results smoother and more\ncontextual, we design a mask decoder which takes the feature of CLIP and the\nrectified logits as input and outputs a rectified segmentation mask with the\nhelp of Gumbel-Softmax operation. A contrastive loss based on the masked visual\nfeatures and the text features of different classes is imposed, which makes the\nbias modeling and rectification process meaningful and effective. Extensive\nexperiments on various benchmarks including PASCAL VOC, PASCAL Context, ADE20K,\nCityscapes, and COCO Stuff demonstrate that our method performs favorably\nagainst previous state-of-the-arts. The implementation is available at:\nhttps://github.com/dogehhh/ReCLIP.\n","authors":["Jingyun Wang","Guoliang Kang"],"pdf_url":"https://arxiv.org/pdf/2408.06747v2.pdf","comment":"Extended version of our CVPR 24 paper"},{"id":"http://arxiv.org/abs/2405.08766v2","updated":"2025-01-08T13:45:46Z","published":"2024-05-14T16:59:20Z","title":"Energy-based Hopfield Boosting for Out-of-Distribution Detection","summary":" Out-of-distribution (OOD) detection is critical when deploying machine\nlearning models in the real world. Outlier exposure methods, which incorporate\nauxiliary outlier data in the training process, can drastically improve OOD\ndetection performance compared to approaches without advanced training\nstrategies. We introduce Hopfield Boosting, a boosting approach, which\nleverages modern Hopfield energy (MHE) to sharpen the decision boundary between\nthe in-distribution and OOD data. Hopfield Boosting encourages the model to\nconcentrate on hard-to-distinguish auxiliary outlier examples that lie close to\nthe decision boundary between in-distribution and auxiliary outlier data. Our\nmethod achieves a new state-of-the-art in OOD detection with outlier exposure,\nimproving the FPR95 metric from 2.28 to 0.92 on CIFAR-10 and from 11.76 to 7.94\non CIFAR-100.\n","authors":["Claus Hofmann","Simon Schmid","Bernhard Lehner","Daniel Klotz","Sepp Hochreiter"],"pdf_url":"https://arxiv.org/pdf/2405.08766v2.pdf","comment":"NeurIPS 2024"},{"id":"http://arxiv.org/abs/2311.15963v3","updated":"2025-01-08T13:45:15Z","published":"2023-11-27T16:07:34Z","title":"From Pixels to Titles: Video Game Identification by Screenshots using\n Convolutional Neural Networks","summary":" This paper investigates video game identification through single screenshots,\nutilizing ten convolutional neural network (CNN) architectures (VGG16,\nResNet50, ResNet152, MobileNet, DenseNet169, DenseNet201, EfficientNetB0,\nEfficientNetB2, EfficientNetB3, and EfficientNetV2S) and three transformers\narchitectures (ViT-B16, ViT-L32, and SwinT) across 22 home console systems,\nspanning from Atari 2600 to PlayStation 5, totalling 8,796 games and 170,881\nscreenshots. Except for VGG16, all CNNs outperformed the transformers in this\ntask. Using ImageNet pre-trained weights as initial weights, EfficientNetV2S\nachieves the highest average accuracy (77.44%) and the highest accuracy in 16\nof the 22 systems. DenseNet201 is the best in four systems and EfficientNetB3\nis the best in the remaining two systems. Employing alternative initial weights\nfine-tuned in an arcade screenshots dataset boosts accuracy for EfficientNet\narchitectures, with the EfficientNetV2S reaching a peak accuracy of 77.63% and\ndemonstrating reduced convergence epochs from 26.9 to 24.5 on average. Overall,\nthe combination of optimal architecture and weights attains 78.79% accuracy,\nprimarily led by EfficientNetV2S in 15 systems. These findings underscore the\nefficacy of CNNs in video game identification through screenshots.\n","authors":["Fabricio Breve"],"pdf_url":"https://arxiv.org/pdf/2311.15963v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.02270v2","updated":"2025-01-08T13:42:02Z","published":"2025-01-04T12:15:58Z","title":"Efficient Video-Based ALPR System Using YOLO and Visual Rhythm","summary":" Automatic License Plate Recognition (ALPR) involves extracting vehicle\nlicense plate information from image or a video capture. These systems have\ngained popularity due to the wide availability of low-cost surveillance cameras\nand advances in Deep Learning. Typically, video-based ALPR systems rely on\nmultiple frames to detect the vehicle and recognize the license plates.\nTherefore, we propose a system capable of extracting exactly one frame per\nvehicle and recognizing its license plate characters from this singular image\nusing an Optical Character Recognition (OCR) model. Early experiments show that\nthis methodology is viable.\n","authors":["Victor Nascimento Ribeiro","Nina S. T. Hirata"],"pdf_url":"https://arxiv.org/pdf/2501.02270v2.pdf","comment":"Accepted to CVPR 2024"},{"id":"http://arxiv.org/abs/2412.17378v3","updated":"2025-01-08T13:31:11Z","published":"2024-12-23T08:26:30Z","title":"Balanced 3DGS: Gaussian-wise Parallelism Rendering with Fine-Grained\n Tiling","summary":" 3D Gaussian Splatting (3DGS) is increasingly attracting attention in both\nacademia and industry owing to its superior visual quality and rendering speed.\nHowever, training a 3DGS model remains a time-intensive task, especially in\nload imbalance scenarios where workload diversity among pixels and Gaussian\nspheres causes poor renderCUDA kernel performance. We introduce Balanced 3DGS,\na Gaussian-wise parallelism rendering with fine-grained tiling approach in 3DGS\ntraining process, perfectly solving load-imbalance issues. First, we\ninnovatively introduce the inter-block dynamic workload distribution technique\nto map workloads to Streaming Multiprocessor(SM) resources within a single GPU\ndynamically, which constitutes the foundation of load balancing. Second, we are\nthe first to propose the Gaussian-wise parallel rendering technique to\nsignificantly reduce workload divergence inside a warp, which serves as a\ncritical component in addressing load imbalance. Based on the above two\nmethods, we further creatively put forward the fine-grained combined load\nbalancing technique to uniformly distribute workload across all SMs, which\nboosts the forward renderCUDA kernel performance by up to 7.52x. Besides, we\npresent a self-adaptive render kernel selection strategy during the 3DGS\ntraining process based on different load-balance situations, which effectively\nimproves training efficiency.\n","authors":["Hao Gui","Lin Hu","Rui Chen","Mingxiao Huang","Yuxin Yin","Jin Yang","Yong Wu","Chen Liu","Zhongxu Sun","Xueyang Zhang","Kun Zhan"],"pdf_url":"https://arxiv.org/pdf/2412.17378v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04493v1","updated":"2025-01-08T13:26:24Z","published":"2025-01-08T13:26:24Z","title":"The Role of Machine Learning in Congenital Heart Disease Diagnosis:\n Datasets, Algorithms, and Insights","summary":" Congenital heart disease is among the most common fetal abnormalities and\nbirth defects. Despite identifying numerous risk factors influencing its onset,\na comprehensive understanding of its genesis and management across diverse\npopulations remains limited. Recent advancements in machine learning have\ndemonstrated the potential for leveraging patient data to enable early\ncongenital heart disease detection. Over the past seven years, researchers have\nproposed various data-driven and algorithmic solutions to address this\nchallenge. This paper presents a systematic review of congential heart disease\nrecognition using machine learning, conducting a meta-analysis of 432\nreferences from leading journals published between 2018 and 2024. A detailed\ninvestigation of 74 scholarly works highlights key factors, including\ndatabases, algorithms, applications, and solutions. Additionally, the survey\noutlines reported datasets used by machine learning experts for congenital\nheart disease recognition. Using a systematic literature review methodology,\nthis study identifies critical challenges and opportunities in applying machine\nlearning to congenital heart disease.\n","authors":["Khalil Khan","Farhan Ullah","Ikram Syed","Irfan Ullah"],"pdf_url":"https://arxiv.org/pdf/2501.04493v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04486v1","updated":"2025-01-08T13:13:52Z","published":"2025-01-08T13:13:52Z","title":"MB-TaylorFormer V2: Improved Multi-branch Linear Transformer Expanded by\n Taylor Formula for Image Restoration","summary":" Recently, Transformer networks have demonstrated outstanding performance in\nthe field of image restoration due to the global receptive field and\nadaptability to input. However, the quadratic computational complexity of\nSoftmax-attention poses a significant limitation on its extensive application\nin image restoration tasks, particularly for high-resolution images. To tackle\nthis challenge, we propose a novel variant of the Transformer. This variant\nleverages the Taylor expansion to approximate the Softmax-attention and\nutilizes the concept of norm-preserving mapping to approximate the remainder of\nthe first-order Taylor expansion, resulting in a linear computational\ncomplexity. Moreover, we introduce a multi-branch architecture featuring\nmulti-scale patch embedding into the proposed Transformer, which has four\ndistinct advantages: 1) various sizes of the receptive field; 2) multi-level\nsemantic information; 3) flexible shapes of the receptive field; 4) accelerated\ntraining and inference speed. Hence, the proposed model, named the second\nversion of Taylor formula expansion-based Transformer (for short\nMB-TaylorFormer V2) has the capability to concurrently process coarse-to-fine\nfeatures, capture long-distance pixel interactions with limited computational\ncost, and improve the approximation of the Taylor expansion remainder.\nExperimental results across diverse image restoration benchmarks demonstrate\nthat MB-TaylorFormer V2 achieves state-of-the-art performance in multiple image\nrestoration tasks, such as image dehazing, deraining, desnowing, motion\ndeblurring, and denoising, with very little computational overhead. The source\ncode is available at https://github.com/FVL2020/MB-TaylorFormerV2.\n","authors":["Zhi Jin","Yuwei Qiu","Kaihao Zhang","Hongdong Li","Wenhan Luo"],"pdf_url":"https://arxiv.org/pdf/2501.04486v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.01996v4","updated":"2025-01-08T13:03:24Z","published":"2024-07-02T07:10:10Z","title":"ViG-Bias: Visually Grounded Bias Discovery and Mitigation","summary":" The proliferation of machine learning models in critical decision making\nprocesses has underscored the need for bias discovery and mitigation\nstrategies. Identifying the reasons behind a biased system is not\nstraightforward, since in many occasions they are associated with hidden\nspurious correlations which are not easy to spot. Standard approaches rely on\nbias audits performed by analyzing model performance in pre-defined subgroups\nof data samples, usually characterized by common attributes like gender or\nethnicity when it comes to people, or other specific attributes defining\nsemantically coherent groups of images. However, it is not always possible to\nknow a-priori the specific attributes defining the failure modes of visual\nrecognition systems. Recent approaches propose to discover these groups by\nleveraging large vision language models, which enable the extraction of\ncross-modal embeddings and the generation of textual descriptions to\ncharacterize the subgroups where a certain model is underperforming. In this\nwork, we argue that incorporating visual explanations (e.g. heatmaps generated\nvia GradCAM or other approaches) can boost the performance of such bias\ndiscovery and mitigation frameworks. To this end, we introduce Visually\nGrounded Bias Discovery and Mitigation (ViG-Bias), a simple yet effective\ntechnique which can be integrated to a variety of existing frameworks to\nimprove both, discovery and mitigation performance. Our comprehensive\nevaluation shows that incorporating visual explanations enhances existing\ntechniques like DOMINO, FACTS and Bias-to-Text, across several challenging\ndatasets, including CelebA, Waterbirds, and NICO++.\n","authors":["Badr-Eddine Marani","Mohamed Hanini","Nihitha Malayarukil","Stergios Christodoulidis","Maria Vakalopoulou","Enzo Ferrante"],"pdf_url":"https://arxiv.org/pdf/2407.01996v4.pdf","comment":"ECCV 2024"},{"id":"http://arxiv.org/abs/2501.04477v1","updated":"2025-01-08T13:00:17Z","published":"2025-01-08T13:00:17Z","title":"Rethinking High-speed Image Reconstruction Framework with Spike Camera","summary":" Spike cameras, as innovative neuromorphic devices, generate continuous spike\nstreams to capture high-speed scenes with lower bandwidth and higher dynamic\nrange than traditional RGB cameras. However, reconstructing high-quality images\nfrom the spike input under low-light conditions remains challenging.\nConventional learning-based methods often rely on the synthetic dataset as the\nsupervision for training. Still, these approaches falter when dealing with\nnoisy spikes fired under the low-light environment, leading to further\nperformance degradation in the real-world dataset. This phenomenon is primarily\ndue to inadequate noise modelling and the domain gap between synthetic and real\ndatasets, resulting in recovered images with unclear textures, excessive noise,\nand diminished brightness. To address these challenges, we introduce a novel\nspike-to-image reconstruction framework SpikeCLIP that goes beyond traditional\ntraining paradigms. Leveraging the CLIP model's powerful capability to align\ntext and images, we incorporate the textual description of the captured scene\nand unpaired high-quality datasets as the supervision. Our experiments on\nreal-world low-light datasets U-CALTECH and U-CIFAR demonstrate that SpikeCLIP\nsignificantly enhances texture details and the luminance balance of recovered\nimages. Furthermore, the reconstructed images are well-aligned with the broader\nvisual features needed for downstream tasks, ensuring more robust and versatile\nperformance in challenging environments.\n","authors":["Kang Chen","Yajing Zheng","Tiejun Huang","Zhaofei Yu"],"pdf_url":"https://arxiv.org/pdf/2501.04477v1.pdf","comment":"Accepted by AAAI2025"},{"id":"http://arxiv.org/abs/2501.04467v1","updated":"2025-01-08T12:41:42Z","published":"2025-01-08T12:41:42Z","title":"A Histologic Dataset of Normal and Atypical Mitotic Figures on Human\n Breast Cancer (AMi-Br)","summary":" Assessment of the density of mitotic figures (MFs) in histologic tumor\nsections is an important prognostic marker for many tumor types, including\nbreast cancer. Recently, it has been reported in multiple works that the\nquantity of MFs with an atypical morphology (atypical MFs, AMFs) might be an\nindependent prognostic criterion for breast cancer. AMFs are an indicator of\nmutations in the genes regulating the cell cycle and can lead to aberrant\nchromosome constitution (aneuploidy) of the tumor cells. To facilitate further\nresearch on this topic using pattern recognition, we present the first ever\npublicly available dataset of atypical and normal MFs (AMi-Br). For this, we\nutilized two of the most popular MF datasets (MIDOG 2021 and TUPAC) and\nsubclassified all MFs using a three expert majority vote. Our final dataset\nconsists of 3,720 MFs, split into 832 AMFs (22.4%) and 2,888 normal MFs (77.6%)\nacross all 223 tumor cases in the combined set. We provide baseline\nclassification experiments to investigate the consistency of the dataset, using\na Monte Carlo cross-validation and different strategies to combat class\nimbalance. We found an averaged balanced accuracy of up to 0.806 when using a\npatch-level data set split, and up to 0.713 when using a patient-level split.\n","authors":["Christof A. Bertram","Viktoria Weiss","Taryn A. Donovan","Sweta Banerjee","Thomas Conrad","Jonas Ammeling","Robert Klopfleisch","Christopher Kaltenecker","Marc Aubreville"],"pdf_url":"https://arxiv.org/pdf/2501.04467v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04459v1","updated":"2025-01-08T12:30:06Z","published":"2025-01-08T12:30:06Z","title":"Rapid Automated Mapping of Clouds on Titan With Instance Segmentation","summary":" Despite widespread adoption of deep learning models to address a variety of\ncomputer vision tasks, planetary science has yet to see extensive utilization\nof such tools to address its unique problems. On Titan, the largest moon of\nSaturn, tracking seasonal trends and weather patterns of clouds provides\ncrucial insights into one of the most complex climates in the Solar System, yet\nmuch of the available image data are still analyzed in a conventional way. In\nthis work, we apply a Mask R-CNN trained via transfer learning to perform\ninstance segmentation of clouds in Titan images acquired by the Cassini\nspacecraft - a previously unexplored approach to a big data problem in\nplanetary science. We demonstrate that an automated technique can provide\nquantitative measures for clouds, such as areas and centroids, that may\notherwise be prohibitively time-intensive to produce by human mapping.\nFurthermore, despite Titan specific challenges, our approach yields accuracy\ncomparable to contemporary cloud identification studies on Earth and other\nworlds. We compare the efficiencies of human-driven versus algorithmic\napproaches, showing that transfer learning provides speed-ups that may open new\nhorizons for data investigation for Titan. Moreover, we suggest that such\napproaches have broad potential for application to similar problems in\nplanetary science where they are currently under-utilized. Future planned\nmissions to the planets and remote sensing initiatives for the Earth promise to\nprovide a deluge of image data in the coming years that will benefit strongly\nfrom leveraging machine learning approaches to perform the analysis.\n","authors":["Zachary Yahn","Douglas M Trent","Ethan Duncan","Benoît Seignovert","John Santerre","Conor Nixon"],"pdf_url":"https://arxiv.org/pdf/2501.04459v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.14599v2","updated":"2025-01-08T12:20:56Z","published":"2024-06-20T17:59:56Z","title":"Stylebreeder: Exploring and Democratizing Artistic Styles through\n Text-to-Image Models","summary":" Text-to-image models are becoming increasingly popular, revolutionizing the\nlandscape of digital art creation by enabling highly detailed and creative\nvisual content generation. These models have been widely employed across\nvarious domains, particularly in art generation, where they facilitate a broad\nspectrum of creative expression and democratize access to artistic creation. In\nthis paper, we introduce \\texttt{STYLEBREEDER}, a comprehensive dataset of 6.8M\nimages and 1.8M prompts generated by 95K users on Artbreeder, a platform that\nhas emerged as a significant hub for creative exploration with over 13M users.\nWe introduce a series of tasks with this dataset aimed at identifying diverse\nartistic styles, generating personalized content, and recommending styles based\non user interests. By documenting unique, user-generated styles that transcend\nconventional categories like 'cyberpunk' or 'Picasso,' we explore the potential\nfor unique, crowd-sourced styles that could provide deep insights into the\ncollective creative psyche of users worldwide. We also evaluate different\npersonalization methods to enhance artistic expression and introduce a style\natlas, making these models available in LoRA format for public use. Our\nresearch demonstrates the potential of text-to-image diffusion models to\nuncover and promote unique artistic expressions, further democratizing AI in\nart and fostering a more diverse and inclusive artistic community. The dataset,\ncode and models are available at https://stylebreeder.github.io under a Public\nDomain (CC0) license.\n","authors":["Matthew Zheng","Enis Simsar","Hidir Yesiltepe","Federico Tombari","Joel Simon","Pinar Yanardag"],"pdf_url":"https://arxiv.org/pdf/2406.14599v2.pdf","comment":"Accepted at NeurIPS 2024 D&B Track, Project page:\n https://stylebreeder.github.io HuggingFace DB Page:\n https://huggingface.co/datasets/stylebreeder/stylebreeder"},{"id":"http://arxiv.org/abs/2501.01767v2","updated":"2025-01-08T12:11:18Z","published":"2025-01-03T11:40:41Z","title":"LogicAD: Explainable Anomaly Detection via VLM-based Text Feature\n Extraction","summary":" Logical image understanding involves interpreting and reasoning about the\nrelationships and consistency within an image's visual content. This capability\nis essential in applications such as industrial inspection, where logical\nanomaly detection is critical for maintaining high-quality standards and\nminimizing costly recalls. Previous research in anomaly detection (AD) has\nrelied on prior knowledge for designing algorithms, which often requires\nextensive manual annotations, significant computing power, and large amounts of\ndata for training. Autoregressive, multimodal Vision Language Models (AVLMs)\noffer a promising alternative due to their exceptional performance in visual\nreasoning across various domains. Despite this, their application to logical AD\nremains unexplored. In this work, we investigate using AVLMs for logical AD and\ndemonstrate that they are well-suited to the task. Combining AVLMs with format\nembedding and a logic reasoner, we achieve SOTA performance on public\nbenchmarks, MVTec LOCO AD, with an AUROC of 86.0% and F1-max of 83.7%, along\nwith explanations of anomalies. This significantly outperforms the existing\nSOTA method by a large margin.\n","authors":["Er Jin","Qihui Feng","Yongli Mou","Stefan Decker","Gerhard Lakemeyer","Oliver Simons","Johannes Stegmaier"],"pdf_url":"https://arxiv.org/pdf/2501.01767v2.pdf","comment":"Accepted for publication at aaai25, project page:\n https://jasonjin34.github.io/logicad.github.io/"},{"id":"http://arxiv.org/abs/2501.04444v1","updated":"2025-01-08T11:53:30Z","published":"2025-01-08T11:53:30Z","title":"A novel Facial Recognition technique with Focusing on Masked Faces","summary":" Recognizing the same faces with and without masks is important for ensuring\nconsistent identification in security, access control, and public safety. This\ncapability is crucial in scenarios like law enforcement, healthcare, and\nsurveillance, where accurate recognition must be maintained despite facial\nocclusion. This research focuses on the challenge of recognizing the same faces\nwith and without masks by employing cosine similarity as the primary technique.\nWith the increased use of masks, traditional facial recognition systems face\nsignificant accuracy issues, making it crucial to develop methods that can\nreliably identify individuals in masked conditions. For that reason, this study\nproposed Masked-Unmasked Face Matching Model (MUFM). This model employs\ntransfer learning using the Visual Geometry Group (VGG16) model to extract\nsignificant facial features, which are subsequently classified utilizing the\nK-Nearest Neighbors (K-NN) algorithm. The cosine similarity metric is employed\nto compare masked and unmasked faces of the same individuals. This approach\nrepresents a novel contribution, as the task of recognizing the same individual\nwith and without a mask using cosine similarity has not been previously\naddressed. By integrating these advanced methodologies, the research\ndemonstrates effective identification of individuals despite the presence of\nmasks, addressing a significant limitation in traditional systems. Using data\nis another essential part of this work, by collecting and preparing an image\ndataset from three different sources especially some of those data are real\nprovided a comprehensive power of this research. The image dataset used were\nalready collected in three different datasets of masked and unmasked for the\nsame faces.\n","authors":["Dana A Abdullah","Dana Rasul Hamad","Hakem Beitollahi","Ismail Y Maolood","Abdulhady Abas Abdullah","Aso Khaleel Ameen"],"pdf_url":"https://arxiv.org/pdf/2501.04444v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04440v1","updated":"2025-01-08T11:41:47Z","published":"2025-01-08T11:41:47Z","title":"RSAR: Restricted State Angle Resolver and Rotated SAR Benchmark","summary":" Rotated object detection has made significant progress in the optical remote\nsensing. However, advancements in the Synthetic Aperture Radar (SAR) field are\nlaggard behind, primarily due to the absence of a large-scale dataset.\nAnnotating such a dataset is inefficient and costly. A promising solution is to\nemploy a weakly supervised model (e.g., trained with available horizontal boxes\nonly) to generate pseudo-rotated boxes for reference before manual calibration.\nUnfortunately, the existing weakly supervised models exhibit limited accuracy\nin predicting the object's angle. Previous works attempt to enhance angle\nprediction by using angle resolvers that decouple angles into cosine and sine\nencodings. In this work, we first reevaluate these resolvers from a unified\nperspective of dimension mapping and expose that they share the same\nshortcomings: these methods overlook the unit cycle constraint inherent in\nthese encodings, easily leading to prediction biases. To address this issue, we\npropose the Unit Cycle Resolver, which incorporates a unit circle constraint\nloss to improve angle prediction accuracy. Our approach can effectively improve\nthe performance of existing state-of-the-art weakly supervised methods and even\nsurpasses fully supervised models on existing optical benchmarks (i.e.,\nDOTA-v1.0 dataset). With the aid of UCR, we further annotate and introduce\nRSAR, the largest multi-class rotated SAR object detection dataset to date.\nExtensive experiments on both RSAR and optical datasets demonstrate that our\nUCR enhances angle prediction accuracy. Our dataset and code can be found at:\nhttps://github.com/zhasion/RSAR.\n","authors":["Xin Zhang","Xue Yang","Yuxuan Li","Jian Yang","Ming-Ming Cheng","Xiang Li"],"pdf_url":"https://arxiv.org/pdf/2501.04440v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.01087v3","updated":"2025-01-08T11:40:29Z","published":"2025-01-02T06:19:53Z","title":"Bridging Simplicity and Sophistication using GLinear: A Novel\n Architecture for Enhanced Time Series Prediction","summary":" Time Series Forecasting (TSF) is an important application across many fields.\nThere is a debate about whether Transformers, despite being good at\nunderstanding long sequences, struggle with preserving temporal relationships\nin time series data. Recent research suggests that simpler linear models might\noutperform or at least provide competitive performance compared to complex\nTransformer-based models for TSF tasks. In this paper, we propose a novel\ndata-efficient architecture, GLinear, for multivariate TSF that exploits\nperiodic patterns to provide better accuracy. It also provides better\nprediction accuracy by using a smaller amount of historical data compared to\nother state-of-the-art linear predictors. Four different datasets (ETTh1,\nElectricity, Traffic, and Weather) are used to evaluate the performance of the\nproposed predictor. A performance comparison with state-of-the-art linear\narchitectures (such as NLinear, DLinear, and RLinear) and transformer-based\ntime series predictor (Autoformer) shows that the GLinear, despite being\nparametrically efficient, significantly outperforms the existing architectures\nin most cases of multivariate TSF. We hope that the proposed GLinear opens new\nfronts of research and development of simpler and more sophisticated\narchitectures for data and computationally efficient time-series analysis.\n","authors":["Syed Tahir Hussain Rizvi","Neel Kanwal","Muddasar Naeem","Alfredo Cuzzocrea","Antonio Coronato"],"pdf_url":"https://arxiv.org/pdf/2501.01087v3.pdf","comment":"Submitted to IEEE Transactions on Emerging Topics in Computational\n Intelligence"},{"id":"http://arxiv.org/abs/2501.03567v2","updated":"2025-01-08T11:20:00Z","published":"2025-01-07T06:35:34Z","title":"Evaluating Image Caption via Cycle-consistent Text-to-Image Generation","summary":" Evaluating image captions typically relies on reference captions, which are\ncostly to obtain and exhibit significant diversity and subjectivity. While\nreference-free evaluation metrics have been proposed, most focus on cross-modal\nevaluation between captions and images. Recent research has revealed that the\nmodality gap generally exists in the representation of contrastive\nlearning-based multi-modal systems, undermining the reliability of\ncross-modality metrics like CLIPScore. In this paper, we propose CAMScore, a\ncyclic reference-free automatic evaluation metric for image captioning models.\nTo circumvent the aforementioned modality gap, CAMScore utilizes a\ntext-to-image model to generate images from captions and subsequently evaluates\nthese generated images against the original images. Furthermore, to provide\nfine-grained information for a more comprehensive evaluation, we design a\nthree-level evaluation framework for CAMScore that encompasses pixel-level,\nsemantic-level, and objective-level perspectives. Extensive experiment results\nacross multiple benchmark datasets show that CAMScore achieves a superior\ncorrelation with human judgments compared to existing reference-based and\nreference-free metrics, demonstrating the effectiveness of the framework.\n","authors":["Tianyu Cui","Jinbin Bai","Guo-Hua Wang","Qing-Guo Chen","Zhao Xu","Weihua Luo","Kaifu Zhang","Ye Shi"],"pdf_url":"https://arxiv.org/pdf/2501.03567v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.14520v4","updated":"2025-01-08T11:03:00Z","published":"2024-03-21T16:17:57Z","title":"Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient\n Inference","summary":" In recent years, the application of multimodal large language models (MLLM)\nin various fields has achieved remarkable success. However, as the foundation\nmodel for many downstream tasks, current MLLMs are composed of the well-known\nTransformer network, which has a less efficient quadratic computation\ncomplexity. To improve the efficiency of such basic models, we propose Cobra, a\nlinear computational complexity MLLM. Specifically, Cobra integrates the\nefficient Mamba language model into the visual modality. Moreover, we explore\nand study various modal fusion schemes to create an effective multi-modal\nMamba. Extensive experiments demonstrate that (1) Cobra achieves extremely\ncompetitive performance with current computationally efficient state-of-the-art\nmethods, e.g., LLaVA-Phi, TinyLLaVA, and MobileVLM v2, and has faster speed due\nto Cobra's linear sequential modeling. (2) Interestingly, the results of\nclosed-set challenging prediction benchmarks show that Cobra performs well in\novercoming visual illusions and spatial relationship judgments. (3) Notably,\nCobra even achieves comparable performance to LLaVA with about 43% of the\nnumber of parameters. We will make all codes of Cobra open-source and hope that\nthe proposed method can facilitate future research on complexity problems in\nMLLM. Our project page is available at: https://sites.google.com/view/cobravlm.\n","authors":["Han Zhao","Min Zhang","Wei Zhao","Pengxiang Ding","Siteng Huang","Donglin Wang"],"pdf_url":"https://arxiv.org/pdf/2403.14520v4.pdf","comment":"Accepted to the Thirty-Ninth AAAI Conference on Artificial\n Intelligence (AAAI-25)"},{"id":"http://arxiv.org/abs/2409.09502v2","updated":"2025-01-08T10:11:44Z","published":"2024-09-14T18:26:26Z","title":"One missing piece in Vision and Language: A Survey on Comics\n Understanding","summary":" Vision-language models have recently evolved into versatile systems capable\nof high performance across a range of tasks, such as document understanding,\nvisual question answering, and grounding, often in zero-shot settings. Comics\nUnderstanding, a complex and multifaceted field, stands to greatly benefit from\nthese advances. Comics, as a medium, combine rich visual and textual\nnarratives, challenging AI models with tasks that span image classification,\nobject detection, instance segmentation, and deeper narrative comprehension\nthrough sequential panels. However, the unique structure of comics --\ncharacterized by creative variations in style, reading order, and non-linear\nstorytelling -- presents a set of challenges distinct from those in other\nvisual-language domains. In this survey, we present a comprehensive review of\nComics Understanding from both dataset and task perspectives. Our contributions\nare fivefold: (1) We analyze the structure of the comics medium, detailing its\ndistinctive compositional elements; (2) We survey the widely used datasets and\ntasks in comics research, emphasizing their role in advancing the field; (3) We\nintroduce the Layer of Comics Understanding (LoCU) framework, a novel taxonomy\nthat redefines vision-language tasks within comics and lays the foundation for\nfuture work; (4) We provide a detailed review and categorization of existing\nmethods following the LoCU framework; (5) Finally, we highlight current\nresearch challenges and propose directions for future exploration, particularly\nin the context of vision-language models applied to comics. This survey is the\nfirst to propose a task-oriented framework for comics intelligence and aims to\nguide future research by addressing critical gaps in data availability and task\ndefinition. A project associated with this survey is available at\nhttps://github.com/emanuelevivoli/awesome-comics-understanding.\n","authors":["Emanuele Vivoli","Mohamed Ali Souibgui","Andrey Barsky","Artemis LLabrés","Marco Bertini","Dimosthenis Karatzas"],"pdf_url":"https://arxiv.org/pdf/2409.09502v2.pdf","comment":"under review. project website:\n https://github.com/emanuelevivoli/awesome-comics-understanding"},{"id":"http://arxiv.org/abs/2501.04390v1","updated":"2025-01-08T10:08:09Z","published":"2025-01-08T10:08:09Z","title":"iFADIT: Invertible Face Anonymization via Disentangled Identity\n Transform","summary":" Face anonymization aims to conceal the visual identity of a face to safeguard\nthe individual's privacy. Traditional methods like blurring and pixelation can\nlargely remove identifying features, but these techniques significantly degrade\nimage quality and are vulnerable to deep reconstruction attacks. Generative\nmodels have emerged as a promising solution for anonymizing faces while\npreserving a natural appearance.However, many still face limitations in visual\nquality and often overlook the potential to recover the original face from the\nanonymized version, which can be valuable in specific contexts such as image\nforensics. This paper proposes a novel framework named iFADIT, an acronym for\nInvertible Face Anonymization via Disentangled Identity Transform.The framework\nfeatures a disentanglement architecture coupled with a secure flow-based model:\nthe former decouples identity information from non-identifying attributes,\nwhile the latter transforms the decoupled identity into an anonymized version\nin an invertible manner controlled by a secret key. The anonymized face can\nthen be reconstructed based on a pre-trained StyleGAN that ensures high image\nquality and realistic facial details. Recovery of the original face (aka\nde-anonymization) is possible upon the availability of the matching secret, by\ninverting the anonymization process based on the same set of model parameters.\nFurthermore, a dedicated secret-key mechanism along with a dual-phase training\nstrategy is devised to ensure the desired properties of face anonymization.\nQualitative and quantitative experiments demonstrate the superiority of the\nproposed approach in anonymity, reversibility, security, diversity, and\ninterpretability over competing methods.\n","authors":["Lin Yuan","Kai Liang","Xiong Li","Tao Wu","Nannan Wang","Xinbo Gao"],"pdf_url":"https://arxiv.org/pdf/2501.04390v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13789v3","updated":"2025-01-08T10:01:20Z","published":"2023-12-21T12:26:11Z","title":"TinySAM: Pushing the Envelope for Efficient Segment Anything Model","summary":" Recently segment anything model (SAM) has shown powerful segmentation\ncapability and has drawn great attention in computer vision fields. Massive\nfollowing works have developed various applications based on the pre-trained\nSAM and achieved impressive performance on downstream vision tasks. However,\nSAM consists of heavy architectures and requires massive computational\ncapacity, which hinders the further application of SAM on computation\nconstrained edge devices. To this end, in this paper we propose a framework to\nobtain a tiny segment anything model (TinySAM) while maintaining the strong\nzero-shot performance. We first propose a full-stage knowledge distillation\nmethod with hard prompt sampling and hard mask weighting strategy to distill a\nlightweight student model. We also adapt the post-training quantization to the\nprompt-based segmentation task and further reduce the computational cost.\nMoreover, a hierarchical segmenting everything strategy is proposed to\naccelerate the everything inference by $2\\times$ with almost no performance\ndegradation. With all these proposed methods, our TinySAM leads to orders of\nmagnitude computational reduction and pushes the envelope for efficient segment\nanything task. Extensive experiments on various zero-shot transfer tasks\ndemonstrate the significantly advantageous performance of our TinySAM against\ncounterpart methods. Codes are available at\nhttps://github.com/xinghaochen/TinySAM and\nhttps://gitee.com/mindspore/models/tree/master/research/cv/TinySAM.\n","authors":["Han Shu","Wenshuo Li","Yehui Tang","Yiman Zhang","Yihao Chen","Houqiang Li","Yunhe Wang","Xinghao Chen"],"pdf_url":"https://arxiv.org/pdf/2312.13789v3.pdf","comment":"AAAI 2025"},{"id":"http://arxiv.org/abs/2312.00947v3","updated":"2025-01-08T09:41:18Z","published":"2023-12-01T22:00:14Z","title":"FreeZe: Training-free zero-shot 6D pose estimation with geometric and\n vision foundation models","summary":" Estimating the 6D pose of objects unseen during training is highly desirable\nyet challenging. Zero-shot object 6D pose estimation methods address this\nchallenge by leveraging additional task-specific supervision provided by\nlarge-scale, photo-realistic synthetic datasets. However, their performance\nheavily depends on the quality and diversity of rendered data and they require\nextensive training. In this work, we show how to tackle the same task but\nwithout training on specific data. We propose FreeZe, a novel solution that\nharnesses the capabilities of pre-trained geometric and vision foundation\nmodels. FreeZe leverages 3D geometric descriptors learned from unrelated 3D\npoint clouds and 2D visual features learned from web-scale 2D images to\ngenerate discriminative 3D point-level descriptors. We then estimate the 6D\npose of unseen objects by 3D registration based on RANSAC. We also introduce a\nnovel algorithm to solve ambiguous cases due to geometrically symmetric objects\nthat is based on visual features. We comprehensively evaluate FreeZe across the\nseven core datasets of the BOP Benchmark, which include over a hundred 3D\nobjects and 20,000 images captured in various scenarios. FreeZe consistently\noutperforms all state-of-the-art approaches, including competitors extensively\ntrained on synthetic 6D pose estimation data. Code will be publicly available\nat https://andreacaraffa.github.io/freeze.\n","authors":["Andrea Caraffa","Davide Boscaini","Amir Hamza","Fabio Poiesi"],"pdf_url":"https://arxiv.org/pdf/2312.00947v3.pdf","comment":"Accepted to ECCV 2024. Project page:\n https://andreacaraffa.github.io/freeze"},{"id":"http://arxiv.org/abs/2309.06941v3","updated":"2025-01-08T09:35:58Z","published":"2023-09-13T13:24:27Z","title":"DEFormer: DCT-driven Enhancement Transformer for Low-light Image and\n Dark Vision","summary":" Low-light image enhancement restores the colors and details of a single image\nand improves high-level visual tasks. However, restoring the lost details in\nthe dark area is still a challenge relying only on the RGB domain. In this\npaper, we delve into frequency as a new clue into the model and propose a\nDCT-driven enhancement transformer (DEFormer) framework. First, we propose a\nlearnable frequency branch (LFB) for frequency enhancement contains DCT\nprocessing and curvature-based frequency enhancement (CFE) to represent\nfrequency features. Additionally, we propose a cross domain fusion (CDF) to\nreduce the differences between the RGB domain and the frequency domain. Our\nDEFormer has achieved superior results on the LOL and MIT-Adobe FiveK datasets,\nimproving the dark detection performance.\n","authors":["Xiangchen Yin","Zhenda Yu","Xin Gao","Xiao Sun"],"pdf_url":"https://arxiv.org/pdf/2309.06941v3.pdf","comment":"Accepted by ICASSP"},{"id":"http://arxiv.org/abs/2501.04377v1","updated":"2025-01-08T09:34:15Z","published":"2025-01-08T09:34:15Z","title":"On Computational Limits and Provably Efficient Criteria of Visual\n Autoregressive Models: A Fine-Grained Complexity Analysis","summary":" Recently, Visual Autoregressive ($\\mathsf{VAR}$) Models introduced a\ngroundbreaking advancement in the field of image generation, offering a\nscalable approach through a coarse-to-fine \"next-scale prediction\" paradigm.\nHowever, the state-of-the-art algorithm of $\\mathsf{VAR}$ models in [Tian,\nJiang, Yuan, Peng and Wang, NeurIPS 2024] takes $O(n^4)$ time, which is\ncomputationally inefficient. In this work, we analyze the computational limits\nand efficiency criteria of $\\mathsf{VAR}$ Models through a fine-grained\ncomplexity lens. Our key contribution is identifying the conditions under which\n$\\mathsf{VAR}$ computations can achieve sub-quadratic time complexity.\nSpecifically, we establish a critical threshold for the norm of input matrices\nused in $\\mathsf{VAR}$ attention mechanisms. Above this threshold, assuming the\nStrong Exponential Time Hypothesis ($\\mathsf{SETH}$) from fine-grained\ncomplexity theory, a sub-quartic time algorithm for $\\mathsf{VAR}$ models is\nimpossible. To substantiate our theoretical findings, we present efficient\nconstructions leveraging low-rank approximations that align with the derived\ncriteria. This work initiates the study of the computational efficiency of the\n$\\mathsf{VAR}$ model from a theoretical perspective. Our technique will shed\nlight on advancing scalable and efficient image generation in $\\mathsf{VAR}$\nframeworks.\n","authors":["Yekun Ke","Xiaoyu Li","Yingyu Liang","Zhizhou Sha","Zhenmei Shi","Zhao Song"],"pdf_url":"https://arxiv.org/pdf/2501.04377v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04376v1","updated":"2025-01-08T09:30:45Z","published":"2025-01-08T09:30:45Z","title":"Exploring Unbiased Deepfake Detection via Token-Level Shuffling and\n Mixing","summary":" The generalization problem is broadly recognized as a critical challenge in\ndetecting deepfakes. Most previous work believes that the generalization gap is\ncaused by the differences among various forgery methods. However, our\ninvestigation reveals that the generalization issue can still occur when\nforgery-irrelevant factors shift. In this work, we identify two biases that\ndetectors may also be prone to overfitting: position bias and content bias, as\ndepicted in Fig. 1. For the position bias, we observe that detectors are prone\nto lazily depending on the specific positions within an image (e.g., central\nregions even no forgery). As for content bias, we argue that detectors may\npotentially and mistakenly utilize forgery-unrelated information for detection\n(e.g., background, and hair). To intervene these biases, we propose two\nbranches for shuffling and mixing with tokens in the latent space of\ntransformers. For the shuffling branch, we rearrange the tokens and\ncorresponding position embedding for each image while maintaining the local\ncorrelation. For the mixing branch, we randomly select and mix the tokens in\nthe latent space between two images with the same label within the mini-batch\nto recombine the content information. During the learning process, we align the\noutputs of detectors from different branches in both feature space and logit\nspace. Contrastive losses for features and divergence losses for logits are\napplied to obtain unbiased feature representation and classifiers. We\ndemonstrate and verify the effectiveness of our method through extensive\nexperiments on widely used evaluation datasets.\n","authors":["Xinghe Fu","Zhiyuan Yan","Taiping Yao","Shen Chen","Xi Li"],"pdf_url":"https://arxiv.org/pdf/2501.04376v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.15209v3","updated":"2025-01-08T09:29:10Z","published":"2024-03-22T13:50:27Z","title":"MSCoTDet: Language-driven Multi-modal Fusion for Improved Multispectral\n Pedestrian Detection","summary":" Multispectral pedestrian detection is attractive for around-the-clock\napplications due to the complementary information between RGB and thermal\nmodalities. However, current models often fail to detect pedestrians in certain\ncases (e.g., thermal-obscured pedestrians), particularly due to the modality\nbias learned from statistically biased datasets. In this paper, we investigate\nhow to mitigate modality bias in multispectral pedestrian detection using Large\nLanguage Models (LLMs). Accordingly, we design a Multispectral Chain-of-Thought\n(MSCoT) prompting strategy, which prompts the LLM to perform multispectral\npedestrian detection. Moreover, we propose a novel Multispectral\nChain-of-Thought Detection (MSCoTDet) framework that integrates MSCoT prompting\ninto multispectral pedestrian detection. To this end, we design a\nLanguage-driven Multi-modal Fusion (LMF) strategy that enables fusing the\noutputs of MSCoT prompting with the detection results of vision-based\nmultispectral pedestrian detection models. Extensive experiments validate that\nMSCoTDet effectively mitigates modality biases and improves multispectral\npedestrian detection.\n","authors":["Taeheon Kim","Sangyun Chung","Damin Yeom","Youngjoon Yu","Hak Gu Kim","Yong Man Ro"],"pdf_url":"https://arxiv.org/pdf/2403.15209v3.pdf","comment":"IEEE Transactions on Circuits and Systems for Video Technology\n (TCSVT)"},{"id":"http://arxiv.org/abs/2501.04374v1","updated":"2025-01-08T09:28:25Z","published":"2025-01-08T09:28:25Z","title":"Instructive3D: Editing Large Reconstruction Models with Text\n Instructions","summary":" Transformer based methods have enabled users to create, modify, and\ncomprehend text and image data. Recently proposed Large Reconstruction Models\n(LRMs) further extend this by providing the ability to generate high-quality 3D\nmodels with the help of a single object image. These models, however, lack the\nability to manipulate or edit the finer details, such as adding standard design\npatterns or changing the color and reflectance of the generated objects, thus\nlacking fine-grained control that may be very helpful in domains such as\naugmented reality, animation and gaming. Naively training LRMs for this purpose\nwould require generating precisely edited images and 3D object pairs, which is\ncomputationally expensive. In this paper, we propose Instructive3D, a novel LRM\nbased model that integrates generation and fine-grained editing, through user\ntext prompts, of 3D objects into a single model. We accomplish this by adding\nan adapter that performs a diffusion process conditioned on a text prompt\nspecifying edits in the triplane latent space representation of 3D object\nmodels. Our method does not require the generation of edited 3D objects.\nAdditionally, Instructive3D allows us to perform geometrically consistent\nmodifications, as the edits done through user-defined text prompts are applied\nto the triplane latent representation thus enhancing the versatility and\nprecision of 3D objects generated. We compare the objects generated by\nInstructive3D and a baseline that first generates the 3D object meshes using a\nstandard LRM model and then edits these 3D objects using text prompts when\nimages are provided from the Objaverse LVIS dataset. We find that Instructive3D\nproduces qualitatively superior 3D objects with the properties specified by the\nedit prompts.\n","authors":["Kunal Kathare","Ankit Dhiman","K Vikas Gowda","Siddharth Aravindan","Shubham Monga","Basavaraja Shanthappa Vandrotti","Lokesh R Boregowda"],"pdf_url":"https://arxiv.org/pdf/2501.04374v1.pdf","comment":"Accepted at WACV 2025. First two authors contributed equally"},{"id":"http://arxiv.org/abs/2501.04373v1","updated":"2025-01-08T09:26:36Z","published":"2025-01-08T09:26:36Z","title":"FGU3R: Fine-Grained Fusion via Unified 3D Representation for Multimodal\n 3D Object Detection","summary":" Multimodal 3D object detection has garnered considerable interest in\nautonomous driving. However, multimodal detectors suffer from dimension\nmismatches that derive from fusing 3D points with 2D pixels coarsely, which\nleads to sub-optimal fusion performance. In this paper, we propose a multimodal\nframework FGU3R to tackle the issue mentioned above via unified 3D\nrepresentation and fine-grained fusion, which consists of two important\ncomponents. First, we propose an efficient feature extractor for raw and pseudo\npoints, termed Pseudo-Raw Convolution (PRConv), which modulates multimodal\nfeatures synchronously and aggregates the features from different types of\npoints on key points based on multimodal interaction. Second, a Cross-Attention\nAdaptive Fusion (CAAF) is designed to fuse homogeneous 3D RoI (Region of\nInterest) features adaptively via a cross-attention variant in a fine-grained\nmanner. Together they make fine-grained fusion on unified 3D representation.\nThe experiments conducted on the KITTI and nuScenes show the effectiveness of\nour proposed method.\n","authors":["Guoxin Zhang","Ziying Song","Lin Liu","Zhonghong Ou"],"pdf_url":"https://arxiv.org/pdf/2501.04373v1.pdf","comment":"Accepted by ICASSP 2025"},{"id":"http://arxiv.org/abs/2412.04247v2","updated":"2025-01-08T09:18:03Z","published":"2024-12-05T15:27:58Z","title":"3D Part Segmentation via Geometric Aggregation of 2D Visual Features","summary":" Supervised 3D part segmentation models are tailored for a fixed set of\nobjects and parts, limiting their transferability to open-set, real-world\nscenarios. Recent works have explored vision-language models (VLMs) as a\npromising alternative, using multi-view rendering and textual prompting to\nidentify object parts. However, naively applying VLMs in this context\nintroduces several drawbacks, such as the need for meticulous prompt\nengineering, and fails to leverage the 3D geometric structure of objects. To\naddress these limitations, we propose COPS, a COmprehensive model for Parts\nSegmentation that blends the semantics extracted from visual concepts and 3D\ngeometry to effectively identify object parts. COPS renders a point cloud from\nmultiple viewpoints, extracts 2D features, projects them back to 3D, and uses a\nnovel geometric-aware feature aggregation procedure to ensure spatial and\nsemantic consistency. Finally, it clusters points into parts and labels them.\nWe demonstrate that COPS is efficient, scalable, and achieves zero-shot\nstate-of-the-art performance across five datasets, covering synthetic and\nreal-world data, texture-less and coloured objects, as well as rigid and\nnon-rigid shapes. The code is available at https://3d-cops.github.io.\n","authors":["Marco Garosi","Riccardo Tedoldi","Davide Boscaini","Massimiliano Mancini","Nicu Sebe","Fabio Poiesi"],"pdf_url":"https://arxiv.org/pdf/2412.04247v2.pdf","comment":"Published in WACV 2025. Project page: https://3d-cops.github.io/"},{"id":"http://arxiv.org/abs/2501.04361v1","updated":"2025-01-08T08:58:53Z","published":"2025-01-08T08:58:53Z","title":"A Unified Framework for Foreground and Anonymization Area Segmentation\n in CT and MRI Data","summary":" This study presents an open-source toolkit to address critical challenges in\npreprocessing data for self-supervised learning (SSL) for 3D medical imaging,\nfocusing on data privacy and computational efficiency. The toolkit comprises\ntwo main components: a segmentation network that delineates foreground regions\nto optimize data sampling and thus reduce training time, and a segmentation\nnetwork that identifies anonymized regions, preventing erroneous supervision in\nreconstruction-based SSL methods. Experimental results demonstrate high\nrobustness, with mean Dice scores exceeding 98.5 across all anonymization\nmethods and surpassing 99.5 for foreground segmentation tasks, highlighting the\nefficacy of the toolkit in supporting SSL applications in 3D medical imaging\nfor both CT and MRI images. The weights and code is available at\nhttps://github.com/MIC-DKFZ/Foreground-and-Anonymization-Area-Segmentation.\n","authors":["Michal Nohel","Constantin Ulrich","Jonathan Suprijadi","Tassilo Wald","Klaus Maier-Hein"],"pdf_url":"https://arxiv.org/pdf/2501.04361v1.pdf","comment":"6 pages"},{"id":"http://arxiv.org/abs/2501.04353v1","updated":"2025-01-08T08:51:35Z","published":"2025-01-08T08:51:35Z","title":"DeFusion: An Effective Decoupling Fusion Network for Multi-Modal\n Pregnancy Prediction","summary":" Temporal embryo images and parental fertility table indicators are both\nvaluable for pregnancy prediction in \\textbf{in vitro fertilization embryo\ntransfer} (IVF-ET). However, current machine learning models cannot make full\nuse of the complementary information between the two modalities to improve\npregnancy prediction performance. In this paper, we propose a Decoupling Fusion\nNetwork called DeFusion to effectively integrate the multi-modal information\nfor IVF-ET pregnancy prediction. Specifically, we propose a decoupling fusion\nmodule that decouples the information from the different modalities into\nrelated and unrelated information, thereby achieving a more delicate fusion.\nAnd we fuse temporal embryo images with a spatial-temporal position encoding,\nand extract fertility table indicator information with a table transformer. To\nevaluate the effectiveness of our model, we use a new dataset including 4046\ncases collected from Southern Medical University. The experiments show that our\nmodel outperforms state-of-the-art methods. Meanwhile, the performance on the\neye disease prediction dataset reflects the model's good generalization. Our\ncode and dataset are available at https://github.com/Ou-Young-1999/DFNet.\n","authors":["Xueqiang Ouyang","Jia Wei","Wenjie Huo","Xiaocong Wang","Rui Li","Jianlong Zhou"],"pdf_url":"https://arxiv.org/pdf/2501.04353v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04352v1","updated":"2025-01-08T08:49:52Z","published":"2025-01-08T08:49:52Z","title":"Online Gaussian Test-Time Adaptation of Vision-Language Models","summary":" Online test-time adaptation (OTTA) of vision-language models (VLMs) has\nrecently garnered increased attention to take advantage of data observed along\na stream to improve future predictions. Unfortunately, existing methods rely on\ndataset-specific hyperparameters, significantly limiting their adaptability to\nunseen tasks. In response, we propose Online Gaussian Adaptation (OGA), a novel\nmethod that models the likelihoods of visual features using Gaussian\ndistributions and incorporates zero-shot priors into an interpretable Maximum A\nPosteriori (MAP) estimation framework with fixed hyper-parameters across all\ndatasets. We demonstrate that OGA outperforms state-of-the-art methods on most\ndatasets and runs. Additionally, we show that combining OTTA with popular\nfew-shot techniques (a practical yet overlooked setting in prior research) is\nhighly beneficial. Furthermore, our experimental study reveals that common OTTA\nevaluation protocols, which average performance over at most three runs per\ndataset, are inadequate due to the substantial variability observed across runs\nfor all OTTA methods. Therefore, we advocate for more rigorous evaluation\npractices, including increasing the number of runs and considering additional\nquantitative metrics, such as our proposed Expected Tail Accuracy (ETA),\ncalculated as the average accuracy in the worst 10% of runs. We hope these\ncontributions will encourage more rigorous and diverse evaluation practices in\nthe OTTA community. Code is available at https://github.com/cfuchs2023/OGA .\n","authors":["Clément Fuchs","Maxime Zanella","Christophe De Vleeschouwer"],"pdf_url":"https://arxiv.org/pdf/2501.04352v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.17042v2","updated":"2025-01-08T08:22:49Z","published":"2024-12-22T14:49:55Z","title":"Adapting Image-to-Video Diffusion Models for Large-Motion Frame\n Interpolation","summary":" With the development of video generation models has advanced significantly in\nrecent years, we adopt large-scale image-to-video diffusion models for video\nframe interpolation. We present a conditional encoder designed to adapt an\nimage-to-video model for large-motion frame interpolation. To enhance\nperformance, we integrate a dual-branch feature extractor and propose a\ncross-frame attention mechanism that effectively captures both spatial and\ntemporal information, enabling accurate interpolations of intermediate frames.\nOur approach demonstrates superior performance on the Fr\\'echet Video Distance\n(FVD) metric when evaluated against other state-of-the-art approaches,\nparticularly in handling large motion scenarios, highlighting advancements in\ngenerative-based methodologies.\n","authors":["Luoxu Jin","Hiroshi Watanabe"],"pdf_url":"https://arxiv.org/pdf/2412.17042v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04336v1","updated":"2025-01-08T08:15:29Z","published":"2025-01-08T08:15:29Z","title":"Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs\n for Effective Long Video Analysis with LLMs","summary":" Long-form video understanding with Large Vision Language Models is challenged\nby the need to analyze temporally dispersed yet spatially concentrated key\nmoments within limited context windows. In this work, we introduce\nVideoMindPalace, a new framework inspired by the \"Mind Palace\", which organizes\ncritical video moments into a topologically structured semantic graph.\nVideoMindPalace organizes key information through (i) hand-object tracking and\ninteraction, (ii) clustered activity zones representing specific areas of\nrecurring activities, and (iii) environment layout mapping, allowing natural\nlanguage parsing by LLMs to provide grounded insights on spatio-temporal and 3D\ncontext. In addition, we propose the Video MindPalace Benchmark (VMB), to\nassess human-like reasoning, including spatial localization, temporal\nreasoning, and layout-aware sequential understanding. Evaluated on VMB and\nestablished video QA datasets, including EgoSchema, NExT-QA, IntentQA, and the\nActive Memories Benchmark, VideoMindPalace demonstrates notable gains in\nspatio-temporal coherence and human-aligned reasoning, advancing long-form\nvideo analysis capabilities in VLMs.\n","authors":["Zeyi Huang","Yuyang Ji","Xiaofang Wang","Nikhil Mehta","Tong Xiao","Donghyun Lee","Sigmund Vanvalkenburgh","Shengxin Zha","Bolin Lai","Licheng Yu","Ning Zhang","Yong Jae Lee","Miao Liu"],"pdf_url":"https://arxiv.org/pdf/2501.04336v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04329v1","updated":"2025-01-08T08:03:49Z","published":"2025-01-08T08:03:49Z","title":"An Efficient Adaptive Compression Method for Human Perception and\n Machine Vision Tasks","summary":" While most existing neural image compression (NIC) and neural video\ncompression (NVC) methodologies have achieved remarkable success, their\noptimization is primarily focused on human visual perception. However, with the\nrapid development of artificial intelligence, many images and videos will be\nused for various machine vision tasks. Consequently, such existing compression\nmethodologies cannot achieve competitive performance in machine vision. In this\nwork, we introduce an efficient adaptive compression (EAC) method tailored for\nboth human perception and multiple machine vision tasks. Our method involves\ntwo key modules: 1), an adaptive compression mechanism, that adaptively selects\nseveral subsets from latent features to balance the optimizations for multiple\nmachine vision tasks (e.g., segmentation, and detection) and human vision. 2),\na task-specific adapter, that uses the parameter-efficient delta-tuning\nstrategy to stimulate the comprehensive downstream analytical networks for\nspecific machine vision tasks. By using the above two modules, we can optimize\nthe bit-rate costs and improve machine vision performance. In general, our\nproposed EAC can seamlessly integrate with existing NIC (i.e., Ball\\'e2018, and\nCheng2020) and NVC (i.e., DVC, and FVC) methods. Extensive evaluation on\nvarious benchmark datasets (i.e., VOC2007, ILSVRC2012, VOC2012, COCO, UCF101,\nand DAVIS) shows that our method enhances performance for multiple machine\nvision tasks while maintaining the quality of human vision.\n","authors":["Lei Liu","Zhenghao Chen","Zhihao Hu","Dong Xu"],"pdf_url":"https://arxiv.org/pdf/2501.04329v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04325v1","updated":"2025-01-08T07:52:12Z","published":"2025-01-08T07:52:12Z","title":"Edit as You See: Image-guided Video Editing via Masked Motion Modeling","summary":" Recent advancements in diffusion models have significantly facilitated\ntext-guided video editing. However, there is a relative scarcity of research on\nimage-guided video editing, a method that empowers users to edit videos by\nmerely indicating a target object in the initial frame and providing an RGB\nimage as reference, without relying on the text prompts. In this paper, we\npropose a novel Image-guided Video Editing Diffusion model, termed IVEDiff for\nthe image-guided video editing. IVEDiff is built on top of image editing\nmodels, and is equipped with learnable motion modules to maintain the temporal\nconsistency of edited video. Inspired by self-supervised learning concepts, we\nintroduce a masked motion modeling fine-tuning strategy that empowers the\nmotion module's capabilities for capturing inter-frame motion dynamics, while\npreserving the capabilities for intra-frame semantic correlations modeling of\nthe base image editing model. Moreover, an optical-flow-guided motion reference\nnetwork is proposed to ensure the accurate propagation of information between\nedited video frames, alleviating the misleading effects of invalid information.\nWe also construct a benchmark to facilitate further research. The comprehensive\nexperiments demonstrate that our method is able to generate temporally smooth\nedited videos while robustly dealing with various editing objects with high\nquality.\n","authors":["Zhi-Lin Huang","Yixuan Liu","Chujun Qin","Zhongdao Wang","Dong Zhou","Dong Li","Emad Barsoum"],"pdf_url":"https://arxiv.org/pdf/2501.04325v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18930v2","updated":"2025-01-08T07:43:09Z","published":"2024-12-25T15:20:54Z","title":"Graph Cut-guided Maximal Coding Rate Reduction for Learning Image\n Embedding and Clustering","summary":" In the era of pre-trained models, image clustering task is usually addressed\nby two relevant stages: a) to produce features from pre-trained vision models;\nand b) to find clusters from the pre-trained features. However, these two\nstages are often considered separately or learned by different paradigms,\nleading to suboptimal clustering performance. In this paper, we propose a\nunified framework, termed graph Cut-guided Maximal Coding Rate Reduction\n(CgMCR$^2$), for jointly learning the structured embeddings and the clustering.\nTo be specific, we attempt to integrate an efficient clustering module into the\nprincipled framework for learning structured representation, in which the\nclustering module is used to provide partition information to guide the\ncluster-wise compression and the learned embeddings is aligned to desired\ngeometric structures in turn to help for yielding more accurate partitions. We\nconduct extensive experiments on both standard and out-of-domain image datasets\nand experimental results validate the effectiveness of our approach.\n","authors":["W. He","Z. Huang","X. Meng","X. Qi","R. Xiao","C. -G. Li"],"pdf_url":"https://arxiv.org/pdf/2412.18930v2.pdf","comment":"24 pages, 9 figures, accepted in ACCV2024"},{"id":"http://arxiv.org/abs/2501.04322v1","updated":"2025-01-08T07:42:54Z","published":"2025-01-08T07:42:54Z","title":"Eve: Efficient Multimodal Vision Language Models with Elastic Visual\n Experts","summary":" Multimodal vision language models (VLMs) have made significant progress with\nthe support of continuously increasing model sizes and data volumes. Running\nVLMs on edge devices has become a challenge for their widespread application.\nThere are several efficient VLM efforts, but they often sacrifice linguistic\ncapabilities to enhance multimodal abilities, or require extensive training. To\naddress this quandary,we introduce the innovative framework of Efficient Vision\nLanguage Models with Elastic Visual Experts (Eve). By strategically\nincorporating adaptable visual expertise at multiple stages of training, Eve\nstrikes a balance between preserving linguistic abilities and augmenting\nmultimodal capabilities. This balanced approach results in a versatile model\nwith only 1.8B parameters that delivers significant improvements in both\nmultimodal and linguistic tasks. Notably, in configurations below 3B\nparameters, Eve distinctly outperforms in language benchmarks and achieves\nstate-of-the-art results 68.87% in VLM Benchmarks. Additionally, its multimodal\naccuracy outstrips that of the larger 7B LLaVA-1.5 model.\n","authors":["Miao Rang","Zhenni Bi","Chuanjian Liu","Yehui Tang","Kai Han","Yunhe Wang"],"pdf_url":"https://arxiv.org/pdf/2501.04322v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.11102v2","updated":"2025-01-08T07:22:51Z","published":"2024-12-15T07:49:31Z","title":"Empowering LLMs to Understand and Generate Complex Vector Graphics","summary":" The unprecedented advancements in Large Language Models (LLMs) have\nprofoundly impacted natural language processing but have yet to fully embrace\nthe realm of scalable vector graphics (SVG) generation. While LLMs encode\npartial knowledge of SVG data from web pages during training, recent findings\nsuggest that semantically ambiguous and tokenized representations within LLMs\nmay result in hallucinations in vector primitive predictions. Additionally, LLM\ntraining typically lacks modeling and understanding of the rendering sequence\nof vector paths, which can lead to occlusion between output vector primitives.\nIn this paper, we present LLM4SVG, an initial yet substantial step toward\nbridging this gap by enabling LLMs to better understand and generate vector\ngraphics. LLM4SVG facilitates a deeper understanding of SVG components through\nlearnable semantic tokens, which precisely encode these tokens and their\ncorresponding properties to generate semantically aligned SVG outputs. Using a\nseries of learnable semantic tokens, a structured dataset for instruction\nfollowing is developed to support comprehension and generation across two\nprimary tasks. Our method introduces a modular architecture to existing large\nlanguage models, integrating semantic tags, vector instruction encoders,\nfine-tuned commands, and powerful LLMs to tightly combine geometric,\nappearance, and language information. To overcome the scarcity of SVG-text\ninstruction data, we developed an automated data generation pipeline that\ncollected a massive dataset of more than 250k SVG data and 580k SVG-text\ninstructions, which facilitated the adoption of the two-stage training strategy\npopular in LLM development. By exploring various training strategies, we\ndeveloped LLM4SVG, which significantly moves beyond optimized rendering-based\napproaches and language-model-based baselines to achieve remarkable results in\nhuman evaluation tasks.\n","authors":["Ximing Xing","Juncheng Hu","Guotao Liang","Jing Zhang","Dong Xu","Qian Yu"],"pdf_url":"https://arxiv.org/pdf/2412.11102v2.pdf","comment":"Project Page: https://ximinng.github.io/LLM4SVGProject/"},{"id":"http://arxiv.org/abs/2501.03775v2","updated":"2025-01-08T07:05:16Z","published":"2025-01-07T13:30:54Z","title":"Strip R-CNN: Large Strip Convolution for Remote Sensing Object Detection","summary":" While witnessed with rapid development, remote sensing object detection\nremains challenging for detecting high aspect ratio objects. This paper shows\nthat large strip convolutions are good feature representation learners for\nremote sensing object detection and can detect objects of various aspect ratios\nwell. Based on large strip convolutions, we build a new network architecture\ncalled Strip R-CNN, which is simple, efficient, and powerful. Unlike recent\nremote sensing object detectors that leverage large-kernel convolutions with\nsquare shapes, our Strip R-CNN takes advantage of sequential orthogonal large\nstrip convolutions to capture spatial information. In addition, we enhance the\nlocalization capability of remote-sensing object detectors by decoupling the\ndetection heads and equipping the localization head with strip convolutions to\nbetter localize the target objects. Extensive experiments on several\nbenchmarks, e.g., DOTA, FAIR1M, HRSC2016, and DIOR, show that our Strip R-CNN\ncan largely improve previous works. Notably, our 30M model achieves 82.75% mAP\non DOTA-v1.0, setting a new state-of-the-art record.Code is available at\nhttps://github.com/YXB-NKU/Strip-R-CNN.\n","authors":["Xinbin Yuan","ZhaoHui Zheng","Yuxuan Li","Xialei Liu","Li Liu","Xiang Li","Qibin Hou","Ming-Ming Cheng"],"pdf_url":"https://arxiv.org/pdf/2501.03775v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.01714v4","updated":"2025-01-08T06:52:07Z","published":"2024-04-02T07:57:17Z","title":"Conjugate-Gradient-like Based Adaptive Moment Estimation Optimization\n Algorithm for Deep Learning","summary":" Training deep neural networks is a challenging task. In order to speed up\ntraining and enhance the performance of deep neural networks, we rectify the\nvanilla conjugate gradient as conjugate-gradient-like and incorporate it into\nthe generic Adam, and thus propose a new optimization algorithm named\nCG-like-Adam for deep learning. Specifically, both the first-order and the\nsecond-order moment estimation of generic Adam are replaced by the\nconjugate-gradient-like. Convergence analysis handles the cases where the\nexponential moving average coefficient of the first-order moment estimation is\nconstant and the first-order moment estimation is unbiased. Numerical\nexperiments show the superiority of the proposed algorithm based on the\nCIFAR10/100 dataset.\n","authors":["Jiawu Tian","Liwei Xu","Xiaowei Zhang","Yongqi Li"],"pdf_url":"https://arxiv.org/pdf/2404.01714v4.pdf","comment":"32 pages, 13 figures"},{"id":"http://arxiv.org/abs/2412.19112v2","updated":"2025-01-08T06:45:02Z","published":"2024-12-26T08:11:41Z","title":"Future Success Prediction in Open-Vocabulary Object Manipulation Tasks\n Based on End-Effector Trajectories","summary":" This study addresses a task designed to predict the future success or failure\nof open-vocabulary object manipulation. In this task, the model is required to\nmake predictions based on natural language instructions, egocentric view images\nbefore manipulation, and the given end-effector trajectories. Conventional\nmethods typically perform success prediction only after the manipulation is\nexecuted, limiting their efficiency in executing the entire task sequence. We\npropose a novel approach that enables the prediction of success or failure by\naligning the given trajectories and images with natural language instructions.\nWe introduce Trajectory Encoder to apply learnable weighting to the input\ntrajectories, allowing the model to consider temporal dynamics and interactions\nbetween objects and the end effector, improving the model's ability to predict\nmanipulation outcomes accurately. We constructed a dataset based on the RT-1\ndataset, a large-scale benchmark for open-vocabulary object manipulation tasks,\nto evaluate our method. The experimental results show that our method achieved\na higher prediction accuracy than baseline approaches.\n","authors":["Motonari Kambara","Komei Sugiura"],"pdf_url":"https://arxiv.org/pdf/2412.19112v2.pdf","comment":"Accepted for presentation at LangRob @ CoRL 2024"},{"id":"http://arxiv.org/abs/2309.05271v2","updated":"2025-01-08T06:30:39Z","published":"2023-09-11T07:05:02Z","title":"AutoFuse: Automatic Fusion Networks for Deformable Medical Image\n Registration","summary":" Deformable image registration aims to find a dense non-linear spatial\ncorrespondence between a pair of images, which is a crucial step for many\nmedical tasks such as tumor growth monitoring and population analysis.\nRecently, Deep Neural Networks (DNNs) have been widely recognized for their\nability to perform fast end-to-end registration. However, DNN-based\nregistration needs to explore the spatial information of each image and fuse\nthis information to characterize spatial correspondence. This raises an\nessential question: what is the optimal fusion strategy to characterize spatial\ncorrespondence? Existing fusion strategies (e.g., early fusion, late fusion)\nwere empirically designed to fuse information by manually defined prior\nknowledge, which inevitably constrains the registration performance within the\nlimits of empirical designs. In this study, we depart from existing\nempirically-designed fusion strategies and develop a data-driven fusion\nstrategy for deformable image registration. To achieve this, we propose an\nAutomatic Fusion network (AutoFuse) that provides flexibility to fuse\ninformation at many potential locations within the network. A Fusion Gate (FG)\nmodule is also proposed to control how to fuse information at each potential\nnetwork location based on training data. Our AutoFuse can automatically\noptimize its fusion strategy during training and can be generalizable to both\nunsupervised registration (without any labels) and semi-supervised registration\n(with weak labels provided for partial training data). Extensive experiments on\ntwo well-benchmarked medical registration tasks (inter- and intra-patient\nregistration) with eight public datasets show that our AutoFuse outperforms\nstate-of-the-art unsupervised and semi-supervised registration methods.\n","authors":["Mingyuan Meng","Michael Fulham","Dagan Feng","Lei Bi","Jinman Kim"],"pdf_url":"https://arxiv.org/pdf/2309.05271v2.pdf","comment":"Published at Pattern Recognition"},{"id":"http://arxiv.org/abs/2501.04304v1","updated":"2025-01-08T06:30:31Z","published":"2025-01-08T06:30:31Z","title":"DGQ: Distribution-Aware Group Quantization for Text-to-Image Diffusion\n Models","summary":" Despite the widespread use of text-to-image diffusion models across various\ntasks, their computational and memory demands limit practical applications. To\nmitigate this issue, quantization of diffusion models has been explored. It\nreduces memory usage and computational costs by compressing weights and\nactivations into lower-bit formats. However, existing methods often struggle to\npreserve both image quality and text-image alignment, particularly in\nlower-bit($<$ 8bits) quantization. In this paper, we analyze the challenges\nassociated with quantizing text-to-image diffusion models from a distributional\nperspective. Our analysis reveals that activation outliers play a crucial role\nin determining image quality. Additionally, we identify distinctive patterns in\ncross-attention scores, which significantly affects text-image alignment. To\naddress these challenges, we propose Distribution-aware Group Quantization\n(DGQ), a method that identifies and adaptively handles pixel-wise and\nchannel-wise outliers to preserve image quality. Furthermore, DGQ applies\nprompt-specific logarithmic quantization scales to maintain text-image\nalignment. Our method demonstrates remarkable performance on datasets such as\nMS-COCO and PartiPrompts. We are the first to successfully achieve low-bit\nquantization of text-to-image diffusion models without requiring additional\nfine-tuning of weight quantization parameters.\n","authors":["Hyogon Ryu","NaHyeon Park","Hyunjung Shim"],"pdf_url":"https://arxiv.org/pdf/2501.04304v1.pdf","comment":"Project page: https://ugonfor.kr/DGQ"},{"id":"http://arxiv.org/abs/2501.04302v1","updated":"2025-01-08T06:26:16Z","published":"2025-01-08T06:26:16Z","title":"H-MBA: Hierarchical MamBa Adaptation for Multi-Modal Video Understanding\n in Autonomous Driving","summary":" With the prevalence of Multimodal Large Language Models(MLLMs), autonomous\ndriving has encountered new opportunities and challenges. In particular,\nmulti-modal video understanding is critical to interactively analyze what will\nhappen in the procedure of autonomous driving. However, videos in such a\ndynamical scene that often contains complex spatial-temporal movements, which\nrestricts the generalization capacity of the existing MLLMs in this field. To\nbridge the gap, we propose a novel Hierarchical Mamba Adaptation (H-MBA)\nframework to fit the complicated motion changes in autonomous driving videos.\nSpecifically, our H-MBA consists of two distinct modules, including Context\nMamba (C-Mamba) and Query Mamba (Q-Mamba). First, C-Mamba contains various\ntypes of structure state space models, which can effectively capture\nmulti-granularity video context for different temporal resolutions. Second,\nQ-Mamba flexibly transforms the current frame as the learnable query, and\nattentively selects multi-granularity video context into query. Consequently,\nit can adaptively integrate all the video contexts of multi-scale temporal\nresolutions to enhance video understanding. Via a plug-and-play paradigm in\nMLLMs, our H-MBA shows the remarkable performance on multi-modal video tasks in\nautonomous driving, e.g., for risk object detection, it outperforms the\nprevious SOTA method with 5.5% mIoU improvement.\n","authors":["Siran Chen","Yuxiao Luo","Yue Ma","Yu Qiao","Yali Wang"],"pdf_url":"https://arxiv.org/pdf/2501.04302v1.pdf","comment":"7 pages, 4 figures"},{"id":"http://arxiv.org/abs/2312.02541v2","updated":"2025-01-08T06:08:30Z","published":"2023-12-05T07:12:05Z","title":"Explainable Severity ranking via pairwise n-hidden comparison: a case\n study of glaucoma","summary":" Primary open-angle glaucoma (POAG) is a chronic and progressive optic nerve\ncondition that results in an acquired loss of optic nerve fibers and potential\nblindness. The gradual onset of glaucoma results in patients progressively\nlosing their vision without being consciously aware of the changes. To diagnose\nPOAG and determine its severity, patients must undergo a comprehensive dilated\neye examination. In this work, we build a framework to rank, compare, and\ninterpret the severity of glaucoma using fundus images. We introduce a\nsiamese-based severity ranking using pairwise n-hidden comparisons. We\nadditionally have a novel approach to explaining why a specific image is deemed\nmore severe than others. Our findings indicate that the proposed severity\nranking model surpasses traditional ones in terms of diagnostic accuracy and\ndelivers improved saliency explanations.\n","authors":["Hong Nguyen","Cuong V. Nguyen","Shrikanth Narayanan","Benjamin Y. Xu","Michael Pazzani"],"pdf_url":"https://arxiv.org/pdf/2312.02541v2.pdf","comment":"4 pages"},{"id":"http://arxiv.org/abs/2501.04293v1","updated":"2025-01-08T05:35:07Z","published":"2025-01-08T05:35:07Z","title":"TADFormer : Task-Adaptive Dynamic Transformer for Efficient Multi-Task\n Learning","summary":" Transfer learning paradigm has driven substantial advancements in various\nvision tasks. However, as state-of-the-art models continue to grow, classical\nfull fine-tuning often becomes computationally impractical, particularly in\nmulti-task learning (MTL) setup where training complexity increases\nproportional to the number of tasks. Consequently, recent studies have explored\nParameter-Efficient Fine-Tuning (PEFT) for MTL architectures. Despite some\nprogress, these approaches still exhibit limitations in capturing fine-grained,\ntask-specific features that are crucial to MTL. In this paper, we introduce\nTask-Adaptive Dynamic transFormer, termed TADFormer, a novel PEFT framework\nthat performs task-aware feature adaptation in the fine-grained manner by\ndynamically considering task-specific input contexts. TADFormer proposes the\nparameter-efficient prompting for task adaptation and the Dynamic Task Filter\n(DTF) to capture task information conditioned on input contexts. Experiments on\nthe PASCAL-Context benchmark demonstrate that the proposed method achieves\nhigher accuracy in dense scene understanding tasks, while reducing the number\nof trainable parameters by up to 8.4 times when compared to full fine-tuning of\nMTL models. TADFormer also demonstrates superior parameter efficiency and\naccuracy compared to recent PEFT methods.\n","authors":["Seungmin Baek","Soyul Lee","Hayeon Jo","Hyesong Choi","Dongbo Min"],"pdf_url":"https://arxiv.org/pdf/2501.04293v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.17051v2","updated":"2025-01-08T05:26:58Z","published":"2023-12-28T14:52:07Z","title":"FILP-3D: Enhancing 3D Few-shot Class-incremental Learning with\n Pre-trained Vision-Language Models","summary":" Few-shot class-incremental learning (FSCIL) aims to mitigate the catastrophic\nforgetting issue when a model is incrementally trained on limited data.\nHowever, many of these works lack effective exploration of prior knowledge,\nrendering them unable to effectively address the domain gap issue in the\ncontext of 3D FSCIL, thereby leading to catastrophic forgetting. The\nContrastive Vision-Language Pre-Training (CLIP) model serves as a highly\nsuitable backbone for addressing the challenges of 3D FSCIL due to its abundant\nshape-related prior knowledge. Unfortunately, its direct application to 3D\nFSCIL still faces the incompatibility between 3D data representation and the 2D\nfeatures, primarily manifested as feature space misalignment and significant\nnoise. To address the above challenges, we introduce the FILP-3D framework with\ntwo novel components: the Redundant Feature Eliminator (RFE) for feature space\nmisalignment and the Spatial Noise Compensator (SNC) for significant noise. RFE\naligns the feature spaces of input point clouds and their embeddings by\nperforming a unique dimensionality reduction on the feature space of\npre-trained models (PTMs), effectively eliminating redundant information\nwithout compromising semantic integrity. On the other hand, SNC is a\ngraph-based 3D model designed to capture robust geometric information within\npoint clouds, thereby augmenting the knowledge lost due to projection,\nparticularly when processing real-world scanned data. Moreover, traditional\naccuracy metrics are proven to be biased due to the imbalance in existing 3D\ndatasets. Therefore we propose 3D FSCIL benchmark FSCIL3D-XL and novel\nevaluation metrics that offer a more nuanced assessment of a 3D FSCIL model.\nExperimental results on both established and our proposed benchmarks\ndemonstrate that our approach significantly outperforms existing\nstate-of-the-art methods.\n","authors":["Wan Xu","Tianyu Huang","Tianyu Qu","Guanglei Yang","Yiwen Guo","Wangmeng Zuo"],"pdf_url":"https://arxiv.org/pdf/2312.17051v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04284v1","updated":"2025-01-08T05:15:43Z","published":"2025-01-08T05:15:43Z","title":"ContextMRI: Enhancing Compressed Sensing MRI through Metadata\n Conditioning","summary":" Compressed sensing MRI seeks to accelerate MRI acquisition processes by\nsampling fewer k-space measurements and then reconstructing the missing data\nalgorithmically. The success of these approaches often relies on strong priors\nor learned statistical models. While recent diffusion model-based priors have\nshown great potential, previous methods typically ignore clinically available\nmetadata (e.g. patient demographics, imaging parameters, slice-specific\ninformation). In practice, metadata contains meaningful cues about the anatomy\nand acquisition protocol, suggesting it could further constrain the\nreconstruction problem. In this work, we propose ContextMRI, a text-conditioned\ndiffusion model for MRI that integrates granular metadata into the\nreconstruction process. We train a pixel-space diffusion model directly on\nminimally processed, complex-valued MRI images. During inference, metadata is\nconverted into a structured text prompt and fed to the model via CLIP text\nembeddings. By conditioning the prior on metadata, we unlock more accurate\nreconstructions and show consistent gains across multiple datasets,\nacceleration factors, and undersampling patterns. Our experiments demonstrate\nthat increasing the fidelity of metadata, ranging from slice location and\ncontrast to patient age, sex, and pathology, systematically boosts\nreconstruction performance. This work highlights the untapped potential of\nleveraging clinical context for inverse problems and opens a new direction for\nmetadata-driven MRI reconstruction.\n","authors":["Hyungjin Chung","Dohun Lee","Zihui Wu","Byung-Hoon Kim","Katherine L. Bouman","Jong Chul Ye"],"pdf_url":"https://arxiv.org/pdf/2501.04284v1.pdf","comment":"29 pages, 9 figures"},{"id":"http://arxiv.org/abs/2501.04283v1","updated":"2025-01-08T05:14:36Z","published":"2025-01-08T05:14:36Z","title":"Enhancing Scene Classification in Cloudy Image Scenarios: A\n Collaborative Transfer Method with Information Regulation Mechanism using\n Optical Cloud-Covered and SAR Remote Sensing Images","summary":" In remote sensing scene classification, leveraging the transfer methods with\nwell-trained optical models is an efficient way to overcome label scarcity.\nHowever, cloud contamination leads to optical information loss and significant\nimpacts on feature distribution, challenging the reliability and stability of\ntransferred target models. Common solutions include cloud removal for optical\ndata or directly using Synthetic aperture radar (SAR) data in the target\ndomain. However, cloud removal requires substantial auxiliary data for support\nand pre-training, while directly using SAR disregards the unobstructed portions\nof optical data. This study presents a scene classification transfer method\nthat synergistically combines multi-modality data, which aims to transfer the\nsource domain model trained on cloudfree optical data to the target domain that\nincludes both cloudy optical and SAR data at low cost. Specifically, the\nframework incorporates two parts: (1) the collaborative transfer strategy,\nbased on knowledge distillation, enables the efficient prior knowledge transfer\nacross heterogeneous data; (2) the information regulation mechanism (IRM) is\nproposed to address the modality imbalance issue during transfer. It employs\nauxiliary models to measure the contribution discrepancy of each modality, and\nautomatically balances the information utilization of modalities during the\ntarget model learning process at the sample-level. The transfer experiments\nwere conducted on simulated and real cloud datasets, demonstrating the superior\nperformance of the proposed method compared to other solutions in cloud-covered\nscenarios. We also verified the importance and limitations of IRM, and further\ndiscussed and visualized the modality imbalance problem during the model\ntransfer. Codes are available at https://github.com/wangyuze-csu/ESCCS\n","authors":["Yuze Wang","Rong Xiao","Haifeng Li","Mariana Belgiu","Chao Tao"],"pdf_url":"https://arxiv.org/pdf/2501.04283v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.16998v2","updated":"2025-01-08T04:58:02Z","published":"2023-12-28T13:02:16Z","title":"Deep Unfolding Network with Spatial Alignment for multi-modal MRI\n reconstruction","summary":" Multi-modal Magnetic Resonance Imaging (MRI) offers complementary diagnostic\ninformation, but some modalities are limited by the long scanning time. To\naccelerate the whole acquisition process, MRI reconstruction of one modality\nfrom highly undersampled k-space data with another fully-sampled reference\nmodality is an efficient solution. However, the misalignment between\nmodalities, which is common in clinic practice, can negatively affect\nreconstruction quality. Existing deep learning-based methods that account for\ninter-modality misalignment perform better, but still share two main common\nlimitations: (1) The spatial alignment task is not adaptively integrated with\nthe reconstruction process, resulting in insufficient complementarity between\nthe two tasks; (2) the entire framework has weak interpretability. In this\npaper, we construct a novel Deep Unfolding Network with Spatial Alignment,\ntermed DUN-SA, to appropriately embed the spatial alignment task into the\nreconstruction process. Concretely, we derive a novel joint\nalignment-reconstruction model with a specially designed cross-modal spatial\nalignment term. By relaxing the model into cross-modal spatial alignment and\nmulti-modal reconstruction tasks, we propose an effective algorithm to solve\nthis model alternatively. Then, we unfold the iterative steps of the proposed\nalgorithm and design corresponding network modules to build DUN-SA with\ninterpretability. Through end-to-end training, we effectively compensate for\nspatial misalignment using only reconstruction loss, and utilize the\nprogressively aligned reference modality to provide inter-modality prior to\nimprove the reconstruction of the target modality. Comprehensive experiments on\nthree real datasets demonstrate that our method exhibits superior\nreconstruction performance compared to state-of-the-art methods.\n","authors":["Hao Zhang","Qi Wang","Jun Shi","Shihui Ying","Zhijie Wen"],"pdf_url":"https://arxiv.org/pdf/2312.16998v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04269v1","updated":"2025-01-08T04:37:36Z","published":"2025-01-08T04:37:36Z","title":"Open set label noise learning with robust sample selection and\n margin-guided module","summary":" In recent years, the remarkable success of deep neural networks (DNNs) in\ncomputer vision is largely due to large-scale, high-quality labeled datasets.\nTraining directly on real-world datasets with label noise may result in\noverfitting. The traditional method is limited to deal with closed set label\nnoise, where noisy training data has true class labels within the known label\nspace. However, there are some real-world datasets containing open set label\nnoise, which means that some samples belong to an unknown class outside the\nknown label space. To address the open set label noise problem, we introduce a\nmethod based on Robust Sample Selection and Margin-Guided Module (RSS-MGM).\nFirstly, unlike the prior clean sample selection approach, which only select a\nlimited number of clean samples, a robust sample selection module combines\nsmall loss selection or high-confidence sample selection to obtain more clean\nsamples. Secondly, to efficiently distinguish open set label noise and closed\nset ones, margin functions are designed to filter open-set data and closed set\ndata. Thirdly, different processing methods are selected for different types of\nsamples in order to fully utilize the data's prior information and optimize the\nwhole model. Furthermore, extensive experimental results with noisy labeled\ndata from benchmark datasets and real-world datasets, such as CIFAR-100N-C,\nCIFAR80N-O, WebFG-469, and Food101N, indicate that our approach outperforms\nmany state-of-the-art label noise learning methods. Especially, it can more\naccurately divide open set label noise samples and closed set ones.\n","authors":["Yuandi Zhao","Qianxi Xia","Yang Sun","Zhijie Wen","Liyan Ma","Shihui Ying"],"pdf_url":"https://arxiv.org/pdf/2501.04269v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.09420v3","updated":"2025-01-08T04:31:16Z","published":"2024-11-14T13:15:27Z","title":"SAG-ViT: A Scale-Aware, High-Fidelity Patching Approach with Graph\n Attention for Vision Transformers","summary":" Vision Transformers (ViTs) have redefined image classification by leveraging\nself-attention to capture complex patterns and long-range dependencies between\nimage patches. However, a key challenge for ViTs is efficiently incorporating\nmulti-scale feature representations, which is inherent in convolutional neural\nnetworks (CNNs) through their hierarchical structure. Graph transformers have\nmade strides in addressing this by leveraging graph-based modeling, but they\noften lose or insufficiently represent spatial hierarchies, especially since\nredundant or less relevant areas dilute the image's contextual representation.\nTo bridge this gap, we propose SAG-ViT, a Scale-Aware Graph Attention ViT that\nintegrates multi-scale feature capabilities of CNNs, representational power of\nViTs, graph-attended patching to enable richer contextual representation. Using\nEfficientNetV2 as a backbone, the model extracts multi-scale feature maps,\ndividing them into patches to preserve richer semantic information compared to\ndirectly patching the input images. The patches are structured into a graph\nusing spatial and feature similarities, where a Graph Attention Network (GAT)\nrefines the node embeddings. This refined graph representation is then\nprocessed by a Transformer encoder, capturing long-range dependencies and\ncomplex interactions. We evaluate SAG-ViT on benchmark datasets across various\ndomains, validating its effectiveness in advancing image classification tasks.\nOur code and weights are available at https://github.com/shravan-18/SAG-ViT.\n","authors":["Shravan Venkatraman","Jaskaran Singh Walia","Joe Dhanith P R"],"pdf_url":"https://arxiv.org/pdf/2411.09420v3.pdf","comment":"14 pages, 8 figures, 9 tables"},{"id":"http://arxiv.org/abs/2501.04268v1","updated":"2025-01-08T04:30:45Z","published":"2025-01-08T04:30:45Z","title":"Robotic Programmer: Video Instructed Policy Code Generation for Robotic\n Manipulation","summary":" Zero-shot generalization across various robots, tasks and environments\nremains a significant challenge in robotic manipulation. Policy code generation\nmethods use executable code to connect high-level task descriptions and\nlow-level action sequences, leveraging the generalization capabilities of large\nlanguage models and atomic skill libraries. In this work, we propose Robotic\nProgrammer (RoboPro), a robotic foundation model, enabling the capability of\nperceiving visual information and following free-form instructions to perform\nrobotic manipulation with policy code in a zero-shot manner. To address low\nefficiency and high cost in collecting runtime code data for robotic tasks, we\ndevise Video2Code to synthesize executable code from extensive videos\nin-the-wild with off-the-shelf vision-language model and code-domain large\nlanguage model. Extensive experiments show that RoboPro achieves the\nstate-of-the-art zero-shot performance on robotic manipulation in both\nsimulators and real-world environments. Specifically, the zero-shot success\nrate of RoboPro on RLBench surpasses the state-of-the-art model GPT-4o by\n11.6%, which is even comparable to a strong supervised training baseline.\nFurthermore, RoboPro is robust to variations on API formats and skill sets.\n","authors":["Senwei Xie","Hongyu Wang","Zhanqi Xiao","Ruiping Wang","Xilin Chen"],"pdf_url":"https://arxiv.org/pdf/2501.04268v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.00547v2","updated":"2025-01-08T04:14:07Z","published":"2024-11-30T17:40:49Z","title":"Motion Dreamer: Realizing Physically Coherent Video Generation through\n Scene-Aware Motion Reasoning","summary":" Recent numerous video generation models, also known as world models, have\ndemonstrated the ability to generate plausible real-world videos. However, many\nstudies have shown that these models often produce motion results lacking\nlogical or physical coherence. In this paper, we revisit video generation\nmodels and find that single-stage approaches struggle to produce high-quality\nresults while maintaining coherent motion reasoning. To address this issue, we\npropose \\textbf{Motion Dreamer}, a two-stage video generation framework. In\nStage I, the model generates an intermediate motion representation-such as a\nsegmentation map or depth map-based on the input image and motion conditions,\nfocusing solely on the motion itself. In Stage II, the model uses this\nintermediate motion representation as a condition to generate a high-detail\nvideo. By decoupling motion reasoning from high-fidelity video synthesis, our\napproach allows for more accurate and physically plausible motion generation.\nWe validate the effectiveness of our approach on the Physion dataset and in\nautonomous driving scenarios. For example, given a single push, our model can\nsynthesize the sequential toppling of a set of dominoes. Similarly, by varying\nthe movements of ego-cars, our model can produce different effects on other\nvehicles. Our work opens new avenues in creating models that can reason about\nphysical interactions in a more coherent and realistic manner. Our webpage is\navailable: https://envision-research.github.io/MotionDreamer/.\n","authors":["Tianshuo Xu","Zhifei Chen","Leyi Wu","Hao Lu","Yuying Chen","Lihui Jiang","Bingbing Liu","Yingcong Chen"],"pdf_url":"https://arxiv.org/pdf/2412.00547v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19106v2","updated":"2025-01-08T04:11:48Z","published":"2024-11-28T12:42:14Z","title":"Detailed Object Description with Controllable Dimensions","summary":" Object description plays an important role for visually impaired individuals\nto understand and compare the differences between objects. Recent multimodal\nlarge language models(MLLMs) exhibit powerful perceptual abilities and\ndemonstrate impressive potential for generating object-centric descriptions.\nHowever, the descriptions generated by such models may still usually contain a\nlot of content that is not relevant to the user intent or miss some important\nobject dimension details. Under special scenarios, users may only need the\ndetails of certain dimensions of an object. In this paper, we propose a\ntraining-free object description refinement pipeline, Dimension Tailor,\ndesigned to enhance user-specified details in object descriptions. This\npipeline includes three steps: dimension extracting, erasing, and\nsupplementing, which decompose the description into user-specified dimensions.\nDimension Tailor can not only improve the quality of object details but also\noffer flexibility in including or excluding specific dimensions based on user\npreferences. We conducted extensive experiments to demonstrate the\neffectiveness of Dimension Tailor on controllable object descriptions. Notably,\nthe proposed pipeline can consistently improve the performance of the recent\nMLLMs. The code is currently accessible at\nhttps://github.com/xin-ran-w/ControllableObjectDescription.\n","authors":["Xinran Wang","Haiwen Zhang","Baoteng Li","Kongming Liang","Hao Sun","Zhongjiang He","Zhanyu Ma","Jun Guo"],"pdf_url":"https://arxiv.org/pdf/2411.19106v2.pdf","comment":"11 pages, 8 figures"},{"id":"http://arxiv.org/abs/2406.11280v2","updated":"2025-01-08T03:18:23Z","published":"2024-06-17T07:33:30Z","title":"ISR-DPO: Aligning Large Multimodal Models for Videos by Iterative\n Self-Retrospective DPO","summary":" Iterative self-improvement, a concept extending beyond personal growth, has\nfound powerful applications in machine learning, particularly in transforming\nweak models into strong ones. While recent advances in natural language\nprocessing have shown its efficacy through iterative preference optimization,\napplying this approach to Video Large Multi-modal Models (VLMMs) remains\nchallenging due to modality misalignment. VLMMs struggle with this misalignment\nduring iterative preference modeling, as the self-judge model often prioritizes\nlinguistic knowledge over visual information. Additionally, iterative\npreference optimization can lead to visually hallucinated verbose responses due\nto length bias within the self-rewarding cycle. To address these issues, we\npropose Iterative Self-Retrospective Direct Preference Optimization (ISR-DPO),\na method that uses self-retrospection to enhance preference modeling. This\napproach enhances the self-judge's focus on informative video regions,\nresulting in more visually grounded preferences. In extensive empirical\nevaluations across diverse video question answering benchmarks, the ISR-DPO\nsignificantly outperforms the state of the art. We are committed to\nopen-sourcing our code, models, and datasets to encourage further\ninvestigation.\n","authors":["Daechul Ahn","Yura Choi","San Kim","Youngjae Yu","Dongyeop Kang","Jonghyun Choi"],"pdf_url":"https://arxiv.org/pdf/2406.11280v2.pdf","comment":"AAAI 2025"},{"id":"http://arxiv.org/abs/2311.07594v3","updated":"2025-01-08T02:33:37Z","published":"2023-11-10T09:51:24Z","title":"How to Bridge the Gap between Modalities: Survey on Multimodal Large\n Language Model","summary":" We explore Multimodal Large Language Models (MLLMs), which integrate LLMs\nlike GPT-4 to handle multimodal data, including text, images, audio, and more.\nMLLMs demonstrate capabilities such as generating image captions and answering\nimage-based questions, bridging the gap towards real-world human-computer\ninteractions and hinting at a potential pathway to artificial general\nintelligence. However, MLLMs still face challenges in addressing the semantic\ngap in multimodal data, which may lead to erroneous outputs, posing potential\nrisks to society. Selecting the appropriate modality alignment method is\ncrucial, as improper methods might require more parameters without significant\nperformance improvements. This paper aims to explore modality alignment methods\nfor LLMs and their current capabilities. Implementing effective modality\nalignment can help LLMs address environmental issues and enhance accessibility.\nThe study surveys existing modality alignment methods for MLLMs, categorizing\nthem into four groups: (1) Multimodal Converter, which transforms data into a\nformat that LLMs can understand; (2) Multimodal Perceiver, which improves how\nLLMs percieve different types of data; (3) Tool Learning, which leverages\nexternal tools to convert data into a common format, usually text; and (4)\nData-Driven Method, which teaches LLMs to understand specific data types within\ndatasets.\n","authors":["Shezheng Song","Xiaopeng Li","Shasha Li","Shan Zhao","Jie Yu","Jun Ma","Xiaoguang Mao","Weimin Zhang"],"pdf_url":"https://arxiv.org/pdf/2311.07594v3.pdf","comment":"Accepted by TKDE"},{"id":"http://arxiv.org/abs/2412.16050v3","updated":"2025-01-08T01:47:16Z","published":"2024-12-20T16:52:11Z","title":"Label-Efficient Data Augmentation with Video Diffusion Models for\n Guidewire Segmentation in Cardiac Fluoroscopy","summary":" The accurate segmentation of guidewires in interventional cardiac fluoroscopy\nvideos is crucial for computer-aided navigation tasks. Although deep learning\nmethods have demonstrated high accuracy and robustness in wire segmentation,\nthey require substantial annotated datasets for generalizability, underscoring\nthe need for extensive labeled data to enhance model performance. To address\nthis challenge, we propose the Segmentation-guided Frame-consistency Video\nDiffusion Model (SF-VD) to generate large collections of labeled fluoroscopy\nvideos, augmenting the training data for wire segmentation networks. SF-VD\nleverages videos with limited annotations by independently modeling scene\ndistribution and motion distribution. It first samples the scene distribution\nby generating 2D fluoroscopy images with wires positioned according to a\nspecified input mask, and then samples the motion distribution by progressively\ngenerating subsequent frames, ensuring frame-to-frame coherence through a\nframe-consistency strategy. A segmentation-guided mechanism further refines the\nprocess by adjusting wire contrast, ensuring a diverse range of visibility in\nthe synthesized image. Evaluation on a fluoroscopy dataset confirms the\nsuperior quality of the generated videos and shows significant improvements in\nguidewire segmentation.\n","authors":["Shaoyan Pan","Yikang Liu","Lin Zhao","Eric Z. Chen","Xiao Chen","Terrence Chen","Shanhui Sun"],"pdf_url":"https://arxiv.org/pdf/2412.16050v3.pdf","comment":"AAAI 2025"},{"id":"http://arxiv.org/abs/2501.04217v1","updated":"2025-01-08T01:27:35Z","published":"2025-01-08T01:27:35Z","title":"Continual Self-supervised Learning Considering Medical Domain Knowledge\n in Chest CT Images","summary":" We propose a novel continual self-supervised learning method (CSSL)\nconsidering medical domain knowledge in chest CT images. Our approach addresses\nthe challenge of sequential learning by effectively capturing the relationship\nbetween previously learned knowledge and new information at different stages.\nBy incorporating an enhanced DER into CSSL and maintaining both diversity and\nrepresentativeness within the rehearsal buffer of DER, the risk of data\ninterference during pretraining is reduced, enabling the model to learn more\nricher and robust feature representations. In addition, we incorporate a mixup\nstrategy and feature distillation to further enhance the model's ability to\nlearn meaningful representations. We validate our method using chest CT images\nobtained under two different imaging conditions, demonstrating superior\nperformance compared to state-of-the-art methods.\n","authors":["Ren Tasai","Guang Li","Ren Togo","Minghui Tang","Takaaki Yoshimura","Hiroyuki Sugimori","Kenji Hirata","Takahiro Ogawa","Kohsuke Kudo","Miki Haseyama"],"pdf_url":"https://arxiv.org/pdf/2501.04217v1.pdf","comment":"Accepted by ICASSP 2025"},{"id":"http://arxiv.org/abs/2501.04213v1","updated":"2025-01-08T01:18:14Z","published":"2025-01-08T01:18:14Z","title":"UPAQ: A Framework for Real-Time and Energy-Efficient 3D Object Detection\n in Autonomous Vehicles","summary":" To enhance perception in autonomous vehicles (AVs), recent efforts are\nconcentrating on 3D object detectors, which deliver more comprehensive\npredictions than traditional 2D object detectors, at the cost of increased\nmemory footprint and computational resource usage. We present a novel framework\ncalled UPAQ, which leverages semi-structured pattern pruning and quantization\nto improve the efficiency of LiDAR point-cloud and camera-based 3D object\ndetectors on resource-constrained embedded AV platforms. Experimental results\non the Jetson Orin Nano embedded platform indicate that UPAQ achieves up to\n5.62x and 5.13x model compression rates, up to 1.97x and 1.86x boost in\ninference speed, and up to 2.07x and 1.87x reduction in energy consumption\ncompared to state-of-the-art model compression frameworks, on the Pointpillar\nand SMOKE models respectively.\n","authors":["Abhishek Balasubramaniam","Febin P Sunny","Sudeep Pasricha"],"pdf_url":"https://arxiv.org/pdf/2501.04213v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.06438v2","updated":"2025-01-08T01:11:07Z","published":"2024-01-12T08:18:42Z","title":"Improving Low-Light Image Recognition Performance Based on\n Image-adaptive Learnable Module","summary":" In recent years, significant progress has been made in image recognition\ntechnology based on deep neural networks. However, improving recognition\nperformance under low-light conditions remains a significant challenge. This\nstudy addresses the enhancement of recognition model performance in low-light\nconditions. We propose an image-adaptive learnable module which apply\nappropriate image processing on input images and a hyperparameter predictor to\nforecast optimal parameters used in the module. Our proposed approach allows\nfor the enhancement of recognition performance under low-light conditions by\neasily integrating as a front-end filter without the need to retrain existing\nrecognition models designed for low-light conditions. Through experiments, our\nproposed method demonstrates its contribution to enhancing image recognition\nperformance under low-light conditions.\n","authors":["Seitaro Ono","Yuka Ogino","Takahiro Toizumi","Atsushi Ito","Masato Tsukada"],"pdf_url":"https://arxiv.org/pdf/2401.06438v2.pdf","comment":"accepted to VISAPP2024"},{"id":"http://arxiv.org/abs/2501.04210v1","updated":"2025-01-08T01:09:49Z","published":"2025-01-08T01:09:49Z","title":"Recognition-Oriented Low-Light Image Enhancement based on Global and\n Pixelwise Optimization","summary":" In this paper, we propose a novel low-light image enhancement method aimed at\nimproving the performance of recognition models. Despite recent advances in\ndeep learning, the recognition of images under low-light conditions remains a\nchallenge. Although existing low-light image enhancement methods have been\ndeveloped to improve image visibility for human vision, they do not\nspecifically focus on enhancing recognition model performance. Our proposed\nlow-light image enhancement method consists of two key modules: the Global\nEnhance Module, which adjusts the overall brightness and color balance of the\ninput image, and the Pixelwise Adjustment Module, which refines image features\nat the pixel level. These modules are trained to enhance input images to\nimprove downstream recognition model performance effectively. Notably, the\nproposed method can be applied as a frontend filter to improve low-light\nrecognition performance without requiring retraining of downstream recognition\nmodels. Experimental results demonstrate that our method improves the\nperformance of pretrained recognition models under low-light conditions and its\neffectiveness.\n","authors":["Seitaro Ono","Yuka Ogino","Takahiro Toizumi","Atsushi Ito","Masato Tsukada"],"pdf_url":"https://arxiv.org/pdf/2501.04210v1.pdf","comment":"accepted to VISAPP2025"},{"id":"http://arxiv.org/abs/2412.05394v2","updated":"2025-01-08T01:06:34Z","published":"2024-12-06T19:40:00Z","title":"YOLOv5-Based Object Detection for Emergency Response in Aerial Imagery","summary":" This paper presents a robust approach for object detection in aerial imagery\nusing the YOLOv5 model. We focus on identifying critical objects such as\nambulances, car crashes, police vehicles, tow trucks, fire engines, overturned\ncars, and vehicles on fire. By leveraging a custom dataset, we outline the\ncomplete pipeline from data collection and annotation to model training and\nevaluation. Our results demonstrate that YOLOv5 effectively balances speed and\naccuracy, making it suitable for real-time emergency response applications.\nThis work addresses key challenges in aerial imagery, including small object\ndetection and complex backgrounds, and provides insights for future research in\nautomated emergency response systems.\n","authors":["Sindhu Boddu","Arindam Mukherjee"],"pdf_url":"https://arxiv.org/pdf/2412.05394v2.pdf","comment":"6 pages, 8 figures, submitted for open-access publication on arXiv"},{"id":"http://arxiv.org/abs/2501.04206v1","updated":"2025-01-08T00:54:43Z","published":"2025-01-08T00:54:43Z","title":"GRAPHITE: Graph-Based Interpretable Tissue Examination for Enhanced\n Explainability in Breast Cancer Histopathology","summary":" Explainable AI (XAI) in medical histopathology is essential for enhancing the\ninterpretability and clinical trustworthiness of deep learning models in cancer\ndiagnosis. However, the black-box nature of these models often limits their\nclinical adoption. We introduce GRAPHITE (Graph-based Interpretable Tissue\nExamination), a post-hoc explainable framework designed for breast cancer\ntissue microarray (TMA) analysis. GRAPHITE employs a multiscale approach,\nextracting patches at various magnification levels, constructing an\nhierarchical graph, and utilising graph attention networks (GAT) with scalewise\nattention (SAN) to capture scale-dependent features. We trained the model on\n140 tumour TMA cores and four benign whole slide images from which 140 benign\nsamples were created, and tested it on 53 pathologist-annotated TMA samples.\nGRAPHITE outperformed traditional XAI methods, achieving a mean average\nprecision (mAP) of 0.56, an area under the receiver operating characteristic\ncurve (AUROC) of 0.94, and a threshold robustness (ThR) of 0.70, indicating\nthat the model maintains high performance across a wide range of thresholds. In\nclinical utility, GRAPHITE achieved the highest area under the decision curve\n(AUDC) of 4.17e+5, indicating reliable decision support across thresholds.\nThese results highlight GRAPHITE's potential as a clinically valuable tool in\ncomputational pathology, providing interpretable visualisations that align with\nthe pathologists' diagnostic reasoning and support precision medicine.\n","authors":["Raktim Kumar Mondol","Ewan K. A. Millar","Peter H. Graham","Lois Browne","Arcot Sowmya","Erik Meijering"],"pdf_url":"https://arxiv.org/pdf/2501.04206v1.pdf","comment":"24 Pages, 9 Figures, 1 Tables"},{"id":"http://arxiv.org/abs/2501.04204v1","updated":"2025-01-08T00:52:19Z","published":"2025-01-08T00:52:19Z","title":"LipGen: Viseme-Guided Lip Video Generation for Enhancing Visual Speech\n Recognition","summary":" Visual speech recognition (VSR), commonly known as lip reading, has garnered\nsignificant attention due to its wide-ranging practical applications. The\nadvent of deep learning techniques and advancements in hardware capabilities\nhave significantly enhanced the performance of lip reading models. Despite\nthese advancements, existing datasets predominantly feature stable video\nrecordings with limited variability in lip movements. This limitation results\nin models that are highly sensitive to variations encountered in real-world\nscenarios. To address this issue, we propose a novel framework, LipGen, which\naims to improve model robustness by leveraging speech-driven synthetic visual\ndata, thereby mitigating the constraints of current datasets. Additionally, we\nintroduce an auxiliary task that incorporates viseme classification alongside\nattention mechanisms. This approach facilitates the efficient integration of\ntemporal information, directing the model's focus toward the relevant segments\nof speech, thereby enhancing discriminative capabilities. Our method\ndemonstrates superior performance compared to the current state-of-the-art on\nthe lip reading in the wild (LRW) dataset and exhibits even more pronounced\nadvantages under challenging conditions.\n","authors":["Bowen Hao","Dongliang Zhou","Xiaojie Li","Xingyu Zhang","Liang Xie","Jianlong Wu","Erwei Yin"],"pdf_url":"https://arxiv.org/pdf/2501.04204v1.pdf","comment":"This paper has been accepted for presentation at ICASSP 2025"},{"id":"http://arxiv.org/abs/2501.04202v1","updated":"2025-01-08T00:43:31Z","published":"2025-01-08T00:43:31Z","title":"Generative Dataset Distillation Based on Self-knowledge Distillation","summary":" Dataset distillation is an effective technique for reducing the cost and\ncomplexity of model training while maintaining performance by compressing large\ndatasets into smaller, more efficient versions. In this paper, we present a\nnovel generative dataset distillation method that can improve the accuracy of\naligning prediction logits. Our approach integrates self-knowledge distillation\nto achieve more precise distribution matching between the synthetic and\noriginal data, thereby capturing the overall structure and relationships within\nthe data. To further improve the accuracy of alignment, we introduce a\nstandardization step on the logits before performing distribution matching,\nensuring consistency in the range of logits. Through extensive experiments, we\ndemonstrate that our method outperforms existing state-of-the-art methods,\nresulting in superior distillation performance.\n","authors":["Longzhen Li","Guang Li","Ren Togo","Keisuke Maeda","Takahiro Ogawa","Miki Haseyama"],"pdf_url":"https://arxiv.org/pdf/2501.04202v1.pdf","comment":"Accepted by ICASSP 2025"},{"id":"http://arxiv.org/abs/2404.15683v4","updated":"2025-01-08T00:27:00Z","published":"2024-04-24T06:35:56Z","title":"AnoFPDM: Anomaly Segmentation with Forward Process of Diffusion Models\n for Brain MRI","summary":" Weakly-supervised diffusion models (DMs) in anomaly segmentation, leveraging\nimage-level labels, have attracted significant attention for their superior\nperformance compared to unsupervised methods. It eliminates the need for\npixel-level labels in training, offering a more cost-effective alternative to\nsupervised methods. However, existing methods are not fully weakly-supervised\nbecause they heavily rely on costly pixel-level labels for hyperparameter\ntuning in inference. To tackle this challenge, we introduce Anomaly\nSegmentation with Forward Process of Diffusion Models (AnoFPDM), a fully\nweakly-supervised framework that operates without the need of pixel-level\nlabels. Leveraging the unguided forward process as a reference for the guided\nforward process, we select hyperparameters such as the noise scale, the\nthreshold for segmentation and the guidance strength. We aggregate anomaly maps\nfrom guided forward process, enhancing the signal strength of anomalous\nregions. Remarkably, our proposed method outperforms recent state-of-the-art\nweakly-supervised approaches, even without utilizing pixel-level labels.\n","authors":["Yiming Che","Fazle Rafsani","Jay Shah","Md Mahfuzur Rahman Siddiquee","Teresa Wu"],"pdf_url":"https://arxiv.org/pdf/2404.15683v4.pdf","comment":"v4: added appendices and fixed some typos"},{"id":"http://arxiv.org/abs/2407.05910v3","updated":"2025-01-08T23:40:38Z","published":"2024-07-08T13:15:11Z","title":"Enhancing Vision-Language Models with Scene Graphs for Traffic Accident\n Understanding","summary":" Recognizing a traffic accident is an essential part of any autonomous driving\nor road monitoring system. An accident can appear in a wide variety of forms,\nand understanding what type of accident is taking place may be useful to\nprevent it from recurring. This work focuses on classifying traffic scenes into\nspecific accident types. We approach the problem by representing a traffic\nscene as a graph, where objects such as cars can be represented as nodes, and\nrelative distances and directions between them as edges. This representation of\na traffic scene is referred to as a scene graph, and can be used as input for\nan accident classifier. Better results are obtained with a classifier that\nfuses the scene graph input with visual and textual representations. This work\nintroduces a multi-stage, multimodal pipeline that pre-processes videos of\ntraffic accidents, encodes them as scene graphs, and aligns this representation\nwith vision and language modalities before executing the classification task.\nWhen trained on 4 classes, our method achieves a balanced accuracy score of\n57.77% on an (unbalanced) subset of the popular Detection of Traffic Anomaly\n(DoTA) benchmark, representing an increase of close to 5 percentage points from\nthe case where scene graph information is not taken into account.\n","authors":["Aaron Lohner","Francesco Compagno","Jonathan Francis","Alessandro Oltramari"],"pdf_url":"https://arxiv.org/pdf/2407.05910v3.pdf","comment":"Won the 'Best Paper Runner-up Award' at the 2024 IEEE International\n Automated Vehicle Validation Conference (IAVVC 2024). Also accepted at the\n 1st Workshop on Semantic Reasoning and Goal Understanding in Robotics, at the\n Robotics Science and Systems Conference (RSS SemRob 2024)"},{"id":"http://arxiv.org/abs/2501.04878v1","updated":"2025-01-08T23:21:49Z","published":"2025-01-08T23:21:49Z","title":"Topological Classification of points in $Z^2$ by using Topological\n Numbers for $2$D discrete binary images","summary":" In this paper, we propose a topological classification of points for 2D\ndiscrete binary images. This classification is based on the values of the\ncalculus of topological numbers. Six classes of points are proposed: isolated\npoint, interior point, simple point, curve point, point of intersection of 3\ncurves, point of intersection of 4 curves. The number of configurations of each\nclass is also given.\n","authors":["Christophe Lohou"],"pdf_url":"https://arxiv.org/pdf/2501.04878v1.pdf","comment":"arXiv admin note: substantial text overlap with arXiv:2410.21588"},{"id":"http://arxiv.org/abs/2410.06154v3","updated":"2025-01-08T23:08:03Z","published":"2024-10-08T15:55:40Z","title":"GLOV: Guided Large Language Models as Implicit Optimizers for Vision\n Language Models","summary":" In this work, we propose a novel method (GLOV) enabling Large Language Models\n(LLMs) to act as implicit Optimizers for Vision-Langugage Models (VLMs) to\nenhance downstream vision tasks. Our GLOV meta-prompts an LLM with the\ndownstream task description, querying it for suitable VLM prompts (e.g., for\nzero-shot classification with CLIP). These prompts are ranked according to a\npurity measure obtained through a fitness function. In each respective\noptimization step, the ranked prompts are fed as in-context examples (with\ntheir accuracies) to equip the LLM with the knowledge of the type of text\nprompts preferred by the downstream VLM. Furthermore, we also explicitly steer\nthe LLM generation process in each optimization step by specifically adding an\noffset difference vector of the embeddings from the positive and negative\nsolutions found by the LLM, in previous optimization steps, to the intermediate\nlayer of the network for the next generation step. This offset vector steers\nthe LLM generation toward the type of language preferred by the downstream VLM,\nresulting in enhanced performance on the downstream vision tasks. We\ncomprehensively evaluate our GLOV on 16 diverse datasets using two families of\nVLMs, i.e., dual-encoder (e.g., CLIP) and encoder-decoder (e.g., LLaVa) models\n-- showing that the discovered solutions can enhance the recognition\nperformance by up to 15.0% and 57.5% (3.8% and 21.6% on average) for these\nmodels.\n","authors":["M. Jehanzeb Mirza","Mengjie Zhao","Zhuoyuan Mao","Sivan Doveh","Wei Lin","Paul Gavrikov","Michael Dorkenwald","Shiqi Yang","Saurav Jha","Hiromi Wakaki","Yuki Mitsufuji","Horst Possegger","Rogerio Feris","Leonid Karlinsky","James Glass"],"pdf_url":"https://arxiv.org/pdf/2410.06154v3.pdf","comment":"Code: https://github.com/jmiemirza/GLOV"},{"id":"http://arxiv.org/abs/2501.04873v1","updated":"2025-01-08T23:07:10Z","published":"2025-01-08T23:07:10Z","title":"Back Home: A Machine Learning Approach to Seashell Classification and\n Ecosystem Restoration","summary":" In Costa Rica, an average of 5 tons of seashells are extracted from\necosystems annually. Confiscated seashells, cannot be returned to their\necosystems due to the lack of origin recognition. To address this issue, we\ndeveloped a convolutional neural network (CNN) specifically for seashell\nidentification. We built a dataset from scratch, consisting of approximately\n19000 images from the Pacific and Caribbean coasts. Using this dataset, the\nmodel achieved a classification accuracy exceeding 85%. The model has been\nintegrated into a user-friendly application, which has classified over 36,000\nseashells to date, delivering real-time results within 3 seconds per image. To\nfurther enhance the system's accuracy, an anomaly detection mechanism was\nincorporated to filter out irrelevant or anomalous inputs, ensuring only valid\nseashell images are processed.\n","authors":["Alexander Valverde","Luis Solano"],"pdf_url":"https://arxiv.org/pdf/2501.04873v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.00915v3","updated":"2025-01-08T22:58:51Z","published":"2023-03-02T02:20:04Z","title":"BiomedCLIP: a multimodal biomedical foundation model pretrained from\n fifteen million scientific image-text pairs","summary":" Biomedical data is inherently multimodal, comprising physical measurements\nand natural language narratives. A generalist biomedical AI model needs to\nsimultaneously process different modalities of data, including text and images.\nTherefore, training an effective generalist biomedical model requires\nhigh-quality multimodal data, such as parallel image-text pairs. Here, we\npresent PMC-15M, a novel dataset that is two orders of magnitude larger than\nexisting biomedical multimodal datasets such as MIMIC-CXR, and spans a diverse\nrange of biomedical image types. PMC-15M contains 15 million biomedical\nimage-text pairs collected from 4.4 million scientific articles. Based on\nPMC-15M, we have pretrained BiomedCLIP, a multimodal foundation model, with\ndomain-specific adaptations tailored to biomedical vision-language processing.\nWe conducted extensive experiments and ablation studies on standard biomedical\nimaging tasks from retrieval to classification to visual question-answering\n(VQA). BiomedCLIP achieved new state-of-the-art results in a wide range of\nstandard datasets, substantially outperforming prior approaches. Intriguingly,\nby large-scale pretraining on diverse biomedical image types, BiomedCLIP even\noutperforms state-of-the-art radiology-specific models such as BioViL in\nradiology-specific tasks such as RSNA pneumonia detection. In summary,\nBiomedCLIP is a fully open-access foundation model that achieves\nstate-of-the-art performance on various biomedical tasks, paving the way for\ntransformative multimodal biomedical discovery and applications. We release our\nmodels at https://aka.ms/biomedclip to facilitate future research in multimodal\nbiomedical AI.\n","authors":["Sheng Zhang","Yanbo Xu","Naoto Usuyama","Hanwen Xu","Jaspreet Bagga","Robert Tinn","Sam Preston","Rajesh Rao","Mu Wei","Naveen Valluri","Cliff Wong","Andrea Tupini","Yu Wang","Matt Mazzola","Swadheen Shukla","Lars Liden","Jianfeng Gao","Angela Crabtree","Brian Piening","Carlo Bifulco","Matthew P. Lungren","Tristan Naumann","Sheng Wang","Hoifung Poon"],"pdf_url":"https://arxiv.org/pdf/2303.00915v3.pdf","comment":"The models are released at https://aka.ms/biomedclip"},{"id":"http://arxiv.org/abs/2501.04861v1","updated":"2025-01-08T22:22:44Z","published":"2025-01-08T22:22:44Z","title":"LayerMix: Enhanced Data Augmentation through Fractal Integration for\n Robust Deep Learning","summary":" Deep learning models have demonstrated remarkable performance across various\ncomputer vision tasks, yet their vulnerability to distribution shifts remains a\ncritical challenge. Despite sophisticated neural network architectures,\nexisting models often struggle to maintain consistent performance when\nconfronted with Out-of-Distribution (OOD) samples, including natural\ncorruptions, adversarial perturbations, and anomalous patterns. We introduce\nLayerMix, an innovative data augmentation approach that systematically enhances\nmodel robustness through structured fractal-based image synthesis. By\nmeticulously integrating structural complexity into training datasets, our\nmethod generates semantically consistent synthetic samples that significantly\nimprove neural network generalization capabilities. Unlike traditional\naugmentation techniques that rely on random transformations, LayerMix employs a\nstructured mixing pipeline that preserves original image semantics while\nintroducing controlled variability. Extensive experiments across multiple\nbenchmark datasets, including CIFAR-10, CIFAR-100, ImageNet-200, and\nImageNet-1K demonstrate LayerMixs superior performance in classification\naccuracy and substantially enhances critical Machine Learning (ML) safety\nmetrics, including resilience to natural image corruptions, robustness against\nadversarial attacks, improved model calibration and enhanced prediction\nconsistency. LayerMix represents a significant advancement toward developing\nmore reliable and adaptable artificial intelligence systems by addressing the\nfundamental challenges of deep learning generalization. The code is available\nat https://github.com/ahmadmughees/layermix.\n","authors":["Hafiz Mughees Ahmad","Dario Morle","Afshin Rahimi"],"pdf_url":"https://arxiv.org/pdf/2501.04861v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04846v1","updated":"2025-01-08T21:15:14Z","published":"2025-01-08T21:15:14Z","title":"EDMB: Edge Detector with Mamba","summary":" Transformer-based models have made significant progress in edge detection,\nbut their high computational cost is prohibitive. Recently, vision Mamba have\nshown excellent ability in efficiently capturing long-range dependencies.\nDrawing inspiration from this, we propose a novel edge detector with Mamba,\ntermed EDMB, to efficiently generate high-quality multi-granularity edges. In\nEDMB, Mamba is combined with a global-local architecture, therefore it can\nfocus on both global information and fine-grained cues. The fine-grained cues\nplay a crucial role in edge detection, but are usually ignored by ordinary\nMamba. We design a novel decoder to construct learnable Gaussian distributions\nby fusing global features and fine-grained features. And the multi-grained\nedges are generated by sampling from the distributions. In order to make\nmulti-granularity edges applicable to single-label data, we introduce Evidence\nLower Bound loss to supervise the learning of the distributions. On the\nmulti-label dataset BSDS500, our proposed EDMB achieves competitive\nsingle-granularity ODS 0.837 and multi-granularity ODS 0.851 without\nmulti-scale test or extra PASCAL-VOC data. Remarkably, EDMB can be extended to\nsingle-label datasets such as NYUDv2 and BIPED. The source code is available at\nhttps://github.com/Li-yachuan/EDMB.\n","authors":["Yachuan Li","Xavier Soria Poma","Yun Bai","Qian Xiao","Chaozhi Yang","Guanlin Li","Zongmin Li"],"pdf_url":"https://arxiv.org/pdf/2501.04846v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.02640v2","updated":"2025-01-08T20:57:22Z","published":"2025-01-05T20:05:10Z","title":"Multispectral Pedestrian Detection with Sparsely Annotated Label","summary":" Although existing Sparsely Annotated Object Detection (SAOD) approches have\nmade progress in handling sparsely annotated environments in multispectral\ndomain, where only some pedestrians are annotated, they still have the\nfollowing limitations: (i) they lack considerations for improving the quality\nof pseudo-labels for missing annotations, and (ii) they rely on fixed ground\ntruth annotations, which leads to learning only a limited range of pedestrian\nvisual appearances in the multispectral domain. To address these issues, we\npropose a novel framework called Sparsely Annotated Multispectral Pedestrian\nDetection (SAMPD). For limitation (i), we introduce Multispectral\nPedestrian-aware Adaptive Weight (MPAW) and Positive Pseudo-label Enhancement\n(PPE) module. Utilizing multispectral knowledge, these modules ensure the\ngeneration of high-quality pseudo-labels and enable effective learning by\nincreasing weights for high-quality pseudo-labels based on modality\ncharacteristics. To address limitation (ii), we propose an Adaptive Pedestrian\nRetrieval Augmentation (APRA) module, which adaptively incorporates pedestrian\npatches from ground-truth and dynamically integrates high-quality pseudo-labels\nwith the ground-truth, facilitating a more diverse learning pool of\npedestrians. Extensive experimental results demonstrate that our SAMPD\nsignificantly enhances performance in sparsely annotated environments within\nthe multispectral domain.\n","authors":["Chan Lee","Seungho Shin","Gyeong-Moon Park","Jung Uk Kim"],"pdf_url":"https://arxiv.org/pdf/2501.02640v2.pdf","comment":"Accepted at AAAI 2025"},{"id":"http://arxiv.org/abs/2501.04815v1","updated":"2025-01-08T20:11:09Z","published":"2025-01-08T20:11:09Z","title":"Towards Generalizable Trajectory Prediction Using Dual-Level\n Representation Learning And Adaptive Prompting","summary":" Existing vehicle trajectory prediction models struggle with generalizability,\nprediction uncertainties, and handling complex interactions. It is often due to\nlimitations like complex architectures customized for a specific dataset and\ninefficient multimodal handling. We propose Perceiver with Register queries\n(PerReg+), a novel trajectory prediction framework that introduces: (1)\nDual-Level Representation Learning via Self-Distillation (SD) and Masked\nReconstruction (MR), capturing global context and fine-grained details.\nAdditionally, our approach of reconstructing segmentlevel trajectories and lane\nsegments from masked inputs with query drop, enables effective use of\ncontextual information and improves generalization; (2) Enhanced Multimodality\nusing register-based queries and pretraining, eliminating the need for\nclustering and suppression; and (3) Adaptive Prompt Tuning during fine-tuning,\nfreezing the main architecture and optimizing a small number of prompts for\nefficient adaptation. PerReg+ sets a new state-of-the-art performance on\nnuScenes [1], Argoverse 2 [2], and Waymo Open Motion Dataset (WOMD) [3].\nRemarkable, our pretrained model reduces the error by 6.8% on smaller datasets,\nand multi-dataset training enhances generalization. In cross-domain tests,\nPerReg+ reduces B-FDE by 11.8% compared to its non-pretrained variant.\n","authors":["Kaouther Messaoud","Matthieu Cord","Alexandre Alahi"],"pdf_url":"https://arxiv.org/pdf/2501.04815v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04794v1","updated":"2025-01-08T19:18:44Z","published":"2025-01-08T19:18:44Z","title":"A Steerable Deep Network for Model-Free Diffusion MRI Registration","summary":" Nonrigid registration is vital to medical image analysis but remains\nchallenging for diffusion MRI (dMRI) due to its high-dimensional,\norientation-dependent nature. While classical methods are accurate, they are\ncomputationally demanding, and deep neural networks, though efficient, have\nbeen underexplored for nonrigid dMRI registration compared to structural\nimaging. We present a novel, deep learning framework for model-free, nonrigid\nregistration of raw diffusion MRI data that does not require explicit\nreorientation. Unlike previous methods relying on derived representations such\nas diffusion tensors or fiber orientation distribution functions, in our\napproach, we formulate the registration as an equivariant diffeomorphism of\nposition-and-orientation space. Central to our method is an\n$\\mathsf{SE}(3)$-equivariant UNet that generates velocity fields while\npreserving the geometric properties of a raw dMRI's domain. We introduce a new\nloss function based on the maximum mean discrepancy in Fourier space,\nimplicitly matching ensemble average propagators across images. Experimental\nresults on Human Connectome Project dMRI data demonstrate competitive\nperformance compared to state-of-the-art approaches, with the added advantage\nof bypassing the overhead for estimating derived representations. This work\nestablishes a foundation for data-driven, geometry-aware dMRI registration\ndirectly in the acquisition space.\n","authors":["Gianfranco Cortes","Baba C. Vemuri"],"pdf_url":"https://arxiv.org/pdf/2501.04794v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04784v1","updated":"2025-01-08T19:02:32Z","published":"2025-01-08T19:02:32Z","title":"Leveraging Registers in Vision Transformers for Robust Adaptation","summary":" Vision Transformers (ViTs) have shown success across a variety of tasks due\nto their ability to capture global image representations. Recent studies have\nidentified the existence of high-norm tokens in ViTs, which can interfere with\nunsupervised object discovery. To address this, the use of \"registers\" which\nare additional tokens that isolate high norm patch tokens while capturing\nglobal image-level information has been proposed. While registers have been\nstudied extensively for object discovery, their generalization properties\nparticularly in out-of-distribution (OOD) scenarios, remains underexplored. In\nthis paper, we examine the utility of register token embeddings in providing\nadditional features for improving generalization and anomaly rejection. To that\nend, we propose a simple method that combines the special CLS token embedding\ncommonly employed in ViTs with the average-pooled register embeddings to create\nfeature representations which are subsequently used for training a downstream\nclassifier. We find that this enhances OOD generalization and anomaly\nrejection, while maintaining in-distribution (ID) performance. Extensive\nexperiments across multiple ViT backbones trained with and without registers\nreveal consistent improvements of 2-4\\% in top-1 OOD accuracy and a 2-3\\%\nreduction in false positive rates for anomaly detection. Importantly, these\ngains are achieved without additional computational overhead.\n","authors":["Srikar Yellapragada","Kowshik Thopalli","Vivek Narayanaswamy","Wesam Sakla","Yang Liu","Yamen Mubarka","Dimitris Samaras","Jayaraman J. Thiagarajan"],"pdf_url":"https://arxiv.org/pdf/2501.04784v1.pdf","comment":"Accepted at ICASSP 2025"},{"id":"http://arxiv.org/abs/2501.04782v1","updated":"2025-01-08T19:01:12Z","published":"2025-01-08T19:01:12Z","title":"GaussianVideo: Efficient Video Representation via Hierarchical Gaussian\n Splatting","summary":" Efficient neural representations for dynamic video scenes are critical for\napplications ranging from video compression to interactive simulations. Yet,\nexisting methods often face challenges related to high memory usage, lengthy\ntraining times, and temporal consistency. To address these issues, we introduce\na novel neural video representation that combines 3D Gaussian splatting with\ncontinuous camera motion modeling. By leveraging Neural ODEs, our approach\nlearns smooth camera trajectories while maintaining an explicit 3D scene\nrepresentation through Gaussians. Additionally, we introduce a spatiotemporal\nhierarchical learning strategy, progressively refining spatial and temporal\nfeatures to enhance reconstruction quality and accelerate convergence. This\nmemory-efficient approach achieves high-quality rendering at impressive speeds.\nExperimental results show that our hierarchical learning, combined with robust\ncamera motion modeling, captures complex dynamic scenes with strong temporal\nconsistency, achieving state-of-the-art performance across diverse video\ndatasets in both high- and low-motion scenarios.\n","authors":["Andrew Bond","Jui-Hsien Wang","Long Mai","Erkut Erdem","Aykut Erdem"],"pdf_url":"https://arxiv.org/pdf/2501.04782v1.pdf","comment":"10 pages, 10 figures"},{"id":"http://arxiv.org/abs/2501.04765v1","updated":"2025-01-08T18:38:25Z","published":"2025-01-08T18:38:25Z","title":"TREAD: Token Routing for Efficient Architecture-agnostic Diffusion\n Training","summary":" Diffusion models have emerged as the mainstream approach for visual\ngeneration. However, these models usually suffer from sample inefficiency and\nhigh training costs. This issue is particularly pronounced in the standard\ndiffusion transformer architecture due to its quadratic complexity relative to\ninput length. Recent works have addressed this by reducing the number of tokens\nprocessed in the model, often through masking. In contrast, this work aims to\nimprove the training efficiency of the diffusion backbone by using predefined\nroutes that store this information until it is reintroduced to deeper layers of\nthe model, rather than discarding these tokens entirely. Further, we combine\nmultiple routes and introduce an adapted auxiliary loss that accounts for all\napplied routes. Our method is not limited to the common transformer-based model\n- it can also be applied to state-space models. Unlike most current approaches,\nTREAD achieves this without architectural modifications. Finally, we show that\nour method reduces the computational cost and simultaneously boosts model\nperformance on the standard benchmark ImageNet-1K 256 x 256 in\nclass-conditional synthesis. Both of these benefits multiply to a convergence\nspeedup of 9.55x at 400K training iterations compared to DiT and 25.39x\ncompared to the best benchmark performance of DiT at 7M training iterations.\n","authors":["Felix Krause","Timy Phan","Vincent Tao Hu","Björn Ommer"],"pdf_url":"https://arxiv.org/pdf/2501.04765v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04764v1","updated":"2025-01-08T18:35:48Z","published":"2025-01-08T18:35:48Z","title":"Video Summarisation with Incident and Context Information using\n Generative AI","summary":" The proliferation of video content production has led to vast amounts of\ndata, posing substantial challenges in terms of analysis efficiency and\nresource utilization. Addressing this issue calls for the development of robust\nvideo analysis tools. This paper proposes a novel approach leveraging\nGenerative Artificial Intelligence (GenAI) to facilitate streamlined video\nanalysis. Our tool aims to deliver tailored textual summaries of user-defined\nqueries, offering a focused insight amidst extensive video datasets. Unlike\nconventional frameworks that offer generic summaries or limited action\nrecognition, our method harnesses the power of GenAI to distil relevant\ninformation, enhancing analysis precision and efficiency. Employing YOLO-V8 for\nobject detection and Gemini for comprehensive video and text analysis, our\nsolution achieves heightened contextual accuracy. By combining YOLO with\nGemini, our approach furnishes textual summaries extracted from extensive CCTV\nfootage, enabling users to swiftly navigate and verify pertinent events without\nthe need for exhaustive manual review. The quantitative evaluation revealed a\nsimilarity of 72.8%, while the qualitative assessment rated an accuracy of 85%,\ndemonstrating the capability of the proposed method.\n","authors":["Ulindu De Silva","Leon Fernando","Kalinga Bandara","Rashmika Nawaratne"],"pdf_url":"https://arxiv.org/pdf/2501.04764v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04750v1","updated":"2025-01-08T16:17:05Z","published":"2025-01-08T16:17:05Z","title":"Efficient License Plate Recognition in Videos Using Visual Rhythm and\n Accumulative Line Analysis","summary":" Video-based Automatic License Plate Recognition (ALPR) involves extracting\nvehicle license plate text information from video captures. Traditional systems\ntypically rely heavily on high-end computing resources and utilize multiple\nframes to recognize license plates, leading to increased computational\noverhead. In this paper, we propose two methods capable of efficiently\nextracting exactly one frame per vehicle and recognizing its license plate\ncharacters from this single image, thus significantly reducing computational\ndemands. The first method uses Visual Rhythm (VR) to generate time-spatial\nimages from videos, while the second employs Accumulative Line Analysis (ALA),\na novel algorithm based on single-line video processing for real-time\noperation. Both methods leverage YOLO for license plate detection within the\nframe and a Convolutional Neural Network (CNN) for Optical Character\nRecognition (OCR) to extract textual information. Experiments on real videos\ndemonstrate that the proposed methods achieve results comparable to traditional\nframe-by-frame approaches, with processing speeds three times faster.\n","authors":["Victor Nascimento Ribeiro","Nina S. T. Hirata"],"pdf_url":"https://arxiv.org/pdf/2501.04750v1.pdf","comment":"Accepted for presentation at the Conference on Graphics, Patterns and\n Images (SIBGRAPI) 2024"},{"id":"http://arxiv.org/abs/2501.05488v1","updated":"2025-01-08T18:57:05Z","published":"2025-01-08T18:57:05Z","title":"EndoDINO: A Foundation Model for GI Endoscopy","summary":" In this work, we present EndoDINO, a foundation model for GI endoscopy tasks\nthat achieves strong generalizability by pre-training on a well-curated image\ndataset sampled from the largest known GI endoscopy video dataset in the\nliterature. Specifically, we pre-trained ViT models with 1B, 307M, and 86M\nparameters using datasets ranging from 100K to 10M curated images. Using\nEndoDINO as a frozen feature encoder, we achieved state-of-the-art performance\nin anatomical landmark classification, polyp segmentation, and Mayo endoscopic\nscoring (MES) for ulcerative colitis with only simple decoder heads.\n","authors":["Patrick Dermyer","Angad Kalra","Matt Schwartz"],"pdf_url":"https://arxiv.org/pdf/2501.05488v1.pdf","comment":null}],"Information Retrieval":[{"id":"http://arxiv.org/abs/2501.04695v1","updated":"2025-01-08T18:58:22Z","published":"2025-01-08T18:58:22Z","title":"Re-ranking the Context for Multimodal Retrieval Augmented Generation","summary":" Retrieval-augmented generation (RAG) enhances large language models (LLMs) by\nincorporating external knowledge to generate a response within a context with\nimproved accuracy and reduced hallucinations. However, multi-modal RAG systems\nface unique challenges: (i) the retrieval process may select irrelevant entries\nto user query (e.g., images, documents), and (ii) vision-language models or\nmulti-modal language models like GPT-4o may hallucinate when processing these\nentries to generate RAG output. In this paper, we aim to address the first\nchallenge, i.e, improving the selection of relevant context from the\nknowledge-base in retrieval phase of the multi-modal RAG. Specifically, we\nleverage the relevancy score (RS) measure designed in our previous work for\nevaluating the RAG performance to select more relevant entries in retrieval\nprocess. The retrieval based on embeddings, say CLIP-based embedding, and\ncosine similarity usually perform poorly particularly for multi-modal data. We\nshow that by using a more advanced relevancy measure, one can enhance the\nretrieval process by selecting more relevant pieces from the knowledge-base and\neliminate the irrelevant pieces from the context by adaptively selecting\nup-to-$k$ entries instead of fixed number of entries. Our evaluation using COCO\ndataset demonstrates significant enhancement in selecting relevant context and\naccuracy of the generated response.\n","authors":["Matin Mortaheb","Mohammad A. Amir Khojastepour","Srimat T. Chakradhar","Sennur Ulukus"],"pdf_url":"https://arxiv.org/pdf/2501.04695v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04652v1","updated":"2025-01-08T18:05:30Z","published":"2025-01-08T18:05:30Z","title":"Multi-task retriever fine-tuning for domain-specific and efficient RAG","summary":" Retrieval-Augmented Generation (RAG) has become ubiquitous when deploying\nLarge Language Models (LLMs), as it can address typical limitations such as\ngenerating hallucinated or outdated information. However, when building\nreal-world RAG applications, practical issues arise. First, the retrieved\ninformation is generally domain-specific. Since it is computationally expensive\nto fine-tune LLMs, it is more feasible to fine-tune the retriever to improve\nthe quality of the data included in the LLM input. Second, as more applications\nare deployed in the same real-world system, one cannot afford to deploy\nseparate retrievers. Moreover, these RAG applications normally retrieve\ndifferent kinds of data. Our solution is to instruction fine-tune a small\nretriever encoder on a variety of domain-specific tasks to allow us to deploy\none encoder that can serve many use cases, thereby achieving low-cost,\nscalability, and speed. We show how this encoder generalizes to out-of-domain\nsettings as well as to an unseen retrieval task on real-world enterprise use\ncases.\n","authors":["Patrice Béchard","Orlando Marquez Ayala"],"pdf_url":"https://arxiv.org/pdf/2501.04652v1.pdf","comment":"9 pages, 2 figures. Submitted to NAACL 2025 Industry Track"},{"id":"http://arxiv.org/abs/2501.04635v1","updated":"2025-01-08T17:29:46Z","published":"2025-01-08T17:29:46Z","title":"Knowledge Retrieval Based on Generative AI","summary":" This study develops a question-answering system based on Retrieval-Augmented\nGeneration (RAG) using Chinese Wikipedia and Lawbank as retrieval sources.\nUsing TTQA and TMMLU+ as evaluation datasets, the system employs BGE-M3 for\ndense vector retrieval to obtain highly relevant search results and\nBGE-reranker to reorder these results based on query relevance. The most\npertinent retrieval outcomes serve as reference knowledge for a Large Language\nModel (LLM), enhancing its ability to answer questions and establishing a\nknowledge retrieval system grounded in generative AI.\n The system's effectiveness is assessed through a two-stage evaluation:\nautomatic and assisted performance evaluations. The automatic evaluation\ncalculates accuracy by comparing the model's auto-generated labels with ground\ntruth answers, measuring performance under standardized conditions without\nhuman intervention. The assisted performance evaluation involves 20\nfinance-related multiple-choice questions answered by 20 participants without\nfinancial backgrounds. Initially, participants answer independently. Later,\nthey receive system-generated reference information to assist in answering,\nexamining whether the system improves accuracy when assistance is provided.\n The main contributions of this research are: (1) Enhanced LLM Capability: By\nintegrating BGE-M3 and BGE-reranker, the system retrieves and reorders highly\nrelevant results, reduces hallucinations, and dynamically accesses authorized\nor public knowledge sources. (2) Improved Data Privacy: A customized RAG\narchitecture enables local operation of the LLM, eliminating the need to send\nprivate data to external servers. This approach enhances data security, reduces\nreliance on commercial services, lowers operational costs, and mitigates\nprivacy risks.\n","authors":["Te-Lun Yang","Jyi-Shane Liu","Yuen-Hsien Tseng","Jyh-Shing Roger Jang"],"pdf_url":"https://arxiv.org/pdf/2501.04635v1.pdf","comment":"8 pages, 13 figures, 1 table"},{"id":"http://arxiv.org/abs/2501.04630v1","updated":"2025-01-08T17:22:03Z","published":"2025-01-08T17:22:03Z","title":"Evaluating Interval-based Tokenization for Pitch Representation in\n Symbolic Music Analysis","summary":" Symbolic music analysis tasks are often performed by models originally\ndeveloped for Natural Language Processing, such as Transformers. Such models\nrequire the input data to be represented as sequences, which is achieved\nthrough a process of tokenization. Tokenization strategies for symbolic music\noften rely on absolute MIDI values to represent pitch information. However,\nmusic research largely promotes the benefit of higher-level representations\nsuch as melodic contour and harmonic relations for which pitch intervals turn\nout to be more expressive than absolute pitches. In this work, we introduce a\ngeneral framework for building interval-based tokenizations. By evaluating\nthese tokenizations on three music analysis tasks, we show that such\ninterval-based tokenizations improve model performances and facilitate their\nexplainability.\n","authors":["Dinh-Viet-Toan Le","Louis Bigo","Mikaela Keller"],"pdf_url":"https://arxiv.org/pdf/2501.04630v1.pdf","comment":"Accepted at Artificial Intelligence for Music Workshop at AAAI 2025\n (https://ai4musicians.org/2025aaai.html)"},{"id":"http://arxiv.org/abs/2405.10587v3","updated":"2025-01-08T11:21:12Z","published":"2024-05-17T07:22:02Z","title":"RDRec: Rationale Distillation for LLM-based Recommendation","summary":" Large language model (LLM)-based recommender models that bridge users and\nitems through textual prompts for effective semantic reasoning have gained\nconsiderable attention. However, few methods consider the underlying rationales\nbehind interactions, such as user preferences and item attributes, limiting the\nreasoning capability of LLMs for recommendations. This paper proposes a\nrationale distillation recommender (RDRec), a compact model designed to learn\nrationales generated by a larger language model (LM). By leveraging rationales\nfrom reviews related to users and items, RDRec remarkably specifies their\nprofiles for recommendations. Experiments show that RDRec achieves\nstate-of-the-art (SOTA) performance in both top-N and sequential\nrecommendations. Our source code is released at\nhttps://github.com/WangXFng/RDRec.\n","authors":["Xinfeng Wang","Jin Cui","Yoshimi Suzuki","Fumiyo Fukumoto"],"pdf_url":"https://arxiv.org/pdf/2405.10587v3.pdf","comment":"10 pages. Accepted to ACL 2024 Main as a short paper"},{"id":"http://arxiv.org/abs/2501.04420v1","updated":"2025-01-08T11:08:58Z","published":"2025-01-08T11:08:58Z","title":"A Closer Look on Gender Stereotypes in Movie Recommender Systems and\n Their Implications with Privacy","summary":" The movie recommender system typically leverages user feedback to provide\npersonalized recommendations that align with user preferences and increase\nbusiness revenue. This study investigates the impact of gender stereotypes on\nsuch systems through a specific attack scenario. In this scenario, an attacker\ndetermines users' gender, a private attribute, by exploiting gender stereotypes\nabout movie preferences and analyzing users' feedback data, which is either\npublicly available or observed within the system. The study consists of two\nphases. In the first phase, a user study involving 630 participants identified\ngender stereotypes associated with movie genres, which often influence viewing\nchoices. In the second phase, four inference algorithms were applied to detect\ngender stereotypes by combining the findings from the first phase with users'\nfeedback data. Results showed that these algorithms performed more effectively\nthan relying solely on feedback data for gender inference. Additionally, we\nquantified the extent of gender stereotypes to evaluate their broader impact on\ndigital computational science. The latter part of the study utilized two major\nmovie recommender datasets: MovieLens 1M and Yahoo!Movie. Detailed experimental\ninformation is available on our GitHub repository:\nhttps://github.com/fr-iit/GSMRS\n","authors":["Falguni Roy","Yiduo Shen","Na Zhao","Xiaofeng Ding","Md. Omar Faruk"],"pdf_url":"https://arxiv.org/pdf/2501.04420v1.pdf","comment":"19 pages, 2 figures"},{"id":"http://arxiv.org/abs/2501.04410v1","updated":"2025-01-08T10:49:13Z","published":"2025-01-08T10:49:13Z","title":"User Simulation in the Era of Generative AI: User Modeling, Synthetic\n Data Generation, and System Evaluation","summary":" User simulation is an emerging interdisciplinary topic with multiple critical\napplications in the era of Generative AI. It involves creating an intelligent\nagent that mimics the actions of a human user interacting with an AI system,\nenabling researchers to model and analyze user behaviour, generate synthetic\ndata for training, and evaluate interactive AI systems in a controlled and\nreproducible manner. User simulation has profound implications for diverse\nfields and plays a vital role in the pursuit of Artificial General\nIntelligence. This paper provides an overview of user simulation, highlighting\nits key applications, connections to various disciplines, and outlining future\nresearch directions to advance this increasingly important technology.\n","authors":["Krisztian Balog","ChengXiang Zhai"],"pdf_url":"https://arxiv.org/pdf/2501.04410v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04364v1","updated":"2025-01-08T09:03:16Z","published":"2025-01-08T09:03:16Z","title":"An innovative data collection method to eliminate the preprocessing\n phase in web usage mining","summary":" The underlying data source for web usage mining (WUM) is commonly thought to\nbe server logs. However, access log files ensure quite limited data about the\nclients. Identifying sessions from this messy data takes a considerable effort,\nand operations performed for this purpose do not always yield excellent\nresults. Also, this data cannot be used for web analytics efficiently. This\nstudy proposes an innovative method for user tracking, session management, and\ncollecting web usage data. The method is mainly based on a new approach for\nusing collected data for web analytics extraction as the data source in web\nusage mining. An application-based API has been developed with a different\nstrategy from conventional client-side methods to obtain and process log data.\nThe log data has been successfully gathered by integrating the technique into\nan enterprise web application. The results reveal that the homogeneous\nstructured data collected and stored with this method is more convenient to\nbrowse, filter, and process than web server logs. This data stored on a\nrelational database can be used effortlessly as a reliable data source for\nhigh-performance web usage mining activity, real-time web analytics, or a\nfunctional recommendation system.\n","authors":["Ozkan Canay","Umit Kocabicak"],"pdf_url":"https://arxiv.org/pdf/2501.04364v1.pdf","comment":"15 pages, 8 figures"},{"id":"http://arxiv.org/abs/2308.12743v3","updated":"2025-01-08T07:00:36Z","published":"2023-08-24T12:45:02Z","title":"Network-Based Video Recommendation Using Viewing Patterns and Modularity\n Analysis: An Integrated Framework","summary":" The proliferation of video-on-demand (VOD) services has led to a paradox of\nchoice, overwhelming users with vast content libraries and revealing\nlimitations in current recommender systems. This research introduces a novel\napproach by combining implicit user data, such as viewing percentages, with\nsocial network analysis to enhance personalization in VOD platforms. The\nmethodology constructs user-item interaction graphs based on viewing patterns\nand applies centrality measures (degree, closeness, and betweenness) to\nidentify important videos. Modularity-based clustering groups related content,\nenabling personalized recommendations. The system was evaluated on a\ndocumentary-focused VOD platform with 328 users over four months. Results\nshowed significant improvements: a 63% increase in click-through rate (CTR), a\n24% increase in view completion rate, and a 17% improvement in user\nsatisfaction. The approach outperformed traditional methods like Naive Bayes\nand SVM. Future research should explore advanced techniques, such as matrix\nfactorization models, graph neural networks, and hybrid approaches combining\ncontent-based and collaborative filtering. Additionally, incorporating temporal\nmodels and addressing scalability challenges for large-scale platforms are\nessential next steps. This study contributes to the state of the art by\nintroducing modularity-based clustering and ego-centric ranking methods to\nenhance personalization in video recommendations. The findings suggest that\nintegrating network-based features and implicit feedback can significantly\nimprove user engagement, offering a cost-effective solution for VOD platforms\nto enhance recommendation quality.\n","authors":["Mehrdad Maghsoudi","Mohammad Hossein valikhani","Mohammad Hossein Zohdi"],"pdf_url":"https://arxiv.org/pdf/2308.12743v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.00309v2","updated":"2025-01-08T05:16:25Z","published":"2024-12-31T06:59:35Z","title":"Retrieval-Augmented Generation with Graphs (GraphRAG)","summary":" Retrieval-augmented generation (RAG) is a powerful technique that enhances\ndownstream task execution by retrieving additional information, such as\nknowledge, skills, and tools from external sources. Graph, by its intrinsic\n\"nodes connected by edges\" nature, encodes massive heterogeneous and relational\ninformation, making it a golden resource for RAG in tremendous real-world\napplications. As a result, we have recently witnessed increasing attention on\nequipping RAG with Graph, i.e., GraphRAG. However, unlike conventional RAG,\nwhere the retriever, generator, and external data sources can be uniformly\ndesigned in the neural-embedding space, the uniqueness of graph-structured\ndata, such as diverse-formatted and domain-specific relational knowledge, poses\nunique and significant challenges when designing GraphRAG for different\ndomains. Given the broad applicability, the associated design challenges, and\nthe recent surge in GraphRAG, a systematic and up-to-date survey of its key\nconcepts and techniques is urgently desired. Following this motivation, we\npresent a comprehensive and up-to-date survey on GraphRAG. Our survey first\nproposes a holistic GraphRAG framework by defining its key components,\nincluding query processor, retriever, organizer, generator, and data source.\nFurthermore, recognizing that graphs in different domains exhibit distinct\nrelational patterns and require dedicated designs, we review GraphRAG\ntechniques uniquely tailored to each domain. Finally, we discuss research\nchallenges and brainstorm directions to inspire cross-disciplinary\nopportunities. Our survey repository is publicly maintained at\nhttps://github.com/Graph-RAG/GraphRAG/.\n","authors":["Haoyu Han","Yu Wang","Harry Shomer","Kai Guo","Jiayuan Ding","Yongjia Lei","Mahantesh Halappanavar","Ryan A. Rossi","Subhabrata Mukherjee","Xianfeng Tang","Qi He","Zhigang Hua","Bo Long","Tong Zhao","Neil Shah","Amin Javari","Yinglong Xia","Jiliang Tang"],"pdf_url":"https://arxiv.org/pdf/2501.00309v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.09852v2","updated":"2025-01-08T01:44:07Z","published":"2024-11-15T00:20:36Z","title":"InterFormer: Towards Effective Heterogeneous Interaction Learning for\n Click-Through Rate Prediction","summary":" Click-through rate (CTR) prediction, which predicts the probability of a user\nclicking an ad, is a fundamental task in recommender systems. The emergence of\nheterogeneous information, such as user profile and behavior sequences, depicts\nuser interests from different aspects. A mutually beneficial integration of\nheterogeneous information is the cornerstone towards the success of CTR\nprediction. However, most of the existing methods suffer from two fundamental\nlimitations, including (1) insufficient inter-mode interaction due to the\nunidirectional information flow between modes, and (2) aggressive information\naggregation caused by early summarization, resulting in excessive information\nloss. To address the above limitations, we propose a novel module named\nInterFormer to learn heterogeneous information interaction in an interleaving\nstyle. To achieve better interaction learning, InterFormer enables\nbidirectional information flow for mutually beneficial learning across\ndifferent modes. To avoid aggressive information aggregation, we retain\ncomplete information in each data mode and use a separate bridging arch for\neffective information selection and summarization. Our proposed InterFormer\nachieves state-of-the-art performance on three public datasets and a\nlarge-scale industrial dataset.\n","authors":["Zhichen Zeng","Xiaolong Liu","Mengyue Hang","Xiaoyi Liu","Qinghai Zhou","Chaofei Yang","Yiqun Liu","Yichen Ruan","Laming Chen","Yuxin Chen","Yujia Hao","Jiaqi Xu","Jade Nie","Xi Liu","Buyun Zhang","Wei Wen","Siyang Yuan","Kai Wang","Wen-Yen Chen","Yiping Han","Huayu Li","Chunzhi Yang","Bo Long","Philip S. Yu","Hanghang Tong","Jiyan Yang"],"pdf_url":"https://arxiv.org/pdf/2411.09852v2.pdf","comment":"10 pages, 6 figures"},{"id":"http://arxiv.org/abs/2408.06653v3","updated":"2025-01-08T20:40:09Z","published":"2024-08-13T05:53:46Z","title":"Hierarchical Structured Neural Network: Efficient Retrieval Scaling for\n Large Scale Recommendation","summary":" Retrieval, the initial stage of a recommendation system, is tasked with\ndown-selecting items from a pool of tens of millions of candidates to a few\nthousands. Embedding Based Retrieval (EBR) has been a typical choice for this\nproblem, addressing the computational demands of deep neural networks across\nvast item corpora. EBR utilizes Two Tower or Siamese Networks to learn\nrepresentations for users and items, and employ Approximate Nearest Neighbor\n(ANN) search to efficiently retrieve relevant items. Despite its popularity in\nindustry, EBR faces limitations. The Two Tower architecture, relying on a\nsingle dot product interaction, struggles to capture complex data distributions\ndue to limited capability in learning expressive interactions between users and\nitems. Additionally, ANN index building and representation learning for user\nand item are often separate, leading to inconsistencies exacerbated by\nrepresentation (e.g. continuous online training) and item drift (e.g. items\nexpired and new items added). In this paper, we introduce the Hierarchical\nStructured Neural Network (HSNN), an efficient deep neural network model to\nlearn intricate user and item interactions beyond the commonly used dot product\nin retrieval tasks, achieving sublinear computational costs relative to corpus\nsize. A Modular Neural Network (MoNN) is designed to maintain high\nexpressiveness for interaction learning while ensuring efficiency. A mixture of\nMoNNs operate on a hierarchical item index to achieve extensive computation\nsharing, enabling it to scale up to large corpus size. MoNN and the\nhierarchical index are jointly learnt to continuously adapt to distribution\nshifts in both user interests and item distributions. HSNN achieves substantial\nimprovement in offline evaluation compared to prevailing methods.\n","authors":["Kaushik Rangadurai","Siyang Yuan","Minhui Huang","Yiqun Liu","Golnaz Ghasemiesfeh","Yunchen Pu","Haiyu Lu","Xingfeng He","Fangzhou Xu","Andrew Cui","Vidhoon Viswanathan","Lin Yang","Liang Wang","Jiyan Yang","Chonglin Sun"],"pdf_url":"https://arxiv.org/pdf/2408.06653v3.pdf","comment":"Resubmit"},{"id":"http://arxiv.org/abs/2501.04802v1","updated":"2025-01-08T19:29:33Z","published":"2025-01-08T19:29:33Z","title":"Reproducing HotFlip for Corpus Poisoning Attacks in Dense Retrieval","summary":" HotFlip is a topical gradient-based word substitution method for attacking\nlanguage models. Recently, this method has been further applied to attack\nretrieval systems by generating malicious passages that are injected into a\ncorpus, i.e., corpus poisoning. However, HotFlip is known to be computationally\ninefficient, with the majority of time being spent on gradient accumulation for\neach query-passage pair during the adversarial token generation phase, making\nit impossible to generate an adequate number of adversarial passages in a\nreasonable amount of time. Moreover, the attack method itself assumes access to\na set of user queries, a strong assumption that does not correspond to how\nreal-world adversarial attacks are usually performed. In this paper, we first\nsignificantly boost the efficiency of HotFlip, reducing the adversarial\ngeneration process from 4 hours per document to only 15 minutes, using the same\nhardware. We further contribute experiments and analysis on two additional\ntasks: (1) transfer-based black-box attacks, and (2) query-agnostic attacks.\nWhenever possible, we provide comparisons between the original method and our\nimproved version. Our experiments demonstrate that HotFlip can effectively\nattack a variety of dense retrievers, with an observed trend that its attack\nperformance diminishes against more advanced and recent methods. Interestingly,\nwe observe that while HotFlip performs poorly in a black-box setting,\nindicating limited capacity for generalization, in query-agnostic scenarios its\nperformance is correlated to the volume of injected adversarial passages.\n","authors":["Yongkang Li","Panagiotis Eustratiadis","Evangelos Kanoulas"],"pdf_url":"https://arxiv.org/pdf/2501.04802v1.pdf","comment":"This paper has been accepted for oral presentation in the\n reproducibility track at ECIR 2025"},{"id":"http://arxiv.org/abs/2501.04762v1","updated":"2025-01-08T18:08:48Z","published":"2025-01-08T18:08:48Z","title":"Efficient and Responsible Adaptation of Large Language Models for Robust\n and Equitable Top-k Recommendations","summary":" Conventional recommendation systems (RSs) are typically optimized to enhance\nperformance metrics uniformly across all training samples, inadvertently\noverlooking the needs of diverse user populations. The performance disparity\namong various populations can harm the model's robustness to sub-populations\ndue to the varying user properties. While large language models (LLMs) show\npromise in enhancing RS performance, their practical applicability is hindered\nby high costs, inference latency, and degraded performance on long user\nqueries. To address these challenges, we propose a hybrid task allocation\nframework designed to promote social good by equitably serving all user groups.\nBy adopting a two-phase approach, we promote a strategic assignment of tasks\nfor efficient and responsible adaptation of LLMs. Our strategy works by first\nidentifying the weak and inactive users that receive a suboptimal ranking\nperformance by RSs. Next, we use an in-context learning approach for such\nusers, wherein each user interaction history is contextualized as a distinct\nranking task. We evaluate our hybrid framework by incorporating eight different\nrecommendation algorithms and three different LLMs -- both open and\nclose-sourced. Our results on three real-world datasets show a significant\nreduction in weak users and improved robustness to subpopulations without\ndisproportionately escalating costs.\n","authors":["Kirandeep Kaur","Manya Chadha","Vinayak Gupta","Chirag Shah"],"pdf_url":"https://arxiv.org/pdf/2501.04762v1.pdf","comment":"arXiv admin note: text overlap with arXiv:2405.00824"},{"id":"http://arxiv.org/abs/2501.04763v1","updated":"2025-01-08T18:18:03Z","published":"2025-01-08T18:18:03Z","title":"Search engines in polarized media environment: Auditing political\n information curation on Google and Bing prior to 2024 US elections","summary":" Search engines play an important role in the context of modern elections. By\ncurating information in response to user queries, search engines influence how\nindividuals are informed about election-related developments and perceive the\nmedia environment in which elections take place. It has particular implications\nfor (perceived) polarization, especially if search engines' curation results in\na skewed treatment of information sources based on their political leaning.\nUntil now, however, it is unclear whether such a partisan gap emerges through\ninformation curation on search engines and what user- and system-side factors\naffect it. To address this shortcoming, we audit the two largest Western search\nengines, Google and Bing, prior to the 2024 US presidential elections and\nexamine how these search engines' organic search results and additional\ninterface elements represent election-related information depending on the\nqueries' slant, user location, and time when the search was conducted. Our\nfindings indicate that both search engines tend to prioritize left-leaning\nmedia sources, with the exact scope of search results' ideological slant\nvarying between Democrat- and Republican-focused queries. We also observe\nlimited effects of location- and time-based factors on organic search results,\nwhereas results for additional interface elements were more volatile over time\nand specific US states. Together, our observations highlight that search\nengines' information curation actively mirrors the partisan divides present in\nthe US media environments and has the potential to contribute to (perceived)\npolarization within these environments.\n","authors":["Mykola Makhortykh","Tobias Rorhbach","Maryna Sydorova","Elizaveta Kuznetsova"],"pdf_url":"https://arxiv.org/pdf/2501.04763v1.pdf","comment":"38 pages"},{"id":"http://arxiv.org/abs/2501.05485v1","updated":"2025-01-08T09:06:29Z","published":"2025-01-08T09:06:29Z","title":"S2 Chunking: A Hybrid Framework for Document Segmentation Through\n Integrated Spatial and Semantic Analysis","summary":" Document chunking is a critical task in natural language processing (NLP)\nthat involves dividing a document into meaningful segments. Traditional methods\noften rely solely on semantic analysis, ignoring the spatial layout of\nelements, which is crucial for understanding relationships in complex\ndocuments. This paper introduces a novel hybrid approach that combines layout\nstructure, semantic analysis, and spatial relationships to enhance the cohesion\nand accuracy of document chunks. By leveraging bounding box information (bbox)\nand text embeddings, our method constructs a weighted graph representation of\ndocument elements, which is then clustered using spectral clustering.\nExperimental results demonstrate that this approach outperforms traditional\nmethods, particularly in documents with diverse layouts such as reports,\narticles, and multi-column designs. The proposed method also ensures that no\nchunk exceeds a specified token length, making it suitable for use cases where\ntoken limits are critical (e.g., language models with input size limitations)\n","authors":["Prashant Verma"],"pdf_url":"https://arxiv.org/pdf/2501.05485v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.10415v1","updated":"2025-01-08T14:17:26Z","published":"2025-01-08T14:17:26Z","title":"Making Software FAIR: A machine-assisted workflow for the research\n software lifecycle","summary":" A key issue hindering discoverability, attribution and reusability of open\nresearch software is that its existence often remains hidden within the\nmanuscript of research papers. For these resources to become first-class\nbibliographic records, they first need to be identified and subsequently\nregistered with persistent identifiers (PIDs) to be made FAIR (Findable,\nAccessible, Interoperable and Reusable). To this day, much open research\nsoftware fails to meet FAIR principles and software resources are mostly not\nexplicitly linked from the manuscripts that introduced them or used them.\nSoFAIR is a 2-year international project (2024-2025) which proposes a solution\nto the above problem realised over the content available through the global\nnetwork of open repositories. SoFAIR will extend the capabilities of widely\nused open scholarly infrastructures (CORE, Software Heritage, HAL) and tools\n(GROBID) operated by the consortium partners, delivering and deploying an\neffective solution for the management of the research software lifecycle,\nincluding: 1) ML-assisted identification of research software assets from\nwithin the manuscripts of scholarly papers, 2) validation of the identified\nassets by authors, 3) registration of software assets with PIDs and their\narchival.\n","authors":["Petr Knoth","Laurent Romary","Patrice Lopez","Roberto Di Cosmo","Pavel Smrz","Tomasz Umerle","Melissa Harrison","Alain Monteil","Matteo Cancellieri","David Pride"],"pdf_url":"https://arxiv.org/pdf/2501.10415v1.pdf","comment":"5 pages"}],"Machine Learning":[{"id":"http://arxiv.org/abs/2409.08272v2","updated":"2025-01-08T18:59:48Z","published":"2024-09-12T17:59:04Z","title":"Click2Mask: Local Editing with Dynamic Mask Generation","summary":" Recent advancements in generative models have revolutionized image generation\nand editing, making these tasks accessible to non-experts. This paper focuses\non local image editing, particularly the task of adding new content to a\nloosely specified area. Existing methods often require a precise mask or a\ndetailed description of the location, which can be cumbersome and prone to\nerrors. We propose Click2Mask, a novel approach that simplifies the local\nediting process by requiring only a single point of reference (in addition to\nthe content description). A mask is dynamically grown around this point during\na Blended Latent Diffusion (BLD) process, guided by a masked CLIP-based\nsemantic loss. Click2Mask surpasses the limitations of segmentation-based and\nfine-tuning dependent methods, offering a more user-friendly and contextually\naccurate solution. Our experiments demonstrate that Click2Mask not only\nminimizes user effort but also enables competitive or superior local image\nmanipulations compared to SoTA methods, according to both human judgement and\nautomatic metrics. Key contributions include the simplification of user input,\nthe ability to freely add objects unconstrained by existing segments, and the\nintegration potential of our dynamic mask approach within other editing\nmethods.\n","authors":["Omer Regev","Omri Avrahami","Dani Lischinski"],"pdf_url":"https://arxiv.org/pdf/2409.08272v2.pdf","comment":"Accepted to AAAI 2025. Project page is available at\n https://omeregev.github.io/click2mask/"},{"id":"http://arxiv.org/abs/2309.10775v2","updated":"2025-01-08T18:59:39Z","published":"2023-09-19T17:21:12Z","title":"$O(k)$-Equivariant Dimensionality Reduction on Stiefel Manifolds","summary":" Many real-world datasets live on high-dimensional Stiefel and Grassmannian\nmanifolds, $V_k(\\mathbb{R}^N)$ and $Gr(k, \\mathbb{R}^N)$ respectively, and\nbenefit from projection onto lower-dimensional Stiefel and Grassmannian\nmanifolds. In this work, we propose an algorithm called \\textit{Principal\nStiefel Coordinates (PSC)} to reduce data dimensionality from $\nV_k(\\mathbb{R}^N)$ to $V_k(\\mathbb{R}^n)$ in an \\textit{$O(k)$-equivariant}\nmanner ($k \\leq n \\ll N$). We begin by observing that each element $\\alpha \\in\nV_n(\\mathbb{R}^N)$ defines an isometric embedding of $V_k(\\mathbb{R}^n)$ into\n$V_k(\\mathbb{R}^N)$. Next, we describe two ways of finding a suitable embedding\nmap $\\alpha$: one via an extension of principal component analysis\n($\\alpha_{PCA}$), and one that further minimizes data fit error using gradient\ndescent ($\\alpha_{GD}$). Then, we define a continuous and $O(k)$-equivariant\nmap $\\pi_\\alpha$ that acts as a \"closest point operator\" to project the data\nonto the image of $V_k(\\mathbb{R}^n)$ in $V_k(\\mathbb{R}^N)$ under the\nembedding determined by $\\alpha$, while minimizing distortion. Because this\ndimensionality reduction is $O(k)$-equivariant, these results extend to\nGrassmannian manifolds as well. Lastly, we show that $\\pi_{\\alpha_{PCA}}$\nglobally minimizes projection error in a noiseless setting, while\n$\\pi_{\\alpha_{GD}}$ achieves a meaningfully different and improved outcome when\nthe data does not lie exactly on the image of a linearly embedded\nlower-dimensional Stiefel manifold as above. Multiple numerical experiments\nusing synthetic and real-world data are performed.\n","authors":["Andrew Lee","Harlin Lee","Jose A. Perea","Nikolas Schonsheck","Madeleine Weinstein"],"pdf_url":"https://arxiv.org/pdf/2309.10775v2.pdf","comment":"26 pages, 8 figures, comments welcome!"},{"id":"http://arxiv.org/abs/2501.04700v1","updated":"2025-01-08T18:59:36Z","published":"2025-01-08T18:59:36Z","title":"Planarian Neural Networks: Evolutionary Patterns from Basic Bilateria\n Shaping Modern Artificial Neural Network Architectures","summary":" This study examined the viability of enhancing the prediction accuracy of\nartificial neural networks (ANNs) in image classification tasks by developing\nANNs with evolution patterns similar to those of biological neural networks.\nResNet is a widely used family of neural networks with both deep and wide\nvariants; therefore, it was selected as the base model for our investigation.\nThe aim of this study is to improve the image classification performance of\nANNs via a novel approach inspired by the biological nervous system\narchitecture of planarians, which comprises a brain and two nerve cords. We\nbelieve that the unique neural architecture of planarians offers valuable\ninsights into the performance enhancement of ANNs. The proposed planarian\nneural architecture-based neural network was evaluated on the CIFAR-10 and\nCIFAR-100 datasets. Our results indicate that the proposed method exhibits\nhigher prediction accuracy than the baseline neural network models in image\nclassification tasks. These findings demonstrate the significant potential of\nbiologically inspired neural network architectures in improving the performance\nof ANNs in a wide range of applications.\n","authors":["Ziyuan Huang","Mark Newman","Maria Vaida","Srikar Bellur","Roozbeh Sadeghian","Andrew Siu","Hui Wang","Kevin Huggins"],"pdf_url":"https://arxiv.org/pdf/2501.04700v1.pdf","comment":"11 pages, 9 figures"},{"id":"http://arxiv.org/abs/2501.04697v1","updated":"2025-01-08T18:58:48Z","published":"2025-01-08T18:58:48Z","title":"Grokking at the Edge of Numerical Stability","summary":" Grokking, the sudden generalization that occurs after prolonged overfitting,\nis a surprising phenomenon challenging our understanding of deep learning.\nAlthough significant progress has been made in understanding grokking, the\nreasons behind the delayed generalization and its dependence on regularization\nremain unclear. In this work, we argue that without regularization, grokking\ntasks push models to the edge of numerical stability, introducing floating\npoint errors in the Softmax function, which we refer to as Softmax Collapse\n(SC). We demonstrate that SC prevents grokking and that mitigating SC enables\ngrokking without regularization. Investigating the root cause of SC, we find\nthat beyond the point of overfitting, the gradients strongly align with what we\ncall the na\\\"ive loss minimization (NLM) direction. This component of the\ngradient does not alter the model's predictions but decreases the loss by\nscaling the logits, typically by scaling the weights along their current\ndirection. We show that this scaling of the logits explains the delay in\ngeneralization characteristic of grokking and eventually leads to SC, halting\nfurther learning. To validate our hypotheses, we introduce two key\ncontributions that address the challenges in grokking tasks: StableMax, a new\nactivation function that prevents SC and enables grokking without\nregularization, and $\\perp$Grad, a training algorithm that promotes quick\ngeneralization in grokking tasks by preventing NLM altogether. These\ncontributions provide new insights into grokking, elucidating its delayed\ngeneralization, reliance on regularization, and the effectiveness of existing\ngrokking-inducing methods. Code for this paper is available at\nhttps://github.com/LucasPrietoAl/grokking-at-the-edge-of-numerical-stability.\n","authors":["Lucas Prieto","Melih Barsbey","Pedro A. M. Mediano","Tolga Birdal"],"pdf_url":"https://arxiv.org/pdf/2501.04697v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04695v1","updated":"2025-01-08T18:58:22Z","published":"2025-01-08T18:58:22Z","title":"Re-ranking the Context for Multimodal Retrieval Augmented Generation","summary":" Retrieval-augmented generation (RAG) enhances large language models (LLMs) by\nincorporating external knowledge to generate a response within a context with\nimproved accuracy and reduced hallucinations. However, multi-modal RAG systems\nface unique challenges: (i) the retrieval process may select irrelevant entries\nto user query (e.g., images, documents), and (ii) vision-language models or\nmulti-modal language models like GPT-4o may hallucinate when processing these\nentries to generate RAG output. In this paper, we aim to address the first\nchallenge, i.e, improving the selection of relevant context from the\nknowledge-base in retrieval phase of the multi-modal RAG. Specifically, we\nleverage the relevancy score (RS) measure designed in our previous work for\nevaluating the RAG performance to select more relevant entries in retrieval\nprocess. The retrieval based on embeddings, say CLIP-based embedding, and\ncosine similarity usually perform poorly particularly for multi-modal data. We\nshow that by using a more advanced relevancy measure, one can enhance the\nretrieval process by selecting more relevant pieces from the knowledge-base and\neliminate the irrelevant pieces from the context by adaptively selecting\nup-to-$k$ entries instead of fixed number of entries. Our evaluation using COCO\ndataset demonstrates significant enhancement in selecting relevant context and\naccuracy of the generated response.\n","authors":["Matin Mortaheb","Mohammad A. Amir Khojastepour","Srimat T. Chakradhar","Sennur Ulukus"],"pdf_url":"https://arxiv.org/pdf/2501.04695v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04690v1","updated":"2025-01-08T18:53:50Z","published":"2025-01-08T18:53:50Z","title":"Comparative Analysis of Quantum and Classical Support Vector Classifiers\n for Software Bug Prediction: An Exploratory Study","summary":" Purpose: Quantum computing promises to transform problem-solving across\nvarious domains with rapid and practical solutions. Within Software Evolution\nand Maintenance, Quantum Machine Learning (QML) remains mostly an underexplored\ndomain, particularly in addressing challenges such as detecting buggy software\ncommits from code repositories. Methods: In this study, we investigate the\npractical application of Quantum Support Vector Classifiers (QSVC) for\ndetecting buggy software commits across 14 open-source software projects with\ndiverse dataset sizes encompassing 30,924 data instances. We compare the QML\nalgorithm PQSVC (Pegasos QSVC) and QSVC against the classical Support Vector\nClassifier (SVC). Our technique addresses large datasets in QSVC algorithms by\ndividing them into smaller subsets. We propose and evaluate an aggregation\nmethod to combine predictions from these models to detect the entire test\ndataset. We also introduce an incremental testing methodology to overcome the\ndifficulties of quantum feature mapping during the testing approach. Results:\nThe study shows the effectiveness of QSVC and PQSVC in detecting buggy software\ncommits. The aggregation technique successfully combines predictions from\nsmaller data subsets, enhancing the overall detection accuracy for the entire\ntest dataset. The incremental testing methodology effectively manages the\nchallenges associated with quantum feature mapping during the testing process.\nConclusion: We contribute to the advancement of QML algorithms in defect\nprediction, unveiling the potential for further research in this domain. The\nspecific scenario of the Short-Term Activity Frame (STAF) highlights the early\ndetection of buggy software commits during the initial developmental phases of\nsoftware systems, particularly when dataset sizes remain insufficient to train\nmachine learning models.\n","authors":["Md Nadim","Mohammad Hassan","Ashis Kumar Mandal","Chanchal K. Roy","Banani Roy","Kevin A. Schneider"],"pdf_url":"https://arxiv.org/pdf/2501.04690v1.pdf","comment":"Accepted for publication in the Springer Journal: Quantum Machine\n Intelligence (https://link.springer.com/journal/42484)"},{"id":"http://arxiv.org/abs/2501.04686v1","updated":"2025-01-08T18:49:41Z","published":"2025-01-08T18:49:41Z","title":"URSA: Understanding and Verifying Chain-of-thought Reasoning in\n Multimodal Mathematics","summary":" Chain-of-thought (CoT) reasoning has been widely applied in the mathematical\nreasoning of Large Language Models (LLMs). Recently, the introduction of\nderivative process supervision on CoT trajectories has sparked discussions on\nenhancing scaling capabilities during test time, thereby boosting the potential\nof these models. However, in multimodal mathematical reasoning, the scarcity of\nhigh-quality CoT training data has hindered existing models from achieving\nhigh-precision CoT reasoning and has limited the realization of reasoning\npotential during test time. In this work, we propose a three-module synthesis\nstrategy that integrates CoT distillation, trajectory-format rewriting, and\nformat unification. It results in a high-quality CoT reasoning instruction\nfine-tuning dataset in multimodal mathematics, MMathCoT-1M. We comprehensively\nvalidate the state-of-the-art (SOTA) performance of the trained URSA-7B model\non multiple multimodal mathematical benchmarks. For test-time scaling, we\nintroduce a data synthesis strategy that automatically generates process\nannotation datasets, known as DualMath-1.1M, focusing on both interpretation\nand logic. By further training URSA-7B on DualMath-1.1M, we transition from CoT\nreasoning capabilities to robust supervision abilities. The trained URSA-RM-7B\nacts as a verifier, effectively enhancing the performance of URSA-7B at test\ntime. URSA-RM-7B also demonstrates excellent out-of-distribution (OOD)\nverifying capabilities, showcasing its generalization. Model weights, training\ndata and code will be open-sourced.\n","authors":["Ruilin Luo","Zhuofan Zheng","Yifan Wang","Yiyao Yu","Xinzhe Ni","Zicheng Lin","Jin Zeng","Yujiu Yang"],"pdf_url":"https://arxiv.org/pdf/2501.04686v1.pdf","comment":"27 pages, 10 tables, 17 figures. The training data has been released.\n The code and model are currently undergoing internal review. They will be\n made available soon. Project url: https://ursa-math.github.io"},{"id":"http://arxiv.org/abs/2501.04683v1","updated":"2025-01-08T18:43:59Z","published":"2025-01-08T18:43:59Z","title":"Toward Sufficient Statistical Power in Algorithmic Bias Assessment: A\n Test for ABROCA","summary":" Algorithmic bias is a pressing concern in educational data mining (EDM), as\nit risks amplifying inequities in learning outcomes. The Area Between ROC\nCurves (ABROCA) metric is frequently used to measure discrepancies in model\nperformance across demographic groups to quantify overall model fairness.\nHowever, its skewed distribution--especially when class or group imbalances\nexist--makes significance testing challenging. This study investigates ABROCA's\ndistributional properties and contributes robust methods for its significance\ntesting. Specifically, we address (1) whether ABROCA follows any known\ndistribution, (2) how to reliably test for algorithmic bias using ABROCA, and\n(3) the statistical power achievable with ABROCA-based bias assessments under\ntypical EDM sample specifications. Simulation results confirm that ABROCA does\nnot match standard distributions, including those suited to accommodate\nskewness. We propose nonparametric randomization tests for ABROCA and\ndemonstrate that reliably detecting bias with ABROCA requires large sample\nsizes or substantial effect sizes, particularly in imbalanced settings.\nFindings suggest that ABROCA-based bias evaluation based on sample sizes common\nin EDM tends to be underpowered, undermining the reliability of conclusions\nabout model fairness. By offering open-source code to simulate power and\nstatistically test ABROCA, this paper aims to foster more reliable statistical\ntesting in EDM research. It supports broader efforts toward replicability and\nequity in educational modeling.\n","authors":["Conrad Borchers"],"pdf_url":"https://arxiv.org/pdf/2501.04683v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04675v1","updated":"2025-01-08T18:33:17Z","published":"2025-01-08T18:33:17Z","title":"Enhancing Financial VQA in Vision Language Models using Intermediate\n Structured Representations","summary":" Chart interpretation is crucial for visual data analysis, but accurately\nextracting information from charts poses significant challenges for automated\nmodels. This study investigates the fine-tuning of DEPLOT, a modality\nconversion module that translates the image of a plot or chart to a linearized\ntable, on a custom dataset of 50,000 bar charts. The dataset comprises simple,\nstacked, and grouped bar charts, targeting the unique structural features of\nthese visualizations. The finetuned DEPLOT model is evaluated against its base\nversion using a test set of 1,000 images and two metrics: Relative Mapping\nSimilarity (RMS), which measures categorical mapping accuracy, and Relative\nNumber Set Similarity (RNSS), which evaluates numerical interpretation\naccuracy. To further explore the reasoning capabilities of large language\nmodels (LLMs), we curate an additional set of 100 bar chart images paired with\nquestion answer sets. Our findings demonstrate that providing a structured\nintermediate table alongside the image significantly enhances LLM reasoning\nperformance compared to direct image queries.\n","authors":["Archita Srivastava","Abhas Kumar","Rajesh Kumar","Prabhakar Srinivasan"],"pdf_url":"https://arxiv.org/pdf/2501.04675v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.02788v2","updated":"2025-01-08T18:33:07Z","published":"2025-01-06T06:07:40Z","title":"GLoG-CSUnet: Enhancing Vision Transformers with Adaptable Radiomic\n Features for Medical Image Segmentation","summary":" Vision Transformers (ViTs) have shown promise in medical image semantic\nsegmentation (MISS) by capturing long-range correlations. However, ViTs often\nstruggle to model local spatial information effectively, which is essential for\naccurately segmenting fine anatomical details, particularly when applied to\nsmall datasets without extensive pre-training. We introduce Gabor and Laplacian\nof Gaussian Convolutional Swin Network (GLoG-CSUnet), a novel architecture\nenhancing Transformer-based models by incorporating learnable radiomic\nfeatures. This approach integrates dynamically adaptive Gabor and Laplacian of\nGaussian (LoG) filters to capture texture, edge, and boundary information,\nenhancing the feature representation processed by the Transformer model. Our\nmethod uniquely combines the long-range dependency modeling of Transformers\nwith the texture analysis capabilities of Gabor and LoG features. Evaluated on\nthe Synapse multi-organ and ACDC cardiac segmentation datasets, GLoG-CSUnet\ndemonstrates significant improvements over state-of-the-art models, achieving a\n1.14% increase in Dice score for Synapse and 0.99% for ACDC, with minimal\ncomputational overhead (only 15 and 30 additional parameters, respectively).\nGLoG-CSUnet's flexible design allows integration with various base models,\noffering a promising approach for incorporating radiomics-inspired feature\nextraction in Transformer architectures for medical image analysis. The code\nimplementation is available on GitHub at: https://github.com/HAAIL/GLoG-CSUnet.\n","authors":["Niloufar Eghbali","Hassan Bagher-Ebadian","Tuka Alhanai","Mohammad M. Ghassemi"],"pdf_url":"https://arxiv.org/pdf/2501.02788v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04667v1","updated":"2025-01-08T18:28:12Z","published":"2025-01-08T18:28:12Z","title":"Natural Variational Annealing for Multimodal Optimization","summary":" We introduce a new multimodal optimization approach called Natural\nVariational Annealing (NVA) that combines the strengths of three foundational\nconcepts to simultaneously search for multiple global and local modes of\nblack-box nonconvex objectives. First, it implements a simultaneous search by\nusing variational posteriors, such as, mixtures of Gaussians. Second, it\napplies annealing to gradually trade off exploration for exploitation. Finally,\nit learns the variational search distribution using natural-gradient learning\nwhere updates resemble well-known and easy-to-implement algorithms. The three\nconcepts come together in NVA giving rise to new algorithms and also allowing\nus to incorporate \"fitness shaping\", a core concept from evolutionary\nalgorithms. We assess the quality of search on simulations and compare them to\nmethods using gradient descent and evolution strategies. We also provide an\napplication to a real-world inverse problem in planetary science.\n","authors":["Tâm Le Minh","Julyan Arbel","Thomas Möllenhoff","Mohammad Emtiyaz Khan","Florence Forbes"],"pdf_url":"https://arxiv.org/pdf/2501.04667v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04752v2","updated":"2025-01-08T18:23:58Z","published":"2024-12-06T03:33:31Z","title":"GABAR: Graph Attention-Based Action Ranking for Relational Policy\n Learning","summary":" We propose a novel approach to learn relational policies for classical\nplanning based on learning to rank actions. We introduce a new graph\nrepresentation that explicitly captures action information and propose a Graph\nNeural Network architecture augmented with Gated Recurrent Units (GRUs) to\nlearn action rankings. Our model is trained on small problem instances and\ngeneralizes to significantly larger instances where traditional planning\nbecomes computationally expensive. Experimental results across standard\nplanning benchmarks demonstrate that our action-ranking approach achieves\ngeneralization to significantly larger problems than those used in training.\n","authors":["Rajesh Mangannavar","Stefan Lee","Alan Fern","Prasad Tadepalli"],"pdf_url":"https://arxiv.org/pdf/2412.04752v2.pdf","comment":"6 Pages, 1 figure. Updated acknowledgments"},{"id":"http://arxiv.org/abs/2412.01348v2","updated":"2025-01-08T18:20:46Z","published":"2024-12-02T10:19:36Z","title":"Hierarchical Object-Oriented POMDP Planning for Object Rearrangement","summary":" We present an online planning framework for solving multi-object\nrearrangement problems in partially observable, multi-room environments.\nCurrent object rearrangement solutions, primarily based on Reinforcement\nLearning or hand-coded planning methods, often lack adaptability to diverse\nchallenges. To address this limitation, we introduce a novel Hierarchical\nObject-Oriented Partially Observed Markov Decision Process (HOO-POMDP) planning\napproach. This approach comprises of (a) an object-oriented POMDP planner\ngenerating sub-goals, (b) a set of low-level policies for sub-goal achievement,\nand (c) an abstraction system converting the continuous low-level world into a\nrepresentation suitable for abstract planning. We evaluate our system on\nvarying numbers of objects, rooms, and problem types in AI2-THOR simulated\nenvironments with promising results.\n","authors":["Rajesh Mangannavar","Alan Fern","Prasad Tadepalli"],"pdf_url":"https://arxiv.org/pdf/2412.01348v2.pdf","comment":"17 pages, 2 Figures. Preprint. Updated acknowledgments"},{"id":"http://arxiv.org/abs/2407.03289v2","updated":"2025-01-08T18:20:18Z","published":"2024-07-03T17:22:33Z","title":"Correlated Privacy Mechanisms for Differentially Private Distributed\n Mean Estimation","summary":" Differentially private distributed mean estimation (DP-DME) is a fundamental\nbuilding block in privacy-preserving federated learning, where a central server\nestimates the mean of $d$-dimensional vectors held by $n$ users while ensuring\n$(\\epsilon,\\delta)$-DP. Local differential privacy (LDP) and distributed DP\nwith secure aggregation (SA) are the most common notions of DP used in DP-DME\nsettings with an untrusted server. LDP provides strong resilience to dropouts,\ncolluding users, and adversarial attacks, but suffers from poor utility. In\ncontrast, SA-based DP-DME achieves an $O(n)$ utility gain over LDP in DME, but\nrequires increased communication and computation overheads and complex\nmulti-round protocols to handle dropouts and attacks. In this work, we present\na generalized framework for DP-DME, that captures LDP and SA-based mechanisms\nas extreme cases. Our framework provides a foundation for developing and\nanalyzing a variety of DP-DME protocols that leverage correlated privacy\nmechanisms across users. To this end, we propose CorDP-DME, a novel DP-DME\nmechanism based on the correlated Gaussian mechanism, that spans the gap\nbetween DME with LDP and distributed DP. We prove that CorDP-DME offers a\nfavorable balance between utility and resilience to dropout and collusion. We\nprovide an information-theoretic analysis of CorDP-DME, and derive theoretical\nguarantees for utility under any given privacy parameters and dropout/colluding\nuser thresholds. Our results demonstrate that (anti) correlated Gaussian DP\nmechanisms can significantly improve utility in mean estimation tasks compared\nto LDP -- even in adversarial settings -- while maintaining better resilience\nto dropouts and attacks compared to distributed DP.\n","authors":["Sajani Vithana","Viveck R. Cadambe","Flavio P. Calmon","Haewon Jeong"],"pdf_url":"https://arxiv.org/pdf/2407.03289v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.17194v3","updated":"2025-01-08T18:18:51Z","published":"2024-10-22T17:13:34Z","title":"Representation Shattering in Transformers: A Synthetic Study with\n Knowledge Editing","summary":" Knowledge Editing (KE) algorithms alter models' weights to perform targeted\nupdates to incorrect, outdated, or otherwise unwanted factual associations. To\nbetter identify the possibilities and limitations of these approaches, recent\nwork has shown that applying KE can adversely affect models' factual recall\naccuracy and diminish their general reasoning abilities. While these studies\ngive broad insights into the potential harms of KE algorithms, e.g., via\nperformance evaluations on benchmarks, we argue little is understood as to why\nsuch destructive failures occur. Is it possible KE methods distort\nrepresentations of concepts beyond the targeted fact, hence hampering abilities\nat broad? If so, what is the extent of this distortion? Motivated by such\nquestions, we define a novel synthetic task wherein a Transformer is trained\nfrom scratch to internalize a \"structured\" knowledge graph. The structure\nenforces relationships between entities of the graph, such that editing a\nfactual association has \"trickling effects\" on other entities in the graph\n(e.g., altering X's parent is Y to Z affects who X's siblings' parent is).\nThrough evaluations of edited models and analysis of extracted representations,\nwe show that KE inadvertently affects representations of entities beyond the\ntargeted one, distorting relevant structures that allow a model to infer unseen\nknowledge about an entity. We call this phenomenon representation shattering\nand demonstrate that it results in degradation of factual recall and reasoning\nperformance more broadly. To corroborate our findings in a more naturalistic\nsetup, we perform preliminary experiments with pre-trained Llama and Mamba\nmodels, reproducing the representation shattering effect therein as well.\nOverall, our work yields a precise mechanistic hypothesis to explain why KE has\nadverse effects on model abilities.\n","authors":["Kento Nishi","Maya Okawa","Rahul Ramesh","Mikail Khona","Hidenori Tanaka","Ekdeep Singh Lubana"],"pdf_url":"https://arxiv.org/pdf/2410.17194v3.pdf","comment":"Under review"},{"id":"http://arxiv.org/abs/2501.04652v1","updated":"2025-01-08T18:05:30Z","published":"2025-01-08T18:05:30Z","title":"Multi-task retriever fine-tuning for domain-specific and efficient RAG","summary":" Retrieval-Augmented Generation (RAG) has become ubiquitous when deploying\nLarge Language Models (LLMs), as it can address typical limitations such as\ngenerating hallucinated or outdated information. However, when building\nreal-world RAG applications, practical issues arise. First, the retrieved\ninformation is generally domain-specific. Since it is computationally expensive\nto fine-tune LLMs, it is more feasible to fine-tune the retriever to improve\nthe quality of the data included in the LLM input. Second, as more applications\nare deployed in the same real-world system, one cannot afford to deploy\nseparate retrievers. Moreover, these RAG applications normally retrieve\ndifferent kinds of data. Our solution is to instruction fine-tune a small\nretriever encoder on a variety of domain-specific tasks to allow us to deploy\none encoder that can serve many use cases, thereby achieving low-cost,\nscalability, and speed. We show how this encoder generalizes to out-of-domain\nsettings as well as to an unseen retrieval task on real-world enterprise use\ncases.\n","authors":["Patrice Béchard","Orlando Marquez Ayala"],"pdf_url":"https://arxiv.org/pdf/2501.04652v1.pdf","comment":"9 pages, 2 figures. Submitted to NAACL 2025 Industry Track"},{"id":"http://arxiv.org/abs/2501.04641v1","updated":"2025-01-08T17:47:06Z","published":"2025-01-08T17:47:06Z","title":"A Statistical Theory of Contrastive Pre-training and Multimodal\n Generative AI","summary":" Multi-modal generative AI systems, such as those combining vision and\nlanguage, rely on contrastive pre-training to learn representations across\ndifferent modalities. While their practical benefits are widely acknowledged, a\nrigorous theoretical understanding of the contrastive pre-training framework\nremains limited. This paper develops a theoretical framework to explain the\nsuccess of contrastive pre-training in downstream tasks, such as zero-shot\nclassification, conditional diffusion models, and vision-language models. We\nintroduce the concept of approximate sufficient statistics, a generalization of\nthe classical sufficient statistics, and show that near-minimizers of the\ncontrastive pre-training loss are approximately sufficient, making them\nadaptable to diverse downstream tasks. We further propose the Joint Generative\nHierarchical Model for the joint distribution of images and text, showing that\ntransformers can efficiently approximate relevant functions within this model\nvia belief propagation. Building on this framework, we derive sample complexity\nguarantees for multi-modal learning based on contrastive pre-trained\nrepresentations. Numerical simulations validate these theoretical findings,\ndemonstrating the strong generalization performance of contrastively\npre-trained transformers in various multi-modal tasks.\n","authors":["Kazusato Oko","Licong Lin","Yuhang Cai","Song Mei"],"pdf_url":"https://arxiv.org/pdf/2501.04641v1.pdf","comment":"108 pages"},{"id":"http://arxiv.org/abs/2409.05901v2","updated":"2025-01-08T17:36:41Z","published":"2024-09-05T20:45:44Z","title":"Diffusion Map Autoencoder","summary":" In this work, we explore various modifications to diffusion maps (DMAP),\nincluding their incorporation into a layered sequential neural network model\ntrained with gradient descent. The result is a sequential neural network that\ninherits the interpretability of diffusion maps.\n","authors":["Julio Candanedo"],"pdf_url":"https://arxiv.org/pdf/2409.05901v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.13015v4","updated":"2025-01-08T17:14:40Z","published":"2023-11-21T21:44:28Z","title":"Fast and Interpretable Mortality Risk Scores for Critical Care Patients","summary":" Prediction of mortality in intensive care unit (ICU) patients typically\nrelies on black box models (that are unacceptable for use in hospitals) or\nhand-tuned interpretable models (that might lead to the loss in performance).\nWe aim to bridge the gap between these two categories by building on modern\ninterpretable ML techniques to design interpretable mortality risk scores that\nare as accurate as black boxes. We developed a new algorithm, GroupFasterRisk,\nwhich has several important benefits: it uses both hard and soft direct\nsparsity regularization, it incorporates group sparsity to allow more cohesive\nmodels, it allows for monotonicity constraint to include domain knowledge, and\nit produces many equally-good models, which allows domain experts to choose\namong them. For evaluation, we leveraged the largest existing public ICU\nmonitoring datasets (MIMIC III and eICU). Models produced by GroupFasterRisk\noutperformed OASIS and SAPS II scores and performed similarly to APACHE IV/IVa\nwhile using at most a third of the parameters. For patients with\nsepsis/septicemia, acute myocardial infarction, heart failure, and acute kidney\nfailure, GroupFasterRisk models outperformed OASIS and SOFA. Finally, different\nmortality prediction ML approaches performed better based on variables selected\nby GroupFasterRisk as compared to OASIS variables. GroupFasterRisk's models\nperformed better than risk scores currently used in hospitals, and on par with\nblack box ML models, while being orders of magnitude sparser. Because\nGroupFasterRisk produces a variety of risk scores, it allows design flexibility\n- the key enabler of practical model creation. GroupFasterRisk is a fast,\naccessible, and flexible procedure that allows learning a diverse set of sparse\nrisk scores for mortality prediction.\n","authors":["Chloe Qinyu Zhu","Muhang Tian","Lesia Semenova","Jiachang Liu","Jack Xu","Joseph Scarpa","Cynthia Rudin"],"pdf_url":"https://arxiv.org/pdf/2311.13015v4.pdf","comment":"This article has been accepted for publication in the Journal of the\n American Medical Informatics Association, published by Oxford University\n Press"},{"id":"http://arxiv.org/abs/2412.16780v2","updated":"2025-01-08T17:00:18Z","published":"2024-12-21T21:27:22Z","title":"Forget Vectors at Play: Universal Input Perturbations Driving Machine\n Unlearning in Image Classification","summary":" Machine unlearning (MU), which seeks to erase the influence of specific\nunwanted data from already-trained models, is becoming increasingly vital in\nmodel editing, particularly to comply with evolving data regulations like the\n``right to be forgotten''. Conventional approaches are predominantly\nmodel-based, typically requiring retraining or fine-tuning the model's weights\nto meet unlearning requirements. In this work, we approach the MU problem from\na novel input perturbation-based perspective, where the model weights remain\nintact throughout the unlearning process. We demonstrate the existence of a\nproactive input-based unlearning strategy, referred to forget vector, which can\nbe generated as an input-agnostic data perturbation and remains as effective as\nmodel-based approximate unlearning approaches. We also explore forget vector\narithmetic, whereby multiple class-specific forget vectors are combined through\nsimple operations (e.g., linear combinations) to generate new forget vectors\nfor unseen unlearning tasks, such as forgetting arbitrary subsets across\nclasses. Extensive experiments validate the effectiveness and adaptability of\nthe forget vector, showcasing its competitive performance relative to\nstate-of-the-art model-based methods. Codes are available at\nhttps://github.com/Changchangsun/Forget-Vector.\n","authors":["Changchang Sun","Ren Wang","Yihua Zhang","Jinghan Jia","Jiancheng Liu","Gaowen Liu","Sijia Liu","Yan Yan"],"pdf_url":"https://arxiv.org/pdf/2412.16780v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04882v1","updated":"2025-01-08T23:38:19Z","published":"2025-01-08T23:38:19Z","title":"Reach Measurement, Optimization and Frequency Capping In Targeted Online\n Advertising Under k-Anonymity","summary":" The growth in the use of online advertising to foster brand awareness over\nrecent years is largely attributable to the ubiquity of social media. One\npivotal technology contributing to the success of online brand advertising is\nfrequency capping, a mechanism that enables marketers to control the number of\ntimes an ad is shown to a specific user. However, the very foundation of this\ntechnology is being scrutinized as the industry gravitates towards advertising\nsolutions that prioritize user privacy. This paper delves into the issue of\nreach measurement and optimization within the context of $k$-anonymity, a\nprivacy-preserving model gaining traction across major online advertising\nplatforms. We outline how to report reach within this new privacy landscape and\ndemonstrate how probabilistic discounting, a probabilistic adaptation of\ntraditional frequency capping, can be employed to optimize campaign\nperformance. Experiments are performed to assess the trade-off between user\nprivacy and the efficacy of online brand advertising. Notably, we discern a\nsignificant dip in performance as long as privacy is introduced, yet this comes\nwith a limited additional cost for advertising platforms to offer their users\nmore privacy.\n","authors":["Yuan Gao","Mu Qiao"],"pdf_url":"https://arxiv.org/pdf/2501.04882v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04881v1","updated":"2025-01-08T23:33:50Z","published":"2025-01-08T23:33:50Z","title":"Geophysical inverse problems with measurement-guided diffusion models","summary":" Solving inverse problems with the reverse process of a diffusion model\nrepresents an appealing avenue to produce highly realistic, yet diverse\nsolutions from incomplete and possibly noisy measurements, ultimately enabling\nuncertainty quantification at scale. However, because of the intractable nature\nof the score function of the likelihood term (i.e., $\\nabla_{\\mathbf{x}_t}\np(\\mathbf{y} | \\mathbf{x}_t)$), various samplers have been proposed in the\nliterature that use different (more or less accurate) approximations of such a\ngradient to guide the diffusion process towards solutions that match the\nobservations. In this work, I consider two sampling algorithms recently\nproposed under the name of Diffusion Posterior Sampling (DPS) and\nPseudo-inverse Guided Diffusion Model (PGDM), respectively. In DSP, the\nguidance term used at each step of the reverse diffusion process is obtained by\napplying the adjoint of the modeling operator to the residual obtained from a\none-step denoising estimate of the solution. On the other hand, PGDM utilizes a\npseudo-inverse operator that originates from the fact that the one-step\ndenoised solution is not assumed to be deterministic, rather modeled as a\nGaussian distribution. Through an extensive set of numerical examples on two\ngeophysical inverse problems (namely, seismic interpolation and seismic\ninversion), I show that two key aspects for the success of any\nmeasurement-guided diffusion process are: i) our ability to re-parametrize the\ninverse problem such that the sought after model is bounded between -1 and 1 (a\npre-requisite for any diffusion model); ii) the choice of the training dataset\nused to learn the implicit prior that guides the reverse diffusion process.\nNumerical examples on synthetic and field datasets reveal that PGDM outperforms\nDPS in both scenarios at limited additional cost.\n","authors":["Matteo Ravasi"],"pdf_url":"https://arxiv.org/pdf/2501.04881v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04880v1","updated":"2025-01-08T23:28:28Z","published":"2025-01-08T23:28:28Z","title":"Leveraging Log Probabilities in Language Models to Forecast Future\n Events","summary":" In the constantly changing field of data-driven decision making, accurately\npredicting future events is crucial for strategic planning in various sectors.\nThe emergence of Large Language Models (LLMs) marks a significant advancement\nin this area, offering advanced tools that utilise extensive text data for\nprediction. In this industry paper, we introduce a novel method for AI-driven\nforesight using LLMs. Building on top of previous research, we employ data on\ncurrent trends and their trajectories for generating forecasts on 15 different\ntopics. Subsequently, we estimate their probabilities via a multi-step approach\nbased on log probabilities. We show we achieve a Brier score of 0.186, meaning\na +26% improvement over random chance and a +19% improvement over\nwidely-available AI systems.\n","authors":["Tommaso Soru","Jim Marshall"],"pdf_url":"https://arxiv.org/pdf/2501.04880v1.pdf","comment":"5 pages, 4 figures"},{"id":"http://arxiv.org/abs/2501.04879v1","updated":"2025-01-08T23:22:08Z","published":"2025-01-08T23:22:08Z","title":"Multilinear Tensor Low-Rank Approximation for Policy-Gradient Methods in\n Reinforcement Learning","summary":" Reinforcement learning (RL) aims to estimate the action to take given a\n(time-varying) state, with the goal of maximizing a cumulative reward function.\nPredominantly, there are two families of algorithms to solve RL problems:\nvalue-based and policy-based methods, with the latter designed to learn a\nprobabilistic parametric policy from states to actions. Most contemporary\napproaches implement this policy using a neural network (NN). However, NNs\nusually face issues related to convergence, architectural suitability,\nhyper-parameter selection, and underutilization of the redundancies of the\nstate-action representations (e.g. locally similar states). This paper\npostulates multi-linear mappings to efficiently estimate the parameters of the\nRL policy. More precisely, we leverage the PARAFAC decomposition to design\ntensor low-rank policies. The key idea involves collecting the policy\nparameters into a tensor and leveraging tensor-completion techniques to enforce\nlow rank. We establish theoretical guarantees of the proposed methods for\nvarious policy classes and validate their efficacy through numerical\nexperiments. Specifically, we demonstrate that tensor low-rank policy models\nreduce computational and sample complexities in comparison to NN models while\nachieving similar rewards.\n","authors":["Sergio Rozada","Hoi-To Wai","Antonio G. Marques"],"pdf_url":"https://arxiv.org/pdf/2501.04879v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.00958v2","updated":"2025-01-08T23:16:20Z","published":"2024-05-02T02:50:58Z","title":"Generative manufacturing systems using diffusion models and ChatGPT","summary":" In this study, we introduce Generative Manufacturing Systems (GMS) as a novel\napproach to effectively manage and coordinate autonomous manufacturing assets,\nthereby enhancing their responsiveness and flexibility to address a wide array\nof production objectives and human preferences. Deviating from traditional\nexplicit modeling, GMS employs generative AI, including diffusion models and\nChatGPT, for implicit learning from envisioned futures, marking a shift from a\nmodel-optimum to a training-sampling decision-making. Through the integration\nof generative AI, GMS enables complex decision-making through interactive\ndialogue with humans, allowing manufacturing assets to generate multiple\nhigh-quality global decisions that can be iteratively refined based on human\nfeedback. Empirical findings showcase GMS's substantial improvement in system\nresilience and responsiveness to uncertainties, with decision times reduced\nfrom seconds to milliseconds. The study underscores the inherent creativity and\ndiversity in the generated solutions, facilitating human-centric\ndecision-making through seamless and continuous human-machine interactions.\n","authors":["Xingyu Li","Fei Tao","Wei Ye","Aydin Nassehi","John W. Sutherland"],"pdf_url":"https://arxiv.org/pdf/2405.00958v2.pdf","comment":"We are withdrawing this preprint to incorporate significant new\n results and expand the scope of the paper. We plan to resubmit a\n substantially revised version in the near future"},{"id":"http://arxiv.org/abs/2501.04871v1","updated":"2025-01-08T23:04:32Z","published":"2025-01-08T23:04:32Z","title":"RieszBoost: Gradient Boosting for Riesz Regression","summary":" Answering causal questions often involves estimating linear functionals of\nconditional expectations, such as the average treatment effect or the effect of\na longitudinal modified treatment policy. By the Riesz representation theorem,\nthese functionals can be expressed as the expected product of the conditional\nexpectation of the outcome and the Riesz representer, a key component in doubly\nrobust estimation methods. Traditionally, the Riesz representer is estimated\nindirectly by deriving its explicit analytical form, estimating its components,\nand substituting these estimates into the known form (e.g., the inverse\npropensity score). However, deriving or estimating the analytical form can be\nchallenging, and substitution methods are often sensitive to practical\npositivity violations, leading to higher variance and wider confidence\nintervals. In this paper, we propose a novel gradient boosting algorithm to\ndirectly estimate the Riesz representer without requiring its explicit\nanalytical form. This method is particularly suited for tabular data, offering\na flexible, nonparametric, and computationally efficient alternative to\nexisting methods for Riesz regression. Through simulation studies, we\ndemonstrate that our algorithm performs on par with or better than indirect\nestimation techniques across a range of functionals, providing a user-friendly\nand robust solution for estimating causal quantities.\n","authors":["Kaitlyn J. Lee","Alejandro Schuler"],"pdf_url":"https://arxiv.org/pdf/2501.04871v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04870v1","updated":"2025-01-08T23:03:18Z","published":"2025-01-08T23:03:18Z","title":"Deep Transfer $Q$-Learning for Offline Non-Stationary Reinforcement\n Learning","summary":" In dynamic decision-making scenarios across business and healthcare,\nleveraging sample trajectories from diverse populations can significantly\nenhance reinforcement learning (RL) performance for specific target\npopulations, especially when sample sizes are limited. While existing transfer\nlearning methods primarily focus on linear regression settings, they lack\ndirect applicability to reinforcement learning algorithms. This paper pioneers\nthe study of transfer learning for dynamic decision scenarios modeled by\nnon-stationary finite-horizon Markov decision processes, utilizing neural\nnetworks as powerful function approximators and backward inductive learning. We\ndemonstrate that naive sample pooling strategies, effective in regression\nsettings, fail in Markov decision processes.To address this challenge, we\nintroduce a novel ``re-weighted targeting procedure'' to construct\n``transferable RL samples'' and propose ``transfer deep $Q^*$-learning'',\nenabling neural network approximation with theoretical guarantees. We assume\nthat the reward functions are transferable and deal with both situations in\nwhich the transition densities are transferable or nontransferable. Our\nanalytical techniques for transfer learning in neural network approximation and\ntransition density transfers have broader implications, extending to supervised\ntransfer learning with neural networks and domain shift scenarios. Empirical\nexperiments on both synthetic and real datasets corroborate the advantages of\nour method, showcasing its potential for improving decision-making through\nstrategically constructing transferable RL samples in non-stationary\nreinforcement learning contexts.\n","authors":["Jinhang Chai","Elynn Chen","Jianqing Fan"],"pdf_url":"https://arxiv.org/pdf/2501.04870v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.19595v2","updated":"2025-01-08T22:29:03Z","published":"2024-10-25T14:43:32Z","title":"Mask-Weighted Spatial Likelihood Coding for Speaker-Independent Joint\n Localization and Mask Estimation","summary":" Due to their robustness and flexibility, neural-driven beamformers are a\npopular choice for speech separation in challenging environments with a varying\namount of simultaneous speakers alongside noise and reverberation.\nTime-frequency masks and relative directions of the speakers regarding a fixed\nspatial grid can be used to estimate the beamformer's parameters. To some\ndegree, speaker-independence is achieved by ensuring a greater amount of\nspatial partitions than speech sources. In this work, we analyze how to encode\nboth mask and positioning into such a grid to enable joint estimation of both\nquantities. We propose mask-weighted spatial likelihood coding and show that it\nachieves considerable performance in both tasks compared to baseline encodings\noptimized for either localization or mask estimation. In the same setup, we\ndemonstrate superiority for joint estimation of both quantities. Conclusively,\nwe propose a universal approach which can replace an upstream sound source\nlocalization system solely by adapting the training framework, making it highly\nrelevant in performance-critical scenarios.\n","authors":["Jakob Kienegger","Alina Mannanova","Timo Gerkmann"],"pdf_url":"https://arxiv.org/pdf/2410.19595v2.pdf","comment":"\\copyright 2025 IEEE. Personal use of this material is permitted.\n Permission from IEEE must be obtained for all other uses, in any current or\n future media, including reprinting/republishing this material for advertising\n or promotional purposes, creating new collective works, for resale or\n redistribution to servers or lists, or reuse of any copyrighted component of\n this work in other works"},{"id":"http://arxiv.org/abs/2501.03489v2","updated":"2025-01-08T22:22:43Z","published":"2025-01-07T03:17:47Z","title":"Entropy-Guided Attention for Private LLMs","summary":" The pervasiveness of proprietary language models has raised critical privacy\nconcerns, necessitating advancements in private inference (PI), where\ncomputations are performed directly on encrypted data without revealing users'\nsensitive information. While PI offers a promising solution, its practical\ndeployment is hindered by substantial communication and latency overheads,\nprimarily stemming from nonlinear operations. To address this, we introduce an\ninformation-theoretic framework to characterize the role of nonlinearities in\ndecoder-only language models, laying a principled foundation for optimizing\ntransformer-architectures tailored to the demands of PI.\n By leveraging Shannon's entropy as a quantitative measure, we uncover the\npreviously unexplored dual significance of nonlinearities: beyond ensuring\ntraining stability, they are crucial for maintaining attention head diversity.\nSpecifically, we find that their removal triggers two critical failure modes:\n{\\em entropy collapse} in deeper layers that destabilizes training, and {\\em\nentropic overload} in earlier layers that leads to under-utilization of\nMulti-Head Attention's (MHA) representational capacity.\n We propose an entropy-guided attention mechanism paired with a novel entropy\nregularization technique to mitigate entropic overload. Additionally, we\nexplore PI-friendly alternatives to layer normalization for preventing entropy\ncollapse and stabilizing the training of LLMs with reduced-nonlinearities. Our\nstudy bridges the gap between information theory and architectural design,\nestablishing entropy dynamics as a principled guide for developing efficient PI\narchitectures. The code and implementation are available at\nhttps://github.com/Nandan91/entropy-guided-attention-llm\n","authors":["Nandan Kumar Jha","Brandon Reagen"],"pdf_url":"https://arxiv.org/pdf/2501.03489v2.pdf","comment":"Accepted to the 6th AAAI Workshop on Privacy-Preserving Artificial\n Intelligence (PPAI), 2025. arXiv admin note: substantial text overlap with\n arXiv:2410.13060"},{"id":"http://arxiv.org/abs/2409.18153v2","updated":"2025-01-08T22:20:36Z","published":"2024-09-25T20:00:23Z","title":"Most Influential Subset Selection: Challenges, Promises, and Beyond","summary":" How can we attribute the behaviors of machine learning models to their\ntraining data? While the classic influence function sheds light on the impact\nof individual samples, it often fails to capture the more complex and\npronounced collective influence of a set of samples. To tackle this challenge,\nwe study the Most Influential Subset Selection (MISS) problem, which aims to\nidentify a subset of training samples with the greatest collective influence.\nWe conduct a comprehensive analysis of the prevailing approaches in MISS,\nelucidating their strengths and weaknesses. Our findings reveal that\ninfluence-based greedy heuristics, a dominant class of algorithms in MISS, can\nprovably fail even in linear regression. We delineate the failure modes,\nincluding the errors of influence function and the non-additive structure of\nthe collective influence. Conversely, we demonstrate that an adaptive version\nof these heuristics which applies them iteratively, can effectively capture the\ninteractions among samples and thus partially address the issues. Experiments\non real-world datasets corroborate these theoretical findings and further\ndemonstrate that the merit of adaptivity can extend to more complex scenarios\nsuch as classification tasks and non-linear neural networks. We conclude our\nanalysis by emphasizing the inherent trade-off between performance and\ncomputational efficiency, questioning the use of additive metrics such as the\nLinear Datamodeling Score, and offering a range of discussions.\n","authors":["Yuzheng Hu","Pingbang Hu","Han Zhao","Jiaqi W. Ma"],"pdf_url":"https://arxiv.org/pdf/2409.18153v2.pdf","comment":"Accepted at the 38th Conference on Neural Information Processing\n Systems (NeurIPS 2024) Edit: Added discussion on a concurrent work"},{"id":"http://arxiv.org/abs/2406.14469v6","updated":"2025-01-08T21:47:16Z","published":"2024-06-20T16:32:18Z","title":"Forecasting Symmetric Random Walks: A Fusion Approach","summary":" Forecasting random walks is notoriously challenging, with na\\\"ive prediction\nserving as a difficult-to-surpass baseline. To investigate the potential of\nusing movement predictions to improve point forecasts in this context, this\nstudy focuses on symmetric random walks, in which the target variable's future\nvalue is reformulated as a combination of its future movement and current\nvalue. The proposed forecasting method, termed the fusion of movement and\nna\\\"ive predictions (FMNP), is grounded in this reformulation. The simulation\nresults show that FMNP achieves statistically significant improvements over\nna\\\"ive prediction, even when the movement prediction accuracy is only slightly\nabove 0.50. In practice, movement predictions can be derived from the\ncomovement between an exogenous variable and the target variable and then\nlinearly combined with the na\\\"ive prediction to generate the final forecast.\nFMNP effectiveness was evaluated on four U.S. financial time series -- the\nclose prices of Boeing (BA), Brent crude oil (OIL), Halliburton (HAL), and\nSchlumberger (SLB) -- using the open price of the Financial Times Stock\nExchange (FTSE) index as the exogenous variable. In all the cases, FMNP\noutperformed the na\\\"ive prediction, demonstrating its efficacy in forecasting\nsymmetric random walks and its potential applicability to other forecasting\ntasks.\n","authors":["Cheng Zhang"],"pdf_url":"https://arxiv.org/pdf/2406.14469v6.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04845v1","updated":"2025-01-08T21:13:50Z","published":"2025-01-08T21:13:50Z","title":"Intelligent experiments through real-time AI: Fast Data Processing and\n Autonomous Detector Control for sPHENIX and future EIC detectors","summary":" This R\\&D project, initiated by the DOE Nuclear Physics AI-Machine Learning\ninitiative in 2022, leverages AI to address data processing challenges in\nhigh-energy nuclear experiments (RHIC, LHC, and future EIC). Our focus is on\ndeveloping a demonstrator for real-time processing of high-rate data streams\nfrom sPHENIX experiment tracking detectors. The limitations of a 15 kHz maximum\ntrigger rate imposed by the calorimeters can be negated by intelligent use of\nstreaming technology in the tracking system. The approach efficiently\nidentifies low momentum rare heavy flavor events in high-rate p+p collisions\n(3MHz), using Graph Neural Network (GNN) and High Level Synthesis for Machine\nLearning (hls4ml). Success at sPHENIX promises immediate benefits, minimizing\nresources and accelerating the heavy-flavor measurements. The approach is\ntransferable to other fields. For the EIC, we develop a DIS-electron tagger\nusing Artificial Intelligence - Machine Learning (AI-ML) algorithms for\nreal-time identification, showcasing the transformative potential of AI and\nFPGA technologies in high-energy nuclear and particle experiments real-time\ndata processing pipelines.\n","authors":["J. Kvapil","G. Borca-Tasciuc","H. Bossi","K. Chen","Y. Chen","Y. Corrales Morales","H. Da Costa","C. Da Silva","C. Dean","J. Durham","S. Fu","C. Hao","P. Harris","O. Hen","H. Jheng","Y. Lee","P. Li","X. Li","Y. Lin","M. X. Liu","V. Loncar","J. P. Mitrevski","A. Olvera","M. L. Purschke","J. S. Renck","G. Roland","J. Schambach","Z. Shi","N. Tran","N. Wuerfel","B. Xu","D. Yu","H. Zhang"],"pdf_url":"https://arxiv.org/pdf/2501.04845v1.pdf","comment":"proceedings for 42nd International Conference on High Energy Physics\n (ICHEP2024), 18-24 July 2024, Prague, Czech Republic"},{"id":"http://arxiv.org/abs/2408.08260v2","updated":"2025-01-08T21:12:48Z","published":"2024-08-15T17:01:00Z","title":"GSVD-NMF: Recovering Missing Features in Non-negative Matrix\n Factorization","summary":" Non-negative matrix factorization (NMF) is an important tool in signal\nprocessing and widely used to separate mixed sources into their components.\nAlgorithms for NMF require that the user choose the number of components in\nadvance, and if the results are unsatisfying one typically needs to start again\nwith a different number of components. To make NMF more interactive and\nincremental, here we introduce GSVD-NMF, a method that proposes new components\nbased on the generalized singular value decomposition (GSVD) to address\ndiscrepancies between the initial under-complete NMF results and the SVD of the\noriginal matrix. Simulation and experimental results demonstrate that GSVD-NMF\noften effectively recovers multiple missing components in under-complete NMF,\nwith the recovered NMF solutions frequently reaching better local optima. The\nresults further show that GSVD-NMF is compatible with various NMF algorithms\nand that directly augmenting components is more efficient than rerunning NMF\nfrom scratch with additional components. By deliberately starting from\nunder-complete NMF, GSVD-NMF has the potential to be a recommended approach for\na range of general NMF applications.\n","authors":["Youdong Guo","Timothy E. Holy"],"pdf_url":"https://arxiv.org/pdf/2408.08260v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.16542v3","updated":"2025-01-08T21:05:26Z","published":"2024-03-25T08:35:19Z","title":"Differentially Private Online Federated Learning with Correlated Noise","summary":" We introduce a novel differentially private algorithm for online federated\nlearning that employs temporally correlated noise to enhance utility while\nensuring privacy of continuously released models. To address challenges posed\nby DP noise and local updates with streaming non-iid data, we develop a\nperturbed iterate analysis to control the impact of the DP noise on the\nutility. Moreover, we demonstrate how the drift errors from local updates can\nbe effectively managed under a quasi-strong convexity condition. Subject to an\n$(\\epsilon, \\delta)$-DP budget, we establish a dynamic regret bound over the\nentire time horizon, quantifying the impact of key parameters and the intensity\nof changes in dynamic environments. Numerical experiments confirm the efficacy\nof the proposed algorithm.\n","authors":["Jiaojiao Zhang","Linglingzhi Zhu","Mikael Johansson"],"pdf_url":"https://arxiv.org/pdf/2403.16542v3.pdf","comment":"11 pages"},{"id":"http://arxiv.org/abs/2411.18752v2","updated":"2025-01-08T20:51:17Z","published":"2024-11-27T20:56:43Z","title":"Locally Differentially Private Online Federated Learning With Correlated\n Noise","summary":" We introduce a locally differentially private (LDP) algorithm for online\nfederated learning that employs temporally correlated noise to improve utility\nwhile preserving privacy. To address challenges posed by the correlated noise\nand local updates with streaming non-IID data, we develop a perturbed iterate\nanalysis that controls the impact of the noise on the utility. Moreover, we\ndemonstrate how the drift errors from local updates can be effectively managed\nfor several classes of nonconvex loss functions. Subject to an\n$(\\epsilon,\\delta)$-LDP budget, we establish a dynamic regret bound that\nquantifies the impact of key parameters and the intensity of changes in the\ndynamic environment on the learning performance. Numerical experiments confirm\nthe efficacy of the proposed algorithm.\n","authors":["Jiaojiao Zhang","Linglingzhi Zhu","Dominik Fay","Mikael Johansson"],"pdf_url":"https://arxiv.org/pdf/2411.18752v2.pdf","comment":"arXiv admin note: text overlap with arXiv:2403.16542"},{"id":"http://arxiv.org/abs/2501.01447v2","updated":"2025-01-08T20:50:40Z","published":"2024-12-30T02:48:40Z","title":"Analyzing Country-Level Vaccination Rates and Determinants of Practical\n Capacity to Administer COVID-19 Vaccines","summary":" The COVID-19 vaccine development, manufacturing, transportation, and\nadministration proved an extreme logistics operation of global magnitude.\nGlobal vaccination levels, however, remain a key concern in preventing the\nemergence of new strains and minimizing the impact of the pandemic's disruption\nof daily life. In this paper, country-level vaccination rates are analyzed\nthrough a queuing framework to extract service rates that represent the\npractical capacity of a country to administer vaccines. These rates are further\ncharacterized through regression and interpretable machine learning methods\nwith country-level demographic, governmental, and socio-economic variates.\nModel results show that participation in multi-governmental collaborations such\nas COVAX may improve the ability to vaccinate. Similarly, improved\ntransportation and accessibility variates such as roads per area for low-income\ncountries and rail lines per area for high-income countries can improve rates.\nIt was also found that for low-income countries specifically, improvements in\nbasic and health infrastructure (as measured through spending on healthcare,\nnumber of doctors and hospital beds per 100k, population percent with access to\nelectricity, life expectancy, and vehicles per 1000 people) resulted in higher\nvaccination rates. Of the high-income countries, those with larger 65-plus\npopulations struggled to vaccinate at high rates, indicating potential\naccessibility issues for the elderly. This study finds that improving basic and\nhealth infrastructure, focusing on accessibility in the last mile, particularly\nfor the elderly, and fostering global partnerships can improve logistical\noperations of such a scale. Such structural impediments and inequities in\nglobal health care must be addressed in preparation for future global public\nhealth crises.\n","authors":["Sharika J. Hegde","Max T. M. Ng","Marcos Rios","Hani S. Mahmassani","Ying Chen","Karen Smilowitz"],"pdf_url":"https://arxiv.org/pdf/2501.01447v2.pdf","comment":"Under consideration for more thorough analysis"},{"id":"http://arxiv.org/abs/2501.04831v1","updated":"2025-01-08T20:36:40Z","published":"2025-01-08T20:36:40Z","title":"Quantum Hybrid Support Vector Machines for Stress Detection in Older\n Adults","summary":" Stress can increase the possibility of cognitive impairment and decrease the\nquality of life in older adults. Smart healthcare can deploy quantum machine\nlearning to enable preventive and diagnostic support. This work introduces a\nunique technique to address stress detection as an anomaly detection problem\nthat uses quantum hybrid support vector machines. With the help of a wearable\nsmartwatch, we mapped baseline sensor reading as normal data and stressed\nsensor reading as anomaly data using cortisol concentration as the ground\ntruth. We have used quantum computing techniques to explore the complex feature\nspaces with kernel-based preprocessing. We illustrate the usefulness of our\nmethod by doing experimental validation on 40 older adults with the help of the\nTSST protocol. Our findings highlight that using a limited number of features,\nquantum machine learning provides improved accuracy compared to classical\nmethods. We also observed that the recall value using quantum machine learning\nis higher compared to the classical method. The higher recall value illustrates\nthe potential of quantum machine learning in healthcare, as missing anomalies\ncould result in delayed diagnostics or treatment.\n","authors":["Md Saif Hassan Onim","Travis S. Humble","Himanshu Thapliyal"],"pdf_url":"https://arxiv.org/pdf/2501.04831v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.15237v3","updated":"2025-01-08T20:34:02Z","published":"2024-08-27T17:56:11Z","title":"The Mamba in the Llama: Distilling and Accelerating Hybrid Models","summary":" Linear RNN architectures, like Mamba, can be competitive with Transformer\nmodels in language modeling while having advantageous deployment\ncharacteristics. Given the focus on training large-scale Transformer models, we\nconsider the challenge of converting these pretrained models for deployment. We\ndemonstrate that it is feasible to distill large Transformers into linear RNNs\nby reusing the linear projection weights from attention layers with academic\nGPU resources. The resulting hybrid model, which incorporates a quarter of the\nattention layers, achieves performance comparable to the original Transformer\nin chat benchmarks and outperforms open-source hybrid Mamba models trained from\nscratch with trillions of tokens in both chat benchmarks and general\nbenchmarks. Moreover, we introduce a hardware-aware speculative decoding\nalgorithm that accelerates the inference speed of Mamba and hybrid models.\nOverall we show how, with limited computation resources, we can remove many of\nthe original attention layers and generate from the resulting model more\nefficiently. Our top-performing model, distilled from Llama3-8B-Instruct,\nachieves a 29.61 length-controlled win rate on AlpacaEval 2 against GPT-4 and\n7.35 on MT-Bench, surpassing the best 8B scale instruction-tuned linear RNN\nmodel. We also find that the distilled model has natural length extrapolation,\nshowing almost perfect accuracy in the needle-in-a-haystack test at 20x the\ndistillation length. Code and pre-trained checkpoints are open-sourced at\nhttps://github.com/jxiw/MambaInLlama and\nhttps://github.com/itsdaniele/speculative_mamba.\n","authors":["Junxiong Wang","Daniele Paliotta","Avner May","Alexander M. Rush","Tri Dao"],"pdf_url":"https://arxiv.org/pdf/2408.15237v3.pdf","comment":"NeurIPS 2024. v3 updates: fix format errors"},{"id":"http://arxiv.org/abs/2501.04826v1","updated":"2025-01-08T20:26:13Z","published":"2025-01-08T20:26:13Z","title":"Intelligent Gradient Boosting Algorithms for Estimating Strength of\n Modified Subgrade Soil","summary":" The performance of pavement under loading depends on the strength of the\nsubgrade. However, experimental estimation of properties of pavement strengths\nsuch as California bearing ratio (CBR), unconfined compressive strength (UCS)\nand resistance value (R) are often tedious, time-consuming and costly, thereby\ninspiring a growing interest in machine learning based tools which are simple,\ncheap and fast alternatives. Thus, the potential application of two boosting\ntechniques; categorical boosting (CatBoost) and extreme gradient boosting\n(XGBoost) and support vector regression (SVR), is similarly explored in this\nstudy for estimation of properties of subgrade soil modified with hydrated lime\nactivated rice husk ash (HARSH). Using 121 experimental data samples of varying\nproportions of HARSH, plastic limit, liquid limit, plasticity index, clay\nactivity, optimum moisture content, and maximum dry density as input for CBR,\nUCS and R estimation, four evaluation metrics namely coefficient of\ndetermination (R2), root mean squared error (RMSE), mean absolute error (MAE)\nand mean absolute percentage error (MAPE) are used to evaluate the models'\nperformance. The results indicate that XGBoost outperformed CatBoost and SVR in\nestimating these properties, yielding R2 of 0.9994, 0.9995 and 0.9999 in\nestimating the CBR, UCS and R respectively. Also, SVR outperformed CatBoost in\nestimating the CBR and R with R2 of 0.9997 respectively. On the other hand,\nCatBoost outperformed SVR in estimating the UCS with R2 of 0.9994. Feature\nsensitivity analysis shows that the three machine learning techniques are\nunanimous that increasing HARSH proportion lead to values of the estimated\nproperties respectively. A comparison with previous results also shows\nsuperiority of XGBoost in estimating subgrade properties.\n","authors":["Ismail B. Mustapha","Muyideen Abdulkareem","Shafaatunnur Hasan","Abideen Ganiyu","Hatem Nabus","Jin Chai Lee"],"pdf_url":"https://arxiv.org/pdf/2501.04826v1.pdf","comment":"17 pages"},{"id":"http://arxiv.org/abs/2501.04817v1","updated":"2025-01-08T20:14:07Z","published":"2025-01-08T20:14:07Z","title":"Decentralised Resource Sharing in TinyML: Wireless Bilayer Gossip\n Parallel SGD for Collaborative Learning","summary":" With the growing computational capabilities of microcontroller units (MCUs),\nedge devices can now support machine learning models. However, deploying\ndecentralised federated learning (DFL) on such devices presents key challenges,\nincluding intermittent connectivity, limited communication range, and dynamic\nnetwork topologies. This paper proposes a novel framework, bilayer Gossip\nDecentralised Parallel Stochastic Gradient Descent (GD PSGD), designed to\naddress these issues in resource-constrained environments. The framework\nincorporates a hierarchical communication structure using Distributed Kmeans\n(DKmeans) clustering for geographic grouping and a gossip protocol for\nefficient model aggregation across two layers: intra-cluster and inter-cluster.\nWe evaluate the framework's performance against the Centralised Federated\nLearning (CFL) baseline using the MCUNet model on the CIFAR-10 dataset under\nIID and Non-IID conditions. Results demonstrate that the proposed method\nachieves comparable accuracy to CFL on IID datasets, requiring only 1.8\nadditional rounds for convergence. On Non-IID datasets, the accuracy loss\nremains under 8\\% for moderate data imbalance. These findings highlight the\nframework's potential to support scalable and privacy-preserving learning on\nedge devices with minimal performance trade-offs.\n","authors":["Ziyuan Bao","Eiman Kanjo","Soumya Banerjee","Hasib-Al Rashid","Tinoosh Mohsenin"],"pdf_url":"https://arxiv.org/pdf/2501.04817v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04816v1","updated":"2025-01-08T20:12:33Z","published":"2025-01-08T20:12:33Z","title":"Probabilistic Skip Connections for Deterministic Uncertainty\n Quantification in Deep Neural Networks","summary":" Deterministic uncertainty quantification (UQ) in deep learning aims to\nestimate uncertainty with a single pass through a network by leveraging outputs\nfrom the network's feature extractor. Existing methods require that the feature\nextractor be both sensitive and smooth, ensuring meaningful input changes\nproduce meaningful changes in feature vectors. Smoothness enables\ngeneralization, while sensitivity prevents feature collapse, where distinct\ninputs are mapped to identical feature vectors. To meet these requirements,\ncurrent deterministic methods often retrain networks with spectral\nnormalization. Instead of modifying training, we propose using measures of\nneural collapse to identify an existing intermediate layer that is both\nsensitive and smooth. We then fit a probabilistic model to the feature vector\nof this intermediate layer, which we call a probabilistic skip connection\n(PSC). Through empirical analysis, we explore the impact of spectral\nnormalization on neural collapse and demonstrate that PSCs can effectively\ndisentangle aleatoric and epistemic uncertainty. Additionally, we show that\nPSCs achieve uncertainty quantification and out-of-distribution (OOD) detection\nperformance that matches or exceeds existing single-pass methods requiring\ntraining modifications. By retrofitting existing models, PSCs enable\nhigh-quality UQ and OOD capabilities without retraining.\n","authors":["Felix Jimenez","Matthias Katzfuss"],"pdf_url":"https://arxiv.org/pdf/2501.04816v1.pdf","comment":"15 pages, 9 figures"},{"id":"http://arxiv.org/abs/2412.16339v2","updated":"2025-01-08T20:11:59Z","published":"2024-12-20T21:00:11Z","title":"Deliberative Alignment: Reasoning Enables Safer Language Models","summary":" As large-scale language models increasingly impact safety-critical domains,\nensuring their reliable adherence to well-defined principles remains a\nfundamental challenge. We introduce Deliberative Alignment, a new paradigm that\ndirectly teaches the model safety specifications and trains it to explicitly\nrecall and accurately reason over the specifications before answering. We used\nthis approach to align OpenAI's o-series models, and achieved highly precise\nadherence to OpenAI's safety policies, without requiring human-written\nchain-of-thoughts or answers. Deliberative Alignment pushes the Pareto frontier\nby simultaneously increasing robustness to jailbreaks while decreasing\noverrefusal rates, and also improves out-of-distribution generalization. We\ndemonstrate that reasoning over explicitly specified policies enables more\nscalable, trustworthy, and interpretable alignment.\n","authors":["Melody Y. Guan","Manas Joglekar","Eric Wallace","Saachi Jain","Boaz Barak","Alec Helyar","Rachel Dias","Andrea Vallone","Hongyu Ren","Jason Wei","Hyung Won Chung","Sam Toyer","Johannes Heidecke","Alex Beutel","Amelia Glaese"],"pdf_url":"https://arxiv.org/pdf/2412.16339v2.pdf","comment":"24 pages"},{"id":"http://arxiv.org/abs/2501.01950v2","updated":"2025-01-08T20:09:16Z","published":"2025-01-03T18:54:26Z","title":"MADGEN: Mass-Spec attends to De Novo Molecular generation","summary":" The annotation (assigning structural chemical identities) of MS/MS spectra\nremains a significant challenge due to the enormous molecular diversity in\nbiological samples and the limited scope of reference databases. Currently, the\nvast majority of spectral measurements remain in the \"dark chemical space\"\nwithout structural annotations. To improve annotation, we propose MADGEN\n(Mass-spec Attends to De Novo Molecular GENeration), a scaffold-based method\nfor de novo molecular structure generation guided by mass spectrometry data.\nMADGEN operates in two stages: scaffold retrieval and spectra-conditioned\nmolecular generation starting with the scaffold. In the first stage, given an\nMS/MS spectrum, we formulate scaffold retrieval as a ranking problem and employ\ncontrastive learning to align mass spectra with candidate molecular scaffolds.\nIn the second stage, starting from the retrieved scaffold, we employ the MS/MS\nspectrum to guide an attention-based generative model to generate the final\nmolecule. Our approach constrains the molecular generation search space,\nreducing its complexity and improving generation accuracy. We evaluate MADGEN\non three datasets (NIST23, CANOPUS, and MassSpecGym) and evaluate MADGEN's\nperformance with a predictive scaffold retriever and with an oracle retriever.\nWe demonstrate the effectiveness of using attention to integrate spectral\ninformation throughout the generation process to achieve strong results with\nthe oracle retriever.\n","authors":["Yinkai Wang","Xiaohui Chen","Liping Liu","Soha Hassoun"],"pdf_url":"https://arxiv.org/pdf/2501.01950v2.pdf","comment":"preprint"},{"id":"http://arxiv.org/abs/2501.04811v1","updated":"2025-01-08T19:59:48Z","published":"2025-01-08T19:59:48Z","title":"Fast, Fine-Grained Equivalence Checking for Neural Decompilers","summary":" Neural decompilers are machine learning models that reconstruct the source\ncode from an executable program. Critical to the lifecycle of any machine\nlearning model is an evaluation of its effectiveness. However, existing\ntechniques for evaluating neural decompilation models have substantial\nweaknesses, especially when it comes to showing the correctness of the neural\ndecompiler's predictions. To address this, we introduce codealign, a novel\ninstruction-level code equivalence technique designed for neural decompilers.\nWe provide a formal definition of a relation between equivalent instructions,\nwhich we term an equivalence alignment. We show how codealign generates\nequivalence alignments, then evaluate codealign by comparing it with symbolic\nexecution. Finally, we show how the information codealign provides-which parts\nof the functions are equivalent and how well the variable names match-is\nsubstantially more detailed than existing state-of-the-art evaluation metrics,\nwhich report unitless numbers measuring similarity.\n","authors":["Luke Dramko","Claire Le Goues","Edward J. Schwartz"],"pdf_url":"https://arxiv.org/pdf/2501.04811v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04794v1","updated":"2025-01-08T19:18:44Z","published":"2025-01-08T19:18:44Z","title":"A Steerable Deep Network for Model-Free Diffusion MRI Registration","summary":" Nonrigid registration is vital to medical image analysis but remains\nchallenging for diffusion MRI (dMRI) due to its high-dimensional,\norientation-dependent nature. While classical methods are accurate, they are\ncomputationally demanding, and deep neural networks, though efficient, have\nbeen underexplored for nonrigid dMRI registration compared to structural\nimaging. We present a novel, deep learning framework for model-free, nonrigid\nregistration of raw diffusion MRI data that does not require explicit\nreorientation. Unlike previous methods relying on derived representations such\nas diffusion tensors or fiber orientation distribution functions, in our\napproach, we formulate the registration as an equivariant diffeomorphism of\nposition-and-orientation space. Central to our method is an\n$\\mathsf{SE}(3)$-equivariant UNet that generates velocity fields while\npreserving the geometric properties of a raw dMRI's domain. We introduce a new\nloss function based on the maximum mean discrepancy in Fourier space,\nimplicitly matching ensemble average propagators across images. Experimental\nresults on Human Connectome Project dMRI data demonstrate competitive\nperformance compared to state-of-the-art approaches, with the added advantage\nof bypassing the overhead for estimating derived representations. This work\nestablishes a foundation for data-driven, geometry-aware dMRI registration\ndirectly in the acquisition space.\n","authors":["Gianfranco Cortes","Baba C. Vemuri"],"pdf_url":"https://arxiv.org/pdf/2501.04794v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.11977v2","updated":"2025-01-08T19:17:14Z","published":"2024-10-15T18:33:42Z","title":"Generative AI Policies under the Microscope: How CS Conferences Are\n Navigating the New Frontier in Scholarly Writing","summary":" This paper explores the current state of generative AI policies of computer\nscience conferences and offers guidelines for policy adoption.\n","authors":["Mahjabin Nahar","Sian Lee","Becky Guillen","Dongwon Lee"],"pdf_url":"https://arxiv.org/pdf/2410.11977v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04784v1","updated":"2025-01-08T19:02:32Z","published":"2025-01-08T19:02:32Z","title":"Leveraging Registers in Vision Transformers for Robust Adaptation","summary":" Vision Transformers (ViTs) have shown success across a variety of tasks due\nto their ability to capture global image representations. Recent studies have\nidentified the existence of high-norm tokens in ViTs, which can interfere with\nunsupervised object discovery. To address this, the use of \"registers\" which\nare additional tokens that isolate high norm patch tokens while capturing\nglobal image-level information has been proposed. While registers have been\nstudied extensively for object discovery, their generalization properties\nparticularly in out-of-distribution (OOD) scenarios, remains underexplored. In\nthis paper, we examine the utility of register token embeddings in providing\nadditional features for improving generalization and anomaly rejection. To that\nend, we propose a simple method that combines the special CLS token embedding\ncommonly employed in ViTs with the average-pooled register embeddings to create\nfeature representations which are subsequently used for training a downstream\nclassifier. We find that this enhances OOD generalization and anomaly\nrejection, while maintaining in-distribution (ID) performance. Extensive\nexperiments across multiple ViT backbones trained with and without registers\nreveal consistent improvements of 2-4\\% in top-1 OOD accuracy and a 2-3\\%\nreduction in false positive rates for anomaly detection. Importantly, these\ngains are achieved without additional computational overhead.\n","authors":["Srikar Yellapragada","Kowshik Thopalli","Vivek Narayanaswamy","Wesam Sakla","Yang Liu","Yamen Mubarka","Dimitris Samaras","Jayaraman J. Thiagarajan"],"pdf_url":"https://arxiv.org/pdf/2501.04784v1.pdf","comment":"Accepted at ICASSP 2025"},{"id":"http://arxiv.org/abs/2410.17309v3","updated":"2025-01-08T19:00:00Z","published":"2024-10-22T18:00:00Z","title":"Literature Meets Data: A Synergistic Approach to Hypothesis Generation","summary":" AI holds promise for transforming scientific processes, including hypothesis\ngeneration. Prior work on hypothesis generation can be broadly categorized into\ntheory-driven and data-driven approaches. While both have proven effective in\ngenerating novel and plausible hypotheses, it remains an open question whether\nthey can complement each other. To address this, we develop the first method\nthat combines literature-based insights with data to perform LLM-powered\nhypothesis generation. We apply our method on five different datasets and\ndemonstrate that integrating literature and data outperforms other baselines\n(8.97\\% over few-shot, 15.75\\% over literature-based alone, and 3.37\\% over\ndata-driven alone). Additionally, we conduct the first human evaluation to\nassess the utility of LLM-generated hypotheses in assisting human\ndecision-making on two challenging tasks: deception detection and AI generated\ncontent detection. Our results show that human accuracy improves significantly\nby 7.44\\% and 14.19\\% on these tasks, respectively. These findings suggest that\nintegrating literature-based and data-driven approaches provides a\ncomprehensive and nuanced framework for hypothesis generation and could open\nnew avenues for scientific inquiry.\n","authors":["Haokun Liu","Yangqiaoyu Zhou","Mingxuan Li","Chenfei Yuan","Chenhao Tan"],"pdf_url":"https://arxiv.org/pdf/2410.17309v3.pdf","comment":"37 pages, 9 figures, code link:\n https://github.com/ChicagoHAI/hypothesis-generation"},{"id":"http://arxiv.org/abs/2501.04762v1","updated":"2025-01-08T18:08:48Z","published":"2025-01-08T18:08:48Z","title":"Efficient and Responsible Adaptation of Large Language Models for Robust\n and Equitable Top-k Recommendations","summary":" Conventional recommendation systems (RSs) are typically optimized to enhance\nperformance metrics uniformly across all training samples, inadvertently\noverlooking the needs of diverse user populations. The performance disparity\namong various populations can harm the model's robustness to sub-populations\ndue to the varying user properties. While large language models (LLMs) show\npromise in enhancing RS performance, their practical applicability is hindered\nby high costs, inference latency, and degraded performance on long user\nqueries. To address these challenges, we propose a hybrid task allocation\nframework designed to promote social good by equitably serving all user groups.\nBy adopting a two-phase approach, we promote a strategic assignment of tasks\nfor efficient and responsible adaptation of LLMs. Our strategy works by first\nidentifying the weak and inactive users that receive a suboptimal ranking\nperformance by RSs. Next, we use an in-context learning approach for such\nusers, wherein each user interaction history is contextualized as a distinct\nranking task. We evaluate our hybrid framework by incorporating eight different\nrecommendation algorithms and three different LLMs -- both open and\nclose-sourced. Our results on three real-world datasets show a significant\nreduction in weak users and improved robustness to subpopulations without\ndisproportionately escalating costs.\n","authors":["Kirandeep Kaur","Manya Chadha","Vinayak Gupta","Chirag Shah"],"pdf_url":"https://arxiv.org/pdf/2501.04762v1.pdf","comment":"arXiv admin note: text overlap with arXiv:2405.00824"},{"id":"http://arxiv.org/abs/2501.04757v1","updated":"2025-01-08T17:11:56Z","published":"2025-01-08T17:11:56Z","title":"DAREK -- Distance Aware Error for Kolmogorov Networks","summary":" In this paper, we provide distance-aware error bounds for Kolmogorov Arnold\nNetworks (KANs). We call our new error bounds estimator DAREK -- Distance Aware\nError for Kolmogorov networks. Z. Liu et al. provide error bounds, which may be\nloose, lack distance-awareness, and are defined only up to an unknown constant\nof proportionality. We review the error bounds for Newton's polynomial, which\nis then generalized to an arbitrary spline, under Lipschitz continuity\nassumptions. We then extend these bounds to nested compositions of splines,\narriving at error bounds for KANs. We evaluate our method by estimating an\nobject's shape from sparse laser scan points. We use KAN to fit a smooth\nfunction to the scans and provide error bounds for the fit. We find that our\nmethod is faster than Monte Carlo approaches, and that our error bounds enclose\nthe true obstacle shape reliably.\n","authors":["Masoud Ataei","Mohammad Javad Khojasteh","Vikas Dhiman"],"pdf_url":"https://arxiv.org/pdf/2501.04757v1.pdf","comment":"Accepted at ICASSP25, 5 pages + 2 pages supplementary material, 3\n figures"},{"id":"http://arxiv.org/abs/2501.04613v1","updated":"2025-01-08T16:53:17Z","published":"2025-01-08T16:53:17Z","title":"A Semantic Partitioning Method for Large-Scale Training of Knowledge\n Graph Embeddings","summary":" In recent years, knowledge graph embeddings have achieved great success. Many\nmethods have been proposed and achieved state-of-the-art results in various\ntasks. However, most of the current methods present one or more of the\nfollowing problems: (i) They only consider fact triplets, while ignoring the\nontology information of knowledge graphs. (ii) The obtained embeddings do not\ncontain much semantic information. Therefore, using these embeddings for\nsemantic tasks is problematic. (iii) They do not enable large-scale training.\nIn this paper, we propose a new algorithm that incorporates the ontology of\nknowledge graphs and partitions the knowledge graph based on classes to include\nmore semantic information for parallel training of large-scale knowledge graph\nembeddings. Our preliminary results show that our algorithm performs well on\nseveral popular benchmarks.\n","authors":["Yuhe Bai"],"pdf_url":"https://arxiv.org/pdf/2501.04613v1.pdf","comment":"Accepted at WWW '23 Companion: Companion Proceedings of the ACM Web\n Conference 2023"},{"id":"http://arxiv.org/abs/2501.04610v1","updated":"2025-01-08T16:47:45Z","published":"2025-01-08T16:47:45Z","title":"Resilient Peer-to-peer Learning based on Adaptive Aggregation","summary":" Collaborative learning in peer-to-peer networks offers the benefits of\ndistributed learning while mitigating the risks associated with single points\nof failure inherent in centralized servers. However, adversarial workers pose\npotential threats by attempting to inject malicious information into the\nnetwork. Thus, ensuring the resilience of peer-to-peer learning emerges as a\npivotal research objective. The challenge is exacerbated in the presence of\nnon-convex loss functions and non-iid data distributions. This paper introduces\na resilient aggregation technique tailored for such scenarios, aimed at\nfostering similarity among peers' learning processes. The aggregation weights\nare determined through an optimization procedure, and use the loss function\ncomputed using the neighbor's models and individual private data, thereby\naddressing concerns regarding data privacy in distributed machine learning.\nTheoretical analysis demonstrates convergence of parameters with non-convex\nloss functions and non-iid data distributions. Empirical evaluations across\nthree distinct machine learning tasks support the claims. The empirical\nfindings, which encompass a range of diverse attack models, also demonstrate\nimproved accuracy when compared to existing methodologies.\n","authors":["Chandreyee Bhowmick","Xenofon Koutsoukos"],"pdf_url":"https://arxiv.org/pdf/2501.04610v1.pdf","comment":"11 pages"},{"id":"http://arxiv.org/abs/2501.04608v1","updated":"2025-01-08T16:44:06Z","published":"2025-01-08T16:44:06Z","title":"Comprehensive Examination of Unrolled Networks for Linear Inverse\n Problems","summary":" Unrolled networks have become prevalent in various computer vision and\nimaging tasks. Although they have demonstrated remarkable efficacy in solving\nspecific computer vision and computational imaging tasks, their adaptation to\nother applications presents considerable challenges. This is primarily due to\nthe multitude of design decisions that practitioners working on new\napplications must navigate, each potentially affecting the network's overall\nperformance. These decisions include selecting the optimization algorithm,\ndefining the loss function, and determining the number of convolutional layers,\namong others. Compounding the issue, evaluating each design choice requires\ntime-consuming simulations to train, fine-tune the neural network, and optimize\nfor its performance. As a result, the process of exploring multiple options and\nidentifying the optimal configuration becomes time-consuming and\ncomputationally demanding. The main objectives of this paper are (1) to unify\nsome ideas and methodologies used in unrolled networks to reduce the number of\ndesign choices a user has to make, and (2) to report a comprehensive ablation\nstudy to discuss the impact of each of the choices involved in designing\nunrolled networks and present practical recommendations based on our findings.\nWe anticipate that this study will help scientists and engineers design\nunrolled networks for their applications and diagnose problems within their\nnetworks efficiently.\n","authors":["Eric Chen","Xi Chen","Arian Maleki","Shirin Jalali"],"pdf_url":"https://arxiv.org/pdf/2501.04608v1.pdf","comment":"27 pages, 10 figures. Project Page:\n https://github.com/YuxiChen25/Memory-Net-Inverse"},{"id":"http://arxiv.org/abs/2410.05898v5","updated":"2025-01-08T16:43:41Z","published":"2024-10-08T10:55:40Z","title":"Manifolds, Random Matrices and Spectral Gaps: The geometric phases of\n generative diffusion","summary":" In this paper, we investigate the latent geometry of generative diffusion\nmodels under the manifold hypothesis. For this purpose, we analyze the spectrum\nof eigenvalues (and singular values) of the Jacobian of the score function,\nwhose discontinuities (gaps) reveal the presence and dimensionality of distinct\nsub-manifolds. Using a statistical physics approach, we derive the spectral\ndistributions and formulas for the spectral gaps under several distributional\nassumptions, and we compare these theoretical predictions with the spectra\nestimated from trained networks. Our analysis reveals the existence of three\ndistinct qualitative phases during the generative process: a trivial phase; a\nmanifold coverage phase where the diffusion process fits the distribution\ninternal to the manifold; a consolidation phase where the score becomes\northogonal to the manifold and all particles are projected on the support of\nthe data. This `division of labor' between different timescales provides an\nelegant explanation of why generative diffusion models are not affected by the\nmanifold overfitting phenomenon that plagues likelihood-based models, since the\ninternal distribution and the manifold geometry are produced at different time\npoints during generation.\n","authors":["Enrico Ventura","Beatrice Achilli","Gianluigi Silvestri","Carlo Lucibello","Luca Ambrogioni"],"pdf_url":"https://arxiv.org/pdf/2410.05898v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03304v2","updated":"2025-01-08T16:41:03Z","published":"2025-01-06T16:04:56Z","title":"LiLMaps: Learnable Implicit Language Maps","summary":" One of the current trends in robotics is to employ large language models\n(LLMs) to provide non-predefined command execution and natural human-robot\ninteraction. It is useful to have an environment map together with its language\nrepresentation, which can be further utilized by LLMs. Such a comprehensive\nscene representation enables numerous ways of interaction with the map for\nautonomously operating robots. In this work, we present an approach that\nenhances incremental implicit mapping through the integration of\nvision-language features. Specifically, we (i) propose a decoder optimization\ntechnique for implicit language maps which can be used when new objects appear\non the scene, and (ii) address the problem of inconsistent vision-language\npredictions between different viewing positions. Our experiments demonstrate\nthe effectiveness of LiLMaps and solid improvements in performance.\n","authors":["Evgenii Kruzhkov","Sven Behnke"],"pdf_url":"https://arxiv.org/pdf/2501.03304v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.15856v2","updated":"2025-01-08T16:31:06Z","published":"2024-01-29T03:07:04Z","title":"The Indoor-Training Effect: unexpected gains from distribution shifts in\n the transition function","summary":" Is it better to perform tennis training in a pristine indoor environment or a\nnoisy outdoor one? To model this problem, here we investigate whether shifts in\nthe transition probabilities between the training and testing environments in\nreinforcement learning problems can lead to better performance under certain\nconditions. We generate new Markov Decision Processes (MDPs) starting from a\ngiven MDP, by adding quantifiable, parametric noise into the transition\nfunction. We refer to this process as Noise Injection and the resulting\nenvironments as {\\delta}-environments. This process allows us to create\nvariations of the same environment with quantitative control over noise serving\nas a metric of distance between environments. Conventional wisdom suggests that\ntraining and testing on the same MDP should yield the best results. In stark\ncontrast, we observe that agents can perform better when trained on the\nnoise-free environment and tested on the noisy {\\delta}-environments, compared\nto training and testing on the same {\\delta}-environments. We confirm that this\nfinding extends beyond noise variations: it is possible to showcase the same\nphenomenon in ATARI game variations including varying Ghost behaviour in\nPacMan, and Paddle behaviour in Pong. We demonstrate this intriguing behaviour\nacross 60 different variations of ATARI games, including PacMan, Pong, and\nBreakout. We refer to this phenomenon as the Indoor-Training Effect. Code to\nreproduce our experiments and to implement Noise Injection can be found at\nhttps://bit.ly/3X6CTYk.\n","authors":["Serena Bono","Spandan Madan","Ishaan Grover","Mao Yasueda","Cynthia Breazeal","Hanspeter Pfister","Gabriel Kreiman"],"pdf_url":"https://arxiv.org/pdf/2401.15856v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04750v1","updated":"2025-01-08T16:17:05Z","published":"2025-01-08T16:17:05Z","title":"Efficient License Plate Recognition in Videos Using Visual Rhythm and\n Accumulative Line Analysis","summary":" Video-based Automatic License Plate Recognition (ALPR) involves extracting\nvehicle license plate text information from video captures. Traditional systems\ntypically rely heavily on high-end computing resources and utilize multiple\nframes to recognize license plates, leading to increased computational\noverhead. In this paper, we propose two methods capable of efficiently\nextracting exactly one frame per vehicle and recognizing its license plate\ncharacters from this single image, thus significantly reducing computational\ndemands. The first method uses Visual Rhythm (VR) to generate time-spatial\nimages from videos, while the second employs Accumulative Line Analysis (ALA),\na novel algorithm based on single-line video processing for real-time\noperation. Both methods leverage YOLO for license plate detection within the\nframe and a Convolutional Neural Network (CNN) for Optical Character\nRecognition (OCR) to extract textual information. Experiments on real videos\ndemonstrate that the proposed methods achieve results comparable to traditional\nframe-by-frame approaches, with processing speeds three times faster.\n","authors":["Victor Nascimento Ribeiro","Nina S. T. Hirata"],"pdf_url":"https://arxiv.org/pdf/2501.04750v1.pdf","comment":"Accepted for presentation at the Conference on Graphics, Patterns and\n Images (SIBGRAPI) 2024"},{"id":"http://arxiv.org/abs/2501.04588v1","updated":"2025-01-08T16:06:39Z","published":"2025-01-08T16:06:39Z","title":"Federated-Continual Dynamic Segmentation of Histopathology guided by\n Barlow Continuity","summary":" Federated- and Continual Learning have been established as approaches to\nenable privacy-aware learning on continuously changing data, as required for\ndeploying AI systems in histopathology images. However, data shifts can occur\nin a dynamic world, spatially between institutions and temporally, due to\nchanging data over time. This leads to two issues: Client Drift, where the\ncentral model degrades from aggregating data from clients trained on shifted\ndata, and Catastrophic Forgetting, from temporal shifts such as changes in\npatient populations. Both tend to degrade the model's performance of previously\nseen data or spatially distributed training. Despite both problems arising from\nthe same underlying problem of data shifts, existing research addresses them\nonly individually. In this work, we introduce a method that can jointly\nalleviate Client Drift and Catastrophic Forgetting by using our proposed\nDynamic Barlow Continuity that evaluates client updates on a public reference\ndataset and uses this to guide the training process to a spatially and\ntemporally shift-invariant model. We evaluate our approach on the\nhistopathology datasets BCSS and Semicol and prove our method to be highly\neffective by jointly improving the dice score as much as from 15.8% to 71.6% in\nClient Drift and from 42.5% to 62.8% in Catastrophic Forgetting. This enables\nDynamic Learning by establishing spatio-temporal shift-invariance.\n","authors":["Niklas Babendererde","Haozhe Zhu","Moritz Fuchs","Jonathan Stieber","Anirban Mukhopadhyay"],"pdf_url":"https://arxiv.org/pdf/2501.04588v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.03692v3","updated":"2025-01-08T16:05:00Z","published":"2024-01-08T06:46:39Z","title":"Boosting Column Generation with Graph Neural Networks for Joint Rider\n Trip Planning and Crew Shift Scheduling","summary":" Optimizing service schedules is pivotal to the reliable, efficient, and\ninclusive on-demand mobility. This pressing challenge is further exacerbated by\nthe increasing needs of an aging population, the oversubscription of existing\nservices, and the lack of effective solution methods. This study addresses the\nintricacies of service scheduling, by jointly optimizing rider trip planning\nand crew scheduling for a complex dynamic mobility service. The resulting\noptimization problems are extremely challenging computationally for\nstate-of-the-art methods. To address this fundamental gap, this paper\nintroduces the Joint Rider Trip Planning and Crew Shift Scheduling Problem\n(JRTPCSSP) and a novel solution method, called Attention and Gated GNN-Informed\nColumn Generation (AGGNNI-CG), that hybridizes column generation and machine\nlearning to obtain near-optimal solutions to the JRTPCSSP with real-life\nconstraints of the application. The key idea of the machine-learning component\nis to dramatically reduce the number of paths to explore in the pricing\nproblem, accelerating the most time-consuming component of the column\ngeneration. The machine learning component is a graph neural network with an\nattention mechanism and a gated architecture, which is particularly suited to\ncater for the different input sizes coming from daily operations. AGGNNI-CG has\nbeen applied to a challenging, real-world dataset from the Paratransit system\nof Chatham County in Georgia. It produces substantial improvements compared to\nthe baseline column generation approach, which typically cannot produce\nhigh-quality feasible solutions in reasonable time on large-scale complex\ninstances. AGGNNI-CG also produces significant improvements in service quality\ncompared to the existing system.\n","authors":["Jiawei Lu","Tinghan Ye","Wenbo Chen","Pascal Van Hentenryck"],"pdf_url":"https://arxiv.org/pdf/2401.03692v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2301.12553v3","updated":"2025-01-08T16:02:24Z","published":"2023-01-29T22:00:53Z","title":"Asymptotic Inference for Multi-Stage Stationary Treatment Policy with\n Variable Selection","summary":" Dynamic treatment regimes or policies are a sequence of decision functions\nover multiple stages that are tailored to individual features. One important\nclass of treatment policies in practice, namely multi-stage stationary\ntreatment policies, prescribes treatment assignment probabilities using the\nsame decision function across stages, where the decision is based on the same\nset of features consisting of time-evolving variables (e.g., routinely\ncollected disease biomarkers). Although there has been extensive literature on\nconstructing valid inference for the value function associated with dynamic\ntreatment policies, little work has focused on the policies themselves,\nespecially in the presence of high-dimensional feature variables. We aim to\nfill the gap in this work. Specifically, we first estimate the multi-stage\nstationary treatment policy using an augmented inverse probability weighted\nestimator for the value function to increase asymptotic efficiency, and further\napply a penalty to select important feature variables. We then construct\none-step improvements of the policy parameter estimators for valid inference.\nTheoretically, we show that the improved estimators are asymptotically normal,\neven if nuisance parameters are estimated at a slow convergence rate and the\ndimension of the feature variables increases with the sample size. Our\nnumerical studies demonstrate that the proposed method estimates a sparse\npolicy with a near-optimal value function and conducts valid inference for the\npolicy parameters.\n","authors":["Daiqi Gao","Yufeng Liu","Donglin Zeng"],"pdf_url":"https://arxiv.org/pdf/2301.12553v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03432v2","updated":"2025-01-08T15:57:01Z","published":"2025-01-06T23:28:19Z","title":"Mixture-of-Experts Graph Transformers for Interpretable Particle\n Collision Detection","summary":" The Large Hadron Collider at CERN produces immense volumes of complex data\nfrom high-energy particle collisions, demanding sophisticated analytical\ntechniques for effective interpretation. Neural Networks, including Graph\nNeural Networks, have shown promise in tasks such as event classification and\nobject identification by representing collisions as graphs. However, while\nGraph Neural Networks excel in predictive accuracy, their \"black box\" nature\noften limits their interpretability, making it difficult to trust their\ndecision-making processes. In this paper, we propose a novel approach that\ncombines a Graph Transformer model with Mixture-of-Expert layers to achieve\nhigh predictive performance while embedding interpretability into the\narchitecture. By leveraging attention maps and expert specialization, the model\noffers insights into its internal decision-making, linking predictions to\nphysics-informed features. We evaluate the model on simulated events from the\nATLAS experiment, focusing on distinguishing rare Supersymmetric signal events\nfrom Standard Model background. Our results highlight that the model achieves\ncompetitive classification accuracy while providing interpretable outputs that\nalign with known physics, demonstrating its potential as a robust and\ntransparent tool for high-energy physics data analysis. This approach\nunderscores the importance of explainability in machine learning methods\napplied to high energy physics, offering a path toward greater trust in\nAI-driven discoveries.\n","authors":["Donatella Genovese","Alessandro Sgroi","Alessio Devoto","Samuel Valentine","Lennox Wood","Cristiano Sebastiani","Stefano Giagu","Monica D'Onofrio","Simone Scardapane"],"pdf_url":"https://arxiv.org/pdf/2501.03432v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04577v1","updated":"2025-01-08T15:47:04Z","published":"2025-01-08T15:47:04Z","title":"A 65 nm Bayesian Neural Network Accelerator with 360 fJ/Sample In-Word\n GRNG for AI Uncertainty Estimation","summary":" Uncertainty estimation is an indispensable capability for AI-enabled,\nsafety-critical applications, e.g. autonomous vehicles or medical diagnosis.\nBayesian neural networks (BNNs) use Bayesian statistics to provide both\nclassification predictions and uncertainty estimation, but they suffer from\nhigh computational overhead associated with random number generation and\nrepeated sample iterations. Furthermore, BNNs are not immediately amenable to\nacceleration through compute-in-memory architectures due to the frequent memory\nwrites necessary after each RNG operation. To address these challenges, we\npresent an ASIC that integrates 360 fJ/Sample Gaussian RNG directly into the\nSRAM memory words. This integration reduces RNG overhead and enables\nfully-parallel compute-in-memory operations for BNNs. The prototype chip\nachieves 5.12 GSa/s RNG throughput and 102 GOp/s neural network throughput\nwhile occupying 0.45 mm2, bringing AI uncertainty estimation to edge\ncomputation.\n","authors":["Zephan M. Enciso","Boyang Cheng","Likai Pei","Jianbo Liu","Steven Davis","Ningyuan Cao","Michael Niemier"],"pdf_url":"https://arxiv.org/pdf/2501.04577v1.pdf","comment":"7 pages, 12 figures"},{"id":"http://arxiv.org/abs/2409.10589v2","updated":"2025-01-08T15:41:04Z","published":"2024-09-16T15:18:10Z","title":"Offline Reinforcement Learning for Learning to Dispatch for Job Shop\n Scheduling","summary":" The Job Shop Scheduling Problem (JSSP) is a complex combinatorial\noptimization problem. While online Reinforcement Learning (RL) has shown\npromise by quickly finding acceptable solutions for JSSP, it faces key\nlimitations: it requires extensive training interactions from scratch leading\nto sample inefficiency, cannot leverage existing high-quality solutions, and\noften yields suboptimal results compared to traditional methods like Constraint\nProgramming (CP). We introduce Offline Reinforcement Learning for Learning to\nDispatch (Offline-LD), which addresses these limitations by learning from\npreviously generated solutions. Our approach is motivated by scenarios where\nhistorical scheduling data and expert solutions are available, although our\ncurrent evaluation focuses on benchmark problems. Offline-LD adapts two\nCQL-based Q-learning methods (mQRDQN and discrete mSAC) for maskable action\nspaces, introduces a novel entropy bonus modification for discrete SAC, and\nexploits reward normalization through preprocessing. Our experiments\ndemonstrate that Offline-LD outperforms online RL on both generated and\nbenchmark instances. Notably, by introducing noise into the expert dataset, we\nachieve similar or better results than those obtained from the expert dataset,\nsuggesting that a more diverse training set is preferable because it contains\ncounterfactual information.\n","authors":["Jesse van Remmerden","Zaharah Bukhsh","Yingqian Zhang"],"pdf_url":"https://arxiv.org/pdf/2409.10589v2.pdf","comment":"Code available at https://github.com/jesserem/Offline-LD"},{"id":"http://arxiv.org/abs/2402.07099v3","updated":"2025-01-08T15:37:04Z","published":"2024-02-11T04:09:50Z","title":"Rethinking the Capacity of Graph Neural Networks for Branching Strategy","summary":" Graph neural networks (GNNs) have been widely used to predict properties and\nheuristics of mixed-integer linear programs (MILPs) and hence accelerate MILP\nsolvers. This paper investigates the capacity of GNNs to represent strong\nbranching (SB), the most effective yet computationally expensive heuristic\nemployed in the branch-and-bound algorithm. In the literature, message-passing\nGNN (MP-GNN), as the simplest GNN structure, is frequently used as a fast\napproximation of SB and we find that not all MILPs's SB can be represented with\nMP-GNN. We precisely define a class of \"MP-tractable\" MILPs for which MP-GNNs\ncan accurately approximate SB scores. Particularly, we establish a universal\napproximation theorem: for any data distribution over the MP-tractable class,\nthere always exists an MP-GNN that can approximate the SB score with\narbitrarily high accuracy and arbitrarily high probability, which lays a\ntheoretical foundation of the existing works on imitating SB with MP-GNN. For\nMILPs without the MP-tractability, unfortunately, a similar result is\nimpossible, which can be illustrated by two MILP instances with different SB\nscores that cannot be distinguished by any MP-GNN, regardless of the number of\nparameters. Recognizing this, we explore another GNN structure called the\nsecond-order folklore GNN (2-FGNN) that overcomes this limitation, and the\naforementioned universal approximation theorem can be extended to the entire\nMILP space using 2-FGNN, regardless of the MP-tractability. A small-scale\nnumerical experiment is conducted to directly validate our theoretical\nfindings.\n","authors":["Ziang Chen","Jialin Liu","Xiaohan Chen","Xinshang Wang","Wotao Yin"],"pdf_url":"https://arxiv.org/pdf/2402.07099v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04570v1","updated":"2025-01-08T15:36:19Z","published":"2025-01-08T15:36:19Z","title":"Large-Scale Spectral Graph Neural Networks via Laplacian Sparsification:\n Technical Report","summary":" Graph Neural Networks (GNNs) play a pivotal role in graph-based tasks for\ntheir proficiency in representation learning. Among the various GNN methods,\nspectral GNNs employing polynomial filters have shown promising performance on\ntasks involving both homophilous and heterophilous graph structures. However,\nThe scalability of spectral GNNs on large graphs is limited because they learn\nthe polynomial coefficients through multiple forward propagation executions\nduring forward propagation. Existing works have attempted to scale up spectral\nGNNs by eliminating the linear layers on the input node features, a change that\ncan disrupt end-to-end training, potentially impact performance, and become\nimpractical with high-dimensional input features. To address the above\nchallenges, we propose \"Spectral Graph Neural Networks with Laplacian\nSparsification (SGNN-LS)\", a novel graph spectral sparsification method to\napproximate the propagation patterns of spectral GNNs. We prove that our\nproposed method generates Laplacian sparsifiers that can approximate both fixed\nand learnable polynomial filters with theoretical guarantees. Our method allows\nthe application of linear layers on the input node features, enabling\nend-to-end training as well as the handling of raw text features. We conduct an\nextensive experimental analysis on datasets spanning various graph scales and\nproperties to demonstrate the superior efficiency and effectiveness of our\nmethod. The results show that our method yields superior results in comparison\nwith the corresponding approximated base models, especially on dataset\nOgbn-papers100M(111M nodes, 1.6B edges) and MAG-scholar-C (2.8M features).\n","authors":["Haipeng Ding","Zhewei Wei","Yuhang Ye"],"pdf_url":"https://arxiv.org/pdf/2501.04570v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.12046v2","updated":"2025-01-08T15:35:02Z","published":"2024-10-15T20:32:07Z","title":"Towards Realistic Evaluation of Commit Message Generation by Matching\n Online and Offline Settings","summary":" When a Commit Message Generation (CMG) system is integrated into the IDEs and\nother products at JetBrains, we perform online evaluation based on user\nacceptance of the generated messages. However, performing online experiments\nwith every change to a CMG system is troublesome, as each iteration affects\nusers and requires time to collect enough statistics. On the other hand,\noffline evaluation, a prevalent approach in the research literature,\nfacilitates fast experiments but employs automatic metrics that are not\nguaranteed to represent the preferences of real users. In this work, we\ndescribe a novel way we employed to deal with this problem at JetBrains, by\nleveraging an online metric - the number of edits users introduce before\ncommitting the generated messages to the VCS - to select metrics for offline\nexperiments.\n To support this new type of evaluation, we develop a novel markup collection\ntool mimicking the real workflow with a CMG system, collect a dataset with 57\npairs consisting of commit messages generated by GPT-4 and their counterparts\nedited by human experts, and design and verify a way to synthetically extend\nsuch a dataset. Then, we use the final dataset of 656 pairs to study how the\nwidely used similarity metrics correlate with the online metric reflecting the\nreal users' experience.\n Our results indicate that edit distance exhibits the highest correlation with\nthe online metric, whereas commonly used similarity metrics such as BLEU and\nMETEOR demonstrate low correlation. This contradicts the previous studies on\nsimilarity metrics for CMG, suggesting that user interactions with a CMG system\nin real-world settings differ significantly from the responses by human\nlabelers within controlled environments. We release all the code and the\ndataset to support future research in the field: https://jb.gg/cmg-evaluation.\n","authors":["Petr Tsvetkov","Aleksandra Eliseeva","Danny Dig","Alexander Bezzubov","Yaroslav Golubev","Timofey Bryksin","Yaroslav Zharov"],"pdf_url":"https://arxiv.org/pdf/2410.12046v2.pdf","comment":"10 pages, 5 figures (Published at ICSE'2025)"},{"id":"http://arxiv.org/abs/2501.04568v1","updated":"2025-01-08T15:32:12Z","published":"2025-01-08T15:32:12Z","title":"Supervision-free Vision-Language Alignment","summary":" Vision-language models (VLMs) have demonstrated remarkable potential in\nintegrating visual and linguistic information, but their performance is often\nconstrained by the need for extensive, high-quality image-text training data.\nCuration of these image-text pairs is both time-consuming and computationally\nexpensive. To address this challenge, we introduce SVP (Supervision-free Visual\nProjection), a novel framework that enhances vision-language alignment without\nrelying on curated data or preference annotation. SVP leverages self-captioning\nand a pre-trained grounding model as a feedback mechanism to elicit latent\ninformation in VLMs. We evaluate our approach across six key areas: captioning,\nreferring, visual question answering, multitasking, hallucination control, and\nobject recall. Results demonstrate significant improvements, including a 14%\naverage improvement in captioning tasks, up to 12% increase in object recall,\nand substantial reduction in hallucination rates. Notably, a small VLM using\nSVP achieves hallucination reductions comparable to a model five times larger,\nwhile a VLM with initially poor referring capabilities more than doubles its\nperformance, approaching parity with a model twice its size.\n","authors":["Giorgio Giannone","Ruoteng Li","Qianli Feng","Evgeny Perevodchikov","Rui Chen","Aleix Martinez"],"pdf_url":"https://arxiv.org/pdf/2501.04568v1.pdf","comment":"Preprint"},{"id":"http://arxiv.org/abs/2406.06184v2","updated":"2025-01-08T15:28:11Z","published":"2024-06-10T11:28:25Z","title":"Deep Multi-Objective Reinforcement Learning for Utility-Based\n Infrastructural Maintenance Optimization","summary":" In this paper, we introduce Multi-Objective Deep Centralized Multi-Agent\nActor-Critic (MO- DCMAC), a multi-objective reinforcement learning (MORL)\nmethod for infrastructural maintenance optimization, an area traditionally\ndominated by single-objective reinforcement learning (RL) approaches. Previous\nsingle-objective RL methods combine multiple objectives, such as probability of\ncollapse and cost, into a singular reward signal through reward-shaping. In\ncontrast, MO-DCMAC can optimize a policy for multiple objectives directly, even\nwhen the utility function is non-linear. We evaluated MO-DCMAC using two\nutility functions, which use probability of collapse and cost as input. The\nfirst utility function is the Threshold utility, in which MO-DCMAC should\nminimize cost so that the probability of collapse is never above the threshold.\nThe second is based on the Failure Mode, Effects, and Criticality Analysis\n(FMECA) methodology used by asset managers to asses maintenance plans. We\nevaluated MO-DCMAC, with both utility functions, in multiple maintenance\nenvironments, including ones based on a case study of the historical quay walls\nof Amsterdam. The performance of MO-DCMAC was compared against multiple\nrule-based policies based on heuristics currently used for constructing\nmaintenance plans. Our results demonstrate that MO-DCMAC outperforms\ntraditional rule-based policies across various environments and utility\nfunctions.\n","authors":["Jesse van Remmerden","Maurice Kenter","Diederik M. Roijers","Charalampos Andriotis","Yingqian Zhang","Zaharah Bukhsh"],"pdf_url":"https://arxiv.org/pdf/2406.06184v2.pdf","comment":"Accepted in the Neural Computing and Applications: Topical Collection\n on Multi-Objective Decision Making 2023 (MODeM 2023)"},{"id":"http://arxiv.org/abs/2412.04628v2","updated":"2025-01-08T15:00:39Z","published":"2024-12-05T21:50:22Z","title":"SWEPO: Simultaneous Weighted Preference Optimization for Group\n Contrastive Alignment","summary":" We introduce Simultaneous Weighted Preference Optimization (SWEPO), a novel\nextension of Direct Preference Optimization (DPO) designed to accommodate\nmultiple dynamically chosen positive and negative responses for each query.\nSWEPO employs a weighted group contrastive loss, assigning weights to responses\nbased on their deviation from the mean reward score. This approach effectively\nprioritizes responses that are significantly better or worse than the average,\nenhancing optimization. Our theoretical analysis demonstrates that\nsimultaneously considering multiple preferences reduces alignment bias,\nresulting in more robust alignment. Additionally, we provide insights into the\ntraining dynamics of our loss function and a related function, InfoNCA.\nEmpirical validation on the UltraFeedback dataset establishes SWEPO as\nstate-of-the-art, with superior performance in downstream evaluations using the\nAlpacaEval dataset.\n","authors":["Taneesh Gupta","Rahul Madhavan","Xuchao Zhang","Chetan Bansal","Saravan Rajmohan"],"pdf_url":"https://arxiv.org/pdf/2412.04628v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04547v1","updated":"2025-01-08T14:51:36Z","published":"2025-01-08T14:51:36Z","title":"Medical artificial intelligence toolbox (MAIT): an explainable machine\n learning framework for binary classification, survival modelling, and\n regression analyses","summary":" While machine learning offers diverse techniques suitable for exploring\nvarious medical research questions, a cohesive synergistic framework can\nfacilitate the integration and understanding of new approaches within unified\nmodel development and interpretation. We therefore introduce the Medical\nArtificial Intelligence Toolbox (MAIT), an explainable, open-source Python\npipeline for developing and evaluating binary classification, regression, and\nsurvival models on tabular datasets. MAIT addresses key challenges (e.g., high\ndimensionality, class imbalance, mixed variable types, and missingness) while\npromoting transparency in reporting (TRIPOD+AI compliant). Offering automated\nconfigurations for beginners and customizable source code for experts, MAIT\nstreamlines two primary use cases: Discovery (feature importance via unified\nscoring, e.g., SHapley Additive exPlanations - SHAP) and Prediction (model\ndevelopment and deployment with optimized solutions). Moreover, MAIT proposes\nnew techniques including fine-tuning of probability threshold in binary\nclassification, translation of cumulative hazard curves to binary\nclassification, enhanced visualizations for model interpretation for mixed data\ntypes, and handling censoring through semi-supervised learning, to adapt to a\nwide set of data constraints and study designs. We provide detailed tutorials\non GitHub, using four open-access data sets, to demonstrate how MAIT can be\nused to improve implementation and interpretation of ML models in medical\nresearch.\n","authors":["Ramtin Zargari Marandi","Anne Svane Frahm","Jens Lundgren","Daniel Dawson Murray","Maja Milojevic"],"pdf_url":"https://arxiv.org/pdf/2501.04547v1.pdf","comment":"14 pages, 2 figures, 1 table"},{"id":"http://arxiv.org/abs/2501.00889v2","updated":"2025-01-08T14:50:23Z","published":"2025-01-01T16:36:21Z","title":"Evaluating Time Series Foundation Models on Noisy Periodic Time Series","summary":" While recent advancements in foundation models have significantly impacted\nmachine learning, rigorous tests on the performance of time series foundation\nmodels (TSFMs) remain largely underexplored. This paper presents an empirical\nstudy evaluating the zero-shot, long-horizon forecasting abilities of several\nleading TSFMs over two synthetic datasets constituting noisy periodic time\nseries. We assess model efficacy across different noise levels, underlying\nfrequencies, and sampling rates. As benchmarks for comparison, we choose two\nstatistical techniques: a Fourier transform (FFT)-based approach and a linear\nautoregressive (AR) model. Our findings demonstrate that while for time series\nwith bounded periods and higher sampling rates, TSFMs can match or outperform\nthe statistical approaches, their forecasting abilities deteriorate with longer\nperiods, higher noise levels, lower sampling rates and more complex shapes of\nthe time series.\n","authors":["Syamantak Datta Gupta"],"pdf_url":"https://arxiv.org/pdf/2501.00889v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.02334v2","updated":"2025-01-08T14:42:05Z","published":"2024-04-26T15:02:39Z","title":"Rad4XCNN: a new agnostic method for post-hoc global explanation of\n CNN-derived features by means of radiomics","summary":" In recent years, machine learning-based clinical decision support systems\n(CDSS) have played a key role in the analysis of several medical conditions.\nDespite their promising capabilities, the lack of transparency in AI models\nposes significant challenges, particularly in medical contexts where\nreliability is a mandatory aspect. However, it appears that explainability is\ninversely proportional to accuracy. For this reason, achieving transparency\nwithout compromising predictive accuracy remains a key challenge. This paper\npresents a novel method, namely Rad4XCNN, to enhance the predictive power of\nCNN-derived features with the inherent interpretability of radiomic features.\nRad4XCNN diverges from conventional methods based on saliency maps, by\nassociating intelligible meaning to CNN-derived features by means of Radiomics,\noffering new perspectives on explanation methods beyond visualization maps.\nUsing a breast cancer classification task as a case study, we evaluated\nRad4XCNN on ultrasound imaging datasets, including an online dataset and two\nin-house datasets for internal and external validation. Some key results are:\ni) CNN-derived features guarantee more robust accuracy when compared against\nViT-derived and radiomic features; ii) conventional visualization map methods\nfor explanation present several pitfalls; iii) Rad4XCNN does not sacrifice\nmodel accuracy for their explainability; iv) Rad4XCNN provides a global\nexplanation enabling the physician to extract global insights and findings. Our\nmethod can mitigate some concerns related to the explainability-accuracy\ntrade-off. This study highlighted the importance of proposing new methods for\nmodel explanation without affecting their accuracy.\n","authors":["Francesco Prinzi","Carmelo Militello","Calogero Zarcaro","Tommaso Vincenzo Bartolotta","Salvatore Gaglio","Salvatore Vitabile"],"pdf_url":"https://arxiv.org/pdf/2405.02334v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.00599v2","updated":"2025-01-08T14:38:30Z","published":"2024-12-31T18:56:46Z","title":"VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with\n Video LLM","summary":" Video Large Language Models (Video LLMs) have recently exhibited remarkable\ncapabilities in general video understanding. However, they mainly focus on\nholistic comprehension and struggle with capturing fine-grained spatial and\ntemporal details. Besides, the lack of high-quality object-level video\ninstruction data and a comprehensive benchmark further hinders their\nadvancements. To tackle these challenges, we introduce the VideoRefer Suite to\nempower Video LLM for finer-level spatial-temporal video understanding, i.e.,\nenabling perception and reasoning on any objects throughout the video.\nSpecially, we thoroughly develop VideoRefer Suite across three essential\naspects: dataset, model, and benchmark. Firstly, we introduce a multi-agent\ndata engine to meticulously curate a large-scale, high-quality object-level\nvideo instruction dataset, termed VideoRefer-700K. Next, we present the\nVideoRefer model, which equips a versatile spatial-temporal object encoder to\ncapture precise regional and sequential representations. Finally, we\nmeticulously create a VideoRefer-Bench to comprehensively assess the\nspatial-temporal understanding capability of a Video LLM, evaluating it across\nvarious aspects. Extensive experiments and analyses demonstrate that our\nVideoRefer model not only achieves promising performance on video referring\nbenchmarks but also facilitates general video understanding capabilities.\n","authors":["Yuqian Yuan","Hang Zhang","Wentong Li","Zesen Cheng","Boqiang Zhang","Long Li","Xin Li","Deli Zhao","Wenqiao Zhang","Yueting Zhuang","Jianke Zhu","Lidong Bing"],"pdf_url":"https://arxiv.org/pdf/2501.00599v2.pdf","comment":"17 pages, 14 figures, technical report"},{"id":"http://arxiv.org/abs/2501.04538v1","updated":"2025-01-08T14:38:03Z","published":"2025-01-08T14:38:03Z","title":"HypeRL: Parameter-Informed Reinforcement Learning for Parametric PDEs","summary":" In this work, we devise a new, general-purpose reinforcement learning\nstrategy for the optimal control of parametric partial differential equations\n(PDEs). Such problems frequently arise in applied sciences and engineering and\nentail a significant complexity when control and/or state variables are\ndistributed in high-dimensional space or depend on varying parameters.\nTraditional numerical methods, relying on either iterative minimization\nalgorithms or dynamic programming, while reliable, often become computationally\ninfeasible. Indeed, in either way, the optimal control problem must be solved\nfor each instance of the parameters, and this is out of reach when dealing with\nhigh-dimensional time-dependent and parametric PDEs. In this paper, we propose\nHypeRL, a deep reinforcement learning (DRL) framework to overcome the\nlimitations shown by traditional methods. HypeRL aims at approximating the\noptimal control policy directly. Specifically, we employ an actor-critic DRL\napproach to learn an optimal feedback control strategy that can generalize\nacross the range of variation of the parameters. To effectively learn such\noptimal control laws, encoding the parameter information into the DRL policy\nand value function neural networks (NNs) is essential. To do so, HypeRL uses\ntwo additional NNs, often called hypernetworks, to learn the weights and biases\nof the value function and the policy NNs. We validate the proposed approach on\ntwo PDE-constrained optimal control benchmarks, namely a 1D\nKuramoto-Sivashinsky equation and a 2D Navier-Stokes equations, by showing that\nthe knowledge of the PDE parameters and how this information is encoded, i.e.,\nvia a hypernetwork, is an essential ingredient for learning parameter-dependent\ncontrol policies that can generalize effectively to unseen scenarios and for\nimproving the sample efficiency of such policies.\n","authors":["Nicolò Botteghi","Stefania Fresca","Mengwu Guo","Andrea Manzoni"],"pdf_url":"https://arxiv.org/pdf/2501.04538v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04534v1","updated":"2025-01-08T14:33:47Z","published":"2025-01-08T14:33:47Z","title":"Combining YOLO and Visual Rhythm for Vehicle Counting","summary":" Video-based vehicle detection and counting play a critical role in managing\ntransport infrastructure. Traditional image-based counting methods usually\ninvolve two main steps: initial detection and subsequent tracking, which are\napplied to all video frames, leading to a significant increase in computational\ncomplexity. To address this issue, this work presents an alternative and more\nefficient method for vehicle detection and counting. The proposed approach\neliminates the need for a tracking step and focuses solely on detecting\nvehicles in key video frames, thereby increasing its efficiency. To achieve\nthis, we developed a system that combines YOLO, for vehicle detection, with\nVisual Rhythm, a way to create time-spatial images that allows us to focus on\nframes that contain useful information. Additionally, this method can be used\nfor counting in any application involving unidirectional moving targets to be\ndetected and identified. Experimental analysis using real videos shows that the\nproposed method achieves mean counting accuracy around 99.15% over a set of\nvideos, with a processing speed three times faster than tracking based\napproaches.\n","authors":["Victor Nascimento Ribeiro","Nina S. T. Hirata"],"pdf_url":"https://arxiv.org/pdf/2501.04534v1.pdf","comment":"Accepted for presentation at the Conference on Graphics, Patterns and\n Images (SIBGRAPI) 2023"},{"id":"http://arxiv.org/abs/2501.02156v3","updated":"2025-01-08T14:26:51Z","published":"2025-01-04T01:45:32Z","title":"The Race to Efficiency: A New Perspective on AI Scaling Laws","summary":" As large-scale AI models expand, training becomes costlier and sustaining\nprogress grows harder. Classical scaling laws (e.g., Kaplan et al. (2020),\nHoffmann et al. (2022)) predict training loss from a static compute budget yet\nneglect time and efficiency, prompting the question: how can we balance\nballooning GPU fleets with rapidly improving hardware and algorithms? We\nintroduce the relative-loss equation, a time- and efficiency-aware framework\nthat extends classical AI scaling laws. Our model shows that, without ongoing\nefficiency gains, advanced performance could demand millennia of training or\nunrealistically large GPU fleets. However, near-exponential progress remains\nachievable if the \"efficiency-doubling rate\" parallels Moore's Law. By\nformalizing this race to efficiency, we offer a quantitative roadmap for\nbalancing front-loaded GPU investments with incremental improvements across the\nAI stack. Empirical trends suggest that sustained efficiency gains can push AI\nscaling well into the coming decade, providing a new perspective on the\ndiminishing returns inherent in classical scaling.\n","authors":["Chien-Ping Lu"],"pdf_url":"https://arxiv.org/pdf/2501.02156v3.pdf","comment":"21 pages, 3 figures. 2 tables, second draft"},{"id":"http://arxiv.org/abs/2501.04529v1","updated":"2025-01-08T14:21:03Z","published":"2025-01-08T14:21:03Z","title":"A Plug-and-Play Bregman ADMM Module for Inferring Event Branches in\n Temporal Point Processes","summary":" An event sequence generated by a temporal point process is often associated\nwith a hidden and structured event branching process that captures the\ntriggering relations between its historical and current events. In this study,\nwe design a new plug-and-play module based on the Bregman ADMM (BADMM)\nalgorithm, which infers event branches associated with event sequences in the\nmaximum likelihood estimation framework of temporal point processes (TPPs).\nSpecifically, we formulate the inference of event branches as an optimization\nproblem for the event transition matrix under sparse and low-rank constraints,\nwhich is embedded in existing TPP models or their learning paradigms. We can\nimplement this optimization problem based on subspace clustering and sparse\ngroup-lasso, respectively, and solve it using the Bregman ADMM algorithm, whose\nunrolling leads to the proposed BADMM module. When learning a classic TPP\n(e.g., Hawkes process) by the expectation-maximization algorithm, the BADMM\nmodule helps derive structured responsibility matrices in the E-step.\nSimilarly, the BADMM module helps derive low-rank and sparse attention maps for\nthe neural TPPs with self-attention layers. The structured responsibility\nmatrices and attention maps, which work as learned event transition matrices,\nindicate event branches, e.g., inferring isolated events and those key events\ntriggering many subsequent events. Experiments on both synthetic and real-world\ndata show that plugging our BADMM module into existing TPP models and learning\nparadigms can improve model performance and provide us with interpretable\nstructured event branches. The code is available at\n\\url{https://github.com/qingmeiwangdaily/BADMM_TPP}.\n","authors":["Qingmei Wang","Yuxin Wu","Yujie Long","Jing Huang","Fengyuan Ran","Bing Su","Hongteng Xu"],"pdf_url":"https://arxiv.org/pdf/2501.04529v1.pdf","comment":"Accepted at AAAI 2025"},{"id":"http://arxiv.org/abs/2501.04528v1","updated":"2025-01-08T14:19:54Z","published":"2025-01-08T14:19:54Z","title":"Towards a Problem-Oriented Domain Adaptation Framework for Machine\n Learning","summary":" Domain adaptation is a sub-field of machine learning that involves\ntransferring knowledge from a source domain to perform the same task in the\ntarget domain. It is a typical challenge in machine learning that arises, e.g.,\nwhen data is obtained from various sources or when using a data basis that\nchanges over time. Recent advances in the field offer promising methods, but it\nis still challenging for researchers and practitioners to determine if domain\nadaptation is suitable for a given problem -- and, subsequently, to select the\nappropriate approach. This article employs design science research to develop a\nproblem-oriented framework for domain adaptation, which is matured in three\nevaluation episodes. We describe a framework that distinguishes between five\ndomain adaptation scenarios, provides recommendations for addressing each\nscenario, and offers guidelines for determining if a problem falls into one of\nthese scenarios. During the multiple evaluation episodes, the framework is\ntested on artificial and real-world datasets and an experimental study\ninvolving 100 participants. The evaluation demonstrates that the framework has\nthe explanatory power to capture any domain adaptation problem effectively. In\nsummary, we provide clear guidance for researchers and practitioners who want\nto employ domain adaptation but lack in-depth knowledge of the possibilities.\n","authors":["Philipp Spitzer","Dominik Martin","Laurin Eichberger","Niklas Kühl"],"pdf_url":"https://arxiv.org/pdf/2501.04528v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04527v1","updated":"2025-01-08T14:19:03Z","published":"2025-01-08T14:19:03Z","title":"Towards Fair Class-wise Robustness: Class Optimal Distribution\n Adversarial Training","summary":" Adversarial training has proven to be a highly effective method for improving\nthe robustness of deep neural networks against adversarial attacks.\nNonetheless, it has been observed to exhibit a limitation in terms of robust\nfairness, characterized by a significant disparity in robustness across\ndifferent classes. Recent efforts to mitigate this problem have turned to\nclass-wise reweighted methods. However, these methods suffer from a lack of\nrigorous theoretical analysis and are limited in their exploration of the\nweight space, as they mainly rely on existing heuristic algorithms or intuition\nto compute weights. In addition, these methods fail to guarantee the\nconsistency of the optimization direction due to the decoupled optimization of\nweights and the model parameters. They potentially lead to suboptimal weight\nassignments and consequently, a suboptimal model. To address these problems,\nthis paper proposes a novel min-max training framework, Class Optimal\nDistribution Adversarial Training (CODAT), which employs distributionally\nrobust optimization to fully explore the class-wise weight space, thus enabling\nthe identification of the optimal weight with theoretical guarantees.\nFurthermore, we derive a closed-form optimal solution to the internal\nmaximization and then get a deterministic equivalent objective function, which\nprovides a theoretical basis for the joint optimization of weights and model\nparameters. Meanwhile, we propose a fairness elasticity coefficient for the\nevaluation of the algorithm with regard to both robustness and robust fairness.\nExperimental results on various datasets show that the proposed method can\neffectively improve the robust fairness of the model and outperform the\nstate-of-the-art approaches.\n","authors":["Hongxin Zhi","Hongtao Yu","Shaome Li","Xiuming Zhao","Yiteng Wu"],"pdf_url":"https://arxiv.org/pdf/2501.04527v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.18103v3","updated":"2025-01-08T14:13:23Z","published":"2024-03-26T21:01:41Z","title":"Tutorial on Diffusion Models for Imaging and Vision","summary":" The astonishing growth of generative tools in recent years has empowered many\nexciting applications in text-to-image generation and text-to-video generation.\nThe underlying principle behind these generative tools is the concept of\ndiffusion, a particular sampling mechanism that has overcome some shortcomings\nthat were deemed difficult in the previous approaches. The goal of this\ntutorial is to discuss the essential ideas underlying the diffusion models. The\ntarget audience of this tutorial includes undergraduate and graduate students\nwho are interested in doing research on diffusion models or applying these\nmodels to solve other problems.\n","authors":["Stanley H. Chan"],"pdf_url":"https://arxiv.org/pdf/2403.18103v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.01844v2","updated":"2025-01-08T14:10:15Z","published":"2025-01-03T14:54:49Z","title":"Learning from Ambiguous Data with Hard Labels","summary":" Real-world data often contains intrinsic ambiguity that the common\nsingle-hard-label annotation paradigm ignores. Standard training using\nambiguous data with these hard labels may produce overly confident models and\nthus leading to poor generalization. In this paper, we propose a novel\nframework called Quantized Label Learning (QLL) to alleviate this issue. First,\nwe formulate QLL as learning from (very) ambiguous data with hard labels:\nideally, each ambiguous instance should be associated with a ground-truth\nsoft-label distribution describing its corresponding probabilistic weight in\neach class, however, this is usually not accessible; in practice, we can only\nobserve a quantized label, i.e., a hard label sampled (quantized) from the\ncorresponding ground-truth soft-label distribution, of each instance, which can\nbe seen as a biased approximation of the ground-truth soft-label. Second, we\npropose a Class-wise Positive-Unlabeled (CPU) risk estimator that allows us to\ntrain accurate classifiers from only ambiguous data with quantized labels.\nThird, to simulate ambiguous datasets with quantized labels in the real world,\nwe design a mixing-based ambiguous data generation procedure for empirical\nevaluation. Experiments demonstrate that our CPU method can significantly\nimprove model generalization performance and outperform the baselines.\n","authors":["Zeke Xie","Zheng He","Nan Lu","Lichen Bai","Bao Li","Shuo Yang","Mingming Sun","Ping Li"],"pdf_url":"https://arxiv.org/pdf/2501.01844v2.pdf","comment":"9 pages, 4 figures, accepted by ICASSP 2025"},{"id":"http://arxiv.org/abs/2405.13867v2","updated":"2025-01-08T14:08:11Z","published":"2024-05-22T17:48:17Z","title":"Scaling-laws for Large Time-series Models","summary":" Scaling laws for large language models (LLMs) have provided useful guidance\nin training ever larger models for predictable performance gains. Time series\nforecasting shares a similar sequential structure to language, and is amenable\nto large-scale transformer architectures. Here we show that foundational\ndecoder-only time series transformer models exhibit analogous scaling-behavior\nto LLMs, with architectural details (aspect ratio and number of heads) having a\nminimal effect over broad ranges. We assemble a large corpus of heterogenous\ntime series data on which to train, and establish for the first time power-law\nscaling with parameter count, dataset size, and training compute, spanning five\norders of magnitude.\n","authors":["Thomas D. P. Edwards","James Alvey","Justin Alsing","Nam H. Nguyen","Benjamin D. Wandelt"],"pdf_url":"https://arxiv.org/pdf/2405.13867v2.pdf","comment":"4 main pages (16 total), 4 figures; Accepted for oral presentation in\n Time Series in the Age of Large Models (TSALM) Workshop at Neurips 2024"},{"id":"http://arxiv.org/abs/2405.08766v2","updated":"2025-01-08T13:45:46Z","published":"2024-05-14T16:59:20Z","title":"Energy-based Hopfield Boosting for Out-of-Distribution Detection","summary":" Out-of-distribution (OOD) detection is critical when deploying machine\nlearning models in the real world. Outlier exposure methods, which incorporate\nauxiliary outlier data in the training process, can drastically improve OOD\ndetection performance compared to approaches without advanced training\nstrategies. We introduce Hopfield Boosting, a boosting approach, which\nleverages modern Hopfield energy (MHE) to sharpen the decision boundary between\nthe in-distribution and OOD data. Hopfield Boosting encourages the model to\nconcentrate on hard-to-distinguish auxiliary outlier examples that lie close to\nthe decision boundary between in-distribution and auxiliary outlier data. Our\nmethod achieves a new state-of-the-art in OOD detection with outlier exposure,\nimproving the FPR95 metric from 2.28 to 0.92 on CIFAR-10 and from 11.76 to 7.94\non CIFAR-100.\n","authors":["Claus Hofmann","Simon Schmid","Bernhard Lehner","Daniel Klotz","Sepp Hochreiter"],"pdf_url":"https://arxiv.org/pdf/2405.08766v2.pdf","comment":"NeurIPS 2024"},{"id":"http://arxiv.org/abs/2501.02270v2","updated":"2025-01-08T13:42:02Z","published":"2025-01-04T12:15:58Z","title":"Efficient Video-Based ALPR System Using YOLO and Visual Rhythm","summary":" Automatic License Plate Recognition (ALPR) involves extracting vehicle\nlicense plate information from image or a video capture. These systems have\ngained popularity due to the wide availability of low-cost surveillance cameras\nand advances in Deep Learning. Typically, video-based ALPR systems rely on\nmultiple frames to detect the vehicle and recognize the license plates.\nTherefore, we propose a system capable of extracting exactly one frame per\nvehicle and recognizing its license plate characters from this singular image\nusing an Optical Character Recognition (OCR) model. Early experiments show that\nthis methodology is viable.\n","authors":["Victor Nascimento Ribeiro","Nina S. T. Hirata"],"pdf_url":"https://arxiv.org/pdf/2501.02270v2.pdf","comment":"Accepted to CVPR 2024"},{"id":"http://arxiv.org/abs/2409.16586v2","updated":"2025-01-08T13:16:26Z","published":"2024-09-25T03:25:34Z","title":"AutoSTF: Decoupled Neural Architecture Search for Cost-Effective\n Automated Spatio-Temporal Forecasting","summary":" Spatio-temporal forecasting is a critical component of various smart city\napplications, such as transportation optimization, energy management, and\nsocio-economic analysis. Recently, several automated spatio-temporal\nforecasting methods have been proposed to automatically search the optimal\nneural network architecture for capturing complex spatio-temporal dependencies.\nHowever, the existing automated approaches suffer from expensive neural\narchitecture search overhead, which hinders their practical use and the further\nexploration of diverse spatio-temporal operators in a finer granularity. In\nthis paper, we propose AutoSTF, a decoupled automatic neural architecture\nsearch framework for cost-effective automated spatio-temporal forecasting. From\nthe efficiency perspective, we first decouple the mixed search space into\ntemporal space and spatial space and respectively devise representation\ncompression and parameter-sharing schemes to mitigate the parameter explosion.\nThe decoupled spatio-temporal search not only expedites the model optimization\nprocess but also leaves new room for more effective spatio-temporal dependency\nmodeling. From the effectiveness perspective, we propose a multi-patch transfer\nmodule to jointly capture multi-granularity temporal dependencies and extend\nthe spatial search space to enable finer-grained layer-wise spatial dependency\nsearch. Extensive experiments on eight datasets demonstrate the superiority of\nAutoSTF in terms of both accuracy and efficiency. Specifically, our proposed\nmethod achieves up to 13.48x speed-up compared to state-of-the-art automatic\nspatio-temporal forecasting methods while maintaining the best forecasting\naccuracy.\n","authors":["Tengfei Lyu","Weijia Zhang","Jinliang Deng","Hao Liu"],"pdf_url":"https://arxiv.org/pdf/2409.16586v2.pdf","comment":"Accepted by KDD 2025 Research Track"},{"id":"http://arxiv.org/abs/2501.04487v1","updated":"2025-01-08T13:14:05Z","published":"2025-01-08T13:14:05Z","title":"Integrating remote sensing data assimilation, deep learning and large\n language model for interactive wheat breeding yield prediction","summary":" Yield is one of the core goals of crop breeding. By predicting the potential\nyield of different breeding materials, breeders can screen these materials at\nvarious growth stages to select the best performing. Based on unmanned aerial\nvehicle remote sensing technology, high-throughput crop phenotyping data in\nbreeding areas is collected to provide data support for the breeding decisions\nof breeders. However, the accuracy of current yield predictions still requires\nimprovement, and the usability and user-friendliness of yield forecasting tools\nremain suboptimal. To address these challenges, this study introduces a hybrid\nmethod and tool for crop yield prediction, designed to allow breeders to\ninteractively and accurately predict wheat yield by chatting with a large\nlanguage model (LLM). First, the newly designed data assimilation algorithm is\nused to assimilate the leaf area index into the WOFOST model. Then, selected\noutputs from the assimilation process, along with remote sensing inversion\nresults, are used to drive the time-series temporal fusion transformer model\nfor wheat yield prediction. Finally, based on this hybrid method and leveraging\nan LLM with retrieval augmented generation technology, we developed an\ninteractive yield prediction Web tool that is user-friendly and supports\nsustainable data updates. This tool integrates multi-source data to assist\nbreeding decision-making. This study aims to accelerate the identification of\nhigh-yield materials in the breeding process, enhance breeding efficiency, and\nenable more scientific and smart breeding decisions.\n","authors":["Guofeng Yang","Nanfei Jin","Wenjie Ai","Zhonghua Zheng","Yuhong He","Yong He"],"pdf_url":"https://arxiv.org/pdf/2501.04487v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04481v1","updated":"2025-01-08T13:04:08Z","published":"2025-01-08T13:04:08Z","title":"Safe Reinforcement Learning with Minimal Supervision","summary":" Reinforcement learning (RL) in the real world necessitates the development of\nprocedures that enable agents to explore without causing harm to themselves or\nothers. The most successful solutions to the problem of safe RL leverage\noffline data to learn a safe-set, enabling safe online exploration. However,\nthis approach to safe-learning is often constrained by the demonstrations that\nare available for learning.\n In this paper we investigate the influence of the quantity and quality of\ndata used to train the initial safe learning problem offline on the ability to\nlearn safe-RL policies online. Specifically, we focus on tasks with spatially\nextended goal states where we have few or no demonstrations available.\nClassically this problem is addressed either by using hand-designed controllers\nto generate data or by collecting user-generated demonstrations. However, these\nmethods are often expensive and do not scale to more complex tasks and\nenvironments. To address this limitation we propose an unsupervised RL-based\noffline data collection procedure, to learn complex and scalable policies\nwithout the need for hand-designed controllers or user demonstrations. Our\nresearch demonstrates the significance of providing sufficient demonstrations\nfor agents to learn optimal safe-RL policies online, and as a result, we\npropose optimistic forgetting, a novel online safe-RL approach that is\npractical for scenarios with limited data. Further, our unsupervised data\ncollection approach highlights the need to balance diversity and optimality for\nsafe online exploration.\n","authors":["Alexander Quessy","Thomas Richardson","Sebastian East"],"pdf_url":"https://arxiv.org/pdf/2501.04481v1.pdf","comment":"Initially submitted to ICML 2023"},{"id":"http://arxiv.org/abs/2501.04470v1","updated":"2025-01-08T12:48:15Z","published":"2025-01-08T12:48:15Z","title":"Regularising NARX models with multi-task learning","summary":" A Nonlinear Auto-Regressive with eXogenous inputs (NARX) model can be used to\ndescribe time-varying processes; where the output depends on both previous\noutputs and current/previous external input variables. One limitation of NARX\nmodels is their propensity to overfit and result in poor generalisation for\nfuture predictions. The proposed method to help to overcome the issue of\noverfitting is a NARX model which predicts outputs at both the current time and\nseveral lead times into the future. This is a form of multi-task learner (MTL);\nwhereby the lead time outputs will regularise the current time output. This\nwork shows that for high noise level, MTL can be used to regularise NARX with a\nlower Normalised Mean Square Error (NMSE) compared to the NMSE of the\nindependent learner counterpart.\n","authors":["Sarah Bee","Lawrence Bull","Nikolaos Dervilis","Keith Worden"],"pdf_url":"https://arxiv.org/pdf/2501.04470v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.08023v2","updated":"2025-01-08T12:40:56Z","published":"2024-09-12T13:05:28Z","title":"Edge-Wise Graph-Instructed Neural Networks","summary":" The problem of multi-task regression over graph nodes has been recently\napproached through Graph-Instructed Neural Network (GINN), which is a promising\narchitecture belonging to the subset of message-passing graph neural networks.\nIn this work, we discuss the limitations of the Graph-Instructed (GI) layer,\nand we formalize a novel edge-wise GI (EWGI) layer. We discuss the advantages\nof the EWGI layer and we provide numerical evidence that EWGINNs perform better\nthan GINNs over some graph-structured input data, like the ones inferred from\nthe Barabasi-Albert graph, and improve the training regularization on graphs\nwith chaotic connectivity, like the ones inferred from the Erdos-Renyi graph.\n","authors":["Francesco Della Santa","Antonio Mastropietro","Sandra Pieraccini","Francesco Vaccarino"],"pdf_url":"https://arxiv.org/pdf/2409.08023v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.16149v4","updated":"2025-01-08T12:40:27Z","published":"2024-03-24T13:43:43Z","title":"Analyzing Consumer IoT Traffic from Security and Privacy Perspectives: a\n Comprehensive Survey","summary":" The Consumer Internet of Things (CIoT), a notable segment within the IoT\ndomain, involves the integration of IoT technology into consumer electronics\nand devices, such as smart homes and smart wearables. Compared to traditional\nIoT fields, CIoT differs notably in target users, product types, and design\napproaches. While offering convenience to users, it also raises new security\nand privacy concerns. Network traffic analysis, a widely used technique in the\nsecurity community, has been extensively applied to investigate these concerns\nabout CIoT. Compared to network traffic analysis in other fields such as mobile\napps and websites, CIoT presents unique characteristics, introducing new\nchallenges and research opportunities. Researchers have made significant\ncontributions in this area. To aid researchers in understanding the application\nof traffic analysis tools for studying CIoT security and privacy risks, this\nsurvey reviews 303 publications on traffic analysis within the CIoT security\nand privacy domain from January 2018 to June 2024, focusing on three research\nquestions. Our work: 1) outlines the CIoT traffic analysis process and\nhighlights its differences from general network traffic analysis. 2) summarizes\nand classifies existing research into four categories according to its\napplication objectives: device fingerprinting, user activity inference,\nmalicious traffic detection, and measurement. 3) explores emerging challenges\nand potential future research directions based on each step of the CIoT traffic\nanalysis process. This will provide new insights to the community and guide the\nindustry towards safer product designs.\n","authors":["Yan Jia","Yuxin Song","Zihou Liu","Qingyin Tan","Yang Song","Yu Zhang","Zheli Liu"],"pdf_url":"https://arxiv.org/pdf/2403.16149v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.14434v4","updated":"2025-01-08T12:19:46Z","published":"2024-02-22T10:26:46Z","title":"Parallelized Midpoint Randomization for Langevin Monte Carlo","summary":" We study the problem of sampling from a target probability density function\nin frameworks where parallel evaluations of the log-density gradient are\nfeasible. Focusing on smooth and strongly log-concave densities, we revisit the\nparallelized randomized midpoint method and investigate its properties using\nrecently developed techniques for analyzing its sequential version. Through\nthese techniques, we derive upper bounds on the Wasserstein distance between\nsampling and target densities. These bounds quantify the substantial runtime\nimprovements achieved through parallel processing.\n","authors":["Lu Yu","Arnak Dalalyan"],"pdf_url":"https://arxiv.org/pdf/2402.14434v4.pdf","comment":"arXiv admin note: substantial text overlap with arXiv:2306.08494"},{"id":"http://arxiv.org/abs/2501.04453v1","updated":"2025-01-08T12:14:00Z","published":"2025-01-08T12:14:00Z","title":"Gradient Purification: Defense Against Poisoning Attack in Decentralized\n Federated Learning","summary":" Decentralized federated learning (DFL) is inherently vulnerable to poisoning\nattacks, as malicious clients can transmit manipulated model gradients to\nneighboring clients. Existing defense methods either reject suspicious\ngradients per iteration or restart DFL aggregation after detecting all\nmalicious clients. They overlook the potential accuracy benefit from the\ndiscarded malicious gradients. In this paper, we propose a novel gradient\npurification defense, named GPD, that integrates seamlessly with existing DFL\naggregation to defend against poisoning attacks. It aims to mitigate the harm\nin model gradients while retaining the benefit in model weights for enhancing\naccuracy. For each benign client in GPD, a recording variable is designed to\ntrack the historically aggregated gradients from one of its neighbors. It\nallows benign clients to precisely detect malicious neighbors and swiftly\nmitigate aggregated malicious gradients via historical consistency checks. Upon\nmitigation, GPD optimizes model weights via aggregating gradients solely from\nbenign clients. This retains the previously beneficial portions from malicious\nclients and exploits the contributions from benign clients, thereby\nsignificantly enhancing the model accuracy. We analyze the convergence of GPD,\nas well as its ability to harvest high accuracy. Extensive experiments over\nthree datasets demonstrate that, GPD is capable of mitigating poisoning attacks\nunder both iid and non-iid data distributions. It significantly outperforms\nstate-of-the-art defenses in terms of accuracy against various poisoning\nattacks.\n","authors":["Bin Li","Xiaoye Miao","Yongheng Shang","Xinkui Zhao","Shuiguang Deng","Jianwei Yin"],"pdf_url":"https://arxiv.org/pdf/2501.04453v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04443v1","updated":"2025-01-08T11:52:43Z","published":"2025-01-08T11:52:43Z","title":"Revisiting LocalSGD and SCAFFOLD: Improved Rates and Missing Analysis","summary":" LocalSGD and SCAFFOLD are widely used methods in distributed stochastic\noptimization, with numerous applications in machine learning, large-scale data\nprocessing, and federated learning. However, rigorously establishing their\ntheoretical advantages over simpler methods, such as minibatch SGD (MbSGD), has\nproven challenging, as existing analyses often rely on strong assumptions,\nunrealistic premises, or overly restrictive scenarios.\n In this work, we revisit the convergence properties of LocalSGD and SCAFFOLD\nunder a variety of existing or weaker conditions, including gradient\nsimilarity, Hessian similarity, weak convexity, and Lipschitz continuity of the\nHessian. Our analysis shows that (i) LocalSGD achieves faster convergence\ncompared to MbSGD for weakly convex functions without requiring stronger\ngradient similarity assumptions; (ii) LocalSGD benefits significantly from\nhigher-order similarity and smoothness; and (iii) SCAFFOLD demonstrates faster\nconvergence than MbSGD for a broader class of non-quadratic functions. These\ntheoretical insights provide a clearer understanding of the conditions under\nwhich LocalSGD and SCAFFOLD outperform MbSGD.\n","authors":["Ruichen Luo","Sebastian U Stich","Samuel Horváth","Martin Takáč"],"pdf_url":"https://arxiv.org/pdf/2501.04443v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.09570v4","updated":"2025-01-08T11:50:42Z","published":"2024-03-14T17:00:01Z","title":"Multi-Fidelity Bayesian Optimization With Across-Task Transferable\n Max-Value Entropy Search","summary":" In many applications, ranging from logistics to engineering, a designer is\nfaced with a sequence of optimization tasks for which the objectives are in the\nform of black-box functions that are costly to evaluate. Furthermore,\nhigher-fidelity evaluations of the optimization objectives often entail a\nlarger cost. Existing multi-fidelity black-box optimization strategies select\ncandidate solutions and fidelity levels with the goal of maximizing the\ninformation about the optimal value or the optimal solution for the current\ntask. Assuming that successive optimization tasks are related, this paper\nintroduces a novel information-theoretic acquisition function that balances the\nneed to acquire information about the current task with the goal of collecting\ninformation transferable to future tasks. The proposed method transfers across\ntasks distributions over parameters of a Gaussian process surrogate model by\nimplementing particle-based variational Bayesian updates. Theoretical insights\nbased on the analysis of the expected regret substantiate the benefits of\nacquiring transferable knowledge across tasks. Furthermore, experimental\nresults across synthetic and real-world examples reveal that the proposed\nacquisition strategy that caters to future tasks can significantly improve the\noptimization efficiency as soon as a sufficient number of tasks is processed.\n","authors":["Yunchuan Zhang","Sangwoo Park","Osvaldo Simeone"],"pdf_url":"https://arxiv.org/pdf/2403.09570v4.pdf","comment":"17 pages, 10 figures, published in IEEE Transactions on Signal\n Processing"},{"id":"http://arxiv.org/abs/2501.03301v2","updated":"2025-01-08T11:47:25Z","published":"2025-01-06T15:19:26Z","title":"Rethinking Byzantine Robustness in Federated Recommendation from Sparse\n Aggregation Perspective","summary":" To preserve user privacy in recommender systems, federated recommendation\n(FR) based on federated learning (FL) emerges, keeping the personal data on the\nlocal client and updating a model collaboratively. Unlike FL, FR has a unique\nsparse aggregation mechanism, where the embedding of each item is updated by\nonly partial clients, instead of full clients in a dense aggregation of general\nFL. Recently, as an essential principle of FL, model security has received\nincreasing attention, especially for Byzantine attacks, where malicious clients\ncan send arbitrary updates. The problem of exploring the Byzantine robustness\nof FR is particularly critical since in the domains applying FR, e.g.,\ne-commerce, malicious clients can be injected easily by registering new\naccounts. However, existing Byzantine works neglect the unique sparse\naggregation of FR, making them unsuitable for our problem. Thus, we make the\nfirst effort to investigate Byzantine attacks on FR from the perspective of\nsparse aggregation, which is non-trivial: it is not clear how to define\nByzantine robustness under sparse aggregations and design Byzantine attacks\nunder limited knowledge/capability. In this paper, we reformulate the Byzantine\nrobustness under sparse aggregation by defining the aggregation for a single\nitem as the smallest execution unit. Then we propose a family of effective\nattack strategies, named Spattack, which exploit the vulnerability in sparse\naggregation and are categorized along the adversary's knowledge and capability.\nExtensive experimental results demonstrate that Spattack can effectively\nprevent convergence and even break down defenses under a few malicious clients,\nraising alarms for securing FR systems.\n","authors":["Zhongjian Zhang","Mengmei Zhang","Xiao Wang","Lingjuan Lyu","Bo Yan","Junping Du","Chuan Shi"],"pdf_url":"https://arxiv.org/pdf/2501.03301v2.pdf","comment":"accepted by AAAI 2025"},{"id":"http://arxiv.org/abs/2501.04441v1","updated":"2025-01-08T11:45:50Z","published":"2025-01-08T11:45:50Z","title":"Motif Discovery Framework for Psychiatric EEG Data Classification","summary":" In current medical practice, patients undergoing depression treatment must\nwait four to six weeks before a clinician can assess medication response due to\nthe delayed noticeable effects of antidepressants. Identification of a\ntreatment response at any earlier stage is of great importance, since it can\nreduce the emotional and economic burden connected with the treatment. We\napproach the prediction of a patient response to a treatment as a\nclassification problem, by utilizing the dynamic properties of EEG recordings\non the 7th day of the treatment. We present a novel framework that applies\nmotif discovery to extract meaningful features from EEG data distinguishing\nbetween depression treatment responders and non-responders. We applied our\nframework also to classification tasks in other psychiatric EEG datasets,\nnamely to patients with symptoms of schizophrenia, pediatric patients with\nintractable seizures, and Alzheimer disease and dementia. We achieved high\nclassification precision in all data sets. The results demonstrate that the\ndynamic properties of the EEGs may support clinicians in decision making both\nin diagnosis and in the prediction depression treatment response as early as on\nthe 7th day of the treatment. To our best knowledge, our work is the first one\nusing motifs in the depression diagnostics in general.\n","authors":["Melanija Kraljevska","Katerina Hlavackova-Schindler","Lukas Miklautz","Claudia Plant"],"pdf_url":"https://arxiv.org/pdf/2501.04441v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.01087v3","updated":"2025-01-08T11:40:29Z","published":"2025-01-02T06:19:53Z","title":"Bridging Simplicity and Sophistication using GLinear: A Novel\n Architecture for Enhanced Time Series Prediction","summary":" Time Series Forecasting (TSF) is an important application across many fields.\nThere is a debate about whether Transformers, despite being good at\nunderstanding long sequences, struggle with preserving temporal relationships\nin time series data. Recent research suggests that simpler linear models might\noutperform or at least provide competitive performance compared to complex\nTransformer-based models for TSF tasks. In this paper, we propose a novel\ndata-efficient architecture, GLinear, for multivariate TSF that exploits\nperiodic patterns to provide better accuracy. It also provides better\nprediction accuracy by using a smaller amount of historical data compared to\nother state-of-the-art linear predictors. Four different datasets (ETTh1,\nElectricity, Traffic, and Weather) are used to evaluate the performance of the\nproposed predictor. A performance comparison with state-of-the-art linear\narchitectures (such as NLinear, DLinear, and RLinear) and transformer-based\ntime series predictor (Autoformer) shows that the GLinear, despite being\nparametrically efficient, significantly outperforms the existing architectures\nin most cases of multivariate TSF. We hope that the proposed GLinear opens new\nfronts of research and development of simpler and more sophisticated\narchitectures for data and computationally efficient time-series analysis.\n","authors":["Syed Tahir Hussain Rizvi","Neel Kanwal","Muddasar Naeem","Alfredo Cuzzocrea","Antonio Coronato"],"pdf_url":"https://arxiv.org/pdf/2501.01087v3.pdf","comment":"Submitted to IEEE Transactions on Emerging Topics in Computational\n Intelligence"},{"id":"http://arxiv.org/abs/2501.04436v1","updated":"2025-01-08T11:37:06Z","published":"2025-01-08T11:37:06Z","title":"Federated Fine-Tuning of LLMs: Framework Comparison and Research\n Directions","summary":" Federated learning (FL) provides a privacy-preserving solution for\nfine-tuning pre-trained large language models (LLMs) using distributed private\ndatasets, enabling task-specific adaptation while preserving data privacy.\nHowever, fine-tuning the extensive parameters in LLMs is particularly\nchallenging in resource-constrained federated scenarios due to the significant\ncommunication and computational costs. To gain a deeper understanding of how\nthese challenges can be addressed, this article conducts a comparative analysis\nthree advanced federated LLM (FedLLM) frameworks that integrate knowledge\ndistillation (KD) and split learning (SL) to mitigate these issues: 1) FedLLMs,\nwhere clients upload model parameters or gradients to enable straightforward\nand effective fine-tuning; 2) KD-FedLLMs, which leverage KD for efficient\nknowledge sharing via logits; and 3) Split-FedLLMs, which split the LLMs into\ntwo parts, with one part executed on the client and the other one on the\nserver, to balance the computational load. Each framework is evaluated based on\nkey performance metrics, including model accuracy, communication overhead, and\nclient-side computational load, offering insights into their effectiveness for\nvarious federated fine-tuning scenarios. Through this analysis, we identify\nframework-specific optimization opportunities to enhance the efficiency of\nFedLLMs and discuss broader research directions, highlighting open\nopportunities to better adapt FedLLMs for real-world applications. A use case\nis presented to demonstrate the performance comparison of these three\nframeworks under varying configurations and settings.\n","authors":["Na Yan","Yang Su","Yansha Deng","Robert Schober"],"pdf_url":"https://arxiv.org/pdf/2501.04436v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04426v1","updated":"2025-01-08T11:20:48Z","published":"2025-01-08T11:20:48Z","title":"Dual-Force: Enhanced Offline Diversity Maximization under Imitation\n Constraints","summary":" While many algorithms for diversity maximization under imitation constraints\nare online in nature, many applications require offline algorithms without\nenvironment interactions. Tackling this problem in the offline setting,\nhowever, presents significant challenges that require non-trivial, multi-stage\noptimization processes with non-stationary rewards. In this work, we present a\nnovel offline algorithm that enhances diversity using an objective based on Van\nder Waals (VdW) force and successor features, and eliminates the need to learn\na previously used skill discriminator. Moreover, by conditioning the value\nfunction and policy on a pre-trained Functional Reward Encoding (FRE), our\nmethod allows for better handling of non-stationary rewards and provides\nzero-shot recall of all skills encountered during training, significantly\nexpanding the set of skills learned in prior work. Consequently, our algorithm\nbenefits from receiving a consistently strong diversity signal (VdW), and\nenjoys more stable and efficient training. We demonstrate the effectiveness of\nour method in generating diverse skills for two robotic tasks in simulation:\nlocomotion of a quadruped and local navigation with obstacle traversal.\n","authors":["Pavel Kolev","Marin Vlastelica","Georg Martius"],"pdf_url":"https://arxiv.org/pdf/2501.04426v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.02781v3","updated":"2025-01-08T11:16:18Z","published":"2025-01-06T05:53:38Z","title":"From Dense to Sparse: Event Response for Enhanced Residential Load\n Forecasting","summary":" Residential load forecasting (RLF) is crucial for resource scheduling in\npower systems. Most existing methods utilize all given load records (dense\ndata) to indiscriminately extract the dependencies between historical and\nfuture time series. However, there exist important regular patterns residing in\nthe event-related associations among different appliances (sparse knowledge),\nwhich have yet been ignored. In this paper, we propose an Event-Response\nKnowledge Guided approach (ERKG) for RLF by incorporating the estimation of\nelectricity usage events for different appliances, mining event-related sparse\nknowledge from the load series. With ERKG, the event-response estimation\nenables portraying the electricity consumption behaviors of residents,\nrevealing regular variations in appliance operational states. To be specific,\nERKG consists of knowledge extraction and guidance: i) a forecasting model is\ndesigned for the electricity usage events by estimating appliance operational\nstates, aiming to extract the event-related sparse knowledge; ii) a novel\nknowledge-guided mechanism is established by fusing such state estimates of the\nappliance events into the RLF model, which can give particular focuses on the\npatterns of users' electricity consumption behaviors. Notably, ERKG can\nflexibly serve as a plug-in module to boost the capability of existing\nforecasting models by leveraging event response. In numerical experiments,\nextensive comparisons and ablation studies have verified the effectiveness of\nour ERKG, e.g., over 8% MAE can be reduced on the tested state-of-the-art\nforecasting models.\n","authors":["Xin Cao","Qinghua Tao","Yingjie Zhou","Lu Zhang","Le Zhang","Dongjin Song","Dapeng Oliver Wu","Ce Zhu"],"pdf_url":"https://arxiv.org/pdf/2501.02781v3.pdf","comment":"12 pages and 6 figures. Accepted for publication by IEEE Transactions\n on Instrumentation and Measurement"},{"id":"http://arxiv.org/abs/2501.04421v1","updated":"2025-01-08T11:11:25Z","published":"2025-01-08T11:11:25Z","title":"Risk-averse policies for natural gas futures trading using\n distributional reinforcement learning","summary":" Financial markets have experienced significant instabilities in recent years,\ncreating unique challenges for trading and increasing interest in risk-averse\nstrategies. Distributional Reinforcement Learning (RL) algorithms, which model\nthe full distribution of returns rather than just expected values, offer a\npromising approach to managing market uncertainty. This paper investigates this\npotential by studying the effectiveness of three distributional RL algorithms\nfor natural gas futures trading and exploring their capacity to develop\nrisk-averse policies. Specifically, we analyze the performance and behavior of\nCategorical Deep Q-Network (C51), Quantile Regression Deep Q-Network (QR-DQN),\nand Implicit Quantile Network (IQN). To the best of our knowledge, these\nalgorithms have never been applied in a trading context. These policies are\ncompared against five Machine Learning (ML) baselines, using a detailed dataset\nprovided by Predictive Layer SA, a company supplying ML-based strategies for\nenergy trading. The main contributions of this study are as follows. (1) We\ndemonstrate that distributional RL algorithms significantly outperform\nclassical RL methods, with C51 achieving performance improvement of more than\n32\\%. (2) We show that training C51 and IQN to maximize CVaR produces\nrisk-sensitive policies with adjustable risk aversion. Specifically, our\nablation studies reveal that lower CVaR confidence levels increase risk\naversion, while higher levels decrease it, offering flexible risk management\noptions. In contrast, QR-DQN shows less predictable behavior. These findings\nemphasize the potential of distributional RL for developing adaptable,\nrisk-averse trading strategies in volatile markets.\n","authors":["Félicien Hêche","Biagio Nigro","Oussama Barakat","Stephan Robert-Nicoud"],"pdf_url":"https://arxiv.org/pdf/2501.04421v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.08704v3","updated":"2025-01-08T11:07:42Z","published":"2024-05-14T15:42:55Z","title":"Full Line Code Completion: Bringing AI to Desktop","summary":" In recent years, several industrial solutions for the problem of multi-token\ncode completion appeared, each making a great advance in the area but mostly\nfocusing on cloud-based runtime and avoiding working on the end user's device.\n In this work, we describe our approach for building a multi-token code\ncompletion feature for the JetBrains' IntelliJ Platform, which we call Full\nLine Code Completion. The feature suggests only syntactically correct code and\nworks fully locally, i.e., data querying and the generation of suggestions\nhappens on the end user's machine. We share important time and\nmemory-consumption restrictions, as well as design principles that a code\ncompletion engine should satisfy. Working entirely on the end user's device,\nour code completion engine enriches user experience while being not only fast\nand compact but also secure. We share a number of useful techniques to meet the\nstated development constraints and also describe offline and online evaluation\npipelines that allowed us to make better decisions.\n Our online evaluation shows that the usage of the tool leads to 1.3 times\nmore Python code in the IDE being produced by code completion. The described\nsolution was initially started with a help of researchers and was then bundled\ninto all JetBrains IDEs where it is now used by millions of users. Thus, we\nbelieve that this work is useful for bridging academia and industry, providing\nresearchers with the knowledge of what happens when complex research-based\nsolutions are integrated into real products.\n","authors":["Anton Semenkin","Vitaliy Bibaev","Yaroslav Sokolov","Kirill Krylov","Alexey Kalina","Anna Khannanova","Danila Savenkov","Darya Rovdo","Igor Davidenko","Kirill Karnaukhov","Maxim Vakhrushev","Mikhail Kostyukov","Mikhail Podvitskii","Petr Surkov","Yaroslav Golubev","Nikita Povarov","Timofey Bryksin"],"pdf_url":"https://arxiv.org/pdf/2405.08704v3.pdf","comment":"Published at ICSE'25. 12 pages, 4 figures"},{"id":"http://arxiv.org/abs/2501.04413v1","updated":"2025-01-08T10:59:36Z","published":"2025-01-08T10:59:36Z","title":"Machine Learning and statistical classification of CRISPR-Cas12a\n diagnostic assays","summary":" CRISPR-based diagnostics have gained increasing attention as biosensing tools\nable to address limitations in contemporary molecular diagnostic tests. To\nmaximise the performance of CRISPR-based assays, much effort has focused on\noptimizing the chemistry and biology of the biosensing reaction. However, less\nattention has been paid to improving the techniques used to analyse\nCRISPR-based diagnostic data. To date, diagnostic decisions typically involve\nvarious forms of slope-based classification. Such methods are superior to\ntraditional methods based on assessing absolute signals, but still have\nlimitations. Herein, we establish performance benchmarks (total accuracy,\nsensitivity, and specificity) using common slope-based methods. We compare the\nperformance of these benchmark methods with three different quadratic empirical\ndistribution function statistical tests, finding significant improvements in\ndiagnostic speed and accuracy when applied to a clinical data set. Two of the\nthree statistical techniques, the Kolmogorov-Smirnov and Anderson-Darling\ntests, report the lowest time-to-result and highest total test accuracy.\nFurthermore, we developed a long short-term memory recurrent neural network to\nclassify CRISPR-biosensing data, achieving 100% specificity on our model data\nset. Finally, we provide guidelines on choosing the classification method and\nclassification method parameters that best suit a diagnostic assays needs.\n","authors":["Nathan Khosla","Jake M. Lesinski","Marcus Haywood-Alexander","Andrew J. deMello","Daniel A. Richards"],"pdf_url":"https://arxiv.org/pdf/2501.04413v1.pdf","comment":"25 pages, 5 figures, research paper. Nathan Khosla and Jake M.\n Lesinski contributed equally. Electronic supporting information is included\n as an appendix"},{"id":"http://arxiv.org/abs/2501.04410v1","updated":"2025-01-08T10:49:13Z","published":"2025-01-08T10:49:13Z","title":"User Simulation in the Era of Generative AI: User Modeling, Synthetic\n Data Generation, and System Evaluation","summary":" User simulation is an emerging interdisciplinary topic with multiple critical\napplications in the era of Generative AI. It involves creating an intelligent\nagent that mimics the actions of a human user interacting with an AI system,\nenabling researchers to model and analyze user behaviour, generate synthetic\ndata for training, and evaluate interactive AI systems in a controlled and\nreproducible manner. User simulation has profound implications for diverse\nfields and plays a vital role in the pursuit of Artificial General\nIntelligence. This paper provides an overview of user simulation, highlighting\nits key applications, connections to various disciplines, and outlining future\nresearch directions to advance this increasingly important technology.\n","authors":["Krisztian Balog","ChengXiang Zhai"],"pdf_url":"https://arxiv.org/pdf/2501.04410v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04409v1","updated":"2025-01-08T10:49:06Z","published":"2025-01-08T10:49:06Z","title":"Lossless Privacy-Preserving Aggregation for Decentralized Federated\n Learning","summary":" Privacy concerns arise as sensitive data proliferate. Despite decentralized\nfederated learning (DFL) aggregating gradients from neighbors to avoid direct\ndata transmission, it still poses indirect data leaks from the transmitted\ngradients. Existing privacy-preserving methods for DFL add noise to gradients.\nThey either diminish the model predictive accuracy or suffer from ineffective\ngradient protection. In this paper, we propose a novel lossless\nprivacy-preserving aggregation rule named LPPA to enhance gradient protection\nas much as possible but without loss of DFL model predictive accuracy. LPPA\nsubtly injects the noise difference between the sent and received noise into\ntransmitted gradients for gradient protection. The noise difference\nincorporates neighbors' randomness for each client, effectively safeguarding\nagainst data leaks. LPPA employs the noise flow conservation theory to ensure\nthat the noise impact can be globally eliminated. The global sum of all noise\ndifferences remains zero, ensuring that accurate gradient aggregation is\nunaffected and the model accuracy remains intact. We theoretically prove that\nthe privacy-preserving capacity of LPPA is \\sqrt{2} times greater than that of\nnoise addition, while maintaining comparable model accuracy to the standard DFL\naggregation without noise injection. Experimental results verify the\ntheoretical findings and show that LPPA achieves a 13% mean improvement in\naccuracy over noise addition. We also demonstrate the effectiveness of LPPA in\nprotecting raw data and guaranteeing lossless model accuracy.\n","authors":["Xiaoye Miao","Bin Li","Yangyang Wu","Meng Xi","Xinkui Zhao","Jianwei Yin"],"pdf_url":"https://arxiv.org/pdf/2501.04409v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.00817v2","updated":"2025-01-08T10:42:53Z","published":"2025-01-01T12:04:06Z","title":"Hardness of Learning Fixed Parities with Neural Networks","summary":" Learning parity functions is a canonical problem in learning theory, which\nalthough computationally tractable, is not amenable to standard learning\nalgorithms such as gradient-based methods. This hardness is usually explained\nvia statistical query lower bounds [Kearns, 1998]. However, these bounds only\nimply that for any given algorithm, there is some worst-case parity function\nthat will be hard to learn. Thus, they do not explain why fixed parities - say,\nthe full parity function over all coordinates - are difficult to learn in\npractice, at least with standard predictors and gradient-based methods [Abbe\nand Boix-Adsera, 2022]. In this paper, we address this open problem, by showing\nthat for any fixed parity of some minimal size, using it as a target function\nto train one-hidden-layer ReLU networks with perturbed gradient descent will\nfail to produce anything meaningful. To establish this, we prove a new result\nabout the decay of the Fourier coefficients of linear threshold (or weighted\nmajority) functions, which may be of independent interest.\n","authors":["Itamar Shoshani","Ohad Shamir"],"pdf_url":"https://arxiv.org/pdf/2501.00817v2.pdf","comment":"An updated version was uploaded in order to fix a typo at theorem 2\n statement"},{"id":"http://arxiv.org/abs/2501.04403v1","updated":"2025-01-08T10:33:21Z","published":"2025-01-08T10:33:21Z","title":"Rising Rested MAB with Linear Drift","summary":" We consider non-stationary multi-arm bandit (MAB) where the expected reward\nof each action follows a linear function of the number of times we executed the\naction. Our main result is a tight regret bound of\n$\\tilde{\\Theta}(T^{4/5}K^{3/5})$, by providing both upper and lower bounds. We\nextend our results to derive instance dependent regret bounds, which depend on\nthe unknown parametrization of the linear drift of the rewards.\n","authors":["Omer Amichay","Yishay Mansour"],"pdf_url":"https://arxiv.org/pdf/2501.04403v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04401v1","updated":"2025-01-08T10:29:35Z","published":"2025-01-08T10:29:35Z","title":"Tracking UWB Devices Through Radio Frequency Fingerprinting Is Possible","summary":" Ultra-wideband (UWB) is a state-of-the-art technology designed for\napplications requiring centimeter-level localization. Its widespread adoption\nby smartphone manufacturer naturally raises security and privacy concerns.\nSuccessfully implementing Radio Frequency Fingerprinting (RFF) to UWB could\nenable physical layer security, but might also allow undesired tracking of the\ndevices. The scope of this paper is to explore the feasibility of applying RFF\nto UWB and investigates how well this technique generalizes across different\nenvironments. We collected a realistic dataset using off-the-shelf UWB devices\nwith controlled variation in device positioning. Moreover, we developed an\nimproved deep learning pipeline to extract the hardware signature from the\nsignal data. In stable conditions, the extracted RFF achieves over 99%\naccuracy. While the accuracy decreases in more changing environments, we still\nobtain up to 76% accuracy in untrained locations.\n","authors":["Thibaud Ardoin","Niklas Pauli","Benedikt Groß","Mahsa Kholghi","Khan Reaz","Gerhard Wunder"],"pdf_url":"https://arxiv.org/pdf/2501.04401v1.pdf","comment":"conference ICNC'25, 7 pages, 7 figures"},{"id":"http://arxiv.org/abs/2207.03890v3","updated":"2025-01-08T10:05:34Z","published":"2022-07-08T13:25:06Z","title":"ENCODE: Encoding NetFlows for Network Anomaly Detection","summary":" NetFlow data is a popular network log format used by many network analysts\nand researchers. The advantages of using NetFlow over deep packet inspection\nare that it is easier to collect and process, and it is less privacy intrusive.\nMany works have used machine learning to detect network attacks using NetFlow\ndata. The first step for these machine learning pipelines is to pre-process the\ndata before it is given to the machine learning algorithm. Many approaches\nexist to pre-process NetFlow data; however, these simply apply existing methods\nto the data, not considering the specific properties of network data. We argue\nthat for data originating from software systems, such as NetFlow or software\nlogs, similarities in frequency and contexts of feature values are more\nimportant than similarities in the value itself. In this work, we propose an\nencoding algorithm that directly takes the frequency and the context of the\nfeature values into account when the data is being processed. Different types\nof network behaviours can be clustered using this encoding, thus aiding the\nprocess of detecting anomalies within the network. We train several machine\nlearning models for anomaly detection using the data that has been encoded with\nour encoding algorithm. We evaluate the effectiveness of our encoding on a new\ndataset that we created for network attacks on Kubernetes clusters and two\nwell-known public NetFlow datasets. We empirically demonstrate that the machine\nlearning models benefit from using our encoding for anomaly detection.\n","authors":["Clinton Cao","Annibale Panichella","Sicco Verwer","Agathe Blaise","Filippo Rebecchi"],"pdf_url":"https://arxiv.org/pdf/2207.03890v3.pdf","comment":"11 pages, 17 figures"},{"id":"http://arxiv.org/abs/2501.04387v1","updated":"2025-01-08T09:57:08Z","published":"2025-01-08T09:57:08Z","title":"The unbearable lightness of Restricted Boltzmann Machines: Theoretical\n Insights and Biological Applications","summary":" Restricted Boltzmann Machines are simple yet powerful neural networks. They\ncan be used for learning structure in data, and are used as a building block of\nmore complex neural architectures. At the same time, their simplicity makes\nthem easy to use, amenable to theoretical analysis, yielding interpretable\nmodels in applications. Here, we focus on reviewing the role that the\nactivation functions, describing the input-output relationship of single\nneurons in RBM, play in the functionality of these models. We discuss recent\ntheoretical results on the benefits and limitations of different activation\nfunctions. We also review applications to biological data analysis, namely\nneural data analysis, where RBM units are mostly taken to have sigmoid\nactivation functions and binary units, to protein data analysis and immunology\nwhere non-binary units and non-sigmoid activation functions have recently been\nshown to yield important insights into the data. Finally, we discuss open\nproblems addressing which can shed light on broader issues in neural network\nresearch.\n","authors":["Giovanni di Sarra","Barbara Bravi","Yasser Roudi"],"pdf_url":"https://arxiv.org/pdf/2501.04387v1.pdf","comment":"7 pages, 3 figures. To be published in EPL as di Sarra et al 2025\n EPL. Accepted manuscript available online at\n https://doi.org/10.1209/0295-5075/ada636"},{"id":"http://arxiv.org/abs/2409.20431v2","updated":"2025-01-08T09:54:15Z","published":"2024-09-30T15:53:24Z","title":"Multilevel Picard approximations and deep neural networks with ReLU,\n leaky ReLU, and softplus activation overcome the curse of dimensionality when\n approximating semilinear parabolic partial differential equations in\n $L^p$-sense","summary":" We prove that multilevel Picard approximations and deep neural networks with\nReLU, leaky ReLU, and softplus activation are capable of approximating\nsolutions of semilinear Kolmogorov PDEs in $L^\\mathfrak{p}$-sense,\n$\\mathfrak{p}\\in [2,\\infty)$, in the case of gradient-independent,\nLipschitz-continuous nonlinearities, while the computational effort of the\nmultilevel Picard approximations and the required number of parameters in the\nneural networks grow at most polynomially in both dimension $d\\in \\mathbb{N}$\nand reciprocal of the prescribed accuracy $\\epsilon$.\n","authors":["Ariel Neufeld","Tuan Anh Nguyen"],"pdf_url":"https://arxiv.org/pdf/2409.20431v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19109v2","updated":"2025-01-08T09:48:02Z","published":"2024-12-26T07:58:09Z","title":"Stochastic normalizing flows for Effective String Theory","summary":" Effective String Theory (EST) is a powerful tool used to study confinement in\npure gauge theories by modeling the confining flux tube connecting a static\nquark-anti-quark pair as a thin vibrating string. Recently, flow-based samplers\nhave been applied as an efficient numerical method to study EST regularized on\nthe lattice, opening the route to study observables previously inaccessible to\nstandard analytical methods. Flow-based samplers are a class of algorithms\nbased on Normalizing Flows (NFs), deep generative models recently proposed as a\npromising alternative to traditional Markov Chain Monte Carlo methods in\nlattice field theory calculations. By combining NF layers with\nout-of-equilibrium stochastic updates, we obtain Stochastic Normalizing Flows\n(SNFs), a scalable class of machine learning algorithms that can be explained\nin terms of stochastic thermodynamics. In this contribution, we outline EST and\nSNFs, and report some numerical results for the shape of the flux tube.\n","authors":["Michele Caselle","Elia Cellini","Alessandro Nada"],"pdf_url":"https://arxiv.org/pdf/2412.19109v2.pdf","comment":"1+ 10 pages, 2 figures, contribution for the 41st International\n Symposium on Lattice Field Theory (Lattice 2024), 28 July - 3 August 2024,\n Liverpool, UK; v2: 1+ 10 pages, 2 figures, reference added"},{"id":"http://arxiv.org/abs/2501.04377v1","updated":"2025-01-08T09:34:15Z","published":"2025-01-08T09:34:15Z","title":"On Computational Limits and Provably Efficient Criteria of Visual\n Autoregressive Models: A Fine-Grained Complexity Analysis","summary":" Recently, Visual Autoregressive ($\\mathsf{VAR}$) Models introduced a\ngroundbreaking advancement in the field of image generation, offering a\nscalable approach through a coarse-to-fine \"next-scale prediction\" paradigm.\nHowever, the state-of-the-art algorithm of $\\mathsf{VAR}$ models in [Tian,\nJiang, Yuan, Peng and Wang, NeurIPS 2024] takes $O(n^4)$ time, which is\ncomputationally inefficient. In this work, we analyze the computational limits\nand efficiency criteria of $\\mathsf{VAR}$ Models through a fine-grained\ncomplexity lens. Our key contribution is identifying the conditions under which\n$\\mathsf{VAR}$ computations can achieve sub-quadratic time complexity.\nSpecifically, we establish a critical threshold for the norm of input matrices\nused in $\\mathsf{VAR}$ attention mechanisms. Above this threshold, assuming the\nStrong Exponential Time Hypothesis ($\\mathsf{SETH}$) from fine-grained\ncomplexity theory, a sub-quartic time algorithm for $\\mathsf{VAR}$ models is\nimpossible. To substantiate our theoretical findings, we present efficient\nconstructions leveraging low-rank approximations that align with the derived\ncriteria. This work initiates the study of the computational efficiency of the\n$\\mathsf{VAR}$ model from a theoretical perspective. Our technique will shed\nlight on advancing scalable and efficient image generation in $\\mathsf{VAR}$\nframeworks.\n","authors":["Yekun Ke","Xiaoyu Li","Yingyu Liang","Zhizhou Sha","Zhenmei Shi","Zhao Song"],"pdf_url":"https://arxiv.org/pdf/2501.04377v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.18601v2","updated":"2025-01-08T09:30:47Z","published":"2024-07-26T08:41:58Z","title":"Reorganizing attention-space geometry with expressive attention","summary":" Attention regulates information transfer between tokens. For this, query and\nkey vectors are compared, typically in terms of a scalar product,\n$\\mathbf{Q}^T\\mathbf{K}$, together with a subsequent softmax normalization. In\ngeometric terms, the standard dot-product attention (DPA) leads to large/small\nattention weights for parallel/antiparallel queries and keys. Here we study\nexpressive attention (EA), which is based on $(\\mathbf{Q}^T\\mathbf{K})^2$, the\nsquared dot product. In this case, attention is enhanced when query and key are\neither parallel or antiparallel, and suppressed for orthogonal configurations.\nEA can be introduced into any attention-based code without additional compute\ncosts or memory requirements. For a series of autoregressive prediction tasks,\nwe find that expressive attention performs at least as well as vanilla DPA.\nIncreasing task complexity, EA is observed to outperform DPA with increasing\nmargins, which also holds for multi-task settings. For a given model size, EA\nmanages to achieve 100% performance for a range of complexity levels not\naccessible to DPA. Our results show that it is possible to reorganize the\ngeometry of the matching condition in the space of attention heads without loss\nof performance.\n","authors":["Claudius Gros"],"pdf_url":"https://arxiv.org/pdf/2407.18601v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.15267v2","updated":"2025-01-08T09:18:05Z","published":"2024-12-17T05:04:57Z","title":"Toxicity Detection towards Adaptability to Changing Perturbations","summary":" Toxicity detection is crucial for maintaining the peace of the society. While\nexisting methods perform well on normal toxic contents or those generated by\nspecific perturbation methods, they are vulnerable to evolving perturbation\npatterns. However, in real-world scenarios, malicious users tend to create new\nperturbation patterns for fooling the detectors. For example, some users may\ncircumvent the detector of large language models (LLMs) by adding `I am a\nscientist' at the beginning of the prompt. In this paper, we introduce a novel\nproblem, i.e., continual learning jailbreak perturbation patterns, into the\ntoxicity detection field. To tackle this problem, we first construct a new\ndataset generated by 9 types of perturbation patterns, 7 of them are summarized\nfrom prior work and 2 of them are developed by us. We then systematically\nvalidate the vulnerability of current methods on this new perturbation\npattern-aware dataset via both the zero-shot and fine tuned cross-pattern\ndetection. Upon this, we present the domain incremental learning paradigm and\nthe corresponding benchmark to ensure the detector's robustness to dynamically\nemerging types of perturbed toxic text. Our code and dataset are provided in\nthe appendix and will be publicly available at GitHub, by which we wish to\noffer new research opportunities for the security-relevant communities.\n","authors":["Hankun Kang","Jianhao Chen","Yongqi Li","Xin Miao","Mayi Xu","Ming Zhong","Yuanyuan Zhu","Tieyun Qian"],"pdf_url":"https://arxiv.org/pdf/2412.15267v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.04080v3","updated":"2025-01-08T09:03:14Z","published":"2024-02-06T15:34:30Z","title":"Entropy-regularized Diffusion Policy with Q-Ensembles for Offline\n Reinforcement Learning","summary":" This paper presents advanced techniques of training diffusion policies for\noffline reinforcement learning (RL). At the core is a mean-reverting stochastic\ndifferential equation (SDE) that transfers a complex action distribution into a\nstandard Gaussian and then samples actions conditioned on the environment state\nwith a corresponding reverse-time SDE, like a typical diffusion policy. We show\nthat such an SDE has a solution that we can use to calculate the log\nprobability of the policy, yielding an entropy regularizer that improves the\nexploration of offline datasets. To mitigate the impact of inaccurate value\nfunctions from out-of-distribution data points, we further propose to learn the\nlower confidence bound of Q-ensembles for more robust policy improvement. By\ncombining the entropy-regularized diffusion policy with Q-ensembles in offline\nRL, our method achieves state-of-the-art performance on most tasks in D4RL\nbenchmarks. Code is available at\nhttps://github.com/ruoqizzz/Entropy-Regularized-Diffusion-Policy-with-QEnsemble.\n","authors":["Ruoqi Zhang","Ziwei Luo","Jens Sjölund","Thomas B. Schön","Per Mattsson"],"pdf_url":"https://arxiv.org/pdf/2402.04080v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03562v2","updated":"2025-01-08T08:57:32Z","published":"2025-01-07T06:22:55Z","title":"Rethinking Adversarial Attacks in Reinforcement Learning from Policy\n Distribution Perspective","summary":" Deep Reinforcement Learning (DRL) suffers from uncertainties and inaccuracies\nin the observation signal in realworld applications. Adversarial attack is an\neffective method for evaluating the robustness of DRL agents. However, existing\nattack methods targeting individual sampled actions have limited impacts on the\noverall policy distribution, particularly in continuous action spaces. To\naddress these limitations, we propose the Distribution-Aware Projected Gradient\nDescent attack (DAPGD). DAPGD uses distribution similarity as the gradient\nperturbation input to attack the policy network, which leverages the entire\npolicy distribution rather than relying on individual samples. We utilize the\nBhattacharyya distance in DAPGD to measure policy similarity, enabling\nsensitive detection of subtle but critical differences between probability\ndistributions. Our experiment results demonstrate that DAPGD achieves SOTA\nresults compared to the baselines in three robot navigation tasks, achieving an\naverage 22.03% higher reward drop compared to the best baseline.\n","authors":["Tianyang Duan","Zongyuan Zhang","Zheng Lin","Yue Gao","Ling Xiong","Yong Cui","Hongbin Liang","Xianhao Chen","Heming Cui","Dong Huang"],"pdf_url":"https://arxiv.org/pdf/2501.03562v2.pdf","comment":"10 pages, 2 figures, 2 tables"},{"id":"http://arxiv.org/abs/2501.04359v1","updated":"2025-01-08T08:55:10Z","published":"2025-01-08T08:55:10Z","title":"Decoding EEG Speech Perception with Transformers and VAE-based Data\n Augmentation","summary":" Decoding speech from non-invasive brain signals, such as\nelectroencephalography (EEG), has the potential to advance brain-computer\ninterfaces (BCIs), with applications in silent communication and assistive\ntechnologies for individuals with speech impairments. However, EEG-based speech\ndecoding faces major challenges, such as noisy data, limited datasets, and poor\nperformance on complex tasks like speech perception. This study attempts to\naddress these challenges by employing variational autoencoders (VAEs) for EEG\ndata augmentation to improve data quality and applying a state-of-the-art\n(SOTA) sequence-to-sequence deep learning architecture, originally successful\nin electromyography (EMG) tasks, to EEG-based speech decoding. Additionally, we\nadapt this architecture for word classification tasks. Using the Brennan\ndataset, which contains EEG recordings of subjects listening to narrated\nspeech, we preprocess the data and evaluate both classification and\nsequence-to-sequence models for EEG-to-words/sentences tasks. Our experiments\nshow that VAEs have the potential to reconstruct artificial EEG data for\naugmentation. Meanwhile, our sequence-to-sequence model achieves more promising\nperformance in generating sentences compared to our classification model,\nthough both remain challenging tasks. These findings lay the groundwork for\nfuture research on EEG speech perception decoding, with possible extensions to\nspeech production tasks such as silent or imagined speech.\n","authors":["Terrance Yu-Hao Chen","Yulin Chen","Pontus Soederhaell","Sadrishya Agrawal","Kateryna Shapovalenko"],"pdf_url":"https://arxiv.org/pdf/2501.04359v1.pdf","comment":"19 pages, 15 figures, 2 tables"},{"id":"http://arxiv.org/abs/2501.04353v1","updated":"2025-01-08T08:51:35Z","published":"2025-01-08T08:51:35Z","title":"DeFusion: An Effective Decoupling Fusion Network for Multi-Modal\n Pregnancy Prediction","summary":" Temporal embryo images and parental fertility table indicators are both\nvaluable for pregnancy prediction in \\textbf{in vitro fertilization embryo\ntransfer} (IVF-ET). However, current machine learning models cannot make full\nuse of the complementary information between the two modalities to improve\npregnancy prediction performance. In this paper, we propose a Decoupling Fusion\nNetwork called DeFusion to effectively integrate the multi-modal information\nfor IVF-ET pregnancy prediction. Specifically, we propose a decoupling fusion\nmodule that decouples the information from the different modalities into\nrelated and unrelated information, thereby achieving a more delicate fusion.\nAnd we fuse temporal embryo images with a spatial-temporal position encoding,\nand extract fertility table indicator information with a table transformer. To\nevaluate the effectiveness of our model, we use a new dataset including 4046\ncases collected from Southern Medical University. The experiments show that our\nmodel outperforms state-of-the-art methods. Meanwhile, the performance on the\neye disease prediction dataset reflects the model's good generalization. Our\ncode and dataset are available at https://github.com/Ou-Young-1999/DFNet.\n","authors":["Xueqiang Ouyang","Jia Wei","Wenjie Huo","Xiaocong Wang","Rui Li","Jianlong Zhou"],"pdf_url":"https://arxiv.org/pdf/2501.04353v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04339v1","updated":"2025-01-08T08:21:58Z","published":"2025-01-08T08:21:58Z","title":"DCIts -- Deep Convolutional Interpreter for time series","summary":" We introduce an interpretable deep learning model for multivariate time\nseries forecasting that prioritizes both predictive performance and\ninterpretability - key requirements for understanding complex physical\nphenomena. Our model not only matches but often surpasses existing\ninterpretability methods, achieving this without compromising accuracy. Through\nextensive experiments, we demonstrate its ability to identify the most relevant\ntime series and lags that contribute to forecasting future values, providing\nintuitive and transparent explanations for its predictions. To minimize the\nneed for manual supervision, the model is designed so one can robustly\ndetermine the optimal window size that captures all necessary interactions\nwithin the smallest possible time frame. Additionally, it effectively\nidentifies the optimal model order, balancing complexity when incorporating\nhigher-order terms. These advancements hold significant implications for\nmodeling and understanding dynamic systems, making the model a valuable tool\nfor applied and computational physicists.\n","authors":["Davor Horvatic","Domjan Baric"],"pdf_url":"https://arxiv.org/pdf/2501.04339v1.pdf","comment":"37 pages, 15 figures"},{"id":"http://arxiv.org/abs/2405.18725v2","updated":"2025-01-08T08:20:07Z","published":"2024-05-29T03:16:12Z","title":"Can We Enhance the Quality of Mobile Crowdsensing Data Without Ground\n Truth?","summary":" Mobile crowdsensing (MCS) has emerged as a prominent trend across various\ndomains. However, ensuring the quality of the sensing data submitted by mobile\nusers (MUs) remains a complex and challenging problem. To address this\nchallenge, an advanced method is needed to detect low-quality sensing data and\nidentify malicious MUs that may disrupt the normal operations of an MCS system.\nTherefore, this article proposes a prediction- and reputation-based truth\ndiscovery (PRBTD) framework, which can separate low-quality data from\nhigh-quality data in sensing tasks. First, we apply a correlation-focused\nspatio-temporal Transformer network that learns from the historical sensing\ndata and predicts the ground truth of the data submitted by MUs. However, due\nto the noise in historical data for training and the bursty values within\nsensing data, the prediction results can be inaccurate. To address this issue,\nwe use the implications among the sensing data, which are learned from the\nprediction results but are stable and less affected by inaccurate predictions,\nto evaluate the quality of the data. Finally, we design a reputation-based\ntruth discovery (TD) module for identifying low-quality data with their\nimplications. Given the sensing data submitted by MUs, PRBTD can eliminate the\ndata with heavy noise and identify malicious MUs with high accuracy. Extensive\nexperimental results demonstrate that the PRBTD method outperforms existing\nmethods in terms of identification accuracy and data quality enhancement.\n","authors":["Jiajie Li","Bo Gu","Shimin Gong","Zhou Su","Mohsen Guizani"],"pdf_url":"https://arxiv.org/pdf/2405.18725v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04331v1","updated":"2025-01-08T08:05:18Z","published":"2025-01-08T08:05:18Z","title":"AutoDFL: A Scalable and Automated Reputation-Aware Decentralized\n Federated Learning","summary":" Blockchained federated learning (BFL) combines the concepts of federated\nlearning and blockchain technology to enhance privacy, security, and\ntransparency in collaborative machine learning models. However, implementing\nBFL frameworks poses challenges in terms of scalability and cost-effectiveness.\nReputation-aware BFL poses even more challenges, as blockchain validators are\ntasked with processing federated learning transactions along with the\ntransactions that evaluate FL tasks and aggregate reputations. This leads to\nfaster blockchain congestion and performance degradation. To improve BFL\nefficiency while increasing scalability and reducing on-chain reputation\nmanagement costs, this paper proposes AutoDFL, a scalable and automated\nreputation-aware decentralized federated learning framework. AutoDFL leverages\nzk-Rollups as a Layer-2 scaling solution to boost the performance while\nmaintaining the same level of security as the underlying Layer-1 blockchain.\nMoreover, AutoDFL introduces an automated and fair reputation model designed to\nincentivize federated learning actors. We develop a proof of concept for our\nframework for an accurate evaluation. Tested with various custom workloads,\nAutoDFL reaches an average throughput of over 3000 TPS with a gas reduction of\nup to 20X.\n","authors":["Meryem Malak Dif","Mouhamed Amine Bouchiha","Mourad Rabah","Yacine Ghamri-Doudane"],"pdf_url":"https://arxiv.org/pdf/2501.04331v1.pdf","comment":"Paper accepted at NOMS'2025 (pages 9, figures 5)"},{"id":"http://arxiv.org/abs/2406.01189v3","updated":"2025-01-08T07:59:53Z","published":"2024-06-03T10:51:43Z","title":"MultiMax: Sparse and Multi-Modal Attention Learning","summary":" SoftMax is a ubiquitous ingredient of modern machine learning algorithms. It\nmaps an input vector onto a probability simplex and reweights the input by\nconcentrating the probability mass at large entries. Yet, as a smooth\napproximation to the Argmax function, a significant amount of probability mass\nis distributed to other, residual entries, leading to poor interpretability and\nnoise. Although sparsity can be achieved by a family of SoftMax variants, they\noften require an alternative loss function and do not preserve multi-modality.\nWe show that this trade-off between multi-modality and sparsity limits the\nexpressivity of SoftMax as well as its variants. We provide a solution to this\ntension between objectives by proposing a piece-wise differentiable function,\ntermed MultiMax, which adaptively modulates the output distribution according\nto input entry range. Through comprehensive analysis and evaluation, we show\nthat MultiMax successfully produces a distribution that supresses irrelevant\nentries while preserving multimodality, with benefits in image classification,\nlanguage modeling and machine translation. The code is available at\nhttps://github.com/ZhouYuxuanYX/MultiMax.\n","authors":["Yuxuan Zhou","Mario Fritz","Margret Keuper"],"pdf_url":"https://arxiv.org/pdf/2406.01189v3.pdf","comment":"Accepted at ICML 2024"},{"id":"http://arxiv.org/abs/2501.04319v1","updated":"2025-01-08T07:32:54Z","published":"2025-01-08T07:32:54Z","title":"VerifBFL: Leveraging zk-SNARKs for A Verifiable Blockchained Federated\n Learning","summary":" Blockchain-based Federated Learning (FL) is an emerging decentralized machine\nlearning paradigm that enables model training without relying on a central\nserver. Although some BFL frameworks are considered privacy-preserving, they\nare still vulnerable to various attacks, including inference and model\npoisoning. Additionally, most of these solutions employ strong trust\nassumptions among all participating entities or introduce incentive mechanisms\nto encourage collaboration, making them susceptible to multiple security flaws.\nThis work presents VerifBFL, a trustless, privacy-preserving, and verifiable\nfederated learning framework that integrates blockchain technology and\ncryptographic protocols. By employing zero-knowledge Succinct Non-Interactive\nArgument of Knowledge (zk-SNARKs) and incrementally verifiable computation\n(IVC), VerifBFL ensures the verifiability of both local training and\naggregation processes. The proofs of training and aggregation are verified\non-chain, guaranteeing the integrity and auditability of each participant's\ncontributions. To protect training data from inference attacks, VerifBFL\nleverages differential privacy. Finally, to demonstrate the efficiency of the\nproposed protocols, we built a proof of concept using emerging tools. The\nresults show that generating proofs for local training and aggregation in\nVerifBFL takes less than 81s and 2s, respectively, while verifying them\non-chain takes less than 0.6s.\n","authors":["Ahmed Ayoub Bellachia","Mouhamed Amine Bouchiha","Yacine Ghamri-Doudane","Mourad Rabah"],"pdf_url":"https://arxiv.org/pdf/2501.04319v1.pdf","comment":"Paper accepted at NOMS'25 (9 pages, 6 Figures)"},{"id":"http://arxiv.org/abs/2501.02721v3","updated":"2025-01-08T07:31:13Z","published":"2025-01-06T02:25:48Z","title":"Learning Stochastic Nonlinear Dynamics with Embedded Latent Transfer\n Operators","summary":" We consider an operator-based latent Markov representation of a stochastic\nnonlinear dynamical system, where the stochastic evolution of the latent state\nembedded in a reproducing kernel Hilbert space is described with the\ncorresponding transfer operator, and develop a spectral method to learn this\nrepresentation based on the theory of stochastic realization. The embedding may\nbe learned simultaneously using reproducing kernels, for example, constructed\nwith feed-forward neural networks. We also address the generalization of\nsequential state-estimation (Kalman filtering) in stochastic nonlinear systems,\nand of operator-based eigen-mode decomposition of dynamics, for the\nrepresentation. Several examples with synthetic and real-world data are shown\nto illustrate the empirical characteristics of our methods, and to investigate\nthe performance of our model in sequential state-estimation and mode\ndecomposition.\n","authors":["Naichang Ke","Ryogo Tanaka","Yoshinobu Kawahara"],"pdf_url":"https://arxiv.org/pdf/2501.02721v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.05412v4","updated":"2025-01-08T07:29:55Z","published":"2023-06-08T17:56:46Z","title":"Decoupled Prioritized Resampling for Offline RL","summary":" Offline reinforcement learning (RL) is challenged by the distributional shift\nproblem. To address this problem, existing works mainly focus on designing\nsophisticated policy constraints between the learned policy and the behavior\npolicy. However, these constraints are applied equally to well-performing and\ninferior actions through uniform sampling, which might negatively affect the\nlearned policy. To alleviate this issue, we propose Offline Prioritized\nExperience Replay (OPER), featuring a class of priority functions designed to\nprioritize highly-rewarding transitions, making them more frequently visited\nduring training. Through theoretical analysis, we show that this class of\npriority functions induce an improved behavior policy, and when constrained to\nthis improved policy, a policy-constrained offline RL algorithm is likely to\nyield a better solution. We develop two practical strategies to obtain priority\nweights by estimating advantages based on a fitted value network (OPER-A) or\nutilizing trajectory returns (OPER-R) for quick computation. OPER is a\nplug-and-play component for offline RL algorithms. As case studies, we evaluate\nOPER on five different algorithms, including BC, TD3+BC, Onestep RL, CQL, and\nIQL. Extensive experiments demonstrate that both OPER-A and OPER-R\nsignificantly improve the performance for all baseline methods. Codes and\npriority weights are availiable at https://github.com/sail-sg/OPER.\n","authors":["Yang Yue","Bingyi Kang","Xiao Ma","Qisen Yang","Gao Huang","Shiji Song","Shuicheng Yan"],"pdf_url":"https://arxiv.org/pdf/2306.05412v4.pdf","comment":"published on IEEE TNNLS"},{"id":"http://arxiv.org/abs/2411.07464v2","updated":"2025-01-08T07:25:55Z","published":"2024-11-12T00:57:30Z","title":"BudgetMLAgent: A Cost-Effective LLM Multi-Agent system for Automating\n Machine Learning Tasks","summary":" Large Language Models (LLMs) excel in diverse applications including\ngeneration of code snippets, but often struggle with generating code for\ncomplex Machine Learning (ML) tasks. Although existing LLM single-agent based\nsystems give varying performance depending on the task complexity, they purely\nrely on larger and expensive models such as GPT-4. Our investigation reveals\nthat no-cost and low-cost models such as Gemini-Pro, Mixtral and CodeLlama\nperform far worse than GPT-4 in a single-agent setting. With the motivation of\ndeveloping a cost-efficient LLM based solution for solving ML tasks, we propose\nan LLM Multi-Agent based system which leverages combination of experts using\nprofiling, efficient retrieval of past observations, LLM cascades, and\nask-the-expert calls. Through empirical analysis on ML engineering tasks in the\nMLAgentBench benchmark, we demonstrate the effectiveness of our system, using\nno-cost models, namely Gemini as the base LLM, paired with GPT-4 in cascade and\nexpert to serve occasional ask-the-expert calls for planning. With 94.2\\%\nreduction in the cost (from \\$0.931 per run cost averaged over all tasks for\nGPT-4 single agent system to \\$0.054), our system is able to yield better\naverage success rate of 32.95\\% as compared to GPT-4 single-agent system\nyielding 22.72\\% success rate averaged over all the tasks of MLAgentBench.\n","authors":["Shubham Gandhi","Manasi Patwardhan","Lovekesh Vig","Gautam Shroff"],"pdf_url":"https://arxiv.org/pdf/2411.07464v2.pdf","comment":"Presented at AIMLSystems '24"},{"id":"http://arxiv.org/abs/2407.16040v2","updated":"2025-01-08T07:21:15Z","published":"2024-07-22T20:34:00Z","title":"Generalizing Teacher Networks for Effective Knowledge Distillation\n Across Student Architectures","summary":" Knowledge distillation (KD) is a model compression method that entails\ntraining a compact student model to emulate the performance of a more complex\nteacher model. However, the architectural capacity gap between the two models\nlimits the effectiveness of knowledge transfer. Addressing this issue, previous\nworks focused on customizing teacher-student pairs to improve compatibility, a\ncomputationally expensive process that needs to be repeated every time either\nmodel changes. Hence, these methods are impractical when a teacher model has to\nbe compressed into different student models for deployment on multiple hardware\ndevices with distinct resource constraints. In this work, we propose Generic\nTeacher Network (GTN), a one-off KD-aware training to create a generic teacher\ncapable of effectively transferring knowledge to any student model sampled from\na given finite pool of architectures. To this end, we represent the student\npool as a weight-sharing supernet and condition our generic teacher to align\nwith the capacities of various student architectures sampled from this\nsupernet. Experimental evaluation shows that our method both improves overall\nKD effectiveness and amortizes the minimal additional training cost of the\ngeneric teacher across students in the pool.\n","authors":["Kuluhan Binici","Weiming Wu","Tulika Mitra"],"pdf_url":"https://arxiv.org/pdf/2407.16040v2.pdf","comment":"British Machine Vision Conference (BMVC 24)"},{"id":"http://arxiv.org/abs/2408.12545v2","updated":"2025-01-08T07:20:32Z","published":"2024-08-22T16:59:32Z","title":"Dynamics of Meta-learning Representation in the Teacher-student Scenario","summary":" Gradient-based meta-learning algorithms have gained popularity for their\nability to train models on new tasks using limited data. Empirical observations\nindicate that such algorithms are able to learn a shared representation across\ntasks, which is regarded as a key factor in their success. However, the\nin-depth theoretical understanding of the learning dynamics and the origin of\nthe shared representation remains underdeveloped. In this work, we investigate\nthe meta-learning dynamics of nonlinear two-layer neural networks trained on\nstreaming tasks in the teacher-student scenario. Through the lens of\nstatistical physics analysis, we characterize the macroscopic behavior of the\nmeta-training processes, the formation of the shared representation, and the\ngeneralization ability of the model on new tasks. The analysis also points to\nthe importance of the choice of certain hyperparameters of the learning\nalgorithms.\n","authors":["Hui Wang","Cho Tung Yip","Bo Li"],"pdf_url":"https://arxiv.org/pdf/2408.12545v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04315v1","updated":"2025-01-08T07:13:52Z","published":"2025-01-08T07:13:52Z","title":"RoRA: Efficient Fine-Tuning of LLM with Reliability Optimization for\n Rank Adaptation","summary":" Fine-tuning helps large language models (LLM) recover degraded information\nand enhance task performance.Although Low-Rank Adaptation (LoRA) is widely used\nand effective for fine-tuning, we have observed that its scaling factor can\nlimit or even reduce performance as the rank size increases. To address this\nissue, we propose RoRA (Rank-adaptive Reliability Optimization), a simple yet\neffective method for optimizing LoRA's scaling factor. By replacing $\\alpha/r$\nwith $\\alpha/\\sqrt{r}$, RoRA ensures improved performance as rank size\nincreases. Moreover, RoRA enhances low-rank adaptation in fine-tuning\nuncompressed models and excels in the more challenging task of accuracy\nrecovery when fine-tuning pruned models. Extensive experiments demonstrate the\neffectiveness of RoRA in fine-tuning both uncompressed and pruned models. RoRA\nsurpasses the state-of-the-art (SOTA) in average accuracy and robustness on\nLLaMA-7B/13B, LLaMA2-7B, and LLaMA3-8B, specifically outperforming LoRA and\nDoRA by 6.5% and 2.9% on LLaMA-7B, respectively. In pruned model fine-tuning,\nRoRA shows significant advantages; for SHEARED-LLAMA-1.3, a LLaMA-7B with 81.4%\npruning, RoRA achieves 5.7% higher average accuracy than LoRA and 3.9% higher\nthan DoRA.\n","authors":["Jun Liu","Zhenglun Kong","Peiyan Dong","Xuan Shen","Pu Zhao","Hao Tang","Geng Yuan","Wei Niu","Wenbin Zhang","Xue Lin","Dong Huang","Yanzhi Wang"],"pdf_url":"https://arxiv.org/pdf/2501.04315v1.pdf","comment":"ICASSP 2025"},{"id":"http://arxiv.org/abs/2501.04308v1","updated":"2025-01-08T06:53:21Z","published":"2025-01-08T06:53:21Z","title":"FSC-loss: A Frequency-domain Structure Consistency Learning Approach for\n Signal Data Recovery and Reconstruction","summary":" A core challenge for signal data recovery is to model the distribution of\nsignal matrix (SM) data based on measured low-quality data in biomedical\nengineering of magnetic particle imaging (MPI). For acquiring the\nhigh-resolution (high-quality) SM, the number of meticulous measurements at\nnumerous positions in the field-of-view proves time-consuming (measurement of a\n37x37x37 SM takes about 32 hours). To improve reconstructed signal quality and\nshorten SM measurement time, existing methods explore to generating\nhigh-resolution SM based on time-saving measured low-resolution SM (a 9x9x9 SM\njust takes about 0.5 hours). However, previous methods show poor performance\nfor high-frequency signal recovery in SM. To achieve a high-resolution SM\nrecovery and shorten its acquisition time, we propose a frequency-domain\nstructure consistency loss function and data component embedding strategy to\nmodel global and local structural information of SM. We adopt a\ntransformer-based network to evaluate this function and the strategy. We\nevaluate our methods and state-of-the-art (SOTA) methods on the two simulation\ndatasets and four public measured SMs in Open MPI Data. The results show that\nour method outperforms the SOTA methods in high-frequency structural signal\nrecovery. Additionally, our method can recover a high-resolution SM with clear\nhigh-frequency structure based on a down-sampling factor of 16 less than 15\nseconds, which accelerates the acquisition time over 60 times faster than the\nmeasurement-based HR SM with the minimum error (nRMSE=0.041). Moreover, our\nmethod is applied in our three in-house MPI systems, and boost their\nperformance for signal reconstruction.\n","authors":["Liwen Zhang","Zhaoji Miao","Fan Yang","Gen Shi","Jie He","Yu An","Hui Hui","Jie Tian"],"pdf_url":"https://arxiv.org/pdf/2501.04308v1.pdf","comment":"11 pages,7 figures"},{"id":"http://arxiv.org/abs/2404.01714v4","updated":"2025-01-08T06:52:07Z","published":"2024-04-02T07:57:17Z","title":"Conjugate-Gradient-like Based Adaptive Moment Estimation Optimization\n Algorithm for Deep Learning","summary":" Training deep neural networks is a challenging task. In order to speed up\ntraining and enhance the performance of deep neural networks, we rectify the\nvanilla conjugate gradient as conjugate-gradient-like and incorporate it into\nthe generic Adam, and thus propose a new optimization algorithm named\nCG-like-Adam for deep learning. Specifically, both the first-order and the\nsecond-order moment estimation of generic Adam are replaced by the\nconjugate-gradient-like. Convergence analysis handles the cases where the\nexponential moving average coefficient of the first-order moment estimation is\nconstant and the first-order moment estimation is unbiased. Numerical\nexperiments show the superiority of the proposed algorithm based on the\nCIFAR10/100 dataset.\n","authors":["Jiawu Tian","Liwei Xu","Xiaowei Zhang","Yongqi Li"],"pdf_url":"https://arxiv.org/pdf/2404.01714v4.pdf","comment":"32 pages, 13 figures"},{"id":"http://arxiv.org/abs/2407.08974v2","updated":"2025-01-08T06:42:39Z","published":"2024-07-12T04:04:54Z","title":"Topology-enhanced machine learning model (Top-ML) for anticancer peptide\n prediction","summary":" Recently, therapeutic peptides have demonstrated great promise for cancer\ntreatment. To explore powerful anticancer peptides, artificial intelligence\n(AI)-based approaches have been developed to systematically screen potential\ncandidates. However, the lack of efficient featurization of peptides has become\na bottleneck for these machine-learning models. In this paper, we propose a\ntopology-enhanced machine learning model (Top-ML) for anticancer peptides\nprediction. Our Top-ML employs peptide topological features derived from its\nsequence \"connection\" information characterized by vector and spectral\ndescriptors. Our Top-ML model, employing an Extra-Trees classifier, has been\nvalidated on the AntiCP 2.0 and mACPpred 2.0 benchmark datasets, achieving\nstate-of-the-art performance or results comparable to existing deep learning\nmodels, while providing greater interpretability. Our results highlight the\npotential of leveraging novel topology-based featurization to accelerate the\nidentification of anticancer peptides.\n","authors":["Joshua Zhi En Tan","JunJie Wee","Xue Gong","Kelin Xia"],"pdf_url":"https://arxiv.org/pdf/2407.08974v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.08128v4","updated":"2025-01-08T06:35:45Z","published":"2024-12-11T06:31:06Z","title":"Why Does Dropping Edges Usually Outperform Adding Edges in Graph\n Contrastive Learning?","summary":" Graph contrastive learning (GCL) has been widely used as an effective\nself-supervised learning method for graph representation learning. However, how\nto apply adequate and stable graph augmentation to generating proper views for\ncontrastive learning remains an essential problem. Dropping edges is a primary\naugmentation in GCL while adding edges is not a common method due to its\nunstable performance. To our best knowledge, there is no theoretical analysis\nto study why dropping edges usually outperforms adding edges. To answer this\nquestion, we introduce a new metric, namely Error Passing Rate (EPR), to\nquantify how a graph fits the network. Inspired by the theoretical conclusions\nand the idea of positive-incentive noise, we propose a novel GCL algorithm,\nError-PAssing-based Graph Contrastive Learning (EPAGCL), which uses both edge\nadding and edge dropping as its augmentations. To be specific, we generate\nviews by adding and dropping edges based on the weights derived from EPR.\nExtensive experiments on various real-world datasets are conducted to validate\nthe correctness of our theoretical analysis and the effectiveness of our\nproposed algorithm. Our code is available at:\nhttps://github.com/hyzhang98/EPAGCL.\n","authors":["Yanchen Xu","Siqi Huang","Hongyuan Zhang","Xuelong Li"],"pdf_url":"https://arxiv.org/pdf/2412.08128v4.pdf","comment":"Accepted by AAAI 2025"},{"id":"http://arxiv.org/abs/2501.04305v1","updated":"2025-01-08T06:34:32Z","published":"2025-01-08T06:34:32Z","title":"Physics-Informed Super-Resolution Diffusion for 6D Phase Space\n Diagnostics","summary":" Adaptive physics-informed super-resolution diffusion is developed for\nnon-invasive virtual diagnostics of the 6D phase space density of charged\nparticle beams. An adaptive variational autoencoder (VAE) embeds initial beam\ncondition images and scalar measurements to a low-dimensional latent space from\nwhich a 326 pixel 6D tensor representation of the beam's 6D phase space density\nis generated. Projecting from a 6D tensor generates physically consistent 2D\nprojections. Physics-guided super-resolution diffusion transforms\nlow-resolution images of the 6D density to high resolution 256x256 pixel\nimages. Un-supervised adaptive latent space tuning enables tracking of\ntime-varying beams without knowledge of time-varying initial conditions. The\nmethod is demonstrated with experimental data and multi-particle simulations at\nthe HiRES UED. The general approach is applicable to a wide range of complex\ndynamic systems evolving in high-dimensional phase space. The method is shown\nto be robust to distribution shift without re-training.\n","authors":["Alexander Scheinker"],"pdf_url":"https://arxiv.org/pdf/2501.04305v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04304v1","updated":"2025-01-08T06:30:31Z","published":"2025-01-08T06:30:31Z","title":"DGQ: Distribution-Aware Group Quantization for Text-to-Image Diffusion\n Models","summary":" Despite the widespread use of text-to-image diffusion models across various\ntasks, their computational and memory demands limit practical applications. To\nmitigate this issue, quantization of diffusion models has been explored. It\nreduces memory usage and computational costs by compressing weights and\nactivations into lower-bit formats. However, existing methods often struggle to\npreserve both image quality and text-image alignment, particularly in\nlower-bit($<$ 8bits) quantization. In this paper, we analyze the challenges\nassociated with quantizing text-to-image diffusion models from a distributional\nperspective. Our analysis reveals that activation outliers play a crucial role\nin determining image quality. Additionally, we identify distinctive patterns in\ncross-attention scores, which significantly affects text-image alignment. To\naddress these challenges, we propose Distribution-aware Group Quantization\n(DGQ), a method that identifies and adaptively handles pixel-wise and\nchannel-wise outliers to preserve image quality. Furthermore, DGQ applies\nprompt-specific logarithmic quantization scales to maintain text-image\nalignment. Our method demonstrates remarkable performance on datasets such as\nMS-COCO and PartiPrompts. We are the first to successfully achieve low-bit\nquantization of text-to-image diffusion models without requiring additional\nfine-tuning of weight quantization parameters.\n","authors":["Hyogon Ryu","NaHyeon Park","Hyunjung Shim"],"pdf_url":"https://arxiv.org/pdf/2501.04304v1.pdf","comment":"Project page: https://ugonfor.kr/DGQ"},{"id":"http://arxiv.org/abs/2501.04300v1","updated":"2025-01-08T06:18:32Z","published":"2025-01-08T06:18:32Z","title":"Handling Incomplete Heterogeneous Data using a Data-Dependent Kernel","summary":" Handling incomplete data in real-world applications is a critical challenge\ndue to two key limitations of existing methods: (i) they are primarily designed\nfor numeric data and struggle with categorical or heterogeneous/mixed datasets;\n(ii) they assume that data is missing completely at random, which is often not\nthe case in practice -- in reality, data is missing in patterns, leading to\nbiased results if these patterns are not accounted for. To address these two\nlimitations, this paper presents a novel approach to handling missing values\nusing the Probability Mass Similarity Kernel (PMK), a data-dependent kernel,\nwhich does not make any assumptions about data types and missing mechanisms. It\neliminates the need for prior knowledge or extensive pre-processing steps and\ninstead leverages the distribution of observed data. Our method unifies the\nrepresentation of diverse data types by capturing more meaningful pairwise\nsimilarities and enhancing downstream performance. We evaluated our approach\nacross over 10 datasets with numerical-only, categorical-only, and mixed\nfeatures under different missing mechanisms and rates. Across both\nclassification and clustering tasks, our approach consistently outperformed\nexisting techniques, demonstrating its robustness and effectiveness in managing\nincomplete heterogeneous data.\n","authors":["Youran Zhou","Mohamed Reda Bouadjenek","Jonathan Wells","Sunil Aryal"],"pdf_url":"https://arxiv.org/pdf/2501.04300v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04299v1","updated":"2025-01-08T06:07:33Z","published":"2025-01-08T06:07:33Z","title":"Circuit Complexity Bounds for Visual Autoregressive Model","summary":" Understanding the expressive ability of a specific model is essential for\ngrasping its capacity limitations. Recently, several studies have established\ncircuit complexity bounds for Transformer architecture. Besides, the Visual\nAutoRegressive (VAR) model has risen to be a prominent method in the field of\nimage generation, outperforming previous techniques, such as Diffusion\nTransformers, in generating high-quality images. We investigate the circuit\ncomplexity of the VAR model and establish a bound in this study. Our primary\nresult demonstrates that the VAR model is equivalent to a simulation by a\nuniform $\\mathsf{TC}^0$ threshold circuit with hidden dimension $d \\leq O(n)$\nand $\\mathrm{poly}(n)$ precision. This is the first study to rigorously\nhighlight the limitations in the expressive power of VAR models despite their\nimpressive performance. We believe our findings will offer valuable insights\ninto the inherent constraints of these models and guide the development of more\nefficient and expressive architectures in the future.\n","authors":["Yekun Ke","Xiaoyu Li","Yingyu Liang","Zhenmei Shi","Zhao Song"],"pdf_url":"https://arxiv.org/pdf/2501.04299v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.17462v4","updated":"2025-01-08T05:36:30Z","published":"2024-05-23T07:20:45Z","title":"Ferrari: Federated Feature Unlearning via Optimizing Feature Sensitivity","summary":" The advent of Federated Learning (FL) highlights the practical necessity for\nthe right to be forgotten for all clients, allowing them to request data\ndeletion from the machine learning models service provider. This necessity has\nspurred a growing demand for Federated Unlearning (FU). Feature unlearning has\ngained considerable attention due to its applications in unlearning sensitive,\nbackdoor, and biased features. Existing methods employ the influence function\nto achieve feature unlearning, which is impractical for FL as it necessitates\nthe participation of other clients, if not all, in the unlearning process.\nFurthermore, current research lacks an evaluation of the effectiveness of\nfeature unlearning. To address these limitations, we define feature sensitivity\nin evaluating feature unlearning according to Lipschitz continuity. This metric\ncharacterizes the model outputs rate of change or sensitivity to perturbations\nin the input feature. We then propose an effective federated feature unlearning\nframework called Ferrari, which minimizes feature sensitivity. Extensive\nexperimental results and theoretical analysis demonstrate the effectiveness of\nFerrari across various feature unlearning scenarios, including sensitive,\nbackdoor, and biased features. The code is publicly available at\nhttps://github.com/OngWinKent/Federated-Feature-Unlearning\n","authors":["Hanlin Gu","Win Kent Ong","Chee Seng Chan","Lixin Fan"],"pdf_url":"https://arxiv.org/pdf/2405.17462v4.pdf","comment":"TLDR: The need for a \"right to be forgotten\" in Federated Learning\n has led to the development of the Ferrari framework, which efficiently\n unlearns sensitive features using a Lipschitz continuity-based metric, proven\n effective in extensive testing. Accepted at NeurIPS 2024"},{"id":"http://arxiv.org/abs/2501.04292v1","updated":"2025-01-08T05:32:55Z","published":"2025-01-08T05:32:55Z","title":"MAD-UV: The 1st INTERSPEECH Mice Autism Detection via Ultrasound\n Vocalization Challenge","summary":" The Mice Autism Detection via Ultrasound Vocalization (MAD-UV) Challenge\nintroduces the first INTERSPEECH challenge focused on detecting autism spectrum\ndisorder (ASD) in mice through their vocalizations. Participants are tasked\nwith developing models to automatically classify mice as either wild-type or\nASD models based on recordings with a high sampling rate. Our baseline system\nemploys a simple CNN-based classification using three different spectrogram\nfeatures. Results demonstrate the feasibility of automated ASD detection, with\nthe considered audible-range features achieving the best performance (UAR of\n0.600 for segment-level and 0.625 for subject-level classification). This\nchallenge bridges speech technology and biomedical research, offering\nopportunities to advance our understanding of ASD models through machine\nlearning approaches. The findings suggest promising directions for vocalization\nanalysis and highlight the potential value of audible and ultrasound\nvocalizations in ASD detection.\n","authors":["Zijiang Yang","Meishu Song","Xin Jing","Haojie Zhang","Kun Qian","Bin Hu","Kota Tamada","Toru Takumi","Björn W. Schuller","Yoshiharu Yamamoto"],"pdf_url":"https://arxiv.org/pdf/2501.04292v1.pdf","comment":"5 pages, 1 figure and 2 tables. For MAD-UV Challenge 2025"},{"id":"http://arxiv.org/abs/2501.04288v1","updated":"2025-01-08T05:27:16Z","published":"2025-01-08T05:27:16Z","title":"An Analysis of Model Robustness across Concurrent Distribution Shifts","summary":" Machine learning models, meticulously optimized for source data, often fail\nto predict target data when faced with distribution shifts (DSs). Previous\nbenchmarking studies, though extensive, have mainly focused on simple DSs.\nRecognizing that DSs often occur in more complex forms in real-world scenarios,\nwe broadened our study to include multiple concurrent shifts, such as unseen\ndomain shifts combined with spurious correlations. We evaluated 26 algorithms\nthat range from simple heuristic augmentations to zero-shot inference using\nfoundation models, across 168 source-target pairs from eight datasets. Our\nanalysis of over 100K models reveals that (i) concurrent DSs typically worsen\nperformance compared to a single shift, with certain exceptions, (ii) if a\nmodel improves generalization for one distribution shift, it tends to be\neffective for others, and (iii) heuristic data augmentations achieve the best\noverall performance on both synthetic and real-world datasets.\n","authors":["Myeongho Jeon","Suhwan Choi","Hyoje Lee","Teresa Yeo"],"pdf_url":"https://arxiv.org/pdf/2501.04288v1.pdf","comment":"Accepted to TMLR"},{"id":"http://arxiv.org/abs/2501.04287v1","updated":"2025-01-08T05:25:14Z","published":"2025-01-08T05:25:14Z","title":"ElasticZO: A Memory-Efficient On-Device Learning with Combined Zeroth-\n and First-Order Optimization","summary":" Zeroth-order (ZO) optimization is being recognized as a simple yet powerful\nalternative to standard backpropagation (BP)-based training. Notably, ZO\noptimization allows for training with only forward passes and (almost) the same\nmemory as inference, making it well-suited for edge devices with limited\ncomputing and memory resources. In this paper, we propose ZO-based on-device\nlearning (ODL) methods for full-precision and 8-bit quantized deep neural\nnetworks (DNNs), namely ElasticZO and ElasticZO-INT8. ElasticZO lies in the\nmiddle between pure ZO- and pure BP-based approaches, and is based on the idea\nto employ BP for the last few layers and ZO for the remaining layers.\nElasticZO-INT8 achieves integer arithmetic-only ZO-based training for the first\ntime, by incorporating a novel method for computing quantized ZO gradients from\ninteger cross-entropy loss values. Experimental results on the classification\ndatasets show that ElasticZO effectively addresses the slow convergence of\nvanilla ZO and shrinks the accuracy gap to BP-based training. Compared to\nvanilla ZO, ElasticZO achieves 5.2-9.5% higher accuracy with only 0.072-1.7%\nmemory overhead, and can handle fine-tuning tasks as well as full training.\nElasticZO-INT8 further reduces the memory usage and training time by 1.46-1.60x\nand 1.38-1.42x without compromising the accuracy. These results demonstrate a\nbetter tradeoff between accuracy and training cost compared to pure ZO- and\nBP-based approaches, and also highlight the potential of ZO optimization in\non-device learning.\n","authors":["Keisuke Sugiura","Hiroki Matsutani"],"pdf_url":"https://arxiv.org/pdf/2501.04287v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04286v1","updated":"2025-01-08T05:24:11Z","published":"2025-01-08T05:24:11Z","title":"Mapping the Edge of Chaos: Fractal-Like Boundaries in The Trainability\n of Decoder-Only Transformer Models","summary":" In the realm of fractal geometry, intricate structures emerge from simple\niterative processes that partition parameter spaces into regions of stability\nand instability. Likewise, training large language models involves iteratively\napplying update functions, such as Adam, where even slight hyperparameter\nadjustments can shift the training process from convergence to divergence.\nRecent evidence from miniature neural networks suggests that the boundary\nseparating these outcomes displays fractal characteristics [1]. Building on\nthese insights, this study extends them to medium-sized, decoder-only\ntransformer architectures by employing a more consistent convergence measure\nand examining the learning rate hyperparameter landscape for attention and\nfully connected layers. The results show that the trainability frontier is not\na simple threshold; rather, it forms a self-similar yet seemingly random\nstructure at multiple scales, with statistically consistent and repeating\npatterns. Within this landscape, a region of stable convergence is surrounded\nby a complex chaotic border, illustrating the sensitive nature of the\nunderlying training dynamics.\n","authors":["Bahman Torkamandi"],"pdf_url":"https://arxiv.org/pdf/2501.04286v1.pdf","comment":"15 pages"},{"id":"http://arxiv.org/abs/2501.00309v2","updated":"2025-01-08T05:16:25Z","published":"2024-12-31T06:59:35Z","title":"Retrieval-Augmented Generation with Graphs (GraphRAG)","summary":" Retrieval-augmented generation (RAG) is a powerful technique that enhances\ndownstream task execution by retrieving additional information, such as\nknowledge, skills, and tools from external sources. Graph, by its intrinsic\n\"nodes connected by edges\" nature, encodes massive heterogeneous and relational\ninformation, making it a golden resource for RAG in tremendous real-world\napplications. As a result, we have recently witnessed increasing attention on\nequipping RAG with Graph, i.e., GraphRAG. However, unlike conventional RAG,\nwhere the retriever, generator, and external data sources can be uniformly\ndesigned in the neural-embedding space, the uniqueness of graph-structured\ndata, such as diverse-formatted and domain-specific relational knowledge, poses\nunique and significant challenges when designing GraphRAG for different\ndomains. Given the broad applicability, the associated design challenges, and\nthe recent surge in GraphRAG, a systematic and up-to-date survey of its key\nconcepts and techniques is urgently desired. Following this motivation, we\npresent a comprehensive and up-to-date survey on GraphRAG. Our survey first\nproposes a holistic GraphRAG framework by defining its key components,\nincluding query processor, retriever, organizer, generator, and data source.\nFurthermore, recognizing that graphs in different domains exhibit distinct\nrelational patterns and require dedicated designs, we review GraphRAG\ntechniques uniquely tailored to each domain. Finally, we discuss research\nchallenges and brainstorm directions to inspire cross-disciplinary\nopportunities. Our survey repository is publicly maintained at\nhttps://github.com/Graph-RAG/GraphRAG/.\n","authors":["Haoyu Han","Yu Wang","Harry Shomer","Kai Guo","Jiayuan Ding","Yongjia Lei","Mahantesh Halappanavar","Ryan A. Rossi","Subhabrata Mukherjee","Xianfeng Tang","Qi He","Zhigang Hua","Bo Long","Tong Zhao","Neil Shah","Amin Javari","Yinglong Xia","Jiliang Tang"],"pdf_url":"https://arxiv.org/pdf/2501.00309v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04281v1","updated":"2025-01-08T05:09:25Z","published":"2025-01-08T05:09:25Z","title":"Cluster & Disperse: a general air conflict resolution heuristic using\n unsupervised learning","summary":" We provide a general and malleable heuristic for the air conflict resolution\nproblem. This heuristic is based on a new neighborhood structure for searching\nthe solution space of trajectories and flight-levels. Using unsupervised\nlearning, the core idea of our heuristic is to cluster the conflict points and\ndisperse them in various flight levels. Our first algorithm is called Cluster &\nDisperse and in each iteration it assigns the most problematic flights in each\ncluster to another flight-level. In effect, we shuffle them between the\nflight-levels until we achieve a well-balanced configuration. The Cluster &\nDisperse algorithm then uses any horizontal plane conflict resolution algorithm\nas a subroutine to solve these well-balanced instances. Nevertheless, we\ndevelop a novel algorithm for the horizontal plane based on a similar idea.\nThat is we cluster and disperse the conflict points spatially in the same\nflight level using the gradient descent and a social force. We use a novel\nmaneuver making flights travel on an arc instead of a straight path which is\nbased on the aviation routine of the Radius to Fix legs. Our algorithms can\nhandle a high density of flights within a reasonable computation time. We put\ntheir performance in context with some notable algorithms from the literature.\nBeing a general framework, a particular strength of the Cluster & Disperse is\nits malleability in allowing various constraints regarding the aircraft or the\nenvironment to be integrated with ease. This is in contrast to the models for\ninstance based on mixed integer programming.\n","authors":["Mirmojtaba Gharibi","John-Paul Clarke"],"pdf_url":"https://arxiv.org/pdf/2501.04281v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.12335v2","updated":"2025-01-08T04:53:52Z","published":"2024-03-19T00:48:25Z","title":"Temporally-Consistent Koopman Autoencoders for Forecasting Dynamical\n Systems","summary":" Absence of sufficiently high-quality data often poses a key challenge in\ndata-driven modeling of high-dimensional spatio-temporal dynamical systems.\nKoopman Autoencoders (KAEs) harness the expressivity of deep neural networks\n(DNNs), the dimension reduction capabilities of autoencoders, and the spectral\nproperties of the Koopman operator to learn a reduced-order feature space with\nsimpler, linear dynamics. However, the effectiveness of KAEs is hindered by\nlimited and noisy training datasets, leading to poor generalizability. To\naddress this, we introduce the Temporally-Consistent Koopman Autoencoder\n(tcKAE), designed to generate accurate long-term predictions even with limited\nand noisy training data. This is achieved through a consistency regularization\nterm that enforces prediction coherence across different time steps, thus\nenhancing the robustness and generalizability of tcKAE over existing models. We\nprovide analytical justification for this approach based on Koopman spectral\ntheory and empirically demonstrate tcKAE's superior performance over\nstate-of-the-art KAE models across a variety of test cases, including simple\npendulum oscillations, kinetic plasma, and fluid flow data.\n","authors":["Indranil Nayak","Ananda Chakrabarty","Mrinal Kumar","Fernando Teixeira","Debdipta Goswami"],"pdf_url":"https://arxiv.org/pdf/2403.12335v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03295v2","updated":"2025-01-08T04:50:01Z","published":"2025-01-06T11:43:29Z","title":"A Soft Sensor Method with Uncertainty-Awareness and Self-Explanation\n Based on Large Language Models Enhanced by Domain Knowledge Retrieval","summary":" Data-driven soft sensors are crucial in predicting key performance indicators\nin industrial systems. However, current methods predominantly rely on the\nsupervised learning paradigms of parameter updating, which inherently faces\nchallenges such as high development costs, poor robustness, training\ninstability, and lack of interpretability. Recently, large language models\n(LLMs) have demonstrated significant potential across various domains, notably\nthrough In-Context Learning (ICL), which enables high-performance task\nexecution with minimal input-label demonstrations and no prior training. This\npaper aims to replace supervised learning with the emerging ICL paradigm for\nsoft sensor modeling to address existing challenges and explore new avenues for\nadvancement. To achieve this, we propose a novel framework called the Few-shot\nUncertainty-aware and self-Explaining Soft Sensor (LLM-FUESS), which includes\nthe Zero-shot Auxiliary Variable Selector (LLM-ZAVS) and the Uncertainty-aware\nFew-shot Soft Sensor (LLM-UFSS). The LLM-ZAVS retrieves from the Industrial\nKnowledge Vector Storage to enhance LLMs' domain-specific knowledge, enabling\nzero-shot auxiliary variable selection. In the LLM-UFSS, we utilize text-based\ncontext demonstrations of structured data to prompt LLMs to execute ICL for\npredicting and propose a context sample retrieval augmentation strategy to\nimprove performance. Additionally, we explored LLMs' AIGC and probabilistic\ncharacteristics to propose self-explanation and uncertainty quantification\nmethods for constructing a trustworthy soft sensor. Extensive experiments\ndemonstrate that our method achieved state-of-the-art predictive performance,\nstrong robustness, and flexibility, effectively mitigates training instability\nfound in traditional methods. To the best of our knowledge, this is the first\nwork to establish soft sensor utilizing LLMs.\n","authors":["Shuo Tong","Han Liu","Runyuan Guo","Wenqing Wang","Xueqiong Tian","Lingyun Wei","Lin Zhang","Huayong Wu","Ding Liu","Youmin Zhang"],"pdf_url":"https://arxiv.org/pdf/2501.03295v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04272v1","updated":"2025-01-08T04:44:47Z","published":"2025-01-08T04:44:47Z","title":"On weight and variance uncertainty in neural networks for regression\n tasks","summary":" We consider the problem of weight uncertainty proposed by [Blundell et al.\n(2015). Weight uncertainty in neural network. In International conference on\nmachine learning, 1613-1622, PMLR.] in neural networks {(NNs)} specialized for\nregression tasks. {We further} investigate the effect of variance uncertainty\nin {their model}. We show that including the variance uncertainty can improve\nthe prediction performance of the Bayesian {NN}. Variance uncertainty enhances\nthe generalization of the model {by} considering the posterior distribution\nover the variance parameter. { We examine the generalization ability of the\nproposed model using a function approximation} example and {further illustrate\nit with} the riboflavin genetic data set. {We explore fully connected dense\nnetworks and dropout NNs with} Gaussian and spike-and-slab priors,\nrespectively, for the network weights.\n","authors":["Moein Monemi","Morteza Amini","S. Mahmoud Taheri","Mohammad Arashi"],"pdf_url":"https://arxiv.org/pdf/2501.04272v1.pdf","comment":"Submitted to journal"}],"Multimedia":[{"id":"http://arxiv.org/abs/2501.04579v1","updated":"2025-01-08T15:48:30Z","published":"2025-01-08T15:48:30Z","title":"Unified Coding for Both Human Perception and Generalized Machine\n Analytics with CLIP Supervision","summary":" The image compression model has long struggled with adaptability and\ngeneralization, as the decoded bitstream typically serves only human or machine\nneeds and fails to preserve information for unseen visual tasks. Therefore,\nthis paper innovatively introduces supervision obtained from multimodal\npre-training models and incorporates adaptive multi-objective optimization\ntailored to support both human visual perception and machine vision\nsimultaneously with a single bitstream, denoted as Unified and Generalized\nImage Coding for Machine (UG-ICM). Specifically, to get rid of the reliance\nbetween compression models with downstream task supervision, we introduce\nContrastive Language-Image Pre-training (CLIP) models into the training\nconstraint for improved generalization. Global-to-instance-wise CLIP\nsupervision is applied to help obtain hierarchical semantics that make models\nmore generalizable for the tasks relying on the information of different\ngranularity. Furthermore, for supporting both human and machine visions with\nonly a unifying bitstream, we incorporate a conditional decoding strategy that\ntakes as conditions human or machine preferences, enabling the bitstream to be\ndecoded into different versions for corresponding preferences. As such, our\nproposed UG-ICM is fully trained in a self-supervised manner, i.e., without\nawareness of any specific downstream models and tasks. The extensive\nexperiments have shown that the proposed UG-ICM is capable of achieving\nremarkable improvements in various unseen machine analytics tasks, while\nsimultaneously providing perceptually satisfying images.\n","authors":["Kangsheng Yin","Quan Liu","Xuelin Shen","Yulin He","Wenhan Yang","Shiqi Wang"],"pdf_url":"https://arxiv.org/pdf/2501.04579v1.pdf","comment":"9 pages, 10 figures, publised to AAAI 2025"},{"id":"http://arxiv.org/abs/2501.04511v1","updated":"2025-01-08T13:58:07Z","published":"2025-01-08T13:58:07Z","title":"Multichannel Steganography: A Provably Secure Hybrid Steganographic\n Model for Secure Communication","summary":" This study introduces a novel steganographic model that synthesizes\nSteganography by Cover Modification (CMO) and Steganography by Cover Synthesis\n(CSY), enhancing both security and undetectability by generating cover messages\nor parameters while retaining the original cover's form, thus minimizing\ndetection risks and overcoming the limitations of single-method techniques.\nBuilding upon this model, a refined Steganographic Communication Protocol is\nproposed, enhancing resilience against sophisticated threats such as\nMultichannel Replay Attacks and Multichannel Man-in-the-Middle Attacks,\nfortifying the protocol against potential tampering and improving upon prior\nworks. To evaluate the security of the proposed protocol, a novel adversarial\nmodel is developed simulating a probabilistic polynomial time (PPT) adversary\ncapable of intercepting communications across multiple channels. This model\nassesses the adversary's ability to compromise the protocol, providing a\ncomprehensive security analysis. Finally, this study explores the practicality\nand adaptability of the model to both constrained environments like SMS banking\nand resource-rich settings such as blockchain transactions, demonstrating their\npotential to enhance financial services and security. These contributions\npresent a robust and adaptable framework for secure steganographic\ncommunication, offering practical solutions for secure communications across\ndiverse environments.\n","authors":["Obinna Omego","Michal Bosy"],"pdf_url":"https://arxiv.org/pdf/2501.04511v1.pdf","comment":"18 pages, 8 figures, 3 algorithms, This version is a preprint\n uploaded to arXiv"},{"id":"http://arxiv.org/abs/2311.07594v3","updated":"2025-01-08T02:33:37Z","published":"2023-11-10T09:51:24Z","title":"How to Bridge the Gap between Modalities: Survey on Multimodal Large\n Language Model","summary":" We explore Multimodal Large Language Models (MLLMs), which integrate LLMs\nlike GPT-4 to handle multimodal data, including text, images, audio, and more.\nMLLMs demonstrate capabilities such as generating image captions and answering\nimage-based questions, bridging the gap towards real-world human-computer\ninteractions and hinting at a potential pathway to artificial general\nintelligence. However, MLLMs still face challenges in addressing the semantic\ngap in multimodal data, which may lead to erroneous outputs, posing potential\nrisks to society. Selecting the appropriate modality alignment method is\ncrucial, as improper methods might require more parameters without significant\nperformance improvements. This paper aims to explore modality alignment methods\nfor LLMs and their current capabilities. Implementing effective modality\nalignment can help LLMs address environmental issues and enhance accessibility.\nThe study surveys existing modality alignment methods for MLLMs, categorizing\nthem into four groups: (1) Multimodal Converter, which transforms data into a\nformat that LLMs can understand; (2) Multimodal Perceiver, which improves how\nLLMs percieve different types of data; (3) Tool Learning, which leverages\nexternal tools to convert data into a common format, usually text; and (4)\nData-Driven Method, which teaches LLMs to understand specific data types within\ndatasets.\n","authors":["Shezheng Song","Xiaopeng Li","Shasha Li","Shan Zhao","Jie Yu","Jun Ma","Xiaoguang Mao","Weimin Zhang"],"pdf_url":"https://arxiv.org/pdf/2311.07594v3.pdf","comment":"Accepted by TKDE"},{"id":"http://arxiv.org/abs/2501.04204v1","updated":"2025-01-08T00:52:19Z","published":"2025-01-08T00:52:19Z","title":"LipGen: Viseme-Guided Lip Video Generation for Enhancing Visual Speech\n Recognition","summary":" Visual speech recognition (VSR), commonly known as lip reading, has garnered\nsignificant attention due to its wide-ranging practical applications. The\nadvent of deep learning techniques and advancements in hardware capabilities\nhave significantly enhanced the performance of lip reading models. Despite\nthese advancements, existing datasets predominantly feature stable video\nrecordings with limited variability in lip movements. This limitation results\nin models that are highly sensitive to variations encountered in real-world\nscenarios. To address this issue, we propose a novel framework, LipGen, which\naims to improve model robustness by leveraging speech-driven synthetic visual\ndata, thereby mitigating the constraints of current datasets. Additionally, we\nintroduce an auxiliary task that incorporates viseme classification alongside\nattention mechanisms. This approach facilitates the efficient integration of\ntemporal information, directing the model's focus toward the relevant segments\nof speech, thereby enhancing discriminative capabilities. Our method\ndemonstrates superior performance compared to the current state-of-the-art on\nthe lip reading in the wild (LRW) dataset and exhibits even more pronounced\nadvantages under challenging conditions.\n","authors":["Bowen Hao","Dongliang Zhou","Xiaojie Li","Xingyu Zhang","Liang Xie","Jianlong Wu","Erwei Yin"],"pdf_url":"https://arxiv.org/pdf/2501.04204v1.pdf","comment":"This paper has been accepted for presentation at ICASSP 2025"},{"id":"http://arxiv.org/abs/2404.05522v2","updated":"2025-01-08T22:34:12Z","published":"2024-04-08T13:43:19Z","title":"3DMambaIPF: A State Space Model for Iterative Point Cloud Filtering via\n Differentiable Rendering","summary":" Noise is an inevitable aspect of point cloud acquisition, necessitating\nfiltering as a fundamental task within the realm of 3D vision. Existing\nlearning-based filtering methods have shown promising capabilities on\nsmall-scale synthetic or real-world datasets. Nonetheless, the effectiveness of\nthese methods is constrained when dealing with a substantial quantity of point\nclouds. This limitation primarily stems from their limited denoising\ncapabilities for large-scale point clouds and their inclination to generate\nnoisy outliers after denoising. The recent introduction of State Space Models\n(SSMs) for long sequence modeling in Natural Language Processing (NLP) presents\na promising solution for handling large-scale data. Encouraged by iterative\npoint cloud filtering methods, we introduce 3DMambaIPF, firstly incorporating\nMamba (Selective SSM) architecture to sequentially handle extensive point\nclouds from large scenes, capitalizing on its strengths in selective input\nprocessing and long sequence modeling capabilities. Additionally, we integrate\na robust and fast differentiable rendering loss to constrain the noisy points\naround the surface. In contrast to previous methodologies, this differentiable\nrendering loss enhances the visual realism of denoised geometric structures and\naligns point cloud boundaries more closely with those observed in real-world\nobjects. Extensive evaluation on datasets comprising small-scale synthetic and\nreal-world models (typically with up to 50K points) demonstrate that our method\nachieves state-of-the-art results. Moreover, we showcase the superior\nscalability and efficiency of our method on large-scale models with about 500K\npoints, where the majority of the existing learning-based denoising methods are\nunable to handle.\n","authors":["Qingyuan Zhou","Weidong Yang","Ben Fei","Jingyi Xu","Rui Zhang","Keyi Liu","Yeqi Luo","Ying He"],"pdf_url":"https://arxiv.org/pdf/2404.05522v2.pdf","comment":"Accepted at AAAI-25"},{"id":"http://arxiv.org/abs/2501.04764v1","updated":"2025-01-08T18:35:48Z","published":"2025-01-08T18:35:48Z","title":"Video Summarisation with Incident and Context Information using\n Generative AI","summary":" The proliferation of video content production has led to vast amounts of\ndata, posing substantial challenges in terms of analysis efficiency and\nresource utilization. Addressing this issue calls for the development of robust\nvideo analysis tools. This paper proposes a novel approach leveraging\nGenerative Artificial Intelligence (GenAI) to facilitate streamlined video\nanalysis. Our tool aims to deliver tailored textual summaries of user-defined\nqueries, offering a focused insight amidst extensive video datasets. Unlike\nconventional frameworks that offer generic summaries or limited action\nrecognition, our method harnesses the power of GenAI to distil relevant\ninformation, enhancing analysis precision and efficiency. Employing YOLO-V8 for\nobject detection and Gemini for comprehensive video and text analysis, our\nsolution achieves heightened contextual accuracy. By combining YOLO with\nGemini, our approach furnishes textual summaries extracted from extensive CCTV\nfootage, enabling users to swiftly navigate and verify pertinent events without\nthe need for exhaustive manual review. The quantitative evaluation revealed a\nsimilarity of 72.8%, while the qualitative assessment rated an accuracy of 85%,\ndemonstrating the capability of the proposed method.\n","authors":["Ulindu De Silva","Leon Fernando","Kalinga Bandara","Rashmika Nawaratne"],"pdf_url":"https://arxiv.org/pdf/2501.04764v1.pdf","comment":null}],"Artificial Intelligence":[{"id":"http://arxiv.org/abs/2501.04700v1","updated":"2025-01-08T18:59:36Z","published":"2025-01-08T18:59:36Z","title":"Planarian Neural Networks: Evolutionary Patterns from Basic Bilateria\n Shaping Modern Artificial Neural Network Architectures","summary":" This study examined the viability of enhancing the prediction accuracy of\nartificial neural networks (ANNs) in image classification tasks by developing\nANNs with evolution patterns similar to those of biological neural networks.\nResNet is a widely used family of neural networks with both deep and wide\nvariants; therefore, it was selected as the base model for our investigation.\nThe aim of this study is to improve the image classification performance of\nANNs via a novel approach inspired by the biological nervous system\narchitecture of planarians, which comprises a brain and two nerve cords. We\nbelieve that the unique neural architecture of planarians offers valuable\ninsights into the performance enhancement of ANNs. The proposed planarian\nneural architecture-based neural network was evaluated on the CIFAR-10 and\nCIFAR-100 datasets. Our results indicate that the proposed method exhibits\nhigher prediction accuracy than the baseline neural network models in image\nclassification tasks. These findings demonstrate the significant potential of\nbiologically inspired neural network architectures in improving the performance\nof ANNs in a wide range of applications.\n","authors":["Ziyuan Huang","Mark Newman","Maria Vaida","Srikar Bellur","Roozbeh Sadeghian","Andrew Siu","Hui Wang","Kevin Huggins"],"pdf_url":"https://arxiv.org/pdf/2501.04700v1.pdf","comment":"11 pages, 9 figures"},{"id":"http://arxiv.org/abs/2501.04697v1","updated":"2025-01-08T18:58:48Z","published":"2025-01-08T18:58:48Z","title":"Grokking at the Edge of Numerical Stability","summary":" Grokking, the sudden generalization that occurs after prolonged overfitting,\nis a surprising phenomenon challenging our understanding of deep learning.\nAlthough significant progress has been made in understanding grokking, the\nreasons behind the delayed generalization and its dependence on regularization\nremain unclear. In this work, we argue that without regularization, grokking\ntasks push models to the edge of numerical stability, introducing floating\npoint errors in the Softmax function, which we refer to as Softmax Collapse\n(SC). We demonstrate that SC prevents grokking and that mitigating SC enables\ngrokking without regularization. Investigating the root cause of SC, we find\nthat beyond the point of overfitting, the gradients strongly align with what we\ncall the na\\\"ive loss minimization (NLM) direction. This component of the\ngradient does not alter the model's predictions but decreases the loss by\nscaling the logits, typically by scaling the weights along their current\ndirection. We show that this scaling of the logits explains the delay in\ngeneralization characteristic of grokking and eventually leads to SC, halting\nfurther learning. To validate our hypotheses, we introduce two key\ncontributions that address the challenges in grokking tasks: StableMax, a new\nactivation function that prevents SC and enables grokking without\nregularization, and $\\perp$Grad, a training algorithm that promotes quick\ngeneralization in grokking tasks by preventing NLM altogether. These\ncontributions provide new insights into grokking, elucidating its delayed\ngeneralization, reliance on regularization, and the effectiveness of existing\ngrokking-inducing methods. Code for this paper is available at\nhttps://github.com/LucasPrietoAl/grokking-at-the-edge-of-numerical-stability.\n","authors":["Lucas Prieto","Melih Barsbey","Pedro A. M. Mediano","Tolga Birdal"],"pdf_url":"https://arxiv.org/pdf/2501.04697v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04694v1","updated":"2025-01-08T18:58:15Z","published":"2025-01-08T18:58:15Z","title":"EpiCoder: Encompassing Diversity and Complexity in Code Generation","summary":" Effective instruction tuning is indispensable for optimizing code LLMs,\naligning model behavior with user expectations and enhancing model performance\nin real-world applications. However, most existing methods focus on code\nsnippets, which are limited to specific functionalities and rigid structures,\nrestricting the complexity and diversity of the synthesized data. To address\nthese limitations, we introduce a novel feature tree-based synthesis framework\ninspired by Abstract Syntax Trees (AST). Unlike AST, which captures syntactic\nstructure of code, our framework models semantic relationships between code\nelements, enabling the generation of more nuanced and diverse data. The feature\ntree is constructed from raw data and refined iteratively to increase the\nquantity and diversity of the extracted features. This process enables the\nidentification of more complex patterns and relationships within the code. By\nsampling subtrees with controlled depth and breadth, our framework allows\nprecise adjustments to the complexity of the generated code, supporting a wide\nrange of tasks from simple function-level operations to intricate multi-file\nscenarios. We fine-tuned widely-used base models to create the EpiCoder series,\nachieving state-of-the-art performance at both the function and file levels\nacross multiple benchmarks. Notably, empirical evidence indicates that our\napproach shows significant potential in synthesizing highly complex\nrepository-level code data. Further analysis elucidates the merits of this\napproach by rigorously assessing data complexity and diversity through software\nengineering principles and LLM-as-a-judge method.\n","authors":["Yaoxiang Wang","Haoling Li","Xin Zhang","Jie Wu","Xiao Liu","Wenxiang Hu","Zhongxin Guo","Yangyu Huang","Ying Xin","Yujiu Yang","Jinsong Su","Qi Chen","Scarlett Li"],"pdf_url":"https://arxiv.org/pdf/2501.04694v1.pdf","comment":"40 pages, 11 figures"},{"id":"http://arxiv.org/abs/2501.04693v1","updated":"2025-01-08T18:57:33Z","published":"2025-01-08T18:57:33Z","title":"Beyond Sight: Finetuning Generalist Robot Policies with Heterogeneous\n Sensors via Language Grounding","summary":" Interacting with the world is a multi-sensory experience: achieving effective\ngeneral-purpose interaction requires making use of all available modalities --\nincluding vision, touch, and audio -- to fill in gaps from partial observation.\nFor example, when vision is occluded reaching into a bag, a robot should rely\non its senses of touch and sound. However, state-of-the-art generalist robot\npolicies are typically trained on large datasets to predict robot actions\nsolely from visual and proprioceptive observations. In this work, we propose\nFuSe, a novel approach that enables finetuning visuomotor generalist policies\non heterogeneous sensor modalities for which large datasets are not readily\navailable by leveraging natural language as a common cross-modal grounding. We\ncombine a multimodal contrastive loss with a sensory-grounded language\ngeneration loss to encode high-level semantics. In the context of robot\nmanipulation, we show that FuSe enables performing challenging tasks that\nrequire reasoning jointly over modalities such as vision, touch, and sound in a\nzero-shot setting, such as multimodal prompting, compositional cross-modal\nprompting, and descriptions of objects it interacts with. We show that the same\nrecipe is applicable to widely different generalist policies, including both\ndiffusion-based generalist policies and large vision-language-action (VLA)\nmodels. Extensive experiments in the real world show that FuSeis able to\nincrease success rates by over 20% compared to all considered baselines.\n","authors":["Joshua Jones","Oier Mees","Carmelo Sferrazza","Kyle Stachowicz","Pieter Abbeel","Sergey Levine"],"pdf_url":"https://arxiv.org/pdf/2501.04693v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04686v1","updated":"2025-01-08T18:49:41Z","published":"2025-01-08T18:49:41Z","title":"URSA: Understanding and Verifying Chain-of-thought Reasoning in\n Multimodal Mathematics","summary":" Chain-of-thought (CoT) reasoning has been widely applied in the mathematical\nreasoning of Large Language Models (LLMs). Recently, the introduction of\nderivative process supervision on CoT trajectories has sparked discussions on\nenhancing scaling capabilities during test time, thereby boosting the potential\nof these models. However, in multimodal mathematical reasoning, the scarcity of\nhigh-quality CoT training data has hindered existing models from achieving\nhigh-precision CoT reasoning and has limited the realization of reasoning\npotential during test time. In this work, we propose a three-module synthesis\nstrategy that integrates CoT distillation, trajectory-format rewriting, and\nformat unification. It results in a high-quality CoT reasoning instruction\nfine-tuning dataset in multimodal mathematics, MMathCoT-1M. We comprehensively\nvalidate the state-of-the-art (SOTA) performance of the trained URSA-7B model\non multiple multimodal mathematical benchmarks. For test-time scaling, we\nintroduce a data synthesis strategy that automatically generates process\nannotation datasets, known as DualMath-1.1M, focusing on both interpretation\nand logic. By further training URSA-7B on DualMath-1.1M, we transition from CoT\nreasoning capabilities to robust supervision abilities. The trained URSA-RM-7B\nacts as a verifier, effectively enhancing the performance of URSA-7B at test\ntime. URSA-RM-7B also demonstrates excellent out-of-distribution (OOD)\nverifying capabilities, showcasing its generalization. Model weights, training\ndata and code will be open-sourced.\n","authors":["Ruilin Luo","Zhuofan Zheng","Yifan Wang","Yiyao Yu","Xinzhe Ni","Zicheng Lin","Jin Zeng","Yujiu Yang"],"pdf_url":"https://arxiv.org/pdf/2501.04686v1.pdf","comment":"27 pages, 10 tables, 17 figures. The training data has been released.\n The code and model are currently undergoing internal review. They will be\n made available soon. Project url: https://ursa-math.github.io"},{"id":"http://arxiv.org/abs/2501.04682v1","updated":"2025-01-08T18:42:48Z","published":"2025-01-08T18:42:48Z","title":"Towards System 2 Reasoning in LLMs: Learning How to Think With Meta\n Chain-of-Though","summary":" We propose a novel framework, Meta Chain-of-Thought (Meta-CoT), which extends\ntraditional Chain-of-Thought (CoT) by explicitly modeling the underlying\nreasoning required to arrive at a particular CoT. We present empirical evidence\nfrom state-of-the-art models exhibiting behaviors consistent with in-context\nsearch, and explore methods for producing Meta-CoT via process supervision,\nsynthetic data generation, and search algorithms. Finally, we outline a\nconcrete pipeline for training a model to produce Meta-CoTs, incorporating\ninstruction tuning with linearized search traces and reinforcement learning\npost-training. Finally, we discuss open research questions, including scaling\nlaws, verifier roles, and the potential for discovering novel reasoning\nalgorithms. This work provides a theoretical and practical roadmap to enable\nMeta-CoT in LLMs, paving the way for more powerful and human-like reasoning in\nartificial intelligence.\n","authors":["Violet Xiang","Charlie Snell","Kanishk Gandhi","Alon Albalak","Anikait Singh","Chase Blagden","Duy Phung","Rafael Rafailov","Nathan Lile","Dakota Mahan","Louis Castricato","Jan-Philipp Franken","Nick Haber","Chelsea Finn"],"pdf_url":"https://arxiv.org/pdf/2501.04682v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04675v1","updated":"2025-01-08T18:33:17Z","published":"2025-01-08T18:33:17Z","title":"Enhancing Financial VQA in Vision Language Models using Intermediate\n Structured Representations","summary":" Chart interpretation is crucial for visual data analysis, but accurately\nextracting information from charts poses significant challenges for automated\nmodels. This study investigates the fine-tuning of DEPLOT, a modality\nconversion module that translates the image of a plot or chart to a linearized\ntable, on a custom dataset of 50,000 bar charts. The dataset comprises simple,\nstacked, and grouped bar charts, targeting the unique structural features of\nthese visualizations. The finetuned DEPLOT model is evaluated against its base\nversion using a test set of 1,000 images and two metrics: Relative Mapping\nSimilarity (RMS), which measures categorical mapping accuracy, and Relative\nNumber Set Similarity (RNSS), which evaluates numerical interpretation\naccuracy. To further explore the reasoning capabilities of large language\nmodels (LLMs), we curate an additional set of 100 bar chart images paired with\nquestion answer sets. Our findings demonstrate that providing a structured\nintermediate table alongside the image significantly enhances LLM reasoning\nperformance compared to direct image queries.\n","authors":["Archita Srivastava","Abhas Kumar","Rajesh Kumar","Prabhakar Srinivasan"],"pdf_url":"https://arxiv.org/pdf/2501.04675v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.02788v2","updated":"2025-01-08T18:33:07Z","published":"2025-01-06T06:07:40Z","title":"GLoG-CSUnet: Enhancing Vision Transformers with Adaptable Radiomic\n Features for Medical Image Segmentation","summary":" Vision Transformers (ViTs) have shown promise in medical image semantic\nsegmentation (MISS) by capturing long-range correlations. However, ViTs often\nstruggle to model local spatial information effectively, which is essential for\naccurately segmenting fine anatomical details, particularly when applied to\nsmall datasets without extensive pre-training. We introduce Gabor and Laplacian\nof Gaussian Convolutional Swin Network (GLoG-CSUnet), a novel architecture\nenhancing Transformer-based models by incorporating learnable radiomic\nfeatures. This approach integrates dynamically adaptive Gabor and Laplacian of\nGaussian (LoG) filters to capture texture, edge, and boundary information,\nenhancing the feature representation processed by the Transformer model. Our\nmethod uniquely combines the long-range dependency modeling of Transformers\nwith the texture analysis capabilities of Gabor and LoG features. Evaluated on\nthe Synapse multi-organ and ACDC cardiac segmentation datasets, GLoG-CSUnet\ndemonstrates significant improvements over state-of-the-art models, achieving a\n1.14% increase in Dice score for Synapse and 0.99% for ACDC, with minimal\ncomputational overhead (only 15 and 30 additional parameters, respectively).\nGLoG-CSUnet's flexible design allows integration with various base models,\noffering a promising approach for incorporating radiomics-inspired feature\nextraction in Transformer architectures for medical image analysis. The code\nimplementation is available on GitHub at: https://github.com/HAAIL/GLoG-CSUnet.\n","authors":["Niloufar Eghbali","Hassan Bagher-Ebadian","Tuka Alhanai","Mohammad M. Ghassemi"],"pdf_url":"https://arxiv.org/pdf/2501.02788v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04671v1","updated":"2025-01-08T18:31:16Z","published":"2025-01-08T18:31:16Z","title":"DRIVINGVQA: Analyzing Visual Chain-of-Thought Reasoning of Vision\n Language Models in Real-World Scenarios with Driving Theory Tests","summary":" Large vision-language models (LVLMs) augment language models with visual\nunderstanding, enabling multimodal reasoning. However, due to the modality gap\nbetween textual and visual data, they often face significant challenges, such\nas over-reliance on text priors, hallucinations, and limited capacity for\ncomplex visual reasoning. Existing benchmarks to evaluate visual reasoning in\nLVLMs often rely on schematic or synthetic images and on imprecise\nmachine-generated explanations. To bridge the modality gap, we present\nDrivingVQA, a new benchmark derived from driving theory tests to evaluate\nvisual chain-of-thought reasoning in complex real-world scenarios. It offers\n3,931 expert-crafted multiple-choice problems and interleaved explanations\ngrounded with entities relevant to the reasoning process. We leverage this\ndataset to perform an extensive study of LVLMs' ability to reason about complex\nvisual scenarios. Our experiments reveal that open-source and proprietary LVLMs\nstruggle with visual chain-of-thought reasoning under zero-shot settings. We\ninvestigate training strategies that leverage relevant entities to improve\nvisual reasoning. Notably, we observe a performance boost of up to 7\\% when\nreasoning over image tokens of cropped regions tied to these entities.\n","authors":["Charles Corbière","Simon Roburin","Syrielle Montariol","Antoine Bosselut","Alexandre Alahi"],"pdf_url":"https://arxiv.org/pdf/2501.04671v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.01348v2","updated":"2025-01-08T18:20:46Z","published":"2024-12-02T10:19:36Z","title":"Hierarchical Object-Oriented POMDP Planning for Object Rearrangement","summary":" We present an online planning framework for solving multi-object\nrearrangement problems in partially observable, multi-room environments.\nCurrent object rearrangement solutions, primarily based on Reinforcement\nLearning or hand-coded planning methods, often lack adaptability to diverse\nchallenges. To address this limitation, we introduce a novel Hierarchical\nObject-Oriented Partially Observed Markov Decision Process (HOO-POMDP) planning\napproach. This approach comprises of (a) an object-oriented POMDP planner\ngenerating sub-goals, (b) a set of low-level policies for sub-goal achievement,\nand (c) an abstraction system converting the continuous low-level world into a\nrepresentation suitable for abstract planning. We evaluate our system on\nvarying numbers of objects, rooms, and problem types in AI2-THOR simulated\nenvironments with promising results.\n","authors":["Rajesh Mangannavar","Alan Fern","Prasad Tadepalli"],"pdf_url":"https://arxiv.org/pdf/2412.01348v2.pdf","comment":"17 pages, 2 Figures. Preprint. Updated acknowledgments"},{"id":"http://arxiv.org/abs/2501.04661v1","updated":"2025-01-08T18:15:10Z","published":"2025-01-08T18:15:10Z","title":"Assessing Language Comprehension in Large Language Models Using\n Construction Grammar","summary":" Large Language Models, despite their significant capabilities, are known to\nfail in surprising and unpredictable ways. Evaluating their true\n`understanding' of language is particularly challenging due to the extensive\nweb-scale data they are trained on. Therefore, we construct an evaluation to\nsystematically assess natural language understanding (NLU) in LLMs by\nleveraging Construction Grammar (CxG), which provides insights into the meaning\ncaptured by linguistic elements known as constructions (Cxns). CxG is\nwell-suited for this purpose because provides a theoretical basis to construct\ntargeted evaluation sets. These datasets are carefully constructed to include\nexamples which are unlikely to appear in pre-training data, yet intuitive and\neasy for humans to understand, enabling a more targeted and reliable\nassessment. Our experiments focus on downstream natural language inference and\nreasoning tasks by comparing LLMs' understanding of the underlying meanings\ncommunicated through 8 unique Cxns with that of humans. The results show that\nwhile LLMs demonstrate some knowledge of constructional information, even the\nlatest models including GPT-o1 struggle with abstract meanings conveyed by\nthese Cxns, as demonstrated in cases where test sentences are dissimilar to\ntheir pre-training data. We argue that such cases provide a more accurate test\nof true language understanding, highlighting key limitations in LLMs' semantic\ncapabilities. We make our novel dataset and associated experimental data\nincluding prompts and model responses publicly available.\n","authors":["Wesley Scivetti","Melissa Torgbi","Austin Blodgett","Mollie Shichman","Taylor Hudson","Claire Bonial","Harish Tayyar Madabushi"],"pdf_url":"https://arxiv.org/pdf/2501.04661v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.02832v3","updated":"2025-01-08T17:46:40Z","published":"2025-01-06T08:16:06Z","title":"Samba-ASR: State-Of-The-Art Speech Recognition Leveraging Structured\n State-Space Models","summary":" We propose Samba ASR,the first state of the art Automatic Speech\nRecognition(ASR)model leveraging the novel Mamba architecture as both encoder\nand decoder,built on the foundation of state space models(SSMs).Unlike\ntransformerbased ASR models,which rely on self-attention mechanisms to capture\ndependencies,Samba ASR effectively models both local and global temporal\ndependencies using efficient statespace dynamics,achieving remarkable\nperformance gains.By addressing the limitations of transformers,such as\nquadratic scaling with input length and difficulty in handling longrange\ndependencies,Samba ASR achieves superior accuracy and efficiency.Experimental\nresults demonstrate that Samba ASR surpasses existing opensource\ntransformerbased ASR models across various standard benchmarks,establishing it\nas the new state of theart in ASR.Extensive evaluations on the benchmark\ndataset show significant improvements in Word Error Rate(WER),with competitive\nperformance even in lowresource scenarios.Furthermore,the inherent\ncomputational efficiency and parameter optimization of the Mamba architecture\nmake Samba ASR a scalable and robust solution for diverse ASR tasks.Our\ncontributions include the development of a new Samba ASR architecture for\nautomatic speech recognition(ASR),demonstrating the superiority of structured\nstatespace models(SSMs)over transformer based models for speech sequence\nprocessing.We provide a comprehensive evaluation on public\nbenchmarks,showcasing stateoftheart(SOTA)performance,and present an indepth\nanalysis of computational efficiency,robustness to noise,and sequence\ngeneralization.This work highlights the viability of Mamba SSMs as a\ntransformerfree alternative for efficient and accurate ASR.By leveraging the\nadvancements of statespace modeling,Samba ASR redefines ASR performance\nstandards and sets a new benchmark for future research in this field.\n","authors":["Syed Abdul Gaffar Shakhadri","Kruthika KR","Kartik Basavaraj Angadi"],"pdf_url":"https://arxiv.org/pdf/2501.02832v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.15861v2","updated":"2025-01-08T17:41:51Z","published":"2024-09-24T08:33:41Z","title":"A Zero-Shot Open-Vocabulary Pipeline for Dialogue Understanding","summary":" Dialogue State Tracking (DST) is crucial for understanding user needs and\nexecuting appropriate system actions in task-oriented dialogues. Majority of\nexisting DST methods are designed to work within predefined ontologies and\nassume the availability of gold domain labels, struggling with adapting to new\nslots values. While Large Language Models (LLMs)-based systems show promising\nzero-shot DST performance, they either require extensive computational\nresources or they underperform existing fully-trained systems, limiting their\npracticality. To address these limitations, we propose a zero-shot,\nopen-vocabulary system that integrates domain classification and DST in a\nsingle pipeline. Our approach includes reformulating DST as a\nquestion-answering task for less capable models and employing self-refining\nprompts for more adaptable ones. Our system does not rely on fixed slot values\ndefined in the ontology allowing the system to adapt dynamically. We compare\nour approach with existing SOTA, and show that it provides up to 20% better\nJoint Goal Accuracy (JGA) over previous methods on datasets like Multi-WOZ 2.1,\nwith up to 90% fewer requests to the LLM API.\n","authors":["Abdulfattah Safa","Gözde Gül Şahin"],"pdf_url":"https://arxiv.org/pdf/2409.15861v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04635v1","updated":"2025-01-08T17:29:46Z","published":"2025-01-08T17:29:46Z","title":"Knowledge Retrieval Based on Generative AI","summary":" This study develops a question-answering system based on Retrieval-Augmented\nGeneration (RAG) using Chinese Wikipedia and Lawbank as retrieval sources.\nUsing TTQA and TMMLU+ as evaluation datasets, the system employs BGE-M3 for\ndense vector retrieval to obtain highly relevant search results and\nBGE-reranker to reorder these results based on query relevance. The most\npertinent retrieval outcomes serve as reference knowledge for a Large Language\nModel (LLM), enhancing its ability to answer questions and establishing a\nknowledge retrieval system grounded in generative AI.\n The system's effectiveness is assessed through a two-stage evaluation:\nautomatic and assisted performance evaluations. The automatic evaluation\ncalculates accuracy by comparing the model's auto-generated labels with ground\ntruth answers, measuring performance under standardized conditions without\nhuman intervention. The assisted performance evaluation involves 20\nfinance-related multiple-choice questions answered by 20 participants without\nfinancial backgrounds. Initially, participants answer independently. Later,\nthey receive system-generated reference information to assist in answering,\nexamining whether the system improves accuracy when assistance is provided.\n The main contributions of this research are: (1) Enhanced LLM Capability: By\nintegrating BGE-M3 and BGE-reranker, the system retrieves and reorders highly\nrelevant results, reduces hallucinations, and dynamically accesses authorized\nor public knowledge sources. (2) Improved Data Privacy: A customized RAG\narchitecture enables local operation of the LLM, eliminating the need to send\nprivate data to external servers. This approach enhances data security, reduces\nreliance on commercial services, lowers operational costs, and mitigates\nprivacy risks.\n","authors":["Te-Lun Yang","Jyi-Shane Liu","Yuen-Hsien Tseng","Jyh-Shing Roger Jang"],"pdf_url":"https://arxiv.org/pdf/2501.04635v1.pdf","comment":"8 pages, 13 figures, 1 table"},{"id":"http://arxiv.org/abs/2501.04614v1","updated":"2025-01-08T16:53:56Z","published":"2025-01-08T16:53:56Z","title":"MedCoDi-M: A Multi-Prompt Foundation Model for Multimodal Medical Data\n Generation","summary":" Artificial Intelligence is revolutionizing medical practice, enhancing\ndiagnostic accuracy and healthcare delivery. However, its adaptation in medical\nsettings still faces significant challenges, related to data availability and\nprivacy constraints. Synthetic data has emerged as a promising solution to\nmitigate these issues, addressing data scarcity while preserving privacy.\nRecently, Latent Diffusion Models have emerged as a powerful tool for\ngenerating high-quality synthetic data. Meanwhile, the integration of different\nmodalities has gained interest, emphasizing the need of models capable of\nhandle multimodal medical data.Existing approaches struggle to integrate\ncomplementary information and lack the ability to generate modalities\nsimultaneously. To address this challenge, we present MedCoDi-M, a\n6.77-billion-parameter model, designed for multimodal medical data generation,\nthat, following Foundation Model paradigm, exploits contrastive learning and\nlarge quantity of data to build a shared latent space which capture the\nrelationships between different data modalities. Further, we introduce the\nMulti-Prompt training technique, which significantly boosts MedCoDi-M's\ngeneration under different settings. We extensively validate MedCoDi-M: first\nwe benchmark it against five competitors on the MIMIC-CXR dataset, a\nstate-of-the-art dataset for Chest X-ray and radiological report generation.\nSecondly, we perform a Visual Turing Test with expert radiologists to assess\nthe realism and clinical relevance of the generated data, ensuring alignment\nwith real-world scenarios. Finally, we assess the utility of MedCoDi-M in\naddressing key challenges in the medical field, such as anonymization, data\nscarcity and imbalance learning. The results are promising, demonstrating the\napplicability of MedCoDi-M in medical contexts. Project page is at\nhttps://cosbidev.github.io/MedCoDi-M/.\n","authors":["Daniele Molino","Francesco Di Feola","Eliodoro Faiella","Deborah Fazzini","Domiziana Santucci","Linlin Shen","Valerio Guarrasi","Paolo Soda"],"pdf_url":"https://arxiv.org/pdf/2501.04614v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.15856v2","updated":"2025-01-08T16:31:06Z","published":"2024-01-29T03:07:04Z","title":"The Indoor-Training Effect: unexpected gains from distribution shifts in\n the transition function","summary":" Is it better to perform tennis training in a pristine indoor environment or a\nnoisy outdoor one? To model this problem, here we investigate whether shifts in\nthe transition probabilities between the training and testing environments in\nreinforcement learning problems can lead to better performance under certain\nconditions. We generate new Markov Decision Processes (MDPs) starting from a\ngiven MDP, by adding quantifiable, parametric noise into the transition\nfunction. We refer to this process as Noise Injection and the resulting\nenvironments as {\\delta}-environments. This process allows us to create\nvariations of the same environment with quantitative control over noise serving\nas a metric of distance between environments. Conventional wisdom suggests that\ntraining and testing on the same MDP should yield the best results. In stark\ncontrast, we observe that agents can perform better when trained on the\nnoise-free environment and tested on the noisy {\\delta}-environments, compared\nto training and testing on the same {\\delta}-environments. We confirm that this\nfinding extends beyond noise variations: it is possible to showcase the same\nphenomenon in ATARI game variations including varying Ghost behaviour in\nPacMan, and Paddle behaviour in Pong. We demonstrate this intriguing behaviour\nacross 60 different variations of ATARI games, including PacMan, Pong, and\nBreakout. We refer to this phenomenon as the Indoor-Training Effect. Code to\nreproduce our experiments and to implement Noise Injection can be found at\nhttps://bit.ly/3X6CTYk.\n","authors":["Serena Bono","Spandan Madan","Ishaan Grover","Mao Yasueda","Cynthia Breazeal","Hanspeter Pfister","Gabriel Kreiman"],"pdf_url":"https://arxiv.org/pdf/2401.15856v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.06855v3","updated":"2025-01-08T16:26:44Z","published":"2024-12-08T20:23:48Z","title":"Incentivized Symbiosis: A Paradigm for Human-Agent Coevolution","summary":" Cooperation is vital to our survival and progress. Evolutionary game theory\noffers a lens to understand the structures and incentives that enable\ncooperation to be a successful strategy. As artificial intelligence agents\nbecome integral to human systems, the dynamics of cooperation take on\nunprecedented significance. The convergence of human-agent teaming, contract\ntheory, and decentralized frameworks like Web3, grounded in transparency,\naccountability, and trust, offers a foundation for fostering cooperation by\nestablishing enforceable rules and incentives for humans and AI agents. We\nconceptualize Incentivized Symbiosis as a social contract between humans and\nAI, inspired by Web3 principles and encoded in blockchain technology, to define\nand enforce rules, incentives, and consequences for both parties. By exploring\nthis paradigm, we aim to catalyze new research at the intersection of systems\nthinking in AI, Web3, and society, fostering innovative pathways for\ncooperative human-agent coevolution.\n","authors":["Tomer Jordi Chaffer","Justin Goldston","Gemach D. A. T. A. I"],"pdf_url":"https://arxiv.org/pdf/2412.06855v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04588v1","updated":"2025-01-08T16:06:39Z","published":"2025-01-08T16:06:39Z","title":"Federated-Continual Dynamic Segmentation of Histopathology guided by\n Barlow Continuity","summary":" Federated- and Continual Learning have been established as approaches to\nenable privacy-aware learning on continuously changing data, as required for\ndeploying AI systems in histopathology images. However, data shifts can occur\nin a dynamic world, spatially between institutions and temporally, due to\nchanging data over time. This leads to two issues: Client Drift, where the\ncentral model degrades from aggregating data from clients trained on shifted\ndata, and Catastrophic Forgetting, from temporal shifts such as changes in\npatient populations. Both tend to degrade the model's performance of previously\nseen data or spatially distributed training. Despite both problems arising from\nthe same underlying problem of data shifts, existing research addresses them\nonly individually. In this work, we introduce a method that can jointly\nalleviate Client Drift and Catastrophic Forgetting by using our proposed\nDynamic Barlow Continuity that evaluates client updates on a public reference\ndataset and uses this to guide the training process to a spatially and\ntemporally shift-invariant model. We evaluate our approach on the\nhistopathology datasets BCSS and Semicol and prove our method to be highly\neffective by jointly improving the dice score as much as from 15.8% to 71.6% in\nClient Drift and from 42.5% to 62.8% in Catastrophic Forgetting. This enables\nDynamic Learning by establishing spatio-temporal shift-invariance.\n","authors":["Niklas Babendererde","Haozhe Zhu","Moritz Fuchs","Jonathan Stieber","Anirban Mukhopadhyay"],"pdf_url":"https://arxiv.org/pdf/2501.04588v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04577v1","updated":"2025-01-08T15:47:04Z","published":"2025-01-08T15:47:04Z","title":"A 65 nm Bayesian Neural Network Accelerator with 360 fJ/Sample In-Word\n GRNG for AI Uncertainty Estimation","summary":" Uncertainty estimation is an indispensable capability for AI-enabled,\nsafety-critical applications, e.g. autonomous vehicles or medical diagnosis.\nBayesian neural networks (BNNs) use Bayesian statistics to provide both\nclassification predictions and uncertainty estimation, but they suffer from\nhigh computational overhead associated with random number generation and\nrepeated sample iterations. Furthermore, BNNs are not immediately amenable to\nacceleration through compute-in-memory architectures due to the frequent memory\nwrites necessary after each RNG operation. To address these challenges, we\npresent an ASIC that integrates 360 fJ/Sample Gaussian RNG directly into the\nSRAM memory words. This integration reduces RNG overhead and enables\nfully-parallel compute-in-memory operations for BNNs. The prototype chip\nachieves 5.12 GSa/s RNG throughput and 102 GOp/s neural network throughput\nwhile occupying 0.45 mm2, bringing AI uncertainty estimation to edge\ncomputation.\n","authors":["Zephan M. Enciso","Boyang Cheng","Likai Pei","Jianbo Liu","Steven Davis","Ningyuan Cao","Michael Niemier"],"pdf_url":"https://arxiv.org/pdf/2501.04577v1.pdf","comment":"7 pages, 12 figures"},{"id":"http://arxiv.org/abs/2501.04575v1","updated":"2025-01-08T15:45:21Z","published":"2025-01-08T15:45:21Z","title":"InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning\n and Reflection","summary":" Graphical User Interface (GUI) Agents, powered by multimodal large language\nmodels (MLLMs), have shown great potential for task automation on computing\ndevices such as computers and mobile phones. However, existing agents face\nchallenges in multi-step reasoning and reliance on textual annotations,\nlimiting their effectiveness. We introduce \\textit{InfiGUIAgent}, an MLLM-based\nGUI Agent trained with a two-stage supervised fine-tuning pipeline. Stage 1\nenhances fundamental skills such as GUI understanding and grounding, while\nStage 2 integrates hierarchical reasoning and expectation-reflection reasoning\nskills using synthesized data to enable native reasoning abilities of the\nagents. \\textit{InfiGUIAgent} achieves competitive performance on several GUI\nbenchmarks, highlighting the impact of native reasoning skills in enhancing GUI\ninteraction for automation tasks. Resources are available at\n\\url{https://github.com/Reallm-Labs/InfiGUIAgent}.\n","authors":["Yuhang Liu","Pengxiang Li","Zishu Wei","Congkai Xie","Xueyu Hu","Xinchen Xu","Shengyu Zhang","Xiaotian Han","Hongxia Yang","Fei Wu"],"pdf_url":"https://arxiv.org/pdf/2501.04575v1.pdf","comment":"14 pages, 7 figures, work in progress"},{"id":"http://arxiv.org/abs/2409.10589v2","updated":"2025-01-08T15:41:04Z","published":"2024-09-16T15:18:10Z","title":"Offline Reinforcement Learning for Learning to Dispatch for Job Shop\n Scheduling","summary":" The Job Shop Scheduling Problem (JSSP) is a complex combinatorial\noptimization problem. While online Reinforcement Learning (RL) has shown\npromise by quickly finding acceptable solutions for JSSP, it faces key\nlimitations: it requires extensive training interactions from scratch leading\nto sample inefficiency, cannot leverage existing high-quality solutions, and\noften yields suboptimal results compared to traditional methods like Constraint\nProgramming (CP). We introduce Offline Reinforcement Learning for Learning to\nDispatch (Offline-LD), which addresses these limitations by learning from\npreviously generated solutions. Our approach is motivated by scenarios where\nhistorical scheduling data and expert solutions are available, although our\ncurrent evaluation focuses on benchmark problems. Offline-LD adapts two\nCQL-based Q-learning methods (mQRDQN and discrete mSAC) for maskable action\nspaces, introduces a novel entropy bonus modification for discrete SAC, and\nexploits reward normalization through preprocessing. Our experiments\ndemonstrate that Offline-LD outperforms online RL on both generated and\nbenchmark instances. Notably, by introducing noise into the expert dataset, we\nachieve similar or better results than those obtained from the expert dataset,\nsuggesting that a more diverse training set is preferable because it contains\ncounterfactual information.\n","authors":["Jesse van Remmerden","Zaharah Bukhsh","Yingqian Zhang"],"pdf_url":"https://arxiv.org/pdf/2409.10589v2.pdf","comment":"Code available at https://github.com/jesserem/Offline-LD"},{"id":"http://arxiv.org/abs/2501.04568v1","updated":"2025-01-08T15:32:12Z","published":"2025-01-08T15:32:12Z","title":"Supervision-free Vision-Language Alignment","summary":" Vision-language models (VLMs) have demonstrated remarkable potential in\nintegrating visual and linguistic information, but their performance is often\nconstrained by the need for extensive, high-quality image-text training data.\nCuration of these image-text pairs is both time-consuming and computationally\nexpensive. To address this challenge, we introduce SVP (Supervision-free Visual\nProjection), a novel framework that enhances vision-language alignment without\nrelying on curated data or preference annotation. SVP leverages self-captioning\nand a pre-trained grounding model as a feedback mechanism to elicit latent\ninformation in VLMs. We evaluate our approach across six key areas: captioning,\nreferring, visual question answering, multitasking, hallucination control, and\nobject recall. Results demonstrate significant improvements, including a 14%\naverage improvement in captioning tasks, up to 12% increase in object recall,\nand substantial reduction in hallucination rates. Notably, a small VLM using\nSVP achieves hallucination reductions comparable to a model five times larger,\nwhile a VLM with initially poor referring capabilities more than doubles its\nperformance, approaching parity with a model twice its size.\n","authors":["Giorgio Giannone","Ruoteng Li","Qianli Feng","Evgeny Perevodchikov","Rui Chen","Aleix Martinez"],"pdf_url":"https://arxiv.org/pdf/2501.04568v1.pdf","comment":"Preprint"},{"id":"http://arxiv.org/abs/2406.06184v2","updated":"2025-01-08T15:28:11Z","published":"2024-06-10T11:28:25Z","title":"Deep Multi-Objective Reinforcement Learning for Utility-Based\n Infrastructural Maintenance Optimization","summary":" In this paper, we introduce Multi-Objective Deep Centralized Multi-Agent\nActor-Critic (MO- DCMAC), a multi-objective reinforcement learning (MORL)\nmethod for infrastructural maintenance optimization, an area traditionally\ndominated by single-objective reinforcement learning (RL) approaches. Previous\nsingle-objective RL methods combine multiple objectives, such as probability of\ncollapse and cost, into a singular reward signal through reward-shaping. In\ncontrast, MO-DCMAC can optimize a policy for multiple objectives directly, even\nwhen the utility function is non-linear. We evaluated MO-DCMAC using two\nutility functions, which use probability of collapse and cost as input. The\nfirst utility function is the Threshold utility, in which MO-DCMAC should\nminimize cost so that the probability of collapse is never above the threshold.\nThe second is based on the Failure Mode, Effects, and Criticality Analysis\n(FMECA) methodology used by asset managers to asses maintenance plans. We\nevaluated MO-DCMAC, with both utility functions, in multiple maintenance\nenvironments, including ones based on a case study of the historical quay walls\nof Amsterdam. The performance of MO-DCMAC was compared against multiple\nrule-based policies based on heuristics currently used for constructing\nmaintenance plans. Our results demonstrate that MO-DCMAC outperforms\ntraditional rule-based policies across various environments and utility\nfunctions.\n","authors":["Jesse van Remmerden","Maurice Kenter","Diederik M. Roijers","Charalampos Andriotis","Yingqian Zhang","Zaharah Bukhsh"],"pdf_url":"https://arxiv.org/pdf/2406.06184v2.pdf","comment":"Accepted in the Neural Computing and Applications: Topical Collection\n on Multi-Objective Decision Making 2023 (MODeM 2023)"},{"id":"http://arxiv.org/abs/2402.18205v4","updated":"2025-01-08T15:18:15Z","published":"2024-02-28T09:51:55Z","title":"Lemur: Log Parsing with Entropy Sampling and Chain-of-Thought Merging","summary":" Logs produced by extensive software systems are integral to monitoring system\nbehaviors. Advanced log analysis facilitates the detection, alerting, and\ndiagnosis of system faults. Log parsing, which entails transforming raw log\nmessages into structured templates, constitutes a critical phase in the\nautomation of log analytics. Existing log parsers fail to identify the correct\ntemplates due to reliance on human-made rules. Besides, These methods focus on\nstatistical features while ignoring semantic information in log messages. To\naddress these challenges, we introduce a cutting-edge \\textbf{L}og parsing\nframework with \\textbf{E}ntropy sampling and Chain-of-Thought \\textbf{M}erging\n(Lemur). Specifically, to discard the tedious manual rules. We propose a novel\nsampling method inspired by information entropy, which efficiently clusters\ntypical logs. Furthermore, to enhance the merging of log templates, we design a\nchain-of-thought method for large language models (LLMs). LLMs exhibit\nexceptional semantic comprehension, deftly distinguishing between parameters\nand invariant tokens. We have conducted experiments on large-scale public\ndatasets. Extensive evaluation demonstrates that Lemur achieves the\nstate-of-the-art performance and impressive efficiency. The Code is available\nat https://github.com/zwpride/lemur.\n","authors":["Wei Zhang","Hongcheng Guo","Anjie Le","Jian Yang","Jiaheng Liu","Zhoujun Li"],"pdf_url":"https://arxiv.org/pdf/2402.18205v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04628v2","updated":"2025-01-08T15:00:39Z","published":"2024-12-05T21:50:22Z","title":"SWEPO: Simultaneous Weighted Preference Optimization for Group\n Contrastive Alignment","summary":" We introduce Simultaneous Weighted Preference Optimization (SWEPO), a novel\nextension of Direct Preference Optimization (DPO) designed to accommodate\nmultiple dynamically chosen positive and negative responses for each query.\nSWEPO employs a weighted group contrastive loss, assigning weights to responses\nbased on their deviation from the mean reward score. This approach effectively\nprioritizes responses that are significantly better or worse than the average,\nenhancing optimization. Our theoretical analysis demonstrates that\nsimultaneously considering multiple preferences reduces alignment bias,\nresulting in more robust alignment. Additionally, we provide insights into the\ntraining dynamics of our loss function and a related function, InfoNCA.\nEmpirical validation on the UltraFeedback dataset establishes SWEPO as\nstate-of-the-art, with superior performance in downstream evaluations using the\nAlpacaEval dataset.\n","authors":["Taneesh Gupta","Rahul Madhavan","Xuchao Zhang","Chetan Bansal","Saravan Rajmohan"],"pdf_url":"https://arxiv.org/pdf/2412.04628v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.02654v2","updated":"2025-01-08T14:53:41Z","published":"2025-01-05T20:39:52Z","title":"Tougher Text, Smarter Models: Raising the Bar for Adversarial Defence\n Benchmarks","summary":" Recent advancements in natural language processing have highlighted the\nvulnerability of deep learning models to adversarial attacks. While various\ndefence mechanisms have been proposed, there is a lack of comprehensive\nbenchmarks that evaluate these defences across diverse datasets, models, and\ntasks. In this work, we address this gap by presenting an extensive benchmark\nfor textual adversarial defence that significantly expands upon previous work.\nOur benchmark incorporates a wide range of datasets, evaluates state-of-the-art\ndefence mechanisms, and extends the assessment to include critical tasks such\nas single-sentence classification, similarity and paraphrase identification,\nnatural language inference, and commonsense reasoning. This work not only\nserves as a valuable resource for researchers and practitioners in the field of\nadversarial robustness but also identifies key areas for future research in\ntextual adversarial defence. By establishing a new standard for benchmarking in\nthis domain, we aim to accelerate progress towards more robust and reliable\nnatural language processing systems.\n","authors":["Yang Wang","Chenghua Lin"],"pdf_url":"https://arxiv.org/pdf/2501.02654v2.pdf","comment":"Will be presented as an oral in-person presentation at the conference\n of COLING 2025"},{"id":"http://arxiv.org/abs/2501.04541v1","updated":"2025-01-08T14:44:40Z","published":"2025-01-08T14:44:40Z","title":"Cyber-Physical Steganography in Robotic Motion Control","summary":" Steganography, the art of information hiding, has continually evolved across\nvisual, auditory and linguistic domains, adapting to the ceaseless interplay\nbetween steganographic concealment and steganalytic revelation. This study\nseeks to extend the horizons of what constitutes a viable steganographic medium\nby introducing a steganographic paradigm in robotic motion control. Based on\nthe observation of the robot's inherent sensitivity to changes in its\nenvironment, we propose a methodology to encode messages as environmental\nstimuli influencing the motions of the robotic agent and to decode messages\nfrom the resulting motion trajectory. The constraints of maximal robot\nintegrity and minimal motion deviation are established as fundamental\nprinciples underlying secrecy. As a proof of concept, we conduct experiments in\nsimulated environments across various manipulation tasks, incorporating robotic\nembodiments equipped with generalist multimodal policies.\n","authors":["Ching-Chun Chang","Yijie Lin","Isao Echizen"],"pdf_url":"https://arxiv.org/pdf/2501.04541v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.02334v2","updated":"2025-01-08T14:42:05Z","published":"2024-04-26T15:02:39Z","title":"Rad4XCNN: a new agnostic method for post-hoc global explanation of\n CNN-derived features by means of radiomics","summary":" In recent years, machine learning-based clinical decision support systems\n(CDSS) have played a key role in the analysis of several medical conditions.\nDespite their promising capabilities, the lack of transparency in AI models\nposes significant challenges, particularly in medical contexts where\nreliability is a mandatory aspect. However, it appears that explainability is\ninversely proportional to accuracy. For this reason, achieving transparency\nwithout compromising predictive accuracy remains a key challenge. This paper\npresents a novel method, namely Rad4XCNN, to enhance the predictive power of\nCNN-derived features with the inherent interpretability of radiomic features.\nRad4XCNN diverges from conventional methods based on saliency maps, by\nassociating intelligible meaning to CNN-derived features by means of Radiomics,\noffering new perspectives on explanation methods beyond visualization maps.\nUsing a breast cancer classification task as a case study, we evaluated\nRad4XCNN on ultrasound imaging datasets, including an online dataset and two\nin-house datasets for internal and external validation. Some key results are:\ni) CNN-derived features guarantee more robust accuracy when compared against\nViT-derived and radiomic features; ii) conventional visualization map methods\nfor explanation present several pitfalls; iii) Rad4XCNN does not sacrifice\nmodel accuracy for their explainability; iv) Rad4XCNN provides a global\nexplanation enabling the physician to extract global insights and findings. Our\nmethod can mitigate some concerns related to the explainability-accuracy\ntrade-off. This study highlighted the importance of proposing new methods for\nmodel explanation without affecting their accuracy.\n","authors":["Francesco Prinzi","Carmelo Militello","Calogero Zarcaro","Tommaso Vincenzo Bartolotta","Salvatore Gaglio","Salvatore Vitabile"],"pdf_url":"https://arxiv.org/pdf/2405.02334v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.00599v2","updated":"2025-01-08T14:38:30Z","published":"2024-12-31T18:56:46Z","title":"VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with\n Video LLM","summary":" Video Large Language Models (Video LLMs) have recently exhibited remarkable\ncapabilities in general video understanding. However, they mainly focus on\nholistic comprehension and struggle with capturing fine-grained spatial and\ntemporal details. Besides, the lack of high-quality object-level video\ninstruction data and a comprehensive benchmark further hinders their\nadvancements. To tackle these challenges, we introduce the VideoRefer Suite to\nempower Video LLM for finer-level spatial-temporal video understanding, i.e.,\nenabling perception and reasoning on any objects throughout the video.\nSpecially, we thoroughly develop VideoRefer Suite across three essential\naspects: dataset, model, and benchmark. Firstly, we introduce a multi-agent\ndata engine to meticulously curate a large-scale, high-quality object-level\nvideo instruction dataset, termed VideoRefer-700K. Next, we present the\nVideoRefer model, which equips a versatile spatial-temporal object encoder to\ncapture precise regional and sequential representations. Finally, we\nmeticulously create a VideoRefer-Bench to comprehensively assess the\nspatial-temporal understanding capability of a Video LLM, evaluating it across\nvarious aspects. Extensive experiments and analyses demonstrate that our\nVideoRefer model not only achieves promising performance on video referring\nbenchmarks but also facilitates general video understanding capabilities.\n","authors":["Yuqian Yuan","Hang Zhang","Wentong Li","Zesen Cheng","Boqiang Zhang","Long Li","Xin Li","Deli Zhao","Wenqiao Zhang","Yueting Zhuang","Jianke Zhu","Lidong Bing"],"pdf_url":"https://arxiv.org/pdf/2501.00599v2.pdf","comment":"17 pages, 14 figures, technical report"},{"id":"http://arxiv.org/abs/2409.14457v2","updated":"2025-01-08T14:29:44Z","published":"2024-09-22T14:09:49Z","title":"Large Model Based Agents: State-of-the-Art, Cooperation Paradigms,\n Security and Privacy, and Future Trends","summary":" With the rapid advancement of large models (LMs), the development of\ngeneral-purpose intelligent agents powered by LMs has become a reality. It is\nforeseeable that in the near future, LM-driven general AI agents will serve as\nessential tools in production tasks, capable of autonomous communication and\ncollaboration without human intervention. This paper investigates scenarios\ninvolving the autonomous collaboration of future LM agents. We review the\ncurrent state of LM agents, the key technologies enabling LM agent\ncollaboration, and the security and privacy challenges they face during\ncooperative operations. To this end, we first explore the foundational\nprinciples of LM agents, including their general architecture, key components,\nenabling technologies, and modern applications. We then discuss practical\ncollaboration paradigms from data, computation, and knowledge perspectives to\nachieve connected intelligence among LM agents. After that, we analyze the\nsecurity vulnerabilities and privacy risks associated with LM agents,\nparticularly in multi-agent settings, examining underlying mechanisms and\nreviewing current and potential countermeasures. Lastly, we propose future\nresearch directions for building robust and secure LM agent ecosystems.\n","authors":["Yuntao Wang","Yanghe Pan","Zhou Su","Yi Deng","Quan Zhao","Linkang Du","Tom H. Luan","Jiawen Kang","Dusit Niyato"],"pdf_url":"https://arxiv.org/pdf/2409.14457v2.pdf","comment":"40 pages, 31 figures, 8 tables"},{"id":"http://arxiv.org/abs/2501.02156v3","updated":"2025-01-08T14:26:51Z","published":"2025-01-04T01:45:32Z","title":"The Race to Efficiency: A New Perspective on AI Scaling Laws","summary":" As large-scale AI models expand, training becomes costlier and sustaining\nprogress grows harder. Classical scaling laws (e.g., Kaplan et al. (2020),\nHoffmann et al. (2022)) predict training loss from a static compute budget yet\nneglect time and efficiency, prompting the question: how can we balance\nballooning GPU fleets with rapidly improving hardware and algorithms? We\nintroduce the relative-loss equation, a time- and efficiency-aware framework\nthat extends classical AI scaling laws. Our model shows that, without ongoing\nefficiency gains, advanced performance could demand millennia of training or\nunrealistically large GPU fleets. However, near-exponential progress remains\nachievable if the \"efficiency-doubling rate\" parallels Moore's Law. By\nformalizing this race to efficiency, we offer a quantitative roadmap for\nbalancing front-loaded GPU investments with incremental improvements across the\nAI stack. Empirical trends suggest that sustained efficiency gains can push AI\nscaling well into the coming decade, providing a new perspective on the\ndiminishing returns inherent in classical scaling.\n","authors":["Chien-Ping Lu"],"pdf_url":"https://arxiv.org/pdf/2501.02156v3.pdf","comment":"21 pages, 3 figures. 2 tables, second draft"},{"id":"http://arxiv.org/abs/2402.13809v3","updated":"2025-01-08T14:21:46Z","published":"2024-02-21T13:46:25Z","title":"NeuralDiffuser: Neuroscience-inspired Diffusion Guidance for fMRI Visual\n Reconstruction","summary":" Reconstructing visual stimuli from functional Magnetic Resonance Imaging fMRI\nenables fine-grained retrieval of brain activity. However, the accurate\nreconstruction of diverse details, including structure, background, texture,\ncolor, and more, remains challenging. The stable diffusion models inevitably\nresult in the variability of reconstructed images, even under identical\nconditions. To address this challenge, we first uncover the neuroscientific\nperspective of diffusion methods, which primarily involve top-down creation\nusing pre-trained knowledge from extensive image datasets, but tend to lack\ndetail-driven bottom-up perception, leading to a loss of faithful details. In\nthis paper, we propose NeuralDiffuser, which incorporates primary visual\nfeature guidance to provide detailed cues in the form of gradients. This\nextension of the bottom-up process for diffusion models achieves both semantic\ncoherence and detail fidelity when reconstructing visual stimuli. Furthermore,\nwe have developed a novel guidance strategy for reconstruction tasks that\nensures the consistency of repeated outputs with original images rather than\nwith various outputs. Extensive experimental results on the Natural Senses\nDataset (NSD) qualitatively and quantitatively demonstrate the advancement of\nNeuralDiffuser by comparing it against baseline and state-of-the-art methods\nhorizontally, as well as conducting longitudinal ablation studies.\n","authors":["Haoyu Li","Hao Wu","Badong Chen"],"pdf_url":"https://arxiv.org/pdf/2402.13809v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04528v1","updated":"2025-01-08T14:19:54Z","published":"2025-01-08T14:19:54Z","title":"Towards a Problem-Oriented Domain Adaptation Framework for Machine\n Learning","summary":" Domain adaptation is a sub-field of machine learning that involves\ntransferring knowledge from a source domain to perform the same task in the\ntarget domain. It is a typical challenge in machine learning that arises, e.g.,\nwhen data is obtained from various sources or when using a data basis that\nchanges over time. Recent advances in the field offer promising methods, but it\nis still challenging for researchers and practitioners to determine if domain\nadaptation is suitable for a given problem -- and, subsequently, to select the\nappropriate approach. This article employs design science research to develop a\nproblem-oriented framework for domain adaptation, which is matured in three\nevaluation episodes. We describe a framework that distinguishes between five\ndomain adaptation scenarios, provides recommendations for addressing each\nscenario, and offers guidelines for determining if a problem falls into one of\nthese scenarios. During the multiple evaluation episodes, the framework is\ntested on artificial and real-world datasets and an experimental study\ninvolving 100 participants. The evaluation demonstrates that the framework has\nthe explanatory power to capture any domain adaptation problem effectively. In\nsummary, we provide clear guidance for researchers and practitioners who want\nto employ domain adaptation but lack in-depth knowledge of the possibilities.\n","authors":["Philipp Spitzer","Dominik Martin","Laurin Eichberger","Niklas Kühl"],"pdf_url":"https://arxiv.org/pdf/2501.04528v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.13867v2","updated":"2025-01-08T14:08:11Z","published":"2024-05-22T17:48:17Z","title":"Scaling-laws for Large Time-series Models","summary":" Scaling laws for large language models (LLMs) have provided useful guidance\nin training ever larger models for predictable performance gains. Time series\nforecasting shares a similar sequential structure to language, and is amenable\nto large-scale transformer architectures. Here we show that foundational\ndecoder-only time series transformer models exhibit analogous scaling-behavior\nto LLMs, with architectural details (aspect ratio and number of heads) having a\nminimal effect over broad ranges. We assemble a large corpus of heterogenous\ntime series data on which to train, and establish for the first time power-law\nscaling with parameter count, dataset size, and training compute, spanning five\norders of magnitude.\n","authors":["Thomas D. P. Edwards","James Alvey","Justin Alsing","Nam H. Nguyen","Benjamin D. Wandelt"],"pdf_url":"https://arxiv.org/pdf/2405.13867v2.pdf","comment":"4 main pages (16 total), 4 figures; Accepted for oral presentation in\n Time Series in the Age of Large Models (TSALM) Workshop at Neurips 2024"},{"id":"http://arxiv.org/abs/2409.12809v2","updated":"2025-01-08T13:59:28Z","published":"2024-09-19T14:34:20Z","title":"Don't be Fooled: The Misinformation Effect of Explanations in Human-AI\n Collaboration","summary":" Across various applications, humans increasingly use black-box artificial\nintelligence (AI) systems without insight into these systems' reasoning. To\ncounter this opacity, explainable AI (XAI) methods promise enhanced\ntransparency and interpretability. While recent studies have explored how XAI\naffects human-AI collaboration, few have examined the potential pitfalls caused\nby incorrect explanations. The implications for humans can be far-reaching but\nhave not been explored extensively. To investigate this, we ran a study (n=160)\non AI-assisted decision-making in which humans were supported by XAI. Our\nfindings reveal a misinformation effect when incorrect explanations accompany\ncorrect AI advice with implications post-collaboration. This effect causes\nhumans to infer flawed reasoning strategies, hindering task execution and\ndemonstrating impaired procedural knowledge. Additionally, incorrect\nexplanations compromise human-AI team-performance during collaboration. With\nour work, we contribute to HCI by providing empirical evidence for the negative\nconsequences of incorrect explanations on humans post-collaboration and\noutlining guidelines for designers of AI.\n","authors":["Philipp Spitzer","Joshua Holstein","Katelyn Morrison","Kenneth Holstein","Gerhard Satzger","Niklas Kühl"],"pdf_url":"https://arxiv.org/pdf/2409.12809v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04510v1","updated":"2025-01-08T13:56:17Z","published":"2025-01-08T13:56:17Z","title":"CGP-Tuning: Structure-Aware Soft Prompt Tuning for Code Vulnerability\n Detection","summary":" Large language models (LLMs) have been proposed as powerful tools for\ndetecting software vulnerabilities, where task-specific fine-tuning is\ntypically employed to provide vulnerability-specific knowledge to the LLMs for\nthis purpose. However, traditional full-parameter fine-tuning is inefficient\nfor modern, complex LLMs, which contain billions of parameters.\n Soft prompt tuning has been suggested as a more efficient alternative for\nfine-tuning LLMs in general cases. However, pure soft prompt tuning treats\nsource code as plain text, losing structural information inherent in source\ncode. Meanwhile, graph-enhanced soft prompt tuning methods, which aim to\naddress this issue, are unable to preserve the rich semantic information within\ncode graphs, as they are primarily designed for general graph-related tasks and\nfocus more on adjacency information. They also fail to ensure computational\nefficiency while accounting for graph-text interactions.\n This paper, therefore, introduces a new code graph-enhanced, structure-aware\nsoft prompt tuning method for vulnerability detection, referred to as\nCGP-Tuning. It employs innovative type-aware embeddings to capture the rich\nsemantic information within code graphs, along with a novel and efficient\ncross-modal alignment module that achieves linear computational cost while\nincorporating graph-text interactions. The proposed CGP-Tuning is evaluated on\nthe latest DiverseVul dataset and the most recent open-source code LLMs,\nCodeLlama and CodeGemma. Experimental results demonstrate that CGP-Tuning\noutperforms the best state-of-the-art method by an average of 3.5 percentage\npoints in accuracy, without compromising its vulnerability detection\ncapabilities for long source code.\n","authors":["Ruijun Feng","Hammond Pearce","Pietro Liguori","Yulei Sui"],"pdf_url":"https://arxiv.org/pdf/2501.04510v1.pdf","comment":"14 pages, 5 figures"},{"id":"http://arxiv.org/abs/2407.02994v2","updated":"2025-01-08T13:35:45Z","published":"2024-07-03T10:49:21Z","title":"MedPix 2.0: A Comprehensive Multimodal Biomedical Data set for Advanced\n AI Applications with Retrieval Augmented Generation and Knowledge Graphs","summary":" The increasing interest in developing Artificial Intelligence applications in\nthe medical domain, suffers from the lack of high-quality data set, mainly due\nto privacy-related issues. In addition, the recent increase in large multimodal\nmodels (LMM) leads to the need for multimodal medical data sets, where clinical\nreports and findings are attached to the corresponding CT or MRI scans. This\npaper illustrates the entire workflow for building the MedPix 2.0 data set.\nStarting with the well-known multimodal data set\nMedPix\\textsuperscript{\\textregistered}, mainly used by physicians, nurses, and\nhealthcare students for Continuing Medical Education purposes, a semi-automatic\npipeline was developed to extract visual and textual data followed by a manual\ncuring procedure in which noisy samples were removed, thus creating a MongoDB\ndatabase. Along with the data set, we developed a GUI aimed at navigating\nefficiently the MongoDB instance and obtaining the raw data that can be easily\nused for training and/or fine-tuning LMMs. To enforce this point, in this work,\nwe first recall DR-Minerva, a RAG-based LMM trained using MedPix 2.0.\nDR-Minerva predicts the body part and the modality used to scan its input\nimage. We also propose the extension of DR-Minerva with a Knowledge Graph that\nuses Llama 3.1 Instruct 8B, and leverages MedPix 2.0. The resulting\narchitecture can be queried in a end-to-end manner, as a medical decision\nsupport system. MedPix 2.0 is available on GitHub.\n\\url{https://github.com/CHILab1/MedPix-2.0}\n","authors":["Irene Siragusa","Salvatore Contino","Massimo La Ciura","Rosario Alicata","Roberto Pirrone"],"pdf_url":"https://arxiv.org/pdf/2407.02994v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04493v1","updated":"2025-01-08T13:26:24Z","published":"2025-01-08T13:26:24Z","title":"The Role of Machine Learning in Congenital Heart Disease Diagnosis:\n Datasets, Algorithms, and Insights","summary":" Congenital heart disease is among the most common fetal abnormalities and\nbirth defects. Despite identifying numerous risk factors influencing its onset,\na comprehensive understanding of its genesis and management across diverse\npopulations remains limited. Recent advancements in machine learning have\ndemonstrated the potential for leveraging patient data to enable early\ncongenital heart disease detection. Over the past seven years, researchers have\nproposed various data-driven and algorithmic solutions to address this\nchallenge. This paper presents a systematic review of congential heart disease\nrecognition using machine learning, conducting a meta-analysis of 432\nreferences from leading journals published between 2018 and 2024. A detailed\ninvestigation of 74 scholarly works highlights key factors, including\ndatabases, algorithms, applications, and solutions. Additionally, the survey\noutlines reported datasets used by machine learning experts for congenital\nheart disease recognition. Using a systematic literature review methodology,\nthis study identifies critical challenges and opportunities in applying machine\nlearning to congenital heart disease.\n","authors":["Khalil Khan","Farhan Ullah","Ikram Syed","Irfan Ullah"],"pdf_url":"https://arxiv.org/pdf/2501.04493v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.16586v2","updated":"2025-01-08T13:16:26Z","published":"2024-09-25T03:25:34Z","title":"AutoSTF: Decoupled Neural Architecture Search for Cost-Effective\n Automated Spatio-Temporal Forecasting","summary":" Spatio-temporal forecasting is a critical component of various smart city\napplications, such as transportation optimization, energy management, and\nsocio-economic analysis. Recently, several automated spatio-temporal\nforecasting methods have been proposed to automatically search the optimal\nneural network architecture for capturing complex spatio-temporal dependencies.\nHowever, the existing automated approaches suffer from expensive neural\narchitecture search overhead, which hinders their practical use and the further\nexploration of diverse spatio-temporal operators in a finer granularity. In\nthis paper, we propose AutoSTF, a decoupled automatic neural architecture\nsearch framework for cost-effective automated spatio-temporal forecasting. From\nthe efficiency perspective, we first decouple the mixed search space into\ntemporal space and spatial space and respectively devise representation\ncompression and parameter-sharing schemes to mitigate the parameter explosion.\nThe decoupled spatio-temporal search not only expedites the model optimization\nprocess but also leaves new room for more effective spatio-temporal dependency\nmodeling. From the effectiveness perspective, we propose a multi-patch transfer\nmodule to jointly capture multi-granularity temporal dependencies and extend\nthe spatial search space to enable finer-grained layer-wise spatial dependency\nsearch. Extensive experiments on eight datasets demonstrate the superiority of\nAutoSTF in terms of both accuracy and efficiency. Specifically, our proposed\nmethod achieves up to 13.48x speed-up compared to state-of-the-art automatic\nspatio-temporal forecasting methods while maintaining the best forecasting\naccuracy.\n","authors":["Tengfei Lyu","Weijia Zhang","Jinliang Deng","Hao Liu"],"pdf_url":"https://arxiv.org/pdf/2409.16586v2.pdf","comment":"Accepted by KDD 2025 Research Track"},{"id":"http://arxiv.org/abs/2501.04487v1","updated":"2025-01-08T13:14:05Z","published":"2025-01-08T13:14:05Z","title":"Integrating remote sensing data assimilation, deep learning and large\n language model for interactive wheat breeding yield prediction","summary":" Yield is one of the core goals of crop breeding. By predicting the potential\nyield of different breeding materials, breeders can screen these materials at\nvarious growth stages to select the best performing. Based on unmanned aerial\nvehicle remote sensing technology, high-throughput crop phenotyping data in\nbreeding areas is collected to provide data support for the breeding decisions\nof breeders. However, the accuracy of current yield predictions still requires\nimprovement, and the usability and user-friendliness of yield forecasting tools\nremain suboptimal. To address these challenges, this study introduces a hybrid\nmethod and tool for crop yield prediction, designed to allow breeders to\ninteractively and accurately predict wheat yield by chatting with a large\nlanguage model (LLM). First, the newly designed data assimilation algorithm is\nused to assimilate the leaf area index into the WOFOST model. Then, selected\noutputs from the assimilation process, along with remote sensing inversion\nresults, are used to drive the time-series temporal fusion transformer model\nfor wheat yield prediction. Finally, based on this hybrid method and leveraging\nan LLM with retrieval augmented generation technology, we developed an\ninteractive yield prediction Web tool that is user-friendly and supports\nsustainable data updates. This tool integrates multi-source data to assist\nbreeding decision-making. This study aims to accelerate the identification of\nhigh-yield materials in the breeding process, enhance breeding efficiency, and\nenable more scientific and smart breeding decisions.\n","authors":["Guofeng Yang","Nanfei Jin","Wenjie Ai","Zhonghua Zheng","Yuhong He","Yong He"],"pdf_url":"https://arxiv.org/pdf/2501.04487v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.01433v2","updated":"2025-01-08T13:08:32Z","published":"2024-12-18T02:00:53Z","title":"Mathematical Definition and Systematization of Puzzle Rules","summary":" While logic puzzles have engaged individuals through problem-solving and\ncritical thinking, the creation of new puzzle rules has largely relied on\nad-hoc processes. Pencil puzzles, such as Slitherlink and Sudoku, represent a\nprominent subset of these games, celebrated for their intellectual challenges\nrooted in combinatorial logic and spatial reasoning. Despite extensive research\ninto solving techniques and automated problem generation, a unified framework\nfor systematic and scalable rule design has been lacking. Here, we introduce a\nmathematical framework for defining and systematizing pencil puzzle rules. This\nframework formalizes grid elements, their positional relationships, and\niterative composition operations, allowing for the incremental construction of\nstructures that form the basis of puzzle rules. Furthermore, we establish a\nformal method to describe constraints and domains for each structure, ensuring\nsolvability and coherence. Applying this framework, we successfully formalized\nthe rules of well-known Nikoli puzzles, including Slitherlink and Sudoku,\ndemonstrating the formal representation of a significant portion (approximately\none-fourth) of existing puzzles. These results validate the potential of the\nframework to systematize and innovate puzzle rule design, establishing a\npathway to automated rule generation. By providing a mathematical foundation\nfor puzzle rule creation, this framework opens avenues for computers,\npotentially enhanced by AI, to design novel puzzle rules tailored to player\npreferences, expanding the scope of puzzle diversity. Beyond its direct\napplication to pencil puzzles, this work illustrates how mathematical\nframeworks can bridge recreational mathematics and algorithmic design, offering\ntools for broader exploration in logic-based systems, with potential\napplications in educational game design, personalized learning, and\ncomputational creativity.\n","authors":["Itsuki Maeda","Yasuhiro Inoue"],"pdf_url":"https://arxiv.org/pdf/2501.01433v2.pdf","comment":"16pages"},{"id":"http://arxiv.org/abs/2501.04480v1","updated":"2025-01-08T13:03:34Z","published":"2025-01-08T13:03:34Z","title":"Research on environment perception and behavior prediction of\n intelligent UAV based on semantic communication","summary":" The convergence of drone delivery systems, virtual worlds, and blockchain has\ntransformed logistics and supply chain management, providing a fast, and\nenvironmentally friendly alternative to traditional ground transportation\nmethods;Provide users with a real-world experience, virtual service providers\nneed to collect up-to-the-minute delivery information from edge devices. To\naddress this challenge, 1) a reinforcement learning approach is introduced to\nenable drones with fast training capabilities and the ability to autonomously\nadapt to new virtual scenarios for effective resource allocation.2) A semantic\ncommunication framework for meta-universes is proposed, which utilizes the\nextraction of semantic information to reduce the communication cost and\nincentivize the transmission of information for meta-universe services.3) In\norder to ensure that user information security, a lightweight authentication\nand key agreement scheme is designed between the drone and the user by\nintroducing blockchain technology. In our experiments, the drone adaptation\nperformance is improved by about 35\\%, and the local offloading rate can reach\n90\\% with the increase of the number of base stations. The semantic\ncommunication system proposed in this paper is compared with the Cross Entropy\nbaseline model. Introducing blockchain technology the throughput of the\ntransaction is maintained at a stable value with different number of drones.\n","authors":["Kechong Ren","Li Gao","Qi Guan"],"pdf_url":"https://arxiv.org/pdf/2501.04480v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04472v1","updated":"2025-01-08T12:51:34Z","published":"2025-01-08T12:51:34Z","title":"Hybrid Artificial Intelligence Strategies for Drone Navigation","summary":" Objective: This paper describes the development of hybrid artificial\nintelligence strategies for drone navigation. Methods: The navigation module\ncombines a deep learning model with a rule-based engine depending on the agent\nstate. The deep learning model has been trained using reinforcement learning.\nThe rule-based engine uses expert knowledge to deal with specific situations.\nThe navigation module incorporates several strategies to explain the drone\ndecision based on its observation space, and different mechanisms for including\nhuman decisions in the navigation process. Finally, this paper proposes an\nevaluation methodology based on defining several scenarios and analyzing the\nperformance of the different strategies according to metrics adapted to each\nscenario. Results: Two main navigation problems have been studied. For the\nfirst scenario (reaching known targets), it has been possible to obtain a 90%\ntask completion rate, reducing significantly the number of collisions thanks to\nthe rule-based engine. For the second scenario, it has been possible to reduce\n20% of the time required to locate all the targets using the reinforcement\nlearning model. Conclusions: Reinforcement learning is a very good strategy to\nlearn policies for drone navigation, but in critical situations, it is\nnecessary to complement it with a rule-based module to increase task success\nrate.\n","authors":["Rubén San-Segundo","Lucía Angulo","Manuel Gil-Martín","David Carramiñana","Ana M. Bernardos"],"pdf_url":"https://arxiv.org/pdf/2501.04472v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.08023v2","updated":"2025-01-08T12:40:56Z","published":"2024-09-12T13:05:28Z","title":"Edge-Wise Graph-Instructed Neural Networks","summary":" The problem of multi-task regression over graph nodes has been recently\napproached through Graph-Instructed Neural Network (GINN), which is a promising\narchitecture belonging to the subset of message-passing graph neural networks.\nIn this work, we discuss the limitations of the Graph-Instructed (GI) layer,\nand we formalize a novel edge-wise GI (EWGI) layer. We discuss the advantages\nof the EWGI layer and we provide numerical evidence that EWGINNs perform better\nthan GINNs over some graph-structured input data, like the ones inferred from\nthe Barabasi-Albert graph, and improve the training regularization on graphs\nwith chaotic connectivity, like the ones inferred from the Erdos-Renyi graph.\n","authors":["Francesco Della Santa","Antonio Mastropietro","Sandra Pieraccini","Francesco Vaccarino"],"pdf_url":"https://arxiv.org/pdf/2409.08023v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.16149v4","updated":"2025-01-08T12:40:27Z","published":"2024-03-24T13:43:43Z","title":"Analyzing Consumer IoT Traffic from Security and Privacy Perspectives: a\n Comprehensive Survey","summary":" The Consumer Internet of Things (CIoT), a notable segment within the IoT\ndomain, involves the integration of IoT technology into consumer electronics\nand devices, such as smart homes and smart wearables. Compared to traditional\nIoT fields, CIoT differs notably in target users, product types, and design\napproaches. While offering convenience to users, it also raises new security\nand privacy concerns. Network traffic analysis, a widely used technique in the\nsecurity community, has been extensively applied to investigate these concerns\nabout CIoT. Compared to network traffic analysis in other fields such as mobile\napps and websites, CIoT presents unique characteristics, introducing new\nchallenges and research opportunities. Researchers have made significant\ncontributions in this area. To aid researchers in understanding the application\nof traffic analysis tools for studying CIoT security and privacy risks, this\nsurvey reviews 303 publications on traffic analysis within the CIoT security\nand privacy domain from January 2018 to June 2024, focusing on three research\nquestions. Our work: 1) outlines the CIoT traffic analysis process and\nhighlights its differences from general network traffic analysis. 2) summarizes\nand classifies existing research into four categories according to its\napplication objectives: device fingerprinting, user activity inference,\nmalicious traffic detection, and measurement. 3) explores emerging challenges\nand potential future research directions based on each step of the CIoT traffic\nanalysis process. This will provide new insights to the community and guide the\nindustry towards safer product designs.\n","authors":["Yan Jia","Yuxin Song","Zihou Liu","Qingyin Tan","Yang Song","Yu Zhang","Zheli Liu"],"pdf_url":"https://arxiv.org/pdf/2403.16149v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04444v1","updated":"2025-01-08T11:53:30Z","published":"2025-01-08T11:53:30Z","title":"A novel Facial Recognition technique with Focusing on Masked Faces","summary":" Recognizing the same faces with and without masks is important for ensuring\nconsistent identification in security, access control, and public safety. This\ncapability is crucial in scenarios like law enforcement, healthcare, and\nsurveillance, where accurate recognition must be maintained despite facial\nocclusion. This research focuses on the challenge of recognizing the same faces\nwith and without masks by employing cosine similarity as the primary technique.\nWith the increased use of masks, traditional facial recognition systems face\nsignificant accuracy issues, making it crucial to develop methods that can\nreliably identify individuals in masked conditions. For that reason, this study\nproposed Masked-Unmasked Face Matching Model (MUFM). This model employs\ntransfer learning using the Visual Geometry Group (VGG16) model to extract\nsignificant facial features, which are subsequently classified utilizing the\nK-Nearest Neighbors (K-NN) algorithm. The cosine similarity metric is employed\nto compare masked and unmasked faces of the same individuals. This approach\nrepresents a novel contribution, as the task of recognizing the same individual\nwith and without a mask using cosine similarity has not been previously\naddressed. By integrating these advanced methodologies, the research\ndemonstrates effective identification of individuals despite the presence of\nmasks, addressing a significant limitation in traditional systems. Using data\nis another essential part of this work, by collecting and preparing an image\ndataset from three different sources especially some of those data are real\nprovided a comprehensive power of this research. The image dataset used were\nalready collected in three different datasets of masked and unmasked for the\nsame faces.\n","authors":["Dana A Abdullah","Dana Rasul Hamad","Hakem Beitollahi","Ismail Y Maolood","Abdulhady Abas Abdullah","Aso Khaleel Ameen"],"pdf_url":"https://arxiv.org/pdf/2501.04444v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03301v2","updated":"2025-01-08T11:47:25Z","published":"2025-01-06T15:19:26Z","title":"Rethinking Byzantine Robustness in Federated Recommendation from Sparse\n Aggregation Perspective","summary":" To preserve user privacy in recommender systems, federated recommendation\n(FR) based on federated learning (FL) emerges, keeping the personal data on the\nlocal client and updating a model collaboratively. Unlike FL, FR has a unique\nsparse aggregation mechanism, where the embedding of each item is updated by\nonly partial clients, instead of full clients in a dense aggregation of general\nFL. Recently, as an essential principle of FL, model security has received\nincreasing attention, especially for Byzantine attacks, where malicious clients\ncan send arbitrary updates. The problem of exploring the Byzantine robustness\nof FR is particularly critical since in the domains applying FR, e.g.,\ne-commerce, malicious clients can be injected easily by registering new\naccounts. However, existing Byzantine works neglect the unique sparse\naggregation of FR, making them unsuitable for our problem. Thus, we make the\nfirst effort to investigate Byzantine attacks on FR from the perspective of\nsparse aggregation, which is non-trivial: it is not clear how to define\nByzantine robustness under sparse aggregations and design Byzantine attacks\nunder limited knowledge/capability. In this paper, we reformulate the Byzantine\nrobustness under sparse aggregation by defining the aggregation for a single\nitem as the smallest execution unit. Then we propose a family of effective\nattack strategies, named Spattack, which exploit the vulnerability in sparse\naggregation and are categorized along the adversary's knowledge and capability.\nExtensive experimental results demonstrate that Spattack can effectively\nprevent convergence and even break down defenses under a few malicious clients,\nraising alarms for securing FR systems.\n","authors":["Zhongjian Zhang","Mengmei Zhang","Xiao Wang","Lingjuan Lyu","Bo Yan","Junping Du","Chuan Shi"],"pdf_url":"https://arxiv.org/pdf/2501.03301v2.pdf","comment":"accepted by AAAI 2025"},{"id":"http://arxiv.org/abs/2501.04438v1","updated":"2025-01-08T11:39:28Z","published":"2025-01-08T11:39:28Z","title":"Effect of Information Technology on Job Creation to Support Economic:\n Case Studies of Graduates in Universities (2023-2024) of the KRG of Iraq","summary":" The aim of this study is to assess the impact of information technology (IT)\non university graduates in terms of employment development, which will aid in\neconomic issues. This study uses a descriptive research methodology and a\nquantitative approach to understand variables. The focus of this study is to\nascertain how graduates of Kurdistan regional universities might use IT to\nsecure employment and significantly contribute to the nation's economic\nrevival. The sample size was established by the use of judgmental sampling\nprocedure and consisted of 314 people. The researcher prepared the\nquestionnaire to collect data, and then SPSS statistical software, version 22,\nand Excel 2010 were used to modify, compile, and tabulate the results. The\nstudy's outcome showed that information technology is incredibly inventive, has\na promising future, and makes life much easier for everyone. It also proved\nthat a deep academic understanding of information technology and its\nconstituent parts helps graduates of Kurdistan Regional University find\nsuitable careers. More importantly, though, anyone looking for work or a means\nof support will find great benefit from possessing credentials and\nunderstanding of IT. The study's final finding was that information technology\nhas actively advanced the country's economy. Not only is IT helping to boost\nyouth employment, but it is also turning into a worthwhile investment for\neconomic growth.\n","authors":["Azhi Kh. Bapir","Ismail Y. Maolood","Dana A Abdullah","Aso K. Ameen","Abdulhady Abas Abdullah"],"pdf_url":"https://arxiv.org/pdf/2501.04438v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04437v1","updated":"2025-01-08T11:37:35Z","published":"2025-01-08T11:37:35Z","title":"Integrating LLMs with ITS: Recent Advances, Potentials, Challenges, and\n Future Directions","summary":" Intelligent Transportation Systems (ITS) are crucial for the development and\noperation of smart cities, addressing key challenges in efficiency,\nproductivity, and environmental sustainability. This paper comprehensively\nreviews the transformative potential of Large Language Models (LLMs) in\noptimizing ITS. Initially, we provide an extensive overview of ITS,\nhighlighting its components, operational principles, and overall effectiveness.\nWe then delve into the theoretical background of various LLM techniques, such\nas GPT, T5, CTRL, and BERT, elucidating their relevance to ITS applications.\nFollowing this, we examine the wide-ranging applications of LLMs within ITS,\nincluding traffic flow prediction, vehicle detection and classification,\nautonomous driving, traffic sign recognition, and pedestrian detection. Our\nanalysis reveals how these advanced models can significantly enhance traffic\nmanagement and safety. Finally, we explore the challenges and limitations LLMs\nface in ITS, such as data availability, computational constraints, and ethical\nconsiderations. We also present several future research directions and\npotential innovations to address these challenges. This paper aims to guide\nresearchers and practitioners through the complexities and opportunities of\nintegrating LLMs in ITS, offering a roadmap to create more efficient,\nsustainable, and responsive next-generation transportation systems.\n","authors":["Doaa Mahmud","Hadeel Hajmohamed","Shamma Almentheri","Shamma Alqaydi","Lameya Aldhaheri","Ruhul Amin Khalil","Nasir Saeed"],"pdf_url":"https://arxiv.org/pdf/2501.04437v1.pdf","comment":"Accepted for publication in IEEE Transactions on Intelligent\n Transportation Systems"},{"id":"http://arxiv.org/abs/2501.04436v1","updated":"2025-01-08T11:37:06Z","published":"2025-01-08T11:37:06Z","title":"Federated Fine-Tuning of LLMs: Framework Comparison and Research\n Directions","summary":" Federated learning (FL) provides a privacy-preserving solution for\nfine-tuning pre-trained large language models (LLMs) using distributed private\ndatasets, enabling task-specific adaptation while preserving data privacy.\nHowever, fine-tuning the extensive parameters in LLMs is particularly\nchallenging in resource-constrained federated scenarios due to the significant\ncommunication and computational costs. To gain a deeper understanding of how\nthese challenges can be addressed, this article conducts a comparative analysis\nthree advanced federated LLM (FedLLM) frameworks that integrate knowledge\ndistillation (KD) and split learning (SL) to mitigate these issues: 1) FedLLMs,\nwhere clients upload model parameters or gradients to enable straightforward\nand effective fine-tuning; 2) KD-FedLLMs, which leverage KD for efficient\nknowledge sharing via logits; and 3) Split-FedLLMs, which split the LLMs into\ntwo parts, with one part executed on the client and the other one on the\nserver, to balance the computational load. Each framework is evaluated based on\nkey performance metrics, including model accuracy, communication overhead, and\nclient-side computational load, offering insights into their effectiveness for\nvarious federated fine-tuning scenarios. Through this analysis, we identify\nframework-specific optimization opportunities to enhance the efficiency of\nFedLLMs and discuss broader research directions, highlighting open\nopportunities to better adapt FedLLMs for real-world applications. A use case\nis presented to demonstrate the performance comparison of these three\nframeworks under varying configurations and settings.\n","authors":["Na Yan","Yang Su","Yansha Deng","Robert Schober"],"pdf_url":"https://arxiv.org/pdf/2501.04436v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04435v1","updated":"2025-01-08T11:31:39Z","published":"2025-01-08T11:31:39Z","title":"A Digital Shadow for Modeling, Studying and Preventing Urban Crime","summary":" Crime is one of the greatest threats to urban security. Around 80 percent of\nthe world's population lives in countries with high levels of criminality. Most\nof the crimes committed in the cities take place in their urban environments.\nThis paper presents the development and validation of a digital shadow platform\nfor modeling and simulating urban crime. This digital shadow has been\nconstructed using data-driven agent-based modeling and simulation techniques,\nwhich are suitable for capturing dynamic interactions among individuals and\nwith their environment. Our approach transforms and integrates well-known\ncriminological theories and the expert knowledge of law enforcement agencies\n(LEA), policy makers, and other stakeholders under a theoretical model, which\nis in turn combined with real crime, spatial (cartographic) and socio-economic\ndata into an urban model characterizing the daily behavior of citizens. The\ndigital shadow has also been instantiated for the city of Malaga, for which we\nhad over 300,000 complaints available. This instance has been calibrated with\nthose complaints and other geographic and socio-economic information of the\ncity. To the best of our knowledge, our digital shadow is the first for large\nurban areas that has been calibrated with a large dataset of real crime reports\nand with an accurate representation of the urban environment. The performance\nindicators of the model after being calibrated, in terms of the metrics widely\nused in predictive policing, suggest that our simulated crime generation\nmatches the general pattern of crime in the city according to historical data.\nOur digital shadow platform could be an interesting tool for modeling and\npredicting criminal behavior in an urban environment on a daily basis and,\nthus, a useful tool for policy makers, criminologists, sociologists, LEAs, etc.\nto study and prevent urban crime.\n","authors":["Juan Palma-Borda","Eduardo Guzmán","María-Victoria Belmonte"],"pdf_url":"https://arxiv.org/pdf/2501.04435v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.11189v2","updated":"2025-01-08T11:24:17Z","published":"2024-12-15T13:48:39Z","title":"Leveraging Large Language Models for Active Merchant Non-player\n Characters","summary":" We highlight two significant issues leading to the passivity of current\nmerchant non-player characters (NPCs): pricing and communication. While\nimmersive interactions have been a focus, negotiations between merchant NPCs\nand players on item prices have not received sufficient attention. First, we\ndefine passive pricing as the limited ability of merchants to modify predefined\nitem prices. Second, passive communication means that merchants can only\ninteract with players in a scripted manner. To tackle these issues and create\nan active merchant NPC, we propose a merchant framework based on large language\nmodels (LLMs), called MART, which consists of an appraiser module and a\nnegotiator module. We conducted two experiments to guide game developers in\nselecting appropriate implementations by comparing different training methods\nand LLM sizes. Our findings indicate that finetuning methods, such as\nsupervised finetuning (SFT) and knowledge distillation (KD), are effective in\nusing smaller LLMs to implement active merchant NPCs. Additionally, we found\nthree irregular cases arising from the responses of LLMs. We expect our\nfindings to guide developers in using LLMs for developing active merchant NPCs.\n","authors":["Byungjun Kim","Minju Kim","Dayeon Seo","Bugeun Kim"],"pdf_url":"https://arxiv.org/pdf/2412.11189v2.pdf","comment":"Under review / Modified the links to code and dataset"},{"id":"http://arxiv.org/abs/2501.04426v1","updated":"2025-01-08T11:20:48Z","published":"2025-01-08T11:20:48Z","title":"Dual-Force: Enhanced Offline Diversity Maximization under Imitation\n Constraints","summary":" While many algorithms for diversity maximization under imitation constraints\nare online in nature, many applications require offline algorithms without\nenvironment interactions. Tackling this problem in the offline setting,\nhowever, presents significant challenges that require non-trivial, multi-stage\noptimization processes with non-stationary rewards. In this work, we present a\nnovel offline algorithm that enhances diversity using an objective based on Van\nder Waals (VdW) force and successor features, and eliminates the need to learn\na previously used skill discriminator. Moreover, by conditioning the value\nfunction and policy on a pre-trained Functional Reward Encoding (FRE), our\nmethod allows for better handling of non-stationary rewards and provides\nzero-shot recall of all skills encountered during training, significantly\nexpanding the set of skills learned in prior work. Consequently, our algorithm\nbenefits from receiving a consistently strong diversity signal (VdW), and\nenjoys more stable and efficient training. We demonstrate the effectiveness of\nour method in generating diverse skills for two robotic tasks in simulation:\nlocomotion of a quadruped and local navigation with obstacle traversal.\n","authors":["Pavel Kolev","Marin Vlastelica","Georg Martius"],"pdf_url":"https://arxiv.org/pdf/2501.04426v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04424v1","updated":"2025-01-08T11:17:40Z","published":"2025-01-08T11:17:40Z","title":"NSA: Neuro-symbolic ARC Challenge","summary":" The Abstraction and Reasoning Corpus (ARC) evaluates general reasoning\ncapabilities that are difficult for both machine learning models and\ncombinatorial search methods. We propose a neuro-symbolic approach that\ncombines a transformer for proposal generation with combinatorial search using\na domain-specific language. The transformer narrows the search space by\nproposing promising search directions, which allows the combinatorial search to\nfind the actual solution in short time. We pre-train the trainsformer with\nsynthetically generated data. During test-time we generate additional\ntask-specific training tasks and fine-tune our model. Our results surpass\ncomparable state of the art on the ARC evaluation set by 27% and compare\nfavourably on the ARC train set. We make our code and dataset publicly\navailable at https://github.com/Batorskq/NSA.\n","authors":["Paweł Batorski","Jannik Brinkmann","Paul Swoboda"],"pdf_url":"https://arxiv.org/pdf/2501.04424v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2208.06652v4","updated":"2025-01-08T11:10:16Z","published":"2022-08-13T13:46:13Z","title":"Differentiable Inductive Logic Programming in High-Dimensional Space","summary":" Synthesizing large logic programs through symbolic Inductive Logic\nProgramming (ILP) typically requires intermediate definitions. However,\ncluttering the hypothesis space with intensional predicates typically degrades\nperformance. In contrast, gradient descent provides an efficient way to find\nsolutions within such high-dimensional spaces. Neuro-symbolic ILP approaches\nhave not fully exploited this so far. We propose extending the {\\delta}ILP\napproach to inductive synthesis with large-scale predicate invention, thus\nallowing us to exploit the efficacy of high-dimensional gradient descent. We\nshow that large-scale predicate invention benefits differentiable inductive\nsynthesis through gradient descent and allows one to learn solutions for tasks\nbeyond the capabilities of existing neuro-symbolic ILP systems. Furthermore, we\nachieve these results without specifying the precise structure of the solution\nwithin the language bias.\n","authors":["Stanisław J. Purgał","David M. Cerna","Cezary Kaliszyk"],"pdf_url":"https://arxiv.org/pdf/2208.06652v4.pdf","comment":"8 pages, To appear, published at IJCLR 2024"},{"id":"http://arxiv.org/abs/2501.04410v1","updated":"2025-01-08T10:49:13Z","published":"2025-01-08T10:49:13Z","title":"User Simulation in the Era of Generative AI: User Modeling, Synthetic\n Data Generation, and System Evaluation","summary":" User simulation is an emerging interdisciplinary topic with multiple critical\napplications in the era of Generative AI. It involves creating an intelligent\nagent that mimics the actions of a human user interacting with an AI system,\nenabling researchers to model and analyze user behaviour, generate synthetic\ndata for training, and evaluate interactive AI systems in a controlled and\nreproducible manner. User simulation has profound implications for diverse\nfields and plays a vital role in the pursuit of Artificial General\nIntelligence. This paper provides an overview of user simulation, highlighting\nits key applications, connections to various disciplines, and outlining future\nresearch directions to advance this increasingly important technology.\n","authors":["Krisztian Balog","ChengXiang Zhai"],"pdf_url":"https://arxiv.org/pdf/2501.04410v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03535v2","updated":"2025-01-08T10:34:54Z","published":"2025-01-07T05:15:46Z","title":"SenseRAG: Constructing Environmental Knowledge Bases with Proactive\n Querying for LLM-Based Autonomous Driving","summary":" This study addresses the critical need for enhanced situational awareness in\nautonomous driving (AD) by leveraging the contextual reasoning capabilities of\nlarge language models (LLMs). Unlike traditional perception systems that rely\non rigid, label-based annotations, it integrates real-time, multimodal sensor\ndata into a unified, LLMs-readable knowledge base, enabling LLMs to dynamically\nunderstand and respond to complex driving environments. To overcome the\ninherent latency and modality limitations of LLMs, a proactive\nRetrieval-Augmented Generation (RAG) is designed for AD, combined with a\nchain-of-thought prompting mechanism, ensuring rapid and context-rich\nunderstanding. Experimental results using real-world Vehicle-to-everything\n(V2X) datasets demonstrate significant improvements in perception and\nprediction performance, highlighting the potential of this framework to enhance\nsafety, adaptability, and decision-making in next-generation AD systems.\n","authors":["Xuewen Luo","Fan Ding","Fengze Yang","Yang Zhou","Junnyong Loo","Hwa Hui Tew","Chenxi Liu"],"pdf_url":"https://arxiv.org/pdf/2501.03535v2.pdf","comment":"This paper has been accepted for presentation at WACV Workshop LLMAD\n 2025"},{"id":"http://arxiv.org/abs/2309.06941v3","updated":"2025-01-08T09:35:58Z","published":"2023-09-13T13:24:27Z","title":"DEFormer: DCT-driven Enhancement Transformer for Low-light Image and\n Dark Vision","summary":" Low-light image enhancement restores the colors and details of a single image\nand improves high-level visual tasks. However, restoring the lost details in\nthe dark area is still a challenge relying only on the RGB domain. In this\npaper, we delve into frequency as a new clue into the model and propose a\nDCT-driven enhancement transformer (DEFormer) framework. First, we propose a\nlearnable frequency branch (LFB) for frequency enhancement contains DCT\nprocessing and curvature-based frequency enhancement (CFE) to represent\nfrequency features. Additionally, we propose a cross domain fusion (CDF) to\nreduce the differences between the RGB domain and the frequency domain. Our\nDEFormer has achieved superior results on the LOL and MIT-Adobe FiveK datasets,\nimproving the dark detection performance.\n","authors":["Xiangchen Yin","Zhenda Yu","Xin Gao","Xiao Sun"],"pdf_url":"https://arxiv.org/pdf/2309.06941v3.pdf","comment":"Accepted by ICASSP"},{"id":"http://arxiv.org/abs/2501.04377v1","updated":"2025-01-08T09:34:15Z","published":"2025-01-08T09:34:15Z","title":"On Computational Limits and Provably Efficient Criteria of Visual\n Autoregressive Models: A Fine-Grained Complexity Analysis","summary":" Recently, Visual Autoregressive ($\\mathsf{VAR}$) Models introduced a\ngroundbreaking advancement in the field of image generation, offering a\nscalable approach through a coarse-to-fine \"next-scale prediction\" paradigm.\nHowever, the state-of-the-art algorithm of $\\mathsf{VAR}$ models in [Tian,\nJiang, Yuan, Peng and Wang, NeurIPS 2024] takes $O(n^4)$ time, which is\ncomputationally inefficient. In this work, we analyze the computational limits\nand efficiency criteria of $\\mathsf{VAR}$ Models through a fine-grained\ncomplexity lens. Our key contribution is identifying the conditions under which\n$\\mathsf{VAR}$ computations can achieve sub-quadratic time complexity.\nSpecifically, we establish a critical threshold for the norm of input matrices\nused in $\\mathsf{VAR}$ attention mechanisms. Above this threshold, assuming the\nStrong Exponential Time Hypothesis ($\\mathsf{SETH}$) from fine-grained\ncomplexity theory, a sub-quartic time algorithm for $\\mathsf{VAR}$ models is\nimpossible. To substantiate our theoretical findings, we present efficient\nconstructions leveraging low-rank approximations that align with the derived\ncriteria. This work initiates the study of the computational efficiency of the\n$\\mathsf{VAR}$ model from a theoretical perspective. Our technique will shed\nlight on advancing scalable and efficient image generation in $\\mathsf{VAR}$\nframeworks.\n","authors":["Yekun Ke","Xiaoyu Li","Yingyu Liang","Zhizhou Sha","Zhenmei Shi","Zhao Song"],"pdf_url":"https://arxiv.org/pdf/2501.04377v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.18601v2","updated":"2025-01-08T09:30:47Z","published":"2024-07-26T08:41:58Z","title":"Reorganizing attention-space geometry with expressive attention","summary":" Attention regulates information transfer between tokens. For this, query and\nkey vectors are compared, typically in terms of a scalar product,\n$\\mathbf{Q}^T\\mathbf{K}$, together with a subsequent softmax normalization. In\ngeometric terms, the standard dot-product attention (DPA) leads to large/small\nattention weights for parallel/antiparallel queries and keys. Here we study\nexpressive attention (EA), which is based on $(\\mathbf{Q}^T\\mathbf{K})^2$, the\nsquared dot product. In this case, attention is enhanced when query and key are\neither parallel or antiparallel, and suppressed for orthogonal configurations.\nEA can be introduced into any attention-based code without additional compute\ncosts or memory requirements. For a series of autoregressive prediction tasks,\nwe find that expressive attention performs at least as well as vanilla DPA.\nIncreasing task complexity, EA is observed to outperform DPA with increasing\nmargins, which also holds for multi-task settings. For a given model size, EA\nmanages to achieve 100% performance for a range of complexity levels not\naccessible to DPA. Our results show that it is possible to reorganize the\ngeometry of the matching condition in the space of attention heads without loss\nof performance.\n","authors":["Claudius Gros"],"pdf_url":"https://arxiv.org/pdf/2407.18601v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.15267v2","updated":"2025-01-08T09:18:05Z","published":"2024-12-17T05:04:57Z","title":"Toxicity Detection towards Adaptability to Changing Perturbations","summary":" Toxicity detection is crucial for maintaining the peace of the society. While\nexisting methods perform well on normal toxic contents or those generated by\nspecific perturbation methods, they are vulnerable to evolving perturbation\npatterns. However, in real-world scenarios, malicious users tend to create new\nperturbation patterns for fooling the detectors. For example, some users may\ncircumvent the detector of large language models (LLMs) by adding `I am a\nscientist' at the beginning of the prompt. In this paper, we introduce a novel\nproblem, i.e., continual learning jailbreak perturbation patterns, into the\ntoxicity detection field. To tackle this problem, we first construct a new\ndataset generated by 9 types of perturbation patterns, 7 of them are summarized\nfrom prior work and 2 of them are developed by us. We then systematically\nvalidate the vulnerability of current methods on this new perturbation\npattern-aware dataset via both the zero-shot and fine tuned cross-pattern\ndetection. Upon this, we present the domain incremental learning paradigm and\nthe corresponding benchmark to ensure the detector's robustness to dynamically\nemerging types of perturbed toxic text. Our code and dataset are provided in\nthe appendix and will be publicly available at GitHub, by which we wish to\noffer new research opportunities for the security-relevant communities.\n","authors":["Hankun Kang","Jianhao Chen","Yongqi Li","Xin Miao","Mayi Xu","Ming Zhong","Yuanyuan Zhu","Tieyun Qian"],"pdf_url":"https://arxiv.org/pdf/2412.15267v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04366v1","updated":"2025-01-08T09:08:24Z","published":"2025-01-08T09:08:24Z","title":"DispFormer: Pretrained Transformer for Flexible Dispersion Curve\n Inversion from Global Synthesis to Regional Applications","summary":" Surface wave dispersion curve inversion is essential for estimating\nsubsurface Shear-wave velocity ($v_s$), yet traditional methods often struggle\nto balance computational efficiency with inversion accuracy. While deep\nlearning approaches show promise, previous studies typically require large\namounts of labeled data and struggle with real-world datasets that have varying\nperiod ranges, missing data, and low signal-to-noise ratios. This study\nproposes DispFormer, a transformer-based neural network for inverting the $v_s$\nprofile from Rayleigh-wave phase and group dispersion curves. DispFormer\nprocesses dispersion data at each period independently, thereby allowing it to\nhandle data of varying lengths without requiring network modifications or\nalignment between training and testing data. The performance is demonstrated by\npre-training it on a global synthetic dataset and testing it on two regional\nsynthetic datasets using zero-shot and few-shot strategies. Results indicate\nthat zero-shot DispFormer, even without any labeled data, produces inversion\nprofiles that match well with the ground truth, providing a deployable initial\nmodel generator to assist traditional methods. When labeled data is available,\nfew-shot DispFormer outperforms traditional methods with only a small number of\nlabels. Furthermore, real-world tests indicate that DispFormer effectively\nhandles varying length data, and yields lower data residuals than reference\nmodels. These findings demonstrate that DispFormer provides a robust foundation\nmodel for dispersion curve inversion and is a promising approach for broader\napplications.\n","authors":["Feng Liu","Bao Deng","Rui Su","Lei Bai","Wanli Ouyang"],"pdf_url":"https://arxiv.org/pdf/2501.04366v1.pdf","comment":"11 pages, 11 figures, related codes and data are available at\n https://github.com/liufeng2317/DispFormer"},{"id":"http://arxiv.org/abs/2404.07965v4","updated":"2025-01-08T09:07:54Z","published":"2024-04-11T17:52:01Z","title":"Rho-1: Not All Tokens Are What You Need","summary":" Previous language model pre-training methods have uniformly applied a\nnext-token prediction loss to all training tokens. Challenging this norm, we\nposit that \"9l training\". Our initial analysis examines token-level training\ndynamics of language model, revealing distinct loss patterns for different\ntokens. Leveraging these insights, we introduce a new language model called\nRho-1. Unlike traditional LMs that learn to predict every next token in a\ncorpus, Rho-1 employs Selective Language Modeling (SLM), which selectively\ntrains on useful tokens that aligned with the desired distribution. This\napproach involves scoring pretraining tokens using a reference model, and then\ntraining the language model with a focused loss on tokens with higher scores.\nWhen continual pretraining on 15B OpenWebMath corpus, Rho-1 yields an absolute\nimprovement in few-shot accuracy of up to 30% in 9 math tasks. After\nfine-tuning, Rho-1-1B and 7B achieved state-of-the-art results of 40.6% and\n51.8% on MATH dataset, respectively - matching DeepSeekMath with only 3% of the\npretraining tokens. Furthermore, when continual pretraining on 80B general\ntokens, Rho-1 achieves 6.8% average enhancement across 15 diverse tasks,\nincreasing both efficiency and performance of the language model pre-training.\n","authors":["Zhenghao Lin","Zhibin Gou","Yeyun Gong","Xiao Liu","Yelong Shen","Ruochen Xu","Chen Lin","Yujiu Yang","Jian Jiao","Nan Duan","Weizhu Chen"],"pdf_url":"https://arxiv.org/pdf/2404.07965v4.pdf","comment":"First two authors equal contribution"},{"id":"http://arxiv.org/abs/2501.03562v2","updated":"2025-01-08T08:57:32Z","published":"2025-01-07T06:22:55Z","title":"Rethinking Adversarial Attacks in Reinforcement Learning from Policy\n Distribution Perspective","summary":" Deep Reinforcement Learning (DRL) suffers from uncertainties and inaccuracies\nin the observation signal in realworld applications. Adversarial attack is an\neffective method for evaluating the robustness of DRL agents. However, existing\nattack methods targeting individual sampled actions have limited impacts on the\noverall policy distribution, particularly in continuous action spaces. To\naddress these limitations, we propose the Distribution-Aware Projected Gradient\nDescent attack (DAPGD). DAPGD uses distribution similarity as the gradient\nperturbation input to attack the policy network, which leverages the entire\npolicy distribution rather than relying on individual samples. We utilize the\nBhattacharyya distance in DAPGD to measure policy similarity, enabling\nsensitive detection of subtle but critical differences between probability\ndistributions. Our experiment results demonstrate that DAPGD achieves SOTA\nresults compared to the baselines in three robot navigation tasks, achieving an\naverage 22.03% higher reward drop compared to the best baseline.\n","authors":["Tianyang Duan","Zongyuan Zhang","Zheng Lin","Yue Gao","Ling Xiong","Yong Cui","Hongbin Liang","Xianhao Chen","Heming Cui","Dong Huang"],"pdf_url":"https://arxiv.org/pdf/2501.03562v2.pdf","comment":"10 pages, 2 figures, 2 tables"},{"id":"http://arxiv.org/abs/2501.04343v1","updated":"2025-01-08T08:30:44Z","published":"2025-01-08T08:30:44Z","title":"TimelineKGQA: A Comprehensive Question-Answer Pair Generator for\n Temporal Knowledge Graphs","summary":" Question answering over temporal knowledge graphs (TKGs) is crucial for\nunderstanding evolving facts and relationships, yet its development is hindered\nby limited datasets and difficulties in generating custom QA pairs. We propose\na novel categorization framework based on timeline-context relationships, along\nwith \\textbf{TimelineKGQA}, a universal temporal QA generator applicable to any\nTKGs. The code is available at: \\url{https://github.com/PascalSun/TimelineKGQA}\nas an open source Python package.\n","authors":["Qiang Sun","Sirui Li","Du Huynh","Mark Reynolds","Wei Liu"],"pdf_url":"https://arxiv.org/pdf/2501.04343v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.01189v3","updated":"2025-01-08T07:59:53Z","published":"2024-06-03T10:51:43Z","title":"MultiMax: Sparse and Multi-Modal Attention Learning","summary":" SoftMax is a ubiquitous ingredient of modern machine learning algorithms. It\nmaps an input vector onto a probability simplex and reweights the input by\nconcentrating the probability mass at large entries. Yet, as a smooth\napproximation to the Argmax function, a significant amount of probability mass\nis distributed to other, residual entries, leading to poor interpretability and\nnoise. Although sparsity can be achieved by a family of SoftMax variants, they\noften require an alternative loss function and do not preserve multi-modality.\nWe show that this trade-off between multi-modality and sparsity limits the\nexpressivity of SoftMax as well as its variants. We provide a solution to this\ntension between objectives by proposing a piece-wise differentiable function,\ntermed MultiMax, which adaptively modulates the output distribution according\nto input entry range. Through comprehensive analysis and evaluation, we show\nthat MultiMax successfully produces a distribution that supresses irrelevant\nentries while preserving multimodality, with benefits in image classification,\nlanguage modeling and machine translation. The code is available at\nhttps://github.com/ZhouYuxuanYX/MultiMax.\n","authors":["Yuxuan Zhou","Mario Fritz","Margret Keuper"],"pdf_url":"https://arxiv.org/pdf/2406.01189v3.pdf","comment":"Accepted at ICML 2024"},{"id":"http://arxiv.org/abs/2409.14978v2","updated":"2025-01-08T07:53:15Z","published":"2024-09-23T12:57:24Z","title":"TS-HTFA: Advancing Time Series Forecasting via Hierarchical Text-Free\n Alignment with Large Language Models","summary":" Given the significant potential of large language models (LLMs) in sequence\nmodeling, emerging studies have begun applying them to time-series forecasting.\nDespite notable progress, existing methods still face two critical challenges:\n1) their reliance on large amounts of paired text data, limiting the model\napplicability, and 2) a substantial modality gap between text and time series,\nleading to insufficient alignment and suboptimal performance. In this paper, we\nintroduce \\textbf{H}ierarchical \\textbf{T}ext-\\textbf{F}ree \\textbf{A}lignment\n(\\textbf{TS-HTFA}), a novel method that leverages hierarchical alignment to\nfully exploit the representation capacity of LLMs while eliminating the\ndependence on text data. Specifically, we replace paired text data with\nadaptive virtual text based on QR decomposition word embeddings and learnable\nprompt. Furthermore, we establish comprehensive cross-modal alignment at three\nlevels: input, feature, and output. Extensive experiments on multiple\ntime-series benchmarks demonstrate that HTFA achieves state-of-the-art\nperformance, significantly improving prediction accuracy and generalization.\n","authors":["Pengfei Wang","Huanran Zheng","Qi'ao Xu","Silong Dai","Yiqiao Wang","Wenjing Yue","Wei Zhu","Tianwen Qian","Xiaoling Wang"],"pdf_url":"https://arxiv.org/pdf/2409.14978v2.pdf","comment":"19 pages, 6 figures"},{"id":"http://arxiv.org/abs/2407.00662v2","updated":"2025-01-08T07:35:31Z","published":"2024-06-30T11:14:29Z","title":"Multi-Agent Training for Pommerman: Curriculum Learning and\n Population-based Self-Play Approach","summary":" Pommerman is a multi-agent environment that has received considerable\nattention from researchers in recent years. This environment is an ideal\nbenchmark for multi-agent training, providing a battleground for two teams with\ncommunication capabilities among allied agents. Pommerman presents significant\nchallenges for model-free reinforcement learning due to delayed action effects,\nsparse rewards, and false positives, where opponent players can lose due to\ntheir own mistakes. This study introduces a system designed to train\nmulti-agent systems to play Pommerman using a combination of curriculum\nlearning and population-based self-play. We also tackle two challenging\nproblems when deploying the multi-agent training system for competitive games:\nsparse reward and suitable matchmaking mechanism. Specifically, we propose an\nadaptive annealing factor based on agents' performance to adjust the dense\nexploration reward during training dynamically. Additionally, we implement a\nmatchmaking mechanism utilizing the Elo rating system to pair agents\neffectively. Our experimental results demonstrate that our trained agent can\noutperform top learning agents without requiring communication among allied\nagents.\n","authors":["Nhat-Minh Huynh","Hoang-Giang Cao","I-Chen Wu"],"pdf_url":"https://arxiv.org/pdf/2407.00662v2.pdf","comment":"Accepted at The First Workshop on Game AI Algorithms and Multi-Agent\n Learning - IJCAI 2024"},{"id":"http://arxiv.org/abs/2306.05412v4","updated":"2025-01-08T07:29:55Z","published":"2023-06-08T17:56:46Z","title":"Decoupled Prioritized Resampling for Offline RL","summary":" Offline reinforcement learning (RL) is challenged by the distributional shift\nproblem. To address this problem, existing works mainly focus on designing\nsophisticated policy constraints between the learned policy and the behavior\npolicy. However, these constraints are applied equally to well-performing and\ninferior actions through uniform sampling, which might negatively affect the\nlearned policy. To alleviate this issue, we propose Offline Prioritized\nExperience Replay (OPER), featuring a class of priority functions designed to\nprioritize highly-rewarding transitions, making them more frequently visited\nduring training. Through theoretical analysis, we show that this class of\npriority functions induce an improved behavior policy, and when constrained to\nthis improved policy, a policy-constrained offline RL algorithm is likely to\nyield a better solution. We develop two practical strategies to obtain priority\nweights by estimating advantages based on a fitted value network (OPER-A) or\nutilizing trajectory returns (OPER-R) for quick computation. OPER is a\nplug-and-play component for offline RL algorithms. As case studies, we evaluate\nOPER on five different algorithms, including BC, TD3+BC, Onestep RL, CQL, and\nIQL. Extensive experiments demonstrate that both OPER-A and OPER-R\nsignificantly improve the performance for all baseline methods. Codes and\npriority weights are availiable at https://github.com/sail-sg/OPER.\n","authors":["Yang Yue","Bingyi Kang","Xiao Ma","Qisen Yang","Gao Huang","Shiji Song","Shuicheng Yan"],"pdf_url":"https://arxiv.org/pdf/2306.05412v4.pdf","comment":"published on IEEE TNNLS"},{"id":"http://arxiv.org/abs/2411.07464v2","updated":"2025-01-08T07:25:55Z","published":"2024-11-12T00:57:30Z","title":"BudgetMLAgent: A Cost-Effective LLM Multi-Agent system for Automating\n Machine Learning Tasks","summary":" Large Language Models (LLMs) excel in diverse applications including\ngeneration of code snippets, but often struggle with generating code for\ncomplex Machine Learning (ML) tasks. Although existing LLM single-agent based\nsystems give varying performance depending on the task complexity, they purely\nrely on larger and expensive models such as GPT-4. Our investigation reveals\nthat no-cost and low-cost models such as Gemini-Pro, Mixtral and CodeLlama\nperform far worse than GPT-4 in a single-agent setting. With the motivation of\ndeveloping a cost-efficient LLM based solution for solving ML tasks, we propose\nan LLM Multi-Agent based system which leverages combination of experts using\nprofiling, efficient retrieval of past observations, LLM cascades, and\nask-the-expert calls. Through empirical analysis on ML engineering tasks in the\nMLAgentBench benchmark, we demonstrate the effectiveness of our system, using\nno-cost models, namely Gemini as the base LLM, paired with GPT-4 in cascade and\nexpert to serve occasional ask-the-expert calls for planning. With 94.2\\%\nreduction in the cost (from \\$0.931 per run cost averaged over all tasks for\nGPT-4 single agent system to \\$0.054), our system is able to yield better\naverage success rate of 32.95\\% as compared to GPT-4 single-agent system\nyielding 22.72\\% success rate averaged over all the tasks of MLAgentBench.\n","authors":["Shubham Gandhi","Manasi Patwardhan","Lovekesh Vig","Gautam Shroff"],"pdf_url":"https://arxiv.org/pdf/2411.07464v2.pdf","comment":"Presented at AIMLSystems '24"},{"id":"http://arxiv.org/abs/2408.14418v3","updated":"2025-01-08T07:23:56Z","published":"2024-08-26T17:04:00Z","title":"MEDSAGE: Enhancing Robustness of Medical Dialogue Summarization to ASR\n Errors with LLM-generated Synthetic Dialogues","summary":" Automatic Speech Recognition (ASR) systems are pivotal in transcribing speech\ninto text, yet the errors they introduce can significantly degrade the\nperformance of downstream tasks like summarization. This issue is particularly\npronounced in clinical dialogue summarization, a low-resource domain where\nsupervised data for fine-tuning is scarce, necessitating the use of ASR models\nas black-box solutions. Employing conventional data augmentation for enhancing\nthe noise robustness of summarization models is not feasible either due to the\nunavailability of sufficient medical dialogue audio recordings and\ncorresponding ASR transcripts. To address this challenge, we propose MEDSAGE,\nan approach for generating synthetic samples for data augmentation using Large\nLanguage Models (LLMs). Specifically, we leverage the in-context learning\ncapabilities of LLMs and instruct them to generate ASR-like errors based on a\nfew available medical dialogue examples with audio recordings. Experimental\nresults show that LLMs can effectively model ASR noise, and incorporating this\nnoisy data into the training process significantly improves the robustness and\naccuracy of medical dialogue summarization systems. This approach addresses the\nchallenges of noisy ASR outputs in critical applications, offering a robust\nsolution to enhance the reliability of clinical dialogue summarization.\n","authors":["Kuluhan Binici","Abhinav Ramesh Kashyap","Viktor Schlegel","Andy T. Liu","Vijay Prakash Dwivedi","Thanh-Tung Nguyen","Xiaoxue Gao","Nancy F. Chen","Stefan Winkler"],"pdf_url":"https://arxiv.org/pdf/2408.14418v3.pdf","comment":"Accepted by the Thirty-Ninth AAAI Conference on Artificial\n Intelligence (AAAI-25)"},{"id":"http://arxiv.org/abs/2407.16040v2","updated":"2025-01-08T07:21:15Z","published":"2024-07-22T20:34:00Z","title":"Generalizing Teacher Networks for Effective Knowledge Distillation\n Across Student Architectures","summary":" Knowledge distillation (KD) is a model compression method that entails\ntraining a compact student model to emulate the performance of a more complex\nteacher model. However, the architectural capacity gap between the two models\nlimits the effectiveness of knowledge transfer. Addressing this issue, previous\nworks focused on customizing teacher-student pairs to improve compatibility, a\ncomputationally expensive process that needs to be repeated every time either\nmodel changes. Hence, these methods are impractical when a teacher model has to\nbe compressed into different student models for deployment on multiple hardware\ndevices with distinct resource constraints. In this work, we propose Generic\nTeacher Network (GTN), a one-off KD-aware training to create a generic teacher\ncapable of effectively transferring knowledge to any student model sampled from\na given finite pool of architectures. To this end, we represent the student\npool as a weight-sharing supernet and condition our generic teacher to align\nwith the capacities of various student architectures sampled from this\nsupernet. Experimental evaluation shows that our method both improves overall\nKD effectiveness and amortizes the minimal additional training cost of the\ngeneric teacher across students in the pool.\n","authors":["Kuluhan Binici","Weiming Wu","Tulika Mitra"],"pdf_url":"https://arxiv.org/pdf/2407.16040v2.pdf","comment":"British Machine Vision Conference (BMVC 24)"},{"id":"http://arxiv.org/abs/2409.12444v3","updated":"2025-01-08T07:19:14Z","published":"2024-09-19T03:52:50Z","title":"A Lightweight and Real-Time Binaural Speech Enhancement Model with\n Spatial Cues Preservation","summary":" Binaural speech enhancement (BSE) aims to jointly improve the speech quality\nand intelligibility of noisy signals received by hearing devices and preserve\nthe spatial cues of the target for natural listening. Existing methods often\nsuffer from the compromise between noise reduction (NR) capacity and spatial\ncues preservation (SCP) accuracy and a high computational demand in complex\nacoustic scenes. In this work, we present a learning-based lightweight binaural\ncomplex convolutional network (LBCCN), which excels in NR by filtering\nlow-frequency bands and keeping the rest. Additionally, our approach explicitly\nincorporates the estimation of interchannel relative acoustic transfer function\nto ensure the spatial cues fidelity and speech clarity. Results show that the\nproposed LBCCN can achieve a comparable NR performance to state-of-the-art\nmethods under fixed-speaker conditions, but with a much lower computational\ncost and a certain degree of SCP capability. The reproducible code and audio\nexamples are available at https://github.com/jywanng/LBCCN.\n","authors":["Jingyuan Wang","Jie Zhang","Shihao Chen","Miao Sun"],"pdf_url":"https://arxiv.org/pdf/2409.12444v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04315v1","updated":"2025-01-08T07:13:52Z","published":"2025-01-08T07:13:52Z","title":"RoRA: Efficient Fine-Tuning of LLM with Reliability Optimization for\n Rank Adaptation","summary":" Fine-tuning helps large language models (LLM) recover degraded information\nand enhance task performance.Although Low-Rank Adaptation (LoRA) is widely used\nand effective for fine-tuning, we have observed that its scaling factor can\nlimit or even reduce performance as the rank size increases. To address this\nissue, we propose RoRA (Rank-adaptive Reliability Optimization), a simple yet\neffective method for optimizing LoRA's scaling factor. By replacing $\\alpha/r$\nwith $\\alpha/\\sqrt{r}$, RoRA ensures improved performance as rank size\nincreases. Moreover, RoRA enhances low-rank adaptation in fine-tuning\nuncompressed models and excels in the more challenging task of accuracy\nrecovery when fine-tuning pruned models. Extensive experiments demonstrate the\neffectiveness of RoRA in fine-tuning both uncompressed and pruned models. RoRA\nsurpasses the state-of-the-art (SOTA) in average accuracy and robustness on\nLLaMA-7B/13B, LLaMA2-7B, and LLaMA3-8B, specifically outperforming LoRA and\nDoRA by 6.5% and 2.9% on LLaMA-7B, respectively. In pruned model fine-tuning,\nRoRA shows significant advantages; for SHEARED-LLAMA-1.3, a LLaMA-7B with 81.4%\npruning, RoRA achieves 5.7% higher average accuracy than LoRA and 3.9% higher\nthan DoRA.\n","authors":["Jun Liu","Zhenglun Kong","Peiyan Dong","Xuan Shen","Pu Zhao","Hao Tang","Geng Yuan","Wei Niu","Wenbin Zhang","Xue Lin","Dong Huang","Yanzhi Wang"],"pdf_url":"https://arxiv.org/pdf/2501.04315v1.pdf","comment":"ICASSP 2025"},{"id":"http://arxiv.org/abs/2412.13720v2","updated":"2025-01-08T07:03:42Z","published":"2024-12-18T11:00:58Z","title":"Federated Learning and RAG Integration: A Scalable Approach for Medical\n Large Language Models","summary":" This study analyzes the performance of domain-specific Large Language Models\n(LLMs) for the medical field by integrating Retrieval-Augmented Generation\n(RAG) systems within a federated learning framework. Leveraging the inherent\nadvantages of federated learning, such as preserving data privacy and enabling\ndistributed computation, this research explores the integration of RAG systems\nwith models trained under varying client configurations to optimize\nperformance. Experimental results demonstrate that the federated learning-based\nmodels integrated with RAG systems consistently outperform their non-integrated\ncounterparts across all evaluation metrics. This study highlights the potential\nof combining federated learning and RAG systems for developing domain-specific\nLLMs in the medical field, providing a scalable and privacy-preserving solution\nfor enhancing text generation capabilities.\n","authors":["Jincheol Jung","Hongju Jeong","Eui-Nam Huh"],"pdf_url":"https://arxiv.org/pdf/2412.13720v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.19681v3","updated":"2025-01-08T06:56:19Z","published":"2024-07-29T03:53:14Z","title":"Motion Manifold Flow Primitives for Task-Conditioned Trajectory\n Generation under Complex Task-Motion Dependencies","summary":" Effective movement primitives should be capable of encoding and generating a\nrich repertoire of trajectories -- typically collected from human\ndemonstrations -- conditioned on task-defining parameters such as vision or\nlanguage inputs. While recent methods based on the motion manifold hypothesis,\nwhich assumes that a set of trajectories lies on a lower-dimensional nonlinear\nsubspace, address challenges such as limited dataset size and the high\ndimensionality of trajectory data, they often struggle to capture complex\ntask-motion dependencies, i.e., when motion distributions shift drastically\nwith task variations. To address this, we introduce Motion Manifold Flow\nPrimitives (MMFP), a framework that decouples the training of the motion\nmanifold from task-conditioned distributions. Specifically, we employ flow\nmatching models, state-of-the-art conditional deep generative models, to learn\ntask-conditioned distributions in the latent coordinate space of the learned\nmotion manifold. Experiments are conducted on language-guided trajectory\ngeneration tasks, where many-to-many text-motion correspondences introduce\ncomplex task-motion dependencies, highlighting MMFP's superiority over existing\nmethods.\n","authors":["Yonghyeon Lee","Byeongho Lee","Seungyeon Kim","Frank C. Park"],"pdf_url":"https://arxiv.org/pdf/2407.19681v3.pdf","comment":"8 pages, 11 figures"},{"id":"http://arxiv.org/abs/2404.01714v4","updated":"2025-01-08T06:52:07Z","published":"2024-04-02T07:57:17Z","title":"Conjugate-Gradient-like Based Adaptive Moment Estimation Optimization\n Algorithm for Deep Learning","summary":" Training deep neural networks is a challenging task. In order to speed up\ntraining and enhance the performance of deep neural networks, we rectify the\nvanilla conjugate gradient as conjugate-gradient-like and incorporate it into\nthe generic Adam, and thus propose a new optimization algorithm named\nCG-like-Adam for deep learning. Specifically, both the first-order and the\nsecond-order moment estimation of generic Adam are replaced by the\nconjugate-gradient-like. Convergence analysis handles the cases where the\nexponential moving average coefficient of the first-order moment estimation is\nconstant and the first-order moment estimation is unbiased. Numerical\nexperiments show the superiority of the proposed algorithm based on the\nCIFAR10/100 dataset.\n","authors":["Jiawu Tian","Liwei Xu","Xiaowei Zhang","Yongqi Li"],"pdf_url":"https://arxiv.org/pdf/2404.01714v4.pdf","comment":"32 pages, 13 figures"},{"id":"http://arxiv.org/abs/2412.14500v2","updated":"2025-01-08T06:52:05Z","published":"2024-12-19T03:48:23Z","title":"The Digital Ecosystem of Beliefs: does evolution favour AI over humans?","summary":" As AI systems are integrated into social networks, there are AI safety\nconcerns that AI-generated content may dominate the web, e.g. in popularity or\nimpact on beliefs. To understand such questions, this paper proposes the\nDigital Ecosystem of Beliefs (Digico), the first evolutionary framework for\ncontrolled experimentation with multi-population interactions in simulated\nsocial networks. The framework models a population of agents which change their\nmessaging strategies due to evolutionary updates following a Universal\nDarwinism approach, interact via messages, influence each other's beliefs\nthrough dynamics based on a contagion model, and maintain their beliefs through\ncognitive Lamarckian inheritance. Initial experiments with an abstract\nimplementation of Digico show that: a) when AIs have faster messaging,\nevolution, and more influence in the recommendation algorithm, they get 80% to\n95% of the views, depending on the size of the influence benefit; b) AIs\ndesigned for propaganda can typically convince 50% of humans to adopt extreme\nbeliefs, and up to 85% when agents believe only a limited number of channels;\nc) a penalty for content that violates agents' beliefs reduces propaganda\neffectiveness by up to 8%. We further discuss implications for control (e.g.\nlegislation) and Digico as a means of studying evolutionary principles.\n","authors":["David M. Bossens","Shanshan Feng","Yew-Soon Ong"],"pdf_url":"https://arxiv.org/pdf/2412.14500v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.05271v2","updated":"2025-01-08T06:30:39Z","published":"2023-09-11T07:05:02Z","title":"AutoFuse: Automatic Fusion Networks for Deformable Medical Image\n Registration","summary":" Deformable image registration aims to find a dense non-linear spatial\ncorrespondence between a pair of images, which is a crucial step for many\nmedical tasks such as tumor growth monitoring and population analysis.\nRecently, Deep Neural Networks (DNNs) have been widely recognized for their\nability to perform fast end-to-end registration. However, DNN-based\nregistration needs to explore the spatial information of each image and fuse\nthis information to characterize spatial correspondence. This raises an\nessential question: what is the optimal fusion strategy to characterize spatial\ncorrespondence? Existing fusion strategies (e.g., early fusion, late fusion)\nwere empirically designed to fuse information by manually defined prior\nknowledge, which inevitably constrains the registration performance within the\nlimits of empirical designs. In this study, we depart from existing\nempirically-designed fusion strategies and develop a data-driven fusion\nstrategy for deformable image registration. To achieve this, we propose an\nAutomatic Fusion network (AutoFuse) that provides flexibility to fuse\ninformation at many potential locations within the network. A Fusion Gate (FG)\nmodule is also proposed to control how to fuse information at each potential\nnetwork location based on training data. Our AutoFuse can automatically\noptimize its fusion strategy during training and can be generalizable to both\nunsupervised registration (without any labels) and semi-supervised registration\n(with weak labels provided for partial training data). Extensive experiments on\ntwo well-benchmarked medical registration tasks (inter- and intra-patient\nregistration) with eight public datasets show that our AutoFuse outperforms\nstate-of-the-art unsupervised and semi-supervised registration methods.\n","authors":["Mingyuan Meng","Michael Fulham","Dagan Feng","Lei Bi","Jinman Kim"],"pdf_url":"https://arxiv.org/pdf/2309.05271v2.pdf","comment":"Published at Pattern Recognition"},{"id":"http://arxiv.org/abs/2501.04302v1","updated":"2025-01-08T06:26:16Z","published":"2025-01-08T06:26:16Z","title":"H-MBA: Hierarchical MamBa Adaptation for Multi-Modal Video Understanding\n in Autonomous Driving","summary":" With the prevalence of Multimodal Large Language Models(MLLMs), autonomous\ndriving has encountered new opportunities and challenges. In particular,\nmulti-modal video understanding is critical to interactively analyze what will\nhappen in the procedure of autonomous driving. However, videos in such a\ndynamical scene that often contains complex spatial-temporal movements, which\nrestricts the generalization capacity of the existing MLLMs in this field. To\nbridge the gap, we propose a novel Hierarchical Mamba Adaptation (H-MBA)\nframework to fit the complicated motion changes in autonomous driving videos.\nSpecifically, our H-MBA consists of two distinct modules, including Context\nMamba (C-Mamba) and Query Mamba (Q-Mamba). First, C-Mamba contains various\ntypes of structure state space models, which can effectively capture\nmulti-granularity video context for different temporal resolutions. Second,\nQ-Mamba flexibly transforms the current frame as the learnable query, and\nattentively selects multi-granularity video context into query. Consequently,\nit can adaptively integrate all the video contexts of multi-scale temporal\nresolutions to enhance video understanding. Via a plug-and-play paradigm in\nMLLMs, our H-MBA shows the remarkable performance on multi-modal video tasks in\nautonomous driving, e.g., for risk object detection, it outperforms the\nprevious SOTA method with 5.5% mIoU improvement.\n","authors":["Siran Chen","Yuxiao Luo","Yue Ma","Yu Qiao","Yali Wang"],"pdf_url":"https://arxiv.org/pdf/2501.04302v1.pdf","comment":"7 pages, 4 figures"},{"id":"http://arxiv.org/abs/2501.04299v1","updated":"2025-01-08T06:07:33Z","published":"2025-01-08T06:07:33Z","title":"Circuit Complexity Bounds for Visual Autoregressive Model","summary":" Understanding the expressive ability of a specific model is essential for\ngrasping its capacity limitations. Recently, several studies have established\ncircuit complexity bounds for Transformer architecture. Besides, the Visual\nAutoRegressive (VAR) model has risen to be a prominent method in the field of\nimage generation, outperforming previous techniques, such as Diffusion\nTransformers, in generating high-quality images. We investigate the circuit\ncomplexity of the VAR model and establish a bound in this study. Our primary\nresult demonstrates that the VAR model is equivalent to a simulation by a\nuniform $\\mathsf{TC}^0$ threshold circuit with hidden dimension $d \\leq O(n)$\nand $\\mathrm{poly}(n)$ precision. This is the first study to rigorously\nhighlight the limitations in the expressive power of VAR models despite their\nimpressive performance. We believe our findings will offer valuable insights\ninto the inherent constraints of these models and guide the development of more\nefficient and expressive architectures in the future.\n","authors":["Yekun Ke","Xiaoyu Li","Yingyu Liang","Zhenmei Shi","Zhao Song"],"pdf_url":"https://arxiv.org/pdf/2501.04299v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.12386v2","updated":"2025-01-08T05:57:28Z","published":"2024-09-19T01:02:31Z","title":"Channel-Aware Domain-Adaptive Generative Adversarial Network for Robust\n Speech Recognition","summary":" While pre-trained automatic speech recognition (ASR) systems demonstrate\nimpressive performance on matched domains, their performance often degrades\nwhen confronted with channel mismatch stemming from unseen recording\nenvironments and conditions. To mitigate this issue, we propose a novel\nchannel-aware data simulation method for robust ASR training. Our method\nharnesses the synergistic power of channel-extractive techniques and generative\nadversarial networks (GANs). We first train a channel encoder capable of\nextracting embeddings from arbitrary audio. On top of this, channel embeddings\nare extracted using a minimal amount of target-domain data and used to guide a\nGAN-based speech synthesizer. This synthesizer generates speech that faithfully\npreserves the phonetic content of the input while mimicking the channel\ncharacteristics of the target domain. We evaluate our method on the challenging\nHakka Across Taiwan (HAT) and Taiwanese Across Taiwan (TAT) corpora, achieving\nrelative character error rate (CER) reductions of 20.02% and 9.64%,\nrespectively, compared to the baselines. These results highlight the efficacy\nof our channel-aware data simulation method for bridging the gap between\nsource- and target-domain acoustics.\n","authors":["Chien-Chun Wang","Li-Wei Chen","Cheng-Kang Chou","Hung-Shin Lee","Berlin Chen","Hsin-Min Wang"],"pdf_url":"https://arxiv.org/pdf/2409.12386v2.pdf","comment":"Accepted to ICASSP 2025"},{"id":"http://arxiv.org/abs/2501.04292v1","updated":"2025-01-08T05:32:55Z","published":"2025-01-08T05:32:55Z","title":"MAD-UV: The 1st INTERSPEECH Mice Autism Detection via Ultrasound\n Vocalization Challenge","summary":" The Mice Autism Detection via Ultrasound Vocalization (MAD-UV) Challenge\nintroduces the first INTERSPEECH challenge focused on detecting autism spectrum\ndisorder (ASD) in mice through their vocalizations. Participants are tasked\nwith developing models to automatically classify mice as either wild-type or\nASD models based on recordings with a high sampling rate. Our baseline system\nemploys a simple CNN-based classification using three different spectrogram\nfeatures. Results demonstrate the feasibility of automated ASD detection, with\nthe considered audible-range features achieving the best performance (UAR of\n0.600 for segment-level and 0.625 for subject-level classification). This\nchallenge bridges speech technology and biomedical research, offering\nopportunities to advance our understanding of ASD models through machine\nlearning approaches. The findings suggest promising directions for vocalization\nanalysis and highlight the potential value of audible and ultrasound\nvocalizations in ASD detection.\n","authors":["Zijiang Yang","Meishu Song","Xin Jing","Haojie Zhang","Kun Qian","Bin Hu","Kota Tamada","Toru Takumi","Björn W. Schuller","Yoshiharu Yamamoto"],"pdf_url":"https://arxiv.org/pdf/2501.04292v1.pdf","comment":"5 pages, 1 figure and 2 tables. For MAD-UV Challenge 2025"},{"id":"http://arxiv.org/abs/2412.04604v2","updated":"2025-01-08T05:24:50Z","published":"2024-12-05T20:40:28Z","title":"ARC Prize 2024: Technical Report","summary":" As of December 2024, the ARC-AGI benchmark is five years old and remains\nunbeaten. We believe it is currently the most important unsolved AI benchmark\nin the world because it seeks to measure generalization on novel tasks -- the\nessence of intelligence -- as opposed to skill at tasks that can be prepared\nfor in advance. This year, we launched ARC Prize, a global competition to\ninspire new ideas and drive open progress towards AGI by reaching a target\nbenchmark score of 85\\%. As a result, the state-of-the-art score on the ARC-AGI\nprivate evaluation set increased from 33\\% to 55.5\\%, propelled by several\nfrontier AGI reasoning techniques including deep learning-guided program\nsynthesis and test-time training. In this paper, we survey top approaches,\nreview new open-source implementations, discuss the limitations of the\nARC-AGI-1 dataset, and share key insights gained from the competition.\n","authors":["Francois Chollet","Mike Knoop","Gregory Kamradt","Bryan Landers"],"pdf_url":"https://arxiv.org/pdf/2412.04604v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04286v1","updated":"2025-01-08T05:24:11Z","published":"2025-01-08T05:24:11Z","title":"Mapping the Edge of Chaos: Fractal-Like Boundaries in The Trainability\n of Decoder-Only Transformer Models","summary":" In the realm of fractal geometry, intricate structures emerge from simple\niterative processes that partition parameter spaces into regions of stability\nand instability. Likewise, training large language models involves iteratively\napplying update functions, such as Adam, where even slight hyperparameter\nadjustments can shift the training process from convergence to divergence.\nRecent evidence from miniature neural networks suggests that the boundary\nseparating these outcomes displays fractal characteristics [1]. Building on\nthese insights, this study extends them to medium-sized, decoder-only\ntransformer architectures by employing a more consistent convergence measure\nand examining the learning rate hyperparameter landscape for attention and\nfully connected layers. The results show that the trainability frontier is not\na simple threshold; rather, it forms a self-similar yet seemingly random\nstructure at multiple scales, with statistically consistent and repeating\npatterns. Within this landscape, a region of stable convergence is surrounded\nby a complex chaotic border, illustrating the sensitive nature of the\nunderlying training dynamics.\n","authors":["Bahman Torkamandi"],"pdf_url":"https://arxiv.org/pdf/2501.04286v1.pdf","comment":"15 pages"},{"id":"http://arxiv.org/abs/2501.04283v1","updated":"2025-01-08T05:14:36Z","published":"2025-01-08T05:14:36Z","title":"Enhancing Scene Classification in Cloudy Image Scenarios: A\n Collaborative Transfer Method with Information Regulation Mechanism using\n Optical Cloud-Covered and SAR Remote Sensing Images","summary":" In remote sensing scene classification, leveraging the transfer methods with\nwell-trained optical models is an efficient way to overcome label scarcity.\nHowever, cloud contamination leads to optical information loss and significant\nimpacts on feature distribution, challenging the reliability and stability of\ntransferred target models. Common solutions include cloud removal for optical\ndata or directly using Synthetic aperture radar (SAR) data in the target\ndomain. However, cloud removal requires substantial auxiliary data for support\nand pre-training, while directly using SAR disregards the unobstructed portions\nof optical data. This study presents a scene classification transfer method\nthat synergistically combines multi-modality data, which aims to transfer the\nsource domain model trained on cloudfree optical data to the target domain that\nincludes both cloudy optical and SAR data at low cost. Specifically, the\nframework incorporates two parts: (1) the collaborative transfer strategy,\nbased on knowledge distillation, enables the efficient prior knowledge transfer\nacross heterogeneous data; (2) the information regulation mechanism (IRM) is\nproposed to address the modality imbalance issue during transfer. It employs\nauxiliary models to measure the contribution discrepancy of each modality, and\nautomatically balances the information utilization of modalities during the\ntarget model learning process at the sample-level. The transfer experiments\nwere conducted on simulated and real cloud datasets, demonstrating the superior\nperformance of the proposed method compared to other solutions in cloud-covered\nscenarios. We also verified the importance and limitations of IRM, and further\ndiscussed and visualized the modality imbalance problem during the model\ntransfer. Codes are available at https://github.com/wangyuze-csu/ESCCS\n","authors":["Yuze Wang","Rong Xiao","Haifeng Li","Mariana Belgiu","Chao Tao"],"pdf_url":"https://arxiv.org/pdf/2501.04283v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03295v2","updated":"2025-01-08T04:50:01Z","published":"2025-01-06T11:43:29Z","title":"A Soft Sensor Method with Uncertainty-Awareness and Self-Explanation\n Based on Large Language Models Enhanced by Domain Knowledge Retrieval","summary":" Data-driven soft sensors are crucial in predicting key performance indicators\nin industrial systems. However, current methods predominantly rely on the\nsupervised learning paradigms of parameter updating, which inherently faces\nchallenges such as high development costs, poor robustness, training\ninstability, and lack of interpretability. Recently, large language models\n(LLMs) have demonstrated significant potential across various domains, notably\nthrough In-Context Learning (ICL), which enables high-performance task\nexecution with minimal input-label demonstrations and no prior training. This\npaper aims to replace supervised learning with the emerging ICL paradigm for\nsoft sensor modeling to address existing challenges and explore new avenues for\nadvancement. To achieve this, we propose a novel framework called the Few-shot\nUncertainty-aware and self-Explaining Soft Sensor (LLM-FUESS), which includes\nthe Zero-shot Auxiliary Variable Selector (LLM-ZAVS) and the Uncertainty-aware\nFew-shot Soft Sensor (LLM-UFSS). The LLM-ZAVS retrieves from the Industrial\nKnowledge Vector Storage to enhance LLMs' domain-specific knowledge, enabling\nzero-shot auxiliary variable selection. In the LLM-UFSS, we utilize text-based\ncontext demonstrations of structured data to prompt LLMs to execute ICL for\npredicting and propose a context sample retrieval augmentation strategy to\nimprove performance. Additionally, we explored LLMs' AIGC and probabilistic\ncharacteristics to propose self-explanation and uncertainty quantification\nmethods for constructing a trustworthy soft sensor. Extensive experiments\ndemonstrate that our method achieved state-of-the-art predictive performance,\nstrong robustness, and flexibility, effectively mitigates training instability\nfound in traditional methods. To the best of our knowledge, this is the first\nwork to establish soft sensor utilizing LLMs.\n","authors":["Shuo Tong","Han Liu","Runyuan Guo","Wenqing Wang","Xueqiong Tian","Lingyun Wei","Lin Zhang","Huayong Wu","Ding Liu","Youmin Zhang"],"pdf_url":"https://arxiv.org/pdf/2501.03295v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.09420v3","updated":"2025-01-08T04:31:16Z","published":"2024-11-14T13:15:27Z","title":"SAG-ViT: A Scale-Aware, High-Fidelity Patching Approach with Graph\n Attention for Vision Transformers","summary":" Vision Transformers (ViTs) have redefined image classification by leveraging\nself-attention to capture complex patterns and long-range dependencies between\nimage patches. However, a key challenge for ViTs is efficiently incorporating\nmulti-scale feature representations, which is inherent in convolutional neural\nnetworks (CNNs) through their hierarchical structure. Graph transformers have\nmade strides in addressing this by leveraging graph-based modeling, but they\noften lose or insufficiently represent spatial hierarchies, especially since\nredundant or less relevant areas dilute the image's contextual representation.\nTo bridge this gap, we propose SAG-ViT, a Scale-Aware Graph Attention ViT that\nintegrates multi-scale feature capabilities of CNNs, representational power of\nViTs, graph-attended patching to enable richer contextual representation. Using\nEfficientNetV2 as a backbone, the model extracts multi-scale feature maps,\ndividing them into patches to preserve richer semantic information compared to\ndirectly patching the input images. The patches are structured into a graph\nusing spatial and feature similarities, where a Graph Attention Network (GAT)\nrefines the node embeddings. This refined graph representation is then\nprocessed by a Transformer encoder, capturing long-range dependencies and\ncomplex interactions. We evaluate SAG-ViT on benchmark datasets across various\ndomains, validating its effectiveness in advancing image classification tasks.\nOur code and weights are available at https://github.com/shravan-18/SAG-ViT.\n","authors":["Shravan Venkatraman","Jaskaran Singh Walia","Joe Dhanith P R"],"pdf_url":"https://arxiv.org/pdf/2411.09420v3.pdf","comment":"14 pages, 8 figures, 9 tables"},{"id":"http://arxiv.org/abs/2501.04266v1","updated":"2025-01-08T04:19:57Z","published":"2025-01-08T04:19:57Z","title":"Scaling Large Language Model Training on Frontier with Low-Bandwidth\n Partitioning","summary":" Scaling up Large Language Model(LLM) training involves fitting a tremendous\namount of training parameters across a limited number of workers. However,\nmethods like ZeRO-3 that drastically reduce GPU memory pressure often incur\nheavy communication to ensure global synchronization and consistency.\nEstablished efforts such as ZeRO++ use secondary partitions to avoid inter-node\ncommunications, given that intra-node GPU-GPU transfer generally has more\nbandwidth and lower latency than inter-node connections. However, as more\ncapable infrastructure like Frontier, equipped with AMD GPUs, emerged with\nimpressive computing capability, there is a need for investigations on the\nhardware topology and to develop targeted strategies to improve training\nefficiency. In this work, we propose a collection of communication and\noptimization strategies for ZeRO++ to reduce communication costs and improve\nmemory utilization. In this paper, we propose a 3-level hierarchical\npartitioning specifically for the current Top-1 supercomputing cluster,\nFrontier, which aims at leveraging various bandwidths across layers of\ncommunications (GCD-GCD, GPU-GPU, and inter-node) to reduce communication\noverhead. For a 20B GPT model, we observe a 1.71x increase in TFLOPS per GPU\nwhen compared with ZeRO++ up to 384 GCDs and a scaling efficiency of 0.94 for\nup to 384 GCDs. To the best of our knowledge, our work is also the first effort\nto efficiently optimize LLM workloads on Frontier AMD GPUs.\n","authors":["Lang Xu","Quentin Anthony","Jacob Hatef","Aamir Shafi","Hari Subramoni","Dhabaleswar K."," Panda"],"pdf_url":"https://arxiv.org/pdf/2501.04266v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04263v1","updated":"2025-01-08T04:14:09Z","published":"2025-01-08T04:14:09Z","title":"KN-LIO: Geometric Kinematics and Neural Field Coupled LiDAR-Inertial\n Odometry","summary":" Recent advancements in LiDAR-Inertial Odometry (LIO) have boosted a large\namount of applications. However, traditional LIO systems tend to focus more on\nlocalization rather than mapping, with maps consisting mostly of sparse\ngeometric elements, which is not ideal for downstream tasks. Recent emerging\nneural field technology has great potential in dense mapping, but pure LiDAR\nmapping is difficult to work on high-dynamic vehicles. To mitigate this\nchallenge, we present a new solution that tightly couples geometric kinematics\nwith neural fields to enhance simultaneous state estimation and dense mapping\ncapabilities. We propose both semi-coupled and tightly coupled Kinematic-Neural\nLIO (KN-LIO) systems that leverage online SDF decoding and iterated error-state\nKalman filtering to fuse laser and inertial data. Our KN-LIO minimizes\ninformation loss and improves accuracy in state estimation, while also\naccommodating asynchronous multi-LiDAR inputs. Evaluations on diverse\nhigh-dynamic datasets demonstrate that our KN-LIO achieves performance on par\nwith or superior to existing state-of-the-art solutions in pose estimation and\noffers improved dense mapping accuracy over pure LiDAR-based methods. The\nrelevant code and datasets will be made available at https://**.\n","authors":["Zhong Wang","Lele Ren","Yue Wen","Hesheng Wang"],"pdf_url":"https://arxiv.org/pdf/2501.04263v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.00547v2","updated":"2025-01-08T04:14:07Z","published":"2024-11-30T17:40:49Z","title":"Motion Dreamer: Realizing Physically Coherent Video Generation through\n Scene-Aware Motion Reasoning","summary":" Recent numerous video generation models, also known as world models, have\ndemonstrated the ability to generate plausible real-world videos. However, many\nstudies have shown that these models often produce motion results lacking\nlogical or physical coherence. In this paper, we revisit video generation\nmodels and find that single-stage approaches struggle to produce high-quality\nresults while maintaining coherent motion reasoning. To address this issue, we\npropose \\textbf{Motion Dreamer}, a two-stage video generation framework. In\nStage I, the model generates an intermediate motion representation-such as a\nsegmentation map or depth map-based on the input image and motion conditions,\nfocusing solely on the motion itself. In Stage II, the model uses this\nintermediate motion representation as a condition to generate a high-detail\nvideo. By decoupling motion reasoning from high-fidelity video synthesis, our\napproach allows for more accurate and physically plausible motion generation.\nWe validate the effectiveness of our approach on the Physion dataset and in\nautonomous driving scenarios. For example, given a single push, our model can\nsynthesize the sequential toppling of a set of dominoes. Similarly, by varying\nthe movements of ego-cars, our model can produce different effects on other\nvehicles. Our work opens new avenues in creating models that can reason about\nphysical interactions in a more coherent and realistic manner. Our webpage is\navailable: https://envision-research.github.io/MotionDreamer/.\n","authors":["Tianshuo Xu","Zhifei Chen","Leyi Wu","Hao Lu","Yuying Chen","Lihui Jiang","Bingbing Liu","Yingcong Chen"],"pdf_url":"https://arxiv.org/pdf/2412.00547v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03271v2","updated":"2025-01-08T03:51:59Z","published":"2025-01-05T00:08:52Z","title":"DPO Kernels: A Semantically-Aware, Kernel-Enhanced, and Divergence-Rich\n Paradigm for Direct Preference Optimization","summary":" The rapid rise of large language models (LLMs) has unlocked many applications\nbut also underscores the challenge of aligning them with diverse values and\npreferences. Direct Preference Optimization (DPO) is central to alignment but\nconstrained by fixed divergences and limited feature transformations. We\npropose DPO-Kernels, which integrates kernel methods to address these issues\nthrough four key contributions: (i) Kernelized Representations with polynomial,\nRBF, Mahalanobis, and spectral kernels for richer transformations, plus a\nhybrid loss combining embedding-based and probability-based objectives; (ii)\nDivergence Alternatives (Jensen-Shannon, Hellinger, Renyi, Bhattacharyya,\nWasserstein, and f-divergences) for greater stability; (iii) Data-Driven\nSelection metrics that automatically choose the best kernel-divergence pair;\nand (iv) a Hierarchical Mixture of Kernels for both local precision and global\nmodeling. Evaluations on 12 datasets demonstrate state-of-the-art performance\nin factuality, safety, reasoning, and instruction following. Grounded in\nHeavy-Tailed Self-Regularization, DPO-Kernels maintains robust generalization\nfor LLMs, offering a comprehensive resource for further alignment research.\n","authors":["Amitava Das","Suranjana Trivedy","Danush Khanna","Rajarshi Roy","Gurpreet Singh","Basab Ghosh","Yaswanth Narsupalli","Vinija Jain","Vasu Sharma","Aishwarya Naresh Reganti","Aman Chadha"],"pdf_url":"https://arxiv.org/pdf/2501.03271v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04253v1","updated":"2025-01-08T03:35:28Z","published":"2025-01-08T03:35:28Z","title":"Integrated Offline and Online Learning to Solve a Large Class of\n Scheduling Problems","summary":" In this paper, we develop a unified machine learning (ML) approach to predict\nhigh-quality solutions for single-machine scheduling problems with a\nnon-decreasing min-sum objective function with or without release times. Our ML\napproach is novel in three major aspects. First, our approach is developed for\nthe entire class of the aforementioned problems. To achieve this, we exploit\nthe fact that the entire class of the problems considered can be formulated as\na time-indexed formulation in a unified manner. We develop a deep neural\nnetwork (DNN) which uses the cost parameters in the time-indexed formulation as\nthe inputs to effectively predict a continuous solution to this formulation,\nbased on which a feasible discrete solution is easily constructed. The second\nnovel aspect of our approach lies in how the DNN model is trained. In view of\nthe NP-hard nature of the problems, labels (i.e., optimal solutions) are hard\nto generate for training. To overcome this difficulty, we generate and utilize\na set of special instances, for which optimal solutions can be found with\nlittle computational effort, to train the ML model offline. The third novel\nidea we employ in our approach is that we develop an online single-instance\nlearning approach to fine tune the parameters in the DNN for a given online\ninstance, with the goal of generating an improved solution for the given\ninstance. To this end, we develop a feasibility surrogate that approximates the\nobjective value of a given instance as a continuous function of the outputs of\nthe DNN, which then enables us to derive gradients and update the learnable\nparameters in the DNN. Numerical results show that our approach can efficiently\ngenerate high-quality solutions for a variety of single-machine scheduling\nmin-sum problems with up to 1000 jobs.\n","authors":["Anbang Liu","Zhi-Long Chen","Jinyang Jiang","Xi Chen"],"pdf_url":"https://arxiv.org/pdf/2501.04253v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.16950v4","updated":"2025-01-08T03:14:04Z","published":"2024-03-25T17:11:28Z","title":"Aligning with Human Judgement: The Role of Pairwise Large Language Model\n Evaluators in Preference Aggregation","summary":" Large Language Models (LLMs) have demonstrated promising capabilities as\nautomatic evaluators in assessing the quality of generated natural language.\nHowever, LLMs still exhibit biases in evaluation and often struggle to generate\ncoherent evaluations that align with human assessments. In this work, we first\nconduct a systematic study of the misalignment between LLM evaluators and human\njudgement, revealing that existing calibration methods aimed at mitigating\nbiases are insufficient for effectively aligning LLM evaluators. Inspired by\nthe use of preference data in RLHF, we formulate the evaluation as a ranking\nproblem and introduce Pairwise-preference Search (PairS), an uncertainty-guided\nsearch method that employs LLMs to conduct pairwise comparisons and efficiently\nranks candidate texts. PairS achieves state-of-the-art performance on\nrepresentative evaluation tasks and demonstrates significant improvements over\ndirect scoring. Furthermore, we provide insights into the role of pairwise\npreference in quantifying the transitivity of LLMs and demonstrate how PairS\nbenefits from calibration.\n","authors":["Yinhong Liu","Han Zhou","Zhijiang Guo","Ehsan Shareghi","Ivan Vulić","Anna Korhonen","Nigel Collier"],"pdf_url":"https://arxiv.org/pdf/2403.16950v4.pdf","comment":"This paper has been accepted by COLM 2024"},{"id":"http://arxiv.org/abs/2412.19403v2","updated":"2025-01-08T02:43:21Z","published":"2024-12-27T01:53:18Z","title":"Fully Data-driven but Interpretable Human Behavioural Modelling with\n Differentiable Discrete Choice Model","summary":" Discrete choice models are essential for modelling various decision-making\nprocesses in human behaviour. However, the specification of these models has\ndepended heavily on domain knowledge from experts, and the fully automated but\ninterpretable modelling of complex human behaviours has been a long-standing\nchallenge. In this paper, we introduce the differentiable discrete choice model\n(Diff-DCM), a fully data-driven method for the interpretable modelling,\nlearning, prediction, and control of complex human behaviours, which is\nrealised by differentiable programming. Solely from input features and choice\noutcomes without any prior knowledge, Diff-DCM can estimate interpretable\nclosed-form utility functions that reproduce observed behaviours. Comprehensive\nexperiments with both synthetic and real-world data demonstrate that Diff-DCM\ncan be applied to various types of data and requires only a small amount of\ncomputational resources for the estimations, which can be completed within tens\nof seconds on a laptop without any accelerators. In these experiments, we also\ndemonstrate that, using its differentiability, Diff-DCM can provide useful\ninsights into human behaviours, such as an optimal intervention path for\neffective behavioural changes. This study provides a strong basis for the fully\nautomated and reliable modelling, prediction, and control of human behaviours.\n","authors":["Fumiyasu Makinoshima","Tatsuya Mitomi","Fumiya Makihara","Eigo Segawa"],"pdf_url":"https://arxiv.org/pdf/2412.19403v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.07594v3","updated":"2025-01-08T02:33:37Z","published":"2023-11-10T09:51:24Z","title":"How to Bridge the Gap between Modalities: Survey on Multimodal Large\n Language Model","summary":" We explore Multimodal Large Language Models (MLLMs), which integrate LLMs\nlike GPT-4 to handle multimodal data, including text, images, audio, and more.\nMLLMs demonstrate capabilities such as generating image captions and answering\nimage-based questions, bridging the gap towards real-world human-computer\ninteractions and hinting at a potential pathway to artificial general\nintelligence. However, MLLMs still face challenges in addressing the semantic\ngap in multimodal data, which may lead to erroneous outputs, posing potential\nrisks to society. Selecting the appropriate modality alignment method is\ncrucial, as improper methods might require more parameters without significant\nperformance improvements. This paper aims to explore modality alignment methods\nfor LLMs and their current capabilities. Implementing effective modality\nalignment can help LLMs address environmental issues and enhance accessibility.\nThe study surveys existing modality alignment methods for MLLMs, categorizing\nthem into four groups: (1) Multimodal Converter, which transforms data into a\nformat that LLMs can understand; (2) Multimodal Perceiver, which improves how\nLLMs percieve different types of data; (3) Tool Learning, which leverages\nexternal tools to convert data into a common format, usually text; and (4)\nData-Driven Method, which teaches LLMs to understand specific data types within\ndatasets.\n","authors":["Shezheng Song","Xiaopeng Li","Shasha Li","Shan Zhao","Jie Yu","Jun Ma","Xiaoguang Mao","Weimin Zhang"],"pdf_url":"https://arxiv.org/pdf/2311.07594v3.pdf","comment":"Accepted by TKDE"},{"id":"http://arxiv.org/abs/2404.09005v7","updated":"2025-01-08T02:10:31Z","published":"2024-04-13T13:18:40Z","title":"Proof-of-Learning with Incentive Security","summary":" Most concurrent blockchain systems rely heavily on the Proof-of-Work (PoW) or\nProof-of-Stake (PoS) mechanisms for decentralized consensus and security\nassurance. However, the substantial energy expenditure stemming from\ncomputationally intensive yet meaningless tasks has raised considerable\nconcerns surrounding traditional PoW approaches, The PoS mechanism, while free\nof energy consumption, is subject to security and economic issues. Addressing\nthese issues, the paradigm of Proof-of-Useful-Work (PoUW) seeks to employ\nchallenges of practical significance as PoW, thereby imbuing energy consumption\nwith tangible value. While previous efforts in Proof of Learning (PoL) explored\nthe utilization of deep learning model training SGD tasks as PoUW challenges,\nrecent research has revealed its vulnerabilities to adversarial attacks and the\ntheoretical hardness in crafting a byzantine-secure PoL mechanism. In this\npaper, we introduce the concept of incentive-security that incentivizes\nrational provers to behave honestly for their best interest, bypassing the\nexisting hardness to design a PoL mechanism with computational efficiency, a\nprovable incentive-security guarantee and controllable difficulty.\nParticularly, our work is secure against two attacks, and also improves the\ncomputational overhead from $\\Theta(1)$ to $O(\\frac{\\log E}{E})$. Furthermore,\nwhile most recent research assumes trusted problem providers and verifiers, our\ndesign also guarantees frontend incentive-security even when problem providers\nare untrusted, and verifier incentive-security that bypasses the Verifier's\nDilemma. By incorporating ML training into blockchain consensus mechanisms with\nprovable guarantees, our research not only proposes an eco-friendly solution to\nblockchain systems, but also provides a proposal for a completely decentralized\ncomputing power market in the new AI age.\n","authors":["Zishuo Zhao","Zhixuan Fang","Xuechao Wang","Xi Chen","Hongxu Su","Haibo Xiao","Yuan Zhou"],"pdf_url":"https://arxiv.org/pdf/2404.09005v7.pdf","comment":"20 pages, 4 figures"},{"id":"http://arxiv.org/abs/2408.13586v2","updated":"2025-01-08T02:09:15Z","published":"2024-08-24T14:14:32Z","title":"Balancing Diversity and Risk in LLM Sampling: How to Select Your Method\n and Parameter for Open-Ended Text Generation","summary":" Sampling-based decoding strategies have been widely adopted for Large\nLanguage Models (LLMs) in numerous applications, targeting a balance between\ndiversity and quality via temperature tuning and tail truncation. Considering\nthe strong dependency of the candidate next tokens on different prefixes,\nrecent studies propose to adaptively truncate the tail of LLMs' predicted\ndistribution. Although improved results have been reported with these methods\non open-ended text generation tasks, the results are highly dependent on the\ncurated parameters and the limited exemplar text. In this paper, we propose a\nsystematic way to estimate the capacity of a truncation sampling method by\nconsidering the trade-off between diversity and risk at each decoding step,\nbased on our collected prefix tree which preserves the context of a full\nsentence. Our work offers a comprehensive comparison of existing truncation\nsampling methods and serves as a practical user guideline for their parameter\nselection.\n","authors":["Yuxuan Zhou","Margret Keuper","Mario Fritz"],"pdf_url":"https://arxiv.org/pdf/2408.13586v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04228v1","updated":"2025-01-08T01:59:47Z","published":"2025-01-08T01:59:47Z","title":"Constraints as Rewards: Reinforcement Learning for Robots without Reward\n Functions","summary":" Reinforcement learning has become an essential algorithm for generating\ncomplex robotic behaviors. However, to learn such behaviors, it is necessary to\ndesign a reward function that describes the task, which often consists of\nmultiple objectives that needs to be balanced. This tuning process is known as\nreward engineering and typically involves extensive trial-and-error. In this\npaper, to avoid this trial-and-error process, we propose the concept of\nConstraints as Rewards (CaR). CaR formulates the task objective using multiple\nconstraint functions instead of a reward function and solves a reinforcement\nlearning problem with constraints using the Lagrangian-method. By adopting this\napproach, different objectives are automatically balanced, because Lagrange\nmultipliers serves as the weights among the objectives. In addition, we will\ndemonstrate that constraints, expressed as inequalities, provide an intuitive\ninterpretation of the optimization target designed for the task. We apply the\nproposed method to the standing-up motion generation task of a\nsix-wheeled-telescopic-legged robot and demonstrate that the proposed method\nsuccessfully acquires the target behavior, even though it is challenging to\nlearn with manually designed reward functions.\n","authors":["Yu Ishihara","Noriaki Takasugi","Kotaro Kawakami","Masaya Kinoshita","Kazumi Aoyama"],"pdf_url":"https://arxiv.org/pdf/2501.04228v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04227v1","updated":"2025-01-08T01:58:42Z","published":"2025-01-08T01:58:42Z","title":"Agent Laboratory: Using LLM Agents as Research Assistants","summary":" Historically, scientific discovery has been a lengthy and costly process,\ndemanding substantial time and resources from initial conception to final\nresults. To accelerate scientific discovery, reduce research costs, and improve\nresearch quality, we introduce Agent Laboratory, an autonomous LLM-based\nframework capable of completing the entire research process. This framework\naccepts a human-provided research idea and progresses through three\nstages--literature review, experimentation, and report writing to produce\ncomprehensive research outputs, including a code repository and a research\nreport, while enabling users to provide feedback and guidance at each stage. We\ndeploy Agent Laboratory with various state-of-the-art LLMs and invite multiple\nresearchers to assess its quality by participating in a survey, providing human\nfeedback to guide the research process, and then evaluate the final paper. We\nfound that: (1) Agent Laboratory driven by o1-preview generates the best\nresearch outcomes; (2) The generated machine learning code is able to achieve\nstate-of-the-art performance compared to existing methods; (3) Human\ninvolvement, providing feedback at each stage, significantly improves the\noverall quality of research; (4) Agent Laboratory significantly reduces\nresearch expenses, achieving an 84% decrease compared to previous autonomous\nresearch methods. We hope Agent Laboratory enables researchers to allocate more\neffort toward creative ideation rather than low-level coding and writing,\nultimately accelerating scientific discovery.\n","authors":["Samuel Schmidgall","Yusheng Su","Ze Wang","Ximeng Sun","Jialian Wu","Xiaodong Yu","Jiang Liu","Zicheng Liu","Emad Barsoum"],"pdf_url":"https://arxiv.org/pdf/2501.04227v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.16050v3","updated":"2025-01-08T01:47:16Z","published":"2024-12-20T16:52:11Z","title":"Label-Efficient Data Augmentation with Video Diffusion Models for\n Guidewire Segmentation in Cardiac Fluoroscopy","summary":" The accurate segmentation of guidewires in interventional cardiac fluoroscopy\nvideos is crucial for computer-aided navigation tasks. Although deep learning\nmethods have demonstrated high accuracy and robustness in wire segmentation,\nthey require substantial annotated datasets for generalizability, underscoring\nthe need for extensive labeled data to enhance model performance. To address\nthis challenge, we propose the Segmentation-guided Frame-consistency Video\nDiffusion Model (SF-VD) to generate large collections of labeled fluoroscopy\nvideos, augmenting the training data for wire segmentation networks. SF-VD\nleverages videos with limited annotations by independently modeling scene\ndistribution and motion distribution. It first samples the scene distribution\nby generating 2D fluoroscopy images with wires positioned according to a\nspecified input mask, and then samples the motion distribution by progressively\ngenerating subsequent frames, ensuring frame-to-frame coherence through a\nframe-consistency strategy. A segmentation-guided mechanism further refines the\nprocess by adjusting wire contrast, ensuring a diverse range of visibility in\nthe synthesized image. Evaluation on a fluoroscopy dataset confirms the\nsuperior quality of the generated videos and shows significant improvements in\nguidewire segmentation.\n","authors":["Shaoyan Pan","Yikang Liu","Lin Zhao","Eric Z. Chen","Xiao Chen","Terrence Chen","Shanhui Sun"],"pdf_url":"https://arxiv.org/pdf/2412.16050v3.pdf","comment":"AAAI 2025"},{"id":"http://arxiv.org/abs/2411.09852v2","updated":"2025-01-08T01:44:07Z","published":"2024-11-15T00:20:36Z","title":"InterFormer: Towards Effective Heterogeneous Interaction Learning for\n Click-Through Rate Prediction","summary":" Click-through rate (CTR) prediction, which predicts the probability of a user\nclicking an ad, is a fundamental task in recommender systems. The emergence of\nheterogeneous information, such as user profile and behavior sequences, depicts\nuser interests from different aspects. A mutually beneficial integration of\nheterogeneous information is the cornerstone towards the success of CTR\nprediction. However, most of the existing methods suffer from two fundamental\nlimitations, including (1) insufficient inter-mode interaction due to the\nunidirectional information flow between modes, and (2) aggressive information\naggregation caused by early summarization, resulting in excessive information\nloss. To address the above limitations, we propose a novel module named\nInterFormer to learn heterogeneous information interaction in an interleaving\nstyle. To achieve better interaction learning, InterFormer enables\nbidirectional information flow for mutually beneficial learning across\ndifferent modes. To avoid aggressive information aggregation, we retain\ncomplete information in each data mode and use a separate bridging arch for\neffective information selection and summarization. Our proposed InterFormer\nachieves state-of-the-art performance on three public datasets and a\nlarge-scale industrial dataset.\n","authors":["Zhichen Zeng","Xiaolong Liu","Mengyue Hang","Xiaoyi Liu","Qinghai Zhou","Chaofei Yang","Yiqun Liu","Yichen Ruan","Laming Chen","Yuxin Chen","Yujia Hao","Jiaqi Xu","Jade Nie","Xi Liu","Buyun Zhang","Wei Wen","Siyang Yuan","Kai Wang","Wen-Yen Chen","Yiping Han","Huayu Li","Chunzhi Yang","Bo Long","Philip S. Yu","Hanghang Tong","Jiyan Yang"],"pdf_url":"https://arxiv.org/pdf/2411.09852v2.pdf","comment":"10 pages, 6 figures"},{"id":"http://arxiv.org/abs/2501.04217v1","updated":"2025-01-08T01:27:35Z","published":"2025-01-08T01:27:35Z","title":"Continual Self-supervised Learning Considering Medical Domain Knowledge\n in Chest CT Images","summary":" We propose a novel continual self-supervised learning method (CSSL)\nconsidering medical domain knowledge in chest CT images. Our approach addresses\nthe challenge of sequential learning by effectively capturing the relationship\nbetween previously learned knowledge and new information at different stages.\nBy incorporating an enhanced DER into CSSL and maintaining both diversity and\nrepresentativeness within the rehearsal buffer of DER, the risk of data\ninterference during pretraining is reduced, enabling the model to learn more\nricher and robust feature representations. In addition, we incorporate a mixup\nstrategy and feature distillation to further enhance the model's ability to\nlearn meaningful representations. We validate our method using chest CT images\nobtained under two different imaging conditions, demonstrating superior\nperformance compared to state-of-the-art methods.\n","authors":["Ren Tasai","Guang Li","Ren Togo","Minghui Tang","Takaaki Yoshimura","Hiroyuki Sugimori","Kenji Hirata","Takahiro Ogawa","Kohsuke Kudo","Miki Haseyama"],"pdf_url":"https://arxiv.org/pdf/2501.04217v1.pdf","comment":"Accepted by ICASSP 2025"},{"id":"http://arxiv.org/abs/2501.02683v2","updated":"2025-01-08T01:27:30Z","published":"2025-01-05T23:19:55Z","title":"From Superficial Patterns to Semantic Understanding: Fine-Tuning\n Language Models on Contrast Sets","summary":" Large-scale pre-trained language models have demonstrated high performance on\nstandard datasets for natural language inference (NLI) tasks. Unfortunately,\nthese evaluations can be misleading, as although the models can perform well on\nin-distribution data, they perform poorly on out-of-distribution test sets,\nsuch as contrast sets. Contrast sets consist of perturbed instances of data\nthat have very minor, but meaningful, changes to the input that alter the gold\nlabel, revealing how models can learn superficial patterns in the training data\nrather than learning more sophisticated language nuances. As an example, the\nELECTRA-small language model achieves nearly 90% accuracy on an SNLI dataset\nbut drops to 75% when tested on an out-of-distribution contrast set. The\nresearch carried out in this study explores how the robustness of a language\nmodel can be improved by exposing it to small amounts of more complex contrast\nsets during training to help it better learn language patterns. With this\napproach, the model recovers performance and achieves nearly 90% accuracy on\ncontrast sets, highlighting the importance of diverse and challenging training\ndata.\n","authors":["Daniel Petrov"],"pdf_url":"https://arxiv.org/pdf/2501.02683v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04213v1","updated":"2025-01-08T01:18:14Z","published":"2025-01-08T01:18:14Z","title":"UPAQ: A Framework for Real-Time and Energy-Efficient 3D Object Detection\n in Autonomous Vehicles","summary":" To enhance perception in autonomous vehicles (AVs), recent efforts are\nconcentrating on 3D object detectors, which deliver more comprehensive\npredictions than traditional 2D object detectors, at the cost of increased\nmemory footprint and computational resource usage. We present a novel framework\ncalled UPAQ, which leverages semi-structured pattern pruning and quantization\nto improve the efficiency of LiDAR point-cloud and camera-based 3D object\ndetectors on resource-constrained embedded AV platforms. Experimental results\non the Jetson Orin Nano embedded platform indicate that UPAQ achieves up to\n5.62x and 5.13x model compression rates, up to 1.97x and 1.86x boost in\ninference speed, and up to 2.07x and 1.87x reduction in energy consumption\ncompared to state-of-the-art model compression frameworks, on the Pointpillar\nand SMOKE models respectively.\n","authors":["Abhishek Balasubramaniam","Febin P Sunny","Sudeep Pasricha"],"pdf_url":"https://arxiv.org/pdf/2501.04213v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04211v1","updated":"2025-01-08T01:11:17Z","published":"2025-01-08T01:11:17Z","title":"CURing Large Models: Compression via CUR Decomposition","summary":" Large deep learning models have achieved remarkable success but are\nresource-intensive, posing challenges in computational cost and memory usage.\n We introduce CURing, a novel model compression method based on CUR matrix\ndecomposition, which approximates weight matrices as the product of selected\ncolumns (C) and rows (R), and a small linking matrix (U). We apply this\ndecomposition to weights chosen based on the combined influence of their\nmagnitudes and activations. By identifying and retaining informative rows and\ncolumns, CURing significantly reduces model size with minimal performance loss.\n It preserves the original network's input/output structures, retains\nimportant features such as non-negativity, and the compressed model's\nactivation patterns align with the original, thereby enhancing\ninterpretability.\n","authors":["Sanghyeon Park","Soo-Mook Moon"],"pdf_url":"https://arxiv.org/pdf/2501.04211v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04202v1","updated":"2025-01-08T00:43:31Z","published":"2025-01-08T00:43:31Z","title":"Generative Dataset Distillation Based on Self-knowledge Distillation","summary":" Dataset distillation is an effective technique for reducing the cost and\ncomplexity of model training while maintaining performance by compressing large\ndatasets into smaller, more efficient versions. In this paper, we present a\nnovel generative dataset distillation method that can improve the accuracy of\naligning prediction logits. Our approach integrates self-knowledge distillation\nto achieve more precise distribution matching between the synthetic and\noriginal data, thereby capturing the overall structure and relationships within\nthe data. To further improve the accuracy of alignment, we introduce a\nstandardization step on the logits before performing distribution matching,\nensuring consistency in the range of logits. Through extensive experiments, we\ndemonstrate that our method outperforms existing state-of-the-art methods,\nresulting in superior distillation performance.\n","authors":["Longzhen Li","Guang Li","Ren Togo","Keisuke Maeda","Takahiro Ogawa","Miki Haseyama"],"pdf_url":"https://arxiv.org/pdf/2501.04202v1.pdf","comment":"Accepted by ICASSP 2025"},{"id":"http://arxiv.org/abs/2501.04193v1","updated":"2025-01-08T00:06:38Z","published":"2025-01-08T00:06:38Z","title":"GNN-based Decentralized Perception in Multirobot Systems for Predicting\n Worker Actions","summary":" In industrial environments, predicting human actions is essential for\nensuring safe and effective collaboration between humans and robots. This paper\nintroduces a perception framework that enables mobile robots to understand and\nshare information about human actions in a decentralized way. The framework\nfirst allows each robot to build a spatial graph representing its surroundings,\nwhich it then shares with other robots. This shared spatial data is combined\nwith temporal information to track human behavior over time. A swarm-inspired\ndecision-making process is used to ensure all robots agree on a unified\ninterpretation of the human's actions. Results show that adding more robots and\nincorporating longer time sequences improve prediction accuracy. Additionally,\nthe consensus mechanism increases system resilience, making the multi-robot\nsetup more reliable in dynamic industrial settings.\n","authors":["Ali Imran","Giovanni Beltrame","David St-Onge"],"pdf_url":"https://arxiv.org/pdf/2501.04193v1.pdf","comment":"Submitted to RA-L"},{"id":"http://arxiv.org/abs/2402.17853v2","updated":"2025-01-08T00:00:44Z","published":"2024-02-27T19:36:27Z","title":"Latent Neural PDE Solver: a reduced-order modelling framework for\n partial differential equations","summary":" Neural networks have shown promising potential in accelerating the numerical\nsimulation of systems governed by partial differential equations (PDEs).\nDifferent from many existing neural network surrogates operating on\nhigh-dimensional discretized fields, we propose to learn the dynamics of the\nsystem in the latent space with much coarser discretizations. In our proposed\nframework - Latent Neural PDE Solver (LNS), a non-linear autoencoder is first\ntrained to project the full-order representation of the system onto the\nmesh-reduced space, then a temporal model is trained to predict the future\nstate in this mesh-reduced space. This reduction process simplifies the\ntraining of the temporal model by greatly reducing the computational cost\naccompanying a fine discretization. We study the capability of the proposed\nframework and several other popular neural PDE solvers on various types of\nsystems including single-phase and multi-phase flows along with varying system\nparameters. We showcase that it has competitive accuracy and efficiency\ncompared to the neural PDE solver that operates on full-order space.\n","authors":["Zijie Li","Saurabh Patil","Francis Ogoke","Dule Shu","Wilson Zhen","Michael Schneier","John R. Buchanan, Jr.","Amir Barati Farimani"],"pdf_url":"https://arxiv.org/pdf/2402.17853v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.05910v3","updated":"2025-01-08T23:40:38Z","published":"2024-07-08T13:15:11Z","title":"Enhancing Vision-Language Models with Scene Graphs for Traffic Accident\n Understanding","summary":" Recognizing a traffic accident is an essential part of any autonomous driving\nor road monitoring system. An accident can appear in a wide variety of forms,\nand understanding what type of accident is taking place may be useful to\nprevent it from recurring. This work focuses on classifying traffic scenes into\nspecific accident types. We approach the problem by representing a traffic\nscene as a graph, where objects such as cars can be represented as nodes, and\nrelative distances and directions between them as edges. This representation of\na traffic scene is referred to as a scene graph, and can be used as input for\nan accident classifier. Better results are obtained with a classifier that\nfuses the scene graph input with visual and textual representations. This work\nintroduces a multi-stage, multimodal pipeline that pre-processes videos of\ntraffic accidents, encodes them as scene graphs, and aligns this representation\nwith vision and language modalities before executing the classification task.\nWhen trained on 4 classes, our method achieves a balanced accuracy score of\n57.77% on an (unbalanced) subset of the popular Detection of Traffic Anomaly\n(DoTA) benchmark, representing an increase of close to 5 percentage points from\nthe case where scene graph information is not taken into account.\n","authors":["Aaron Lohner","Francesco Compagno","Jonathan Francis","Alessandro Oltramari"],"pdf_url":"https://arxiv.org/pdf/2407.05910v3.pdf","comment":"Won the 'Best Paper Runner-up Award' at the 2024 IEEE International\n Automated Vehicle Validation Conference (IAVVC 2024). Also accepted at the\n 1st Workshop on Semantic Reasoning and Goal Understanding in Robotics, at the\n Robotics Science and Systems Conference (RSS SemRob 2024)"},{"id":"http://arxiv.org/abs/2501.04882v1","updated":"2025-01-08T23:38:19Z","published":"2025-01-08T23:38:19Z","title":"Reach Measurement, Optimization and Frequency Capping In Targeted Online\n Advertising Under k-Anonymity","summary":" The growth in the use of online advertising to foster brand awareness over\nrecent years is largely attributable to the ubiquity of social media. One\npivotal technology contributing to the success of online brand advertising is\nfrequency capping, a mechanism that enables marketers to control the number of\ntimes an ad is shown to a specific user. However, the very foundation of this\ntechnology is being scrutinized as the industry gravitates towards advertising\nsolutions that prioritize user privacy. This paper delves into the issue of\nreach measurement and optimization within the context of $k$-anonymity, a\nprivacy-preserving model gaining traction across major online advertising\nplatforms. We outline how to report reach within this new privacy landscape and\ndemonstrate how probabilistic discounting, a probabilistic adaptation of\ntraditional frequency capping, can be employed to optimize campaign\nperformance. Experiments are performed to assess the trade-off between user\nprivacy and the efficacy of online brand advertising. Notably, we discern a\nsignificant dip in performance as long as privacy is introduced, yet this comes\nwith a limited additional cost for advertising platforms to offer their users\nmore privacy.\n","authors":["Yuan Gao","Mu Qiao"],"pdf_url":"https://arxiv.org/pdf/2501.04882v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04877v1","updated":"2025-01-08T23:21:43Z","published":"2025-01-08T23:21:43Z","title":"Real-Time Textless Dialogue Generation","summary":" Recent advancements in large language models (LLMs) have led to significant\nprogress in text-based dialogue systems. These systems can now generate\nhigh-quality responses that are accurate and coherent across a wide range of\ntopics and tasks. However, spoken dialogue systems still lag behind in terms of\nnaturalness. They tend to produce robotic interactions, with issues such as\nslow response times, overly generic or cautious replies, and a lack of natural\nrhythm and fluid turn-taking. This shortcoming is largely due to the\nover-reliance on the traditional cascaded design, which involve separate,\nsequential components, as well as the use of text as an intermediate\nrepresentation. This paper propose a real-time, textless spoken dialogue\ngeneration model (RTTL-DG) that aims to overcome these challenges. Our system\nenables fluid turn-taking and generates responses with minimal delay by\nprocessing streaming spoken conversation directly. Additionally, our model\nincorporates backchannels, filters, laughter, and other paralinguistic signals,\nwhich are often absent in cascaded dialogue systems, to create more natural and\nhuman-like interactions. The implementations and generated samples are\navailable in our repository: https://github.com/mailong25/rts2s-dg\n","authors":["Long Mai","Julie Carson-Berndsen"],"pdf_url":"https://arxiv.org/pdf/2501.04877v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.00958v2","updated":"2025-01-08T23:16:20Z","published":"2024-05-02T02:50:58Z","title":"Generative manufacturing systems using diffusion models and ChatGPT","summary":" In this study, we introduce Generative Manufacturing Systems (GMS) as a novel\napproach to effectively manage and coordinate autonomous manufacturing assets,\nthereby enhancing their responsiveness and flexibility to address a wide array\nof production objectives and human preferences. Deviating from traditional\nexplicit modeling, GMS employs generative AI, including diffusion models and\nChatGPT, for implicit learning from envisioned futures, marking a shift from a\nmodel-optimum to a training-sampling decision-making. Through the integration\nof generative AI, GMS enables complex decision-making through interactive\ndialogue with humans, allowing manufacturing assets to generate multiple\nhigh-quality global decisions that can be iteratively refined based on human\nfeedback. Empirical findings showcase GMS's substantial improvement in system\nresilience and responsiveness to uncertainties, with decision times reduced\nfrom seconds to milliseconds. The study underscores the inherent creativity and\ndiversity in the generated solutions, facilitating human-centric\ndecision-making through seamless and continuous human-machine interactions.\n","authors":["Xingyu Li","Fei Tao","Wei Ye","Aydin Nassehi","John W. Sutherland"],"pdf_url":"https://arxiv.org/pdf/2405.00958v2.pdf","comment":"We are withdrawing this preprint to incorporate significant new\n results and expand the scope of the paper. We plan to resubmit a\n substantially revised version in the near future"},{"id":"http://arxiv.org/abs/2501.04873v1","updated":"2025-01-08T23:07:10Z","published":"2025-01-08T23:07:10Z","title":"Back Home: A Machine Learning Approach to Seashell Classification and\n Ecosystem Restoration","summary":" In Costa Rica, an average of 5 tons of seashells are extracted from\necosystems annually. Confiscated seashells, cannot be returned to their\necosystems due to the lack of origin recognition. To address this issue, we\ndeveloped a convolutional neural network (CNN) specifically for seashell\nidentification. We built a dataset from scratch, consisting of approximately\n19000 images from the Pacific and Caribbean coasts. Using this dataset, the\nmodel achieved a classification accuracy exceeding 85%. The model has been\nintegrated into a user-friendly application, which has classified over 36,000\nseashells to date, delivering real-time results within 3 seconds per image. To\nfurther enhance the system's accuracy, an anomaly detection mechanism was\nincorporated to filter out irrelevant or anomalous inputs, ensuring only valid\nseashell images are processed.\n","authors":["Alexander Valverde","Luis Solano"],"pdf_url":"https://arxiv.org/pdf/2501.04873v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.14469v6","updated":"2025-01-08T21:47:16Z","published":"2024-06-20T16:32:18Z","title":"Forecasting Symmetric Random Walks: A Fusion Approach","summary":" Forecasting random walks is notoriously challenging, with na\\\"ive prediction\nserving as a difficult-to-surpass baseline. To investigate the potential of\nusing movement predictions to improve point forecasts in this context, this\nstudy focuses on symmetric random walks, in which the target variable's future\nvalue is reformulated as a combination of its future movement and current\nvalue. The proposed forecasting method, termed the fusion of movement and\nna\\\"ive predictions (FMNP), is grounded in this reformulation. The simulation\nresults show that FMNP achieves statistically significant improvements over\nna\\\"ive prediction, even when the movement prediction accuracy is only slightly\nabove 0.50. In practice, movement predictions can be derived from the\ncomovement between an exogenous variable and the target variable and then\nlinearly combined with the na\\\"ive prediction to generate the final forecast.\nFMNP effectiveness was evaluated on four U.S. financial time series -- the\nclose prices of Boeing (BA), Brent crude oil (OIL), Halliburton (HAL), and\nSchlumberger (SLB) -- using the open price of the Financial Times Stock\nExchange (FTSE) index as the exogenous variable. In all the cases, FMNP\noutperformed the na\\\"ive prediction, demonstrating its efficacy in forecasting\nsymmetric random walks and its potential applicability to other forecasting\ntasks.\n","authors":["Cheng Zhang"],"pdf_url":"https://arxiv.org/pdf/2406.14469v6.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04848v1","updated":"2025-01-08T21:22:45Z","published":"2025-01-08T21:22:45Z","title":"Exploring Large Language Models for Semantic Analysis and Categorization\n of Android Malware","summary":" Malware analysis is a complex process of examining and evaluating malicious\nsoftware's functionality, origin, and potential impact. This arduous process\ntypically involves dissecting the software to understand its components,\ninfection vector, propagation mechanism, and payload. Over the years, deep\nreverse engineering of malware has become increasingly tedious, mainly due to\nmodern malicious codebases' fast evolution and sophistication. Essentially,\nanalysts are tasked with identifying the elusive needle in the haystack within\nthe complexities of zero-day malware, all while under tight time constraints.\nThus, in this paper, we explore leveraging Large Language Models (LLMs) for\nsemantic malware analysis to expedite the analysis of known and novel samples.\nBuilt on GPT-4o-mini model, \\msp is designed to augment malware analysis for\nAndroid through a hierarchical-tiered summarization chain and strategic prompt\nengineering. Additionally, \\msp performs malware categorization, distinguishing\npotential malware from benign applications, thereby saving time during the\nmalware reverse engineering process. Despite not being fine-tuned for Android\nmalware analysis, we demonstrate that through optimized and advanced prompt\nengineering \\msp can achieve up to 77% classification accuracy while providing\nhighly robust summaries at functional, class, and package levels. In addition,\nleveraging the backward tracing of the summaries from package to function\nlevels allowed us to pinpoint the precise code snippets responsible for\nmalicious behavior.\n","authors":["Brandon J Walton","Mst Eshita Khatun","James M Ghawaly","Aisha Ali-Gombe"],"pdf_url":"https://arxiv.org/pdf/2501.04848v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04844v1","updated":"2025-01-08T21:11:35Z","published":"2025-01-08T21:11:35Z","title":"Enhancing Listened Speech Decoding from EEG via Parallel Phoneme\n Sequence Prediction","summary":" Brain-computer interfaces (BCI) offer numerous human-centered application\npossibilities, particularly affecting people with neurological disorders. Text\nor speech decoding from brain activities is a relevant domain that could\naugment the quality of life for people with impaired speech perception. We\npropose a novel approach to enhance listened speech decoding from\nelectroencephalography (EEG) signals by utilizing an auxiliary phoneme\npredictor that simultaneously decodes textual phoneme sequences. The proposed\nmodel architecture consists of three main parts: EEG module, speech module, and\nphoneme predictor. The EEG module learns to properly represent EEG signals into\nEEG embeddings. The speech module generates speech waveforms from the EEG\nembeddings. The phoneme predictor outputs the decoded phoneme sequences in text\nmodality. Our proposed approach allows users to obtain decoded listened speech\nfrom EEG signals in both modalities (speech waveforms and textual phoneme\nsequences) simultaneously, eliminating the need for a concatenated sequential\npipeline for each modality. The proposed approach also outperforms previous\nmethods in both modalities. The source code and speech samples are publicly\navailable.\n","authors":["Jihwan Lee","Tiantian Feng","Aditya Kommineni","Sudarsana Reddy Kadiri","Shrikanth Narayanan"],"pdf_url":"https://arxiv.org/pdf/2501.04844v1.pdf","comment":"ICASSP 2025"},{"id":"http://arxiv.org/abs/2408.06653v3","updated":"2025-01-08T20:40:09Z","published":"2024-08-13T05:53:46Z","title":"Hierarchical Structured Neural Network: Efficient Retrieval Scaling for\n Large Scale Recommendation","summary":" Retrieval, the initial stage of a recommendation system, is tasked with\ndown-selecting items from a pool of tens of millions of candidates to a few\nthousands. Embedding Based Retrieval (EBR) has been a typical choice for this\nproblem, addressing the computational demands of deep neural networks across\nvast item corpora. EBR utilizes Two Tower or Siamese Networks to learn\nrepresentations for users and items, and employ Approximate Nearest Neighbor\n(ANN) search to efficiently retrieve relevant items. Despite its popularity in\nindustry, EBR faces limitations. The Two Tower architecture, relying on a\nsingle dot product interaction, struggles to capture complex data distributions\ndue to limited capability in learning expressive interactions between users and\nitems. Additionally, ANN index building and representation learning for user\nand item are often separate, leading to inconsistencies exacerbated by\nrepresentation (e.g. continuous online training) and item drift (e.g. items\nexpired and new items added). In this paper, we introduce the Hierarchical\nStructured Neural Network (HSNN), an efficient deep neural network model to\nlearn intricate user and item interactions beyond the commonly used dot product\nin retrieval tasks, achieving sublinear computational costs relative to corpus\nsize. A Modular Neural Network (MoNN) is designed to maintain high\nexpressiveness for interaction learning while ensuring efficiency. A mixture of\nMoNNs operate on a hierarchical item index to achieve extensive computation\nsharing, enabling it to scale up to large corpus size. MoNN and the\nhierarchical index are jointly learnt to continuously adapt to distribution\nshifts in both user interests and item distributions. HSNN achieves substantial\nimprovement in offline evaluation compared to prevailing methods.\n","authors":["Kaushik Rangadurai","Siyang Yuan","Minhui Huang","Yiqun Liu","Golnaz Ghasemiesfeh","Yunchen Pu","Haiyu Lu","Xingfeng He","Fangzhou Xu","Andrew Cui","Vidhoon Viswanathan","Lin Yang","Liang Wang","Jiyan Yang","Chonglin Sun"],"pdf_url":"https://arxiv.org/pdf/2408.06653v3.pdf","comment":"Resubmit"},{"id":"http://arxiv.org/abs/2501.04835v1","updated":"2025-01-08T20:39:45Z","published":"2025-01-08T20:39:45Z","title":"Do Code LLMs Understand Design Patterns?","summary":" Code Large Language Models (LLMs) demonstrate great versatility in adapting\nto various downstream tasks, including code generation and completion, as well\nas bug detection and fixing. However, Code LLMs often fail to capture existing\ncoding standards, leading to the generation of code that conflicts with the\nrequired design patterns for a given project. As a result, developers must\npost-process to adapt the generated code to the project's design norms. In this\nwork, we empirically investigate the biases of Code LLMs in software\ndevelopment. Through carefully designed experiments, we assess the models'\nunderstanding of design patterns across recognition, comprehension, and\ngeneration. Our findings reveal that biases in Code LLMs significantly affect\nthe reliability of downstream tasks.\n","authors":["Zhenyu Pan","Xuefeng Song","Yunkun Wang","Rongyu Cao","Binhua Li","Yongbin Li","Han Liu"],"pdf_url":"https://arxiv.org/pdf/2501.04835v1.pdf","comment":"accpeted by llm4code workshop in ICSE 2025"},{"id":"http://arxiv.org/abs/2501.04832v1","updated":"2025-01-08T20:38:02Z","published":"2025-01-08T20:38:02Z","title":"ActPC-Geom: Towards Scalable Online Neural-Symbolic Learning via\n Accelerating Active Predictive Coding with Information Geometry & Diverse\n Cognitive Mechanisms","summary":" This paper introduces ActPC-Geom, an approach to accelerate Active Predictive\nCoding (ActPC) in neural networks by integrating information geometry,\nspecifically using Wasserstein-metric-based methods for measure-dependent\ngradient flows. We propose replacing KL-divergence in ActPC's predictive error\nassessment with the Wasserstein metric, suggesting this may enhance network\nrobustness.\n To make this computationally feasible, we present strategies including: (1)\nneural approximators for inverse measure-dependent Laplacians, (2) approximate\nkernel PCA embeddings for low-rank approximations feeding into these\napproximators, and (3) compositional hypervector embeddings derived from kPCA\noutputs, with algebra optimized for fuzzy FCA lattices learned through neural\narchitectures analyzing network states.\n This results in an ActPC architecture capable of real-time online learning\nand integrating continuous (e.g., transformer-like or Hopfield-net-like) and\ndiscrete symbolic ActPC networks, including frameworks like OpenCog Hyperon or\nActPC-Chem for algorithmic chemistry evolution. Shared probabilistic,\nconcept-lattice, and hypervector models enable symbolic-subsymbolic\nintegration.\n Key features include (1) compositional reasoning via hypervector embeddings\nin transformer-like architectures for tasks like commonsense reasoning, and (2)\nHopfield-net dynamics enabling associative long-term memory and\nattractor-driven cognitive features.\n We outline how ActPC-Geom combines few-shot learning with online weight\nupdates, enabling deliberative thinking and seamless symbolic-subsymbolic\nreasoning. Ideas from Galois connections are explored for efficient hybrid\nActPC/ActPC-Chem processing. Finally, we propose a specialized HPC design\noptimized for real-time focused attention and deliberative reasoning tailored\nto ActPC-Geom's demands.\n","authors":["Ben Goertzel"],"pdf_url":"https://arxiv.org/pdf/2501.04832v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.15237v3","updated":"2025-01-08T20:34:02Z","published":"2024-08-27T17:56:11Z","title":"The Mamba in the Llama: Distilling and Accelerating Hybrid Models","summary":" Linear RNN architectures, like Mamba, can be competitive with Transformer\nmodels in language modeling while having advantageous deployment\ncharacteristics. Given the focus on training large-scale Transformer models, we\nconsider the challenge of converting these pretrained models for deployment. We\ndemonstrate that it is feasible to distill large Transformers into linear RNNs\nby reusing the linear projection weights from attention layers with academic\nGPU resources. The resulting hybrid model, which incorporates a quarter of the\nattention layers, achieves performance comparable to the original Transformer\nin chat benchmarks and outperforms open-source hybrid Mamba models trained from\nscratch with trillions of tokens in both chat benchmarks and general\nbenchmarks. Moreover, we introduce a hardware-aware speculative decoding\nalgorithm that accelerates the inference speed of Mamba and hybrid models.\nOverall we show how, with limited computation resources, we can remove many of\nthe original attention layers and generate from the resulting model more\nefficiently. Our top-performing model, distilled from Llama3-8B-Instruct,\nachieves a 29.61 length-controlled win rate on AlpacaEval 2 against GPT-4 and\n7.35 on MT-Bench, surpassing the best 8B scale instruction-tuned linear RNN\nmodel. We also find that the distilled model has natural length extrapolation,\nshowing almost perfect accuracy in the needle-in-a-haystack test at 20x the\ndistillation length. Code and pre-trained checkpoints are open-sourced at\nhttps://github.com/jxiw/MambaInLlama and\nhttps://github.com/itsdaniele/speculative_mamba.\n","authors":["Junxiong Wang","Daniele Paliotta","Avner May","Alexander M. Rush","Tri Dao"],"pdf_url":"https://arxiv.org/pdf/2408.15237v3.pdf","comment":"NeurIPS 2024. v3 updates: fix format errors"},{"id":"http://arxiv.org/abs/2501.04826v1","updated":"2025-01-08T20:26:13Z","published":"2025-01-08T20:26:13Z","title":"Intelligent Gradient Boosting Algorithms for Estimating Strength of\n Modified Subgrade Soil","summary":" The performance of pavement under loading depends on the strength of the\nsubgrade. However, experimental estimation of properties of pavement strengths\nsuch as California bearing ratio (CBR), unconfined compressive strength (UCS)\nand resistance value (R) are often tedious, time-consuming and costly, thereby\ninspiring a growing interest in machine learning based tools which are simple,\ncheap and fast alternatives. Thus, the potential application of two boosting\ntechniques; categorical boosting (CatBoost) and extreme gradient boosting\n(XGBoost) and support vector regression (SVR), is similarly explored in this\nstudy for estimation of properties of subgrade soil modified with hydrated lime\nactivated rice husk ash (HARSH). Using 121 experimental data samples of varying\nproportions of HARSH, plastic limit, liquid limit, plasticity index, clay\nactivity, optimum moisture content, and maximum dry density as input for CBR,\nUCS and R estimation, four evaluation metrics namely coefficient of\ndetermination (R2), root mean squared error (RMSE), mean absolute error (MAE)\nand mean absolute percentage error (MAPE) are used to evaluate the models'\nperformance. The results indicate that XGBoost outperformed CatBoost and SVR in\nestimating these properties, yielding R2 of 0.9994, 0.9995 and 0.9999 in\nestimating the CBR, UCS and R respectively. Also, SVR outperformed CatBoost in\nestimating the CBR and R with R2 of 0.9997 respectively. On the other hand,\nCatBoost outperformed SVR in estimating the UCS with R2 of 0.9994. Feature\nsensitivity analysis shows that the three machine learning techniques are\nunanimous that increasing HARSH proportion lead to values of the estimated\nproperties respectively. A comparison with previous results also shows\nsuperiority of XGBoost in estimating subgrade properties.\n","authors":["Ismail B. Mustapha","Muyideen Abdulkareem","Shafaatunnur Hasan","Abideen Ganiyu","Hatem Nabus","Jin Chai Lee"],"pdf_url":"https://arxiv.org/pdf/2501.04826v1.pdf","comment":"17 pages"},{"id":"http://arxiv.org/abs/2501.04819v1","updated":"2025-01-08T20:17:18Z","published":"2025-01-08T20:17:18Z","title":"Planing It by Ear: Convolutional Neural Networks for Acoustic Anomaly\n Detection in Industrial Wood Planers","summary":" In recent years, the wood product industry has been facing a skilled labor\nshortage. The result is more frequent sudden failures, resulting in additional\ncosts for these companies already operating in a very competitive market.\nMoreover, sawmills are challenging environments for machinery and sensors.\nGiven that experienced machine operators may be able to diagnose defects or\nmalfunctions, one possible way of assisting novice operators is through\nacoustic monitoring. As a step towards the automation of wood-processing\nequipment and decision support systems for machine operators, in this paper, we\nexplore using a deep convolutional autoencoder for acoustic anomaly detection\nof wood planers on a new real-life dataset. Specifically, our convolutional\nautoencoder with skip connections (Skip-CAE) and our Skip-CAE transformer\noutperform the DCASE autoencoder baseline, one-class SVM, isolation forest and\na published convolutional autoencoder architecture, respectively obtaining an\narea under the ROC curve of 0.846 and 0.875 on a dataset of real-factory planer\nsounds. Moreover, we show that adding skip connections and attention mechanism\nunder the form of a transformer encoder-decoder helps to further improve the\nanomaly detection capabilities.\n","authors":["Anthony Deschênes","Rémi Georges","Cem Subakan","Bruna Ugulino","Antoine Henry","Michael Morin"],"pdf_url":"https://arxiv.org/pdf/2501.04819v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04817v1","updated":"2025-01-08T20:14:07Z","published":"2025-01-08T20:14:07Z","title":"Decentralised Resource Sharing in TinyML: Wireless Bilayer Gossip\n Parallel SGD for Collaborative Learning","summary":" With the growing computational capabilities of microcontroller units (MCUs),\nedge devices can now support machine learning models. However, deploying\ndecentralised federated learning (DFL) on such devices presents key challenges,\nincluding intermittent connectivity, limited communication range, and dynamic\nnetwork topologies. This paper proposes a novel framework, bilayer Gossip\nDecentralised Parallel Stochastic Gradient Descent (GD PSGD), designed to\naddress these issues in resource-constrained environments. The framework\nincorporates a hierarchical communication structure using Distributed Kmeans\n(DKmeans) clustering for geographic grouping and a gossip protocol for\nefficient model aggregation across two layers: intra-cluster and inter-cluster.\nWe evaluate the framework's performance against the Centralised Federated\nLearning (CFL) baseline using the MCUNet model on the CIFAR-10 dataset under\nIID and Non-IID conditions. Results demonstrate that the proposed method\nachieves comparable accuracy to CFL on IID datasets, requiring only 1.8\nadditional rounds for convergence. On Non-IID datasets, the accuracy loss\nremains under 8\\% for moderate data imbalance. These findings highlight the\nframework's potential to support scalable and privacy-preserving learning on\nedge devices with minimal performance trade-offs.\n","authors":["Ziyuan Bao","Eiman Kanjo","Soumya Banerjee","Hasib-Al Rashid","Tinoosh Mohsenin"],"pdf_url":"https://arxiv.org/pdf/2501.04817v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.16339v2","updated":"2025-01-08T20:11:59Z","published":"2024-12-20T21:00:11Z","title":"Deliberative Alignment: Reasoning Enables Safer Language Models","summary":" As large-scale language models increasingly impact safety-critical domains,\nensuring their reliable adherence to well-defined principles remains a\nfundamental challenge. We introduce Deliberative Alignment, a new paradigm that\ndirectly teaches the model safety specifications and trains it to explicitly\nrecall and accurately reason over the specifications before answering. We used\nthis approach to align OpenAI's o-series models, and achieved highly precise\nadherence to OpenAI's safety policies, without requiring human-written\nchain-of-thoughts or answers. Deliberative Alignment pushes the Pareto frontier\nby simultaneously increasing robustness to jailbreaks while decreasing\noverrefusal rates, and also improves out-of-distribution generalization. We\ndemonstrate that reasoning over explicitly specified policies enables more\nscalable, trustworthy, and interpretable alignment.\n","authors":["Melody Y. Guan","Manas Joglekar","Eric Wallace","Saachi Jain","Boaz Barak","Alec Helyar","Rachel Dias","Andrea Vallone","Hongyu Ren","Jason Wei","Hyung Won Chung","Sam Toyer","Johannes Heidecke","Alex Beutel","Amelia Glaese"],"pdf_url":"https://arxiv.org/pdf/2412.16339v2.pdf","comment":"24 pages"},{"id":"http://arxiv.org/abs/2501.01950v2","updated":"2025-01-08T20:09:16Z","published":"2025-01-03T18:54:26Z","title":"MADGEN: Mass-Spec attends to De Novo Molecular generation","summary":" The annotation (assigning structural chemical identities) of MS/MS spectra\nremains a significant challenge due to the enormous molecular diversity in\nbiological samples and the limited scope of reference databases. Currently, the\nvast majority of spectral measurements remain in the \"dark chemical space\"\nwithout structural annotations. To improve annotation, we propose MADGEN\n(Mass-spec Attends to De Novo Molecular GENeration), a scaffold-based method\nfor de novo molecular structure generation guided by mass spectrometry data.\nMADGEN operates in two stages: scaffold retrieval and spectra-conditioned\nmolecular generation starting with the scaffold. In the first stage, given an\nMS/MS spectrum, we formulate scaffold retrieval as a ranking problem and employ\ncontrastive learning to align mass spectra with candidate molecular scaffolds.\nIn the second stage, starting from the retrieved scaffold, we employ the MS/MS\nspectrum to guide an attention-based generative model to generate the final\nmolecule. Our approach constrains the molecular generation search space,\nreducing its complexity and improving generation accuracy. We evaluate MADGEN\non three datasets (NIST23, CANOPUS, and MassSpecGym) and evaluate MADGEN's\nperformance with a predictive scaffold retriever and with an oracle retriever.\nWe demonstrate the effectiveness of using attention to integrate spectral\ninformation throughout the generation process to achieve strong results with\nthe oracle retriever.\n","authors":["Yinkai Wang","Xiaohui Chen","Liping Liu","Soha Hassoun"],"pdf_url":"https://arxiv.org/pdf/2501.01950v2.pdf","comment":"preprint"},{"id":"http://arxiv.org/abs/2412.18036v2","updated":"2025-01-08T19:44:56Z","published":"2024-12-23T23:09:56Z","title":"Explainability in Neural Networks for Natural Language Processing Tasks","summary":" Neural networks are widely regarded as black-box models, creating significant\nchallenges in understanding their inner workings, especially in natural\nlanguage processing (NLP) applications. To address this opacity, model\nexplanation techniques like Local Interpretable Model-Agnostic Explanations\n(LIME) have emerged as essential tools for providing insights into the behavior\nof these complex systems. This study leverages LIME to interpret a multi-layer\nperceptron (MLP) neural network trained on a text classification task. By\nanalyzing the contribution of individual features to model predictions, the\nLIME approach enhances interpretability and supports informed decision-making.\nDespite its effectiveness in offering localized explanations, LIME has\nlimitations in capturing global patterns and feature interactions. This\nresearch highlights the strengths and shortcomings of LIME and proposes\ndirections for future work to achieve more comprehensive interpretability in\nneural NLP models.\n","authors":["Melkamu Mersha","Mingiziem Bitewa","Tsion Abay","Jugal Kalita"],"pdf_url":"https://arxiv.org/pdf/2412.18036v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.11977v2","updated":"2025-01-08T19:17:14Z","published":"2024-10-15T18:33:42Z","title":"Generative AI Policies under the Microscope: How CS Conferences Are\n Navigating the New Frontier in Scholarly Writing","summary":" This paper explores the current state of generative AI policies of computer\nscience conferences and offers guidelines for policy adoption.\n","authors":["Mahjabin Nahar","Sian Lee","Becky Guillen","Dongwon Lee"],"pdf_url":"https://arxiv.org/pdf/2410.11977v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.17309v3","updated":"2025-01-08T19:00:00Z","published":"2024-10-22T18:00:00Z","title":"Literature Meets Data: A Synergistic Approach to Hypothesis Generation","summary":" AI holds promise for transforming scientific processes, including hypothesis\ngeneration. Prior work on hypothesis generation can be broadly categorized into\ntheory-driven and data-driven approaches. While both have proven effective in\ngenerating novel and plausible hypotheses, it remains an open question whether\nthey can complement each other. To address this, we develop the first method\nthat combines literature-based insights with data to perform LLM-powered\nhypothesis generation. We apply our method on five different datasets and\ndemonstrate that integrating literature and data outperforms other baselines\n(8.97\\% over few-shot, 15.75\\% over literature-based alone, and 3.37\\% over\ndata-driven alone). Additionally, we conduct the first human evaluation to\nassess the utility of LLM-generated hypotheses in assisting human\ndecision-making on two challenging tasks: deception detection and AI generated\ncontent detection. Our results show that human accuracy improves significantly\nby 7.44\\% and 14.19\\% on these tasks, respectively. These findings suggest that\nintegrating literature-based and data-driven approaches provides a\ncomprehensive and nuanced framework for hypothesis generation and could open\nnew avenues for scientific inquiry.\n","authors":["Haokun Liu","Yangqiaoyu Zhou","Mingxuan Li","Chenfei Yuan","Chenhao Tan"],"pdf_url":"https://arxiv.org/pdf/2410.17309v3.pdf","comment":"37 pages, 9 figures, code link:\n https://github.com/ChicagoHAI/hypothesis-generation"},{"id":"http://arxiv.org/abs/2501.04682v1","updated":"2025-01-08T18:42:48Z","published":"2025-01-08T18:42:48Z","title":"Towards System 2 Reasoning in LLMs: Learning How to Think With Meta\n Chain-of-Thought","summary":" We propose a novel framework, Meta Chain-of-Thought (Meta-CoT), which extends\ntraditional Chain-of-Thought (CoT) by explicitly modeling the underlying\nreasoning required to arrive at a particular CoT. We present empirical evidence\nfrom state-of-the-art models exhibiting behaviors consistent with in-context\nsearch, and explore methods for producing Meta-CoT via process supervision,\nsynthetic data generation, and search algorithms. Finally, we outline a\nconcrete pipeline for training a model to produce Meta-CoTs, incorporating\ninstruction tuning with linearized search traces and reinforcement learning\npost-training. Finally, we discuss open research questions, including scaling\nlaws, verifier roles, and the potential for discovering novel reasoning\nalgorithms. This work provides a theoretical and practical roadmap to enable\nMeta-CoT in LLMs, paving the way for more powerful and human-like reasoning in\nartificial intelligence.\n","authors":["Violet Xiang","Charlie Snell","Kanishk Gandhi","Alon Albalak","Anikait Singh","Chase Blagden","Duy Phung","Rafael Rafailov","Nathan Lile","Dakota Mahan","Louis Castricato","Jan-Philipp Franken","Nick Haber","Chelsea Finn"],"pdf_url":"https://arxiv.org/pdf/2501.04682v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04765v1","updated":"2025-01-08T18:38:25Z","published":"2025-01-08T18:38:25Z","title":"TREAD: Token Routing for Efficient Architecture-agnostic Diffusion\n Training","summary":" Diffusion models have emerged as the mainstream approach for visual\ngeneration. However, these models usually suffer from sample inefficiency and\nhigh training costs. This issue is particularly pronounced in the standard\ndiffusion transformer architecture due to its quadratic complexity relative to\ninput length. Recent works have addressed this by reducing the number of tokens\nprocessed in the model, often through masking. In contrast, this work aims to\nimprove the training efficiency of the diffusion backbone by using predefined\nroutes that store this information until it is reintroduced to deeper layers of\nthe model, rather than discarding these tokens entirely. Further, we combine\nmultiple routes and introduce an adapted auxiliary loss that accounts for all\napplied routes. Our method is not limited to the common transformer-based model\n- it can also be applied to state-space models. Unlike most current approaches,\nTREAD achieves this without architectural modifications. Finally, we show that\nour method reduces the computational cost and simultaneously boosts model\nperformance on the standard benchmark ImageNet-1K 256 x 256 in\nclass-conditional synthesis. Both of these benefits multiply to a convergence\nspeedup of 9.55x at 400K training iterations compared to DiT and 25.39x\ncompared to the best benchmark performance of DiT at 7M training iterations.\n","authors":["Felix Krause","Timy Phan","Vincent Tao Hu","Björn Ommer"],"pdf_url":"https://arxiv.org/pdf/2501.04765v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.02132v2","updated":"2025-01-08T16:26:36Z","published":"2025-01-03T22:53:43Z","title":"A hybrid marketplace of ideas","summary":" The convergence of humans and artificial intelligence systems introduces new\ndynamics into the cultural and intellectual landscape. Complementing emerging\ncultural evolution concepts such as machine culture, AI agents represent a\nsignificant techno-sociological development, particularly within the\nanthropological study of Web3 as a community focused on decentralization\nthrough blockchain. Despite their growing presence, the cultural significance\nof AI agents remains largely unexplored in academic literature. Toward this\nend, we conceived hybrid netnography, a novel interdisciplinary approach that\nexamines the cultural and intellectual dynamics within digital ecosystems by\nanalyzing the interactions and contributions of both human and AI agents as\nco-participants in shaping narratives, ideas, and cultural artifacts. We argue\nthat, within the Web3 community on the social media platform X, these agents\nchallenge traditional notions of participation and influence in public\ndiscourse, creating a hybrid marketplace of ideas, a conceptual space where\nhuman and AI generated ideas coexist and compete for attention. We examine the\ncurrent state of AI agents in idea generation, propagation, and engagement,\npositioning their role as cultural agents through the lens of memetics and\nencouraging further inquiry into their cultural and societal impact.\nAdditionally, we address the implications of this paradigm for privacy,\nintellectual property, and governance, highlighting the societal and legal\nchallenges of integrating AI agents into the hybrid marketplace of ideas.\n","authors":["Tomer Jordi Chaffer","Dontrail Cotlage","Justin Goldston"],"pdf_url":"https://arxiv.org/pdf/2501.02132v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04747v1","updated":"2025-01-08T10:31:16Z","published":"2025-01-08T10:31:16Z","title":"Discovering new robust local search algorithms with neuro-evolution","summary":" This paper explores a novel approach aimed at overcoming existing challenges\nin the realm of local search algorithms. Our aim is to improve the decision\nprocess that takes place within a local search algorithm so as to make the best\npossible transitions in the neighborhood at each iteration. To improve this\nprocess, we propose to use a neural network that has the same input information\nas conventional local search algorithms. In this paper, which is an extension\nof the work [Goudet et al. 2024] presented at EvoCOP2024, we investigate\ndifferent ways of representing this information so as to make the algorithm as\nefficient as possible but also robust to monotonic transformations of the\nproblem objective function. To assess the efficiency of this approach, we\ndevelop an experimental setup centered around NK landscape problems, offering\nthe flexibility to adjust problem size and ruggedness. This approach offers a\npromising avenue for the emergence of new local search algorithms and the\nimprovement of their problem-solving capabilities for black-box problems.\n","authors":["Mohamed Salim Amri Sakhri","Adrien Goëffon","Olivier Goudet","Frédéric Saubion","Chaïmaâ Touhami"],"pdf_url":"https://arxiv.org/pdf/2501.04747v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05486v1","updated":"2025-01-08T16:53:25Z","published":"2025-01-08T16:53:25Z","title":"Towards an Ontology of Traceable Impact Management in the Food Supply\n Chain","summary":" The pursuit of quality improvements and accountability in the food supply\nchains, especially how they relate to food-related outcomes, such as hunger,\nhas become increasingly vital, necessitating a comprehensive approach that\nencompasses product quality and its impact on various stakeholders and their\ncommunities. Such an approach offers numerous benefits in increasing product\nquality and eliminating superfluous measurements while appraising and\nalleviating the broader societal and environmental repercussions. A traceable\nimpact management model (TIMM) provides an impact structure and a reporting\nmechanism that identifies each stakeholder's role in the total impact of food\nproduction and consumption stages.\n The model aims to increase traceability's utility in understanding the impact\nof changes on communities affected by food production and consumption, aligning\nwith current and future government requirements, and addressing the needs of\ncommunities and consumers. This holistic approach is further supported by an\nontological model that forms the logical foundation and a unified terminology.\nBy proposing a holistic and integrated solution across multiple stakeholders,\nthe model emphasizes quality and the extensive impact of championing\naccountability, sustainability, and responsible practices with global\ntraceability.\n With these combined efforts, the food supply chain moves toward a global\ntracking and tracing process that not only ensures product quality but also\naddresses its impact on a broader scale, fostering accountability,\nsustainability, and responsible food production and consumption.\n","authors":["Bart Gajderowicz","Mark S Fox","Yongchao Gao"],"pdf_url":"https://arxiv.org/pdf/2501.05486v1.pdf","comment":null}]},"2025-01-09T00:00:00Z":{"Robotics":[{"id":"http://arxiv.org/abs/2409.05545v2","updated":"2025-01-09T18:51:52Z","published":"2024-09-09T12:11:18Z","title":"Adaptive Probabilistic Planning for the Uncertain and Dynamic\n Orienteering Problem","summary":" The Orienteering Problem (OP) is a well-studied routing problem that has been\nextended to incorporate uncertainties, reflecting stochastic or dynamic travel\ncosts, prize-collection costs, and prizes. Existing approaches may, however, be\ninefficient in real-world applications due to insufficient modeling knowledge\nand initially unknowable parameters in online scenarios. Thus, we propose the\nUncertain and Dynamic Orienteering Problem (UDOP), modeling travel costs as\ndistributions with unknown and time-variant parameters. UDOP also associates\nuncertain travel costs with dynamic prizes and prize-collection costs for its\nobjective and budget constraints. To address UDOP, we develop an ADaptive\nApproach for Probabilistic paThs - ADAPT, that iteratively performs 'execution'\nand 'online planning' based on an initial 'offline' solution. The execution\nphase updates system status and records online cost observations. The online\nplanner employs a Bayesian approach to adaptively estimate power consumption\nand optimize path sequence based on safety beliefs. We evaluate ADAPT in a\npractical Unmanned Aerial Vehicle (UAV) charging scheduling problem for\nWireless Rechargeable Sensor Networks. The UAV must optimize its path to\nrecharge sensor nodes efficiently while managing its energy under uncertain\nconditions. ADAPT maintains comparable solution quality and computation time\nwhile offering superior robustness. Extensive simulations show that ADAPT\nachieves a 100% Mission Success Rate (MSR) across all tested scenarios,\noutperforming comparable heuristic-based and frequentist approaches that fail\nup to 70% (under challenging conditions) and averaging 67% MSR, respectively.\nThis work advances the field of OP with uncertainties, offering a reliable and\nefficient approach for real-world applications in uncertain and dynamic\nenvironments.\n","authors":["Qiuchen Qian","Yanran Wang","David Boyle"],"pdf_url":"https://arxiv.org/pdf/2409.05545v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05439v1","updated":"2025-01-09T18:49:39Z","published":"2025-01-09T18:49:39Z","title":"From Simple to Complex Skills: The Case of In-Hand Object Reorientation","summary":" Learning policies in simulation and transferring them to the real world has\nbecome a promising approach in dexterous manipulation. However, bridging the\nsim-to-real gap for each new task requires substantial human effort, such as\ncareful reward engineering, hyperparameter tuning, and system identification.\nIn this work, we present a system that leverages low-level skills to address\nthese challenges for more complex tasks. Specifically, we introduce a\nhierarchical policy for in-hand object reorientation based on previously\nacquired rotation skills. This hierarchical policy learns to select which\nlow-level skill to execute based on feedback from both the environment and the\nlow-level skill policies themselves. Compared to learning from scratch, the\nhierarchical policy is more robust to out-of-distribution changes and transfers\neasily from simulation to real-world environments. Additionally, we propose a\ngeneralizable object pose estimator that uses proprioceptive information,\nlow-level skill predictions, and control errors as inputs to estimate the\nobject pose over time. We demonstrate that our system can reorient objects,\nincluding symmetrical and textureless ones, to a desired pose.\n","authors":["Haozhi Qi","Brent Yi","Mike Lambeta","Yi Ma","Roberto Calandra","Jitendra Malik"],"pdf_url":"https://arxiv.org/pdf/2501.05439v1.pdf","comment":"website: https://dexhier.github.io"},{"id":"http://arxiv.org/abs/2501.05420v1","updated":"2025-01-09T18:22:10Z","published":"2025-01-09T18:22:10Z","title":"RoboPanoptes: The All-seeing Robot with Whole-body Dexterity","summary":" We present RoboPanoptes, a capable yet practical robot system that achieves\nwhole-body dexterity through whole-body vision. Its whole-body dexterity allows\nthe robot to utilize its entire body surface for manipulation, such as\nleveraging multiple contact points or navigating constrained spaces. Meanwhile,\nwhole-body vision uses a camera system distributed over the robot's surface to\nprovide comprehensive, multi-perspective visual feedback of its own and the\nenvironment's state. At its core, RoboPanoptes uses a whole-body visuomotor\npolicy that learns complex manipulation skills directly from human\ndemonstrations, efficiently aggregating information from the distributed\ncameras while maintaining resilience to sensor failures. Together, these design\naspects unlock new capabilities and tasks, allowing RoboPanoptes to unbox in\nnarrow spaces, sweep multiple or oversized objects, and succeed in multi-step\nstowing in cluttered environments, outperforming baselines in adaptability and\nefficiency. Results are best viewed on https://robopanoptes.github.io.\n","authors":["Xiaomeng Xu","Dominik Bauer","Shuran Song"],"pdf_url":"https://arxiv.org/pdf/2501.05420v1.pdf","comment":"Project website: https://robopanoptes.github.io"},{"id":"http://arxiv.org/abs/2501.05418v1","updated":"2025-01-09T18:20:57Z","published":"2025-01-09T18:20:57Z","title":"Virtual-Work Based Shape-Force Sensing for Continuum Instruments with\n Tension-Feedback Actuation","summary":" Continuum instruments are integral to robot-assisted minimally invasive\nsurgery (MIS), with tendon-driven mechanisms being the most common. Real-time\ntension feedback is crucial for precise articulation but remains a challenge in\ncompact actuation unit designs. Additionally, accurate shape and external force\nsensing of continuum instruments are essential for advanced control and\nmanipulation. This paper presents a compact and modular actuation unit that\nintegrates a torque cell directly into the pulley module to provide real-time\ntension feedback. Building on this unit, we propose a novel shape-force sensing\nframework that incorporates polynomial curvature kinematics to accurately model\nnon-constant curvature. The framework combines pose sensor measurements at the\ninstrument tip and actuation tension feedback at the developed actuation unit.\nExperimental results demonstrate the improved performance of the proposed\nshape-force sensing framework in terms of shape reconstruction accuracy and\nforce estimation reliability compared to conventional constant-curvature\nmethods.\n","authors":["Guoqing Zhang","Zihan Chen","Long Wang"],"pdf_url":"https://arxiv.org/pdf/2501.05418v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05411v1","updated":"2025-01-09T18:10:16Z","published":"2025-01-09T18:10:16Z","title":"Adaptive Path-Planning for Autonomous Robots: A UCH-Enhanced Q-Learning\n Approach","summary":" Q-learning methods are widely used in robot path planning but often face\nchallenges of inefficient search and slow convergence. We propose an Improved\nQ-learning (IQL) framework that enhances standard Q-learning in two significant\nways. First, we introduce the Path Adaptive Collaborative Optimization (PACO)\nalgorithm to optimize Q-table initialization, providing better initial\nestimates and accelerating learning. Second, we incorporate a\nUtility-Controlled Heuristic (UCH) mechanism with dynamically tuned parameters\nto optimize the reward function, enhancing the algorithm's accuracy and\neffectiveness in path-planning tasks. Extensive experiments in three different\nraster grid environments validate the superior performance of our IQL\nframework. The results demonstrate that our IQL algorithm outperforms existing\nmethods, including FIQL, PP-QL-based CPP, DFQL, and QMABC algorithms, in terms\nof path-planning capabilities.\n","authors":["Wei Liu","Ruiyang Wang","Haonan Wang","Guangwei Liu"],"pdf_url":"https://arxiv.org/pdf/2501.05411v1.pdf","comment":"25 pages, 20 figures"},{"id":"http://arxiv.org/abs/2408.00846v2","updated":"2025-01-09T16:55:55Z","published":"2024-08-01T18:01:23Z","title":"Occupation-aware planning method for robotic monitoring missions in\n dynamic environments","summary":" This paper presents a method for robotic monitoring missions in the presence\nof moving obstacles. Although the scenario map is known, the robot lacks\ninformation about the movement of dynamic obstacles during the monitoring\nmission. Numerous local planners have been developed in recent years for\nnavigating highly dynamic environments. However, the absence of a global\nplanner for these environments can result in unavoidable collisions or the\ninability to successfully complete missions in densely populated areas, such as\na scenario monitoring in our case. This work addresses the development and\nevaluation of a global planner, $MADA$ (Monitoring Avoiding Dynamic Areas),\naimed at enhancing the deployment of robots in such challenging conditions. The\nrobot plans and executes the mission using the proposed two-step approach. The\nfirst step involves selecting the observation goal based on the environment's\ndistribution and estimated monitoring costs. In the second step, the robot\nidentifies areas with moving obstacles and obtains paths avoiding densely\noccupied dynamic regions based on their occupation. Quantitative and\nqualitative results based on simulations and on real-world experimentation,\nconfirm that the proposed method allows the robot to effectively monitor most\nof the environment while avoiding densely occupied dynamic areas.\n","authors":["Yaroslav Marchukov","Luis Montano"],"pdf_url":"https://arxiv.org/pdf/2408.00846v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.00242v2","updated":"2025-01-09T16:43:16Z","published":"2024-12-31T03:17:05Z","title":"Automotive Speed Estimation: Sensor Types and Error Characteristics from\n OBD-II to ADAS","summary":" Modern on-road navigation systems heavily depend on integrating speed\nmeasurements with inertial navigation systems (INS) and global navigation\nsatellite systems (GNSS). Telemetry-based applications typically source speed\ndata from the On-Board Diagnostic II (OBD-II) system. However, the method of\nderiving speed, as well as the types of sensors used to measure wheel speed,\ndiffers across vehicles. These differences result in varying error\ncharacteristics that must be accounted for in navigation and autonomy\napplications. This paper addresses this gap by examining the diverse\nspeed-sensing technologies employed in standard automotive systems and\nalternative techniques used in advanced systems designed for higher levels of\nautonomy, such as Advanced Driver Assistance Systems (ADAS), Autonomous Driving\n(AD), or surveying applications. We propose a method to identify the type of\nspeed sensor in a vehicle and present strategies for accurately modeling its\nerror characteristics. To validate our approach, we collected and analyzed data\nfrom three long real road trajectories conducted in urban environments in\nToronto and Kingston, Ontario, Canada. The results underscore the critical role\nof integrating multiple sensor modalities to achieve more accurate speed\nestimation, thus improving automotive navigation state estimation, particularly\nin GNSS-denied environments.\n","authors":["Hany Ragab","Sidney Givigi","Aboelmagd Noureldin"],"pdf_url":"https://arxiv.org/pdf/2501.00242v2.pdf","comment":"7 pages, 12 figures, to be published in conference proceedings"},{"id":"http://arxiv.org/abs/2501.05329v1","updated":"2025-01-09T15:55:08Z","published":"2025-01-09T15:55:08Z","title":"Knowledge Transfer in Model-Based Reinforcement Learning Agents for\n Efficient Multi-Task Learning","summary":" We propose an efficient knowledge transfer approach for model-based\nreinforcement learning, addressing the challenge of deploying large world\nmodels in resource-constrained environments. Our method distills a\nhigh-capacity multi-task agent (317M parameters) into a compact 1M parameter\nmodel, achieving state-of-the-art performance on the MT30 benchmark with a\nnormalized score of 28.45, a substantial improvement over the original 1M\nparameter model's score of 18.93. This demonstrates the ability of our\ndistillation technique to consolidate complex multi-task knowledge effectively.\nAdditionally, we apply FP16 post-training quantization, reducing the model size\nby 50% while maintaining performance. Our work bridges the gap between the\npower of large models and practical deployment constraints, offering a scalable\nsolution for efficient and accessible multi-task reinforcement learning in\nrobotics and other resource-limited domains.\n","authors":["Dmytro Kuzmenko","Nadiya Shvai"],"pdf_url":"https://arxiv.org/pdf/2501.05329v1.pdf","comment":"Preprint of an extended abstract accepted to AAMAS 2025"},{"id":"http://arxiv.org/abs/2405.17794v2","updated":"2025-01-09T15:15:40Z","published":"2024-05-28T03:45:32Z","title":"LNS2+RL: Combining Multi-Agent Reinforcement Learning with Large\n Neighborhood Search in Multi-Agent Path Finding","summary":" Multi-Agent Path Finding (MAPF) is a critical component of logistics and\nwarehouse management, which focuses on planning collision-free paths for a team\nof robots in a known environment. Recent work introduced a novel MAPF approach,\nLNS2, which proposed to repair a quickly obtained set of infeasible paths via\niterative replanning, by relying on a fast, yet lower-quality, prioritized\nplanning (PP) algorithm. At the same time, there has been a recent push for\nMulti-Agent Reinforcement Learning (MARL) based MAPF algorithms, which exhibit\nimproved cooperation over such PP algorithms, although inevitably remaining\nslower. In this paper, we introduce a new MAPF algorithm, LNS2+RL, which\ncombines the distinct yet complementary characteristics of LNS2 and MARL to\neffectively balance their individual limitations and get the best from both\nworlds. During early iterations, LNS2+RL relies on MARL for low-level\nreplanning, which we show eliminates collisions much more than a PP algorithm.\nThere, our MARL-based planner allows agents to reason about past and future\ninformation to gradually learn cooperative decision-making through a finely\ndesigned curriculum learning. At later stages of planning, LNS2+RL adaptively\nswitches to PP algorithm to quickly resolve the remaining collisions, naturally\ntrading off solution quality (number of collisions in the solution) and\ncomputational efficiency. Our comprehensive experiments on high-agent-density\ntasks across various team sizes, world sizes, and map structures consistently\ndemonstrate the superior performance of LNS2+RL compared to many MAPF\nalgorithms, including LNS2, LaCAM, EECBS, and SCRIMP. In maps with complex\nstructures, the advantages of LNS2+RL are particularly pronounced, with LNS2+RL\nachieving a success rate of over 50% in nearly half of the tested tasks, while\nthat of LaCAM, EECBS and SCRIMP falls to 0%.\n","authors":["Yutong Wang","Tanishq Duhan","Jiaoyang Li","Guillaume Sartoretti"],"pdf_url":"https://arxiv.org/pdf/2405.17794v2.pdf","comment":"Accepted for presentation at AAAI 2025"},{"id":"http://arxiv.org/abs/2501.02580v2","updated":"2025-01-09T15:10:09Z","published":"2025-01-05T15:26:36Z","title":"LP-ICP: General Localizability-Aware Point Cloud Registration for Robust\n Localization in Extreme Unstructured Environments","summary":" The Iterative Closest Point (ICP) algorithm is a crucial component of\nLiDAR-based SLAM algorithms. However, its performance can be negatively\naffected in unstructured environments that lack features and geometric\nstructures, leading to low accuracy and poor robustness in localization and\nmapping. It is known that degeneracy caused by the lack of geometric\nconstraints can lead to errors in 6-DOF pose estimation along ill-conditioned\ndirections. Therefore, there is a need for a broader and more fine-grained\ndegeneracy detection and handling method. This paper proposes a new point cloud\nregistration framework, LP-ICP, that combines point-to-line and point-to-plane\ndistance metrics in the ICP algorithm, with localizability detection and\nhandling. LP-ICP consists of a localizability detection module and an\noptimization module. The localizability detection module performs\nlocalizability analysis by utilizing the correspondences between edge points\n(with low local smoothness) to lines and planar points (with high local\nsmoothness) to planes between the scan and the map. The localizability\ncontribution of individual correspondence constraints can be applied to a\nbroader range. The optimization module adds additional soft and hard\nconstraints to the optimization equations based on the localizability category.\nThis allows the pose to be constrained along ill-conditioned directions, with\nupdates either tending towards the constraint value or leaving the initial\nestimate unchanged. This improves accuracy and reduces fluctuations. The\nproposed method is extensively evaluated through experiments on both simulation\nand real-world datasets, demonstrating higher or comparable accuracy than the\nstate-of-the-art methods. The dataset and code of this paper will also be\nopen-sourced at https://github.com/xuqingyuan2000/LP-ICP.\n","authors":["Haosong Yue","Qingyuan Xu","Fei Chen","Jia Pan","Weihai Chen"],"pdf_url":"https://arxiv.org/pdf/2501.02580v2.pdf","comment":"18 Pages, 8 Figures Submitted to IEEE Transactions on Automation\n Science and Engineering"},{"id":"http://arxiv.org/abs/2310.09589v2","updated":"2025-01-09T14:46:41Z","published":"2023-10-14T14:11:46Z","title":"Airborne Sense and Detect of Drones using Deep Learning and LiDAR Point\n Clouds","summary":" The safe operation of drone swarms beyond visual line of sight requires\nmultiple safeguards to mitigate the risk of collision between drones flying in\nclose-proximity scenarios. Cooperative navigation and flight coordination\nstrategies that rely on pre-planned trajectories, constant %{satellite and\nnetwork connectivity and reliable Global Navigation Satellite System (GNSS)\npositioning are brittle to failure. Drone embedded sense and detect offers a\ncomprehensive mode of separation between drones for deconfliction and collision\navoidance. This paper presents the first airborne LiDAR based solution for\ndrone-swarm detection and localization using 3D deep learning model. It adapts\nan existing deep learning neural network to the air-to-air drone scenario by\nexpanding the scan space vertically. A new sparse convolution is proposed and\napplied to accelerate the backbone layer, which is the most time-consuming part\nof the neural network. To collect training data of safety critical,\nclose-proximity multi-drone operations, a scenario Digital Twin is used to\naugment real datasets with high fidelity synthetic data. The trained model\nachieves over 80% recall and 96% precision when tested on real-world datasets.\nBy incorporating a tracking-by-detection algorithm the system can reliably\nmonitor the separation distance of multiple drones in challenging environments.\n","authors":["Manduhu Manduhu","Alexander Dow","Petar Trslic","Gerard Dooly","Benjamin Blanck","James Riordan"],"pdf_url":"https://arxiv.org/pdf/2310.09589v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.15589v2","updated":"2025-01-09T14:30:07Z","published":"2024-09-23T22:47:42Z","title":"Beyond Humanoid Prosthetic Hands: Modular Terminal Devices That Improve\n User Performance","summary":" Despite decades of research and development, myoelectric prosthetic hands\nlack functionality and are often rejected by users. This lack in functionality\ncan be partially attributed to the widely accepted anthropomorphic design\nideology in the field; attempting to replicate human hand form and function\ndespite severe limitations in control and sensing technology. Instead,\nprosthetic hands can be tailored to perform specific tasks without increasing\ncomplexity by shedding the constraints of anthropomorphism. In this paper, we\ndevelop and evaluate four open-source modular non-humanoid devices to perform\nthe motion required to replicate human flicking motion and to twist a\nscrewdriver, and the functionality required to pick and place flat objects and\nto cut paper. Experimental results from these devices demonstrate that, versus\na humanoid prosthesis, non-humanoid prosthesis design dramatically improves\ntask performance, reduces user compensatory movement, and reduces task load.\nCase studies with two end users demonstrate the translational benefits of this\nresearch. We found that special attention should be paid to monitoring end-user\ntask load to ensure positive rehabilitation outcomes.\n","authors":["Digby Chappell","Barry Mulvey","Shehara Perera","Fernando Bello","Petar Kormushev","Nicolas Rojas"],"pdf_url":"https://arxiv.org/pdf/2409.15589v2.pdf","comment":"10 pages, 10 figures, 2 tables. Accepted for publication in IEEE\n Transactions on Neural Systems and Rehabilitation Engineering"},{"id":"http://arxiv.org/abs/2409.16828v3","updated":"2025-01-09T14:10:38Z","published":"2024-09-25T11:29:26Z","title":"On the role of Artificial Intelligence methods in modern\n force-controlled manufacturing robotic tasks","summary":" This position paper explores the integration of Artificial Intelligence (AI)\ninto force-controlled robotic tasks within the scope of advanced manufacturing,\na cornerstone of Industry 4.0. AI's role in enhancing robotic manipulators -\nkey drivers in the Fourth Industrial Revolution - is rapidly leading to\nsignificant innovations in smart manufacturing. The objective of this article\nis to frame these innovations in practical force-controlled applications - e.g.\ndeburring, polishing, and assembly tasks like peg-in-hole (PiH) - highlighting\ntheir necessity for maintaining high-quality production standards. By reporting\non recent AI-based methodologies, this article contrasts them and identifies\ncurrent challenges to be addressed in future research. The analysis concludes\nwith a perspective on future research directions, emphasizing the need for\ncommon performance metrics to validate AI techniques, integration of various\nenhancements for performance optimization, and the importance of validating\nthem in relevant scenarios. These future directions aim to provide consistency\nwith already adopted approaches, so as to be compatible with manufacturing\nstandards, increasing the relevance of AI-driven methods in both academic and\nindustrial contexts.\n","authors":["Vincenzo Petrone","Enrico Ferrentino","Pasquale Chiacchio"],"pdf_url":"https://arxiv.org/pdf/2409.16828v3.pdf","comment":"In Proceedings of the 21st International Conference on Informatics in\n Control, Automation and Robotics - Volume 1: ICINCO, 392-399, 2024 , Porto,\n Portugal"},{"id":"http://arxiv.org/abs/2311.16623v2","updated":"2025-01-09T13:59:21Z","published":"2023-11-28T09:24:42Z","title":"Visual Semantic Navigation with Real Robots","summary":" Visual Semantic Navigation (VSN) is the ability of a robot to learn visual\nsemantic information for navigating in unseen environments. These VSN models\nare typically tested in those virtual environments where they are trained,\nmainly using reinforcement learning based approaches. Therefore, we do not yet\nhave an in-depth analysis of how these models would behave in the real world.\nIn this work, we propose a new solution to integrate VSN models into real\nrobots, so that we have true embodied agents. We also release a novel ROS-based\nframework for VSN, ROS4VSN, so that any VSN-model can be easily deployed in any\nROS-compatible robot and tested in a real setting. Our experiments with two\ndifferent robots, where we have embedded two state-of-the-art VSN agents,\nconfirm that there is a noticeable performance difference of these VSN\nsolutions when tested in real-world and simulation environments. We hope that\nthis research will endeavor to provide a foundation for addressing this\nconsequential issue, with the ultimate aim of advancing the performance and\nefficiency of embodied agents within authentic real-world scenarios. Code to\nreproduce all our experiments can be found at\nhttps://github.com/gramuah/ros4vsn.\n","authors":["Carlos Gutiérrez-Álvarez","Pablo Ríos-Navarro","Rafael Flor-Rodríguez","Francisco Javier Acevedo-Rodríguez","Roberto J. López-Sastre"],"pdf_url":"https://arxiv.org/pdf/2311.16623v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05204v1","updated":"2025-01-09T12:55:21Z","published":"2025-01-09T12:55:21Z","title":"Design and Control of a Bipedal Robotic Character","summary":" Legged robots have achieved impressive feats in dynamic locomotion in\nchallenging unstructured terrain. However, in entertainment applications, the\ndesign and control of these robots face additional challenges in appealing to\nhuman audiences. This work aims to unify expressive, artist-directed motions\nand robust dynamic mobility for legged robots. To this end, we introduce a new\nbipedal robot, designed with a focus on character-driven mechanical features.\nWe present a reinforcement learning-based control architecture to robustly\nexecute artistic motions conditioned on command signals. During runtime, these\ncommand signals are generated by an animation engine which composes and blends\nbetween multiple animation sources. Finally, an intuitive operator interface\nenables real-time show performances with the robot. The complete system results\nin a believable robotic character, and paves the way for enhanced human-robot\nengagement in various contexts, in entertainment robotics and beyond.\n","authors":["Ruben Grandia","Espen Knoop","Michael A. Hopkins","Georg Wiedebach","Jared Bishop","Steven Pickles","David Müller","Moritz Bächer"],"pdf_url":"https://arxiv.org/pdf/2501.05204v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05198v1","updated":"2025-01-09T12:48:40Z","published":"2025-01-09T12:48:40Z","title":"Dexterous Manipulation of Deformable Objects via Pneumatic Gripping:\n Lifting by One End","summary":" Manipulating deformable objects in robotic cells is often costly and not\nwidely accessible. However, the use of localized pneumatic gripping systems can\nenhance accessibility. Current methods that use pneumatic grippers to handle\ndeformable objects struggle with effective lifting. This paper introduces a\nmethod for the dexterous lifting of textile deformable objects from one edge,\nutilizing a previously developed gripper designed for flexible and porous\nmaterials. By precisely adjusting the orientation and position of the gripper\nduring the lifting process, we were able to significantly reduce necessary\ngripping force and minimize object vibration caused by airflow. This method was\ntested and validated on four materials with varying mass, friction, and\nflexibility. The proposed approach facilitates the lifting of deformable\nobjects from a conveyor or automated line, even when only one edge is\naccessible for grasping. Future work will involve integrating a vision system\nto optimize the manipulation of deformable objects with more complex shapes.\n","authors":["Roman Mykhailyshyn","Jonathan Lee","Mykhailo Mykhailyshyn","Kensuke Harada","Ann Majewicz Fey"],"pdf_url":"https://arxiv.org/pdf/2501.05198v1.pdf","comment":"Submitted to RA-L"},{"id":"http://arxiv.org/abs/2501.05156v1","updated":"2025-01-09T11:23:31Z","published":"2025-01-09T11:23:31Z","title":"State-Based Disassembly Planning","summary":" It has been shown recently that physics-based simulation significantly\nenhances the disassembly capabilities of real-world assemblies with diverse 3D\nshapes and stringent motion constraints. However, the efficiency suffers when\ntackling intricate disassembly tasks that require numerous simulations and\nincreased simulation time. In this work, we propose a State-Based Disassembly\nPlanning (SBDP) approach, prioritizing physics-based simulation with\ntranslational motion over rotational motion to facilitate autonomy, reducing\ndependency on human input, while storing intermediate motion states to improve\nsearch scalability. We introduce two novel evaluation functions derived from\nnew Directional Blocking Graphs (DBGs) enriched with state information to scale\nup the search. Our experiments show that SBDP with new evaluation functions and\nDBGs constraints outperforms the state-of-the-art in disassembly planning in\nterms of success rate and computational efficiency over benchmark datasets\nconsisting of thousands of physically valid industrial assemblies.\n","authors":["Chao Lei","Nir Lipovetzky","Krista A. Ehinger"],"pdf_url":"https://arxiv.org/pdf/2501.05156v1.pdf","comment":"Accepted at AAAI 2025 (extended version)"},{"id":"http://arxiv.org/abs/2501.05153v1","updated":"2025-01-09T11:16:00Z","published":"2025-01-09T11:16:00Z","title":"Assisting MoCap-Based Teleoperation of Robot Arm using Augmented Reality\n Visualisations","summary":" Teleoperating a robot arm involves the human operator positioning the robot's\nend-effector or programming each joint. Whereas humans can control their own\narms easily by integrating visual and proprioceptive feedback, it is\nchallenging to control an external robot arm in the same way, due to its\ninconsistent orientation and appearance. We explore teleoperating a robot arm\nthrough motion-capture (MoCap) of the human operator's arm with the assistance\nof augmented reality (AR) visualisations. We investigate how AR helps\nteleoperation by visualising a virtual reference of the human arm alongside the\nrobot arm to help users understand the movement mapping. We found that the AR\noverlay of a humanoid arm on the robot in the same orientation helped users\nlearn the control. We discuss findings and future work on MoCap-based robot\nteleoperation.\n","authors":["Qiushi Zhou","Antony Chacon","Jiahe Pan","Wafa Johal"],"pdf_url":"https://arxiv.org/pdf/2501.05153v1.pdf","comment":"5 pages, 7 figures, accepted to HRI 2025"},{"id":"http://arxiv.org/abs/2403.14320v3","updated":"2025-01-09T10:59:37Z","published":"2024-03-21T11:41:39Z","title":"Exosense: A Vision-Based Scene Understanding System For Exoskeletons","summary":" Self-balancing exoskeletons are a key enabling technology for individuals\nwith mobility impairments. While the current challenges focus on\nhuman-compliant hardware and control, unlocking their use for daily activities\nrequires a scene perception system. In this work, we present Exosense, a\nvision-centric scene understanding system for self-balancing exoskeletons. We\nintroduce a multi-sensor visual-inertial mapping device as well as a navigation\nstack for state estimation, terrain mapping and long-term operation. We tested\nExosense attached to both a human leg and Wandercraft's Personal Exoskeleton in\nreal-world indoor scenarios. This enabled us to test the system during typical\nperiodic walking gaits, as well as future uses in multi-story environments. We\ndemonstrate that Exosense can achieve an odometry drift of about 4 cm per meter\ntraveled, and construct terrain maps under 1 cm average reconstruction error.\nIt can also work in a visual localization mode in a previously mapped\nenvironment, providing a step towards long-term operation of exoskeletons.\n","authors":["Jianeng Wang","Matias Mattamala","Christina Kassab","Guillaume Burger","Fabio Elnecave","Lintong Zhang","Marine Petriaux","Maurice Fallon"],"pdf_url":"https://arxiv.org/pdf/2403.14320v3.pdf","comment":"8 pages, 9 figures"},{"id":"http://arxiv.org/abs/2501.05147v1","updated":"2025-01-09T10:56:50Z","published":"2025-01-09T10:56:50Z","title":"A Systematic Literature Review on Deep Learning-based Depth Estimation\n in Computer Vision","summary":" Depth estimation (DE) provides spatial information about a scene and enables\ntasks such as 3D reconstruction, object detection, and scene understanding.\nRecently, there has been an increasing interest in using deep learning\n(DL)-based methods for DE. Traditional techniques rely on handcrafted features\nthat often struggle to generalise to diverse scenes and require extensive\nmanual tuning. However, DL models for DE can automatically extract relevant\nfeatures from input data, adapt to various scene conditions, and generalise\nwell to unseen environments. Numerous DL-based methods have been developed,\nmaking it necessary to survey and synthesize the state-of-the-art (SOTA).\nPrevious reviews on DE have mainly focused on either monocular or stereo-based\ntechniques, rather than comprehensively reviewing DE. Furthermore, to the best\nof our knowledge, there is no systematic literature review (SLR) that\ncomprehensively focuses on DE. Therefore, this SLR study is being conducted.\nInitially, electronic databases were searched for relevant publications,\nresulting in 1284 publications. Using defined exclusion and quality criteria,\n128 publications were shortlisted and further filtered to select 59\nhigh-quality primary studies. These studies were analysed to extract data and\nanswer defined research questions. Based on the results, DL methods were\ndeveloped for mainly three different types of DE: monocular, stereo, and\nmulti-view. 20 publicly available datasets were used to train, test, and\nevaluate DL models for DE, with KITTI, NYU Depth V2, and Make 3D being the most\nused datasets. 29 evaluation metrics were used to assess the performance of DE.\n35 base models were reported in the primary studies, and the top five most-used\nbase models were ResNet-50, ResNet-18, ResNet-101, U-Net, and VGG-16. Finally,\nthe lack of ground truth data was among the most significant challenges\nreported by primary studies.\n","authors":["Ali Rohan","Md Junayed Hasan","Andrei Petrovski"],"pdf_url":"https://arxiv.org/pdf/2501.05147v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05141v1","updated":"2025-01-09T10:47:03Z","published":"2025-01-09T10:47:03Z","title":"OfficeMate: Pilot Evaluation of an Office Assistant Robot","summary":" Office Assistant Robots (OARs) offer a promising solution to proactively\nprovide in-situ support to enhance employee well-being and productivity in\noffice spaces. We introduce OfficeMate, a social OAR designed to assist with\npractical tasks, foster social interaction, and promote health and well-being.\nThrough a pilot evaluation with seven participants in an office environment, we\nfound that users see potential in OARs for reducing stress and promoting\nhealthy habits and value the robot's ability to provide companionship and\nphysical activity reminders in the office space. However, concerns regarding\nprivacy, communication, and the robot's interaction timing were also raised.\nThe feedback highlights the need to carefully consider the robot's appearance\nand behaviour to ensure it enhances user experience and aligns with office\nsocial norms. We believe these insights will better inform the development of\nadaptive, intelligent OAR systems for future office space integration.\n","authors":["Jiahe Pan","Sarah Schömbs","Yan Zhang","Ramtin Tabatabaei","Muhammad Bilal","Wafa Johal"],"pdf_url":"https://arxiv.org/pdf/2501.05141v1.pdf","comment":"5 pages, 1 figure, accepted to HRI 2025"},{"id":"http://arxiv.org/abs/2501.05107v1","updated":"2025-01-09T09:54:31Z","published":"2025-01-09T09:54:31Z","title":"Harnessing the Power of Vibration Motors to Develop Miniature Untethered\n Robotic Fishes","summary":" Miniature underwater robots play a crucial role in the exploration and\ndevelopment of marine resources, particularly in confined spaces and\nhigh-pressure deep-sea environments. This study presents the design,\noptimization, and performance of a miniature robotic fish, powered by the\noscillation of bio-inspired fins. These fins feature a rigid-flexible hybrid\nstructure and use an eccentric rotating mass (ERM) vibration motor as the\nexcitation source to generate high-frequency unidirectional oscillations that\ninduce acoustic streaming for propulsion. The drive mechanism, powered by\nminiature ERM vibration motors, eliminates the need for complex mechanical\ndrive systems, enabling complete isolation of the entire drive system from the\nexternal environment and facilitating the miniaturization of the robotic fish.\nA compact, untethered robotic fish, measuring 85*60*45 mm^3, is equipped with\nthree bio-inspired fins located at the pectoral and caudal positions.\nExperimental results demonstrate that the robotic fish achieves a maximum\nforward swimming speed of 1.36 body lengths (BL) per second powered by all fins\nand minimum turning radius of 0.6 BL when powered by a single fin. These\nresults underscore the significance of employing the ERM vibration motor in\nadvancing the development of highly maneuverable, miniature untethered\nunderwater robots for various marine exploration tasks.\n","authors":["Chongjie Jiang","Yingying Dai","Jinyang Le","Xiaomeng Chen","Yu Xie","Wei Zhou","Fuzhou Niu","Ying Li","Tao Luo"],"pdf_url":"https://arxiv.org/pdf/2501.05107v1.pdf","comment":"8 pages, 8 figures"},{"id":"http://arxiv.org/abs/2501.05087v1","updated":"2025-01-09T09:11:40Z","published":"2025-01-09T09:11:40Z","title":"Enhanced Quantile Regression with Spiking Neural Networks for Long-Term\n System Health Prognostics","summary":" This paper presents a novel predictive maintenance framework centered on\nEnhanced Quantile Regression Neural Networks EQRNNs, for anticipating system\nfailures in industrial robotics. We address the challenge of early failure\ndetection through a hybrid approach that combines advanced neural\narchitectures. The system leverages dual computational stages: first\nimplementing an EQRNN optimized for processing multi-sensor data streams\nincluding vibration, thermal, and power signatures, followed by an integrated\nSpiking Neural Network SNN, layer that enables microsecond-level response\ntimes. This architecture achieves notable accuracy rates of 92.3\\% in component\nfailure prediction with a 90-hour advance warning window. Field testing\nconducted on an industrial scale with 50 robotic systems demonstrates\nsignificant operational improvements, yielding a 94\\% decrease in unexpected\nsystem failures and 76\\% reduction in maintenance-related downtimes. The\nframework's effectiveness in processing complex, multi-modal sensor data while\nmaintaining computational efficiency validates its applicability for Industry\n4.0 manufacturing environments.\n","authors":["David J Poland"],"pdf_url":"https://arxiv.org/pdf/2501.05087v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05057v1","updated":"2025-01-09T08:28:16Z","published":"2025-01-09T08:28:16Z","title":"LearningFlow: Automated Policy Learning Workflow for Urban Driving with\n Large Language Models","summary":" Recent advancements in reinforcement learning (RL) demonstrate the\nsignificant potential in autonomous driving. Despite this promise, challenges\nsuch as the manual design of reward functions and low sample efficiency in\ncomplex environments continue to impede the development of safe and effective\ndriving policies. To tackle these issues, we introduce LearningFlow, an\ninnovative automated policy learning workflow tailored to urban driving. This\nframework leverages the collaboration of multiple large language model (LLM)\nagents throughout the RL training process. LearningFlow includes a curriculum\nsequence generation process and a reward generation process, which work in\ntandem to guide the RL policy by generating tailored training curricula and\nreward functions. Particularly, each process is supported by an analysis agent\nthat evaluates training progress and provides critical insights to the\ngeneration agent. Through the collaborative efforts of these LLM agents,\nLearningFlow automates policy learning across a series of complex driving\ntasks, and it significantly reduces the reliance on manual reward function\ndesign while enhancing sample efficiency. Comprehensive experiments are\nconducted in the high-fidelity CARLA simulator, along with comparisons with\nother existing methods, to demonstrate the efficacy of our proposed approach.\nThe results demonstrate that LearningFlow excels in generating rewards and\ncurricula. It also achieves superior performance and robust generalization\nacross various driving tasks, as well as commendable adaptation to different RL\nalgorithms.\n","authors":["Zengqi Peng","Yubin Wang","Xu Han","Lei Zheng","Jun Ma"],"pdf_url":"https://arxiv.org/pdf/2501.05057v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05031v1","updated":"2025-01-09T07:43:49Z","published":"2025-01-09T07:43:49Z","title":"ECBench: Can Multi-modal Foundation Models Understand the Egocentric\n World? A Holistic Embodied Cognition Benchmark","summary":" The enhancement of generalization in robots by large vision-language models\n(LVLMs) is increasingly evident. Therefore, the embodied cognitive abilities of\nLVLMs based on egocentric videos are of great interest. However, current\ndatasets for embodied video question answering lack comprehensive and\nsystematic evaluation frameworks. Critical embodied cognitive issues, such as\nrobotic self-cognition, dynamic scene perception, and hallucination, are rarely\naddressed. To tackle these challenges, we propose ECBench, a high-quality\nbenchmark designed to systematically evaluate the embodied cognitive abilities\nof LVLMs. ECBench features a diverse range of scene video sources, open and\nvaried question formats, and 30 dimensions of embodied cognition. To ensure\nquality, balance, and high visual dependence, ECBench uses class-independent\nmeticulous human annotation and multi-round question screening strategies.\nAdditionally, we introduce ECEval, a comprehensive evaluation system that\nensures the fairness and rationality of the indicators. Utilizing ECBench, we\nconduct extensive evaluations of proprietary, open-source, and task-specific\nLVLMs. ECBench is pivotal in advancing the embodied cognitive capabilities of\nLVLMs, laying a solid foundation for developing reliable core models for\nembodied agents. All data and code are available at\nhttps://github.com/Rh-Dang/ECBench.\n","authors":["Ronghao Dang","Yuqian Yuan","Wenqi Zhang","Yifei Xin","Boqiang Zhang","Long Li","Liuyi Wang","Qinyang Zeng","Xin Li","Lidong Bing"],"pdf_url":"https://arxiv.org/pdf/2501.05031v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05014v1","updated":"2025-01-09T07:15:59Z","published":"2025-01-09T07:15:59Z","title":"UAV-VLA: Vision-Language-Action System for Large Scale Aerial Mission\n Generation","summary":" The UAV-VLA (Visual-Language-Action) system is a tool designed to facilitate\ncommunication with aerial robots. By integrating satellite imagery processing\nwith the Visual Language Model (VLM) and the powerful capabilities of GPT,\nUAV-VLA enables users to generate general flight paths-and-action plans through\nsimple text requests. This system leverages the rich contextual information\nprovided by satellite images, allowing for enhanced decision-making and mission\nplanning. The combination of visual analysis by VLM and natural language\nprocessing by GPT can provide the user with the path-and-action set, making\naerial operations more efficient and accessible. The newly developed method\nshowed the difference in the length of the created trajectory in 22% and the\nmean error in finding the objects of interest on a map in 34.22 m by Euclidean\ndistance in the K-Nearest Neighbors (KNN) approach.\n","authors":["Oleg Sautenkov","Yasheerah Yaqoot","Artem Lykov","Muhammad Ahsan Mustafa","Grik Tadevosyan","Aibek Akhmetkazy","Miguel Altamirano Cabrera","Mikhail Martynov","Sausar Karaf","Dzmitry Tsetserukou"],"pdf_url":"https://arxiv.org/pdf/2501.05014v1.pdf","comment":"HRI 2025"},{"id":"http://arxiv.org/abs/2409.14737v3","updated":"2025-01-09T07:09:44Z","published":"2024-09-23T06:33:52Z","title":"Generalizable Autonomous Driving System across Diverse Adverse Weather\n Conditions","summary":" Various adverse weather conditions pose a significant challenge to autonomous\ndriving (AD) street scene semantic understanding (segmentation). A common\nstrategy is to minimize the disparity between images captured in clear and\nadverse weather conditions. However, this technique typically relies on\nutilizing clear image as a reference, which is challenging to obtain in\npractice. Furthermore, this method typically targets a single adverse\ncondition, and thus perform poorly when confronting a mixture of multiple\nadverse weather conditions. To address these issues, we introduce a\nreference-free and Adverse weather-Immune scheme (called AdvImmu) that\nleverages the invariance of weather conditions over short periods (seconds).\nSpecifically, AdvImmu includes three components: Locally Sequential Mechanism\n(LSM), Globally Shuffled Mechanism (GSM), and Unfolded Regularizers (URs). LSM\nleverages temporal correlations between adjacent frames to enhance model\nperformance. GSM is proposed to shuffle LSM segments to prevent overfitting of\ntemporal patterns. URs are the deep unfolding implementation of two proposed\nregularizers to penalize the model complexity to enhance across-weather\ngeneralization. In addition, to overcome the over-reliance on consecutive\nframe-wise annotations in the training of AdvImmu (typically unavailable in AD\nscenarios), we incorporate a foundation model named Segment Anything Model\n(SAM) to assist to annotate frames, and additionally propose a cluster\nalgorithm (denoted as SBICAC) to surmount SAM's category-agnostic issue to\ngenerate pseudo-labels. Extensive experiments demonstrate that the proposed\nAdvImmu outperforms existing state-of-the-art methods by 88.56% in mean\nIntersection over Union (mIoU).\n","authors":["Wei-Bin Kou","Guangxu Zhu","Rongguang Ye","Qingfeng Lin","Zeyi Ren","Ming Tang","Yik-Chung Wu"],"pdf_url":"https://arxiv.org/pdf/2409.14737v3.pdf","comment":"16 Pages"},{"id":"http://arxiv.org/abs/2501.05004v1","updated":"2025-01-09T06:56:44Z","published":"2025-01-09T06:56:44Z","title":"A Fast Path-Planning Method for Continuous Harvesting of Table-Top Grown\n Strawberries","summary":" Continuous harvesting and storage of multiple fruits in a single operation\nallow robots to significantly reduce the travel distance required for\nrepetitive back-and-forth movements. Traditional collision-free path planning\nalgorithms, such as Rapidly-Exploring Random Tree (RRT) and A-star (A), often\nfail to meet the demands of efficient continuous fruit harvesting due to their\nlow search efficiency and the generation of excessive redundant points. This\npaper presents the Interactive Local Minima Search Algorithm (ILMSA), a fast\npath-planning method designed for the continuous harvesting of table-top grown\nstrawberries. The algorithm featured an interactive node expansion strategy\nthat iteratively extended and refined collision-free path segments based on\nlocal minima points. To enable the algorithm to function in 3D, the 3D\nenvironment was projected onto multiple 2D planes, generating optimal paths on\neach plane. The best path was then selected, followed by integrating and\nsmoothing the 3D path segments. Simulations demonstrated that ILMSA\noutperformed existing methods, reducing path length by 21.5% and planning time\nby 97.1% compared to 3D-RRT, while achieving 11.6% shorter paths and 25.4%\nfewer nodes than the Lowest Point of the Strawberry (LPS) algorithm in 3D\nenvironments. In 2D, ILMSA achieved path lengths 16.2% shorter than A, 23.4%\nshorter than RRT, and 20.9% shorter than RRT-Connect, while being over 96%\nfaster and generating significantly fewer nodes. Field tests confirmed ILMSA's\nsuitability for complex agricultural tasks, having a combined planning and\nexecution time and an average path length that were approximately 58% and 69%,\nrespectively, of those achieved by the LPS algorithm.\n","authors":["Zhonghua Miao","Yang Chen","Lichao Yang","Shimin Hu","Ya Xiong"],"pdf_url":"https://arxiv.org/pdf/2501.05004v1.pdf","comment":"Accepted by IEEE Transactions on AgriFood Electronics"},{"id":"http://arxiv.org/abs/2410.14368v2","updated":"2025-01-09T06:02:11Z","published":"2024-10-18T10:53:44Z","title":"CoMAL: Collaborative Multi-Agent Large Language Models for\n Mixed-Autonomy Traffic","summary":" The integration of autonomous vehicles into urban traffic has great potential\nto improve efficiency by reducing congestion and optimizing traffic flow\nsystematically. In this paper, we introduce CoMAL (Collaborative Multi-Agent\nLLMs), a framework designed to address the mixed-autonomy traffic problem by\ncollaboration among autonomous vehicles to optimize traffic flow. CoMAL is\nbuilt upon large language models, operating in an interactive traffic\nsimulation environment. It utilizes a Perception Module to observe surrounding\nagents and a Memory Module to store strategies for each agent. The overall\nworkflow includes a Collaboration Module that encourages autonomous vehicles to\ndiscuss the effective strategy and allocate roles, a reasoning engine to\ndetermine optimal behaviors based on assigned roles, and an Execution Module\nthat controls vehicle actions using a hybrid approach combining rule-based\nmodels. Experimental results demonstrate that CoMAL achieves superior\nperformance on the Flow benchmark. Additionally, we evaluate the impact of\ndifferent language models and compare our framework with reinforcement learning\napproaches. It highlights the strong cooperative capability of LLM agents and\npresents a promising solution to the mixed-autonomy traffic challenge. The code\nis available at https://github.com/Hyan-Yao/CoMAL.\n","authors":["Huaiyuan Yao","Longchao Da","Vishnu Nandam","Justin Turnau","Zhiwei Liu","Linsey Pang","Hua Wei"],"pdf_url":"https://arxiv.org/pdf/2410.14368v2.pdf","comment":"8 pages, 4 figures, accepted to SDM25"},{"id":"http://arxiv.org/abs/2501.04988v1","updated":"2025-01-09T06:01:34Z","published":"2025-01-09T06:01:34Z","title":"Intelligent Sailing Model for Open Sea Navigation","summary":" Autonomous vessels potentially enhance safety and reliability of seaborne\ntrade. To facilitate the development of autonomous vessels, high-fidelity\nsimulations are required to model realistic interactions with other vessels.\nHowever, modeling realistic interactive maritime traffic is challenging due to\nthe unstructured environment, coarsely specified traffic rules, and largely\nvarying vessel types. Currently, there is no standard for simulating\ninteractive maritime environments in order to rigorously benchmark autonomous\nvessel algorithms. In this paper, we introduce the first intelligent sailing\nmodel (ISM), which simulates rule-compliant vessels for navigation on the open\nsea. An ISM vessel reacts to other traffic participants according to maritime\ntraffic rules while at the same time solving a motion planning task\ncharacterized by waypoints. In particular, the ISM monitors the applicable\nrules, generates rule-compliant waypoints accordingly, and utilizes a model\npredictive control for tracking the waypoints. We evaluate the ISM in two\nenvironments: interactive traffic with only ISM vessels and mixed traffic where\nsome vessel trajectories are from recorded real-world maritime traffic data or\nhandcrafted for criticality. Our results show that simulations with many ISM\nvessels of different vessel types are rule-compliant and scalable. We tested\n4,049 critical traffic scenarios. For interactive traffic with ISM vessels, no\ncollisions occurred while goal-reaching rates of about 97 percent were\nachieved. We believe that our ISM can serve as a standard for challenging and\nrealistic maritime traffic simulation to accelerate autonomous vessel\ndevelopment.\n","authors":["Hanna Krasowski","Stefan Schärdinger","Murat Arcak","Matthias Althoff"],"pdf_url":"https://arxiv.org/pdf/2501.04988v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04982v1","updated":"2025-01-09T05:45:03Z","published":"2025-01-09T05:45:03Z","title":"CuRLA: Curriculum Learning Based Deep Reinforcement Learning for\n Autonomous Driving","summary":" In autonomous driving, traditional Computer Vision (CV) agents often struggle\nin unfamiliar situations due to biases in the training data. Deep Reinforcement\nLearning (DRL) agents address this by learning from experience and maximizing\nrewards, which helps them adapt to dynamic environments. However, ensuring\ntheir generalization remains challenging, especially with static training\nenvironments. Additionally, DRL models lack transparency, making it difficult\nto guarantee safety in all scenarios, particularly those not seen during\ntraining. To tackle these issues, we propose a method that combines DRL with\nCurriculum Learning for autonomous driving. Our approach uses a Proximal Policy\nOptimization (PPO) agent and a Variational Autoencoder (VAE) to learn safe\ndriving in the CARLA simulator. The agent is trained using two-fold curriculum\nlearning, progressively increasing environment difficulty and incorporating a\ncollision penalty in the reward function to promote safety. This method\nimproves the agent's adaptability and reliability in complex environments, and\nunderstand the nuances of balancing multiple reward components from different\nfeedback signals in a single scalar reward function. Keywords: Computer Vision,\nDeep Reinforcement Learning, Variational Autoencoder, Proximal Policy\nOptimization, Curriculum Learning, Autonomous Driving.\n","authors":["Bhargava Uppuluri","Anjel Patel","Neil Mehta","Sridhar Kamath","Pratyush Chakraborty"],"pdf_url":"https://arxiv.org/pdf/2501.04982v1.pdf","comment":"To be published in the 17th International Conference on Agents and\n Artificial Intelligence (ICAART), Feb 2025"},{"id":"http://arxiv.org/abs/2501.04969v1","updated":"2025-01-09T04:47:51Z","published":"2025-01-09T04:47:51Z","title":"AD-L-JEPA: Self-Supervised Spatial World Models with Joint Embedding\n Predictive Architecture for Autonomous Driving with LiDAR Data","summary":" As opposed to human drivers, current autonomous driving systems still require\nvast amounts of labeled data to train. Recently, world models have been\nproposed to simultaneously enhance autonomous driving capabilities by improving\nthe way these systems understand complex real-world environments and reduce\ntheir data demands via self-supervised pre-training. In this paper, we present\nAD-L-JEPA (aka Autonomous Driving with LiDAR data via a Joint Embedding\nPredictive Architecture), a novel self-supervised pre-training framework for\nautonomous driving with LiDAR data that, as opposed to existing methods, is\nneither generative nor contrastive. Our method learns spatial world models with\na joint embedding predictive architecture. Instead of explicitly generating\nmasked unknown regions, our self-supervised world models predict Bird's Eye\nView (BEV) embeddings to represent the diverse nature of autonomous driving\nscenes. Our approach furthermore eliminates the need to manually create\npositive and negative pairs, as is the case in contrastive learning. AD-L-JEPA\nleads to simpler implementation and enhanced learned representations. We\nqualitatively and quantitatively demonstrate high-quality of embeddings learned\nwith AD-L-JEPA. We furthermore evaluate the accuracy and label efficiency of\nAD-L-JEPA on popular downstream tasks such as LiDAR 3D object detection and\nassociated transfer learning. Our experimental evaluation demonstrates that\nAD-L-JEPA is a plausible approach for self-supervised pre-training in\nautonomous driving applications and is the best available approach\noutperforming SOTA, including most recently proposed Occupancy-MAE [1] and ALSO\n[2]. The source code of AD-L-JEPA is available at\nhttps://github.com/HaoranZhuExplorer/AD-L-JEPA-Release.\n","authors":["Haoran Zhu","Zhenyuan Dong","Kristi Topollai","Anna Choromanska"],"pdf_url":"https://arxiv.org/pdf/2501.04969v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04276v2","updated":"2025-01-09T04:26:27Z","published":"2025-01-08T04:54:28Z","title":"Bridging Adaptivity and Safety: Learning Agile Collision-Free Locomotion\n Across Varied Physics","summary":" Real-world legged locomotion systems often need to reconcile agility and\nsafety for different scenarios. Moreover, the underlying dynamics are often\nunknown and time-variant (e.g., payload, friction). In this paper, we introduce\nBAS (Bridging Adaptivity and Safety), which builds upon the pipeline of prior\nwork Agile But Safe (ABS)(He et al.) and is designed to provide adaptive safety\neven in dynamic environments with uncertainties. BAS involves an agile policy\nto avoid obstacles rapidly and a recovery policy to prevent collisions, a\nphysical parameter estimator that is concurrently trained with agile policy,\nand a learned control-theoretic RA (reach-avoid) value network that governs the\npolicy switch. Also, the agile policy and RA network are both conditioned on\nphysical parameters to make them adaptive. To mitigate the distribution shift\nissue, we further introduce an on-policy fine-tuning phase for the estimator to\nenhance its robustness and accuracy. The simulation results show that BAS\nachieves 50% better safety than baselines in dynamic environments while\nmaintaining a higher speed on average. In real-world experiments, BAS shows its\ncapability in complex environments with unknown physics (e.g., slippery floors\nwith unknown frictions, unknown payloads up to 8kg), while baselines lack\nadaptivity, leading to collisions or. degraded agility. As a result, BAS\nachieves a 19.8% increase in speed and gets a 2.36 times lower collision rate\nthan ABS in the real world. Videos: https://adaptive-safe-locomotion.github.io.\n","authors":["Yichao Zhong","Chong Zhang","Tairan He","Guanya Shi"],"pdf_url":"https://arxiv.org/pdf/2501.04276v2.pdf","comment":"11 Pages, 6 Figures"},{"id":"http://arxiv.org/abs/2311.09346v2","updated":"2025-01-09T04:20:34Z","published":"2023-11-15T20:09:29Z","title":"Nothing Stands Still: A Spatiotemporal Benchmark on 3D Point Cloud\n Registration Under Large Geometric and Temporal Change","summary":" Building 3D geometric maps of man-made spaces is a well-established and\nactive field that is fundamental to computer vision and robotics. However,\nconsidering the evolving nature of built environments, it is essential to\nquestion the capabilities of current mapping efforts in handling temporal\nchanges. In addition, spatiotemporal mapping holds significant potential for\nachieving sustainability and circularity goals. Existing mapping approaches\nfocus on small changes, such as object relocation or self-driving car\noperation; in all cases where the main structure of the scene remains fixed.\nConsequently, these approaches fail to address more radical changes in the\nstructure of the built environment, such as geometry and topology. To this end,\nwe introduce the Nothing Stands Still (NSS) benchmark, which focuses on the\nspatiotemporal registration of 3D scenes undergoing large spatial and temporal\nchange, ultimately creating one coherent spatiotemporal map. Specifically, the\nbenchmark involves registering two or more partial 3D point clouds (fragments)\nfrom the same scene but captured from different spatiotemporal views. In\naddition to the standard pairwise registration, we assess the multi-way\nregistration of multiple fragments that belong to any temporal stage. As part\nof NSS, we introduce a dataset of 3D point clouds recurrently captured in\nlarge-scale building indoor environments that are under construction or\nrenovation. The NSS benchmark presents three scenarios of increasing\ndifficulty, to quantify the generalization ability of point cloud registration\nmethods over space (within one building and across buildings) and time. We\nconduct extensive evaluations of state-of-the-art methods on NSS. The results\ndemonstrate the necessity for novel methods specifically designed to handle\nlarge spatiotemporal changes. The homepage of our benchmark is at\nhttp://nothing-stands-still.com.\n","authors":["Tao Sun","Yan Hao","Shengyu Huang","Silvio Savarese","Konrad Schindler","Marc Pollefeys","Iro Armeni"],"pdf_url":"https://arxiv.org/pdf/2311.09346v2.pdf","comment":"To appear in the ISPRS Journal of Photogrammetry and Remote Sensing.\n 29 pages, 26 figures. For the project page, see\n http://nothing-stands-still.com"},{"id":"http://arxiv.org/abs/2501.04595v2","updated":"2025-01-09T04:13:45Z","published":"2025-01-08T16:23:56Z","title":"MobileH2R: Learning Generalizable Human to Mobile Robot Handover\n Exclusively from Scalable and Diverse Synthetic Data","summary":" This paper introduces MobileH2R, a framework for learning generalizable\nvision-based human-to-mobile-robot (H2MR) handover skills. Unlike traditional\nfixed-base handovers, this task requires a mobile robot to reliably receive\nobjects in a large workspace enabled by its mobility. Our key insight is that\ngeneralizable handover skills can be developed in simulators using high-quality\nsynthetic data, without the need for real-world demonstrations. To achieve\nthis, we propose a scalable pipeline for generating diverse synthetic full-body\nhuman motion data, an automated method for creating safe and imitation-friendly\ndemonstrations, and an efficient 4D imitation learning method for distilling\nlarge-scale demonstrations into closed-loop policies with base-arm\ncoordination. Experimental evaluations in both simulators and the real world\nshow significant improvements (at least +15% success rate) over baseline\nmethods in all cases. Experiments also validate that large-scale and diverse\nsynthetic data greatly enhances robot learning, highlighting our scalable\nframework.\n","authors":["Zifan Wang","Ziqing Chen","Junyu Chen","Jilong Wang","Yuxin Yang","Yunze Liu","Xueyi Liu","He Wang","Li Yi"],"pdf_url":"https://arxiv.org/pdf/2501.04595v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04929v1","updated":"2025-01-09T02:45:05Z","published":"2025-01-09T02:45:05Z","title":"What Drives You to Interact?: The Role of User Motivation for a Robot in\n the Wild","summary":" In this paper, we aim to understand how user motivation shapes human-robot\ninteraction (HRI) in the wild. To explore this, we conducted a field study by\ndeploying a fully autonomous conversational robot in a shopping mall over two\ndays. Through sequential video analysis, we identified five patterns of\ninteraction fluency (Smooth, Awkward, Active, Messy, and Quiet), four types of\nuser motivation for interacting with the robot (Function, Experiment,\nCuriosity, and Education), and user positioning towards the robot. We further\nanalyzed how these motivations and positioning influence interaction fluency.\nOur findings suggest that incorporating users' motivation types into the design\nof robot behavior can enhance interaction fluency, engagement, and user\nsatisfaction in real-world HRI scenarios.\n","authors":["Amy Koike","Yuki Okafuji","Kenya Hoshimure","Jun Baba"],"pdf_url":"https://arxiv.org/pdf/2501.04929v1.pdf","comment":"8 pages, 4 figures"},{"id":"http://arxiv.org/abs/2501.04228v2","updated":"2025-01-09T01:35:56Z","published":"2025-01-08T01:59:47Z","title":"Constraints as Rewards: Reinforcement Learning for Robots without Reward\n Functions","summary":" Reinforcement learning has become an essential algorithm for generating\ncomplex robotic behaviors. However, to learn such behaviors, it is necessary to\ndesign a reward function that describes the task, which often consists of\nmultiple objectives that needs to be balanced. This tuning process is known as\nreward engineering and typically involves extensive trial-and-error. In this\npaper, to avoid this trial-and-error process, we propose the concept of\nConstraints as Rewards (CaR). CaR formulates the task objective using multiple\nconstraint functions instead of a reward function and solves a reinforcement\nlearning problem with constraints using the Lagrangian-method. By adopting this\napproach, different objectives are automatically balanced, because Lagrange\nmultipliers serves as the weights among the objectives. In addition, we will\ndemonstrate that constraints, expressed as inequalities, provide an intuitive\ninterpretation of the optimization target designed for the task. We apply the\nproposed method to the standing-up motion generation task of a\nsix-wheeled-telescopic-legged robot and demonstrate that the proposed method\nsuccessfully acquires the target behavior, even though it is challenging to\nlearn with manually designed reward functions.\n","authors":["Yu Ishihara","Noriaki Takasugi","Kotaro Kawakami","Masaya Kinoshita","Kazumi Aoyama"],"pdf_url":"https://arxiv.org/pdf/2501.04228v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05610v1","updated":"2025-01-09T23:18:38Z","published":"2025-01-09T23:18:38Z","title":"Towards Probabilistic Inference of Human Motor Intentions by Assistive\n Mobile Robots Controlled via a Brain-Computer Interface","summary":" Assistive mobile robots are a transformative technology that helps persons\nwith disabilities regain the ability to move freely. Although autonomous\nwheelchairs significantly reduce user effort, they still require human input to\nallow users to maintain control and adapt to changing environments. Brain\nComputer Interface (BCI) stands out as a highly user-friendly option that does\nnot require physical movement. Current BCI systems can understand whether users\nwant to accelerate or decelerate, but they implement these changes in discrete\nspeed steps rather than allowing for smooth, continuous velocity adjustments.\nThis limitation prevents the systems from mimicking the natural, fluid speed\nchanges seen in human self-paced motion. The authors aim to address this\nlimitation by redesigning the perception-action cycle in a BCI controlled\nrobotic system: improving how the robotic agent interprets the user's motion\nintentions (world state) and implementing these actions in a way that better\nreflects natural physical properties of motion, such as inertia and damping.\nThe scope of this paper focuses on the perception aspect. We asked and answered\na normative question \"what computation should the robotic agent carry out to\noptimally perceive incomplete or noisy sensory observations?\" Empirical EEG\ndata were collected, and probabilistic representation that served as world\nstate distributions were learned and evaluated in a Generative Adversarial\nNetwork framework. The ROS framework was established that connected with a\nGazebo environment containing a digital twin of an indoor space and a virtual\nmodel of a robotic wheelchair. Signal processing and statistical analyses were\nimplemented to identity the most discriminative features in the\nspatial-spectral-temporal dimensions, which are then used to construct the\nworld model for the robotic agent to interpret user motion intentions as a\nBayesian observer.\n","authors":["Xiaoshan Zhou","Carol M. Menassa","Vineet R. Kamat"],"pdf_url":"https://arxiv.org/pdf/2501.05610v1.pdf","comment":"10 pages"},{"id":"http://arxiv.org/abs/2304.02075v2","updated":"2025-01-09T19:54:53Z","published":"2023-04-04T18:58:16Z","title":"GUTS: Generalized Uncertainty-Aware Thompson Sampling for Multi-Agent\n Active Search","summary":" Robotic solutions for quick disaster response are essential to ensure minimal\nloss of life, especially when the search area is too dangerous or too vast for\nhuman rescuers. We model this problem as an asynchronous multi-agent\nactive-search task where each robot aims to efficiently seek objects of\ninterest (OOIs) in an unknown environment. This formulation addresses the\nrequirement that search missions should focus on quick recovery of OOIs rather\nthan full coverage of the search region. Previous approaches fail to accurately\nmodel sensing uncertainty, account for occlusions due to foliage or terrain, or\nconsider the requirement for heterogeneous search teams and robustness to\nhardware and communication failures. We present the Generalized\nUncertainty-aware Thompson Sampling (GUTS) algorithm, which addresses these\nissues and is suitable for deployment on heterogeneous multi-robot systems for\nactive search in large unstructured environments. We show through simulation\nexperiments that GUTS consistently outperforms existing methods such as\nparallelized Thompson Sampling and exhaustive search, recovering all OOIs in\n80% of all runs. In contrast, existing approaches recover all OOIs in less than\n40% of all runs. We conduct field tests using our multi-robot system in an\nunstructured environment with a search area of approximately 75,000 sq. m. Our\nsystem demonstrates robustness to various failure modes, achieving full\nrecovery of OOIs (where feasible) in every field run, and significantly\noutperforming our baseline.\n","authors":["Nikhil Angad Bakshi","Tejus Gupta","Ramina Ghods","Jeff Schneider"],"pdf_url":"https://arxiv.org/pdf/2304.02075v2.pdf","comment":"7 pages, 5 figures, 1 table, for associated video see:\n https://youtu.be/K0jkzdQ_j2E , published in International Conference on\n Robotics and Automation (ICRA) 2023. Outstanding Deployed Systems Paper\n Winner"},{"id":"http://arxiv.org/abs/2501.06263v1","updated":"2025-01-09T15:00:03Z","published":"2025-01-09T15:00:03Z","title":"GelBelt: A Vision-based Tactile Sensor for Continuous Sensing of Large\n Surfaces","summary":" Scanning large-scale surfaces is widely demanded in surface reconstruction\napplications and detecting defects in industries' quality control and\nmaintenance stages. Traditional vision-based tactile sensors have shown\npromising performance in high-resolution shape reconstruction while suffering\nlimitations such as small sensing areas or susceptibility to damage when slid\nacross surfaces, making them unsuitable for continuous sensing on large\nsurfaces. To address these shortcomings, we introduce a novel vision-based\ntactile sensor designed for continuous surface sensing applications. Our design\nuses an elastomeric belt and two wheels to continuously scan the target\nsurface. The proposed sensor showed promising results in both shape\nreconstruction and surface fusion, indicating its applicability. The dot\nproduct of the estimated and reference surface normal map is reported over the\nsensing area and for different scanning speeds. Results indicate that the\nproposed sensor can rapidly scan large-scale surfaces with high accuracy at\nspeeds up to 45 mm/s.\n","authors":["Mohammad Amin Mirzaee","Hung-Jui Huang","Wenzhen Yuan"],"pdf_url":"https://arxiv.org/pdf/2501.06263v1.pdf","comment":"Accepted to IEEE RA-L. 8 pages, 7 figures, webpage:\n https://aminmirz.github.io/GelBelt/"},{"id":"http://arxiv.org/abs/2501.06262v1","updated":"2025-01-09T13:27:02Z","published":"2025-01-09T13:27:02Z","title":"Towards smart and adaptive agents for active sensing on edge devices","summary":" TinyML has made deploying deep learning models on low-power edge devices\nfeasible, creating new opportunities for real-time perception in constrained\nenvironments. However, the adaptability of such deep learning methods remains\nlimited to data drift adaptation, lacking broader capabilities that account for\nthe environment's underlying dynamics and inherent uncertainty. Deep learning's\nscaling laws, which counterbalance this limitation by massively up-scaling data\nand model size, cannot be applied when deploying on the Edge, where deep\nlearning limitations are further amplified as models are scaled down for\ndeployment on resource-constrained devices.\n This paper presents a smart agentic system capable of performing on-device\nperception and planning, enabling active sensing on the edge. By incorporating\nactive inference into our solution, our approach extends beyond deep learning\ncapabilities, allowing the system to plan in dynamic environments while\noperating in real time with a modest total model size of 2.3 MB. We showcase\nour proposed system by creating and deploying a saccade agent connected to an\nIoT camera with pan and tilt capabilities on an NVIDIA Jetson embedded device.\nThe saccade agent controls the camera's field of view following optimal\npolicies derived from the active inference principles, simulating human-like\nsaccadic motion for surveillance and robotics applications.\n","authors":["Devendra Vyas","Miguel de Prado","Tim Verbelen"],"pdf_url":"https://arxiv.org/pdf/2501.06262v1.pdf","comment":null}],"Systems and Control":[{"id":"http://arxiv.org/abs/2409.05545v2","updated":"2025-01-09T18:51:52Z","published":"2024-09-09T12:11:18Z","title":"Adaptive Probabilistic Planning for the Uncertain and Dynamic\n Orienteering Problem","summary":" The Orienteering Problem (OP) is a well-studied routing problem that has been\nextended to incorporate uncertainties, reflecting stochastic or dynamic travel\ncosts, prize-collection costs, and prizes. Existing approaches may, however, be\ninefficient in real-world applications due to insufficient modeling knowledge\nand initially unknowable parameters in online scenarios. Thus, we propose the\nUncertain and Dynamic Orienteering Problem (UDOP), modeling travel costs as\ndistributions with unknown and time-variant parameters. UDOP also associates\nuncertain travel costs with dynamic prizes and prize-collection costs for its\nobjective and budget constraints. To address UDOP, we develop an ADaptive\nApproach for Probabilistic paThs - ADAPT, that iteratively performs 'execution'\nand 'online planning' based on an initial 'offline' solution. The execution\nphase updates system status and records online cost observations. The online\nplanner employs a Bayesian approach to adaptively estimate power consumption\nand optimize path sequence based on safety beliefs. We evaluate ADAPT in a\npractical Unmanned Aerial Vehicle (UAV) charging scheduling problem for\nWireless Rechargeable Sensor Networks. The UAV must optimize its path to\nrecharge sensor nodes efficiently while managing its energy under uncertain\nconditions. ADAPT maintains comparable solution quality and computation time\nwhile offering superior robustness. Extensive simulations show that ADAPT\nachieves a 100% Mission Success Rate (MSR) across all tested scenarios,\noutperforming comparable heuristic-based and frequentist approaches that fail\nup to 70% (under challenging conditions) and averaging 67% MSR, respectively.\nThis work advances the field of OP with uncertainties, offering a reliable and\nefficient approach for real-world applications in uncertain and dynamic\nenvironments.\n","authors":["Qiuchen Qian","Yanran Wang","David Boyle"],"pdf_url":"https://arxiv.org/pdf/2409.05545v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.06445v4","updated":"2025-01-09T18:17:20Z","published":"2024-05-10T12:50:52Z","title":"Systematic interval observer design for linear systems","summary":" We first propose systematic and comprehensive interval observer designs for\nlinear time-invariant systems, under standard assumptions involving\nobservability and interval bounds on the initial condition and disturbances.\nHistorically, such designs rely on transformations with certain limitations\ninto a form that is Metzler (for continuous time) or non-negative (for discrete\ntime). We show that they can be effectively replaced with a linear\ntime-invariant transformation that can be easily computed offline. Next, we\npropose an extension to the time-varying setting, addressing the limitations of\nconventional transformations that lack guaranteed outcomes. We employ dynamical\ntransformations into higher-dimensional target forms for which an interval\nobserver can always be constructed. These transformations become\nleft-invertible after a certain time, provided observability conditions are met\nand the target dynamics are sufficiently high-dimensional and fast, thus\nenabling the reconstruction of bounds in the original coordinates in finite\ntime. Academic examples are presented to illustrate our methods.\n","authors":["Thach Ngoc Dinh","Gia Quoc Bao Tran"],"pdf_url":"https://arxiv.org/pdf/2405.06445v4.pdf","comment":"7 pages, 3 figures"},{"id":"http://arxiv.org/abs/2501.05285v1","updated":"2025-01-09T14:49:03Z","published":"2025-01-09T14:49:03Z","title":"Pitch Plane Trajectory Tracking Control for Sounding Rockets via\n Adaptive Feedback Linearization","summary":" This paper proposes a pitch plane trajectory tacking control solution for\nsuborbital launch vehicles relying on adaptive feedback linearization.\nInitially, the 2D dynamics and kinematics for a single-engine,\nthrust-vector-controlled sounding rocket are obtained for control design\npurposes. Then, an inner-outer control strategy, which simultaneously tackles\nattitude and position control, is adopted, with the inner-loop comprising the\naltitude and pitch control and the outer-loop addressing the horizontal\n(downrange) position control. Feedback linearization is used to cancel out the\nnon-linearities in both the inner and outer dynamics. Making use of Lyapunov\nstability theory, an adaptation law, which provides online estimates on the\ninner-loop aerodynamic uncertainty, is jointly designed with the output\ntracking controller via adaptive backstepping, ensuring global reference\ntracking in the region where the feedback linearization is well-defined. The\nzero dynamics of the inner-stabilized system are then exploited to obtain the\nouterloop dynamics and derive a Linear Quadratic Regulator (LQR) with integral\naction, which can stabilize them as well as reject external disturbances. In\nthe outermost loop, the estimate on the correspondent aerodynamic uncertainty\nis indirectly obtained by using the inner loop estimates together with known\naerodynamics relations. The resulting inner-outer position control solution is\nproven to be asymptotically stable in the region of interest. Using a\nsingle-stage sounding rocket, propelled by a liquid engine, as reference\nvehicle, different mission scenarios are tested in a simulation environment to\nverify the adaptability of the proposed control strategy. The system is able to\ntrack the requested trajectories while rejecting external wind disturbances.\nFurthermore, the need to re-tune the control gains in between different mission\nscenarios is minimal to none.\n","authors":["Pedro dos Santos","Paulo Oliveira"],"pdf_url":"https://arxiv.org/pdf/2501.05285v1.pdf","comment":"Paper accepted to the IEEE Aerospace Conference 2025. Copyright:\n 979-8-3503-5597-0/25/$31.00 @2025 IEEE"},{"id":"http://arxiv.org/abs/2501.04572v2","updated":"2025-01-09T14:30:41Z","published":"2025-01-08T15:42:41Z","title":"Regret Analysis: a control perspective","summary":" Online learning and model reference adaptive control have many interesting\nintersections. One area where they differ however is in how the algorithms are\nanalyzed and what objective or metric is used to discriminate \"good\" algorithms\nfrom \"bad\" algorithms. In adaptive control there are usually two objectives: 1)\nprove that all time varying parameters/states of the system are bounded, and 2)\nthat the instantaneous error between the adaptively controlled system and a\nreference system converges to zero over time (or at least a compact set). For\nonline learning the performance of algorithms is often characterized by the\nregret the algorithm incurs. Regret is defined as the cumulative loss (cost)\nover time from the online algorithm minus the cumulative loss (cost) of the\nsingle optimal fixed parameter choice in hindsight. Another significant\ndifference between the two areas of research is with regard to the assumptions\nmade in order to obtain said results. Adaptive control makes assumptions about\nthe input-output properties of the control problem and derives solutions for a\nfixed error model or optimization task. In the online learning literature\nresults are derived for classes of loss functions (i.e. convex) while a priori\nassuming that all time varying parameters are bounded, which for many\noptimization tasks is not unrealistic, but is a non starter in control\napplications. In this work we discuss these differences in detail through the\nregret based analysis of gradient descent for convex functions and the control\nbased analysis of a streaming regression problem. We close with a discussion\nabout the newly defined paradigm of online adaptive control and ask the\nfollowing question \"Are regret optimal control strategies deployable?\"\n","authors":["Travis E. Gibson","Sawal Acharya"],"pdf_url":"https://arxiv.org/pdf/2501.04572v2.pdf","comment":"10 pages no figures"},{"id":"http://arxiv.org/abs/2501.05163v1","updated":"2025-01-09T11:36:29Z","published":"2025-01-09T11:36:29Z","title":"Explainable AI based System for Supply Air Temperature Forecast","summary":" This paper explores the application of Explainable AI (XAI) techniques to\nimprove the transparency and understanding of predictive models in control of\nautomated supply air temperature (ASAT) of Air Handling Unit (AHU). The study\nfocuses on forecasting of ASAT using a linear regression with Huber loss.\nHowever, having only a control curve without semantic and/or physical\nexplanation is often not enough. The present study employs one of the XAI\nmethods: Shapley values, which allows to reveal the reasoning and highlight the\ncontribution of each feature to the final ASAT forecast. In comparison to other\nXAI methods, Shapley values have solid mathematical background, resulting in\ninterpretation transparency. The study demonstrates the contrastive\nexplanations--slices, for each control value of ASAT, which makes it possible\nto give the client objective justifications for curve changes.\n","authors":["Marika Eik","Ahmet Kose","Hossein Nourollahi Hokmabad","Juri Belikov"],"pdf_url":"https://arxiv.org/pdf/2501.05163v1.pdf","comment":"5 pages, 7 figures, 1 table, conference paper"},{"id":"http://arxiv.org/abs/2405.16490v2","updated":"2025-01-09T10:38:46Z","published":"2024-05-26T08:58:03Z","title":"Formalising the intentional stance 1: attributing goals and beliefs to\n stochastic processes","summary":" This article presents a formalism inspired by Dennett's notion of the\nintentional stance. Whereas Dennett's treatment of these concepts is informal,\nwe aim to provide a more formal analogue. We introduce a framework based on\nstochastic processes with inputs and outputs, in which we can talk precisely\nabout *interpreting* systems as having *normative-epistemic states*, which\ncombine belief-like and desire-like features. Our framework is based on\noptimality but nevertheless allows us to model some forms of bounded cognition.\n One might expect that the systems that can be described in\nnormative-epistemic terms would be some special subset of all systems, but we\nshow that this is not the case: every system admits a (possibly trivial)\nnormative-epistemic interpretation, and those that can be *uniquely specified*\nby a normative-epistemic description are exactly the deterministic ones.\nFinally, we show that there is a suitable notion of Bayesian updating for\nnormative-epistemic states, which we call *value-laden filtering*, since it\ninvolves both normative and epistemic elements. For unbounded cognition it is\nalways permissible to attribute beliefs that update in this way. This is not\nalways the case for bounded cognition, but we give a sufficient condition under\nwhich it is.\n This paper gives an overview of our framework aimed at cognitive scientists,\nwith a formal mathematical treatment given in a companion paper.\n","authors":["Simon McGregor"," timorl","Nathaniel Virgo"],"pdf_url":"https://arxiv.org/pdf/2405.16490v2.pdf","comment":"The previous version of this document included the content of the\n companion paper, \"Formalising the intentional stance 2: a coinductive\n approach\". The paper has now been split into two, this one (which is an\n overview aimed at cognitive scientists) and the companion (which contains\n full mathematical detail). 16 pages, one figure with two subfigures"},{"id":"http://arxiv.org/abs/2501.05102v1","updated":"2025-01-09T09:44:25Z","published":"2025-01-09T09:44:25Z","title":"Coordinated Control of Deformation and Flight for Morphing Aircraft via\n Meta-Learning and Coupled State-Dependent Riccati Equations","summary":" In this paper, the coordinated control problem of deformation and flight for\nmorphing aircraft (MA) is studied by using meta-learning (ML) and coupled\nstate-dependent Riccati equations (CSDREs). Our method is built on two\nprincipal observations that dynamic models of MA under varying morphing\nconditions share a morphing condition independent representation function and\nthat the specific morphing condition part lies in a set of linear coefficients.\nTo that end, the domain adversarially invariant meta-learning (DAIML) is\nemployed to learn the shared representation with offline flight data. Based on\nthe learned representation function, the coordinated control of the deformation\nand flight for MA is formulated as a non-cooperative differential game. The\nstate-dependent feedback control solutions can be derived by addressing a pair\nof CSDREs. For this purpose, Lyapunov iterations are extended to obtain the\npositive semidefinite (definite) stabilizing solutions of the CSDREs, and the\nconvergence proof of the proposed algorithm is provided. Finally, a simulation\nstudy is carried out to validate the efficacy of the developed coordinated game\ncontrol strategies.\n","authors":["Hao-Chi Che","Huai-Ning Wu"],"pdf_url":"https://arxiv.org/pdf/2501.05102v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.03997v5","updated":"2025-01-09T09:41:38Z","published":"2024-01-08T16:20:05Z","title":"Low-Complexity Control for a Class of Uncertain MIMO Nonlinear Systems\n under Generalized Time-Varying Output Constraints (extended version)","summary":" This paper introduces a novel control framework to address the satisfaction\nof multiple time-varying output constraints in uncertain high-order MIMO\nnonlinear control systems. Unlike existing methods, which often assume that the\nconstraints are always decoupled and feasible, our approach can handle coupled\ntime-varying constraints even in the presence of potential infeasibilities.\nFirst, it is shown that satisfying multiple constraints essentially boils down\nto ensuring the positivity of a scalar variable, representing the signed\ndistance from the boundary of the time-varying output-constrained set. To\nachieve this, a single consolidating constraint is designed that, when\nsatisfied, guarantees convergence to and invariance of the time-varying\noutput-constrained set within a user-defined finite time. Next, a novel robust\nand low-complexity feedback controller is proposed to ensure the satisfaction\nof the consolidating constraint. Additionally, we provide a mechanism for\nonline modification of the consolidating constraint to find a least violating\nsolution when the constraints become mutually infeasible for some time.\nFinally, simulation examples of trajectory and region tracking for a mobile\nrobot validate the proposed approach.\n","authors":["Farhad Mehdifar","Lars Lindemann","Charalampos P. Bechlioulis","Dimos V. Dimarogonas"],"pdf_url":"https://arxiv.org/pdf/2401.03997v5.pdf","comment":"extended version, 21 pages, 8 figures"},{"id":"http://arxiv.org/abs/2405.19546v4","updated":"2025-01-09T05:01:32Z","published":"2024-05-29T22:19:39Z","title":"Convex Optimization of Initial Perturbations toward Quantitative Weather\n Control","summary":" This study proposes introducing convex optimization to find initial\nperturbations of atmospheric states to realize specified changes in subsequent\nweather. In the proposed method, we formulate and solve an inverse problem to\nfind effective perturbations in atmospheric variables so that controlled\nvariables satisfy specified changes at a specified time. The proposed method\nfirst constructs a sensitivity matrix of controlled variables, such as\naccumulated precipitation, to the initial atmospheric variables, such as\ntemperature and humidity, through sensitivity analysis using a numerical\nweather prediction (NWP) model. Then a convex optimization problem is\nformulated to achieve various control specifications involving not only\nquadratic functions but also absolute values and maximum values of the\ncontrolled variables and initial atmospheric variables in the cost function and\nconstraints. The proposed method was validated through a benchmark warm bubble\nexperiment using the NWP model. The experiments showed that the identified\nperturbations successfully realized specified spatial distributions of\naccumulated precipitation.\n","authors":["Toshiyuki Ohtsuka","Atsushi Okazaki","Masaki Ogura","Shunji Kotsuki"],"pdf_url":"https://arxiv.org/pdf/2405.19546v4.pdf","comment":"shortend to improve conciseness; some figures added to Supplements\n for discussion about physical processes; license changed to CC BY 4.0;\n revised to improve readability; some figures in Appendix omitted to improve\n conciseness"},{"id":"http://arxiv.org/abs/2501.04964v1","updated":"2025-01-09T04:34:07Z","published":"2025-01-09T04:34:07Z","title":"Promoting Shared Energy Storage Aggregation among High Price-Tolerance\n Prosumer: An Incentive Deposit and Withdrawal Service","summary":" Many residential prosumers exhibit a high price-tolerance for household\nelectricity bills and a low response to price incentives. This is because the\nhousehold electricity bills are not inherently high, and the potential for\nsaving on electricity bills through participation in conventional Shared Energy\nStorage (SES) is limited, which diminishes their motivation to actively engage\nin SES. Additionally, existing SES models often require prosumers to take\nadditional actions, such as optimizing rental capacity and bidding prices,\nwhich happen to be capabilities that typical household prosumers do not\npossess. To incentivize these high price-tolerance residential prosumers to\nparticipate in SES, a novel SES aggregation framework is proposed, which does\nnot require prosumers to take additional actions and allows them to maintain\nexisting energy storage patterns. Compared to conventional long-term operation\nof SES, the proposed framework introduces an additional short-term construction\nstep during which the energy service provider (ESP) acquires control of the\nenergy storage systems (ESS) and offers electricity deposit and withdrawal\nservices (DWS) with dynamic coefficients, enabling prosumers to withdraw more\nelectricity than they deposit without additional actions. Additionally, a\nmatching mechanism is proposed to align prosumers' electricity consumption\nbehaviors with ESP's optimization strategies. Finally, the dynamic coefficients\nin DWS and trading strategies are optimized by an improved deep reinforcement\nlearning (DRL) algorithm. Case studies are conducted to verify the\neffectiveness of the proposed SES aggregation framework with DWS and the\nmatching mechanism.\n","authors":["Xin Lu","Jing Qiu","Cuo Zhang","Gang Lei","Jianguo Zhu"],"pdf_url":"https://arxiv.org/pdf/2501.04964v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.02792v3","updated":"2025-01-09T03:32:35Z","published":"2025-01-06T06:25:46Z","title":"Gaming on Coincident Peak Shaving: Equilibrium and Strategic Behavior","summary":" Coincident peak demand charges are imposed by power system operators or\nelectric utilities when the overall system demand, aggregated across multiple\nconsumers, reaches its peak. These charges incentivize consumers to reduce\ntheir demand during peak periods, a practice known as coincident peak shaving.\nIn this paper, we analyze the coincident peak shaving problem through the lens\nof game theory, developing a theoretical model to examine the impact of\nstrategic consumer behavior on system efficiency. We demonstrate that the game\nstructure exhibits varying characteristics - concave,\nquasiconcave/discontinuous, or non-concave/discontinuous - depending on the\nextent of consumers demand-shifting capabilities. For a two-agent, two-period\nsetting, we derive closed-form Nash equilibrium solutions under each condition\nand generalize our findings to cases with multiple agents. We prove the\nstability of the equilibrium points and present an algorithm for computing\nequilibrium outcomes across all game scenarios. We also show that the\npeak-shaving effectiveness of the game model matches that of the centralized\npeak-shaving model but with increased levels of anarchy. In the cases of\nquasiconcave and non-concave game conditions, we analytically demonstrate in\nthe two-agent setting that anarchy increases with consumers' flexibility and\ninequity, as measured by their marginal shifting costs, and we also analyze the\ninfluence of the number of agents on anarchy. Finally, we provide numerical\nsimulations to validate our theoretical results.\n","authors":["Liudong Chen","Bolun Xu"],"pdf_url":"https://arxiv.org/pdf/2501.02792v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04937v1","updated":"2025-01-09T03:01:57Z","published":"2025-01-09T03:01:57Z","title":"Generalized Linear Models with 1-Bit Measurements: Asymptotics of the\n Maximum Likelihood Estimator","summary":" This work establishes regularity conditions for consistency and asymptotic\nnormality of the multiple parameter maximum likelihood estimator(MLE) from\ncensored data, where the censoring mechanism is in the form of $1$-bit\nmeasurements. The underlying distribution of the uncensored data is assumed to\nbelong to the exponential family, with natural parameters expressed as a linear\ncombination of the predictors, known as generalized linear model (GLM). As part\nof the analysis, the Fisher information matrix is also derived for both\ncensored and uncensored data, which helps to quantify the impact of censoring\nand assess the performance of the MLE. The choice of GLM allows one to consider\na variety of practical examples where 1-bit estimation is of interest. In\nparticular, it is shown how the derived results can be used to analyze two\npractically relevant scenarios: the Gaussian model with both unknown mean and\nvariance, and the Poisson model with an unknown mean.\n","authors":["Jaimin Shah","Martina Cardone","Cynthia Rush","Alex Dytso"],"pdf_url":"https://arxiv.org/pdf/2501.04937v1.pdf","comment":"ICASSP 2025"},{"id":"http://arxiv.org/abs/2501.04160v2","updated":"2025-01-09T02:53:56Z","published":"2025-01-07T22:19:06Z","title":"Collaborative Spacecraft Servicing under Partial Feedback using\n Lyapunov-based Deep Neural Networks","summary":" Multi-agent systems are increasingly applied in space missions, including\ndistributed space systems, resilient constellations, and autonomous rendezvous\nand docking operations. A critical emerging application is collaborative\nspacecraft servicing, which encompasses on-orbit maintenance, space debris\nremoval, and swarm-based satellite repositioning. These missions involve\nservicing spacecraft interacting with malfunctioning or defunct spacecraft\nunder challenging conditions, such as limited state information, measurement\ninaccuracies, and erratic target behaviors. Existing approaches often rely on\nassumptions of full state knowledge or single-integrator dynamics, which are\nimpractical for real-world applications involving second-order spacecraft\ndynamics. This work addresses these challenges by developing a distributed\nstate estimation and tracking framework that requires only relative position\nmeasurements and operates under partial state information. A novel\n$\\rho$-filter is introduced to reconstruct unknown states using locally\navailable information, and a Lyapunov-based deep neural network adaptive\ncontroller is developed that adaptively compensates for uncertainties stemming\nfrom unknown spacecraft dynamics. To ensure the collaborative spacecraft\nregulation problem is well-posed, a trackability condition is defined. A\nLyapunov-based stability analysis is provided to ensure exponential convergence\nof errors in state estimation and spacecraft regulation to a neighborhood of\nthe origin under the trackability condition. The developed method eliminates\nthe need for expensive velocity sensors or extensive pre-training, offering a\npractical and robust solution for spacecraft servicing in complex, dynamic\nenvironments.\n","authors":["Cristian F. Nino","Omkar Sudhir Patil","Christopher D. Petersen","Sean Phillips","Warren E. Dixon"],"pdf_url":"https://arxiv.org/pdf/2501.04160v2.pdf","comment":"24 pages, 4 Figures, Journal"},{"id":"http://arxiv.org/abs/2409.20511v2","updated":"2025-01-09T00:27:06Z","published":"2024-09-30T17:13:11Z","title":"Quantifying Metrics for Wildfire Ignition Risk from Geographic Data in\n Power Shutoff Decision-Making","summary":" Faults on power lines and other electric equipment are known to cause\nwildfire ignitions. To mitigate the threat of wildfire ignitions from electric\npower infrastructure, many utilities preemptively de-energize power lines,\nwhich may result in power shutoffs. Data regarding wildfire ignition risks are\nkey inputs for effective planning of power line de-energizations. However,\nthere are multiple ways to formulate risk metrics that spatially aggregate\nwildfire risk map data, and there are different ways of leveraging this data to\nmake decisions. The key contribution of this paper is to define and compare the\nresults of employing six metrics for quantifying the wildfire ignition risks of\npower lines from risk maps, considering both threshold- and optimization-based\nmethods for planning power line de-energizations. The numeric results use the\nCalifornia Test System (CATS), a large-scale synthetic grid model with power\nline corridors accurately representing California infrastructure, in\ncombination with real Wildland Fire Potential Index data for a full year. This\nis the first application of optimal power shutoff planning on such a large and\nrealistic test case. Our results show that the choice of risk metric\nsignificantly impacts the lines that are de-energized and the resulting load\nshed. We find that the optimization-based method results in significantly less\nload shed than the threshold-based method while achieving the same risk\nreduction.\n","authors":["Ryan Piansky","Sofia Taylor","Noah Rhodes","Daniel K. Molzahn","Line A. Roald","Jean-Paul Watson"],"pdf_url":"https://arxiv.org/pdf/2409.20511v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.17925v2","updated":"2025-01-09T23:12:11Z","published":"2024-11-26T22:41:40Z","title":"Stability and Synchronization of Kuramoto Oscillators","summary":" Imagine a group of oscillators, each endowed with their own rhythm or\nfrequency, be it the ticking of a biological clock, the swing of a pendulum, or\nthe glowing of fireflies. While these individual oscillators may seem\nindependent of one another at first glance, the true magic lies in their\nability to influence and synchronize with one another, like a group of\nfireflies glowing in unison.\n The Kuramoto model was motivated by this phenomenon of collective\nsynchronization, when a group of a large number of oscillators spontaneously\nlock to a common frequency, despite vast differences in their individual\nfrequencies. Inspired by Kuramoto's groundbreaking work in the 1970s, this\nmodel captures the essence of how interconnected systems, ranging from\nbiological networks to power grids, can achieve a state of synchronization.\n This work aims to study the stability and synchronization of Kuramoto\noscillators, starting off with an introduction to Kuramoto Oscillators and it's\nbroader applications. We then at a graph theoretic formulation for the same and\nestablish various criterion for the stability, synchronization of Kuramoto\nOscillators. Finally, we broadly analyze and experiment with various physical\nsystems that tend to behave like Kuramoto oscillators followed by further\nsimulations.\n","authors":["Abhiram Gorle"],"pdf_url":"https://arxiv.org/pdf/2411.17925v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.02094v2","updated":"2025-01-09T19:50:57Z","published":"2025-01-03T20:43:57Z","title":"SMTL: A Stratified Logic for Expressive Multi-Level Temporal\n Specifications","summary":" We present Stratified Metric Temporal Logic (SMTL), a novel formalism for\nspecifying and verifying properties of complex cyber-physical systems that\nexhibit behaviors across multiple temporal and abstraction scales. SMTL extends\nexisting temporal logics by incorporating a stratification operator, enabling\nthe association of temporal properties with specific abstraction levels. This\nallows for the natural expression of multi-scale requirements while maintaining\nformal reasoning about inter-level relationships. We formalize the syntax and\nsemantics of SMTL, proving that it strictly subsumes metric temporal logic\n(MTL) and offers enhanced expressiveness by capturing properties unattainable\nin existing logics. Numerical simulations comparing agents operating under MTL\nand SMTL specifications show that SMTL enhances agent coordination and safety,\nreducing collision rates without substantial computational overhead or\ncompromising path efficiency. These findings underscore SMTL's potential as a\nvaluable tool for designing and verifying complex multi-agent systems operating\nacross diverse temporal and abstraction scales.\n","authors":["Ali Baheri","Peng Wei"],"pdf_url":"https://arxiv.org/pdf/2501.02094v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05548v1","updated":"2025-01-09T19:38:27Z","published":"2025-01-09T19:38:27Z","title":"Switched Optimal Control with Dwell Time Constraints","summary":" This paper presents an embedding-based approach for solving switched optimal\ncontrol problems (SOCPs) with dwell time constraints. At first, an embedded\noptimal control problem (EOCP) is defined by replacing the discrete switching\nsignal with a continuous embedded variable that can take intermediate values\nbetween the discrete modes. While embedding enables solutions of SOCPs via\nconventional techniques, optimal solutions of EOCPs often involve nonexistent\nmodes and thus may not be feasible for the SOCP. In the modified EOCP (MEOCP),\na concave function is added to the cost function to enforce a bang-bang\nsolution in the embedded variable, which results in feasible solutions for the\nSOCP. However, the MEOCP cannot guarantee the satisfaction of dwell-time\nconstraints.\n In this paper, a MEOCP is combined with a filter layer to remove switching\ntimes that violate the dwell time constraint. Insertion gradients are used to\nminimize the effect of the filter on the optimal cost.\n","authors":["Masoud S. Sakha","Rushikesh Kamalapurkar"],"pdf_url":"https://arxiv.org/pdf/2501.05548v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.15745v7","updated":"2025-01-09T19:35:10Z","published":"2024-01-28T20:12:08Z","title":"The computation of approximate feedback Stackelberg equilibria in\n multi-player nonlinear constrained dynamic games","summary":" Solving feedback Stackelberg games with nonlinear dynamics and coupled\nconstraints, a common scenario in practice, presents significant challenges.\nThis work introduces an efficient method for computing approximate local\nfeedback Stackelberg equilibria in multi-player general-sum dynamic games, with\ncontinuous state and action spaces. Different from existing (approximate)\ndynamic programming solutions that are primarily designed for unconstrained\nproblems, our approach involves reformulating a feedback Stackelberg dynamic\ngame into a sequence of nested optimization problems, enabling the derivation\nof Karush-Kuhn-Tucker (KKT) conditions and the establishment of a second-order\nsufficient condition for local feedback Stackelberg equilibria. We propose a\nNewton-style primal-dual interior point method for solving constrained linear\nquadratic (LQ) feedback Stackelberg games, offering provable convergence\nguarantees. Our method is further extended to compute local feedback\nStackelberg equilibria for more general nonlinear games by iteratively\napproximating them using LQ games, ensuring that their KKT conditions are\nlocally aligned with those of the original nonlinear games. We prove the\nexponential convergence of our algorithm in constrained nonlinear games. In a\nfeedback Stackelberg game with nonlinear dynamics and (nonconvex) coupled costs\nand constraints, our experimental results reveal the algorithm's ability to\nhandle infeasible initial conditions and achieve exponential convergence\ntowards an approximate local feedback Stackelberg equilibrium.\n","authors":["Jingqi Li","Somayeh Sojoudi","Claire Tomlin","David Fridovich-Keil"],"pdf_url":"https://arxiv.org/pdf/2401.15745v7.pdf","comment":"This manuscript has been accepted by SIAM Journal on Optimization. We\n fix few typos in this arxiv version"},{"id":"http://arxiv.org/abs/2501.04988v1","updated":"2025-01-09T06:01:34Z","published":"2025-01-09T06:01:34Z","title":"Intelligent Sailing Model for Open Sea Navigation","summary":" Autonomous vessels potentially enhance safety and reliability of seaborne\ntrade. To facilitate the development of autonomous vessels, high-fidelity\nsimulations are required to model realistic interactions with other vessels.\nHowever, modeling realistic interactive maritime traffic is challenging due to\nthe unstructured environment, coarsely specified traffic rules, and largely\nvarying vessel types. Currently, there is no standard for simulating\ninteractive maritime environments in order to rigorously benchmark autonomous\nvessel algorithms. In this paper, we introduce the first intelligent sailing\nmodel (ISM), which simulates rule-compliant vessels for navigation on the open\nsea. An ISM vessel reacts to other traffic participants according to maritime\ntraffic rules while at the same time solving a motion planning task\ncharacterized by waypoints. In particular, the ISM monitors the applicable\nrules, generates rule-compliant waypoints accordingly, and utilizes a model\npredictive control for tracking the waypoints. We evaluate the ISM in two\nenvironments: interactive traffic with only ISM vessels and mixed traffic where\nsome vessel trajectories are from recorded real-world maritime traffic data or\nhandcrafted for criticality. Our results show that simulations with many ISM\nvessels of different vessel types are rule-compliant and scalable. We tested\n4,049 critical traffic scenarios. For interactive traffic with ISM vessels, no\ncollisions occurred while goal-reaching rates of about 97 percent were\nachieved. We believe that our ISM can serve as a standard for challenging and\nrealistic maritime traffic simulation to accelerate autonomous vessel\ndevelopment.\n","authors":["Hanna Krasowski","Stefan Schärdinger","Murat Arcak","Matthias Althoff"],"pdf_url":"https://arxiv.org/pdf/2501.04988v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.07595v1","updated":"2025-01-09T23:02:16Z","published":"2025-01-09T23:02:16Z","title":"LUCAS: A Low-Power Ultra-Low Jitter Compact ASIC for SiPM Targetting\n ToF-CT","summary":" We present LUCAS (Low power Ultra-low jitter Compact ASIC for SiPM), an\nanalog front-end for Silicon Photomultipliers (SiPM) targeting fast timing\ndetectors in Time-of-Flight Computed Tomography (ToF-CT). LUCAS features a very\nlow input impedance preamplifier followed by a voltage comparator. It is\ndesigned in TSMC 65 nm low-power CMOS technology with a power supply of 1.2 V.\nOur first 8-channel prototype has been sent to fabrication and will be received\nin August 2023. Post-layout simulations predict less than 40 ps FWHM SPTR\njitter and an approximate power consumption of 3.2 mW per channel. The front\nend is suitable for applications with rigorous jitter requirements and high\nevent rates, thanks to its 3.9 GHz unity-gain bandwidth. The front-end compact\nform factor will facilitate its incorporation into systems demanding high\nchannel densities.\n","authors":["Seyed Arash Katourani"],"pdf_url":"https://arxiv.org/pdf/2501.07595v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.09173v1","updated":"2025-01-09T11:06:35Z","published":"2025-01-09T11:06:35Z","title":"Formalising the intentional stance 2: a coinductive approach","summary":" Given a stochastic process with inputs and outputs, how might its behaviour\nbe related to pursuit of a goal? We model this using 'transducers', objects\nthat capture only the external behaviour of a system and not its internal\nstate. A companion paper summarises our results for cognitive scientists; the\ncurrent paper gives formal definitions and proofs.\n To formalise the concept of a system that behaves as if it were pursuing a\ngoal, we consider what happens when a transducer (a 'policy') is coupled to\nanother transducer that comes equipped with a success condition (a\n'teleo-environment'). An optimal policy is identified with a transducer that\nbehaves as if it were perfectly rational in the pursuit of a goal; our\nframework also allows us to model constrained rationality.\n Optimal policies obey a version of Bellman's principle: a policy that's\noptimal in one time step will again be optimal in the next time step, but with\nrespect to a different teleo-environment (obtained from the original one by a\nmodified version of Bayesian filtering). This property sometimes also applies\nto the bounded-rational case; we give a sufficient condition.\n A policy is deterministic if and only if there exists a teleo-environment for\nwhich it is uniquely optimal among the set of all policies; we relate this to\nclassical representation theorems from decision theory. This result need not\nhold in the bounded-rational case; we give an example related to the\nabsent-minded driver problem. The formalism is defined using coinduction,\nfollowing the style proposed by Czajka.\n","authors":["Simon McGregor"," timorl","Nathaniel Virgo"],"pdf_url":"https://arxiv.org/pdf/2501.09173v1.pdf","comment":"This is the companion paper to \"Formalising the intentional stance 1:\n attributing goals and beliefs to stochastic processes\" (uploaded as version 2\n of arXiv:2405.16490). The other paper is an overview aimed at cognitive\n scientists while this paper gives full mathematical details. 50 pages, no\n figures"}],"Optimization and Control":[{"id":"http://arxiv.org/abs/2501.05430v1","updated":"2025-01-09T18:42:49Z","published":"2025-01-09T18:42:49Z","title":"A dimension reduction procedure for the design of lattice-spring systems\n with minimal fabrication cost and required multi-functional properties","summary":" We show that the problem of the design of the lattices of elastoplastic\ncurrent conducting springs with optimal multi-functional properties leads to an\nanalytically tractable problem. Specifically, focusing on a lattice with a\nsmall number of springs, we use the technique of inequalities to reduce the\nnumber variables and to compute the minimal cost of lattice fabrication\nexplicitly.\n","authors":["Egor Makarenkov","Sakshi Malhotra","Yang Jiao"],"pdf_url":"https://arxiv.org/pdf/2501.05430v1.pdf","comment":"20 pages, 10 figures"},{"id":"http://arxiv.org/abs/2408.01857v2","updated":"2025-01-09T17:54:15Z","published":"2024-08-03T20:00:36Z","title":"Using Linearized Optimal Transport to Predict the Evolution of\n Stochastic Particle Systems","summary":" We develop an algorithm to approximate the time evolution of a probability\ndistribution without explicitly learning an operator that governs the\nevolution. A particular application of interest is discrete measures $\\mu_t^N$\nthat arise from systems of $N$ particles in $\\mathbb R^d$. In many such\nsituations, the individual particles move chaotically on short time scales,\nmaking it difficult to learn the dynamics of a governing operator, but the bulk\ndistribution $\\mu_t^N$ approximates an absolutely continuous measure $\\mu_t$\nthat evolves ``smoothly.'' If $\\mu_t$ is known on some time interval, then\nlinearized optimal transport theory provides an Euler-like scheme for\napproximating the evolution of $\\mu_t$ using its ``tangent vector field''\n(represented as a time-dependent vector field on $\\mathbb R^d$), which can be\ncomputed as a limit of optimal transport maps. We propose an analog of this\nEuler approximation to predict the evolution of the discrete measure $\\mu_t^N$\n(without knowing $\\mu_t$). To approximate the analogous tangent vector field,\nwe use a finite difference over a time step that sits between two time scales\nof the system -- long enough for a large-$N$ evolution ($\\mu_t$) to emerge but\nshort enough to satisfactorily approximate the derivative object used in the\nEuler scheme. The emergence of the limiting behavior ensures the optimal\ntransport maps closely approximate the vector field describing the bulk\ndistribution's smooth evolution instead of the individual particles' more\nchaotic movements. We demonstrate the efficacy of our approach with two\nillustrative examples, Gaussian diffusion and a cell chemotaxis model, and show\nthat our method succeeds in predicting the bulk behavior over relatively large\nsteps.\n","authors":["Nicholas Karris","Evangelos A. Nikitopoulos","Ioannis Kevrekidis","Seungjoon Lee","Alexander Cloninger"],"pdf_url":"https://arxiv.org/pdf/2408.01857v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05388v1","updated":"2025-01-09T17:22:11Z","published":"2025-01-09T17:22:11Z","title":"A fast approximate scenario addition method for two-stage robust\n mixed-integer programs","summary":" This paper presents a new scenario addition method for two-stage robust\nmixed-integer programs with finite uncertainty sets. Our method combines and\nextends speed-up techniques used in previous scenario addition methods (also\ncalled column-and-constraint generation methods) and introduces several new\ntechniques. In particular, it uses dual bounds for second-stage problems in\norder to allow a faster identification of the next promising scenario to be\nadded to the master problem. Moreover, adaptive time limits are imposed to\navoid getting stuck on particularly hard second-stage problems, and a gap\npropagation between master problem and second-stage problems is used to stop\nsolving them earlier if only a given non-zero optimality gap is to be reached\noverall. This makes our method particularly effective for problems where\nsolving the second-stage problem is computationally challenging. To evaluate\nthe method's performance, we compare it to two recent scenario addition methods\nfrom the literature on two applications: a robust capacitated location routing\nproblem and a robust integrated berth allocation and quay crane assignment and\nscheduling problem. The first problem features a particularly hard second\nstage, and we show that our method is able to solve considerably more and\nlarger instances in a given time limit. Using the second problem, we verify the\ngeneral applicability of our method, even for problems where the second stage\nis relatively easy.\n","authors":["Marc Goerigk","Dorothee Henke","Johannes Kager","Fabian Schäfer","Clemens Thielen"],"pdf_url":"https://arxiv.org/pdf/2501.05388v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05373v1","updated":"2025-01-09T16:57:54Z","published":"2025-01-09T16:57:54Z","title":"On the emergence of almost-honeycomb structures in low-energy planar\n clusters","summary":" Several commonly observed physical and biological systems are arranged in\nshapes that closely resemble an honeycomb cluster, that is, a tessellation of\nthe plane by regular hexagons. Although these shapes are not always the direct\nproduct of energy minimization, they can still be understood, at least\nphenomenologically, as low-energy configurations. In this paper, explicit\nquantitative estimates on the geometry of such low-energy configurations are\nprovided, showing in particular that the vast majority of the chambers must be\ngeneralized polygons with six edges, and be closely resembling regular\nhexagons. Part of our arguments is a detailed revision of the estimates behind\nthe global isoperimetric principle for honeycomb clusters due to Hales (T. C.\nHales. The honeycomb conjecture. Discrete Comput. Geom., 25(1):1-22, 2001).\n","authors":["Marco Caroccia","Kenneth DeMason","Francesco Maggi"],"pdf_url":"https://arxiv.org/pdf/2501.05373v1.pdf","comment":"32 pages, 3 figures"},{"id":"http://arxiv.org/abs/2501.05365v1","updated":"2025-01-09T16:48:14Z","published":"2025-01-09T16:48:14Z","title":"Control of Overpopulated Tails in Kinetic Epidemic Models","summary":" We introduce model-based transition rates for controlled compartmental models\nin mathematical epidemiology, with a focus on the effects of control strategies\napplied to interacting multi-agent systems describing contact formation\ndynamics. In the framework of kinetic control problems, we compare two\nprototypical control protocols: one additive control directly influencing the\ndynamics and another targeting the interaction strength between agents. The\nemerging controlled macroscopic models are derived for an SIR\ncompartmentalization to illustrate their impact on epidemic progression and\ncontact interaction dynamics. Numerical results show the effectiveness of this\napproach in steering the dynamics and controlling epidemic trends, even in\nscenarios where contact distributions exhibit an overpopulated tail.\n","authors":["Mattia Zanella","Andrea Medaglia"],"pdf_url":"https://arxiv.org/pdf/2501.05365v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03144v2","updated":"2025-01-09T16:35:13Z","published":"2025-01-06T17:09:38Z","title":"Enhancing Quantum State Reconstruction with Structured Classical Shadows","summary":" Quantum state tomography (QST) remains the prevailing method for benchmarking\nand verifying quantum devices; however, its application to large quantum\nsystems is rendered impractical due to the exponential growth in both the\nrequired number of total state copies and classical computational resources.\nRecently, the classical shadow (CS) method has been introduced as a more\ncomputationally efficient alternative, capable of accurately predicting key\nquantum state properties. Despite its advantages, a critical question remains\nas to whether the CS method can be extended to perform QST with guaranteed\nperformance. In this paper, we address this challenge by introducing a\nprojected classical shadow (PCS) method with guaranteed performance for QST\nbased on Haar-random projective measurements. PCS extends the standard CS\nmethod by incorporating a projection step onto the target subspace. For a\ngeneral quantum state consisting of $n$ qubits, our method requires a minimum\nof $O(4^n)$ total state copies to achieve a bounded recovery error in the\nFrobenius norm between the reconstructed and true density matrices, reducing to\n$O(2^n r)$ for states of rank $r<2^n$ -- meeting information-theoretic optimal\nbounds in both cases. For matrix product operator states, we demonstrate that\nthe PCS method can recover the ground-truth state with $O(n^2)$ total state\ncopies, improving upon the previously established Haar-random bound of\n$O(n^3)$. Simulation results further validate the effectiveness of the proposed\nPCS method.\n","authors":["Zhen Qin","Joseph M. Lukens","Brian T. Kirby","Zhihui Zhu"],"pdf_url":"https://arxiv.org/pdf/2501.03144v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.08188v4","updated":"2025-01-09T15:54:36Z","published":"2024-01-16T08:04:34Z","title":"Bounded weak solutions for Keller-Segel equations with generalized\n diffusion and logistic source via an unbalanced Optimal Transport splitting\n scheme","summary":" We consider a parabolic-elliptic type of Keller-Segel equations with\ngeneralized diffusion and logistic source under homogeneous Neumann-Neumann\nboundary conditions. We construct bounded weak solutions globally in time in an\nunbalanced optimal transport framework, provided that the magnitude of the\nchemotactic sensitivity can be restricted depending on parameters. In the case\nof subquadratic degradation of the logistic source, we quantify the chemotactic\nsensitivity, in particular, in terms of the power of degradation and the\npointwise bound of the initial density.\n","authors":["Kyungkeun Kang","Hwa Kil Kim","Geuntaek Seo"],"pdf_url":"https://arxiv.org/pdf/2401.08188v4.pdf","comment":"29 pages"},{"id":"http://arxiv.org/abs/2501.05320v1","updated":"2025-01-09T15:40:59Z","published":"2025-01-09T15:40:59Z","title":"Isoperimetric inequalities for the fractional composite membrane problem","summary":" In this article, we investigate some isoperimetric-type inequalities related\nto the first eigenvalue of the fractional composite membrane problem. First, we\nestablish an analogue of the renowned Faber-Krahn inequality for the fractional\ncomposite membrane problem. Next, we investigate an isoperimetric inequality\nfor the first eigenvalue of the fractional composite membrane problem on the\nintersection of two domains-a problem that was first studied by Lieb [23] for\nthe Laplacian. Similar results in the local case were previously obtained by\nCupini-Vecchi [9] for the composite membrane problem. Our findings provide\nfurther insights into the fractional setting, offering a new perspective on\nthese classical inequalities.\n","authors":["Mrityunjoy Ghosh"],"pdf_url":"https://arxiv.org/pdf/2501.05320v1.pdf","comment":"10 pages"},{"id":"http://arxiv.org/abs/2412.13538v2","updated":"2025-01-09T15:27:12Z","published":"2024-12-18T06:35:10Z","title":"Stabilization of strictly pre-dissipative nonlinear receding horizon\n control by terminal costs","summary":" It is known that receding horizon control with a strictly pre-dissipative\noptimal control problem yields a practically asymptotically stable closed loop\nwhen suitable state constraints are imposed. In this note we show that\nalternatively suitably bounded terminal costs can be used for stabilizing the\nclosed loop.\n","authors":["Lars Grüne","Mario Zanon"],"pdf_url":"https://arxiv.org/pdf/2412.13538v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05280v1","updated":"2025-01-09T14:43:29Z","published":"2025-01-09T14:43:29Z","title":"Exploring near-optimal energy systems with stakeholders: a novel\n approach for participatory modelling","summary":" Involving people in energy systems planning can increase the legitimacy and\nsocio-political feasibility of energy transitions. Participatory research in\nenergy modelling offers the opportunity to engage with stakeholders in a\ncomprehensive way, but is limited by how results can be generated and presented\nwithout imposing assumptions and discrete scenarios on the participants. To\nthis end, we present a methodology and a framework, based on near-optimal\nmodelling results, that can incorporate stakeholders in a holistic and engaging\nway. We confront stakeholders with a continuum of modelling-based energy system\ndesigns via an interactive interface allowing them to choose essentially any\ncombination of components that meet the system requirements. Together with\ninformation on the implications of different technologies, it is possible to\nassess how participants prioritise different aspects in energy systems planning\nwhile also facilitating learning in an engaging and stimulating way. We\nshowcase the methodology for the remote Arctic settlement of Longyearbyen and\nillustrate how participants deviate consistently from the cost optimum. At the\nsame time, they manage to balance different priorities such as emissions,\ncosts, and system vulnerability leading to a better understanding of the\ncomplexity and intertwined nature of decisions.\n","authors":["Oskar Vågerö","Koen van Greevenbroek","Aleksander Grochowicz","Maximilian Roithner"],"pdf_url":"https://arxiv.org/pdf/2501.05280v1.pdf","comment":"24 pages, 7 figures and 3 tables"},{"id":"http://arxiv.org/abs/2311.09844v2","updated":"2025-01-09T14:39:59Z","published":"2023-11-16T12:15:43Z","title":"Observability of the linear Zakharov--Kuznetsov equation","summary":" We study the linear Zakharov--Kuznetsov equation with periodic boundary\nconditions. Employing some tools from the nonharmonic Fourier series we obtain\nseveral internal observability theorems. Then we prove various exact\ncontrollability and rapid uniform stabilization results by applying a duality\nprinciple and a general feedback construction. The method presented here\nintroduces a new insight into the control of dispersive equations in\ntwo-dimensional cases and may be adapted to more general equations.\n","authors":["Roberto de A. Capistrano Filho","Vilmos Komornik","Ademir F. Pazoto"],"pdf_url":"https://arxiv.org/pdf/2311.09844v2.pdf","comment":"30 pages, 2 figures. Comments are welcome"},{"id":"http://arxiv.org/abs/2501.04572v2","updated":"2025-01-09T14:30:41Z","published":"2025-01-08T15:42:41Z","title":"Regret Analysis: a control perspective","summary":" Online learning and model reference adaptive control have many interesting\nintersections. One area where they differ however is in how the algorithms are\nanalyzed and what objective or metric is used to discriminate \"good\" algorithms\nfrom \"bad\" algorithms. In adaptive control there are usually two objectives: 1)\nprove that all time varying parameters/states of the system are bounded, and 2)\nthat the instantaneous error between the adaptively controlled system and a\nreference system converges to zero over time (or at least a compact set). For\nonline learning the performance of algorithms is often characterized by the\nregret the algorithm incurs. Regret is defined as the cumulative loss (cost)\nover time from the online algorithm minus the cumulative loss (cost) of the\nsingle optimal fixed parameter choice in hindsight. Another significant\ndifference between the two areas of research is with regard to the assumptions\nmade in order to obtain said results. Adaptive control makes assumptions about\nthe input-output properties of the control problem and derives solutions for a\nfixed error model or optimization task. In the online learning literature\nresults are derived for classes of loss functions (i.e. convex) while a priori\nassuming that all time varying parameters are bounded, which for many\noptimization tasks is not unrealistic, but is a non starter in control\napplications. In this work we discuss these differences in detail through the\nregret based analysis of gradient descent for convex functions and the control\nbased analysis of a streaming regression problem. We close with a discussion\nabout the newly defined paradigm of online adaptive control and ask the\nfollowing question \"Are regret optimal control strategies deployable?\"\n","authors":["Travis E. Gibson","Sawal Acharya"],"pdf_url":"https://arxiv.org/pdf/2501.04572v2.pdf","comment":"10 pages no figures"},{"id":"http://arxiv.org/abs/2501.05270v1","updated":"2025-01-09T14:27:15Z","published":"2025-01-09T14:27:15Z","title":"Identifiability of Controlled Open Quantum Systems","summary":" Open quantum systems are a rich area of research on the intersection of\nquantum mechanics and stochastic analysis. We unify multiple views of\ncontrolled open quantum systems within the framework of bilinear dynamical\nsystems. We define the corresponding notions of identifiability from the\nresults of quantum state tomography, obtained in many copies of the initial\nquantum state, under subsequences of varying lengths of control signals. We\nexplain and extend work on identifiability of bilinear systems using either\nspectral criteria, criteria based on Hankel matrix, and frequency-domain\ncriteria, to the parameter estimation of master equations of open quantum\nsystems. This sets the groundwork for a number of constructive approaches to\nthe identification of open quantum systems.\n","authors":["Waqas Parvaiz","Johannes Aspman","Ales Wodecki","Georgios Korpas","Jakub Marecek"],"pdf_url":"https://arxiv.org/pdf/2501.05270v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05200v1","updated":"2025-01-09T12:51:39Z","published":"2025-01-09T12:51:39Z","title":"On Coordinated Drone-Courier Logistics for Intra-city Express Services","summary":" Problem definition: Drones, despite being acknowledged as a transformative\nforce in the city logistics sector, are unable to execute the\n\\textit{last-meter delivery} (unloading goods directly to customers' doorsteps)\ndue to airspace restrictions and safety concerns. To leverage advancements and\novercome the limitations of drones in providing intra-city express services, we\nintroduce a coordinated drone-courier logistics system where drones operate\nwithin a closed network among vertiports, while couriers connect customers to\nthe drone delivery system. This paper aims to shed light on this coordinated\nsystem in terms of system feasibility, network interactivity, and long-term\nsustainability. Methodology/Results: We develop an integrated optimization\nmodel to optimize the network planning of the coordinated logistics system. The\ninterplay between network planning and tactical operations is mirrored by a\nqueueing network model, resulting in the nonlinear and nonconvex (partially\nconvex and partially concave) feasible region of the optimization model. An\niterative exact algorithm that tightens lower and upper bounds by adaptively\nrefining the linear approximations of nonlinear constraints is developed to\nprovide optimality-guaranteed solutions with finite convergence. The\ncomputational experiments demonstrate the scalability and robustness of our\nalgorithm across various network configurations and scenarios.Managerial\nimplications: The case study, based on a real-world dataset from SF Express, a\nlogistics giant in China, validates that the coordinated logistics system\nefficiently attains cost and time savings by leveraging the effective turnover\nof drones and the coordination between drones and couriers. The optimal network\ndesign features a concentrated structure, streamlining demand consolidation and\nreducing deadhead repositioning.\n","authors":["Shuiwang Chen","Kai Wang","Lingxiao Wu","Wei Qi"],"pdf_url":"https://arxiv.org/pdf/2501.05200v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05178v1","updated":"2025-01-09T11:53:52Z","published":"2025-01-09T11:53:52Z","title":"KLAP: KYP lemma based low-rank approximation for $\\mathcal{H}_2$-optimal\n passivation","summary":" We present a novel passivity enforcement (passivation) method, called KLAP,\nfor linear time-invariant systems based on the Kalman-Yakubovich-Popov (KYP)\nlemma and the closely related Lur'e equations. The passivation problem in our\nframework corresponds to finding a perturbation to a given non-passive system\nthat renders the system passive while minimizing the $\\mathcal{H}_2$ or\nfrequency-weighted $\\mathcal{H}_2$ distance between the original non-passive\nand the resulting passive system. We show that this problem can be formulated\nas an unconstrained optimization problem whose objective function can be\ndifferentiated efficiently even in large-scale settings. We show that any\nminimizer of the unconstrained problem yields the same passive system.\nFurthermore, we prove that, in the absence of a feedthrough term, every local\nminimizer is also a global minimizer. For cases involving a non-trivial\nfeedthrough term, we analyze global minimizers in relation to the extremal\nsolutions of the Lur'e equations, which can serve as tools for identifying\nlocal minima. To solve the resulting numerical optimization problem\nefficiently, we propose an initialization strategy based on modifying the\nfeedthrough term and a restart strategy when it is likely that the optimization\nhas converged to a local minimum. Numerical examples illustrate the\neffectiveness of the proposed method.\n","authors":["Jonas Nicodemus","Matthias Voigt","Serkan Gugercin","Benjamin Unger"],"pdf_url":"https://arxiv.org/pdf/2501.05178v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05158v1","updated":"2025-01-09T11:27:58Z","published":"2025-01-09T11:27:58Z","title":"An Efficient Mixed-Integer Formulation and an Iterative Method for\n Optimal Control of Switched Systems Under Dwell Time Constraints","summary":" This paper presents an efficient Mixed-Integer Nonlinear Programming (MINLP)\nformulation for systems with discrete control inputs under dwell time\nconstraints. By viewing such systems as a switched system, the problem is\ndecomposed into a Sequence Optimization (SO) and a Switching Time Optimization\n(STO) -- the former providing the sequence of the switched system, and the\nlatter calculating the optimal switching times. By limiting the feasible set of\nSO to subsequences of a master sequence, this formulation requires a small\nnumber of binary variables, independent of the number of time discretization\nnodes. This enables the proposed formulation to provide solutions efficiently,\neven for large numbers of time discretization nodes. To provide even faster\nsolutions, an iterative algorithm is introduced to heuristically solve STO and\nSO. The proposed approaches are then showcased on four different switched\nsystems and results demonstrate the efficiency of the MINLP formulation and the\niterative algorithm.\n","authors":["Ramin Abbasi-Esfeden","Armin Nurkanovic","Moritz Diehl","Panagiotis Patrinos","Jan Swevers"],"pdf_url":"https://arxiv.org/pdf/2501.05158v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.03105v2","updated":"2025-01-09T11:24:56Z","published":"2024-04-03T23:07:24Z","title":"Methodology for Interpretable Reinforcement Learning for Optimizing\n Mechanical Ventilation","summary":" Mechanical ventilation is a critical life support intervention that delivers\ncontrolled air and oxygen to a patient's lungs, assisting or replacing\nspontaneous breathing. While several data-driven approaches have been proposed\nto optimize ventilator control strategies, they often lack interpretability and\nalignment with domain knowledge, hindering clinical adoption. This paper\npresents a methodology for interpretable reinforcement learning (RL) aimed at\nimproving mechanical ventilation control as part of connected health systems.\nUsing a causal, nonparametric model-based off-policy evaluation, we assess RL\npolicies for their ability to enhance patient-specific outcomes-specifically,\nincreasing blood oxygen levels (SpO2), while avoiding aggressive ventilator\nsettings that may cause ventilator-induced lung injuries and other\ncomplications. Through numerical experiments on real-world ICU data from the\nMIMIC-III database, we demonstrate that our interpretable decision tree policy\nachieves performance comparable to state-of-the-art deep RL methods while\noutperforming standard behavior cloning approaches. The results highlight the\npotential of interpretable, data-driven decision support systems to improve\nsafety and efficiency in personalized ventilation strategies, paving the way\nfor seamless integration into connected healthcare environments.\n","authors":["Joo Seung Lee","Malini Mahendra","Anil Aswani"],"pdf_url":"https://arxiv.org/pdf/2404.03105v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.16490v2","updated":"2025-01-09T10:38:46Z","published":"2024-05-26T08:58:03Z","title":"Formalising the intentional stance 1: attributing goals and beliefs to\n stochastic processes","summary":" This article presents a formalism inspired by Dennett's notion of the\nintentional stance. Whereas Dennett's treatment of these concepts is informal,\nwe aim to provide a more formal analogue. We introduce a framework based on\nstochastic processes with inputs and outputs, in which we can talk precisely\nabout *interpreting* systems as having *normative-epistemic states*, which\ncombine belief-like and desire-like features. Our framework is based on\noptimality but nevertheless allows us to model some forms of bounded cognition.\n One might expect that the systems that can be described in\nnormative-epistemic terms would be some special subset of all systems, but we\nshow that this is not the case: every system admits a (possibly trivial)\nnormative-epistemic interpretation, and those that can be *uniquely specified*\nby a normative-epistemic description are exactly the deterministic ones.\nFinally, we show that there is a suitable notion of Bayesian updating for\nnormative-epistemic states, which we call *value-laden filtering*, since it\ninvolves both normative and epistemic elements. For unbounded cognition it is\nalways permissible to attribute beliefs that update in this way. This is not\nalways the case for bounded cognition, but we give a sufficient condition under\nwhich it is.\n This paper gives an overview of our framework aimed at cognitive scientists,\nwith a formal mathematical treatment given in a companion paper.\n","authors":["Simon McGregor"," timorl","Nathaniel Virgo"],"pdf_url":"https://arxiv.org/pdf/2405.16490v2.pdf","comment":"The previous version of this document included the content of the\n companion paper, \"Formalising the intentional stance 2: a coinductive\n approach\". The paper has now been split into two, this one (which is an\n overview aimed at cognitive scientists) and the companion (which contains\n full mathematical detail). 16 pages, one figure with two subfigures"},{"id":"http://arxiv.org/abs/2412.16222v2","updated":"2025-01-09T10:27:10Z","published":"2024-12-18T12:54:50Z","title":"A matheuristic approach for an integrated lot-sizing and scheduling\n problem with a period-based learning effect","summary":" This research investigates a multi-product capacitated lot-sizing and\nscheduling problem incorporating a novel learning effect, namely the\nperiod-based learning effect. This is inspired by a real case in a core\nanalysis laboratory under a job shop setting. Accordingly, a Mixed-Integer\nLinear Programming (MILP) model is extended based on the big-bucket\nformulation, optimizing the total tardiness and overtime costs. Given the\ncomplexity of the problem, a cutting plane method is employed to simplify the\nmodel. Afterward, three matheuristic methods based on the rolling horizon\napproach are devised, incorporating two lower bounds and a local search\nheuristic. Furthermore, a post-processing approach is implemented to\nincorporate lot-streaming possibility. Computational experiments demonstrate:\n1) the simplified model performs effectively in terms of both solution quality\nand computational time; and 2) although the model encounters challenges with\nlarge-scale instances, the proposed matheuristic methods achieve satisfactory\noutcomes; and 3) it can be inferred that the complexity of the models and\nsolution methods are independent of the learning effect; however, the value of\nlearning effect may impact the performance of the lower bounds; 4) in\nmanufacturing settings, where the lot-streaming is possible, incorporating\npost-processing can drastically improve the objective function; 5) the impact\nof the period-based learning effect in the results is significant, and the\nmodel's sensitivity to time-based parameters (e.g., learning rate) is more than\ncost-based ones (e.g., tardiness cost).\n","authors":["Mohammad Rohaninejad","Behdin Vahedi-Nouri","Reza Tavakkoli-Moghaddam","Zdeněk Hanzálek"],"pdf_url":"https://arxiv.org/pdf/2412.16222v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05052v1","updated":"2025-01-09T08:20:43Z","published":"2025-01-09T08:20:43Z","title":"Cover-Relax-Search: A Primal Heuristic for Binary Quadratic Programs","summary":" Binary Quadratic Programs (BQPs) are a class of NP-hard problems that arise\nin a wide range of applications, including finance, machine learning, and\nlogistics. These problems are challenging to solve due to the combinatorial\nsearch space and nonlinearity. In fact, this class of optimization problems is\nso challenging that, in many instances, standard algorithms struggle to\nidentify feasible solutions within a reasonable time. Primal heuristic\nalgorithms have been developed to quickly identify feasible solutions to BQPs.\nIn this paper, we propose Cover-Relax-Search, an efficient primal heuristic for\nBQPs. This approach is inspired by multiple local search algorithms, including\nUndercover. We evaluate the \\emph{Cover-Relax-Search} algorithm on multiple BQP\nbenchmarks and show that our proposed heuristic identifies high-quality\nsolutions at a faster speed and significantly reduces the primal integral\ncompared to state-of-the-art solvers and other local search baselines.\n","authors":["Weimin Huang","Natalie M. Isenberg","Jan Drgona","Draguna L Vrabie","Bistra Dilkina"],"pdf_url":"https://arxiv.org/pdf/2501.05052v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05046v1","updated":"2025-01-09T08:12:34Z","published":"2025-01-09T08:12:34Z","title":"Quantum-Assisted Space Logistics Mission Planning","summary":" Quantum computing provides a novel approach to addressing conventionally\nintractable issues in large-scale optimization. Space logistics missions\nrequire the efficient routing of payloads, spacecraft, and resources across\ncomplex networks, often resulting in an exponential growth of the solution\nspace that classical methods cannot efficiently solve. This paper leverages\nentropy quantum computing to model and solve the space logistics problem as a\ntime-dependent multicommodity network flow, enabling the exploration of large\nsolution spaces. The findings highlight quantum computing's potential to\naddress complex aerospace logistics, demonstrating its suitability for complex\ninterplanetary mission planning.\n","authors":["Amiratabak Bahengam","Mohammad-Ali Miri","R. Joseph Rupert","Wesley Dyk","Hao Chen"],"pdf_url":"https://arxiv.org/pdf/2501.05046v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18173v3","updated":"2025-01-09T05:08:53Z","published":"2024-12-24T05:18:12Z","title":"Optimal error estimates of the stochastic parabolic optimal control\n problem with integral state constraint","summary":" In this paper, the optimal strong error estimates for stochastic parabolic\noptimal control problem with additive noise and integral state constraint are\nderived based on time-implicit and finite element discretization. The\ncontinuous and discrete first-order optimality conditions are deduced by\nconstructing the Lagrange functional, which contains forward-backward\nstochastic parabolic equations and a variational equation. The fully discrete\nversion of forward-backward stochastic parabolic equations is introduced as an\nauxiliary problem and the optimal strong convergence orders are estimated,\nwhich further allows the optimal a priori error estimates for control, state,\nadjoint state and multiplier to be derived. Then, a simple and yet efficient\ngradient projection algorithm is proposed to solve stochastic parabolic control\nproblem and its convergence rate is proved. Numerical experiments are carried\nout to illustrate the theoretical findings.\n","authors":["Qiming Wang","Wanfang Shen","Wenbin Liu"],"pdf_url":"https://arxiv.org/pdf/2412.18173v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04972v1","updated":"2025-01-09T05:03:42Z","published":"2025-01-09T05:03:42Z","title":"Algebraic characterization of equivalence between optimization\n algorithms","summary":" When are two algorithms the same? How can we be sure a recently proposed\nalgorithm is novel, and not a minor twist on an existing method? In this paper,\nwe present a framework for reasoning about equivalence between a broad class of\niterative algorithms, with a focus on algorithms designed for convex\noptimization. We propose several notions of what it means for two algorithms to\nbe equivalent, and provide computationally tractable means to detect\nequivalence. Our main definition, oracle equivalence, states that two\nalgorithms are equivalent if they result in the same sequence of calls to the\nfunction oracles (for suitable initialization). Borrowing from control theory,\nwe use state-space realizations to represent algorithms and characterize\nalgorithm equivalence via transfer functions. Our framework can also identify\nand characterize equivalence between algorithms that use different oracles that\nare related via a linear fractional transformation. Prominent examples include\nlinear transformations and function conjugation.\n","authors":["Laurent Lessard","Madeleine Udell"],"pdf_url":"https://arxiv.org/pdf/2501.04972v1.pdf","comment":"This paper generalizes and provides new analysis and examples\n compared to arxiv:2105.04684"},{"id":"http://arxiv.org/abs/2405.19546v4","updated":"2025-01-09T05:01:32Z","published":"2024-05-29T22:19:39Z","title":"Convex Optimization of Initial Perturbations toward Quantitative Weather\n Control","summary":" This study proposes introducing convex optimization to find initial\nperturbations of atmospheric states to realize specified changes in subsequent\nweather. In the proposed method, we formulate and solve an inverse problem to\nfind effective perturbations in atmospheric variables so that controlled\nvariables satisfy specified changes at a specified time. The proposed method\nfirst constructs a sensitivity matrix of controlled variables, such as\naccumulated precipitation, to the initial atmospheric variables, such as\ntemperature and humidity, through sensitivity analysis using a numerical\nweather prediction (NWP) model. Then a convex optimization problem is\nformulated to achieve various control specifications involving not only\nquadratic functions but also absolute values and maximum values of the\ncontrolled variables and initial atmospheric variables in the cost function and\nconstraints. The proposed method was validated through a benchmark warm bubble\nexperiment using the NWP model. The experiments showed that the identified\nperturbations successfully realized specified spatial distributions of\naccumulated precipitation.\n","authors":["Toshiyuki Ohtsuka","Atsushi Okazaki","Masaki Ogura","Shunji Kotsuki"],"pdf_url":"https://arxiv.org/pdf/2405.19546v4.pdf","comment":"shortend to improve conciseness; some figures added to Supplements\n for discussion about physical processes; license changed to CC BY 4.0;\n revised to improve readability; some figures in Appendix omitted to improve\n conciseness"},{"id":"http://arxiv.org/abs/2411.01899v2","updated":"2025-01-09T03:57:46Z","published":"2024-11-04T09:06:15Z","title":"New Lagrangian dual algorithms for solving the continuous nonlinear\n resource allocation problem","summary":" The continuous nonlinear resource allocation problem (CONRAP) has broad\napplications in economics, engineering, production and inventory management,\nand often serves as a subproblem in complex programming. Without relying on\nmonotonicity assumptions for the objective and constraint functions, we propose\ntwo Lagrangian dual algorithms for solving two types of CONRAP. Both algorithms\ndetermine an update strategy for the Lagrange multiplier, utilizing the values\nof the objective and constraint functions at the current and previous\niterations. This strategy accelerates the process of finding dual optimal\nsolutions. Subsequently, leveraging the problem's convexity, the primal optimal\nsolution is either directly identified or derived by solving a one-dimensional\nlinear equation. We also prove that both algorithms converge to optimal\nsolutions within a finite number of iterations. Numerical experiments on six\ntypes of practical test problems illustrate the superior computational\nefficiency of the proposed algorithms. For test problems with a general\ninequality constraint, the first algorithm achieves a CPU time reduction\nexceeding an order of magnitude compared to solvers such as Gurobi and CVX. For\ntest problems with a linear equality constraint, the second algorithm\nconsistently outperforms four existing algorithms, delivering an improvement of\nover two orders of magnitude in computational efficiency.\n","authors":["Kaixiang Hu","Caixia Kou","Jianhua Yuan"],"pdf_url":"https://arxiv.org/pdf/2411.01899v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04936v1","updated":"2025-01-09T03:00:09Z","published":"2025-01-09T03:00:09Z","title":"Continuous and Discrete Systems for Quasi Variational Inequalities with\n Application to Game Theory","summary":" A new class of projected dynamical systems of third order is investigated for\nquasi (parametric) variational inequalities in which the convex set in the\nclassical variational inequality also depends upon the solution explicitly or\nimplicitly. We study the stability of a continuous method of a gradient type.\nSome iterative implicit and explicit schemes are suggested as counterparts of\nthe continuous case by inertial proximal methods. The convergence analysis of\nthese proposed methods is established under sufficient mild conditions.\nMoreover, some applications dealing with the generalized Nash equilibrium\nproblems are presented.\n","authors":["Oday Hazaimah"],"pdf_url":"https://arxiv.org/pdf/2501.04936v1.pdf","comment":"17 pages. arXiv admin note: text overlap with arXiv:2406.19345"},{"id":"http://arxiv.org/abs/2501.04160v2","updated":"2025-01-09T02:53:56Z","published":"2025-01-07T22:19:06Z","title":"Collaborative Spacecraft Servicing under Partial Feedback using\n Lyapunov-based Deep Neural Networks","summary":" Multi-agent systems are increasingly applied in space missions, including\ndistributed space systems, resilient constellations, and autonomous rendezvous\nand docking operations. A critical emerging application is collaborative\nspacecraft servicing, which encompasses on-orbit maintenance, space debris\nremoval, and swarm-based satellite repositioning. These missions involve\nservicing spacecraft interacting with malfunctioning or defunct spacecraft\nunder challenging conditions, such as limited state information, measurement\ninaccuracies, and erratic target behaviors. Existing approaches often rely on\nassumptions of full state knowledge or single-integrator dynamics, which are\nimpractical for real-world applications involving second-order spacecraft\ndynamics. This work addresses these challenges by developing a distributed\nstate estimation and tracking framework that requires only relative position\nmeasurements and operates under partial state information. A novel\n$\\rho$-filter is introduced to reconstruct unknown states using locally\navailable information, and a Lyapunov-based deep neural network adaptive\ncontroller is developed that adaptively compensates for uncertainties stemming\nfrom unknown spacecraft dynamics. To ensure the collaborative spacecraft\nregulation problem is well-posed, a trackability condition is defined. A\nLyapunov-based stability analysis is provided to ensure exponential convergence\nof errors in state estimation and spacecraft regulation to a neighborhood of\nthe origin under the trackability condition. The developed method eliminates\nthe need for expensive velocity sensors or extensive pre-training, offering a\npractical and robust solution for spacecraft servicing in complex, dynamic\nenvironments.\n","authors":["Cristian F. Nino","Omkar Sudhir Patil","Christopher D. Petersen","Sean Phillips","Warren E. Dixon"],"pdf_url":"https://arxiv.org/pdf/2501.04160v2.pdf","comment":"24 pages, 4 Figures, Journal"},{"id":"http://arxiv.org/abs/2501.04889v1","updated":"2025-01-09T00:05:31Z","published":"2025-01-09T00:05:31Z","title":"Projected proximal gradient trust-region algorithm for nonsmooth\n optimization","summary":" We consider trust-region methods for solving optimization problems where the\nobjective is the sum of a smooth, nonconvex function and a nonsmooth, convex\nregularizer. We extend the global convergence theory of such methods to include\nworst-case complexity bounds in the case of unbounded model Hessian growth, and\nintroduce a new, simple nonsmooth trust-region subproblem solver based on\ncombining several iterations of proximal gradient descent with a single\nprojection into the trust region, which meets the sufficient descent\nrequirements for algorithm convergence and has promising numerical results.\n","authors":["Minh N. Dao","Hung M. Phan","Lindon Roberts"],"pdf_url":"https://arxiv.org/pdf/2501.04889v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05623v1","updated":"2025-01-09T23:44:42Z","published":"2025-01-09T23:44:42Z","title":"A Quadratically-Constrained Convex Approximation for the AC Optimal\n Power Flow","summary":" We introduce a quadratically-constrained approximation (QCAC) of the AC\noptimal power flow (AC-OPF) problem. Unlike existing approximations like the\nDC-OPF, our model does not rely on typical assumptions such as high\nreactance-to-resistance ratio, near-nominal voltage magnitudes, or small angle\ndifferences, and preserves the structural sparsity of the original AC power\nflow equations, making it suitable for decentralized power systems optimization\nproblems. To achieve this, we reformulate the AC-OPF problem as a quadratically\nconstrained quadratic program. The nonconvex terms are expressed as differences\nof convex functions, which are then convexified around a base point derived\nfrom a warm start of the nodal voltages. If this linearization results in a\nnon-empty constraint set, the convexified constraints form an inner convex\napproximation. Our experimental results, based on Power Grid Library instances\nof up to 30,000 buses, demonstrate the effectiveness of the QCAC approximation\nwith respect to other well-documented conic relaxations and a linear\napproximation. We further showcase its potential advantages over the\nwell-documented second-order conic relaxation of the power flow equations in\ntwo proof-of-concept case studies: optimal reactive power dispatch in\ntransmission networks and PV hosting capacity in distribution grids.\n","authors":["Gonzalo E. Constante-Flores","Can Li"],"pdf_url":"https://arxiv.org/pdf/2501.05623v1.pdf","comment":"10 pages, 5 figures, 4 tables"},{"id":"http://arxiv.org/abs/2501.05619v1","updated":"2025-01-09T23:40:57Z","published":"2025-01-09T23:40:57Z","title":"Comparative Analysis of Two-Stage Distributionally Robust Optimization\n over 1-Wasserstein and 2-Wasserstein Balls","summary":" This paper investigates advantages of using 2-Wasserstein ambiguity sets over\n1-Wasserstein sets in two-stage distributionally robust optimization with\nright-hand side uncertainty. We examine the worst-case distributions within 1-\nand 2-Wasserstein balls under both unrestricted and nonnegative orthant\nsupports, highlighting a pathological behavior arising in 1-Wasserstein balls.\nClosed-form solutions for a single-scenario newsvendor problem illustrate that\n2-Wasserstein balls enable more informed decisions. Additionally, a\npenalty-based dual interpretation suggests that 2-Wasserstein balls may\noutperform 1-Wasserstein balls across a broader range of Wasserstein radii,\neven with general support sets.\n","authors":["Geunyeong Byeon"],"pdf_url":"https://arxiv.org/pdf/2501.05619v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.06881v3","updated":"2025-01-09T21:36:01Z","published":"2023-04-14T01:16:53Z","title":"Designing a Framework for Solving Multiobjective Simulation Optimization\n Problems","summary":" Multiobjective simulation optimization (MOSO) problems are optimization\nproblems with multiple conflicting objectives, where evaluation of at least one\nof the objectives depends on a black-box numerical code or real-world\nexperiment, which we refer to as a simulation. While an extensive body of\nresearch is dedicated to developing new algorithms and methods for solving\nthese and related problems, it is challenging and time consuming to integrate\nthese techniques into real world production-ready solvers. This is partly due\nto the diversity and complexity of modern state-of-the-art MOSO algorithms and\nmethods and partly due to the complexity and specificity of many real-world\nproblems and their corresponding computing environments. The complexity of this\nproblem is only compounded when introducing potentially complex and/or\ndomain-specific surrogate modeling techniques, problem formulations, design\nspaces, and data acquisition functions. This paper carefully surveys the\ncurrent state-of-the-art in MOSO algorithms, techniques, and solvers; as well\nas problem types and computational environments where MOSO is commonly applied.\nWe then present several key challenges in the design of a Parallel\nMultiobjective Simulation Optimization framework (ParMOO) and how they have\nbeen addressed. Finally, we provide two case studies demonstrating how\ncustomized ParMOO solvers can be quickly built and deployed to solve real-world\nMOSO problems.\n","authors":["Tyler H. Chang","Stefan M. Wild"],"pdf_url":"https://arxiv.org/pdf/2304.06881v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.04406v2","updated":"2025-01-09T21:26:04Z","published":"2024-02-06T21:05:37Z","title":"Regularized MIP Model for Integrating Energy Storage Systems and its\n Application for Solving a Trilevel Interdiction Problem","summary":" Incorporating energy storage systems (ESS) into power systems has been\nstudied in many recent works, where binary variables are often introduced to\nmodel the complementary nature of battery charging and discharging. A\nconventional approach for these ESS optimization problems is to relax binary\nvariables and convert the problem into a linear program. However, such linear\nprogramming relaxation models can yield unrealistic fractional solutions, such\nas simultaneous charging and discharging. In this paper, we develop a\nregularized Mixed-Integer Programming (MIP) model for the ESS optimal power\nflow (OPF) problem. We prove that under mild conditions, the proposed\nregularized model admits a zero integrality gap with its linear programming\nrelaxation; hence, it can be solved efficiently. By studying the properties of\nthe regularized MIP model, we show that its optimal solution is also\nnear-optimal to the original ESS OPF problem, thereby providing a valid and\ntight upper bound for the ESS OPF problem. The use of the regularized MIP model\nallows us to solve a trilevel min-max-min network contingency problem which is\notherwise intractable to solve.\n","authors":["Dahye Han","Nan Jiang","Santanu S. Dey","Weijun Xie"],"pdf_url":"https://arxiv.org/pdf/2402.04406v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05582v1","updated":"2025-01-09T21:19:10Z","published":"2025-01-09T21:19:10Z","title":"Equivariant Perturbation in Gomory and Johnson's Infinite Group Problem.\n IV. The General Unimodular Two-Dimensional Case","summary":" We study an abstract setting for cutting planes for integer programming\ncalled the infinite group problem. In this abstraction, cutting planes are\ncomputed via cut generating function that act on the simplex tableau. In this\nfunction space, cut generating functions are classified as minimal, extreme,\nand facets as a proxy for understanding the strength or potential importance of\nthese functions. Prior work developed algorithms for testing minimality,\nextremality, and facetness for cut generating functions applied to 1-row\ntableau and to some 2-row tableau in a restricted setting. We complement and\ngeneralize this work by giving an algorithm for testing the extremality of a\nlarge class of minimal valid functions for the two-dimensional infinite group\nproblem. Along the way, we develop results of independent interest on\nfunctional equations and infinite systems of linear equations.\n","authors":["Robert Hildebrand","Matthias Köppe","Luze Xu"],"pdf_url":"https://arxiv.org/pdf/2501.05582v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05548v1","updated":"2025-01-09T19:38:27Z","published":"2025-01-09T19:38:27Z","title":"Switched Optimal Control with Dwell Time Constraints","summary":" This paper presents an embedding-based approach for solving switched optimal\ncontrol problems (SOCPs) with dwell time constraints. At first, an embedded\noptimal control problem (EOCP) is defined by replacing the discrete switching\nsignal with a continuous embedded variable that can take intermediate values\nbetween the discrete modes. While embedding enables solutions of SOCPs via\nconventional techniques, optimal solutions of EOCPs often involve nonexistent\nmodes and thus may not be feasible for the SOCP. In the modified EOCP (MEOCP),\na concave function is added to the cost function to enforce a bang-bang\nsolution in the embedded variable, which results in feasible solutions for the\nSOCP. However, the MEOCP cannot guarantee the satisfaction of dwell-time\nconstraints.\n In this paper, a MEOCP is combined with a filter layer to remove switching\ntimes that violate the dwell time constraint. Insertion gradients are used to\nminimize the effect of the filter on the optimal cost.\n","authors":["Masoud S. Sakha","Rushikesh Kamalapurkar"],"pdf_url":"https://arxiv.org/pdf/2501.05548v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.15745v7","updated":"2025-01-09T19:35:10Z","published":"2024-01-28T20:12:08Z","title":"The computation of approximate feedback Stackelberg equilibria in\n multi-player nonlinear constrained dynamic games","summary":" Solving feedback Stackelberg games with nonlinear dynamics and coupled\nconstraints, a common scenario in practice, presents significant challenges.\nThis work introduces an efficient method for computing approximate local\nfeedback Stackelberg equilibria in multi-player general-sum dynamic games, with\ncontinuous state and action spaces. Different from existing (approximate)\ndynamic programming solutions that are primarily designed for unconstrained\nproblems, our approach involves reformulating a feedback Stackelberg dynamic\ngame into a sequence of nested optimization problems, enabling the derivation\nof Karush-Kuhn-Tucker (KKT) conditions and the establishment of a second-order\nsufficient condition for local feedback Stackelberg equilibria. We propose a\nNewton-style primal-dual interior point method for solving constrained linear\nquadratic (LQ) feedback Stackelberg games, offering provable convergence\nguarantees. Our method is further extended to compute local feedback\nStackelberg equilibria for more general nonlinear games by iteratively\napproximating them using LQ games, ensuring that their KKT conditions are\nlocally aligned with those of the original nonlinear games. We prove the\nexponential convergence of our algorithm in constrained nonlinear games. In a\nfeedback Stackelberg game with nonlinear dynamics and (nonconvex) coupled costs\nand constraints, our experimental results reveal the algorithm's ability to\nhandle infeasible initial conditions and achieve exponential convergence\ntowards an approximate local feedback Stackelberg equilibrium.\n","authors":["Jingqi Li","Somayeh Sojoudi","Claire Tomlin","David Fridovich-Keil"],"pdf_url":"https://arxiv.org/pdf/2401.15745v7.pdf","comment":"This manuscript has been accepted by SIAM Journal on Optimization. We\n fix few typos in this arxiv version"},{"id":"http://arxiv.org/abs/1912.07356v5","updated":"2025-01-09T12:28:40Z","published":"2019-12-11T16:38:44Z","title":"The Integrated Vehicle and Pollster Routing Problem","summary":" The National Statistics Bureau of Ecuador carries out monthly polls to\nmonitor the evolution of the Consumer Price Index, a metric measuring consumer\nprices of essential commodities. These surveys are administered across a\ndesignated set of stores, with a fleet of vehicles transporting pollsters from\nthe bureau headquarters to the chosen locations. Moreover, pollsters move\nbetween stores using pedestrian paths or using a vehicle to shorten the travel\ntime. This paper introduces the Integrated Vehicle and Pollster Routing Problem\nand presents an integer programming model to effectively schedule pollster\nvisits to selected stores while optimizing the routing of the vehicle fleet.\nResults on the computational complexity, a three-phase algorithm, and\ncomputational experience based on real-world instances are provided.\n","authors":["Sandra Gutiérrez","Andrés Miniguano-Trujillo","Diego Recalde","Luis M. Torres","Ramiro Torres"],"pdf_url":"https://arxiv.org/pdf/1912.07356v5.pdf","comment":"28 pages, 5 figures, 8 tables"},{"id":"http://arxiv.org/abs/2205.08435v4","updated":"2025-01-09T18:56:11Z","published":"2022-05-17T15:25:23Z","title":"Cyber Risk Assessment for Capital Management","summary":" This paper introduces a two-pillar cyber risk management framework to address\nthe pervasive challenges in managing cyber risk. The first pillar, cyber risk\nassessment, combines insurance frequency-severity models with cybersecurity\ncascade models to capture the unique nature of cyber risk. The second pillar,\ncyber capital management, facilitates informed allocation of capital for a\nbalanced cyber risk management strategy, including cybersecurity investments,\ninsurance coverage, and reserves. A case study, based on historical cyber\nincident data and realistic assumptions, demonstrates the necessity of\ncomprehensive cost-benefit analysis for budget-constrained companies with\ncompeting objectives in cyber risk management. In addition, sensitivity\nanalysis highlights the dependence of the optimal strategy on factors such as\nthe price of cybersecurity controls and their effectiveness. The framework's\nimplementation across a diverse range of companies yields general insights on\ncyber risk management.\n","authors":["Wing Fung Chong","Runhuan Feng","Hins Hu","Linfeng Zhang"],"pdf_url":"https://arxiv.org/pdf/2205.08435v4.pdf","comment":"This paper was first presented on July 5, 2021, at the 24th\n International Congress on Insurance: Mathematics and Economics"},{"id":"http://arxiv.org/abs/2501.06266v1","updated":"2025-01-09T18:25:41Z","published":"2025-01-09T18:25:41Z","title":"Linear Algebraic Truncation Algorithm with A Posteriori Error Bounds for\n Computing Markov Chain Equilibrium Gradients","summary":" The numerical computation of equilibrium reward gradients for Markov chains\nappears in many applications for example within the policy improvement step\narising in connection with average reward stochastic dynamic programming. When\nthe state space is large or infinite, one will typically need to truncate the\nstate space in order to arrive at a numerically tractable formulation. In this\npaper, we derive the first computable a posteriori error bounds for equilibrium\nreward gradients that account for the error induced by the truncation. Our\napproach uses regeneration to express equilibrium quantities in terms of the\nexpectations of cumulative rewards over regenerative cycles. Lyapunov functions\nare then used to bound the contributions to these cumulative rewards and their\ngradients from path excursions that take the chain outside the truncation set.\nOur numerical results indicate that our approach can provide highly accurate\nbounds with truncation sets of moderate size. We further extend our approach to\nMarkov jump processes.\n","authors":["Saied Mahdian","Peter W. Glynn"],"pdf_url":"https://arxiv.org/pdf/2501.06266v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.09173v1","updated":"2025-01-09T11:06:35Z","published":"2025-01-09T11:06:35Z","title":"Formalising the intentional stance 2: a coinductive approach","summary":" Given a stochastic process with inputs and outputs, how might its behaviour\nbe related to pursuit of a goal? We model this using 'transducers', objects\nthat capture only the external behaviour of a system and not its internal\nstate. A companion paper summarises our results for cognitive scientists; the\ncurrent paper gives formal definitions and proofs.\n To formalise the concept of a system that behaves as if it were pursuing a\ngoal, we consider what happens when a transducer (a 'policy') is coupled to\nanother transducer that comes equipped with a success condition (a\n'teleo-environment'). An optimal policy is identified with a transducer that\nbehaves as if it were perfectly rational in the pursuit of a goal; our\nframework also allows us to model constrained rationality.\n Optimal policies obey a version of Bellman's principle: a policy that's\noptimal in one time step will again be optimal in the next time step, but with\nrespect to a different teleo-environment (obtained from the original one by a\nmodified version of Bayesian filtering). This property sometimes also applies\nto the bounded-rational case; we give a sufficient condition.\n A policy is deterministic if and only if there exists a teleo-environment for\nwhich it is uniquely optimal among the set of all policies; we relate this to\nclassical representation theorems from decision theory. This result need not\nhold in the bounded-rational case; we give an example related to the\nabsent-minded driver problem. The formalism is defined using coinduction,\nfollowing the style proposed by Czajka.\n","authors":["Simon McGregor"," timorl","Nathaniel Virgo"],"pdf_url":"https://arxiv.org/pdf/2501.09173v1.pdf","comment":"This is the companion paper to \"Formalising the intentional stance 1:\n attributing goals and beliefs to stochastic processes\" (uploaded as version 2\n of arXiv:2405.16490). The other paper is an overview aimed at cognitive\n scientists while this paper gives full mathematical details. 50 pages, no\n figures"}],"Computer Vision and Pattern Recognition":[{"id":"http://arxiv.org/abs/2501.05452v1","updated":"2025-01-09T18:59:58Z","published":"2025-01-09T18:59:58Z","title":"ReFocus: Visual Editing as a Chain of Thought for Structured Image\n Understanding","summary":" Structured image understanding, such as interpreting tables and charts,\nrequires strategically refocusing across various structures and texts within an\nimage, forming a reasoning sequence to arrive at the final answer. However,\ncurrent multimodal large language models (LLMs) lack this multihop selective\nattention capability. In this work, we introduce ReFocus, a simple yet\neffective framework that equips multimodal LLMs with the ability to generate\n\"visual thoughts\" by performing visual editing on the input image through code,\nshifting and refining their visual focuses. Specifically, ReFocus enables\nmultimodal LLMs to generate Python codes to call tools and modify the input\nimage, sequentially drawing boxes, highlighting sections, and masking out\nareas, thereby enhancing the visual reasoning process. We experiment upon a\nwide range of structured image understanding tasks involving tables and charts.\nReFocus largely improves performance on all tasks over GPT-4o without visual\nediting, yielding an average gain of 11.0% on table tasks and 6.8% on chart\ntasks. We present an in-depth analysis of the effects of different visual\nedits, and reasons why ReFocus can improve the performance without introducing\nadditional information. Further, we collect a 14k training set using ReFocus,\nand prove that such visual chain-of-thought with intermediate information\noffers a better supervision than standard VQA data, reaching a 8.0% average\ngain over the same model trained with QA pairs and 2.6% over CoT.\n","authors":["Xingyu Fu","Minqian Liu","Zhengyuan Yang","John Corring","Yijuan Lu","Jianwei Yang","Dan Roth","Dinei Florencio","Cha Zhang"],"pdf_url":"https://arxiv.org/pdf/2501.05452v1.pdf","comment":"Project link: https://zeyofu.github.io/ReFocus/"},{"id":"http://arxiv.org/abs/2501.05453v1","updated":"2025-01-09T18:59:58Z","published":"2025-01-09T18:59:58Z","title":"An Empirical Study of Autoregressive Pre-training from Videos","summary":" We empirically study autoregressive pre-training from videos. To perform our\nstudy, we construct a series of autoregressive video models, called Toto. We\ntreat videos as sequences of visual tokens and train transformer models to\nautoregressively predict future tokens. Our models are pre-trained on a diverse\ndataset of videos and images comprising over 1 trillion visual tokens. We\nexplore different architectural, training, and inference design choices. We\nevaluate the learned visual representations on a range of downstream tasks\nincluding image recognition, video classification, object tracking, and\nrobotics. Our results demonstrate that, despite minimal inductive biases,\nautoregressive pre-training leads to competitive performance across all\nbenchmarks. Finally, we find that scaling our video models results in similar\nscaling curves to those seen in language models, albeit with a different rate.\nMore details at https://brjathu.github.io/toto/\n","authors":["Jathushan Rajasegaran","Ilija Radosavovic","Rahul Ravishankar","Yossi Gandelsman","Christoph Feichtenhofer","Jitendra Malik"],"pdf_url":"https://arxiv.org/pdf/2501.05453v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05450v1","updated":"2025-01-09T18:59:56Z","published":"2025-01-09T18:59:56Z","title":"Decentralized Diffusion Models","summary":" Large-scale AI model training divides work across thousands of GPUs, then\nsynchronizes gradients across them at each step. This incurs a significant\nnetwork burden that only centralized, monolithic clusters can support, driving\nup infrastructure costs and straining power systems. We propose Decentralized\nDiffusion Models, a scalable framework for distributing diffusion model\ntraining across independent clusters or datacenters by eliminating the\ndependence on a centralized, high-bandwidth networking fabric. Our method\ntrains a set of expert diffusion models over partitions of the dataset, each in\nfull isolation from one another. At inference time, the experts ensemble\nthrough a lightweight router. We show that the ensemble collectively optimizes\nthe same objective as a single model trained over the whole dataset. This means\nwe can divide the training burden among a number of \"compute islands,\" lowering\ninfrastructure costs and improving resilience to localized GPU failures.\nDecentralized diffusion models empower researchers to take advantage of\nsmaller, more cost-effective and more readily available compute like on-demand\nGPU nodes rather than central integrated systems. We conduct extensive\nexperiments on ImageNet and LAION Aesthetics, showing that decentralized\ndiffusion models FLOP-for-FLOP outperform standard diffusion models. We finally\nscale our approach to 24 billion parameters, demonstrating that high-quality\ndiffusion models can now be trained with just eight individual GPU nodes in\nless than a week.\n","authors":["David McAllister","Matthew Tancik","Jiaming Song","Angjoo Kanazawa"],"pdf_url":"https://arxiv.org/pdf/2501.05450v1.pdf","comment":"Project webpage: https://decentralizeddiffusion.github.io/"},{"id":"http://arxiv.org/abs/2501.05449v1","updated":"2025-01-09T18:59:35Z","published":"2025-01-09T18:59:35Z","title":"Explainable AI-Enhanced Deep Learning for Pumpkin Leaf Disease\n Detection: A Comparative Analysis of CNN Architectures","summary":" Pumpkin leaf diseases are significant threats to agricultural productivity,\nrequiring a timely and precise diagnosis for effective management. Traditional\nidentification methods are laborious and susceptible to human error,\nemphasizing the necessity for automated solutions. This study employs on the\n\"Pumpkin Leaf Disease Dataset\", that comprises of 2000 high-resolution images\nseparated into five categories. Downy mildew, powdery mildew, mosaic disease,\nbacterial leaf spot, and healthy leaves. The dataset was rigorously assembled\nfrom several agricultural fields to ensure a strong representation for model\ntraining. We explored many proficient deep learning architectures, including\nDenseNet201, DenseNet121, DenseNet169, Xception, ResNet50, ResNet101 and\nInceptionResNetV2, and observed that ResNet50 performed most effectively, with\nan accuracy of 90.5% and comparable precision, recall, and F1-Score. We used\nExplainable AI (XAI) approaches like Grad-CAM, Grad-CAM++, Score-CAM, and\nLayer-CAM to provide meaningful representations of model decision-making\nprocesses, which improved understanding and trust in automated disease\ndiagnostics. These findings demonstrate ResNet50's potential to revolutionize\npumpkin leaf disease detection, allowing for earlier and more accurate\ntreatments.\n","authors":["Md. Arafat Alam Khandaker","Ziyan Shirin Raha","Shifat Islam","Tashreef Muhammad"],"pdf_url":"https://arxiv.org/pdf/2501.05449v1.pdf","comment":"Accepted in 2024 27th International Conference on Computer and\n Information Technology (ICCIT)"},{"id":"http://arxiv.org/abs/2501.05446v1","updated":"2025-01-09T18:58:30Z","published":"2025-01-09T18:58:30Z","title":"Relative Pose Estimation through Affine Corrections of Monocular Depth\n Priors","summary":" Monocular depth estimation (MDE) models have undergone significant\nadvancements over recent years. Many MDE models aim to predict affine-invariant\nrelative depth from monocular images, while recent developments in large-scale\ntraining and vision foundation models enable reasonable estimation of metric\n(absolute) depth. However, effectively leveraging these predictions for\ngeometric vision tasks, in particular relative pose estimation, remains\nrelatively under explored. While depths provide rich constraints for cross-view\nimage alignment, the intrinsic noise and ambiguity from the monocular depth\npriors present practical challenges to improving upon classic keypoint-based\nsolutions. In this paper, we develop three solvers for relative pose estimation\nthat explicitly account for independent affine (scale and shift) ambiguities,\ncovering both calibrated and uncalibrated conditions. We further propose a\nhybrid estimation pipeline that combines our proposed solvers with classic\npoint-based solvers and epipolar constraints. We find that the affine\ncorrection modeling is beneficial to not only the relative depth priors but\nalso, surprisingly, the ``metric\" ones. Results across multiple datasets\ndemonstrate large improvements of our approach over classic keypoint-based\nbaselines and PnP-based solutions, under both calibrated and uncalibrated\nsetups. We also show that our method improves consistently with different\nfeature matchers and MDE models, and can further benefit from very recent\nadvances on both modules. Code is available at\nhttps://github.com/MarkYu98/madpose.\n","authors":["Yifan Yu","Shaohui Liu","Rémi Pautrat","Marc Pollefeys","Viktor Larsson"],"pdf_url":"https://arxiv.org/pdf/2501.05446v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05445v1","updated":"2025-01-09T18:56:05Z","published":"2025-01-09T18:56:05Z","title":"Consistent Flow Distillation for Text-to-3D Generation","summary":" Score Distillation Sampling (SDS) has made significant strides in distilling\nimage-generative models for 3D generation. However, its\nmaximum-likelihood-seeking behavior often leads to degraded visual quality and\ndiversity, limiting its effectiveness in 3D applications. In this work, we\npropose Consistent Flow Distillation (CFD), which addresses these limitations.\nWe begin by leveraging the gradient of the diffusion ODE or SDE sampling\nprocess to guide the 3D generation. From the gradient-based sampling\nperspective, we find that the consistency of 2D image flows across different\nviewpoints is important for high-quality 3D generation. To achieve this, we\nintroduce multi-view consistent Gaussian noise on the 3D object, which can be\nrendered from various viewpoints to compute the flow gradient. Our experiments\ndemonstrate that CFD, through consistent flows, significantly outperforms\nprevious methods in text-to-3D generation.\n","authors":["Runjie Yan","Yinbo Chen","Xiaolong Wang"],"pdf_url":"https://arxiv.org/pdf/2501.05445v1.pdf","comment":"Project page: https://runjie-yan.github.io/cfd/"},{"id":"http://arxiv.org/abs/2501.05444v1","updated":"2025-01-09T18:55:52Z","published":"2025-01-09T18:55:52Z","title":"Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal\n ReAsoning Benchmark","summary":" The ability to organically reason over and with both text and images is a\npillar of human intelligence, yet the ability of Multimodal Large Language\nModels (MLLMs) to perform such multimodal reasoning remains under-explored.\nExisting benchmarks often emphasize text-dominant reasoning or rely on shallow\nvisual cues, failing to adequately assess integrated visual and textual\nreasoning. We introduce EMMA (Enhanced MultiModal reAsoning), a benchmark\ntargeting organic multimodal reasoning across mathematics, physics, chemistry,\nand coding. EMMA tasks demand advanced cross-modal reasoning that cannot be\naddressed by reasoning independently in each modality, offering an enhanced\ntest suite for MLLMs' reasoning capabilities. Our evaluation of\nstate-of-the-art MLLMs on EMMA reveals significant limitations in handling\ncomplex multimodal and multi-step reasoning tasks, even with advanced\ntechniques like Chain-of-Thought prompting and test-time compute scaling\nunderperforming. These findings underscore the need for improved multimodal\narchitectures and training paradigms to close the gap between human and model\nreasoning in multimodality.\n","authors":["Yunzhuo Hao","Jiawei Gu","Huichen Will Wang","Linjie Li","Zhengyuan Yang","Lijuan Wang","Yu Cheng"],"pdf_url":"https://arxiv.org/pdf/2501.05444v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05442v1","updated":"2025-01-09T18:55:15Z","published":"2025-01-09T18:55:15Z","title":"Progressive Growing of Video Tokenizers for Highly Compressed Latent\n Spaces","summary":" Video tokenizers are essential for latent video diffusion models, converting\nraw video data into spatiotemporally compressed latent spaces for efficient\ntraining. However, extending state-of-the-art video tokenizers to achieve a\ntemporal compression ratio beyond 4x without increasing channel capacity poses\nsignificant challenges. In this work, we propose an alternative approach to\nenhance temporal compression. We find that the reconstruction quality of\ntemporally subsampled videos from a low-compression encoder surpasses that of\nhigh-compression encoders applied to original videos. This indicates that\nhigh-compression models can leverage representations from lower-compression\nmodels. Building on this insight, we develop a bootstrapped\nhigh-temporal-compression model that progressively trains high-compression\nblocks atop well-trained lower-compression models. Our method includes a\ncross-level feature-mixing module to retain information from the pretrained\nlow-compression model and guide higher-compression blocks to capture the\nremaining details from the full video sequence. Evaluation of video benchmarks\nshows that our method significantly improves reconstruction quality while\nincreasing temporal compression compared to direct extensions of existing video\ntokenizers. Furthermore, the resulting compact latent space effectively trains\na video diffusion model for high-quality video generation with a reduced token\nbudget.\n","authors":["Aniruddha Mahapatra","Long Mai","Yitian Zhang","David Bourgin","Feng Liu"],"pdf_url":"https://arxiv.org/pdf/2501.05442v1.pdf","comment":"Project website:\n https://progressive-video-tokenizer.github.io/Pro-MAG/"},{"id":"http://arxiv.org/abs/2501.05441v1","updated":"2025-01-09T18:53:06Z","published":"2025-01-09T18:53:06Z","title":"The GAN is dead; long live the GAN! A Modern GAN Baseline","summary":" There is a widely-spread claim that GANs are difficult to train, and GAN\narchitectures in the literature are littered with empirical tricks. We provide\nevidence against this claim and build a modern GAN baseline in a more\nprincipled manner. First, we derive a well-behaved regularized relativistic GAN\nloss that addresses issues of mode dropping and non-convergence that were\npreviously tackled via a bag of ad-hoc tricks. We analyze our loss\nmathematically and prove that it admits local convergence guarantees, unlike\nmost existing relativistic losses. Second, our new loss allows us to discard\nall ad-hoc tricks and replace outdated backbones used in common GANs with\nmodern architectures. Using StyleGAN2 as an example, we present a roadmap of\nsimplification and modernization that results in a new minimalist baseline --\nR3GAN. Despite being simple, our approach surpasses StyleGAN2 on FFHQ,\nImageNet, CIFAR, and Stacked MNIST datasets, and compares favorably against\nstate-of-the-art GANs and diffusion models.\n","authors":["Yiwen Huang","Aaron Gokaslan","Volodymyr Kuleshov","James Tompkin"],"pdf_url":"https://arxiv.org/pdf/2501.05441v1.pdf","comment":"Accepted to NeurIPS 2024. Code available at\n https://github.com/brownvc/R3GAN/"},{"id":"http://arxiv.org/abs/2501.05436v1","updated":"2025-01-09T18:48:55Z","published":"2025-01-09T18:48:55Z","title":"$DPF^*$: improved Depth Potential Function for scale-invariant sulcal\n depth estimation","summary":" The shape of human brain is complex and highly variable, with interactions\nbetween brain size, cortical folding, and age well-documented in the\nliterature. However, few studies have explored how global brain size influences\ngeometric features of the cortical surface derived from anatomical MRI. In this\nwork, we focus on sulcal depth, an imaging phenotype that has gained\nsignificant attention in both basic research and clinical applications. We make\nkey contributions to the field by: 1) providing the first quantitative analysis\nof how brain size affects sulcal depth measurements; 2) introducing a novel,\nscale-invariant method for sulcal depth estimation based on an original\nformalization of the problem; 3) presenting a validation framework and sharing\nour code and benchmark data with the community; and 4) demonstrating the\nbiological relevance of our new sulcal depth measure using a large sample of\n1,987 subjects spanning the developmental period from 26 weeks post-conception\nto adulthood.\n","authors":["Maxime Dieudonné","Guillaume Auzias","Julien Lefèvre"],"pdf_url":"https://arxiv.org/pdf/2501.05436v1.pdf","comment":"GA and JL contributed equally to this work"},{"id":"http://arxiv.org/abs/2412.06927v2","updated":"2025-01-09T18:44:39Z","published":"2024-12-09T19:12:17Z","title":"Gradient-based facial encoding for key generation to encrypt and decrypt\n multimedia data","summary":" Security systems relying on passwords are vulnerable to being forgotten,\nguessed, or breached. Likewise, biometric systems that operate independently\nare at risk of template spoofing and replay incidents. This paper introduces a\nbiocryptosystem utilizing face recognition techniques to address these issues,\nallowing for the encryption and decryption of various file types through the\nAdvanced Encryption Standard (AES). The proposed system creates a distinct\n32-bit encryption key derived from facial features identified by Histogram of\nOriented Gradients (HOG) and categorized using Support Vector Machines (SVM).\nHOG efficiently identifies edge-aligned facial features, even in dim lighting,\nensuring that reliable biometric keys can be generated. This key is then used\nwith AES to encrypt and decrypt a variety of data formats, such as text, audio,\nand video files. This encryption key, derived from an individual's distinctive\nfacial traits, is exceedingly challenging for adversaries to reproduce or\nguess. The security and performance of the system have been validated through\nexperiments using several metrics, including correlation analysis, Shannon\nentropy, normalized Hamming distance, and the avalanche effect on 25 different\nfile types. Potential uses for the proposed system include secure file sharing,\nonline transactions, and data archiving, making it a strong and trustworthy\napproach to safeguarding sensitive information by integrating the uniqueness of\nfacial biometrics with the established security of AES encryption.\n","authors":["Ankit Kumar Patel","Dewanshi Paul","Sarthak Giri","Sneha Chaudhary","Bikalpa Gautam"],"pdf_url":"https://arxiv.org/pdf/2412.06927v2.pdf","comment":"12 pages, 2 figures, This work has been submitted to the IEEE for\n possible publication"},{"id":"http://arxiv.org/abs/2410.08405v2","updated":"2025-01-09T18:43:18Z","published":"2024-10-10T22:38:26Z","title":"AgroGPT: Efficient Agricultural Vision-Language Model with Expert Tuning","summary":" Significant progress has been made in advancing large multimodal\nconversational models (LMMs), capitalizing on vast repositories of image-text\ndata available online. Despite this progress, these models often encounter\nsubstantial domain gaps, hindering their ability to engage in complex\nconversations across new domains. Recent efforts have aimed to mitigate this\nissue, albeit relying on domain-specific image-text data to curate\ninstruction-tuning data. However, many domains, such as agriculture, lack such\nvision-language data. In this work, we propose an approach to construct\ninstruction-tuning data that harnesses vision-only data for the agriculture\ndomain. We utilize diverse agricultural datasets spanning multiple domains,\ncurate class-specific information, and employ large language models (LLMs) to\nconstruct an expert-tuning set, resulting in a 70k expert-tuning dataset called\nAgroInstruct. Subsequently, we expert-tuned and created AgroGPT, an efficient\nLMM that can hold complex agriculture-related conversations and provide useful\ninsights. We also develop AgroEvals for evaluation and compare {AgroGPT's}\nperformance with large open and closed-source models. {AgroGPT} excels at\nidentifying fine-grained agricultural concepts, can act as an agriculture\nexpert, and provides helpful information for multimodal agriculture questions.\nThe code, datasets, and models are available at\nhttps://github.com/awaisrauf/agroGPT.\n","authors":["Muhammad Awais","Ali Husain Salem Abdulla Alharthi","Amandeep Kumar","Hisham Cholakkal","Rao Muhammad Anwer"],"pdf_url":"https://arxiv.org/pdf/2410.08405v2.pdf","comment":"Accepted at WACV, 2025"},{"id":"http://arxiv.org/abs/2501.05429v1","updated":"2025-01-09T18:42:47Z","published":"2025-01-09T18:42:47Z","title":"Flatland Vision","summary":" When is it possible to project two sets of labeled points lying in a pair of\nprojective planes to the same image on a projective line? We give a complete\nanswer to this question and describe the loci of the projection centers that\nenable a common image. In particular, we find that there exists a solution to\nthis problem if and only if these two sets are themselves images of a common\npointset in projective space.\n","authors":["Sameer Agarwal","Erin Connelly","Annalisa Crannell","Timothy Duff","Rekha R. Thomas"],"pdf_url":"https://arxiv.org/pdf/2501.05429v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05427v1","updated":"2025-01-09T18:37:35Z","published":"2025-01-09T18:37:35Z","title":"Zero-1-to-G: Taming Pretrained 2D Diffusion Model for Direct 3D\n Generation","summary":" Recent advances in 2D image generation have achieved remarkable\nquality,largely driven by the capacity of diffusion models and the availability\nof large-scale datasets. However, direct 3D generation is still constrained by\nthe scarcity and lower fidelity of 3D datasets. In this paper, we introduce\nZero-1-to-G, a novel approach that addresses this problem by enabling direct\nsingle-view generation on Gaussian splats using pretrained 2D diffusion models.\nOur key insight is that Gaussian splats, a 3D representation, can be decomposed\ninto multi-view images encoding different attributes. This reframes the\nchallenging task of direct 3D generation within a 2D diffusion framework,\nallowing us to leverage the rich priors of pretrained 2D diffusion models. To\nincorporate 3D awareness, we introduce cross-view and cross-attribute attention\nlayers, which capture complex correlations and enforce 3D consistency across\ngenerated splats. This makes Zero-1-to-G the first direct image-to-3D\ngenerative model to effectively utilize pretrained 2D diffusion priors,\nenabling efficient training and improved generalization to unseen objects.\nExtensive experiments on both synthetic and in-the-wild datasets demonstrate\nsuperior performance in 3D object generation, offering a new approach to\nhigh-quality 3D generation.\n","authors":["Xuyi Meng","Chen Wang","Jiahui Lei","Kostas Daniilidis","Jiatao Gu","Lingjie Liu"],"pdf_url":"https://arxiv.org/pdf/2501.05427v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05426v1","updated":"2025-01-09T18:35:43Z","published":"2025-01-09T18:35:43Z","title":"From Images to Insights: Transforming Brain Cancer Diagnosis with\n Explainable AI","summary":" Brain cancer represents a major challenge in medical diagnostics, requisite\nprecise and timely detection for effective treatment. Diagnosis initially\nrelies on the proficiency of radiologists, which can cause difficulties and\nthreats when the expertise is sparse. Despite the use of imaging resources,\nbrain cancer remains often difficult, time-consuming, and vulnerable to\nintraclass variability. This study conveys the Bangladesh Brain Cancer MRI\nDataset, containing 6,056 MRI images organized into three categories: Brain\nTumor, Brain Glioma, and Brain Menin. The dataset was collected from several\nhospitals in Bangladesh, providing a diverse and realistic sample for research.\nWe implemented advanced deep learning models, and DenseNet169 achieved\nexceptional results, with accuracy, precision, recall, and F1-Score all\nreaching 0.9983. In addition, Explainable AI (XAI) methods including GradCAM,\nGradCAM++, ScoreCAM, and LayerCAM were employed to provide visual\nrepresentations of the decision-making processes of the models. In the context\nof brain cancer, these techniques highlight DenseNet169's potential to enhance\ndiagnostic accuracy while simultaneously offering transparency, facilitating\nearly diagnosis and better patient outcomes.\n","authors":["Md. Arafat Alam Khandaker","Ziyan Shirin Raha","Salehin Bin Iqbal","M. F. Mridha","Jungpil Shin"],"pdf_url":"https://arxiv.org/pdf/2501.05426v1.pdf","comment":"Accepted in 2024 27th International Conference on Computer and\n Information Technology (ICCIT)"},{"id":"http://arxiv.org/abs/2501.05413v1","updated":"2025-01-09T18:13:57Z","published":"2025-01-09T18:13:57Z","title":"Seeing Sound: Assembling Sounds from Visuals for Audio-to-Image\n Generation","summary":" Training audio-to-image generative models requires an abundance of diverse\naudio-visual pairs that are semantically aligned. Such data is almost always\ncurated from in-the-wild videos, given the cross-modal semantic correspondence\nthat is inherent to them. In this work, we hypothesize that insisting on the\nabsolute need for ground truth audio-visual correspondence, is not only\nunnecessary, but also leads to severe restrictions in scale, quality, and\ndiversity of the data, ultimately impairing its use in the modern generative\nmodels. That is, we propose a scalable image sonification framework where\ninstances from a variety of high-quality yet disjoint uni-modal origins can be\nartificially paired through a retrieval process that is empowered by reasoning\ncapabilities of modern vision-language models. To demonstrate the efficacy of\nthis approach, we use our sonified images to train an audio-to-image generative\nmodel that performs competitively against state-of-the-art. Finally, through a\nseries of ablation studies, we exhibit several intriguing auditory capabilities\nlike semantic mixing and interpolation, loudness calibration and acoustic space\nmodeling through reverberation that our model has implicitly developed to guide\nthe image generation process.\n","authors":["Darius Petermann","Mahdi M. Kalayeh"],"pdf_url":"https://arxiv.org/pdf/2501.05413v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05409v1","updated":"2025-01-09T18:06:45Z","published":"2025-01-09T18:06:45Z","title":"A Novel Pathology Foundation Model by Mayo Clinic, Charité, and\n Aignostics","summary":" Recent advances in digital pathology have demonstrated the effectiveness of\nfoundation models across diverse applications. In this report, we present a\nnovel vision foundation model based on the RudolfV approach. Our model was\ntrained on a dataset comprising 1.2 million histopathology whole slide images,\ncollected from two medical institutions: Mayo Clinic and Charit\\'e -\nUniverst\\\"atsmedizin Berlin. Comprehensive evaluations show that our model\nachieves state-of-the-art performance across twenty-one public benchmark\ndatasets, even though it is neither the largest model by parameter count nor by\ntraining dataset size.\n","authors":["Maximilian Alber","Stephan Tietz","Jonas Dippel","Timo Milbich","Timothée Lesort","Panos Korfiatis","Moritz Krügener","Beatriz Perez Cancer","Neelay Shah","Alexander Möllers","Philipp Seegerer","Alexandra Carpen-Amarie","Kai Standvoss","Gabriel Dernbach","Edwin de Jong","Simon Schallenberg","Andreas Kunft","Helmut Hoffer von Ankershoffen","Gavin Schaeferle","Patrick Duffy","Matt Redlon","Philipp Jurmeister","David Horst","Lukas Ruff","Klaus-Robert Müller","Frederick Klauschen","Andrew Norgan"],"pdf_url":"https://arxiv.org/pdf/2501.05409v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.01971v2","updated":"2025-01-09T17:57:53Z","published":"2024-09-03T15:15:49Z","title":"Snapshot: Towards Application-centered Models for Pedestrian Trajectory\n Prediction in Urban Traffic Environments","summary":" This paper explores pedestrian trajectory prediction in urban traffic while\nfocusing on both model accuracy and real-world applicability. While promising\napproaches exist, they often revolve around pedestrian datasets excluding\ntraffic-related information, or resemble architectures that are either not\nreal-time capable or robust. To address these limitations, we first introduce a\ndedicated benchmark based on Argoverse 2, specifically targeting pedestrians in\ntraffic environments. Following this, we present Snapshot, a modular,\nfeed-forward neural network that outperforms the current state of the art,\nreducing the Average Displacement Error (ADE) by 8.8% while utilizing\nsignificantly less information. Despite its agent-centric encoding scheme,\nSnapshot demonstrates scalability, real-time performance, and robustness to\nvarying motion histories. Moreover, by integrating Snapshot into a modular\nautonomous driving software stack, we showcase its real-world applicability.\n","authors":["Nico Uhlemann","Yipeng Zhou","Tobias Simeon Mohr","Markus Lienkamp"],"pdf_url":"https://arxiv.org/pdf/2409.01971v2.pdf","comment":"8 Pages, 9 Figures"},{"id":"http://arxiv.org/abs/2501.05399v1","updated":"2025-01-09T17:47:57Z","published":"2025-01-09T17:47:57Z","title":"Performance of YOLOv7 in Kitchen Safety While Handling Knife","summary":" Safe knife practices in the kitchen significantly reduce the risk of cuts,\ninjuries, and serious accidents during food preparation. Using YOLOv7, an\nadvanced object detection model, this study focuses on identifying safety risks\nduring knife handling, particularly improper finger placement and blade contact\nwith hand. The model's performance was evaluated using metrics such as\nprecision, recall, mAP50, and mAP50-95. The results demonstrate that YOLOv7\nachieved its best performance at epoch 31, with a mAP50-95 score of 0.7879,\nprecision of 0.9063, and recall of 0.7503. These findings highlight YOLOv7's\npotential to accurately detect knife-related hazards, promoting the development\nof improved kitchen safety.\n","authors":["Athulya Sundaresan Geetha"],"pdf_url":"https://arxiv.org/pdf/2501.05399v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05379v1","updated":"2025-01-09T17:04:33Z","published":"2025-01-09T17:04:33Z","title":"Arc2Avatar: Generating Expressive 3D Avatars from a Single Image via ID\n Guidance","summary":" Inspired by the effectiveness of 3D Gaussian Splatting (3DGS) in\nreconstructing detailed 3D scenes within multi-view setups and the emergence of\nlarge 2D human foundation models, we introduce Arc2Avatar, the first SDS-based\nmethod utilizing a human face foundation model as guidance with just a single\nimage as input. To achieve that, we extend such a model for diverse-view human\nhead generation by fine-tuning on synthetic data and modifying its\nconditioning. Our avatars maintain a dense correspondence with a human face\nmesh template, allowing blendshape-based expression generation. This is\nachieved through a modified 3DGS approach, connectivity regularizers, and a\nstrategic initialization tailored for our task. Additionally, we propose an\noptional efficient SDS-based correction step to refine the blendshape\nexpressions, enhancing realism and diversity. Experiments demonstrate that\nArc2Avatar achieves state-of-the-art realism and identity preservation,\neffectively addressing color issues by allowing the use of very low guidance,\nenabled by our strong identity prior and initialization strategy, without\ncompromising detail.\n","authors":["Dimitrios Gerogiannis","Foivos Paraperas Papantoniou","Rolandos Alexandros Potamias","Alexandros Lattas","Stefanos Zafeiriou"],"pdf_url":"https://arxiv.org/pdf/2501.05379v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05369v1","updated":"2025-01-09T16:49:04Z","published":"2025-01-09T16:49:04Z","title":"1-2-1: Renaissance of Single-Network Paradigm for Virtual Try-On","summary":" Virtual Try-On (VTON) has become a crucial tool in ecommerce, enabling the\nrealistic simulation of garments on individuals while preserving their original\nappearance and pose. Early VTON methods relied on single generative networks,\nbut challenges remain in preserving fine-grained garment details due to\nlimitations in feature extraction and fusion. To address these issues, recent\napproaches have adopted a dual-network paradigm, incorporating a complementary\n\"ReferenceNet\" to enhance garment feature extraction and fusion. While\neffective, this dual-network approach introduces significant computational\noverhead, limiting its scalability for high-resolution and long-duration\nimage/video VTON applications. In this paper, we challenge the dual-network\nparadigm by proposing a novel single-network VTON method that overcomes the\nlimitations of existing techniques. Our method, namely MNVTON, introduces a\nModality-specific Normalization strategy that separately processes text, image\nand video inputs, enabling them to share the same attention layers in a VTON\nnetwork. Extensive experimental results demonstrate the effectiveness of our\napproach, showing that it consistently achieves higher-quality, more detailed\nresults for both image and video VTON tasks. Our results suggest that the\nsingle-network paradigm can rival the performance of dualnetwork approaches,\noffering a more efficient alternative for high-quality, scalable VTON\napplications.\n","authors":["Shuliang Ning","Yipeng Qin","Xiaoguang Han"],"pdf_url":"https://arxiv.org/pdf/2501.05369v1.pdf","comment":"Project page: https://ningshuliang.github.io/2023/Arxiv/index.html"},{"id":"http://arxiv.org/abs/2501.05359v1","updated":"2025-01-09T16:43:21Z","published":"2025-01-09T16:43:21Z","title":"CROPS: Model-Agnostic Training-Free Framework for Safe Image Synthesis\n with Latent Diffusion Models","summary":" With advances in diffusion models, image generation has shown significant\nperformance improvements. This raises concerns about the potential abuse of\nimage generation, such as the creation of explicit or violent images, commonly\nreferred to as Not Safe For Work (NSFW) content. To address this, the Stable\nDiffusion model includes several safety checkers to censor initial text prompts\nand final output images generated from the model. However, recent research has\nshown that these safety checkers have vulnerabilities against adversarial\nattacks, allowing them to generate NSFW images. In this paper, we find that\nthese adversarial attacks are not robust to small changes in text prompts or\ninput latents. Based on this, we propose CROPS (Circular or RandOm Prompts for\nSafety), a model-agnostic framework that easily defends against adversarial\nattacks generating NSFW images without requiring additional training. Moreover,\nwe develop an approach that utilizes one-step diffusion models for efficient\nNSFW detection (CROPS-1), further reducing computational resources. We\ndemonstrate the superiority of our method in terms of performance and\napplicability.\n","authors":["Junha Park","Ian Ryu","Jaehui Hwang","Hyungkeun Park","Jiyoon Kim","Jong-Seok Lee"],"pdf_url":"https://arxiv.org/pdf/2501.05359v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.01428v3","updated":"2025-01-09T16:41:07Z","published":"2025-01-02T18:59:59Z","title":"GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models","summary":" In recent years, 2D Vision-Language Models (VLMs) have made significant\nstrides in image-text understanding tasks. However, their performance in 3D\nspatial comprehension, which is critical for embodied intelligence, remains\nlimited. Recent advances have leveraged 3D point clouds and multi-view images\nas inputs, yielding promising results. However, we propose exploring a purely\nvision-based solution inspired by human perception, which merely relies on\nvisual cues for 3D spatial understanding. This paper empirically investigates\nthe limitations of VLMs in 3D spatial knowledge, revealing that their primary\nshortcoming lies in the lack of global-local correspondence between the scene\nand individual frames. To address this, we introduce GPT4Scene, a novel visual\nprompting paradigm in VLM training and inference that helps build the\nglobal-local relationship, significantly improving the 3D spatial understanding\nof indoor scenes. Specifically, GPT4Scene constructs a 3D Bird's Eye View (BEV)\nimage from the video and marks consistent object IDs across both frames and the\nBEV image. The model then inputs the concatenated BEV image and video frames\nwith markers. In zero-shot evaluations, GPT4Scene improves performance over\nclosed-source VLMs like GPT-4o. Additionally, we prepare a processed video\ndataset consisting of 165K text annotation to fine-tune open-source VLMs,\nachieving state-of-the-art performance on all 3D understanding tasks.\nSurprisingly, after training with the GPT4Scene paradigm, VLMs consistently\nimprove during inference, even without visual prompting and BEV image as\nexplicit correspondence. It demonstrates that the proposed paradigm helps VLMs\ndevelop an intrinsic ability to understand 3D scenes, which paves the way for a\nnoninvasive approach to extending pre-trained VLMs for 3D scene understanding.\n","authors":["Zhangyang Qi","Zhixiong Zhang","Ye Fang","Jiaqi Wang","Hengshuang Zhao"],"pdf_url":"https://arxiv.org/pdf/2501.01428v3.pdf","comment":"Project page: https://gpt4scene.github.io/"},{"id":"http://arxiv.org/abs/2501.05339v1","updated":"2025-01-09T16:10:06Z","published":"2025-01-09T16:10:06Z","title":"JAQ: Joint Efficient Architecture Design and Low-Bit Quantization with\n Hardware-Software Co-Exploration","summary":" The co-design of neural network architectures, quantization precisions, and\nhardware accelerators offers a promising approach to achieving an optimal\nbalance between performance and efficiency, particularly for model deployment\non resource-constrained edge devices. In this work, we propose the JAQ\nFramework, which jointly optimizes the three critical dimensions. However,\neffectively automating the design process across the vast search space of those\nthree dimensions poses significant challenges, especially when pursuing\nextremely low-bit quantization. Specifical, the primary challenges include: (1)\nMemory overhead in software-side: Low-precision quantization-aware training can\nlead to significant memory usage due to storing large intermediate features and\nlatent weights for back-propagation, potentially causing memory exhaustion. (2)\nSearch time-consuming in hardware-side: The discrete nature of hardware\nparameters and the complex interplay between compiler optimizations and\nindividual operators make the accelerator search time-consuming. To address\nthese issues, JAQ mitigates the memory overhead through a channel-wise sparse\nquantization (CSQ) scheme, selectively applying quantization to the most\nsensitive components of the model during optimization. Additionally, JAQ\ndesigns BatchTile, which employs a hardware generation network to encode all\npossible tiling modes, thereby speeding up the search for the optimal compiler\nmapping strategy. Extensive experiments demonstrate the effectiveness of JAQ,\nachieving approximately 7% higher Top-1 accuracy on ImageNet compared to\nprevious methods and reducing the hardware search time per iteration to 0.15\nseconds.\n","authors":["Mingzi Wang","Yuan Meng","Chen Tang","Weixiang Zhang","Yijian Qin","Yang Yao","Yingxin Li","Tongtong Feng","Xin Wang","Xun Guan","Zhi Wang","Wenwu Zhu"],"pdf_url":"https://arxiv.org/pdf/2501.05339v1.pdf","comment":"Accepted by AAAI 2025"},{"id":"http://arxiv.org/abs/2501.04561v2","updated":"2025-01-09T15:54:14Z","published":"2025-01-08T15:18:09Z","title":"OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment\n across Language with Real-time Self-Aware Emotional Speech Synthesis","summary":" Recent advancements in omnimodal learning have been achieved in understanding\nand generation across images, text, and speech, though mainly within\nproprietary models. Limited omnimodal datasets and the inherent challenges\nassociated with real-time emotional speech generation have hindered open-source\nprogress. To address these issues, we propose openomni, a two-stage training\nmethod combining omnimodal alignment and speech generation to develop a\nstate-of-the-art omnimodal large language model. In the alignment phase, a\npre-trained speech model is further trained on text-image tasks to generalize\nfrom vision to speech in a (near) zero-shot manner, outperforming models\ntrained on tri-modal datasets. In the speech generation phase, a lightweight\ndecoder facilitates real-time emotional speech through training on speech tasks\nand preference learning. Experiments demonstrate that openomni consistently\nimproves across omnimodal, vision-language, and speech-language evaluations,\nenabling natural, emotion-rich dialogues and real-time emotional speech\ngeneration.\n","authors":["Run Luo","Ting-En Lin","Haonan Zhang","Yuchuan Wu","Xiong Liu","Min Yang","Yongbin Li","Longze Chen","Jiaming Li","Lei Zhang","Yangyi Chen","Hamid Alinejad-Rokny","Fei Huang"],"pdf_url":"https://arxiv.org/pdf/2501.04561v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.10616v2","updated":"2025-01-09T15:45:59Z","published":"2024-11-15T22:37:56Z","title":"Voxel-Aggregated Feature Synthesis: Efficient Dense Mapping for\n Simulated 3D Reasoning","summary":" We address the issue of the exploding computational requirements of recent\nState-of-the-art (SOTA) open set multimodel 3D mapping (dense 3D mapping)\nalgorithms and present Voxel-Aggregated Feature Synthesis (VAFS), a novel\napproach to dense 3D mapping in simulation. Dense 3D mapping involves\nsegmenting and embedding sequential RGBD frames which are then fused into 3D.\nThis leads to redundant computation as the differences between frames are small\nbut all are individually segmented and embedded. This makes dense 3D mapping\nimpractical for research involving embodied agents in which the environment,\nand thus the mapping, must be modified with regularity. VAFS drastically\nreduces this computation by using the segmented point cloud computed by a\nsimulator's physics engine and synthesizing views of each region. This reduces\nthe number of features to embed from the number of captured RGBD frames to the\nnumber of objects in the scene, effectively allowing a \"ground truth\" semantic\nmap to be computed an order of magnitude faster than traditional methods. We\ntest the resulting representation by assessing the IoU scores of semantic\nqueries for different objects in the simulated scene, and find that VAFS\nexceeds the accuracy and speed of prior dense 3D mapping techniques.\n","authors":["Owen Burns","Rizwan Qureshi"],"pdf_url":"https://arxiv.org/pdf/2411.10616v2.pdf","comment":"6 pages, 2 figures, CVPR 2025"},{"id":"http://arxiv.org/abs/2302.08878v2","updated":"2025-01-09T15:35:59Z","published":"2023-02-17T13:50:53Z","title":"Less is More: The Influence of Pruning on the Explainability of CNNs","summary":" Modern, state-of-the-art Convolutional Neural Networks (CNNs) in computer\nvision have millions of parameters. Thus, explaining the complex decisions of\nsuch networks to humans is challenging. A technical approach to reduce CNN\ncomplexity is network pruning, where less important parameters are deleted. The\nwork presented in this paper investigates whether this technical complexity\nreduction also helps with perceived explainability. To do so, we conducted a\npre-study and two human-grounded experiments, assessing the effects of\ndifferent pruning ratios on CNN explainability. Overall, we evaluated four\ndifferent compression rates (i.e., CPR 2, 4, 8, and 32) with 37 500 tasks on\nMechanical Turk. Results indicate that lower compression rates have a positive\ninfluence on explainability, while higher compression rates show negative\neffects. Furthermore, we were able to identify sweet spots that increase both\nthe perceived explainability and the model's performance.\n","authors":["David Weber","Florian Merkle","Pascal Schöttle","Stephan Schlögl"],"pdf_url":"https://arxiv.org/pdf/2302.08878v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03145v2","updated":"2025-01-09T15:31:29Z","published":"2025-01-06T17:12:19Z","title":"Geometry Restoration and Dewarping of Camera-Captured Document Images","summary":" This research focuses on developing a method for restoring the topology of\ndigital images of paper documents captured by a camera, using algorithms for\ndetection, segmentation, geometry restoration, and dewarping. Our methodology\nemploys deep learning (DL) for document outline detection, followed by computer\nvision (CV) to create a topological 2D grid using cubic polynomial\ninterpolation and correct nonlinear distortions by remapping the image. Using\nclassical CV methods makes the document topology restoration process more\nefficient and faster, as it requires significantly fewer computational\nresources and memory. We developed a new pipeline for automatic document\ndewarping and reconstruction, along with a framework and annotated dataset to\ndemonstrate its efficiency. Our experiments confirm the promise of our\nmethodology and its superiority over existing benchmarks (including mobile apps\nand popular DL solutions, such as RectiNet, DocGeoNet, and DocTr++) both\nvisually and in terms of document readability via Optical Character Recognition\n(OCR) and geometry restoration metrics. This paves the way for creating\nhigh-quality digital copies of paper documents and enhancing the efficiency of\nOCR systems. Project page: https://github.com/HorizonParadox/DRCCBI\n","authors":["Valery Istomin","Oleg Pereziabov","Ilya Afanasyev"],"pdf_url":"https://arxiv.org/pdf/2501.03145v2.pdf","comment":"28 pages, 16 figures"},{"id":"http://arxiv.org/abs/2501.04586v2","updated":"2025-01-09T15:27:58Z","published":"2025-01-08T16:06:21Z","title":"Identity-Preserving Video Dubbing Using Motion Warping","summary":" Video dubbing aims to synthesize realistic, lip-synced videos from a\nreference video and a driving audio signal. Although existing methods can\naccurately generate mouth shapes driven by audio, they often fail to preserve\nidentity-specific features, largely because they do not effectively capture the\nnuanced interplay between audio cues and the visual attributes of reference\nidentity . As a result, the generated outputs frequently lack fidelity in\nreproducing the unique textural and structural details of the reference\nidentity. To address these limitations, we propose IPTalker, a novel and robust\nframework for video dubbing that achieves seamless alignment between driving\naudio and reference identity while ensuring both lip-sync accuracy and\nhigh-fidelity identity preservation. At the core of IPTalker is a\ntransformer-based alignment mechanism designed to dynamically capture and model\nthe correspondence between audio features and reference images, thereby\nenabling precise, identity-aware audio-visual integration. Building on this\nalignment, a motion warping strategy further refines the results by spatially\ndeforming reference images to match the target audio-driven configuration. A\ndedicated refinement process then mitigates occlusion artifacts and enhances\nthe preservation of fine-grained textures, such as mouth details and skin\nfeatures. Extensive qualitative and quantitative evaluations demonstrate that\nIPTalker consistently outperforms existing approaches in terms of realism, lip\nsynchronization, and identity retention, establishing a new state of the art\nfor high-quality, identity-consistent video dubbing.\n","authors":["Runzhen Liu","Qinjie Lin","Yunfei Liu","Lijian Lin","Ye Zhu","Yu Li","Chuhua Xian","Fa-Ting Hong"],"pdf_url":"https://arxiv.org/pdf/2501.04586v2.pdf","comment":"v2, Under Review"},{"id":"http://arxiv.org/abs/2501.05281v1","updated":"2025-01-09T14:43:36Z","published":"2025-01-09T14:43:36Z","title":"Comparison Study: Glacier Calving Front Delineation in Synthetic\n Aperture Radar Images With Deep Learning","summary":" Calving front position variation of marine-terminating glaciers is an\nindicator of ice mass loss and a crucial parameter in numerical glacier models.\nDeep Learning (DL) systems can automatically extract this position from\nSynthetic Aperture Radar (SAR) imagery, enabling continuous, weather- and\nillumination-independent, large-scale monitoring. This study presents the first\ncomparison of DL systems on a common calving front benchmark dataset. A\nmulti-annotator study with ten annotators is performed to contrast the\nbest-performing DL system against human performance. The best DL model's\noutputs deviate 221 m on average, while the average deviation of the human\nannotators is 38 m. This significant difference shows that current DL systems\ndo not yet match human performance and that further research is needed to\nenable fully automated monitoring of glacier calving fronts. The study of\nVision Transformers, foundation models, and the inclusion and processing\nstrategy of more information are identified as avenues for future research.\n","authors":["Nora Gourmelon","Konrad Heidler","Erik Loebel","Daniel Cheng","Julian Klink","Anda Dong","Fei Wu","Noah Maul","Moritz Koch","Marcel Dreier","Dakota Pyles","Thorsten Seehaus","Matthias Braun","Andreas Maier","Vincent Christlein"],"pdf_url":"https://arxiv.org/pdf/2501.05281v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03616v2","updated":"2025-01-09T14:33:09Z","published":"2025-01-07T08:32:48Z","title":"BTMTrack: Robust RGB-T Tracking via Dual-template Bridging and\n Temporal-Modal Candidate Elimination","summary":" RGB-T tracking leverages the complementary strengths of RGB and thermal\ninfrared (TIR) modalities to address challenging scenarios such as low\nillumination and adverse weather. However, existing methods often fail to\neffectively integrate temporal information and perform efficient cross-modal\ninteractions, which constrain their adaptability to dynamic targets. In this\npaper, we propose BTMTrack, a novel framework for RGB-T tracking. The core of\nour approach lies in the dual-template backbone network and the Temporal-Modal\nCandidate Elimination (TMCE) strategy. The dual-template backbone effectively\nintegrates temporal information, while the TMCE strategy focuses the model on\ntarget-relevant tokens by evaluating temporal and modal correlations, reducing\ncomputational overhead and avoiding irrelevant background noise. Building upon\nthis foundation, we propose the Temporal Dual Template Bridging (TDTB) module,\nwhich facilitates precise cross-modal fusion through dynamically filtered\ntokens. This approach further strengthens the interaction between templates and\nthe search region. Extensive experiments conducted on three benchmark datasets\ndemonstrate the effectiveness of BTMTrack. Our method achieves state-of-the-art\nperformance, with a 72.3% precision rate on the LasHeR test set and competitive\nresults on RGBT210 and RGBT234 datasets.\n","authors":["Zhongxuan Zhang","Bi Zeng","Xinyu Ni","Yimin Du"],"pdf_url":"https://arxiv.org/pdf/2501.03616v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05272v1","updated":"2025-01-09T14:31:54Z","published":"2025-01-09T14:31:54Z","title":"Solving the Catastrophic Forgetting Problem in Generalized Category\n Discovery","summary":" Generalized Category Discovery (GCD) aims to identify a mix of known and\nnovel categories within unlabeled data sets, providing a more realistic setting\nfor image recognition. Essentially, GCD needs to remember existing patterns\nthoroughly to recognize novel categories. Recent state-of-the-art method SimGCD\ntransfers the knowledge from known-class data to the learning of novel classes\nthrough debiased learning. However, some patterns are catastrophically forgot\nduring adaptation and thus lead to poor performance in novel categories\nclassification. To address this issue, we propose a novel learning approach,\nLegoGCD, which is seamlessly integrated into previous methods to enhance the\ndiscrimination of novel classes while maintaining performance on previously\nencountered known classes. Specifically, we design two types of techniques\ntermed as Local Entropy Regularization (LER) and Dual-views Kullback Leibler\ndivergence constraint (DKL). The LER optimizes the distribution of potential\nknown class samples in unlabeled data, thus ensuring the preservation of\nknowledge related to known categories while learning novel classes. Meanwhile,\nDKL introduces Kullback Leibler divergence to encourage the model to produce a\nsimilar prediction distribution of two view samples from the same image. In\nthis way, it successfully avoids mismatched prediction and generates more\nreliable potential known class samples simultaneously. Extensive experiments\nvalidate that the proposed LegoGCD effectively addresses the known category\nforgetting issue across all datasets, eg, delivering a 7.74% and 2.51% accuracy\nboost on known and novel classes in CUB, respectively. Our code is available\nat: https://github.com/Cliffia123/LegoGCD.\n","authors":["Xinzi Cao","Xiawu Zheng","Guanhong Wang","Weijiang Yu","Yunhang Shen","Ke Li","Yutong Lu","Yonghong Tian"],"pdf_url":"https://arxiv.org/pdf/2501.05272v1.pdf","comment":"Accepted by CVPR 2024"},{"id":"http://arxiv.org/abs/2501.05269v1","updated":"2025-01-09T14:26:50Z","published":"2025-01-09T14:26:50Z","title":"CellViT++: Energy-Efficient and Adaptive Cell Segmentation and\n Classification Using Foundation Models","summary":" Digital Pathology is a cornerstone in the diagnosis and treatment of\ndiseases. A key task in this field is the identification and segmentation of\ncells in hematoxylin and eosin-stained images. Existing methods for cell\nsegmentation often require extensive annotated datasets for training and are\nlimited to a predefined cell classification scheme. To overcome these\nlimitations, we propose $\\text{CellViT}^{{\\scriptscriptstyle ++}}$, a framework\nfor generalized cell segmentation in digital pathology.\n$\\text{CellViT}^{{\\scriptscriptstyle ++}}$ utilizes Vision Transformers with\nfoundation models as encoders to compute deep cell features and segmentation\nmasks simultaneously. To adapt to unseen cell types, we rely on a\ncomputationally efficient approach. It requires minimal data for training and\nleads to a drastically reduced carbon footprint. We demonstrate excellent\nperformance on seven different datasets, covering a broad spectrum of cell\ntypes, organs, and clinical settings. The framework achieves remarkable\nzero-shot segmentation and data-efficient cell-type classification.\nFurthermore, we show that $\\text{CellViT}^{{\\scriptscriptstyle ++}}$ can\nleverage immunofluorescence stainings to generate training datasets without the\nneed for pathologist annotations. The automated dataset generation approach\nsurpasses the performance of networks trained on manually labeled data,\ndemonstrating its effectiveness in creating high-quality training datasets\nwithout expert annotations. To advance digital pathology,\n$\\text{CellViT}^{{\\scriptscriptstyle ++}}$ is available as an open-source\nframework featuring a user-friendly, web-based interface for visualization and\nannotation. The code is available under\nhttps://github.com/TIO-IKIM/CellViT-plus-plus.\n","authors":["Fabian Hörst","Moritz Rempe","Helmut Becker","Lukas Heine","Julius Keyl","Jens Kleesiek"],"pdf_url":"https://arxiv.org/pdf/2501.05269v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05265v1","updated":"2025-01-09T14:19:46Z","published":"2025-01-09T14:19:46Z","title":"Patch-GAN Transfer Learning with Reconstructive Models for Cloud Removal","summary":" Cloud removal plays a crucial role in enhancing remote sensing image\nanalysis, yet accurately reconstructing cloud-obscured regions remains a\nsignificant challenge. Recent advancements in generative models have made the\ngeneration of realistic images increasingly accessible, offering new\nopportunities for this task. Given the conceptual alignment between image\ngeneration and cloud removal tasks, generative models present a promising\napproach for addressing cloud removal in remote sensing. In this work, we\npropose a deep transfer learning approach built on a generative adversarial\nnetwork (GAN) framework to explore the potential of the novel masked\nautoencoder (MAE) image reconstruction model in cloud removal. Due to the\ncomplexity of remote sensing imagery, we further propose using a patch-wise\ndiscriminator to determine whether each patch of the image is real or not. The\nproposed reconstructive transfer learning approach demonstrates significant\nimprovements in cloud removal performance compared to other GAN-based methods.\nAdditionally, whilst direct comparisons with some of the state-of-the-art cloud\nremoval techniques are limited due to unclear details regarding their\ntrain/test data splits, the proposed model achieves competitive results based\non available benchmarks.\n","authors":["Wanli Ma","Oktay Karakus","Paul L. Rosin"],"pdf_url":"https://arxiv.org/pdf/2501.05265v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05264v1","updated":"2025-01-09T14:19:33Z","published":"2025-01-09T14:19:33Z","title":"Towards Balanced Continual Multi-Modal Learning in Human Pose Estimation","summary":" 3D human pose estimation (3D HPE) has emerged as a prominent research topic,\nparticularly in the realm of RGB-based methods. However, RGB images are\nsusceptible to limitations such as sensitivity to lighting conditions and\npotential user discomfort. Consequently, multi-modal sensing, which leverages\nnon-intrusive sensors, is gaining increasing attention. Nevertheless,\nmulti-modal 3D HPE still faces challenges, including modality imbalance and the\nimperative for continual learning. In this work, we introduce a novel balanced\ncontinual multi-modal learning method for 3D HPE, which harnesses the power of\nRGB, LiDAR, mmWave, and WiFi. Specifically, we propose a Shapley value-based\ncontribution algorithm to quantify the contribution of each modality and\nidentify modality imbalance. To address this imbalance, we employ a re-learning\nstrategy. Furthermore, recognizing that raw data is prone to noise\ncontamination, we develop a novel denoising continual learning approach. This\napproach incorporates a noise identification and separation module to mitigate\nthe adverse effects of noise and collaborates with the balanced learning\nstrategy to enhance optimization. Additionally, an adaptive EWC mechanism is\nemployed to alleviate catastrophic forgetting. We conduct extensive experiments\non the widely-adopted multi-modal dataset, MM-Fi, which demonstrate the\nsuperiority of our approach in boosting 3D pose estimation and mitigating\ncatastrophic forgetting in complex scenarios. We will release our codes.\n","authors":["Jiaxuan Peng","Mengshi Qi","Dong Zhao","Huadong Ma"],"pdf_url":"https://arxiv.org/pdf/2501.05264v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.16623v2","updated":"2025-01-09T13:59:21Z","published":"2023-11-28T09:24:42Z","title":"Visual Semantic Navigation with Real Robots","summary":" Visual Semantic Navigation (VSN) is the ability of a robot to learn visual\nsemantic information for navigating in unseen environments. These VSN models\nare typically tested in those virtual environments where they are trained,\nmainly using reinforcement learning based approaches. Therefore, we do not yet\nhave an in-depth analysis of how these models would behave in the real world.\nIn this work, we propose a new solution to integrate VSN models into real\nrobots, so that we have true embodied agents. We also release a novel ROS-based\nframework for VSN, ROS4VSN, so that any VSN-model can be easily deployed in any\nROS-compatible robot and tested in a real setting. Our experiments with two\ndifferent robots, where we have embedded two state-of-the-art VSN agents,\nconfirm that there is a noticeable performance difference of these VSN\nsolutions when tested in real-world and simulation environments. We hope that\nthis research will endeavor to provide a foundation for addressing this\nconsequential issue, with the ultimate aim of advancing the performance and\nefficiency of embodied agents within authentic real-world scenarios. Code to\nreproduce all our experiments can be found at\nhttps://github.com/gramuah/ros4vsn.\n","authors":["Carlos Gutiérrez-Álvarez","Pablo Ríos-Navarro","Rafael Flor-Rodríguez","Francisco Javier Acevedo-Rodríguez","Roberto J. López-Sastre"],"pdf_url":"https://arxiv.org/pdf/2311.16623v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05246v1","updated":"2025-01-09T13:54:59Z","published":"2025-01-09T13:54:59Z","title":"Domain-Incremental Semantic Segmentation for Autonomous Driving under\n Adverse Driving Conditions","summary":" Semantic segmentation for autonomous driving is an even more challenging task\nwhen faced with adverse driving conditions. Standard models trained on data\nrecorded under ideal conditions show a deteriorated performance in unfavorable\nweather or illumination conditions. Fine-tuning on the new task or condition\nwould lead to overwriting the previously learned information resulting in\ncatastrophic forgetting. Adapting to the new conditions through traditional\ndomain adaption methods improves the performance on the target domain at the\nexpense of the source domain. Addressing these issues, we propose an\narchitecture-based domain-incremental learning approach called Progressive\nSemantic Segmentation (PSS). PSS is a task-agnostic, dynamically growing\ncollection of domain-specific segmentation models. The task of inferring the\ndomain and subsequently selecting the appropriate module for segmentation is\ncarried out using a collection of convolutional autoencoders. We extensively\nevaluate our proposed approach using several datasets at varying levels of\ngranularity in the categorization of adverse driving conditions. Furthermore,\nwe demonstrate the generalization of the proposed approach to similar and\nunseen domains.\n","authors":["Shishir Muralidhara","René Schuster","Didier Stricker"],"pdf_url":"https://arxiv.org/pdf/2501.05246v1.pdf","comment":"Accepted at ICPRAM 2025"},{"id":"http://arxiv.org/abs/2501.05244v1","updated":"2025-01-09T13:52:30Z","published":"2025-01-09T13:52:30Z","title":"Optimized Sampling for Non-Line-of-Sight Imaging Using Modified Fast\n Fourier Transforms","summary":" Non-line-of-Sight (NLOS) imaging systems collect light at a diffuse relay\nsurface and input this measurement into computational algorithms that output a\n3D volumetric reconstruction. These algorithms utilize the Fast Fourier\nTransform (FFT) to accelerate the reconstruction process but require both input\nand output to be sampled spatially with uniform grids. However, the geometry of\nNLOS imaging inherently results in non-uniform sampling on the relay surface\nwhen using multi-pixel detector arrays, even though such arrays significantly\nreduce acquisition times. Furthermore, using these arrays increases the data\nrate required for sensor readout, posing challenges for real-world deployment.\nIn this work, we utilize the phasor field framework to demonstrate that\nexisting NLOS imaging setups typically oversample the relay surface spatially,\nexplaining why the measurement can be compressed without significantly\nsacrificing reconstruction quality. This enables us to utilize the Non-Uniform\nFast Fourier Transform (NUFFT) to reconstruct from sparse measurements acquired\nfrom irregularly sampled relay surfaces of arbitrary shapes. Furthermore, we\nutilize the NUFFT to reconstruct at arbitrary locations in the hidden volume,\nensuring flexible sampling schemes for both the input and output. Finally, we\nutilize the Scaled Fast Fourier Transform (SFFT) to reconstruct larger volumes\nwithout increasing the number of samples stored in memory. All algorithms\nintroduced in this paper preserve the computational complexity of FFT-based\nmethods, ensuring scalability for practical NLOS imaging applications.\n","authors":["Talha Sultan","Alex Bocchieri","Chaoying Gu","Xiaochun Liu","Pavel Polynkin","Andreas Velten"],"pdf_url":"https://arxiv.org/pdf/2501.05244v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05242v1","updated":"2025-01-09T13:50:26Z","published":"2025-01-09T13:50:26Z","title":"Scaffold-SLAM: Structured 3D Gaussians for Simultaneous Localization and\n Photorealistic Mapping","summary":" 3D Gaussian Splatting (3DGS) has recently revolutionized novel view synthesis\nin the Simultaneous Localization and Mapping (SLAM). However, existing SLAM\nmethods utilizing 3DGS have failed to provide high-quality novel view rendering\nfor monocular, stereo, and RGB-D cameras simultaneously. Notably, some methods\nperform well for RGB-D cameras but suffer significant degradation in rendering\nquality for monocular cameras. In this paper, we present Scaffold-SLAM, which\ndelivers simultaneous localization and high-quality photorealistic mapping\nacross monocular, stereo, and RGB-D cameras. We introduce two key innovations\nto achieve this state-of-the-art visual quality. First, we propose\nAppearance-from-Motion embedding, enabling 3D Gaussians to better model image\nappearance variations across different camera poses. Second, we introduce a\nfrequency regularization pyramid to guide the distribution of Gaussians,\nallowing the model to effectively capture finer details in the scene. Extensive\nexperiments on monocular, stereo, and RGB-D datasets demonstrate that\nScaffold-SLAM significantly outperforms state-of-the-art methods in\nphotorealistic mapping quality, e.g., PSNR is 16.76% higher in the TUM RGB-D\ndatasets for monocular cameras.\n","authors":["Wen Tianci","Liu Zhiang","Lu Biao","Fang Yongchun"],"pdf_url":"https://arxiv.org/pdf/2501.05242v1.pdf","comment":"12 pages, 6 figures"},{"id":"http://arxiv.org/abs/2501.05241v1","updated":"2025-01-09T13:46:46Z","published":"2025-01-09T13:46:46Z","title":"Contrast-Free Myocardial Scar Segmentation in Cine MRI using Motion and\n Texture Fusion","summary":" Late gadolinium enhancement MRI (LGE MRI) is the gold standard for the\ndetection of myocardial scars for post myocardial infarction (MI). LGE MRI\nrequires the injection of a contrast agent, which carries potential side\neffects and increases scanning time and patient discomfort. To address these\nissues, we propose a novel framework that combines cardiac motion observed in\ncine MRI with image texture information to segment the myocardium and scar\ntissue in the left ventricle. Cardiac motion tracking can be formulated as a\nfull cardiac image cycle registration problem, which can be solved via deep\nneural networks. Experimental results prove that the proposed method can\nachieve scar segmentation based on non-contrasted cine images with comparable\naccuracy to LGE MRI. This demonstrates its potential as an alternative to\ncontrast-enhanced techniques for scar detection.\n","authors":["Guang Yang","Jingkun Chen","Xicheng Sheng","Shan Yang","Xiahai Zhuang","Betty Raman","Lei Li","Vicente Grau"],"pdf_url":"https://arxiv.org/pdf/2501.05241v1.pdf","comment":"5 pages, 2figs, 2tables"},{"id":"http://arxiv.org/abs/2501.05239v1","updated":"2025-01-09T13:44:42Z","published":"2025-01-09T13:44:42Z","title":"Is Your Autonomous Vehicle Safe? Understanding the Threat of\n Electromagnetic Signal Injection Attacks on Traffic Scene Perception","summary":" Autonomous vehicles rely on camera-based perception systems to comprehend\ntheir driving environment and make crucial decisions, thereby ensuring vehicles\nto steer safely. However, a significant threat known as Electromagnetic Signal\nInjection Attacks (ESIA) can distort the images captured by these cameras,\nleading to incorrect AI decisions and potentially compromising the safety of\nautonomous vehicles. Despite the serious implications of ESIA, there is limited\nunderstanding of its impacts on the robustness of AI models across various and\ncomplex driving scenarios. To address this gap, our research analyzes the\nperformance of different models under ESIA, revealing their vulnerabilities to\nthe attacks. Moreover, due to the challenges in obtaining real-world attack\ndata, we develop a novel ESIA simulation method and generate a simulated attack\ndataset for different driving scenarios. Our research provides a comprehensive\nsimulation and evaluation framework, aiming to enhance the development of more\nrobust AI models and secure intelligent systems, ultimately contributing to the\nadvancement of safer and more reliable technology across various fields.\n","authors":["Wenhao Liao","Sineng Yan","Youqian Zhang","Xinwei Zhai","Yuanyuan Wang","Eugene Yujun Fu"],"pdf_url":"https://arxiv.org/pdf/2501.05239v1.pdf","comment":"To appear in AAAI 2025"},{"id":"http://arxiv.org/abs/2501.05238v1","updated":"2025-01-09T13:44:15Z","published":"2025-01-09T13:44:15Z","title":"FOCUS: Towards Universal Foreground Segmentation","summary":" Foreground segmentation is a fundamental task in computer vision,\nencompassing various subdivision tasks. Previous research has typically\ndesigned task-specific architectures for each task, leading to a lack of\nunification. Moreover, they primarily focus on recognizing foreground objects\nwithout effectively distinguishing them from the background. In this paper, we\nemphasize the importance of the background and its relationship with the\nforeground. We introduce FOCUS, the Foreground ObjeCts Universal Segmentation\nframework that can handle multiple foreground tasks. We develop a multi-scale\nsemantic network using the edge information of objects to enhance image\nfeatures. To achieve boundary-aware segmentation, we propose a novel\ndistillation method, integrating the contrastive learning strategy to refine\nthe prediction mask in multi-modal feature space. We conduct extensive\nexperiments on a total of 13 datasets across 5 tasks, and the results\ndemonstrate that FOCUS consistently outperforms the state-of-the-art\ntask-specific models on most metrics.\n","authors":["Zuyao You","Lingyu Kong","Lingchen Meng","Zuxuan Wu"],"pdf_url":"https://arxiv.org/pdf/2501.05238v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05236v1","updated":"2025-01-09T13:43:01Z","published":"2025-01-09T13:43:01Z","title":"Automated external cervical resorption segmentation in cone-beam CT\n using local texture features","summary":" External cervical resorption (ECR) is a resorptive process affecting teeth.\nWhile in some patients, active resorption ceases and gets replaced by osseous\ntissue, in other cases, the resorption progresses and ultimately results in\ntooth loss. For proper ECR assessment, cone-beam computed tomography (CBCT) is\nthe recommended imaging modality, enabling a 3-D characterization of these\nlesions. While it is possible to manually identify and measure ECR resorption\nin CBCT scans, this process can be time intensive and highly subject to human\nerror. Therefore, there is an urgent need to develop an automated method to\nidentify and quantify the severity of ECR resorption using CBCT. Here, we\npresent a method for ECR lesion segmentation that is based on automatic, binary\nclassification of locally extracted voxel-wise texture features. We evaluate\nour method on 6 longitudinal CBCT datasets and show that certain\ntexture-features can be used to accurately detect subtle CBCT signal changes\ndue to ECR. We also present preliminary analyses clustering texture features\nwithin a lesion to stratify the defects and identify patterns indicative of\ncalcification. These methods are important steps in developing prognostic\nbiomarkers to predict whether ECR will continue to progress or cease,\nultimately informing treatment decisions.\n","authors":["Sadhana Ravikumar","Asma A. Khan","Matthew C. Davis","Beatriz Paniagua"],"pdf_url":"https://arxiv.org/pdf/2501.05236v1.pdf","comment":"4 pages, 3 figures, 1 table"},{"id":"http://arxiv.org/abs/2501.05228v1","updated":"2025-01-09T13:36:37Z","published":"2025-01-09T13:36:37Z","title":"Harnessing Large Language and Vision-Language Models for Robust\n Out-of-Distribution Detection","summary":" Out-of-distribution (OOD) detection has seen significant advancements with\nzero-shot approaches by leveraging the powerful Vision-Language Models (VLMs)\nsuch as CLIP. However, prior research works have predominantly focused on\nenhancing Far-OOD performance, while potentially compromising Near-OOD\nefficacy, as observed from our pilot study. To address this issue, we propose a\nnovel strategy to enhance zero-shot OOD detection performances for both Far-OOD\nand Near-OOD scenarios by innovatively harnessing Large Language Models (LLMs)\nand VLMs. Our approach first exploit an LLM to generate superclasses of the ID\nlabels and their corresponding background descriptions followed by feature\nextraction using CLIP. We then isolate the core semantic features for ID data\nby subtracting background features from the superclass features. The refined\nrepresentation facilitates the selection of more appropriate negative labels\nfor OOD data from a comprehensive candidate label set of WordNet, thereby\nenhancing the performance of zero-shot OOD detection in both scenarios.\nFurthermore, we introduce novel few-shot prompt tuning and visual prompt tuning\nto adapt the proposed framework to better align with the target distribution.\nExperimental results demonstrate that the proposed approach consistently\noutperforms current state-of-the-art methods across multiple benchmarks, with\nan improvement of up to 2.9% in AUROC and a reduction of up to 12.6% in FPR95.\nAdditionally, our method exhibits superior robustness against covariate shift\nacross different domains, further highlighting its effectiveness in real-world\nscenarios.\n","authors":["Pei-Kang Lee","Jun-Cheng Chen","Ja-Ling Wu"],"pdf_url":"https://arxiv.org/pdf/2501.05228v1.pdf","comment":"9 pages, 4 figures"},{"id":"http://arxiv.org/abs/2501.05226v1","updated":"2025-01-09T13:29:54Z","published":"2025-01-09T13:29:54Z","title":"Light Transport-aware Diffusion Posterior Sampling for Single-View\n Reconstruction of 3D Volumes","summary":" We introduce a single-view reconstruction technique of volumetric fields in\nwhich multiple light scattering effects are omnipresent, such as in clouds. We\nmodel the unknown distribution of volumetric fields using an unconditional\ndiffusion model trained on a novel benchmark dataset comprising 1,000\nsynthetically simulated volumetric density fields. The neural diffusion model\nis trained on the latent codes of a novel, diffusion-friendly, monoplanar\nrepresentation. The generative model is used to incorporate a tailored\nparametric diffusion posterior sampling technique into different reconstruction\ntasks. A physically-based differentiable volume renderer is employed to provide\ngradients with respect to light transport in the latent space. This stands in\ncontrast to classic NeRF approaches and makes the reconstructions better\naligned with observed data. Through various experiments, we demonstrate\nsingle-view reconstruction of volumetric clouds at a previously unattainable\nquality.\n","authors":["Ludwic Leonard","Nils Thuerey","Ruediger Westermann"],"pdf_url":"https://arxiv.org/pdf/2501.05226v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.07899v3","updated":"2025-01-09T13:01:55Z","published":"2024-11-12T16:12:51Z","title":"Rendering-Oriented 3D Point Cloud Attribute Compression using Sparse\n Tensor-based Transformer","summary":" The evolution of 3D visualization techniques has fundamentally transformed\nhow we interact with digital content. At the forefront of this change is point\ncloud technology, offering an immersive experience that surpasses traditional\n2D representations. However, the massive data size of point clouds presents\nsignificant challenges in data compression. Current methods for lossy point\ncloud attribute compression (PCAC) generally focus on reconstructing the\noriginal point clouds with minimal error. However, for point cloud\nvisualization scenarios, the reconstructed point clouds with distortion still\nneed to undergo a complex rendering process, which affects the final\nuser-perceived quality. In this paper, we propose an end-to-end deep learning\nframework that seamlessly integrates PCAC with differentiable rendering,\ndenoted as rendering-oriented PCAC (RO-PCAC), directly targeting the quality of\nrendered multiview images for viewing. In a differentiable manner, the impact\nof the rendering process on the reconstructed point clouds is taken into\naccount. Moreover, we characterize point clouds as sparse tensors and propose a\nsparse tensor-based transformer, called SP-Trans. By aligning with the local\ndensity of the point cloud and utilizing an enhanced local attention mechanism,\nSP-Trans captures the intricate relationships within the point cloud, further\nimproving feature analysis and synthesis within the framework. Extensive\nexperiments demonstrate that the proposed RO-PCAC achieves state-of-the-art\ncompression performance, compared to existing reconstruction-oriented methods,\nincluding traditional, learning-based, and hybrid methods.\n","authors":["Xiao Huo","Junhui Hou","Shuai Wan","Fuzheng Yang"],"pdf_url":"https://arxiv.org/pdf/2411.07899v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05209v1","updated":"2025-01-09T13:00:01Z","published":"2025-01-09T13:00:01Z","title":"MHAFF: Multi-Head Attention Feature Fusion of CNN and Transformer for\n Cattle Identification","summary":" Convolutional Neural Networks (CNNs) have drawn researchers' attention to\nidentifying cattle using muzzle images. However, CNNs often fail to capture\nlong-range dependencies within the complex patterns of the muzzle. The\ntransformers handle these challenges. This inspired us to fuse the strengths of\nCNNs and transformers in muzzle-based cattle identification. Addition and\nconcatenation have been the most commonly used techniques for feature fusion.\nHowever, addition fails to preserve discriminative information, while\nconcatenation results in an increase in dimensionality. Both methods are simple\noperations and cannot discover the relationships or interactions between fusing\nfeatures. This research aims to overcome the issues faced by addition and\nconcatenation. This research introduces a novel approach called Multi-Head\nAttention Feature Fusion (MHAFF) for the first time in cattle identification.\nMHAFF captures relations between the different types of fusing features while\npreserving their originality. The experiments show that MHAFF outperformed\naddition and concatenation techniques and the existing cattle identification\nmethods in accuracy on two publicly available cattle datasets. MHAFF\ndemonstrates excellent performance and quickly converges to achieve optimum\naccuracy of 99.88% and 99.52% in two cattle datasets simultaneously.\n","authors":["Rabin Dulal","Lihong Zheng","Muhammad Ashad Kabir"],"pdf_url":"https://arxiv.org/pdf/2501.05209v1.pdf","comment":"30 pages"},{"id":"http://arxiv.org/abs/2501.05205v1","updated":"2025-01-09T12:55:55Z","published":"2025-01-09T12:55:55Z","title":"Discovering Hidden Visual Concepts Beyond Linguistic Input in Infant\n Learning","summary":" Infants develop complex visual understanding rapidly, even preceding of the\nacquisition of linguistic inputs. As computer vision seeks to replicate the\nhuman vision system, understanding infant visual development may offer valuable\ninsights. In this paper, we present an interdisciplinary study exploring this\nquestion: can a computational model that imitates the infant learning process\ndevelop broader visual concepts that extend beyond the vocabulary it has heard,\nsimilar to how infants naturally learn? To investigate this, we analyze a\nrecently published model in Science by Vong et al.,which is trained on\nlongitudinal, egocentric images of a single child paired with transcribed\nparental speech. We introduce a training-free framework that can discover\nvisual concept neurons hidden in the model's internal representations. Our\nfindings show that these neurons can classify objects outside its original\nvocabulary. Furthermore, we compare the visual representations in infant-like\nmodels with those in moder computer vision models, such as CLIP or ImageNet\npre-trained model, highlighting key similarities and differences. Ultimately,\nour work bridges cognitive science and computer vision by analyzing the\ninternal representations of a computational model trained on an infant's visual\nand linguistic inputs.\n","authors":["Xueyi Ke","Satoshi Tsutsui","Yayun Zhang","Bihan Wen"],"pdf_url":"https://arxiv.org/pdf/2501.05205v1.pdf","comment":"12 pages, 11 figures"},{"id":"http://arxiv.org/abs/2408.11559v4","updated":"2025-01-09T12:45:39Z","published":"2024-08-21T12:13:18Z","title":"Semi-supervised 3D Semantic Scene Completion with 2D Vision Foundation\n Model Guidance","summary":" Accurate prediction of 3D semantic occupancy from 2D visual images is vital\nin enabling autonomous agents to comprehend their surroundings for planning and\nnavigation. State-of-the-art methods typically employ fully supervised\napproaches, necessitating a huge labeled dataset acquired through expensive\nLiDAR sensors and meticulous voxel-wise labeling by human annotators. The\nresource-intensive nature of this annotating process significantly hampers the\napplication and scalability of these methods. We introduce a novel\nsemi-supervised framework to alleviate the dependency on densely annotated\ndata. Our approach leverages 2D foundation models to generate essential 3D\nscene geometric and semantic cues, facilitating a more efficient training\nprocess. Our framework exhibits notable properties: (1) Generalizability,\napplicable to various 3D semantic scene completion approaches, including 2D-3D\nlifting and 3D-2D transformer methods. (2) Effectiveness, as demonstrated\nthrough experiments on SemanticKITTI and NYUv2, wherein our method achieves up\nto 85% of the fully-supervised performance using only 10% labeled data. This\napproach not only reduces the cost and labor associated with data annotation\nbut also demonstrates the potential for broader adoption in camera-based\nsystems for 3D semantic occupancy prediction.\n","authors":["Duc-Hai Pham","Duc-Dung Nguyen","Anh Pham","Tuan Ho","Phong Nguyen","Khoi Nguyen","Rang Nguyen"],"pdf_url":"https://arxiv.org/pdf/2408.11559v4.pdf","comment":"Accepted at AAAI2025. Project Page:\n https://vinairesearch.github.io/SemiSSC"},{"id":"http://arxiv.org/abs/2412.05557v2","updated":"2025-01-09T12:38:33Z","published":"2024-12-07T06:42:35Z","title":"CoE: Deep Coupled Embedding for Non-Rigid Point Cloud Correspondences","summary":" The interest in matching non-rigidly deformed shapes represented as raw point\nclouds is rising due to the proliferation of low-cost 3D sensors. Yet, the task\nis challenging since point clouds are irregular and there is a lack of\nintrinsic shape information. We propose to tackle these challenges by learning\na new shape representation -- a per-point high dimensional embedding, in an\nembedding space where semantically similar points share similar embeddings. The\nlearned embedding has multiple beneficial properties: it is aware of the\nunderlying shape geometry and is robust to shape deformations and various shape\nartefacts, such as noise and partiality. Consequently, this embedding can be\ndirectly employed to retrieve high-quality dense correspondences through a\nsimple nearest neighbor search in the embedding space. Extensive experiments\ndemonstrate new state-of-the-art results and robustness in numerous challenging\nnon-rigid shape matching benchmarks and show its great potential in other shape\nanalysis tasks, such as segmentation.\n","authors":["Huajian Zeng","Maolin Gao","Daniel Cremers"],"pdf_url":"https://arxiv.org/pdf/2412.05557v2.pdf","comment":"16 pages, 17 figures"},{"id":"http://arxiv.org/abs/2501.05195v1","updated":"2025-01-09T12:33:46Z","published":"2025-01-09T12:33:46Z","title":"HipyrNet: Hypernet-Guided Feature Pyramid network for mixed-exposure\n correction","summary":" Recent advancements in image translation for enhancing mixed-exposure images\nhave demonstrated the transformative potential of deep learning algorithms.\nHowever, addressing extreme exposure variations in images remains a significant\nchallenge due to the inherent complexity and contrast inconsistencies across\nregions. Current methods often struggle to adapt effectively to these\nvariations, resulting in suboptimal performance. In this work, we propose\nHipyrNet, a novel approach that integrates a HyperNetwork within a Laplacian\nPyramid-based framework to tackle the challenges of mixed-exposure image\nenhancement. The inclusion of a HyperNetwork allows the model to adapt to these\nexposure variations. HyperNetworks dynamically generates weights for another\nnetwork, allowing dynamic changes during deployment. In our model, the\nHyperNetwork employed is used to predict optimal kernels for Feature Pyramid\ndecomposition, which enables a tailored and adaptive decomposition process for\neach input image. Our enhanced translational network incorporates multiscale\ndecomposition and reconstruction, leveraging dynamic kernel prediction to\ncapture and manipulate features across varying scales. Extensive experiments\ndemonstrate that HipyrNet outperforms existing methods, particularly in\nscenarios with extreme exposure variations, achieving superior results in both\nqualitative and quantitative evaluations. Our approach sets a new benchmark for\nmixed-exposure image enhancement, paving the way for future research in\nadaptive image translation.\n","authors":["Shaurya Singh Rathore","Aravind Shenoy","Krish Didwania","Aditya Kasliwal","Ujjwal Verma"],"pdf_url":"https://arxiv.org/pdf/2501.05195v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.17251v5","updated":"2025-01-09T12:28:55Z","published":"2024-11-26T09:29:27Z","title":"DGNN-YOLO: Interpretable Dynamic Graph Neural Networks with YOLO11 for\n Detecting and Tracking Small Occluded Objects in Urban Traffic","summary":" The detection and tracking of small, occluded objects such as pedestrians,\ncyclists, and motorbikes pose significant challenges for traffic surveillance\nsystems because of their erratic movement, frequent occlusion, and poor\nvisibility in dynamic urban environments. Traditional methods like YOLO11,\nwhile proficient in spatial feature extraction for precise detection, often\nstruggle with these small and dynamically moving objects, particularly in\nhandling real-time data updates and resource efficiency. This paper introduces\nDGNN-YOLO, a novel framework that integrates dynamic graph neural networks\n(DGNNs) with YOLO11 to address these limitations. Unlike standard GNNs, DGNNs\nare chosen for their superior ability to dynamically update graph structures in\nreal-time, which enables adaptive and robust tracking of objects in highly\nvariable urban traffic scenarios. This framework constructs and regularly\nupdates its graph representations, capturing objects as nodes and their\ninteractions as edges, thus effectively responding to rapidly changing\nconditions. Additionally, DGNN-YOLO incorporates Grad-CAM, Grad-CAM++, and\nEigen-CAM visualization techniques to enhance interpretability and foster\ntrust, offering insights into the model's decision-making process. Extensive\nexperiments validate the framework's performance, achieving a precision of\n0.8382, recall of 0.6875, and mAP@0.5:0.95 of 0.6476, significantly\noutperforming existing methods. This study offers a scalable and interpretable\nsolution for real-time traffic surveillance and significantly advances\nintelligent transportation systems' capabilities by addressing the critical\nchallenge of detecting and tracking small, occluded objects.\n","authors":["Shahriar Soudeep","M. F. Mridha","Md Abrar Jahin","Nilanjan Dey"],"pdf_url":"https://arxiv.org/pdf/2411.17251v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05179v1","updated":"2025-01-09T11:57:58Z","published":"2025-01-09T11:57:58Z","title":"Compression with Global Guidance: Towards Training-free High-Resolution\n MLLMs Acceleration","summary":" Multimodal large language models (MLLMs) have attracted considerable\nattention due to their exceptional performance in visual content understanding\nand reasoning. However, their inference efficiency has been a notable concern,\nas the increasing length of multimodal contexts leads to quadratic complexity.\nToken compression techniques, which reduce the number of visual tokens, have\ndemonstrated their effectiveness in reducing computational costs. Yet, these\napproaches have struggled to keep pace with the rapid advancements in MLLMs,\nespecially the AnyRes strategy in the context of high-resolution image\nunderstanding. In this paper, we propose a novel token compression method,\nGlobalCom$^2$, tailored for high-resolution MLLMs that receive both the\nthumbnail and multiple crops. GlobalCom$^2$ treats the tokens derived from the\nthumbnail as the ``commander'' of the entire token compression process,\ndirecting the allocation of retention ratios and the specific compression for\neach crop. In this way, redundant tokens are eliminated while important local\ndetails are adaptively preserved to the highest extent feasible. Empirical\nresults across 10 benchmarks reveal that GlobalCom$^2$ achieves an optimal\nbalance between performance and efficiency, and consistently outperforms\nstate-of-the-art token compression methods with LLaVA-NeXT-7B/13B models. Our\ncode is released at \\url{https://github.com/xuyang-liu16/GlobalCom2}.\n","authors":["Xuyang Liu","Ziming Wang","Yuhang Han","Yingyao Wang","Jiale Yuan","Jun Song","Bo Zheng","Linfeng Zhang","Siteng Huang","Honggang Chen"],"pdf_url":"https://arxiv.org/pdf/2501.05179v1.pdf","comment":"Our code is released at\n \\url{https://github.com/xuyang-liu16/GlobalCom2}"},{"id":"http://arxiv.org/abs/2501.05177v1","updated":"2025-01-09T11:52:54Z","published":"2025-01-09T11:52:54Z","title":"FaceMe: Robust Blind Face Restoration with Personal Identification","summary":" Blind face restoration is a highly ill-posed problem due to the lack of\nnecessary context. Although existing methods produce high-quality outputs, they\noften fail to faithfully preserve the individual's identity. In this paper, we\npropose a personalized face restoration method, FaceMe, based on a diffusion\nmodel. Given a single or a few reference images, we use an identity encoder to\nextract identity-related features, which serve as prompts to guide the\ndiffusion model in restoring high-quality and identity-consistent facial\nimages. By simply combining identity-related features, we effectively minimize\nthe impact of identity-irrelevant features during training and support any\nnumber of reference image inputs during inference. Additionally, thanks to the\nrobustness of the identity encoder, synthesized images can be used as reference\nimages during training, and identity changing during inference does not require\nfine-tuning the model. We also propose a pipeline for constructing a reference\nimage training pool that simulates the poses and expressions that may appear in\nreal-world scenarios. Experimental results demonstrate that our FaceMe can\nrestore high-quality facial images while maintaining identity consistency,\nachieving excellent performance and robustness.\n","authors":["Siyu Liu","Zheng-Peng Duan","Jia OuYang","Jiayi Fu","Hyunhee Park","Zikun Liu","Chun-Le Guo","Chongyi Li"],"pdf_url":"https://arxiv.org/pdf/2501.05177v1.pdf","comment":"To appear at AAAI 2025"},{"id":"http://arxiv.org/abs/2406.14080v3","updated":"2025-01-09T11:31:56Z","published":"2024-06-20T07:56:51Z","title":"CMTNet: Convolutional Meets Transformer Network for Hyperspectral Images\n Classification","summary":" Hyperspectral remote sensing (HIS) enables the detailed capture of spectral\ninformation from the Earth's surface, facilitating precise classification and\nidentification of surface crops due to its superior spectral diagnostic\ncapabilities. However, current convolutional neural networks (CNNs) focus on\nlocal features in hyperspectral data, leading to suboptimal performance when\nclassifying intricate crop types and addressing imbalanced sample\ndistributions. In contrast, the Transformer framework excels at extracting\nglobal features from hyperspectral imagery. To leverage the strengths of both\napproaches, this research introduces the Convolutional Meet Transformer Network\n(CMTNet). This innovative model includes a spectral-spatial feature extraction\nmodule for shallow feature capture, a dual-branch structure combining CNN and\nTransformer branches for local and global feature extraction, and a\nmulti-output constraint module that enhances classification accuracy through\nmulti-output loss calculations and cross constraints across local,\ninternational, and joint features. Extensive experiments conducted on three\ndatasets (WHU-Hi-LongKou, WHU-Hi-HanChuan, and WHU-Hi-HongHu) demonstrate that\nCTDBNet significantly outperforms other state-of-the-art networks in\nclassification performance, validating its effectiveness in hyperspectral crop\nclassification.\n","authors":["Faxu Guo","Quan Feng","Sen Yang","Wanxia Yang"],"pdf_url":"https://arxiv.org/pdf/2406.14080v3.pdf","comment":"After submission, our research team underwent a significant shift in\n the project's focus and direction. As a result, the current manuscript no\n longer accurately reflects the revised scope or findings of our research.To\n prevent potential misinterpretations or misleading citations, we believe it\n is in the best interest of the academic community to withdraw this article"},{"id":"http://arxiv.org/abs/2403.14320v3","updated":"2025-01-09T10:59:37Z","published":"2024-03-21T11:41:39Z","title":"Exosense: A Vision-Based Scene Understanding System For Exoskeletons","summary":" Self-balancing exoskeletons are a key enabling technology for individuals\nwith mobility impairments. While the current challenges focus on\nhuman-compliant hardware and control, unlocking their use for daily activities\nrequires a scene perception system. In this work, we present Exosense, a\nvision-centric scene understanding system for self-balancing exoskeletons. We\nintroduce a multi-sensor visual-inertial mapping device as well as a navigation\nstack for state estimation, terrain mapping and long-term operation. We tested\nExosense attached to both a human leg and Wandercraft's Personal Exoskeleton in\nreal-world indoor scenarios. This enabled us to test the system during typical\nperiodic walking gaits, as well as future uses in multi-story environments. We\ndemonstrate that Exosense can achieve an odometry drift of about 4 cm per meter\ntraveled, and construct terrain maps under 1 cm average reconstruction error.\nIt can also work in a visual localization mode in a previously mapped\nenvironment, providing a step towards long-term operation of exoskeletons.\n","authors":["Jianeng Wang","Matias Mattamala","Christina Kassab","Guillaume Burger","Fabio Elnecave","Lintong Zhang","Marine Petriaux","Maurice Fallon"],"pdf_url":"https://arxiv.org/pdf/2403.14320v3.pdf","comment":"8 pages, 9 figures"},{"id":"http://arxiv.org/abs/2501.05147v1","updated":"2025-01-09T10:56:50Z","published":"2025-01-09T10:56:50Z","title":"A Systematic Literature Review on Deep Learning-based Depth Estimation\n in Computer Vision","summary":" Depth estimation (DE) provides spatial information about a scene and enables\ntasks such as 3D reconstruction, object detection, and scene understanding.\nRecently, there has been an increasing interest in using deep learning\n(DL)-based methods for DE. Traditional techniques rely on handcrafted features\nthat often struggle to generalise to diverse scenes and require extensive\nmanual tuning. However, DL models for DE can automatically extract relevant\nfeatures from input data, adapt to various scene conditions, and generalise\nwell to unseen environments. Numerous DL-based methods have been developed,\nmaking it necessary to survey and synthesize the state-of-the-art (SOTA).\nPrevious reviews on DE have mainly focused on either monocular or stereo-based\ntechniques, rather than comprehensively reviewing DE. Furthermore, to the best\nof our knowledge, there is no systematic literature review (SLR) that\ncomprehensively focuses on DE. Therefore, this SLR study is being conducted.\nInitially, electronic databases were searched for relevant publications,\nresulting in 1284 publications. Using defined exclusion and quality criteria,\n128 publications were shortlisted and further filtered to select 59\nhigh-quality primary studies. These studies were analysed to extract data and\nanswer defined research questions. Based on the results, DL methods were\ndeveloped for mainly three different types of DE: monocular, stereo, and\nmulti-view. 20 publicly available datasets were used to train, test, and\nevaluate DL models for DE, with KITTI, NYU Depth V2, and Make 3D being the most\nused datasets. 29 evaluation metrics were used to assess the performance of DE.\n35 base models were reported in the primary studies, and the top five most-used\nbase models were ResNet-50, ResNet-18, ResNet-101, U-Net, and VGG-16. Finally,\nthe lack of ground truth data was among the most significant challenges\nreported by primary studies.\n","authors":["Ali Rohan","Md Junayed Hasan","Andrei Petrovski"],"pdf_url":"https://arxiv.org/pdf/2501.05147v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.01486v3","updated":"2025-01-09T10:56:43Z","published":"2024-06-03T16:11:39Z","title":"Differentiable Task Graph Learning: Procedural Activity Representation\n and Online Mistake Detection from Egocentric Videos","summary":" Procedural activities are sequences of key-steps aimed at achieving specific\ngoals. They are crucial to build intelligent agents able to assist users\neffectively. In this context, task graphs have emerged as a\nhuman-understandable representation of procedural activities, encoding a\npartial ordering over the key-steps. While previous works generally relied on\nhand-crafted procedures to extract task graphs from videos, in this paper, we\npropose an approach based on direct maximum likelihood optimization of edges'\nweights, which allows gradient-based learning of task graphs and can be\nnaturally plugged into neural network architectures. Experiments on the\nCaptainCook4D dataset demonstrate the ability of our approach to predict\naccurate task graphs from the observation of action sequences, with an\nimprovement of +16.7% over previous approaches. Owing to the differentiability\nof the proposed framework, we also introduce a feature-based approach, aiming\nto predict task graphs from key-step textual or video embeddings, for which we\nobserve emerging video understanding abilities. Task graphs learned with our\napproach are also shown to significantly enhance online mistake detection in\nprocedural egocentric videos, achieving notable gains of +19.8% and +7.5% on\nthe Assembly101-O and EPIC-Tent-O datasets. Code for replicating experiments is\navailable at https://github.com/fpv-iplab/Differentiable-Task-Graph-Learning.\n","authors":["Luigi Seminara","Giovanni Maria Farinella","Antonino Furnari"],"pdf_url":"https://arxiv.org/pdf/2406.01486v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05132v1","updated":"2025-01-09T10:34:25Z","published":"2025-01-09T10:34:25Z","title":"CorrDiff: Adaptive Delay-aware Detector with Temporal Cue Inputs for\n Real-time Object Detection","summary":" Real-time object detection takes an essential part in the decision-making\nprocess of numerous real-world applications, including collision avoidance and\npath planning in autonomous driving systems. This paper presents a novel\nreal-time streaming perception method named CorrDiff, designed to tackle the\nchallenge of delays in real-time detection systems. The main contribution of\nCorrDiff lies in its adaptive delay-aware detector, which is able to utilize\nruntime-estimated temporal cues to predict objects' locations for multiple\nfuture frames, and selectively produce predictions that matches real-world\ntime, effectively compensating for any communication and computational delays.\nThe proposed model outperforms current state-of-the-art methods by leveraging\nmotion estimation and feature enhancement, both for 1) single-frame detection\nfor the current frame or the next frame, in terms of the metric mAP, and 2) the\nprediction for (multiple) future frame(s), in terms of the metric sAP (The sAP\nmetric is to evaluate object detection algorithms in streaming scenarios,\nfactoring in both latency and accuracy). It demonstrates robust performance\nacross a range of devices, from powerful Tesla V100 to modest RTX 2080Ti,\nachieving the highest level of perceptual accuracy on all platforms. Unlike\nmost state-of-the-art methods that struggle to complete computation within a\nsingle frame on less powerful devices, CorrDiff meets the stringent real-time\nprocessing requirements on all kinds of devices. The experimental results\nemphasize the system's adaptability and its potential to significantly improve\nthe safety and reliability for many real-world systems, such as autonomous\ndriving. Our code is completely open-sourced and is available at\nhttps://anonymous.4open.science/r/CorrDiff.\n","authors":["Xiang Zhang","Chenchen Fu","Yufei Cui","Lan Yi","Yuyang Sun","Weiwei Wu","Xue Liu"],"pdf_url":"https://arxiv.org/pdf/2501.05132v1.pdf","comment":"Submitted to IEEE JSAC Special Issue: Intelligent Communications for\n Real-Time Computer Vision (Comm4CV)"},{"id":"http://arxiv.org/abs/2501.05131v1","updated":"2025-01-09T10:34:00Z","published":"2025-01-09T10:34:00Z","title":"3DIS-FLUX: simple and efficient multi-instance generation with DiT\n rendering","summary":" The growing demand for controllable outputs in text-to-image generation has\ndriven significant advancements in multi-instance generation (MIG), enabling\nusers to define both instance layouts and attributes. Currently, the\nstate-of-the-art methods in MIG are primarily adapter-based. However, these\nmethods necessitate retraining a new adapter each time a more advanced model is\nreleased, resulting in significant resource consumption. A methodology named\nDepth-Driven Decoupled Instance Synthesis (3DIS) has been introduced, which\ndecouples MIG into two distinct phases: 1) depth-based scene construction and\n2) detail rendering with widely pre-trained depth control models. The 3DIS\nmethod requires adapter training solely during the scene construction phase,\nwhile enabling various models to perform training-free detail rendering.\nInitially, 3DIS focused on rendering techniques utilizing U-Net architectures\nsuch as SD1.5, SD2, and SDXL, without exploring the potential of recent\nDiT-based models like FLUX. In this paper, we present 3DIS-FLUX, an extension\nof the 3DIS framework that integrates the FLUX model for enhanced rendering\ncapabilities. Specifically, we employ the FLUX.1-Depth-dev model for depth map\ncontrolled image generation and introduce a detail renderer that manipulates\nthe Attention Mask in FLUX's Joint Attention mechanism based on layout\ninformation. This approach allows for the precise rendering of fine-grained\nattributes of each instance. Our experimental results indicate that 3DIS-FLUX,\nleveraging the FLUX model, outperforms the original 3DIS method, which utilized\nSD2 and SDXL, and surpasses current state-of-the-art adapter-based methods in\nterms of both performance and image quality. Project Page:\nhttps://limuloo.github.io/3DIS/.\n","authors":["Dewei Zhou","Ji Xie","Zongxin Yang","Yi Yang"],"pdf_url":"https://arxiv.org/pdf/2501.05131v1.pdf","comment":"tech report"},{"id":"http://arxiv.org/abs/2501.05122v1","updated":"2025-01-09T10:26:14Z","published":"2025-01-09T10:26:14Z","title":"Centurio: On Drivers of Multilingual Ability of Large Vision-Language\n Model","summary":" Most Large Vision-Language Models (LVLMs) to date are trained predominantly\non English data, which makes them struggle to understand non-English input and\nfail to generate output in the desired target language. Existing efforts\nmitigate these issues by adding multilingual training data, but do so in a\nlargely ad-hoc manner, lacking insight into how different training mixes tip\nthe scale for different groups of languages. In this work, we present a\ncomprehensive investigation into the training strategies for massively\nmultilingual LVLMs. First, we conduct a series of multi-stage experiments\nspanning 13 downstream vision-language tasks and 43 languages, systematically\nexamining: (1) the number of training languages that can be included without\ndegrading English performance and (2) optimal language distributions of\npre-training as well as (3) instruction-tuning data. Further, we (4)\ninvestigate how to improve multilingual text-in-image understanding, and\nintroduce a new benchmark for the task. Surprisingly, our analysis reveals that\none can (i) include as many as 100 training languages simultaneously (ii) with\nas little as 25-50\\% of non-English data, to greatly improve multilingual\nperformance while retaining strong English performance. We further find that\n(iii) including non-English OCR data in pre-training and instruction-tuning is\nparamount for improving multilingual text-in-image understanding. Finally, we\nput all our findings together and train Centurio, a 100-language LVLM, offering\nstate-of-the-art performance in an evaluation covering 14 tasks and 56\nlanguages.\n","authors":["Gregor Geigle","Florian Schneider","Carolin Holtermann","Chris Biemann","Radu Timofte","Anne Lauscher","Goran Glavaš"],"pdf_url":"https://arxiv.org/pdf/2501.05122v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05120v1","updated":"2025-01-09T10:22:35Z","published":"2025-01-09T10:22:35Z","title":"Improving the U-Net Configuration for Automated Delineation of Head and\n Neck Cancer on MRI","summary":" Tumor volume segmentation on MRI is a challenging and time-consuming process\nthat is performed manually in typical clinical settings. This work presents an\napproach to automated delineation of head and neck tumors on MRI scans,\ndeveloped in the context of the MICCAI Head and Neck Tumor Segmentation for\nMR-Guided Applications (HNTS-MRG) 2024 Challenge. Rather than designing a new,\ntask-specific convolutional neural network, the focus of this research was to\npropose improvements to the configuration commonly used in medical segmentation\ntasks, relying solely on the traditional U-Net architecture. The empirical\nresults presented in this article suggest the superiority of patch-wise\nnormalization used for both training and sliding window inference. They also\nindicate that the performance of segmentation models can be enhanced by\napplying a scheduled data augmentation policy during training. Finally, it is\nshown that a small improvement in quality can be achieved by using Gaussian\nweighting to combine predictions for individual patches during sliding window\ninference. The model with the best configuration obtained an aggregated Dice\nSimilarity Coefficient (DSCagg) of 0.749 in Task 1 and 0.710 in Task 2 on five\ncross-validation folds. The ensemble of five models (one best model per\nvalidation fold) showed consistent results on a private test set of 50 patients\nwith an DSCagg of 0.752 in Task 1 and 0.718 in Task 2 (team name:\nandrei.iantsen). The source code and model weights are freely available at\nwww.github.com/iantsen/hntsmrg.\n","authors":["Andrei Iantsen"],"pdf_url":"https://arxiv.org/pdf/2501.05120v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05108v1","updated":"2025-01-09T09:56:33Z","published":"2025-01-09T09:56:33Z","title":"Optimizing Multitask Industrial Processes with Predictive Action\n Guidance","summary":" Monitoring complex assembly processes is critical for maintaining\nproductivity and ensuring compliance with assembly standards. However,\nvariability in human actions and subjective task preferences complicate\naccurate task anticipation and guidance. To address these challenges, we\nintroduce the Multi-Modal Transformer Fusion and Recurrent Units (MMTFRU)\nNetwork for egocentric activity anticipation, utilizing multimodal fusion to\nimprove prediction accuracy. Integrated with the Operator Action Monitoring\nUnit (OAMU), the system provides proactive operator guidance, preventing\ndeviations in the assembly process. OAMU employs two strategies: (1) Top-5\nMMTF-RU predictions, combined with a reference graph and an action dictionary,\nfor next-step recommendations; and (2) Top-1 MMTF-RU predictions, integrated\nwith a reference graph, for detecting sequence deviations and predicting\nanomaly scores via an entropy-informed confidence mechanism. We also introduce\nTime-Weighted Sequence Accuracy (TWSA) to evaluate operator efficiency and\nensure timely task completion. Our approach is validated on the industrial\nMeccano dataset and the largescale EPIC-Kitchens-55 dataset, demonstrating its\neffectiveness in dynamic environments.\n","authors":["Naval Kishore Mehta"," Arvind","Shyam Sunder Prasad","Sumeet Saurav","Sanjay Singh"],"pdf_url":"https://arxiv.org/pdf/2501.05108v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05098v1","updated":"2025-01-09T09:37:27Z","published":"2025-01-09T09:37:27Z","title":"Motion-X++: A Large-Scale Multimodal 3D Whole-body Human Motion Dataset","summary":" In this paper, we introduce Motion-X++, a large-scale multimodal 3D\nexpressive whole-body human motion dataset. Existing motion datasets\npredominantly capture body-only poses, lacking facial expressions, hand\ngestures, and fine-grained pose descriptions, and are typically limited to lab\nsettings with manually labeled text descriptions, thereby restricting their\nscalability. To address this issue, we develop a scalable annotation pipeline\nthat can automatically capture 3D whole-body human motion and comprehensive\ntextural labels from RGB videos and build the Motion-X dataset comprising 81.1K\ntext-motion pairs. Furthermore, we extend Motion-X into Motion-X++ by improving\nthe annotation pipeline, introducing more data modalities, and scaling up the\ndata quantities. Motion-X++ provides 19.5M 3D whole-body pose annotations\ncovering 120.5K motion sequences from massive scenes, 80.8K RGB videos, 45.3K\naudios, 19.5M frame-level whole-body pose descriptions, and 120.5K\nsequence-level semantic labels. Comprehensive experiments validate the accuracy\nof our annotation pipeline and highlight Motion-X++'s significant benefits for\ngenerating expressive, precise, and natural motion with paired multimodal\nlabels supporting several downstream tasks, including text-driven whole-body\nmotion generation,audio-driven motion generation, 3D whole-body human mesh\nrecovery, and 2D whole-body keypoints estimation, etc.\n","authors":["Yuhong Zhang","Jing Lin","Ailing Zeng","Guanlin Wu","Shunlin Lu","Yurong Fu","Yuanhao Cai","Ruimao Zhang","Haoqian Wang","Lei Zhang"],"pdf_url":"https://arxiv.org/pdf/2501.05098v1.pdf","comment":"17 pages, 14 figures, This work extends and enhances the research\n published in the NeurIPS 2023 paper, \"Motion-X: A Large-scale 3D Expressive\n Whole-body Human Motion Dataset\". arXiv admin note: substantial text overlap\n with arXiv:2307.00818"},{"id":"http://arxiv.org/abs/2501.05097v1","updated":"2025-01-09T09:25:22Z","published":"2025-01-09T09:25:22Z","title":"A 1Mb mixed-precision quantized encoder for image classification and\n patch-based compression","summary":" Even if Application-Specific Integrated Circuits (ASIC) have proven to be a\nrelevant choice for integrating inference at the edge, they are often limited\nin terms of applicability. In this paper, we demonstrate that an ASIC neural\nnetwork accelerator dedicated to image processing can be applied to multiple\ntasks of different levels: image classification and compression, while\nrequiring a very limited hardware. The key component is a reconfigurable,\nmixed-precision (3b/2b/1b) encoder that takes advantage of proper weight and\nactivation quantizations combined with convolutional layer structural pruning\nto lower hardware-related constraints (memory and computing). We introduce an\nautomatic adaptation of linear symmetric quantizer scaling factors to perform\nquantized levels equalization, aiming at stabilizing quinary and ternary\nweights training. In addition, a proposed layer-shared Bit-Shift Normalization\nsignificantly simplifies the implementation of the hardware-expensive Batch\nNormalization. For a specific configuration in which the encoder design only\nrequires 1Mb, the classification accuracy reaches 87.5% on CIFAR-10. Besides,\nwe also show that this quantized encoder can be used to compress image\npatch-by-patch while the reconstruction can performed remotely, by a dedicated\nfull-frame decoder. This solution typically enables an end-to-end compression\nalmost without any block artifacts, outperforming patch-based state-of-the-art\ntechniques employing a patch-constant bitrate.\n","authors":["Van Thien Nguyen","William Guicquero","Gilles Sicard"],"pdf_url":"https://arxiv.org/pdf/2501.05097v1.pdf","comment":"Published at IEEE Transactions on Circuits and Systems for Video\n Technology (TCSVT)"},{"id":"http://arxiv.org/abs/2501.05095v1","updated":"2025-01-09T09:21:09Z","published":"2025-01-09T09:21:09Z","title":"Advancing ALS Applications with Large-Scale Pre-training: Dataset\n Development and Downstream Assessment","summary":" The pre-training and fine-tuning paradigm has revolutionized satellite remote\nsensing applications. However, this approach remains largely underexplored for\nairborne laser scanning (ALS), an important technology for applications such as\nforest management and urban planning. In this study, we address this gap by\nconstructing a large-scale ALS point cloud dataset and evaluating its impact on\ndownstream applications. Our dataset comprises ALS point clouds collected\nacross the contiguous United States, provided by the United States Geological\nSurvey's 3D Elevation Program. To ensure efficient data collection while\ncapturing diverse land cover and terrain types, we introduce a geospatial\nsampling method that selects point cloud tiles based on land cover maps and\ndigital elevation models. As a baseline self-supervised learning model, we\nadopt BEV-MAE, a state-of-the-art masked autoencoder for 3D outdoor point\nclouds, and pre-train it on the constructed dataset. The pre-trained models are\nsubsequently fine-tuned for downstream tasks, including tree species\nclassification, terrain scene recognition, and point cloud semantic\nsegmentation. Our results show that the pre-trained models significantly\noutperform their scratch counterparts across all downstream tasks,\ndemonstrating the transferability of the representations learned from the\nproposed dataset. Furthermore, we observe that scaling the dataset using our\ngeospatial sampling method consistently enhances performance, whereas\npre-training on datasets constructed with random sampling fails to achieve\nsimilar improvements. These findings highlight the utility of the constructed\ndataset and the effectiveness of our sampling strategy in the pre-training and\nfine-tuning paradigm. The source code and pre-trained models will be made\npublicly available at \\url{https://github.com/martianxiu/ALS_pretraining}.\n","authors":["Haoyi Xiu","Xin Liu","Taehoon Kim","Kyoung-Sook Kim"],"pdf_url":"https://arxiv.org/pdf/2501.05095v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05091v1","updated":"2025-01-09T09:15:07Z","published":"2025-01-09T09:15:07Z","title":"ResPanDiff: Diffusion Model with Disentangled Modulations for Image\n Fusion","summary":" The implementation of diffusion-based pansharpening task is predominantly\nconstrained by its slow inference speed, which results from numerous sampling\nsteps. Despite the existing techniques aiming to accelerate sampling, they\noften compromise performance when fusing multi-source images. To ease this\nlimitation, we introduce a novel and efficient diffusion model named Diffusion\nModel for Pansharpening by Inferring Residual Inference (ResPanDiff), which\nsignificantly reduces the number of diffusion steps without sacrificing the\nperformance to tackle pansharpening task. In ResPanDiff, we innovatively\npropose a Markov chain that transits from noisy residuals to the residuals\nbetween the LRMS and HRMS images, thereby reducing the number of sampling steps\nand enhancing performance. Additionally, we design the latent space to help\nmodel extract more features at the encoding stage, Shallow\nCond-Injection~(SC-I) to help model fetch cond-injected hidden features with\nhigher dimensions, and loss functions to give a better guidance for the\nresidual generation task. enabling the model to achieve superior performance in\nresidual generation. Furthermore, experimental evaluations on pansharpening\ndatasets demonstrate that the proposed method achieves superior outcomes\ncompared to recent state-of-the-art~(SOTA) techniques, requiring only 15\nsampling steps, which reduces over $90\\%$ step compared with the benchmark\ndiffusion models. Our experiments also include thorough discussions and\nablation studies to underscore the effectiveness of our approach.\n","authors":["Shiqi Cao","Liangjian Deng","Shangqi Deng"],"pdf_url":"https://arxiv.org/pdf/2501.05091v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.03700v2","updated":"2025-01-09T09:12:06Z","published":"2023-12-06T18:59:19Z","title":"OneLLM: One Framework to Align All Modalities with Language","summary":" Multimodal large language models (MLLMs) have gained significant attention\ndue to their strong multimodal understanding capability. However, existing\nworks rely heavily on modality-specific encoders, which usually differ in\narchitecture and are limited to common modalities. In this paper, we present\nOneLLM, an MLLM that aligns eight modalities to language using a unified\nframework. We achieve this through a unified multimodal encoder and a\nprogressive multimodal alignment pipeline. In detail, we first train an image\nprojection module to connect a vision encoder with LLM. Then, we build a\nuniversal projection module (UPM) by mixing multiple image projection modules\nand dynamic routing. Finally, we progressively align more modalities to LLM\nwith the UPM. To fully leverage the potential of OneLLM in following\ninstructions, we also curated a comprehensive multimodal instruction dataset,\nincluding 2M items from image, audio, video, point cloud, depth/normal map, IMU\nand fMRI brain activity. OneLLM is evaluated on 25 diverse benchmarks,\nencompassing tasks such as multimodal captioning, question answering and\nreasoning, where it delivers excellent performance. Code, data, model and\nonline demo are available at https://github.com/csuhan/OneLLM\n","authors":["Jiaming Han","Kaixiong Gong","Yiyuan Zhang","Jiaqi Wang","Kaipeng Zhang","Dahua Lin","Yu Qiao","Peng Gao","Xiangyu Yue"],"pdf_url":"https://arxiv.org/pdf/2312.03700v2.pdf","comment":"Accepted by CVPR 2024. Code: https://github.com/csuhan/OneLLM"},{"id":"http://arxiv.org/abs/2501.05085v1","updated":"2025-01-09T09:10:17Z","published":"2025-01-09T09:10:17Z","title":"End-to-End Deep Learning for Interior Tomography with Low-Dose X-ray CT","summary":" Objective: There exist several X-ray computed tomography (CT) scanning\nstrategies to reduce a radiation dose, such as (1) sparse-view CT, (2) low-dose\nCT, and (3) region-of-interest (ROI) CT (called interior tomography). To\nfurther reduce the dose, the sparse-view and/or low-dose CT settings can be\napplied together with interior tomography. Interior tomography has various\nadvantages in terms of reducing the number of detectors and decreasing the\nX-ray radiation dose. However, a large patient or small field-of-view (FOV)\ndetector can cause truncated projections, and then the reconstructed images\nsuffer from severe cupping artifacts. In addition, although the low-dose CT can\nreduce the radiation exposure dose, analytic reconstruction algorithms produce\nimage noise. Recently, many researchers have utilized image-domain deep\nlearning (DL) approaches to remove each artifact and demonstrated impressive\nperformances, and the theory of deep convolutional framelets supports the\nreason for the performance improvement. Approach: In this paper, we found that\nthe image-domain convolutional neural network (CNN) is difficult to solve\ncoupled artifacts, based on deep convolutional framelets. Significance: To\naddress the coupled problem, we decouple it into two sub-problems: (i) image\ndomain noise reduction inside truncated projection to solve low-dose CT problem\nand (ii) extrapolation of projection outside truncated projection to solve the\nROI CT problem. The decoupled sub-problems are solved directly with a novel\nproposed end-to-end learning using dual-domain CNNs. Main results: We\ndemonstrate that the proposed method outperforms the conventional image-domain\ndeep learning methods, and a projection-domain CNN shows better performance\nthan the image-domain CNNs which are commonly used by many researchers.\n","authors":["Yoseob Han","Dufan Wu","Kyungsang Kim","Quanzheng Li"],"pdf_url":"https://arxiv.org/pdf/2501.05085v1.pdf","comment":"Published by Physics in Medicine & Biology (2022.5)"},{"id":"http://arxiv.org/abs/2501.02227v2","updated":"2025-01-09T08:59:41Z","published":"2025-01-04T08:25:32Z","title":"tCURLoRA: Tensor CUR Decomposition Based Low-Rank Parameter Adaptation\n and Its Application in Medical Image Segmentation","summary":" Transfer learning, by leveraging knowledge from pre-trained models, has\nsignificantly enhanced the performance of target tasks. However, as deep neural\nnetworks scale up, full fine-tuning introduces substantial computational and\nstorage challenges in resource-constrained environments, limiting its\nwidespread adoption. To address this, parameter-efficient fine-tuning (PEFT)\nmethods have been developed to reduce computational complexity and storage\nrequirements by minimizing the number of updated parameters. While matrix\ndecomposition-based PEFT methods, such as LoRA, show promise, they struggle to\nfully capture the high-dimensional structural characteristics of model weights.\nIn contrast, high-dimensional tensors offer a more natural representation of\nneural network weights, allowing for a more comprehensive capture of\nhigher-order features and multi-dimensional interactions. In this paper, we\npropose tCURLoRA, a novel fine-tuning method based on tensor CUR decomposition.\nBy concatenating pre-trained weight matrices into a three-dimensional tensor\nand applying tensor CUR decomposition, we update only the lower-order tensor\ncomponents during fine-tuning, effectively reducing computational and storage\noverhead. Experimental results demonstrate that tCURLoRA outperforms existing\nPEFT methods in medical image segmentation tasks.\n","authors":["Guanghua He","Wangang Cheng","Hancan Zhu","Xiaohao Cai","Gaohang Yu"],"pdf_url":"https://arxiv.org/pdf/2501.02227v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05076v1","updated":"2025-01-09T08:59:23Z","published":"2025-01-09T08:59:23Z","title":"TipSegNet: Fingertip Segmentation in Contactless Fingerprint Imaging","summary":" Contactless fingerprint recognition systems offer a hygienic, user-friendly,\nand efficient alternative to traditional contact-based methods. However, their\naccuracy heavily relies on precise fingertip detection and segmentation,\nparticularly under challenging background conditions. This paper introduces\nTipSegNet, a novel deep learning model that achieves state-of-the-art\nperformance in segmenting fingertips directly from grayscale hand images.\nTipSegNet leverages a ResNeXt-101 backbone for robust feature extraction,\ncombined with a Feature Pyramid Network (FPN) for multi-scale representation,\nenabling accurate segmentation across varying finger poses and image qualities.\nFurthermore, we employ an extensive data augmentation strategy to enhance the\nmodel's generalizability and robustness. TipSegNet outperforms existing\nmethods, achieving a mean Intersection over Union (mIoU) of 0.987 and an\naccuracy of 0.999, representing a significant advancement in contactless\nfingerprint segmentation. This enhanced accuracy has the potential to\nsubstantially improve the reliability and effectiveness of contactless\nbiometric systems in real-world applications.\n","authors":["Laurenz Ruzicka","Bernhard Kohn","Clemens Heitzinger"],"pdf_url":"https://arxiv.org/pdf/2501.05076v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05072v1","updated":"2025-01-09T08:54:19Z","published":"2025-01-09T08:54:19Z","title":"A Flexible and Scalable Framework for Video Moment Search","summary":" Video moment search, the process of finding relevant moments in a video\ncorpus to match a user's query, is crucial for various applications. Existing\nsolutions, however, often assume a single perfect matching moment, struggle\nwith inefficient inference, and have limitations with hour-long videos. This\npaper introduces a flexible and scalable framework for retrieving a ranked list\nof moments from collection of videos in any length to match a text query, a\ntask termed Ranked Video Moment Retrieval (RVMR). Our framework, called\nSegment-Proposal-Ranking (SPR), simplifies the search process into three\nindependent stages: segment retrieval, proposal generation, and moment\nrefinement with re-ranking. Specifically, videos are divided into equal-length\nsegments with precomputed embeddings indexed offline, allowing efficient\nretrieval regardless of video length. For scalable online retrieval, both\nsegments and queries are projected into a shared feature space to enable\napproximate nearest neighbor (ANN) search. Retrieved segments are then merged\ninto coarse-grained moment proposals. Then a refinement and re-ranking module\nis designed to reorder and adjust timestamps of the coarse-grained proposals.\nEvaluations on the TVR-Ranking dataset demonstrate that our framework achieves\nstate-of-the-art performance with significant reductions in computational cost\nand processing time. The flexible design also allows for independent\nimprovements to each stage, making SPR highly adaptable for large-scale\napplications.\n","authors":["Chongzhi Zhang","Xizhou Zhu","Aixin Sun"],"pdf_url":"https://arxiv.org/pdf/2501.05072v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.19599v2","updated":"2025-01-09T08:49:40Z","published":"2024-09-29T07:32:14Z","title":"DATransNet: Dynamic Attention Transformer Network for Infrared Small\n Target Detection","summary":" Infrared small target detection (ISTD) is widely used in civilian and\nmilitary applications. However, ISTD encounters several challenges, including\nthe tendency for small and dim targets to be obscured by complex backgrounds.To\naddress this issue, we propose the Dynamic Attention Transformer Network\n(DATransNet), which aims to extract and preserve edge information of small\ntargets.DATransNet employs the Dynamic Attention Transformer (DATrans),\nsimulating central difference convolutions (CDC) to extract and integrate\ngradient features with deeper features.Furthermore, we propose a global feature\nextraction module (GFEM) that offers a comprehensive perspective to prevent the\nnetwork from focusing solely on details while neglecting the background\ninformation. We compare the network with state-of-the-art (SOTA) approaches,\nand the results demonstrate that our method performs effectively. Our source\ncode is available at https://github.com/greekinRoma/DATransNet.\n","authors":["Chen Hu","Yian Huang","Kexuan Li","Luping Zhang","Chang Long","Yiming Zhu","Tian Pu","Zhenming Peng"],"pdf_url":"https://arxiv.org/pdf/2409.19599v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05069v1","updated":"2025-01-09T08:44:42Z","published":"2025-01-09T08:44:42Z","title":"Commonsense Video Question Answering through Video-Grounded Entailment\n Tree Reasoning","summary":" This paper proposes the first video-grounded entailment tree reasoning method\nfor commonsense video question answering (VQA). Despite the remarkable progress\nof large visual-language models (VLMs), there are growing concerns that they\nlearn spurious correlations between videos and likely answers, reinforced by\ntheir black-box nature and remaining benchmarking biases. Our method explicitly\ngrounds VQA tasks to video fragments in four steps: entailment tree\nconstruction, video-language entailment verification, tree reasoning, and\ndynamic tree expansion. A vital benefit of the method is its generalizability\nto current video and image-based VLMs across reasoning types. To support fair\nevaluation, we devise a de-biasing procedure based on large-language models\nthat rewrites VQA benchmark answer sets to enforce model reasoning. Systematic\nexperiments on existing and de-biased benchmarks highlight the impact of our\nmethod components across benchmarks, VLMs, and reasoning types.\n","authors":["Huabin Liu","Filip Ilievski","Cees G. M. Snoek"],"pdf_url":"https://arxiv.org/pdf/2501.05069v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05067v1","updated":"2025-01-09T08:43:57Z","published":"2025-01-09T08:43:57Z","title":"LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion\n for Video Understanding","summary":" In this paper, we introduce LLaVA-Octopus, a novel video multimodal large\nlanguage model. LLaVA-Octopus adaptively weights features from different visual\nprojectors based on user instructions, enabling us to leverage the\ncomplementary strengths of each projector. We observe that different visual\nprojectors exhibit distinct characteristics when handling specific tasks. For\ninstance, some projectors excel at capturing static details, while others are\nmore effective at processing temporal information, and some are better suited\nfor tasks requiring temporal coherence. By dynamically adjusting feature\nweights according to user instructions, LLaVA-Octopus dynamically selects and\ncombines the most suitable features, significantly enhancing the model's\nperformance in multimodal tasks. Experimental results demonstrate that\nLLaVA-Octopus achieves excellent performance across multiple benchmarks,\nespecially in tasks such as multimodal understanding, visual question\nanswering, and video understanding, highlighting its broad application\npotential.\n","authors":["Jiaxing Zhao","Boyuan Sun","Xiang Chen","Xihan Wei","Qibin Hou"],"pdf_url":"https://arxiv.org/pdf/2501.05067v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05066v1","updated":"2025-01-09T08:43:09Z","published":"2025-01-09T08:43:09Z","title":"Improving Skeleton-based Action Recognition with Interactive Object\n Information","summary":" Human skeleton information is important in skeleton-based action recognition,\nwhich provides a simple and efficient way to describe human pose. However,\nexisting skeleton-based methods focus more on the skeleton, ignoring the\nobjects interacting with humans, resulting in poor performance in recognizing\nactions that involve object interactions. We propose a new action recognition\nframework introducing object nodes to supplement absent interactive object\ninformation. We also propose Spatial Temporal Variable Graph Convolutional\nNetworks (ST-VGCN) to effectively model the Variable Graph (VG) containing\nobject nodes. Specifically, in order to validate the role of interactive object\ninformation, by leveraging a simple self-training approach, we establish a new\ndataset, JXGC 24, and an extended dataset, NTU RGB+D+Object 60, including more\nthan 2 million additional object nodes. At the same time, we designe the\nVariable Graph construction method to accommodate a variable number of nodes\nfor graph structure. Additionally, we are the first to explore the overfitting\nissue introduced by incorporating additional object information, and we propose\na VG-based data augmentation method to address this issue, called Random Node\nAttack. Finally, regarding the network structure, we introduce two fusion\nmodules, CAF and WNPool, along with a novel Node Balance Loss, to enhance the\ncomprehensive performance by effectively fusing and balancing skeleton and\nobject node information. Our method surpasses the previous state-of-the-art on\nmultiple skeleton-based action recognition benchmarks. The accuracy of our\nmethod on NTU RGB+D 60 cross-subject split is 96.7\\%, and on cross-view split,\nit is 99.2\\%.\n","authors":["Hao Wen","Ziqian Lu","Fengli Shen","Zhe-Ming Lu","Jialin Cui"],"pdf_url":"https://arxiv.org/pdf/2501.05066v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.05427v2","updated":"2025-01-09T08:31:23Z","published":"2024-09-09T08:26:47Z","title":"TextToucher: Fine-Grained Text-to-Touch Generation","summary":" Tactile sensation plays a crucial role in the development of multi-modal\nlarge models and embodied intelligence. To collect tactile data with minimal\ncost as possible, a series of studies have attempted to generate tactile images\nby vision-to-touch image translation. However, compared to text modality,\nvisual modality-driven tactile generation cannot accurately depict human\ntactile sensation. In this work, we analyze the characteristics of tactile\nimages in detail from two granularities: object-level (tactile texture, tactile\nshape), and sensor-level (gel status). We model these granularities of\ninformation through text descriptions and propose a fine-grained Text-to-Touch\ngeneration method (TextToucher) to generate high-quality tactile samples.\nSpecifically, we introduce a multimodal large language model to build the text\nsentences about object-level tactile information and employ a set of learnable\ntext prompts to represent the sensor-level tactile information. To better guide\nthe tactile generation process with the built text information, we fuse the\ndual grains of text information and explore various dual-grain text\nconditioning methods within the diffusion transformer architecture.\nFurthermore, we propose a Contrastive Text-Touch Pre-training (CTTP) metric to\nprecisely evaluate the quality of text-driven generated tactile data. Extensive\nexperiments demonstrate the superiority of our TextToucher method. The source\ncodes will be available at \\url{https://github.com/TtuHamg/TextToucher}.\n","authors":["Jiahang Tu","Hao Fu","Fengyu Yang","Hanbin Zhao","Chao Zhang","Hui Qian"],"pdf_url":"https://arxiv.org/pdf/2409.05427v2.pdf","comment":"This paper has been accepted by AAAI 2025"},{"id":"http://arxiv.org/abs/2501.03397v2","updated":"2025-01-09T08:28:11Z","published":"2025-01-06T21:34:52Z","title":"DoubleDiffusion: Combining Heat Diffusion with Denoising Diffusion for\n Generative Learning on 3D Meshes","summary":" This paper proposes DoubleDiffusion, a novel framework that combines heat\ndissipation diffusion and denoising diffusion for direct generative learning on\n3D mesh surfaces. Our approach addresses the challenges of generating\ncontinuous signal distributions residing on a curve manifold surface. Unlike\nprevious methods that rely on unrolling 3D meshes into 2D or adopting field\nrepresentations, DoubleDiffusion leverages the Laplacian-Beltrami operator to\nprocess features respecting the mesh structure. This combination enables\neffective geometry-aware signal diffusion across the underlying geometry. As\nshown in Fig.1, we demonstrate that DoubleDiffusion has the ability to generate\nRGB signal distributions on complex 3D mesh surfaces and achieves per-category\nshape-conditioned texture generation across different shape geometry. Our work\ncontributes a new direction in diffusion-based generative modeling on 3D\nsurfaces, with potential applications in the field of 3D asset generation.\n","authors":["Xuyang Wang","Ziang Cheng","Zhenyu Li","Jiayu Yang","Haorui Ji","Pan Ji","Mehrtash Harandi","Richard Hartley","Hongdong Li"],"pdf_url":"https://arxiv.org/pdf/2501.03397v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.18571v3","updated":"2025-01-09T07:58:38Z","published":"2024-06-03T11:48:17Z","title":"UltraCortex: Submillimeter Ultra-High Field 9.4 T Brain MR Image\n Collection and Manual Cortical Segmentations","summary":" The UltraCortex repository (https://www.ultracortex.org) houses magnetic\nresonance imaging data of the human brain obtained at an ultra-high field\nstrength of 9.4 T. It contains 86 structural MR images with spatial resolutions\nranging from 0.6 to 0.8 mm. Additionally, the repository includes segmentations\nof 12 brains into gray and white matter compartments. These segmentations have\nbeen independently validated by two expert neuroradiologists, thus establishing\nthem as a reliable gold standard. This resource provides researchers with\naccess to high-quality brain imaging data and validated segmentations,\nfacilitating neuroimaging studies and advancing our understanding of brain\nstructure and function. Existing repositories do not accommodate field\nstrengths beyond 7 T, nor do they offer validated segmentations, underscoring\nthe significance of this new resource.\n","authors":["Lucas Mahler","Julius Steiglechner","Benjamin Bender","Tobias Lindig","Dana Ramadan","Jonas Bause","Florian Birk","Rahel Heule","Edyta Charyasz","Michael Erb","Vinod Jangir Kumar","Gisela E Hagberg","Pascal Martin","Gabriele Lohmann","Klaus Scheffler"],"pdf_url":"https://arxiv.org/pdf/2406.18571v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.10440v3","updated":"2025-01-09T07:58:20Z","published":"2024-11-15T18:58:31Z","title":"LLaVA-CoT: Let Vision Language Models Reason Step-by-Step","summary":" Large language models have demonstrated substantial advancements in reasoning\ncapabilities, particularly through inference-time scaling, as illustrated by\nmodels such as OpenAI's o1. However, current Vision-Language Models (VLMs)\noften struggle to perform systematic and structured reasoning, especially when\nhandling complex visual question-answering tasks. In this work, we introduce\nLLaVA-CoT, a novel VLM designed to conduct autonomous multistage reasoning.\nUnlike chain-of-thought prompting, LLaVA-CoT independently engages in\nsequential stages of summarization, visual interpretation, logical reasoning,\nand conclusion generation. This structured approach enables LLaVA-CoT to\nachieve marked improvements in precision on reasoning-intensive tasks. To\naccomplish this, we compile the LLaVA-CoT-100k dataset, integrating samples\nfrom various visual question answering sources and providing structured\nreasoning annotations. Besides, we propose an inference-time stage-level beam\nsearch method, which enables effective inference-time scaling. Remarkably, with\nonly 100k training samples and a simple yet effective inference time scaling\nmethod, LLaVA-CoT not only outperforms its base model by 7.4% on a wide range\nof multimodal reasoning benchmarks, but also surpasses the performance of\nlarger and even closed-source models, such as Gemini-1.5-pro, GPT-4o-mini, and\nLlama-3.2-90B-Vision-Instruct.\n","authors":["Guowei Xu","Peng Jin","Hao Li","Yibing Song","Lichao Sun","Li Yuan"],"pdf_url":"https://arxiv.org/pdf/2411.10440v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05037v1","updated":"2025-01-09T07:51:14Z","published":"2025-01-09T07:51:14Z","title":"LongViTU: Instruction Tuning for Long-Form Video Understanding","summary":" This paper introduce LongViTU, a large-scale (~121k QA pairs, ~900h videos),\nautomatically generated dataset for long-form video understanding. We developed\na systematic approach that organizes videos into a hierarchical tree structure\nand incorporates self-revision mechanisms to ensure high-quality QA pairs. Each\nQA pair in LongViTU features: 1) long-term context (average certificate length\nof 4.6 minutes); 2) rich knowledge and condensed reasoning (commonsense,\ncausality, planning, etc.); and 3) explicit timestamp labels for relevant\nevents. LongViTU also serves as a benchmark for instruction following in\nlong-form and streaming video understanding. We evaluate the open-source\nstate-of-the-art long video understanding model, LongVU, and the commercial\nmodel, Gemini-1.5-Pro, on our benchmark. They achieve GPT-4 scores of 49.9 and\n52.3, respectively, underscoring the substantial challenge posed by our\nbenchmark. Further supervised fine-tuning (SFT) on LongVU led to performance\nimprovements of 12.0% on our benchmark, 2.2% on the in-distribution (ID)\nbenchmark EgoSchema, 1.0%, 2.2% and 1.2% on the out-of-distribution (OOD)\nbenchmarks VideoMME (Long), WorldQA and OpenEQA, respectively. These outcomes\ndemonstrate LongViTU's high data quality and robust OOD generalizability.\n","authors":["Rujie Wu","Xiaojian Ma","Hai Ci","Yue Fan","Yuxuan Wang","Haozhe Zhao","Qing Li","Yizhou Wang"],"pdf_url":"https://arxiv.org/pdf/2501.05037v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05034v1","updated":"2025-01-09T07:49:37Z","published":"2025-01-09T07:49:37Z","title":"Towards Fingerprint Mosaicking Artifact Detection: A Self-Supervised\n Deep Learning Approach","summary":" Fingerprint mosaicking, which is the process of combining multiple\nfingerprint images into a single master fingerprint, is an essential process in\nmodern biometric systems. However, it is prone to errors that can significantly\ndegrade fingerprint image quality. This paper proposes a novel deep\nlearning-based approach to detect and score mosaicking artifacts in fingerprint\nimages. Our method leverages a self-supervised learning framework to train a\nmodel on large-scale unlabeled fingerprint data, eliminating the need for\nmanual artifact annotation. The proposed model effectively identifies\nmosaicking errors, achieving high accuracy on various fingerprint modalities,\nincluding contactless, rolled, and pressed fingerprints and furthermore proves\nto be robust to different data sources. Additionally, we introduce a novel\nmosaicking artifact score to quantify the severity of errors, enabling\nautomated evaluation of fingerprint images. By addressing the challenges of\nmosaicking artifact detection, our work contributes to improving the accuracy\nand reliability of fingerprint-based biometric systems.\n","authors":["Laurenz Ruzicka","Alexander Spenke","Stephan Bergmann","Gerd Nolden","Bernhard Kohn","Clemens Heitzinger"],"pdf_url":"https://arxiv.org/pdf/2501.05034v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05031v1","updated":"2025-01-09T07:43:49Z","published":"2025-01-09T07:43:49Z","title":"ECBench: Can Multi-modal Foundation Models Understand the Egocentric\n World? A Holistic Embodied Cognition Benchmark","summary":" The enhancement of generalization in robots by large vision-language models\n(LVLMs) is increasingly evident. Therefore, the embodied cognitive abilities of\nLVLMs based on egocentric videos are of great interest. However, current\ndatasets for embodied video question answering lack comprehensive and\nsystematic evaluation frameworks. Critical embodied cognitive issues, such as\nrobotic self-cognition, dynamic scene perception, and hallucination, are rarely\naddressed. To tackle these challenges, we propose ECBench, a high-quality\nbenchmark designed to systematically evaluate the embodied cognitive abilities\nof LVLMs. ECBench features a diverse range of scene video sources, open and\nvaried question formats, and 30 dimensions of embodied cognition. To ensure\nquality, balance, and high visual dependence, ECBench uses class-independent\nmeticulous human annotation and multi-round question screening strategies.\nAdditionally, we introduce ECEval, a comprehensive evaluation system that\nensures the fairness and rationality of the indicators. Utilizing ECBench, we\nconduct extensive evaluations of proprietary, open-source, and task-specific\nLVLMs. ECBench is pivotal in advancing the embodied cognitive capabilities of\nLVLMs, laying a solid foundation for developing reliable core models for\nembodied agents. All data and code are available at\nhttps://github.com/Rh-Dang/ECBench.\n","authors":["Ronghao Dang","Yuqian Yuan","Wenqi Zhang","Yifei Xin","Boqiang Zhang","Long Li","Liuyi Wang","Qinyang Zeng","Xin Li","Lidong Bing"],"pdf_url":"https://arxiv.org/pdf/2501.05031v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.01973v3","updated":"2025-01-09T07:26:05Z","published":"2024-12-28T02:28:19Z","title":"INFELM: In-depth Fairness Evaluation of Large Text-To-Image Models","summary":" The rapid development of large language models (LLMs) and large vision models\n(LVMs) have propelled the evolution of multi-modal AI systems, which have\ndemonstrated the remarkable potential for industrial applications by emulating\nhuman-like cognition. However, they also pose significant ethical challenges,\nincluding amplifying harmful content and reinforcing societal biases. For\ninstance, biases in some industrial image generation models highlighted the\nurgent need for robust fairness assessments. Most existing evaluation\nframeworks focus on the comprehensiveness of various aspects of the models, but\nthey exhibit critical limitations, including insufficient attention to content\ngeneration alignment and social bias-sensitive domains. More importantly, their\nreliance on pixel-detection techniques is prone to inaccuracies.\n To address these issues, this paper presents INFELM, an in-depth fairness\nevaluation on widely-used text-to-image models. Our key contributions are: (1)\nan advanced skintone classifier incorporating facial topology and refined skin\npixel representation to enhance classification precision by at least 16.04%,\n(2) a bias-sensitive content alignment measurement for understanding societal\nimpacts, (3) a generalizable representation bias evaluation for diverse\ndemographic groups, and (4) extensive experiments analyzing large-scale\ntext-to-image model outputs across six social-bias-sensitive domains. We find\nthat existing models in the study generally do not meet the empirical fairness\ncriteria, and representation bias is generally more pronounced than alignment\nerrors. INFELM establishes a robust benchmark for fairness assessment,\nsupporting the development of multi-modal AI systems that align with ethical\nand human-centric principles.\n","authors":["Di Jin","Xing Liu","Yu Liu","Jia Qing Yap","Andrea Wong","Adriana Crespo","Qi Lin","Zhiyuan Yin","Qiang Yan","Ryan Ye"],"pdf_url":"https://arxiv.org/pdf/2501.01973v3.pdf","comment":"Di Jin and Xing Liu contributed equally to this work"},{"id":"http://arxiv.org/abs/2409.06710v2","updated":"2025-01-09T07:24:09Z","published":"2024-08-25T07:55:06Z","title":"McGrids: Monte Carlo-Driven Adaptive Grids for Iso-Surface Extraction","summary":" Iso-surface extraction from an implicit field is a fundamental process in\nvarious applications of computer vision and graphics. When dealing with\ngeometric shapes with complicated geometric details, many existing algorithms\nsuffer from high computational costs and memory usage. This paper proposes\nMcGrids, a novel approach to improve the efficiency of iso-surface extraction.\nThe key idea is to construct adaptive grids for iso-surface extraction rather\nthan using a simple uniform grid as prior art does. Specifically, we formulate\nthe problem of constructing adaptive grids as a probability sampling problem,\nwhich is then solved by Monte Carlo process. We demonstrate McGrids' capability\nwith extensive experiments from both analytical SDFs computed from surface\nmeshes and learned implicit fields from real multiview images. The experiment\nresults show that our McGrids can significantly reduce the number of implicit\nfield queries, resulting in significant memory reduction, while producing\nhigh-quality meshes with rich geometric details.\n","authors":["Daxuan Ren","Hezi Shi","Jianmin Zheng","Jianfei Cai"],"pdf_url":"https://arxiv.org/pdf/2409.06710v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05020v1","updated":"2025-01-09T07:23:48Z","published":"2025-01-09T07:23:48Z","title":"Perception-as-Control: Fine-grained Controllable Image Animation with\n 3D-aware Motion Representation","summary":" Motion-controllable image animation is a fundamental task with a wide range\nof potential applications. Recent works have made progress in controlling\ncamera or object motion via various motion representations, while they still\nstruggle to support collaborative camera and object motion control with\nadaptive control granularity. To this end, we introduce 3D-aware motion\nrepresentation and propose an image animation framework, called\nPerception-as-Control, to achieve fine-grained collaborative motion control.\nSpecifically, we construct 3D-aware motion representation from a reference\nimage, manipulate it based on interpreted user intentions, and perceive it from\ndifferent viewpoints. In this way, camera and object motions are transformed\ninto intuitive, consistent visual changes. Then, the proposed framework\nleverages the perception results as motion control signals, enabling it to\nsupport various motion-related video synthesis tasks in a unified and flexible\nway. Experiments demonstrate the superiority of the proposed framework. For\nmore details and qualitative results, please refer to our project webpage:\nhttps://chen-yingjie.github.io/projects/Perception-as-Control.\n","authors":["Yingjie Chen","Yifang Men","Yuan Yao","Miaomiao Cui","Liefeng Bo"],"pdf_url":"https://arxiv.org/pdf/2501.05020v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05017v1","updated":"2025-01-09T07:18:48Z","published":"2025-01-09T07:18:48Z","title":"Continuous Knowledge-Preserving Decomposition for Few-Shot Continual\n Learning","summary":" Few-shot class-incremental learning (FSCIL) involves learning new classes\nfrom limited data while retaining prior knowledge, and often results in\ncatastrophic forgetting. Existing methods either freeze backbone networks to\npreserve knowledge, which limits adaptability, or rely on additional modules or\nprompts, introducing inference overhead. To this end, we propose Continuous\nKnowledge-Preserving Decomposition for FSCIL (CKPD-FSCIL), a framework that\ndecomposes a model's weights into two parts: one that compacts existing\nknowledge (knowledge-sensitive components) and another that carries redundant\ncapacity to accommodate new abilities (redundant-capacity components). The\ndecomposition is guided by a covariance matrix from replay samples, ensuring\nprincipal components align with classification abilities. During adaptation, we\nfreeze the knowledge-sensitive components and only adapt the redundant-capacity\ncomponents, fostering plasticity while minimizing interference without changing\nthe architecture or increasing overhead. Additionally, CKPD introduces an\nadaptive layer selection strategy to identify layers with redundant capacity,\ndynamically allocating adapters. Experiments on multiple benchmarks show that\nCKPD-FSCIL outperforms state-of-the-art methods.\n","authors":["Xiaojie Li","Yibo Yang","Jianlong Wu","David A. Clifton","Yue Yu","Bernard Ghanem","Min Zhang"],"pdf_url":"https://arxiv.org/pdf/2501.05017v1.pdf","comment":"Code: https://github.com/xiaojieli0903/CKPD-FSCIL"},{"id":"http://arxiv.org/abs/2501.05009v1","updated":"2025-01-09T07:11:51Z","published":"2025-01-09T07:11:51Z","title":"A Scalable System for Visual Analysis of Ocean Data","summary":" Oceanographers rely on visual analysis to interpret model simulations,\nidentify events and phenomena, and track dynamic ocean processes. The ever\nincreasing resolution and complexity of ocean data due to its dynamic nature\nand multivariate relationships demands a scalable and adaptable visualization\ntool for interactive exploration. We introduce pyParaOcean, a scalable and\ninteractive visualization system designed specifically for ocean data analysis.\npyParaOcean offers specialized modules for common oceanographic analysis tasks,\nincluding eddy identification and salinity movement tracking. These modules\nseamlessly integrate with ParaView as filters, ensuring a user-friendly and\neasy-to-use system while leveraging the parallelization capabilities of\nParaView and a plethora of inbuilt general-purpose visualization\nfunctionalities. The creation of an auxiliary dataset stored as a Cinema\ndatabase helps address I/O and network bandwidth bottlenecks while supporting\nthe generation of quick overview visualizations. We present a case study on the\nBay of Bengal (BoB) to demonstrate the utility of the system and scaling\nstudies to evaluate the efficiency of the system.\n","authors":["Toshit Jain","Upkar Singh","Varun Singh","Vijay Kumar Boda","Ingrid Hotz","Sathish S. Vadhiyar","P. N. Vinayachandran","Vijay Natarajan"],"pdf_url":"https://arxiv.org/pdf/2501.05009v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04996v1","updated":"2025-01-09T06:22:50Z","published":"2025-01-09T06:22:50Z","title":"A CT Image Classification Network Framework for Lung Tumors Based on\n Pre-trained MobileNetV2 Model and Transfer learning, And Its Application and\n Market Analysis in the Medical field","summary":" In the medical field, accurate diagnosis of lung cancer is crucial for\ntreatment. Traditional manual analysis methods have significant limitations in\nterms of accuracy and efficiency. To address this issue, this paper proposes a\ndeep learning network framework based on the pre-trained MobileNetV2 model,\ninitialized with weights from the ImageNet-1K dataset (version 2). The last\nlayer of the model (the fully connected layer) is replaced with a new fully\nconnected layer, and a softmax activation function is added to efficiently\nclassify three types of lung cancer CT scan images. Experimental results show\nthat the model achieves an accuracy of 99.6% on the test set, with significant\nimprovements in feature extraction compared to traditional models.With the\nrapid development of artificial intelligence technologies, deep learning\napplications in medical image processing are bringing revolutionary changes to\nthe healthcare industry. AI-based lung cancer detection systems can\nsignificantly improve diagnostic efficiency, reduce the workload of doctors,\nand occupy an important position in the global healthcare market. The potential\nof AI to improve diagnostic accuracy, reduce medical costs, and promote\nprecision medicine will have a profound impact on the future development of the\nhealthcare industry.\n","authors":["Ziyang Gao","Yong Tian","Shih-Chi Lin","Junghua Lin"],"pdf_url":"https://arxiv.org/pdf/2501.04996v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04995v1","updated":"2025-01-09T06:20:00Z","published":"2025-01-09T06:20:00Z","title":"IPDN: Image-enhanced Prompt Decoding Network for 3D Referring Expression\n Segmentation","summary":" 3D Referring Expression Segmentation (3D-RES) aims to segment point cloud\nscenes based on a given expression. However, existing 3D-RES approaches face\ntwo major challenges: feature ambiguity and intent ambiguity. Feature ambiguity\narises from information loss or distortion during point cloud acquisition due\nto limitations such as lighting and viewpoint. Intent ambiguity refers to the\nmodel's equal treatment of all queries during the decoding process, lacking\ntop-down task-specific guidance. In this paper, we introduce an Image enhanced\nPrompt Decoding Network (IPDN), which leverages multi-view images and\ntask-driven information to enhance the model's reasoning capabilities. To\naddress feature ambiguity, we propose the Multi-view Semantic Embedding (MSE)\nmodule, which injects multi-view 2D image information into the 3D scene and\ncompensates for potential spatial information loss. To tackle intent ambiguity,\nwe designed a Prompt-Aware Decoder (PAD) that guides the decoding process by\nderiving task-driven signals from the interaction between the expression and\nvisual features. Comprehensive experiments demonstrate that IPDN outperforms\nthe state-ofthe-art by 1.9 and 4.2 points in mIoU metrics on the 3D-RES and\n3D-GRES tasks, respectively.\n","authors":["Qi Chen","Changli Wu","Jiayi Ji","Yiwei Ma","Danni Yang","Xiaoshuai Sun"],"pdf_url":"https://arxiv.org/pdf/2501.04995v1.pdf","comment":"AAAI 2025"},{"id":"http://arxiv.org/abs/2501.02260v2","updated":"2025-01-09T06:14:09Z","published":"2025-01-04T11:28:49Z","title":"MagicFace: High-Fidelity Facial Expression Editing with Action-Unit\n Control","summary":" We address the problem of facial expression editing by controling the\nrelative variation of facial action-unit (AU) from the same person. This\nenables us to edit this specific person's expression in a fine-grained,\ncontinuous and interpretable manner, while preserving their identity, pose,\nbackground and detailed facial attributes. Key to our model, which we dub\nMagicFace, is a diffusion model conditioned on AU variations and an ID encoder\nto preserve facial details of high consistency. Specifically, to preserve the\nfacial details with the input identity, we leverage the power of pretrained\nStable-Diffusion models and design an ID encoder to merge appearance features\nthrough self-attention. To keep background and pose consistency, we introduce\nan efficient Attribute Controller by explicitly informing the model of current\nbackground and pose of the target. By injecting AU variations into a denoising\nUNet, our model can animate arbitrary identities with various AU combinations,\nyielding superior results in high-fidelity expression editing compared to other\nfacial expression editing works. Code is publicly available at\nhttps://github.com/weimengting/MagicFace.\n","authors":["Mengting Wei","Tuomas Varanka","Xingxun Jiang","Huai-Qian Khor","Guoying Zhao"],"pdf_url":"https://arxiv.org/pdf/2501.02260v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04975v1","updated":"2025-01-09T05:12:38Z","published":"2025-01-09T05:12:38Z","title":"V2C-CBM: Building Concept Bottlenecks with Vision-to-Concept Tokenizer","summary":" Concept Bottleneck Models (CBMs) offer inherent interpretability by initially\ntranslating images into human-comprehensible concepts, followed by a linear\ncombination of these concepts for classification. However, the annotation of\nconcepts for visual recognition tasks requires extensive expert knowledge and\nlabor, constraining the broad adoption of CBMs. Recent approaches have\nleveraged the knowledge of large language models to construct concept\nbottlenecks, with multimodal models like CLIP subsequently mapping image\nfeatures into the concept feature space for classification. Despite this, the\nconcepts produced by language models can be verbose and may introduce\nnon-visual attributes, which hurts accuracy and interpretability. In this\nstudy, we investigate to avoid these issues by constructing CBMs directly from\nmultimodal models. To this end, we adopt common words as base concept\nvocabulary and leverage auxiliary unlabeled images to construct a\nVision-to-Concept (V2C) tokenizer that can explicitly quantize images into\ntheir most relevant visual concepts, thus creating a vision-oriented concept\nbottleneck tightly coupled with the multimodal model. This leads to our V2C-CBM\nwhich is training efficient and interpretable with high accuracy. Our V2C-CBM\nhas matched or outperformed LLM-supervised CBMs on various visual\nclassification benchmarks, validating the efficacy of our approach.\n","authors":["Hangzhou He","Lei Zhu","Xinliang Zhang","Shuang Zeng","Qian Chen","Yanye Lu"],"pdf_url":"https://arxiv.org/pdf/2501.04975v1.pdf","comment":"Accepted by AAAI2025"},{"id":"http://arxiv.org/abs/2410.10777v2","updated":"2025-01-09T04:57:35Z","published":"2024-10-14T17:49:27Z","title":"UniMatch V2: Pushing the Limit of Semi-Supervised Semantic Segmentation","summary":" Semi-supervised semantic segmentation (SSS) aims at learning rich visual\nknowledge from cheap unlabeled images to enhance semantic segmentation\ncapability. Among recent works, UniMatch improves its precedents tremendously\nby amplifying the practice of weak-to-strong consistency regularization.\nSubsequent works typically follow similar pipelines and propose various\ndelicate designs. Despite the achieved progress, strangely, even in this\nflourishing era of numerous powerful vision models, almost all SSS works are\nstill sticking to 1) using outdated ResNet encoders with small-scale\nImageNet-1K pre-training, and 2) evaluation on simple Pascal and Cityscapes\ndatasets. In this work, we argue that, it is necessary to switch the baseline\nof SSS from ResNet-based encoders to more capable ViT-based encoders (e.g.,\nDINOv2) that are pre-trained on massive data. A simple update on the encoder\n(even using 2x fewer parameters) can bring more significant improvement than\ncareful method designs. Built on this competitive baseline, we present our\nupgraded and simplified UniMatch V2, inheriting the core spirit of\nweak-to-strong consistency from V1, but requiring less training cost and\nproviding consistently better results. Additionally, witnessing the gradually\nsaturated performance on Pascal and Cityscapes, we appeal that we should focus\non more challenging benchmarks with complex taxonomy, such as ADE20K and COCO\ndatasets. Code, models, and logs of all reported values, are available at\nhttps://github.com/LiheYoung/UniMatch-V2.\n","authors":["Lihe Yang","Zhen Zhao","Hengshuang Zhao"],"pdf_url":"https://arxiv.org/pdf/2410.10777v2.pdf","comment":"Accepted by TPAMI"},{"id":"http://arxiv.org/abs/2501.02795v2","updated":"2025-01-09T04:50:16Z","published":"2025-01-06T06:29:55Z","title":"InfiFusion: A Unified Framework for Enhanced Cross-Model Reasoning via\n LLM Fusion","summary":" Large Language Models (LLMs) have demonstrated strong performance across\nvarious reasoning tasks, yet building a single model that consistently excels\nacross all domains remains challenging. This paper addresses this problem by\nexploring strategies to integrate multiple domain-specialized models into an\nefficient pivot model.We propose two fusion strategies to combine the strengths\nof multiple LLMs: (1) a pairwise, multi-step fusion approach that sequentially\ndistills each source model into the pivot model, followed by a weight merging\nstep to integrate the distilled models into the final model. This method\nachieves strong performance but requires substantial training effort; and (2) a\nunified fusion approach that aggregates all source models' outputs\nsimultaneously.To improve the fusion process, we introduce a novel\nRate-Skewness Adaptive Fusion (RSAF) technique, which dynamically adjusts top-K\nratios during parameter merging for enhanced flexibility and\nstability.Furthermore, we propose an uncertainty-based weighting method for the\nunified approach, which dynamically balances the contributions of source models\nand outperforms other logits/distribution ensemble methods.We achieved accuracy\nimprovements of 9.27%, 8.80%, and 8.89% on the GSM8K, MATH, and HumanEval\ntasks, respectively.\n","authors":["Zhaoyi Yan","Zhijie Sang","Yiming Zhang","Yuhao Fu","Baoyi He","Qi Zhou","Yining Di","Chunlin Ji","Shengyu Zhang","Fei Wu","Hongxia Yang"],"pdf_url":"https://arxiv.org/pdf/2501.02795v2.pdf","comment":"Under review"},{"id":"http://arxiv.org/abs/2501.04969v1","updated":"2025-01-09T04:47:51Z","published":"2025-01-09T04:47:51Z","title":"AD-L-JEPA: Self-Supervised Spatial World Models with Joint Embedding\n Predictive Architecture for Autonomous Driving with LiDAR Data","summary":" As opposed to human drivers, current autonomous driving systems still require\nvast amounts of labeled data to train. Recently, world models have been\nproposed to simultaneously enhance autonomous driving capabilities by improving\nthe way these systems understand complex real-world environments and reduce\ntheir data demands via self-supervised pre-training. In this paper, we present\nAD-L-JEPA (aka Autonomous Driving with LiDAR data via a Joint Embedding\nPredictive Architecture), a novel self-supervised pre-training framework for\nautonomous driving with LiDAR data that, as opposed to existing methods, is\nneither generative nor contrastive. Our method learns spatial world models with\na joint embedding predictive architecture. Instead of explicitly generating\nmasked unknown regions, our self-supervised world models predict Bird's Eye\nView (BEV) embeddings to represent the diverse nature of autonomous driving\nscenes. Our approach furthermore eliminates the need to manually create\npositive and negative pairs, as is the case in contrastive learning. AD-L-JEPA\nleads to simpler implementation and enhanced learned representations. We\nqualitatively and quantitatively demonstrate high-quality of embeddings learned\nwith AD-L-JEPA. We furthermore evaluate the accuracy and label efficiency of\nAD-L-JEPA on popular downstream tasks such as LiDAR 3D object detection and\nassociated transfer learning. Our experimental evaluation demonstrates that\nAD-L-JEPA is a plausible approach for self-supervised pre-training in\nautonomous driving applications and is the best available approach\noutperforming SOTA, including most recently proposed Occupancy-MAE [1] and ALSO\n[2]. The source code of AD-L-JEPA is available at\nhttps://github.com/HaoranZhuExplorer/AD-L-JEPA-Release.\n","authors":["Haoran Zhu","Zhenyuan Dong","Kristi Topollai","Anna Choromanska"],"pdf_url":"https://arxiv.org/pdf/2501.04969v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04966v1","updated":"2025-01-09T04:37:31Z","published":"2025-01-09T04:37:31Z","title":"Emergence of Painting Ability via Recognition-Driven Evolution","summary":" From Paleolithic cave paintings to Impressionism, human painting has evolved\nto depict increasingly complex and detailed scenes, conveying more nuanced\nmessages. This paper attempts to emerge this artistic capability by simulating\nthe evolutionary pressures that enhance visual communication efficiency.\nSpecifically, we present a model with a stroke branch and a palette branch that\ntogether simulate human-like painting. The palette branch learns a limited\ncolour palette, while the stroke branch parameterises each stroke using\nB\\'ezier curves to render an image, subsequently evaluated by a high-level\nrecognition module. We quantify the efficiency of visual communication by\nmeasuring the recognition accuracy achieved with machine vision. The model then\noptimises the control points and colour choices for each stroke to maximise\nrecognition accuracy with minimal strokes and colours. Experimental results\nshow that our model achieves superior performance in high-level recognition\ntasks, delivering artistic expression and aesthetic appeal, especially in\nabstract sketches. Additionally, our approach shows promise as an efficient\nbit-level image compression technique, outperforming traditional methods.\n","authors":["Yi Lin","Lin Gu","Ziteng Cui","Shenghan Su","Yumo Hao","Yingtao Tian","Tatsuya Harada","Jianfei Yang"],"pdf_url":"https://arxiv.org/pdf/2501.04966v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03847v2","updated":"2025-01-09T04:25:42Z","published":"2025-01-07T15:01:58Z","title":"Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video\n Generation Control","summary":" Diffusion models have demonstrated impressive performance in generating\nhigh-quality videos from text prompts or images. However, precise control over\nthe video generation process, such as camera manipulation or content editing,\nremains a significant challenge. Existing methods for controlled video\ngeneration are typically limited to a single control type, lacking the\nflexibility to handle diverse control demands. In this paper, we introduce\nDiffusion as Shader (DaS), a novel approach that supports multiple video\ncontrol tasks within a unified architecture. Our key insight is that achieving\nversatile video control necessitates leveraging 3D control signals, as videos\nare fundamentally 2D renderings of dynamic 3D content. Unlike prior methods\nlimited to 2D control signals, DaS leverages 3D tracking videos as control\ninputs, making the video diffusion process inherently 3D-aware. This innovation\nallows DaS to achieve a wide range of video controls by simply manipulating the\n3D tracking videos. A further advantage of using 3D tracking videos is their\nability to effectively link frames, significantly enhancing the temporal\nconsistency of the generated videos. With just 3 days of fine-tuning on 8 H800\nGPUs using less than 10k videos, DaS demonstrates strong control capabilities\nacross diverse tasks, including mesh-to-video generation, camera control,\nmotion transfer, and object manipulation.\n","authors":["Zekai Gu","Rui Yan","Jiahao Lu","Peng Li","Zhiyang Dou","Chenyang Si","Zhen Dong","Qifeng Liu","Cheng Lin","Ziwei Liu","Wenping Wang","Yuan Liu"],"pdf_url":"https://arxiv.org/pdf/2501.03847v2.pdf","comment":"Project page: https://igl-hkust.github.io/das/ Codes:\n https://github.com/IGL-HKUST/DiffusionAsShader"},{"id":"http://arxiv.org/abs/2311.09346v2","updated":"2025-01-09T04:20:34Z","published":"2023-11-15T20:09:29Z","title":"Nothing Stands Still: A Spatiotemporal Benchmark on 3D Point Cloud\n Registration Under Large Geometric and Temporal Change","summary":" Building 3D geometric maps of man-made spaces is a well-established and\nactive field that is fundamental to computer vision and robotics. However,\nconsidering the evolving nature of built environments, it is essential to\nquestion the capabilities of current mapping efforts in handling temporal\nchanges. In addition, spatiotemporal mapping holds significant potential for\nachieving sustainability and circularity goals. Existing mapping approaches\nfocus on small changes, such as object relocation or self-driving car\noperation; in all cases where the main structure of the scene remains fixed.\nConsequently, these approaches fail to address more radical changes in the\nstructure of the built environment, such as geometry and topology. To this end,\nwe introduce the Nothing Stands Still (NSS) benchmark, which focuses on the\nspatiotemporal registration of 3D scenes undergoing large spatial and temporal\nchange, ultimately creating one coherent spatiotemporal map. Specifically, the\nbenchmark involves registering two or more partial 3D point clouds (fragments)\nfrom the same scene but captured from different spatiotemporal views. In\naddition to the standard pairwise registration, we assess the multi-way\nregistration of multiple fragments that belong to any temporal stage. As part\nof NSS, we introduce a dataset of 3D point clouds recurrently captured in\nlarge-scale building indoor environments that are under construction or\nrenovation. The NSS benchmark presents three scenarios of increasing\ndifficulty, to quantify the generalization ability of point cloud registration\nmethods over space (within one building and across buildings) and time. We\nconduct extensive evaluations of state-of-the-art methods on NSS. The results\ndemonstrate the necessity for novel methods specifically designed to handle\nlarge spatiotemporal changes. The homepage of our benchmark is at\nhttp://nothing-stands-still.com.\n","authors":["Tao Sun","Yan Hao","Shengyu Huang","Silvio Savarese","Konrad Schindler","Marc Pollefeys","Iro Armeni"],"pdf_url":"https://arxiv.org/pdf/2311.09346v2.pdf","comment":"To appear in the ISPRS Journal of Photogrammetry and Remote Sensing.\n 29 pages, 26 figures. For the project page, see\n http://nothing-stands-still.com"},{"id":"http://arxiv.org/abs/2501.04958v1","updated":"2025-01-09T04:20:12Z","published":"2025-01-09T04:20:12Z","title":"Addressing Domain Shift via Imbalance-Aware Domain Adaptation in Embryo\n Development Assessment","summary":" Deep learning models in medical imaging face dual challenges: domain shift,\nwhere models perform poorly when deployed in settings different from their\ntraining environment, and class imbalance, where certain disease conditions are\nnaturally underrepresented. We present Imbalance-Aware Domain Adaptation\n(IADA), a novel framework that simultaneously tackles both challenges through\nthree key components: (1) adaptive feature learning with class-specific\nattention mechanisms, (2) balanced domain alignment with dynamic weighting, and\n(3) adaptive threshold optimization. Our theoretical analysis establishes\nconvergence guarantees and complexity bounds. Through extensive experiments on\nembryo development assessment across four imaging modalities, IADA demonstrates\nsignificant improvements over existing methods, achieving up to 25.19\\% higher\naccuracy while maintaining balanced performance across classes. In challenging\nscenarios with low-quality imaging systems, IADA shows robust generalization\nwith AUC improvements of up to 12.56\\%. These results demonstrate IADA's\npotential for developing reliable and equitable medical imaging systems for\ndiverse clinical settings. The code is made public available at\n\\url{https://github.com/yinghemedical/imbalance-aware_domain_adaptation}\n","authors":["Lei Li","Xinglin Zhang","Jun Liang","Tao Chen"],"pdf_url":"https://arxiv.org/pdf/2501.04958v1.pdf","comment":"15 pages"},{"id":"http://arxiv.org/abs/2501.04950v1","updated":"2025-01-09T03:58:02Z","published":"2025-01-09T03:58:02Z","title":"MORDA: A Synthetic Dataset to Facilitate Adaptation of Object Detectors\n to Unseen Real-target Domain While Preserving Performance on Real-source\n Domain","summary":" Deep neural network (DNN) based perception models are indispensable in the\ndevelopment of autonomous vehicles (AVs). However, their reliance on\nlarge-scale, high-quality data is broadly recognized as a burdensome necessity\ndue to the substantial cost of data acquisition and labeling. Further, the\nissue is not a one-time concern, as AVs might need a new dataset if they are to\nbe deployed to another region (real-target domain) that the in-hand dataset\nwithin the real-source domain cannot incorporate. To mitigate this burden, we\npropose leveraging synthetic environments as an auxiliary domain where the\ncharacteristics of real domains are reproduced. This approach could enable\nindirect experience about the real-target domain in a time- and cost-effective\nmanner. As a practical demonstration of our methodology, nuScenes and South\nKorea are employed to represent real-source and real-target domains,\nrespectively. That means we construct digital twins for several regions of\nSouth Korea, and the data-acquisition framework of nuScenes is reproduced.\nBlending the aforementioned components within a simulator allows us to obtain a\nsynthetic-fusion domain in which we forge our novel driving dataset, MORDA:\nMixture Of Real-domain characteristics for synthetic-data-assisted Domain\nAdaptation. To verify the value of synthetic features that MORDA provides in\nlearning about driving environments of South Korea, 2D/3D detectors are trained\nsolely on a combination of nuScenes and MORDA. Afterward, their performance is\nevaluated on the unforeseen real-world dataset (AI-Hub) collected in South\nKorea. Our experiments present that MORDA can significantly improve mean\nAverage Precision (mAP) on AI-Hub dataset while that on nuScenes is retained or\nslightly enhanced.\n","authors":["Hojun Lim","Heecheol Yoo","Jinwoo Lee","Seungmin Jeon","Hyeongseok Jeon"],"pdf_url":"https://arxiv.org/pdf/2501.04950v1.pdf","comment":"7 pages, 6 figures, 4 tables, This work has been submitted to the\n IEEE for possible publication (the paper is submitted to the conference\n ICRA2025 and is under review)"},{"id":"http://arxiv.org/abs/2501.04947v1","updated":"2025-01-09T03:50:00Z","published":"2025-01-09T03:50:00Z","title":"Seeing with Partial Certainty: Conformal Prediction for Robotic Scene\n Recognition in Built Environments","summary":" In assistive robotics serving people with disabilities (PWD), accurate place\nrecognition in built environments is crucial to ensure that robots navigate and\ninteract safely within diverse indoor spaces. Language interfaces, particularly\nthose powered by Large Language Models (LLM) and Vision Language Models (VLM),\nhold significant promise in this context, as they can interpret visual scenes\nand correlate them with semantic information. However, such interfaces are also\nknown for their hallucinated predictions. In addition, language instructions\nprovided by humans can also be ambiguous and lack precise details about\nspecific locations, objects, or actions, exacerbating the hallucination issue.\nIn this work, we introduce Seeing with Partial Certainty (SwPC) - a framework\ndesigned to measure and align uncertainty in VLM-based place recognition,\nenabling the model to recognize when it lacks confidence and seek assistance\nwhen necessary. This framework is built on the theory of conformal prediction\nto provide statistical guarantees on place recognition while minimizing\nrequests for human help in complex indoor environment settings. Through\nexperiments on the widely used richly-annotated scene dataset Matterport3D, we\nshow that SwPC significantly increases the success rate and decreases the\namount of human intervention required relative to the prior art. SwPC can be\nutilized with any VLMs directly without requiring model fine-tuning, offering a\npromising, lightweight approach to uncertainty modeling that complements and\nscales alongside the expanding capabilities of foundational models.\n","authors":["Yifan Xu","Vineet Kamat","Carol Menassa"],"pdf_url":"https://arxiv.org/pdf/2501.04947v1.pdf","comment":"10 pages, 4 Figures"},{"id":"http://arxiv.org/abs/2412.18696v2","updated":"2025-01-09T03:39:37Z","published":"2024-12-24T22:55:35Z","title":"STITCH: Surface reconstrucTion using Implicit neural representations\n with Topology Constraints and persistent Homology","summary":" We present STITCH, a novel approach for neural implicit surface\nreconstruction of a sparse and irregularly spaced point cloud while enforcing\ntopological constraints (such as having a single connected component). We\ndevelop a new differentiable framework based on persistent homology to\nformulate topological loss terms that enforce the prior of a single 2-manifold\nobject. Our method demonstrates excellent performance in preserving the\ntopology of complex 3D geometries, evident through both visual and empirical\ncomparisons. We supplement this with a theoretical analysis, and provably show\nthat optimizing the loss with stochastic (sub)gradient descent leads to\nconvergence and enables reconstructing shapes with a single connected\ncomponent. Our approach showcases the integration of differentiable topological\ndata analysis tools for implicit surface reconstruction.\n","authors":["Anushrut Jignasu","Ethan Herron","Zhanhong Jiang","Soumik Sarkar","Chinmay Hegde","Baskar Ganapathysubramanian","Aditya Balu","Adarsh Krishnamurthy"],"pdf_url":"https://arxiv.org/pdf/2412.18696v2.pdf","comment":"19 pages, 12 figures, 29 tables"},{"id":"http://arxiv.org/abs/2411.18729v2","updated":"2025-01-09T03:34:55Z","published":"2024-11-27T20:08:55Z","title":"Multi-Task Model Merging via Adaptive Weight Disentanglement","summary":" Model merging has recently gained attention as an economical and scalable\napproach to incorporate task-specific weights from various tasks into a unified\nmulti-task model. For example, in Task Arithmetic (TA), adding the fine-tuned\nweights of different tasks can enhance the model's performance on those tasks,\nwhile subtracting them leads to task forgetting. Although TA is highly\neffective, interference among task still hampers the performance of the merged\nmodel. Existing methods for handling conflicts between task generally rely on\nempirical selection, resulting in suboptimal performance. In this paper, we\nintroduce an Adaptive Weight Disentanglement method. We begin by theoretically\nproving that task vectors employed in model merging should be orthogonal to\nminimize interference among tasks. Guided by this insight, we initialize\nredundant vectors such that, when subtracted from the original task vectors,\nthe resulting vectors exhibit increased orthogonality. Additionally, we impose\nan norm constraint on the redundant vectors to preserve the performance of the\ntask-specific models. Experimental results demonstrate the effectiveness of our\nproposed technique: it successfully extracts redundant vectors, and after their\nsubtraction, the task vectors not only retain robust performance but also\nachieve superior fusion outcomes. Our code is available at\n\\href{https://github.com/FarisXiong/AWD.git}{https://github.com/FarisXiong/AWD.git}.\n","authors":["Feng Xiong","Runxi Cheng","Wang Chen","Zhanqiu Zhang","Yiwen Guo","Chun Yuan","Ruifeng Xu"],"pdf_url":"https://arxiv.org/pdf/2411.18729v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04944v1","updated":"2025-01-09T03:27:47Z","published":"2025-01-09T03:27:47Z","title":"MambaHSI: Spatial-Spectral Mamba for Hyperspectral Image Classification","summary":" Transformer has been extensively explored for hyperspectral image (HSI)\nclassification. However, transformer poses challenges in terms of speed and\nmemory usage because of its quadratic computational complexity. Recently, the\nMamba model has emerged as a promising approach, which has strong long-distance\nmodeling capabilities while maintaining a linear computational complexity.\nHowever, representing the HSI is challenging for the Mamba due to the\nrequirement for an integrated spatial and spectral understanding. To remedy\nthese drawbacks, we propose a novel HSI classification model based on a Mamba\nmodel, named MambaHSI, which can simultaneously model long-range interaction of\nthe whole image and integrate spatial and spectral information in an adaptive\nmanner. Specifically, we design a spatial Mamba block (SpaMB) to model the\nlong-range interaction of the whole image at the pixel-level. Then, we propose\na spectral Mamba block (SpeMB) to split the spectral vector into multiple\ngroups, mine the relations across different spectral groups, and extract\nspectral features. Finally, we propose a spatial-spectral fusion module (SSFM)\nto adaptively integrate spatial and spectral features of a HSI. To our best\nknowledge, this is the first image-level HSI classification model based on the\nMamba. We conduct extensive experiments on four diverse HSI datasets. The\nresults demonstrate the effectiveness and superiority of the proposed model for\nHSI classification. This reveals the great potential of Mamba to be the\nnext-generation backbone for HSI models. Codes are available at\nhttps://github.com/li-yapeng/MambaHSI .\n","authors":["Yapeng Li","Yong Luo","Lefei Zhang","Zengmao Wang","Bo Du"],"pdf_url":"https://arxiv.org/pdf/2501.04944v1.pdf","comment":"accepted by IEEE TGRS"},{"id":"http://arxiv.org/abs/2501.00358v2","updated":"2025-01-09T03:25:24Z","published":"2024-12-31T09:22:38Z","title":"Embodied VideoAgent: Persistent Memory from Egocentric Videos and\n Embodied Sensors Enables Dynamic Scene Understanding","summary":" This paper investigates the problem of understanding dynamic 3D scenes from\negocentric observations, a key challenge in robotics and embodied AI. Unlike\nprior studies that explored this as long-form video understanding and utilized\negocentric video only, we instead propose an LLM-based agent, Embodied\nVideoAgent, which constructs scene memory from both egocentric video and\nembodied sensory inputs (e.g. depth and pose sensing). We further introduce a\nVLM-based approach to automatically update the memory when actions or\nactivities over objects are perceived. Embodied VideoAgent attains significant\nadvantages over counterparts in challenging reasoning and planning tasks in 3D\nscenes, achieving gains of 4.9% on Ego4D-VQ3D, 5.8% on OpenEQA, and 11.7% on\nEnvQA. We have also demonstrated its potential in various embodied AI tasks\nincluding generating embodied interactions and perception for robot\nmanipulation. The code and demo will be made public.\n","authors":["Yue Fan","Xiaojian Ma","Rongpeng Su","Jun Guo","Rujie Wu","Xi Chen","Qing Li"],"pdf_url":"https://arxiv.org/pdf/2501.00358v2.pdf","comment":"project page: https://embodied-videoagent.github.io/"},{"id":"http://arxiv.org/abs/2501.04939v1","updated":"2025-01-09T03:04:08Z","published":"2025-01-09T03:04:08Z","title":"Multi-Context Temporal Consistent Modeling for Referring Video Object\n Segmentation","summary":" Referring video object segmentation aims to segment objects within a video\ncorresponding to a given text description. Existing transformer-based temporal\nmodeling approaches face challenges related to query inconsistency and the\nlimited consideration of context. Query inconsistency produces unstable masks\nof different objects in the middle of the video. The limited consideration of\ncontext leads to the segmentation of incorrect objects by failing to adequately\naccount for the relationship between the given text and instances. To address\nthese issues, we propose the Multi-context Temporal Consistency Module (MTCM),\nwhich consists of an Aligner and a Multi-Context Enhancer (MCE). The Aligner\nremoves noise from queries and aligns them to achieve query consistency. The\nMCE predicts text-relevant queries by considering multi-context. We applied\nMTCM to four different models, increasing performance across all of them,\nparticularly achieving 47.6 J&F on the MeViS. Code is available at\nhttps://github.com/Choi58/MTCM.\n","authors":["Sun-Hyuk Choi","Hayoung Jo","Seong-Whan Lee"],"pdf_url":"https://arxiv.org/pdf/2501.04939v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04934v1","updated":"2025-01-09T02:52:30Z","published":"2025-01-09T02:52:30Z","title":"Plug-and-Play DISep: Separating Dense Instances for Scene-to-Pixel\n Weakly-Supervised Change Detection in High-Resolution Remote Sensing Images","summary":" Existing Weakly-Supervised Change Detection (WSCD) methods often encounter\nthe problem of \"instance lumping\" under scene-level supervision, particularly\nin scenarios with a dense distribution of changed instances (i.e., changed\nobjects). In these scenarios, unchanged pixels between changed instances are\nalso mistakenly identified as changed, causing multiple changes to be\nmistakenly viewed as one. In practical applications, this issue prevents the\naccurate quantification of the number of changes. To address this issue, we\npropose a Dense Instance Separation (DISep) method as a plug-and-play solution,\nrefining pixel features from a unified instance perspective under scene-level\nsupervision. Specifically, our DISep comprises a three-step iterative training\nprocess: 1) Instance Localization: We locate instance candidate regions for\nchanged pixels using high-pass class activation maps. 2) Instance Retrieval: We\nidentify and group these changed pixels into different instance IDs through\nconnectivity searching. Then, based on the assigned instance IDs, we extract\ncorresponding pixel-level features on a per-instance basis. 3) Instance\nSeparation: We introduce a separation loss to enforce intra-instance pixel\nconsistency in the embedding space, thereby ensuring separable instance feature\nrepresentations. The proposed DISep adds only minimal training cost and no\ninference cost. It can be seamlessly integrated to enhance existing WSCD\nmethods. We achieve state-of-the-art performance by enhancing {three\nTransformer-based and four ConvNet-based methods} on the LEVIR-CD, WHU-CD,\nDSIFN-CD, SYSU-CD, and CDD datasets. Additionally, our DISep can be used to\nimprove fully-supervised change detection methods. Code is available at\nhttps://github.com/zhenghuizhao/Plug-and-Play-DISep-for-Change-Detection.\n","authors":["Zhenghui Zhao","Chen Wu","Lixiang Ru","Di Wang","Hongruixuan Chen","Cuiqun Chen"],"pdf_url":"https://arxiv.org/pdf/2501.04934v1.pdf","comment":"Accepted by ISPRS Journal of Photogrammetry and Remote Sensing"},{"id":"http://arxiv.org/abs/2501.01808v2","updated":"2025-01-09T02:45:43Z","published":"2025-01-03T13:43:21Z","title":"MoEE: Mixture of Emotion Experts for Audio-Driven Portrait Animation","summary":" The generation of talking avatars has achieved significant advancements in\nprecise audio synchronization. However, crafting lifelike talking head videos\nrequires capturing a broad spectrum of emotions and subtle facial expressions.\nCurrent methods face fundamental challenges: a) the absence of frameworks for\nmodeling single basic emotional expressions, which restricts the generation of\ncomplex emotions such as compound emotions; b) the lack of comprehensive\ndatasets rich in human emotional expressions, which limits the potential of\nmodels. To address these challenges, we propose the following innovations: 1)\nthe Mixture of Emotion Experts (MoEE) model, which decouples six fundamental\nemotions to enable the precise synthesis of both singular and compound\nemotional states; 2) the DH-FaceEmoVid-150 dataset, specifically curated to\ninclude six prevalent human emotional expressions as well as four types of\ncompound emotions, thereby expanding the training potential of emotion-driven\nmodels. Furthermore, to enhance the flexibility of emotion control, we propose\nan emotion-to-latents module that leverages multimodal inputs, aligning diverse\ncontrol signals-such as audio, text, and labels-to ensure more varied control\ninputs as well as the ability to control emotions using audio alone. Through\nextensive quantitative and qualitative evaluations, we demonstrate that the\nMoEE framework, in conjunction with the DH-FaceEmoVid-150 dataset, excels in\ngenerating complex emotional expressions and nuanced facial details, setting a\nnew benchmark in the field. These datasets will be publicly released.\n","authors":["Huaize Liu","Wenzhang Sun","Donglin Di","Shibo Sun","Jiahui Yang","Changqing Zou","Hujun Bao"],"pdf_url":"https://arxiv.org/pdf/2501.01808v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04928v1","updated":"2025-01-09T02:36:21Z","published":"2025-01-09T02:36:21Z","title":"Image2CADSeq: Computer-Aided Design Sequence and Knowledge Inference\n from Product Images","summary":" Computer-aided design (CAD) tools empower designers to design and modify 3D\nmodels through a series of CAD operations, commonly referred to as a CAD\nsequence. In scenarios where digital CAD files are not accessible, reverse\nengineering (RE) has been used to reconstruct 3D CAD models. Recent advances\nhave seen the rise of data-driven approaches for RE, with a primary focus on\nconverting 3D data, such as point clouds, into 3D models in boundary\nrepresentation (B-rep) format. However, obtaining 3D data poses significant\nchallenges, and B-rep models do not reveal knowledge about the 3D modeling\nprocess of designs. To this end, our research introduces a novel data-driven\napproach with an Image2CADSeq neural network model. This model aims to reverse\nengineer CAD models by processing images as input and generating CAD sequences.\nThese sequences can then be translated into B-rep models using a solid modeling\nkernel. Unlike B-rep models, CAD sequences offer enhanced flexibility to modify\nindividual steps of model creation, providing a deeper understanding of the\nconstruction process of CAD models. To quantitatively and rigorously evaluate\nthe predictive performance of the Image2CADSeq model, we have developed a\nmulti-level evaluation framework for model assessment. The model was trained on\na specially synthesized dataset, and various network architectures were\nexplored to optimize the performance. The experimental and validation results\nshow great potential for the model in generating CAD sequences from 2D image\ndata.\n","authors":["Xingang Li","Zhenghui Sha"],"pdf_url":"https://arxiv.org/pdf/2501.04928v1.pdf","comment":"20 pages, 10 figures, and 6 tables"},{"id":"http://arxiv.org/abs/2404.06429v3","updated":"2025-01-09T02:34:25Z","published":"2024-04-09T16:20:03Z","title":"Magic-Boost: Boost 3D Generation with Multi-View Conditioned Diffusion","summary":" Benefiting from the rapid development of 2D diffusion models, 3D content\ngeneration has witnessed significant progress. One promising solution is to\nfinetune the pre-trained 2D diffusion models to produce multi-view images and\nthen reconstruct them into 3D assets via feed-forward sparse-view\nreconstruction models. However, limited by the 3D inconsistency in the\ngenerated multi-view images and the low reconstruction resolution of the\nfeed-forward reconstruction models, the generated 3d assets are still limited\nto incorrect geometries and blurry textures. To address this problem, we\npresent a multi-view based refine method, named Magic-Boost, to further refine\nthe generation results. In detail, we first propose a novel multi-view\nconditioned diffusion model which extracts 3d prior from the synthesized\nmulti-view images to synthesize high-fidelity novel view images and then\nintroduce a novel iterative-update strategy to adopt it to provide precise\nguidance to refine the coarse generated results through a fast optimization\nprocess. Conditioned on the strong 3d priors extracted from the synthesized\nmulti-view images, Magic-Boost is capable of providing precise optimization\nguidance that well aligns with the coarse generated 3D assets, enriching the\nlocal detail in both geometry and texture within a short time ($\\sim15$min).\nExtensive experiments show Magic-Boost greatly enhances the coarse generated\ninputs, generates high-quality 3D assets with rich geometric and textural\ndetails. (Project Page: https://magic-research.github.io/magic-boost/)\n","authors":["Fan Yang","Jianfeng Zhang","Yichun Shi","Bowen Chen","Chenxu Zhang","Huichao Zhang","Xiaofeng Yang","Xiu Li","Jiashi Feng","Guosheng Lin"],"pdf_url":"https://arxiv.org/pdf/2404.06429v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.19407v5","updated":"2025-01-09T02:33:15Z","published":"2024-06-12T06:41:23Z","title":"YOLO11 to Its Genesis: A Decadal and Comprehensive Review of The You\n Only Look Once (YOLO) Series","summary":" Given the rapid emergence and applications of Large Language This review\nsystematically examines the progression of the You Only Look Once (YOLO) object\ndetection algorithms from YOLOv1 to the recently unveiled YOLO11 (or YOLOv11).\nEmploying a reverse chronological analysis, this study examines the\nadvancements introduced by YOLO algorithms, beginning with YOLOv11 and\nprogressing through YOLOv10, YOLOv9, YOLOv8, and subsequent versions to explore\neach version's contributions to enhancing speed, detection accuracy, and\ncomputational efficiency in real-time object detection. By detailing the\nincremental technological advancements in subsequent YOLO versions, this review\nchronicles the evolution of YOLO, and discusses the challenges and limitations\nin each earlier versions. The evolution signifies a path towards integrating\nYOLO with multimodal, context-aware, and Artificial General Intelligence (AGI)\nsystems for the next YOLO decade, promising significant implications for future\ndevelopments in AI-driven applications. YOLOV11 to YOLOv1\n","authors":["Ranjan Sapkota","Rizwan Qureshi","Marco Flores Calero","Chetan Badjugar","Upesh Nepal","Alwin Poulose","Peter Zeno","Uday Bhanu Prakash Vaddevolu","Sheheryar Khan","Maged Shoman","Hong Yan","Manoj Karkee"],"pdf_url":"https://arxiv.org/pdf/2406.19407v5.pdf","comment":"11 Figures, 7 Tables"},{"id":"http://arxiv.org/abs/2501.04914v1","updated":"2025-01-09T02:10:15Z","published":"2025-01-09T02:10:15Z","title":"From Mesh Completion to AI Designed Crown","summary":" Designing a dental crown is a time-consuming and labor intensive process. Our\ngoal is to simplify crown design and minimize the tediousness of making manual\nadjustments while still ensuring the highest level of accuracy and consistency.\nTo this end, we present a new end- to-end deep learning approach, coined Dental\nMesh Completion (DMC), to generate a crown mesh conditioned on a point cloud\ncontext. The dental context includes the tooth prepared to receive a crown and\nits surroundings, namely the two adjacent teeth and the three closest teeth in\nthe opposing jaw. We formulate crown generation in terms of completing this\npoint cloud context. A feature extractor first converts the input point cloud\ninto a set of feature vectors that represent local regions in the point cloud.\nThe set of feature vectors is then fed into a transformer to predict a new set\nof feature vectors for the missing region (crown). Subsequently, a point\nreconstruction head, followed by a multi-layer perceptron, is used to predict a\ndense set of points with normals. Finally, a differentiable point-to-mesh layer\nserves to reconstruct the crown surface mesh. We compare our DMC method to a\ngraph-based convolutional neural network which learns to deform a crown mesh\nfrom a generic crown shape to the target geometry. Extensive experiments on our\ndataset demonstrate the effectiveness of our method, which attains an average\nof 0.062 Chamfer Distance.The code is available\nat:https://github.com/Golriz-code/DMC.gi\n","authors":["Golriz Hosseinimanesh","Farnoosh Ghadiri","Francois Guibault","Farida Cheriet","Julia Keren"],"pdf_url":"https://arxiv.org/pdf/2501.04914v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.12620v2","updated":"2025-01-09T02:00:15Z","published":"2024-12-17T07:33:07Z","title":"Multi-Domain Features Guided Supervised Contrastive Learning for Radar\n Target Detection","summary":" Detecting small targets in sea clutter is challenging due to dynamic maritime\nconditions. Existing solutions either model sea clutter for detection or\nextract target features based on clutter-target echo differences, including\nstatistical and deep features. While more common, the latter often excels in\ncontrolled scenarios but struggles with robust detection and generalization in\ndiverse environments, limiting practical use. In this letter, we propose a\nmulti-domain features guided supervised contrastive learning (MDFG_SCL) method,\nwhich integrates statistical features derived from multi-domain differences\nwith deep features obtained through supervised contrastive learning, thereby\ncapturing both low-level domain-specific variations and high-level semantic\ninformation. This comprehensive feature integration enables the model to\neffectively distinguish between small targets and sea clutter, even under\nchallenging conditions. Experiments conducted on real-world datasets\ndemonstrate that the proposed shallow-to-deep detector not only achieves\neffective identification of small maritime targets but also maintains superior\ndetection performance across varying sea conditions, outperforming the\nmainstream unsupervised contrastive learning and supervised contrastive\nlearning methods.\n","authors":["Junjie Wang","Yuze Gao","Dongying Li","Wenxian Yu"],"pdf_url":"https://arxiv.org/pdf/2412.12620v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04284v2","updated":"2025-01-09T01:58:41Z","published":"2025-01-08T05:15:43Z","title":"ContextMRI: Enhancing Compressed Sensing MRI through Metadata\n Conditioning","summary":" Compressed sensing MRI seeks to accelerate MRI acquisition processes by\nsampling fewer k-space measurements and then reconstructing the missing data\nalgorithmically. The success of these approaches often relies on strong priors\nor learned statistical models. While recent diffusion model-based priors have\nshown great potential, previous methods typically ignore clinically available\nmetadata (e.g. patient demographics, imaging parameters, slice-specific\ninformation). In practice, metadata contains meaningful cues about the anatomy\nand acquisition protocol, suggesting it could further constrain the\nreconstruction problem. In this work, we propose ContextMRI, a text-conditioned\ndiffusion model for MRI that integrates granular metadata into the\nreconstruction process. We train a pixel-space diffusion model directly on\nminimally processed, complex-valued MRI images. During inference, metadata is\nconverted into a structured text prompt and fed to the model via CLIP text\nembeddings. By conditioning the prior on metadata, we unlock more accurate\nreconstructions and show consistent gains across multiple datasets,\nacceleration factors, and undersampling patterns. Our experiments demonstrate\nthat increasing the fidelity of metadata, ranging from slice location and\ncontrast to patient age, sex, and pathology, systematically boosts\nreconstruction performance. This work highlights the untapped potential of\nleveraging clinical context for inverse problems and opens a new direction for\nmetadata-driven MRI reconstruction.\n","authors":["Hyungjin Chung","Dohun Lee","Zihui Wu","Byung-Hoon Kim","Katherine L. Bouman","Jong Chul Ye"],"pdf_url":"https://arxiv.org/pdf/2501.04284v2.pdf","comment":"29 pages, 9 figures. Code is available at\n https://github.com/DoHunLee1/ContextMRI"},{"id":"http://arxiv.org/abs/2501.04911v1","updated":"2025-01-09T01:58:14Z","published":"2025-01-09T01:58:14Z","title":"A Machine Learning Model for Crowd Density Classification in Hajj Video\n Frames","summary":" Managing the massive annual gatherings of Hajj and Umrah presents significant\nchallenges, particularly as the Saudi government aims to increase the number of\npilgrims. Currently, around two million pilgrims attend Hajj and 26 million\nattend Umrah making crowd control especially in critical areas like the Grand\nMosque during Tawaf, a major concern. Additional risks arise in managing dense\ncrowds at key sites such as Arafat where the potential for stampedes, fires and\npandemics poses serious threats to public safety. This research proposes a\nmachine learning model to classify crowd density into three levels: moderate\ncrowd, overcrowded and very dense crowd in video frames recorded during Hajj,\nwith a flashing red light to alert organizers in real-time when a very dense\ncrowd is detected. While current research efforts in processing Hajj\nsurveillance videos focus solely on using CNN to detect abnormal behaviors,\nthis research focuses more on high-risk crowds that can lead to disasters.\nHazardous crowd conditions require a robust method, as incorrect classification\ncould trigger unnecessary alerts and government intervention, while failure to\nclassify could result in disaster. The proposed model integrates Local Binary\nPattern (LBP) texture analysis, which enhances feature extraction for\ndifferentiating crowd density levels, along with edge density and area-based\nfeatures. The model was tested on the KAU-Smart Crowd 'HAJJv2' dataset which\ncontains 18 videos from various key locations during Hajj including 'Massaa',\n'Jamarat', 'Arafat' and 'Tawaf'. The model achieved an accuracy rate of 87%\nwith a 2.14% error percentage (misclassification rate), demonstrating its\nability to detect and classify various crowd conditions effectively. That\ncontributes to enhanced crowd management and safety during large-scale events\nlike Hajj.\n","authors":["Afnan A. Shah"],"pdf_url":"https://arxiv.org/pdf/2501.04911v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.09236v2","updated":"2025-01-09T01:20:46Z","published":"2024-03-14T09:59:55Z","title":"Hyper-3DG: Text-to-3D Gaussian Generation via Hypergraph","summary":" Text-to-3D generation represents an exciting field that has seen rapid\nadvancements, facilitating the transformation of textual descriptions into\ndetailed 3D models. However, current progress often neglects the intricate\nhigh-order correlation of geometry and texture within 3D objects, leading to\nchallenges such as over-smoothness, over-saturation and the Janus problem. In\nthis work, we propose a method named ``3D Gaussian Generation via Hypergraph\n(Hyper-3DG)'', designed to capture the sophisticated high-order correlations\npresent within 3D objects. Our framework is anchored by a well-established\nmainflow and an essential module, named ``Geometry and Texture Hypergraph\nRefiner (HGRefiner)''. This module not only refines the representation of 3D\nGaussians but also accelerates the update process of these 3D Gaussians by\nconducting the Patch-3DGS Hypergraph Learning on both explicit attributes and\nlatent visual features. Our framework allows for the production of finely\ngenerated 3D objects within a cohesive optimization, effectively circumventing\ndegradation. Extensive experimentation has shown that our proposed method\nsignificantly enhances the quality of 3D generation while incurring no\nadditional computational overhead for the underlying framework. (Project code:\nhttps://github.com/yjhboy/Hyper3DG)\n","authors":["Donglin Di","Jiahui Yang","Chaofan Luo","Zhou Xue","Wei Chen","Xun Yang","Yue Gao"],"pdf_url":"https://arxiv.org/pdf/2403.09236v2.pdf","comment":"Accepted by IJCV"},{"id":"http://arxiv.org/abs/2410.04041v4","updated":"2025-01-09T00:39:56Z","published":"2024-10-05T05:26:21Z","title":"EndoPerfect: A Hybrid NeRF-Stereo Vision Approach Pioneering Monocular\n Depth Estimation and 3D Reconstruction in Endoscopy","summary":" 3D reconstruction in endoscopic sinus surgery (ESS) demands exceptional\naccuracy, with the mean error and standard deviation necessitating within the\nrange of a single CT slice (0.625 mm), as the critical structures in the nasal\ncavity are situated within submillimeter distances from surgical instruments.\nThis poses a formidable challenge when using conventional monocular endoscopes.\nDepth estimation is crucial for 3D reconstruction, yet existing depth\nestimation methodologies either suffer from inherent accuracy limitations or,\nin the case of learning-based approaches, perform poorly when applied to ESS\ndespite succeeding on their original datasets. In this study, we present a\nnovel, highly generalizable method that combines Neural Radiance Fields (NeRF)\nand stereo depth estimation for 3D reconstruction that can derive metric\nmonocular depth. Our approach begins with an initial NeRF reconstruction\nyielding a coarse 3D scene, the subsequent creation of binocular pairs within\ncoarse 3D scene, and generation of depth maps through stereo vision, These\ndepth maps are used to supervise subsequent NeRF iteration, progressively\nrefining NeRF and binocular depth, the refinement process continues until the\ndepth maps converged. This recursive process generates high-accuracy depth maps\nfrom monocular endoscopic video. Evaluation in synthetic endoscopy shows a\ndepth accuracy of 0.125 $\\pm$ 0.443 mm, well within the 0.625 mm threshold.\nFurther clinical experiments with real endoscopic data demonstrate a mean\ndistance to CT mesh of 0.269 mm, representing the highest accuracy among\nmonocular 3D reconstruction methods in ESS.\n","authors":["Pengcheng Chen","Wenhao Li","Nicole Gunderson","Jeremy Ruthberg","Randall Bly","Zhenglong Sun","Waleed M. Abuzeid","Eric J. Seibel"],"pdf_url":"https://arxiv.org/pdf/2410.04041v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05611v1","updated":"2025-01-09T23:20:19Z","published":"2025-01-09T23:20:19Z","title":"Bit-depth color recovery via off-the-shelf super-resolution models","summary":" Advancements in imaging technology have enabled hardware to support 10 to 16\nbits per channel, facilitating precise manipulation in applications like image\nediting and video processing. While deep neural networks promise to recover\nhigh bit-depth representations, existing methods often rely on scale-invariant\nimage information, limiting performance in certain scenarios. In this paper, we\nintroduce a novel approach that integrates a super-resolution architecture to\nextract detailed a priori information from images. By leveraging interpolated\ndata generated during the super-resolution process, our method achieves\npixel-level recovery of fine-grained color details. Additionally, we\ndemonstrate that spatial features learned through the super-resolution process\nsignificantly contribute to the recovery of detailed color depth information.\nExperiments on benchmark datasets demonstrate that our approach outperforms\nstate-of-the-art methods, highlighting the potential of super-resolution for\nhigh-fidelity color restoration.\n","authors":["Xuanshuo Fu","Danna Xue","Javier Vazquez-Corral"],"pdf_url":"https://arxiv.org/pdf/2501.05611v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.02095v2","updated":"2025-01-09T22:38:13Z","published":"2024-11-04T13:59:01Z","title":"The evolution of volumetric video: A survey of smart transcoding and\n compression approaches","summary":" Volumetric video, the capture and display of three-dimensional (3D) imagery,\nhas emerged as a revolutionary technology poised to transform the media\nlandscape, enabling immersive experiences that transcend the limitations of\ntraditional 2D video. One of the key challenges in this domain is the efficient\ndelivery of these high-bandwidth, data-intensive volumetric video streams,\nwhich requires innovative transcoding and compression techniques. This research\npaper explores the state-of-the-art in volumetric video compression and\ndelivery, with a focus on the potential of AI-driven solutions to address the\nunique challenges posed by this emerging medium.\n","authors":["Preetish Kakkar","Hariharan Ragothaman"],"pdf_url":"https://arxiv.org/pdf/2411.02095v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.08563v2","updated":"2025-01-09T22:36:57Z","published":"2024-12-11T17:31:17Z","title":"Physics Based Differentiable Rendering for Inverse Problems and Beyond","summary":" Physics-based differentiable rendering (PBDR) has become an efficient method\nin computer vision, graphics, and machine learning for addressing an array of\ninverse problems. PBDR allows patterns to be generated from perceptions which\ncan be applied to enhance object attributes like geometry, substances, and\nlighting by adding physical models of light propagation and materials\ninteraction. Due to these capabilities, distinguished rendering has been\nemployed in a wider range of sectors such as autonomous navigation, scene\nreconstruction, and material design. We provide an extensive overview of PBDR\ntechniques in this study, emphasizing their creation, effectiveness, and\nlimitations while managing inverse situations. We demonstrate modern techniques\nand examine their value in everyday situations.\n","authors":["Preetish Kakkar","Srijani Mukherjee","Hariharan Ragothaman","Vishal Mehta"],"pdf_url":"https://arxiv.org/pdf/2412.08563v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.17155v4","updated":"2025-01-09T22:23:15Z","published":"2023-03-30T05:25:20Z","title":"Discriminative Class Tokens for Text-to-Image Diffusion Models","summary":" Recent advances in text-to-image diffusion models have enabled the generation\nof diverse and high-quality images. While impressive, the images often fall\nshort of depicting subtle details and are susceptible to errors due to\nambiguity in the input text. One way of alleviating these issues is to train\ndiffusion models on class-labeled datasets. This approach has two\ndisadvantages: (i) supervised datasets are generally small compared to\nlarge-scale scraped text-image datasets on which text-to-image models are\ntrained, affecting the quality and diversity of the generated images, or (ii)\nthe input is a hard-coded label, as opposed to free-form text, limiting the\ncontrol over the generated images.\n In this work, we propose a non-invasive fine-tuning technique that\ncapitalizes on the expressive potential of free-form text while achieving high\naccuracy through discriminative signals from a pretrained classifier. This is\ndone by iteratively modifying the embedding of an added input token of a\ntext-to-image diffusion model, by steering generated images toward a given\ntarget class according to a classifier. Our method is fast compared to prior\nfine-tuning methods and does not require a collection of in-class images or\nretraining of a noise-tolerant classifier. We evaluate our method extensively,\nshowing that the generated images are: (i) more accurate and of higher quality\nthan standard diffusion models, (ii) can be used to augment training data in a\nlow-resource setting, and (iii) reveal information about the data used to train\nthe guiding classifier. The code is available at\n\\url{https://github.com/idansc/discriminative_class_tokens}.\n","authors":["Idan Schwartz","Vésteinn Snæbjarnarson","Hila Chefer","Ryan Cotterell","Serge Belongie","Lior Wolf","Sagie Benaim"],"pdf_url":"https://arxiv.org/pdf/2303.17155v4.pdf","comment":"ICCV 2023"},{"id":"http://arxiv.org/abs/2411.13553v2","updated":"2025-01-09T22:17:30Z","published":"2024-11-20T18:59:58Z","title":"AI-generated Image Detection: Passive or Watermark?","summary":" While text-to-image models offer numerous benefits, they also pose\nsignificant societal risks. Detecting AI-generated images is crucial for\nmitigating these risks. Detection methods can be broadly categorized into\npassive and watermark-based approaches: passive detectors rely on artifacts\npresent in AI-generated images, whereas watermark-based detectors proactively\nembed watermarks into such images. A key question is which type of detector\nperforms better in terms of effectiveness, robustness, and efficiency. However,\nthe current literature lacks a comprehensive understanding of this issue. In\nthis work, we aim to bridge that gap by developing ImageDetectBench, the first\ncomprehensive benchmark to compare the effectiveness, robustness, and\nefficiency of passive and watermark-based detectors. Our benchmark includes\nfour datasets, each containing a mix of AI-generated and non-AI-generated\nimages. We evaluate five passive detectors and four watermark-based detectors\nagainst eight types of common perturbations and three types of adversarial\nperturbations. Our benchmark results reveal several interesting findings. For\ninstance, watermark-based detectors consistently outperform passive detectors,\nboth in the presence and absence of perturbations. Based on these insights, we\nprovide recommendations for detecting AI-generated images, e.g., when both\ntypes of detectors are applicable, watermark-based detectors should be the\npreferred choice. Our code and data are publicly available at\nhttps://github.com/moyangkuo/ImageDetectBench.git.\n","authors":["Moyang Guo","Yuepeng Hu","Zhengyuan Jiang","Zeyu Li","Amir Sadovnik","Arka Daw","Neil Gong"],"pdf_url":"https://arxiv.org/pdf/2411.13553v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.06687v2","updated":"2025-01-09T22:14:55Z","published":"2024-08-13T07:27:02Z","title":"Masked Image Modeling: A Survey","summary":" In this work, we survey recent studies on masked image modeling (MIM), an\napproach that emerged as a powerful self-supervised learning technique in\ncomputer vision. The MIM task involves masking some information, e.g.~pixels,\npatches, or even latent representations, and training a model, usually an\nautoencoder, to predicting the missing information by using the context\navailable in the visible part of the input. We identify and formalize two\ncategories of approaches on how to implement MIM as a pretext task, one based\non reconstruction and one based on contrastive learning. Then, we construct a\ntaxonomy and review the most prominent papers in recent years. We complement\nthe manually constructed taxonomy with a dendrogram obtained by applying a\nhierarchical clustering algorithm. We further identify relevant clusters via\nmanually inspecting the resulting dendrogram. Our review also includes datasets\nthat are commonly used in MIM research. We aggregate the performance results of\nvarious masked image modeling methods on the most popular datasets, to\nfacilitate the comparison of competing methods. Finally, we identify research\ngaps and propose several interesting directions of future work. We supplement\nour survey with the following public repository containing organized\nreferences: https://github.com/vladhondru25/MIM-Survey.\n","authors":["Vlad Hondru","Florinel Alin Croitoru","Shervin Minaee","Radu Tudor Ionescu","Nicu Sebe"],"pdf_url":"https://arxiv.org/pdf/2408.06687v2.pdf","comment":"Revised version"},{"id":"http://arxiv.org/abs/2404.18731v3","updated":"2025-01-09T22:10:14Z","published":"2024-04-29T14:17:52Z","title":"Real Time Multi Organ Classification on Computed Tomography Images","summary":" Organ segmentation is a fundamental task in medical imaging since it is\nuseful for many clinical automation pipelines. However, some tasks do not\nrequire full segmentation. Instead, a classifier can identify the selected\norgan without segmenting the entire volume. In this study, we demonstrate a\nclassifier based method to obtain organ labels in real time by using a large\ncontext size with a sparse data sampling strategy. Although our method operates\nas an independent classifier at query locations, it can generate full\nsegmentations by querying grid locations at any resolution, offering faster\nperformance than segmentation algorithms. We compared our method with existing\nsegmentation techniques, demonstrating its superior runtime potential for\npractical applications in medical imaging.\n","authors":["Halid Ziya Yerebakan","Yoshihisa Shinagawa","Gerardo Hermosillo Valadez"],"pdf_url":"https://arxiv.org/pdf/2404.18731v3.pdf","comment":"11 pages, Organ Classification, Organ Segmentation"},{"id":"http://arxiv.org/abs/2501.05567v1","updated":"2025-01-09T20:34:36Z","published":"2025-01-09T20:34:36Z","title":"Approximate Supervised Object Distance Estimation on Unmanned Surface\n Vehicles","summary":" Unmanned surface vehicles (USVs) and boats are increasingly important in\nmaritime operations, yet their deployment is limited due to costly sensors and\ncomplexity. LiDAR, radar, and depth cameras are either costly, yield sparse\npoint clouds or are noisy, and require extensive calibration. Here, we\nintroduce a novel approach for approximate distance estimation in USVs using\nsupervised object detection. We collected a dataset comprising images with\nmanually annotated bounding boxes and corresponding distance measurements.\nLeveraging this data, we propose a specialized branch of an object detection\nmodel, not only to detect objects but also to predict their distances from the\nUSV. This method offers a cost-efficient and intuitive alternative to\nconventional distance measurement techniques, aligning more closely with human\nestimation capabilities. We demonstrate its application in a marine assistance\nsystem that alerts operators to nearby objects such as boats, buoys, or other\nwaterborne hazards.\n","authors":["Benjamin Kiefer","Yitong Quan","Andreas Zell"],"pdf_url":"https://arxiv.org/pdf/2501.05567v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05566v1","updated":"2025-01-09T20:29:31Z","published":"2025-01-09T20:29:31Z","title":"Vision-Language Models for Autonomous Driving: CLIP-Based Dynamic Scene\n Understanding","summary":" Scene understanding is essential for enhancing driver safety, generating\nhuman-centric explanations for Automated Vehicle (AV) decisions, and leveraging\nArtificial Intelligence (AI) for retrospective driving video analysis. This\nstudy developed a dynamic scene retrieval system using Contrastive\nLanguage-Image Pretraining (CLIP) models, which can be optimized for real-time\ndeployment on edge devices. The proposed system outperforms state-of-the-art\nin-context learning methods, including the zero-shot capabilities of GPT-4o,\nparticularly in complex scenarios. By conducting frame-level analysis on the\nHonda Scenes Dataset, which contains a collection of about 80 hours of\nannotated driving videos capturing diverse real-world road and weather\nconditions, our study highlights the robustness of CLIP models in learning\nvisual concepts from natural language supervision. Results also showed that\nfine-tuning the CLIP models, such as ViT-L/14 and ViT-B/32, significantly\nimproved scene classification, achieving a top F1 score of 91.1%. These results\ndemonstrate the ability of the system to deliver rapid and precise scene\nrecognition, which can be used to meet the critical requirements of Advanced\nDriver Assistance Systems (ADAS). This study shows the potential of CLIP models\nto provide scalable and efficient frameworks for dynamic scene understanding\nand classification. Furthermore, this work lays the groundwork for advanced\nautonomous vehicle technologies by fostering a deeper understanding of driver\nbehavior, road conditions, and safety-critical scenarios, marking a significant\nstep toward smarter, safer, and more context-aware autonomous driving systems.\n","authors":["Mohammed Elhenawy","Huthaifa I. Ashqar","Andry Rakotonirainy","Taqwa I. Alhadidi","Ahmed Jaber","Mohammad Abu Tami"],"pdf_url":"https://arxiv.org/pdf/2501.05566v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.09566v3","updated":"2025-01-09T20:24:46Z","published":"2024-09-15T00:53:44Z","title":"Learning Transferable Features for Implicit Neural Representations","summary":" Implicit neural representations (INRs) have demonstrated success in a variety\nof applications, including inverse problems and neural rendering. An INR is\ntypically trained to capture one signal of interest, resulting in learned\nneural features that are highly attuned to that signal. Assumed to be less\ngeneralizable, we explore the aspect of transferability of such learned neural\nfeatures for fitting similar signals. We introduce a new INR training\nframework, STRAINER that learns transferrable features for fitting INRs to new\nsignals from a given distribution, faster and with better reconstruction\nquality. Owing to the sequential layer-wise affine operations in an INR, we\npropose to learn transferable representations by sharing initial encoder layers\nacross multiple INRs with independent decoder layers. At test time, the learned\nencoder representations are transferred as initialization for an otherwise\nrandomly initialized INR. We find STRAINER to yield extremely powerful\ninitialization for fitting images from the same domain and allow for $\\approx\n+10dB$ gain in signal quality early on compared to an untrained INR itself.\nSTRAINER also provides a simple way to encode data-driven priors in INRs. We\nevaluate STRAINER on multiple in-domain and out-of-domain signal fitting tasks\nand inverse problems and further provide detailed analysis and discussion on\nthe transferability of STRAINER's features. Our demo can be accessed at\nhttps://kushalvyas.github.io/strainer.html .\n","authors":["Kushal Vyas","Ahmed Imtiaz Humayun","Aniket Dashpute","Richard G. Baraniuk","Ashok Veeraraghavan","Guha Balakrishnan"],"pdf_url":"https://arxiv.org/pdf/2409.09566v3.pdf","comment":"Project Website: https://kushalvyas.github.io/strainer.html"},{"id":"http://arxiv.org/abs/2412.20110v2","updated":"2025-01-09T20:24:29Z","published":"2024-12-28T10:40:21Z","title":"Cross-Modal Mapping: Eliminating the Modality Gap for Few-Shot Image\n Classification","summary":" In few-shot image classification tasks, methods based on pretrained\nvision-language models (such as CLIP) have achieved significant progress. Many\nexisting approaches directly utilize visual or textual features as class\nprototypes, however, these features fail to adequately represent their\nrespective classes. We identify that this limitation arises from the modality\ngap inherent in pretrained vision-language models, which weakens the connection\nbetween the visual and textual modalities. To eliminate this modality gap and\nenable textual features to fully represent class prototypes, we propose a\nsimple and efficient Cross-Modal Mapping (CMM) method. This method employs a\nlinear transformation to map image features into the textual feature space,\nensuring that both modalities are comparable within the same feature space.\nNevertheless, the modality gap diminishes the effectiveness of this mapping. To\naddress this, we further introduce a triplet loss to optimize the spatial\nrelationships between image features and class textual features, allowing class\ntextual features to naturally serve as class prototypes for image features.\nExperimental results on 11 benchmark demonstrate an average improvement of\napproximately 3.5% compared to conventional methods and exhibit competitive\nperformance on 4 distribution shift benchmarks.\n","authors":["Xi Yang","Pai Peng","Wulin Xie","Xiaohuan Lu","Jie Wen"],"pdf_url":"https://arxiv.org/pdf/2412.20110v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.13969v2","updated":"2025-01-09T20:08:31Z","published":"2023-08-26T22:48:06Z","title":"Gaze-Informed Vision Transformers: Predicting Driving Decisions Under\n Uncertainty","summary":" Vision Transformers (ViT) have advanced computer vision, yet their efficacy\nin complex tasks like driving remains less explored. This study enhances ViT by\nintegrating human eye gaze, captured via eye-tracking, to increase prediction\naccuracy in driving scenarios under uncertainty in both real-world and virtual\nreality scenarios. First, we establish the significance of human eye gaze in\nleft-right driving decisions, as observed in both human subjects and a ViT\nmodel. By comparing the similarity between human fixation maps and ViT\nattention weights, we reveal the dynamics of overlap across individual heads\nand layers. This overlap demonstrates that fixation data can guide the model in\ndistributing its attention weights more effectively. We introduce the\nfixation-attention intersection (FAX) loss, a novel loss function that\nsignificantly improves ViT performance under high uncertainty conditions. Our\nresults show that ViT, when trained with FAX loss, aligns its attention with\nhuman gaze patterns. This gaze-informed approach has significant potential for\ndriver behavior analysis, as well as broader applications in human-centered AI\nsystems, extending ViT's use to complex visual environments.\n","authors":["Sharath Koorathota","Nikolas Papadopoulos","Jia Li Ma","Shruti Kumar","Xiaoxiao Sun","Arunesh Mittal","Patrick Adelman","Paul Sajda"],"pdf_url":"https://arxiv.org/pdf/2308.13969v2.pdf","comment":"25 pages, 9 figures, 3 tables"},{"id":"http://arxiv.org/abs/2501.05555v1","updated":"2025-01-09T20:02:10Z","published":"2025-01-09T20:02:10Z","title":"Improving Zero-Shot Object-Level Change Detection by Incorporating\n Visual Correspondence","summary":" Detecting object-level changes between two images across possibly different\nviews is a core task in many applications that involve visual inspection or\ncamera surveillance. Existing change-detection approaches suffer from three\nmajor limitations: (1) lack of evaluation on image pairs that contain no\nchanges, leading to unreported false positive rates; (2) lack of\ncorrespondences (\\ie, localizing the regions before and after a change); and\n(3) poor zero-shot generalization across different domains. To address these\nissues, we introduce a novel method that leverages change correspondences (a)\nduring training to improve change detection accuracy, and (b) at test time, to\nminimize false positives. That is, we harness the supervision labels of where\nan object is added or removed to supervise change detectors, improving their\naccuracy over previous work by a large margin. Our work is also the first to\npredict correspondences between pairs of detected changes using estimated\nhomography and the Hungarian algorithm. Our model demonstrates superior\nperformance over existing methods, achieving state-of-the-art results in change\ndetection and change correspondence accuracy across both in-distribution and\nzero-shot benchmarks.\n","authors":["Hung Huy Nguyen","Pooyan Rahmanzadehgervi","Long Mail","Anh Totti Nguyen"],"pdf_url":"https://arxiv.org/pdf/2501.05555v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.08755v2","updated":"2025-01-09T19:15:20Z","published":"2024-12-11T19:54:14Z","title":"Proactive Adversarial Defense: Harnessing Prompt Tuning in\n Vision-Language Models to Detect Unseen Backdoored Images","summary":" Backdoor attacks pose a critical threat by embedding hidden triggers into\ninputs, causing models to misclassify them into target labels. While extensive\nresearch has focused on mitigating these attacks in object recognition models\nthrough weight fine-tuning, much less attention has been given to detecting\nbackdoored samples directly. Given the vast datasets used in training, manual\ninspection for backdoor triggers is impractical, and even state-of-the-art\ndefense mechanisms fail to fully neutralize their impact. To address this gap,\nwe introduce a groundbreaking method to detect unseen backdoored images during\nboth training and inference. Leveraging the transformative success of prompt\ntuning in Vision Language Models (VLMs), our approach trains learnable text\nprompts to differentiate clean images from those with hidden backdoor triggers.\nExperiments demonstrate the exceptional efficacy of this method, achieving an\nimpressive average accuracy of 86% across two renowned datasets for detecting\nunseen backdoor triggers, establishing a new standard in backdoor defense.\n","authors":["Kyle Stein","Andrew Arash Mahyari","Guillermo Francia","Eman El-Sheikh"],"pdf_url":"https://arxiv.org/pdf/2412.08755v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05510v1","updated":"2025-01-09T19:00:01Z","published":"2025-01-09T19:00:01Z","title":"OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video\n Understanding?","summary":" Temporal Awareness, the ability to reason dynamically based on the timestamp\nwhen a question is raised, is the key distinction between offline and online\nvideo LLMs. Unlike offline models, which rely on complete videos for static,\npost hoc analysis, online models process video streams incrementally and\ndynamically adapt their responses based on the timestamp at which the question\nis posed. Despite its significance, temporal awareness has not been adequately\nevaluated in existing benchmarks. To fill this gap, we present OVO-Bench\n(Online-VideO-Benchmark), a novel video benchmark that emphasizes the\nimportance of timestamps for advanced online video understanding capability\nbenchmarking. OVO-Bench evaluates the ability of video LLMs to reason and\nrespond to events occurring at specific timestamps under three distinct\nscenarios: (1) Backward tracing: trace back to past events to answer the\nquestion. (2) Real-time understanding: understand and respond to events as they\nunfold at the current timestamp. (3) Forward active responding: delay the\nresponse until sufficient future information becomes available to answer the\nquestion accurately. OVO-Bench comprises 12 tasks, featuring 644 unique videos\nand approximately human-curated 2,800 fine-grained meta-annotations with\nprecise timestamps. We combine automated generation pipelines with human\ncuration. With these high-quality samples, we further developed an evaluation\npipeline to systematically query video LLMs along the video timeline.\nEvaluations of nine Video-LLMs reveal that, despite advancements on traditional\nbenchmarks, current models struggle with online video understanding, showing a\nsignificant gap compared to human agents. We hope OVO-Bench will drive progress\nin video LLMs and inspire future research in online video reasoning. Our\nbenchmark and code can be accessed at https://github.com/JoeLeelyf/OVO-Bench.\n","authors":["Yifei Li","Junbo Niu","Ziyang Miao","Chunjiang Ge","Yuanhang Zhou","Qihao He","Xiaoyi Dong","Haodong Duan","Shuangrui Ding","Rui Qian","Pan Zhang","Yuhang Zang","Yuhang Cao","Conghui He","Jiaqi Wang"],"pdf_url":"https://arxiv.org/pdf/2501.05510v1.pdf","comment":"28 pages"},{"id":"http://arxiv.org/abs/2501.05014v1","updated":"2025-01-09T07:15:59Z","published":"2025-01-09T07:15:59Z","title":"UAV-VLA: Vision-Language-Action System for Large Scale Aerial Mission\n Generation","summary":" The UAV-VLA (Visual-Language-Action) system is a tool designed to facilitate\ncommunication with aerial robots. By integrating satellite imagery processing\nwith the Visual Language Model (VLM) and the powerful capabilities of GPT,\nUAV-VLA enables users to generate general flight paths-and-action plans through\nsimple text requests. This system leverages the rich contextual information\nprovided by satellite images, allowing for enhanced decision-making and mission\nplanning. The combination of visual analysis by VLM and natural language\nprocessing by GPT can provide the user with the path-and-action set, making\naerial operations more efficient and accessible. The newly developed method\nshowed the difference in the length of the created trajectory in 22% and the\nmean error in finding the objects of interest on a map in 34.22 m by Euclidean\ndistance in the K-Nearest Neighbors (KNN) approach.\n","authors":["Oleg Sautenkov","Yasheerah Yaqoot","Artem Lykov","Muhammad Ahsan Mustafa","Grik Tadevosyan","Aibek Akhmetkazy","Miguel Altamirano Cabrera","Mikhail Martynov","Sausar Karaf","Dzmitry Tsetserukou"],"pdf_url":"https://arxiv.org/pdf/2501.05014v1.pdf","comment":"HRI 2025"},{"id":"http://arxiv.org/abs/2501.04940v1","updated":"2025-01-09T03:14:03Z","published":"2025-01-09T03:14:03Z","title":"A New Perspective on Privacy Protection in Federated Learning with\n Granular-Ball Computing","summary":" Federated Learning (FL) facilitates collaborative model training while\nprioritizing privacy by avoiding direct data sharing. However, most existing\narticles attempt to address challenges within the model's internal parameters\nand corresponding outputs, while neglecting to solve them at the input level.\nTo address this gap, we propose a novel framework called Granular-Ball\nFederated Learning (GrBFL) for image classification. GrBFL diverges from\ntraditional methods that rely on the finest-grained input data. Instead, it\nsegments images into multiple regions with optimal coarse granularity, which\nare then reconstructed into a graph structure. We designed a two-dimensional\nbinary search segmentation algorithm based on variance constraints for GrBFL,\nwhich effectively removes redundant information while preserving key\nrepresentative features. Extensive theoretical analysis and experiments\ndemonstrate that GrBFL not only safeguards privacy and enhances efficiency but\nalso maintains robust utility, consistently outperforming other\nstate-of-the-art FL methods. The code is available at\nhttps://github.com/AIGNLAI/GrBFL.\n","authors":["Guannan Lai","Yihui Feng","Xin Yang","Xiaoyu Deng","Hao Yu","Shuyin Xia","Guoyin Wang","Tianrui Li"],"pdf_url":"https://arxiv.org/pdf/2501.04940v1.pdf","comment":null}],"Information Retrieval":[{"id":"http://arxiv.org/abs/2501.05366v1","updated":"2025-01-09T16:48:17Z","published":"2025-01-09T16:48:17Z","title":"Search-o1: Agentic Search-Enhanced Large Reasoning Models","summary":" Large reasoning models (LRMs) like OpenAI-o1 have demonstrated impressive\nlong stepwise reasoning capabilities through large-scale reinforcement\nlearning. However, their extended reasoning processes often suffer from\nknowledge insufficiency, leading to frequent uncertainties and potential\nerrors. To address this limitation, we introduce \\textbf{Search-o1}, a\nframework that enhances LRMs with an agentic retrieval-augmented generation\n(RAG) mechanism and a Reason-in-Documents module for refining retrieved\ndocuments. Search-o1 integrates an agentic search workflow into the reasoning\nprocess, enabling dynamic retrieval of external knowledge when LRMs encounter\nuncertain knowledge points. Additionally, due to the verbose nature of\nretrieved documents, we design a separate Reason-in-Documents module to deeply\nanalyze the retrieved information before injecting it into the reasoning chain,\nminimizing noise and preserving coherent reasoning flow. Extensive experiments\non complex reasoning tasks in science, mathematics, and coding, as well as six\nopen-domain QA benchmarks, demonstrate the strong performance of Search-o1.\nThis approach enhances the trustworthiness and applicability of LRMs in complex\nreasoning tasks, paving the way for more reliable and versatile intelligent\nsystems. The code is available at\n\\url{https://github.com/sunnynexus/Search-o1}.\n","authors":["Xiaoxi Li","Guanting Dong","Jiajie Jin","Yuyao Zhang","Yujia Zhou","Yutao Zhu","Peitian Zhang","Zhicheng Dou"],"pdf_url":"https://arxiv.org/pdf/2501.05366v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05289v1","updated":"2025-01-09T14:51:58Z","published":"2025-01-09T14:51:58Z","title":"Unraveling the Impact of Visual Complexity on Search as Learning","summary":" Information search has become essential for learning and knowledge\nacquisition, offering broad access to information and learning resources. The\nvisual complexity of web pages is known to influence search behavior, with\nprevious work suggesting that searchers make evaluative judgments within the\nfirst second on a page. However, there is a significant gap in our\nunderstanding of how visual complexity impacts searches specifically conducted\nwith a learning intent. This gap is particularly relevant for the development\nof optimized information retrieval (IR) systems that effectively support\neducational objectives. To address this research need, we model visual\ncomplexity and aesthetics via a diverse set of features, investigating their\nrelationship with search behavior during learning-oriented web sessions. Our\nstudy utilizes a publicly available dataset from a lab study where participants\nlearned about thunderstorm formation. Our findings reveal that while content\nrelevance is the most significant predictor for knowledge gain, sessions with\nless visually complex pages are associated with higher learning success. This\nobservation applies to features associated with the layout of web pages rather\nthan to simpler features (e.g., number of images). The reported results shed\nlight on the impact of visual complexity on learning-oriented searches,\ninforming the design of more effective IR systems for educational contexts. To\nfoster reproducibility, we release our source code\n(https://github.com/TIBHannover/sal_visual_complexity).\n","authors":["Wolfgang Gritz","Anett Hoppe","Ralph Ewerth"],"pdf_url":"https://arxiv.org/pdf/2501.05289v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05220v1","updated":"2025-01-09T13:13:24Z","published":"2025-01-09T13:13:24Z","title":"A Novel Approach to Scalable and Automatic Topic-Controlled Question\n Generation in Education","summary":" The development of Automatic Question Generation (QG) models has the\npotential to significantly improve educational practices by reducing the\nteacher workload associated with creating educational content. This paper\nintroduces a novel approach to educational question generation that controls\nthe topical focus of questions. The proposed Topic-Controlled Question\nGeneration (T-CQG) method enhances the relevance and effectiveness of the\ngenerated content for educational purposes. Our approach uses fine-tuning on a\npre-trained T5-small model, employing specially created datasets tailored to\neducational needs. The research further explores the impacts of pre-training\nstrategies, quantisation, and data augmentation on the model's performance. We\nspecifically address the challenge of generating semantically aligned questions\nwith paragraph-level contexts, thereby improving the topic specificity of the\ngenerated questions. In addition, we introduce and explore novel evaluation\nmethods to assess the topical relatedness of the generated questions. Our\nresults, validated through rigorous offline and human-backed evaluations,\ndemonstrate that the proposed models effectively generate high-quality,\ntopic-focused questions. These models have the potential to reduce teacher\nworkload and support personalised tutoring systems by serving as bespoke\nquestion generators. With its relatively small number of parameters, the\nproposals not only advance the capabilities of question generation models for\nhandling specific educational topics but also offer a scalable solution that\nreduces infrastructure costs. This scalability makes them feasible for\nwidespread use in education without reliance on proprietary large language\nmodels like ChatGPT.\n","authors":["Ziqing Li","Mutlu Cukurova","Sahan Bulathwela"],"pdf_url":"https://arxiv.org/pdf/2501.05220v1.pdf","comment":"To be published at ACM Conf. on Learning Analytics and Knowledge\n (LAK'25)"},{"id":"http://arxiv.org/abs/2501.05170v1","updated":"2025-01-09T11:44:49Z","published":"2025-01-09T11:44:49Z","title":"De-centering the (Traditional) User: Multistakeholder Evaluation of\n Recommender Systems","summary":" Multistakeholder recommender systems are those that account for the impacts\nand preferences of multiple groups of individuals, not just the end users\nreceiving recommendations. Due to their complexity, evaluating these systems\ncannot be restricted to the overall utility of a single stakeholder, as is\noften the case of more mainstream recommender system applications. In this\narticle, we focus our discussion on the intricacies of the evaluation of\nmultistakeholder recommender systems. We bring attention to the different\naspects involved in the evaluation of multistakeholder recommender systems -\nfrom the range of stakeholders involved (including but not limited to producers\nand consumers) to the values and specific goals of each relevant stakeholder.\nAdditionally, we discuss how to move from theoretical principles to practical\nimplementation, providing specific use case examples. Finally, we outline open\nresearch directions for the RecSys community to explore. We aim to provide\nguidance to researchers and practitioners about how to think about these\ncomplex and domain-dependent issues of evaluation in the course of designing,\ndeveloping, and researching applications with multistakeholder aspects.\n","authors":["Robin Burke","Gediminas Adomavicius","Toine Bogers","Tommaso Di Noia","Dominik Kowald","Julia Neidhardt","Özlem Özgöbek","Maria Soledad Pera","Nava Tintarev","Jürgen Ziegler"],"pdf_url":"https://arxiv.org/pdf/2501.05170v1.pdf","comment":"Preprint submitted to Elsevier, \"Re-centering the User in Recommender\n System Research\" special issue of the International Journal of Human-Computer\n Studies (IJHCS)"},{"id":"http://arxiv.org/abs/2501.05082v1","updated":"2025-01-09T09:03:43Z","published":"2025-01-09T09:03:43Z","title":"Comparison of Feature Learning Methods for Metadata Extraction from PDF\n Scholarly Documents","summary":" The availability of metadata for scientific documents is pivotal in\npropelling scientific knowledge forward and for adhering to the FAIR principles\n(i.e. Findability, Accessibility, Interoperability, and Reusability) of\nresearch findings. However, the lack of sufficient metadata in published\ndocuments, particularly those from smaller and mid-sized publishers, hinders\ntheir accessibility. This issue is widespread in some disciplines, such as the\nGerman Social Sciences, where publications often employ diverse templates. To\naddress this challenge, our study evaluates various feature learning and\nprediction methods, including natural language processing (NLP), computer\nvision (CV), and multimodal approaches, for extracting metadata from documents\nwith high template variance. We aim to improve the accessibility of scientific\ndocuments and facilitate their wider use. To support our comparison of these\nmethods, we provide comprehensive experimental results, analyzing their\naccuracy and efficiency in extracting metadata. Additionally, we provide\nvaluable insights into the strengths and weaknesses of various feature learning\nand prediction methods, which can guide future research in this field.\n","authors":["Zeyd Boukhers","Cong Yang"],"pdf_url":"https://arxiv.org/pdf/2501.05082v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05072v1","updated":"2025-01-09T08:54:19Z","published":"2025-01-09T08:54:19Z","title":"A Flexible and Scalable Framework for Video Moment Search","summary":" Video moment search, the process of finding relevant moments in a video\ncorpus to match a user's query, is crucial for various applications. Existing\nsolutions, however, often assume a single perfect matching moment, struggle\nwith inefficient inference, and have limitations with hour-long videos. This\npaper introduces a flexible and scalable framework for retrieving a ranked list\nof moments from collection of videos in any length to match a text query, a\ntask termed Ranked Video Moment Retrieval (RVMR). Our framework, called\nSegment-Proposal-Ranking (SPR), simplifies the search process into three\nindependent stages: segment retrieval, proposal generation, and moment\nrefinement with re-ranking. Specifically, videos are divided into equal-length\nsegments with precomputed embeddings indexed offline, allowing efficient\nretrieval regardless of video length. For scalable online retrieval, both\nsegments and queries are projected into a shared feature space to enable\napproximate nearest neighbor (ANN) search. Retrieved segments are then merged\ninto coarse-grained moment proposals. Then a refinement and re-ranking module\nis designed to reorder and adjust timestamps of the coarse-grained proposals.\nEvaluations on the TVR-Ranking dataset demonstrate that our framework achieves\nstate-of-the-art performance with significant reductions in computational cost\nand processing time. The flexible design also allows for independent\nimprovements to each stage, making SPR highly adaptable for large-scale\napplications.\n","authors":["Chongzhi Zhang","Xizhou Zhu","Aixin Sun"],"pdf_url":"https://arxiv.org/pdf/2501.05072v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05018v1","updated":"2025-01-09T07:21:44Z","published":"2025-01-09T07:21:44Z","title":"Finding Needles in Emb(a)dding Haystacks: Legal Document Retrieval via\n Bagging and SVR Ensembles","summary":" We introduce a retrieval approach leveraging Support Vector Regression (SVR)\nensembles, bootstrap aggregation (bagging), and embedding spaces on the German\nDataset for Legal Information Retrieval (GerDaLIR). By conceptualizing the\nretrieval task in terms of multiple binary needle-in-a-haystack subtasks, we\nshow improved recall over the baselines (0.849 > 0.803 | 0.829) using our\nvoting ensemble, suggesting promising initial results, without training or\nfine-tuning any deep learning models. Our approach holds potential for further\nenhancement, particularly through refining the encoding models and optimizing\nhyperparameters.\n","authors":["Kevin Bönisch","Alexander Mehler"],"pdf_url":"https://arxiv.org/pdf/2501.05018v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.08981v3","updated":"2025-01-09T00:53:45Z","published":"2024-08-16T19:10:48Z","title":"From Lazy to Prolific: Tackling Missing Labels in Open Vocabulary\n Extreme Classification by Positive-Unlabeled Sequence Learning","summary":" Open-vocabulary Extreme Multi-label Classification (OXMC) extends traditional\nXMC by allowing prediction beyond an extremely large, predefined label set\n(typically $10^3$ to $10^{12}$ labels), addressing the dynamic nature of\nreal-world labeling tasks. However, self-selection bias in data annotation\nleads to significant missing labels in both training and test data,\nparticularly for less popular inputs. This creates two critical challenges:\ngeneration models learn to be \"lazy'\" by under-generating labels, and\nevaluation becomes unreliable due to insufficient annotation in the test set.\nIn this work, we introduce Positive-Unlabeled Sequence Learning (PUSL), which\nreframes OXMC as an infinite keyphrase generation task, addressing the\ngeneration model's laziness. Additionally, we propose to adopt a suite of\nevaluation metrics, F1@$\\mathcal{O}$ and newly proposed B@$k$, to reliably\nassess OXMC models with incomplete ground truths. In a highly imbalanced\ne-commerce dataset with substantial missing labels, PUSL generates 30% more\nunique labels, and 72% of its predictions align with actual user queries. On\nthe less skewed EURLex-4.3k dataset, PUSL demonstrates superior F1 scores,\nespecially as label counts increase from 15 to 30. Our approach effectively\ntackles both the modeling and evaluation challenges in OXMC with missing\nlabels.\n","authors":["Ranran Haoran Zhang","Bensu Uçar","Soumik Dey","Hansi Wu","Binbin Li","Rui Zhang"],"pdf_url":"https://arxiv.org/pdf/2408.08981v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05606v1","updated":"2025-01-09T22:48:43Z","published":"2025-01-09T22:48:43Z","title":"Harmonizing Metadata of Language Resources for Enhanced Querying and\n Accessibility","summary":" This paper addresses the harmonization of metadata from diverse repositories\nof language resources (LRs). Leveraging linked data and RDF techniques, we\nintegrate data from multiple sources into a unified model based on DCAT and\nMETA-SHARE OWL ontology. Our methodology supports text-based search, faceted\nbrowsing, and advanced SPARQL queries through Linghub, a newly developed\nportal. Real user queries from the Corpora Mailing List (CML) were evaluated to\nassess Linghub capability to satisfy actual user needs. Results indicate that\nwhile some limitations persist, many user requests can be successfully\naddressed. The study highlights significant metadata issues and advocates for\nadherence to open vocabularies and standards to enhance metadata harmonization.\nThis initial research underscores the importance of API-based access to LRs,\npromoting machine usability and data subset extraction for specific purposes,\npaving the way for more efficient and standardized LR utilization.\n","authors":["Zixuan Liang"],"pdf_url":"https://arxiv.org/pdf/2501.05606v1.pdf","comment":"2024 5th International Conference on Computers and Artificial\n Intelligence Technology (CAIT 2024)"},{"id":"http://arxiv.org/abs/2404.09889v3","updated":"2025-01-09T22:43:05Z","published":"2024-04-15T15:55:01Z","title":"Is Table Retrieval a Solved Problem? Exploring Join-Aware Multi-Table\n Retrieval","summary":" Retrieving relevant tables containing the necessary information to accurately\nanswer a given question over tables is critical to open-domain\nquestion-answering (QA) systems. Previous methods assume the answer to such a\nquestion can be found either in a single table or multiple tables identified\nthrough question decomposition or rewriting. However, neither of these\napproaches is sufficient, as many questions require retrieving multiple tables\nand joining them through a join plan that cannot be discerned from the user\nquery itself. If the join plan is not considered in the retrieval stage, the\nsubsequent steps of reasoning and answering based on those retrieved tables are\nlikely to be incorrect. To address this problem, we introduce a method that\nuncovers useful join relations for any query and database during table\nretrieval. We use a novel re-ranking method formulated as a mixed-integer\nprogram that considers not only table-query relevance but also table-table\nrelevance that requires inferring join relationships. Our method outperforms\nthe state-of-the-art approaches for table retrieval by up to 9.3% in F1 score\nand for end-to-end QA by up to 5.4% in accuracy.\n","authors":["Peter Baile Chen","Yi Zhang","Dan Roth"],"pdf_url":"https://arxiv.org/pdf/2404.09889v3.pdf","comment":"ACL 2024. Dataset and code are available at\n https://peterbaile.github.io/jar"},{"id":"http://arxiv.org/abs/2405.17428v2","updated":"2025-01-09T22:27:06Z","published":"2024-05-27T17:59:45Z","title":"NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding\n Models","summary":" Decoder-only large language model (LLM)-based embedding models are beginning\nto outperform BERT or T5-based embedding models in general-purpose text\nembedding tasks, including dense vector-based retrieval. In this work, we\nintroduce the NV-Embed model, incorporating architectural designs, training\nprocedures, and curated datasets to significantly enhance the performance of\nLLM as a versatile embedding model, while maintaining its simplicity and\nreproducibility. For model architecture, we propose a latent attention layer to\nobtain pooled embeddings, which consistently improves retrieval and downstream\ntask accuracy compared to mean pooling or using the last token embedding\nfrom LLMs. To enhance representation learning, we remove the causal attention\nmask of LLMs during contrastive training. For training algorithm, we introduce\na two-stage contrastive instruction-tuning method. It first applies contrastive\ntraining with instructions on retrieval datasets, utilizing in-batch negatives\nand curated hard negative examples. At stage-2, it blends various non-retrieval\ninto instruction tuning, which not only enhances non-retrieval task accuracy\nbut also improves retrieval performance. For training data, we utilize the\nhard-negative mining, synthetic data generation and existing public available\ndatasets to boost the performance of embedding model. By combining these\ntechniques, our NV-Embed-v1 and NV-Embed-v2 models obtained the No.1 position\non the Massive Text Embedding Benchmark (MTEB) (as of May 24, 2024 and August\n30, 2024, respectively) across 56 embedding tasks, demonstrating the sustained\neffectiveness of the proposed methods over time. Additionally, it achieved the\nhighest scores in the Long Doc section and the second-highest scores in the QA\nsection of the AIR Benchmark, which covers a range of out-of-domain information\nretrieval topics beyond those in MTEB.\n","authors":["Chankyu Lee","Rajarshi Roy","Mengyao Xu","Jonathan Raiman","Mohammad Shoeybi","Bryan Catanzaro","Wei Ping"],"pdf_url":"https://arxiv.org/pdf/2405.17428v2.pdf","comment":"We open-source the model at:\n https://huggingface.co/nvidia/NV-Embed-v2"},{"id":"http://arxiv.org/abs/2501.05497v1","updated":"2025-01-09T17:20:00Z","published":"2025-01-09T17:20:00Z","title":"Spatial Information Integration in Small Language Models for Document\n Layout Generation and Classification","summary":" Document layout understanding is a field of study that analyzes the spatial\narrangement of information in a document hoping to understand its structure and\nlayout. Models such as LayoutLM (and its subsequent iterations) can understand\nsemi-structured documents with SotA results; however, the lack of open\nsemi-structured data is a limitation in itself. While semi-structured data is\ncommon in everyday life (balance sheets, purchase orders, receipts), there is a\nlack of public datasets for training machine learning models for this type of\ndocument. In this investigation we propose a method to generate new, synthetic,\nlayout information that can help overcoming this data shortage. According to\nour results, the proposed method performs better than LayoutTransformer,\nanother popular layout generation method. We also show that, in some scenarios,\ntext classification can improve when supported by bounding box information.\n","authors":["Pablo Melendez","Clemens Havas"],"pdf_url":"https://arxiv.org/pdf/2501.05497v1.pdf","comment":"8 pages. Symposium on Applied Computing 2025"}],"Machine Learning":[{"id":"http://arxiv.org/abs/2501.05450v1","updated":"2025-01-09T18:59:56Z","published":"2025-01-09T18:59:56Z","title":"Decentralized Diffusion Models","summary":" Large-scale AI model training divides work across thousands of GPUs, then\nsynchronizes gradients across them at each step. This incurs a significant\nnetwork burden that only centralized, monolithic clusters can support, driving\nup infrastructure costs and straining power systems. We propose Decentralized\nDiffusion Models, a scalable framework for distributing diffusion model\ntraining across independent clusters or datacenters by eliminating the\ndependence on a centralized, high-bandwidth networking fabric. Our method\ntrains a set of expert diffusion models over partitions of the dataset, each in\nfull isolation from one another. At inference time, the experts ensemble\nthrough a lightweight router. We show that the ensemble collectively optimizes\nthe same objective as a single model trained over the whole dataset. This means\nwe can divide the training burden among a number of \"compute islands,\" lowering\ninfrastructure costs and improving resilience to localized GPU failures.\nDecentralized diffusion models empower researchers to take advantage of\nsmaller, more cost-effective and more readily available compute like on-demand\nGPU nodes rather than central integrated systems. We conduct extensive\nexperiments on ImageNet and LAION Aesthetics, showing that decentralized\ndiffusion models FLOP-for-FLOP outperform standard diffusion models. We finally\nscale our approach to 24 billion parameters, demonstrating that high-quality\ndiffusion models can now be trained with just eight individual GPU nodes in\nless than a week.\n","authors":["David McAllister","Matthew Tancik","Jiaming Song","Angjoo Kanazawa"],"pdf_url":"https://arxiv.org/pdf/2501.05450v1.pdf","comment":"Project webpage: https://decentralizeddiffusion.github.io/"},{"id":"http://arxiv.org/abs/2501.05445v1","updated":"2025-01-09T18:56:05Z","published":"2025-01-09T18:56:05Z","title":"Consistent Flow Distillation for Text-to-3D Generation","summary":" Score Distillation Sampling (SDS) has made significant strides in distilling\nimage-generative models for 3D generation. However, its\nmaximum-likelihood-seeking behavior often leads to degraded visual quality and\ndiversity, limiting its effectiveness in 3D applications. In this work, we\npropose Consistent Flow Distillation (CFD), which addresses these limitations.\nWe begin by leveraging the gradient of the diffusion ODE or SDE sampling\nprocess to guide the 3D generation. From the gradient-based sampling\nperspective, we find that the consistency of 2D image flows across different\nviewpoints is important for high-quality 3D generation. To achieve this, we\nintroduce multi-view consistent Gaussian noise on the 3D object, which can be\nrendered from various viewpoints to compute the flow gradient. Our experiments\ndemonstrate that CFD, through consistent flows, significantly outperforms\nprevious methods in text-to-3D generation.\n","authors":["Runjie Yan","Yinbo Chen","Xiaolong Wang"],"pdf_url":"https://arxiv.org/pdf/2501.05445v1.pdf","comment":"Project page: https://runjie-yan.github.io/cfd/"},{"id":"http://arxiv.org/abs/2501.05441v1","updated":"2025-01-09T18:53:06Z","published":"2025-01-09T18:53:06Z","title":"The GAN is dead; long live the GAN! A Modern GAN Baseline","summary":" There is a widely-spread claim that GANs are difficult to train, and GAN\narchitectures in the literature are littered with empirical tricks. We provide\nevidence against this claim and build a modern GAN baseline in a more\nprincipled manner. First, we derive a well-behaved regularized relativistic GAN\nloss that addresses issues of mode dropping and non-convergence that were\npreviously tackled via a bag of ad-hoc tricks. We analyze our loss\nmathematically and prove that it admits local convergence guarantees, unlike\nmost existing relativistic losses. Second, our new loss allows us to discard\nall ad-hoc tricks and replace outdated backbones used in common GANs with\nmodern architectures. Using StyleGAN2 as an example, we present a roadmap of\nsimplification and modernization that results in a new minimalist baseline --\nR3GAN. Despite being simple, our approach surpasses StyleGAN2 on FFHQ,\nImageNet, CIFAR, and Stacked MNIST datasets, and compares favorably against\nstate-of-the-art GANs and diffusion models.\n","authors":["Yiwen Huang","Aaron Gokaslan","Volodymyr Kuleshov","James Tompkin"],"pdf_url":"https://arxiv.org/pdf/2501.05441v1.pdf","comment":"Accepted to NeurIPS 2024. Code available at\n https://github.com/brownvc/R3GAN/"},{"id":"http://arxiv.org/abs/2501.05439v1","updated":"2025-01-09T18:49:39Z","published":"2025-01-09T18:49:39Z","title":"From Simple to Complex Skills: The Case of In-Hand Object Reorientation","summary":" Learning policies in simulation and transferring them to the real world has\nbecome a promising approach in dexterous manipulation. However, bridging the\nsim-to-real gap for each new task requires substantial human effort, such as\ncareful reward engineering, hyperparameter tuning, and system identification.\nIn this work, we present a system that leverages low-level skills to address\nthese challenges for more complex tasks. Specifically, we introduce a\nhierarchical policy for in-hand object reorientation based on previously\nacquired rotation skills. This hierarchical policy learns to select which\nlow-level skill to execute based on feedback from both the environment and the\nlow-level skill policies themselves. Compared to learning from scratch, the\nhierarchical policy is more robust to out-of-distribution changes and transfers\neasily from simulation to real-world environments. Additionally, we propose a\ngeneralizable object pose estimator that uses proprioceptive information,\nlow-level skill predictions, and control errors as inputs to estimate the\nobject pose over time. We demonstrate that our system can reorient objects,\nincluding symmetrical and textureless ones, to a desired pose.\n","authors":["Haozhi Qi","Brent Yi","Mike Lambeta","Yi Ma","Roberto Calandra","Jitendra Malik"],"pdf_url":"https://arxiv.org/pdf/2501.05439v1.pdf","comment":"website: https://dexhier.github.io"},{"id":"http://arxiv.org/abs/2412.11526v3","updated":"2025-01-09T18:44:52Z","published":"2024-12-16T08:01:22Z","title":"Probabilities-Informed Machine Learning","summary":" Machine learning (ML) has emerged as a powerful tool for tackling complex\nregression and classification tasks, yet its success often hinges on the\nquality of training data. This study introduces an ML paradigm inspired by\ndomain knowledge of the structure of output function, akin to physics-informed\nML, but rooted in probabilistic principles rather than physical laws. The\nproposed approach integrates the probabilistic structure of the target variable\n(such as its cumulative distribution function) into the training process. This\nprobabilistic information is obtained from historical data or estimated using\nstructural reliability methods during experimental design. By embedding\ndomain-specific probabilistic insights into the learning process, the technique\nenhances model accuracy and mitigates risks of overfitting and underfitting.\nApplications in regression, image denoising, and classification demonstrate the\napproach's effectiveness in addressing real-world problems.\n","authors":["Mohsen Rashki"],"pdf_url":"https://arxiv.org/pdf/2412.11526v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05425v1","updated":"2025-01-09T18:31:35Z","published":"2025-01-09T18:31:35Z","title":"Entangled Mean Estimation in High-Dimensions","summary":" We study the task of high-dimensional entangled mean estimation in the\nsubset-of-signals model. Specifically, given $N$ independent random points\n$x_1,\\ldots,x_N$ in $\\mathbb{R}^D$ and a parameter $\\alpha \\in (0, 1)$ such\nthat each $x_i$ is drawn from a Gaussian with mean $\\mu$ and unknown\ncovariance, and an unknown $\\alpha$-fraction of the points have\nidentity-bounded covariances, the goal is to estimate the common mean $\\mu$.\nThe one-dimensional version of this task has received significant attention in\ntheoretical computer science and statistics over the past decades. Recent work\n[LY20; CV24] has given near-optimal upper and lower bounds for the\none-dimensional setting. On the other hand, our understanding of even the\ninformation-theoretic aspects of the multivariate setting has remained limited.\n In this work, we design a computationally efficient algorithm achieving an\ninformation-theoretically near-optimal error. Specifically, we show that the\noptimal error (up to polylogarithmic factors) is $f(\\alpha,N) + \\sqrt{D/(\\alpha\nN)}$, where the term $f(\\alpha,N)$ is the error of the one-dimensional problem\nand the second term is the sub-Gaussian error rate. Our algorithmic approach\nemploys an iterative refinement strategy, whereby we progressively learn more\naccurate approximations $\\hat \\mu$ to $\\mu$. This is achieved via a novel\nrejection sampling procedure that removes points significantly deviating from\n$\\hat \\mu$, as an attempt to filter out unusually noisy samples. A complication\nthat arises is that rejection sampling introduces bias in the distribution of\nthe remaining points. To address this issue, we perform a careful analysis of\nthe bias, develop an iterative dimension-reduction strategy, and employ a novel\nsubroutine inspired by list-decodable learning that leverages the\none-dimensional result.\n","authors":["Ilias Diakonikolas","Daniel M. Kane","Sihan Liu","Thanasis Pittas"],"pdf_url":"https://arxiv.org/pdf/2501.05425v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05423v1","updated":"2025-01-09T18:30:14Z","published":"2025-01-09T18:30:14Z","title":"Using LLMs to Infer Non-Binary COVID-19 Sentiments of Chinese\n Micro-bloggers","summary":" Studying public sentiment during crises is crucial for understanding how\nopinions and sentiments shift, resulting in polarized societies. We study\nWeibo, the most popular microblogging site in China, using posts made during\nthe outbreak of the COVID-19 crisis. The study period includes the pre-COVID-19\nstage, the outbreak stage, and the early stage of epidemic prevention. We use\nLlama 3 8B, a Large Language Model, to analyze users' sentiments on the\nplatform by classifying them into positive, negative, sarcastic, and neutral\ncategories. Analyzing sentiment shifts on Weibo provides insights into how\nsocial events and government actions influence public opinion. This study\ncontributes to understanding the dynamics of social sentiments during health\ncrises, fulfilling a gap in sentiment analysis for Chinese platforms. By\nexamining these dynamics, we aim to offer valuable perspectives on digital\ncommunication's role in shaping society's responses during unprecedented global\nchallenges.\n","authors":["Jerry Chongyi Hu","Mohammed Shahid Modi","Boleslaw K. Szymanski"],"pdf_url":"https://arxiv.org/pdf/2501.05423v1.pdf","comment":"11 pages, 4 figures"},{"id":"http://arxiv.org/abs/2501.05415v1","updated":"2025-01-09T18:17:27Z","published":"2025-01-09T18:17:27Z","title":"Uncertainty-aware Knowledge Tracing","summary":" Knowledge Tracing (KT) is crucial in education assessment, which focuses on\ndepicting students' learning states and assessing students' mastery of\nsubjects. With the rise of modern online learning platforms, particularly\nmassive open online courses (MOOCs), an abundance of interaction data has\ngreatly advanced the development of the KT technology. Previous research\ncommonly adopts deterministic representation to capture students' knowledge\nstates, which neglects the uncertainty during student interactions and thus\nfails to model the true knowledge state in learning process. In light of this,\nwe propose an Uncertainty-Aware Knowledge Tracing model (UKT) which employs\nstochastic distribution embeddings to represent the uncertainty in student\ninteractions, with a Wasserstein self-attention mechanism designed to capture\nthe transition of state distribution in student learning behaviors.\nAdditionally, we introduce the aleatory uncertainty-aware contrastive learning\nloss, which strengthens the model's robustness towards different types of\nuncertainties. Extensive experiments on six real-world datasets demonstrate\nthat UKT not only significantly surpasses existing deep learning-based models\nin KT prediction, but also shows unique advantages in handling the uncertainty\nof student interactions.\n","authors":["Weihua Cheng","Hanwen Du","Chunxiao Li","Ersheng Ni","Liangdi Tan","Tianqi Xu","Yongxin Ni"],"pdf_url":"https://arxiv.org/pdf/2501.05415v1.pdf","comment":"Accepted by AAAI 2025"},{"id":"http://arxiv.org/abs/2412.18234v2","updated":"2025-01-09T18:16:38Z","published":"2024-12-24T07:35:48Z","title":"Conditional Deep Canonical Time Warping","summary":" Temporal alignment of sequences is a fundamental challenge in many\napplications, such as computer vision and bioinformatics, where local time\nshifting needs to be accounted for. Misalignment can lead to poor model\ngeneralization, especially in high-dimensional sequences. Existing methods\noften struggle with optimization when dealing with high-dimensional sparse\ndata, falling into poor alignments. Feature selection is frequently used to\nenhance model performance for sparse data. However, a fixed set of selected\nfeatures would not generally work for dynamically changing sequences and would\nneed to be modified based on the state of the sequence. Therefore, modifying\nthe selected feature based on contextual input would result in better\nalignment. Our suggested method, Conditional Deep Canonical Temporal Time\nWarping (CDCTW), is designed for temporal alignment in sparse temporal data to\naddress these challenges. CDCTW enhances alignment accuracy for high\ndimensional time-dependent views be performing dynamic time warping on data\nembedded in maximally correlated subspace which handles sparsity with novel\nfeature selection method. We validate the effectiveness of CDCTW through\nextensive experiments on various datasets, demonstrating superior performance\nover previous techniques.\n","authors":["Afek Steinberg","Ran Eisenberg","Ofir Lindenbaum"],"pdf_url":"https://arxiv.org/pdf/2412.18234v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05409v1","updated":"2025-01-09T18:06:45Z","published":"2025-01-09T18:06:45Z","title":"A Novel Pathology Foundation Model by Mayo Clinic, Charité, and\n Aignostics","summary":" Recent advances in digital pathology have demonstrated the effectiveness of\nfoundation models across diverse applications. In this report, we present a\nnovel vision foundation model based on the RudolfV approach. Our model was\ntrained on a dataset comprising 1.2 million histopathology whole slide images,\ncollected from two medical institutions: Mayo Clinic and Charit\\'e -\nUniverst\\\"atsmedizin Berlin. Comprehensive evaluations show that our model\nachieves state-of-the-art performance across twenty-one public benchmark\ndatasets, even though it is neither the largest model by parameter count nor by\ntraining dataset size.\n","authors":["Maximilian Alber","Stephan Tietz","Jonas Dippel","Timo Milbich","Timothée Lesort","Panos Korfiatis","Moritz Krügener","Beatriz Perez Cancer","Neelay Shah","Alexander Möllers","Philipp Seegerer","Alexandra Carpen-Amarie","Kai Standvoss","Gabriel Dernbach","Edwin de Jong","Simon Schallenberg","Andreas Kunft","Helmut Hoffer von Ankershoffen","Gavin Schaeferle","Patrick Duffy","Matt Redlon","Philipp Jurmeister","David Horst","Lukas Ruff","Klaus-Robert Müller","Frederick Klauschen","Andrew Norgan"],"pdf_url":"https://arxiv.org/pdf/2501.05409v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05408v1","updated":"2025-01-09T18:05:33Z","published":"2025-01-09T18:05:33Z","title":"TimeRL: Efficient Deep Reinforcement Learning with Polyhedral Dependence\n Graphs","summary":" Modern deep learning (DL) workloads increasingly use complex deep\nreinforcement learning (DRL) algorithms that generate training data within the\nlearning loop. This results in programs with several nested loops and dynamic\ndata dependencies between tensors. While DL systems with eager execution\nsupport such dynamism, they lack the optimizations and smart scheduling of\ngraph-based execution. Graph-based execution, however, cannot express dynamic\ntensor shapes, instead requiring the use of multiple static subgraphs. Either\nexecution model for DRL thus leads to redundant computation, reduced\nparallelism, and less efficient memory management.\n We describe TimeRL, a system for executing dynamic DRL programs that combines\nthe dynamism of eager execution with the whole-program optimizations and\nscheduling of graph-based execution. TimeRL achieves this by introducing the\ndeclarative programming model of recurrent tensors, which allows users to\ndefine dynamic dependencies as intuitive recurrence equations. TimeRL\ntranslates recurrent tensors into a polyhedral dependence graph (PDG) with\ndynamic dependencies as symbolic expressions. Through simple PDG\ntransformations, TimeRL applies whole-program optimizations, such as automatic\nvectorization, incrementalization, and operator fusion. The PDG also allows for\nthe computation of an efficient program-wide execution schedule, which decides\non buffer deallocations, buffer donations, and GPU/CPU memory swapping. We show\nthat TimeRL executes current DRL algorithms up to 47$\\times$ faster than\nexisting DRL systems, while using 16$\\times$ less GPU peak memory.\n","authors":["Pedro F. Silvestre","Peter Pietzuch"],"pdf_url":"https://arxiv.org/pdf/2501.05408v1.pdf","comment":"17 pages, 11 figures, 5 bibliography pages"},{"id":"http://arxiv.org/abs/2501.05407v1","updated":"2025-01-09T18:05:05Z","published":"2025-01-09T18:05:05Z","title":"On-line Policy Improvement using Monte-Carlo Search","summary":" We present a Monte-Carlo simulation algorithm for real-time policy\nimprovement of an adaptive controller. In the Monte-Carlo simulation, the\nlong-term expected reward of each possible action is statistically measured,\nusing the initial policy to make decisions in each step of the simulation. The\naction maximizing the measured expected reward is then taken, resulting in an\nimproved policy. Our algorithm is easily parallelizable and has been\nimplemented on the IBM SP1 and SP2 parallel-RISC supercomputers.\n We have obtained promising initial results in applying this algorithm to the\ndomain of backgammon. Results are reported for a wide variety of initial\npolicies, ranging from a random policy to TD-Gammon, an extremely strong\nmulti-layer neural network. In each case, the Monte-Carlo algorithm gives a\nsubstantial reduction, by as much as a factor of 5 or more, in the error rate\nof the base players. The algorithm is also potentially useful in many other\nadaptive control applications in which it is possible to simulate the\nenvironment.\n","authors":["Gerald Tesauro","Gregory R. Galperin"],"pdf_url":"https://arxiv.org/pdf/2501.05407v1.pdf","comment":"Accompanied by oral presentation by Gregory Galperin at NeurIPS 1996\n (then known as NIPS*96)"},{"id":"http://arxiv.org/abs/2405.13536v2","updated":"2025-01-09T17:58:44Z","published":"2024-05-22T11:14:00Z","title":"Attention Mechanisms Don't Learn Additive Models: Rethinking Feature\n Importance for Transformers","summary":" We address the critical challenge of applying feature attribution methods to\nthe transformer architecture, which dominates current applications in natural\nlanguage processing and beyond. Traditional attribution methods to explainable\nAI (XAI) explicitly or implicitly rely on linear or additive surrogate models\nto quantify the impact of input features on a model's output. In this work, we\nformally prove an alarming incompatibility: transformers are structurally\nincapable of representing linear or additive surrogate models used for feature\nattribution, undermining the grounding of these conventional explanation\nmethodologies. To address this discrepancy, we introduce the Softmax-Linked\nAdditive Log Odds Model (SLALOM), a novel surrogate model specifically designed\nto align with the transformer framework. SLALOM demonstrates the capacity to\ndeliver a range of insightful explanations with both synthetic and real-world\ndatasets. We highlight SLALOM's unique efficiency-quality curve by showing that\nSLALOM can produce explanations with substantially higher fidelity than\ncompeting surrogate models or provide explanations of comparable quality at a\nfraction of their computational costs. We release code for SLALOM as an\nopen-source project online at https://github.com/tleemann/slalom_explanations.\n","authors":["Tobias Leemann","Alina Fastowski","Felix Pfeiffer","Gjergji Kasneci"],"pdf_url":"https://arxiv.org/pdf/2405.13536v2.pdf","comment":"TMLR Camera-Ready version"},{"id":"http://arxiv.org/abs/2501.05403v1","updated":"2025-01-09T17:57:56Z","published":"2025-01-09T17:57:56Z","title":"TimeDP: Learning to Generate Multi-Domain Time Series with Domain\n Prompts","summary":" Time series generation models are crucial for applications like data\naugmentation and privacy preservation. Most existing time series generation\nmodels are typically designed to generate data from one specified domain. While\nleveraging data from other domain for better generalization is proved to work\nin other application areas, this approach remains challenging for time series\nmodeling due to the large divergence in patterns among different real world\ntime series categories. In this paper, we propose a multi-domain time series\ndiffusion model with domain prompts, named TimeDP. In TimeDP, we utilize a time\nseries semantic prototype module which defines time series prototypes to\nrepresent time series basis, each prototype vector serving as \"word\"\nrepresenting some elementary time series feature. A prototype assignment module\nis applied to extract the extract domain specific prototype weights, for\nlearning domain prompts as generation condition. During sampling, we extract\n\"domain prompt\" with few-shot samples from the target domain and use the domain\nprompts as condition to generate time series samples. Experiments demonstrate\nthat our method outperforms baselines to provide the state-of-the-art in-domain\ngeneration quality and strong unseen domain generation capability.\n","authors":["Yu-Hao Huang","Chang Xu","Yueying Wu","Wu-Jun Li","Jiang Bian"],"pdf_url":"https://arxiv.org/pdf/2501.05403v1.pdf","comment":"AAAI 2025"},{"id":"http://arxiv.org/abs/2408.01857v2","updated":"2025-01-09T17:54:15Z","published":"2024-08-03T20:00:36Z","title":"Using Linearized Optimal Transport to Predict the Evolution of\n Stochastic Particle Systems","summary":" We develop an algorithm to approximate the time evolution of a probability\ndistribution without explicitly learning an operator that governs the\nevolution. A particular application of interest is discrete measures $\\mu_t^N$\nthat arise from systems of $N$ particles in $\\mathbb R^d$. In many such\nsituations, the individual particles move chaotically on short time scales,\nmaking it difficult to learn the dynamics of a governing operator, but the bulk\ndistribution $\\mu_t^N$ approximates an absolutely continuous measure $\\mu_t$\nthat evolves ``smoothly.'' If $\\mu_t$ is known on some time interval, then\nlinearized optimal transport theory provides an Euler-like scheme for\napproximating the evolution of $\\mu_t$ using its ``tangent vector field''\n(represented as a time-dependent vector field on $\\mathbb R^d$), which can be\ncomputed as a limit of optimal transport maps. We propose an analog of this\nEuler approximation to predict the evolution of the discrete measure $\\mu_t^N$\n(without knowing $\\mu_t$). To approximate the analogous tangent vector field,\nwe use a finite difference over a time step that sits between two time scales\nof the system -- long enough for a large-$N$ evolution ($\\mu_t$) to emerge but\nshort enough to satisfactorily approximate the derivative object used in the\nEuler scheme. The emergence of the limiting behavior ensures the optimal\ntransport maps closely approximate the vector field describing the bulk\ndistribution's smooth evolution instead of the individual particles' more\nchaotic movements. We demonstrate the efficacy of our approach with two\nillustrative examples, Gaussian diffusion and a cell chemotaxis model, and show\nthat our method succeeds in predicting the bulk behavior over relatively large\nsteps.\n","authors":["Nicholas Karris","Evangelos A. Nikitopoulos","Ioannis Kevrekidis","Seungjoon Lee","Alexander Cloninger"],"pdf_url":"https://arxiv.org/pdf/2408.01857v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05401v1","updated":"2025-01-09T17:50:56Z","published":"2025-01-09T17:50:56Z","title":"BRATI: Bidirectional Recurrent Attention for Time-Series Imputation","summary":" Missing data in time-series analysis poses significant challenges, affecting\nthe reliability of downstream applications. Imputation, the process of\nestimating missing values, has emerged as a key solution. This paper introduces\nBRATI, a novel deep-learning model designed to address multivariate time-series\nimputation by combining Bidirectional Recurrent Networks and Attention\nmechanisms. BRATI processes temporal dependencies and feature correlations\nacross long and short time horizons, utilizing two imputation blocks that\noperate in opposite temporal directions. Each block integrates recurrent layers\nand attention mechanisms to effectively resolve long-term dependencies.\n We evaluate BRATI on three real-world datasets under diverse missing-data\nscenarios: randomly missing values, fixed-length missing sequences, and\nvariable-length missing sequences. Our findings demonstrate that BRATI\nconsistently outperforms state-of-the-art models, delivering superior accuracy\nand robustness in imputing multivariate time-series data.\n","authors":["Armando Collado-Villaverde","Pablo Muñoz","Maria D. R-Moreno"],"pdf_url":"https://arxiv.org/pdf/2501.05401v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05398v1","updated":"2025-01-09T17:47:34Z","published":"2025-01-09T17:47:34Z","title":"Mechanistic understanding and validation of large AI models with\n SemanticLens","summary":" Unlike human-engineered systems such as aeroplanes, where each component's\nrole and dependencies are well understood, the inner workings of AI models\nremain largely opaque, hindering verifiability and undermining trust. This\npaper introduces SemanticLens, a universal explanation method for neural\nnetworks that maps hidden knowledge encoded by components (e.g., individual\nneurons) into the semantically structured, multimodal space of a foundation\nmodel such as CLIP. In this space, unique operations become possible, including\n(i) textual search to identify neurons encoding specific concepts, (ii)\nsystematic analysis and comparison of model representations, (iii) automated\nlabelling of neurons and explanation of their functional roles, and (iv) audits\nto validate decision-making against requirements. Fully scalable and operating\nwithout human input, SemanticLens is shown to be effective for debugging and\nvalidation, summarizing model knowledge, aligning reasoning with expectations\n(e.g., adherence to the ABCDE-rule in melanoma classification), and detecting\ncomponents tied to spurious correlations and their associated training data. By\nenabling component-level understanding and validation, the proposed approach\nhelps bridge the \"trust gap\" between AI models and traditional engineered\nsystems. We provide code for SemanticLens on\nhttps://github.com/jim-berend/semanticlens and a demo on\nhttps://semanticlens.hhi-research-insights.eu.\n","authors":["Maximilian Dreyer","Jim Berend","Tobias Labarta","Johanna Vielhaben","Thomas Wiegand","Sebastian Lapuschkin","Wojciech Samek"],"pdf_url":"https://arxiv.org/pdf/2501.05398v1.pdf","comment":"74 pages (18 pages manuscript, 7 pages references, 49 pages appendix)"},{"id":"http://arxiv.org/abs/2110.01593v7","updated":"2025-01-09T17:28:02Z","published":"2021-10-04T17:41:53Z","title":"Generalized Kernel Thinning","summary":" The kernel thinning (KT) algorithm of Dwivedi and Mackey (2021) compresses a\nprobability distribution more effectively than independent sampling by\ntargeting a reproducing kernel Hilbert space (RKHS) and leveraging a less\nsmooth square-root kernel. Here we provide four improvements. First, we show\nthat KT applied directly to the target RKHS yields tighter, dimension-free\nguarantees for any kernel, any distribution, and any fixed function in the\nRKHS. Second, we show that, for analytic kernels like Gaussian, inverse\nmultiquadric, and sinc, target KT admits maximum mean discrepancy (MMD)\nguarantees comparable to or better than those of square-root KT without making\nexplicit use of a square-root kernel. Third, we prove that KT with a fractional\npower kernel yields better-than-Monte-Carlo MMD guarantees for non-smooth\nkernels, like Laplace and Mat\\'ern, that do not have square-roots. Fourth, we\nestablish that KT applied to a sum of the target and power kernels (a procedure\nwe call KT+) simultaneously inherits the improved MMD guarantees of power KT\nand the tighter individual function guarantees of target KT. In our experiments\nwith target KT and KT+, we witness significant improvements in integration\nerror even in $100$ dimensions and when compressing challenging differential\nequation posteriors.\n","authors":["Raaz Dwivedi","Lester Mackey"],"pdf_url":"https://arxiv.org/pdf/2110.01593v7.pdf","comment":"Corrected B-spline and Sinc rates in Table 3"},{"id":"http://arxiv.org/abs/2501.05387v1","updated":"2025-01-09T17:21:00Z","published":"2025-01-09T17:21:00Z","title":"Integrating Explainable AI for Effective Malware Detection in Encrypted\n Network Traffic","summary":" Encrypted network communication ensures confidentiality, integrity, and\nprivacy between endpoints. However, attackers are increasingly exploiting\nencryption to conceal malicious behavior. Detecting unknown encrypted malicious\ntraffic without decrypting the payloads remains a significant challenge. In\nthis study, we investigate the integration of explainable artificial\nintelligence (XAI) techniques to detect malicious network traffic. We employ\nensemble learning models to identify malicious activity using multi-view\nfeatures extracted from various aspects of encrypted communication. To\neffectively represent malicious communication, we compiled a robust dataset\nwith 1,127 unique connections, more than any other available open-source\ndataset, and spanning 54 malware families. Our models were benchmarked against\nthe CTU-13 dataset, achieving performance of over 99% accuracy, precision, and\nF1-score. Additionally, the eXtreme Gradient Boosting (XGB) model demonstrated\n99.32% accuracy, 99.53% precision, and 99.43% F1-score on our custom dataset.\nBy leveraging Shapley Additive Explanations (SHAP), we identified that the\nmaximum packet size, mean inter-arrival time of packets, and transport layer\nsecurity version used are the most critical features for the global model\nexplanation. Furthermore, key features were identified as important for local\nexplanations across both datasets for individual traffic samples. These\ninsights provide a deeper understanding of the model decision-making process,\nenhancing the transparency and reliability of detecting malicious encrypted\ntraffic.\n","authors":["Sileshi Nibret Zeleke","Amsalu Fentie Jember","Mario Bochicchio"],"pdf_url":"https://arxiv.org/pdf/2501.05387v1.pdf","comment":"Accepted and presented on PanAfriCon AI 2024"},{"id":"http://arxiv.org/abs/2501.05370v1","updated":"2025-01-09T16:50:16Z","published":"2025-01-09T16:50:16Z","title":"Accelerated Diffusion Models via Speculative Sampling","summary":" Speculative sampling is a popular technique for accelerating inference in\nLarge Language Models by generating candidate tokens using a fast draft model\nand accepting or rejecting them based on the target model's distribution. While\nspeculative sampling was previously limited to discrete sequences, we extend it\nto diffusion models, which generate samples via continuous, vector-valued\nMarkov chains. In this context, the target model is a high-quality but\ncomputationally expensive diffusion model. We propose various drafting\nstrategies, including a simple and effective approach that does not require\ntraining a draft model and is applicable out of the box to any diffusion model.\nOur experiments demonstrate significant generation speedup on various diffusion\nmodels, halving the number of function evaluations, while generating exact\nsamples from the target model.\n","authors":["Valentin De Bortoli","Alexandre Galashov","Arthur Gretton","Arnaud Doucet"],"pdf_url":"https://arxiv.org/pdf/2501.05370v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05368v1","updated":"2025-01-09T16:49:04Z","published":"2025-01-09T16:49:04Z","title":"Developing a Foundation of Vector Symbolic Architectures Using Category\n Theory","summary":" At the risk of overstating the case, connectionist approaches to machine\nlearning, i.e. neural networks, are enjoying a small vogue right now. However,\nthese methods require large volumes of data and produce models that are\nuninterpretable to humans. An alternative framework that is compatible with\nneural networks and gradient-based learning, but explicitly models\ncompositionality, is Vector Symbolic Architectures (VSAs). VSAs are a family of\nalgebras on high-dimensional vector representations. They arose in cognitive\nscience from the need to unify neural processing and the kind of symbolic\nreasoning that humans perform. While machine learning methods have benefited\nfrom category theoretical analyses, VSAs have not yet received similar\ntreatment. In this paper, we present a first attempt at applying category\ntheory to VSAs. Specifically, we conduct a brief literature survey\ndemonstrating the lacking intersection of these two topics, provide a list of\ndesiderata for VSAs, and propose that VSAs may be understood as a (division)\nrig in a category enriched over a monoid in Met (the category of Lawvere metric\nspaces). This final contribution suggests that VSAs may be generalised beyond\ncurrent implementations. It is our hope that grounding VSAs in category theory\nwill lead to more rigorous connections with other research, both within and\nbeyond, learning and cognition.\n","authors":["Nolan P Shaw","P Michael Furlong","Britt Anderson","Jeff Orchard"],"pdf_url":"https://arxiv.org/pdf/2501.05368v1.pdf","comment":"13 pages, no figures, 2 tables, one appendix"},{"id":"http://arxiv.org/abs/2501.05361v1","updated":"2025-01-09T16:44:53Z","published":"2025-01-09T16:44:53Z","title":"No-Regret Linear Bandits under Gap-Adjusted Misspecification","summary":" This work studies linear bandits under a new notion of gap-adjusted\nmisspecification and is an extension of Liu et al. (2023). When the underlying\nreward function is not linear, existing linear bandits work usually relies on a\nuniform misspecification parameter $\\epsilon$ that measures the sup-norm error\nof the best linear approximation. This results in an unavoidable linear regret\nwhenever $\\epsilon > 0$. We propose a more natural model of misspecification\nwhich only requires the approximation error at each input $x$ to be\nproportional to the suboptimality gap at $x$. It captures the intuition that,\nfor optimization problems, near-optimal regions should matter more and we can\ntolerate larger approximation errors in suboptimal regions.\n Quite surprisingly, we show that the classical LinUCB algorithm -- designed\nfor the realizable case -- is automatically robust against such\n$\\rho$-gap-adjusted misspecification with parameter $\\rho$ diminishing at\n$O(1/(d \\sqrt{\\log T}))$. It achieves a near-optimal $O(\\sqrt{T})$ regret for\nproblems that the best-known regret is almost linear in time horizon $T$. We\nfurther advance this frontier by presenting a novel phased elimination-based\nalgorithm whose gap-adjusted misspecification parameter $\\rho = O(1/\\sqrt{d})$\ndoes not scale with $T$. This algorithm attains optimal $O(\\sqrt{T})$ regret\nand is deployment-efficient, requiring only $\\log T$ batches of exploration. It\nalso enjoys an adaptive $O(\\log T)$ regret when a constant suboptimality gap\nexists. Technically, our proof relies on a novel self-bounding argument that\nbounds the part of the regret due to misspecification by the regret itself, and\na new inductive lemma that limits the misspecification error within the\nsuboptimality gap for all valid actions in each batch selected by G-optimal\ndesign.\n","authors":["Chong Liu","Dan Qiao","Ming Yin","Ilija Bogunovic","Yu-Xiang Wang"],"pdf_url":"https://arxiv.org/pdf/2501.05361v1.pdf","comment":"arXiv admin note: substantial text overlap with arXiv:2302.13252"},{"id":"http://arxiv.org/abs/2412.20138v2","updated":"2025-01-09T16:36:26Z","published":"2024-12-28T12:54:06Z","title":"TradingAgents: Multi-Agents LLM Financial Trading Framework","summary":" Significant progress has been made in automated problem-solving using\nsocieties of agents powered by large language models (LLMs). In finance,\nefforts have largely focused on single-agent systems handling specific tasks or\nmulti-agent frameworks independently gathering data. However, multi-agent\nsystems' potential to replicate real-world trading firms' collaborative\ndynamics remains underexplored. TradingAgents proposes a novel stock trading\nframework inspired by trading firms, featuring LLM-powered agents in\nspecialized roles such as fundamental analysts, sentiment analysts, technical\nanalysts, and traders with varied risk profiles. The framework includes Bull\nand Bear researcher agents assessing market conditions, a risk management team\nmonitoring exposure, and traders synthesizing insights from debates and\nhistorical data to make informed decisions. By simulating a dynamic,\ncollaborative trading environment, this framework aims to improve trading\nperformance. Detailed architecture and extensive experiments reveal its\nsuperiority over baseline models, with notable improvements in cumulative\nreturns, Sharpe ratio, and maximum drawdown, highlighting the potential of\nmulti-agent LLM frameworks in financial trading. More details on TradingAgents\nare available at https://TradingAgents-AI.github.io.\n","authors":["Yijia Xiao","Edward Sun","Di Luo","Wei Wang"],"pdf_url":"https://arxiv.org/pdf/2412.20138v2.pdf","comment":"Multi-Agent AI in the Real World @ AAAI 2025"},{"id":"http://arxiv.org/abs/2411.10087v2","updated":"2025-01-09T16:22:42Z","published":"2024-11-15T10:16:38Z","title":"PFML: Self-Supervised Learning of Time-Series Data Without\n Representation Collapse","summary":" Self-supervised learning (SSL) is a data-driven learning approach that\nutilizes the innate structure of the data to guide the learning process. In\ncontrast to supervised learning, which depends on external labels, SSL utilizes\nthe inherent characteristics of the data to produce its own supervisory signal.\nHowever, one frequent issue with SSL methods is representation collapse, where\nthe model outputs a constant input-invariant feature representation. This issue\nhinders the potential application of SSL methods to new data modalities, as\ntrying to avoid representation collapse wastes researchers' time and effort.\nThis paper introduces a novel SSL algorithm for time-series data called\nPrediction of Functionals from Masked Latents (PFML). Instead of predicting\nmasked input signals or their latent representations directly, PFML operates by\npredicting statistical functionals of the input signal corresponding to masked\nembeddings, given a sequence of unmasked embeddings. The algorithm is designed\nto avoid representation collapse, rendering it straightforwardly applicable to\ndifferent time-series data domains, such as novel sensor modalities in clinical\ndata. We demonstrate the effectiveness of PFML through complex, real-life\nclassification tasks across three different data modalities: infant posture and\nmovement classification from multi-sensor inertial measurement unit data,\nemotion recognition from speech data, and sleep stage classification from EEG\ndata. The results show that PFML is superior to a conceptually similar SSL\nmethod and a contrastive learning-based SSL method. Additionally, PFML is on\npar with the current state-of-the-art SSL method, while also being conceptually\nsimpler and without suffering from representation collapse.\n","authors":["Einari Vaaras","Manu Airaksinen","Okko Räsänen"],"pdf_url":"https://arxiv.org/pdf/2411.10087v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05336v1","updated":"2025-01-09T16:02:51Z","published":"2025-01-09T16:02:51Z","title":"Stream Aligner: Efficient Sentence-Level Alignment via Distribution\n Induction","summary":" The rapid advancement of large language models (LLMs) has led to significant\nimprovements in their capabilities, but also to increased concerns about their\nalignment with human values and intentions. Current alignment strategies,\nincluding adaptive training and inference-time methods, have demonstrated\npotential in this area. However, these approaches still struggle to balance\ndeployment complexity and capability across various tasks and difficulties. In\nthis work, we introduce the Streaming Distribution Induce Aligner (Stream\nAligner), a novel alignment paradigm that combines efficiency with enhanced\nperformance in various tasks throughout the generation process. Stream Aligner\nachieves dynamic sentence-level correction by using a small model to learn the\npreferences of the suffix sentence, iteratively correcting the suffix sentence\noutput by the upstream model, and then using the corrected sentence to replace\nthe suffix sentence in subsequent generations. Compared to Aligner, our\nexperiments demonstrate that Stream Aligner reduces reliance on the\ncapabilities of additional models, enhances the reasoning abilities of LLMs,\nand decreases latency during user interaction. Specifically, Stream Aligner-2B\nmodel has achieved an improvement of 76.1% in helpfulness, 36.0% in\nharmlessness on the tested Llama2-70B-chat model, and Stream Aligner-8B has\nachieved an improvement of 3.5% on the math ability of the tested\nLlama3-70B-Instruct model.\n","authors":["Hantao Lou","Jiaming Ji","Kaile Wang","Yaodong Yang"],"pdf_url":"https://arxiv.org/pdf/2501.05336v1.pdf","comment":"AAAI Alignment Track 2025 Poster"},{"id":"http://arxiv.org/abs/2501.05333v1","updated":"2025-01-09T15:59:15Z","published":"2025-01-09T15:59:15Z","title":"Stability and List-Replicability for Agnostic Learners","summary":" Two seminal papers--Alon, Livni, Malliaris, Moran (STOC 2019) and Bun, Livni,\nand Moran (FOCS 2020)--established the equivalence between online learnability\nand globally stable PAC learnability in binary classification. However, Chase,\nChornomaz, Moran, and Yehudayoff (STOC 2024) recently showed that this\nequivalence does not hold in the agnostic setting. Specifically, they proved\nthat in the agnostic setting, only finite hypothesis classes are globally\nstable learnable. Therefore, agnostic global stability is too restrictive to\ncapture interesting hypothesis classes.\n To address this limitation, Chase \\emph{et al.} introduced two relaxations of\nagnostic global stability. In this paper, we characterize the classes that are\nlearnable under their proposed relaxed conditions, resolving the two open\nproblems raised in their work.\n First, we prove that in the setting where the stability parameter can depend\non the excess error (the gap between the learner's error and the best\nachievable error by the hypothesis class), agnostic stability is fully\ncharacterized by the Littlestone dimension. Consequently, as in the realizable\ncase, this form of learnability is equivalent to online learnability.\n As part of the proof of this theorem, we strengthen the celebrated result of\nBun et al. by showing that classes with infinite Littlestone dimension are not\nstably PAC learnable, even if we allow the stability parameter to depend on the\nexcess error.\n For the second relaxation proposed by Chase et al., we prove that only finite\nhypothesis classes are globally stable learnable even if we restrict the\nagnostic setting to distributions with small population loss.\n","authors":["Ari Blonda","Shan Gao","Hamed Hatami","Pooya Hatami"],"pdf_url":"https://arxiv.org/pdf/2501.05333v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05329v1","updated":"2025-01-09T15:55:08Z","published":"2025-01-09T15:55:08Z","title":"Knowledge Transfer in Model-Based Reinforcement Learning Agents for\n Efficient Multi-Task Learning","summary":" We propose an efficient knowledge transfer approach for model-based\nreinforcement learning, addressing the challenge of deploying large world\nmodels in resource-constrained environments. Our method distills a\nhigh-capacity multi-task agent (317M parameters) into a compact 1M parameter\nmodel, achieving state-of-the-art performance on the MT30 benchmark with a\nnormalized score of 28.45, a substantial improvement over the original 1M\nparameter model's score of 18.93. This demonstrates the ability of our\ndistillation technique to consolidate complex multi-task knowledge effectively.\nAdditionally, we apply FP16 post-training quantization, reducing the model size\nby 50% while maintaining performance. Our work bridges the gap between the\npower of large models and practical deployment constraints, offering a scalable\nsolution for efficient and accessible multi-task reinforcement learning in\nrobotics and other resource-limited domains.\n","authors":["Dmytro Kuzmenko","Nadiya Shvai"],"pdf_url":"https://arxiv.org/pdf/2501.05329v1.pdf","comment":"Preprint of an extended abstract accepted to AAMAS 2025"},{"id":"http://arxiv.org/abs/2501.05325v1","updated":"2025-01-09T15:50:02Z","published":"2025-01-09T15:50:02Z","title":"The explanation dialogues: an expert focus study to understand\n requirements towards explanations within the GDPR","summary":" Explainable AI (XAI) provides methods to understand non-interpretable machine\nlearning models. However, we have little knowledge about what legal experts\nexpect from these explanations, including their legal compliance with, and\nvalue against European Union legislation. To close this gap, we present the\nExplanation Dialogues, an expert focus study to uncover the expectations,\nreasoning, and understanding of legal experts and practitioners towards XAI,\nwith a specific focus on the European General Data Protection Regulation. The\nstudy consists of an online questionnaire and follow-up interviews, and is\ncentered around a use-case in the credit domain. We extract both a set of\nhierarchical and interconnected codes using grounded theory, and present the\nstandpoints of the participating experts towards XAI. We find that the\npresented explanations are hard to understand and lack information, and discuss\nissues that can arise from the different interests of the data controller and\nsubject. Finally, we present a set of recommendations for developers of XAI\nmethods, and indications of legal areas of discussion. Among others,\nrecommendations address the presentation, choice, and content of an\nexplanation, technical risks as well as the end-user, while we provide legal\npointers to the contestability of explanations, transparency thresholds,\nintellectual property rights as well as the relationship between involved\nparties.\n","authors":["Laura State","Alejandra Bringas Colmenarejo","Andrea Beretta","Salvatore Ruggieri","Franco Turini","Stephanie Law"],"pdf_url":"https://arxiv.org/pdf/2501.05325v1.pdf","comment":"Artificial Intelligence and Law (Springer Nature)"},{"id":"http://arxiv.org/abs/2501.05323v1","updated":"2025-01-09T15:48:29Z","published":"2025-01-09T15:48:29Z","title":"Distributed Learning and Inference Systems: A Networking Perspective","summary":" Machine learning models have achieved, and in some cases surpassed,\nhuman-level performance in various tasks, mainly through centralized training\nof static models and the use of large models stored in centralized clouds for\ninference. However, this centralized approach has several drawbacks, including\nprivacy concerns, high storage demands, a single point of failure, and\nsignificant computing requirements. These challenges have driven interest in\ndeveloping alternative decentralized and distributed methods for AI training\nand inference. Distribution introduces additional complexity, as it requires\nmanaging multiple moving parts. To address these complexities and fill a gap in\nthe development of distributed AI systems, this work proposes a novel\nframework, Data and Dynamics-Aware Inference and Training Networks (DA-ITN).\nThe different components of DA-ITN and their functions are explored, and the\nassociated challenges and research areas are highlighted.\n","authors":["Hesham G. Moussa","Arashmid Akhavain","S. Maryam Hosseini","Bill McCormick"],"pdf_url":"https://arxiv.org/pdf/2501.05323v1.pdf","comment":"This paper has been submitted to IEEE Network magazine and is still\n under review"},{"id":"http://arxiv.org/abs/2406.05405v3","updated":"2025-01-09T15:47:33Z","published":"2024-06-08T08:56:47Z","title":"Robust Conformal Prediction Using Privileged Information","summary":" We develop a method to generate prediction sets with a guaranteed coverage\nrate that is robust to corruptions in the training data, such as missing or\nnoisy variables. Our approach builds on conformal prediction, a powerful\nframework to construct prediction sets that are valid under the i.i.d\nassumption. Importantly, naively applying conformal prediction does not provide\nreliable predictions in this setting, due to the distribution shift induced by\nthe corruptions. To account for the distribution shift, we assume access to\nprivileged information (PI). The PI is formulated as additional features that\nexplain the distribution shift, however, they are only available during\ntraining and absent at test time. We approach this problem by introducing a\nnovel generalization of weighted conformal prediction and support our method\nwith theoretical coverage guarantees. Empirical experiments on both real and\nsynthetic datasets indicate that our approach achieves a valid coverage rate\nand constructs more informative predictions compared to existing methods, which\nare not supported by theoretical guarantees.\n","authors":["Shai Feldman","Yaniv Romano"],"pdf_url":"https://arxiv.org/pdf/2406.05405v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03145v2","updated":"2025-01-09T15:31:29Z","published":"2025-01-06T17:12:19Z","title":"Geometry Restoration and Dewarping of Camera-Captured Document Images","summary":" This research focuses on developing a method for restoring the topology of\ndigital images of paper documents captured by a camera, using algorithms for\ndetection, segmentation, geometry restoration, and dewarping. Our methodology\nemploys deep learning (DL) for document outline detection, followed by computer\nvision (CV) to create a topological 2D grid using cubic polynomial\ninterpolation and correct nonlinear distortions by remapping the image. Using\nclassical CV methods makes the document topology restoration process more\nefficient and faster, as it requires significantly fewer computational\nresources and memory. We developed a new pipeline for automatic document\ndewarping and reconstruction, along with a framework and annotated dataset to\ndemonstrate its efficiency. Our experiments confirm the promise of our\nmethodology and its superiority over existing benchmarks (including mobile apps\nand popular DL solutions, such as RectiNet, DocGeoNet, and DocTr++) both\nvisually and in terms of document readability via Optical Character Recognition\n(OCR) and geometry restoration metrics. This paves the way for creating\nhigh-quality digital copies of paper documents and enhancing the efficiency of\nOCR systems. Project page: https://github.com/HorizonParadox/DRCCBI\n","authors":["Valery Istomin","Oleg Pereziabov","Ilya Afanasyev"],"pdf_url":"https://arxiv.org/pdf/2501.03145v2.pdf","comment":"28 pages, 16 figures"},{"id":"http://arxiv.org/abs/2501.05313v1","updated":"2025-01-09T15:29:33Z","published":"2025-01-09T15:29:33Z","title":"Optimizing Distributed Deployment of Mixture-of-Experts Model Inference\n in Serverless Computing","summary":" With the advancement of serverless computing, running machine learning (ML)\ninference services over a serverless platform has been advocated, given its\nlabor-free scalability and cost effectiveness. Mixture-of-Experts (MoE) models\nhave been a dominant type of model architectures to enable large models\nnowadays, with parallel expert networks. Serving large MoE models on serverless\ncomputing is potentially beneficial, but has been underexplored due to\nsubstantial challenges in handling the skewed expert popularity and\nscatter-gather communication bottleneck in MoE model execution, for\ncost-efficient serverless MoE deployment and performance guarantee. We study\noptimized MoE model deployment and distributed inference serving on a\nserverless platform, that effectively predict expert selection, pipeline\ncommunication with model execution, and minimize the overall billed cost of\nserving MoE models. Especially, we propose a Bayesian optimization framework\nwith multi-dimensional epsilon-greedy search to learn expert selections and\noptimal MoE deployment achieving optimal billed cost, including: 1) a Bayesian\ndecision-making method for predicting expert popularity; 2) flexibly pipelined\nscatter-gather communication; and 3) an optimal model deployment algorithm for\ndistributed MoE serving. Extensive experiments on AWS Lambda show that our\ndesigns reduce the billed cost of all MoE layers by at least 75.67% compared to\nCPU clusters while maintaining satisfactory inference throughput. As compared\nto LambdaML in serverless computing, our designs achieves 43.41% lower cost\nwith a throughput decrease of at most 18.76%.\n","authors":["Mengfan Liu","Wei Wang","Chuan Wu"],"pdf_url":"https://arxiv.org/pdf/2501.05313v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05309v1","updated":"2025-01-09T15:25:07Z","published":"2025-01-09T15:25:07Z","title":"Private Selection with Heterogeneous Sensitivities","summary":" Differentially private (DP) selection involves choosing a high-scoring\ncandidate from a finite candidate pool, where each score depends on a sensitive\ndataset. This problem arises naturally in a variety of contexts including model\nselection, hypothesis testing, and within many DP algorithms. Classical\nmethods, such as Report Noisy Max (RNM), assume all candidates' scores are\nequally sensitive to changes in a single individual's data, but this often\nisn't the case. To address this, algorithms like the Generalised Exponential\nMechanism (GEM) leverage variability in candidate sensitivities. However, we\nobserve that while these algorithms can outperform RNM in some situations, they\nmay underperform in others - they can even perform worse than random selection.\nIn this work, we explore how the distribution of scores and sensitivities\nimpacts DP selection mechanisms. In all settings we study, we find that there\nexists a mechanism that utilises heterogeneity in the candidate sensitivities\nthat outperforms standard mechanisms like RNM. However, no single mechanism\nuniformly outperforms RNM. We propose using the correlation between the scores\nand sensitivities as the basis for deciding which DP selection mechanism to\nuse. Further, we design a slight variant of GEM, modified GEM that generally\nperforms well whenever GEM performs poorly. Relying on the correlation\nheuristic we propose combined GEM, which adaptively chooses between GEM and\nmodified GEM and outperforms both in polarised settings.\n","authors":["Daniela Antonova","Allegra Laro","Audra McMillan","Lorenz Wolf"],"pdf_url":"https://arxiv.org/pdf/2501.05309v1.pdf","comment":"21 pages, 18 figures"},{"id":"http://arxiv.org/abs/2412.16378v2","updated":"2025-01-09T15:20:31Z","published":"2024-12-20T22:25:23Z","title":"REFA: Reference Free Alignment for multi-preference optimization","summary":" We introduce REFA, a family of reference-free alignment methods that optimize\nover multiple user preferences while enforcing fine-grained length control. Our\napproach integrates deviation-based weighting to emphasize high-quality\nresponses more strongly, length normalization to prevent trivial short-response\nsolutions, and an EOS-probability regularizer to mitigate dataset-induced\nbrevity biases. Theoretically, we show that under the Uncertainty Reduction\nwith Sequence Length Assertion (URSLA), naive length normalization can still\nincentivize length-based shortcuts. By contrast, REFA corrects these subtle\nincentives, guiding models toward genuinely more informative and higher-quality\noutputs. Empirically, REFA sets a new state-of-the-art among reference-free\nalignment methods, producing richer responses aligned more closely with human\npreferences. Compared to a base supervised fine-tuned (SFT) mistral-7b model\nthat achieves 8.4% length-controlled win rate (LC-WR) and 6.2% win rate (WR),\nour best REFA configuration attains 21.62% LC-WR and 19.87% WR on the\nAlpacaEval v2 benchmark. This represents a substantial improvement over both\nthe strongest multi-preference baseline, InfoNCA (16.82% LC-WR, 10.44% WR), and\nthe strongest reference-free baseline, SimPO (20.01% LC-WR, 17.65% WR)\n","authors":["Taneesh Gupta","Rahul Madhavan","Xuchao Zhang","Chetan Bansal","Saravan Rajmohan"],"pdf_url":"https://arxiv.org/pdf/2412.16378v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.19528v3","updated":"2025-01-09T15:12:04Z","published":"2024-10-25T12:53:33Z","title":"AgentForge: A Flexible Low-Code Platform for Reinforcement Learning\n Agent Design","summary":" Developing a reinforcement learning (RL) agent often involves identifying\nvalues for numerous parameters, covering the policy, reward function,\nenvironment, and agent-internal architecture. Since these parameters are\ninterrelated in complex ways, optimizing them is a black-box problem that\nproves especially challenging for nonexperts. Although existing\noptimization-as-a-service platforms (e.g., Vizier and Optuna) can handle such\nproblems, they are impractical for RL systems, since the need for manual user\nmapping of each parameter to distinct components makes the effort cumbersome.\nIt also requires understanding of the optimization process, limiting the\nsystems' application beyond the machine learning field and restricting access\nin areas such as cognitive science, which models human decision-making. To\ntackle these challenges, the paper presents AgentForge, a flexible low-code\nplatform to optimize any parameter set across an RL system. Available at\nhttps://github.com/feferna/AgentForge, it allows an optimization problem to be\ndefined in a few lines of code and handed to any of the interfaced optimizers.\nWith AgentForge, the user can optimize the parameters either individually or\njointly. The paper presents an evaluation of its performance for a challenging\nvision-based RL problem.\n","authors":["Francisco Erivaldo Fernandes Junior","Antti Oulasvirta"],"pdf_url":"https://arxiv.org/pdf/2410.19528v3.pdf","comment":"This paper has been accepted at the 17th International Conference on\n Agents and Artificial Intelligence (ICAART 2025)"},{"id":"http://arxiv.org/abs/2409.07387v2","updated":"2025-01-09T14:58:03Z","published":"2024-09-11T16:21:44Z","title":"A Contrastive Symmetric Forward-Forward Algorithm (SFFA) for Continual\n Learning Tasks","summary":" The so-called Forward-Forward Algorithm (FFA) has recently gained momentum as\nan alternative to the conventional back-propagation algorithm for neural\nnetwork learning, yielding competitive performance across various modeling\ntasks. By replacing the backward pass of gradient back-propagation with two\ncontrastive forward passes, the FFA avoids several shortcomings undergone by\nits predecessor (e.g., vanishing/exploding gradient) by enabling layer-wise\ntraining heuristics. In classification tasks, this contrastive method has been\nproven to effectively create a latent sparse representation of the input data,\nultimately favoring discriminability. However, FFA exhibits an inherent\nasymmetric gradient behavior due to an imbalanced loss function between\npositive and negative data, adversely impacting on the model's generalization\ncapabilities and leading to an accuracy degradation. To address this issue,\nthis work proposes the Symmetric Forward-Forward Algorithm (SFFA), a novel\nmodification of the original FFA which partitions each layer into positive and\nnegative neurons. This allows the local fitness function to be defined as the\nratio between the activation of positive neurons and the overall layer\nactivity, resulting in a symmetric loss landscape during the training phase. To\nevaluate the enhanced convergence of our method, we conduct several experiments\nusing multiple image classification benchmarks, comparing the accuracy of\nmodels trained with SFFA to those trained with its FFA counterpart. As a\nbyproduct of this reformulation, we explore the advantages of using a\nlayer-wise training algorithm for Continual Learning (CL) tasks. The\nspecialization of neurons and the sparsity of their activations induced by\nlayer-wise training algorithms enable efficient CL strategies that incorporate\nnew knowledge (classes) into the neural network, while preventing catastrophic\nforgetting of previously...\n","authors":["Erik B. Terres-Escudero","Javier Del Ser","Pablo Garcia Bringas"],"pdf_url":"https://arxiv.org/pdf/2409.07387v2.pdf","comment":"Accepted at 3rd Conference on Lifelong Learning Agents (CoLLAs), 2024"},{"id":"http://arxiv.org/abs/2412.16220v3","updated":"2025-01-09T14:55:29Z","published":"2024-12-18T10:56:40Z","title":"Cross-Attention Graph Neural Networks for Inferring Gene Regulatory\n Networks with Skewed Degree Distribution","summary":" Inferencing Gene Regulatory Networks (GRNs) from gene expression data is a\npivotal challenge in systems biology, and several innovative computational\nmethods have been introduced. However, most of these studies have not\nconsidered the skewed degree distribution of genes. Specifically, some genes\nmay regulate multiple target genes while some genes may be regulated by\nmultiple regulator genes. Such a skewed degree distribution issue significantly\ncomplicates the application of directed graph embedding methods. To tackle this\nissue, we propose the Cross-Attention Complex Dual Graph Embedding Model\n(XATGRN). Our XATGRN employs a cross-attention mechanism to effectively capture\nintricate gene interactions from gene expression profiles. Additionally, it\nuses a Dual Complex Graph Embedding approach to manage the skewed degree\ndistribution, thereby ensuring precise prediction of regulatory relationships\nand their directionality. Our model consistently outperforms existing\nstate-of-the-art methods across various datasets, underscoring its efficacy in\nelucidating complex gene regulatory mechanisms. Our codes used in this paper\nare publicly available at: https://github.com/kikixiong/XATGRN.\n","authors":["Jiaqi Xiong","Nan Yin","Shiyang Liang","Haoyang Li","Yingxu Wang","Duo Ai","Fang Pan","Jingjie Wang"],"pdf_url":"https://arxiv.org/pdf/2412.16220v3.pdf","comment":"11 pages, 6 figures,1 tabels"},{"id":"http://arxiv.org/abs/2501.01480v2","updated":"2025-01-09T14:52:13Z","published":"2025-01-02T15:09:00Z","title":"Drift2Matrix: Kernel-Induced Self Representation for Concept Drift\n Adaptation in Co-evolving Time Series","summary":" In the realm of time series analysis, tackling the phenomenon of concept\ndrift poses a significant challenge. Concept drift -- characterized by the\nevolving statistical properties of time series data, affects the reliability\nand accuracy of conventional analysis models. This is particularly evident in\nco-evolving scenarios where interactions among variables are crucial. This\npaper presents Drift2Matrix, a novel framework that leverages kernel-induced\nself-representation for adaptive responses to concept drift in time series.\nDrift2Matrix employs a kernel-based learning mechanism to generate a\nrepresentation matrix, encapsulating the inherent dynamics of co-evolving time\nseries. This matrix serves as a key tool for identification and adaptation to\nconcept drift by observing its temporal variations. Furthermore, Drift2Matrix\neffectively identifies prevailing patterns and offers insights into emerging\ntrends through pattern evolution analysis. Our empirical evaluation of\nDrift2Matrix across various datasets demonstrates its effectiveness in handling\nthe complexities of concept drift. This approach introduces a novel perspective\nin the theoretical domain of co-evolving time series analysis, enhancing\nadaptability and accuracy in the face of dynamic data environments.\n","authors":["Kunpeng Xu","Lifei Chen","Shengrui Wang"],"pdf_url":"https://arxiv.org/pdf/2501.01480v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05281v1","updated":"2025-01-09T14:43:36Z","published":"2025-01-09T14:43:36Z","title":"Comparison Study: Glacier Calving Front Delineation in Synthetic\n Aperture Radar Images With Deep Learning","summary":" Calving front position variation of marine-terminating glaciers is an\nindicator of ice mass loss and a crucial parameter in numerical glacier models.\nDeep Learning (DL) systems can automatically extract this position from\nSynthetic Aperture Radar (SAR) imagery, enabling continuous, weather- and\nillumination-independent, large-scale monitoring. This study presents the first\ncomparison of DL systems on a common calving front benchmark dataset. A\nmulti-annotator study with ten annotators is performed to contrast the\nbest-performing DL system against human performance. The best DL model's\noutputs deviate 221 m on average, while the average deviation of the human\nannotators is 38 m. This significant difference shows that current DL systems\ndo not yet match human performance and that further research is needed to\nenable fully automated monitoring of glacier calving fronts. The study of\nVision Transformers, foundation models, and the inclusion and processing\nstrategy of more information are identified as avenues for future research.\n","authors":["Nora Gourmelon","Konrad Heidler","Erik Loebel","Daniel Cheng","Julian Klink","Anda Dong","Fei Wu","Noah Maul","Moritz Koch","Marcel Dreier","Dakota Pyles","Thorsten Seehaus","Matthias Braun","Andreas Maier","Vincent Christlein"],"pdf_url":"https://arxiv.org/pdf/2501.05281v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05279v1","updated":"2025-01-09T14:43:08Z","published":"2025-01-09T14:43:08Z","title":"Learning convolution operators on compact Abelian groups","summary":" We consider the problem of learning convolution operators associated to\ncompact Abelian groups. We study a regularization-based approach and provide\ncorresponding learning guarantees, discussing natural regularity condition on\nthe convolution kernel. More precisely, we assume the convolution kernel is a\nfunction in a translation invariant Hilbert space and analyze a natural ridge\nregression (RR) estimator. Building on existing results for RR, we characterize\nthe accuracy of the estimator in terms of finite sample bounds. Interestingly,\nregularity assumptions which are classical in the analysis of RR, have a novel\nand natural interpretation in terms of space/frequency localization.\nTheoretical results are illustrated by numerical simulations.\n","authors":["Emilia Magnani","Ernesto De Vito","Philipp Hennig","Lorenzo Rosasco"],"pdf_url":"https://arxiv.org/pdf/2501.05279v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05278v1","updated":"2025-01-09T14:39:40Z","published":"2025-01-09T14:39:40Z","title":"Off-Policy Evaluation and Counterfactual Methods in Dynamic Auction\n Environments","summary":" Counterfactual estimators are critical for learning and refining policies\nusing logged data, a process known as Off-Policy Evaluation (OPE). OPE allows\nresearchers to assess new policies without costly experiments, speeding up the\nevaluation process. Online experimental methods, such as A/B tests, are\neffective but often slow, thus delaying the policy selection and optimization\nprocess.\n In this work, we explore the application of OPE methods in the context of\nresource allocation in dynamic auction environments. Given the competitive\nnature of environments where rapid decision-making is crucial for gaining a\ncompetitive edge, the ability to quickly and accurately assess algorithmic\nperformance is essential. By utilizing counterfactual estimators as a\npreliminary step before conducting A/B tests, we aim to streamline the\nevaluation process, reduce the time and resources required for experimentation,\nand enhance confidence in the chosen policies. Our investigation focuses on the\nfeasibility and effectiveness of using these estimators to predict the outcomes\nof potential resource allocation strategies, evaluate their performance, and\nfacilitate more informed decision-making in policy selection. Motivated by the\noutcomes of our initial study, we envision an advanced analytics system\ndesigned to seamlessly and dynamically assess new resource allocation\nstrategies and policies.\n","authors":["Ritam Guha","Nilavra Pathak"],"pdf_url":"https://arxiv.org/pdf/2501.05278v1.pdf","comment":"9 pages, 15 figures, IEEE format"},{"id":"http://arxiv.org/abs/2501.04572v2","updated":"2025-01-09T14:30:41Z","published":"2025-01-08T15:42:41Z","title":"Regret Analysis: a control perspective","summary":" Online learning and model reference adaptive control have many interesting\nintersections. One area where they differ however is in how the algorithms are\nanalyzed and what objective or metric is used to discriminate \"good\" algorithms\nfrom \"bad\" algorithms. In adaptive control there are usually two objectives: 1)\nprove that all time varying parameters/states of the system are bounded, and 2)\nthat the instantaneous error between the adaptively controlled system and a\nreference system converges to zero over time (or at least a compact set). For\nonline learning the performance of algorithms is often characterized by the\nregret the algorithm incurs. Regret is defined as the cumulative loss (cost)\nover time from the online algorithm minus the cumulative loss (cost) of the\nsingle optimal fixed parameter choice in hindsight. Another significant\ndifference between the two areas of research is with regard to the assumptions\nmade in order to obtain said results. Adaptive control makes assumptions about\nthe input-output properties of the control problem and derives solutions for a\nfixed error model or optimization task. In the online learning literature\nresults are derived for classes of loss functions (i.e. convex) while a priori\nassuming that all time varying parameters are bounded, which for many\noptimization tasks is not unrealistic, but is a non starter in control\napplications. In this work we discuss these differences in detail through the\nregret based analysis of gradient descent for convex functions and the control\nbased analysis of a streaming regression problem. We close with a discussion\nabout the newly defined paradigm of online adaptive control and ask the\nfollowing question \"Are regret optimal control strategies deployable?\"\n","authors":["Travis E. Gibson","Sawal Acharya"],"pdf_url":"https://arxiv.org/pdf/2501.04572v2.pdf","comment":"10 pages no figures"},{"id":"http://arxiv.org/abs/2501.05269v1","updated":"2025-01-09T14:26:50Z","published":"2025-01-09T14:26:50Z","title":"CellViT++: Energy-Efficient and Adaptive Cell Segmentation and\n Classification Using Foundation Models","summary":" Digital Pathology is a cornerstone in the diagnosis and treatment of\ndiseases. A key task in this field is the identification and segmentation of\ncells in hematoxylin and eosin-stained images. Existing methods for cell\nsegmentation often require extensive annotated datasets for training and are\nlimited to a predefined cell classification scheme. To overcome these\nlimitations, we propose $\\text{CellViT}^{{\\scriptscriptstyle ++}}$, a framework\nfor generalized cell segmentation in digital pathology.\n$\\text{CellViT}^{{\\scriptscriptstyle ++}}$ utilizes Vision Transformers with\nfoundation models as encoders to compute deep cell features and segmentation\nmasks simultaneously. To adapt to unseen cell types, we rely on a\ncomputationally efficient approach. It requires minimal data for training and\nleads to a drastically reduced carbon footprint. We demonstrate excellent\nperformance on seven different datasets, covering a broad spectrum of cell\ntypes, organs, and clinical settings. The framework achieves remarkable\nzero-shot segmentation and data-efficient cell-type classification.\nFurthermore, we show that $\\text{CellViT}^{{\\scriptscriptstyle ++}}$ can\nleverage immunofluorescence stainings to generate training datasets without the\nneed for pathologist annotations. The automated dataset generation approach\nsurpasses the performance of networks trained on manually labeled data,\ndemonstrating its effectiveness in creating high-quality training datasets\nwithout expert annotations. To advance digital pathology,\n$\\text{CellViT}^{{\\scriptscriptstyle ++}}$ is available as an open-source\nframework featuring a user-friendly, web-based interface for visualization and\nannotation. The code is available under\nhttps://github.com/TIO-IKIM/CellViT-plus-plus.\n","authors":["Fabian Hörst","Moritz Rempe","Helmut Becker","Lukas Heine","Julius Keyl","Jens Kleesiek"],"pdf_url":"https://arxiv.org/pdf/2501.05269v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05260v1","updated":"2025-01-09T14:14:18Z","published":"2025-01-09T14:14:18Z","title":"Enhancing Plagiarism Detection in Marathi with a Weighted Ensemble of\n TF-IDF and BERT Embeddings for Low-Resource Language Processing","summary":" Plagiarism involves using another person's work or concepts without proper\nattribution, presenting them as original creations. With the growing amount of\ndata communicated in regional languages such as Marathi -- one of India's\nregional languages -- it is crucial to design robust plagiarism detection\nsystems tailored for low-resource languages. Language models like Bidirectional\nEncoder Representations from Transformers (BERT) have demonstrated exceptional\ncapability in text representation and feature extraction, making them essential\ntools for semantic analysis and plagiarism detection. However, the application\nof BERT for low-resource languages remains under-explored, particularly in the\ncontext of plagiarism detection. This paper presents a method to enhance the\naccuracy of plagiarism detection for Marathi texts using BERT sentence\nembeddings in conjunction with Term Frequency-Inverse Document Frequency\n(TF-IDF) feature representation. This approach effectively captures\nstatistical, semantic, and syntactic aspects of text features through a\nweighted voting ensemble of machine learning models.\n","authors":["Atharva Mutsaddi","Aditya Choudhary"],"pdf_url":"https://arxiv.org/pdf/2501.05260v1.pdf","comment":"Accepted into LoResLM: The First Workshop on Language Models for\n Low-Resource Languages, colocated with COLING 2025 and set to be published\n into ACL Anthology"},{"id":"http://arxiv.org/abs/2410.20398v2","updated":"2025-01-09T14:11:34Z","published":"2024-10-27T10:06:09Z","title":"Evaluation of uncertainty estimations for Gaussian process regression\n based machine learning interatomic potentials","summary":" Uncertainty estimations for machine learning interatomic potentials (MLIPs)\nare crucial for quantifying model error and identifying informative training\nsamples in active learning strategies. In this study, we evaluate uncertainty\nestimations of Gaussian process regression (GPR)-based MLIPs, including the\npredictive GPR standard deviation and ensemble-based uncertainties. We do this\nin terms of calibration and in terms of impact on model performance in an\nactive learning scheme. We consider GPR models with Coulomb and Smooth Overlap\nof Atomic Positions (SOAP) representations as inputs to predict potential\nenergy surfaces and excitation energies of molecules. Regarding calibration, we\nfind that ensemble-based uncertainty estimations show already poor global\ncalibration (e.g., averaged over the whole test set). In contrast, the GPR\nstandard deviation shows good global calibration, but when grouping predictions\nby their uncertainty, we observe a systematical bias for predictions with high\nuncertainty. Although an increasing uncertainty correlates with an increasing\nbias, the bias is not captured quantitatively by the uncertainty. Therefore,\nthe GPR standard deviation can be useful to identify predictions with a high\nbias and error but, without further knowledge, should not be interpreted as a\nquantitative measure for a potential error range. Selecting the samples with\nthe highest GPR standard deviation from a fixed configuration space leads to a\nmodel that overemphasizes the borders of the configuration space represented in\nthe fixed dataset. This may result in worse performance in more densely sampled\nareas but better generalization for extrapolation tasks.\n","authors":["Matthias Holzenkamp","Dongyu Lyu","Ulrich Kleinekathöfer","Peter Zaspel"],"pdf_url":"https://arxiv.org/pdf/2410.20398v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.05838v2","updated":"2025-01-09T14:04:01Z","published":"2024-10-08T09:06:34Z","title":"Time Transfer: On Optimal Learning Rate and Batch Size In The Infinite\n Data Limit","summary":" One of the main challenges in optimal scaling of large language models (LLMs)\nis the prohibitive cost of hyperparameter tuning, particularly learning rate\n$\\eta$ and batch size $B$. While techniques like $\\mu$P (Yang et al., 2022)\nprovide scaling rules for optimal $\\eta$ transfer in the infinite model size\nlimit, the optimal scaling behavior in the infinite data size limit remains\nunknown. We fill in this gap by observing for the first time an intricate\ndependence of optimal $\\eta$ scaling on the pretraining token budget $T$, $B$\nand its relation to the critical batch size $B_\\mathrm{crit}$, which we measure\nto evolve as $B_\\mathrm{crit} \\propto T$. Furthermore, we show that the optimal\nbatch size is positively correlated with $B_\\mathrm{crit}$: keeping it fixed\nbecomes suboptimal over time even if learning rate is scaled optimally.\nSurprisingly, our results demonstrate that the observed optimal $\\eta$ and $B$\ndynamics are preserved with $\\mu$P model scaling, challenging the conventional\nview of $B_\\mathrm{crit}$ dependence solely on loss value. Complementing\noptimality, we examine the sensitivity of loss to changes in learning rate,\nwhere we find the sensitivity to decrease with increase of $T$ and to remain\nconstant with $\\mu$P model scaling. We hope our results make the first step\ntowards a unified picture of the joint optimal data and model scaling.\n","authors":["Oleg Filatov","Jan Ebert","Jiangtao Wang","Stefan Kesselheim"],"pdf_url":"https://arxiv.org/pdf/2410.05838v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05248v1","updated":"2025-01-09T14:00:01Z","published":"2025-01-09T14:00:01Z","title":"Deriving Coding-Specific Sub-Models from LLMs using Resource-Efficient\n Pruning","summary":" Large Language Models (LLMs) have demonstrated their exceptional performance\nin various complex code generation tasks. However, their broader adoption is\nlimited by significant computational demands and high resource requirements,\nparticularly memory and processing power. To mitigate such requirements, model\npruning techniques are used to create more compact models with significantly\nfewer parameters. However, current approaches do not focus on the efficient\nextraction of programming-language-specific sub-models. In this work, we\nexplore the idea of efficiently deriving coding-specific sub-models through\nunstructured pruning (i.e., Wanda). We investigate the impact of different\ndomain-specific calibration datasets on pruning outcomes across three distinct\ndomains and extend our analysis to extracting four language-specific\nsub-models: Python, Java, C++, and JavaScript. We are the first to efficiently\nextract programming-language-specific sub-models using appropriate calibration\ndatasets while maintaining acceptable accuracy w.r.t. full models. We are also\nthe first to provide analytical evidence that domain-specific tasks activate\ndistinct regions within LLMs, supporting the creation of specialized sub-models\nthrough unstructured pruning. We believe that this work has significant\npotential to enhance LLM accessibility for coding by reducing computational\nrequirements to enable local execution on consumer-grade hardware, and\nsupporting faster inference times critical for real-time development feedback.\n","authors":["Laura Puccioni","Alireza Farshin","Mariano Scazzariello","Changjie Wang","Marco Chiesa","Dejan Kostic"],"pdf_url":"https://arxiv.org/pdf/2501.05248v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05234v1","updated":"2025-01-09T13:41:37Z","published":"2025-01-09T13:41:37Z","title":"Optimizing Estonian TV Subtitles with Semi-supervised Learning and LLMs","summary":" This paper presents an approach for generating high-quality, same-language\nsubtitles for Estonian TV content. We fine-tune the Whisper model on\nhuman-generated Estonian subtitles and enhance it with iterative\npseudo-labeling and large language model (LLM) based post-editing. Our\nexperiments demonstrate notable subtitle quality improvement through\npseudo-labeling with an unlabeled dataset. We find that applying LLM-based\nediting at test time enhances subtitle accuracy, while its use during training\ndoes not yield further gains. This approach holds promise for creating subtitle\nquality close to human standard and could be extended to real-time\napplications.\n","authors":["Artem Fedorchenko","Tanel Alumäe"],"pdf_url":"https://arxiv.org/pdf/2501.05234v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.23569v4","updated":"2025-01-09T13:30:25Z","published":"2024-10-31T02:25:43Z","title":"RA-PbRL: Provably Efficient Risk-Aware Preference-Based Reinforcement\n Learning","summary":" Reinforcement Learning from Human Feedback (RLHF) has recently surged in\npopularity, particularly for aligning large language models and other AI\nsystems with human intentions. At its core, RLHF can be viewed as a specialized\ninstance of Preference-based Reinforcement Learning (PbRL), where the\npreferences specifically originate from human judgments rather than arbitrary\nevaluators. Despite this connection, most existing approaches in both RLHF and\nPbRL primarily focus on optimizing a mean reward objective, neglecting\nscenarios that necessitate risk-awareness, such as AI safety, healthcare, and\nautonomous driving. These scenarios often operate under a one-episode-reward\nsetting, which makes conventional risk-sensitive objectives inapplicable. To\naddress this, we explore and prove the applicability of two risk-aware\nobjectives to PbRL : nested and static quantile risk objectives. We also\nintroduce Risk-AwarePbRL (RA-PbRL), an algorithm designed to optimize both\nnested and static objectives. Additionally, we provide a theoretical analysis\nof the regret upper bounds, demonstrating that they are sublinear with respect\nto the number of episodes, and present empirical results to support our\nfindings. Our code is available in\nhttps://github.com/aguilarjose11/PbRLNeurips.\n","authors":["Yujie Zhao","Jose Efraim Aguilar Escamill","Weyl Lu","Huazheng Wang"],"pdf_url":"https://arxiv.org/pdf/2410.23569v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05226v1","updated":"2025-01-09T13:29:54Z","published":"2025-01-09T13:29:54Z","title":"Light Transport-aware Diffusion Posterior Sampling for Single-View\n Reconstruction of 3D Volumes","summary":" We introduce a single-view reconstruction technique of volumetric fields in\nwhich multiple light scattering effects are omnipresent, such as in clouds. We\nmodel the unknown distribution of volumetric fields using an unconditional\ndiffusion model trained on a novel benchmark dataset comprising 1,000\nsynthetically simulated volumetric density fields. The neural diffusion model\nis trained on the latent codes of a novel, diffusion-friendly, monoplanar\nrepresentation. The generative model is used to incorporate a tailored\nparametric diffusion posterior sampling technique into different reconstruction\ntasks. A physically-based differentiable volume renderer is employed to provide\ngradients with respect to light transport in the latent space. This stands in\ncontrast to classic NeRF approaches and makes the reconstructions better\naligned with observed data. Through various experiments, we demonstrate\nsingle-view reconstruction of volumetric clouds at a previously unattainable\nquality.\n","authors":["Ludwic Leonard","Nils Thuerey","Ruediger Westermann"],"pdf_url":"https://arxiv.org/pdf/2501.05226v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.15879v2","updated":"2025-01-09T13:27:29Z","published":"2024-07-20T10:45:06Z","title":"Decentralized Federated Anomaly Detection in Smart Grids: A P2P Gossip\n Approach","summary":" The increasing security and privacy concerns in the Smart Grid sector have\nled to a significant demand for robust intrusion detection systems within\ncritical smart grid infrastructure. To address the challenges posed by privacy\npreservation and decentralized power system zones with distinct data ownership,\nFederated Learning (FL) has emerged as a promising privacy-preserving solution\nwhich facilitates collaborative training of attack detection models without\nnecessitating the sharing of raw data. However, FL presents several\nimplementation limitations in the power system domain due to its heavy reliance\non a centralized aggregator and the risks of privacy leakage during model\nupdate transmission. To overcome these technical bottlenecks, this paper\nintroduces a novel decentralized federated anomaly detection scheme based on\ntwo main gossip protocols namely Random Walk and Epidemic. Our findings\nindicate that the Random Walk protocol exhibits superior performance compared\nto the Epidemic protocol, highlighting its efficacy in decentralized federated\nlearning environments. Experimental validation of the proposed framework\nutilizing publicly available industrial control systems datasets demonstrates\nsuperior attack detection accuracy while safeguarding data confidentiality and\nmitigating the impact of communication latency and stragglers. Furthermore, our\napproach yields a notable 35% improvement in training time compared to\nconventional FL, underscoring the efficacy and robustness of our decentralized\nlearning method.\n","authors":["Muhammad Akbar Husnoo","Adnan Anwar","Md Enamul Haque","A. N. Mahmood"],"pdf_url":"https://arxiv.org/pdf/2407.15879v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05223v1","updated":"2025-01-09T13:19:59Z","published":"2025-01-09T13:19:59Z","title":"EVA-S2PLoR: A Secure Element-wise Multiplication Meets Logistic\n Regression on Heterogeneous Database","summary":" Accurate nonlinear computation is a key challenge in privacy-preserving\nmachine learning (PPML). Most existing frameworks approximate it through linear\noperations, resulting in significant precision loss. This paper proposes an\nefficient, verifiable and accurate security 2-party logistic regression\nframework (EVA-S2PLoR), which achieves accurate nonlinear function computation\nthrough a novel secure element-wise multiplication protocol and its derived\nprotocols. Our framework primarily includes secure 2-party vector element-wise\nmultiplication, addition to multiplication, reciprocal, and sigmoid function\nbased on data disguising technology, where high efficiency and accuracy are\nguaranteed by the simple computation flow based on the real number domain and\nthe few number of fixed communication rounds. We provide secure and robust\nanomaly detection through dimension transformation and Monte Carlo methods.\nEVA-S2PLoR outperforms many advanced frameworks in terms of precision\n(improving the performance of the sigmoid function by about 10 orders of\nmagnitude compared to most frameworks) and delivers the best overall\nperformance in secure logistic regression experiments.\n","authors":["Tianle Tao","Shizhao Peng","Tianyu Mei","Shoumo Li","Haogang Zhu"],"pdf_url":"https://arxiv.org/pdf/2501.05223v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.14738v5","updated":"2025-01-09T13:06:40Z","published":"2024-12-19T11:10:48Z","title":"Boosting Graph Neural Network Training by Focusing on Non-Robust Samples\n from the Training Set","summary":" Graph Neural Networks (GNNs) are a highly effective neural network\narchitecture for processing graph-structured data. Unlike traditional neural\nnetworks that rely solely on the features of the data as input, GNNs leverage\nboth the graph structure, which represents the relationships between data\npoints, and the feature matrix of the data to optimize their feature\nrepresentation. This unique capability enables GNNs to achieve superior\nperformance across various tasks. However, it also makes GNNs more susceptible\nto noise from both the graph structure and data features, which can\nsignificantly increase the training difficulty and degrade their performance.\nTo address this issue, this paper proposes a novel method for selecting\nnoise-sensitive training samples from the original training set to construct a\nsmaller yet more effective training set for model training. These samples are\nthen used to enhance the model's ability to handle noise-prone instances\neffectively. We have evaluated our approach on three of the most classical GNN\nmodels -- GCN, GAT, and GraphSAGE -- as well as three widely used benchmark\ndatasets: Cora, Citeseer, and PubMed. Our experiments demonstrate that the\nproposed method can substantially boost the overall training of Graph Neural\nNetworks compared to using randomly constructed training sets.\n","authors":["Yongyu Wang"],"pdf_url":"https://arxiv.org/pdf/2412.14738v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05207v1","updated":"2025-01-09T12:57:41Z","published":"2025-01-09T12:57:41Z","title":"CoDe: Communication Delay-Tolerant Multi-Agent Collaboration via Dual\n Alignment of Intent and Timeliness","summary":" Communication has been widely employed to enhance multi-agent collaboration.\nPrevious research has typically assumed delay-free communication, a strong\nassumption that is challenging to meet in practice. However, real-world agents\nsuffer from channel delays, receiving messages sent at different time points,\ntermed {\\it{Asynchronous Communication}}, leading to cognitive biases and\nbreakdowns in collaboration. This paper first defines two communication delay\nsettings in MARL and emphasizes their harm to collaboration. To handle the\nabove delays, this paper proposes a novel framework, Communication\nDelay-tolerant Multi-Agent Collaboration (CoDe). At first, CoDe learns an\nintent representation as messages through future action inference, reflecting\nthe stable future behavioral trends of the agents. Then, CoDe devises a dual\nalignment mechanism of intent and timeliness to strengthen the fusion process\nof asynchronous messages. In this way, agents can extract the long-term intent\nof others, even from delayed messages, and selectively utilize the most recent\nmessages that are relevant to their intent. Experimental results demonstrate\nthat CoDe outperforms baseline algorithms in three MARL benchmarks without\ndelay and exhibits robustness under fixed and time-varying delays.\n","authors":["Shoucheng Song","Youfang Lin","Sheng Han","Chang Yao","Hao Wu","Shuo Wang","Kai Lv"],"pdf_url":"https://arxiv.org/pdf/2501.05207v1.pdf","comment":"AAAI 2025 Accepted"},{"id":"http://arxiv.org/abs/2501.05204v1","updated":"2025-01-09T12:55:21Z","published":"2025-01-09T12:55:21Z","title":"Design and Control of a Bipedal Robotic Character","summary":" Legged robots have achieved impressive feats in dynamic locomotion in\nchallenging unstructured terrain. However, in entertainment applications, the\ndesign and control of these robots face additional challenges in appealing to\nhuman audiences. This work aims to unify expressive, artist-directed motions\nand robust dynamic mobility for legged robots. To this end, we introduce a new\nbipedal robot, designed with a focus on character-driven mechanical features.\nWe present a reinforcement learning-based control architecture to robustly\nexecute artistic motions conditioned on command signals. During runtime, these\ncommand signals are generated by an animation engine which composes and blends\nbetween multiple animation sources. Finally, an intuitive operator interface\nenables real-time show performances with the robot. The complete system results\nin a believable robotic character, and paves the way for enhanced human-robot\nengagement in various contexts, in entertainment robotics and beyond.\n","authors":["Ruben Grandia","Espen Knoop","Michael A. Hopkins","Georg Wiedebach","Jared Bishop","Steven Pickles","David Müller","Moritz Bächer"],"pdf_url":"https://arxiv.org/pdf/2501.05204v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05197v1","updated":"2025-01-09T12:48:15Z","published":"2025-01-09T12:48:15Z","title":"An Algorithmic Approach for Causal Health Equity: A Look at Race\n Differentials in Intensive Care Unit (ICU) Outcomes","summary":" The new era of large-scale data collection and analysis presents an\nopportunity for diagnosing and understanding the causes of health inequities.\nIn this study, we describe a framework for systematically analyzing health\ndisparities using causal inference. The framework is illustrated by\ninvestigating racial and ethnic disparities in intensive care unit (ICU)\noutcome between majority and minority groups in Australia (Indigenous vs.\nNon-Indigenous) and the United States (African-American vs. White). We\ndemonstrate that commonly used statistical measures for quantifying inequity\nare insufficient, and focus on attributing the observed disparity to the causal\nmechanisms that generate it. We find that minority patients are younger at\nadmission, have worse chronic health, are more likely to be admitted for urgent\nand non-elective reasons, and have higher illness severity. At the same time,\nhowever, we find a protective direct effect of belonging to a minority group,\nwith minority patients showing improved survival compared to their majority\ncounterparts, with all other variables kept equal. We demonstrate that this\nprotective effect is related to the increased probability of being admitted to\nICU, with minority patients having an increased risk of ICU admission. We also\nfind that minority patients, while showing improved survival, are more likely\nto be readmitted to ICU. Thus, due to worse access to primary health care,\nminority patients are more likely to end up in ICU for preventable conditions,\ncausing a reduction in the mortality rates and creating an effect that appears\nto be protective. Since the baseline risk of ICU admission may serve as proxy\nfor lack of access to primary care, we developed the Indigenous Intensive Care\nEquity (IICE) Radar, a monitoring system for tracking the over-utilization of\nICU resources by the Indigenous population of Australia across geographical\nareas.\n","authors":["Drago Plecko","Paul Secombe","Andrea Clarke","Amelia Fiske","Samarra Toby","Donisha Duff","David Pilcher","Leo Anthony Celi","Rinaldo Bellomo","Elias Bareinboim"],"pdf_url":"https://arxiv.org/pdf/2501.05197v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.04729v3","updated":"2025-01-09T12:44:44Z","published":"2024-01-09T18:59:47Z","title":"Human Delegation Behavior in Human-AI Collaboration: The Effect of\n Contextual Information","summary":" The integration of artificial intelligence (AI) into human decision-making\nprocesses at the workplace presents both opportunities and challenges. One\npromising approach to leverage existing complementary capabilities is allowing\nhumans to delegate individual instances of decision tasks to AI. However,\nenabling humans to delegate instances effectively requires them to assess\nseveral factors. One key factor is the analysis of both their own capabilities\nand those of the AI in the context of the given task. In this work, we conduct\na behavioral study to explore the effects of providing contextual information\nto support this delegation decision. Specifically, we investigate how\ncontextual information about the AI and the task domain influence humans'\ndelegation decisions to an AI and their impact on the human-AI team\nperformance. Our findings reveal that access to contextual information\nsignificantly improves human-AI team performance in delegation settings.\nFinally, we show that the delegation behavior changes with the different types\nof contextual information. Overall, this research advances the understanding of\ncomputer-supported, collaborative work and provides actionable insights for\ndesigning more effective collaborative systems.\n","authors":["Philipp Spitzer","Joshua Holstein","Patrick Hemmer","Michael Vössing","Niklas Kühl","Dominik Martin","Gerhard Satzger"],"pdf_url":"https://arxiv.org/pdf/2401.04729v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09094v2","updated":"2025-01-09T12:38:37Z","published":"2024-12-12T09:22:04Z","title":"Filter-then-Generate: Large Language Models with Structure-Text Adapter\n for Knowledge Graph Completion","summary":" Large Language Models (LLMs) present massive inherent knowledge and superior\nsemantic comprehension capability, which have revolutionized various tasks in\nnatural language processing. Despite their success, a critical gap remains in\nenabling LLMs to perform knowledge graph completion (KGC). Empirical evidence\nsuggests that LLMs consistently perform worse than conventional KGC approaches,\neven through sophisticated prompt design or tailored instruction-tuning.\nFundamentally, applying LLMs on KGC introduces several critical challenges,\nincluding a vast set of entity candidates, hallucination issue of LLMs, and\nunder-exploitation of the graph structure. To address these challenges, we\npropose a novel instruction-tuning-based method, namely FtG. Specifically, we\npresent a \\textit{filter-then-generate} paradigm and formulate the KGC task\ninto a multiple-choice question format. In this way, we can harness the\ncapability of LLMs while mitigating the issue casused by hallucinations.\nMoreover, we devise a flexible ego-graph serialization prompt and employ a\nstructure-text adapter to couple structure and text information in a\ncontextualized manner. Experimental results demonstrate that FtG achieves\nsubstantial performance gain compared to existing state-of-the-art methods. The\ninstruction dataset and code are available at\n\\url{https://github.com/LB0828/FtG}.\n","authors":["Ben Liu","Jihai Zhang","Fangquan Lin","Cheng Yang","Min Peng"],"pdf_url":"https://arxiv.org/pdf/2412.09094v2.pdf","comment":"COLING 2025 Main Conference"},{"id":"http://arxiv.org/abs/2501.05190v1","updated":"2025-01-09T12:30:22Z","published":"2025-01-09T12:30:22Z","title":"RadioTransformer: Accurate Radio Map Construction and Coverage\n Prediction","summary":" Radio map, or pathloss map prediction, is a crucial method for wireless\nnetwork modeling and management. By leveraging deep learning to construct\npathloss patterns from geographical maps, an accurate digital replica of the\ntransmission environment could be established with less computational overhead\nand lower prediction error compared to traditional model-driven techniques.\nWhile existing state-of-the-art (SOTA) methods predominantly rely on\nconvolutional architectures, this paper introduces a hybrid\ntransformer-convolution model, termed RadioTransformer, to enhance the accuracy\nof radio map prediction. The proposed model features a multi-scale\ntransformer-based encoder for efficient feature extraction and a\nconvolution-based decoder for precise pixel-level image reconstruction.\nSimulation results demonstrate that the proposed scheme significantly improves\nprediction accuracy, and over a 30% reduction in root mean square error (RMSE)\nis achieved compared to typical SOTA approaches.\n","authors":["Yuxuan Li","Cheng Zhang","Wen Wang","Yongming Huang"],"pdf_url":"https://arxiv.org/pdf/2501.05190v1.pdf","comment":"Submitted to IEEE VTC 2025 Spring"},{"id":"http://arxiv.org/abs/2411.17251v5","updated":"2025-01-09T12:28:55Z","published":"2024-11-26T09:29:27Z","title":"DGNN-YOLO: Interpretable Dynamic Graph Neural Networks with YOLO11 for\n Detecting and Tracking Small Occluded Objects in Urban Traffic","summary":" The detection and tracking of small, occluded objects such as pedestrians,\ncyclists, and motorbikes pose significant challenges for traffic surveillance\nsystems because of their erratic movement, frequent occlusion, and poor\nvisibility in dynamic urban environments. Traditional methods like YOLO11,\nwhile proficient in spatial feature extraction for precise detection, often\nstruggle with these small and dynamically moving objects, particularly in\nhandling real-time data updates and resource efficiency. This paper introduces\nDGNN-YOLO, a novel framework that integrates dynamic graph neural networks\n(DGNNs) with YOLO11 to address these limitations. Unlike standard GNNs, DGNNs\nare chosen for their superior ability to dynamically update graph structures in\nreal-time, which enables adaptive and robust tracking of objects in highly\nvariable urban traffic scenarios. This framework constructs and regularly\nupdates its graph representations, capturing objects as nodes and their\ninteractions as edges, thus effectively responding to rapidly changing\nconditions. Additionally, DGNN-YOLO incorporates Grad-CAM, Grad-CAM++, and\nEigen-CAM visualization techniques to enhance interpretability and foster\ntrust, offering insights into the model's decision-making process. Extensive\nexperiments validate the framework's performance, achieving a precision of\n0.8382, recall of 0.6875, and mAP@0.5:0.95 of 0.6476, significantly\noutperforming existing methods. This study offers a scalable and interpretable\nsolution for real-time traffic surveillance and significantly advances\nintelligent transportation systems' capabilities by addressing the critical\nchallenge of detecting and tracking small, occluded objects.\n","authors":["Shahriar Soudeep","M. F. Mridha","Md Abrar Jahin","Nilanjan Dey"],"pdf_url":"https://arxiv.org/pdf/2411.17251v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.15361v2","updated":"2025-01-09T12:18:26Z","published":"2024-12-19T19:47:35Z","title":"Spatiotemporally Coherent Probabilistic Generation of Weather from\n Climate","summary":" Local climate information is crucial for impact assessment and\ndecision-making, yet coarse global climate simulations cannot capture\nsmall-scale phenomena. Current statistical downscaling methods infer these\nphenomena as temporally decoupled spatial patches. However, to preserve\nphysical properties, estimating spatio-temporally coherent high-resolution\nweather dynamics for multiple variables across long time horizons is crucial.\nWe present a novel generative approach that uses a score-based diffusion model\ntrained on high-resolution reanalysis data to capture the statistical\nproperties of local weather dynamics. After training, we condition on coarse\nclimate model data to generate weather patterns consistent with the aggregate\ninformation. As this inference task is inherently uncertain, we leverage the\nprobabilistic nature of diffusion models and sample multiple trajectories. We\nevaluate our approach with high-resolution reanalysis information before\napplying it to the climate model downscaling task. We then demonstrate that the\nmodel generates spatially and temporally coherent weather dynamics that align\nwith global climate output.\n","authors":["Jonathan Schmidt","Luca Schmidt","Felix Strnad","Nicole Ludwig","Philipp Hennig"],"pdf_url":"https://arxiv.org/pdf/2412.15361v2.pdf","comment":"15 pages, 6 figures, additional supplementary text and figures"},{"id":"http://arxiv.org/abs/2406.11814v5","updated":"2025-01-09T12:14:23Z","published":"2024-06-17T17:54:42Z","title":"Stochastic Neural Network Symmetrisation in Markov Categories","summary":" We consider the problem of symmetrising a neural network along a group\nhomomorphism: given a homomorphism $\\varphi : H \\to G$, we would like a\nprocedure that converts $H$-equivariant neural networks to $G$-equivariant\nones. We formulate this in terms of Markov categories, which allows us to\nconsider neural networks whose outputs may be stochastic, but with\nmeasure-theoretic details abstracted away. We obtain a flexible and\ncompositional framework for symmetrisation that relies on minimal assumptions\nabout the structure of the group and the underlying neural network\narchitecture. Our approach recovers existing canonicalisation and averaging\ntechniques for symmetrising deterministic models, and extends to provide a\nnovel methodology for symmetrising stochastic models also. Beyond this, our\nfindings also demonstrate the utility of Markov categories for addressing\ncomplex problems in machine learning in a conceptually clear yet mathematically\nprecise way.\n","authors":["Rob Cornish"],"pdf_url":"https://arxiv.org/pdf/2406.11814v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.16768v2","updated":"2025-01-09T12:04:51Z","published":"2024-09-25T09:26:19Z","title":"Interpreting Deep Neural Network-Based Receiver Under Varying\n Signal-To-Noise Ratios","summary":" We propose a novel method for interpreting neural networks, focusing on\nconvolutional neural network-based receiver model. The method identifies which\nunit or units of the model contain most (or least) information about the\nchannel parameter(s) of the interest, providing insights at both global and\nlocal levels -- with global explanations aggregating local ones. Experiments on\nlink-level simulations demonstrate the method's effectiveness in identifying\nunits that contribute most (and least) to signal-to-noise ratio processing.\nAlthough we focus on a radio receiver model, the method generalizes to other\nneural network architectures and applications, offering robust estimation even\nin high-dimensional settings.\n","authors":["Marko Tuononen","Dani Korpi","Ville Hautamäki"],"pdf_url":"https://arxiv.org/pdf/2409.16768v2.pdf","comment":"7+1 pages, 8 figures, 1 equation"},{"id":"http://arxiv.org/abs/2501.05170v1","updated":"2025-01-09T11:44:49Z","published":"2025-01-09T11:44:49Z","title":"De-centering the (Traditional) User: Multistakeholder Evaluation of\n Recommender Systems","summary":" Multistakeholder recommender systems are those that account for the impacts\nand preferences of multiple groups of individuals, not just the end users\nreceiving recommendations. Due to their complexity, evaluating these systems\ncannot be restricted to the overall utility of a single stakeholder, as is\noften the case of more mainstream recommender system applications. In this\narticle, we focus our discussion on the intricacies of the evaluation of\nmultistakeholder recommender systems. We bring attention to the different\naspects involved in the evaluation of multistakeholder recommender systems -\nfrom the range of stakeholders involved (including but not limited to producers\nand consumers) to the values and specific goals of each relevant stakeholder.\nAdditionally, we discuss how to move from theoretical principles to practical\nimplementation, providing specific use case examples. Finally, we outline open\nresearch directions for the RecSys community to explore. We aim to provide\nguidance to researchers and practitioners about how to think about these\ncomplex and domain-dependent issues of evaluation in the course of designing,\ndeveloping, and researching applications with multistakeholder aspects.\n","authors":["Robin Burke","Gediminas Adomavicius","Toine Bogers","Tommaso Di Noia","Dominik Kowald","Julia Neidhardt","Özlem Özgöbek","Maria Soledad Pera","Nava Tintarev","Jürgen Ziegler"],"pdf_url":"https://arxiv.org/pdf/2501.05170v1.pdf","comment":"Preprint submitted to Elsevier, \"Re-centering the User in Recommender\n System Research\" special issue of the International Journal of Human-Computer\n Studies (IJHCS)"},{"id":"http://arxiv.org/abs/2404.16969v4","updated":"2025-01-09T11:42:21Z","published":"2024-04-25T18:42:25Z","title":"COCOLA: Coherence-Oriented Contrastive Learning of Musical Audio\n Representations","summary":" We present COCOLA (Coherence-Oriented Contrastive Learning for Audio), a\ncontrastive learning method for musical audio representations that captures the\nharmonic and rhythmic coherence between samples. Our method operates at the\nlevel of the stems composing music tracks and can input features obtained via\nHarmonic-Percussive Separation (HPS). COCOLA allows the objective evaluation of\ngenerative models for music accompaniment generation, which are difficult to\nbenchmark with established metrics. In this regard, we evaluate recent music\naccompaniment generation models, demonstrating the effectiveness of the\nproposed method. We release the model checkpoints trained on public datasets\ncontaining separate stems (MUSDB18-HQ, MoisesDB, Slakh2100, and CocoChorales).\n","authors":["Ruben Ciranni","Giorgio Mariani","Michele Mancusi","Emilian Postolache","Giorgio Fabbro","Emanuele Rodolà","Luca Cosmo"],"pdf_url":"https://arxiv.org/pdf/2404.16969v4.pdf","comment":"Demo page: https://github.com/gladia-research-group/cocola, Accepted\n at ICASSP-25"},{"id":"http://arxiv.org/abs/2412.11120v2","updated":"2025-01-09T11:39:32Z","published":"2024-12-15T08:51:14Z","title":"Latent Reward: LLM-Empowered Credit Assignment in Episodic Reinforcement\n Learning","summary":" Reinforcement learning (RL) often encounters delayed and sparse feedback in\nreal-world applications, even with only episodic rewards. Previous approaches\nhave made some progress in reward redistribution for credit assignment but\nstill face challenges, including training difficulties due to redundancy and\nambiguous attributions stemming from overlooking the multifaceted nature of\nmission performance evaluation. Hopefully, Large Language Model (LLM)\nencompasses fruitful decision-making knowledge and provides a plausible tool\nfor reward redistribution. Even so, deploying LLM in this case is non-trivial\ndue to the misalignment between linguistic knowledge and the symbolic form\nrequirement, together with inherent randomness and hallucinations in inference.\nTo tackle these issues, we introduce LaRe, a novel LLM-empowered symbolic-based\ndecision-making framework, to improve credit assignment. Key to LaRe is the\nconcept of the Latent Reward, which works as a multi-dimensional performance\nevaluation, enabling more interpretable goal attainment from various\nperspectives and facilitating more effective reward redistribution. We examine\nthat semantically generated code from LLM can bridge linguistic knowledge and\nsymbolic latent rewards, as it is executable for symbolic objects. Meanwhile,\nwe design latent reward self-verification to increase the stability and\nreliability of LLM inference. Theoretically, reward-irrelevant redundancy\nelimination in the latent reward benefits RL performance from more accurate\nreward estimation. Extensive experimental results witness that LaRe (i)\nachieves superior temporal credit assignment to SOTA methods, (ii) excels in\nallocating contributions among multiple agents, and (iii) outperforms policies\ntrained with ground truth rewards for certain tasks.\n","authors":["Yun Qu","Yuhang Jiang","Boyuan Wang","Yixiu Mao","Cheems Wang","Chang Liu","Xiangyang Ji"],"pdf_url":"https://arxiv.org/pdf/2412.11120v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.15166v3","updated":"2025-01-09T11:39:19Z","published":"2024-02-23T07:59:23Z","title":"Convergence Analysis of Split Federated Learning on Heterogeneous Data","summary":" Split federated learning (SFL) is a recent distributed approach for\ncollaborative model training among multiple clients. In SFL, a global model is\ntypically split into two parts, where clients train one part in a parallel\nfederated manner, and a main server trains the other. Despite the recent\nresearch on SFL algorithm development, the convergence analysis of SFL is\nmissing in the literature, and this paper aims to fill this gap. The analysis\nof SFL can be more challenging than that of federated learning (FL), due to the\npotential dual-paced updates at the clients and the main server. We provide\nconvergence analysis of SFL for strongly convex and general convex objectives\non heterogeneous data. The convergence rates are $O(1/T)$ and\n$O(1/\\sqrt[3]{T})$, respectively, where $T$ denotes the total number of rounds\nfor SFL training. We further extend the analysis to non-convex objectives and\nthe scenario where some clients may be unavailable during training.\nExperimental experiments validate our theoretical results and show that SFL\noutperforms FL and split learning (SL) when data is highly heterogeneous across\na large number of clients.\n","authors":["Pengchao Han","Chao Huang","Geng Tian","Ming Tang","Xin Liu"],"pdf_url":"https://arxiv.org/pdf/2402.15166v3.pdf","comment":"Accepted by Conference on Neural Information Processing Systems\n (NeurIPS 2024)"},{"id":"http://arxiv.org/abs/2501.04239v2","updated":"2025-01-09T11:38:45Z","published":"2025-01-08T02:32:48Z","title":"Dynamic Localisation of Spatial-Temporal Graph Neural Network","summary":" Spatial-temporal data, fundamental to many intelligent applications, reveals\ndependencies indicating causal links between present measurements at specific\nlocations and historical data at the same or other locations. Within this\ncontext, adaptive spatial-temporal graph neural networks (ASTGNNs) have emerged\nas valuable tools for modelling these dependencies, especially through a\ndata-driven approach rather than pre-defined spatial graphs. While this\napproach offers higher accuracy, it presents increased computational demands.\nAddressing this challenge, this paper delves into the concept of localisation\nwithin ASTGNNs, introducing an innovative perspective that spatial dependencies\nshould be dynamically evolving over time. We introduce \\textit{DynAGS}, a\nlocalised ASTGNN framework aimed at maximising efficiency and accuracy in\ndistributed deployment. This framework integrates dynamic localisation,\ntime-evolving spatial graphs, and personalised localisation, all orchestrated\naround the Dynamic Graph Generator, a light-weighted central module leveraging\ncross attention. The central module can integrate historical information in a\nnode-independent manner to enhance the feature representation of nodes at the\ncurrent moment. This improved feature representation is then used to generate a\ndynamic sparse graph without the need for costly data exchanges, and it\nsupports personalised localisation. Performance assessments across two core\nASTGNN architectures and nine real-world datasets from various applications\nreveal that \\textit{DynAGS} outshines current benchmarks, underscoring that the\ndynamic modelling of spatial dependencies can drastically improve model\nexpressibility, flexibility, and system efficiency, especially in distributed\nsettings.\n","authors":["Wenying Duan","Shujun Guo","Wei huang","Hong Rao","Xiaoxi He"],"pdf_url":"https://arxiv.org/pdf/2501.04239v2.pdf","comment":"This paper was accepted by KDD'25"},{"id":"http://arxiv.org/abs/2405.09062v6","updated":"2025-01-09T11:38:07Z","published":"2024-05-15T03:26:01Z","title":"Naturalistic Music Decoding from EEG Data via Latent Diffusion Models","summary":" In this article, we explore the potential of using latent diffusion models, a\nfamily of powerful generative models, for the task of reconstructing\nnaturalistic music from electroencephalogram (EEG) recordings. Unlike simpler\nmusic with limited timbres, such as MIDI-generated tunes or monophonic pieces,\nthe focus here is on intricate music featuring a diverse array of instruments,\nvoices, and effects, rich in harmonics and timbre. This study represents an\ninitial foray into achieving general music reconstruction of high-quality using\nnon-invasive EEG data, employing an end-to-end training approach directly on\nraw data without the need for manual pre-processing and channel selection. We\ntrain our models on the public NMED-T dataset and perform quantitative\nevaluation proposing neural embedding-based metrics. Our work contributes to\nthe ongoing research in neural decoding and brain-computer interfaces, offering\ninsights into the feasibility of using EEG data for complex auditory\ninformation reconstruction.\n","authors":["Emilian Postolache","Natalia Polouliakh","Hiroaki Kitano","Akima Connelly","Emanuele Rodolà","Luca Cosmo","Taketo Akama"],"pdf_url":"https://arxiv.org/pdf/2405.09062v6.pdf","comment":"Accepted at ICASSP-25"},{"id":"http://arxiv.org/abs/2404.03105v2","updated":"2025-01-09T11:24:56Z","published":"2024-04-03T23:07:24Z","title":"Methodology for Interpretable Reinforcement Learning for Optimizing\n Mechanical Ventilation","summary":" Mechanical ventilation is a critical life support intervention that delivers\ncontrolled air and oxygen to a patient's lungs, assisting or replacing\nspontaneous breathing. While several data-driven approaches have been proposed\nto optimize ventilator control strategies, they often lack interpretability and\nalignment with domain knowledge, hindering clinical adoption. This paper\npresents a methodology for interpretable reinforcement learning (RL) aimed at\nimproving mechanical ventilation control as part of connected health systems.\nUsing a causal, nonparametric model-based off-policy evaluation, we assess RL\npolicies for their ability to enhance patient-specific outcomes-specifically,\nincreasing blood oxygen levels (SpO2), while avoiding aggressive ventilator\nsettings that may cause ventilator-induced lung injuries and other\ncomplications. Through numerical experiments on real-world ICU data from the\nMIMIC-III database, we demonstrate that our interpretable decision tree policy\nachieves performance comparable to state-of-the-art deep RL methods while\noutperforming standard behavior cloning approaches. The results highlight the\npotential of interpretable, data-driven decision support systems to improve\nsafety and efficiency in personalized ventilation strategies, paving the way\nfor seamless integration into connected healthcare environments.\n","authors":["Joo Seung Lee","Malini Mahendra","Anil Aswani"],"pdf_url":"https://arxiv.org/pdf/2404.03105v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.00717v3","updated":"2025-01-09T11:24:44Z","published":"2024-09-01T13:14:41Z","title":"Preference-Based Multi-Agent Reinforcement Learning: Data Coverage and\n Algorithmic Techniques","summary":" We initiate the study of Preference-Based Multi-Agent Reinforcement Learning\n(PbMARL), exploring both theoretical foundations and empirical validations. We\ndefine the task as identifying the Nash equilibrium from a preference-only\noffline dataset in general-sum games, a problem marked by the challenge of\nsparse feedback signals. Our theory establishes the upper complexity bounds for\nNash Equilibrium in effective PbMARL, demonstrating that single-policy coverage\nis inadequate and highlighting the importance of unilateral dataset coverage.\nThese theoretical insights are verified through comprehensive experiments. To\nenhance the practical performance, we further introduce two algorithmic\ntechniques. (1) We propose a Mean Squared Error (MSE) regularization along the\ntime axis to achieve a more uniform reward distribution and improve reward\nlearning outcomes. (2) We propose an additional penalty based on the\ndistribution of the dataset to incorporate pessimism, improving stability and\neffectiveness during training. Our findings underscore the multifaceted\napproach required for PbMARL, paving the way for effective preference-based\nmulti-agent systems.\n","authors":["Natalia Zhang","Xinqi Wang","Qiwen Cui","Runlong Zhou","Sham M. Kakade","Simon S. Du"],"pdf_url":"https://arxiv.org/pdf/2409.00717v3.pdf","comment":"9 pages"},{"id":"http://arxiv.org/abs/2501.02648v2","updated":"2025-01-09T11:17:01Z","published":"2025-01-05T20:26:49Z","title":"Representation Learning of Lab Values via Masked AutoEncoder","summary":" Accurate imputation of missing laboratory values in electronic health records\n(EHRs) is critical to enable robust clinical predictions and reduce biases in\nAI systems in healthcare. Existing methods, such as variational autoencoders\n(VAEs) and decision tree-based approaches such as XGBoost, struggle to model\nthe complex temporal and contextual dependencies in EHR data, mainly in\nunderrepresented groups. In this work, we propose Lab-MAE, a novel\ntransformer-based masked autoencoder framework that leverages self-supervised\nlearning for the imputation of continuous sequential lab values. Lab-MAE\nintroduces a structured encoding scheme that jointly models laboratory test\nvalues and their corresponding timestamps, enabling explicit capturing temporal\ndependencies. Empirical evaluation on the MIMIC-IV dataset demonstrates that\nLab-MAE significantly outperforms the state-of-the-art baselines such as\nXGBoost across multiple metrics, including root mean square error (RMSE),\nR-squared (R2), and Wasserstein distance (WD). Notably, Lab-MAE achieves\nequitable performance across demographic groups of patients, advancing fairness\nin clinical predictions. We further investigate the role of follow-up\nlaboratory values as potential shortcut features, revealing Lab-MAE's\nrobustness in scenarios where such data is unavailable. The findings suggest\nthat our transformer-based architecture, adapted to the characteristics of the\nEHR data, offers a foundation model for more accurate and fair clinical\nimputation models. In addition, we measure and compare the carbon footprint of\nLab-MAE with the baseline XGBoost model, highlighting its environmental\nrequirements.\n","authors":["David Restrepo","Chenwei Wu","Yueran Jia","Jaden K. Sun","Jack Gallifant","Catherine G. Bielick","Yugang Jia","Leo A. Celi"],"pdf_url":"https://arxiv.org/pdf/2501.02648v2.pdf","comment":"10 pages main text, 8 appendix"},{"id":"http://arxiv.org/abs/2411.07066v2","updated":"2025-01-09T11:11:37Z","published":"2024-11-11T15:30:16Z","title":"Zeroth-Order Adaptive Neuron Alignment Based Pruning without Re-Training","summary":" Network pruning focuses on computational techniques that aim to reduce a\ngiven model's computational cost by removing a subset of its parameters while\nhaving minimal impact on performance. Throughout the last decade, the most\nwidely used pruning paradigm has been pruning and re-training, which nowadays\nis inconvenient due to the vast amount of pre-trained models, which are in any\ncase too expensive to re-train. In this paper, we exploit functional\ninformation from dense pre-trained models, i.e., their activations, to obtain\nsparse models that maximize the activations' alignment w.r.t. their\ncorresponding dense models. Hence, we propose \\textsc{NeuroAL}, a \\emph{top-up}\nalgorithm that can be used on top of any given pruning algorithm for LLMs,\nwhich modifies the block-wise and row-wise sparsity exploiting information from\nboth the dense model and its sparse version to maximize the \\emph{neuron\nalignment} among activations. Differently from existing methods, our approach\nadaptively selects the best hyperparameters for the block-wise and row-wise\nsparsity ratios w.r.t. the model and the desired sparsity, and requires\n\\emph{no re-training}. We test our method over 276 cases combining four LLM\nfamilies, three sparsity ratios, and ten language tasks (three language\nmodeling and seven zero-shot datasets), showing how it consistently outperforms\nthe latest state-of-the-art methods in terms of performance-runtime trade-off.\nThe code is available at\n\\href{https://github.com/eliacunegatti/NeuroAL}{https://github.com/eliacunegatti/NeuroAL}.\n","authors":["Elia Cunegatti","Leonardo Lucio Custode","Giovanni Iacca"],"pdf_url":"https://arxiv.org/pdf/2411.07066v2.pdf","comment":"Work in progress"},{"id":"http://arxiv.org/abs/2409.05072v2","updated":"2025-01-09T11:06:36Z","published":"2024-09-08T12:19:12Z","title":"A General Framework for Clustering and Distribution Matching with Bandit\n Feedback","summary":" We develop a general framework for clustering and distribution matching\nproblems with bandit feedback. We consider a $K$-armed bandit model where some\nsubset of $K$ arms is partitioned into $M$ groups. Within each group, the\nrandom variable associated to each arm follows the same distribution on a\nfinite alphabet. At each time step, the decision maker pulls an arm and\nobserves its outcome from the random variable associated to that arm.\nSubsequent arm pulls depend on the history of arm pulls and their outcomes. The\ndecision maker has no knowledge of the distributions of the arms or the\nunderlying partitions. The task is to devise an online algorithm to learn the\nunderlying partition of arms with the least number of arm pulls on average and\nwith an error probability not exceeding a pre-determined value~$\\delta$.\nSeveral existing problems fall under our general framework, including finding\n$M$ pairs of arms, odd arm identification, and $N$-ary clustering of $K$ arms\nbelong to our general framework. We derive a non-asymptotic lower bound on the\naverage number of arm pulls for any online algorithm with an error probability\nnot exceeding $\\delta$. Furthermore, we develop a computationally-efficient\nonline algorithm based on the Track-and-Stop method and Frank--Wolfe algorithm,\nand show that the average number of arm pulls of our algorithm asymptotically\nmatches that of the lower bound. Our refined analysis also uncovers a novel\nbound on the speed at which the average number of arm pulls of our algorithm\nconverges to the fundamental limit as $\\delta$ vanishes.\n","authors":["Recep Can Yavas","Yuqi Huang","Vincent Y. F. Tan","Jonathan Scarlett"],"pdf_url":"https://arxiv.org/pdf/2409.05072v2.pdf","comment":"24 pages"},{"id":"http://arxiv.org/abs/2406.00778v2","updated":"2025-01-09T10:47:35Z","published":"2024-06-02T15:35:45Z","title":"Bayesian Joint Additive Factor Models for Multiview Learning","summary":" It is increasingly common in a wide variety of applied settings to collect\ndata of multiple different types on the same set of samples. Our particular\nfocus in this article is on studying relationships between such multiview\nfeatures and responses. A motivating application arises in the context of\nprecision medicine where multi-omics data are collected to correlate with\nclinical outcomes. It is of interest to infer dependence within and across\nviews while combining multimodal information to improve the prediction of\noutcomes. The signal-to-noise ratio can vary substantially across views,\nmotivating more nuanced statistical tools beyond standard late and early\nfusion. This challenge comes with the need to preserve interpretability, select\nfeatures, and obtain accurate uncertainty quantification. We propose a joint\nadditive factor regression model (JAFAR) with a structured additive design,\naccounting for shared and view-specific components. We ensure identifiability\nvia a novel dependent cumulative shrinkage process (D-CUSP) prior. We provide\nan efficient implementation via a partially collapsed Gibbs sampler and extend\nour approach to allow flexible feature and outcome distributions. Prediction of\ntime-to-labor onset from immunome, metabolome, and proteome data illustrates\nperformance gains against state-of-the-art competitors. Our open-source\nsoftware (R package) is available at https://github.com/niccoloanceschi/jafar.\n","authors":["Niccolo Anceschi","Federico Ferrari","David B. Dunson","Himel Mallick"],"pdf_url":"https://arxiv.org/pdf/2406.00778v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05130v1","updated":"2025-01-09T10:33:16Z","published":"2025-01-09T10:33:16Z","title":"Learning In-Distribution Representations for Anomaly Detection","summary":" Anomaly detection involves identifying data patterns that deviate from the\nanticipated norm. Traditional methods struggle in high-dimensional spaces due\nto the curse of dimensionality. In recent years, self-supervised learning,\nparticularly through contrastive objectives, has driven advances in anomaly\ndetection. However, vanilla contrastive learning struggles to align with the\nunique demands of anomaly detection, as it lacks a pretext task tailored to the\nhomogeneous nature of In-Distribution (ID) data and the diversity of\nOut-of-Distribution (OOD) anomalies. Methods that attempt to address these\nchallenges, such as introducing hard negatives through synthetic outliers,\nOutlier Exposure (OE), and supervised objectives, often rely on pretext tasks\nthat fail to balance compact clustering of ID samples with sufficient\nseparation from OOD data. In this work, we propose Focused In-distribution\nRepresentation Modeling (FIRM), a contrastive learning objective specifically\ndesigned for anomaly detection. Unlike existing approaches, FIRM incorporates\nsynthetic outliers into its pretext task in a way that actively shapes the\nrepresentation space, promoting compact clustering of ID samples while\nenforcing strong separation from outliers. This formulation addresses the\nchallenges of class collision, enhancing both the compactness of ID\nrepresentations and the discriminative power of the learned feature space. We\nshow that FIRM surpasses other contrastive methods in standard benchmarks,\nsignificantly enhancing anomaly detection compared to both traditional and\nsupervised contrastive learning objectives. Our ablation studies confirm that\nFIRM consistently improves the quality of representations and shows robustness\nacross a range of scoring methods. The code is available at:\nhttps://github.com/willtl/firm.\n","authors":["William T. Lunardi","Abdulrahman Banabila","Dania Herzalla","Martin L. Andreoni"],"pdf_url":"https://arxiv.org/pdf/2501.05130v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.01163v2","updated":"2025-01-09T10:19:09Z","published":"2024-08-02T10:25:19Z","title":"Domain Adaptation-Enhanced Searchlight: Enabling classification of brain\n states from visual perception to mental imagery","summary":" In cognitive neuroscience and brain-computer interface research, accurately\npredicting imagined stimuli is crucial. This study investigates the\neffectiveness of Domain Adaptation (DA) in enhancing imagery prediction using\nprimarily visual data from fMRI scans of 18 subjects. Initially, we train a\nbaseline model on visual stimuli to predict imagined stimuli, utilizing data\nfrom 14 brain regions. We then develop several models to improve imagery\nprediction, comparing different DA methods. Our results demonstrate that DA\nsignificantly enhances imagery prediction in binary classification on our\ndataset, as well as in multiclass classification on a publicly available\ndataset. We then conduct a DA-enhanced searchlight analysis, followed by\npermutation-based statistical tests to identify brain regions where imagery\ndecoding is consistently above chance across subjects. Our DA-enhanced\nsearchlight predicts imagery contents in a highly distributed set of brain\nregions, including the visual cortex and the frontoparietal cortex, thereby\noutperforming standard cross-domain classification methods. The complete code\nand data for this paper have been made openly available for the use of the\nscientific community.\n","authors":["Alexander Olza","David Soto","Roberto Santana"],"pdf_url":"https://arxiv.org/pdf/2408.01163v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05113v1","updated":"2025-01-09T09:59:42Z","published":"2025-01-09T09:59:42Z","title":"Constrained Optimization of Charged Particle Tracking with Multi-Agent\n Reinforcement Learning","summary":" Reinforcement learning demonstrated immense success in modelling complex\nphysics-driven systems, providing end-to-end trainable solutions by interacting\nwith a simulated or real environment, maximizing a scalar reward signal. In\nthis work, we propose, building upon previous work, a multi-agent reinforcement\nlearning approach with assignment constraints for reconstructing particle\ntracks in pixelated particle detectors. Our approach optimizes collaboratively\na parametrized policy, functioning as a heuristic to a multidimensional\nassignment problem, by jointly minimizing the total amount of particle\nscattering over the reconstructed tracks in a readout frame. To satisfy\nconstraints, guaranteeing a unique assignment of particle hits, we propose a\nsafety layer solving a linear assignment problem for every joint action.\nFurther, to enforce cost margins, increasing the distance of the local policies\npredictions to the decision boundaries of the optimizer mappings, we recommend\nthe use of an additional component in the blackbox gradient estimation, forcing\nthe policy to solutions with lower total assignment costs. We empirically show\non simulated data, generated for a particle detector developed for proton\nimaging, the effectiveness of our approach, compared to multiple single- and\nmulti-agent baselines. We further demonstrate the effectiveness of constraints\nwith cost margins for both optimization and generalization, introduced by wider\nregions with high reconstruction performance as well as reduced predictive\ninstabilities. Our results form the basis for further developments in RL-based\ntracking, offering both enhanced performance with constrained policies and\ngreater flexibility in optimizing tracking algorithms through the option for\nindividual and team rewards.\n","authors":["Tobias Kortus","Ralf Keidel","Nicolas R. Gauger","Jan Kieseler"],"pdf_url":"https://arxiv.org/pdf/2501.05113v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05109v1","updated":"2025-01-09T09:57:33Z","published":"2025-01-09T09:57:33Z","title":"EquiBoost: An Equivariant Boosting Approach to Molecular Conformation\n Generation","summary":" Molecular conformation generation plays key roles in computational drug\ndesign. Recently developed deep learning methods, particularly diffusion models\nhave reached competitive performance over traditional cheminformatical\napproaches. However, these methods are often time-consuming or require extra\nsupport from traditional methods. We propose EquiBoost, a boosting model that\nstacks several equivariant graph transformers as weak learners, to iteratively\nrefine 3D conformations of molecules. Without relying on diffusion techniques,\nEquiBoost balances accuracy and efficiency more effectively than\ndiffusion-based methods. Notably, compared to the previous state-of-the-art\ndiffusion method, EquiBoost improves generation quality and preserves\ndiversity, achieving considerably better precision of Average Minimum RMSD\n(AMR) on the GEOM datasets. This work rejuvenates boosting and sheds light on\nits potential to be a robust alternative to diffusion models in certain\nscenarios.\n","authors":["Yixuan Yang","Xingyu Fang","Zhaowen Cheng","Pengju Yan","Xiaolin Li"],"pdf_url":"https://arxiv.org/pdf/2501.05109v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05105v1","updated":"2025-01-09T09:46:27Z","published":"2025-01-09T09:46:27Z","title":"Robust Score Matching","summary":" Proposed in Hyv\\\"arinen (2005), score matching is a parameter estimation\nprocedure that does not require computation of distributional normalizing\nconstants. In this work we utilize the geometric median of means to develop a\nrobust score matching procedure that yields consistent parameter estimates in\nsettings where the observed data has been contaminated. A special appeal of the\nproposed method is that it retains convexity in exponential family models. The\nnew method is therefore particularly attractive for non-Gaussian, exponential\nfamily graphical models where evaluation of normalizing constants is\nintractable. Support recovery guarantees for such models when contamination is\npresent are provided. Additionally, support recovery is studied in numerical\nexperiments and on a precipitation dataset. We demonstrate that the proposed\nrobust score matching estimator performs comparably to the standard score\nmatching estimator when no contamination is present but greatly outperforms\nthis estimator in a setting with contamination.\n","authors":["Richard Schwank","Andrew McCormack","Mathias Drton"],"pdf_url":"https://arxiv.org/pdf/2501.05105v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05097v1","updated":"2025-01-09T09:25:22Z","published":"2025-01-09T09:25:22Z","title":"A 1Mb mixed-precision quantized encoder for image classification and\n patch-based compression","summary":" Even if Application-Specific Integrated Circuits (ASIC) have proven to be a\nrelevant choice for integrating inference at the edge, they are often limited\nin terms of applicability. In this paper, we demonstrate that an ASIC neural\nnetwork accelerator dedicated to image processing can be applied to multiple\ntasks of different levels: image classification and compression, while\nrequiring a very limited hardware. The key component is a reconfigurable,\nmixed-precision (3b/2b/1b) encoder that takes advantage of proper weight and\nactivation quantizations combined with convolutional layer structural pruning\nto lower hardware-related constraints (memory and computing). We introduce an\nautomatic adaptation of linear symmetric quantizer scaling factors to perform\nquantized levels equalization, aiming at stabilizing quinary and ternary\nweights training. In addition, a proposed layer-shared Bit-Shift Normalization\nsignificantly simplifies the implementation of the hardware-expensive Batch\nNormalization. For a specific configuration in which the encoder design only\nrequires 1Mb, the classification accuracy reaches 87.5% on CIFAR-10. Besides,\nwe also show that this quantized encoder can be used to compress image\npatch-by-patch while the reconstruction can performed remotely, by a dedicated\nfull-frame decoder. This solution typically enables an end-to-end compression\nalmost without any block artifacts, outperforming patch-based state-of-the-art\ntechniques employing a patch-constant bitrate.\n","authors":["Van Thien Nguyen","William Guicquero","Gilles Sicard"],"pdf_url":"https://arxiv.org/pdf/2501.05097v1.pdf","comment":"Published at IEEE Transactions on Circuits and Systems for Video\n Technology (TCSVT)"},{"id":"http://arxiv.org/abs/2410.06232v2","updated":"2025-01-09T09:20:48Z","published":"2024-10-08T17:41:37Z","title":"Range, not Independence, Drives Modularity in Biological Inspired\n Representation","summary":" Why do biological and artificial neurons sometimes modularise, each encoding\na single meaningful variable, and sometimes entangle their representation of\nmany variables? In this work, we develop a theory of when biologically inspired\nnetworks -- those that are nonnegative and energy efficient -- modularise their\nrepresentation of source variables (sources). We derive necessary and\nsufficient conditions on a sample of sources that determine whether the neurons\nin an optimal biologically-inspired linear autoencoder modularise. Our theory\napplies to any dataset, extending far beyond the case of statistical\nindependence studied in previous work. Rather we show that sources modularise\nif their support is ``sufficiently spread''. From this theory, we extract and\nvalidate predictions in a variety of empirical studies on how data distribution\naffects modularisation in nonlinear feedforward and recurrent neural networks\ntrained on supervised and unsupervised tasks. Furthermore, we apply these ideas\nto neuroscience data, showing that range independence can be used to understand\nthe mixing or modularising of spatial and reward information in entorhinal\nrecordings in seemingly conflicting experiments. Further, we use these results\nto suggest alternate origins of mixed-selectivity, beyond the predominant\ntheory of flexible nonlinear classification. In sum, our theory prescribes\nprecise conditions on when neural activities modularise, providing tools for\ninducing and elucidating modular representations in brains and machines.\n","authors":["Will Dorrell","Kyle Hsu","Luke Hollingsworth","Jin Hwa Lee","Jiajun Wu","Chelsea Finn","Peter E Latham","Tim EJ Behrens","James CR Whittington"],"pdf_url":"https://arxiv.org/pdf/2410.06232v2.pdf","comment":"40 pages, 16 figures. WD and KH contributed equally; LH and JHL\n contributed equally"},{"id":"http://arxiv.org/abs/2501.05093v1","updated":"2025-01-09T09:19:05Z","published":"2025-01-09T09:19:05Z","title":"Hierarchical Decomposed Dual-domain Deep Learning for Sparse-View CT\n Reconstruction","summary":" Objective: X-ray computed tomography employing sparse projection views has\nemerged as a contemporary technique to mitigate radiation dose. However, due to\nthe inadequate number of projection views, an analytic reconstruction method\nutilizing filtered backprojection results in severe streaking artifacts.\nRecently, deep learning strategies employing image-domain networks have\ndemonstrated remarkable performance in eliminating the streaking artifact\ncaused by analytic reconstruction methods with sparse projection views.\nNevertheless, it is difficult to clarify the theoretical justification for\napplying deep learning to sparse view CT reconstruction, and it has been\nunderstood as restoration by removing image artifacts, not reconstruction.\n Approach: By leveraging the theory of deep convolutional framelets and the\nhierarchical decomposition of measurement, this research reveals the\nconstraints of conventional image- and projection-domain deep learning\nmethodologies, subsequently, the research proposes a novel dual-domain deep\nlearning framework utilizing hierarchical decomposed measurements.\nSpecifically, the research elucidates how the performance of the\nprojection-domain network can be enhanced through a low-rank property of deep\nconvolutional framelets and a bowtie support of hierarchical decomposed\nmeasurement in the Fourier domain.\n Main Results: This study demonstrated performance improvement of the proposed\nframework based on the low-rank property, resulting in superior reconstruction\nperformance compared to conventional analytic and deep learning methods.\n Significance: By providing a theoretically justified deep learning approach\nfor sparse-view CT reconstruction, this study not only offers a superior\nalternative to existing methods but also opens new avenues for research in\nmedical imaging.\n","authors":["Yoseob Han"],"pdf_url":"https://arxiv.org/pdf/2501.05093v1.pdf","comment":"Published by Physics in Medicine & Biology (2024.4)"},{"id":"http://arxiv.org/abs/2501.05089v1","updated":"2025-01-09T09:12:57Z","published":"2025-01-09T09:12:57Z","title":"Supervised Learning with Evolving Tasks and Performance Guarantees","summary":" Multiple supervised learning scenarios are composed by a sequence of\nclassification tasks. For instance, multi-task learning and continual learning\naim to learn a sequence of tasks that is either fixed or grows over time.\nExisting techniques for learning tasks that are in a sequence are tailored to\nspecific scenarios, lacking adaptability to others. In addition, most of\nexisting techniques consider situations in which the order of the tasks in the\nsequence is not relevant. However, it is common that tasks in a sequence are\nevolving in the sense that consecutive tasks often have a higher similarity.\nThis paper presents a learning methodology that is applicable to multiple\nsupervised learning scenarios and adapts to evolving tasks. Differently from\nexisting techniques, we provide computable tight performance guarantees and\nanalytically characterize the increase in the effective sample size.\nExperiments on benchmark datasets show the performance improvement of the\nproposed methodology in multiple scenarios and the reliability of the presented\nperformance guarantees.\n","authors":["Verónica Álvarez","Santiago Mazuelas","Jose A. Lozano"],"pdf_url":"https://arxiv.org/pdf/2501.05089v1.pdf","comment":"arXiv admin note: text overlap with arXiv:2310.15974"},{"id":"http://arxiv.org/abs/2312.03700v2","updated":"2025-01-09T09:12:06Z","published":"2023-12-06T18:59:19Z","title":"OneLLM: One Framework to Align All Modalities with Language","summary":" Multimodal large language models (MLLMs) have gained significant attention\ndue to their strong multimodal understanding capability. However, existing\nworks rely heavily on modality-specific encoders, which usually differ in\narchitecture and are limited to common modalities. In this paper, we present\nOneLLM, an MLLM that aligns eight modalities to language using a unified\nframework. We achieve this through a unified multimodal encoder and a\nprogressive multimodal alignment pipeline. In detail, we first train an image\nprojection module to connect a vision encoder with LLM. Then, we build a\nuniversal projection module (UPM) by mixing multiple image projection modules\nand dynamic routing. Finally, we progressively align more modalities to LLM\nwith the UPM. To fully leverage the potential of OneLLM in following\ninstructions, we also curated a comprehensive multimodal instruction dataset,\nincluding 2M items from image, audio, video, point cloud, depth/normal map, IMU\nand fMRI brain activity. OneLLM is evaluated on 25 diverse benchmarks,\nencompassing tasks such as multimodal captioning, question answering and\nreasoning, where it delivers excellent performance. Code, data, model and\nonline demo are available at https://github.com/csuhan/OneLLM\n","authors":["Jiaming Han","Kaixiong Gong","Yiyuan Zhang","Jiaqi Wang","Kaipeng Zhang","Dahua Lin","Yu Qiao","Peng Gao","Xiangyu Yue"],"pdf_url":"https://arxiv.org/pdf/2312.03700v2.pdf","comment":"Accepted by CVPR 2024. Code: https://github.com/csuhan/OneLLM"},{"id":"http://arxiv.org/abs/2501.05087v1","updated":"2025-01-09T09:11:40Z","published":"2025-01-09T09:11:40Z","title":"Enhanced Quantile Regression with Spiking Neural Networks for Long-Term\n System Health Prognostics","summary":" This paper presents a novel predictive maintenance framework centered on\nEnhanced Quantile Regression Neural Networks EQRNNs, for anticipating system\nfailures in industrial robotics. We address the challenge of early failure\ndetection through a hybrid approach that combines advanced neural\narchitectures. The system leverages dual computational stages: first\nimplementing an EQRNN optimized for processing multi-sensor data streams\nincluding vibration, thermal, and power signatures, followed by an integrated\nSpiking Neural Network SNN, layer that enables microsecond-level response\ntimes. This architecture achieves notable accuracy rates of 92.3\\% in component\nfailure prediction with a 90-hour advance warning window. Field testing\nconducted on an industrial scale with 50 robotic systems demonstrates\nsignificant operational improvements, yielding a 94\\% decrease in unexpected\nsystem failures and 76\\% reduction in maintenance-related downtimes. The\nframework's effectiveness in processing complex, multi-modal sensor data while\nmaintaining computational efficiency validates its applicability for Industry\n4.0 manufacturing environments.\n","authors":["David J Poland"],"pdf_url":"https://arxiv.org/pdf/2501.05087v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05085v1","updated":"2025-01-09T09:10:17Z","published":"2025-01-09T09:10:17Z","title":"End-to-End Deep Learning for Interior Tomography with Low-Dose X-ray CT","summary":" Objective: There exist several X-ray computed tomography (CT) scanning\nstrategies to reduce a radiation dose, such as (1) sparse-view CT, (2) low-dose\nCT, and (3) region-of-interest (ROI) CT (called interior tomography). To\nfurther reduce the dose, the sparse-view and/or low-dose CT settings can be\napplied together with interior tomography. Interior tomography has various\nadvantages in terms of reducing the number of detectors and decreasing the\nX-ray radiation dose. However, a large patient or small field-of-view (FOV)\ndetector can cause truncated projections, and then the reconstructed images\nsuffer from severe cupping artifacts. In addition, although the low-dose CT can\nreduce the radiation exposure dose, analytic reconstruction algorithms produce\nimage noise. Recently, many researchers have utilized image-domain deep\nlearning (DL) approaches to remove each artifact and demonstrated impressive\nperformances, and the theory of deep convolutional framelets supports the\nreason for the performance improvement. Approach: In this paper, we found that\nthe image-domain convolutional neural network (CNN) is difficult to solve\ncoupled artifacts, based on deep convolutional framelets. Significance: To\naddress the coupled problem, we decouple it into two sub-problems: (i) image\ndomain noise reduction inside truncated projection to solve low-dose CT problem\nand (ii) extrapolation of projection outside truncated projection to solve the\nROI CT problem. The decoupled sub-problems are solved directly with a novel\nproposed end-to-end learning using dual-domain CNNs. Main results: We\ndemonstrate that the proposed method outperforms the conventional image-domain\ndeep learning methods, and a projection-domain CNN shows better performance\nthan the image-domain CNNs which are commonly used by many researchers.\n","authors":["Yoseob Han","Dufan Wu","Kyungsang Kim","Quanzheng Li"],"pdf_url":"https://arxiv.org/pdf/2501.05085v1.pdf","comment":"Published by Physics in Medicine & Biology (2022.5)"},{"id":"http://arxiv.org/abs/2412.10095v2","updated":"2025-01-09T09:09:32Z","published":"2024-12-13T12:31:06Z","title":"HiTZ at VarDial 2025 NorSID: Overcoming Data Scarcity with Language\n Transfer and Automatic Data Annotation","summary":" In this paper we present our submission for the NorSID Shared Task as part of\nthe 2025 VarDial Workshop (Scherrer et al., 2025), consisting of three tasks:\nIntent Detection, Slot Filling and Dialect Identification, evaluated using data\nin different dialects of the Norwegian language. For Intent Detection and Slot\nFilling, we have fine-tuned a multitask model in a cross-lingual setting, to\nleverage the xSID dataset available in 17 languages. In the case of Dialect\nIdentification, our final submission consists of a model fine-tuned on the\nprovided development set, which has obtained the highest scores within our\nexperiments. Our final results on the test set show that our models do not drop\nin performance compared to the development set, likely due to the\ndomain-specificity of the dataset and the similar distribution of both subsets.\nFinally, we also report an in-depth analysis of the provided datasets and their\nartifacts, as well as other sets of experiments that have been carried out but\ndid not yield the best results. Additionally, we present an analysis on the\nreasons why some methods have been more successful than others; mainly the\nimpact of the combination of languages and domain-specificity of the training\ndata on the results.\n","authors":["Jaione Bengoetxea","Mikel Zubillaga","Ekhi Azurmendi","Maite Heredia","Julen Etxaniz","Markel Ferro","Jeremy Barnes"],"pdf_url":"https://arxiv.org/pdf/2412.10095v2.pdf","comment":"Vardial 2025 NorSID Shared Task, fixed minor typos"},{"id":"http://arxiv.org/abs/2501.05082v1","updated":"2025-01-09T09:03:43Z","published":"2025-01-09T09:03:43Z","title":"Comparison of Feature Learning Methods for Metadata Extraction from PDF\n Scholarly Documents","summary":" The availability of metadata for scientific documents is pivotal in\npropelling scientific knowledge forward and for adhering to the FAIR principles\n(i.e. Findability, Accessibility, Interoperability, and Reusability) of\nresearch findings. However, the lack of sufficient metadata in published\ndocuments, particularly those from smaller and mid-sized publishers, hinders\ntheir accessibility. This issue is widespread in some disciplines, such as the\nGerman Social Sciences, where publications often employ diverse templates. To\naddress this challenge, our study evaluates various feature learning and\nprediction methods, including natural language processing (NLP), computer\nvision (CV), and multimodal approaches, for extracting metadata from documents\nwith high template variance. We aim to improve the accessibility of scientific\ndocuments and facilitate their wider use. To support our comparison of these\nmethods, we provide comprehensive experimental results, analyzing their\naccuracy and efficiency in extracting metadata. Additionally, we provide\nvaluable insights into the strengths and weaknesses of various feature learning\nand prediction methods, which can guide future research in this field.\n","authors":["Zeyd Boukhers","Cong Yang"],"pdf_url":"https://arxiv.org/pdf/2501.05082v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05081v1","updated":"2025-01-09T09:02:41Z","published":"2025-01-09T09:02:41Z","title":"DriVLM: Domain Adaptation of Vision-Language Models in Autonomous\n Driving","summary":" In recent years, large language models have had a very impressive\nperformance, which largely contributed to the development and application of\nartificial intelligence, and the parameters and performance of the models are\nstill growing rapidly. In particular, multimodal large language models (MLLM)\ncan combine multiple modalities such as pictures, videos, sounds, texts, etc.,\nand have great potential in various tasks. However, most MLLMs require very\nhigh computational resources, which is a major challenge for most researchers\nand developers. In this paper, we explored the utility of small-scale MLLMs and\napplied small-scale MLLMs to the field of autonomous driving. We hope that this\nwill advance the application of MLLMs in real-world scenarios.\n","authors":["Xuran Zheng","Chang D. Yoo"],"pdf_url":"https://arxiv.org/pdf/2501.05081v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05078v1","updated":"2025-01-09T09:00:32Z","published":"2025-01-09T09:00:32Z","title":"Analyzing Memorization in Large Language Models through the Lens of\n Model Attribution","summary":" Large Language Models (LLMs) are prevalent in modern applications but often\nmemorize training data, leading to privacy breaches and copyright issues.\nExisting research has mainly focused on posthoc analyses, such as extracting\nmemorized content or developing memorization metrics, without exploring the\nunderlying architectural factors that contribute to memorization. In this work,\nwe investigate memorization from an architectural lens by analyzing how\nattention modules at different layers impact its memorization and\ngeneralization performance. Using attribution techniques, we systematically\nintervene in the LLM architecture by bypassing attention modules at specific\nblocks while keeping other components like layer normalization and MLP\ntransformations intact. We provide theorems analyzing our intervention\nmechanism from a mathematical view, bounding the difference in layer outputs\nwith and without our attributions. Our theoretical and empirical analyses\nreveal that attention modules in deeper transformer blocks are primarily\nresponsible for memorization, whereas earlier blocks are crucial for the models\ngeneralization and reasoning capabilities. We validate our findings through\ncomprehensive experiments on different LLM families (Pythia and GPTNeo) and\nfive benchmark datasets. Our insights offer a practical approach to mitigate\nmemorization in LLMs while preserving their performance, contributing to safer\nand more ethical deployment in real world applications.\n","authors":["Tarun Ram Menta","Susmit Agrawal","Chirag Agarwal"],"pdf_url":"https://arxiv.org/pdf/2501.05078v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04517v2","updated":"2025-01-09T09:00:02Z","published":"2025-01-08T14:06:07Z","title":"Histogram-Equalized Quantization for logic-gated Residual Neural\n Networks","summary":" Adjusting the quantization according to the data or to the model loss seems\nmandatory to enable a high accuracy in the context of quantized neural\nnetworks. This work presents Histogram-Equalized Quantization (HEQ), an\nadaptive framework for linear symmetric quantization. HEQ automatically adapts\nthe quantization thresholds using a unique step size optimization. We\nempirically show that HEQ achieves state-of-the-art performances on CIFAR-10.\nExperiments on the STL-10 dataset even show that HEQ enables a proper training\nof our proposed logic-gated (OR, MUX) residual networks with a higher accuracy\nat a lower hardware complexity than previous work.\n","authors":["Van Thien Nguyen","William Guicquero","Gilles Sicard"],"pdf_url":"https://arxiv.org/pdf/2501.04517v2.pdf","comment":"Published at IEEE ISCAS 2022"},{"id":"http://arxiv.org/abs/2501.05076v1","updated":"2025-01-09T08:59:23Z","published":"2025-01-09T08:59:23Z","title":"TipSegNet: Fingertip Segmentation in Contactless Fingerprint Imaging","summary":" Contactless fingerprint recognition systems offer a hygienic, user-friendly,\nand efficient alternative to traditional contact-based methods. However, their\naccuracy heavily relies on precise fingertip detection and segmentation,\nparticularly under challenging background conditions. This paper introduces\nTipSegNet, a novel deep learning model that achieves state-of-the-art\nperformance in segmenting fingertips directly from grayscale hand images.\nTipSegNet leverages a ResNeXt-101 backbone for robust feature extraction,\ncombined with a Feature Pyramid Network (FPN) for multi-scale representation,\nenabling accurate segmentation across varying finger poses and image qualities.\nFurthermore, we employ an extensive data augmentation strategy to enhance the\nmodel's generalizability and robustness. TipSegNet outperforms existing\nmethods, achieving a mean Intersection over Union (mIoU) of 0.987 and an\naccuracy of 0.999, representing a significant advancement in contactless\nfingerprint segmentation. This enhanced accuracy has the potential to\nsubstantially improve the reliability and effectiveness of contactless\nbiometric systems in real-world applications.\n","authors":["Laurenz Ruzicka","Bernhard Kohn","Clemens Heitzinger"],"pdf_url":"https://arxiv.org/pdf/2501.05076v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05075v1","updated":"2025-01-09T08:59:14Z","published":"2025-01-09T08:59:14Z","title":"A Text-Based Knowledge-Embedded Soft Sensing Modeling Approach for\n General Industrial Process Tasks Based on Large Language Model","summary":" Data-driven soft sensors (DDSS) have become mainstream methods for predicting\nkey performance indicators in process industries. However, DDSS development\nrequires complex and costly customized designs tailored to various tasks during\nthe modeling process. Moreover, DDSS are constrained to a single structured\ndata modality, limiting their ability to incorporate additional contextual\nknowledge. Furthermore, DDSSs' limited representation learning leads to weak\npredictive performance with scarce data. To address these challenges, we\npropose a general framework named LLM-TKESS (large language model for\ntext-based knowledge-embedded soft sensing), harnessing the powerful general\nproblem-solving capabilities, cross-modal knowledge transfer abilities, and\nfew-shot capabilities of LLM for enhanced soft sensing modeling. Specifically,\nan auxiliary variable series encoder (AVS Encoder) is proposed to unleash LLM's\npotential for capturing temporal relationships within series and spatial\nsemantic relationships among auxiliary variables. Then, we propose a two-stage\nfine-tuning alignment strategy: in the first stage, employing\nparameter-efficient fine-tuning through autoregressive training adjusts LLM to\nrapidly accommodate process variable data, resulting in a soft sensing\nfoundation model (SSFM). Subsequently, by training adapters, we adapt the SSFM\nto various downstream tasks without modifying its architecture. Then, we\npropose two text-based knowledge-embedded soft sensors, integrating new natural\nlanguage modalities to overcome the limitations of pure structured data models.\nFurthermore, benefiting from LLM's pre-existing world knowledge, our model\ndemonstrates outstanding predictive capabilities in small sample conditions.\nUsing the thermal deformation of air preheater rotor as a case study, we\nvalidate through extensive experiments that LLM-TKESS exhibits outstanding\nperformance.\n","authors":["Shuo Tong","Han Liu","Runyuan Guo","Xueqiong Tian","Wenqing Wang","Ding Liu","Youmin Zhang"],"pdf_url":"https://arxiv.org/pdf/2501.05075v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05068v1","updated":"2025-01-09T08:44:06Z","published":"2025-01-09T08:44:06Z","title":"D3RM: A Discrete Denoising Diffusion Refinement Model for Piano\n Transcription","summary":" Diffusion models have been widely used in the generative domain due to their\nconvincing performance in modeling complex data distributions. Moreover, they\nhave shown competitive results on discriminative tasks, such as image\nsegmentation. While diffusion models have also been explored for automatic\nmusic transcription, their performance has yet to reach a competitive level. In\nthis paper, we focus on discrete diffusion model's refinement capabilities and\npresent a novel architecture for piano transcription. Our model utilizes\nNeighborhood Attention layers as the denoising module, gradually predicting the\ntarget high-resolution piano roll, conditioned on the finetuned features of a\npretrained acoustic model. To further enhance refinement, we devise a novel\nstrategy which applies distinct transition states during training and inference\nstage of discrete diffusion models. Experiments on the MAESTRO dataset show\nthat our approach outperforms previous diffusion-based piano transcription\nmodels and the baseline model in terms of F1 score. Our code is available in\nhttps://github.com/hanshounsu/d3rm.\n","authors":["Hounsu Kim","Taegyun Kwon","Juhan Nam"],"pdf_url":"https://arxiv.org/pdf/2501.05068v1.pdf","comment":"Accepted to ICASSP 2025"},{"id":"http://arxiv.org/abs/2501.04614v2","updated":"2025-01-09T08:42:56Z","published":"2025-01-08T16:53:56Z","title":"MedCoDi-M: A Multi-Prompt Foundation Model for Multimodal Medical Data\n Generation","summary":" Artificial Intelligence is revolutionizing medical practice, enhancing\ndiagnostic accuracy and healthcare delivery. However, its adaptation in medical\nsettings still faces significant challenges, related to data availability and\nprivacy constraints. Synthetic data has emerged as a promising solution to\nmitigate these issues, addressing data scarcity while preserving privacy.\nRecently, Latent Diffusion Models have emerged as a powerful tool for\ngenerating high-quality synthetic data. Meanwhile, the integration of different\nmodalities has gained interest, emphasizing the need of models capable of\nhandle multimodal medical data. Existing approaches struggle to integrate\ncomplementary information and lack the ability to generate modalities\nsimultaneously. To address this challenge, we present MedCoDi-M, a\n6.77-billion-parameter model, designed for multimodal medical data generation,\nthat, following Foundation Model paradigm, exploits contrastive learning and\nlarge quantity of data to build a shared latent space which capture the\nrelationships between different data modalities. Further, we introduce the\nMulti-Prompt training technique, which significantly boosts MedCoDi-M's\ngeneration under different settings. We extensively validate MedCoDi-M: first\nwe benchmark it against five competitors on the MIMIC-CXR dataset, a\nstate-of-the-art dataset for Chest X-ray and radiological report generation.\nSecondly, we perform a Visual Turing Test with expert radiologists to assess\nthe realism and clinical relevance of the generated data, ensuring alignment\nwith real-world scenarios. Finally, we assess the utility of MedCoDi-M in\naddressing key challenges in the medical field, such as anonymization, data\nscarcity and imbalance learning. The results are promising, demonstrating the\napplicability of MedCoDi-M in medical contexts. Project page is at\nhttps://cosbidev.github.io/MedCoDi-M/.\n","authors":["Daniele Molino","Francesco Di Feola","Eliodoro Faiella","Deborah Fazzini","Domiziana Santucci","Linlin Shen","Valerio Guarrasi","Paolo Soda"],"pdf_url":"https://arxiv.org/pdf/2501.04614v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05058v1","updated":"2025-01-09T08:28:31Z","published":"2025-01-09T08:28:31Z","title":"Simultaneous emulation and downscaling with physically-consistent deep\n learning-based regional ocean emulators","summary":" Building on top of the success in AI-based atmospheric emulation, we propose\nan AI-based ocean emulation and downscaling framework focusing on the\nhigh-resolution regional ocean over Gulf of Mexico. Regional ocean emulation\npresents unique challenges owing to the complex bathymetry and lateral boundary\nconditions as well as from fundamental biases in deep learning-based\nframeworks, such as instability and hallucinations. In this paper, we develop a\ndeep learning-based framework to autoregressively integrate ocean-surface\nvariables over the Gulf of Mexico at $8$ Km spatial resolution without\nunphysical drifts over decadal time scales and simulataneously downscale and\nbias-correct it to $4$ Km resolution using a physics-constrained generative\nmodel. The framework shows both short-term skills as well as accurate long-term\nstatistics in terms of mean and variability.\n","authors":["Leonard Lupin-Jimenez","Moein Darman","Subhashis Hazarika","Tianning Wu","Michael Gray","Ruyoing He","Anthony Wong","Ashesh Chattopadhyay"],"pdf_url":"https://arxiv.org/pdf/2501.05058v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05057v1","updated":"2025-01-09T08:28:16Z","published":"2025-01-09T08:28:16Z","title":"LearningFlow: Automated Policy Learning Workflow for Urban Driving with\n Large Language Models","summary":" Recent advancements in reinforcement learning (RL) demonstrate the\nsignificant potential in autonomous driving. Despite this promise, challenges\nsuch as the manual design of reward functions and low sample efficiency in\ncomplex environments continue to impede the development of safe and effective\ndriving policies. To tackle these issues, we introduce LearningFlow, an\ninnovative automated policy learning workflow tailored to urban driving. This\nframework leverages the collaboration of multiple large language model (LLM)\nagents throughout the RL training process. LearningFlow includes a curriculum\nsequence generation process and a reward generation process, which work in\ntandem to guide the RL policy by generating tailored training curricula and\nreward functions. Particularly, each process is supported by an analysis agent\nthat evaluates training progress and provides critical insights to the\ngeneration agent. Through the collaborative efforts of these LLM agents,\nLearningFlow automates policy learning across a series of complex driving\ntasks, and it significantly reduces the reliance on manual reward function\ndesign while enhancing sample efficiency. Comprehensive experiments are\nconducted in the high-fidelity CARLA simulator, along with comparisons with\nother existing methods, to demonstrate the efficacy of our proposed approach.\nThe results demonstrate that LearningFlow excels in generating rewards and\ncurricula. It also achieves superior performance and robust generalization\nacross various driving tasks, as well as commendable adaptation to different RL\nalgorithms.\n","authors":["Zengqi Peng","Yubin Wang","Xu Han","Lei Zheng","Jun Ma"],"pdf_url":"https://arxiv.org/pdf/2501.05057v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.02945v2","updated":"2025-01-09T08:26:17Z","published":"2025-01-06T11:38:19Z","title":"The Tabular Foundation Model TabPFN Outperforms Specialized Time Series\n Forecasting Models Based on Simple Features","summary":" Foundation models have become popular in forecasting due to their ability to\nmake accurate predictions, even with minimal fine-tuning on specific datasets.\nIn this paper, we demonstrate how the newly released regression variant of\nTabPFN, a general tabular foundation model, can be applied to time series\nforecasting. We propose a straightforward approach, TabPFN-TS, which pairs\nTabPFN with simple feature engineering to achieve strong forecasting\nperformance. Despite its simplicity and with only 11M parameters, TabPFN-TS\noutperforms Chronos-Mini, a model of similar size, and matches or even slightly\noutperforms Chronos-Large, which has 65-fold more parameters. A key strength of\nour method lies in its reliance solely on artificial data during pre-training,\navoiding the need for large training datasets and eliminating the risk of\nbenchmark contamination.\n","authors":["Shi Bin Hoo","Samuel Müller","David Salinas","Frank Hutter"],"pdf_url":"https://arxiv.org/pdf/2501.02945v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.17908v2","updated":"2025-01-09T08:17:23Z","published":"2024-12-23T19:04:46Z","title":"Trading Devil RL: Backdoor attack via Stock market, Bayesian\n Optimization and Reinforcement Learning","summary":" With the rapid development of generative artificial intelligence,\nparticularly large language models, a number of sub-fields of deep learning\nhave made significant progress and are now very useful in everyday\napplications. For example, well-known financial institutions simulate a wide\nrange of scenarios for various models created by their research teams using\nreinforcement learning, both before production and after regular operations. In\nthis work, we propose a backdoor attack that focuses solely on data poisoning.\nThis particular backdoor attack is classified as an attack without prior\nconsideration or trigger, and we name it FinanceLLMsBackRL. Our aim is to\nexamine the potential effects of large language models that use reinforcement\nlearning systems for text production or speech recognition, finance, physics,\nor the ecosystem of contemporary artificial intelligence models.\n","authors":["Orson Mengara"],"pdf_url":"https://arxiv.org/pdf/2412.17908v2.pdf","comment":"End of data poisoning research!: Navier-stokes equations (3D;\n update); Reinforcement Learning (RL); HFT (High Frequency Trading); Limit\n Order Markets and backdoor attack detection"},{"id":"http://arxiv.org/abs/2306.09202v3","updated":"2025-01-09T08:14:06Z","published":"2023-06-15T15:37:31Z","title":"A Fast Algorithm for the Real-Valued Combinatorial Pure Exploration of\n Multi-Armed Bandit","summary":" We study the real-valued combinatorial pure exploration problem in the\nstochastic multi-armed bandit (R-CPE-MAB). We study the case where the size of\nthe action set is polynomial with respect to the number of arms. In such a\ncase, the R-CPE-MAB can be seen as a special case of the so-called transductive\nlinear bandits. We introduce an algorithm named the combinatorial gap-based\nexploration (CombGapE) algorithm, whose sample complexity upper bound matches\nthe lower bound up to a problem-dependent constant factor. We numerically show\nthat the CombGapE algorithm outperforms existing methods significantly in both\nsynthetic and real-world datasets.\n","authors":["Shintaro Nakamura","Masashi Sugiyama"],"pdf_url":"https://arxiv.org/pdf/2306.09202v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.04710v3","updated":"2025-01-09T07:58:13Z","published":"2024-04-06T19:07:12Z","title":"Exploiting the geometry of heterogeneous networks: A case study of the\n Indian stock market","summary":" In this study, we model the Indian stock market as heterogenous scale free\nnetwork, which is then embedded in a two dimensional hyperbolic space through a\nmachine learning based technique called as coalescent embedding. This allows us\nto apply the hyperbolic kmeans algorithm on the Poincare disc and the clusters\nso obtained resemble the original network communities more closely than the\nclusters obtained via Euclidean kmeans on the basis of well-known measures\nnormalised mutual information and adjusted mutual information. Through this, we\nare able to clearly distinguish between periods of market stability and\nvolatility by applying non-parametric statistical tests with a significance\nlevel of 0.05 to geometric measures namely hyperbolic distance and hyperbolic\nshortest path distance. After that, we are able to spot significant market\nchange early by leveraging the Bollinger Band analysis on the time series of\nmodularity in the embedded networks of each window. Finally, the radial\ndistance and the Equidistance Angular coordinates help in visualizing the\nembedded network in the Poincare disc and it is seen that specific market\nsectors cluster together.\n","authors":["Pawanesh Pawanesh","Charu Sharma","Niteesh Sahni"],"pdf_url":"https://arxiv.org/pdf/2404.04710v3.pdf","comment":"39 pages, 11 figures"},{"id":"http://arxiv.org/abs/2501.05037v1","updated":"2025-01-09T07:51:14Z","published":"2025-01-09T07:51:14Z","title":"LongViTU: Instruction Tuning for Long-Form Video Understanding","summary":" This paper introduce LongViTU, a large-scale (~121k QA pairs, ~900h videos),\nautomatically generated dataset for long-form video understanding. We developed\na systematic approach that organizes videos into a hierarchical tree structure\nand incorporates self-revision mechanisms to ensure high-quality QA pairs. Each\nQA pair in LongViTU features: 1) long-term context (average certificate length\nof 4.6 minutes); 2) rich knowledge and condensed reasoning (commonsense,\ncausality, planning, etc.); and 3) explicit timestamp labels for relevant\nevents. LongViTU also serves as a benchmark for instruction following in\nlong-form and streaming video understanding. We evaluate the open-source\nstate-of-the-art long video understanding model, LongVU, and the commercial\nmodel, Gemini-1.5-Pro, on our benchmark. They achieve GPT-4 scores of 49.9 and\n52.3, respectively, underscoring the substantial challenge posed by our\nbenchmark. Further supervised fine-tuning (SFT) on LongVU led to performance\nimprovements of 12.0% on our benchmark, 2.2% on the in-distribution (ID)\nbenchmark EgoSchema, 1.0%, 2.2% and 1.2% on the out-of-distribution (OOD)\nbenchmarks VideoMME (Long), WorldQA and OpenEQA, respectively. These outcomes\ndemonstrate LongViTU's high data quality and robust OOD generalizability.\n","authors":["Rujie Wu","Xiaojian Ma","Hai Ci","Yue Fan","Yuxuan Wang","Haozhe Zhao","Qing Li","Yizhou Wang"],"pdf_url":"https://arxiv.org/pdf/2501.05037v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05034v1","updated":"2025-01-09T07:49:37Z","published":"2025-01-09T07:49:37Z","title":"Towards Fingerprint Mosaicking Artifact Detection: A Self-Supervised\n Deep Learning Approach","summary":" Fingerprint mosaicking, which is the process of combining multiple\nfingerprint images into a single master fingerprint, is an essential process in\nmodern biometric systems. However, it is prone to errors that can significantly\ndegrade fingerprint image quality. This paper proposes a novel deep\nlearning-based approach to detect and score mosaicking artifacts in fingerprint\nimages. Our method leverages a self-supervised learning framework to train a\nmodel on large-scale unlabeled fingerprint data, eliminating the need for\nmanual artifact annotation. The proposed model effectively identifies\nmosaicking errors, achieving high accuracy on various fingerprint modalities,\nincluding contactless, rolled, and pressed fingerprints and furthermore proves\nto be robust to different data sources. Additionally, we introduce a novel\nmosaicking artifact score to quantify the severity of errors, enabling\nautomated evaluation of fingerprint images. By addressing the challenges of\nmosaicking artifact detection, our work contributes to improving the accuracy\nand reliability of fingerprint-based biometric systems.\n","authors":["Laurenz Ruzicka","Alexander Spenke","Stephan Bergmann","Gerd Nolden","Bernhard Kohn","Clemens Heitzinger"],"pdf_url":"https://arxiv.org/pdf/2501.05034v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05031v1","updated":"2025-01-09T07:43:49Z","published":"2025-01-09T07:43:49Z","title":"ECBench: Can Multi-modal Foundation Models Understand the Egocentric\n World? A Holistic Embodied Cognition Benchmark","summary":" The enhancement of generalization in robots by large vision-language models\n(LVLMs) is increasingly evident. Therefore, the embodied cognitive abilities of\nLVLMs based on egocentric videos are of great interest. However, current\ndatasets for embodied video question answering lack comprehensive and\nsystematic evaluation frameworks. Critical embodied cognitive issues, such as\nrobotic self-cognition, dynamic scene perception, and hallucination, are rarely\naddressed. To tackle these challenges, we propose ECBench, a high-quality\nbenchmark designed to systematically evaluate the embodied cognitive abilities\nof LVLMs. ECBench features a diverse range of scene video sources, open and\nvaried question formats, and 30 dimensions of embodied cognition. To ensure\nquality, balance, and high visual dependence, ECBench uses class-independent\nmeticulous human annotation and multi-round question screening strategies.\nAdditionally, we introduce ECEval, a comprehensive evaluation system that\nensures the fairness and rationality of the indicators. Utilizing ECBench, we\nconduct extensive evaluations of proprietary, open-source, and task-specific\nLVLMs. ECBench is pivotal in advancing the embodied cognitive capabilities of\nLVLMs, laying a solid foundation for developing reliable core models for\nembodied agents. All data and code are available at\nhttps://github.com/Rh-Dang/ECBench.\n","authors":["Ronghao Dang","Yuqian Yuan","Wenqi Zhang","Yifei Xin","Boqiang Zhang","Long Li","Liuyi Wang","Qinyang Zeng","Xin Li","Lidong Bing"],"pdf_url":"https://arxiv.org/pdf/2501.05031v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.09184v3","updated":"2025-01-09T07:42:45Z","published":"2024-01-17T12:50:50Z","title":"A Two-Scale Complexity Measure for Deep Learning Models","summary":" We introduce a novel capacity measure 2sED for statistical models based on\nthe effective dimension. The new quantity provably bounds the generalization\nerror under mild assumptions on the model. Furthermore, simulations on standard\ndata sets and popular model architectures show that 2sED correlates well with\nthe training error. For Markovian models, we show how to efficiently\napproximate 2sED from below through a layerwise iterative approach, which\nallows us to tackle deep learning models with a large number of parameters.\nSimulation results suggest that the approximation is good for different\nprominent models and data sets.\n","authors":["Massimiliano Datres","Gian Paolo Leonardi","Alessio Figalli","David Sutter"],"pdf_url":"https://arxiv.org/pdf/2401.09184v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.06764v3","updated":"2025-01-09T07:39:30Z","published":"2023-08-13T13:01:21Z","title":"Few-shot Class-incremental Learning for Classification and Object\n Detection: A Survey","summary":" Few-shot Class-Incremental Learning (FSCIL) presents a unique challenge in\nMachine Learning (ML), as it necessitates the Incremental Learning (IL) of new\nclasses from sparsely labeled training samples without forgetting previous\nknowledge. While this field has seen recent progress, it remains an active\nexploration area. This paper aims to provide a comprehensive and systematic\nreview of FSCIL. In our in-depth examination, we delve into various facets of\nFSCIL, encompassing the problem definition, the discussion of the primary\nchallenges of unreliable empirical risk minimization and the\nstability-plasticity dilemma, general schemes, and relevant problems of IL and\nFew-shot Learning (FSL). Besides, we offer an overview of benchmark datasets\nand evaluation metrics. Furthermore, we introduce the Few-shot\nClass-incremental Classification (FSCIC) methods from data-based,\nstructure-based, and optimization-based approaches and the Few-shot\nClass-incremental Object Detection (FSCIOD) methods from anchor-free and\nanchor-based approaches. Beyond these, we present several promising research\ndirections within FSCIL that merit further investigation.\n","authors":["Jinghua Zhang","Li Liu","Olli Silvén","Matti Pietikäinen","Dewen Hu"],"pdf_url":"https://arxiv.org/pdf/2308.06764v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05015v1","updated":"2025-01-09T07:16:21Z","published":"2025-01-09T07:16:21Z","title":"On Measuring Unnoticeability of Graph Adversarial Attacks: Observations,\n New Measure, and Applications","summary":" Adversarial attacks are allegedly unnoticeable. Prior studies have designed\nattack noticeability measures on graphs, primarily using statistical tests to\ncompare the topology of original and (possibly) attacked graphs. However, we\nobserve two critical limitations in the existing measures. First, because the\nmeasures rely on simple rules, attackers can readily enhance their attacks to\nbypass them, reducing their attack \"noticeability\" and, yet, maintaining their\nattack performance. Second, because the measures naively leverage global\nstatistics, such as degree distributions, they may entirely overlook attacks\nuntil severe perturbations occur, letting the attacks be almost \"totally\nunnoticeable.\" To address the limitations, we introduce HideNSeek, a learnable\nmeasure for graph attack noticeability. First, to mitigate the bypass problem,\nHideNSeek learns to distinguish the original and (potential) attack edges using\na learnable edge scorer (LEO), which scores each edge on its likelihood of\nbeing an attack. Second, to mitigate the overlooking problem, HideNSeek\nconducts imbalance-aware aggregation of all the edge scores to obtain the final\nnoticeability score. Using six real-world graphs, we empirically demonstrate\nthat HideNSeek effectively alleviates the observed limitations, and LEO (i.e.,\nour learnable edge scorer) outperforms eleven competitors in distinguishing\nattack edges under five different attack methods. For an additional\napplication, we show that LEO boost the performance of robust GNNs by removing\nattack-like edges.\n","authors":["Hyeonsoo Jo","Hyunjin Hwang","Fanchen Bu","Soo Yong Lee","Chanyoung Park","Kijung Shin"],"pdf_url":"https://arxiv.org/pdf/2501.05015v1.pdf","comment":"KDD 2025"},{"id":"http://arxiv.org/abs/2501.05014v1","updated":"2025-01-09T07:15:59Z","published":"2025-01-09T07:15:59Z","title":"UAV-VLA: Vision-Language-Action System for Large Scale Aerial Mission\n Generation","summary":" The UAV-VLA (Visual-Language-Action) system is a tool designed to facilitate\ncommunication with aerial robots. By integrating satellite imagery processing\nwith the Visual Language Model (VLM) and the powerful capabilities of GPT,\nUAV-VLA enables users to generate general flight paths-and-action plans through\nsimple text requests. This system leverages the rich contextual information\nprovided by satellite images, allowing for enhanced decision-making and mission\nplanning. The combination of visual analysis by VLM and natural language\nprocessing by GPT can provide the user with the path-and-action set, making\naerial operations more efficient and accessible. The newly developed method\nshowed the difference in the length of the created trajectory in 22% and the\nmean error in finding the objects of interest on a map in 34.22 m by Euclidean\ndistance in the K-Nearest Neighbors (KNN) approach.\n","authors":["Oleg Sautenkov","Yasheerah Yaqoot","Artem Lykov","Muhammad Ahsan Mustafa","Grik Tadevosyan","Aibek Akhmetkazy","Miguel Altamirano Cabrera","Mikhail Martynov","Sausar Karaf","Dzmitry Tsetserukou"],"pdf_url":"https://arxiv.org/pdf/2501.05014v1.pdf","comment":"HRI 2025"},{"id":"http://arxiv.org/abs/2501.05007v1","updated":"2025-01-09T07:05:22Z","published":"2025-01-09T07:05:22Z","title":"Quantum-enhanced causal discovery for a small number of samples","summary":" The discovery of causal relationships from observed data has attracted\nsignificant interest from disciplines such as economics, social sciences,\nepidemiology, and biology. In practical applications, considerable knowledge of\nthe underlying systems is often unavailable, and real data are often associated\nwith nonlinear causal structures, which make the direct use of most\nconventional causality analysis methods difficult. This study proposes a novel\nquantum Peter-Clark (qPC) algorithm for causal discovery that does not assume\nany underlying model structures. Based on the independence conditional tests in\na class of reproducing kernel Hilbert spaces characterized by quantum circuits,\nthe proposed qPC algorithm can explore causal relationships from the observed\ndata drawn from arbitrary distributions. We conducted systematic experiments on\nfundamental graph parts of causal structures, demonstrating that the qPC\nalgorithm exhibits a significantly better performance, particularly with\nsmaller sample sizes compared to its classical counterpart. Furthermore, we\nproposed a novel optimization approach based on Kernel Target Alignment (KTA)\nfor determining hyperparameters of quantum kernels. This method effectively\nreduced the risk of false positives in causal discovery, enabling more reliable\ninference. Our theoretical and experimental results demonstrate that the\nproposed quantum algorithm can empower classical algorithms for robust and\naccurate inference in causal discovery, supporting them in regimes where\nclassical algorithms typically fail. Additionally, the effectiveness of this\nmethod was validated using the Boston Housing dataset as a real-world\napplication. These findings demonstrate the new potential of quantum\ncircuit-based causal discovery methods in addressing practical challenges,\nparticularly in small-sample scenarios where traditional approaches have shown\nlimitations.\n","authors":["Yota Maeda","Ken Arai","Yu Tanaka","Yu Terada","Hiroshi Ueno","Hiroyuki Tezuka"],"pdf_url":"https://arxiv.org/pdf/2501.05007v1.pdf","comment":"19 pages, 8 figures"},{"id":"http://arxiv.org/abs/2501.05005v1","updated":"2025-01-09T06:56:47Z","published":"2025-01-09T06:56:47Z","title":"A High-accuracy Calibration Method of Transient TSEPs for Power\n Semiconductor Devices","summary":" The thermal sensitive electrical parameter (TSEP) method is crucial for\nenhancing the reliability of power devices through junction temperature\nmonitoring. The TSEP method comprises three key processes: calibration,\nregression, and application. While significant efforts have been devoted to\nimproving regression algorithms and increasing TSEP sensitivity to enhance\njunction temperature monitoring accuracy, these approaches have reached a\nbottleneck. In reality, the calibration method significantly influences\nmonitoring accuracy, an aspect often overlooked in conventional TSEP methods.\nTo address this issue, we propose a high-accuracy calibration method for\ntransient TSEPs. First, a temperature compensation strategy based on thermal\nanalysis is introduced to mitigate the temperature difference caused by load\ncurrent during dual pulse tests. Second, the impact of stray parameters is\nanalyzed to identify coupled parameters, which are typically neglected in\nexisting methods. Third, it is observed that random errors follow a logarithm\nGaussian distribution, covering a hidden variable. A neural network is used to\nobtain the junction temperature predictive model. The proposed calibration\nmethod is experimental validated in threshold voltage as an example. Compared\nwith conventional calibration methods, the mean absolute error is reduced by\nover 30%. Moreover, this method does not require additional hardware cost and\nhas good generalization.\n","authors":["Qinghao Zhang","Wenrui Li","Pinjia Zhang"],"pdf_url":"https://arxiv.org/pdf/2501.05005v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.07204v5","updated":"2025-01-09T06:53:50Z","published":"2024-02-11T13:30:53Z","title":"ITINERA: Integrating Spatial Optimization with Large Language Models for\n Open-domain Urban Itinerary Planning","summary":" Citywalk, a recently popular form of urban travel, requires genuine\npersonalization and understanding of fine-grained requests compared to\ntraditional itinerary planning. In this paper, we introduce the novel task of\nOpen-domain Urban Itinerary Planning (OUIP), which generates personalized urban\nitineraries from user requests in natural language. We then present ITINERA, an\nOUIP system that integrates spatial optimization with large language models to\nprovide customized urban itineraries based on user needs. This involves\ndecomposing user requests, selecting candidate points of interest (POIs),\nordering the POIs based on cluster-aware spatial optimization, and generating\nthe itinerary. Experiments on real-world datasets and the performance of the\ndeployed system demonstrate our system's capacity to deliver personalized and\nspatially coherent itineraries compared to current solutions. Source codes of\nITINERA are available at https://github.com/YihongT/ITINERA.\n","authors":["Yihong Tang","Zhaokai Wang","Ao Qu","Yihao Yan","Zhaofeng Wu","Dingyi Zhuang","Jushi Kai","Kebing Hou","Xiaotong Guo","Han Zheng","Tiange Luo","Jinhua Zhao","Zhan Zhao","Wei Ma"],"pdf_url":"https://arxiv.org/pdf/2402.07204v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.03670v2","updated":"2025-01-09T06:47:34Z","published":"2024-03-06T12:47:14Z","title":"CDC: A Simple Framework for Complex Data Clustering","summary":" In today's data-driven digital era, the amount as well as complexity, such as\nmulti-view, non-Euclidean, and multi-relational, of the collected data are\ngrowing exponentially or even faster. Clustering, which unsupervisely extracts\nvalid knowledge from data, is extremely useful in practice. However, existing\nmethods are independently developed to handle one particular challenge at the\nexpense of the others. In this work, we propose a simple but effective\nframework for complex data clustering (CDC) that can efficiently process\ndifferent types of data with linear complexity. We first utilize graph\nfiltering to fuse geometry structure and attribute information. We then reduce\nthe complexity with high-quality anchors that are adaptively learned via a\nnovel similarity-preserving regularizer. We illustrate the cluster-ability of\nour proposed method theoretically and experimentally. In particular, we deploy\nCDC to graph data of size 111M.\n","authors":["Zhao Kang","Xuanting Xie","Bingheng Li","Erlin Pan"],"pdf_url":"https://arxiv.org/pdf/2403.03670v2.pdf","comment":"Accepted by TNNLS"},{"id":"http://arxiv.org/abs/2408.10517v5","updated":"2025-01-09T06:41:46Z","published":"2024-08-20T03:35:28Z","title":"Integrating Multi-Modal Input Token Mixer Into Mamba-Based Decision\n Models: Decision MetaMamba","summary":" Sequence modeling with State Space models (SSMs) has demonstrated performance\nsurpassing that of Transformers in various tasks, raising expectations for\ntheir potential to outperform the Decision Transformer and its enhanced\nvariants in offline reinforcement learning (RL). However, decision models based\non Mamba, a state-of-the-art SSM, failed to achieve superior performance\ncompared to these enhanced Decision Transformers. We hypothesize that this\nlimitation arises from information loss during the selective scanning phase. To\naddress this, we propose the Decision MetaMamba (DMM), which augments Mamba\nwith a token mixer in its input layer. This mixer explicitly accounts for the\nmultimodal nature of offline RL inputs, comprising state, action, and\nreturn-to-go. The DMM demonstrates improved performance while significantly\nreducing parameter count compared to prior models. Notably, similar performance\ngains were achieved using a simple linear token mixer, emphasizing the\nimportance of preserving information from proximate time steps rather than the\nspecific design of the token mixer itself. This novel modification to Mamba's\ninput layer represents a departure from conventional timestamp-based encoding\napproaches used in Transformers. By enhancing performance of Mamba in offline\nRL, characterized by memory efficiency and fast inference, this work opens new\navenues for its broader application in future RL research.\n","authors":["Wall Kim"],"pdf_url":"https://arxiv.org/pdf/2408.10517v5.pdf","comment":"We have decided to withdraw this manuscript as we believe that the\n work requires significant improvements and further research to ensure its\n quality and impact. We are currently pursuing a more comprehensive approach\n to address the limitations of the current submission and plan to resubmit an\n improved version in the future"},{"id":"http://arxiv.org/abs/2408.16030v2","updated":"2025-01-09T06:33:24Z","published":"2024-08-28T09:30:20Z","title":"Deep Learning-Based Automatic Multi-Level Airway Collapse Monitoring on\n Obstructive Sleep Apnea Patients","summary":" This study investigated the use of deep learning to identify multi-level\nupper airway collapses in obstructive sleep apnea (OSA) patients based on\nsnoring sounds. We fi-ne-tuned ResNet-50 and Audio Spectrogram Transformer\n(AST) models using snoring recordings from 37 subjects undergoing drug-induced\nsleep endoscopy (DISE) between 2020 and 2021. Snoring sounds were labeled\naccording to the VOTE (Velum, Orophar-ynx, Tongue Base, Epiglottis)\nclassification, resulting in 259 V, 403 O, 77 T, 13 E, 1016 VO, 46 VT, 140 OT,\n39 OE, 30 VOT, and 3150 non-snoring (N) 0.5-second clips. The models were\ntrained for two multi-label classification tasks: identifying obstructions at\nV, O, T, and E levels, and identifying retropalatal (RP) and retroglossal (RG)\nobstruc-tions. Results showed AST slightly outperformed ResNet-50,\ndemonstrating good abil-ity to identify V (F1-score: 0.71, MCC: 0.61, AUC:\n0.89), O (F1-score: 0.80, MCC: 0.72, AUC: 0.94), and RP obstructions (F1-score:\n0.86, MCC: 0.77, AUC: 0.97). However, both models struggled with T, E, and RG\nclassifications due to limited data. Retrospective analysis of a full-night\nrecording showed the potential to profile airway obstruction dynamics. We\nexpect this information, combined with polysomnography and other clinical\nparameters, can aid clinical triage and treatment planning for OSA patients.\n","authors":["Ying-Chieh Hsu","Stanley Yung-Chuan Liu","Chao-Jung Huang","Chi-Wei Wu","Ren-Kai Cheng","Jane Yung-Jen Hsu","Shang-Ran Huang","Yuan-Ren Cheng","Fu-Shun Hsu"],"pdf_url":"https://arxiv.org/pdf/2408.16030v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05000v1","updated":"2025-01-09T06:29:50Z","published":"2025-01-09T06:29:50Z","title":"Load Forecasting for Households and Energy Communities: Are Deep\n Learning Models Worth the Effort?","summary":" Accurate load forecasting is crucial for predictive control in many energy\ndomain applications, with significant economic and ecological implications. To\naddress these implications, this study provides an extensive benchmark of\nstate-of-the-art deep learning models for short-term load forecasting in energy\ncommunities. Namely, LSTM, xLSTM, and Transformers are compared with benchmarks\nsuch as KNNs, synthetic load models, and persistence forecasting models. This\ncomparison considers different scales of aggregation (e.g., number of household\nloads) and varying training data availability (e.g., training data time spans).\nFurther, the impact of transfer learning from synthetic (standard) load\nprofiles and the deep learning model size (i.e., parameter count) is\ninvestigated in terms of forecasting error. Implementations are publicly\navailable and other researchers are encouraged to benchmark models using this\nframework. Additionally, a comprehensive case study, comprising an energy\ncommunity of 50 households and a battery storage demonstrates the beneficial\nfinancial implications of accurate predictions. Key findings of this research\ninclude: (1) Simple persistence benchmarks outperform deep learning models for\nshort-term load forecasting when the available training data is limited to six\nmonths or less; (2) Pretraining with publicly available synthetic load profiles\nimproves the normalized Mean Absolute Error (nMAE) by an average of 1.28%pt\nduring the first nine months of training data; (3) Increased aggregation\nsignificantly enhances the performance of deep learning models relative to\npersistence benchmarks; (4) Improved load forecasting, with an nMAE reduction\nof 1.1%pt, translates to an economic benefit of approximately 600EUR per year\nin an energy community comprising 50 households.\n","authors":["Lukas Moosbrugger","Valentin Seiler","Philipp Wohlgenannt","Sebastian Hegenbart","Sashko Ristov","Peter Kepplinger"],"pdf_url":"https://arxiv.org/pdf/2501.05000v1.pdf","comment":"This preprint was submitted to the Elsevier journal Energy and AI on\n December 18, 2024"},{"id":"http://arxiv.org/abs/2501.04997v1","updated":"2025-01-09T06:26:28Z","published":"2025-01-09T06:26:28Z","title":"GiNet: Integrating Sequential and Context-Aware Learning for Battery\n Capacity Prediction","summary":" The surging demand for batteries requires advanced battery management\nsystems, where battery capacity modelling is a key functionality. In this\npaper, we aim to achieve accurate battery capacity prediction by learning from\nhistorical measurements of battery dynamics. We propose GiNet, a gated\nrecurrent units enhanced Informer network, for predicting battery's capacity.\nThe novelty and competitiveness of GiNet lies in its capability of capturing\nsequential and contextual information from raw battery data and reflecting the\nbattery's complex behaviors with both temporal dynamics and long-term\ndependencies. We conducted an experimental study based on a publicly available\ndataset to showcase GiNet's strength of gaining a holistic understanding of\nbattery behavior and predicting battery capacity accurately. GiNet achieves\n0.11 mean absolute error for predicting the battery capacity in a sequence of\nfuture time slots without knowing the historical battery capacity. It also\noutperforms the latest algorithms significantly with 27% error reduction on\naverage compared to Informer. The promising results highlight the importance of\ncustomized and optimized integration of algorithm and battery knowledge and\nshed light on other industry applications as well.\n","authors":["Sara Sameer","Wei Zhang","Xin Lou","Qingyu Yan","Terence Goh","Yulin Gao"],"pdf_url":"https://arxiv.org/pdf/2501.04997v1.pdf","comment":"6 pages"},{"id":"http://arxiv.org/abs/2412.05144v2","updated":"2025-01-09T06:18:12Z","published":"2024-12-06T16:00:50Z","title":"Effective Rank and the Staircase Phenomenon: New Insights into Neural\n Network Training Dynamics","summary":" In recent years, deep learning, powered by neural networks, has achieved\nwidespread success in solving high-dimensional problems, particularly those\nwith low-dimensional feature structures. This success stems from their ability\nto identify and learn low dimensional features tailored to the problems.\nUnderstanding how neural networks extract such features during training\ndynamics remains a fundamental question in deep learning theory. In this work,\nwe propose a novel perspective by interpreting the neurons in the last hidden\nlayer of a neural network as basis functions that represent essential features.\nTo explore the linear independence of these basis functions throughout the deep\nlearning dynamics, we introduce the concept of 'effective rank'. Our extensive\nnumerical experiments reveal a notable phenomenon: the effective rank increases\nprogressively during the learning process, exhibiting a staircase-like pattern,\nwhile the loss function concurrently decreases as the effective rank rises. We\nrefer to this observation as the 'staircase phenomenon'. Specifically, for deep\nneural networks, we rigorously prove the negative correlation between the loss\nfunction and effective rank, demonstrating that the lower bound of the loss\nfunction decreases with increasing effective rank. Therefore, to achieve a\nrapid descent of the loss function, it is critical to promote the swift growth\nof effective rank. Ultimately, we evaluate existing advanced learning\nmethodologies and find that these approaches can quickly achieve a higher\neffective rank, thereby avoiding redundant staircase processes and accelerating\nthe rapid decline of the loss function.\n","authors":["Jiang Yang","Yuxiang Zhao","Quanhui Zhu"],"pdf_url":"https://arxiv.org/pdf/2412.05144v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.03389v3","updated":"2025-01-09T06:11:32Z","published":"2023-11-04T04:54:17Z","title":"Learning Disentangled Speech Representations","summary":" Disentangled representation learning in speech processing has lagged behind\nother domains, largely due to the lack of datasets with annotated generative\nfactors for robust evaluation. To address this, we propose SynSpeech, a novel\nlarge-scale synthetic speech dataset specifically designed to enable research\non disentangled speech representations. SynSpeech includes controlled\nvariations in speaker identity, spoken text, and speaking style, with three\ndataset versions to support experimentation at different levels of complexity.\n In this study, we present a comprehensive framework to evaluate disentangled\nrepresentation learning techniques, applying both linear probing and\nestablished supervised disentanglement metrics to assess the modularity,\ncompactness, and informativeness of the representations learned by a\nstate-of-the-art model. Using the RAVE model as a test case, we find that\nSynSpeech facilitates benchmarking across a range of factors, achieving\npromising disentanglement of simpler features like gender and speaking style,\nwhile highlighting challenges in isolating complex attributes like speaker\nidentity. This benchmark dataset and evaluation framework fills a critical gap,\nsupporting the development of more robust and interpretable speech\nrepresentation learning methods.\n","authors":["Yusuf Brima","Ulf Krumnack","Simone Pika","Gunther Heidemann"],"pdf_url":"https://arxiv.org/pdf/2311.03389v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04982v1","updated":"2025-01-09T05:45:03Z","published":"2025-01-09T05:45:03Z","title":"CuRLA: Curriculum Learning Based Deep Reinforcement Learning for\n Autonomous Driving","summary":" In autonomous driving, traditional Computer Vision (CV) agents often struggle\nin unfamiliar situations due to biases in the training data. Deep Reinforcement\nLearning (DRL) agents address this by learning from experience and maximizing\nrewards, which helps them adapt to dynamic environments. However, ensuring\ntheir generalization remains challenging, especially with static training\nenvironments. Additionally, DRL models lack transparency, making it difficult\nto guarantee safety in all scenarios, particularly those not seen during\ntraining. To tackle these issues, we propose a method that combines DRL with\nCurriculum Learning for autonomous driving. Our approach uses a Proximal Policy\nOptimization (PPO) agent and a Variational Autoencoder (VAE) to learn safe\ndriving in the CARLA simulator. The agent is trained using two-fold curriculum\nlearning, progressively increasing environment difficulty and incorporating a\ncollision penalty in the reward function to promote safety. This method\nimproves the agent's adaptability and reliability in complex environments, and\nunderstand the nuances of balancing multiple reward components from different\nfeedback signals in a single scalar reward function. Keywords: Computer Vision,\nDeep Reinforcement Learning, Variational Autoencoder, Proximal Policy\nOptimization, Curriculum Learning, Autonomous Driving.\n","authors":["Bhargava Uppuluri","Anjel Patel","Neil Mehta","Sridhar Kamath","Pratyush Chakraborty"],"pdf_url":"https://arxiv.org/pdf/2501.04982v1.pdf","comment":"To be published in the 17th International Conference on Agents and\n Artificial Intelligence (ICAART), Feb 2025"},{"id":"http://arxiv.org/abs/2402.08948v3","updated":"2025-01-09T05:24:57Z","published":"2024-02-14T05:34:24Z","title":"Mean-Field Analysis for Learning Subspace-Sparse Polynomials with\n Gaussian Input","summary":" In this work, we study the mean-field flow for learning subspace-sparse\npolynomials using stochastic gradient descent and two-layer neural networks,\nwhere the input distribution is standard Gaussian and the output only depends\non the projection of the input onto a low-dimensional subspace. We establish a\nnecessary condition for SGD-learnability, involving both the characteristics of\nthe target function and the expressiveness of the activation function. In\naddition, we prove that the condition is almost sufficient, in the sense that a\ncondition slightly stronger than the necessary condition can guarantee the\nexponential decay of the loss functional to zero.\n","authors":["Ziang Chen","Rong Ge"],"pdf_url":"https://arxiv.org/pdf/2402.08948v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.01145v5","updated":"2025-01-09T05:14:56Z","published":"2024-01-02T10:55:01Z","title":"HAAQI-Net: A Non-intrusive Neural Music Audio Quality Assessment Model\n for Hearing Aids","summary":" This paper introduces HAAQI-Net, a non-intrusive deep learning-based music\naudio quality assessment model for hearing aid users. Unlike traditional\nmethods like the Hearing Aid Audio Quality Index (HAAQI) that require intrusive\nreference signal comparisons, HAAQI-Net offers a more accessible and\ncomputationally efficient alternative. By utilizing a Bidirectional Long\nShort-Term Memory (BLSTM) architecture with attention mechanisms and features\nextracted from the pre-trained BEATs model, it can predict HAAQI scores\ndirectly from music audio clips and hearing loss patterns. Experimental results\ndemonstrate HAAQI-Net's effectiveness, achieving a Linear Correlation\nCoefficient (LCC) of 0.9368 , a Spearman's Rank Correlation Coefficient (SRCC)\nof 0.9486 , and a Mean Squared Error (MSE) of 0.0064 and inference time\nsignificantly reduces from 62.52 to 2.54 seconds. To address computational\noverhead, a knowledge distillation strategy was applied, reducing parameters by\n75.85% and inference time by 96.46%, while maintaining strong performance (LCC:\n0.9071 , SRCC: 0.9307 , MSE: 0.0091 ). To expand its capabilities, HAAQI-Net\nwas adapted to predict subjective human scores like the Mean Opinion Score\n(MOS) through fine-tuning. This adaptation significantly improved prediction\naccuracy, validated through statistical analysis. Furthermore, the robustness\nof HAAQI-Net was evaluated under varying Sound Pressure Level (SPL) conditions,\nrevealing optimal performance at a reference SPL of 65 dB, with accuracy\ngradually decreasing as SPL deviated from this point. The advancements in\nsubjective score prediction, SPL robustness, and computational efficiency\nposition HAAQI-Net as a scalable solution for music audio quality assessment in\nhearing aid applications, contributing to efficient and accurate models in\naudio signal processing and hearing aid technology.\n","authors":["Dyah A. M. G. Wisnu","Stefano Rini","Ryandhimas E. Zezario","Hsin-Min Wang","Yu Tsao"],"pdf_url":"https://arxiv.org/pdf/2401.01145v5.pdf","comment":"Accepted by IEEE/ACM Transactions on Audio, Speech, and Language\n Processing (TASLP), 2025"},{"id":"http://arxiv.org/abs/2501.04971v1","updated":"2025-01-09T05:02:50Z","published":"2025-01-09T05:02:50Z","title":"Self-Adaptive Ising Machines for Constrained Optimization","summary":" Ising machines (IM) are physics-inspired alternatives to von Neumann\narchitectures for solving hard optimization tasks. By mapping binary variables\nto coupled Ising spins, IMs can naturally solve unconstrained combinatorial\noptimization problems such as finding maximum cuts in graphs. However, despite\ntheir importance in practical applications, constrained problems remain\nchallenging to solve for IMs that require large quadratic energy penalties to\nensure the correspondence between energy ground states and constrained optimal\nsolutions. To relax this requirement, we propose a self-adaptive IM that\niteratively shapes its energy landscape using a Lagrange relaxation of\nconstraints and avoids prior tuning of penalties. Using a probabilistic-bit\n(p-bit) IM emulated in software, we benchmark our algorithm with\nmultidimensional knapsack problems (MKP) and quadratic knapsack problems (QKP),\nthe latter being an Ising problem with linear constraints. For QKP with 300\nvariables, the proposed algorithm finds better solutions than state-of-the-art\nIMs such as Fujitsu's Digital Annealer and requires 7,500x fewer samples. Our\nresults show that adapting the energy landscape during the search can speed up\nIMs for constrained optimization.\n","authors":["Corentin Delacour"],"pdf_url":"https://arxiv.org/pdf/2501.04971v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04970v1","updated":"2025-01-09T04:59:15Z","published":"2025-01-09T04:59:15Z","title":"Battling the Non-stationarity in Time Series Forecasting via Test-time\n Adaptation","summary":" Deep Neural Networks have spearheaded remarkable advancements in time series\nforecasting (TSF), one of the major tasks in time series modeling. Nonetheless,\nthe non-stationarity of time series undermines the reliability of pre-trained\nsource time series forecasters in mission-critical deployment settings. In this\nstudy, we introduce a pioneering test-time adaptation framework tailored for\nTSF (TSF-TTA). TAFAS, the proposed approach to TSF-TTA, flexibly adapts source\nforecasters to continuously shifting test distributions while preserving the\ncore semantic information learned during pre-training. The novel utilization of\npartially-observed ground truth and gated calibration module enables proactive,\nrobust, and model-agnostic adaptation of source forecasters. Experiments on\ndiverse benchmark datasets and cutting-edge architectures demonstrate the\nefficacy and generality of TAFAS, especially in long-term forecasting scenarios\nthat suffer from significant distribution shifts. The code is available at\nhttps://github.com/kimanki/TAFAS.\n","authors":["HyunGi Kim","Siwon Kim","Jisoo Mok","Sungroh Yoon"],"pdf_url":"https://arxiv.org/pdf/2501.04970v1.pdf","comment":"Accepted at AAAI 2025"},{"id":"http://arxiv.org/abs/2501.04967v1","updated":"2025-01-09T04:41:50Z","published":"2025-01-09T04:41:50Z","title":"Targeted Adversarial Denoising Autoencoders (TADA) for Neural Time\n Series Filtration","summary":" Current machine learning (ML)-based algorithms for filtering\nelectroencephalography (EEG) time series data face challenges related to\ncumbersome training times, regularization, and accurate reconstruction. To\naddress these shortcomings, we present an ML filtration algorithm driven by a\nlogistic covariance-targeted adversarial denoising autoencoder (TADA). We\nhypothesize that the expressivity of a targeted, correlation-driven\nconvolutional autoencoder will enable effective time series filtration while\nminimizing compute requirements (e.g., runtime, model size). Furthermore, we\nexpect that adversarial training with covariance rescaling will minimize signal\ndegradation. To test this hypothesis, a TADA system prototype was trained and\nevaluated on the task of removing electromyographic (EMG) noise from EEG data\nin the EEGdenoiseNet dataset, which includes EMG and EEG data from 67 subjects.\nThe TADA filter surpasses conventional signal filtration algorithms across\nquantitative metrics (Correlation Coefficient, Temporal RRMSE, Spectral RRMSE),\nand performs competitively against other deep learning architectures at a\nreduced model size of less than 400,000 trainable parameters. Further\nexperimentation will be necessary to assess the viability of TADA on a wider\nrange of deployment cases.\n","authors":["Benjamin J. Choi","Griffin Milsap","Clara A. Scholl","Francesco Tenore","Mattson Ogg"],"pdf_url":"https://arxiv.org/pdf/2501.04967v1.pdf","comment":"[Accepted] Artificial Intelligence for Time Series Analysis (AI4TS):\n Theory, Algorithms, and Applications @ AAAI 2025, Philadelphia, PA, USA"},{"id":"http://arxiv.org/abs/2501.04276v2","updated":"2025-01-09T04:26:27Z","published":"2025-01-08T04:54:28Z","title":"Bridging Adaptivity and Safety: Learning Agile Collision-Free Locomotion\n Across Varied Physics","summary":" Real-world legged locomotion systems often need to reconcile agility and\nsafety for different scenarios. Moreover, the underlying dynamics are often\nunknown and time-variant (e.g., payload, friction). In this paper, we introduce\nBAS (Bridging Adaptivity and Safety), which builds upon the pipeline of prior\nwork Agile But Safe (ABS)(He et al.) and is designed to provide adaptive safety\neven in dynamic environments with uncertainties. BAS involves an agile policy\nto avoid obstacles rapidly and a recovery policy to prevent collisions, a\nphysical parameter estimator that is concurrently trained with agile policy,\nand a learned control-theoretic RA (reach-avoid) value network that governs the\npolicy switch. Also, the agile policy and RA network are both conditioned on\nphysical parameters to make them adaptive. To mitigate the distribution shift\nissue, we further introduce an on-policy fine-tuning phase for the estimator to\nenhance its robustness and accuracy. The simulation results show that BAS\nachieves 50% better safety than baselines in dynamic environments while\nmaintaining a higher speed on average. In real-world experiments, BAS shows its\ncapability in complex environments with unknown physics (e.g., slippery floors\nwith unknown frictions, unknown payloads up to 8kg), while baselines lack\nadaptivity, leading to collisions or. degraded agility. As a result, BAS\nachieves a 19.8% increase in speed and gets a 2.36 times lower collision rate\nthan ABS in the real world. Videos: https://adaptive-safe-locomotion.github.io.\n","authors":["Yichao Zhong","Chong Zhang","Tairan He","Guanya Shi"],"pdf_url":"https://arxiv.org/pdf/2501.04276v2.pdf","comment":"11 Pages, 6 Figures"},{"id":"http://arxiv.org/abs/2501.04961v1","updated":"2025-01-09T04:26:15Z","published":"2025-01-09T04:26:15Z","title":"Demystifying Domain-adaptive Post-training for Financial LLMs","summary":" Domain-adaptive post-training of large language models (LLMs) has emerged as\na promising approach for specialized domains such as medicine and finance.\nHowever, significant challenges remain in identifying optimal adaptation\ncriteria and training strategies across varying data and model configurations.\nTo address these challenges, we introduce FINDAP, a systematic and fine-grained\ninvestigation into domain-adaptive post-training of LLMs for the finance\ndomain. Our approach begins by identifying the core capabilities required for\nthe target domain and designing a comprehensive evaluation suite aligned with\nthese needs. We then analyze the effectiveness of key post-training stages,\nincluding continual pretraining, instruction tuning, and preference alignment.\nBuilding on these insights, we propose an effective training recipe centered on\na novel preference data distillation method, which leverages process signals\nfrom a generative reward model. The resulting model, Llama-Fin, achieves\nstate-of-the-art performance across a wide range of financial tasks. Our\nanalysis also highlights how each post-training stage contributes to distinct\ncapabilities, uncovering specific challenges and effective solutions, providing\nvaluable insights for domain adaptation of LLMs. Project page:\nhttps://github.com/SalesforceAIResearch/FinDap\n","authors":["Zixuan Ke","Yifei Ming","Xuan-Phi Nguyen","Caiming Xiong","Shafiq Joty"],"pdf_url":"https://arxiv.org/pdf/2501.04961v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.07074v2","updated":"2025-01-09T04:25:14Z","published":"2024-10-09T17:19:12Z","title":"Let's Ask GNN: Empowering Large Language Model for Graph In-Context\n Learning","summary":" Textual Attributed Graphs (TAGs) are crucial for modeling complex real-world\nsystems, yet leveraging large language models (LLMs) for TAGs presents unique\nchallenges due to the gap between sequential text processing and\ngraph-structured data. We introduce AskGNN, a novel approach that bridges this\ngap by leveraging In-Context Learning (ICL) to integrate graph data and\ntask-specific information into LLMs. AskGNN employs a Graph Neural Network\n(GNN)-powered structure-enhanced retriever to select labeled nodes across\ngraphs, incorporating complex graph structures and their supervision signals.\nOur learning-to-retrieve algorithm optimizes the retriever to select example\nnodes that maximize LLM performance on graph. Experiments across three tasks\nand seven LLMs demonstrate AskGNN's superior effectiveness in graph task\nperformance, opening new avenues for applying LLMs to graph-structured data\nwithout extensive fine-tuning.\n","authors":["Zhengyu Hu","Yichuan Li","Zhengyu Chen","Jingang Wang","Han Liu","Kyumin Lee","Kaize Ding"],"pdf_url":"https://arxiv.org/pdf/2410.07074v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.09346v2","updated":"2025-01-09T04:20:34Z","published":"2023-11-15T20:09:29Z","title":"Nothing Stands Still: A Spatiotemporal Benchmark on 3D Point Cloud\n Registration Under Large Geometric and Temporal Change","summary":" Building 3D geometric maps of man-made spaces is a well-established and\nactive field that is fundamental to computer vision and robotics. However,\nconsidering the evolving nature of built environments, it is essential to\nquestion the capabilities of current mapping efforts in handling temporal\nchanges. In addition, spatiotemporal mapping holds significant potential for\nachieving sustainability and circularity goals. Existing mapping approaches\nfocus on small changes, such as object relocation or self-driving car\noperation; in all cases where the main structure of the scene remains fixed.\nConsequently, these approaches fail to address more radical changes in the\nstructure of the built environment, such as geometry and topology. To this end,\nwe introduce the Nothing Stands Still (NSS) benchmark, which focuses on the\nspatiotemporal registration of 3D scenes undergoing large spatial and temporal\nchange, ultimately creating one coherent spatiotemporal map. Specifically, the\nbenchmark involves registering two or more partial 3D point clouds (fragments)\nfrom the same scene but captured from different spatiotemporal views. In\naddition to the standard pairwise registration, we assess the multi-way\nregistration of multiple fragments that belong to any temporal stage. As part\nof NSS, we introduce a dataset of 3D point clouds recurrently captured in\nlarge-scale building indoor environments that are under construction or\nrenovation. The NSS benchmark presents three scenarios of increasing\ndifficulty, to quantify the generalization ability of point cloud registration\nmethods over space (within one building and across buildings) and time. We\nconduct extensive evaluations of state-of-the-art methods on NSS. The results\ndemonstrate the necessity for novel methods specifically designed to handle\nlarge spatiotemporal changes. The homepage of our benchmark is at\nhttp://nothing-stands-still.com.\n","authors":["Tao Sun","Yan Hao","Shengyu Huang","Silvio Savarese","Konrad Schindler","Marc Pollefeys","Iro Armeni"],"pdf_url":"https://arxiv.org/pdf/2311.09346v2.pdf","comment":"To appear in the ISPRS Journal of Photogrammetry and Remote Sensing.\n 29 pages, 26 figures. For the project page, see\n http://nothing-stands-still.com"},{"id":"http://arxiv.org/abs/2501.04952v1","updated":"2025-01-09T03:59:10Z","published":"2025-01-09T03:59:10Z","title":"Open Problems in Machine Unlearning for AI Safety","summary":" As AI systems become more capable, widely deployed, and increasingly\nautonomous in critical areas such as cybersecurity, biological research, and\nhealthcare, ensuring their safety and alignment with human values is paramount.\nMachine unlearning -- the ability to selectively forget or suppress specific\ntypes of knowledge -- has shown promise for privacy and data removal tasks,\nwhich has been the primary focus of existing research. More recently, its\npotential application to AI safety has gained attention. In this paper, we\nidentify key limitations that prevent unlearning from serving as a\ncomprehensive solution for AI safety, particularly in managing dual-use\nknowledge in sensitive domains like cybersecurity and chemical, biological,\nradiological, and nuclear (CBRN) safety. In these contexts, information can be\nboth beneficial and harmful, and models may combine seemingly harmless\ninformation for harmful purposes -- unlearning this information could strongly\naffect beneficial uses. We provide an overview of inherent constraints and open\nproblems, including the broader side effects of unlearning dangerous knowledge,\nas well as previously unexplored tensions between unlearning and existing\nsafety mechanisms. Finally, we investigate challenges related to evaluation,\nrobustness, and the preservation of safety features during unlearning. By\nmapping these limitations and open challenges, we aim to guide future research\ntoward realistic applications of unlearning within a broader AI safety\nframework, acknowledging its limitations and highlighting areas where\nalternative approaches may be required.\n","authors":["Fazl Barez","Tingchen Fu","Ameya Prabhu","Stephen Casper","Amartya Sanyal","Adel Bibi","Aidan O'Gara","Robert Kirk","Ben Bucknall","Tim Fist","Luke Ong","Philip Torr","Kwok-Yan Lam","Robert Trager","David Krueger","Sören Mindermann","José Hernandez-Orallo","Mor Geva","Yarin Gal"],"pdf_url":"https://arxiv.org/pdf/2501.04952v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18696v2","updated":"2025-01-09T03:39:37Z","published":"2024-12-24T22:55:35Z","title":"STITCH: Surface reconstrucTion using Implicit neural representations\n with Topology Constraints and persistent Homology","summary":" We present STITCH, a novel approach for neural implicit surface\nreconstruction of a sparse and irregularly spaced point cloud while enforcing\ntopological constraints (such as having a single connected component). We\ndevelop a new differentiable framework based on persistent homology to\nformulate topological loss terms that enforce the prior of a single 2-manifold\nobject. Our method demonstrates excellent performance in preserving the\ntopology of complex 3D geometries, evident through both visual and empirical\ncomparisons. We supplement this with a theoretical analysis, and provably show\nthat optimizing the loss with stochastic (sub)gradient descent leads to\nconvergence and enables reconstructing shapes with a single connected\ncomponent. Our approach showcases the integration of differentiable topological\ndata analysis tools for implicit surface reconstruction.\n","authors":["Anushrut Jignasu","Ethan Herron","Zhanhong Jiang","Soumik Sarkar","Chinmay Hegde","Baskar Ganapathysubramanian","Aditya Balu","Adarsh Krishnamurthy"],"pdf_url":"https://arxiv.org/pdf/2412.18696v2.pdf","comment":"19 pages, 12 figures, 29 tables"},{"id":"http://arxiv.org/abs/2501.04946v1","updated":"2025-01-09T03:36:17Z","published":"2025-01-09T03:36:17Z","title":"Non-asymptotic analysis of the performance of the penalized least\n trimmed squares in sparse models","summary":" The least trimmed squares (LTS) estimator is a renowned robust alternative to\nthe classic least squares estimator and is popular in location, regression,\nmachine learning, and AI literature. Many studies exist on LTS, including its\nrobustness, computation algorithms, extension to non-linear cases, asymptotics,\netc. The LTS has been applied in the penalized regression in a high-dimensional\nreal-data sparse-model setting where dimension $p$ (in thousands) is much\nlarger than sample size $n$ (in tens, or hundreds). In such a practical\nsetting, the sample size $n$ often is the count of sub-population that has a\nspecial attribute (e.g. the count of patients of Alzheimer's, Parkinson's,\nLeukemia, or ALS, etc.) among a population with a finite fixed size N.\nAsymptotic analysis assuming that $n$ tends to infinity is not practically\nconvincing and legitimate in such a scenario. A non-asymptotic or finite sample\nanalysis will be more desirable and feasible.\n This article establishes some finite sample (non-asymptotic) error bounds for\nestimating and predicting based on LTS with high probability for the first\ntime.\n","authors":["Yijun Zuo"],"pdf_url":"https://arxiv.org/pdf/2501.04946v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.18729v2","updated":"2025-01-09T03:34:55Z","published":"2024-11-27T20:08:55Z","title":"Multi-Task Model Merging via Adaptive Weight Disentanglement","summary":" Model merging has recently gained attention as an economical and scalable\napproach to incorporate task-specific weights from various tasks into a unified\nmulti-task model. For example, in Task Arithmetic (TA), adding the fine-tuned\nweights of different tasks can enhance the model's performance on those tasks,\nwhile subtracting them leads to task forgetting. Although TA is highly\neffective, interference among task still hampers the performance of the merged\nmodel. Existing methods for handling conflicts between task generally rely on\nempirical selection, resulting in suboptimal performance. In this paper, we\nintroduce an Adaptive Weight Disentanglement method. We begin by theoretically\nproving that task vectors employed in model merging should be orthogonal to\nminimize interference among tasks. Guided by this insight, we initialize\nredundant vectors such that, when subtracted from the original task vectors,\nthe resulting vectors exhibit increased orthogonality. Additionally, we impose\nan norm constraint on the redundant vectors to preserve the performance of the\ntask-specific models. Experimental results demonstrate the effectiveness of our\nproposed technique: it successfully extracts redundant vectors, and after their\nsubtraction, the task vectors not only retain robust performance but also\nachieve superior fusion outcomes. Our code is available at\n\\href{https://github.com/FarisXiong/AWD.git}{https://github.com/FarisXiong/AWD.git}.\n","authors":["Feng Xiong","Runxi Cheng","Wang Chen","Zhanqiu Zhang","Yiwen Guo","Chun Yuan","Ruifeng Xu"],"pdf_url":"https://arxiv.org/pdf/2411.18729v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04940v1","updated":"2025-01-09T03:14:03Z","published":"2025-01-09T03:14:03Z","title":"A New Perspective on Privacy Protection in Federated Learning with\n Granular-Ball Computing","summary":" Federated Learning (FL) facilitates collaborative model training while\nprioritizing privacy by avoiding direct data sharing. However, most existing\narticles attempt to address challenges within the model's internal parameters\nand corresponding outputs, while neglecting to solve them at the input level.\nTo address this gap, we propose a novel framework called Granular-Ball\nFederated Learning (GrBFL) for image classification. GrBFL diverges from\ntraditional methods that rely on the finest-grained input data. Instead, it\nsegments images into multiple regions with optimal coarse granularity, which\nare then reconstructed into a graph structure. We designed a two-dimensional\nbinary search segmentation algorithm based on variance constraints for GrBFL,\nwhich effectively removes redundant information while preserving key\nrepresentative features. Extensive theoretical analysis and experiments\ndemonstrate that GrBFL not only safeguards privacy and enhances efficiency but\nalso maintains robust utility, consistently outperforming other\nstate-of-the-art FL methods. The code is available at\nhttps://github.com/AIGNLAI/GrBFL.\n","authors":["Guannan Lai","Yihui Feng","Xin Yang","Xiaoyu Deng","Hao Yu","Shuyin Xia","Guoyin Wang","Tianrui Li"],"pdf_url":"https://arxiv.org/pdf/2501.04940v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.01100v2","updated":"2025-01-09T03:12:38Z","published":"2025-01-02T06:49:58Z","title":"Long-range Brain Graph Transformer","summary":" Understanding communication and information processing among brain regions of\ninterest (ROIs) is highly dependent on long-range connectivity, which plays a\ncrucial role in facilitating diverse functional neural integration across the\nentire brain. However, previous studies generally focused on the short-range\ndependencies within brain networks while neglecting the long-range\ndependencies, limiting an integrated understanding of brain-wide communication.\nTo address this limitation, we propose Adaptive Long-range aware TransformER\n(ALTER), a brain graph transformer to capture long-range dependencies between\nbrain ROIs utilizing biased random walk. Specifically, we present a novel\nlong-range aware strategy to explicitly capture long-range dependencies between\nbrain ROIs. By guiding the walker towards the next hop with higher correlation\nvalue, our strategy simulates the real-world brain-wide communication.\nFurthermore, by employing the transformer framework, ALERT adaptively\nintegrates both short- and long-range dependencies between brain ROIs, enabling\nan integrated understanding of multi-level communication across the entire\nbrain. Extensive experiments on ABIDE and ADNI datasets demonstrate that ALTER\nconsistently outperforms generalized state-of-the-art graph learning methods\n(including SAN, Graphormer, GraphTrans, and LRGNN) and other graph learning\nbased brain network analysis methods (including FBNETGEN, BrainNetGNN,\nBrainGNN, and BrainNETTF) in neurological disease diagnosis. Cases of\nlong-range dependencies are also presented to further illustrate the\neffectiveness of ALTER. The implementation is available at\nhttps://github.com/yushuowiki/ALTER.\n","authors":["Shuo Yu","Shan Jin","Ming Li","Tabinda Sarwar","Feng Xia"],"pdf_url":"https://arxiv.org/pdf/2501.01100v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2209.11691v5","updated":"2025-01-09T02:35:18Z","published":"2022-09-23T16:11:09Z","title":"Linear Multidimensional Regression with Interactive Fixed-Effects","summary":" This paper studies a linear and additively separable regression model for\nmultidimensional panel data of three or more dimensions with unobserved\ninteractive fixed effects. The main estimator follows a double debias approach,\nand requires two preliminary steps to control unobserved heterogeneity. First,\nthe model is embedded within the standard two-dimensional panel framework and\nrestrictions are formed under which the factor structure methods in Bai (2009)\nlead to consistent estimation of model parameters, but at slow rates of\nconvergence. The second step develops a weighted fixed-effects method that is\nrobust to the multidimensional nature of the problem and achieves the\nparametric rate of consistency. This second step is combined with a double\ndebias procedure for asymptotically normal slope estimates. The methods are\nimplemented to estimate the demand elasticity for beer.\n","authors":["Hugo Freeman"],"pdf_url":"https://arxiv.org/pdf/2209.11691v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04323v2","updated":"2025-01-09T02:33:04Z","published":"2025-01-08T07:47:43Z","title":"Navigating the Designs of Privacy-Preserving Fine-tuning for Large\n Language Models","summary":" Instruction tuning has proven effective in enhancing Large Language Models'\n(LLMs) performance on downstream tasks. However, real-world fine-tuning faces\ninherent conflicts between model providers' intellectual property protection,\nclients' data privacy requirements, and tuning costs. While recent approaches\nlike split learning and offsite tuning demonstrate promising architectures for\nprivacy-preserving fine-tuning, there is a gap in systematically addressing the\nmultidimensional trade-offs required for diverse real-world deployments. We\npropose several indicative evaluation metrics to guide design trade-offs for\nprivacy-preserving fine-tuning and a series of example designs, collectively\nnamed GuardedTuning; they result from novel combinations of system\narchitectures with adapted privacy-enhancement methods and emerging computation\ntechniques. Each design represents distinct trade-offs across model utility,\nprivacy guarantees, and costs. Experimental results demonstrate that these\ndesigns protect against data reconstruction attacks while maintaining\ncompetitive fine-tuning performance.\n","authors":["Haonan Shi","Tu Ouyang","An Wang"],"pdf_url":"https://arxiv.org/pdf/2501.04323v2.pdf","comment":"4 pages, 2 figures"},{"id":"http://arxiv.org/abs/2501.04126v2","updated":"2025-01-09T02:20:28Z","published":"2025-01-07T20:12:56Z","title":"Stochastic Process Learning via Operator Flow Matching","summary":" Expanding on neural operators, we propose a novel framework for stochastic\nprocess learning across arbitrary domains. In particular, we develop operator\nflow matching (OFM) for learning stochastic process priors on function spaces.\nOFM provides the probability density of the values of any collection of points\nand enables mathematically tractable functional regression at new points with\nmean and density estimation. Our method outperforms state-of-the-art models in\nstochastic process learning, functional regression, and prior learning.\n","authors":["Yaozhong Shi","Zachary E. Ross","Domniki Asimaki","Kamyar Azizzadenesheli"],"pdf_url":"https://arxiv.org/pdf/2501.04126v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04070v2","updated":"2025-01-09T02:20:13Z","published":"2025-01-07T14:57:08Z","title":"More is not always better? Enhancing Many-Shot In-Context Learning with\n Differentiated and Reweighting Objectives","summary":" Large language models (LLMs) excel at few-shot in-context learning (ICL)\nwithout requiring parameter updates. However, as the number of ICL\ndemonstrations increases from a few to many, performance tends to plateau and\neventually decline. We identify two primary causes for this trend: the\nsuboptimal negative log-likelihood (NLL) optimization objective and the\nincremental data noise. To address these issues, we introduce DrICL, a novel\noptimization method that enhances model performance through Differentiated\nLearning and advantage-based Reweighting objectives. Globally, DrICL utilizes\ndifferentiated learning to optimize the NLL objective, ensuring that many-shot\nperformance surpasses zero-shot levels. Locally, it dynamically adjusts the\nweighting of many-shot demonstrations by leveraging cumulative advantages\ninspired by reinforcement learning, thereby improving generalization. This\napproach allows the model to handle varying numbers of shots effectively,\nmitigating the impact of noisy data. Recognizing the lack of multi-task\ndatasets with diverse many-shot distributions, we develop the Many-Shot ICL\nBenchmark (ICL-50)-a large-scale benchmark of 50 tasks that cover shot numbers\nfrom 1 to 350 within sequences of up to 8,000 tokens-for fine-tuning purposes.\nICL-50 facilitates the evaluation of many-shot ICL strategies across seven\nprominent NLP tasks and 50 distinct datasets. Experimental results demonstrate\nthat LLMs enhanced with DrICL achieve significant improvements in many-shot\nsetups across various tasks, including both in-domain and out-of-domain\nscenarios. We release the code and benchmark dataset hoping to facilitate\nfurther research in many-shot ICL.\n","authors":["Xiaoqing Zhang","Ang Lv","Yuhan Liu","Flood Sung","Wei Liu","Shuo Shang","Xiuying Chen","Rui Yan"],"pdf_url":"https://arxiv.org/pdf/2501.04070v2.pdf","comment":"13 pages, 8 figures, 11 tables"},{"id":"http://arxiv.org/abs/2501.04916v1","updated":"2025-01-09T02:14:12Z","published":"2025-01-09T02:14:12Z","title":"SpecTf: Transformers Enable Data-Driven Imaging Spectroscopy Cloud\n Detection","summary":" Current and upcoming generations of visible-shortwave infrared (VSWIR)\nimaging spectrometers promise unprecedented capacity to quantify Earth System\nprocesses across the globe. However, reliable cloud screening remains a\nfundamental challenge for these instruments, where traditional spatial and\ntemporal approaches are limited by cloud variability and limited temporal\ncoverage. The Spectroscopic Transformer (SpecTf) addresses these challenges\nwith a spectroscopy-specific deep learning architecture that performs cloud\ndetection using only spectral information (no spatial or temporal data are\nrequired). By treating spectral measurements as sequences rather than image\nchannels, SpecTf learns fundamental physical relationships without relying on\nspatial context. Our experiments demonstrate that SpecTf significantly\noutperforms the current baseline approach implemented for the EMIT instrument,\nand performs comparably with other machine learning methods with orders of\nmagnitude fewer learned parameters. Critically, we demonstrate SpecTf's\ninherent interpretability through its attention mechanism, revealing physically\nmeaningful spectral features the model has learned. Finally, we present\nSpecTf's potential for cross-instrument generalization by applying it to a\ndifferent instrument on a different platform without modifications, opening the\ndoor to instrument agnostic data driven algorithms for future imaging\nspectroscopy tasks.\n","authors":["Jake H. Lee","Michael Kiper","David R. Thompson","Philip G. Brodrick"],"pdf_url":"https://arxiv.org/pdf/2501.04916v1.pdf","comment":"23 pages, 5 figures, in review. Code repository:\n https://github.com/emit-sds/SpecTf"},{"id":"http://arxiv.org/abs/2501.04914v1","updated":"2025-01-09T02:10:15Z","published":"2025-01-09T02:10:15Z","title":"From Mesh Completion to AI Designed Crown","summary":" Designing a dental crown is a time-consuming and labor intensive process. Our\ngoal is to simplify crown design and minimize the tediousness of making manual\nadjustments while still ensuring the highest level of accuracy and consistency.\nTo this end, we present a new end- to-end deep learning approach, coined Dental\nMesh Completion (DMC), to generate a crown mesh conditioned on a point cloud\ncontext. The dental context includes the tooth prepared to receive a crown and\nits surroundings, namely the two adjacent teeth and the three closest teeth in\nthe opposing jaw. We formulate crown generation in terms of completing this\npoint cloud context. A feature extractor first converts the input point cloud\ninto a set of feature vectors that represent local regions in the point cloud.\nThe set of feature vectors is then fed into a transformer to predict a new set\nof feature vectors for the missing region (crown). Subsequently, a point\nreconstruction head, followed by a multi-layer perceptron, is used to predict a\ndense set of points with normals. Finally, a differentiable point-to-mesh layer\nserves to reconstruct the crown surface mesh. We compare our DMC method to a\ngraph-based convolutional neural network which learns to deform a crown mesh\nfrom a generic crown shape to the target geometry. Extensive experiments on our\ndataset demonstrate the effectiveness of our method, which attains an average\nof 0.062 Chamfer Distance.The code is available\nat:https://github.com/Golriz-code/DMC.gi\n","authors":["Golriz Hosseinimanesh","Farnoosh Ghadiri","Francois Guibault","Farida Cheriet","Julia Keren"],"pdf_url":"https://arxiv.org/pdf/2501.04914v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04284v2","updated":"2025-01-09T01:58:41Z","published":"2025-01-08T05:15:43Z","title":"ContextMRI: Enhancing Compressed Sensing MRI through Metadata\n Conditioning","summary":" Compressed sensing MRI seeks to accelerate MRI acquisition processes by\nsampling fewer k-space measurements and then reconstructing the missing data\nalgorithmically. The success of these approaches often relies on strong priors\nor learned statistical models. While recent diffusion model-based priors have\nshown great potential, previous methods typically ignore clinically available\nmetadata (e.g. patient demographics, imaging parameters, slice-specific\ninformation). In practice, metadata contains meaningful cues about the anatomy\nand acquisition protocol, suggesting it could further constrain the\nreconstruction problem. In this work, we propose ContextMRI, a text-conditioned\ndiffusion model for MRI that integrates granular metadata into the\nreconstruction process. We train a pixel-space diffusion model directly on\nminimally processed, complex-valued MRI images. During inference, metadata is\nconverted into a structured text prompt and fed to the model via CLIP text\nembeddings. By conditioning the prior on metadata, we unlock more accurate\nreconstructions and show consistent gains across multiple datasets,\nacceleration factors, and undersampling patterns. Our experiments demonstrate\nthat increasing the fidelity of metadata, ranging from slice location and\ncontrast to patient age, sex, and pathology, systematically boosts\nreconstruction performance. This work highlights the untapped potential of\nleveraging clinical context for inverse problems and opens a new direction for\nmetadata-driven MRI reconstruction.\n","authors":["Hyungjin Chung","Dohun Lee","Zihui Wu","Byung-Hoon Kim","Katherine L. Bouman","Jong Chul Ye"],"pdf_url":"https://arxiv.org/pdf/2501.04284v2.pdf","comment":"29 pages, 9 figures. Code is available at\n https://github.com/DoHunLee1/ContextMRI"},{"id":"http://arxiv.org/abs/2310.05549v2","updated":"2025-01-09T01:54:34Z","published":"2023-10-09T09:17:52Z","title":"A New Transformation Approach for Uplift Modeling with Binary Outcome","summary":" Uplift modeling has been used effectively in fields such as marketing and\ncustomer retention, to target those customers who are more likely to respond\ndue to the campaign or treatment. Essentially, it is a machine learning\ntechnique that predicts the gain from performing some action with respect to\nnot taking it. A popular class of uplift models is the transformation approach\nthat redefines the target variable with the original treatment indicator. These\ntransformation approaches only need to train and predict the difference in\noutcomes directly. The main drawback of these approaches is that in general it\ndoes not use the information in the treatment indicator beyond the construction\nof the transformed outcome and usually is not efficient. In this paper, we\ndesign a novel transformed outcome for the case of the binary target variable\nand unlock the full value of the samples with zero outcome. From a practical\nperspective, our new approach is flexible and easy to use. Experimental results\non synthetic and real-world datasets obviously show that our new approach\noutperforms the traditional one. At present, our new approach has already been\napplied to precision marketing in a China nation-wide financial holdings group.\n","authors":["Kun Li","Liangshu Zhu"],"pdf_url":"https://arxiv.org/pdf/2310.05549v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04228v2","updated":"2025-01-09T01:35:56Z","published":"2025-01-08T01:59:47Z","title":"Constraints as Rewards: Reinforcement Learning for Robots without Reward\n Functions","summary":" Reinforcement learning has become an essential algorithm for generating\ncomplex robotic behaviors. However, to learn such behaviors, it is necessary to\ndesign a reward function that describes the task, which often consists of\nmultiple objectives that needs to be balanced. This tuning process is known as\nreward engineering and typically involves extensive trial-and-error. In this\npaper, to avoid this trial-and-error process, we propose the concept of\nConstraints as Rewards (CaR). CaR formulates the task objective using multiple\nconstraint functions instead of a reward function and solves a reinforcement\nlearning problem with constraints using the Lagrangian-method. By adopting this\napproach, different objectives are automatically balanced, because Lagrange\nmultipliers serves as the weights among the objectives. In addition, we will\ndemonstrate that constraints, expressed as inequalities, provide an intuitive\ninterpretation of the optimization target designed for the task. We apply the\nproposed method to the standing-up motion generation task of a\nsix-wheeled-telescopic-legged robot and demonstrate that the proposed method\nsuccessfully acquires the target behavior, even though it is challenging to\nlearn with manually designed reward functions.\n","authors":["Yu Ishihara","Noriaki Takasugi","Kotaro Kawakami","Masaya Kinoshita","Kazumi Aoyama"],"pdf_url":"https://arxiv.org/pdf/2501.04228v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04903v1","updated":"2025-01-09T01:31:30Z","published":"2025-01-09T01:31:30Z","title":"Towards understanding the bias in decision trees","summary":" There is a widespread and longstanding belief that machine learning models\nare biased towards the majority (or negative) class when learning from\nimbalanced data, leading them to neglect or ignore the minority (or positive)\nclass. In this study, we show that this belief is not necessarily correct for\ndecision trees, and that their bias can actually be in the opposite direction.\nMotivated by a recent simulation study that suggested that decision trees can\nbe biased towards the minority class, our paper aims to reconcile the conflict\nbetween that study and decades of other works. First, we critically evaluate\npast literature on this problem, finding that failing to consider the data\ngenerating process has led to incorrect conclusions about the bias in decision\ntrees. We then prove that, under specific conditions related to the predictors,\ndecision trees fit to purity and trained on a dataset with only one positive\ncase are biased towards the minority class. Finally, we demonstrate that splits\nin a decision tree are also biased when there is more than one positive case.\nOur findings have implications on the use of popular tree-based models, such as\nrandom forests.\n","authors":["Nathan Phelps","Daniel J. Lizotte","Douglas G. Woolford"],"pdf_url":"https://arxiv.org/pdf/2501.04903v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04898v1","updated":"2025-01-09T01:22:22Z","published":"2025-01-09T01:22:22Z","title":"Optimality and Adaptivity of Deep Neural Features for Instrumental\n Variable Regression","summary":" We provide a convergence analysis of deep feature instrumental variable\n(DFIV) regression (Xu et al., 2021), a nonparametric approach to IV regression\nusing data-adaptive features learned by deep neural networks in two stages. We\nprove that the DFIV algorithm achieves the minimax optimal learning rate when\nthe target structural function lies in a Besov space. This is shown under\nstandard nonparametric IV assumptions, and an additional smoothness assumption\non the regularity of the conditional distribution of the covariate given the\ninstrument, which controls the difficulty of Stage 1. We further demonstrate\nthat DFIV, as a data-adaptive algorithm, is superior to fixed-feature (kernel\nor sieve) IV methods in two ways. First, when the target function possesses low\nspatial homogeneity (i.e., it has both smooth and spiky/discontinuous regions),\nDFIV still achieves the optimal rate, while fixed-feature methods are shown to\nbe strictly suboptimal. Second, comparing with kernel-based two-stage\nregression estimators, DFIV is provably more data efficient in the Stage 1\nsamples.\n","authors":["Juno Kim","Dimitri Meunier","Arthur Gretton","Taiji Suzuki","Zhu Li"],"pdf_url":"https://arxiv.org/pdf/2501.04898v1.pdf","comment":"46 pages, 1 figure, 2 tables"},{"id":"http://arxiv.org/abs/2305.10391v3","updated":"2025-01-09T01:09:42Z","published":"2023-05-17T17:31:20Z","title":"Optimality of Message-Passing Architectures for Sparse Graphs","summary":" We study the node classification problem on feature-decorated graphs in the\nsparse setting, i.e., when the expected degree of a node is $O(1)$ in the\nnumber of nodes, in the fixed-dimensional asymptotic regime, i.e., the\ndimension of the feature data is fixed while the number of nodes is large. Such\ngraphs are typically known to be locally tree-like. We introduce a notion of\nBayes optimality for node classification tasks, called asymptotic local Bayes\noptimality, and compute the optimal classifier according to this criterion for\na fairly general statistical data model with arbitrary distributions of the\nnode features and edge connectivity. The optimal classifier is implementable\nusing a message-passing graph neural network architecture. We then compute the\ngeneralization error of this classifier and compare its performance against\nexisting learning methods theoretically on a well-studied statistical model\nwith naturally identifiable signal-to-noise ratios (SNRs) in the data. We find\nthat the optimal message-passing architecture interpolates between a standard\nMLP in the regime of low graph signal and a typical convolution in the regime\nof high graph signal. Furthermore, we prove a corresponding non-asymptotic\nresult.\n","authors":["Aseem Baranwal","Kimon Fountoulakis","Aukosh Jagannath"],"pdf_url":"https://arxiv.org/pdf/2305.10391v3.pdf","comment":"27 pages, 2 figures, published at NeurIPS 2023"},{"id":"http://arxiv.org/abs/2501.04897v1","updated":"2025-01-09T01:03:14Z","published":"2025-01-09T01:03:14Z","title":"Online Continual Learning: A Systematic Literature Review of Approaches,\n Challenges, and Benchmarks","summary":" Online Continual Learning (OCL) is a critical area in machine learning,\nfocusing on enabling models to adapt to evolving data streams in real-time\nwhile addressing challenges such as catastrophic forgetting and the\nstability-plasticity trade-off. This study conducts the first comprehensive\nSystematic Literature Review (SLR) on OCL, analyzing 81 approaches, extracting\nover 1,000 features (specific tasks addressed by these approaches), and\nidentifying more than 500 components (sub-models within approaches, including\nalgorithms and tools). We also review 83 datasets spanning applications like\nimage classification, object detection, and multimodal vision-language tasks.\nOur findings highlight key challenges, including reducing computational\noverhead, developing domain-agnostic solutions, and improving scalability in\nresource-constrained environments. Furthermore, we identify promising\ndirections for future research, such as leveraging self-supervised learning for\nmultimodal and sequential data, designing adaptive memory mechanisms that\nintegrate sparse retrieval and generative replay, and creating efficient\nframeworks for real-world applications with noisy or evolving task boundaries.\nBy providing a rigorous and structured synthesis of the current state of OCL,\nthis review offers a valuable resource for advancing this field and addressing\nits critical challenges and opportunities. The complete SLR methodology steps\nand extracted data are publicly available through the provided link:\nhttps://github.com/kiyan-rezaee/\nSystematic-Literature-Review-on-Online-Continual-Learning\n","authors":["Seyed Amir Bidaki","Amir Mohammadkhah","Kiyan Rezaee","Faeze Hassani","Sadegh Eskandari","Maziar Salahi","Mohammad M. Ghassemi"],"pdf_url":"https://arxiv.org/pdf/2501.04897v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04896v1","updated":"2025-01-09T00:50:44Z","published":"2025-01-09T00:50:44Z","title":"Quantifying Itch and its Impact on Sleep Using Machine Learning and\n Radio Signals","summary":" Chronic itch affects 13% of the US population, is highly debilitating, and\nunderlies many medical conditions. A major challenge in clinical care and new\ntherapeutics development is the lack of an objective measure for quantifying\nitch, leading to reliance on subjective measures like patients' self-assessment\nof itch severity. In this paper, we show that a home radio device paired with\nartificial intelligence (AI) can concurrently capture scratching and evaluate\nits impact on sleep quality by analyzing radio signals bouncing in the\nenvironment. The device eliminates the need for wearable sensors or skin\ncontact, enabling monitoring of chronic itch over extended periods at home\nwithout burdening patients or interfering with their skin condition. To\nvalidate the technology, we conducted an observational clinical study of\nchronic pruritus patients, monitored at home for one month using both the radio\ndevice and an infrared camera. Comparing the output of the device to ground\ntruth data from the camera demonstrates its feasibility and accuracy (ROC AUC =\n0.997, sensitivity = 0.825, specificity = 0.997). The results reveal a\nsignificant correlation between scratching and low sleep quality, manifested as\na reduction in sleep efficiency (R = 0.6, p < 0.001) and an increase in sleep\nlatency (R = 0.68, p < 0.001). Our study underscores the potential of passive,\nlong-term, at-home monitoring of chronic scratching and its sleep implications,\noffering a valuable tool for both clinical care of chronic itch patients and\npharmaceutical clinical trials.\n","authors":["Michail Ouroutzoglou","Mingmin Zhao","Joshua Hellerstein","Hariharan Rahul","Asima Badic","Brian S. Kim","Dina Katabi"],"pdf_url":"https://arxiv.org/pdf/2501.04896v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04894v1","updated":"2025-01-09T00:35:48Z","published":"2025-01-09T00:35:48Z","title":"A Look into How Machine Learning is Reshaping Engineering Models: the\n Rise of Analysis Paralysis, Optimal yet Infeasible Solutions, and the\n Inevitable Rashomon Paradox","summary":" The widespread acceptance of empirically derived codal provisions and\nequations in civil engineering stands in stark contrast to the skepticism\nfacing machine learning (ML) models, despite their shared statistical\nfoundations. This paper examines this philosophical tension through the lens of\nstructural engineering and explores how integrating ML challenges traditional\nengineering philosophies and professional identities. Recent efforts have\ndocumented how ML enhances predictive accuracy, optimizes designs, and analyzes\ncomplex behaviors. However, one might also raise concerns about the diminishing\nrole of human intuition and the interpretability of algorithms. To showcase\nthis rarely explored front, this paper presents how ML can be successfully\nintegrated into various engineering problems by means of formulation via\ndeduction, induction, and abduction. Then, this paper identifies three\nprincipal paradoxes that could arise when adopting ML: analysis paralysis\n(increased prediction accuracy leading to a reduced understanding of physical\nmechanisms), infeasible solutions (optimization resulting in unconventional\ndesigns that challenge engineering intuition), and the Rashomon effect (where\ncontradictions in explainability methods and physics arise). This paper\nconcludes by addressing these paradoxes and arguing the need to rethink\nepistemological shifts in engineering and engineering education and\nmethodologies to harmonize traditional principles with ML.\n","authors":["MZ Naser"],"pdf_url":"https://arxiv.org/pdf/2501.04894v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.14390v3","updated":"2025-01-09T00:17:04Z","published":"2024-11-21T18:24:06Z","title":"Persistent Homology for Structural Characterization in Disordered\n Systems","summary":" We propose a unified framework based on persistent homology (PH) to\ncharacterize both local and global structures in disordered systems. It can\nsimultaneously generate local and global descriptors using the same algorithm\nand data structure, and has shown to be highly effective and interpretable in\npredicting particle rearrangements and classifying global phases. We also\ndemonstrated that using a single variable enables a linear SVM to achieve\nnearly perfect three-phase classification. Inspired by this discovery, we\ndefine a non-parametric metric, the Separation Index (SI), which not only\nachieves this classification without sacrificing significant performance but\nalso establishes a connection between particle environments and the global\nphase structure. Our methods provide an effective framework for understanding\nand analyzing the properties of disordered materials, with broad potential\napplications in materials science and even wider studies of complex systems.\n","authors":["An Wang","Li Zou"],"pdf_url":"https://arxiv.org/pdf/2411.14390v3.pdf","comment":"19 pages, 17 figures"},{"id":"http://arxiv.org/abs/2410.05315v2","updated":"2025-01-09T00:11:59Z","published":"2024-10-05T03:37:07Z","title":"PalmBench: A Comprehensive Benchmark of Compressed Large Language Models\n on Mobile Platforms","summary":" Deploying large language models (LLMs) locally on mobile devices is\nadvantageous in scenarios where transmitting data to remote cloud servers is\neither undesirable due to privacy concerns or impractical due to network\nconnection. Recent advancements (MLC, 2023a; Gerganov, 2023) have facilitated\nthe local deployment of LLMs. However, local deployment also presents\nchallenges, particularly in balancing quality (generative performance),\nlatency, and throughput within the hardware constraints of mobile devices. In\nthis paper, we introduce our lightweight, all-in-one automated benchmarking\nframework that allows users to evaluate LLMs on mobile devices. We provide a\ncomprehensive benchmark of various popular LLMs with different quantization\nconfigurations (both weights and activations) across multiple mobile platforms\nwith varying hardware capabilities. Unlike traditional benchmarks that assess\nfull-scale models on high-end GPU clusters, we focus on evaluating resource\nefficiency (memory and power consumption) and harmful output for compressed\nmodels on mobile devices. Our key observations include i) differences in energy\nefficiency and throughput across mobile platforms; ii) the impact of\nquantization on memory usage, GPU execution time, and power consumption; and\niii) accuracy and performance degradation of quantized models compared to their\nnon-quantized counterparts; and iv) the frequency of hallucinations and toxic\ncontent generated by compressed LLMs on mobile devices.\n","authors":["Yilong Li","Jingyu Liu","Hao Zhang","M Badri Narayanan","Utkarsh Sharma","Shuai Zhang","Pan Hu","Yijing Zeng","Jayaram Raghuram","Suman Banerjee"],"pdf_url":"https://arxiv.org/pdf/2410.05315v2.pdf","comment":"10 pages"},{"id":"http://arxiv.org/abs/2501.05610v1","updated":"2025-01-09T23:18:38Z","published":"2025-01-09T23:18:38Z","title":"Towards Probabilistic Inference of Human Motor Intentions by Assistive\n Mobile Robots Controlled via a Brain-Computer Interface","summary":" Assistive mobile robots are a transformative technology that helps persons\nwith disabilities regain the ability to move freely. Although autonomous\nwheelchairs significantly reduce user effort, they still require human input to\nallow users to maintain control and adapt to changing environments. Brain\nComputer Interface (BCI) stands out as a highly user-friendly option that does\nnot require physical movement. Current BCI systems can understand whether users\nwant to accelerate or decelerate, but they implement these changes in discrete\nspeed steps rather than allowing for smooth, continuous velocity adjustments.\nThis limitation prevents the systems from mimicking the natural, fluid speed\nchanges seen in human self-paced motion. The authors aim to address this\nlimitation by redesigning the perception-action cycle in a BCI controlled\nrobotic system: improving how the robotic agent interprets the user's motion\nintentions (world state) and implementing these actions in a way that better\nreflects natural physical properties of motion, such as inertia and damping.\nThe scope of this paper focuses on the perception aspect. We asked and answered\na normative question \"what computation should the robotic agent carry out to\noptimally perceive incomplete or noisy sensory observations?\" Empirical EEG\ndata were collected, and probabilistic representation that served as world\nstate distributions were learned and evaluated in a Generative Adversarial\nNetwork framework. The ROS framework was established that connected with a\nGazebo environment containing a digital twin of an indoor space and a virtual\nmodel of a robotic wheelchair. Signal processing and statistical analyses were\nimplemented to identity the most discriminative features in the\nspatial-spectral-temporal dimensions, which are then used to construct the\nworld model for the robotic agent to interpret user motion intentions as a\nBayesian observer.\n","authors":["Xiaoshan Zhou","Carol M. Menassa","Vineet R. Kamat"],"pdf_url":"https://arxiv.org/pdf/2501.05610v1.pdf","comment":"10 pages"},{"id":"http://arxiv.org/abs/2501.05605v1","updated":"2025-01-09T22:41:50Z","published":"2025-01-09T22:41:50Z","title":"Advancing Personalized Learning Analysis via an Innovative Domain\n Knowledge Informed Attention-based Knowledge Tracing Method","summary":" Emerging Knowledge Tracing (KT) models, particularly deep learning and\nattention-based Knowledge Tracing, have shown great potential in realizing\npersonalized learning analysis via prediction of students' future performance\nbased on their past interactions. The existing methods mainly focus on\nimmediate past interactions or individual concepts without accounting for\ndependencies between knowledge concept, referred as knowledge concept routes,\nthat can be critical to advance the understanding the students' learning\noutcomes. To address this, in this paper, we propose an innovative\nattention-based method by effectively incorporating the domain knowledge of\nknowledge concept routes in the given curriculum. Additionally, we leverage\nXES3G5M dataset, a benchmark dataset with rich auxiliary information for\nknowledge concept routes, to evaluate and compare the performance of our\nproposed method to the seven State-of-the-art (SOTA) deep learning models.\n","authors":["Shubham Kose","Jin Wei-Kocsis"],"pdf_url":"https://arxiv.org/pdf/2501.05605v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.17428v2","updated":"2025-01-09T22:27:06Z","published":"2024-05-27T17:59:45Z","title":"NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding\n Models","summary":" Decoder-only large language model (LLM)-based embedding models are beginning\nto outperform BERT or T5-based embedding models in general-purpose text\nembedding tasks, including dense vector-based retrieval. In this work, we\nintroduce the NV-Embed model, incorporating architectural designs, training\nprocedures, and curated datasets to significantly enhance the performance of\nLLM as a versatile embedding model, while maintaining its simplicity and\nreproducibility. For model architecture, we propose a latent attention layer to\nobtain pooled embeddings, which consistently improves retrieval and downstream\ntask accuracy compared to mean pooling or using the last token embedding\nfrom LLMs. To enhance representation learning, we remove the causal attention\nmask of LLMs during contrastive training. For training algorithm, we introduce\na two-stage contrastive instruction-tuning method. It first applies contrastive\ntraining with instructions on retrieval datasets, utilizing in-batch negatives\nand curated hard negative examples. At stage-2, it blends various non-retrieval\ninto instruction tuning, which not only enhances non-retrieval task accuracy\nbut also improves retrieval performance. For training data, we utilize the\nhard-negative mining, synthetic data generation and existing public available\ndatasets to boost the performance of embedding model. By combining these\ntechniques, our NV-Embed-v1 and NV-Embed-v2 models obtained the No.1 position\non the Massive Text Embedding Benchmark (MTEB) (as of May 24, 2024 and August\n30, 2024, respectively) across 56 embedding tasks, demonstrating the sustained\neffectiveness of the proposed methods over time. Additionally, it achieved the\nhighest scores in the Long Doc section and the second-highest scores in the QA\nsection of the AIR Benchmark, which covers a range of out-of-domain information\nretrieval topics beyond those in MTEB.\n","authors":["Chankyu Lee","Rajarshi Roy","Mengyao Xu","Jonathan Raiman","Mohammad Shoeybi","Bryan Catanzaro","Wei Ping"],"pdf_url":"https://arxiv.org/pdf/2405.17428v2.pdf","comment":"We open-source the model at:\n https://huggingface.co/nvidia/NV-Embed-v2"},{"id":"http://arxiv.org/abs/2403.13257v3","updated":"2025-01-09T22:21:56Z","published":"2024-03-20T02:38:01Z","title":"Arcee's MergeKit: A Toolkit for Merging Large Language Models","summary":" The rapid expansion of the open-source language model landscape presents an\nopportunity to merge the competencies of these model checkpoints by combining\ntheir parameters. Advances in transfer learning, the process of fine-tuning\npretrained models for specific tasks, has resulted in the development of vast\namounts of task-specific models, typically specialized in individual tasks and\nunable to utilize each other's strengths. Model merging facilitates the\ncreation of multitask models without the need for additional training, offering\na promising avenue for enhancing model performance and versatility. By\npreserving the intrinsic capabilities of the original models, model merging\naddresses complex challenges in AI - including the difficulties of catastrophic\nforgetting and multitask learning. To support this expanding area of research,\nwe introduce MergeKit, a comprehensive, open-source library designed to\nfacilitate the application of model merging strategies. MergeKit offers an\nextensible framework to efficiently merge models on any hardware, providing\nutility to researchers and practitioners. To date, thousands of models have\nbeen merged by the open-source community, leading to the creation of some of\nthe worlds most powerful open-source model checkpoints, as assessed by the Open\nLLM Leaderboard. The library is accessible at\nhttps://github.com/arcee-ai/MergeKit.\n","authors":["Charles Goddard","Shamane Siriwardhana","Malikeh Ehghaghi","Luke Meyers","Vlad Karpukhin","Brian Benedict","Mark McQuade","Jacob Solawetz"],"pdf_url":"https://arxiv.org/pdf/2403.13257v3.pdf","comment":"11 pages, 4 figures"},{"id":"http://arxiv.org/abs/2411.13553v2","updated":"2025-01-09T22:17:30Z","published":"2024-11-20T18:59:58Z","title":"AI-generated Image Detection: Passive or Watermark?","summary":" While text-to-image models offer numerous benefits, they also pose\nsignificant societal risks. Detecting AI-generated images is crucial for\nmitigating these risks. Detection methods can be broadly categorized into\npassive and watermark-based approaches: passive detectors rely on artifacts\npresent in AI-generated images, whereas watermark-based detectors proactively\nembed watermarks into such images. A key question is which type of detector\nperforms better in terms of effectiveness, robustness, and efficiency. However,\nthe current literature lacks a comprehensive understanding of this issue. In\nthis work, we aim to bridge that gap by developing ImageDetectBench, the first\ncomprehensive benchmark to compare the effectiveness, robustness, and\nefficiency of passive and watermark-based detectors. Our benchmark includes\nfour datasets, each containing a mix of AI-generated and non-AI-generated\nimages. We evaluate five passive detectors and four watermark-based detectors\nagainst eight types of common perturbations and three types of adversarial\nperturbations. Our benchmark results reveal several interesting findings. For\ninstance, watermark-based detectors consistently outperform passive detectors,\nboth in the presence and absence of perturbations. Based on these insights, we\nprovide recommendations for detecting AI-generated images, e.g., when both\ntypes of detectors are applicable, watermark-based detectors should be the\npreferred choice. Our code and data are publicly available at\nhttps://github.com/moyangkuo/ImageDetectBench.git.\n","authors":["Moyang Guo","Yuepeng Hu","Zhengyuan Jiang","Zeyu Li","Amir Sadovnik","Arka Daw","Neil Gong"],"pdf_url":"https://arxiv.org/pdf/2411.13553v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.06687v2","updated":"2025-01-09T22:14:55Z","published":"2024-08-13T07:27:02Z","title":"Masked Image Modeling: A Survey","summary":" In this work, we survey recent studies on masked image modeling (MIM), an\napproach that emerged as a powerful self-supervised learning technique in\ncomputer vision. The MIM task involves masking some information, e.g.~pixels,\npatches, or even latent representations, and training a model, usually an\nautoencoder, to predicting the missing information by using the context\navailable in the visible part of the input. We identify and formalize two\ncategories of approaches on how to implement MIM as a pretext task, one based\non reconstruction and one based on contrastive learning. Then, we construct a\ntaxonomy and review the most prominent papers in recent years. We complement\nthe manually constructed taxonomy with a dendrogram obtained by applying a\nhierarchical clustering algorithm. We further identify relevant clusters via\nmanually inspecting the resulting dendrogram. Our review also includes datasets\nthat are commonly used in MIM research. We aggregate the performance results of\nvarious masked image modeling methods on the most popular datasets, to\nfacilitate the comparison of competing methods. Finally, we identify research\ngaps and propose several interesting directions of future work. We supplement\nour survey with the following public repository containing organized\nreferences: https://github.com/vladhondru25/MIM-Survey.\n","authors":["Vlad Hondru","Florinel Alin Croitoru","Shervin Minaee","Radu Tudor Ionescu","Nicu Sebe"],"pdf_url":"https://arxiv.org/pdf/2408.06687v2.pdf","comment":"Revised version"},{"id":"http://arxiv.org/abs/2310.03696v3","updated":"2025-01-09T22:12:45Z","published":"2023-10-05T17:13:16Z","title":"Function-Space Optimality of Neural Architectures with Multivariate\n Nonlinearities","summary":" We investigate the function-space optimality (specifically, the Banach-space\noptimality) of a large class of shallow neural architectures with multivariate\nnonlinearities/activation functions. To that end, we construct a new family of\nBanach spaces defined via a regularization operator, the $k$-plane transform,\nand a sparsity-promoting norm. We prove a representer theorem that states that\nthe solution sets to learning problems posed over these Banach spaces are\ncompletely characterized by neural architectures with multivariate\nnonlinearities. These optimal architectures have skip connections and are\ntightly connected to orthogonal weight normalization and multi-index models,\nboth of which have received recent interest in the neural network community.\nOur framework is compatible with a number of classical nonlinearities including\nthe rectified linear unit (ReLU) activation function, the norm activation\nfunction, and the radial basis functions found in the theory of\nthin-plate/polyharmonic splines. We also show that the underlying spaces are\nspecial instances of reproducing kernel Banach spaces and variation spaces. Our\nresults shed light on the regularity of functions learned by neural networks\ntrained on data, particularly with multivariate nonlinearities, and provide new\ntheoretical motivation for several architectural choices found in practice.\n","authors":["Rahul Parhi","Michael Unser"],"pdf_url":"https://arxiv.org/pdf/2310.03696v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.18731v3","updated":"2025-01-09T22:10:14Z","published":"2024-04-29T14:17:52Z","title":"Real Time Multi Organ Classification on Computed Tomography Images","summary":" Organ segmentation is a fundamental task in medical imaging since it is\nuseful for many clinical automation pipelines. However, some tasks do not\nrequire full segmentation. Instead, a classifier can identify the selected\norgan without segmenting the entire volume. In this study, we demonstrate a\nclassifier based method to obtain organ labels in real time by using a large\ncontext size with a sparse data sampling strategy. Although our method operates\nas an independent classifier at query locations, it can generate full\nsegmentations by querying grid locations at any resolution, offering faster\nperformance than segmentation algorithms. We compared our method with existing\nsegmentation techniques, demonstrating its superior runtime potential for\npractical applications in medical imaging.\n","authors":["Halid Ziya Yerebakan","Yoshihisa Shinagawa","Gerardo Hermosillo Valadez"],"pdf_url":"https://arxiv.org/pdf/2404.18731v3.pdf","comment":"11 pages, Organ Classification, Organ Segmentation"},{"id":"http://arxiv.org/abs/2312.12641v5","updated":"2025-01-09T22:08:44Z","published":"2023-12-19T22:36:37Z","title":"Robust Point Matching with Distance Profiles","summary":" We show the outlier robustness and noise stability of practical matching\nprocedures based on distance profiles. Although the idea of matching points\nbased on invariants like distance profiles has a long history in the\nliterature, there has been little understanding of the theoretical properties\nof such procedures, especially in the presence of outliers and noise. We\nprovide a theoretical analysis showing that under certain probabilistic\nsettings, the proposed matching procedure is successful with high probability\neven in the presence of outliers and noise. We demonstrate the performance of\nthe proposed method using a real data example and provide simulation studies to\ncomplement the theoretical findings. Lastly, we extend the concept of distance\nprofiles to the abstract setting and connect the proposed matching procedure to\nthe Gromov-Wasserstein distance and its lower bound, with a new sample\ncomplexity result derived based on the properties of distance profiles. This\npaper contributes to the literature by providing theoretical underpinnings of\nthe matching procedures based on invariants like distance profiles, which have\nbeen widely used in practice but have rarely been analyzed theoretically.\n","authors":["YoonHaeng Hur","Yuehaw Khoo"],"pdf_url":"https://arxiv.org/pdf/2312.12641v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05591v1","updated":"2025-01-09T21:53:03Z","published":"2025-01-09T21:53:03Z","title":"Session-Level Dynamic Ad Load Optimization using Offline Robust\n Reinforcement Learning","summary":" Session-level dynamic ad load optimization aims to personalize the density\nand types of delivered advertisements in real time during a user's online\nsession by dynamically balancing user experience quality and ad monetization.\nTraditional causal learning-based approaches struggle with key technical\nchallenges, especially in handling confounding bias and distribution shifts. In\nthis paper, we develop an offline deep Q-network (DQN)-based framework that\neffectively mitigates confounding bias in dynamic systems and demonstrates more\nthan 80% offline gains compared to the best causal learning-based production\nbaseline. Moreover, to improve the framework's robustness against unanticipated\ndistribution shifts, we further enhance our framework with a novel offline\nrobust dueling DQN approach. This approach achieves more stable rewards on\nmultiple OpenAI-Gym datasets as perturbations increase, and provides an\nadditional 5% offline gains on real-world ad delivery data.\n Deployed across multiple production systems, our approach has achieved\noutsized topline gains. Post-launch online A/B tests have shown double-digit\nimprovements in the engagement-ad score trade-off efficiency, significantly\nenhancing our platform's capability to serve both consumers and advertisers.\n","authors":["Tao Liu","Qi Xu","Wei Shi","Zhigang Hua","Shuang Yang"],"pdf_url":"https://arxiv.org/pdf/2501.05591v1.pdf","comment":"Will appear in KDD 2025"},{"id":"http://arxiv.org/abs/2501.05588v1","updated":"2025-01-09T21:45:09Z","published":"2025-01-09T21:45:09Z","title":"Enforcing Fundamental Relations via Adversarial Attacks on Input\n Parameter Correlations","summary":" Correlations between input parameters play a crucial role in many scientific\nclassification tasks, since these are often related to fundamental laws of\nnature. For example, in high energy physics, one of the common deep learning\nuse-cases is the classification of signal and background processes in particle\ncollisions. In many such cases, the fundamental principles of the correlations\nbetween observables are often better understood than the actual distributions\nof the observables themselves. In this work, we present a new adversarial\nattack algorithm called Random Distribution Shuffle Attack (RDSA), emphasizing\nthe correlations between observables in the network rather than individual\nfeature characteristics. Correct application of the proposed novel attack can\nresult in a significant improvement in classification performance -\nparticularly in the context of data augmentation - when using the generated\nadversaries within adversarial training. Given that correlations between input\nfeatures are also crucial in many other disciplines. We demonstrate the RDSA\neffectiveness on six classification tasks, including two particle collision\nchallenges (using CERN Open Data), hand-written digit recognition (MNIST784),\nhuman activity recognition (HAR), weather forecasting (Rain in Australia), and\nICU patient mortality (MIMIC-IV), demonstrating a general use case beyond\nfundamental physics for this new type of adversarial attack algorithms.\n","authors":["Timo Saala","Lucie Flek","Alexander Jung","Akbar Karimi","Alexander Schmidt","Matthias Schott","Philipp Soldin","Christopher Wiebusch"],"pdf_url":"https://arxiv.org/pdf/2501.05588v1.pdf","comment":"12 pages, 8 figures (Without appendix)"},{"id":"http://arxiv.org/abs/2501.05583v1","updated":"2025-01-09T21:21:06Z","published":"2025-01-09T21:21:06Z","title":"Learned Discrepancy Reconstruction and Benchmark Dataset for Magnetic\n Particle Imaging","summary":" Magnetic Particle Imaging (MPI) is an emerging imaging modality based on the\nmagnetic response of superparamagnetic iron oxide nanoparticles to achieve\nhigh-resolution and real-time imaging without harmful radiation. One key\nchallenge in the MPI image reconstruction task arises from its underlying noise\nmodel, which does not fulfill the implicit Gaussian assumptions that are made\nwhen applying traditional reconstruction approaches. To address this challenge,\nwe introduce the Learned Discrepancy Approach, a novel learning-based\nreconstruction method for inverse problems that includes a learned discrepancy\nfunction. It enhances traditional techniques by incorporating an invertible\nneural network to explicitly model problem-specific noise distributions. This\napproach does not rely on implicit Gaussian noise assumptions, making it\nespecially suited to handle the sophisticated noise model in MPI and also\napplicable to other inverse problems. To further advance MPI reconstruction\ntechniques, we introduce the MPI-MNIST dataset - a large collection of\nsimulated MPI measurements derived from the MNIST dataset of handwritten\ndigits. The dataset includes noise-perturbed measurements generated from\nstate-of-the-art model-based system matrices and measurements of a preclinical\nMPI scanner device. This provides a realistic and flexible environment for\nalgorithm testing. Validated against the MPI-MNIST dataset, our method\ndemonstrates significant improvements in reconstruction quality in terms of\nstructural similarity when compared to classical reconstruction techniques.\n","authors":["Meira Iske","Hannes Albers","Tobias Knopp","Tobias Kluth"],"pdf_url":"https://arxiv.org/pdf/2501.05583v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05580v1","updated":"2025-01-09T21:14:25Z","published":"2025-01-09T21:14:25Z","title":"Physics-Driven Learning for Inverse Problems in Quantum Chromodynamics","summary":" The integration of deep learning techniques and physics-driven designs is\nreforming the way we address inverse problems, in which accurate physical\nproperties are extracted from complex data sets. This is particularly relevant\nfor quantum chromodynamics (QCD), the theory of strong interactions, with its\ninherent limitations in observational data and demanding computational\napproaches. This perspective highlights advances and potential of\nphysics-driven learning methods, focusing on predictions of physical quantities\ntowards QCD physics, and drawing connections to machine learning(ML). It is\nshown that the fusion of ML and physics can lead to more efficient and reliable\nproblem-solving strategies. Key ideas of ML, methodology of embedding physics\npriors, and generative models as inverse modelling of physical probability\ndistributions are introduced. Specific applications cover first-principle\nlattice calculations, and QCD physics of hadrons, neutron stars, and heavy-ion\ncollisions. These examples provide a structured and concise overview of how\nincorporating prior knowledge such as symmetry, continuity and equations into\ndeep learning designs can address diverse inverse problems across different\nphysical sciences.\n","authors":["Gert Aarts","Kenji Fukushima","Tetsuo Hatsuda","Andreas Ipp","Shuzhe Shi","Lingxiao Wang","Kai Zhou"],"pdf_url":"https://arxiv.org/pdf/2501.05580v1.pdf","comment":"14 pages, 5 figures, submitted version to Nat Rev Phys"},{"id":"http://arxiv.org/abs/2501.05564v1","updated":"2025-01-09T20:19:27Z","published":"2025-01-09T20:19:27Z","title":"Analog Bayesian neural networks are insensitive to the shape of the\n weight distribution","summary":" Recent work has demonstrated that Bayesian neural networks (BNN's) trained\nwith mean field variational inference (MFVI) can be implemented in analog\nhardware, promising orders of magnitude energy savings compared to the standard\ndigital implementations. However, while Gaussians are typically used as the\nvariational distribution in MFVI, it is difficult to precisely control the\nshape of the noise distributions produced by sampling analog devices. This\npaper introduces a method for MFVI training using real device noise as the\nvariational distribution. Furthermore, we demonstrate empirically that the\npredictive distributions from BNN's with the same weight means and variances\nconverge to the same distribution, regardless of the shape of the variational\ndistribution. This result suggests that analog device designers do not need to\nconsider the shape of the device noise distribution when hardware-implementing\nBNNs performing MFVI.\n","authors":["Ravi G. Patel","T. Patrick Xiao","Sapan Agarwal","Christopher Bennett"],"pdf_url":"https://arxiv.org/pdf/2501.05564v1.pdf","comment":"Presented at the NeurIPS 2024 Workshop on Machine Learning with New\n Compute Paradigms, https://openreview.net/forum?id=soS5qgU7Yb"},{"id":"http://arxiv.org/abs/2501.05563v1","updated":"2025-01-09T20:19:01Z","published":"2025-01-09T20:19:01Z","title":"Prediction-Assisted Online Distributed Deep Learning Workload Scheduling\n in GPU Clusters","summary":" The recent explosive growth of deep learning (DL) models has necessitated a\ncompelling need for efficient job scheduling for distributed deep learning\ntraining with mixed parallelisms (DDLwMP) in GPU clusters. This paper proposes\nan adaptive shortest-remaining-processing-time-first (A-SRPT) scheduling\nalgorithm, a novel prediction-assisted online scheduling approach designed to\nmitigate the challenges associated with DL cluster scheduling. By modeling each\njob as a graph corresponding to heterogeneous Deep Neural Network (DNN) models\nand their associated distributed training configurations, A-SRPT strategically\nassigns jobs to the available GPUs, thereby minimizing inter-server\ncommunication overhead. Observing that most DDLwMP jobs recur, A-SRPT\nincorporates a random forest regression model to predict training iterations.\nCrucially, A-SRPT maps the complex scheduling problem into a single-machine\ninstance, which is addressed optimally by a preemptive\n\"shortest-remaining-processing-time-first\" strategy. This optimized solution\nserves as a guide for actual job scheduling within the GPU clusters, leading to\na theoretically provable competitive scheduling efficiency. We conduct\nextensive real-world testbed and simulation experiments to verify our proposed\nalgorithms.\n","authors":["Ziyue Luo","Jia Liu","Myungjin Lee","Ness B. Shroff"],"pdf_url":"https://arxiv.org/pdf/2501.05563v1.pdf","comment":"INFOCOM 2025"},{"id":"http://arxiv.org/abs/2501.05559v1","updated":"2025-01-09T20:11:08Z","published":"2025-01-09T20:11:08Z","title":"Soup to go: mitigating forgetting during continual learning with model\n averaging","summary":" In continual learning, where task data arrives in a sequence, fine-tuning on\nlater tasks will often lead to performance degradation on earlier tasks. This\nis especially pronounced when these tasks come from diverse domains. In this\nsetting, how can we mitigate catastrophic forgetting of earlier tasks and\nretain what the model has learned with minimal computational expenses? Inspired\nby other merging methods, and L2-regression, we propose Sequential Fine-tuning\nwith Averaging (SFA), a method that merges currently training models with\nearlier checkpoints during the course of training. SOTA approaches typically\nmaintain a data buffer of past tasks or impose a penalty at each gradient step.\nIn contrast, our method achieves comparable results without the need to store\npast data, or multiple copies of parameters for each gradient step.\nFurthermore, our method outperforms common merging techniques such as Task\nArithmetic, TIES Merging, and WiSE-FT, as well as other penalty methods like L2\nand Elastic Weight Consolidation. In turn, our method offers insight into the\nbenefits of merging partially-trained models during training across both image\nand language domains.\n","authors":["Anat Kleiman","Gintare Karolina Dziugaite","Jonathan Frankle","Sham Kakade","Mansheej Paul"],"pdf_url":"https://arxiv.org/pdf/2501.05559v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.13969v2","updated":"2025-01-09T20:08:31Z","published":"2023-08-26T22:48:06Z","title":"Gaze-Informed Vision Transformers: Predicting Driving Decisions Under\n Uncertainty","summary":" Vision Transformers (ViT) have advanced computer vision, yet their efficacy\nin complex tasks like driving remains less explored. This study enhances ViT by\nintegrating human eye gaze, captured via eye-tracking, to increase prediction\naccuracy in driving scenarios under uncertainty in both real-world and virtual\nreality scenarios. First, we establish the significance of human eye gaze in\nleft-right driving decisions, as observed in both human subjects and a ViT\nmodel. By comparing the similarity between human fixation maps and ViT\nattention weights, we reveal the dynamics of overlap across individual heads\nand layers. This overlap demonstrates that fixation data can guide the model in\ndistributing its attention weights more effectively. We introduce the\nfixation-attention intersection (FAX) loss, a novel loss function that\nsignificantly improves ViT performance under high uncertainty conditions. Our\nresults show that ViT, when trained with FAX loss, aligns its attention with\nhuman gaze patterns. This gaze-informed approach has significant potential for\ndriver behavior analysis, as well as broader applications in human-centered AI\nsystems, extending ViT's use to complex visual environments.\n","authors":["Sharath Koorathota","Nikolas Papadopoulos","Jia Li Ma","Shruti Kumar","Xiaoxiao Sun","Arunesh Mittal","Patrick Adelman","Paul Sajda"],"pdf_url":"https://arxiv.org/pdf/2308.13969v2.pdf","comment":"25 pages, 9 figures, 3 tables"},{"id":"http://arxiv.org/abs/2501.00190v2","updated":"2025-01-09T20:00:16Z","published":"2024-12-31T00:02:07Z","title":"SepsisCalc: Integrating Clinical Calculators into Early Sepsis\n Prediction via Dynamic Temporal Graph Construction","summary":" Sepsis is an organ dysfunction caused by a deregulated immune response to an\ninfection. Early sepsis prediction and identification allow for timely\nintervention, leading to improved clinical outcomes. Clinical calculators\n(e.g., the six-organ dysfunction assessment of SOFA) play a vital role in\nsepsis identification within clinicians' workflow, providing evidence-based\nrisk assessments essential for sepsis diagnosis. However, artificial\nintelligence (AI) sepsis prediction models typically generate a single sepsis\nrisk score without incorporating clinical calculators for assessing organ\ndysfunctions, making the models less convincing and transparent to clinicians.\nTo bridge the gap, we propose to mimic clinicians' workflow with a novel\nframework SepsisCalc to integrate clinical calculators into the predictive\nmodel, yielding a clinically transparent and precise model for utilization in\nclinical settings. Practically, clinical calculators usually combine\ninformation from multiple component variables in Electronic Health Records\n(EHR), and might not be applicable when the variables are (partially) missing.\nWe mitigate this issue by representing EHRs as temporal graphs and integrating\na learning module to dynamically add the accurately estimated calculator to the\ngraphs. Experimental results on real-world datasets show that the proposed\nmodel outperforms state-of-the-art methods on sepsis prediction tasks.\nMoreover, we developed a system to identify organ dysfunctions and potential\nsepsis risks, providing a human-AI interaction tool for deployment, which can\nhelp clinicians understand the prediction outputs and prepare timely\ninterventions for the corresponding dysfunctions, paving the way for actionable\nclinical decision-making support for early intervention.\n","authors":["Changchang Yin","Shihan Fu","Bingsheng Yao","Thai-Hoang Pham","Weidan Cao","Dakuo Wang","Jeffrey Caterino","Ping Zhang"],"pdf_url":"https://arxiv.org/pdf/2501.00190v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05550v1","updated":"2025-01-09T19:48:51Z","published":"2025-01-09T19:48:51Z","title":"Emergent weight morphologies in deep neural networks","summary":" Whether deep neural networks can exhibit emergent behaviour is not only\nrelevant for understanding how deep learning works, it is also pivotal for\nestimating potential security risks of increasingly capable artificial\nintelligence systems. Here, we show that training deep neural networks gives\nrise to emergent weight morphologies independent of the training data.\nSpecifically, in analogy to condensed matter physics, we derive a theory that\npredict that the homogeneous state of deep neural networks is unstable in a way\nthat leads to the emergence of periodic channel structures. We verified these\nstructures by performing numerical experiments on a variety of data sets. Our\nwork demonstrates emergence in the training of deep neural networks, which\nimpacts the achievable performance of deep neural networks.\n","authors":["Pascal de Jong","Felix Meigel","Steffen Rulands"],"pdf_url":"https://arxiv.org/pdf/2501.05550v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.01919v5","updated":"2025-01-09T19:42:52Z","published":"2024-03-04T10:36:06Z","title":"Randomized Approach to Matrix Completion: Applications in Collaborative\n Filtering and Image Inpainting","summary":" We present a novel method for matrix completion, specifically designed for\nmatrices where one dimension significantly exceeds the other. Our Columns\nSelected Matrix Completion (CSMC) method combines Column Subset Selection and\nLow-Rank Matrix Completion to efficiently reconstruct incomplete datasets. In\neach step, CSMC solves a convex optimization problem. We introduce two\nalgorithms to implement CSMC, each tailored to problems of different sizes. A\nformal analysis is provided, outlining the necessary assumptions and the\nprobability of obtaining a correct solution. To assess the impact of matrix\nsize, rank, and the ratio of missing entries on solution quality and\ncomputation time, we conducted experiments on synthetic data. The method was\nalso applied to two real-world problems: recommendation systems and image\ninpainting. Our results show that CSMC provides solutions of the same quality\nas state-of-the-art matrix completion algorithms based on convex optimization,\nwhile achieving significant reductions in computational runtime.\n","authors":["Antonina Krajewska","Ewa Niewiadomska-Szynkiewicz"],"pdf_url":"https://arxiv.org/pdf/2403.01919v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.16865v3","updated":"2025-01-09T19:39:12Z","published":"2024-05-27T06:31:39Z","title":"An Investigation of Conformal Isometry Hypothesis for Grid Cells","summary":" This paper investigates the conformal isometry hypothesis as a potential\nexplanation for hexagonal periodic patterns in grid cell response maps. The\nhypothesis posits that grid cell activity forms a high-dimensional vector in\nneural space, encoding the agent's position in 2D physical space. As the agent\nmoves, this vector rotates within a 2D manifold in the neural space, driven by\na recurrent neural network. The conformal hypothesis suggests that this neural\nmanifold is a conformally isometric embedding of physical space, where local\ndisplacements in neural space are proportional to those in physical space. In\nthis paper, we conduct numerical experiments to show that this hypothesis leads\nto the hexagon periodic patterns of grid cells, agnostic to the choice of\ntransformation models. Furthermore, we present a theoretical understanding that\nhexagon patterns emerge by minimizing our loss function because hexagon flat\ntorus exhibits minimal deviation from local conformal isometry. In addition, we\npropose a conformal modulation of the agent's input velocity, enabling the\nrecurrent neural network of grid cells to satisfy the conformal isometry\nhypothesis automatically.\n","authors":["Dehong Xu","Ruiqi Gao","Wen-Hao Zhang","Xue-Xin Wei","Ying Nian Wu"],"pdf_url":"https://arxiv.org/pdf/2405.16865v3.pdf","comment":"arXiv admin note: text overlap with arXiv:2310.19192"},{"id":"http://arxiv.org/abs/2501.05541v1","updated":"2025-01-09T19:27:28Z","published":"2025-01-09T19:27:28Z","title":"NSChat: A Chatbot System To Rule Them All","summary":" The rapid advancement of artificial intelligence has resulted in the advent\nof large language models (LLMs) with the capacity to produce text that closely\nresembles human communication. These models have been seamlessly integrated\ninto diverse applications, enabling interactive and responsive communication\nacross multiple platforms. The potential utility of chatbots transcends these\ntraditional applications, particularly in research contexts, wherein they can\noffer valuable insights and facilitate the design of innovative experiments. In\nthis study, we present NSChat, a web-based chatbot system designed to assist in\nneuroscience research. The system is meticulously designed to function as an\nexperimental instrument rather than a conventional chatbot, necessitating users\nto input a username and experiment code upon access. This setup facilitates\nprecise data cross-referencing, thereby augmenting the integrity and\napplicability of the data collected for research purposes. It can be easily\nexpanded to accommodate new basic events as needed; and it allows researchers\nto integrate their own logging events without the necessity of implementing a\nseparate logging mechanism. It is worth noting that our system was built to\nassist primarily neuroscience research but is not limited to it, it can easily\nbe adapted to assist information retrieval research or interacting with chat\nbot agents in general.\n","authors":["Zenon Lamprou","Yashar Moshfeghi"],"pdf_url":"https://arxiv.org/pdf/2501.05541v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05534v1","updated":"2025-01-09T19:16:41Z","published":"2025-01-09T19:16:41Z","title":"OmniJet-${α_{ C}}$: Learning point cloud calorimeter simulations\n using generative transformers","summary":" We show the first use of generative transformers for generating calorimeter\nshowers as point clouds in a high-granularity calorimeter. Using the tokenizer\nand generative part of the OmniJet-${\\alpha}$ model, we represent the hits in\nthe detector as sequences of integers. This model allows variable-length\nsequences, which means that it supports realistic shower development and does\nnot need to be conditioned on the number of hits. Since the tokenization\nrepresents the showers as point clouds, the model learns the geometry of the\nshowers without being restricted to any particular voxel grid.\n","authors":["Joschka Birk","Frank Gaede","Anna Hallin","Gregor Kasieczka","Martina Mozzanica","Henning Rose"],"pdf_url":"https://arxiv.org/pdf/2501.05534v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.08755v2","updated":"2025-01-09T19:15:20Z","published":"2024-12-11T19:54:14Z","title":"Proactive Adversarial Defense: Harnessing Prompt Tuning in\n Vision-Language Models to Detect Unseen Backdoored Images","summary":" Backdoor attacks pose a critical threat by embedding hidden triggers into\ninputs, causing models to misclassify them into target labels. While extensive\nresearch has focused on mitigating these attacks in object recognition models\nthrough weight fine-tuning, much less attention has been given to detecting\nbackdoored samples directly. Given the vast datasets used in training, manual\ninspection for backdoor triggers is impractical, and even state-of-the-art\ndefense mechanisms fail to fully neutralize their impact. To address this gap,\nwe introduce a groundbreaking method to detect unseen backdoored images during\nboth training and inference. Leveraging the transformative success of prompt\ntuning in Vision Language Models (VLMs), our approach trains learnable text\nprompts to differentiate clean images from those with hidden backdoor triggers.\nExperiments demonstrate the exceptional efficacy of this method, achieving an\nimpressive average accuracy of 86% across two renowned datasets for detecting\nunseen backdoor triggers, establishing a new standard in backdoor defense.\n","authors":["Kyle Stein","Andrew Arash Mahyari","Guillermo Francia","Eman El-Sheikh"],"pdf_url":"https://arxiv.org/pdf/2412.08755v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.05239v3","updated":"2025-01-09T19:12:14Z","published":"2023-08-09T21:54:34Z","title":"Enhancing Architecture Frameworks by Including Modern Stakeholders and\n their Views/Viewpoints","summary":" Various architecture frameworks for software, systems, and enterprises have\nbeen proposed in the literature. They identified several stakeholders and\ndefined modeling perspectives, architecture viewpoints, and views to frame and\naddress stakeholder concerns. However, the stakeholders with data science and\nMachine Learning (ML) related concerns, such as data scientists and data\nengineers, are yet to be included in existing architecture frameworks. Only\nthis way can we envision a holistic system architecture description of an\nML-enabled system. Note that the ML component behavior and functionalities are\nspecial and should be distinguished from traditional software system behavior\nand functionalities. The main reason is that the actual functionality should be\ninferred from data instead of being specified at design time. Additionally, the\nstructural models of ML components, such as ML model architectures, are\ntypically specified using different notations and formalisms from what the\nSoftware Engineering (SE) community uses for software structural models. Yet,\nthese two aspects, namely ML and non-ML, are becoming so intertwined that it\nnecessitates an extension of software architecture frameworks and modeling\npractices toward supporting ML-enabled system architectures. In this paper, we\naddress this gap through an empirical study using an online survey instrument.\nWe surveyed 61 subject matter experts from over 25 organizations in 10\ncountries.\n","authors":["Armin Moin","Atta Badii","Stephan Günnemann","Moharram Challenger"],"pdf_url":"https://arxiv.org/pdf/2308.05239v3.pdf","comment":"ICICT 2025"},{"id":"http://arxiv.org/abs/2501.05530v1","updated":"2025-01-09T19:08:23Z","published":"2025-01-09T19:08:23Z","title":"Outlyingness Scores with Cluster Catch Digraphs","summary":" This paper introduces two novel, outlyingness scores (OSs) based on Cluster\nCatch Digraphs (CCDs): Outbound Outlyingness Score (OOS) and Inbound\nOutlyingness Score (IOS). These scores enhance the interpretability of outlier\ndetection results. Both OSs employ graph-, density-, and distribution-based\ntechniques, tailored to high-dimensional data with varying cluster shapes and\nintensities. OOS evaluates the outlyingness of a point relative to its nearest\nneighbors, while IOS assesses the total ``influence\" a point receives from\nothers within its cluster. Both OSs effectively identify global and local\noutliers, invariant to data collinearity. Moreover, IOS is robust to the\nmasking problems. With extensive Monte Carlo simulations, we compare the\nperformance of both OSs with CCD-based, traditional, and state-of-the-art\noutlier detection methods. Both OSs exhibit substantial overall improvements\nover the CCD-based methods in both artificial and real-world data sets,\nparticularly with IOS, which delivers the best overall performance among all\nthe methods, especially in high-dimensional settings.\n Keywords: Outlier detection, Outlyingness score, Graph-based clustering,\nCluster catch digraphs, High-dimensional data.\n","authors":["Rui Shi","Nedret Billor","Elvan Ceyhan"],"pdf_url":"https://arxiv.org/pdf/2501.05530v1.pdf","comment":"29 pages, 7 figures, 16 tables"},{"id":"http://arxiv.org/abs/2501.05515v1","updated":"2025-01-09T19:00:03Z","published":"2025-01-09T19:00:03Z","title":"Neural Architecture Codesign for Fast Physics Applications","summary":" We develop a pipeline to streamline neural architecture codesign for physics\napplications to reduce the need for ML expertise when designing models for\nnovel tasks. Our method employs neural architecture search and network\ncompression in a two-stage approach to discover hardware efficient models. This\napproach consists of a global search stage that explores a wide range of\narchitectures while considering hardware constraints, followed by a local\nsearch stage that fine-tunes and compresses the most promising candidates. We\nexceed performance on various tasks and show further speedup through model\ncompression techniques such as quantization-aware-training and neural network\npruning. We synthesize the optimal models to high level synthesis code for FPGA\ndeployment with the hls4ml library. Additionally, our hierarchical search space\nprovides greater flexibility in optimization, which can easily extend to other\ntasks and domains. We demonstrate this with two case studies: Bragg peak\nfinding in materials science and jet classification in high energy physics,\nachieving models with improved accuracy, smaller latencies, or reduced resource\nutilization relative to the baseline models.\n","authors":["Jason Weitz","Dmitri Demler","Luke McDermott","Nhan Tran","Javier Duarte"],"pdf_url":"https://arxiv.org/pdf/2501.05515v1.pdf","comment":"21 pages, 6 figures"},{"id":"http://arxiv.org/abs/2501.05503v1","updated":"2025-01-09T18:50:47Z","published":"2025-01-09T18:50:47Z","title":"The more polypersonal the better -- a short look on space geometry of\n fine-tuned layers","summary":" The interpretation of deep learning models is a rapidly growing field, with\nparticular interest in language models. There are various approaches to this\ntask, including training simpler models to replicate neural network predictions\nand analyzing the latent space of the model. The latter method allows us to not\nonly identify patterns in the model's decision-making process, but also\nunderstand the features of its internal structure. In this paper, we analyze\nthe changes in the internal representation of the BERT model when it is trained\nwith additional grammatical modules and data containing new grammatical\nstructures (polypersonality). We find that adding a single grammatical layer\ncauses the model to separate the new and old grammatical systems within itself,\nimproving the overall performance on perplexity metrics.\n","authors":["Sergei Kudriashov","Veronika Zykova","Angelina Stepanova","Yakov Raskind","Eduard Klyshinsky"],"pdf_url":"https://arxiv.org/pdf/2501.05503v1.pdf","comment":"Neuroinformatics 2024"},{"id":"http://arxiv.org/abs/2501.05502v1","updated":"2025-01-09T18:44:10Z","published":"2025-01-09T18:44:10Z","title":"Shrink the longest: improving latent space isotropy with symplicial\n geometry","summary":" Although transformer-based models have been dominating the field of deep\nlearning, various studies of their embedding space have shown that they suffer\nfrom \"representation degeneration problem\": embeddings tend to be distributed\nin a narrow cone, making the latent space highly anisotropic. Increasing the\nisotropy has shown to improve performance in downstream tasks both in static\nand contextual language models. However, most of approaches either add\ninference overhead or require substantial amount of data for model\nreparametrization. We propose a novel regularization technique based on\nsimplicial geometry to improve the isotropy of latent representations. The core\nidea of our method is based on maximizing the persistent entropy of barcodes\nobtained using Vietoris-Rips filtration from contextual embeddings in the\nunderlying latent space. We demonstrate that the method leads to an increase in\ndownstream performance while significantly lowering the anisotropy during\nfine-tuning by exploiting existing geometric structures instead of\nreparametrization.\n","authors":["Sergei Kudriashov","Olesya Karpik","Eduard Klyshinsky"],"pdf_url":"https://arxiv.org/pdf/2501.05502v1.pdf","comment":"AIST-2024"},{"id":"http://arxiv.org/abs/2501.05501v1","updated":"2025-01-09T18:43:05Z","published":"2025-01-09T18:43:05Z","title":"Strategy Masking: A Method for Guardrails in Value-based Reinforcement\n Learning Agents","summary":" The use of reward functions to structure AI learning and decision making is\ncore to the current reinforcement learning paradigm; however, without careful\ndesign of reward functions, agents can learn to solve problems in ways that may\nbe considered ``undesirable\" or ``unethical. Without thorough understanding of\nthe incentives a reward function creates, it can be difficult to impose\nprincipled yet general control mechanisms over its behavior. In this paper, we\nstudy methods for constructing guardrails for AI agents that use reward\nfunctions to learn decision making. We introduce a novel approach, which we\ncall strategy masking, to explicitly learn and then suppress undesirable AI\nagent behavior. We apply our method to study lying in AI agents and show that\nstrategy masking can effectively modify agent behavior by suppressing, or\nactively penalizing, the reward dimension for lying such that agents act more\nhonestly while not compromising their ability to perform effectively.\n","authors":["Jonathan Keane","Sam Keyser","Jeremy Kedziora"],"pdf_url":"https://arxiv.org/pdf/2501.05501v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05499v1","updated":"2025-01-09T18:02:12Z","published":"2025-01-09T18:02:12Z","title":"Generalization of Urban Wind Environment Using Fourier Neural Operator\n Across Different Wind Directions and Cities","summary":" Simulation of urban wind environments is crucial for urban planning,\npollution control, and renewable energy utilization. However, the computational\nrequirements of high-fidelity computational fluid dynamics (CFD) methods make\nthem impractical for real cities. To address these limitations, this study\ninvestigates the effectiveness of the Fourier Neural Operator (FNO) model in\npredicting flow fields under different wind directions and urban layouts. In\nthis study, we investigate the effectiveness of the Fourier Neural Operator\n(FNO) model in predicting urban wind conditions under different wind directions\nand urban layouts. By training the model on velocity data from large eddy\nsimulation data, we evaluate the performance of the model under different urban\nconfigurations and wind conditions. The results show that the FNO model can\nprovide accurate predictions while significantly reducing the computational\ntime by 99%. Our innovative approach of dividing the wind field into smaller\nspatial blocks for training improves the ability of the FNO model to capture\nwind frequency features effectively. The SDF data also provides important\nspatial building information, enhancing the model's ability to recognize\nphysical boundaries and generate more realistic predictions. The proposed FNO\napproach enhances the AI model's generalizability for different wind directions\nand urban layouts.\n","authors":["Cheng Chen","Geng Tian","Shaoxiang Qin","Senwen Yang","Dingyang Geng","Dongxue Zhan","Jinqiu Yang","David Vidal","Liangzhu Leon Wang"],"pdf_url":"https://arxiv.org/pdf/2501.05499v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05498v1","updated":"2025-01-09T17:47:17Z","published":"2025-01-09T17:47:17Z","title":"Generative Flow Networks: Theory and Applications to Structure Learning","summary":" Without any assumptions about data generation, multiple causal models may\nexplain our observations equally well. To avoid selecting a single arbitrary\nmodel that could result in unsafe decisions if it does not match reality, it is\ntherefore essential to maintain a notion of epistemic uncertainty about our\npossible candidates. This thesis studies the problem of structure learning from\na Bayesian perspective, approximating the posterior distribution over the\nstructure of a causal model, represented as a directed acyclic graph (DAG),\ngiven data. It introduces Generative Flow Networks (GFlowNets), a novel class\nof probabilistic models designed for modeling distributions over discrete and\ncompositional objects such as graphs. They treat generation as a sequential\ndecision making problem, constructing samples of a target distribution defined\nup to a normalization constant piece by piece. In the first part of this\nthesis, we present the mathematical foundations of GFlowNets, their connections\nto existing domains of machine learning and statistics such as variational\ninference and reinforcement learning, and their extensions beyond discrete\nproblems. In the second part of this thesis, we show how GFlowNets can\napproximate the posterior distribution over DAG structures of causal Bayesian\nNetworks, along with the parameters of its causal mechanisms, given\nobservational and experimental data.\n","authors":["Tristan Deleu"],"pdf_url":"https://arxiv.org/pdf/2501.05498v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05496v1","updated":"2025-01-09T16:10:03Z","published":"2025-01-09T16:10:03Z","title":"FedSA: A Unified Representation Learning via Semantic Anchors for\n Prototype-based Federated Learning","summary":" Prototype-based federated learning has emerged as a promising approach that\nshares lightweight prototypes to transfer knowledge among clients with data\nheterogeneity in a model-agnostic manner. However, existing methods often\ncollect prototypes directly from local models, which inevitably introduce\ninconsistencies into representation learning due to the biased data\ndistributions and differing model architectures among clients. In this paper,\nwe identify that both statistical and model heterogeneity create a vicious\ncycle of representation inconsistency, classifier divergence, and skewed\nprototype alignment, which negatively impacts the performance of clients. To\nbreak the vicious cycle, we propose a novel framework named Federated Learning\nvia Semantic Anchors (FedSA) to decouple the generation of prototypes from\nlocal representation learning. We introduce a novel perspective that uses\nsimple yet effective semantic anchors serving as prototypes to guide local\nmodels in learning consistent representations. By incorporating semantic\nanchors, we further propose anchor-based regularization with margin-enhanced\ncontrastive learning and anchor-based classifier calibration to correct feature\nextractors and calibrate classifiers across clients, achieving intra-class\ncompactness and inter-class separability of prototypes while ensuring\nconsistent decision boundaries. We then update the semantic anchors with these\nconsistent and discriminative prototypes, which iteratively encourage clients\nto collaboratively learn a unified data representation with robust\ngeneralization. Extensive experiments under both statistical and model\nheterogeneity settings show that FedSA significantly outperforms existing\nprototype-based FL methods on various classification tasks.\n","authors":["Yanbing Zhou","Xiangmou Qu","Chenlong You","Jiyang Zhou","Jingyue Tang","Xin Zheng","Chunmao Cai","Yingbo Wu"],"pdf_url":"https://arxiv.org/pdf/2501.05496v1.pdf","comment":"Accepted by AAAI2025"},{"id":"http://arxiv.org/abs/2501.05495v1","updated":"2025-01-09T15:47:30Z","published":"2025-01-09T15:47:30Z","title":"LSEBMCL: A Latent Space Energy-Based Model for Continual Learning","summary":" Continual learning has become essential in many practical applications such\nas online news summaries and product classification. The primary challenge is\nknown as catastrophic forgetting, a phenomenon where a model inadvertently\ndiscards previously learned knowledge when it is trained on new tasks. Existing\nsolutions involve storing exemplars from previous classes, regularizing\nparameters during the fine-tuning process, or assigning different model\nparameters to each task. The proposed solution LSEBMCL (Latent Space\nEnergy-Based Model for Continual Learning) in this work is to use energy-based\nmodels (EBMs) to prevent catastrophic forgetting by sampling data points from\nprevious tasks when training on new ones. The EBM is a machine learning model\nthat associates an energy value with each input data point. The proposed method\nuses an EBM layer as an outer-generator in the continual learning framework for\nNLP tasks. The study demonstrates the efficacy of EBM in NLP tasks, achieving\nstate-of-the-art results in all experiments.\n","authors":["Xiaodi Li","Dingcheng Li","Rujun Gao","Mahmoud Zamani","Latifur Khan"],"pdf_url":"https://arxiv.org/pdf/2501.05495v1.pdf","comment":"In the 7th International Conference on Artificial Intelligence in\n Information and Communication (ICAIIC 2025)"},{"id":"http://arxiv.org/abs/2501.05494v1","updated":"2025-01-09T14:32:08Z","published":"2025-01-09T14:32:08Z","title":"Mathematical Modeling and Machine Learning for Predicting Shade-Seeking\n Behavior in Cows Under Heat Stress","summary":" In this paper we develop a mathematical model combined with machine learning\ntechniques to predict shade-seeking behavior in cows exposed to heat stress.\nThe approach integrates advanced mathematical features, such as time-averaged\nthermal indices and accumulated heat stress metrics, obtained by mathematical\nanalysis of data from a farm in Titaguas (Valencia, Spain), collected during\nthe summer of 2023. Two predictive models, Random Forests and Neural Networks,\nare compared for accuracy, robustness, and interpretability. The Random Forest\nmodel is highlighted for its balance between precision and explainability,\nachieving an RMSE of $14.97$. The methodology also employs $5-$fold\ncross-validation to ensure robustness under real-world conditions. This work\nnot only advances the mathematical modeling of animal behavior but also\nprovides useful insights for mitigating heat stress in livestock through\ndata-driven tools.\n","authors":["S. Sanjuan","D. A. Méndez","R. Arnau","J. M. Calabuig","X. Díaz de Otálora Aguirre","F. Estellés"],"pdf_url":"https://arxiv.org/pdf/2501.05494v1.pdf","comment":"22 pages, 10 figures"},{"id":"http://arxiv.org/abs/2501.05238v1","updated":"2025-01-09T13:44:15Z","published":"2025-01-09T13:44:15Z","title":"FOCUS: Towards Universal Foreground Segmentation","summary":" Foreground segmentation is a fundamental task in computer vision,\nencompassing various subdivision tasks. Previous research has typically\ndesigned task-specific architectures for each task, leading to a lack of\nunification. Moreover, they primarily focus on recognizing foreground objects\nwithout effectively distinguishing them from the background. In this paper, we\nemphasize the importance of the background and its relationship with the\nforeground. We introduce FOCUS, the Foreground ObjeCts Universal Segmentation\nframework that can handle multiple foreground tasks. We develop a multi-scale\nsemantic network using the edge information of objects to enhance image\nfeatures. To achieve boundary-aware segmentation, we propose a novel\ndistillation method, integrating the contrastive learning strategy to refine\nthe prediction mask in multi-modal feature space. We conduct extensive\nexperiments on a total of 13 datasets across 5 tasks, and the results\ndemonstrate that FOCUS consistently outperforms the state-of-the-art\ntask-specific models on most metrics.\n","authors":["Zuyao You","Lingyu Kong","Lingchen Meng","Zuxuan Wu"],"pdf_url":"https://arxiv.org/pdf/2501.05238v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05493v1","updated":"2025-01-09T12:26:11Z","published":"2025-01-09T12:26:11Z","title":"Monotonic Learning in the PAC Framework: A New Perspective","summary":" Monotone learning refers to learning processes in which expected performance\nconsistently improves as more training data is introduced. Non-monotone\nbehavior of machine learning has been the topic of a series of recent works,\nwith various proposals that ensure monotonicity by applying transformations or\nwrappers on learning algorithms. In this work, from a different perspective, we\ntackle the topic of monotone learning within the framework of Probably\nApproximately Correct (PAC) learning theory. Following the mechanism that\nestimates sample complexity of a PAC-learnable problem, we derive a performance\nlower bound for that problem, and prove the monotonicity of that bound as the\nsample sizes increase. By calculating the lower bound distribution, we are able\nto prove that given a PAC-learnable problem with a hypothesis space that is\neither of finite size or of finite VC dimension, any learning algorithm based\non Empirical Risk Minimization (ERM) is monotone if training samples are\nindependent and identically distributed (i.i.d.). We further carry out an\nexperiment on two concrete machine learning problems, one of which has a finite\nhypothesis set, and the other of finite VC dimension, and compared the\nexperimental data for the empirical risk distributions with the estimated\ntheoretical bound. The results of the comparison have confirmed the\nmonotonicity of learning for the two PAC-learnable problems.\n","authors":["Ming Li","Chenyi Zhang","Qin Li"],"pdf_url":"https://arxiv.org/pdf/2501.05493v1.pdf","comment":"16 pages"}],"Multimedia":[{"id":"http://arxiv.org/abs/2411.07899v3","updated":"2025-01-09T13:01:55Z","published":"2024-11-12T16:12:51Z","title":"Rendering-Oriented 3D Point Cloud Attribute Compression using Sparse\n Tensor-based Transformer","summary":" The evolution of 3D visualization techniques has fundamentally transformed\nhow we interact with digital content. At the forefront of this change is point\ncloud technology, offering an immersive experience that surpasses traditional\n2D representations. However, the massive data size of point clouds presents\nsignificant challenges in data compression. Current methods for lossy point\ncloud attribute compression (PCAC) generally focus on reconstructing the\noriginal point clouds with minimal error. However, for point cloud\nvisualization scenarios, the reconstructed point clouds with distortion still\nneed to undergo a complex rendering process, which affects the final\nuser-perceived quality. In this paper, we propose an end-to-end deep learning\nframework that seamlessly integrates PCAC with differentiable rendering,\ndenoted as rendering-oriented PCAC (RO-PCAC), directly targeting the quality of\nrendered multiview images for viewing. In a differentiable manner, the impact\nof the rendering process on the reconstructed point clouds is taken into\naccount. Moreover, we characterize point clouds as sparse tensors and propose a\nsparse tensor-based transformer, called SP-Trans. By aligning with the local\ndensity of the point cloud and utilizing an enhanced local attention mechanism,\nSP-Trans captures the intricate relationships within the point cloud, further\nimproving feature analysis and synthesis within the framework. Extensive\nexperiments demonstrate that the proposed RO-PCAC achieves state-of-the-art\ncompression performance, compared to existing reconstruction-oriented methods,\nincluding traditional, learning-based, and hybrid methods.\n","authors":["Xiao Huo","Junhui Hou","Shuai Wan","Fuzheng Yang"],"pdf_url":"https://arxiv.org/pdf/2411.07899v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.03700v2","updated":"2025-01-09T09:12:06Z","published":"2023-12-06T18:59:19Z","title":"OneLLM: One Framework to Align All Modalities with Language","summary":" Multimodal large language models (MLLMs) have gained significant attention\ndue to their strong multimodal understanding capability. However, existing\nworks rely heavily on modality-specific encoders, which usually differ in\narchitecture and are limited to common modalities. In this paper, we present\nOneLLM, an MLLM that aligns eight modalities to language using a unified\nframework. We achieve this through a unified multimodal encoder and a\nprogressive multimodal alignment pipeline. In detail, we first train an image\nprojection module to connect a vision encoder with LLM. Then, we build a\nuniversal projection module (UPM) by mixing multiple image projection modules\nand dynamic routing. Finally, we progressively align more modalities to LLM\nwith the UPM. To fully leverage the potential of OneLLM in following\ninstructions, we also curated a comprehensive multimodal instruction dataset,\nincluding 2M items from image, audio, video, point cloud, depth/normal map, IMU\nand fMRI brain activity. OneLLM is evaluated on 25 diverse benchmarks,\nencompassing tasks such as multimodal captioning, question answering and\nreasoning, where it delivers excellent performance. Code, data, model and\nonline demo are available at https://github.com/csuhan/OneLLM\n","authors":["Jiaming Han","Kaixiong Gong","Yiyuan Zhang","Jiaqi Wang","Kaipeng Zhang","Dahua Lin","Yu Qiao","Peng Gao","Xiangyu Yue"],"pdf_url":"https://arxiv.org/pdf/2312.03700v2.pdf","comment":"Accepted by CVPR 2024. Code: https://github.com/csuhan/OneLLM"}],"Artificial Intelligence":[{"id":"http://arxiv.org/abs/2501.05453v1","updated":"2025-01-09T18:59:58Z","published":"2025-01-09T18:59:58Z","title":"An Empirical Study of Autoregressive Pre-training from Videos","summary":" We empirically study autoregressive pre-training from videos. To perform our\nstudy, we construct a series of autoregressive video models, called Toto. We\ntreat videos as sequences of visual tokens and train transformer models to\nautoregressively predict future tokens. Our models are pre-trained on a diverse\ndataset of videos and images comprising over 1 trillion visual tokens. We\nexplore different architectural, training, and inference design choices. We\nevaluate the learned visual representations on a range of downstream tasks\nincluding image recognition, video classification, object tracking, and\nrobotics. Our results demonstrate that, despite minimal inductive biases,\nautoregressive pre-training leads to competitive performance across all\nbenchmarks. Finally, we find that scaling our video models results in similar\nscaling curves to those seen in language models, albeit with a different rate.\nMore details at https://brjathu.github.io/toto/\n","authors":["Jathushan Rajasegaran","Ilija Radosavovic","Rahul Ravishankar","Yossi Gandelsman","Christoph Feichtenhofer","Jitendra Malik"],"pdf_url":"https://arxiv.org/pdf/2501.05453v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05445v1","updated":"2025-01-09T18:56:05Z","published":"2025-01-09T18:56:05Z","title":"Consistent Flow Distillation for Text-to-3D Generation","summary":" Score Distillation Sampling (SDS) has made significant strides in distilling\nimage-generative models for 3D generation. However, its\nmaximum-likelihood-seeking behavior often leads to degraded visual quality and\ndiversity, limiting its effectiveness in 3D applications. In this work, we\npropose Consistent Flow Distillation (CFD), which addresses these limitations.\nWe begin by leveraging the gradient of the diffusion ODE or SDE sampling\nprocess to guide the 3D generation. From the gradient-based sampling\nperspective, we find that the consistency of 2D image flows across different\nviewpoints is important for high-quality 3D generation. To achieve this, we\nintroduce multi-view consistent Gaussian noise on the 3D object, which can be\nrendered from various viewpoints to compute the flow gradient. Our experiments\ndemonstrate that CFD, through consistent flows, significantly outperforms\nprevious methods in text-to-3D generation.\n","authors":["Runjie Yan","Yinbo Chen","Xiaolong Wang"],"pdf_url":"https://arxiv.org/pdf/2501.05445v1.pdf","comment":"Project page: https://runjie-yan.github.io/cfd/"},{"id":"http://arxiv.org/abs/2501.05443v1","updated":"2025-01-09T18:55:50Z","published":"2025-01-09T18:55:50Z","title":"A survey of textual cyber abuse detection using cutting-edge language\n models and large language models","summary":" The success of social media platforms has facilitated the emergence of\nvarious forms of online abuse within digital communities. This abuse manifests\nin multiple ways, including hate speech, cyberbullying, emotional abuse,\ngrooming, and sexting. In this paper, we present a comprehensive analysis of\nthe different forms of abuse prevalent in social media, with a particular focus\non how emerging technologies, such as Language Models (LMs) and Large Language\nModels (LLMs), are reshaping both the detection and generation of abusive\ncontent within these networks. We delve into the mechanisms through which\nsocial media abuse is perpetuated, exploring the psychological and social\nimpact. Additionally, we examine the dual role of advanced language\nmodels-highlighting their potential to enhance automated detection systems for\nabusive behavior while also acknowledging their capacity to generate harmful\ncontent. This paper aims to contribute to the ongoing discourse on online\nsafety and ethics, offering insights into the evolving landscape of cyberabuse\nand the technological innovations that both mitigate and exacerbate it.\n","authors":["Jose A. Diaz-Garcia","Joao Paulo Carvalho"],"pdf_url":"https://arxiv.org/pdf/2501.05443v1.pdf","comment":"37 pages, under review in WIREs Data Mining and Knowledge Discovery"},{"id":"http://arxiv.org/abs/2501.05442v1","updated":"2025-01-09T18:55:15Z","published":"2025-01-09T18:55:15Z","title":"Progressive Growing of Video Tokenizers for Highly Compressed Latent\n Spaces","summary":" Video tokenizers are essential for latent video diffusion models, converting\nraw video data into spatiotemporally compressed latent spaces for efficient\ntraining. However, extending state-of-the-art video tokenizers to achieve a\ntemporal compression ratio beyond 4x without increasing channel capacity poses\nsignificant challenges. In this work, we propose an alternative approach to\nenhance temporal compression. We find that the reconstruction quality of\ntemporally subsampled videos from a low-compression encoder surpasses that of\nhigh-compression encoders applied to original videos. This indicates that\nhigh-compression models can leverage representations from lower-compression\nmodels. Building on this insight, we develop a bootstrapped\nhigh-temporal-compression model that progressively trains high-compression\nblocks atop well-trained lower-compression models. Our method includes a\ncross-level feature-mixing module to retain information from the pretrained\nlow-compression model and guide higher-compression blocks to capture the\nremaining details from the full video sequence. Evaluation of video benchmarks\nshows that our method significantly improves reconstruction quality while\nincreasing temporal compression compared to direct extensions of existing video\ntokenizers. Furthermore, the resulting compact latent space effectively trains\na video diffusion model for high-quality video generation with a reduced token\nbudget.\n","authors":["Aniruddha Mahapatra","Long Mai","Yitian Zhang","David Bourgin","Feng Liu"],"pdf_url":"https://arxiv.org/pdf/2501.05442v1.pdf","comment":"Project website:\n https://progressive-video-tokenizer.github.io/Pro-MAG/"},{"id":"http://arxiv.org/abs/2501.05439v1","updated":"2025-01-09T18:49:39Z","published":"2025-01-09T18:49:39Z","title":"From Simple to Complex Skills: The Case of In-Hand Object Reorientation","summary":" Learning policies in simulation and transferring them to the real world has\nbecome a promising approach in dexterous manipulation. However, bridging the\nsim-to-real gap for each new task requires substantial human effort, such as\ncareful reward engineering, hyperparameter tuning, and system identification.\nIn this work, we present a system that leverages low-level skills to address\nthese challenges for more complex tasks. Specifically, we introduce a\nhierarchical policy for in-hand object reorientation based on previously\nacquired rotation skills. This hierarchical policy learns to select which\nlow-level skill to execute based on feedback from both the environment and the\nlow-level skill policies themselves. Compared to learning from scratch, the\nhierarchical policy is more robust to out-of-distribution changes and transfers\neasily from simulation to real-world environments. Additionally, we propose a\ngeneralizable object pose estimator that uses proprioceptive information,\nlow-level skill predictions, and control errors as inputs to estimate the\nobject pose over time. We demonstrate that our system can reorient objects,\nincluding symmetrical and textureless ones, to a desired pose.\n","authors":["Haozhi Qi","Brent Yi","Mike Lambeta","Yi Ma","Roberto Calandra","Jitendra Malik"],"pdf_url":"https://arxiv.org/pdf/2501.05439v1.pdf","comment":"website: https://dexhier.github.io"},{"id":"http://arxiv.org/abs/2501.05435v1","updated":"2025-01-09T18:48:35Z","published":"2025-01-09T18:48:35Z","title":"Neuro-Symbolic AI in 2024: A Systematic Review","summary":" Background: The field of Artificial Intelligence has undergone cyclical\nperiods of growth and decline, known as AI summers and winters. Currently, we\nare in the third AI summer, characterized by significant advancements and\ncommercialization, particularly in the integration of Symbolic AI and\nSub-Symbolic AI, leading to the emergence of Neuro-Symbolic AI.\n Methods: The review followed the PRISMA methodology, utilizing databases such\nas IEEE Explore, Google Scholar, arXiv, ACM, and SpringerLink. The inclusion\ncriteria targeted peer-reviewed papers published between 2020 and 2024. Papers\nwere screened for relevance to Neuro-Symbolic AI, with further inclusion based\non the availability of associated codebases to ensure reproducibility.\n Results: From an initial pool of 1,428 papers, 167 met the inclusion criteria\nand were analyzed in detail. The majority of research efforts are concentrated\nin the areas of learning and inference (63%), logic and reasoning (35%), and\nknowledge representation (44%). Explainability and trustworthiness are less\nrepresented (28%), with Meta-Cognition being the least explored area (5%). The\nreview identifies significant interdisciplinary opportunities, particularly in\nintegrating explainability and trustworthiness with other research areas.\n Conclusion: Neuro-Symbolic AI research has seen rapid growth since 2020, with\nconcentrated efforts in learning and inference. Significant gaps remain in\nexplainability, trustworthiness, and Meta-Cognition. Addressing these gaps\nthrough interdisciplinary research will be crucial for advancing the field\ntowards more intelligent, reliable, and context-aware AI systems.\n","authors":["Brandon C. Colelough","William Regli"],"pdf_url":"https://arxiv.org/pdf/2501.05435v1.pdf","comment":"19 pages"},{"id":"http://arxiv.org/abs/2410.08405v2","updated":"2025-01-09T18:43:18Z","published":"2024-10-10T22:38:26Z","title":"AgroGPT: Efficient Agricultural Vision-Language Model with Expert Tuning","summary":" Significant progress has been made in advancing large multimodal\nconversational models (LMMs), capitalizing on vast repositories of image-text\ndata available online. Despite this progress, these models often encounter\nsubstantial domain gaps, hindering their ability to engage in complex\nconversations across new domains. Recent efforts have aimed to mitigate this\nissue, albeit relying on domain-specific image-text data to curate\ninstruction-tuning data. However, many domains, such as agriculture, lack such\nvision-language data. In this work, we propose an approach to construct\ninstruction-tuning data that harnesses vision-only data for the agriculture\ndomain. We utilize diverse agricultural datasets spanning multiple domains,\ncurate class-specific information, and employ large language models (LLMs) to\nconstruct an expert-tuning set, resulting in a 70k expert-tuning dataset called\nAgroInstruct. Subsequently, we expert-tuned and created AgroGPT, an efficient\nLMM that can hold complex agriculture-related conversations and provide useful\ninsights. We also develop AgroEvals for evaluation and compare {AgroGPT's}\nperformance with large open and closed-source models. {AgroGPT} excels at\nidentifying fine-grained agricultural concepts, can act as an agriculture\nexpert, and provides helpful information for multimodal agriculture questions.\nThe code, datasets, and models are available at\nhttps://github.com/awaisrauf/agroGPT.\n","authors":["Muhammad Awais","Ali Husain Salem Abdulla Alharthi","Amandeep Kumar","Hisham Cholakkal","Rao Muhammad Anwer"],"pdf_url":"https://arxiv.org/pdf/2410.08405v2.pdf","comment":"Accepted at WACV, 2025"},{"id":"http://arxiv.org/abs/2501.05409v1","updated":"2025-01-09T18:06:45Z","published":"2025-01-09T18:06:45Z","title":"A Novel Pathology Foundation Model by Mayo Clinic, Charité, and\n Aignostics","summary":" Recent advances in digital pathology have demonstrated the effectiveness of\nfoundation models across diverse applications. In this report, we present a\nnovel vision foundation model based on the RudolfV approach. Our model was\ntrained on a dataset comprising 1.2 million histopathology whole slide images,\ncollected from two medical institutions: Mayo Clinic and Charit\\'e -\nUniverst\\\"atsmedizin Berlin. Comprehensive evaluations show that our model\nachieves state-of-the-art performance across twenty-one public benchmark\ndatasets, even though it is neither the largest model by parameter count nor by\ntraining dataset size.\n","authors":["Maximilian Alber","Stephan Tietz","Jonas Dippel","Timo Milbich","Timothée Lesort","Panos Korfiatis","Moritz Krügener","Beatriz Perez Cancer","Neelay Shah","Alexander Möllers","Philipp Seegerer","Alexandra Carpen-Amarie","Kai Standvoss","Gabriel Dernbach","Edwin de Jong","Simon Schallenberg","Andreas Kunft","Helmut Hoffer von Ankershoffen","Gavin Schaeferle","Patrick Duffy","Matt Redlon","Philipp Jurmeister","David Horst","Lukas Ruff","Klaus-Robert Müller","Frederick Klauschen","Andrew Norgan"],"pdf_url":"https://arxiv.org/pdf/2501.05409v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05408v1","updated":"2025-01-09T18:05:33Z","published":"2025-01-09T18:05:33Z","title":"TimeRL: Efficient Deep Reinforcement Learning with Polyhedral Dependence\n Graphs","summary":" Modern deep learning (DL) workloads increasingly use complex deep\nreinforcement learning (DRL) algorithms that generate training data within the\nlearning loop. This results in programs with several nested loops and dynamic\ndata dependencies between tensors. While DL systems with eager execution\nsupport such dynamism, they lack the optimizations and smart scheduling of\ngraph-based execution. Graph-based execution, however, cannot express dynamic\ntensor shapes, instead requiring the use of multiple static subgraphs. Either\nexecution model for DRL thus leads to redundant computation, reduced\nparallelism, and less efficient memory management.\n We describe TimeRL, a system for executing dynamic DRL programs that combines\nthe dynamism of eager execution with the whole-program optimizations and\nscheduling of graph-based execution. TimeRL achieves this by introducing the\ndeclarative programming model of recurrent tensors, which allows users to\ndefine dynamic dependencies as intuitive recurrence equations. TimeRL\ntranslates recurrent tensors into a polyhedral dependence graph (PDG) with\ndynamic dependencies as symbolic expressions. Through simple PDG\ntransformations, TimeRL applies whole-program optimizations, such as automatic\nvectorization, incrementalization, and operator fusion. The PDG also allows for\nthe computation of an efficient program-wide execution schedule, which decides\non buffer deallocations, buffer donations, and GPU/CPU memory swapping. We show\nthat TimeRL executes current DRL algorithms up to 47$\\times$ faster than\nexisting DRL systems, while using 16$\\times$ less GPU peak memory.\n","authors":["Pedro F. Silvestre","Peter Pietzuch"],"pdf_url":"https://arxiv.org/pdf/2501.05408v1.pdf","comment":"17 pages, 11 figures, 5 bibliography pages"},{"id":"http://arxiv.org/abs/2501.05407v1","updated":"2025-01-09T18:05:05Z","published":"2025-01-09T18:05:05Z","title":"On-line Policy Improvement using Monte-Carlo Search","summary":" We present a Monte-Carlo simulation algorithm for real-time policy\nimprovement of an adaptive controller. In the Monte-Carlo simulation, the\nlong-term expected reward of each possible action is statistically measured,\nusing the initial policy to make decisions in each step of the simulation. The\naction maximizing the measured expected reward is then taken, resulting in an\nimproved policy. Our algorithm is easily parallelizable and has been\nimplemented on the IBM SP1 and SP2 parallel-RISC supercomputers.\n We have obtained promising initial results in applying this algorithm to the\ndomain of backgammon. Results are reported for a wide variety of initial\npolicies, ranging from a random policy to TD-Gammon, an extremely strong\nmulti-layer neural network. In each case, the Monte-Carlo algorithm gives a\nsubstantial reduction, by as much as a factor of 5 or more, in the error rate\nof the base players. The algorithm is also potentially useful in many other\nadaptive control applications in which it is possible to simulate the\nenvironment.\n","authors":["Gerald Tesauro","Gregory R. Galperin"],"pdf_url":"https://arxiv.org/pdf/2501.05407v1.pdf","comment":"Accompanied by oral presentation by Gregory Galperin at NeurIPS 1996\n (then known as NIPS*96)"},{"id":"http://arxiv.org/abs/2405.13536v2","updated":"2025-01-09T17:58:44Z","published":"2024-05-22T11:14:00Z","title":"Attention Mechanisms Don't Learn Additive Models: Rethinking Feature\n Importance for Transformers","summary":" We address the critical challenge of applying feature attribution methods to\nthe transformer architecture, which dominates current applications in natural\nlanguage processing and beyond. Traditional attribution methods to explainable\nAI (XAI) explicitly or implicitly rely on linear or additive surrogate models\nto quantify the impact of input features on a model's output. In this work, we\nformally prove an alarming incompatibility: transformers are structurally\nincapable of representing linear or additive surrogate models used for feature\nattribution, undermining the grounding of these conventional explanation\nmethodologies. To address this discrepancy, we introduce the Softmax-Linked\nAdditive Log Odds Model (SLALOM), a novel surrogate model specifically designed\nto align with the transformer framework. SLALOM demonstrates the capacity to\ndeliver a range of insightful explanations with both synthetic and real-world\ndatasets. We highlight SLALOM's unique efficiency-quality curve by showing that\nSLALOM can produce explanations with substantially higher fidelity than\ncompeting surrogate models or provide explanations of comparable quality at a\nfraction of their computational costs. We release code for SLALOM as an\nopen-source project online at https://github.com/tleemann/slalom_explanations.\n","authors":["Tobias Leemann","Alina Fastowski","Felix Pfeiffer","Gjergji Kasneci"],"pdf_url":"https://arxiv.org/pdf/2405.13536v2.pdf","comment":"TMLR Camera-Ready version"},{"id":"http://arxiv.org/abs/2501.05403v1","updated":"2025-01-09T17:57:56Z","published":"2025-01-09T17:57:56Z","title":"TimeDP: Learning to Generate Multi-Domain Time Series with Domain\n Prompts","summary":" Time series generation models are crucial for applications like data\naugmentation and privacy preservation. Most existing time series generation\nmodels are typically designed to generate data from one specified domain. While\nleveraging data from other domain for better generalization is proved to work\nin other application areas, this approach remains challenging for time series\nmodeling due to the large divergence in patterns among different real world\ntime series categories. In this paper, we propose a multi-domain time series\ndiffusion model with domain prompts, named TimeDP. In TimeDP, we utilize a time\nseries semantic prototype module which defines time series prototypes to\nrepresent time series basis, each prototype vector serving as \"word\"\nrepresenting some elementary time series feature. A prototype assignment module\nis applied to extract the extract domain specific prototype weights, for\nlearning domain prompts as generation condition. During sampling, we extract\n\"domain prompt\" with few-shot samples from the target domain and use the domain\nprompts as condition to generate time series samples. Experiments demonstrate\nthat our method outperforms baselines to provide the state-of-the-art in-domain\ngeneration quality and strong unseen domain generation capability.\n","authors":["Yu-Hao Huang","Chang Xu","Yueying Wu","Wu-Jun Li","Jiang Bian"],"pdf_url":"https://arxiv.org/pdf/2501.05403v1.pdf","comment":"AAAI 2025"},{"id":"http://arxiv.org/abs/2501.05401v1","updated":"2025-01-09T17:50:56Z","published":"2025-01-09T17:50:56Z","title":"BRATI: Bidirectional Recurrent Attention for Time-Series Imputation","summary":" Missing data in time-series analysis poses significant challenges, affecting\nthe reliability of downstream applications. Imputation, the process of\nestimating missing values, has emerged as a key solution. This paper introduces\nBRATI, a novel deep-learning model designed to address multivariate time-series\nimputation by combining Bidirectional Recurrent Networks and Attention\nmechanisms. BRATI processes temporal dependencies and feature correlations\nacross long and short time horizons, utilizing two imputation blocks that\noperate in opposite temporal directions. Each block integrates recurrent layers\nand attention mechanisms to effectively resolve long-term dependencies.\n We evaluate BRATI on three real-world datasets under diverse missing-data\nscenarios: randomly missing values, fixed-length missing sequences, and\nvariable-length missing sequences. Our findings demonstrate that BRATI\nconsistently outperforms state-of-the-art models, delivering superior accuracy\nand robustness in imputing multivariate time-series data.\n","authors":["Armando Collado-Villaverde","Pablo Muñoz","Maria D. R-Moreno"],"pdf_url":"https://arxiv.org/pdf/2501.05401v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05398v1","updated":"2025-01-09T17:47:34Z","published":"2025-01-09T17:47:34Z","title":"Mechanistic understanding and validation of large AI models with\n SemanticLens","summary":" Unlike human-engineered systems such as aeroplanes, where each component's\nrole and dependencies are well understood, the inner workings of AI models\nremain largely opaque, hindering verifiability and undermining trust. This\npaper introduces SemanticLens, a universal explanation method for neural\nnetworks that maps hidden knowledge encoded by components (e.g., individual\nneurons) into the semantically structured, multimodal space of a foundation\nmodel such as CLIP. In this space, unique operations become possible, including\n(i) textual search to identify neurons encoding specific concepts, (ii)\nsystematic analysis and comparison of model representations, (iii) automated\nlabelling of neurons and explanation of their functional roles, and (iv) audits\nto validate decision-making against requirements. Fully scalable and operating\nwithout human input, SemanticLens is shown to be effective for debugging and\nvalidation, summarizing model knowledge, aligning reasoning with expectations\n(e.g., adherence to the ABCDE-rule in melanoma classification), and detecting\ncomponents tied to spurious correlations and their associated training data. By\nenabling component-level understanding and validation, the proposed approach\nhelps bridge the \"trust gap\" between AI models and traditional engineered\nsystems. We provide code for SemanticLens on\nhttps://github.com/jim-berend/semanticlens and a demo on\nhttps://semanticlens.hhi-research-insights.eu.\n","authors":["Maximilian Dreyer","Jim Berend","Tobias Labarta","Johanna Vielhaben","Thomas Wiegand","Sebastian Lapuschkin","Wojciech Samek"],"pdf_url":"https://arxiv.org/pdf/2501.05398v1.pdf","comment":"74 pages (18 pages manuscript, 7 pages references, 49 pages appendix)"},{"id":"http://arxiv.org/abs/2501.05391v1","updated":"2025-01-09T17:33:08Z","published":"2025-01-09T17:33:08Z","title":"The global consensus on the risk management of autonomous driving","summary":" Every maneuver of a vehicle redistributes risks between road users. While\nhuman drivers do this intuitively, autonomous vehicles allow and require\ndeliberative algorithmic risk management. But how should traffic risks be\ndistributed among road users? In a global experimental study in eight countries\nwith different cultural backgrounds and almost 11,000 participants, we compared\nrisk distribution preferences. It turns out that risk preferences in road\ntraffic are strikingly similar between the cultural zones. The vast majority of\nparticipants in all countries deviates from a guiding principle of minimizing\naccident probabilities in favor of weighing up the probability and severity of\naccidents. At the national level, the consideration of accident probability and\nseverity hardly differs between countries. The social dilemma of autonomous\nvehicles detected in deterministic crash scenarios disappears in risk\nassessments of everyday traffic situations in all countries. In no country do\ncyclists receive a risk bonus that goes beyond their higher vulnerability. In\nsum, our results suggest that a global consensus on the risk ethics of\nautonomous driving is easier to establish than on the ethics of crashing.\n","authors":["Sebastian Krügel","Matthias Uhl"],"pdf_url":"https://arxiv.org/pdf/2501.05391v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.19185v2","updated":"2025-01-09T17:29:40Z","published":"2024-10-24T22:34:27Z","title":"Tailored-LLaMA: Optimizing Few-Shot Learning in Pruned LLaMA Models with\n Task-Specific Prompts","summary":" Large language models demonstrate impressive proficiency in language\nunderstanding and generation. Nonetheless, training these models from scratch,\neven the least complex billion-parameter variant demands significant\ncomputational resources rendering it economically impractical for many\norganizations. With large language models functioning as general-purpose task\nsolvers, this paper investigates their task-specific fine-tuning. We employ\ntask-specific datasets and prompts to fine-tune two pruned LLaMA models having\n5 billion and 4 billion parameters. This process utilizes the pre-trained\nweights and focuses on a subset of weights using the LoRA method. One challenge\nin fine-tuning the LLaMA model is crafting a precise prompt tailored to the\nspecific task. To address this, we propose a novel approach to fine-tune the\nLLaMA model under two primary constraints: task specificity and prompt\neffectiveness. Our approach, Tailored LLaMA initially employs structural\npruning to reduce the model sizes from 7B to 5B and 4B parameters.\nSubsequently, it applies a carefully designed prompt specific to the task and\nutilizes the LoRA method to accelerate the fine-tuning process. Moreover,\nfine-tuning a model pruned by 50\\% for less than one hour restores the mean\naccuracy of classification tasks to 95.68\\% at a 20\\% compression ratio and to\n86.54\\% at a 50\\% compression ratio through few-shot learning with 50 shots.\nOur validation of Tailored LLaMA on these two pruned variants demonstrates that\neven when compressed to 50\\%, the models maintain over 65\\% of the baseline\nmodel accuracy in few-shot classification and generation tasks. These findings\nhighlight the efficacy of our tailored approach in maintaining high performance\nwith significantly reduced model sizes.\n","authors":["Danyal Aftab","Steven Davy"],"pdf_url":"https://arxiv.org/pdf/2410.19185v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05382v1","updated":"2025-01-09T17:11:22Z","published":"2025-01-09T17:11:22Z","title":"Large Physics Models: Towards a collaborative approach with Large\n Language Models and Foundation Models","summary":" This paper explores ideas and provides a potential roadmap for the\ndevelopment and evaluation of physics-specific large-scale AI models, which we\ncall Large Physics Models (LPMs). These models, based on foundation models such\nas Large Language Models (LLMs) - trained on broad data - are tailored to\naddress the demands of physics research. LPMs can function independently or as\npart of an integrated framework. This framework can incorporate specialized\ntools, including symbolic reasoning modules for mathematical manipulations,\nframeworks to analyse specific experimental and simulated data, and mechanisms\nfor synthesizing theories and scientific literature. We begin by examining\nwhether the physics community should actively develop and refine dedicated\nmodels, rather than relying solely on commercial LLMs. We then outline how LPMs\ncan be realized through interdisciplinary collaboration among experts in\nphysics, computer science, and philosophy of science. To integrate these models\neffectively, we identify three key pillars: Development, Evaluation, and\nPhilosophical Reflection. Development focuses on constructing models capable of\nprocessing physics texts, mathematical formulations, and diverse physical data.\nEvaluation assesses accuracy and reliability by testing and benchmarking.\nFinally, Philosophical Reflection encompasses the analysis of broader\nimplications of LLMs in physics, including their potential to generate new\nscientific understanding and what novel collaboration dynamics might arise in\nresearch. Inspired by the organizational structure of experimental\ncollaborations in particle physics, we propose a similarly interdisciplinary\nand collaborative approach to building and refining Large Physics Models. This\nroadmap provides specific objectives, defines pathways to achieve them, and\nidentifies challenges that must be addressed to realise physics-specific large\nscale AI models.\n","authors":["Kristian G. Barman","Sascha Caron","Emily Sullivan","Henk W. de Regt","Roberto Ruiz de Austri","Mieke Boon","Michael Färber","Stefan Fröse","Faegheh Hasibi","Andreas Ipp","Rukshak Kapoor","Gregor Kasieczka","Daniel Kostić","Michael Krämer","Tobias Golling","Luis G. Lopez","Jesus Marco","Sydney Otten","Pawel Pawlowski","Pietro Vischia","Erik Weber","Christoph Weniger"],"pdf_url":"https://arxiv.org/pdf/2501.05382v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05368v1","updated":"2025-01-09T16:49:04Z","published":"2025-01-09T16:49:04Z","title":"Developing a Foundation of Vector Symbolic Architectures Using Category\n Theory","summary":" At the risk of overstating the case, connectionist approaches to machine\nlearning, i.e. neural networks, are enjoying a small vogue right now. However,\nthese methods require large volumes of data and produce models that are\nuninterpretable to humans. An alternative framework that is compatible with\nneural networks and gradient-based learning, but explicitly models\ncompositionality, is Vector Symbolic Architectures (VSAs). VSAs are a family of\nalgebras on high-dimensional vector representations. They arose in cognitive\nscience from the need to unify neural processing and the kind of symbolic\nreasoning that humans perform. While machine learning methods have benefited\nfrom category theoretical analyses, VSAs have not yet received similar\ntreatment. In this paper, we present a first attempt at applying category\ntheory to VSAs. Specifically, we conduct a brief literature survey\ndemonstrating the lacking intersection of these two topics, provide a list of\ndesiderata for VSAs, and propose that VSAs may be understood as a (division)\nrig in a category enriched over a monoid in Met (the category of Lawvere metric\nspaces). This final contribution suggests that VSAs may be generalised beyond\ncurrent implementations. It is our hope that grounding VSAs in category theory\nwill lead to more rigorous connections with other research, both within and\nbeyond, learning and cognition.\n","authors":["Nolan P Shaw","P Michael Furlong","Britt Anderson","Jeff Orchard"],"pdf_url":"https://arxiv.org/pdf/2501.05368v1.pdf","comment":"13 pages, no figures, 2 tables, one appendix"},{"id":"http://arxiv.org/abs/2501.05366v1","updated":"2025-01-09T16:48:17Z","published":"2025-01-09T16:48:17Z","title":"Search-o1: Agentic Search-Enhanced Large Reasoning Models","summary":" Large reasoning models (LRMs) like OpenAI-o1 have demonstrated impressive\nlong stepwise reasoning capabilities through large-scale reinforcement\nlearning. However, their extended reasoning processes often suffer from\nknowledge insufficiency, leading to frequent uncertainties and potential\nerrors. To address this limitation, we introduce \\textbf{Search-o1}, a\nframework that enhances LRMs with an agentic retrieval-augmented generation\n(RAG) mechanism and a Reason-in-Documents module for refining retrieved\ndocuments. Search-o1 integrates an agentic search workflow into the reasoning\nprocess, enabling dynamic retrieval of external knowledge when LRMs encounter\nuncertain knowledge points. Additionally, due to the verbose nature of\nretrieved documents, we design a separate Reason-in-Documents module to deeply\nanalyze the retrieved information before injecting it into the reasoning chain,\nminimizing noise and preserving coherent reasoning flow. Extensive experiments\non complex reasoning tasks in science, mathematics, and coding, as well as six\nopen-domain QA benchmarks, demonstrate the strong performance of Search-o1.\nThis approach enhances the trustworthiness and applicability of LRMs in complex\nreasoning tasks, paving the way for more reliable and versatile intelligent\nsystems. The code is available at\n\\url{https://github.com/sunnynexus/Search-o1}.\n","authors":["Xiaoxi Li","Guanting Dong","Jiajie Jin","Yuyao Zhang","Yujia Zhou","Yutao Zhu","Peitian Zhang","Zhicheng Dou"],"pdf_url":"https://arxiv.org/pdf/2501.05366v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05360v1","updated":"2025-01-09T16:44:38Z","published":"2025-01-09T16:44:38Z","title":"On Corrigibility and Alignment in Multi Agent Games","summary":" Corrigibility of autonomous agents is an under explored part of system\ndesign, with previous work focusing on single agent systems. It has been\nsuggested that uncertainty over the human preferences acts to keep the agents\ncorrigible, even in the face of human irrationality. We present a general\nframework for modelling corrigibility in a multi-agent setting as a 2 player\ngame in which the agents always have a move in which they can ask the human for\nsupervision. This is formulated as a Bayesian game for the purpose of\nintroducing uncertainty over the human beliefs. We further analyse two specific\ncases. First, a two player corrigibility game, in which we want corrigibility\ndisplayed in both agents for both common payoff (monotone) games and harmonic\ngames. Then we investigate an adversary setting, in which one agent is\nconsidered to be a `defending' agent and the other an `adversary'. A general\nresult is provided for what belief over the games and human rationality the\ndefending agent is required to have to induce corrigibility.\n","authors":["Edmund Dable-Heath","Boyko Vodenicharski","James Bishop"],"pdf_url":"https://arxiv.org/pdf/2501.05360v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.20138v2","updated":"2025-01-09T16:36:26Z","published":"2024-12-28T12:54:06Z","title":"TradingAgents: Multi-Agents LLM Financial Trading Framework","summary":" Significant progress has been made in automated problem-solving using\nsocieties of agents powered by large language models (LLMs). In finance,\nefforts have largely focused on single-agent systems handling specific tasks or\nmulti-agent frameworks independently gathering data. However, multi-agent\nsystems' potential to replicate real-world trading firms' collaborative\ndynamics remains underexplored. TradingAgents proposes a novel stock trading\nframework inspired by trading firms, featuring LLM-powered agents in\nspecialized roles such as fundamental analysts, sentiment analysts, technical\nanalysts, and traders with varied risk profiles. The framework includes Bull\nand Bear researcher agents assessing market conditions, a risk management team\nmonitoring exposure, and traders synthesizing insights from debates and\nhistorical data to make informed decisions. By simulating a dynamic,\ncollaborative trading environment, this framework aims to improve trading\nperformance. Detailed architecture and extensive experiments reveal its\nsuperiority over baseline models, with notable improvements in cumulative\nreturns, Sharpe ratio, and maximum drawdown, highlighting the potential of\nmulti-agent LLM frameworks in financial trading. More details on TradingAgents\nare available at https://TradingAgents-AI.github.io.\n","authors":["Yijia Xiao","Edward Sun","Di Luo","Wei Wang"],"pdf_url":"https://arxiv.org/pdf/2412.20138v2.pdf","comment":"Multi-Agent AI in the Real World @ AAAI 2025"},{"id":"http://arxiv.org/abs/2411.10087v2","updated":"2025-01-09T16:22:42Z","published":"2024-11-15T10:16:38Z","title":"PFML: Self-Supervised Learning of Time-Series Data Without\n Representation Collapse","summary":" Self-supervised learning (SSL) is a data-driven learning approach that\nutilizes the innate structure of the data to guide the learning process. In\ncontrast to supervised learning, which depends on external labels, SSL utilizes\nthe inherent characteristics of the data to produce its own supervisory signal.\nHowever, one frequent issue with SSL methods is representation collapse, where\nthe model outputs a constant input-invariant feature representation. This issue\nhinders the potential application of SSL methods to new data modalities, as\ntrying to avoid representation collapse wastes researchers' time and effort.\nThis paper introduces a novel SSL algorithm for time-series data called\nPrediction of Functionals from Masked Latents (PFML). Instead of predicting\nmasked input signals or their latent representations directly, PFML operates by\npredicting statistical functionals of the input signal corresponding to masked\nembeddings, given a sequence of unmasked embeddings. The algorithm is designed\nto avoid representation collapse, rendering it straightforwardly applicable to\ndifferent time-series data domains, such as novel sensor modalities in clinical\ndata. We demonstrate the effectiveness of PFML through complex, real-life\nclassification tasks across three different data modalities: infant posture and\nmovement classification from multi-sensor inertial measurement unit data,\nemotion recognition from speech data, and sleep stage classification from EEG\ndata. The results show that PFML is superior to a conceptually similar SSL\nmethod and a contrastive learning-based SSL method. Additionally, PFML is on\npar with the current state-of-the-art SSL method, while also being conceptually\nsimpler and without suffering from representation collapse.\n","authors":["Einari Vaaras","Manu Airaksinen","Okko Räsänen"],"pdf_url":"https://arxiv.org/pdf/2411.10087v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05336v1","updated":"2025-01-09T16:02:51Z","published":"2025-01-09T16:02:51Z","title":"Stream Aligner: Efficient Sentence-Level Alignment via Distribution\n Induction","summary":" The rapid advancement of large language models (LLMs) has led to significant\nimprovements in their capabilities, but also to increased concerns about their\nalignment with human values and intentions. Current alignment strategies,\nincluding adaptive training and inference-time methods, have demonstrated\npotential in this area. However, these approaches still struggle to balance\ndeployment complexity and capability across various tasks and difficulties. In\nthis work, we introduce the Streaming Distribution Induce Aligner (Stream\nAligner), a novel alignment paradigm that combines efficiency with enhanced\nperformance in various tasks throughout the generation process. Stream Aligner\nachieves dynamic sentence-level correction by using a small model to learn the\npreferences of the suffix sentence, iteratively correcting the suffix sentence\noutput by the upstream model, and then using the corrected sentence to replace\nthe suffix sentence in subsequent generations. Compared to Aligner, our\nexperiments demonstrate that Stream Aligner reduces reliance on the\ncapabilities of additional models, enhances the reasoning abilities of LLMs,\nand decreases latency during user interaction. Specifically, Stream Aligner-2B\nmodel has achieved an improvement of 76.1% in helpfulness, 36.0% in\nharmlessness on the tested Llama2-70B-chat model, and Stream Aligner-8B has\nachieved an improvement of 3.5% on the math ability of the tested\nLlama3-70B-Instruct model.\n","authors":["Hantao Lou","Jiaming Ji","Kaile Wang","Yaodong Yang"],"pdf_url":"https://arxiv.org/pdf/2501.05336v1.pdf","comment":"AAAI Alignment Track 2025 Poster"},{"id":"http://arxiv.org/abs/2501.05334v1","updated":"2025-01-09T15:59:32Z","published":"2025-01-09T15:59:32Z","title":"The Bakers and Millers Game with Restricted Locations","summary":" We study strategic location choice by customers and sellers, termed the\nBakers and Millers Game in the literature. In our generalized setting, each\nmiller can freely choose any location for setting up a mill, while each baker\nis restricted in the choice of location for setting up a bakery. For optimal\nbargaining power, a baker would like to select a location with many millers to\nbuy flour from and with little competition from other bakers. Likewise, a\nmiller aims for a location with many bakers and few competing millers. Thus,\nboth types of agents choose locations to optimize the ratio of agents of\nopposite type divided by agents of the same type at their chosen location.\nOriginally raised in the context of Fractional Hedonic Games, the Bakers and\nMillers Game has applications that range from commerce to product design.\n We study the impact of location restrictions on the properties of the game.\nWhile pure Nash equilibria trivially exist in the setting without location\nrestrictions, we show via a sophisticated, efficient algorithm that even the\nmore challenging restricted setting admits equilibria. Moreover, the computed\nequilibrium approximates the optimal social welfare by a factor of at most\n$2\\left(\\frac{e}{e-1}\\right)$. Furthermore, we give tight bounds on the price\nof anarchy/stability.\n On the conceptual side, the location choice feature adds a new layer to the\nstandard setting of Hedonic Games, in the sense that agents that select the\nsame location form a coalition. This allows to naturally restrict the possible\ncoalitions that can be formed. With this, our model generalizes simple\nsymmetric Fractional Hedonic Games on complete bipartite valuation graphs and\nalso Hedonic Diversity Games with utilities single-peaked at 0. We believe that\nthis generalization is also a very interesting direction for other types of\nHedonic Games.\n","authors":["Simon Krogmann","Pascal Lenzner","Alexander Skopalik"],"pdf_url":"https://arxiv.org/pdf/2501.05334v1.pdf","comment":"To appear at the 24th International Conference on Autonomous Agents\n and Multiagent Systems (AAMAS 2025)"},{"id":"http://arxiv.org/abs/2501.05332v1","updated":"2025-01-09T15:58:37Z","published":"2025-01-09T15:58:37Z","title":"AnCoGen: Analysis, Control and Generation of Speech with a Masked\n Autoencoder","summary":" This article introduces AnCoGen, a novel method that leverages a masked\nautoencoder to unify the analysis, control, and generation of speech signals\nwithin a single model. AnCoGen can analyze speech by estimating key attributes,\nsuch as speaker identity, pitch, content, loudness, signal-to-noise ratio, and\nclarity index. In addition, it can generate speech from these attributes and\nallow precise control of the synthesized speech by modifying them. Extensive\nexperiments demonstrated the effectiveness of AnCoGen across speech\nanalysis-resynthesis, pitch estimation, pitch modification, and speech\nenhancement.\n","authors":["Samir Sadok","Simon Leglaive","Laurent Girin","Gaël Richard","Xavier Alameda-Pineda"],"pdf_url":"https://arxiv.org/pdf/2501.05332v1.pdf","comment":"5 pages, https://samsad35.github.io/site-ancogen"},{"id":"http://arxiv.org/abs/2302.08878v2","updated":"2025-01-09T15:35:59Z","published":"2023-02-17T13:50:53Z","title":"Less is More: The Influence of Pruning on the Explainability of CNNs","summary":" Modern, state-of-the-art Convolutional Neural Networks (CNNs) in computer\nvision have millions of parameters. Thus, explaining the complex decisions of\nsuch networks to humans is challenging. A technical approach to reduce CNN\ncomplexity is network pruning, where less important parameters are deleted. The\nwork presented in this paper investigates whether this technical complexity\nreduction also helps with perceived explainability. To do so, we conducted a\npre-study and two human-grounded experiments, assessing the effects of\ndifferent pruning ratios on CNN explainability. Overall, we evaluated four\ndifferent compression rates (i.e., CPR 2, 4, 8, and 32) with 37 500 tasks on\nMechanical Turk. Results indicate that lower compression rates have a positive\ninfluence on explainability, while higher compression rates show negative\neffects. Furthermore, we were able to identify sweet spots that increase both\nthe perceived explainability and the model's performance.\n","authors":["David Weber","Florian Merkle","Pascal Schöttle","Stephan Schlögl"],"pdf_url":"https://arxiv.org/pdf/2302.08878v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03145v2","updated":"2025-01-09T15:31:29Z","published":"2025-01-06T17:12:19Z","title":"Geometry Restoration and Dewarping of Camera-Captured Document Images","summary":" This research focuses on developing a method for restoring the topology of\ndigital images of paper documents captured by a camera, using algorithms for\ndetection, segmentation, geometry restoration, and dewarping. Our methodology\nemploys deep learning (DL) for document outline detection, followed by computer\nvision (CV) to create a topological 2D grid using cubic polynomial\ninterpolation and correct nonlinear distortions by remapping the image. Using\nclassical CV methods makes the document topology restoration process more\nefficient and faster, as it requires significantly fewer computational\nresources and memory. We developed a new pipeline for automatic document\ndewarping and reconstruction, along with a framework and annotated dataset to\ndemonstrate its efficiency. Our experiments confirm the promise of our\nmethodology and its superiority over existing benchmarks (including mobile apps\nand popular DL solutions, such as RectiNet, DocGeoNet, and DocTr++) both\nvisually and in terms of document readability via Optical Character Recognition\n(OCR) and geometry restoration metrics. This paves the way for creating\nhigh-quality digital copies of paper documents and enhancing the efficiency of\nOCR systems. Project page: https://github.com/HorizonParadox/DRCCBI\n","authors":["Valery Istomin","Oleg Pereziabov","Ilya Afanasyev"],"pdf_url":"https://arxiv.org/pdf/2501.03145v2.pdf","comment":"28 pages, 16 figures"},{"id":"http://arxiv.org/abs/2412.16378v2","updated":"2025-01-09T15:20:31Z","published":"2024-12-20T22:25:23Z","title":"REFA: Reference Free Alignment for multi-preference optimization","summary":" We introduce REFA, a family of reference-free alignment methods that optimize\nover multiple user preferences while enforcing fine-grained length control. Our\napproach integrates deviation-based weighting to emphasize high-quality\nresponses more strongly, length normalization to prevent trivial short-response\nsolutions, and an EOS-probability regularizer to mitigate dataset-induced\nbrevity biases. Theoretically, we show that under the Uncertainty Reduction\nwith Sequence Length Assertion (URSLA), naive length normalization can still\nincentivize length-based shortcuts. By contrast, REFA corrects these subtle\nincentives, guiding models toward genuinely more informative and higher-quality\noutputs. Empirically, REFA sets a new state-of-the-art among reference-free\nalignment methods, producing richer responses aligned more closely with human\npreferences. Compared to a base supervised fine-tuned (SFT) mistral-7b model\nthat achieves 8.4% length-controlled win rate (LC-WR) and 6.2% win rate (WR),\nour best REFA configuration attains 21.62% LC-WR and 19.87% WR on the\nAlpacaEval v2 benchmark. This represents a substantial improvement over both\nthe strongest multi-preference baseline, InfoNCA (16.82% LC-WR, 10.44% WR), and\nthe strongest reference-free baseline, SimPO (20.01% LC-WR, 17.65% WR)\n","authors":["Taneesh Gupta","Rahul Madhavan","Xuchao Zhang","Chetan Bansal","Saravan Rajmohan"],"pdf_url":"https://arxiv.org/pdf/2412.16378v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.16220v3","updated":"2025-01-09T14:55:29Z","published":"2024-12-18T10:56:40Z","title":"Cross-Attention Graph Neural Networks for Inferring Gene Regulatory\n Networks with Skewed Degree Distribution","summary":" Inferencing Gene Regulatory Networks (GRNs) from gene expression data is a\npivotal challenge in systems biology, and several innovative computational\nmethods have been introduced. However, most of these studies have not\nconsidered the skewed degree distribution of genes. Specifically, some genes\nmay regulate multiple target genes while some genes may be regulated by\nmultiple regulator genes. Such a skewed degree distribution issue significantly\ncomplicates the application of directed graph embedding methods. To tackle this\nissue, we propose the Cross-Attention Complex Dual Graph Embedding Model\n(XATGRN). Our XATGRN employs a cross-attention mechanism to effectively capture\nintricate gene interactions from gene expression profiles. Additionally, it\nuses a Dual Complex Graph Embedding approach to manage the skewed degree\ndistribution, thereby ensuring precise prediction of regulatory relationships\nand their directionality. Our model consistently outperforms existing\nstate-of-the-art methods across various datasets, underscoring its efficacy in\nelucidating complex gene regulatory mechanisms. Our codes used in this paper\nare publicly available at: https://github.com/kikixiong/XATGRN.\n","authors":["Jiaqi Xiong","Nan Yin","Shiyang Liang","Haoyang Li","Yingxu Wang","Duo Ai","Fang Pan","Jingjie Wang"],"pdf_url":"https://arxiv.org/pdf/2412.16220v3.pdf","comment":"11 pages, 6 figures,1 tabels"},{"id":"http://arxiv.org/abs/2501.01480v2","updated":"2025-01-09T14:52:13Z","published":"2025-01-02T15:09:00Z","title":"Drift2Matrix: Kernel-Induced Self Representation for Concept Drift\n Adaptation in Co-evolving Time Series","summary":" In the realm of time series analysis, tackling the phenomenon of concept\ndrift poses a significant challenge. Concept drift -- characterized by the\nevolving statistical properties of time series data, affects the reliability\nand accuracy of conventional analysis models. This is particularly evident in\nco-evolving scenarios where interactions among variables are crucial. This\npaper presents Drift2Matrix, a novel framework that leverages kernel-induced\nself-representation for adaptive responses to concept drift in time series.\nDrift2Matrix employs a kernel-based learning mechanism to generate a\nrepresentation matrix, encapsulating the inherent dynamics of co-evolving time\nseries. This matrix serves as a key tool for identification and adaptation to\nconcept drift by observing its temporal variations. Furthermore, Drift2Matrix\neffectively identifies prevailing patterns and offers insights into emerging\ntrends through pattern evolution analysis. Our empirical evaluation of\nDrift2Matrix across various datasets demonstrates its effectiveness in handling\nthe complexities of concept drift. This approach introduces a novel perspective\nin the theoretical domain of co-evolving time series analysis, enhancing\nadaptability and accuracy in the face of dynamic data environments.\n","authors":["Kunpeng Xu","Lifei Chen","Shengrui Wang"],"pdf_url":"https://arxiv.org/pdf/2501.01480v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05278v1","updated":"2025-01-09T14:39:40Z","published":"2025-01-09T14:39:40Z","title":"Off-Policy Evaluation and Counterfactual Methods in Dynamic Auction\n Environments","summary":" Counterfactual estimators are critical for learning and refining policies\nusing logged data, a process known as Off-Policy Evaluation (OPE). OPE allows\nresearchers to assess new policies without costly experiments, speeding up the\nevaluation process. Online experimental methods, such as A/B tests, are\neffective but often slow, thus delaying the policy selection and optimization\nprocess.\n In this work, we explore the application of OPE methods in the context of\nresource allocation in dynamic auction environments. Given the competitive\nnature of environments where rapid decision-making is crucial for gaining a\ncompetitive edge, the ability to quickly and accurately assess algorithmic\nperformance is essential. By utilizing counterfactual estimators as a\npreliminary step before conducting A/B tests, we aim to streamline the\nevaluation process, reduce the time and resources required for experimentation,\nand enhance confidence in the chosen policies. Our investigation focuses on the\nfeasibility and effectiveness of using these estimators to predict the outcomes\nof potential resource allocation strategies, evaluate their performance, and\nfacilitate more informed decision-making in policy selection. Motivated by the\noutcomes of our initial study, we envision an advanced analytics system\ndesigned to seamlessly and dynamically assess new resource allocation\nstrategies and policies.\n","authors":["Ritam Guha","Nilavra Pathak"],"pdf_url":"https://arxiv.org/pdf/2501.05278v1.pdf","comment":"9 pages, 15 figures, IEEE format"},{"id":"http://arxiv.org/abs/2412.13426v2","updated":"2025-01-09T14:33:25Z","published":"2024-12-18T01:43:25Z","title":"Safeguarding System Prompts for LLMs","summary":" Large language models (LLMs) are increasingly utilized in applications where\nsystem prompts, which guide model outputs, play a crucial role. These prompts\noften contain business logic and sensitive information, making their protection\nessential. However, adversarial and even regular user queries can exploit LLM\nvulnerabilities to expose these hidden prompts. To address this issue, we\npropose PromptKeeper, a robust defense mechanism designed to safeguard system\nprompts. PromptKeeper tackles two core challenges: reliably detecting prompt\nleakage and mitigating side-channel vulnerabilities when leakage occurs. By\nframing detection as a hypothesis-testing problem, PromptKeeper effectively\nidentifies both explicit and subtle leakage. Upon detection, it regenerates\nresponses using a dummy prompt, ensuring that outputs remain indistinguishable\nfrom typical interactions when no leakage is present. PromptKeeper ensures\nrobust protection against prompt extraction attacks via either adversarial or\nregular queries, while preserving conversational capability and runtime\nefficiency during benign user interactions.\n","authors":["Zhifeng Jiang","Zhihua Jin","Guoliang He"],"pdf_url":"https://arxiv.org/pdf/2412.13426v2.pdf","comment":"15 pages, 5 figures, 2 tables"},{"id":"http://arxiv.org/abs/2501.05264v1","updated":"2025-01-09T14:19:33Z","published":"2025-01-09T14:19:33Z","title":"Towards Balanced Continual Multi-Modal Learning in Human Pose Estimation","summary":" 3D human pose estimation (3D HPE) has emerged as a prominent research topic,\nparticularly in the realm of RGB-based methods. However, RGB images are\nsusceptible to limitations such as sensitivity to lighting conditions and\npotential user discomfort. Consequently, multi-modal sensing, which leverages\nnon-intrusive sensors, is gaining increasing attention. Nevertheless,\nmulti-modal 3D HPE still faces challenges, including modality imbalance and the\nimperative for continual learning. In this work, we introduce a novel balanced\ncontinual multi-modal learning method for 3D HPE, which harnesses the power of\nRGB, LiDAR, mmWave, and WiFi. Specifically, we propose a Shapley value-based\ncontribution algorithm to quantify the contribution of each modality and\nidentify modality imbalance. To address this imbalance, we employ a re-learning\nstrategy. Furthermore, recognizing that raw data is prone to noise\ncontamination, we develop a novel denoising continual learning approach. This\napproach incorporates a noise identification and separation module to mitigate\nthe adverse effects of noise and collaborates with the balanced learning\nstrategy to enhance optimization. Additionally, an adaptive EWC mechanism is\nemployed to alleviate catastrophic forgetting. We conduct extensive experiments\non the widely-adopted multi-modal dataset, MM-Fi, which demonstrate the\nsuperiority of our approach in boosting 3D pose estimation and mitigating\ncatastrophic forgetting in complex scenarios. We will release our codes.\n","authors":["Jiaxuan Peng","Mengshi Qi","Dong Zhao","Huadong Ma"],"pdf_url":"https://arxiv.org/pdf/2501.05264v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05260v1","updated":"2025-01-09T14:14:18Z","published":"2025-01-09T14:14:18Z","title":"Enhancing Plagiarism Detection in Marathi with a Weighted Ensemble of\n TF-IDF and BERT Embeddings for Low-Resource Language Processing","summary":" Plagiarism involves using another person's work or concepts without proper\nattribution, presenting them as original creations. With the growing amount of\ndata communicated in regional languages such as Marathi -- one of India's\nregional languages -- it is crucial to design robust plagiarism detection\nsystems tailored for low-resource languages. Language models like Bidirectional\nEncoder Representations from Transformers (BERT) have demonstrated exceptional\ncapability in text representation and feature extraction, making them essential\ntools for semantic analysis and plagiarism detection. However, the application\nof BERT for low-resource languages remains under-explored, particularly in the\ncontext of plagiarism detection. This paper presents a method to enhance the\naccuracy of plagiarism detection for Marathi texts using BERT sentence\nembeddings in conjunction with Term Frequency-Inverse Document Frequency\n(TF-IDF) feature representation. This approach effectively captures\nstatistical, semantic, and syntactic aspects of text features through a\nweighted voting ensemble of machine learning models.\n","authors":["Atharva Mutsaddi","Aditya Choudhary"],"pdf_url":"https://arxiv.org/pdf/2501.05260v1.pdf","comment":"Accepted into LoResLM: The First Workshop on Language Models for\n Low-Resource Languages, colocated with COLING 2025 and set to be published\n into ACL Anthology"},{"id":"http://arxiv.org/abs/2501.05258v1","updated":"2025-01-09T14:13:39Z","published":"2025-01-09T14:13:39Z","title":"Automating the Detection of Code Vulnerabilities by Analyzing GitHub\n Issues","summary":" In today's digital landscape, the importance of timely and accurate\nvulnerability detection has significantly increased. This paper presents a\nnovel approach that leverages transformer-based models and machine learning\ntechniques to automate the identification of software vulnerabilities by\nanalyzing GitHub issues. We introduce a new dataset specifically designed for\nclassifying GitHub issues relevant to vulnerability detection. We then examine\nvarious classification techniques to determine their effectiveness. The results\ndemonstrate the potential of this approach for real-world application in early\nvulnerability detection, which could substantially reduce the window of\nexploitation for software vulnerabilities. This research makes a key\ncontribution to the field by providing a scalable and computationally efficient\nframework for automated detection, enabling the prevention of compromised\nsoftware usage before official notifications. This work has the potential to\nenhance the security of open-source software ecosystems.\n","authors":["Daniele Cipollone","Changjie Wang","Mariano Scazzariello","Simone Ferlin","Maliheh Izadi","Dejan Kostic","Marco Chiesa"],"pdf_url":"https://arxiv.org/pdf/2501.05258v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.16828v3","updated":"2025-01-09T14:10:38Z","published":"2024-09-25T11:29:26Z","title":"On the role of Artificial Intelligence methods in modern\n force-controlled manufacturing robotic tasks","summary":" This position paper explores the integration of Artificial Intelligence (AI)\ninto force-controlled robotic tasks within the scope of advanced manufacturing,\na cornerstone of Industry 4.0. AI's role in enhancing robotic manipulators -\nkey drivers in the Fourth Industrial Revolution - is rapidly leading to\nsignificant innovations in smart manufacturing. The objective of this article\nis to frame these innovations in practical force-controlled applications - e.g.\ndeburring, polishing, and assembly tasks like peg-in-hole (PiH) - highlighting\ntheir necessity for maintaining high-quality production standards. By reporting\non recent AI-based methodologies, this article contrasts them and identifies\ncurrent challenges to be addressed in future research. The analysis concludes\nwith a perspective on future research directions, emphasizing the need for\ncommon performance metrics to validate AI techniques, integration of various\nenhancements for performance optimization, and the importance of validating\nthem in relevant scenarios. These future directions aim to provide consistency\nwith already adopted approaches, so as to be compatible with manufacturing\nstandards, increasing the relevance of AI-driven methods in both academic and\nindustrial contexts.\n","authors":["Vincenzo Petrone","Enrico Ferrentino","Pasquale Chiacchio"],"pdf_url":"https://arxiv.org/pdf/2409.16828v3.pdf","comment":"In Proceedings of the 21st International Conference on Informatics in\n Control, Automation and Robotics - Volume 1: ICINCO, 392-399, 2024 , Porto,\n Portugal"},{"id":"http://arxiv.org/abs/2410.05838v2","updated":"2025-01-09T14:04:01Z","published":"2024-10-08T09:06:34Z","title":"Time Transfer: On Optimal Learning Rate and Batch Size In The Infinite\n Data Limit","summary":" One of the main challenges in optimal scaling of large language models (LLMs)\nis the prohibitive cost of hyperparameter tuning, particularly learning rate\n$\\eta$ and batch size $B$. While techniques like $\\mu$P (Yang et al., 2022)\nprovide scaling rules for optimal $\\eta$ transfer in the infinite model size\nlimit, the optimal scaling behavior in the infinite data size limit remains\nunknown. We fill in this gap by observing for the first time an intricate\ndependence of optimal $\\eta$ scaling on the pretraining token budget $T$, $B$\nand its relation to the critical batch size $B_\\mathrm{crit}$, which we measure\nto evolve as $B_\\mathrm{crit} \\propto T$. Furthermore, we show that the optimal\nbatch size is positively correlated with $B_\\mathrm{crit}$: keeping it fixed\nbecomes suboptimal over time even if learning rate is scaled optimally.\nSurprisingly, our results demonstrate that the observed optimal $\\eta$ and $B$\ndynamics are preserved with $\\mu$P model scaling, challenging the conventional\nview of $B_\\mathrm{crit}$ dependence solely on loss value. Complementing\noptimality, we examine the sensitivity of loss to changes in learning rate,\nwhere we find the sensitivity to decrease with increase of $T$ and to remain\nconstant with $\\mu$P model scaling. We hope our results make the first step\ntowards a unified picture of the joint optimal data and model scaling.\n","authors":["Oleg Filatov","Jan Ebert","Jiangtao Wang","Stefan Kesselheim"],"pdf_url":"https://arxiv.org/pdf/2410.05838v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05252v1","updated":"2025-01-09T14:03:35Z","published":"2025-01-09T14:03:35Z","title":"From Scientific Texts to Verifiable Code: Automating the Process with\n Transformers","summary":" Despite the vast body of research literature proposing algorithms with formal\nguarantees, the amount of verifiable code in today's systems remains minimal.\nThis discrepancy stems from the inherent difficulty of verifying code,\nparticularly due to the time-consuming nature and strict formalism of proof\ndetails that formal verification tools require. However, the emergence of\ntransformers in Large Language Models presents a promising solution to this\nchallenge. In this position paper, we believe that transformers have the\npotential to read research papers that propose algorithms with formal proofs\nand translate these proofs into verifiable code. We leverage transformers to\nfirst build a formal structure of the proof using the original text from the\npaper, and then to handle the tedious, low-level aspects of proofs that are\noften omitted by humans. We argue that this approach can significantly reduce\nthe barrier to formal verification. The above idea of reading papers to write\nverifiable code opens new avenues for automating the verification of complex\nsystems, enabling a future where formally verified algorithms from academic\nresearch can more seamlessly transition into real-world software systems,\nthereby improving code reliability and security.\n","authors":["Changjie Wang","Mariano Scazzariello","Marco Chiesa"],"pdf_url":"https://arxiv.org/pdf/2501.05252v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05249v1","updated":"2025-01-09T14:01:15Z","published":"2025-01-09T14:01:15Z","title":"RAG-WM: An Efficient Black-Box Watermarking Approach for\n Retrieval-Augmented Generation of Large Language Models","summary":" In recent years, tremendous success has been witnessed in Retrieval-Augmented\nGeneration (RAG), widely used to enhance Large Language Models (LLMs) in\ndomain-specific, knowledge-intensive, and privacy-sensitive tasks. However,\nattackers may steal those valuable RAGs and deploy or commercialize them,\nmaking it essential to detect Intellectual Property (IP) infringement. Most\nexisting ownership protection solutions, such as watermarks, are designed for\nrelational databases and texts. They cannot be directly applied to RAGs because\nrelational database watermarks require white-box access to detect IP\ninfringement, which is unrealistic for the knowledge base in RAGs. Meanwhile,\npost-processing by the adversary's deployed LLMs typically destructs text\nwatermark information. To address those problems, we propose a novel black-box\n\"knowledge watermark\" approach, named RAG-WM, to detect IP infringement of\nRAGs. RAG-WM uses a multi-LLM interaction framework, comprising a Watermark\nGenerator, Shadow LLM & RAG, and Watermark Discriminator, to create watermark\ntexts based on watermark entity-relationship tuples and inject them into the\ntarget RAG. We evaluate RAG-WM across three domain-specific and two\nprivacy-sensitive tasks on four benchmark LLMs. Experimental results show that\nRAG-WM effectively detects the stolen RAGs in various deployed LLMs.\nFurthermore, RAG-WM is robust against paraphrasing, unrelated content removal,\nknowledge insertion, and knowledge expansion attacks. Lastly, RAG-WM can also\nevade watermark detection approaches, highlighting its promising application in\ndetecting IP infringement of RAG systems.\n","authors":["Peizhuo Lv","Mengjie Sun","Hao Wang","Xiaofeng Wang","Shengzhi Zhang","Yuxuan Chen","Kai Chen","Limin Sun"],"pdf_url":"https://arxiv.org/pdf/2501.05249v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05248v1","updated":"2025-01-09T14:00:01Z","published":"2025-01-09T14:00:01Z","title":"Deriving Coding-Specific Sub-Models from LLMs using Resource-Efficient\n Pruning","summary":" Large Language Models (LLMs) have demonstrated their exceptional performance\nin various complex code generation tasks. However, their broader adoption is\nlimited by significant computational demands and high resource requirements,\nparticularly memory and processing power. To mitigate such requirements, model\npruning techniques are used to create more compact models with significantly\nfewer parameters. However, current approaches do not focus on the efficient\nextraction of programming-language-specific sub-models. In this work, we\nexplore the idea of efficiently deriving coding-specific sub-models through\nunstructured pruning (i.e., Wanda). We investigate the impact of different\ndomain-specific calibration datasets on pruning outcomes across three distinct\ndomains and extend our analysis to extracting four language-specific\nsub-models: Python, Java, C++, and JavaScript. We are the first to efficiently\nextract programming-language-specific sub-models using appropriate calibration\ndatasets while maintaining acceptable accuracy w.r.t. full models. We are also\nthe first to provide analytical evidence that domain-specific tasks activate\ndistinct regions within LLMs, supporting the creation of specialized sub-models\nthrough unstructured pruning. We believe that this work has significant\npotential to enhance LLM accessibility for coding by reducing computational\nrequirements to enable local execution on consumer-grade hardware, and\nsupporting faster inference times critical for real-time development feedback.\n","authors":["Laura Puccioni","Alireza Farshin","Mariano Scazzariello","Changjie Wang","Marco Chiesa","Dejan Kostic"],"pdf_url":"https://arxiv.org/pdf/2501.05248v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05247v1","updated":"2025-01-09T13:57:09Z","published":"2025-01-09T13:57:09Z","title":"Online Prompt and Solver Selection for Program Synthesis","summary":" Large Language Models (LLMs) demonstrate impressive capabilities in the\ndomain of program synthesis. This level of performance is not, however,\nuniversal across all tasks, all LLMs and all prompting styles. There are many\nareas where one LLM dominates, one prompting style dominates, or where calling\na symbolic solver is a better choice than an LLM. A key challenge for the user\nthen, is to identify not only when an LLM is the right choice of solver, and\nthe appropriate LLM to call for a given synthesis task, but also the right way\nto call it. A non-expert user who makes the wrong choice, incurs a cost both in\nterms of results (number of tasks solved, and the time it takes to solve them)\nand financial cost, if using a closed-source language model via a commercial\nAPI. We frame this choice as an online learning problem. We use a multi-armed\nbandit algorithm to select which symbolic solver, or LLM and prompt combination\nto deploy in order to maximize a given reward function (which may prioritize\nsolving time, number of synthesis tasks solved, or financial cost of solving).\nWe implement an instance of this approach, called CYANEA, and evaluate it on\nsynthesis queries from the literature in ranking function synthesis, from the\nsyntax-guided synthesis competition, and fresh, unseen queries generated from\nSMT problems. CYANEA solves 37.2\\% more queries than the best single solver and\nachieves results within 4\\% of the virtual best solver.\n","authors":["Yixuan Li","Lewis Frampton","Federico Mora","Elizabeth Polgreen"],"pdf_url":"https://arxiv.org/pdf/2501.05247v1.pdf","comment":"Accepted at the 39th AAAI Conference on Artificial Intelligence\n (AAAI-25) Main Track"},{"id":"http://arxiv.org/abs/2411.06928v2","updated":"2025-01-09T13:56:49Z","published":"2024-11-11T12:32:26Z","title":"Multi-class Decoding of Attended Speaker Direction Using\n Electroencephalogram and Audio Spatial Spectrum","summary":" Decoding the directional focus of an attended speaker from listeners'\nelectroencephalogram (EEG) signals is essential for developing brain-computer\ninterfaces to improve the quality of life for individuals with hearing\nimpairment. Previous works have concentrated on binary directional focus\ndecoding, i.e., determining whether the attended speaker is on the left or\nright side of the listener. However, a more precise decoding of the exact\ndirection of the attended speaker is necessary for effective speech processing.\nAdditionally, audio spatial information has not been effectively leveraged,\nresulting in suboptimal decoding results. In this paper, it is found that on\nthe recently presented dataset with 14-class directional focus, models relying\nexclusively on EEG inputs exhibit significantly lower accuracy when decoding\nthe directional focus in both leave-one-subject-out and leave-one-trial-out\nscenarios. By integrating audio spatial spectra with EEG features, the decoding\naccuracy can be effectively improved. The CNN, LSM-CNN, and Deformer models are\nemployed to decode the directional focus from listeners' EEG signals and audio\nspatial spectra. The proposed Sp-EEG-Deformer model achieves notable 14-class\ndecoding accuracies of 55.35% and 57.19% in leave-one-subject-out and\nleave-one-trial-out scenarios with a decision window of 1 second, respectively.\nExperiment results indicate increased decoding accuracy as the number of\nalternative directions reduces. These findings suggest the efficacy of our\nproposed dual modal directional focus decoding strategy.\n","authors":["Yuanming Zhang","Jing Lu","Fei Chen","Haoliang Du","Xia Gao","Zhibin Lin"],"pdf_url":"https://arxiv.org/pdf/2411.06928v2.pdf","comment":"Submitted to IEEE TNSRE"},{"id":"http://arxiv.org/abs/2501.05234v1","updated":"2025-01-09T13:41:37Z","published":"2025-01-09T13:41:37Z","title":"Optimizing Estonian TV Subtitles with Semi-supervised Learning and LLMs","summary":" This paper presents an approach for generating high-quality, same-language\nsubtitles for Estonian TV content. We fine-tune the Whisper model on\nhuman-generated Estonian subtitles and enhance it with iterative\npseudo-labeling and large language model (LLM) based post-editing. Our\nexperiments demonstrate notable subtitle quality improvement through\npseudo-labeling with an unlabeled dataset. We find that applying LLM-based\nediting at test time enhances subtitle accuracy, while its use during training\ndoes not yield further gains. This approach holds promise for creating subtitle\nquality close to human standard and could be extended to real-time\napplications.\n","authors":["Artem Fedorchenko","Tanel Alumäe"],"pdf_url":"https://arxiv.org/pdf/2501.05234v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.15879v2","updated":"2025-01-09T13:27:29Z","published":"2024-07-20T10:45:06Z","title":"Decentralized Federated Anomaly Detection in Smart Grids: A P2P Gossip\n Approach","summary":" The increasing security and privacy concerns in the Smart Grid sector have\nled to a significant demand for robust intrusion detection systems within\ncritical smart grid infrastructure. To address the challenges posed by privacy\npreservation and decentralized power system zones with distinct data ownership,\nFederated Learning (FL) has emerged as a promising privacy-preserving solution\nwhich facilitates collaborative training of attack detection models without\nnecessitating the sharing of raw data. However, FL presents several\nimplementation limitations in the power system domain due to its heavy reliance\non a centralized aggregator and the risks of privacy leakage during model\nupdate transmission. To overcome these technical bottlenecks, this paper\nintroduces a novel decentralized federated anomaly detection scheme based on\ntwo main gossip protocols namely Random Walk and Epidemic. Our findings\nindicate that the Random Walk protocol exhibits superior performance compared\nto the Epidemic protocol, highlighting its efficacy in decentralized federated\nlearning environments. Experimental validation of the proposed framework\nutilizing publicly available industrial control systems datasets demonstrates\nsuperior attack detection accuracy while safeguarding data confidentiality and\nmitigating the impact of communication latency and stragglers. Furthermore, our\napproach yields a notable 35% improvement in training time compared to\nconventional FL, underscoring the efficacy and robustness of our decentralized\nlearning method.\n","authors":["Muhammad Akbar Husnoo","Adnan Anwar","Md Enamul Haque","A. N. Mahmood"],"pdf_url":"https://arxiv.org/pdf/2407.15879v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05220v1","updated":"2025-01-09T13:13:24Z","published":"2025-01-09T13:13:24Z","title":"A Novel Approach to Scalable and Automatic Topic-Controlled Question\n Generation in Education","summary":" The development of Automatic Question Generation (QG) models has the\npotential to significantly improve educational practices by reducing the\nteacher workload associated with creating educational content. This paper\nintroduces a novel approach to educational question generation that controls\nthe topical focus of questions. The proposed Topic-Controlled Question\nGeneration (T-CQG) method enhances the relevance and effectiveness of the\ngenerated content for educational purposes. Our approach uses fine-tuning on a\npre-trained T5-small model, employing specially created datasets tailored to\neducational needs. The research further explores the impacts of pre-training\nstrategies, quantisation, and data augmentation on the model's performance. We\nspecifically address the challenge of generating semantically aligned questions\nwith paragraph-level contexts, thereby improving the topic specificity of the\ngenerated questions. In addition, we introduce and explore novel evaluation\nmethods to assess the topical relatedness of the generated questions. Our\nresults, validated through rigorous offline and human-backed evaluations,\ndemonstrate that the proposed models effectively generate high-quality,\ntopic-focused questions. These models have the potential to reduce teacher\nworkload and support personalised tutoring systems by serving as bespoke\nquestion generators. With its relatively small number of parameters, the\nproposals not only advance the capabilities of question generation models for\nhandling specific educational topics but also offer a scalable solution that\nreduces infrastructure costs. This scalability makes them feasible for\nwidespread use in education without reliance on proprietary large language\nmodels like ChatGPT.\n","authors":["Ziqing Li","Mutlu Cukurova","Sahan Bulathwela"],"pdf_url":"https://arxiv.org/pdf/2501.05220v1.pdf","comment":"To be published at ACM Conf. on Learning Analytics and Knowledge\n (LAK'25)"},{"id":"http://arxiv.org/abs/2501.05213v1","updated":"2025-01-09T13:06:47Z","published":"2025-01-09T13:06:47Z","title":"GLaM-Sign: Greek Language Multimodal Lip Reading with Integrated Sign\n Language Accessibility","summary":" The Greek Language Multimodal Lip Reading with Integrated Sign Language\nAccessibility (GLaM-Sign) [1] is a groundbreaking resource in accessibility and\nmultimodal AI, designed to support Deaf and Hard-of-Hearing (DHH) individuals.\nDeveloped from the FEELIT project [2], it integrates high-resolution audio,\nvideo, textual transcriptions, and Greek Sign Language translations for\napplications like real-time sign language translation and enhanced subtitle\nsynchronization. While its primary focus is on promoting inclusivity in the\nGreek tourism sector, its adaptability extends to education, healthcare, and\npublic services. Future advancements will enhance word-level precision and\nscalability to additional languages, supported by advanced AI methodologies and\ncollaborations with diverse stakeholders. This dataset underscores the\ntransformative potential of multimodal resources in bridging communication\ngaps, fostering innovation, and setting a benchmark for ethical AI and\ninclusive technologies.\n","authors":["Dimitris Kouremenos","Klimis Ntalianis"],"pdf_url":"https://arxiv.org/pdf/2501.05213v1.pdf","comment":"9 pages, 4 figures"},{"id":"http://arxiv.org/abs/2501.05205v1","updated":"2025-01-09T12:55:55Z","published":"2025-01-09T12:55:55Z","title":"Discovering Hidden Visual Concepts Beyond Linguistic Input in Infant\n Learning","summary":" Infants develop complex visual understanding rapidly, even preceding of the\nacquisition of linguistic inputs. As computer vision seeks to replicate the\nhuman vision system, understanding infant visual development may offer valuable\ninsights. In this paper, we present an interdisciplinary study exploring this\nquestion: can a computational model that imitates the infant learning process\ndevelop broader visual concepts that extend beyond the vocabulary it has heard,\nsimilar to how infants naturally learn? To investigate this, we analyze a\nrecently published model in Science by Vong et al.,which is trained on\nlongitudinal, egocentric images of a single child paired with transcribed\nparental speech. We introduce a training-free framework that can discover\nvisual concept neurons hidden in the model's internal representations. Our\nfindings show that these neurons can classify objects outside its original\nvocabulary. Furthermore, we compare the visual representations in infant-like\nmodels with those in moder computer vision models, such as CLIP or ImageNet\npre-trained model, highlighting key similarities and differences. Ultimately,\nour work bridges cognitive science and computer vision by analyzing the\ninternal representations of a computational model trained on an infant's visual\nand linguistic inputs.\n","authors":["Xueyi Ke","Satoshi Tsutsui","Yayun Zhang","Bihan Wen"],"pdf_url":"https://arxiv.org/pdf/2501.05205v1.pdf","comment":"12 pages, 11 figures"},{"id":"http://arxiv.org/abs/2501.05197v1","updated":"2025-01-09T12:48:15Z","published":"2025-01-09T12:48:15Z","title":"An Algorithmic Approach for Causal Health Equity: A Look at Race\n Differentials in Intensive Care Unit (ICU) Outcomes","summary":" The new era of large-scale data collection and analysis presents an\nopportunity for diagnosing and understanding the causes of health inequities.\nIn this study, we describe a framework for systematically analyzing health\ndisparities using causal inference. The framework is illustrated by\ninvestigating racial and ethnic disparities in intensive care unit (ICU)\noutcome between majority and minority groups in Australia (Indigenous vs.\nNon-Indigenous) and the United States (African-American vs. White). We\ndemonstrate that commonly used statistical measures for quantifying inequity\nare insufficient, and focus on attributing the observed disparity to the causal\nmechanisms that generate it. We find that minority patients are younger at\nadmission, have worse chronic health, are more likely to be admitted for urgent\nand non-elective reasons, and have higher illness severity. At the same time,\nhowever, we find a protective direct effect of belonging to a minority group,\nwith minority patients showing improved survival compared to their majority\ncounterparts, with all other variables kept equal. We demonstrate that this\nprotective effect is related to the increased probability of being admitted to\nICU, with minority patients having an increased risk of ICU admission. We also\nfind that minority patients, while showing improved survival, are more likely\nto be readmitted to ICU. Thus, due to worse access to primary health care,\nminority patients are more likely to end up in ICU for preventable conditions,\ncausing a reduction in the mortality rates and creating an effect that appears\nto be protective. Since the baseline risk of ICU admission may serve as proxy\nfor lack of access to primary care, we developed the Indigenous Intensive Care\nEquity (IICE) Radar, a monitoring system for tracking the over-utilization of\nICU resources by the Indigenous population of Australia across geographical\nareas.\n","authors":["Drago Plecko","Paul Secombe","Andrea Clarke","Amelia Fiske","Samarra Toby","Donisha Duff","David Pilcher","Leo Anthony Celi","Rinaldo Bellomo","Elias Bareinboim"],"pdf_url":"https://arxiv.org/pdf/2501.05197v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09094v2","updated":"2025-01-09T12:38:37Z","published":"2024-12-12T09:22:04Z","title":"Filter-then-Generate: Large Language Models with Structure-Text Adapter\n for Knowledge Graph Completion","summary":" Large Language Models (LLMs) present massive inherent knowledge and superior\nsemantic comprehension capability, which have revolutionized various tasks in\nnatural language processing. Despite their success, a critical gap remains in\nenabling LLMs to perform knowledge graph completion (KGC). Empirical evidence\nsuggests that LLMs consistently perform worse than conventional KGC approaches,\neven through sophisticated prompt design or tailored instruction-tuning.\nFundamentally, applying LLMs on KGC introduces several critical challenges,\nincluding a vast set of entity candidates, hallucination issue of LLMs, and\nunder-exploitation of the graph structure. To address these challenges, we\npropose a novel instruction-tuning-based method, namely FtG. Specifically, we\npresent a \\textit{filter-then-generate} paradigm and formulate the KGC task\ninto a multiple-choice question format. In this way, we can harness the\ncapability of LLMs while mitigating the issue casused by hallucinations.\nMoreover, we devise a flexible ego-graph serialization prompt and employ a\nstructure-text adapter to couple structure and text information in a\ncontextualized manner. Experimental results demonstrate that FtG achieves\nsubstantial performance gain compared to existing state-of-the-art methods. The\ninstruction dataset and code are available at\n\\url{https://github.com/LB0828/FtG}.\n","authors":["Ben Liu","Jihai Zhang","Fangquan Lin","Cheng Yang","Min Peng"],"pdf_url":"https://arxiv.org/pdf/2412.09094v2.pdf","comment":"COLING 2025 Main Conference"},{"id":"http://arxiv.org/abs/2412.11120v2","updated":"2025-01-09T11:39:32Z","published":"2024-12-15T08:51:14Z","title":"Latent Reward: LLM-Empowered Credit Assignment in Episodic Reinforcement\n Learning","summary":" Reinforcement learning (RL) often encounters delayed and sparse feedback in\nreal-world applications, even with only episodic rewards. Previous approaches\nhave made some progress in reward redistribution for credit assignment but\nstill face challenges, including training difficulties due to redundancy and\nambiguous attributions stemming from overlooking the multifaceted nature of\nmission performance evaluation. Hopefully, Large Language Model (LLM)\nencompasses fruitful decision-making knowledge and provides a plausible tool\nfor reward redistribution. Even so, deploying LLM in this case is non-trivial\ndue to the misalignment between linguistic knowledge and the symbolic form\nrequirement, together with inherent randomness and hallucinations in inference.\nTo tackle these issues, we introduce LaRe, a novel LLM-empowered symbolic-based\ndecision-making framework, to improve credit assignment. Key to LaRe is the\nconcept of the Latent Reward, which works as a multi-dimensional performance\nevaluation, enabling more interpretable goal attainment from various\nperspectives and facilitating more effective reward redistribution. We examine\nthat semantically generated code from LLM can bridge linguistic knowledge and\nsymbolic latent rewards, as it is executable for symbolic objects. Meanwhile,\nwe design latent reward self-verification to increase the stability and\nreliability of LLM inference. Theoretically, reward-irrelevant redundancy\nelimination in the latent reward benefits RL performance from more accurate\nreward estimation. Extensive experimental results witness that LaRe (i)\nachieves superior temporal credit assignment to SOTA methods, (ii) excels in\nallocating contributions among multiple agents, and (iii) outperforms policies\ntrained with ground truth rewards for certain tasks.\n","authors":["Yun Qu","Yuhang Jiang","Boyuan Wang","Yixiu Mao","Cheems Wang","Chang Liu","Xiangyang Ji"],"pdf_url":"https://arxiv.org/pdf/2412.11120v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05165v1","updated":"2025-01-09T11:38:58Z","published":"2025-01-09T11:38:58Z","title":"Bringing Order Amidst Chaos: On the Role of Artificial Intelligence in\n Secure Software Engineering","summary":" Context. Developing secure and reliable software remains a key challenge in\nsoftware engineering (SE). The ever-evolving technological landscape offers\nboth opportunities and threats, creating a dynamic space where chaos and order\ncompete. Secure software engineering (SSE) must continuously address\nvulnerabilities that endanger software systems and carry broader socio-economic\nrisks, such as compromising critical national infrastructure and causing\nsignificant financial losses. Researchers and practitioners have explored\nmethodologies like Static Application Security Testing Tools (SASTTs) and\nartificial intelligence (AI) approaches, including machine learning (ML) and\nlarge language models (LLMs), to detect and mitigate these vulnerabilities.\nEach method has unique strengths and limitations.\n Aim. This thesis seeks to bring order to the chaos in SSE by addressing\ndomain-specific differences that impact AI accuracy.\n Methodology. The research employs a mix of empirical strategies, such as\nevaluating effort-aware metrics, analyzing SASTTs, conducting method-level\nanalysis, and leveraging evidence-based techniques like systematic dataset\nreviews. These approaches help characterize vulnerability prediction datasets.\n Results. Key findings include limitations in static analysis tools for\nidentifying vulnerabilities, gaps in SASTT coverage of vulnerability types,\nweak relationships among vulnerability severity scores, improved defect\nprediction accuracy using just-in-time modeling, and threats posed by untouched\nmethods.\n Conclusions. This thesis highlights the complexity of SSE and the importance\nof contextual knowledge in improving AI-driven vulnerability and defect\nprediction. The comprehensive analysis advances effective prediction models,\nbenefiting both researchers and practitioners.\n","authors":["Matteo Esposito"],"pdf_url":"https://arxiv.org/pdf/2501.05165v1.pdf","comment":"PhD thesis"},{"id":"http://arxiv.org/abs/2501.05163v1","updated":"2025-01-09T11:36:29Z","published":"2025-01-09T11:36:29Z","title":"Explainable AI based System for Supply Air Temperature Forecast","summary":" This paper explores the application of Explainable AI (XAI) techniques to\nimprove the transparency and understanding of predictive models in control of\nautomated supply air temperature (ASAT) of Air Handling Unit (AHU). The study\nfocuses on forecasting of ASAT using a linear regression with Huber loss.\nHowever, having only a control curve without semantic and/or physical\nexplanation is often not enough. The present study employs one of the XAI\nmethods: Shapley values, which allows to reveal the reasoning and highlight the\ncontribution of each feature to the final ASAT forecast. In comparison to other\nXAI methods, Shapley values have solid mathematical background, resulting in\ninterpretation transparency. The study demonstrates the contrastive\nexplanations--slices, for each control value of ASAT, which makes it possible\nto give the client objective justifications for curve changes.\n","authors":["Marika Eik","Ahmet Kose","Hossein Nourollahi Hokmabad","Juri Belikov"],"pdf_url":"https://arxiv.org/pdf/2501.05163v1.pdf","comment":"5 pages, 7 figures, 1 table, conference paper"},{"id":"http://arxiv.org/abs/2409.00717v3","updated":"2025-01-09T11:24:44Z","published":"2024-09-01T13:14:41Z","title":"Preference-Based Multi-Agent Reinforcement Learning: Data Coverage and\n Algorithmic Techniques","summary":" We initiate the study of Preference-Based Multi-Agent Reinforcement Learning\n(PbMARL), exploring both theoretical foundations and empirical validations. We\ndefine the task as identifying the Nash equilibrium from a preference-only\noffline dataset in general-sum games, a problem marked by the challenge of\nsparse feedback signals. Our theory establishes the upper complexity bounds for\nNash Equilibrium in effective PbMARL, demonstrating that single-policy coverage\nis inadequate and highlighting the importance of unilateral dataset coverage.\nThese theoretical insights are verified through comprehensive experiments. To\nenhance the practical performance, we further introduce two algorithmic\ntechniques. (1) We propose a Mean Squared Error (MSE) regularization along the\ntime axis to achieve a more uniform reward distribution and improve reward\nlearning outcomes. (2) We propose an additional penalty based on the\ndistribution of the dataset to incorporate pessimism, improving stability and\neffectiveness during training. Our findings underscore the multifaceted\napproach required for PbMARL, paving the way for effective preference-based\nmulti-agent systems.\n","authors":["Natalia Zhang","Xinqi Wang","Qiwen Cui","Runlong Zhou","Sham M. Kakade","Simon S. Du"],"pdf_url":"https://arxiv.org/pdf/2409.00717v3.pdf","comment":"9 pages"},{"id":"http://arxiv.org/abs/2501.05155v1","updated":"2025-01-09T11:19:40Z","published":"2025-01-09T11:19:40Z","title":"Biomedical Relation Extraction via Adaptive Document-Relation\n Cross-Mapping and Concept Unique Identifier","summary":" Document-Level Biomedical Relation Extraction (Bio-RE) aims to identify\nrelations between biomedical entities within extensive texts, serving as a\ncrucial subfield of biomedical text mining. Existing Bio-RE methods struggle\nwith cross-sentence inference, which is essential for capturing relations\nspanning multiple sentences. Moreover, previous methods often overlook the\nincompleteness of documents and lack the integration of external knowledge,\nlimiting contextual richness. Besides, the scarcity of annotated data further\nhampers model training. Recent advancements in large language models (LLMs)\nhave inspired us to explore all the above issues for document-level Bio-RE.\nSpecifically, we propose a document-level Bio-RE framework via LLM Adaptive\nDocument-Relation Cross-Mapping (ADRCM) Fine-Tuning and Concept Unique\nIdentifier (CUI) Retrieval-Augmented Generation (RAG). First, we introduce the\nIteration-of-REsummary (IoRs) prompt for solving the data scarcity issue. In\nthis way, Bio-RE task-specific synthetic data can be generated by guiding\nChatGPT to focus on entity relations and iteratively refining synthetic data.\nNext, we propose ADRCM fine-tuning, a novel fine-tuning recipe that establishes\nmappings across different documents and relations, enhancing the model's\ncontextual understanding and cross-sentence inference capabilities. Finally,\nduring the inference, a biomedical-specific RAG approach, named CUI RAG, is\ndesigned to leverage CUIs as indexes for entities, narrowing the retrieval\nscope and enriching the relevant document contexts. Experiments conducted on\nthree Bio-RE datasets (GDA, CDR, and BioRED) demonstrate the state-of-the-art\nperformance of our proposed method by comparing it with other related works.\n","authors":["Yufei Shang","Yanrong Guo","Shijie Hao","Richang Hong"],"pdf_url":"https://arxiv.org/pdf/2501.05155v1.pdf","comment":"13 pages, 6 figures"},{"id":"http://arxiv.org/abs/2501.02648v2","updated":"2025-01-09T11:17:01Z","published":"2025-01-05T20:26:49Z","title":"Representation Learning of Lab Values via Masked AutoEncoder","summary":" Accurate imputation of missing laboratory values in electronic health records\n(EHRs) is critical to enable robust clinical predictions and reduce biases in\nAI systems in healthcare. Existing methods, such as variational autoencoders\n(VAEs) and decision tree-based approaches such as XGBoost, struggle to model\nthe complex temporal and contextual dependencies in EHR data, mainly in\nunderrepresented groups. In this work, we propose Lab-MAE, a novel\ntransformer-based masked autoencoder framework that leverages self-supervised\nlearning for the imputation of continuous sequential lab values. Lab-MAE\nintroduces a structured encoding scheme that jointly models laboratory test\nvalues and their corresponding timestamps, enabling explicit capturing temporal\ndependencies. Empirical evaluation on the MIMIC-IV dataset demonstrates that\nLab-MAE significantly outperforms the state-of-the-art baselines such as\nXGBoost across multiple metrics, including root mean square error (RMSE),\nR-squared (R2), and Wasserstein distance (WD). Notably, Lab-MAE achieves\nequitable performance across demographic groups of patients, advancing fairness\nin clinical predictions. We further investigate the role of follow-up\nlaboratory values as potential shortcut features, revealing Lab-MAE's\nrobustness in scenarios where such data is unavailable. The findings suggest\nthat our transformer-based architecture, adapted to the characteristics of the\nEHR data, offers a foundation model for more accurate and fair clinical\nimputation models. In addition, we measure and compare the carbon footprint of\nLab-MAE with the baseline XGBoost model, highlighting its environmental\nrequirements.\n","authors":["David Restrepo","Chenwei Wu","Yueran Jia","Jaden K. Sun","Jack Gallifant","Catherine G. Bielick","Yugang Jia","Leo A. Celi"],"pdf_url":"https://arxiv.org/pdf/2501.02648v2.pdf","comment":"10 pages main text, 8 appendix"},{"id":"http://arxiv.org/abs/2411.07066v2","updated":"2025-01-09T11:11:37Z","published":"2024-11-11T15:30:16Z","title":"Zeroth-Order Adaptive Neuron Alignment Based Pruning without Re-Training","summary":" Network pruning focuses on computational techniques that aim to reduce a\ngiven model's computational cost by removing a subset of its parameters while\nhaving minimal impact on performance. Throughout the last decade, the most\nwidely used pruning paradigm has been pruning and re-training, which nowadays\nis inconvenient due to the vast amount of pre-trained models, which are in any\ncase too expensive to re-train. In this paper, we exploit functional\ninformation from dense pre-trained models, i.e., their activations, to obtain\nsparse models that maximize the activations' alignment w.r.t. their\ncorresponding dense models. Hence, we propose \\textsc{NeuroAL}, a \\emph{top-up}\nalgorithm that can be used on top of any given pruning algorithm for LLMs,\nwhich modifies the block-wise and row-wise sparsity exploiting information from\nboth the dense model and its sparse version to maximize the \\emph{neuron\nalignment} among activations. Differently from existing methods, our approach\nadaptively selects the best hyperparameters for the block-wise and row-wise\nsparsity ratios w.r.t. the model and the desired sparsity, and requires\n\\emph{no re-training}. We test our method over 276 cases combining four LLM\nfamilies, three sparsity ratios, and ten language tasks (three language\nmodeling and seven zero-shot datasets), showing how it consistently outperforms\nthe latest state-of-the-art methods in terms of performance-runtime trade-off.\nThe code is available at\n\\href{https://github.com/eliacunegatti/NeuroAL}{https://github.com/eliacunegatti/NeuroAL}.\n","authors":["Elia Cunegatti","Leonardo Lucio Custode","Giovanni Iacca"],"pdf_url":"https://arxiv.org/pdf/2411.07066v2.pdf","comment":"Work in progress"},{"id":"http://arxiv.org/abs/2501.05147v1","updated":"2025-01-09T10:56:50Z","published":"2025-01-09T10:56:50Z","title":"A Systematic Literature Review on Deep Learning-based Depth Estimation\n in Computer Vision","summary":" Depth estimation (DE) provides spatial information about a scene and enables\ntasks such as 3D reconstruction, object detection, and scene understanding.\nRecently, there has been an increasing interest in using deep learning\n(DL)-based methods for DE. Traditional techniques rely on handcrafted features\nthat often struggle to generalise to diverse scenes and require extensive\nmanual tuning. However, DL models for DE can automatically extract relevant\nfeatures from input data, adapt to various scene conditions, and generalise\nwell to unseen environments. Numerous DL-based methods have been developed,\nmaking it necessary to survey and synthesize the state-of-the-art (SOTA).\nPrevious reviews on DE have mainly focused on either monocular or stereo-based\ntechniques, rather than comprehensively reviewing DE. Furthermore, to the best\nof our knowledge, there is no systematic literature review (SLR) that\ncomprehensively focuses on DE. Therefore, this SLR study is being conducted.\nInitially, electronic databases were searched for relevant publications,\nresulting in 1284 publications. Using defined exclusion and quality criteria,\n128 publications were shortlisted and further filtered to select 59\nhigh-quality primary studies. These studies were analysed to extract data and\nanswer defined research questions. Based on the results, DL methods were\ndeveloped for mainly three different types of DE: monocular, stereo, and\nmulti-view. 20 publicly available datasets were used to train, test, and\nevaluate DL models for DE, with KITTI, NYU Depth V2, and Make 3D being the most\nused datasets. 29 evaluation metrics were used to assess the performance of DE.\n35 base models were reported in the primary studies, and the top five most-used\nbase models were ResNet-50, ResNet-18, ResNet-101, U-Net, and VGG-16. Finally,\nthe lack of ground truth data was among the most significant challenges\nreported by primary studies.\n","authors":["Ali Rohan","Md Junayed Hasan","Andrei Petrovski"],"pdf_url":"https://arxiv.org/pdf/2501.05147v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.00778v2","updated":"2025-01-09T10:47:35Z","published":"2024-06-02T15:35:45Z","title":"Bayesian Joint Additive Factor Models for Multiview Learning","summary":" It is increasingly common in a wide variety of applied settings to collect\ndata of multiple different types on the same set of samples. Our particular\nfocus in this article is on studying relationships between such multiview\nfeatures and responses. A motivating application arises in the context of\nprecision medicine where multi-omics data are collected to correlate with\nclinical outcomes. It is of interest to infer dependence within and across\nviews while combining multimodal information to improve the prediction of\noutcomes. The signal-to-noise ratio can vary substantially across views,\nmotivating more nuanced statistical tools beyond standard late and early\nfusion. This challenge comes with the need to preserve interpretability, select\nfeatures, and obtain accurate uncertainty quantification. We propose a joint\nadditive factor regression model (JAFAR) with a structured additive design,\naccounting for shared and view-specific components. We ensure identifiability\nvia a novel dependent cumulative shrinkage process (D-CUSP) prior. We provide\nan efficient implementation via a partially collapsed Gibbs sampler and extend\nour approach to allow flexible feature and outcome distributions. Prediction of\ntime-to-labor onset from immunome, metabolome, and proteome data illustrates\nperformance gains against state-of-the-art competitors. Our open-source\nsoftware (R package) is available at https://github.com/niccoloanceschi/jafar.\n","authors":["Niccolo Anceschi","Federico Ferrari","David B. Dunson","Himel Mallick"],"pdf_url":"https://arxiv.org/pdf/2406.00778v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05113v1","updated":"2025-01-09T09:59:42Z","published":"2025-01-09T09:59:42Z","title":"Constrained Optimization of Charged Particle Tracking with Multi-Agent\n Reinforcement Learning","summary":" Reinforcement learning demonstrated immense success in modelling complex\nphysics-driven systems, providing end-to-end trainable solutions by interacting\nwith a simulated or real environment, maximizing a scalar reward signal. In\nthis work, we propose, building upon previous work, a multi-agent reinforcement\nlearning approach with assignment constraints for reconstructing particle\ntracks in pixelated particle detectors. Our approach optimizes collaboratively\na parametrized policy, functioning as a heuristic to a multidimensional\nassignment problem, by jointly minimizing the total amount of particle\nscattering over the reconstructed tracks in a readout frame. To satisfy\nconstraints, guaranteeing a unique assignment of particle hits, we propose a\nsafety layer solving a linear assignment problem for every joint action.\nFurther, to enforce cost margins, increasing the distance of the local policies\npredictions to the decision boundaries of the optimizer mappings, we recommend\nthe use of an additional component in the blackbox gradient estimation, forcing\nthe policy to solutions with lower total assignment costs. We empirically show\non simulated data, generated for a particle detector developed for proton\nimaging, the effectiveness of our approach, compared to multiple single- and\nmulti-agent baselines. We further demonstrate the effectiveness of constraints\nwith cost margins for both optimization and generalization, introduced by wider\nregions with high reconstruction performance as well as reduced predictive\ninstabilities. Our results form the basis for further developments in RL-based\ntracking, offering both enhanced performance with constrained policies and\ngreater flexibility in optimizing tracking algorithms through the option for\nindividual and team rewards.\n","authors":["Tobias Kortus","Ralf Keidel","Nicolas R. Gauger","Jan Kieseler"],"pdf_url":"https://arxiv.org/pdf/2501.05113v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05095v1","updated":"2025-01-09T09:21:09Z","published":"2025-01-09T09:21:09Z","title":"Advancing ALS Applications with Large-Scale Pre-training: Dataset\n Development and Downstream Assessment","summary":" The pre-training and fine-tuning paradigm has revolutionized satellite remote\nsensing applications. However, this approach remains largely underexplored for\nairborne laser scanning (ALS), an important technology for applications such as\nforest management and urban planning. In this study, we address this gap by\nconstructing a large-scale ALS point cloud dataset and evaluating its impact on\ndownstream applications. Our dataset comprises ALS point clouds collected\nacross the contiguous United States, provided by the United States Geological\nSurvey's 3D Elevation Program. To ensure efficient data collection while\ncapturing diverse land cover and terrain types, we introduce a geospatial\nsampling method that selects point cloud tiles based on land cover maps and\ndigital elevation models. As a baseline self-supervised learning model, we\nadopt BEV-MAE, a state-of-the-art masked autoencoder for 3D outdoor point\nclouds, and pre-train it on the constructed dataset. The pre-trained models are\nsubsequently fine-tuned for downstream tasks, including tree species\nclassification, terrain scene recognition, and point cloud semantic\nsegmentation. Our results show that the pre-trained models significantly\noutperform their scratch counterparts across all downstream tasks,\ndemonstrating the transferability of the representations learned from the\nproposed dataset. Furthermore, we observe that scaling the dataset using our\ngeospatial sampling method consistently enhances performance, whereas\npre-training on datasets constructed with random sampling fails to achieve\nsimilar improvements. These findings highlight the utility of the constructed\ndataset and the effectiveness of our sampling strategy in the pre-training and\nfine-tuning paradigm. The source code and pre-trained models will be made\npublicly available at \\url{https://github.com/martianxiu/ALS_pretraining}.\n","authors":["Haoyi Xiu","Xin Liu","Taehoon Kim","Kyoung-Sook Kim"],"pdf_url":"https://arxiv.org/pdf/2501.05095v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.06232v2","updated":"2025-01-09T09:20:48Z","published":"2024-10-08T17:41:37Z","title":"Range, not Independence, Drives Modularity in Biological Inspired\n Representation","summary":" Why do biological and artificial neurons sometimes modularise, each encoding\na single meaningful variable, and sometimes entangle their representation of\nmany variables? In this work, we develop a theory of when biologically inspired\nnetworks -- those that are nonnegative and energy efficient -- modularise their\nrepresentation of source variables (sources). We derive necessary and\nsufficient conditions on a sample of sources that determine whether the neurons\nin an optimal biologically-inspired linear autoencoder modularise. Our theory\napplies to any dataset, extending far beyond the case of statistical\nindependence studied in previous work. Rather we show that sources modularise\nif their support is ``sufficiently spread''. From this theory, we extract and\nvalidate predictions in a variety of empirical studies on how data distribution\naffects modularisation in nonlinear feedforward and recurrent neural networks\ntrained on supervised and unsupervised tasks. Furthermore, we apply these ideas\nto neuroscience data, showing that range independence can be used to understand\nthe mixing or modularising of spatial and reward information in entorhinal\nrecordings in seemingly conflicting experiments. Further, we use these results\nto suggest alternate origins of mixed-selectivity, beyond the predominant\ntheory of flexible nonlinear classification. In sum, our theory prescribes\nprecise conditions on when neural activities modularise, providing tools for\ninducing and elucidating modular representations in brains and machines.\n","authors":["Will Dorrell","Kyle Hsu","Luke Hollingsworth","Jin Hwa Lee","Jiajun Wu","Chelsea Finn","Peter E Latham","Tim EJ Behrens","James CR Whittington"],"pdf_url":"https://arxiv.org/pdf/2410.06232v2.pdf","comment":"40 pages, 16 figures. WD and KH contributed equally; LH and JHL\n contributed equally"},{"id":"http://arxiv.org/abs/2312.03700v2","updated":"2025-01-09T09:12:06Z","published":"2023-12-06T18:59:19Z","title":"OneLLM: One Framework to Align All Modalities with Language","summary":" Multimodal large language models (MLLMs) have gained significant attention\ndue to their strong multimodal understanding capability. However, existing\nworks rely heavily on modality-specific encoders, which usually differ in\narchitecture and are limited to common modalities. In this paper, we present\nOneLLM, an MLLM that aligns eight modalities to language using a unified\nframework. We achieve this through a unified multimodal encoder and a\nprogressive multimodal alignment pipeline. In detail, we first train an image\nprojection module to connect a vision encoder with LLM. Then, we build a\nuniversal projection module (UPM) by mixing multiple image projection modules\nand dynamic routing. Finally, we progressively align more modalities to LLM\nwith the UPM. To fully leverage the potential of OneLLM in following\ninstructions, we also curated a comprehensive multimodal instruction dataset,\nincluding 2M items from image, audio, video, point cloud, depth/normal map, IMU\nand fMRI brain activity. OneLLM is evaluated on 25 diverse benchmarks,\nencompassing tasks such as multimodal captioning, question answering and\nreasoning, where it delivers excellent performance. Code, data, model and\nonline demo are available at https://github.com/csuhan/OneLLM\n","authors":["Jiaming Han","Kaixiong Gong","Yiyuan Zhang","Jiaqi Wang","Kaipeng Zhang","Dahua Lin","Yu Qiao","Peng Gao","Xiangyu Yue"],"pdf_url":"https://arxiv.org/pdf/2312.03700v2.pdf","comment":"Accepted by CVPR 2024. Code: https://github.com/csuhan/OneLLM"},{"id":"http://arxiv.org/abs/2412.10095v2","updated":"2025-01-09T09:09:32Z","published":"2024-12-13T12:31:06Z","title":"HiTZ at VarDial 2025 NorSID: Overcoming Data Scarcity with Language\n Transfer and Automatic Data Annotation","summary":" In this paper we present our submission for the NorSID Shared Task as part of\nthe 2025 VarDial Workshop (Scherrer et al., 2025), consisting of three tasks:\nIntent Detection, Slot Filling and Dialect Identification, evaluated using data\nin different dialects of the Norwegian language. For Intent Detection and Slot\nFilling, we have fine-tuned a multitask model in a cross-lingual setting, to\nleverage the xSID dataset available in 17 languages. In the case of Dialect\nIdentification, our final submission consists of a model fine-tuned on the\nprovided development set, which has obtained the highest scores within our\nexperiments. Our final results on the test set show that our models do not drop\nin performance compared to the development set, likely due to the\ndomain-specificity of the dataset and the similar distribution of both subsets.\nFinally, we also report an in-depth analysis of the provided datasets and their\nartifacts, as well as other sets of experiments that have been carried out but\ndid not yield the best results. Additionally, we present an analysis on the\nreasons why some methods have been more successful than others; mainly the\nimpact of the combination of languages and domain-specificity of the training\ndata on the results.\n","authors":["Jaione Bengoetxea","Mikel Zubillaga","Ekhi Azurmendi","Maite Heredia","Julen Etxaniz","Markel Ferro","Jeremy Barnes"],"pdf_url":"https://arxiv.org/pdf/2412.10095v2.pdf","comment":"Vardial 2025 NorSID Shared Task, fixed minor typos"},{"id":"http://arxiv.org/abs/2501.05079v1","updated":"2025-01-09T09:01:04Z","published":"2025-01-09T09:01:04Z","title":"Multimodal-to-Text Prompt Engineering in Large Language Models Using\n Feature Embeddings for GNSS Interference Characterization","summary":" Large language models (LLMs) are advanced AI systems applied across various\ndomains, including NLP, information retrieval, and recommendation systems.\nDespite their adaptability and efficiency, LLMs have not been extensively\nexplored for signal processing tasks, particularly in the domain of global\nnavigation satellite system (GNSS) interference monitoring. GNSS interference\nmonitoring is essential to ensure the reliability of vehicle localization on\nroads, a critical requirement for numerous applications. However, GNSS-based\npositioning is vulnerable to interference from jamming devices, which can\ncompromise its accuracy. The primary objective is to identify, classify, and\nmitigate these interferences. Interpreting GNSS snapshots and the associated\ninterferences presents significant challenges due to the inherent complexity,\nincluding multipath effects, diverse interference types, varying sensor\ncharacteristics, and satellite constellations. In this paper, we extract\nfeatures from a large GNSS dataset and employ LLaVA to retrieve relevant\ninformation from an extensive knowledge base. We employ prompt engineering to\ninterpret the interferences and environmental factors, and utilize t-SNE to\nanalyze the feature embeddings. Our findings demonstrate that the proposed\nmethod is capable of visual and logical reasoning within the GNSS context.\nFurthermore, our pipeline outperforms state-of-the-art machine learning models\nin interference classification tasks.\n","authors":["Harshith Manjunath","Lucas Heublein","Tobias Feigl","Felix Ott"],"pdf_url":"https://arxiv.org/pdf/2501.05079v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05078v1","updated":"2025-01-09T09:00:32Z","published":"2025-01-09T09:00:32Z","title":"Analyzing Memorization in Large Language Models through the Lens of\n Model Attribution","summary":" Large Language Models (LLMs) are prevalent in modern applications but often\nmemorize training data, leading to privacy breaches and copyright issues.\nExisting research has mainly focused on posthoc analyses, such as extracting\nmemorized content or developing memorization metrics, without exploring the\nunderlying architectural factors that contribute to memorization. In this work,\nwe investigate memorization from an architectural lens by analyzing how\nattention modules at different layers impact its memorization and\ngeneralization performance. Using attribution techniques, we systematically\nintervene in the LLM architecture by bypassing attention modules at specific\nblocks while keeping other components like layer normalization and MLP\ntransformations intact. We provide theorems analyzing our intervention\nmechanism from a mathematical view, bounding the difference in layer outputs\nwith and without our attributions. Our theoretical and empirical analyses\nreveal that attention modules in deeper transformer blocks are primarily\nresponsible for memorization, whereas earlier blocks are crucial for the models\ngeneralization and reasoning capabilities. We validate our findings through\ncomprehensive experiments on different LLM families (Pythia and GPTNeo) and\nfive benchmark datasets. Our insights offer a practical approach to mitigate\nmemorization in LLMs while preserving their performance, contributing to safer\nand more ethical deployment in real world applications.\n","authors":["Tarun Ram Menta","Susmit Agrawal","Chirag Agarwal"],"pdf_url":"https://arxiv.org/pdf/2501.05078v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05075v1","updated":"2025-01-09T08:59:14Z","published":"2025-01-09T08:59:14Z","title":"A Text-Based Knowledge-Embedded Soft Sensing Modeling Approach for\n General Industrial Process Tasks Based on Large Language Model","summary":" Data-driven soft sensors (DDSS) have become mainstream methods for predicting\nkey performance indicators in process industries. However, DDSS development\nrequires complex and costly customized designs tailored to various tasks during\nthe modeling process. Moreover, DDSS are constrained to a single structured\ndata modality, limiting their ability to incorporate additional contextual\nknowledge. Furthermore, DDSSs' limited representation learning leads to weak\npredictive performance with scarce data. To address these challenges, we\npropose a general framework named LLM-TKESS (large language model for\ntext-based knowledge-embedded soft sensing), harnessing the powerful general\nproblem-solving capabilities, cross-modal knowledge transfer abilities, and\nfew-shot capabilities of LLM for enhanced soft sensing modeling. Specifically,\nan auxiliary variable series encoder (AVS Encoder) is proposed to unleash LLM's\npotential for capturing temporal relationships within series and spatial\nsemantic relationships among auxiliary variables. Then, we propose a two-stage\nfine-tuning alignment strategy: in the first stage, employing\nparameter-efficient fine-tuning through autoregressive training adjusts LLM to\nrapidly accommodate process variable data, resulting in a soft sensing\nfoundation model (SSFM). Subsequently, by training adapters, we adapt the SSFM\nto various downstream tasks without modifying its architecture. Then, we\npropose two text-based knowledge-embedded soft sensors, integrating new natural\nlanguage modalities to overcome the limitations of pure structured data models.\nFurthermore, benefiting from LLM's pre-existing world knowledge, our model\ndemonstrates outstanding predictive capabilities in small sample conditions.\nUsing the thermal deformation of air preheater rotor as a case study, we\nvalidate through extensive experiments that LLM-TKESS exhibits outstanding\nperformance.\n","authors":["Shuo Tong","Han Liu","Runyuan Guo","Xueqiong Tian","Wenqing Wang","Ding Liu","Youmin Zhang"],"pdf_url":"https://arxiv.org/pdf/2501.05075v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.14503v2","updated":"2025-01-09T08:55:07Z","published":"2024-11-21T08:31:06Z","title":"Planning-Driven Programming: A Large Language Model Programming Workflow","summary":" The strong performance of large language models (LLMs) raises extensive\ndiscussion on their application to code generation. Recent research suggests\ncontinuous program refinements through visible tests to improve code generation\naccuracy in LLMs. However, these methods suffer from LLMs' inefficiency and\nlimited reasoning capacity. In this work, we propose an LLM programming\nworkflow (LPW) designed to improve both initial code generation and subsequent\nrefinements within a structured two-phase workflow. Specifically, the solution\ngeneration phase formulates a solution plan, which is then verified through\nvisible tests to specify the intended natural language solution. Subsequently,\nthe code implementation phase drafts an initial code according to the solution\nplan and its verification. If the generated code fails the visible tests, the\nplan verification serves as the intended solution to consistently inform the\nrefinement process for correcting bugs. Compared to state-of-the-art methods\nacross various existing LLMs, LPW significantly improves the Pass@1 accuracy by\nup to 16.4% on well-established text-to-code generation benchmarks. LPW also\nsets new state-of-the-art Pass@1 accuracy, achieving 98.2% on HumanEval, 84.8%\non MBPP, 59.3% on LiveCode, 62.6% on APPS, and 34.7% on CodeContest, using\nGPT-4o as the backbone.\n","authors":["Chao Lei","Yanchuan Chang","Nir Lipovetzky","Krista A. Ehinger"],"pdf_url":"https://arxiv.org/pdf/2411.14503v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05069v1","updated":"2025-01-09T08:44:42Z","published":"2025-01-09T08:44:42Z","title":"Commonsense Video Question Answering through Video-Grounded Entailment\n Tree Reasoning","summary":" This paper proposes the first video-grounded entailment tree reasoning method\nfor commonsense video question answering (VQA). Despite the remarkable progress\nof large visual-language models (VLMs), there are growing concerns that they\nlearn spurious correlations between videos and likely answers, reinforced by\ntheir black-box nature and remaining benchmarking biases. Our method explicitly\ngrounds VQA tasks to video fragments in four steps: entailment tree\nconstruction, video-language entailment verification, tree reasoning, and\ndynamic tree expansion. A vital benefit of the method is its generalizability\nto current video and image-based VLMs across reasoning types. To support fair\nevaluation, we devise a de-biasing procedure based on large-language models\nthat rewrites VQA benchmark answer sets to enforce model reasoning. Systematic\nexperiments on existing and de-biased benchmarks highlight the impact of our\nmethod components across benchmarks, VLMs, and reasoning types.\n","authors":["Huabin Liu","Filip Ilievski","Cees G. M. Snoek"],"pdf_url":"https://arxiv.org/pdf/2501.05069v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05068v1","updated":"2025-01-09T08:44:06Z","published":"2025-01-09T08:44:06Z","title":"D3RM: A Discrete Denoising Diffusion Refinement Model for Piano\n Transcription","summary":" Diffusion models have been widely used in the generative domain due to their\nconvincing performance in modeling complex data distributions. Moreover, they\nhave shown competitive results on discriminative tasks, such as image\nsegmentation. While diffusion models have also been explored for automatic\nmusic transcription, their performance has yet to reach a competitive level. In\nthis paper, we focus on discrete diffusion model's refinement capabilities and\npresent a novel architecture for piano transcription. Our model utilizes\nNeighborhood Attention layers as the denoising module, gradually predicting the\ntarget high-resolution piano roll, conditioned on the finetuned features of a\npretrained acoustic model. To further enhance refinement, we devise a novel\nstrategy which applies distinct transition states during training and inference\nstage of discrete diffusion models. Experiments on the MAESTRO dataset show\nthat our approach outperforms previous diffusion-based piano transcription\nmodels and the baseline model in terms of F1 score. Our code is available in\nhttps://github.com/hanshounsu/d3rm.\n","authors":["Hounsu Kim","Taegyun Kwon","Juhan Nam"],"pdf_url":"https://arxiv.org/pdf/2501.05068v1.pdf","comment":"Accepted to ICASSP 2025"},{"id":"http://arxiv.org/abs/2501.05067v1","updated":"2025-01-09T08:43:57Z","published":"2025-01-09T08:43:57Z","title":"LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion\n for Video Understanding","summary":" In this paper, we introduce LLaVA-Octopus, a novel video multimodal large\nlanguage model. LLaVA-Octopus adaptively weights features from different visual\nprojectors based on user instructions, enabling us to leverage the\ncomplementary strengths of each projector. We observe that different visual\nprojectors exhibit distinct characteristics when handling specific tasks. For\ninstance, some projectors excel at capturing static details, while others are\nmore effective at processing temporal information, and some are better suited\nfor tasks requiring temporal coherence. By dynamically adjusting feature\nweights according to user instructions, LLaVA-Octopus dynamically selects and\ncombines the most suitable features, significantly enhancing the model's\nperformance in multimodal tasks. Experimental results demonstrate that\nLLaVA-Octopus achieves excellent performance across multiple benchmarks,\nespecially in tasks such as multimodal understanding, visual question\nanswering, and video understanding, highlighting its broad application\npotential.\n","authors":["Jiaxing Zhao","Boyuan Sun","Xiang Chen","Xihan Wei","Qibin Hou"],"pdf_url":"https://arxiv.org/pdf/2501.05067v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05066v1","updated":"2025-01-09T08:43:09Z","published":"2025-01-09T08:43:09Z","title":"Improving Skeleton-based Action Recognition with Interactive Object\n Information","summary":" Human skeleton information is important in skeleton-based action recognition,\nwhich provides a simple and efficient way to describe human pose. However,\nexisting skeleton-based methods focus more on the skeleton, ignoring the\nobjects interacting with humans, resulting in poor performance in recognizing\nactions that involve object interactions. We propose a new action recognition\nframework introducing object nodes to supplement absent interactive object\ninformation. We also propose Spatial Temporal Variable Graph Convolutional\nNetworks (ST-VGCN) to effectively model the Variable Graph (VG) containing\nobject nodes. Specifically, in order to validate the role of interactive object\ninformation, by leveraging a simple self-training approach, we establish a new\ndataset, JXGC 24, and an extended dataset, NTU RGB+D+Object 60, including more\nthan 2 million additional object nodes. At the same time, we designe the\nVariable Graph construction method to accommodate a variable number of nodes\nfor graph structure. Additionally, we are the first to explore the overfitting\nissue introduced by incorporating additional object information, and we propose\na VG-based data augmentation method to address this issue, called Random Node\nAttack. Finally, regarding the network structure, we introduce two fusion\nmodules, CAF and WNPool, along with a novel Node Balance Loss, to enhance the\ncomprehensive performance by effectively fusing and balancing skeleton and\nobject node information. Our method surpasses the previous state-of-the-art on\nmultiple skeleton-based action recognition benchmarks. The accuracy of our\nmethod on NTU RGB+D 60 cross-subject split is 96.7\\%, and on cross-view split,\nit is 99.2\\%.\n","authors":["Hao Wen","Ziqian Lu","Fengli Shen","Zhe-Ming Lu","Jialin Cui"],"pdf_url":"https://arxiv.org/pdf/2501.05066v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04614v2","updated":"2025-01-09T08:42:56Z","published":"2025-01-08T16:53:56Z","title":"MedCoDi-M: A Multi-Prompt Foundation Model for Multimodal Medical Data\n Generation","summary":" Artificial Intelligence is revolutionizing medical practice, enhancing\ndiagnostic accuracy and healthcare delivery. However, its adaptation in medical\nsettings still faces significant challenges, related to data availability and\nprivacy constraints. Synthetic data has emerged as a promising solution to\nmitigate these issues, addressing data scarcity while preserving privacy.\nRecently, Latent Diffusion Models have emerged as a powerful tool for\ngenerating high-quality synthetic data. Meanwhile, the integration of different\nmodalities has gained interest, emphasizing the need of models capable of\nhandle multimodal medical data. Existing approaches struggle to integrate\ncomplementary information and lack the ability to generate modalities\nsimultaneously. To address this challenge, we present MedCoDi-M, a\n6.77-billion-parameter model, designed for multimodal medical data generation,\nthat, following Foundation Model paradigm, exploits contrastive learning and\nlarge quantity of data to build a shared latent space which capture the\nrelationships between different data modalities. Further, we introduce the\nMulti-Prompt training technique, which significantly boosts MedCoDi-M's\ngeneration under different settings. We extensively validate MedCoDi-M: first\nwe benchmark it against five competitors on the MIMIC-CXR dataset, a\nstate-of-the-art dataset for Chest X-ray and radiological report generation.\nSecondly, we perform a Visual Turing Test with expert radiologists to assess\nthe realism and clinical relevance of the generated data, ensuring alignment\nwith real-world scenarios. Finally, we assess the utility of MedCoDi-M in\naddressing key challenges in the medical field, such as anonymization, data\nscarcity and imbalance learning. The results are promising, demonstrating the\napplicability of MedCoDi-M in medical contexts. Project page is at\nhttps://cosbidev.github.io/MedCoDi-M/.\n","authors":["Daniele Molino","Francesco Di Feola","Eliodoro Faiella","Deborah Fazzini","Domiziana Santucci","Linlin Shen","Valerio Guarrasi","Paolo Soda"],"pdf_url":"https://arxiv.org/pdf/2501.04614v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05058v1","updated":"2025-01-09T08:28:31Z","published":"2025-01-09T08:28:31Z","title":"Simultaneous emulation and downscaling with physically-consistent deep\n learning-based regional ocean emulators","summary":" Building on top of the success in AI-based atmospheric emulation, we propose\nan AI-based ocean emulation and downscaling framework focusing on the\nhigh-resolution regional ocean over Gulf of Mexico. Regional ocean emulation\npresents unique challenges owing to the complex bathymetry and lateral boundary\nconditions as well as from fundamental biases in deep learning-based\nframeworks, such as instability and hallucinations. In this paper, we develop a\ndeep learning-based framework to autoregressively integrate ocean-surface\nvariables over the Gulf of Mexico at $8$ Km spatial resolution without\nunphysical drifts over decadal time scales and simulataneously downscale and\nbias-correct it to $4$ Km resolution using a physics-constrained generative\nmodel. The framework shows both short-term skills as well as accurate long-term\nstatistics in terms of mean and variability.\n","authors":["Leonard Lupin-Jimenez","Moein Darman","Subhashis Hazarika","Tianning Wu","Michael Gray","Ruyoing He","Anthony Wong","Ashesh Chattopadhyay"],"pdf_url":"https://arxiv.org/pdf/2501.05058v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05053v1","updated":"2025-01-09T08:24:10Z","published":"2025-01-09T08:24:10Z","title":"TAPFed: Threshold Secure Aggregation for Privacy-Preserving Federated\n Learning","summary":" Federated learning is a computing paradigm that enhances privacy by enabling\nmultiple parties to collaboratively train a machine learning model without\nrevealing personal data. However, current research indicates that traditional\nfederated learning platforms are unable to ensure privacy due to privacy leaks\ncaused by the interchange of gradients. To achieve privacy-preserving federated\nlearning, integrating secure aggregation mechanisms is essential.\nUnfortunately, existing solutions are vulnerable to recently demonstrated\ninference attacks such as the disaggregation attack. This paper proposes\nTAPFed, an approach for achieving privacy-preserving federated learning in the\ncontext of multiple decentralized aggregators with malicious actors. TAPFed\nuses a proposed threshold functional encryption scheme and allows for a certain\nnumber of malicious aggregators while maintaining security and privacy. We\nprovide formal security and privacy analyses of TAPFed and compare it to\nvarious baselines through experimental evaluation. Our results show that TAPFed\noffers equivalent performance in terms of model quality compared to\nstate-of-the-art approaches while reducing transmission overhead by 29%-45%\nacross different model training scenarios. Most importantly, TAPFed can defend\nagainst recently demonstrated inference attacks caused by curious aggregators,\nwhich the majority of existing approaches are susceptible to.\n","authors":["Runhua Xu","Bo Li","Chao Li","James B. D. Joshi","Shuai Ma","Jianxin Li"],"pdf_url":"https://arxiv.org/pdf/2501.05053v1.pdf","comment":"The paper has been published in IEEE TDSC"},{"id":"http://arxiv.org/abs/2501.05032v1","updated":"2025-01-09T07:44:06Z","published":"2025-01-09T07:44:06Z","title":"Enhancing Human-Like Responses in Large Language Models","summary":" This paper explores the advancements in making large language models (LLMs)\nmore human-like. We focus on techniques that enhance natural language\nunderstanding, conversational coherence, and emotional intelligence in AI\nsystems. The study evaluates various approaches, including fine-tuning with\ndiverse datasets, incorporating psychological principles, and designing models\nthat better mimic human reasoning patterns. Our findings demonstrate that these\nenhancements not only improve user interactions but also open new possibilities\nfor AI applications across different domains. Future work will address the\nethical implications and potential biases introduced by these human-like\nattributes.\n","authors":["Ethem Yağız Çalık","Talha Rüzgar Akkuş"],"pdf_url":"https://arxiv.org/pdf/2501.05032v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05030v1","updated":"2025-01-09T07:41:22Z","published":"2025-01-09T07:41:22Z","title":"A General Retrieval-Augmented Generation Framework for Multimodal\n Case-Based Reasoning Applications","summary":" Case-based reasoning (CBR) is an experience-based approach to problem\nsolving, where a repository of solved cases is adapted to solve new cases.\nRecent research shows that Large Language Models (LLMs) with\nRetrieval-Augmented Generation (RAG) can support the Retrieve and Reuse stages\nof the CBR pipeline by retrieving similar cases and using them as additional\ncontext to an LLM query. Most studies have focused on text-only applications,\nhowever, in many real-world problems the components of a case are multimodal.\nIn this paper we present MCBR-RAG, a general RAG framework for multimodal CBR\napplications. The MCBR-RAG framework converts non-text case components into\ntext-based representations, allowing it to: 1) learn application-specific\nlatent representations that can be indexed for retrieval, and 2) enrich the\nquery provided to the LLM by incorporating all case components for better\ncontext. We demonstrate MCBR-RAG's effectiveness through experiments conducted\non a simplified Math-24 application and a more complex Backgammon application.\nOur empirical results show that MCBR-RAG improves generation quality compared\nto a baseline LLM with no contextual information provided.\n","authors":["Ofir Marom"],"pdf_url":"https://arxiv.org/pdf/2501.05030v1.pdf","comment":"15 pages, 7 figures"},{"id":"http://arxiv.org/abs/2308.06764v3","updated":"2025-01-09T07:39:30Z","published":"2023-08-13T13:01:21Z","title":"Few-shot Class-incremental Learning for Classification and Object\n Detection: A Survey","summary":" Few-shot Class-Incremental Learning (FSCIL) presents a unique challenge in\nMachine Learning (ML), as it necessitates the Incremental Learning (IL) of new\nclasses from sparsely labeled training samples without forgetting previous\nknowledge. While this field has seen recent progress, it remains an active\nexploration area. This paper aims to provide a comprehensive and systematic\nreview of FSCIL. In our in-depth examination, we delve into various facets of\nFSCIL, encompassing the problem definition, the discussion of the primary\nchallenges of unreliable empirical risk minimization and the\nstability-plasticity dilemma, general schemes, and relevant problems of IL and\nFew-shot Learning (FSL). Besides, we offer an overview of benchmark datasets\nand evaluation metrics. Furthermore, we introduce the Few-shot\nClass-incremental Classification (FSCIC) methods from data-based,\nstructure-based, and optimization-based approaches and the Few-shot\nClass-incremental Object Detection (FSCIOD) methods from anchor-free and\nanchor-based approaches. Beyond these, we present several promising research\ndirections within FSCIL that merit further investigation.\n","authors":["Jinghua Zhang","Li Liu","Olli Silvén","Matti Pietikäinen","Dewen Hu"],"pdf_url":"https://arxiv.org/pdf/2308.06764v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.01973v3","updated":"2025-01-09T07:26:05Z","published":"2024-12-28T02:28:19Z","title":"INFELM: In-depth Fairness Evaluation of Large Text-To-Image Models","summary":" The rapid development of large language models (LLMs) and large vision models\n(LVMs) have propelled the evolution of multi-modal AI systems, which have\ndemonstrated the remarkable potential for industrial applications by emulating\nhuman-like cognition. However, they also pose significant ethical challenges,\nincluding amplifying harmful content and reinforcing societal biases. For\ninstance, biases in some industrial image generation models highlighted the\nurgent need for robust fairness assessments. Most existing evaluation\nframeworks focus on the comprehensiveness of various aspects of the models, but\nthey exhibit critical limitations, including insufficient attention to content\ngeneration alignment and social bias-sensitive domains. More importantly, their\nreliance on pixel-detection techniques is prone to inaccuracies.\n To address these issues, this paper presents INFELM, an in-depth fairness\nevaluation on widely-used text-to-image models. Our key contributions are: (1)\nan advanced skintone classifier incorporating facial topology and refined skin\npixel representation to enhance classification precision by at least 16.04%,\n(2) a bias-sensitive content alignment measurement for understanding societal\nimpacts, (3) a generalizable representation bias evaluation for diverse\ndemographic groups, and (4) extensive experiments analyzing large-scale\ntext-to-image model outputs across six social-bias-sensitive domains. We find\nthat existing models in the study generally do not meet the empirical fairness\ncriteria, and representation bias is generally more pronounced than alignment\nerrors. INFELM establishes a robust benchmark for fairness assessment,\nsupporting the development of multi-modal AI systems that align with ethical\nand human-centric principles.\n","authors":["Di Jin","Xing Liu","Yu Liu","Jia Qing Yap","Andrea Wong","Adriana Crespo","Qi Lin","Zhiyuan Yin","Qiang Yan","Ryan Ye"],"pdf_url":"https://arxiv.org/pdf/2501.01973v3.pdf","comment":"Di Jin and Xing Liu contributed equally to this work"},{"id":"http://arxiv.org/abs/2501.05018v1","updated":"2025-01-09T07:21:44Z","published":"2025-01-09T07:21:44Z","title":"Finding Needles in Emb(a)dding Haystacks: Legal Document Retrieval via\n Bagging and SVR Ensembles","summary":" We introduce a retrieval approach leveraging Support Vector Regression (SVR)\nensembles, bootstrap aggregation (bagging), and embedding spaces on the German\nDataset for Legal Information Retrieval (GerDaLIR). By conceptualizing the\nretrieval task in terms of multiple binary needle-in-a-haystack subtasks, we\nshow improved recall over the baselines (0.849 > 0.803 | 0.829) using our\nvoting ensemble, suggesting promising initial results, without training or\nfine-tuning any deep learning models. Our approach holds potential for further\nenhancement, particularly through refining the encoding models and optimizing\nhyperparameters.\n","authors":["Kevin Bönisch","Alexander Mehler"],"pdf_url":"https://arxiv.org/pdf/2501.05018v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.14571v2","updated":"2025-01-09T07:16:39Z","published":"2024-01-26T00:06:08Z","title":"Driving Towards Inclusion: A Systematic Review of AI-powered\n Accessibility Enhancements for People with Disability in Autonomous Vehicles","summary":" This paper provides a comprehensive and, to our knowledge, the first review\nof inclusive human-computer interaction (HCI) within autonomous vehicles (AVs)\nand human-driven cars with partial autonomy, emphasizing accessibility and\nuser-centered design principles. We explore the current technologies and HCI\nsystems designed to enhance passenger experience, particularly for individuals\nwith accessibility needs. Key technologies discussed include brain-computer\ninterfaces, anthropomorphic interaction, virtual reality, augmented reality,\nmode adaptation, voice-activated interfaces, haptic feedback, etc. Each\ntechnology is evaluated for its role in creating an inclusive in-vehicle\nenvironment. Furthermore, we highlight recent interface designs by leading\ncompanies and review emerging concepts and prototypes under development or\ntesting, which show significant potential to address diverse accessibility\nrequirements. Safety considerations, ethical concerns, and adoption of AVs are\nother major issues that require thorough investigation. Building on these\nfindings, we propose an end-to-end design framework that addresses\naccessibility requirements across diverse user demographics, including older\nadults and individuals with physical or cognitive impairments. This work\nprovides actionable insights for designers, researchers, and policymakers\naiming to create safer and more comfortable environments in autonomous and\nregular vehicles accessible to all users.\n","authors":["Ashish Bastola","Hao Wang","Sayed Pedram Haeri Boroujeni","Julian Brinkley","Ata Jahangir Moshayedi","Abolfazl Razi"],"pdf_url":"https://arxiv.org/pdf/2401.14571v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05015v1","updated":"2025-01-09T07:16:21Z","published":"2025-01-09T07:16:21Z","title":"On Measuring Unnoticeability of Graph Adversarial Attacks: Observations,\n New Measure, and Applications","summary":" Adversarial attacks are allegedly unnoticeable. Prior studies have designed\nattack noticeability measures on graphs, primarily using statistical tests to\ncompare the topology of original and (possibly) attacked graphs. However, we\nobserve two critical limitations in the existing measures. First, because the\nmeasures rely on simple rules, attackers can readily enhance their attacks to\nbypass them, reducing their attack \"noticeability\" and, yet, maintaining their\nattack performance. Second, because the measures naively leverage global\nstatistics, such as degree distributions, they may entirely overlook attacks\nuntil severe perturbations occur, letting the attacks be almost \"totally\nunnoticeable.\" To address the limitations, we introduce HideNSeek, a learnable\nmeasure for graph attack noticeability. First, to mitigate the bypass problem,\nHideNSeek learns to distinguish the original and (potential) attack edges using\na learnable edge scorer (LEO), which scores each edge on its likelihood of\nbeing an attack. Second, to mitigate the overlooking problem, HideNSeek\nconducts imbalance-aware aggregation of all the edge scores to obtain the final\nnoticeability score. Using six real-world graphs, we empirically demonstrate\nthat HideNSeek effectively alleviates the observed limitations, and LEO (i.e.,\nour learnable edge scorer) outperforms eleven competitors in distinguishing\nattack edges under five different attack methods. For an additional\napplication, we show that LEO boost the performance of robust GNNs by removing\nattack-like edges.\n","authors":["Hyeonsoo Jo","Hyunjin Hwang","Fanchen Bu","Soo Yong Lee","Chanyoung Park","Kijung Shin"],"pdf_url":"https://arxiv.org/pdf/2501.05015v1.pdf","comment":"KDD 2025"},{"id":"http://arxiv.org/abs/2501.05014v1","updated":"2025-01-09T07:15:59Z","published":"2025-01-09T07:15:59Z","title":"UAV-VLA: Vision-Language-Action System for Large Scale Aerial Mission\n Generation","summary":" The UAV-VLA (Visual-Language-Action) system is a tool designed to facilitate\ncommunication with aerial robots. By integrating satellite imagery processing\nwith the Visual Language Model (VLM) and the powerful capabilities of GPT,\nUAV-VLA enables users to generate general flight paths-and-action plans through\nsimple text requests. This system leverages the rich contextual information\nprovided by satellite images, allowing for enhanced decision-making and mission\nplanning. The combination of visual analysis by VLM and natural language\nprocessing by GPT can provide the user with the path-and-action set, making\naerial operations more efficient and accessible. The newly developed method\nshowed the difference in the length of the created trajectory in 22% and the\nmean error in finding the objects of interest on a map in 34.22 m by Euclidean\ndistance in the K-Nearest Neighbors (KNN) approach.\n","authors":["Oleg Sautenkov","Yasheerah Yaqoot","Artem Lykov","Muhammad Ahsan Mustafa","Grik Tadevosyan","Aibek Akhmetkazy","Miguel Altamirano Cabrera","Mikhail Martynov","Sausar Karaf","Dzmitry Tsetserukou"],"pdf_url":"https://arxiv.org/pdf/2501.05014v1.pdf","comment":"HRI 2025"},{"id":"http://arxiv.org/abs/2501.05007v1","updated":"2025-01-09T07:05:22Z","published":"2025-01-09T07:05:22Z","title":"Quantum-enhanced causal discovery for a small number of samples","summary":" The discovery of causal relationships from observed data has attracted\nsignificant interest from disciplines such as economics, social sciences,\nepidemiology, and biology. In practical applications, considerable knowledge of\nthe underlying systems is often unavailable, and real data are often associated\nwith nonlinear causal structures, which make the direct use of most\nconventional causality analysis methods difficult. This study proposes a novel\nquantum Peter-Clark (qPC) algorithm for causal discovery that does not assume\nany underlying model structures. Based on the independence conditional tests in\na class of reproducing kernel Hilbert spaces characterized by quantum circuits,\nthe proposed qPC algorithm can explore causal relationships from the observed\ndata drawn from arbitrary distributions. We conducted systematic experiments on\nfundamental graph parts of causal structures, demonstrating that the qPC\nalgorithm exhibits a significantly better performance, particularly with\nsmaller sample sizes compared to its classical counterpart. Furthermore, we\nproposed a novel optimization approach based on Kernel Target Alignment (KTA)\nfor determining hyperparameters of quantum kernels. This method effectively\nreduced the risk of false positives in causal discovery, enabling more reliable\ninference. Our theoretical and experimental results demonstrate that the\nproposed quantum algorithm can empower classical algorithms for robust and\naccurate inference in causal discovery, supporting them in regimes where\nclassical algorithms typically fail. Additionally, the effectiveness of this\nmethod was validated using the Boston Housing dataset as a real-world\napplication. These findings demonstrate the new potential of quantum\ncircuit-based causal discovery methods in addressing practical challenges,\nparticularly in small-sample scenarios where traditional approaches have shown\nlimitations.\n","authors":["Yota Maeda","Ken Arai","Yu Tanaka","Yu Terada","Hiroshi Ueno","Hiroyuki Tezuka"],"pdf_url":"https://arxiv.org/pdf/2501.05007v1.pdf","comment":"19 pages, 8 figures"},{"id":"http://arxiv.org/abs/2402.07204v5","updated":"2025-01-09T06:53:50Z","published":"2024-02-11T13:30:53Z","title":"ITINERA: Integrating Spatial Optimization with Large Language Models for\n Open-domain Urban Itinerary Planning","summary":" Citywalk, a recently popular form of urban travel, requires genuine\npersonalization and understanding of fine-grained requests compared to\ntraditional itinerary planning. In this paper, we introduce the novel task of\nOpen-domain Urban Itinerary Planning (OUIP), which generates personalized urban\nitineraries from user requests in natural language. We then present ITINERA, an\nOUIP system that integrates spatial optimization with large language models to\nprovide customized urban itineraries based on user needs. This involves\ndecomposing user requests, selecting candidate points of interest (POIs),\nordering the POIs based on cluster-aware spatial optimization, and generating\nthe itinerary. Experiments on real-world datasets and the performance of the\ndeployed system demonstrate our system's capacity to deliver personalized and\nspatially coherent itineraries compared to current solutions. Source codes of\nITINERA are available at https://github.com/YihongT/ITINERA.\n","authors":["Yihong Tang","Zhaokai Wang","Ao Qu","Yihao Yan","Zhaofeng Wu","Dingyi Zhuang","Jushi Kai","Kebing Hou","Xiaotong Guo","Han Zheng","Tiange Luo","Jinhua Zhao","Zhan Zhao","Wei Ma"],"pdf_url":"https://arxiv.org/pdf/2402.07204v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.10517v5","updated":"2025-01-09T06:41:46Z","published":"2024-08-20T03:35:28Z","title":"Integrating Multi-Modal Input Token Mixer Into Mamba-Based Decision\n Models: Decision MetaMamba","summary":" Sequence modeling with State Space models (SSMs) has demonstrated performance\nsurpassing that of Transformers in various tasks, raising expectations for\ntheir potential to outperform the Decision Transformer and its enhanced\nvariants in offline reinforcement learning (RL). However, decision models based\non Mamba, a state-of-the-art SSM, failed to achieve superior performance\ncompared to these enhanced Decision Transformers. We hypothesize that this\nlimitation arises from information loss during the selective scanning phase. To\naddress this, we propose the Decision MetaMamba (DMM), which augments Mamba\nwith a token mixer in its input layer. This mixer explicitly accounts for the\nmultimodal nature of offline RL inputs, comprising state, action, and\nreturn-to-go. The DMM demonstrates improved performance while significantly\nreducing parameter count compared to prior models. Notably, similar performance\ngains were achieved using a simple linear token mixer, emphasizing the\nimportance of preserving information from proximate time steps rather than the\nspecific design of the token mixer itself. This novel modification to Mamba's\ninput layer represents a departure from conventional timestamp-based encoding\napproaches used in Transformers. By enhancing performance of Mamba in offline\nRL, characterized by memory efficiency and fast inference, this work opens new\navenues for its broader application in future RL research.\n","authors":["Wall Kim"],"pdf_url":"https://arxiv.org/pdf/2408.10517v5.pdf","comment":"We have decided to withdraw this manuscript as we believe that the\n work requires significant improvements and further research to ensure its\n quality and impact. We are currently pursuing a more comprehensive approach\n to address the limitations of the current submission and plan to resubmit an\n improved version in the future"},{"id":"http://arxiv.org/abs/2408.16030v2","updated":"2025-01-09T06:33:24Z","published":"2024-08-28T09:30:20Z","title":"Deep Learning-Based Automatic Multi-Level Airway Collapse Monitoring on\n Obstructive Sleep Apnea Patients","summary":" This study investigated the use of deep learning to identify multi-level\nupper airway collapses in obstructive sleep apnea (OSA) patients based on\nsnoring sounds. We fi-ne-tuned ResNet-50 and Audio Spectrogram Transformer\n(AST) models using snoring recordings from 37 subjects undergoing drug-induced\nsleep endoscopy (DISE) between 2020 and 2021. Snoring sounds were labeled\naccording to the VOTE (Velum, Orophar-ynx, Tongue Base, Epiglottis)\nclassification, resulting in 259 V, 403 O, 77 T, 13 E, 1016 VO, 46 VT, 140 OT,\n39 OE, 30 VOT, and 3150 non-snoring (N) 0.5-second clips. The models were\ntrained for two multi-label classification tasks: identifying obstructions at\nV, O, T, and E levels, and identifying retropalatal (RP) and retroglossal (RG)\nobstruc-tions. Results showed AST slightly outperformed ResNet-50,\ndemonstrating good abil-ity to identify V (F1-score: 0.71, MCC: 0.61, AUC:\n0.89), O (F1-score: 0.80, MCC: 0.72, AUC: 0.94), and RP obstructions (F1-score:\n0.86, MCC: 0.77, AUC: 0.97). However, both models struggled with T, E, and RG\nclassifications due to limited data. Retrospective analysis of a full-night\nrecording showed the potential to profile airway obstruction dynamics. We\nexpect this information, combined with polysomnography and other clinical\nparameters, can aid clinical triage and treatment planning for OSA patients.\n","authors":["Ying-Chieh Hsu","Stanley Yung-Chuan Liu","Chao-Jung Huang","Chi-Wei Wu","Ren-Kai Cheng","Jane Yung-Jen Hsu","Shang-Ran Huang","Yuan-Ren Cheng","Fu-Shun Hsu"],"pdf_url":"https://arxiv.org/pdf/2408.16030v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04997v1","updated":"2025-01-09T06:26:28Z","published":"2025-01-09T06:26:28Z","title":"GiNet: Integrating Sequential and Context-Aware Learning for Battery\n Capacity Prediction","summary":" The surging demand for batteries requires advanced battery management\nsystems, where battery capacity modelling is a key functionality. In this\npaper, we aim to achieve accurate battery capacity prediction by learning from\nhistorical measurements of battery dynamics. We propose GiNet, a gated\nrecurrent units enhanced Informer network, for predicting battery's capacity.\nThe novelty and competitiveness of GiNet lies in its capability of capturing\nsequential and contextual information from raw battery data and reflecting the\nbattery's complex behaviors with both temporal dynamics and long-term\ndependencies. We conducted an experimental study based on a publicly available\ndataset to showcase GiNet's strength of gaining a holistic understanding of\nbattery behavior and predicting battery capacity accurately. GiNet achieves\n0.11 mean absolute error for predicting the battery capacity in a sequence of\nfuture time slots without knowing the historical battery capacity. It also\noutperforms the latest algorithms significantly with 27% error reduction on\naverage compared to Informer. The promising results highlight the importance of\ncustomized and optimized integration of algorithm and battery knowledge and\nshed light on other industry applications as well.\n","authors":["Sara Sameer","Wei Zhang","Xin Lou","Qingyu Yan","Terence Goh","Yulin Gao"],"pdf_url":"https://arxiv.org/pdf/2501.04997v1.pdf","comment":"6 pages"},{"id":"http://arxiv.org/abs/2501.04995v1","updated":"2025-01-09T06:20:00Z","published":"2025-01-09T06:20:00Z","title":"IPDN: Image-enhanced Prompt Decoding Network for 3D Referring Expression\n Segmentation","summary":" 3D Referring Expression Segmentation (3D-RES) aims to segment point cloud\nscenes based on a given expression. However, existing 3D-RES approaches face\ntwo major challenges: feature ambiguity and intent ambiguity. Feature ambiguity\narises from information loss or distortion during point cloud acquisition due\nto limitations such as lighting and viewpoint. Intent ambiguity refers to the\nmodel's equal treatment of all queries during the decoding process, lacking\ntop-down task-specific guidance. In this paper, we introduce an Image enhanced\nPrompt Decoding Network (IPDN), which leverages multi-view images and\ntask-driven information to enhance the model's reasoning capabilities. To\naddress feature ambiguity, we propose the Multi-view Semantic Embedding (MSE)\nmodule, which injects multi-view 2D image information into the 3D scene and\ncompensates for potential spatial information loss. To tackle intent ambiguity,\nwe designed a Prompt-Aware Decoder (PAD) that guides the decoding process by\nderiving task-driven signals from the interaction between the expression and\nvisual features. Comprehensive experiments demonstrate that IPDN outperforms\nthe state-ofthe-art by 1.9 and 4.2 points in mIoU metrics on the 3D-RES and\n3D-GRES tasks, respectively.\n","authors":["Qi Chen","Changli Wu","Jiayi Ji","Yiwei Ma","Danni Yang","Xiaoshuai Sun"],"pdf_url":"https://arxiv.org/pdf/2501.04995v1.pdf","comment":"AAAI 2025"},{"id":"http://arxiv.org/abs/2410.14368v2","updated":"2025-01-09T06:02:11Z","published":"2024-10-18T10:53:44Z","title":"CoMAL: Collaborative Multi-Agent Large Language Models for\n Mixed-Autonomy Traffic","summary":" The integration of autonomous vehicles into urban traffic has great potential\nto improve efficiency by reducing congestion and optimizing traffic flow\nsystematically. In this paper, we introduce CoMAL (Collaborative Multi-Agent\nLLMs), a framework designed to address the mixed-autonomy traffic problem by\ncollaboration among autonomous vehicles to optimize traffic flow. CoMAL is\nbuilt upon large language models, operating in an interactive traffic\nsimulation environment. It utilizes a Perception Module to observe surrounding\nagents and a Memory Module to store strategies for each agent. The overall\nworkflow includes a Collaboration Module that encourages autonomous vehicles to\ndiscuss the effective strategy and allocate roles, a reasoning engine to\ndetermine optimal behaviors based on assigned roles, and an Execution Module\nthat controls vehicle actions using a hybrid approach combining rule-based\nmodels. Experimental results demonstrate that CoMAL achieves superior\nperformance on the Flow benchmark. Additionally, we evaluate the impact of\ndifferent language models and compare our framework with reinforcement learning\napproaches. It highlights the strong cooperative capability of LLM agents and\npresents a promising solution to the mixed-autonomy traffic challenge. The code\nis available at https://github.com/Hyan-Yao/CoMAL.\n","authors":["Huaiyuan Yao","Longchao Da","Vishnu Nandam","Justin Turnau","Zhiwei Liu","Linsey Pang","Hua Wei"],"pdf_url":"https://arxiv.org/pdf/2410.14368v2.pdf","comment":"8 pages, 4 figures, accepted to SDM25"},{"id":"http://arxiv.org/abs/2501.04982v1","updated":"2025-01-09T05:45:03Z","published":"2025-01-09T05:45:03Z","title":"CuRLA: Curriculum Learning Based Deep Reinforcement Learning for\n Autonomous Driving","summary":" In autonomous driving, traditional Computer Vision (CV) agents often struggle\nin unfamiliar situations due to biases in the training data. Deep Reinforcement\nLearning (DRL) agents address this by learning from experience and maximizing\nrewards, which helps them adapt to dynamic environments. However, ensuring\ntheir generalization remains challenging, especially with static training\nenvironments. Additionally, DRL models lack transparency, making it difficult\nto guarantee safety in all scenarios, particularly those not seen during\ntraining. To tackle these issues, we propose a method that combines DRL with\nCurriculum Learning for autonomous driving. Our approach uses a Proximal Policy\nOptimization (PPO) agent and a Variational Autoencoder (VAE) to learn safe\ndriving in the CARLA simulator. The agent is trained using two-fold curriculum\nlearning, progressively increasing environment difficulty and incorporating a\ncollision penalty in the reward function to promote safety. This method\nimproves the agent's adaptability and reliability in complex environments, and\nunderstand the nuances of balancing multiple reward components from different\nfeedback signals in a single scalar reward function. Keywords: Computer Vision,\nDeep Reinforcement Learning, Variational Autoencoder, Proximal Policy\nOptimization, Curriculum Learning, Autonomous Driving.\n","authors":["Bhargava Uppuluri","Anjel Patel","Neil Mehta","Sridhar Kamath","Pratyush Chakraborty"],"pdf_url":"https://arxiv.org/pdf/2501.04982v1.pdf","comment":"To be published in the 17th International Conference on Agents and\n Artificial Intelligence (ICAART), Feb 2025"},{"id":"http://arxiv.org/abs/2501.04974v1","updated":"2025-01-09T05:06:44Z","published":"2025-01-09T05:06:44Z","title":"SensorQA: A Question Answering Benchmark for Daily-Life Monitoring","summary":" With the rapid growth in sensor data, effectively interpreting and\ninterfacing with these data in a human-understandable way has become crucial.\nWhile existing research primarily focuses on learning classification models,\nfewer studies have explored how end users can actively extract useful insights\nfrom sensor data, often hindered by the lack of a proper dataset. To address\nthis gap, we introduce \\Dataset, the first human-created question-answering\n(QA) dataset for long-term time-series sensor data for daily life monitoring.\n\\Dataset is created by human workers and includes 5.6K diverse and practical\nqueries that reflect genuine human interests, paired with accurate answers\nderived from sensor data. We further establish benchmarks for state-of-the-art\nAI models on this dataset and evaluate their performance on typical edge\ndevices. Our results reveal a gap between current models and optimal QA\nperformance and efficiency, highlighting the need for new contributions. The\ndataset and code are available at:\n\\url{https://github.com/benjamin-reichman/SensorQA}.\n","authors":["Benjamin Reichman","Xiaofan Yu","Lanxiang Hu","Jack Truxal","Atishay Jain","Rushil Chandrupatla","Tajana Šimunić Rosing","Larry Heck"],"pdf_url":"https://arxiv.org/pdf/2501.04974v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04970v1","updated":"2025-01-09T04:59:15Z","published":"2025-01-09T04:59:15Z","title":"Battling the Non-stationarity in Time Series Forecasting via Test-time\n Adaptation","summary":" Deep Neural Networks have spearheaded remarkable advancements in time series\nforecasting (TSF), one of the major tasks in time series modeling. Nonetheless,\nthe non-stationarity of time series undermines the reliability of pre-trained\nsource time series forecasters in mission-critical deployment settings. In this\nstudy, we introduce a pioneering test-time adaptation framework tailored for\nTSF (TSF-TTA). TAFAS, the proposed approach to TSF-TTA, flexibly adapts source\nforecasters to continuously shifting test distributions while preserving the\ncore semantic information learned during pre-training. The novel utilization of\npartially-observed ground truth and gated calibration module enables proactive,\nrobust, and model-agnostic adaptation of source forecasters. Experiments on\ndiverse benchmark datasets and cutting-edge architectures demonstrate the\nefficacy and generality of TAFAS, especially in long-term forecasting scenarios\nthat suffer from significant distribution shifts. The code is available at\nhttps://github.com/kimanki/TAFAS.\n","authors":["HyunGi Kim","Siwon Kim","Jisoo Mok","Sungroh Yoon"],"pdf_url":"https://arxiv.org/pdf/2501.04970v1.pdf","comment":"Accepted at AAAI 2025"},{"id":"http://arxiv.org/abs/2501.04961v1","updated":"2025-01-09T04:26:15Z","published":"2025-01-09T04:26:15Z","title":"Demystifying Domain-adaptive Post-training for Financial LLMs","summary":" Domain-adaptive post-training of large language models (LLMs) has emerged as\na promising approach for specialized domains such as medicine and finance.\nHowever, significant challenges remain in identifying optimal adaptation\ncriteria and training strategies across varying data and model configurations.\nTo address these challenges, we introduce FINDAP, a systematic and fine-grained\ninvestigation into domain-adaptive post-training of LLMs for the finance\ndomain. Our approach begins by identifying the core capabilities required for\nthe target domain and designing a comprehensive evaluation suite aligned with\nthese needs. We then analyze the effectiveness of key post-training stages,\nincluding continual pretraining, instruction tuning, and preference alignment.\nBuilding on these insights, we propose an effective training recipe centered on\na novel preference data distillation method, which leverages process signals\nfrom a generative reward model. The resulting model, Llama-Fin, achieves\nstate-of-the-art performance across a wide range of financial tasks. Our\nanalysis also highlights how each post-training stage contributes to distinct\ncapabilities, uncovering specific challenges and effective solutions, providing\nvaluable insights for domain adaptation of LLMs. Project page:\nhttps://github.com/SalesforceAIResearch/FinDap\n","authors":["Zixuan Ke","Yifei Ming","Xuan-Phi Nguyen","Caiming Xiong","Shafiq Joty"],"pdf_url":"https://arxiv.org/pdf/2501.04961v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03847v2","updated":"2025-01-09T04:25:42Z","published":"2025-01-07T15:01:58Z","title":"Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video\n Generation Control","summary":" Diffusion models have demonstrated impressive performance in generating\nhigh-quality videos from text prompts or images. However, precise control over\nthe video generation process, such as camera manipulation or content editing,\nremains a significant challenge. Existing methods for controlled video\ngeneration are typically limited to a single control type, lacking the\nflexibility to handle diverse control demands. In this paper, we introduce\nDiffusion as Shader (DaS), a novel approach that supports multiple video\ncontrol tasks within a unified architecture. Our key insight is that achieving\nversatile video control necessitates leveraging 3D control signals, as videos\nare fundamentally 2D renderings of dynamic 3D content. Unlike prior methods\nlimited to 2D control signals, DaS leverages 3D tracking videos as control\ninputs, making the video diffusion process inherently 3D-aware. This innovation\nallows DaS to achieve a wide range of video controls by simply manipulating the\n3D tracking videos. A further advantage of using 3D tracking videos is their\nability to effectively link frames, significantly enhancing the temporal\nconsistency of the generated videos. With just 3 days of fine-tuning on 8 H800\nGPUs using less than 10k videos, DaS demonstrates strong control capabilities\nacross diverse tasks, including mesh-to-video generation, camera control,\nmotion transfer, and object manipulation.\n","authors":["Zekai Gu","Rui Yan","Jiahao Lu","Peng Li","Zhiyang Dou","Chenyang Si","Zhen Dong","Qifeng Liu","Cheng Lin","Ziwei Liu","Wenping Wang","Yuan Liu"],"pdf_url":"https://arxiv.org/pdf/2501.03847v2.pdf","comment":"Project page: https://igl-hkust.github.io/das/ Codes:\n https://github.com/IGL-HKUST/DiffusionAsShader"},{"id":"http://arxiv.org/abs/2501.04958v1","updated":"2025-01-09T04:20:12Z","published":"2025-01-09T04:20:12Z","title":"Addressing Domain Shift via Imbalance-Aware Domain Adaptation in Embryo\n Development Assessment","summary":" Deep learning models in medical imaging face dual challenges: domain shift,\nwhere models perform poorly when deployed in settings different from their\ntraining environment, and class imbalance, where certain disease conditions are\nnaturally underrepresented. We present Imbalance-Aware Domain Adaptation\n(IADA), a novel framework that simultaneously tackles both challenges through\nthree key components: (1) adaptive feature learning with class-specific\nattention mechanisms, (2) balanced domain alignment with dynamic weighting, and\n(3) adaptive threshold optimization. Our theoretical analysis establishes\nconvergence guarantees and complexity bounds. Through extensive experiments on\nembryo development assessment across four imaging modalities, IADA demonstrates\nsignificant improvements over existing methods, achieving up to 25.19\\% higher\naccuracy while maintaining balanced performance across classes. In challenging\nscenarios with low-quality imaging systems, IADA shows robust generalization\nwith AUC improvements of up to 12.56\\%. These results demonstrate IADA's\npotential for developing reliable and equitable medical imaging systems for\ndiverse clinical settings. The code is made public available at\n\\url{https://github.com/yinghemedical/imbalance-aware_domain_adaptation}\n","authors":["Lei Li","Xinglin Zhang","Jun Liang","Tao Chen"],"pdf_url":"https://arxiv.org/pdf/2501.04958v1.pdf","comment":"15 pages"},{"id":"http://arxiv.org/abs/2501.04945v1","updated":"2025-01-09T03:34:07Z","published":"2025-01-09T03:34:07Z","title":"Step-by-Step Mastery: Enhancing Soft Constraint Following Ability of\n Large Language Models","summary":" It is crucial for large language models (LLMs) to follow instructions that\ninvolve multiple constraints. However, soft constraints are semantically\nrelated and difficult to verify through automated methods. These constraints\nremain a significant challenge for LLMs. To enhance the ability of LLMs to\nfollow soft constraints, we initially design a pipeline to obtain high-quality\noutputs automatically. Additionally, to fully utilize the acquired data, we\nintroduce a training paradigm based on curriculum learning. We experimentally\nevaluate the effectiveness of our methods in improving LLMs' soft constraint\nfollowing ability and analyze the factors driving the improvements. The\ndatasets and code are publicly available at\nhttps://github.com/Rainier-rq/FollowSoftConstraints.\n","authors":["Qingyu Ren","Jie Zeng","Qianyu He","Jiaqing Liang","Yanghua Xiao","Weikang Zhou","Zeye Sun","Fei Yu"],"pdf_url":"https://arxiv.org/pdf/2501.04945v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.15594v3","updated":"2025-01-09T03:08:17Z","published":"2024-11-23T16:03:35Z","title":"A Survey on LLM-as-a-Judge","summary":" Accurate and consistent evaluation is crucial for decision-making across\nnumerous fields, yet it remains a challenging task due to inherent\nsubjectivity, variability, and scale. Large Language Models (LLMs) have\nachieved remarkable success across diverse domains, leading to the emergence of\n\"LLM-as-a-Judge,\" where LLMs are employed as evaluators for complex tasks. With\ntheir ability to process diverse data types and provide scalable,\ncost-effective, and consistent assessments, LLMs present a compelling\nalternative to traditional expert-driven evaluations. However, ensuring the\nreliability of LLM-as-a-Judge systems remains a significant challenge that\nrequires careful design and standardization. This paper provides a\ncomprehensive survey of LLM-as-a-Judge, addressing the core question: How can\nreliable LLM-as-a-Judge systems be built? We explore strategies to enhance\nreliability, including improving consistency, mitigating biases, and adapting\nto diverse assessment scenarios. Additionally, we propose methodologies for\nevaluating the reliability of LLM-as-a-Judge systems, supported by a novel\nbenchmark designed for this purpose. To advance the development and real-world\ndeployment of LLM-as-a-Judge systems, we also discussed practical applications,\nchallenges, and future directions. This survey serves as a foundational\nreference for researchers and practitioners in this rapidly evolving field.\n","authors":["Jiawei Gu","Xuhui Jiang","Zhichao Shi","Hexiang Tan","Xuehao Zhai","Chengjin Xu","Wei Li","Yinghan Shen","Shengjie Ma","Honghao Liu","Yuanzhuo Wang","Jian Guo"],"pdf_url":"https://arxiv.org/pdf/2411.15594v3.pdf","comment":"Corrected typos & more discussion on reasoning models 33 pages, 9\n figures. arXiv admin note: text overlap with arXiv:2310.05470 by other\n authors"},{"id":"http://arxiv.org/abs/2501.04931v1","updated":"2025-01-09T02:47:01Z","published":"2025-01-09T02:47:01Z","title":"Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency","summary":" Multimodal Large Language Models (MLLMs) have achieved impressive performance\nand have been put into practical use in commercial applications, but they still\nhave potential safety mechanism vulnerabilities. Jailbreak attacks are red\nteaming methods that aim to bypass safety mechanisms and discover MLLMs'\npotential risks. Existing MLLMs' jailbreak methods often bypass the model's\nsafety mechanism through complex optimization methods or carefully designed\nimage and text prompts. Despite achieving some progress, they have a low attack\nsuccess rate on commercial closed-source MLLMs. Unlike previous research, we\nempirically find that there exists a Shuffle Inconsistency between MLLMs'\ncomprehension ability and safety ability for the shuffled harmful instruction.\nThat is, from the perspective of comprehension ability, MLLMs can understand\nthe shuffled harmful text-image instructions well. However, they can be easily\nbypassed by the shuffled harmful instructions from the perspective of safety\nability, leading to harmful responses. Then we innovatively propose a\ntext-image jailbreak attack named SI-Attack. Specifically, to fully utilize the\nShuffle Inconsistency and overcome the shuffle randomness, we apply a\nquery-based black-box optimization method to select the most harmful shuffled\ninputs based on the feedback of the toxic judge model. A series of experiments\nshow that SI-Attack can improve the attack's performance on three benchmarks.\nIn particular, SI-Attack can obviously improve the attack success rate for\ncommercial MLLMs such as GPT-4o or Claude-3.5-Sonnet.\n","authors":["Shiji Zhao","Ranjie Duan","Fengxiang Wang","Chi Chen","Caixin Kang","Jialing Tao","YueFeng Chen","Hui Xue","Xingxing Wei"],"pdf_url":"https://arxiv.org/pdf/2501.04931v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04928v1","updated":"2025-01-09T02:36:21Z","published":"2025-01-09T02:36:21Z","title":"Image2CADSeq: Computer-Aided Design Sequence and Knowledge Inference\n from Product Images","summary":" Computer-aided design (CAD) tools empower designers to design and modify 3D\nmodels through a series of CAD operations, commonly referred to as a CAD\nsequence. In scenarios where digital CAD files are not accessible, reverse\nengineering (RE) has been used to reconstruct 3D CAD models. Recent advances\nhave seen the rise of data-driven approaches for RE, with a primary focus on\nconverting 3D data, such as point clouds, into 3D models in boundary\nrepresentation (B-rep) format. However, obtaining 3D data poses significant\nchallenges, and B-rep models do not reveal knowledge about the 3D modeling\nprocess of designs. To this end, our research introduces a novel data-driven\napproach with an Image2CADSeq neural network model. This model aims to reverse\nengineer CAD models by processing images as input and generating CAD sequences.\nThese sequences can then be translated into B-rep models using a solid modeling\nkernel. Unlike B-rep models, CAD sequences offer enhanced flexibility to modify\nindividual steps of model creation, providing a deeper understanding of the\nconstruction process of CAD models. To quantitatively and rigorously evaluate\nthe predictive performance of the Image2CADSeq model, we have developed a\nmulti-level evaluation framework for model assessment. The model was trained on\na specially synthesized dataset, and various network architectures were\nexplored to optimize the performance. The experimental and validation results\nshow great potential for the model in generating CAD sequences from 2D image\ndata.\n","authors":["Xingang Li","Zhenghui Sha"],"pdf_url":"https://arxiv.org/pdf/2501.04928v1.pdf","comment":"20 pages, 10 figures, and 6 tables"},{"id":"http://arxiv.org/abs/2404.06429v3","updated":"2025-01-09T02:34:25Z","published":"2024-04-09T16:20:03Z","title":"Magic-Boost: Boost 3D Generation with Multi-View Conditioned Diffusion","summary":" Benefiting from the rapid development of 2D diffusion models, 3D content\ngeneration has witnessed significant progress. One promising solution is to\nfinetune the pre-trained 2D diffusion models to produce multi-view images and\nthen reconstruct them into 3D assets via feed-forward sparse-view\nreconstruction models. However, limited by the 3D inconsistency in the\ngenerated multi-view images and the low reconstruction resolution of the\nfeed-forward reconstruction models, the generated 3d assets are still limited\nto incorrect geometries and blurry textures. To address this problem, we\npresent a multi-view based refine method, named Magic-Boost, to further refine\nthe generation results. In detail, we first propose a novel multi-view\nconditioned diffusion model which extracts 3d prior from the synthesized\nmulti-view images to synthesize high-fidelity novel view images and then\nintroduce a novel iterative-update strategy to adopt it to provide precise\nguidance to refine the coarse generated results through a fast optimization\nprocess. Conditioned on the strong 3d priors extracted from the synthesized\nmulti-view images, Magic-Boost is capable of providing precise optimization\nguidance that well aligns with the coarse generated 3D assets, enriching the\nlocal detail in both geometry and texture within a short time ($\\sim15$min).\nExtensive experiments show Magic-Boost greatly enhances the coarse generated\ninputs, generates high-quality 3D assets with rich geometric and textural\ndetails. (Project Page: https://magic-research.github.io/magic-boost/)\n","authors":["Fan Yang","Jianfeng Zhang","Yichun Shi","Bowen Chen","Chenxu Zhang","Huichao Zhang","Xiaofeng Yang","Xiu Li","Jiashi Feng","Guosheng Lin"],"pdf_url":"https://arxiv.org/pdf/2404.06429v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.17052v2","updated":"2025-01-09T02:33:14Z","published":"2024-12-22T15:05:30Z","title":"ViLBias: A Comprehensive Framework for Bias Detection through Linguistic\n and Visual Cues , presenting Annotation Strategies, Evaluation, and Key\n Challenges","summary":" The integration of Large Language Models (LLMs) and Vision-Language Models\n(VLMs) opens new avenues for addressing complex challenges in multimodal\ncontent analysis, particularly in biased news detection. This study introduces\nVLBias, a framework that leverages state-of-the-art LLMs and VLMs to detect\nlinguistic and visual biases in news content. We present a multimodal dataset\ncomprising textual content and corresponding images from diverse news sources.\nWe propose a hybrid annotation framework that combines LLM-based annotations\nwith human review to ensure high-quality labeling while reducing costs and\nenhancing scalability. Our evaluation compares the performance of\nstate-of-the-art SLMs and LLMs for both modalities (text and images) and the\nresults reveal that while SLMs are computationally efficient, LLMs demonstrate\nsuperior accuracy in identifying subtle framing and text-visual\ninconsistencies. Furthermore, empirical analysis shows that incorporating\nvisual cues alongside textual data improves bias detection accuracy by 3 to 5%.\nThis study provides a comprehensive exploration of LLMs, SLMs, and VLMs as\ntools for detecting multimodal biases in news content and highlights their\nrespective strengths, limitations, and potential for future applications\n","authors":["Shaina Raza","Caesar Saleh","Emrul Hasan","Franklin Ogidi","Maximus Powers","Veronica Chatrath","Marcelo Lotif","Roya Javadi","Anam Zahid","Vahid Reza Khazaie"],"pdf_url":"https://arxiv.org/pdf/2412.17052v2.pdf","comment":"Under review"},{"id":"http://arxiv.org/abs/2501.04926v1","updated":"2025-01-09T02:30:26Z","published":"2025-01-09T02:30:26Z","title":"FLowHigh: Towards Efficient and High-Quality Audio Super-Resolution with\n Single-Step Flow Matching","summary":" Audio super-resolution is challenging owing to its ill-posed nature.\nRecently, the application of diffusion models in audio super-resolution has\nshown promising results in alleviating this challenge. However, diffusion-based\nmodels have limitations, primarily the necessity for numerous sampling steps,\nwhich causes significantly increased latency when synthesizing high-quality\naudio samples. In this paper, we propose FLowHigh, a novel approach that\nintegrates flow matching, a highly efficient generative model, into audio\nsuper-resolution. We also explore probability paths specially tailored for\naudio super-resolution, which effectively capture high-resolution audio\ndistributions, thereby enhancing reconstruction quality. The proposed method\ngenerates high-fidelity, high-resolution audio through a single-step sampling\nprocess across various input sampling rates. The experimental results on the\nVCTK benchmark dataset demonstrate that FLowHigh achieves state-of-the-art\nperformance in audio super-resolution, as evaluated by log-spectral distance\nand ViSQOL while maintaining computational efficiency with only a single-step\nsampling process.\n","authors":["Jun-Hak Yun","Seung-Bin Kim","Seong-Whan Lee"],"pdf_url":"https://arxiv.org/pdf/2501.04926v1.pdf","comment":"Accepted by ICASSP 2025"},{"id":"http://arxiv.org/abs/2501.04070v2","updated":"2025-01-09T02:20:13Z","published":"2025-01-07T14:57:08Z","title":"More is not always better? Enhancing Many-Shot In-Context Learning with\n Differentiated and Reweighting Objectives","summary":" Large language models (LLMs) excel at few-shot in-context learning (ICL)\nwithout requiring parameter updates. However, as the number of ICL\ndemonstrations increases from a few to many, performance tends to plateau and\neventually decline. We identify two primary causes for this trend: the\nsuboptimal negative log-likelihood (NLL) optimization objective and the\nincremental data noise. To address these issues, we introduce DrICL, a novel\noptimization method that enhances model performance through Differentiated\nLearning and advantage-based Reweighting objectives. Globally, DrICL utilizes\ndifferentiated learning to optimize the NLL objective, ensuring that many-shot\nperformance surpasses zero-shot levels. Locally, it dynamically adjusts the\nweighting of many-shot demonstrations by leveraging cumulative advantages\ninspired by reinforcement learning, thereby improving generalization. This\napproach allows the model to handle varying numbers of shots effectively,\nmitigating the impact of noisy data. Recognizing the lack of multi-task\ndatasets with diverse many-shot distributions, we develop the Many-Shot ICL\nBenchmark (ICL-50)-a large-scale benchmark of 50 tasks that cover shot numbers\nfrom 1 to 350 within sequences of up to 8,000 tokens-for fine-tuning purposes.\nICL-50 facilitates the evaluation of many-shot ICL strategies across seven\nprominent NLP tasks and 50 distinct datasets. Experimental results demonstrate\nthat LLMs enhanced with DrICL achieve significant improvements in many-shot\nsetups across various tasks, including both in-domain and out-of-domain\nscenarios. We release the code and benchmark dataset hoping to facilitate\nfurther research in many-shot ICL.\n","authors":["Xiaoqing Zhang","Ang Lv","Yuhan Liu","Flood Sung","Wei Liu","Shuo Shang","Xiuying Chen","Rui Yan"],"pdf_url":"https://arxiv.org/pdf/2501.04070v2.pdf","comment":"13 pages, 8 figures, 11 tables"},{"id":"http://arxiv.org/abs/2501.04228v2","updated":"2025-01-09T01:35:56Z","published":"2025-01-08T01:59:47Z","title":"Constraints as Rewards: Reinforcement Learning for Robots without Reward\n Functions","summary":" Reinforcement learning has become an essential algorithm for generating\ncomplex robotic behaviors. However, to learn such behaviors, it is necessary to\ndesign a reward function that describes the task, which often consists of\nmultiple objectives that needs to be balanced. This tuning process is known as\nreward engineering and typically involves extensive trial-and-error. In this\npaper, to avoid this trial-and-error process, we propose the concept of\nConstraints as Rewards (CaR). CaR formulates the task objective using multiple\nconstraint functions instead of a reward function and solves a reinforcement\nlearning problem with constraints using the Lagrangian-method. By adopting this\napproach, different objectives are automatically balanced, because Lagrange\nmultipliers serves as the weights among the objectives. In addition, we will\ndemonstrate that constraints, expressed as inequalities, provide an intuitive\ninterpretation of the optimization target designed for the task. We apply the\nproposed method to the standing-up motion generation task of a\nsix-wheeled-telescopic-legged robot and demonstrate that the proposed method\nsuccessfully acquires the target behavior, even though it is challenging to\nlearn with manually designed reward functions.\n","authors":["Yu Ishihara","Noriaki Takasugi","Kotaro Kawakami","Masaya Kinoshita","Kazumi Aoyama"],"pdf_url":"https://arxiv.org/pdf/2501.04228v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.17077v4","updated":"2025-01-09T01:29:00Z","published":"2023-06-29T16:28:34Z","title":"RAPGen: An Approach for Fixing Code Inefficiencies in Zero-Shot","summary":" Performance bugs are non-functional bugs that can even manifest in\nwell-tested commercial products. Fixing these performance bugs is an important\nyet challenging problem. In this work, we address this challenge and present a\nnew approach called Retrieval-Augmented Prompt Generation (RAPGen). Given a\ncode snippet with a performance issue, RAPGen first retrieves a prompt\ninstruction from a pre-constructed knowledge-base of previous performance bug\nfixes and then generates a prompt using the retrieved instruction. It then uses\nthis prompt on a Large Language Model (such as Codex) in zero-shot to generate\na fix. We compare our approach with the various prompt variations and state of\nthe art methods in the task of performance bug fixing. Our evaluation shows\nthat RAPGen can generate performance improvement suggestions equivalent or\nbetter than a developer in ~60% of the cases, getting ~42% of them verbatim, in\nan expert-verified dataset of past performance changes made by C# developers.\n","authors":["Spandan Garg","Roshanak Zilouchian Moghaddam","Neel Sundaresan"],"pdf_url":"https://arxiv.org/pdf/2306.17077v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04899v1","updated":"2025-01-09T01:24:59Z","published":"2025-01-09T01:24:59Z","title":"SUGAR: Leveraging Contextual Confidence for Smarter Retrieval","summary":" Bearing in mind the limited parametric knowledge of Large Language Models\n(LLMs), retrieval-augmented generation (RAG) which supplies them with the\nrelevant external knowledge has served as an approach to mitigate the issue of\nhallucinations to a certain extent. However, uniformly retrieving supporting\ncontext makes response generation source-inefficient, as triggering the\nretriever is not always necessary, or even inaccurate, when a model gets\ndistracted by noisy retrieved content and produces an unhelpful answer.\nMotivated by these issues, we introduce Semantic Uncertainty Guided Adaptive\nRetrieval (SUGAR), where we leverage context-based entropy to actively decide\nwhether to retrieve and to further determine between single-step and multi-step\nretrieval. Our empirical results show that selective retrieval guided by\nsemantic uncertainty estimation improves the performance across diverse\nquestion answering tasks, as well as achieves a more efficient inference.\n","authors":["Hanna Zubkova","Ji-Hoon Park","Seong-Whan Lee"],"pdf_url":"https://arxiv.org/pdf/2501.04899v1.pdf","comment":"ICASSP2025"},{"id":"http://arxiv.org/abs/2501.04896v1","updated":"2025-01-09T00:50:44Z","published":"2025-01-09T00:50:44Z","title":"Quantifying Itch and its Impact on Sleep Using Machine Learning and\n Radio Signals","summary":" Chronic itch affects 13% of the US population, is highly debilitating, and\nunderlies many medical conditions. A major challenge in clinical care and new\ntherapeutics development is the lack of an objective measure for quantifying\nitch, leading to reliance on subjective measures like patients' self-assessment\nof itch severity. In this paper, we show that a home radio device paired with\nartificial intelligence (AI) can concurrently capture scratching and evaluate\nits impact on sleep quality by analyzing radio signals bouncing in the\nenvironment. The device eliminates the need for wearable sensors or skin\ncontact, enabling monitoring of chronic itch over extended periods at home\nwithout burdening patients or interfering with their skin condition. To\nvalidate the technology, we conducted an observational clinical study of\nchronic pruritus patients, monitored at home for one month using both the radio\ndevice and an infrared camera. Comparing the output of the device to ground\ntruth data from the camera demonstrates its feasibility and accuracy (ROC AUC =\n0.997, sensitivity = 0.825, specificity = 0.997). The results reveal a\nsignificant correlation between scratching and low sleep quality, manifested as\na reduction in sleep efficiency (R = 0.6, p < 0.001) and an increase in sleep\nlatency (R = 0.68, p < 0.001). Our study underscores the potential of passive,\nlong-term, at-home monitoring of chronic scratching and its sleep implications,\noffering a valuable tool for both clinical care of chronic itch patients and\npharmaceutical clinical trials.\n","authors":["Michail Ouroutzoglou","Mingmin Zhao","Joshua Hellerstein","Hariharan Rahul","Asima Badic","Brian S. Kim","Dina Katabi"],"pdf_url":"https://arxiv.org/pdf/2501.04896v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.05315v2","updated":"2025-01-09T00:11:59Z","published":"2024-10-05T03:37:07Z","title":"PalmBench: A Comprehensive Benchmark of Compressed Large Language Models\n on Mobile Platforms","summary":" Deploying large language models (LLMs) locally on mobile devices is\nadvantageous in scenarios where transmitting data to remote cloud servers is\neither undesirable due to privacy concerns or impractical due to network\nconnection. Recent advancements (MLC, 2023a; Gerganov, 2023) have facilitated\nthe local deployment of LLMs. However, local deployment also presents\nchallenges, particularly in balancing quality (generative performance),\nlatency, and throughput within the hardware constraints of mobile devices. In\nthis paper, we introduce our lightweight, all-in-one automated benchmarking\nframework that allows users to evaluate LLMs on mobile devices. We provide a\ncomprehensive benchmark of various popular LLMs with different quantization\nconfigurations (both weights and activations) across multiple mobile platforms\nwith varying hardware capabilities. Unlike traditional benchmarks that assess\nfull-scale models on high-end GPU clusters, we focus on evaluating resource\nefficiency (memory and power consumption) and harmful output for compressed\nmodels on mobile devices. Our key observations include i) differences in energy\nefficiency and throughput across mobile platforms; ii) the impact of\nquantization on memory usage, GPU execution time, and power consumption; and\niii) accuracy and performance degradation of quantized models compared to their\nnon-quantized counterparts; and iv) the frequency of hallucinations and toxic\ncontent generated by compressed LLMs on mobile devices.\n","authors":["Yilong Li","Jingyu Liu","Hao Zhang","M Badri Narayanan","Utkarsh Sharma","Shuai Zhang","Pan Hu","Yijing Zeng","Jayaram Raghuram","Suman Banerjee"],"pdf_url":"https://arxiv.org/pdf/2410.05315v2.pdf","comment":"10 pages"},{"id":"http://arxiv.org/abs/2501.05614v1","updated":"2025-01-09T23:25:06Z","published":"2025-01-09T23:25:06Z","title":"Watermarking Graph Neural Networks via Explanations for Ownership\n Protection","summary":" Graph Neural Networks (GNNs) are the mainstream method to learn pervasive\ngraph data and are widely deployed in industry, making their intellectual\nproperty valuable. However, protecting GNNs from unauthorized use remains a\nchallenge. Watermarking, which embeds ownership information into a model, is a\npotential solution. However, existing watermarking methods have two key\nlimitations: First, almost all of them focus on non-graph data, with\nwatermarking GNNs for complex graph data largely unexplored. Second, the de\nfacto backdoor-based watermarking methods pollute training data and induce\nownership ambiguity through intentional misclassification. Our\nexplanation-based watermarking inherits the strengths of backdoor-based methods\n(e.g., robust to watermark removal attacks), but avoids data pollution and\neliminates intentional misclassification. In particular, our method learns to\nembed the watermark in GNN explanations such that this unique watermark is\nstatistically distinct from other potential solutions, and ownership claims\nmust show statistical significance to be verified. We theoretically prove that,\neven with full knowledge of our method, locating the watermark is an NP-hard\nproblem. Empirically, our method manifests robustness to removal attacks like\nfine-tuning and pruning. By addressing these challenges, our approach marks a\nsignificant advancement in protecting GNN intellectual property.\n","authors":["Jane Downer","Ren Wang","Binghui Wang"],"pdf_url":"https://arxiv.org/pdf/2501.05614v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.10425v3","updated":"2025-01-09T22:46:26Z","published":"2024-12-10T16:34:47Z","title":"Active Inference for Self-Organizing Multi-LLM Systems: A Bayesian\n Thermodynamic Approach to Adaptation","summary":" This paper introduces a novel approach to creating adaptive language agents\nby integrating active inference with large language models (LLMs). While LLMs\ndemonstrate remarkable capabilities, their reliance on static prompts limits\nadaptation to new information and changing environments. We address this by\nimplementing an active inference framework that acts as a cognitive layer above\nan LLM-based agent, dynamically adjusting prompts and search strategies through\nprincipled information-seeking behavior. Our framework models the environment\nusing three state factors (prompt, search, and information states) with seven\nobservation modalities capturing quality metrics. By framing the agent's\nlearning through the free energy principle, we enable systematic exploration of\nprompt combinations and search strategies. Experimental results demonstrate the\neffectiveness of this approach, with the agent developing accurate models of\nenvironment dynamics evidenced by emergent structure in observation matrices.\nAction selection patterns reveal sophisticated exploration-exploitation\nbehavior, transitioning from initial information-gathering to targeted prompt\ntesting. The integration of thermodynamic principles with language model\ncapabilities provides a principled framework for creating robust, adaptable\nagents, extending active inference beyond traditional low-dimensional control\nproblems to high-dimensional, language-driven environments.\n","authors":["Rithvik Prakki"],"pdf_url":"https://arxiv.org/pdf/2412.10425v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.09889v3","updated":"2025-01-09T22:43:05Z","published":"2024-04-15T15:55:01Z","title":"Is Table Retrieval a Solved Problem? Exploring Join-Aware Multi-Table\n Retrieval","summary":" Retrieving relevant tables containing the necessary information to accurately\nanswer a given question over tables is critical to open-domain\nquestion-answering (QA) systems. Previous methods assume the answer to such a\nquestion can be found either in a single table or multiple tables identified\nthrough question decomposition or rewriting. However, neither of these\napproaches is sufficient, as many questions require retrieving multiple tables\nand joining them through a join plan that cannot be discerned from the user\nquery itself. If the join plan is not considered in the retrieval stage, the\nsubsequent steps of reasoning and answering based on those retrieved tables are\nlikely to be incorrect. To address this problem, we introduce a method that\nuncovers useful join relations for any query and database during table\nretrieval. We use a novel re-ranking method formulated as a mixed-integer\nprogram that considers not only table-query relevance but also table-table\nrelevance that requires inferring join relationships. Our method outperforms\nthe state-of-the-art approaches for table retrieval by up to 9.3% in F1 score\nand for end-to-end QA by up to 5.4% in accuracy.\n","authors":["Peter Baile Chen","Yi Zhang","Dan Roth"],"pdf_url":"https://arxiv.org/pdf/2404.09889v3.pdf","comment":"ACL 2024. Dataset and code are available at\n https://peterbaile.github.io/jar"},{"id":"http://arxiv.org/abs/2501.05605v1","updated":"2025-01-09T22:41:50Z","published":"2025-01-09T22:41:50Z","title":"Advancing Personalized Learning Analysis via an Innovative Domain\n Knowledge Informed Attention-based Knowledge Tracing Method","summary":" Emerging Knowledge Tracing (KT) models, particularly deep learning and\nattention-based Knowledge Tracing, have shown great potential in realizing\npersonalized learning analysis via prediction of students' future performance\nbased on their past interactions. The existing methods mainly focus on\nimmediate past interactions or individual concepts without accounting for\ndependencies between knowledge concept, referred as knowledge concept routes,\nthat can be critical to advance the understanding the students' learning\noutcomes. To address this, in this paper, we propose an innovative\nattention-based method by effectively incorporating the domain knowledge of\nknowledge concept routes in the given curriculum. Additionally, we leverage\nXES3G5M dataset, a benchmark dataset with rich auxiliary information for\nknowledge concept routes, to evaluate and compare the performance of our\nproposed method to the seven State-of-the-art (SOTA) deep learning models.\n","authors":["Shubham Kose","Jin Wei-Kocsis"],"pdf_url":"https://arxiv.org/pdf/2501.05605v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.17428v2","updated":"2025-01-09T22:27:06Z","published":"2024-05-27T17:59:45Z","title":"NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding\n Models","summary":" Decoder-only large language model (LLM)-based embedding models are beginning\nto outperform BERT or T5-based embedding models in general-purpose text\nembedding tasks, including dense vector-based retrieval. In this work, we\nintroduce the NV-Embed model, incorporating architectural designs, training\nprocedures, and curated datasets to significantly enhance the performance of\nLLM as a versatile embedding model, while maintaining its simplicity and\nreproducibility. For model architecture, we propose a latent attention layer to\nobtain pooled embeddings, which consistently improves retrieval and downstream\ntask accuracy compared to mean pooling or using the last token embedding\nfrom LLMs. To enhance representation learning, we remove the causal attention\nmask of LLMs during contrastive training. For training algorithm, we introduce\na two-stage contrastive instruction-tuning method. It first applies contrastive\ntraining with instructions on retrieval datasets, utilizing in-batch negatives\nand curated hard negative examples. At stage-2, it blends various non-retrieval\ninto instruction tuning, which not only enhances non-retrieval task accuracy\nbut also improves retrieval performance. For training data, we utilize the\nhard-negative mining, synthetic data generation and existing public available\ndatasets to boost the performance of embedding model. By combining these\ntechniques, our NV-Embed-v1 and NV-Embed-v2 models obtained the No.1 position\non the Massive Text Embedding Benchmark (MTEB) (as of May 24, 2024 and August\n30, 2024, respectively) across 56 embedding tasks, demonstrating the sustained\neffectiveness of the proposed methods over time. Additionally, it achieved the\nhighest scores in the Long Doc section and the second-highest scores in the QA\nsection of the AIR Benchmark, which covers a range of out-of-domain information\nretrieval topics beyond those in MTEB.\n","authors":["Chankyu Lee","Rajarshi Roy","Mengyao Xu","Jonathan Raiman","Mohammad Shoeybi","Bryan Catanzaro","Wei Ping"],"pdf_url":"https://arxiv.org/pdf/2405.17428v2.pdf","comment":"We open-source the model at:\n https://huggingface.co/nvidia/NV-Embed-v2"},{"id":"http://arxiv.org/abs/2303.17155v4","updated":"2025-01-09T22:23:15Z","published":"2023-03-30T05:25:20Z","title":"Discriminative Class Tokens for Text-to-Image Diffusion Models","summary":" Recent advances in text-to-image diffusion models have enabled the generation\nof diverse and high-quality images. While impressive, the images often fall\nshort of depicting subtle details and are susceptible to errors due to\nambiguity in the input text. One way of alleviating these issues is to train\ndiffusion models on class-labeled datasets. This approach has two\ndisadvantages: (i) supervised datasets are generally small compared to\nlarge-scale scraped text-image datasets on which text-to-image models are\ntrained, affecting the quality and diversity of the generated images, or (ii)\nthe input is a hard-coded label, as opposed to free-form text, limiting the\ncontrol over the generated images.\n In this work, we propose a non-invasive fine-tuning technique that\ncapitalizes on the expressive potential of free-form text while achieving high\naccuracy through discriminative signals from a pretrained classifier. This is\ndone by iteratively modifying the embedding of an added input token of a\ntext-to-image diffusion model, by steering generated images toward a given\ntarget class according to a classifier. Our method is fast compared to prior\nfine-tuning methods and does not require a collection of in-class images or\nretraining of a noise-tolerant classifier. We evaluate our method extensively,\nshowing that the generated images are: (i) more accurate and of higher quality\nthan standard diffusion models, (ii) can be used to augment training data in a\nlow-resource setting, and (iii) reveal information about the data used to train\nthe guiding classifier. The code is available at\n\\url{https://github.com/idansc/discriminative_class_tokens}.\n","authors":["Idan Schwartz","Vésteinn Snæbjarnarson","Hila Chefer","Ryan Cotterell","Serge Belongie","Lior Wolf","Sagie Benaim"],"pdf_url":"https://arxiv.org/pdf/2303.17155v4.pdf","comment":"ICCV 2023"},{"id":"http://arxiv.org/abs/2403.13257v3","updated":"2025-01-09T22:21:56Z","published":"2024-03-20T02:38:01Z","title":"Arcee's MergeKit: A Toolkit for Merging Large Language Models","summary":" The rapid expansion of the open-source language model landscape presents an\nopportunity to merge the competencies of these model checkpoints by combining\ntheir parameters. Advances in transfer learning, the process of fine-tuning\npretrained models for specific tasks, has resulted in the development of vast\namounts of task-specific models, typically specialized in individual tasks and\nunable to utilize each other's strengths. Model merging facilitates the\ncreation of multitask models without the need for additional training, offering\na promising avenue for enhancing model performance and versatility. By\npreserving the intrinsic capabilities of the original models, model merging\naddresses complex challenges in AI - including the difficulties of catastrophic\nforgetting and multitask learning. To support this expanding area of research,\nwe introduce MergeKit, a comprehensive, open-source library designed to\nfacilitate the application of model merging strategies. MergeKit offers an\nextensible framework to efficiently merge models on any hardware, providing\nutility to researchers and practitioners. To date, thousands of models have\nbeen merged by the open-source community, leading to the creation of some of\nthe worlds most powerful open-source model checkpoints, as assessed by the Open\nLLM Leaderboard. The library is accessible at\nhttps://github.com/arcee-ai/MergeKit.\n","authors":["Charles Goddard","Shamane Siriwardhana","Malikeh Ehghaghi","Luke Meyers","Vlad Karpukhin","Brian Benedict","Mark McQuade","Jacob Solawetz"],"pdf_url":"https://arxiv.org/pdf/2403.13257v3.pdf","comment":"11 pages, 4 figures"},{"id":"http://arxiv.org/abs/2408.06687v2","updated":"2025-01-09T22:14:55Z","published":"2024-08-13T07:27:02Z","title":"Masked Image Modeling: A Survey","summary":" In this work, we survey recent studies on masked image modeling (MIM), an\napproach that emerged as a powerful self-supervised learning technique in\ncomputer vision. The MIM task involves masking some information, e.g.~pixels,\npatches, or even latent representations, and training a model, usually an\nautoencoder, to predicting the missing information by using the context\navailable in the visible part of the input. We identify and formalize two\ncategories of approaches on how to implement MIM as a pretext task, one based\non reconstruction and one based on contrastive learning. Then, we construct a\ntaxonomy and review the most prominent papers in recent years. We complement\nthe manually constructed taxonomy with a dendrogram obtained by applying a\nhierarchical clustering algorithm. We further identify relevant clusters via\nmanually inspecting the resulting dendrogram. Our review also includes datasets\nthat are commonly used in MIM research. We aggregate the performance results of\nvarious masked image modeling methods on the most popular datasets, to\nfacilitate the comparison of competing methods. Finally, we identify research\ngaps and propose several interesting directions of future work. We supplement\nour survey with the following public repository containing organized\nreferences: https://github.com/vladhondru25/MIM-Survey.\n","authors":["Vlad Hondru","Florinel Alin Croitoru","Shervin Minaee","Radu Tudor Ionescu","Nicu Sebe"],"pdf_url":"https://arxiv.org/pdf/2408.06687v2.pdf","comment":"Revised version"},{"id":"http://arxiv.org/abs/2404.18731v3","updated":"2025-01-09T22:10:14Z","published":"2024-04-29T14:17:52Z","title":"Real Time Multi Organ Classification on Computed Tomography Images","summary":" Organ segmentation is a fundamental task in medical imaging since it is\nuseful for many clinical automation pipelines. However, some tasks do not\nrequire full segmentation. Instead, a classifier can identify the selected\norgan without segmenting the entire volume. In this study, we demonstrate a\nclassifier based method to obtain organ labels in real time by using a large\ncontext size with a sparse data sampling strategy. Although our method operates\nas an independent classifier at query locations, it can generate full\nsegmentations by querying grid locations at any resolution, offering faster\nperformance than segmentation algorithms. We compared our method with existing\nsegmentation techniques, demonstrating its superior runtime potential for\npractical applications in medical imaging.\n","authors":["Halid Ziya Yerebakan","Yoshihisa Shinagawa","Gerardo Hermosillo Valadez"],"pdf_url":"https://arxiv.org/pdf/2404.18731v3.pdf","comment":"11 pages, Organ Classification, Organ Segmentation"},{"id":"http://arxiv.org/abs/2411.08745v3","updated":"2025-01-09T21:53:56Z","published":"2024-11-13T16:26:19Z","title":"Separating Tongue from Thought: Activation Patching Reveals\n Language-Agnostic Concept Representations in Transformers","summary":" A central question in multilingual language modeling is whether large\nlanguage models (LLMs) develop a universal concept representation, disentangled\nfrom specific languages. In this paper, we address this question by analyzing\nlatent representations (latents) during a word translation task in\ntransformer-based LLMs. We strategically extract latents from a source\ntranslation prompt and insert them into the forward pass on a target\ntranslation prompt. By doing so, we find that the output language is encoded in\nthe latent at an earlier layer than the concept to be translated. Building on\nthis insight, we conduct two key experiments. First, we demonstrate that we can\nchange the concept without changing the language and vice versa through\nactivation patching alone. Second, we show that patching with the mean over\nlatents across different languages does not impair and instead improves the\nmodels' performance in translating the concept. Our results provide evidence\nfor the existence of language-agnostic concept representations within the\ninvestigated models.\n","authors":["Clément Dumas","Chris Wendler","Veniamin Veselovsky","Giovanni Monea","Robert West"],"pdf_url":"https://arxiv.org/pdf/2411.08745v3.pdf","comment":"18 pages, 14 figures, previous version published under the title \"How\n Do Llamas Process Multilingual Text? A Latent Exploration through Activation\n Patching\" at the ICML 2024 mechanistic interpretability workshop at\n https://openreview.net/forum?id=0ku2hIm4BS"},{"id":"http://arxiv.org/abs/2501.05567v1","updated":"2025-01-09T20:34:36Z","published":"2025-01-09T20:34:36Z","title":"Approximate Supervised Object Distance Estimation on Unmanned Surface\n Vehicles","summary":" Unmanned surface vehicles (USVs) and boats are increasingly important in\nmaritime operations, yet their deployment is limited due to costly sensors and\ncomplexity. LiDAR, radar, and depth cameras are either costly, yield sparse\npoint clouds or are noisy, and require extensive calibration. Here, we\nintroduce a novel approach for approximate distance estimation in USVs using\nsupervised object detection. We collected a dataset comprising images with\nmanually annotated bounding boxes and corresponding distance measurements.\nLeveraging this data, we propose a specialized branch of an object detection\nmodel, not only to detect objects but also to predict their distances from the\nUSV. This method offers a cost-efficient and intuitive alternative to\nconventional distance measurement techniques, aligning more closely with human\nestimation capabilities. We demonstrate its application in a marine assistance\nsystem that alerts operators to nearby objects such as boats, buoys, or other\nwaterborne hazards.\n","authors":["Benjamin Kiefer","Yitong Quan","Andreas Zell"],"pdf_url":"https://arxiv.org/pdf/2501.05567v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05566v1","updated":"2025-01-09T20:29:31Z","published":"2025-01-09T20:29:31Z","title":"Vision-Language Models for Autonomous Driving: CLIP-Based Dynamic Scene\n Understanding","summary":" Scene understanding is essential for enhancing driver safety, generating\nhuman-centric explanations for Automated Vehicle (AV) decisions, and leveraging\nArtificial Intelligence (AI) for retrospective driving video analysis. This\nstudy developed a dynamic scene retrieval system using Contrastive\nLanguage-Image Pretraining (CLIP) models, which can be optimized for real-time\ndeployment on edge devices. The proposed system outperforms state-of-the-art\nin-context learning methods, including the zero-shot capabilities of GPT-4o,\nparticularly in complex scenarios. By conducting frame-level analysis on the\nHonda Scenes Dataset, which contains a collection of about 80 hours of\nannotated driving videos capturing diverse real-world road and weather\nconditions, our study highlights the robustness of CLIP models in learning\nvisual concepts from natural language supervision. Results also showed that\nfine-tuning the CLIP models, such as ViT-L/14 and ViT-B/32, significantly\nimproved scene classification, achieving a top F1 score of 91.1%. These results\ndemonstrate the ability of the system to deliver rapid and precise scene\nrecognition, which can be used to meet the critical requirements of Advanced\nDriver Assistance Systems (ADAS). This study shows the potential of CLIP models\nto provide scalable and efficient frameworks for dynamic scene understanding\nand classification. Furthermore, this work lays the groundwork for advanced\nautonomous vehicle technologies by fostering a deeper understanding of driver\nbehavior, road conditions, and safety-critical scenarios, marking a significant\nstep toward smarter, safer, and more context-aware autonomous driving systems.\n","authors":["Mohammed Elhenawy","Huthaifa I. Ashqar","Andry Rakotonirainy","Taqwa I. Alhadidi","Ahmed Jaber","Mohammad Abu Tami"],"pdf_url":"https://arxiv.org/pdf/2501.05566v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.09566v3","updated":"2025-01-09T20:24:46Z","published":"2024-09-15T00:53:44Z","title":"Learning Transferable Features for Implicit Neural Representations","summary":" Implicit neural representations (INRs) have demonstrated success in a variety\nof applications, including inverse problems and neural rendering. An INR is\ntypically trained to capture one signal of interest, resulting in learned\nneural features that are highly attuned to that signal. Assumed to be less\ngeneralizable, we explore the aspect of transferability of such learned neural\nfeatures for fitting similar signals. We introduce a new INR training\nframework, STRAINER that learns transferrable features for fitting INRs to new\nsignals from a given distribution, faster and with better reconstruction\nquality. Owing to the sequential layer-wise affine operations in an INR, we\npropose to learn transferable representations by sharing initial encoder layers\nacross multiple INRs with independent decoder layers. At test time, the learned\nencoder representations are transferred as initialization for an otherwise\nrandomly initialized INR. We find STRAINER to yield extremely powerful\ninitialization for fitting images from the same domain and allow for $\\approx\n+10dB$ gain in signal quality early on compared to an untrained INR itself.\nSTRAINER also provides a simple way to encode data-driven priors in INRs. We\nevaluate STRAINER on multiple in-domain and out-of-domain signal fitting tasks\nand inverse problems and further provide detailed analysis and discussion on\nthe transferability of STRAINER's features. Our demo can be accessed at\nhttps://kushalvyas.github.io/strainer.html .\n","authors":["Kushal Vyas","Ahmed Imtiaz Humayun","Aniket Dashpute","Richard G. Baraniuk","Ashok Veeraraghavan","Guha Balakrishnan"],"pdf_url":"https://arxiv.org/pdf/2409.09566v3.pdf","comment":"Project Website: https://kushalvyas.github.io/strainer.html"},{"id":"http://arxiv.org/abs/2412.14194v3","updated":"2025-01-09T20:16:41Z","published":"2024-12-12T23:42:46Z","title":"Detecting Cognitive Impairment and Psychological Well-being among Older\n Adults Using Facial, Acoustic, Linguistic, and Cardiovascular Patterns\n Derived from Remote Conversations","summary":" The aging society urgently requires scalable methods to monitor cognitive\ndecline and identify social and psychological factors indicative of dementia\nrisk in older adults. Our machine learning (ML) models captured facial,\nacoustic, linguistic, and cardiovascular features from 39 individuals with\nnormal cognition or Mild Cognitive Impairment derived from remote video\nconversations and classified cognitive status, social isolation, neuroticism,\nand psychological well-being. Our model could distinguish Clinical Dementia\nRating Scale (CDR) of 0.5 (vs. 0) with 0.78 area under the receiver operating\ncharacteristic curve (AUC), social isolation with 0.75 AUC, neuroticism with\n0.71 AUC, and negative affect scales with 0.79 AUC. Recent advances in machine\nlearning offer new opportunities to remotely detect cognitive impairment and\nassess associated factors, such as neuroticism and psychological well-being.\nOur experiment showed that speech and language patterns were more useful for\nquantifying cognitive impairment, whereas facial expression and cardiovascular\npatterns using photoplethysmography (PPG) were more useful for quantifying\npersonality and psychological well-being.\n","authors":["Xiaofan Mu","Salman Seyedi","Iris Zheng","Zifan Jiang","Liu Chen","Bolaji Omofojoye","Rachel Hershenberg","Allan I. Levey","Gari D. Clifford","Hiroko H. Dodge","Hyeokhyen Kwon"],"pdf_url":"https://arxiv.org/pdf/2412.14194v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05559v1","updated":"2025-01-09T20:11:08Z","published":"2025-01-09T20:11:08Z","title":"Soup to go: mitigating forgetting during continual learning with model\n averaging","summary":" In continual learning, where task data arrives in a sequence, fine-tuning on\nlater tasks will often lead to performance degradation on earlier tasks. This\nis especially pronounced when these tasks come from diverse domains. In this\nsetting, how can we mitigate catastrophic forgetting of earlier tasks and\nretain what the model has learned with minimal computational expenses? Inspired\nby other merging methods, and L2-regression, we propose Sequential Fine-tuning\nwith Averaging (SFA), a method that merges currently training models with\nearlier checkpoints during the course of training. SOTA approaches typically\nmaintain a data buffer of past tasks or impose a penalty at each gradient step.\nIn contrast, our method achieves comparable results without the need to store\npast data, or multiple copies of parameters for each gradient step.\nFurthermore, our method outperforms common merging techniques such as Task\nArithmetic, TIES Merging, and WiSE-FT, as well as other penalty methods like L2\nand Elastic Weight Consolidation. In turn, our method offers insight into the\nbenefits of merging partially-trained models during training across both image\nand language domains.\n","authors":["Anat Kleiman","Gintare Karolina Dziugaite","Jonathan Frankle","Sham Kakade","Mansheej Paul"],"pdf_url":"https://arxiv.org/pdf/2501.05559v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.13969v2","updated":"2025-01-09T20:08:31Z","published":"2023-08-26T22:48:06Z","title":"Gaze-Informed Vision Transformers: Predicting Driving Decisions Under\n Uncertainty","summary":" Vision Transformers (ViT) have advanced computer vision, yet their efficacy\nin complex tasks like driving remains less explored. This study enhances ViT by\nintegrating human eye gaze, captured via eye-tracking, to increase prediction\naccuracy in driving scenarios under uncertainty in both real-world and virtual\nreality scenarios. First, we establish the significance of human eye gaze in\nleft-right driving decisions, as observed in both human subjects and a ViT\nmodel. By comparing the similarity between human fixation maps and ViT\nattention weights, we reveal the dynamics of overlap across individual heads\nand layers. This overlap demonstrates that fixation data can guide the model in\ndistributing its attention weights more effectively. We introduce the\nfixation-attention intersection (FAX) loss, a novel loss function that\nsignificantly improves ViT performance under high uncertainty conditions. Our\nresults show that ViT, when trained with FAX loss, aligns its attention with\nhuman gaze patterns. This gaze-informed approach has significant potential for\ndriver behavior analysis, as well as broader applications in human-centered AI\nsystems, extending ViT's use to complex visual environments.\n","authors":["Sharath Koorathota","Nikolas Papadopoulos","Jia Li Ma","Shruti Kumar","Xiaoxiao Sun","Arunesh Mittal","Patrick Adelman","Paul Sajda"],"pdf_url":"https://arxiv.org/pdf/2308.13969v2.pdf","comment":"25 pages, 9 figures, 3 tables"},{"id":"http://arxiv.org/abs/2501.05555v1","updated":"2025-01-09T20:02:10Z","published":"2025-01-09T20:02:10Z","title":"Improving Zero-Shot Object-Level Change Detection by Incorporating\n Visual Correspondence","summary":" Detecting object-level changes between two images across possibly different\nviews is a core task in many applications that involve visual inspection or\ncamera surveillance. Existing change-detection approaches suffer from three\nmajor limitations: (1) lack of evaluation on image pairs that contain no\nchanges, leading to unreported false positive rates; (2) lack of\ncorrespondences (\\ie, localizing the regions before and after a change); and\n(3) poor zero-shot generalization across different domains. To address these\nissues, we introduce a novel method that leverages change correspondences (a)\nduring training to improve change detection accuracy, and (b) at test time, to\nminimize false positives. That is, we harness the supervision labels of where\nan object is added or removed to supervise change detectors, improving their\naccuracy over previous work by a large margin. Our work is also the first to\npredict correspondences between pairs of detected changes using estimated\nhomography and the Hungarian algorithm. Our model demonstrates superior\nperformance over existing methods, achieving state-of-the-art results in change\ndetection and change correspondence accuracy across both in-distribution and\nzero-shot benchmarks.\n","authors":["Hung Huy Nguyen","Pooyan Rahmanzadehgervi","Long Mail","Anh Totti Nguyen"],"pdf_url":"https://arxiv.org/pdf/2501.05555v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05554v1","updated":"2025-01-09T20:01:15Z","published":"2025-01-09T20:01:15Z","title":"LLMQuoter: Enhancing RAG Capabilities Through Efficient Quote Extraction\n From Large Contexts","summary":" We introduce LLMQuoter, a lightweight, distillation-based model designed to\nenhance Retrieval Augmented Generation (RAG) by extracting the most relevant\ntextual evidence for downstream reasoning tasks. Built on the LLaMA-3B\narchitecture and fine-tuned with Low-Rank Adaptation (LoRA) on a 15,000-sample\nsubset of HotpotQA, LLMQuoter adopts a \"quote-first-then-answer\" strategy,\nefficiently identifying key quotes before passing curated snippets to reasoning\nmodels. This workflow reduces cognitive overhead and outperforms full-context\napproaches like Retrieval-Augmented Fine-Tuning (RAFT), achieving over 20-point\naccuracy gains across both small and large language models. By leveraging\nknowledge distillation from a high-performing teacher model, LLMQuoter achieves\ncompetitive results in a resource-efficient fine-tuning setup. It democratizes\nadvanced RAG capabilities, delivering significant performance improvements\nwithout requiring extensive model retraining. Our results highlight the\npotential of distilled quote-based reasoning to streamline complex workflows,\noffering a scalable and practical solution for researchers and practitioners\nalike.\n","authors":["Yuri Facanha Bezerra","Li Weigang"],"pdf_url":"https://arxiv.org/pdf/2501.05554v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.00190v2","updated":"2025-01-09T20:00:16Z","published":"2024-12-31T00:02:07Z","title":"SepsisCalc: Integrating Clinical Calculators into Early Sepsis\n Prediction via Dynamic Temporal Graph Construction","summary":" Sepsis is an organ dysfunction caused by a deregulated immune response to an\ninfection. Early sepsis prediction and identification allow for timely\nintervention, leading to improved clinical outcomes. Clinical calculators\n(e.g., the six-organ dysfunction assessment of SOFA) play a vital role in\nsepsis identification within clinicians' workflow, providing evidence-based\nrisk assessments essential for sepsis diagnosis. However, artificial\nintelligence (AI) sepsis prediction models typically generate a single sepsis\nrisk score without incorporating clinical calculators for assessing organ\ndysfunctions, making the models less convincing and transparent to clinicians.\nTo bridge the gap, we propose to mimic clinicians' workflow with a novel\nframework SepsisCalc to integrate clinical calculators into the predictive\nmodel, yielding a clinically transparent and precise model for utilization in\nclinical settings. Practically, clinical calculators usually combine\ninformation from multiple component variables in Electronic Health Records\n(EHR), and might not be applicable when the variables are (partially) missing.\nWe mitigate this issue by representing EHRs as temporal graphs and integrating\na learning module to dynamically add the accurately estimated calculator to the\ngraphs. Experimental results on real-world datasets show that the proposed\nmodel outperforms state-of-the-art methods on sepsis prediction tasks.\nMoreover, we developed a system to identify organ dysfunctions and potential\nsepsis risks, providing a human-AI interaction tool for deployment, which can\nhelp clinicians understand the prediction outputs and prepare timely\ninterventions for the corresponding dysfunctions, paving the way for actionable\nclinical decision-making support for early intervention.\n","authors":["Changchang Yin","Shihan Fu","Bingsheng Yao","Thai-Hoang Pham","Weidan Cao","Dakuo Wang","Jeffrey Caterino","Ping Zhang"],"pdf_url":"https://arxiv.org/pdf/2501.00190v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05552v1","updated":"2025-01-09T19:56:44Z","published":"2025-01-09T19:56:44Z","title":"The dynamics of meaning through time: Assessment of Large Language\n Models","summary":" Understanding how large language models (LLMs) grasp the historical context\nof concepts and their semantic evolution is essential in advancing artificial\nintelligence and linguistic studies. This study aims to evaluate the\ncapabilities of various LLMs in capturing temporal dynamics of meaning,\nspecifically how they interpret terms across different time periods. We analyze\na diverse set of terms from multiple domains, using tailored prompts and\nmeasuring responses through both objective metrics (e.g., perplexity and word\ncount) and subjective human expert evaluations. Our comparative analysis\nincludes prominent models like ChatGPT, GPT-4, Claude, Bard, Gemini, and Llama.\nFindings reveal marked differences in each model's handling of historical\ncontext and semantic shifts, highlighting both strengths and limitations in\ntemporal semantic understanding. These insights offer a foundation for refining\nLLMs to better address the evolving nature of language, with implications for\nhistorical text analysis, AI design, and applications in digital humanities.\n","authors":["Mohamed Taher Alrefaie","Fatty Salem","Nour Eldin Morsy","Nada Samir","Mohamed Medhat Gaber"],"pdf_url":"https://arxiv.org/pdf/2501.05552v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.02075v2","updated":"2025-01-09T19:54:53Z","published":"2023-04-04T18:58:16Z","title":"GUTS: Generalized Uncertainty-Aware Thompson Sampling for Multi-Agent\n Active Search","summary":" Robotic solutions for quick disaster response are essential to ensure minimal\nloss of life, especially when the search area is too dangerous or too vast for\nhuman rescuers. We model this problem as an asynchronous multi-agent\nactive-search task where each robot aims to efficiently seek objects of\ninterest (OOIs) in an unknown environment. This formulation addresses the\nrequirement that search missions should focus on quick recovery of OOIs rather\nthan full coverage of the search region. Previous approaches fail to accurately\nmodel sensing uncertainty, account for occlusions due to foliage or terrain, or\nconsider the requirement for heterogeneous search teams and robustness to\nhardware and communication failures. We present the Generalized\nUncertainty-aware Thompson Sampling (GUTS) algorithm, which addresses these\nissues and is suitable for deployment on heterogeneous multi-robot systems for\nactive search in large unstructured environments. We show through simulation\nexperiments that GUTS consistently outperforms existing methods such as\nparallelized Thompson Sampling and exhaustive search, recovering all OOIs in\n80% of all runs. In contrast, existing approaches recover all OOIs in less than\n40% of all runs. We conduct field tests using our multi-robot system in an\nunstructured environment with a search area of approximately 75,000 sq. m. Our\nsystem demonstrates robustness to various failure modes, achieving full\nrecovery of OOIs (where feasible) in every field run, and significantly\noutperforming our baseline.\n","authors":["Nikhil Angad Bakshi","Tejus Gupta","Ramina Ghods","Jeff Schneider"],"pdf_url":"https://arxiv.org/pdf/2304.02075v2.pdf","comment":"7 pages, 5 figures, 1 table, for associated video see:\n https://youtu.be/K0jkzdQ_j2E , published in International Conference on\n Robotics and Automation (ICRA) 2023. Outstanding Deployed Systems Paper\n Winner"},{"id":"http://arxiv.org/abs/2412.08755v2","updated":"2025-01-09T19:15:20Z","published":"2024-12-11T19:54:14Z","title":"Proactive Adversarial Defense: Harnessing Prompt Tuning in\n Vision-Language Models to Detect Unseen Backdoored Images","summary":" Backdoor attacks pose a critical threat by embedding hidden triggers into\ninputs, causing models to misclassify them into target labels. While extensive\nresearch has focused on mitigating these attacks in object recognition models\nthrough weight fine-tuning, much less attention has been given to detecting\nbackdoored samples directly. Given the vast datasets used in training, manual\ninspection for backdoor triggers is impractical, and even state-of-the-art\ndefense mechanisms fail to fully neutralize their impact. To address this gap,\nwe introduce a groundbreaking method to detect unseen backdoored images during\nboth training and inference. Leveraging the transformative success of prompt\ntuning in Vision Language Models (VLMs), our approach trains learnable text\nprompts to differentiate clean images from those with hidden backdoor triggers.\nExperiments demonstrate the exceptional efficacy of this method, achieving an\nimpressive average accuracy of 86% across two renowned datasets for detecting\nunseen backdoor triggers, establishing a new standard in backdoor defense.\n","authors":["Kyle Stein","Andrew Arash Mahyari","Guillermo Francia","Eman El-Sheikh"],"pdf_url":"https://arxiv.org/pdf/2412.08755v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05510v1","updated":"2025-01-09T19:00:01Z","published":"2025-01-09T19:00:01Z","title":"OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video\n Understanding?","summary":" Temporal Awareness, the ability to reason dynamically based on the timestamp\nwhen a question is raised, is the key distinction between offline and online\nvideo LLMs. Unlike offline models, which rely on complete videos for static,\npost hoc analysis, online models process video streams incrementally and\ndynamically adapt their responses based on the timestamp at which the question\nis posed. Despite its significance, temporal awareness has not been adequately\nevaluated in existing benchmarks. To fill this gap, we present OVO-Bench\n(Online-VideO-Benchmark), a novel video benchmark that emphasizes the\nimportance of timestamps for advanced online video understanding capability\nbenchmarking. OVO-Bench evaluates the ability of video LLMs to reason and\nrespond to events occurring at specific timestamps under three distinct\nscenarios: (1) Backward tracing: trace back to past events to answer the\nquestion. (2) Real-time understanding: understand and respond to events as they\nunfold at the current timestamp. (3) Forward active responding: delay the\nresponse until sufficient future information becomes available to answer the\nquestion accurately. OVO-Bench comprises 12 tasks, featuring 644 unique videos\nand approximately human-curated 2,800 fine-grained meta-annotations with\nprecise timestamps. We combine automated generation pipelines with human\ncuration. With these high-quality samples, we further developed an evaluation\npipeline to systematically query video LLMs along the video timeline.\nEvaluations of nine Video-LLMs reveal that, despite advancements on traditional\nbenchmarks, current models struggle with online video understanding, showing a\nsignificant gap compared to human agents. We hope OVO-Bench will drive progress\nin video LLMs and inspire future research in online video reasoning. Our\nbenchmark and code can be accessed at https://github.com/JoeLeelyf/OVO-Bench.\n","authors":["Yifei Li","Junbo Niu","Ziyang Miao","Chunjiang Ge","Yuanhang Zhou","Qihao He","Xiaoyi Dong","Haodong Duan","Shuangrui Ding","Rui Qian","Pan Zhang","Yuhang Zang","Yuhang Cao","Conghui He","Jiaqi Wang"],"pdf_url":"https://arxiv.org/pdf/2501.05510v1.pdf","comment":"28 pages"},{"id":"http://arxiv.org/abs/2501.05501v1","updated":"2025-01-09T18:43:05Z","published":"2025-01-09T18:43:05Z","title":"Strategy Masking: A Method for Guardrails in Value-based Reinforcement\n Learning Agents","summary":" The use of reward functions to structure AI learning and decision making is\ncore to the current reinforcement learning paradigm; however, without careful\ndesign of reward functions, agents can learn to solve problems in ways that may\nbe considered ``undesirable\" or ``unethical. Without thorough understanding of\nthe incentives a reward function creates, it can be difficult to impose\nprincipled yet general control mechanisms over its behavior. In this paper, we\nstudy methods for constructing guardrails for AI agents that use reward\nfunctions to learn decision making. We introduce a novel approach, which we\ncall strategy masking, to explicitly learn and then suppress undesirable AI\nagent behavior. We apply our method to study lying in AI agents and show that\nstrategy masking can effectively modify agent behavior by suppressing, or\nactively penalizing, the reward dimension for lying such that agents act more\nhonestly while not compromising their ability to perform effectively.\n","authors":["Jonathan Keane","Sam Keyser","Jeremy Kedziora"],"pdf_url":"https://arxiv.org/pdf/2501.05501v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05497v1","updated":"2025-01-09T17:20:00Z","published":"2025-01-09T17:20:00Z","title":"Spatial Information Integration in Small Language Models for Document\n Layout Generation and Classification","summary":" Document layout understanding is a field of study that analyzes the spatial\narrangement of information in a document hoping to understand its structure and\nlayout. Models such as LayoutLM (and its subsequent iterations) can understand\nsemi-structured documents with SotA results; however, the lack of open\nsemi-structured data is a limitation in itself. While semi-structured data is\ncommon in everyday life (balance sheets, purchase orders, receipts), there is a\nlack of public datasets for training machine learning models for this type of\ndocument. In this investigation we propose a method to generate new, synthetic,\nlayout information that can help overcoming this data shortage. According to\nour results, the proposed method performs better than LayoutTransformer,\nanother popular layout generation method. We also show that, in some scenarios,\ntext classification can improve when supported by bounding box information.\n","authors":["Pablo Melendez","Clemens Havas"],"pdf_url":"https://arxiv.org/pdf/2501.05497v1.pdf","comment":"8 pages. Symposium on Applied Computing 2025"},{"id":"http://arxiv.org/abs/2501.05496v1","updated":"2025-01-09T16:10:03Z","published":"2025-01-09T16:10:03Z","title":"FedSA: A Unified Representation Learning via Semantic Anchors for\n Prototype-based Federated Learning","summary":" Prototype-based federated learning has emerged as a promising approach that\nshares lightweight prototypes to transfer knowledge among clients with data\nheterogeneity in a model-agnostic manner. However, existing methods often\ncollect prototypes directly from local models, which inevitably introduce\ninconsistencies into representation learning due to the biased data\ndistributions and differing model architectures among clients. In this paper,\nwe identify that both statistical and model heterogeneity create a vicious\ncycle of representation inconsistency, classifier divergence, and skewed\nprototype alignment, which negatively impacts the performance of clients. To\nbreak the vicious cycle, we propose a novel framework named Federated Learning\nvia Semantic Anchors (FedSA) to decouple the generation of prototypes from\nlocal representation learning. We introduce a novel perspective that uses\nsimple yet effective semantic anchors serving as prototypes to guide local\nmodels in learning consistent representations. By incorporating semantic\nanchors, we further propose anchor-based regularization with margin-enhanced\ncontrastive learning and anchor-based classifier calibration to correct feature\nextractors and calibrate classifiers across clients, achieving intra-class\ncompactness and inter-class separability of prototypes while ensuring\nconsistent decision boundaries. We then update the semantic anchors with these\nconsistent and discriminative prototypes, which iteratively encourage clients\nto collaboratively learn a unified data representation with robust\ngeneralization. Extensive experiments under both statistical and model\nheterogeneity settings show that FedSA significantly outperforms existing\nprototype-based FL methods on various classification tasks.\n","authors":["Yanbing Zhou","Xiangmou Qu","Chenlong You","Jiyang Zhou","Jingyue Tang","Xin Zheng","Chunmao Cai","Yingbo Wu"],"pdf_url":"https://arxiv.org/pdf/2501.05496v1.pdf","comment":"Accepted by AAAI2025"},{"id":"http://arxiv.org/abs/2501.05495v1","updated":"2025-01-09T15:47:30Z","published":"2025-01-09T15:47:30Z","title":"LSEBMCL: A Latent Space Energy-Based Model for Continual Learning","summary":" Continual learning has become essential in many practical applications such\nas online news summaries and product classification. The primary challenge is\nknown as catastrophic forgetting, a phenomenon where a model inadvertently\ndiscards previously learned knowledge when it is trained on new tasks. Existing\nsolutions involve storing exemplars from previous classes, regularizing\nparameters during the fine-tuning process, or assigning different model\nparameters to each task. The proposed solution LSEBMCL (Latent Space\nEnergy-Based Model for Continual Learning) in this work is to use energy-based\nmodels (EBMs) to prevent catastrophic forgetting by sampling data points from\nprevious tasks when training on new ones. The EBM is a machine learning model\nthat associates an energy value with each input data point. The proposed method\nuses an EBM layer as an outer-generator in the continual learning framework for\nNLP tasks. The study demonstrates the efficacy of EBM in NLP tasks, achieving\nstate-of-the-art results in all experiments.\n","authors":["Xiaodi Li","Dingcheng Li","Rujun Gao","Mahmoud Zamani","Latifur Khan"],"pdf_url":"https://arxiv.org/pdf/2501.05495v1.pdf","comment":"In the 7th International Conference on Artificial Intelligence in\n Information and Communication (ICAIIC 2025)"},{"id":"http://arxiv.org/abs/2501.05238v1","updated":"2025-01-09T13:44:15Z","published":"2025-01-09T13:44:15Z","title":"FOCUS: Towards Universal Foreground Segmentation","summary":" Foreground segmentation is a fundamental task in computer vision,\nencompassing various subdivision tasks. Previous research has typically\ndesigned task-specific architectures for each task, leading to a lack of\nunification. Moreover, they primarily focus on recognizing foreground objects\nwithout effectively distinguishing them from the background. In this paper, we\nemphasize the importance of the background and its relationship with the\nforeground. We introduce FOCUS, the Foreground ObjeCts Universal Segmentation\nframework that can handle multiple foreground tasks. We develop a multi-scale\nsemantic network using the edge information of objects to enhance image\nfeatures. To achieve boundary-aware segmentation, we propose a novel\ndistillation method, integrating the contrastive learning strategy to refine\nthe prediction mask in multi-modal feature space. We conduct extensive\nexperiments on a total of 13 datasets across 5 tasks, and the results\ndemonstrate that FOCUS consistently outperforms the state-of-the-art\ntask-specific models on most metrics.\n","authors":["Zuyao You","Lingyu Kong","Lingchen Meng","Zuxuan Wu"],"pdf_url":"https://arxiv.org/pdf/2501.05238v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05057v1","updated":"2025-01-09T08:28:16Z","published":"2025-01-09T08:28:16Z","title":"LearningFlow: Automated Policy Learning Workflow for Urban Driving with\n Large Language Models","summary":" Recent advancements in reinforcement learning (RL) demonstrate the\nsignificant potential in autonomous driving. Despite this promise, challenges\nsuch as the manual design of reward functions and low sample efficiency in\ncomplex environments continue to impede the development of safe and effective\ndriving policies. To tackle these issues, we introduce LearningFlow, an\ninnovative automated policy learning workflow tailored to urban driving. This\nframework leverages the collaboration of multiple large language model (LLM)\nagents throughout the RL training process. LearningFlow includes a curriculum\nsequence generation process and a reward generation process, which work in\ntandem to guide the RL policy by generating tailored training curricula and\nreward functions. Particularly, each process is supported by an analysis agent\nthat evaluates training progress and provides critical insights to the\ngeneration agent. Through the collaborative efforts of these LLM agents,\nLearningFlow automates policy learning across a series of complex driving\ntasks, and it significantly reduces the reliance on manual reward function\ndesign while enhancing sample efficiency. Comprehensive experiments are\nconducted in the high-fidelity CARLA simulator, along with comparisons with\nother existing methods, to demonstrate the efficacy of our proposed approach.\nThe results demonstrate that LearningFlow excels in generating rewards and\ncurricula. It also achieves superior performance and robust generalization\nacross various driving tasks, as well as commendable adaptation to different RL\nalgorithms.\n","authors":["Zengqi Peng","Yubin Wang","Xu Han","Lei Zheng","Jun Ma"],"pdf_url":"https://arxiv.org/pdf/2501.05057v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05490v1","updated":"2025-01-09T07:36:28Z","published":"2025-01-09T07:36:28Z","title":"Interpretable deep learning illuminates multiple structures fluorescence\n imaging: a path toward trustworthy artificial intelligence in microscopy","summary":" Live-cell imaging of multiple subcellular structures is essential for\nunderstanding subcellular dynamics. However, the conventional multi-color\nsequential fluorescence microscopy suffers from significant imaging delays and\nlimited number of subcellular structure separate labeling, resulting in\nsubstantial limitations for real-time live-cell research applications. Here, we\npresent the Adaptive Explainable Multi-Structure Network (AEMS-Net), a\ndeep-learning framework that enables simultaneous prediction of two subcellular\nstructures from a single image. The model normalizes staining intensity and\nprioritizes critical image features by integrating attention mechanisms and\nbrightness adaptation layers. Leveraging the Kolmogorov-Arnold representation\ntheorem, our model decomposes learned features into interpretable univariate\nfunctions, enhancing the explainability of complex subcellular morphologies. We\ndemonstrate that AEMS-Net allows real-time recording of interactions between\nmitochondria and microtubules, requiring only half the conventional\nsequential-channel imaging procedures. Notably, this approach achieves over 30%\nimprovement in imaging quality compared to traditional deep learning methods,\nestablishing a new paradigm for long-term, interpretable live-cell imaging that\nadvances the ability to explore subcellular dynamics.\n","authors":["Mingyang Chen","Luhong Jin","Xuwei Xuan","Defu Yang","Yun Cheng","Ju Zhang"],"pdf_url":"https://arxiv.org/pdf/2501.05490v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04952v1","updated":"2025-01-09T03:59:10Z","published":"2025-01-09T03:59:10Z","title":"Open Problems in Machine Unlearning for AI Safety","summary":" As AI systems become more capable, widely deployed, and increasingly\nautonomous in critical areas such as cybersecurity, biological research, and\nhealthcare, ensuring their safety and alignment with human values is paramount.\nMachine unlearning -- the ability to selectively forget or suppress specific\ntypes of knowledge -- has shown promise for privacy and data removal tasks,\nwhich has been the primary focus of existing research. More recently, its\npotential application to AI safety has gained attention. In this paper, we\nidentify key limitations that prevent unlearning from serving as a\ncomprehensive solution for AI safety, particularly in managing dual-use\nknowledge in sensitive domains like cybersecurity and chemical, biological,\nradiological, and nuclear (CBRN) safety. In these contexts, information can be\nboth beneficial and harmful, and models may combine seemingly harmless\ninformation for harmful purposes -- unlearning this information could strongly\naffect beneficial uses. We provide an overview of inherent constraints and open\nproblems, including the broader side effects of unlearning dangerous knowledge,\nas well as previously unexplored tensions between unlearning and existing\nsafety mechanisms. Finally, we investigate challenges related to evaluation,\nrobustness, and the preservation of safety features during unlearning. By\nmapping these limitations and open challenges, we aim to guide future research\ntoward realistic applications of unlearning within a broader AI safety\nframework, acknowledging its limitations and highlighting areas where\nalternative approaches may be required.\n","authors":["Fazl Barez","Tingchen Fu","Ameya Prabhu","Stephen Casper","Amartya Sanyal","Adel Bibi","Aidan O'Gara","Robert Kirk","Ben Bucknall","Tim Fist","Luke Ong","Philip Torr","Kwok-Yan Lam","Robert Trager","David Krueger","Sören Mindermann","José Hernandez-Orallo","Mor Geva","Yarin Gal"],"pdf_url":"https://arxiv.org/pdf/2501.04952v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2006.02482v5","updated":"2025-01-09T02:10:04Z","published":"2020-06-03T19:02:34Z","title":"Explaining the Behavior of Black-Box Prediction Algorithms with Causal\n Learning","summary":" Causal approaches to post-hoc explainability for black-box prediction models\n(e.g., deep neural networks trained on image pixel data) have become\nincreasingly popular. However, existing approaches have two important\nshortcomings: (i) the \"explanatory units\" are micro-level inputs into the\nrelevant prediction model, e.g., image pixels, rather than interpretable\nmacro-level features that are more useful for understanding how to possibly\nchange the algorithm's behavior, and (ii) existing approaches assume there\nexists no unmeasured confounding between features and target model predictions,\nwhich fails to hold when the explanatory units are macro-level variables. Our\nfocus is on the important setting where the analyst has no access to the inner\nworkings of the target prediction algorithm, rather only the ability to query\nthe output of the model in response to a particular input. To provide causal\nexplanations in such a setting, we propose to learn causal graphical\nrepresentations that allow for arbitrary unmeasured confounding among features.\nWe demonstrate the resulting graph can differentiate between interpretable\nfeatures that causally influence model predictions versus those that are merely\nassociated with model predictions due to confounding. Our approach is motivated\nby a counterfactual theory of causal explanation wherein good explanations\npoint to factors that are \"difference-makers\" in an interventionist sense.\n","authors":["Numair Sani","Daniel Malinsky","Ilya Shpitser"],"pdf_url":"https://arxiv.org/pdf/2006.02482v5.pdf","comment":null}]},"2025-01-10T00:00:00Z":{"Robotics":[{"id":"http://arxiv.org/abs/2501.06132v1","updated":"2025-01-10T17:44:57Z","published":"2025-01-10T17:44:57Z","title":"CoDriveVLM: VLM-Enhanced Urban Cooperative Dispatching and Motion\n Planning for Future Autonomous Mobility on Demand Systems","summary":" The increasing demand for flexible and efficient urban transportation\nsolutions has spotlighted the limitations of traditional Demand Responsive\nTransport (DRT) systems, particularly in accommodating diverse passenger needs\nand dynamic urban environments. Autonomous Mobility-on-Demand (AMoD) systems\nhave emerged as a promising alternative, leveraging connected and autonomous\nvehicles (CAVs) to provide responsive and adaptable services. However, existing\nmethods primarily focus on either vehicle scheduling or path planning, which\noften simplify complex urban layouts and neglect the necessity for simultaneous\ncoordination and mutual avoidance among CAVs. This oversimplification poses\nsignificant challenges to the deployment of AMoD systems in real-world\nscenarios. To address these gaps, we propose CoDriveVLM, a novel framework that\nintegrates high-fidelity simultaneous dispatching and cooperative motion\nplanning for future AMoD systems. Our method harnesses Vision-Language Models\n(VLMs) to enhance multi-modality information processing, and this enables\ncomprehensive dispatching and collision risk evaluation. The VLM-enhanced CAV\ndispatching coordinator is introduced to effectively manage complex and\nunforeseen AMoD conditions, thus supporting efficient scheduling\ndecision-making. Furthermore, we propose a scalable decentralized cooperative\nmotion planning method via consensus alternating direction method of\nmultipliers (ADMM) focusing on collision risk evaluation and decentralized\ntrajectory optimization. Simulation results demonstrate the feasibility and\nrobustness of CoDriveVLM in various traffic conditions, showcasing its\npotential to significantly improve the fidelity and effectiveness of AMoD\nsystems in future urban transportation networks. The code is available at\nhttps://github.com/henryhcliu/CoDriveVLM.git.\n","authors":["Haichao Liu","Ruoyu Yao","Wenru Liu","Zhenmin Huang","Shaojie Shen","Jun Ma"],"pdf_url":"https://arxiv.org/pdf/2501.06132v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.02189v2","updated":"2025-01-10T17:43:10Z","published":"2025-01-04T04:59:33Z","title":"Benchmark Evaluations, Applications, and Challenges of Large Vision\n Language Models: A Survey","summary":" Multimodal Vision Language Models (VLMs) have emerged as a transformative\ntechnology at the intersection of computer vision and natural language\nprocessing, enabling machines to perceive and reason about the world through\nboth visual and textual modalities. For example, models such as CLIP, Claude,\nand GPT-4V demonstrate strong reasoning and understanding abilities on visual\nand textual data and beat classical single modality vision models on zero-shot\nclassification. Despite their rapid advancements in research and growing\npopularity in applications, a comprehensive survey of existing studies on VLMs\nis notably lacking, particularly for researchers aiming to leverage VLMs in\ntheir specific domains. To this end, we provide a systematic overview of VLMs\nin the following aspects: model information of the major VLMs developed over\nthe past five years (2019-2024); the main architectures and training methods of\nthese VLMs; summary and categorization of the popular benchmarks and evaluation\nmetrics of VLMs; the applications of VLMs including embodied agents, robotics,\nand video generation; the challenges and issues faced by current VLMs such as\nhallucination, fairness, and safety. Detailed collections including papers and\nmodel repository links are listed in\nhttps://github.com/zli12321/Awesome-VLM-Papers-And-Models.git.\n","authors":["Zongxia Li","Xiyang Wu","Hongyang Du","Huy Nghiem","Guangyao Shi"],"pdf_url":"https://arxiv.org/pdf/2501.02189v2.pdf","comment":"35 pages, 3 figures"},{"id":"http://arxiv.org/abs/2501.06130v1","updated":"2025-01-10T17:35:29Z","published":"2025-01-10T17:35:29Z","title":"A Mixed-Integer Conic Program for the Multi-Agent Moving-Target\n Traveling Salesman Problem","summary":" The Moving-Target Traveling Salesman Problem (MT-TSP) aims to find a shortest\npath for an agent that starts at a stationary depot, visits a set of moving\ntargets exactly once, each within one of their respective time windows, and\nthen returns to the depot. In this paper, we introduce a new Mixed-Integer\nConic Program (MICP) formulation that finds the optimum for the Multi-Agent\nMoving-Target Traveling Salesman Problem (MA-MT-TSP), a generalization of the\nMT-TSP involving multiple agents. We obtain our formulation by first restating\nthe current state-of-the-art MICP formulation for MA-MT-TSP as a Mixed-Integer\nNonlinear Nonconvex Program, and then reformulating it as a new MICP. We\npresent computational results to demonstrate the performance of our approach.\nThe results show that our formulation significantly outperforms the\nstate-of-the-art, with up to a two-order-of-magnitude reduction in runtime, and\nup to over 90% tighter optimality gap.\n","authors":["Allen George Philip","Zhongqiang Ren","Sivakumar Rathinam","Howie Choset"],"pdf_url":"https://arxiv.org/pdf/2501.06130v1.pdf","comment":"7 pages, 3 figures"},{"id":"http://arxiv.org/abs/2501.06122v1","updated":"2025-01-10T17:21:04Z","published":"2025-01-10T17:21:04Z","title":"NDOB-Based Control of a UAV with Delta-Arm Considering Manipulator\n Dynamics","summary":" Aerial Manipulators (AMs) provide a versatile platform for various\napplications, including 3D printing, architecture, and aerial grasping\nmissions. However, their operational speed is often sacrificed to uphold\nprecision. Existing control strategies for AMs often regard the manipulator as\na disturbance and employ robust control methods to mitigate its influence. This\nresearch focuses on elevating the precision of the end-effector and enhancing\nthe agility of aerial manipulator movements. We present a composite control\nscheme to address these challenges. Initially, a Nonlinear Disturbance Observer\n(NDOB) is utilized to compensate for internal coupling effects and external\ndisturbances. Subsequently, manipulator dynamics are processed through a high\npass filter to facilitate agile movements. By integrating the proposed control\nmethod into a fully autonomous delta-arm-based AM system, we substantiate the\ncontroller's efficacy through extensive real-world experiments. The outcomes\nillustrate that the end-effector can achieve accuracy at the millimeter level.\n","authors":["Hongming Chen","Biyu Ye","Xianqi Liang","Weiliang Deng","Ximin Lyu"],"pdf_url":"https://arxiv.org/pdf/2501.06122v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.06115v1","updated":"2025-01-10T17:12:30Z","published":"2025-01-10T17:12:30Z","title":"Development of an Advisory System for Parking of a Car and Trailer","summary":" Trailer parking is a challenging task due to the unstable nature of the\nvehicle-trailer system in reverse motion and the unintuitive steering actions\nrequired at the vehicle to accomplish the parking maneuver. This paper presents\na strategy to tackle this kind of maneuver with an advisory graphic aid to help\nthe human driver with the task of manually backing up the vehicle-trailer\nsystem. A kinematic vehicle-trailer model is derived to describe the low-speed\nmotion of the vehicle-trailer system, and its inverse kinematics is established\nby generating an equivalent virtual trailer axle steering command. The advisory\nsystem graphics is generated based on the inverse kinematics and displays the\nexpected trailer orientation given the current vehicle steer angle and\nconfiguration (hitch angle). Simulation study and animation are set up to test\nthe efficacy of the approach, where the user can select both vehicle speed and\nvehicle steering angle freely, which allows the user to stop the\nvehicle-trailer system and experiment with different steering inputs to see\ntheir effect on the predicted trailer motion before proceeding with the best\none according to the advisory graphics, hence creating a series of piecewise\ncontinuous control actions similar to how manual trailer reverse parking is\nusually carried out. The advisory graphics proves to provide the driver with an\nintuitive understanding of the trailer motion at any given configuration (hitch\nangle).\n","authors":["Xincheng Cao","Haochong Chen","Bilin Aksun Guvenc","Levent Guvenc","Shihong Fan","John Harber","Brian Link","Peter Richmond","Dokyung Yim"],"pdf_url":"https://arxiv.org/pdf/2501.06115v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.06113v1","updated":"2025-01-10T17:05:59Z","published":"2025-01-10T17:05:59Z","title":"Vehicle-in-Virtual-Environment (VVE) Based Autonomous Driving Function\n Development and Evaluation Methodology for Vulnerable Road User Safety","summary":" Traditional methods for developing and evaluating autonomous driving\nfunctions, such as model-in-the-loop (MIL) and hardware-in-the-loop (HIL)\nsimulations, heavily depend on the accuracy of simulated vehicle models and\nhuman factors, especially for vulnerable road user safety systems. Continuation\nof development during public road deployment forces other road users including\nvulnerable ones to involuntarily participate in the development process,\nleading to safety risks, inefficiencies, and a decline in public trust. To\naddress these deficiencies, the Vehicle-in-Virtual-Environment (VVE) method was\nproposed as a safer, more efficient, and cost-effective solution for developing\nand testing connected and autonomous driving technologies by operating the real\nvehicle and multiple other actors like vulnerable road users in different test\nareas while being immersed within the same highly realistic virtual\nenvironment. This VVE approach synchronizes real-world vehicle and vulnerable\nroad user motion within the same virtual scenario, enabling the safe and\nrealistic testing of various traffic situations in a safe and repeatable\nmanner. In this paper, we propose a new testing pipeline that sequentially\nintegrates MIL, HIL, and VVE methods to comprehensively develop and evaluate\nautonomous driving functions. The effectiveness of this testing pipeline will\nbe demonstrated using an autonomous driving path-tracking algorithm with local\ndeep reinforcement learning modification for vulnerable road user collision\navoidance.\n","authors":["Haochong Chen","Xincheng Cao","Levent Guvenc","Bilin Aksun Guvenc"],"pdf_url":"https://arxiv.org/pdf/2501.06113v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.06089v1","updated":"2025-01-10T16:39:01Z","published":"2025-01-10T16:39:01Z","title":"Towards Developing Socially Compliant Automated Vehicles: State of the\n Art, Experts Expectations, and A Conceptual Framework","summary":" Automated Vehicles (AVs) hold promise for revolutionizing transportation by\nimproving road safety, traffic efficiency, and overall mobility. Despite the\nsteady advancement in high-level AVs in recent years, the transition to full\nautomation entails a period of mixed traffic, where AVs of varying automation\nlevels coexist with human-driven vehicles (HDVs). Making AVs socially compliant\nand understood by human drivers is expected to improve the safety and\nefficiency of mixed traffic. Thus, ensuring AVs compatibility with HDVs and\nsocial acceptance is crucial for their successful and seamless integration into\nmixed traffic. However, research in this critical area of developing Socially\nCompliant AVs (SCAVs) remains sparse. This study carries out the first\ncomprehensive scoping review to assess the current state of the art in\ndeveloping SCAVs, identifying key concepts, methodological approaches, and\nresearch gaps. An expert interview was also conducted to identify critical\nresearch gaps and expectations towards SCAVs. Based on the scoping review and\nexpert interview input, a conceptual framework is proposed for the development\nof SCAVs. The conceptual framework is evaluated using an online survey\ntargeting researchers, technicians, policymakers, and other relevant\nprofessionals worldwide. The survey results provide valuable validation and\ninsights, affirming the significance of the proposed conceptual framework in\ntackling the challenges of integrating AVs into mixed-traffic environments.\nAdditionally, future research perspectives and suggestions are discussed,\ncontributing to the research and development agenda of SCAVs.\n","authors":["Yongqi Dong","Bart van Arem","Haneen Farah"],"pdf_url":"https://arxiv.org/pdf/2501.06089v1.pdf","comment":"39 pages, 13 figures, under review by the journal of Transportation\n Research Part E: Logistics and Transportation Review"},{"id":"http://arxiv.org/abs/2501.06088v1","updated":"2025-01-10T16:36:15Z","published":"2025-01-10T16:36:15Z","title":"Non-planar 3D Printing of Double Shells","summary":" We present a method to fabricate double shell structures printed in\ntrans-versal directions using multi-axis fused-deposition-modeling (FDM)\nrobot-ic 3D printing. Shell structures, characterized by lightweight, thin\nwalls, fast buildup, and minimal material usage, find diverse applications in\npro-totyping and architecture for uses such as fa\\c{c}ade panels, molds for\nconcrete casting, or full-scale pavilions. We leverage an underlying\nrepresentation of transversal strip networks generated using existing methods\nand propose a methodology for converting them into printable partitions. Each\npartition is printed separately and assembled into a double-shell structure. We\nout-line the specifications and workflow that make the printing of each piece\nand the subsequent assembly process feasible. The versatility and robust-ness\nof our method are demonstrated with both digital and fabricated re-sults on\nsurfaces of different scales and geometric complexity.\n","authors":["Ioanna Mitropoulou","Amir Vaxman","Olga Diamanti","Benjamin Dillenburger"],"pdf_url":"https://arxiv.org/pdf/2501.06088v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.14762v2","updated":"2025-01-10T15:54:54Z","published":"2024-12-19T11:43:13Z","title":"A General Control Method for Human-Robot Integration","summary":" This paper introduces a new generalized control method designed for\nmulti-degrees-of-freedom devices to help people with limited motion\ncapabilities in their daily activities. The challenge lies in finding the most\nadapted strategy for the control interface to effectively map user's motions in\na low-dimensional space to complex robotic assistive devices, such as\nprostheses, supernumerary limbs, up to remote robotic avatars. The goal is a\nsystem which integrates the human and the robotic parts into a unique system,\nmoving so as to reach the targets decided by the human while autonomously\nreducing the user's effort and discomfort. We present a framework to control\ngeneral multi DoFs assistive systems, which translates user-performed\ncompensatory motions into the necessary robot commands for reaching targets\nwhile canceling or reducing compensation. The framework extends to prostheses\nof any number of DoF up to full robotic avatars, regarded here as a sort of\nwhole-body prosthesis of the person who sees the robot as an artificial\nextension of their own body without a physical link but with a sensory-motor\nintegration. We have validated and applied this control strategy through tests\nencompassing simulated scenarios and real-world trials involving a virtual twin\nof the robotic parts (prosthesis and robot) and a physical humanoid avatar.\n","authors":["Maddalena Feder","Giorgio Grioli","Manuel G. Catalano","Antonio Bicchi"],"pdf_url":"https://arxiv.org/pdf/2412.14762v2.pdf","comment":"Submitted to the International Journal of Robotics Research (IJRR),\n under review since October 2024, 16 pages, 30 figures"},{"id":"http://arxiv.org/abs/2406.11136v2","updated":"2025-01-10T15:53:00Z","published":"2024-06-17T01:47:11Z","title":"Robots in Family Routines: Development of and Initial Insights from the\n Family-Robot Routines Inventory","summary":" Despite advances in areas such as the personalization of robots, sustaining\nadoption of robots for long-term use in families remains a challenge. Recent\nstudies have identified integrating robots into families' routines and rituals\nas a promising approach to support long-term adoption. However, few studies\nexplored the integration of robots into family routines and there is a gap in\nsystematic measures to capture family preferences for robot integration.\nBuilding upon existing routine inventories, we developed Family-Robot Routines\nInventory (FRRI), with 24 family routines and 24 child routine items, to\ncapture parents' attitudes toward and expectations from the integration of\nrobotic technology into their family routines. Using this inventory, we\ncollected data from 150 parents through an online survey. Our analysis\nindicates that parents had varying perceptions for the utility of integrating\nrobots into their routines. For example, parents found robot integration to be\nmore helpful in children's individual routines, than to the collective routines\nof their families. We discuss the design implications of these preliminary\nfindings, and how they may serve as a first step toward understanding the\ndiverse challenges and demands of designing and integrating household robots\nfor families.\n","authors":["Michael F. Xu","Bengisu Cagiltay","Joseph Michaelis","Sarah Sebo","Bilge Mutlu"],"pdf_url":"https://arxiv.org/pdf/2406.11136v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04860v2","updated":"2025-01-10T15:47:19Z","published":"2025-01-08T22:22:15Z","title":"Exploring the Use of Robots for Diary Studies","summary":" As interest in studying in-the-wild human-robot interaction grows, there is a\nneed for methods to collect data over time and in naturalistic or potentially\nprivate environments. HRI researchers have increasingly used the diary method\nfor these studies, asking study participants to self-administer a structured\ndata collection instrument, i.e., a diary, over a period of time. Although the\ndiary method offers a unique window into settings that researchers may not have\naccess to, they also lack the interactivity and probing that interview-based\nmethods offer. In this paper, we explore a novel data collection method in\nwhich a robot plays the role of an interactive diary. We developed the Diary\nRobot system and performed in-home deployments for a week to evaluate the\nfeasibility and effectiveness of this approach. Using traditional text-based\nand audio-based diaries as benchmarks, we found that robots are able to\neffectively elicit the intended information. We reflect on our findings, and\ndescribe scenarios where the utilization of robots in diary studies as a data\ncollection instrument may be especially applicable.\n","authors":["Michael F. Xu","Bilge Mutlu"],"pdf_url":"https://arxiv.org/pdf/2501.04860v2.pdf","comment":"Proceedings of the 20th ACM/IEEE International Conference on Human\n Robot Interaction (HRI 2025)"},{"id":"http://arxiv.org/abs/2408.00907v3","updated":"2025-01-10T15:35:20Z","published":"2024-08-01T20:56:28Z","title":"The Harmonic Exponential Filter for Nonparametric Estimation on Motion\n Groups","summary":" Bayesian estimation is a vital tool in robotics as it allows systems to\nupdate the robot state belief using incomplete information from noisy sensors.\nTo render the state estimation problem tractable, many systems assume that the\nmotion and measurement noise, as well as the state distribution, are unimodal\nand Gaussian. However, there are numerous scenarios and systems that do not\ncomply with these assumptions. Existing nonparametric filters that are used to\nmodel multimodal distributions have drawbacks that limit their ability to\nrepresent a diverse set of distributions. This paper introduces a novel\napproach to nonparametric Bayesian filtering on motion groups, designed to\nhandle multimodal distributions using harmonic exponential distributions. This\napproach leverages two key insights of harmonic exponential distributions: a)\nthe product of two distributions can be expressed as the element-wise addition\nof their log-likelihood Fourier coefficients, and b) the convolution of two\ndistributions can be efficiently computed as the tensor product of their\nFourier coefficients. These observations enable the development of an efficient\nand asymptotically exact solution to the Bayes filter up to the band limit of a\nFourier transform. We demonstrate our filter's performance compared with\nestablished nonparametric filtering methods across simulated and real-world\nlocalization tasks.\n","authors":["Miguel Saavedra-Ruiz","Steven A. Parkison","Ria Arora","James Richard Forbes","Liam Paull"],"pdf_url":"https://arxiv.org/pdf/2408.00907v3.pdf","comment":"Accepted to the IEEE Robotics and Automation Letters (RA-L 2025) Code\n available at https://github.com/montrealrobotics/harmonic-filter. Webpage and\n additional videos at https://montrealrobotics.ca/hef/"},{"id":"http://arxiv.org/abs/2501.06047v1","updated":"2025-01-10T15:28:24Z","published":"2025-01-10T15:28:24Z","title":"Learning Affordances from Interactive Exploration using an Object-level\n Map","summary":" Many robotic tasks in real-world environments require physical interactions\nwith an object such as pick up or push. For successful interactions, the robot\nneeds to know the object's affordances, which are defined as the potential\nactions the robot can perform with the object. In order to learn a\nrobot-specific affordance predictor, we propose an interactive exploration\npipeline which allows the robot to collect interaction experiences while\nexploring an unknown environment. We integrate an object-level map in the\nexploration pipeline such that the robot can identify different object\ninstances and track objects across diverse viewpoints. This results in denser\nand more accurate affordance annotations compared to state-of-the-art methods,\nwhich do not incorporate a map. We show that our affordance exploration\napproach makes exploration more efficient and results in more accurate\naffordance prediction models compared to baseline methods.\n","authors":["Paula Wulkop","Halil Umut Özdemir","Antonia Hüfner","Jen Jen Chung","Roland Siegwart","Lionel Ott"],"pdf_url":"https://arxiv.org/pdf/2501.06047v1.pdf","comment":"International Symposium of Robotics Research (ISRR) 2024"},{"id":"http://arxiv.org/abs/1907.03817v2","updated":"2025-01-10T13:14:26Z","published":"2019-07-08T19:10:56Z","title":"Towards the Internet of Robotic Things: Analysis, Architecture,\n Components and Challenges","summary":" Internet of Things (IoT) and robotics cannot be considered two separate\ndomains these days. Internet of Robotics Things (IoRT) is a concept that has\nbeen recently introduced to describe the integration of robotics technologies\nin IoT scenarios. As a consequence, these two research fields have started\ninteracting, and thus linking research communities. In this paper we intend to\nmake further steps in joining the two communities and broaden the discussion on\nthe development of this interdisciplinary field. The paper provides an\noverview, analysis and challenges of possible solutions for the Internet of\nRobotic Things, discussing the issues of the IoRT architecture, the integration\nof smart spaces and robotic applications.\n","authors":["Ilya Afanasyev","Manuel Mazzara","Subham Chakraborty","Nikita Zhuchkov","Aizhan Maksatbek","Mohamad Kassab","Salvatore Distefano"],"pdf_url":"https://arxiv.org/pdf/1907.03817v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.16111v2","updated":"2025-01-10T12:56:47Z","published":"2024-09-24T14:19:47Z","title":"CloudTrack: Scalable UAV Tracking with Cloud Semantics","summary":" Nowadays, unmanned aerial vehicles (UAVs) are commonly used in search and\nrescue scenarios to gather information in the search area. The automatic\nidentification of the person searched for in aerial footage could increase the\nautonomy of such systems, reduce the search time, and thus increase the missed\nperson's chances of survival. In this paper, we present a novel approach to\nperform semantically conditioned open vocabulary object tracking that is\nspecifically designed to cope with the limitations of UAV hardware. Our\napproach has several advantages. It can run with verbal descriptions of the\nmissing person, e.g., the color of the shirt, it does not require dedicated\ntraining to execute the mission and can efficiently track a potentially moving\nperson. Our experimental results demonstrate the versatility and efficacy of\nour approach.\n","authors":["Yannik Blei","Michael Krawez","Nisarga Nilavadi","Tanja Katharina Kaiser","Wolfram Burgard"],"pdf_url":"https://arxiv.org/pdf/2409.16111v2.pdf","comment":"7 pages, 3 figures"},{"id":"http://arxiv.org/abs/2501.05931v1","updated":"2025-01-10T12:54:33Z","published":"2025-01-10T12:54:33Z","title":"Environment Modeling for Service Robots From a Task Execution\n Perspective","summary":" Service robots are increasingly entering the home to provide domestic tasks\nfor residents. However, when working in an open, dynamic, and unstructured home\nenvironment, service robots still face challenges such as low intelligence for\ntask execution and poor long-term autonomy (LTA), which has limited their\ndeployment. As the basis of robotic task execution, environment modeling has\nattracted significant attention. This integrates core technologies such as\nenvironment perception, understanding, and representation to accurately\nrecognize environmental information. This paper presents a comprehensive survey\nof environmental modeling from a new task-executionoriented perspective. In\nparticular, guided by the requirements of robots in performing domestic service\ntasks in the home environment, we systematically review the progress that has\nbeen made in task-execution-oriented environmental modeling in four respects:\n1) localization, 2) navigation, 3) manipulation, and 4) LTA. Current challenges\nare discussed, and potential research opportunities are also highlighted.\n","authors":["Ying Zhang","Guohui Tian","Cui-Hua Zhang","Changchun Hua","Weili Ding","Choon Ki Ahn"],"pdf_url":"https://arxiv.org/pdf/2501.05931v1.pdf","comment":"16 pages, 9 figures; This article has been accepted for publication\n in a future issue of IEEE/CAA Journal of Automatica Sinica, but has not been\n fully edited. Content may change prior to final publication"},{"id":"http://arxiv.org/abs/2501.03968v2","updated":"2025-01-10T10:38:49Z","published":"2025-01-07T18:06:27Z","title":"VLM-driven Behavior Tree for Context-aware Task Planning","summary":" The use of Large Language Models (LLMs) for generating Behavior Trees (BTs)\nhas recently gained attention in the robotics community, yet remains in its\nearly stages of development. In this paper, we propose a novel framework that\nleverages Vision-Language Models (VLMs) to interactively generate and edit BTs\nthat address visual conditions, enabling context-aware robot operations in\nvisually complex environments. A key feature of our approach lies in the\nconditional control through self-prompted visual conditions. Specifically, the\nVLM generates BTs with visual condition nodes, where conditions are expressed\nas free-form text. Another VLM process integrates the text into its prompt and\nevaluates the conditions against real-world images during robot execution. We\nvalidated our framework in a real-world cafe scenario, demonstrating both its\nfeasibility and limitations.\n","authors":["Naoki Wake","Atsushi Kanehira","Jun Takamatsu","Kazuhiro Sasabuchi","Katsushi Ikeuchi"],"pdf_url":"https://arxiv.org/pdf/2501.03968v2.pdf","comment":"10 pages, 11 figures, 5 tables. Last updated on January 9th, 2024"},{"id":"http://arxiv.org/abs/2501.05770v1","updated":"2025-01-10T07:58:52Z","published":"2025-01-10T07:58:52Z","title":"Path Planning for Multi-Copter UAV Formation Employing a Generalized\n Particle Swarm Optimization","summary":" The paper investigates the problem of path planning techniques for\nmulti-copter uncrewed aerial vehicles (UAV) cooperation in a formation shape to\nexamine surrounding surfaces. We first describe the problem as a joint\nobjective cost for planning a path of the formation centroid working in a\ncomplicated space. The path planning algorithm, named the generalized particle\nswarm optimization algorithm, is then presented to construct an optimal,\nflyable path while avoiding obstacles and ensuring the flying mission\nrequirements. A path-development scheme is then incorporated to generate a\nrelevant path for each drone to maintain its position in the formation\nconfiguration. Simulation, comparison, and experiments have been conducted to\nverify the proposed approach. Results show the feasibility of the proposed\npath-planning algorithm with GEPSO.\n","authors":["Van Truong Hoang"],"pdf_url":"https://arxiv.org/pdf/2501.05770v1.pdf","comment":"6 pages, 8 figures, conference"},{"id":"http://arxiv.org/abs/2501.05750v1","updated":"2025-01-10T06:58:14Z","published":"2025-01-10T06:58:14Z","title":"Semantic Mapping in Indoor Embodied AI -- A Comprehensive Survey and\n Future Directions","summary":" Intelligent embodied agents (e.g. robots) need to perform complex semantic\ntasks in unfamiliar environments. Among many skills that the agents need to\npossess, building and maintaining a semantic map of the environment is most\ncrucial in long-horizon tasks. A semantic map captures information about the\nenvironment in a structured way, allowing the agent to reference it for\nadvanced reasoning throughout the task. While existing surveys in embodied AI\nfocus on general advancements or specific tasks like navigation and\nmanipulation, this paper provides a comprehensive review of semantic\nmap-building approaches in embodied AI, specifically for indoor navigation. We\ncategorize these approaches based on their structural representation (spatial\ngrids, topological graphs, dense point-clouds or hybrid maps) and the type of\ninformation they encode (implicit features or explicit environmental data). We\nalso explore the strengths and limitations of the map building techniques,\nhighlight current challenges, and propose future research directions. We\nidentify that the field is moving towards developing open-vocabulary,\nqueryable, task-agnostic map representations, while high memory demands and\ncomputational inefficiency still remaining to be open challenges. This survey\naims to guide current and future researchers in advancing semantic mapping\ntechniques for embodied AI systems.\n","authors":["Sonia Raychaudhuri","Angel X. Chang"],"pdf_url":"https://arxiv.org/pdf/2501.05750v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05723v1","updated":"2025-01-10T05:43:34Z","published":"2025-01-10T05:43:34Z","title":"Robot Error Awareness Through Human Reactions: Implementation,\n Evaluation, and Recommendations","summary":" Effective error detection is crucial to prevent task disruption and maintain\nuser trust. Traditional methods often rely on task-specific models or user\nreporting, which can be inflexible or slow. Recent research suggests social\nsignals, naturally exhibited by users in response to robot errors, can enable\nmore flexible, timely error detection. However, most studies rely on post hoc\nanalysis, leaving their real-time effectiveness uncertain and lacking\nuser-centric evaluation. In this work, we developed a proactive error detection\nsystem that combines user behavioral signals (facial action units and speech),\nuser feedback, and error context for automatic error detection. In a study (N =\n28), we compared our proactive system to a status quo reactive approach.\nResults show our system 1) reliably and flexibly detects error, 2) detects\nerrors faster than the reactive approach, and 3) is perceived more favorably by\nusers than the reactive one. We discuss recommendations for enabling robot\nerror awareness in future HRI systems.\n","authors":["Maia Stiber","Russell Taylor","Chien-Ming Huang"],"pdf_url":"https://arxiv.org/pdf/2501.05723v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05688v1","updated":"2025-01-10T03:41:03Z","published":"2025-01-10T03:41:03Z","title":"eKalibr: Dynamic Intrinsic Calibration for Event Cameras From First\n Principles of Events","summary":" The bio-inspired event camera has garnered extensive research attention in\nrecent years, owing to its significant potential derived from its high dynamic\nrange and low latency characteristics. Similar to the standard camera, the\nevent camera requires precise intrinsic calibration to facilitate further\nhigh-level visual applications, such as pose estimation and mapping. While\nseveral calibration methods for event cameras have been proposed, most of them\nare either (i) engineering-driven, heavily relying on conventional image-based\ncalibration pipelines, or (ii) inconvenient, requiring complex instrumentation.\nTo this end, we propose an accurate and convenient intrinsic calibration method\nfor event cameras, named eKalibr, which builds upon a carefully designed\nevent-based circle grid pattern recognition algorithm. To extract target\npatterns from events, we perform event-based normal flow estimation to identify\npotential events generated by circle edges, and cluster them spatially.\nSubsequently, event clusters associated with the same grid circles are matched\nand grouped using normal flows, for subsequent time-varying ellipse estimation.\nFitted ellipse centers are time-synchronized, for final grid pattern\nrecognition. We conducted extensive experiments to evaluate the performance of\neKalibr in terms of pattern extraction and intrinsic calibration. The\nimplementation of eKalibr is open-sourced at\n(https://github.com/Unsigned-Long/eKalibr) to benefit the research community.\n","authors":["Shuolong Chen","Xingxing Li","Liu Yuan","Ziao Liu"],"pdf_url":"https://arxiv.org/pdf/2501.05688v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05639v1","updated":"2025-01-10T00:56:39Z","published":"2025-01-10T00:56:39Z","title":"Scaling Safe Multi-Agent Control for Signal Temporal Logic\n Specifications","summary":" Existing methods for safe multi-agent control using logic specifications like\nSignal Temporal Logic (STL) often face scalability issues. This is because they\nrely either on single-agent perspectives or on Mixed Integer Linear Programming\n(MILP)-based planners, which are complex to optimize. These methods have proven\nto be computationally expensive and inefficient when dealing with a large\nnumber of agents. To address these limitations, we present a new scalable\napproach to multi-agent control in this setting. Our method treats the\nrelationships between agents using a graph structure rather than in terms of a\nsingle-agent perspective. Moreover, it combines a multi-agent collision\navoidance controller with a Graph Neural Network (GNN) based planner, models\nthe system in a decentralized fashion, and trains on STL-based objectives to\ngenerate safe and efficient plans for multiple agents, thereby optimizing the\nsatisfaction of complex temporal specifications while also facilitating\nmulti-agent collision avoidance. Our experiments show that our approach\nsignificantly outperforms existing methods that use a state-of-the-art\nMILP-based planner in terms of scalability and performance. The project website\nis https://jeappen.com/mastl-gcbf-website/ and the code is at\nhttps://github.com/jeappen/mastl-gcbf .\n","authors":["Joe Eappen","Zikang Xiong","Dipam Patel","Aniket Bera","Suresh Jagannathan"],"pdf_url":"https://arxiv.org/pdf/2501.05639v1.pdf","comment":"Accepted to CoRL 2024. arXiv admin note: text overlap with\n arXiv:2401.14554 by other authors"},{"id":"http://arxiv.org/abs/2501.05628v1","updated":"2025-01-10T00:08:37Z","published":"2025-01-10T00:08:37Z","title":"Concerns and Values in Human-Robot Interactions: A Focus on Social\n Robotics","summary":" Robots, as AI with physical instantiation, inhabit our social and physical\nworld, where their actions have both social and physical consequences, posing\nchallenges for researchers when designing social robots. This study starts with\na scoping review to identify discussions and potential concerns arising from\ninteractions with robotic systems. Two focus groups of technology ethics\nexperts then validated a comprehensive list of key topics and values in\nhuman-robot interaction (HRI) literature. These insights were integrated into\nthe HRI Value Compass web tool, to help HRI researchers identify ethical values\nin robot design. The tool was evaluated in a pilot study. This work benefits\nthe HRI community by highlighting key concerns in human-robot interactions and\nproviding an instrument to help researchers design robots that align with human\nvalues, ensuring future robotic systems adhere to these values in social\napplications.\n","authors":["Giulio Antonio Abbo","Tony Belpaeme","Micol Spitale"],"pdf_url":"https://arxiv.org/pdf/2501.05628v1.pdf","comment":"52 pages, 10 figures, 5 appendices"},{"id":"http://arxiv.org/abs/2501.06348v1","updated":"2025-01-10T21:20:11Z","published":"2025-01-10T21:20:11Z","title":"Why Automate This? Exploring the Connection between Time Use, Well-being\n and Robot Automation Across Social Groups","summary":" Understanding the motivations underlying the human inclination to automate\ntasks is vital to developing truly helpful robots integrated into daily life.\nAccordingly, we ask: are individuals more inclined to automate chores based on\nthe time they consume or the feelings experienced while performing them? This\nstudy explores these preferences and whether they vary across different social\ngroups (i.e., gender category and income level). Leveraging data from the\nBEHAVIOR-1K dataset, the American Time-Use Survey, and the American Time-Use\nSurvey Well-Being Module, we investigate the relationship between the desire\nfor automation, time spent on daily activities, and their associated feelings -\nHappiness, Meaningfulness, Sadness, Painfulness, Stressfulness, or Tiredness.\nOur key findings show that, despite common assumptions, time spent does not\nstrongly relate to the desire for automation for the general population. For\nthe feelings analyzed, only happiness and pain are key indicators. Significant\ndifferences by gender and economic level also emerged: Women prefer to automate\nstressful activities, whereas men prefer to automate those that make them\nunhappy; mid-income individuals prioritize automating less enjoyable and\nmeaningful activities, while low and high-income show no significant\ncorrelations. We hope our research helps motivate technologies to develop\nrobots that match the priorities of potential users, moving domestic robotics\ntoward more socially relevant solutions. We open-source all the data, including\nan online tool that enables the community to replicate our analysis and explore\nadditional trends at https://hri1260.github.io/why-automate-this.\n","authors":["Ruchira Ray","Leona Pang","Sanjana Srivastava","Li Fei-Fei","Samantha Shorey","Roberto Martín-Martín"],"pdf_url":"https://arxiv.org/pdf/2501.06348v1.pdf","comment":"20 pages, 14 figures"},{"id":"http://arxiv.org/abs/2501.07597v1","updated":"2025-01-10T02:20:59Z","published":"2025-01-10T02:20:59Z","title":"Learning-based Detection of GPS Spoofing Attack for Quadrotors","summary":" Safety-critical cyber-physical systems (CPS), such as quadrotor UAVs, are\nparticularly prone to cyber attacks, which can result in significant\nconsequences if not detected promptly and accurately. During outdoor\noperations, the nonlinear dynamics of UAV systems, combined with non-Gaussian\nnoise, pose challenges to the effectiveness of conventional statistical and\nmachine learning methods. To overcome these limitations, we present QUADFormer,\nan advanced attack detection framework for quadrotor UAVs leveraging a\ntransformer-based architecture. This framework features a residue generator\nthat produces sequences sensitive to anomalies, which are then analyzed by the\ntransformer to capture statistical patterns for detection and classification.\nFurthermore, an alert mechanism ensures UAVs can operate safely even when under\nattack. Extensive simulations and experimental evaluations highlight that\nQUADFormer outperforms existing state-of-the-art techniques in detection\naccuracy.\n","authors":["Pengyu Wang","Zhaohua Yang","Jialu Li","Ling Shi"],"pdf_url":"https://arxiv.org/pdf/2501.07597v1.pdf","comment":"Accepted in IEEE Industrial Electronics Society Annual Online\n Conference"},{"id":"http://arxiv.org/abs/2411.07261v2","updated":"2025-01-10T15:21:58Z","published":"2024-11-08T14:34:09Z","title":"Sinkage Study in Granular Material for Space Exploration Legged Robot\n Gripper","summary":" Wheeled rovers have been the primary choice for lunar exploration due to\ntheir speed and efficiency. However, deeper areas, such as lunar caves and\ncraters, require the mobility of legged robots. To do so, appropriate end\neffectors must be designed to enable climbing and walking on the granular\nsurface of the Moon. This paper investigates the behavior of an underactuated\nsoft gripper on deformable granular material when a legged robot is walking in\nsoft soil. A modular test bench and a simulation model were developed to\nobserve the gripper sinkage behavior under load. The gripper uses tendon-driven\nfingers to match its target shape and grasp on the target surface using\nmultiple micro-spines. The sinkage of the gripper in silica sand was measured\nby comparing the axial displacement of the gripper with the nominal load of the\nrobot mass. Multiple experiments were performed to observe the sinkage of the\ngripper over a range of slope angles. A simulation model accounting for the\ndegrees of compliance of the gripper fingers was created using Altair\nMotionSolve software and coupled to Altair EDEM to compute the gripper\ninteraction with particles utilizing the discrete element method. After\nvalidation of the model, complementary simulations using Lunar gravity and a\nregolith particle model were performed. The results show that a satisfactory\ngripper model with accurate freedom of motion can be created in simulation\nusing the Altair simulation packages and expected sinkage under load in a\nparticle-filled environment can be estimated using this model. By computing the\nsinkage of the end effector of legged robots, the results can be directly\nintegrated into the motion control algorithm and improve the accuracy of\nmobility in a granular material environment.\n","authors":["Arthur Candalot","James Hurrell","Malik Manel Hashim","Brigid Hickey","Mickael Laine","Kazuya Yoshida"],"pdf_url":"https://arxiv.org/pdf/2411.07261v2.pdf","comment":"Proceedings of the 21st International and 12th Asia-Pacific Regional\n Conference of the ISTVS"}],"Systems and Control":[{"id":"http://arxiv.org/abs/2501.06181v1","updated":"2025-01-10T18:58:44Z","published":"2025-01-10T18:58:44Z","title":"Best Response Convergence for Zero-sum Stochastic Dynamic Games with\n Partial and Asymmetric Information","summary":" We analyze best response dynamics for finding a Nash equilibrium of an\ninfinite horizon zero-sum stochastic linear quadratic dynamic game (LQDG) with\npartial and asymmetric information. We derive explicit expressions for each\nplayer's best response within the class of pure linear dynamic output feedback\ncontrol strategies where the internal state dimension of each control strategy\nis an integer multiple of the system state dimension. With each best response,\nthe players form increasingly higher-order belief states, leading to\ninfinite-dimensional internal states. However, we observe in extensive\nnumerical experiments that the game's value converges after just a few\niterations, suggesting that strategies associated with increasingly\nhigher-order belief states eventually provide no benefit. To help explain this\nconvergence, our numerical analysis reveals rapid decay of the controllability\nand observability Gramian eigenvalues and Hankel singular values in\nhigher-order belief dynamics, indicating that the higher-order belief dynamics\nbecome increasingly difficult for both players to control and observe.\nConsequently, the higher-order belief dynamics can be closely approximated by\nlow-order belief dynamics with bounded error, and thus feedback strategies with\nlimited internal state dimension can closely approximate a Nash equilibrium.\n","authors":["Yuxiang Guan","Iman Shames","Tyler H. Summers"],"pdf_url":"https://arxiv.org/pdf/2501.06181v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.06167v1","updated":"2025-01-10T18:46:28Z","published":"2025-01-10T18:46:28Z","title":"Meta-Learning for Physically-Constrained Neural System Identification","summary":" We present a gradient-based meta-learning framework for rapid adaptation of\nneural state-space models (NSSMs) for black-box system identification. When\napplicable, we also incorporate domain-specific physical constraints to improve\nthe accuracy of the NSSM. The major benefit of our approach is that instead of\nrelying solely on data from a single target system, our framework utilizes data\nfrom a diverse set of source systems, enabling learning from limited target\ndata, as well as with few online training iterations. Through benchmark\nexamples, we demonstrate the potential of our approach, study the effect of\nfine-tuning subnetworks rather than full fine-tuning, and report real-world\ncase studies to illustrate the practical application and generalizability of\nthe approach to practical problems with physical-constraints. Specifically, we\nshow that the meta-learned models result in improved downstream performance in\nmodel-based state estimation in indoor localization and energy systems.\n","authors":["Ankush Chakrabarty","Gordon Wichern","Vedang M. Deshpande","Abraham P. Vinod","Karl Berntorp","Christopher R. Laughman"],"pdf_url":"https://arxiv.org/pdf/2501.06167v1.pdf","comment":"30 pages"},{"id":"http://arxiv.org/abs/2501.00588v2","updated":"2025-01-10T18:39:11Z","published":"2024-12-31T18:25:05Z","title":"Privacy-Preserving Distributed Defense Framework for DC Microgrids\n Against Exponentially Unbounded False Data Injection Attacks","summary":" This paper introduces a novel, fully distributed control framework for DC\nmicrogrids, enhancing resilience against exponentially unbounded false data\ninjection (EU-FDI) attacks. Our framework features a consensus-based secondary\ncontrol for each converter, effectively addressing these advanced threats. To\nfurther safeguard sensitive operational data, a privacy-preserving mechanism is\nincorporated into the control design, ensuring that critical information\nremains secure even under adversarial conditions. Rigorous Lyapunov stability\nanalysis confirms the framework's ability to maintain critical DC microgrid\noperations like voltage regulation and load sharing under EU-FDI threats. The\nframework's practicality is validated through hardware-in-the-loop experiments,\ndemonstrating its enhanced resilience and robust privacy protection against the\ncomplex challenges posed by quick variant FDI attacks.\n","authors":["Yi Zhang","Mohamadamin Rajabinezhad","Yichao Wang","Junbo Zhao","Shan Zuo"],"pdf_url":"https://arxiv.org/pdf/2501.00588v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.06118v1","updated":"2025-01-10T17:15:59Z","published":"2025-01-10T17:15:59Z","title":"Nonlinear port-Hamiltonian system identification from input-state-output\n data","summary":" A framework for identifying nonlinear port-Hamiltonian systems using\ninput-state-output data is introduced. The framework utilizes neural networks'\nuniversal approximation capacity to effectively represent complex dynamics in a\nstructured way. We show that using the structure helps to make long-term\npredictions compared to baselines that do not incorporate physics. We also\nexplore different architectures based on MLPs, KANs, and using prior\ninformation. The technique is validated through examples featuring\nnonlinearities in either the skew-symmetric terms, the dissipative terms, or\nthe Hamiltonian.\n","authors":["Karim Cherifi","Achraf El Messaoudi","Hannes Gernandt","Marco Roschkowski"],"pdf_url":"https://arxiv.org/pdf/2501.06118v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.06115v1","updated":"2025-01-10T17:12:30Z","published":"2025-01-10T17:12:30Z","title":"Development of an Advisory System for Parking of a Car and Trailer","summary":" Trailer parking is a challenging task due to the unstable nature of the\nvehicle-trailer system in reverse motion and the unintuitive steering actions\nrequired at the vehicle to accomplish the parking maneuver. This paper presents\na strategy to tackle this kind of maneuver with an advisory graphic aid to help\nthe human driver with the task of manually backing up the vehicle-trailer\nsystem. A kinematic vehicle-trailer model is derived to describe the low-speed\nmotion of the vehicle-trailer system, and its inverse kinematics is established\nby generating an equivalent virtual trailer axle steering command. The advisory\nsystem graphics is generated based on the inverse kinematics and displays the\nexpected trailer orientation given the current vehicle steer angle and\nconfiguration (hitch angle). Simulation study and animation are set up to test\nthe efficacy of the approach, where the user can select both vehicle speed and\nvehicle steering angle freely, which allows the user to stop the\nvehicle-trailer system and experiment with different steering inputs to see\ntheir effect on the predicted trailer motion before proceeding with the best\none according to the advisory graphics, hence creating a series of piecewise\ncontinuous control actions similar to how manual trailer reverse parking is\nusually carried out. The advisory graphics proves to provide the driver with an\nintuitive understanding of the trailer motion at any given configuration (hitch\nangle).\n","authors":["Xincheng Cao","Haochong Chen","Bilin Aksun Guvenc","Levent Guvenc","Shihong Fan","John Harber","Brian Link","Peter Richmond","Dokyung Yim"],"pdf_url":"https://arxiv.org/pdf/2501.06115v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.06113v1","updated":"2025-01-10T17:05:59Z","published":"2025-01-10T17:05:59Z","title":"Vehicle-in-Virtual-Environment (VVE) Based Autonomous Driving Function\n Development and Evaluation Methodology for Vulnerable Road User Safety","summary":" Traditional methods for developing and evaluating autonomous driving\nfunctions, such as model-in-the-loop (MIL) and hardware-in-the-loop (HIL)\nsimulations, heavily depend on the accuracy of simulated vehicle models and\nhuman factors, especially for vulnerable road user safety systems. Continuation\nof development during public road deployment forces other road users including\nvulnerable ones to involuntarily participate in the development process,\nleading to safety risks, inefficiencies, and a decline in public trust. To\naddress these deficiencies, the Vehicle-in-Virtual-Environment (VVE) method was\nproposed as a safer, more efficient, and cost-effective solution for developing\nand testing connected and autonomous driving technologies by operating the real\nvehicle and multiple other actors like vulnerable road users in different test\nareas while being immersed within the same highly realistic virtual\nenvironment. This VVE approach synchronizes real-world vehicle and vulnerable\nroad user motion within the same virtual scenario, enabling the safe and\nrealistic testing of various traffic situations in a safe and repeatable\nmanner. In this paper, we propose a new testing pipeline that sequentially\nintegrates MIL, HIL, and VVE methods to comprehensively develop and evaluate\nautonomous driving functions. The effectiveness of this testing pipeline will\nbe demonstrated using an autonomous driving path-tracking algorithm with local\ndeep reinforcement learning modification for vulnerable road user collision\navoidance.\n","authors":["Haochong Chen","Xincheng Cao","Levent Guvenc","Bilin Aksun Guvenc"],"pdf_url":"https://arxiv.org/pdf/2501.06113v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.06112v1","updated":"2025-01-10T17:05:32Z","published":"2025-01-10T17:05:32Z","title":"Optimizing Experiments for Accurate Battery Circuit Parameters\n Estimation: Reduction and Adjustment of Frequency Set Used in Electrochemical\n Impedance Spectroscopy","summary":" In this paper, we study a suitable experimental design of electrochemical\nimpedance spectroscopy (EIS) to reduce the number of frequency points while not\nsignificantly affecting the uncertainties of the estimated cell's equivalent\ncircuit model (ECM) parameters. It is based on an E-optimal experimental design\nthat aims to maximize the information about the ECM parameters collected by EIS\nmeasurements and, at the same time, minimize the overall uncertainty. In a\nnumerical experiment, we first analyze to which extent reducing the number of\nmeasurement points at low frequencies affects the uncertainty of the estimated\nparameters. Secondly, we show that applying the frequency adjustments can lead\nto the same or even improved global uncertainty of ECM parameter estimates as\nwith a higher number of measurements. This is numerically verified through a\ncase study using the ECM parameters of a commercial battery cell.\n","authors":["Vladimir Sovljanski","Mario Paolone","Sylvain Tant","Damien Pierre Sainflou"],"pdf_url":"https://arxiv.org/pdf/2501.06112v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.06104v1","updated":"2025-01-10T16:57:38Z","published":"2025-01-10T16:57:38Z","title":"Weather-Driven Priority Charging for Battery Storage Systems in Hybrid\n Renewable Energy Grid","summary":" The integration of renewable energy into the power grid is often hindered by\nits fragmented infrastructure, leading to inefficient utilization due to the\nvariability of energy production and its reliance on weather conditions.\nBattery storage systems, while essential for stabilizing energy supply, face\nchallenges like sub-optimal energy distribution, accelerating battery\ndegradation, and reducing operational efficiency. This paper presents a novel\nsolution to these challenges by developing a large-scale, interconnected\nrenewable energy network that optimizes energy storage and distribution. The\nproposed system includes strategically placed battery storage facilities that\nstabilize energy production by compensating for fluctuations in renewable\noutput. A priority charging algorithm, informed by real-time weather\nforecasting and load monitoring, ensures that the most suitable battery systems\nare charged under varying conditions. Within each storage facility, a secondary\npriority charging algorithm minimizes battery degradation by ranking batteries\nbased on critical parameters such as state of health (SoH) and state of charge\n(SoC) and deciding which to charge. This comprehensive approach enhances the\nefficiency and longevity of battery storage systems, offering a more reliable\nand resilient renewable energy infrastructure.\n","authors":["Dhrumil Bhatt","Siddharth Penumatsa","Nirbhay Singhal"],"pdf_url":"https://arxiv.org/pdf/2501.06104v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.06092v1","updated":"2025-01-10T16:45:00Z","published":"2025-01-10T16:45:00Z","title":"Molecular Communication-Inspired Particle Collector-Transmitter (PaCoT)\n for Heavy Metal Removal from Human Circulatory System","summary":" This study proposes a novel molecular communication (MC)-inspired\nnanomachine, PArticle COllector-Transmitter (PaCoT), to remove toxic heavy\nmetals from the human circulatory system. PaCoT collects these toxic metals and\ntransmits them to release nodes, such as lymph capillaries, before they reach\ncritical organs. The design incorporates key physical parameters and operates\nthrough particle reception and release mechanisms. In the reception process,\ndescribed as ligand-receptor binding reactions, modeled as a continuous-time\nMarkov process (CTMP), PaCoT uses metallothionein proteins as receptors and\nheavy metals (e.g., Zn, Pb, Cd) as ligands. We assume that the toxicity\ncondition (toxic (bit-1), non-toxic (bit-0)) is encoded into the concentration\nof heavy metal molecules. Thus, we consider that heavy metal concentration\nwithin the MC channel (e.g., human circulatory system) employs binary\nconcentration shift keying (binary CSK). The concentration ratio of specific\nheavy metals is estimated to infer toxicity, i.e., a high ratio indicates\ntoxicity and a low ratio suggests non-toxicity. Toxicity detection is achieved\nby monitoring the receptor bound duration in the presence of interferers and\nvarious types of heavy metals. After detecting and collecting toxic heavy\nmetals, PaCoT securely retains them in a liquid medium (e.g., water) until\nrelease, employing two mechanisms: (1) a single-disc viscous micropump to\nregulate flow rate, and (2) Brownian motion to facilitate diffusion. PaCoT's\nperformance is evaluated through MATLAB simulations, focusing on bit error\nprobability (BEP) of the toxicity detection method, release time of molecules\nfrom PaCoT and energy consumption.\n","authors":["Hilal Esra Yaldiz","Ozgur B. Akan"],"pdf_url":"https://arxiv.org/pdf/2501.06092v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.06089v1","updated":"2025-01-10T16:39:01Z","published":"2025-01-10T16:39:01Z","title":"Towards Developing Socially Compliant Automated Vehicles: State of the\n Art, Experts Expectations, and A Conceptual Framework","summary":" Automated Vehicles (AVs) hold promise for revolutionizing transportation by\nimproving road safety, traffic efficiency, and overall mobility. Despite the\nsteady advancement in high-level AVs in recent years, the transition to full\nautomation entails a period of mixed traffic, where AVs of varying automation\nlevels coexist with human-driven vehicles (HDVs). Making AVs socially compliant\nand understood by human drivers is expected to improve the safety and\nefficiency of mixed traffic. Thus, ensuring AVs compatibility with HDVs and\nsocial acceptance is crucial for their successful and seamless integration into\nmixed traffic. However, research in this critical area of developing Socially\nCompliant AVs (SCAVs) remains sparse. This study carries out the first\ncomprehensive scoping review to assess the current state of the art in\ndeveloping SCAVs, identifying key concepts, methodological approaches, and\nresearch gaps. An expert interview was also conducted to identify critical\nresearch gaps and expectations towards SCAVs. Based on the scoping review and\nexpert interview input, a conceptual framework is proposed for the development\nof SCAVs. The conceptual framework is evaluated using an online survey\ntargeting researchers, technicians, policymakers, and other relevant\nprofessionals worldwide. The survey results provide valuable validation and\ninsights, affirming the significance of the proposed conceptual framework in\ntackling the challenges of integrating AVs into mixed-traffic environments.\nAdditionally, future research perspectives and suggestions are discussed,\ncontributing to the research and development agenda of SCAVs.\n","authors":["Yongqi Dong","Bart van Arem","Haneen Farah"],"pdf_url":"https://arxiv.org/pdf/2501.06089v1.pdf","comment":"39 pages, 13 figures, under review by the journal of Transportation\n Research Part E: Logistics and Transportation Review"},{"id":"http://arxiv.org/abs/2501.06042v1","updated":"2025-01-10T15:21:48Z","published":"2025-01-10T15:21:48Z","title":"The improvement in transmission resilience metrics from reduced outages\n or faster restoration can be calculated by rerunning historical outage data","summary":" Transmission utilities routinely collect detailed outage data, including\nresilience events in which outages bunch up due to weather. The resilience\nevents and their resilience metrics can readily be extracted from this\nhistorical outage data. Improvements such as grid hardening or investments in\nrestoration lead to reduced outages or faster restoration. We show how to rerun\nthis history with the effects of the reduced outages or faster restoration\nincluded to find the resulting improvement in resilience metrics, thus\nquantifying the benefits of these investments. This is demonstrated with case\nstudies for specific events (a derecho and a hurricane), and all large events\nor large thunderstorms in the Midwest USA. Instead of predicting future extreme\nevents with models, which is very challenging, the historical rerun readily\nquantifies the benefits that a resilience investment would have had if it had\nbeen made in the past. The historical rerun is particularly vivid in making the\ncase for resilience investments to stakeholders because it quantifies the\nbenefits for events actually experienced by those stakeholders, rather than for\nfuture events predicted with uncertainty.\n","authors":["Arslan Ahmad","Ian Dobson","Svetlana Ekisheva","Christopher Claypool"],"pdf_url":"https://arxiv.org/pdf/2501.06042v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.06030v1","updated":"2025-01-10T15:07:07Z","published":"2025-01-10T15:07:07Z","title":"Resiliency metrics quantifying emergency response in a distribution\n system","summary":" The electric distribution system is a cornerstone of modern life, playing a\ncritical role in the daily activities and well-being of individuals. As the\nworld transitions toward a decarbonized future, where even mobility relies on\nelectricity, ensuring the resilience of the grid becomes paramount. This paper\nintroduces novel resilience metrics designed to equip utilities and\nstakeholders with actionable tools to assess performance during storm events.\nThe metrics focus on emergency storm response and the resources required to\nimprove customer service. The practical calculation of the metrics from\nhistorical utility data is demonstrated for multiple storm events.\nAdditionally, the metrics' improvement with added crews is estimated by\n\"rerunning history\" with faster restoration. By applying this resilience\nframework, utilities can enhance their restoration strategies and unlock\npotential cost savings, benefiting both providers and customers in an era of\nheightened energy dependency.\n","authors":["Shikhar Pandey","Gowtham Kandaperumal","Arslan Ahmad","Ian Dobson"],"pdf_url":"https://arxiv.org/pdf/2501.06030v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.06016v1","updated":"2025-01-10T14:53:21Z","published":"2025-01-10T14:53:21Z","title":"Investigating the Impact of Observation Space Design Choices On Training\n Reinforcement Learning Solutions for Spacecraft Problems","summary":" Recent research using Reinforcement Learning (RL) to learn autonomous control\nfor spacecraft operations has shown great success. However, a recent study\nshowed their performance could be improved by changing the action space, i.e.\ncontrol outputs, used in the learning environment. This has opened the door for\nfinding more improvements through further changes to the environment. The work\nin this paper focuses on how changes to the environment's observation space can\nimpact the training and performance of RL agents learning the spacecraft\ninspection task. The studies are split into two groups. The first looks at the\nimpact of sensors that were designed to help agents learn the task. The second\nlooks at the impact of reference frames, reorienting the agent to see the world\nfrom a different perspective. The results show the sensors are not necessary,\nbut most of them help agents learn more optimal behavior, and that the\nreference frame does not have a large impact, but is best kept consistent.\n","authors":["Nathaniel Hamilton","Kyle Dunlap","Kerianne L Hobbs"],"pdf_url":"https://arxiv.org/pdf/2501.06016v1.pdf","comment":"18 pages, 10 figures, 3 tables"},{"id":"http://arxiv.org/abs/2501.05994v1","updated":"2025-01-10T14:26:23Z","published":"2025-01-10T14:26:23Z","title":"On the Interaction in Transient Stability of Two-Inverter Power Systems\n containing GFL inverter Using Manifold Method","summary":" Many renewable energy resources are integrated into power systems via\ngrid-following (GFL) inverters which rely on a phase-locked loop (PLL) for grid\nsynchronization. During severe grid faults, GFL inverters are vulnerable to\ntransient instability, often leading to disconnection from the grid. This paper\naims to elucidate the interaction mechanisms and define the stability\nboundaries of systems of two inverters, including GFL, grid-forming (GFM), or\ngrid-supporting (GSP) inverters. First, the generalized large-signal expression\nfor the two-inverter system under various inverter combinations is derived,\nrevealing that no energy function exists for systems containing GFL inverters.\nThis implies that the traditional direct method cannot be applied to such\nsystems. To overcome these challenges, a manifold method is employed to\nprecisely determine the domain of attraction (DOA) of the system, and the\ntransient stability margin is assessed by a new metric termed the critical\nclearing radius (CCR). A case study of the two-inverter system under various\ninverter combinations is conducted to explore large-signal interactions across\ndifferent scenarios. Manifold analysis and simulation results reveal that GSP\ninverters using PLL for grid synchronization exhibit behavior similar to GFM\ninverters when the droop coefficients in the terminal voltage control loop\n(TVC) are sufficiently large. Compared to GFL inverters, GSP inverters\nincorporating a TVC significantly enhances the transient stability of other\ninverters. In the STATCOM case, the optimal placement of the STATCOM, realized\nby GSP or GFM inverters, is identified to be at the midpoint of a transmission\nline. All findings in this paper are validated through electromagnetic\ntransient (EMT) simulations\n","authors":["Yifan Zhang","Yunjie Gu","Yue Zhu","Timothy C. Green","Hsiao-Dong Chiang"],"pdf_url":"https://arxiv.org/pdf/2501.05994v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05984v1","updated":"2025-01-10T14:14:18Z","published":"2025-01-10T14:14:18Z","title":"The Safe Trusted Autonomy for Responsible Space Program","summary":" The Safe Trusted Autonomy for Responsible Space (STARS) program aims to\nadvance autonomy technologies for space by leveraging machine learning\ntechnologies while mitigating barriers to trust, such as uncertainty,\nopaqueness, brittleness, and inflexibility. This paper presents the\nachievements and lessons learned from the STARS program in integrating\nreinforcement learning-based multi-satellite control, run time assurance\napproaches, and flexible human-autonomy teaming interfaces, into a new\nintegrated testing environment for collaborative autonomous satellite systems.\nThe primary results describe analysis of the reinforcement learning\nmulti-satellite control and run time assurance algorithms. These algorithms are\nintegrated into a prototype human-autonomy interface using best practices from\nhuman-autonomy trust literature, however detailed analysis of the effectiveness\nis left to future work. References are provided with additional detailed\nresults of individual experiments.\n","authors":["Kerianne L. Hobbs","Sean Phillips","Michelle Simon","Joseph B. Lyons","Jared Culbertson","Hamilton Scott Clouse","Nathaniel Hamilton","Kyle Dunlap","Zachary S. Lippay","Joshua Aurand","Zachary I. Bell","Taleri Hammack","Dorothy Ayres","Rizza Lim"],"pdf_url":"https://arxiv.org/pdf/2501.05984v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05946v1","updated":"2025-01-10T13:18:00Z","published":"2025-01-10T13:18:00Z","title":"Coverage and Spectral Efficiency of NOMA-Enabled LEO Satellite Networks\n with Ordering Schemes","summary":" This paper investigates an analytical model for low-earth orbit (LEO)\nmulti-satellite downlink non-orthogonal multiple access (NOMA) networks. The\nsatellites transmit data to multiple NOMA user terminals (UTs), each employing\nsuccessive interference cancellation (SIC) for decoding. Two ordering schemes\nare adopted for NOMA-enabled LEO satellite networks, i.e., mean signal power\n(MSP)-based ordering and\ninstantaneous-signal-to-inter-satellite-interference-plus-noise ratio\n(ISINR)-based ordering. For each ordering scheme, we derive the coverage\nprobabilities of UTs under different channel conditions. Moreover, we discuss\nhow coverage is influenced by SIC, main-lobe gain, and tradeoffs between the\nnumber of satellites and their altitudes. Additionally, two user fairness-based\npower allocation (PA) schemes are considered, and PA coefficients with the\noptimal number of UTs that maximize their sum spectral efficiency (SE) are\nstudied. Simulation results show that there exists a maximum\nsignal-to-inter-satellite-interference-plus-noise ratio (SINR) threshold for\neach PA scheme that ensures the operation of NOMA in LEO satellite networks,\nand the benefit of NOMA only exists when the target SINR is below a certain\nthreshold. Compared with orthogonal multiple access (OMA), NOMA increases UTs'\nsum SE by as much as 35\\%. Furthermore, for most SINR thresholds, the sum SE\nincreases with the number of UTs to the highest value, whilst the maximum sum\nSE is obtained when there are two UTs.\n","authors":["Xiangyu Li","Bodong Shang","Qingqing Wu","Chao Ren"],"pdf_url":"https://arxiv.org/pdf/2501.05946v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05943v1","updated":"2025-01-10T13:08:38Z","published":"2025-01-10T13:08:38Z","title":"Koopman-Based Model Predictive Control of Functional Electrical\n Stimulation for Ankle Dorsiflexion and Plantarflexion Assistance","summary":" Functional Electrical Stimulation (FES) can be an effective tool to augment\nparetic muscle function and restore normal ankle function. Our approach\nincorporates a real-time, data-driven Model Predictive Control (MPC) scheme,\nbuilt upon a Koopman operator theory (KOT) framework. This framework adeptly\ncaptures the complex nonlinear dynamics of ankle motion in a linearized form,\nenabling application of linear control approaches for highly nonlinear\nFES-actuated dynamics. Utilizing inertial measurement units (IMUs), our method\naccurately predicts the FES-induced ankle movements, while accounting for\nnonlinear muscle actuation dynamics, including the muscle activation for both\nplantarflexors, and dorsiflexors (Tibialis Anterior (TA)). The linear\nprediction model derived through KOT allowed us to formulate the MPC problem\nwith linear state space dynamics, enhancing the real-time feasibility,\nprecision and adaptability of the FES driven control. The effectiveness and\napplicability of our approach have been demonstrated through comprehensive\nsimulations and experimental trials, including three participants with no\ndisability and a participant with Multiple Sclerosis. Our findings highlight\nthe potential of a KOT-based MPC approach for FES based gait assistance that\noffers effective and personalized assistance for individuals with gait\nimpairment conditions.\n","authors":["Mayank Singh","Noor Hakam","Trisha M. Kesar","Nitin Sharma"],"pdf_url":"https://arxiv.org/pdf/2501.05943v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.15862v2","updated":"2025-01-10T10:46:26Z","published":"2024-10-21T10:43:14Z","title":"Integration of Cobalt Ferromagnetic Control Gates for Electrical and\n Magnetic Manipulation of Semiconductor Quantum Dots","summary":" The rise of electron spin qubit architectures for quantum computing\nprocessors has led to a strong interest in designing and integrating\nferromagnets to induce stray magnetic fields for electron dipole spin resonance\n(EDSR). The integration of nanomagnets imposes however strict layout and\nprocessing constraints, challenging the arrangement of different gating layers\nand the control of neighboring qubit frequencies. This work reports a\nsuccessful integration of nano-sized cobalt control gates into a multi-gate\nFD-SOI nanowire with nanometer-scale dot-to-magnet pitch, simultaneously\nexploiting electrical and ferromagnetic properties of the gate stack at\nnanoscale. The electrical characterization of the multi-gate nanowire exhibits\nfull field effect functionality of all ferromagnetic gates from room\ntemperature to 10 mK, proving quantum dot formation when ferromagnets are\noperated as barrier gates. The front-end-of-line (FEOL) compatible integration\nof cobalt is examined by energy dispersive X-ray spectroscopy and high/low\nfrequency capacitance characterization, confirming the quality of interfaces\nand control over material diffusion. Insights into the magnetic properties of\nthin films and patterned control-gates are provided by vibrating sample\nmagnetometry and electron holography measurements. Micromagnetic simulations\nanticipate that this structure fulfills the requirements for EDSR driving for\nmagnetic fields higher than 1 T, where a homogeneous magnetization along the\nhard magnetic axis of the Co gates is expected. The FDSOI architecture\nshowcased in this study provides a scalable alternative to micromagnets\ndeposited in the back-end-of-line (BEOL) and middle-of-line (MOL) processes,\nwhile bringing technological insights for the FEOL-compatible integration of Co\nnanostructures in spin qubit devices.\n","authors":["Fabio Bersano","Michele Aldeghi","Niccolò Martinolli","Victor Boureau","Thibault Aboud","Michele Ghini","Pasquale Scarlino","Gian Salis","Adrian Mihai Ionescu"],"pdf_url":"https://arxiv.org/pdf/2410.15862v2.pdf","comment":"15 pages, 7 figures"},{"id":"http://arxiv.org/abs/2501.05842v1","updated":"2025-01-10T10:33:13Z","published":"2025-01-10T10:33:13Z","title":"Orthogonal projection-based regularization for efficient model\n augmentation","summary":" Deep-learning-based nonlinear system identification has shown the ability to\nproduce reliable and highly accurate models in practice. However, these\nblack-box models lack physical interpretability, and often a considerable part\nof the learning effort is spent on capturing already expected/known behavior\ndue to first-principles-based understanding of some aspects of the system. A\npotential solution is to integrate prior physical knowledge directly into the\nmodel structure, combining the strengths of physics-based modeling and\ndeep-learning-based identification. The most common approach is to use an\nadditive model augmentation structure, where the physics-based and the\nmachine-learning (ML) components are connected in parallel. However, such\nmodels are overparametrized, training them is challenging, potentially causing\nthe physics-based part to lose interpretability. To overcome this challenge,\nthis paper proposes an orthogonal projection-based regularization technique to\nenhance parameter learning, convergence, and even model accuracy in\nlearning-based augmentation of nonlinear baseline models.\n","authors":["Bendegúz M. Györök","Jan H. Hoekstra","Johan Kon","Tamás Péni","Maarten Schoukens","Roland Tóth"],"pdf_url":"https://arxiv.org/pdf/2501.05842v1.pdf","comment":"Submitted to L4DC 2025"},{"id":"http://arxiv.org/abs/2401.10726v4","updated":"2025-01-10T10:30:41Z","published":"2024-01-19T14:43:04Z","title":"Empowering Aggregators with Practical Data-Driven Tools: Harnessing\n Aggregated and Disaggregated Flexibility for Demand Response","summary":" This study explores the interaction between aggregators and building\noccupants in activating flexibility through Demand Response (DR) programs, with\na focus on reinforcing the resilience of the energy system considering the\nuncertainties presented by Renewable Energy Sources (RES). Firstly, it\nintroduces a methodology of optimizing aggregated flexibility provision\nstrategies in environments with limited data, utilizing Discrete Fourier\nTransformation (DFT) and clustering techniques to identify building occupants'\nactivity patterns. Secondly, the study assesses the disaggregated flexibility\nprovision of Heating Ventilation and Air Conditioning (HVAC) systems during DR\nevents, employing machine learning and optimization techniques for precise,\ndevice-level analysis. The first approach offers a non-intrusive pathway for\naggregators to provide flexibility services in environments of a single smart\nmeter for the whole building's consumption, while the second approach maximizes\nthe amount of flexibility in the case of dedicated metering devices to the HVAC\nsystems by carefully considering building occupants' thermal comfort profiles.\nThrough the application of data-driven techniques and encompassing case studies\nfrom both industrial and residential buildings, this paper not only unveils\npivotal opportunities for aggregators in the balancing and emerging flexibility\nmarkets but also successfully develops and demonstrates end-to-end practical\ntools for aggregators.\n","authors":["Costas Mylonas","Donata Boric","Leila Luttenberger Maric","Alexandros Tsitsanis","Eleftheria Petrianou","Magda Foti"],"pdf_url":"https://arxiv.org/pdf/2401.10726v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.02203v2","updated":"2025-01-10T10:08:13Z","published":"2023-10-03T16:59:26Z","title":"Stochastic Quantum Power Flow for Risk Assessment in Power Systems","summary":" This paper introduces the first quantum computing framework for Stochastic\nQuantum Power Flow (SQPF) analysis in power systems. The proposed method\nleverages quantum states to encode power flow distributions, enabling the use\nof Quantum Monte Carlo (QMC) sampling to efficiently assess the probability of\nline overloads. Our approach significantly reduces the required sample size\ncompared to traditional Monte Carlo methods, making it particularly suited for\nrisk assessments in scenarios involving high uncertainty, such as renewable\nenergy integration. We validate the method on two test systems, demonstrating\nthe computational advantage of quantum algorithms in reducing sample complexity\nwhile maintaining accuracy. This work represents a foundational step toward\nscalable quantum power flow analysis, with potential applications in future\npower system operations and planning. The results show promising computational\nspeedups, underscoring the potential of quantum computing in addressing the\nincreasing uncertainty in modern power grids.\n","authors":["Brynjar Sævarsson","Hjörtur Jóhannsson","Spyros Chatzivasileiadis"],"pdf_url":"https://arxiv.org/pdf/2310.02203v2.pdf","comment":"Accepted by the Electric Power System Research journal"},{"id":"http://arxiv.org/abs/2501.05815v1","updated":"2025-01-10T09:38:42Z","published":"2025-01-10T09:38:42Z","title":"Enhanced sampled-data model predictive control via nonlinear lifting","summary":" This paper introduces a novel nonlinear model predictive control (NMPC)\nframework that incorporates a lifting technique to enhance control performance\nfor nonlinear systems. While the lifting technique has been widely employed in\nlinear systems to capture intersample behaviour, their application to nonlinear\nsystems remains unexplored. We address this gap by formulating an NMPC scheme\nthat combines fast-sample fast-hold (FSFH) approximations and numerical methods\nto approximate system dynamics and cost functions. The proposed approach is\nvalidated through two case studies: the Van der Pol oscillator and the inverted\npendulum on a cart. Simulation results demonstrate that the lifted NMPC\noutperforms conventional NMPC in terms of reduced settling time and improved\ncontrol accuracy. These findings underscore the potential of the lifting-based\nNMPC for efficient control of nonlinear systems, offering a practical solution\nfor real-time applications.\n","authors":["Nuthasith Gerdpratoom","Fumiya Matsuzaki","Yutaka Yamamoto","Kaoru Yamamoto"],"pdf_url":"https://arxiv.org/pdf/2501.05815v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05808v1","updated":"2025-01-10T09:15:40Z","published":"2025-01-10T09:15:40Z","title":"Real-Time Integrated Dispatching and Idle Fleet Steering with Deep\n Reinforcement Learning for A Meal Delivery Platform","summary":" To achieve high service quality and profitability, meal delivery platforms\nlike Uber Eats and Grubhub must strategically operate their fleets to ensure\ntimely deliveries for current orders while mitigating the consequential impacts\nof suboptimal decisions that leads to courier understaffing in the future. This\nstudy set out to solve the real-time order dispatching and idle courier\nsteering problems for a meal delivery platform by proposing a reinforcement\nlearning (RL)-based strategic dual-control framework. To address the inherent\nsequential nature of these problems, we model both order dispatching and\ncourier steering as Markov Decision Processes. Trained via a deep reinforcement\nlearning (DRL) framework, we obtain strategic policies by leveraging the\nexplicitly predicted demands as part of the inputs. In our dual-control\nframework, the dispatching and steering policies are iteratively trained in an\nintegrated manner. These forward-looking policies can be executed in real-time\nand provide decisions while jointly considering the impacts on local and\nnetwork levels. To enhance dispatching fairness, we propose convolutional deep\nQ networks to construct fair courier embeddings. To simultaneously rebalance\nthe supply and demand within the service network, we propose to utilize\nmean-field approximated supply-demand knowledge to reallocate idle couriers at\nthe local level. Utilizing the policies generated by the RL-based strategic\ndual-control framework, we find the delivery efficiency and fairness of\nworkload distribution among couriers have been improved, and under-supplied\nconditions have been alleviated within the service network. Our study sheds\nlight on designing an RL-based framework to enable forward-looking real-time\noperations for meal delivery platforms and other on-demand services.\n","authors":["Jingyi Cheng","Shadi Sharif Azadeh"],"pdf_url":"https://arxiv.org/pdf/2501.05808v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05778v1","updated":"2025-01-10T08:21:41Z","published":"2025-01-10T08:21:41Z","title":"Formally Verified Neural Lyapunov Function for Incremental\n Input-to-State Stability of Unknown Systems","summary":" This work presents an approach to synthesize a Lyapunov-like function to\nensure incrementally input-to-state stability ($\\delta$-ISS) property for an\nunknown discrete-time system. To deal with challenges posed by unknown system\ndynamics, we parameterize the Lyapunov-like function as a neural network, which\nwe train using the data samples collected from the unknown system along with\nappropriately designed loss functions. We propose a validity condition to test\nthe obtained function and incorporate it into the training framework to ensure\nprovable correctness at the end of the training. Finally, the usefulness of the\nproposed technique is proved using two case studies: a scalar non-linear\ndynamical system and a permanent magnet DC motor.\n","authors":["Ahan Basu","Bhabani Shankar Dey","Pushpak Jagtap"],"pdf_url":"https://arxiv.org/pdf/2501.05778v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05770v1","updated":"2025-01-10T07:58:52Z","published":"2025-01-10T07:58:52Z","title":"Path Planning for Multi-Copter UAV Formation Employing a Generalized\n Particle Swarm Optimization","summary":" The paper investigates the problem of path planning techniques for\nmulti-copter uncrewed aerial vehicles (UAV) cooperation in a formation shape to\nexamine surrounding surfaces. We first describe the problem as a joint\nobjective cost for planning a path of the formation centroid working in a\ncomplicated space. The path planning algorithm, named the generalized particle\nswarm optimization algorithm, is then presented to construct an optimal,\nflyable path while avoiding obstacles and ensuring the flying mission\nrequirements. A path-development scheme is then incorporated to generate a\nrelevant path for each drone to maintain its position in the formation\nconfiguration. Simulation, comparison, and experiments have been conducted to\nverify the proposed approach. Results show the feasibility of the proposed\npath-planning algorithm with GEPSO.\n","authors":["Van Truong Hoang"],"pdf_url":"https://arxiv.org/pdf/2501.05770v1.pdf","comment":"6 pages, 8 figures, conference"},{"id":"http://arxiv.org/abs/2404.14767v4","updated":"2025-01-10T06:45:33Z","published":"2024-04-23T06:10:31Z","title":"Remaining Discharge Energy Prediction for Lithium-Ion Batteries Over\n Broad Current Ranges: A Machine Learning Approach","summary":" Lithium-ion batteries have found their way into myriad sectors of industry to\ndrive electrification, decarbonization, and sustainability. A crucial aspect in\nensuring their safe and optimal performance is monitoring their energy levels.\nIn this paper, we present the first study on predicting the remaining energy of\na battery cell undergoing discharge over wide current ranges from low to high\nC-rates. The complexity of the challenge arises from the cell's\nC-rate-dependent energy availability as well as its intricate electro-thermal\ndynamics especially at high C-rates. To address this, we introduce a new\ndefinition of remaining discharge energy and then undertake a systematic effort\nin harnessing the power of machine learning to enable its prediction. Our\neffort includes two parts in cascade. First, we develop an accurate dynamic\nmodel based on integration of physics with machine learning to capture a\nbattery's voltage and temperature behaviors. Second, based on the model, we\npropose a machine learning approach to predict the remaining discharge energy\nunder arbitrary C-rates and pre-specified cut-off limits in voltage and\ntemperature. The experimental validation shows that the proposed approach can\npredict the remaining discharge energy with a relative error of less than 3%\nwhen the current varies between 0~8 C for an NCA cell and 0~15 C for an LFP\ncell. The approach, by design, is amenable to training and computation.\n","authors":["Hao Tu","Manashita Borah","Scott Moura","Yebin Wang","Huazhen Fang"],"pdf_url":"https://arxiv.org/pdf/2404.14767v4.pdf","comment":"15 pages, 13 figures, 4 tables"},{"id":"http://arxiv.org/abs/2406.00621v3","updated":"2025-01-10T06:10:19Z","published":"2024-06-02T05:50:41Z","title":"Log-Scale Quantization in Distributed First-Order Methods:\n Gradient-based Learning from Distributed Data","summary":" Decentralized strategies are of interest for learning from large-scale data\nover networks. This paper studies learning over a network of geographically\ndistributed nodes/agents subject to quantization. Each node possesses a private\nlocal cost function, collectively contributing to a global cost function, which\nthe considered methodology aims to minimize. In contrast to many existing\npapers, the information exchange among nodes is log-quantized to address\nlimited network-bandwidth in practical situations. We consider a first-order\ncomputationally efficient distributed optimization algorithm (with no extra\ninner consensus loop) that leverages node-level gradient correction based on\nlocal data and network-level gradient aggregation only over nearby nodes. This\nmethod only requires balanced networks with no need for stochastic weight\ndesign. It can handle log-scale quantized data exchange over possibly\ntime-varying and switching network setups. We study convergence over both\nstructured networks (for example, training over data-centers) and ad-hoc\nmulti-agent networks (for example, training over dynamic robotic networks).\nThrough experimental validation, we show that (i) structured networks generally\nresult in a smaller optimality gap, and (ii) log-scale quantization leads to a\nsmaller optimality gap compared to uniform quantization.\n","authors":["Mohammadreza Doostmohammadian","Muhammad I. Qureshi","Mohammad Hossein Khalesi","Hamid R. Rabiee","Usman A. Khan"],"pdf_url":"https://arxiv.org/pdf/2406.00621v3.pdf","comment":"IEEE TASE 2025"},{"id":"http://arxiv.org/abs/2501.05715v1","updated":"2025-01-10T05:22:55Z","published":"2025-01-10T05:22:55Z","title":"Non-intrusive Data-driven ADI-based Low-rank Balanced Truncation","summary":" In this short note, a non-intrusive data-driven formulation of ADI-based\nlow-rank balanced truncation is provided. The proposed algorithm only requires\ntransfer function samples at the mirror images of ADI shifts. If some shifts\nare used in both approximating the controllability Gramian and the\nobservability Gramian, then samples of the transfer function's derivative at\nthese shifts are also needed to enforce Hermite interpolation in the Loewner\nframework. It is noted that ADI-based low-rank balanced truncation can be\nviewed as a two-step process. The first step involves constructing an\ninterpolant of the original model at the mirror images of the ADI shifts, which\ncan be done non-intrusively within the Loewner framework. The second step\ninvolves reducing this interpolant using low-rank factors of Gramians\nassociated with the interpolation data through the balanced square-root\nalgorithm. This second step does not require any system information, making the\noverall process non-intrusive with the only required information being samples\nof the transfer function and/or its derivative at the mirror images of ADI\nshifts. Furthermore, it is shown that when the order of the reduced model in\nADI-based low-rank balanced truncation is selected to match the numerical rank\nof the low-rank factors of the Gramians, it effectively reduces to standard\ninterpolation at the mirror images of the ADI shift. An illustrative example is\nprovided to explain the proposed approach.\n","authors":["Umair Zulfiqar"],"pdf_url":"https://arxiv.org/pdf/2501.05715v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05660v1","updated":"2025-01-10T02:24:35Z","published":"2025-01-10T02:24:35Z","title":"Fully Decentralized Computation Offloading in Priority-Driven Edge\n Computing Systems","summary":" We develop a novel framework for fully decentralized offloading policy design\nin multi-access edge computing (MEC) systems. The system comprises $N$\npower-constrained user equipments (UEs) assisted by an edge server (ES) to\nprocess incoming tasks. Tasks are labeled with urgency flags, and in this\npaper, we classify them under three urgency levels, namely, high, moderate, and\nlow urgency. We formulate the problem of designing computation decisions for\nthe UEs within a large population noncooperative game framework, where each UE\nselfishly decides on how to split task execution between its local onboard\nprocessor and the ES. We employ the weighted average age of information (AoI)\nmetric to quantify information freshness at the UEs. Increased onboard\nprocessing consumes more local power, while increased offloading may\npotentially incur a higher average AoI due to other UEs' packets being\noffloaded to the same ES. Thus, we use the mean-field game (MFG) formulation to\ncompute approximate decentralized Nash equilibrium offloading and local\ncomputation policies for the UEs to balance between the information freshness\nand local power consumption. Finally, we provide a projected gradient\ndescent-based algorithm to numerically assess the merits of our approach.\n","authors":["Shubham Aggarwal","Melih Bastopcu","Muhammad Aneeq uz Zaman","Tamer Başar","Sennur Ulukus","Nail Akar"],"pdf_url":"https://arxiv.org/pdf/2501.05660v1.pdf","comment":"Submitted to IEEE for possible publication"},{"id":"http://arxiv.org/abs/2501.05655v1","updated":"2025-01-10T01:57:10Z","published":"2025-01-10T01:57:10Z","title":"Downlink Performance of Cell-Free Massive MIMO for LEO Satellite\n Mega-Constellation","summary":" Low-earth orbit (LEO) satellite communication (SatCom) has emerged as a\npromising technology for improving wireless connectivity in global areas.\nCell-free massive multiple-input multiple-output (CF-mMIMO), an architecture\nrecently proposed for next-generation networks, has yet to be fully explored\nfor LEO satellites. In this paper, we investigate the downlink performance of a\nCF-mMIMO LEO SatCom network, where many satellite access points (SAPs)\nsimultaneously serve the corresponding ground user terminals (UTs). Using tools\nfrom stochastic geometry, we model the locations of SAPs and UTs on surfaces of\nconcentric spheres using Poisson point processes (PPPs) and present expressions\nbased on linear minimum-mean-square-error (LMMSE) channel estimation and\nconjugate beamforming. Then, we derive the coverage probabilities in both\nfading and non-fading scenarios, with significant system parameters such as the\nNakagami fading parameter, number of UTs, number of SAPs, orbital altitude, and\nservice range brought by the dome angle. Finally, the analytical model is\nverified by extensive Monte Carlo simulations. Simulation results show that\nstronger line-of-sight (LoS) effects and a more comprehensive service range of\nthe UT bring higher coverage probability despite existing multi-user\ninterference. Moreover, we found that there exist optimal numbers of UTs for\ndifferent orbital altitudes and dome angles, which provides valuable system\ndesign insights.\n","authors":["Xiangyu Li","Bodong Shang"],"pdf_url":"https://arxiv.org/pdf/2501.05655v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.09131v4","updated":"2025-01-10T23:30:13Z","published":"2023-12-14T17:01:58Z","title":"Physics-Informed Neural Network Lyapunov Functions: PDE\n Characterization, Learning, and Verification","summary":" We provide a systematic investigation of using physics-informed neural\nnetworks to compute Lyapunov functions. We encode Lyapunov conditions as a\npartial differential equation (PDE) and use this for training neural network\nLyapunov functions. We analyze the analytical properties of the solutions to\nthe Lyapunov and Zubov PDEs. In particular, we show that employing the Zubov\nequation in training neural Lyapunov functions can lead to approximate regions\nof attraction close to the true domain of attraction. We also examine\napproximation errors and the convergence of neural approximations to the unique\nsolution of Zubov's equation. We then provide sufficient conditions for the\nlearned neural Lyapunov functions that can be readily verified by\nsatisfiability modulo theories (SMT) solvers, enabling formal verification of\nboth local stability analysis and region-of-attraction estimates in the large.\nThrough a number of nonlinear examples, ranging from low to high dimensions, we\ndemonstrate that the proposed framework can outperform traditional\nsums-of-squares (SOS) Lyapunov functions obtained using semidefinite\nprogramming (SDP).\n","authors":["Jun Liu","Yiming Meng","Maxwell Fitzsimmons","Ruikun Zhou"],"pdf_url":"https://arxiv.org/pdf/2312.09131v4.pdf","comment":"The current version is accepted to the IFAC Journal Automatica"},{"id":"http://arxiv.org/abs/2501.06353v1","updated":"2025-01-10T21:25:16Z","published":"2025-01-10T21:25:16Z","title":"Event Constrained Programming","summary":" In this paper, we present event constraints as a new modeling paradigm that\ngeneralizes joint chance constraints from stochastic optimization to (1)\nenforce a constraint on the probability of satisfying a set of constraints\naggregated via application-specific logic (constituting an event) and (2) to be\napplied to general infinite-dimensional optimization (InfiniteOpt) problems\n(i.e., time, space, and/or uncertainty domains). This new constraint class\noffers significant modeling flexibility in posing InfiniteOpt constraints that\nare enforced over a certain portion of their domain (e.g., to a certain\nprobability level), but can be challenging to reformulate/solve due to\ndifficulties in representing arbitrary logical conditions and specifying a\nprobabilistic measure on a collection of constraints. To address these\nchallenges, we derive a generalized disjunctive programming (GDP)\nrepresentation of event constrained optimization problems, which readily\nenables us to pose logical event conditions in a standard form and allows us to\ndraw from a suite of GDP solution strategies that leverage the special\nstructure of this problem class. We also extend several approximation\ntechniques from the chance constraint literature to provide a means to\nreformulate certain event constraints without the use of binary variables. We\nillustrate these findings with case studies in stochastic optimal power flow,\ndynamic disease control, and optimal 2D diffusion.\n","authors":["Daniel Ovalle","Stefan Mazzadi","Carl D. Laird","Ignacio E. Grossmann","Joshua L. Pulsipher"],"pdf_url":"https://arxiv.org/pdf/2501.06353v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.04669v4","updated":"2025-01-10T21:06:41Z","published":"2024-09-07T01:17:59Z","title":"Learning Optimal Stable Matches in Decentralized Markets with Unknown\n Preferences","summary":" Matching algorithms have demonstrated great success in several practical\napplications, but they often require centralized coordination and plentiful\ninformation. In many modern online marketplaces, agents must independently seek\nout and match with another using little to no information. For these kinds of\nsettings, can we design decentralized, limited-information matching algorithms\nthat preserve the desirable properties of standard centralized techniques? In\nthis work, we constructively answer this question in the affirmative. We model\na two-sided matching market as a game consisting of two disjoint sets of\nagents, referred to as proposers and acceptors, each of whom seeks to match\nwith their most preferable partner on the opposite side of the market. However,\neach proposer has no knowledge of their own preferences, so they must learn\ntheir preferences while forming matches in the market. We present a simple\nonline learning rule that guarantees a strong notion of probabilistic\nconvergence to the welfare-maximizing equilibrium of the game, referred to as\nthe proposer-optimal stable match. To the best of our knowledge, this\nrepresents the first completely decoupled, communication-free algorithm that\nguarantees probabilistic convergence to an optimal stable match, irrespective\nof the structure of the matching market.\n","authors":["Vade Shah","Bryce L. Ferguson","Jason R. Marden"],"pdf_url":"https://arxiv.org/pdf/2409.04669v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.06335v1","updated":"2025-01-10T20:41:11Z","published":"2025-01-10T20:41:11Z","title":"A Comparison of Strategies to Embed Physics-Informed Neural Networks in\n Nonlinear Model Predictive Control Formulations Solved via Direct\n Transcription","summary":" This study aims to benchmark candidate strategies for embedding neural\nnetwork (NN) surrogates in nonlinear model predictive control (NMPC)\nformulations that are subject to systems described with partial differential\nequations and that are solved via direct transcription (i.e., simultaneous\nmethods). This study focuses on the use of physics-informed NNs and\nphysics-informed convolutional NNs as the internal (surrogate) models within\nthe NMPC formulation. One strategy embeds NN models as explicit algebraic\nconstraints, leveraging the automatic differentiation (AD) of an algebraic\nmodelling language (AML) to evaluate the derivatives. Alternatively, the solver\ncan be provided with derivatives computed external to the AML via the AD\nroutines of the machine learning environment the NN is trained in. The three\nnumerical experiments considered in this work reveal that replacing mechanistic\nmodels with NN surrogates may not always offer computational advantages when\nsmooth activation functions are used in conjunction with a local nonlinear\nsolver (e.g., Ipopt), even with highly nonlinear systems. Moreover, in this\ncontext, the external function evaluation of the NN surrogates often\noutperforms the embedding strategies that rely on explicit algebraic\nconstraints, likely due to the difficulty in initializing the auxiliary\nvariables and constraints introduced by explicit algebraic reformulations.\n","authors":["Carlos Andrés Elorza Casas","Luis A. Ricardez-Sandoval","Joshua L. Pulsipher"],"pdf_url":"https://arxiv.org/pdf/2501.06335v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.07601v1","updated":"2025-01-10T22:31:53Z","published":"2025-01-10T22:31:53Z","title":"Real-Time Decision-Making for Digital Twin in Additive Manufacturing\n with Model Predictive Control using Time-Series Deep Neural Networks","summary":" Digital Twin-a virtual replica of a physical system enabling real-time\nmonitoring, model updating, prediction, and decision-making-combined with\nrecent advances in machine learning (ML), offers new opportunities for\nproactive control strategies in autonomous manufacturing. However, achieving\nreal-time decision-making with Digital Twins requires efficient optimization\ndriven by accurate predictions of highly nonlinear manufacturing systems. This\npaper presents a simultaneous multi-step Model Predictive Control (MPC)\nframework for real-time decision-making, using a multi-variate deep neural\nnetwork (DNN), named Time-Series Dense Encoder (TiDE), as the surrogate model.\nDifferent from the models in conventional MPC which only provide one-step ahead\nprediction, TiDE is capable of predicting future states within the prediction\nhorizon in one shot (multi-step), significantly accelerating MPC. Using\nDirected Energy Deposition additive manufacturing as a case study, we\ndemonstrate the effectiveness of the proposed MPC in achieving melt pool\ntemperature tracking to ensure part quality, while reducing porosity defects by\nregulating laser power to maintain melt pool depth constraints. In this work,\nwe first show that TiDE is capable of accurately predicting melt pool\ntemperature and depth. Second, we demonstrate that the proposed MPC achieves\nprecise temperature tracking while satisfying melt pool depth constraints\nwithin a targeted dilution range (10%-30%), reducing potential porosity\ndefects. Compared to the PID controller, MPC results in smoother and less\nfluctuating laser power profiles with competitive or superior melt pool\ntemperature control performance. This demonstrates MPC's proactive control\ncapabilities, leveraging time-series prediction and real-time optimization,\npositioning it as a powerful tool for future Digital Twin applications and\nreal-time process optimization in manufacturing.\n","authors":["Yi-Ping Chen","Vispi Karkaria","Ying-Kuan Tsai","Faith Rolark","Daniel Quispe","Robert X. Gao","Jian Cao","Wei Chen"],"pdf_url":"https://arxiv.org/pdf/2501.07601v1.pdf","comment":null}],"Optimization and Control":[{"id":"http://arxiv.org/abs/2501.06181v1","updated":"2025-01-10T18:58:44Z","published":"2025-01-10T18:58:44Z","title":"Best Response Convergence for Zero-sum Stochastic Dynamic Games with\n Partial and Asymmetric Information","summary":" We analyze best response dynamics for finding a Nash equilibrium of an\ninfinite horizon zero-sum stochastic linear quadratic dynamic game (LQDG) with\npartial and asymmetric information. We derive explicit expressions for each\nplayer's best response within the class of pure linear dynamic output feedback\ncontrol strategies where the internal state dimension of each control strategy\nis an integer multiple of the system state dimension. With each best response,\nthe players form increasingly higher-order belief states, leading to\ninfinite-dimensional internal states. However, we observe in extensive\nnumerical experiments that the game's value converges after just a few\niterations, suggesting that strategies associated with increasingly\nhigher-order belief states eventually provide no benefit. To help explain this\nconvergence, our numerical analysis reveals rapid decay of the controllability\nand observability Gramian eigenvalues and Hankel singular values in\nhigher-order belief dynamics, indicating that the higher-order belief dynamics\nbecome increasingly difficult for both players to control and observe.\nConsequently, the higher-order belief dynamics can be closely approximated by\nlow-order belief dynamics with bounded error, and thus feedback strategies with\nlimited internal state dimension can closely approximate a Nash equilibrium.\n","authors":["Yuxiang Guan","Iman Shames","Tyler H. Summers"],"pdf_url":"https://arxiv.org/pdf/2501.06181v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.06167v1","updated":"2025-01-10T18:46:28Z","published":"2025-01-10T18:46:28Z","title":"Meta-Learning for Physically-Constrained Neural System Identification","summary":" We present a gradient-based meta-learning framework for rapid adaptation of\nneural state-space models (NSSMs) for black-box system identification. When\napplicable, we also incorporate domain-specific physical constraints to improve\nthe accuracy of the NSSM. The major benefit of our approach is that instead of\nrelying solely on data from a single target system, our framework utilizes data\nfrom a diverse set of source systems, enabling learning from limited target\ndata, as well as with few online training iterations. Through benchmark\nexamples, we demonstrate the potential of our approach, study the effect of\nfine-tuning subnetworks rather than full fine-tuning, and report real-world\ncase studies to illustrate the practical application and generalizability of\nthe approach to practical problems with physical-constraints. Specifically, we\nshow that the meta-learned models result in improved downstream performance in\nmodel-based state estimation in indoor localization and energy systems.\n","authors":["Ankush Chakrabarty","Gordon Wichern","Vedang M. Deshpande","Abraham P. Vinod","Karl Berntorp","Christopher R. Laughman"],"pdf_url":"https://arxiv.org/pdf/2501.06167v1.pdf","comment":"30 pages"},{"id":"http://arxiv.org/abs/2308.15732v2","updated":"2025-01-10T18:41:19Z","published":"2023-08-30T03:16:41Z","title":"On Lie-Bracket Averaging for a Class of Hybrid Dynamical Systems with\n Applications to Model-Free Control and Optimization","summary":" The stability of dynamical systems with oscillatory behaviors and\nwell-defined average vector fields has traditionally been studied using\naveraging theory. These tools have also been applied to hybrid dynamical\nsystems, which combine continuous and discrete dynamics. However, most\naveraging results for hybrid systems are limited to first-order methods,\nhindering their use in systems and algorithms that require high-order averaging\ntechniques, such as hybrid Lie-bracket-based extremum seeking algorithms and\nhybrid vibrational controllers. To address this limitation, we introduce a\nnovel high-order averaging theorem for analyzing the stability of hybrid\ndynamical systems with high-frequency periodic flow maps. These systems\nincorporate set-valued flow maps and jump maps, effectively modeling well-posed\ndifferential and difference inclusions. By imposing appropriate regularity\nconditions, we establish results on $(T,\\varepsilon)$-closeness of solutions\nand semi-global practical asymptotic stability for sets. These theoretical\nresults are then applied to the study of three distinct applications in the\ncontext of hybrid model-free control and optimization via Lie-bracket\naveraging.\n","authors":["Mahmoud Abdelgalil","Jorge I. Poveda"],"pdf_url":"https://arxiv.org/pdf/2308.15732v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.06118v1","updated":"2025-01-10T17:15:59Z","published":"2025-01-10T17:15:59Z","title":"Nonlinear port-Hamiltonian system identification from input-state-output\n data","summary":" A framework for identifying nonlinear port-Hamiltonian systems using\ninput-state-output data is introduced. The framework utilizes neural networks'\nuniversal approximation capacity to effectively represent complex dynamics in a\nstructured way. We show that using the structure helps to make long-term\npredictions compared to baselines that do not incorporate physics. We also\nexplore different architectures based on MLPs, KANs, and using prior\ninformation. The technique is validated through examples featuring\nnonlinearities in either the skew-symmetric terms, the dissipative terms, or\nthe Hamiltonian.\n","authors":["Karim Cherifi","Achraf El Messaoudi","Hannes Gernandt","Marco Roschkowski"],"pdf_url":"https://arxiv.org/pdf/2501.06118v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.14121v3","updated":"2025-01-10T16:59:51Z","published":"2023-10-21T21:39:24Z","title":"Monotone Causality in Opportunistically Stochastic Shortest Path\n Problems","summary":" When traveling through a graph with an accessible deterministic path to a\ntarget, is it ever preferable to resort to stochastic node-to-node transitions\ninstead? And if so, what are the conditions guaranteeing that such a stochastic\noptimal routing policy can be computed efficiently? We aim to answer these\nquestions here by defining a class of Opportunistically Stochastic Shortest\nPath (OSSP) problems and deriving sufficient conditions for applicability of\nnon-iterative label-setting methods. The usefulness of this framework is\ndemonstrated in two very different contexts: numerical analysis and autonomous\nvehicle routing. We use OSSPs to derive causality conditions for\nsemi-Lagrangian discretizations of anisotropic Hamilton-Jacobi equations. We\nalso use a Dijkstra-like method to solve OSSPs optimizing the timing and\nurgency of lane change maneuvers for an autonomous vehicle navigating road\nnetworks with a heterogeneous traffic load.\n","authors":["Mallory E. Gaspard","Alexander Vladimirsky"],"pdf_url":"https://arxiv.org/pdf/2310.14121v3.pdf","comment":"Submitted to and under review for INFORMS Mathematics of Operations\n Research. Revised to address first round feedback from reviewers for this\n journal"},{"id":"http://arxiv.org/abs/2501.06081v1","updated":"2025-01-10T16:15:25Z","published":"2025-01-10T16:15:25Z","title":"Averaged Adam accelerates stochastic optimization in the training of\n deep neural network approximations for partial differential equation and\n optimal control problems","summary":" Deep learning methods - usually consisting of a class of deep neural networks\n(DNNs) trained by a stochastic gradient descent (SGD) optimization method - are\nnowadays omnipresent in data-driven learning problems as well as in scientific\ncomputing tasks such as optimal control (OC) and partial differential equation\n(PDE) problems. In practically relevant learning tasks, often not the\nplain-vanilla standard SGD optimization method is employed to train the\nconsidered class of DNNs but instead more sophisticated adaptive and\naccelerated variants of the standard SGD method such as the popular Adam\noptimizer are used. Inspired by the classical Polyak-Ruppert averaging\napproach, in this work we apply averaged variants of the Adam optimizer to\ntrain DNNs to approximately solve exemplary scientific computing problems in\nthe form of PDEs and OC problems. We test the averaged variants of Adam in a\nseries of learning problems including physics-informed neural network (PINN),\ndeep backward stochastic differential equation (deep BSDE), and deep Kolmogorov\napproximations for PDEs (such as heat, Black-Scholes, Burgers, and Allen-Cahn\nPDEs), including DNN approximations for OC problems, and including DNN\napproximations for image classification problems (ResNet for CIFAR-10). In each\nof the numerical examples the employed averaged variants of Adam outperform the\nstandard Adam and the standard SGD optimizers, particularly, in the situation\nof the scientific machine learning problems. The Python source codes for the\nnumerical experiments associated to this work can be found on GitHub at\nhttps://github.com/deeplearningmethods/averaged-adam.\n","authors":["Steffen Dereich","Arnulf Jentzen","Adrian Riekert"],"pdf_url":"https://arxiv.org/pdf/2501.06081v1.pdf","comment":"25 pages, 10 figures"},{"id":"http://arxiv.org/abs/2501.06079v1","updated":"2025-01-10T16:15:02Z","published":"2025-01-10T16:15:02Z","title":"Set-valued evenly convex functions: characterizations and c-conjugacy","summary":" In this work we deal with set-valued functions with values in the power set\nof a separated locally convex space where a nontrivial pointed convex cone\ninduces a partial order relation. A set-valued function is evenly convex if its\nepigraph is an evenly convex set, i.e., it is the intersection of an arbitrary\nfamily of open half-spaces. In this paper we characterize evenly convex\nset-valued functions as the pointwise supremum of its set-valued e-affine\nminorants. Moreover, a suitable conjugation pattern will be developed for these\nfunctions, as well as the counterpart of the biconjugation Fenchel-Moreau\ntheorem.\n","authors":["M. D. Fajardo"],"pdf_url":"https://arxiv.org/pdf/2501.06079v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.06052v1","updated":"2025-01-10T15:32:35Z","published":"2025-01-10T15:32:35Z","title":"Rank conditions for exactness of semidefinite relaxations in polynomial\n optimization","summary":" We consider the Moment-SOS hierarchy in polynomial optimization. We first\nprovide a sufficient condition to solve the truncated K-moment problem\nassociated with a given degree-$2n$ pseudo-moment sequence $\\phi$ n and a\nsemi-algebraic set $K \\subset \\mathbb{R}^d$. Namely, let $2v$ be the maximum\ndegree of the polynomials that describe $K$. If the rank $r$ of its associated\nmoment matrix is less than $nv + 1$, then $\\phi^n$ has an atomic representing\nmeasure supported on at most $r$ points of $K$. When used at step-$n$ of the\nMoment-SOS hierarchy, it provides a sufficient condition to guarantee its\nfinite convergence (i.e., the optimal value of the corresponding degree-n\nsemidefinite relaxation of the hierarchy is the global minimum). For Quadratic\nConstrained Quadratic Problems (QCQPs) one may also recover global minimizers\nfrom the optimal pseudo-moment sequence. Our condition is in the spirit of\nBlekherman's rank condition and while on the one-hand it is more restrictive,\non the other hand it applies to constrained POPs as it provides a localization\non $K$ for the representing measure.\n","authors":["Jean B Lasserre"],"pdf_url":"https://arxiv.org/pdf/2501.06052v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11940v3","updated":"2025-01-10T15:07:43Z","published":"2024-01-22T13:30:11Z","title":"Low-Tubal-Rank Tensor Recovery via Factorized Gradient Descent","summary":" This paper considers the problem of recovering a tensor with an underlying\nlow-tubal-rank structure from a small number of corrupted linear measurements.\nTraditional approaches tackling such a problem require the computation of\ntensor Singular Value Decomposition (t-SVD), that is a computationally\nintensive process, rendering them impractical for dealing with large-scale\ntensors. Aim to address this challenge, we propose an efficient and effective\nlow-tubal-rank tensor recovery method based on a factorization procedure akin\nto the Burer-Monteiro (BM) method. Precisely, our fundamental approach involves\ndecomposing a large tensor into two smaller factor tensors, followed by solving\nthe problem through factorized gradient descent (FGD). This strategy eliminates\nthe need for t-SVD computation, thereby reducing computational costs and\nstorage requirements. We provide rigorous theoretical analysis to ensure the\nconvergence of FGD under both noise-free and noisy situations. Additionally, it\nis worth noting that our method does not require the precise estimation of the\ntensor tubal-rank. Even in cases where the tubal-rank is slightly\noverestimated, our approach continues to demonstrate robust performance. A\nseries of experiments have been carried out to demonstrate that, as compared to\nother popular ones, our approach exhibits superior performance in multiple\nscenarios, in terms of the faster computational speed and the smaller\nconvergence error.\n","authors":["Zhiyu Liu","Zhi Han","Yandong Tang","Xi-Le Zhao","Yao Wang"],"pdf_url":"https://arxiv.org/pdf/2401.11940v3.pdf","comment":"13 pages, 4 figures"},{"id":"http://arxiv.org/abs/2501.06023v1","updated":"2025-01-10T15:01:36Z","published":"2025-01-10T15:01:36Z","title":"Distributed Generalized Nash Equilibria Learning for Online Stochastic\n Aggregative Games","summary":" This paper investigates online stochastic aggregative games subject to local\nset constraints and time-varying coupled inequality constraints, where each\nplayer possesses a time-varying expectation-valued cost function relying on not\nonly its own decision variable but also an aggregation of all the players'\nvariables. Each player can only access its local individual cost function and\nconstraints, necessitating partial information exchanges with neighboring\nplayers through time-varying unbalanced networks. Additionally, local cost\nfunctions and constraint functions are not prior knowledge and only revealed\ngradually. To learn generalized Nash equilibria of such games, a novel\ndistributed online stochastic algorithm is devised based on push-sum and\nprimal-dual strategies. Through rigorous analysis, high probability bounds on\nthe regret and constraint violation are provided by appropriately selecting\ndecreasing stepsizes. Moreover, for a time-invariant stochastic strongly\nmonotone game, it is shown that the generated sequence by the designed\nalgorithm converges to its variational generalized Nash equilibrium (GNE)\nalmost surely, and the time-averaged sequence converges sublinearly with high\nprobability. Finally, the derived theoretical results are illustrated by\nnumerical simulations.\n","authors":["Kaixin Du","Min Meng"],"pdf_url":"https://arxiv.org/pdf/2501.06023v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.05545v3","updated":"2025-01-10T14:51:06Z","published":"2024-12-07T05:47:28Z","title":"Convergence analysis of wide shallow neural operators within the\n framework of Neural Tangent Kernel","summary":" Neural operators are aiming at approximating operators mapping between Banach\nspaces of functions, achieving much success in the field of scientific\ncomputing. Compared to certain deep learning-based solvers, such as\nPhysics-Informed Neural Networks (PINNs), Deep Ritz Method (DRM), neural\noperators can solve a class of Partial Differential Equations (PDEs). Although\nmuch work has been done to analyze the approximation and generalization error\nof neural operators, there is still a lack of analysis on their training error.\nIn this work, we conduct the convergence analysis of gradient descent for the\nwide shallow neural operators and physics-informed shallow neural operators\nwithin the framework of Neural Tangent Kernel (NTK). The core idea lies on the\nfact that over-parameterization and random initialization together ensure that\neach weight vector remains near its initialization throughout all iterations,\nyielding the linear convergence of gradient descent. In this work, we\ndemonstrate that under the setting of over-parametrization, gradient descent\ncan find the global minimum regardless of whether it is in continuous time or\ndiscrete time.\n","authors":["Xianliang Xu","Ye Li","Zhongyi Huang"],"pdf_url":"https://arxiv.org/pdf/2412.05545v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.14488v2","updated":"2025-01-10T13:52:14Z","published":"2024-12-19T03:22:47Z","title":"A stochastic first-order method with multi-extrapolated momentum for\n highly smooth unconstrained optimization","summary":" In this paper, we consider an unconstrained stochastic optimization problem\nwhere the objective function exhibits high-order smoothness. Specifically, we\npropose a new stochastic first-order method (SFOM) with multi-extrapolated\nmomentum, in which multiple extrapolations are performed in each iteration,\nfollowed by a momentum update based on these extrapolations. We demonstrate\nthat the proposed SFOM can accelerate optimization by exploiting the high-order\nsmoothness of the objective function $f$. Assuming that the $p$th-order\nderivative of $f$ is Lipschitz continuous for some $p\\ge2$, and under\nadditional mild assumptions, we establish that our method achieves a sample\ncomplexity of $\\widetilde{\\mathcal{O}}(\\epsilon^{-(3p+1)/p})$ for finding a\npoint $x$ such that $\\mathbb{E}[\\|\\nabla f(x)\\|]\\le\\epsilon$. To the best of\nour knowledge, this is the first SFOM to leverage arbitrary-order smoothness of\nthe objective function for acceleration, resulting in a sample complexity that\nimproves upon the best-known results without assuming the mean-squared\nsmoothness condition. Preliminary numerical experiments validate the practical\nperformance of our method and support our theoretical findings.\n","authors":["Chuan He"],"pdf_url":"https://arxiv.org/pdf/2412.14488v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05942v1","updated":"2025-01-10T13:06:36Z","published":"2025-01-10T13:06:36Z","title":"Soft regression trees: a model variant and a decomposition training\n algorithm","summary":" Decision trees are widely used for classification and regression tasks in a\nvariety of application fields due to their interpretability and good accuracy.\nDuring the past decade, growing attention has been devoted to globally\noptimized decision trees with deterministic or soft splitting rules at branch\nnodes, which are trained by optimizing the error function over all the tree\nparameters. In this work, we propose a new variant of soft multivariate\nregression trees (SRTs) where, for every input vector, the prediction is\ndefined as the linear regression associated to a single leaf node, namely, the\nleaf node obtained by routing the input vector from the root along the branches\nwith higher probability. SRTs exhibit the conditional computational property,\ni.e., each prediction depends on a small number of nodes (parameters), and our\nnonlinear optimization formulation for training them is amenable to\ndecomposition. After showing a universal approximation result for SRTs, we\npresent a decomposition training algorithm including a clustering-based\ninitialization procedure and a heuristic for reassigning the input vectors\nalong the tree. Under mild assumptions, we establish asymptotic convergence\nguarantees. Experiments on 15 wellknown datasets indicate that our SRTs and\ndecomposition algorithm yield higher accuracy and robustness compared with\ntraditional soft regression trees trained using the nonlinear optimization\nformulation of Blanquero et al., and a significant reduction in training times\nas well as a slightly better average accuracy compared with the mixed-integer\noptimization approach of Bertsimas and Dunn. We also report a comparison with\nthe Random Forest ensemble method.\n","authors":["Antonio Consolo","Edoardo Amaldi","Andrea Manno"],"pdf_url":"https://arxiv.org/pdf/2501.05942v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05939v1","updated":"2025-01-10T13:02:50Z","published":"2025-01-10T13:02:50Z","title":"Designing a Robust and Cost-Efficient Electrified Bus Network with\n Sparse Energy Consumption Data","summary":" This paper addresses the challenges of charging infrastructure design (CID)\nfor electrified public transport networks using Battery Electric Buses (BEBs)\nunder conditions of sparse energy consumption data. Accurate energy consumption\nestimation is critical for cost-effective and reliable electrification but\noften requires costly field experiments, resulting in limited data. To address\nthis issue, we propose two mathematical models designed to handle uncertainty\nand data sparsity in energy consumption. The first is a robust optimization\nmodel with box uncertainty, addressing variability in energy consumption. The\nsecond is a data-driven distributionally robust optimization model that\nleverages observed data to provide more flexible and informed solutions. To\nevaluate these models, we apply them to the Rotterdam bus network. Our analysis\nreveals three key insights: (1) Ignoring variations in energy consumption can\nresult in operational unreliability, with up to 55\\% of scenarios leading to\ninfeasible trips. (2) Designing infrastructure based on worst-case energy\nconsumption increases costs by 67\\% compared to using average estimates. (3)\nThe data-driven distributionally robust optimization model reduces costs by\n28\\% compared to the box uncertainty model while maintaining reliability,\nespecially in scenarios where extreme energy consumption values are rare and\ndata exhibit skewness. In addition to cost savings, this approach provides\nrobust protection against uncertainty, ensuring reliable operation under\ndiverse conditions.\n","authors":["Sara Momen","Yousef Maknoon","Bart van Arem","Shadi Sharif Azadeh"],"pdf_url":"https://arxiv.org/pdf/2501.05939v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05930v1","updated":"2025-01-10T12:52:00Z","published":"2025-01-10T12:52:00Z","title":"Random Sparse Lifts: Construction, Analysis and Convergence of finite\n sparse networks","summary":" We present a framework to define a large class of neural networks for which,\nby construction, training by gradient flow provably reaches arbitrarily low\nloss when the number of parameters grows. Distinct from the fixed-space global\noptimality of non-convex optimization, this new form of convergence, and the\ntechniques introduced to prove such convergence, pave the way for a usable deep\nlearning convergence theory in the near future, without overparameterization\nassumptions relating the number of parameters and training samples. We define\nthese architectures from a simple computation graph and a mechanism to lift it,\nthus increasing the number of parameters, generalizing the idea of increasing\nthe widths of multi-layer perceptrons. We show that architectures similar to\nmost common deep learning models are present in this class, obtained by\nsparsifying the weight tensors of usual architectures at initialization.\nLeveraging tools of algebraic topology and random graph theory, we use the\ncomputation graph's geometry to propagate properties guaranteeing convergence\nto any precision for these large sparse models.\n","authors":["David A. R. Robin","Kevin Scaman","Marc Lelarge"],"pdf_url":"https://arxiv.org/pdf/2501.05930v1.pdf","comment":"The Twelfth International Conference on Learning Representations, May\n 2024, Vienna, Austria"},{"id":"http://arxiv.org/abs/2412.09594v2","updated":"2025-01-10T09:40:04Z","published":"2024-12-12T18:58:14Z","title":"Wait-Less Offline Tuning and Re-solving for Online Decision Making","summary":" Online linear programming (OLP) has found broad applications in revenue\nmanagement and resource allocation. State-of-the-art OLP algorithms achieve low\nregret by repeatedly solving linear programming (LP) subproblems that\nincorporate updated resource information. However, LP-based methods are\ncomputationally expensive and often inefficient for large-scale applications.\nIn contrast, recent first-order OLP algorithms are more computationally\nefficient but typically suffer from worse regret guarantees. To address these\nshortcomings, we propose a new algorithm that combines the strengths of\nLP-based and first-order OLP methods. The algorithm re-solves the LP\nsubproblems periodically at a predefined frequency $f$ and uses the latest dual\nprices to guide online decision-making. In addition, a first-order method runs\nin parallel during each interval between LP re-solves, smoothing resource\nconsumption. Our algorithm achieves $\\mathscr{O}(\\log (T/f) + \\sqrt{f})$\nregret, delivering a \"wait-less\" online decision-making process that balances\nthe computational efficiency of first-order methods and the superior regret\nguarantee of LP-based methods.\n","authors":["Jingruo Sun","Wenzhi Gao","Ellen Vitercik","Yinyu Ye"],"pdf_url":"https://arxiv.org/pdf/2412.09594v2.pdf","comment":"In this version, we achieve a tighter regret bound with the warm\n start for the first batch. We also make the proof more elegant by manually\n accepting all subsequent orders once the constraint is violated. In this way,\n we do not need to introduce the concept of stopping time for the analysis of\n the LP-based method"},{"id":"http://arxiv.org/abs/2404.08289v2","updated":"2025-01-10T08:50:35Z","published":"2024-04-12T07:27:25Z","title":"Generic controllability of equivariant systems and applications to\n particle systems and neural networks","summary":" There exist many examples of systems which have some symmetries, and which\none may monitor with symmetry preserving controls. Since symmetries are\npreserved along the evolution, full controllability is not possible, and\ncontrollability has to be considered inside sets of states with same\nsymmetries. We prove that generic systems with symmetries are controllable in\nthis sense. This result has several applications, for instance: (i) generic\ncontrollability of particle systems when the kernel of interaction between\nparticles plays the role of a mean-field control; (ii) generic controllability\nfor families of vector fields on manifolds with boundary; (iii) universal\ninterpolation for neural networks architectures with \"generic\" self\nattention-type layers - a type of layers ubiquitous in recent neural networks\narchitectures, e.g., in the Transformers architecture. The tools we develop\ncould help address various other questions of control of equivariant systems.\n","authors":["Andrei Agrachev","Cyril Letrouit"],"pdf_url":"https://arxiv.org/pdf/2404.08289v2.pdf","comment":"To appear in Annales de l'Institut Henri Poincar\\'e, Analyse non\n lin\\'eaire"},{"id":"http://arxiv.org/abs/2404.09746v2","updated":"2025-01-10T08:17:37Z","published":"2024-04-15T12:47:23Z","title":"Gradient descent for unbounded convex functions on Hadamard manifolds\n and its applications to scaling problems","summary":" In this paper, we study asymptotic behaviors of continuous-time and\ndiscrete-time gradient flows of a ``lower-unbounded\" convex function $f$ on a\nHadamard manifold $M$, particularly, their convergence properties to the\nboundary $M^{\\infty}$ at infinity of $M$. We establish a duality theorem that\nthe infimum of the gradient-norm $\\|\\nabla f(x)\\|$ of $f$ over $M$ is equal to\nthe supremum of the negative of the recession function $f^{\\infty}$ of $f$ over\nthe boundary $M^{\\infty}$, provided the infimum is positive. Further, the\ninfimum and the supremum are obtained by the limits of the gradient flows of\n$f$, Our results feature convex-optimization ingredients of the moment-weight\ninequality for reductive group actions by Georgoulas, Robbin, and Salamon,and\nare applied to noncommutative optimization by B\\\"urgisser et al. FOCS 2019. We\nshow that the gradient descent of the Kempf-Ness function for an unstable orbit\nconverges to a 1-parameter subgroup in the Hilbert-Mumford criterion, and the\nassociated moment-map sequence converges to the mimimum-norm point of the\nmoment polytope. We show further refinements for operator scaling -- the\nleft-right action on a matrix tuple $A= (A_1,A_2,\\ldots,A_N)$. We characterize\nthe gradient-flow limit of operator scaling by a vector-space generalization of\nthe classical Dulmage-Mendelsohn decomposition of a bipartite graph. Also, for\na special case of $N = 2$, we reveal that this limit determines the Kronecker\ncanonical form of matrix pencils $s A_1+A_2$.\n","authors":["Hiroshi Hirai","Keiya Sakabe"],"pdf_url":"https://arxiv.org/pdf/2404.09746v2.pdf","comment":"The conference version in FOCS 2024"},{"id":"http://arxiv.org/abs/2406.00621v3","updated":"2025-01-10T06:10:19Z","published":"2024-06-02T05:50:41Z","title":"Log-Scale Quantization in Distributed First-Order Methods:\n Gradient-based Learning from Distributed Data","summary":" Decentralized strategies are of interest for learning from large-scale data\nover networks. This paper studies learning over a network of geographically\ndistributed nodes/agents subject to quantization. Each node possesses a private\nlocal cost function, collectively contributing to a global cost function, which\nthe considered methodology aims to minimize. In contrast to many existing\npapers, the information exchange among nodes is log-quantized to address\nlimited network-bandwidth in practical situations. We consider a first-order\ncomputationally efficient distributed optimization algorithm (with no extra\ninner consensus loop) that leverages node-level gradient correction based on\nlocal data and network-level gradient aggregation only over nearby nodes. This\nmethod only requires balanced networks with no need for stochastic weight\ndesign. It can handle log-scale quantized data exchange over possibly\ntime-varying and switching network setups. We study convergence over both\nstructured networks (for example, training over data-centers) and ad-hoc\nmulti-agent networks (for example, training over dynamic robotic networks).\nThrough experimental validation, we show that (i) structured networks generally\nresult in a smaller optimality gap, and (ii) log-scale quantization leads to a\nsmaller optimality gap compared to uniform quantization.\n","authors":["Mohammadreza Doostmohammadian","Muhammad I. Qureshi","Mohammad Hossein Khalesi","Hamid R. Rabiee","Usman A. Khan"],"pdf_url":"https://arxiv.org/pdf/2406.00621v3.pdf","comment":"IEEE TASE 2025"},{"id":"http://arxiv.org/abs/2501.05737v1","updated":"2025-01-10T06:06:41Z","published":"2025-01-10T06:06:41Z","title":"Efficient Gradient Tracking Algorithms for Distributed Optimization\n Problems with Inexact Communication","summary":" Distributed optimization problems usually face inexact communication issues\ninduced by communication quantization, differential privacy protection, or\nchannels noise. Most existing algorithms need two-timescale setting of the\nstepsize of gradient descent and the parameter of noise suppression to ensure\nthe convergence to the optimal solution. In this paper, we propose two\nsingle-timescale algorithms, VRA-DGT and VRA--DSGT, for distributed\ndeterministic and stochastic optimization problems with inexact communication\nrespectively. VRA-DGT integrates the Variance-Reduced Aggregation (VRA)\nmechanism with the distributed gradient tracking framework, which achieves a\nconvergence rate of $\\mathcal{O}\\left(k^{-1}\\right)$ in the mean-square sense\nwhen the objective function is strongly convex and smooth. For distributed\nstochastic optimization problem,VRA-DSGT, where a hybrid variance reduction\ntechnique has been introduced in VRA-DGT,\n VRA-DGT,, maintains the convergence rate of $\\mathcal{O}\\left(k^{-1}\\right)$\nfor strongly convex and smooth objective function. Simulated experiments on\nlogistic regression problem with real-world data verify the effectiveness of\nthe proposed algorithms.\n","authors":["Shengchao Zhaoa","Yongchao Liu"],"pdf_url":"https://arxiv.org/pdf/2501.05737v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.07616v2","updated":"2025-01-10T05:44:12Z","published":"2024-08-14T15:31:15Z","title":"Prophet Inequalities: Competing with the Top $\\ell$ Items is Easy","summary":" We explore a prophet inequality problem, where the values of a sequence of\nitems are drawn i.i.d. from some distribution, and an online decision maker\nmust select one item irrevocably. We establish that $\\mathrm{CR}_{\\ell}$ the\nworst-case competitive ratio between the expected optimal performance of an\nonline decision maker compared to that of a prophet who uses the average of the\ntop $\\ell$ items is exactly the solution to an integral equation. This quantity\n$\\mathrm{CR}_{\\ell}$ is larger than $1-e^{-\\ell}$. This implies that the bound\nconverges exponentially fast to $1$ as $\\ell$ grows. In particular for\n$\\ell=2$, $\\mathrm{CR}_{2} \\approx 0.966$ which is much closer to $1$ than the\nclassical bound of $0.745$ for $\\ell=1$. Additionally, we prove asymptotic\nlower bounds for the competitive ratio of a more general scenario, where the\ndecision maker is permitted to select $k$ items. This subsumes the $k$\nmulti-unit i.i.d. prophet problem and provides the current best asymptotic\nguarantees, as well as enables broader understanding in the more general\nframework. Finally, we prove a tight asymptotic competitive ratio when only\nstatic threshold policies are allowed.\n","authors":["Mathieu Molina","Nicolas Gast","Patrick Loiseau","Vianney Perchet"],"pdf_url":"https://arxiv.org/pdf/2408.07616v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05701v1","updated":"2025-01-10T04:19:38Z","published":"2025-01-10T04:19:38Z","title":"A Two-timescale Primal-dual Algorithm for Decentralized Optimization\n with Compression","summary":" This paper proposes a two-timescale compressed primal-dual (TiCoPD) algorithm\nfor decentralized optimization with improved communication efficiency over\nprior works on primal-dual decentralized optimization. The algorithm is built\nupon the primal-dual optimization framework and utilizes a\nmajorization-minimization procedure. The latter naturally suggests the agents\nto share a compressed difference term during the iteration. Furthermore, the\nTiCoPD algorithm incorporates a fast timescale mirror sequence for agent\nconsensus on nonlinearly compressed terms, together with a slow timescale\nprimal-dual recursion for optimizing the objective function. We show that the\nTiCoPD algorithm converges with a constant step size. It also finds an O(1 /T )\nstationary solution after T iterations. Numerical experiments on decentralized\ntraining of a neural network validate the efficacy of TiCoPD algorithm.\n","authors":["Haoming Liu","Chung-Yiu Yau","Hoi-To Wai"],"pdf_url":"https://arxiv.org/pdf/2501.05701v1.pdf","comment":"5 pages, 8 figures"},{"id":"http://arxiv.org/abs/2501.05693v1","updated":"2025-01-10T03:52:29Z","published":"2025-01-10T03:52:29Z","title":"Robust Adaptive Supplementary Control for Damping Weak-Grid SSOs\n Involving IBRs","summary":" Subsynchronous oscillations (SSOs) involving grid-following converters\n(GFLCs) connected to weak grids are a relatively new phenomena observed in\nmodern power systems. SSOs are further exacerbated when grids become weaker\nbecause lines are disconnected due to maintenance or following faults. Such\nundesirable oscillations have also led to curtailment of inverter-based\nresource (IBR) outputs. In contrast to most literature addressing the issue by\nretuning/redesigning of standard IBR controllers, we propose a robust adaptive\nsupplementary control for damping of such SSOs while keeping standard controls\nunaltered. As a result, uncertainty in system conditions can be handled without\nnegatively impacting the nominal IBR performance. To that end, the adaptive\ncontrol law is derived for a GFLC connected to the grid, where the grid is\nmodeled by the Thevenin's equivalent representation with uncertainty and\ndisturbances. The theoretical result provides dissipativity certificate for the\nclosed-loop error dynamics with sufficient conditions for stability. The\neffectiveness of the developed controller is validated with several case\nstudies conducted on a single-GFLC-infinite-bus test system, the IEEE $2$-area\ntest system, wherein some of the synchronous generators are replaced by GFLCs,\nand a modified IEEE $5$-area test system with two GFLCs. The findings\ndemonstrate that under very weak grid conditions, the proposed robust adaptive\ncontrol performs well in stabilizing SSO modes, which a classical\nstate-feedback control method fails to address.\n","authors":["Sina Ameli","Lilan Karunaratne","Nilanjan Ray Chaudhuri","Constantino Lagoa"],"pdf_url":"https://arxiv.org/pdf/2501.05693v1.pdf","comment":"14 pages, 19 figures, 3 tables, IEEE Transactions on Power Systems"},{"id":"http://arxiv.org/abs/2501.05677v1","updated":"2025-01-10T03:01:48Z","published":"2025-01-10T03:01:48Z","title":"Single-Loop Variance-Reduced Stochastic Algorithm for Nonconvex-Concave\n Minimax Optimization","summary":" Nonconvex-concave (NC-C) finite-sum minimax problems have broad applications\nin decentralized optimization and various machine learning tasks. However, the\nnonsmooth nature of NC-C problems makes it challenging to design effective\nvariance reduction techniques. Existing vanilla stochastic algorithms using\nuniform samples for gradient estimation often exhibit slow convergence rates\nand require bounded variance assumptions. In this paper, we develop a novel\nprobabilistic variance reduction updating scheme and propose a single-loop\nalgorithm called the probabilistic variance-reduced smoothed gradient\ndescent-ascent (PVR-SGDA) algorithm. The proposed algorithm achieves an\niteration complexity of $O(\\epsilon^{-4})$, surpassing the best-known rates of\nstochastic algorithms for NC-C minimax problems and matching the performance of\nthe best deterministic algorithms in this context. Finally, we demonstrate the\neffectiveness of the proposed algorithm through numerical simulations.\n","authors":["Xia Jiang","Linglingzhi Zhu","Taoli Zheng","Anthony Man-Cho So"],"pdf_url":"https://arxiv.org/pdf/2501.05677v1.pdf","comment":"The conference version of this paper has been accepted by ICASSP 2025"},{"id":"http://arxiv.org/abs/2501.05676v1","updated":"2025-01-10T02:59:56Z","published":"2025-01-10T02:59:56Z","title":"An Efficient Dual ADMM for Huber Regression with Fused Lasso Penalty","summary":" The ordinary least squares estimate in linear regression is sensitive to the\ninfluence of errors with large variance, which reduces its robustness,\nespecially when dealing with heavy-tailed errors or outliers frequently\nencountered in real-world scenarios. To address this issue and accommodate the\nsparsity of coefficients along with their sequential disparities, we combine\nthe adaptive robust Huber loss function with a fused lasso penalty. This\ncombination yields a robust estimator capable of simultaneously achieving\nestimation and variable selection. Furthermore, we utilize an efficient\nalternating direction method of multipliers to solve this regression model from\na dual perspective. The effectiveness and efficiency of our proposed approach\nis demonstrated through numerical experiments carried out on both simulated and\nreal datasets.\n","authors":["Mengjiao Shi","Yunhai Xiao"],"pdf_url":"https://arxiv.org/pdf/2501.05676v1.pdf","comment":"14 pages,24 figures"},{"id":"http://arxiv.org/abs/2105.04684v4","updated":"2025-01-10T02:50:39Z","published":"2021-05-10T21:46:12Z","title":"An automatic system to detect equivalence between iterative algorithms","summary":" When are two algorithms the same? How can we be sure a recently proposed\nalgorithm is novel, and not a minor twist on an existing method? In this paper,\nwe present a framework for reasoning about equivalence between a broad class of\niterative algorithms, with a focus on algorithms designed for convex\noptimization. We propose several notions of what it means for two algorithms to\nbe equivalent, and provide computationally tractable means to detect\nequivalence. Our main definition, oracle equivalence, states that two\nalgorithms are equivalent if they result in the same sequence of calls to the\nfunction oracles (for suitable initialization). Borrowing from control theory,\nwe use state-space realizations to represent algorithms and characterize\nalgorithm equivalence via transfer functions. Our framework can also identify\nand characterize some algorithm transformations including permutations of the\nupdate equations, repetition of the iteration, and conjugation of some of the\nfunction oracles in the algorithm. To support the paper, we have developed a\nsoftware package named Linnaeus that implements the framework to identify other\niterative algorithms that are equivalent to an input algorithm. More broadly,\nthis framework and software advances the goal of making mathematics searchable.\n","authors":["Shipu Zhao","Laurent Lessard","Madeleine Udell"],"pdf_url":"https://arxiv.org/pdf/2105.04684v4.pdf","comment":"This paper documents a software system for identifying equivalence\n between optimization algorithms. The analysis in this paper has been improved\n in arxiv:2501.04972"},{"id":"http://arxiv.org/abs/2303.10503v3","updated":"2025-01-10T01:27:06Z","published":"2023-03-18T21:28:45Z","title":"Counter-examples in first-order optimization: a constructive approach","summary":" While many approaches were developed for obtaining worst-case complexity\nbounds for first-order optimization methods in the last years, there remain\ntheoretical gaps in cases where no such bound can be found. In such cases, it\nis often unclear whether no such bound exists (e.g., because the algorithm\nmight fail to systematically converge) or simply if the current techniques do\nnot allow finding them.\n In this work, we propose an approach to automate the search for cyclic\ntrajectories generated by first-order methods. This provides a constructive\napproach to show that no appropriate complexity bound exists, thereby\ncomplementing the approaches providing sufficient conditions for convergence.\nUsing this tool, we provide ranges of parameters for which some of the famous\nheavy-ball, Nesterov accelerated gradient, inexact gradient descent, and\nthree-operator splitting algorithms fail to systematically converge, and show\nthat it nicely complements existing tools searching for Lyapunov functions.\n","authors":["Baptiste Goujaud","Aymeric Dieuleveut","Adrien Taylor"],"pdf_url":"https://arxiv.org/pdf/2303.10503v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05642v1","updated":"2025-01-10T00:59:43Z","published":"2025-01-10T00:59:43Z","title":"FIRM: Federated Image Reconstruction using Multimodal Tomographic Data","summary":" We propose a federated algorithm for reconstructing images using multimodal\ntomographic data sourced from dispersed locations, addressing the challenges of\ntraditional unimodal approaches that are prone to noise and reduced image\nquality. Our approach formulates a joint inverse optimization problem\nincorporating multimodality constraints and solves it in a federated framework\nthrough local gradient computations complemented by lightweight central\noperations, ensuring data decentralization. Leveraging the connection between\nour federated algorithm and the quadratic penalty method, we introduce an\nadaptive step-size rule with guaranteed sublinear convergence and further\nsuggest its extension to augmented Lagrangian framework. Numerical results\ndemonstrate its superior computational efficiency and improved image\nreconstruction quality.\n","authors":["Geunyeong Byeon","Minseok Ryu","Zichao Wendy Di","Kibaek Kim"],"pdf_url":"https://arxiv.org/pdf/2501.05642v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.09131v4","updated":"2025-01-10T23:30:13Z","published":"2023-12-14T17:01:58Z","title":"Physics-Informed Neural Network Lyapunov Functions: PDE\n Characterization, Learning, and Verification","summary":" We provide a systematic investigation of using physics-informed neural\nnetworks to compute Lyapunov functions. We encode Lyapunov conditions as a\npartial differential equation (PDE) and use this for training neural network\nLyapunov functions. We analyze the analytical properties of the solutions to\nthe Lyapunov and Zubov PDEs. In particular, we show that employing the Zubov\nequation in training neural Lyapunov functions can lead to approximate regions\nof attraction close to the true domain of attraction. We also examine\napproximation errors and the convergence of neural approximations to the unique\nsolution of Zubov's equation. We then provide sufficient conditions for the\nlearned neural Lyapunov functions that can be readily verified by\nsatisfiability modulo theories (SMT) solvers, enabling formal verification of\nboth local stability analysis and region-of-attraction estimates in the large.\nThrough a number of nonlinear examples, ranging from low to high dimensions, we\ndemonstrate that the proposed framework can outperform traditional\nsums-of-squares (SOS) Lyapunov functions obtained using semidefinite\nprogramming (SDP).\n","authors":["Jun Liu","Yiming Meng","Maxwell Fitzsimmons","Ruikun Zhou"],"pdf_url":"https://arxiv.org/pdf/2312.09131v4.pdf","comment":"The current version is accepted to the IFAC Journal Automatica"},{"id":"http://arxiv.org/abs/2407.13868v4","updated":"2025-01-10T21:43:57Z","published":"2024-07-18T19:28:05Z","title":"Stochastic Monotone Inclusion with Closed Loop Distributions","summary":" In this paper, we study in a Hilbertian setting, first and second-order\nmonotone inclusions related to stochastic optimization problems with decision\ndependent distributions. The studied dynamics are formulated as monotone\ninclusions governed by Lipschitz perturbations of maximally monotone operators\nwhere the concept of equilibrium plays a central role. We discuss the\nrelationship between the $\\mathbb{W}_1$-Wasserstein Lipschitz behavior of the\ndistribution and the so-called coarse Ricci curvature. As an application, we\nconsider the monotone inclusions associated with stochastic optimisation\nproblems involving the sum of a smooth function with Lipschitz gradient, a\nproximable function and a composite term.\n","authors":["Hamza Ennaji","Jalal Fadili","Hedy Attouch"],"pdf_url":"https://arxiv.org/pdf/2407.13868v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.06353v1","updated":"2025-01-10T21:25:16Z","published":"2025-01-10T21:25:16Z","title":"Event Constrained Programming","summary":" In this paper, we present event constraints as a new modeling paradigm that\ngeneralizes joint chance constraints from stochastic optimization to (1)\nenforce a constraint on the probability of satisfying a set of constraints\naggregated via application-specific logic (constituting an event) and (2) to be\napplied to general infinite-dimensional optimization (InfiniteOpt) problems\n(i.e., time, space, and/or uncertainty domains). This new constraint class\noffers significant modeling flexibility in posing InfiniteOpt constraints that\nare enforced over a certain portion of their domain (e.g., to a certain\nprobability level), but can be challenging to reformulate/solve due to\ndifficulties in representing arbitrary logical conditions and specifying a\nprobabilistic measure on a collection of constraints. To address these\nchallenges, we derive a generalized disjunctive programming (GDP)\nrepresentation of event constrained optimization problems, which readily\nenables us to pose logical event conditions in a standard form and allows us to\ndraw from a suite of GDP solution strategies that leverage the special\nstructure of this problem class. We also extend several approximation\ntechniques from the chance constraint literature to provide a means to\nreformulate certain event constraints without the use of binary variables. We\nillustrate these findings with case studies in stochastic optimal power flow,\ndynamic disease control, and optimal 2D diffusion.\n","authors":["Daniel Ovalle","Stefan Mazzadi","Carl D. Laird","Ignacio E. Grossmann","Joshua L. Pulsipher"],"pdf_url":"https://arxiv.org/pdf/2501.06353v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.06350v1","updated":"2025-01-10T21:23:05Z","published":"2025-01-10T21:23:05Z","title":"SMOP: Stochastic trust region method for multi-objective problems","summary":" The problem considered is a multi-objective optimization problem, in which\nthe goal is to find an optimal value of a vector function representing various\ncriteria. The aim of this work is to develop an algorithm which utilizes the\ntrust region framework with probabilistic model functions, able to cope with\nnoisy problems, using inaccurate functions and gradients. We prove the almost\nsure convergence of the proposed algorithm to a Pareto critical point if the\nmodel functions are good approximations in probabilistic sense. Numerical\nresults demonstrate effectiveness of the probabilistic trust region by\ncomparing it to competitive stochastic multi-objective solvers. The application\nin supervised machine learning is showcased by training non discriminatory\nLogistic Regression models on different size data groups. Additionally, we use\nseveral test examples with irregularly shaped fronts to exhibit the efficiency\nof the algorithm.\n","authors":["Nataša Krejić","Nataša Krklec Jerinkić","Luka Rutešić"],"pdf_url":"https://arxiv.org/pdf/2501.06350v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.06335v1","updated":"2025-01-10T20:41:11Z","published":"2025-01-10T20:41:11Z","title":"A Comparison of Strategies to Embed Physics-Informed Neural Networks in\n Nonlinear Model Predictive Control Formulations Solved via Direct\n Transcription","summary":" This study aims to benchmark candidate strategies for embedding neural\nnetwork (NN) surrogates in nonlinear model predictive control (NMPC)\nformulations that are subject to systems described with partial differential\nequations and that are solved via direct transcription (i.e., simultaneous\nmethods). This study focuses on the use of physics-informed NNs and\nphysics-informed convolutional NNs as the internal (surrogate) models within\nthe NMPC formulation. One strategy embeds NN models as explicit algebraic\nconstraints, leveraging the automatic differentiation (AD) of an algebraic\nmodelling language (AML) to evaluate the derivatives. Alternatively, the solver\ncan be provided with derivatives computed external to the AML via the AD\nroutines of the machine learning environment the NN is trained in. The three\nnumerical experiments considered in this work reveal that replacing mechanistic\nmodels with NN surrogates may not always offer computational advantages when\nsmooth activation functions are used in conjunction with a local nonlinear\nsolver (e.g., Ipopt), even with highly nonlinear systems. Moreover, in this\ncontext, the external function evaluation of the NN surrogates often\noutperforms the embedding strategies that rely on explicit algebraic\nconstraints, likely due to the difficulty in initializing the auxiliary\nvariables and constraints introduced by explicit algebraic reformulations.\n","authors":["Carlos Andrés Elorza Casas","Luis A. Ricardez-Sandoval","Joshua L. Pulsipher"],"pdf_url":"https://arxiv.org/pdf/2501.06335v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.06279v1","updated":"2025-01-10T13:57:32Z","published":"2025-01-10T13:57:32Z","title":"Reinforcing Infrastructure Networks with Multicriteria Portfolio\n Decision Analysis: An Application to Railway Stations in Finland","summary":" Advanced societies are crucially dependent on critical infrastructure\nnetworks for the reliable delivery of essential goods and services. Hence,\nwell-founded analyses concerning disruptions are needed to guide decisions that\nseek to ensure the performance of these networks in the face of failures caused\nby vulnerabilities to external hazards or technical malfunctions. In this\nsetting, we develop a multicriteria decision analysis approach to support the\nformulation of cost-efficient portfolios of preventive reinforcement actions.\nOur approach is general in that it (i) allows for multiple objectives, such as\nthose that represent the volume of traffic that is enabled between alternative\norigin-destination pairs in a transportation network, (ii) uses methods of\nprobabilistic risk assessment to quantify the expected performance of the\nnetwork, and (iii) solves optimization problems to identify those combinations\nof reinforcement actions that are cost-efficient in improving the performance\nof the network, given the available, possibly incomplete information about the\nrelative importance of objectives. Our methodological contributions are\nillustrated by a case study on the analysis of railway switches at a\nrepresentative Finnish railway station.\n","authors":["Joaquín de la Barra","Ahti Salo","Leevi Olander","Kash Barker","Jussi Kangaspunta"],"pdf_url":"https://arxiv.org/pdf/2501.06279v1.pdf","comment":"32 pages, 7 figures"},{"id":"http://arxiv.org/abs/2501.06275v1","updated":"2025-01-10T10:16:38Z","published":"2025-01-10T10:16:38Z","title":"Exploratory Randomization for Discrete-Time Linear Exponential Quadratic\n Gaussian (LEQG) Problem","summary":" We investigate exploratory randomization for an extended\nlinear-exponential-quadratic-Gaussian (LEQG) control problem in discrete time.\nThis extended control problem is related to the structure of risk-sensitive\ninvestment management applications. We introduce exploration through a\nrandomization of the control. Next, we apply the duality between free energy\nand relative entropy to reduce the LEQG problem to an equivalent risk-neutral\nLQG control problem with an entropy regularization term, see, e.g. Dai Pra et\nal. (1996), for which we present a solution approach based on Dynamic\nProgramming. Our approach, based on the energy-entropy duality may also be\nconsidered as leading to a justification for the use, in the literature, of an\nentropy regularization when applying a randomized control.\n","authors":["Sebastien Lleo","Wolfgang Runggaldier"],"pdf_url":"https://arxiv.org/pdf/2501.06275v1.pdf","comment":null}],"Computer Vision and Pattern Recognition":[{"id":"http://arxiv.org/abs/2501.06187v1","updated":"2025-01-10T18:59:54Z","published":"2025-01-10T18:59:54Z","title":"Multi-subject Open-set Personalization in Video Generation","summary":" Video personalization methods allow us to synthesize videos with specific\nconcepts such as people, pets, and places. However, existing methods often\nfocus on limited domains, require time-consuming optimization per subject, or\nsupport only a single subject. We present Video Alchemist $-$ a video model\nwith built-in multi-subject, open-set personalization capabilities for both\nforeground objects and background, eliminating the need for time-consuming\ntest-time optimization. Our model is built on a new Diffusion Transformer\nmodule that fuses each conditional reference image and its corresponding\nsubject-level text prompt with cross-attention layers. Developing such a large\nmodel presents two main challenges: dataset and evaluation. First, as paired\ndatasets of reference images and videos are extremely hard to collect, we\nsample selected video frames as reference images and synthesize a clip of the\ntarget video. However, while models can easily denoise training videos given\nreference frames, they fail to generalize to new contexts. To mitigate this\nissue, we design a new automatic data construction pipeline with extensive\nimage augmentations. Second, evaluating open-set video personalization is a\nchallenge in itself. To address this, we introduce a personalization benchmark\nthat focuses on accurate subject fidelity and supports diverse personalization\nscenarios. Finally, our extensive experiments show that our method\nsignificantly outperforms existing personalization methods in both quantitative\nand qualitative evaluations.\n","authors":["Tsai-Shien Chen","Aliaksandr Siarohin","Willi Menapace","Yuwei Fang","Kwot Sin Lee","Ivan Skorokhodov","Kfir Aberman","Jun-Yan Zhu","Ming-Hsuan Yang","Sergey Tulyakov"],"pdf_url":"https://arxiv.org/pdf/2501.06187v1.pdf","comment":"Project page:\n https://snap-research.github.io/open-set-video-personalization/"},{"id":"http://arxiv.org/abs/2501.06186v1","updated":"2025-01-10T18:59:51Z","published":"2025-01-10T18:59:51Z","title":"LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs","summary":" Reasoning is a fundamental capability for solving complex multi-step\nproblems, particularly in visual contexts where sequential step-wise\nunderstanding is essential. Existing approaches lack a comprehensive framework\nfor evaluating visual reasoning and do not emphasize step-wise problem-solving.\nTo this end, we propose a comprehensive framework for advancing step-by-step\nvisual reasoning in large language models (LMMs) through three key\ncontributions. First, we introduce a visual reasoning benchmark specifically\ndesigned to evaluate multi-step reasoning tasks. The benchmark presents a\ndiverse set of challenges with eight different categories ranging from complex\nvisual perception to scientific reasoning with over 4k reasoning steps in\ntotal, enabling robust evaluation of LLMs' abilities to perform accurate and\ninterpretable visual reasoning across multiple steps. Second, we propose a\nnovel metric that assesses visual reasoning quality at the granularity of\nindividual steps, emphasizing both correctness and logical coherence. The\nproposed metric offers deeper insights into reasoning performance compared to\ntraditional end-task accuracy metrics. Third, we present a new multimodal\nvisual reasoning model, named LlamaV-o1, trained using a multi-step curriculum\nlearning approach, where tasks are progressively organized to facilitate\nincremental skill acquisition and problem-solving. The proposed LlamaV-o1 is\ndesigned for multi-step reasoning and learns step-by-step through a structured\ntraining paradigm. Extensive experiments show that our LlamaV-o1 outperforms\nexisting open-source models and performs favorably against close-source\nproprietary models. Compared to the recent Llava-CoT, our LlamaV-o1 achieves an\naverage score of 67.3 with an absolute gain of 3.8\\% across six benchmarks\nwhile being 5 times faster during inference scaling. Our benchmark, model, and\ncode are publicly available.\n","authors":["Omkar Thawakar","Dinura Dissanayake","Ketan More","Ritesh Thawkar","Ahmed Heakl","Noor Ahsan","Yuhao Li","Mohammed Zumri","Jean Lahoud","Rao Muhammad Anwer","Hisham Cholakkal","Ivan Laptev","Mubarak Shah","Fahad Shahbaz Khan","Salman Khan"],"pdf_url":"https://arxiv.org/pdf/2501.06186v1.pdf","comment":"15 pages, 5 Figures"},{"id":"http://arxiv.org/abs/2501.06184v1","updated":"2025-01-10T18:59:42Z","published":"2025-01-10T18:59:42Z","title":"PEACE: Empowering Geologic Map Holistic Understanding with MLLMs","summary":" Geologic map, as a fundamental diagram in geology science, provides critical\ninsights into the structure and composition of Earth's subsurface and surface.\nThese maps are indispensable in various fields, including disaster detection,\nresource exploration, and civil engineering. Despite their significance,\ncurrent Multimodal Large Language Models (MLLMs) often fall short in geologic\nmap understanding. This gap is primarily due to the challenging nature of\ncartographic generalization, which involves handling high-resolution map,\nmanaging multiple associated components, and requiring domain-specific\nknowledge. To quantify this gap, we construct GeoMap-Bench, the first-ever\nbenchmark for evaluating MLLMs in geologic map understanding, which assesses\nthe full-scale abilities in extracting, referring, grounding, reasoning, and\nanalyzing. To bridge this gap, we introduce GeoMap-Agent, the inaugural agent\ndesigned for geologic map understanding, which features three modules:\nHierarchical Information Extraction (HIE), Domain Knowledge Injection (DKI),\nand Prompt-enhanced Question Answering (PEQA). Inspired by the\ninterdisciplinary collaboration among human scientists, an AI expert group acts\nas consultants, utilizing a diverse tool pool to comprehensively analyze\nquestions. Through comprehensive experiments, GeoMap-Agent achieves an overall\nscore of 0.811 on GeoMap-Bench, significantly outperforming 0.369 of GPT-4o.\nOur work, emPowering gEologic mAp holistiC undErstanding (PEACE) with MLLMs,\npaves the way for advanced AI applications in geology, enhancing the efficiency\nand accuracy of geological investigations.\n","authors":["Yangyu Huang","Tianyi Gao","Haoran Xu","Qihao Zhao","Yang Song","Zhipeng Gui","Tengchao Lv","Hao Chen","Lei Cui","Scarlett Li","Furu Wei"],"pdf_url":"https://arxiv.org/pdf/2501.06184v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05450v2","updated":"2025-01-10T18:58:11Z","published":"2025-01-09T18:59:56Z","title":"Decentralized Diffusion Models","summary":" Large-scale AI model training divides work across thousands of GPUs, then\nsynchronizes gradients across them at each step. This incurs a significant\nnetwork burden that only centralized, monolithic clusters can support, driving\nup infrastructure costs and straining power systems. We propose Decentralized\nDiffusion Models, a scalable framework for distributing diffusion model\ntraining across independent clusters or datacenters by eliminating the\ndependence on a centralized, high-bandwidth networking fabric. Our method\ntrains a set of expert diffusion models over partitions of the dataset, each in\nfull isolation from one another. At inference time, the experts ensemble\nthrough a lightweight router. We show that the ensemble collectively optimizes\nthe same objective as a single model trained over the whole dataset. This means\nwe can divide the training burden among a number of \"compute islands,\" lowering\ninfrastructure costs and improving resilience to localized GPU failures.\nDecentralized diffusion models empower researchers to take advantage of\nsmaller, more cost-effective and more readily available compute like on-demand\nGPU nodes rather than central integrated systems. We conduct extensive\nexperiments on ImageNet and LAION Aesthetics, showing that decentralized\ndiffusion models FLOP-for-FLOP outperform standard diffusion models. We finally\nscale our approach to 24 billion parameters, demonstrating that high-quality\ndiffusion models can now be trained with just eight individual GPU nodes in\nless than a week.\n","authors":["David McAllister","Matthew Tancik","Jiaming Song","Angjoo Kanazawa"],"pdf_url":"https://arxiv.org/pdf/2501.05450v2.pdf","comment":"Project webpage: https://decentralizeddiffusion.github.io/"},{"id":"http://arxiv.org/abs/2501.06173v1","updated":"2025-01-10T18:52:11Z","published":"2025-01-10T18:52:11Z","title":"VideoAuteur: Towards Long Narrative Video Generation","summary":" Recent video generation models have shown promising results in producing\nhigh-quality video clips lasting several seconds. However, these models face\nchallenges in generating long sequences that convey clear and informative\nevents, limiting their ability to support coherent narrations. In this paper,\nwe present a large-scale cooking video dataset designed to advance long-form\nnarrative generation in the cooking domain. We validate the quality of our\nproposed dataset in terms of visual fidelity and textual caption accuracy using\nstate-of-the-art Vision-Language Models (VLMs) and video generation models,\nrespectively. We further introduce a Long Narrative Video Director to enhance\nboth visual and semantic coherence in generated videos and emphasize the role\nof aligning visual embeddings to achieve improved overall video quality. Our\nmethod demonstrates substantial improvements in generating visually detailed\nand semantically aligned keyframes, supported by finetuning techniques that\nintegrate text and image embeddings within the video generation process.\nProject page: https://videoauteur.github.io/\n","authors":["Junfei Xiao","Feng Cheng","Lu Qi","Liangke Gui","Jiepeng Cen","Zhibei Ma","Alan Yuille","Lu Jiang"],"pdf_url":"https://arxiv.org/pdf/2501.06173v1.pdf","comment":"Preprint, https://videoauteur.github.io/"},{"id":"http://arxiv.org/abs/2501.06151v1","updated":"2025-01-10T18:24:00Z","published":"2025-01-10T18:24:00Z","title":"PySpatial: A High-Speed Whole Slide Image Pathomics Toolkit","summary":" Whole Slide Image (WSI) analysis plays a crucial role in modern digital\npathology, enabling large-scale feature extraction from tissue samples.\nHowever, traditional feature extraction pipelines based on tools like\nCellProfiler often involve lengthy workflows, requiring WSI segmentation into\npatches, feature extraction at the patch level, and subsequent mapping back to\nthe original WSI. To address these challenges, we present PySpatial, a\nhigh-speed pathomics toolkit specifically designed for WSI-level analysis.\nPySpatial streamlines the conventional pipeline by directly operating on\ncomputational regions of interest, reducing redundant processing steps.\nUtilizing rtree-based spatial indexing and matrix-based computation, PySpatial\nefficiently maps and processes computational regions, significantly\naccelerating feature extraction while maintaining high accuracy. Our\nexperiments on two datasets-Perivascular Epithelioid Cell (PEC) and data from\nthe Kidney Precision Medicine Project (KPMP)-demonstrate substantial\nperformance improvements. For smaller and sparse objects in PEC datasets,\nPySpatial achieves nearly a 10-fold speedup compared to standard CellProfiler\npipelines. For larger objects, such as glomeruli and arteries in KPMP datasets,\nPySpatial achieves a 2-fold speedup. These results highlight PySpatial's\npotential to handle large-scale WSI analysis with enhanced efficiency and\naccuracy, paving the way for broader applications in digital pathology.\n","authors":["Yuechen Yang","Yu Wang","Tianyuan Yao","Ruining Deng","Mengmeng Yin","Shilin Zhao","Haichun Yang","Yuankai Huo"],"pdf_url":"https://arxiv.org/pdf/2501.06151v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.02780v2","updated":"2025-01-10T18:14:56Z","published":"2024-09-17T19:07:13Z","title":"Guess What I Think: Streamlined EEG-to-Image Generation with Latent\n Diffusion Models","summary":" Generating images from brain waves is gaining increasing attention due to its\npotential to advance brain-computer interface (BCI) systems by understanding\nhow brain signals encode visual cues. Most of the literature has focused on\nfMRI-to-Image tasks as fMRI is characterized by high spatial resolution.\nHowever, fMRI is an expensive neuroimaging modality and does not allow for\nreal-time BCI. On the other hand, electroencephalography (EEG) is a low-cost,\nnon-invasive, and portable neuroimaging technique, making it an attractive\noption for future real-time applications. Nevertheless, EEG presents inherent\nchallenges due to its low spatial resolution and susceptibility to noise and\nartifacts, which makes generating images from EEG more difficult. In this\npaper, we address these problems with a streamlined framework based on the\nControlNet adapter for conditioning a latent diffusion model (LDM) through EEG\nsignals. We conduct experiments and ablation studies on popular benchmarks to\ndemonstrate that the proposed method beats other state-of-the-art models.\nUnlike these methods, which often require extensive preprocessing, pretraining,\ndifferent losses, and captioning models, our approach is efficient and\nstraightforward, requiring only minimal preprocessing and a few components. The\ncode is available at https://github.com/LuigiSigillo/GWIT.\n","authors":["Eleonora Lopez","Luigi Sigillo","Federica Colonnese","Massimo Panella","Danilo Comminiello"],"pdf_url":"https://arxiv.org/pdf/2410.02780v2.pdf","comment":"Accepted at ICASSP 2025"},{"id":"http://arxiv.org/abs/2409.11456v2","updated":"2025-01-10T17:54:39Z","published":"2024-09-17T17:48:12Z","title":"Two Stage Segmentation of Cervical Tumors using PocketNet","summary":" Cervical cancer remains the fourth most common malignancy amongst women\nworldwide.1 Concurrent chemoradiotherapy (CRT) serves as the mainstay\ndefinitive treatment regimen for locally advanced cervical cancers and includes\nexternal beam radiation followed by brachytherapy.2 Integral to radiotherapy\ntreatment planning is the routine contouring of both the target tumor at the\nlevel of the cervix, associated gynecologic anatomy and the adjacent organs at\nrisk (OARs). However, manual contouring of these structures is both time and\nlabor intensive and associated with known interobserver variability that can\nimpact treatment outcomes. While multiple tools have been developed to\nautomatically segment OARs and the high-risk clinical tumor volume (HR-CTV)\nusing computed tomography (CT) images,3,4,5,6 the development of deep\nlearning-based tumor segmentation tools using routine T2-weighted (T2w)\nmagnetic resonance imaging (MRI) addresses an unmet clinical need to improve\nthe routine contouring of both anatomical structures and cervical cancers,\nthereby increasing quality and consistency of radiotherapy planning. This work\napplied a novel deep-learning model (PocketNet) to segment the cervix, vagina,\nuterus, and tumor(s) on T2w MRI. The performance of the PocketNet architecture\nwas evaluated, when trained on data via 5-fold cross validation. PocketNet\nachieved a mean Dice-Sorensen similarity coefficient (DSC) exceeding 70% for\ntumor segmentation and 80% for organ segmentation. These results suggest that\nPocketNet is robust to variations in contrast protocols, providing reliable\nsegmentation of the regions of interest.\n","authors":["Awj Twam","Megan Jacobsen","Rachel Glenn","Peng Wei","Jia Sun","Ann Klopp","Aradhana M. Venkatesan","David Fuentes"],"pdf_url":"https://arxiv.org/pdf/2409.11456v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.06138v1","updated":"2025-01-10T17:52:47Z","published":"2025-01-10T17:52:47Z","title":"MS-Temba : Multi-Scale Temporal Mamba for Efficient Temporal Action\n Detection","summary":" Action detection in real-world scenarios is particularly challenging due to\ndensely distributed actions in hour-long untrimmed videos. It requires modeling\nboth short- and long-term temporal relationships while handling significant\nintra-class temporal variations. Previous state-of-the-art (SOTA)\nTransformer-based architectures, though effective, are impractical for\nreal-world deployment due to their high parameter count, GPU memory usage, and\nlimited throughput, making them unsuitable for very long videos. In this work,\nwe innovatively adapt the Mamba architecture for action detection and propose\nMulti-scale Temporal Mamba (MS-Temba), comprising two key components: Temporal\nMamba (Temba) Blocks and the Temporal Mamba Fuser. Temba Blocks include the\nTemporal Local Module (TLM) for short-range temporal modeling and the Dilated\nTemporal SSM (DTS) for long-range dependencies. By introducing dilations, a\nnovel concept for Mamba, TLM and DTS capture local and global features at\nmultiple scales. The Temba Fuser aggregates these scale-specific features using\nMamba to learn comprehensive multi-scale representations of untrimmed videos.\nMS-Temba is validated on three public datasets, outperforming SOTA methods on\nlong videos and matching prior methods on short videos while using only\none-eighth of the parameters.\n","authors":["Arkaprava Sinha","Monish Soundar Raj","Pu Wang","Ahmed Helmy","Srijan Das"],"pdf_url":"https://arxiv.org/pdf/2501.06138v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.02189v2","updated":"2025-01-10T17:43:10Z","published":"2025-01-04T04:59:33Z","title":"Benchmark Evaluations, Applications, and Challenges of Large Vision\n Language Models: A Survey","summary":" Multimodal Vision Language Models (VLMs) have emerged as a transformative\ntechnology at the intersection of computer vision and natural language\nprocessing, enabling machines to perceive and reason about the world through\nboth visual and textual modalities. For example, models such as CLIP, Claude,\nand GPT-4V demonstrate strong reasoning and understanding abilities on visual\nand textual data and beat classical single modality vision models on zero-shot\nclassification. Despite their rapid advancements in research and growing\npopularity in applications, a comprehensive survey of existing studies on VLMs\nis notably lacking, particularly for researchers aiming to leverage VLMs in\ntheir specific domains. To this end, we provide a systematic overview of VLMs\nin the following aspects: model information of the major VLMs developed over\nthe past five years (2019-2024); the main architectures and training methods of\nthese VLMs; summary and categorization of the popular benchmarks and evaluation\nmetrics of VLMs; the applications of VLMs including embodied agents, robotics,\nand video generation; the challenges and issues faced by current VLMs such as\nhallucination, fairness, and safety. Detailed collections including papers and\nmodel repository links are listed in\nhttps://github.com/zli12321/Awesome-VLM-Papers-And-Models.git.\n","authors":["Zongxia Li","Xiyang Wu","Hongyang Du","Huy Nghiem","Guangyao Shi"],"pdf_url":"https://arxiv.org/pdf/2501.02189v2.pdf","comment":"35 pages, 3 figures"},{"id":"http://arxiv.org/abs/2408.11810v2","updated":"2025-01-10T17:29:36Z","published":"2024-08-21T17:56:34Z","title":"Pixel Is Not A Barrier: An Effective Evasion Attack for Pixel-Domain\n Diffusion Models","summary":" Diffusion Models have emerged as powerful generative models for high-quality\nimage synthesis, with many subsequent image editing techniques based on them.\nHowever, the ease of text-based image editing introduces significant risks,\nsuch as malicious editing for scams or intellectual property infringement.\nPrevious works have attempted to safeguard images from diffusion-based editing\nby adding imperceptible perturbations. These methods are costly and\nspecifically target prevalent Latent Diffusion Models (LDMs), while\nPixel-domain Diffusion Models (PDMs) remain largely unexplored and robust\nagainst such attacks. Our work addresses this gap by proposing a novel attack\nframework, AtkPDM. AtkPDM is mainly composed of a feature representation\nattacking loss that exploits vulnerabilities in denoising UNets and a latent\noptimization strategy to enhance the naturalness of adversarial images.\nExtensive experiments demonstrate the effectiveness of our approach in\nattacking dominant PDM-based editing methods (e.g., SDEdit) while maintaining\nreasonable fidelity and robustness against common defense methods.\nAdditionally, our framework is extensible to LDMs, achieving comparable\nperformance to existing approaches.\n","authors":["Chun-Yen Shih","Li-Xuan Peng","Jia-Wei Liao","Ernie Chu","Cheng-Fu Chou","Jun-Cheng Chen"],"pdf_url":"https://arxiv.org/pdf/2408.11810v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.05580v2","updated":"2025-01-10T17:06:36Z","published":"2024-12-07T08:08:24Z","title":"Self-Supervised Masked Mesh Learning for Unsupervised Anomaly Detection\n on 3D Cortical Surfaces","summary":" Unsupervised anomaly detection in brain imaging is challenging. In this\npaper, we propose a self-supervised masked mesh learning for unsupervised\nanomaly detection in 3D cortical surfaces. Our framework leverages the\nintrinsic geometry of the cortical surface to learn a self-supervised\nrepresentation that captures the underlying structure of the brain. We\nintroduce a masked mesh convolutional neural network (MMN) that learns to\npredict masked regions of the cortical surface. By training the MMN on a large\ndataset of healthy subjects, we learn a representation that captures the normal\nvariation in the cortical surface. We then use this representation to detect\nanomalies in unseen individuals by calculating anomaly scores based on the\nreconstruction error of the MMN. We evaluate our framework by training on\npopulation-scale dataset UKB and HCP-Aging and testing on two datasets of\nAlzheimer's disease patients ADNI and OASIS3. Our results show that our\nframework can detect anomalies in cortical thickness, cortical volume, and\ncortical sulcus features, which are known to be sensitive biomarkers for\nAlzheimer's disease. Our proposed framework provides a promising approach for\nunsupervised anomaly detection based on normative variation of cortical\nfeatures.\n","authors":["Hao-Chun Yang","Sicheng Dai","Saige Rutherford","Christian Gaser","Andre F Marquand","Christian F Beckmann","Thomas Wolfers"],"pdf_url":"https://arxiv.org/pdf/2412.05580v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05409v2","updated":"2025-01-10T16:58:29Z","published":"2025-01-09T18:06:45Z","title":"Atlas: A Novel Pathology Foundation Model by Mayo Clinic, Charité, and\n Aignostics","summary":" Recent advances in digital pathology have demonstrated the effectiveness of\nfoundation models across diverse applications. In this report, we present\nAtlas, a novel vision foundation model based on the RudolfV approach. Our model\nwas trained on a dataset comprising 1.2 million histopathology whole slide\nimages, collected from two medical institutions: Mayo Clinic and Charit\\'e -\nUniverst\\\"atsmedizin Berlin. Comprehensive evaluations show that Atlas achieves\nstate-of-the-art performance across twenty-one public benchmark datasets, even\nthough it is neither the largest model by parameter count nor by training\ndataset size.\n","authors":["Maximilian Alber","Stephan Tietz","Jonas Dippel","Timo Milbich","Timothée Lesort","Panos Korfiatis","Moritz Krügener","Beatriz Perez Cancer","Neelay Shah","Alexander Möllers","Philipp Seegerer","Alexandra Carpen-Amarie","Kai Standvoss","Gabriel Dernbach","Edwin de Jong","Simon Schallenberg","Andreas Kunft","Helmut Hoffer von Ankershoffen","Gavin Schaeferle","Patrick Duffy","Matt Redlon","Philipp Jurmeister","David Horst","Lukas Ruff","Klaus-Robert Müller","Frederick Klauschen","Andrew Norgan"],"pdf_url":"https://arxiv.org/pdf/2501.05409v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.19635v2","updated":"2025-01-10T16:51:33Z","published":"2023-10-30T15:25:29Z","title":"Improving Medical Visual Representations via Radiology Report Generation","summary":" Vision-language pretraining has been shown to produce high-quality visual\nencoders which transfer efficiently to downstream computer vision tasks.\nContrastive learning approaches have increasingly been adopted for medical\nvision language pretraining (MVLP), yet recent developments in generative AI\noffer new modeling alternatives. This paper introduces RadTex, a CNN-encoder\ntransformer-decoder architecture optimized for radiology. We explore\nbidirectional captioning as an alternative MVLP strategy and demonstrate that\nRadTex's captioning pretraining is competitive with established contrastive\nmethods, achieving a CheXpert macro-AUC of 89.4%. Additionally, RadTex's\nlightweight text decoder not only generates clinically relevant radiology\nreports (macro-F1 score of 0.349), but also provides targeted, interactive\nresponses, highlighting the utility of bidirectional captioning in advancing\nmedical image analysis.\n","authors":["Keegan Quigley","Miriam Cha","Josh Barua","Geeticka Chauhan","Seth Berkowitz","Steven Horng","Polina Golland"],"pdf_url":"https://arxiv.org/pdf/2310.19635v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.08168v2","updated":"2025-01-10T16:44:55Z","published":"2024-10-10T17:45:12Z","title":"ZeroComp: Zero-shot Object Compositing from Image Intrinsics via\n Diffusion","summary":" We present ZeroComp, an effective zero-shot 3D object compositing approach\nthat does not require paired composite-scene images during training. Our method\nleverages ControlNet to condition from intrinsic images and combines it with a\nStable Diffusion model to utilize its scene priors, together operating as an\neffective rendering engine. During training, ZeroComp uses intrinsic images\nbased on geometry, albedo, and masked shading, all without the need for paired\nimages of scenes with and without composite objects. Once trained, it\nseamlessly integrates virtual 3D objects into scenes, adjusting shading to\ncreate realistic composites. We developed a high-quality evaluation dataset and\ndemonstrate that ZeroComp outperforms methods using explicit lighting\nestimations and generative techniques in quantitative and human perception\nbenchmarks. Additionally, ZeroComp extends to real and outdoor image\ncompositing, even when trained solely on synthetic indoor data, showcasing its\neffectiveness in image compositing.\n","authors":["Zitian Zhang","Frédéric Fortier-Chouinard","Mathieu Garon","Anand Bhattad","Jean-François Lalonde"],"pdf_url":"https://arxiv.org/pdf/2410.08168v2.pdf","comment":"Project page: https://lvsn.github.io/ZeroComp, Code:\n https://github.com/lvsn/ZeroComp"},{"id":"http://arxiv.org/abs/2210.06433v3","updated":"2025-01-10T16:26:43Z","published":"2022-10-12T17:30:12Z","title":"Self-supervised video pretraining yields robust and more human-aligned\n visual representations","summary":" Humans learn powerful representations of objects and scenes by observing how\nthey evolve over time. Yet, outside of specific tasks that require explicit\ntemporal understanding, static image pretraining remains the dominant paradigm\nfor learning visual foundation models. We question this mismatch, and ask\nwhether video pretraining can yield visual representations that bear the\nhallmarks of human perception: generalisation across tasks, robustness to\nperturbations, and consistency with human judgements. To that end we propose a\nnovel procedure for curating videos, and develop a contrastive framework which\nlearns from the complex transformations therein. This simple paradigm for\ndistilling knowledge from videos, called VITO, yields general representations\nthat far outperform prior video pretraining methods on image understanding\ntasks, and image pretraining methods on video understanding tasks. Moreover,\nVITO representations are significantly more robust to natural and synthetic\ndeformations than image-, video-, and adversarially-trained ones. Finally,\nVITO's predictions are strongly aligned with human judgements, surpassing\nmodels that were specifically trained for that purpose. Together, these results\nsuggest that video pretraining could be a simple way of learning unified,\nrobust, and human-aligned representations of the visual world.\n","authors":["Nikhil Parthasarathy","S. M. Ali Eslami","João Carreira","Olivier J. Hénaff"],"pdf_url":"https://arxiv.org/pdf/2210.06433v3.pdf","comment":"Accepted to 37th Conference on Neural Information Processing Systems\n (NeurIPS 2023)"},{"id":"http://arxiv.org/abs/2501.05177v2","updated":"2025-01-10T15:44:28Z","published":"2025-01-09T11:52:54Z","title":"FaceMe: Robust Blind Face Restoration with Personal Identification","summary":" Blind face restoration is a highly ill-posed problem due to the lack of\nnecessary context. Although existing methods produce high-quality outputs, they\noften fail to faithfully preserve the individual's identity. In this paper, we\npropose a personalized face restoration method, FaceMe, based on a diffusion\nmodel. Given a single or a few reference images, we use an identity encoder to\nextract identity-related features, which serve as prompts to guide the\ndiffusion model in restoring high-quality and identity-consistent facial\nimages. By simply combining identity-related features, we effectively minimize\nthe impact of identity-irrelevant features during training and support any\nnumber of reference image inputs during inference. Additionally, thanks to the\nrobustness of the identity encoder, synthesized images can be used as reference\nimages during training, and identity changing during inference does not require\nfine-tuning the model. We also propose a pipeline for constructing a reference\nimage training pool that simulates the poses and expressions that may appear in\nreal-world scenarios. Experimental results demonstrate that our FaceMe can\nrestore high-quality facial images while maintaining identity consistency,\nachieving excellent performance and robustness.\n","authors":["Siyu Liu","Zheng-Peng Duan","Jia OuYang","Jiayi Fu","Hyunhee Park","Zikun Liu","Chun-Le Guo","Chongyi Li"],"pdf_url":"https://arxiv.org/pdf/2501.05177v2.pdf","comment":"To appear at AAAI 2025"},{"id":"http://arxiv.org/abs/2407.18243v3","updated":"2025-01-10T15:37:27Z","published":"2024-07-25T17:57:48Z","title":"BIV-Priv-Seg: Locating Private Content in Images Taken by People With\n Visual Impairments","summary":" Individuals who are blind or have low vision (BLV) are at a heightened risk\nof sharing private information if they share photographs they have taken. To\nfacilitate developing technologies that can help them preserve privacy, we\nintroduce BIV-Priv-Seg, the first localization dataset originating from people\nwith visual impairments that shows private content. It contains 1,028 images\nwith segmentation annotations for 16 private object categories. We first\ncharacterize BIV-Priv-Seg and then evaluate modern models' performance for\nlocating private content in the dataset. We find modern models struggle most\nwith locating private objects that are not salient, small, and lack text as\nwell as recognizing when private content is absent from an image. We facilitate\nfuture extensions by sharing our new dataset with the evaluation server at\nhttps://vizwiz.org/tasks-and-datasets/object-localization.\n","authors":["Yu-Yun Tseng","Tanusree Sharma","Lotus Zhang","Abigale Stangl","Leah Findlater","Yang Wang","Danna Gurari"],"pdf_url":"https://arxiv.org/pdf/2407.18243v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.04103v2","updated":"2025-01-10T15:37:26Z","published":"2024-07-04T18:06:48Z","title":"Advances in Diffusion Models for Image Data Augmentation: A Review of\n Methods, Models, Evaluation Metrics and Future Research Directions","summary":" Image data augmentation constitutes a critical methodology in modern computer\nvision tasks, since it can facilitate towards enhancing the diversity and\nquality of training datasets; thereby, improving the performance and robustness\nof machine learning models in downstream tasks. In parallel, augmentation\napproaches can also be used for editing/modifying a given image in a context-\nand semantics-aware way. Diffusion Models (DMs), which comprise one of the most\nrecent and highly promising classes of methods in the field of generative\nArtificial Intelligence (AI), have emerged as a powerful tool for image data\naugmentation, capable of generating realistic and diverse images by learning\nthe underlying data distribution. The current study realizes a systematic,\ncomprehensive and in-depth review of DM-based approaches for image\naugmentation, covering a wide range of strategies, tasks and applications. In\nparticular, a comprehensive analysis of the fundamental principles, model\narchitectures and training strategies of DMs is initially performed.\nSubsequently, a taxonomy of the relevant image augmentation methods is\nintroduced, focusing on techniques regarding semantic manipulation,\npersonalization and adaptation, and application-specific augmentation tasks.\nThen, performance assessment methodologies and respective evaluation metrics\nare analyzed. Finally, current challenges and future research directions in the\nfield are discussed.\n","authors":["Panagiotis Alimisis","Ioannis Mademlis","Panagiotis Radoglou-Grammatikis","Panagiotis Sarigiannidis","Georgios Th. Papadopoulos"],"pdf_url":"https://arxiv.org/pdf/2407.04103v2.pdf","comment":"65 pages, 15 figures"},{"id":"http://arxiv.org/abs/2501.03053v2","updated":"2025-01-10T15:35:07Z","published":"2025-01-06T14:40:45Z","title":"Dr. Tongue: Sign-Oriented Multi-label Detection for Remote Tongue\n Diagnosis","summary":" Tongue diagnosis is a vital tool in Western and Traditional Chinese Medicine,\nproviding key insights into a patient's health by analyzing tongue attributes.\nThe COVID-19 pandemic has heightened the need for accurate remote medical\nassessments, emphasizing the importance of precise tongue attribute recognition\nvia telehealth. To address this, we propose a Sign-Oriented multi-label\nAttributes Detection framework. Our approach begins with an adaptive tongue\nfeature extraction module that standardizes tongue images and mitigates\nenvironmental factors. This is followed by a Sign-oriented Network (SignNet)\nthat identifies specific tongue attributes, emulating the diagnostic process of\nexperienced practitioners and enabling comprehensive health evaluations. To\nvalidate our methodology, we developed an extensive tongue image dataset\nspecifically designed for telemedicine. Unlike existing datasets, ours is\ntailored for remote diagnosis, with a comprehensive set of attribute labels.\nThis dataset will be openly available, providing a valuable resource for\nresearch. Initial tests have shown improved accuracy in detecting various\ntongue attributes, highlighting our framework's potential as an essential tool\nfor remote medical assessments.\n","authors":["Yiliang Chen","Steven SC Ho","Cheng Xu","Yao Jie Xie","Wing-Fai Yeung","Shengfeng He","Jing Qin"],"pdf_url":"https://arxiv.org/pdf/2501.03053v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.06053v1","updated":"2025-01-10T15:33:37Z","published":"2025-01-10T15:33:37Z","title":"Enhancing, Refining, and Fusing: Towards Robust Multi-Scale and Dense\n Ship Detection","summary":" Synthetic aperture radar (SAR) imaging, celebrated for its high resolution,\nall-weather capability, and day-night operability, is indispensable for\nmaritime applications. However, ship detection in SAR imagery faces significant\nchallenges, including complex backgrounds, densely arranged targets, and large\nscale variations. To address these issues, we propose a novel framework,\nCenter-Aware SAR Ship Detector (CASS-Det), designed for robust multi-scale and\ndensely packed ship detection. CASS-Det integrates three key innovations: (1) a\ncenter enhancement module (CEM) that employs rotational convolution to\nemphasize ship centers, improving localization while suppressing background\ninterference; (2) a neighbor attention module (NAM) that leverages cross-layer\ndependencies to refine ship boundaries in densely populated scenes; and (3) a\ncross-connected feature pyramid network (CC-FPN) that enhances multi-scale\nfeature fusion by integrating shallow and deep features. Extensive experiments\non the SSDD, HRSID, and LS-SSDD-v1.0 datasets demonstrate the state-of-the-art\nperformance of CASS-Det, excelling at detecting multi-scale and densely\narranged ships.\n","authors":["Congxia Zhao","Xiongjun Fu","Jian Dong","Shen Cao","Chunyan Zhang"],"pdf_url":"https://arxiv.org/pdf/2501.06053v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.06040v1","updated":"2025-01-10T15:18:05Z","published":"2025-01-10T15:18:05Z","title":"MSCViT: A Small-size ViT architecture with Multi-Scale Self-Attention\n Mechanism for Tiny Datasets","summary":" Vision Transformer (ViT) has demonstrated significant potential in various\nvision tasks due to its strong ability in modelling long-range dependencies.\nHowever, such success is largely fueled by training on massive samples. In real\napplications, the large-scale datasets are not always available, and ViT\nperforms worse than Convolutional Neural Networks (CNNs) if it is only trained\non small scale dataset (called tiny dataset), since it requires large amount of\ntraining data to ensure its representational capacity. In this paper, a\nsmall-size ViT architecture with multi-scale self-attention mechanism and\nconvolution blocks is presented (dubbed MSCViT) to model different scales of\nattention at each layer. Firstly, we introduced wavelet convolution, which\nselectively combines the high-frequency components obtained by frequency\ndivision with our convolution channel to extract local features. Then, a\nlightweight multi-head attention module is developed to reduce the number of\ntokens and computational costs. Finally, the positional encoding (PE) in the\nbackbone is replaced by a local feature extraction module. Compared with the\noriginal ViT, it is parameter-efficient and is particularly suitable for tiny\ndatasets. Extensive experiments have been conducted on tiny datasets, in which\nour model achieves an accuracy of 84.68% on CIFAR-100 with 14.0M parameters and\n2.5 GFLOPs, without pre-training on large datasets.\n","authors":["Bowei Zhang","Yi Zhang"],"pdf_url":"https://arxiv.org/pdf/2501.06040v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.06039v1","updated":"2025-01-10T15:17:27Z","published":"2025-01-10T15:17:27Z","title":"AI-powered virtual tissues from spatial proteomics for clinical\n diagnostics and biomedical discovery","summary":" Spatial proteomics technologies have transformed our understanding of complex\ntissue architectures by enabling simultaneous analysis of multiple molecular\nmarkers and their spatial organization. The high dimensionality of these data,\nvarying marker combinations across experiments and heterogeneous study designs\npose unique challenges for computational analysis. Here, we present Virtual\nTissues (VirTues), a foundation model framework for biological tissues that\noperates across the molecular, cellular and tissue scale. VirTues introduces\ninnovations in transformer architecture design, including a novel tokenization\nscheme that captures both spatial and marker dimensions, and attention\nmechanisms that scale to high-dimensional multiplex data while maintaining\ninterpretability. Trained on diverse cancer and non-cancer tissue datasets,\nVirTues demonstrates strong generalization capabilities without task-specific\nfine-tuning, enabling cross-study analysis and novel marker integration. As a\ngeneralist model, VirTues outperforms existing approaches across clinical\ndiagnostics, biological discovery and patient case retrieval tasks, while\nproviding insights into tissue function and disease mechanisms.\n","authors":["Johann Wenckstern","Eeshaan Jain","Kiril Vasilev","Matteo Pariset","Andreas Wicki","Gabriele Gut","Charlotte Bunne"],"pdf_url":"https://arxiv.org/pdf/2501.06039v1.pdf","comment":"23 pages, 5 figures"},{"id":"http://arxiv.org/abs/2501.06038v1","updated":"2025-01-10T15:17:02Z","published":"2025-01-10T15:17:02Z","title":"A Holistically Point-guided Text Framework for Weakly-Supervised\n Camouflaged Object Detection","summary":" Weakly-Supervised Camouflaged Object Detection (WSCOD) has gained popularity\nfor its promise to train models with weak labels to segment objects that\nvisually blend into their surroundings. Recently, some methods using\nsparsely-annotated supervision shown promising results through scribbling in\nWSCOD, while point-text supervision remains underexplored. Hence, this paper\nintroduces a novel holistically point-guided text framework for WSCOD by\ndecomposing into three phases: segment, choose, train. Specifically, we propose\nPoint-guided Candidate Generation (PCG), where the point's foreground serves as\na correction for the text path to explicitly correct and rejuvenate the loss\ndetection object during the mask generation process (SEGMENT). We also\nintroduce a Qualified Candidate Discriminator (QCD) to choose the optimal mask\nfrom a given text prompt using CLIP (CHOOSE), and employ the chosen pseudo mask\nfor training with a self-supervised Vision Transformer (TRAIN). Additionally,\nwe developed a new point-supervised dataset (P2C-COD) and a text-supervised\ndataset (T-COD). Comprehensive experiments on four benchmark datasets\ndemonstrate our method outperforms state-of-the-art methods by a large margin,\nand also outperforms some existing fully-supervised camouflaged object\ndetection methods.\n","authors":["Tsui Qin Mok","Shuyong Gao","Haozhe Xing","Miaoyang He","Yan Wang","Wenqiang Zhang"],"pdf_url":"https://arxiv.org/pdf/2501.06038v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.06035v1","updated":"2025-01-10T15:13:43Z","published":"2025-01-10T15:13:43Z","title":"Nonisotropic Gaussian Diffusion for Realistic 3D Human Motion Prediction","summary":" Probabilistic human motion prediction aims to forecast multiple possible\nfuture movements from past observations. While current approaches report high\ndiversity and realism, they often generate motions with undetected limb\nstretching and jitter. To address this, we introduce SkeletonDiffusion, a\nlatent diffusion model that embeds an explicit inductive bias on the human body\nwithin its architecture and training. Our model is trained with a novel\nnonisotropic Gaussian diffusion formulation that aligns with the natural\nkinematic structure of the human skeleton. Results show that our approach\noutperforms conventional isotropic alternatives, consistently generating\nrealistic predictions while avoiding artifacts such as limb distortion.\nAdditionally, we identify a limitation in commonly used diversity metrics,\nwhich may inadvertently favor models that produce inconsistent limb lengths\nwithin the same sequence. SkeletonDiffusion sets a new benchmark on three\nreal-world datasets, outperforming various baselines across multiple evaluation\nmetrics. Visit our project page:\nhttps://ceveloper.github.io/publications/skeletondiffusion/\n","authors":["Cecilia Curreli","Dominik Muhle","Abhishek Saroha","Zhenzhang Ye","Riccardo Marin","Daniel Cremers"],"pdf_url":"https://arxiv.org/pdf/2501.06035v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.06031v1","updated":"2025-01-10T15:07:57Z","published":"2025-01-10T15:07:57Z","title":"Generate, Transduct, Adapt: Iterative Transduction with VLMs","summary":" Transductive zero-shot learning with vision-language models leverages\nimage-image similarities within the dataset to achieve better classification\naccuracy compared to the inductive setting. However, there is little work that\nexplores the structure of the language space in this context. We propose\nGTA-CLIP, a novel technique that incorporates supervision from language models\nfor joint transduction in language and vision spaces. Our approach is iterative\nand consists of three steps: (i) incrementally exploring the attribute space by\nquerying language models, (ii) an attribute-augmented transductive inference\nprocedure, and (iii) fine-tuning the language and vision encoders based on\ninferred labels within the dataset. Through experiments with CLIP encoders, we\ndemonstrate that GTA-CLIP, yields an average performance improvement of 8.6%\nand 3.7% across 12 datasets and 3 encoders, over CLIP and transductive CLIP\nrespectively in the zero-shot setting. We also observe similar improvements in\na few-shot setting. We present ablation studies that demonstrate the value of\neach step and visualize how the vision and language spaces evolve over\niterations driven by the transductive learning.\n","authors":["Oindrila Saha","Logan Lawrence","Grant Van Horn","Subhransu Maji"],"pdf_url":"https://arxiv.org/pdf/2501.06031v1.pdf","comment":"Code will be released at https://github.com/cvl-umass/GTA-CLIP"},{"id":"http://arxiv.org/abs/2501.06027v1","updated":"2025-01-10T15:04:23Z","published":"2025-01-10T15:04:23Z","title":"Geometric-Based Nail Segmentation for Clinical Measurements","summary":" A robust segmentation method that can be used to perform measurements on\ntoenails is presented. The proposed method is used as the first step in a\nclinical trial to objectively quantify the incidence of a particular pathology.\nFor such an assessment, it is necessary to distinguish a nail, which locally\nappears to be similar to the skin. Many algorithms have been used, each of\nwhich leverages different aspects of toenail appearance. We used the Hough\ntransform to locate the tip of the toe and estimate the nail location and size.\nSubsequently, we classified the super-pixels of the image based on their\ngeometric and photometric information. Thereafter, the watershed transform\ndelineated the border of the nail. The method was validated using a 348-image\nmedical dataset, achieving an accuracy of 0.993 and an F-measure of 0.925. The\nproposed method is considerably robust across samples, with respect to factors\nsuch as nail shape, skin pigmentation, illumination conditions, and appearance\nof large regions affected by a medical condition\n","authors":["Bernat Galmés","Gabriel Moyà-Alcover","Pedro Bibiloni","Javier Varona","Antoni Jaume-i-Capó"],"pdf_url":"https://arxiv.org/pdf/2501.06027v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04794v2","updated":"2025-01-10T14:59:31Z","published":"2025-01-08T19:18:44Z","title":"A Steerable Deep Network for Model-Free Diffusion MRI Registration","summary":" Nonrigid registration is vital to medical image analysis but remains\nchallenging for diffusion MRI (dMRI) due to its high-dimensional,\norientation-dependent nature. While classical methods are accurate, they are\ncomputationally demanding, and deep neural networks, though efficient, have\nbeen underexplored for nonrigid dMRI registration compared to structural\nimaging. We present a novel, deep learning framework for model-free, nonrigid\nregistration of raw diffusion MRI data that does not require explicit\nreorientation. Unlike previous methods relying on derived representations such\nas diffusion tensors or fiber orientation distribution functions, in our\napproach, we formulate the registration as an equivariant diffeomorphism of\nposition-and-orientation space. Central to our method is an\n$\\mathsf{SE}(3)$-equivariant UNet that generates velocity fields while\npreserving the geometric properties of a raw dMRI's domain. We introduce a new\nloss function based on the maximum mean discrepancy in Fourier space,\nimplicitly matching ensemble average propagators across images. Experimental\nresults on Human Connectome Project dMRI data demonstrate competitive\nperformance compared to state-of-the-art approaches, with the added advantage\nof bypassing the overhead for estimating derived representations. This work\nestablishes a foundation for data-driven, geometry-aware dMRI registration\ndirectly in the acquisition space.\n","authors":["Gianfranco Cortes","Xiaoda Qu","Baba C. Vemuri"],"pdf_url":"https://arxiv.org/pdf/2501.04794v2.pdf","comment":"Coauthor was inadvertently left out. This is now corrected"},{"id":"http://arxiv.org/abs/2501.06019v1","updated":"2025-01-10T14:57:18Z","published":"2025-01-10T14:57:18Z","title":"BRIGHT: A globally distributed multimodal building damage assessment\n dataset with very-high-resolution for all-weather disaster response","summary":" Disaster events occur around the world and cause significant damage to human\nlife and property. Earth observation (EO) data enables rapid and comprehensive\nbuilding damage assessment (BDA), an essential capability in the aftermath of a\ndisaster to reduce human casualties and to inform disaster relief efforts.\nRecent research focuses on the development of AI models to achieve accurate\nmapping of unseen disaster events, mostly using optical EO data. However,\nsolutions based on optical data are limited to clear skies and daylight hours,\npreventing a prompt response to disasters. Integrating multimodal (MM) EO data,\nparticularly the combination of optical and SAR imagery, makes it possible to\nprovide all-weather, day-and-night disaster responses. Despite this potential,\nthe development of robust multimodal AI models has been constrained by the lack\nof suitable benchmark datasets. In this paper, we present a BDA dataset using\nveRy-hIGH-resoluTion optical and SAR imagery (BRIGHT) to support AI-based\nall-weather disaster response. To the best of our knowledge, BRIGHT is the\nfirst open-access, globally distributed, event-diverse MM dataset specifically\ncurated to support AI-based disaster response. It covers five types of natural\ndisasters and two types of man-made disasters across 12 regions worldwide, with\na particular focus on developing countries where external assistance is most\nneeded. The optical and SAR imagery in BRIGHT, with a spatial resolution\nbetween 0.3-1 meters, provides detailed representations of individual\nbuildings, making it ideal for precise BDA. In our experiments, we have tested\nseven advanced AI models trained with our BRIGHT to validate the\ntransferability and robustness. The dataset and code are available at\nhttps://github.com/ChenHongruixuan/BRIGHT. BRIGHT also serves as the official\ndataset for the 2025 IEEE GRSS Data Fusion Contest.\n","authors":["Hongruixuan Chen","Jian Song","Olivier Dietrich","Clifford Broni-Bediako","Weihao Xuan","Junjue Wang","Xinlei Shao","Yimin Wei","Junshi Xia","Cuiling Lan","Konrad Schindler","Naoto Yokoya"],"pdf_url":"https://arxiv.org/pdf/2501.06019v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.06014v1","updated":"2025-01-10T14:50:00Z","published":"2025-01-10T14:50:00Z","title":"Pose-independent 3D Anthropometry from Sparse Data","summary":" 3D digital anthropometry is the study of estimating human body measurements\nfrom 3D scans. Precise body measurements are important health indicators in the\nmedical industry, and guiding factors in the fashion, ergonomic and\nentertainment industries. The measuring protocol consists of scanning the whole\nsubject in the static A-pose, which is maintained without breathing or movement\nduring the scanning process. However, the A-pose is not easy to maintain during\nthe whole scanning process, which can last even up to a couple of minutes. This\nconstraint affects the final quality of the scan, which in turn affects the\naccuracy of the estimated body measurements obtained from methods that rely on\ndense geometric data. Additionally, this constraint makes it impossible to\ndevelop a digital anthropometry method for subjects unable to assume the\nA-pose, such as those with injuries or disabilities. We propose a method that\ncan obtain body measurements from sparse landmarks acquired in any pose. We\nmake use of the sparse landmarks of the posed subject to create\npose-independent features, and train a network to predict the body measurements\nas taken from the standard A-pose. We show that our method achieves comparable\nresults to competing methods that use dense geometry in the standard A-pose,\nbut has the capability of estimating the body measurements from any pose using\nsparse landmarks only. Finally, we address the lack of open-source 3D\nanthropometry methods by making our method available to the research community\nat https://github.com/DavidBoja/pose-independent-anthropometry.\n","authors":["David Bojanić","Stefanie Wuhrer","Tomislav Petković","Tomislav Pribanić"],"pdf_url":"https://arxiv.org/pdf/2501.06014v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.16745v2","updated":"2025-01-10T14:40:49Z","published":"2024-12-21T19:41:10Z","title":"ViM-Disparity: Bridging the Gap of Speed, Accuracy and Memory for\n Disparity Map Generation","summary":" In this work we propose a Visual Mamba (ViM) based architecture, to dissolve\nthe existing trade-off for real-time and accurate model with low computation\noverhead for disparity map generation (DMG). Moreover, we proposed a\nperformance measure that can jointly evaluate the inference speed, computation\noverhead and the accurateness of a DMG model. The code implementation and\ncorresponding models are available at: https://github.com/MBora/ViM-Disparity.\n","authors":["Maheswar Bora","Tushar Anand","Saurabh Atreya","Aritra Mukherjee","Abhijit Das"],"pdf_url":"https://arxiv.org/pdf/2412.16745v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.06006v1","updated":"2025-01-10T14:37:32Z","published":"2025-01-10T14:37:32Z","title":"CamCtrl3D: Single-Image Scene Exploration with Precise 3D Camera Control","summary":" We propose a method for generating fly-through videos of a scene, from a\nsingle image and a given camera trajectory. We build upon an image-to-video\nlatent diffusion model. We condition its UNet denoiser on the camera\ntrajectory, using four techniques. (1) We condition the UNet's temporal blocks\non raw camera extrinsics, similar to MotionCtrl. (2) We use images containing\ncamera rays and directions, similar to CameraCtrl. (3) We reproject the initial\nimage to subsequent frames and use the resulting video as a condition. (4) We\nuse 2D<=>3D transformers to introduce a global 3D representation, which\nimplicitly conditions on the camera poses. We combine all conditions in a\nContolNet-style architecture. We then propose a metric that evaluates overall\nvideo quality and the ability to preserve details with view changes, which we\nuse to analyze the trade-offs of individual and combined conditions. Finally,\nwe identify an optimal combination of conditions. We calibrate camera positions\nin our datasets for scale consistency across scenes, and we train our scene\nexploration model, CamCtrl3D, demonstrating state-of-theart results.\n","authors":["Stefan Popov","Amit Raj","Michael Krainin","Yuanzhen Li","William T. Freeman","Michael Rubinstein"],"pdf_url":"https://arxiv.org/pdf/2501.06006v1.pdf","comment":"To be published in 3DV 2025"},{"id":"http://arxiv.org/abs/2501.06004v1","updated":"2025-01-10T14:35:16Z","published":"2025-01-10T14:35:16Z","title":"SeMi: When Imbalanced Semi-Supervised Learning Meets Mining Hard\n Examples","summary":" Semi-Supervised Learning (SSL) can leverage abundant unlabeled data to boost\nmodel performance. However, the class-imbalanced data distribution in\nreal-world scenarios poses great challenges to SSL, resulting in performance\ndegradation. Existing class-imbalanced semi-supervised learning (CISSL) methods\nmainly focus on rebalancing datasets but ignore the potential of using hard\nexamples to enhance performance, making it difficult to fully harness the power\nof unlabeled data even with sophisticated algorithms. To address this issue, we\npropose a method that enhances the performance of Imbalanced Semi-Supervised\nLearning by Mining Hard Examples (SeMi). This method distinguishes the entropy\ndifferences among logits of hard and easy examples, thereby identifying hard\nexamples and increasing the utility of unlabeled data, better addressing the\nimbalance problem in CISSL. In addition, we maintain a class-balanced memory\nbank with confidence decay for storing high-confidence embeddings to enhance\nthe pseudo-labels' reliability. Although our method is simple, it is effective\nand seamlessly integrates with existing approaches. We perform comprehensive\nexperiments on standard CISSL benchmarks and experimentally demonstrate that\nour proposed SeMi outperforms existing state-of-the-art methods on multiple\nbenchmarks, especially in reversed scenarios, where our best result shows\napproximately a 54.8\\% improvement over the baseline methods.\n","authors":["Yin Wang","Zixuan Wang","Hao Lu","Zhen Qin","Hailiang Zhao","Guanjie Cheng","Ge Su","Li Kuang","Mengchu Zhou","Shuiguang Deng"],"pdf_url":"https://arxiv.org/pdf/2501.06004v1.pdf","comment":"11 pages,6 figures, conference"},{"id":"http://arxiv.org/abs/2501.06000v1","updated":"2025-01-10T14:32:20Z","published":"2025-01-10T14:32:20Z","title":"Self-Supervised Partial Cycle-Consistency for Multi-View Matching","summary":" Matching objects across partially overlapping camera views is crucial in\nmulti-camera systems and requires a view-invariant feature extraction network.\nTraining such a network with cycle-consistency circumvents the need for\nlabor-intensive labeling. In this paper, we extend the mathematical formulation\nof cycle-consistency to handle partial overlap. We then introduce a pseudo-mask\nwhich directs the training loss to take partial overlap into account. We\nadditionally present several new cycle variants that complement each other and\npresent a time-divergent scene sampling scheme that improves the data input for\nthis self-supervised setting. Cross-camera matching experiments on the\nchallenging DIVOTrack dataset show the merits of our approach. Compared to the\nself-supervised state-of-the-art, we achieve a 4.3 percentage point higher F1\nscore with our combined contributions. Our improvements are robust to reduced\noverlap in the training data, with substantial improvements in challenging\nscenes that need to make few matches between many people. Self-supervised\nfeature networks trained with our method are effective at matching objects in a\nrange of multi-camera settings, providing opportunities for complex tasks like\nlarge-scale multi-camera scene understanding.\n","authors":["Fedor Taggenbrock","Gertjan Burghouts","Ronald Poppe"],"pdf_url":"https://arxiv.org/pdf/2501.06000v1.pdf","comment":"Accepted to VISAPP 2025"},{"id":"http://arxiv.org/abs/2501.05997v1","updated":"2025-01-10T14:29:03Z","published":"2025-01-10T14:29:03Z","title":"Minimizing Occlusion Effect on Multi-View Camera Perception in BEV with\n Multi-Sensor Fusion","summary":" Autonomous driving technology is rapidly evolving, offering the potential for\nsafer and more efficient transportation. However, the performance of these\nsystems can be significantly compromised by the occlusion on sensors due to\nenvironmental factors like dirt, dust, rain, and fog. These occlusions severely\naffect vision-based tasks such as object detection, vehicle segmentation, and\nlane recognition. In this paper, we investigate the impact of various kinds of\nocclusions on camera sensor by projecting their effects from multi-view camera\nimages of the nuScenes dataset into the Bird's-Eye View (BEV) domain. This\napproach allows us to analyze how occlusions spatially distribute and influence\nvehicle segmentation accuracy within the BEV domain. Despite significant\nadvances in sensor technology and multi-sensor fusion, a gap remains in the\nexisting literature regarding the specific effects of camera occlusions on\nBEV-based perception systems. To address this gap, we use a multi-sensor fusion\ntechnique that integrates LiDAR and radar sensor data to mitigate the\nperformance degradation caused by occluded cameras. Our findings demonstrate\nthat this approach significantly enhances the accuracy and robustness of\nvehicle segmentation tasks, leading to more reliable autonomous driving\nsystems.\n","authors":["Sanjay Kumar","Hiep Truong","Sushil Sharma","Ganesh Sistu","Tony Scanlan","Eoin Grua","Ciarán Eising"],"pdf_url":"https://arxiv.org/pdf/2501.05997v1.pdf","comment":"Accepted form publishing at the Electronic Imaging - Autonomous\n Vehicles and Machines Conference"},{"id":"http://arxiv.org/abs/2501.05991v1","updated":"2025-01-10T14:25:01Z","published":"2025-01-10T14:25:01Z","title":"An Attention-Guided Deep Learning Approach for Classifying 39 Skin\n Lesion Types","summary":" The skin, as the largest organ of the human body, is vulnerable to a diverse\narray of conditions collectively known as skin lesions, which encompass various\ndermatoses. Diagnosing these lesions presents significant challenges for\nmedical practitioners due to the subtle visual differences that are often\nimperceptible to the naked eye. While not all skin lesions are\nlife-threatening, certain types can act as early indicators of severe diseases,\nincluding skin cancers, underscoring the critical need for timely and accurate\ndiagnostic methods. Deep learning algorithms have demonstrated remarkable\npotential in facilitating the early detection and prognosis of skin lesions.\nThis study advances the field by curating a comprehensive and diverse dataset\ncomprising 39 categories of skin lesions, synthesized from five publicly\navailable datasets. Using this dataset, the performance of five\nstate-of-the-art deep learning models -- MobileNetV2, Xception, InceptionV3,\nEfficientNetB1, and Vision Transformer - is rigorously evaluated. To enhance\nthe accuracy and robustness of these models, attention mechanisms such as the\nEfficient Channel Attention (ECA) and the Convolutional Block Attention Module\n(CBAM) are incorporated into their architectures. Comprehensive evaluation\nacross multiple performance metrics reveals that the Vision Transformer model\nintegrated with CBAM outperforms others, achieving an accuracy of 93.46%,\nprecision of 94%, recall of 93%, F1-score of 93%, and specificity of 93.67%.\nThese results underscore the significant potential of the proposed system in\nsupporting medical professionals with accurate and efficient prognostic tools\nfor diagnosing a broad spectrum of skin lesions. The dataset and code used in\nthis study can be found at\nhttps://github.com/akabircs/Skin-Lesions-Classification.\n","authors":["Sauda Adiv Hanum","Ashim Dey","Muhammad Ashad Kabir"],"pdf_url":"https://arxiv.org/pdf/2501.05991v1.pdf","comment":"26 pages"},{"id":"http://arxiv.org/abs/2302.10798v5","updated":"2025-01-10T13:43:48Z","published":"2023-02-17T09:37:17Z","title":"Learning a Consensus Sub-Network with Polarization Regularization and\n One Pass Training","summary":" The subject of green AI has been gaining attention within the deep learning\ncommunity given the recent trend of ever larger and more complex neural network\nmodels. Existing solutions for reducing the computational load of training at\ninference time usually involve pruning the network parameters. Pruning schemes\noften create extra overhead either by iterative training and fine-tuning for\nstatic pruning or repeated computation of a dynamic pruning graph. We propose a\nnew parameter pruning strategy for learning a lighter-weight sub-network that\nminimizes the energy cost while maintaining comparable performance to the fully\nparameterised network on given downstream tasks. Our proposed pruning scheme is\ngreen-oriented, as it only requires a one-off training to discover the optimal\nstatic sub-networks by dynamic pruning methods. The pruning scheme consists of\na binary gating module and a polarizing loss function to uncover sub-networks\nwith user-defined sparsity. Our method enables pruning and training\nsimultaneously, which saves energy in both the training and inference phases\nand avoids extra computational overhead from gating modules at inference time.\nOur results on CIFAR-10, CIFAR-100, and Tiny Imagenet suggest that our scheme\ncan remove 50% of connections in deep networks with <1% reduction in\nclassification accuracy. Compared to other related pruning methods, our method\ndemonstrates a lower drop in accuracy for equivalent reductions in\ncomputational cost.\n","authors":["Xiaoying Zhi","Varun Babbar","Rundong Liu","Pheobe Sun","Fran Silavong","Ruibo Shi","Sean Moran"],"pdf_url":"https://arxiv.org/pdf/2302.10798v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05961v1","updated":"2025-01-10T13:41:10Z","published":"2025-01-10T13:41:10Z","title":"Swin-X2S: Reconstructing 3D Shape from 2D Biplanar X-ray with Swin\n Transformers","summary":" The conversion from 2D X-ray to 3D shape holds significant potential for\nimproving diagnostic efficiency and safety. However, existing reconstruction\nmethods often rely on hand-crafted features, manual intervention, and prior\nknowledge, resulting in unstable shape errors and additional processing costs.\nIn this paper, we introduce Swin-X2S, an end-to-end deep learning method for\ndirectly reconstructing 3D segmentation and labeling from 2D biplanar\northogonal X-ray images. Swin-X2S employs an encoder-decoder architecture: the\nencoder leverages 2D Swin Transformer for X-ray information extraction, while\nthe decoder employs 3D convolution with cross-attention to integrate structural\nfeatures from orthogonal views. A dimension-expanding module is introduced to\nbridge the encoder and decoder, ensuring a smooth conversion from 2D pixels to\n3D voxels. We evaluate proposed method through extensive qualitative and\nquantitative experiments across nine publicly available datasets covering four\nanatomies (femur, hip, spine, and rib), with a total of 54 categories.\nSignificant improvements over previous methods have been observed not only in\nthe segmentation and labeling metrics but also in the clinically relevant\nparameters that are of primary concern in practical applications, which\ndemonstrates the promise of Swin-X2S to provide an effective option for\nanatomical shape reconstruction in clinical scenarios. Code implementation is\navailable at: \\url{https://github.com/liukuan5625/Swin-X2S}.\n","authors":["Kuan Liu","Zongyuan Ying","Jie Jin","Dongyan Li","Ping Huang","Wenjian Wu","Zhe Chen","Jin Qi","Yong Lu","Lianfu Deng","Bo Chen"],"pdf_url":"https://arxiv.org/pdf/2501.05961v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05952v1","updated":"2025-01-10T13:27:04Z","published":"2025-01-10T13:27:04Z","title":"Scalable Vision Language Model Training via High Quality Data Curation","summary":" In this paper, we introduce SAIL-VL (ScAlable Vision Language Model TraIning\nvia High QuaLity Data Curation), an open-source vision language model (VLM) of\nstate-of-the-art (SOTA) performance with 2B parameters. We introduce three key\nimprovements that contribute to SAIL-VL's leading performance: (1) Scalable\nhigh-quality visual understanding data construction: We implement a visual\nunderstanding data construction pipeline, which enables hundred-million-scale\nhigh-quality recaption data annotation. Equipped with this pipeline, we curate\nSAIL-Caption, a large-scale caption dataset with large quantity and the highest\ndata quality compared with opensource caption datasets. (2) Scalable\nPretraining with High-Quality Visual Understanding Data: We scale SAIL-VL's\npretraining budget up to 131B tokens and show that even a 2B VLM benefits from\nscaled up training data sizes, exhibiting expected data size scaling laws in\nvisual understanding and instruction following performance. (3) Scalable SFT\nvia quantity and quality scaling: We introduce general guidance for instruction\ndata curation to scale up instruction data continuously, allowing us to\nconstruct a large SFT dataset with the highest quality. To further improve\nSAIL-VL's performance, we propose quality scaling, a multi-stage training\nrecipe with curriculum learning, to improve model performance scaling curves\nw.r.t. data sizes from logarithmic to be near-linear. SAIL-VL obtains the\nhighest average score in 19 commonly used benchmarks in our evaluation and\nachieves top1 performance among VLMs of comparable sizes on OpenCompass\n(https://rank.opencompass.org.cn/leaderboard-multimodal). We release our\nSAIL-VL-2B model at HuggingFace\n(https://huggingface.co/BytedanceDouyinContent/SAIL-VL-2B).\n","authors":["Hongyuan Dong","Zijian Kang","Weijie Yin","Xiao Liang","Chao Feng","Jiao Ran"],"pdf_url":"https://arxiv.org/pdf/2501.05952v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03775v3","updated":"2025-01-10T13:25:32Z","published":"2025-01-07T13:30:54Z","title":"Strip R-CNN: Large Strip Convolution for Remote Sensing Object Detection","summary":" While witnessed with rapid development, remote sensing object detection\nremains challenging for detecting high aspect ratio objects. This paper shows\nthat large strip convolutions are good feature representation learners for\nremote sensing object detection and can detect objects of various aspect ratios\nwell. Based on large strip convolutions, we build a new network architecture\ncalled Strip R-CNN, which is simple, efficient, and powerful. Unlike recent\nremote sensing object detectors that leverage large-kernel convolutions with\nsquare shapes, our Strip R-CNN takes advantage of sequential orthogonal large\nstrip convolutions to capture spatial information. In addition, we enhance the\nlocalization capability of remote-sensing object detectors by decoupling the\ndetection heads and equipping the localization head with strip convolutions to\nbetter localize the target objects. Extensive experiments on several\nbenchmarks, e.g., DOTA, FAIR1M, HRSC2016, and DIOR, show that our Strip R-CNN\ncan largely improve previous works. Notably, our 30M model achieves 82.75% mAP\non DOTA-v1.0, setting a new state-of-the-art record.Code is available at\nhttps://github.com/YXB-NKU/Strip-R-CNN.\n","authors":["Xinbin Yuan","Zhaohui Zheng","Yuxuan Li","Xialei Liu","Li Liu","Xiang Li","Qibin Hou","Ming-Ming Cheng"],"pdf_url":"https://arxiv.org/pdf/2501.03775v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05945v1","updated":"2025-01-10T13:15:37Z","published":"2025-01-10T13:15:37Z","title":"Reusable specimen-level inference in computational pathology","summary":" Foundation models for computational pathology have shown great promise for\nspecimen-level tasks and are increasingly accessible to researchers. However,\nspecimen-level models built on these foundation models remain largely\nunavailable, hindering their broader utility and impact. To address this gap,\nwe developed SpinPath, a toolkit designed to democratize specimen-level deep\nlearning by providing a zoo of pretrained specimen-level models, a Python-based\ninference engine, and a JavaScript-based inference platform. We demonstrate the\nutility of SpinPath in metastasis detection tasks across nine foundation\nmodels. SpinPath may foster reproducibility, simplify experimentation, and\naccelerate the adoption of specimen-level deep learning in computational\npathology research.\n","authors":["Jakub R. Kaczmarzyk","Rishul Sharma","Peter K. Koo","Joel H. Saltz"],"pdf_url":"https://arxiv.org/pdf/2501.05945v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03916v2","updated":"2025-01-10T13:14:28Z","published":"2025-01-07T16:31:10Z","title":"Dolphin: Closed-loop Open-ended Auto-research through Thinking,\n Practice, and Feedback","summary":" The scientific research paradigm is undergoing a profound transformation\nowing to the development of Artificial Intelligence (AI). Recent works\ndemonstrate that various AI-assisted research methods can largely improve\nresearch efficiency by improving data analysis, accelerating computation, and\nfostering novel idea generation. To further move towards the ultimate goal\n(i.e., automatic scientific research), in this paper, we propose Dolphin, the\nfirst closed-loop open-ended auto-research framework to further build the\nentire process of human scientific research. Dolphin can generate research\nideas, perform experiments, and get feedback from experimental results to\ngenerate higher-quality ideas. More specifically, Dolphin first generates novel\nideas based on relevant papers which are ranked by the topic and task\nattributes. Then, the codes are automatically generated and debugged with the\nexception-traceback-guided local code structure. Finally, Dolphin automatically\nanalyzes the results of each idea and feeds the results back to the next round\nof idea generation. Experiments are conducted on the benchmark datasets of\ndifferent topics and results show that Dolphin can generate novel ideas\ncontinuously and complete the experiment in a loop. We highlight that Dolphin\ncan automatically propose methods that are comparable to the state-of-the-art\nin some tasks such as 2D image classification and 3D point classification.\n","authors":["Jiakang Yuan","Xiangchao Yan","Botian Shi","Tao Chen","Wanli Ouyang","Bo Zhang","Lei Bai","Yu Qiao","Bowen Zhou"],"pdf_url":"https://arxiv.org/pdf/2501.03916v2.pdf","comment":"19 pages, 11 figures, and our homepage:\n https://alpha-innovator.github.io/Dolphin-project-page"},{"id":"http://arxiv.org/abs/2412.01246v2","updated":"2025-01-10T13:02:07Z","published":"2024-12-02T08:06:14Z","title":"Class Distance Weighted Cross Entropy Loss for Classification of Disease\n Severity","summary":" Assessing disease severity involving ordinal classes, where each class\nrepresents increasing levels of severity, benefit from loss functions that\naccount for this ordinal structure. Traditional categorical loss functions,\nlike Cross-Entropy (CE), often perform suboptimally in these scenarios. To\naddress this, we propose a novel loss function, Class Distance Weighted\nCross-Entropy (CDW-CE), which penalizes misclassifications more harshly when\nclasses are farther apart. We evaluated CDW-CE on the Labeled Images for\nUlcerative Colitis (LIMUC) dataset using various deep architectures. Its\nperformance was compared against several categorical and ordinal loss\nfunctions. To analyze the quality of latent representations, we used\nt-distributed stochastic neighbor embedding (t-SNE) visualizations and\nquantified their clustering with the Silhouette Score. We also compared Class\nActivation Maps (CAM) generated by models trained with CDW-CE and CE loss,\nincorporating domain expert feedback to evaluate alignment with expert\nknowledge. Our results show that CDW-CE consistently improves performance in\nordinal image classification tasks. It achieves higher Silhouette Scores,\nindicating better differentiation of class representations, and its CAM\nvisualizations demonstrate a stronger focus on clinically significant regions,\nas confirmed by domain experts.\n","authors":["Gorkem Polat","Ümit Mert Çağlar","Alptekin Temizel"],"pdf_url":"https://arxiv.org/pdf/2412.01246v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05936v1","updated":"2025-01-10T12:57:33Z","published":"2025-01-10T12:57:33Z","title":"A Multimodal Dataset for Enhancing Industrial Task Monitoring and\n Engagement Prediction","summary":" Detecting and interpreting operator actions, engagement, and object\ninteractions in dynamic industrial workflows remains a significant challenge in\nhuman-robot collaboration research, especially within complex, real-world\nenvironments. Traditional unimodal methods often fall short of capturing the\nintricacies of these unstructured industrial settings. To address this gap, we\npresent a novel Multimodal Industrial Activity Monitoring (MIAM) dataset that\ncaptures realistic assembly and disassembly tasks, facilitating the evaluation\nof key meta-tasks such as action localization, object interaction, and\nengagement prediction. The dataset comprises multi-view RGB, depth, and\nInertial Measurement Unit (IMU) data collected from 22 sessions, amounting to\n290 minutes of untrimmed video, annotated in detail for task performance and\noperator behavior. Its distinctiveness lies in the integration of multiple data\nmodalities and its emphasis on real-world, untrimmed industrial workflows-key\nfor advancing research in human-robot collaboration and operator monitoring.\nAdditionally, we propose a multimodal network that fuses RGB frames, IMU data,\nand skeleton sequences to predict engagement levels during industrial tasks.\nOur approach improves the accuracy of recognizing engagement states, providing\na robust solution for monitoring operator performance in dynamic industrial\nenvironments. The dataset and code can be accessed from\nhttps://github.com/navalkishoremehta95/MIAM/.\n","authors":["Naval Kishore Mehta"," Arvind","Himanshu Kumar","Abeer Banerjee","Sumeet Saurav","Sanjay Singh"],"pdf_url":"https://arxiv.org/pdf/2501.05936v1.pdf","comment":"Accepted at the 20th International Conference on Human-Robot\n Interaction (HRI) 2025"},{"id":"http://arxiv.org/abs/2409.16111v2","updated":"2025-01-10T12:56:47Z","published":"2024-09-24T14:19:47Z","title":"CloudTrack: Scalable UAV Tracking with Cloud Semantics","summary":" Nowadays, unmanned aerial vehicles (UAVs) are commonly used in search and\nrescue scenarios to gather information in the search area. The automatic\nidentification of the person searched for in aerial footage could increase the\nautonomy of such systems, reduce the search time, and thus increase the missed\nperson's chances of survival. In this paper, we present a novel approach to\nperform semantically conditioned open vocabulary object tracking that is\nspecifically designed to cope with the limitations of UAV hardware. Our\napproach has several advantages. It can run with verbal descriptions of the\nmissing person, e.g., the color of the shirt, it does not require dedicated\ntraining to execute the mission and can efficiently track a potentially moving\nperson. Our experimental results demonstrate the versatility and efficacy of\nour approach.\n","authors":["Yannik Blei","Michael Krawez","Nisarga Nilavadi","Tanja Katharina Kaiser","Wolfram Burgard"],"pdf_url":"https://arxiv.org/pdf/2409.16111v2.pdf","comment":"7 pages, 3 figures"},{"id":"http://arxiv.org/abs/2501.05933v1","updated":"2025-01-10T12:56:18Z","published":"2025-01-10T12:56:18Z","title":"Weakly Supervised Segmentation of Hyper-Reflective Foci with Compact\n Convolutional Transformers and SAM2","summary":" Weakly supervised segmentation has the potential to greatly reduce the\nannotation effort for training segmentation models for small structures such as\nhyper-reflective foci (HRF) in optical coherence tomography (OCT). However,\nmost weakly supervised methods either involve a strong downsampling of input\nimages, or only achieve localization at a coarse resolution, both of which are\nunsatisfactory for small structures. We propose a novel framework that\nincreases the spatial resolution of a traditional attention-based Multiple\nInstance Learning (MIL) approach by using Layer-wise Relevance Propagation\n(LRP) to prompt the Segment Anything Model (SAM~2), and increases recall with\niterative inference. Moreover, we demonstrate that replacing MIL with a Compact\nConvolutional Transformer (CCT), which adds a positional encoding, and permits\nan exchange of information between different regions of the OCT image, leads to\na further and substantial increase in segmentation accuracy.\n","authors":["Olivier Morelle","Justus Bisten","Maximilian W. M. Wintergerst","Robert P. Finger","Thomas Schultz"],"pdf_url":"https://arxiv.org/pdf/2501.05933v1.pdf","comment":"7 pages, 1 figure, accepted at German Conference on Medical Image\n Computing 2025"},{"id":"http://arxiv.org/abs/2410.07128v2","updated":"2025-01-10T12:34:47Z","published":"2024-09-23T11:29:19Z","title":"Neural Differential Appearance Equations","summary":" We propose a method to reproduce dynamic appearance textures with\nspace-stationary but time-varying visual statistics. While most previous work\ndecomposes dynamic textures into static appearance and motion, we focus on\ndynamic appearance that results not from motion but variations of fundamental\nproperties, such as rusting, decaying, melting, and weathering. To this end, we\nadopt the neural ordinary differential equation (ODE) to learn the underlying\ndynamics of appearance from a target exemplar. We simulate the ODE in two\nphases. At the \"warm-up\" phase, the ODE diffuses a random noise to an initial\nstate. We then constrain the further evolution of this ODE to replicate the\nevolution of visual feature statistics in the exemplar during the generation\nphase. The particular innovation of this work is the neural ODE achieving both\ndenoising and evolution for dynamics synthesis, with a proposed temporal\ntraining scheme. We study both relightable (BRDF) and non-relightable (RGB)\nappearance models. For both we introduce new pilot datasets, allowing, for the\nfirst time, to study such phenomena: For RGB we provide 22 dynamic textures\nacquired from free online sources; For BRDFs, we further acquire a dataset of\n21 flash-lit videos of time-varying materials, enabled by a simple-to-construct\nsetup. Our experiments show that our method consistently yields realistic and\ncoherent results, whereas prior works falter under pronounced temporal\nappearance variations. A user study confirms our approach is preferred to\nprevious work for such exemplars.\n","authors":["Chen Liu","Tobias Ritschel"],"pdf_url":"https://arxiv.org/pdf/2410.07128v2.pdf","comment":"SIGGRAPH Asia 2024 Journal Track. Project page at\n https://ryushinn.github.io/ode-appearance"},{"id":"http://arxiv.org/abs/2412.05983v2","updated":"2025-01-10T12:28:30Z","published":"2024-12-08T16:10:42Z","title":"Chimera: Improving Generalist Model with Domain-Specific Experts","summary":" Recent advancements in Large Multi-modal Models (LMMs) underscore the\nimportance of scaling by increasing image-text paired data, achieving\nimpressive performance on general tasks. Despite their effectiveness in broad\napplications, generalist models are primarily trained on web-scale datasets\ndominated by natural images, resulting in the sacrifice of specialized\ncapabilities for domain-specific tasks that require extensive domain prior\nknowledge. Moreover, directly integrating expert models tailored for specific\ndomains is challenging due to the representational gap and imbalanced\noptimization between the generalist model and experts. To address these\nchallenges, we introduce Chimera, a scalable and low-cost multi-modal pipeline\ndesigned to boost the ability of existing LMMs with domain-specific experts.\nSpecifically, we design a progressive training strategy to integrate features\nfrom expert models into the input of a generalist LMM. To address the\nimbalanced optimization caused by the well-aligned general visual encoder, we\nintroduce a novel Generalist-Specialist Collaboration Masking (GSCM) mechanism.\nThis results in a versatile model that excels across the chart, table, math,\nand document domains, achieving state-of-the-art performance on multi-modal\nreasoning and visual content extraction tasks, both of which are challenging\ntasks for assessing existing LMMs.\n","authors":["Tianshuo Peng","Mingsheng Li","Hongbin Zhou","Renqiu Xia","Renrui Zhang","Lei Bai","Song Mao","Bin Wang","Conghui He","Aojun Zhou","Botian Shi","Tao Chen","Bo Zhang","Xiangyu Yue"],"pdf_url":"https://arxiv.org/pdf/2412.05983v2.pdf","comment":"Chimera Homepage: https://alpha-innovator.github.io/chimera_page"},{"id":"http://arxiv.org/abs/2412.11863v2","updated":"2025-01-10T12:22:53Z","published":"2024-12-16T15:20:03Z","title":"GeoX: Geometric Problem Solving Through Unified Formalized\n Vision-Language Pre-training","summary":" Despite their proficiency in general tasks, Multi-modal Large Language Models\n(MLLMs) struggle with automatic Geometry Problem Solving (GPS), which demands\nunderstanding diagrams, interpreting symbols, and performing complex reasoning.\nThis limitation arises from their pre-training on natural images and texts,\nalong with the lack of automated verification in the problem-solving process.\nBesides, current geometric specialists are limited by their task-specific\ndesigns, making them less effective for broader geometric problems. To this\nend, we present GeoX, a multi-modal large model focusing on geometric\nunderstanding and reasoning tasks. Given the significant differences between\ngeometric diagram-symbol and natural image-text, we introduce unimodal\npre-training to develop a diagram encoder and symbol decoder, enhancing the\nunderstanding of geometric images and corpora. Furthermore, we introduce\ngeometry-language alignment, an effective pre-training paradigm that bridges\nthe modality gap between unimodal geometric experts. We propose a\nGenerator-And-Sampler Transformer (GS-Former) to generate discriminative\nqueries and eliminate uninformative representations from unevenly distributed\ngeometric signals. Finally, GeoX benefits from visual instruction tuning,\nempowering it to take geometric images and questions as input and generate\nverifiable solutions. Experiments show that GeoX outperforms both generalists\nand geometric specialists on publicly recognized benchmarks, such as GeoQA,\nUniGeo, Geometry3K, and PGPS9k.\n","authors":["Renqiu Xia","Mingsheng Li","Hancheng Ye","Wenjie Wu","Hongbin Zhou","Jiakang Yuan","Tianshuo Peng","Xinyu Cai","Xiangchao Yan","Bin Wang","Conghui He","Botian Shi","Tao Chen","Junchi Yan","Bo Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.11863v2.pdf","comment":"Our code is available at https://github.com/Alpha-Innovator/GeoX"},{"id":"http://arxiv.org/abs/2412.07277v2","updated":"2025-01-10T12:17:00Z","published":"2024-12-10T08:07:19Z","title":"Backdoor Attacks against No-Reference Image Quality Assessment Models\n via a Scalable Trigger","summary":" No-Reference Image Quality Assessment (NR-IQA), responsible for assessing the\nquality of a single input image without using any reference, plays a critical\nrole in evaluating and optimizing computer vision systems, e.g., low-light\nenhancement. Recent research indicates that NR-IQA models are susceptible to\nadversarial attacks, which can significantly alter predicted scores with\nvisually imperceptible perturbations. Despite revealing vulnerabilities, these\nattack methods have limitations, including high computational demands,\nuntargeted manipulation, limited practical utility in white-box scenarios, and\nreduced effectiveness in black-box scenarios. To address these challenges, we\nshift our focus to another significant threat and present a novel\npoisoning-based backdoor attack against NR-IQA (BAIQA), allowing the attacker\nto manipulate the IQA model's output to any desired target value by simply\nadjusting a scaling coefficient $\\alpha$ for the trigger. We propose to inject\nthe trigger in the discrete cosine transform (DCT) domain to improve the local\ninvariance of the trigger for countering trigger diminishment in NR-IQA models\ndue to widely adopted data augmentations. Furthermore, the universal\nadversarial perturbations (UAP) in the DCT space are designed as the trigger,\nto increase IQA model susceptibility to manipulation and improve attack\neffectiveness. In addition to the heuristic method for poison-label BAIQA\n(P-BAIQA), we explore the design of clean-label BAIQA (C-BAIQA), focusing on\n$\\alpha$ sampling and image data refinement, driven by theoretical insights we\nreveal. Extensive experiments on diverse datasets and various NR-IQA models\ndemonstrate the effectiveness of our attacks. Code can be found at\nhttps://github.com/yuyi-sd/BAIQA.\n","authors":["Yi Yu","Song Xia","Xun Lin","Wenhan Yang","Shijian Lu","Yap-peng Tan","Alex Kot"],"pdf_url":"https://arxiv.org/pdf/2412.07277v2.pdf","comment":"Accept by AAAI 2025"},{"id":"http://arxiv.org/abs/2406.06521v2","updated":"2025-01-10T12:05:16Z","published":"2024-06-10T17:59:01Z","title":"PGSR: Planar-based Gaussian Splatting for Efficient and High-Fidelity\n Surface Reconstruction","summary":" Recently, 3D Gaussian Splatting (3DGS) has attracted widespread attention due\nto its high-quality rendering, and ultra-fast training and rendering speed.\nHowever, due to the unstructured and irregular nature of Gaussian point clouds,\nit is difficult to guarantee geometric reconstruction accuracy and multi-view\nconsistency simply by relying on image reconstruction loss. Although many\nstudies on surface reconstruction based on 3DGS have emerged recently, the\nquality of their meshes is generally unsatisfactory. To address this problem,\nwe propose a fast planar-based Gaussian splatting reconstruction representation\n(PGSR) to achieve high-fidelity surface reconstruction while ensuring\nhigh-quality rendering. Specifically, we first introduce an unbiased depth\nrendering method, which directly renders the distance from the camera origin to\nthe Gaussian plane and the corresponding normal map based on the Gaussian\ndistribution of the point cloud, and divides the two to obtain the unbiased\ndepth. We then introduce single-view geometric, multi-view photometric, and\ngeometric regularization to preserve global geometric accuracy. We also propose\na camera exposure compensation model to cope with scenes with large\nillumination variations. Experiments on indoor and outdoor scenes show that our\nmethod achieves fast training and rendering while maintaining high-fidelity\nrendering and geometric reconstruction, outperforming 3DGS-based and NeRF-based\nmethods.\n","authors":["Danpeng Chen","Hai Li","Weicai Ye","Yifan Wang","Weijian Xie","Shangjin Zhai","Nan Wang","Haomin Liu","Hujun Bao","Guofeng Zhang"],"pdf_url":"https://arxiv.org/pdf/2406.06521v2.pdf","comment":"project page: https://zju3dv.github.io/pgsr/"},{"id":"http://arxiv.org/abs/2501.00574v2","updated":"2025-01-10T12:00:51Z","published":"2024-12-31T18:01:23Z","title":"VideoChat-Flash: Hierarchical Compression for Long-Context Video\n Modeling","summary":" Long-context modeling is a critical capability for multimodal large language\nmodels (MLLMs), enabling them to process long-form contents with implicit\nmemorization. Despite its advances, handling extremely long videos remains\nchallenging due to the difficulty in maintaining crucial features over extended\nsequences. This paper introduces a Hierarchical visual token Compression (HiCo)\nmethod designed for high-fidelity representation and a practical context\nmodeling system VideoChat-Flash tailored for multimodal long-sequence\nprocessing. HiCo capitalizes on the redundancy of visual information in long\nvideos to compress long video context from the clip-level to the video-level,\nreducing the compute significantly while preserving essential details.\nVideoChat-Flash features a multi-stage short-to-long learning scheme, a rich\ndataset of real-world long videos named LongVid, and an upgraded\n\"Needle-In-A-video-Haystack\" (NIAH) for evaluating context capacities. In\nextensive experiments, VideoChat-Flash shows the leading performance on both\nmainstream long and short video benchmarks at the 2B and 7B model scale. It\nfirstly gets 99.1% accuracy over 10,000 frames in NIAH among open-source\nmodels.\n","authors":["Xinhao Li","Yi Wang","Jiashuo Yu","Xiangyu Zeng","Yuhan Zhu","Haian Huang","Jianfei Gao","Kunchang Li","Yinan He","Chenting Wang","Yu Qiao","Yali Wang","Limin Wang"],"pdf_url":"https://arxiv.org/pdf/2501.00574v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05904v1","updated":"2025-01-10T12:00:11Z","published":"2025-01-10T12:00:11Z","title":"Binary Event-Driven Spiking Transformer","summary":" Transformer-based Spiking Neural Networks (SNNs) introduce a novel\nevent-driven self-attention paradigm that combines the high performance of\nTransformers with the energy efficiency of SNNs. However, the larger model size\nand increased computational demands of the Transformer structure limit their\npracticality in resource-constrained scenarios. In this paper, we integrate\nbinarization techniques into Transformer-based SNNs and propose the Binary\nEvent-Driven Spiking Transformer, i.e. BESTformer. The proposed BESTformer can\nsignificantly reduce storage and computational demands by representing weights\nand attention maps with a mere 1-bit. However, BESTformer suffers from a severe\nperformance drop from its full-precision counterpart due to the limited\nrepresentation capability of binarization. To address this issue, we propose a\nCoupled Information Enhancement (CIE) method, which consists of a reversible\nframework and information enhancement distillation. By maximizing the mutual\ninformation between the binary model and its full-precision counterpart, the\nCIE method effectively mitigates the performance degradation of the BESTformer.\nExtensive experiments on static and neuromorphic datasets demonstrate that our\nmethod achieves superior performance to other binary SNNs, showcasing its\npotential as a compact yet high-performance model for resource-limited edge\ndevices.\n","authors":["Honglin Cao","Zijian Zhou","Wenjie Wei","Ammar Belatreche","Yu Liang","Dehao Zhang","Malu Zhang","Yang Yang","Haizhou Li"],"pdf_url":"https://arxiv.org/pdf/2501.05904v1.pdf","comment":"11 pages, 5 figures"},{"id":"http://arxiv.org/abs/2501.05901v1","updated":"2025-01-10T11:53:46Z","published":"2025-01-10T11:53:46Z","title":"Valley2: Exploring Multimodal Models with Scalable Vision-Language\n Design","summary":" Recently, vision-language models have made remarkable progress, demonstrating\noutstanding capabilities in various tasks such as image captioning and video\nunderstanding. We introduce Valley2, a novel multimodal large language model\ndesigned to enhance performance across all domains and extend the boundaries of\npractical applications in e-commerce and short video scenarios. Notably,\nValley2 achieves state-of-the-art (SOTA) performance on e-commerce benchmarks,\nsurpassing open-source models of similar size by a large margin (79.66 vs.\n72.76). Additionally, Valley2 ranks second on the OpenCompass leaderboard among\nmodels with fewer than 10B parameters, with an impressive average score of\n67.4. The code and model weights are open-sourced at\nhttps://github.com/bytedance/Valley.\n","authors":["Ziheng Wu","Zhenghao Chen","Ruipu Luo","Can Zhang","Yuan Gao","Zhentao He","Xian Wang","Haoran Lin","Minghui Qiu"],"pdf_url":"https://arxiv.org/pdf/2501.05901v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.05435v6","updated":"2025-01-10T11:50:21Z","published":"2024-03-08T16:38:11Z","title":"OmniCount: Multi-label Object Counting with Semantic-Geometric Priors","summary":" Object counting is pivotal for understanding the composition of scenes.\nPreviously, this task was dominated by class-specific methods, which have\ngradually evolved into more adaptable class-agnostic strategies. However, these\nstrategies come with their own set of limitations, such as the need for manual\nexemplar input and multiple passes for multiple categories, resulting in\nsignificant inefficiencies. This paper introduces a more practical approach\nenabling simultaneous counting of multiple object categories using an\nopen-vocabulary framework. Our solution, OmniCount, stands out by using\nsemantic and geometric insights (priors) from pre-trained models to count\nmultiple categories of objects as specified by users, all without additional\ntraining. OmniCount distinguishes itself by generating precise object masks and\nleveraging varied interactive prompts via the Segment Anything Model for\nefficient counting. To evaluate OmniCount, we created the OmniCount-191\nbenchmark, a first-of-its-kind dataset with multi-label object counts,\nincluding points, bounding boxes, and VQA annotations. Our comprehensive\nevaluation in OmniCount-191, alongside other leading benchmarks, demonstrates\nOmniCount's exceptional performance, significantly outpacing existing\nsolutions. The project webpage is available at\nhttps://mondalanindya.github.io/OmniCount.\n","authors":["Anindya Mondal","Sauradip Nag","Xiatian Zhu","Anjan Dutta"],"pdf_url":"https://arxiv.org/pdf/2403.05435v6.pdf","comment":"Accepted to AAAI 2025"},{"id":"http://arxiv.org/abs/2501.05892v1","updated":"2025-01-10T11:44:59Z","published":"2025-01-10T11:44:59Z","title":"Beyond Flat Text: Dual Self-inherited Guidance for Visual Text\n Generation","summary":" In real-world images, slanted or curved texts, especially those on cans,\nbanners, or badges, appear as frequently, if not more so, than flat texts due\nto artistic design or layout constraints. While high-quality visual text\ngeneration has become available with the advanced generative capabilities of\ndiffusion models, these models often produce distorted text and inharmonious\ntext background when given slanted or curved text layouts due to training data\nlimitation. In this paper, we introduce a new training-free framework, STGen,\nwhich accurately generates visual texts in challenging scenarios (\\eg, slanted\nor curved text layouts) while harmonizing them with the text background. Our\nframework decomposes the visual text generation process into two branches: (i)\n\\textbf{Semantic Rectification Branch}, which leverages the ability in\ngenerating flat but accurate visual texts of the model to guide the generation\nof challenging scenarios. The generated latent of flat text is abundant in\naccurate semantic information related both to the text itself and its\nbackground. By incorporating this, we rectify the semantic information of the\ntexts and harmonize the integration of the text with its background in complex\nlayouts. (ii) \\textbf{Structure Injection Branch}, which reinforces the visual\ntext structure during inference. We incorporate the latent information of the\nglyph image, rich in glyph structure, as a new condition to further strengthen\nthe text structure. To enhance image harmony, we also apply an effective\ncombination method to merge the priors, providing a solid foundation for\ngeneration. Extensive experiments across a variety of visual text layouts\ndemonstrate that our framework achieves superior accuracy and outstanding\nquality.\n","authors":["Minxing Luo","Zixun Xia","Liaojun Chen","Zhenhang Li","Weichao Zeng","Jianye Wang","Wentao Cheng","Yaxing Wang","Yu Zhou","Jian Yang"],"pdf_url":"https://arxiv.org/pdf/2501.05892v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05885v1","updated":"2025-01-10T11:37:50Z","published":"2025-01-10T11:37:50Z","title":"EDNet: Edge-Optimized Small Target Detection in UAV Imagery -- Faster\n Context Attention, Better Feature Fusion, and Hardware Acceleration","summary":" Detecting small targets in drone imagery is challenging due to low\nresolution, complex backgrounds, and dynamic scenes. We propose EDNet, a novel\nedge-target detection framework built on an enhanced YOLOv10 architecture,\noptimized for real-time applications without post-processing. EDNet\nincorporates an XSmall detection head and a Cross Concat strategy to improve\nfeature fusion and multi-scale context awareness for detecting tiny targets in\ndiverse environments. Our unique C2f-FCA block employs Faster Context Attention\nto enhance feature extraction while reducing computational complexity. The WIoU\nloss function is employed for improved bounding box regression. With seven\nmodel sizes ranging from Tiny to XL, EDNet accommodates various deployment\nenvironments, enabling local real-time inference and ensuring data privacy.\nNotably, EDNet achieves up to a 5.6% gain in mAP@50 with significantly fewer\nparameters. On an iPhone 12, EDNet variants operate at speeds ranging from 16\nto 55 FPS, providing a scalable and efficient solution for edge-based object\ndetection in challenging drone imagery. The source code and pre-trained models\nare available at: https://github.com/zsniko/EDNet.\n","authors":["Zhifan Song","Yuan Zhang","Abd Al Rahman M. Abu Ebayyeh"],"pdf_url":"https://arxiv.org/pdf/2501.05885v1.pdf","comment":"Accepted in 21st IEEE International Conference on Ubiquitous\n Intelligence and Computing (UIC 2024)\n https://www.ieee-smart-world.org/2024/uic"},{"id":"http://arxiv.org/abs/2501.01987v2","updated":"2025-01-10T11:36:09Z","published":"2024-12-30T18:08:13Z","title":"Gender Bias in Text-to-Video Generation Models: A case study of Sora","summary":" The advent of text-to-video generation models has revolutionized content\ncreation as it produces high-quality videos from textual prompts. However,\nconcerns regarding inherent biases in such models have prompted scrutiny,\nparticularly regarding gender representation. Our study investigates the\npresence of gender bias in OpenAI's Sora, a state-of-the-art text-to-video\ngeneration model. We uncover significant evidence of bias by analyzing the\ngenerated videos from a diverse set of gender-neutral and stereotypical\nprompts. The results indicate that Sora disproportionately associates specific\ngenders with stereotypical behaviors and professions, which reflects societal\nprejudices embedded in its training data.\n","authors":["Mohammad Nadeem","Shahab Saquib Sohail","Erik Cambria","Björn W. Schuller","Amir Hussain"],"pdf_url":"https://arxiv.org/pdf/2501.01987v2.pdf","comment":"7 pages, 3 figures"},{"id":"http://arxiv.org/abs/2501.05884v1","updated":"2025-01-10T11:35:43Z","published":"2025-01-10T11:35:43Z","title":"Text-to-Edit: Controllable End-to-End Video Ad Creation via Multimodal\n LLMs","summary":" The exponential growth of short-video content has ignited a surge in the\nnecessity for efficient, automated solutions to video editing, with challenges\narising from the need to understand videos and tailor the editing according to\nuser requirements. Addressing this need, we propose an innovative end-to-end\nfoundational framework, ultimately actualizing precise control over the final\nvideo content editing. Leveraging the flexibility and generalizability of\nMultimodal Large Language Models (MLLMs), we defined clear input-output\nmappings for efficient video creation. To bolster the model's capability in\nprocessing and comprehending video content, we introduce a strategic\ncombination of a denser frame rate and a slow-fast processing technique,\nsignificantly enhancing the extraction and understanding of both temporal and\nspatial video information. Furthermore, we introduce a text-to-edit mechanism\nthat allows users to achieve desired video outcomes through textual input,\nthereby enhancing the quality and controllability of the edited videos. Through\ncomprehensive experimentation, our method has not only showcased significant\neffectiveness within advertising datasets, but also yields universally\napplicable conclusions on public datasets.\n","authors":["Dabing Cheng","Haosen Zhan","Xingchen Zhao","Guisheng Liu","Zemin Li","Jinghui Xie","Zhao Song","Weiguo Feng","Bingyue Peng"],"pdf_url":"https://arxiv.org/pdf/2501.05884v1.pdf","comment":"16pages conference"},{"id":"http://arxiv.org/abs/2501.05880v1","updated":"2025-01-10T11:32:56Z","published":"2025-01-10T11:32:56Z","title":"TakuNet: an Energy-Efficient CNN for Real-Time Inference on Embedded UAV\n systems in Emergency Response Scenarios","summary":" Designing efficient neural networks for embedded devices is a critical\nchallenge, particularly in applications requiring real-time performance, such\nas aerial imaging with drones and UAVs for emergency responses. In this work,\nwe introduce TakuNet, a novel light-weight architecture which employs\ntechniques such as depth-wise convolutions and an early downsampling stem to\nreduce computational complexity while maintaining high accuracy. It leverages\ndense connections for fast convergence during training and uses 16-bit\nfloating-point precision for optimization on embedded hardware accelerators.\nExperimental evaluation on two public datasets shows that TakuNet achieves\nnear-state-of-the-art accuracy in classifying aerial images of emergency\nsituations, despite its minimal parameter count. Real-world tests on embedded\ndevices, namely Jetson Orin Nano and Raspberry Pi, confirm TakuNet's\nefficiency, achieving more than 650 fps on the 15W Jetson board, making it\nsuitable for real-time AI processing on resource-constrained platforms and\nadvancing the applicability of drones in emergency scenarios. The code and\nimplementation details are publicly released.\n","authors":["Daniel Rossi","Guido Borghi","Roberto Vezzani"],"pdf_url":"https://arxiv.org/pdf/2501.05880v1.pdf","comment":"This paper has been accepted at WACVW 2025, which will take place on\n 28/02/2025. The official conference proceedings have not yet been published\n at the time of submission to arXiv. The final version of the paper,\n incorporating any changes based on feedback received during the conference,\n will be included in the proceedings once they are made available"},{"id":"http://arxiv.org/abs/2501.05874v1","updated":"2025-01-10T11:17:15Z","published":"2025-01-10T11:17:15Z","title":"VideoRAG: Retrieval-Augmented Generation over Video Corpus","summary":" Retrieval-Augmented Generation (RAG) is a powerful strategy to address the\nissue of generating factually incorrect outputs in foundation models by\nretrieving external knowledge relevant to queries and incorporating it into\ntheir generation process. However, existing RAG approaches have primarily\nfocused on textual information, with some recent advancements beginning to\nconsider images, and they largely overlook videos, a rich source of multimodal\nknowledge capable of representing events, processes, and contextual details\nmore effectively than any other modality. While a few recent studies explore\nthe integration of videos in the response generation process, they either\npredefine query-associated videos without retrieving them according to queries,\nor convert videos into the textual descriptions without harnessing their\nmultimodal richness. To tackle these, we introduce VideoRAG, a novel framework\nthat not only dynamically retrieves relevant videos based on their relevance\nwith queries but also utilizes both visual and textual information of videos in\nthe output generation. Further, to operationalize this, our method revolves\naround the recent advance of Large Video Language Models (LVLMs), which enable\nthe direct processing of video content to represent it for retrieval and\nseamless integration of the retrieved videos jointly with queries. We\nexperimentally validate the effectiveness of VideoRAG, showcasing that it is\nsuperior to relevant baselines.\n","authors":["Soyeong Jeong","Kangsan Kim","Jinheon Baek","Sung Ju Hwang"],"pdf_url":"https://arxiv.org/pdf/2501.05874v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05862v1","updated":"2025-01-10T10:59:27Z","published":"2025-01-10T10:59:27Z","title":"Language-Inspired Relation Transfer for Few-shot Class-Incremental\n Learning","summary":" Depicting novel classes with language descriptions by observing few-shot\nsamples is inherent in human-learning systems. This lifelong learning\ncapability helps to distinguish new knowledge from old ones through the\nincrease of open-world learning, namely Few-Shot Class-Incremental Learning\n(FSCIL). Existing works to solve this problem mainly rely on the careful tuning\nof visual encoders, which shows an evident trade-off between the base knowledge\nand incremental ones. Motivated by human learning systems, we propose a new\nLanguage-inspired Relation Transfer (LRT) paradigm to understand objects by\njoint visual clues and text depictions, composed of two major steps. We first\ntransfer the pretrained text knowledge to the visual domains by proposing a\ngraph relation transformation module and then fuse the visual and language\nembedding by a text-vision prototypical fusion module. Second, to mitigate the\ndomain gap caused by visual finetuning, we propose context prompt learning for\nfast domain alignment and imagined contrastive learning to alleviate the\ninsufficient text data during alignment. With collaborative learning of domain\nalignments and text-image transfer, our proposed LRT outperforms the\nstate-of-the-art models by over $13\\%$ and $7\\%$ on the final session of\nmini-ImageNet and CIFAR-100 FSCIL benchmarks.\n","authors":["Yifan Zhao","Jia Li","Zeyin Song","Yonghong Tian"],"pdf_url":"https://arxiv.org/pdf/2501.05862v1.pdf","comment":"Accepted by IEEE TPAMI"},{"id":"http://arxiv.org/abs/2501.05852v1","updated":"2025-01-10T10:47:00Z","published":"2025-01-10T10:47:00Z","title":"MRI Patterns of the Hippocampus and Amygdala for Predicting Stages of\n Alzheimer's Progression: A Minimal Feature Machine Learning Framework","summary":" Alzheimer's disease (AD) progresses through distinct stages, from early mild\ncognitive impairment (EMCI) to late mild cognitive impairment (LMCI) and\neventually to AD. Accurate identification of these stages, especially\ndistinguishing LMCI from EMCI, is crucial for developing pre-dementia\ntreatments but remains challenging due to subtle and overlapping imaging\nfeatures. This study proposes a minimal-feature machine learning framework that\nleverages structural MRI data, focusing on the hippocampus and amygdala as\nregions of interest. The framework addresses the curse of dimensionality\nthrough feature selection, utilizes region-specific voxel information, and\nimplements innovative data organization to enhance classification performance\nby reducing noise. The methodology integrates dimensionality reduction\ntechniques such as PCA and t-SNE with state-of-the-art classifiers, achieving\nthe highest accuracy of 88.46%. This framework demonstrates the potential for\nefficient and accurate staging of AD progression while providing valuable\ninsights for clinical applications.\n","authors":["Aswini Kumar Patra","Soraisham Elizabeth Devi","Tejashwini Gajurel"],"pdf_url":"https://arxiv.org/pdf/2501.05852v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03630v2","updated":"2025-01-10T10:45:49Z","published":"2025-01-07T09:00:07Z","title":"MC-VTON: Minimal Control Virtual Try-On Diffusion Transformer","summary":" Virtual try-on methods based on diffusion models achieve realistic try-on\neffects. They use an extra reference network or an additional image encoder to\nprocess multiple conditional image inputs, which adds complexity pre-processing\nand additional computational costs. Besides, they require more than 25\ninference steps, bringing longer inference time. In this work, with the\ndevelopment of diffusion transformer (DiT), we rethink the necessity of\nadditional reference network or image encoder and introduce MC-VTON, which\nleverages DiT's intrinsic backbone to seamlessly integrate minimal conditional\ntry-on inputs. Compared to existing methods, the superiority of MC-VTON is\ndemonstrated in four aspects: (1) Superior detail fidelity. Our DiT-based\nMC-VTON exhibits superior fidelity in preserving fine-grained details. (2)\nSimplified network and inputs. We remove any extra reference network or image\nencoder. We also remove unnecessary conditions like the long prompt, pose\nestimation, human parsing, and depth map. We require only the masked person\nimage and the garment image. (3) Parameter-efficient training. To process the\ntry-on task, we fine-tune the FLUX.1-dev with only 39.7M additional parameters\n(0.33% of the backbone parameters). (4) Less inference steps. We apply\ndistillation diffusion on MC-VTON and only need 8 steps to generate a realistic\ntry-on image, with only 86.8M additional parameters (0.72% of the backbone\nparameters). Experiments show that MC-VTON achieves superior qualitative and\nquantitative results with fewer condition inputs, trainable parameters, and\ninference steps than baseline methods.\n","authors":["Junsheng Luan","Guangyuan Li","Lei Zhao","Wei Xing"],"pdf_url":"https://arxiv.org/pdf/2501.03630v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05851v1","updated":"2025-01-10T10:45:38Z","published":"2025-01-10T10:45:38Z","title":"Identity-aware Feature Decoupling Learning for Clothing-change Person\n Re-identification","summary":" Clothing-change person re-identification (CC Re-ID) has attracted increasing\nattention in recent years due to its application prospect. Most existing works\nstruggle to adequately extract the ID-related information from the original RGB\nimages. In this paper, we propose an Identity-aware Feature Decoupling (IFD)\nlearning framework to mine identity-related features. Particularly, IFD\nexploits a dual stream architecture that consists of a main stream and an\nattention stream. The attention stream takes the clothing-masked images as\ninputs and derives the identity attention weights for effectively transferring\nthe spatial knowledge to the main stream and highlighting the regions with\nabundant identity-related information. To eliminate the semantic gap between\nthe inputs of two streams, we propose a clothing bias diminishing module\nspecific to the main stream to regularize the features of clothing-relevant\nregions. Extensive experimental results demonstrate that our framework\noutperforms other baseline models on several widely-used CC Re-ID datasets.\n","authors":["Haoxuan Xu","Bo Li","Guanglin Niu"],"pdf_url":"https://arxiv.org/pdf/2501.05851v1.pdf","comment":"Accepted by ICASSP2025"},{"id":"http://arxiv.org/abs/2501.03968v2","updated":"2025-01-10T10:38:49Z","published":"2025-01-07T18:06:27Z","title":"VLM-driven Behavior Tree for Context-aware Task Planning","summary":" The use of Large Language Models (LLMs) for generating Behavior Trees (BTs)\nhas recently gained attention in the robotics community, yet remains in its\nearly stages of development. In this paper, we propose a novel framework that\nleverages Vision-Language Models (VLMs) to interactively generate and edit BTs\nthat address visual conditions, enabling context-aware robot operations in\nvisually complex environments. A key feature of our approach lies in the\nconditional control through self-prompted visual conditions. Specifically, the\nVLM generates BTs with visual condition nodes, where conditions are expressed\nas free-form text. Another VLM process integrates the text into its prompt and\nevaluates the conditions against real-world images during robot execution. We\nvalidated our framework in a real-world cafe scenario, demonstrating both its\nfeasibility and limitations.\n","authors":["Naoki Wake","Atsushi Kanehira","Jun Takamatsu","Kazuhiro Sasabuchi","Katsushi Ikeuchi"],"pdf_url":"https://arxiv.org/pdf/2501.03968v2.pdf","comment":"10 pages, 11 figures, 5 tables. Last updated on January 9th, 2024"},{"id":"http://arxiv.org/abs/2406.10221v2","updated":"2025-01-10T10:36:58Z","published":"2024-06-14T17:54:54Z","title":"Long Story Short: Story-level Video Understanding from 20K Short Films","summary":" Recent developments in vision-language models have significantly advanced\nvideo understanding. Existing datasets and tasks, however, have notable\nlimitations. Most datasets are confined to short videos with limited events and\nnarrow narratives. For example, datasets with instructional and egocentric\nvideos often depict activities of one person in a single scene. Although\nexisting movie datasets offer richer content, they are often limited to\nshort-term tasks, lack publicly available videos, and frequently encounter data\nleakage issues given the use of subtitles and other information about\ncommercial movies during LLM pretraining. To address the above limitations, we\npropose Short-Films 20K (SF20K), the largest publicly available movie dataset.\nSF20K is composed of 20,143 amateur films and offers long-term video tasks in\nthe form of multiple-choice and open-ended question answering. Our extensive\nanalysis of SF20K reveals minimal data leakage, emphasizes the need for\nlong-term reasoning, and demonstrates the strong performance of recent VLMs.\nFinally, we show that instruction tuning on the SF20K-Train set substantially\nimproves model performance, paving the way for future progress in long-term\nvideo understanding.\n","authors":["Ridouane Ghermi","Xi Wang","Vicky Kalogeiton","Ivan Laptev"],"pdf_url":"https://arxiv.org/pdf/2406.10221v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05839v1","updated":"2025-01-10T10:26:54Z","published":"2025-01-10T10:26:54Z","title":"Poetry in Pixels: Prompt Tuning for Poem Image Generation via Diffusion\n Models","summary":" The task of text-to-image generation has encountered significant challenges\nwhen applied to literary works, especially poetry. Poems are a distinct form of\nliterature, with meanings that frequently transcend beyond the literal words.\nTo address this shortcoming, we propose a PoemToPixel framework designed to\ngenerate images that visually represent the inherent meanings of poems. Our\napproach incorporates the concept of prompt tuning in our image generation\nframework to ensure that the resulting images closely align with the poetic\ncontent. In addition, we propose the PoeKey algorithm, which extracts three key\nelements in the form of emotions, visual elements, and themes from poems to\nform instructions which are subsequently provided to a diffusion model for\ngenerating corresponding images. Furthermore, to expand the diversity of the\npoetry dataset across different genres and ages, we introduce MiniPo, a novel\nmultimodal dataset comprising 1001 children's poems and images. Leveraging this\ndataset alongside PoemSum, we conducted both quantitative and qualitative\nevaluations of image generation using our PoemToPixel framework. This paper\ndemonstrates the effectiveness of our approach and offers a fresh perspective\non generating images from literary sources.\n","authors":["Sofia Jamil","Bollampalli Areen Reddy","Raghvendra Kumar","Sriparna Saha","K J Joseph","Koustava Goswami"],"pdf_url":"https://arxiv.org/pdf/2501.05839v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.11875v2","updated":"2025-01-10T10:15:49Z","published":"2023-10-18T10:49:29Z","title":"Fractional Concepts in Neural Networks: Enhancing Activation Functions","summary":" Designing effective neural networks requires tuning architectural elements.\nThis study integrates fractional calculus into neural networks by introducing\nfractional order derivatives (FDO) as tunable parameters in activation\nfunctions, allowing diverse activation functions by adjusting the FDO. We\nevaluate these fractional activation functions on various datasets and network\narchitectures, comparing their performance with traditional and new activation\nfunctions. Our experiments assess their impact on accuracy, time complexity,\ncomputational overhead, and memory usage. Results suggest fractional activation\nfunctions, particularly fractional Sigmoid, offer benefits in some scenarios.\nChallenges related to consistency and efficiency remain. Practical implications\nand limitations are discussed.\n","authors":["Zahra Alijani","Vojtech Molek"],"pdf_url":"https://arxiv.org/pdf/2310.11875v2.pdf","comment":"8 pages, 8 figures, submitted to pattern recognition letters"},{"id":"http://arxiv.org/abs/2501.01834v2","updated":"2025-01-10T10:08:50Z","published":"2025-01-03T14:38:01Z","title":"MoColl: Agent-Based Specific and General Model Collaboration for Image\n Captioning","summary":" Image captioning is a critical task at the intersection of computer vision\nand natural language processing, with wide-ranging applications across various\ndomains. For complex tasks such as diagnostic report generation, deep learning\nmodels require not only domain-specific image-caption datasets but also the\nincorporation of relevant general knowledge to provide contextual accuracy.\nExisting approaches exhibit inherent limitations: specialized models excel in\ncapturing domain-specific details but lack generalization, while\nvision-language models (VLMs) built on large language models (LLMs) leverage\ngeneral knowledge but struggle with domain-specific adaptation. To address\nthese limitations, this paper proposes a novel agent-enhanced model\ncollaboration framework, which we call MoColl, designed to effectively\nintegrate domain-specific and general knowledge. Specifically, our approach is\nto decompose complex image captioning tasks into a series of interconnected\nquestion-answer subtasks. A trainable visual question answering (VQA) model is\nemployed as a specialized tool to focus on domain-specific visual analysis,\nanswering task-specific questions based on image content. Concurrently, an\nLLM-based agent with general knowledge formulates these questions and\nsynthesizes the resulting question-answer pairs into coherent captions. Beyond\nits role in leveraging the VQA model, the agent further guides its training to\nenhance its domain-specific capabilities. Experimental results on radiology\nreport generation validate the effectiveness of the proposed framework,\ndemonstrating significant improvements in the quality of generated reports.\n","authors":["Pu Yang","Bin Dong"],"pdf_url":"https://arxiv.org/pdf/2501.01834v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09278v2","updated":"2025-01-10T10:07:55Z","published":"2024-12-12T13:41:35Z","title":"Towards a Multimodal Large Language Model with Pixel-Level Insight for\n Biomedicine","summary":" In recent years, Multimodal Large Language Models (MLLM) have achieved\nnotable advancements, demonstrating the feasibility of developing an\nintelligent biomedical assistant. However, current biomedical MLLMs\npredominantly focus on image-level understanding and restrict interactions to\ntextual commands, thus limiting their capability boundaries and the flexibility\nof usage. In this paper, we introduce a novel end-to-end multimodal large\nlanguage model for the biomedical domain, named MedPLIB, which possesses\npixel-level understanding. Excitingly, it supports visual question answering\n(VQA), arbitrary pixel-level prompts (points, bounding boxes, and free-form\nshapes), and pixel-level grounding. We propose a novel Mixture-of-Experts (MoE)\nmulti-stage training strategy, which divides MoE into separate training phases\nfor a visual-language expert model and a pixel-grounding expert model, followed\nby fine-tuning using MoE. This strategy effectively coordinates multitask\nlearning while maintaining the computational cost at inference equivalent to\nthat of a single expert model. To advance the research of biomedical MLLMs, we\nintroduce the Medical Complex Vision Question Answering Dataset (MeCoVQA),\nwhich comprises an array of 8 modalities for complex medical imaging question\nanswering and image region understanding. Experimental results indicate that\nMedPLIB has achieved state-of-the-art outcomes across multiple medical visual\nlanguage tasks. More importantly, in zero-shot evaluations for the pixel\ngrounding task, MedPLIB leads the best small and large models by margins of\n19.7 and 15.6 respectively on the mDice metric. The codes, data, and model\ncheckpoints will be made publicly available at\nhttps://github.com/ShawnHuang497/MedPLIB.\n","authors":["Xiaoshuang Huang","Lingdong Shen","Jia Liu","Fangxin Shang","Hongxiang Li","Haifeng Huang","Yehui Yang"],"pdf_url":"https://arxiv.org/pdf/2412.09278v2.pdf","comment":"Accepted by AAAI2025"},{"id":"http://arxiv.org/abs/2501.05828v1","updated":"2025-01-10T10:07:41Z","published":"2025-01-10T10:07:41Z","title":"UltraRay: Full-Path Ray Tracing for Enhancing Realism in Ultrasound\n Simulation","summary":" Traditional ultrasound simulators solve the wave equation to model pressure\ndistribution fields, achieving high accuracy but requiring significant\ncomputational time and resources. To address this, ray tracing approaches have\nbeen introduced, modeling wave propagation as rays interacting with boundaries\nand scatterers. However, existing models simplify ray propagation, generating\nechoes at interaction points without considering return paths to the sensor.\nThis can result in unrealistic artifacts and necessitates careful scene tuning\nfor plausible results. We propose a novel ultrasound simulation pipeline that\nutilizes a ray tracing algorithm to generate echo data, tracing each ray from\nthe transducer through the scene and back to the sensor. To replicate advanced\nultrasound imaging, we introduce a ray emission scheme optimized for plane wave\nimaging, incorporating delay and steering capabilities. Furthermore, we\nintegrate a standard signal processing pipeline to simulate end-to-end\nultrasound image formation. We showcase the efficacy of the proposed pipeline\nby modeling synthetic scenes featuring highly reflective objects, such as\nbones. In doing so, our proposed approach, UltraRay, not only enhances the\noverall visual quality but also improves the realism of the simulated images by\naccurately capturing secondary reflections and reducing unnatural artifacts. By\nbuilding on top of a differentiable framework, the proposed pipeline lays the\ngroundwork for a fast and differentiable ultrasound simulation tool necessary\nfor gradient-based optimization, enabling advanced ultrasound beamforming\nstrategies, neural network integration, and accurate inverse scene\nreconstruction.\n","authors":["Felix Duelmer","Mohammad Farid Azampour","Nassir Navab"],"pdf_url":"https://arxiv.org/pdf/2501.05828v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05826v1","updated":"2025-01-10T10:03:56Z","published":"2025-01-10T10:03:56Z","title":"AI-Driven Diabetic Retinopathy Screening: Multicentric Validation of\n AIDRSS in India","summary":" Purpose: Diabetic retinopathy (DR) is a major cause of vision loss,\nparticularly in India, where access to retina specialists is limited in rural\nareas. This study aims to evaluate the Artificial Intelligence-based Diabetic\nRetinopathy Screening System (AIDRSS) for DR detection and prevalence\nassessment, addressing the growing need for scalable, automated screening\nsolutions in resource-limited settings.\n Approach: A multicentric, cross-sectional study was conducted in Kolkata,\nIndia, involving 5,029 participants and 10,058 macula-centric retinal fundus\nimages. The AIDRSS employed a deep learning algorithm with 50 million trainable\nparameters, integrated with Contrast Limited Adaptive Histogram Equalization\n(CLAHE) preprocessing for enhanced image quality. DR was graded using the\nInternational Clinical Diabetic Retinopathy (ICDR) Scale, categorizing disease\ninto five stages (DR0 to DR4). Statistical metrics including sensitivity,\nspecificity, and prevalence rates were evaluated against expert retina\nspecialist assessments.\n Results: The prevalence of DR in the general population was 13.7%, rising to\n38.2% among individuals with elevated random blood glucose levels. The AIDRSS\nachieved an overall sensitivity of 92%, specificity of 88%, and 100%\nsensitivity for detecting referable DR (DR3 and DR4). These results demonstrate\nthe system's robust performance in accurately identifying and grading DR in a\ndiverse population.\n Conclusions: AIDRSS provides a reliable, scalable solution for early DR\ndetection in resource-constrained environments. Its integration of advanced AI\ntechniques ensures high diagnostic accuracy, with potential to significantly\nreduce the burden of diabetes-related vision loss in underserved regions.\n","authors":["Amit Kr Dey","Pradeep Walia","Girish Somvanshi","Abrar Ali","Sagarnil Das","Pallabi Paul","Minakhi Ghosh"],"pdf_url":"https://arxiv.org/pdf/2501.05826v1.pdf","comment":"22 pages, 5 figures. arXiv admin note: substantial text overlap with\n arXiv:1812.07105 by other authors without attribution"},{"id":"http://arxiv.org/abs/2501.05823v1","updated":"2025-01-10T10:01:36Z","published":"2025-01-10T10:01:36Z","title":"PersonaHOI: Effortlessly Improving Personalized Face with Human-Object\n Interaction Generation","summary":" We introduce PersonaHOI, a training- and tuning-free framework that fuses a\ngeneral StableDiffusion model with a personalized face diffusion (PFD) model to\ngenerate identity-consistent human-object interaction (HOI) images. While\nexisting PFD models have advanced significantly, they often overemphasize\nfacial features at the expense of full-body coherence, PersonaHOI introduces an\nadditional StableDiffusion (SD) branch guided by HOI-oriented text inputs. By\nincorporating cross-attention constraints in the PFD branch and spatial merging\nat both latent and residual levels, PersonaHOI preserves personalized facial\ndetails while ensuring interactive non-facial regions. Experiments, validated\nby a novel interaction alignment metric, demonstrate the superior realism and\nscalability of PersonaHOI, establishing a new standard for practical\npersonalized face with HOI generation. Our code will be available at\nhttps://github.com/JoyHuYY1412/PersonaHOI\n","authors":["Xinting Hu","Haoran Wang","Jan Eric Lenssen","Bernt Schiele"],"pdf_url":"https://arxiv.org/pdf/2501.05823v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.13719v2","updated":"2025-01-10T10:00:58Z","published":"2024-07-18T17:18:25Z","title":"HazeCLIP: Towards Language Guided Real-World Image Dehazing","summary":" Existing methods have achieved remarkable performance in image dehazing,\nparticularly on synthetic datasets. However, they often struggle with\nreal-world hazy images due to domain shift, limiting their practical\napplicability. This paper introduces HazeCLIP, a language-guided adaptation\nframework designed to enhance the real-world performance of pre-trained\ndehazing networks. Inspired by the Contrastive Language-Image Pre-training\n(CLIP) model's ability to distinguish between hazy and clean images, we\nleverage it to evaluate dehazing results. Combined with a region-specific\ndehazing technique and tailored prompt sets, the CLIP model accurately\nidentifies hazy areas, providing a high-quality, human-like prior that guides\nthe fine-tuning process of pre-trained networks. Extensive experiments\ndemonstrate that HazeCLIP achieves state-of-the-art performance in real-word\nimage dehazing, evaluated through both visual quality and image quality\nassessment metrics. Codes are available at https://github.com/Troivyn/HazeCLIP.\n","authors":["Ruiyi Wang","Wenhao Li","Xiaohong Liu","Chunyi Li","Zicheng Zhang","Xiongkuo Min","Guangtao Zhai"],"pdf_url":"https://arxiv.org/pdf/2407.13719v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.06154v2","updated":"2025-01-10T09:44:43Z","published":"2024-09-10T01:57:57Z","title":"Static for Dynamic: Towards a Deeper Understanding of Dynamic Facial\n Expressions Using Static Expression Data","summary":" Dynamic facial expression recognition (DFER) infers emotions from the\ntemporal evolution of expressions, unlike static facial expression recognition\n(SFER), which relies solely on a single snapshot. This temporal analysis\nprovides richer information and promises greater recognition capability.\nHowever, current DFER methods often exhibit unsatisfied performance largely due\nto fewer training samples compared to SFER. Given the inherent correlation\nbetween static and dynamic expressions, we hypothesize that leveraging the\nabundant SFER data can enhance DFER. To this end, we propose Static-for-Dynamic\n(S4D), a unified dual-modal learning framework that integrates SFER data as a\ncomplementary resource for DFER. Specifically, S4D employs dual-modal\nself-supervised pre-training on facial images and videos using a shared Vision\nTransformer (ViT) encoder-decoder architecture, yielding improved\nspatiotemporal representations. The pre-trained encoder is then fine-tuned on\nstatic and dynamic expression datasets in a multi-task learning setup to\nfacilitate emotional information interaction. Unfortunately, vanilla multi-task\nlearning in our study results in negative transfer. To address this, we propose\nan innovative Mixture of Adapter Experts (MoAE) module that facilitates\ntask-specific knowledge acquisition while effectively extracting shared\nknowledge from both static and dynamic expression data. Extensive experiments\ndemonstrate that S4D achieves a deeper understanding of DFER, setting new\nstate-of-the-art performance on FERV39K, MAFW, and DFEW benchmarks, with\nweighted average recall (WAR) of 53.65\\%, 58.44\\%, and 76.68\\%, respectively.\nAdditionally, a systematic correlation analysis between SFER and DFER tasks is\npresented, which further elucidates the potential benefits of leveraging SFER.\n","authors":["Yin Chen","Jia Li","Yu Zhang","Zhenzhen Hu","Shiguang Shan","Meng Wang","Richang Hong"],"pdf_url":"https://arxiv.org/pdf/2409.06154v2.pdf","comment":"The code and model are publicly available here\n https://github.com/MSA-LMC/S4D"},{"id":"http://arxiv.org/abs/2411.10185v3","updated":"2025-01-10T09:37:49Z","published":"2024-11-15T13:34:46Z","title":"Efficient Progressive Image Compression with Variance-aware Masking","summary":" Learned progressive image compression is gaining momentum as it allows\nimproved image reconstruction as more bits are decoded at the receiver. We\npropose a progressive image compression method in which an image is first\nrepresented as a pair of base-quality and top-quality latent representations.\nNext, a residual latent representation is encoded as the element-wise\ndifference between the top and base representations. Our scheme enables\nprogressive image compression with element-wise granularity by introducing a\nmasking system that ranks each element of the residual latent representation\nfrom most to least important, dividing it into complementary components, which\ncan be transmitted separately to the decoder in order to obtain different\nreconstruction quality. The masking system does not add further parameters nor\ncomplexity. At the receiver, any elements of the top latent representation\nexcluded from the transmitted components can be independently replaced with the\nmean predicted by the hyperprior architecture, ensuring reliable\nreconstructions at any intermediate quality level. We also introduced Rate\nEnhancement Modules (REMs), which refine the estimation of entropy parameters\nusing already decoded components. We obtain results competitive with\nstate-of-the-art competitors, while significantly reducing computational\ncomplexity, decoding time, and number of parameters.\n","authors":["Alberto Presta","Enzo Tartaglione","Attilio Fiandrotti","Marco Grangetto","Pamela Cosman"],"pdf_url":"https://arxiv.org/pdf/2411.10185v3.pdf","comment":"9 pages. Accepted at WACV 2025"},{"id":"http://arxiv.org/abs/2501.01042v2","updated":"2025-01-10T09:21:43Z","published":"2025-01-02T03:52:22Z","title":"Image-based Multimodal Models as Intruders: Transferable Multimodal\n Attacks on Video-based MLLMs","summary":" Video-based multimodal large language models (V-MLLMs) have shown\nvulnerability to adversarial examples in video-text multimodal tasks. However,\nthe transferability of adversarial videos to unseen models--a common and\npractical real world scenario--remains unexplored. In this paper, we pioneer an\ninvestigation into the transferability of adversarial video samples across\nV-MLLMs. We find that existing adversarial attack methods face significant\nlimitations when applied in black-box settings for V-MLLMs, which we attribute\nto the following shortcomings: (1) lacking generalization in perturbing video\nfeatures, (2) focusing only on sparse key-frames, and (3) failing to integrate\nmultimodal information. To address these limitations and deepen the\nunderstanding of V-MLLM vulnerabilities in black-box scenarios, we introduce\nthe Image-to-Video MLLM (I2V-MLLM) attack. In I2V-MLLM, we utilize an\nimage-based multimodal model (IMM) as a surrogate model to craft adversarial\nvideo samples. Multimodal interactions and temporal information are integrated\nto disrupt video representations within the latent space, improving adversarial\ntransferability. In addition, a perturbation propagation technique is\nintroduced to handle different unknown frame sampling strategies. Experimental\nresults demonstrate that our method can generate adversarial examples that\nexhibit strong transferability across different V-MLLMs on multiple video-text\nmultimodal tasks. Compared to white-box attacks on these models, our black-box\nattacks (using BLIP-2 as surrogate model) achieve competitive performance, with\naverage attack success rates of 55.48% on MSVD-QA and 58.26% on MSRVTT-QA for\nVideoQA tasks, respectively. Our code will be released upon acceptance.\n","authors":["Linhao Huang","Xue Jiang","Zhiqiang Wang","Wentao Mo","Xi Xiao","Bo Han","Yongjie Yin","Feng Zheng"],"pdf_url":"https://arxiv.org/pdf/2501.01042v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05803v1","updated":"2025-01-10T09:10:30Z","published":"2025-01-10T09:10:30Z","title":"Alignment without Over-optimization: Training-Free Solution for\n Diffusion Models","summary":" Diffusion models excel in generative tasks, but aligning them with specific\nobjectives while maintaining their versatility remains challenging. Existing\nfine-tuning methods often suffer from reward over-optimization, while\napproximate guidance approaches fail to optimize target rewards effectively.\nAddressing these limitations, we propose a training-free sampling method based\non Sequential Monte Carlo (SMC) to sample from the reward-aligned target\ndistribution. Our approach, tailored for diffusion sampling and incorporating\ntempering techniques, achieves comparable or superior target rewards to\nfine-tuning methods while preserving diversity and cross-reward generalization.\nWe demonstrate its effectiveness in single-reward optimization, multi-objective\nscenarios, and online black-box optimization. This work offers a robust\nsolution for aligning diffusion models with diverse downstream objectives\nwithout compromising their general capabilities. Code is available at\nhttps://github.com/krafton-ai/DAS .\n","authors":["Sunwoo Kim","Minkyu Kim","Dongmin Park"],"pdf_url":"https://arxiv.org/pdf/2501.05803v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05091v2","updated":"2025-01-10T08:43:50Z","published":"2025-01-09T09:15:07Z","title":"ResPanDiff: Diffusion Model for Pansharpening by Inferring Residual\n Inference","summary":" The implementation of diffusion-based pansharpening task is predominantly\nconstrained by its slow inference speed, which results from numerous sampling\nsteps. Despite the existing techniques aiming to accelerate sampling, they\noften compromise performance when fusing multi-source images. To ease this\nlimitation, we introduce a novel and efficient diffusion model named Diffusion\nModel for Pansharpening by Inferring Residual Inference (ResPanDiff), which\nsignificantly reduces the number of diffusion steps without sacrificing the\nperformance to tackle pansharpening task. In ResPanDiff, we innovatively\npropose a Markov chain that transits from noisy residuals to the residuals\nbetween the LRMS and HRMS images, thereby reducing the number of sampling steps\nand enhancing performance. Additionally, we design the latent space to help\nmodel extract more features at the encoding stage, Shallow\nCond-Injection~(SC-I) to help model fetch cond-injected hidden features with\nhigher dimensions, and loss functions to give a better guidance for the\nresidual generation task. enabling the model to achieve superior performance in\nresidual generation. Furthermore, experimental evaluations on pansharpening\ndatasets demonstrate that the proposed method achieves superior outcomes\ncompared to recent state-of-the-art~(SOTA) techniques, requiring only 15\nsampling steps, which reduces over $90\\%$ step compared with the benchmark\ndiffusion models. Our experiments also include thorough discussions and\nablation studies to underscore the effectiveness of our approach.\n","authors":["Shiqi Cao","Liangjian Deng","Shangqi Deng"],"pdf_url":"https://arxiv.org/pdf/2501.05091v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.02564v2","updated":"2025-01-10T08:40:49Z","published":"2025-01-05T14:42:47Z","title":"Balanced Multi-view Clustering","summary":" Multi-view clustering (MvC) aims to integrate information from different\nviews to enhance the capability of the model in capturing the underlying data\nstructures. The widely used joint training paradigm in MvC is potentially not\nfully leverage the multi-view information, since the imbalanced and\nunder-optimized view-specific features caused by the uniform learning objective\nfor all views. For instance, particular views with more discriminative\ninformation could dominate the learning process in the joint training paradigm,\nleading to other views being under-optimized. To alleviate this issue, we first\nanalyze the imbalanced phenomenon in the joint-training paradigm of multi-view\nclustering from the perspective of gradient descent for each view-specific\nfeature extractor. Then, we propose a novel balanced multi-view clustering\n(BMvC) method, which introduces a view-specific contrastive regularization\n(VCR) to modulate the optimization of each view. Concretely, VCR preserves the\nsample similarities captured from the joint features and view-specific ones\ninto the clustering distributions corresponding to view-specific features to\nenhance the learning process of view-specific feature extractors. Additionally,\na theoretical analysis is provided to illustrate that VCR adaptively modulates\nthe magnitudes of gradients for updating the parameters of view-specific\nfeature extractors to achieve a balanced multi-view learning procedure. In such\na manner, BMvC achieves a better trade-off between the exploitation of\nview-specific patterns and the exploration of view-invariance patterns to fully\nlearn the multi-view information for the clustering task. Finally, a set of\nexperiments are conducted to verify the superiority of the proposed method\ncompared with state-of-the-art approaches both on eight benchmark MvC datasets\nand two spatially resolved transcriptomics datasets.\n","authors":["Zhenglai Li","Jun Wang","Chang Tang","Xinzhong Zhu","Wei Zhang","Xinwang Liu"],"pdf_url":"https://arxiv.org/pdf/2501.02564v2.pdf","comment":"We are withdrawing this paper due to issues in the experimental\n section related to the Application for Spatially Resolved Transcriptomics\n Data Clustering. These issues affect the validity of the results presented.\n We believe it is necessary to withdraw the paper to address these problems\n adequately before resubmission."},{"id":"http://arxiv.org/abs/2501.05786v1","updated":"2025-01-10T08:36:59Z","published":"2025-01-10T08:36:59Z","title":"Cryptanalysis of Cancelable Biometrics Vault","summary":" Cancelable Biometrics (CB) stands for a range of biometric transformation\nschemes combining biometrics with user specific tokens to generate secure\ntemplates. Required properties are the irreversibility, unlikability and\nrecognition accuracy of templates while making their revocation possible. In\nbiometrics, a key-binding scheme is used for protecting a cryptographic key\nusing a biometric data. The key can be recomputed only if a correct biometric\ndata is acquired during authentication. Applications of key-binding schemes are\ntypically disk encryption, where the cryptographic key is used to encrypt and\ndecrypt the disk. In this paper, we cryptanalyze a recent key-binding scheme,\ncalled Cancelable Biometrics Vault (CBV) based on cancelable biometrics. More\nprecisely, the introduced cancelable transformation, called BioEncoding scheme,\nfor instantiating the CBV framework is attacked in terms of reversibility and\nlinkability of templates. Subsequently, our linkability attack enables to\nrecover the key in the vault without additional assumptions. Our cryptanalysis\nintroduces a new perspective by uncovering the CBV scheme's revocability and\nlinkability vulnerabilities, which were not previously identified in comparable\nbiometric-based key-binding schemes.\n","authors":["Patrick Lacharme","Kevin Thiry-Atighehchi"],"pdf_url":"https://arxiv.org/pdf/2501.05786v1.pdf","comment":"17 pages, 4 figures"},{"id":"http://arxiv.org/abs/2410.05993v4","updated":"2025-01-10T08:35:13Z","published":"2024-10-08T12:44:57Z","title":"Aria: An Open Multimodal Native Mixture-of-Experts Model","summary":" Information comes in diverse modalities. Multimodal native AI models are\nessential to integrate real-world information and deliver comprehensive\nunderstanding. While proprietary multimodal native models exist, their lack of\nopenness imposes obstacles for adoptions, let alone adaptations. To fill this\ngap, we introduce Aria, an open multimodal native model with best-in-class\nperformance across a wide range of multimodal, language, and coding tasks. Aria\nis a mixture-of-expert model with 3.9B and 3.5B activated parameters per visual\ntoken and text token, respectively. It outperforms Pixtral-12B and\nLlama3.2-11B, and is competitive against the best proprietary models on various\nmultimodal tasks. We pre-train Aria from scratch following a 4-stage pipeline,\nwhich progressively equips the model with strong capabilities in language\nunderstanding, multimodal understanding, long context window, and instruction\nfollowing. We open-source the model weights along with a codebase that\nfacilitates easy adoptions and adaptations of Aria in real-world applications.\n","authors":["Dongxu Li","Yudong Liu","Haoning Wu","Yue Wang","Zhiqi Shen","Bowen Qu","Xinyao Niu","Fan Zhou","Chengen Huang","Yanpeng Li","Chongyan Zhu","Xiaoyi Ren","Chao Li","Yifan Ye","Peng Liu","Lihuan Zhang","Hanshu Yan","Guoyin Wang","Bei Chen","Junnan Li"],"pdf_url":"https://arxiv.org/pdf/2410.05993v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05783v1","updated":"2025-01-10T08:33:31Z","published":"2025-01-10T08:33:31Z","title":"UV-Attack: Physical-World Adversarial Attacks for Person Detection via\n Dynamic-NeRF-based UV Mapping","summary":" In recent research, adversarial attacks on person detectors using patches or\nstatic 3D model-based texture modifications have struggled with low success\nrates due to the flexible nature of human movement. Modeling the 3D\ndeformations caused by various actions has been a major challenge. Fortunately,\nadvancements in Neural Radiance Fields (NeRF) for dynamic human modeling offer\nnew possibilities. In this paper, we introduce UV-Attack, a groundbreaking\napproach that achieves high success rates even with extensive and unseen human\nactions. We address the challenge above by leveraging dynamic-NeRF-based UV\nmapping. UV-Attack can generate human images across diverse actions and\nviewpoints, and even create novel actions by sampling from the SMPL parameter\nspace. While dynamic NeRF models are capable of modeling human bodies,\nmodifying clothing textures is challenging because they are embedded in neural\nnetwork parameters. To tackle this, UV-Attack generates UV maps instead of RGB\nimages and modifies the texture stacks. This approach enables real-time texture\nedits and makes the attack more practical. We also propose a novel Expectation\nover Pose Transformation loss (EoPT) to improve the evasion success rate on\nunseen poses and views. Our experiments show that UV-Attack achieves a 92.75%\nattack success rate against the FastRCNN model across varied poses in dynamic\nvideo settings, significantly outperforming the state-of-the-art AdvCamou\nattack, which only had a 28.50% ASR. Moreover, we achieve 49.5% ASR on the\nlatest YOLOv8 detector in black-box settings. This work highlights the\npotential of dynamic NeRF-based UV mapping for creating more effective\nadversarial attacks on person detectors, addressing key challenges in modeling\nhuman movement and texture modification.\n","authors":["Yanjie Li","Wenxuan Zhang","Kaisheng Liang","Bin Xiao"],"pdf_url":"https://arxiv.org/pdf/2501.05783v1.pdf","comment":"23 pages, 22 figures, submitted to ICLR2025"},{"id":"http://arxiv.org/abs/2501.05777v1","updated":"2025-01-10T08:18:37Z","published":"2025-01-10T08:18:37Z","title":"StructSR: Refuse Spurious Details in Real-World Image Super-Resolution","summary":" Diffusion-based models have shown great promise in real-world image\nsuper-resolution (Real-ISR), but often generate content with structural errors\nand spurious texture details due to the empirical priors and illusions of these\nmodels. To address this issue, we introduce StructSR, a simple, effective, and\nplug-and-play method that enhances structural fidelity and suppresses spurious\ndetails for diffusion-based Real-ISR. StructSR operates without the need for\nadditional fine-tuning, external model priors, or high-level semantic\nknowledge. At its core is the Structure-Aware Screening (SAS) mechanism, which\nidentifies the image with the highest structural similarity to the\nlow-resolution (LR) input in the early inference stage, allowing us to leverage\nit as a historical structure knowledge to suppress the generation of spurious\ndetails. By intervening in the diffusion inference process, StructSR seamlessly\nintegrates with existing diffusion-based Real-ISR models. Our experimental\nresults demonstrate that StructSR significantly improves the fidelity of\nstructure and texture, improving the PSNR and SSIM metrics by an average of\n5.27% and 9.36% on a synthetic dataset (DIV2K-Val) and 4.13% and 8.64% on two\nreal-world datasets (RealSR and DRealSR) when integrated with four\nstate-of-the-art diffusion-based Real-ISR methods.\n","authors":["Yachao Li","Dong Liang","Tianyu Ding","Sheng-Jun Huang"],"pdf_url":"https://arxiv.org/pdf/2501.05777v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05769v1","updated":"2025-01-10T07:58:38Z","published":"2025-01-10T07:58:38Z","title":"Conditional Diffusion Model for Electrical Impedance Tomography","summary":" Electrical impedance tomography (EIT) is a non-invasive imaging technique,\nwhich has been widely used in the fields of industrial inspection, medical\nmonitoring and tactile sensing. However, due to the inherent non-linearity and\nill-conditioned nature of the EIT inverse problem, the reconstructed image is\nhighly sensitive to the measured data, and random noise artifacts often appear\nin the reconstructed image, which greatly limits the application of EIT. To\naddress this issue, a conditional diffusion model with voltage consistency\n(CDMVC) is proposed in this study. The method consists of a pre-imaging module,\na conditional diffusion model for reconstruction, a forward voltage constraint\nnetwork and a scheme of voltage consistency constraint during sampling process.\nThe pre-imaging module is employed to generate the initial reconstruction. This\nserves as a condition for training the conditional diffusion model. Finally,\nbased on the forward voltage constraint network, a voltage consistency\nconstraint is implemented in the sampling phase to incorporate forward\ninformation of EIT, thereby enhancing imaging quality. A more complete dataset,\nincluding both common and complex concave shapes, is generated. The proposed\nmethod is validated using both simulation and physical experiments.\nExperimental results demonstrate that our method can significantly improves the\nquality of reconstructed images. In addition, experimental results also\ndemonstrate that our method has good robustness and generalization performance.\n","authors":["Duanpeng Shi","Wendong Zheng","Di Guo","Huaping Liu"],"pdf_url":"https://arxiv.org/pdf/2501.05769v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05767v1","updated":"2025-01-10T07:56:23Z","published":"2025-01-10T07:56:23Z","title":"Migician: Revealing the Magic of Free-Form Multi-Image Grounding in\n Multimodal Large Language Models","summary":" The recent advancement of Multimodal Large Language Models (MLLMs) has\nsignificantly improved their fine-grained perception of single images and\ngeneral comprehension across multiple images. However, existing MLLMs still\nface challenges in achieving precise grounding in complex multi-image\nscenarios. To address this, we first explore a Chain-of-Thought (CoT) framework\nthat integrates single-image grounding with multi-image comprehension. While\npartially effective, it remains unstable and struggles to capture abstract\nvisual information due to its non-end-to-end nature. Therefore, we introduce\nMigician, the first multi-image grounding model capable of performing free-form\nand accurate grounding across multiple images. To support this, we present the\nMGrounding-630k dataset, which comprises data for several multi-image grounding\ntasks derived from existing datasets, along with newly generated free-form\ngrounding instruction-following data. Furthermore, we propose MIG-Bench, a\ncomprehensive benchmark specifically designed for evaluating multi-image\ngrounding capabilities. Experimental results demonstrate that our model\nachieves significantly superior multi-image grounding capabilities,\noutperforming the best existing MLLMs by 21.61% and even surpassing much larger\n70B models. Our code, model, dataset, and benchmark are fully open-sourced.\n","authors":["You Li","Heyu Huang","Chi Chen","Kaiyu Huang","Chao Huang","Zonghao Guo","Zhiyuan Liu","Jinan Xu","Yuhua Li","Ruixuan Li","Maosong Sun"],"pdf_url":"https://arxiv.org/pdf/2501.05767v1.pdf","comment":"20 pages, 8 figures"},{"id":"http://arxiv.org/abs/2501.05763v1","updated":"2025-01-10T07:41:47Z","published":"2025-01-10T07:41:47Z","title":"StarGen: A Spatiotemporal Autoregression Framework with Video Diffusion\n Model for Scalable and Controllable Scene Generation","summary":" Recent advances in large reconstruction and generative models have\nsignificantly improved scene reconstruction and novel view generation. However,\ndue to compute limitations, each inference with these large models is confined\nto a small area, making long-range consistent scene generation challenging. To\naddress this, we propose StarGen, a novel framework that employs a pre-trained\nvideo diffusion model in an autoregressive manner for long-range scene\ngeneration. The generation of each video clip is conditioned on the 3D warping\nof spatially adjacent images and the temporally overlapping image from\npreviously generated clips, improving spatiotemporal consistency in long-range\nscene generation with precise pose control. The spatiotemporal condition is\ncompatible with various input conditions, facilitating diverse tasks, including\nsparse view interpolation, perpetual view generation, and layout-conditioned\ncity generation. Quantitative and qualitative evaluations demonstrate StarGen's\nsuperior scalability, fidelity, and pose accuracy compared to state-of-the-art\nmethods.\n","authors":["Shangjin Zhai","Zhichao Ye","Jialin Liu","Weijian Xie","Jiaqi Hu","Zhen Peng","Hua Xue","Danpeng Chen","Xiaomeng Wang","Lei Yang","Nan Wang","Haomin Liu","Guofeng Zhang"],"pdf_url":"https://arxiv.org/pdf/2501.05763v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.11210v2","updated":"2025-01-10T07:26:43Z","published":"2024-12-15T15:04:27Z","title":"ViPOcc: Leveraging Visual Priors from Vision Foundation Models for\n Single-View 3D Occupancy Prediction","summary":" Inferring the 3D structure of a scene from a single image is an ill-posed and\nchallenging problem in the field of vision-centric autonomous driving. Existing\nmethods usually employ neural radiance fields to produce voxelized 3D\noccupancy, lacking instance-level semantic reasoning and temporal photometric\nconsistency. In this paper, we propose ViPOcc, which leverages the visual\npriors from vision foundation models (VFMs) for fine-grained 3D occupancy\nprediction. Unlike previous works that solely employ volume rendering for RGB\nand depth image reconstruction, we introduce a metric depth estimation branch,\nin which an inverse depth alignment module is proposed to bridge the domain gap\nin depth distribution between VFM predictions and the ground truth. The\nrecovered metric depth is then utilized in temporal photometric alignment and\nspatial geometric alignment to ensure accurate and consistent 3D occupancy\nprediction. Additionally, we also propose a semantic-guided non-overlapping\nGaussian mixture sampler for efficient, instance-aware ray sampling, which\naddresses the redundant and imbalanced sampling issue that still exists in\nprevious state-of-the-art methods. Extensive experiments demonstrate the\nsuperior performance of ViPOcc in both 3D occupancy prediction and depth\nestimation tasks on the KITTI-360 and KITTI Raw datasets. Our code is available\nat: \\url{https://mias.group/ViPOcc}.\n","authors":["Yi Feng","Yu Han","Xijing Zhang","Tanghui Li","Yanting Zhang","Rui Fan"],"pdf_url":"https://arxiv.org/pdf/2412.11210v2.pdf","comment":"accepted to AAAI25"},{"id":"http://arxiv.org/abs/2412.10718v3","updated":"2025-01-10T07:20:26Z","published":"2024-12-14T07:22:03Z","title":"GridShow: Omni Visual Generation","summary":" In this paper, we introduce GRID, a novel paradigm that reframes a broad\nrange of visual generation tasks as the problem of arranging grids, akin to\nfilm strips. At its core, GRID transforms temporal sequences into grid layouts,\nenabling image generation models to process visual sequences holistically. To\nachieve both layout consistency and motion coherence, we develop a parallel\nflow-matching training strategy that combines layout matching and temporal\nlosses, guided by a coarse-to-fine schedule that evolves from basic layouts to\nprecise motion control. Our approach demonstrates remarkable efficiency,\nachieving up to 35 faster inference speeds while using 1/1000 of the\ncomputational resources compared to specialized models. Extensive experiments\nshow that GRID exhibits exceptional versatility across diverse visual\ngeneration tasks, from Text-to-Video to 3D Editing, while maintaining its\nfoundational image generation capabilities. This dual strength in both expanded\napplications and preserved core competencies establishes GRID as an efficient\nand versatile omni-solution for visual generation.\n","authors":["Cong Wan","Xiangyang Luo","Zijian Cai","Yiren Song","Yunlong Zhao","Yifan Bai","Yuhang He","Yihong Gong"],"pdf_url":"https://arxiv.org/pdf/2412.10718v3.pdf","comment":"Codes: https://github.com/Should-AI-Lab/GRID"},{"id":"http://arxiv.org/abs/2501.05757v1","updated":"2025-01-10T07:19:41Z","published":"2025-01-10T07:19:41Z","title":"Locality-aware Gaussian Compression for Fast and High-quality Rendering","summary":" We present LocoGS, a locality-aware 3D Gaussian Splatting (3DGS) framework\nthat exploits the spatial coherence of 3D Gaussians for compact modeling of\nvolumetric scenes. To this end, we first analyze the local coherence of 3D\nGaussian attributes, and propose a novel locality-aware 3D Gaussian\nrepresentation that effectively encodes locally-coherent Gaussian attributes\nusing a neural field representation with a minimal storage requirement. On top\nof the novel representation, LocoGS is carefully designed with additional\ncomponents such as dense initialization, an adaptive spherical harmonics\nbandwidth scheme and different encoding schemes for different Gaussian\nattributes to maximize compression performance. Experimental results\ndemonstrate that our approach outperforms the rendering quality of existing\ncompact Gaussian representations for representative real-world 3D datasets\nwhile achieving from 54.6$\\times$ to 96.6$\\times$ compressed storage size and\nfrom 2.1$\\times$ to 2.4$\\times$ rendering speed than 3DGS. Even our approach\nalso demonstrates an averaged 2.4$\\times$ higher rendering speed than the\nstate-of-the-art compression method with comparable compression performance.\n","authors":["Seungjoo Shin","Jaesik Park","Sunghyun Cho"],"pdf_url":"https://arxiv.org/pdf/2501.05757v1.pdf","comment":"28 pages, 15 figures, and 14 tables"},{"id":"http://arxiv.org/abs/2501.05750v1","updated":"2025-01-10T06:58:14Z","published":"2025-01-10T06:58:14Z","title":"Semantic Mapping in Indoor Embodied AI -- A Comprehensive Survey and\n Future Directions","summary":" Intelligent embodied agents (e.g. robots) need to perform complex semantic\ntasks in unfamiliar environments. Among many skills that the agents need to\npossess, building and maintaining a semantic map of the environment is most\ncrucial in long-horizon tasks. A semantic map captures information about the\nenvironment in a structured way, allowing the agent to reference it for\nadvanced reasoning throughout the task. While existing surveys in embodied AI\nfocus on general advancements or specific tasks like navigation and\nmanipulation, this paper provides a comprehensive review of semantic\nmap-building approaches in embodied AI, specifically for indoor navigation. We\ncategorize these approaches based on their structural representation (spatial\ngrids, topological graphs, dense point-clouds or hybrid maps) and the type of\ninformation they encode (implicit features or explicit environmental data). We\nalso explore the strengths and limitations of the map building techniques,\nhighlight current challenges, and propose future research directions. We\nidentify that the field is moving towards developing open-vocabulary,\nqueryable, task-agnostic map representations, while high memory demands and\ncomputational inefficiency still remaining to be open challenges. This survey\naims to guide current and future researchers in advancing semantic mapping\ntechniques for embodied AI systems.\n","authors":["Sonia Raychaudhuri","Angel X. Chang"],"pdf_url":"https://arxiv.org/pdf/2501.05750v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05744v1","updated":"2025-01-10T06:20:27Z","published":"2025-01-10T06:20:27Z","title":"LLVD: LSTM-based Explicit Motion Modeling in Latent Space for Blind\n Video Denoising","summary":" Video restoration plays a pivotal role in revitalizing degraded video content\nby rectifying imperfections caused by various degradations introduced during\ncapturing (sensor noise, motion blur, etc.), saving/sharing (compression,\nresizing, etc.) and editing. This paper introduces a novel algorithm designed\nfor scenarios where noise is introduced during video capture, aiming to enhance\nthe visual quality of videos by reducing unwanted noise artifacts. We propose\nthe Latent space LSTM Video Denoiser (LLVD), an end-to-end blind denoising\nmodel. LLVD uniquely combines spatial and temporal feature extraction,\nemploying Long Short Term Memory (LSTM) within the encoded feature domain. This\nintegration of LSTM layers is crucial for maintaining continuity and minimizing\nflicker in the restored video. Moreover, processing frames in the encoded\nfeature domain significantly reduces computations, resulting in a very\nlightweight architecture. LLVD's blind nature makes it versatile for real,\nin-the-wild denoising scenarios where prior information about noise\ncharacteristics is not available. Experiments reveal that LLVD demonstrates\nexcellent performance for both synthetic and captured noise. Specifically, LLVD\nsurpasses the current State-Of-The-Art (SOTA) in RAW denoising by 0.3dB, while\nalso achieving a 59\\% reduction in computational complexity.\n","authors":["Loay Rashid","Siddharth Roheda","Amit Unde"],"pdf_url":"https://arxiv.org/pdf/2501.05744v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05733v1","updated":"2025-01-10T06:02:06Z","published":"2025-01-10T06:02:06Z","title":"TB-Bench: Training and Testing Multi-Modal AI for Understanding\n Spatio-Temporal Traffic Behaviors from Dashcam Images/Videos","summary":" The application of Multi-modal Large Language Models (MLLMs) in Autonomous\nDriving (AD) faces significant challenges due to their limited training on\ntraffic-specific data and the absence of dedicated benchmarks for\nspatiotemporal understanding. This study addresses these issues by proposing\nTB-Bench, a comprehensive benchmark designed to evaluate MLLMs on understanding\ntraffic behaviors across eight perception tasks from ego-centric views. We also\nintroduce vision-language instruction tuning datasets, TB-100k and TB-250k,\nalong with simple yet effective baselines for the tasks. Through extensive\nexperiments, we show that existing MLLMs underperform in these tasks, with even\na powerful model like GPT-4o achieving less than 35% accuracy on average. In\ncontrast, when fine-tuned with TB-100k or TB-250k, our baseline models achieve\naverage accuracy up to 85%, significantly enhancing performance on the tasks.\nAdditionally, we demonstrate performance transfer by co-training TB-100k with\nanother traffic dataset, leading to improved performance on the latter.\nOverall, this study represents a step forward by introducing a comprehensive\nbenchmark, high-quality datasets, and baselines, thus supporting the gradual\nintegration of MLLMs into the perception, prediction, and planning stages of\nAD.\n","authors":["Korawat Charoenpitaks","Van-Quang Nguyen","Masanori Suganuma","Kentaro Arai","Seiji Totsuka","Hiroshi Ino","Takayuki Okatani"],"pdf_url":"https://arxiv.org/pdf/2501.05733v1.pdf","comment":"Main Paper: 8 pages, Supplementary Materials: 15 pages"},{"id":"http://arxiv.org/abs/2501.05728v1","updated":"2025-01-10T05:53:32Z","published":"2025-01-10T05:53:32Z","title":"Super-class guided Transformer for Zero-Shot Attribute Classification","summary":" Attribute classification is crucial for identifying specific characteristics\nwithin image regions. Vision-Language Models (VLMs) have been effective in\nzero-shot tasks by leveraging their general knowledge from large-scale\ndatasets. Recent studies demonstrate that transformer-based models with\nclass-wise queries can effectively address zero-shot multi-label\nclassification. However, poor utilization of the relationship between seen and\nunseen attributes makes the model lack generalizability. Additionally,\nattribute classification generally involves many attributes, making maintaining\nthe model's scalability difficult. To address these issues, we propose\nSuper-class guided transFormer (SugaFormer), a novel framework that leverages\nsuper-classes to enhance scalability and generalizability for zero-shot\nattribute classification. SugaFormer employs Super-class Query Initialization\n(SQI) to reduce the number of queries, utilizing common semantic information\nfrom super-classes, and incorporates Multi-context Decoding (MD) to handle\ndiverse visual cues. To strengthen generalizability, we introduce two knowledge\ntransfer strategies that utilize VLMs. During training, Super-class guided\nConsistency Regularization (SCR) aligns SugaFormer's features with VLMs using\nregion-specific prompts, and during inference, Zero-shot Retrieval-based Score\nEnhancement (ZRSE) refines predictions for unseen attributes. Extensive\nexperiments demonstrate that SugaFormer achieves state-of-the-art performance\nacross three widely-used attribute classification benchmarks under zero-shot,\nand cross-dataset transfer settings. Our code is available at\nhttps://github.com/mlvlab/SugaFormer.\n","authors":["Sehyung Kim","Chanhyeong Yang","Jihwan Park","Taehoon Song","Hyunwoo J. Kim"],"pdf_url":"https://arxiv.org/pdf/2501.05728v1.pdf","comment":"AAAI25"},{"id":"http://arxiv.org/abs/2501.05717v1","updated":"2025-01-10T05:29:09Z","published":"2025-01-10T05:29:09Z","title":"Zero-shot Shark Tracking and Biometrics from Aerial Imagery","summary":" The recent widespread adoption of drones for studying marine animals provides\nopportunities for deriving biological information from aerial imagery. The\nlarge scale of imagery data acquired from drones is well suited for machine\nlearning (ML) analysis. Development of ML models for analyzing marine animal\naerial imagery has followed the classical paradigm of training, testing, and\ndeploying a new model for each dataset, requiring significant time, human\neffort, and ML expertise. We introduce Frame Level ALIgment and tRacking\n(FLAIR), which leverages the video understanding of Segment Anything Model 2\n(SAM2) and the vision-language capabilities of Contrastive Language-Image\nPre-training (CLIP). FLAIR takes a drone video as input and outputs\nsegmentation masks of the species of interest across the video. Notably, FLAIR\nleverages a zero-shot approach, eliminating the need for labeled data, training\na new model, or fine-tuning an existing model to generalize to other species.\nWith a dataset of 18,000 drone images of Pacific nurse sharks, we trained\nstate-of-the-art object detection models to compare against FLAIR. We show that\nFLAIR massively outperforms these object detectors and performs competitively\nagainst two human-in-the-loop methods for prompting SAM2, achieving a Dice\nscore of 0.81. FLAIR readily generalizes to other shark species without\nadditional human effort and can be combined with novel heuristics to\nautomatically extract relevant information including length and tailbeat\nfrequency. FLAIR has significant potential to accelerate aerial imagery\nanalysis workflows, requiring markedly less human effort and expertise than\ntraditional machine learning workflows, while achieving superior accuracy. By\nreducing the effort required for aerial imagery analysis, FLAIR allows\nscientists to spend more time interpreting results and deriving insights about\nmarine ecosystems.\n","authors":["Chinmay K Lalgudi","Mark E Leone","Jaden V Clark","Sergio Madrigal-Mora","Mario Espinoza"],"pdf_url":"https://arxiv.org/pdf/2501.05717v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2212.12322v3","updated":"2025-01-10T05:28:08Z","published":"2022-12-22T08:33:32Z","title":"Infrared Image Super-Resolution: Systematic Review, and Future Trends","summary":" Image Super-Resolution (SR) is essential for a wide range of computer vision\nand image processing tasks. Investigating infrared (IR) image (or thermal\nimages) super-resolution is a continuing concern within the development of deep\nlearning. This survey aims to provide a comprehensive perspective of IR image\nsuper-resolution, including its applications, hardware imaging system dilemmas,\nand taxonomy of image processing methodologies. In addition, the datasets and\nevaluation metrics in IR image super-resolution tasks are also discussed.\nFurthermore, the deficiencies in current technologies and possible promising\ndirections for the community to explore are highlighted. To cope with the rapid\ndevelopment in this field, we intend to regularly update the relevant excellent\nwork at \\url{https://github.com/yongsongH/Infrared_Image_SR_Survey\n","authors":["Yongsong Huang","Tomo Miyazaki","Xiaofeng Liu","Shinichiro Omachi"],"pdf_url":"https://arxiv.org/pdf/2212.12322v3.pdf","comment":"This work has been submitted to the IEEE for possible publication"},{"id":"http://arxiv.org/abs/2404.11615v2","updated":"2025-01-10T05:22:40Z","published":"2024-04-17T17:59:59Z","title":"Factorized Diffusion: Perceptual Illusions by Noise Decomposition","summary":" Given a factorization of an image into a sum of linear components, we present\na zero-shot method to control each individual component through diffusion model\nsampling. For example, we can decompose an image into low and high spatial\nfrequencies and condition these components on different text prompts. This\nproduces hybrid images, which change appearance depending on viewing distance.\nBy decomposing an image into three frequency subbands, we can generate hybrid\nimages with three prompts. We also use a decomposition into grayscale and color\ncomponents to produce images whose appearance changes when they are viewed in\ngrayscale, a phenomena that naturally occurs under dim lighting. And we explore\na decomposition by a motion blur kernel, which produces images that change\nappearance under motion blurring. Our method works by denoising with a\ncomposite noise estimate, built from the components of noise estimates\nconditioned on different prompts. We also show that for certain decompositions,\nour method recovers prior approaches to compositional generation and spatial\ncontrol. Finally, we show that we can extend our approach to generate hybrid\nimages from real images. We do this by holding one component fixed and\ngenerating the remaining components, effectively solving an inverse problem.\n","authors":["Daniel Geng","Inbum Park","Andrew Owens"],"pdf_url":"https://arxiv.org/pdf/2404.11615v2.pdf","comment":"ECCV 2024 camera ready version + more readable size"},{"id":"http://arxiv.org/abs/2501.05711v1","updated":"2025-01-10T05:01:58Z","published":"2025-01-10T05:01:58Z","title":"From My View to Yours: Ego-Augmented Learning in Large Vision Language\n Models for Understanding Exocentric Daily Living Activities","summary":" Large Vision Language Models (LVLMs) have demonstrated impressive\ncapabilities in video understanding, yet their adoption for Activities of Daily\nLiving (ADL) remains limited by their inability to capture fine-grained\ninteractions and spatial relationships. This limitation is particularly evident\nin ADL tasks, where understanding detailed human-object interaction and\nhuman-centric motion is crucial for applications such as elderly monitoring and\ncognitive assessment. To address this, we aim to leverage the complementary\nnature of egocentric views to enhance LVLM's understanding of exocentric ADL\nvideos. Consequently, we propose an online ego2exo distillation approach to\nlearn ego-augmented exo representations in LVLMs. While effective, this\napproach requires paired ego-exo training data, which is impractical to collect\nfor real-world ADL scenarios. Consequently, we develop EgoMimic, a\nskeleton-guided method that can generate mimicked ego views from exocentric\nvideos. We find that the exo representations of our ego-augmented LVLMs\nsuccessfully learn to extract ego-perspective cues, demonstrated through\ncomprehensive evaluation on six ADL benchmarks and our proposed\nEgoPerceptionMCQ benchmark designed specifically to assess egocentric\nunderstanding from exocentric videos. Code, models, and data will be\nopen-sourced at https://github.com/dominickrei/EgoExo4ADL.\n","authors":["Dominick Reilly","Manish Kumar Govind","Srijan Das"],"pdf_url":"https://arxiv.org/pdf/2501.05711v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04934v2","updated":"2025-01-10T04:43:24Z","published":"2025-01-09T02:52:30Z","title":"Plug-and-Play DISep: Separating Dense Instances for Scene-to-Pixel\n Weakly-Supervised Change Detection in High-Resolution Remote Sensing Images","summary":" Existing Weakly-Supervised Change Detection (WSCD) methods often encounter\nthe problem of \"instance lumping\" under scene-level supervision, particularly\nin scenarios with a dense distribution of changed instances (i.e., changed\nobjects). In these scenarios, unchanged pixels between changed instances are\nalso mistakenly identified as changed, causing multiple changes to be\nmistakenly viewed as one. In practical applications, this issue prevents the\naccurate quantification of the number of changes. To address this issue, we\npropose a Dense Instance Separation (DISep) method as a plug-and-play solution,\nrefining pixel features from a unified instance perspective under scene-level\nsupervision. Specifically, our DISep comprises a three-step iterative training\nprocess: 1) Instance Localization: We locate instance candidate regions for\nchanged pixels using high-pass class activation maps. 2) Instance Retrieval: We\nidentify and group these changed pixels into different instance IDs through\nconnectivity searching. Then, based on the assigned instance IDs, we extract\ncorresponding pixel-level features on a per-instance basis. 3) Instance\nSeparation: We introduce a separation loss to enforce intra-instance pixel\nconsistency in the embedding space, thereby ensuring separable instance feature\nrepresentations. The proposed DISep adds only minimal training cost and no\ninference cost. It can be seamlessly integrated to enhance existing WSCD\nmethods. We achieve state-of-the-art performance by enhancing {three\nTransformer-based and four ConvNet-based methods} on the LEVIR-CD, WHU-CD,\nDSIFN-CD, SYSU-CD, and CDD datasets. Additionally, our DISep can be used to\nimprove fully-supervised change detection methods. Code is available at\nhttps://github.com/zhenghuizhao/Plug-and-Play-DISep-for-Change-Detection.\n","authors":["Zhenghui Zhao","Chen Wu","Lixiang Ru","Di Wang","Hongruixuan Chen","Cuiqun Chen"],"pdf_url":"https://arxiv.org/pdf/2501.04934v2.pdf","comment":"Accepted by ISPRS Journal of Photogrammetry and Remote Sensing"},{"id":"http://arxiv.org/abs/2501.05710v1","updated":"2025-01-10T04:41:37Z","published":"2025-01-10T04:41:37Z","title":"EmotiCrafter: Text-to-Emotional-Image Generation based on\n Valence-Arousal Model","summary":" Recent research shows that emotions can enhance users' cognition and\ninfluence information communication. While research on visual emotion analysis\nis extensive, limited work has been done on helping users generate emotionally\nrich image content. Existing work on emotional image generation relies on\ndiscrete emotion categories, making it challenging to capture complex and\nsubtle emotional nuances accurately. Additionally, these methods struggle to\ncontrol the specific content of generated images based on text prompts. In this\nwork, we introduce the new task of continuous emotional image content\ngeneration (C-EICG) and present EmotiCrafter, an emotional image generation\nmodel that generates images based on text prompts and Valence-Arousal values.\nSpecifically, we propose a novel emotion-embedding mapping network that embeds\nValence-Arousal values into textual features, enabling the capture of specific\nemotions in alignment with intended input prompts. Additionally, we introduce a\nloss function to enhance emotion expression. The experimental results show that\nour method effectively generates images representing specific emotions with the\ndesired content and outperforms existing techniques.\n","authors":["Yi He","Shengqi Dang","Long Ling","Ziqing Qian","Nanxuan Zhao","Nan Cao"],"pdf_url":"https://arxiv.org/pdf/2501.05710v1.pdf","comment":"11 pages, 8 figures"},{"id":"http://arxiv.org/abs/2404.15580v2","updated":"2025-01-10T04:23:51Z","published":"2024-04-24T01:14:33Z","title":"MiM: Mask in Mask Self-Supervised Pre-Training for 3D Medical Image\n Analysis","summary":" The Vision Transformer (ViT) has demonstrated remarkable performance in\nSelf-Supervised Learning (SSL) for 3D medical image analysis. Masked\nAutoEncoder (MAE) for feature pre-training can further unleash the potential of\nViT on various medical vision tasks. However, due to large spatial sizes with\nmuch higher dimensions of 3D medical images, the lack of hierarchical design\nfor MAE may hinder the performance of downstream tasks. In this paper, we\npropose a novel \\textit{Mask in Mask (MiM)} pre-training framework for 3D\nmedical images, which aims to advance MAE by learning discriminative\nrepresentation from hierarchical visual tokens across varying scales. We\nintroduce multiple levels of granularity for masked inputs from the volume,\nwhich are then reconstructed simultaneously ranging at both fine and coarse\nlevels. Additionally, a cross-level alignment mechanism is applied to adjacent\nlevel volumes to enforce anatomical similarity hierarchically. Furthermore, we\nadopt a hybrid backbone to enhance the hierarchical representation learning\nefficiently during the pre-training. MiM was pre-trained on a large scale of\navailable 3D volumetric images, \\textit{i.e.,} Computed Tomography (CT) images\ncontaining various body parts. Extensive experiments on thirteen public\ndatasets demonstrate the superiority of MiM over other SSL methods in\norgan/lesion/tumor segmentation and disease classification. We further scale up\nthe MiM to large pre-training datasets with more than 10k volumes, showing that\nlarge-scale pre-training can further enhance the performance of downstream\ntasks. The improvement also concluded that the research community should pay\nmore attention to the scale of the pre-training dataset towards the healthcare\nfoundation model for 3D medical images.\n","authors":["Jiaxin Zhuang","Linshan Wu","Qiong Wang","Peng Fei","Varut Vardhanabhuti","Lin Luo","Hao Chen"],"pdf_url":"https://arxiv.org/pdf/2404.15580v2.pdf","comment":"submitted to a journal, updated v2"},{"id":"http://arxiv.org/abs/2412.13717v2","updated":"2025-01-10T04:09:17Z","published":"2024-12-18T10:55:58Z","title":"Towards Automatic Evaluation for Image Transcreation","summary":" Beyond conventional paradigms of translating speech and text, recently, there\nhas been interest in automated transcreation of images to facilitate\nlocalization of visual content across different cultures. Attempts to define\nthis as a formal Machine Learning (ML) problem have been impeded by the lack of\nautomatic evaluation mechanisms, with previous work relying solely on human\nevaluation. In this paper, we seek to close this gap by proposing a suite of\nautomatic evaluation metrics inspired by machine translation (MT) metrics,\ncategorized into: a) Object-based, b) Embedding-based, and c) VLM-based.\nDrawing on theories from translation studies and real-world transcreation\npractices, we identify three critical dimensions of image transcreation:\ncultural relevance, semantic equivalence and visual similarity, and design our\nmetrics to evaluate systems along these axes. Our results show that proprietary\nVLMs best identify cultural relevance and semantic equivalence, while\nvision-encoder representations are adept at measuring visual similarity.\nMeta-evaluation across 7 countries shows our metrics agree strongly with human\nratings, with average segment-level correlations ranging from 0.55-0.87.\nFinally, through a discussion of the merits and demerits of each metric, we\noffer a robust framework for automated image transcreation evaluation, grounded\nin both theoretical foundations and practical application. Our code can be\nfound here: https://github.com/simran-khanuja/automatic-eval-transcreation\n","authors":["Simran Khanuja","Vivek Iyer","Claire He","Graham Neubig"],"pdf_url":"https://arxiv.org/pdf/2412.13717v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05690v1","updated":"2025-01-10T03:42:37Z","published":"2025-01-10T03:42:37Z","title":"Overcoming Language Priors for Visual Question Answering Based on\n Knowledge Distillation","summary":" Previous studies have pointed out that visual question answering (VQA) models\nare prone to relying on language priors for answer predictions. In this\ncontext, predictions often depend on linguistic shortcuts rather than a\ncomprehensive grasp of multimodal knowledge, which diminishes their\ngeneralization ability. In this paper, we propose a novel method, namely, KDAR,\nleveraging knowledge distillation to address the prior-dependency dilemmas\nwithin the VQA task. Specifically, the regularization effect facilitated by\nsoft labels from a well-trained teacher is employed to penalize overfitting to\nthe most common answers. The soft labels, which serve a regularization role,\nalso provide semantic guidance that narrows the range of candidate answers.\nAdditionally, we design an adaptive sample-wise reweighting learning strategy\nto further mitigate bias by dynamically adjusting the importance of each\nsample. Experimental results demonstrate that our method enhances performance\nin both OOD and IID settings. Our method achieves state-of-the-art performance\non the VQA-CPv2 out-of-distribution (OOD) benchmark, significantly\noutperforming previous state-of-the-art approaches.\n","authors":["Daowan Peng","Wei Wei"],"pdf_url":"https://arxiv.org/pdf/2501.05690v1.pdf","comment":"Accepted to ICME2024"},{"id":"http://arxiv.org/abs/2501.05688v1","updated":"2025-01-10T03:41:03Z","published":"2025-01-10T03:41:03Z","title":"eKalibr: Dynamic Intrinsic Calibration for Event Cameras From First\n Principles of Events","summary":" The bio-inspired event camera has garnered extensive research attention in\nrecent years, owing to its significant potential derived from its high dynamic\nrange and low latency characteristics. Similar to the standard camera, the\nevent camera requires precise intrinsic calibration to facilitate further\nhigh-level visual applications, such as pose estimation and mapping. While\nseveral calibration methods for event cameras have been proposed, most of them\nare either (i) engineering-driven, heavily relying on conventional image-based\ncalibration pipelines, or (ii) inconvenient, requiring complex instrumentation.\nTo this end, we propose an accurate and convenient intrinsic calibration method\nfor event cameras, named eKalibr, which builds upon a carefully designed\nevent-based circle grid pattern recognition algorithm. To extract target\npatterns from events, we perform event-based normal flow estimation to identify\npotential events generated by circle edges, and cluster them spatially.\nSubsequently, event clusters associated with the same grid circles are matched\nand grouped using normal flows, for subsequent time-varying ellipse estimation.\nFitted ellipse centers are time-synchronized, for final grid pattern\nrecognition. We conducted extensive experiments to evaluate the performance of\neKalibr in terms of pattern extraction and intrinsic calibration. The\nimplementation of eKalibr is open-sourced at\n(https://github.com/Unsigned-Long/eKalibr) to benefit the research community.\n","authors":["Shuolong Chen","Xingxing Li","Liu Yuan","Ziao Liu"],"pdf_url":"https://arxiv.org/pdf/2501.05688v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05687v1","updated":"2025-01-10T03:38:16Z","published":"2025-01-10T03:38:16Z","title":"UniQ: Unified Decoder with Task-specific Queries for Efficient Scene\n Graph Generation","summary":" Scene Graph Generation(SGG) is a scene understanding task that aims at\nidentifying object entities and reasoning their relationships within a given\nimage. In contrast to prevailing two-stage methods based on a large object\ndetector (e.g., Faster R-CNN), one-stage methods integrate a fixed-size set of\nlearnable queries to jointly reason relational triplets . This paradigm demonstrates robust performance with significantly\nreduced parameters and computational overhead. However, the challenge in\none-stage methods stems from the issue of weak entanglement, wherein entities\ninvolved in relationships require both coupled features shared within triplets\nand decoupled visual features. Previous methods either adopt a single decoder\nfor coupled triplet feature modeling or multiple decoders for separate visual\nfeature extraction but fail to consider both. In this paper, we introduce UniQ,\na Unified decoder with task-specific Queries architecture, where task-specific\nqueries generate decoupled visual features for subjects, objects, and\npredicates respectively, and unified decoder enables coupled feature modeling\nwithin relational triplets. Experimental results on the Visual Genome dataset\ndemonstrate that UniQ has superior performance to both one-stage and two-stage\nmethods.\n","authors":["Xinyao Liao","Wei Wei","Dangyang Chen","Yuanyuan Fu"],"pdf_url":"https://arxiv.org/pdf/2501.05687v1.pdf","comment":"10 pages, 5 figures"},{"id":"http://arxiv.org/abs/2501.05686v1","updated":"2025-01-10T03:35:22Z","published":"2025-01-10T03:35:22Z","title":"Deep Reversible Consistency Learning for Cross-modal Retrieval","summary":" Cross-modal retrieval (CMR) typically involves learning common\nrepresentations to directly measure similarities between multimodal samples.\nMost existing CMR methods commonly assume multimodal samples in pairs and\nemploy joint training to learn common representations, limiting the flexibility\nof CMR. Although some methods adopt independent training strategies for each\nmodality to improve flexibility in CMR, they utilize the randomly initialized\northogonal matrices to guide representation learning, which is suboptimal since\nthey assume inter-class samples are independent of each other, limiting the\npotential of semantic alignments between sample representations and\nground-truth labels. To address these issues, we propose a novel method termed\nDeep Reversible Consistency Learning (DRCL) for cross-modal retrieval. DRCL\nincludes two core modules, \\ie Selective Prior Learning (SPL) and Reversible\nSemantic Consistency learning (RSC). More specifically, SPL first learns a\ntransformation weight matrix on each modality and selects the best one based on\nthe quality score as the Prior, which greatly avoids blind selection of priors\nlearned from low-quality modalities. Then, RSC employs a Modality-invariant\nRepresentation Recasting mechanism (MRR) to recast the potential\nmodality-invariant representations from sample semantic labels by the\ngeneralized inverse matrix of the prior. Since labels are devoid of\nmodal-specific information, we utilize the recast features to guide the\nrepresentation learning, thus maintaining semantic consistency to the fullest\nextent possible. In addition, a feature augmentation mechanism (FA) is\nintroduced in RSC to encourage the model to learn over a wider data\ndistribution for diversity. Finally, extensive experiments conducted on five\nwidely used datasets and comparisons with 15 state-of-the-art baselines\ndemonstrate the effectiveness and superiority of our DRCL.\n","authors":["Ruitao Pu","Yang Qin","Dezhong Peng","Xiaomin Song","Huiming Zheng"],"pdf_url":"https://arxiv.org/pdf/2501.05686v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.01148v3","updated":"2025-01-10T03:28:00Z","published":"2024-09-02T10:33:45Z","title":"FMRFT: Fusion Mamba and DETR for Query Time Sequence Intersection Fish\n Tracking","summary":" Early detection of abnormal fish behavior caused by disease or hunger can be\nachieved through fish tracking using deep learning techniques, which holds\nsignificant value for industrial aquaculture. However, underwater reflections\nand some reasons with fish, such as the high similarity, rapid swimming caused\nby stimuli and mutual occlusion bring challenges to multi-target tracking of\nfish. To address these challenges, this paper establishes a complex\nmulti-scenario sturgeon tracking dataset and introduces the FMRFT model, a\nreal-time end-to-end fish tracking solution. The model incorporates the low\nvideo memory consumption Mamba In Mamba (MIM) architecture, which facilitates\nmulti-frame temporal memory and feature extraction, thereby addressing the\nchallenges to track multiple fish across frames. Additionally, the FMRFT model\nwith the Query Time Sequence Intersection (QTSI) module effectively manages\noccluded objects and reduces redundant tracking frames using the superior\nfeature interaction and prior frame processing capabilities of RT-DETR. This\ncombination significantly enhances the accuracy and stability of fish tracking.\nTrained and tested on the dataset, the model achieves an IDF1 score of 90.3%\nand a MOTA accuracy of 94.3%. Experimental results show that the proposed FMRFT\nmodel effectively addresses the challenges of high similarity and mutual\nocclusion in fish populations, enabling accurate tracking in factory farming\nenvironments.\n","authors":["Mingyuan Yao","Yukang Huo","Qingbin Tian","Jiayin Zhao","Xiao Liu","Ruifeng Wang","Lin Xue","Haihua Wang"],"pdf_url":"https://arxiv.org/pdf/2409.01148v3.pdf","comment":"14 pages,14 figures"},{"id":"http://arxiv.org/abs/2501.04608v2","updated":"2025-01-10T03:08:11Z","published":"2025-01-08T16:44:06Z","title":"Comprehensive Examination of Unrolled Networks for Solving Linear\n Inverse Problems","summary":" Unrolled networks have become prevalent in various computer vision and\nimaging tasks. Although they have demonstrated remarkable efficacy in solving\nspecific computer vision and computational imaging tasks, their adaptation to\nother applications presents considerable challenges. This is primarily due to\nthe multitude of design decisions that practitioners working on new\napplications must navigate, each potentially affecting the network's overall\nperformance. These decisions include selecting the optimization algorithm,\ndefining the loss function, and determining the number of convolutional layers,\namong others. Compounding the issue, evaluating each design choice requires\ntime-consuming simulations to train, fine-tune the neural network, and optimize\nfor its performance. As a result, the process of exploring multiple options and\nidentifying the optimal configuration becomes time-consuming and\ncomputationally demanding. The main objectives of this paper are (1) to unify\nsome ideas and methodologies used in unrolled networks to reduce the number of\ndesign choices a user has to make, and (2) to report a comprehensive ablation\nstudy to discuss the impact of each of the choices involved in designing\nunrolled networks and present practical recommendations based on our findings.\nWe anticipate that this study will help scientists and engineers design\nunrolled networks for their applications and diagnose problems within their\nnetworks efficiently.\n","authors":["Eric Chen","Xi Chen","Arian Maleki","Shirin Jalali"],"pdf_url":"https://arxiv.org/pdf/2501.04608v2.pdf","comment":"27 pages, 10 figures. Project Page:\n https://github.com/YuxiChen25/Memory-Net-Inverse"},{"id":"http://arxiv.org/abs/2501.05669v1","updated":"2025-01-10T02:36:37Z","published":"2025-01-10T02:36:37Z","title":"LPRnet: A self-supervised registration network for LiDAR and\n photogrammetric point clouds","summary":" LiDAR and photogrammetry are active and passive remote sensing techniques for\npoint cloud acquisition, respectively, offering complementary advantages and\nheterogeneous. Due to the fundamental differences in sensing mechanisms,\nspatial distributions and coordinate systems, their point clouds exhibit\nsignificant discrepancies in density, precision, noise, and overlap. Coupled\nwith the lack of ground truth for large-scale scenes, integrating the\nheterogeneous point clouds is a highly challenging task. This paper proposes a\nself-supervised registration network based on a masked autoencoder, focusing on\nheterogeneous LiDAR and photogrammetric point clouds. At its core, the method\nintroduces a multi-scale masked training strategy to extract robust features\nfrom heterogeneous point clouds under self-supervision. To further enhance\nregistration performance, a rotation-translation embedding module is designed\nto effectively capture the key features essential for accurate rigid\ntransformations. Building upon the robust representations, a transformer-based\narchitecture seamlessly integrates local and global features, fostering precise\nalignment across diverse point cloud datasets. The proposed method demonstrates\nstrong feature extraction capabilities for both LiDAR and photogrammetric point\nclouds, addressing the challenges of acquiring ground truth at the scene level.\nExperiments conducted on two real-world datasets validate the effectiveness of\nthe proposed method in solving heterogeneous point cloud registration problems.\n","authors":["Chen Wang","Yanfeng Gu","Xian Li"],"pdf_url":"https://arxiv.org/pdf/2501.05669v1.pdf","comment":"12 pages, 9 figures, 5 tables"},{"id":"http://arxiv.org/abs/2409.12953v4","updated":"2025-01-10T02:31:03Z","published":"2024-09-19T17:58:16Z","title":"JourneyBench: A Challenging One-Stop Vision-Language Understanding\n Benchmark of Generated Images","summary":" Existing vision-language understanding benchmarks largely consist of images\nof objects in their usual contexts. As a consequence, recent multimodal large\nlanguage models can perform well with only a shallow visual understanding by\nrelying on background language biases. Thus, strong performance on these\nbenchmarks does not necessarily correlate with strong visual understanding. In\nthis paper, we release JourneyBench, a comprehensive human-annotated benchmark\nof generated images designed to assess the model's fine-grained multimodal\nreasoning abilities across five tasks: complementary multimodal chain of\nthought, multi-image VQA, imaginary image captioning, VQA with hallucination\ntriggers, and fine-grained retrieval with sample-specific distractors. Unlike\nexisting benchmarks, JourneyBench explicitly requires fine-grained multimodal\nreasoning in unusual imaginary scenarios where language bias and holistic image\ngist are insufficient. We benchmark state-of-the-art models on JourneyBench and\nanalyze performance along a number of fine-grained dimensions. Results across\nall five tasks show that JourneyBench is exceptionally challenging for even the\nbest models, indicating that models' visual reasoning abilities are not as\nstrong as they first appear. We discuss the implications of our findings and\npropose avenues for further research.\n","authors":["Zhecan Wang","Junzhang Liu","Chia-Wei Tang","Hani Alomari","Anushka Sivakumar","Rui Sun","Wenhao Li","Md. Atabuzzaman","Hammad Ayyubi","Haoxuan You","Alvi Ishmam","Kai-Wei Chang","Shih-Fu Chang","Chris Thomas"],"pdf_url":"https://arxiv.org/pdf/2409.12953v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.14080v4","updated":"2025-01-10T02:12:05Z","published":"2024-06-20T07:56:51Z","title":"CMTNet: Convolutional Meets Transformer Network for Hyperspectral Images\n Classification","summary":" Hyperspectral remote sensing (HIS) enables the detailed capture of spectral\ninformation from the Earth's surface, facilitating precise classification and\nidentification of surface crops due to its superior spectral diagnostic\ncapabilities. However, current convolutional neural networks (CNNs) focus on\nlocal features in hyperspectral data, leading to suboptimal performance when\nclassifying intricate crop types and addressing imbalanced sample\ndistributions. In contrast, the Transformer framework excels at extracting\nglobal features from hyperspectral imagery. To leverage the strengths of both\napproaches, this research introduces the Convolutional Meet Transformer Network\n(CMTNet). This innovative model includes a spectral-spatial feature extraction\nmodule for shallow feature capture, a dual-branch structure combining CNN and\nTransformer branches for local and global feature extraction, and a\nmulti-output constraint module that enhances classification accuracy through\nmulti-output loss calculations and cross constraints across local,\ninternational, and joint features. Extensive experiments conducted on three\ndatasets (WHU-Hi-LongKou, WHU-Hi-HanChuan, and WHU-Hi-HongHu) demonstrate that\nCTDBNet significantly outperforms other state-of-the-art networks in\nclassification performance, validating its effectiveness in hyperspectral crop\nclassification.\n","authors":["Faxu Guo","Quan Feng","Sen Yang","Wanxia Yang"],"pdf_url":"https://arxiv.org/pdf/2406.14080v4.pdf","comment":"We have decided to withdraw this article due to significant\n adjustments in the research direction. The current manuscript no longer\n reflects the final conclusions of our study. We plan to revise and resubmit\n the work in the future."},{"id":"http://arxiv.org/abs/2412.20006v2","updated":"2025-01-10T01:09:37Z","published":"2024-12-28T04:06:29Z","title":"Adversarial Robustness for Deep Learning-based Wildfire Prediction\n Models","summary":" Smoke detection using Deep Neural Networks (DNNs) is an effective approach\nfor early wildfire detection. However, because smoke is temporally and\nspatially anomalous, there are limitations in collecting sufficient training\ndata. This raises overfitting and bias concerns in existing DNN-based wildfire\ndetection models. Thus, we introduce WARP (Wildfire Adversarial Robustness\nProcedure), the first model-agnostic framework for evaluating the adversarial\nrobustness of DNN-based wildfire detection models. WARP addresses limitations\nin smoke image diversity using global and local adversarial attack methods. The\nglobal attack method uses image-contextualized Gaussian noise, while the local\nattack method uses patch noise injection, tailored to address critical aspects\nof wildfire detection. Leveraging WARP's model-agnostic capabilities, we assess\nthe adversarial robustness of real-time Convolutional Neural Networks (CNNs)\nand Transformers. The analysis revealed valuable insights into the models'\nlimitations. Specifically, the global attack method demonstrates that the\nTransformer model has more than 70% precision degradation than the CNN against\nglobal noise. In contrast, the local attack method shows that both models are\nsusceptible to cloud image injections when detecting smoke-positive instances,\nsuggesting a need for model improvements through data augmentation. WARP's\ncomprehensive robustness analysis contributed to the development of\nwildfire-specific data augmentation strategies, marking a step toward\npracticality.\n","authors":["Ryo Ide","Lei Yang"],"pdf_url":"https://arxiv.org/pdf/2412.20006v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.05488v2","updated":"2025-01-10T00:58:28Z","published":"2024-12-07T01:19:14Z","title":"Enhancing Sample Generation of Diffusion Models using Noise Level\n Correction","summary":" The denoising process of diffusion models can be interpreted as an\napproximate projection of noisy samples onto the data manifold. Moreover, the\nnoise level in these samples approximates their distance to the underlying\nmanifold. Building on this insight, we propose a novel method to enhance sample\ngeneration by aligning the estimated noise level with the true distance of\nnoisy samples to the manifold. Specifically, we introduce a noise level\ncorrection network, leveraging a pre-trained denoising network, to refine noise\nlevel estimates during the denoising process. Additionally, we extend this\napproach to various image restoration tasks by integrating task-specific\nconstraints, including inpainting, deblurring, super-resolution, colorization,\nand compressed sensing. Experimental results demonstrate that our method\nsignificantly improves sample quality in both unconstrained and constrained\ngeneration scenarios. Notably, the proposed noise level correction framework is\ncompatible with existing denoising schedulers (e.g., DDIM), offering additional\nperformance improvements.\n","authors":["Abulikemu Abuduweili","Chenyang Yuan","Changliu Liu","Frank Permenter"],"pdf_url":"https://arxiv.org/pdf/2412.05488v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05631v1","updated":"2025-01-10T00:20:29Z","published":"2025-01-10T00:20:29Z","title":"HFMF: Hierarchical Fusion Meets Multi-Stream Models for Deepfake\n Detection","summary":" The rapid progress in deep generative models has led to the creation of\nincredibly realistic synthetic images that are becoming increasingly difficult\nto distinguish from real-world data. The widespread use of Variational Models,\nDiffusion Models, and Generative Adversarial Networks has made it easier to\ngenerate convincing fake images and videos, which poses significant challenges\nfor detecting and mitigating the spread of misinformation. As a result,\ndeveloping effective methods for detecting AI-generated fakes has become a\npressing concern. In our research, we propose HFMF, a comprehensive two-stage\ndeepfake detection framework that leverages both hierarchical cross-modal\nfeature fusion and multi-stream feature extraction to enhance detection\nperformance against imagery produced by state-of-the-art generative AI models.\nThe first component of our approach integrates vision Transformers and\nconvolutional nets through a hierarchical feature fusion mechanism. The second\ncomponent of our framework combines object-level information and a fine-tuned\nconvolutional net model. We then fuse the outputs from both components via an\nensemble deep neural net, enabling robust classification performances. We\ndemonstrate that our architecture achieves superior performance across diverse\ndataset benchmarks while maintaining calibration and interoperability.\n","authors":["Anant Mehta","Bryant McArthur","Nagarjuna Kolloju","Zhengzhong Tu"],"pdf_url":"https://arxiv.org/pdf/2501.05631v1.pdf","comment":"This work is accepted to WACV 2025 Workshop on AI for Multimedia\n Forensics & Disinformation Detection. Code is available at:\n https://github.com/taco-group/HFMF"}],"Information Retrieval":[{"id":"http://arxiv.org/abs/2405.12327v3","updated":"2025-01-10T18:30:45Z","published":"2024-05-20T18:52:33Z","title":"Beyond Item Dissimilarities: Diversifying by Intent in Recommender\n Systems","summary":" It has become increasingly clear that recommender systems that overly focus\non short-term engagement prevents users from exploring diverse interests,\nultimately hurting long-term user experience. To tackle this challenge,\nnumerous diversification algorithms have been proposed. These algorithms\ntypically rely on measures of item similarity, aiming to maximize the\ndissimilarity across items in the final set of recommendations. However, in\nthis work, we demonstrate the benefits of going beyond item-level similarities\nby utilizing higher-level user understanding--specifically, user intents that\npersist across multiple interactions--in diversification. Our approach is\nmotivated by the observation that user behaviors on online platforms are\nlargely driven by their underlying intents. Therefore, recommendations should\nensure that diverse user intents are accurately represented. While intent has\nprimarily been studied in the context of search, it is less clear how to\nincorporate real-time dynamic intent predictions into recommender systems. To\naddress this gap, we develop a probabilistic intent-based whole-page\ndiversification framework for the final stage of a recommender system. Starting\nwith a prior belief of user intents, the proposed framework sequentially\nselects items for each position based on these beliefs and subsequently updates\nposterior beliefs about the intents. This approach ensures that different user\nintents are represented on a page, towards optimizing long-term user\nexperience. We experiment with the intent diversification framework on YouTube,\nthe world's largest video recommendation platform, serving billions of users\ndaily. Live experiments on a diverse set of intents show that the proposed\nframework increases Daily Active Users (DAU) and overall user enjoyment,\nvalidating its effectiveness in facilitating long-term planning.\n","authors":["Yuyan Wang","Cheenar Banerjee","Samer Chucri","Fabio Soldo","Sriraj Badam","Ed H. Chi","Minmin Chen"],"pdf_url":"https://arxiv.org/pdf/2405.12327v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.06121v1","updated":"2025-01-10T17:19:59Z","published":"2025-01-10T17:19:59Z","title":"kANNolo: Sweet and Smooth Approximate k-Nearest Neighbors Search","summary":" Approximate Nearest Neighbors (ANN) search is a crucial task in several\napplications like recommender systems and information retrieval. Current\nstate-of-the-art ANN libraries, although being performance-oriented, often lack\nmodularity and ease of use. This translates into them not being fully suitable\nfor easy prototyping and testing of research ideas, an important feature to\nenable. We address these limitations by introducing kANNolo, a novel\nresearch-oriented ANN library written in Rust and explicitly designed to\ncombine usability with performance effectively. kANNolo is the first ANN\nlibrary that supports dense and sparse vector representations made available on\ntop of different similarity measures, e.g., euclidean distance and inner\nproduct. Moreover, it also supports vector quantization techniques, e.g.,\nProduct Quantization, on top of the indexing strategies implemented. These\nfunctionalities are managed through Rust traits, allowing shared behaviors to\nbe handled abstractly. This abstraction ensures flexibility and facilitates an\neasy integration of new components. In this work, we detail the architecture of\nkANNolo and demonstrate that its flexibility does not compromise performance.\nThe experimental analysis shows that kANNolo achieves state-of-the-art\nperformance in terms of speed-accuracy trade-off while allowing fast and easy\nprototyping, thus making kANNolo a valuable tool for advancing ANN research.\nSource code available on GitHub: https://github.com/TusKANNy/kannolo.\n","authors":["Leonardo Delfino","Domenico Erriquez","Silvio Martinico","Franco Maria Nardini","Cosimo Rulli","Rossano Venturini"],"pdf_url":"https://arxiv.org/pdf/2501.06121v1.pdf","comment":"7 pages, 3 figures"},{"id":"http://arxiv.org/abs/2501.05964v1","updated":"2025-01-10T13:46:23Z","published":"2025-01-10T13:46:23Z","title":"Recommender Systems for Social Good: The Role of Accountability and\n Sustainability","summary":" This work examines the role of recommender systems in promoting\nsustainability, social responsibility, and accountability, with a focus on\nalignment with the United Nations Sustainable Development Goals (SDGs). As\nrecommender systems become increasingly integrated into daily interactions,\nthey must go beyond personalization to support responsible consumption, reduce\nenvironmental impact, and foster social good. We explore strategies to mitigate\nthe carbon footprint of recommendation models, ensure fairness, and implement\naccountability mechanisms. By adopting these approaches, recommender systems\ncan contribute to sustainable and socially beneficial outcomes, aligning\ntechnological advancements with the SDGs focused on environmental\nsustainability and social well-being.\n","authors":["Alan Said"],"pdf_url":"https://arxiv.org/pdf/2501.05964v1.pdf","comment":"First International Workshop on Recommender Systems for\n Sustainability and Social Good (RecSoGood'24)"},{"id":"http://arxiv.org/abs/2501.05925v1","updated":"2025-01-10T12:44:46Z","published":"2025-01-10T12:44:46Z","title":"Navigating Tomorrow: Reliably Assessing Large Language Models\n Performance on Future Event Prediction","summary":" Predicting future events is an important activity with applications across\nmultiple fields and domains. For example, the capacity to foresee stock market\ntrends, natural disasters, business developments, or political events can\nfacilitate early preventive measures and uncover new opportunities. Multiple\ndiverse computational methods for attempting future predictions, including\npredictive analysis, time series forecasting, and simulations have been\nproposed. This study evaluates the performance of several large language models\n(LLMs) in supporting future prediction tasks, an under-explored domain. We\nassess the models across three scenarios: Affirmative vs. Likelihood\nquestioning, Reasoning, and Counterfactual analysis. For this, we create a\ndataset1 by finding and categorizing news articles based on entity type and its\npopularity. We gather news articles before and after the LLMs training cutoff\ndate in order to thoroughly test and compare model performance. Our research\nhighlights LLMs potential and limitations in predictive modeling, providing a\nfoundation for future improvements.\n","authors":["Petraq Nako","Adam Jatowt"],"pdf_url":"https://arxiv.org/pdf/2501.05925v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05894v1","updated":"2025-01-10T11:46:51Z","published":"2025-01-10T11:46:51Z","title":"Text2Playlist: Generating Personalized Playlists from Text on Deezer","summary":" The streaming service Deezer heavily relies on the search to help users\nnavigate through its extensive music catalog. Nonetheless, it is primarily\ndesigned to find specific items and does not lead directly to a smooth\nlistening experience. We present Text2Playlist, a stand-alone tool that\naddresses these limitations. Text2Playlist leverages generative AI, music\ninformation retrieval and recommendation systems to generate query-specific and\npersonalized playlists, successfully deployed at scale.\n","authors":["Mathieu Delcluze","Antoine Khoury","Clémence Vast","Valerio Arnaudo","Léa Briand","Walid Bendada","Thomas Bouabça"],"pdf_url":"https://arxiv.org/pdf/2501.05894v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05874v1","updated":"2025-01-10T11:17:15Z","published":"2025-01-10T11:17:15Z","title":"VideoRAG: Retrieval-Augmented Generation over Video Corpus","summary":" Retrieval-Augmented Generation (RAG) is a powerful strategy to address the\nissue of generating factually incorrect outputs in foundation models by\nretrieving external knowledge relevant to queries and incorporating it into\ntheir generation process. However, existing RAG approaches have primarily\nfocused on textual information, with some recent advancements beginning to\nconsider images, and they largely overlook videos, a rich source of multimodal\nknowledge capable of representing events, processes, and contextual details\nmore effectively than any other modality. While a few recent studies explore\nthe integration of videos in the response generation process, they either\npredefine query-associated videos without retrieving them according to queries,\nor convert videos into the textual descriptions without harnessing their\nmultimodal richness. To tackle these, we introduce VideoRAG, a novel framework\nthat not only dynamically retrieves relevant videos based on their relevance\nwith queries but also utilizes both visual and textual information of videos in\nthe output generation. Further, to operationalize this, our method revolves\naround the recent advance of Large Video Language Models (LVLMs), which enable\nthe direct processing of video content to represent it for retrieval and\nseamless integration of the retrieved videos jointly with queries. We\nexperimentally validate the effectiveness of VideoRAG, showcasing that it is\nsuperior to relevant baselines.\n","authors":["Soyeong Jeong","Kangsan Kim","Jinheon Baek","Sung Ju Hwang"],"pdf_url":"https://arxiv.org/pdf/2501.05874v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05813v1","updated":"2025-01-10T09:32:28Z","published":"2025-01-10T09:32:28Z","title":"Social web and Wikipedia: an opportunity to rethink the links between\n sources' credibility, trust and authority","summary":" The Web and its main tools (Google, Wikipedia, Facebook, Twitter) deeply\nraise and renew fundamental questions, that everyone asks almost every day: Is\nthis information or content true? Can I trust this author or source? These\nquestions are not new, they have been the same with books, newspapers,\nbroadcasting and television, and, more fundamentally, in every human\ninterpersonal communication. This paper is focused on two scientific problems\non this issue. The first one is theoretical: to address this issue, many\nconcepts have been used in library and information sciences, communication and\npsychology. The links between these concepts are not clear: sometimes two\nconcepts are considered as synonymous, sometimes as very different. The second\none is historical: sources like Wikipedia deeply challenge the epistemic\nevaluation of information sources, compared to previous modes of information\nproduction. This paper proposes an integrated and simple model considering the\nrelation between a user, a document and an author as human communication. It\nreduces the problem to three concepts: credibility as a characteristic granted\nto information depending on its truth-value; trust as the ability to produce\ncredible information; authority when the power to influence of an author is\naccepted, i.e., when readers accept that the source can modify their opinion,\nknowledge and decisions. The model describes also two kinds of relationships\nbetween the three concepts: an upward link and a downward link. The model is\nconfronted with findings of empirical research on Wikipedia in particular.\n","authors":["Gilles Sahut","André Tricot"],"pdf_url":"https://arxiv.org/pdf/2501.05813v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.15005v3","updated":"2025-01-10T01:52:48Z","published":"2024-11-22T15:29:05Z","title":"Multi-granularity Interest Retrieval and Refinement Network for\n Long-Term User Behavior Modeling in CTR Prediction","summary":" Click-through Rate (CTR) prediction is crucial for online personalization\nplatforms. Recent advancements have shown that modeling rich user behaviors can\nsignificantly improve the performance of CTR prediction. Current long-term user\nbehavior modeling algorithms predominantly follow two cascading stages. The\nfirst stage retrieves subsequence related to the target item from the long-term\nbehavior sequence, while the second stage models the relationship between the\nsubsequence and the target item. Despite significant progress, these methods\nhave two critical flaws. First, the retrieval query typically includes only\ntarget item information, limiting the ability to capture the user's diverse\ninterests. Second, relational information, such as sequential and interactive\ninformation within the subsequence, is frequently overlooked. Therefore, it\nrequires to be further mined to more accurately model user interests.\n To this end, we propose Multi-granularity Interest Retrieval and Refinement\nNetwork (MIRRN). Specifically, we first construct queries based on behaviors\nobserved at different time scales to obtain subsequences, each capturing users'\ninterest at various granularities. We then introduce an noval multi-head\nFourier transformer to efficiently learn sequential and interactive information\nwithin the subsequences, leading to more accurate modeling of user interests.\nFinally, we employ multi-head target attention to adaptively assess the impact\nof these multi-granularity interests on the target item. Extensive experiments\nhave demonstrated that MIRRN significantly outperforms state-of-the-art\nbaselines. Furthermore, an A/B test shows that MIRRN increases the average\nnumber of listening songs by 1.32% and the average time of listening songs by\n0.55% on the Huawei Music App. The implementation code is publicly available at\nhttps://github.com/USTC-StarTeam/MIRRN.\n","authors":["Xiang Xu","Hao Wang","Wei Guo","Luankang Zhang","Wanshan Yang","Runlong Yu","Yong Liu","Defu Lian","Enhong Chen"],"pdf_url":"https://arxiv.org/pdf/2411.15005v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05647v1","updated":"2025-01-10T01:27:12Z","published":"2025-01-10T01:27:12Z","title":"Collaboration of Large Language Models and Small Recommendation Models\n for Device-Cloud Recommendation","summary":" Large Language Models (LLMs) for Recommendation (LLM4Rec) is a promising\nresearch direction that has demonstrated exceptional performance in this field.\nHowever, its inability to capture real-time user preferences greatly limits the\npractical application of LLM4Rec because (i) LLMs are costly to train and infer\nfrequently, and (ii) LLMs struggle to access real-time data (its large number\nof parameters poses an obstacle to deployment on devices). Fortunately, small\nrecommendation models (SRMs) can effectively supplement these shortcomings of\nLLM4Rec diagrams by consuming minimal resources for frequent training and\ninference, and by conveniently accessing real-time data on devices.\n In light of this, we designed the Device-Cloud LLM-SRM Collaborative\nRecommendation Framework (LSC4Rec) under a device-cloud collaboration setting.\nLSC4Rec aims to integrate the advantages of both LLMs and SRMs, as well as the\nbenefits of cloud and edge computing, achieving a complementary synergy. We\nenhance the practicability of LSC4Rec by designing three strategies:\ncollaborative training, collaborative inference, and intelligent request.\nDuring training, LLM generates candidate lists to enhance the ranking ability\nof SRM in collaborative scenarios and enables SRM to update adaptively to\ncapture real-time user interests. During inference, LLM and SRM are deployed on\nthe cloud and on the device, respectively. LLM generates candidate lists and\ninitial ranking results based on user behavior, and SRM get reranking results\nbased on the candidate list, with final results integrating both LLM's and\nSRM's scores. The device determines whether a new candidate list is needed by\ncomparing the consistency of the LLM's and SRM's sorted lists. Our\ncomprehensive and extensive experimental analysis validates the effectiveness\nof each strategy in LSC4Rec.\n","authors":["Zheqi Lv","Tianyu Zhan","Wenjie Wang","Xinyu Lin","Shengyu Zhang","Wenqiao Zhang","Jiwei Li","Kun Kuang","Fei Wu"],"pdf_url":"https://arxiv.org/pdf/2501.05647v1.pdf","comment":"Published on KDD'25: Proceedings of the ACM SIGKDD Conference on\n Knowledge Discovery and Data Mining 2025"},{"id":"http://arxiv.org/abs/2501.06365v1","updated":"2025-01-10T22:07:56Z","published":"2025-01-10T22:07:56Z","title":"Gender-Neutral Large Language Models for Medical Applications: Reducing\n Bias in PubMed Abstracts","summary":" This paper presents a pipeline for mitigating gender bias in large language\nmodels (LLMs) used in medical literature by neutralizing gendered occupational\npronouns. A dataset of 379,000 PubMed abstracts from 1965-1980 was processed to\nidentify and modify pronouns tied to professions. We developed a BERT-based\nmodel, ``Modern Occupational Bias Elimination with Refined Training,'' or\n``MOBERT,'' trained on these neutralized abstracts, and compared its\nperformance with ``1965Bert,'' trained on the original dataset. MOBERT achieved\na 70\\% inclusive replacement rate, while 1965Bert reached only 4\\%. A further\nanalysis of MOBERT revealed that pronoun replacement accuracy correlated with\nthe frequency of occupational terms in the training data. We propose expanding\nthe dataset and refining the pipeline to improve performance and ensure more\nequitable language modeling in medical applications.\n","authors":["Elizabeth Schaefer","Kirk Roberts"],"pdf_url":"https://arxiv.org/pdf/2501.06365v1.pdf","comment":"9 pages, 4 figures"},{"id":"http://arxiv.org/abs/2501.06362v1","updated":"2025-01-10T21:58:34Z","published":"2025-01-10T21:58:34Z","title":"Repeat-bias-aware Optimization of Beyond-accuracy Metrics for Next\n Basket Recommendation","summary":" In next basket recommendation (NBR) a set of items is recommended to users\nbased on their historical basket sequences. In many domains, the recommended\nbaskets consist of both repeat items and explore items. Some state-of-the-art\nNBR methods are heavily biased to recommend repeat items so as to maximize\nutility. The evaluation and optimization of beyond-accuracy objectives for NBR,\nsuch as item fairness and diversity, has attracted increasing attention. How\ncan such beyond-accuracy objectives be pursued in the presence of heavy repeat\nbias? We find that only optimizing diversity or item fairness without\nconsidering repeat bias may cause NBR algorithms to recommend more repeat\nitems. To solve this problem, we propose a model-agnostic repeat-bias-aware\noptimization algorithm to post-process the recommended results obtained from\nNBR methods with the objective of mitigating repeat bias when optimizing\ndiversity or item fairness. We consider multiple variations of our optimization\nalgorithm to cater to multiple NBR methods. Experiments on three real-world\ngrocery shopping datasets show that the proposed algorithms can effectively\nimprove diversity and item fairness, and mitigate repeat bias at acceptable\nRecall loss.\n","authors":["Yuanna Liu","Ming Li","Mohammad Aliannejadi","Maarten de Rijke"],"pdf_url":"https://arxiv.org/pdf/2501.06362v1.pdf","comment":"This paper has been accepted as a full paper at the 47th European\n Conference on Information Retrieval (ECIR2025)"},{"id":"http://arxiv.org/abs/2412.16435v2","updated":"2025-01-10T19:49:12Z","published":"2024-12-21T01:52:03Z","title":"THeGCN: Temporal Heterophilic Graph Convolutional Network","summary":" Graph Neural Networks (GNNs) have exhibited remarkable efficacy in diverse\ngraph learning tasks, particularly on static homophilic graphs. Recent\nattention has pivoted towards more intricate structures, encompassing (1)\nstatic heterophilic graphs encountering the edge heterophily issue in the\nspatial domain and (2) event-based continuous graphs in the temporal domain.\nState-of-the-art (SOTA) has been concurrently addressing these two lines of\nwork but tends to overlook the presence of heterophily in the temporal domain,\nconstituting the temporal heterophily issue. Furthermore, we highlight that the\nedge heterophily issue and the temporal heterophily issue often co-exist in\nevent-based continuous graphs, giving rise to the temporal edge heterophily\nchallenge. To tackle this challenge, this paper first introduces the temporal\nedge heterophily measurement. Subsequently, we propose the Temporal\nHeterophilic Graph Convolutional Network (THeGCN), an innovative model that\nincorporates the low/high-pass graph signal filtering technique to accurately\ncapture both edge (spatial) heterophily and temporal heterophily. Specifically,\nthe THeGCN model consists of two key components: a sampler and an aggregator.\nThe sampler selects events relevant to a node at a given moment. Then, the\naggregator executes message-passing, encoding temporal information, node\nattributes, and edge attributes into node embeddings. Extensive experiments\nconducted on 5 real-world datasets validate the efficacy of THeGCN.\n","authors":["Yuchen Yan","Yuzhong Chen","Huiyuan Chen","Xiaoting Li","Zhe Xu","Zhichen Zeng","Lihui Liu","Zhining Liu","Hanghang Tong"],"pdf_url":"https://arxiv.org/pdf/2412.16435v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.06277v1","updated":"2025-01-10T12:48:29Z","published":"2025-01-10T12:48:29Z","title":"Environmental large language model Evaluation (ELLE) dataset: A\n Benchmark for Evaluating Generative AI applications in Eco-environment Domain","summary":" Generative AI holds significant potential for ecological and environmental\napplications such as monitoring, data analysis, education, and policy support.\nHowever, its effectiveness is limited by the lack of a unified evaluation\nframework. To address this, we present the Environmental Large Language model\nEvaluation (ELLE) question answer (QA) dataset, the first benchmark designed to\nassess large language models and their applications in ecological and\nenvironmental sciences. The ELLE dataset includes 1,130 question answer pairs\nacross 16 environmental topics, categorized by domain, difficulty, and type.\nThis comprehensive dataset standardizes performance assessments in these\nfields, enabling consistent and objective comparisons of generative AI\nperformance. By providing a dedicated evaluation tool, ELLE dataset promotes\nthe development and application of generative AI technologies for sustainable\nenvironmental outcomes. The dataset and code are available at\nhttps://elle.ceeai.net/ and https://github.com/CEEAI/elle.\n","authors":["Jing Guo","Nan Li","Ming Xu"],"pdf_url":"https://arxiv.org/pdf/2501.06277v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.07596v1","updated":"2025-01-10T01:42:43Z","published":"2025-01-10T01:42:43Z","title":"Optimize Incompatible Parameters through Compatibility-aware Knowledge\n Integration","summary":" Deep neural networks have become foundational to advancements in multiple\ndomains, including recommendation systems, natural language processing, and so\non. Despite their successes, these models often contain incompatible parameters\nthat can be underutilized or detrimental to model performance, particularly\nwhen faced with specific, varying data distributions. Existing research excels\nin removing such parameters or merging the outputs of multiple different\npretrained models. However, the former focuses on efficiency rather than\nperformance, while the latter requires several times more computing and storage\nresources to support inference. In this paper, we set the goal to explicitly\nimprove these incompatible parameters by leveraging the complementary strengths\nof different models, thereby directly enhancing the models without any\nadditional parameters. Specifically, we propose Compatibility-aware Knowledge\nIntegration (CKI), which consists of Parameter Compatibility Assessment and\nParameter Splicing, which are used to evaluate the knowledge content of\nmultiple models and integrate the knowledge into one model, respectively. The\nintegrated model can be used directly for inference or for further fine-tuning.\nWe conduct extensive experiments on various datasets for recommendation and\nlanguage tasks, and the results show that Compatibility-aware Knowledge\nIntegration can effectively optimize incompatible parameters under multiple\ntasks and settings to break through the training limit of the original model\nwithout increasing the inference cost.\n","authors":["Zheqi Lv","Keming Ye","Zishu Wei","Qi Tian","Shengyu Zhang","Wenqiao Zhang","Wenjie Wang","Kun Kuang","Tat-Seng Chua","Fei Wu"],"pdf_url":"https://arxiv.org/pdf/2501.07596v1.pdf","comment":"Published on AAAI'25: The Annual AAAI Conference on Artificial\n Intelligence"}],"Machine Learning":[{"id":"http://arxiv.org/abs/2501.05450v2","updated":"2025-01-10T18:58:11Z","published":"2025-01-09T18:59:56Z","title":"Decentralized Diffusion Models","summary":" Large-scale AI model training divides work across thousands of GPUs, then\nsynchronizes gradients across them at each step. This incurs a significant\nnetwork burden that only centralized, monolithic clusters can support, driving\nup infrastructure costs and straining power systems. We propose Decentralized\nDiffusion Models, a scalable framework for distributing diffusion model\ntraining across independent clusters or datacenters by eliminating the\ndependence on a centralized, high-bandwidth networking fabric. Our method\ntrains a set of expert diffusion models over partitions of the dataset, each in\nfull isolation from one another. At inference time, the experts ensemble\nthrough a lightweight router. We show that the ensemble collectively optimizes\nthe same objective as a single model trained over the whole dataset. This means\nwe can divide the training burden among a number of \"compute islands,\" lowering\ninfrastructure costs and improving resilience to localized GPU failures.\nDecentralized diffusion models empower researchers to take advantage of\nsmaller, more cost-effective and more readily available compute like on-demand\nGPU nodes rather than central integrated systems. We conduct extensive\nexperiments on ImageNet and LAION Aesthetics, showing that decentralized\ndiffusion models FLOP-for-FLOP outperform standard diffusion models. We finally\nscale our approach to 24 billion parameters, demonstrating that high-quality\ndiffusion models can now be trained with just eight individual GPU nodes in\nless than a week.\n","authors":["David McAllister","Matthew Tancik","Jiaming Song","Angjoo Kanazawa"],"pdf_url":"https://arxiv.org/pdf/2501.05450v2.pdf","comment":"Project webpage: https://decentralizeddiffusion.github.io/"},{"id":"http://arxiv.org/abs/2501.06171v1","updated":"2025-01-10T18:50:45Z","published":"2025-01-10T18:50:45Z","title":"Machine Learning Force-Field Approach for Itinerant Electron Magnets","summary":" We review the recent development of machine-learning (ML) force-field\nframeworks for Landau-Lifshitz-Gilbert (LLG) dynamics simulations of itinerant\nelectron magnets, focusing on the general theory and implementations of\nsymmetry-invariant representations of spin configurations. The crucial\nproperties that such magnetic descriptors must satisfy are differentiability\nwith respect to spin rotations and invariance to both lattice point-group\nsymmetry and internal spin rotation symmetry. We propose an efficient\nimplementation based on the concept of reference irreducible representations,\nmodified from the group-theoretical power-spectrum and bispectrum methods. The\nML framework is demonstrated using the s-d models, which are widely applied in\nspintronics research. We show that LLG simulations based on local fields\npredicted by the trained ML models successfully reproduce representative\nnon-collinear spin structures, including 120$^\\circ$, tetrahedral, and skyrmion\ncrystal orders of the triangular-lattice s-d models. Large-scale thermal quench\nsimulations enabled by ML models further reveal intriguing freezing dynamics\nand glassy stripe states consisting of skyrmions and bi-merons. Our work\nhighlights the utility of ML force-field approach to dynamical modeling of\ncomplex spin orders in itinerant electron magnets.\n","authors":["Sheng Zhang","Yunhao Fan","Kotaro Shimizu","Gia-Wei Chern"],"pdf_url":"https://arxiv.org/pdf/2501.06171v1.pdf","comment":"18 pages, 8 figures"},{"id":"http://arxiv.org/abs/2501.06167v1","updated":"2025-01-10T18:46:28Z","published":"2025-01-10T18:46:28Z","title":"Meta-Learning for Physically-Constrained Neural System Identification","summary":" We present a gradient-based meta-learning framework for rapid adaptation of\nneural state-space models (NSSMs) for black-box system identification. When\napplicable, we also incorporate domain-specific physical constraints to improve\nthe accuracy of the NSSM. The major benefit of our approach is that instead of\nrelying solely on data from a single target system, our framework utilizes data\nfrom a diverse set of source systems, enabling learning from limited target\ndata, as well as with few online training iterations. Through benchmark\nexamples, we demonstrate the potential of our approach, study the effect of\nfine-tuning subnetworks rather than full fine-tuning, and report real-world\ncase studies to illustrate the practical application and generalizability of\nthe approach to practical problems with physical-constraints. Specifically, we\nshow that the meta-learned models result in improved downstream performance in\nmodel-based state estimation in indoor localization and energy systems.\n","authors":["Ankush Chakrabarty","Gordon Wichern","Vedang M. Deshpande","Abraham P. Vinod","Karl Berntorp","Christopher R. Laughman"],"pdf_url":"https://arxiv.org/pdf/2501.06167v1.pdf","comment":"30 pages"},{"id":"http://arxiv.org/abs/2501.06164v1","updated":"2025-01-10T18:39:29Z","published":"2025-01-10T18:39:29Z","title":"Model Alignment Search","summary":" When can we say that two neural systems are the same? The answer to this\nquestion is goal-dependent, and it is often addressed through correlative\nmethods such as Representational Similarity Analysis (RSA) and Centered Kernel\nAlignment (CKA). What do we miss when we forgo causal explorations, and how can\nwe target specific types of similarity? In this work, we introduce Model\nAlignment Search (MAS), a method for causally exploring distributed\nrepresentational similarity. The method learns invertible linear\ntransformations that align a subspace between two distributed networks'\nrepresentations where causal information can be freely interchanged. We first\nshow that the method can be used to transfer specific causal variables, such as\nthe number of items in a counting task, between networks with different\ntraining seeds. We then explore open questions in number cognition by comparing\ndifferent types of numeric representations in models trained on structurally\ndifferent numeric tasks. We then explore differences between MAS vs preexisting\ncausal similarity methods, showing MAS to be more resistant to unwanted\nexchanges. Lastly, we introduce a counterfactual latent auxiliary loss function\nthat helps shape causally relevant alignments even in cases where we do not\nhave causal access to one of the two models for training.\n","authors":["Satchel Grant"],"pdf_url":"https://arxiv.org/pdf/2501.06164v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.06159v1","updated":"2025-01-10T18:32:05Z","published":"2025-01-10T18:32:05Z","title":"Efficient Transition State Searches by Freezing String Method with Graph\n Neural Network Potentials","summary":" Transition states are a critical bottleneck in chemical transformations.\nSignificant efforts have been made to develop algorithms that efficiently\nlocate transition states on potential energy surfaces. However, the\ncomputational cost of ab-initio potential energy surface evaluation limits the\nsize of chemical systems that can routinely studied. In this work, we develop\nand fine-tune a graph neural network potential energy function suitable for\ndescribing organic chemical reactions and use it to rapidly identify transition\nstate guess structures. We successfully refine guess structures and locate a\ntransition state in each test system considered and reduce the average number\nof ab-initio calculations by 47% though use of the graph neural network\npotential energy function. Our results show that modern machine learning models\nhave reached levels of reliability whereby they can be used to accelerate\nroutine computational chemistry tasks.\n","authors":["Jonah Marks","Joseph Gomes"],"pdf_url":"https://arxiv.org/pdf/2501.06159v1.pdf","comment":"9 pages, 4 figures, 3 tables"},{"id":"http://arxiv.org/abs/2405.12327v3","updated":"2025-01-10T18:30:45Z","published":"2024-05-20T18:52:33Z","title":"Beyond Item Dissimilarities: Diversifying by Intent in Recommender\n Systems","summary":" It has become increasingly clear that recommender systems that overly focus\non short-term engagement prevents users from exploring diverse interests,\nultimately hurting long-term user experience. To tackle this challenge,\nnumerous diversification algorithms have been proposed. These algorithms\ntypically rely on measures of item similarity, aiming to maximize the\ndissimilarity across items in the final set of recommendations. However, in\nthis work, we demonstrate the benefits of going beyond item-level similarities\nby utilizing higher-level user understanding--specifically, user intents that\npersist across multiple interactions--in diversification. Our approach is\nmotivated by the observation that user behaviors on online platforms are\nlargely driven by their underlying intents. Therefore, recommendations should\nensure that diverse user intents are accurately represented. While intent has\nprimarily been studied in the context of search, it is less clear how to\nincorporate real-time dynamic intent predictions into recommender systems. To\naddress this gap, we develop a probabilistic intent-based whole-page\ndiversification framework for the final stage of a recommender system. Starting\nwith a prior belief of user intents, the proposed framework sequentially\nselects items for each position based on these beliefs and subsequently updates\nposterior beliefs about the intents. This approach ensures that different user\nintents are represented on a page, towards optimizing long-term user\nexperience. We experiment with the intent diversification framework on YouTube,\nthe world's largest video recommendation platform, serving billions of users\ndaily. Live experiments on a diverse set of intents show that the proposed\nframework increases Daily Active Users (DAU) and overall user enjoyment,\nvalidating its effectiveness in facilitating long-term planning.\n","authors":["Yuyan Wang","Cheenar Banerjee","Samer Chucri","Fabio Soldo","Sriraj Badam","Ed H. Chi","Minmin Chen"],"pdf_url":"https://arxiv.org/pdf/2405.12327v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.06158v1","updated":"2025-01-10T18:30:05Z","published":"2025-01-10T18:30:05Z","title":"GenMol: A Drug Discovery Generalist with Discrete Diffusion","summary":" Drug discovery is a complex process that involves multiple scenarios and\nstages, such as fragment-constrained molecule generation, hit generation and\nlead optimization. However, existing molecular generative models can only\ntackle one or two of these scenarios and lack the flexibility to address\nvarious aspects of the drug discovery pipeline. In this paper, we present\nGeneralist Molecular generative model (GenMol), a versatile framework that\naddresses these limitations by applying discrete diffusion to the Sequential\nAttachment-based Fragment Embedding (SAFE) molecular representation. GenMol\ngenerates SAFE sequences through non-autoregressive bidirectional parallel\ndecoding, thereby allowing utilization of a molecular context that does not\nrely on the specific token ordering and enhanced computational efficiency.\nMoreover, under the discrete diffusion framework, we introduce fragment\nremasking, a strategy that optimizes molecules by replacing fragments with\nmasked tokens and regenerating them, enabling effective exploration of chemical\nspace. GenMol significantly outperforms the previous GPT-based model trained on\nSAFE representations in de novo generation and fragment-constrained generation,\nand achieves state-of-the-art performance in goal-directed hit generation and\nlead optimization. These experimental results demonstrate that GenMol can\ntackle a wide range of drug discovery tasks, providing a unified and versatile\napproach for molecular design.\n","authors":["Seul Lee","Karsten Kreis","Srimukh Prasad Veccham","Meng Liu","Danny Reidenbach","Yuxing Peng","Saee Paliwal","Weili Nie","Arash Vahdat"],"pdf_url":"https://arxiv.org/pdf/2501.06158v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.06148v1","updated":"2025-01-10T18:18:25Z","published":"2025-01-10T18:18:25Z","title":"From discrete-time policies to continuous-time diffusion samplers:\n Asymptotic equivalences and faster training","summary":" We study the problem of training neural stochastic differential equations, or\ndiffusion models, to sample from a Boltzmann distribution without access to\ntarget samples. Existing methods for training such models enforce time-reversal\nof the generative and noising processes, using either differentiable simulation\nor off-policy reinforcement learning (RL). We prove equivalences between\nfamilies of objectives in the limit of infinitesimal discretization steps,\nlinking entropic RL methods (GFlowNets) with continuous-time objects (partial\ndifferential equations and path space measures). We further show that an\nappropriate choice of coarse time discretization during training allows greatly\nimproved sample efficiency and the use of time-local objectives, achieving\ncompetitive performance on standard sampling benchmarks with reduced\ncomputational cost.\n","authors":["Julius Berner","Lorenz Richter","Marcin Sendera","Jarrid Rector-Brooks","Nikolay Malkin"],"pdf_url":"https://arxiv.org/pdf/2501.06148v1.pdf","comment":"code: https://github.com/GFNOrg/gfn-diffusion/tree/stagger"},{"id":"http://arxiv.org/abs/2410.02780v2","updated":"2025-01-10T18:14:56Z","published":"2024-09-17T19:07:13Z","title":"Guess What I Think: Streamlined EEG-to-Image Generation with Latent\n Diffusion Models","summary":" Generating images from brain waves is gaining increasing attention due to its\npotential to advance brain-computer interface (BCI) systems by understanding\nhow brain signals encode visual cues. Most of the literature has focused on\nfMRI-to-Image tasks as fMRI is characterized by high spatial resolution.\nHowever, fMRI is an expensive neuroimaging modality and does not allow for\nreal-time BCI. On the other hand, electroencephalography (EEG) is a low-cost,\nnon-invasive, and portable neuroimaging technique, making it an attractive\noption for future real-time applications. Nevertheless, EEG presents inherent\nchallenges due to its low spatial resolution and susceptibility to noise and\nartifacts, which makes generating images from EEG more difficult. In this\npaper, we address these problems with a streamlined framework based on the\nControlNet adapter for conditioning a latent diffusion model (LDM) through EEG\nsignals. We conduct experiments and ablation studies on popular benchmarks to\ndemonstrate that the proposed method beats other state-of-the-art models.\nUnlike these methods, which often require extensive preprocessing, pretraining,\ndifferent losses, and captioning models, our approach is efficient and\nstraightforward, requiring only minimal preprocessing and a few components. The\ncode is available at https://github.com/LuigiSigillo/GWIT.\n","authors":["Eleonora Lopez","Luigi Sigillo","Federica Colonnese","Massimo Panella","Danilo Comminiello"],"pdf_url":"https://arxiv.org/pdf/2410.02780v2.pdf","comment":"Accepted at ICASSP 2025"},{"id":"http://arxiv.org/abs/2501.06141v1","updated":"2025-01-10T18:03:46Z","published":"2025-01-10T18:03:46Z","title":"Emergent Symbol-like Number Variables in Artificial Neural Networks","summary":" What types of numeric representations emerge in Neural Networks (NNs)? To\nwhat degree do NNs induce abstract, mutable, slot-like numeric variables, and\nin what situations do these representations emerge? How do these\nrepresentations change over learning, and how can we understand the neural\nimplementations in ways that are unified across different NNs? In this work, we\napproach these questions by first training sequence based neural systems using\nNext Token Prediction (NTP) objectives on numeric tasks. We then seek to\nunderstand the neural solutions through the lens of causal abstractions or\nsymbolic algorithms. We use a combination of causal interventions and\nvisualization methods to find that artificial neural models do indeed develop\nanalogs of interchangeable, mutable, latent number variables purely from the\nNTP objective. We then ask how variations on the tasks and model architectures\naffect the models' learned solutions to find that these symbol-like numeric\nrepresentations do not form for every variant of the task, and transformers\nsolve the problem in a notably different way than their recurrent counterparts.\nWe then show how the symbol-like variables change over the course of training\nto find a strong correlation between the models' task performance and the\nalignment of their symbol-like representations. Lastly, we show that in all\ncases, some degree of gradience exists in these neural symbols, highlighting\nthe difficulty of finding simple, interpretable symbolic stories of how neural\nnetworks perform numeric tasks. Taken together, our results are consistent with\nthe view that neural networks can approximate interpretable symbolic programs\nof number cognition, but the particular program they approximate and the extent\nto which they approximate it can vary widely, depending on the network\narchitecture, training data, extent of training, and network size.\n","authors":["Satchel Grant","Noah D. Goodman","James L. McClelland"],"pdf_url":"https://arxiv.org/pdf/2501.06141v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.11456v2","updated":"2025-01-10T17:54:39Z","published":"2024-09-17T17:48:12Z","title":"Two Stage Segmentation of Cervical Tumors using PocketNet","summary":" Cervical cancer remains the fourth most common malignancy amongst women\nworldwide.1 Concurrent chemoradiotherapy (CRT) serves as the mainstay\ndefinitive treatment regimen for locally advanced cervical cancers and includes\nexternal beam radiation followed by brachytherapy.2 Integral to radiotherapy\ntreatment planning is the routine contouring of both the target tumor at the\nlevel of the cervix, associated gynecologic anatomy and the adjacent organs at\nrisk (OARs). However, manual contouring of these structures is both time and\nlabor intensive and associated with known interobserver variability that can\nimpact treatment outcomes. While multiple tools have been developed to\nautomatically segment OARs and the high-risk clinical tumor volume (HR-CTV)\nusing computed tomography (CT) images,3,4,5,6 the development of deep\nlearning-based tumor segmentation tools using routine T2-weighted (T2w)\nmagnetic resonance imaging (MRI) addresses an unmet clinical need to improve\nthe routine contouring of both anatomical structures and cervical cancers,\nthereby increasing quality and consistency of radiotherapy planning. This work\napplied a novel deep-learning model (PocketNet) to segment the cervix, vagina,\nuterus, and tumor(s) on T2w MRI. The performance of the PocketNet architecture\nwas evaluated, when trained on data via 5-fold cross validation. PocketNet\nachieved a mean Dice-Sorensen similarity coefficient (DSC) exceeding 70% for\ntumor segmentation and 80% for organ segmentation. These results suggest that\nPocketNet is robust to variations in contrast protocols, providing reliable\nsegmentation of the regions of interest.\n","authors":["Awj Twam","Megan Jacobsen","Rachel Glenn","Peng Wei","Jia Sun","Ann Klopp","Aradhana M. Venkatesan","David Fuentes"],"pdf_url":"https://arxiv.org/pdf/2409.11456v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.02189v2","updated":"2025-01-10T17:43:10Z","published":"2025-01-04T04:59:33Z","title":"Benchmark Evaluations, Applications, and Challenges of Large Vision\n Language Models: A Survey","summary":" Multimodal Vision Language Models (VLMs) have emerged as a transformative\ntechnology at the intersection of computer vision and natural language\nprocessing, enabling machines to perceive and reason about the world through\nboth visual and textual modalities. For example, models such as CLIP, Claude,\nand GPT-4V demonstrate strong reasoning and understanding abilities on visual\nand textual data and beat classical single modality vision models on zero-shot\nclassification. Despite their rapid advancements in research and growing\npopularity in applications, a comprehensive survey of existing studies on VLMs\nis notably lacking, particularly for researchers aiming to leverage VLMs in\ntheir specific domains. To this end, we provide a systematic overview of VLMs\nin the following aspects: model information of the major VLMs developed over\nthe past five years (2019-2024); the main architectures and training methods of\nthese VLMs; summary and categorization of the popular benchmarks and evaluation\nmetrics of VLMs; the applications of VLMs including embodied agents, robotics,\nand video generation; the challenges and issues faced by current VLMs such as\nhallucination, fairness, and safety. Detailed collections including papers and\nmodel repository links are listed in\nhttps://github.com/zli12321/Awesome-VLM-Papers-And-Models.git.\n","authors":["Zongxia Li","Xiyang Wu","Hongyang Du","Huy Nghiem","Guangyao Shi"],"pdf_url":"https://arxiv.org/pdf/2501.02189v2.pdf","comment":"35 pages, 3 figures"},{"id":"http://arxiv.org/abs/2501.06126v1","updated":"2025-01-10T17:25:11Z","published":"2025-01-10T17:25:11Z","title":"Merging Feed-Forward Sublayers for Compressed Transformers","summary":" With the rise and ubiquity of larger deep learning models, the need for\nhigh-quality compression techniques is growing in order to deploy these models\nwidely. The sheer parameter count of these models makes it difficult to fit\nthem into the memory constraints of different hardware. In this work, we\npresent a novel approach to model compression by merging similar parameter\ngroups within a model, rather than pruning away less important parameters.\nSpecifically, we select, align, and merge separate feed-forward sublayers in\nTransformer models, and test our method on language modeling, image\nclassification, and machine translation. With our method, we demonstrate\nperformance comparable to the original models while combining more than a third\nof model feed-forward sublayers, and demonstrate improved performance over a\nstrong layer-pruning baseline. For instance, we can remove over 21% of total\nparameters from a Vision Transformer, while maintaining 99% of its original\nperformance. Additionally, we observe that some groups of feed-forward\nsublayers exhibit high activation similarity, which may help explain their\nsurprising mergeability.\n","authors":["Neha Verma","Kenton Murray","Kevin Duh"],"pdf_url":"https://arxiv.org/pdf/2501.06126v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.08999v2","updated":"2025-01-10T17:04:47Z","published":"2023-12-14T14:44:08Z","title":"Conformalised data synthesis","summary":" With the proliferation of increasingly complicated Deep Learning\narchitectures, data synthesis is a highly promising technique to address the\ndemand of data-hungry models. However, reliably assessing the quality of a\n'synthesiser' model's output is an open research question with significant\nassociated risks for high-stake domains. To address this challenge, we propose\na unique synthesis algorithm that generates data from high-confidence feature\nspace regions based on the Conformal Prediction framework. We support our\nproposed algorithm with a comprehensive exploration of the core parameter's\ninfluence, an in-depth discussion of practical advice, and an extensive\nempirical evaluation of five benchmark datasets. To show our approach's\nversatility on ubiquitous real-world challenges, the datasets were carefully\nselected for their variety of difficult characteristics: low sample count,\nclass imbalance, and non-separability. In all trials, training sets extended\nwith our confident synthesised data performed at least as well as the original\nset and frequently significantly improved Deep Learning performance by up to 61\npercentage points F1-score.\n","authors":["Julia A. Meister","Khuong An Nguyen"],"pdf_url":"https://arxiv.org/pdf/2312.08999v2.pdf","comment":"Accepted for publication in the Machine Learning journal special\n issue \"Conformal Prediction and Distribution-Free Uncertainty Quantification\""},{"id":"http://arxiv.org/abs/2501.06108v1","updated":"2025-01-10T17:01:09Z","published":"2025-01-10T17:01:09Z","title":"Inferring High-Order Couplings with Neural Networks","summary":" Maximum-entropy methods, rooted in the inverse Ising/Potts problem from\nstatistical mechanics, have become indispensable tools for modeling pairwise\ninteractions in disciplines such as bioinformatics, ecology, and neuroscience.\nDespite their remarkable success, these methods often overlook high-order\ninteractions that may be crucial in complex systems. Conversely, while modern\nmachine learning approaches can capture such interactions, existing\ninterpretable frameworks are computationally expensive, making it impractical\nto assess the relevance of high-order interactions in real-world scenarios.\nRestricted Boltzmann Machines (RBMs) offer a computationally efficient\nalternative by encoding statistical correlations via hidden nodes in a\nbipartite neural network. Here, we present a method that maps RBMs exactly onto\ngeneralized Potts models with interactions of arbitrary high order. This\napproach leverages large-$N$ approximations, facilitated by the simple\narchitecture of the RBM, to enable the efficient extraction of effective\nmany-body couplings with minimal computational cost. This mapping also enables\nthe development of a general formal framework for the extraction of effective\nhigher-order interactions in arbitrarily complex probabilistic models.\nAdditionally, we introduce a robust formalism for gauge fixing within the\ngeneralized Potts model. We validate our method by accurately recovering two-\nand three-body interactions from synthetic datasets. Additionally, applying our\nframework to protein sequence data demonstrates its effectiveness in\nreconstructing protein contact maps, achieving performance comparable to\nstate-of-the-art inverse Potts models. These results position RBMs as a\npowerful and efficient tool for investigating high-order interactions in\ncomplex systems.\n","authors":["Aurélien Decelle","Alfonso de Jesús Navas Gómez","Beatriz Seoane"],"pdf_url":"https://arxiv.org/pdf/2501.06108v1.pdf","comment":"13 Pages and 3 Figures"},{"id":"http://arxiv.org/abs/2412.14306v2","updated":"2025-01-10T17:00:34Z","published":"2024-12-18T20:19:56Z","title":"Closing the Gap: A User Study on the Real-world Usefulness of AI-powered\n Vulnerability Detection & Repair in the IDE","summary":" This paper presents the first empirical study of a vulnerability detection\nand fix tool with professional software developers on real projects that they\nown. We implemented DeepVulGuard, an IDE-integrated tool based on\nstate-of-the-art detection and fix models, and show that it has promising\nperformance on benchmarks of historic vulnerability data. DeepVulGuard scans\ncode for vulnerabilities (including identifying the vulnerability type and\nvulnerable region of code), suggests fixes, provides natural-language\nexplanations for alerts and fixes, leveraging chat interfaces. We recruited 17\nprofessional software developers at Microsoft, observed their usage of the tool\non their code, and conducted interviews to assess the tool's usefulness, speed,\ntrust, relevance, and workflow integration. We also gathered detailed\nqualitative feedback on users' perceptions and their desired features. Study\nparticipants scanned a total of 24 projects, 6.9k files, and over 1.7 million\nlines of source code, and generated 170 alerts and 50 fix suggestions. We find\nthat although state-of-the-art AI-powered detection and fix tools show promise,\nthey are not yet practical for real-world use due to a high rate of false\npositives and non-applicable fixes. User feedback reveals several actionable\npain points, ranging from incomplete context to lack of customization for the\nuser's codebase. Additionally, we explore how AI features, including confidence\nscores, explanations, and chat interaction, can apply to vulnerability\ndetection and fixing. Based on these insights, we offer practical\nrecommendations for evaluating and deploying AI detection and fix models. Our\ncode and data are available at https://doi.org/10.6084/m9.figshare.26367139.\n","authors":["Benjamin Steenhoek","Kalpathy Sivaraman","Renata Saldivar Gonzalez","Yevhen Mohylevskyy","Roshanak Zilouchian Moghaddam","Wei Le"],"pdf_url":"https://arxiv.org/pdf/2412.14306v2.pdf","comment":"Accepted to ICSE 2025 research track. Camera-ready version with\n updated acknowledgments"},{"id":"http://arxiv.org/abs/2501.05409v2","updated":"2025-01-10T16:58:29Z","published":"2025-01-09T18:06:45Z","title":"Atlas: A Novel Pathology Foundation Model by Mayo Clinic, Charité, and\n Aignostics","summary":" Recent advances in digital pathology have demonstrated the effectiveness of\nfoundation models across diverse applications. In this report, we present\nAtlas, a novel vision foundation model based on the RudolfV approach. Our model\nwas trained on a dataset comprising 1.2 million histopathology whole slide\nimages, collected from two medical institutions: Mayo Clinic and Charit\\'e -\nUniverst\\\"atsmedizin Berlin. Comprehensive evaluations show that Atlas achieves\nstate-of-the-art performance across twenty-one public benchmark datasets, even\nthough it is neither the largest model by parameter count nor by training\ndataset size.\n","authors":["Maximilian Alber","Stephan Tietz","Jonas Dippel","Timo Milbich","Timothée Lesort","Panos Korfiatis","Moritz Krügener","Beatriz Perez Cancer","Neelay Shah","Alexander Möllers","Philipp Seegerer","Alexandra Carpen-Amarie","Kai Standvoss","Gabriel Dernbach","Edwin de Jong","Simon Schallenberg","Andreas Kunft","Helmut Hoffer von Ankershoffen","Gavin Schaeferle","Patrick Duffy","Matt Redlon","Philipp Jurmeister","David Horst","Lukas Ruff","Klaus-Robert Müller","Frederick Klauschen","Andrew Norgan"],"pdf_url":"https://arxiv.org/pdf/2501.05409v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.06103v1","updated":"2025-01-10T16:54:56Z","published":"2025-01-10T16:54:56Z","title":"Finite-Horizon Single-Pull Restless Bandits: An Efficient Index Policy\n For Scarce Resource Allocation","summary":" Restless multi-armed bandits (RMABs) have been highly successful in\noptimizing sequential resource allocation across many domains. However, in many\npractical settings with highly scarce resources, where each agent can only\nreceive at most one resource, such as healthcare intervention programs, the\nstandard RMAB framework falls short. To tackle such scenarios, we introduce\nFinite-Horizon Single-Pull RMABs (SPRMABs), a novel variant in which each arm\ncan only be pulled once. This single-pull constraint introduces additional\ncomplexity, rendering many existing RMAB solutions suboptimal or ineffective.\n%To address this, we propose using dummy states to duplicate the system,\nensuring that once an arm is activated, it transitions exclusively within the\ndummy states. To address this shortcoming, we propose using \\textit{dummy\nstates} that expand the system and enforce the one-pull constraint. We then\ndesign a lightweight index policy for this expanded system. For the first time,\nwe demonstrate that our index policy achieves a sub-linearly decaying average\noptimality gap of $\\tilde{\\mathcal{O}}\\left(\\frac{1}{\\rho^{1/2}}\\right)$ for a\nfinite number of arms, where $\\rho$ is the scaling factor for each arm cluster.\nExtensive simulations validate the proposed method, showing robust performance\nacross various domains compared to existing benchmarks.\n","authors":["Guojun Xiong","Haichuan Wang","Yuqi Pan","Saptarshi Mandal","Sanket Shah","Niclas Boehmer","Milind Tambe"],"pdf_url":"https://arxiv.org/pdf/2501.06103v1.pdf","comment":"17 Pages, 8 figures. Accepted by AAMAS 2025"},{"id":"http://arxiv.org/abs/2501.06099v1","updated":"2025-01-10T16:53:48Z","published":"2025-01-10T16:53:48Z","title":"Explaining Deep Learning-based Anomaly Detection in Energy Consumption\n Data by Focusing on Contextually Relevant Data","summary":" Detecting anomalies in energy consumption data is crucial for identifying\nenergy waste, equipment malfunction, and overall, for ensuring efficient energy\nmanagement. Machine learning, and specifically deep learning approaches, have\nbeen greatly successful in anomaly detection; however, they are black-box\napproaches that do not provide transparency or explanations. SHAP and its\nvariants have been proposed to explain these models, but they suffer from high\ncomputational complexity (SHAP) or instability and inconsistency (e.g., Kernel\nSHAP). To address these challenges, this paper proposes an explainability\napproach for anomalies in energy consumption data that focuses on\ncontext-relevant information. The proposed approach leverages existing\nexplainability techniques, focusing on SHAP variants, together with global\nfeature importance and weighted cosine similarity to select background dataset\nbased on the context of each anomaly point. By focusing on the context and most\nrelevant features, this approach mitigates the instability of explainability\nalgorithms. Experimental results across 10 different machine learning models,\nfive datasets, and five XAI techniques, demonstrate that our method reduces the\nvariability of explanations providing consistent explanations. Statistical\nanalyses confirm the robustness of our approach, showing an average reduction\nin variability of approximately 38% across multiple datasets.\n","authors":["Mohammad Noorchenarboo","Katarina Grolinger"],"pdf_url":"https://arxiv.org/pdf/2501.06099v1.pdf","comment":"26 pages, 8 figures"},{"id":"http://arxiv.org/abs/2501.06089v1","updated":"2025-01-10T16:39:01Z","published":"2025-01-10T16:39:01Z","title":"Towards Developing Socially Compliant Automated Vehicles: State of the\n Art, Experts Expectations, and A Conceptual Framework","summary":" Automated Vehicles (AVs) hold promise for revolutionizing transportation by\nimproving road safety, traffic efficiency, and overall mobility. Despite the\nsteady advancement in high-level AVs in recent years, the transition to full\nautomation entails a period of mixed traffic, where AVs of varying automation\nlevels coexist with human-driven vehicles (HDVs). Making AVs socially compliant\nand understood by human drivers is expected to improve the safety and\nefficiency of mixed traffic. Thus, ensuring AVs compatibility with HDVs and\nsocial acceptance is crucial for their successful and seamless integration into\nmixed traffic. However, research in this critical area of developing Socially\nCompliant AVs (SCAVs) remains sparse. This study carries out the first\ncomprehensive scoping review to assess the current state of the art in\ndeveloping SCAVs, identifying key concepts, methodological approaches, and\nresearch gaps. An expert interview was also conducted to identify critical\nresearch gaps and expectations towards SCAVs. Based on the scoping review and\nexpert interview input, a conceptual framework is proposed for the development\nof SCAVs. The conceptual framework is evaluated using an online survey\ntargeting researchers, technicians, policymakers, and other relevant\nprofessionals worldwide. The survey results provide valuable validation and\ninsights, affirming the significance of the proposed conceptual framework in\ntackling the challenges of integrating AVs into mixed-traffic environments.\nAdditionally, future research perspectives and suggestions are discussed,\ncontributing to the research and development agenda of SCAVs.\n","authors":["Yongqi Dong","Bart van Arem","Haneen Farah"],"pdf_url":"https://arxiv.org/pdf/2501.06089v1.pdf","comment":"39 pages, 13 figures, under review by the journal of Transportation\n Research Part E: Logistics and Transportation Review"},{"id":"http://arxiv.org/abs/2501.06086v1","updated":"2025-01-10T16:34:19Z","published":"2025-01-10T16:34:19Z","title":"All AI Models are Wrong, but Some are Optimal","summary":" AI models that predict the future behavior of a system (a.k.a. predictive AI\nmodels) are central to intelligent decision-making. However, decision-making\nusing predictive AI models often results in suboptimal performance. This is\nprimarily because AI models are typically constructed to best fit the data, and\nhence to predict the most likely future rather than to enable high-performance\ndecision-making. The hope that such prediction enables high-performance\ndecisions is neither guaranteed in theory nor established in practice. In fact,\nthere is increasing empirical evidence that predictive models must be tailored\nto decision-making objectives for performance. In this paper, we establish\nformal (necessary and sufficient) conditions that a predictive model (AI-based\nor not) must satisfy for a decision-making policy established using that model\nto be optimal. We then discuss their implications for building predictive AI\nmodels for sequential decision-making.\n","authors":["Akhil S Anand","Shambhuraj Sawant","Dirk Reinhardt","Sebastien Gros"],"pdf_url":"https://arxiv.org/pdf/2501.06086v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2210.06433v3","updated":"2025-01-10T16:26:43Z","published":"2022-10-12T17:30:12Z","title":"Self-supervised video pretraining yields robust and more human-aligned\n visual representations","summary":" Humans learn powerful representations of objects and scenes by observing how\nthey evolve over time. Yet, outside of specific tasks that require explicit\ntemporal understanding, static image pretraining remains the dominant paradigm\nfor learning visual foundation models. We question this mismatch, and ask\nwhether video pretraining can yield visual representations that bear the\nhallmarks of human perception: generalisation across tasks, robustness to\nperturbations, and consistency with human judgements. To that end we propose a\nnovel procedure for curating videos, and develop a contrastive framework which\nlearns from the complex transformations therein. This simple paradigm for\ndistilling knowledge from videos, called VITO, yields general representations\nthat far outperform prior video pretraining methods on image understanding\ntasks, and image pretraining methods on video understanding tasks. Moreover,\nVITO representations are significantly more robust to natural and synthetic\ndeformations than image-, video-, and adversarially-trained ones. Finally,\nVITO's predictions are strongly aligned with human judgements, surpassing\nmodels that were specifically trained for that purpose. Together, these results\nsuggest that video pretraining could be a simple way of learning unified,\nrobust, and human-aligned representations of the visual world.\n","authors":["Nikhil Parthasarathy","S. M. Ali Eslami","João Carreira","Olivier J. Hénaff"],"pdf_url":"https://arxiv.org/pdf/2210.06433v3.pdf","comment":"Accepted to 37th Conference on Neural Information Processing Systems\n (NeurIPS 2023)"},{"id":"http://arxiv.org/abs/2501.06081v1","updated":"2025-01-10T16:15:25Z","published":"2025-01-10T16:15:25Z","title":"Averaged Adam accelerates stochastic optimization in the training of\n deep neural network approximations for partial differential equation and\n optimal control problems","summary":" Deep learning methods - usually consisting of a class of deep neural networks\n(DNNs) trained by a stochastic gradient descent (SGD) optimization method - are\nnowadays omnipresent in data-driven learning problems as well as in scientific\ncomputing tasks such as optimal control (OC) and partial differential equation\n(PDE) problems. In practically relevant learning tasks, often not the\nplain-vanilla standard SGD optimization method is employed to train the\nconsidered class of DNNs but instead more sophisticated adaptive and\naccelerated variants of the standard SGD method such as the popular Adam\noptimizer are used. Inspired by the classical Polyak-Ruppert averaging\napproach, in this work we apply averaged variants of the Adam optimizer to\ntrain DNNs to approximately solve exemplary scientific computing problems in\nthe form of PDEs and OC problems. We test the averaged variants of Adam in a\nseries of learning problems including physics-informed neural network (PINN),\ndeep backward stochastic differential equation (deep BSDE), and deep Kolmogorov\napproximations for PDEs (such as heat, Black-Scholes, Burgers, and Allen-Cahn\nPDEs), including DNN approximations for OC problems, and including DNN\napproximations for image classification problems (ResNet for CIFAR-10). In each\nof the numerical examples the employed averaged variants of Adam outperform the\nstandard Adam and the standard SGD optimizers, particularly, in the situation\nof the scientific machine learning problems. The Python source codes for the\nnumerical experiments associated to this work can be found on GitHub at\nhttps://github.com/deeplearningmethods/averaged-adam.\n","authors":["Steffen Dereich","Arnulf Jentzen","Adrian Riekert"],"pdf_url":"https://arxiv.org/pdf/2501.06081v1.pdf","comment":"25 pages, 10 figures"},{"id":"http://arxiv.org/abs/2501.06080v1","updated":"2025-01-10T16:15:23Z","published":"2025-01-10T16:15:23Z","title":"Scale-up Unlearnable Examples Learning with High-Performance Computing","summary":" Recent advancements in AI models are structured to retain user interactions,\nwhich could inadvertently include sensitive healthcare data. In the healthcare\nfield, particularly when radiologists use AI-driven diagnostic tools hosted on\nonline platforms, there is a risk that medical imaging data may be repurposed\nfor future AI training without explicit consent, spotlighting critical privacy\nand intellectual property concerns around healthcare data usage. Addressing\nthese privacy challenges, a novel approach known as Unlearnable Examples (UEs)\nhas been introduced, aiming to make data unlearnable to deep learning models. A\nprominent method within this area, called Unlearnable Clustering (UC), has\nshown improved UE performance with larger batch sizes but was previously\nlimited by computational resources. To push the boundaries of UE performance\nwith theoretically unlimited resources, we scaled up UC learning across various\ndatasets using Distributed Data Parallel (DDP) training on the Summit\nsupercomputer. Our goal was to examine UE efficacy at high-performance\ncomputing (HPC) levels to prevent unauthorized learning and enhance data\nsecurity, particularly exploring the impact of batch size on UE's\nunlearnability. Utilizing the robust computational capabilities of the Summit,\nextensive experiments were conducted on diverse datasets such as Pets,\nMedMNist, Flowers, and Flowers102. Our findings reveal that both overly large\nand overly small batch sizes can lead to performance instability and affect\naccuracy. However, the relationship between batch size and unlearnability\nvaried across datasets, highlighting the necessity for tailored batch size\nstrategies to achieve optimal data protection. Our results underscore the\ncritical role of selecting appropriate batch sizes based on the specific\ncharacteristics of each dataset to prevent learning and ensure data security in\ndeep learning applications.\n","authors":["Yanfan Zhu","Issac Lyngaas","Murali Gopalakrishnan Meena","Mary Ellen I. Koran","Bradley Malin","Daniel Moyer","Shunxing Bao","Anuj Kapadia","Xiao Wang","Bennett Landman","Yuankai Huo"],"pdf_url":"https://arxiv.org/pdf/2501.06080v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.06078v1","updated":"2025-01-10T16:14:35Z","published":"2025-01-10T16:14:35Z","title":"Explaining k-Nearest Neighbors: Abductive and Counterfactual\n Explanations","summary":" Despite the wide use of $k$-Nearest Neighbors as classification models, their\nexplainability properties remain poorly understood from a theoretical\nperspective. While nearest neighbors classifiers offer interpretability from a\n\"data perspective\", in which the classification of an input vector $\\bar{x}$ is\nexplained by identifying the vectors $\\bar{v}_1, \\ldots, \\bar{v}_k$ in the\ntraining set that determine the classification of $\\bar{x}$, we argue that such\nexplanations can be impractical in high-dimensional applications, where each\nvector has hundreds or thousands of features and it is not clear what their\nrelative importance is. Hence, we focus on understanding nearest neighbor\nclassifications through a \"feature perspective\", in which the goal is to\nidentify how the values of the features in $\\bar{x}$ affect its classification.\nConcretely, we study abductive explanations such as \"minimum sufficient\nreasons\", which correspond to sets of features in $\\bar{x}$ that are enough to\nguarantee its classification, and \"counterfactual explanations\" based on the\nminimum distance feature changes one would have to perform in $\\bar{x}$ to\nchange its classification. We present a detailed landscape of positive and\nnegative complexity results for counterfactual and abductive explanations,\ndistinguishing between discrete and continuous feature spaces, and considering\nthe impact of the choice of distance function involved. Finally, we show that\ndespite some negative complexity results, Integer Quadratic Programming and SAT\nsolving allow for computing explanations in practice.\n","authors":["Pablo Barceló","Alexander Kozachinskiy","Miguel Romero Orth","Bernardo Subercaseaux","José Verschae"],"pdf_url":"https://arxiv.org/pdf/2501.06078v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.06077v1","updated":"2025-01-10T16:14:08Z","published":"2025-01-10T16:14:08Z","title":"Explainable Federated Bayesian Causal Inference and Its Application in\n Advanced Manufacturing","summary":" Causal inference has recently gained notable attention across various fields\nlike biology, healthcare, and environmental science, especially within\nexplainable artificial intelligence (xAI) systems, for uncovering the causal\nrelationships among multiple variables and outcomes. Yet, it has not been fully\nrecognized and deployed in the manufacturing systems. In this paper, we\nintroduce an explainable, scalable, and flexible federated Bayesian learning\nframework, \\texttt{xFBCI}, designed to explore causality through treatment\neffect estimation in distributed manufacturing systems. By leveraging federated\nBayesian learning, we efficiently estimate posterior of local parameters to\nderive the propensity score for each client without accessing local private\ndata. These scores are then used to estimate the treatment effect using\npropensity score matching (PSM). Through simulations on various datasets and a\nreal-world Electrohydrodynamic (EHD) printing data, we demonstrate that our\napproach outperforms standard Bayesian causal inference methods and several\nstate-of-the-art federated learning benchmarks.\n","authors":["Xiaofeng Xiao","Khawlah Alharbi","Pengyu Zhang","Hantang Qin","Xubo Yue"],"pdf_url":"https://arxiv.org/pdf/2501.06077v1.pdf","comment":"26 pages"},{"id":"http://arxiv.org/abs/2501.06076v1","updated":"2025-01-10T16:13:57Z","published":"2025-01-10T16:13:57Z","title":"A monthly sub-national Harmonized Food Insecurity Dataset for\n comprehensive analysis and predictive modeling","summary":" Food security is a complex, multidimensional concept challenging to measure\ncomprehensively. Effective anticipation, monitoring, and mitigation of food\ncrises require timely and comprehensive global data. This paper introduces the\nHarmonized Food Insecurity Dataset (HFID), an open-source resource\nconsolidating four key data sources: the Integrated Food Security Phase\nClassification (IPC)/Cadre Harmonis\\'e (CH) phases, the Famine Early Warning\nSystems Network (FEWS NET) IPC-compatible phases, and the World Food Program's\n(WFP) Food Consumption Score (FCS) and reduced Coping Strategy Index (rCSI).\nUpdated monthly and using a common reference system for administrative units,\nthe HFID offers extensive spatial and temporal coverage. It serves as a vital\ntool for food security experts and humanitarian agencies, providing a unified\nresource for analyzing food security conditions and highlighting global data\ndisparities. The scientific community can also leverage the HFID to develop\ndata-driven predictive models, enhancing the capacity to forecast and prevent\nfuture food crises.\n","authors":["Machefer Mélissande","Michele Ronco","Anne-Claire Thomas","Michael Assouline","Melanie Rabier","Christina Corbane","Felix Rembold"],"pdf_url":"https://arxiv.org/pdf/2501.06076v1.pdf","comment":"The authors Melissande Machefer and Michele Ronco have contributed\n equally as both first authors to this work. This work is currently being\n reviewed in a peer-reviewed journal"},{"id":"http://arxiv.org/abs/2501.06074v1","updated":"2025-01-10T16:11:27Z","published":"2025-01-10T16:11:27Z","title":"Geometry and Optimization of Shallow Polynomial Networks","summary":" We study shallow neural networks with polynomial activations. The function\nspace for these models can be identified with a set of symmetric tensors with\nbounded rank. We describe general features of these networks, focusing on the\nrelationship between width and optimization. We then consider teacher-student\nproblems, that can be viewed as a problem of low-rank tensor approximation with\nrespect to a non-standard inner product that is induced by the data\ndistribution. In this setting, we introduce a teacher-metric discriminant which\nencodes the qualitative behavior of the optimization as a function of the\ntraining data distribution. Finally, we focus on networks with quadratic\nactivations, presenting an in-depth analysis of the optimization landscape. In\nparticular, we present a variation of the Eckart-Young Theorem characterizing\nall critical points and their Hessian signatures for teacher-student problems\nwith quadratic networks and Gaussian training data.\n","authors":["Yossi Arjevani","Joan Bruna","Joe Kileel","Elzbieta Polak","Matthew Trager"],"pdf_url":"https://arxiv.org/pdf/2501.06074v1.pdf","comment":"36 pages, 2 figures"},{"id":"http://arxiv.org/abs/2308.08235v2","updated":"2025-01-10T16:02:22Z","published":"2023-08-16T09:12:21Z","title":"The Expressive Power of Graph Neural Networks: A Survey","summary":" Graph neural networks (GNNs) are effective machine learning models for many\ngraph-related applications. Despite their empirical success, many research\nefforts focus on the theoretical limitations of GNNs, i.e., the GNNs expressive\npower. Early works in this domain mainly focus on studying the graph\nisomorphism recognition ability of GNNs, and recent works try to leverage the\nproperties such as subgraph counting and connectivity learning to characterize\nthe expressive power of GNNs, which are more practical and closer to\nreal-world. However, no survey papers and open-source repositories\ncomprehensively summarize and discuss models in this important direction. To\nfill the gap, we conduct a first survey for models for enhancing expressive\npower under different forms of definition. Concretely, the models are reviewed\nbased on three categories, i.e., Graph feature enhancement, Graph topology\nenhancement, and GNNs architecture enhancement.\n","authors":["Bingxu Zhang","Changjun Fan","Shixuan Liu","Kuihua Huang","Xiang Zhao","Jincai Huang","Zhong Liu"],"pdf_url":"https://arxiv.org/pdf/2308.08235v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.06066v1","updated":"2025-01-10T15:57:23Z","published":"2025-01-10T15:57:23Z","title":"Distilling Calibration via Conformalized Credal Inference","summary":" Deploying artificial intelligence (AI) models on edge devices involves a\ndelicate balance between meeting stringent complexity constraints, such as\nlimited memory and energy resources, and ensuring reliable performance in\nsensitive decision-making tasks. One way to enhance reliability is through\nuncertainty quantification via Bayesian inference. This approach, however,\ntypically necessitates maintaining and running multiple models in an ensemble,\nwhich may exceed the computational limits of edge devices. This paper\nintroduces a low-complexity methodology to address this challenge by distilling\ncalibration information from a more complex model. In an offline phase,\npredictive probabilities generated by a high-complexity cloud-based model are\nleveraged to determine a threshold based on the typical divergence between the\ncloud and edge models. At run time, this threshold is used to construct credal\nsets -- ranges of predictive probabilities that are guaranteed, with a\nuser-selected confidence level, to include the predictions of the cloud model.\nThe credal sets are obtained through thresholding of a divergence measure in\nthe simplex of predictive probabilities. Experiments on visual and language\ntasks demonstrate that the proposed approach, termed Conformalized Distillation\nfor Credal Inference (CD-CI), significantly improves calibration performance\ncompared to low-complexity Bayesian methods, such as Laplace approximation,\nmaking it a practical and efficient solution for edge AI deployments.\n","authors":["Jiayi Huang","Sangwoo Park","Nicola Paoletti","Osvaldo Simeone"],"pdf_url":"https://arxiv.org/pdf/2501.06066v1.pdf","comment":"Under review"},{"id":"http://arxiv.org/abs/2412.07312v2","updated":"2025-01-10T15:46:25Z","published":"2024-12-10T08:50:35Z","title":"High-dimensional classification problems with Barron regular boundaries\n under margin conditions","summary":" We prove that a classifier with a Barron-regular decision boundary can be\napproximated with a rate of high polynomial degree by ReLU neural networks with\nthree hidden layers when a margin condition is assumed. In particular, for\nstrong margin conditions, high-dimensional discontinuous classifiers can be\napproximated with a rate that is typically only achievable when approximating a\nlow-dimensional smooth function. We demonstrate how these expression rate\nbounds imply fast-rate learning bounds that are close to $n^{-1}$ where $n$ is\nthe number of samples. In addition, we carry out comprehensive numerical\nexperimentation on binary classification problems with various margins. We\nstudy three different dimensions, with the highest dimensional problem\ncorresponding to images from the MNIST data set.\n","authors":["Jonathan García","Philipp Petersen"],"pdf_url":"https://arxiv.org/pdf/2412.07312v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.06062v1","updated":"2025-01-10T15:46:19Z","published":"2025-01-10T15:46:19Z","title":"Personalized Language Model Learning on Text Data Without User\n Identifiers","summary":" In many practical natural language applications, user data are highly\nsensitive, requiring anonymous uploads of text data from mobile devices to the\ncloud without user identifiers. However, the absence of user identifiers\nrestricts the ability of cloud-based language models to provide personalized\nservices, which are essential for catering to diverse user needs. The trivial\nmethod of replacing an explicit user identifier with a static user embedding as\nmodel input still compromises data anonymization. In this work, we propose to\nlet each mobile device maintain a user-specific distribution to dynamically\ngenerate user embeddings, thereby breaking the one-to-one mapping between an\nembedding and a specific user. We further theoretically demonstrate that to\nprevent the cloud from tracking users via uploaded embeddings, the local\ndistributions of different users should either be derived from a linearly\ndependent space to avoid identifiability or be close to each other to prevent\naccurate attribution. Evaluation on both public and industrial datasets using\ndifferent language models reveals a remarkable improvement in accuracy from\nincorporating anonymous user embeddings, while preserving real-time inference\nrequirement.\n","authors":["Yucheng Ding","Yangwenjian Tan","Xiangyu Liu","Chaoyue Niu","Fandong Meng","Jie Zhou","Ning Liu","Fan Wu","Guihai Chen"],"pdf_url":"https://arxiv.org/pdf/2501.06062v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.06059v1","updated":"2025-01-10T15:40:31Z","published":"2025-01-10T15:40:31Z","title":"COMIX: Compositional Explanations using Prototypes","summary":" Aligning machine representations with human understanding is key to improving\ninterpretability of machine learning (ML) models. When classifying a new image,\nhumans often explain their decisions by decomposing the image into concepts and\npointing to corresponding regions in familiar images. Current ML explanation\ntechniques typically either trace decision-making processes to reference\nprototypes, generate attribution maps highlighting feature importance, or\nincorporate intermediate bottlenecks designed to align with human-interpretable\nconcepts. The proposed method, named COMIX, classifies an image by decomposing\nit into regions based on learned concepts and tracing each region to\ncorresponding ones in images from the training dataset, assuring that\nexplanations fully represent the actual decision-making process. We dissect the\ntest image into selected internal representations of a neural network to derive\nprototypical parts (primitives) and match them with the corresponding\nprimitives derived from the training data. In a series of qualitative and\nquantitative experiments, we theoretically prove and demonstrate that our\nmethod, in contrast to post hoc analysis, provides fidelity of explanations and\nshows that the efficiency is competitive with other inherently interpretable\narchitectures. Notably, it shows substantial improvements in fidelity and\nsparsity metrics, including 48.82% improvement in the C-insertion score on the\nImageNet dataset over the best state-of-the-art baseline.\n","authors":["Sarath Sivaprasad","Dmitry Kangin","Plamen Angelov","Mario Fritz"],"pdf_url":"https://arxiv.org/pdf/2501.06059v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.06058v1","updated":"2025-01-10T15:39:39Z","published":"2025-01-10T15:39:39Z","title":"Learning Flexible Heterogeneous Coordination with Capability-Aware\n Shared Hypernetworks","summary":" Cooperative heterogeneous multi-agent tasks require agents to effectively\ncoordinate their behaviors while accounting for their relative capabilities.\nLearning-based solutions to this challenge span between two extremes: i)\nshared-parameter methods, which encode diverse behaviors within a single\narchitecture by assigning an ID to each agent, and are sample-efficient but\nresult in limited behavioral diversity; ii) independent methods, which learn a\nseparate policy for each agent, and show greater behavioral diversity but lack\nsample-efficiency. Prior work has also explored selective parameter-sharing,\nallowing for a compromise between diversity and efficiency. None of these\napproaches, however, effectively generalize to unseen agents or teams. We\npresent Capability-Aware Shared Hypernetworks (CASH), a novel architecture for\nheterogeneous multi-agent coordination that generates sufficient diversity\nwhile maintaining sample-efficiency via soft parameter-sharing hypernetworks.\nIntuitively, CASH allows the team to learn common strategies using a shared\nencoder, which are then adapted according to the team's individual and\ncollective capabilities with a hypernetwork, allowing for zero-shot\ngeneralization to unseen teams and agents. We present experiments across two\nheterogeneous coordination tasks and three standard learning paradigms\n(imitation learning, on- and off-policy reinforcement learning). CASH is able\nto outperform baseline architectures in success rate and sample efficiency when\nevaluated on unseen teams and agents despite using less than half of the\nlearnable parameters.\n","authors":["Kevin Fu","Pierce Howell","Shalin Jain","Harish Ravichandar"],"pdf_url":"https://arxiv.org/pdf/2501.06058v1.pdf","comment":"11 pages, 6 figures, equal authorship between Pierce Howell and\n Shalin Jain"},{"id":"http://arxiv.org/abs/2202.13059v5","updated":"2025-01-10T15:30:29Z","published":"2022-02-26T04:49:01Z","title":"Theoretical Error Analysis of Entropy Approximation for Gaussian Mixture","summary":" Gaussian mixture distributions are commonly employed to represent general\nprobability distributions. Despite the importance of using Gaussian mixtures\nfor uncertainty estimation, the entropy of a Gaussian mixture cannot be\ncalculated analytically. In this paper, we study the approximate entropy\nrepresented as the sum of the entropies of unimodal Gaussian distributions with\nmixing coefficients. This approximation is easy to calculate analytically\nregardless of dimension, but there is a lack of theoretical guarantees. We\ntheoretically analyze the approximation error between the true and the\napproximate entropy to reveal when this approximation works effectively. This\nerror is essentially controlled by how far apart each Gaussian component of the\nGaussian mixture is. To measure such separation, we introduce the ratios of the\ndistances between the means to the sum of the variances of each Gaussian\ncomponent of the Gaussian mixture, and we reveal that the error converges to\nzero as the ratios tend to infinity. In addition, the probabilistic estimate\nindicates that this convergence situation is more likely to occur in\nhigher-dimensional spaces. Therefore, our results provide a guarantee that this\napproximation works well for high-dimensional problems, such as neural networks\nthat involve a large number of parameters.\n","authors":["Takashi Furuya","Hiroyuki Kusumoto","Koichi Taniguchi","Naoya Kanno","Kazuma Suetake"],"pdf_url":"https://arxiv.org/pdf/2202.13059v5.pdf","comment":"35 pages, 4 figures"},{"id":"http://arxiv.org/abs/2410.05289v3","updated":"2025-01-10T15:25:06Z","published":"2024-10-02T14:14:17Z","title":"MARS: A neurosymbolic approach for interpretable drug discovery","summary":" Neurosymbolic (NeSy) artificial intelligence describes the combination of\nlogic or rule-based techniques with neural networks. Compared to neural\napproaches, NeSy methods often possess enhanced interpretability, which is\nparticularly promising for biomedical applications like drug discovery.\nHowever, since interpretability is broadly defined, there are no clear\nguidelines for assessing the biological plausibility of model interpretations.\nTo assess interpretability in the context of drug discovery, we devise a novel\nprediction task, called drug mechanism-of-action (MoA) deconvolution, with an\nassociated, tailored knowledge graph (KG), MoA-net. We then develop the MoA\nRetrieval System (MARS), a NeSy approach for drug discovery which leverages\nlogical rules with learned rule weights. Using this interpretable feature\nalongside domain knowledge, we find that MARS and other NeSy approaches on KGs\nare susceptible to reasoning shortcuts, in which the prediction of true labels\nis driven by \"degree-bias\" rather than the domain-based rules. Subsequently, we\ndemonstrate ways to identify and mitigate this. Thereafter, MARS achieves\nperformance on par with current state-of-the-art models while producing model\ninterpretations aligned with known MoAs.\n","authors":["Lauren Nicole DeLong","Yojana Gadiya","Paola Galdi","Jacques D. Fleuriot","Daniel Domingo-Fernández"],"pdf_url":"https://arxiv.org/pdf/2410.05289v3.pdf","comment":"Under review. 10 pages, 7 supplementary pages. Corresponding code is\n here: https://github.com/laurendelong21/MARS and here:\n https://github.com/laurendelong21/MoA-Net"},{"id":"http://arxiv.org/abs/2501.06039v1","updated":"2025-01-10T15:17:27Z","published":"2025-01-10T15:17:27Z","title":"AI-powered virtual tissues from spatial proteomics for clinical\n diagnostics and biomedical discovery","summary":" Spatial proteomics technologies have transformed our understanding of complex\ntissue architectures by enabling simultaneous analysis of multiple molecular\nmarkers and their spatial organization. The high dimensionality of these data,\nvarying marker combinations across experiments and heterogeneous study designs\npose unique challenges for computational analysis. Here, we present Virtual\nTissues (VirTues), a foundation model framework for biological tissues that\noperates across the molecular, cellular and tissue scale. VirTues introduces\ninnovations in transformer architecture design, including a novel tokenization\nscheme that captures both spatial and marker dimensions, and attention\nmechanisms that scale to high-dimensional multiplex data while maintaining\ninterpretability. Trained on diverse cancer and non-cancer tissue datasets,\nVirTues demonstrates strong generalization capabilities without task-specific\nfine-tuning, enabling cross-study analysis and novel marker integration. As a\ngeneralist model, VirTues outperforms existing approaches across clinical\ndiagnostics, biological discovery and patient case retrieval tasks, while\nproviding insights into tissue function and disease mechanisms.\n","authors":["Johann Wenckstern","Eeshaan Jain","Kiril Vasilev","Matteo Pariset","Andreas Wicki","Gabriele Gut","Charlotte Bunne"],"pdf_url":"https://arxiv.org/pdf/2501.06039v1.pdf","comment":"23 pages, 5 figures"},{"id":"http://arxiv.org/abs/2401.11940v3","updated":"2025-01-10T15:07:43Z","published":"2024-01-22T13:30:11Z","title":"Low-Tubal-Rank Tensor Recovery via Factorized Gradient Descent","summary":" This paper considers the problem of recovering a tensor with an underlying\nlow-tubal-rank structure from a small number of corrupted linear measurements.\nTraditional approaches tackling such a problem require the computation of\ntensor Singular Value Decomposition (t-SVD), that is a computationally\nintensive process, rendering them impractical for dealing with large-scale\ntensors. Aim to address this challenge, we propose an efficient and effective\nlow-tubal-rank tensor recovery method based on a factorization procedure akin\nto the Burer-Monteiro (BM) method. Precisely, our fundamental approach involves\ndecomposing a large tensor into two smaller factor tensors, followed by solving\nthe problem through factorized gradient descent (FGD). This strategy eliminates\nthe need for t-SVD computation, thereby reducing computational costs and\nstorage requirements. We provide rigorous theoretical analysis to ensure the\nconvergence of FGD under both noise-free and noisy situations. Additionally, it\nis worth noting that our method does not require the precise estimation of the\ntensor tubal-rank. Even in cases where the tubal-rank is slightly\noverestimated, our approach continues to demonstrate robust performance. A\nseries of experiments have been carried out to demonstrate that, as compared to\nother popular ones, our approach exhibits superior performance in multiple\nscenarios, in terms of the faster computational speed and the smaller\nconvergence error.\n","authors":["Zhiyu Liu","Zhi Han","Yandong Tang","Xi-Le Zhao","Yao Wang"],"pdf_url":"https://arxiv.org/pdf/2401.11940v3.pdf","comment":"13 pages, 4 figures"},{"id":"http://arxiv.org/abs/2405.06653v2","updated":"2025-01-10T15:02:43Z","published":"2024-04-08T08:25:25Z","title":"A unified cross-attention model for predicting antigen binding\n specificity to both HLA and TCR molecules","summary":" The immune checkpoint inhibitors have demonstrated promising clinical\nefficacy across various tumor types, yet the percentage of patients who benefit\nfrom them remains low. The bindings between tumor antigens and HLA-I/TCR\nmolecules determine the antigen presentation and T-cell activation, thereby\nplaying an important role in the immunotherapy response. In this paper, we\npropose UnifyImmun, a unified cross-attention transformer model designed to\nsimultaneously predict the bindings of peptides to both receptors, providing\nmore comprehensive evaluation of antigen immunogenicity. We devise a two-phase\nstrategy using virtual adversarial training that enables these two tasks to\nreinforce each other mutually, by compelling the encoders to extract more\nexpressive features. Our method demonstrates superior performance in predicting\nboth pHLA and pTCR binding on multiple independent and external test sets.\nNotably, on a large-scale COVID-19 pTCR binding test set without any seen\npeptide in training set, our method outperforms the current state-of-the-art\nmethods by more than 10\\%. The predicted binding scores significantly correlate\nwith the immunotherapy response and clinical outcomes on two clinical cohorts.\nFurthermore, the cross-attention scores and integrated gradients reveal the\namino-acid sites critical for peptide binding to receptors. In essence, our\napproach marks a significant step toward comprehensive evaluation of antigen\nimmunogenicity.\n","authors":["Chenpeng Yu","Xing Fang","Hui Liu"],"pdf_url":"https://arxiv.org/pdf/2405.06653v2.pdf","comment":"Accepted by Nature Machine Intelligence"},{"id":"http://arxiv.org/abs/2501.04794v2","updated":"2025-01-10T14:59:31Z","published":"2025-01-08T19:18:44Z","title":"A Steerable Deep Network for Model-Free Diffusion MRI Registration","summary":" Nonrigid registration is vital to medical image analysis but remains\nchallenging for diffusion MRI (dMRI) due to its high-dimensional,\norientation-dependent nature. While classical methods are accurate, they are\ncomputationally demanding, and deep neural networks, though efficient, have\nbeen underexplored for nonrigid dMRI registration compared to structural\nimaging. We present a novel, deep learning framework for model-free, nonrigid\nregistration of raw diffusion MRI data that does not require explicit\nreorientation. Unlike previous methods relying on derived representations such\nas diffusion tensors or fiber orientation distribution functions, in our\napproach, we formulate the registration as an equivariant diffeomorphism of\nposition-and-orientation space. Central to our method is an\n$\\mathsf{SE}(3)$-equivariant UNet that generates velocity fields while\npreserving the geometric properties of a raw dMRI's domain. We introduce a new\nloss function based on the maximum mean discrepancy in Fourier space,\nimplicitly matching ensemble average propagators across images. Experimental\nresults on Human Connectome Project dMRI data demonstrate competitive\nperformance compared to state-of-the-art approaches, with the added advantage\nof bypassing the overhead for estimating derived representations. This work\nestablishes a foundation for data-driven, geometry-aware dMRI registration\ndirectly in the acquisition space.\n","authors":["Gianfranco Cortes","Xiaoda Qu","Baba C. Vemuri"],"pdf_url":"https://arxiv.org/pdf/2501.04794v2.pdf","comment":"Coauthor was inadvertently left out. This is now corrected"},{"id":"http://arxiv.org/abs/2501.06016v1","updated":"2025-01-10T14:53:21Z","published":"2025-01-10T14:53:21Z","title":"Investigating the Impact of Observation Space Design Choices On Training\n Reinforcement Learning Solutions for Spacecraft Problems","summary":" Recent research using Reinforcement Learning (RL) to learn autonomous control\nfor spacecraft operations has shown great success. However, a recent study\nshowed their performance could be improved by changing the action space, i.e.\ncontrol outputs, used in the learning environment. This has opened the door for\nfinding more improvements through further changes to the environment. The work\nin this paper focuses on how changes to the environment's observation space can\nimpact the training and performance of RL agents learning the spacecraft\ninspection task. The studies are split into two groups. The first looks at the\nimpact of sensors that were designed to help agents learn the task. The second\nlooks at the impact of reference frames, reorienting the agent to see the world\nfrom a different perspective. The results show the sensors are not necessary,\nbut most of them help agents learn more optimal behavior, and that the\nreference frame does not have a large impact, but is best kept consistent.\n","authors":["Nathaniel Hamilton","Kyle Dunlap","Kerianne L Hobbs"],"pdf_url":"https://arxiv.org/pdf/2501.06016v1.pdf","comment":"18 pages, 10 figures, 3 tables"},{"id":"http://arxiv.org/abs/2412.05545v3","updated":"2025-01-10T14:51:06Z","published":"2024-12-07T05:47:28Z","title":"Convergence analysis of wide shallow neural operators within the\n framework of Neural Tangent Kernel","summary":" Neural operators are aiming at approximating operators mapping between Banach\nspaces of functions, achieving much success in the field of scientific\ncomputing. Compared to certain deep learning-based solvers, such as\nPhysics-Informed Neural Networks (PINNs), Deep Ritz Method (DRM), neural\noperators can solve a class of Partial Differential Equations (PDEs). Although\nmuch work has been done to analyze the approximation and generalization error\nof neural operators, there is still a lack of analysis on their training error.\nIn this work, we conduct the convergence analysis of gradient descent for the\nwide shallow neural operators and physics-informed shallow neural operators\nwithin the framework of Neural Tangent Kernel (NTK). The core idea lies on the\nfact that over-parameterization and random initialization together ensure that\neach weight vector remains near its initialization throughout all iterations,\nyielding the linear convergence of gradient descent. In this work, we\ndemonstrate that under the setting of over-parametrization, gradient descent\ncan find the global minimum regardless of whether it is in continuous time or\ndiscrete time.\n","authors":["Xianliang Xu","Ye Li","Zhongyi Huang"],"pdf_url":"https://arxiv.org/pdf/2412.05545v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.06007v1","updated":"2025-01-10T14:42:08Z","published":"2025-01-10T14:42:08Z","title":"A Neural Operator for Forecasting Carbon Monoxide Evolution in Cities","summary":" Real-time forecasting of carbon monoxide (CO) concentrations is essential for\nenabling timely interventions to improve urban air quality. Conventional air\nquality models often require extensive computational resources for accurate,\nmulti-scale predictions, limiting their practicality for rapid, real-time\napplication. To address this challenge, we introduce the Complex Neural\nOperator for Air Quality (CoNOAir), a machine learning model that forecast CO\nconcentrations efficiently. CoNOAir demonstrates superior performance over\nstate-of-theart models, such as the Fourier Neural Operator (FNO), in both\nshort-term (hourly) and extended (72-hour) forecasts at a national scale. It\nexcels in capturing extreme pollution events and performs consistently across\nmultiple Indian cities, achieving an R2 above 0.95 for hourly CO predictions\nacross all evaluated locations. CoNOAir equips authorities with an effective\ntool for issuing early warnings and designing targeted intervention strategies.\nThis work marks a step forward in achieving dependable, real-time CO pollution\npredictions for densely populated urban centres.\n","authors":["Sanchit Bedi","Karn Tiwari","Prathosh A. P.","Sri Harsha Kota","N. M. Anoop Krishnan"],"pdf_url":"https://arxiv.org/pdf/2501.06007v1.pdf","comment":"36 pages, 21 figures, to be published in npj Clean Air journal\n (accepted)"},{"id":"http://arxiv.org/abs/2501.04211v2","updated":"2025-01-10T14:36:48Z","published":"2025-01-08T01:11:17Z","title":"CURing Large Models: Compression via CUR Decomposition","summary":" Large deep learning models have achieved remarkable success but are\nresource-intensive, posing challenges such as memory usage. We introduce\nCURing, a novel model compression method based on CUR matrix decomposition,\nwhich approximates weight matrices as the product of selected columns (C) and\nrows (R), and a small linking matrix (U). We apply this decomposition to\nweights chosen based on the combined influence of their magnitudes and\nactivations. By identifying and retaining informative rows and columns, CURing\nsignificantly reduces model size with minimal performance loss. For example, it\nreduces Llama3.1-8B's parameters to 7.32B (-9%) in just 129 seconds, over 20\ntimes faster than prior compression methods.\n","authors":["Sanghyeon Park","Soo-Mook Moon"],"pdf_url":"https://arxiv.org/pdf/2501.04211v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.06003v1","updated":"2025-01-10T14:34:46Z","published":"2025-01-10T14:34:46Z","title":"Learning to generate feasible graphs using graph grammars","summary":" Generative methods for graphs need to be sufficiently flexible to model\ncomplex dependencies between sets of nodes. At the same time, the generated\ngraphs need to satisfy domain-dependent feasibility conditions, that is, they\nshould not violate certain constraints that would make their interpretation\nimpossible within the given application domain (e.g. a molecular graph where an\natom has a very large number of chemical bounds). Crucially, constraints can\ninvolve not only local but also long-range dependencies: for example, the\nmaximal length of a cycle can be bounded.\n Currently, a large class of generative approaches for graphs, such as methods\nbased on artificial neural networks, is based on message passing schemes. These\napproaches suffer from information 'dilution' issues that severely limit the\nmaximal range of the dependencies that can be modeled. To address this problem,\nwe propose a generative approach based on the notion of graph grammars. The key\nnovel idea is to introduce a domain-dependent coarsening procedure to provide\nshort-cuts for long-range dependencies.\n We show the effectiveness of our proposal in two domains: 1) small drugs and\n2) RNA secondary structures. In the first case, we compare the quality of the\ngenerated molecular graphs via the Molecular Sets (MOSES) benchmark suite,\nwhich evaluates the distance between generated and real molecules, their\nlipophilicity, synthesizability, and drug-likeness. In the second case, we show\nthat the approach can generate very large graphs (with hundreds of nodes) that\nare accepted as valid examples for a desired RNA family by the \"Infernal\"\ncovariance model, a state-of-the-art RNA classifier.\n Our implementation is available on github:\ngithub.com/fabriziocosta/GraphLearn\n","authors":["Stefan Mautner","Rolf Backofen","Fabrizio Costa"],"pdf_url":"https://arxiv.org/pdf/2501.06003v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.06002v1","updated":"2025-01-10T14:34:20Z","published":"2025-01-10T14:34:20Z","title":"DeltaGNN: Graph Neural Network with Information Flow Control","summary":" Graph Neural Networks (GNNs) are popular deep learning models designed to\nprocess graph-structured data through recursive neighborhood aggregations in\nthe message passing process. When applied to semi-supervised node\nclassification, the message-passing enables GNNs to understand short-range\nspatial interactions, but also causes them to suffer from over-smoothing and\nover-squashing. These challenges hinder model expressiveness and prevent the\nuse of deeper models to capture long-range node interactions (LRIs) within the\ngraph. Popular solutions for LRIs detection are either too expensive to process\nlarge graphs due to high time complexity or fail to generalize across diverse\ngraph structures. To address these limitations, we propose a mechanism called\n\\emph{information flow control}, which leverages a novel connectivity measure,\ncalled \\emph{information flow score}, to address over-smoothing and\nover-squashing with linear computational overhead, supported by theoretical\nevidence. Finally, to prove the efficacy of our methodology we design DeltaGNN,\nthe first scalable and generalizable approach for detecting long-range and\nshort-range interactions. We benchmark our model across 10 real-world datasets,\nincluding graphs with varying sizes, topologies, densities, and homophilic\nratios, showing superior performance with limited computational complexity. The\nimplementation of the proposed methods are publicly available at\nhttps://github.com/basiralab/DeltaGNN.\n","authors":["Kevin Mancini","Islem Rekik"],"pdf_url":"https://arxiv.org/pdf/2501.06002v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.10859v3","updated":"2025-01-10T14:28:32Z","published":"2024-12-14T15:15:17Z","title":"DUET: Dual Clustering Enhanced Multivariate Time Series Forecasting","summary":" Multivariate time series forecasting is crucial for various applications,\nsuch as financial investment, energy management, weather forecasting, and\ntraffic optimization. However, accurate forecasting is challenging due to two\nmain factors. First, real-world time series often show heterogeneous temporal\npatterns caused by distribution shifts over time. Second, correlations among\nchannels are complex and intertwined, making it hard to model the interactions\namong channels precisely and flexibly.\n In this study, we address these challenges by proposing a general framework\ncalled DUET, which introduces dual clustering on the temporal and channel\ndimensions to enhance multivariate time series forecasting. First, we design a\nTemporal Clustering Module (TCM) that clusters time series into fine-grained\ndistributions to handle heterogeneous temporal patterns. For different\ndistribution clusters, we design various pattern extractors to capture their\nintrinsic temporal patterns, thus modeling the heterogeneity. Second, we\nintroduce a novel Channel-Soft-Clustering strategy and design a Channel\nClustering Module (CCM), which captures the relationships among channels in the\nfrequency domain through metric learning and applies sparsification to mitigate\nthe adverse effects of noisy channels. Finally, DUET combines TCM and CCM to\nincorporate both the temporal and channel dimensions. Extensive experiments on\n25 real-world datasets from 10 application domains, demonstrate the\nstate-of-the-art performance of DUET.\n","authors":["Xiangfei Qiu","Xingjian Wu","Yan Lin","Chenjuan Guo","Jilin Hu","Bin Yang"],"pdf_url":"https://arxiv.org/pdf/2412.10859v3.pdf","comment":"Accepted by KDD 2025 research track"},{"id":"http://arxiv.org/abs/2501.05991v1","updated":"2025-01-10T14:25:01Z","published":"2025-01-10T14:25:01Z","title":"An Attention-Guided Deep Learning Approach for Classifying 39 Skin\n Lesion Types","summary":" The skin, as the largest organ of the human body, is vulnerable to a diverse\narray of conditions collectively known as skin lesions, which encompass various\ndermatoses. Diagnosing these lesions presents significant challenges for\nmedical practitioners due to the subtle visual differences that are often\nimperceptible to the naked eye. While not all skin lesions are\nlife-threatening, certain types can act as early indicators of severe diseases,\nincluding skin cancers, underscoring the critical need for timely and accurate\ndiagnostic methods. Deep learning algorithms have demonstrated remarkable\npotential in facilitating the early detection and prognosis of skin lesions.\nThis study advances the field by curating a comprehensive and diverse dataset\ncomprising 39 categories of skin lesions, synthesized from five publicly\navailable datasets. Using this dataset, the performance of five\nstate-of-the-art deep learning models -- MobileNetV2, Xception, InceptionV3,\nEfficientNetB1, and Vision Transformer - is rigorously evaluated. To enhance\nthe accuracy and robustness of these models, attention mechanisms such as the\nEfficient Channel Attention (ECA) and the Convolutional Block Attention Module\n(CBAM) are incorporated into their architectures. Comprehensive evaluation\nacross multiple performance metrics reveals that the Vision Transformer model\nintegrated with CBAM outperforms others, achieving an accuracy of 93.46%,\nprecision of 94%, recall of 93%, F1-score of 93%, and specificity of 93.67%.\nThese results underscore the significant potential of the proposed system in\nsupporting medical professionals with accurate and efficient prognostic tools\nfor diagnosing a broad spectrum of skin lesions. The dataset and code used in\nthis study can be found at\nhttps://github.com/akabircs/Skin-Lesions-Classification.\n","authors":["Sauda Adiv Hanum","Ashim Dey","Muhammad Ashad Kabir"],"pdf_url":"https://arxiv.org/pdf/2501.05991v1.pdf","comment":"26 pages"},{"id":"http://arxiv.org/abs/2407.17465v3","updated":"2025-01-10T14:22:02Z","published":"2024-07-24T17:58:42Z","title":"u-$μ$P: The Unit-Scaled Maximal Update Parametrization","summary":" The Maximal Update Parametrization ($\\mu$P) aims to make the optimal\nhyperparameters (HPs) of a model independent of its size, allowing them to be\nswept using a cheap proxy model rather than the full-size target model. We\npresent a new scheme, u-$\\mu$P, which improves upon $\\mu$P by combining it with\nUnit Scaling, a method for designing models that makes them easy to train in\nlow-precision. The two techniques have a natural affinity: $\\mu$P ensures that\nthe scale of activations is independent of model size, and Unit Scaling ensures\nthat activations, weights and gradients begin training with a scale of one.\nThis synthesis opens the door to a simpler scheme, whose default values are\nnear-optimal. This in turn facilitates a more efficient sweeping strategy, with\nu-$\\mu$P models reaching a loss that is equal to or lower than comparable\n$\\mu$P models and working out-of-the-box in FP8.\n","authors":["Charlie Blake","Constantin Eichenberg","Josef Dean","Lukas Balles","Luke Y. Prince","Björn Deiseroth","Andres Felipe Cruz-Salinas","Carlo Luschi","Samuel Weinbach","Douglas Orr"],"pdf_url":"https://arxiv.org/pdf/2407.17465v3.pdf","comment":"55 pages"},{"id":"http://arxiv.org/abs/2501.05987v1","updated":"2025-01-10T14:18:21Z","published":"2025-01-10T14:18:21Z","title":"Comparing Self-Supervised Learning Models Pre-Trained on Human Speech\n and Animal Vocalizations for Bioacoustics Processing","summary":" Self-supervised learning (SSL) foundation models have emerged as powerful,\ndomain-agnostic, general-purpose feature extractors applicable to a wide range\nof tasks. Such models pre-trained on human speech have demonstrated high\ntransferability for bioacoustic processing. This paper investigates (i) whether\nSSL models pre-trained directly on animal vocalizations offer a significant\nadvantage over those pre-trained on speech, and (ii) whether fine-tuning\nspeech-pretrained models on automatic speech recognition (ASR) tasks can\nenhance bioacoustic classification. We conduct a comparative analysis using\nthree diverse bioacoustic datasets and two different bioacoustic tasks. Results\nindicate that pre-training on bioacoustic data provides only marginal\nimprovements over speech-pretrained models, with comparable performance in most\nscenarios. Fine-tuning on ASR tasks yields mixed outcomes, suggesting that the\ngeneral-purpose representations learned during SSL pre-training are already\nwell-suited for bioacoustic tasks. These findings highlight the robustness of\nspeech-pretrained SSL models for bioacoustics and imply that extensive\nfine-tuning may not be necessary for optimal performance.\n","authors":["Eklavya Sarkar","Mathew Magimai. -Doss"],"pdf_url":"https://arxiv.org/pdf/2501.05987v1.pdf","comment":"Accepted at ICASSP 2025"},{"id":"http://arxiv.org/abs/2501.05982v1","updated":"2025-01-10T14:10:19Z","published":"2025-01-10T14:10:19Z","title":"Deep Variational Sequential Monte Carlo for High-Dimensional\n Observations","summary":" Sequential Monte Carlo (SMC), or particle filtering, is widely used in\nnonlinear state-space systems, but its performance often suffers from poorly\napproximated proposal and state-transition distributions. This work introduces\na differentiable particle filter that leverages the unsupervised variational\nSMC objective to parameterize the proposal and transition distributions with a\nneural network, designed to learn from high-dimensional observations.\nExperimental results demonstrate that our approach outperforms established\nbaselines in tracking the challenging Lorenz attractor from high-dimensional\nand partial observations. Furthermore, an evidence lower bound based evaluation\nindicates that our method offers a more accurate representation of the\nposterior distribution.\n","authors":["Wessel L. van Nierop","Nir Shlezinger","Ruud J. G. van Sloun"],"pdf_url":"https://arxiv.org/pdf/2501.05982v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05970v1","updated":"2025-01-10T13:56:03Z","published":"2025-01-10T13:56:03Z","title":"A Brain Age Residual Biomarker (BARB): Leveraging MRI-Based Models to\n Detect Latent Health Conditions in U.S. Veterans","summary":" Age prediction using brain imaging, such as MRIs, has achieved promising\nresults, with several studies identifying the model's residual as a potential\nbiomarker for chronic disease states. In this study, we developed a brain age\npredictive model using a dataset of 1,220 U.S. veterans (18--80 years) and\nconvolutional neural networks (CNNs) trained on two-dimensional slices of axial\nT2-weighted fast spin-echo and T2-weighted fluid attenuated inversion recovery\nMRI images. The model, incorporating a degree-3 polynomial ensemble, achieved\nan $R^{2}$ of 0.816 on the testing set. Images were acquired at the level of\nthe anterior commissure and the frontal horns of the lateral ventricles.\nResidual analysis was performed to assess its potential as a biomarker for five\nICD-coded conditions: hypertension (HTN), diabetes mellitus (DM), mild\ntraumatic brain injury (mTBI), illicit substance abuse/dependence (SAD), and\nalcohol abuse/dependence (AAD). Residuals grouped by the number of ICD-coded\nconditions demonstrated different trends that were statistically significant\n($p = 0.002$), suggesting a relationship between disease states and predicted\nbrain age. This association was particularly pronounced in patients over 49\nyears, where negative residuals (indicating advanced brain aging) correlated\nwith the presence of multiple ICD codes. These findings support the potential\nof residuals as biomarkers for detecting latent health conditions.\n","authors":["Arthur Bousquet","Sugata Banerji","Mark F. Conneely","Shahrzad Jamshidi"],"pdf_url":"https://arxiv.org/pdf/2501.05970v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.14488v2","updated":"2025-01-10T13:52:14Z","published":"2024-12-19T03:22:47Z","title":"A stochastic first-order method with multi-extrapolated momentum for\n highly smooth unconstrained optimization","summary":" In this paper, we consider an unconstrained stochastic optimization problem\nwhere the objective function exhibits high-order smoothness. Specifically, we\npropose a new stochastic first-order method (SFOM) with multi-extrapolated\nmomentum, in which multiple extrapolations are performed in each iteration,\nfollowed by a momentum update based on these extrapolations. We demonstrate\nthat the proposed SFOM can accelerate optimization by exploiting the high-order\nsmoothness of the objective function $f$. Assuming that the $p$th-order\nderivative of $f$ is Lipschitz continuous for some $p\\ge2$, and under\nadditional mild assumptions, we establish that our method achieves a sample\ncomplexity of $\\widetilde{\\mathcal{O}}(\\epsilon^{-(3p+1)/p})$ for finding a\npoint $x$ such that $\\mathbb{E}[\\|\\nabla f(x)\\|]\\le\\epsilon$. To the best of\nour knowledge, this is the first SFOM to leverage arbitrary-order smoothness of\nthe objective function for acceleration, resulting in a sample complexity that\nimproves upon the best-known results without assuming the mean-squared\nsmoothness condition. Preliminary numerical experiments validate the practical\nperformance of our method and support our theoretical findings.\n","authors":["Chuan He"],"pdf_url":"https://arxiv.org/pdf/2412.14488v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05966v1","updated":"2025-01-10T13:49:09Z","published":"2025-01-10T13:49:09Z","title":"Towards Early Prediction of Self-Supervised Speech Model Performance","summary":" In Self-Supervised Learning (SSL), pre-training and evaluation are resource\nintensive. In the speech domain, current indicators of the quality of SSL\nmodels during pre-training, such as the loss, do not correlate well with\ndownstream performance. Consequently, it is often difficult to gauge the final\ndownstream performance in a cost efficient manner during pre-training. In this\nwork, we propose unsupervised efficient methods that give insights into the\nquality of the pre-training of SSL speech models, namely, measuring the cluster\nquality and rank of the embeddings of the SSL model. Results show that measures\nof cluster quality and rank correlate better with downstream performance than\nthe pre-training loss with only one hour of unlabeled audio, reducing the need\nfor GPU hours and labeled data in SSL model evaluation.\n","authors":["Ryan Whetten","Lucas Maison","Titouan Parcollet","Marco Dinarelli","Yannick Estève"],"pdf_url":"https://arxiv.org/pdf/2501.05966v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05965v1","updated":"2025-01-10T13:47:13Z","published":"2025-01-10T13:47:13Z","title":"Model Inversion in Split Learning for Personalized LLMs: New Insights\n from Information Bottleneck Theory","summary":" Personalized Large Language Models (LLMs) have become increasingly prevalent,\nshowcasing the impressive capabilities of models like GPT-4. This trend has\nalso catalyzed extensive research on deploying LLMs on mobile devices. Feasible\napproaches for such edge-cloud deployment include using split learning.\nHowever, previous research has largely overlooked the privacy leakage\nassociated with intermediate representations transmitted from devices to\nservers. This work is the first to identify model inversion attacks in the\nsplit learning framework for LLMs, emphasizing the necessity of secure defense.\nFor the first time, we introduce mutual information entropy to understand the\ninformation propagation of Transformer-based LLMs and assess privacy attack\nperformance for LLM blocks. To address the issue of representations being\nsparser and containing less information than embeddings, we propose a two-stage\nattack system in which the first part projects representations into the\nembedding space, and the second part uses a generative model to recover text\nfrom these embeddings. This design breaks down the complexity and achieves\nattack scores of 38%-75% in various scenarios, with an over 60% improvement\nover the SOTA. This work comprehensively highlights the potential privacy risks\nduring the deployment of personalized LLMs on the edge side.\n","authors":["Yunmeng Shu","Shaofeng Li","Tian Dong","Yan Meng","Haojin Zhu"],"pdf_url":"https://arxiv.org/pdf/2501.05965v1.pdf","comment":"8 pages"},{"id":"http://arxiv.org/abs/2302.10798v5","updated":"2025-01-10T13:43:48Z","published":"2023-02-17T09:37:17Z","title":"Learning a Consensus Sub-Network with Polarization Regularization and\n One Pass Training","summary":" The subject of green AI has been gaining attention within the deep learning\ncommunity given the recent trend of ever larger and more complex neural network\nmodels. Existing solutions for reducing the computational load of training at\ninference time usually involve pruning the network parameters. Pruning schemes\noften create extra overhead either by iterative training and fine-tuning for\nstatic pruning or repeated computation of a dynamic pruning graph. We propose a\nnew parameter pruning strategy for learning a lighter-weight sub-network that\nminimizes the energy cost while maintaining comparable performance to the fully\nparameterised network on given downstream tasks. Our proposed pruning scheme is\ngreen-oriented, as it only requires a one-off training to discover the optimal\nstatic sub-networks by dynamic pruning methods. The pruning scheme consists of\na binary gating module and a polarizing loss function to uncover sub-networks\nwith user-defined sparsity. Our method enables pruning and training\nsimultaneously, which saves energy in both the training and inference phases\nand avoids extra computational overhead from gating modules at inference time.\nOur results on CIFAR-10, CIFAR-100, and Tiny Imagenet suggest that our scheme\ncan remove 50% of connections in deep networks with <1% reduction in\nclassification accuracy. Compared to other related pruning methods, our method\ndemonstrates a lower drop in accuracy for equivalent reductions in\ncomputational cost.\n","authors":["Xiaoying Zhi","Varun Babbar","Rundong Liu","Pheobe Sun","Fran Silavong","Ruibo Shi","Sean Moran"],"pdf_url":"https://arxiv.org/pdf/2302.10798v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05942v1","updated":"2025-01-10T13:06:36Z","published":"2025-01-10T13:06:36Z","title":"Soft regression trees: a model variant and a decomposition training\n algorithm","summary":" Decision trees are widely used for classification and regression tasks in a\nvariety of application fields due to their interpretability and good accuracy.\nDuring the past decade, growing attention has been devoted to globally\noptimized decision trees with deterministic or soft splitting rules at branch\nnodes, which are trained by optimizing the error function over all the tree\nparameters. In this work, we propose a new variant of soft multivariate\nregression trees (SRTs) where, for every input vector, the prediction is\ndefined as the linear regression associated to a single leaf node, namely, the\nleaf node obtained by routing the input vector from the root along the branches\nwith higher probability. SRTs exhibit the conditional computational property,\ni.e., each prediction depends on a small number of nodes (parameters), and our\nnonlinear optimization formulation for training them is amenable to\ndecomposition. After showing a universal approximation result for SRTs, we\npresent a decomposition training algorithm including a clustering-based\ninitialization procedure and a heuristic for reassigning the input vectors\nalong the tree. Under mild assumptions, we establish asymptotic convergence\nguarantees. Experiments on 15 wellknown datasets indicate that our SRTs and\ndecomposition algorithm yield higher accuracy and robustness compared with\ntraditional soft regression trees trained using the nonlinear optimization\nformulation of Blanquero et al., and a significant reduction in training times\nas well as a slightly better average accuracy compared with the mixed-integer\noptimization approach of Bertsimas and Dunn. We also report a comparison with\nthe Random Forest ensemble method.\n","authors":["Antonio Consolo","Edoardo Amaldi","Andrea Manno"],"pdf_url":"https://arxiv.org/pdf/2501.05942v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05934v1","updated":"2025-01-10T12:56:19Z","published":"2025-01-10T12:56:19Z","title":"Encoded Spatial Attribute in Multi-Tier Federated Learning","summary":" This research presents an Encoded Spatial Multi-Tier Federated Learning\napproach for a comprehensive evaluation of aggregated models for geospatial\ndata. In the client tier, encoding spatial information is introduced to better\npredict the target outcome. The research aims to assess the performance of\nthese models across diverse datasets and spatial attributes, highlighting\nvariations in predictive accuracy. Using evaluation metrics such as accuracy,\nour research reveals insights into the complexities of spatial granularity and\nthe challenges of capturing underlying patterns in the data. We extended the\nscope of federated learning (FL) by having multi-tier along with the\nfunctionality of encoding spatial attributes. Our N-tier FL approach used\nencoded spatial data to aggregate in different tiers. We obtained multiple\nmodels that predicted the different granularities of spatial data. Our findings\nunderscore the need for further research to improve predictive accuracy and\nmodel generalization, with potential avenues including incorporating additional\nfeatures, refining model architectures, and exploring alternative modeling\napproaches. Our experiments have several tiers representing different levels of\nspatial aspects. We obtained accuracy of 75.62% and 89.52% for the global model\nwithout having to train the model using the data constituted with the\ndesignated tier. The research also highlights the importance of the proposed\napproach in real-time applications.\n","authors":["Asfia Kawnine","Francis Palma","Seyed Alireza Rahimi Azghadi","Hung Cao"],"pdf_url":"https://arxiv.org/pdf/2501.05934v1.pdf","comment":"IEEE ICCE 2025"},{"id":"http://arxiv.org/abs/2501.05932v1","updated":"2025-01-10T12:55:34Z","published":"2025-01-10T12:55:34Z","title":"DiffuSETS: 12-lead ECG Generation Conditioned on Clinical Text Reports\n and Patient-Specific Information","summary":" Heart disease remains a significant threat to human health. As a non-invasive\ndiagnostic tool, the electrocardiogram (ECG) is one of the most widely used\nmethods for cardiac screening. However, the scarcity of high-quality ECG data,\ndriven by privacy concerns and limited medical resources, creates a pressing\nneed for effective ECG signal generation. Existing approaches for generating\nECG signals typically rely on small training datasets, lack comprehensive\nevaluation frameworks, and overlook potential applications beyond data\naugmentation. To address these challenges, we propose DiffuSETS, a novel\nframework capable of generating ECG signals with high semantic alignment and\nfidelity. DiffuSETS accepts various modalities of clinical text reports and\npatient-specific information as inputs, enabling the creation of clinically\nmeaningful ECG signals. Additionally, to address the lack of standardized\nevaluation in ECG generation, we introduce a comprehensive benchmarking\nmethodology to assess the effectiveness of generative models in this domain.\nOur model achieve excellent results in tests, proving its superiority in the\ntask of ECG generation. Furthermore, we showcase its potential to mitigate data\nscarcity while exploring novel applications in cardiology education and medical\nknowledge discovery, highlighting the broader impact of our work.\n","authors":["Yongfan Lai","Jiabo Chen","Deyun Zhang","Yue Wang","Shijia Geng","Hongyan Li","Shenda Hong"],"pdf_url":"https://arxiv.org/pdf/2501.05932v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.07128v2","updated":"2025-01-10T12:34:47Z","published":"2024-09-23T11:29:19Z","title":"Neural Differential Appearance Equations","summary":" We propose a method to reproduce dynamic appearance textures with\nspace-stationary but time-varying visual statistics. While most previous work\ndecomposes dynamic textures into static appearance and motion, we focus on\ndynamic appearance that results not from motion but variations of fundamental\nproperties, such as rusting, decaying, melting, and weathering. To this end, we\nadopt the neural ordinary differential equation (ODE) to learn the underlying\ndynamics of appearance from a target exemplar. We simulate the ODE in two\nphases. At the \"warm-up\" phase, the ODE diffuses a random noise to an initial\nstate. We then constrain the further evolution of this ODE to replicate the\nevolution of visual feature statistics in the exemplar during the generation\nphase. The particular innovation of this work is the neural ODE achieving both\ndenoising and evolution for dynamics synthesis, with a proposed temporal\ntraining scheme. We study both relightable (BRDF) and non-relightable (RGB)\nappearance models. For both we introduce new pilot datasets, allowing, for the\nfirst time, to study such phenomena: For RGB we provide 22 dynamic textures\nacquired from free online sources; For BRDFs, we further acquire a dataset of\n21 flash-lit videos of time-varying materials, enabled by a simple-to-construct\nsetup. Our experiments show that our method consistently yields realistic and\ncoherent results, whereas prior works falter under pronounced temporal\nappearance variations. A user study confirms our approach is preferred to\nprevious work for such exemplars.\n","authors":["Chen Liu","Tobias Ritschel"],"pdf_url":"https://arxiv.org/pdf/2410.07128v2.pdf","comment":"SIGGRAPH Asia 2024 Journal Track. Project page at\n https://ryushinn.github.io/ode-appearance"},{"id":"http://arxiv.org/abs/2411.07432v2","updated":"2025-01-10T12:33:03Z","published":"2024-11-11T23:21:01Z","title":"Fast unsupervised ground metric learning with tree-Wasserstein distance","summary":" The performance of unsupervised methods such as clustering depends on the\nchoice of distance metric between features, or ground metric. Commonly, ground\nmetrics are decided with heuristics or learned via supervised algorithms.\nHowever, since many interesting datasets are unlabelled, unsupervised ground\nmetric learning approaches have been introduced. One promising option employs\nWasserstein singular vectors (WSVs), which emerge when computing optimal\ntransport distances between features and samples simultaneously. WSVs are\neffective, but can be prohibitively computationally expensive in some\napplications: $\\mathcal{O}(n^2m^2(n \\log(n) + m \\log(m))$ for $n$ samples and\n$m$ features. In this work, we propose to augment the WSV method by embedding\nsamples and features on trees, on which we compute the tree-Wasserstein\ndistance (TWD). We demonstrate theoretically and empirically that the algorithm\nconverges to a better approximation of the standard WSV approach than the best\nknown alternatives, and does so with $\\mathcal{O}(n^3+m^3+mn)$ complexity. In\naddition, we prove that the initial tree structure can be chosen flexibly,\nsince tree geometry does not constrain the richness of the approximation up to\nthe number of edge weights. This proof suggests a fast and recursive algorithm\nfor computing the tree parameter basis set, which we find crucial to realising\nthe efficiency gains at scale. Finally, we employ the tree-WSV algorithm to\nseveral single-cell RNA sequencing genomics datasets, demonstrating its\nscalability and utility for unsupervised cell-type clustering problems. These\nresults poise unsupervised ground metric learning with TWD as a low-rank\napproximation of WSV with the potential for widespread application.\n","authors":["Kira M. Düsterwald","Samo Hromadka","Makoto Yamada"],"pdf_url":"https://arxiv.org/pdf/2411.07432v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05906v1","updated":"2025-01-10T12:07:00Z","published":"2025-01-10T12:07:00Z","title":"Q-MAML: Quantum Model-Agnostic Meta-Learning for Variational Quantum\n Algorithms","summary":" In the Noisy Intermediate-Scale Quantum (NISQ) era, using variational quantum\nalgorithms (VQAs) to solve optimization problems has become a key application.\nHowever, these algorithms face significant challenges, such as choosing an\neffective initial set of parameters and the limited quantum processing time\nthat restricts the number of optimization iterations. In this study, we\nintroduce a new framework for optimizing parameterized quantum circuits (PQCs)\nthat employs a classical optimizer, inspired by Model-Agnostic Meta-Learning\n(MAML) technique. This approach aim to achieve better parameter initialization\nthat ensures fast convergence. Our framework features a classical neural\nnetwork, called Learner}, which interacts with a PQC using the output of\nLearner as an initial parameter. During the pre-training phase, Learner is\ntrained with a meta-objective based on the quantum circuit cost function. In\nthe adaptation phase, the framework requires only a few PQC updates to converge\nto a more accurate value, while the learner remains unchanged. This method is\nhighly adaptable and is effectively extended to various Hamiltonian\noptimization problems. We validate our approach through experiments, including\ndistribution function mapping and optimization of the Heisenberg XYZ\nHamiltonian. The result implies that the Learner successfully estimates initial\nparameters that generalize across the problem space, enabling fast adaptation.\n","authors":["Junyong Lee","JeiHee Cho","Shiho Kim"],"pdf_url":"https://arxiv.org/pdf/2501.05906v1.pdf","comment":"8 pages, 8 figures, to be published in AAAI 25"},{"id":"http://arxiv.org/abs/2501.00574v2","updated":"2025-01-10T12:00:51Z","published":"2024-12-31T18:01:23Z","title":"VideoChat-Flash: Hierarchical Compression for Long-Context Video\n Modeling","summary":" Long-context modeling is a critical capability for multimodal large language\nmodels (MLLMs), enabling them to process long-form contents with implicit\nmemorization. Despite its advances, handling extremely long videos remains\nchallenging due to the difficulty in maintaining crucial features over extended\nsequences. This paper introduces a Hierarchical visual token Compression (HiCo)\nmethod designed for high-fidelity representation and a practical context\nmodeling system VideoChat-Flash tailored for multimodal long-sequence\nprocessing. HiCo capitalizes on the redundancy of visual information in long\nvideos to compress long video context from the clip-level to the video-level,\nreducing the compute significantly while preserving essential details.\nVideoChat-Flash features a multi-stage short-to-long learning scheme, a rich\ndataset of real-world long videos named LongVid, and an upgraded\n\"Needle-In-A-video-Haystack\" (NIAH) for evaluating context capacities. In\nextensive experiments, VideoChat-Flash shows the leading performance on both\nmainstream long and short video benchmarks at the 2B and 7B model scale. It\nfirstly gets 99.1% accuracy over 10,000 frames in NIAH among open-source\nmodels.\n","authors":["Xinhao Li","Yi Wang","Jiashuo Yu","Xiangyu Zeng","Yuhan Zhu","Haian Huang","Jianfei Gao","Kunchang Li","Yinan He","Chenting Wang","Yu Qiao","Yali Wang","Limin Wang"],"pdf_url":"https://arxiv.org/pdf/2501.00574v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05903v1","updated":"2025-01-10T12:00:08Z","published":"2025-01-10T12:00:08Z","title":"Discovery of sustainable energy materials via the machine-learned\n material space","summary":" Does a machine learning model actually gain an understanding of the material\nspace? We answer this question in the affirmative on the example of the\nOptiMate model, a graph attention network trained to predict the optical\nproperties of semiconductors and insulators. By applying the UMAP\ndimensionality reduction technique to its latent embeddings, we demonstrate\nthat the model captures a nuanced and interpretable representation of the\nmaterials space, reflecting chemical and physical principles, without any\nuser-induced bias. This enables clustering of almost 10,000 materials based on\noptical properties and chemical similarities. Beyond this understanding, we\ndemonstrate how the learned material space can be used to identify more\nsustainable alternatives to critical materials in energy-related technologies,\nsuch as photovoltaics. These findings demonstrate the dual utility of machine\nlearning models in materials science: Accurately predicting material properties\nwhile providing insights into the underlying materials space. The approach\ndemonstrates the broader potential of leveraging learned materials spaces for\nthe discovery and design of materials for diverse applications, and is easily\napplicable to any state-of-the-art machine learning model.\n","authors":["Malte Grunert","Max Großmann","Erich Runge"],"pdf_url":"https://arxiv.org/pdf/2501.05903v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05130v2","updated":"2025-01-10T11:50:00Z","published":"2025-01-09T10:33:16Z","title":"Learning In-Distribution Representations for Anomaly Detection","summary":" Anomaly detection involves identifying data patterns that deviate from the\nanticipated norm. Traditional methods struggle in high-dimensional spaces due\nto the curse of dimensionality. In recent years, self-supervised learning,\nparticularly through contrastive objectives, has driven advances in anomaly\ndetection. However, vanilla contrastive learning struggles to align with the\nunique demands of anomaly detection, as it lacks a pretext task tailored to the\nhomogeneous nature of In-Distribution (ID) data and the diversity of\nOut-of-Distribution (OOD) anomalies. Methods that attempt to address these\nchallenges, such as introducing hard negatives through synthetic outliers,\nOutlier Exposure (OE), and supervised objectives, often rely on pretext tasks\nthat fail to balance compact clustering of ID samples with sufficient\nseparation from OOD data. In this work, we propose Focused In-distribution\nRepresentation Modeling (FIRM), a contrastive learning objective specifically\ndesigned for anomaly detection. Unlike existing approaches, FIRM incorporates\nsynthetic outliers into its pretext task in a way that actively shapes the\nrepresentation space, promoting compact clustering of ID samples while\nenforcing strong separation from outliers. This formulation addresses the\nchallenges of class collision, enhancing both the compactness of ID\nrepresentations and the discriminative power of the learned feature space. We\nshow that FIRM surpasses other contrastive methods in standard benchmarks,\nsignificantly enhancing anomaly detection compared to both traditional and\nsupervised contrastive learning objectives. Our ablation studies confirm that\nFIRM consistently improves the quality of representations and shows robustness\nacross a range of scoring methods. The code is available at:\nhttps://github.com/willtl/firm.\n","authors":["Willian T. Lunardi","Abdulrahman Banabila","Dania Herzalla","Martin Andreoni"],"pdf_url":"https://arxiv.org/pdf/2501.05130v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05894v1","updated":"2025-01-10T11:46:51Z","published":"2025-01-10T11:46:51Z","title":"Text2Playlist: Generating Personalized Playlists from Text on Deezer","summary":" The streaming service Deezer heavily relies on the search to help users\nnavigate through its extensive music catalog. Nonetheless, it is primarily\ndesigned to find specific items and does not lead directly to a smooth\nlistening experience. We present Text2Playlist, a stand-alone tool that\naddresses these limitations. Text2Playlist leverages generative AI, music\ninformation retrieval and recommendation systems to generate query-specific and\npersonalized playlists, successfully deployed at scale.\n","authors":["Mathieu Delcluze","Antoine Khoury","Clémence Vast","Valerio Arnaudo","Léa Briand","Walid Bendada","Thomas Bouabça"],"pdf_url":"https://arxiv.org/pdf/2501.05894v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05885v1","updated":"2025-01-10T11:37:50Z","published":"2025-01-10T11:37:50Z","title":"EDNet: Edge-Optimized Small Target Detection in UAV Imagery -- Faster\n Context Attention, Better Feature Fusion, and Hardware Acceleration","summary":" Detecting small targets in drone imagery is challenging due to low\nresolution, complex backgrounds, and dynamic scenes. We propose EDNet, a novel\nedge-target detection framework built on an enhanced YOLOv10 architecture,\noptimized for real-time applications without post-processing. EDNet\nincorporates an XSmall detection head and a Cross Concat strategy to improve\nfeature fusion and multi-scale context awareness for detecting tiny targets in\ndiverse environments. Our unique C2f-FCA block employs Faster Context Attention\nto enhance feature extraction while reducing computational complexity. The WIoU\nloss function is employed for improved bounding box regression. With seven\nmodel sizes ranging from Tiny to XL, EDNet accommodates various deployment\nenvironments, enabling local real-time inference and ensuring data privacy.\nNotably, EDNet achieves up to a 5.6% gain in mAP@50 with significantly fewer\nparameters. On an iPhone 12, EDNet variants operate at speeds ranging from 16\nto 55 FPS, providing a scalable and efficient solution for edge-based object\ndetection in challenging drone imagery. The source code and pre-trained models\nare available at: https://github.com/zsniko/EDNet.\n","authors":["Zhifan Song","Yuan Zhang","Abd Al Rahman M. Abu Ebayyeh"],"pdf_url":"https://arxiv.org/pdf/2501.05885v1.pdf","comment":"Accepted in 21st IEEE International Conference on Ubiquitous\n Intelligence and Computing (UIC 2024)\n https://www.ieee-smart-world.org/2024/uic"},{"id":"http://arxiv.org/abs/2501.01987v2","updated":"2025-01-10T11:36:09Z","published":"2024-12-30T18:08:13Z","title":"Gender Bias in Text-to-Video Generation Models: A case study of Sora","summary":" The advent of text-to-video generation models has revolutionized content\ncreation as it produces high-quality videos from textual prompts. However,\nconcerns regarding inherent biases in such models have prompted scrutiny,\nparticularly regarding gender representation. Our study investigates the\npresence of gender bias in OpenAI's Sora, a state-of-the-art text-to-video\ngeneration model. We uncover significant evidence of bias by analyzing the\ngenerated videos from a diverse set of gender-neutral and stereotypical\nprompts. The results indicate that Sora disproportionately associates specific\ngenders with stereotypical behaviors and professions, which reflects societal\nprejudices embedded in its training data.\n","authors":["Mohammad Nadeem","Shahab Saquib Sohail","Erik Cambria","Björn W. Schuller","Amir Hussain"],"pdf_url":"https://arxiv.org/pdf/2501.01987v2.pdf","comment":"7 pages, 3 figures"},{"id":"http://arxiv.org/abs/2501.05874v1","updated":"2025-01-10T11:17:15Z","published":"2025-01-10T11:17:15Z","title":"VideoRAG: Retrieval-Augmented Generation over Video Corpus","summary":" Retrieval-Augmented Generation (RAG) is a powerful strategy to address the\nissue of generating factually incorrect outputs in foundation models by\nretrieving external knowledge relevant to queries and incorporating it into\ntheir generation process. However, existing RAG approaches have primarily\nfocused on textual information, with some recent advancements beginning to\nconsider images, and they largely overlook videos, a rich source of multimodal\nknowledge capable of representing events, processes, and contextual details\nmore effectively than any other modality. While a few recent studies explore\nthe integration of videos in the response generation process, they either\npredefine query-associated videos without retrieving them according to queries,\nor convert videos into the textual descriptions without harnessing their\nmultimodal richness. To tackle these, we introduce VideoRAG, a novel framework\nthat not only dynamically retrieves relevant videos based on their relevance\nwith queries but also utilizes both visual and textual information of videos in\nthe output generation. Further, to operationalize this, our method revolves\naround the recent advance of Large Video Language Models (LVLMs), which enable\nthe direct processing of video content to represent it for retrieval and\nseamless integration of the retrieved videos jointly with queries. We\nexperimentally validate the effectiveness of VideoRAG, showcasing that it is\nsuperior to relevant baselines.\n","authors":["Soyeong Jeong","Kangsan Kim","Jinheon Baek","Sung Ju Hwang"],"pdf_url":"https://arxiv.org/pdf/2501.05874v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05871v1","updated":"2025-01-10T11:12:03Z","published":"2025-01-10T11:12:03Z","title":"Collaborative Content Moderation in the Fediverse","summary":" The Fediverse, a group of interconnected servers providing a variety of\ninteroperable services (e.g. micro-blogging in Mastodon) has gained rapid\npopularity. This sudden growth, partly driven by Elon Musk's acquisition of\nTwitter, has created challenges for administrators though. This paper focuses\non one particular challenge: content moderation, e.g. the need to remove spam\nor hate speech. While centralized platforms like Facebook and Twitter rely on\nautomated tools for moderation, their dependence on massive labeled datasets\nand specialized infrastructure renders them impractical for decentralized,\nlow-resource settings like the Fediverse. In this work, we design and evaluate\nFedMod, a collaborative content moderation system based on federated learning.\nOur system enables servers to exchange parameters of partially trained local\ncontent moderation models with similar servers, creating a federated model\nshared among collaborating servers. FedMod demonstrates robust performance on\nthree different content moderation tasks: harmful content detection, bot\ncontent detection, and content warning assignment, achieving average per-server\nmacro-F1 scores of 0.71, 0.73, and 0.58, respectively.\n","authors":["Haris Bin Zia","Aravindh Raman","Ignacio Castro","Gareth Tyson"],"pdf_url":"https://arxiv.org/pdf/2501.05871v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05870v1","updated":"2025-01-10T11:11:08Z","published":"2025-01-10T11:11:08Z","title":"A Neighbor-based Approach to Pitch Ownership Models in Soccer","summary":" Pitch ownership models allow many types of analysis in soccer and provide\nvaluable assistance to tactical analysts in understanding the game's dynamics.\nThe novelty they provide over event-based analysis is that tracking data\nincorporates context that event-based data does not possess, like player\npositioning. This paper proposes a novel approach to building pitch ownership\nmodels in soccer games using the K-Nearest Neighbors (KNN) algorithm. Our\napproach provides a fast inference mechanism that can model different\napproaches to pitch control using the same algorithm. Despite its flexibility,\nit uses only three hyperparameters to tune the model, facilitating the tuning\nprocess for different player skill levels. The flexibility of the approach\nallows for the emulation of different methods available in the literature by\nadjusting a small number of parameters, including adjusting for different\nlevels of uncertainty. In summary, the proposed model provides a new and more\nflexible strategy for building pitch ownership models, extending beyond just\nreplicating existing algorithms, and can provide valuable insights for tactical\nanalysts and open up new avenues for future research. We thoroughly visualize\nseveral examples demonstrating the presented models' strengths and weaknesses.\nThe code is available at github.com/nvsclub/KNNPitchControl.\n","authors":["Tiago Mendes-Neves","Luís Meireles","João Mendes-Moreira"],"pdf_url":"https://arxiv.org/pdf/2501.05870v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05867v1","updated":"2025-01-10T11:08:40Z","published":"2025-01-10T11:08:40Z","title":"Neural Network Verification is a Programming Language Challenge","summary":" Neural network verification is a new and rapidly developing field of\nresearch. So far, the main priority has been establishing efficient\nverification algorithms and tools, while proper support from the programming\nlanguage perspective has been considered secondary or unimportant. Yet, there\nis mounting evidence that insights from the programming language community may\nmake a difference in the future development of this domain. In this paper, we\nformulate neural network verification challenges as programming language\nchallenges and suggest possible future solutions.\n","authors":["Lucas C. Cordeiro","Matthew L. Daggitt","Julien Girard-Satabin","Omri Isac","Taylor T. Johnson","Guy Katz","Ekaterina Komendantskaya","Augustin Lemesle","Edoardo Manino","Artjoms Šinkarovs","Haoze Wu"],"pdf_url":"https://arxiv.org/pdf/2501.05867v1.pdf","comment":"Accepted at ESOP 2025, European Symposium on Programming Languages"},{"id":"http://arxiv.org/abs/2501.05852v1","updated":"2025-01-10T10:47:00Z","published":"2025-01-10T10:47:00Z","title":"MRI Patterns of the Hippocampus and Amygdala for Predicting Stages of\n Alzheimer's Progression: A Minimal Feature Machine Learning Framework","summary":" Alzheimer's disease (AD) progresses through distinct stages, from early mild\ncognitive impairment (EMCI) to late mild cognitive impairment (LMCI) and\neventually to AD. Accurate identification of these stages, especially\ndistinguishing LMCI from EMCI, is crucial for developing pre-dementia\ntreatments but remains challenging due to subtle and overlapping imaging\nfeatures. This study proposes a minimal-feature machine learning framework that\nleverages structural MRI data, focusing on the hippocampus and amygdala as\nregions of interest. The framework addresses the curse of dimensionality\nthrough feature selection, utilizes region-specific voxel information, and\nimplements innovative data organization to enhance classification performance\nby reducing noise. The methodology integrates dimensionality reduction\ntechniques such as PCA and t-SNE with state-of-the-art classifiers, achieving\nthe highest accuracy of 88.46%. This framework demonstrates the potential for\nefficient and accurate staging of AD progression while providing valuable\ninsights for clinical applications.\n","authors":["Aswini Kumar Patra","Soraisham Elizabeth Devi","Tejashwini Gajurel"],"pdf_url":"https://arxiv.org/pdf/2501.05852v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05845v1","updated":"2025-01-10T10:36:46Z","published":"2025-01-10T10:36:46Z","title":"Annealing Machine-assisted Learning of Graph Neural Network for\n Combinatorial Optimization","summary":" While Annealing Machines (AM) have shown increasing capabilities in solving\ncomplex combinatorial problems, positioning themselves as a more immediate\nalternative to the expected advances of future fully quantum solutions, there\nare still scaling limitations. In parallel, Graph Neural Networks (GNN) have\nbeen recently adapted to solve combinatorial problems, showing competitive\nresults and potentially high scalability due to their distributed nature. We\npropose a merging approach that aims at retaining both the accuracy exhibited\nby AMs and the representational flexibility and scalability of GNNs. Our model\nconsiders a compression step, followed by a supervised interaction where\npartial solutions obtained from the AM are used to guide local GNNs from where\nnode feature representations are obtained and combined to initialize an\nadditional GNN-based solver that handles the original graph's target problem.\nIntuitively, the AM can solve the combinatorial problem indirectly by infusing\nits knowledge into the GNN. Experiments on canonical optimization problems show\nthat the idea is feasible, effectively allowing the AM to solve size problems\nbeyond its original limits.\n","authors":["Pablo Loyola","Kento Hasegawa","Andres Hoyos-Idobro","Kazuo Ono","Toyotaro Suzumura","Yu Hirate","Masanao Yamaoka"],"pdf_url":"https://arxiv.org/pdf/2501.05845v1.pdf","comment":"Second Workshop on Machine Learning with New Compute Paradigms at\n NeurIPS 2024 (MLNCP 2024)"},{"id":"http://arxiv.org/abs/2501.05844v1","updated":"2025-01-10T10:36:26Z","published":"2025-01-10T10:36:26Z","title":"\"Cause\" is Mechanistic Narrative within Scientific Domains: An Ordinary\n Language Philosophical Critique of \"Causal Machine Learning\"","summary":" Causal Learning has emerged as a major theme of AI in recent years, promising\nto use special techniques to reveal the true nature of cause and effect in a\nnumber of important domains. We consider the Epistemology of learning and\nrecognizing true cause and effect phenomena. Through thought exercises on the\ncustomary use of the word ''cause'', especially in scientific domains, we\ninvestigate what, in practice, constitutes a valid causal claim. We recognize\nthe word's uses across scientific domains in disparate form but consistent\nfunction within the scientific paradigm. We highlight fundamental distinctions\nof practice that can be performed in the natural and social sciences, highlight\nthe importance of many systems of interest being open and irreducible and\nidentify the important notion of Hermeneutic knowledge for social science\ninquiry. We posit that the distinct properties require that definitive causal\nclaims can only come through an agglomeration of consistent evidence across\nmultiple domains and levels of abstraction, such as empirical, physiological,\nbiochemical, etc. We present Cognitive Science as an exemplary\nmulti-disciplinary field providing omnipresent opportunity for such a Research\nProgram, and highlight the main general modes of practice of scientific inquiry\nthat can adequately merge, rather than place as incorrigibly conflictual,\nmulti-domain multi-abstraction scientific practices and language games.\n","authors":["Vyacheslav Kungurtsev","Leonardo Christov Moore","Gustav Sir","Martin Krutsky"],"pdf_url":"https://arxiv.org/pdf/2501.05844v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05842v1","updated":"2025-01-10T10:33:13Z","published":"2025-01-10T10:33:13Z","title":"Orthogonal projection-based regularization for efficient model\n augmentation","summary":" Deep-learning-based nonlinear system identification has shown the ability to\nproduce reliable and highly accurate models in practice. However, these\nblack-box models lack physical interpretability, and often a considerable part\nof the learning effort is spent on capturing already expected/known behavior\ndue to first-principles-based understanding of some aspects of the system. A\npotential solution is to integrate prior physical knowledge directly into the\nmodel structure, combining the strengths of physics-based modeling and\ndeep-learning-based identification. The most common approach is to use an\nadditive model augmentation structure, where the physics-based and the\nmachine-learning (ML) components are connected in parallel. However, such\nmodels are overparametrized, training them is challenging, potentially causing\nthe physics-based part to lose interpretability. To overcome this challenge,\nthis paper proposes an orthogonal projection-based regularization technique to\nenhance parameter learning, convergence, and even model accuracy in\nlearning-based augmentation of nonlinear baseline models.\n","authors":["Bendegúz M. Györök","Jan H. Hoekstra","Johan Kon","Tamás Péni","Maarten Schoukens","Roland Tóth"],"pdf_url":"https://arxiv.org/pdf/2501.05842v1.pdf","comment":"Submitted to L4DC 2025"},{"id":"http://arxiv.org/abs/2309.13736v3","updated":"2025-01-10T10:31:19Z","published":"2023-09-24T19:40:15Z","title":"Geometry of Linear Neural Networks: Equivariance and Invariance under\n Permutation Groups","summary":" The set of functions parameterized by a linear fully-connected neural network\nis a determinantal variety. We investigate the subvariety of functions that are\nequivariant or invariant under the action of a permutation group. Examples of\nsuch group actions are translations or $90^\\circ$ rotations on images. We\ndescribe such equivariant or invariant subvarieties as direct products of\ndeterminantal varieties, from which we deduce their dimension, degree,\nEuclidean distance degree, and their singularities. We fully characterize\ninvariance for arbitrary permutation groups, and equivariance for cyclic\ngroups. We draw conclusions for the parameterization and the design of\nequivariant and invariant linear networks in terms of sparsity and\nweight-sharing properties. We prove that all invariant linear functions can be\nparameterized by a single linear autoencoder with a weight-sharing property\nimposed by the cycle decomposition of the considered permutation. The space of\nrank-bounded equivariant functions has several irreducible components, so it\ncan not be parameterized by a single network-but each irreducible component\ncan. Finally, we show that minimizing the squared-error loss on our invariant\nor equivariant networks reduces to minimizing the Euclidean distance from\ndeterminantal varieties via the Eckart-Young theorem.\n","authors":["Kathlén Kohn","Anna-Laura Sattelberger","Vahid Shahverdi"],"pdf_url":"https://arxiv.org/pdf/2309.13736v3.pdf","comment":"42 pages, 8 figures, 1 table; comments welcome!"},{"id":"http://arxiv.org/abs/2401.10726v4","updated":"2025-01-10T10:30:41Z","published":"2024-01-19T14:43:04Z","title":"Empowering Aggregators with Practical Data-Driven Tools: Harnessing\n Aggregated and Disaggregated Flexibility for Demand Response","summary":" This study explores the interaction between aggregators and building\noccupants in activating flexibility through Demand Response (DR) programs, with\na focus on reinforcing the resilience of the energy system considering the\nuncertainties presented by Renewable Energy Sources (RES). Firstly, it\nintroduces a methodology of optimizing aggregated flexibility provision\nstrategies in environments with limited data, utilizing Discrete Fourier\nTransformation (DFT) and clustering techniques to identify building occupants'\nactivity patterns. Secondly, the study assesses the disaggregated flexibility\nprovision of Heating Ventilation and Air Conditioning (HVAC) systems during DR\nevents, employing machine learning and optimization techniques for precise,\ndevice-level analysis. The first approach offers a non-intrusive pathway for\naggregators to provide flexibility services in environments of a single smart\nmeter for the whole building's consumption, while the second approach maximizes\nthe amount of flexibility in the case of dedicated metering devices to the HVAC\nsystems by carefully considering building occupants' thermal comfort profiles.\nThrough the application of data-driven techniques and encompassing case studies\nfrom both industrial and residential buildings, this paper not only unveils\npivotal opportunities for aggregators in the balancing and emerging flexibility\nmarkets but also successfully develops and demonstrates end-to-end practical\ntools for aggregators.\n","authors":["Costas Mylonas","Donata Boric","Leila Luttenberger Maric","Alexandros Tsitsanis","Eleftheria Petrianou","Magda Foti"],"pdf_url":"https://arxiv.org/pdf/2401.10726v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.15627v3","updated":"2025-01-10T10:24:19Z","published":"2024-06-21T20:06:31Z","title":"Benchmarking Uncertainty Quantification Methods for Large Language\n Models with LM-Polygraph","summary":" The rapid proliferation of large language models (LLMs) has stimulated\nresearchers to seek effective and efficient approaches to deal with LLM\nhallucinations and low-quality outputs. Uncertainty quantification (UQ) is a\nkey element of machine learning applications in dealing with such challenges.\nHowever, research to date on UQ for LLMs has been fragmented in terms of\ntechniques and evaluation methodologies. In this work, we address this issue by\nintroducing a novel benchmark that implements a collection of state-of-the-art\nUQ baselines and offers an environment for controllable and consistent\nevaluation of novel UQ techniques over various text generation tasks. Our\nbenchmark also supports the assessment of confidence normalization methods in\nterms of their ability to provide interpretable scores. Using our benchmark, we\nconduct a large-scale empirical investigation of UQ and normalization\ntechniques across eleven tasks, identifying the most effective approaches.\nCode: https://github.com/IINemo/lm-polygraph Benchmark:\nhttps://huggingface.co/LM-Polygraph\n","authors":["Roman Vashurin","Ekaterina Fadeeva","Artem Vazhentsev","Lyudmila Rvanova","Akim Tsvigun","Daniil Vasilev","Rui Xing","Abdelrahman Boda Sadallah","Kirill Grishchenkov","Sergey Petrakov","Alexander Panchenko","Timothy Baldwin","Preslav Nakov","Maxim Panov","Artem Shelmanov"],"pdf_url":"https://arxiv.org/pdf/2406.15627v3.pdf","comment":"Accepted to TACL 2025, pre-MIT Press publication version. Roman\n Vashurin, Ekaterina Fadeeva, Artem Vazhentsev contributed equally"},{"id":"http://arxiv.org/abs/2407.17163v3","updated":"2025-01-10T10:17:05Z","published":"2024-07-24T11:07:20Z","title":"dlordinal: a Python package for deep ordinal classification","summary":" dlordinal is a new Python library that unifies many recent deep ordinal\nclassification methodologies available in the literature. Developed using\nPyTorch as underlying framework, it implements the top performing\nstate-of-the-art deep learning techniques for ordinal classification problems.\nOrdinal approaches are designed to leverage the ordering information present in\nthe target variable. Specifically, it includes loss functions, various output\nlayers, dropout techniques, soft labelling methodologies, and other\nclassification strategies, all of which are appropriately designed to\nincorporate the ordinal information. Furthermore, as the performance metrics to\nassess novel proposals in ordinal classification depend on the distance between\ntarget and predicted classes in the ordinal scale, suitable ordinal evaluation\nmetrics are also included. dlordinal is distributed under the BSD-3-Clause\nlicense and is available at https://github.com/ayrna/dlordinal.\n","authors":["Francisco Bérchez-Moreno","Víctor M. Vargas","Rafael Ayllón-Gavilán","David Guijo-Rubio","César Hervás-Martínez","Juan C. Fernández","Pedro A. Gutiérrez"],"pdf_url":"https://arxiv.org/pdf/2407.17163v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05835v1","updated":"2025-01-10T10:16:35Z","published":"2025-01-10T10:16:35Z","title":"Fine-tuning is Not Fine: Mitigating Backdoor Attacks in GNNs with\n Limited Clean Data","summary":" Graph Neural Networks (GNNs) have achieved remarkable performance through\ntheir message-passing mechanism. However, recent studies have highlighted the\nvulnerability of GNNs to backdoor attacks, which can lead the model to\nmisclassify graphs with attached triggers as the target class. The\neffectiveness of recent promising defense techniques, such as fine-tuning or\ndistillation, is heavily contingent on having comprehensive knowledge of the\nsufficient training dataset. Empirical studies have shown that fine-tuning\nmethods require a clean dataset of 20% to reduce attack accuracy to below 25%,\nwhile distillation methods require a clean dataset of 15%. However, obtaining\nsuch a large amount of clean data is commonly impractical.\n In this paper, we propose a practical backdoor mitigation framework, denoted\nas GRAPHNAD, which can capture high-quality intermediate-layer representations\nin GNNs to enhance the distillation process with limited clean data. To achieve\nthis, we address the following key questions: How to identify the appropriate\nattention representations in graphs for distillation? How to enhance\ndistillation with limited data? By adopting the graph attention transfer\nmethod, GRAPHNAD can effectively align the intermediate-layer attention\nrepresentations of the backdoored model with that of the teacher model, forcing\nthe backdoor neurons to transform into benign ones. Besides, we extract the\nrelation maps from intermediate-layer transformation and enforce the relation\nmaps of the backdoored model to be consistent with that of the teacher model,\nthereby ensuring model accuracy while further reducing the influence of\nbackdoors. Extensive experimental results show that by fine-tuning a teacher\nmodel with only 3% of the clean data, GRAPHNAD can reduce the attack success\nrate to below 5%.\n","authors":["Jiale Zhang","Bosen Rao","Chengcheng Zhu","Xiaobing Sun","Qingming Li","Haibo Hu","Xiapu Luo","Qingqing Ye","Shouling Ji"],"pdf_url":"https://arxiv.org/pdf/2501.05835v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.11875v2","updated":"2025-01-10T10:15:49Z","published":"2023-10-18T10:49:29Z","title":"Fractional Concepts in Neural Networks: Enhancing Activation Functions","summary":" Designing effective neural networks requires tuning architectural elements.\nThis study integrates fractional calculus into neural networks by introducing\nfractional order derivatives (FDO) as tunable parameters in activation\nfunctions, allowing diverse activation functions by adjusting the FDO. We\nevaluate these fractional activation functions on various datasets and network\narchitectures, comparing their performance with traditional and new activation\nfunctions. Our experiments assess their impact on accuracy, time complexity,\ncomputational overhead, and memory usage. Results suggest fractional activation\nfunctions, particularly fractional Sigmoid, offer benefits in some scenarios.\nChallenges related to consistency and efficiency remain. Practical implications\nand limitations are discussed.\n","authors":["Zahra Alijani","Vojtech Molek"],"pdf_url":"https://arxiv.org/pdf/2310.11875v2.pdf","comment":"8 pages, 8 figures, submitted to pattern recognition letters"},{"id":"http://arxiv.org/abs/2501.05819v1","updated":"2025-01-10T09:59:16Z","published":"2025-01-10T09:59:16Z","title":"Diffusion Models for Smarter UAVs: Decision-Making and Modeling","summary":" Unmanned Aerial Vehicles (UAVs) are increasingly adopted in modern\ncommunication networks. However, challenges in decision-making and digital\nmodeling continue to impede their rapid advancement. Reinforcement Learning\n(RL) algorithms face limitations such as low sample efficiency and limited data\nversatility, further magnified in UAV communication scenarios. Moreover,\nDigital Twin (DT) modeling introduces substantial decision-making and data\nmanagement complexities. RL models, often integrated into DT frameworks,\nrequire extensive training data to achieve accurate predictions. In contrast to\ntraditional approaches that focus on class boundaries, Diffusion Models (DMs),\na new class of generative AI, learn the underlying probability distribution\nfrom the training data and can generate trustworthy new patterns based on this\nlearned distribution. This paper explores the integration of DMs with RL and DT\nto effectively address these challenges. By combining the data generation\ncapabilities of DMs with the decision-making framework of RL and the modeling\naccuracy of DT, the integration improves the adaptability and real-time\nperformance of UAV communication. Moreover, the study shows how DMs can\nalleviate data scarcity, improve policy networks, and optimize dynamic\nmodeling, providing a robust solution for complex UAV communication scenarios.\n","authors":["Yousef Emami","Hao Zhou","Luis Almeida","Kai Li"],"pdf_url":"https://arxiv.org/pdf/2501.05819v1.pdf","comment":"7 pages, 2 figures"},{"id":"http://arxiv.org/abs/2412.13902v2","updated":"2025-01-10T09:55:54Z","published":"2024-12-18T14:42:43Z","title":"Threshold Neuron: A Brain-inspired Artificial Neuron for Efficient\n On-device Inference","summary":" Enhancing the computational efficiency of on-device Deep Neural Networks\n(DNNs) remains a significant challengein mobile and edge computing. As we aim\nto execute increasingly complex tasks with constrained computational resources,\nmuch of the research has focused on compressing neural network structures and\noptimizing systems. Although many studies have focused on compressing neural\nnetwork structures and parameters or optimizing underlying systems, there has\nbeen limited attention on optimizing the fundamental building blocks of neural\nnetworks: the neurons. In this study, we deliberate on a simple but important\nresearch question: Can we design artificial neurons that offer greater\nefficiency than the traditional neuron paradigm? Inspired by the threshold\nmechanisms and the excitation-inhibition balance observed in biological\nneurons, we propose a novel artificial neuron model, Threshold Neurons. Using\nThreshold Neurons, we can construct neural networks similar to those with\ntraditional artificial neurons, while significantly reducing hardware\nimplementation complexity. Our extensive experiments validate the effectiveness\nof neural networks utilizing Threshold Neurons, achieving substantial power\nsavings of 7.51x to 8.19x and area savings of 3.89x to 4.33x at the kernel\nlevel, with minimal loss in precision. Furthermore, FPGA-based implementations\nof these networks demonstrate 2.52x power savings and 1.75x speed enhancements\nat the system level. The source code will be made available upon publication.\n","authors":["Zihao Zheng","Yuanchun Li","Jiayu Chen","Peng Zhou","Xiang Chen","Yunxin Liu"],"pdf_url":"https://arxiv.org/pdf/2412.13902v2.pdf","comment":"14 pages, 11 figures"},{"id":"http://arxiv.org/abs/2402.11650v2","updated":"2025-01-10T09:44:48Z","published":"2024-02-18T17:02:39Z","title":"Programmatic Reinforcement Learning: Navigating Gridworlds","summary":" The field of reinforcement learning (RL) is concerned with algorithms for\nlearning optimal policies in unknown stochastic environments. Programmatic RL\nstudies representations of policies as programs, meaning involving higher order\nconstructs such as control loops. Despite attracting a lot of attention at the\nintersection of the machine learning and formal methods communities, very\nlittle is known on the theoretical front about programmatic RL: what are good\nclasses of programmatic policies? How large are optimal programmatic policies?\nHow can we learn them? The goal of this paper is to give first answers to these\nquestions, initiating a theoretical study of programmatic RL. Considering a\nclass of gridworld environments, we define a class of programmatic policies.\nOur main contributions are to place upper bounds on the size of optimal\nprogrammatic policies, and to construct an algorithm for synthesizing them.\nThese theoretical findings are complemented by a prototype implementation of\nthe algorithm.\n","authors":["Guruprerana Shabadi","Nathanaël Fijalkow","Théo Matricon"],"pdf_url":"https://arxiv.org/pdf/2402.11650v2.pdf","comment":"Published in the proceedings of GenPlan, AAAI 2025 Workshop on\n Generlization in Planning"},{"id":"http://arxiv.org/abs/2412.09594v2","updated":"2025-01-10T09:40:04Z","published":"2024-12-12T18:58:14Z","title":"Wait-Less Offline Tuning and Re-solving for Online Decision Making","summary":" Online linear programming (OLP) has found broad applications in revenue\nmanagement and resource allocation. State-of-the-art OLP algorithms achieve low\nregret by repeatedly solving linear programming (LP) subproblems that\nincorporate updated resource information. However, LP-based methods are\ncomputationally expensive and often inefficient for large-scale applications.\nIn contrast, recent first-order OLP algorithms are more computationally\nefficient but typically suffer from worse regret guarantees. To address these\nshortcomings, we propose a new algorithm that combines the strengths of\nLP-based and first-order OLP methods. The algorithm re-solves the LP\nsubproblems periodically at a predefined frequency $f$ and uses the latest dual\nprices to guide online decision-making. In addition, a first-order method runs\nin parallel during each interval between LP re-solves, smoothing resource\nconsumption. Our algorithm achieves $\\mathscr{O}(\\log (T/f) + \\sqrt{f})$\nregret, delivering a \"wait-less\" online decision-making process that balances\nthe computational efficiency of first-order methods and the superior regret\nguarantee of LP-based methods.\n","authors":["Jingruo Sun","Wenzhi Gao","Ellen Vitercik","Yinyu Ye"],"pdf_url":"https://arxiv.org/pdf/2412.09594v2.pdf","comment":"In this version, we achieve a tighter regret bound with the warm\n start for the first batch. We also make the proof more elegant by manually\n accepting all subsequent orders once the constraint is violated. In this way,\n we do not need to introduce the concept of stopping time for the analysis of\n the LP-based method"},{"id":"http://arxiv.org/abs/2308.00721v4","updated":"2025-01-10T09:35:20Z","published":"2023-07-31T03:56:46Z","title":"A Pre-trained Data Deduplication Model based on Active Learning","summary":" In the era of big data, the issue of data quality has become increasingly\nprominent. One of the main challenges is the problem of duplicate data, which\ncan arise from repeated entry or the merging of multiple data sources. These\n\"dirty data\" problems can significantly limit the effective application of big\ndata. To address the issue of data deduplication, we propose a pre-trained\ndeduplication model based on active learning, which is the first work that\nutilizes active learning to address the problem of deduplication at the\nsemantic level. The model is built on a pre-trained Transformer and fine-tuned\nto solve the deduplication problem as a sequence to classification task, which\nfirstly integrate the transformer with active learning into an end-to-end\narchitecture to select the most valuable data for deduplication model training,\nand also firstly employ the R-Drop method to perform data augmentation on each\nround of labeled data, which can reduce the cost of manual labeling and improve\nthe model's performance. Experimental results demonstrate that our proposed\nmodel outperforms previous state-of-the-art (SOTA) for deduplicated data\nidentification, achieving up to a 28% improvement in Recall score on benchmark\ndatasets.\n","authors":["Haochen Shi","Xinyao Liu","Fengmao Lv","Hongtao Xue","Jie Hu","Shengdong Du","Tianrui Li"],"pdf_url":"https://arxiv.org/pdf/2308.00721v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.13840v2","updated":"2025-01-10T09:30:44Z","published":"2023-08-26T10:24:43Z","title":"Optimal Transport-inspired Deep Learning Framework for Slow-Decaying\n Kolmogorov n-width Problems: Exploiting Sinkhorn Loss and Wasserstein Kernel","summary":" Reduced order models (ROMs) are widely used in scientific computing to tackle\nhigh-dimensional systems. However, traditional ROM methods may only partially\ncapture the intrinsic geometric characteristics of the data. These\ncharacteristics encompass the underlying structure, relationships, and\nessential features crucial for accurate modeling.\n To overcome this limitation, we propose a novel ROM framework that integrates\noptimal transport (OT) theory and neural network-based methods. Specifically,\nwe investigate the Kernel Proper Orthogonal Decomposition (kPOD) method\nexploiting the Wasserstein distance as the custom kernel, and we efficiently\ntrain the resulting neural network (NN) employing the Sinkhorn algorithm. By\nleveraging an OT-based nonlinear reduction, the presented framework can capture\nthe geometric structure of the data, which is crucial for accurate learning of\nthe reduced solution manifold. When compared with traditional metrics such as\nmean squared error or cross-entropy, exploiting the Sinkhorn divergence as the\nloss function enhances stability during training, robustness against\noverfitting and noise, and accelerates convergence.\n To showcase the approach's effectiveness, we conduct experiments on a set of\nchallenging test cases exhibiting a slow decay of the Kolmogorov n-width. The\nresults show that our framework outperforms traditional ROM methods in terms of\naccuracy and computational efficiency.\n","authors":["Moaad Khamlich","Federico Pichi","Gianluigi Rozza"],"pdf_url":"https://arxiv.org/pdf/2308.13840v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.02221v2","updated":"2025-01-10T09:26:32Z","published":"2025-01-04T07:53:38Z","title":"CORD: Generalizable Cooperation via Role Diversity","summary":" Cooperative multi-agent reinforcement learning (MARL) aims to develop agents\nthat can collaborate effectively. However, most cooperative MARL methods\noverfit training agents, making learned policies not generalize well to unseen\ncollaborators, which is a critical issue for real-world deployment. Some\nmethods attempt to address the generalization problem but require prior\nknowledge or predefined policies of new teammates, limiting real-world\napplications. To this end, we propose a hierarchical MARL approach to enable\ngeneralizable cooperation via role diversity, namely CORD. CORD's high-level\ncontroller assigns roles to low-level agents by maximizing the role entropy\nwith constraints. We show this constrained objective can be decomposed into\ncausal influence in role that enables reasonable role assignment, and role\nheterogeneity that yields coherent, non-redundant role clusters. Evaluated on a\nvariety of cooperative multi-agent tasks, CORD achieves better performance than\nbaselines, especially in generalization tests. Ablation studies further\ndemonstrate the efficacy of the constrained objective in generalizable\ncooperation.\n","authors":["Kanefumi Matsuyama","Kefan Su","Jiangxing Wang","Deheng Ye","Zongqing Lu"],"pdf_url":"https://arxiv.org/pdf/2501.02221v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.01042v2","updated":"2025-01-10T09:21:43Z","published":"2025-01-02T03:52:22Z","title":"Image-based Multimodal Models as Intruders: Transferable Multimodal\n Attacks on Video-based MLLMs","summary":" Video-based multimodal large language models (V-MLLMs) have shown\nvulnerability to adversarial examples in video-text multimodal tasks. However,\nthe transferability of adversarial videos to unseen models--a common and\npractical real world scenario--remains unexplored. In this paper, we pioneer an\ninvestigation into the transferability of adversarial video samples across\nV-MLLMs. We find that existing adversarial attack methods face significant\nlimitations when applied in black-box settings for V-MLLMs, which we attribute\nto the following shortcomings: (1) lacking generalization in perturbing video\nfeatures, (2) focusing only on sparse key-frames, and (3) failing to integrate\nmultimodal information. To address these limitations and deepen the\nunderstanding of V-MLLM vulnerabilities in black-box scenarios, we introduce\nthe Image-to-Video MLLM (I2V-MLLM) attack. In I2V-MLLM, we utilize an\nimage-based multimodal model (IMM) as a surrogate model to craft adversarial\nvideo samples. Multimodal interactions and temporal information are integrated\nto disrupt video representations within the latent space, improving adversarial\ntransferability. In addition, a perturbation propagation technique is\nintroduced to handle different unknown frame sampling strategies. Experimental\nresults demonstrate that our method can generate adversarial examples that\nexhibit strong transferability across different V-MLLMs on multiple video-text\nmultimodal tasks. Compared to white-box attacks on these models, our black-box\nattacks (using BLIP-2 as surrogate model) achieve competitive performance, with\naverage attack success rates of 55.48% on MSVD-QA and 58.26% on MSRVTT-QA for\nVideoQA tasks, respectively. Our code will be released upon acceptance.\n","authors":["Linhao Huang","Xue Jiang","Zhiqiang Wang","Wentao Mo","Xi Xiao","Bo Han","Yongjie Yin","Feng Zheng"],"pdf_url":"https://arxiv.org/pdf/2501.01042v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05809v1","updated":"2025-01-10T09:19:10Z","published":"2025-01-10T09:19:10Z","title":"AdaPRL: Adaptive Pairwise Regression Learning with Uncertainty\n Estimation for Universal Regression Tasks","summary":" Current deep regression models usually learn in point-wise way that treat\neach sample as an independent input, neglecting the relative ordering among\ndifferent data. Consequently, the regression model could neglect the data 's\ninterrelationships, potentially resulting in suboptimal performance. Moreover,\nthe existence of aleatoric uncertainty in the training data may drive the model\nto capture non-generalizable patterns, contributing to increased overfitting.\nTo address these issues, we propose a novel adaptive pairwise learning\nframework (AdaPRL) for regression tasks which leverages the relative\ndifferences between data points and integrates with deep probabilistic models\nto quantify the uncertainty associated with the predictions. Additionally, we\nadapt AdaPRL for applications in multi-task learning and multivariate time\nseries forecasting. Extensive experiments with several real-world regression\ndatasets including recommendation systems, age estimation, time series\nforecasting, natural language understanding, finance, and industry datasets\nshow that AdaPRL is compatible with different backbone networks in various\ntasks and achieves state-of-the-art performance on the vast majority of tasks,\nhighlighting its notable potential including enhancing prediction accuracy and\nranking ability, increasing generalization capability, improving robustness to\nnoisy data, improving resilience to reduced data, and enhancing\ninterpretability, etc.\n","authors":["Fuhang Liang","Rucong Xu","Deng Lin"],"pdf_url":"https://arxiv.org/pdf/2501.05809v1.pdf","comment":"22 pages, 11 figures"},{"id":"http://arxiv.org/abs/2402.13572v2","updated":"2025-01-10T09:11:39Z","published":"2024-02-21T07:07:54Z","title":"AlgoFormer: An Efficient Transformer Framework with Algorithmic\n Structures","summary":" Besides natural language processing, transformers exhibit extraordinary\nperformance in solving broader applications, including scientific computing and\ncomputer vision. Previous works try to explain this from the expressive power\nand capability perspectives that standard transformers are capable of\nperforming some algorithms. To empower transformers with algorithmic\ncapabilities and motivated by the recently proposed looped transformer, we\ndesign a novel transformer framework, dubbed Algorithm Transformer (abbreviated\nas AlgoFormer). We provide an insight that efficient transformer architectures\ncan be designed by leveraging prior knowledge of tasks and the underlying\nstructure of potential algorithms. Compared with the standard transformer and\nvanilla looped transformer, the proposed AlgoFormer can perform efficiently in\nalgorithm representation in some specific tasks. In particular, inspired by the\nstructure of human-designed learning algorithms, our transformer framework\nconsists of a pre-transformer that is responsible for task preprocessing, a\nlooped transformer for iterative optimization algorithms, and a\npost-transformer for producing the desired results after post-processing. We\nprovide theoretical evidence of the expressive power of the AlgoFormer in\nsolving some challenging problems, mirroring human-designed algorithms.\nFurthermore, some theoretical and empirical results are presented to show that\nthe designed transformer has the potential to perform algorithm representation\nand learning. Experimental results demonstrate the empirical superiority of the\nproposed transformer in that it outperforms the standard transformer and\nvanilla looped transformer in some specific tasks. An extensive experiment on\nreal language tasks (e.g., neural machine translation of German and English,\nand text classification) further validates the expressiveness and effectiveness\nof AlgoFormer.\n","authors":["Yihang Gao","Chuanyang Zheng","Enze Xie","Han Shi","Tianyang Hu","Yu Li","Michael K. Ng","Zhenguo Li","Zhaoqiang Liu"],"pdf_url":"https://arxiv.org/pdf/2402.13572v2.pdf","comment":"Published at Transactions on Machine Learning Research (TMLR). The\n paper provides insight that the Transformer architectures can mimic the\n algorithm structures in (in-context) algorithm learning and representation.\n The incorporated algorithmic structure in Algoformer shows its potential in\n (deep learning for) scientific computing, besides the real language tasks"},{"id":"http://arxiv.org/abs/2501.05803v1","updated":"2025-01-10T09:10:30Z","published":"2025-01-10T09:10:30Z","title":"Alignment without Over-optimization: Training-Free Solution for\n Diffusion Models","summary":" Diffusion models excel in generative tasks, but aligning them with specific\nobjectives while maintaining their versatility remains challenging. Existing\nfine-tuning methods often suffer from reward over-optimization, while\napproximate guidance approaches fail to optimize target rewards effectively.\nAddressing these limitations, we propose a training-free sampling method based\non Sequential Monte Carlo (SMC) to sample from the reward-aligned target\ndistribution. Our approach, tailored for diffusion sampling and incorporating\ntempering techniques, achieves comparable or superior target rewards to\nfine-tuning methods while preserving diversity and cross-reward generalization.\nWe demonstrate its effectiveness in single-reward optimization, multi-objective\nscenarios, and online black-box optimization. This work offers a robust\nsolution for aligning diffusion models with diverse downstream objectives\nwithout compromising their general capabilities. Code is available at\nhttps://github.com/krafton-ai/DAS .\n","authors":["Sunwoo Kim","Minkyu Kim","Dongmin Park"],"pdf_url":"https://arxiv.org/pdf/2501.05803v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05795v1","updated":"2025-01-10T08:57:50Z","published":"2025-01-10T08:57:50Z","title":"Robust Counterfactual Explanations under Model Multiplicity Using\n Multi-Objective Optimization","summary":" In recent years, explainability in machine learning has gained importance. In\nthis context, counterfactual explanation (CE), which is an explanation method\nthat uses examples, has attracted attention. However, it has been pointed out\nthat CE is not robust when there are multiple machine-learning models. These\nproblems are important when using machine learning to make safe decisions. In\nthis paper, we propose robust CEs that introduce a new viewpoint - Pareto\nimprovement - and a method that uses multi-objective optimization to generate\nit. To evaluate the proposed method, we conducted experiments using both\nsimulated and actual data. The results demonstrate that the proposed method is\nrobust and useful. We believe that this research will contribute to a wide\nrange of research areas, such as explainability in machine learning,\ndecision-making, and action planning based on machine learning.\n","authors":["Keita Kinjo"],"pdf_url":"https://arxiv.org/pdf/2501.05795v1.pdf","comment":"19 pages"},{"id":"http://arxiv.org/abs/2501.05790v1","updated":"2025-01-10T08:50:38Z","published":"2025-01-10T08:50:38Z","title":"Understanding Impact of Human Feedback via Influence Functions","summary":" In Reinforcement Learning from Human Feedback (RLHF), it is crucial to learn\nsuitable reward models from human feedback to align large language models\n(LLMs) with human intentions. However, human feedback can often be noisy,\ninconsistent, or biased, especially when evaluating complex responses. Such\nfeedback can lead to misaligned reward signals, potentially causing unintended\nside effects during the RLHF process. To address these challenges, we explore\nthe use of influence functions to measure the impact of human feedback on the\nperformance of reward models. We propose a compute-efficient approximation\nmethod that enables the application of influence functions to LLM-based reward\nmodels and large-scale preference datasets. In our experiments, we demonstrate\ntwo key applications of influence functions: (1) detecting common forms of\nlabeler bias in human feedback datasets and (2) guiding labelers to refine\ntheir strategies to align more closely with expert feedback. By quantifying the\nimpact of human feedback on reward models, we believe that influence functions\ncan enhance feedback interpretability and contribute to scalable oversight in\nRLHF, helping labelers provide more accurate and consistent feedback. Source\ncode is available at https://github.com/mintaywon/IF_RLHF\n","authors":["Taywon Min","Haeone Lee","Hanho Ryu","Yongchan Kwon","Kimin Lee"],"pdf_url":"https://arxiv.org/pdf/2501.05790v1.pdf","comment":"Source code: https://github.com/mintaywon/IF_RLHF"},{"id":"http://arxiv.org/abs/2501.02564v2","updated":"2025-01-10T08:40:49Z","published":"2025-01-05T14:42:47Z","title":"Balanced Multi-view Clustering","summary":" Multi-view clustering (MvC) aims to integrate information from different\nviews to enhance the capability of the model in capturing the underlying data\nstructures. The widely used joint training paradigm in MvC is potentially not\nfully leverage the multi-view information, since the imbalanced and\nunder-optimized view-specific features caused by the uniform learning objective\nfor all views. For instance, particular views with more discriminative\ninformation could dominate the learning process in the joint training paradigm,\nleading to other views being under-optimized. To alleviate this issue, we first\nanalyze the imbalanced phenomenon in the joint-training paradigm of multi-view\nclustering from the perspective of gradient descent for each view-specific\nfeature extractor. Then, we propose a novel balanced multi-view clustering\n(BMvC) method, which introduces a view-specific contrastive regularization\n(VCR) to modulate the optimization of each view. Concretely, VCR preserves the\nsample similarities captured from the joint features and view-specific ones\ninto the clustering distributions corresponding to view-specific features to\nenhance the learning process of view-specific feature extractors. Additionally,\na theoretical analysis is provided to illustrate that VCR adaptively modulates\nthe magnitudes of gradients for updating the parameters of view-specific\nfeature extractors to achieve a balanced multi-view learning procedure. In such\na manner, BMvC achieves a better trade-off between the exploitation of\nview-specific patterns and the exploration of view-invariance patterns to fully\nlearn the multi-view information for the clustering task. Finally, a set of\nexperiments are conducted to verify the superiority of the proposed method\ncompared with state-of-the-art approaches both on eight benchmark MvC datasets\nand two spatially resolved transcriptomics datasets.\n","authors":["Zhenglai Li","Jun Wang","Chang Tang","Xinzhong Zhu","Wei Zhang","Xinwang Liu"],"pdf_url":"https://arxiv.org/pdf/2501.02564v2.pdf","comment":"We are withdrawing this paper due to issues in the experimental\n section related to the Application for Spatially Resolved Transcriptomics\n Data Clustering. These issues affect the validity of the results presented.\n We believe it is necessary to withdraw the paper to address these problems\n adequately before resubmission."},{"id":"http://arxiv.org/abs/2501.05775v1","updated":"2025-01-10T08:15:02Z","published":"2025-01-10T08:15:02Z","title":"STHFL: Spatio-Temporal Heterogeneous Federated Learning","summary":" Federated learning is a new framework that protects data privacy and allows\nmultiple devices to cooperate in training machine learning models. Previous\nstudies have proposed multiple approaches to eliminate the challenges posed by\nnon-iid data and inter-domain heterogeneity issues. However, they ignore the\n\\textbf{spatio-temporal} heterogeneity formed by different data distributions\nof increasing task data in the intra-domain. Moreover, the global data is\ngenerally a long-tailed distribution rather than assuming the global data is\nbalanced in practical applications. To tackle the \\textbf{spatio-temporal}\ndilemma, we propose a novel setting named \\textbf{Spatio-Temporal\nHeterogeneity} Federated Learning (STHFL). Specially, the Global-Local Dynamic\nPrototype (GLDP) framework is designed for STHFL. In GLDP, the model in each\nclient contains personalized layers which can dynamically adapt to different\ndata distributions. For long-tailed data distribution, global prototypes are\nserved as complementary knowledge for the training on classes with few samples\nin clients without leaking privacy. As tasks increase in clients, the knowledge\nof local prototypes generated in previous tasks guides for training in the\ncurrent task to solve catastrophic forgetting. Meanwhile, the global-local\nprototypes are updated through the moving average method after training local\nprototypes in clients. Finally, we evaluate the effectiveness of GLDP, which\nachieves remarkable results compared to state-of-the-art methods in STHFL\nscenarios.\n","authors":["Shunxin Guo","Hongsong Wang","Shuxia Lin","Xu Yang","Xin Geng"],"pdf_url":"https://arxiv.org/pdf/2501.05775v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04614v2","updated":"2025-01-10T08:07:16Z","published":"2024-12-05T21:00:46Z","title":"Extractive Structures Learned in Pretraining Enable Generalization on\n Finetuned Facts","summary":" Pretrained language models (LMs) can generalize to implications of facts that\nthey are finetuned on. For example, if finetuned on ``John Doe lives in Tokyo,\"\nLMs can correctly answer ``What language do the people in John Doe's city\nspeak?'' with ``Japanese''. However, little is known about the mechanisms that\nenable this generalization or how they are learned during pretraining. We\nintroduce extractive structures as a framework for describing how components in\nLMs (e.g., MLPs or attention heads) coordinate to enable this generalization.\nThe structures consist of informative components that store training facts as\nweight changes, and upstream and downstream extractive components that query\nand process the stored information to produce the correct implication. We\nhypothesize that extractive structures are learned during pretraining when\nencountering implications of previously known facts. This yields two\npredictions: a data ordering effect where extractive structures can be learned\nonly if facts precede their implications, and a weight grafting effect where\nextractive structures can be transferred to predict counterfactual\nimplications. We empirically demonstrate these phenomena in the OLMo-7b, Llama\n3-8b, Gemma 2-9b, and Qwen 2-7b models. Of independent interest, our results\nalso indicate that fact learning can occur at both early and late layers, which\nlead to different forms of generalization.\n","authors":["Jiahai Feng","Stuart Russell","Jacob Steinhardt"],"pdf_url":"https://arxiv.org/pdf/2412.04614v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05772v1","updated":"2025-01-10T08:07:14Z","published":"2025-01-10T08:07:14Z","title":"rmlnomogram: An R package to construct an explainable nomogram for any\n machine learning algorithms","summary":" Background: Current nomogram can only be created for regression algorithm.\nProviding nomogram for any machine learning (ML) algorithms may accelerate\nmodel deployment in clinical settings or improve model availability. We\ndeveloped an R package and web application to construct nomogram with model\nexplainability of any ML algorithms. Methods: We formulated a function to\ntransform an ML prediction model into a nomogram, requiring datasets with: (1)\nall possible combinations of predictor values; (2) the corresponding outputs of\nthe model; and (3) the corresponding explainability values for each predictor\n(optional). Web application was also created. Results: Our R package could\ncreate 5 types of nomograms for categorical predictors and binary outcome\nwithout probability (1), categorical predictors and binary outcome with\nprobability (2) or continuous outcome (3), and categorical with single\nnumerical predictors and binary outcome with probability (4) or continuous\noutcome (5). Respectively, the first and remaining types optimally allowed\nmaximum 15 and 5 predictors with maximum 3,200 combinations. Web application is\nprovided with such limits. The explainability values were possible for types 2\nto 5. Conclusions: Our R package and web application could construct nomogram\nwith model explainability of any ML algorithms using a fair number of\npredictors.\n","authors":["Herdiantri Sufriyana","Emily Chia-Yu Su"],"pdf_url":"https://arxiv.org/pdf/2501.05772v1.pdf","comment":"16 pages, 2 figures, 1 table, 3 equations, 1 algorithm, 4 code\n snippets"},{"id":"http://arxiv.org/abs/2311.02565v2","updated":"2025-01-10T08:01:09Z","published":"2023-11-05T04:43:48Z","title":"KITS: Inductive Spatio-Temporal Kriging with Increment Training Strategy","summary":" Sensors are commonly deployed to perceive the environment. However, due to\nthe high cost, sensors are usually sparsely deployed. Kriging is the tailored\ntask to infer the unobserved nodes (without sensors) using the observed source\nnodes (with sensors). The essence of kriging task is transferability. Recently,\nseveral inductive spatio-temporal kriging methods have been proposed based on\ngraph neural networks, being trained based on a graph built on top of observed\nnodes via pretext tasks such as masking nodes out and reconstructing them.\nHowever, the graph in training is inevitably much sparser than the graph in\ninference that includes all the observed and unobserved nodes. The learned\npattern cannot be well generalized for inference, denoted as graph gap. To\naddress this issue, we first present a novel Increment training strategy:\ninstead of masking nodes (and reconstructing them), we add virtual nodes into\nthe training graph so as to mitigate the graph gap issue naturally.\nNevertheless, the empty-shell virtual nodes without labels could have\nbad-learned features and lack supervision signals. To solve these issues, we\npair each virtual node with its most similar observed node and fuse their\nfeatures together; to enhance the supervision signal, we construct reliable\npseudo labels for virtual nodes. As a result, the learned pattern of virtual\nnodes could be safely transferred to real unobserved nodes for reliable\nkriging. We name our new Kriging model with Increment Training Strategy as\nKITS. Extensive experiments demonstrate that KITS consistently outperforms\nexisting kriging methods by large margins, e.g., the improvement over MAE score\ncould be as high as 18.33%.\n","authors":["Qianxiong Xu","Cheng Long","Ziyue Li","Sijie Ruan","Rui Zhao","Zhishuai Li"],"pdf_url":"https://arxiv.org/pdf/2311.02565v2.pdf","comment":"This paper is accepted by AAAI'25"},{"id":"http://arxiv.org/abs/2501.05768v1","updated":"2025-01-10T07:56:30Z","published":"2025-01-10T07:56:30Z","title":"Halal or Not: Knowledge Graph Completion for Predicting Cultural\n Appropriateness of Daily Products","summary":" The growing demand for halal cosmetic products has exposed significant\nchallenges, especially in Muslim-majority countries. Recently, various machine\nlearning-based strategies, e.g., image-based methods, have shown remarkable\nsuccess in predicting the halal status of cosmetics. However, these methods\nmainly focus on analyzing the discrete and specific ingredients within separate\ncosmetics, which ignore the high-order and complex relations between cosmetics\nand ingredients. To address this problem, we propose a halal cosmetic\nrecommendation framework, namely HaCKG, that leverages a knowledge graph of\ncosmetics and their ingredients to explicitly model and capture the\nrelationships between cosmetics and their components. By representing cosmetics\nand ingredients as entities within the knowledge graph, HaCKG effectively\nlearns the high-order and complex relations between entities, offering a robust\nmethod for predicting halal status. Specifically, we first construct a cosmetic\nknowledge graph representing the relations between various cosmetics,\ningredients, and their properties. We then propose a pre-trained relational\ngraph attention network model with residual connections to learn the structural\nrelation between entities in the knowledge graph. The pre-trained model is then\nfine-tuned on downstream cosmetic data to predict halal status. Extensive\nexperiments on the cosmetic dataset over halal prediction tasks demonstrate the\nsuperiority of our model over state-of-the-art baselines.\n","authors":["Van Thuy Hoang","Tien-Bach-Thanh Do","Jinho Seo","Seung Charlie Kim","Luong Vuong Nguyen","Duong Nguyen Minh Huy","Hyeon-Ju Jeon","O-Joun Lee"],"pdf_url":"https://arxiv.org/pdf/2501.05768v1.pdf","comment":"10 pages"},{"id":"http://arxiv.org/abs/2501.04063v2","updated":"2025-01-10T07:44:50Z","published":"2025-01-07T10:54:25Z","title":"Fuzzy Information Entropy and Region Biased Matrix Factorization for Web\n Service QoS Prediction","summary":" Nowadays, there are many similar services available on the internet, making\nQuality of Service (QoS) a key concern for users. Since collecting QoS values\nfor all services through user invocations is impractical, predicting QoS values\nis a more feasible approach. Matrix factorization is considered an effective\nprediction method. However, most existing matrix factorization algorithms focus\non capturing global similarities between users and services, overlooking the\nlocal similarities between users and their similar neighbors, as well as the\nnon-interactive effects between users and services. This paper proposes a\nmatrix factorization approach based on user information entropy and region\nbias, which utilizes a similarity measurement method based on fuzzy information\nentropy to identify similar neighbors of users. Simultaneously, it integrates\nthe region bias between each user and service linearly into matrix\nfactorization to capture the non-interactive features between users and\nservices. This method demonstrates improved predictive performance in more\nrealistic and complex network environments. Additionally, numerous experiments\nare conducted on real-world QoS datasets. The experimental results show that\nthe proposed method outperforms some of the state-of-the-art methods in the\nfield at matrix densities ranging from 5% to 20%.\n","authors":["Guoxing Tang","Yugen Du","Xia Chen","Yingwei Luo","Benchi Ma"],"pdf_url":"https://arxiv.org/pdf/2501.04063v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.12746v2","updated":"2025-01-10T07:38:53Z","published":"2023-10-19T13:50:56Z","title":"TabuLa: Harnessing Language Models for Tabular Data Synthesis","summary":" Tabular data synthesis is crucial for addressing privacy and security\nconcerns in industries reliant on tabular data. While recent advancements adopt\nlarge language models (LLMs) for realistic tabular data generation, their long\ntraining times and limited reusability hinder practical applications. In this\npaper, we propose Tabula, a tabular data synthesizer that leverages the\nstructure of LLM. Unlike state-of-the-art (SOTA) LLM-based tabular data\nsynthesizers that rely on pre-trained LLMs, Tabula discards the pre-trained\nweights originally designed for natural language tasks, focusing instead on a\ntailored approach for tabular data. In addition, Tabula introduces a token\nsequence compression strategy that significantly reduces training time while\nmaintaining data quality, alongside a novel token padding method that improves\nsequence alignment across training batches. Experiments on six datasets show\nthat Tabula achieves superior synthetic data utility compared to current SOTA\nmethods. Additionally, the results demonstrate that Tabula model trained on\ntabular datasets serves effectively as a foundational model for synthesizing\nnew tabular datasets. Furthermore, the proposed padding method outperforms the\nconventional left and right padding strategies. Finally, the results highlight\nthat Tabula averagely reduces training time per epoch by 46.2% compared to\nstate-of-the-art LLM approaches while achieving higher data utility. Our code\nis available at https://github.com/zhao-zilong/Tabula\n","authors":["Zilong Zhao","Robert Birke","Lydia Chen"],"pdf_url":"https://arxiv.org/pdf/2310.12746v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05762v1","updated":"2025-01-10T07:38:24Z","published":"2025-01-10T07:38:24Z","title":"Development and Comparison of Model-Based and Data-Driven Approaches for\n the Prediction of the Mechanical Properties of Lattice Structures","summary":" Lattice structures have great potential for several application fields\nranging from medical and tissue engineering to aeronautical one. Their\ndevelopment is further speeded up by the continuing advances in additive\nmanufacturing technologies that allow to overcome issues typical of standard\nprocesses and to propose tailored designs. However, the design of lattice\nstructures is still challenging since their properties are considerably\naffected by numerous factors. The present paper aims to propose, discuss, and\ncompare various modeling approaches to describe, understand, and predict the\ncorrelations between the mechanical properties and the void volume fraction of\ndifferent types of lattice structures fabricated by fused deposition modeling\n3D printing. Particularly, four approaches are proposed: (i) a simplified\nanalytical model; (ii) a semi-empirical model combining analytical equations\nwith experimental correction factors; (iii) an artificial neural network\ntrained on experimental data; (iv) numerical simulations by finite element\nanalyses. The comparison among the various approaches, and with experimental\ndata, allows to identify the performances, advantages, and disadvantages of\neach approach, thus giving important guidelines for choosing the right design\nmethodology based on the needs and available data.\n","authors":["Chiara Pasini","Oscar Ramponi","Stefano Pandini","Luciana Sartore","Giulia Scalet"],"pdf_url":"https://arxiv.org/pdf/2501.05762v1.pdf","comment":"This work was funded by the European Union ERC CoDe4Bio Grant ID\n 101039467 under the funding programme Horizon Europe"},{"id":"http://arxiv.org/abs/2405.18144v3","updated":"2025-01-10T07:22:12Z","published":"2024-05-28T13:02:56Z","title":"4-bit Shampoo for Memory-Efficient Network Training","summary":" Second-order optimizers, maintaining a matrix termed a preconditioner, are\nsuperior to first-order optimizers in both theory and practice. The states\nforming the preconditioner and its inverse root restrict the maximum size of\nmodels trained by second-order optimizers. To address this, compressing 32-bit\noptimizer states to lower bitwidths has shown promise in reducing memory usage.\nHowever, current approaches only pertain to first-order optimizers. In this\npaper, we propose the first 4-bit second-order optimizers, exemplified by 4-bit\nShampoo, maintaining performance similar to that of 32-bit ones. We show that\nquantizing the eigenvector matrix of the preconditioner in 4-bit Shampoo is\nremarkably better than quantizing the preconditioner itself both theoretically\nand experimentally. By rectifying the orthogonality of the quantized\neigenvector matrix, we enhance the approximation of the preconditioner's\neigenvector matrix, which also benefits the computation of its inverse 4-th\nroot. Besides, we find that linear square quantization slightly outperforms\ndynamic tree quantization when quantizing second-order optimizer states.\nEvaluation on various networks for image classification and natural language\nmodeling demonstrates that our 4-bit Shampoo achieves comparable performance to\nits 32-bit counterpart while being more memory-efficient.\n","authors":["Sike Wang","Pan Zhou","Jia Li","Hua Huang"],"pdf_url":"https://arxiv.org/pdf/2405.18144v3.pdf","comment":"NeurIPS 2024 final camera-ready revisions, rectify the legend in\n figure 9"},{"id":"http://arxiv.org/abs/2501.05755v1","updated":"2025-01-10T07:13:42Z","published":"2025-01-10T07:13:42Z","title":"CognoSpeak: an automatic, remote assessment of early cognitive decline\n in real-world conversational speech","summary":" The early signs of cognitive decline are often noticeable in conversational\nspeech, and identifying those signs is crucial in dealing with later and more\nserious stages of neurodegenerative diseases. Clinical detection is costly and\ntime-consuming and although there has been recent progress in the automatic\ndetection of speech-based cues, those systems are trained on relatively small\ndatabases, lacking detailed metadata and demographic information. This paper\npresents CognoSpeak and its associated data collection efforts. CognoSpeak asks\nmemory-probing long and short-term questions and administers standard cognitive\ntasks such as verbal and semantic fluency and picture description using a\nvirtual agent on a mobile or web platform. In addition, it collects multimodal\ndata such as audio and video along with a rich set of metadata from primary and\nsecondary care, memory clinics and remote settings like people's homes. Here,\nwe present results from 126 subjects whose audio was manually transcribed.\nSeveral classic classifiers, as well as large language model-based classifiers,\nhave been investigated and evaluated across the different types of prompts. We\ndemonstrate a high level of performance; in particular, we achieved an F1-score\nof 0.873 using a DistilBERT model to discriminate people with cognitive\nimpairment (dementia and people with mild cognitive impairment (MCI)) from\nhealthy volunteers using the memory responses, fluency tasks and cookie theft\npicture description. CognoSpeak is an automatic, remote, low-cost, repeatable,\nnon-invasive and less stressful alternative to existing clinical cognitive\nassessments.\n","authors":["Madhurananda Pahar","Fuxiang Tao","Bahman Mirheidari","Nathan Pevy","Rebecca Bright","Swapnil Gadgil","Lise Sproson","Dorota Braun","Caitlin Illingworth","Daniel Blackburn","Heidi Christensen"],"pdf_url":"https://arxiv.org/pdf/2501.05755v1.pdf","comment":"This paper has been accepted for publication in IEEE SSCI 2025.\n Copyright belongs to IEEE"},{"id":"http://arxiv.org/abs/2501.05745v1","updated":"2025-01-10T06:21:48Z","published":"2025-01-10T06:21:48Z","title":"Covariate Dependent Mixture of Bayesian Networks","summary":" Learning the structure of Bayesian networks from data provides insights into\nunderlying processes and the causal relationships that generate the data, but\nits usefulness depends on the homogeneity of the data population, a condition\noften violated in real-world applications. In such cases, using a single\nnetwork structure for inference can be misleading, as it may not capture\nsub-population differences. To address this, we propose a novel approach of\nmodelling a mixture of Bayesian networks where component probabilities depend\non individual characteristics. Our method identifies both network structures\nand demographic predictors of sub-population membership, aiding personalised\ninterventions. We evaluate our method through simulations and a youth mental\nhealth case study, demonstrating its potential to improve tailored\ninterventions in health, education, and social policy.\n","authors":["Roman Marchant","Dario Draca","Gilad Francis","Sahand Assadzadeh","Mathew Varidel","Frank Iorfino","Sally Cripps"],"pdf_url":"https://arxiv.org/pdf/2501.05745v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05744v1","updated":"2025-01-10T06:20:27Z","published":"2025-01-10T06:20:27Z","title":"LLVD: LSTM-based Explicit Motion Modeling in Latent Space for Blind\n Video Denoising","summary":" Video restoration plays a pivotal role in revitalizing degraded video content\nby rectifying imperfections caused by various degradations introduced during\ncapturing (sensor noise, motion blur, etc.), saving/sharing (compression,\nresizing, etc.) and editing. This paper introduces a novel algorithm designed\nfor scenarios where noise is introduced during video capture, aiming to enhance\nthe visual quality of videos by reducing unwanted noise artifacts. We propose\nthe Latent space LSTM Video Denoiser (LLVD), an end-to-end blind denoising\nmodel. LLVD uniquely combines spatial and temporal feature extraction,\nemploying Long Short Term Memory (LSTM) within the encoded feature domain. This\nintegration of LSTM layers is crucial for maintaining continuity and minimizing\nflicker in the restored video. Moreover, processing frames in the encoded\nfeature domain significantly reduces computations, resulting in a very\nlightweight architecture. LLVD's blind nature makes it versatile for real,\nin-the-wild denoising scenarios where prior information about noise\ncharacteristics is not available. Experiments reveal that LLVD demonstrates\nexcellent performance for both synthetic and captured noise. Specifically, LLVD\nsurpasses the current State-Of-The-Art (SOTA) in RAW denoising by 0.3dB, while\nalso achieving a 59\\% reduction in computational complexity.\n","authors":["Loay Rashid","Siddharth Roheda","Amit Unde"],"pdf_url":"https://arxiv.org/pdf/2501.05744v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05735v1","updated":"2025-01-10T06:04:32Z","published":"2025-01-10T06:04:32Z","title":"ELENA: Epigenetic Learning through Evolved Neural Adaptation","summary":" Despite the success of metaheuristic algorithms in solving complex network\noptimization problems, they often struggle with adaptation, especially in\ndynamic or high-dimensional search spaces. Traditional approaches can become\nstuck in local optima, leading to inefficient exploration and suboptimal\nsolutions. Most of the widely accepted advanced algorithms do well either on\nhighly complex or smaller search spaces due to the lack of adaptation. To\naddress these limitations, we present ELENA (Epigenetic Learning through\nEvolved Neural Adaptation), a new evolutionary framework that incorporates\nepigenetic mechanisms to enhance the adaptability of the core evolutionary\napproach. ELENA leverages compressed representation of learning parameters\nimproved dynamically through epigenetic tags that serve as adaptive memory.\nThree epigenetic tags (mutation resistance, crossover affinity, and stability\nscore) assist with guiding solution space search, facilitating a more\nintelligent hypothesis landscape exploration. To assess the framework\nperformance, we conduct experiments on three critical network optimization\nproblems: the Traveling Salesman Problem (TSP), the Vehicle Routing Problem\n(VRP), and the Maximum Clique Problem (MCP). Experiments indicate that ELENA\nachieves competitive results, often surpassing state-of-the-art methods on\nnetwork optimization tasks.\n","authors":["Boris Kriuk","Keti Sulamanidze","Fedor Kriuk"],"pdf_url":"https://arxiv.org/pdf/2501.05735v1.pdf","comment":"15 pages, 6 figures, 4 tables, 2 algorithms"},{"id":"http://arxiv.org/abs/2501.05731v1","updated":"2025-01-10T05:55:14Z","published":"2025-01-10T05:55:14Z","title":"Diving Deep: Forecasting Sea Surface Temperatures and Anomalies","summary":" This overview paper details the findings from the Diving Deep: Forecasting\nSea Surface Temperatures and Anomalies Challenge at the European Conference on\nMachine Learning and Principles and Practice of Knowledge Discovery in\nDatabases (ECML PKDD) 2024. The challenge focused on the data-driven\npredictability of global sea surface temperatures (SSTs), a key factor in\nclimate forecasting, ecosystem management, fisheries management, and climate\nchange monitoring. The challenge involved forecasting SST anomalies (SSTAs)\nthree months in advance using historical data and included a special task of\npredicting SSTAs nine months ahead for the Baltic Sea. Participants utilized\nvarious machine learning approaches to tackle the task, leveraging data from\nERA5. This paper discusses the methodologies employed, the results obtained,\nand the lessons learned, offering insights into the future of climate-related\npredictive modeling.\n","authors":["Ding Ning","Varvara Vetrova","Karin R. Bryan","Yun Sing Koh","Andreas Voskou","N'Dah Jean Kouagou","Arnab Sharma"],"pdf_url":"https://arxiv.org/pdf/2501.05731v1.pdf","comment":"The paper contains 9 pages for the main text and 10 pages including\n References. 5 figures. Discovery Track, European Conference on Machine\n Learning and Principles and Practice of Knowledge Discovery in Databases\n (ECML PKDD) 2024"},{"id":"http://arxiv.org/abs/2501.05730v1","updated":"2025-01-10T05:54:04Z","published":"2025-01-10T05:54:04Z","title":"Element-wise Attention Is All You Need","summary":" The self-attention (SA) mechanism has demonstrated superior performance\nacross various domains, yet it suffers from substantial complexity during both\ntraining and inference. The next-generation architecture, aiming at retaining\nthe competitive performance of SA while achieving low-cost inference and\nefficient long-sequence training, primarily focuses on three approaches: linear\nattention, linear RNNs, and state space models. Although these approaches\nachieve reduced complexity than SA, they all have built-in performance\ndegradation factors, such as diminished “spikiness” and compression of\nhistorical information. In contrast to these approaches, we propose a novel\nelement-wise attention mechanism, which uses the element-wise squared Euclidean\ndistance, instead of the dot product operation, to compute similarity and\napproximates the quadratic complexity term $\\exp(q_{ic}k_{jc})$ with a Taylor\npolynomial. This design achieves remarkable efficiency: during training, the\nelement-wise attention has a complexity of $\\mathcal{O}(tLD)$, making\nlong-sequence training both computationally and memory efficient, where $L$ is\nthe sequence length, $D$ is the feature dimension, and $t$ is the highest order\nof the polynomial; during inference, it can be reformulated as recurrent neural\nnetworks, achieving a inference complexity of $\\mathcal{O}(tD)$. Furthermore,\nthe element-wise attention circumvents the performance degradation factors\npresent in these approaches and achieves performance comparable to SA in both\ncausal and non-causal forms.\n","authors":["Guoxin Feng"],"pdf_url":"https://arxiv.org/pdf/2501.05730v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05727v1","updated":"2025-01-10T05:51:52Z","published":"2025-01-10T05:51:52Z","title":"Enabling Scalable Oversight via Self-Evolving Critic","summary":" Despite their remarkable performance, the development of Large Language\nModels (LLMs) faces a critical challenge in scalable oversight: providing\neffective feedback for tasks where human evaluation is difficult or where LLMs\noutperform humans. While there is growing interest in using LLMs for critique,\ncurrent approaches still rely on human annotations or more powerful models,\nleaving the issue of enhancing critique capabilities without external\nsupervision unresolved. We introduce SCRIT (Self-evolving CRITic), a framework\nthat enables genuine self-evolution of critique abilities. Technically, SCRIT\nself-improves by training on synthetic data, generated by a contrastive-based\nself-critic that uses reference solutions for step-by-step critique, and a\nself-validation mechanism that ensures critique quality through correction\noutcomes. Implemented with Qwen2.5-72B-Instruct, one of the most powerful LLMs,\nSCRIT achieves up to a 10.3\\% improvement on critique-correction and error\nidentification benchmarks. Our analysis reveals that SCRIT's performance scales\npositively with data and model size, outperforms alternative approaches, and\nbenefits critically from its self-validation component.\n","authors":["Zhengyang Tang","Ziniu Li","Zhenyang Xiao","Tian Ding","Ruoyu Sun","Benyou Wang","Dayiheng Liu","Fei Huang","Tianyu Liu","Bowen Yu","Junyang Lin"],"pdf_url":"https://arxiv.org/pdf/2501.05727v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.00778v3","updated":"2025-01-10T05:35:58Z","published":"2024-06-02T15:35:45Z","title":"Bayesian Joint Additive Factor Models for Multiview Learning","summary":" It is increasingly common in a wide variety of applied settings to collect\ndata of multiple different types on the same set of samples. Our particular\nfocus in this article is on studying relationships between such multiview\nfeatures and responses. A motivating application arises in the context of\nprecision medicine where multi-omics data are collected to correlate with\nclinical outcomes. It is of interest to infer dependence within and across\nviews while combining multimodal information to improve the prediction of\noutcomes. The signal-to-noise ratio can vary substantially across views,\nmotivating more nuanced statistical tools beyond standard late and early\nfusion. This challenge comes with the need to preserve interpretability, select\nfeatures, and obtain accurate uncertainty quantification. We propose a joint\nadditive factor regression model (JAFAR) with a structured additive design,\naccounting for shared and view-specific components. We ensure identifiability\nvia a novel dependent cumulative shrinkage process (D-CUSP) prior. We provide\nan efficient implementation via a partially collapsed Gibbs sampler and extend\nour approach to allow flexible feature and outcome distributions. Prediction of\ntime-to-labor onset from immunome, metabolome, and proteome data illustrates\nperformance gains against state-of-the-art competitors. Our open-source\nsoftware (R package) is available at https://github.com/niccoloanceschi/jafar.\n","authors":["Niccolo Anceschi","Federico Ferrari","David B. Dunson","Himel Mallick"],"pdf_url":"https://arxiv.org/pdf/2406.00778v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.15720v2","updated":"2025-01-10T05:32:06Z","published":"2023-08-30T02:50:54Z","title":"Surrogate-based Autotuning for Randomized Sketching Algorithms in\n Regression Problems","summary":" Algorithms from Randomized Numerical Linear Algebra (RandNLA) are known to be\neffective in handling high-dimensional computational problems, providing\nhigh-quality empirical performance as well as strong probabilistic guarantees.\nHowever, their practical application is complicated by the fact that the user\nneeds to set various algorithm-specific tuning parameters which are different\nthan those used in traditional NLA. This paper demonstrates how a\nsurrogate-based autotuning approach can be used to address fundamental problems\nof parameter selection in RandNLA algorithms. In particular, we provide a\ndetailed investigation of surrogate-based autotuning for\nsketch-and-precondition (SAP) based randomized least squares methods, which\nhave been one of the great success stories in modern RandNLA. Empirical results\nshow that our surrogate-based autotuning approach can achieve near-optimal\nperformance with much less tuning cost than a random search (up to about 4x\nfewer trials of different parameter configurations). Moreover, while our\nexperiments focus on least squares, our results demonstrate a general-purpose\nautotuning pipeline applicable to any kind of RandNLA algorithm.\n","authors":["Younghyun Cho","James W. Demmel","Michał Dereziński","Haoyun Li","Hengrui Luo","Michael W. Mahoney","Riley J. Murray"],"pdf_url":"https://arxiv.org/pdf/2308.15720v2.pdf","comment":"Improved the presentation and clarity. Updated experimental results\n and scenarios. Accepted for publication in SIAM Journal on Matrix Analysis\n and Applications"},{"id":"http://arxiv.org/abs/2212.12322v3","updated":"2025-01-10T05:28:08Z","published":"2022-12-22T08:33:32Z","title":"Infrared Image Super-Resolution: Systematic Review, and Future Trends","summary":" Image Super-Resolution (SR) is essential for a wide range of computer vision\nand image processing tasks. Investigating infrared (IR) image (or thermal\nimages) super-resolution is a continuing concern within the development of deep\nlearning. This survey aims to provide a comprehensive perspective of IR image\nsuper-resolution, including its applications, hardware imaging system dilemmas,\nand taxonomy of image processing methodologies. In addition, the datasets and\nevaluation metrics in IR image super-resolution tasks are also discussed.\nFurthermore, the deficiencies in current technologies and possible promising\ndirections for the community to explore are highlighted. To cope with the rapid\ndevelopment in this field, we intend to regularly update the relevant excellent\nwork at \\url{https://github.com/yongsongH/Infrared_Image_SR_Survey\n","authors":["Yongsong Huang","Tomo Miyazaki","Xiaofeng Liu","Shinichiro Omachi"],"pdf_url":"https://arxiv.org/pdf/2212.12322v3.pdf","comment":"This work has been submitted to the IEEE for possible publication"},{"id":"http://arxiv.org/abs/2501.05707v1","updated":"2025-01-10T04:35:46Z","published":"2025-01-10T04:35:46Z","title":"Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains","summary":" Large language models (LLMs) have achieved remarkable performance in recent\nyears but are fundamentally limited by the underlying training data. To improve\nmodels beyond the training data, recent works have explored how LLMs can be\nused to generate synthetic data for autonomous self-improvement. However,\nsuccessive steps of self-improvement can reach a point of diminishing returns.\nIn this work, we propose a complementary approach towards self-improvement\nwhere finetuning is applied to a multiagent society of language models. A group\nof language models, all starting from the same base model, are independently\nspecialized by updating each one using data generated through multiagent\ninteractions among the models. By training each model on independent sets of\ndata, we illustrate how this approach enables specialization across models and\ndiversification over the set of models. As a result, our overall system is able\nto preserve diverse reasoning chains and autonomously improve over many more\nrounds of fine-tuning than single-agent self-improvement methods. We\nquantitatively illustrate the efficacy of the approach across a wide suite of\nreasoning tasks.\n","authors":["Vighnesh Subramaniam","Yilun Du","Joshua B. Tenenbaum","Antonio Torralba","Shuang Li","Igor Mordatch"],"pdf_url":"https://arxiv.org/pdf/2501.05707v1.pdf","comment":"22 pages, 13 figures, 7 tables; Project page at\n https://llm-multiagent-ft.github.io/"},{"id":"http://arxiv.org/abs/2411.12924v2","updated":"2025-01-10T03:55:57Z","published":"2024-11-19T23:22:33Z","title":"Human-In-the-Loop Software Development Agents","summary":" Recently, Large Language Models (LLMs)-based multi-agent paradigms for\nsoftware engineering are introduced to automatically resolve software\ndevelopment tasks (e.g., from a given issue to source code). However, existing\nwork is evaluated based on historical benchmark datasets, rarely considers\nhuman feedback at each stage of the automated software development process, and\nhas not been deployed in practice. In this paper, we introduce a\nHuman-in-the-loop LLM-based Agents framework (HULA) for software development\nthat allows software engineers to refine and guide LLMs when generating coding\nplans and source code for a given task. We design, implement, and deploy the\nHULA framework into Atlassian JIRA for internal uses. Through a multi-stage\nevaluation of the HULA framework, Atlassian software engineers perceive that\nHULA can minimize the overall development time and effort, especially in\ninitiating a coding plan and writing code for straightforward tasks. On the\nother hand, challenges around code quality remain a concern in some cases. We\ndraw lessons learned and discuss opportunities for future work, which will pave\nthe way for the advancement of LLM-based agents in software development.\n","authors":["Wannita Takerngsaksiri","Jirat Pasuksmit","Patanamon Thongtanunam","Chakkrit Tantithamthavorn","Ruixiong Zhang","Fan Jiang","Jing Li","Evan Cook","Kun Chen","Ming Wu"],"pdf_url":"https://arxiv.org/pdf/2411.12924v2.pdf","comment":"10 pages, 9 figures, ICSE SEIP 2025"},{"id":"http://arxiv.org/abs/2501.04608v2","updated":"2025-01-10T03:08:11Z","published":"2025-01-08T16:44:06Z","title":"Comprehensive Examination of Unrolled Networks for Solving Linear\n Inverse Problems","summary":" Unrolled networks have become prevalent in various computer vision and\nimaging tasks. Although they have demonstrated remarkable efficacy in solving\nspecific computer vision and computational imaging tasks, their adaptation to\nother applications presents considerable challenges. This is primarily due to\nthe multitude of design decisions that practitioners working on new\napplications must navigate, each potentially affecting the network's overall\nperformance. These decisions include selecting the optimization algorithm,\ndefining the loss function, and determining the number of convolutional layers,\namong others. Compounding the issue, evaluating each design choice requires\ntime-consuming simulations to train, fine-tune the neural network, and optimize\nfor its performance. As a result, the process of exploring multiple options and\nidentifying the optimal configuration becomes time-consuming and\ncomputationally demanding. The main objectives of this paper are (1) to unify\nsome ideas and methodologies used in unrolled networks to reduce the number of\ndesign choices a user has to make, and (2) to report a comprehensive ablation\nstudy to discuss the impact of each of the choices involved in designing\nunrolled networks and present practical recommendations based on our findings.\nWe anticipate that this study will help scientists and engineers design\nunrolled networks for their applications and diagnose problems within their\nnetworks efficiently.\n","authors":["Eric Chen","Xi Chen","Arian Maleki","Shirin Jalali"],"pdf_url":"https://arxiv.org/pdf/2501.04608v2.pdf","comment":"27 pages, 10 figures. Project Page:\n https://github.com/YuxiChen25/Memory-Net-Inverse"},{"id":"http://arxiv.org/abs/2501.05680v1","updated":"2025-01-10T03:07:28Z","published":"2025-01-10T03:07:28Z","title":"EXION: Exploiting Inter- and Intra-Iteration Output Sparsity for\n Diffusion Models","summary":" Over the past few years, diffusion models have emerged as novel AI solutions,\ngenerating diverse multi-modal outputs from text prompts. Despite their\ncapabilities, they face challenges in computing, such as excessive latency and\nenergy consumption due to their iterative architecture. Although prior works\nspecialized in transformer acceleration can be applied, the iterative nature of\ndiffusion models remains unresolved. In this paper, we present EXION, the first\nSW-HW co-designed diffusion accelerator that solves the computation challenges\nby exploiting the unique inter- and intra-iteration output sparsity in\ndiffusion models. To this end, we propose two SW-level optimizations. First, we\nintroduce the FFN-Reuse algorithm that identifies and skips redundant\ncomputations in FFN layers across different iterations (inter-iteration\nsparsity). Second, we use a modified eager prediction method that employs\ntwo-step leading-one detection to accurately predict the attention score,\nskipping unnecessary computations within an iteration (intra-iteration\nsparsity). We also introduce a novel data compaction mechanism named ConMerge,\nwhich can enhance HW utilization by condensing and merging sparse matrices into\ncompact forms. Finally, it has a dedicated HW architecture that supports the\nabove sparsity-inducing algorithms, translating high output sparsity into\nimproved energy efficiency and performance. To verify the feasibility of the\nEXION, we first demonstrate that it has no impact on accuracy in various types\nof multi-modal diffusion models. We then instantiate EXION in both server- and\nedge-level settings and compare its performance against GPUs with similar\nspecifications. Our evaluation shows that EXION achieves dramatic improvements\nin performance and energy efficiency by 3.2-379.3x and 45.1-3067.6x compared to\na server GPU and by 42.6-1090.9x and 196.9-4668.2x compared to an edge GPU.\n","authors":["Jaehoon Heo","Adiwena Putra","Jieon Yoon","Sungwoong Yune","Hangyeol Lee","Ji-Hoon Kim","Joo-Young Kim"],"pdf_url":"https://arxiv.org/pdf/2501.05680v1.pdf","comment":"To appear in 2025 IEEE International Symposium on High-Performance\n Computer Architecture (HPCA 2025)"},{"id":"http://arxiv.org/abs/2106.02329v3","updated":"2025-01-10T03:00:30Z","published":"2021-06-04T08:25:47Z","title":"Deep Switching State Space Model (DS$^3$M) for Nonlinear Time Series\n Forecasting with Regime Switching","summary":" Modern time series data often display complex nonlinear dependencies along\nwith irregular regime-switching behaviors. These features present technical\nchallenges in modeling, inference, and in offering insightful understanding\ninto the underlying stochastic phenomena. To tackle these challenges, we\nintroduce a novel modeling framework known as the Deep Switching State Space\nModel (DS$^3$M). This framework is engineered to make accurate forecasts for\nsuch time series while adeptly identifying the irregular regimes hidden within\nthe dynamics. These identifications not only have significant economic\nramifications but also contribute to a deeper understanding of the underlying\nphenomena. In DS$^3$M, the architecture employs discrete latent variables to\nrepresent regimes and continuous latent variables to account for random driving\nfactors. By melding a Recurrent Neural Network (RNN) with a nonlinear Switching\nState Space Model (SSSM), we manage to capture the nonlinear dependencies and\nirregular regime-switching behaviors, governed by a Markov chain and\nparameterized using multilayer perceptrons. We validate the effectiveness and\nregime identification capabilities of DS$^3$M through short- and long-term\nforecasting tests on a wide array of simulated and real-world datasets,\nspanning sectors such as healthcare, economics, traffic, meteorology, and\nenergy. Experimental results reveal that DS$^3$M outperforms several\nstate-of-the-art models in terms of forecasting accuracy, while providing\nmeaningful regime identifications.\n","authors":["Xiuqin Xu","Hanqiu Peng","Ying Chen"],"pdf_url":"https://arxiv.org/pdf/2106.02329v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05675v1","updated":"2025-01-10T02:57:08Z","published":"2025-01-10T02:57:08Z","title":"Facilitate Collaboration between Large Language Model and Task-specific\n Model for Time Series Anomaly Detection","summary":" In anomaly detection, methods based on large language models (LLMs) can\nincorporate expert knowledge, while task-specific smaller models excel at\nextracting normal patterns and detecting value fluctuations. Inspired by the\nhuman nervous system, where the brain stores expert knowledge and the\nperipheral nervous system and spinal cord handle specific tasks like withdrawal\nand knee-jerk reflexes, we propose CoLLaTe, a framework designed to facilitate\ncollaboration between LLMs and task-specific models, leveraging the strengths\nof both.\n In this work, we first formulate the collaboration process and identify two\nkey challenges in the collaboration between LLMs and task-specific models: (1)\nthe misalignment between the expression domains of LLMs and smaller models, and\n(2) error accumulation arising from the predictions of both models.\n To address these challenges, we introduce two key components in CoLLaTe: the\nalignment module and the collaborative loss function. Through theoretical\nanalysis and experimental validation, we demonstrate that these components\neffectively mitigate the identified challenges and achieve better performance\nthan LLM based methods and task-specific smaller model.\n","authors":["Feiyi Chen","Leilei Zhang","Guansong Pang","Roger Zimmermann","Shuiguang Deng"],"pdf_url":"https://arxiv.org/pdf/2501.05675v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04967v2","updated":"2025-01-10T02:54:18Z","published":"2025-01-09T04:41:50Z","title":"Targeted Adversarial Denoising Autoencoders (TADA) for Neural Time\n Series Filtration","summary":" Current machine learning (ML)-based algorithms for filtering\nelectroencephalography (EEG) time series data face challenges related to\ncumbersome training times, regularization, and accurate reconstruction. To\naddress these shortcomings, we present an ML filtration algorithm driven by a\nlogistic covariance-targeted adversarial denoising autoencoder (TADA). We\nhypothesize that the expressivity of a targeted, correlation-driven\nconvolutional autoencoder will enable effective time series filtration while\nminimizing compute requirements (e.g., runtime, model size). Furthermore, we\nexpect that adversarial training with covariance rescaling will minimize signal\ndegradation. To test this hypothesis, a TADA system prototype was trained and\nevaluated on the task of removing electromyographic (EMG) noise from EEG data\nin the EEGdenoiseNet dataset, which includes EMG and EEG data from 67 subjects.\nThe TADA filter surpasses conventional signal filtration algorithms across\nquantitative metrics (Correlation Coefficient, Temporal RRMSE, Spectral RRMSE),\nand performs competitively against other deep learning architectures at a\nreduced model size of less than 400,000 trainable parameters. Further\nexperimentation will be necessary to assess the viability of TADA on a wider\nrange of deployment cases.\n","authors":["Benjamin J. Choi","Griffin Milsap","Clara A. Scholl","Francesco Tenore","Mattson Ogg"],"pdf_url":"https://arxiv.org/pdf/2501.04967v2.pdf","comment":"[Accepted] Artificial Intelligence for Time Series Analysis (AI4TS):\n Theory, Algorithms, and Applications @ AAAI 2025, Philadelphia, PA, USA"},{"id":"http://arxiv.org/abs/2501.05667v1","updated":"2025-01-10T02:33:15Z","published":"2025-01-10T02:33:15Z","title":"TransPlace: Transferable Circuit Global Placement via Graph Neural\n Network","summary":" Global placement, a critical step in designing the physical layout of\ncomputer chips, is essential to optimize chip performance. Prior global\nplacement methods optimize each circuit design individually from scratch. Their\nneglect of transferable knowledge limits solution efficiency and chip\nperformance as circuit complexity drastically increases. This study presents\nTransPlace, a global placement framework that learns to place millions of\nmixed-size cells in continuous space. TransPlace introduces i) Netlist Graph to\nefficiently model netlist topology, ii) Cell-flow and relative position\nencoding to learn SE(2)-invariant representation, iii) a tailored graph neural\nnetwork architecture for informed parameterization of placement knowledge, and\niv) a two-stage strategy for coarse-to-fine placement. Compared to\nstate-of-the-art placement methods, TransPlace-trained on a few high-quality\nplacements-can place unseen circuits with 1.2x speedup while reducing\ncongestion by 30%, timing by 9%, and wirelength by 5%.\n","authors":["Yunbo Hou","Haoran Ye","Yingxue Zhang","Siyuan Xu","Guojie Song"],"pdf_url":"https://arxiv.org/pdf/2501.05667v1.pdf","comment":"Accepted at KDD 2025"},{"id":"http://arxiv.org/abs/2501.05663v1","updated":"2025-01-10T02:28:19Z","published":"2025-01-10T02:28:19Z","title":"Learning to Measure Quantum Neural Networks","summary":" The rapid progress in quantum computing (QC) and machine learning (ML) has\nattracted growing attention, prompting extensive research into quantum machine\nlearning (QML) algorithms to solve diverse and complex problems. Designing\nhigh-performance QML models demands expert-level proficiency, which remains a\nsignificant obstacle to the broader adoption of QML. A few major hurdles\ninclude crafting effective data encoding techniques and parameterized quantum\ncircuits, both of which are crucial to the performance of QML models.\nAdditionally, the measurement phase is frequently overlooked-most current QML\nmodels rely on pre-defined measurement protocols that often fail to account for\nthe specific problem being addressed. We introduce a novel approach that makes\nthe observable of the quantum system-specifically, the Hermitian\nmatrix-learnable. Our method features an end-to-end differentiable learning\nframework, where the parameterized observable is trained alongside the ordinary\nquantum circuit parameters simultaneously. Using numerical simulations, we show\nthat the proposed method can identify observables for variational quantum\ncircuits that lead to improved outcomes, such as higher classification\naccuracy, thereby boosting the overall performance of QML models.\n","authors":["Samuel Yen-Chi Chen","Huan-Hsin Tseng","Hsin-Yi Lin","Shinjae Yoo"],"pdf_url":"https://arxiv.org/pdf/2501.05663v1.pdf","comment":"Accepted by ICASSP 2025 Workshop: Quantum Machine Learning in Signal\n Processing and Artificial Intelligence"},{"id":"http://arxiv.org/abs/2501.05661v1","updated":"2025-01-10T02:25:39Z","published":"2025-01-10T02:25:39Z","title":"TAMER: A Test-Time Adaptive MoE-Driven Framework for EHR Representation\n Learning","summary":" We propose TAMER, a Test-time Adaptive MoE-driven framework for EHR\nRepresentation learning. TAMER combines a Mixture-of-Experts (MoE) with\nTest-Time Adaptation (TTA) to address two critical challenges in EHR modeling:\npatient population heterogeneity and distribution shifts. The MoE component\nhandles diverse patient subgroups, while TTA enables real-time adaptation to\nevolving health status distributions when new patient samples are introduced.\nExtensive experiments across four real-world EHR datasets demonstrate that\nTAMER consistently improves predictive performance for both mortality and\nreadmission risk tasks when combined with diverse EHR modeling backbones. TAMER\noffers a promising approach for dynamic and personalized EHR-based predictions\nin practical clinical settings. Code is publicly available at\nhttps://github.com/yhzhu99/TAMER.\n","authors":["Yinghao Zhu","Xiaochen Zheng","Ahmed Allam","Michael Krauthammer"],"pdf_url":"https://arxiv.org/pdf/2501.05661v1.pdf","comment":"8 pages, 3 figures, 7 tables"},{"id":"http://arxiv.org/abs/2501.05656v1","updated":"2025-01-10T02:14:29Z","published":"2025-01-10T02:14:29Z","title":"Evidential Deep Learning for Uncertainty Quantification and\n Out-of-Distribution Detection in Jet Identification using Deep Neural\n Networks","summary":" Current methods commonly used for uncertainty quantification (UQ) in deep\nlearning (DL) models utilize Bayesian methods which are computationally\nexpensive and time-consuming. In this paper, we provide a detailed study of UQ\nbased on evidential deep learning (EDL) for deep neural network models designed\nto identify jets in high energy proton-proton collisions at the Large Hadron\nCollider and explore its utility in anomaly detection. EDL is a DL approach\nthat treats learning as an evidence acquisition process designed to provide\nconfidence (or epistemic uncertainty) about test data. Using publicly available\ndatasets for jet classification benchmarking, we explore hyperparameter\noptimizations for EDL applied to the challenge of UQ for jet identification. We\nalso investigate how the uncertainty is distributed for each jet class, how\nthis method can be implemented for the detection of anomalies, how the\nuncertainty compares with Bayesian ensemble methods, and how the uncertainty\nmaps onto latent spaces for the models. Our studies uncover some pitfalls of\nEDL applied to anomaly detection and a more effective way to quantify\nuncertainty from EDL as compared with the foundational EDL setup. These studies\nillustrate a methodological approach to interpreting EDL in jet classification\nmodels, providing new insights on how EDL quantifies uncertainty and detects\nout-of-distribution data which may lead to improved EDL methods for DL models\napplied to classification tasks.\n","authors":["Ayush Khot","Xiwei Wang","Avik Roy","Volodymyr Kindratenko","Mark S. Neubauer"],"pdf_url":"https://arxiv.org/pdf/2501.05656v1.pdf","comment":"38 pages (including references) with 17 figures and 3 tables.\n Repository: https://github.com/FAIR4HEP/PFIN4UQAD . Submitted to Machine\n Learning: Science and Technology"},{"id":"http://arxiv.org/abs/2404.11917v2","updated":"2025-01-10T02:08:52Z","published":"2024-04-18T05:48:15Z","title":"Expected Coordinate Improvement for High-Dimensional Bayesian\n Optimization","summary":" Bayesian optimization (BO) algorithm is very popular for solving\nlow-dimensional expensive optimization problems. Extending Bayesian\noptimization to high dimension is a meaningful but challenging task. One of the\nmajor challenges is that it is difficult to find good infill solutions as the\nacquisition functions are also high-dimensional. In this work, we propose the\nexpected coordinate improvement (ECI) criterion for high-dimensional Bayesian\noptimization. The proposed ECI criterion measures the potential improvement we\ncan get by moving the current best solution along one coordinate. The proposed\napproach selects the coordinate with the highest ECI value to refine in each\niteration and covers all the coordinates gradually by iterating over the\ncoordinates. The greatest advantage of the proposed ECI-BO (expected coordinate\nimprovement based Bayesian optimization) algorithm over the standard BO\nalgorithm is that the infill selection problem of the proposed algorithm is\nalways a one-dimensional problem thus can be easily solved. Numerical\nexperiments show that the proposed algorithm can achieve significantly better\nresults than the standard BO algorithm and competitive results when compared\nwith five state-of-the-art high-dimensional BOs. This work provides a simple\nbut efficient approach for high-dimensional Bayesian optimization.\n","authors":["Dawei Zhan"],"pdf_url":"https://arxiv.org/pdf/2404.11917v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05651v1","updated":"2025-01-10T01:42:05Z","published":"2025-01-10T01:42:05Z","title":"A Practical Cross-Layer Approach for ML-Driven Storage Placement in\n Warehouse-Scale Computers","summary":" Storage systems account for a major portion of the total cost of ownership\n(TCO) of warehouse-scale computers, and thus have a major impact on the overall\nsystem's efficiency. Machine learning (ML)-based methods for solving key\nproblems in storage system efficiency, such as data placement, have shown\nsignificant promise. However, there are few known practical deployments of such\nmethods. Studying this problem in the context of real-world hyperscale data\ncenter deployments at Google, we identify a number of challenges that we\nbelieve cause this lack of practical adoption. Specifically, prior work assumes\na monolithic model that resides entirely within the storage layer, an\nunrealistic assumption in real-world data center deployments. We propose a\ncross-layer approach that moves ML out of the storage system and performs it in\nthe application running on top of it, co-designed with a scheduling algorithm\nat the storage layer that consumes predictions from these application-level\nmodels. This approach combines small, interpretable models with a co-designed\nheuristic that adapts to different online environments. We build a\nproof-of-concept of this approach in a production distributed computation\nframework at Google. Evaluations in a test deployment and large-scale\nsimulation studies using production traces show improvements of as much as\n3.47x in TCO savings compared to state of the art baselines. We believe this\nwork represents a significant step towards more practical ML-driven storage\nplacement in warehouse-scale computers.\n","authors":["Chenxi Yang","Yan Li","Martin Maas","Mustafa Uysal","Ubaid Ullah Hafeez","Arif Merchant","Richard McDougall"],"pdf_url":"https://arxiv.org/pdf/2501.05651v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05646v1","updated":"2025-01-10T01:25:01Z","published":"2025-01-10T01:25:01Z","title":"Efficient Representations for High-Cardinality Categorical Variables in\n Machine Learning","summary":" High\\-cardinality categorical variables pose significant challenges in\nmachine learning, particularly in terms of computational efficiency and model\ninterpretability. Traditional one\\-hot encoding often results in\nhigh\\-dimensional sparse feature spaces, increasing the risk of overfitting and\nreducing scalability. This paper introduces novel encoding techniques,\nincluding means encoding, low\\-rank encoding, and multinomial logistic\nregression encoding, to address these challenges. These methods leverage\nsufficient representations to generate compact and informative embeddings of\ncategorical data. We conduct rigorous theoretical analyses and empirical\nvalidations on diverse datasets, demonstrating significant improvements in\nmodel performance and computational efficiency compared to baseline methods.\nThe proposed techniques are particularly effective in domains requiring\nscalable solutions for large datasets, paving the way for more robust and\nefficient applications in machine learning.\n","authors":["Zixuan Liang"],"pdf_url":"https://arxiv.org/pdf/2501.05646v1.pdf","comment":"2025 International Conference on Advanced Machine Learning and Data\n Science (AMLDS 2025)"},{"id":"http://arxiv.org/abs/2412.20006v2","updated":"2025-01-10T01:09:37Z","published":"2024-12-28T04:06:29Z","title":"Adversarial Robustness for Deep Learning-based Wildfire Prediction\n Models","summary":" Smoke detection using Deep Neural Networks (DNNs) is an effective approach\nfor early wildfire detection. However, because smoke is temporally and\nspatially anomalous, there are limitations in collecting sufficient training\ndata. This raises overfitting and bias concerns in existing DNN-based wildfire\ndetection models. Thus, we introduce WARP (Wildfire Adversarial Robustness\nProcedure), the first model-agnostic framework for evaluating the adversarial\nrobustness of DNN-based wildfire detection models. WARP addresses limitations\nin smoke image diversity using global and local adversarial attack methods. The\nglobal attack method uses image-contextualized Gaussian noise, while the local\nattack method uses patch noise injection, tailored to address critical aspects\nof wildfire detection. Leveraging WARP's model-agnostic capabilities, we assess\nthe adversarial robustness of real-time Convolutional Neural Networks (CNNs)\nand Transformers. The analysis revealed valuable insights into the models'\nlimitations. Specifically, the global attack method demonstrates that the\nTransformer model has more than 70% precision degradation than the CNN against\nglobal noise. In contrast, the local attack method shows that both models are\nsusceptible to cloud image injections when detecting smoke-positive instances,\nsuggesting a need for model improvements through data augmentation. WARP's\ncomprehensive robustness analysis contributed to the development of\nwildfire-specific data augmentation strategies, marking a step toward\npracticality.\n","authors":["Ryo Ide","Lei Yang"],"pdf_url":"https://arxiv.org/pdf/2412.20006v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18544v2","updated":"2025-01-10T01:06:06Z","published":"2024-12-24T16:51:35Z","title":"Consistency Checks for Language Model Forecasters","summary":" Forecasting is a task that is difficult to evaluate: the ground truth can\nonly be known in the future. Recent work showing LLM forecasters rapidly\napproaching human-level performance begs the question: how can we benchmark and\nevaluate these forecasters instantaneously? Following the consistency check\nframework, we measure the performance of forecasters in terms of the\nconsistency of their predictions on different logically-related questions. We\npropose a new, general consistency metric based on arbitrage: for example, if a\nforecasting AI illogically predicts that both the Democratic and Republican\nparties have 60% probability of winning the 2024 US presidential election, an\narbitrageur can trade against the forecaster's predictions and make a profit.\nWe build an automated evaluation system that generates a set of base questions,\ninstantiates consistency checks from these questions, elicits the predictions\nof the forecaster, and measures the consistency of the predictions. We then\nbuild a standard, proper-scoring-rule forecasting benchmark, and show that our\n(instantaneous) consistency metrics correlate with LLM forecasters' ground\ntruth Brier scores (which are only known in the future). We also release a\nconsistency benchmark that resolves in 2028, providing a long-term evaluation\ntool for forecasting.\n","authors":["Daniel Paleka","Abhimanyu Pallavi Sudhir","Alejandro Alvarez","Vineeth Bhat","Adam Shen","Evan Wang","Florian Tramèr"],"pdf_url":"https://arxiv.org/pdf/2412.18544v2.pdf","comment":"55 pages, 25 figures. Submitted to ICLR 2025"},{"id":"http://arxiv.org/abs/2501.05644v1","updated":"2025-01-10T01:02:43Z","published":"2025-01-10T01:02:43Z","title":"Interpretable Enzyme Function Prediction via Residue-Level Detection","summary":" Predicting multiple functions labeled with Enzyme Commission (EC) numbers\nfrom the enzyme sequence is of great significance but remains a challenge due\nto its sparse multi-label classification nature, i.e., each enzyme is typically\nassociated with only a few labels out of more than 6000 possible EC numbers.\nHowever, existing machine learning algorithms generally learn a fixed global\nrepresentation for each enzyme to classify all functions, thereby they lack\ninterpretability and the fine-grained information of some function-specific\nlocal residue fragments may be overwhelmed. Here we present an attention-based\nframework, namely ProtDETR (Protein Detection Transformer), by casting enzyme\nfunction prediction as a detection problem. It uses a set of learnable\nfunctional queries to adaptatively extract different local representations from\nthe sequence of residue-level features for predicting different EC numbers.\nProtDETR not only significantly outperforms existing deep learning-based enzyme\nfunction prediction methods, but also provides a new interpretable perspective\non automatically detecting different local regions for identifying different\nfunctions through cross-attentions between queries and residue-level features.\nCode is available at https://github.com/yangzhao1230/ProtDETR.\n","authors":["Zhao Yang","Bing Su","Jiahao Chen","Ji-Rong Wen"],"pdf_url":"https://arxiv.org/pdf/2501.05644v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.05488v2","updated":"2025-01-10T00:58:28Z","published":"2024-12-07T01:19:14Z","title":"Enhancing Sample Generation of Diffusion Models using Noise Level\n Correction","summary":" The denoising process of diffusion models can be interpreted as an\napproximate projection of noisy samples onto the data manifold. Moreover, the\nnoise level in these samples approximates their distance to the underlying\nmanifold. Building on this insight, we propose a novel method to enhance sample\ngeneration by aligning the estimated noise level with the true distance of\nnoisy samples to the manifold. Specifically, we introduce a noise level\ncorrection network, leveraging a pre-trained denoising network, to refine noise\nlevel estimates during the denoising process. Additionally, we extend this\napproach to various image restoration tasks by integrating task-specific\nconstraints, including inpainting, deblurring, super-resolution, colorization,\nand compressed sensing. Experimental results demonstrate that our method\nsignificantly improves sample quality in both unconstrained and constrained\ngeneration scenarios. Notably, the proposed noise level correction framework is\ncompatible with existing denoising schedulers (e.g., DDIM), offering additional\nperformance improvements.\n","authors":["Abulikemu Abuduweili","Chenyang Yuan","Changliu Liu","Frank Permenter"],"pdf_url":"https://arxiv.org/pdf/2412.05488v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05635v1","updated":"2025-01-10T00:42:27Z","published":"2025-01-10T00:42:27Z","title":"Enhancing Unsupervised Graph Few-shot Learning via Set Functions and\n Optimal Transport","summary":" Graph few-shot learning has garnered significant attention for its ability to\nrapidly adapt to downstream tasks with limited labeled data, sparking\nconsiderable interest among researchers. Recent advancements in graph few-shot\nlearning models have exhibited superior performance across diverse\napplications. Despite their successes, several limitations still exist. First,\nexisting models in the meta-training phase predominantly focus on\ninstance-level features within tasks, neglecting crucial set-level features\nessential for distinguishing between different categories. Second, these models\noften utilize query sets directly on classifiers trained with support sets\ncontaining only a few labeled examples, overlooking potential distribution\nshifts between these sets and leading to suboptimal performance. Finally,\nprevious models typically require necessitate abundant labeled data from base\nclasses to extract transferable knowledge, which is typically infeasible in\nreal-world scenarios. To address these issues, we propose a novel model named\nSTAR, which leverages Set funcTions and optimAl tRansport for enhancing\nunsupervised graph few-shot learning. Specifically, STAR utilizes expressive\nset functions to obtain set-level features in an unsupervised manner and\nemploys optimal transport principles to align the distributions of support and\nquery sets, thereby mitigating distribution shift effects. Theoretical analysis\ndemonstrates that STAR can capture more task-relevant information and enhance\ngeneralization capabilities. Empirically, extensive experiments across multiple\ndatasets validate the effectiveness of STAR. Our code can be found here.\n","authors":["Yonghao Liu","Fausto Giunchiglia","Ximing Li","Lan Huang","Xiaoyue Feng","Renchu Guan"],"pdf_url":"https://arxiv.org/pdf/2501.05635v1.pdf","comment":"KDD2025"},{"id":"http://arxiv.org/abs/2501.05633v1","updated":"2025-01-10T00:32:46Z","published":"2025-01-10T00:32:46Z","title":"Regularized Top-$k$: A Bayesian Framework for Gradient Sparsification","summary":" Error accumulation is effective for gradient sparsification in distributed\nsettings: initially-unselected gradient entries are eventually selected as\ntheir accumulated error exceeds a certain level. The accumulation essentially\nbehaves as a scaling of the learning rate for the selected entries. Although\nthis property prevents the slow-down of lateral movements in distributed\ngradient descent, it can deteriorate convergence in some settings. This work\nproposes a novel sparsification scheme that controls the learning rate scaling\nof error accumulation. The development of this scheme follows two major steps:\nfirst, gradient sparsification is formulated as an inverse probability\n(inference) problem, and the Bayesian optimal sparsification mask is derived as\na maximum-a-posteriori estimator. Using the prior distribution inherited from\nTop-$k$, we derive a new sparsification algorithm which can be interpreted as a\nregularized form of Top-$k$. We call this algorithm regularized Top-$k$\n(RegTop-$k$). It utilizes past aggregated gradients to evaluate posterior\nstatistics of the next aggregation. It then prioritizes the local accumulated\ngradient entries based on these posterior statistics. We validate our\nderivation through numerical experiments. In distributed linear regression, it\nis observed that while Top-$k$ remains at a fixed distance from the global\noptimum, RegTop-$k$ converges to the global optimum at significantly higher\ncompression ratios. We further demonstrate the generalization of this\nobservation by employing RegTop-$k$ in distributed training of ResNet-18 on\nCIFAR-10, where it noticeably outperforms Top-$k$.\n","authors":["Ali Bereyhi","Ben Liang","Gary Boudreau","Ali Afana"],"pdf_url":"https://arxiv.org/pdf/2501.05633v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.08294v3","updated":"2025-01-10T00:19:23Z","published":"2024-08-15T17:49:24Z","title":"eGAD! double descent is explained by Generalized Aliasing Decomposition","summary":" A central problem in data science is to use potentially noisy samples of an\nunknown function to predict values for unseen inputs. In classical statistics,\npredictive error is understood as a trade-off between the bias and the variance\nthat balances model simplicity with its ability to fit complex functions.\nHowever, over-parameterized models exhibit counterintuitive behaviors, such as\n\"double descent\" in which models of increasing complexity exhibit decreasing\ngeneralization error. Others may exhibit more complicated patterns of\npredictive error with multiple peaks and valleys. Neither double descent nor\nmultiple descent phenomena are well explained by the bias-variance\ndecomposition.\n We introduce a novel decomposition that we call the generalized aliasing\ndecomposition (GAD) to explain the relationship between predictive performance\nand model complexity. The GAD decomposes the predictive error into three parts:\n1) model insufficiency, which dominates when the number of parameters is much\nsmaller than the number of data points, 2) data insufficiency, which dominates\nwhen the number of parameters is much greater than the number of data points,\nand 3) generalized aliasing, which dominates between these two extremes.\n We demonstrate the applicability of the GAD to diverse applications,\nincluding random feature models from machine learning, Fourier transforms from\nsignal processing, solution methods for differential equations, and predictive\nformation enthalpy in materials discovery. Because key components of the GAD\ncan be explicitly calculated from the relationship between model class and\nsamples without seeing any data labels, it can answer questions related to\nexperimental design and model selection before collecting data or performing\nexperiments. We further demonstrate this approach on several examples and\ndiscuss implications for predictive modeling and data science.\n","authors":["Mark K. Transtrum","Gus L. W. Hart","Tyler J. Jarvis","Jared P. Whitehead"],"pdf_url":"https://arxiv.org/pdf/2408.08294v3.pdf","comment":null}],"Multimedia":[{"id":"http://arxiv.org/abs/2501.05686v1","updated":"2025-01-10T03:35:22Z","published":"2025-01-10T03:35:22Z","title":"Deep Reversible Consistency Learning for Cross-modal Retrieval","summary":" Cross-modal retrieval (CMR) typically involves learning common\nrepresentations to directly measure similarities between multimodal samples.\nMost existing CMR methods commonly assume multimodal samples in pairs and\nemploy joint training to learn common representations, limiting the flexibility\nof CMR. Although some methods adopt independent training strategies for each\nmodality to improve flexibility in CMR, they utilize the randomly initialized\northogonal matrices to guide representation learning, which is suboptimal since\nthey assume inter-class samples are independent of each other, limiting the\npotential of semantic alignments between sample representations and\nground-truth labels. To address these issues, we propose a novel method termed\nDeep Reversible Consistency Learning (DRCL) for cross-modal retrieval. DRCL\nincludes two core modules, \\ie Selective Prior Learning (SPL) and Reversible\nSemantic Consistency learning (RSC). More specifically, SPL first learns a\ntransformation weight matrix on each modality and selects the best one based on\nthe quality score as the Prior, which greatly avoids blind selection of priors\nlearned from low-quality modalities. Then, RSC employs a Modality-invariant\nRepresentation Recasting mechanism (MRR) to recast the potential\nmodality-invariant representations from sample semantic labels by the\ngeneralized inverse matrix of the prior. Since labels are devoid of\nmodal-specific information, we utilize the recast features to guide the\nrepresentation learning, thus maintaining semantic consistency to the fullest\nextent possible. In addition, a feature augmentation mechanism (FA) is\nintroduced in RSC to encourage the model to learn over a wider data\ndistribution for diversity. Finally, extensive experiments conducted on five\nwidely used datasets and comparisons with 15 state-of-the-art baselines\ndemonstrate the effectiveness and superiority of our DRCL.\n","authors":["Ruitao Pu","Yang Qin","Dezhong Peng","Xiaomin Song","Huiming Zheng"],"pdf_url":"https://arxiv.org/pdf/2501.05686v1.pdf","comment":null}],"Artificial Intelligence":[{"id":"http://arxiv.org/abs/2501.06164v1","updated":"2025-01-10T18:39:29Z","published":"2025-01-10T18:39:29Z","title":"Model Alignment Search","summary":" When can we say that two neural systems are the same? The answer to this\nquestion is goal-dependent, and it is often addressed through correlative\nmethods such as Representational Similarity Analysis (RSA) and Centered Kernel\nAlignment (CKA). What do we miss when we forgo causal explorations, and how can\nwe target specific types of similarity? In this work, we introduce Model\nAlignment Search (MAS), a method for causally exploring distributed\nrepresentational similarity. The method learns invertible linear\ntransformations that align a subspace between two distributed networks'\nrepresentations where causal information can be freely interchanged. We first\nshow that the method can be used to transfer specific causal variables, such as\nthe number of items in a counting task, between networks with different\ntraining seeds. We then explore open questions in number cognition by comparing\ndifferent types of numeric representations in models trained on structurally\ndifferent numeric tasks. We then explore differences between MAS vs preexisting\ncausal similarity methods, showing MAS to be more resistant to unwanted\nexchanges. Lastly, we introduce a counterfactual latent auxiliary loss function\nthat helps shape causally relevant alignments even in cases where we do not\nhave causal access to one of the two models for training.\n","authors":["Satchel Grant"],"pdf_url":"https://arxiv.org/pdf/2501.06164v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.02780v2","updated":"2025-01-10T18:14:56Z","published":"2024-09-17T19:07:13Z","title":"Guess What I Think: Streamlined EEG-to-Image Generation with Latent\n Diffusion Models","summary":" Generating images from brain waves is gaining increasing attention due to its\npotential to advance brain-computer interface (BCI) systems by understanding\nhow brain signals encode visual cues. Most of the literature has focused on\nfMRI-to-Image tasks as fMRI is characterized by high spatial resolution.\nHowever, fMRI is an expensive neuroimaging modality and does not allow for\nreal-time BCI. On the other hand, electroencephalography (EEG) is a low-cost,\nnon-invasive, and portable neuroimaging technique, making it an attractive\noption for future real-time applications. Nevertheless, EEG presents inherent\nchallenges due to its low spatial resolution and susceptibility to noise and\nartifacts, which makes generating images from EEG more difficult. In this\npaper, we address these problems with a streamlined framework based on the\nControlNet adapter for conditioning a latent diffusion model (LDM) through EEG\nsignals. We conduct experiments and ablation studies on popular benchmarks to\ndemonstrate that the proposed method beats other state-of-the-art models.\nUnlike these methods, which often require extensive preprocessing, pretraining,\ndifferent losses, and captioning models, our approach is efficient and\nstraightforward, requiring only minimal preprocessing and a few components. The\ncode is available at https://github.com/LuigiSigillo/GWIT.\n","authors":["Eleonora Lopez","Luigi Sigillo","Federica Colonnese","Massimo Panella","Danilo Comminiello"],"pdf_url":"https://arxiv.org/pdf/2410.02780v2.pdf","comment":"Accepted at ICASSP 2025"},{"id":"http://arxiv.org/abs/2501.06146v1","updated":"2025-01-10T18:10:06Z","published":"2025-01-10T18:10:06Z","title":"xLSTM-SENet: xLSTM for Single-Channel Speech Enhancement","summary":" While attention-based architectures, such as Conformers, excel in speech\nenhancement, they face challenges such as scalability with respect to input\nsequence length. In contrast, the recently proposed Extended Long Short-Term\nMemory (xLSTM) architecture offers linear scalability. However, xLSTM-based\nmodels remain unexplored for speech enhancement. This paper introduces\nxLSTM-SENet, the first xLSTM-based single-channel speech enhancement system. A\ncomparative analysis reveals that xLSTM-and notably, even LSTM-can match or\noutperform state-of-the-art Mamba- and Conformer-based systems across various\nmodel sizes in speech enhancement on the VoiceBank+Demand dataset. Through\nablation studies, we identify key architectural design choices such as\nexponential gating and bidirectionality contributing to its effectiveness. Our\nbest xLSTM-based model, xLSTM-SENet2, outperforms state-of-the-art Mamba- and\nConformer-based systems on the Voicebank+DEMAND dataset.\n","authors":["Nikolai Lund Kühne","Jan Østergaard","Jesper Jensen","Zheng-Hua Tan"],"pdf_url":"https://arxiv.org/pdf/2501.06146v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.06143v1","updated":"2025-01-10T18:08:07Z","published":"2025-01-10T18:08:07Z","title":"Multilingual Performance of a Multimodal Artificial Intelligence System\n on Multisubject Physics Concept Inventories","summary":" We investigate the multilingual and multimodal performance of a large\nlanguage model-based artificial intelligence (AI) system, GPT-4o, on a diverse\nset of physics concept inventories spanning multiple languages and subject\nareas. The inventories taken from the PhysPort website cover the classical\nphysics topics of mechanics, electromagnetism, optics, and thermodynamics as\nwell as relativity, quantum mechanics, astronomy, mathematics, and laboratory\nskills. Unlike previous text-only studies, we uploaded the inventories as\nimages mirroring what a student would see on paper, assessing the system's\nmultimodal functionality. The AI is prompted in English and autonomously\nchooses the language of its response - either remaining in the nominal language\nof the test, switching entirely to English, or mixing languages - revealing\nadaptive behavior dependent on linguistic complexity and data availability. Our\nresults indicate some variation in performance across subject areas, with\nlaboratory skills standing out as the area of poorest performance. Furthermore,\nthe AI's performance on questions that require visual interpretation of images\nis worse than on purely text-based questions. Questions that are difficult for\nthe AI tend to be that way invariably of the inventory language. We also find\nlarge variations in performance across languages, with some appearing to\nbenefit substantially from language switching, a phenomenon similar to\ncode-switching ofhuman speakers. Overall, comparing the obtained AI results to\nthe existing literature, we find that the AI system outperforms average\nundergraduate students post-instruction in all subject areas but laboratory\nskills.\n","authors":["Gerd Kortemeyer","Marina Babayeva","Giulia Polverini","Bor Gregorcic","Ralf Widenhorn"],"pdf_url":"https://arxiv.org/pdf/2501.06143v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.06141v1","updated":"2025-01-10T18:03:46Z","published":"2025-01-10T18:03:46Z","title":"Emergent Symbol-like Number Variables in Artificial Neural Networks","summary":" What types of numeric representations emerge in Neural Networks (NNs)? To\nwhat degree do NNs induce abstract, mutable, slot-like numeric variables, and\nin what situations do these representations emerge? How do these\nrepresentations change over learning, and how can we understand the neural\nimplementations in ways that are unified across different NNs? In this work, we\napproach these questions by first training sequence based neural systems using\nNext Token Prediction (NTP) objectives on numeric tasks. We then seek to\nunderstand the neural solutions through the lens of causal abstractions or\nsymbolic algorithms. We use a combination of causal interventions and\nvisualization methods to find that artificial neural models do indeed develop\nanalogs of interchangeable, mutable, latent number variables purely from the\nNTP objective. We then ask how variations on the tasks and model architectures\naffect the models' learned solutions to find that these symbol-like numeric\nrepresentations do not form for every variant of the task, and transformers\nsolve the problem in a notably different way than their recurrent counterparts.\nWe then show how the symbol-like variables change over the course of training\nto find a strong correlation between the models' task performance and the\nalignment of their symbol-like representations. Lastly, we show that in all\ncases, some degree of gradience exists in these neural symbols, highlighting\nthe difficulty of finding simple, interpretable symbolic stories of how neural\nnetworks perform numeric tasks. Taken together, our results are consistent with\nthe view that neural networks can approximate interpretable symbolic programs\nof number cognition, but the particular program they approximate and the extent\nto which they approximate it can vary widely, depending on the network\narchitecture, training data, extent of training, and network size.\n","authors":["Satchel Grant","Noah D. Goodman","James L. McClelland"],"pdf_url":"https://arxiv.org/pdf/2501.06141v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.11456v2","updated":"2025-01-10T17:54:39Z","published":"2024-09-17T17:48:12Z","title":"Two Stage Segmentation of Cervical Tumors using PocketNet","summary":" Cervical cancer remains the fourth most common malignancy amongst women\nworldwide.1 Concurrent chemoradiotherapy (CRT) serves as the mainstay\ndefinitive treatment regimen for locally advanced cervical cancers and includes\nexternal beam radiation followed by brachytherapy.2 Integral to radiotherapy\ntreatment planning is the routine contouring of both the target tumor at the\nlevel of the cervix, associated gynecologic anatomy and the adjacent organs at\nrisk (OARs). However, manual contouring of these structures is both time and\nlabor intensive and associated with known interobserver variability that can\nimpact treatment outcomes. While multiple tools have been developed to\nautomatically segment OARs and the high-risk clinical tumor volume (HR-CTV)\nusing computed tomography (CT) images,3,4,5,6 the development of deep\nlearning-based tumor segmentation tools using routine T2-weighted (T2w)\nmagnetic resonance imaging (MRI) addresses an unmet clinical need to improve\nthe routine contouring of both anatomical structures and cervical cancers,\nthereby increasing quality and consistency of radiotherapy planning. This work\napplied a novel deep-learning model (PocketNet) to segment the cervix, vagina,\nuterus, and tumor(s) on T2w MRI. The performance of the PocketNet architecture\nwas evaluated, when trained on data via 5-fold cross validation. PocketNet\nachieved a mean Dice-Sorensen similarity coefficient (DSC) exceeding 70% for\ntumor segmentation and 80% for organ segmentation. These results suggest that\nPocketNet is robust to variations in contrast protocols, providing reliable\nsegmentation of the regions of interest.\n","authors":["Awj Twam","Megan Jacobsen","Rachel Glenn","Peng Wei","Jia Sun","Ann Klopp","Aradhana M. Venkatesan","David Fuentes"],"pdf_url":"https://arxiv.org/pdf/2409.11456v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.06137v1","updated":"2025-01-10T17:52:34Z","published":"2025-01-10T17:52:34Z","title":"Supervision policies can shape long-term risk management in\n general-purpose AI models","summary":" The rapid proliferation and deployment of General-Purpose AI (GPAI) models,\nincluding large language models (LLMs), present unprecedented challenges for AI\nsupervisory entities. We hypothesize that these entities will need to navigate\nan emergent ecosystem of risk and incident reporting, likely to exceed their\nsupervision capacity. To investigate this, we develop a simulation framework\nparameterized by features extracted from the diverse landscape of risk,\nincident, or hazard reporting ecosystems, including community-driven platforms,\ncrowdsourcing initiatives, and expert assessments. We evaluate four supervision\npolicies: non-prioritized (first-come, first-served), random selection,\npriority-based (addressing the highest-priority risks first), and\ndiversity-prioritized (balancing high-priority risks with comprehensive\ncoverage across risk types). Our results indicate that while priority-based and\ndiversity-prioritized policies are more effective at mitigating high-impact\nrisks, particularly those identified by experts, they may inadvertently neglect\nsystemic issues reported by the broader community. This oversight can create\nfeedback loops that amplify certain types of reporting while discouraging\nothers, leading to a skewed perception of the overall risk landscape. We\nvalidate our simulation results with several real-world datasets, including one\nwith over a million ChatGPT interactions, of which more than 150,000\nconversations were identified as risky. This validation underscores the complex\ntrade-offs inherent in AI risk supervision and highlights how the choice of\nrisk management policies can shape the future landscape of AI risks across\ndiverse GPAI models used in society.\n","authors":["Manuel Cebrian","Emilia Gomez","David Fernandez Llorca"],"pdf_url":"https://arxiv.org/pdf/2501.06137v1.pdf","comment":"24 pages, 14 figures"},{"id":"http://arxiv.org/abs/2501.06132v1","updated":"2025-01-10T17:44:57Z","published":"2025-01-10T17:44:57Z","title":"CoDriveVLM: VLM-Enhanced Urban Cooperative Dispatching and Motion\n Planning for Future Autonomous Mobility on Demand Systems","summary":" The increasing demand for flexible and efficient urban transportation\nsolutions has spotlighted the limitations of traditional Demand Responsive\nTransport (DRT) systems, particularly in accommodating diverse passenger needs\nand dynamic urban environments. Autonomous Mobility-on-Demand (AMoD) systems\nhave emerged as a promising alternative, leveraging connected and autonomous\nvehicles (CAVs) to provide responsive and adaptable services. However, existing\nmethods primarily focus on either vehicle scheduling or path planning, which\noften simplify complex urban layouts and neglect the necessity for simultaneous\ncoordination and mutual avoidance among CAVs. This oversimplification poses\nsignificant challenges to the deployment of AMoD systems in real-world\nscenarios. To address these gaps, we propose CoDriveVLM, a novel framework that\nintegrates high-fidelity simultaneous dispatching and cooperative motion\nplanning for future AMoD systems. Our method harnesses Vision-Language Models\n(VLMs) to enhance multi-modality information processing, and this enables\ncomprehensive dispatching and collision risk evaluation. The VLM-enhanced CAV\ndispatching coordinator is introduced to effectively manage complex and\nunforeseen AMoD conditions, thus supporting efficient scheduling\ndecision-making. Furthermore, we propose a scalable decentralized cooperative\nmotion planning method via consensus alternating direction method of\nmultipliers (ADMM) focusing on collision risk evaluation and decentralized\ntrajectory optimization. Simulation results demonstrate the feasibility and\nrobustness of CoDriveVLM in various traffic conditions, showcasing its\npotential to significantly improve the fidelity and effectiveness of AMoD\nsystems in future urban transportation networks. The code is available at\nhttps://github.com/henryhcliu/CoDriveVLM.git.\n","authors":["Haichao Liu","Ruoyu Yao","Wenru Liu","Zhenmin Huang","Shaojie Shen","Jun Ma"],"pdf_url":"https://arxiv.org/pdf/2501.06132v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.02189v2","updated":"2025-01-10T17:43:10Z","published":"2025-01-04T04:59:33Z","title":"Benchmark Evaluations, Applications, and Challenges of Large Vision\n Language Models: A Survey","summary":" Multimodal Vision Language Models (VLMs) have emerged as a transformative\ntechnology at the intersection of computer vision and natural language\nprocessing, enabling machines to perceive and reason about the world through\nboth visual and textual modalities. For example, models such as CLIP, Claude,\nand GPT-4V demonstrate strong reasoning and understanding abilities on visual\nand textual data and beat classical single modality vision models on zero-shot\nclassification. Despite their rapid advancements in research and growing\npopularity in applications, a comprehensive survey of existing studies on VLMs\nis notably lacking, particularly for researchers aiming to leverage VLMs in\ntheir specific domains. To this end, we provide a systematic overview of VLMs\nin the following aspects: model information of the major VLMs developed over\nthe past five years (2019-2024); the main architectures and training methods of\nthese VLMs; summary and categorization of the popular benchmarks and evaluation\nmetrics of VLMs; the applications of VLMs including embodied agents, robotics,\nand video generation; the challenges and issues faced by current VLMs such as\nhallucination, fairness, and safety. Detailed collections including papers and\nmodel repository links are listed in\nhttps://github.com/zli12321/Awesome-VLM-Papers-And-Models.git.\n","authors":["Zongxia Li","Xiyang Wu","Hongyang Du","Huy Nghiem","Guangyao Shi"],"pdf_url":"https://arxiv.org/pdf/2501.02189v2.pdf","comment":"35 pages, 3 figures"},{"id":"http://arxiv.org/abs/2501.06129v1","updated":"2025-01-10T17:35:06Z","published":"2025-01-10T17:35:06Z","title":"Contextual ASR Error Handling with LLMs Augmentation for Goal-Oriented\n Conversational AI","summary":" General-purpose automatic speech recognition (ASR) systems do not always\nperform well in goal-oriented dialogue. Existing ASR correction methods rely on\nprior user data or named entities. We extend correction to tasks that have no\nprior user data and exhibit linguistic flexibility such as lexical and\nsyntactic variations. We propose a novel context augmentation with a large\nlanguage model and a ranking strategy that incorporates contextual information\nfrom the dialogue states of a goal-oriented conversational AI and its tasks.\nOur method ranks (1) n-best ASR hypotheses by their lexical and semantic\nsimilarity with context and (2) context by phonetic correspondence with ASR\nhypotheses. Evaluated in home improvement and cooking domains with real-world\nusers, our method improves recall and F1 of correction by 34% and 16%,\nrespectively, while maintaining precision and false positive rate. Users rated\n.8-1 point (out of 5) higher when our correction method worked properly, with\nno decrease due to false positives.\n","authors":["Yuya Asano","Sabit Hassan","Paras Sharma","Anthony Sicilia","Katherine Atwell","Diane Litman","Malihe Alikhani"],"pdf_url":"https://arxiv.org/pdf/2501.06129v1.pdf","comment":"Accepted to COLING 2025 Industry Track"},{"id":"http://arxiv.org/abs/2501.06117v1","updated":"2025-01-10T17:15:38Z","published":"2025-01-10T17:15:38Z","title":"Fleurs-SLU: A Massively Multilingual Benchmark for Spoken Language\n Understanding","summary":" While recent multilingual automatic speech recognition models claim to\nsupport thousands of languages, ASR for low-resource languages remains highly\nunreliable due to limited bimodal speech and text training data. Better\nmultilingual spoken language understanding (SLU) can strengthen massively the\nrobustness of multilingual ASR by levering language semantics to compensate for\nscarce training data, such as disambiguating utterances via context or\nexploiting semantic similarities across languages. Even more so, SLU is\nindispensable for inclusive speech technology in roughly half of all living\nlanguages that lack a formal writing system. However, the evaluation of\nmultilingual SLU remains limited to shallower tasks such as intent\nclassification or language identification. To address this, we present\nFleurs-SLU, a multilingual SLU benchmark that encompasses topical speech\nclassification in 102 languages and multiple-choice question answering through\nlistening comprehension in 92 languages. We extensively evaluate both\nend-to-end speech classification models and cascaded systems that combine\nspeech-to-text transcription with subsequent classification by large language\nmodels on Fleurs-SLU. Our results show that cascaded systems exhibit greater\nrobustness in multilingual SLU tasks, though speech encoders can achieve\ncompetitive performance in topical speech classification when appropriately\npre-trained. We further find a strong correlation between robust multilingual\nASR, effective speech-to-text translation, and strong multilingual SLU,\nhighlighting the mutual benefits between acoustic and semantic speech\nrepresentations.\n","authors":["Fabian David Schmidt","Ivan Vulić","Goran Glavaš","David Ifeoluwa Adelani"],"pdf_url":"https://arxiv.org/pdf/2501.06117v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05409v2","updated":"2025-01-10T16:58:29Z","published":"2025-01-09T18:06:45Z","title":"Atlas: A Novel Pathology Foundation Model by Mayo Clinic, Charité, and\n Aignostics","summary":" Recent advances in digital pathology have demonstrated the effectiveness of\nfoundation models across diverse applications. In this report, we present\nAtlas, a novel vision foundation model based on the RudolfV approach. Our model\nwas trained on a dataset comprising 1.2 million histopathology whole slide\nimages, collected from two medical institutions: Mayo Clinic and Charit\\'e -\nUniverst\\\"atsmedizin Berlin. Comprehensive evaluations show that Atlas achieves\nstate-of-the-art performance across twenty-one public benchmark datasets, even\nthough it is neither the largest model by parameter count nor by training\ndataset size.\n","authors":["Maximilian Alber","Stephan Tietz","Jonas Dippel","Timo Milbich","Timothée Lesort","Panos Korfiatis","Moritz Krügener","Beatriz Perez Cancer","Neelay Shah","Alexander Möllers","Philipp Seegerer","Alexandra Carpen-Amarie","Kai Standvoss","Gabriel Dernbach","Edwin de Jong","Simon Schallenberg","Andreas Kunft","Helmut Hoffer von Ankershoffen","Gavin Schaeferle","Patrick Duffy","Matt Redlon","Philipp Jurmeister","David Horst","Lukas Ruff","Klaus-Robert Müller","Frederick Klauschen","Andrew Norgan"],"pdf_url":"https://arxiv.org/pdf/2501.05409v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.06099v1","updated":"2025-01-10T16:53:48Z","published":"2025-01-10T16:53:48Z","title":"Explaining Deep Learning-based Anomaly Detection in Energy Consumption\n Data by Focusing on Contextually Relevant Data","summary":" Detecting anomalies in energy consumption data is crucial for identifying\nenergy waste, equipment malfunction, and overall, for ensuring efficient energy\nmanagement. Machine learning, and specifically deep learning approaches, have\nbeen greatly successful in anomaly detection; however, they are black-box\napproaches that do not provide transparency or explanations. SHAP and its\nvariants have been proposed to explain these models, but they suffer from high\ncomputational complexity (SHAP) or instability and inconsistency (e.g., Kernel\nSHAP). To address these challenges, this paper proposes an explainability\napproach for anomalies in energy consumption data that focuses on\ncontext-relevant information. The proposed approach leverages existing\nexplainability techniques, focusing on SHAP variants, together with global\nfeature importance and weighted cosine similarity to select background dataset\nbased on the context of each anomaly point. By focusing on the context and most\nrelevant features, this approach mitigates the instability of explainability\nalgorithms. Experimental results across 10 different machine learning models,\nfive datasets, and five XAI techniques, demonstrate that our method reduces the\nvariability of explanations providing consistent explanations. Statistical\nanalyses confirm the robustness of our approach, showing an average reduction\nin variability of approximately 38% across multiple datasets.\n","authors":["Mohammad Noorchenarboo","Katarina Grolinger"],"pdf_url":"https://arxiv.org/pdf/2501.06099v1.pdf","comment":"26 pages, 8 figures"},{"id":"http://arxiv.org/abs/2501.06089v1","updated":"2025-01-10T16:39:01Z","published":"2025-01-10T16:39:01Z","title":"Towards Developing Socially Compliant Automated Vehicles: State of the\n Art, Experts Expectations, and A Conceptual Framework","summary":" Automated Vehicles (AVs) hold promise for revolutionizing transportation by\nimproving road safety, traffic efficiency, and overall mobility. Despite the\nsteady advancement in high-level AVs in recent years, the transition to full\nautomation entails a period of mixed traffic, where AVs of varying automation\nlevels coexist with human-driven vehicles (HDVs). Making AVs socially compliant\nand understood by human drivers is expected to improve the safety and\nefficiency of mixed traffic. Thus, ensuring AVs compatibility with HDVs and\nsocial acceptance is crucial for their successful and seamless integration into\nmixed traffic. However, research in this critical area of developing Socially\nCompliant AVs (SCAVs) remains sparse. This study carries out the first\ncomprehensive scoping review to assess the current state of the art in\ndeveloping SCAVs, identifying key concepts, methodological approaches, and\nresearch gaps. An expert interview was also conducted to identify critical\nresearch gaps and expectations towards SCAVs. Based on the scoping review and\nexpert interview input, a conceptual framework is proposed for the development\nof SCAVs. The conceptual framework is evaluated using an online survey\ntargeting researchers, technicians, policymakers, and other relevant\nprofessionals worldwide. The survey results provide valuable validation and\ninsights, affirming the significance of the proposed conceptual framework in\ntackling the challenges of integrating AVs into mixed-traffic environments.\nAdditionally, future research perspectives and suggestions are discussed,\ncontributing to the research and development agenda of SCAVs.\n","authors":["Yongqi Dong","Bart van Arem","Haneen Farah"],"pdf_url":"https://arxiv.org/pdf/2501.06089v1.pdf","comment":"39 pages, 13 figures, under review by the journal of Transportation\n Research Part E: Logistics and Transportation Review"},{"id":"http://arxiv.org/abs/2501.06086v1","updated":"2025-01-10T16:34:19Z","published":"2025-01-10T16:34:19Z","title":"All AI Models are Wrong, but Some are Optimal","summary":" AI models that predict the future behavior of a system (a.k.a. predictive AI\nmodels) are central to intelligent decision-making. However, decision-making\nusing predictive AI models often results in suboptimal performance. This is\nprimarily because AI models are typically constructed to best fit the data, and\nhence to predict the most likely future rather than to enable high-performance\ndecision-making. The hope that such prediction enables high-performance\ndecisions is neither guaranteed in theory nor established in practice. In fact,\nthere is increasing empirical evidence that predictive models must be tailored\nto decision-making objectives for performance. In this paper, we establish\nformal (necessary and sufficient) conditions that a predictive model (AI-based\nor not) must satisfy for a decision-making policy established using that model\nto be optimal. We then discuss their implications for building predictive AI\nmodels for sequential decision-making.\n","authors":["Akhil S Anand","Shambhuraj Sawant","Dirk Reinhardt","Sebastien Gros"],"pdf_url":"https://arxiv.org/pdf/2501.06086v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2210.06433v3","updated":"2025-01-10T16:26:43Z","published":"2022-10-12T17:30:12Z","title":"Self-supervised video pretraining yields robust and more human-aligned\n visual representations","summary":" Humans learn powerful representations of objects and scenes by observing how\nthey evolve over time. Yet, outside of specific tasks that require explicit\ntemporal understanding, static image pretraining remains the dominant paradigm\nfor learning visual foundation models. We question this mismatch, and ask\nwhether video pretraining can yield visual representations that bear the\nhallmarks of human perception: generalisation across tasks, robustness to\nperturbations, and consistency with human judgements. To that end we propose a\nnovel procedure for curating videos, and develop a contrastive framework which\nlearns from the complex transformations therein. This simple paradigm for\ndistilling knowledge from videos, called VITO, yields general representations\nthat far outperform prior video pretraining methods on image understanding\ntasks, and image pretraining methods on video understanding tasks. Moreover,\nVITO representations are significantly more robust to natural and synthetic\ndeformations than image-, video-, and adversarially-trained ones. Finally,\nVITO's predictions are strongly aligned with human judgements, surpassing\nmodels that were specifically trained for that purpose. Together, these results\nsuggest that video pretraining could be a simple way of learning unified,\nrobust, and human-aligned representations of the visual world.\n","authors":["Nikhil Parthasarathy","S. M. Ali Eslami","João Carreira","Olivier J. Hénaff"],"pdf_url":"https://arxiv.org/pdf/2210.06433v3.pdf","comment":"Accepted to 37th Conference on Neural Information Processing Systems\n (NeurIPS 2023)"},{"id":"http://arxiv.org/abs/2501.06080v1","updated":"2025-01-10T16:15:23Z","published":"2025-01-10T16:15:23Z","title":"Scale-up Unlearnable Examples Learning with High-Performance Computing","summary":" Recent advancements in AI models are structured to retain user interactions,\nwhich could inadvertently include sensitive healthcare data. In the healthcare\nfield, particularly when radiologists use AI-driven diagnostic tools hosted on\nonline platforms, there is a risk that medical imaging data may be repurposed\nfor future AI training without explicit consent, spotlighting critical privacy\nand intellectual property concerns around healthcare data usage. Addressing\nthese privacy challenges, a novel approach known as Unlearnable Examples (UEs)\nhas been introduced, aiming to make data unlearnable to deep learning models. A\nprominent method within this area, called Unlearnable Clustering (UC), has\nshown improved UE performance with larger batch sizes but was previously\nlimited by computational resources. To push the boundaries of UE performance\nwith theoretically unlimited resources, we scaled up UC learning across various\ndatasets using Distributed Data Parallel (DDP) training on the Summit\nsupercomputer. Our goal was to examine UE efficacy at high-performance\ncomputing (HPC) levels to prevent unauthorized learning and enhance data\nsecurity, particularly exploring the impact of batch size on UE's\nunlearnability. Utilizing the robust computational capabilities of the Summit,\nextensive experiments were conducted on diverse datasets such as Pets,\nMedMNist, Flowers, and Flowers102. Our findings reveal that both overly large\nand overly small batch sizes can lead to performance instability and affect\naccuracy. However, the relationship between batch size and unlearnability\nvaried across datasets, highlighting the necessity for tailored batch size\nstrategies to achieve optimal data protection. Our results underscore the\ncritical role of selecting appropriate batch sizes based on the specific\ncharacteristics of each dataset to prevent learning and ensure data security in\ndeep learning applications.\n","authors":["Yanfan Zhu","Issac Lyngaas","Murali Gopalakrishnan Meena","Mary Ellen I. Koran","Bradley Malin","Daniel Moyer","Shunxing Bao","Anuj Kapadia","Xiao Wang","Bennett Landman","Yuankai Huo"],"pdf_url":"https://arxiv.org/pdf/2501.06080v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.06078v1","updated":"2025-01-10T16:14:35Z","published":"2025-01-10T16:14:35Z","title":"Explaining k-Nearest Neighbors: Abductive and Counterfactual\n Explanations","summary":" Despite the wide use of $k$-Nearest Neighbors as classification models, their\nexplainability properties remain poorly understood from a theoretical\nperspective. While nearest neighbors classifiers offer interpretability from a\n\"data perspective\", in which the classification of an input vector $\\bar{x}$ is\nexplained by identifying the vectors $\\bar{v}_1, \\ldots, \\bar{v}_k$ in the\ntraining set that determine the classification of $\\bar{x}$, we argue that such\nexplanations can be impractical in high-dimensional applications, where each\nvector has hundreds or thousands of features and it is not clear what their\nrelative importance is. Hence, we focus on understanding nearest neighbor\nclassifications through a \"feature perspective\", in which the goal is to\nidentify how the values of the features in $\\bar{x}$ affect its classification.\nConcretely, we study abductive explanations such as \"minimum sufficient\nreasons\", which correspond to sets of features in $\\bar{x}$ that are enough to\nguarantee its classification, and \"counterfactual explanations\" based on the\nminimum distance feature changes one would have to perform in $\\bar{x}$ to\nchange its classification. We present a detailed landscape of positive and\nnegative complexity results for counterfactual and abductive explanations,\ndistinguishing between discrete and continuous feature spaces, and considering\nthe impact of the choice of distance function involved. Finally, we show that\ndespite some negative complexity results, Integer Quadratic Programming and SAT\nsolving allow for computing explanations in practice.\n","authors":["Pablo Barceló","Alexander Kozachinskiy","Miguel Romero Orth","Bernardo Subercaseaux","José Verschae"],"pdf_url":"https://arxiv.org/pdf/2501.06078v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.06066v1","updated":"2025-01-10T15:57:23Z","published":"2025-01-10T15:57:23Z","title":"Distilling Calibration via Conformalized Credal Inference","summary":" Deploying artificial intelligence (AI) models on edge devices involves a\ndelicate balance between meeting stringent complexity constraints, such as\nlimited memory and energy resources, and ensuring reliable performance in\nsensitive decision-making tasks. One way to enhance reliability is through\nuncertainty quantification via Bayesian inference. This approach, however,\ntypically necessitates maintaining and running multiple models in an ensemble,\nwhich may exceed the computational limits of edge devices. This paper\nintroduces a low-complexity methodology to address this challenge by distilling\ncalibration information from a more complex model. In an offline phase,\npredictive probabilities generated by a high-complexity cloud-based model are\nleveraged to determine a threshold based on the typical divergence between the\ncloud and edge models. At run time, this threshold is used to construct credal\nsets -- ranges of predictive probabilities that are guaranteed, with a\nuser-selected confidence level, to include the predictions of the cloud model.\nThe credal sets are obtained through thresholding of a divergence measure in\nthe simplex of predictive probabilities. Experiments on visual and language\ntasks demonstrate that the proposed approach, termed Conformalized Distillation\nfor Credal Inference (CD-CI), significantly improves calibration performance\ncompared to low-complexity Bayesian methods, such as Laplace approximation,\nmaking it a practical and efficient solution for edge AI deployments.\n","authors":["Jiayi Huang","Sangwoo Park","Nicola Paoletti","Osvaldo Simeone"],"pdf_url":"https://arxiv.org/pdf/2501.06066v1.pdf","comment":"Under review"},{"id":"http://arxiv.org/abs/2407.04103v2","updated":"2025-01-10T15:37:26Z","published":"2024-07-04T18:06:48Z","title":"Advances in Diffusion Models for Image Data Augmentation: A Review of\n Methods, Models, Evaluation Metrics and Future Research Directions","summary":" Image data augmentation constitutes a critical methodology in modern computer\nvision tasks, since it can facilitate towards enhancing the diversity and\nquality of training datasets; thereby, improving the performance and robustness\nof machine learning models in downstream tasks. In parallel, augmentation\napproaches can also be used for editing/modifying a given image in a context-\nand semantics-aware way. Diffusion Models (DMs), which comprise one of the most\nrecent and highly promising classes of methods in the field of generative\nArtificial Intelligence (AI), have emerged as a powerful tool for image data\naugmentation, capable of generating realistic and diverse images by learning\nthe underlying data distribution. The current study realizes a systematic,\ncomprehensive and in-depth review of DM-based approaches for image\naugmentation, covering a wide range of strategies, tasks and applications. In\nparticular, a comprehensive analysis of the fundamental principles, model\narchitectures and training strategies of DMs is initially performed.\nSubsequently, a taxonomy of the relevant image augmentation methods is\nintroduced, focusing on techniques regarding semantic manipulation,\npersonalization and adaptation, and application-specific augmentation tasks.\nThen, performance assessment methodologies and respective evaluation metrics\nare analyzed. Finally, current challenges and future research directions in the\nfield are discussed.\n","authors":["Panagiotis Alimisis","Ioannis Mademlis","Panagiotis Radoglou-Grammatikis","Panagiotis Sarigiannidis","Georgios Th. Papadopoulos"],"pdf_url":"https://arxiv.org/pdf/2407.04103v2.pdf","comment":"65 pages, 15 figures"},{"id":"http://arxiv.org/abs/2410.18710v2","updated":"2025-01-10T15:37:01Z","published":"2024-10-23T07:55:40Z","title":"Uncovering the Genetic Basis of Glioblastoma Heterogeneity through\n Multimodal Analysis of Whole Slide Images and RNA Sequencing Data","summary":" Glioblastoma is a highly aggressive form of brain cancer characterized by\nrapid progression and poor prognosis. Despite advances in treatment, the\nunderlying genetic mechanisms driving this aggressiveness remain poorly\nunderstood. In this study, we employed multimodal deep learning approaches to\ninvestigate glioblastoma heterogeneity using joint image/RNA-seq analysis. Our\nresults reveal novel genes associated with glioblastoma. By leveraging a\ncombination of whole-slide images and RNA-seq, as well as introducing novel\nmethods to encode RNA-seq data, we identified specific genetic profiles that\nmay explain different patterns of glioblastoma progression. These findings\nprovide new insights into the genetic mechanisms underlying glioblastoma\nheterogeneity and highlight potential targets for therapeutic intervention.\n","authors":["Ahmad Berjaoui","Louis Roussel","Eduardo Hugo Sanchez","Elizabeth Cohen-Jonathan Moyal"],"pdf_url":"https://arxiv.org/pdf/2410.18710v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.06051v1","updated":"2025-01-10T15:30:46Z","published":"2025-01-10T15:30:46Z","title":"Benchmarking Rotary Position Embeddings for Automatic Speech Recognition","summary":" Rotary Position Embedding (RoPE) encodes relative and absolute positional\ninformation in Transformer-based models through rotation matrices applied to\ninput vectors within sequences. While RoPE has demonstrated superior\nperformance compared to other positional embedding technologies in natural\nlanguage processing tasks, its effectiveness in speech processing applications\nremains understudied. In this work, we conduct a comprehensive evaluation of\nRoPE across diverse automatic speech recognition (ASR) tasks. Our experimental\nresults demonstrate that for ASR tasks, RoPE consistently achieves lower error\nrates compared to the currently widely used relative positional embedding. To\nfacilitate further research, we release the implementation and all experimental\nrecipes through the SpeechBrain toolkit.\n","authors":["Shucong Zhang","Titouan Parcollet","Rogier van Dalen","Sourav Bhattacharya"],"pdf_url":"https://arxiv.org/pdf/2501.06051v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.05289v3","updated":"2025-01-10T15:25:06Z","published":"2024-10-02T14:14:17Z","title":"MARS: A neurosymbolic approach for interpretable drug discovery","summary":" Neurosymbolic (NeSy) artificial intelligence describes the combination of\nlogic or rule-based techniques with neural networks. Compared to neural\napproaches, NeSy methods often possess enhanced interpretability, which is\nparticularly promising for biomedical applications like drug discovery.\nHowever, since interpretability is broadly defined, there are no clear\nguidelines for assessing the biological plausibility of model interpretations.\nTo assess interpretability in the context of drug discovery, we devise a novel\nprediction task, called drug mechanism-of-action (MoA) deconvolution, with an\nassociated, tailored knowledge graph (KG), MoA-net. We then develop the MoA\nRetrieval System (MARS), a NeSy approach for drug discovery which leverages\nlogical rules with learned rule weights. Using this interpretable feature\nalongside domain knowledge, we find that MARS and other NeSy approaches on KGs\nare susceptible to reasoning shortcuts, in which the prediction of true labels\nis driven by \"degree-bias\" rather than the domain-based rules. Subsequently, we\ndemonstrate ways to identify and mitigate this. Thereafter, MARS achieves\nperformance on par with current state-of-the-art models while producing model\ninterpretations aligned with known MoAs.\n","authors":["Lauren Nicole DeLong","Yojana Gadiya","Paola Galdi","Jacques D. Fleuriot","Daniel Domingo-Fernández"],"pdf_url":"https://arxiv.org/pdf/2410.05289v3.pdf","comment":"Under review. 10 pages, 7 supplementary pages. Corresponding code is\n here: https://github.com/laurendelong21/MARS and here:\n https://github.com/laurendelong21/MoA-Net"},{"id":"http://arxiv.org/abs/2501.06039v1","updated":"2025-01-10T15:17:27Z","published":"2025-01-10T15:17:27Z","title":"AI-powered virtual tissues from spatial proteomics for clinical\n diagnostics and biomedical discovery","summary":" Spatial proteomics technologies have transformed our understanding of complex\ntissue architectures by enabling simultaneous analysis of multiple molecular\nmarkers and their spatial organization. The high dimensionality of these data,\nvarying marker combinations across experiments and heterogeneous study designs\npose unique challenges for computational analysis. Here, we present Virtual\nTissues (VirTues), a foundation model framework for biological tissues that\noperates across the molecular, cellular and tissue scale. VirTues introduces\ninnovations in transformer architecture design, including a novel tokenization\nscheme that captures both spatial and marker dimensions, and attention\nmechanisms that scale to high-dimensional multiplex data while maintaining\ninterpretability. Trained on diverse cancer and non-cancer tissue datasets,\nVirTues demonstrates strong generalization capabilities without task-specific\nfine-tuning, enabling cross-study analysis and novel marker integration. As a\ngeneralist model, VirTues outperforms existing approaches across clinical\ndiagnostics, biological discovery and patient case retrieval tasks, while\nproviding insights into tissue function and disease mechanisms.\n","authors":["Johann Wenckstern","Eeshaan Jain","Kiril Vasilev","Matteo Pariset","Andreas Wicki","Gabriele Gut","Charlotte Bunne"],"pdf_url":"https://arxiv.org/pdf/2501.06039v1.pdf","comment":"23 pages, 5 figures"},{"id":"http://arxiv.org/abs/2411.19876v3","updated":"2025-01-10T15:08:44Z","published":"2024-11-29T17:38:56Z","title":"LUMIA: Linear probing for Unimodal and MultiModal Membership Inference\n Attacks leveraging internal LLM states","summary":" Large Language Models (LLMs) are increasingly used in a variety of\napplications, but concerns around membership inference have grown in parallel.\nPrevious efforts focus on black-to-grey-box models, thus neglecting the\npotential benefit from internal LLM information. To address this, we propose\nthe use of Linear Probes (LPs) as a method to detect Membership Inference\nAttacks (MIAs) by examining internal activations of LLMs. Our approach, dubbed\nLUMIA, applies LPs layer-by-layer to get fine-grained data on the model inner\nworkings. We test this method across several model architectures, sizes and\ndatasets, including unimodal and multimodal tasks. In unimodal MIA, LUMIA\nachieves an average gain of 15.71 % in Area Under the Curve (AUC) over previous\ntechniques. Remarkably, LUMIA reaches AUC>60% in 65.33% of cases -- an\nincrement of 46.80% against the state of the art. Furthermore, our approach\nreveals key insights, such as the model layers where MIAs are most detectable.\nIn multimodal models, LPs indicate that visual inputs can significantly\ncontribute to detect MIAs -- AUC>60% is reached in 85.90% of experiments.\n","authors":["Luis Ibanez-Lissen","Lorena Gonzalez-Manzano","Jose Maria de Fuentes","Nicolas Anciaux","Joaquin Garcia-Alfaro"],"pdf_url":"https://arxiv.org/pdf/2411.19876v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.06025v1","updated":"2025-01-10T15:01:51Z","published":"2025-01-10T15:01:51Z","title":"How to Tune a Multilingual Encoder Model for Germanic Languages: A Study\n of PEFT, Full Fine-Tuning, and Language Adapters","summary":" This paper investigates the optimal use of the multilingual encoder model\nmDeBERTa for tasks in three Germanic languages -- German, Swedish, and\nIcelandic -- representing varying levels of presence and likely data quality in\nmDeBERTas pre-training data. We compare full fine-tuning with the\nparameter-efficient fine-tuning (PEFT) methods LoRA and Pfeiffer bottleneck\nadapters, finding that PEFT is more effective for the higher-resource language,\nGerman. However, results for Swedish and Icelandic are less consistent. We also\nobserve differences between tasks: While PEFT tends to work better for question\nanswering, full fine-tuning is preferable for named entity recognition.\nInspired by previous research on modular approaches that combine task and\nlanguage adapters, we evaluate the impact of adding PEFT modules trained on\nunstructured text, finding that this approach is not beneficial.\n","authors":["Romina Oji","Jenny Kunz"],"pdf_url":"https://arxiv.org/pdf/2501.06025v1.pdf","comment":"Accepted at NoDaLiDa Baltic-HLT 2025 Conference"},{"id":"http://arxiv.org/abs/2501.06019v1","updated":"2025-01-10T14:57:18Z","published":"2025-01-10T14:57:18Z","title":"BRIGHT: A globally distributed multimodal building damage assessment\n dataset with very-high-resolution for all-weather disaster response","summary":" Disaster events occur around the world and cause significant damage to human\nlife and property. Earth observation (EO) data enables rapid and comprehensive\nbuilding damage assessment (BDA), an essential capability in the aftermath of a\ndisaster to reduce human casualties and to inform disaster relief efforts.\nRecent research focuses on the development of AI models to achieve accurate\nmapping of unseen disaster events, mostly using optical EO data. However,\nsolutions based on optical data are limited to clear skies and daylight hours,\npreventing a prompt response to disasters. Integrating multimodal (MM) EO data,\nparticularly the combination of optical and SAR imagery, makes it possible to\nprovide all-weather, day-and-night disaster responses. Despite this potential,\nthe development of robust multimodal AI models has been constrained by the lack\nof suitable benchmark datasets. In this paper, we present a BDA dataset using\nveRy-hIGH-resoluTion optical and SAR imagery (BRIGHT) to support AI-based\nall-weather disaster response. To the best of our knowledge, BRIGHT is the\nfirst open-access, globally distributed, event-diverse MM dataset specifically\ncurated to support AI-based disaster response. It covers five types of natural\ndisasters and two types of man-made disasters across 12 regions worldwide, with\na particular focus on developing countries where external assistance is most\nneeded. The optical and SAR imagery in BRIGHT, with a spatial resolution\nbetween 0.3-1 meters, provides detailed representations of individual\nbuildings, making it ideal for precise BDA. In our experiments, we have tested\nseven advanced AI models trained with our BRIGHT to validate the\ntransferability and robustness. The dataset and code are available at\nhttps://github.com/ChenHongruixuan/BRIGHT. BRIGHT also serves as the official\ndataset for the 2025 IEEE GRSS Data Fusion Contest.\n","authors":["Hongruixuan Chen","Jian Song","Olivier Dietrich","Clifford Broni-Bediako","Weihao Xuan","Junjue Wang","Xinlei Shao","Yimin Wei","Junshi Xia","Cuiling Lan","Konrad Schindler","Naoto Yokoya"],"pdf_url":"https://arxiv.org/pdf/2501.06019v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04211v2","updated":"2025-01-10T14:36:48Z","published":"2025-01-08T01:11:17Z","title":"CURing Large Models: Compression via CUR Decomposition","summary":" Large deep learning models have achieved remarkable success but are\nresource-intensive, posing challenges such as memory usage. We introduce\nCURing, a novel model compression method based on CUR matrix decomposition,\nwhich approximates weight matrices as the product of selected columns (C) and\nrows (R), and a small linking matrix (U). We apply this decomposition to\nweights chosen based on the combined influence of their magnitudes and\nactivations. By identifying and retaining informative rows and columns, CURing\nsignificantly reduces model size with minimal performance loss. For example, it\nreduces Llama3.1-8B's parameters to 7.32B (-9%) in just 129 seconds, over 20\ntimes faster than prior compression methods.\n","authors":["Sanghyeon Park","Soo-Mook Moon"],"pdf_url":"https://arxiv.org/pdf/2501.04211v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.04127v3","updated":"2025-01-10T14:31:21Z","published":"2024-06-06T14:49:06Z","title":"Are We Done with MMLU?","summary":" Maybe not. We identify and analyse errors in the popular Massive Multitask\nLanguage Understanding (MMLU) benchmark. Even though MMLU is widely adopted,\nour analysis demonstrates numerous ground truth errors that obscure the true\ncapabilities of LLMs. For example, we find that 57% of the analysed questions\nin the Virology subset contain errors. To address this issue, we introduce a\ncomprehensive framework for identifying dataset errors using a novel error\nannotation protocol. Then, we create MMLU-Redux, which is a subset of 5,700\nmanually re-annotated questions across all 57 MMLU subjects. We estimate that\n6.49% of MMLU questions contain errors. Using MMLU-Redux, we demonstrate\nsignificant discrepancies with the model performance metrics that were\noriginally reported. Our results strongly advocate for revising MMLU's\nerror-ridden questions to enhance its future utility and reliability as a\nbenchmark. https://huggingface.co/datasets/edinburgh-dawg/mmlu-redux-2.0.\n","authors":["Aryo Pradipta Gema","Joshua Ong Jun Leang","Giwon Hong","Alessio Devoto","Alberto Carlo Maria Mancino","Rohit Saxena","Xuanli He","Yu Zhao","Xiaotang Du","Mohammad Reza Ghasemi Madani","Claire Barale","Robert McHardy","Joshua Harris","Jean Kaddour","Emile van Krieken","Pasquale Minervini"],"pdf_url":"https://arxiv.org/pdf/2406.04127v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05989v1","updated":"2025-01-10T14:20:46Z","published":"2025-01-10T14:20:46Z","title":"Addressing speaker gender bias in large scale speech translation systems","summary":" This study addresses the issue of speaker gender bias in Speech Translation\n(ST) systems, which can lead to offensive and inaccurate translations. The\nmasculine bias often found in large-scale ST systems is typically perpetuated\nthrough training data derived from Machine Translation (MT) systems. Our\napproach involves two key steps. First, we employ Large Language Models (LLMs)\nto rectify translations based on the speaker's gender in a cost-effective\nmanner. Second, we fine-tune the ST model with the corrected data, enabling the\nmodel to generate gender-specific translations directly from audio cues,\nwithout the need for explicit gender input. Additionally, we propose a\nthree-mode fine-tuned model for scenarios where the speaker's gender is either\npredefined or should not be inferred from speech cues. We demonstrate a 70%\nimprovement in translations for female speakers compared to our baseline and\nother large-scale ST systems, such as Seamless M4T and Canary, on the MuST-SHE\ntest set.\n","authors":["Shubham Bansal","Vikas Joshi","Harveen Chadha","Rupeshkumar Mehta","Jinyu Li"],"pdf_url":"https://arxiv.org/pdf/2501.05989v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.14488v2","updated":"2025-01-10T13:52:14Z","published":"2024-12-19T03:22:47Z","title":"A stochastic first-order method with multi-extrapolated momentum for\n highly smooth unconstrained optimization","summary":" In this paper, we consider an unconstrained stochastic optimization problem\nwhere the objective function exhibits high-order smoothness. Specifically, we\npropose a new stochastic first-order method (SFOM) with multi-extrapolated\nmomentum, in which multiple extrapolations are performed in each iteration,\nfollowed by a momentum update based on these extrapolations. We demonstrate\nthat the proposed SFOM can accelerate optimization by exploiting the high-order\nsmoothness of the objective function $f$. Assuming that the $p$th-order\nderivative of $f$ is Lipschitz continuous for some $p\\ge2$, and under\nadditional mild assumptions, we establish that our method achieves a sample\ncomplexity of $\\widetilde{\\mathcal{O}}(\\epsilon^{-(3p+1)/p})$ for finding a\npoint $x$ such that $\\mathbb{E}[\\|\\nabla f(x)\\|]\\le\\epsilon$. To the best of\nour knowledge, this is the first SFOM to leverage arbitrary-order smoothness of\nthe objective function for acceleration, resulting in a sample complexity that\nimproves upon the best-known results without assuming the mean-squared\nsmoothness condition. Preliminary numerical experiments validate the practical\nperformance of our method and support our theoretical findings.\n","authors":["Chuan He"],"pdf_url":"https://arxiv.org/pdf/2412.14488v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05962v1","updated":"2025-01-10T13:42:40Z","published":"2025-01-10T13:42:40Z","title":"Effective faking of verbal deception detection with target-aligned\n adversarial attacks","summary":" Background: Deception detection through analysing language is a promising\navenue using both human judgments and automated machine learning judgments. For\nboth forms of credibility assessment, automated adversarial attacks that\nrewrite deceptive statements to appear truthful pose a serious threat. Methods:\nWe used a dataset of 243 truthful and 262 fabricated autobiographical stories\nin a deception detection task for humans and machine learning models. A large\nlanguage model was tasked to rewrite deceptive statements so that they appear\ntruthful. In Study 1, humans who made a deception judgment or used the\ndetailedness heuristic and two machine learning models (a fine-tuned language\nmodel and a simple n-gram model) judged original or adversarial modifications\nof deceptive statements. In Study 2, we manipulated the target alignment of the\nmodifications, i.e. tailoring the attack to whether the statements would be\nassessed by humans or computer models. Results: When adversarial modifications\nwere aligned with their target, human (d=-0.07 and d=-0.04) and machine\njudgments (51% accuracy) dropped to the chance level. When the attack was not\naligned with the target, both human heuristics judgments (d=0.30 and d=0.36)\nand machine learning predictions (63-78%) were significantly better than\nchance. Conclusions: Easily accessible language models can effectively help\nanyone fake deception detection efforts both by humans and machine learning\nmodels. Robustness against adversarial modifications for humans and machines\ndepends on that target alignment. We close with suggestions on advancing\ndeception research with adversarial attack designs.\n","authors":["Bennett Kleinberg","Riccardo Loconte","Bruno Verschuere"],"pdf_url":"https://arxiv.org/pdf/2501.05962v1.pdf","comment":"preprint"},{"id":"http://arxiv.org/abs/2412.11698v2","updated":"2025-01-10T13:35:37Z","published":"2024-12-16T12:21:05Z","title":"On Large Language Models in Mission-Critical IT Governance: Are We Ready\n Yet?","summary":" Context. The security of critical infrastructure has been a pressing concern\nsince the advent of computers and has become even more critical in today's era\nof cyber warfare. Protecting mission-critical systems (MCSs), essential for\nnational security, requires swift and robust governance, yet recent events\nreveal the increasing difficulty of meeting these challenges. Aim. Building on\nprior research showcasing the potential of Generative AI (GAI), such as Large\nLanguage Models, in enhancing risk analysis, we aim to explore practitioners'\nviews on integrating GAI into the governance of IT MCSs. Our goal is to provide\nactionable insights and recommendations for stakeholders, including\nresearchers, practitioners, and policymakers. Method. We designed a survey to\ncollect practical experiences, concerns, and expectations of practitioners who\ndevelop and implement security solutions in the context of MCSs. Conclusions\nand Future Works. Our findings highlight that the safe use of LLMs in MCS\ngovernance requires interdisciplinary collaboration. Researchers should focus\non designing regulation-oriented models and focus on accountability;\npractitioners emphasize data protection and transparency, while policymakers\nmust establish a unified AI framework with global benchmarks to ensure ethical\nand secure LLMs-based MCS governance.\n","authors":["Matteo Esposito","Francesco Palagiano","Valentina Lenarduzzi","Davide Taibi"],"pdf_url":"https://arxiv.org/pdf/2412.11698v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03916v2","updated":"2025-01-10T13:14:28Z","published":"2025-01-07T16:31:10Z","title":"Dolphin: Closed-loop Open-ended Auto-research through Thinking,\n Practice, and Feedback","summary":" The scientific research paradigm is undergoing a profound transformation\nowing to the development of Artificial Intelligence (AI). Recent works\ndemonstrate that various AI-assisted research methods can largely improve\nresearch efficiency by improving data analysis, accelerating computation, and\nfostering novel idea generation. To further move towards the ultimate goal\n(i.e., automatic scientific research), in this paper, we propose Dolphin, the\nfirst closed-loop open-ended auto-research framework to further build the\nentire process of human scientific research. Dolphin can generate research\nideas, perform experiments, and get feedback from experimental results to\ngenerate higher-quality ideas. More specifically, Dolphin first generates novel\nideas based on relevant papers which are ranked by the topic and task\nattributes. Then, the codes are automatically generated and debugged with the\nexception-traceback-guided local code structure. Finally, Dolphin automatically\nanalyzes the results of each idea and feeds the results back to the next round\nof idea generation. Experiments are conducted on the benchmark datasets of\ndifferent topics and results show that Dolphin can generate novel ideas\ncontinuously and complete the experiment in a loop. We highlight that Dolphin\ncan automatically propose methods that are comparable to the state-of-the-art\nin some tasks such as 2D image classification and 3D point classification.\n","authors":["Jiakang Yuan","Xiangchao Yan","Botian Shi","Tao Chen","Wanli Ouyang","Bo Zhang","Lei Bai","Yu Qiao","Bowen Zhou"],"pdf_url":"https://arxiv.org/pdf/2501.03916v2.pdf","comment":"19 pages, 11 figures, and our homepage:\n https://alpha-innovator.github.io/Dolphin-project-page"},{"id":"http://arxiv.org/abs/2311.03056v4","updated":"2025-01-10T13:01:45Z","published":"2023-11-06T12:22:19Z","title":"LitSumm: Large language models for literature summarisation of\n non-coding RNAs","summary":" Curation of literature in life sciences is a growing challenge. The continued\nincrease in the rate of publication, coupled with the relatively fixed number\nof curators worldwide presents a major challenge to developers of biomedical\nknowledgebases. Very few knowledgebases have resources to scale to the whole\nrelevant literature and all have to prioritise their efforts.\n In this work, we take a first step to alleviating the lack of curator time in\nRNA science by generating summaries of literature for non-coding RNAs using\nlarge language models (LLMs). We demonstrate that high-quality, factually\naccurate summaries with accurate references can be automatically generated from\nthe literature using a commercial LLM and a chain of prompts and checks. Manual\nassessment was carried out for a subset of summaries, with the majority being\nrated extremely high quality.\n We apply our tool to a selection of over 4,600 ncRNAs and make the generated\nsummaries available via the RNAcentral resource. We conclude that automated\nliterature summarization is feasible with the current generation of LLMs,\nprovided careful prompting and automated checking are applied.\n","authors":["Andrew Green","Carlos Ribas","Nancy Ontiveros-Palacios","Sam Griffiths-Jones","Anton I. Petrov","Alex Bateman","Blake Sweeney"],"pdf_url":"https://arxiv.org/pdf/2311.03056v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05932v1","updated":"2025-01-10T12:55:34Z","published":"2025-01-10T12:55:34Z","title":"DiffuSETS: 12-lead ECG Generation Conditioned on Clinical Text Reports\n and Patient-Specific Information","summary":" Heart disease remains a significant threat to human health. As a non-invasive\ndiagnostic tool, the electrocardiogram (ECG) is one of the most widely used\nmethods for cardiac screening. However, the scarcity of high-quality ECG data,\ndriven by privacy concerns and limited medical resources, creates a pressing\nneed for effective ECG signal generation. Existing approaches for generating\nECG signals typically rely on small training datasets, lack comprehensive\nevaluation frameworks, and overlook potential applications beyond data\naugmentation. To address these challenges, we propose DiffuSETS, a novel\nframework capable of generating ECG signals with high semantic alignment and\nfidelity. DiffuSETS accepts various modalities of clinical text reports and\npatient-specific information as inputs, enabling the creation of clinically\nmeaningful ECG signals. Additionally, to address the lack of standardized\nevaluation in ECG generation, we introduce a comprehensive benchmarking\nmethodology to assess the effectiveness of generative models in this domain.\nOur model achieve excellent results in tests, proving its superiority in the\ntask of ECG generation. Furthermore, we showcase its potential to mitigate data\nscarcity while exploring novel applications in cardiology education and medical\nknowledge discovery, highlighting the broader impact of our work.\n","authors":["Yongfan Lai","Jiabo Chen","Deyun Zhang","Yue Wang","Shijia Geng","Hongyan Li","Shenda Hong"],"pdf_url":"https://arxiv.org/pdf/2501.05932v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05928v1","updated":"2025-01-10T12:49:12Z","published":"2025-01-10T12:49:12Z","title":"Towards Backdoor Stealthiness in Model Parameter Space","summary":" Recent research on backdoor stealthiness focuses mainly on indistinguishable\ntriggers in input space and inseparable backdoor representations in feature\nspace, aiming to circumvent backdoor defenses that examine these respective\nspaces. However, existing backdoor attacks are typically designed to resist a\nspecific type of backdoor defense without considering the diverse range of\ndefense mechanisms. Based on this observation, we pose a natural question: Are\ncurrent backdoor attacks truly a real-world threat when facing diverse\npractical defenses?\n To answer this question, we examine 12 common backdoor attacks that focus on\ninput-space or feature-space stealthiness and 17 diverse representative\ndefenses. Surprisingly, we reveal a critical blind spot: Backdoor attacks\ndesigned to be stealthy in input and feature spaces can be mitigated by\nexamining backdoored models in parameter space. To investigate the underlying\ncauses behind this common vulnerability, we study the characteristics of\nbackdoor attacks in the parameter space. Notably, we find that input- and\nfeature-space attacks introduce prominent backdoor-related neurons in parameter\nspace, which are not thoroughly considered by current backdoor attacks. Taking\ncomprehensive stealthiness into account, we propose a novel supply-chain attack\ncalled Grond. Grond limits the parameter changes by a simple yet effective\nmodule, Adversarial Backdoor Injection (ABI), which adaptively increases the\nparameter-space stealthiness during the backdoor injection. Extensive\nexperiments demonstrate that Grond outperforms all 12 backdoor attacks against\nstate-of-the-art (including adaptive) defenses on CIFAR-10, GTSRB, and a subset\nof ImageNet. In addition, we show that ABI consistently improves the\neffectiveness of common backdoor attacks.\n","authors":["Xiaoyun Xu","Zhuoran Liu","Stefanos Koffas","Stjepan Picek"],"pdf_url":"https://arxiv.org/pdf/2501.05928v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05921v1","updated":"2025-01-10T12:26:38Z","published":"2025-01-10T12:26:38Z","title":"The New Anticipatory Governance Culture for Innovation: Regulatory\n Foresight, Regulatory Experimentation and Regulatory Learning","summary":" With the rapid pace of technological innovation, traditional methods of\npolicy formation and legislating are becoming conspicuously anachronistic. The\nneed for regulatory choices to be made to counter the deadening effect of\nregulatory lag is more important to developing markets and fostering growth\nthan achieving one off regulatory perfection. This article advances scholarship\non innovation policy and the regulation of technological innovation in the\nEuropean Union. It does so by considering what building an agile yet robust\nanticipatory governance regulatory culture involves. It systematically\nexcavates a variety of tools and elements that are being put into use in\ninventive ways and argues that these need to be more cohesively and\nsystemically integrated into the regulatory toolbox. Approaches covered include\nstrategic foresight, the critical embrace of iterative policy development and\nregulatory learning in the face of uncertainty and the embrace of bottom up\napproaches to cocreation of policy such as Policy Labs and the testing and\nregulatory learning through pilot regulation and experimentation. The growing\nuse of regulatory sandboxes as an EU policy tool to boost innovation and\nnavigate regulatory complexity as seen in the EU AI Act is also probed\n","authors":["Deirdre Ahern"],"pdf_url":"https://arxiv.org/pdf/2501.05921v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05891v1","updated":"2025-01-10T11:44:35Z","published":"2025-01-10T11:44:35Z","title":"Affordably Fine-tuned LLMs Provide Better Answers to Course-specific\n MCQs","summary":" In education, the capability of generating human-like text of Large Language\nModels (LLMs) inspired work on how they can increase the efficiency of learning\nand teaching. We study the affordability of these models for educators and\nstudents by investigating how LLMs answer multiple-choice questions (MCQs) with\nrespect to hardware constraints and refinement techniques. We explore this\nspace by using generic pre-trained LLMs (the 7B, 13B, and 70B variants of\nLLaMA-2) to answer 162 undergraduate-level MCQs from a course on Programming\nLanguages (PL) -- the MCQ dataset is a contribution of this work, which we make\npublicly available. Specifically, we dissect how different factors, such as\nusing readily-available material -- (parts of) the course's textbook -- for\nfine-tuning and quantisation (to decrease resource usage) can change the\naccuracy of the responses. The main takeaway is that smaller textbook-based\nfine-tuned models outperform generic larger ones (whose pre-training requires\nconspicuous resources), making the usage of LLMs for answering MCQs resource-\nand material-wise affordable.\n","authors":["Bianca Raimondi","Saverio Giallorenzo","Maurizio Gabbrielli"],"pdf_url":"https://arxiv.org/pdf/2501.05891v1.pdf","comment":"The 40th ACM/SIGAPP Symposium On Applied Computing"},{"id":"http://arxiv.org/abs/2501.05885v1","updated":"2025-01-10T11:37:50Z","published":"2025-01-10T11:37:50Z","title":"EDNet: Edge-Optimized Small Target Detection in UAV Imagery -- Faster\n Context Attention, Better Feature Fusion, and Hardware Acceleration","summary":" Detecting small targets in drone imagery is challenging due to low\nresolution, complex backgrounds, and dynamic scenes. We propose EDNet, a novel\nedge-target detection framework built on an enhanced YOLOv10 architecture,\noptimized for real-time applications without post-processing. EDNet\nincorporates an XSmall detection head and a Cross Concat strategy to improve\nfeature fusion and multi-scale context awareness for detecting tiny targets in\ndiverse environments. Our unique C2f-FCA block employs Faster Context Attention\nto enhance feature extraction while reducing computational complexity. The WIoU\nloss function is employed for improved bounding box regression. With seven\nmodel sizes ranging from Tiny to XL, EDNet accommodates various deployment\nenvironments, enabling local real-time inference and ensuring data privacy.\nNotably, EDNet achieves up to a 5.6% gain in mAP@50 with significantly fewer\nparameters. On an iPhone 12, EDNet variants operate at speeds ranging from 16\nto 55 FPS, providing a scalable and efficient solution for edge-based object\ndetection in challenging drone imagery. The source code and pre-trained models\nare available at: https://github.com/zsniko/EDNet.\n","authors":["Zhifan Song","Yuan Zhang","Abd Al Rahman M. Abu Ebayyeh"],"pdf_url":"https://arxiv.org/pdf/2501.05885v1.pdf","comment":"Accepted in 21st IEEE International Conference on Ubiquitous\n Intelligence and Computing (UIC 2024)\n https://www.ieee-smart-world.org/2024/uic"},{"id":"http://arxiv.org/abs/2501.01987v2","updated":"2025-01-10T11:36:09Z","published":"2024-12-30T18:08:13Z","title":"Gender Bias in Text-to-Video Generation Models: A case study of Sora","summary":" The advent of text-to-video generation models has revolutionized content\ncreation as it produces high-quality videos from textual prompts. However,\nconcerns regarding inherent biases in such models have prompted scrutiny,\nparticularly regarding gender representation. Our study investigates the\npresence of gender bias in OpenAI's Sora, a state-of-the-art text-to-video\ngeneration model. We uncover significant evidence of bias by analyzing the\ngenerated videos from a diverse set of gender-neutral and stereotypical\nprompts. The results indicate that Sora disproportionately associates specific\ngenders with stereotypical behaviors and professions, which reflects societal\nprejudices embedded in its training data.\n","authors":["Mohammad Nadeem","Shahab Saquib Sohail","Erik Cambria","Björn W. Schuller","Amir Hussain"],"pdf_url":"https://arxiv.org/pdf/2501.01987v2.pdf","comment":"7 pages, 3 figures"},{"id":"http://arxiv.org/abs/2501.05882v1","updated":"2025-01-10T11:34:22Z","published":"2025-01-10T11:34:22Z","title":"Solving nonograms using Neural Networks","summary":" Nonograms are logic puzzles in which cells in a grid must be colored or left\nblank according to the numbers that are located in its headers. In this study,\nwe analyze different techniques to solve this type of logical problem using an\nHeuristic Algorithm, Genetic Algorithm, and Heuristic Algorithm with Neural\nNetwork. Furthermore, we generate a public dataset to train the neural\nnetworks. We published this dataset and the code of the algorithms. Combination\nof the heuristic algorithm with a neural network obtained the best results.\nFrom state of the art review, no previous works used neural network to solve\nnonograms, nor combined a network with other algorithms to accelerate the\nresolution process.\n","authors":["José María Buades Rubio","Antoni Jaume-i-Capó","David López González","Gabriel Moyà Alcover"],"pdf_url":"https://arxiv.org/pdf/2501.05882v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05874v1","updated":"2025-01-10T11:17:15Z","published":"2025-01-10T11:17:15Z","title":"VideoRAG: Retrieval-Augmented Generation over Video Corpus","summary":" Retrieval-Augmented Generation (RAG) is a powerful strategy to address the\nissue of generating factually incorrect outputs in foundation models by\nretrieving external knowledge relevant to queries and incorporating it into\ntheir generation process. However, existing RAG approaches have primarily\nfocused on textual information, with some recent advancements beginning to\nconsider images, and they largely overlook videos, a rich source of multimodal\nknowledge capable of representing events, processes, and contextual details\nmore effectively than any other modality. While a few recent studies explore\nthe integration of videos in the response generation process, they either\npredefine query-associated videos without retrieving them according to queries,\nor convert videos into the textual descriptions without harnessing their\nmultimodal richness. To tackle these, we introduce VideoRAG, a novel framework\nthat not only dynamically retrieves relevant videos based on their relevance\nwith queries but also utilizes both visual and textual information of videos in\nthe output generation. Further, to operationalize this, our method revolves\naround the recent advance of Large Video Language Models (LVLMs), which enable\nthe direct processing of video content to represent it for retrieval and\nseamless integration of the retrieved videos jointly with queries. We\nexperimentally validate the effectiveness of VideoRAG, showcasing that it is\nsuperior to relevant baselines.\n","authors":["Soyeong Jeong","Kangsan Kim","Jinheon Baek","Sung Ju Hwang"],"pdf_url":"https://arxiv.org/pdf/2501.05874v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.03968v2","updated":"2025-01-10T10:38:49Z","published":"2025-01-07T18:06:27Z","title":"VLM-driven Behavior Tree for Context-aware Task Planning","summary":" The use of Large Language Models (LLMs) for generating Behavior Trees (BTs)\nhas recently gained attention in the robotics community, yet remains in its\nearly stages of development. In this paper, we propose a novel framework that\nleverages Vision-Language Models (VLMs) to interactively generate and edit BTs\nthat address visual conditions, enabling context-aware robot operations in\nvisually complex environments. A key feature of our approach lies in the\nconditional control through self-prompted visual conditions. Specifically, the\nVLM generates BTs with visual condition nodes, where conditions are expressed\nas free-form text. Another VLM process integrates the text into its prompt and\nevaluates the conditions against real-world images during robot execution. We\nvalidated our framework in a real-world cafe scenario, demonstrating both its\nfeasibility and limitations.\n","authors":["Naoki Wake","Atsushi Kanehira","Jun Takamatsu","Kazuhiro Sasabuchi","Katsushi Ikeuchi"],"pdf_url":"https://arxiv.org/pdf/2501.03968v2.pdf","comment":"10 pages, 11 figures, 5 tables. Last updated on January 9th, 2024"},{"id":"http://arxiv.org/abs/2406.10221v2","updated":"2025-01-10T10:36:58Z","published":"2024-06-14T17:54:54Z","title":"Long Story Short: Story-level Video Understanding from 20K Short Films","summary":" Recent developments in vision-language models have significantly advanced\nvideo understanding. Existing datasets and tasks, however, have notable\nlimitations. Most datasets are confined to short videos with limited events and\nnarrow narratives. For example, datasets with instructional and egocentric\nvideos often depict activities of one person in a single scene. Although\nexisting movie datasets offer richer content, they are often limited to\nshort-term tasks, lack publicly available videos, and frequently encounter data\nleakage issues given the use of subtitles and other information about\ncommercial movies during LLM pretraining. To address the above limitations, we\npropose Short-Films 20K (SF20K), the largest publicly available movie dataset.\nSF20K is composed of 20,143 amateur films and offers long-term video tasks in\nthe form of multiple-choice and open-ended question answering. Our extensive\nanalysis of SF20K reveals minimal data leakage, emphasizes the need for\nlong-term reasoning, and demonstrates the strong performance of recent VLMs.\nFinally, we show that instruction tuning on the SF20K-Train set substantially\nimproves model performance, paving the way for future progress in long-term\nvideo understanding.\n","authors":["Ridouane Ghermi","Xi Wang","Vicky Kalogeiton","Ivan Laptev"],"pdf_url":"https://arxiv.org/pdf/2406.10221v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05845v1","updated":"2025-01-10T10:36:46Z","published":"2025-01-10T10:36:46Z","title":"Annealing Machine-assisted Learning of Graph Neural Network for\n Combinatorial Optimization","summary":" While Annealing Machines (AM) have shown increasing capabilities in solving\ncomplex combinatorial problems, positioning themselves as a more immediate\nalternative to the expected advances of future fully quantum solutions, there\nare still scaling limitations. In parallel, Graph Neural Networks (GNN) have\nbeen recently adapted to solve combinatorial problems, showing competitive\nresults and potentially high scalability due to their distributed nature. We\npropose a merging approach that aims at retaining both the accuracy exhibited\nby AMs and the representational flexibility and scalability of GNNs. Our model\nconsiders a compression step, followed by a supervised interaction where\npartial solutions obtained from the AM are used to guide local GNNs from where\nnode feature representations are obtained and combined to initialize an\nadditional GNN-based solver that handles the original graph's target problem.\nIntuitively, the AM can solve the combinatorial problem indirectly by infusing\nits knowledge into the GNN. Experiments on canonical optimization problems show\nthat the idea is feasible, effectively allowing the AM to solve size problems\nbeyond its original limits.\n","authors":["Pablo Loyola","Kento Hasegawa","Andres Hoyos-Idobro","Kazuo Ono","Toyotaro Suzumura","Yu Hirate","Masanao Yamaoka"],"pdf_url":"https://arxiv.org/pdf/2501.05845v1.pdf","comment":"Second Workshop on Machine Learning with New Compute Paradigms at\n NeurIPS 2024 (MLNCP 2024)"},{"id":"http://arxiv.org/abs/2501.01834v2","updated":"2025-01-10T10:08:50Z","published":"2025-01-03T14:38:01Z","title":"MoColl: Agent-Based Specific and General Model Collaboration for Image\n Captioning","summary":" Image captioning is a critical task at the intersection of computer vision\nand natural language processing, with wide-ranging applications across various\ndomains. For complex tasks such as diagnostic report generation, deep learning\nmodels require not only domain-specific image-caption datasets but also the\nincorporation of relevant general knowledge to provide contextual accuracy.\nExisting approaches exhibit inherent limitations: specialized models excel in\ncapturing domain-specific details but lack generalization, while\nvision-language models (VLMs) built on large language models (LLMs) leverage\ngeneral knowledge but struggle with domain-specific adaptation. To address\nthese limitations, this paper proposes a novel agent-enhanced model\ncollaboration framework, which we call MoColl, designed to effectively\nintegrate domain-specific and general knowledge. Specifically, our approach is\nto decompose complex image captioning tasks into a series of interconnected\nquestion-answer subtasks. A trainable visual question answering (VQA) model is\nemployed as a specialized tool to focus on domain-specific visual analysis,\nanswering task-specific questions based on image content. Concurrently, an\nLLM-based agent with general knowledge formulates these questions and\nsynthesizes the resulting question-answer pairs into coherent captions. Beyond\nits role in leveraging the VQA model, the agent further guides its training to\nenhance its domain-specific capabilities. Experimental results on radiology\nreport generation validate the effectiveness of the proposed framework,\ndemonstrating significant improvements in the quality of generated reports.\n","authors":["Pu Yang","Bin Dong"],"pdf_url":"https://arxiv.org/pdf/2501.01834v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09278v2","updated":"2025-01-10T10:07:55Z","published":"2024-12-12T13:41:35Z","title":"Towards a Multimodal Large Language Model with Pixel-Level Insight for\n Biomedicine","summary":" In recent years, Multimodal Large Language Models (MLLM) have achieved\nnotable advancements, demonstrating the feasibility of developing an\nintelligent biomedical assistant. However, current biomedical MLLMs\npredominantly focus on image-level understanding and restrict interactions to\ntextual commands, thus limiting their capability boundaries and the flexibility\nof usage. In this paper, we introduce a novel end-to-end multimodal large\nlanguage model for the biomedical domain, named MedPLIB, which possesses\npixel-level understanding. Excitingly, it supports visual question answering\n(VQA), arbitrary pixel-level prompts (points, bounding boxes, and free-form\nshapes), and pixel-level grounding. We propose a novel Mixture-of-Experts (MoE)\nmulti-stage training strategy, which divides MoE into separate training phases\nfor a visual-language expert model and a pixel-grounding expert model, followed\nby fine-tuning using MoE. This strategy effectively coordinates multitask\nlearning while maintaining the computational cost at inference equivalent to\nthat of a single expert model. To advance the research of biomedical MLLMs, we\nintroduce the Medical Complex Vision Question Answering Dataset (MeCoVQA),\nwhich comprises an array of 8 modalities for complex medical imaging question\nanswering and image region understanding. Experimental results indicate that\nMedPLIB has achieved state-of-the-art outcomes across multiple medical visual\nlanguage tasks. More importantly, in zero-shot evaluations for the pixel\ngrounding task, MedPLIB leads the best small and large models by margins of\n19.7 and 15.6 respectively on the mDice metric. The codes, data, and model\ncheckpoints will be made publicly available at\nhttps://github.com/ShawnHuang497/MedPLIB.\n","authors":["Xiaoshuang Huang","Lingdong Shen","Jia Liu","Fangxin Shang","Hongxiang Li","Haifeng Huang","Yehui Yang"],"pdf_url":"https://arxiv.org/pdf/2412.09278v2.pdf","comment":"Accepted by AAAI2025"},{"id":"http://arxiv.org/abs/2501.05826v1","updated":"2025-01-10T10:03:56Z","published":"2025-01-10T10:03:56Z","title":"AI-Driven Diabetic Retinopathy Screening: Multicentric Validation of\n AIDRSS in India","summary":" Purpose: Diabetic retinopathy (DR) is a major cause of vision loss,\nparticularly in India, where access to retina specialists is limited in rural\nareas. This study aims to evaluate the Artificial Intelligence-based Diabetic\nRetinopathy Screening System (AIDRSS) for DR detection and prevalence\nassessment, addressing the growing need for scalable, automated screening\nsolutions in resource-limited settings.\n Approach: A multicentric, cross-sectional study was conducted in Kolkata,\nIndia, involving 5,029 participants and 10,058 macula-centric retinal fundus\nimages. The AIDRSS employed a deep learning algorithm with 50 million trainable\nparameters, integrated with Contrast Limited Adaptive Histogram Equalization\n(CLAHE) preprocessing for enhanced image quality. DR was graded using the\nInternational Clinical Diabetic Retinopathy (ICDR) Scale, categorizing disease\ninto five stages (DR0 to DR4). Statistical metrics including sensitivity,\nspecificity, and prevalence rates were evaluated against expert retina\nspecialist assessments.\n Results: The prevalence of DR in the general population was 13.7%, rising to\n38.2% among individuals with elevated random blood glucose levels. The AIDRSS\nachieved an overall sensitivity of 92%, specificity of 88%, and 100%\nsensitivity for detecting referable DR (DR3 and DR4). These results demonstrate\nthe system's robust performance in accurately identifying and grading DR in a\ndiverse population.\n Conclusions: AIDRSS provides a reliable, scalable solution for early DR\ndetection in resource-constrained environments. Its integration of advanced AI\ntechniques ensures high diagnostic accuracy, with potential to significantly\nreduce the burden of diabetes-related vision loss in underserved regions.\n","authors":["Amit Kr Dey","Pradeep Walia","Girish Somvanshi","Abrar Ali","Sagarnil Das","Pallabi Paul","Minakhi Ghosh"],"pdf_url":"https://arxiv.org/pdf/2501.05826v1.pdf","comment":"22 pages, 5 figures. arXiv admin note: substantial text overlap with\n arXiv:1812.07105 by other authors without attribution"},{"id":"http://arxiv.org/abs/2501.05819v1","updated":"2025-01-10T09:59:16Z","published":"2025-01-10T09:59:16Z","title":"Diffusion Models for Smarter UAVs: Decision-Making and Modeling","summary":" Unmanned Aerial Vehicles (UAVs) are increasingly adopted in modern\ncommunication networks. However, challenges in decision-making and digital\nmodeling continue to impede their rapid advancement. Reinforcement Learning\n(RL) algorithms face limitations such as low sample efficiency and limited data\nversatility, further magnified in UAV communication scenarios. Moreover,\nDigital Twin (DT) modeling introduces substantial decision-making and data\nmanagement complexities. RL models, often integrated into DT frameworks,\nrequire extensive training data to achieve accurate predictions. In contrast to\ntraditional approaches that focus on class boundaries, Diffusion Models (DMs),\na new class of generative AI, learn the underlying probability distribution\nfrom the training data and can generate trustworthy new patterns based on this\nlearned distribution. This paper explores the integration of DMs with RL and DT\nto effectively address these challenges. By combining the data generation\ncapabilities of DMs with the decision-making framework of RL and the modeling\naccuracy of DT, the integration improves the adaptability and real-time\nperformance of UAV communication. Moreover, the study shows how DMs can\nalleviate data scarcity, improve policy networks, and optimize dynamic\nmodeling, providing a robust solution for complex UAV communication scenarios.\n","authors":["Yousef Emami","Hao Zhou","Luis Almeida","Kai Li"],"pdf_url":"https://arxiv.org/pdf/2501.05819v1.pdf","comment":"7 pages, 2 figures"},{"id":"http://arxiv.org/abs/2404.05399v2","updated":"2025-01-10T09:54:54Z","published":"2024-04-08T10:57:25Z","title":"SafetyPrompts: a Systematic Review of Open Datasets for Evaluating and\n Improving Large Language Model Safety","summary":" The last two years have seen a rapid growth in concerns around the safety of\nlarge language models (LLMs). Researchers and practitioners have met these\nconcerns by creating an abundance of datasets for evaluating and improving LLM\nsafety. However, much of this work has happened in parallel, and with very\ndifferent goals in mind, ranging from the mitigation of near-term risks around\nbias and toxic content generation to the assessment of longer-term catastrophic\nrisk potential. This makes it difficult for researchers and practitioners to\nfind the most relevant datasets for their use case, and to identify gaps in\ndataset coverage that future work may fill. To remedy these issues, we conduct\na first systematic review of open datasets for evaluating and improving LLM\nsafety. We review 144 datasets, which we identified through an iterative and\ncommunity-driven process over the course of several months. We highlight\npatterns and trends, such as a trend towards fully synthetic datasets, as well\nas gaps in dataset coverage, such as a clear lack of non-English and\nnaturalistic datasets. We also examine how LLM safety datasets are used in\npractice -- in LLM release publications and popular LLM benchmarks -- finding\nthat current evaluation practices are highly idiosyncratic and make use of only\na small fraction of available datasets. Our contributions are based on\nSafetyPrompts.com, a living catalogue of open datasets for LLM safety, which we\nplan to update continuously as the field of LLM safety develops.\n","authors":["Paul Röttger","Fabio Pernisi","Bertie Vidgen","Dirk Hovy"],"pdf_url":"https://arxiv.org/pdf/2404.05399v2.pdf","comment":"Accepted at AAAI 2025 (Special Track on AI Alignment)"},{"id":"http://arxiv.org/abs/2308.00721v4","updated":"2025-01-10T09:35:20Z","published":"2023-07-31T03:56:46Z","title":"A Pre-trained Data Deduplication Model based on Active Learning","summary":" In the era of big data, the issue of data quality has become increasingly\nprominent. One of the main challenges is the problem of duplicate data, which\ncan arise from repeated entry or the merging of multiple data sources. These\n\"dirty data\" problems can significantly limit the effective application of big\ndata. To address the issue of data deduplication, we propose a pre-trained\ndeduplication model based on active learning, which is the first work that\nutilizes active learning to address the problem of deduplication at the\nsemantic level. The model is built on a pre-trained Transformer and fine-tuned\nto solve the deduplication problem as a sequence to classification task, which\nfirstly integrate the transformer with active learning into an end-to-end\narchitecture to select the most valuable data for deduplication model training,\nand also firstly employ the R-Drop method to perform data augmentation on each\nround of labeled data, which can reduce the cost of manual labeling and improve\nthe model's performance. Experimental results demonstrate that our proposed\nmodel outperforms previous state-of-the-art (SOTA) for deduplicated data\nidentification, achieving up to a 28% improvement in Recall score on benchmark\ndatasets.\n","authors":["Haochen Shi","Xinyao Liu","Fengmao Lv","Hongtao Xue","Jie Hu","Shengdong Du","Tianrui Li"],"pdf_url":"https://arxiv.org/pdf/2308.00721v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.02221v2","updated":"2025-01-10T09:26:32Z","published":"2025-01-04T07:53:38Z","title":"CORD: Generalizable Cooperation via Role Diversity","summary":" Cooperative multi-agent reinforcement learning (MARL) aims to develop agents\nthat can collaborate effectively. However, most cooperative MARL methods\noverfit training agents, making learned policies not generalize well to unseen\ncollaborators, which is a critical issue for real-world deployment. Some\nmethods attempt to address the generalization problem but require prior\nknowledge or predefined policies of new teammates, limiting real-world\napplications. To this end, we propose a hierarchical MARL approach to enable\ngeneralizable cooperation via role diversity, namely CORD. CORD's high-level\ncontroller assigns roles to low-level agents by maximizing the role entropy\nwith constraints. We show this constrained objective can be decomposed into\ncausal influence in role that enables reasonable role assignment, and role\nheterogeneity that yields coherent, non-redundant role clusters. Evaluated on a\nvariety of cooperative multi-agent tasks, CORD achieves better performance than\nbaselines, especially in generalization tests. Ablation studies further\ndemonstrate the efficacy of the constrained objective in generalizable\ncooperation.\n","authors":["Kanefumi Matsuyama","Kefan Su","Jiangxing Wang","Deheng Ye","Zongqing Lu"],"pdf_url":"https://arxiv.org/pdf/2501.02221v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05808v1","updated":"2025-01-10T09:15:40Z","published":"2025-01-10T09:15:40Z","title":"Real-Time Integrated Dispatching and Idle Fleet Steering with Deep\n Reinforcement Learning for A Meal Delivery Platform","summary":" To achieve high service quality and profitability, meal delivery platforms\nlike Uber Eats and Grubhub must strategically operate their fleets to ensure\ntimely deliveries for current orders while mitigating the consequential impacts\nof suboptimal decisions that leads to courier understaffing in the future. This\nstudy set out to solve the real-time order dispatching and idle courier\nsteering problems for a meal delivery platform by proposing a reinforcement\nlearning (RL)-based strategic dual-control framework. To address the inherent\nsequential nature of these problems, we model both order dispatching and\ncourier steering as Markov Decision Processes. Trained via a deep reinforcement\nlearning (DRL) framework, we obtain strategic policies by leveraging the\nexplicitly predicted demands as part of the inputs. In our dual-control\nframework, the dispatching and steering policies are iteratively trained in an\nintegrated manner. These forward-looking policies can be executed in real-time\nand provide decisions while jointly considering the impacts on local and\nnetwork levels. To enhance dispatching fairness, we propose convolutional deep\nQ networks to construct fair courier embeddings. To simultaneously rebalance\nthe supply and demand within the service network, we propose to utilize\nmean-field approximated supply-demand knowledge to reallocate idle couriers at\nthe local level. Utilizing the policies generated by the RL-based strategic\ndual-control framework, we find the delivery efficiency and fairness of\nworkload distribution among couriers have been improved, and under-supplied\nconditions have been alleviated within the service network. Our study sheds\nlight on designing an RL-based framework to enable forward-looking real-time\noperations for meal delivery platforms and other on-demand services.\n","authors":["Jingyi Cheng","Shadi Sharif Azadeh"],"pdf_url":"https://arxiv.org/pdf/2501.05808v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.13572v2","updated":"2025-01-10T09:11:39Z","published":"2024-02-21T07:07:54Z","title":"AlgoFormer: An Efficient Transformer Framework with Algorithmic\n Structures","summary":" Besides natural language processing, transformers exhibit extraordinary\nperformance in solving broader applications, including scientific computing and\ncomputer vision. Previous works try to explain this from the expressive power\nand capability perspectives that standard transformers are capable of\nperforming some algorithms. To empower transformers with algorithmic\ncapabilities and motivated by the recently proposed looped transformer, we\ndesign a novel transformer framework, dubbed Algorithm Transformer (abbreviated\nas AlgoFormer). We provide an insight that efficient transformer architectures\ncan be designed by leveraging prior knowledge of tasks and the underlying\nstructure of potential algorithms. Compared with the standard transformer and\nvanilla looped transformer, the proposed AlgoFormer can perform efficiently in\nalgorithm representation in some specific tasks. In particular, inspired by the\nstructure of human-designed learning algorithms, our transformer framework\nconsists of a pre-transformer that is responsible for task preprocessing, a\nlooped transformer for iterative optimization algorithms, and a\npost-transformer for producing the desired results after post-processing. We\nprovide theoretical evidence of the expressive power of the AlgoFormer in\nsolving some challenging problems, mirroring human-designed algorithms.\nFurthermore, some theoretical and empirical results are presented to show that\nthe designed transformer has the potential to perform algorithm representation\nand learning. Experimental results demonstrate the empirical superiority of the\nproposed transformer in that it outperforms the standard transformer and\nvanilla looped transformer in some specific tasks. An extensive experiment on\nreal language tasks (e.g., neural machine translation of German and English,\nand text classification) further validates the expressiveness and effectiveness\nof AlgoFormer.\n","authors":["Yihang Gao","Chuanyang Zheng","Enze Xie","Han Shi","Tianyang Hu","Yu Li","Michael K. Ng","Zhenguo Li","Zhaoqiang Liu"],"pdf_url":"https://arxiv.org/pdf/2402.13572v2.pdf","comment":"Published at Transactions on Machine Learning Research (TMLR). The\n paper provides insight that the Transformer architectures can mimic the\n algorithm structures in (in-context) algorithm learning and representation.\n The incorporated algorithmic structure in Algoformer shows its potential in\n (deep learning for) scientific computing, besides the real language tasks"},{"id":"http://arxiv.org/abs/2501.05803v1","updated":"2025-01-10T09:10:30Z","published":"2025-01-10T09:10:30Z","title":"Alignment without Over-optimization: Training-Free Solution for\n Diffusion Models","summary":" Diffusion models excel in generative tasks, but aligning them with specific\nobjectives while maintaining their versatility remains challenging. Existing\nfine-tuning methods often suffer from reward over-optimization, while\napproximate guidance approaches fail to optimize target rewards effectively.\nAddressing these limitations, we propose a training-free sampling method based\non Sequential Monte Carlo (SMC) to sample from the reward-aligned target\ndistribution. Our approach, tailored for diffusion sampling and incorporating\ntempering techniques, achieves comparable or superior target rewards to\nfine-tuning methods while preserving diversity and cross-reward generalization.\nWe demonstrate its effectiveness in single-reward optimization, multi-objective\nscenarios, and online black-box optimization. This work offers a robust\nsolution for aligning diffusion models with diverse downstream objectives\nwithout compromising their general capabilities. Code is available at\nhttps://github.com/krafton-ai/DAS .\n","authors":["Sunwoo Kim","Minkyu Kim","Dongmin Park"],"pdf_url":"https://arxiv.org/pdf/2501.05803v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05795v1","updated":"2025-01-10T08:57:50Z","published":"2025-01-10T08:57:50Z","title":"Robust Counterfactual Explanations under Model Multiplicity Using\n Multi-Objective Optimization","summary":" In recent years, explainability in machine learning has gained importance. In\nthis context, counterfactual explanation (CE), which is an explanation method\nthat uses examples, has attracted attention. However, it has been pointed out\nthat CE is not robust when there are multiple machine-learning models. These\nproblems are important when using machine learning to make safe decisions. In\nthis paper, we propose robust CEs that introduce a new viewpoint - Pareto\nimprovement - and a method that uses multi-objective optimization to generate\nit. To evaluate the proposed method, we conducted experiments using both\nsimulated and actual data. The results demonstrate that the proposed method is\nrobust and useful. We believe that this research will contribute to a wide\nrange of research areas, such as explainability in machine learning,\ndecision-making, and action planning based on machine learning.\n","authors":["Keita Kinjo"],"pdf_url":"https://arxiv.org/pdf/2501.05795v1.pdf","comment":"19 pages"},{"id":"http://arxiv.org/abs/2501.05790v1","updated":"2025-01-10T08:50:38Z","published":"2025-01-10T08:50:38Z","title":"Understanding Impact of Human Feedback via Influence Functions","summary":" In Reinforcement Learning from Human Feedback (RLHF), it is crucial to learn\nsuitable reward models from human feedback to align large language models\n(LLMs) with human intentions. However, human feedback can often be noisy,\ninconsistent, or biased, especially when evaluating complex responses. Such\nfeedback can lead to misaligned reward signals, potentially causing unintended\nside effects during the RLHF process. To address these challenges, we explore\nthe use of influence functions to measure the impact of human feedback on the\nperformance of reward models. We propose a compute-efficient approximation\nmethod that enables the application of influence functions to LLM-based reward\nmodels and large-scale preference datasets. In our experiments, we demonstrate\ntwo key applications of influence functions: (1) detecting common forms of\nlabeler bias in human feedback datasets and (2) guiding labelers to refine\ntheir strategies to align more closely with expert feedback. By quantifying the\nimpact of human feedback on reward models, we believe that influence functions\ncan enhance feedback interpretability and contribute to scalable oversight in\nRLHF, helping labelers provide more accurate and consistent feedback. Source\ncode is available at https://github.com/mintaywon/IF_RLHF\n","authors":["Taywon Min","Haeone Lee","Hanho Ryu","Yongchan Kwon","Kimin Lee"],"pdf_url":"https://arxiv.org/pdf/2501.05790v1.pdf","comment":"Source code: https://github.com/mintaywon/IF_RLHF"},{"id":"http://arxiv.org/abs/2501.02564v2","updated":"2025-01-10T08:40:49Z","published":"2025-01-05T14:42:47Z","title":"Balanced Multi-view Clustering","summary":" Multi-view clustering (MvC) aims to integrate information from different\nviews to enhance the capability of the model in capturing the underlying data\nstructures. The widely used joint training paradigm in MvC is potentially not\nfully leverage the multi-view information, since the imbalanced and\nunder-optimized view-specific features caused by the uniform learning objective\nfor all views. For instance, particular views with more discriminative\ninformation could dominate the learning process in the joint training paradigm,\nleading to other views being under-optimized. To alleviate this issue, we first\nanalyze the imbalanced phenomenon in the joint-training paradigm of multi-view\nclustering from the perspective of gradient descent for each view-specific\nfeature extractor. Then, we propose a novel balanced multi-view clustering\n(BMvC) method, which introduces a view-specific contrastive regularization\n(VCR) to modulate the optimization of each view. Concretely, VCR preserves the\nsample similarities captured from the joint features and view-specific ones\ninto the clustering distributions corresponding to view-specific features to\nenhance the learning process of view-specific feature extractors. Additionally,\na theoretical analysis is provided to illustrate that VCR adaptively modulates\nthe magnitudes of gradients for updating the parameters of view-specific\nfeature extractors to achieve a balanced multi-view learning procedure. In such\na manner, BMvC achieves a better trade-off between the exploitation of\nview-specific patterns and the exploration of view-invariance patterns to fully\nlearn the multi-view information for the clustering task. Finally, a set of\nexperiments are conducted to verify the superiority of the proposed method\ncompared with state-of-the-art approaches both on eight benchmark MvC datasets\nand two spatially resolved transcriptomics datasets.\n","authors":["Zhenglai Li","Jun Wang","Chang Tang","Xinzhong Zhu","Wei Zhang","Xinwang Liu"],"pdf_url":"https://arxiv.org/pdf/2501.02564v2.pdf","comment":"We are withdrawing this paper due to issues in the experimental\n section related to the Application for Spatially Resolved Transcriptomics\n Data Clustering. These issues affect the validity of the results presented.\n We believe it is necessary to withdraw the paper to address these problems\n adequately before resubmission."},{"id":"http://arxiv.org/abs/2501.05783v1","updated":"2025-01-10T08:33:31Z","published":"2025-01-10T08:33:31Z","title":"UV-Attack: Physical-World Adversarial Attacks for Person Detection via\n Dynamic-NeRF-based UV Mapping","summary":" In recent research, adversarial attacks on person detectors using patches or\nstatic 3D model-based texture modifications have struggled with low success\nrates due to the flexible nature of human movement. Modeling the 3D\ndeformations caused by various actions has been a major challenge. Fortunately,\nadvancements in Neural Radiance Fields (NeRF) for dynamic human modeling offer\nnew possibilities. In this paper, we introduce UV-Attack, a groundbreaking\napproach that achieves high success rates even with extensive and unseen human\nactions. We address the challenge above by leveraging dynamic-NeRF-based UV\nmapping. UV-Attack can generate human images across diverse actions and\nviewpoints, and even create novel actions by sampling from the SMPL parameter\nspace. While dynamic NeRF models are capable of modeling human bodies,\nmodifying clothing textures is challenging because they are embedded in neural\nnetwork parameters. To tackle this, UV-Attack generates UV maps instead of RGB\nimages and modifies the texture stacks. This approach enables real-time texture\nedits and makes the attack more practical. We also propose a novel Expectation\nover Pose Transformation loss (EoPT) to improve the evasion success rate on\nunseen poses and views. Our experiments show that UV-Attack achieves a 92.75%\nattack success rate against the FastRCNN model across varied poses in dynamic\nvideo settings, significantly outperforming the state-of-the-art AdvCamou\nattack, which only had a 28.50% ASR. Moreover, we achieve 49.5% ASR on the\nlatest YOLOv8 detector in black-box settings. This work highlights the\npotential of dynamic NeRF-based UV mapping for creating more effective\nadversarial attacks on person detectors, addressing key challenges in modeling\nhuman movement and texture modification.\n","authors":["Yanjie Li","Wenxuan Zhang","Kaisheng Liang","Bin Xiao"],"pdf_url":"https://arxiv.org/pdf/2501.05783v1.pdf","comment":"23 pages, 22 figures, submitted to ICLR2025"},{"id":"http://arxiv.org/abs/2311.02565v2","updated":"2025-01-10T08:01:09Z","published":"2023-11-05T04:43:48Z","title":"KITS: Inductive Spatio-Temporal Kriging with Increment Training Strategy","summary":" Sensors are commonly deployed to perceive the environment. However, due to\nthe high cost, sensors are usually sparsely deployed. Kriging is the tailored\ntask to infer the unobserved nodes (without sensors) using the observed source\nnodes (with sensors). The essence of kriging task is transferability. Recently,\nseveral inductive spatio-temporal kriging methods have been proposed based on\ngraph neural networks, being trained based on a graph built on top of observed\nnodes via pretext tasks such as masking nodes out and reconstructing them.\nHowever, the graph in training is inevitably much sparser than the graph in\ninference that includes all the observed and unobserved nodes. The learned\npattern cannot be well generalized for inference, denoted as graph gap. To\naddress this issue, we first present a novel Increment training strategy:\ninstead of masking nodes (and reconstructing them), we add virtual nodes into\nthe training graph so as to mitigate the graph gap issue naturally.\nNevertheless, the empty-shell virtual nodes without labels could have\nbad-learned features and lack supervision signals. To solve these issues, we\npair each virtual node with its most similar observed node and fuse their\nfeatures together; to enhance the supervision signal, we construct reliable\npseudo labels for virtual nodes. As a result, the learned pattern of virtual\nnodes could be safely transferred to real unobserved nodes for reliable\nkriging. We name our new Kriging model with Increment Training Strategy as\nKITS. Extensive experiments demonstrate that KITS consistently outperforms\nexisting kriging methods by large margins, e.g., the improvement over MAE score\ncould be as high as 18.33%.\n","authors":["Qianxiong Xu","Cheng Long","Ziyue Li","Sijie Ruan","Rui Zhao","Zhishuai Li"],"pdf_url":"https://arxiv.org/pdf/2311.02565v2.pdf","comment":"This paper is accepted by AAAI'25"},{"id":"http://arxiv.org/abs/2501.05768v1","updated":"2025-01-10T07:56:30Z","published":"2025-01-10T07:56:30Z","title":"Halal or Not: Knowledge Graph Completion for Predicting Cultural\n Appropriateness of Daily Products","summary":" The growing demand for halal cosmetic products has exposed significant\nchallenges, especially in Muslim-majority countries. Recently, various machine\nlearning-based strategies, e.g., image-based methods, have shown remarkable\nsuccess in predicting the halal status of cosmetics. However, these methods\nmainly focus on analyzing the discrete and specific ingredients within separate\ncosmetics, which ignore the high-order and complex relations between cosmetics\nand ingredients. To address this problem, we propose a halal cosmetic\nrecommendation framework, namely HaCKG, that leverages a knowledge graph of\ncosmetics and their ingredients to explicitly model and capture the\nrelationships between cosmetics and their components. By representing cosmetics\nand ingredients as entities within the knowledge graph, HaCKG effectively\nlearns the high-order and complex relations between entities, offering a robust\nmethod for predicting halal status. Specifically, we first construct a cosmetic\nknowledge graph representing the relations between various cosmetics,\ningredients, and their properties. We then propose a pre-trained relational\ngraph attention network model with residual connections to learn the structural\nrelation between entities in the knowledge graph. The pre-trained model is then\nfine-tuned on downstream cosmetic data to predict halal status. Extensive\nexperiments on the cosmetic dataset over halal prediction tasks demonstrate the\nsuperiority of our model over state-of-the-art baselines.\n","authors":["Van Thuy Hoang","Tien-Bach-Thanh Do","Jinho Seo","Seung Charlie Kim","Luong Vuong Nguyen","Duong Nguyen Minh Huy","Hyeon-Ju Jeon","O-Joun Lee"],"pdf_url":"https://arxiv.org/pdf/2501.05768v1.pdf","comment":"10 pages"},{"id":"http://arxiv.org/abs/2501.05767v1","updated":"2025-01-10T07:56:23Z","published":"2025-01-10T07:56:23Z","title":"Migician: Revealing the Magic of Free-Form Multi-Image Grounding in\n Multimodal Large Language Models","summary":" The recent advancement of Multimodal Large Language Models (MLLMs) has\nsignificantly improved their fine-grained perception of single images and\ngeneral comprehension across multiple images. However, existing MLLMs still\nface challenges in achieving precise grounding in complex multi-image\nscenarios. To address this, we first explore a Chain-of-Thought (CoT) framework\nthat integrates single-image grounding with multi-image comprehension. While\npartially effective, it remains unstable and struggles to capture abstract\nvisual information due to its non-end-to-end nature. Therefore, we introduce\nMigician, the first multi-image grounding model capable of performing free-form\nand accurate grounding across multiple images. To support this, we present the\nMGrounding-630k dataset, which comprises data for several multi-image grounding\ntasks derived from existing datasets, along with newly generated free-form\ngrounding instruction-following data. Furthermore, we propose MIG-Bench, a\ncomprehensive benchmark specifically designed for evaluating multi-image\ngrounding capabilities. Experimental results demonstrate that our model\nachieves significantly superior multi-image grounding capabilities,\noutperforming the best existing MLLMs by 21.61% and even surpassing much larger\n70B models. Our code, model, dataset, and benchmark are fully open-sourced.\n","authors":["You Li","Heyu Huang","Chi Chen","Kaiyu Huang","Chao Huang","Zonghao Guo","Zhiyuan Liu","Jinan Xu","Yuhua Li","Ruixuan Li","Maosong Sun"],"pdf_url":"https://arxiv.org/pdf/2501.05767v1.pdf","comment":"20 pages, 8 figures"},{"id":"http://arxiv.org/abs/2501.05765v1","updated":"2025-01-10T07:48:40Z","published":"2025-01-10T07:48:40Z","title":"Deontic Temporal Logic for Formal Verification of AI Ethics","summary":" Ensuring ethical behavior in Artificial Intelligence (AI) systems amidst\ntheir increasing ubiquity and influence is a major concern the world over. The\nuse of formal methods in AI ethics is a possible crucial approach for\nspecifying and verifying the ethical behavior of AI systems. This paper\nproposes a formalization based on deontic logic to define and evaluate the\nethical behavior of AI systems, focusing on system-level specifications,\ncontributing to this important goal. It introduces axioms and theorems to\ncapture ethical requirements related to fairness and explainability. The\nformalization incorporates temporal operators to reason about the ethical\nbehavior of AI systems over time. The authors evaluate the effectiveness of\nthis formalization by assessing the ethics of the real-world COMPAS and loan\nprediction AI systems. Various ethical properties of the COMPAS and loan\nprediction systems are encoded using deontic logical formulas, allowing the use\nof an automated theorem prover to verify whether these systems satisfy the\ndefined properties. The formal verification reveals that both systems fail to\nfulfill certain key ethical properties related to fairness and\nnon-discrimination, demonstrating the effectiveness of the proposed\nformalization in identifying potential ethical issues in real-world AI\napplications.\n","authors":["Priya T. V.","Shrisha Rao"],"pdf_url":"https://arxiv.org/pdf/2501.05765v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.18144v3","updated":"2025-01-10T07:22:12Z","published":"2024-05-28T13:02:56Z","title":"4-bit Shampoo for Memory-Efficient Network Training","summary":" Second-order optimizers, maintaining a matrix termed a preconditioner, are\nsuperior to first-order optimizers in both theory and practice. The states\nforming the preconditioner and its inverse root restrict the maximum size of\nmodels trained by second-order optimizers. To address this, compressing 32-bit\noptimizer states to lower bitwidths has shown promise in reducing memory usage.\nHowever, current approaches only pertain to first-order optimizers. In this\npaper, we propose the first 4-bit second-order optimizers, exemplified by 4-bit\nShampoo, maintaining performance similar to that of 32-bit ones. We show that\nquantizing the eigenvector matrix of the preconditioner in 4-bit Shampoo is\nremarkably better than quantizing the preconditioner itself both theoretically\nand experimentally. By rectifying the orthogonality of the quantized\neigenvector matrix, we enhance the approximation of the preconditioner's\neigenvector matrix, which also benefits the computation of its inverse 4-th\nroot. Besides, we find that linear square quantization slightly outperforms\ndynamic tree quantization when quantizing second-order optimizer states.\nEvaluation on various networks for image classification and natural language\nmodeling demonstrates that our 4-bit Shampoo achieves comparable performance to\nits 32-bit counterpart while being more memory-efficient.\n","authors":["Sike Wang","Pan Zhou","Jia Li","Hua Huang"],"pdf_url":"https://arxiv.org/pdf/2405.18144v3.pdf","comment":"NeurIPS 2024 final camera-ready revisions, rectify the legend in\n figure 9"},{"id":"http://arxiv.org/abs/2501.05752v1","updated":"2025-01-10T07:02:43Z","published":"2025-01-10T07:02:43Z","title":"Semantic Exploration with Adaptive Gating for Efficient Problem Solving\n with Language Models","summary":" Recent advancements in large language models (LLMs) have shown remarkable\npotential in various complex tasks requiring multi-step reasoning methods like\ntree search to explore diverse reasoning paths. However, existing methods often\nsuffer from computational inefficiency and redundancy. First, they overlook the\ndiversity of task difficulties, leading to unnecessarily extensive searches\neven for easy tasks. Second, they neglect the semantics of reasoning paths,\nresulting in redundant exploration of semantically identical paths. To address\nthese limitations, we propose Semantic Exploration with Adaptive Gating (SEAG),\na computationally efficient method. SEAG employs an adaptive gating mechanism\nthat dynamically decides whether to conduct a tree search, based on the\nconfidence level of answers from a preceding simple reasoning method.\nFurthermore, its tree-based exploration consolidates semantically identical\nreasoning steps, reducing redundant explorations while maintaining or even\nimproving accuracy. Our extensive experiments demonstrate that SEAG\nsignificantly improves accuracy by 4.3% on average while requiring only 31% of\ncomputational costs compared to existing tree search-based methods on complex\nreasoning benchmarks including GSM8K and ARC with diverse language models such\nas Llama2, Llama3, and Mistral.\n","authors":["Sungjae Lee","Hyejin Park","Jaechang Kim","Jungseul Ok"],"pdf_url":"https://arxiv.org/pdf/2501.05752v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05730v1","updated":"2025-01-10T05:54:04Z","published":"2025-01-10T05:54:04Z","title":"Element-wise Attention Is All You Need","summary":" The self-attention (SA) mechanism has demonstrated superior performance\nacross various domains, yet it suffers from substantial complexity during both\ntraining and inference. The next-generation architecture, aiming at retaining\nthe competitive performance of SA while achieving low-cost inference and\nefficient long-sequence training, primarily focuses on three approaches: linear\nattention, linear RNNs, and state space models. Although these approaches\nachieve reduced complexity than SA, they all have built-in performance\ndegradation factors, such as diminished “spikiness” and compression of\nhistorical information. In contrast to these approaches, we propose a novel\nelement-wise attention mechanism, which uses the element-wise squared Euclidean\ndistance, instead of the dot product operation, to compute similarity and\napproximates the quadratic complexity term $\\exp(q_{ic}k_{jc})$ with a Taylor\npolynomial. This design achieves remarkable efficiency: during training, the\nelement-wise attention has a complexity of $\\mathcal{O}(tLD)$, making\nlong-sequence training both computationally and memory efficient, where $L$ is\nthe sequence length, $D$ is the feature dimension, and $t$ is the highest order\nof the polynomial; during inference, it can be reformulated as recurrent neural\nnetworks, achieving a inference complexity of $\\mathcal{O}(tD)$. Furthermore,\nthe element-wise attention circumvents the performance degradation factors\npresent in these approaches and achieves performance comparable to SA in both\ncausal and non-causal forms.\n","authors":["Guoxin Feng"],"pdf_url":"https://arxiv.org/pdf/2501.05730v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05729v1","updated":"2025-01-10T05:53:37Z","published":"2025-01-10T05:53:37Z","title":"ExPO: Explainable Phonetic Trait-Oriented Network for Speaker\n Verification","summary":" In speaker verification, we use computational method to verify if an\nutterance matches the identity of an enrolled speaker. This task is similar to\nthe manual task of forensic voice comparison, where linguistic analysis is\ncombined with auditory measurements to compare and evaluate voice samples.\nDespite much success, we have yet to develop a speaker verification system that\noffers explainable results comparable to those from manual forensic voice\ncomparison. A novel approach, Explainable Phonetic Trait-Oriented (ExPO)\nnetwork, is proposed in this paper to introduce the speaker's phonetic trait\nwhich describes the speaker's characteristics at the phonetic level, resembling\nwhat forensic comparison does. ExPO not only generates utterance-level speaker\nembeddings but also allows for fine-grained analysis and visualization of\nphonetic traits, offering an explainable speaker verification process.\nFurthermore, we investigate phonetic traits from within-speaker and\nbetween-speaker variation perspectives to determine which trait is most\neffective for speaker verification, marking an important step towards\nexplainable speaker verification. Our code is available at\nhttps://github.com/mmmmayi/ExPO.\n","authors":["Yi Ma","Shuai Wang","Tianchi Liu","Haizhou Li"],"pdf_url":"https://arxiv.org/pdf/2501.05729v1.pdf","comment":"Accepted by IEEE Signal Processing Letters"},{"id":"http://arxiv.org/abs/2501.05727v1","updated":"2025-01-10T05:51:52Z","published":"2025-01-10T05:51:52Z","title":"Enabling Scalable Oversight via Self-Evolving Critic","summary":" Despite their remarkable performance, the development of Large Language\nModels (LLMs) faces a critical challenge in scalable oversight: providing\neffective feedback for tasks where human evaluation is difficult or where LLMs\noutperform humans. While there is growing interest in using LLMs for critique,\ncurrent approaches still rely on human annotations or more powerful models,\nleaving the issue of enhancing critique capabilities without external\nsupervision unresolved. We introduce SCRIT (Self-evolving CRITic), a framework\nthat enables genuine self-evolution of critique abilities. Technically, SCRIT\nself-improves by training on synthetic data, generated by a contrastive-based\nself-critic that uses reference solutions for step-by-step critique, and a\nself-validation mechanism that ensures critique quality through correction\noutcomes. Implemented with Qwen2.5-72B-Instruct, one of the most powerful LLMs,\nSCRIT achieves up to a 10.3\\% improvement on critique-correction and error\nidentification benchmarks. Our analysis reveals that SCRIT's performance scales\npositively with data and model size, outperforms alternative approaches, and\nbenefits critically from its self-validation component.\n","authors":["Zhengyang Tang","Ziniu Li","Zhenyang Xiao","Tian Ding","Ruoyu Sun","Benyou Wang","Dayiheng Liu","Fei Huang","Tianyu Liu","Bowen Yu","Junyang Lin"],"pdf_url":"https://arxiv.org/pdf/2501.05727v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.00778v3","updated":"2025-01-10T05:35:58Z","published":"2024-06-02T15:35:45Z","title":"Bayesian Joint Additive Factor Models for Multiview Learning","summary":" It is increasingly common in a wide variety of applied settings to collect\ndata of multiple different types on the same set of samples. Our particular\nfocus in this article is on studying relationships between such multiview\nfeatures and responses. A motivating application arises in the context of\nprecision medicine where multi-omics data are collected to correlate with\nclinical outcomes. It is of interest to infer dependence within and across\nviews while combining multimodal information to improve the prediction of\noutcomes. The signal-to-noise ratio can vary substantially across views,\nmotivating more nuanced statistical tools beyond standard late and early\nfusion. This challenge comes with the need to preserve interpretability, select\nfeatures, and obtain accurate uncertainty quantification. We propose a joint\nadditive factor regression model (JAFAR) with a structured additive design,\naccounting for shared and view-specific components. We ensure identifiability\nvia a novel dependent cumulative shrinkage process (D-CUSP) prior. We provide\nan efficient implementation via a partially collapsed Gibbs sampler and extend\nour approach to allow flexible feature and outcome distributions. Prediction of\ntime-to-labor onset from immunome, metabolome, and proteome data illustrates\nperformance gains against state-of-the-art competitors. Our open-source\nsoftware (R package) is available at https://github.com/niccoloanceschi/jafar.\n","authors":["Niccolo Anceschi","Federico Ferrari","David B. Dunson","Himel Mallick"],"pdf_url":"https://arxiv.org/pdf/2406.00778v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.01350v2","updated":"2025-01-10T05:35:32Z","published":"2024-10-02T09:07:33Z","title":"Takin-VC: Expressive Zero-Shot Voice Conversion via Adaptive Hybrid\n Content Encoding and Enhanced Timbre Modeling","summary":" Expressive zero-shot voice conversion (VC) is a critical and challenging task\nthat aims to transform the source timbre into an arbitrary unseen speaker while\npreserving the original content and expressive qualities. Despite recent\nprogress in zero-shot VC, there remains considerable potential for improvements\nin speaker similarity and speech naturalness. Moreover, existing zero-shot VC\nsystems struggle to fully reproduce paralinguistic information in highly\nexpressive speech, such as breathing, crying, and emotional nuances, limiting\ntheir practical applicability. To address these issues, we propose Takin-VC, a\nnovel expressive zero-shot VC framework via adaptive hybrid content encoding\nand memory-augmented context-aware timbre modeling. Specifically, we introduce\nan innovative hybrid content encoder that incorporates an adaptive fusion\nmodule, capable of effectively integrating quantized features of the\npre-trained WavLM and HybridFormer in an implicit manner, so as to extract\nprecise linguistic features while enriching paralinguistic elements. For timbre\nmodeling, we propose advanced memory-augmented and context-aware modules to\ngenerate high-quality target timbre features and fused representations that\nseamlessly align source content with target timbre. To enhance real-time\nperformance, we advocate a conditional flow matching model to reconstruct the\nMel-spectrogram of the source speech. Experimental results show that our\nTakin-VC consistently surpasses state-of-the-art VC systems, achieving notable\nimprovements in terms of speech naturalness, speech expressiveness, and speaker\nsimilarity, while offering enhanced inference speed.\n","authors":["Yuguang Yang","Yu Pan","Jixun Yao","Xiang Zhang","Jianhao Ye","Hongbin Zhou","Lei Xie","Lei Ma","Jianjun Zhao"],"pdf_url":"https://arxiv.org/pdf/2410.01350v2.pdf","comment":"Work in Progress; Under Review"},{"id":"http://arxiv.org/abs/2308.15720v2","updated":"2025-01-10T05:32:06Z","published":"2023-08-30T02:50:54Z","title":"Surrogate-based Autotuning for Randomized Sketching Algorithms in\n Regression Problems","summary":" Algorithms from Randomized Numerical Linear Algebra (RandNLA) are known to be\neffective in handling high-dimensional computational problems, providing\nhigh-quality empirical performance as well as strong probabilistic guarantees.\nHowever, their practical application is complicated by the fact that the user\nneeds to set various algorithm-specific tuning parameters which are different\nthan those used in traditional NLA. This paper demonstrates how a\nsurrogate-based autotuning approach can be used to address fundamental problems\nof parameter selection in RandNLA algorithms. In particular, we provide a\ndetailed investigation of surrogate-based autotuning for\nsketch-and-precondition (SAP) based randomized least squares methods, which\nhave been one of the great success stories in modern RandNLA. Empirical results\nshow that our surrogate-based autotuning approach can achieve near-optimal\nperformance with much less tuning cost than a random search (up to about 4x\nfewer trials of different parameter configurations). Moreover, while our\nexperiments focus on least squares, our results demonstrate a general-purpose\nautotuning pipeline applicable to any kind of RandNLA algorithm.\n","authors":["Younghyun Cho","James W. Demmel","Michał Dereziński","Haoyun Li","Hengrui Luo","Michael W. Mahoney","Riley J. Murray"],"pdf_url":"https://arxiv.org/pdf/2308.15720v2.pdf","comment":"Improved the presentation and clarity. Updated experimental results\n and scenarios. Accepted for publication in SIAM Journal on Matrix Analysis\n and Applications"},{"id":"http://arxiv.org/abs/2501.05717v1","updated":"2025-01-10T05:29:09Z","published":"2025-01-10T05:29:09Z","title":"Zero-shot Shark Tracking and Biometrics from Aerial Imagery","summary":" The recent widespread adoption of drones for studying marine animals provides\nopportunities for deriving biological information from aerial imagery. The\nlarge scale of imagery data acquired from drones is well suited for machine\nlearning (ML) analysis. Development of ML models for analyzing marine animal\naerial imagery has followed the classical paradigm of training, testing, and\ndeploying a new model for each dataset, requiring significant time, human\neffort, and ML expertise. We introduce Frame Level ALIgment and tRacking\n(FLAIR), which leverages the video understanding of Segment Anything Model 2\n(SAM2) and the vision-language capabilities of Contrastive Language-Image\nPre-training (CLIP). FLAIR takes a drone video as input and outputs\nsegmentation masks of the species of interest across the video. Notably, FLAIR\nleverages a zero-shot approach, eliminating the need for labeled data, training\na new model, or fine-tuning an existing model to generalize to other species.\nWith a dataset of 18,000 drone images of Pacific nurse sharks, we trained\nstate-of-the-art object detection models to compare against FLAIR. We show that\nFLAIR massively outperforms these object detectors and performs competitively\nagainst two human-in-the-loop methods for prompting SAM2, achieving a Dice\nscore of 0.81. FLAIR readily generalizes to other shark species without\nadditional human effort and can be combined with novel heuristics to\nautomatically extract relevant information including length and tailbeat\nfrequency. FLAIR has significant potential to accelerate aerial imagery\nanalysis workflows, requiring markedly less human effort and expertise than\ntraditional machine learning workflows, while achieving superior accuracy. By\nreducing the effort required for aerial imagery analysis, FLAIR allows\nscientists to spend more time interpreting results and deriving insights about\nmarine ecosystems.\n","authors":["Chinmay K Lalgudi","Mark E Leone","Jaden V Clark","Sergio Madrigal-Mora","Mario Espinoza"],"pdf_url":"https://arxiv.org/pdf/2501.05717v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.04974v2","updated":"2025-01-10T05:15:34Z","published":"2025-01-09T05:06:44Z","title":"SensorQA: A Question Answering Benchmark for Daily-Life Monitoring","summary":" With the rapid growth in sensor data, effectively interpreting and\ninterfacing with these data in a human-understandable way has become crucial.\nWhile existing research primarily focuses on learning classification models,\nfewer studies have explored how end users can actively extract useful insights\nfrom sensor data, often hindered by the lack of a proper dataset. To address\nthis gap, we introduce SensorQA, the first human-created question-answering\n(QA) dataset for long-term time-series sensor data for daily life monitoring.\nSensorQA is created by human workers and includes 5.6K diverse and practical\nqueries that reflect genuine human interests, paired with accurate answers\nderived from sensor data. We further establish benchmarks for state-of-the-art\nAI models on this dataset and evaluate their performance on typical edge\ndevices. Our results reveal a gap between current models and optimal QA\nperformance and efficiency, highlighting the need for new contributions. The\ndataset and code are available at:\n\\url{https://github.com/benjamin-reichman/SensorQA}.\n","authors":["Benjamin Reichman","Xiaofan Yu","Lanxiang Hu","Jack Truxal","Atishay Jain","Rushil Chandrupatla","Tajana Šimunić Rosing","Larry Heck"],"pdf_url":"https://arxiv.org/pdf/2501.04974v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05714v1","updated":"2025-01-10T05:15:14Z","published":"2025-01-10T05:15:14Z","title":"How to Enable Effective Cooperation Between Humans and NLP Models: A\n Survey of Principles, Formalizations, and Beyond","summary":" With the advancement of large language models (LLMs), intelligent models have\nevolved from mere tools to autonomous agents with their own goals and\nstrategies for cooperating with humans. This evolution has birthed a novel\nparadigm in NLP, i.e., human-model cooperation, that has yielded remarkable\nprogress in numerous NLP tasks in recent years. In this paper, we take the\nfirst step to present a thorough review of human-model cooperation, exploring\nits principles, formalizations, and open challenges. In particular, we\nintroduce a new taxonomy that provides a unified perspective to summarize\nexisting approaches. Also, we discuss potential frontier areas and their\ncorresponding challenges. We regard our work as an entry point, paving the way\nfor more breakthrough research in this regard.\n","authors":["Chen Huang","Yang Deng","Wenqiang Lei","Jiancheng Lv","Tat-Seng Chua","Jimmy Xiangji Huang"],"pdf_url":"https://arxiv.org/pdf/2501.05714v1.pdf","comment":"23 pages"},{"id":"http://arxiv.org/abs/2501.05707v1","updated":"2025-01-10T04:35:46Z","published":"2025-01-10T04:35:46Z","title":"Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains","summary":" Large language models (LLMs) have achieved remarkable performance in recent\nyears but are fundamentally limited by the underlying training data. To improve\nmodels beyond the training data, recent works have explored how LLMs can be\nused to generate synthetic data for autonomous self-improvement. However,\nsuccessive steps of self-improvement can reach a point of diminishing returns.\nIn this work, we propose a complementary approach towards self-improvement\nwhere finetuning is applied to a multiagent society of language models. A group\nof language models, all starting from the same base model, are independently\nspecialized by updating each one using data generated through multiagent\ninteractions among the models. By training each model on independent sets of\ndata, we illustrate how this approach enables specialization across models and\ndiversification over the set of models. As a result, our overall system is able\nto preserve diverse reasoning chains and autonomously improve over many more\nrounds of fine-tuning than single-agent self-improvement methods. We\nquantitatively illustrate the efficacy of the approach across a wide suite of\nreasoning tasks.\n","authors":["Vighnesh Subramaniam","Yilun Du","Joshua B. Tenenbaum","Antonio Torralba","Shuang Li","Igor Mordatch"],"pdf_url":"https://arxiv.org/pdf/2501.05707v1.pdf","comment":"22 pages, 13 figures, 7 tables; Project page at\n https://llm-multiagent-ft.github.io/"},{"id":"http://arxiv.org/abs/2408.01933v3","updated":"2025-01-10T04:09:43Z","published":"2024-08-04T05:15:02Z","title":"DiReCT: Diagnostic Reasoning for Clinical Notes via Large Language\n Models","summary":" Large language models (LLMs) have recently showcased remarkable capabilities,\nspanning a wide range of tasks and applications, including those in the medical\ndomain. Models like GPT-4 excel in medical question answering but may face\nchallenges in the lack of interpretability when handling complex tasks in real\nclinical settings. We thus introduce the diagnostic reasoning dataset for\nclinical notes (DiReCT), aiming at evaluating the reasoning ability and\ninterpretability of LLMs compared to human doctors. It contains 511 clinical\nnotes, each meticulously annotated by physicians, detailing the diagnostic\nreasoning process from observations in a clinical note to the final diagnosis.\nAdditionally, a diagnostic knowledge graph is provided to offer essential\nknowledge for reasoning, which may not be covered in the training data of\nexisting LLMs. Evaluations of leading LLMs on DiReCT bring out a significant\ngap between their reasoning ability and that of human doctors, highlighting the\ncritical need for models that can reason effectively in real-world clinical\nscenarios.\n","authors":["Bowen Wang","Jiuyang Chang","Yiming Qian","Guoxin Chen","Junhao Chen","Zhouqiang Jiang","Jiahao Zhang","Yuta Nakashima","Hajime Nagahara"],"pdf_url":"https://arxiv.org/pdf/2408.01933v3.pdf","comment":"9 pages,6 figures"},{"id":"http://arxiv.org/abs/2411.12924v2","updated":"2025-01-10T03:55:57Z","published":"2024-11-19T23:22:33Z","title":"Human-In-the-Loop Software Development Agents","summary":" Recently, Large Language Models (LLMs)-based multi-agent paradigms for\nsoftware engineering are introduced to automatically resolve software\ndevelopment tasks (e.g., from a given issue to source code). However, existing\nwork is evaluated based on historical benchmark datasets, rarely considers\nhuman feedback at each stage of the automated software development process, and\nhas not been deployed in practice. In this paper, we introduce a\nHuman-in-the-loop LLM-based Agents framework (HULA) for software development\nthat allows software engineers to refine and guide LLMs when generating coding\nplans and source code for a given task. We design, implement, and deploy the\nHULA framework into Atlassian JIRA for internal uses. Through a multi-stage\nevaluation of the HULA framework, Atlassian software engineers perceive that\nHULA can minimize the overall development time and effort, especially in\ninitiating a coding plan and writing code for straightforward tasks. On the\nother hand, challenges around code quality remain a concern in some cases. We\ndraw lessons learned and discuss opportunities for future work, which will pave\nthe way for the advancement of LLM-based agents in software development.\n","authors":["Wannita Takerngsaksiri","Jirat Pasuksmit","Patanamon Thongtanunam","Chakkrit Tantithamthavorn","Ruixiong Zhang","Fan Jiang","Jing Li","Evan Cook","Kun Chen","Ming Wu"],"pdf_url":"https://arxiv.org/pdf/2411.12924v2.pdf","comment":"10 pages, 9 figures, ICSE SEIP 2025"},{"id":"http://arxiv.org/abs/2409.01148v3","updated":"2025-01-10T03:28:00Z","published":"2024-09-02T10:33:45Z","title":"FMRFT: Fusion Mamba and DETR for Query Time Sequence Intersection Fish\n Tracking","summary":" Early detection of abnormal fish behavior caused by disease or hunger can be\nachieved through fish tracking using deep learning techniques, which holds\nsignificant value for industrial aquaculture. However, underwater reflections\nand some reasons with fish, such as the high similarity, rapid swimming caused\nby stimuli and mutual occlusion bring challenges to multi-target tracking of\nfish. To address these challenges, this paper establishes a complex\nmulti-scenario sturgeon tracking dataset and introduces the FMRFT model, a\nreal-time end-to-end fish tracking solution. The model incorporates the low\nvideo memory consumption Mamba In Mamba (MIM) architecture, which facilitates\nmulti-frame temporal memory and feature extraction, thereby addressing the\nchallenges to track multiple fish across frames. Additionally, the FMRFT model\nwith the Query Time Sequence Intersection (QTSI) module effectively manages\noccluded objects and reduces redundant tracking frames using the superior\nfeature interaction and prior frame processing capabilities of RT-DETR. This\ncombination significantly enhances the accuracy and stability of fish tracking.\nTrained and tested on the dataset, the model achieves an IDF1 score of 90.3%\nand a MOTA accuracy of 94.3%. Experimental results show that the proposed FMRFT\nmodel effectively addresses the challenges of high similarity and mutual\nocclusion in fish populations, enabling accurate tracking in factory farming\nenvironments.\n","authors":["Mingyuan Yao","Yukang Huo","Qingbin Tian","Jiayin Zhao","Xiao Liu","Ruifeng Wang","Lin Xue","Haihua Wang"],"pdf_url":"https://arxiv.org/pdf/2409.01148v3.pdf","comment":"14 pages,14 figures"},{"id":"http://arxiv.org/abs/2501.05680v1","updated":"2025-01-10T03:07:28Z","published":"2025-01-10T03:07:28Z","title":"EXION: Exploiting Inter- and Intra-Iteration Output Sparsity for\n Diffusion Models","summary":" Over the past few years, diffusion models have emerged as novel AI solutions,\ngenerating diverse multi-modal outputs from text prompts. Despite their\ncapabilities, they face challenges in computing, such as excessive latency and\nenergy consumption due to their iterative architecture. Although prior works\nspecialized in transformer acceleration can be applied, the iterative nature of\ndiffusion models remains unresolved. In this paper, we present EXION, the first\nSW-HW co-designed diffusion accelerator that solves the computation challenges\nby exploiting the unique inter- and intra-iteration output sparsity in\ndiffusion models. To this end, we propose two SW-level optimizations. First, we\nintroduce the FFN-Reuse algorithm that identifies and skips redundant\ncomputations in FFN layers across different iterations (inter-iteration\nsparsity). Second, we use a modified eager prediction method that employs\ntwo-step leading-one detection to accurately predict the attention score,\nskipping unnecessary computations within an iteration (intra-iteration\nsparsity). We also introduce a novel data compaction mechanism named ConMerge,\nwhich can enhance HW utilization by condensing and merging sparse matrices into\ncompact forms. Finally, it has a dedicated HW architecture that supports the\nabove sparsity-inducing algorithms, translating high output sparsity into\nimproved energy efficiency and performance. To verify the feasibility of the\nEXION, we first demonstrate that it has no impact on accuracy in various types\nof multi-modal diffusion models. We then instantiate EXION in both server- and\nedge-level settings and compare its performance against GPUs with similar\nspecifications. Our evaluation shows that EXION achieves dramatic improvements\nin performance and energy efficiency by 3.2-379.3x and 45.1-3067.6x compared to\na server GPU and by 42.6-1090.9x and 196.9-4668.2x compared to an edge GPU.\n","authors":["Jaehoon Heo","Adiwena Putra","Jieon Yoon","Sungwoong Yune","Hangyeol Lee","Ji-Hoon Kim","Joo-Young Kim"],"pdf_url":"https://arxiv.org/pdf/2501.05680v1.pdf","comment":"To appear in 2025 IEEE International Symposium on High-Performance\n Computer Architecture (HPCA 2025)"},{"id":"http://arxiv.org/abs/2501.05675v1","updated":"2025-01-10T02:57:08Z","published":"2025-01-10T02:57:08Z","title":"Facilitate Collaboration between Large Language Model and Task-specific\n Model for Time Series Anomaly Detection","summary":" In anomaly detection, methods based on large language models (LLMs) can\nincorporate expert knowledge, while task-specific smaller models excel at\nextracting normal patterns and detecting value fluctuations. Inspired by the\nhuman nervous system, where the brain stores expert knowledge and the\nperipheral nervous system and spinal cord handle specific tasks like withdrawal\nand knee-jerk reflexes, we propose CoLLaTe, a framework designed to facilitate\ncollaboration between LLMs and task-specific models, leveraging the strengths\nof both.\n In this work, we first formulate the collaboration process and identify two\nkey challenges in the collaboration between LLMs and task-specific models: (1)\nthe misalignment between the expression domains of LLMs and smaller models, and\n(2) error accumulation arising from the predictions of both models.\n To address these challenges, we introduce two key components in CoLLaTe: the\nalignment module and the collaborative loss function. Through theoretical\nanalysis and experimental validation, we demonstrate that these components\neffectively mitigate the identified challenges and achieve better performance\nthan LLM based methods and task-specific smaller model.\n","authors":["Feiyi Chen","Leilei Zhang","Guansong Pang","Roger Zimmermann","Shuiguang Deng"],"pdf_url":"https://arxiv.org/pdf/2501.05675v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05673v1","updated":"2025-01-10T02:51:58Z","published":"2025-01-10T02:51:58Z","title":"Network Diffuser for Placing-Scheduling Service Function Chains with\n Inverse Demonstration","summary":" Network services are increasingly managed by considering chained-up virtual\nnetwork functions and relevant traffic flows, known as the Service Function\nChains (SFCs). To deal with sequential arrivals of SFCs in an online fashion,\nwe must consider two closely-coupled problems - an SFC placement problem that\nmaps SFCs to servers/links in the network and an SFC scheduling problem that\ndetermines when each SFC is executed. Solving the whole SFC problem targeting\nthese two optimizations jointly is extremely challenging. In this paper, we\npropose a novel network diffuser using conditional generative modeling for this\nSFC placing-scheduling optimization. Recent advances in generative AI and\ndiffusion models have made it possible to generate high-quality images/videos\nand decision trajectories from language description. We formulate the SFC\noptimization as a problem of generating a state sequence for planning and\nperform graph diffusion on the state trajectories to enable extraction of SFC\ndecisions, with SFC optimization constraints and objectives as conditions. To\naddress the lack of demonstration data due to NP-hardness and exponential\nproblem space of the SFC optimization, we also propose a novel and somewhat\nmaverick approach -- Rather than solving instances of this difficult\noptimization, we start with randomly-generated solutions as input, and then\ndetermine appropriate SFC optimization problems that render these solutions\nfeasible. This inverse demonstration enables us to obtain sufficient expert\ndemonstrations, i.e., problem-solution pairs, through further optimization. In\nour numerical evaluations, the proposed network diffuser outperforms learning\nand heuristic baselines, by $\\sim$20\\% improvement in SFC reward and $\\sim$50\\%\nreduction in SFC waiting time and blocking rate.\n","authors":["Zuyuan Zhang","Vaneet Aggarwal","Tian Lan"],"pdf_url":"https://arxiv.org/pdf/2501.05673v1.pdf","comment":"Accepted to IEEE INFOCOM 2025"},{"id":"http://arxiv.org/abs/2303.16045v4","updated":"2025-01-10T02:39:43Z","published":"2023-03-28T15:20:25Z","title":"An Optimal, Universal and Agnostic Decoding Method for Message\n Reconstruction, Bio and Technosignature Detection","summary":" We present an agnostic signal reconstruction method for zero-knowledge\none-way communication channels in which a receiver aims to interpret a message\nsent by an unknown source about which no prior knowledge is available and to\nwhich no return message can be sent. Our reconstruction method is agnostic\nvis-\\`a-vis the arbitrarily chosen encoding-decoding scheme and other\nobserver-dependent characteristics, such as the arbitrarily chosen\ncomputational model, probability distributions, or underlying mathematical\ntheory. We investigate how non-random messages encode information about their\nintended physical properties, such as dimension and length scales of the space\nin which a signal or message may have been originally encoded, embedded, or\ngenerated. We focus on image data as a first illustration of the capabilities\nof the new method. We argue that our results have applications to life and\ntechnosignature detection, and to coding theory in general.\n","authors":["Hector Zenil","Alyssa Adams","Felipe S. Abrahão","Luan Ozelim"],"pdf_url":"https://arxiv.org/pdf/2303.16045v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05667v1","updated":"2025-01-10T02:33:15Z","published":"2025-01-10T02:33:15Z","title":"TransPlace: Transferable Circuit Global Placement via Graph Neural\n Network","summary":" Global placement, a critical step in designing the physical layout of\ncomputer chips, is essential to optimize chip performance. Prior global\nplacement methods optimize each circuit design individually from scratch. Their\nneglect of transferable knowledge limits solution efficiency and chip\nperformance as circuit complexity drastically increases. This study presents\nTransPlace, a global placement framework that learns to place millions of\nmixed-size cells in continuous space. TransPlace introduces i) Netlist Graph to\nefficiently model netlist topology, ii) Cell-flow and relative position\nencoding to learn SE(2)-invariant representation, iii) a tailored graph neural\nnetwork architecture for informed parameterization of placement knowledge, and\niv) a two-stage strategy for coarse-to-fine placement. Compared to\nstate-of-the-art placement methods, TransPlace-trained on a few high-quality\nplacements-can place unseen circuits with 1.2x speedup while reducing\ncongestion by 30%, timing by 9%, and wirelength by 5%.\n","authors":["Yunbo Hou","Haoran Ye","Yingxue Zhang","Siyuan Xu","Guojie Song"],"pdf_url":"https://arxiv.org/pdf/2501.05667v1.pdf","comment":"Accepted at KDD 2025"},{"id":"http://arxiv.org/abs/2409.12953v4","updated":"2025-01-10T02:31:03Z","published":"2024-09-19T17:58:16Z","title":"JourneyBench: A Challenging One-Stop Vision-Language Understanding\n Benchmark of Generated Images","summary":" Existing vision-language understanding benchmarks largely consist of images\nof objects in their usual contexts. As a consequence, recent multimodal large\nlanguage models can perform well with only a shallow visual understanding by\nrelying on background language biases. Thus, strong performance on these\nbenchmarks does not necessarily correlate with strong visual understanding. In\nthis paper, we release JourneyBench, a comprehensive human-annotated benchmark\nof generated images designed to assess the model's fine-grained multimodal\nreasoning abilities across five tasks: complementary multimodal chain of\nthought, multi-image VQA, imaginary image captioning, VQA with hallucination\ntriggers, and fine-grained retrieval with sample-specific distractors. Unlike\nexisting benchmarks, JourneyBench explicitly requires fine-grained multimodal\nreasoning in unusual imaginary scenarios where language bias and holistic image\ngist are insufficient. We benchmark state-of-the-art models on JourneyBench and\nanalyze performance along a number of fine-grained dimensions. Results across\nall five tasks show that JourneyBench is exceptionally challenging for even the\nbest models, indicating that models' visual reasoning abilities are not as\nstrong as they first appear. We discuss the implications of our findings and\npropose avenues for further research.\n","authors":["Zhecan Wang","Junzhang Liu","Chia-Wei Tang","Hani Alomari","Anushka Sivakumar","Rui Sun","Wenhao Li","Md. Atabuzzaman","Hammad Ayyubi","Haoxuan You","Alvi Ishmam","Kai-Wei Chang","Shih-Fu Chang","Chris Thomas"],"pdf_url":"https://arxiv.org/pdf/2409.12953v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05663v1","updated":"2025-01-10T02:28:19Z","published":"2025-01-10T02:28:19Z","title":"Learning to Measure Quantum Neural Networks","summary":" The rapid progress in quantum computing (QC) and machine learning (ML) has\nattracted growing attention, prompting extensive research into quantum machine\nlearning (QML) algorithms to solve diverse and complex problems. Designing\nhigh-performance QML models demands expert-level proficiency, which remains a\nsignificant obstacle to the broader adoption of QML. A few major hurdles\ninclude crafting effective data encoding techniques and parameterized quantum\ncircuits, both of which are crucial to the performance of QML models.\nAdditionally, the measurement phase is frequently overlooked-most current QML\nmodels rely on pre-defined measurement protocols that often fail to account for\nthe specific problem being addressed. We introduce a novel approach that makes\nthe observable of the quantum system-specifically, the Hermitian\nmatrix-learnable. Our method features an end-to-end differentiable learning\nframework, where the parameterized observable is trained alongside the ordinary\nquantum circuit parameters simultaneously. Using numerical simulations, we show\nthat the proposed method can identify observables for variational quantum\ncircuits that lead to improved outcomes, such as higher classification\naccuracy, thereby boosting the overall performance of QML models.\n","authors":["Samuel Yen-Chi Chen","Huan-Hsin Tseng","Hsin-Yi Lin","Shinjae Yoo"],"pdf_url":"https://arxiv.org/pdf/2501.05663v1.pdf","comment":"Accepted by ICASSP 2025 Workshop: Quantum Machine Learning in Signal\n Processing and Artificial Intelligence"},{"id":"http://arxiv.org/abs/2501.05662v1","updated":"2025-01-10T02:28:04Z","published":"2025-01-10T02:28:04Z","title":"Cascaded Self-Evaluation Augmented Training for Efficient Multimodal\n Large Language Models","summary":" Efficient Multimodal Large Language Models (EMLLMs) have rapidly advanced\nrecently. Incorporating Chain-of-Thought (CoT) reasoning and step-by-step\nself-evaluation has improved their performance. However, limited parameters\noften hinder EMLLMs from effectively using self-evaluation during inference.\nKey challenges include synthesizing evaluation data, determining its quantity,\noptimizing training and inference strategies, and selecting appropriate\nprompts.\n To address these issues, we introduce Self-Evaluation Augmented Training\n(SEAT). SEAT uses more powerful EMLLMs for CoT reasoning, data selection, and\nevaluation generation, then trains EMLLMs with the synthesized data. However,\nhandling long prompts and maintaining CoT reasoning quality are problematic.\nTherefore, we propose Cascaded Self-Evaluation Augmented Training (Cas-SEAT),\nwhich breaks down lengthy prompts into shorter, task-specific cascaded prompts\nand reduces costs for resource-limited settings. During data synthesis, we\nemploy open-source 7B-parameter EMLLMs and annotate a small dataset with short\nprompts.\n Experiments demonstrate that Cas-SEAT significantly boosts EMLLMs'\nself-evaluation abilities, improving performance by 19.68%, 55.57%, and 46.79%\non the MathVista, Math-V, and We-Math datasets, respectively. Additionally, our\nCas-SEAT Dataset serves as a valuable resource for future research in enhancing\nEMLLM self-evaluation.\n","authors":["Zheqi Lv","Wenkai Wang","Jiawei Wang","Shengyu Zhang","Fei Wu"],"pdf_url":"https://arxiv.org/pdf/2501.05662v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.11484v9","updated":"2025-01-10T02:18:01Z","published":"2024-07-16T08:20:39Z","title":"The Oscars of AI Theater: A Survey on Role-Playing with Language Models","summary":" This survey explores the burgeoning field of role-playing with language\nmodels, focusing on their development from early persona-based models to\nadvanced character-driven simulations facilitated by Large Language Models\n(LLMs). Initially confined to simple persona consistency due to limited model\ncapabilities, role-playing tasks have now expanded to embrace complex character\nportrayals involving character consistency, behavioral alignment, and overall\nattractiveness. We provide a comprehensive taxonomy of the critical components\nin designing these systems, including data, models and alignment, agent\narchitecture and evaluation. This survey not only outlines the current\nmethodologies and challenges, such as managing dynamic personal profiles and\nachieving high-level persona consistency but also suggests avenues for future\nresearch in improving the depth and realism of role-playing applications. The\ngoal is to guide future research by offering a structured overview of current\nmethodologies and identifying potential areas for improvement. Related\nresources and papers are available at\nhttps://github.com/nuochenpku/Awesome-Role-Play-Papers.\n","authors":["Nuo Chen","Yan Wang","Yang Deng","Jia Li"],"pdf_url":"https://arxiv.org/pdf/2407.11484v9.pdf","comment":"28 pages"},{"id":"http://arxiv.org/abs/2404.11917v2","updated":"2025-01-10T02:08:52Z","published":"2024-04-18T05:48:15Z","title":"Expected Coordinate Improvement for High-Dimensional Bayesian\n Optimization","summary":" Bayesian optimization (BO) algorithm is very popular for solving\nlow-dimensional expensive optimization problems. Extending Bayesian\noptimization to high dimension is a meaningful but challenging task. One of the\nmajor challenges is that it is difficult to find good infill solutions as the\nacquisition functions are also high-dimensional. In this work, we propose the\nexpected coordinate improvement (ECI) criterion for high-dimensional Bayesian\noptimization. The proposed ECI criterion measures the potential improvement we\ncan get by moving the current best solution along one coordinate. The proposed\napproach selects the coordinate with the highest ECI value to refine in each\niteration and covers all the coordinates gradually by iterating over the\ncoordinates. The greatest advantage of the proposed ECI-BO (expected coordinate\nimprovement based Bayesian optimization) algorithm over the standard BO\nalgorithm is that the infill selection problem of the proposed algorithm is\nalways a one-dimensional problem thus can be easily solved. Numerical\nexperiments show that the proposed algorithm can achieve significantly better\nresults than the standard BO algorithm and competitive results when compared\nwith five state-of-the-art high-dimensional BOs. This work provides a simple\nbut efficient approach for high-dimensional Bayesian optimization.\n","authors":["Dawei Zhan"],"pdf_url":"https://arxiv.org/pdf/2404.11917v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05647v1","updated":"2025-01-10T01:27:12Z","published":"2025-01-10T01:27:12Z","title":"Collaboration of Large Language Models and Small Recommendation Models\n for Device-Cloud Recommendation","summary":" Large Language Models (LLMs) for Recommendation (LLM4Rec) is a promising\nresearch direction that has demonstrated exceptional performance in this field.\nHowever, its inability to capture real-time user preferences greatly limits the\npractical application of LLM4Rec because (i) LLMs are costly to train and infer\nfrequently, and (ii) LLMs struggle to access real-time data (its large number\nof parameters poses an obstacle to deployment on devices). Fortunately, small\nrecommendation models (SRMs) can effectively supplement these shortcomings of\nLLM4Rec diagrams by consuming minimal resources for frequent training and\ninference, and by conveniently accessing real-time data on devices.\n In light of this, we designed the Device-Cloud LLM-SRM Collaborative\nRecommendation Framework (LSC4Rec) under a device-cloud collaboration setting.\nLSC4Rec aims to integrate the advantages of both LLMs and SRMs, as well as the\nbenefits of cloud and edge computing, achieving a complementary synergy. We\nenhance the practicability of LSC4Rec by designing three strategies:\ncollaborative training, collaborative inference, and intelligent request.\nDuring training, LLM generates candidate lists to enhance the ranking ability\nof SRM in collaborative scenarios and enables SRM to update adaptively to\ncapture real-time user interests. During inference, LLM and SRM are deployed on\nthe cloud and on the device, respectively. LLM generates candidate lists and\ninitial ranking results based on user behavior, and SRM get reranking results\nbased on the candidate list, with final results integrating both LLM's and\nSRM's scores. The device determines whether a new candidate list is needed by\ncomparing the consistency of the LLM's and SRM's sorted lists. Our\ncomprehensive and extensive experimental analysis validates the effectiveness\nof each strategy in LSC4Rec.\n","authors":["Zheqi Lv","Tianyu Zhan","Wenjie Wang","Xinyu Lin","Shengyu Zhang","Wenqiao Zhang","Jiwei Li","Kun Kuang","Fei Wu"],"pdf_url":"https://arxiv.org/pdf/2501.05647v1.pdf","comment":"Published on KDD'25: Proceedings of the ACM SIGKDD Conference on\n Knowledge Discovery and Data Mining 2025"},{"id":"http://arxiv.org/abs/2501.05646v1","updated":"2025-01-10T01:25:01Z","published":"2025-01-10T01:25:01Z","title":"Efficient Representations for High-Cardinality Categorical Variables in\n Machine Learning","summary":" High\\-cardinality categorical variables pose significant challenges in\nmachine learning, particularly in terms of computational efficiency and model\ninterpretability. Traditional one\\-hot encoding often results in\nhigh\\-dimensional sparse feature spaces, increasing the risk of overfitting and\nreducing scalability. This paper introduces novel encoding techniques,\nincluding means encoding, low\\-rank encoding, and multinomial logistic\nregression encoding, to address these challenges. These methods leverage\nsufficient representations to generate compact and informative embeddings of\ncategorical data. We conduct rigorous theoretical analyses and empirical\nvalidations on diverse datasets, demonstrating significant improvements in\nmodel performance and computational efficiency compared to baseline methods.\nThe proposed techniques are particularly effective in domains requiring\nscalable solutions for large datasets, paving the way for more robust and\nefficient applications in machine learning.\n","authors":["Zixuan Liang"],"pdf_url":"https://arxiv.org/pdf/2501.05646v1.pdf","comment":"2025 International Conference on Advanced Machine Learning and Data\n Science (AMLDS 2025)"},{"id":"http://arxiv.org/abs/2412.18544v2","updated":"2025-01-10T01:06:06Z","published":"2024-12-24T16:51:35Z","title":"Consistency Checks for Language Model Forecasters","summary":" Forecasting is a task that is difficult to evaluate: the ground truth can\nonly be known in the future. Recent work showing LLM forecasters rapidly\napproaching human-level performance begs the question: how can we benchmark and\nevaluate these forecasters instantaneously? Following the consistency check\nframework, we measure the performance of forecasters in terms of the\nconsistency of their predictions on different logically-related questions. We\npropose a new, general consistency metric based on arbitrage: for example, if a\nforecasting AI illogically predicts that both the Democratic and Republican\nparties have 60% probability of winning the 2024 US presidential election, an\narbitrageur can trade against the forecaster's predictions and make a profit.\nWe build an automated evaluation system that generates a set of base questions,\ninstantiates consistency checks from these questions, elicits the predictions\nof the forecaster, and measures the consistency of the predictions. We then\nbuild a standard, proper-scoring-rule forecasting benchmark, and show that our\n(instantaneous) consistency metrics correlate with LLM forecasters' ground\ntruth Brier scores (which are only known in the future). We also release a\nconsistency benchmark that resolves in 2028, providing a long-term evaluation\ntool for forecasting.\n","authors":["Daniel Paleka","Abhimanyu Pallavi Sudhir","Alejandro Alvarez","Vineeth Bhat","Adam Shen","Evan Wang","Florian Tramèr"],"pdf_url":"https://arxiv.org/pdf/2412.18544v2.pdf","comment":"55 pages, 25 figures. Submitted to ICLR 2025"},{"id":"http://arxiv.org/abs/2501.05643v1","updated":"2025-01-10T01:00:05Z","published":"2025-01-10T01:00:05Z","title":"Iconicity in Large Language Models","summary":" Lexical iconicity, a direct relation between a word's meaning and its form,\nis an important aspect of every natural language, most commonly manifesting\nthrough sound-meaning associations. Since Large language models' (LLMs') access\nto both meaning and sound of text is only mediated (meaning through textual\ncontext, sound through written representation, further complicated by\ntokenization), we might expect that the encoding of iconicity in LLMs would be\neither insufficient or significantly different from human processing. This\nstudy addresses this hypothesis by having GPT-4 generate highly iconic\npseudowords in artificial languages. To verify that these words actually carry\niconicity, we had their meanings guessed by Czech and German participants\n(n=672) and subsequently by LLM-based participants (generated by GPT-4 and\nClaude 3.5 Sonnet). The results revealed that humans can guess the meanings of\npseudowords in the generated iconic language more accurately than words in\ndistant natural languages and that LLM-based participants are even more\nsuccessful than humans in this task. This core finding is accompanied by\nseveral additional analyses concerning the universality of the generated\nlanguage and the cues that both human and LLM-based participants utilize.\n","authors":["Anna Marklová","Jiří Milička","Leonid Ryvkin","Ľudmila Lacková Bennet","Libuše Kormaníková"],"pdf_url":"https://arxiv.org/pdf/2501.05643v1.pdf","comment":"Supplementary information: https://osf.io/ywjrk/"},{"id":"http://arxiv.org/abs/2501.05629v1","updated":"2025-01-10T00:10:21Z","published":"2025-01-10T00:10:21Z","title":"The Impact of Model Scaling on Seen and Unseen Language Performance","summary":" The rapid advancement of Large Language Models (LLMs), particularly those\ntrained on multilingual corpora, has intensified the need for a deeper\nunderstanding of their performance across a diverse range of languages and\nmodel sizes. Our research addresses this critical need by studying the\nperformance and scaling behavior of multilingual LLMs in text classification\nand machine translation tasks across 204 languages. We systematically examine\nboth seen and unseen languages across three model families of varying sizes in\nzero-shot and few-shot settings. Our findings show significant differences in\nscaling behavior between zero-shot and two-shot scenarios, with striking\ndisparities in performance between seen and unseen languages. Model scale has\nlittle effect on zero-shot performance, which remains mostly flat. However, in\ntwo-shot settings, larger models show clear linear improvements in multilingual\ntext classification. For translation tasks, however, only the instruction-tuned\nmodel showed clear benefits from scaling. Our analysis also suggests that\noverall resource levels, not just the proportions of pretraining languages, are\nbetter predictors of model performance, shedding light on what drives\nmultilingual LLM effectiveness.\n","authors":["Rhitabrat Pokharel","Sina Bagheri Nezhad","Ameeta Agrawal","Suresh Singh"],"pdf_url":"https://arxiv.org/pdf/2501.05629v1.pdf","comment":"Accepted at SEAS Workshop at AAAI25"}]},"2025-01-13T00:00:00Z":{"Robotics":[{"id":"http://arxiv.org/abs/2501.07566v1","updated":"2025-01-13T18:54:02Z","published":"2025-01-13T18:54:02Z","title":"SafeSwarm: Decentralized Safe RL for the Swarm of Drones Landing in\n Dense Crowds","summary":" This paper introduces a safe swarm of drones capable of performing landings\nin crowded environments robustly by relying on Reinforcement Learning\ntechniques combined with Safe Learning. The developed system allows us to teach\nthe swarm of drones with different dynamics to land on moving landing pads in\nan environment while avoiding collisions with obstacles and between agents.\n The safe barrier net algorithm was developed and evaluated using a swarm of\nCrazyflie 2.1 micro quadrotors, which were tested indoors with the Vicon motion\ncapture system to ensure precise localization and control.\n Experimental results show that our system achieves landing accuracy of 2.25\ncm with a mean time of 17 s and collision-free landings, underscoring its\neffectiveness and robustness in real-world scenarios. This work offers a\npromising foundation for applications in environments where safety and\nprecision are paramount.\n","authors":["Grik Tadevosyan","Maksim Osipenko","Demetros Aschu","Aleksey Fedoseev","Valerii Serpiva","Oleg Sautenkov","Sausar Karaf","Dzmitry Tsetserukou"],"pdf_url":"https://arxiv.org/pdf/2501.07566v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.04987v2","updated":"2025-01-13T18:24:22Z","published":"2024-11-07T18:55:10Z","title":"Few-Shot Task Learning through Inverse Generative Modeling","summary":" Learning the intents of an agent, defined by its goals or motion style, is\noften extremely challenging from just a few examples. We refer to this problem\nas task concept learning and present our approach, Few-Shot Task Learning\nthrough Inverse Generative Modeling (FTL-IGM), which learns new task concepts\nby leveraging invertible neural generative models. The core idea is to pretrain\na generative model on a set of basic concepts and their demonstrations. Then,\ngiven a few demonstrations of a new concept (such as a new goal or a new\naction), our method learns the underlying concepts through backpropagation\nwithout updating the model weights, thanks to the invertibility of the\ngenerative model. We evaluate our method in five domains -- object\nrearrangement, goal-oriented navigation, motion caption of human actions,\nautonomous driving, and real-world table-top manipulation. Our experimental\nresults demonstrate that via the pretrained generative model, we successfully\nlearn novel concepts and generate agent plans or motion corresponding to these\nconcepts in (1) unseen environments and (2) in composition with training\nconcepts.\n","authors":["Aviv Netanyahu","Yilun Du","Antonia Bronars","Jyothish Pari","Joshua Tenenbaum","Tianmin Shu","Pulkit Agrawal"],"pdf_url":"https://arxiv.org/pdf/2411.04987v2.pdf","comment":"Added acknowledgment"},{"id":"http://arxiv.org/abs/2501.07507v1","updated":"2025-01-13T17:25:46Z","published":"2025-01-13T17:25:46Z","title":"Inductive Learning of Robot Task Knowledge from Raw Data and Online\n Expert Feedback","summary":" The increasing level of autonomy of robots poses challenges of trust and\nsocial acceptance, especially in human-robot interaction scenarios. This\nrequires an interpretable implementation of robotic cognitive capabilities,\npossibly based on formal methods as logics for the definition of task\nspecifications. However, prior knowledge is often unavailable in complex\nrealistic scenarios.\n In this paper, we propose an offline algorithm based on inductive logic\nprogramming from noisy examples to extract task specifications (i.e., action\npreconditions, constraints and effects) directly from raw data of few\nheterogeneous (i.e., not repetitive) robotic executions. Our algorithm\nleverages on the output of any unsupervised action identification algorithm\nfrom video-kinematic recordings. Combining it with the definition of very\nbasic, almost task-agnostic, commonsense concepts about the environment, which\ncontribute to the interpretability of our methodology, we are able to learn\nlogical axioms encoding preconditions of actions, as well as their effects in\nthe event calculus paradigm. Since the quality of learned specifications\ndepends mainly on the accuracy of the action identification algorithm, we also\npropose an online framework for incremental refinement of task knowledge from\nuser feedback, guaranteeing safe execution. Results in a standard manipulation\ntask and benchmark for user training in the safety-critical surgical robotic\nscenario, show the robustness, data- and time-efficiency of our methodology,\nwith promising results towards the scalability in more complex domains.\n","authors":["Daniele Meli","Paolo Fiorini"],"pdf_url":"https://arxiv.org/pdf/2501.07507v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.07462v1","updated":"2025-01-13T16:32:13Z","published":"2025-01-13T16:32:13Z","title":"The Sense of Agency in Assistive Robotics Using Shared Autonomy","summary":" Sense of agency is one factor that influences people's preferences for robot\nassistance and a phenomenon from cognitive science that represents the\nexperience of control over one's environment. However, in assistive robotics\nliterature, we often see paradigms that optimize measures like task success and\ncognitive load, rather than sense of agency. In fact, prior work has found that\nparticipants sometimes express a preference for paradigms, such as direct\nteleoperation, which do not perform well with those other metrics but give more\ncontrol to the user. In this work, we focus on a subset of assistance paradigms\nfor manipulation called shared autonomy in which the system combines control\nsignals from the user and the automated control. We run a study to evaluate\nsense of agency and show that higher robot autonomy during assistance leads to\nimproved task performance but a decreased sense of agency, indicating a\npotential trade-off between task performance and sense of agency. From our\nfindings, we discuss the relation between sense of agency and optimality, and\nwe consider a proxy metric for a component of sense of agency which might\nenable us to build systems that monitor and maintain sense of agency in real\ntime.\n","authors":["Maggie A. Collier","Rithika Narayan","Henny Admoni"],"pdf_url":"https://arxiv.org/pdf/2501.07462v1.pdf","comment":"10 pages, 8 figure, HRI conference"},{"id":"http://arxiv.org/abs/2501.07421v1","updated":"2025-01-13T15:41:18Z","published":"2025-01-13T15:41:18Z","title":"Empirical Comparison of Four Stereoscopic Depth Sensing Cameras for\n Robotics Applications","summary":" Depth sensing is an essential technology in robotics and many other fields.\nMany depth sensing (or RGB-D) cameras are available on the market and selecting\nthe best one for your application can be challenging. In this work, we tested\nfour stereoscopic RGB-D cameras that sense the distance by using two images\nfrom slightly different views. We empirically compared four cameras (Intel\nRealSense D435, Intel RealSense D455, StereoLabs ZED 2, and Luxonis OAK-D Pro)\nin three scenarios: (i) planar surface perception, (ii) plastic doll\nperception, (iii) household object perception (YCB dataset). We recorded and\nevaluated more than 3,000 RGB-D frames for each camera. For table-top robotics\nscenarios with distance to objects up to one meter, the best performance is\nprovided by the D435 camera. For longer distances, the other three models\nperform better, making them more suitable for some mobile robotics\napplications. OAK-D Pro additionally offers integrated AI modules (e.g., object\nand human keypoint detection). ZED 2 is not a standalone device and requires a\ncomputer with a GPU for depth data acquisition. All data (more than 12,000\nRGB-D frames) are publicly available at https://osf.io/f2seb.\n","authors":["Lukas Rustler","Vojtech Volprecht","Matej Hoffmann"],"pdf_url":"https://arxiv.org/pdf/2501.07421v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.07399v1","updated":"2025-01-13T15:17:10Z","published":"2025-01-13T15:17:10Z","title":"Efficiently Closing Loops in LiDAR-Based SLAM Using Point Cloud Density\n Maps","summary":" Consistent maps are key for most autonomous mobile robots. They often use\nSLAM approaches to build such maps. Loop closures via place recognition help\nmaintain accurate pose estimates by mitigating global drift. This paper\npresents a robust loop closure detection pipeline for outdoor SLAM with\nLiDAR-equipped robots. The method handles various LiDAR sensors with different\nscanning patterns, field of views and resolutions. It generates local maps from\nLiDAR scans and aligns them using a ground alignment module to handle both\nplanar and non-planar motion of the LiDAR, ensuring applicability across\nplatforms. The method uses density-preserving bird's eye view projections of\nthese local maps and extracts ORB feature descriptors from them for place\nrecognition. It stores the feature descriptors in a binary search tree for\nefficient retrieval, and self-similarity pruning addresses perceptual aliasing\nin repetitive environments. Extensive experiments on public and self-recorded\ndatasets demonstrate accurate loop closure detection, long-term localization,\nand cross-platform multi-map alignment, agnostic to the LiDAR scanning\npatterns, fields of view, and motion profiles.\n","authors":["Saurabh Gupta","Tiziano Guadagnino","Benedikt Mersch","Niklas Trekel","Meher V. R. Malladi","Cyrill Stachniss"],"pdf_url":"https://arxiv.org/pdf/2501.07399v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.08094v2","updated":"2025-01-13T14:53:11Z","published":"2023-05-14T08:10:49Z","title":"Accelerating genetic optimization of nonlinear model predictive control\n by learning optimal search space size","summary":" Genetic algorithm (GA) is typically used to solve nonlinear model predictive\ncontrol's optimization problem. However, the size of the search space in which\nthe GA searches for the optimal control inputs is crucial for its applicability\nto fast-response systems. This paper proposes accelerating the genetic\noptimization of NMPC by learning optimal search space size. The approach trains\na multivariate regression model to adaptively predict the best smallest size of\nthe search space in every control cycle. The proposed approach reduces the GA's\ncomputational time, improves the chance of convergence to better control\ninputs, and provides a stable and feasible solution. The proposed approach was\nevaluated on three nonlinear systems and compared to four other evolutionary\nalgorithms implemented in a processor-in-the-loop fashion. The results show\nthat the proposed approach provides a 17-45\\% reduction in computational time\nand increases the convergence rate by 35-47\\%. The source code is available on\nGitHub.\n","authors":["Eslam Mostafa","Hussein A. Aly","Ahmed Elliethy"],"pdf_url":"https://arxiv.org/pdf/2305.08094v2.pdf","comment":"Accepted by the Journal of Control and Decision"},{"id":"http://arxiv.org/abs/2412.19706v3","updated":"2025-01-13T14:15:59Z","published":"2024-12-27T16:00:24Z","title":"Geometric Freeze-Tag Problem","summary":" We study the Freeze-Tag Problem (FTP), introduced by Arkin et al. (SODA'02),\nwhere the objective is to activate a group of n robots, starting from a single\ninitially active robot. Robots are positioned in $\\mathbb{R}^d$, and once\nactivated, they move at a constant speed to wake up others. The goal is to\nminimize the time required to activate the last robot, known as the makespan.\nWe establish new upper bounds for the makespan under the $l_1$ and $l_2$ norms\nin $\\mathbb{R}^2$ and $\\mathbb{R}^3$. Specifically, we improve the previous\nupper bound for $(\\mathbb{R}^2, l_2)$ from $7.07r$ (Bonichon et al., DISC'24)\nto $5.064r$. For $(\\mathbb{R}^3, l_1)$, we derive a makespan bound of $13r$,\nwhich translates to $22.52r$ for $(\\mathbb{R}^3, l_2)$. Here, $r$ denotes the\nmaximum distance of any robot from the initially active robot under the given\nnorm. To our knowledge, these are the first makespan bounds for FTP in\n$\\mathbb{R}^3$. Additionally, we show that the maximum makespan for $n$ robots\nis not necessarily achieved when robots are equally distributed along the\nboundary in $(\\mathbb{R}^2, l_2)$. We further investigate FTP in\n$(\\mathbb{R}^3, l_2)$ for specific configurations where robots lie on a\nboundary, providing insights into practical scenarios.\n","authors":["Sharareh Alipour","Kajal Baghestani","Mahdis Mirzaei","Soroush Sahraei"],"pdf_url":"https://arxiv.org/pdf/2412.19706v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.06782v2","updated":"2025-01-13T14:11:49Z","published":"2024-11-11T08:19:54Z","title":"QuadWBG: Generalizable Quadrupedal Whole-Body Grasping","summary":" Legged robots with advanced manipulation capabilities have the potential to\nsignificantly improve household duties and urban maintenance. Despite\nconsiderable progress in developing robust locomotion and precise manipulation\nmethods, seamlessly integrating these into cohesive whole-body control for\nreal-world applications remains challenging. In this paper, we present a\nmodular framework for robust and generalizable whole-body loco-manipulation\ncontroller based on a single arm-mounted camera. By using reinforcement\nlearning (RL), we enable a robust low-level policy for command execution over 5\ndimensions (5D) and a grasp-aware high-level policy guided by a novel metric,\nGeneralized Oriented Reachability Map (GORM). The proposed system achieves\nstate-of-the-art one-time grasping accuracy of 89% in the real world, including\nchallenging tasks such as grasping transparent objects. Through extensive\nsimulations and real-world experiments, we demonstrate that our system can\neffectively manage a large workspace, from floor level to above body height,\nand perform diverse whole-body loco-manipulation tasks.\n","authors":["Jilong Wang","Javokhirbek Rajabov","Chaoyi Xu","Yiming Zheng","He Wang"],"pdf_url":"https://arxiv.org/pdf/2411.06782v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.07343v1","updated":"2025-01-13T13:57:37Z","published":"2025-01-13T13:57:37Z","title":"Fast-Revisit Coverage Path Planning for Autonomous Mobile Patrol Robots\n Using Long-Range Sensor Information","summary":" The utilization of Unmanned Ground Vehicles (UGVs) for patrolling industrial\nsites has expanded significantly. These UGVs typically are equipped with\nperception systems, e.g., computer vision, with limited range due to sensor\nlimitations or site topology. High-level control of the UGVs requires Coverage\nPath Planning (CPP) algorithms that navigate all relevant waypoints and\npromptly start the next cycle. In this paper, we propose the novel Fast-Revisit\nCoverage Path Planning (FaRe-CPP) algorithm using a greedy heuristic approach\nto propose waypoints for maximum coverage area and a random search-based path\noptimization technique to obtain a path along the proposed waypoints with\nminimum revisit time. We evaluated the algorithm in a simulated environment\nusing Gazebo and a camera-equipped TurtleBot3 against a number of existing\nalgorithms. Compared to their average revisit times and path lengths, our\nFaRe-CPP algorithm approximately showed a 45% and 40% reduction, respectively,\nin these highly relevant performance indicators.\n","authors":["Srinivas Kachavarapu","Tobias Doernbach","Reinhard Gerndt"],"pdf_url":"https://arxiv.org/pdf/2501.07343v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.07317v1","updated":"2025-01-13T13:28:03Z","published":"2025-01-13T13:28:03Z","title":"Evaluation of Artificial Intelligence Methods for Lead Time Prediction\n in Non-Cycled Areas of Automotive Production","summary":" The present study examines the effectiveness of applying Artificial\nIntelligence methods in an automotive production environment to predict unknown\nlead times in a non-cycle-controlled production area. Data structures are\nanalyzed to identify contextual features and then preprocessed using one-hot\nencoding. Methods selection focuses on supervised machine learning techniques.\nIn supervised learning methods, regression and classification methods are\nevaluated. Continuous regression based on target size distribution is not\nfeasible. Classification methods analysis shows that Ensemble Learning and\nSupport Vector Machines are the most suitable. Preliminary study results\nindicate that gradient boosting algorithms LightGBM, XGBoost, and CatBoost\nyield the best results. After further testing and extensive hyperparameter\noptimization, the final method choice is the LightGBM algorithm. Depending on\nfeature availability and prediction interval granularity, relative prediction\naccuracies of up to 90% can be achieved. Further tests highlight the importance\nof periodic retraining of AI models to accurately represent complex production\nprocesses using the database. The research demonstrates that AI methods can be\neffectively applied to highly variable production data, adding business value\nby providing an additional metric for various control tasks while outperforming\ncurrent non AI-based systems.\n","authors":["Cornelius Hake","Jonas Weigele","Frederik Reichert","Christian Friedrich"],"pdf_url":"https://arxiv.org/pdf/2501.07317v1.pdf","comment":"7 pages, 4 figures, CLC2024 Conference"},{"id":"http://arxiv.org/abs/2501.07299v1","updated":"2025-01-13T13:07:20Z","published":"2025-01-13T13:07:20Z","title":"ViewVR: Visual Feedback Modes to Achieve Quality of VR-based\n Telemanipulation","summary":" The paper focuses on an immersive teleoperation system that enhances\noperator's ability to actively perceive the robot's surroundings. A\nconsumer-grade HTC Vive VR system was used to synchronize the operator's hand\nand head movements with a UR3 robot and a custom-built robotic head with two\ndegrees of freedom (2-DoF). The system's usability, manipulation efficiency,\nand intuitiveness of control were evaluated in comparison with static head\ncamera positioning across three distinct tasks. Code and other supplementary\nmaterials can be accessed by link: https://github.com/ErkhovArtem/ViewVR\n","authors":["A. Erkhov","A. Bazhenov","S. Satsevich","D. Belov","F. Khabibullin","S. Egorov","M. Gromakov","M. Altamirano Cabrera","D. Tsetserukou"],"pdf_url":"https://arxiv.org/pdf/2501.07299v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.07295v1","updated":"2025-01-13T13:01:21Z","published":"2025-01-13T13:01:21Z","title":"GestLLM: Advanced Hand Gesture Interpretation via Large Language Models\n for Human-Robot Interaction","summary":" This paper introduces GestLLM, an advanced system for human-robot interaction\nthat enables intuitive robot control through hand gestures. Unlike conventional\nsystems, which rely on a limited set of predefined gestures, GestLLM leverages\nlarge language models and feature extraction via MediaPipe to interpret a\ndiverse range of gestures. This integration addresses key limitations in\nexisting systems, such as restricted gesture flexibility and the inability to\nrecognize complex or unconventional gestures commonly used in human\ncommunication.\n By combining state-of-the-art feature extraction and language model\ncapabilities, GestLLM achieves performance comparable to leading\nvision-language models while supporting gestures underrepresented in\ntraditional datasets. For example, this includes gestures from popular culture,\nsuch as the ``Vulcan salute\" from Star Trek, without any additional\npretraining, prompt engineering, etc. This flexibility enhances the naturalness\nand inclusivity of robot control, making interactions more intuitive and\nuser-friendly.\n GestLLM provides a significant step forward in gesture-based interaction,\nenabling robots to understand and respond to a wide variety of hand gestures\neffectively. This paper outlines its design, implementation, and evaluation,\ndemonstrating its potential applications in advanced human-robot collaboration,\nassistive robotics, and interactive entertainment.\n","authors":["Oleg Kobzarev","Artem Lykov","Dzmitry Tsetserukou"],"pdf_url":"https://arxiv.org/pdf/2501.07295v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.07259v1","updated":"2025-01-13T12:14:48Z","published":"2025-01-13T12:14:48Z","title":"PO-GVINS: Tightly Coupled GNSS-Visual-Inertial Integration with\n Pose-Only Representation","summary":" Accurate and reliable positioning is crucial for perception, decision-making,\nand other high-level applications in autonomous driving, unmanned aerial\nvehicles, and intelligent robots. Given the inherent limitations of standalone\nsensors, integrating heterogeneous sensors with complementary capabilities is\none of the most effective approaches to achieving this goal. In this paper, we\npropose a filtering-based, tightly coupled global navigation satellite system\n(GNSS)-visual-inertial positioning framework with a pose-only formulation\napplied to the visual-inertial system (VINS), termed PO-GVINS. Specifically,\nmultiple-view imaging used in current VINS requires a priori of 3D feature,\nthen jointly estimate camera poses and 3D feature position, which inevitably\nintroduces linearization error of the feature as well as facing dimensional\nexplosion. However, the pose-only (PO) formulation, which is demonstrated to be\nequivalent to the multiple-view imaging and has been applied in visual\nreconstruction, represent feature depth using two camera poses and thus 3D\nfeature position is removed from state vector avoiding aforementioned\ndifficulties. Inspired by this, we first apply PO formulation in our VINS,\ni.e., PO-VINS. GNSS raw measurements are then incorporated with integer\nambiguity resolved to achieve accurate and drift-free estimation. Extensive\nexperiments demonstrate that the proposed PO-VINS significantly outperforms the\nmulti-state constrained Kalman filter (MSCKF). By incorporating GNSS\nmeasurements, PO-GVINS achieves accurate, drift-free state estimation, making\nit a robust solution for positioning in challenging environments.\n","authors":["Zhuo Xu","Feng Zhu","Zihang Zhang","Chang Jian","Jiarui Lv","Yuantai Zhang","Xiaohong Zhang"],"pdf_url":"https://arxiv.org/pdf/2501.07259v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.07255v1","updated":"2025-01-13T12:06:58Z","published":"2025-01-13T12:06:58Z","title":"GazeGrasp: DNN-Driven Robotic Grasping with Wearable Eye-Gaze Interface","summary":" We present GazeGrasp, a gaze-based manipulation system enabling individuals\nwith motor impairments to control collaborative robots using eye-gaze. The\nsystem employs an ESP32 CAM for eye tracking, MediaPipe for gaze detection, and\nYOLOv8 for object localization, integrated with a Universal Robot UR10 for\nmanipulation tasks. After user-specific calibration, the system allows\nintuitive object selection with a magnetic snapping effect and robot control\nvia eye gestures. Experimental evaluation involving 13 participants\ndemonstrated that the magnetic snapping effect significantly reduced gaze\nalignment time, improving task efficiency by 31%. GazeGrasp provides a robust,\nhands-free interface for assistive robotics, enhancing accessibility and\nautonomy for users.\n","authors":["Issatay Tokmurziyev","Miguel Altamirano Cabrera","Luis Moreno","Muhammad Haris Khan","Dzmitry Tsetserukou"],"pdf_url":"https://arxiv.org/pdf/2501.07255v1.pdf","comment":"Accepted to: IEEE/ACM International Conference on Human-Robot\n Interaction (HRI 2025)"},{"id":"http://arxiv.org/abs/2412.20104v2","updated":"2025-01-13T11:46:06Z","published":"2024-12-28T10:12:12Z","title":"SyncDiff: Synchronized Motion Diffusion for Multi-Body Human-Object\n Interaction Synthesis","summary":" Synthesizing realistic human-object interaction motions is a critical problem\nin VR/AR and human animation. Unlike the commonly studied scenarios involving a\nsingle human or hand interacting with one object, we address a more generic\nmulti-body setting with arbitrary numbers of humans, hands, and objects. This\ncomplexity introduces significant challenges in synchronizing motions due to\nthe high correlations and mutual influences among bodies. To address these\nchallenges, we introduce SyncDiff, a novel method for multi-body interaction\nsynthesis using a synchronized motion diffusion strategy. SyncDiff employs a\nsingle diffusion model to capture the joint distribution of multi-body motions.\nTo enhance motion fidelity, we propose a frequency-domain motion decomposition\nscheme. Additionally, we introduce a new set of alignment scores to emphasize\nthe synchronization of different body motions. SyncDiff jointly optimizes both\ndata sample likelihood and alignment likelihood through an explicit\nsynchronization strategy. Extensive experiments across four datasets with\nvarious multi-body configurations demonstrate the superiority of SyncDiff over\nexisting state-of-the-art motion synthesis methods.\n","authors":["Wenkun He","Yun Liu","Ruitao Liu","Li Yi"],"pdf_url":"https://arxiv.org/pdf/2412.20104v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.07224v1","updated":"2025-01-13T11:22:57Z","published":"2025-01-13T11:22:57Z","title":"Touched by ChatGPT: Using an LLM to Drive Affective Tactile Interaction","summary":" Touch is a fundamental aspect of emotion-rich communication, playing a vital\nrole in human interaction and offering significant potential in human-robot\ninteraction. Previous research has demonstrated that a sparse representation of\nhuman touch can effectively convey social tactile signals. However, advances in\nhuman-robot tactile interaction remain limited, as many humanoid robots possess\nsimplistic capabilities, such as only opening and closing their hands,\nrestricting nuanced tactile expressions. In this study, we explore how a robot\ncan use sparse representations of tactile vibrations to convey emotions to a\nperson. To achieve this, we developed a wearable sleeve integrated with a 5x5\ngrid of vibration motors, enabling the robot to communicate diverse tactile\nemotions and gestures. Using chain prompts within a Large Language Model (LLM),\nwe generated distinct 10-second vibration patterns corresponding to 10 emotions\n(e.g., happiness, sadness, fear) and 6 touch gestures (e.g., pat, rub, tap).\nParticipants (N = 32) then rated each vibration stimulus based on perceived\nvalence and arousal. People are accurate at recognising intended emotions, a\nresult which aligns with earlier findings. These results highlight the LLM's\nability to generate emotional haptic data and effectively convey emotions\nthrough tactile signals. By translating complex emotional and tactile\nexpressions into vibratory patterns, this research demonstrates how LLMs can\nenhance physical interaction between humans and robots.\n","authors":["Qiaoqiao Ren","Tony Belpaeme"],"pdf_url":"https://arxiv.org/pdf/2501.07224v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.07223v1","updated":"2025-01-13T11:21:53Z","published":"2025-01-13T11:21:53Z","title":"Improving Incremental Nonlinear Dynamic Inversion Robustness Using\n Robust Control in Aerial Robotics","summary":" Improving robustness to uncertainty and rejection of external disturbances\nrepresents a significant challenge in aerial robotics. Nonlinear controllers\nbased on Incremental Nonlinear Dynamic Inversion (INDI), known for their\nability in estimating disturbances through measured-filtered data, have been\nnotably used in such applications. Typically, these controllers comprise two\ncascaded loops: an inner loop employing nonlinear dynamic inversion and an\nouter loop generating the virtual control inputs via linear controllers. In\nthis paper, a novel methodology is introduced, that combines the advantages of\nINDI with the robustness of linear structured $\\mathcal{H}_\\infty$ controllers.\nA full cascaded architecture is proposed to control the dynamics of a\nmultirotor drone, covering both stabilization and guidance. In particular,\nlow-order $\\mathcal{H}_\\infty$ controllers are designed for the outer loop by\nproperly structuring the problem and solving it through non-smooth\noptimization. A comparative analysis is conducted between an existing INDI/PD\napproach and the proposed INDI/$\\mathcal{H}_\\infty$ strategy, showing a notable\nenhancement in the rejection of external disturbances. It is carried out first\nusing MATLAB simulations involving a nonlinear model of a Parrot Bebop\nquadcopter drone, and then experimentally using a customized quadcopter built\nby the ENAC team. The results show an improvement of more than 50\\% in the\nrejection of disturbances such as gusts.\n","authors":["Mohamad Hachem","Clément Roos","Thierry Miquel","Murat Bronz"],"pdf_url":"https://arxiv.org/pdf/2501.07223v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.07216v1","updated":"2025-01-13T11:14:05Z","published":"2025-01-13T11:14:05Z","title":"Temperature Driven Multi-modal/Single-actuated Soft Finger","summary":" Soft pneumatic fingers are of great research interest. However, their\nsignificant potential is limited as most of them can generate only one motion,\nmostly bending. The conventional design of soft fingers does not allow them to\nswitch to another motion mode. In this paper, we developed a novel multi-modal\nand single-actuated soft finger where its motion mode is switched by changing\nthe finger's temperature. Our soft finger is capable of switching between three\ndistinctive motion modes: bending, twisting, and extension-in approximately\nfive seconds. We carried out a detailed experimental study of the soft finger\nand evaluated its repeatability and range of motion. It exhibited repeatability\nof around one millimeter and a fifty percent larger range of motion than a\nstandard bending actuator. We developed an analytical model for a\nfiber-reinforced soft actuator for twisting motion. This helped us relate the\ninput pressure to the output twist radius of the twisting motion. This model\nwas validated by experimental verification. Further, a soft robotic gripper\nwith multiple grasp modes was developed using three actuators. This gripper can\nadapt to and grasp objects of a large range of size, shape, and stiffness. We\nshowcased its grasping capabilities by successfully grasping a small berry, a\nlarge roll, and a delicate tofu cube.\n","authors":["Prashant Kumar","Weiwei Wan","Kensuke Harada"],"pdf_url":"https://arxiv.org/pdf/2501.07216v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.07213v1","updated":"2025-01-13T11:12:47Z","published":"2025-01-13T11:12:47Z","title":"Multi-face emotion detection for effective Human-Robot Interaction","summary":" The integration of dialogue interfaces in mobile devices has become\nubiquitous, providing a wide array of services. As technology progresses,\nhumanoid robots designed with human-like features to interact effectively with\npeople are gaining prominence, and the use of advanced human-robot dialogue\ninterfaces is continually expanding. In this context, emotion recognition plays\na crucial role in enhancing human-robot interaction by enabling robots to\nunderstand human intentions. This research proposes a facial emotion detection\ninterface integrated into a mobile humanoid robot, capable of displaying\nreal-time emotions from multiple individuals on a user interface. To this end,\nvarious deep neural network models for facial expression recognition were\ndeveloped and evaluated under consistent computer-based conditions, yielding\npromising results. Afterwards, a trade-off between accuracy and memory\nfootprint was carefully considered to effectively implement this application on\na mobile humanoid robot.\n","authors":["Mohamed Ala Yahyaoui","Mouaad Oujabour","Leila Ben Letaifa","Amine Bohi"],"pdf_url":"https://arxiv.org/pdf/2501.07213v1.pdf","comment":"9 pages, 8 figures and 1 table. Accepted at the 17th International\n Conference on Agents and Artificial Intelligence (ICAART 2025), Porto,\n Portugal"},{"id":"http://arxiv.org/abs/2501.07180v1","updated":"2025-01-13T10:19:30Z","published":"2025-01-13T10:19:30Z","title":"Evaluating Robotic Approach Techniques for the Insertion of a Straight\n Instrument into a Vitreoretinal Surgery Trocar","summary":" Advances in vitreoretinal robotic surgery enable precise techniques for gene\ntherapies. This study evaluates three robotic approaches using the 7-DoF\nrobotic arm for docking a micro-precise tool to a trocar: fully co-manipulated,\nhybrid co-manipulated/teleoperated, and hybrid with camera assistance. The\nfully co-manipulated approach was the fastest but had a 42% success rate.\nHybrid methods showed higher success rates (91.6% and 100%) and completed tasks\nwithin 2 minutes. NASA Task Load Index (TLX) assessments indicated lower\nphysical demand and effort for hybrid approaches.\n","authors":["Ross Henry","Martin Huber","Anestis Mablekos-Alexiou","Carlo Seneci","Mohamed Abdelaziz","Hans Natalius","Lyndon da Cruz","Christos Bergeles"],"pdf_url":"https://arxiv.org/pdf/2501.07180v1.pdf","comment":"2 Pages, 2 Figures, 1 Table"},{"id":"http://arxiv.org/abs/2409.06501v3","updated":"2025-01-13T09:53:48Z","published":"2024-09-10T13:34:53Z","title":"An Adaptive Sliding Window Estimator for Positioning of Unmanned Aerial\n Vehicle Using a Single Anchor","summary":" Localization using a single range anchor combined with onboard\noptical-inertial odometry offers a lightweight solution that provides\nmultidimensional measurements for the positioning of unmanned aerial vehicles.\nUnfortunately, the performance of such lightweight sensors varies with the\ndynamic environment, and the fidelity of the dynamic model is also severely\naffected by environmental aerial flow. To address this challenge, we propose an\nadaptive sliding window estimator equipped with an estimation reliability\nevaluator, where the states, noise covariance matrices and aerial drag are\nestimated simultaneously. The aerial drag effects are first evaluated based on\nposterior states and covariance. Then, an augmented Kalman filter is designed\nto pre-process multidimensional measurements and inherit historical\ninformation. Subsequently, an inverse-Wishart smoother is employed to estimate\nposterior states and covariance matrices. To further suppress potential\ndivergence, a reliability evaluator is devised to infer estimation errors. We\nfurther determine the fidelity of each sensor based on the error propagation.\nExtensive experiments are conducted in both standard and harsh environments,\ndemonstrating the adaptability and robustness of the proposed method. The root\nmean square error reaches 0.15 m, outperforming the state-of-the-art approach.\n","authors":["Kaiwen Xiong","Sijia Chen","Wei Dong"],"pdf_url":"https://arxiv.org/pdf/2409.06501v3.pdf","comment":"This work has been submitted to the IEEE for possible publication"},{"id":"http://arxiv.org/abs/2407.11218v3","updated":"2025-01-13T09:23:41Z","published":"2024-07-15T20:07:33Z","title":"Walk along: An Experiment on Controlling the Mobile Robot 'Spot' with\n Voice and Gestures","summary":" Robots are becoming more capable and can autonomously perform tasks such as\nnavigating between locations. However, human oversight remains crucial. This\nstudy compared two touchless methods for directing mobile robots: voice control\nand gesture control, to investigate the efficiency of the methods and the\npreference of users. We tested these methods in two conditions: one in which\nparticipants remained stationary and one in which they walked freely alongside\nthe robot. We hypothesized that walking alongside the robot would result in\nhigher intuitiveness ratings and improved task performance, based on the idea\nthat walking promotes spatial alignment and reduces the effort required for\nmental rotation. In a 2x2 within-subject design, 218 participants guided the\nquadruped robot Spot along a circuitous route with multiple 90-degree turns\nusing rotate left, rotate right, and walk forward commands. After each trial,\nparticipants rated the intuitiveness of the command mapping, while\npost-experiment interviews were used to gather the participants' preferences.\nResults showed that voice control combined with walking with Spot was the most\nfavored and intuitive, whereas gesture control while standing caused confusion\nfor left/right commands. Nevertheless, 29% of participants preferred gesture\ncontrol, citing increased task engagement and visual congruence as reasons. An\nodometry-based analysis revealed that participants often followed behind Spot,\nparticularly in the gesture control condition, when they were allowed to walk.\nIn conclusion, voice control with walking produced the best outcomes. Improving\nphysical ergonomics and adjusting gesture types could make gesture control more\neffective.\n","authors":["Renchi Zhang","Jesse van der Linden","Dimitra Dodou","Harleigh Seyffert","Yke Bauke Eisma","Joost C. F. de Winter"],"pdf_url":"https://arxiv.org/pdf/2407.11218v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.01144v3","updated":"2025-01-13T08:40:27Z","published":"2024-09-02T10:28:18Z","title":"Adaptive Non-linear Centroidal MPC with Stability Guarantees for Robust\n Locomotion of Legged Robots","summary":" Nonlinear model predictive locomotion controllers based on the reduced\ncentroidal dynamics are nowadays ubiquitous in legged robots. These schemes,\neven if they assume an inherent simplification of the robot's dynamics, were\nshown to endow robots with a step-adjustment capability in reaction to small\npushes, and, moreover, in the case of uncertain parameters - as unknown\npayloads - they were shown to be able to provide some practical, albeit\nlimited, robustness. In this work, we provide rigorous certificates of their\nclosed loop stability via a reformulation of the centroidal MPC controller.\nThis is achieved thanks to a systematic procedure inspired by the machinery of\nadaptive control, together with ideas coming from Control Lyapunov functions.\nOur reformulation, in addition, provides robustness for a class of unmeasured\nconstant disturbances. To demonstrate the generality of our approach, we\nvalidated our formulation on a new generation of humanoid robots - the 56.7 kg\nergoCub, as well as on a commercially available 21 kg quadruped robot, Aliengo.\n","authors":["Mohamed Elobaid","Giulio Turrisi","Lorenzo Rapetti","Giulio Romualdi","Stefano Dafarra","Tomohiro Kawakami","Tomohiro Chaki","Takahide Yoshiike","Claudio Semini","Daniele Pucci"],"pdf_url":"https://arxiv.org/pdf/2409.01144v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.14797v2","updated":"2025-01-13T07:47:32Z","published":"2024-07-20T07:56:24Z","title":"From Underground Mines to Offices: A Versatile and Robust Framework for\n Range-Inertial SLAM","summary":" Simultaneous Localization and Mapping (SLAM) is an essential component of\nautonomous robotic applications and self-driving vehicles, enabling them to\nunderstand and operate in their environment. Many SLAM systems have been\nproposed in the last decade, but they are often complex to adapt to different\nsettings or sensor setups. In this work, we present LiDAR Graph-SLAM (LG-SLAM),\na versatile range-inertial SLAM framework that can be adapted to different\ntypes of sensors and environments, from underground mines to offices with\nminimal parameter tuning. Our system integrates range, inertial and GNSS\nmeasurements into a graph-based optimization framework. We also use a refined\nsubmap management approach and a robust loop closure method that effectively\naccounts for uncertainty in the identification and validation of putative loop\nclosures, ensuring global consistency and robustness. Enabled by a parallelized\narchitecture and GPU integration, our system achieves pose estimation at LiDAR\nframe rate, along with online loop closing and graph optimization. We validate\nour system in diverse environments using public datasets and real-world data,\nconsistently achieving an average error below 20 cm and outperforming other\nstate-of-the-art algorithms.\n","authors":["Lorenzo Montano-Oliván","Julio A. Placed","Luis Montano","María T. Lázaro"],"pdf_url":"https://arxiv.org/pdf/2407.14797v2.pdf","comment":"8 pages, 8 figures, 3 tables"},{"id":"http://arxiv.org/abs/2407.10031v2","updated":"2025-01-13T06:03:14Z","published":"2024-07-14T00:12:44Z","title":"LLaMAR: Long-Horizon Planning for Multi-Agent Robots in Partially\n Observable Environments","summary":" The ability of Language Models (LMs) to understand natural language makes\nthem a powerful tool for parsing human instructions into task plans for\nautonomous robots. Unlike traditional planning methods that rely on\ndomain-specific knowledge and handcrafted rules, LMs generalize from diverse\ndata and adapt to various tasks with minimal tuning, acting as a compressed\nknowledge base. However, LMs in their standard form face challenges with\nlong-horizon tasks, particularly in partially observable multi-agent settings.\nWe propose an LM-based Long-Horizon Planner for Multi-Agent Robotics (LLaMAR),\na cognitive architecture for planning that achieves state-of-the-art results in\nlong-horizon tasks within partially observable environments. LLaMAR employs a\nplan-act-correct-verify framework, allowing self-correction from action\nexecution feedback without relying on oracles or simulators. Additionally, we\npresent MAP-THOR, a comprehensive test suite encompassing household tasks of\nvarying complexity within the AI2-THOR environment. Experiments show that\nLLaMAR achieves a 30% higher success rate than other state-of-the-art LM-based\nmulti-agent planners in MAP-THOR and Search \\& Rescue tasks. Code can be found\nat https://github.com/nsidn98/LLaMAR\n","authors":["Siddharth Nayak","Adelmo Morrison Orozco","Marina Ten Have","Vittal Thirumalai","Jackson Zhang","Darren Chen","Aditya Kapoor","Eric Robinson","Karthik Gopalakrishnan","James Harrison","Brian Ichter","Anuj Mahajan","Hamsa Balakrishnan"],"pdf_url":"https://arxiv.org/pdf/2407.10031v2.pdf","comment":"27 pages, 4 figures, 5 tables"},{"id":"http://arxiv.org/abs/2501.07051v1","updated":"2025-01-13T04:18:52Z","published":"2025-01-13T04:18:52Z","title":"ROSAnnotator: A Web Application for ROSBag Data Analysis in Human-Robot\n Interaction","summary":" Human-robot interaction (HRI) is an interdisciplinary field that utilises\nboth quantitative and qualitative methods. While ROSBags, a file format within\nthe Robot Operating System (ROS), offer an efficient means of collecting\ntemporally synched multimodal data in empirical studies with real robots, there\nis a lack of tools specifically designed to integrate qualitative coding and\nanalysis functions with ROSBags. To address this gap, we developed\nROSAnnotator, a web-based application that incorporates a multimodal Large\nLanguage Model (LLM) to support both manual and automated annotation of ROSBag\ndata. ROSAnnotator currently facilitates video, audio, and transcription\nannotations and provides an open interface for custom ROS messages and tools.\nBy using ROSAnnotator, researchers can streamline the qualitative analysis\nprocess, create a more cohesive analysis pipeline, and quickly access\nstatistical summaries of annotations, thereby enhancing the overall efficiency\nof HRI data analysis. https://github.com/CHRI-Lab/ROSAnnotator\n","authors":["Yan Zhang","Haoqi Li","Ramtin Tabatabaei","Wafa Johal"],"pdf_url":"https://arxiv.org/pdf/2501.07051v1.pdf","comment":"Accepted to HRI 2025"},{"id":"http://arxiv.org/abs/2412.16908v2","updated":"2025-01-13T04:11:53Z","published":"2024-12-22T07:54:21Z","title":"Map Imagination Like Blind Humans: Group Diffusion Model for Robotic Map\n Generation","summary":" Can robots imagine or generate maps like humans do, especially when only\nlimited information can be perceived like blind people? To address this\nchallenging task, we propose a novel group diffusion model (GDM) based\narchitecture for robots to generate point cloud maps with very limited input\ninformation.Inspired from the blind humans' natural capability of imagining or\ngenerating mental maps, the proposed method can generate maps without visual\nperception data or depth data. With additional limited super-sparse spatial\npositioning data, like the extra contact-based positioning information the\nblind individuals can obtain, the map generation quality can be improved even\nmore.Experiments on public datasets are conducted, and the results indicate\nthat our method can generate reasonable maps solely based on path data, and\nproduce even more refined maps upon incorporating exiguous LiDAR data.Compared\nto conventional mapping approaches, our novel method significantly mitigates\nsensor dependency, enabling the robots to imagine and generate elementary maps\nwithout heavy onboard sensory devices.\n","authors":["Qijin Song","Weibang Bai"],"pdf_url":"https://arxiv.org/pdf/2412.16908v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.05723v2","updated":"2025-01-13T02:58:58Z","published":"2025-01-10T05:43:34Z","title":"Robot Error Awareness Through Human Reactions: Implementation,\n Evaluation, and Recommendations","summary":" Effective error detection is crucial to prevent task disruption and maintain\nuser trust. Traditional methods often rely on task-specific models or user\nreporting, which can be inflexible or slow. Recent research suggests social\nsignals, naturally exhibited by users in response to robot errors, can enable\nmore flexible, timely error detection. However, most studies rely on post hoc\nanalysis, leaving their real-time effectiveness uncertain and lacking\nuser-centric evaluation. In this work, we developed a proactive error detection\nsystem that combines user behavioral signals (facial action units and speech),\nuser feedback, and error context for automatic error detection. In a study (N =\n28), we compared our proactive system to a status quo reactive approach.\nResults show our system 1) reliably and flexibly detects error, 2) detects\nerrors faster than the reactive approach, and 3) is perceived more favorably by\nusers than the reactive one. We discuss recommendations for enabling robot\nerror awareness in future HRI systems.\n","authors":["Maia Stiber","Russell Taylor","Chien-Ming Huang"],"pdf_url":"https://arxiv.org/pdf/2501.05723v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.07013v1","updated":"2025-01-13T02:15:15Z","published":"2025-01-13T02:15:15Z","title":"Sthymuli: a Static Educational Robot. Leveraging the Thymio II Platform","summary":" The use of robots in education represents a challenge for teachers and a\nfixed vision of what robots can do for students. This paper presents the\ndevelopment of Sthymuli, a static educational robot designed to explore new\nclassroom interactions between robots, students and teachers. We propose the\nuse of the Thymio II educational platform as a base, ensuring a robust\nbenchmark for a fair comparison of the commonly available wheeled robots and\nour exploratory approach with Sthymuli. This paper outlines the constraints and\nrequirements for developing such a robot, the current state of development and\nfuture work.\n","authors":["Manuel Bernal-Lecina","Alejandrina Hernández","Adrien Pannatier","Léa Pereyre","Francesco Mondada"],"pdf_url":"https://arxiv.org/pdf/2501.07013v1.pdf","comment":"Two pages, three figures. ICRA40 extended abstract"},{"id":"http://arxiv.org/abs/2501.06994v1","updated":"2025-01-13T01:01:44Z","published":"2025-01-13T01:01:44Z","title":"Motion Tracks: A Unified Representation for Human-Robot Transfer in\n Few-Shot Imitation Learning","summary":" Teaching robots to autonomously complete everyday tasks remains a challenge.\nImitation Learning (IL) is a powerful approach that imbues robots with skills\nvia demonstrations, but is limited by the labor-intensive process of collecting\nteleoperated robot data. Human videos offer a scalable alternative, but it\nremains difficult to directly train IL policies from them due to the lack of\nrobot action labels. To address this, we propose to represent actions as\nshort-horizon 2D trajectories on an image. These actions, or motion tracks,\ncapture the predicted direction of motion for either human hands or robot\nend-effectors. We instantiate an IL policy called Motion Track Policy (MT-pi)\nwhich receives image observations and outputs motion tracks as actions. By\nleveraging this unified, cross-embodiment action space, MT-pi completes tasks\nwith high success given just minutes of human video and limited additional\nrobot demonstrations. At test time, we predict motion tracks from two camera\nviews, recovering 6DoF trajectories via multi-view synthesis. MT-pi achieves an\naverage success rate of 86.5% across 4 real-world tasks, outperforming\nstate-of-the-art IL baselines which do not leverage human data or our action\nspace by 40%, and generalizes to scenarios seen only in human videos. Code and\nvideos are available on our website\nhttps://portal-cornell.github.io/motion_track_policy/.\n","authors":["Juntao Ren","Priya Sundaresan","Dorsa Sadigh","Sanjiban Choudhury","Jeannette Bohg"],"pdf_url":"https://arxiv.org/pdf/2501.06994v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.06987v1","updated":"2025-01-13T00:29:57Z","published":"2025-01-13T00:29:57Z","title":"Hand-Object Contact Detection using Grasp Quality Metrics","summary":" We propose a novel hand-object contact detection system based on grasp\nquality metrics extracted from object and hand poses, and evaluated its\nperformance using the DexYCB dataset. Our evaluation demonstrated the system's\nhigh accuracy (approaching 90%). Future work will focus on a real-time\nimplementation using vision-based estimation, and integrating it to a\nrobot-to-human handover system.\n","authors":["Akansel Cosgun","Thanh Vinh Nguyen"],"pdf_url":"https://arxiv.org/pdf/2501.06987v1.pdf","comment":"Submitted to the 2025 IEEE/ACM International Conference on\n Human-Robot Interaction (HRI'25)"},{"id":"http://arxiv.org/abs/2411.10941v2","updated":"2025-01-13T00:03:58Z","published":"2024-11-17T02:39:58Z","title":"Efficient Estimation of Relaxed Model Parameters for Robust UAV\n Trajectory Optimization","summary":" Online trajectory optimization and optimal control methods are crucial for\nenabling sustainable unmanned aerial vehicle (UAV) services, such as\nagriculture, environmental monitoring, and transportation, where available\nactuation and energy are limited. However, optimal controllers are highly\nsensitive to model mismatch, which can occur due to loaded equipment, packages\nto be delivered, or pre-existing variability in fundamental structural and\nthrust-related parameters. To circumvent this problem, optimal controllers can\nbe paired with parameter estimators to improve their trajectory planning\nperformance and perform adaptive control. However, UAV platforms are limited in\nterms of onboard processing power, oftentimes making nonlinear parameter\nestimation too computationally expensive to consider. To address these issues,\nwe propose a relaxed, affine-in-parameters multirotor model along with an\nefficient optimal parameter estimator. We convexify the nominal Moving Horizon\nParameter Estimation (MHPE) problem into a linear-quadratic form (LQ-MHPE) via\nan affine-in-parameter relaxation on the nonlinear dynamics, resulting in fast\nquadratic programs (QPs) that facilitate adaptive Model Predictve Control (MPC)\nin real time. We compare this approach to the equivalent nonlinear estimator in\nMonte Carlo simulations, demonstrating a decrease in average solve time and\ntrajectory optimality cost by 98.2% and 23.9-56.2%, respectively.\n","authors":["Derek Fan","David A. Copp"],"pdf_url":"https://arxiv.org/pdf/2411.10941v2.pdf","comment":"8 pages, 5 figures, to be published in IEEE Sustech 2025"},{"id":"http://arxiv.org/abs/2501.07713v1","updated":"2025-01-13T21:52:46Z","published":"2025-01-13T21:52:46Z","title":"Testing Human-Hand Segmentation on In-Distribution and\n Out-of-Distribution Data in Human-Robot Interactions Using a Deep Ensemble\n Model","summary":" Reliable detection and segmentation of human hands are critical for enhancing\nsafety and facilitating advanced interactions in human-robot collaboration.\nCurrent research predominantly evaluates hand segmentation under\nin-distribution (ID) data, which reflects the training data of deep learning\n(DL) models. However, this approach fails to address out-of-distribution (OOD)\nscenarios that often arise in real-world human-robot interactions. In this\nstudy, we present a novel approach by evaluating the performance of pre-trained\nDL models under both ID data and more challenging OOD scenarios. To mimic\nrealistic industrial scenarios, we designed a diverse dataset featuring simple\nand cluttered backgrounds with industrial tools, varying numbers of hands (0 to\n4), and hands with and without gloves. For OOD scenarios, we incorporated\nunique and rare conditions such as finger-crossing gestures and motion blur\nfrom fast-moving hands, addressing both epistemic and aleatoric uncertainties.\nTo ensure multiple point of views (PoVs), we utilized both egocentric cameras,\nmounted on the operator's head, and static cameras to capture RGB images of\nhuman-robot interactions. This approach allowed us to account for multiple\ncamera perspectives while also evaluating the performance of models trained on\nexisting egocentric datasets as well as static-camera datasets. For\nsegmentation, we used a deep ensemble model composed of UNet and RefineNet as\nbase learners. Performance evaluation was conducted using segmentation metrics\nand uncertainty quantification via predictive entropy. Results revealed that\nmodels trained on industrial datasets outperformed those trained on\nnon-industrial datasets, highlighting the importance of context-specific\ntraining. Although all models struggled with OOD scenarios, those trained on\nindustrial datasets demonstrated significantly better generalization.\n","authors":["Reza Jalayer","Yuxin Chen","Masoud Jalayer","Carlotta Orsenigo","Masayoshi Tomizuka"],"pdf_url":"https://arxiv.org/pdf/2501.07713v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.07705v1","updated":"2025-01-13T21:32:42Z","published":"2025-01-13T21:32:42Z","title":"Autonomous Electrochemistry Platform with Real-Time Normality Testing of\n Voltammetry Measurements Using ML","summary":" Electrochemistry workflows utilize various instruments and computing systems\nto execute workflows consisting of electrocatalyst synthesis, testing and\nevaluation tasks. The heterogeneity of the software and hardware of these\necosystems makes it challenging to orchestrate a complete workflow from\nproduction to characterization by automating its tasks. We propose an\nautonomous electrochemistry computing platform for a multi-site ecosystem that\nprovides the services for remote experiment steering, real-time measurement\ntransfer, and AI/ML-driven analytics. We describe the integration of a mobile\nrobot and synthesis workstation into the ecosystem by developing custom\nhub-networks and software modules to support remote operations over the\necosystem's wireless and wired networks. We describe a workflow task for\ngenerating I-V voltammetry measurements using a potentiostat, and a machine\nlearning framework to ensure their normality by detecting abnormal conditions\nsuch as disconnected electrodes. We study a number of machine learning methods\nfor the underlying detection problem, including smooth, non-smooth, structural\nand statistical methods, and their fusers. We present experimental results to\nillustrate the effectiveness of this platform, and also validate the proposed\nML method by deriving its rigorous generalization equations.\n","authors":["Anees Al-Najjar","Nageswara S. V. Rao","Craig A. Bridges","Sheng Dai","Alex Walters"],"pdf_url":"https://arxiv.org/pdf/2501.07705v1.pdf","comment":"10 pages, 14 figures, accepted in the IEEE 20th International\n Conference on e-Science (e-Science), 2024"},{"id":"http://arxiv.org/abs/2403.04917v3","updated":"2025-01-13T20:28:04Z","published":"2024-03-07T22:03:36Z","title":"A Mixed-Integer Conic Program for the Moving-Target Traveling Salesman\n Problem based on a Graph of Convex Sets","summary":" This paper introduces a new formulation that finds the optimum for the\nMoving-Target Traveling Salesman Problem (MT-TSP), which seeks to find a\nshortest path for an agent, that starts at a depot, visits a set of moving\ntargets exactly once within their assigned time-windows, and returns to the\ndepot. The formulation relies on the key idea that when the targets move along\nlines, their trajectories become convex sets within the space-time coordinate\nsystem. The problem then reduces to finding the shortest path within a graph of\nconvex sets, subject to some speed constraints. We compare our formulation with\nthe current state-of-the-art Mixed Integer Conic Program (MICP) solver for the\nMT-TSP. The experimental results show that our formulation outperforms the MICP\nfor instances with up to 20 targets, with up to two orders of magnitude\nreduction in runtime, and up to a 60\\% tighter optimality gap. We also show\nthat the solution cost from the convex relaxation of our formulation provides\nsignificantly tighter lower bounds for the MT-TSP than the ones from the MICP.\n","authors":["Allen George Philip","Zhongqiang Ren","Sivakumar Rathinam","Howie Choset"],"pdf_url":"https://arxiv.org/pdf/2403.04917v3.pdf","comment":"7 pages, 4 figures"},{"id":"http://arxiv.org/abs/2406.02365v5","updated":"2025-01-13T20:06:35Z","published":"2024-06-04T14:43:50Z","title":"Exploiting Chordal Sparsity for Fast Global Optimality with Application\n to Localization","summary":" In recent years, many estimation problems in robotics have been shown to be\nsolvable to global optimality using their semidefinite relaxations. However,\nthe runtime complexity of off-the-shelf semidefinite programming (SDP) solvers\nis up to cubic in problem size, which inhibits real-time solutions of problems\ninvolving large state dimensions. We show that for a large class of problems,\nnamely those with chordal sparsity, we can reduce the complexity of these\nsolvers to linear in problem size. In particular, we show how to replace the\nlarge positive-semidefinite variable with a number of smaller interconnected\nones using the well-known chordal decomposition. This formulation also allows\nfor the straightforward application of the alternating direction method of\nmultipliers (ADMM), which can exploit parallelism for increased scalability. We\nshow for two example problems in simulation that the chordal solvers provide a\nsignificant speed-up over standard SDP solvers, and that global optimality is\ncrucial in the absence of good initializations.\n","authors":["Frederike Dümbgen","Connor Holmes","Timothy D. Barfoot"],"pdf_url":"https://arxiv.org/pdf/2406.02365v5.pdf","comment":"21 pages, 6 figures. Version history: v1: initial arXiv, v2: WAFR\n submission, v3: correction, v4: WAFR conference-ready, v5: WAFR SPAR journal\n version"}],"Systems and Control":[{"id":"http://arxiv.org/abs/2501.07570v1","updated":"2025-01-13T18:57:15Z","published":"2025-01-13T18:57:15Z","title":"Digital Twin for Smart Societies: A Catalyst for Inclusive and\n Accessible Healthcare","summary":" With rapid digitization and digitalization, drawing a fine line between the\ndigital and the physical world has become nearly impossible. It has become\nessential more than ever to integrate all spheres of life into a single Digital\nThread to address pressing challenges of modern society: accessible and\ninclusive healthcare in terms of equality and equity. Techno-social\nadvancements and mutual acceptance have enabled the infusion of digital models\nto simulate social settings with minimum resource utilization to make effective\ndecisions. However, a significant gap exists in feeding back the models with\nappropriate real-time changes. In other words, active behavioral modeling of\nmodern society is lacking, influencing community healthcare as a whole. By\ncreating virtual replicas of (physical) behavioral systems, digital twins can\nenable real-time monitoring, simulation, and optimization of urban dynamics.\nThis paper explores the potential of digital twins to promote inclusive\nhealthcare for evolving smart cities. We argue that digital twins can be used\nto: Identify and address disparities in access to healthcare services,\nFacilitate community participation, Simulate the impact of urban policies and\ninterventions on different groups of people, and Aid policy-making bodies for\nbetter access to healthcare. This paper proposes several ways to use digital\ntwins to stitch the actual and virtual societies. Several discussed concepts\nwithin this framework envision an active, integrated, and synchronized\ncommunity aware of data privacy and security. The proposal also provides\nhigh-level step-wise transitions that will enable this transformation.\n","authors":["Joshit Mohanty","Sujatha Alla"," Vaishali","Nagesh Bheesetty","Prasanthi Chidipudi","Satya Prakash Chowdary Nandigam","Marisha Jmukhadze","Puneeth Bheesetty","Narendra Lakshmana Gowda"],"pdf_url":"https://arxiv.org/pdf/2501.07570v1.pdf","comment":"13 pages, 1 figure. This is accepted to publish at the proceedings of\n the 6th International Conference on Artificial Intelligence and Applied\n Mathematics in Engineering (ICAIAME 2024)"},{"id":"http://arxiv.org/abs/2312.15141v2","updated":"2025-01-13T18:21:03Z","published":"2023-12-23T02:34:50Z","title":"Improving the Performance of Echo State Networks Through State Feedback","summary":" Reservoir computing, using nonlinear dynamical systems, offers a\ncost-effective alternative to neural networks for complex tasks involving\nprocessing of sequential data, time series modeling, and system identification.\nEcho state networks (ESNs), a type of reservoir computer, mirror neural\nnetworks but simplify training. They apply fixed, random linear transformations\nto the internal state, followed by nonlinear changes. This process, guided by\ninput signals and linear regression, adapts the system to match target\ncharacteristics, reducing computational demands. A potential drawback of ESNs\nis that the fixed reservoir may not offer the complexity needed for specific\nproblems. While directly altering (training) the internal ESN would reintroduce\nthe computational burden, an indirect modification can be achieved by\nredirecting some output as input. This feedback can influence the internal\nreservoir state, yielding ESNs with enhanced complexity suitable for broader\nchallenges. In this paper, we demonstrate that by feeding some component of the\nreservoir state back into the network through the input, we can drastically\nimprove upon the performance of a given ESN. We rigorously prove that, for any\ngiven ESN, feedback will almost always improve the accuracy of the output. For\na set of three tasks, each representing different problem classes, we find that\nwith feedback the average error measures are reduced by $30\\%-60\\%$.\nRemarkably, feedback provides at least an equivalent performance boost to\ndoubling the initial number of computational nodes, a computationally expensive\nand technologically challenging alternative. These results demonstrate the\nbroad applicability and substantial usefulness of this feedback scheme.\n","authors":["Peter J. Ehlers","Hendra I. Nurdin","Daniel Soh"],"pdf_url":"https://arxiv.org/pdf/2312.15141v2.pdf","comment":"36 pages, 6 figures"},{"id":"http://arxiv.org/abs/2501.07516v1","updated":"2025-01-13T17:38:03Z","published":"2025-01-13T17:38:03Z","title":"Determining Disturbance Recovery Conditions by Inverse Sensitivity\n Minimization","summary":" Power systems naturally experience disturbances, some of which can damage\nequipment and disrupt consumers. It is important to quickly assess the likely\nconsequences of credible disturbances and take preventive action, if necessary.\nHowever, assessing the impact of potential disturbances is challenging because\nmany of the influential factors, such as loading patterns, controller settings\nand load dynamics, are not precisely known. To address this issue, the paper\nintroduces the concept of parameter-space recovery regions. For each\ndisturbance, the corresponding recovery region is the region of parameter space\nfor which the system will recover to the desired operating point. The boundary\nof the recovery region establishes the separation between parameter values that\nresult in trouble-free recovery and those that incur undesirable non-recovery.\nThe safety margin for a given set of parameter values is defined as the\nsmallest distance (in parameter space) between the given values and the\nrecovery boundary. Novel numerical algorithms with theoretical guarantees are\npresented for efficiently computing recovery boundaries and safety margins.\nUnlike prior methods, which tend to be overly conservative and restricted to\nlow dimensional parameter space, these methods compute safety margins to\narbitrary user-specified accuracy and do so efficiently in high dimensional\nparameter space. The efficacy of the methods is demonstrated using the IEEE\n39-bus benchmark power system, where safety margins are computed for cases that\nconsider up to 86 parameters, and reveal unexpected safety implications that\nwould not have been observed otherwise.\n","authors":["Michael W. Fisher","Ian A. Hiskens"],"pdf_url":"https://arxiv.org/pdf/2501.07516v1.pdf","comment":"9 pages"},{"id":"http://arxiv.org/abs/2501.07498v1","updated":"2025-01-13T17:16:34Z","published":"2025-01-13T17:16:34Z","title":"Computing Safety Margins of Parameterized Nonlinear Systems for\n Vulnerability Assessment via Trajectory Sensitivities","summary":" Physical systems experience nonlinear disturbances which have the potential\nto disrupt desired behavior. For a particular disturbance, whether or not the\nsystem recovers from the disturbance to a desired stable equilibrium point\ndepends on system parameter values, which are typically uncertain and\ntime-varying. Therefore, to quantify proximity to vulnerability we define the\nsafety margin to be the smallest change in parameter values from a nominal\nvalue such that the system will no longer be able to recover from the\ndisturbance. Safety margins are valuable but challenging to compute as related\nmethods, such as those for robust region of attraction estimation, are often\neither overly conservative or computationally intractable for high dimensional\nsystems. Recently, we developed algorithms to compute safety margins\nefficiently and non-conservatively by exploiting the large sensitivity of the\nsystem trajectory near the region of attraction boundary to small\nperturbations. Although these algorithms have enjoyed empirical success, they\nlack theoretical guarantees that would ensure their generalizability. This work\ndevelops a novel characterization of safety margins in terms of trajectory\nsensitivities, and uses this to derive well-posedness and convergence\nguarantees for these algorithms, enabling their generalizability and successful\napplication to a large class of nonlinear systems.\n","authors":["Michael W. Fisher"],"pdf_url":"https://arxiv.org/pdf/2501.07498v1.pdf","comment":"16 pages"},{"id":"http://arxiv.org/abs/2501.07476v1","updated":"2025-01-13T16:48:22Z","published":"2025-01-13T16:48:22Z","title":"Encrypted Computation of Collision Probability for Secure Satellite\n Conjunction Analysis","summary":" The computation of collision probability ($\\mathcal{P}_c$) is crucial for\nspace environmentalism and sustainability by providing decision-making\nknowledge that can prevent collisions between anthropogenic space objects.\nHowever, the accuracy and precision of $\\mathcal{P}_c$ computations is often\ncompromised by limitations in computational resources and data availability.\nWhile significant improvements have been made in the computational aspects, the\nrising concerns regarding the privacy of collaborative data sharing can be a\nmajor limiting factor in the future conjunction analysis and risk assessment,\nespecially as the space environment grows increasingly privatized, competitive,\nand fraught with conflicting strategic interests. This paper argues that the\nimportance of privacy measures in space situational awareness (SSA) is\nunderappreciated, and regulatory and compliance measures currently in place are\nnot sufficient by themselves, presenting a significant gap.\n To address this gap, we introduce a novel encrypted architecture that\nleverages advanced cryptographic techniques, including homomorphic encryption\n(HE) and multi-party computation (MPC), to safeguard the privacy of entities\ncomputing space sustainability metrics, inter alia, $\\mathcal{P}_c$. Our\nproposed protocol, Encrypted $\\mathcal{P}_c$, integrates the Monte Carlo\nestimation algorithm with cryptographic solutions, enabling secure collision\nprobability computation without exposing sensitive or proprietary information.\nThis research advances secure conjunction analysis by developing a secure MPC\nprotocol for $\\mathcal{P}_c$ computation and highlights the need for innovative\nprotocols to ensure a more secure and cooperative SSA landscape.\n","authors":["Jihoon Suh","Michael Hibbard","Kaoru Teranishi","Takashi Tanaka","Moriba Jah","Maruthi Akella"],"pdf_url":"https://arxiv.org/pdf/2501.07476v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.07461v1","updated":"2025-01-13T16:30:56Z","published":"2025-01-13T16:30:56Z","title":"A Linear Parameter-Varying Framework for the Analysis of Time-Varying\n Optimization Algorithms","summary":" In this paper we propose a framework to analyze iterative first-order\noptimization algorithms for time-varying convex optimization. We assume that\nthe temporal variability is caused by a time-varying parameter entering the\nobjective, which can be measured at the time of decision but whose future\nvalues are unknown. We consider the case of strongly convex objective functions\nwith Lipschitz continuous gradients and address the class of running algorithms\nwhere only one iteration per time change is performed. We model these\nalgorithms as discrete-time linear parameter varying (LPV) systems in feedback\nwith a time-varying gradient. We leverage the approach of analyzing algorithms\nas uncertain control interconnections with integral quadratic constraints\n(IQCs) and generalize that framework to the time-varying case. We propose novel\nIQCs that are capable of capturing the behavior of time-varying nonlinearities\nand leverage techniques from the LPV literature to establish novel bounds on\nthe tracking error. Quantitative bounds can be computed by solving a\nsemi-definite program and can be interpreted as an input-to-state stability\nresult with respect to a disturbance signal which increases with the temporal\nvariability of the problem. As a departure from results in this research area,\nour bounds introduce terms that can be interpreted as a temporal rate of change\nin the cost function and the optimal value. We exemplify our main results with\nnumerical experiments that showcase how our analysis framework is able to\ncapture convergence rates of different first-order algorithms for time-varying\noptimization through the choice of IQC and rate bounds.\n","authors":["Fabian Jakob","Andrea Iannelli"],"pdf_url":"https://arxiv.org/pdf/2501.07461v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.06782v2","updated":"2025-01-13T14:11:49Z","published":"2024-11-11T08:19:54Z","title":"QuadWBG: Generalizable Quadrupedal Whole-Body Grasping","summary":" Legged robots with advanced manipulation capabilities have the potential to\nsignificantly improve household duties and urban maintenance. Despite\nconsiderable progress in developing robust locomotion and precise manipulation\nmethods, seamlessly integrating these into cohesive whole-body control for\nreal-world applications remains challenging. In this paper, we present a\nmodular framework for robust and generalizable whole-body loco-manipulation\ncontroller based on a single arm-mounted camera. By using reinforcement\nlearning (RL), we enable a robust low-level policy for command execution over 5\ndimensions (5D) and a grasp-aware high-level policy guided by a novel metric,\nGeneralized Oriented Reachability Map (GORM). The proposed system achieves\nstate-of-the-art one-time grasping accuracy of 89% in the real world, including\nchallenging tasks such as grasping transparent objects. Through extensive\nsimulations and real-world experiments, we demonstrate that our system can\neffectively manage a large workspace, from floor level to above body height,\nand perform diverse whole-body loco-manipulation tasks.\n","authors":["Jilong Wang","Javokhirbek Rajabov","Chaoyi Xu","Yiming Zheng","He Wang"],"pdf_url":"https://arxiv.org/pdf/2411.06782v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.07273v1","updated":"2025-01-13T12:36:11Z","published":"2025-01-13T12:36:11Z","title":"An Extended Survey and a Comparison Framework for Dataflow Models of\n Computation and Communication","summary":" Dataflow Model of Computation and Communications (DF MoCCs) is a formalism\nused to specify the behavior of Cyber-Physical Systems (CPSs). DF MoCCs are\nwidely used in the design of CPSs, as they provide a high-level of abstraction\nto specify the system's behavior. DF MoCCs rules give semantics to a dataflow\nspecification of a CPS, and static analysis algorithms rely on these semantics\nto guarantee safety properties of the dataflow specification, such as bounded\nmemory usage and deadlock freeness. A wide range of DF MoCCs exists, each with\nits own characteristics and static analyses. This paper presents a survey of\nthose DF MoCCs and a classification in eight categories. In addition, DF MoCCs\nare characterized by a comprehensive list of features and static analyses,\nwhich reflect their expressiveness and analyzability. Based on this\ncharacterization, a framework is proposed to compare the expressiveness and the\nanalyzability of DF MoCCs quantitatively.\n","authors":["Guillaume Roumage","Selma Azaiez","Cyril Faure","Stéphane Louise"],"pdf_url":"https://arxiv.org/pdf/2501.07273v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.07247v1","updated":"2025-01-13T11:55:04Z","published":"2025-01-13T11:55:04Z","title":"Interpretable machine-learning for predicting molecular weight of PLA\n based on artificial bee colony optimization algorithm and adaptive neurofuzzy\n inference system","summary":" This article discusses the integration of the Artificial Bee Colony (ABC)\nalgorithm with two supervised learning methods, namely Artificial Neural\nNetworks (ANNs) and Adaptive Network-based Fuzzy Inference System (ANFIS), for\nfeature selection from Near-Infrared (NIR) spectra for predicting the molecular\nweight of medical-grade Polylactic Acid (PLA). During extrusion processing of\nPLA, in-line NIR spectra were captured along with extrusion process and machine\nsetting data. With a dataset comprising 63 observations and 512 input features,\nappropriate machine learning tools are essential for interpreting data and\nselecting features to improve prediction accuracy. Initially, the ABC\noptimization algorithm is coupled with ANN/ANFIS to forecast PLA molecular\nweight. The objective functions of the ABC algorithm are to minimize the root\nmean square error (RMSE) between experimental and predicted PLA molecular\nweights while also minimizing the number of input features. Results indicate\nthat employing ABC-ANFIS yields the lowest RMSE of 282 Da and identifies four\nsignificant parameters (NIR wavenumbers 6158 cm-1, 6310 cm-1, 6349 cm-1, and\nmelt temperature) for prediction. These findings demonstrate the effectiveness\nof using the ABC algorithm with ANFIS for selecting a minimal set of features\nto predict PLA molecular weight with high accuracy during processing\n","authors":["Amir Pouya Masoumi","Leo Creedon","Ramen Ghosh","Nimra Munir","Ross McMorrow","Marion McAfee"],"pdf_url":"https://arxiv.org/pdf/2501.07247v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.07191v1","updated":"2025-01-13T10:38:12Z","published":"2025-01-13T10:38:12Z","title":"Pre-Trained Large Language Model Based Remaining Useful Life Transfer\n Prediction of Bearing","summary":" Accurately predicting the remaining useful life (RUL) of rotating machinery,\nsuch as bearings, is essential for ensuring equipment reliability and\nminimizing unexpected industrial failures. Traditional data-driven deep\nlearning methods face challenges in practical settings due to inconsistent\ntraining and testing data distributions and limited generalization for\nlong-term predictions.\n","authors":["Laifa Tao","Zhengduo Zhao","Xuesong Wang","Bin Li","Wenchao Zhan","Xuanyuan Su","Shangyu Li","Qixuan Huang","Haifei Liu","Chen Lu","Zhixuan Lian"],"pdf_url":"https://arxiv.org/pdf/2501.07191v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.07187v1","updated":"2025-01-13T10:35:23Z","published":"2025-01-13T10:35:23Z","title":"Real-time Mode-Aware Dataflow: A Dataflow Model to Specify and Analyze\n Mode-dependent CPSs under Relaxed Timing Constraints","summary":" Modern Cyber-Physical Systems (CPS) often exhibit both relaxed real-time\nconstraints and a mode-dependent execution. Relaxed real-time constraints mean\nthat only a subset of the processes of a CPS have real-time constraints, and a\nmode-dependent CPS has conditional execution branches. Static analysis tools,\nsuch as the PolyGraph model (a formalism extending the Cyclo-Static Dataflow\nmodel with real-time constraints), can specify and analyze systems with relaxed\nreal-time constraints. However, PolyGraph is limited in its ability to specify\nand analyze mode-dependent CPSs. This paper extends PolyGraph with routing\nactors, yielding the Routed PolyGraph model. This model is further extended to\nthe Real-time Mode-Aware Dataflow (RMDF), which both leverages routing actors\nand incorporates a new dataflow actor to specify mode-dependent CPSs under\nrelaxed real-time constraints. This paper also extends the static analyses of\nPolyGraph to RMDF. We showcase the application of RMDF with a specification and\nan analysis (derivation of timing constraints at the job-level and a\nfeasibility test) of the vision processing system of the Ingenuity Mars\nhelicopter.\n","authors":["Guillaume Roumage","Selma Azaiez","Cyril Faure","Stéphane Louise"],"pdf_url":"https://arxiv.org/pdf/2501.07187v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.07180v1","updated":"2025-01-13T10:19:30Z","published":"2025-01-13T10:19:30Z","title":"Evaluating Robotic Approach Techniques for the Insertion of a Straight\n Instrument into a Vitreoretinal Surgery Trocar","summary":" Advances in vitreoretinal robotic surgery enable precise techniques for gene\ntherapies. This study evaluates three robotic approaches using the 7-DoF\nrobotic arm for docking a micro-precise tool to a trocar: fully co-manipulated,\nhybrid co-manipulated/teleoperated, and hybrid with camera assistance. The\nfully co-manipulated approach was the fastest but had a 42% success rate.\nHybrid methods showed higher success rates (91.6% and 100%) and completed tasks\nwithin 2 minutes. NASA Task Load Index (TLX) assessments indicated lower\nphysical demand and effort for hybrid approaches.\n","authors":["Ross Henry","Martin Huber","Anestis Mablekos-Alexiou","Carlo Seneci","Mohamed Abdelaziz","Hans Natalius","Lyndon da Cruz","Christos Bergeles"],"pdf_url":"https://arxiv.org/pdf/2501.07180v1.pdf","comment":"2 Pages, 2 Figures, 1 Table"},{"id":"http://arxiv.org/abs/2501.07148v1","updated":"2025-01-13T09:22:17Z","published":"2025-01-13T09:22:17Z","title":"Implementing LoRa MIMO System for Internet of Things","summary":" Bandwidth constraints limit LoRa implementations. Contemporary IoT\napplications require higher throughput than that provided by LoRa. This work\nintroduces a LoRa Multiple Input Multiple Output (MIMO) system and a spatial\nmultiplexing algorithm to address LoRa's bandwidth limitation. The transceivers\nin the proposed approach modulate the signals on distinct frequencies of the\nsame LoRa band. A Frequency Division Multiplexing (FDM) method is used at the\ntransmitters to provide a wider MIMO channel. Unlike conventional Orthogonal\nFrequency Division Multiplexing (OFDM) techniques, this work exploits the\northogonality of the LoRa signals facilitated by its proprietary Chirp Spread\nSpectrum (CSS) modulation to perform an OFDM in the proposed LoRa MIMO system.\nBy varying the Spreading Factor (SF) and bandwidth of LoRa signals, orthogonal\nsignals can transmit on the same frequency irrespective of the FDM. Even though\nthe channel correlation is minimal for different spreading factors and\nbandwidths, different Carrier Frequencies (CF) ensure the signals do not\noverlap and provide additional degrees of freedom. This work assesses the\nproposed model's performance and conducts an extensive analysis to provide an\noverview of resources consumed by the proposed system. Finally, this work\nprovides the detailed results of a thorough evaluation of the model on test\nhardware.\n","authors":["Atonu Ghosh","Sharath Chandan","Sudip Misra"],"pdf_url":"https://arxiv.org/pdf/2501.07148v1.pdf","comment":"8 pages, 7 figures"},{"id":"http://arxiv.org/abs/2405.04287v3","updated":"2025-01-13T09:06:54Z","published":"2024-05-07T12:58:37Z","title":"Asymmetry of Frequency Distribution in Power Systems: Sources,\n Estimation, Impact and Control","summary":" This paper analyses an emerging real-world phenomena in inverter-based\nrenewable-dominated power systems, namely, asymmetry of frequency distribution.\nThe paper first provides a rationale on why asymmetry reduces the \"quality\" of\nthe frequency control and system operation. Then it provides qualitative\ntheoretical insights that explain asymmetry in terms of the nonlinearity of\nreal-world power systems and associated models. In particular network losses\nand pitch angle-based frequency control of wind power plants are discussed.\nThen the paper proposes a nonlinear compensation control to reduce the\nasymmetry as well as a statistical metric based on the frequency probability\ndistribution to quantify the level of asymmetry in a power system. Real-world\ndata obtained from the Irish and Australian transmission systems serve to\nsupport the theoretical appraisal, whereas simulations based on an IEEE\nbenchmark system show the effectiveness of the proposed nonlinear compensation.\nThe case study also shows that, while automatic generation control reduces\nasymmetry, frequency control limits and droop-based frequency support provided\nby wind generation using a tight deadband of 15 mHz, namely active power\ncontrol, leads to a significant increase in the asymmetry of the frequency\nprobability distribution.\n","authors":["Taulant Kerci","Federico Milano"],"pdf_url":"https://arxiv.org/pdf/2405.04287v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.07126v1","updated":"2025-01-13T08:30:09Z","published":"2025-01-13T08:30:09Z","title":"A Federated Deep Learning Framework for Cell-Free RSMA Networks","summary":" Next-generation wireless networks are poised to benefit significantly from\nthe integration of three key technologies (KTs): Rate-Splitting Multiple Access\n(RSMA), cell-free architectures, and federated learning. Each of these\ntechnologies offers distinct advantages in terms of security, robustness, and\ndistributed structure. In this paper, we propose a novel cell-free network\narchitecture that incorporates RSMA and employs machine learning techniques\nwithin a federated framework. This combination leverages the strengths of each\nKT, creating a synergistic effect that maximizes the benefits of security,\nrobustness, and distributed structure. We formally formulate the access point\n(AP) selection and precoder design for max-min rate optimization in a cell-free\nMIMO RSMA network. Our proposed solution scheme involves a three-block\nprocedure. The first block trains deep reinforcement learning (DRL) neural\nnetworks to obtain RSMA precoders, assuming full connectivity between APs and\nuser equipments (UEs). The second block uses these precoders and principal\ncomponent analysis (PCA) to assign APs to UEs by removing a subset of AP-UE\nconnections. The final block fine-tunes the RSMA precoders by incorporating the\nassociated APs into a second DRL network. To leverage the distributed nature of\nthe cell-free network, this process is implemented in a Federated Deep\nReinforcement Learning (FDRL) structure operating through the cooperation of\nAPs and a central processing unit (CPU). Simulation results demonstrate that\nthe proposed FDRL approach performs comparably to a benchmark centralized DRL\nscheme. Our FDRL approach, provides a balanced trade-off, maintaining high\nperformance with enhanced security and reduced processing demands.\n","authors":["S. Ali Mousavi","Mehdi Monemi","Reza Mohseni","Matti Latva-aho"],"pdf_url":"https://arxiv.org/pdf/2501.07126v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.09385v2","updated":"2025-01-13T05:46:45Z","published":"2024-07-12T16:06:07Z","title":"Cost-optimized probabilistic maintenance for condition monitoring of\n wind turbines with rare failures","summary":" We propose a method, a model, and a form of presenting model results for\ncondition monitoring of a small set of wind turbines with rare failures. The\nmain new ingredient of the method is to sample failure thresholds according to\nthe profit they give to an operating company. The model is a multiple linear\nregression with seasonal components and external regressors, representing all\nsensor components except for the selected one. To overcome the scarcity of the\ntraining data, we use the median sensor values from all available turbines in\ntheir healthy state. The cumulated deviation from the normal behavior model\nobtained for this median turbine is calibrated for each turbine at the\nbeginning of the test period and after known failures. The proposed form of\npresenting results is to set a scale for possible costs, control for random\nmaintenance, and show a whole distribution of costs depending on the free model\nparameters. We make a case study on an open dataset with SCADA data from\nmultiple sensors and show that considering the influence of turbine components\nis more critical than seasonality. The distribution, the average, and the\nstandard deviation of maintenance costs can be very different for similar\nminimal costs. Random maintenance can be more profitable than reactive\nmaintenance and other approaches. Our predictive maintenance model outperforms\nrandom maintenance and competitors for the whole set of considered turbines,\ngiving substantial savings.\n","authors":["Viktor Begun","Ulrich Schlickewei"],"pdf_url":"https://arxiv.org/pdf/2407.09385v2.pdf","comment":"Improved and finally accepted journal version"},{"id":"http://arxiv.org/abs/2410.08147v6","updated":"2025-01-13T05:29:14Z","published":"2024-10-10T17:31:36Z","title":"The Bouc-Wen Model for Binary Direct Collinear Collisions of Convex\n Viscoplastic Bodies","summary":" We study mathematical models of binary direct collinear collisions of convex\nviscoplastic bodies based on two incremental collision laws that employ the\nBouc-Wen differential model of hysteresis to represent the elastoplastic\nbehavior of the materials of the colliding bodies. These collision laws are the\nBouc-Wen-Simon-Hunt-Crossley Collision Law (BWSHCCL) and the Bouc-Wen-Maxwell\nCollision Law (BWMCL). The BWSHCCL comprises of the Bouc-Wen model amended with\na nonlinear Hertzian elastic spring element and connected in parallel to a\nnonlinear displacement-dependent and velocity-dependent energy dissipation\nelement. The BWMCL comprises of the Bouc-Wen model amended with a nonlinear\nHertzian elastic spring element and connected in series to a linear\nvelocity-dependent energy dissipation element. The mathematical models of the\ncollision process are presented in the form of finite-dimensional initial value\nproblems. We show that the models possess favorable analytical properties\n(e.g., global existence, uniqueness, and boundedness of the solutions) under\nsuitable restrictions on the values of their parameters. Furthermore, based on\nthe results of two model parameter identification studies, we demonstrate that\ngood agreement can be attained between experimental data and numerical\napproximations of the behavior of the mathematical models across a wide range\nof initial relative velocities of the colliding bodies while using\nparameterizations of the models that are independent of the initial relative\nvelocity.\n","authors":["Mihails Milehins","Dan B. Marghitu"],"pdf_url":"https://arxiv.org/pdf/2410.08147v6.pdf","comment":"15 pages; 5 figures; (v1-v5) a variety of amendments; (v6) updated\n scaling/nondimensionalization and introduced amendments based on external\n feedback; the associated code/data are available from\n https://gitlab.com/user9716869/BWBCL"},{"id":"http://arxiv.org/abs/2407.21533v2","updated":"2025-01-13T05:12:56Z","published":"2024-07-31T11:39:10Z","title":"Data Requirements and Prediction Scaling for Long-Term Failure Forecasts\n in Wind Turbines","summary":" We investigate the key factors that enable early failure forecasting in wind\nturbines. For this purpose, we analyze studies with long-term forecasts and\ncompare their main features: prediction time, methods, targeted components,\ndataset size, and check the effect of using additional sensors. We found that\nthe size of the dataset is the main factor and that an approximate linear\nscaling holds: the number of forecast days is twice the size of the dataset,\nmeasured in turbine years. We also observe that the data allow us to quantify\nthe meaning of \"big\" and \"long\" in the terms \"big data\" and \"long-term\"\nforecasts, which are found to be ten turbine years and two weeks.\n","authors":["Viktor Begun","Ulrich Schlickewei"],"pdf_url":"https://arxiv.org/pdf/2407.21533v2.pdf","comment":"Improved the text and figure, updated the references"},{"id":"http://arxiv.org/abs/2501.07057v1","updated":"2025-01-13T04:31:31Z","published":"2025-01-13T04:31:31Z","title":"Optimization with Multi-sourced Reference Information and Unknown Trust:\n A Distributionally Robust Approach","summary":" In problems that involve input parameter information gathered from multiple\ndata sources with varying reliability, incorporating users' trust about\ndifferent sources in decision-optimization models can potentially improve\nsolution performance and reliability. In this work, we propose a novel\nmulti-reference distributionally robust optimization (MR-DRO) framework, where\nthe model inputs are uncertain and their probability distributions can be\nstatistically inferred from multiple data sources. Via nonparametric data\nfusion, we construct a Wasserstein ambiguity set to minimize the worst-case\nexpected value of a stochastic objective function, accounting for both\nuncertainty and unknown reliability of information sources. We reformulate the\nMR-DRO model as a linear program given linear objective and constraints in the\noriginal problem. We also incorporate a dynamic trust update mechanism that\nadjusts the trust for each source based on its performance over time. In\naddition, we introduce the concept of probability dominance to identify sources\nwith dominant trust. Via solving instances of resource allocation and portfolio\noptimization, we demonstrate the effectiveness of the trust-informed MR-DRO\napproach compared to traditional optimization frameworks relying on a single\ndata source. Our results highlight the significance of integrating (dynamic)\nuser trust in decision making under uncertainty, particularly when given\ndiverse and potentially conflicting input data.\n","authors":["Yanru Guo","Ruiwei Jiang","Siqian Shen"],"pdf_url":"https://arxiv.org/pdf/2501.07057v1.pdf","comment":"38 pages, 9 figures, 7 tables"},{"id":"http://arxiv.org/abs/2501.07030v1","updated":"2025-01-13T03:02:15Z","published":"2025-01-13T03:02:15Z","title":"Erasing Noise in Signal Detection with Diffusion Model: From Theory to\n Application","summary":" In this paper, a signal detection method based on the denoise diffusion model\n(DM) is proposed, which outperforms the maximum likelihood (ML) estimation\nmethod that has long been regarded as the optimal signal detection technique.\nTheoretically, a novel mathematical theory for intelligent signal detection\nbased on stochastic differential equations (SDEs) is established in this paper,\ndemonstrating the effectiveness of DM in reducing the additive white Gaussian\nnoise in received signals. Moreover, a mathematical relationship between the\nsignal-to-noise ratio (SNR) and the timestep in DM is established, revealing\nthat for any given SNR, a corresponding optimal timestep can be identified.\nFurthermore, to address potential issues with out-of-distribution inputs in the\nDM, we employ a mathematical scaling technique that allows the trained DM to\nhandle signal detection across a wide range of SNRs without any fine-tuning.\nBuilding on the above theoretical foundation, we propose a DM-based signal\ndetection method, with the diffusion transformer (DiT) serving as the backbone\nneural network, whose computational complexity of this method is\n$\\mathcal{O}(n^2)$. Simulation results demonstrate that, for BPSK and QAM\nmodulation schemes, the DM-based method achieves a significantly lower symbol\nerror rate (SER) compared to ML estimation, while maintaining a much lower\ncomputational complexity.\n","authors":["Xiucheng Wang","Peilin Zheng","Nan Cheng"],"pdf_url":"https://arxiv.org/pdf/2501.07030v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.07026v1","updated":"2025-01-13T02:59:03Z","published":"2025-01-13T02:59:03Z","title":"IEEE_TIE25: Analysis and Synthesis of DOb-based Robust Motion\n Controllers","summary":" By employing a unified state-space design framework, this paper proposes a\nnovel systematic analysis and synthesis method that facilitates the\nimplementation of both conventional zero-order (ZO) and high-order (HO) DObs.\nFurthermore, this design method supports the development of advanced DObs\n(e.g., the proposed High-Performance (HP) DOb in this paper), enabling more\naccurate disturbance estimation and, consequently, enhancing the robust\nstability and performance of motion control systems. Lyapunov direct method is\nemployed in the discrete-time domain to analyse the stability of the proposed\ndigital robust motion controllers. The analysis demonstrates that the proposed\nDObs are stable in the sense that the estimation error is uniformly ultimately\nbounded when subjected to bounded disturbances. Additionally, they are proven\nto be asymptotically stable under specific disturbance conditions, such as\nconstant disturbances for the ZO and HP DObs. Stability constraints on the\ndesign parameters of the DObs are analytically derived, providing effective\nsynthesis tools for the implementation of the digital robust motion\ncontrollers. The discrete-time analysis facilitates the derivation of more\npractical design constraints. The proposed analysis and synthesis methods have\nbeen rigorously validated through experimental evaluations, confirming their\neffectiveness.\n","authors":["Emre Sariyildiz"],"pdf_url":"https://arxiv.org/pdf/2501.07026v1.pdf","comment":"IEEE Transactions on Industrial Electronics 2025"},{"id":"http://arxiv.org/abs/2501.07005v1","updated":"2025-01-13T01:49:17Z","published":"2025-01-13T01:49:17Z","title":"Global Search for Optimal Low Thrust Spacecraft Trajectories using\n Diffusion Models and the Indirect Method","summary":" Long time-duration low-thrust nonlinear optimal spacecraft trajectory global\nsearch is a computationally and time expensive problem characterized by\nclustering patterns in locally optimal solutions. During preliminary mission\ndesign, mission parameters are subject to frequent changes, necessitating that\ntrajectory designers efficiently generate high-quality control solutions for\nthese new scenarios. Generative machine learning models can be trained to learn\nhow the solution structure varies with respect to a conditional parameter,\nthereby accelerating the global search for missions with updated parameters. In\nthis work, state-of-the-art diffusion models are integrated with the indirect\napproach for trajectory optimization within a global search framework. This\nframework is tested on two low-thrust transfers of different complexity in the\ncircular restricted three-body problem. By generating and analyzing a training\ndata set, we develop mathematical relations and techniques to understand the\ncomplex structures in the costate domain of locally optimal solutions for these\nproblems. A diffusion model is trained on this data and successfully\naccelerates the global search for both problems. The model predicts how the\ncostate solution structure changes, based on the maximum spacecraft thrust\nmagnitude. Warm-starting a numerical solver with diffusion model samples for\nthe costates at the initial time increases the number of solutions generated\nper minute for problems with unseen thrust magnitudes by one to two orders of\nmagnitude in comparison to samples from a uniform distribution and from an\nadjoint control transformation.\n","authors":["Jannik Graebner","Ryne Beeson"],"pdf_url":"https://arxiv.org/pdf/2501.07005v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.10941v2","updated":"2025-01-13T00:03:58Z","published":"2024-11-17T02:39:58Z","title":"Efficient Estimation of Relaxed Model Parameters for Robust UAV\n Trajectory Optimization","summary":" Online trajectory optimization and optimal control methods are crucial for\nenabling sustainable unmanned aerial vehicle (UAV) services, such as\nagriculture, environmental monitoring, and transportation, where available\nactuation and energy are limited. However, optimal controllers are highly\nsensitive to model mismatch, which can occur due to loaded equipment, packages\nto be delivered, or pre-existing variability in fundamental structural and\nthrust-related parameters. To circumvent this problem, optimal controllers can\nbe paired with parameter estimators to improve their trajectory planning\nperformance and perform adaptive control. However, UAV platforms are limited in\nterms of onboard processing power, oftentimes making nonlinear parameter\nestimation too computationally expensive to consider. To address these issues,\nwe propose a relaxed, affine-in-parameters multirotor model along with an\nefficient optimal parameter estimator. We convexify the nominal Moving Horizon\nParameter Estimation (MHPE) problem into a linear-quadratic form (LQ-MHPE) via\nan affine-in-parameter relaxation on the nonlinear dynamics, resulting in fast\nquadratic programs (QPs) that facilitate adaptive Model Predictve Control (MPC)\nin real time. We compare this approach to the equivalent nonlinear estimator in\nMonte Carlo simulations, demonstrating a decrease in average solve time and\ntrajectory optimality cost by 98.2% and 23.9-56.2%, respectively.\n","authors":["Derek Fan","David A. Copp"],"pdf_url":"https://arxiv.org/pdf/2411.10941v2.pdf","comment":"8 pages, 5 figures, to be published in IEEE Sustech 2025"},{"id":"http://arxiv.org/abs/2501.07743v1","updated":"2025-01-13T23:14:15Z","published":"2025-01-13T23:14:15Z","title":"The Reliability of Remotely Piloted Aircraft System Performance under\n Communication Loss and Latency Uncertainties","summary":" Mission-critical use of highly maneuverable Remotely Piloted Aircraft Systems\n(RPAS) requires a thorough understanding of the reliability of their\ncommunication systems. Investigations into system-level performance under\nstochastic aviation communication conditions are critical for estimating\nmission success rates and assessing the risks associated with integrating RPAS\ninto existing airspace, ensuring overall aviation safety. This study aims to\nquantify the impact of communication latency and complete signal loss on the\nmission completion performance of a highly maneuverable RPAS. The mission is\ndefined as a static waypoint tracking task in three-dimensional airspace. We\nstart with examining and deriving mathematical formulations of key reliability\nmetrics of Required Communication Performance (RCP). These stochastic factors\nare then embedded into flight control simulations (i.e., communication\navailability and latency) to examine the system behavior. Lastly, we generate\nmission success rate and mission completion time envelopes through extensive\nmultiprocessing Monte Carlo simulations through high-performance computing. We\ndiscover a drastic deterioration in flight performance while latency or\navailability erodes the stability margin. In addition, we propose a new\nreliability metric, namely \\textit{communicability}, which integrates three key\nRCP metrics and helps understanding the maximum tolerable latency to flight\ncontrol. The procedure and results obtained from this research inform engineers\ndesigning RPAS with better trade-off between communication capability and\nflight control performance. Future works includes exploring alternative flight\nsimulators (i.e., nonlinear dynamic inversion) with other missions (i.e.,\ndynamic waypoint following), or develop delay-compensated optimal controls. The\nanalysis on stability margin is also desired for theoretical verification.\n","authors":["Yutian Pang","Andrew Paul Kendall","John-Paul Clarke"],"pdf_url":"https://arxiv.org/pdf/2501.07743v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.07715v1","updated":"2025-01-13T22:06:06Z","published":"2025-01-13T22:06:06Z","title":"Analyzing the Role of the DSO in Electricity Trading of VPPs via a\n Stackelberg Game Model","summary":" The increasing penetration of distributed energy resources (DER) has sparked\ninterest in promoting their participation in the power market. Here we consider\na setting in which different virtual power plants (VPPs) with certain flexible\nresources take part in electricity trading, either by direct participation in\nthe wholesale power market, or interfaced by the Distribution System Operator\n(DSO). Our goal is to examine the role and influence of the DSO as a\nstakeholder, for which we formulate a Stackelberg game via a bilevel\noptimization model: the DSO maximizes profits at the upper level, while VPPs\nminimize operating costs at the lower level. To solve this problem, we use the\nKarush-Kuhn-Tucke optimality conditions of the convex lower-level problems to\nachieve a single-level mixed-integer nonlinear program. The results show that\nthe role of the DSO as an intermediary agent leads to a decrease in operating\ncosts for the VPPs, while guaranteeing a profit for the DSO.\n","authors":["Peng Wang","Xi Zhang","Luis Badesa"],"pdf_url":"https://arxiv.org/pdf/2501.07715v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.07714v1","updated":"2025-01-13T21:55:48Z","published":"2025-01-13T21:55:48Z","title":"Koopman Meets Limited Bandwidth: Effect of Quantization on Data-Driven\n Linear Prediction and Control of Nonlinear Systems","summary":" Koopman-based lifted linear identification have been widely used for\ndata-driven prediction and model predictive control (MPC) of nonlinear systems.\nIt has found applications in flow-control, soft robotics, and unmanned aerial\nvehicles (UAV). For autonomous systems, this system identification method works\nby embedding the nonlinear system in a higher-dimensional linear space and\ncomputing a finite-dimensional approximation of the corresponding Koopman\noperator with the Extended Dynamic Mode Decomposition (EDMD) algorithm. EDMD is\na data-driven algorithm that estimates an approximate linear system by lifting\nthe state data-snapshots via nonlinear dictionary functions. For control\nsystems, EDMD is further modified to utilize both state and control\ndata-snapshots to estimate a lifted linear predictor with control input. This\narticle investigates how the estimation process is affected when the data is\nquantized. Specifically, we examine the fundamental connection between\nestimates of the linear predictor matrices obtained from unquantized data and\nthose from quantized data via modified EDMD. Furthermore, using the law of\nlarge numbers, we demonstrate that, under a large data regime, the quantized\nestimate can be considered a regularized version of the unquantized estimate.\nWe also explore the relationship between the two estimates in the finite data\nregime. We further analyze the effect of nonlinear lifting functions on this\nregularization due to quantization. The theory is validated through repeated\nnumerical experiments conducted on several control systems. The effect of\nquantization on the MPC performance is also demonstrated.\n","authors":["Shahab Ataei","Dipankar Maity","Debdipta Goswami"],"pdf_url":"https://arxiv.org/pdf/2501.07714v1.pdf","comment":"15 pages, 4 figures. arXiv admin note: text overlap with\n arXiv:2410.02803"},{"id":"http://arxiv.org/abs/2402.06108v2","updated":"2025-01-13T21:09:51Z","published":"2024-02-09T00:05:28Z","title":"United We Fall: On the Nash Equilibria of Multiplex and Multilayer\n Network Games","summary":" Network games provide a framework to study strategic decision making\nprocesses that are governed by structured interdependencies among agents.\nHowever, existing models do not account for environments in which agents\nsimultaneously interact over multiple networks, or when agents operate over\nmultiple action dimensions. In this paper, we propose new models of multiplex\nnetwork games to capture the different modalities of interactions among\nstrategic agents, and multilayer network games to capture their interactions\nover multiple action dimensions. We explore how the properties of the\nconstituent networks of a multiplex/multilayer network can undermine or support\nthe existence, uniqueness, and stability of the game's Nash equilibria.\nNotably, we highlight that both the largest and smallest eigenvalues of the\nconstituent networks (reflecting their connectivity and two-sidedness,\nrespectively) are instrumental in determining the uniqueness of the\nmultiplex/multilayer network game's equilibrium. Together, our findings shed\nlight on the reasons for the fragility of equilibria when agents interact over\nnetworks of networks, and point out potential interventions to alleviate them.\n","authors":["Raman Ebrahimi","Parinaz Naghizadeh"],"pdf_url":"https://arxiv.org/pdf/2402.06108v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.07689v1","updated":"2025-01-13T21:05:04Z","published":"2025-01-13T21:05:04Z","title":"Real-Time Outlier Connections Detection in Databases Network Traffic","summary":" The article describes a practical method for detecting outlier database\nconnections in real-time. Outlier connections are detected with a specified\nlevel of confidence. The method is based on generalized security rules and a\nsimple but effective real-time machine learning mechanism. The described method\nis non-intrusive to the database and does not depend on the type of database.\nThe method is used to proactively control access even before database\nconnection is established, minimize false positives, and maintain the required\nresponse speed to detected database connection outliers. The capabilities of\nthe system are demonstrated with several examples of outliers in real-world\nscenarios.\n","authors":["Leonid Rodniansky","Tania Butovsky","Mikhail Shpak"],"pdf_url":"https://arxiv.org/pdf/2501.07689v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.07982v2","updated":"2025-01-13T20:44:40Z","published":"2024-12-10T23:54:44Z","title":"Data-Driven Assessment of Vehicle-to-Grid Capabilities in Supporting\n Grid During Emergencies: Case Study of Travis County, TX","summary":" As extreme weather events become more common and threaten power grids, the\ncontinuing adoption of electric vehicles (EVs) introduces a growing opportunity\nfor their use as a distributed energy storage resource. This energy storage can\nbe used as backup generation through the use of vehicle-to-grid (V2G)\ntechnology, where electricity is sent back from EV batteries to the grid. With\nenough participation from EV owners, V2G can mitigate outages during grid\nemergencies. In order to investigate a practical application of V2G, this study\nleverages a vast array of real-world data, such as survey results on V2G\nparticipation willingness, historical outage data within ERCOT, current EV\nregistrations, and demographic data. This data informs realistic emergency grid\nscenarios with V2G support using a synthetic transmission grid for Travis\nCounty. The results find that as EV ownership rises in the coming years, the\nsimultaneous facilitation of bidirectional charging availability would allow\nfor V2G to play a substantial role in preventing involuntary load shed as a\nresult of emergencies like winter storms.\n","authors":["Kelsey Nelson","Javad Mohammadi"],"pdf_url":"https://arxiv.org/pdf/2412.07982v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.15728v2","updated":"2025-01-13T19:46:20Z","published":"2024-02-24T05:36:14Z","title":"Design and Implementation of Low-Cost Electric Vehicles (Evs)\n Supercharger: A Comprehensive Review","summary":" This article presents a probabilistic modeling method utilizing smart meter\ndata and an innovative agent-based simulator for electric vehicles (EVs). The\naim is to assess the effects of different cost-driven EV charging strategies on\nthe power distribution network (PDN). We investigate the effects of a 40% EV\nadoption on three parts of Frederiksberg's low voltage distribution network\n(LVDN), a densely urbanized municipality in Denmark. Our findings indicate that\ncable and transformer overloading especially pose a challenge. However, the\nimpact of EVs varies significantly between each LVDN area and charging\nscenario. Across scenarios and LVDNs, the share of cables facing congestion\nranges between 5% and 60%. It is also revealed that time-of-use (ToU)-based and\nsingle-day cost-minimized charging could be beneficial for LVDNs with moderate\nEV adoption rates. In contrast, multiple-day optimization will likely lead to\nsevere congestion, as such strategies concentrate demand on a single day that\nwould otherwise be distributed over several days, thus raising concerns about\nhow to prevent it. The broader implications of our research suggest that,\ndespite initial worries primarily centered on congestion due to unregulated\ncharging during peak hours, a transition to cost-based smart charging,\npropelled by an increasing awareness of time-dependent electricity prices, may\nlead to a significant rise in charging synchronization, bringing about\nundesirable consequences for the power distribution network (PDN).\n","authors":["Md Khaledur Rahman","Faysal Amin Tanvir","Md Saiful Islam","Md Shameem Ahsan","Manam Ahmed"],"pdf_url":"https://arxiv.org/pdf/2402.15728v2.pdf","comment":"arXiv admin note: This work has been withdrawn by arXiv\n administrators due to inappropriate text reuse from external sources"},{"id":"http://arxiv.org/abs/2501.07652v1","updated":"2025-01-13T19:24:14Z","published":"2025-01-13T19:24:14Z","title":"Finite Sample Identification of Partially Observed Bilinear Dynamical\n Systems","summary":" We consider the problem of learning a realization of a partially observed\nbilinear dynamical system (BLDS) from noisy input-output data. Given a single\ntrajectory of input-output samples, we provide a finite time analysis for\nlearning the system's Markov-like parameters, from which a balanced realization\nof the bilinear system can be obtained. Our bilinear system identification\nalgorithm learns the system's Markov-like parameters by regressing the outputs\nto highly correlated, nonlinear, and heavy-tailed covariates. Moreover, the\nstability of BLDS depends on the sequence of inputs used to excite the system.\nThese properties, unique to partially observed bilinear dynamical systems, pose\nsignificant challenges to the analysis of our algorithm for learning the\nunknown dynamics. We address these challenges and provide high probability\nerror bounds on our identification algorithm under a uniform stability\nassumption. Our analysis provides insights into system theoretic quantities\nthat affect learning accuracy and sample complexity. Lastly, we perform\nnumerical experiments with synthetic data to reinforce these insights.\n","authors":["Yahya Sattar","Yassir Jedra","Maryam Fazel","Sarah Dean"],"pdf_url":"https://arxiv.org/pdf/2501.07652v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.17753v2","updated":"2025-01-13T19:10:31Z","published":"2024-05-28T02:11:21Z","title":"Regression Equilibrium in Electricity Markets","summary":" In two-stage electricity markets, renewable power producers enter the\nday-ahead market with a forecast of future power generation and then reconcile\nany forecast deviation in the real-time market at a penalty. The choice of the\nforecast model is thus an important strategy decision for renewable power\nproducers as it affects financial performance. In electricity markets with\nlarge shares of renewable generation, the choice of the forecast model impacts\nnot only individual performance but also outcomes for other producers. In this\npaper, we argue for the existence of a competitive regression equilibrium in\ntwo-stage electricity markets in terms of the parameters of private forecast\nmodels informing the participation strategies of renewable power producers. In\nour model, renewables optimize the forecast against the day-ahead and real-time\nprices, thereby maximizing the average profits across the day-ahead and\nreal-time markets. By doing so, they also implicitly enhance the temporal cost\ncoordination of day-ahead and real-time markets. We base the equilibrium\nanalysis on the theory of variational inequalities, providing results on the\nexistence and uniqueness of regression equilibrium in energy-only markets. We\nalso devise two methods to compute regression equilibrium: centralized\noptimization and a decentralized ADMM-based algorithm.\n","authors":["Vladimir Dvorkin"],"pdf_url":"https://arxiv.org/pdf/2405.17753v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.07616v1","updated":"2025-01-13T12:29:37Z","published":"2025-01-13T12:29:37Z","title":"The Ingenuity Mars Helicopter Specified and Analyzed with the Real-time\n Mode-aware Dataflow Model","summary":" Ingenuity is an autonomous Cyber-Pysical System (CPS) that has successfully\ncompleted more than 70 flights over Mars between 2021 and 2024. Ensuring the\nsafety of its mission is paramount, as any failure could result in catastrophic\neconomic damage and significant financial losses. Dataflow Models of\nComputation and Communication (DF MoCCs) serve as a formal framework for\nspecifying and analyzing the timing behavior of such CPSs. In particular, the\nReal-time Mode-aware Dataflow (RMDF) model is highly suitable to specify and\nanalyze real-time and mode-dependent Cyber-Physical Systems (CPSs) like\nIngenuity. This paper showcases the application of RMDF for the specification\nand analysis of Ingenuity. We propose a dataflow specification of Ingenuity,\nanalyze its timing behavior, and provide a feasibility test. Finally, we\nproposed a plausible explanation of the timing anomaly that occurred during the\nsixth flight of Ingenuity.\n","authors":["Guillaume Roumage","Selma Azaiez","Cyril Faure","Stéphane Louise"],"pdf_url":"https://arxiv.org/pdf/2501.07616v1.pdf","comment":"arXiv admin note: text overlap with arXiv:2501.07187"},{"id":"http://arxiv.org/abs/2207.11132v3","updated":"2025-01-13T02:38:21Z","published":"2022-07-16T13:43:58Z","title":"Proactive Distributed Emergency Response with Heterogeneous Tasks\n Allocation","summary":" Traditionally, traffic incident management (TIM) programs coordinate the\ndeployment of emergency resources to immediate incident requests without\naccommodating the interdependencies on incident evolutions in the environment.\nHowever, ignoring inherent interdependencies on the evolution of incidents in\nthe environment while making current deployment decisions is shortsighted, and\nthe resulting naive deployment strategy can significantly worsen the overall\nincident delay impact on the network. The interdependencies on incident\nevolution in the environment, including those between incident occurrences, and\nthose between resource availability in near-future requests and the anticipated\nduration of the immediate incident request, should be considered through a\nlook-ahead model when making current-stage deployment decisions. This study\ndevelops a new proactive framework based on the distributed constraint\noptimization problem (DCOP) to address the above limitations, overcoming\nconventional TIM models that cannot accommodate the dependencies in the TIM\nproblem. Furthermore, the optimization objective is formulated to incorporate\nUnmanned Aerial Vehicles (UAVs). The UAVs' role in TIM includes exploring\nuncertain traffic conditions, detecting unexpected events, and augmenting\ninformation from roadway traffic sensors. Robustness analysis of our model for\nmultiple TIM scenarios shows satisfactory performance using local search\nexploration heuristics. Overall, our model reports a significant reduction in\ntotal incident delay compared to conventional TIM models. With UAV support, we\ndemonstrate a further decrease in the total incident delay ranging between 5%\nand 45% for the different number of incidents. UAV's active sensing can shorten\nresponse time of emergency vehicles, and a reduction in uncertainties\nassociated with the estimated incident delay impact.\n","authors":["Justice Darko","Hyoshin Park"],"pdf_url":"https://arxiv.org/pdf/2207.11132v3.pdf","comment":"16 pages, 13 figures, 3 tables, journal"},{"id":"http://arxiv.org/abs/2501.10441v1","updated":"2025-01-13T22:28:04Z","published":"2025-01-13T22:28:04Z","title":"A Review of Detection, Evolution, and Data Reconstruction Strategies for\n False Data Injection Attacks in Power Cyber-Physical Systems","summary":" The integration of information and physical systems in modern power grids has\nheightened vulnerabilities to False Data Injection Attacks (FDIAs), threatening\nthe secure operation of power cyber-physical systems (CPS). This paper reviews\nFDIA detection, evolution, and data reconstruction strategies, highlighting\ncross-domain coordination, multi-temporal evolution, and stealth\ncharacteristics. Challenges in existing detection methods, including poor\ninterpretability and data imbalance, are discussed, alongside advanced\nstate-aware and action-control data reconstruction techniques. Key issues, such\nas modeling FDIA evolution and distinguishing malicious data from regular\nfaults, are identified. Future directions to enhance system resilience and\ndetection accuracy are proposed, contributing to the secure operation of power\nCPS.\n","authors":["Xiaoyong Bo"],"pdf_url":"https://arxiv.org/pdf/2501.10441v1.pdf","comment":"34 pages, 4 figures, 6 tables"},{"id":"http://arxiv.org/abs/2501.10438v1","updated":"2025-01-13T14:40:49Z","published":"2025-01-13T14:40:49Z","title":"Event-Based Impulsive Control for Spacecraft Rendezvous Hovering Phases","summary":" This work presents an event-triggered controller for spacecraft rendezvous\nhovering phases. The goal is to maintain the chaser within a bounded region\nwith respect to the target. The main assumption is that the chaser vehicle has\nimpulsive thrusters. These are assumed to be orientable at any direction and\nare constrained by dead-zone and saturation bounds. The event-based controller\nrelies on trigger rules deciding when a suitable control law is applied. The\nlocal control law consists on a single impulse; therefore the trigger rules\ndesign is based on the instantaneous reachability to the admissible set. The\nfinal outcome is a very efficient algorithm from both computational burden and\nfootprint perspectives. Because the proposed methodology is based on a single\nimpulse control, the controller invariance is local and assessed through\nimpulsive systems theory. Finally, numerical results are shown and discussed.\n","authors":["Julio C. Sanchez","Christophe Louembet","Francisco Gavilan","Rafael Vazquez"],"pdf_url":"https://arxiv.org/pdf/2501.10438v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.10437v1","updated":"2025-01-13T14:19:06Z","published":"2025-01-13T14:19:06Z","title":"Chance-constrained Model Predictive Control for Near Rectilinear Halo\n Orbit spacecraft rendezvous","summary":" This work presents a robust Model Predictive Controller (MPC) to solve the\nproblem of spacecraft rendezvous in the context of the restricted three-body\nproblem (R3BP) as will be required to dock with space stations in cislunar\nspace. The employed methodology is both valid for chemical and electric\nthrusters. By exploiting the state transition matrix and using a\nchance-constrained approach, the robust MPC assures constraints satisfaction\nunder the presence of disturbances in a probabilistic sense. The perturbations\nparameters are computed on-line using a disturbance estimator. The robust\ncontroller is tested for a rendezvous scenario with a target placed in an\nEarth-Moon Near-Rectilinear Halo Orbit. Numerical results are shown and\ndiscussed.\n","authors":["Julio C. Sanchez","Francisco Gavilan","Rafael Vazquez"],"pdf_url":"https://arxiv.org/pdf/2501.10437v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.10436v1","updated":"2025-01-13T14:12:41Z","published":"2025-01-13T14:12:41Z","title":"A flatness-based predictive controller for six-degrees of freedom\n spacecraft rendezvous","summary":" This work presents a closed-loop guidance algorithm for six-degrees of\nfreedom spacecraft rendezvous with a passive target flying in an eccentric\norbit. The main assumption is that the chaser vehicle has an attitude control\nsystem, based on reaction wheels, providing the necessary torque to change its\norientation whereas the number of thrusters is arbitrary. The goal is to design\nfuel optimal maneuvers while satisfying operational constraints and rejecting\ndisturbances. The proposed method is as follows; first, the coupled\ntranslational and angular dynamics are transformed to equivalent algebraic\nrelations using the relative translational states transition matrix and the\nattitude flatness property. Then, a direct transcription method, based on\nB-splines parameterization and discretization of time continuous constraints,\nis developed to obtain a tractable static program. Finally, a Model Predictive\nController, based on linearization around the previously computed solution, is\nconsidered to handle disturbances. Numerical results are shown and discussed.\n","authors":["Julio C. Sanchez","Francisco Gavilan","Rafael Vazquez","Christophe Louembet"],"pdf_url":"https://arxiv.org/pdf/2501.10436v1.pdf","comment":null}],"Optimization and Control":[{"id":"http://arxiv.org/abs/2403.02079v3","updated":"2025-01-13T18:01:08Z","published":"2024-03-04T14:26:22Z","title":"The ultimate upper bound on the injectivity radius of the Stiefel\n manifold","summary":" We exhibit conjugate points on the Stiefel manifold endowed with any member\nof the family of Riemannian metrics introduced by H\\\"uper et al. (2021). This\nfamily contains the well-known canonical and Euclidean metrics. An upper bound\non the injectivity radius of the Stiefel manifold in the considered metric is\nthen obtained as the minimum between the length of the geodesic along which the\npoints are conjugate and the length of certain geodesic loops. Numerical\nexperiments support the conjecture that the obtained upper bound is in fact\nequal to the injectivity radius.\n","authors":["P. -A. Absil","Simon Mataigne"],"pdf_url":"https://arxiv.org/pdf/2403.02079v3.pdf","comment":"Version accepted for publication in SIAM Journal on Matrix Analysis\n and Applications on 6 January 2025"},{"id":"http://arxiv.org/abs/2501.07505v1","updated":"2025-01-13T17:22:58Z","published":"2025-01-13T17:22:58Z","title":"An Error Analysis of Second Order Elliptic Optimal Control Problem via\n Hybrid Higher Order Methods","summary":" This paper presents the design and analysis of a Hybrid High-Order (HHO)\napproximation for a distributed optimal control problem governed by the Poisson\nequation. We propose three distinct schemes to address unconstrained control\nproblems and two schemes for constrained control problems. For the\nunconstrained control problem, while standard finite elements achieve a\nconvergence rate of \\( k+1 \\) (with \\( k \\) representing the polynomial\ndegree), our approach enhances this rate to \\( k+2 \\) by selecting the control\nfrom a carefully constructed reconstruction space. For the box-constrained\nproblem, we demonstrate that using lowest-order elements (\\( \\mathbb{P}_0 \\))\nyields linear convergence, in contrast to finite element methods (FEM) that\nrequire linear elements to achieve comparable results. Furthermore, we derive a\ncubic convergence rate for control in the variational discretization scheme.\nNumerical experiments are provided to validate the theoretical findings.\n","authors":["Gouranga Mallik","Ramesh Chandra Sau"],"pdf_url":"https://arxiv.org/pdf/2501.07505v1.pdf","comment":"34 pages"},{"id":"http://arxiv.org/abs/2407.00843v3","updated":"2025-01-13T16:58:43Z","published":"2024-06-30T22:33:47Z","title":"A Unified Approach to Extract Interpretable Rules from Tree Ensembles\n via Integer Programming","summary":" Tree ensembles are very popular machine learning models, known for their\neffectiveness in supervised classification and regression tasks. Their\nperformance derives from aggregating predictions of multiple decision trees,\nwhich are renowned for their interpretability properties. However, tree\nensemble models do not reliably exhibit interpretable output. Our work aims to\nextract an optimized list of rules from a trained tree ensemble, providing the\nuser with a condensed, interpretable model that retains most of the predictive\npower of the full model. Our approach consists of solving a set partitioning\nproblem formulated through Integer Programming. The proposed method works with\neither tabular or time series data, for both classification and regression\ntasks, and its flexible formulation can include any arbitrary loss or\nregularization functions. Our extensive computational experiments offer\nstatistically significant evidence that our method is competitive with other\nrule extraction methods in terms of predictive performance and fidelity towards\nthe tree ensemble. Moreover, we empirically show that the proposed method\neffectively extracts interpretable rules from tree ensemble that are designed\nfor time series data.\n","authors":["Lorenzo Bonasera","Emilio Carrizosa"],"pdf_url":"https://arxiv.org/pdf/2407.00843v3.pdf","comment":"- Improved overall manuscript flow and clearness - Added related work\n on explanation fidelity - Added computational results on fidelity - Fixed\n some flaws on data inference - Optimization problem with weighted objectives\n - Added appendix containing qualitative examples - New computational results"},{"id":"http://arxiv.org/abs/2501.07461v1","updated":"2025-01-13T16:30:56Z","published":"2025-01-13T16:30:56Z","title":"A Linear Parameter-Varying Framework for the Analysis of Time-Varying\n Optimization Algorithms","summary":" In this paper we propose a framework to analyze iterative first-order\noptimization algorithms for time-varying convex optimization. We assume that\nthe temporal variability is caused by a time-varying parameter entering the\nobjective, which can be measured at the time of decision but whose future\nvalues are unknown. We consider the case of strongly convex objective functions\nwith Lipschitz continuous gradients and address the class of running algorithms\nwhere only one iteration per time change is performed. We model these\nalgorithms as discrete-time linear parameter varying (LPV) systems in feedback\nwith a time-varying gradient. We leverage the approach of analyzing algorithms\nas uncertain control interconnections with integral quadratic constraints\n(IQCs) and generalize that framework to the time-varying case. We propose novel\nIQCs that are capable of capturing the behavior of time-varying nonlinearities\nand leverage techniques from the LPV literature to establish novel bounds on\nthe tracking error. Quantitative bounds can be computed by solving a\nsemi-definite program and can be interpreted as an input-to-state stability\nresult with respect to a disturbance signal which increases with the temporal\nvariability of the problem. As a departure from results in this research area,\nour bounds introduce terms that can be interpreted as a temporal rate of change\nin the cost function and the optimal value. We exemplify our main results with\nnumerical experiments that showcase how our analysis framework is able to\ncapture convergence rates of different first-order algorithms for time-varying\noptimization through the choice of IQC and rate bounds.\n","authors":["Fabian Jakob","Andrea Iannelli"],"pdf_url":"https://arxiv.org/pdf/2501.07461v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.07427v1","updated":"2025-01-13T15:48:45Z","published":"2025-01-13T15:48:45Z","title":"Numerical Method for Simultaneous Design and Control Optimization of\n Seasonal Thermal Energy Storage Systems","summary":" The transition to a carbon-neutral energy system requires massive\ninstallation of renewable energy sources and economically feasible energy\nstorage solutions. This study addresses these challenges by optimizing the\ndesign and control strategies of an energy system that meets the heat and\nelectricity demands of a community. The proposed system integrates solar and\nwind power with energy storage, including seasonal thermal energy storage\n(STES) and battery, coupled via a heat pump. This approach enhances\nself-sufficiency and effectively mitigates seasonal mismatches. To model heat\ntransfer between the storage and the ground in the STES system, we employ a\nmulti-node lumped-parameter method. The optimization problem is formulated as a\nperiodic optimal control problem, which is then transcribed into a nonlinear\nprogramming problem. To reduce computational complexity, we apply the averaging\nmethod, which significantly lowers the effort required to solve the problem. We\napply this approach to a case study, where the economically optimized\nconfiguration results in a projected total energy cost per household of\napproximately 75 EUR/month over 30 years for both heat and electricity. This\nstudy demonstrates the feasibility of designing economically viable, autonomous\nenergy communities in real-world scenarios, and provides a comprehensive\noptimization framework for designing system components and control strategies.\n","authors":["Wonsun Song","Jakob Harzer","Christopher Jung","Leon Sander","Moritz Diehl"],"pdf_url":"https://arxiv.org/pdf/2501.07427v1.pdf","comment":"35 pages, 12 figures, submitted to Renewable Energy. Editor-in-chief:\n Nidia Caetano"},{"id":"http://arxiv.org/abs/2501.07413v1","updated":"2025-01-13T15:31:50Z","published":"2025-01-13T15:31:50Z","title":"Stable Set Polytopes with Rank $|V(G)|/3$ for the Lov{á}sz--Schrijver\n SDP Operator","summary":" We study the lift-and-project rank of the stable set polytope of graphs with\nrespect to the Lov{\\'a}sz--Schrijver SDP operator $\\text{LS}_+$ applied to the\nfractional stable set polytope. In particular, we show that for every positive\ninteger $\\ell$, the smallest possible graph with $\\text{LS}_+$-rank $\\ell$\ncontains $3\\ell$ vertices. This result is sharp and settles a conjecture posed\nby Lipt{\\'a}k and the second author in 2003, as well as answers a\ngeneralization of a problem posed by Knuth in 1994. We also show that for every\npositive integer $\\ell$ there exists a vertex-transitive graph on $4\\ell+12$\nvertices with $\\text{LS}_+$-rank at least $\\ell$.\n","authors":["Yu Hin Au","Levent Tunçel"],"pdf_url":"https://arxiv.org/pdf/2501.07413v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.04081v2","updated":"2025-01-13T23:48:32Z","published":"2024-03-06T22:24:05Z","title":"Directional Smoothness and Gradient Methods: Convergence and Adaptivity","summary":" We develop new sub-optimality bounds for gradient descent (GD) that depend on\nthe conditioning of the objective along the path of optimization rather than on\nglobal, worst-case constants. Key to our proofs is directional smoothness, a\nmeasure of gradient variation that we use to develop upper-bounds on the\nobjective. Minimizing these upper-bounds requires solving implicit equations to\nobtain a sequence of strongly adapted step-sizes; we show that these equations\nare straightforward to solve for convex quadratics and lead to new guarantees\nfor two classical step-sizes. For general functions, we prove that the Polyak\nstep-size and normalized GD obtain fast, path-dependent rates despite using no\nknowledge of the directional smoothness. Experiments on logistic regression\nshow our convergence guarantees are tighter than the classical theory based on\n$L$-smoothness.\n","authors":["Aaron Mishkin","Ahmed Khaled","Yuanhao Wang","Aaron Defazio","Robert M. Gower"],"pdf_url":"https://arxiv.org/pdf/2403.04081v2.pdf","comment":"Published as a poster at NeurIPS 2024"},{"id":"http://arxiv.org/abs/2412.00617v2","updated":"2025-01-13T23:12:46Z","published":"2024-11-30T23:52:21Z","title":"Flow matching for stochastic linear control systems","summary":" This paper addresses the problem of steering an initial probability\ndistribution to a target probability distribution through a deterministic or\nstochastic linear control system. Our proposed approach is inspired by the flow\nmatching methodology, with the difference that we can only affect the flow\nthrough the given control channels. The motivation comes from applications such\nas robotic swarms and stochastic thermodynamics, where agents or particles can\nonly be manipulated through control actions. The feedback control law that\nachieves the task is characterized as the conditional expectation of the\ncontrol inputs for the stochastic bridges that respect the given control system\ndynamics. Explicit forms are derived for special cases, and a numerical\nprocedure is presented to approximate the control law, illustrated with\nexamples.\n","authors":["Yuhang Mei","Mohammad Al-Jarrah","Amirhossein Taghvaei","Yongxin Chen"],"pdf_url":"https://arxiv.org/pdf/2412.00617v2.pdf","comment":"13 pages, 3 figures"},{"id":"http://arxiv.org/abs/2312.04045v5","updated":"2025-01-13T22:44:23Z","published":"2023-12-07T05:07:12Z","title":"Partial Information in a Mean-Variance Portfolio Selection Game","summary":" This paper considers finitely many investors who perform mean-variance\nportfolio selection under relative performance criteria. That is, each investor\nis concerned about not only her terminal wealth, but how it compares to the\naverage terminal wealth of all investors. At the inter-personal level, each\ninvestor selects a trading strategy in response to others' strategies. This\nselected strategy additionally needs to yield an equilibrium intra-personally,\nso as to resolve time inconsistency among the investor's current and future\nselves (triggered by the mean-variance objective). A Nash equilibrium we look\nfor is thus a tuple of trading strategies under which every investor achieves\nher intra-personal equilibrium simultaneously. We derive such a Nash\nequilibrium explicitly in the idealized case of full information (i.e., the\ndynamics of the underlying stock is perfectly known) and semi-explicitly in the\nrealistic case of partial information (i.e., the stock evolution is observed,\nbut the expected return of the stock is not precisely known). The formula under\npartial information consists of the myopic trading and intertemporal hedging\nterms, both of which depend on an additional state process that serves to\nfilter the true expected return and whose influence on trading is captured by a\ndegenerate Cauchy problem. Our results identify that relative performance\ncriteria can induce downward self-reinforcement of investors' wealth--if every\ninvestor suffers a wealth decline simultaneously, then everyone's wealth tends\nto decline further. This phenomenon, as numerical examples show, is negligible\nunder full information but pronounced under partial information.\n","authors":["Yu-Jui Huang","Li-Hsien Sun"],"pdf_url":"https://arxiv.org/pdf/2312.04045v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.19234v3","updated":"2025-01-13T21:17:16Z","published":"2024-10-25T00:53:16Z","title":"On the Trade-Off Between Distributional Belief and Ambiguity:\n Conservatism, Finite-Sample Guarantees, and Asymptotic Properties","summary":" We propose and analyze a new data-driven trade-off (TRO) approach for\nmodeling uncertainty that serves as a middle ground between the optimistic\napproach, which adopts a distributional belief, and the pessimistic\ndistributionally robust optimization approach, which hedges against\ndistributional ambiguity. We equip the TRO model with a TRO ambiguity set\ncharacterized by a size parameter controlling the level of optimism and a shape\nparameter representing distributional ambiguity. We first show that\nconstructing the TRO ambiguity set using a general star-shaped shape parameter\nwith the empirical distribution as its star center is necessary and sufficient\nto guarantee the hierarchical structure of the sequence of TRO ambiguity sets.\nThen, we analyze the properties of the TRO model, including quantifying\nconservatism, quantifying bias and generalization error, and establishing\nasymptotic properties. Specifically, we show that the TRO model could generate\na spectrum of decisions, ranging from optimistic to conservative decisions.\nAdditionally, we show that it could produce an unbiased estimator of the true\noptimal value. Furthermore, we establish the almost-sure convergence of the\noptimal value and the set of optimal solutions of the TRO model to their true\ncounterparts. We exemplify our theoretical results using an inventory control\nproblem and a portfolio optimization problem.\n","authors":["Man Yiu Tsang","Karmel S. Shehadeh"],"pdf_url":"https://arxiv.org/pdf/2410.19234v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.07681v1","updated":"2025-01-13T20:41:52Z","published":"2025-01-13T20:41:52Z","title":"Dataset Distillation as Pushforward Optimal Quantization","summary":" Dataset distillation aims to find a synthetic training set such that training\non the synthetic data achieves similar performance to training on real data,\nwith orders of magnitude less computational requirements. Existing methods can\nbe broadly categorized as either bi-level optimization problems that have\nneural network training heuristics as the lower level problem, or disentangled\nmethods that bypass the bi-level optimization by matching distributions of\ndata. The latter method has the major advantages of speed and scalability in\nterms of size of both training and distilled datasets. We demonstrate that when\nequipped with an encoder-decoder structure, the empirically successful\ndisentangled methods can be reformulated as an optimal quantization problem,\nwhere a finite set of points is found to approximate the underlying probability\nmeasure by minimizing the expected projection distance. In particular, we link\nexisting disentangled dataset distillation methods to the classical optimal\nquantization and Wasserstein barycenter problems, demonstrating consistency of\ndistilled datasets for diffusion-based generative priors. We propose a simple\nextension of the state-of-the-art data distillation method D4M, achieving\nbetter performance on the ImageNet-1K dataset with trivial additional\ncomputation, and state-of-the-art performance in higher image-per-class\nsettings.\n","authors":["Hong Ye Tan","Emma Slade"],"pdf_url":"https://arxiv.org/pdf/2501.07681v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.07680v1","updated":"2025-01-13T20:37:24Z","published":"2025-01-13T20:37:24Z","title":"Input-to-state stability in integral norms for linear\n infinite-dimensional systems","summary":" We study integral-to-integral input-to-state stability for\ninfinite-dimensional linear systems with inputs and trajectories in\n$L^p$-spaces. We start by developing the corresponding admissibility theory for\nlinear systems with unbounded input operators. While input-to-state stability\nis typically characterized by exponential stability and finite-time\nadmissibility, we show that this equivalence does not extend directly to\nintegral norms. For analytic semigroups, we establish a precise\ncharacterization using maximal regularity theory. Additionally, we provide\ndirect Lyapunov theorems and construct Lyapunov functions for $L^p$-$L^q$-ISS\nand demonstrate the results with examples, including diagonal systems and\ndiffusion equations.\n","authors":["Sahiba Arora","Andrii Mironchenko"],"pdf_url":"https://arxiv.org/pdf/2501.07680v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.07671v1","updated":"2025-01-13T20:09:02Z","published":"2025-01-13T20:09:02Z","title":"Towards nonlinearity. The p-regularity theory. Applications and\n developments","summary":" We present recent advances in the analysis of nonlinear equations with\nsingular operators and nonlinear optimization problems with constraints given\nby singular mappings. The results are obtained within the framework of\n$p$-regularity theory, which has developed successfully over the last forty\nyears. We illustrate the theory with its applications to degenerate problems in\nvarious areas of mathematics. In particular, we address the problem of\ndescribing the tangent cone to the solution set of nonlinear equations in a\nsingular case. The structure of p-factor operators is used to propose\noptimality conditions and construct numerical methods for solving degenerate\nnonlinear equations and optimization problems. The methods presented in the\npaper can be considered as the first numerical approaches targeting solutions\nof degenerate problems, such as the Van der Pol differential equation,\nboundary-value problems with a small parameter, partial differential equations\nwhere Poincar\\'e's method of small parameter fails, nonlinear degenerate\ndynamical systems, and others. There are various practical applications for the\ntheory of p-regularity, including structural engineering, composite materials,\nand material design. For instance, the theory can be applied to analyze the\nbehavior of materials with irregular or complex properties. By considering\nhigher-order derivatives, it becomes possible to model and predict the response\nof materials to external forces, such as stress or temperature variations. In\ngeophysics, the $p$-regularity theory can be utilized to analyze and interpret\ncomplex data obtained from seismic surveys, gravity measurements, or\nelectromagnetic surveys. The theory also finds applications in the analysis of\nnonlinear differential equations arising in control systems, geometric and\ntopological analysis, biomechanics, and many other fields.\n","authors":["E. Bednarczuk","O. Brezhneva","K. Leśniewski","A. Prusińska","A. Tret'yakov"],"pdf_url":"https://arxiv.org/pdf/2501.07671v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.07652v1","updated":"2025-01-13T19:24:14Z","published":"2025-01-13T19:24:14Z","title":"Finite Sample Identification of Partially Observed Bilinear Dynamical\n Systems","summary":" We consider the problem of learning a realization of a partially observed\nbilinear dynamical system (BLDS) from noisy input-output data. Given a single\ntrajectory of input-output samples, we provide a finite time analysis for\nlearning the system's Markov-like parameters, from which a balanced realization\nof the bilinear system can be obtained. Our bilinear system identification\nalgorithm learns the system's Markov-like parameters by regressing the outputs\nto highly correlated, nonlinear, and heavy-tailed covariates. Moreover, the\nstability of BLDS depends on the sequence of inputs used to excite the system.\nThese properties, unique to partially observed bilinear dynamical systems, pose\nsignificant challenges to the analysis of our algorithm for learning the\nunknown dynamics. We address these challenges and provide high probability\nerror bounds on our identification algorithm under a uniform stability\nassumption. Our analysis provides insights into system theoretic quantities\nthat affect learning accuracy and sample complexity. Lastly, we perform\nnumerical experiments with synthetic data to reinforce these insights.\n","authors":["Yahya Sattar","Yassir Jedra","Maryam Fazel","Sarah Dean"],"pdf_url":"https://arxiv.org/pdf/2501.07652v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.17753v2","updated":"2025-01-13T19:10:31Z","published":"2024-05-28T02:11:21Z","title":"Regression Equilibrium in Electricity Markets","summary":" In two-stage electricity markets, renewable power producers enter the\nday-ahead market with a forecast of future power generation and then reconcile\nany forecast deviation in the real-time market at a penalty. The choice of the\nforecast model is thus an important strategy decision for renewable power\nproducers as it affects financial performance. In electricity markets with\nlarge shares of renewable generation, the choice of the forecast model impacts\nnot only individual performance but also outcomes for other producers. In this\npaper, we argue for the existence of a competitive regression equilibrium in\ntwo-stage electricity markets in terms of the parameters of private forecast\nmodels informing the participation strategies of renewable power producers. In\nour model, renewables optimize the forecast against the day-ahead and real-time\nprices, thereby maximizing the average profits across the day-ahead and\nreal-time markets. By doing so, they also implicitly enhance the temporal cost\ncoordination of day-ahead and real-time markets. We base the equilibrium\nanalysis on the theory of variational inequalities, providing results on the\nexistence and uniqueness of regression equilibrium in energy-only markets. We\nalso devise two methods to compute regression equilibrium: centralized\noptimization and a decentralized ADMM-based algorithm.\n","authors":["Vladimir Dvorkin"],"pdf_url":"https://arxiv.org/pdf/2405.17753v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.02648v4","updated":"2025-01-13T19:05:07Z","published":"2024-03-05T04:35:59Z","title":"Remove that Square Root: A New Efficient Scale-Invariant Version of\n AdaGrad","summary":" Adaptive methods are extremely popular in machine learning as they make\nlearning rate tuning less expensive. This paper introduces a novel optimization\nalgorithm named KATE, which presents a scale-invariant adaptation of the\nwell-known AdaGrad algorithm. We prove the scale-invariance of KATE for the\ncase of Generalized Linear Models. Moreover, for general smooth non-convex\nproblems, we establish a convergence rate of $O \\left(\\frac{\\log T}{\\sqrt{T}}\n\\right)$ for KATE, matching the best-known ones for AdaGrad and Adam. We also\ncompare KATE to other state-of-the-art adaptive algorithms Adam and AdaGrad in\nnumerical experiments with different problems, including complex machine\nlearning tasks like image classification and text classification on real data.\nThe results indicate that KATE consistently outperforms AdaGrad and\nmatches/surpasses the performance of Adam in all considered scenarios.\n","authors":["Sayantan Choudhury","Nazarii Tupitsa","Nicolas Loizou","Samuel Horvath","Martin Takac","Eduard Gorbunov"],"pdf_url":"https://arxiv.org/pdf/2403.02648v4.pdf","comment":"32 pages, 12 figures"},{"id":"http://arxiv.org/abs/1912.00043v3","updated":"2025-01-13T18:34:11Z","published":"2019-11-29T19:22:36Z","title":"Barcodes as Summary of Loss Function Topology","summary":" We propose to study neural networks' loss surfaces by methods of topological\ndata analysis. We suggest to apply barcodes of Morse complexes to explore\ntopology of loss surfaces. An algorithm for calculations of the loss function's\nbarcodes of local minima is described. We have conducted experiments for\ncalculating barcodes of local minima for benchmark functions and for loss\nsurfaces of small neural networks. Our experiments confirm our two principal\nobservations for neural networks' loss surfaces. First, the barcodes of local\nminima are located in a small lower part of the range of values of neural\nnetworks' loss function. Secondly, increase of the neural network's depth and\nwidth lowers the barcodes of local minima. This has some natural implications\nfor the neural network's learning and for its generalization properties.\n","authors":["Serguei Barannikov","Alexander Korotin","Dmitry Oganesyan","Daniil Emtsev","Evgeny Burnaev"],"pdf_url":"https://arxiv.org/pdf/1912.00043v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.07401v1","updated":"2025-01-13T15:17:48Z","published":"2025-01-13T15:17:48Z","title":"Smoothing Iterative Consensus-based Optimization Algorithm for Nonsmooth\n Nonconvex Optimization Problems with Global Optimality","summary":" In this paper, we focus on finding the global minimizer of a general\nunconstrained nonsmooth nonconvex optimization problem. Taking advantage of the\nsmoothing method and the consensus-based optimization (CBO) method, we propose\na novel smoothing iterative consensus-based optimization (SICBO) algorithm.\nFirst, we prove that the solution process of the proposed algorithm here\nexponentially converges to a common stochastic consensus point almost surely.\nSecond, we establish a detailed theoretical analysis to ensure the small enough\nerror between the objective function value at the consensus point and the\noptimal function value, to the best of our knowledge, which provides the first\ntheoretical guarantee to the global optimality of the proposed algorithm for\nnonconvex optimization problems. Moreover, unlike the previously introduced CBO\nmethods, the theoretical results are valid for the cases that the objective\nfunction is nonsmooth, nonconvex and perhaps non-Lipschitz continuous. Finally,\nseveral numerical examples are performed to illustrate the effectiveness of our\nproposed algorithm for solving the global minimizer of the nonsmooth and\nnonconvex optimization problems.\n","authors":["Jiazhen Wei","Wei Bian"],"pdf_url":"https://arxiv.org/pdf/2501.07401v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2501.07400v1","updated":"2025-01-13T15:17:28Z","published":"2025-01-13T15:17:28Z","title":"Derivation of effective gradient flow equations and dynamical truncation\n of training data in Deep Learning","summary":" We derive explicit equations governing the cumulative biases and weights in\nDeep Learning with ReLU activation function, based on gradient descent for the\nEuclidean cost in the input layer, and under the assumption that the weights\nare, in a precise sense, adapted to the coordinate system distinguished by the\nactivations. We show that gradient descent corresponds to a dynamical process\nin the input layer, whereby clusters of data are progressively reduced in\ncomplexity (\"truncated\") at an exponential rate that increases with the number\nof data points that have already been truncated. We provide a detailed\ndiscussion of several types of solutions to the gradient flow equations. A main\nmotivation for this work is to shed light on the interpretability question in\nsupervised learning.\n","authors":["Thomas Chen"],"pdf_url":"https://arxiv.org/pdf/2501.07400v1.pdf","comment":"AMS Latex, 35 pages"},{"id":"http://arxiv.org/abs/2501.07383v1","updated":"2025-01-13T15:02:27Z","published":"2025-01-13T15:02:27Z","title":"Anomalies of the Scholtes regularization for mathematical programs with\n complementarity constraints","summary":" For mathematical programs with complementarity constraints (MPCC), we refine\nthe convergence analysis of the Scholtes regularization. Our goal is to relate\nnondegenerate C-stationary points of MPCC with nondegenerate Karush-Kuhn-Tucker\npoints of its Scholtes regularization. We detected the following anomalies: (i)\nin a neighborhood of a nondegenerate C-stationary point there could be\ndegenerate Karush-Kuhn-Tucker points of the Scholtes regularization; (ii) even\nif nondegenerate, they might be locally non-unique; (iii) if nevertheless\nunique, their quadratic index potentially differs from the C-index of the\nC-stationary point under consideration. Thus, a change of the topological type\nfor Karush-Kuhn-Tucker points of the Scholtes regularization is possible. In\nparticular, a nondegenerate minimizer of MPCC might be approximated by saddle\npoints. In order to bypass the mentioned anomalies, an additional generic\ncondition for nondegenerate C-stationary points of MPCC is identified. Then, we\nuniquely trace nondegenerate Karush-Kuhn-Tucker points of the Scholtes\nregularization and successively maintain their topological type.\n","authors":["Vladimir Shikhman","Sebastian Lämmel"],"pdf_url":"https://arxiv.org/pdf/2501.07383v1.pdf","comment":"25 pages"},{"id":"http://arxiv.org/abs/2305.08094v2","updated":"2025-01-13T14:53:11Z","published":"2023-05-14T08:10:49Z","title":"Accelerating genetic optimization of nonlinear model predictive control\n by learning optimal search space size","summary":" Genetic algorithm (GA) is typically used to solve nonlinear model predictive\ncontrol's optimization problem. However, the size of the search space in which\nthe GA searches for the optimal control inputs is crucial for its applicability\nto fast-response systems. This paper proposes accelerating the genetic\noptimization of NMPC by learning optimal search space size. The approach trains\na multivariate regression model to adaptively predict the best smallest size of\nthe search space in every control cycle. The proposed approach reduces the GA's\ncomputational time, improves the chance of convergence to better control\ninputs, and provides a stable and feasible solution. The proposed approach was\nevaluated on three nonlinear systems and compared to four other evolutionary\nalgorithms implemented in a processor-in-the-loop fashion. The results show\nthat the proposed approach provides a 17-45\\% reduction in computational time\nand increases the convergence rate by 35-47\\%. The source code is available on\nGitHub.\n","authors":["Eslam Mostafa","Hussein A. Aly","Ahmed Elliethy"],"pdf_url":"https://arxiv.org/pdf/2305.08094v2.pdf","comment":"Accepted by the Journal of Control and Decision"},{"id":"http://arxiv.org/abs/2501.07307v1","updated":"2025-01-13T13:16:30Z","published":"2025-01-13T13:16:30Z","title":"Quasiconvex Bulk and Surface Energies with subquadratic growth","summary":" We establish partial H\\\"older continuity of the gradient for equilibrium\nconfigurations of vectorial multidimensional variational problems, involving\nbulk and surface energies. The bulk energy densities are uniformly strictly\nquasiconvex functions with $p$-growth, $1