Multi-modal object detection in autonomous driving has achieved great breakthroughs due to the usage of fusing complementary information from different sensors. The calibration in fusion between sensors such as LiDAR and camera is always supposed to be precise in previous work. However, in reality, calibration matrices are fixed when the vehicles leave the factory, but vibration, bumps, and data lags may cause calibration bias. As the research on the calibration influence on fusion detection performance is relatively few, flexible calibration dependency multi-sensor detection method has always been attractive. In this paper, we conducted experiments on SOTA detection method EPNet++ and proved slight bias on calibration can reduce the performance seriously. We also proposed a re-calibration model based on semantic segmentation which can be combined with a detection algorithm to improve the performance and robustness of multi-modal calibration bias.
To generate data from trained diffusion models, most inference algorithms, such as DDPM, DDIM, and other variants, rely on discretizing the reverse SDEs or their equivalent ODEs. In this paper, we view such approaches as decomposing the entire denoising diffusion process into several segments, each corresponding to a reverse transition kernel (RTK) sampling subproblem. Specifically, DDPM uses a Gaussian approximation for the RTK, resulting in low per-subproblem complexity but requiring a large number of segments (i.e., subproblems), which is conjectured to be inefficient. To address this, we develop a general RTK framework that enables a more balanced subproblem decomposition, resulting in $\tilde O(1)$ subproblems, each with strongly log-concave targets. We then propose leveraging two fast sampling algorithms, the Metropolis-Adjusted Langevin Algorithm (MALA) and Underdamped Langevin Dynamics (ULD), for solving these strongly log-concave subproblems. This gives rise to the RTK-MALA and RTK-ULD algorithms for diffusion inference. In theory, we further develop the convergence guarantees for RTK-MALA and RTK-ULD in total variation (TV) distance: RTK-ULD can achieve $\epsilon$ target error within $\tilde{\mathcal O}(d^{1/2}\epsilon^{-1})$ under mild conditions, and RTK-MALA enjoys a $\mathcal{O}(d^{2}\log(d/\epsilon))$ convergence rate under slightly stricter conditions. These theoretical results surpass the state-of-the-art convergence rates for diffusion inference and are well supported by numerical experiments.
Log parsing, a vital task for interpreting the vast and complex data produced within software architectures faces significant challenges in the transition from academic benchmarks to the industrial domain. Existing log parsers, while highly effective on standardized public datasets, struggle to maintain performance and efficiency when confronted with the sheer scale and diversity of real-world industrial logs. These challenges are two-fold: 1) massive log templates: The performance and efficiency of most existing parsers will be significantly reduced when logs of growing quantities and different lengths; 2) Complex and changeable semantics: Traditional template-matching algorithms cannot accurately match the log templates of complicated industrial logs because they cannot utilize cross-language logs with similar semantics. To address these issues, we propose ECLIPSE, Enhanced Cross-Lingual Industrial log Parsing with Semantic Entropy-LCS, since cross-language logs can robustly parse industrial logs. On the one hand, it integrates two efficient data-driven template-matching algorithms and Faiss indexing. On the other hand, driven by the powerful semantic understanding ability of the Large Language Model (LLM), the semantics of log keywords were accurately extracted, and the retrieval space was effectively reduced. It is worth noting that we launched a Chinese and English cross-platform industrial log parsing benchmark ECLIPSE-Bench to evaluate the performance of mainstream parsers in industrial scenarios. Our experimental results, conducted across public benchmarks and the proprietary ECLIPSE-Bench dataset, underscore the superior performance and robustness of our proposed ECLIPSE. Notably, ECLIPSE delivers state-of-the-art performance when compared to strong baselines on diverse datasets and preserves a significant edge in processing efficiency.
Non-line-of-sight (NLOS) imaging enables the visualization of objects hidden from direct view, with applications in surveillance, remote sensing, and light detection and ranging. Here, we introduce a NLOS imaging technique termed ptychographic NLOS (pNLOS), which leverages coded ptychography for depth-resolved imaging of obscured objects. Our approach involves scanning a laser spot on a wall to illuminate the hidden objects in an obscured region. The reflected wavefields from these objects then travel back to the wall, get modulated by the wall's complex-valued profile, and the resulting diffraction patterns are captured by a camera. By modulating the object wavefields, the wall surface serves the role of the coded layer as in coded ptychography. As we scan the laser spot to different positions, the reflected object wavefields on the wall translate accordingly, with the shifts varying for objects at different depths. This translational diversity enables the acquisition of a set of modulated diffraction patterns referred to as a ptychogram. By processing the ptychogram, we recover both the objects at different depths and the modulation profile of the wall surface. Experimental results demonstrate high-resolution, high-fidelity imaging of hidden objects, showcasing the potential of pNLOS for depth-aware vision beyond the direct line of sight.
Intent modeling has attracted widespread attention in recommender systems. As the core motivation behind user selection of items, intent is crucial for elucidating recommendation results. The current mainstream modeling method is to abstract the intent into unknowable but learnable shared or non-shared parameters. Despite considerable progress, we argue that it still confronts the following challenges: firstly, these methods only capture the coarse-grained aspects of intent, ignoring the fact that user-item interactions will be affected by collective and individual factors (e.g., a user may choose a movie because of its high box office or because of his own unique preferences); secondly, modeling believable intent is severely hampered by implicit feedback, which is incredibly sparse and devoid of true semantics. To address these challenges, we propose a novel recommendation framework designated as Bilateral Intent-guided Graph Collaborative Filtering (BIGCF). Specifically, we take a closer look at user-item interactions from a causal perspective and put forth the concepts of individual intent-which signifies private preferences-and collective intent-which denotes overall awareness. To counter the sparsity of implicit feedback, the feature distributions of users and items are encoded via a Gaussian-based graph generation strategy, and we implement the recommendation process through bilateral intent-guided graph reconstruction re-sampling. Finally, we propose graph contrastive regularization for both interaction and intent spaces to uniformize users, items, intents, and interactions in a self-supervised and non-augmented paradigm. Experimental results on three real-world datasets demonstrate the effectiveness of BIGCF compared with existing solutions.
Although the Retrieval-Augmented Generation (RAG) paradigms can use external knowledge to enhance and ground the outputs of Large Language Models (LLMs) to mitigate generative hallucinations and static knowledge base problems, they still suffer from limited flexibility in adopting Information Retrieval (IR) systems with varying capabilities, constrained interpretability during the multi-round retrieval process, and a lack of end-to-end optimization. To address these challenges, we propose a novel LLM-centric approach, IM-RAG, that integrates IR systems with LLMs to support multi-round RAG through learning Inner Monologues (IM, i.e., the human inner voice that narrates one's thoughts). During the IM process, the LLM serves as the core reasoning model (i.e., Reasoner) to either propose queries to collect more information via the Retriever or to provide a final answer based on the conversational context. We also introduce a Refiner that improves the outputs from the Retriever, effectively bridging the gap between the Reasoner and IR modules with varying capabilities and fostering multi-round communications. The entire IM process is optimized via Reinforcement Learning (RL) where a Progress Tracker is incorporated to provide mid-step rewards, and the answer prediction is further separately optimized via Supervised Fine-Tuning (SFT). We conduct extensive experiments with the HotPotQA dataset, a popular benchmark for retrieval-based, multi-step question-answering. The results show that our approach achieves state-of-the-art (SOTA) performance while providing high flexibility in integrating IR modules as well as strong interpretability exhibited in the learned inner monologues.
Social recommendation leverages social network to complement user-item interaction data for recommendation task, aiming to mitigate the data sparsity issue in recommender systems. However, existing social recommendation methods encounter the following challenge: both social network and interaction data contain substaintial noise, and the propagation of such noise through Graph Neural Networks (GNNs) not only fails to enhance recommendation performance but may also interfere with the model's normal training. Despite the importance of denoising for social network and interaction data, only a limited number of studies have considered the denoising for social network and all of them overlook that for interaction data, hindering the denoising effect and recommendation performance. Based on this, we propose a novel model called Dual-domain Collaborative Denoising for Social Recommendation ($\textbf{DCDSR}$). DCDSR comprises two primary modules: the structure-level collaborative denoising module and the embedding-space collaborative denoising module. In the structure-level collaborative denoising module, information from interaction domain is first employed to guide social network denoising. Subsequently, the denoised social network is used to supervise the denoising for interaction data. The embedding-space collaborative denoising module devotes to resisting the noise cross-domain diffusion problem through contrastive learning with dual-domain embedding collaborative perturbation. Additionally, a novel contrastive learning strategy, named Anchor-InfoNCE, is introduced to better harness the denoising capability of contrastive learning. Evaluating our model on three real-world datasets verifies that DCDSR has a considerable denoising effect, thus outperforms the state-of-the-art social recommendation methods.
Effective feature interaction modeling is critical for enhancing the accuracy of click-through rate (CTR) prediction in industrial recommender systems. Most of the current deep CTR models resort to building complex network architectures to better capture intricate feature interactions or user behaviors. However, we identify two limitations in these models: (1) the samples given to the model are undifferentiated, which may lead the model to learn a larger number of easy samples in a single-minded manner while ignoring a smaller number of hard samples, thus reducing the model's generalization ability; (2) differentiated feature interaction encoders are designed to capture different interactions information but receive consistent supervision signals, thereby limiting the effectiveness of the encoder. To bridge the identified gaps, this paper introduces a novel CTR prediction framework by integrating the plug-and-play Twin Focus (TF) Loss, Sample Selection Embedding Module (SSEM), and Dynamic Fusion Module (DFM), named the Twin Focus Framework for CTR (TF4CTR). Specifically, the framework employs the SSEM at the bottom of the model to differentiate between samples, thereby assigning a more suitable encoder for each sample. Meanwhile, the TF Loss provides tailored supervision signals to both simple and complex encoders. Moreover, the DFM dynamically fuses the feature interaction information captured by the encoders, resulting in more accurate predictions. Experiments on five real-world datasets confirm the effectiveness and compatibility of the framework, demonstrating its capacity to enhance various representative baselines in a model-agnostic manner. To facilitate reproducible research, our open-sourced code and detailed running logs will be made available at: https://github.com/salmon1802/TF4CTR.
With the impressive performance in various downstream tasks, large language models (LLMs) have been widely integrated into production pipelines, like recruitment and recommendation systems. A known issue of models trained on natural language data is the presence of human biases, which can impact the fairness of the system. This paper investigates LLMs' behavior with respect to gender stereotypes, in the context of occupation decision making. Our framework is designed to investigate and quantify the presence of gender stereotypes in LLMs' behavior via multi-round question answering. Inspired by prior works, we construct a dataset by leveraging a standard occupation classification knowledge base released by authoritative agencies. We tested three LLMs (RoBERTa-large, GPT-3.5-turbo, and Llama2-70b-chat) and found that all models exhibit gender stereotypes analogous to human biases, but with different preferences. The distinct preferences of GPT-3.5-turbo and Llama2-70b-chat may imply the current alignment methods are insufficient for debiasing and could introduce new biases contradicting the traditional gender stereotypes.
Shifting the focus from principles to practical implementation, responsible artificial intelligence (AI) has garnered considerable attention across academia, industry, and society at large. Despite being in its nascent stages, this emerging field grapples with nebulous concepts and intricate knowledge frameworks. By analyzing three prevailing concepts - explainable AI, trustworthy AI, and ethical AI, this study defined responsible AI and identified its core principles. Methodologically, this study successfully demonstrated the implementation of leveraging AI's capabilities into bibliometrics for enhanced knowledge discovery and the cross-validation of experimentally examined models with domain insights. Empirically, this study investigated 17,799 research articles contributed by the AI community since 2015. This involves recognizing key technological players and their relationships, unveiling the topical landscape and hierarchy of responsible AI, charting its evolution, and elucidating the interplay between the responsibility principles and primary AI techniques. An analysis of a core cohort comprising 380 articles from multiple disciplines captures the most recent advancements in responsible AI. As one of the pioneering bibliometric studies dedicated to exploring responsible AI, this study will provide comprehensive macro-level insights, enhancing the understanding of responsible AI while furnishing valuable knowledge support for AI regulation and governance initiatives.