Scenarios Where Hallucinatory Information Generated by Large Language Models is Most Difficult to Detect

Authors

  • Hao Chen

DOI:

https://doi.org/10.54097/zky1rh15

Keywords:

Large Language Models (LLMs), Hallucination, Hallucination Detection, Multimodality, Retrieval-Augmented Generation (RAG).

Abstract

Large Language Models (LLMs) are really handy for work and daily life, but they’ve got this frustrating issue — “hallucinations”. That’s when they churn out wrong or mismatched info, which ruins how reliable they feel. Right now, there’s no solid solution for this, and past research on spotting these hallucinations is pretty scattered. This review dives into how tough it is to catch hallucinations in different scenarios—like Q&A, casual chats, or text summarization—plus how using single vs. multi-modal inputs changes things, and what happens with different types of negative responses. This paper also looks at existing tools to find good detection strategies. Turns out, Q&A tasks are easiest to check (manual: ~91%, tools: ~87%), summarizing is the hardest (manual: ~65%, tools: ~60%). Single-modal content is easier to verify (avg ~82%) than multi-modal (avg ~59%, even lower when there’s lots of overlapping info). Evasive responses have way more hallucinations (61%) and are toughest to spot (43% accuracy). To fix this, this paper suggests stuff like Retrieval-Augmented Generation (RAG) and checking multi-modal info together to make LLMs more trustworthy.

Downloads

Download data is not yet available.

References

[1] Jiang C, Qi B, Hong X, Fu D, Cheng Y, Meng FD, Yu M, Zhou B, Zhou J. On large language models’ hallucination with regard to known facts. In: Proc Conf North American Chapter Assoc Comput Linguist (NAACL). 2024.

[2] Ji Z, Lee N, Frieske R, Yu T, Su D, Xu Y, Iter D. Survey of hallucination in natural language generation. ACM Comput Surv. 2023; 55 (12): 1-38.

[3] Zhang H, Wang Y, Chen L, et al. Hallucination detection in multimodal large language models. In: Proc Int Conf Mach Learn (ICML). 2024; 234 (1): 15678-90.

[4] Li M, Zhao Y, Fang X, et al. Hallucination in LLM negative responses. J Artif Intell Res. 2024; 79: 453-89.

[5] Lewis P, Perez E, Piktus A, Petroni F, Karpukhin V, Goyal N, Küttler H, Lewis M, Yih WT, Rocktäschel T, Riedel S. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems. 2020; 33: 9459-74.

[6] Shuster K, Poff S, Chen M, Kiela D, Weston J. Retrieval augmentation reduces hallucination in conversation. arXiv preprint arXiv: 2104.07567. 2021 Apr 15.

[7] Béchard P, Ayala OM. Reducing hallucination in structured outputs via Retrieval-Augmented Generation. arXiv preprint arXiv: 2404.08189. 2024 Apr 12.

[8] Wang L. SEReDeEP: Hallucination Detection in Retrieval-Augmented Models via Semantic Entropy and Context-Parameter Fusion. arXiv preprint arXiv: 2505.07528. 2025 May 12.

[9] Bai Z, Wang P, Xiao T, He T, Han Z, Zhang Z, Shou MZ. Hallucination of multimodal large language models: A survey. arXiv preprint arXiv: 2404.18930. 2024 Apr 29.

[10] You L, Yao J, Yang S, Hu G, Hu L, Wang D. Mitigating Behavioral Hallucination in Multimodal Large Language Models for Sequential Images. arXiv preprint arXiv: 2506.07184. 2025 Jun 8.

Downloads

Published

29-01-2026

Issue

Section

Articles

How to Cite

Chen, H. (2026). Scenarios Where Hallucinatory Information Generated by Large Language Models is Most Difficult to Detect. Academic Journal of Science and Technology, 19(2), 259-262. https://doi.org/10.54097/zky1rh15