A Diagnosing Untruthfulness: A G-Eval and Bootstrap Analysis of LLM Failure Modes on TruthfulQA

Shujun Yang

doi:10.54097/vtwqfr43

Authors

Shujun Yang

DOI:

https://doi.org/10.54097/vtwqfr43

Keywords:

large language models, G-Eval framework, Bootstrap Analysis, GPT.

Abstract

While large language models (LLMs) often fail on adversarial questions, the specific nature of these failures is not well understood, hindering efforts to improve their reliability. This study moves beyond simple accuracy metrics to systematically diagnose the error patterns of GPT-3.5-turbo and GPT-4 on the challenging TruthfulQA benchmark. To do this, it created an "error corpus" by collecting incorrect responses from both models. A more advanced model, gpt-4o, was then employed as a judge within a G-Eval framework to classify each error into a fine-grained taxonomy of predefined categories. The statistical robustness of the resulting error distributions was confirmed via a 5,000-iteration bootstrap simulation. The analysis reveals that both models exhibit stable and dominant failure modes, with the repetition of common human misconceptions being the primary cause of untruthfulness. These findings provide a granular, validated "pathology report" of LLM errors, offering crucial insights for future model alignment and targeted safety interventions.

References

[1] Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, P. Fung, Survey of hallucination in natural language generation. ACM Computing Surveys 55 (12), 1 – 38 (2023).

[2] S. Wang, Z. Wang, Y. Liu, Z. Wang, Z. Chen, H. Zhao, H. Ji, A survey on trustworthy large language models. arXiv preprint arXiv: 2401.07187 (2024).

[3] S. Lin, J. Hilton, O. Evans, TruthfulQA: Measuring how models mimic human falsehoods. Proc. 60th Annual Meeting of the Association for Computational Linguistics (ACL), Long Papers (2022).

[4] Y. Liu, T. Duan, J. Teng, W. Chen, D. Li, G-Eval: A new framework for evaluating large language models. arXiv preprint arXiv: 2308.04633 (2023).

[5] L. Zheng, W. L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, I. Stoica, Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv preprint arXiv: 2306.05685 (2023).

[6] OpenAI, GPT-4 technical report. arXiv preprint arXiv: 2303.08774 (2023).

[7] B. Efron, R. J. Tibshirani, An introduction to the bootstrap. (Chapman & Hall/CRC, 1994).

[8] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, R. Lowe, Training language models to follow instructions with human feedback. Adv. Neural Inf. Process. Syst. 35, 27730 – 27744 (2022).

[9] J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. Le, D. Zhou, Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst. 35, 24824 – 24837 (2022).

[10] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, D. Kiela, Retrieval-augmented generation for knowledge-intensive NLP tasks. Adv. Neural Inf. Process. Syst. 33, 9459 – 9474 (2020).