Multi-Scale Entropy for Transformers: Interpreting Training Dynamics and Guiding an Adaptive Training Pipeline

Jiahan Xie

doi:10.54097/ynsrha65

Authors

Jiahan Xie

DOI:

https://doi.org/10.54097/ynsrha65

Keywords:

Transformer, deep Learning, information theory.

Abstract

The internal training dynamics of Transformer models remain poorly understood, limiting the development of both interpretable diagnostics and efficient training strategies. This paper proposes a unified multi-scale entropy framework that serves both as an analytical tool to interpret training dynamics and as a practical controller for adaptive optimization. It is demonstrated through a dual-track investigation. Analytically, it is used to dissect a BERT model's attention patterns on linguistic phenomena, revealing a clear layer-wise hierarchy of functional specialization. Second, the framework is operationalized to develop an entropy-guided adaptive training strategies and finally form a pipeline for Vision Transformers (ViT) which uses the stabilization of mid-layer attention entropy, a robust convergence signal identified through analysis, to trigger intelligent early stopping and staged optimization. Experiment shows the adaptive pipeline achieves performance comparable to a full training schedule while significantly reducing computational costs. Ultimately, this work establishes multi-scale entropy as a cohesive bridge between understanding how Transformers learn and controlling how they are trained, providing a versatile framework for enhancing both model interpretability and training efficiency.

References

[1] Vaswani A., Shazeer N.M., Parmar N., Uszkoreit J., Jones L., Gomez A.N., Kaiser L., Polosukhin I. Attention is All You Need. Advances in Neural Information Processing Systems, 2017, 30: 5998 - 6008.

[2] Cover T. M., Thomas J. A. Elements of Information Theory (2nd ed.). John Wiley & Sons, 2006.

[3] Tishby N., Pereira F. C., Bialek W. The information bottleneck method. arXiv preprint, 2000.

[4] Clark K., Khandelwal U., Levy O., Manning C. D. What does BERT look at? An analysis of BERT’s attention. arXiv preprint, 2019.

[5] Tian Y., Chen Y., Zhang B., Li Y., Ramea K., Daskalakis C. Scan and Snap: Understanding Training Dynamics and Token Composition in 1-layer Transformer. Advances in Neural Information Processing Systems, 2023, 36: 72338 - 72353.

[6] Yang H., Li Y. Training Dynamics of Transformers to Recognize Word Co-occurrence via Gradient Flow Analysis. Advances in Neural Information Processing Systems, 2024.

[7] Chen S., Sheen H., Wang T., Yang Z. Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers. Advances in Neural Information Processing Systems, 2024.

[8] Saponati M., Filippi A., Ansuini A., Micheli A., Bacciu D., Livi L. The underlying structures of self-attention: symmetry, directionality, and emergent dynamics in Transformer training. International Conference on Machine Learning, 2025.

[9] Dong Y., Shu H., Li P. Rethinking Information-Theoretic Generalization: Loss Entropy Induced PAC Bounds. International Conference on Learning Representations, 2024.

[10] Mostafa S., Islam M., Karia R. Leveraging Neuron Activation Patterns to Explain and Improve Deep Learning Classifiers. International Conference on Learning Representations, 2024.

[11] Spadaro G., Bacco M., Bertini F., Passaro G., Livi L., Alippi C. Shannon Strikes Again! Entropy-Based Pruning in Deep Neural Networks for Transfer Learning Under Extreme Memory and Computation Budgets. Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 2023: 3600 - 3609.

[12] Hardt M., Recht B., Singer Y. Train faster, generalize better: Stability of stochastic gradient descent. International Conference on Machine Learning, 2016: 1225 - 1234.

[13] Keskar N. S., Nocedal J., Mudigere D., Smolyansky M., Tang P. T. P. On large-batch training for deep learning: Generalization gap and sharp minima. International Conference on Learning Representations, 2017.

[14] Zhou X., Ji Y., Liu Z., Du S., Wei K. NeuralGrok: Accelerate Grokking by Neural Gradient Transformation. arXiv preprint, 2024.

[15] Zhai S., Likhomanenko T., Littwin E., Busbridge D., Ramapuram J., Zhang Y., Gu J., Susskind J. Stabilizing transformer training by preventing attention entropy collapse. arXiv preprint, 2023.

[16] Chen Z., Zhang J., Wu K., Han X., Liu Z. Variance Sensitivity Induces Attention Entropy Collapse and Instability in Transformers. arXiv preprint, 2024.