Multi-Scale Entropy for Transformers: Interpreting Training Dynamics and Guiding an Adaptive Training Pipeline
DOI:
https://doi.org/10.54097/ynsrha65Keywords:
Transformer, deep Learning, information theory.Abstract
The internal training dynamics of Transformer models remain poorly understood, limiting the development of both interpretable diagnostics and efficient training strategies. This paper proposes a unified multi-scale entropy framework that serves both as an analytical tool to interpret training dynamics and as a practical controller for adaptive optimization. It is demonstrated through a dual-track investigation. Analytically, it is used to dissect a BERT model's attention patterns on linguistic phenomena, revealing a clear layer-wise hierarchy of functional specialization. Second, the framework is operationalized to develop an entropy-guided adaptive training strategies and finally form a pipeline for Vision Transformers (ViT) which uses the stabilization of mid-layer attention entropy, a robust convergence signal identified through analysis, to trigger intelligent early stopping and staged optimization. Experiment shows the adaptive pipeline achieves performance comparable to a full training schedule while significantly reducing computational costs. Ultimately, this work establishes multi-scale entropy as a cohesive bridge between understanding how Transformers learn and controlling how they are trained, providing a versatile framework for enhancing both model interpretability and training efficiency.
References
[1] Vaswani A., Shazeer N.M., Parmar N., Uszkoreit J., Jones L., Gomez A.N., Kaiser L., Polosukhin I. Attention is All You Need. Advances in Neural Information Processing Systems, 2017, 30: 5998 - 6008.
[2] Cover T. M., Thomas J. A. Elements of Information Theory (2nd ed.). John Wiley & Sons, 2006.
[3] Tishby N., Pereira F. C., Bialek W. The information bottleneck method. arXiv preprint, 2000.
[4] Clark K., Khandelwal U., Levy O., Manning C. D. What does BERT look at? An analysis of BERT’s attention. arXiv preprint, 2019.
[5] Tian Y., Chen Y., Zhang B., Li Y., Ramea K., Daskalakis C. Scan and Snap: Understanding Training Dynamics and Token Composition in 1-layer Transformer. Advances in Neural Information Processing Systems, 2023, 36: 72338 - 72353.
[6] Yang H., Li Y. Training Dynamics of Transformers to Recognize Word Co-occurrence via Gradient Flow Analysis. Advances in Neural Information Processing Systems, 2024.
[7] Chen S., Sheen H., Wang T., Yang Z. Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers. Advances in Neural Information Processing Systems, 2024.
[8] Saponati M., Filippi A., Ansuini A., Micheli A., Bacciu D., Livi L. The underlying structures of self-attention: symmetry, directionality, and emergent dynamics in Transformer training. International Conference on Machine Learning, 2025.
[9] Dong Y., Shu H., Li P. Rethinking Information-Theoretic Generalization: Loss Entropy Induced PAC Bounds. International Conference on Learning Representations, 2024.
[10] Mostafa S., Islam M., Karia R. Leveraging Neuron Activation Patterns to Explain and Improve Deep Learning Classifiers. International Conference on Learning Representations, 2024.
[11] Spadaro G., Bacco M., Bertini F., Passaro G., Livi L., Alippi C. Shannon Strikes Again! Entropy-Based Pruning in Deep Neural Networks for Transfer Learning Under Extreme Memory and Computation Budgets. Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 2023: 3600 - 3609.
[12] Hardt M., Recht B., Singer Y. Train faster, generalize better: Stability of stochastic gradient descent. International Conference on Machine Learning, 2016: 1225 - 1234.
[13] Keskar N. S., Nocedal J., Mudigere D., Smolyansky M., Tang P. T. P. On large-batch training for deep learning: Generalization gap and sharp minima. International Conference on Learning Representations, 2017.
[14] Zhou X., Ji Y., Liu Z., Du S., Wei K. NeuralGrok: Accelerate Grokking by Neural Gradient Transformation. arXiv preprint, 2024.
[15] Zhai S., Likhomanenko T., Littwin E., Busbridge D., Ramapuram J., Zhang Y., Gu J., Susskind J. Stabilizing transformer training by preventing attention entropy collapse. arXiv preprint, 2023.
[16] Chen Z., Zhang J., Wu K., Han X., Liu Z. Variance Sensitivity Induces Attention Entropy Collapse and Instability in Transformers. arXiv preprint, 2024.
Downloads
Published
Issue
Section
License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.







