Large language models are powerful architectures for self-supervised data analysis of various natures, ranging from protein sequences to text to images. In these models, the data representation in the hidden layers live in the same space, and the semantic structure of the dataset emerges by a sequence of functionally identical transformations between one representation and the next. We study global (Intrinsic Dimension – ID) and local (Neighbourhood Overlap) geometric properties, focusing on the evolution of such proprieties across the layers. We show that the semantic complexity of the dataset emerges in correspondence with the plateaus of the ID profile, and we suggest an unsupervised strategy to identify the layers more suitable for downstream learning tasks.
Alberto Cazzaniga is a permanent Researcher at the Research and Innovation Technology (RIT) Institute at AREA Science Park in Trieste, where he works and jointly coordinates the activities of the LADE research group focusing on applications of artificial intelligence techniques in life sciences and material sciences. After completing a DPhil in Pure Mathematics at the University of Oxford and after a Claude Leon Fellowship at AIMS-SA in Cape Town, he joined CNR-IOM in Trieste, where he also completed a Master in High-Performance Computing (MHPC) at SISSA and ICTP, transitioning to research in advanced statistical modelling. He teaches representation learning and generative modelling at MHPC and Language Models at the DSSC Master at UniTS.