P.STAR: Parameter-free Structure Analysis via Reordering for Lossless Visualization of Layers in Language Models

IEEE VIS • VIS4DH Workshop

2025

M. El-Assady^♢, A. Fokkens^♣, P. Haghighatkhah^♠, T. Ophelders^♠♡, P. Sommerauer^♣, B. Speckmann^♠, and K. Verbeek^♠

^♢ETH Zurich

^♣VU Amsterdam

^♠TU Eindhoven

^♡Utrecht University

Paper

Abstract

We introduce P.STAR: a new parameter-free method that can be used to inspect the embedding space of a set of instance representations (e.g. words in context) across layers of a contextualized language model at one glance and without loss of information. We use a matrix reordering approach that relies on signature distances; instances are represented in terms of their distances to all other instances in the set. The reordered matrix of signature representations provides a visual depiction of how structures emerge across layers of a contextualized model. We validate our approach by comparing how words in context are grouped to their PoS and Sense labels in two annotated datasets. In line with previous research, we observe that lower-layer representations emphasize PoS-information while higher layers group instances on the basis of senses. We also show that our reordering approach can identify errors in the labels.

Our Contributions

We propose a new, parameter-free method for analyzing the structure of embedding spaces using signature distances and matrix reordering (Section 3)
We implemented our approach and show how the resulting matrices visualize embedding space structure at one glance (Section 4)
We validate P.STAR by showing that the structures it identifies clearly correspond to information about PoS and word senses and demonstrate how it can be used to identify anomalies in data (Section 5)

Background

Within the field of model analysis and interpretability, several methods for analyzing internal model representations have been proposed [4]. The earliest and still widely used approaches use probing. This paradigm relies on testing whether the prediction of a specific linguistic property can be learned from embedding representations using labeled data and a simple (often linear) classifier. Probing results have been shown to vary substantially depending on the choice of classifier; the brittle nature of probing results has raised questions about the reliability of this approach [3, 13].

Clustering approaches can be seen as more direct ways of representing structure and have been used to analyze and visualize hidden layer representations (see, for example, [1, 6]). While clustering approaches do not require a classifier, they still depend on crucial parameters (e.g. the number of clusters). Zhou and Srikumar [31] propose hierarchical clustering as a parameter-free alternative, however, their approach crucially relies on labeled data to form clusters.

Related Work

Previous work on visualizing embedding spaces includes methods using dimensionality reduction [5, 14, 23, 24, 25], which lose information when projecting to lower dimensions. Matrix reordering techniques [29] have shown that optimal linear orders can reveal meaningful structure in distance matrices, supporting our approach.

While cosine distances are commonly used for comparing embeddings, they can be noisy and unreliable [27, 30]. Signature distances, previously used in semantic change detection [11, 20, 21], provide a more robust alternative. Unlike previous work that uses static word embeddings, we apply signature distances to instance representations from contextualized models, creating second-order embeddings that capture contextual relationships within each layer.

Methodology

P.STAR orders embedding instances so that similar points are grouped together, revealing clusters and structures. We frame this as an optimization problem: find an ordering that minimizes the distances between consecutive points. This is equivalent to solving an Open Traveling Salesperson Problem (OTSP), which we solve efficiently using the Concorde solver via the NEOS server [2, 8, 10, 12].

Instead of using standard cosine distances directly, we employ signature distances for more robust comparisons. For each instance, we compute its signature—a vector representing its distances to all other instances in the dataset. This signature captures the contextual relationships more reliably than direct cosine distances, which can be noisy [27]. The signature distance between two instances is then computed as the normalized Euclidean distance between their signatures.

By computing signature distances for all pairs of instances and solving the OTSP problem, we obtain an optimal linear ordering. This ordering is then visualized as a matrix where colors represent distances, revealing clusters and structures at different scales without requiring any parameters or dimension reduction.

Evaluation

We validate P.STAR using BERT-base-uncased [9] and GPT2 [18] models, analyzing embeddings from all 12 layers. We use two annotated datasets: SemCor3.0 [17] (manually annotated with WordNet senses) and OMSTI [26] (automatically annotated, containing one million sense-tagged instances). Both datasets include part-of-speech (PoS) tags.

We demonstrate P.STAR's effectiveness through three case studies. For the word "state", P.STAR clearly separates verbal and nominal uses in lower layers (Layer 7), while distinct senses emerge in higher layers. For "work", we observe a more complex pattern where syntactic boundaries (noun vs. verb) dominate in lower layers, while semantic relationships (related senses) become more prominent in higher layers.

Importantly, P.STAR also reveals annotation errors. When analyzing "cell", we identified mislabeled instances in the OMSTI dataset, including cases where "cell" referred to terrorist groups but was incorrectly labeled as biological cells. These errors were detectable because the instances were placed among semantically related prison cell references.

Our results confirm that lower layers emphasize syntactic (PoS) information while higher layers group instances by semantic similarity (senses), consistent with previous research [15, 28].

Discussion

Our method relies on the visual inspection of matrices; as such, our validation consists of a comparatively small set of examples that serve as a first illustration. The clarity of the observations does, however, provide convincing evidence for the reliability of our method. This first step opens up new directions for exploration. One of the reasons why the approach provides such clear structures is that we make use of signature distances. The second-order distance approach is more robust than cosine distances between embeddings.

The major advantage of our method is that it can represent an entire set of instances at one glance without reducing dimensions, or setting parameters that might influence results. As such, it can provide a more direct depiction of structures than could be achieved through clustering approaches. As demonstrated in our validation, our method shows how clusters and nested clusters arise from the linear reordering of second-order signature representations.

From a linguistic analysis point of view, we believe that P.STAR could offer a powerful new method for the analysis of linguistic properties in contextualized language models and could yield new theoretical insights. From a practical perspective, P.STAR can be used to detect layers in contextualized models that are likely to carry useful information for a particular task or for further model fine-tuning. Our analysis shows that layers that capture word senses particularly well can be identified. Word sense information is of central importance for the task of Lexical Semantic Change Detection. Both the use of signature distances and better pinpointing where sense is represented most can help boost the performance of contextualized models for a task where static word embeddings often still outperform them [16, 22].

In the future, we will build on our P.STAR method to develop interactive visual analytics tools, enabling the exploration of patterns across model layers, for example, to investigate the progression of the learned semantic separability across the layers.

Conclusion

We introduce P.STAR: a parameter-free method that reveals the structure of a contextual embedding at a glance, with no loss of information. Our approach uses a matrix reordering technique based on signature distances, which provides a more robust and stable representation compared to direct cosine distances. We validate our method through an analysis of the emergence of part-of-speech (PoS) information and word senses in contextualized models, demonstrating that lower-layer representations emphasize PoS-information while higher layers group instances on the basis of senses. We also show that our reordering approach can identify errors in annotated data, providing a valuable tool for data quality assessment.

The major advantage of P.STAR is its ability to represent an entire set of instances at one glance without reducing dimensions or requiring parameter selection. This makes it particularly suitable for analyzing linguistic properties in contextualized language models and identifying layers that carry useful information for specific tasks. Our work opens up new directions for exploration and provides a foundation for future interactive visual analytics tools that can enable deeper investigation of patterns across model layers.

Links to External Sources

📄 Read Paper 🌐 VIS4DH Website

BibTeX

@inproceedings{elassady2025pstar,
    title      = {{P.STAR: Parameter-free Structure Analysis via Reordering for Lossless Visualization of Layers in Language Models}},
    author     = { El-Assady, Mennatallah      and
                  Fokkens, Antske             and
                  Haghighatkhah, Pantea       and
                  Ophelders, Tim              and
                  Sommerauer, Pia             and
                  Speckmann, Bettina          and
                  Verbeek, Kevin },
    year       = {2025},
    booktitle  = {2025 IEEE Workshop on Visualization for the Digital Humanities (VIS4DH)},
    publisher  = {IEEE},
    conference = {IEEE VIS 2025},
    location   = {Vienna},
    doi        = {10.1109/VIS4DH69385.2025.00013},
    isbn       = {979-8-3315-8037-7}
}

References

A. Alishahi, M. Barking, and G. Chrupała. Encoding of phonology in a recurrent neural model of grounded speech. CoNLL 2017, p. 368, 2017.
D. Applegate, R. Bixby, W. Cook, and V. Chvatal. On the solution of traveling salesman problems. Proceedings of the International Congress of Mathematicians, 1998.
Y. Belinkov. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48(1):207–219, 2022. DOI: 10.1162/coli_a_00422 🔗
Y. Belinkov and J. Glass. Analysis methods in neural language processing: A survey. Transactions of the Association for Computational Linguistics, 7:49–72, 2019.
A. Boggust, B. Carter, and A. Satyanarayan. Embedding comparator: Visualizing differences in global structure and local neighborhoods via small multiples. In 27th International Conference on Intelligent User Interfaces, p. 746–766, 2022. DOI: 10.1145/3490099.3511122 🔗
G. Brunner, Y. Wang, R. Wattenhofer, and M. Weigelt. Natural language multitasking: analyzing and improving syntactic saliency of hidden representations. The 31st Annual Conference on Neural Information Processing (NIPS)—Workshop on Learning Disentangled Features: From Perception to Control, 2017.
Y. S. Chan and H. T. Ng. Scaling up word sense disambiguation via parallel texts. In Proceedings of the 20th National Conference on Artificial Intelligence, vol. 3, p. 1037–1042. AAAI Press, 2005.
J. Czyzyk, M. P. Mesnier, and J. J. More. The neos server. IEEE Journal on Computational Science and Engineering, 5(3):68–75, 1998.
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pretraining of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186, 2019. DOI: 10.18653/v1/N19-1423 🔗
E. D. Dolan. The neos server 4.0 administrative guide. Technical Memorandum ANL/MCS-TM-250, Mathematics and Computer Science Division, Argonne National Laboratory, 2001.
S. Eger and A. Mehler. On the linearity of semantic change: Investigating meaning variation via dynamic graph models. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 52–58, 2016. DOI: 10.18653/v1/P16-2009 🔗
W. Gropp and J. J. More. Optimization environments and the neos server. In M. D. Buhman and A. Iserles, eds., Approximation Theory and Optimization, pp. 167 – 182. Cambridge University Press, 1997.
J. Hewitt and C. D. Manning. A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4129–4138, 2019. DOI: 10.18653/v1/N19-1419 🔗
Z. Huang, D. Witschard, K. Kucher, and A. Kerren. VA + Embeddings STAR: A State-of-the-Art Report on the Use of Embeddings in Visual Analytics. Computer Graphics Forum, 2023. DOI: 10.1111/cgf.14859 🔗
G. Jawahar, B. Sagot, and D. Seddah. What does BERT learn about the structure of language? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3651–3657, 2019. DOI: 10.18653/v1/P19-1356 🔗
A. Kutuzov, E. Velldal, and L. Øvrelid. Contextualized embeddings for semantic change detection: Lessons learned. In Northern European Journal of Language Technology, Volume 8, 2022. DOI: 10.3384/nejlt.2000-1533.2022.3478 🔗
G. A. Miller, M. Chodorow, S. Landes, C. Leacock, and R. G. Thomas. Using a semantic concordance for sense identification. In Human Language Technology: Proceedings of a Workshop, pp. 8–11, 1994.
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
A. Rathore, Y. Zhou, V. Srikumar, and B. Wang. Topobert: Exploring the topology of fine-tuned word representations. Information Visualization, 22(3):186–208, 2023.
B. B. Rieger. Semiotic cognitive information processing: Learning to understand discourse. a systemic model of meaning constitution. Adaptivity and Learning: An Interdisciplinary Debate, pp. 347–403, 2003.
M. A. Rodda, M. S. Senaldi, and A. Lenci. Panta rei: Tracking semantic change with distributional semantics in ancient greek. IJCoL. Italian Journal of Computational Linguistics, 3(3-1):11–24, 2017.
D. Schlechtweg, B. McGillivray, S. Hengchen, H. Dubossarsky, and N. Tahmasebi. SemEval-2020 task 1: Unsupervised lexical semantic change detection. In Proceedings of the Fourteenth Workshop on Semantic Evaluation, pp. 1–23, 2020. DOI: 10.18653/v1/2020.semeval-1.1 🔗
R. Sevastjanova, E. Cakmak, S. Ravfogel, R. Cotterell, and M. El-Assady. Visual comparison of language model adaptation. IEEE Transactions on Visualization and Computer Graphics, 29(1):1178–1188, 2022.
R. Sevastjanova, A. Kalouli, C. Beck, H. Hauptmann, and M. El-Assady. Lmfingerprints: Visual explanations of language model embedding spaces through layerwise contextualization scores. In Eurographics Conference on Visualization (EuroVis), vol. 41, pp. 295–307, 2022. DOI: 10.1111/cgf.14541 🔗
R. Sevastjanova, A.-L. Kalouli, C. Beck, H. Schafer, M. El-Assady, C. Zong, F. Xia, W. Li, and R. Navigli. Explaining contextualization in language models using visual analytics. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, vol. 1: Long Papers, pp. 464–476, 2021. DOI: 10.18653/v1/2021.acl-long.39 🔗
K. Taghipour and H. T. Ng. One million sense-tagged instances for word sense disambiguation and induction. In Proceedings of the Nineteenth Conference on Computational Natural Language Learning, pp. 338–344, 2015. DOI: 10.18653/v1/K15-1037 🔗
N. Tahmasebi, L. Borin, and A. Jatowt. Survey of computational approaches to lexical semantic change detection. Computational approaches to semantic change, 6:1, 2021.
I. Tenney, D. Das, and E. Pavlick. BERT rediscovers the classical NLP pipeline. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4593–4601, 2019. DOI: 10.18653/v1/P19-1452 🔗
N. van Beusekom, W. Meulemans, and B. Speckmann. Simultaneous matrix orderings for graph collections. IEEE Trans. Vis. Comput. Graph., 28(1):1–10, 2022. DOI: 10.1109/TVCG.2021.3114773 🔗
K. Zhou, K. Ethayarajh, D. Card, and D. Jurafsky. Problems with cosine as a measure of embedding similarity for high frequency words. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Short Papers), vol. 2, pp. 401–423, 2022. DOI: 10.18653/v1/2022.acl-short.45 🔗
Y. Zhou and V. Srikumar. DirectProbe: Studying representations without classifiers. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 5070–5083, 2021. DOI: 10.18653/v1/2021.naacl-main.401 🔗