There are already many different systems to automatically classify texts into different genres. Johan Eklund has found that systems perform better if they have knowledge about words’ semantic relationships. He discovered this when he compared how well different systems classify different texts and what methods work best.
Doctoral thesis With or without context: automatic text categorization using semantic kernels
By doctoral student Johan Eklund at the University of Borås
Supervisor: Professor Sándor Darányi
Defense: 15 April 2016, 1 PM, University of Borås, Allégatan 1, Borås, Room C203.
+46 (0)33-435 5966
+46 (0)33-435 596
– We want systems to automatically be able to recognize what documents are about and to differentiate between different types of texts; for example, different types of news in newspapers, or whatever it is we are interested in classifying, he says. For this, machine-learning is used, or, in other words, we train the systems by analysing different texts, classifying them, and entering this information into the system. But we can also apply methods to seek out the words’ semantic relationships in digital texts.
Semantic relationships are how different words are related in meaning. Words that often occur near each other tend to have something in common in terms of meaning. Take, for example, the words “free kick” and “corner kick”, which often occur near each other and which have a meaning related to football. The system can detect such things through statistically examining how the words are used together in the texts. Collocations (words that frequently appear together) can then be identified and added to the database of words’ meanings.
Tested techniques on different types of texts
– I have used different techniques to create semantic models, or models about semantic relationships. And I have also applied the models in the systems for automatic text classification to see if that can improve the system’s performance. At the same time, I have been able to compare what models or techniques work best.
The techniques have been applied to texts of three different types that have been obtained from document collections constructed for research purposes. Johan Eklund has tested the techniques on newspaper texts, medical and scientific texts, and texts from the communications platform Usenet.
– The results were very positive. They show that language models make text classification better. This is especially evident when the system has not received much “training,” in which you make manual analyses and enter the classifications, but have just trained on a small number of documents. Then the language models compensate for the lack of training documents. This result is a small contribution among many within streamlining text categorisation, something that is needed both for research and within other areas in order to be able to quickly find relevant information.
The geometry of language
Another (theoretical) investigation that Johan Eklund has done is to study what is really meant by ”class” and ”classification” as well as how we use language to organise books and other documents. A discovery he made in this work was that words’ formal semantic relationships can be described with concepts that we recognise from geometry.
– We use, maybe without thinking about it, geometric expressions to explain how different concepts are related to one another. For example, we say that the concept “quantum mechanics” is contained within the concept “physics.” And it turns out that this structure is also reflected in the physical library building, where books on quantum mechanics are within the section for physics. We can also say that the subject area is conceptually close to another subject area, which we can also see in the library building as these areas are placed near one another. This was probably the most fascinating thing for me in my research, and something that I felt intuitively. It was euphoric to be able to show it!
Johan Eklund is a university adjunct at the Faculty of Librarianship, Information, Education and IT. He is very interested in mathematics, language, computer science, and classification.
– When I was going to start my research, I looked for an area that could cultivate my interest in all these areas and at the same time bring classification into modern society.