Last autumn, the Swedish Agency for Higher Education decided to give the University of Borås the right to award doctoral degrees. Previously, Borås research students have received their PhDs from other seats of learning, such as The University of Gothenburg, even though their research was done in Borås. Mikael Gunnarsson’s disputation at the School of Library and Information Sciences is the first disputation at the University of Borås. The subject of research was automatic genre classification of digital texts.
“There are already pretty good routines for classifying digital texts based on their content,” he says. “But as we search for more and more information through digital channels, it becomes increasingly important to know what kinds of documents we find; is it a scientific article, a technical report or dictionary content?”
It is not always entirely apparent to us readers either, what kind of text we have found. Neither is it to the computer software that is called upon to distinguish between different kinds of texts. That requires using a special kind of software, known as machine comprehension algorithms. That software is fed with different texts and then told what kind of text it is reading.
“You also enter special characteristics for different kinds of texts. A text with an unusually large number of questions marks might be recognised as a so called ‘FAQ page’ (frequently asked questions), while a text with a lot of spatial adverbs, words such as ‘above,’ ‘between,’ or ‘outside,’ most likely is a descriptive text as opposed to a scientific article or technical report,” says Mikael Gunnarsson. “It is also a matter of how long the text is or how long the sentences are.”
Classification based on genre
Feeding the computer with data like that is a vital part of his research work. He has also examined what text characteristics should be focused upon when entering them for an algorithm, and what is unique to a certain genre.
An algorithm consists of the instructions that are given to the software so that it can solve the task. In this case, what the software needs to know in order to be able to classify the texts based on genre.
“The difficult thing is that our way of writing is not entirely predictable, and you have to base it on assumptions. It is, for example, very difficult to tell a scientific article from a technical report, because they can be very similar and be constructed in the same way.”
Better than randomness
After having fed the software with texts to work on, where the right classification has been made, and then adjusting the algorithms accordingly, Mikael Gunnarsson has tested how the classification works on a non-classified text.
“The result, which are presented in the doctoral thesis, demonstrates that the software has managed to classify the genres better than a randomness generator, but that the results are far from satisfying.”
Mikael Gunnarsson has worked as a teacher at the School of Information and Library Sciences and the University of Borås since 1992, and found it was high time to get his doctoral degree. Since he has been a teacher of classification and is interested in the Internet, the choice of subject was a natural one.
What are the benefits of your research?
“It is handy when it comes to evaluating sources, when the reader needs to be aware of what kind of text he or she is dealing with in order to make an assessment of the information in the text. With further development we could reach a stage where a Google search provides a symbol next to each hit, revealing what kind of genre a text is in.”
Margareta Lundberg Rodin is the head of the School of Library and Information Sciences, and is excited about Mikael Gunnarsson’s disputation.
“I like that our school was the first at the university to graduate a PhD student,” she says. “The University of Borås was awarded graduation rights in many areas, and this thesis happened to be the first to face a disputation. An extra festive disputation is in order to celebrate that.”
How important are the rights to present degrees to the university?
“To the PhD students themselves the difference is not that great, since the research programme remains the same, but some routines do change as the research program is entirely Borås-based. It is of major strategic importance that the university has the rights to award degrees on all levels, including PhD degrees. It enhances the school’s status and means that the students at basic and intermediate levels can benefit from current research, which in turn enhances the quality of the teaching.”
Mikael Gunnarsson’s disputation takes place on Friday the 29th of April 2011. The name of his thesis is ‘Classification along Genre Dimensions: Exploring a Multidisciplinary Problem’