Visualizing documents
One of the courses I’ve been following the past few weeks was Scientific Visualization. The course mainly consisted of lab sessions in which we played around with a 3D visualization tool called Amira in which we’d load several data sets in order to produce nice pics. I must say that the course wasn’t particularly hard, but what in the end proved to be the most fun part of it was the paper we (that is, my partner in crime Laurens and I) had to write. Here’s its abstract, in case you’re interested; it’s about visualizing documents.
In today’s world of cyberspace and Google more and more documents are accessible. Literally trillions (if not more) documents exist today with many billions added every day. The need for people to have quick access to relevant documents grows at an equal pace, but is severely limited by the sheer amount of available documents and the amount of rubbish present within these. First, this paper will investigate a visual approach to address these problems and make a distinction between different classes of visualization goals in the document visualization domain.
As data acquisition is key to any visualization, the first problem at hand is the challenge of dealing with natural language and what can help to extract useful features from it. While counting words can give a good estimation of a document’s contents, it is by no means satisfactory and different aspects of natural language that can be either helpful or harmful will be covered. Not only the language is a problem: unlike physical phenomena documents cannot easily be represented in a carthesian coordinate system causing all kinds of restrictions that need to be dealt with.
Once the problems in the field have become clear, a selection of existing methods are evaluated against some criteria to determine whether they are usable or not. These methods aren’t necessarily specific to document visualization. Document features are ordinary metrics that could be visualized with any suitable general purpose tool so some of these will be tried as well.
Finally, a method called ThemeStar will be presented. ThemeStar is a single document visualization which is partly borrowed from and inspired by a visualization from so called role playing games. It consists of a “blob” whose shape denotes affinities with defined search strings and heavily relies on the human ability to tell shapes apart. It is this ability that is key to achieving a visualization that allows for lightening fast judgement of document relevance just by quickly looking at their diagrams.
While I found producing bulk amounts of text quite fun to do, I especially enjoyed writing this because I finally got to know some very nifty features of LaTeX. I’ve been using some LaTeX for quite a while now, but this was mainly for lab reports only a few pages long that didn’t make it worthwhile to use fancy headers and cool things like BibTeX and cross references that achieve this extremely slick look. To make matters worse, the other people I’ve been writing papers with so far insisted on using Word which doesn’t help much either. But patience pays off…
To those who already have in-depth LaTeX knowledge my joy may seem a bit… overdone, but hey, isn’t a moment of sheer excitement about a tool the reason you’re still using it..? I know I will :) . As for the paper’s contents, it mainly has got me interested into linguistics and I think I’ll bug a German studying friend of mine until he cries and tells me everything he knows about it.