Do I have to read or skim thousands of music reviews to get a sense of “what is contemporary music?”. Who are the big names and where do they come from? What are the trends? How do genres intersect?
Could these exploratory questions be answered visually?
Each point in the scatterplot above represents the body text of a record review scraped from P4K (1999 – 2012). The placement of each point reflects the corresponding review’s similarity with other reviews.
I’ve generated this visualisation using Overview, a free tool for searching and exploring large collections of text documents. Overview is an open-source project of the Associated Press funded by the Knight Foundation, developed by Jonathan Stray in collaboration with Stephen Ingram and Tamara Munzner at UBC. A recent technical report explains some of its core algorithms in much greater detail than what I offer here.
Each document, a record review in this case, is converted into a list of words appearing in that document. Common English words are discarded, such as articles, conjunctions, and pronouns. It’s then possible to compute the “distance” from one document to another, a measure of similarity between their word lists. As a result, you have a massive high-dimensional space of inter-document distances.
In order to show all documents on a screen in 2 dimensions, we can use one of a number of techniques for transforming a high-dimensional space to a 2-dimensional space, while preserving inter-document distances.
In the resulting scatterplot, the absolute position of documents has no meaning: there are no up-down or left-right axes. Only the relative distance between points is important. This means that as someone looking at a plot like this, your goal would be to look for distinct cluster structure, rather than correlation, as you might do in your typical standard 2-axis scatterplot.
That I got an enormous blob is hardly surprising: the language used in record reviews tends to be pretty similar. If I had started with 5,000 record reviews and 5,000 movie reviews, I likely would have seen 2 distinct clusters. But an ambiguous blob at the macro-level doesn’t necessarily mean that there’s no structure at the micro-level.
A quick preliminary pass of tagging and colouring by genre was performed manually in Overview. The small cluster of red are ~20 releases recorded or produced by the artist Four Tet. The little red cluster of Four Tet records is a case in point. While the larger genres such as electronic or hip-hop are spread all over the plot, I’ve also found similarly tiny clusters relating to niche genres, such as sludge metal or ATL-area hip hop.
So how did I find these micro-level clusters? What you can’t tell from the screen shot above is that the application has panning and zooming controls, and every time I click on a point I pull up the corresponding full-text review in a separate window. Overview has a second visualisation which displays a hierarchical cluster structure in a tree layout, which makes these smaller clusters, hidden within the blob, easier to spot: each time a cluster node is selected in the tree, the corresponding points are highlighted in the scatterplot, and vice-versa.
Scatterplots may not always be the best choice for representing your data (some of the time). It really depends on what your task is. If you’re looking for correlation, a density plot may be a better idea. If you’re looking for small-scale cluster structure, a scatterplot might be useful as a preliminary overview, used in conjunction with other visual analysis techniques.
Interested in using overview?
My current research project involves studying individuals who have a need to explore large text document collections. I’d like to hear from journalists, digital humanities and communications researchers, archivists, and librarians who (a) have this need and (b) give Overview a try. This will help us to build upon and improve Overview, as well as future tools for addressing these types of tasks.