Unsupervised clustering can be conducted in a variety of ways
July 31, 2024 5:29 AM Subscribe

"What are the best methods of capturing thematic similarity between literary texts? Knowing the answer to this question would be useful for automatic clustering of book genres, or any other thematic grouping."

(This post brought to you by my recent faffing around with recent Shakespeare scholarship to decide which edition(s) to read.)

Abstract:

What are the best methods of capturing thematic similarity between literary texts? Knowing the answer to this question would be useful for automatic clustering of book genres, or any other thematic grouping. This paper compares a variety of algorithms for unsupervised learning of thematic similarities between texts, which we call “computational thematics”. These algorithms belong to three steps of analysis: text pre-processing, extraction of text features, and measuring distances between the lists of features. Each of these steps includes a variety of options. We test all the possible combinations of these options. Every combination of algorithms is given a task to cluster a corpus of books belonging to four pre-tagged genres of fiction. This clustering is then validated against the “ground truth” genre labels. Such comparison of algorithms allows us to learn the best and the worst combinations for computational thematic analysis. To illustrate the difference between the best and the worst methods, we then cluster 5000 random novels from the HathiTrust corpus of fiction.

Citation

Sobchuk, O., Šeļa, A. Computational thematics: comparing algorithms for clustering the genres of literary fiction. Humanit Soc Sci Commun 11, 438 (2024). https://doi.org/10.1057/s41599-024-02933-6

posted by cupcakeninja (9 comments total) 17 users marked this as a favorite

If anything, this is great for showing how to apply clustering algorithms to a very messy, real-world dataset with squishy, ill-defined categories and evaluate their performance.

I always hate how many lessons on machine learning use toy datasets that have clusters which are either neatly discoverable or are already known a priori.
posted by RonButNotStupid at 6:34 AM on July 31 [2 favorites]

I'm getting flashbacks to my (very terrible) thesis where I did some clustering that wasn't a million miles away from this kind of thing. It was based on grammatical structure rather than themes or genres, but the basic idea was the same--extracting features, measuring distances between them somehow, and clustering by similarity.

(In retrospect, I shot myself in the foot a little and should have gone with words or phrases at most instead of trying to classify entire sentence structures. Hindsight is 20/20, and all that.)
posted by Mr. Bad Example at 7:30 AM on July 31 [3 favorites]

For someone who was an English major 50 years ago, this was certainly an interesting peek into this new-fangled digital humanities thing. I like to try to decipher texts that are way out of my comfort zone.
posted by kozad at 8:15 AM on July 31 [2 favorites]

When I was trying to teach myself this stuff, Cosma Shalizi's well written lecture notes were a big help:

https://www.stat.cmu.edu/~cshalizi/350/lectures/07/lecture-07.pdf
https://www.stat.cmu.edu/~cshalizi/350/lectures/08/lecture-08.pdf

Even if you are never tempted to go through the math, the classification vs. clustering intro at the start of lecture 7, and the pictures in lecture 8, may be worth a quick look.

Traditionally, researchers who do this sort of thing can look at their clusters and say "That's weird, it clusters Murder on the Orient Express with fantasies" and will list that as a sort of limitation on the utility of clustering.

But the other thing you can do is more common and one of my pet peeves: Find an example like that, then get articles in the press that say "These researchers have proven that Hercule Poirot is more like Gandalf than you think! No one noticed before, but now we proved it with MATH and COMPUTERS." The PR people at Max Planck must be asleep at the wheel.
posted by mark k at 9:18 AM on July 31 [2 favorites]

i recall some time in the distant fuzzy past — it may have just been a dream — being in a room on several occasions with some humanities-department data miners and the cs undergrad who they had recruited to do the things they didn't understand. this group of hazily remembered researchers had by some stroke of tremendous luck and/or use of political connections been allowed access to a very large corpus of text consisting of novels recently put out by one of the major publishing houses.

it may be useful to note at this point that all of these digital humanities data miners and also their undergraduate were cismen.

anyway so they used a kind of primitive version of sentiment analysis or whatever to cluster these texts. one of the things they were testing for was whether their clustering strategy could correctly identify genres. the logic was that they were looking for a clustering strategy that could produce, i forget the exact term they used — this may have, again, been something from a dream — but it was something like "reasonable surprisal," i.e. they wanted to get something that was close enough to what one would expect to not seem like random noise, but that also contained some new surprising things which could demonstrate that what they had wasn't just an elaborate method of telling people what they all already knew.

anyway so this group, who may have existed or who may have been characters in a dream i once had, eventually put on a presentation wherein they discussed their big finding. they had discovered that among the many expected subclusters in the romance novel cluster, there's this whole surprising new subcluster that no one had identified before — scottish romance novels!

look, everyone! we have discovered that "scottish" is a thriving romance subgenre!

anyway.
posted by bombastic lowercase pronouncements at 9:33 AM on July 31 [1 favorite]

I like to think something like this could be used to find more genres. I remember back when the Internet was young, talking with a friend about the existence of Sci-Fi Noir in the movies. At the time, the only titles that came up in a search were two: "Blade Runner" and some Czech film I looked for for years but now have forgotten its title. Search now and there are many more suggestions, but with a literature cluster/classifier we may be able to find more genres composed of a handful of movies...a nerd's delight!
posted by rhizome at 11:05 AM on July 31

What do you actually do with a genre clustering algorithm? The paper has a few ideas, but two of them -- adding genre tags to large digital collections, creating book recommendations -- don't have much to do with the study of literature per se; they're basically technical problems for people who sell books or manage digital libraries. It's interesting that the usefulness of this scholarship, according to the researchers, has so much to do with solving somebody else's business problems.

The other use case that the authors mention is tracking literary evolution and mapping out influences. I'd be curious to see where they can go with that.
posted by Gerald Bostock at 11:16 AM on July 31 [2 favorites]

It can be an extremely frustrating process to go from "these data points all appear in the same cluster in this dimensional space and with this algorithm" to "these data points are all exemplars of the same class because...". Every cluster describes some form of relationship and the trick is to figure out whether it's meaningful or not. Unsupervised learning is great at uncovering interesting relationships to explore, but it's really not going to yield prescriptive categories, especially when those categories are as subjective as a literary genre.
posted by RonButNotStupid at 11:44 AM on July 31

Nice to see the paper authors give a couple cites to my former Dean, Matthew Jockers, a really smart and generous guy who gave a guest lecture on topic modeling to an English grad seminar I taught a few years ago. (Interestingly enough, I think the grad student from History got the most out of his lecture.) He does a deep dive into computational questions of genre and sub-genre in his accessible and enjoyable academic monograph Macroanalysis, and his popular readership book with Jodie Archer, The Bestseller Code, is fun. For folks who want to try some of his approaches out, his how-to book, Text Analysis with R for Students of Literature (which I'm not; I'm a composition and rhetoric specialist, but I point out to my literary colleagues that the folks in Classics have been doing this sort of computational analysis for decades) is a solid cookbook. Sharp-eyed readers will note Jockers co-authored one of the references with sometime MeFi favorite Franco Moretti. I see part of the point to take away from computational genre research as that prescriptive categories are always oversimplifications, and that genres are built of human perceptions, not hard-and-fast rules.
posted by vitia at 1:00 PM on July 31 [2 favorites]

« Older Can women make art? | Exodus Newer »

This thread has been archived and is closed to new comments

MetaFilter

Unsupervised clustering can be conducted in a variety of ways
July 31, 2024 5:29 AM Subscribe

Tags

Share

Unsupervised clustering can be conducted in a variety of ways July 31, 2024 5:29 AM Subscribe

Tags

Share

Unsupervised clustering can be conducted in a variety of ways
July 31, 2024 5:29 AM Subscribe