Keeping the words in Topic Models

By The GPDH Editors | January 22, 2013


Following up on my previous topic modeling post, I want to talk about one thing humanists actually do with topic models once they build them, most of the time: chart the topics over time. Since I think that, although Topic Modeling can be very useful, there’s too little skepticism about the technique, I’m venturing to provide it (even with, I’m sure, a gross misunderstanding or two). More generally, the sort of mistakes temporal changes cause should call into question the complacency with which humanists tend to  ‘topics’ in topic modeling as stable abstractions, and argue for a much greater attention to the granular words that make up a topic model.

In the middle of this, I will briefly veer into some odd reflections about how the post-lapsarian state of language. Some people will want to skip that; maybe some others will want to skip to it.

Humanists seem to want to do different things with topic models than the computer scientists who invented them. David Blei’s group at Princeton (David Mimno aside) most often seem to push LDA (I’m using topic modeling and LDA interchangeably again) as an advance in information retrieval: making large collections of text browsable by giving useful tags to the documents. When someone gives you 100,000 documents, you can ‘read’ the topic headings first, and then only read articles in the ones that interest you.

Probably there are people using LDA for this sort of thing. I haven’t it seen it much in practice, though: it just isn’t very interesting* to talk about. And while this power of LDA is great for some institutions, it’s not a huge sellling point for the individual researcher: it’s a lot of effort for something that produces almost exactly the same outcome as iterative keyword searching. Basically you figure out what you’re interested in, read the top documents in the field. If discovery is the goal, humanists would probably be better off trying to get more flexible search engines than more machinely learned ones.

*I spun around a post for a while trying to respond to Trevor Owens’ post about the binary of  “justification” and “discovery” by saying that really only justification matters, but I couldn’t get it to cohere—obviously discovery matters in some way. That post of his is ironclad. So I’ll just say here that I think conversations which are purely about discovery methods are rare, and usually uninteresting; when scholars make public avowals of their discovery methodology, they frequently do it in part as evidence for the quality of their conclusions. Even if they say they aren’t. Anyhow.

Read full post here. (Originally posted January 9, 2013)