Text Similarity Testing

About

Text similarity measurement algorithms are widely used throughout the Internet, for purposes as varied as purchasing concert tickets to flagging papers for plagiarism. If we ran similar algorithms on a corpus of trade papers from the year 1922, what patterns might emerge? Many publications carefully crafted distinct identities and claims to individuality, but how unique was the content that appeared within their pages? How might the results confirm, complicate, or complement what we already know? The nuances of the language in each publication would have helped create in-groups and out-groups that not only segmented groups within the film industry but also defined the boundaries of the industry itself. Understanding the relative similarities and differences among publications allows us to assess these publications’ claims to individuality. Even more significantly for scholars of film, journalism, and media industry history, these measurements also help us understand the environment in which individual laborers were producing, distributing, and exhibiting films.

Without the search indexing algorithms within Lantern, we would never have found many of the relevant pages that we read and analyzed closely with our eyes. The text similarity testing algorithms described in this chapter are, in part, attempts to achieve an even wider form of search—querying advertisements and strings of publicity text that reoccur across multiple publications, even when the specific words, phrases, and occurrences are not yet known. As we move forward, we invite others to do the same, with the hope of locating many more stories to tell.

Read more about the project in the book chapter by Eric Hoyt, Ben Pettis, Lesley Stevenson, and Samuel Hansen. In it, they describe computational methods for analyzing large volumes of data by applying text similarity algorithms to trade papers in the Media History Digital Library.
Hoyt, Eric, Ben T. Pettis, Lesley Stevenson, and Samuel Hansen. "Searching for Similarity: Computational Analysis and the U.S. Film Industry Trade Press of the Early 1920s. In Global Movie Magazine Networks, edited by Kelley Conway and Eric Hoyt. University of California Press. (forthcoming)

Download Data

Hansen, Samuel; Pettis, Ben; Hoyt, Eric; Stevenson, Lesley. "1922 Film Industry Trade Press Corpus [Dataset]". Dryad, January 22, 2024. https://doi.org/doi:10.5061/dryad.gtht76htc

Source Code

Hansen, Samuel, Ben Pettis, Eric Hoyt, and Lesley Stevenson. “1922 Film Industry Trade Press Corpus”. Zenodo, January 22, 2024. https://doi.org/10.5281/zenodo.105412577.