Textual Analysis for Hypotheses Testing

By Marian Moszoro

Empirical testing has become the touchstone of sound economic theory, especially in the era of big data. In this article, we review some simple (and not-so-simple) algorithmic data reading techniques that may be applied to create novel datasets from textual documents to test hypotheses in social sciences.

There is a wide range of applications of textual analysis in social sciences, which include:

  • Laws and court decisions

  • Political speeches

  • Contracts and regulations (see, e.g., “Data Suggestion: Contracts with State Governments” by Patrick Warren, SIOE webpage)

  • Company filings & disclosures (e.g., SEC-Edgar, MD&A)

  • Audio transcripts (e.g., conference calls)

  • News and media reports (e.g., Factiva, WSJ News Archive)

  • Web sites, Google searches

  • ... and social media (e.g., Twitter, Stocktwits, FB comments)

When properly designed, the outcome is a novel dataset out of (almost) thin air.

Here are a handful of techniques to start with:

a.     Dictionaries

The same way we learn a new language using dictionaries, algorithmic data reading is based on words and sentences grouped into categories called “dictionaries.” These categories are the instrument for the computer to learn to interpret the text. For example, we can teach the computer that the words “arbitration, conciliation, settlement, whereas” indicate arbitration clauses, while the words “satisfactory, timely, good faith, diligent, proper, reasonable, reasonably” introduce contractual flexibility clauses (e.g., Moszoro, Spiller, and Stolorz 2015).

Probably the unbeaten masters of dictionaries in finance and accounting are Loughran and McDonald (2011). They identified 2,349 negative words (e.g., loss, bankruptcy, indebtedness, felony, misstated, discontinued, expire, unable); 354 positive words (e.g., beneficial, excellent, innovative); 291 uncertainty words (e.g., ambiguity, approximate, assume, risk); 871 litigious words (e.g., admission, breach, defendant, plaintiff, remand, testimony); 19 modal strong words (e.g., always, best, definitely, highest, lowest, will), and 27 modal weak words (e.g., could, depending, may, possibly, sometimes). Their elaborated dictionaries are available at: http://www3.nd.edu/~mcdonald/Word_Lists.html

b.     Identification and Clustering

Often, the documents in the library at hand are not classified in any way. File classification can be performed using the title or first lines of a document combined with set theory. For example, in a random set of contracts, if the words “employment” or “compensation” appear in the first 20 lines, it is likely to be an employment contract; if the words “license” or “purchase” appear, it is likely to be a commercial contract; if the words “restarted” or “amendment” appear, it is likely to be a renewed or amended contract. If what we want to retrieve is original employment contracts, then we choose the documents that contain the words “employment” or “compensation,” and exclude the documents that contain “license,” “purchase,” “restarted,” or “amendment.” This technique is called “Boolean search.”

c.     Stemming, Parsing and Comparison

Sometimes, we need to compare documents or test the impact of one document (e.g., law or regulation) on a set of documents (e.g., other subsequent laws, internal procedure, codes of conduct). A way to do so in a quantitative way is by “stemming” the original documents (i.e., reducing them to the root), “parsing” the documents into, e.g., three-word strings, and comparing the number of overlaps of stemmed three-word strings between the main source and the analyzed documents.

d.     “Fogginess”

Textual analysis have also developed measures of language complexity or “fogginess.” Robert Gunning developed an index that measures the readability of English writing as years of formal education needed to understand text on a first reading; e.g., Gunning fog index equal to 12 corresponds to a U.S. high-school senior (~18 years old). The formula is simple, albeit its computation on large libraries requires time:

Gunning Fog Index = 0.4 x (words/sentences + complex_words/words)

i.e., 0.4 times the sum of the average number of words in a sentence plus the average number of words with three or more syllables.

There is a world of opportunities for hypothesis testing using textual analysis. Algorithmic data reading can be done using different coding languages: Visual Basic, R, C++, Python. I recommend Python: it’s powerful, easy to learn, free, and has numerous ready-to-use codes.


Examples of use of algorithmic data reading: 

Ahern, Kenneth and Denis Sosyura. 2014. Who Writes the News? Corporate Press Releases during Merger Negotiations. Journal of Finance 69, 241-291.

Bill McDonald’s web page with multiple resources for textual analysis: http://www3.nd.edu/~mcdonald/Word_Lists.html

Moszoro, Marian, Pablo Spiller, and Sebastian Stolorz. Rigidity of Public Contracts. NBER Working Paper No. 21186.

Talley, Eric and Drew O’Kane. 2012. The Measure of a MAC: A Machine-Learning Protocol for Analyzing Force Majeure Clauses in M&A Agreements. Journal of Institutional and Theoretical Economics 168 (1), 181-201.