The Correlates of War Project's Militarized Interstate Dispute (MID) Data is the most prominent and heavily used data collection in the study of international conflict. The most recent version (MID4) was released in 2014 and brings the period covered to 1816-2010. The MID4 project utilized automated text classification procedures to make the process of identifying relevant news stories more efficient. Over the course of that project, the PIs determined the primary bottleneck in the workflow was the coding of those news documents. To address this inefficiency, The PIs completed a pilot project to determine whether crowdsourcing techniques could be used to code these documents. In the pilot, non-expert workers were paid small sums to read documents and to answer sets of questions, the answers to which were used to identify features of possible militarized incidents (the events that comprise MIDs). A systematic comparison of the crowdsourced responses with those of MID4 Project's trained coders revealed that the crowdsourced codings were completely accurate for 68 percent of the news reports coded; more importantly, high agreement among crowd responses on specific reports was strongly associated with correct coding. This enables the PIs to detect which documents require further expert involvement. As a result, the PIs can produce a majority of the MID data in near-realtime and at limited financial cost. These procedures are applied on the MID5 Project, which will update the MID data for the period 2011-2017.
The MID5 project workflow begins with document retrieval from LexisNexis and document classification using the software and methods implemented in MID4. We discard the negatively classified documents, and proceed to extract metadata from the positively classified documents including the document title, the news agency that published the report, the date, and any actors mentioned in the text. Crowd workers are recruited through Amazon's Mechanical Turk and paid a wage to read one of these documents and answer a line of simple, objective questions about it. The questionnaire is predefined, but some extracted metadata is automatically inserted into the questionnaire to improve the quality of responses. Several workers complete a questionnaire for each document, leaving the PIs with problems of aggregation: how to combine multiple worker responses, possibly regarding multiple related questions, into usable data necessary to code the militarized incident. In the pilot study, the PIs show that Bayesian networks are the most effective way to achieve this aggregation. Recently, the PIs have made advances in semi-supervised text classification with hybrid, Deep Restricted Boltzmann Machines, which outperform previous methods in this task.
|Effective start/end date
|9/15/15 → 8/31/19
- National Science Foundation: $690,353.00