When we talk about data, we often talk about numbers. But humanities and social science disciplines are also dealing with a deluge of data, and these data are often in the form of narrative documents. In May, University of Pennsylvania linguistics and computer and information science professor Mark Liberman wrote about the challenges of digitizing and archiving court data for Language Log in a post titled "Big data in the humanities and social sciences." Reposted here with Mark's permission.
Under the advisement of professor Brian Carver, Michael Lissner, a 2010 graduate of UC Berkeley's School of Information master's program, developed CourtListener for his final project, an effort to monitor and scrape opinions from court websites.
Court data: (Thanks to Jerry Goldman of oyez.org for background information.) If properly collected and archived, the activities of the American judicial system represent a massive collection of formalized social interactions, with great potential for social scientists interested in the activities of American courts and for computer scientists seeking a large highly-structured language corpus. Moreover, the hierarchical structure of the American judiciary represents an opportunity for technologists interested in modeling consequential interactions among institutions in a large system.
However, the American judiciary has been reluctant, at least in practice, to provide access to its data.
Most courts provide access via their websites to recent opinions, but few courts provide access to archival opinions. No consistency to the number of opinions available: Some courts provide all opinions from 2005 forward, others provide only the last term's worth. There is no consistency in the means of delivering the data: Some courts use RSS feeds, others have only a list of links on a page on their website, others require the user to use a search form to access any data.
Only half of the federal circuit courts make recordings of oral arguments available; a smaller fraction of other courts make them available electronically. No court, other than the Supreme Court, appears to make an official transcript of oral arguments available electronically. Almost all written data is contained in PDFs. What audio data is available is of inconsistent quality and in various formats.
Details aside, a sociologist or political scientist today would find it very difficult at best to assemble a complete collection of briefs, oral arguments, and opinions in cases at various levels dealing with some given topic — it would be a lot of work, and there would be many gaps. And in 20 or 30 years, the situation (with respect to cases now and in the past) might well be worse, because much of whatever may be available now might well have vanished.
The scale of the problem is fairly large. State courts of last resort decide about 90,000 cases a year. Intermediate federal courts of appeals decide about 60,000 cases a year. It's not clear how (and even whether) digital versions of oral-argument recordings, collections of briefs, etc., are being preserved by various courts. It seems possible that much of this material, although now almost invariably prepared in digital form, is not being digitally archived in any effective way.