At the end of August, we finished our first official year as the SCALES OKN in the NSF Convergence Accelerator. The past 12 months have been a whirlwind of activity, even though much of it has been happening under the hood as we’ve worked towards improving the SCALES OKN data ecosystem and user experience.
After 12 months, we are happy to say that the SCALES OKN is moving from an alpha to beta release, with numerous improvements! As a part of recognizing what it’s taken to get to this stage, we wanted to look back and acknowledge everyone who has helped us get to this point.
Community Support and Engagement
We always want to start by acknowledging the broader SCALES community, without their input we would not be able to effectively build the SCALES OKN. Over 20 people participated in our two rounds of user testing, giving us feedback on the user experience of querying and analyzing the SCALES OKN within the Satyrn notebook interface. These sessions have led to some incremental improvements to the Satyrn notebook (autocomplete on all fields, more pronounced visual elements) and bigger challenges — like documenting and communicating where data comes from, how it’s filtered, and ultimately analyzed. We’re testing our first approaches to document and explain these data transformations in app and looking forward to more of your feedback! We’re planning on doing more user testing in the next year, including long-term testing, so please reach out to us (email@example.com) if you’re interested.
In May we also held our first annual Open Justice Research Workshop virtually with 24 brilliant scholars from law, political science, computer science, sociology and more and featured talks from Judge Nancy Gertner (retired) and Julie Ciccolini (National Association of Criminal Defense Lawyers). The workshop was held as an unconference, focusing not on prepared talks but ideation, and we spent the bulk of our time brainstorming what future research questions we could tackle with greater access to court records. From this workshop four new collaborations were formed and have started using the SCALES OKN data for their research over the summer. Having others play in the sandbox with us has been extremely helpful as we look to start finalizing our data models and plan out the best methods for raw data access. We’re looking forward to holding the second Open Justice Research Workshop in 2022 and hoping that we’ll even be able to be in person!
During the last year we’ve also continued working with courts and legal aid organizations by providing court-specific analytics on the variation in grant rates of in forma pauperis requests and document sealing. Our work on document sealing is what became our first publicly available living report and you can view that . We’re planning to have more living reports go live in the near future to help support the courts, legal aid organizations, and even researchers with easy to reference statistics and visualizations about court activities.
Data and Modelling
We’ve put significant effort into our software stack for acquiring data from PACER and parsing the html docket sheets into an easy, machine-readable format for downstream analyses. Our PACER acquisition software supports downloading queries, docket sheets, documents, and (new!) case summaries from PACER and our parser supports data extraction for criminal and civil cases from HTML to JSON. Both of these tools are publicly available and licensed for anyone else to integrate into their academic and non-profit projects. If you have any questions about how you can use these tools, please feel free to reach out to us — we’re more than happy to talk with you!
On the data front, we hit one of our big milestones for this year and we finished building an entity recognition model and a disambiguation pipeline for judges in docket entry text! We now, automatically, know which judge specifically was acting in any single docket entry in a case and who that judge actually is (i.e., we know that John M. Doe and John Doe are both the same judge). As a part of our disambiguation pipeline, we integrated the Federal Judicial Center’s Biographical Database on Article III judges, so all biographical data for Article III judges is now also available and linked to cases through the identification of which judge was acting on the case. As an added benefit, since we now know who the Article III judges are we are also able to infer who is acting as a magistrate judge across all 94 districts.
This is an important first hurdle for downstream analytics because robust identification of who is involved in a case at specific points in time is critical for accurate analytics on the broader dataset.
The attribution of a judge in the case header is not necessarily accurate at any one point in time (cases can be transferred, judges can retire or be promoted, etc.), so the use of docket entry attribution in the SCALES OKN is an integral step in supporting our goals of accessible systematic analysis on court records. Right now, the judge identification pipeline is already integrated into the beta SCALES OKN and we’re currently finishing up the work to make the code publicly available in our research code repository.
Our litigation event ontology group has also been hard at work on both civil and criminal event ontologies over the last year. These event ontologies are core to our ability to build models that can label docket entries as specific litigation events (i.e., motion) and identify the broader litigation phases, like settlement, that many of our most basic but pressing questions about the justice system revolve around. Importantly we’ve created this event ontology by focusing on the docket sheet data, combining both our team’s expertise in how courts theoretically function and examining how actually function, as reported in the docket sheets. We’re still working hard at defining litigation phases and how we algorithmically identify them in docket events, but we plan to make our constructed ontologies for observed litigation events publicly available soon.
There’s also been significant work on Satyrn, the underlying technology that powers the user interface of the SCALES OKN. Through beta-testing there’s been numerous improvements in both hands-on use (autocompletion across more query fields and increased analytical capabilities), as well as its back-end robustness (scaling to user demand and authentication systems). On a broader technical note, Satyrn also hit a major milestone that’s significant for data users by moving to a no-code configuration system. That means that end-users will be able to bring new datasets for analysis into Satyrn without needing to write any code at all, which greatly decreases the barrier to potentially bringing more data into the ecosystem.
What is our next year roadmap?
Over the next year we plan to work furiously to tie up loose ends as we push towards our public launch of the SCALES OKN.
On the data side, we’re planning to crosswalk more datasets into the ecosystem. We’re currently working on merging the USSC sentencing dataset against criminal cases and plan to do something similar to make using the USPTO dataset easy on the civil side. For all of our crosswalks we also plan to make living reports available that document the coverage of the crosswalks, which should be helpful as a quick reference. Right now, we’re working on a living report of on our crosswalk of civil cases filed in PACER to civil cases recorded in the Federal Judicial Center’s Integrated Database, making it easy to see where (and potentially why) discrepancies might emerge.
In terms of models, our big push this next year is to continue with our modelling efforts to identify litigation events and phases. We are also going to devote significant efforts to expand our entity recognition beyond just judges. There are many questions that revolve around who is involved in a case, be it a local government or Fortune 500, so we are building models to identify what type of entity each individual party is and expose those annotations for search and analysis. While we won’t be able to disambiguate every entity like we did with the judges (figuring out if plaintiff John Johnson in a case in Illinois is the same person as defendant John Johnson in Georgia is nearly impossible with docket sheets), we are planning to disambiguate lawyers and more prevalent entities like Fortune 500 companies.
Our goal is to keep pushing and enriching the SCALES OKN user interface and data ecosystem throughout the year and, with continued user testing, reach our milestones for a stable and feature-filled public release in Summer 2022. As the year progresses, we plan to make more incremental public releases (ontologies, models, raw data, and more), so make sure to check back here or feel free to reach out. As always, we’re not just excited about building the SCALES OKN, but we’re excited about working with community so that we can all have a resource that can transform our ability to answer systematic questions about the courts.