PACER is a black box when it comes to how exactly it generates its revenues. The system has received criticism from journalists, researchers, and even recently congresspeople. The U.S. House of Representatives recently passed H.R. 8235, an Open Courts Act aimed at making access to PACER more affordable. Given that push, we set out to build an analytical model to estimate the cost of PACER’s contents in 2016 with its current pricing.
From the user side, it is prohibitively expensive to compile large amounts of case data. We estimated it would take somewhere between $5.5 million and $5.75 million to access and download all case data for Civil and Criminal cases filed in 2016 in the 94 district courts. Although we were modelling a black box, our approach lends itself to full transparency. In this post we outline the methodology we employed in building our estimates of the scale of PACER data costs.
Working with panel data
A lot of work on our data involved joining events, dockets, and documents together with the added complexity of time. Because of this complexity, we found it important to ground our analysis in the concept of panel data. A panel of data is a set of observations that tracks individual entities over some period.
A docket sheet is quite literally the panel of a court case, tracking entries over time. In our cost modelling effort, our data corresponded to the 2016 Cohort (C-2016) of panel data. C-2016 contains any docket sheet that was filed in 2016. Importantly, this means our panel of documents corresponding to these cases could have been filed in 2016 or any year after. Case activity might largely occur in the original filing year or it is possible motions, documents, petitions, judicial orders etc. are continually added over the years until a case terminates. Below is a plot of our panel of sample documents from 2016 dockets in the Northern District of Illinois (NDIL) and when they were filed.
You might be wondering about that tall outlier that shows up in August 2018. Digging through our metadata, we found a multi-district-litigation (MDL) responsible for 755 documents filed on a single day. If you look closely you can also see federal holidays contribute to “outlier pits” in our distribution – Christmas 2016 and Independence Day 2017 are good examples where filings drop significantly for a day or two.
What exactly is cost?
When aiming to download data from PACER, there are two unique components for the cost of a case: (A) the cost of the docket sheet and (B) the cost of the documents attached to the docket. We used two separate methodologies to estimate the unique cost components for the 94 courts. There are many possible ways to model costs, including predicting net case costs without splitting component A from component B. We felt given the nature of our disparate data sources that independently modelling the components would be our best approach.
Docket Sheet Cost
In 2016, roughly 350,000 Civil and Criminal dockets were filed across the 94 district courts. We acquired all 350,000 case IDs by building a dataset from both PACER and CourtListener (RECAP). We purchased roughly 45% of those docket sheets directly from PACER, with the other 55% coming through RECAP. Because of the nature in which RECAP crowdsources data from users, those docket sheets did not come with a clear PACER cost value associated with them.
Fortunately, we saved our PACER bill from the docket sheets we did purchase – sometimes we look at it and cry. We used these real cost values to validate our models to estimate the cost of the RECAP dockets. PACER caps their charges at $3.00 for docket sheets, so keep in mind, we are dealing with a tight range of predictive values for the cost of a docket sheet– $0.10 at minimum, and $3.00 at maximum. This range of values also comes with a downside– any dockets that are 30 pages or more are all grouped together at $3.00 – an imbalanced class at the top end of the range.
We began by considering what makes up a docket sheet: text blocks for lawyers, defendants, plaintiffs; docket entries of variable text length; court information; etc. If you have two lawyers representing you, that is twice as many lines in the representation block compared to somebody with one lawyer. If you have 100 docket entries, that is roughly 10 times as much text as 10 docket entries. At this point you should be picking up the trend here: linearity.
Above is a graph of docket length (number of entries) compared to the total billable pages PACER charged for it, broken down by nature of suit subtype. There is clearly some form of linearity at play in this relationship, the challenge for us was using this information to break out of our class imbalance problem. There is a distinct vertical wall of costs locked in at $3.00 that prevents us from extracting an otherwise linear pattern between docket sheet length and docket pages. This led us to attempt a two-pronged model approach. We ensembled the two methods below:
- A logistic regression model to predict if a docket sheet is more than 30 billable pages (the PACER billing cap at $3.00)
- A multiple linear regression model to predict the number of pages a docket contains if it was classified as “under 30 pages”
In a world of advanced models and methods, we settled for Occam’s Razor… er Occam’s Regressions that is. Intuitively, our data has simple linear relationships that are easily captured by these simple regression models. The model features were explainable: the more people or entries there are in a case, the more newlines there will be. Treating the nature of suit types as categorical enabled us to capture any idiosyncrasies with entry lengths. The logistic model for classification was 95% accurate when validated with the real costs of docket reports in our holdout set. Our misclassifications happened more frequently with small dockets being predicted as large – ultimately this means we were probably overestimating their docket costs by 10 or 20 cents.
Once we had a functioning classification model, we trained a multiple linear regression model to predict the number of billable pages on a docket. The training data were dockets with known billed pages up to 29 pages. Again, we validated our model with a holdout set and produced an R2 value of 0.897. While we would hope for a better value, we feel it is still a strong coefficient of determination given our target values were discrete (ten cent intervals) while our model produced continuous value predictions. That is, a 5.5-page predicted value for a 5-page actual value is a 10% error, with a practical difference being 5 cents of cost. Knowing our final range of estimates was likely to be in the hundreds of thousands, we did not think it was prudent to tune and chip away at an R2 value for a minimal impact on our overall estimates.
The results of our approach led us to a mean predicted cost of roughly $238,000 to access and download every docket sheet filed in 2016. This value includes the known costs of the 45% of dockets we had already purchased at just over $100,000. We built an 80% confidence prediction interval on top of this value for the total predicted docket costs. The interval for total docket sheet costs is between $228,000 and $255,000.
With the smaller portion of costs (docket sheets) out of the way, we set our sights on modelling the bulk of PACER costs: documents. Using our sample of 350,000 docket sheets, we ran a parser through the docket entries and identified roughly 11 million documents attached to the 2016 cases. We certainly did not want to download or even attempt to access that many attachments via PACER, so we chose to use the Northern District of Illinois (NDIL) as a document sample. We generated roughly half a million document links and used the document landing page on PACER to extract information.
Our final sample of accessible documents metadata was just over 435,000, and we tried accessing nearly 515,000 webpages. The roughly 80,000 documents we could not get metadata for were inaccessible due to sensitive information, sealing restrictions, or other PACER-determined document restrictions. We felt confident 435,000 was plenty of documents to work with.
The panel of document data covered 4.5 years of document filings (2016 to 2020) and was stratified by nature of suit subtypes. For each of the nearly 13,000 cases in NDIL, we were able to aggregate total pages, total documents, and total case costs. We chose to group Civil Detainee, Federal Tax Suits, Forfeiture/Penalty, with Unknown natures of suit into a “Miscellaneous” category due to their relative infrequency. We found a lot of variability between natures of suit, but also within them. For example, some Labor suits might end in 60 days and cost only $8.00 in total while others last 300 days and cost $40.00. This realization of non-uniformity caused us to rule out any sort of simple flat rate cost extrapolations from NDIL to the other 93 courts.
One option we entertained briefly with simulation was distribution fitting by nature of suit. What we mean by this is identifying the document page frequencies by nature of suit and mapping it to a distribution. From that mapped distribution, we could run simulations of document samples. For example, the number of pages for all documents filed in Civil Rights cases looks to be close to a tuned Negative Binomial distribution, and we could probably simulate Civil Rights cases using the Negative Binomial Distribution.
Distribution fitting however is time-intensive and hedges on many assumptions, so we decided to sample using raw data distributions. With over 400,000 documents across 14 nature of suit subtypes, we had plenty of data to run stratified sampling with replacement using the real document information. Below is a visual representation of how this sampling method works at the case level. If for example, Eastern District of Texas (EDTX) has 100 Civil Rights Cases, we can sample 100 random Civil Rights cases out of NDIL and use their total document costs as an estimate for the 100 cases in the Eastern District of Texas. We can then repeat this approach hundreds or thousands of times to build a range of estimates.
When sampling, we computed costs a few ways. We tried four cost formulas to identify which would be most effective for each nature of suit (validating against NDIL). Using the same example as above, let’s say we are trying to model 100 Civil Rights cases in the Eastern District of Texas. We would know there are 100 cases of this type by parsing the docket sheets, and we would know how many documents each of those 100 cases has attached to it as well.
For each of those 100 cases, we will sample a random NDIL Civil Rights case. The NDIL case will have three features of interest:
- Document Accessibility Rate
- Net Document Costs
- Document Cost per Day Case is Open
Using these features, we calculate a cost estimate in four ways:
- Take the NDIL case Document Accessibility Rate and multiply it by the EDTX case’s known number of documents (ex. 80% rate x 100 documents = 80 documents). Sample that many times (ex. 80) from the NDIL Civil Rights document page lengths distribution, and apply the $0.10 per page rate to those samples, summing the costs to make one case document cost estimate
- Take the NDIL case Document Accessibility Rate and multiply it by the EDTX case’s known number of documents (ex. 80% rate x 100 documents = 80 documents). Sample that many times (ex. 80) from the NDIL Civil Rights document cost distribution, summing the costs to make one case document cost estimate
- Take the NDIL case’s net document cost as the EDTX case’s modelled net document cost
- Take the NDIL case’s Document Cost per Day Case is Open value and multiply it by the duration of the EDTX case to get an estimated EDTX case document cost estimate
We then run this sampling of 100 cases hundreds of times (simulations). We ran this methodology against the Northern District of Illinois and chose the best performing method for each nature of suit in terms of accuracy compared to real costs incurred. The various calculation approaches enable us to be flexible across suit types, especially if some subtypes have extremely variable number of documents or costs. Using multiple methods also allowed us to fine tune and find effective estimators for Social Security and Immigration subtypes given their high rate of document restrictions.
Below are the Mean Absolute Percent Errors (MAPEs) for the cost method selection from 500 random simulations. The MAPEs for the Cost Per Day method (CC4) indicate that for most suit types the duration of the suit plays almost no significant role in the total costs of a case; however, it is oddly an effective approach for the Social Security cases given their document restrictions.
The optimized cost method selection resulted in predictive accuracy within 1% for the Northern District of Illinois document costs. Given that this result held over 100, 500 and 1,000 simulations we felt confident in our approach and applied it across all 94 district courts. We make a general assumption that most nature of suit subtypes operate similarly regardless of the district they are in. This assumption does overlook some case nuances by district (such as national security cases in Virginia’s Eastern District or patent cases in the Eastern District of Texas), but our chosen number of simulations allows us to confidently capture those nuances if some of our assumptions were too loose. Our final modelled document cost was on average somewhere between $5.3 million and $5.5 million.
While PACER is a challenge to navigate we plan to get access to more data in the future to validate our analysis. It would be naïve to think this approach is bulletproof without more nuanced district data or case data beyond 2016. Nevertheless, given the constraints we’re confident in our estimate of the 2016 costs and believe that we could do the proximate years with the same data.