This page will summarize important changes to our methodology and data sources that we expect to significantly affect the Unpaywall dataset. 


2020-02-25 - began retroactively applying Crossref metadata updates:


We improved our Crossref data collection so that the latest article metadata is always reflected in Unpaywall, and we're retroactively applying Crossref updates from the last six months. This will affect the data feed for about 15 million articles and will produce larger-than-usual files between 2020-03-05 and 2020-03-19. We expect these files to contain about 8 million lines. The majority of these changes are revisions to published_date, publisher, and genre and do not affect open_locations or oa_status.


2019-12-08 - added articles from Semantic Scholar:


We're adding about 8 million PDFs hosted by Semantic Scholar. We already have OA locations for many of these articles, but we expect this to create 3 million new Green OA articles by the end of 2019.



2019-11-14 - improved PDF validation:


Our automated PDF validation processes are now much more robust, allowing us to add about 1.5 million new OA articles. Half of these are in newly-identified Gold OA journals that we were previously unable to spot because these articles looked unavailable to us.