This page will summarize important changes to our methodology and data sources that we expect to significantly affect the Unpaywall dataset.
2020-05-01 - reclassified items on preprint servers as Green OA
As described in What do the types of oa_status mean?, an article is Green OA if the host_type of its best location is "repository". Until now, the URL resolved by an article's persistent DOI URL was always considered to have host_type "publisher", and thus to be either Bronze, Hybrid, or Gold. Now, these locations are considered repositories and the articles are Green. At the time of this writing 170,000 articles are affected by this change.
2020-02-25 - began retroactively applying Crossref metadata updates:
We improved our Crossref data collection so that the latest article metadata is always reflected in Unpaywall, and we're retroactively applying Crossref updates from the last six months. This will affect the data feed for about 15 million articles and will produce larger-than-usual files between 2020-03-05 and 2020-03-19. We expect these files to contain about 8 million lines. The majority of these changes are revisions to published_date, publisher, and genre and do not affect open_locations or oa_status.
2019-12-08 - added articles from Semantic Scholar:
We're adding about 8 million PDFs hosted by Semantic Scholar. We already have OA locations for many of these articles, but we expect this to create 3 million new Green OA articles by the end of 2019.
2019-11-14 - improved PDF validation:
Our automated PDF validation processes are now much more robust, allowing us to add about 1.5 million new OA articles. Half of these are in newly-identified Gold OA journals that we were previously unable to spot because these articles looked unavailable to us.