This page will summarize important changes to our methodology and data sources that we expect to significantly affect the Unpaywall dataset. 

2021-02-02: Removed duplicate oa_locations in cases where the same publisher page was determined to have an OA copy in two different ways. Previously two oa_locations representing the same page could be created with different licenses and oa_dates.

2021-01-19: Updated our list of "detected OA" journals, as described in this FAQ, for 2021. Added about 200,000 Gold articles from 9,000 journals.

2021-01-13: Began counting journals using "publisher's own license" in DOAJ as Gold OA. See, for example. Added about 100,000 Gold articles.

2020-12-31: Changed the version property of preprint locations from "publishedVersion" to "submittedVerson". This affects the preprints we reclassified as Green OA on 2020-05-01. We previously called these published because the preprint is often the final version, but this conflicts with the common expectation that accepted and published versions are peer-reviewed.

2020-12-14: New journals added to DOAJ are assumed to be Open Access starting on the date they were added, rather than a start date defined by DOAJ: Journals that already have OA dates from DOAJ will keep those dates. 

2020-10-09: Changed the definition of “OA license” as it relates to the distinction between Hybrid and Bronze articles. See What does oa_date mean and how is it determined? for details.

2020-10-05: Added oa_date property to oa_locations, and first_oa_location to DOI records. See What is an OA license? for details.

2020-09-14 - improved detection of Wiley Bronze OA

We improved our Bronze OA validation process for Wiley, which will convert about 1 million Closed or Green articles to Bronze OA over the next few weeks.

2020-05-01 - reclassified items on preprint servers as Green OA

We've reclassified articles hosted on preprint servers to reflect their differences from traditional publishing platforms. Examples of this type of platform are bioRxiv, MDPI Preprints, and ChemRxiv.

As described in What do the types of oa_status mean?, an article is Green OA if the host_type of its best location is "repository". Until now, the URL resolved by an article's persistent DOI URL was always considered to have host_type "publisher", and thus to be either Bronze, Hybrid, or Gold. Now, these locations are considered repositories and the articles are Green. At the time of this writing 170,000 articles are affected by this change.

2020-02-25 - began retroactively applying Crossref metadata updates:

We improved our Crossref data collection so that the latest article metadata is always reflected in Unpaywall, and we're retroactively applying Crossref updates from the last six months. This will affect the data feed for about 15 million articles and will produce larger-than-usual files between 2020-03-05 and 2020-03-19. We expect these files to contain about 8 million lines. The majority of these changes are revisions to published_date, publisher, and genre and do not affect open_locations or oa_status.

2019-12-08 - added articles from Semantic Scholar:

We're adding about 8 million PDFs hosted by Semantic Scholar. We already have OA locations for many of these articles, but we expect this to create 3 million new Green OA articles by the end of 2019.

2019-11-14 - improved PDF validation:

Our automated PDF validation processes are now much more robust, allowing us to add about 1.5 million new OA articles. Half of these are in newly-identified Gold OA journals that we were previously unable to spot because these articles looked unavailable to us.