Wikidata and R: a perfect pair

Introducing WikidataR: A new package linking the open source programming language R and Wikidata

3 June 2021, Thomas Shafee / Canley.

R is great, and a good match for Wikidata stuff

The open source programming language ‘R’ is a statistical computing environment, which is widely used for data manipulation, analysis and visualisation. It’s also a great match for Wikidata, the massive database of everything.

R is organised into packages that cover different capabilities. An initial pair of packages that can read from Wikidata were actually initially developed back in 2017 due to the great work of Os Keys and Mikhail Popov. Mikhail’s WikidataQueryServiceR can run SPARQL queries to return lists of Wikidata items and Os’s WikidataR can download and read those items. However it wasn’t possible to write back to Wikidata from R, until now.

Thanks to a Wikicite grant, we’ve been able to build an expanded version of WikidataR that closes the loop. Items can be created and statements added or deleted in batches using the QuickStatements format (by Magnus Manske). This format happens to match the R ‘tidyverse’ format of tibbles (enhanced data frames) pretty well.

The other main additions have been a range of additional read utilities, like using Wikidata to translate from any identifier to any other (e.g. from ORCID to QID or from VIAF ID to Twitter username).

> identifier_from_identifier('ORCID iD','IMDb ID',c('0000-0002-7865-7235','0000-0003-1079-5604')) |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=03s
A tibble: 2 x 2
value return <chr> <chr> 1 0000-0002-7865-7235 nm2118834 2 0000-0003-1079-5604 nm1821217

Take a look at what you can do

The WikidataR package is flexible, and can enable and enhance a range of workflows used in combination with the exisiting capabilities of R and other packages to acquire, process and link data.

An important part of the Wikicite model is its network graph of citations which uses Wikidata’s cites work (P2860) property. Using the capabilities of other R packages like rvest (web scraping), stringr (string manipulation) and dplyr (data manipulation), DOIs can be extracted from an article’s list of citations in web pages or even PDFs, and sent to Wikidata’s identifier_from_identifier function to return the Wikidata QID item numbers for those cited articles already in Wikidata. WikidataR can then import the list of cited articles into the item of the citing article.
WikidataR can be used to populate the main subject (P921) property of scholarly article items. As with the list of citations, a list of keywords in an article’s abstract can be extracted and then passed to WikidataR’s find_item function, which returns a list of Wikidata items and their label and description. These can then be assessed and filtered manually or programatically to return a list of Q-items for the objects or concepts mentioned in the keyword listing.
Reconciliation of datasets being imported is important to prevent duplication of items in Wikidata. WikidataR can help with this task, by finding existing items on Wikidata using identifiers in the dataset, or if no identifiers are available, by returning lists of possible matches for the name of a person, object, or concept. Existing items can then be filtered out before before beginning the import process.
For more examples and instructions, see the package’s README file.

How can I install it?

WikidataR (version 2.1.5) is available from CRAN (the Comprehensive R Archive Network), and can be installed by running the following command in the R console:

install.packages("WikidataR")

The development version can be installed from the GitHub repository using the install_github function in the devtools package:

library(devtools) devtools::install_github("TS404/WikidataR")

Summing up

The revised WikidataR package is intended to work in combination with the data handling capabilities of R and other packages, to reduce the friction of acquiring, processing, contributing and publishing data to and from Wikidata.

Thanks to Wikicite for the eScholarship opportunity which enabled the development of the package.

Originally posted on Wikimedia Diff Blog June 3, 2021 by Thomas Shafee and Canley.