# Import the spreadsheet
library(readxl)
<- read_excel("data/CRM_reports_2023-12-12.xlsx",
df col_types = c("numeric", "text", "text",
"date", "text", "text", "text", "text",
"text"))
Developer Notes
This page is an exploration of Quarto, Zotero, and Citation Style Language (CSL) with R. There is an Excel file in the UM SharePoint that lists CRMD Reports located in Kent Hall 202. Sometimes students or the public express an interest in one of these reports. This page contains notes on prepping and presenting the data using the aforementioned tools. This represents a proof-of-concept for using reproducible research methods and open source software to support museum activities.
Excel to Zotero
The following code reads an Excel file that was in the UM’s SharePoint Figure 1. The code below reads the file into R with readxl
(Wickham and Bryan 2023), uses janitor
(Firke 2023) to modify field names in order to improve importing the table to Zotero, and uses the write_refs()
function from synthesisr
(Westgate and Grames 2020) to wite the dataframe to a .bib
file. This .bib
file was used to populate the UM CRM Reports Group Library on the UM’s Zotero account.
Import
Clean
Modify
<- as.data.frame(df) # synthesisr requires a data frame
df <- janitor::clean_names(df) # zotero requires single word fields
df $date <- format(df$date, "%Y-%m-%d") # keeping as a date field threw errors so I converted to a string df
Add
$place <- "Las Cruces, NM"
df$institution <- "New Mexico State University"
df$archive <- "University Museum"
df$techreport_type <- "CRMD" df
colnames(df)
[1] "report_number" "title" "author" "date"
[5] "key_words" "location" "abstract" "notes"
[9] "collections" "place" "institution" "archive"
[13] "techreport_type"
Zotero made the following changes when importing:
collections
was stored as Zotero’sextra
field astex.collections:
keyword
field was stored as Tags for that referencelocation
was stored as Zotero’sPlace
fieldnotes
field is stored as a child note for that referencereport_number
was stored in Zotero’sextra
field astex.report_number:
# change report_number to number
colnames(df)[1] <- "number" # number is the bibtex field for report number
# change location to town_range
colnames(df)[6] <- "town_range"
# change keywords to tags
colnames(df)[5] <- "tags"
colnames(df)
[1] "number" "title" "author" "date"
[5] "tags" "town_range" "abstract" "notes"
[9] "collections" "place" "institution" "archive"
[13] "techreport_type"
Output to .bib
# This library has some nice functions for converting bibliographic formats
library(synthesisr)
write_refs(df, format = "bib", tag_naming= "synthesisr", file = "crm_output.bib")
Within the .bib
file synthesisr
writes the citekey
as @ARTICLE
rather than @techreport
. Therefore this needs to be changed. The following performs a find and replace in base R.
# find and replace string in file
<- readLines("crm_output.bib")
bib_file <- gsub(pattern = "@ARTICLE", replace = "@techreport", bib_file)
bib_file writeLines(bib_file, con="crm_output.bib")
Parsing Paginated API call using httr2
When building this page, I was at first stuck on the fact that the Zotero API limits to 100 items but the collection has more than 600. One easy workaround was to export the library from the Zotero desktop clien. That way all of the items are exported to a single file. While this works, one would have to remember to re-export the library and re-render this page to reflect changes made to the library entries. By calling directly from the Zotero API and retrieving .bib
data, it is possible to have the page reflect breaking changes in the Zotero library when the page is rendered. Given that down the line we will likely modify the Zotero library periodically, the need to parse paginated API calls is likely to be a recurrent issue. These notes describe how I figured out how to page through the requests.
Useful links:
- dev:web_api:v3:start [Zotero Documentation] Zotero API v3 documentation.
- Perform requests iteratively, generating new requests from previous responses — req_perform_iterative • httr2 This page from the
httr2
library documentation had and example that I replicated to make the working solution. I really just changed the URL and began tweaking arguments. - Limit restricted to 99. This is a discussion thread from the Zotero Dev group. While the conversation is not particularly helpful in that it did not identify a solution, it is some of the most focused mention of the issue that I’ve found. Some of the ideas presented were helpful in tracking down the answer.
- Parse link URL from a response — resp_link_url • httr2. I was able to use this to pull out pieces of the response links, but I’m not sure what to do with them. I don’t end up using them, at least not to my knowledge.
Construct the Request
library(httr2)
The following chunk sets up a basic request and adds an argument to set a limit. Given how httr2
works (Wickham 2023), the query fields can be added as a arguments to the function req_url_query()
rather than needing to write these arguments into the URL string itself.
The req_throttle
argument is to give some pause I believe (need to check documentation on this).
Note that it was necessary to explicitly set the limit value in the req_url_query()
otherwise each of the paged requests only returns 25 records but advances 100 due to the offset declared below in iterate_with_offset()
.
It seems like arguments that remain the same for each iteration should be defined under req_url_query()
while arguments that change each iteration should go under req_perform_iterative
using the helper function iterate_with_offest()
.
<- request("https://api.zotero.org/groups/5323184/items") |>
req req_throttle(10) |>
req_url_query(limit=100,
format="bibtex")
So we can see that this results in a get request.
req
This returns a single request. We want several.
Iterating with Offest
The function iterate_with_offset()
will all us to work through the paginated results of the request. At first, I was defining the start and offset under the req_url_query()
function, but I removed that and put those arguments under iterate_with_offset()
. I think that is cleaner.
<- req_perform_iterative(req, iterate_with_offset("start", start = 1, offset = 100)) resps
We can also loop through the URL’s to confirm that the start and offset values iterate in successive requests.
Loop method #1
for (i in resps) {
print(i$url)
}
Loop method #2
for (i in 1:length(resps)) {
print(resps[[i]]$url)
}
The following call is used to loop through the resps
list and append the contents of $body
to a .bib
file. Either loop method should work I think. Here I used the second loop method just because.
for (i in 1:length(resps)) {
cat(rawToChar(resps[[i]]$body), file = "crm_append.bib", append = TRUE)
}
for (i in resps) {
cat(rawToChar(i$body), file = "crm_append.bib", append = TRUE)
}