Developer Notes

Author

Nathan Craig

This page is an exploration of Quarto, Zotero, and Citation Style Language (CSL) with R. There is an Excel file in the UM SharePoint that lists CRMD Reports located in Kent Hall 202. Sometimes students or the public express an interest in one of these reports. This page contains notes on prepping and presenting the data using the aforementioned tools. This represents a proof-of-concept for using reproducible research methods and open source software to support museum activities.

Excel to Zotero

The following code reads an Excel file that was in the UM’s SharePoint Figure 1. The code below reads the file into R with readxl (Wickham and Bryan 2023), uses janitor (Firke 2023) to modify field names in order to improve importing the table to Zotero, and uses the write_refs() function from synthesisr (Westgate and Grames 2020) to wite the dataframe to a .bib file. This .bib file was used to populate the UM CRM Reports Group Library on the UM’s Zotero account.

flowchart LR

Excel[Excel File] --> R[Read in R `readxl` \ncreate dataframe]
R --> Clean[Clean dataframe]
Clean --> Write[Write to .bib file]
Write -.-> Upload[(Upload to Zotero)]

Figure 1: Excel to Zotero workflow

Import

# Import the spreadsheet
library(readxl)
df <- read_excel("data/CRM_reports_2023-12-12.xlsx", 
    col_types = c("numeric", "text", "text", 
        "date", "text", "text", "text", "text", 
        "text"))

Clean

Modify

df <- as.data.frame(df) # synthesisr requires a data frame
df <- janitor::clean_names(df) # zotero requires single word fields
df$date <- format(df$date, "%Y-%m-%d") # keeping as a date field threw errors so I converted to a string

Add

df$place <- "Las Cruces, NM"
df$institution <- "New Mexico State University"
df$archive <- "University Museum"
df$techreport_type <- "CRMD"
colnames(df)
 [1] "report_number"   "title"           "author"          "date"           
 [5] "key_words"       "location"        "abstract"        "notes"          
 [9] "collections"     "place"           "institution"     "archive"        
[13] "techreport_type"

Zotero made the following changes when importing:

  • collections was stored as Zotero’s extra field as tex.collections:
  • keyword field was stored as Tags for that reference
  • location was stored as Zotero’s Place field
  • notes field is stored as a child note for that reference
  • report_number was stored in Zotero’s extra field as tex.report_number:
# change report_number to number
colnames(df)[1] <- "number" # number is the bibtex field for report number
# change location to town_range
colnames(df)[6] <- "town_range"

# change keywords to tags
colnames(df)[5] <- "tags"
colnames(df)
 [1] "number"          "title"           "author"          "date"           
 [5] "tags"            "town_range"      "abstract"        "notes"          
 [9] "collections"     "place"           "institution"     "archive"        
[13] "techreport_type"

Output to .bib

# This library has some nice functions for converting bibliographic formats
library(synthesisr)
write_refs(df, format = "bib", tag_naming= "synthesisr", file = "crm_output.bib")

Within the .bib file synthesisr writes the citekey as @ARTICLE rather than @techreport. Therefore this needs to be changed. The following performs a find and replace in base R.

# find and replace string in file
bib_file <- readLines("crm_output.bib")
bib_file <- gsub(pattern = "@ARTICLE", replace = "@techreport", bib_file)
writeLines(bib_file, con="crm_output.bib")

Parsing Paginated API call using httr2

When building this page, I was at first stuck on the fact that the Zotero API limits to 100 items but the collection has more than 600. One easy workaround was to export the library from the Zotero desktop clien. That way all of the items are exported to a single file. While this works, one would have to remember to re-export the library and re-render this page to reflect changes made to the library entries. By calling directly from the Zotero API and retrieving .bib data, it is possible to have the page reflect breaking changes in the Zotero library when the page is rendered. Given that down the line we will likely modify the Zotero library periodically, the need to parse paginated API calls is likely to be a recurrent issue. These notes describe how I figured out how to page through the requests.

Useful links:

Construct the Request

library(httr2)

The following chunk sets up a basic request and adds an argument to set a limit. Given how httr2 works (Wickham 2023), the query fields can be added as a arguments to the function req_url_query() rather than needing to write these arguments into the URL string itself.

The req_throttle argument is to give some pause I believe (need to check documentation on this).

Note that it was necessary to explicitly set the limit value in the req_url_query() otherwise each of the paged requests only returns 25 records but advances 100 due to the offset declared below in iterate_with_offset().

Note

It seems like arguments that remain the same for each iteration should be defined under req_url_query() while arguments that change each iteration should go under req_perform_iterative using the helper function iterate_with_offest().

req <- request("https://api.zotero.org/groups/5323184/items") |>
  req_throttle(10) |>
  req_url_query(limit=100,
                format="bibtex")

So we can see that this results in a get request.

req

This returns a single request. We want several.

Iterating with Offest

The function iterate_with_offset() will all us to work through the paginated results of the request. At first, I was defining the start and offset under the req_url_query() function, but I removed that and put those arguments under iterate_with_offset(). I think that is cleaner.

resps <- req_perform_iterative(req, iterate_with_offset("start", start = 1, offset = 100))

We can also loop through the URL’s to confirm that the start and offset values iterate in successive requests.

Loop method #1

for (i in resps) {
  print(i$url)
}

Loop method #2

for (i in 1:length(resps)) {
  print(resps[[i]]$url)
}

The following call is used to loop through the resps list and append the contents of $body to a .bib file. Either loop method should work I think. Here I used the second loop method just because.

for (i in 1:length(resps)) {
  cat(rawToChar(resps[[i]]$body), file = "crm_append.bib", append = TRUE)
}
for (i in resps) {
  cat(rawToChar(i$body), file = "crm_append.bib", append = TRUE)
  
}

References

Firke, Sam. 2023. “Janitor: Simple Tools for Examining and Cleaning Dirty Data.” https://CRAN.R-project.org/package=janitor.
Westgate, Martin, and Eliza Grames. 2020. “Synthesisr: Import, Assemble, and Deduplicate Bibliographic Datasets.” https://CRAN.R-project.org/package=synthesisr.
Wickham, Hadley. 2023. “Httr2: Perform HTTP Requests and Process the Responses.” https://CRAN.R-project.org/package=httr2.
Wickham, Hadley, and Jennifer Bryan. 2023. “Readxl: Read Excel Files.” https://CRAN.R-project.org/package=readxl.