Skip to content
This repository was archived by the owner on Jul 31, 2025. It is now read-only.

Data Gathering Methodology

Bree Norlander edited this page Jan 3, 2020 · 26 revisions

DRAFT

Intro

This document outlines and explains the steps to gather data in support of our work on data publication practices of public libraries. In short, we have collected a list of city data portals that may or may not contain data published by public libraries. The goal of this work is to systematically collect that data and compare data publishing practices across cities.

Defining Variables for Data Collection

The first step in data collection is to review the variables (or aspects) of each city portal that appear in our dataset (cities were gathered from the US City Open Data Census). The table below defines each variable. You can also see an example of what an entry for a data portal looks like in this csv.

Variable Code Variable Name Variable Description
City City City Name (spaces allowed)
State State 2-Character State Code
Portal_Url Open Data Portal URL URL for the city's open data portal. There may be more than one. If this is the case, add another row.
Software Software Used to Build the Portal Common answers are ArcGIS, Socrata, CKAN, Custom
TotalDataSetsAvailable Total Datasets Available Total number of datasets available on this portal.
NotesDataSetsAvailable Notes About the Datasets Available Free form space to write notes about the total number of datasets available.
CountVettedPublicLibData Count of Vetted Public Library Datasets Total number of datasets about public libraries. Methodology described in detail below.
TypePLDataAvailable Type of Public Library Data Free form list of types of data available about public libraries.
LibraryDataCategories Library Data Categories Choose one or more of the following (definitions below): Catalog and Circulation, Events, Facilities, Financial, Geospatial, Human Resources, Patrons, Public Records, Technology Offerings
DateLibDataLastUpdated Date that Library Data was Last Updated What is the most recent data associated with public library data
Notes Notes Free form notes about any of the data gathering.

Collecting Data

After reviewing and familiarizing yourself with the variables follow these steps to begin collecting data:

  1. Download the blank CSV with city entries here
  2. The above CSV corresponds to the cities found in this dataset This is the dataset that contains potential links to the city data portals for which you will be gathering data.
  3. Add each data repository record variables described above

Here are some helpful tips on how to identify data repositories and find information about them.

Socrata

How do I know it's a Socrata platform? Many Socrata platforms will indicate this by placing a Socrata logo in the footer:

socrata in footer

If the footer does not indicate the platform, you may have to do some digging. If there is a developers page, that will likely indicate the platform:

screenshot of developer resources

You might also have to click through to a dataset (or category of datasets) to find the logo:

socrata logo in footer

Find the city portal:

screenshot of Socrata portal

If there is a "catalog" link or box, choose that:

socrata catalog link

You may still need to clear categories as mentioned in the next step (this will depend on whether the city has added default categories to the catalog).

If there is no "catalog" option, click on any category of data. In this example, I clicked on "Homeless Counts from HUD." You can then clear the categories to search the entire portal by clicking on the "Clear All" button:

screenshot of clear all button

According to Socrata support (https://support.socrata.com/hc/en-us/articles/225465147-Catalog-Search-FAQ) "We also use word stemming, which automatically includes common forms of the same root term. For example, searching for "educational," will also match “education”, “educating” and “educate”." So you can now search for "library" to find any library related datasets (note: searching for libr or libr* will NOT work):

Baton Rouge Socrata Search

You will want to filter your search to just "datasets": screenshot data filter dataset filter screenshot

Here I found 4 library datasets. However, upon inspection, 1 of these datasets is not specific to libraries, it just contains some data about libraries. In the vetting process, we will eliminate this as a library dataset:

screenshot not library specific data

CKAN

How do you know it's a ckan platform? Scroll the the bottom of the open data portal homepage and look in the footer. If you see this, you know it's CKAN:

ckan logo from screenshot

You may have to dig a little. In this example, when I hover over the link "API Docs" I can see the url it takes me to indicates CKAN:

ckan homepage screenshot Philly

Here is an example CKAN homepage:

ckan homepage screenshot albuquerque

You can then search for "library." You will need to use your judgment in determining what is a dataset. Each dataset will have a format type. We're looking for things like csv, xls, kml, geoJSON, JSON... We are not looking for RSS feeds or applications:

search for library in ckan

DKAN

How do you know it's a DKAN platform? Scroll the the bottom of the open data portal homepage and look in the footer. If you see this, you know it's DKAN:

You may have to dig a little. In this example, when I hover over the link "free open source software" I can see the url it takes me to indicates DKAN:

DKAN platform screenshot

Search for "library" and look specifically for Content Type "dataset":

content type dataset filter

ArcGIS

Most of the time you will know it's an arcgis site by looking at the url:

arcgis homepage screenshot

In this case, I had to navigate all the way to a dataset and click the dropdown for "API" before I found any indication of ArcGIS (though you will start to recognize the platforms after several encounters):

API dropdown choices

You will want to search for data on this platform. Look for something like this:

choose data dropdown

data tab

You can search for both "library" and "libraries" to make sure you're capturing all the relevant datasets (do not search for "libr" or "libr*"):

library search on arcgis

In the above search, there is 1 dataset that is not specific to libraries and will not count in the vetted libraries total:

screenshot not library specific data

Custom

Custom data portals come in many different styles. Search around the homepage, url, and developer resources (if available) for any indication of platforms listed above. If you find nothing, it's probably a custom platform. Here are two examples:

Custom data portal homepage

Oklahoma City Data Portal

Use your search skills acquired from the portals above to determine if there are any library datasets available on the portal.

Others

This is by no means an exhaustive list of platforms. We have seen datasets hosted on Kaggle, data.world, OpenDataSoft, and more. These are all legitimate for our dataset and will require your judgment in deciding how best to search and label.

From JCDL Paper:

For each city’s data portal that was still in operation or that we were able to find via a Google search, we gathered descriptive information about the repository software used to publish the data, the total number of datasets accessible in each portal (we limited Socrata-hosted sites to "type = Datasets"), and searched specifically for data published by or about public libraries. Within each portal we also searched specifically for library-related datasets using queries such as "Library" or "Libr*", and also made use of faceted browsing features such as "categories" or "Data Owner" to locate relevant public library data.

For library related datasets to be considered relevant to our analysis we examined metadata, contents (e.g. variable names), and dataset descriptions. We excluded datasets that contained the keyword "library" but were, in reality, about broader topics such as Boston’s "CityScore" dataset which provides "metrics on overall city health based on work done across all facets of the City of Boston" [17]. Other false positives were removed based on the identified publisher. For example, the City of Seattle has a number of datasets that are published to the portal by users who have performed a specific analysis of that public library’s data, but were not officially published by the Seattle Public Library.

Choosing Library Data Categories

Below is the original list of library dataset categories (continue scrolling for updated list):

Code Definition Examples of Datasets
Catalog and Circulation Information about the holdings and lending practices of a library. These will often be annual or quarterly datasets released by system or branches. Physical Item Circulation Digital/Electronic Item Circulation Materials Counts Digital Holdings Counts Inter-library Loan Counts/Circulation Databases Count Serials Count Electronic Content Retrievals
Events Information about scheduling, programming, or regularly offered services for outreach. Programs Offereds Program Counts Program Attendance Meeting Counts Meeting Attendance
Facilities Information about the operation of a library and branch system, including electricity use, operating hours, etc. Count of Libraries (Central & Branch) Count of Bookmobiles Count of Facilities Types of Facilities Square footage,services offered / operating hours (IMLS records annual hours open)
Financial Information about revenue, expenditures, etc. Does not include information about staff salary.
Geospatial Geographic or mapping informaiton - including shapefiles Library Locations Facilities Locations
Human Resources Information about staff, payroll, hiring, and benefits Salaries & Benefits Staffing Count
Patrons Information about the use of library facilities, lending practices, etc. Door Counts Patrons by Location (Zip Code)
Public Records The libraries response to public record requests FOIA / Public Records Requests
Technology Offerings Information about technology that is available to patrons or in use by library - including laptops and public terminals technology use computer usage technology counts (# computer,stations, laptops, wifi hotspot lending, mobile phone lending, etc.)

Updated Library Categories December 2019

Code Definition Examples of Datasets
Catalog & Circulation Information about the holdings and lending practices of a library. These will often be annual or quarterly datasets released by system or branches. Physical Item Circulation, Digital/Electronic Item Circulation, Materials Counts, Digital Holdings Counts, Inter-library Loan Counts/Circulation, Databases Count, Database Usage, Serials Count, Electronic Content Retrievals
Events & Outreach Information about scheduling, programming, or regularly offered services for outreach. Programs Offered, Program Counts, Program Attendance, Meeting Counts, Meeting Attendance, Librarian Outreach
Facilities & Geospatial Information about the operation of a library and branch system including electricity use, operating hours, library website/social media usage, wi-fi usage, generator availability, weather and air quality at a facility, geographic or mapping information (including shapefiles) Count of Libraries (Central & Branch), Count of Bookmobiles, Count of Facilities, Types of Facilities, Square footage, Services offered / operating hours (IMLS records annual hours open), Library Website Analytics, Library Social Media Usage, Weather, Air Quality, Library Locations, Facilities Locations, Rooms Available, Wi-fi Usage, Bookmobile Routes, Power Backups (e.g. generators)
Financial Information about revenue, expenditures, etc. Does not include information about staff salary (see Human Resources).
Human Resources Information about paid and unpaid staff, payroll, hiring, and benefits Salaries & Benefits, Staffing Count, Volunteer Hours, Library Board Members
Patrons Information about the use of library facilities, lending practices, etc. Door Counts, Patrons by Location (Zip Code), Reference Interactions
Public Records The libraries response to public record requests FOIA / Public Records Requests
Technology Offerings Information about technology that is available to patrons or in use by library - including laptops and public terminals. (Note that wi-fi usage in the library is in Facilities & Geospatial while hotspot lending is included in Technology Offerings.) Technology Usage, Computer Usage, Technology Counts (# computer,stations, laptops, wifi hotspot lending, mobile phone lending, etc.), Software Availability/Usage