-
Notifications
You must be signed in to change notification settings - Fork 0
Data Gathering Methodology
DRAFT
This document outlines and explains the steps to gather data in support of our work on data publication practices of public libraries. In short, we have collected a list of city data portals that may or may not contain data published by public libraries. The goal of this work is to systematically collect that data and compare data publishing practices across cities.
The first step in data collection is to review the variables (or aspects) of each city portal that appear in our dataset (cities were gathered from the US City Open Data Census). The table below defines each variable. You can also see an example of what an entry for a data portal looks like in this csv.
| Variable Code | Variable Name | Variable Description |
|---|---|---|
| City | City | City Name (spaces allowed) |
| State | State | 2-Character State Code |
| Portal_Url | Open Data Portal URL | URL for the city's open data portal. There may be more than one. If this is the case, add another row. |
| Software | Software Used to Build the Portal | Common answers are ArcGIS, Socrata, CKAN, Custom |
| TotalDataSetsAvailable | Total Datasets Available | Total number of datasets available on this portal. |
| NotesDataSetsAvailable | Notes About the Datasets Available | Free form space to write notes about the total number of datasets available. |
| CountVettedPublicLibData | Count of Vetted Public Library Datasets | Total number of datasets about public libraries. Methodology described in detail below. |
| TypePLDataAvailable | Type of Public Library Data | Free form list of types of data available about public libraries. |
| LibraryDataCategories | Library Data Categories | Choose one or more of the following (definitions below): Catalog and Circulation, Events, Facilities, Financial, Geospatial, Human Resources, Patrons, Public Records, Technology Offerings |
| DateLibDataLastUpdated | Date that Library Data was Last Updated | What is the most recent data associated with public library data |
| Notes | Notes | Free form notes about any of the data gathering. |
After reviewing and familiarizing yourself with the variables follow these steps to begin collecting data:
- Download the blank CSV with city entries here
- The above CSV corresponds to the cities found in this dataset This is the dataset that contains potential links to the city data portals for which you will be gathering data.
- Add each data repository record variables described above
Here are some helpful tips on how to identify data repositories and find information about them.
How do I know it's a Socrata platform? Many Socrata platforms will indicate this by placing a Socrata logo in the footer:

If the footer does not indicate the platform, you may have to do some digging. If there is a developers page, that will likely indicate the platform:

You might also have to click through to a dataset (or category of datasets) to find the logo:

Find the city portal:

If there is a "catalog" link or box, choose that:

You may still need to clear categories as mentioned in the next step (this will depend on whether the city has added default categories to the catalog).
If there is no "catalog" option, click on any category of data. In this example, I clicked on "Homeless Counts from HUD." You can then clear the categories to search the entire portal by clicking on the "Clear All" button:

According to Socrata support (https://support.socrata.com/hc/en-us/articles/225465147-Catalog-Search-FAQ) "We also use word stemming, which automatically includes common forms of the same root term. For example, searching for "educational," will also match “education”, “educating” and “educate”." So you can now search for "library" to find any library related datasets (note: searching for libr or libr* will NOT work):

You will want to filter your search to just "datasets":

Here I found 4 library datasets. However, upon inspection, 1 of these datasets is not specific to libraries, it just contains some data about libraries. In the vetting process, we will eliminate this as a library dataset:

How do you know it's a ckan platform? Scroll the the bottom of the open data portal homepage and look in the footer. If you see this, you know it's CKAN:

You may have to dig a little. In this example, when I hover over the link "API Docs" I can see the url it takes me to indicates CKAN:

Here is an example CKAN homepage:

You can then search for "library." You will need to use your judgment in determining what is a dataset. Each dataset will have a format type. We're looking for things like csv, xls, kml, geoJSON, JSON... We are not looking for RSS feeds or applications:

How do you know it's a DKAN platform? Scroll the the bottom of the open data portal homepage and look in the footer. If you see this, you know it's DKAN:
You may have to dig a little. In this example, when I hover over the link "free open source software" I can see the url it takes me to indicates DKAN:

Search for "library" and look specifically for Content Type "dataset":

Most of the time you will know it's an arcgis site by looking at the url:

In this case, I had to navigate all the way to a dataset and click the dropdown for "API" before I found any indication of ArcGIS (though you will start to recognize the platforms after several encounters):

You will want to search for data on this platform. Look for something like this:


You can search for both "library" and "libraries" to make sure you're capturing all the relevant datasets (do not search for "libr" or "libr*"):

In the above search, there is 1 dataset that is not specific to libraries and will not count in the vetted libraries total:

Custom data portals come in many different styles. Search around the homepage, url, and developer resources (if available) for any indication of platforms listed above. If you find nothing, it's probably a custom platform. Here are two examples:


Use your search skills acquired from the portals above to determine if there are any library datasets available on the portal.
This is by no means an exhaustive list of platforms. We have seen datasets hosted on Kaggle, data.world, OpenDataSoft, and more. These are all legitimate for our dataset and will require your judgment in deciding how best to search and label.
For each city’s data portal that was still in operation or that we were able to find via a Google search, we gathered descriptive information about the repository software used to publish the data, the total number of datasets accessible in each portal (we limited Socrata-hosted sites to "type = Datasets"), and searched specifically for data published by or about public libraries. Within each portal we also searched specifically for library-related datasets using queries such as "Library" or "Libr*", and also made use of faceted browsing features such as "categories" or "Data Owner" to locate relevant public library data.
For library related datasets to be considered relevant to our analysis we examined metadata, contents (e.g. variable names), and dataset descriptions. We excluded datasets that contained the keyword "library" but were, in reality, about broader topics such as Boston’s "CityScore" dataset which provides "metrics on overall city health based on work done across all facets of the City of Boston" [17]. Other false positives were removed based on the identified publisher. For example, the City of Seattle has a number of datasets that are published to the portal by users who have performed a specific analysis of that public library’s data, but were not officially published by the Seattle Public Library.
Below is the original list of library dataset categories (continue scrolling for updated list):
| Code | Definition | Examples of Datasets |
|---|---|---|
| Catalog and Circulation | Information about the holdings and lending practices of a library. These will often be annual or quarterly datasets released by system or branches. | Physical Item Circulation Digital/Electronic Item Circulation Materials Counts Digital Holdings Counts Inter-library Loan Counts/Circulation Databases Count Serials Count Electronic Content Retrievals |
| Events | Information about scheduling, programming, or regularly offered services for outreach. | Programs Offereds Program Counts Program Attendance Meeting Counts Meeting Attendance |
| Facilities | Information about the operation of a library and branch system, including electricity use, operating hours, etc. | Count of Libraries (Central & Branch) Count of Bookmobiles Count of Facilities Types of Facilities Square footage,services offered / operating hours (IMLS records annual hours open) |
| Financial | Information about revenue, expenditures, etc. Does not include information about staff salary. | |
| Geospatial | Geographic or mapping informaiton - including shapefiles | Library Locations Facilities Locations |
| Human Resources | Information about staff, payroll, hiring, and benefits | Salaries & Benefits Staffing Count |
| Patrons | Information about the use of library facilities, lending practices, etc. | Door Counts Patrons by Location (Zip Code) |
| Public Records | The libraries response to public record requests | FOIA / Public Records Requests |
| Technology Offerings | Information about technology that is available to patrons or in use by library - including laptops and public terminals | technology use computer usage technology counts (# computer,stations, laptops, wifi hotspot lending, mobile phone lending, etc.) |
| Code | Definition | Examples of Datasets |
|---|---|---|
| Catalog & Circulation | Information about the holdings and lending practices of a library. These will often be annual or quarterly datasets released by system or branches. | Physical Item Circulation, Digital/Electronic Item Circulation, Materials Counts, Digital Holdings Counts, Inter-library Loan Counts/Circulation, Databases Count, Database Usage, Serials Count, Electronic Content Retrievals |
| Events & Outreach | Information about scheduling, programming, or regularly offered services for outreach. | Programs Offered, Program Counts, Program Attendance, Meeting Counts, Meeting Attendance, Librarian Outreach |
| Facilities & Geospatial | Information about the operation of a library and branch system including electricity use, operating hours, library website/social media usage, wi-fi usage, generator availability, weather and air quality at a facility, geographic or mapping information (including shapefiles) | Count of Libraries (Central & Branch), Count of Bookmobiles, Count of Facilities, Types of Facilities, Square footage, Services offered / operating hours (IMLS records annual hours open), Library Website Analytics, Library Social Media Usage, Weather, Air Quality, Library Locations, Facilities Locations, Rooms Available, Wi-fi Usage, Bookmobile Routes, Power Backups (e.g. generators) |
| Financial | Information about revenue, expenditures, etc. Does not include information about staff salary (see Human Resources). | |
| Human Resources | Information about paid and unpaid staff, payroll, hiring, and benefits | Salaries & Benefits, Staffing Count, Volunteer Hours, Library Board Members |
| Patrons | Information about the use of library facilities, lending practices, etc. | Door Counts, Patrons by Location (Zip Code), Reference Interactions |
| Public Records | The libraries response to public record requests | FOIA / Public Records Requests |
| Technology Offerings | Information about technology that is available to patrons or in use by library - including laptops and public terminals. (Note that wi-fi usage in the library is in Facilities & Geospatial while hotspot lending is included in Technology Offerings.) | Technology Usage, Computer Usage, Technology Counts (# computer,stations, laptops, wifi hotspot lending, mobile phone lending, etc.), Software Availability/Usage |