Retrieving information about firstnames and occupations in Wikipedia via Wikidata.
The idea is to study the data about given names from wikipedia.org, and in particular, what are the occupations of people with that given name. It turns out that scrapping wikipedia, getting the first names and the occupation would be difficult, as the data is unstructured. Hopefully, the wikidata project exists. It gives access to structured data.
The tools are given in R, and you can see it in action on shinyapps.io. This provide visualizations, and enables you to get the dataframe in csv directly. You can download the files and run it locally within R, either trough the app (server.R and ui.R, with queries.R and mySPARL.R in the same folder), or using the functions in queries.R (with mySPARQL.R) directly.
You can knit the Firstnames.Rmd file to get a description of the functions and some examples.
- queries.R provides the main functions
- In particular, queryStream takes the string of a firstname as an argument and gives the dataset.
- Note that queryStreamWithProgress is the same, with an increase for the progress bar on the app, and is only useful within the shiny app.
- mySPARQL.R is a rewrite of certain functions within the SPARQL package to include a support for UTF-8. Normally, it will soon be added to the package, so this won't be needed.
- server.R and ui.R are the files for the shiny app that can be seen there.
All the packages are provided on CRAN.
- WikidataR to get item and properties informations,
- SPARQL for the main query,
- dplyr, tidyr and magrittr for data manipulation and cleaning.
- stringr for string manipulation,
- wordcloud for a wordcloud,
- ggplot2 and ggthemes for graphs,
- shiny and shinyjs
Do not hesitate to fork me and/or contact me for more informations.
Enjoy!