Skip to content

Retrieving informations about firstnames and occupations in Wikipedia, via Wikidata

License

Notifications You must be signed in to change notification settings

FlorianGD/Firstnames

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Firstnames

Retrieving information about firstnames and occupations in Wikipedia via Wikidata.

Presentation

The idea is to study the data about given names from wikipedia.org, and in particular, what are the occupations of people with that given name. It turns out that scrapping wikipedia, getting the first names and the occupation would be difficult, as the data is unstructured. Hopefully, the wikidata project exists. It gives access to structured data.

The tools are given in R, and you can see it in action on shinyapps.io. This provide visualizations, and enables you to get the dataframe in csv directly. You can download the files and run it locally within R, either trough the app (server.R and ui.R, with queries.R and mySPARL.R in the same folder), or using the functions in queries.R (with mySPARQL.R) directly.

You can knit the Firstnames.Rmd file to get a description of the functions and some examples.

Files and functions

  • queries.R provides the main functions
    • In particular, queryStream takes the string of a firstname as an argument and gives the dataset.
    • Note that queryStreamWithProgress is the same, with an increase for the progress bar on the app, and is only useful within the shiny app.
  • mySPARQL.R is a rewrite of certain functions within the SPARQL package to include a support for UTF-8. Normally, it will soon be added to the package, so this won't be needed.
  • server.R and ui.R are the files for the shiny app that can be seen there.

Packages needed

All the packages are provided on CRAN.

Minimal packages to get the informations:

  • WikidataR to get item and properties informations,
  • SPARQL for the main query,
  • dplyr, tidyr and magrittr for data manipulation and cleaning.

Packages for data visualization:

  • stringr for string manipulation,
  • wordcloud for a wordcloud,
  • ggplot2 and ggthemes for graphs,

Packages to run the app

  • shiny and shinyjs

Contribute

Do not hesitate to fork me and/or contact me for more informations.

Enjoy!

About

Retrieving informations about firstnames and occupations in Wikipedia, via Wikidata

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages