Skip to content

Commit

Permalink
update
Browse files Browse the repository at this point in the history
  • Loading branch information
markolalovic committed Jun 5, 2024
1 parent e348e12 commit 0e85f58
Show file tree
Hide file tree
Showing 121 changed files with 38,610 additions and 4,213 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -48,3 +48,4 @@ po/*~
# RStudio Connect folder
rsconnect/
.Rproj.user
docs
25 changes: 11 additions & 14 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,24 +1,21 @@
Package: responsesR
Type: Package
Title: Simulation of Likert Item Responses
Version: 1.2.0
Date: 2024-03-26
Version: 1.2.1
Date: 2024-06-05
Authors@R:
person(given = "Marko",
family = "Lalovic",
role = c("aut", "cre"),
email = "marko[email protected]",
email = "marko@lalovic.me",
comment = c(ORCID = "0000-0002-1305-0192"))
Description: Simulates data sets that mimic the kind of survey data
commonly analyzed in applied social research known as Likert items. The user
can leverage continuous variables with known means and covariance structure,
which undergo discretization into discrete variables. The discrete variables
mirror the original continuous data through Lloyd’s algorithm, a technique
commonly utilized in signal processing and closely linked to k-means clustering.
Furthermore, asymmetry can be introduced by incorporating skew normal distribution.
Additionally, the package enables the reconstruction of continuous variables from
probability distributions of discrete variables, thereby enabling users to replicate
existing survey data more accurately.
URL: https://markolalovic.github.io/responsesR, https://github.com/markolalovic/responsesR
Description: Provides an easy framework to simulate survey data commonly analyzed
in applied social research, specifically Likert items. Users can specify latent variables by
providing means, standard deviations, and optionally, skewness and correlations. The generated
dataset represents responses to Likert scale questions, which can be used for various purposes,
such as validating theoretical findings obtained through factor analysis and structural equation modeling.
The package also allows for the estimation of parameters from existing survey data to replicate it more accurately.
URL: https://lalovic.io/responsesR, https://github.com/markolalovic/responsesR
BugReports: https://github.com/markolalovic/responsesR/issues
License: MIT + file LICENSE
Encoding: UTF-8
Expand Down
125 changes: 43 additions & 82 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ output: github_document
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
dpi=450)
dev = 'svg')
```

## responsesR: simulate Likert item responses in R <img src="./man/figures/logo.png" align="right" height="160" style="float:right; height:160px;"/>
Expand All @@ -18,50 +18,33 @@ knitr::opts_chunk$set(
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.10889981.svg)](https://doi.org/10.5281/zenodo.10889981)
<!-- badges: end -->

This package aims to provide an easy way to:

- Simulate Likert-scale data in R, enabling users to define distributions, means, standard deviations, and correlations among latent variables.
- Generate Likert-type responses for single or multiple items.
- Simulate Likert scales with associations between items to measure underlying constructs.
- Create artificial data to validate theoretical findings, when employing statistical techniques such as Factor Analysis and Structural Equation Modeling.
- Estimate means and standard deviations of latent variables and recreate existing rating-scale data.
This package provides an easy framework to simulate survey data commonly analyzed in applied social research, specifically Likert items. Users can specify latent variables by providing means, standard deviations, and optionally, skewness and correlations. The generated data sets represents responses to Likert scale questions, which can be used for various purposes, such as validating theoretical findings obtained through factor analysis and structural equation modeling. The package also allows for the estimation of parameters from existing survey data to replicate it more accurately.

## Installation
You can install the latest version using `devtools`:
You can install the latest version using devtools:
```{r eval=FALSE}
# install.packages("devtools")
library(devtools)
install_github("markolalovic/responsesR")
install.packages("devtools")
devtools::install_github("markolalovic/responsesR")
```

## Examples
Below you'll find two simple examples that illustrate how to create synthetic datasets with responsesR. For further information, refer to the articles on the [package website](https://markolalovic.github.io/responsesR/).

```{r}
library(responsesR)
```
## Code examples
Below are two simple examples. For more details, refer to the articles on the [package website](https://lalovic.io/responsesR/).

### Simulating survey data
The following sample code creates a simulated survey data. The hypothetical survey simulation is roughly based on the actual [comparative study](https://arxiv.org/abs/2201.12960) on teaching and learning R in a pair of introductory statistics labs.

Consider a scenario where 10 participants who completed Course A and 20 participants who completed Course B have taken the survey. Let's assume the initial question was:
Here's how to generate a simulated survey data. Consider a scenario where 10 participants who completed Course A and 20 participants who completed Course B have answered the question:

> "How would you rate your experience with the course?"
with four possible answers:

> Poor, Fair, Good, and Excellent.
Let's suppose that participants in Course A had a neutral opinion regarding the question, while those in Course B, on average, had a more positive experience.

By choosing appropriate parameters for the latent distributions and setting number of categories `K = 4`, we can generate hypothetical responses (standard deviation `sd = 1` and skewness `gamma1 = 0`, by default):
Suppose that on average participants in Course A had a neutral experience, while those in Course B had a more positive experience. By choosing appropriate parameters for the latent variables and setting the number of categories (to K = 4 in this example), we can generate hypothetical responses (standard deviation sd = 1 and skewness gamma1 = 0, by default):
```{r}
library(responsesR) # load the package
set.seed(12345) # to ensure reproducible results
course_A <- get_responses(n = 10, mu = 0, K = 4)
course_B <- get_responses(n = 20, mu = 1, K = 4)
```

Below are the responses to the question, visualized using a grouped bar chart:
Below are the generated responses visualized using a grouped bar chart:
<details>
<summary><b><a style="cursor: pointer;">Click here to expand </a></b> </summary>

Expand Down Expand Up @@ -111,26 +94,18 @@ p
</details>
<p> </p>

```{r courses_grouped_bar_chart, fig.align = 'center', out.width = "80%", echo = FALSE}
knitr::include_graphics("./man/figures/courses_grouped_bar_chart-1.png")
```{r courses_grouped_bar_chart, fig.height=3.3, out.width = "100%", echo = FALSE}
knitr::include_graphics("./man/figures/articles/courses_grouped_bar_chart.svg")
```

Suppose that the survey also asked the participants to rate their skills on a 5-point Likert scale, ranging from 1 (very poor) to 5 (very good) in:
For a pre- and post comparison, suppose that the participants completed the survey both before and after taking the course. And suppose that participants' assessments of their skills in:

* Programming,
* Searching Online,
* Solving Problems.
1. Programming on average increased,
2. Searching online stayed about the same,
3. Solving problems increased in Course A, but decreased for participants in Course B.

The survey was completed by the participants both before and after taking the course for a pre and post-comparison. Suppose that participants' assessments of:

* Programming skills on average increased,
* Searching Online stayed about the same,
* Solving Problems increased in Course A, but decreased for participants in Course B.

Let's simulate the survey data for this scenario (number of categories is `K = 5` by default):
Let's simulate the survey data for this scenario using a 5-point Likert scale (K = 5, by default):
```{r}
set.seed(12345) # to ensure reproducible results
# Pre- and post assessments of skills: 1, 2, 3 for course A
pre_A <- get_responses(n = 10, mu = c(-1, 0, 1))
post_A <- get_responses(n = 10, mu = c(0, 0, 2))
Expand All @@ -140,7 +115,7 @@ pre_B <- get_responses(n = 20, mu = c(-1, 0, 1))
post_B <- get_responses(n = 20, mu = c(0, 0, 0)) # <-- decrease for skill 3
```

The grouped bar chart below displays the responses to Likert-scale questions before and after taking the course:
Below is the grouped bar chart of the generated responses:
<details>
<summary><b><a style="cursor: pointer;">Click here to expand </a></b> </summary>

Expand Down Expand Up @@ -215,17 +190,19 @@ p

</details>
<p> </p>
```{r courses_stacked_bar_chart, fig.align = 'center', out.width = "80%", echo = FALSE}
knitr::include_graphics("./man/figures/courses_stacked_bar_chart-1.png")

```{r courses_stacked_bar_chart, fig.height=5.6, out.width = "100%", echo = FALSE}
knitr::include_graphics("./man/figures/articles/courses_stacked_bar_chart.svg")
```


### Replicating survey data
The following sample code covers the topic of replicating survey data in order to create scale scores. For this, we will use part of [bfi dataset](https://search.r-project.org/CRAN/refmans/psych/html/bfi.html) from package psych. In particular, only the first 5 items A1-A5 corresponding to agreeableness and attribute gender:
The following sample code covers the topic of replicating survey data in order to create scale scores. For this, we will use part of [bfi dataset](https://search.r-project.org/CRAN/refmans/psych/html/bfi.html) from package psych. In particular, the first 5 items A1-A5 corresponding to agreeableness and attribute gender:

```{r}
library(psych)
avars <- c("A1", "A2", "A3", "A4", "A5")
data <- bfi[, c(avars, "gender")]
vars <- c("A1", "A2", "A3", "A4", "A5")
data <- bfi[, c(vars, "gender")]
```

Each item was answered on a six point scale ranging from 1 (very inaccurate), to 6 (very accurate) and the size of the female and male samples were 1881 and 919 respectively:
Expand All @@ -238,8 +215,8 @@ mapdf <- data.frame(old = 1:2, new = c("Male", "Female"))
data$gender <- mapdf$new[match(data$gender, mapdf$old)]

# Impute the missing values.
for (avar in avars) {
data[, avar][is.na(data[, avar])] <- median(data[, avar], na.rm=TRUE)
for (var in vars) {
data[, var][is.na(data[, var])] <- median(data[, var], na.rm=TRUE)
}
knitr::kable(head(data), format="html")
table(data$gender)
Expand All @@ -253,17 +230,17 @@ mapdf <- data.frame(old = 1:2, new = c("Male", "Female"))
data$gender <- mapdf$new[match(data$gender, mapdf$old)]
# Impute the missing values.
for (avar in avars) {
data[, avar][is.na(data[, avar])] <- median(data[, avar], na.rm=TRUE)
for (var in vars) {
data[, var][is.na(data[, var])] <- median(data[, var], na.rm=TRUE)
}
knitr::kable(head(data), format="html")
table(data$gender)
```

Separate the items into two groups according to their gender.
```{r}
items_M <- data[data$gender == "Male", avars]
items_F <- data[data$gender == "Female", avars]
items_M <- data[data$gender == "Male", vars]
items_F <- data[data$gender == "Female", vars]
```

To reproduce the items, start by estimating the parameters of the latent variables, assuming they are normal (`gamma1 = 0` by default) and providing the number of possible response categories `K = 6`:
Expand Down Expand Up @@ -292,8 +269,9 @@ new_items_F <- get_responses(n = nrow(items_F),
```

To compare the results, we can plot the correlation matrix with bar charts on the diagonal:
```{r agree_items_correlations_comparison, fig.align = 'center', out.width = "80%", echo = FALSE}
knitr::include_graphics("./man/figures/agree_items_correlations_comparison-1.png")

```{r agree_items_correlations_comparison, fig.height=10, out.width = "100%", echo = FALSE}
knitr::include_graphics("./man/figures/articles/agree_items_correlations_comparison.svg")
```

The next step would be to create agreeableness scale scores for both groups of participants, by taking the average of these 5 items and visualize the results with a grouped boxplot:
Expand All @@ -311,7 +289,7 @@ data$A1 <- (min(data$A1) + max(data$A1)) - data$A1
new_data$Y1 <- (min(new_data$Y1) + max(new_data$Y1)) - new_data$Y1

# Create agreeableness scale scores
data$agreeable <- rowMeans(data[, avars])
data$agreeable <- rowMeans(data[, vars])
new_data$agreeable <- rowMeans(new_data[, c("Y1", "Y2", "Y3", "Y4", "Y5")])

# And visualize the results with a grouped boxplot.
Expand All @@ -334,36 +312,19 @@ plot_grid(p1, p2, nrow = 2)
</details>
<p> </p>

```{r agreeableness_grouped_boxplot, fig.align = 'center', out.width = "60%", echo = FALSE}
knitr::include_graphics("./man/figures/agreeableness_grouped_boxplot-1.png")
```{r agreeableness_grouped_boxplot, fig.height=4.8, out.width = "100%", echo = FALSE}
knitr::include_graphics("./man/figures/articles/agreeableness_grouped_boxplot.svg")
```

## Dependency statement
To maintain a lightweight package, responsesR only imports [mvtnorm](https://cran.r-project.org/web/packages/mvtnorm/index.html), along with the standard R packages stats and graphics, which are typically included in R releases. An additional suggested dependency is the package [sn](https://cran.r-project.org/web/packages/sn/index.html), necessary only for generating random responses from correlated Likert items with multivariate skew normal latent distribution. However, the package prompts the user to install this dependency during interactive sessions.

## Simulation design
Simulating Likert item responses begins by selecting a continuous distribution, which is then transformed into a discrete probability distribution using a method called discretization. This process is illustrated in Figure 2.

```{r simulation_process_r, fig.align = 'center', out.width = "70%", fig.cap = "Figure 2: Flow diagram of the simulation process.", echo = FALSE}
knitr::include_graphics("./man/figures/simulation_process.png")
```

The transformation is visually depicted in Figures 3 and 4. These figures show the densities of normally distributed X1 and X2 in Figure 3A and skew normally distributed X1 and X2 with skewness `gamma1 = -0.6` in Figure 4A. Corresponding discrete probability distributions of Y1 and Y2 with `K = 5` categories are presented in Figures 3B and 4B.

```{r mapping_normal_r, fig.align = 'center', out.width = "80%", fig.cap = "Figure 3: Relationship between normally distributed X and responses Y.", echo = FALSE}
knitr::include_graphics("./man/figures/mapping_normal.png")
```

```{r mapping_skew_r, fig.align = 'center', out.width = "80%", fig.cap = "Figure 4: Relationship between skew normal X with gamma1 = -0.6, and responses Y.", echo = FALSE}
knitr::include_graphics("./man/figures/mapping_skew.png")
```

## Further Reading
* [Quick Overview](https://markolalovic.github.io/responsesR/articles/responsesR.html)
* [Function Documentation](https://markolalovic.github.io/responsesR/reference/index.html)
## Further reading
* [Get started](https://markolalovic.github.io/responsesR/articles/responsesR.html)
* [Functions reference documentation](https://markolalovic.github.io/responsesR/reference/index.html)
* [Introduction to responsesR package](https://markolalovic.github.io/responsesR/articles/introduction_to_responsesR.html)

## Contributions
Feel free to create issues for bugs or suggestions on the [issues page](https://github.com/markolalovic/responsesR/issues).

You can also fork the responsesR repository, make your changes, and submit a pull request. Contributions may include bug fixes, new features, documentation improvements, or any other features you think will be useful.
You can also make changes and submit a pull request. Contributions may include bug fixes, new features or documentation improvements.
Loading

0 comments on commit 0e85f58

Please sign in to comment.