From 6c79a06460195d257d0462498d12e34a970ad6af Mon Sep 17 00:00:00 2001 From: Ofir Date: Wed, 7 Jan 2026 21:31:57 -0500 Subject: [PATCH 1/4] attached is my first sampling assigment --- .../a1_sampling_and_reproducibility.ipynb | 389 +++++++++++++++++- 1 file changed, 378 insertions(+), 11 deletions(-) diff --git a/02_activities/assignments/a1_sampling_and_reproducibility.ipynb b/02_activities/assignments/a1_sampling_and_reproducibility.ipynb index 11852458..63a035b0 100644 --- a/02_activities/assignments/a1_sampling_and_reproducibility.ipynb +++ b/02_activities/assignments/a1_sampling_and_reproducibility.ipynb @@ -9,42 +9,85 @@ "\n", "The code at the end of this file explores contact tracing data about an outbreak of the flu, and demonstrates the dangers of incomplete and non-random samples. This assignment is modified from [Contact tracing can give a biased sample of COVID-19 cases](https://andrewwhitby.com/2020/11/24/contact-tracing-biased/) by Andrew Whitby.\n", "\n", - "Examine the code below. Identify all stages at which sampling is occurring in the model. Describe in words the sampling procedure, referencing the functions used, sample size, sampling frame, any underlying distributions involved. \n" + "\n" ] }, { "cell_type": "markdown", - "id": "4ea73db3", + "id": "85c70f56", "metadata": {}, - "source": [] + "source": [ + "*Explanation/summary of the article (for my use, to ensure I understood the argument  before deciphering the code)*\n", + "- Whitby critiques the NYT article, arguing that small gatherings are not the primary cause of surges in Covid-19 infections. \n", + "- Only a small proportion of infections can be identified through contact tracing.\n", + "- Contact tracing does not provide a random sample and, therefore, is not representative of the overall situation.\n", + "- It is easier to trace infections from institutions or events that regularly document data compared to private gatherings, restaurants, supermarkets, etc.\n", + "- There is also a comment on the limitations of small sample sizes.\n", + "- Whitby presents a model based on a sample of 1,000 participants: 200 attended two weddings, while the other 800 participated in 80 branches, all occurring simultaneously. He initially determines a constant infection coefficient of 0.1 for all 82 events.\n", + "- To introduce randomness into his model, Whitby changes this coefficient to represent the probability of an individual being infected (known in epidemiology as the attack rate) at 10%. This indicates that the infection rate is not constant and can vary. If he were to repeat the trial 50,000 times (i.e., 50,000 samples of 1,000, with 800 attending branches and 200 attending weddings), he would see varying results.\n", + "- Primary contact tracing refers to the likelihood of linking an infection to a source event, while secondary contact tracing involves identifying all infections after a specific event has been identified.\n", + "- Understanding these two concepts helps clarify the difference between random sampling and observational data obtained through interviews, as weddings are traceable events, unlike branches. This difference accounts for the variations observed between the blue and red histograms." + ] + }, + { + "cell_type": "markdown", + "id": "a058afa6", + "metadata": {}, + "source": [ + "Question 1: Examine the code below. Identify all stages at which sampling is occurring in the model. Describe in words the sampling procedure, referencing the functions used, sample size, sampling frame, any underlying distributions involved. " + ] + }, + { + "cell_type": "markdown", + "id": "1338f7fe", + "metadata": {}, + "source": [ + "Answer:\n", + "\n", + "After creating a (hypothetical/theoretical) dataframe of 1,000 non-infected individuals (800 who attended brunches and 200 who attended weddings), the sampling process occurs in three stages:\n", + "\n", + "(1) Simple Random Sampling (SRS) Without replacement: the code first draws a simple random sample from the 1,000 individuals in the simulation, without replacement. The sample size is determined by multiplying the attack rate (theoretically known parameter) by the sampling frame (i.e., 0.1 * 1,000 = 100). From this sample, 100 individuals are considered infected. This sampling assumes that each individual has a 10% chance of being infected, which is based on a uniform distribution learned in the first lesson, where every outcome in a given set or interval is equally likely.\n", + "\n", + "(2) Primary contact tracing: next, the simulation/code examines the 100 infected individuals selected in the previous step. Here, the code uses a binomial distribution to model the success or failure of contact tracing. This involves 100 independent random Bernoulli trials (using the np.random.rand function). The parameter value for successful contact tracing is set at 0.2, implying that 80% (the complementary distribution) will not be traced. This process samples from this group of 100 individuals to determine how many of their infections have been traced. Drawing on the example from the first lesson (where p=0.5 and n=100), the expected outcome (mean) is 20, with a standard deviation of 4.\n", + "\n", + "(3) Repetition of the sampling process: Finally, the entire simulation is rerun 1,000 times. This process reminded me of bootstrapping. However, the current repetition procedure is not based on empirical data. Instead, it models the population when the parameters (as in our case) are known. It does not estimate population parameters such as mean, standard deviation, or confidence intervals. (I am unsure id this process considered as \"sampling\" per se)" + ] }, { "cell_type": "markdown", "id": "3d9b2ccc", "metadata": {}, "source": [ - "Modify the number of repetitions in the simulation to 10 and 100 (from the original 1000). Run the script multiple times and observe the outputted graphs. Comment on the reproducibility of the results." + "Question 2: Modify the number of repetitions in the simulation to 10 and 100 (from the original 1000). Run the script multiple times and observe the outputted graphs. Comment on the reproducibility of the results." ] }, { "cell_type": "markdown", "id": "4cf5d993", "metadata": {}, - "source": [] + "source": [ + "Answer: \n", + "\n", + "Since a random seed was not generated, the results vary with each execution of the script. Each time the script runs, the first two stages produce 1,000 different values. In the next stage, I will add a random seed, as we learned in the Python and LCR modules, to ensure reproducibility." + ] }, { "cell_type": "markdown", "id": "32603ce7", "metadata": {}, "source": [ - "Alter the code so that it is reproducible. Describe the changes you made to the code and how they affected the reproducibility of the script. The script needs to produce the same output when run multiple times." + "Question 3: Alter the code so that it is reproducible. Describe the changes you made to the code and how they affected the reproducibility of the script. The script needs to produce the same output when run multiple times." ] }, { "cell_type": "markdown", "id": "77613cc3", "metadata": {}, - "source": [] + "source": [ + "Answer:\n", + "\n", + "I set a random seed using the \"np.random.seed\" function with the integer 2026 as its argument. I positioned the random seed outside the simulation event function, before the simulation repetition function. This setup allows the simulation event function to generate different random values for the first two sampling procedures, depending on the number of repetitions required in the specified range (e.g., 10, 100, 1000)." + ] }, { "cell_type": "markdown", @@ -56,10 +99,21 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 12, "id": "ab8587a0", "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], "source": [ "# Import necessary libraries\n", "import pandas as pd\n", @@ -146,6 +200,319 @@ "plt.show()" ] }, + { + "cell_type": "code", + "execution_count": 13, + "id": "ada78e4e", + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Import necessary libraries\n", + "import pandas as pd\n", + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "import seaborn as sns\n", + "\n", + "# Note: Suppressing FutureWarnings to maintain a clean output. This is specifically to ignore warnings about\n", + "# deprecated features in the libraries we're using (e.g., 'use_inf_as_na' option in Pandas, used by Seaborn),\n", + "# which we currently have no direct control over. This action is taken to ensure that our output remains\n", + "# focused on relevant information, acknowledging that we rely on external library updates to fully resolve\n", + "# these deprecations. Always consider reviewing and removing this suppression after significant library updates.\n", + "import warnings\n", + "warnings.simplefilter(action='ignore', category=FutureWarning)\n", + "\n", + "# Constants representing the parameters of the model\n", + "ATTACK_RATE = 0.10\n", + "TRACE_SUCCESS = 0.20\n", + "SECONDARY_TRACE_THRESHOLD = 2\n", + "\n", + "def simulate_event(m):\n", + " \"\"\"\n", + " Simulates the infection and tracing process for a series of events.\n", + " \n", + " This function creates a DataFrame representing individuals attending weddings and brunches,\n", + " infects a subset of them based on the ATTACK_RATE, performs primary and secondary contact tracing,\n", + " and calculates the proportions of infections and traced cases that are attributed to weddings.\n", + " \n", + " Parameters:\n", + " - m: Dummy parameter for iteration purposes.\n", + " \n", + " Returns:\n", + " - A tuple containing the proportion of infections and the proportion of traced cases\n", + " that are attributed to weddings.\n", + " \"\"\"\n", + " # Create DataFrame for people at events with initial infection and traced status\n", + " events = ['wedding'] * 200 + ['brunch'] * 800\n", + " ppl = pd.DataFrame({\n", + " 'event': events,\n", + " 'infected': False,\n", + " 'traced': np.nan # Initially setting traced status as NaN\n", + " })\n", + "\n", + " # Explicitly set 'traced' column to nullable boolean type\n", + " ppl['traced'] = ppl['traced'].astype(pd.BooleanDtype())\n", + "\n", + " # Infect a random subset of people\n", + " infected_indices = np.random.choice(ppl.index, size=int(len(ppl) * ATTACK_RATE), replace=False)\n", + " ppl.loc[infected_indices, 'infected'] = True\n", + "\n", + " # Primary contact tracing: randomly decide which infected people get traced\n", + " ppl.loc[ppl['infected'], 'traced'] = np.random.rand(sum(ppl['infected'])) < TRACE_SUCCESS\n", + "\n", + " # Secondary contact tracing based on event attendance\n", + " event_trace_counts = ppl[ppl['traced'] == True]['event'].value_counts()\n", + " events_traced = event_trace_counts[event_trace_counts >= SECONDARY_TRACE_THRESHOLD].index\n", + " ppl.loc[ppl['event'].isin(events_traced) & ppl['infected'], 'traced'] = True\n", + "\n", + " # Calculate proportions of infections and traces attributed to each event type\n", + " ppl['event_type'] = ppl['event'].str[0] # 'w' for wedding, 'b' for brunch\n", + " wedding_infections = sum(ppl['infected'] & (ppl['event_type'] == 'w'))\n", + " brunch_infections = sum(ppl['infected'] & (ppl['event_type'] == 'b'))\n", + " p_wedding_infections = wedding_infections / (wedding_infections + brunch_infections)\n", + "\n", + " wedding_traces = sum(ppl['infected'] & ppl['traced'] & (ppl['event_type'] == 'w'))\n", + " brunch_traces = sum(ppl['infected'] & ppl['traced'] & (ppl['event_type'] == 'b'))\n", + " p_wedding_traces = wedding_traces / (wedding_traces + brunch_traces)\n", + "\n", + " return p_wedding_infections, p_wedding_traces\n", + "\n", + "# Run the simulation 100 times\n", + "results = [simulate_event(m) for m in range(100)]\n", + "props_df = pd.DataFrame(results, columns=[\"Infections\", \"Traces\"])\n", + "\n", + "# Plotting the results\n", + "plt.figure(figsize=(10, 6))\n", + "sns.histplot(props_df['Infections'], color=\"blue\", alpha=0.75, binwidth=0.05, kde=False, label='Infections from Weddings')\n", + "sns.histplot(props_df['Traces'], color=\"red\", alpha=0.75, binwidth=0.05, kde=False, label='Traced to Weddings')\n", + "plt.xlabel(\"Proportion of cases\")\n", + "plt.ylabel(\"Frequency\")\n", + "plt.title(\"Impact of Contact Tracing on Perceived Flu Infection Sources\")\n", + "plt.legend()\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "id": "a142fb87", + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Import necessary libraries\n", + "import pandas as pd\n", + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "import seaborn as sns\n", + "\n", + "# Note: Suppressing FutureWarnings to maintain a clean output. This is specifically to ignore warnings about\n", + "# deprecated features in the libraries we're using (e.g., 'use_inf_as_na' option in Pandas, used by Seaborn),\n", + "# which we currently have no direct control over. This action is taken to ensure that our output remains\n", + "# focused on relevant information, acknowledging that we rely on external library updates to fully resolve\n", + "# these deprecations. Always consider reviewing and removing this suppression after significant library updates.\n", + "import warnings\n", + "warnings.simplefilter(action='ignore', category=FutureWarning)\n", + "\n", + "# Constants representing the parameters of the model\n", + "ATTACK_RATE = 0.10\n", + "TRACE_SUCCESS = 0.20\n", + "SECONDARY_TRACE_THRESHOLD = 2\n", + "\n", + "def simulate_event(m):\n", + " \"\"\"\n", + " Simulates the infection and tracing process for a series of events.\n", + " \n", + " This function creates a DataFrame representing individuals attending weddings and brunches,\n", + " infects a subset of them based on the ATTACK_RATE, performs primary and secondary contact tracing,\n", + " and calculates the proportions of infections and traced cases that are attributed to weddings.\n", + " \n", + " Parameters:\n", + " - m: Dummy parameter for iteration purposes.\n", + " \n", + " Returns:\n", + " - A tuple containing the proportion of infections and the proportion of traced cases\n", + " that are attributed to weddings.\n", + " \"\"\"\n", + " # Create DataFrame for people at events with initial infection and traced status\n", + " events = ['wedding'] * 200 + ['brunch'] * 800\n", + " ppl = pd.DataFrame({\n", + " 'event': events,\n", + " 'infected': False,\n", + " 'traced': np.nan # Initially setting traced status as NaN\n", + " })\n", + "\n", + " # Explicitly set 'traced' column to nullable boolean type\n", + " ppl['traced'] = ppl['traced'].astype(pd.BooleanDtype())\n", + "\n", + " # Infect a random subset of people\n", + " infected_indices = np.random.choice(ppl.index, size=int(len(ppl) * ATTACK_RATE), replace=False)\n", + " ppl.loc[infected_indices, 'infected'] = True\n", + "\n", + " # Primary contact tracing: randomly decide which infected people get traced\n", + " ppl.loc[ppl['infected'], 'traced'] = np.random.rand(sum(ppl['infected'])) < TRACE_SUCCESS\n", + "\n", + " # Secondary contact tracing based on event attendance\n", + " event_trace_counts = ppl[ppl['traced'] == True]['event'].value_counts()\n", + " events_traced = event_trace_counts[event_trace_counts >= SECONDARY_TRACE_THRESHOLD].index\n", + " ppl.loc[ppl['event'].isin(events_traced) & ppl['infected'], 'traced'] = True\n", + "\n", + " # Calculate proportions of infections and traces attributed to each event type\n", + " ppl['event_type'] = ppl['event'].str[0] # 'w' for wedding, 'b' for brunch\n", + " wedding_infections = sum(ppl['infected'] & (ppl['event_type'] == 'w'))\n", + " brunch_infections = sum(ppl['infected'] & (ppl['event_type'] == 'b'))\n", + " p_wedding_infections = wedding_infections / (wedding_infections + brunch_infections)\n", + "\n", + " wedding_traces = sum(ppl['infected'] & ppl['traced'] & (ppl['event_type'] == 'w'))\n", + " brunch_traces = sum(ppl['infected'] & ppl['traced'] & (ppl['event_type'] == 'b'))\n", + " p_wedding_traces = wedding_traces / (wedding_traces + brunch_traces)\n", + "\n", + " return p_wedding_infections, p_wedding_traces\n", + "\n", + "# Run the simulation 10 times\n", + "results = [simulate_event(m) for m in range(10)]\n", + "props_df = pd.DataFrame(results, columns=[\"Infections\", \"Traces\"])\n", + "\n", + "# Plotting the results\n", + "plt.figure(figsize=(10, 6))\n", + "sns.histplot(props_df['Infections'], color=\"blue\", alpha=0.75, binwidth=0.05, kde=False, label='Infections from Weddings')\n", + "sns.histplot(props_df['Traces'], color=\"red\", alpha=0.75, binwidth=0.05, kde=False, label='Traced to Weddings')\n", + "plt.xlabel(\"Proportion of cases\")\n", + "plt.ylabel(\"Frequency\")\n", + "plt.title(\"Impact of Contact Tracing on Perceived Flu Infection Sources\")\n", + "plt.legend()\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "id": "f6d17b39", + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Import necessary libraries\n", + "import pandas as pd\n", + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "import seaborn as sns\n", + "\n", + "# Note: Suppressing FutureWarnings to maintain a clean output. This is specifically to ignore warnings about\n", + "# deprecated features in the libraries we're using (e.g., 'use_inf_as_na' option in Pandas, used by Seaborn),\n", + "# which we currently have no direct control over. This action is taken to ensure that our output remains\n", + "# focused on relevant information, acknowledging that we rely on external library updates to fully resolve\n", + "# these deprecations. Always consider reviewing and removing this suppression after significant library updates.\n", + "import warnings\n", + "warnings.simplefilter(action='ignore', category=FutureWarning)\n", + "\n", + "# Constants representing the parameters of the model\n", + "ATTACK_RATE = 0.10\n", + "TRACE_SUCCESS = 0.20\n", + "SECONDARY_TRACE_THRESHOLD = 2\n", + "\n", + "\n", + "def simulate_event(m):\n", + " \"\"\"\n", + " Simulates the infection and tracing process for a series of events.\n", + " \n", + " This function creates a DataFrame representing individuals attending weddings and brunches,\n", + " infects a subset of them based on the ATTACK_RATE, performs primary and secondary contact tracing,\n", + " and calculates the proportions of infections and traced cases that are attributed to weddings.\n", + " \n", + " Parameters:\n", + " - m: Dummy parameter for iteration purposes.\n", + " \n", + " Returns:\n", + " - A tuple containing the proportion of infections and the proportion of traced cases\n", + " that are attributed to weddings.\n", + " \"\"\"\n", + " # Create DataFrame for people at events with initial infection and traced status\n", + " events = ['wedding'] * 200 + ['brunch'] * 800\n", + " ppl = pd.DataFrame({\n", + " 'event': events,\n", + " 'infected': False,\n", + " 'traced': np.nan # Initially setting traced status as NaN\n", + " })\n", + "\n", + " # Explicitly set 'traced' column to nullable boolean type\n", + " ppl['traced'] = ppl['traced'].astype(pd.BooleanDtype())\n", + "\n", + " # Infect a random subset of people\n", + " infected_indices = np.random.choice(ppl.index, size=int(len(ppl) * ATTACK_RATE), replace=False)\n", + " ppl.loc[infected_indices, 'infected'] = True\n", + "\n", + " # Primary contact tracing: randomly decide which infected people get traced\n", + " ppl.loc[ppl['infected'], 'traced'] = np.random.rand(sum(ppl['infected'])) < TRACE_SUCCESS\n", + "\n", + " # Secondary contact tracing based on event attendance\n", + " event_trace_counts = ppl[ppl['traced'] == True]['event'].value_counts()\n", + " events_traced = event_trace_counts[event_trace_counts >= SECONDARY_TRACE_THRESHOLD].index\n", + " ppl.loc[ppl['event'].isin(events_traced) & ppl['infected'], 'traced'] = True\n", + "\n", + " # Calculate proportions of infections and traces attributed to each event type\n", + " ppl['event_type'] = ppl['event'].str[0] # 'w' for wedding, 'b' for brunch\n", + " wedding_infections = sum(ppl['infected'] & (ppl['event_type'] == 'w'))\n", + " brunch_infections = sum(ppl['infected'] & (ppl['event_type'] == 'b'))\n", + " p_wedding_infections = wedding_infections / (wedding_infections + brunch_infections)\n", + "\n", + " wedding_traces = sum(ppl['infected'] & ppl['traced'] & (ppl['event_type'] == 'w'))\n", + " brunch_traces = sum(ppl['infected'] & ppl['traced'] & (ppl['event_type'] == 'b'))\n", + " p_wedding_traces = wedding_traces / (wedding_traces + brunch_traces)\n", + "\n", + " return p_wedding_infections, p_wedding_traces\n", + "\n", + "# add a random seed for reproducability of results \n", + "np.random.seed(2026) \n", + "# Run the simulation 1000 times with reproducability\n", + "\n", + "results = [simulate_event(m) for m in range(1000)]\n", + "props_df = pd.DataFrame(results, columns=[\"Infections\", \"Traces\"])\n", + "\n", + "# Plotting the results\n", + "plt.figure(figsize=(10, 6))\n", + "sns.histplot(props_df['Infections'], color=\"blue\", alpha=0.75, binwidth=0.05, kde=False, label='Infections from Weddings')\n", + "sns.histplot(props_df['Traces'], color=\"red\", alpha=0.75, binwidth=0.05, kde=False, label='Traced to Weddings')\n", + "plt.xlabel(\"Proportion of cases\")\n", + "plt.ylabel(\"Frequency\")\n", + "plt.title(\"Impact of Contact Tracing on Perceived Flu Infection Sources\")\n", + "plt.legend()\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, { "cell_type": "markdown", "id": "f418c720", @@ -193,7 +560,7 @@ ], "metadata": { "kernelspec": { - "display_name": "Python 3", + "display_name": "sampling-env", "language": "python", "name": "python3" }, @@ -207,7 +574,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.13.0" + "version": "3.11.13" } }, "nbformat": 4, From eb7acfff53540b4afc3274942b01e09e14df9cd6 Mon Sep 17 00:00:00 2001 From: Ofir Date: Wed, 7 Jan 2026 21:43:04 -0500 Subject: [PATCH 2/4] attached is my first sampling assigment --- 02_activities/assignments/a1_sampling_and_reproducibility.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/02_activities/assignments/a1_sampling_and_reproducibility.ipynb b/02_activities/assignments/a1_sampling_and_reproducibility.ipynb index 63a035b0..beda05fb 100644 --- a/02_activities/assignments/a1_sampling_and_reproducibility.ipynb +++ b/02_activities/assignments/a1_sampling_and_reproducibility.ipynb @@ -99,7 +99,7 @@ }, { "cell_type": "code", - "execution_count": 12, + "execution_count": 22, "id": "ab8587a0", "metadata": {}, "outputs": [ From 60e1623efbba8292cb3f2193dcf9fef47e8085b8 Mon Sep 17 00:00:00 2001 From: Ofir Date: Thu, 15 Jan 2026 03:43:08 -0500 Subject: [PATCH 3/4] attached is my second assignment --- .../a2_survey_design_and_evaluation.md | 135 ++++++++++++++++-- 1 file changed, 122 insertions(+), 13 deletions(-) diff --git a/02_activities/assignments/a2_survey_design_and_evaluation.md b/02_activities/assignments/a2_survey_design_and_evaluation.md index a955d827..16526304 100644 --- a/02_activities/assignments/a2_survey_design_and_evaluation.md +++ b/02_activities/assignments/a2_survey_design_and_evaluation.md @@ -40,28 +40,53 @@ For the **Canadian General Social Survey on Giving, Volunteering, and Participat ## Part A - Survey Design: -The number of your chosen topic: `#` +The number of your chosen topic: `2` Describe the purpose of your survey: ``` -write your answer here... +The Democratic Squirrel Institute (DSI) is a registered, professional survey institute with over 80 years of experience in public opinion studies and pre-election polling surveys across Canada. The institute is not identified with any political party and has provided its survey and polling services to all Canadian parties. This survey was ordered by the Beaver & Moose Alliance Party (BMAP) in Canada. The purpose of the survey is to better understand what voters expect from the BMAP and its leader across an array of relevant areas, including welfare and social policies, economic policies, international relations, immigration, education, transportation, housing, climate change/environment, and healthcare. The survey is voluntary and confidential. Additionally, all data are anonymized before shared with the client (BMAP): i.e., all information that could allow for the identification of participants was removed from the data, in accordance with the Privacy Act. ``` -Describe your target population, sampling frame, sampling units, and observational units: +Describe your target population, sampling frame, sampling units, observational units nd overall sampling strategy ``` -write your answer here... +Target Population: Canadian citizens aged 18 years old and above who can prove their identity and address (voting eligibility criteria). + +Sampling Frame: Participants will be recruited from the Public Puffin Panel (PPP) [the rival of Léger Canada]. PPP owns and operates an Internet panel of more than 4.5 million Canadians. Almost 88% of PPP's members have been recruited randomly over the phone since 2002 to ensure that PPP's members' distribution across various demographic characteristics is similar to that of the real Canadian population. This is an electronic and written survey. Accordingly, participants need to have internet access, a computer/mobile phone with a browser, and the ability to read and write in English and/or French. + +Sampling Units: Individuals (Canadian citizens, age >= 18, who are eligible to vote) who are members of the PPP can read/write in French and/or English and can access the internet and a browser during the data collection period. + +Observational Units: Individuals who meet the inclusion criteria above and who completed the survey. + +Overall Sampling strategy: We aim for a sample size between 4000 and 10000, which is typically used in pre-election polling to learn patterns within subgroups. We will use a stratified-sampling method with proportional allocation to better reflect the national population composition. We will first startify our sample with proportional allocation by province (to ensure that larger provinces [e.g., Ontario, Quebec, British Columbia] are represented proportionally in the sample [e.g., participants from Ontario will comprise around 38.8% of the final sample]). Then we will stratify using proportional allocation the randomly selected individuals in each province about the following variables: age group (18-30, 30-45, 45-65, 65+), newcomer status (time in Canada less/more than 10 years), and gender (man, woman, nonbinary and other gender identities). Participants will be randomly selected better to reflect the proportions of these subgroups within each province. We chose these variables for stratification (first, province; then age, gender, and time in Canada), since research/literature shows their association with the outcome variables on which the survey collects data (e.g., positions on welfare, economics, climate, transformation).   + ``` Your 5-10 question survey: ``` -1. write your question here... -2. write your question here... -3. write your question here... -4. write your question here... -5. write your question here... -6. write your question here... (optional) -7. write your question here... (optional) -8. write your question here... (optional) +1. Which issue is most important for you that the newly elected government addresses first? + [choose one option from the following: a.housing; b.public transportation; c.daycare for children aged 0-5; d.immigration policy; e.environment/climate change; f. universal healthcare; g. socio-economic inequality h.other: please specify] + +2.How satisfied are you with public transportation in your area of residence (considering accessibility, efficiency/frequency, cost, and maintenance)? +[scale 1-5 with one being very dissatisfied and five being very satisfied] + +3. How well do the current government's immigration policies support the Canadian economy? +[scale 1-5, with one being poorly and five being excellent] + +4. How important is the government's role in reducing socioeconomic gaps/inequalities? +[scale 1-5, with one being not at all important and five being extremely important] + +5. Do you support increasing taxes on high-income earners to fund public services? +[choose one option from the following: a.Yes; b.No; C.Unsure]. + +6. How important is the federal government's role in regulating housing prices? +[scale 1-5, with one being not at all important and five being extremely important] + +7. To what extent do you agree with the following statement: "As one of the leading economies, Canada should play a more active role on the international stage." +[scale 1-5 with one being strongly disagree and five strongly agree] + +8. To what extent do you agree with the following statement: "The Canadian government should do more to comply with the Paris agreement and aim for net-zero emissions by mid-century." +[scale 1-5, with one being strongly disagree and five strongly agree] + 9. write your question here... (optional) 10. write your question here... (optional) ``` @@ -71,7 +96,91 @@ Your 5-10 question survey: Identify and describe survey features: ``` -write your answer here +1. Sample type = +The 2018 GSS GVP survey used a probability-stratified-sampling strategy, with a simple random sample without replacement within each stratum (household sampling). Then, a respondent from each household was selected using the age-order method to complete the survey. The age-order method is a probability procedure for selecting one participant (age >= 15) from a household. It aims to ensure a random selection without interviewer bias. However, it can be more time-consuming than other methods. It includes the following stages: list eligible members with age >= 15 in ascending order, assign each one an ordinal number, and generate a random number between 1 and the number of eligible members. The member with the assigned number is selected to complete the household’s survey. + +* The household is the sampling unit, while the selected individual from each sampled household is the unit of observation (to my understanding). + +2. Sample size = +The target sample size (i.e., the number of respondents excluding ‘rejected’ respondents) for the 2018 GSS GVP was 20,000, while the actual number of respondents (again excluding ‘rejected’ respondents) was 16,149. + +3. Target population = +All persons 15 years of age and older in Canada, excluding: Residents of the Yukon, Northwest Territories, and Nunavut; Full-time residents of institutions. + +4. Sampling frame = +Sampling frame data sources: +a. Lists of telephone numbers in use (both landline and cellular) available to Statistics Canada from various sources (Telephone companies, Census of population, etc.); and +b. The Address Register (AR) (i.e., the list of all dwellings within the ten provinces) was used to group all telephone numbers associated with the same valid address. + +Procedure of sampling frame generation: +a. The AR accounted for 86% of the telephones available from Statistics Canada (a). Using AR, telephone numbers were grouped by household [first landline, then cellular] to generate the sampling units [households, also referred to as “records” in the Public Use Microdata File]. The units of observation are the individuals who were selected from each household using an age-order sampling method (see my answer to 1) +b. The remaining 14% of telephone numbers that were not matched to the sampling frame were also included, and about 9% of these were grouped into households using address information from administrative sources. Each of the remaining telephone numbers constitutes a single record on the frame. + +5. Survey mode(s) +Respondents were given the option to complete the questionnaire online (referred to as rEQ – i.e., respondent-completed electronic questionnaire) or by telephone (referred to as iEQ – i.e., interviewer-assisted electronic questionnaire). + +6. Timeline +The 2018 GSS GPV is a cross-sectional survey. Data collection (i.e., interviews)  was conducted from September 4th to December 28th, 2018. For most questions in the 2018 GSS GVP questionnaire, the reference period was the 12 months preceding the interview. + +7. Response rate +The overall response rate was 41.9%. + +8. Weights +The sample weight is defined in the Public Use Microdata File (PUMF) User Guide document as WGHT_PER (= the basic weighting factor for analysis at the person level, i.e., to calculate estimates of the number of persons (non-institutionalized and aged 15 or over)) having one or several given characteristics. WGHT_PER is used for all person-level estimates. +To my understanding, WGHT_PER was calculated using the following formula: +Initial Household Weight x Factor 1 x Number of eligible household members +where: +- Initial Household Weight = (Number of records sampled in the stratum*)/(Total number of records in the stratum from the survey frame) +*stratum = geographic areas- i.e., one of the seven provinces +- Factor 1 = the adjustments done based on the response rate [to my understanding] +- Number of eligible household members (age>=15) + +* More weight adjustments were performed depending on the province, volunteer status, stratum, age and sex of the respondent. + +- I couldn’t find a total sampling weight [for the entire sample] since adjustments were explicitly made for the province. The WGHT_PER has no value but is a factor for analysis at the person level and is used to calculate estimates of the total number of persons in the Canadian population (non-institutionalized and aged 15 or over) ***who possess specific characteristics*** [that is, WGHT_PER is calculated per variable] + +9. Data processing +Data processing is defined in PUMF as: “all data handling activities – automated and manual – that occur after collection and before the dissemination of estimates”. In the case of the 2018 GSS GVP, the following data procedures took place: +- Re/coding of data into existing harmonized systems of variables (e.g., Statistics Canada classifications of occupation, education, etc. or International Classification of Nonprofit Organizations). Data that could not be standardized were coded as “not specified.” +- Edit, cleanup, and emputation (please refer to question 10 below) +- Some variables were created via a derivation process [e.g., collapsing several variables together]. For example, PHSDFLG [= respondent has a spouse/partner in the household] was derived from the 2018 GSS GVP household roster and relationship +question. + +10. Cleaning, imputation, etc + +- out of scope/non-response were dropped/removed and excluded from analysis +- Records with missing or incorrect information were, in a small number of cases, corrected deterministically or imputed from other information on the questionnaire. When there was missing data for a question that the respondent was expected to answer on [and when the answer can be completed based on other previous questions (i.e.,” on path”)], the expected response was imputed. When the missing response seemed intentional (“of path”), the response was considered inappropriate or not of interest, and the current question was coded as ‘Valid Skip’. +-For item and partial non-response, donor imputation was done for certain variables (i.e., completing the missing data using a “donor” respondent- a similar enough respondent that will donate the missing information). +- Variables were imputed in the following order: personal income and family income (not relevant for the 2018 GSS GVP survey), formal volunteering variables, informal volunteering variables, the donation file and the solicitation methods variables. + + +11. Sources of error: + + a. sampling errors: +-  Under coverage of the population due to the sampling frame - may exclude people without a registered phone number in one of the sources available to Statistics Canada. +- measuremnt error using the deffinition of The 19th International Conference of Labor Statisticians (ICLS) defines volunteer may exclude people that were included in the target population [the definition says: ““work performed by persons of working age who, during a short reference period, performed any unpaid, non-compulsory activity to produce goods or provide services for others”]. This definiton may exclude retired/older adult participants who volunteer/work. In addition the “working age” is not defined. +- The use of the age-order method may result in over-representation of younger adults/youth (aged 15-24) in one province, which can lead to a reduced response rate for this province (assuming youth are less patient to complete such a long survey) + * The PUNF assumes that survey estimates are normally distributed it is possible to estimate the sampling error. For example, the chance is 95% that the difference between the sample estimate and the true population would be less than two standard errors. + + b. non-sampling errors: + - social desirability - this is relevant for each social science survey/interview, especially in questions regarding the number of volunteering/donation hours + - questionnaire design - the questionnaire is very long (250 pages of survey!), which can cause attrition. + - using donor imputation in items/variables where the non-response was intentional + - not a representative sample due to over-representation of one subgroup (e.g., male respondents aged 15-24). + - Collapsing of variables/categories within variables may result in losing important data. + - Using an undated AR/Statistic Canada phonelist to create the sampling frame might exclude potential participants + -mistakes during data collection and coding + -Interviewers misunderstood the instructions + -respondents’ inability to recall the precise answer. +  +12. Limitations, known biases, etc: +* Ordering the first the landline and then cellular when grouping phone numbers into households may result in overrepresentation of households with landlines in the final sample since people have a smaller propensity to pick up a call from a survey institution on cellular (especially for the first time). Accordingly, they have a greater chance of being considered “non-response” records. +* Different provinces tend to have different response rates, and non-response rates vary with demographic characteristics. The PUMF provides the following example: “Non-respondents are often more likely to be males and more likely to be younger. In the responding sample, 2.6% of persons were males aged 15-24, while in the overall population, approximately 7.4% were males aged 15-24. Therefore, it is clear that unweighted sample counts cannot be considered to be representative of the survey target population”. + + +13. Link to documentation and any additional sources used: +https://www150.statcan.gc.ca/n1/pub/45-25-0001/cat5/c33_2018.zip + ``` ## Rubric From c079f4c4d13e5e116ef3136c1b144f947ce45670 Mon Sep 17 00:00:00 2001 From: Ofir Date: Thu, 15 Jan 2026 03:47:18 -0500 Subject: [PATCH 4/4] attached is my second assignment --- 02_activities/assignments/a2_survey_design_and_evaluation.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/02_activities/assignments/a2_survey_design_and_evaluation.md b/02_activities/assignments/a2_survey_design_and_evaluation.md index 16526304..6d616068 100644 --- a/02_activities/assignments/a2_survey_design_and_evaluation.md +++ b/02_activities/assignments/a2_survey_design_and_evaluation.md @@ -179,7 +179,7 @@ question. 13. Link to documentation and any additional sources used: -https://www150.statcan.gc.ca/n1/pub/45-25-0001/cat5/c33_2018.zip +https://www150.statcan.gc.ca/n1/pub/45-25-0001/cat5/c33_2018.zip ```