Stat-wizards
Peer review by: ggplotheads
Names of team members that participated in this review: Lindsay Gross, Jake Marrs, Sydney Fox, Ben Yacht, Noah Hirshfield
Describe the goal of the project.
The goal of the project is to analyze the growth in salaries of tech jobs over time and to see if the results are different for companies in the US vs outside of the US. The project hypothesizes that jobs in the machine learning/ AI sector will have a more substantial growth in salaries when compared to data scientist / analyst jobs, and tech jobs in the US will pay more than tech jobs abroad.
Describe the data used or collected, if any. If the proposal does not include the use of a specific dataset, comment on whether the project would be strengthened by the inclusion of a dataset.
The data is from ai-jobs.net. It was collected from individuals in the AI/ML/Data Science space.
The data shows various observations (each representing a person) and includes their job and experience levels, as well as the location and size of the hiring company, and their salary.
Describe the approaches, tools, and methods that will be used.
The project compares mean and median salaries for specific jobs. It also uses linear and logistic regressions, and a bootstrap test.
Provide constructive feedback on how the team might be able to improve their project.
Make sure your feedback includes at least one comment on the statistical reasoning aspect of the project, but do feel free to comment on aspects beyond the reasoning as well.
If you are going to use salary in your graphs, I would highly recommend not keeping units on the graph in exponents. It is not that legible.
Consider changing the units/mutating the data
(e.g. 1 on graph = $1000) - axis label: salary in thousands
Because some code is redundant, I’d recommend adding more code chunks
Probability could be considered
Include Ha and Ho when conducting a hypothesis test and expand more on the boot distributions
Specify axes titles - for example, “Year” x axis title is misleading, is this how many years into the job they are or what year it is? Be clear and concise
It would be useful to add more explanation throughout the project, rather than including a longer explanation at the end. This would make your graphs and code chunks easier to understand and would allow the reader to get a better grasp of what the goal of the code chunks are
It could be interesting if you talked about how there seems to be stark outliers for data analysts and data scientists in year 0 and year 1, but not as much in the later years
The project includes a correct bootstrap, but then does not use the confidence interval to explain anything. Further analysis/explanation of this confidence interval would improve this section of the project.
What aspect of this project are you most interested in and would like to see highlighted in the presentation.
We would like this group to make more visualizations and perform more tests about the interaction between machine learning and data scientist salaries throughout time (as machine learning has become more popular). We think that this relates current events and trends well with the dataset. It also would be interesting to use this data to predict if AI jobs will be higher paying in the future given the current trend.
Were you able to reproduce the project by clicking on Render Website once you cloned it? Were there any issues with reproducibility?
Yes, this project can be reproduced via this method. There were no issues with reproducibility.
Provide constructive feedback on any issues with file and/or code organization.
We believe that much of this code is redundant and therefore unnecessary, specifically for the mean and median calculations for each job within and outside of the United States. This could have been achieved through one or two codes using group_by instead of reporting each of the values separately.
For means and medians section, could use group_by for job title to get means for each job title at once rather than filtering out for each one individually
Other redundant code that could be concise or reduced to single code chunks.
For example, in the code chunk labeled: “means”, you could combine the first block of code with the second block of code to make it less redundant
What have you learned from this team’s project that you are considering implementing in your own project?
Incorporating both means/medians in our data analysis
Learned from inefficiency of some of their code that we could combine some of our code to make it more concise
(Optional) Any further comments or feedback?
Maybe mix it up with color schemes and types of visualizations to make more visually appealing to audience
Overall, very interesting and excited to see your presentation!! :) <3
Stat-wizards
Peer review by: ggplotheads
Names of team members that participated in this review: Lindsay Gross, Jake Marrs, Sydney Fox, Ben Yacht, Noah Hirshfield
Describe the goal of the project.
The goal of the project is to analyze the growth in salaries of tech jobs over time and to see if the results are different for companies in the US vs outside of the US. The project hypothesizes that jobs in the machine learning/ AI sector will have a more substantial growth in salaries when compared to data scientist / analyst jobs, and tech jobs in the US will pay more than tech jobs abroad.
Describe the data used or collected, if any. If the proposal does not include the use of a specific dataset, comment on whether the project would be strengthened by the inclusion of a dataset.
The data is from ai-jobs.net. It was collected from individuals in the AI/ML/Data Science space.
The data shows various observations (each representing a person) and includes their job and experience levels, as well as the location and size of the hiring company, and their salary.
Describe the approaches, tools, and methods that will be used.
The project compares mean and median salaries for specific jobs. It also uses linear and logistic regressions, and a bootstrap test.
Provide constructive feedback on how the team might be able to improve their project.
Make sure your feedback includes at least one comment on the statistical reasoning aspect of the project, but do feel free to comment on aspects beyond the reasoning as well.
If you are going to use salary in your graphs, I would highly recommend not keeping units on the graph in exponents. It is not that legible.
Consider changing the units/mutating the data
(e.g. 1 on graph = $1000) - axis label: salary in thousands
Because some code is redundant, I’d recommend adding more code chunks
Probability could be considered
Include Ha and Ho when conducting a hypothesis test and expand more on the boot distributions
Specify axes titles - for example, “Year” x axis title is misleading, is this how many years into the job they are or what year it is? Be clear and concise
It would be useful to add more explanation throughout the project, rather than including a longer explanation at the end. This would make your graphs and code chunks easier to understand and would allow the reader to get a better grasp of what the goal of the code chunks are
It could be interesting if you talked about how there seems to be stark outliers for data analysts and data scientists in year 0 and year 1, but not as much in the later years
The project includes a correct bootstrap, but then does not use the confidence interval to explain anything. Further analysis/explanation of this confidence interval would improve this section of the project.
What aspect of this project are you most interested in and would like to see highlighted in the presentation.
We would like this group to make more visualizations and perform more tests about the interaction between machine learning and data scientist salaries throughout time (as machine learning has become more popular). We think that this relates current events and trends well with the dataset. It also would be interesting to use this data to predict if AI jobs will be higher paying in the future given the current trend.
Were you able to reproduce the project by clicking on Render Website once you cloned it? Were there any issues with reproducibility?
Yes, this project can be reproduced via this method. There were no issues with reproducibility.
Provide constructive feedback on any issues with file and/or code organization.
We believe that much of this code is redundant and therefore unnecessary, specifically for the mean and median calculations for each job within and outside of the United States. This could have been achieved through one or two codes using group_by instead of reporting each of the values separately.
For means and medians section, could use group_by for job title to get means for each job title at once rather than filtering out for each one individually
Other redundant code that could be concise or reduced to single code chunks.
For example, in the code chunk labeled: “means”, you could combine the first block of code with the second block of code to make it less redundant
What have you learned from this team’s project that you are considering implementing in your own project?
Incorporating both means/medians in our data analysis
Learned from inefficiency of some of their code that we could combine some of our code to make it more concise
(Optional) Any further comments or feedback?
Maybe mix it up with color schemes and types of visualizations to make more visually appealing to audience
Overall, very interesting and excited to see your presentation!! :) <3