You need to choose a data set where you can explore the concepts learned in your theory class using R and the data management tools learned in the lab.
I expect you to choose a dataset with more than 500 observations. Pick a dataset easy enough that you can manipulate it and understand its content but complex enough that you can use it to do different data analysis and visualizations.HAVE TO CHOOSE YOUR OWN DATASET
Here are the requirements and guidelines for this first deliverable:
- You have to submit an HTML document that is the result of knitting a Rmarkdown document.
- This document should contain an introduction explaining why you chose your data set, and what are you planning to investigate using that data set (Basically a short research proposal).
- You should state the source of your data set (I will provide with a list below of good sources of data, however, you are free to choose a data set from any source or topic of your interest).
- You have to download your dataset into R and make a summary of it and/or show its structure. (summary(dataframe) or str(dataframe))
- Your document should contain your code and output.
- At least 70% of your code should be commented explaining what each line of code does.
Some data sources (you don’t need to choose your data set from this list, but these are great suggestions):
U.S. Government data https://www.data.gov/ (Links to an external site.)Links to an external site.
The Data and Story Library https://dasl.datadescription.com/ (Links to an external site.)Links to an external site.
United Nations Data http://data.un.org/ (Links to an external site.)Links to an external site.
This article provides you with a bunch of data resources https://www.springboard.com/blog/free-public-data-sets-data-science-project/ (Links to an external site.)Links to an external site.
UNICEF (Links to an external site.)Links to an external site. offers statistics on the situation of women and children worldwide.
World Health Organization (Links to an external site.)Links to an external site. offers world hunger, health, and disease statistics.
World Bank Data (Links to an external site.)Links to an external site. hundreds of datasets spanning many decades, sortable by topic or country. Data is downloadable in Excel or XML formats.
Here are the common mistakes that I observed in choosing your data set for your final project.
- Data structure. Your data set has to have a cross-sectional data structure. That means observations in rows (individuals or subject of analysis) and attributes or variables in the columns.
- Your variables are mostly characters. Either you entered the data incorrectly into R and it is reading your numerical variables as characters. Sometimes it happens because there is a row with a character in your numeral variables and R recognized the whole variable as a character. Or your variables are mostly categorical. You have to think if you want your regression analysis only have categorical (factor variables).
- See the guidelines for the final project in the Rmd template that I am making so that you use this to create your project.
Project Template:
Introduction Research Idea
Describe your research idea, give a broad idea of the data, the statistical analysis that you want to estimate and describe the dependent and independent variables.
# The data set
Shortly describe your data source and data. Why you think this data is adequate to answer your research question. Do the required manipulations so that your data have self explanatory names and if not describe those variables shortly.
Choose only the variables that are part of analysis. Create the data set with those, do not print your all your dataset into the Rmarkdown document.
If you have issues with your data structure it might be because it is not in the right format. Use the package tidyr that allows you to improve and change the format of your dataset. Also that teaches you how to change the names of your variables.
# In this chunk of code load your data set into Rmarkdown
# str(data)
# use stargazer to do a table of the summary statistics of your dataset
# summary(data)
# Graphs
Explain some of the important descriptive statistics of your data set by using data visualization.
Remember to use GGplot
# Your analysis
Your research idea executed. Here you can show
Correlations
Means
Variances
Standard deviations
Samples
What ever you are planing to investigate with your data do it here remember to use the package dplyr when possible.
Remember to use stargazer and or kable if you do tables, if possible.
See that in the chunk of code below if you use stargazer you nned to use the option results=’asis’
“`{r, results=’asis’}
# Inference
Do some hypotesis testing and confidence intervals to support your claims about your data in the previous section.
Explain your results