Quick Introduction to Linear Regression:
Last week, we covered correlation analyses – which allow us to test for associations between two numeric variables. I mentioned in class that it’s important not to confuse correlations with regressions, which represent a statistical model of how variable Y varies as a function of variable X. Rather than just stating that Y and X tend to covary in the same direction or not, we can build a model that specifically allows us to predict values of Y based on values of X.
Linear regressions work on two different types of study designs: randomly sample individuals from a population where both variables are measured and situations were variation in the explanatory variable X has been manipulated to several different measured levels. In either case, linear regressions work by fitting a best-fit line through the data via the method of “least squares,” resulting in parameter estimates for both the slope and intercept of this line. Confidence interval and hypothesis tests can then be calculated on these estimates. Both linear regressions and ANOVAs are types of “generalized linear models” – so both work though the same basic framework in R: using lm(y~x) to specify the statistical model and then obtaining the relevant information from the resulting liner model object.
An example from the book:
The lion’s nose
Managing the trophy hunting of African lion in an important part of maintaining viable lion populations [my note: that is true where hunting is permitted and thus drives conservation decisions. However, in parts of African where trophy hunting it is strictly forbidden, like Kenya – populations of predators (and ecosystems as a whole) do much better. While hunting is a net conservation benefit in certainly societies that have otherwise abandoned their stewardship of the environment, the idea that managed trophy hunting is the only path forward for Africa’s threatened megafauana is patently nonsense. Our text book authors’ aren’t implying that – I just didn’t want anyone to get the wrong idea]. Knowing the ages of male lion helps, because removing males older than six years has little impact on lion social structure, whereas taking young males is more disruptive. Whitman et al. (2004) showed that the amount of black pigmentation on the nose of male lions increases as they get older and so might be used to estimate the age of unknown lions. Our data from the example come from the known age and pigmentation level of 32 male lions from Tanzania. Let’s use linear regression to determine a predictive relationship between pigmentation and age. MY NOTE: The pigmentation here is reported as “proportion black.” This represents the proportion of nose area assessed as black. While this is area proportion is certainly bound between 0 and 1 – and perhaps prone to issues of right or left skew due to those bounds – it is not a proportion in the sense of a relative frequency of binary events (no one was counting individual nose skin cells here). It is therefore a truly continuous numeric variable, so we don’t have to be as worried about inherent violations of normality and with proportions based on counts.
First, load in and attach the dataframe in lion_noses.txt in which ever manner you prefer at this point. Call the dataframe “lion”
Now, let’s make a scatter plot to examine the data:
plot(ageInYears ~ proportionBlack)
##Note that if you use the tilde (~) between the variables, the first variable appears on the y-axis (as you want in a linear model specified that way). However, if you just use a comma (,) to separate them [i.e. plot(ageInYears , proportionBlack)], the first variable comes up on the x-axis.
Next, we’re going to produce the linear regression of age as a function of black pigmentation. We can call up different components of that resulting linear model object to answer specific questions we might have about the regression. We’re going to use the basic form of: named_model<-lm(y~x)
To construct the model:
lionRegression <- lm(ageInYears ~ proportionBlack) We now have linear model object named “lionRegression.” The most straightforward way to view the output is through the summary(model) function. Summary(lionRegression) This reveals a table with much of the relevant information:1) IQR information about the residuals from the model ( the variation in age not explained by pigmentation – which turns out to be linear distance on the y-axis of each point from the fitted line), 2) the estimated intercept and slope (here named as the variable “proportionBlack,” standard error of those estimates, and the results of t-tests for significance for each, and 3) information about the overall variation explained by the full model.
We can now add the fitted regression line to the plot.
plot(ageInYears ~ proportionBlack)
abline(lionRegression)
We can also use the predict function to predict values of Y based on hypothetical values of X from our model
predict(lionRegression, data.frame(proportionBlack = 0.50), se.fit = TRUE) This returns the predicted age of a lion with 50% nose pigmentation, along with the standard error of that prediction. We can extract the residuals from the model for different downstream uses: lion_residuals<-resid(lionRegression) lion_residuals
We can also view the breakdown of variance in our model in the familiar ANOVA table format:
anova(lionRegression)
Here the “residuals mean square” is equivalent to the error mean square from ANOVA.
We can calculate a confidence interval for our slope and intercept estimates:
confint(lionRegression)
We can even plot the uncertainty in slope estimates by placing confidence bands (or envelopes) around our fitted line. This part gets a bit tricky. We’re going to do this by calculated confidence intervals around predicted values for a set of hypothetical pigmentation values across the full range of our data, and then add those to the plot with the “lines” function”.
xpt <- seq(min(lion$proportionBlack), max(lion$proportionBlack), length.out = 100) Here, we created a variable of 100 values between the minimum and maximum observed values of pigmentation. Now, we’re going to feed those into the predict function and produce a new data frame of predicted values (with their confidence intervals) for each one. ypt <- data.frame( predict(lionRegression, newdata = data.frame(proportionBlack = xpt), interval = “confidence”) ) Now, we’re going to use the $ symbol to subset out the parts of that data frame that we want to plot: lwr, the lower confidence limit and upr, the upper confidence limit. Just to make sure things run smoothly, I’m also going to suggest that we replot the scatter plot and line here first.
plot(ageInYears ~ proportionBlack)
abline(lionRegression)
lines(ypt$lwr ~ xpt, lwd = 1.5, lty = 2) lines(ypt$upr ~ xpt, lwd = 1.5, lty = 2)
ASSIGNMENT:
For out last homework assignment of the semester (!), I want to you to take a crack at expanding the text book example above in a new direction that will tie into some of the last topics we cover in the course: including more than one explanatory variables in your statistical models. To help get you started with this, the basic R syntax for this is to use operators (mostly + and * ) to add additional variable to your standard lm model. Adding variable z with + includes just the main effect of z on y. Adding z with * includes both the main effect of z and the interaction between z and x.
e.g.
model_main_effects<- lm (y ~ x + z)
model_w_interaction_term<- lm (y ~ x * z)
Carrying out the analyses here should represent a fairly straightforward extension of the example above with these new dimensions. However, interpreting the resulting model summary table will require some extra though. Please take the time to think through what this table means and what it says about our data. I’m not expecting you to completely figure it out on your own, but I do want you to give it an honest effort. Spending some time thinking through the interpretation of multiple variables and interaction terms this week will put you on a better footing for understanding that topic next week in class.
“A RAPIDLY EVOLVED SHIFT IN LIFE HISTORY TIMING DURING ECOLOGICAL SPECIATION IS DRIVEN BY THE TRANSITION BETWEEN DEVELOPMENTAL MODULES”
The data from this week assignment comes from one of my own studies that I’m currently prepping for publication. The basic question here to test for potential evolutionary constraints on rapid seasonal adaptation.
Here are a couple of excerpts that help set up this study:
______________________________________________________________________
Phenological adaptation may play an important role in both the origin and maintenance of biodiversity in seasonal environments. Divergent temporal adaptation can be a potent driver of reproductive isolation during ecological speciation (Taylor and Friesen 2017), and altered seasonality is a critical consequence of global climate change for many ecological communities (Miller-Rushing et al. 2010). Understanding how evolution shapes the regulation of seasonal timing has important implications for the tempo of ecological diversification and the potential of biotic communities to persist in the face of global change…
Here, we investigate how these two Rhagoletis host races have shifted their seasonal life history strategies from later to earlier adult emergence with different types of tactics governing eclosion timing in the context of developmental modularity. We used respirometric phenotyping to compare the post-winter metabolic rate trajectories of apple and hawthorn flies from sympatric populations under controlled laboratory conditions … The trajectories of metabolic rate changes were identified with functional parameters representing the baseline metabolic rate during the diapause maintenance module, timing of termination of the maintenance module, the metabolic rate plateau that coincides with early adult morphogenesis during the post-diapause development module, plateau duration, and the exponential increase in metabolic rate that corresponds to late adult morphogenesis in the post-diapause development module (Figure 1). Thus, each functional parameter describes a modular component of the diapause program and we evaluated their assembly with a model of adult emergence phenology as a function of diapause module parameters across host races.
Fig. 1 , example of triphasic metabolic trajectory, where metabolic rate R is modeled as:
______________________________________________________________________________
Our assignment this week picks up with these fitted metabolic rate trajectories. We want to ask how the different components of this trajectory influence the life history timing of these insects (based on when adult flies emerge or eclose from their puparia). The attached data frame “metabolic_trajectories” includes the response variable “Eclosion” (simply the number days after removal from simulated winter that adult flies emerge) and a number of the fitted parameters for to the trajectory model above.
**EDIT** Please use the .txt version of the file rather than the .xlsx spreadsheet version I originally posted. An Excel formatting quirk was leading to R Studio misinterpreting the “termination” variable as a categorical factor – leading not all kinds on nonsensical results. Loading the straight text file should completely take care of this***
Your task is to build a linear regression model to determine how Eclosion varies as a function of the variables “termination”, “baseline”, and “PostD_Dev” and their interaction terms. Your output should include:
- The model summary table for your multiple-variable regression.
- An ANOVA table for your regression model
- A brief paragraph giving your interpretation of these results (don’t worry about getting it spot on – again, I just want you to think through the issues here). Try to answer the question “which of these variables drives variation in adult eclosion timing, and are some more important than others?”
- Scatter plots with confidence bands for the variables that have a significant on Eclosion. Because we’re limited to two dimensions in a normal scatter plot, if we have more than one significant variable, we have to handle them separately on the x-axis in different plots.