All hypothesis tests should include hypotheses, test statistic, p-value or critical value, decision, and conclusion. Please minimize the listing of computer output or the excessive use of appendices in reporting your results. Summarize the results of each regression model simply by displaying the regression equation, the coefficients and their standard errors, as well as the usual summary statistics such as the standard error, R-square and R-square(adj). A policy analyst for the Ontario Ministry of Education wanted to determine what relationships between income and the aggregate level of education might be used to encourage students to stay in school. Although there were potential problems with interpreting relationships based on aggregate data, she decided to begin with data from the 2011 National Household Survey. She collected data for the 1075 census tracts in the Toronto area and took a random sample of 250 observations, before compiling a dataset with the following variables: CensusT: identifying code for the census tract P_hsgrad: the proportion of adults with high school graduation P_trades: the proportion of adults with qualifications in a trade P_collcert: the proportion of adults with a college certificate P_univdipl: the proportion of adults with a university diploma (no degree) P_univdegr: the proportion of adults with a university degree MedInc: the median employment income for individuals above 15 years AvgInc: the average employment income for individuals above 15 years MedInc*: the median employment income, with missing values Note that each proportion tracks the number whose highest level of education is as indicated and the categories are mutually exclusive. The data are in the files toronto.mtw and toronto.xlsx. (a) Plot the average incomes against the median incomes. What two words would best describe the shape of income distributions in general? (b) Perform a multiple regression analysis using the five educational variables as predictor variables and the median income (MedInc) as the response variable. (c) For the regression model in (b), graph the standardized residuals against the fitted values and comment on whether the linear regression model assumptions are warranted. (d) The MedInc* variable copies the data from the MedInc variable, but a missing value code has been inserted for a number of census tracts. Examine the MedInc* data and describe the nature of these census tracts (hint: look at the standardized residual values for the “unusual observations”.) The remaining questions pertain to regression models based on the MedInc* variable and not the original MedInc variable. The elimination of these observations means that subsequent models may not predict well the median incomes for these unusual census tracts. (e) Re-estimate the multiple regression model using MedInc* as the new response variable. For this regression, save the standardized residuals and the fitted values, calculate the Variance Inflation Factors (click the Options button), and plot the standardized residuals against the fitted values. Are there any particular problems with multicollinearity? (f) What changes do you notice, comparing this model with the previous model? (g) Examine the standardized residuals using a histogram, a boxplot or a normal probability plot. Is the assumption of normally distributed errors reasonable? Explain briefly. (h) Plot the standardized residuals against the fits. Do you see any other problems with the model assumptions? (i) Calculate the correlation coefficient between the fitted values and the MedInc* variable. Show the relationship between this correlation coefficient and the value of R2. (j) Perform an F-test for the overall usefulness of the model, using the 1% level of significance. What do you conclude? (k) Using the model developed in part (f), test the marginal usefulness or importance of the P_univdegr variable, given the other variables in the model, using the 1% level of significance. (l) If the proportion of university graduates were to increase by 0.1 in a set of census tracts (that is, from 0.1 to 0.2 or from 0.3 to .4, as the case may be), assuming the other predictor variables remain constant, what is the estimated average increase in the median incomes for these census tracts? (Give an estimate using a 99% confidence level.) Would you conclude that a university degree is beneficial in terms of increasing aggregate incomes? (m) Regress MedInc* against only the P_univdegr variable and find the estimated slope of the regression line. Is the coefficient of the P_univdegr variable in the simple regression model consistent with the slope estimate in the multiple regression model? Explain briefly why they might differ. (n) Finally use the model developed in part (f) to calculate a 99% prediction interval for the actual median income in census tract 61.00. (Click the Options button in the student version of Minitab, and copy and paste the values of the predictor variables for this census tract, making sure there are only spaces between the numerals. For Minitab 17, select “Predict” under “Regression”) Show manually how the standard error for the prediction interval is calculated using the standard error for the confidence interval and the standard error of the regression estimate. (o) Explain why you would not expect the prediction interval to cover the actual median income for this census tract. (p) Re-estimate the multiple regression model, but this time drop the variable that is the least useful. Explain whether this improves the fit.