Project 4: Correlation

Get instructions for creating trendlines:

These links will open a new tab or window on your browser.

Sample
Size
Confidence
90%95%99%
40.9000.9500.990
50.8050.8780.959
60.7290.8110.917
70.6690.7540.875
80.6210.7070.834
90.5820.6660.798
100.5490.6320.765
120.4970.5760.708
150.4410.5140.641
200.3780.4440.561
250.3370.3960.505
300.3060.3610.463
400.2640.3120.403
500.2350.2790.361
Worksheet 1 was a project on linear regression. The regression formula will always find a best fitting line equation, whether there is a significant link between variables (strong correlation) or not (no correlation). In addition to finding this trendline equation, most of these same software packages can also calculate the Pearson Correlation Coefficient, which measures how significant a correlation pattern is.

A correlation coefficient of zero indicates no correlation, while a coefficient of 1 (positive) or –1 (negative) indicates a perfect correlation. If the value of the coefficient is far enough away from zero, compared to a typical correlation coefficient from randomly generated points, the correlation is said to be significant. The more points in the data set, the less strong the correlation must be in order to be significant, because random points are less likely to line up when you have more points.

Whether you are working with a spreadsheet program or a statistics website, these programs will provide a value of “R-squared” rather than the actual coefficient r. To find r, take the square root of this number. Use the positive square root if the trendline has a positive slope, and use the negative square root if the trendline has a negative slope

Part 1

r =

For practice, create five random points using pairs of 20-sided die rolls (if you do not have a 20-sided die handy, use your spreadsheet’s RANDBETWEEN function to create random numbers from 1 to 20 or use random.org). Graph these points, and find the value of the correlation coefficient, r, to three decimal places. Finally, compare this value to the table above to check for significance at each of the three levels

Part 2

    Population vs. Medals

    r =


    GDP vs. Medals

    r =

  1. From worldometers.info, copy the top 30 rows of data from a table containing: Country, GDP (nominal), and Population. It is probably easier to copy everything, then delete the columns you do not need.
  2. Gather information on the number of total Olympic medals won in the most recent summer and winter Olympics combined for these 30 countries. Using lists from a source that can be sorted on country name, like Wikipedia, will make this easiest.
  3. Generate the trendline and the “R-squared” value for two scatter plots (you may need to copy or rearrange the column order to do this): One comparing a country’s population to the number of medals won, and another comparing a country’s GDP to the number of medals won (always use medals won as the second column in the scatter plot). For each scatter plot, calculate the values of the correlation coefficient, r, and compare these to values in the table above to check for significance
  4. Which had the stronger correlation with total medals won, GDP or population?
    Why do you think this variable is so much better than the other in predicting Olympic success?

Once completed, this page to a pdf document; then hand it in through your course's Learning Management System.


 Neil Simonetti

 Back to Intro to QR Worksheets Page

 Back to Neil's Intro to QR Page