# Selecting the Right Statistical Analysis Tool for Your Research

A challenge that many novice researchers face is deciding on the appropriate statistical test for their research problem or research question.  Even experienced researchers or those who have spent time away from statistics may feel a bit rusty and need a refresher.  Let’s say researchers are interested in looking at the relationship(s) across variables. They may ask themselves, “in this situation, do I apply a Pearson’s or Spearman’s Correlation? Or, should I consider setting up cross tabulations or maybe a regression analysis?”  Then, there are those times when researchers want to compare means or hypotheses; the researcher might think about using t-Tests or ANOVA based on the sample data.

To make life easier, I have developed this statistical analysis decision tool.  Read on to learn more.

## Statistical Analysis Decision Tool: An Introduction

This tool is designed to assist the novice and experienced researcher alike in selecting the appropriate statistical procedure for their research problem or question. Below we provide commonly used statistical tests along with easy-to-read tables that are grouped according to the desired outcome of the test. Also provided below are a variety of links for added support. Enjoy, and happy number crunching!

### Instructions

To use this tool please select the applicable goal of the analysis of the data then work through the tables from left to right to select the correct statistical test. Make sense?  Let’s try one.

## Descriptive Statistics

Sometimes the first step in any study is to organize the data and understand patterns.  This can be accomplished with descriptive statistics such as frequencies, means, standard deviations, etc. ### Key Terms (from Dodge, 2010):

• Categorical: “Categorical data consists of counts of observations falling into specific classes. […] In a public opinion survey for approving or disapproving a new law, the votes cast can be either ‘yes’ or ‘no’. […] if we are interested in the number of people that have achieved various levels of education, there will probably be a natural ordering of the categories: ‘primary, secondary’ and then university” (p. 59). “Yes” and “no” and the education levels are all examples of categorical data.
• Normal: This refers to the assumption that data is normally distributed, i.e. if one plotted the data it would take the classic “bell curve” shape. Many statistical tests assume that data is normally distributed.
• Over time: This refers to data that occurs over a given time period, for example, the average of a daily observation taken over a three month period.

## Relational Analysis

Relational analysis helps the researcher understand the relationship between two or more variables. “The notion of relation expresses the rapport that exists between two random variables” (Dodge, 2010, p. 455).

Correlation analysis identifies the strength of the relationship between two variables.  When the relationship is closer 1, the relationship is more position, and the closer the number is to -1, the relationship is more negative. A positive relationship means both variables are going in the same direction, whereas a negative relationship means the variables are going in opposite directions.  A zero value means there is no relationship between the two variables.

“Regression analysis is a technique that permits one to study and measure the relation between two or more variables. […] The goal is to estimate the value of one variable as a function of one or more other variables” (Dodge, 2010, p. 450).

Key differences between correlation and regression are articulated in these links http://www.graphpad.com/faq/viewfaq.cfm?faq=1141 and http://www.statpac.com/statistics-calculator/correlation-regression.htm), but – in summary – regression can do more for your research than correlation as regression allows you to model multiple variables. ### Key terms (from Dodge, 2010):

• Categorical: “Categorical data consists of counts of observations falling into specific classes. […] In a public opinion survey for approving or disapproving a new law, the votes cast can be either ‘yes’ or ‘no’. […] if we are interested in the number of people that have achieved various levels of education, there will probably be a natural ordering of the categories: ‘primary, secondary’ and then university” (p. 59). “Yes” and “no” and the education levels are all examples of categorical data.
• Mixed: This refers to the fact that some data is in a different format than the other data. For example, some data may be numerical while another set may be categorical (nominal or ordinal).
• Normal: This refers to the assumption that data is normally distributed, i.e. if one plotted the data it would take the classic “bell curve” shape. Many statistical tests assume that data is normally distributed.
• Simple linear regression: This test is used for predicting a value of a dependent variable using an independent variable.
• Multiple linear regression: This test is used to predict values of a dependent variable using two or more independent variables.

## Comparison Analysis

Comparison analysis seeks to test hypotheses on a sample mean or to compare means of two samples. The outcome of this type of analysis is usually “there is a statistically significant difference” or that “there is not a statistically significant difference” between/among data sets. ### Key terms (from Dodge, 2010):

• Categorical: “Categorical data consists of counts of observations falling into specific classes. […] In a public opinion survey for approving or disapproving a new law, the votes cast can be either ‘yes’ or ‘no’. […] if we are interested in the number of people that have achieved various levels of education, there will probably be a natural ordering of the categories: ‘primary, secondary’ and then university” (p. 59). “Yes” and “no” and the education levels are all examples of categorical data.
• Independent: This is when no efforts are made to match data points or compare effects (say before or after). An example would be if 100 individuals were recruited and randomly divided into two groups of 50 after which means of the groups were compared.
• Normal: This refers to the assumption that data is normally distributed, i.e. if one plotted the data it would take the classic “bell curve” shape. Many statistical tests assume that data is normally distributed.
• Paired: This refers to cases when each data point (e.g. score) is paired to another data point. Examples of this are when conducting a before and after analysis (pre-test/post-test) or the samples are matched pairs of similar units. Paired is also described by the term “dependent.”
• Repeated measures: This refers to a study that takes multiple measures or time points for each of the subjects. Examples include longitudinal studies or evaluating a measurement under two or more conditions.

## Parametric versus Non-Parametric Tests

“A parametric test is a form of hypothesis testing in which assumptions are made about the underlying distribution of observed data. […] The Student test is an example of a parametric test. It aims to compare the means of two normally distributed populations” (Dodge, 2010, p. 412). Nonparametric procedures are really handy when you think you are going to use one of procedures we have discussed, but for one reason or another (often sample size), you cannot. Below is a list of parametric tests along with their non-parametric equivalent. References

Dodge, Y. (2010). The concise encyclopedia of statistics. New York: Springer.

Kathleen Andrews | October 12, 2016 1:33 pm MST

Thank you for providing this, Scott.

I would like to have a pdf version to send to my students. Do you have one?

Would you make one for multivariate analysis?

Scott Burrus | October 12, 2016 1:40 pm MST

Hi Kathleen - I'm sure we can post a PDF within the next few days or so.  While ANOVA and Regression are designed to handle multivariate analysis a follow-up to this post could be to address this in more detail.  Thanks, Scott

Louis Daily | November 21, 2016 1:01 pm MST

Scott,

I've talked to you briefly at various meetings.  I'm very interested in analytics.  I'm trying to decide which to focus on:  R  ?  Python?   Rapidminer?   I know you have told me you use SPSS.  As faculty, I have access to SPSS, but I think you need the SPSS "Modeler" for most of the clustering techniques, etc.  Do we have access to this?  Or should I just focus on the capabilities in the base SPSS product?

Any direction you can give will be greatly appreciated.

thanks!

Lou Additional content will be provided upon request.

Scott Burrus 