Outliers

Outliers

What are outliers?

Outliers are extreme scores or values on one variable (univariate) or on two or more variables (multivariate) that can bias sample statistics and could potentially influence various coefficients (Tabachnick and Fidell, 2007, p. 72). Aguinis, Gottfredson, and Joo (2013, p. 282-288) further define the types of outliers to three forms:

  • Error outliers โ€“ extreme scores due to a data entry error, an issue in procedure, or a case not from the population of interest.
  • Interesting outliers โ€“ non-error related outliers that could provide valuable knowledge.
  • Influential outliers โ€“ outliers that alter the fit of a model (model fit outlier) or change the parameter estimates (prediction outlier).

How can outliers impact your data?

Outliers can violate the normal distribution in such a way that they can pull the mean closer to it, called a skewed distribution (Howell, 2010). Sample statistics, such as the mean, standard deviation, and variance are common are used in many parametric statistical analyses, but each are susceptible to extreme scores (Howell, 2010). As such, it is important to assess for outliers to minimize potential biases the results and interpretations.

Ways to detect outliers:

Some common ways to detect possible univariate outliers are through boxplots and z-scores and bivariate scatterplots and Mahalanobis distance for multivariate outliers (Aguinis et al., 2013; Tabachnick & Fidell, 2007). If the variables are normally distributed, standardized scores (z-scores) can be calculated and sorted from highest to lowest, with z-scores greater than |2.5| as indicative of potential outliers (Schwab, 2002a). Leys, Ley, Klein, Bernard, and Licata (2013) recommend that when using z-scores to assess for outliers, the median should be used instead of the mean since the mean can be impacted by extreme scores. If the variables are not normally distributed, researchers should use boxplots to assess for outliers, with cases that are between 1.5 and 3 box lengths from the upper or lower edges as potential outliers (Schwab, 2002a). It is important to consider that if a research is going to group the data for analyses, each group should be assessed separately (Tabachnick & Fidell, 2007).

When running multivariate analyses, it important to assess for univariate outliers and multivariate outliers. After assessing for univariate outliers and normality, it is important to assess for multivariate outliers through scatterplots and Mahalanobis distance (D2), which is the statistical distance of a point from the centroid for each independent variable (Aguinis et al., 2013; Schwab, 2002b; Stevens, 2007). Henson (1999) provides SPSS syntax to run the MULTINOR procedure as outlined in Thompson (1990). When using Mahalanobis D2, a larger value for one case will have a small probability and may be indicative of a multivariate outlier (Henson, 1999). Once these values are found, it is important to run the multivariate analysis with and without the extreme score (Schwab, 2002b).

How to Handle Outliers:

Unfortunately, there is disagreement among researchers on how to handle outliers. Aguinis et al. (2013) recommend that researchers discuss the type of outlier (error outliers, interesting outliers, or influential outliers) and how the outlier was identified and handled. Error outliers should be corrected if possible or deleted. Interesting outliers should be investigated more. Influential outliers should be handled by running the analyses with and without the outlier to assess how the extreme score impacts the parameter estimates (or model fit).

 

Assessing For Univariate Outliers Tutorial:

Part 1: 

 Part 2:

 Assessing For Multivariate Outliers Tutorial:

Part 1:

Part 2:

Part 3: 

 

References:

Aguinis, H., Gottfredson, R.K., & Joo, H. (2013). Best-practice techniques for defining, identifying, and handling outliers. Organizational Research Methods, 16, 279-301. doi: 10.1177/1094428112470848

Henson, R. K. (1999). Multivariate normality: What is it and how is it assessed. Advances in social science methodology, 5, 193-211.

Howell, D. C. (2010). Statistical methods for psychology, Seventh Edition. Belmont, CA: Wadsworth Cengage Learning.

Leys, C., Ley, C., Klein, O., Bernard, P. & Licata, L. (2013). Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median. Journal of Experimental Social Psychology, 49, 764-766.

Schwab, A.J. (2002, January 2A). Detecting univariate outliers. Retrieved from: https://www.utexas.edu/courses/schwab/sw388r7/Tutorials/Detecting_Outliers_doc_html/002_Detecting_Univariate_Outliers.html

Schwab, A.J. (2002, January 2B). Detecting multivariate outliers. Retrieved from: https://www.utexas.edu/courses/schwab/sw388r7/Tutorials/Detecting_Outliers_doc_html/008_Detecting_Multivariate_Outliers.html

Stevens, J. P. (2007). Applied multivariate statistics for the social sciences (5th ed.). New York, NY: Taylor & Francis Group.

Tabachnick B.G. & Fidell, L.S. (2007). Using multivariate statistics, fifth edition. New York, NY: Pearson Education, Inc.

Thompson, B. (1990). MULTINOR: A FORTRAN program that assists in evaluating multivariate normality. Education and Psychological Measurement, 50, 845-848.

 

Suggested Readings:

Pedhazur, E.J. (1997) Multiple regression in behavioral research: Explanation and prediction (3rd ed.). Orlando, Fl: Harcourt Brace College Publishers.

 -A detailed discussion on assessing influential outliers in regression through several diagnostic tools available in major statistical software packages.

Thompson, B. (2008).  Foundations of behavioral statistics: An insight-based approach. Guilford Press, New York: New York.

 -One useful approach to assessing potential outliers is through the use of a Jackknife procedure, which is overviewed in Thompson (2008).

Martin, M.A. & Roberts, S. (2010). Jackknife-after-bootstrap regression influence diagnostics. Journal of Nonparametic Statistics, 22, 257-269.

 -A discussion of the implementation of the Jackknife procedure in regression to assess for potential outliers.

 

 

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.