Data Science

Key concepts and terms in Multivariate Statistical Methods

08th Oct `15, 11:36 AM in Data Science

This is a handy list of key concepts and terms in Multivariate Statistical Methods. We’ve put together the…

Baiju-NT
Baiju NT Contributor
Follow

This is a handy list of key concepts and terms in Multivariate Statistical Methods. We’ve put together the list, especially for people new to the field. We hope you find this useful. You can download the complete glossary of Multivariate Statistical Methods at www.camo.com.

Analysis of variance (ANOVA)

Classical method to assess the significance of effects by decomposition of a response’s variance into explained parts, related to variations in the predictors, and a residual part which summarizes the experimental error.

The main ANOVA results are: Sum of Squares (SS), number of Degrees of Freedom (DF), Mean Square (MS=SS/DF), F-value, p-value.

The effect of a design variable on a response is regarded as significant if the variations in the response value due to variations in the design variable are large compared with the experimental error. The significance of the effect is given as a p-value: usually, the effect is considered significant if the p-value is smaller than 0.05 (5%).

Bias

Systematic difference between predicted and measured values. The bias is computed as the average value of the residuals.

Center sample

Sample for which the value of every design variable is set at its mid-level (halfway between low and high).

Center samples have a double purpose: introducing one center sample in a screening design enables curvature checking, and replicating the center sample provides a direct estimation of the experimental error. Real center samples can be included when all design variables are continuous. For design containing category variables real center point do not exist, however it is possible to generate faced center point taking the middle range values for the continuous variables and selecting a level for the category variables.

Classification

Data analysis method used for predicting class membership. Classification can be seen as a predictive method where the response is a category variable. The purpose of the analysis is to be able to predict which category a new sample belongs to. Classification methods implemented in The Unscrambler® include SIMCA, SVM classification, LDA, and PLS-discriminant analysis.

Classification can for instance be used to determine the geographical origin of a raw material from the levels of various impurities, or to accept or reject a product depending on its quality.

To run a SIMCA classification, one needs:

a. One or several PCA models (one for each class) based on the same variables;

b. Values of those variables collected on known or unknown samples.

Each new sample is projected onto each PCA model. According to the outcome of this projection, the sample is either recognized as a member of the corresponding class, or rejected.

Clustering

Clustering is a classification method that does not require any prior knowledge about the available samples. The basic principle consists in grouping together in a “cluster” several samples which are sufficiently close to each other.

The clustering methods available in The Unscrambler® include the K-means algorithm; the behavior of the algorithm may be tuned by choosing among various ways of computing the distance between samples. Hierarchical clustering can also be run, as can clustering using Ward’s method.

Confusion matrix

The confusion matrix is a matrix used for visualization for classification results from supervised methods such as support vector machine classification or linear discriminant analysis classification. It carries information about the predicted and actual classifications of samples, with each row showing the instances in a predicted class, and each column representing the instances in an actual class.

Correlation

A unit less measure of the amount of linear relationship between two variables.

The correlation is computed as the covariance between the two variables divided by the square root of the product of their variances. It varies from –1 to +1.

Positive correlation indicates a positive link between the two variables, i.e. when one increases, the other has a tendency to increase too. The closer to +1, the stronger this link.

Negative correlation indicates a negative link between the two variables, i.e. when one increases, the other has a tendency to decrease. The closer to –1, the stronger this link.

Covariance

A measure of the linear relationship between two variables.

The covariance is given on a scale which is a function of the scales of the two variables, and may not be easy to interpret. Therefore, it is usually simpler to study the correlation instead.

Cross validation

Validation method where some samples are kept out of the calibration and used for prediction. This is repeated until all samples have been kept out once. Validation residual variance can then be computed from the prediction residuals.

In segmented cross validation, the samples are divided into subgroups or “segments”. One segment at a time is kept out of the calibration. There are as many calibration rounds as segments, so that predictions can be made on all samples. A final calibration is then performed with all samples.

In full cross validation, only one sample at a time is kept out of the calibration per iteration.

Degrees of freedom

The number of degrees of freedom of a phenomenon is the number of independent ways this phenomenon can be varied.

Degrees of freedom are used to compute variances and theoretical variable distributions. For instance, an estimated variance is said to be “corrected for degrees of freedom” if it is computed as the sum of square of deviations from the mean, divided by the number of degrees of freedom of this sum.

Distribution

Shape of the frequency diagram of a measured variable or calculated parameter. Observed distributions can be represented by a histogram.

Some statistical parameters have a well-known theoretical distribution which can be used for significance testing.

Experimental design

This is also referred to as Design of Experiments.

Plan for experiments where input variables are varied systematically within predefined ranges, so that their effects on the output variables (responses) can be estimated and checked for significance. Experimental designs are built with a specific objective in mind, namely screening, screening with interaction, or optimization. The number of experiments and the way they are built depends on the objective and on the operational constraints.

Experimental error

Random variation in the response that occurs naturally when performing experiments.

An estimation of the experimental error is used for significance testing, as a comparison to structured variation that can be accounted for by the studied effects.

Experimental error can be measured by replicating some experiments and computing the standard deviation of the response over the replicates. It can also be estimated as the residual variation when all “structured” effects have been accounted for.

Explained variance

Share of the total variance which is accounted for by the model.

Explained variance is computed as the complement to residual variance, divided by total variance. It is expressed as a percentage.

For instance, an explained variance of 90% means that 90% of the variation in the data is described by the model, while the remaining 10% are noise (or error).

F-distribution

Fisher distribution is the distribution of the ratio between two variances.

The F-distribution assumes that the individual observations follow an approximate normal distribution.

F-ratio

The F-ratio is the ratio between explained variance (associated to a given predictor) and residual variance. It shows how large the effect of the predictor is, as compared with random noise.

By comparing the F-ratio with its theoretical distribution (F-distribution), one obtains the significance level (given by a p-value) of the effect.

Histogram

A plot showing the observed distribution of data points. The data range is divided into a number of bins (i.e. intervals) and the number of data points that fall into each bin is summed up.

The height of the bar in the histograms shows how many data points fall within the data range of the bin.

K-means

An algorithm for data clustering. The samples will be grouped into K (user-determined number) clusters based on a specific distance measurement, so that the sum of distances between each sample and its cluster centroid is minimized.

Linear Discriminant Analysis (LDA)

LDA is the simplest of all possible classification methods that are based on Bayes’ formula. The objective of LDA is to determine the best fit parameters for classification of samples by a developed model.

Least square criterion

Basis of classical regression methods, that consists in minimizing the sum of squares of the residuals. It is equivalent to minimizing the average squared distance between the original response values and the fitted values.

Linear model

Regression model including as X-variables the linear effects of each predictor. The linear effects are also called main effects. Linear models are used in the analysis of Plackett-Burman and Resolution III fractional factorial designs. Higher resolution designs allow the estimation of interactions in addition to the linear effects.

Mean

Average value of a variable over a specific sample set. The mean is computed as the sum of the variable values, divided by the number of samples.

The mean gives a value around which all values in the sample set are distributed. In Statistics results, the mean can be displayed together with the standard deviation.

Mean centering

Subtracting the mean (average value) from a variable, for each data point.

Median

The median of an observed distribution is the variable value that splits the distribution in its middle: half the observations have a lower value than the median, and the other half have a higher value. It can also be called 50% percentile.

Missing values

Whenever the value of a given variable for a given sample is unknown or not available, this results in a hole in the data. Such holes are called missing values, and in The Unscrambler® corresponding cell of the data table are left empty.

In some cases, it is only natural to have missing values — for instance when the concentration of a compound (Y) in a new sample is supposed to be predicted from its spectrum (X).

Sometimes it would be nice to reconstruct the missing values, for instance when applying a data analysis that does not handle missing values well, like MLR, kernel-PLS or wide-kernel. One may choose to fill missing values by using the command Tasks – Transform – Missing Values….

Multiple Linear Regression (MLR)

A method for relating the variations in a response variable (Y-variable) to the variations of several predictors (Xvariables), with explanatory or predictive purposes.

An important assumption for the method is that the X-variables are linearly independent, i.e. that no linear relationship exists between the X-variables. When the X-variables carry common information, problems can arise due to exact or approximate collinearity.

Multivariate analysis

Multivariate analysis (MVA) is based on the statistical principle of multivariate statistics, which involves observation and analysis of more than one statistical variable at a time. In design and analysis, the technique is used to perform trade studies across multiple dimensions while taking into account the effects of all variables on the responses of interest.

Normal distribution

Frequency diagram showing how independent observations, measured on a continuous scale, would be distributed if there were an infinite number of observations and no factors caused systematic effects.

A normal distribution can be described by two parameters:

a. A theoretical mean, which is the center of the distribution;

b. A theoretical standard deviation, which is the spread of the individual observations around the mean.

Outlier

An observation (outlying sample) or variable (outlying variable) which is abnormal compared to the major part of the data.

Extreme points are not necessarily outliers; outliers are points that apparently do not belong to the same population as the others, or that are badly described by a model.

Outliers should be investigated before they are removed from a model, as an apparent outlier may be due to an error in the data.

Overfitting

For a model, overfitting is a tendency to describe too much of the variation in the data, so that not only consistent structure is taken into account, but also some noise or noninformative variation.

Overfitting should be avoided, since it usually results in a lower quality of prediction. Validation is an efficient way to avoid model overfitting.

Principal Component Analysis (PCA)

PCA is a bilinear modeling method which gives an interpretable overview of the main information in a multidimensional data table.

The information carried by the original variables is projected onto a smaller number of underlying (“latent”) variables called principal components. The first principal component covers as much of the variation in the data as possible. The second principal component is orthogonal to the first and covers as much of the remaining variation as possible, and so on.

By plotting the principal components, one can view interrelationships between different variables, and detect and interpret sample patterns, groupings, similarities or differences.

Precision

The precision of an instrument or a measurement method is its ability to give consistent results over repeated measurements performed on the same object. A precise method will give several values that are very close to each other.

Precision can be measured by standard deviation over repeated measurements.

If precision is poor, it can be improved by systematically repeating the measurements over each sample, and replacing the original values by their average for that sample.

Precision differs from accuracy, which has to do with how close the average measured value is to the target value.

P-value

The p-value measures the probability that a parameter estimated from experimental data should be as large as it is, if the real (theoretical, non-observable) value of that parameter were actually zero. Thus, p-value is used to assess the significance of observed effects or variations: a small p-value means a small risk of mistakenly concluding that the observed effect is real.

The usual limit used in the interpretation of a p-value is 0.05 (or 5%). If p-value < 0.05, the observed effect can be presumed to be significant and is not due to random variations. p-value is also called “significance level”.

Quantile plot

The Quantile plot represents the distribution of a variable in terms of percentiles for a given population. It shows the minimum, the 25% percentile (lower quartile), the median, the 75% percentile (upper quartile) and the maximum.

Regression coefficient

In a regression model equation, regression coefficients are the numerical coefficients that express the link between variation in the predictors and variation in the response.

Regression

Generic name for all methods relating the variations in one or several response variables (Y-variables) to the variations of several predictors (X-variables), with explanatory or predictive purposes.

Regression can be used to describe and interpret the relationship between the X-variables and the Y-variables, and to predict the Y-values of new samples from the values of the X-variables.

Residual

A measure of the variation that is not taken into account by the model.

The residual for a given sample and a given variable is computed as the difference between observed value and fitted (or projected, or predicted) value of the variable on the sample.

Residual variance

The mean square of all residuals, sample- or variable-wise.

This is a measure of the error made when observed values are approximated by fitted values, i.e. when a sample or a variable is replaced by its projection onto the model.

The complement to residual variance is explained variance.

RMSEC

Root Mean Square Error of Calibration. A measurement of the average difference between predicted and measured response values, at the calibration stage.

RMSEC can be interpreted as the average modeling error, expressed in the same units as the original response values.

R-square

The R-square of a regression model is a measure of the quality of the model. Also known as coefficient of determination, it is computed as 1 – (Residual Y-variance), or (Explained Y-variance)/100. For Calibration results, this is also the square of the correlation coefficient between predicted and measured values, and the R-square value is always between 0 and 1. The closer to 1, the better.

The R-square is displayed among the plot statistics of a Predicted vs. Reference plot. When based on the calibration samples, it tells about the quality of the fit. When computed from the validation samples (similar to the “adjusted Rsquare” found in the literature) it tells about the predictive ability of the model.

Sample

Object or individual on which data values are collected, and which builds up a row in a data table. In experimental design, each separate experiment is a sample.

Singular Value Decomposition (SVD)

In linear algebra, the singular value decomposition (SVD) is an important factorization of a rectangular real or complex matrix, with many applications in signal processing and statistics. Applications which employ the SVD include computing the pseudoinverse, least squares fitting of data, matrix approximation, and determining the rank, range and null space of a matrix.

Standard deviation

SDev is a measure of a variable’s spread around its mean value, expressed in the same unit as the original values. Standard deviation is computed as the square root of the mean square of deviations from the mean.

T-value

The t-value is computed as the ratio between deviation from the mean accounted for by a studied effect, and standard error of the mean.

By comparing the t-value with its theoretical distribution (Student’s t-distribution), one obtains the significance level of the studied effect.

MORE FROM BIG DATA MADE SIMPLE