Know more

Our use of cookies

Cookies are a set of data stored on a user’s device when the user browses a web site. The data is in a file containing an ID number, the name of the server which deposited it and, in some cases, an expiry date. We use cookies to record information about your visit, language of preference, and other parameters on the site in order to optimise your next visit and make the site even more useful to you.

To improve your experience, we use cookies to store certain browsing information and provide secure navigation, and to collect statistics with a view to improve the site’s features. For a complete list of the cookies we use, download “Ghostery”, a free plug-in for browsers which can detect, and, in some cases, block cookies.

Ghostery is available here for free: https://www.ghostery.com/fr/products/

You can also visit the CNIL web site for instructions on how to configure your browser to manage cookie storage on your device.

In the case of third-party advertising cookies, you can also visit the following site: http://www.youronlinechoices.com/fr/controler-ses-cookies/, offered by digital advertising professionals within the European Digital Advertising Alliance (EDAA). From the site, you can deny or accept the cookies used by advertising professionals who are members.

It is also possible to block certain third-party cookies directly via publishers:

Cookie type

Means of blocking

Analytical and performance cookies

Realytics
Google Analytics
Spoteffects
Optimizely

Targeted advertising cookies

DoubleClick
Mediarithmics

The following types of cookies may be used on our websites:

Mandatory cookies

Functional cookies

Social media and advertising cookies

These cookies are needed to ensure the proper functioning of the site and cannot be disabled. They help ensure a secure connection and the basic availability of our website.

These cookies allow us to analyse site use in order to measure and optimise performance. They allow us to store your sign-in information and display the different components of our website in a more coherent way.

These cookies are used by advertising agencies such as Google and by social media sites such as LinkedIn and Facebook. Among other things, they allow pages to be shared on social media, the posting of comments, and the publication (on our site or elsewhere) of ads that reflect your centres of interest.

Our EZPublish content management system (CMS) uses CAS and PHP session cookies and the New Relic cookie for monitoring purposes (IP, response times).

These cookies are deleted at the end of the browsing session (when you log off or close your browser window)

Our EZPublish content management system (CMS) uses the XiTi cookie to measure traffic. Our service provider is AT Internet. This company stores data (IPs, date and time of access, length of the visit and pages viewed) for six months.

Our EZPublish content management system (CMS) does not use this type of cookie.

For more information about the cookies we use, contact INRA’s Data Protection Officer by email at cil-dpo@inra.fr or by post at:

INRA
24, chemin de Borde Rouge –Auzeville – CS52627
31326 Castanet Tolosan CEDEX - France

Dernière mise à jour : Mai 2018

Menu Logo Principal chemproject, chemhouse, chemoocs, chemomics, chemflow, chemdata, chemometrics, logo_agropolis,fondation chemproject, chemhouse, chemoocs, chemomics, chemflow, chemdata, chemometrics, logo, muse, montpellier chemproject, chemhouse, chemoocs, chemomics, chemflow, chemdata, chemometrics, logo, cirad

ChemProject

CheMoocs-FAQ

FAQ CheMoocs

This Frequently Asked Questions section is here to answer the most common questions you have about chemometrics.
Whenever possible, the questions / answers are grouped by grain, each grain being a CheMoocs course.

Grain 0, 1, 2: Introduction, simple statistics

Question:
I understood the notion of "dimension of a vector space" but as soon as you switch to spectra I get totally lost. In particular on the example of the 1101 spectral variables which would be a vector of 1101 dimension. Could you enlighten me further.

Reply:

When you define your position with 2 numbers, you are moving in a 2-dimensional space. For example, 3 steps west, 2 steps north. Or latitude and longitude. If you add a third number, you go to dimension 3. A vector of a space of dimension 3 is therefore a collection of 3 numbers. And symmetrically any collection of 3 numbers is a vector of a space of dimension 3.

Based on this principle, any collection of 1101 numbers is a vector ... of a space of dimension 1101.

 

Question:
I have trouble distinguishing between "dimension of a vector space", and "dimension of a vector". If a vector has 3 coordinates (x, y, z), we need R3 to represent it (= 3 dimensions). However, the definition of the course says: "The dimension of a vector space is the minimum number of vectors necessary to generate all the vector space", so the dimension of the space generated by this vector is 1? Can you help me?

Reply:

When a spectrum is measured over 1000 wavelengths, it means, mathematically, that it can evolve in a space of 1000 dimensions. In the same way, when you indicate the GPS coordinates of a point (latitude + longitude + altitude), you are located in space of dimension 3 (let us abstract from the curvature of the Earth). So it is in accordance with what you say: "If a vector has 3 coordinates (x, y, z), we need R3 to represent it (= 3 dimensions)".

Let us now look at 10 different spectra, all measured in a space of dimension 1000. If we combine these 10 spectra ad infinitum, we obtain a space of dimension 10, which will be a subspace of the starting space (of dimension 1000). So effectively, the dimension of the space generated by a single vector is 1, whatever the dimension of this vector.

 

Question:
In the definitions of the variance and standard deviation of a sample, can you clarify why the division of the sum of the squares of the deviations is done by n and not by n-1?

Reply:

The words "standard deviation" and "variance" are quite ambivalent, as they denote both a measure of dispersion and a property of a probability law. The explanations below address the standard deviation, but also apply to the variance.

When we have a population of values ​​and we want to describe its dispersion, we can calculate the standard deviation of this population as the quadratic mean of the deviations from the mean, i.e. as explained in the grain, with 1 / n

When it comes to estimating the standard deviation of the probability law that is "hidden" behind a sample, then the formula is different; divide by (n-1) so that the calculated value is an unbiased estimator of the standard deviation of the probability law.

 

Grains 3 et 4: PCA

Question: distribution of variables in a PCA

Do the measured variables taken in a PCA have to have a normal or known distribution

Reply:

PCA works very well on variables with a rather Gaussian or normal distribution, and still works very well when the distribution remains more or less symmetrical (eg: uniform distribution, bimodal, ...). The cases where the PCA seems less robust correspond rather to very skewed distributions (eg: many 0s and then some positive values).

 

Question: Dimension of the dataset

I had learned that to do a PCA you need more individuals (rows) than columns (variables), is that true?

 

Reply:

No, a PCA can manage a table of data with more variables than individuals. One of the objectives of PCA is to reduce dimensionality, that is to say to work on only a few principal components, which are new variables created, linear combinations of the initial variables. In many applications, the number of initial variables is thus much greater than the number of individuals, in particular in metabolomics, for example on signals of mass spectrometry or NMR type.

 

Question: PCA and sensory variables

Before performing a PCA on sensory variables (scores from 0 to 10 for example), should they be standardized? They are of the same nature (sensory variables) and yet do not represent the same thing. Thank you for your advice.

 

Reply:

This MOOC is mainly focused on the processing of spectroscopic data, in particular NIR data. But all these tools can indeed very well be applied to other types of multivariate data (Raman spectroscopy, NMR, chromatography, mass spectrometry, process parameters, sensory data, ...). If we have sensory data from quantitative descriptive sensory evaluation, i.e. scores on a structured scale or not structured between 0 and 10 eg. for each sensory descriptor (variable), we no longer really speak of chemometrics but of sensometry. In this case, the column normalization (reduction) of the variables will depend on the data and the application. For example, if certain descriptors correspond to defect scores (eg: musty), these generally have a low variance if the defects are rare. Standardizing in this case is equivalent to giving as much starting weight to the (rare) defects and to the other descriptors, which is an individual choice to be made according to the objective of the PCA. Another ex. : if a descriptor has a low variance because it is not very discriminating between the products evaluated, the fact of normalizing will give an artificial weight which is probably irrelevant to this descriptor. Last example: if the tasting panel is not well calibrated, the range of scores may be very different from one descriptor to another. In this case, standardization may be necessary.

 

Question: loadings

How can we determine the factorial weights of the variables, or how can we determine the terms (p1, p2, p3 ....) in the linear combination t = p1 x1 + p2 x2 + p3 x3 .... so as to explain the maximum variability with this component t?

 

Reply:

In PCA, the matrix X is decomposed in the form of scores T and loadings = eigenvectors P according to the equation: X = TP '. One of the properties of PCA is that: P'P = I the identity matrix. So by multiplying on each side on the right by P we obtain: T = XP which corresponds to your equation, but in the form of matrices. In practice, it is not necessary to make these calculations, the software outputs T and P automatically.

 

Question: PCA study

How is a general PCA study done?

Reply:

The approach is not exactly the same depending on whether the data is of spectral type or not.

1.Data preparation:

Non-spectral data: center and reduce the variables, because a better PCA is provided by a reduced centered normal distribution. This is the general trend, which does not apply everywhere, far from it. The reduction is very dangerous for variables of little or no significance, whose values ​​are often at 0 or close to 0. This is the case in sensory analysis, or in analytics if the values ​​are at the level of the detection threshold of the method. .

Spectral data: do not center or reduce, just visualize the raw spectrum; if it contains deformations, the following pretreatments are carried out:

- noise removal: smoothing by Savitzsky-Golay;

- remove baselines: Detrend or derivative;

- remove the multiplicative effect: the logarithm or normalization;

2. construction of the principal components: choice of the number of principal components based on the maximization of the explained variance;

3. detection of atypical points in the representation of individuals in the space of “score plot” variables = factorial map of individuals;

4. Interpretation of the factorial map of individuals by looking at:

- if the characteristics of an individual are specific to him or if they are identical to other individuals;

- individuals who better explain one of the chosen principal components;

5. Loading = eigenvectors: we identify the variable (s) influencing the model

 

Question: missing data

How to manage missing data in a PCA?

 

Reply:

there are two solutions: (a) delete the rows and / or columns corresponding to the missing values; (b) estimate the missing values. Again, there is no hard and fast rule. With a lot of missing values, it becomes dangerous to estimate them; if there are few, it is feasible; but it's up to everyone to see where we put "a lot" and "a little" depending on the data.

 

Question:
What if our data is a mixture of quantitative and qualitative and spectral data at the same time?

Réponse:
PCA is not enough. you must use a multi-array method, see grains 19 and 20 to start.

 

Question: detection of atypical points by T² and Q

The individuals to be considered as possible atypical in PCA are those presenting a strong T² and a strong distance Q to the model. As the 2 criteria are to be considered, can we not consider using only one value, a distance T² increased by Q² is (T² + Q²), quadratic distance at the center of the model in the starting space? In other words, is it necessary to do a PCA to detect atypical points, or would a simple calculation of the distance from the suspected individual to the average individual not suffice?

 

Reply:

The 2 multivariate distances T² and Q are useful for detecting atypical points, but do not detect the same! The atypical points in T² correspond to atypical samples on known phenomena (modeled by PCA), often stronger or weaker in intensity compared to this phenomenon. The atypical points in Q correspond rather to phenomena never encountered before, and therefore poorly modeled. These 2 distances are therefore interesting to diagnose separately. However, there are indeed methods for which a single criterion, combination of  T²  and Q, is taken into account. This is particularly the case with the SIMCA-Soft Independent Modeling of Class Analogies discrimination method in certain cases. However, be careful, because these 2 terms are calculated in different metrics. The calculation of  T²  involves the variance-covariance matrix, to calculate a Mahalanobis distance, but not the Q. Regarding the last part of the question, calculate a spectral distance directly in the original space, bypassing the PCA, can actually be an idea. The results will be different, and information specific to  T²  and Q is lost.

Question: significant variance on the first principal component (CP1)

In the case of a PCA on spectral data, how to qualify the result if the variance expressed for CP1 is greater than 95%? Is this PCA exploitable (PCA is used to find correlation trends between spectral data and product properties)? 2nd question: if the loading associated with CP1 is the average spectrum of my database, what can that mean? Do I need to review my pretreatments?

 

Reply:

In the case of spectral data, the variance explained by the CP1 of a PCA is very often greater than 95% or even from time to time greater than 99%! In this case, the associated loading effectively represents the average spectrum (or its opposite). This is explained by the fact that the greatest variability in a spectral base is often due not to the chemical or biochemical composition of the sample but to physical information (particle size, etc.). This is called the light scatering effect. I refer you on this subject to the grains on pretreatments for more details, and to know how to eliminate as much as possible this physical information which can be an artefact if one seeks chemical or biochemical information.

 

Question:
How to interpret the curve of loadings = eigenvectors?

Reply:

The loadings curve for a component of the PCA in fact represents the weight (contribution) of each initial (measured) variable in the construction of the principal component considered. We must therefore interpret the loadings in this sense, the weight can be positive or negative.

 

Question: QQ plot to identify atypical values?

Instead of studying T² and Q, can we use a QQ plot and eliminate the values ​​which do not approach the theoretical quantile?

Reply:

The QQ plot can be used to identify individuals who deviate from a population according to a normal distribution. It can be used for the Ys and also for the scores but the normality assumption is not always respected in practice, this limits its use in practice.

For example, soil organic carbon data does not generally follow a normal distribution, but rather lognormal.

 

 

Grain 5: Pretreatments-1

Question: correction of multiplicative noise?

This grain explains how to reduce the effect of additive noise, and much less the methods to reduce multiplicative noise. Is the part of noise having an additive effect the most important? Or is it just that there are more ways to reduce it?

Reply:

Indeed there are more methods related to additive noise than to multiplicative noise. This may be due to the fact that, in the laboratory, we place ourselves in conditions that avoid multiplicative effects (hence the form of Beer Lambert's law). However, under less controlled conditions, such as backscattering on pulverulent products, the multiplicative effect is inevitable and it is important to eliminate it.

Question: pretreatment and Detrend

A linear or quadratic Detrend by a least squares model assumes the normality of the residuals, which is not the case for a spectrum. Does this method introduce artifacts into the corrected spectrum? Is one baseline correction method more recommended than another (ALS, continuum removal, detrend)?

Reply:

It is correct that the linear regression operated by Detrend does not respect the assumptions of normality of the residuals, because a spectrum has a non-Gaussian structure. However, it is worth distinguishing two objectives that can be sought when correcting the baseline:

- remove an additive effect (of the a x lambda + b type) to facilitate the modeling step;

- separate the baseline of the spectrum, in order to obtain a "purer" spectrum.

In the first case, it suffices to perform a transformation which cancels out the effect of the baseline. In this case, Detrend does the trick and there is no need to respect the assumptions of normality. Indeed, if S0 is the spectrum without a baseline, and if S = S0 + a x lambda + b is the spectrum plus a baseline, we check that Detrend (S) does not depend on a and b

In the second case, Detrend is not suitable, because Detrend (S0 + a x lambda + b) is not equal to S0. To find S0, you have to use other baseline identification techniques, such as Asymetric Least Squares (ALS) or Continuum Removal, which are two techniques widely used in Remote Sensing, where it is very important to calculate spectra " pure ".

 

Grain 6: Unsupervised classifications

 

Question: CHA and calculation of distances

What is the relevance of a hierarchical classification when the variables are strongly correlated, for example spectral variables?
Another question: would a descending hierarchical classification give the same results, and is there an advantage in considering a descending tree rather than an ascending one?

Reply:

A strong correlation between variables does not pose a problem with unsupervised classification methods using the usual Euclidean distance.

In the case of spectral variables, when the first principal component has a very high variance, it is on this dimension that the creation of groups will be done mainly and more marginal phenomena could be "forgotten" during the segmentation.

Regarding the second question, there are indeed top-down algorithms for unsupervised classification. They consist in dividing all the individuals into two groups then, successively, in dividing these groups into two ... It should be noted that

these methods do not necessarily provide the same trees as the bottom-up methods.

They are generally used to classify large numbers of individuals (the classification can be long in this case) and / or in the case where it is the only upper part of the tree that interests the analyst. In the spectral case, one might be interested in these techniques to classify variables eg.

Grain 7: Linear regressions: simple, multiple, PCR, PLSR

Question: multiple correlation coefficient

We are asked to give the coefficient of multiple correlation. We saw for the simple regression, that we could deduce the R by making the square root of R ^ 2 and by giving it a positive or negative sign according to the slope of the line. On the other hand, in the case of a multiple linear regression, is it possible to define its sign?

Reply:

In multiple linear regression, the multiple correlation coefficient R is the square root of R², therefore between 0 and 1. To find out the relationship between an explanatory variable and the response, we can look at the sign of its coefficient in the model, or partial correlations.

Question: multilinear regression: Y vs X or X vs Y?

In the example on multilinear regression, we try to explain the density of polyethylene terephthalate by spectrometry. The explanatory variables are the spectral data on which the measurements relate and therefore subject to errors. The density Y is on the other hand measured by a reference method, therefore with less error. However, the presentation indicates that the explanatory variable (s) must be known without error, the residuals only relating to Y. Don't you invert Y and X?

Reply:

MLR theory assumes that only the answer Y has an error and that the Xs are independent and identically distributed. Thanks to these assumptions, one can analytically calculate the error of the model and the variance of the slope, which makes it possible to test its nullity. A zero slope indicating that there is no model.

In practice, particularly in chemometrics, these hypotheses are never verified. In spite of everything, the theoretical formulas of the MLR are used, but the validation of the models goes through tests.

Question: error on X and Y?

Since we know that no measure is known without error, whether on X or on Y, are there any regression methods that account for errors on Y and on X?

Reply:

BLS regression answers your question. The presentation by Marion Berger (Sanofi) and Bernard G. Francq (University of Glasgow) explains its context of use and the functions of the "bivRegBLS" R package which allow to treat models with a variance on X and a variance on Y.

Question: MLR with a number of variables much greater than the number of samples?

We are asked in activity 4 to do an MLR on 106 IR spectra. We therefore have a matrix of dimensions 106 x 612 (106 samples for spectra with 612 different wavelengths). The number of variables (the wavelengths) is therefore much greater than the number of samples. How can we do an MLR under these conditions? cf. extract from the pdf of grain 8: "But when the variables in the matrix X are very correlated and in greater number than the individuals, it is not possible to solve this equation".

Reply:

You are absolutely right: it is IMPOSSIBLE to make an MLR with a 106x612 descriptor matrix.

You are in fact the victim of language abuse. Many statisticians and chemometrists call MLR an "adapted" form of the true MLR, which consists in "forcing" the inversion of the X'X matrix. In practice, this inversion is replaced by a pseudo inverse. This pseudo inverse performs a PCA of X and retains the scores corresponding to a "non-zero" eigenvalue (greater than a certain threshold). It is therefore nothing more than an over-adjusted PCR, because it takes as many latent variables as possible. ChemFlow uses this type of MLR.

Question: number of samples for a model?

MLR, PCR, PLS are very elegant techniques. But I am puzzled by the amount of samples consumed to build a model. For example 106 determinations of oleic acid by gas chromatography (GC), to provide information on the explained variable. Then 106 infrared spectra for explanatory variables. And if I have understood correctly, it is in order to then measure oleic acid no longer by GC, but by infrared spectrometry on new samples. In this case it must then be necessary to assay a lot of new samples to make profitable the hundreds of assays done in GC just to build the model ... ??

In a laboratory that uses these techniques, what is the proportion between the number of samples actually analyzed (assayed or classified), and the samples used for calibration (or construction of the model)?

Reply:

Regression techniques require a lot of samples for calibration. They are indeed intended for situations where a large number of samples must be analyzed quickly.

The minimum number of observations to build the calibration set is usually several hundred. There is no maximum, but some calibration sets reach tens of thousands of observations. The more observations there are and the more different conditions there are (year, place, spectro, temperature, humidity, etc.) so the more the model will give correct results in different situations. By observation is meant: a sample on which the reference measurement and the infrared spectrum have been made.

The construction of the calibration set intended to build a model requires a certain financial investment, since the reference measurement is expensive, which is why some models are sold. But once the model is built, it can be applied to predict an infinity of new samples on which we no longer need to do the reference method, except for a small number of them used by the procedures. quality to regularly check the quality of the model predictions.

So a good model is always very profitable.

On the other hand, it can happen that, despite a large number of observations, the model does not meet our expectations, quite simply because the spectrum does not contain enough information on the parameter to be predicted. This is the risk.

Question: ypredit against yobserved, or the reverse?

It is usual to plot ypredit against observed y to ensure the quality of the prediction of the PLS models. However, some software is rather yobserved vs ypredit (on the x-axis). I conclude that we are doing Observed vs Predicted because we did the reverse calibration with PLS (we reversed the Xs and Ys compared to a classical regression). What is your opinion on the matter?

Reply:

Here are two articles in favor of yobservé = f (ypredit): 1) T. Fearn, “Which Way Round? A Tale of Two Regressions ”, NIR News, vol. 9 (3), 1998, pp 6-7; 2) G. Pineiro, S. Perelmanb, J. Guerschmanb and J. M. Paruelo, “How to evaluate models: Observed vs. predicted? », Ecological Modeling, vol. 216 (3-4), 2008, pp 316-322; 3) plus one (not found) by Pierre Dardenne who is of the same opinion.

And, in my opinion, the philosophy of "yobservé = f (ypredit)" is to say that we would expect to have, for "linked data", a slope of 1 and an ordinate at the origin of 0 .

This might have some logic, because in the case of the calibration subset, predictions are made about the calibration subset which itself allowed the model to be calibrated.

We do the linear regression with "yobservé = f (ypredit)" AND "ypredit = f (observed)", and only the regression "yobservé = f (ypredit)" makes it possible to achieve this goal (slope = 1 and ordinates to l 'origin = 0).

To have tested it with calibration subsets, the calculations systematically give slopes of 1 and ordinates at the origin of 0, with in both cases standard deviations much less than 10 ^ -3. Which would mean that the linear regression is more than perfect ... with points that are not aligned on the line (!). With validation subsets, this phenomenon no longer appears ... without doubts because the two subsets are independent (validation subsets not having participated in the construction of the model).

I am rather in favor of using "ypredit = f (yobserved)" for the reasons given in Grain 7.

https://dieteticien-nutritionniste-montpellier.maigrir2000.com/

Question: reference methods versus spectroscopic methods?

SPIR-type methods are calibrated from the results of quantitative reference analytical methods. A main objective is to be able to replace these reference methods, which are sometimes complicated to implement, by these simpler and faster spectroscopy methods. Is it possible to compare the measurement uncertainty of the reference methods with that of spectroscopic methods based on PLS-type regression models? More generally, how to evaluate (and with what indicators) spectroscopic methods compared to reference methods?

Reply:

A common belief is to consider that since the SPIR is lab based, the SPIR prediction error is greater than that of the lab. This is often the case, but not always!

When we plot the y predicted by SPIR (Yh) as a function of the y lab (Yl), we observe discrepancies between prediction and lab: D = Yh-Yl. From there, one draws quality statistics of the model, like the SEP, which is the standard deviation of D. However, the standard deviation of Yl (that is to say the lab error, noted SEL) participates in that of D. The SEP must be corrected as follows: SEPc ^ 2 = SEP ^ 2 - SEL ^ 2

We can then obtain a lower SPIR error than the lab error. This usually happens when the lab method is relatively imprecise, and the calibration clearance is very large.

Question: PLS-1 versus PLS-2?

In many scientific papers, authors use PLSR as a univariate method when they have multiple y values ​​to predict. In other words, they use the full spectrum of NIRS to estimate a single model for dry matter, another for crude protein, etc. In addition, each model is created using 4-7 latent variables. Isn't this a misuse of the PLSR and a way of over-fitting models?

Reply:

You raise a very important problem, which can fuel an interesting debate. Here are some arguments.

On the algorithm side, it is true that PLS is designed to predict multiple responses, but since its inception the main algorithms have specialized in PLS-1 (one response).

On the application side: Let's take your example, which I guess comes from cereals. when we measure grains by NIRS, we can actually try to calibrate a model to predict, from the same spectrum, the dry matter and the protein level, ie two responses, Y1 and Y2. The problem is that these two variables are most certainly correlated, by "nature".

On the strategic side: The problem then arises of knowing whether one should make a PLS-2 between the spectra and (Y1, Y2) or two PLS-1 between the spectra and Y1 and the spectra and Y2. In theory, you are right, PLS-2 is greater than 2 PLS-1 because it takes into account the correlation between Y1 and Y2. However, the observed performance is often much better when we optimize each model for each response. For example, baseline reduction can be detrimental to the dry matter model and useful to the protein model. Additionally, it is possible that the vs Y1 calibration is actually based on a vs Y2 calibration, which uses the natural correlation between the two variables. Robustness issues then arise when this correlation changes, for example by changing the harvest year.

The ideal would be to "break" the correlation between Y1 and Y2 and to make two PLS-1. This could perhaps be considered with a design of experiments.

Question: How to choose the most suitable method for your problem?

Many methods to tend towards an efficient prediction model: regression methods, selection of variables, spectral decomposition, etc. But how do we choose between the different methods, the method best suited to our problem? Is it a question of habit, of flair, should all the methods be tested, or is there a preferred approach?

Reply:

In the methods you cite there are several categories of methods which correspond to different purposes.

So there are two major categories: data mining and benchmarking. The exploration can be simple, like PCR or ICA, oriented towards the discovery of groups, like classification, resolution according to pure spectra and concentrations (RCM) and therefore oriented chemistry. The calibration can also be oriented towards classes (discrimination) or concentrations (regressions).

You see, each goal has its method

Grain 11: Good modeling practices

Question: "learning" and "cross validation"?

What is the meaning of these 2 words "Learning" and "Cross Validation", it would be very kind of you if you take a little time to explain these 2 terms.

Reply:

The construction of a regression (or discrimination) model consists in calculating a function F which best relates your predictor variables (X) to your variable to be predicted (Y), which we can write Y = F (X ) + R. The training consists in calculating F, given two matrices X and Y. Ã? the outcome of this operation, we can qualify the fit of the model by evaluating the amount of information left out in R. For example, for a regression, the RMSEC is in a way the norm of R. When the method of modeling includes parameters, which must be adjusted, a commonly used method consists in testing the model on a set of validation for different values ​​of these parameters. The observation of the validation error, according to the values ​​of the parameters, allows you to guide you in the choice of the good values. Cross validation is a special case of validation, but which responds to the same principle.

Question: different cross validations?

What is the difference between "k-block cross-validation" (default value in Chemflow) and "venitian cross-validation", which is the one offered in the activities?

Reply:

Cross-validation involves dividing observations into several groups, say k groups. Let's see some of them.

- The k-block method: for each of the k blocks or groups, the observations are drawn at random from the complete starting set. The same observation can be in several groups, or in none! And if you repeat a cross-validation of this type, it is normal not to get exactly the same values ​​in cross-validation.

- Venetian blinds: observations are taken in the order of the file, and assigned successively to group 1, to group 2, etc. up to group k. The k + 1 ° observation is then assigned to group 1, k + 2 ° to group 2, .. 2k ° to group k, 2k + 1 ° to group 1, etc. until the last observation.

- "Jack's knife" or Jack knife: the total number of observations is divided by k to give the number of observations per class. Let's say we get n. Then the first n observations are in group 1, the next n in group 2, etc.

it is also possible to build the composition of the groups yourself. This is particularly useful when the dataset includes replicates, because it is essential that the replicates done on the same sample all fall into the same group.

Question: cross validations and model?

Where is the model after the cross-validation? What model do we take?

Reply:

Nowhere ... We do not take any model ... Cross validation is not intended to create a model, but test modeling options.

Question: error of the cross-validations?

How do you calculate the cross validation error?

Reply:

As cross validation is based on a score (complete and without repetition), each individual is predicted once and only once. At the end of the CV we therefore have a ŷand a y;

their comparison makes it possible to calculate an error.

Question: cross validations with a multistage model?

How to cross-validate when you have a multi-stage model (eg PLS + LDA) ?

Reply:

The cross-validation should repeat the entire model.

Question: cross validations and its RMSECV error?

Is the RMSECV a good estimate of the model prediction error?

Reply:

No, not systematically

Question: cross-validations and bias?

Why is there never a bias between ŷ and y after cross-validation?

Reply:

Haaa, you noticed too!

Grains 12 and 13: Discrimination

Question: Cross-validation in PLS-DA

I am asking myself the question of cross validation in PLS-DA for the choice of the number of PLS ​​components. 2 methods are indicated: minimum PRESS and percentage of well classified. I would be tempted to take the percentage of well ranked because I don't know if the PRESS - sum of the squares of the deviations - has a real meaning in the case of a binary variable (0 and 1). Thank you for your reply.

Reply:

You are quite right. Generally speaking, the best criterion to use in cross-validation is the one that is close to the objective of your model. As it happens

since you want to discriminate the best is to observe the classification error.

Grain 15: Selection of variables

Question: Covsel and PLSR

I would like to know if I understood the nuance correctly with CovSel. We will rather use CovSel when we want to predict a variable with a minimum of explanatory variables and avoid having too many, as for example in the case where we want to develop a sensor and select the most wavelengths. interesting for this prediction (without it being less efficient) whereas the PLSR will be more efficient than CovSel and rather used to develop a model on the set of explanatory variables ... I would like to have more explanations on the method CovSel: how does she manage to choose the variables (I understood that this involved a calculation of covariance) but I have a little trouble understanding ...

Reply:

You are absolutely right about the use of CovSel.

As for the operation of CovSel, you can imagine that it is a PLSR whose latent variables would be forced to be of the form (0000001000000 ... 000) and thus carry out a selection of the variables at each not.

I am well aware that my explanation can only help you if you have understood the PLSR ...

Grains 17 and 18: Spectral decomposition, MCR-ALS and ICA

Questions: about ICA

1. Is ICA still able to isolate signal noise, or having noise in addition to the signal in one component is the consequence of too many components being retained?

2. What is the recommended algorithm for the ICA? What are the (free) toolboxes to use on R or matlab?

3. Can ICA be used to preprocess spectra, in particular to isolate multiplicative and additive effects?

Answers:

1) Even if indeed the ICA or MCR-ALS methods make it possible in a certain way to limit the noise present in your data, the latter still propagates in your components even if you have correctly evaluated the rank of your data set.

2) This is the Jade algorithm coded in Matlab or Scilab.

3) ICA or MCR-ALS are no different from other exploration methods. The multiplicative and additive effects are variances which a priori does not interest you. They must therefore be corrected prior to analysis at the risk of extracting pure spectra and concentrations that do not represent the analytical reality of your system. Regarding the use of ICA (or MCR-ALS) as a pretreatment for multiplicative and additive effects, it is very unlikely to work because it is almost impossible to extract pure components for these effects.

Question: PLS vs ICA or MCR-ALS?

Reply:

Regarding your first question, the analytical objectives are not the same when you apply a PLS regression or a method of solving (or unmixing) like ICA or MCR-ALS. As part of a spectroscopic regression, you want to predict the concentration of a product of interest from a spectrum. It is obvious then that you must have available for each sample a spectrum / concentration pair in order to proceed. during the calibration phase. For the methods of resolution, you only have at your disposal the spectral data. Your objective here is above all to explore without a priori your spectral data set even if the final result is expressed in the form of 'pure' spectra and associated concentrations for each sample.

Regarding your second question, even if you have at your disposal known pure spectra, experience shows that it is not advisable to use them because very often the spectrum of the pure compound of interest within the mixtures studied does not correspond to the one that you acquired directly on this one (molecular interaction problem, variations of the baseline / reformation of the spectrum related to physical phenomena ...).

Within the framework of the regressions at the very beginning of chemometrics, spectra of pure compounds were used in the methods of the CLS / ILS type but we quickly observed problems linked to the problems mentioned above.

Grains 19 and 20: Multi-array / multi-block / multiway

Question: examples of multiblock uses

Could you give a concrete example of the use of multi-block methods on arrays where the variables (columns) are identical, and the individuals (rows) are divided into several arrays?

Reply:

First, we are in a multi-block situation as soon as we are in the presence of more than one table of data. Thus, all the data acquired for the purpose of discriminating between two or more groups of samples on the same set of variables are considered as multiblock data. There is therefore no shortage of concrete examples. This particular class of multiblock data is as abundant as the abundance of situations aimed at discriminating two or more groups of samples. Then, any method aimed at analyzing multiblock data is considered a multiblock method. Thus, from this point of view, all the discrimination methods are multi-block methods. The main question in the presence of this particular class of multiblock data is related to the objective of analysis, that is to say, what one wishes to highlight by the use of a method rather than 'another ?

You may want to discriminate between groups and this can be done by choosing a method of discrimination. Although we can adopt an unsupervised approach, it seems relevant to meet this objective to systematically choose a supervised method because the objective imposes it and the supervised method is designed to highlight the part of variability which expresses the better this difference between groups. Discriminating different groups of samples is not the only objective when one has a set of tables where the variables (columns) are identical, and the individuals (rows) are divided into several groups. For example, we may want to understand how information at the scale of variables is structured from one group to another? In other words, how are the correlations between the variables organized from one group to another? The use of a multi-block method, for example ACOM, is relevant to meet this objective.