Partial Least Squares Model based Process Monitoring using Near Infrared Spectroscopy

On-line analyzers are widely used in chemical and oilindustry to estimate product properties and monitor production process. Partial Least Squares regression (PLS) is known as bilinear factor model as it projects input (X) and output (Y) data into low dimensional spaces. We present how this projection can be utilised in process monitoring and validation of on-line analysers. We apply the proposed methodology in a diesel fuel mixer where main product properties are estimated from near infrared spectra. Results show that the developed 2 Dimensional Partial Least Squares (2DPLS) model not only gives better property estimation performance than the currently applied Topological Near Infrared modelling tool (TOPNIR), but it is also able to provide informative map of operating regimes of the process.


Introduction
Control of measured process values (e.g.temperature, pressure, flow rate) does not always ensure that product properties (e.g.density, cloud point, flash point) will be in desired ranges.Some of these properties in chemical and oil-industry are not measured online (e.g.cetane index, aromatic field, sulfur content) or not at the frequency necessary for real time control (e.g.flash point, density, cold filter plugging point).The objective of the development of software sensors and online analysers is to support the control of product properties which cannot be measured online or offline measurements would be expensive.
Interaction of signals like temperatures, pressures, flow rates or absorption intensities can be used for calculating unmeasured product properties (flash point, density etc.).Soft sensors are especially useful in data fusion, where measurements of different characteristics and dynamics are combined.
Near infrared spectroscopy is a widely used on-line measurement technique.There are several multivariate models and methods to support the prediction of product properties based on Near-Infrared (NIR) spectra.These methods can be separated into parametric models (e.g.linear regression, multi-linear regression, Partial Least Squares regression (PLS) ) and nonparametric methods (e.g.k-NN [1], False Nearest Neighbors (FNN), Neural Networks, Topological Near-Infrared Modeling [2, 3] -TOPNIR).The main difference between these two classes is that the nonparametric techniques cannot extrapolate.
The key idea of the paper is the utilisation of the multivariate signal of NIR analysers not only for building models to estimate product quality but also to use it in process monitoring and validation of models used in on-line analysers.
A PLS based prediction model has been developed to support both prediction and visalisation (monitoring) [4].Datasets taken from the Dune Refinery of MOL Ltd were analysed.The PLS model is applied to estimate cold filter plugging point, density and one property of distillation.For monitoring the latent space of the PLS model is used.A special orthogonalisation algorithm was applied.The presented mapping is able to visualise the data and give information about the distribution of operating regimes and the quality of the model.

Spectroscopic Modeling
The main task of the spectroscopic modeling is to find relation between recorded spectra and relevant material properties, y k = f (x k ), where k represents the index of the samples [5,6].Data driven identification of models require spectral databases.The first part of the database contains the recorded and preprocessed spectra, X = x T 1 . . ., x T N , where N represents the number of samples available for model building.In our case the on-line ABB spectrometer records spectra in range 4000 -4800cm −1 .The recorded spectra contains 195 equally distributed absorbance values in the recorded range, The second part of the training set represents property values (y k = y k,1 , y k,2 , . . ., y k,m ) as output variables of the prediction model.For model identification the set of N samples of these properties are also arranged in a matrix form, Y = y T 1 . . ., y T N .Figure 1 shows a spectral database which contains 651 samples.§ Since prediction model should provide good performance in the whole range of the operational regime of the process the development of an appropriate model requires properly distributed training set.Unfortunately Figure 1 does not give any useful information about the distribution of the data.To get more insight into the structure of the high dimensional spectral database visualisation techniques should be applied that are able to map the original n = 195 dimensional space into an easily visualisable two-dimensional map.In the following section such PLS based method will be presented.

PLS Concept
Partial least squares (PLS) is a perfect method for constructing predictive models from large number and correlated input variables [7].
PLS was developed in the 1960s by Herman Wold as an econometric technique, but soon it become widely applied tool of in chemical engineering [8].In addition to spectrometric calibration, PLS is often applied to monitoring and controlling industrial processes; since compley process can easily have hundreds of process variables [4].
PLS tries to find the multidimensional direction in the X space the input variables that explains the maximum multidimensional variance direction in the Y space of the output variables.PLS regression is particularly suited when the matrix of predictors has more variables than observations, and when there is multicollinearity among X values.By contrast, standard regression fails in these cases.
The general underlying model of multivariate PLS is where X is an n × m matrix of predictors, Y is an n × p matrix of responses; T and U are n × l matrices that are, respectively, projections of X (the X score, component or factor matrix) and projections of Y (the Y scores); P and Q are, respectively, m × l and p × l orthogonal loading matrices; and matrices E and F are the error terms, assumed to be i.i.d.normal.The decompositions of X and Y are made so as to maximize the covariance of T and U.

2DPLS based Visualization
For the two-dimensional visualization of the PLS model the algorithm developed in [4] was applied.In this subsection the most important details of this technique are summarized based on [4].
Two components that are informative for visualization may be obtained in several ways.One example is principal components of predictions (PCP), where in the scalar response case ŷ = X b normalization is used as one component, while residuals of X not contributing to y are suggested for use as the second component.

Fig. 2. 2D PLS mapping
The basic idea behind the applied mapping is illustrated in Figure 2. The estimator b is found in the space spanned by loading weight vectors in Ŵ = [ ŵ1 , ŵ2 , . . ., ŵA ] i.e. it is a linear combination of these vectors.It is, however, also found in the plane defined by ŵ1 and a vector w2 orthogonal to ŵ1 , which is a linear combination of the vectors ŵ2 , ŵ3 , . . ., ŵA .
The matrix W = [ ŵ1 , w2 ] is thus the loading weight matrix in a two-component PLS solution (2PLS) giving exactly the same estimator b as the original solution using any number of components.What matters in the original PLS model is not the matrix Ŵ as such, but the space spanned by ŵ1 , ŵ2 , . . ., ŵA .In the 2PLS model this represents the plane spanned by ŵ1 and w2 that is essential.Note that all samples in X (row vectors) in the original PLS model are projected onto the space spanned by ŵ1 , ŵ2 , . . ., ŵA .Samples may thus be further projected onto the plane spanned by ŵ1 and w1 , and form a single score plot containing all yrelevant information.When for some reason e.g.ŵ2 is more informative than ŵ1 , a plane through ŵ2 and b may be a better alternative.It will in any case result in a 2PLS model that gives the estimator b, as will in fact all planes through b that are at the same time subspaces of the column space of Ŵ.

Application example
Presented research focuses to two tasks.The first task is the development of a prediction model that can estimate product properties based on spectra taken by online NIR analysers.The second task is the development a monitoring tool based on the visualisation of the same spectra [9].
Datasets collected at the Dune Refinery of MOL Ltd (Hungary) are analyzed.The first dataset ( "DS 1 ) contains 651 samples collected from a diesel fuel mixing process.Approximately twenty material properties are estimated.The second data set ( "DS 2 ) consists of 67 samples collected from a different process.

Prediction of product properties
The prediction performance of the models is measured by the correlation coefficient defined as: R(i, j) = C(i, j) where C is the covariance matrix and it's calculated as C = cov(y, ŷ).All the presented algorithms including the k-nn algorithm that TOPNIR utilises have been implemented in MATLAB.Similarly to the global statistics feature of TOPNIR we calculated the basic measures of for the k = 3 case.As it can be seen results are a bit better than the global statistics of TOPNIR.Exact numerical reproduction of results was not possible since the documentation of software and related patent do not contain every details and tricks related to the calculation of distances.Table 1 shows that the N number of the available samples differs for each properties.Among the 651 spectra only 560 were different and in most of cases only a fragment of the properties were measured.
Firstly the effect of dimensionality of latent space of the PLS model has been analysed (from 2 to 48 dimensions).To perform an adequate comparison leave-one-out and 10-fold cross validation technique was applied.On Figure 3 the performances (correlation coefficients) [10] of the PLS models are shown.As it is shown in this figure, the accuracy of the model increases rapidly by increasing the dimensionality of the latent space from 2 to 6 dimensions, however, it reaches a maximum since when the complexity of the model is higher than the complexity of the modelled system.
Tab. 2. Effect of the number of latent variables to the performance of the model (correlation between the estimated and measured variables are shown).

Visualization of operating regimes
In section 2.2 a special method was presented that can map the PLS latent space into two dimensional space by orthogonal signal correction.This method has been compared with Principal Component Analysis [11] (PCA) and Topological Near-Infrared Modeling [2,3] (TOPNIR) developed specifically to visualize NIR spectra and building topological prediction models with the help of resulted maps [12,13].
TOPNIR uses nonlinear equation pairs (referred as aggregates) based on a small set of absorption values.Usually 4-6 characteristic wavelengths are selected to formalise a given aggregate that somehow reflects material property.To maximise the information content of the mapping among 14 predefined aggregates the less correllated pairs were selected (Figure 4 ). Figure 5 shows the mapping of PCA with the first two principal component [11].This map is more informative.As it can be seen, the database contains samples from two different operating modes (summer and winter diesel) and some this mapping is able to separate these operating regimes.
Results of 2D PLS can be seen on Figure 6 and 7.The PLS model is more informative since it also utilizes output variables for the mapping.Figure 6 shows the mapping using the Density as output property.Comparing this mapping with the mapping of obtained using CFPP (see Figure 7) one can easily see that operating regimes have much more impact to the CFPP than density.As it can be seen PLS correctly reflects the operating regions and much more able to detect outliers than aggregate based mappings.
In the second part of the case study we demonstrate how outlier samples can be identified in the mapped space.As it can be seen on the Figure 9 the DS 2 contains two samples which are really far from the normal operational range (top right corner).The aggregate based mapping can not identify these samples exactly, it finds only one outlier of two.
As it can be seen on Figures 10 and 11 the 2D PLS gives detailed information for outlier detection.Comparing these plots, TOPNIR based mapping ( Figure 8 ) and PCA ( Figure 9 ) it can be concluded that the 2PLS technique is the most efficient to detect outliers in the spectral or in the property space.methods to support the prediction of product properties based on NIR spectra.Model development cannot be a fully automatised process, human supervision and intervention is always needed.To support model development it is very informative to visualise the hidden structure of complex spectral database in a lowdimensional space.Industrial applications require easily implementable, interpretable and accurate projection.TOPNIR utilises heuristic nonlinear functions (aggregates) for the mapping of spectra as high dimensional object.We proposed a much more sophisticated approach that can be used simultaneous prediction and visualisation.We adapted a technique that allows the application of PLS also for visualisation of spectral database.
Datasets taken from the Dune Refinery of MOL Ltd were analysed.The PLS model is applied to estimate cold filter plugging point, density and one property of distillation.The main benefit of this technique is that it allows the extention of the operating region of the model by extrapolation.The proposed PLS based model is able to simultaneously predict unmeasured material properties and monitor the state of the process.Process monitoring is realized in orthogonal two dimensional plots.These plots can also be used for the effective identification of outliers.

Fig. 1 .
Fig. 1.DS 1 spectral database containing 651 spectra used for property estimation in a diesel fuel mixing process.