Applications of PCA to the Monitoring of Hydrocarbon Content in Marine Sediments by Means of Gas Chromatographic Measurements

The application of Principal Component Analysis (PCA) in biochemical studies lies in the field of Chemometrics, the discipline which describes and applies statistical multivariate methods to the laboratory studies. PCA like Cluster Analysis (CA), belongs to the so called unsupervised pattern recognition methods, multivariate methods which can be applied to any data set without requiring or supposing any preliminary knowledge about the information present in the data (Massart & Kauffman, 1983; Brereton, 2003).


Introduction
The application of Principal Component Analysis (PCA) in biochemical studies lies in the field of Chemometrics, the discipline which describes and applies statistical multivariate methods to the laboratory studies. PCA like Cluster Analysis (CA), belongs to the so called unsupervised pattern recognition methods, multivariate methods which can be applied to any data set without requiring or supposing any preliminary knowledge about the information present in the data (Massart & Kauffman, 1983;Brereton, 2003). PCA has been also defined "a data reduction form" for its peculiar ability to reduce the dimension of an experimental data set without loosing the qualitative and quantitative information present (Brereton, 2003). In matrix notation, the PCA decomposition of a multivariate experimental data set including several samples and called X, is reported in the equation 1 where the S term is the score matrix, V' is the transposed loading matrix and E is the noise matrix . With respect to the original X (n-sample, t-variables) set, the dimension of the new matrices is changed; S has (n-sample, p) dimension, V has (p, t-variables) dimension and E only retains the same dimension of X obviously. The term " p " of S and V matrices represents the number of significant principal components or factors determined by PCA; they have the peculiar ability of describing a high fraction of the total variance (i.e. information ) present in the X matrix and very important, the "p" dimension is always significantly lower than the " t " dimension of the original variables of the X matrix.
This data reduction ability of PCA is very helpful when large size of multivariate data sets have to be analyzed and interpreted. In common environmental monitoring studies PCA is applied in the analysis of discrete multivariate data when for instance, several sites with their pollutant loads have to be analysed and compared (Cicero et al., 2001;Conti & Mecozzi, 2008). However in environmental studies, the power of PCA becomes even more helpful when large size set of analytical signals such as GC chromatograms have to be analyzed. In fact, gas chromatography is a widespread technique for the monitoring of oil spills in terrestrial and marine environments (Wang et al., 1999) and in the case of marine sediments, gas chromatography tries to establish several aspects concerning total hydrocarbon content and distribution for testing homogeneity and or heterogeneity of pollutant loads and for identifying the sources of oil spills (Wang et al., 1999). In any case, this last task can be hardly obtained because any chromatogram is a multivariate sample where many hydrocarbons are usually present. A typical GC chromatogram, reported in Figure 1, is a data file with 2 columns, the acquisition time of the analytical signals and their detected intensities respectively. Here, the present hydrocarbons are identified by means of their retention time (i.e. the time corresponding to the maximum peak intensity).
The chromatogram of Figure 1 shows the presence of more than fifty hydrocarbons and in addition, the fast sampling signal causes the presence of a not negligible noise which corrupts the real intensity of signals (Mecozzi & Tomassetti, 2007;Kokaly et al., 2001). As a consequence, we can hardly perform a numerical and visual comparison of different chromatograms when we try to establish homogeneous or heterogeneous hydrocarbon compositions among samples as shown by the example of Figure 2. According to Equation 1, PCA re-describes the starting X data set by means of a set of the new "p" variables (i.e. factors), which being significantly lower than the number of the original variables, allow to compare samples by means of simple two or three dimensional plots, using the score matrix. In addition, PCA examines the variables which determine similarity or dissimilarity among samples by means of the loading analysis. Loadings are the statistical weights of the original "t" variables of the X matrix and in the case of chromatographic data their analysis allows to identify the hydrocarbons which characterise any samples. This is a peculiar advantages of PCA with respect Cluster Analysis, that is www.intechopen.com known as a fast screening method to determine similarity in experimental data set but in any case, it allows neither to determine the statistical weight of the variable nor to study peculiar variables determining qualitative similarities and dissimilarities among samples (Brereton, 2003).
However, the application of PCA to large size data set requires some necessary preprocessing treatments so to avoid potential misinterpretation of its results. In fact, a GC data file. such as the chromatogram of Figure 1, consists of about 20.000 analytical signals and when we examine a data file including thirty or forty samples, the resulting X matrix has high data dimension and redundancy. This causes high time for PCA computation and analytical problems such as reduction of the signal to noise (S/N) ratio and baseline drift. The selection of proper preprocessing treatments of chromatograms can solve all these problems and supports the correct application of PCA to large size multivariate data. In this paper we discuss the application of PCA for performing the hydrocarbon monitoring in two different GC chromatographic sets. Our study takes into account all the steps for a correct application of PCA to high dimension chromatographic data files. The first set consists of 29 superficial sediments from two different areas along the coasts of Italian seas, seventeen from Venice lagoon (Adriatic sea) and twelve from Bagnoli (near Naples, Tyrrhenian sea) respectively; the second set consists of 39 subsamples of marine sediments coming from a sediment core taken in Antarctic sea.
The main purpose of PCA application is that to retrieve information hardly detectable by means of conventional methods of GC analysis of hydrocarbons in environmental studies.

Experimental section
This experimental study consists of five different steps; sampling of marine sediments, hydrocarbon extraction and purification from other lipid compounds present in marine sediments (Mecozzi et al., 2011), gas chromatographic analysis of the extracts, chemometric pretreatment of chromatograms and application of PCA. PCA was applied to the two different chromatographic data matrices including all the samples from the Italian coasts and the Antarctic sediment core.

Sampling of marine sediments
Marine sediment sampling from the Italian coasts was performed by a box corer, taking the upper 5 cm layer. Figure 3 reports the location of the two sampling areas along the Italian coasts. Samples were stored frozen at -25°C until chemical analysis. The Antarctic sediment core was sampled in the B5/Y5 station (75° 04' South, 164° 13' East) in the Ross bay at 550 meter of depth. This area is characterised by an intense stratification of sediment and of biogenic organic materials. The sediment core was taken by means of dredge sampler and the core was stored frozen at -25°C until GC analysis.

Hydrocarbon extraction
Hydrocarbon content was extracted and purified by means of an ultrasound method developed in our laboratory (Mecozzi et al., 2011). Each sediment sample (20 g) was added with n-hexane (20 ml) and H 2 O (40 ml) at pH 2 obtained by adding concentrated HCl. Sediment was sonicated in an ultrasound cleaning bath operating at 35 kHz for 20 minutes at room temperature. Then the supernatant was separated from sediment by centrifugation. The separation of the aqueous phase from the organic phase was performed in a separating funnel; then he organic phase was dried on anhydrous Na 2 SO 4 . This process was repeated other twice, the extracts joint together and the organic phase was concentrated under vacuum down to 1 ml of final volume for GC analysis.

Gas chromatographic analysis
The determinations of hydrocarbons extracted by marine sediments were performed using a Carlo Erba (Milano Italy) instrument with flame ionization detector. The apparatus was equipped with a capillary GC Column Therm 1 (Thermo Scientific Milano Italy), 30 m length, i. d. 0.22 mm. Experimental conditions were injector 320°C, FID detector 360°C and the introduction was performed in spleatless mode (one minute). The temperature program used for chromatographic separation of hydrocarbons was 70°C for four minutes, thermal gradient 15°C min -1 to 340°C; This temperature was finally held for fourteen minutes. Chromatograms were saved as ASCII files for any further elaboration.

Improvements of analytical quality data and reduction of computation time
Handling of large data set prior to PCA application requires the preliminary solution of several drawbacks; in fact, the high frequency sampling of analytical signals produces data redundancy, high time of computation, with in addition analytical drawbacks such as reduction of the signal to noise (S/N) ratio and baseline drift (Christensen and Tomasi (2007). The same authors suggested several chemometric procedures for reducing these effects prior to apply PCA to GC data; with this aim, an in house MATLAB (Natik, USA) routine was applied to any collected chromatogram. In the appendix we report a MATLAB routine according to the algorithms described by Christensen and Tomasi (2007). Figure 4 reports an example of this approach. After this pretreatment, GC chromatograms were saved again as ASCII files.

Standardisation of the GC data set
Standardisation, also called scaling, is another fundamental step prior to PCA application, necessary for reducing the effect of the different magnitude of intensity variations in the case of multivariate data, causing uncorrected determination of the total variance of the data system (Brereton, 2003;Wang et al., 1999;Noda, 2008). In environmental monitoring, where PCA is often applied to study the distribution of pollutant loads, a common scaling technique is autoscaling; given the Y column vector to be included in the X matrix and having n-sampled analytical signals, autoscaling performs the data transformation according to where Y i , Y ias , Y M and σ are the original value i th value, its autoscaled term, the average value of the Y vector and standard deviation of the Y vector respectively. After autoscaling, any new Y series to be included in the X matrix has mean value 0 and variance value 1.
This is a very powerful approach to reduce the effect of different size ranges on the total variance of discrete data set but when applied to other types of variables such as the cases of analytical signals, autoscaling has a marked drawback. In fact, digitised files of spectroscopic and chromatographic data generally consist of several thousands of signals sampled with high frequency acquisition. In this case autoscaling can often produce the enhancement of noise depending on its division by a small value of standard deviation (Noda, 2008;Kokalj et al., 2011).
Other scaling techniques are available for solving the disadvantage originating from autoscaling. In the mean centred technique data are scaled according to where Y imc is mean centred scaled value of the Y series while Y i and Y M are the same meaning of the equation 2. www.intechopen.com Normalization scaling consists of transforming data according to where Y inorm , Y min and Y max are the normalized Y i term, the minimum and the maximum values of the Y series respectively. After normalization, all the Y vectors range between 0 and 1.
Pareto scaling is a technique proposed by the Italian economist Vilfredo Pareto (Noda, 2008); it consists of the division of the Y series values by the square root of its standard deviation according to where Y ip is the Pareto scaled of the original Yi value and σ has the same meaning of equation 2.
Any scaling technique produces different effects on the quality of analytical signals so that the selection of the opportune scaling needs a carefully evaluation of the produced results.
We report examples of application of all the above scaling methods in Figures 5 and 6 so to support the selection of the most appropriate methods prior to PCA application to GC data. With respect to the original chromatogram, autoscaling causes baseline drift with negative analytical signals and in addition, noise is enhanced in some zones of the chromatogram as shown by the example of Figure 5 (middle plot).
Mean centred scaling causes a baseline drift with negative analytical signals as well, though it does not cause a S/N ratio reduction as observed for autoscaling instead ( Figure 5, bottom plot).
Normalization and Pareto scaling techniques do not cause negative baseline drifts and evident noise enhancements ( Figure 6) so that we recommend to apply one of these as scaling pretreatments. These techniques can be applied by means of a common spreadsheet such as Excel for Windows. In any case, in the Appendix section we report two ad hoc routines written in MATLAB language for applying the above scaling techniques.

Application of PCA to gas chromatographic data set
PCA was applied to GC chromatograms by an in house routine written in MATLAB (Natik, Wi, USA, ver 5.0) language according to the singular value decomposition algorithm described by Geladi (2002). The list of the routine is reported in the Appendix section.

Chemical reagents
All the chemical reagents used for the experimental work were of analytical reagent grade (Carlo Erba, Milan, Italy) and only ultrapure MilliQ water was used for any chemical treatments of samples.
www.intechopen.com  www.intechopen.com Figure 7 reports the score plot of the first vs. the second factor obtained by PCA application to the GC chromatograms of superficial sediment samples taken along the coasts of Adriatic and Tyrrhenian sea. These two factors extracted by PCA explain the 90.7 % of a total variance of the chromatographic data set. This very high fraction of information, retained in two factors only, is an impressive example of PCA ability as "data reduction form"; now, the visual comparison of GC samples is possible by means of a simple twodimensional plot depending on the reduction of the starting 20.000 variables (i.e. the retention times of hydrocarbons) to the two PCA factors.

Application of PCA to hydrocarbon analysis in sediments from two areas of Italian coasts
The clustering of samples determining homogeneity and heterogeneity among samples is also evident and does not require further multivariate methods such ad discriminant analysis to investigate the classification of samples. Though these samples come from different seas and areas, some samples of the two areas have comparable hydrocarbon compositions as results from several VL and BG samples present in a same cluster, while samples of the Bagnoli area show different hydrocarbon compositions. This result means that the contributions of several biogenic (i.e. natural) and anthropogenic hydrocarbons can make sometimes comparable even sediments from different areas such as the two seas. These results can be hardly retrieved by the visual examination of the 29 chromatographic plots. However PCA can give additional information concerning the qualitative composition of samples because loading analysis can detect the hydrocarbon characteristics determining the similarities and dissimilarities observed in Figure 7.
The loading plot of the first factor (Figure 8) shows the generally high variability present in the hydrocarbon distribution of environmental samples as this factor explains the 83.9% of the total variance. Moreover, Figure 8 shows allows to retrieve characteristics concerning the hydrocarbon distribution of these samples. Pristane and phytane are two peculiar hydrocarbons able to characterise the biogenic and the anthropogenic sources present in environmental samples. In fact, pristane is a hydrocarbon typical of biogenic sources whereas phytane is a hydrocarbon typical of anthropogenic sources (Wang et al., 1999;Mecozzi et al., 2008;Duan et al., 2010). In this loading plot, pristane is negligible (retention time 15.5 minutes) whereas phytane is present (retention time 16.2 minutes in Figure 8, upper plot). In addition, the wax hydrocarbons (i.e. number of carbon higher than 24) which are also typical of biogenic sources (Wang et al., 1999;Duane et al., 2010;Ibbotson and Ibhadon, 2010;Ahad et al., 2011), are absent as shown by the negligible presence of chromatographic peaks with retention time higher than 20 minutes (Mecozzi et al., 2011). So the first loading plot describes the anthropogenic feature of the examined samples. The loading plot of the second factor (Figure 9) , though explaining about the 7% of the total variance only, shows that samples in the upper cluster of Figure 7 are characterised by little concentration changes of some specific hydrocarbons related to biogenic hydrocarbon sources. In fact, with respect to the loading plot of Figure 8, here several linear hydrocarbons with carbon number higher than 24 are present and this is a marker of biogenic hydrocarbons (Wang et al, 1999;Duane et al., 2010). Obviously, due to the heterogeneity of the hydrocarbon composition, PCA can not specify the concentration changes of a single hydrocarbon, but in any case, it is relevant that we can compare samples of different origins solving the problem related to the general lack of methods to compare regional differences in areas submitted to potential hydrocarbon spills (Fraser at al., 2008).
Another interesting and useful advantage of using PCA in GC monitoring data consists of its support to the application of another well diffused unsupervised pattern recognition method such as Cluster Analysis. According to its name, CA performs the classification of data by identifying clusters of data having relevant similarities and for this purposes, it uses the multivariate distance among samples (Massart and Kaufmann, 1983).
CA is considered a fast screening method to perform exploratory data analysis though it does not identify the variables which determine similarity and or dissimilarity among samples; this remains a peculiar ability of PCA (Figures 8 and 9). Fig. 9. Loading plot of second factor for data from the two areas from the Venice lagoon and Bagnoli near Naples. The arrows show the presence of some linear high molecular weight hydrocarbons, with more than 24 carbon atoms, typical of biogenic sources.
However, the application of CA to samples with over than 20.000 variables such as the case of the GC data is almost impossible due to computational and collinearity problems among variables (Massart and Kaufman, 1983). Conversely when CA is applied by means of the PCA scores, we have many peculiar advantages because this approach requires a small number of uncorrelated factors only while it does not require the use of specific distance such as the Mahalanobis one (Massart and Kaufman, 1983). This approach reported in Figure 10, shows that samples are clustered in a perfect agreement with Figure 7 obviously and now, by means of the data reduction of PCA, we can apply CA for estimating the percent of similarity existing among samples. Fig. 10. Cluster Analysis of GC data performed by means of PCA scores. The ratio (Dlink/Dmax)* 100 of the ordinate axis is the quantitative measurements of the dissimilarity among samples.

Application of PCA to hydrocarbon analysis of sediment samples from an Antarctic core
The application of PCA to the chromatographic data set of an Antarctic sediment core ( Figure 11) gives even more peculiar results with respect to those obtained in the previous section. Being Antarctic continent uncontaminated, we can suppose reasonably that hydrocarbons present in sediment core samples depend on biogenic contribution essentially with negligible anthropogenic contributions. If so, the hydrocarbon composition changes observed along the sections of the Antarctic sediment core should have a qualitative www.intechopen.com homogeneous composition depending on the biogenic contributions. As a consequence, the observed quantitative changes should depend on the natural stratification events only. The results reported in the score plot of Figure 11 supports this hypothesis. Fig. 11. Score plot of the first vs. the second factor from PCA applied to GC chromatograms of the Antarctic sediment core. The two factors explain the 67.0 % and the 11.56 % respectively of the total variance.
The first factors explains the 67.0 % of the total variance and the score values have an almost constant value while the positions of the samples changes with the scores of the second factor explaining the 11.56 % of the variance. The loading analysis reported in Figure 11, gives many details for the clarification of these findings. In the first factor, it is evident the presence of a significant hydrocarbon peak at high molecular weight (retention time close to 35 minutes) assigned to the linear hydriacrbon with 38 carbon atom number. This is a wax hydrocarbon, typical of biogenic contributions arising from the degradation of living cells (Duane et al., 2010).
PCA confirms the supposed prevalence of biogenic contributions for these samples depending on prevalent presence of the biogenic linear hydrocarbon with 38 carbon number, suggesting a significant homogeneous composition mostly governed by the natural stratification of sediments as well. In addition, if the hydrocarbon distribution along the sections core is determined by the natural stratification of sediments only, we can suppose that it is governed by time. In this case, we could test the hypothesis of the time depending relationship between stratification of hydrocarbon distribution in sediments by means of the autocorrelation function, a typical approach for time series analysis (Brereton, 2003). In fact, autocorrelation is a tool for studying time trend and periodicity present in an univariate data set according to the regressive model Y t+1 = mY t + cost t = 1, 2,……..n Autocorrelation has an easy application to univariate time series data but its application to multivariate data such chromatographic ones can be performed after a PCA data reduction, under the condition that its first factor explains a high percent of the total variance (Brereton, 2003). In the case of the Antarctic core samples this condition is fulfilled (i.e. 67% in the first factor) and the first factors can be considered as an univariate time series. So we can examine our data by the autocorrelation method using the score values of the first factor. We report the result of this time series PCA application in Figure 13. The time descending trend of hydrocarbon distribution, depending on the natural stratification of sediments only, is clearly supported by the shape of the autocorrelation plot. On the base of this finding, we can verify that the hydrocarbon distribution shows a time trend depending on its biogenic contributions, because if anthropogenic sources were also present, we should observe a more irregular vertical profile and not the time trend supposed by Figure 11 and clearly confirmed by Figure 13.

Conclusion
In this study, we have presented all the aspects of pretreatment, scaling and use related to the application of PCA to complex multivariate chromatographic data set coming from environmental studies. As far as data pretreatment concerns, we stress the importance of data redundancy reduction, signal to noise improvement and data scaling. For this latter aspect, we have evidenced the peculiar advantages given by normalization and Pareto scaling techniques with respect to the most applied autoscaling technique. When applied to two specific cases of environmental studies PCA allows to retrieve much more information than that obtained by the conventional visual examination of GC chromatograms. Both case of studies show the power of PCA for explorative data analysis in chromatography and in addition, its ability as "data reduction form "supports the use of other statistic and complimentary techniques such CA and Time Series Analysis in the interpretation and verification of the environmental results.

Acknowledgments
This study has been jointly supported by the research project "ASTRA" financed by the Ente Nazionale Idrocarburi (ENI), Milan, Italy and by the National Research Antarctic Program (PROGDEF09_125) financed by the Italian Minster of the University and Research.
6. Appendix 6.1 MATLAB routine for performing reduction of data redundancy, smoothing and baseline correction and removal function [d,g]=gcpretreatment(chromatogram,factored); www.intechopen.com % Routine for data redundancy reduction, Savitzky Goaly filtering and baseline removal in