Smoothing Spline ANOVA Models and their Applications in Complex and Massive Datasets Smoothing Spline ANOVA Models and their Applications in Complex and Massive Datasets

Complex and massive datasets can be easily accessed using the newly developed data acquisition technology. In spite of the fact that the smoothing spline ANOVA models have proven to be useful in a variety of fields, these datasets impose the challenges on the applications of the models. In this chapter, we present a selected review of the smoothing spline ANOVA models and highlight some challenges and opportunities in massive datasets. We review two approaches to significantly reduce the computational costs of fitting the model. One real case study is used to illustrate the performance of the reviewed methods.

In this chapter, we focus on studying the SSANOVA models. Suppose that the data y i ; x i À Á and i ¼ 1, 2, …, n are independent and identically distributed (i.i.d.) copies of Y; X ð Þ, where Y ∈ Y ⊂ R is the response variable and X ∈ X ⊂ R d is the covariate variable. We consider the regression model: where y i is the response, η is the nonparametric function varying in an infinite-dimensional À Á T is on the domain X ⊂ R d , and e i $ i:i:d: N 0; σ 2 À Á . More general cases, in which the conditional distribution of Y given x, denoted as Y|x, which follows different distributions instead of the Gaussian distribution, will be discussed later. The nonparametric function η in (1) can be decomposed into through the functional ANOVA, where η c is a constant function, η j is the main effect function of x j h i , η kj is the interaction effect of x k h i and x j h i , and so on.
In the model (1), η can be estimated by minimizing the following penalized likelihood functional: where L η ð Þ is a log likelihood measuring the goodness of fit of η, J η ð Þ is a quadratic functional on η to quantify its smoothness, and λ is the smoothing parameter balancing trade-offs between the goodness of fit and the smoothness of η [11][12][13]. The computational complexity of estimating η by minimizing (2) is of the order O n 3 À Á for the sample of size n. Therefore, the high computational costs render SSANOVA models impractical for massive datasets. In this chapter, we review two methods to lower the computational costs. One approach is through the adaptive basis selection algorithm [14]. By carefully sampling a smaller set of basic functions conditional on the response variables, the adaptive sampling reduces the computational costs to O nn * 2 À Á , where n * ≪ n is the number of the sampled basis functions. The computational costs can also be reduced by the rounding algorithm [15]. This algorithm can significantly decrease the sample size to μ by rounding the data with a given precision, where μ ≪ n. After rounding, the computational costs can be dramatically reduced to O μ 3 À Á .
The rest of the chapter is organized as follows. Section 2 provides a detailed introduction to SSANOVA models and the model estimation. The details of adaptive basis selection algorithm and rounding algorithm are reviewed in Section 3. In Appendix, we demonstrate the numerical implementations using the R software.

Smoothing spline ANOVA models
In this section, we first review smoothing spline models and the reproducing kernel Hilbert space. Second, we present how to decompose a nonparametric function on tensor product domains, which lays the theoretical foundation for SSANOVA models. In the end, we show the estimation of SSANOVA models and illustrate the model with a real data example.

Introduction of smoothing spline models
In the model (1), η is located in an infinite-dimensional space. One way to estimate it is to add some constraints and estimate η in a finite-dimensional space. With the smoothness constraint, we estimate η by minimizing the penalized likelihood functional (2), and the minimizer of (2) is called a smoothing spline.

Example 1. Cubic smoothing splines
Suppose that Y|x follows a normal distribution, that is, Then, the penalized likelihood functional (2) can be reduced as the penalized least squares: where € η ¼ d 2 η=dx 2 . The minimizer of (3) is called a cubic smoothing spline [16][17][18]. In (3), the first term quantifies the fidelity to the data, and the second term controls the roughness of the function.

Example 2. Exponential family smoothing splines
Suppose that Y|x follows an exponential family distribution: where a > 0, b, and c are known functions and ϕ is either known or a nuisance parameter. Then, η can be estimated by minimizing the following penalized likelihood functional [19,20]: Note that the cubic smoothing spline in Example 1 is a special case of exponential family smoothing splines when Y|x follows the Gaussian distribution.
The smoothing parameter λ is sensitive to the estimation of η (see Figure 1). Therefore, it is crucial to implement some proper smoothing parameter selection methods to estimate λ. One of the most popular methods is the generalized cross validation (GCV) [21,22]. More details will be discussed in Section 2.6.2.

Reproducing kernel Hilbert space
We assume that readers are familiar with Hilbert space, which is a complete vector space with an inner product well defined that allows length and angle to be measured [23]. In a general Hilbert space, the continuity of a functional, which is required in minimizing (2) on g , is not always satisfied. To circumvent the problem, we optimize (2) in a special Hilbert space named reproducing kernel Hilbert space [24].
For each g ∈ H, there exists a corresponding continuous linear functional L g such that L g f ð Þ ¼ g; f h i, where f ∈ H and Á; Á h i defines the inner product in H. Conversely, an element g L ∈ H can also be found such that g L ; f ¼ L f ð Þ for any continuous linear functional L of H [23]. This is known as the Riesz representation theorem.
where g L is called the representer of L. The uniqueness is in the sense that g 1 and g 2 are considered as the same representer for any g 1 and g 2 satisfying For a better construction of estimator minimizing (2), one needs the continuity of evaluation functional x ½ f ¼ f x ð Þ. Roughly speaking, this means that if two functions f and g are close in Figure 1. The data are generated by the model e. y ¼ 5 þ e 3x1 þ 10 6 x 11 2 1 À x 2 ð Þ 6 þ 10 4 x 2 2 1 À x 2 ð Þ 10 þ 5 cos 2π norm, that is, kf À gk is small, then f and g are also pointwise close, that is, |f x ð Þ À g x ð Þ| is small for all x.
The bivariate function R x; y ð Þ ¼ R x ; R y is called the reproducing kernel of H, which is unique if it exists. The essential meaning of the name "reproducing kernel" comes from its reproducing property for any f ∈ H. In general, a reproducing kernel Hilbert space defines a reproducing kernel function that is both symmetric and positive definite. In addition, Moore-Aronszajn theorem states that every symmetric, positive definite kernel defines a unique reproducing kernel Hilbert space [25], and hence one can construct a reproducing kernel Hilbert space simply by specifying its reproducing kernel.
We now introduce the concept of tensor sum decomposition. Suppose that H is a Hilbert space and G is a linear subspace of H. The linear subspace g is called the orthogonal complement of G. It is easy to verify that for any f ∈ H, there exists a unique This decomposition is called a tensor sum decomposition, denoted by H ¼ G ⊕ G c . Suppose that H 1 and H 2 are two Hilbert spaces with inner products Á; Á h i 1 and Á; Á h i 2 . If the only common element of these two spaces is 0, one can also define a tensor sum Hilbert space The following theorem provides the rules in the tensor sum decomposition of a reproducing kernel Hilbert space.
Theorem 2.2 Suppose that R 1 and R 2 are the reproducing kernel Hilbert spaces H 1 and H 2 , respec- Conversely, if the reproducing kernel R of H can be decomposed into R ¼ R 1 þ R 2 , where both R 1 and R 2 are positive definite, and they are orthogonal to each other, that is,

Representer theorem
In (2), the smoothness penalty term J η ð Þ ¼ J η; η ð Þis nonnegative definite, that is, J η; η ð Þ≥ 0, and hence it is a squared semi-norm on the reproducing kernel Hilbert space g as the null space of J η ð Þ and H J as its orthogonal complement. By the tensor sum decomposition H ¼ N J ⊕ H J , one may decompose the η into two parts: one in the null space N J that has no contribution on the smoothness penalty and the other in H J "reproduced" by the reproducing kernel R Á; Á ð Þ [12].
where ξ k ; k ¼ 1; ::; M f g is the basis of null space N J and R Á; Á ð Þ is the reproducing kernel of H J .
This theorem indicates that although the minimization problem is in an infinite-dimensional space, the minimizer of (2) lies in a data-adaptive finite-dimensional space.

Function decomposition
The decomposition of a multivariate function is similar to the classical ANOVA. In this section, we present the functional ANOVA which lays the foundation for SSANOVA models.

One-way ANOVA decomposition
We consider a classical one-way ANOVA model y ij ¼ μ i þ e ij , where y ij is the observed data, μ i is the treatment mean for i ¼ 1, …, K and j ¼ 1, …, J, and e ij s are the random errors. The treatment mean μ i can be further decomposed as μ i ¼ μ þ α i , where μ is the overall mean and α i is the treatment effect with the constraint P K i¼1 α i ¼ 0.
Similar to the classical ANOVA decomposition, a univariate function f x ð Þ can be decomposed as where A is an averaging operator that averages the effect of x and I is an identity operator. The operator A averages a function f to a constant function f c satisfying A I À A ð Þ¼0. For example, Þf is the treatment effect.

Multiway ANOVA decomposition
can be decomposed similarly to the one-way ANOVA decomposition. Let A j , j ¼ 1, …, d, be the average operator on X j , and then A j f is a constant function on X j . One can define the ANOVA decomposition on X as where h i , and so on.

Some examples of model conduction
Smoothing , then the minimizer of (2) is called a polynomial smoothing spline.
Here, we use an inner product One can easily check that (9) is a well-defined inner product in C m ð Þ 0; 1 ½ with To construct the reproducing kernel, define where i ¼ ffiffiffiffiffiffi ffi À1 p . One can verify that Ð 1 0 k μ ð Þ ν x ð Þdx ¼ δ νμ and ν, μ ¼ 0, 1, …, m À 1, where δ νμ is the Kronecker delta [26]. Indeed, k 0 ; …; k mÀ1 f gforms an orthonormal basis of H 0 . Then, the reproducing kernel in H 0 is For space one can check that the reproducing kernel in H 1 is (See details in [11]; Section~2.3).
SSANOVA models on product domains: A natural way to construct reproducing kernel Hilbert space on product domain Q d j¼1 X j is taking the tensor product of spaces constructed on the marginal domains X j s. According to the Moore-Aronszajn theorem, every nonnegative definite function R corresponds to a reproducing kernel Hilbert space with R as its reproducing kernel. Therefore, the construction of the tensor product reproducing kernel Hilbert space is induced by constructing its reproducing kernel.
where S denotes all the subsets of 1; …; d f g . The component f S in (8) is in the space H S . Based on the decomposition, the minimizer of (2) is called a tensor product smoothing spline. One can construct a tensor product smoothing spline following Theorem 2.3, in which the reproducing kernel term may be calculated in the same way as the tensor product (11).
In the following, we will give some examples of tensor product smoothing splines on product domains.

Smoothing splines on 1;
We construct the reproducing kernel Hilbert space by using In this case, the space H can be further decomposed as The reproducing kernels of tensor product cubic spline on 1; …; K f gÂ 0; 1 ½ are listed in Table 1.
On other product domains, for example, 0; 1 ½ 2 , the tensor product reproducing kernel Hilbert space can be decomposed in a similar way. More examples are available in ( [11], Section~2.4).

General form
In general, a tensor product reproducing kernel Hilbert space can be specified as H ¼ ⊕ j H j , where j ∈ B is a genetic index. Suppose that H j is equipped with a reproducing kernel R j and an inner product f ; g h i j . Denote f j as the projection of f onto H j . Then, an inner product in H can be defined as
where θ j ≥ 0 are the tuning parameters. If a penalty J in (2) has the form (13), the SSANOVA models can be defined on the space H ¼ ⊕ j H j with the reproducing kernel:

Estimation
In this section, we show the procedure of estimating the minimizer b η of (2) under the Gaussian assumption and selecting the smoothing parameters.

Penalized least squares
We consider the same model shown in (1), and then the η can be estimated by minimizing the penalized least squares: Let S denote the n Â M matrix with the i; j ð Þth entry ξ j x i ð Þ as in (6) and R denote the n Â n matrix with the i; j ð Þth entry R x i ; x j À Á with the form (14). Then, based on Theorem 2.3, η can be expressed as where y ¼ y 1 ; …; y n À Á T .
By the reproducing property (5), the roughness penalty term can be expressed as Therefore, the penalized least squares criterion (15) becomes 1 n y À Sd À Rc ð Þ T y À Sd À Rc ð Þ þ λc T Rc: The penalized least squares (16) is a quadratic form of both d and c. By differentiating (16), one can obtain the linear system: Note that (17) only works for penalized least squares (15), and hence a normal assumption is needed in this case.

Selection of smoothing parameters
In SSANOVA models, properly selecting smoothing parameters is important to estimate η [9,27,28], as shown in Figure 1. Here, we introduce the generalized cross validation (GCV) method for the smoothing parameter selection.
For the multivariate predictors, the penalty term in (15) has the form where S is the number of smoothing parameters, which is related to the functional ANOVA decomposition, and θ j 's are the extra smoothing parameters. To avoid overparameterization, we treat λ ¼ λ=θ 1 ; ⋯; λ=θ S ð Þ T as the effective smoothing parameters.
A GCV score is defined as where A λ ð Þ is a symmetric matrix similar to the hat matrix in linear regression. We can select a proper λ by minimizing the GCV score [21].

Case study: Twitter data
Tweets in the contiguous United States were collected over five weekdays in January 2014. The dataset contains information of time, GPS location, and tweet counts (see Figure 2). To illustrate the application of SSANOVA models, we study the time and spatial patterns in this data.
is a function of time and location, where x 1 h i denotes the time and x 2 h i represents the longitude and latitude coordinates. We use the thin-plate spline for the spatial variable and cubic spline for the time variable. As a rotation-free method, the thinplate spline is popular for modeling spatial data [29][30][31]. For a better interpretation, we decompose the function η as where η c is a constant function; η 1 and η 2 are the main effects of time and location, respectively; and η 12 is the spatial-time interaction effect.
The main effects of time and location are shown in Figure 3. Obviously, in panel (a), the number of tweets has the periodic effect, where it attains the maximum value at 8:00 p.m. and the minimum value at 5:00 a.m. The main effect of time shows the variations of Twitter usages in the United States. In addition, we can infer how the tweet counts vary across different locations based on panel (b) in Figure 3. There tend to be more tweets in the east than those in the west regions and more tweets in the coastal zone than those in the inland. We use the scaled dot product to quantify the percentage decomposition of the sum of squares of b y [11], where b y ¼ b y 1 ; …; b y n À Á T is the predicted values of log #Tweets ð Þ , and b η 12 ¼ η 12 x 1 ð Þ; …; η 12 x n ð Þ À Á T is the estimated interaction effect term, where η 12 . In our fitted model,  π ¼ 3 Â 10 À16 , which is so small that the interaction term is negligible. This indicates that there is no significant difference for the Twitter usages across time in the contiguous United States.

Efficient approximation algorithm in massive datasets
In this section, we consider SSANOVA models under the big data settings. The computational cost of solving (17) is of the order O n 3 À Á and thus gives rise to a challenge on the application of SSANOVA models when the volume of data grows. To reduce the computational load, an obvious way is to select a subset of basis functions randomly. However, it is hard to keep the data features by uniform sampling. In the following section, we present an adaptive basis selection method and show its advantages over uniform sampling [14]. Instead of selecting basis functions, another approach to reduce the computational cost is shrinking the original sample size by rounding algorithm [15].

Adaptive basis selection
A natural way to select the basis functions is through uniform sampling. Suppose that we randomly select a subset Thus, the kernel matrix would be R x i ; x À Á , i ¼ 1, …, n. Then, one minimizes (17) in the effective model space: The computational cost will be reduced significantly to O n n2 ð Þ if n is much smaller than n. Furthermore, it can be proven that the minimizer of (2), η, by uniform sampling basis selection, has the same asymptotic convergence rate as the full basis minimizer b η.
Although the uniform basis selection reduces the computational cost and the corresponding η achieves the optimal asymptotic convergence rate, it may fail to retain the data features occasionally. For example, when the data are not evenly distributed, it is hard for uniform sampling to capture the feature where there are only a few data points. In [14], an adaptive basis selection method is proposed. The main idea is to sample more basis functions where the response functions change largely and fewer basis functions on those flat regions. More details of adaptive basis selection method are shown in the following procedure: Step 1 Divide the range of responses y i È É n i¼1 into K disjoint intervals, S 1 , …, S K . Denote |S k | as the number of observations in S k .
Step 2 For each S k , draw a random sample of size n k from this collection. Let Step 3 Combine x * 1 ð Þ , …, x * K ð Þ together to form a set of sampled predictor values x * 1 ; …; x * n * È É , where n * ¼ P K k¼1 n k .
Step 4 Define as the effective model space.
By adaptive basis selection, the minimizer of (2) keeps the same form as that in Theorem 2.3: Let R * be an n Â n * matrix, and its i; j ð Þth entry is R x i ; x * j . Let R * * be an n * Â n * matrix, and its Then, the estimator η A satisfies (17), the linear system of equations in this case is The computational complexity of solving (18) is of the order O nn * 2 À Á , so the method decreases the computational cost significantly. It can also be shown that the adaptive sampling basis selection smoothing spline estimator η A has the same convergence property as the full basis method. More details about the consistency theory can be found in [14]. Moreover, adaptive sampling basis selection method for exponential family smoothing spline models was developed in [32].

Rounding algorithm
Other than sampling a smaller set of basis functions to save the computational resources, for example, the adaptive basis selection method presented previously, [15] proposed a new rounding algorithm to fit SSANOVA models in the context of big data.
Rounding algorithm: The details of rounding algorithm can be shown in the following procedure: Step 1 Assume that all predictors are continuous.
Step 3 Round the raw data by using the transformation: where the rounding parameter r j h i ∈ 0; 1 ð and rounding function RD Á ð Þ transform input data to the nearest integer.
Step 4 After replacing x i j h i with z i j h i , we redefine S and R in (16) and then estimate η by minimizing the penalized least squares (16).

Remark 1 In Step 3, if r j
h i is the rounding parameter for jth predictor and its value is 0.03, then each z i j h i is formed by rounding the corresponding x i j h i to its nearest 0.03.
Remark 2 It is evident that the value of rounding parameter can influence the precision of approximation. The smaller the rounding parameter, the better the model estimation and the higher the computational cost.
Computational benefits: We now briefly explain why the implementation of rounding algorithm can reduce the computational loads. For example, if the rounding parameter r ¼ 0:01, it is obvious that u ≤ 101, where u denotes the number of uniquely observed values. In conclusion, using user-tunable rounding algorithm can dramatically reduce the computational burden of fitting SSANOVA models from the order of O n 3 À Á to O u 3 À Á , where u ≪ n.
Case study: To illustrate the benefit of the rounding algorithm, we apply the algorithm to the electroencephalography (EEG) dataset. Note that EEG is a monitoring method to record the electrical activity of the brain. It can be used to diagnose sleep disorders, epilepsy, encephalopathies, and brain death.
The dataset [33] contains 44 controls and 76 alcoholics. Each subject was repeatedly measured 10 times by using visual stimulus at a frequency of 256 Hz. This brings about n ¼ 10 replications Â120 subjects Â256 time points ¼ 307; 200 observations. There are two predictors, time and group (control vs. alcoholic). We apply the cubic spline to the time effect and the nominal spline to the group effect.
After applying the model to the unrounded data, rounded data with rounding parameter r ¼ 0:01 and r ¼ 0:05 for time covariate, we can obtain a summary table about GCV, AIC [34], BIC [35], and running time in Table 2.
Based on Table 2, we can easily see that there are no significant difference among the GCV scores and AIC/BIC. In addition using rounding algorithm reduces 92% CPU time compared to using unrounded dataset.
In this chapter, we introduced the general framework of the SSANOVA models in Section 2. In Section 3, we discussed the models under the big data settings. When the volume of data grows, fitting the models is computing-intensive [11]. The adaptive basis selection algorithm [14] and rounding algorithm [15] we presented can significantly reduce the computational cost.
The se.fit parameter indicates if one can get the pointwise standard errors for the predicted values. The predicted values and Bayesian confidence interval, shown in Figure 4, are generated by: plot(x,y,col=1) Example II: Apply the SSANOVA model to a real dataset.
In this example, we illustrate how to implement the SSANOVA model using the gss package. The data is from an experiment in which a single-cylinder engine is run with ethanol to see how the nox concentration nox in the exhaust depends on the compression ratio comp and the equivalence ratio equi. The fitted model contains two predictors (comp and equi) and one interaction term.