Invariant Random Forest: Tree-Based Model Solution for OOD Generalization Yufan Liao12 , Qi Wu2 , Xing Yan1 * arXiv:2312.04273v1 [cs.LG] 7 Dec 2023 1 Institute of Statistics and Big Data, Renmin University of China 2 School of Data Science, City University of Hong Kong liaoyf@ruc.edu.cn, qi.wu@cityu.edu.hk, xingyan@ruc.edu.cn Abstract Out-Of-Distribution (OOD) generalization is an essential topic in machine learning. However, recent research is only focusing on the corresponding methods for neural networks. This paper introduces a novel and effective solution for OOD generalization of decision tree models, named Invariant Decision Tree (IDT). IDT enforces a penalty term with regard to the unstable/varying behavior of a split across different environments during the growth of the tree. Its ensemble version, the Invariant Random Forest (IRF), is constructed. Our proposed method is motivated by a theoretical result under mild conditions, and validated by numerical tests with both synthetic and real datasets. The superior performance compared to non-OOD tree models implies that considering OOD generalization for tree models is absolutely necessary and should be given more attention. 1 Introduction Machine learning models achieve great successes in many applications such as image classification, speech recognition, and so on, when training data and testing data are generated from the same distribution. However, when the theme of the task is about predicting rather than recognizing, it is a big deal that the distribution of data in testing time may have a huge difference compared to the distribution of training data. Model needs to predict well when the testing data comes from unseen distributions. For example, in autopilot tasks, we may encounter unseen signs; when predicting the stock market, huge crashes may happen for new reasons. This is called the Out-Of-Distribution (OOD) generalization problem. The performance of existing machine learning models may drop severely in OOD scenarios (Shen et al. 2021; Zhou et al. 2021; Geirhos et al. 2020). To make up for this weakness, many methods have been proposed to improve the OOD generalization ability of machine learning models, such as IRM (Arjovsky et al. 2019) and REx (Krueger et al. 2021). But almost all the methods are caring about Deep Neural Networks (DNNs). For other types of models in machine learning, there are no solutions proposed so far. Decision trees are a type of classic machine learning models. As opposed to DNNs, decision trees are using white-box * Corresponding author. models which are simple to interpret. Thus, decision trees can be applied to the areas where safety and reliability are required or collaborations are needed between human and AI, such as healthcare, credit assessment, etc. Induced by decision trees, ensemble models like Random Forest (RF) (Breiman 2001), Gradient Boosting Decision Tree (GIDT) (Friedman 2001), and XGBoost (Chen et al. 2015) become popular choices in real applications. Though tree-based models have high interpretability, they may also suffer from distribution shifts and spurious correlations, just as DNNs may suffer. Figure 1 gives the illustration of an OOD situation tree models may face. With regular splitting criteria, decision trees only consider the average performance but ignore the heterogeneous behaviors across different environments. Unstable behavior means that the performance of a particular split can change or even twist on the testing set, no matter how good the performance is on training data. When we already know that the data distribution may alter during testing time, it is vital to avoid using the traditional tree models for precautions. In this paper, we first describe the failure of decision tree models in the case where testing distribution and training distribution are different. Then, we derive an invariant across multiple environments for classification tasks under the setting of stable and unstable features. Based on this invariant, we introduce our proposed model, Invariant Decision Tree (IDT) and the corresponding ensemble version Invariant Random Forest (IRF), by designing an additional penalty term in the splitting criterion for growing trees. The penalty term encourages the use of stable features as the splitting variable in the tree growth. Experiments on both synthetic datasets and real datasets prove the superiority of our proposed method in OOD generalization. 2 Related Works Invariant Learning To achieve OOD generalization of DNNs, invariant learning methods such as IRM (Arjovsky et al. 2019) and REx (Krueger et al. 2021) propose to pursue invariant final layer or invariant loss across data from different sources (e.g., locations, time, etc.). The data from different sources are usually called different environments or domains in this kind of literature. IRM seeks the invariant of the last layer of the networks, and REx requests that the losses in different environments are close. Invariant learning believes that a model with invariant behaviors in all the training environments will have consistency in the prediction under testing environments. In this paper, we will discover a new kind of invariant for tree models which is different from those of IRM and REx. Stable Learning Stable learning shares some common motivations with causal inference (Cui and Athey 2022). In the literature, stable prediction was achieved via variable balancing (Kuang et al. 2018), sample re-weighting (Shen et al. 2020b), and feature decorrelation (Shen et al. 2020a). It was further extended to problems of adversarial (Liu et al. 2021), sparse (Yu et al. 2023), model misspecification (Kuang et al. 2020), and deep model for images (Zhang et al. 2021). Time Robust Tree Time Robust Tree (TRT) (Moneda and Mauá 2022) brought the OOD generalization problem to decision trees. TRT proposed to restrict the minimum sample size for each environment after a split and consider the maximum loss across environments instead of the empirical loss. However, this method is limited to time series data only, and the performance is not outstanding on real datasets. 3 Motivation: Stable Split and Unstable Split Before formally describing our proposed method, we first raise a toy example to show the motivation of our method. Consider a simple binary classification problem, where the label Y ∈ {0, 1}, and the 2-dimensional features X = (X1 , X2 ) ∈ {(0, 0), (0, 1), (1, 0), (1, 1)} are also binary. Suppose we have two training environments e = 1, 2 with the same sample size, and a testing environment e = 3. We consider the out-of-distribution case where in different environments, the causal relations between X1 and Y are stable, but the causal relations between X2 and Y vary: Y = Bern(0.5), X1 = |Y − C1 |, X2 = |Y − C2 |, C1 ∼ Bern(0.3), C2 ∼ Bern(Ue ). (1) Here, Bern(p) refers to the Bernoulli distribution with parameter p. Different environments e have different parameters Ue . The equations are saying that X1 is equal to label Y , but would twist with probability 0.3. X2 is also equal to Y , and twist with probability Ue . Then, suppose U1 = 0.1, U2 = 0.4, U3 = 0.7. When decision tree picks the best split for the training data, it would pick X2 as the splitting variable rather than X1 . This is because, in the mixed data of environment 1 and environment 2, X2 equals to label Y at 75% of the time (for X1 , it is 70%), showing a stronger relation and predictability to label Y . However, given the environmental information we know, to improve the generalization performance of the model, it is better to use X1 instead of X2 as the predicting variable. The given data-generating process has implied that using X1 as the splitting variable is more secure and can ensure good performance at testing time. If we normally learn from the training data and derive X2 = 1 → Y = 1, then we can only achieve 30% accuracy on the testing set. But if we derive X1 = 1 → Y = 1, it is 70%. Figure 1 is an illustration of this toy example. To solve this problem, one needs to avoid the split that behaves differently across different environments. As in the above example, when the split has 90% accuracy in one environment, and only 60% in another environment. With such a huge difference, it is questionable whether this split can be effective on testing data. However, how can we rule out these unstable splits systematically and quantitatively? In the next, we will consider a more general data setting, and develop an invariant that can help us discriminate the stable and unstable splits. 3.1 The Theorem of An Invariant Now we consider a general binary classification problem. Similar to the assumption in (Rosenfeld, Ravikumar, and Risteski 2020), we assume that in the data-generating process, the label is drawn first, and features are drawn according to the label. For each environment e,  1, w.p. η e , (2) Y = 0, otherwise. Stable variables S = (S1 , S2 , ..., Sr ) and environmental variables Z = (Z1 , Z2 , ..., Zt ) are generated according to the label Y : S ∼ P0 (s1 , s2 , ..., sr ), Z ∼ Qe0 (z1 , z2 , ..., zt ), if Y = 0, S ∼ P1 (s1 , s2 , ..., sr ), Z ∼ Qe1 (z1 , z2 , ..., zt ), if Y = 1. (3) Here, P0 , Qe0 , P1 , Qe1 are cumulative distribution functions. Note that with the same label, the distributions of stable variables S are the same across different environments, while environmental variables Z are not. We consider a more general setting compared to (Rosenfeld, Ravikumar, and Risteski 2020), where S and Z can follow any distributions instead of only Gaussian distributions. The goal of the model is to take both stable and environmental variables X = (S, Z) as inputs, to predict the label Y . Decision trees use one splitting variable at a node to separate the data into two subsets. As the distribution of Z may change in a new testing environment, we would like to avoid using Z as the splitting variables in our tree model. However, we do not know which variable is the stable one in advance. We have to judge the stability of the variable on our own. To achieve that, we first derive an invariant across environments when using the stable variables S as the splitting variables. Then, this invariant can be the judging criterion for the stability of any splitting variable. Theorem 1. For any subset D ⊂ Rr+t , let P˜0 , Q̃e0 , P˜1 , Q̃e1 be the distribution of P0 , Qe0 , P1 , Qe1 restricted on D, respectively. For a splitting rule Si ≤ c on D, define the changing rate of positive label as CR1Si ≤c = P(Y = 1|Si ≤ c, X ∈ D) , P(Y = 1|X ∈ D) (4) and the changing rate of negative label as CR0Si ≤c = P(Y = 0|Si ≤ c, X ∈ D) . P(Y = 0|X ∈ D) (5) Training Environment 1 Training Environment 2 Testing Environment 0.5 0.5 0.5 Split by 𝑋1 𝑋1 = 0 𝑋1 = 1 𝑋1 = 0 𝑋1 = 1 0.3 0.7 0.3 0.7 0.5 Split by 𝑋2 𝑋1 = 1 ➡ 𝑌 = 1 𝑋1 = 0 𝑋1 = 1 0.3 0.7 0.5 0.5 𝑋2 = 0 𝑋2 = 1 𝑋2 = 0 𝑋2 = 1 0.1 0.9 0.4 0.6 𝑋2 = 1 ➡ 𝑌 = 1? 𝑋2 = 0 𝑋2 = 1 0.7 0.3 Figure 1: The illustration of the example in (1). The number in the circle is E[Y ]. The prediction rule X2 = 1 → Y = 1 has different accuracies under two different training environments, so it is questionable if this rule can still be effective under the testing environment. These changing rates can be calculated in any environment. Then, using any stable variable as the splitting variable, the ratio between the changing rates of positive label and negative label is invariant across different environments. That is to say, for some k, CR1Si ≤c /CR0Si ≤c = k stands for every environment e. are grouped together. Letting the training data set at node m be Qm , with sample size nm , and we represent each candidate split Xj ≤ c as θ = (j, c). θ splits the data set Qm into Qm,l (θ) and Qm,r (θ) with nm,l and nm,r samples respectively: Proof. We hide the condition X ∈ D for simplicity. For any environment e, the probability of positive label conditional i ≤c|Y =1) on Si ≤ c is P(Y = 1|Si ≤ c) = P(Y =1)P(S . P(Si ≤c) (8) i ≤c|Y =0) = 0|Si ≤ c) = P(Y =0)P(S . Let the P(Si ≤c) Similarly, P(Y first equation be divided by the second one, we have P(Y = 1)P(Si ≤ c|Y = 1) P(Y = 1|Si ≤ c) = . P(Y = 0|Si ≤ c) P(Y = 0)P(Si ≤ c|Y = 0) (6) θ∗ = arg min G(Qm , θ). θ (7) Because all the environments share the same distribution of S over D, i.e., P˜0 and P˜1 , the right-hand side of the last equation is invariant across different environments, as long as the splitting variable is a stable variable. The above theorem suggests that, when we use the stable variable as the splitting variable, the ratio of changing rates is an invariant across different environments. It may no longer be an invariant if the splitting variable is an environmental variable, because Qe0 and Qe1 are changing over environments. Therefore, in order to use more stable variables as splitting variables in the tree model, we could enforce a restriction using this invariant during the splitting procedure. 4 An impurity function H(·) can measure the similarity of a group of data. Generally, the quality of a candidate split θ can be measured by calculating the weighted sum of the impurity functions on both left and right nodes: nm,l nm,r G(Qm , θ) = H(Qm,l ) + H(Qm,r ). (9) nm nm To pick out the best split, we select the one that minimizes the impurity objective: Then, P(Y = 1|Si ≤ c) P(Y = 0|Si ≤ c) / P(Y = 1) P(Y = 0) P ˜ (Si ≤ c) P(Si ≤ c|Y = 1, X ∈ D) = = S∼P1 . P(Si ≤ c|Y = 0, X ∈ D) PS∼P˜0 (Si ≤ c) Qm,l (θ) = {(x, y) ∈ Qm |xj ≤ c}, Qm,r (θ) = {(x, y) ∈ Qm |xj > c}. Invariant Tree and Forest A decision tree recursively partitions the feature space such that the samples with the same labels or similar target values (10) A grid searching for θ is generally adopted. For classification tasks, the popular choices of the impurity function are Gini impurity (Gordon et al. 1984) and Shannon information gain (Quinlan 2014). We use Gini impurity throughout the paper. For regression tasks, Mean Squared Error (MSE) and Mean Absolute Error (MAE) can be used as impurity functions. We use MSE as the impurity function in regression tasks throughout the paper. Note that for both types of tasks, the impurity functions measure how concentrated a set of data is. 4.1 Invariant Classification Tree To consider the case of classification tree, first, we denote the data sets with the environments they come from: Qem is the training data set at node m which comes from environment e, where e = 1, . . . , E and E is the number of training environments. Then, θ = (j, c) can split Qem into two subsets Qem,l (θ) and Qem,r (θ), as in (8). Note that the merged data set Qm = ∪Te=1 Qem is used to compute the impurity objective G(Qm , θ) that needs to be minimized in our tree model. In order to consider the invariance across environments, we propose to put a restriction on the split, together with the original splitting objective G(Qm , θ). In Theorem 1, we have defined the changing rates on a set of distributions. Now, we can compute the changing rates given datasets: |{(x, y) ∈ Qem,l (θ)|y = 1}|/|Qem,l (θ)| CR1Xj ≤c (Qem ) = , |{(x, y) ∈ Qem |y = 1}|/|Qem | (11) e e |{(x, y) ∈ Qm,l (θ)|y = 0}|/|Qm,l (θ)| . CR0Xj ≤c (Qem ) = |{(x, y) ∈ Qem |y = 0}|/|Qem | (12) Through Theorem 1, when the splitting variable Xj is a stable one, an invariant across environments should be I(Qem , θ) = CR1Xj ≤c (Qem )/CR0Xj ≤c (Qem ). (13) Therefore, when deciding the best split at a node, we can select the split which makes I(Qem , θ) unchanged across different environments: θ∗ = arg min G(Qm , θ), s.t. θ e f I(Qm , θ) = I(Qm , θ), (14) ∀e, f = 1, 2, ..., E. By adding this restriction, we can ensure the invariance of the model across environments. However, the restriction is too strong to be satisfied in reality. Thus, we transfer this hard restriction into a penalty term. The more I(Qem , θ) varies, the bigger the penalty term should be. In this way, the problem turns out to be solvable: θ∗ = arg min G(Qm , θ) + λL(Q1m , ..., QE m , θ), θ (15) where Qem is the training data set from the e-th environment, L is the penalty term regarding the invariance among environments, and λ ∈ R+ is the penalty weight. Here, we define L to be the ratio between the highest and the lowest I(Qem , θ), which is e f L(Q1m , ..., QE m , θ) = max I(Qm , θ)/I(Qm , θ). e,f (16) In the algorithm, to prevent the case when the denominator is 0, we add a dummy sample to the calculation of changing rates: |{(x, y) ∈ Qem,l (θ)|y = 1}| + 0.5 , (17) |{(x, y) ∈ Qem |y = 1}| + 1 |{(x, y) ∈ Qem,l (θ)|y = 0}| + 0.5 ˜ 0X ≤c (Qem ) ∝ CR , (18) j |{(x, y) ∈ Qem |y = 0}| + 1 ˜ 1X ≤c (Qem ) ∝ CR j ˜ em , θ) = CR ˜ 1X ≤c (Qem )/CR ˜ 0X ≤c (Qem ). I(Q j j (19) ˜ em , θ) instead. The penalty term L is computed using I(Q 4.2 Invariant Regression Tree In regression tasks, we cannot define changing rates as the ones in the classification problem. However, we can follow the motivation of changing rates, which is the change of average label after a split. Thereby, we propose to define the changing rate in regression problem as the difference of label mean before and after a split: P P y y CRXj ≤c (Q) = (x,y)∈Ql (θ) − |Ql (θ)| I(Qem , θ) = CRXj ≤c (Qem ), (x,y)∈Q |Q| , (20) (21) where Q is the training data set at a node in any environment and Ql (θ) is the left-node data set after applying the split θ on Q. For the penalty term, we calculate it as the variance of changing rates from all training environments: e L(Q1m , ..., QE m , θ) = Vare=1,...,E [I(Qm , θ)]. (22) The final penalized impurity objective for minimization is again the one in (15). This idea is in part driven by the changing rate definition in classification problem, and is also in part similar with REx (Krueger et al. 2021). Differently, REx computes the penalty term using the variance of loss functions from different training environments, while our model uses the variance of changing rates. 4.3 Invariant Random Forest Except the penalty term added in the splitting criterion, the rest of the training process is the same as decision trees. Data on a node is recursively split into two subsets until the maximum tree depth is reached or all the data points on a node have a same label. Invariant Random Forest (IRF), the ensemble of many invariant classification/regression trees, follows a similar process as aggregating decision trees to Random Forest. We do the bootstrap sampling for the training data, which is separately done for each of the training environments, and generating the output by voting. Another key difference from traditional Random Forest is that IRF does not extract a random subset of features in splitting. The random subset of features is a crucial point in Random Forest, aiming at improving the diversity of trees. But in IRF, the penalty term will avoid using some part of features that are suspicious for unstable splits. This is another kind of feature selection in splitting, and given the diversity of environments, the random feature selection is unnecessary. 5 Experiments Next, we test the proposed Invariant Random Forest (IRF) on several datasets with different OOD settings. We compare our method with Random Forest (RF) and XGBoost, which are two most popular ensemble tree models that are even being used in industries very often. Other methods are not included for comparisons because there are few works considering the OOD generalization of tree models, as far as we know. If not stated otherwise, for results of classification tasks, the first number in the table represents the average accuracy in percentage, and the number in the bracket is the average cross-entropy loss. For results of regression tasks, the first Table 1: Accuracy and cross-entropy loss results of the synthetic classification task. We consider four cases when the dataset dimension varies in {2, 5, 10, 20}. The numbers reported are the averages from 5 trials with different randomness seeds. Dataset Dimension RF XGBoost IRF (λ = 1) IRF (λ = 5) IRF (λ = 10) d=2 d=5 d = 10 d = 20 48.74 (0.83) 47.62 (0.85) 43.26 (0.86) 40.08 (0.84) 49.50 (1.24) 49.14 (1.49) 46.64 (1.68) 48.28 (1.51) 50.24 (0.75) 52.24 (0.74) 51.26 (0.74) 52.56 (0.71) 51.20 (0.73) 55.04 (0.70) 53.06 (0.71) 55.08 (0.69) 51.06 (0.73) 55.12 (0.70) 54.94 (0.70) 57.42 (0.68) Table 2: MSE results of the synthetic regression task. We consider four cases when the dataset dimension varies in {2, 5, 10, 20}. The numbers reported are the averages from 5 trials with different randomness seeds. Standard deviations are in the brackets. Dataset Dimension RF XGBoost IRF (λ = 1) IRF (λ = 5) IRF (λ = 10) d=2 d=5 d = 10 d = 20 1.000 1.000 1.000 1.000 1.190 (0.033) 1.041 (0.035) 1.054 (0.034) 1.052 (0.023) 0.940 (0.017) 0.874 (0.017) 1.049 (0.049) 1.042 (0.022) 0.893 (0.019) 0.826 (0.022) 0.881 (0.013) 0.916 (0.023) 0.889 (0.024) 0.847 (0.015) 0.900 (0.021) 0.934 (0.021) number represents the average Mean Square Error (MSE), and the number in the bracket is the standard deviation in multiple runs. For regression tasks, we standardize the MSE results of RF to 1, and the MSE of every other method is shown using the ratio compared to RF. That is to say, a number in the table of a regression task less than 1 means better performance than RF. 5.1 Synthetic Data Experiments We design a classification dataset and a regression dataset to validate our idea and method. Both of the datasets allow some difference between the training distribution and the testing distribution. We report the results of IRF when the hyper-parameter λ = 1, 5, 10. The data points of the classification task are generated by: Y = Bern(0.5), X1 = |Y · 1d − C1 | + N1 , X2 = |Y · 1d − C2 | + N2 , N1 , N2 ∼ N (0, Id ). C1 ∼ Bernd (0.3), C2 ∼ Bernd (Ue ), (23) Bernd is the d-dimensional Bernoulli distribution with each dimension independent. 1d is the d-dimensional vector with all entries being 1, and Id is the d × d identity matrix. We set U1 = 0.1, U2 = 0.4, and U3 = 0.7. The training data consists of two environments e = 1, 2, and the testing data is the environment e = 3. This is an updated version of the example we proposed previously. Both X1 and X2 are ddimensional and are added by a Gaussian noise. In addition, the data points of the regression task are generated by: X1 ∼ N (0, Id ), Y = 1⊤ d X1 + N (0, 1),  Y, if e = 1, X2 = N (0, 1), otherwise. (24) Again the training data is e = 1, 2, and the testing data is the environment e = 3. In environment e = 1, X2 can predict Y exactly without any error. But in other environments, X2 has no predictability to Y at all. Models that use X1 to predict Y would achieve greater performance in the testing environment, because it is more stable. The results are shown in Table 1 and 2. In both tasks, IRF shows superior performance compared to the other methods. This proves that our method can recognize the unstable splits and avoid using these splits during the growing of the trees. A deep looking into the details reveals that our method uses less of X2 and more of X1 to predict than the others, as shown in the appendix. 5.2 Settings of Real Data Experiments In the next, we test our method on real datasets, to see the practicability of IRF on real prediction applications. To run the experiments, we first split a dataset into various environments according to some basic variables that contain meta information (year, month, timestamps, etc.). When using such a variable and partitioning it to obtain environments, this variable will not be used as the inputs for models. To test the OOD performance of models, we use one environment as the testing data. During training, we have no access to this environment at all. For training data, we consider three different training scenarios based on whether the environment information is given: (S1) Only the pooled data of all environments excluding the testing one is provided as training data, i.e., we do not know any environmental information on training data. (S2) The pooled data of most environments is provided, but we have a validation set from a brand new environment, which can be used for hyper-parameter tuning. (S3) Training data with full environmental information is provided. That is to say, we take multiple groups of data as inputs of models, and each group is a single environment. Table 3: Accuracy and cross-entropy loss results of real-data classification tasks in Scenario 1. Dataset RF XGBoost IRF (λ = 1) IRF (λ = 5) IRF (λ = 10) Financial Technical Asset Pricing 53.07 (0.73) 49.07 (0.72) 51.45 (0.71) 53.05 (0.98) 49.10 (1.10) 51.39 (0.90) 55.30 (0.70) 49.36 (0.72) 51.96 (0.70) 56.34 (0.69) 50.07 (0.72) 51.91 (0.70) 56.84 (0.69) 50.65 (0.72) 51.85 (0.70) Table 4: MSE results of real-data regression tasks in Scenario 1. Standard deviations are in the brackets. Dataset RF XGBoost IRF (λ = 1) IRF (λ = 5) IRF (λ = 10) Energy1-1 Energy1-2 Energy1-3 Energy1-4 Energy2-1 Energy2-2 Energy2-3 Energy2-4 Air1-1 Air1-2 Air1-3 Air1-4 Air2-1 Air2-2 Air2-3 Air2-4 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.266 (0.153) 1.201 (0.102) 1.219 (0.106) 1.287 (0.125) 1.186 (0.095) 1.088 (0.022) 1.144 (0.103) 1.010 (0.063) 1.130 (0.015) 1.088 (0.043) 1.107 (0.076) 1.243 (0.061) 1.214 (0.084) 1.260 (0.122) 1.242 (0.144) 0.858 (0.020) 1.009 (0.007) 0.986 (0.027) 1.018 (0.012) 0.933 (0.084) 0.891 (0.131) 0.997 (0.017) 0.973 (0.015) 0.988 (0.019) 0.878 (0.122) 0.969 (0.016) 0.983 (0.027) 0.901 (0.148) 0.987 (0.003) 0.981 (0.033) 0.973 (0.020) 1.037 (0.009) 1.029 (0.016) 0.961 (0.086) 1.032 (0.004) 0.896 (0.129) 0.862 (0.206) 0.976 (0.083) 1.019 (0.075) 1.003 (0.007) 0.850 (0.200) 0.911 (0.025) 0.944 (0.085) 0.974 (0.228) 0.976 (0.024) 1.013 (0.064) 0.965 (0.035) 1.076 (0.012) 1.027 (0.024) 0.958 (0.109) 1.045 (0.023) 0.892 (0.129) 0.854 (0.218) 0.974 (0.094) 1.054 (0.091) 0.983 (0.039) 0.865 (0.222) 0.886 (0.063) 0.923 (0.084) 1.006 (0.246) 0.995 (0.043) 1.092 (0.068) 0.984 (0.026) 1.057 (0.010) The last scenario is more popular in research works, but the first two scenarios are realistic too because sometimes we do not know any meta information. In such cases, we manually split the training data and obtain hand-made environments with a recent simple but effective method of doing so, called Decorr (Liao, Wu, and Yan 2022; Ye et al. 2023; Tong et al. 2023). It finds subsets of training data that exhibit low correlations among features, for reducing spurious correlations in the data. While other methods of the same purpose mainly focus on DNNs, Decorr has no restrictions. The ways of obtaining initial environment partitions and the settings of training and testing sets are different for classification and regression. We introduce them separately in the following subsections. The average results of IRF from multiple runs of training and testing are reported and are compared to RF and XGBoost. In Scenario 1, there is no hyper-parameter tuning and we set λ = 1, 5, 10, respectively. For the validation procedure in Scenario 2 and 3, we first choose the proper maximum tree depth for RF on validation set. Then, we use the same maximum depth in IRF and use the validation set to choose the best λ for IRF. For details regarding the validation procedure and the choices of some other hyper-parameters, please go to the appendix. 5.3 Real-Data Classification Tasks Three datasets are included in this study here. All of them have heterogeneous distributions or prediction patterns dur- ing different periods of time. So, we initially obtain the environments by splitting the datetime. The introduction of each dataset is in the following. The results are shown in Table 3 and Table 5, where IRF performs the best overall than others. Financial Financial indicator dataset1 is a yearly stock return dataset. The features are the financial indicators of each stock, and the label we need to predict is the price going up or down in the next whole year. The span of this dataset is 5 years, and we treat each single year as an environment. In Scenario 1, we use any 3 environments as the training set, and the other two as the testing set. In Scenario 2 and 3, we use any 3 environments as the training set, 1 as the validation set, and 1 as the testing set. Technical Technical indicator dataset2 contains market data and technical indicators of 10 U.S. stocks from 2005 to 2020. The model is expected to predict if tomorrow’s close price of a stock is higher than today’s. We split this dataset into several environments based on different time periods. In Scenario 1, data of the first 60% of the time is used as the training set, and the rest as the testing set. In Scenario 2 and 3, we use the first 60% of the data as the training set (split 1 https://www.kaggle.com/datasets/cnic92/200-financialindicators-of-us-stocks-20142018 2 https://www.kaggle.com/datasets/nikhilkohli/us-stockmarket-data-60-extracted-features Table 5: Accuracy and cross-entropy loss results of real-data classification tasks in Scenario 2 and 3. S2 is the upper part and S3 is the lower part. Table 6: MSE results of real-data regression tasks in Scenario 2 and 3. Standard deviations are in the brackets. S2 is the upper part and S3 is the lower part. Dataset RF XGBoost IRF Dataset RF XGBoost IRF Financial Technical Asset Pricing 54.42 (0.71) 48.89 (0.71) 52.50 (0.70) 52.46 (0.85) 49.05 (1.14) 51.95 (0.75) 56.20 (0.69) 49.18 (0.72) 52.92 (0.69) Financial Technical Asset Pricing 54.50 (0.70) 48.94 (0.72) 52.64 (0.70) 52.46 (0.85) 49.05 (1.14) 51.95 (0.75) 54.27 (0.71) 49.57 (0.72) 52.79 (0.69) Energy1-1 Energy1-2 Energy2-1 Energy2-2 Air1-1 Air1-2 Air1-3 Air1-4 Air2-1 Air2-2 Air2-3 Air2-4 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.064 (0.113) 1.141 (0.231) 1.027 (0.107) 1.059 (0.103) 1.052 (0.099) 1.112 (0.097) 1.173 (0.191) 1.099 (0.042) 1.187 (0.139) 1.191 (0.117) 1.236 (0.109) 1.089 (0.084) 0.989 (0.047) 0.925 (0.116) 1.006 (0.050) 1.037 (0.121) 0.916 (0.106) 0.979 (0.064) 0.940 (0.141) 1.006 (0.013) 0.970 (0.078) 0.982 (0.062) 0.986 (0.042) 0.995 (0.031) Energy1-1 Energy1-2 Energy2-1 Energy2-2 Air1-1 Air1-2 Air1-3 Air1-4 Air2-1 Air2-2 Air2-3 Air2-4 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.050 (0.086) 1.187 (0.200) 1.008 (0.097) 1.068 (0.099) 1.053 (0.076) 1.126 (0.076) 1.151 (0.202) 1.110 (0.046) 1.173 (0.115) 1.205 (0.126) 1.223 (0.108) 1.091 (0.076) 0.977 (0.056) 0.979 (0.142) 1.038 (0.143) 1.062 (0.174) 0.845 (0.230) 0.956 (0.101) 0.988 (0.072) 0.993 (0.010) 0.926 (0.180) 1.005 (0.070) 1.005 (0.055) 0.991 (0.031) into two environments as the first half and the second half, if needed), then the subsequent 20% as the validation set, and the last 20 % as the testing set. Asset Pricing Asset Pricing (Gu, Kelly, and Xiu 2020, 2021) is a monthly stock return dataset with different types of financial and technical indicators for thousands of stocks during the period from 1970 to 2021. Models are expected to predict if one stock has a higher monthly return than at least a half of all stocks in the market at the same time. In Scenario 1, we adopt 9-month rolling windows for training and testing. The first six months is the training set, and the next three months is the testing set. In Scenario 2 and 3, we adopt one-year rolling windows. The first six months is the training set again (split into the first three months and the last three as two environments, if needed), the next three months is the validation set, and the last three is the testing set. 5.4 Real-Data Regression Tasks The training and testing settings of regression tasks all follow a same strategy. For the first scenario, we obtain three environments using the variable containing meta information first. We further use any two environments as training set, and the other one as testing set (3 runs for each dataset). For the second and third scenarios, we obtain four environments initially and use any two as training set, any other one as validation set, and the remaining one as testing set (12 runs for each dataset). We introduce the four datasets used as follow. Energy1 & Energy2 Energy datasets3 provide some information like temperature, humidity, and windspeed in a city or a room. We want to predict the energy consumption of that location. Since time has a heterogeneous influence on the matter of energy consumption (e.g., a big difference between day and night), in both datasets, we use the variable hour to split the data and obtain environments. There are four different splitting strategies considered for Scenario 1 and two strategies for Scenario 2 and 3, resulting in more derived datasets named as Energy1-i (or Energy2-i). The details of the splitting strategies are in the appendix. 3 https://github.com/LuisM78/Appliances-energyprediction-data/blob/master/energydata complete.csv, https://www.kaggle.com/datasets/gmkeshav/tetuan-city-powerconsumption Air1 & Air2 Air datasets4 contain various types of air quality data, and models need to predict the PM2.5 index based on the air quality data given. In these two datasets, we use the variable month to split and obtain environments, because in different seasons, the behaviors of air pollution may be different. We again have different splitting strategies considered (four for Scenario 1, 2, and 3), resulting in more derived datasets named as Air1-i (or Air2-i). Results The results are shown in Table 4 and Table 6, where IRF wins the best performance overall. On many datasets, the performance improvements of IRF over the other two are much significant, with a maximum percentage decrease close to 15%. In many cases, it ranges from 1% to 9%. Another finding is that XGBoost always performs worse than RF in all OOD generalization tasks in this paper, indicating that bagging may be better than boosting in OOD settings. 6 Conclusion Through the discovery of the invariant when using stable variables as the splitting variable of a tree, we construct a method for tree-based models to reduce the use of envi4 https://github.com/sunilmallya/timeseries/blob/master/data/ PRSA data 2010.1.1-2014.12.31.csv, ronmental variables by enforcing penalties when splitting. The proposed IDT and IRF are well-motivated, easy to interpret, and new to the area. Experiments have shown that our method achieves superior performance under the OOD generalization settings. Some topics are still worth discussing on our method. For example, the selection of hyper-parameter λ when there are no validation sets, e.g., in Scenario 1. It may not be a good choice if the validation set is i.i.d. extracted from the training set. Since many existing works assume the existence of meta information, we will leave the undiscussed topics as our future work. And other methodologies for OOD generalization of tree models are worth being explored. References Arjovsky, M.; Bottou, L.; Gulrajani, I.; and Lopez-Paz, D. 2019. Invariant risk minimization. arXiv preprint arXiv:1907.02893. Breiman, L. 2001. Random forests. Machine learning, 45: 5–32. Chen, T.; He, T.; Benesty, M.; Khotilovich, V.; Tang, Y.; Cho, H.; Chen, K.; Mitchell, R.; Cano, I.; Zhou, T.; et al. 2015. Xgboost: extreme gradient boosting. R package version 0.4-2, 1(4): 1–4. Cui, P.; and Athey, S. 2022. Stable learning establishes some common ground between causal inference and machine learning. Nature Machine Intelligence, 4(2): 110–115. Friedman, J. H. 2001. Greedy function approximation: a gradient boosting machine. Annals of statistics, 1189–1232. Geirhos, R.; Jacobsen, J.-H.; Michaelis, C.; Zemel, R.; Brendel, W.; Bethge, M.; and Wichmann, F. A. 2020. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11): 665–673. Gordon, A.; Breiman, L.; Friedman, J.; Olshen, R.; and Stone, C. J. 1984. Classification and Regression Trees. Biometrics, 40(3): 874. Gu, S.; Kelly, B.; and Xiu, D. 2020. Empirical asset pricing via machine learning. The Review of Financial Studies, 33(5): 2223–2273. Gu, S.; Kelly, B.; and Xiu, D. 2021. Autoencoder asset pricing models. Journal of Econometrics, 222(1): 429–450. Krueger, D.; Caballero, E.; Jacobsen, J.-H.; Zhang, A.; Binas, J.; Zhang, D.; Le Priol, R.; and Courville, A. 2021. Outof-distribution generalization via risk extrapolation (rex). In International Conference on Machine Learning, 5815– 5826. PMLR. Kuang, K.; Cui, P.; Athey, S.; Xiong, R.; and Li, B. 2018. Stable prediction across unknown environments. In proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, 1617–1626. Kuang, K.; Xiong, R.; Cui, P.; Athey, S.; and Li, B. 2020. Stable prediction with model misspecification and agnostic distribution shift. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 4485–4492. Liao, Y.; Wu, Q.; and Yan, X. 2022. Decorr: Environment Partitioning for Invariant Learning and OOD Generalization. arXiv preprint arXiv:2211.10054. Liu, J.; Shen, Z.; Cui, P.; Zhou, L.; Kuang, K.; Li, B.; and Lin, Y. 2021. Stable adversarial learning under distributional shifts. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 8662–8670. Moneda, L.; and Mauá, D. 2022. Time Robust Trees: Using Temporal Invariance to Improve Generalization. In Brazilian Conference on Intelligent Systems, 385–397. Springer. Quinlan, J. R. 2014. C4. 5: programs for machine learning. Elsevier. Rosenfeld, E.; Ravikumar, P.; and Risteski, A. 2020. The risks of invariant risk minimization. arXiv preprint arXiv:2010.05761. Shen, Z.; Cui, P.; Liu, J.; Zhang, T.; Li, B.; and Chen, Z. 2020a. Stable learning via differentiated variable decorrelation. In Proceedings of the 26th acm sigkdd international conference on knowledge discovery & data mining, 2185– 2193. Shen, Z.; Cui, P.; Zhang, T.; and Kunag, K. 2020b. Stable learning via sample reweighting. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 5692–5699. Shen, Z.; Liu, J.; He, Y.; Zhang, X.; Xu, R.; Yu, H.; and Cui, P. 2021. Towards out-of-distribution generalization: A survey. arXiv preprint arXiv:2108.13624. Tong, Y.; Yuan, J.; Zhang, M.; Zhu, D.; Zhang, K.; Wu, F.; and Kuang, K. 2023. Quantitatively Measuring and Contrastively Exploring Heterogeneity for Domain Generalization. arXiv preprint arXiv:2305.15889. Ye, S.; Yu, S.; Hou, W.; Wang, Y.; and You, X. 2023. Coping with Change: Learning Invariant and Minimum Sufficient Representations for Fine-Grained Visual Categorization. arXiv preprint arXiv:2306.04893. Yu, H.; Cui, P.; He, Y.; Shen, Z.; Lin, Y.; Xu, R.; and Zhang, X. 2023. Stable Learning via Sparse Variable Independence. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, 10998–11006. Zhang, X.; Cui, P.; Xu, R.; Zhou, L.; He, Y.; and Shen, Z. 2021. Deep stable learning for out-of-distribution generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5372–5382. Zhou, K.; Liu, Z.; Qiao, Y.; Xiang, T.; and Loy, C. C. 2021. Domain Generalization: A Survey. A A.1 Experiment Details Dataset Preprocessing For most of the datasets we use in the experiments, there is no preprocessing. The only exception is the Asset Pricing dataset. As we take rolling windows to train and forecast, and some features are N/A during some time periods, we remove the features which have N/A values on any row in the selected window. Therefore, the feature dimension in training may be less than the feature dimension of the original dataset, and we may use different features as inputs of models in different windows. A.2 Environment Partitions for Regression Datasets For regression datasets in real data experiments, we first split the whole dataset into 3 or 4 environments by splitting a certain variable. Training set, validation set (if needed), and testing set take the data from different environments, so these sets have heterogeneous distributions. In this way, the performance on the testing set can reflect the OOD generalization ability of different models. The construction of these sets is described in the main content of the paper. However, it has more than one strategy to split the certain variable, resulting in more than one derived datasets. Table 7 lists the details of how we split each dataset into multiple environments. A.3 Training Procedure If there is no validation set, the maximum depths of RF, IRF, and XGBoost are fixed to 10 in classification tasks, and 20 in regression tasks. For each task, we run IRF with λ = 1, 5, 10 respectively. If there is a validation set, RF chooses the best maximum depth from {5, 10, 15} (for classification tasks) or {10, 15, 20} (for regression tasks). IRF uses the same maximum depth as RF and chooses the best λ from {0, 1, 5, 10}. XGBoost also chooses the best maximum depth from {5, 10, 15} (for classification tasks) or {10, 15, 20} (for regression tasks). All these hyperparameters are chosen using the cross-entropy loss (for classification tasks) or MSE (for regression tasks) on the validation set. As for the number of trees, in the cases of Scenario 2 and Scenario 3 in real data regression tasks, the number of ensemble trees is fixed to 10 for RF and IRF. In all the other cases, the number of ensemble trees is fixed to 50 for RF and IRF. For XGBoost, the number of ensemble trees is fixed to 100 in all the experiments without change. B More Analysis on Synthetic Data Experiments In the two synthetic data experiments, except for using the accuracy or MSE on the testing set as evaluation metrics, we can directly check the frequency of each variable being the splitting variable in the tree growth. We want our model to use stable variables more and use environmental variables less. To measure the frequency of a variable used as the splitting variable, we define a feature importance to each feature/variable. For a single tree, the initial feature importances are all 0. Whenever a variable is used as the splitting variable on node m, the feature importance of this variable increases by nnm , where nm is the sample size on node m and n is the sample size of the whole training set. The feature importance over a forest is the average feature importance over all its trees. We sum up the feature importances of all stable variables and all environmental variables in the two synthetic data experiments and present them in Table 8. As we can see, when λ goes up, IRF uses stable variables more and environmental variables less, as we expect. Table 7: Splitting strategies of regression datasets. Each dataset is split into three or four environments by the strategies in the table. The training set, validation set, and testing set are constructed by the rule described in the main content of the paper. Scenario Dataset Environment 1 Environment 2 Environment 3 Environment 4 S1 Energy1-1 Energy1-2 Energy1-3 Energy1-4 Energy2-1 Energy2-2 Energy2-3 Energy2-4 Air1-1 Air1-2 Air1-3 Air1-4 Air2-1 Air2-2 Air2-3 Air2-4 hour ∈ [10, 18) hour ∈ [4, 12) hour ∈ [0, 8) first 1/3 of time hour ∈ [10, 18) hour ∈ [4, 12) hour ∈ [0, 8) first 1/3 of time month = 1, 2, 3, 4 month = 2, 3, 4, 5 month = 3, 4, 5, 6 first 1/3 of time month = 1, 2, 3, 4 month = 2, 3, 4, 5 month = 3, 4, 5, 6 first 1/3 of time hour ∈ [6, 10) ∪ [18, 22) hour ∈ [12, 20) hour ∈ [8, 16) middle 1/3 of time hour ∈ [6, 10) ∪ [18, 22) hour ∈ [12, 20) hour ∈ [8, 16) middle 1/3 of time month = 5, 6, 7, 8 month = 6, 7, 8, 9 month = 7, 8, 9, 10 middle 1/3 of time month = 5, 6, 7, 8 month = 6, 7, 8, 9 month = 7, 8, 9, 10 middle 1/3 of time hour ∈ [0, 6) ∪ [22, 24) hour ∈ [0, 4) ∪ [20, 24) hour ∈ [16, 24) last 1/3 of time hour ∈ [0, 6) ∪ [22, 24) hour ∈ [0, 4) ∪ [20, 24) hour ∈ [16, 24) last 1/3 of time month = 9, 10, 11, 12 month = 10, 11, 12, 1 month = 11, 12, 1, 2 last 1/3 of time month = 9, 10, 11, 12 month = 10, 11, 12, 1 month = 11, 12, 1, 2 last 1/3 of time S2 & S3 Energy1-1 Energy1-2 Energy2-1 Energy2-2 Air1-1 Air1-2 Air1-3 Air1-4 Air2-1 Air2-2 Air2-3 Air2-4 hour ∈ [0, 6) first 1/4 of time hour ∈ [0, 6) first 1/4 of time month = 1, 2, 3 month = 2, 3, 4 month = 3, 4, 5 first 1/4 of time month = 1, 2, 3 month = 2, 3, 4 month = 3, 4, 5 first 1/4 of time hour ∈ [6, 12) second 1/4 of time hour ∈ [6, 12) second 1/4 of time month = 4, 5, 6 month = 5, 6, 7 month = 6, 7, 8 second 1/4 of time month = 4, 5, 6 month = 5, 6, 7 month = 6, 7, 8 second 1/4 of time hour ∈ [12, 18) third 1/4 of time hour ∈ [12, 18) third 1/4 of time month = 7, 8, 9 month = 8, 9, 10 month = 9, 10, 11 third 1/4 of time month = 7, 8, 9 month = 8, 9, 10 month = 9, 10, 11 third 1/4 of time hour ∈ [18, 24) last 1/4 of time hour ∈ [18, 24) last 1/4 of time month = 10, 11, 12 month = 11, 12, 1 month = 12, 1, 2 last 1/4 of time month = 10, 11, 12 month = 11, 12, 1 month = 12, 1, 2 last 1/4 of time Table 8: Feature importances in the synthetic data tasks. We sum up the feature importances of all the stable variables and all the environmental variables. In this table, Stable stands for the feature importance sum of all the stable variables, and Environmental stands for the feature importance sum of all the environmental variables. We take the average of the results over 5 runs. The number in the bracket is the standard deviation of the results over 5 runs. As we can see, when λ goes up, IRF uses stable variables more and environmental variables less, as we expect. Dataset Classification (d = 2) Classification (d = 5) Classification (d = 10) Classification (d = 20) Regression (d = 2) Regression (d = 5) Regression (d = 10) Regression (d = 20) RF (IRF with λ = 0) IRF (λ = 1) IRF (λ = 5) IRF (λ = 10) Stable Environmental Stable Environmental Stable Environmental Stable Environmental 4.25 (0.14) 3.84 (0.14) 3.22 (0.11) 2.84 (0.05) 7.58 (0.06) 8.20 (0.08) 8.27 (0.10) 8.45 (0.04) 4.76 (0.12) 4.83 (0.12) 5.18 (0.15) 5.44 (0.05) 4.05 (0.05) 4.00 (0.04) 3.87 (0.04) 3.81 (0.06) 6.06 (0.46) 5.67 (0.30) 5.26 (0.30) 5.26 (0.23) 9.15 (0.13) 9.80 (0.13) 8.89 (0.39) 9.16 (0.22) 3.42 (0.45) 3.61 (0.29) 3.83 (0.30) 3.64 (0.22) 2.11 (0.11) 1.51 (0.21) 2.85 (0.54) 2.46(0.35) 6.44 (0.43) 6.25 (0.29) 5.86 (0.23) 5.70 (0.20) 9.89 (0.06) 10.47 (0.02) 10.80 (0.06) 11.09 (0.04) 3.10 (0.42) 3.20 (0.29) 3.46 (0.23) 3.43 (0.19) 1.42(0.03) 0.73 (0.01) 0.52 (0.07) 0.28 (0.03) 6.54 (0.40) 6.38 (0.34) 6.12 (0.19) 5.90 (0.21) 10.01 (0.05) 10.56 (0.04) 10.90 (0.06) 11.16 (0.05) 3.02 (0.39) 3.11 (0.34) 3.27 (0.19) 3.31 (0.21) 1.38 (0.03) 0.69 (0.03) 0.41 (0.02) 0.23 (0.01)