Invariant Random Forest: Tree-Based Model Solution for OOD Generalization
Yufan Liao12 , Qi Wu2 , Xing Yan1 *

arXiv:2312.04273v1 [cs.LG] 7 Dec 2023

1

Institute of Statistics and Big Data, Renmin University of China
2
School of Data Science, City University of Hong Kong
liaoyf@ruc.edu.cn, qi.wu@cityu.edu.hk, xingyan@ruc.edu.cn

Abstract
Out-Of-Distribution (OOD) generalization is an essential
topic in machine learning. However, recent research is only
focusing on the corresponding methods for neural networks.
This paper introduces a novel and effective solution for OOD
generalization of decision tree models, named Invariant Decision Tree (IDT). IDT enforces a penalty term with regard to
the unstable/varying behavior of a split across different environments during the growth of the tree. Its ensemble version,
the Invariant Random Forest (IRF), is constructed. Our proposed method is motivated by a theoretical result under mild
conditions, and validated by numerical tests with both synthetic and real datasets. The superior performance compared
to non-OOD tree models implies that considering OOD generalization for tree models is absolutely necessary and should
be given more attention.

1

Introduction

Machine learning models achieve great successes in many
applications such as image classification, speech recognition, and so on, when training data and testing data are generated from the same distribution. However, when the theme
of the task is about predicting rather than recognizing, it is
a big deal that the distribution of data in testing time may
have a huge difference compared to the distribution of training data. Model needs to predict well when the testing data
comes from unseen distributions. For example, in autopilot
tasks, we may encounter unseen signs; when predicting the
stock market, huge crashes may happen for new reasons.
This is called the Out-Of-Distribution (OOD) generalization
problem.
The performance of existing machine learning models
may drop severely in OOD scenarios (Shen et al. 2021;
Zhou et al. 2021; Geirhos et al. 2020). To make up for this
weakness, many methods have been proposed to improve
the OOD generalization ability of machine learning models,
such as IRM (Arjovsky et al. 2019) and REx (Krueger et al.
2021). But almost all the methods are caring about Deep
Neural Networks (DNNs). For other types of models in machine learning, there are no solutions proposed so far.
Decision trees are a type of classic machine learning models. As opposed to DNNs, decision trees are using white-box
* Corresponding author.

models which are simple to interpret. Thus, decision trees
can be applied to the areas where safety and reliability are
required or collaborations are needed between human and
AI, such as healthcare, credit assessment, etc. Induced by
decision trees, ensemble models like Random Forest (RF)
(Breiman 2001), Gradient Boosting Decision Tree (GIDT)
(Friedman 2001), and XGBoost (Chen et al. 2015) become
popular choices in real applications.
Though tree-based models have high interpretability, they
may also suffer from distribution shifts and spurious correlations, just as DNNs may suffer. Figure 1 gives the illustration of an OOD situation tree models may face. With
regular splitting criteria, decision trees only consider the average performance but ignore the heterogeneous behaviors
across different environments. Unstable behavior means that
the performance of a particular split can change or even twist
on the testing set, no matter how good the performance is on
training data. When we already know that the data distribution may alter during testing time, it is vital to avoid using
the traditional tree models for precautions.
In this paper, we first describe the failure of decision tree
models in the case where testing distribution and training
distribution are different. Then, we derive an invariant across
multiple environments for classification tasks under the setting of stable and unstable features. Based on this invariant, we introduce our proposed model, Invariant Decision
Tree (IDT) and the corresponding ensemble version Invariant Random Forest (IRF), by designing an additional penalty
term in the splitting criterion for growing trees. The penalty
term encourages the use of stable features as the splitting
variable in the tree growth. Experiments on both synthetic
datasets and real datasets prove the superiority of our proposed method in OOD generalization.

2

Related Works

Invariant Learning To achieve OOD generalization of
DNNs, invariant learning methods such as IRM (Arjovsky
et al. 2019) and REx (Krueger et al. 2021) propose to pursue invariant final layer or invariant loss across data from
different sources (e.g., locations, time, etc.). The data from
different sources are usually called different environments
or domains in this kind of literature. IRM seeks the invariant
of the last layer of the networks, and REx requests that the
losses in different environments are close. Invariant learning

believes that a model with invariant behaviors in all the training environments will have consistency in the prediction under testing environments. In this paper, we will discover a
new kind of invariant for tree models which is different from
those of IRM and REx.
Stable Learning Stable learning shares some common
motivations with causal inference (Cui and Athey 2022).
In the literature, stable prediction was achieved via variable balancing (Kuang et al. 2018), sample re-weighting
(Shen et al. 2020b), and feature decorrelation (Shen et al.
2020a). It was further extended to problems of adversarial (Liu et al. 2021), sparse (Yu et al. 2023), model misspecification (Kuang et al. 2020), and deep model for images
(Zhang et al. 2021).
Time Robust Tree Time Robust Tree (TRT) (Moneda and
Mauá 2022) brought the OOD generalization problem to decision trees. TRT proposed to restrict the minimum sample
size for each environment after a split and consider the maximum loss across environments instead of the empirical loss.
However, this method is limited to time series data only, and
the performance is not outstanding on real datasets.

3

Motivation: Stable Split and Unstable Split

Before formally describing our proposed method, we first
raise a toy example to show the motivation of our method.
Consider a simple binary classification problem, where the
label Y ∈ {0, 1}, and the 2-dimensional features X =
(X1 , X2 ) ∈ {(0, 0), (0, 1), (1, 0), (1, 1)} are also binary.
Suppose we have two training environments e = 1, 2 with
the same sample size, and a testing environment e = 3. We
consider the out-of-distribution case where in different environments, the causal relations between X1 and Y are stable,
but the causal relations between X2 and Y vary:
Y = Bern(0.5),
X1 = |Y − C1 |,
X2 = |Y − C2 |,

C1 ∼ Bern(0.3),
C2 ∼ Bern(Ue ).

(1)

Here, Bern(p) refers to the Bernoulli distribution with
parameter p. Different environments e have different parameters Ue . The equations are saying that X1 is equal to
label Y , but would twist with probability 0.3. X2 is also
equal to Y , and twist with probability Ue . Then, suppose
U1 = 0.1, U2 = 0.4, U3 = 0.7. When decision tree
picks the best split for the training data, it would pick X2 as
the splitting variable rather than X1 . This is because, in the
mixed data of environment 1 and environment 2, X2 equals
to label Y at 75% of the time (for X1 , it is 70%), showing a
stronger relation and predictability to label Y .
However, given the environmental information we know,
to improve the generalization performance of the model, it
is better to use X1 instead of X2 as the predicting variable.
The given data-generating process has implied that using X1
as the splitting variable is more secure and can ensure good
performance at testing time. If we normally learn from the
training data and derive X2 = 1 → Y = 1, then we can only
achieve 30% accuracy on the testing set. But if we derive

X1 = 1 → Y = 1, it is 70%. Figure 1 is an illustration of
this toy example.
To solve this problem, one needs to avoid the split that
behaves differently across different environments. As in the
above example, when the split has 90% accuracy in one environment, and only 60% in another environment. With such
a huge difference, it is questionable whether this split can be
effective on testing data. However, how can we rule out these
unstable splits systematically and quantitatively? In the next,
we will consider a more general data setting, and develop an
invariant that can help us discriminate the stable and unstable splits.

3.1

The Theorem of An Invariant

Now we consider a general binary classification problem.
Similar to the assumption in (Rosenfeld, Ravikumar, and
Risteski 2020), we assume that in the data-generating process, the label is drawn first, and features are drawn according to the label. For each environment e,

1, w.p. η e ,
(2)
Y =
0, otherwise.
Stable variables S = (S1 , S2 , ..., Sr ) and environmental
variables Z = (Z1 , Z2 , ..., Zt ) are generated according to
the label Y :
S ∼ P0 (s1 , s2 , ..., sr ), Z ∼ Qe0 (z1 , z2 , ..., zt ), if Y = 0,
S ∼ P1 (s1 , s2 , ..., sr ), Z ∼ Qe1 (z1 , z2 , ..., zt ), if Y = 1.
(3)
Here, P0 , Qe0 , P1 , Qe1 are cumulative distribution functions.
Note that with the same label, the distributions of stable variables S are the same across different environments, while
environmental variables Z are not. We consider a more general setting compared to (Rosenfeld, Ravikumar, and Risteski 2020), where S and Z can follow any distributions instead of only Gaussian distributions.
The goal of the model is to take both stable and environmental variables X = (S, Z) as inputs, to predict the label
Y . Decision trees use one splitting variable at a node to separate the data into two subsets. As the distribution of Z may
change in a new testing environment, we would like to avoid
using Z as the splitting variables in our tree model. However,
we do not know which variable is the stable one in advance.
We have to judge the stability of the variable on our own. To
achieve that, we first derive an invariant across environments
when using the stable variables S as the splitting variables.
Then, this invariant can be the judging criterion for the stability of any splitting variable.
Theorem 1. For any subset D ⊂ Rr+t , let P˜0 , Q̃e0 , P˜1 , Q̃e1
be the distribution of P0 , Qe0 , P1 , Qe1 restricted on D, respectively. For a splitting rule Si ≤ c on D, define the changing
rate of positive label as
CR1Si ≤c =

P(Y = 1|Si ≤ c, X ∈ D)
,
P(Y = 1|X ∈ D)

(4)

and the changing rate of negative label as
CR0Si ≤c =

P(Y = 0|Si ≤ c, X ∈ D)
.
P(Y = 0|X ∈ D)

(5)

Training Environment 1

Training Environment 2

Testing Environment

0.5

0.5

0.5

Split by 𝑋1

𝑋1 = 0

𝑋1 = 1

𝑋1 = 0

𝑋1 = 1

0.3

0.7

0.3

0.7

0.5

Split by 𝑋2

𝑋1 = 1 ➡ 𝑌 = 1

𝑋1 = 0

𝑋1 = 1

0.3

0.7

0.5

0.5

𝑋2 = 0

𝑋2 = 1

𝑋2 = 0

𝑋2 = 1

0.1

0.9

0.4

0.6

𝑋2 = 1 ➡ 𝑌 = 1?

𝑋2 = 0

𝑋2 = 1

0.7

0.3

Figure 1: The illustration of the example in (1). The number in the circle is E[Y ]. The prediction rule X2 = 1 → Y = 1 has
different accuracies under two different training environments, so it is questionable if this rule can still be effective under the
testing environment.
These changing rates can be calculated in any environment.
Then, using any stable variable as the splitting variable, the
ratio between the changing rates of positive label and negative label is invariant across different environments. That is
to say, for some k, CR1Si ≤c /CR0Si ≤c = k stands for every
environment e.

are grouped together. Letting the training data set at node m
be Qm , with sample size nm , and we represent each candidate split Xj ≤ c as θ = (j, c). θ splits the data set Qm into
Qm,l (θ) and Qm,r (θ) with nm,l and nm,r samples respectively:

Proof. We hide the condition X ∈ D for simplicity. For any
environment e, the probability of positive label conditional
i ≤c|Y =1)
on Si ≤ c is P(Y = 1|Si ≤ c) = P(Y =1)P(S
.
P(Si ≤c)

(8)

i ≤c|Y =0)
= 0|Si ≤ c) = P(Y =0)P(S
. Let the
P(Si ≤c)

Similarly, P(Y
first equation be divided by the second one, we have

P(Y = 1)P(Si ≤ c|Y = 1)
P(Y = 1|Si ≤ c)
=
.
P(Y = 0|Si ≤ c)
P(Y = 0)P(Si ≤ c|Y = 0)

(6)

θ∗ = arg min G(Qm , θ).
θ

(7)

Because all the environments share the same distribution of
S over D, i.e., P˜0 and P˜1 , the right-hand side of the last
equation is invariant across different environments, as long
as the splitting variable is a stable variable.
The above theorem suggests that, when we use the stable variable as the splitting variable, the ratio of changing
rates is an invariant across different environments. It may no
longer be an invariant if the splitting variable is an environmental variable, because Qe0 and Qe1 are changing over environments. Therefore, in order to use more stable variables
as splitting variables in the tree model, we could enforce a
restriction using this invariant during the splitting procedure.

4

An impurity function H(·) can measure the similarity of a
group of data. Generally, the quality of a candidate split θ
can be measured by calculating the weighted sum of the impurity functions on both left and right nodes:
nm,l
nm,r
G(Qm , θ) =
H(Qm,l ) +
H(Qm,r ). (9)
nm
nm
To pick out the best split, we select the one that minimizes
the impurity objective:

Then,
P(Y = 1|Si ≤ c) P(Y = 0|Si ≤ c)
/
P(Y = 1)
P(Y = 0)
P ˜ (Si ≤ c)
P(Si ≤ c|Y = 1, X ∈ D)
=
= S∼P1
.
P(Si ≤ c|Y = 0, X ∈ D)
PS∼P˜0 (Si ≤ c)

Qm,l (θ) = {(x, y) ∈ Qm |xj ≤ c},
Qm,r (θ) = {(x, y) ∈ Qm |xj > c}.

Invariant Tree and Forest

A decision tree recursively partitions the feature space such
that the samples with the same labels or similar target values

(10)

A grid searching for θ is generally adopted.
For classification tasks, the popular choices of the impurity function are Gini impurity (Gordon et al. 1984) and
Shannon information gain (Quinlan 2014). We use Gini
impurity throughout the paper. For regression tasks, Mean
Squared Error (MSE) and Mean Absolute Error (MAE) can
be used as impurity functions. We use MSE as the impurity
function in regression tasks throughout the paper. Note that
for both types of tasks, the impurity functions measure how
concentrated a set of data is.

4.1

Invariant Classification Tree

To consider the case of classification tree, first, we denote
the data sets with the environments they come from: Qem is
the training data set at node m which comes from environment e, where e = 1, . . . , E and E is the number of training
environments. Then, θ = (j, c) can split Qem into two subsets Qem,l (θ) and Qem,r (θ), as in (8). Note that the merged
data set Qm = ∪Te=1 Qem is used to compute the impurity

objective G(Qm , θ) that needs to be minimized in our tree
model.
In order to consider the invariance across environments,
we propose to put a restriction on the split, together with
the original splitting objective G(Qm , θ). In Theorem 1, we
have defined the changing rates on a set of distributions.
Now, we can compute the changing rates given datasets:
|{(x, y) ∈ Qem,l (θ)|y = 1}|/|Qem,l (θ)|
CR1Xj ≤c (Qem ) =
,
|{(x, y) ∈ Qem |y = 1}|/|Qem |
(11)
e
e
|{(x, y) ∈ Qm,l (θ)|y = 0}|/|Qm,l (θ)|
.
CR0Xj ≤c (Qem ) =
|{(x, y) ∈ Qem |y = 0}|/|Qem |
(12)
Through Theorem 1, when the splitting variable Xj is a stable one, an invariant across environments should be
I(Qem , θ) = CR1Xj ≤c (Qem )/CR0Xj ≤c (Qem ).

(13)

Therefore, when deciding the best split at a node, we can select the split which makes I(Qem , θ) unchanged across different environments:
θ∗ = arg min G(Qm , θ),

s.t.

θ
e
f
I(Qm , θ) = I(Qm , θ),

(14)
∀e, f = 1, 2, ..., E.

By adding this restriction, we can ensure the invariance of
the model across environments. However, the restriction is
too strong to be satisfied in reality.
Thus, we transfer this hard restriction into a penalty term.
The more I(Qem , θ) varies, the bigger the penalty term
should be. In this way, the problem turns out to be solvable:
θ∗ = arg min G(Qm , θ) + λL(Q1m , ..., QE
m , θ),
θ

(15)

where Qem is the training data set from the e-th environment,
L is the penalty term regarding the invariance among environments, and λ ∈ R+ is the penalty weight. Here, we
define L to be the ratio between the highest and the lowest
I(Qem , θ), which is
e
f
L(Q1m , ..., QE
m , θ) = max I(Qm , θ)/I(Qm , θ).
e,f

(16)

In the algorithm, to prevent the case when the denominator
is 0, we add a dummy sample to the calculation of changing
rates:
|{(x, y) ∈ Qem,l (θ)|y = 1}| + 0.5
, (17)
|{(x, y) ∈ Qem |y = 1}| + 1
|{(x, y) ∈ Qem,l (θ)|y = 0}| + 0.5
˜ 0X ≤c (Qem ) ∝
CR
, (18)
j
|{(x, y) ∈ Qem |y = 0}| + 1

˜ 1X ≤c (Qem ) ∝
CR
j

˜ em , θ) = CR
˜ 1X ≤c (Qem )/CR
˜ 0X ≤c (Qem ).
I(Q
j
j

(19)

˜ em , θ) instead.
The penalty term L is computed using I(Q

4.2

Invariant Regression Tree

In regression tasks, we cannot define changing rates as the
ones in the classification problem. However, we can follow

the motivation of changing rates, which is the change of average label after a split. Thereby, we propose to define the
changing rate in regression problem as the difference of label mean before and after a split:
P
P
y
y
CRXj ≤c (Q) =

(x,y)∈Ql (θ)

−

|Ql (θ)|
I(Qem , θ) = CRXj ≤c (Qem ),

(x,y)∈Q

|Q|

,

(20)
(21)

where Q is the training data set at a node in any environment
and Ql (θ) is the left-node data set after applying the split θ
on Q.
For the penalty term, we calculate it as the variance of
changing rates from all training environments:
e
L(Q1m , ..., QE
m , θ) = Vare=1,...,E [I(Qm , θ)].

(22)

The final penalized impurity objective for minimization is
again the one in (15). This idea is in part driven by the changing rate definition in classification problem, and is also in
part similar with REx (Krueger et al. 2021). Differently, REx
computes the penalty term using the variance of loss functions from different training environments, while our model
uses the variance of changing rates.

4.3

Invariant Random Forest

Except the penalty term added in the splitting criterion, the
rest of the training process is the same as decision trees. Data
on a node is recursively split into two subsets until the maximum tree depth is reached or all the data points on a node
have a same label. Invariant Random Forest (IRF), the ensemble of many invariant classification/regression trees, follows a similar process as aggregating decision trees to Random Forest. We do the bootstrap sampling for the training
data, which is separately done for each of the training environments, and generating the output by voting.
Another key difference from traditional Random Forest is
that IRF does not extract a random subset of features in splitting. The random subset of features is a crucial point in Random Forest, aiming at improving the diversity of trees. But
in IRF, the penalty term will avoid using some part of features that are suspicious for unstable splits. This is another
kind of feature selection in splitting, and given the diversity
of environments, the random feature selection is unnecessary.

5

Experiments

Next, we test the proposed Invariant Random Forest (IRF)
on several datasets with different OOD settings. We compare our method with Random Forest (RF) and XGBoost,
which are two most popular ensemble tree models that are
even being used in industries very often. Other methods are
not included for comparisons because there are few works
considering the OOD generalization of tree models, as far as
we know.
If not stated otherwise, for results of classification tasks,
the first number in the table represents the average accuracy
in percentage, and the number in the bracket is the average
cross-entropy loss. For results of regression tasks, the first

Table 1: Accuracy and cross-entropy loss results of the synthetic classification task. We consider four cases when the dataset
dimension varies in {2, 5, 10, 20}. The numbers reported are the averages from 5 trials with different randomness seeds.
Dataset Dimension

RF

XGBoost

IRF (λ = 1)

IRF (λ = 5)

IRF (λ = 10)

d=2
d=5
d = 10
d = 20

48.74 (0.83)
47.62 (0.85)
43.26 (0.86)
40.08 (0.84)

49.50 (1.24)
49.14 (1.49)
46.64 (1.68)
48.28 (1.51)

50.24 (0.75)
52.24 (0.74)
51.26 (0.74)
52.56 (0.71)

51.20 (0.73)
55.04 (0.70)
53.06 (0.71)
55.08 (0.69)

51.06 (0.73)
55.12 (0.70)
54.94 (0.70)
57.42 (0.68)

Table 2: MSE results of the synthetic regression task. We consider four cases when the dataset dimension varies in {2, 5, 10, 20}.
The numbers reported are the averages from 5 trials with different randomness seeds. Standard deviations are in the brackets.
Dataset Dimension

RF

XGBoost

IRF (λ = 1)

IRF (λ = 5)

IRF (λ = 10)

d=2
d=5
d = 10
d = 20

1.000
1.000
1.000
1.000

1.190 (0.033)
1.041 (0.035)
1.054 (0.034)
1.052 (0.023)

0.940 (0.017)
0.874 (0.017)
1.049 (0.049)
1.042 (0.022)

0.893 (0.019)
0.826 (0.022)
0.881 (0.013)
0.916 (0.023)

0.889 (0.024)
0.847 (0.015)
0.900 (0.021)
0.934 (0.021)

number represents the average Mean Square Error (MSE),
and the number in the bracket is the standard deviation in
multiple runs. For regression tasks, we standardize the MSE
results of RF to 1, and the MSE of every other method is
shown using the ratio compared to RF. That is to say, a number in the table of a regression task less than 1 means better
performance than RF.

5.1

Synthetic Data Experiments

We design a classification dataset and a regression dataset
to validate our idea and method. Both of the datasets allow some difference between the training distribution and
the testing distribution. We report the results of IRF when
the hyper-parameter λ = 1, 5, 10.
The data points of the classification task are generated by:
Y = Bern(0.5),
X1 = |Y · 1d − C1 | + N1 ,
X2 = |Y · 1d − C2 | + N2 ,
N1 , N2 ∼ N (0, Id ).

C1 ∼ Bernd (0.3),
C2 ∼ Bernd (Ue ),

(23)

Bernd is the d-dimensional Bernoulli distribution with each
dimension independent. 1d is the d-dimensional vector with
all entries being 1, and Id is the d × d identity matrix. We
set U1 = 0.1, U2 = 0.4, and U3 = 0.7. The training data
consists of two environments e = 1, 2, and the testing data
is the environment e = 3. This is an updated version of the
example we proposed previously. Both X1 and X2 are ddimensional and are added by a Gaussian noise. In addition,
the data points of the regression task are generated by:
X1 ∼ N (0, Id ),
Y = 1⊤
d X1 + N (0, 1),

Y,
if e = 1,
X2 =
N (0, 1), otherwise.

(24)

Again the training data is e = 1, 2, and the testing data is
the environment e = 3. In environment e = 1, X2 can predict Y exactly without any error. But in other environments,
X2 has no predictability to Y at all. Models that use X1 to
predict Y would achieve greater performance in the testing
environment, because it is more stable.
The results are shown in Table 1 and 2. In both tasks, IRF
shows superior performance compared to the other methods.
This proves that our method can recognize the unstable splits
and avoid using these splits during the growing of the trees.
A deep looking into the details reveals that our method uses
less of X2 and more of X1 to predict than the others, as
shown in the appendix.

5.2

Settings of Real Data Experiments

In the next, we test our method on real datasets, to see the
practicability of IRF on real prediction applications. To run
the experiments, we first split a dataset into various environments according to some basic variables that contain meta
information (year, month, timestamps, etc.). When using
such a variable and partitioning it to obtain environments,
this variable will not be used as the inputs for models. To test
the OOD performance of models, we use one environment
as the testing data. During training, we have no access to this
environment at all. For training data, we consider three different training scenarios based on whether the environment
information is given:
(S1) Only the pooled data of all environments excluding
the testing one is provided as training data, i.e., we do not
know any environmental information on training data.
(S2) The pooled data of most environments is provided,
but we have a validation set from a brand new environment,
which can be used for hyper-parameter tuning.
(S3) Training data with full environmental information is
provided. That is to say, we take multiple groups of data as
inputs of models, and each group is a single environment.

Table 3: Accuracy and cross-entropy loss results of real-data classification tasks in Scenario 1.
Dataset

RF

XGBoost

IRF (λ = 1)

IRF (λ = 5)

IRF (λ = 10)

Financial
Technical
Asset Pricing

53.07 (0.73)
49.07 (0.72)
51.45 (0.71)

53.05 (0.98)
49.10 (1.10)
51.39 (0.90)

55.30 (0.70)
49.36 (0.72)
51.96 (0.70)

56.34 (0.69)
50.07 (0.72)
51.91 (0.70)

56.84 (0.69)
50.65 (0.72)
51.85 (0.70)

Table 4: MSE results of real-data regression tasks in Scenario 1. Standard deviations are in the brackets.
Dataset

RF

XGBoost

IRF (λ = 1)

IRF (λ = 5)

IRF (λ = 10)

Energy1-1
Energy1-2
Energy1-3
Energy1-4
Energy2-1
Energy2-2
Energy2-3
Energy2-4
Air1-1
Air1-2
Air1-3
Air1-4
Air2-1
Air2-2
Air2-3
Air2-4

1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000

1.266 (0.153)
1.201 (0.102)
1.219 (0.106)
1.287 (0.125)
1.186 (0.095)
1.088 (0.022)
1.144 (0.103)
1.010 (0.063)
1.130 (0.015)
1.088 (0.043)
1.107 (0.076)
1.243 (0.061)
1.214 (0.084)
1.260 (0.122)
1.242 (0.144)
0.858 (0.020)

1.009 (0.007)
0.986 (0.027)
1.018 (0.012)
0.933 (0.084)
0.891 (0.131)
0.997 (0.017)
0.973 (0.015)
0.988 (0.019)
0.878 (0.122)
0.969 (0.016)
0.983 (0.027)
0.901 (0.148)
0.987 (0.003)
0.981 (0.033)
0.973 (0.020)
1.037 (0.009)

1.029 (0.016)
0.961 (0.086)
1.032 (0.004)
0.896 (0.129)
0.862 (0.206)
0.976 (0.083)
1.019 (0.075)
1.003 (0.007)
0.850 (0.200)
0.911 (0.025)
0.944 (0.085)
0.974 (0.228)
0.976 (0.024)
1.013 (0.064)
0.965 (0.035)
1.076 (0.012)

1.027 (0.024)
0.958 (0.109)
1.045 (0.023)
0.892 (0.129)
0.854 (0.218)
0.974 (0.094)
1.054 (0.091)
0.983 (0.039)
0.865 (0.222)
0.886 (0.063)
0.923 (0.084)
1.006 (0.246)
0.995 (0.043)
1.092 (0.068)
0.984 (0.026)
1.057 (0.010)

The last scenario is more popular in research works, but
the first two scenarios are realistic too because sometimes
we do not know any meta information. In such cases, we
manually split the training data and obtain hand-made environments with a recent simple but effective method of doing
so, called Decorr (Liao, Wu, and Yan 2022; Ye et al. 2023;
Tong et al. 2023). It finds subsets of training data that exhibit low correlations among features, for reducing spurious
correlations in the data. While other methods of the same
purpose mainly focus on DNNs, Decorr has no restrictions.
The ways of obtaining initial environment partitions and
the settings of training and testing sets are different for classification and regression. We introduce them separately in
the following subsections. The average results of IRF from
multiple runs of training and testing are reported and are
compared to RF and XGBoost. In Scenario 1, there is no
hyper-parameter tuning and we set λ = 1, 5, 10, respectively. For the validation procedure in Scenario 2 and 3, we
first choose the proper maximum tree depth for RF on validation set. Then, we use the same maximum depth in IRF
and use the validation set to choose the best λ for IRF. For
details regarding the validation procedure and the choices of
some other hyper-parameters, please go to the appendix.

5.3

Real-Data Classification Tasks

Three datasets are included in this study here. All of them
have heterogeneous distributions or prediction patterns dur-

ing different periods of time. So, we initially obtain the environments by splitting the datetime. The introduction of each
dataset is in the following. The results are shown in Table 3
and Table 5, where IRF performs the best overall than others.
Financial Financial indicator dataset1 is a yearly stock return dataset. The features are the financial indicators of each
stock, and the label we need to predict is the price going up
or down in the next whole year. The span of this dataset is
5 years, and we treat each single year as an environment. In
Scenario 1, we use any 3 environments as the training set,
and the other two as the testing set. In Scenario 2 and 3, we
use any 3 environments as the training set, 1 as the validation
set, and 1 as the testing set.
Technical Technical indicator dataset2 contains market
data and technical indicators of 10 U.S. stocks from 2005 to
2020. The model is expected to predict if tomorrow’s close
price of a stock is higher than today’s. We split this dataset
into several environments based on different time periods. In
Scenario 1, data of the first 60% of the time is used as the
training set, and the rest as the testing set. In Scenario 2 and
3, we use the first 60% of the data as the training set (split
1

https://www.kaggle.com/datasets/cnic92/200-financialindicators-of-us-stocks-20142018
2
https://www.kaggle.com/datasets/nikhilkohli/us-stockmarket-data-60-extracted-features

Table 5: Accuracy and cross-entropy loss results of real-data
classification tasks in Scenario 2 and 3. S2 is the upper part
and S3 is the lower part.

Table 6: MSE results of real-data regression tasks in Scenario 2 and 3. Standard deviations are in the brackets. S2 is
the upper part and S3 is the lower part.

Dataset

RF

XGBoost

IRF

Dataset

RF

XGBoost

IRF

Financial
Technical
Asset Pricing

54.42 (0.71)
48.89 (0.71)
52.50 (0.70)

52.46 (0.85)
49.05 (1.14)
51.95 (0.75)

56.20 (0.69)
49.18 (0.72)
52.92 (0.69)

Financial
Technical
Asset Pricing

54.50 (0.70)
48.94 (0.72)
52.64 (0.70)

52.46 (0.85)
49.05 (1.14)
51.95 (0.75)

54.27 (0.71)
49.57 (0.72)
52.79 (0.69)

Energy1-1
Energy1-2
Energy2-1
Energy2-2
Air1-1
Air1-2
Air1-3
Air1-4
Air2-1
Air2-2
Air2-3
Air2-4

1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000

1.064 (0.113)
1.141 (0.231)
1.027 (0.107)
1.059 (0.103)
1.052 (0.099)
1.112 (0.097)
1.173 (0.191)
1.099 (0.042)
1.187 (0.139)
1.191 (0.117)
1.236 (0.109)
1.089 (0.084)

0.989 (0.047)
0.925 (0.116)
1.006 (0.050)
1.037 (0.121)
0.916 (0.106)
0.979 (0.064)
0.940 (0.141)
1.006 (0.013)
0.970 (0.078)
0.982 (0.062)
0.986 (0.042)
0.995 (0.031)

Energy1-1
Energy1-2
Energy2-1
Energy2-2
Air1-1
Air1-2
Air1-3
Air1-4
Air2-1
Air2-2
Air2-3
Air2-4

1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000

1.050 (0.086)
1.187 (0.200)
1.008 (0.097)
1.068 (0.099)
1.053 (0.076)
1.126 (0.076)
1.151 (0.202)
1.110 (0.046)
1.173 (0.115)
1.205 (0.126)
1.223 (0.108)
1.091 (0.076)

0.977 (0.056)
0.979 (0.142)
1.038 (0.143)
1.062 (0.174)
0.845 (0.230)
0.956 (0.101)
0.988 (0.072)
0.993 (0.010)
0.926 (0.180)
1.005 (0.070)
1.005 (0.055)
0.991 (0.031)

into two environments as the first half and the second half, if
needed), then the subsequent 20% as the validation set, and
the last 20 % as the testing set.
Asset Pricing Asset Pricing (Gu, Kelly, and Xiu 2020,
2021) is a monthly stock return dataset with different types
of financial and technical indicators for thousands of stocks
during the period from 1970 to 2021. Models are expected
to predict if one stock has a higher monthly return than at
least a half of all stocks in the market at the same time. In
Scenario 1, we adopt 9-month rolling windows for training
and testing. The first six months is the training set, and the
next three months is the testing set. In Scenario 2 and 3, we
adopt one-year rolling windows. The first six months is the
training set again (split into the first three months and the last
three as two environments, if needed), the next three months
is the validation set, and the last three is the testing set.

5.4

Real-Data Regression Tasks

The training and testing settings of regression tasks all follow a same strategy. For the first scenario, we obtain three
environments using the variable containing meta information first. We further use any two environments as training
set, and the other one as testing set (3 runs for each dataset).
For the second and third scenarios, we obtain four environments initially and use any two as training set, any other one
as validation set, and the remaining one as testing set (12
runs for each dataset). We introduce the four datasets used
as follow.
Energy1 & Energy2 Energy datasets3 provide some information like temperature, humidity, and windspeed in a
city or a room. We want to predict the energy consumption
of that location. Since time has a heterogeneous influence
on the matter of energy consumption (e.g., a big difference
between day and night), in both datasets, we use the variable hour to split the data and obtain environments. There
are four different splitting strategies considered for Scenario
1 and two strategies for Scenario 2 and 3, resulting in more
derived datasets named as Energy1-i (or Energy2-i). The
details of the splitting strategies are in the appendix.
3
https://github.com/LuisM78/Appliances-energyprediction-data/blob/master/energydata complete.csv,
https://www.kaggle.com/datasets/gmkeshav/tetuan-city-powerconsumption

Air1 & Air2 Air datasets4 contain various types of air
quality data, and models need to predict the PM2.5 index
based on the air quality data given. In these two datasets,
we use the variable month to split and obtain environments,
because in different seasons, the behaviors of air pollution
may be different. We again have different splitting strategies
considered (four for Scenario 1, 2, and 3), resulting in more
derived datasets named as Air1-i (or Air2-i).
Results The results are shown in Table 4 and Table 6,
where IRF wins the best performance overall. On many
datasets, the performance improvements of IRF over the
other two are much significant, with a maximum percentage decrease close to 15%. In many cases, it ranges from
1% to 9%. Another finding is that XGBoost always performs
worse than RF in all OOD generalization tasks in this paper,
indicating that bagging may be better than boosting in OOD
settings.

6

Conclusion

Through the discovery of the invariant when using stable
variables as the splitting variable of a tree, we construct a
method for tree-based models to reduce the use of envi4
https://github.com/sunilmallya/timeseries/blob/master/data/
PRSA data 2010.1.1-2014.12.31.csv,

ronmental variables by enforcing penalties when splitting.
The proposed IDT and IRF are well-motivated, easy to interpret, and new to the area. Experiments have shown that
our method achieves superior performance under the OOD
generalization settings.
Some topics are still worth discussing on our method. For
example, the selection of hyper-parameter λ when there are
no validation sets, e.g., in Scenario 1. It may not be a good
choice if the validation set is i.i.d. extracted from the training
set. Since many existing works assume the existence of meta
information, we will leave the undiscussed topics as our future work. And other methodologies for OOD generalization
of tree models are worth being explored.

References
Arjovsky, M.; Bottou, L.; Gulrajani, I.; and Lopez-Paz,
D. 2019. Invariant risk minimization. arXiv preprint
arXiv:1907.02893.
Breiman, L. 2001. Random forests. Machine learning, 45:
5–32.
Chen, T.; He, T.; Benesty, M.; Khotilovich, V.; Tang, Y.;
Cho, H.; Chen, K.; Mitchell, R.; Cano, I.; Zhou, T.; et al.
2015. Xgboost: extreme gradient boosting. R package version 0.4-2, 1(4): 1–4.
Cui, P.; and Athey, S. 2022. Stable learning establishes
some common ground between causal inference and machine learning. Nature Machine Intelligence, 4(2): 110–115.
Friedman, J. H. 2001. Greedy function approximation: a
gradient boosting machine. Annals of statistics, 1189–1232.
Geirhos, R.; Jacobsen, J.-H.; Michaelis, C.; Zemel, R.; Brendel, W.; Bethge, M.; and Wichmann, F. A. 2020. Shortcut
learning in deep neural networks. Nature Machine Intelligence, 2(11): 665–673.
Gordon, A.; Breiman, L.; Friedman, J.; Olshen, R.; and
Stone, C. J. 1984. Classification and Regression Trees. Biometrics, 40(3): 874.
Gu, S.; Kelly, B.; and Xiu, D. 2020. Empirical asset pricing via machine learning. The Review of Financial Studies,
33(5): 2223–2273.
Gu, S.; Kelly, B.; and Xiu, D. 2021. Autoencoder asset pricing models. Journal of Econometrics, 222(1): 429–450.
Krueger, D.; Caballero, E.; Jacobsen, J.-H.; Zhang, A.; Binas, J.; Zhang, D.; Le Priol, R.; and Courville, A. 2021. Outof-distribution generalization via risk extrapolation (rex).
In International Conference on Machine Learning, 5815–
5826. PMLR.
Kuang, K.; Cui, P.; Athey, S.; Xiong, R.; and Li, B. 2018.
Stable prediction across unknown environments. In proceedings of the 24th ACM SIGKDD international conference on
knowledge discovery & data mining, 1617–1626.
Kuang, K.; Xiong, R.; Cui, P.; Athey, S.; and Li, B. 2020.
Stable prediction with model misspecification and agnostic
distribution shift. In Proceedings of the AAAI Conference on
Artificial Intelligence, volume 34, 4485–4492.
Liao, Y.; Wu, Q.; and Yan, X. 2022. Decorr: Environment
Partitioning for Invariant Learning and OOD Generalization.
arXiv preprint arXiv:2211.10054.

Liu, J.; Shen, Z.; Cui, P.; Zhou, L.; Kuang, K.; Li, B.; and
Lin, Y. 2021. Stable adversarial learning under distributional
shifts. In Proceedings of the AAAI Conference on Artificial
Intelligence, volume 35, 8662–8670.
Moneda, L.; and Mauá, D. 2022. Time Robust Trees: Using
Temporal Invariance to Improve Generalization. In Brazilian Conference on Intelligent Systems, 385–397. Springer.
Quinlan, J. R. 2014. C4. 5: programs for machine learning.
Elsevier.
Rosenfeld, E.; Ravikumar, P.; and Risteski, A. 2020.
The risks of invariant risk minimization. arXiv preprint
arXiv:2010.05761.
Shen, Z.; Cui, P.; Liu, J.; Zhang, T.; Li, B.; and Chen, Z.
2020a. Stable learning via differentiated variable decorrelation. In Proceedings of the 26th acm sigkdd international
conference on knowledge discovery & data mining, 2185–
2193.
Shen, Z.; Cui, P.; Zhang, T.; and Kunag, K. 2020b. Stable learning via sample reweighting. In Proceedings of
the AAAI Conference on Artificial Intelligence, volume 34,
5692–5699.
Shen, Z.; Liu, J.; He, Y.; Zhang, X.; Xu, R.; Yu, H.; and
Cui, P. 2021. Towards out-of-distribution generalization: A
survey. arXiv preprint arXiv:2108.13624.
Tong, Y.; Yuan, J.; Zhang, M.; Zhu, D.; Zhang, K.; Wu, F.;
and Kuang, K. 2023. Quantitatively Measuring and Contrastively Exploring Heterogeneity for Domain Generalization. arXiv preprint arXiv:2305.15889.
Ye, S.; Yu, S.; Hou, W.; Wang, Y.; and You, X. 2023. Coping with Change: Learning Invariant and Minimum Sufficient Representations for Fine-Grained Visual Categorization. arXiv preprint arXiv:2306.04893.
Yu, H.; Cui, P.; He, Y.; Shen, Z.; Lin, Y.; Xu, R.; and Zhang,
X. 2023. Stable Learning via Sparse Variable Independence.
In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, 10998–11006.
Zhang, X.; Cui, P.; Xu, R.; Zhou, L.; He, Y.; and Shen, Z.
2021. Deep stable learning for out-of-distribution generalization. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, 5372–5382.
Zhou, K.; Liu, Z.; Qiao, Y.; Xiang, T.; and Loy, C. C. 2021.
Domain Generalization: A Survey.

A
A.1

Experiment Details

Dataset Preprocessing

For most of the datasets we use in the experiments, there is
no preprocessing. The only exception is the Asset Pricing
dataset. As we take rolling windows to train and forecast,
and some features are N/A during some time periods, we remove the features which have N/A values on any row in the
selected window. Therefore, the feature dimension in training may be less than the feature dimension of the original
dataset, and we may use different features as inputs of models in different windows.

A.2

Environment Partitions for Regression
Datasets

For regression datasets in real data experiments, we first
split the whole dataset into 3 or 4 environments by splitting
a certain variable. Training set, validation set (if needed),
and testing set take the data from different environments, so
these sets have heterogeneous distributions. In this way, the
performance on the testing set can reflect the OOD generalization ability of different models. The construction of these
sets is described in the main content of the paper. However,
it has more than one strategy to split the certain variable, resulting in more than one derived datasets. Table 7 lists the
details of how we split each dataset into multiple environments.

A.3

Training Procedure

If there is no validation set, the maximum depths of RF,
IRF, and XGBoost are fixed to 10 in classification tasks,
and 20 in regression tasks. For each task, we run IRF with
λ = 1, 5, 10 respectively. If there is a validation set, RF
chooses the best maximum depth from {5, 10, 15} (for classification tasks) or {10, 15, 20} (for regression tasks). IRF
uses the same maximum depth as RF and chooses the best
λ from {0, 1, 5, 10}. XGBoost also chooses the best maximum depth from {5, 10, 15} (for classification tasks) or
{10, 15, 20} (for regression tasks). All these hyperparameters are chosen using the cross-entropy loss (for classification tasks) or MSE (for regression tasks) on the validation
set.
As for the number of trees, in the cases of Scenario 2 and
Scenario 3 in real data regression tasks, the number of ensemble trees is fixed to 10 for RF and IRF. In all the other
cases, the number of ensemble trees is fixed to 50 for RF and
IRF. For XGBoost, the number of ensemble trees is fixed to
100 in all the experiments without change.

B

More Analysis on Synthetic Data
Experiments

In the two synthetic data experiments, except for using the
accuracy or MSE on the testing set as evaluation metrics,
we can directly check the frequency of each variable being the splitting variable in the tree growth. We want our
model to use stable variables more and use environmental
variables less. To measure the frequency of a variable used
as the splitting variable, we define a feature importance to
each feature/variable. For a single tree, the initial feature
importances are all 0. Whenever a variable is used as the
splitting variable on node m, the feature importance of this
variable increases by nnm , where nm is the sample size on
node m and n is the sample size of the whole training set.
The feature importance over a forest is the average feature
importance over all its trees. We sum up the feature importances of all stable variables and all environmental variables
in the two synthetic data experiments and present them in
Table 8. As we can see, when λ goes up, IRF uses stable
variables more and environmental variables less, as we expect.

Table 7: Splitting strategies of regression datasets. Each dataset is split into three or four environments by the strategies in the
table. The training set, validation set, and testing set are constructed by the rule described in the main content of the paper.
Scenario

Dataset

Environment 1

Environment 2

Environment 3

Environment 4

S1

Energy1-1
Energy1-2
Energy1-3
Energy1-4
Energy2-1
Energy2-2
Energy2-3
Energy2-4
Air1-1
Air1-2
Air1-3
Air1-4
Air2-1
Air2-2
Air2-3
Air2-4

hour ∈ [10, 18)
hour ∈ [4, 12)
hour ∈ [0, 8)
first 1/3 of time
hour ∈ [10, 18)
hour ∈ [4, 12)
hour ∈ [0, 8)
first 1/3 of time
month = 1, 2, 3, 4
month = 2, 3, 4, 5
month = 3, 4, 5, 6
first 1/3 of time
month = 1, 2, 3, 4
month = 2, 3, 4, 5
month = 3, 4, 5, 6
first 1/3 of time

hour ∈ [6, 10) ∪ [18, 22)
hour ∈ [12, 20)
hour ∈ [8, 16)
middle 1/3 of time
hour ∈ [6, 10) ∪ [18, 22)
hour ∈ [12, 20)
hour ∈ [8, 16)
middle 1/3 of time
month = 5, 6, 7, 8
month = 6, 7, 8, 9
month = 7, 8, 9, 10
middle 1/3 of time
month = 5, 6, 7, 8
month = 6, 7, 8, 9
month = 7, 8, 9, 10
middle 1/3 of time

hour ∈ [0, 6) ∪ [22, 24)
hour ∈ [0, 4) ∪ [20, 24)
hour ∈ [16, 24)
last 1/3 of time
hour ∈ [0, 6) ∪ [22, 24)
hour ∈ [0, 4) ∪ [20, 24)
hour ∈ [16, 24)
last 1/3 of time
month = 9, 10, 11, 12
month = 10, 11, 12, 1
month = 11, 12, 1, 2
last 1/3 of time
month = 9, 10, 11, 12
month = 10, 11, 12, 1
month = 11, 12, 1, 2
last 1/3 of time

S2 & S3

Energy1-1
Energy1-2
Energy2-1
Energy2-2
Air1-1
Air1-2
Air1-3
Air1-4
Air2-1
Air2-2
Air2-3
Air2-4

hour ∈ [0, 6)
first 1/4 of time
hour ∈ [0, 6)
first 1/4 of time
month = 1, 2, 3
month = 2, 3, 4
month = 3, 4, 5
first 1/4 of time
month = 1, 2, 3
month = 2, 3, 4
month = 3, 4, 5
first 1/4 of time

hour ∈ [6, 12)
second 1/4 of time
hour ∈ [6, 12)
second 1/4 of time
month = 4, 5, 6
month = 5, 6, 7
month = 6, 7, 8
second 1/4 of time
month = 4, 5, 6
month = 5, 6, 7
month = 6, 7, 8
second 1/4 of time

hour ∈ [12, 18)
third 1/4 of time
hour ∈ [12, 18)
third 1/4 of time
month = 7, 8, 9
month = 8, 9, 10
month = 9, 10, 11
third 1/4 of time
month = 7, 8, 9
month = 8, 9, 10
month = 9, 10, 11
third 1/4 of time

hour ∈ [18, 24)
last 1/4 of time
hour ∈ [18, 24)
last 1/4 of time
month = 10, 11, 12
month = 11, 12, 1
month = 12, 1, 2
last 1/4 of time
month = 10, 11, 12
month = 11, 12, 1
month = 12, 1, 2
last 1/4 of time

Table 8: Feature importances in the synthetic data tasks. We sum up the feature importances of all the stable variables and all the
environmental variables. In this table, Stable stands for the feature importance sum of all the stable variables, and Environmental
stands for the feature importance sum of all the environmental variables. We take the average of the results over 5 runs. The
number in the bracket is the standard deviation of the results over 5 runs. As we can see, when λ goes up, IRF uses stable
variables more and environmental variables less, as we expect.
Dataset

Classification (d = 2)
Classification (d = 5)
Classification (d = 10)
Classification (d = 20)
Regression (d = 2)
Regression (d = 5)
Regression (d = 10)
Regression (d = 20)

RF (IRF with λ = 0)

IRF (λ = 1)

IRF (λ = 5)

IRF (λ = 10)

Stable

Environmental

Stable

Environmental

Stable

Environmental

Stable

Environmental

4.25 (0.14)
3.84 (0.14)
3.22 (0.11)
2.84 (0.05)
7.58 (0.06)
8.20 (0.08)
8.27 (0.10)
8.45 (0.04)

4.76 (0.12)
4.83 (0.12)
5.18 (0.15)
5.44 (0.05)
4.05 (0.05)
4.00 (0.04)
3.87 (0.04)
3.81 (0.06)

6.06 (0.46)
5.67 (0.30)
5.26 (0.30)
5.26 (0.23)
9.15 (0.13)
9.80 (0.13)
8.89 (0.39)
9.16 (0.22)

3.42 (0.45)
3.61 (0.29)
3.83 (0.30)
3.64 (0.22)
2.11 (0.11)
1.51 (0.21)
2.85 (0.54)
2.46(0.35)

6.44 (0.43)
6.25 (0.29)
5.86 (0.23)
5.70 (0.20)
9.89 (0.06)
10.47 (0.02)
10.80 (0.06)
11.09 (0.04)

3.10 (0.42)
3.20 (0.29)
3.46 (0.23)
3.43 (0.19)
1.42(0.03)
0.73 (0.01)
0.52 (0.07)
0.28 (0.03)

6.54 (0.40)
6.38 (0.34)
6.12 (0.19)
5.90 (0.21)
10.01 (0.05)
10.56 (0.04)
10.90 (0.06)
11.16 (0.05)

3.02 (0.39)
3.11 (0.34)
3.27 (0.19)
3.31 (0.21)
1.38 (0.03)
0.69 (0.03)
0.41 (0.02)
0.23 (0.01)