################ Categorical Data ################ .. note:: As of XGBoost 1.6, the feature is experimental and has limited features Starting from version 1.5, XGBoost has experimental support for categorical data available for public testing. For numerical data, the split condition is defined as :math:`value < threshold`, while for categorical data the split is defined depending on whether partitioning or onehot encoding is used. For partition-based splits, the splits are specified as :math:`value \in categories`, where ``categories`` is the set of categories in one feature. If onehot encoding is used instead, then the split is defined as :math:`value == category`. More advanced categorical split strategy is planned for future releases and this tutorial details how to inform XGBoost about the data type. ************************************ Training with scikit-learn Interface ************************************ The easiest way to pass categorical data into XGBoost is using dataframe and the ``scikit-learn`` interface like :class:`XGBClassifier `. For preparing the data, users need to specify the data type of input predictor as ``category``. For ``pandas/cudf Dataframe``, this can be achieved by .. code:: python X["cat_feature"].astype("category") for all columns that represent categorical features. After which, users can tell XGBoost to enable training with categorical data. Assuming that you are using the :class:`XGBClassifier ` for classification problem, specify the parameter ``enable_categorical``: .. code:: python # Supported tree methods are `gpu_hist`, `approx`, and `hist`. clf = xgb.XGBClassifier( tree_method="gpu_hist", enable_categorical=True, use_label_encoder=False ) # X is the dataframe we created in previous snippet clf.fit(X, y) # Must use JSON/UBJSON for serialization, otherwise the information is lost. clf.save_model("categorical-model.json") Once training is finished, most of other features can utilize the model. For instance one can plot the model and calculate the global feature importance: .. code:: python # Get a graph graph = xgb.to_graphviz(clf, num_trees=1) # Or get a matplotlib axis ax = xgb.plot_tree(clf, num_trees=1) # Get feature importances clf.feature_importances_ The ``scikit-learn`` interface from dask is similar to single node version. The basic idea is create dataframe with category feature type, and tell XGBoost to use it by setting the ``enable_categorical`` parameter. See :ref:`sphx_glr_python_examples_categorical.py` for a worked example of using categorical data with ``scikit-learn`` interface with one-hot encoding. A comparison between using one-hot encoded data and XGBoost's categorical data support can be found :ref:`sphx_glr_python_examples_cat_in_the_dat.py`. ******************** Optimal Partitioning ******************** .. versionadded:: 1.6 Optimal partitioning is a technique for partitioning the categorical predictors for each node split, the proof of optimality for numerical output was first introduced by `[1] <#references>`__. The algorithm is used in decision trees `[2] <#references>`__, later LightGBM `[3] <#references>`__ brought it to the context of gradient boosting trees and now is also adopted in XGBoost as an optional feature for handling categorical splits. More specifically, the proof by Fisher `[1] <#references>`__ states that, when trying to partition a set of discrete values into groups based on the distances between a measure of these values, one only needs to look at sorted partitions instead of enumerating all possible permutations. In the context of decision trees, the discrete values are categories, and the measure is the output leaf value. Intuitively, we want to group the categories that output similar leaf values. During split finding, we first sort the gradient histogram to prepare the contiguous partitions then enumerate the splits according to these sorted values. One of the related parameters for XGBoost is ``max_cat_to_one_hot``, which controls whether one-hot encoding or partitioning should be used for each feature, see :doc:`/parameter` for details. ********************** Using native interface ********************** The ``scikit-learn`` interface is user friendly, but lacks some features that are only available in native interface. For instance users cannot compute SHAP value directly or use quantized :class:`DMatrix `. Also native interface supports data types other than dataframe, like ``numpy/cupy array``. To use the native interface with categorical data, we need to pass the similar parameter to :class:`DMatrix ` and the :func:`train ` function. For dataframe input: .. code:: python # X is a dataframe we created in previous snippet Xy = xgb.DMatrix(X, y, enable_categorical=True) booster = xgb.train({"tree_method": "hist", "max_cat_to_onehot": 5}, Xy) # Must use JSON for serialization, otherwise the information is lost booster.save_model("categorical-model.json") SHAP value computation: .. code:: python SHAP = booster.predict(Xy, pred_interactions=True) # categorical features are listed as "c" print(booster.feature_types) For other types of input, like ``numpy array``, we can tell XGBoost about the feature types by using the ``feature_types`` parameter in :class:`DMatrix `: .. code:: python # "q" is numerical feature, while "c" is categorical feature ft = ["q", "c", "c"] X: np.ndarray = load_my_data() assert X.shape[1] == 3 Xy = xgb.DMatrix(X, y, feature_types=ft, enable_categorical=True) For numerical data, the feature type can be ``"q"`` or ``"float"``, while for categorical feature it's specified as ``"c"``. The Dask module in XGBoost has the same interface so :class:`dask.Array ` can also be used for categorical data. ************* Miscellaneous ************* By default, XGBoost assumes input categories are integers starting from 0 till the number of categories :math:`[0, n\_categories)`. However, user might provide inputs with invalid values due to mistakes or missing values. It can be negative value, integer values that can not be accurately represented by 32-bit floating point, or values that are larger than actual number of unique categories. During training this is validated but for prediction it's treated as the same as missing value for performance reasons. Lastly, missing values are treated as the same as numerical features (using the learned split direction). ********** References ********** [1] Walter D. Fisher. "`On Grouping for Maximum Homogeneity`_." Journal of the American Statistical Association. Vol. 53, No. 284 (Dec., 1958), pp. 789-798. [2] Trevor Hastie, Robert Tibshirani, Jerome Friedman. "`The Elements of Statistical Learning`_". Springer Series in Statistics Springer New York Inc. (2001). [3] Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, Tie-Yan Liu. "`LightGBM\: A Highly Efficient Gradient Boosting Decision Tree`_." Advances in Neural Information Processing Systems 30 (NIPS 2017), pp. 3149-3157. .. _On Grouping for Maximum Homogeneity: https://www.tandfonline.com/doi/abs/10.1080/01621459.1958.10501479 .. _The Elements of Statistical Learning: https://link.springer.com/book/10.1007/978-0-387-84858-7 .. _LightGBM\: A Highly Efficient Gradient Boosting Decision Tree: https://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree.pdf