
.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_examples/compose/plot_column_transformer_mixed_types.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        Click :ref:`here <sphx_glr_download_auto_examples_compose_plot_column_transformer_mixed_types.py>`
        to download the full example code

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_examples_compose_plot_column_transformer_mixed_types.py:


===================================
Column Transformer with Mixed Types
===================================

.. currentmodule:: sklearn

This example illustrates how to apply different preprocessing and feature
extraction pipelines to different subsets of features, using
:class:`~compose.ColumnTransformer`. This is particularly handy for the
case of datasets that contain heterogeneous data types, since we may want to
scale the numeric features and one-hot encode the categorical ones.

In this example, the numeric data is standard-scaled after mean-imputation,
while the categorical data is one-hot encoded after imputing missing values
with a new category (``'missing'``).

In addition, we show two different ways to dispatch the columns to the
particular pre-processor: by column names and by column data types.

Finally, the preprocessing pipeline is integrated in a full prediction pipeline
using :class:`~pipeline.Pipeline`, together with a simple classification
model.

.. GENERATED FROM PYTHON SOURCE LINES 26-50

.. code-block:: default


    # Author: Pedro Morales <part.morales@gmail.com>
    #
    # License: BSD 3 clause

    import numpy as np

    from sklearn.compose import ColumnTransformer
    from sklearn.datasets import fetch_openml
    from sklearn.pipeline import Pipeline
    from sklearn.impute import SimpleImputer
    from sklearn.preprocessing import StandardScaler, OneHotEncoder
    from sklearn.linear_model import LogisticRegression
    from sklearn.model_selection import train_test_split, GridSearchCV

    np.random.seed(0)

    # Load data from https://www.openml.org/d/40945
    X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)

    # Alternatively X and y can be obtained directly from the frame attribute:
    # X = titanic.frame.drop('survived', axis=1)
    # y = titanic.frame['survived']



.. rst-class:: sphx-glr-script-out

.. code-block:: pytb

    Traceback (most recent call last):
      File "/build/scikit-learn-HBxYkq/scikit-learn-1.0.2/examples/compose/plot_column_transformer_mixed_types.py", line 44, in <module>
        X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
      File "/build/scikit-learn-HBxYkq/scikit-learn-1.0.2/.pybuild/cpython3_3.9/build/sklearn/datasets/_openml.py", line 862, in fetch_openml
        data_info = _get_data_info_by_name(name, version, data_home)
      File "/build/scikit-learn-HBxYkq/scikit-learn-1.0.2/.pybuild/cpython3_3.9/build/sklearn/datasets/_openml.py", line 428, in _get_data_info_by_name
        json_data = _get_json_content_from_openml_api(
      File "/build/scikit-learn-HBxYkq/scikit-learn-1.0.2/.pybuild/cpython3_3.9/build/sklearn/datasets/_openml.py", line 175, in _get_json_content_from_openml_api
        return _load_json()
      File "/build/scikit-learn-HBxYkq/scikit-learn-1.0.2/.pybuild/cpython3_3.9/build/sklearn/datasets/_openml.py", line 69, in wrapper
        return f(*args, **kw)
      File "/build/scikit-learn-HBxYkq/scikit-learn-1.0.2/.pybuild/cpython3_3.9/build/sklearn/datasets/_openml.py", line 171, in _load_json
        with closing(_open_openml_url(url, data_home)) as response:
      File "/build/scikit-learn-HBxYkq/scikit-learn-1.0.2/.pybuild/cpython3_3.9/build/sklearn/datasets/_openml.py", line 118, in _open_openml_url
        with closing(urlopen(req)) as fsrc:
      File "/usr/lib/python3.9/urllib/request.py", line 214, in urlopen
        return opener.open(url, data, timeout)
      File "/usr/lib/python3.9/urllib/request.py", line 517, in open
        response = self._open(req, data)
      File "/usr/lib/python3.9/urllib/request.py", line 534, in _open
        result = self._call_chain(self.handle_open, protocol, protocol +
      File "/usr/lib/python3.9/urllib/request.py", line 494, in _call_chain
        result = func(*args)
      File "/usr/lib/python3.9/urllib/request.py", line 1389, in https_open
        return self.do_open(http.client.HTTPSConnection, req,
      File "/usr/lib/python3.9/urllib/request.py", line 1349, in do_open
        raise URLError(err)
    urllib.error.URLError: <urlopen error [Errno -2] Name or service not known>




.. GENERATED FROM PYTHON SOURCE LINES 51-69

Use ``ColumnTransformer`` by selecting column by names
##############################################################################
 We will train our classifier with the following features:

 Numeric Features:

 * ``age``: float;
 * ``fare``: float.

 Categorical Features:

 * ``embarked``: categories encoded as strings ``{'C', 'S', 'Q'}``;
 * ``sex``: categories encoded as strings ``{'female', 'male'}``;
 * ``pclass``: ordinal integers ``{1, 2, 3}``.

 We create the preprocessing pipelines for both numeric and categorical data.
 Note that ``pclass`` could either be treated as a categorical or numeric
 feature.

.. GENERATED FROM PYTHON SOURCE LINES 69-96

.. code-block:: default


    numeric_features = ["age", "fare"]
    numeric_transformer = Pipeline(
        steps=[("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler())]
    )

    categorical_features = ["embarked", "sex", "pclass"]
    categorical_transformer = OneHotEncoder(handle_unknown="ignore")

    preprocessor = ColumnTransformer(
        transformers=[
            ("num", numeric_transformer, numeric_features),
            ("cat", categorical_transformer, categorical_features),
        ]
    )

    # Append classifier to preprocessing pipeline.
    # Now we have a full prediction pipeline.
    clf = Pipeline(
        steps=[("preprocessor", preprocessor), ("classifier", LogisticRegression())]
    )

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

    clf.fit(X_train, y_train)
    print("model score: %.3f" % clf.score(X_test, y_test))


.. GENERATED FROM PYTHON SOURCE LINES 97-101

HTML representation of ``Pipeline`` (display diagram)
##############################################################################
 When the ``Pipeline`` is printed out in a jupyter notebook an HTML
 representation of the estimator is displayed as follows:

.. GENERATED FROM PYTHON SOURCE LINES 101-106

.. code-block:: default

    from sklearn import set_config

    set_config(display="diagram")
    clf


.. GENERATED FROM PYTHON SOURCE LINES 107-115

Use ``ColumnTransformer`` by selecting column by data types
##############################################################################
 When dealing with a cleaned dataset, the preprocessing can be automatic by
 using the data types of the column to decide whether to treat a column as a
 numerical or categorical feature.
 :func:`sklearn.compose.make_column_selector` gives this possibility.
 First, let's only select a subset of columns to simplify our
 example.

.. GENERATED FROM PYTHON SOURCE LINES 115-119

.. code-block:: default


    subset_feature = ["embarked", "sex", "pclass", "age", "fare"]
    X_train, X_test = X_train[subset_feature], X_test[subset_feature]


.. GENERATED FROM PYTHON SOURCE LINES 120-121

Then, we introspect the information regarding each column data type.

.. GENERATED FROM PYTHON SOURCE LINES 121-124

.. code-block:: default


    X_train.info()


.. GENERATED FROM PYTHON SOURCE LINES 125-130

We can observe that the `embarked` and `sex` columns were tagged as
`category` columns when loading the data with ``fetch_openml``. Therefore, we
can use this information to dispatch the categorical columns to the
``categorical_transformer`` and the remaining columns to the
``numerical_transformer``.

.. GENERATED FROM PYTHON SOURCE LINES 132-137

.. note:: In practice, you will have to handle yourself the column data type.
   If you want some columns to be considered as `category`, you will have to
   convert them into categorical columns. If you are using pandas, you can
   refer to their documentation regarding `Categorical data
   <https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html>`_.

.. GENERATED FROM PYTHON SOURCE LINES 137-154

.. code-block:: default


    from sklearn.compose import make_column_selector as selector

    preprocessor = ColumnTransformer(
        transformers=[
            ("num", numeric_transformer, selector(dtype_exclude="category")),
            ("cat", categorical_transformer, selector(dtype_include="category")),
        ]
    )
    clf = Pipeline(
        steps=[("preprocessor", preprocessor), ("classifier", LogisticRegression())]
    )


    clf.fit(X_train, y_train)
    print("model score: %.3f" % clf.score(X_test, y_test))


.. GENERATED FROM PYTHON SOURCE LINES 155-158

The resulting score is not exactly the same as the one from the previous
pipeline because the dtype-based selector treats the ``pclass`` column as
a numeric feature instead of a categorical feature as previously:

.. GENERATED FROM PYTHON SOURCE LINES 158-161

.. code-block:: default


    selector(dtype_exclude="category")(X_train)


.. GENERATED FROM PYTHON SOURCE LINES 162-165

.. code-block:: default


    selector(dtype_include="category")(X_train)


.. GENERATED FROM PYTHON SOURCE LINES 166-174

Using the prediction pipeline in a grid search
#############################################################################
 Grid search can also be performed on the different preprocessing steps
 defined in the ``ColumnTransformer`` object, together with the classifier's
 hyperparameters as part of the ``Pipeline``.
 We will search for both the imputer strategy of the numeric preprocessing
 and the regularization parameter of the logistic regression using
 :class:`~sklearn.model_selection.GridSearchCV`.

.. GENERATED FROM PYTHON SOURCE LINES 174-183

.. code-block:: default


    param_grid = {
        "preprocessor__num__imputer__strategy": ["mean", "median"],
        "classifier__C": [0.1, 1.0, 10, 100],
    }

    grid_search = GridSearchCV(clf, param_grid, cv=10)
    grid_search


.. GENERATED FROM PYTHON SOURCE LINES 184-187

Calling 'fit' triggers the cross-validated search for the best
hyper-parameters combination:


.. GENERATED FROM PYTHON SOURCE LINES 187-192

.. code-block:: default

    grid_search.fit(X_train, y_train)

    print("Best params:")
    print(grid_search.best_params_)


.. GENERATED FROM PYTHON SOURCE LINES 193-194

The internal cross-validation scores obtained by those parameters is:

.. GENERATED FROM PYTHON SOURCE LINES 194-196

.. code-block:: default

    print(f"Internal CV score: {grid_search.best_score_:.3f}")


.. GENERATED FROM PYTHON SOURCE LINES 197-198

We can also introspect the top grid search results as a pandas dataframe:

.. GENERATED FROM PYTHON SOURCE LINES 198-211

.. code-block:: default

    import pandas as pd

    cv_results = pd.DataFrame(grid_search.cv_results_)
    cv_results = cv_results.sort_values("mean_test_score", ascending=False)
    cv_results[
        [
            "mean_test_score",
            "std_test_score",
            "param_preprocessor__num__imputer__strategy",
            "param_classifier__C",
        ]
    ].head(5)


.. GENERATED FROM PYTHON SOURCE LINES 212-216

The best hyper-parameters have be used to re-fit a final model on the full
training set. We can evaluate that final model on held out test data that was
not used for hyperparameter tuning.


.. GENERATED FROM PYTHON SOURCE LINES 216-222

.. code-block:: default

    print(
        (
            "best logistic regression from grid search: %.3f"
            % grid_search.score(X_test, y_test)
        )
    )


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** ( 0 minutes  0.444 seconds)


.. _sphx_glr_download_auto_examples_compose_plot_column_transformer_mixed_types.py:


.. only :: html

 .. container:: sphx-glr-footer
    :class: sphx-glr-footer-example



  .. container:: sphx-glr-download sphx-glr-download-python

     :download:`Download Python source code: plot_column_transformer_mixed_types.py <plot_column_transformer_mixed_types.py>`



  .. container:: sphx-glr-download sphx-glr-download-jupyter

     :download:`Download Jupyter notebook: plot_column_transformer_mixed_types.ipynb <plot_column_transformer_mixed_types.ipynb>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_
