GitHub - robinthibaut/skbel at v1.0.1

Name	Name	Last commit message	Last commit date
Latest commit robinthibaut Minor changes Jun 3, 2021 b28279a · Jun 3, 2021 History 155 Commits
.github	.github	Update issue templates	May 26, 2021
docs	docs	kde	May 30, 2021
skbel	skbel	Minor changes	Jun 3, 2021
.gitignore	.gitignore	Minor changes	May 26, 2021
.readthedocs.yml	.readthedocs.yml	Documentation	May 29, 2021
.travis.yml	.travis.yml	Update .travis.yml	Jun 1, 2021
CONTRIBUTING.md	CONTRIBUTING.md	Create CONTRIBUTING.md	May 26, 2021
LICENSE	LICENSE	Documentation	May 29, 2021
MANIFEST.in	MANIFEST.in	Minor changes	May 21, 2021
README.md	README.md	Update README.md	May 31, 2021
codecov.yml	codecov.yml	Minor change	May 25, 2021
pytest.ini	pytest.ini	Minor change	May 25, 2021
requirements.txt	requirements.txt	Minor change	May 25, 2021
setup.py	setup.py	Minor changes	Jun 3, 2021

pip install skbel

SKBEL is a Python package built on top of scikit-learn to implement the Bayesian Evidential Learning framework as described in Scheidt et al. (2018).

Documentation: skbel.readthedocs.io

Bayesian Evidential Learning - A Prediction-Focused Approach

Introduction

Figure 1: The concept of BEL. d = predictor (observed data), h = target (parameter of interest), m = model.

The idea of BEL is to find a direct relationship between d (predictor) and h (target) in a reduced dimensional space with machine learning.
Both d and h are generated by forward modelling from the same set of prior models m.
Given a new measured predictor d*, this relationship is used to infer the posterior probability distribution of the target, without the need for a computationally expensive inversion.
The posterior distribution of the target is then sampled and backtransformed from the reduced dimensional space to the original space to predict posterior realizations of h given d*.

Workflow

Forward modeling

Examples of both d and h are generated through forward modeling from the same model m. Target and predictor are real, multi-dimensional random variables.

Pre-processing

Specific pre-processing is applied to the data if necessary (such as scaling).

Dimensionality reduction

Principal Component Analysis (PCA) is applied to both target and predictor to aggregate the correlated variables into a few independent Principal Components (PC’s).

Learning

Canonical Correlation Analysis (CCA) transforms the two sets into pairs of Canonical Variates (CV’s) independent of each other.

Post-processing

Specific post-processing is applied to the CV's if necessary (such as CV normalization).

Posterior distribution inference

The mean μ and covariance Σ of the posterior distribution of an unknown target given an observed d* can be directly estimated from the CV's distribution.
Alternatively, the posterior conditional distribution can be inferred through KDE.

Sampling and back-transformation to the original space

The posterior distribution is sampled to obtain realizations of h in canonical space, successively back-transformed to the original space.

Figure 2: Typical BEL workflow.

Example

All the details about the example can be found in arXiv:2105.05539, and the code in skbel/examples/demo.py.
It concerns a hydrogeological experiment consisting of predicting the wellhead protection area (WHPA) around a pumping well from measured breakthrough curves at said pumping well.
Predictor and target are generated through forward modeling from a set of hydrogeological model with different hydraulic conductivity fields (not shown).
The predictor is the set of breakthrough curves coming from 6 different injection wells around the pumping well (Figure 3).
The target is the WHPA (Figure 4).

For this example, the data is already pre-processed. We are working with 400 examples of both d and h and consider one extra pair to be predicted. See details in the reference.

Figure 3: Predictor set. Prior in the background and test data in thick lines.

Figure 4: Target set. Prior in the background (blue) and test data to predict in red.

Building the BEL model

In this package, a BEL model consists of a succession of Pipelines (imported from scikit-learn).

import os
from os.path import join as jp

import joblib
import pandas as pd
from loguru import logger
from sklearn.cross_decomposition import CCA
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PowerTransformer

import demo_visualization as myvis

from skbel import utils
from skbel.learning.bel import BEL

We can then define a function that returns our desired BEL model :

def init_bel():
    """
    Set all BEL pipelines. This is the blueprint of the framework.
    """
    # Pipeline before CCA
    X_pre_processing = Pipeline(
        [
            ("scaler", StandardScaler(with_mean=False)),
            ("pca", PCA()),
        ]
    )
    Y_pre_processing = Pipeline(
        [
            ("scaler", StandardScaler(with_mean=False)),
            ("pca", PCA()),
        ]
    )

    # Canonical Correlation Analysis
    cca = CCA()

    # Pipeline after CCA
    X_post_processing = Pipeline(
        [("normalizer", PowerTransformer(method="yeo-johnson", standardize=True))]
    )
    Y_post_processing = Pipeline(
        [("normalizer", PowerTransformer(method="yeo-johnson", standardize=True))]
    )

    # Initiate BEL object
    bel_model = BEL(
        X_pre_processing=X_pre_processing,
        X_post_processing=X_post_processing,
        Y_pre_processing=Y_pre_processing,
        Y_post_processing=Y_post_processing,
        cca=cca,
    )

    return bel_model

The X_pre_processing and Y_pre_processing objects are pipelines which will first scale the data for predictor and target, then apply the dimension reduction through PCA.
The X_post_processing and Y_post_processing objects are pipelines which will normalize predictor and target CV's.
Finally, the BEL model is constructed by passing as arguments all these pipelines in the BEL object.

Training the BEL model

A simple function can be defined to train our model.

def bel_training(
  bel_,
  *,
  X_train_: pd.DataFrame,
  x_test_: pd.DataFrame,
  y_train_: pd.DataFrame,
  y_test_: pd.DataFrame = None,
  directory: str = None,
):
  """
  :param bel_: BEL model
  :param X_train_: Predictor set for training
  :param x_test_: Predictor "test"
  :param y_train_: Target set for training
  :param y_test_: "True" target (optional)
  :param directory: Path to the directory in which to unload the results
  :return:
  """
  #%% Directory in which to load forecasts
  if directory is None:
      sub_dir = os.getcwd()
  else:
      sub_dir = directory

  # Folders
  obj_dir = jp(sub_dir, "obj")  # Location to save the BEL model
  fig_data_dir = jp(sub_dir, "data")  # Location to save the raw data figures
  fig_pca_dir = jp(sub_dir, "pca")  # Location to save the PCA figures
  fig_cca_dir = jp(sub_dir, "cca")  # Location to save the CCA figures
  fig_pred_dir = jp(sub_dir, "uq")  # Location to save the prediction figures

  # Creates directories
  [
      utils.dirmaker(f, erase=True)
      for f in [
          obj_dir,
          fig_data_dir,
          fig_pca_dir,
          fig_cca_dir,
          fig_pred_dir,
      ]
  ]

  # %% Fit BEL model
  bel_.Y_obs = y_test_
  bel_.fit(X=X_train_, Y=y_train_)

  # %% Sample for the observation
  # Extract n random sample (target CV's).
  # The posterior distribution is computed within the method below.
  bel_.predict(x_test_)

  # Save the fitted BEL model
  joblib.dump(bel_, jp(obj_dir, "bel.pkl"))
  msg = f"model trained and saved in {obj_dir}"
  logger.info(msg)

Load the dataset and run everything

The example dataset is saved as pandas DataFrame in skbel/examples/dataset.
An arbitrary choice has to be made on the number of PC to keep for the predictor and the target. In this case, they are set to 50 and 30, respectively.
The CCA operator cca is set to keep the maximum number of CV possible (30).
Note that the variable y_test is the unknown to predict. It is simply saved within the BEL model for later uses (such as plotting or experimental design), but it is ignored during the training.

if __name__ == "__main__":

    # Set directories
    data_dir = jp(os.getcwd(), "dataset")
    output_dir = jp(os.getcwd(), "results")

    # Load dataset
    X_train = pd.read_pickle(jp(data_dir, "X_train.pkl"))
    X_test = pd.read_pickle(jp(data_dir, "X_test.pkl"))
    y_train = pd.read_pickle(jp(data_dir, "y_train.pkl"))
    y_test = pd.read_pickle(jp(data_dir, "y_test.pkl"))

    # Initiate BEL model
    model = init_bel()

    # Set model parameters
    model.mode = "mvn"  # How to compute the posterior conditional distribution
    # Set PC cut
    model.X_n_pc = 50
    model.Y_n_pc = 30
    # Save original dimensions of both predictor and target
    model.X_shape = (6, 200)  # Six curves with 200 time steps each
    model.Y_shape = (1, 100, 87)  # One matrix with 100 rows and 87 columns
    # Number of CCA components is chosen as the min number of PC
    n_cca = min(model.X_n_pc, model.Y_n_pc)
    model.cca.n_components = n_cca
    # Number of samples to be extracted from the posterior distribution
    model.n_posts = 400

    # Train model
    bel_training(
        bel_=model,
        X_train_=X_train,
        x_test_=X_test,
        y_train_=y_train,
        y_test_=y_test,
        directory=output_dir,
    )

    # Plot raw data
    myvis.plot_results(model, base_dir=output_dir)

    # Plot PCA
    myvis.pca_vision(
        model,
        base_dir=output_dir,
    )

    # Plot CCA
    myvis.cca_vision(bel=model, base_dir=output_dir)

Visualization

PC's

Figure 5: A. The full predictor of the test set is the concatenation of all breakthrough curves (thick curves on Figure 3). PCA decomposition with 50 PC’s allows to recover the original curves while smoothing out some noise present in the original dataset. B. Original test target compared to test target projected then back-transformed with its 30 Principal Components. C. Principal Components of the predictor training set and projected test set. D. Principal Components of the target training set and the projected test set. E. Cumulative explained variance for the Principal Components of the predictor training set. F. Cumulative explained variance for the Principal Components of the target training set.

CV's

Figure 6: A, B, C. Canonical Variates bivariate distribution plots for the 4 firsts pairs of the training set, and the projection in the canonical space of the selected test predictor and associated test target (see notches). The posterior distribution of h^c computed according to BEL and KDE can be compared on the y marginal plot. D. Decrease of the Canonical Correlation Coefficient r with the number of CV pairs for the training set.

Prediction

Figure 7: BEL-derived posterior predictions. The red domain corresponds to the envelope of predictions. The uncertainty reduces around the data sources (injection wells)

Contributing

Contributors and feedback from users are welcome. Don't hesitate to submit an issue or a PR, or request a new feature.

References

Hermans, T., Oware, E., Caers, J., 2016. Direct prediction of spatially and temporally varying physical properties from time-lapse electrical resistance data. Water Resour. Res. 52, 7262–7283.

Scheidt, C., Li, L., Caers, J., 2018. Quantifying Uncertainty in Subsurface Systems, Geophysical Monograph Series. John Wiley & Sons, Inc., Hoboken, NJ, USA.

Thibaut, R., Laloy, E., Hermans, T., 2021. A new framework for experimental design using Bayesian Evidential Learning: the case of wellhead protection area. arXiv:2105.05539 [cs].

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bayesian Evidential Learning - A Prediction-Focused Approach

Introduction

Workflow

Forward modeling

Pre-processing

Dimensionality reduction

Learning

Post-processing

Posterior distribution inference

Sampling and back-transformation to the original space

Example

Building the BEL model

Training the BEL model

Load the dataset and run everything

Visualization

PC's

CV's

Prediction

Contributing

References

About

Releases 6

Sponsor this project

Packages

Contributors 2

Languages

License

robinthibaut/skbel

Folders and files

Latest commit

History

Repository files navigation

Bayesian Evidential Learning - A Prediction-Focused Approach

Introduction

Workflow

Forward modeling

Pre-processing

Dimensionality reduction

Learning

Post-processing

Posterior distribution inference

Sampling and back-transformation to the original space

Example

Building the BEL model

Training the BEL model

Load the dataset and run everything

Visualization

PC's

CV's

Prediction

Contributing

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 6

Sponsor this project

Packages 0

Contributors 2

Languages

Packages