-
Notifications
You must be signed in to change notification settings - Fork 14
Histfit doc #258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Histfit doc #258
Changes from 8 commits
29d82ea
3dcaf2c
a548b9b
d283fbc
222df55
ca710c8
337e427
46c9e44
c68cbe4
a0958b7
08779ce
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -389,6 +389,29 @@ A typical dictionary returned by the :py:meth:`~.FitBase.do_fit` method looks li | |||||||||
| called with the corresponding keyword :code:`fit.do_fit(asymmetric_parameter_errors=True)`. | ||||||||||
| Otherwise they will be :py:obj:`None`. | ||||||||||
|
|
||||||||||
| Histogram Fits | ||||||||||
| --------------- | ||||||||||
|
|
||||||||||
| In physics experiments data is frequently histogrammed in order to reduce the data to a manageable number of bins. | ||||||||||
| *kafe2* provides a dedicated fitting class for histogram fits, intendet to be used when datapoints are obtained | ||||||||||
| from a random distribution. Especiallywhen large numbers of datapoints are present it is more efficient to treat | ||||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||
| the data as a histogram. To perform a histogram fit, the datapoints have to be filled into a :py:obj:`HistContainer`. | ||||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||
| Then the procedure is similar to the above. By default, the :py:obj:`HistFit` class will use a normal distribution as model | ||||||||||
| function and a poisson likelihood as cost function. Both can be changed using the `model_function` and | ||||||||||
| `cost_function` keywords. | ||||||||||
|
|
||||||||||
| .. code-block:: python | ||||||||||
|
|
||||||||||
| from kafe2 import Fit, HistContainer | ||||||||||
|
|
||||||||||
| histogram = HistContainer(n_bins=10, bin_range=(-5, 5), | ||||||||||
| fill_data=[-7.5, 1.23, 5.74, 1.9, -0.2, 3.1, -2.75, ...]) | ||||||||||
| hist_fit = Fit(histogram) | ||||||||||
| hist_fit.do_fit() | ||||||||||
|
|
||||||||||
| By default it is assumed that the model function for a `HistFit` object is a probability density function normalised to 1. | ||||||||||
| As a consequence the bin contents are also being normalized to 1. | ||||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
I don't particularly care which spelling convention we use but we should be consistent. |
||||||||||
| To disable this behavior, set `density=False` in the `HistFit` constructor. | ||||||||||
|
|
||||||||||
| .. _plotting: | ||||||||||
|
|
||||||||||
|
|
||||||||||
| Original file line number | Diff line number | Diff line change | ||||||
|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,162 @@ | ||||||||
| #!/usr/bin/env python | ||||||||
|
|
||||||||
| """ | ||||||||
| kafe2 example: Histogram Fit (Pitfalls) | ||||||||
| ======================================= | ||||||||
|
|
||||||||
| This example demonstrates a scenario in which it is more convenient to use the | ||||||||
| HistFit class rather than the XYFit class. | ||||||||
|
|
||||||||
| While it is in principle possible to perform such fits correctly using XYFit, | ||||||||
| more care must be taken to avoid unusable or biased results. | ||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||
| This example shows common problems that can occur when using XYFit and how to fix them. | ||||||||
| HistFit handles these problems automatically. | ||||||||
|
|
||||||||
| We will try to build the kafe2 HistFit by hand using an XYFit. In general Histogram Fits are useful, | ||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||
| when large amounts of data are obtained, since the binning of a histogram reduces the | ||||||||
| amount of computation. | ||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||
|
|
||||||||
| The default XYFit has some problems that have to be addressed, when histogrammed data is | ||||||||
| processed. First of all, when using an XYFit, the data is passed to the Fit object in a | ||||||||
| XYContainer. This container does not automatically fill the datapoints in bins. So this first step | ||||||||
| has to be done manually. In the following, 30 datapoints, sampled from a normal distribution, are used. | ||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we should just start with something like "The XYFit does not have any built-in functionality for transforming raw data into a histogram. So if we want to use it we need to do this step manually." |
||||||||
| """ | ||||||||
| import numpy as np | ||||||||
| import matplotlib.pyplot as plt | ||||||||
| from kafe2 import XYContainer, Fit, HistContainer, Plot | ||||||||
|
|
||||||||
|
|
||||||||
| def normal_distribution(x, mu=0, sigma=1): | ||||||||
| return np.exp(-0.5 * ((x - mu) / sigma) ** 2) / np.sqrt(2.0 * np.pi * sigma**2) | ||||||||
|
|
||||||||
|
|
||||||||
| # Fill the datapoints into bins (generated uniformly with mu = 0, sigma = 1) | ||||||||
| data = np.array( | ||||||||
| [ | ||||||||
| 0.15452608, | ||||||||
| 0.15814163, | ||||||||
| 1.84110142, | ||||||||
| -0.27350108, | ||||||||
| -1.89289518, | ||||||||
| 0.21379812, | ||||||||
| -0.27967822, | ||||||||
| 0.21451591, | ||||||||
| -0.07221073, | ||||||||
| -0.51282405, | ||||||||
| 0.39306818, | ||||||||
| 1.12542323, | ||||||||
| -0.71070046, | ||||||||
| 1.86784967, | ||||||||
| -0.17510427, | ||||||||
| 1.72786353, | ||||||||
| 0.95243581, | ||||||||
| -0.71476994, | ||||||||
| -0.0816407, | ||||||||
| -0.38590102, | ||||||||
| 0.92506023, | ||||||||
| 0.4333063, | ||||||||
| 0.86662424, | ||||||||
| -0.78075927, | ||||||||
| 1.36575012, | ||||||||
| -0.49828474, | ||||||||
| 0.14649399, | ||||||||
| 1.40194082, | ||||||||
| -1.50842127, | ||||||||
| 1.21646781, | ||||||||
| ] | ||||||||
| ) | ||||||||
|
|
||||||||
| binedges = np.array([-2, -1.5, -1, -0.5, 0, 0.5, 1, 1.5, 2]) | ||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
I think it would make sense to put underscores between words for better legibility. |
||||||||
| bincounts, binedges = np.histogram(data, bins=binedges, density=True) | ||||||||
|
|
||||||||
| # Now naively perform a default XYFit | ||||||||
| x_data = bincenters = np.mean([binedges[:-1], binedges[1:]], axis=0) # use bincenters as x values | ||||||||
| y_data = bincounts # use normalized histogram as y data | ||||||||
| y_error = np.sqrt(bincounts) / (np.sum(bincounts) * np.diff(binedges)) # assume Poisson errors on our bincounts | ||||||||
|
|
||||||||
| xy_data = XYContainer(x_data=x_data, y_data=y_data) | ||||||||
| xy_data.add_error(err_val=y_error, axis="y") | ||||||||
|
|
||||||||
| XYFit_1 = Fit(xy_data, normal_distribution) | ||||||||
| XYFit_1.do_fit() | ||||||||
| # create a plot | ||||||||
| Plot1 = Plot(XYFit_1) | ||||||||
| Plot1.plot() | ||||||||
| plt.show() | ||||||||
| """ | ||||||||
| When performing the fit, errors like "The cost function was evaluated as infinite" appear in the output. | ||||||||
| Furthermore, when looking at the plot of our fit result, it is clear that this fit didn't return the result we expected. | ||||||||
| The starting values for the fit are returned as best fit value, and the uncertainties are reported as NaN. | ||||||||
| What happened? The problem arises because Poisson uncertainties were assumed for the bin counts. | ||||||||
| If a bin is empty, the uncertainty is treated as zero by the fit. Thus the model function is forced to pass this datapoint | ||||||||
| exactly or the cost function will be infinite. | ||||||||
| This occures because the XYFit uses a χ²-cost function by default, which is only valid | ||||||||
| for Gaussian uncertainties, but not in the case of Poisson uncertainties. To fix this, the | ||||||||
| cost function of the fit is changed to a Poisson negative log-likelhoood (NLL). | ||||||||
|
Comment on lines
+89
to
+97
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it would make sense to print a warning when using a cost function that needs errors and some of the errors are also == 0. |
||||||||
| """ | ||||||||
|
|
||||||||
|
|
||||||||
| # rescale normal distribution by bin counts and bin width | ||||||||
| def normal_distribution_scaled(x, mu=0, sigma=1): | ||||||||
| return np.exp(-0.5 * ((x - mu) / sigma) ** 2) / np.sqrt(2.0 * np.pi * sigma**2) * np.sum(bincounts) * 0.5 | ||||||||
|
|
||||||||
|
|
||||||||
| bincounts, binedges = np.histogram(data, bins=binedges) | ||||||||
|
|
||||||||
| x_data = bincenters = np.mean([binedges[:-1], binedges[1:]], axis=0) # use bincenters as x values | ||||||||
| y_data = bincounts # use bincounts as y data (not normalized now) | ||||||||
|
|
||||||||
| xy_data = XYContainer(x_data=x_data, y_data=y_data) | ||||||||
|
|
||||||||
| XYFit_2 = Fit(xy_data, normal_distribution_scaled, cost_function="poisson") | ||||||||
| XYFit_2.do_fit() | ||||||||
| # create a plot | ||||||||
| Plot2 = Plot(XYFit_2) | ||||||||
| Plot2.plot() | ||||||||
| plt.show() | ||||||||
|
|
||||||||
| """ | ||||||||
| Now it wasn't even necessary to explicitly specify the y-errors. By using a Poisson NLL, | ||||||||
| the y-errors are no longer calculated from the measured bin counts, but instead from the model expectation. | ||||||||
| This handles empty bins correctly and also prevents getting biased uncertainties. | ||||||||
|
|
||||||||
| Another subtlety is the definition of the y_data: So far, simply the midpoint of each bin was used. This is only | ||||||||
| a linear approximation of the behaviour of the model function between the bin edges. The HistFit class of kafe2 | ||||||||
| on the other hand uses "Simpsons rule", a method to approximate the behaviour quadratically for more accuracy. | ||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||
| The most accurate albeit computationally expensive method would be to integrate the model function over each bin. | ||||||||
|
|
||||||||
| The implementation of Simpsons rule in our procedure using a XYFit will not be done here, | ||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||
| since the influence is rather small in this case. Instead, it is shown how it is much easier to just use the HistFit | ||||||||
| class of kafe2. The binning is automatically done by the HistContainer, the Poisson NLL is the default cost function | ||||||||
| and the Simpson rule is already implemented. | ||||||||
| """ | ||||||||
| # We will use our initial data and binning | ||||||||
| hist_data = HistContainer(bin_edges=binedges, fill_data=data) | ||||||||
|
|
||||||||
|
|
||||||||
| # This is everything we have to prepare befor performing the fit | ||||||||
| Histfit = Fit(hist_data, model_function=normal_distribution, density=True) | ||||||||
| Histfit.do_fit() | ||||||||
|
|
||||||||
| Plot5 = Plot(Histfit) | ||||||||
| Plot5.plot() | ||||||||
| plt.show() | ||||||||
|
|
||||||||
| """ | ||||||||
| Now consider the case, where systematical errors next to the statistical ones are present. Suppose each bin has an additional | ||||||||
| Gaussian uncertainty of 1. Now there are two different types of uncertainties in the Fit, that can't be simply added. | ||||||||
| However, due to the central limit theorem, for sufficiently large event counts the Poisson distribution approaches | ||||||||
| a normal distrbution. Therefore, by switching the cost function from a Poisson NLL to the Gaussian approximation, | ||||||||
| the fit can be performed correctly again. | ||||||||
| """ | ||||||||
| # We will use our initial data and binning | ||||||||
| hist_data = HistContainer(bin_edges=binedges, fill_data=data) | ||||||||
| hist_data.add_error(err_val=1) | ||||||||
|
|
||||||||
| # This is everything we have to prepare befor performing the fit | ||||||||
| Histfit = Fit(hist_data, model_function=normal_distribution, density=True, cost_function="gauss-approximation") | ||||||||
| Histfit.do_fit() | ||||||||
|
|
||||||||
| Plot5 = Plot(Histfit) | ||||||||
| Plot5.plot() | ||||||||
| plt.show() | ||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.