Skip to content

Commit 15ae68d

Browse files
committed
commit version 0.2.1, improvements to Associator()
1 parent f8ece02 commit 15ae68d

7 files changed

Lines changed: 209 additions & 213 deletions

File tree

README.md

Lines changed: 9 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ You can also download the source code here and install using the included PKGBUI
2828

2929
To install python-associations, cd into the python-associations directory and run this command:
3030
```
31-
makepkg -si
31+
$ makepkg -si
3232
```
3333

3434
# Overview
@@ -69,39 +69,36 @@ Knowing that white males are injured on Tuesday more frequently than black males
6969

7070
As another example, if we want to find the association between amputations and fatalities (diagnosis and disposition), we need to take the same approach. While the likelihood that an amputation is fatal is valuable information, we are more interested in the relative fatality of different diagnoses. Amputations may have a very low likelihood of fatality, but we must compare it to the likelihood that any other diagnosis leads to fatality before we discover whether amputations are relatively likely to be fatal. Therefore, we must take into consideration the extreme infrequency of fatalities in general to get a standardized numerical representation of how associated each field is.
7171

72-
There are two approaches to resolve our dilemma that are mathematically equivalent. One approach is to divide the number of fatal amputations by the number of amputations with any disposition, which yields the likelihood that an amputation is fatal. Then we divide that by the likelihood that any diagnosis is fatal (total fatalities / total of everything) and that yields the association ratio between amputation and fatality.
72+
There are two approaches to resolve our dilemma that are mathematically equivalent. One approach is to divide the number of fatal amputations by the number of amputations with any disposition, which yields the likelihood that an amputation is fatal. But we want to normalize this likelihood by scaling it according to the likelihood that anything my be fatal. To do so, we divide them (total fatalities / total of everything) and that yields the association ratio between amputation and fatality.
7373

7474
Identical results would be reached by first dividing fatal amputations by all fatalities (likelihood that a fatality is caused by amputation) and then dividing that by the average likelihood that an amputation is the cause of any disposition (total amputations / total of everything). This results in the exact same association ratio as the first approach.
7575

7676
Both approaches are the same algorithm run in opposite directions. They are also mathematically equivalent since they both result in the same calculation:
7777
```
7878
association between amputations and fatalities = (fatal amputations)*(total of everything) / (fatalities)*(amputations)
7979
```
80-
Since we must also be able to examine the associations within complex subgroups/subpopulations, we approach the problem by looking at a field name combination (diag, disposition, sex, weekday) and finding the associations for any two values for any two of the fields. We start out with a general combination that we know exists (amputation, fatality, male, Tuesday) and work with every association pair and subgroup within that combination. We could do it in reverse, by looking first at every (diag, disposition) association within every (sex, weekday) population, and that would allow us to use the general formula with minimal histogram reshaping, but it would force us to test many orders of magnitude more combinations than actually exist in the data set.
80+
Originally, for efficiency, I used a specialized version of the aforementioned algorithm (calculate likelihoods then divide) in order to naturally cache totals and subtotals for multiple situations. Unfortunately, this led to a very complex and confusing algorithm.
81+
82+
To keep the algorithm simple, I have written a new one optimized to use the general formula as efficiently as possible. I have actually gotten it to be more efficient than the original algorithm. This algorithm is significantly less complex. It is more maintainable and easier to understand and use, so it is favored.
8183

82-
For efficiency, we are forced to limit ourselves to actual combinations that do exist and cycling through different associations within that combination. Likewise, if we wanted to use the general formula, we would have to either reshape our histogram four times for every association ratio or keep several histograms cached at all times (as many as several dozen). Instead, by using an adapted version of the two step algorithm described above, we can calculate broad ratios for many types of subgroups at once and reuse them several times. This increases code complexity, but brings execution time down from minutes or hours to seconds or minutes.
84+
I still see some potential to optimize a few places in the algorithm to improve efficiency even further, but this would require a lot of benchmarking and will probably not be a huge improvement, so it is not my top priority.
8385

84-
I have a rough idea of a way to optimize the efficiency of an algorithm using the general formula to reduce code complexity without hurting overall efficiency, but the inclusion of subgroups would make it a time intensive process with a lot of testing and benchmarking, and I simply have not yet found the time to work on this.
8586

8687
| Attribute | Description |
8788
| ---------------- | ----------- |
8889
| ```notable``` | Minimum association ratio (or inverse) to be included.|
8990
| ```significant```| Minimum number of occurrences (statistical significance).|
9091
| ```assoc``` | Associations organized by association then subgroup.|
91-
| ```subgroups``` | Associations organized by subgroup then association.|
92-
| ```memo``` | Record examined situations to avoid redundancy.|
92+
| ```subpops``` | Associations organized by subgroup then association.|
9393
| ```hist``` | ```Histogram()``` object to extract data from.|
94-
| ```relevant``` | All instances of combination that exist.|
9594

9695
| Method | Description |
9796
| --------------------- | ----------- |
98-
| ```overall_ratios()```| Find general likelihoods (eg. fatality for any diagnosis, not just amputation) |
99-
| ```test()``` | Find association ratio given overall_ratios and histogram.|
10097
| ```add()``` | Save association ratio.|
10198
| ```find()``` | Find the association ratio for every field value combination among a specific field name combination.|
10299

103100
### Associations()
104-
Attributes: self.pairs and self.subgroups contain all association ratios.
101+
Attributes: self.pairs and self.subpops contain all association ratios.
105102
```python
106103
>>> self.pairs
107104
{
@@ -113,7 +110,7 @@ Attributes: self.pairs and self.subgroups contain all association ratios.
113110
}
114111
```
115112
```python
116-
>>> self.subgroups
113+
>>> self.subpops
117114
{
118115
subgroup_type: {
119116
frozenset(subgroup/subpopulation): {

associations/analysis.py

Lines changed: 11 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,6 @@
55
import os
66
import textwrap
77
from .libassoc import invert, istr, iint, pretty
8-
from .associations import Associator
98

109
class Analysis():
1110
""" Ideally this class would implement totally generic methods
@@ -44,7 +43,7 @@ def plot_dir(self, folder='plots'):
4443
os.makedirs(folder)
4544
return folder
4645

47-
def prep_hist(self, field, other_field, notable=1, subgroup=''):
46+
def prep_hist(self, field, other_field, notable=1, subpop=''):
4847
""" Filter out irrelevant data for plotting """
4948
bins = self.hist.valdicts_dict[field]
5049
associations = self.assoc.report(field, other_field)
@@ -56,7 +55,7 @@ def prep_hist(self, field, other_field, notable=1, subgroup=''):
5655
# For non-time, remove excess bins without notable data
5756
for key in associations:
5857
try:
59-
ratio = associations[key][frozenset(subgroup)]
58+
ratio = associations[key][frozenset(subpop)]
6059
except KeyError:
6160
continue
6261
# Only keep notable bins
@@ -74,7 +73,7 @@ def prep_hist(self, field, other_field, notable=1, subgroup=''):
7473
bins = self.bin_sort([s for s in bins if s in keep_of])
7574
return bins, associations
7675

77-
def make_hist(self, field, other_field, notable=1, subgroup=''):
76+
def make_hist(self, field, other_field, notable=1, subpop=''):
7877
""" This method creates the data structure for a histogram plot
7978
given the names of the fields we want to compare. This is done
8079
manually because we are working with already existent bins.
@@ -89,7 +88,7 @@ def make_hist(self, field, other_field, notable=1, subgroup=''):
8988
keep_f, values, hists = set(), [], []
9089
skip, top = False, 0
9190
bins, associations = self.prep_hist(
92-
field, other_field, notable, subgroup
91+
field, other_field, notable, subpop
9392
)
9493
bindex = invert(bins)
9594
empty_things = np.zeros(len(bins))
@@ -116,7 +115,7 @@ def make_hist(self, field, other_field, notable=1, subgroup=''):
116115
actual = myfield
117116
# Gather association ratio for combination
118117
try:
119-
ratio = associations[key][frozenset(subgroup)]
118+
ratio = associations[key][frozenset(subpop)]
120119
except KeyError:
121120
skip = True
122121
if skip == True:
@@ -173,13 +172,13 @@ def plot_hist(self, title, xlabel, ylabel, bins, ds_names,
173172
self.plot_counter += 1
174173

175174
def nice_plot_assoc(self, one, two, title=False, xlabel=False,
176-
bins=False, notable=1.5, subgroup='', force=False):
175+
bins=False, notable=1.5, subpop='', force=False):
177176
""" Try plot with arbitrary limitation first, change if needed. """
178177
while notable > 1:
179178
# This means floats such as 0.9999999999999997 will
180179
# be excluded, but we don't want < 1.1 anyway.
181180
bad = self.plot_assoc(
182-
one, two, title, xlabel, bins, notable, subgroup
181+
one, two, title, xlabel, bins, notable, subpop
183182
)
184183
if bad == 'high':
185184
notable += 0.1
@@ -192,11 +191,11 @@ def nice_plot_assoc(self, one, two, title=False, xlabel=False,
192191
else:
193192
if force:
194193
self.plot_assoc(
195-
one, two, title, xlabel, bins, notable, subgroup, force=True
194+
one, two, title, xlabel, bins, notable, subpop, force=True
196195
)
197196

198197
def plot_assoc(self, one, two, title=False, xlabel=False,
199-
bins=False, notable=2, subgroup='', force=False):
198+
bins=False, notable=2, subpop='', force=False):
200199
""" Plot associations between values one and two. Extract a
201200
more complete data set from the histogram and make the plot.
202201
"""
@@ -209,7 +208,7 @@ def plot_assoc(self, one, two, title=False, xlabel=False,
209208
bins = self.hist.valists_dict[one]
210209
ylabel = 'Association Ratio'
211210
bins, names, top, data = self.make_hist(
212-
one, two, notable, subgroup=subgroup
211+
one, two, notable, subpop=subpop
213212
)
214213
log = True if top > 10 else False
215214
if not force:
@@ -218,7 +217,7 @@ def plot_assoc(self, one, two, title=False, xlabel=False,
218217
if len(data) > 8:
219218
return 'high'
220219
self.plot_hist(title, xlabel, ylabel, bins, names, *data, log=log)
221-
cname = one +', '+ two + (' for ' + istr(subgroup) if subgroup else '')
220+
cname = one +', '+ two + (' for ' + istr(subpop) if subpop else '')
222221
fig = plt.gcf()
223222
fig.set_size_inches(25, 15)
224223
fig.savefig(

0 commit comments

Comments
 (0)