You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+9-12Lines changed: 9 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -28,7 +28,7 @@ You can also download the source code here and install using the included PKGBUI
28
28
29
29
To install python-associations, cd into the python-associations directory and run this command:
30
30
```
31
-
makepkg -si
31
+
$ makepkg -si
32
32
```
33
33
34
34
# Overview
@@ -69,39 +69,36 @@ Knowing that white males are injured on Tuesday more frequently than black males
69
69
70
70
As another example, if we want to find the association between amputations and fatalities (diagnosis and disposition), we need to take the same approach. While the likelihood that an amputation is fatal is valuable information, we are more interested in the relative fatality of different diagnoses. Amputations may have a very low likelihood of fatality, but we must compare it to the likelihood that any other diagnosis leads to fatality before we discover whether amputations are relatively likely to be fatal. Therefore, we must take into consideration the extreme infrequency of fatalities in general to get a standardized numerical representation of how associated each field is.
71
71
72
-
There are two approaches to resolve our dilemma that are mathematically equivalent. One approach is to divide the number of fatal amputations by the number of amputations with any disposition, which yields the likelihood that an amputation is fatal. Then we divide that by the likelihood that any diagnosis is fatal (total fatalities / total of everything) and that yields the association ratio between amputation and fatality.
72
+
There are two approaches to resolve our dilemma that are mathematically equivalent. One approach is to divide the number of fatal amputations by the number of amputations with any disposition, which yields the likelihood that an amputation is fatal. But we want to normalize this likelihood by scaling it according to the likelihood that anything my be fatal. To do so, we divide them (total fatalities / total of everything) and that yields the association ratio between amputation and fatality.
73
73
74
74
Identical results would be reached by first dividing fatal amputations by all fatalities (likelihood that a fatality is caused by amputation) and then dividing that by the average likelihood that an amputation is the cause of any disposition (total amputations / total of everything). This results in the exact same association ratio as the first approach.
75
75
76
76
Both approaches are the same algorithm run in opposite directions. They are also mathematically equivalent since they both result in the same calculation:
77
77
```
78
78
association between amputations and fatalities = (fatal amputations)*(total of everything) / (fatalities)*(amputations)
79
79
```
80
-
Since we must also be able to examine the associations within complex subgroups/subpopulations, we approach the problem by looking at a field name combination (diag, disposition, sex, weekday) and finding the associations for any two values for any two of the fields. We start out with a general combination that we know exists (amputation, fatality, male, Tuesday) and work with every association pair and subgroup within that combination. We could do it in reverse, by looking first at every (diag, disposition) association within every (sex, weekday) population, and that would allow us to use the general formula with minimal histogram reshaping, but it would force us to test many orders of magnitude more combinations than actually exist in the data set.
80
+
Originally, for efficiency, I used a specialized version of the aforementioned algorithm (calculate likelihoods then divide) in order to naturally cache totals and subtotals for multiple situations. Unfortunately, this led to a very complex and confusing algorithm.
81
+
82
+
To keep the algorithm simple, I have written a new one optimized to use the general formula as efficiently as possible. I have actually gotten it to be more efficient than the original algorithm. This algorithm is significantly less complex. It is more maintainable and easier to understand and use, so it is favored.
81
83
82
-
For efficiency, we are forced to limit ourselves to actual combinations that do exist and cycling through different associations within that combination. Likewise, if we wanted to use the general formula, we would have to either reshape our histogram four times for every association ratio or keep several histograms cached at all times (as many as several dozen). Instead, by using an adapted version of the two step algorithm described above, we can calculate broad ratios for many types of subgroups at once and reuse them several times. This increases code complexity, but brings execution time down from minutes or hours to seconds or minutes.
84
+
I still see some potential to optimize a few places in the algorithm to improve efficiency even further, but this would require a lot of benchmarking and will probably not be a huge improvement, so it is not my top priority.
83
85
84
-
I have a rough idea of a way to optimize the efficiency of an algorithm using the general formula to reduce code complexity without hurting overall efficiency, but the inclusion of subgroups would make it a time intensive process with a lot of testing and benchmarking, and I simply have not yet found the time to work on this.
85
86
86
87
| Attribute | Description |
87
88
| ---------------- | ----------- |
88
89
|```notable```| Minimum association ratio (or inverse) to be included.|
89
90
|```significant```| Minimum number of occurrences (statistical significance).|
90
91
|```assoc```| Associations organized by association then subgroup.|
91
-
|```subgroups```| Associations organized by subgroup then association.|
92
-
|```memo```| Record examined situations to avoid redundancy.|
92
+
|```subpops```| Associations organized by subgroup then association.|
93
93
|```hist```|```Histogram()``` object to extract data from.|
94
-
|```relevant```| All instances of combination that exist.|
95
94
96
95
| Method | Description |
97
96
| --------------------- | ----------- |
98
-
|```overall_ratios()```| Find general likelihoods (eg. fatality for any diagnosis, not just amputation) |
99
-
|```test()```| Find association ratio given overall_ratios and histogram.|
100
97
|```add()```| Save association ratio.|
101
98
|```find()```| Find the association ratio for every field value combination among a specific field name combination.|
102
99
103
100
### Associations()
104
-
Attributes: self.pairs and self.subgroups contain all association ratios.
101
+
Attributes: self.pairs and self.subpops contain all association ratios.
105
102
```python
106
103
>>>self.pairs
107
104
{
@@ -113,7 +110,7 @@ Attributes: self.pairs and self.subgroups contain all association ratios.
0 commit comments