-
Notifications
You must be signed in to change notification settings - Fork 1
Description
(wei- please review the following argument)
summary
in the following sim, we find that we need 50 to 150 samples when the number of bins for each numeric attribute is low (e.g. bins=2) and the number of dimensions are low (e.g. dims=3)
comments:
- is it reasonable to use low bins (e.g. bins=2). Wellyes, says audris in his "universal predictor" paper.
- is it reasonable to use low dims (e.g. dims=3). well, yes. wei reports that we often FSS down to dims=3
details
So i assumed that the data had "dims" dimensions, each with a percentile chopped into "bins"
then i said the prodbability of bugs was10% <= p <= 40% and 3 <= bins <= 7 and 2 <= dims <= 7
then i distributed the probability of the bugs around the bins**dims of the hypercube of the independent variables (bugs created randomly using r()**2 so that the mininorty of the bins had the majority of the bigs
then i weighted each part of the hypercube by p/(dims*bins)
then i ran it all 1000 times , increasing the number of samples until we had a 2/3rds chance of finding the bugs (call that the "n66" number) with a sanity condition of n <= 1000
code
from __future__ import division
import random
r = random.random
any = random.choice
def ps(dims,bins,p=0.1,skew=2):
space = bins**dims
f1 = [r()**skew for _ in xrange(space)]
all = sum(f1)
f2 = [x/all for x in f1]
return [p*1/space*x for x in f2]
def ns(dims=3,bins=2,p=0.1):
w = sum(ps(dims,bins,p))
n = 10
found = 0
while found < 0.66 and n < 1000:
n += 10
found = 1 - ((1 - w) ** n)
print "_%s, _%s, _%s, %s" % (int(p*10),dims,bins,n)
print "p,dims,bins,n66"
for _ in xrange(1000):
ns(dims= any([3,4,5,6,7]),
bins= any([2,3,4,5,6,7]),
p = any([0.1,0.2,0.3,0.4]))analysis
then i ran a decision tree learner and got at tree with a correlation of 92% in a 10-way. note that, in the following tree:
- p does not matter. the effects are driven by bins and dims
- we get low number of samples to reach n66 (50 to 200) when
- bins=2 and dims=3,4
- and bins=3 and dims=3
- bins=2 and dims=3,4
- otherwise, we large hundreds of samples
results
bins = _2
| dims = _3 : 50 (25/672) [11/436.36]
| dims = _4 : 107.6 (20/3028.75) [5/3456.25]
| dims = _5 : 203.26 (31/11974.82) [12/10088.09]
| dims = _6 : 336.43 (17/31955.71) [11/31618.62]
| dims = _7 : 613.33 (23/64163.71) [7/87900.89]
bins = _3
| dims = _3 : 159.14 (21/7091.16) [14/8226.53]
| dims = _4 : 455.48 (22/58697.52) [9/107415.61]
| dims = _5 : 905.79 (24/17515.97) [14/14086.41]
| dims = _6 : 1000 (18/0) [18/0]
| dims = _7 : 1000 (17/0) [7/0]
bins = _4
| dims = _3 : 370 (19/27314.13) [8/51881.93]
| dims = _4 : 891.94 (22/16804.96) [9/19220.29]
| dims = _5 : 1000 (33/0) [11/0]
| dims = _6 : 1000 (16/0) [18/0]
| dims = _7 : 1000 (30/0) [18/0]
bins = _5
| dims = _3 : 755.31 (20/61678.75) [12/72781.25]
| dims = _4 : 1000 (23/0) [17/0]
| dims = _5 : 1000 (22/0) [12/0]
| dims = _6 : 1000 (18/0) [12/0]
| dims = _7 : 1000 (20/0) [14/0]
bins = _6
| dims = _3 : 860 (18/28511.11) [9/28511.11]
| dims = _4 : 1000 (18/0) [5/0]
| dims = _5 : 1000 (19/0) [5/0]
| dims = _6 : 1000 (29/0) [15/0]
| dims = _7 : 1000 (25/0) [9/0]
bins = _7 : 998.33 (116/41.88) [52/278.18]t