Simulation 3- assuming hyperbins

(wei- please review the following argument)
# summary

in the following sim, we find that we need 50 to 150 samples when the number of bins for each numeric attribute is low (e.g. bins=2) and the number of dimensions are low (e.g. dims=3)

comments:  
- is it reasonable to use low bins (e.g. bins=2). Wellyes, says audris in his "universal predictor" paper.
- is it  reasonable to use low dims (e.g. dims=3). well, yes. wei reports that we often FSS down to dims=3
# details

So i assumed that the data had "dims" dimensions, each with a  percentile chopped into "bins" 

then i said the prodbability of bugs was10% <= p <= 40% and 3 <= bins <= 7 and 2 <= dims <= 7

then i distributed the probability of the bugs around the bins**dims of the hypercube of the independent variables (bugs created randomly using r()**2 so that the mininorty of the bins had the majority of the bigs

then i weighted each part of the hypercube by p/(dims*bins)  

then i ran it all 1000 times , increasing the number of samples until we had a 2/3rds chance of finding the bugs (call that the "n66" number) with a sanity condition of n <= 1000
## code

``` python
from __future__ import division

import random
r = random.random
any = random.choice

def ps(dims,bins,p=0.1,skew=2):
  space = bins**dims
  f1  = [r()**skew for _ in xrange(space)]
  all = sum(f1)
  f2  = [x/all for x in f1]
  return [p*1/space*x for x in f2]

def ns(dims=3,bins=2,p=0.1):
  w = sum(ps(dims,bins,p))
  n = 10
  found = 0
  while found < 0.66 and n < 1000:
    n += 10
    found = 1 - ((1 - w) ** n)
  print "_%s, _%s, _%s, %s" % (int(p*10),dims,bins,n)

print "p,dims,bins,n66"
for _ in xrange(1000):
  ns(dims= any([3,4,5,6,7]),
     bins= any([2,3,4,5,6,7]),
     p   = any([0.1,0.2,0.3,0.4]))
```
## analysis

then i ran a decision tree learner and got at tree with a correlation of 92% in a 10-way. note that, in the following tree:
- _p_ does not matter. the effects are driven by bins and dims
- we get low number of samples to reach n66 (50 to 200) when 
  - bins=2 and dims=3,4
    - and bins=3 and dims=3
- otherwise, we large hundreds of samples
## results

``` txt
bins =  _2
|   dims =  _3 : 50 (25/672) [11/436.36]
|   dims =  _4 : 107.6 (20/3028.75) [5/3456.25]
|   dims =  _5 : 203.26 (31/11974.82) [12/10088.09]
|   dims =  _6 : 336.43 (17/31955.71) [11/31618.62]
|   dims =  _7 : 613.33 (23/64163.71) [7/87900.89]
bins =  _3
|   dims =  _3 : 159.14 (21/7091.16) [14/8226.53]
|   dims =  _4 : 455.48 (22/58697.52) [9/107415.61]
|   dims =  _5 : 905.79 (24/17515.97) [14/14086.41]
|   dims =  _6 : 1000 (18/0) [18/0]
|   dims =  _7 : 1000 (17/0) [7/0]
bins =  _4
|   dims =  _3 : 370 (19/27314.13) [8/51881.93]
|   dims =  _4 : 891.94 (22/16804.96) [9/19220.29]
|   dims =  _5 : 1000 (33/0) [11/0]
|   dims =  _6 : 1000 (16/0) [18/0]
|   dims =  _7 : 1000 (30/0) [18/0]
bins =  _5
|   dims =  _3 : 755.31 (20/61678.75) [12/72781.25]
|   dims =  _4 : 1000 (23/0) [17/0]
|   dims =  _5 : 1000 (22/0) [12/0]
|   dims =  _6 : 1000 (18/0) [12/0]
|   dims =  _7 : 1000 (20/0) [14/0]
bins =  _6
|   dims =  _3 : 860 (18/28511.11) [9/28511.11]
|   dims =  _4 : 1000 (18/0) [5/0]
|   dims =  _5 : 1000 (19/0) [5/0]
|   dims =  _6 : 1000 (29/0) [15/0]
|   dims =  _7 : 1000 (25/0) [9/0]
bins =  _7 : 998.33 (116/41.88) [52/278.18]
```

t


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Simulation 3- assuming hyperbins #19

summary

details

code

analysis

results

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Simulation 3- assuming hyperbins #19

Description

summary

details

code

analysis

results

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions