Preprocess within cv #45

aranas · 2018-06-12T09:57:35Z

Hi all, this is what I have so far concerning preprocessing within cross-validation issue (right now we are doing it before therefore biasing our performance within cv)

The code runs, but I think something is wrong with it, as the score on the validation data is much higher than on the test data.

I would be grateful for any help.

…tside

…rresponding to small90 and small10

johannadevos · 2018-06-12T10:46:22Z

I'm encountering some errors:

  File "U:/GitHub/fraud_detection/trainer/lightgbm_main.py", line 34, in <module>
    import trainer.plotting_functions as myplot

ImportError: No module named plotting_functions

I could (temporarily) resolve this one by commenting it out that import. I don't have a file called plotting_functions.py in my trainer directory.

  File "U:/GitHub/fraud_detection/trainer/lightgbm_main.py", line 166, in main
    test_df = pp.load_test_raw(args.test_file)

TypeError: load_test_raw() takes exactly 2 arguments (1 given)

As some extra info (I don't think it matters), I ran the code with the following arguments: --train-file trainer/data/train_sample.csv --valid-file trainer/data/valid_sample.csv --test-file trainer/data/test_sample.csv --job-dir trainer/results --run optimization

johannadevos · 2018-06-12T12:59:45Z

Are you aware of those warnings? This happens for multiple of the variables.

[2018-06-12 14:52:54,851] [INFO] Modifying variables
trainer\preprocessing.py:64: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

According to the documentation, fixing this issue can make the code run significantly faster, and thereby potentially making it run on other computers than the DCCN cluster only? It seems relatively simple to fix.

johannadevos · 2018-06-12T13:10:01Z

Conceptually, the code seems alright to me (although I didn't read it very thoroughly), but I'm running into multiple errors for different combinations of arguments.

For example, when I run --train-file ./trainer/data/train_sample.csv --valid-file ./trainer/data/valid_sample.csv --job-dir ./trainer/results --run optimization I'm getting an AttributeError:

  File "C:/Users/johan/Documents/GitHub/fraud_detection/trainer/lightgbm_main.py", line 230, in main
    if args.test_df is not None:

AttributeError: 'Namespace' object has no attribute 'test_df'

Btw, in order to arrive at this point I had hard coded the train and validation datasets to just 3000 / 300 lines only, because otherwise it took a very long time.

I think it would be a good idea if you ran your code with all combinations of command line parameters (e.g., using both optimization and submission, and including/excluding the valid and test data). It would be important that the code can run through before we merge into master.

johannadevos · 2018-06-12T13:11:43Z

I also don't really understand this line:

_, test_df, _ = pp.preprocess_confidence(pp.preprocess_common(train_df), pp.preprocess_common(test_df))

Because you're not providing the train data here, but preprocess_confidence does use the train data.

aranas · 2018-06-12T13:52:53Z

I pushed something to fix the warnings related to the indexing

aranas · 2018-06-12T14:18:14Z

Concerning the error:
AttributeError: 'Namespace' object has no attribute 'test_df'

this actually arises in a part of main, that I have not manipulated and this error arises also in master at the moment. I think we have a block of code double. The one starting on lines 207 and 222 on master. Thus we have to fix this anyway. I can fix it in master and this branch to avoid confusion.

Concerning testing of all combinations of command line parameters.
Generally the test-file should only matter when submission is selected anyways.
Further, I don't think it makes sense to have the code run without a validation file as we always want to use early-stopping in order to prevent overfitting!

aranas · 2018-06-12T14:19:35Z

@johannadevos Thank you for your feedback, but those were all minor code things. What I am really wondering about is the actual output. Why does branch preprocess_within_cv score so high on validation data, while master never exceeds 0.95? I don't think I made major changes, but somehow the output is very different.

aranas added 9 commits June 5, 2018 13:14

starting to work on shifting preprocessing to within cv

fe7d9e5

integrate preprocessing within cv

4296004

Merge branch 'master' into preprocess_within_cv

e1da378

saving recent changes

c0e717e

change private function from preprocessing script to be accessible ou…

0cc92de

…tside

add plotting of roc curve

19c713c

update with changes from master

1de1f94

adjust for changes in preprocessing

2239624

move validation data outside cv loop, minor fixes, hardcoded lines co…

f8d48b6

…rresponding to small90 and small10

aranas requested review from andregalvez79, johannadevos, makemehumanagain, passelacet and twesterhout June 12, 2018 09:57

for plotting roc_auc if interested

26b4552

aranas added 2 commits June 12, 2018 15:25

always take all lines in test data and correct indexing

7d3d814

delete number_samples parameter from load_test_raw

350b8d6

separate two types of preprocessing

9d71b51

merge master into branch

0f0bb19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Preprocess within cv #45

Preprocess within cv #45

Uh oh!

aranas commented Jun 12, 2018

Uh oh!

johannadevos commented Jun 12, 2018 •

edited

Loading

Uh oh!

johannadevos commented Jun 12, 2018

Uh oh!

johannadevos commented Jun 12, 2018 •

edited

Loading

Uh oh!

johannadevos commented Jun 12, 2018

Uh oh!

aranas commented Jun 12, 2018

Uh oh!

aranas commented Jun 12, 2018

Uh oh!

aranas commented Jun 12, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Preprocess within cv #45

Are you sure you want to change the base?

Preprocess within cv #45

Uh oh!

Conversation

aranas commented Jun 12, 2018

Uh oh!

johannadevos commented Jun 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

johannadevos commented Jun 12, 2018

Uh oh!

johannadevos commented Jun 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

johannadevos commented Jun 12, 2018

Uh oh!

aranas commented Jun 12, 2018

Uh oh!

aranas commented Jun 12, 2018

Uh oh!

aranas commented Jun 12, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

johannadevos commented Jun 12, 2018 •

edited

Loading

johannadevos commented Jun 12, 2018 •

edited

Loading