Skip to content

Conversation

@aranas
Copy link
Member

@aranas aranas commented Jun 12, 2018

Hi all, this is what I have so far concerning preprocessing within cross-validation issue (right now we are doing it before therefore biasing our performance within cv)

The code runs, but I think something is wrong with it, as the score on the validation data is much higher than on the test data.

I would be grateful for any help.

@johannadevos
Copy link
Member

johannadevos commented Jun 12, 2018

I'm encountering some errors:

  File "U:/GitHub/fraud_detection/trainer/lightgbm_main.py", line 34, in <module>
    import trainer.plotting_functions as myplot

ImportError: No module named plotting_functions

I could (temporarily) resolve this one by commenting it out that import. I don't have a file called plotting_functions.py in my trainer directory.

  File "U:/GitHub/fraud_detection/trainer/lightgbm_main.py", line 166, in main
    test_df = pp.load_test_raw(args.test_file)

TypeError: load_test_raw() takes exactly 2 arguments (1 given)

As some extra info (I don't think it matters), I ran the code with the following arguments: --train-file trainer/data/train_sample.csv --valid-file trainer/data/valid_sample.csv --test-file trainer/data/test_sample.csv --job-dir trainer/results --run optimization

@johannadevos
Copy link
Member

Are you aware of those warnings? This happens for multiple of the variables.

[2018-06-12 14:52:54,851] [INFO] Modifying variables
trainer\preprocessing.py:64: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

According to the documentation, fixing this issue can make the code run significantly faster, and thereby potentially making it run on other computers than the DCCN cluster only? It seems relatively simple to fix.

@johannadevos
Copy link
Member

johannadevos commented Jun 12, 2018

Conceptually, the code seems alright to me (although I didn't read it very thoroughly), but I'm running into multiple errors for different combinations of arguments.

For example, when I run --train-file ./trainer/data/train_sample.csv --valid-file ./trainer/data/valid_sample.csv --job-dir ./trainer/results --run optimization I'm getting an AttributeError:

  File "C:/Users/johan/Documents/GitHub/fraud_detection/trainer/lightgbm_main.py", line 230, in main
    if args.test_df is not None:

AttributeError: 'Namespace' object has no attribute 'test_df'

Btw, in order to arrive at this point I had hard coded the train and validation datasets to just 3000 / 300 lines only, because otherwise it took a very long time.

I think it would be a good idea if you ran your code with all combinations of command line parameters (e.g., using both optimization and submission, and including/excluding the valid and test data). It would be important that the code can run through before we merge into master.

@johannadevos
Copy link
Member

I also don't really understand this line:

_, test_df, _ = pp.preprocess_confidence(pp.preprocess_common(train_df), pp.preprocess_common(test_df))

Because you're not providing the train data here, but preprocess_confidence does use the train data.

@aranas
Copy link
Member Author

aranas commented Jun 12, 2018

I pushed something to fix the warnings related to the indexing

@aranas
Copy link
Member Author

aranas commented Jun 12, 2018

Concerning the error:
AttributeError: 'Namespace' object has no attribute 'test_df'

this actually arises in a part of main, that I have not manipulated and this error arises also in master at the moment. I think we have a block of code double. The one starting on lines 207 and 222 on master. Thus we have to fix this anyway. I can fix it in master and this branch to avoid confusion.

Concerning testing of all combinations of command line parameters.
Generally the test-file should only matter when submission is selected anyways.
Further, I don't think it makes sense to have the code run without a validation file as we always want to use early-stopping in order to prevent overfitting!

@aranas
Copy link
Member Author

aranas commented Jun 12, 2018

@johannadevos Thank you for your feedback, but those were all minor code things. What I am really wondering about is the actual output. Why does branch preprocess_within_cv score so high on validation data, while master never exceeds 0.95? I don't think I made major changes, but somehow the output is very different.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants