Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tree #3

Open
wants to merge 19 commits into
base: master
Choose a base branch
from
Open

Tree #3

wants to merge 19 commits into from

Conversation

manishamde
Copy link
Contributor

Decision Tree algorithm implemented on top of Spark RDD.

Key features:

  • Supports both classification and regressions
  • Supports gini, entropy and variance for information gain calculation
  • Supports calculating quantiles using a configurable fraction of the data
  • Performance accuracy verified by comparing with scikit-learn

@etrain
Copy link
Contributor

etrain commented Oct 11, 2013

This looks awesome, thanks so much for the contribution Manish!

The big question I have is whether you looked at using MLTable and its API for your input? Were there big hurdles preventing that from being an option? We'd like to build ML algorithms around that API, so if there are things we need to change to add this case, let us know! Decision trees are fairly different than algorithms that work by evaluating some linear loss function and optimizing via gradient descent, so this is a good test for something different that may not fit our existing model.

@manishamde
Copy link
Contributor Author

Thanks Evan. Looking forward to contributing more to the library.

Unfortunately, I haven't looked at MLTable since the code was written prior to the open sourcing of the MLI library. As I mentioned in an earlier comment, I will look to make this code compatible with the MLI API and give feedback for any improvements.

The fixes should not take me too long. The non-linear data generator will be the trickiest part.

When do you think we can start testing performance once I am done?

@manishamde
Copy link
Contributor Author

Evan,

I have just performed a major refactoring of the code based on your feedback without changing functionality.

A few tasks remain:

  1. Making the interface to the tree algorithm consistent with the MLI API. I am wondering whether some "implicit" magic might make the algorithm work with both MLTable and RDD. However, I couldn't find any variance calculation logic for MLTable features. I am currently using the StatCounter class in the Spark library for that purpose.
  2. Using the utils and deprecating TreeUtils class. I don't think this should be a show stopper for now and could be deprecated in the future. TreeRunner and TreeUtils are helper/example classes to get started.
  3. Non-linear data generator. I will think about this problem a little more. We will need to come up with a configurable (features, size, etc.) data generator.
  4. Using a CLI parser (scopt/sumac/etc.)

I think task 1 is the most important for now. Task 2 can be done in the future. I am wondering whether we can use the same data that you might have used for testing logistic regression or SVM for performance testing while we work on Task 3. Task 4 is again one for the future.

@manishamde
Copy link
Contributor Author

Some more changes.

  1. Quantile and split calculations are now performed in memory (significant improvement in performance)
  2. Calculated training error after tree building. Could not find the best place to store the training error in the tree model. Didn't want to hack it right now but will do it more systematically while building ensembles.
  3. Started doing testing on a dataset with ~0.5 million instances but limited by lack of access to a Spark cluster to accurately measure performance.

@etrain
Copy link
Contributor

etrain commented Oct 20, 2013

This is awesome, thanks Manish - we'll plan to test your code for
scalability on a cluster this week.

On Sat, Oct 19, 2013 at 6:43 PM, manishamde [email protected]:

Some more changes.

  1. Quantile and split calculations are now performed in memory
    (significant improvement in performance)
  2. Calculated training error after tree building. Could not find the best
    place to store the training error in the tree model. Didn't want to hack it
    right now but will do it more systematically while building ensembles.
  3. Started doing testing on a dataset with ~0.5 million instances but
    limited by lack of access to a Spark cluster.


Reply to this email directly or view it on GitHubhttps://github.com//pull/3#issuecomment-26662937
.

@manishamde
Copy link
Contributor Author

Sounds great Evan!

@@ -15,3 +15,7 @@ project/plugins/project/
#Eclipse specific
.classpath
.project

#IDEA specific
.idea
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good idea, thanks.

* Add logging
* Move metrics to a different package

#Extensions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@etrain
Copy link
Contributor

etrain commented Nov 8, 2013

This is terrific work! Basic functionality is there and scaling well for large datasets based on my tests. Though, I don't see special logic differentiating between continuous and categorical features. (Maybe I'm just missing something).

We should think about optimizing the inner loop a bit more, in particular see my comments about vectorization and avoiding intermediate caching.

Really good stuff and a welcome contribution!

@manishamde
Copy link
Contributor Author

Attempted vectorization of findBestSplit calculation in the recent commit.

@hirakendu
Copy link

Hi guys, this is some great work and sorry for coming late to the party :), but I have two high-level reservations at this stage.

  1. The implementation doesn't support categorical features, which I think is important for decision tree applications. This may require some invasive changes. I see that the histograms are calculated for splits instead of bins of feature values. This is okay for continuous features where the bins are always arranged in the same order - in the order of feature quantiles. But for categorical features, the ordering of bins is based on the average (or centroid) of output values of samples in the bin. So the bins are ordered after the calculation of histograms.
  2. The design is too tightly coupled to binary classification and related loss functions. I don't see hooks for regression and other loss functions and a general interface for implementing them. One would essentially have to write a very different decision tree algorithm for regression. Sorry if the refactoring was meant to be done at a later stage.

As referenced in the previous (automatic) comment, I have submitted a pull request for an implementation of decision trees for Spark/MLLib. It is based on the boosted trees implementation I have been working on. It does address the above two issues. It supports both categorical and continuous features. I have defined a (pretty awesome) generic loss function interface. Observing that we calculate loss functions over groupings of feature bins (each part of split is a group of bins), the interface requires specifying summary statistics for bins that can be "added", from which the loss can be calculated. The actual decision tree algorithm implementation is generic and uses this loss interface and calculates histograms of loss statistics. One can use this algorithm for a variety of loss functions by simply defining and implementing suitable loss statistics and related methods. Both regression and classification tree derivatives are provided using square loss and entropy loss functions. There are also basic tests, synthetic data generators and ample amount of scaladoc comments as documentation. Overall, it is quite robust and performant to my liking from initial tests and benchmarks.

Additional details of this implementation are at hirakendu/boosted_trees/doc. To give it a try, a precompiled mllib jar based on 0.8.0-incubating release is available at hirakendu/boosted_trees/code/spark_mllib_boosted_trees/target. It would be nice if the performance can be tested along-side this implementation. Comments on the design and implementation would be highly appreciated in the other discussion.

</advertisement>

I am curious about the vectorization optimizations. I believe it is related to my previous observation that using Arrays for calculating and storing histograms of RDD partitions and then merging them at master is faster than reduceByKey, almost runs in half the time in my tests. There is quite some shuffle data involved in reduceByKey. Will have a deeper look and see if this vectorization instead of flatMap can improve performance.

I also agree with one of the previous comments about caching at every node. I think this has already been addressed. I have been debating about training level by level, instead of node by node, but there is some book-keeping involved. Secondly, at moderately deep levels, say depth 5, it may already become too many bins for histograms. On a related note, I am going by the rough estimate that we can handle million bins (say 1000 features with 1000 quantiles or categories) on a single machine.

I will try to write a version of boosted trees for MLI as well. Overall I like the MLTable and Algorithm, Model interfaces, not to mention the support for categorical and other types of features. One thing that is curious in both MLLib and MLI is the absence of an interface for loss functions. It looks to be a part of optimization interface currently, which appears specific to linear models. Another curiosity is the non-standard data format for labeled point instances. I think tab delimited values are standard in Hadoop and outside, with schema/header files explaining the columns, and the first column is generally taken to be label. While it's an easy transformation in Spark, it would be good to have it as a standard input format.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants