We recently added the Gradient Boosted Regression and Classification Trees model to the supervised learning toolkit in GraphLab Create 0.9. I decided to try it on an active Kaggle competition to test its performance and usability.
The speed and accuracy of the model is great and the end to end process turns out to be surprisingly easy. It took me 5 minutes from downloading the data to first submission, and 30 minutes to rise to the top of the leader board. I am excited to share my experience with you!
Task: Predict the bike sharing demand
The task of interest is forecasting the bike sharing demand (Kaggle link): Given historical data about bike rental, we want to train a model to predict the number of bikes rented in the test data. The features provided are related to weather and date, and the evaluation metric is RMSLE -- Root Mean Squared Logarithmic Error.
Model: Gradient Boosted Trees
The gradient boosted trees model (also called Gradient Boosted Machine or GBM), along with Random Forest, is one of the most effective machine learning models for predictive analytics, making it the industrial workhorse for machine learning. With the rise of competitive data science such as KDD Cups and Kaggle competitions, we have witnessed a good number of winning solutions that contain boosted tree models as a crucial component. The reference section at the bottom has more details.
The gradient boosted trees model is a type of additive model that makes predictions by combining decisions from a sequence of base models. More formally we can write this class of models as:
$$g(x) = f_0(x) + f_1(x) + f_2(x) + ...$$
where the final classifier $g$ is the sum of base (usually simpler) classifiers $f_i$. For boosted trees model, each base classifier is a simple decision tree. Below is a simple example of a boosted trees model with 3 trees, for classifying poisonous mushrooms:


Make your first submission in 10 lines of code
In the following code snippet, we import the GraphLab Create module, load the training data from a csv file, train a gradient boosted trees model, and predict on the test data loaded in the same fashion.
To run the following code, download the data from here and make sure you have Graphlab Create 0.9.1. For older version, the API might be different. See the Update section at the end.
import graphlab as gl
# load training data
training_sframe = gl.SFrame.read_csv('train.csv')
# train a model
features = ['datetime', 'season', 'holiday', 'workingday', 'weather',
'temp', 'atemp', 'humidity', 'windspeed']
m = gl.boosted_trees.create(training_sframe,
features=features,
target='count', objective='regression',
num_iterations=20)
# predict on test data
test_sframe = gl.SFrame.read_csv('test.csv')
prediction = m.predict(test_sframe)
The following code saves the prediction to disk, at which point you can submit it to the kaggle website here.
def make_submission(prediction, filename='submission.txt'):
with open(filename, 'w') as f:
f.write('datetime,count\n')
submission_strings = test_sframe['datetime'] + ',' + prediction.astype(str)
for row in submission_strings:
f.write(row + '\n')
make_submission(prediction, 'submission1.txt')
Here is a summary of what we just did above:
- The data was loaded into a scalable data frame (SFrame), and all column types are automatically inferred.
- We trained a gradient boosted trees model with 100 trees using the
graphlab.boosted_trees.create(). - We generated predictions using
m.predict()and made a submission.
This is a good start. But as we will see next, with some simple feature engineering and hyperparameter tuning, we can obtain much better results. Now let us zoom in on the training data using the awesome Canvas.
training_sframe.show()

Feature engineering
Looking at the data in Canvas, I quickly realized that the datetime column should be broken into separate columns as year, month, weekday, and hour. Also, the target column count which we are trying to predict, is really the sum of two other columns registered -- the count of registered users, and casual -- the count of unregistered users. Training separate models to predict each of these counts should yield better results. Finally, the evaluation metric is the RMSE in the log domain, so we should transform the target columns into log domain as well.
Parse the datetime column
from datetime import datetime
date_format_str = '%Y-%m-%d %H:%M:%S'
def parse_date(date_str):
"""Return parsed datetime tuple"""
d = datetime.strptime(date_str, date_format_str)
return {'year': d.year, 'month': d.month, 'day': d.day,
'hour': d.hour, 'weekday': d.weekday()}
def process_date_column(data_sframe):
"""Split the 'datetime' column of a given sframe"""
parsed_date = data_sframe['datetime'].apply(parse_date).unpack(column_name_prefix='')
for col in ['year', 'month', 'day', 'hour', 'weekday']:
data_sframe[col] = parsed_date[col]
process_date_column(training_sframe)
process_date_column(test_sframe)
Transform target counts into log domain
import math
# Create three new columns: log-casual, log-registered, and log-count
for col in ['casual', 'registered', 'count']:
training_sframe['log-' + col] = training_sframe[col].apply(lambda x: math.log(1 + x))
Combine the predictions of separately trained models
new_features = features + ['year', 'month', 'weekday', 'hour']
new_features.remove('datetime')
m1 = gl.boosted_trees.create(training_sframe,
features=new_features,
target='log-casual')
m2 = gl.boosted_trees.create(training_sframe,
features=new_features,
target='log-registered')
def fused_predict(m1, m2, test_sframe):
"""
Fused the prediction of two separately trained models.
The input models are trained in the log domain.
Return the combine predictions in the original domain.
"""
p1 = m1.predict(test_sframe).apply(lambda x: math.exp(x)-1)
p2 = m2.predict(test_sframe).apply(lambda x: math.exp(x)-1)
return (p1 + p2).apply(lambda x: x if x > 0 else 0)
prediction = fused_predict(m1, m2, test_sframe)
We have done the simple feature engineering of transforming the datetime column and the count columns, and made a fused_prediction() function to combine the predictions of two models trained separately in the log domain of registered and casual target columns.
Hyperparameter tuning
In the this section, we will use the model_parameter_search() function to search for the best hyperparameters. There are a couple important parameters in the gradient boosted trees model:
-
num_iterationsdetermines the number of trees in the final model. Usually the more trees, the higher prediction accuracy. But both the training and prediction time also grows linearly in the number of trees. -
params['max_depth']restricts the depth of each individual tree to prevent overfitting. -
params['min_child_weight']also regularizes the complexity by restricting the minimum number of observations contained at each leaf node. -
params['min_loss_reduction']restricts the reduction of loss function for a node split, works similarly tomin_child_weight.
In this example, I fixed the number of trees to 500 and tuned max_depth, min_child_weight.
Setup the environment
Hyperparameter tuning is a task that can be easily parallelized or distributed. GraphLab Create provides different running environments for such tasks. For demonstration, the following uses the local environment, but you can define environments using your local Hadoop cluster or EC2 machines.
env = gl.deploy.environment.Local('hyperparam_search')
training = training_sframe[training_sframe['day'] <= 16]
validation = training_sframe[training_sframe['day'] > 16]
training.save('/tmp/training')
validation.save('/tmp/validation')
Define the search space
ntrees = 500
search_space = {
'params': {
'max_depth': [10, 15, 20],
'min_child_weight': [5, 10, 20],
'step_size': 0.05
},
'num_iterations': ntrees
}
def parameter_search(training_url, validation_url, default_params):
"""
Return the optimal parameters in the given search space.
The parameter returned has the lowest validation rmse.
"""
job = gl.toolkits.model_parameter_search(env, gl.boosted_trees.create,
train_set_path=training_url,
save_path='/tmp/job_output',
standard_model_params=default_params,
hyper_params=search_space,
test_set_path=validation_url)
# When the job is done, the result is stored in an SFrame
# The result contains attributes of the models in the search space
# and the validation error in RMSE.
result = gl.SFrame('/tmp/job_output').sort('rmse', ascending=True)
# Return the parameters with the lowest validation error.
optimal_params = result[['max_depth', 'min_child_weight']][0]
optimal_rmse = result['rmse'][0]
print 'Optimal parameters: %s' % str(optimal_params)
print 'RMSE: %s' % str(optimal_rmse)
return optimal_params
Perform hyperparameter search for both models (Available in 0.9.1)
fixed_params = {'features': new_features,
'verbose': False}
fixed_params['target'] = 'log-casual'
params_log_casual = parameter_search('/tmp/training',
'/tmp/validation',
fixed_params)
fixed_params['target'] = 'log-registered'
params_log_registered = parameter_search('/tmp/training',
'/tmp/validation',
fixed_params)
Train models with the tuned hyperparameters
Doing hyperparameter search requires us to hold out a validation set from the original training data. In the final submission, we want to train models that take full advantages of the provided training data.
m_log_registered = gl.boosted_trees.create(training_sframe,
features=new_features,
target='log-registered',
num_iterations=ntrees,
params=params_log_registered,
verbose=False)
m_log_casual = gl.boosted_trees.create(training_sframe,
features=new_features,
target='log-casual',
num_iterations=ntrees,
params=params_log_casual,
verbose=False)
final_prediction = fused_predict(m_log_registered, m_log_casual, test_sframe)
make_submission(final_prediction, 'submission2.txt')
Try submitting the new result here.
When to use a boosted trees model?
Different kinds of models have different advantages. The boosted trees model is very good at handling tabular data with numerical features, or categorical features with fewer than hundreds of categories.
One important note is that tree based models are not designed to work with very sparse features. When dealing with sparse input data (e.g. categorical features with large dimension), we can either pre-process the sparse features to generate numerical statistics, or switch to a linear model, which is better suited for such scenarios.
What is different about GBM in GraphLab Create
Both R and scikit-learn have GBM packages. However, GBM in GraphLab Create is much faster, more scalable, and has a cleaner API. More importantly, GraphLab Create allows you to easily manipulate data using SFrame, visualize data using Canvas, and deploy model in production.
Gradient Boosted Trees in GLC is a fork of XGBoost, an open source C++ implementation of GBM which is 20 times faster than scikit-learn. Tianqi Chen, the author of XGBoost, is currently interning at GraphLab, helping us improve the toolkit for better scalability. In the coming release, GBM will support out-of-core computation so the data does not have to fit in memory.
Reference
- Wikipedia article on Graident Tree Boosting
- Trevor Hastie's slides on Boosted Trees and Random Forest
- Ildefons et al. (2013). "Kaggle Algorithmic Trading Challenge with the Composition of Many Models and Feature Engineering"
- 3rd Place interview from the KDD Cup 2014
Update (08/29/14):
- The csv parser type inference in GLC 0.9 does not recognize the space in the datetime field. This bug has been fixed in the coming release 0.9.1. In GLC 0.9, the code for reading csv need to be changed to the follwing:
# load training data
training_column_types = [str,int,int,int,int,float,float,int,float,int,int,int]
training_sframe = gl.SFrame.read_csv('train.csv', column_type_hints=training_column_types)
# load test data
test_column_types = [str,int,int,int,int,float,float,int,float]
test_sframe = gl.SFrame.read_csv('test.csv', column_type_hints=test_column_types)
- As part of our effort to make API consistent across toolkits, 0.9.1 has made some API changes that breaks backwards compatibility. One change that affects the code in the blog post is in the
create()call wheretarget_columnis rename as target, andfeature_columnsis renamed asfeatures. - The code in Hyperparameter Search will only work in 0.9.1.
TypeError: create() got an unexpected keyword argument 'features'
", TypeError("create() got an unexpected keyword argument 'features'",)) The only change I made is change step_size to [0.05] instead of 0.05, since "TypeError: 'float' object is not iterable. Same change for num_iterations.