GraphLab Blog

Using Gradient Boosted Trees to Predict Bike Sharing Demand

Posted by Jay (Haijie) Gu on Aug 19, 2014 2:02:00 PM

We recently added the Gradient Boosted Regression and Classification Trees model to the supervised learning toolkit in GraphLab Create 0.9. I decided to try it on an active Kaggle competition to test its performance and usability.

The speed and accuracy of the model is great and the end to end process turns out to be surprisingly easy. It took me 5 minutes from downloading the data to first submission, and 30 minutes to rise to the top of the leader board. I am excited to share my experience with you!

Task: Predict the bike sharing demand

The task of interest is forecasting the bike sharing demand (Kaggle link): Given historical data about bike rental, we want to train a model to predict the number of bikes rented in the test data. The features provided are related to weather and date, and the evaluation metric is RMSLE -- Root Mean Squared Logarithmic Error.

Model: Gradient Boosted Trees

The gradient boosted trees model (also called Gradient Boosted Machine or GBM), along with Random Forest, is one of the most effective machine learning models for predictive analytics, making it the industrial workhorse for machine learning. With the rise of competitive data science such as KDD Cups and Kaggle competitions, we have witnessed a good number of winning solutions that contain boosted tree models as a crucial component. The reference section at the bottom has more details.

The gradient boosted trees model is a type of additive model that makes predictions by combining decisions from a sequence of base models. More formally we can write this class of models as:

$$g(x) = f_0(x) + f_1(x) + f_2(x) + ...$$

where the final classifier $g$ is the sum of base (usually simpler) classifiers $f_i$. For boosted trees model, each base classifier is a simple decision tree. Below is a simple example of a boosted trees model with 3 trees, for classifying poisonous mushrooms:

tree_0 

 

tree_1

 

tree_2

 

Make your first submission in 10 lines of code

In the following code snippet, we import the GraphLab Create module, load the training data from a csv file, train a gradient boosted trees model, and predict on the test data loaded in the same fashion.

To run the following code, download the data from here and make sure you have Graphlab Create 0.9.1. For older version, the API might be different. See the Update section at the end.

import graphlab as gl

# load training data
training_sframe = gl.SFrame.read_csv('train.csv')

# train a model
features = ['datetime', 'season', 'holiday', 'workingday', 'weather',
            'temp', 'atemp', 'humidity', 'windspeed']
m = gl.boosted_trees.create(training_sframe,
                            features=features, 
                            target='count', objective='regression',
                            num_iterations=20)

# predict on test data
test_sframe = gl.SFrame.read_csv('test.csv')
prediction = m.predict(test_sframe)

The following code saves the prediction to disk, at which point you can submit it to the kaggle website here.

def make_submission(prediction, filename='submission.txt'):
    with open(filename, 'w') as f:
        f.write('datetime,count\n')
        submission_strings = test_sframe['datetime'] + ',' + prediction.astype(str)
        for row in submission_strings:
            f.write(row + '\n')

make_submission(prediction, 'submission1.txt')

Here is a summary of what we just did above:

  1. The data was loaded into a scalable data frame (SFrame), and all column types are automatically inferred.
  2. We trained a gradient boosted trees model with 100 trees using the graphlab.boosted_trees.create().
  3. We generated predictions using m.predict() and made a submission.

This is a good start. But as we will see next, with some simple feature engineering and hyperparameter tuning, we can obtain much better results. Now let us zoom in on the training data using the awesome Canvas.

training_sframe.show()

Canvas in GraphLab Create

Feature engineering

Looking at the data in Canvas, I quickly realized that the datetime column should be broken into separate columns as year, month, weekday, and hour. Also, the target column count which we are trying to predict, is really the sum of two other columns registered -- the count of registered users,  and casual -- the count of unregistered users. Training separate models to predict each of these counts should yield better results. Finally, the evaluation metric is the RMSE in the log domain, so we should transform the target columns into log domain as well.

Parse the datetime column

from datetime import datetime
date_format_str = '%Y-%m-%d %H:%M:%S'

def parse_date(date_str):
    """Return parsed datetime tuple"""
    d = datetime.strptime(date_str, date_format_str)
    return {'year': d.year, 'month': d.month, 'day': d.day,
            'hour': d.hour, 'weekday': d.weekday()}

def process_date_column(data_sframe):
    """Split the 'datetime' column of a given sframe"""
    parsed_date = data_sframe['datetime'].apply(parse_date).unpack(column_name_prefix='')
    for col in ['year', 'month', 'day', 'hour', 'weekday']:
        data_sframe[col] = parsed_date[col]

process_date_column(training_sframe)
process_date_column(test_sframe)

Transform target counts into log domain

import math

# Create three new columns: log-casual, log-registered, and log-count
for col in ['casual', 'registered', 'count']:
    training_sframe['log-' + col] = training_sframe[col].apply(lambda x: math.log(1 + x))

Combine the predictions of separately trained models

new_features = features + ['year', 'month', 'weekday', 'hour']
new_features.remove('datetime')

m1 = gl.boosted_trees.create(training_sframe,
                             features=new_features,
                             target='log-casual')

m2 = gl.boosted_trees.create(training_sframe,
                             features=new_features,
                             target='log-registered')

def fused_predict(m1, m2, test_sframe):
    """
    Fused the prediction of two separately trained models.
    The input models are trained in the log domain.
    Return the combine predictions in the original domain.
    """
    p1 = m1.predict(test_sframe).apply(lambda x: math.exp(x)-1)
    p2 = m2.predict(test_sframe).apply(lambda x: math.exp(x)-1)
    return (p1 + p2).apply(lambda x: x if x > 0 else 0)

prediction = fused_predict(m1, m2, test_sframe)

We have done the simple feature engineering of transforming the datetime column and the count columns, and made a fused_prediction() function to combine the predictions of two models trained separately in the log domain of registered and casual target columns.

Hyperparameter tuning

In the this section, we will use the model_parameter_search() function to search for the best hyperparameters. There are a couple important parameters in the gradient boosted trees model:

  • num_iterations determines the number of trees in the final model. Usually the more trees, the higher prediction accuracy. But both the training and prediction time also grows linearly in the number of trees.

  • params['max_depth'] restricts the depth of each individual tree to prevent overfitting.

  • params['min_child_weight'] also regularizes the complexity by restricting the minimum number of observations contained at each leaf node.

  • params['min_loss_reduction'] restricts the reduction of loss function for  a node split, works similarly to min_child_weight.

In this example, I fixed the number of trees to 500 and tuned max_depth, min_child_weight.

Setup the environment

Hyperparameter tuning is a task that can be easily parallelized or distributed. GraphLab Create provides different running environments for such tasks. For demonstration, the following uses the local environment, but you can define environments using your local Hadoop cluster or EC2 machines.

env = gl.deploy.environment.Local('hyperparam_search')
training = training_sframe[training_sframe['day'] <= 16]
validation = training_sframe[training_sframe['day'] > 16]
training.save('/tmp/training')
validation.save('/tmp/validation')

Define the search space

ntrees = 500
search_space = {
    'params': {
        'max_depth': [10, 15, 20],
        'min_child_weight': [5, 10, 20],
        'step_size': 0.05
    },
    'num_iterations': ntrees
}

def parameter_search(training_url, validation_url, default_params):
    """
    Return the optimal parameters in the given search space.
    The parameter returned has the lowest validation rmse.
    """
    job = gl.toolkits.model_parameter_search(env, gl.boosted_trees.create,
                                             train_set_path=training_url,
                                             save_path='/tmp/job_output',
                                             standard_model_params=default_params,
                                             hyper_params=search_space,
                                             test_set_path=validation_url)


    # When the job is done, the result is stored in an SFrame
    # The result contains attributes of the models in the search space
    # and the validation error in RMSE. 
    result = gl.SFrame('/tmp/job_output').sort('rmse', ascending=True)

    # Return the parameters with the lowest validation error. 
    optimal_params = result[['max_depth', 'min_child_weight']][0]
    optimal_rmse = result['rmse'][0]
    print 'Optimal parameters: %s' % str(optimal_params)
    print 'RMSE: %s' % str(optimal_rmse)
    return optimal_params

Perform hyperparameter search for both models (Available in 0.9.1)

fixed_params = {'features': new_features,
                'verbose': False}

fixed_params['target'] = 'log-casual'
params_log_casual = parameter_search('/tmp/training',
                                     '/tmp/validation',
                                     fixed_params)

fixed_params['target'] = 'log-registered'
params_log_registered = parameter_search('/tmp/training',
                                         '/tmp/validation',
                                         fixed_params)

Train models with the tuned hyperparameters

Doing hyperparameter search requires us to hold out a validation set from the original training data. In the final submission, we want to train models that take full advantages of the provided training data.

m_log_registered = gl.boosted_trees.create(training_sframe,
                                           features=new_features,
                                           target='log-registered',
                                           num_iterations=ntrees,
                                           params=params_log_registered,
                                           verbose=False)

m_log_casual = gl.boosted_trees.create(training_sframe,
                                       features=new_features,
                                       target='log-casual',
                                       num_iterations=ntrees,
                                       params=params_log_casual,
                                       verbose=False)

final_prediction = fused_predict(m_log_registered, m_log_casual, test_sframe)

make_submission(final_prediction, 'submission2.txt')

Try submitting the new result here.

When to use a boosted trees model?

Different kinds of models have different advantages. The boosted trees model is very good at handling tabular data with numerical features, or categorical features with fewer than hundreds of categories.

One important note is that tree based models are not designed to work with very sparse features. When dealing with sparse input data (e.g. categorical features with large dimension), we can either pre-process the sparse features to generate numerical statistics, or switch to a linear model, which is better suited for such scenarios.

What is different about GBM in GraphLab Create

Both R and scikit-learn have GBM packages. However, GBM in GraphLab Create is much faster, more scalable, and has a cleaner API. More importantly, GraphLab Create allows you to easily manipulate data using SFrame, visualize data using Canvas, and deploy model in production.

Gradient Boosted Trees in GLC is a fork of XGBoost, an open source C++ implementation of GBM which is 20 times faster than scikit-learn. Tianqi Chen, the author of XGBoost, is currently interning at GraphLab, helping us improve the toolkit for better scalability. In the coming release, GBM will support out-of-core computation so the data does not have to fit in memory.

Reference

Update (08/29/14):

  • The csv parser type inference in GLC 0.9 does not recognize the space in the datetime field. This bug has been fixed in the coming release 0.9.1. In GLC 0.9, the code for reading csv need to be changed to the follwing:
# load training data
training_column_types = [str,int,int,int,int,float,float,int,float,int,int,int]
training_sframe = gl.SFrame.read_csv('train.csv', column_type_hints=training_column_types)

# load test data
test_column_types = [str,int,int,int,int,float,float,int,float]
test_sframe = gl.SFrame.read_csv('test.csv', column_type_hints=test_column_types)
  • As part of our effort to make API consistent across toolkits, 0.9.1 has made some API changes that breaks backwards compatibility. One change that affects the code in the blog post is in the create() call where target_column is rename as target, and feature_columns is renamed as features.
  • The code in Hyperparameter Search will only work in 0.9.1.

Topics: Machine Learning Models

Subscribe to Email Updates