Random Forest

Transitioning from a Single Tree to an Endless Forest: Learning the Decision Logic of Random Forests
You have been told previously you are likely familiar with Decision Trees: They are quite understandable; they function in a linear fashion similar to the way you would make decisions, one at-a-time; Decision Trees are extremely easy to read; and therefore, can be easily communicated.

However, as mentioned above, there are limitations to using Decision Trees alone—they are very sensitive to data input. What does this mean? If you change just a few rows of data, the entire decision tree can grow significantly differently. In data analysis vernacular, this is known as being extremely variable. An example of this is: An individual obtains all the answers correct when taking a prior to the final examination, however, cannot recall the answer when the words are different.

How do we solve these issues? By developing a Random Forest!

A random forest: what is it? #

A Random Forest model is an ensemble of many simple models working together (or dozens, hundreds or even thousands). You essentially get to utilize the expertise of entire panels of what would typically be ONE “expert” (aka tree) to make decisions.

For example, consider how much input you would get from one consultant/advisor giving you advice. That person’s opinion may be very credible, but they also may have a lot of bias (e.g. personal beliefs) or they may just be having a bad day.

In contrast, when using a panel of 500 different experts to provide advice about a problem, chances are overall where some experts are wrong, collectively the opinions will outweigh the inaccurate opinion of any one individual.

“Random” does have (to some degree) a relation to the addition of a couple different variables that would be considered “random”. These two different ways of putting randomness into a Random Forest model are intended to produce more variability between trees so that we do not have the same opinion from all of the trees in the random forest, thereby giving you more accurate predictions based on historical data.

Random Forest Algorithm: How it Works #

Each Tree is Built from its own Random Sample of Data

Random Forest builds many trees, and each tree is built using a random sample of the total amount of training data. This means that each tree may learn from a completely different subset of the data.

Each Tree Uses Randomly Sampled Features

While building individual trees, the random forest algorithm does not take into account all of the features/columns of the training dataset at once; for each potential split, the algorithm only considers a subset of the available features. This ensures that the trees are built using different features and are therefore less likely to produce similar predictions.

All Trees Predict Values

All of the trees in the random forest build their own predictions based on what they learned from the data they trained on. Therefore, all of the trees produce their own outputs.

Prediction Combination

For classification problems, the predicted output for the random forest is given as the class that most of the trees predicted, i.e., voting. For regression problems, the predicted output for the random forest is given as the mean of all of the trees predictions.

Why Does it Work?

The randomization of the training data and training features for each individual tree of the random forest minimizes the amount of overfitting that occurs and subsequently increases the accuracy and reliability of the overall prediction.

Characteristics of Random Forest #

Deals with Missing Values: You can use Random Forest with incomplete datasets as any missing values will not hinder your modelling.

Indicates How Important Each Variable is for the Model: Each individual feature (i.e. each column) in your dataset has its own importance score which helps provide insight of your data.

Handles Large Volume and Complexity of Data: Random forest can process large amounts of data (size and dimension) while being fast and operating with will little degradation (continuous accuracy degradation).

Capable of Being Used for Very Diverse Applications: Classification (type/label prediction) and regression (number/quantity prediction) can be done using random forests.

The following are the underlying assumptions of using a random forest: #

(1) Each tree in the forest operates independently, making its own decisions.

(2) Random samples and random subsets of features are selected in order to minimize error.

(3) Sufficient size of training data set provides enough opportunities for creating trees with unique predictive characteristics and multiple patterns/varieties.

(4) The accumulated predictions of multiple trees produce a more accurate prediction as a whole.

Developing the Regional SDAM Random Forest Models #

The random forest models for the Regional SDAMs were created using data collected during the fieldwork as well as geospatial metrics, such as temperature and precipitation. The total dataset used to create these random forest models was divided into 20% of it being used to test the model and the other 80% used to train the model. In using the training dataset, the random forest model was able to look for patterns in the training dataset to allow the model to classify streamflow duration classes. Once the model was created using the training dataset, the testing dataset was used to evaluate model performance, select the model, and refine the model. By separating the testing set from the training dataset, an independent evaluation of the model’s accuracy was able to be conducted using data that the model had not previously encountered.

Initially, numerous candidate variables were assessed for inclusion in the regional SDAM final indicators; however, far fewer were chosen as useful for inclusion in the random forest algorithm. Recursive feature elimination (RFE) was used to identify which variables in each training region were most important for providing the best predictive accuracy. Initially, all candidate variables were included. The least important of the candidate variables were eliminated sequentially until, from the variables that resulted in the three models with the closest accuracy (less than 1%), a set of less than three indicators was identified. This final set of less than three indicators was further refined by reducing the number of indicators needed or simplifying the definition of the indicator (e.g. presence/absence, binned value, etc.) to make it easier to use in future field efforts, while still providing the same level of predictive accuracy on the test datasets.

This illustration shows an example of generating a random forest model. Each tree of the forest is derived from a different independent sample of the data. Each tree has a base of a random sample of observations (visits) used for training, and a random sample of the available candidate metrics used to train the trees.

Because each tree is trained on a random subset of the observations in the original training data set, there will be less correlation among the trees than if all were trained from the same data set. This lack of correlation between the trees reduces variance and helps improve the performance of the overall model.

Each tree contains nodes (blue circles), which are connected to one another by branches (or descendants). The connection of nodes with the branches shows how the data is split at each node based on the metric(s) used to make the split. Some measurement indicators may consist of many discrete or continuous value(s) and can therefore be used at multiple nodes in each individual tree of the random forest model. Each of the classifications from the 1500 trees will be combined into a single classification by aggregating the votes from each classification (i.e., majority vote).

The relationship between any one indicator and the resulting classification is complex, and it also depends on the relationships between different sets of indicators within each of the 1500 decision trees. This is because the 1500 individual decision trees that comprise your random forest model do not use exactly the same subsample of indicators.

K-Nearest Neighbor (KNN) Project Random Forest Project