Random Forest Regression

Install packages

In [1]:
pip install sklearn
Requirement already satisfied: sklearn in /srv/conda/envs/notebook/lib/python3.6/site-packages (0.0)
Requirement already satisfied: scikit-learn in /srv/conda/envs/notebook/lib/python3.6/site-packages (from sklearn) (0.23.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in /srv/conda/envs/notebook/lib/python3.6/site-packages (from scikit-learn->sklearn) (2.1.0)
Requirement already satisfied: scipy>=0.19.1 in /srv/conda/envs/notebook/lib/python3.6/site-packages (from scikit-learn->sklearn) (1.5.3)
Requirement already satisfied: numpy>=1.13.3 in /srv/conda/envs/notebook/lib/python3.6/site-packages (from scikit-learn->sklearn) (1.19.4)
Requirement already satisfied: joblib>=0.11 in /srv/conda/envs/notebook/lib/python3.6/site-packages (from scikit-learn->sklearn) (0.17.0)
Note: you may need to restart the kernel to use updated packages.
In [2]:
pip install pandas
Requirement already satisfied: pandas in /srv/conda/envs/notebook/lib/python3.6/site-packages (1.1.5)
Requirement already satisfied: python-dateutil>=2.7.3 in /srv/conda/envs/notebook/lib/python3.6/site-packages (from pandas) (2.8.1)
Requirement already satisfied: pytz>=2017.2 in /srv/conda/envs/notebook/lib/python3.6/site-packages (from pandas) (2020.4)
Requirement already satisfied: numpy>=1.15.4 in /srv/conda/envs/notebook/lib/python3.6/site-packages (from pandas) (1.19.4)
Requirement already satisfied: six>=1.5 in /srv/conda/envs/notebook/lib/python3.6/site-packages (from python-dateutil>=2.7.3->pandas) (1.15.0)
Note: you may need to restart the kernel to use updated packages.
In [3]:
pip install numpy
Requirement already satisfied: numpy in /srv/conda/envs/notebook/lib/python3.6/site-packages (1.19.4)
Note: you may need to restart the kernel to use updated packages.
In [4]:
pip install yfinance
Requirement already satisfied: yfinance in /srv/conda/envs/notebook/lib/python3.6/site-packages (0.1.55)
Requirement already satisfied: pandas>=0.24 in /srv/conda/envs/notebook/lib/python3.6/site-packages (from yfinance) (1.1.5)
Requirement already satisfied: numpy>=1.15 in /srv/conda/envs/notebook/lib/python3.6/site-packages (from yfinance) (1.19.4)
Requirement already satisfied: multitasking>=0.0.7 in /srv/conda/envs/notebook/lib/python3.6/site-packages (from yfinance) (0.0.9)
Requirement already satisfied: lxml>=4.5.1 in /srv/conda/envs/notebook/lib/python3.6/site-packages (from yfinance) (4.6.2)
Requirement already satisfied: requests>=2.20 in /srv/conda/envs/notebook/lib/python3.6/site-packages (from yfinance) (2.24.0)
Requirement already satisfied: python-dateutil>=2.7.3 in /srv/conda/envs/notebook/lib/python3.6/site-packages (from pandas>=0.24->yfinance) (2.8.1)
Requirement already satisfied: pytz>=2017.2 in /srv/conda/envs/notebook/lib/python3.6/site-packages (from pandas>=0.24->yfinance) (2020.4)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /srv/conda/envs/notebook/lib/python3.6/site-packages (from requests>=2.20->yfinance) (1.25.10)
Requirement already satisfied: chardet<4,>=3.0.2 in /srv/conda/envs/notebook/lib/python3.6/site-packages (from requests>=2.20->yfinance) (3.0.4)
Requirement already satisfied: certifi>=2017.4.17 in /srv/conda/envs/notebook/lib/python3.6/site-packages (from requests>=2.20->yfinance) (2020.6.20)
Requirement already satisfied: idna<3,>=2.5 in /srv/conda/envs/notebook/lib/python3.6/site-packages (from requests>=2.20->yfinance) (2.10)
Requirement already satisfied: six>=1.5 in /srv/conda/envs/notebook/lib/python3.6/site-packages (from python-dateutil>=2.7.3->pandas>=0.24->yfinance) (1.15.0)
Note: you may need to restart the kernel to use updated packages.

Import packages

In [5]:
from sklearn.ensemble import RandomForestRegressor
import pandas as pd
import numpy as np
import yfinance as yf
import math
from datetime import date, timedelta

Data Acquisition for the Random Forest Regressor

In [6]:
# We set the stock we want to work with
data = yf.Ticker('NFLX')

# This built-in feature will return us the date of today
today = date.today()

# We extract the data history for the ticker we selected from a starting day to an ending day
df = data.history(period="max",  start="2015-01-01", end="2020-12-01")

# We want to view the 5 first row of our collected data
df.head()
Out[6]:
Open High Low Close Volume Dividends Stock Splits
Date
2015-01-02 49.151428 50.331429 48.731430 49.848572 13475000 0 0.0
2015-01-05 49.258572 49.258572 47.147144 47.311428 18165000 0 0.0
2015-01-06 47.347141 47.639999 45.661430 46.501427 16037700 0 0.0
2015-01-07 47.347141 47.421429 46.271427 46.742859 9849700 0 0.0
2015-01-08 47.119999 47.835712 46.478573 47.779999 9601900 0 0.0

Create Testing and Training Data

In [7]:
# We want to predict if the stock will close up or down n days from the day indicated
n = 1
# If we have the data from a day m, then we want to obtain a prediction for the day m+1 since we set our n = 1

# We will work with the close column, so we create a list out of it
close_n_days = []
close = []
close_actual = df["Close"].copy()
for i in close_actual:
    close.append(i)

close_n_days = close[n:]
    
# We "delete" the n last rows of the X column and the n first rows in the Y column
# In this way, if we put them side by side, the Y value un row m will tell if the day m+n is a up or down day
df = df[:len(df)-n]

df["Close in n days"] = close_n_days
    

# We take the first p percent of our dataframe to be our training data
p = 90
df_percentage = int((len(close_n_days)*p)/100)
training = []
for i in range (df_percentage):
    training.append(True)
for i in range (df_percentage, len(close_n_days)):
    training.append(False)
    
df['Training Set'] = training

df[(df_percentage-2):(df_percentage+3)]
Out[7]:
Open High Low Close Volume Dividends Stock Splits Close in n days Training Set
Date
2020-04-27 425.000000 429.000000 420.839996 421.380005 6277500 0 0.0 403.829987 True
2020-04-28 419.989990 421.000000 402.910004 403.829987 10101200 0 0.0 411.890015 True
2020-04-29 399.529999 415.859985 393.600006 411.890015 9693100 0 0.0 419.850006 False
2020-04-30 410.309998 424.440002 408.000000 419.850006 7954000 0 0.0 415.269989 False
2020-05-01 415.100006 427.970001 411.730011 415.269989 8299900 0 0.0 428.149994 False

Creating dataframes with Test Rows and Training Rows

In [8]:
# We split the dataframe into two separate dataframes, one for testing and one for training
train, test = df[df['Training Set']==True], df[df['Training Set']==False]

Displaying the Number of Rows for the Testing and Training Dataframes

In [9]:
print('Number of rows in the training data: ', len(train))
print('Number of rows in the testing data: ', len(test))
Number of rows in the training data:  1339
Number of rows in the testing data:  149

Create a List of the Feature Columns' Names

In [10]:
# In this case the list of features is ['Open', 'High', 'Low', 'Close'] and they are used to predict the closing price in n days
features = df.columns[:4]
X = train[features]
y = train['Close in n days']

Creating the Random Forest Regressor

In [11]:
regr = RandomForestRegressor(n_estimators = 50)

Training the Regressor

In [12]:
regr.fit(X, y)
Out[12]:
RandomForestRegressor(n_estimators=50)

Calculating the Coefficient of Determination R^2 of the Prediction

The coefficient of determination (R squared) is used to see how accurate the predictions are. The closer this R squared is to 1, the better the predictions.

In [13]:
regr.score(X, y)
Out[13]:
0.9994430722470299

Applying the Trained Regressor to the Testing Data

In [14]:
# We apply the model to our testing dataframe
preds = regr.predict(test[features])

print('First five test values: ', preds[0:5])
First five test values:  [408.10320129 412.15060059 412.70239868 423.90580017 426.19799805]
In [15]:
# We can compare the predicted values above to the real values shown below
test['Close in n days'].head()
Out[15]:
Date
2020-04-29    419.850006
2020-04-30    415.269989
2020-05-01    428.149994
2020-05-04    424.679993
2020-05-05    434.260010
Name: Close in n days, dtype: float64

Predicting for one Specific Example

In [16]:
# We get yesterday's date
yesterday = today + timedelta(days=-2)

# We take the data from yesterday (and only from yesterday)
pred_data = data.history(period="max",  start=yesterday, end=today)

pred_data_y = pred_data[:1]

# We again only take the Open, High, Low and Close Features
X = pred_data_y[features]

# We predict the closing price for yesterday
preds = regr.predict(X)
print('Today the predicted closing value is: ', preds[0])
Today the predicted closing value is:  424.46599548339844
In [17]:
print("Today's actual closing value was: ")

pred_data_t = pred_data[1:]

print(pred_data_t["Close"][0])
Today's actual closing value was: 
534.4500122070312