Random Forest - Classification

Package Installation

In [1]:
pip install numpy
Requirement already satisfied: numpy in /srv/conda/envs/notebook/lib/python3.6/site-packages (1.18.5)
Note: you may need to restart the kernel to use updated packages.
In [2]:
pip install pandas
Requirement already satisfied: pandas in /srv/conda/envs/notebook/lib/python3.6/site-packages (1.0.5)
Requirement already satisfied: pytz>=2017.2 in /srv/conda/envs/notebook/lib/python3.6/site-packages (from pandas) (2020.1)
Requirement already satisfied: python-dateutil>=2.6.1 in /srv/conda/envs/notebook/lib/python3.6/site-packages (from pandas) (2.8.1)
Requirement already satisfied: numpy>=1.13.3 in /srv/conda/envs/notebook/lib/python3.6/site-packages (from pandas) (1.18.5)
Requirement already satisfied: six>=1.5 in /srv/conda/envs/notebook/lib/python3.6/site-packages (from python-dateutil>=2.6.1->pandas) (1.15.0)
Note: you may need to restart the kernel to use updated packages.
In [3]:
pip install matplotlib
Requirement already satisfied: matplotlib in /srv/conda/envs/notebook/lib/python3.6/site-packages (3.2.2)
Requirement already satisfied: kiwisolver>=1.0.1 in /srv/conda/envs/notebook/lib/python3.6/site-packages (from matplotlib) (1.2.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /srv/conda/envs/notebook/lib/python3.6/site-packages (from matplotlib) (2.4.7)
Requirement already satisfied: cycler>=0.10 in /srv/conda/envs/notebook/lib/python3.6/site-packages (from matplotlib) (0.10.0)
Requirement already satisfied: python-dateutil>=2.1 in /srv/conda/envs/notebook/lib/python3.6/site-packages (from matplotlib) (2.8.1)
Requirement already satisfied: numpy>=1.11 in /srv/conda/envs/notebook/lib/python3.6/site-packages (from matplotlib) (1.18.5)
Requirement already satisfied: six in /srv/conda/envs/notebook/lib/python3.6/site-packages (from cycler>=0.10->matplotlib) (1.15.0)
Note: you may need to restart the kernel to use updated packages.
In [4]:
pip install sklearn
Requirement already satisfied: sklearn in /srv/conda/envs/notebook/lib/python3.6/site-packages (0.0)
Requirement already satisfied: scikit-learn in /srv/conda/envs/notebook/lib/python3.6/site-packages (from sklearn) (0.23.1)
Requirement already satisfied: joblib>=0.11 in /srv/conda/envs/notebook/lib/python3.6/site-packages (from scikit-learn->sklearn) (0.15.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in /srv/conda/envs/notebook/lib/python3.6/site-packages (from scikit-learn->sklearn) (2.1.0)
Requirement already satisfied: numpy>=1.13.3 in /srv/conda/envs/notebook/lib/python3.6/site-packages (from scikit-learn->sklearn) (1.18.5)
Requirement already satisfied: scipy>=0.19.1 in /srv/conda/envs/notebook/lib/python3.6/site-packages (from scikit-learn->sklearn) (1.4.1)
Note: you may need to restart the kernel to use updated packages.
In [5]:
pip install yfinance
Requirement already satisfied: yfinance in /srv/conda/envs/notebook/lib/python3.6/site-packages (0.1.55)
Requirement already satisfied: pandas>=0.24 in /srv/conda/envs/notebook/lib/python3.6/site-packages (from yfinance) (1.0.5)
Requirement already satisfied: numpy>=1.15 in /srv/conda/envs/notebook/lib/python3.6/site-packages (from yfinance) (1.18.5)
Requirement already satisfied: requests>=2.20 in /srv/conda/envs/notebook/lib/python3.6/site-packages (from yfinance) (2.23.0)
Requirement already satisfied: multitasking>=0.0.7 in /srv/conda/envs/notebook/lib/python3.6/site-packages (from yfinance) (0.0.9)
Requirement already satisfied: lxml>=4.5.1 in /srv/conda/envs/notebook/lib/python3.6/site-packages (from yfinance) (4.6.2)
Requirement already satisfied: python-dateutil>=2.6.1 in /srv/conda/envs/notebook/lib/python3.6/site-packages (from pandas>=0.24->yfinance) (2.8.1)
Requirement already satisfied: pytz>=2017.2 in /srv/conda/envs/notebook/lib/python3.6/site-packages (from pandas>=0.24->yfinance) (2020.1)
Requirement already satisfied: certifi>=2017.4.17 in /srv/conda/envs/notebook/lib/python3.6/site-packages (from requests>=2.20->yfinance) (2020.6.20)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /srv/conda/envs/notebook/lib/python3.6/site-packages (from requests>=2.20->yfinance) (1.25.9)
Requirement already satisfied: idna<3,>=2.5 in /srv/conda/envs/notebook/lib/python3.6/site-packages (from requests>=2.20->yfinance) (2.9)
Requirement already satisfied: chardet<4,>=3.0.2 in /srv/conda/envs/notebook/lib/python3.6/site-packages (from requests>=2.20->yfinance) (3.0.4)
Requirement already satisfied: six>=1.5 in /srv/conda/envs/notebook/lib/python3.6/site-packages (from python-dateutil>=2.6.1->pandas>=0.24->yfinance) (1.15.0)
Note: you may need to restart the kernel to use updated packages.
In [6]:
pip install datetime
Requirement already satisfied: datetime in /srv/conda/envs/notebook/lib/python3.6/site-packages (4.3)
Requirement already satisfied: pytz in /srv/conda/envs/notebook/lib/python3.6/site-packages (from datetime) (2020.1)
Requirement already satisfied: zope.interface in /srv/conda/envs/notebook/lib/python3.6/site-packages (from datetime) (5.2.0)
Requirement already satisfied: setuptools in /srv/conda/envs/notebook/lib/python3.6/site-packages (from zope.interface->datetime) (47.1.1.post20200529)
Note: you may need to restart the kernel to use updated packages.

Importing Libraries

In [7]:
#Paths
import os
import sys
import requests

#Data Acquisition
import yfinance as yf
from datetime import date

#Calculations
import numpy as np
import pandas as pd
import math

#Plotting
import matplotlib.pyplot as plt

#Sklearn is a machine learning library
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import plot_roc_curve
from sklearn.metrics import accuracy_score, classification_report

Data Acquisition

At first, we need to get the data from the stock we want to work with. This is a crucial first step, since this will provide all the information that will feed our random forest.

In [8]:
# date.today() returns today's date
today = date.today()

# The ticker is the stock we want to work with
ticker = "NFLX"

# We set the starting and ending date of the period we want to work with
start_date = "2018-01-01"
end_date = "2020-12-01"

# We want to store the full price history
# We therefore use the built-in methods from yfinance
data = yf.Ticker(ticker)
price_data = data.history(period="max", start=start_date, end=end_date)

# We want to make sure that the dates are in an ascending order 
price_data = price_data.sort_index(ascending=True, axis=0)

# We want to display the head, meaning the first five rows of our data set
price_data.head()
Out[8]:
Open High Low Close Volume Dividends Stock Splits
Date
2018-01-02 196.100006 201.649994 195.419998 201.070007 10966900 0 0
2018-01-03 202.050003 206.210007 201.500000 205.050003 8591400 0 0
2018-01-04 206.199997 207.050003 204.000000 205.630005 6029600 0 0
2018-01-05 207.250000 210.020004 205.589996 209.990005 7033200 0 0
2018-01-08 210.020004 212.500000 208.440002 212.050003 5580200 0 0

Data Preprocessing

As one can observe in the above output, we currently only have 7 columns, which are the default columns yfinance gives us.

To get by far better results, we will now try to extract some relevant features. A feature in the data can be any useful characteristic that can be used to correctly identify if the stock is going to close up or down.

Change in Price

We want to add a column which will indicate the change in price, meaning the difference between the closing price of two consecutive days.

In fact, we want to calculate $Close_m - Close_{m-1}$ with $Close_m$ being the Close value of the day $m$.

This of course implies that we will not get a result for the first row.

In [9]:
# We add a column which indicates the change in price
# We therefore use the diff() method, which will calculate the difference from one row to the next
price_data["Change in Price"] = price_data["Close"].diff()

price_data.head()
Out[9]:
Open High Low Close Volume Dividends Stock Splits Change in Price
Date
2018-01-02 196.100006 201.649994 195.419998 201.070007 10966900 0 0 NaN
2018-01-03 202.050003 206.210007 201.500000 205.050003 8591400 0 0 3.979996
2018-01-04 206.199997 207.050003 204.000000 205.630005 6029600 0 0 0.580002
2018-01-05 207.250000 210.020004 205.589996 209.990005 7033200 0 0 4.360001
2018-01-08 210.020004 212.500000 208.440002 212.050003 5580200 0 0 2.059998

Relative Strength Index

The Relative Strength Index (RSI) is a indicator that determines whether the stock is overbought or oversold.

A stock is said to be overbought when the demand unjustifiably pushes the price upwards. In general, this is a sign that the price is likely to go down since the stock is overvalued. A stock is said to be oversold when the price goes down sharply to a level below its true value. This is a result caused due to panic selling.

The RSI ranges from $0$ to $100$.

If the RSI is above $70$, it may indicate that the stock is overbought. If the RSI is below $30$, it may indicate that the stock is oversold.

Formula:

$RSI = 100 - \frac{100}{1+RS}$

where $RS = \frac{Average \, Gain \,Over \,Past \,n \,Days}{Average \,Loss \,Over \,Past \,n \,Days}$ is the relative strength.

In [10]:
# We want to calculate the n-day RSI
n = 14

# We make a copy of the data frame
gain = price_data["Change in Price"].copy()
loss = price_data["Change in Price"].copy()

# We create the empty lists which we feed with the data from the change in price column
gain_list = []
loss_list = []

for i in gain:
    gain_list.append(i)

for i in loss:
    loss_list.append(i)


gain_list[0]=0
loss_list[0]=0

# If the change in price is positive, the closing price of the day before is less than the closing price of the day itself.
# If the change in price is negative, the closing price of the day before is greater than the closing price of the day itself.

# This means that we have a loss if the change in price is negative and we have a gain if it is positive.

# We want to exclude all the losses from the gain list. We therefore set all the losses to 0.
for i in range (0,len(gain_list)):
    if gain_list[i]<0:
        gain_list[i]=0

# We want to exclude all the gains from the loss list. We therefore set all the gains to 0.
# We need to take the absolute value, since we want the losses to be positive.
for i in range (0, len(loss_list)):
    if loss_list[i]>0:
        loss_list[i]=0
    else:
        loss_list[i]= - loss_list[i]

# Note that by setting the values to 0, they won't play a part in the calculation of the average.

# We now create the empty average lists which we will feed with the average of the gain (resp. loss) of the past n days
average_gain = []
average_loss = []

# We can not calculate the average of the last n days for the values which don't have n previous values
for i in range (0,n):
    average_gain.append(math.nan)
    average_loss.append(math.nan)

# For the rest of the values, we calculate the average of the past n days
for i in range (n,len(gain_list)):
    gain_sum = 0
    loss_sum = 0
    
    # We take the sum of the past n values
    for j in range (1,n+1):
        gain_sum = gain_list[i-j] + gain_sum
        loss_sum = loss_list[i-j] + loss_sum
    
    # We obtain the average by dividing the sum by n
    average_gain.append(gain_sum/n)
    average_loss.append(loss_sum/n)

# We create the empty relative strength list in order to fill it with the average of gains divided by the average of losses for the past n days.
RS = []

for i in range (0,n):
    RS.append(math.nan)

for i in range (n,len(average_gain)):
    RS.append(average_gain[i]/average_loss[i])
    
# We create the empty relative strength index list and will fill it with the values of the according formula
RSI = []

for i in range (0,n):
    RSI.append(math.nan)

for i in range (n,len(average_gain)):
    RSI.append(100 - ((100)/(1+RS[i])))

# We now want to include our newly obtained data to our price_data
price_data["Gain"] = gain_list
price_data["Loss"] = loss_list
price_data["RSI"] = RSI

# We can observe that the new columns were added properly
price_data[10:20]
Out[10]:
Open High Low Close Volume Dividends Stock Splits Change in Price Gain Loss RSI
Date
2018-01-17 221.000000 221.149994 216.320007 217.500000 9123100 0 0 -4.029999 0.000000 4.029999 NaN
2018-01-18 220.339996 220.580002 216.550003 220.330002 8225300 0 0 2.830002 2.830002 0.000000 NaN
2018-01-19 222.750000 223.490005 218.500000 220.460007 10548600 0 0 0.130005 0.130005 0.000000 NaN
2018-01-22 222.000000 227.789993 221.199997 227.580002 17703300 0 0 7.119995 7.119995 0.000000 NaN
2018-01-23 255.050003 257.709991 248.020004 250.289993 27705300 0 0 22.709991 22.709991 0.000000 83.096120
2018-01-24 250.880005 261.709991 249.309998 261.299988 17352400 0 0 11.009995 11.009995 0.000000 89.212867
2018-01-25 263.000000 272.299988 260.230011 269.700012 15336400 0 0 8.400024 8.400024 0.000000 90.299463
2018-01-26 271.489990 274.600006 268.760010 274.600006 11021800 0 0 4.899994 4.899994 0.000000 91.276894
2018-01-29 274.200012 286.809998 273.920013 284.589996 17529700 0 0 9.989990 9.989990 0.000000 91.337168
2018-01-30 277.000000 282.730011 272.700012 278.799988 12482900 0 0 -5.790009 0.000000 5.790009 92.135218

Stochastic Oscillator

The Stochastic Oscillator follows the speed of the price. In general, the momentum changes before the price changes. It measures the level of the closing price relative to the low-high range over a period of time.

Formula:

$K = 100 * \frac{C-L_n}{H_n - L_n}$

where

$C$ is the Current Closing Price,

$L_n$ is the Lowest Low over the past $n$ days and

$H_n$ is the Highest High over the past $n$ days.

In [11]:
low = price_data["Low"].copy()
high = price_data["High"].copy()
close = price_data["Close"].copy()

# We create the empty lists which we feed with the data from the resp. column
low_list = []
high_list = []
close_list = []

for i in high:
    high_list.append(i)
    
for i in low:
    low_list.append(i)

for i in close:
    close_list.append(i)

# We create the empty lists which we fill with the highest high and the lowest low over the past n days
H_n = []
L_n = []

for i in range (0,n):
    H_n.append(math.nan)
    L_n.append(math.nan)
    
for i in range (n,len(high_list)):
    max = high_list[i]
    min = low_list[i]
    for j in range (1,n+1):
        if max < high_list[i-j]:
            max = high_list[i-j]
        if min > low_list[i-j]:
            min = low_list[i-j]
    H_n.append(max)
    L_n.append(min)

# We create an empty list which we want to fill with the values of the according stochastic oscillator formula
K = []

for i in range (0,n):
    K.append(math.nan)
    
for i in range (n,len(high_list)):
    nom = close_list[i] - L_n[i]
    denom = H_n[i] - L_n[i]
    K.append(100*(nom/denom))

# We add the newly obtained data to price_data
price_data["Highest High"] = H_n
price_data["Lowest Low"] = L_n
price_data["Stochastic Oscillator"] = K

# We can check that the new columns were properly added
price_data[:20]
Out[11]:
Open High Low Close Volume Dividends Stock Splits Change in Price Gain Loss RSI Highest High Lowest Low Stochastic Oscillator
Date
2018-01-02 196.100006 201.649994 195.419998 201.070007 10966900 0 0 NaN 0.000000 0.000000 NaN NaN NaN NaN
2018-01-03 202.050003 206.210007 201.500000 205.050003 8591400 0 0 3.979996 3.979996 0.000000 NaN NaN NaN NaN
2018-01-04 206.199997 207.050003 204.000000 205.630005 6029600 0 0 0.580002 0.580002 0.000000 NaN NaN NaN NaN
2018-01-05 207.250000 210.020004 205.589996 209.990005 7033200 0 0 4.360001 4.360001 0.000000 NaN NaN NaN NaN
2018-01-08 210.020004 212.500000 208.440002 212.050003 5580200 0 0 2.059998 2.059998 0.000000 NaN NaN NaN NaN
2018-01-09 212.110001 212.979996 208.589996 209.309998 6125900 0 0 -2.740005 0.000000 2.740005 NaN NaN NaN NaN
2018-01-10 207.570007 213.639999 206.910004 212.520004 5951500 0 0 3.210007 3.210007 0.000000 NaN NaN NaN NaN
2018-01-11 214.289993 217.750000 213.350006 217.240005 7659500 0 0 4.720001 4.720001 0.000000 NaN NaN NaN NaN
2018-01-12 217.179993 222.550003 216.000000 221.229996 8199400 0 0 3.989990 3.989990 0.000000 NaN NaN NaN NaN
2018-01-16 224.240005 226.070007 217.199997 221.529999 13516100 0 0 0.300003 0.300003 0.000000 NaN NaN NaN NaN
2018-01-17 221.000000 221.149994 216.320007 217.500000 9123100 0 0 -4.029999 0.000000 4.029999 NaN NaN NaN NaN
2018-01-18 220.339996 220.580002 216.550003 220.330002 8225300 0 0 2.830002 2.830002 0.000000 NaN NaN NaN NaN
2018-01-19 222.750000 223.490005 218.500000 220.460007 10548600 0 0 0.130005 0.130005 0.000000 NaN NaN NaN NaN
2018-01-22 222.000000 227.789993 221.199997 227.580002 17703300 0 0 7.119995 7.119995 0.000000 NaN NaN NaN NaN
2018-01-23 255.050003 257.709991 248.020004 250.289993 27705300 0 0 22.709991 22.709991 0.000000 83.096120 257.709991 195.419998 88.087977
2018-01-24 250.880005 261.709991 249.309998 261.299988 17352400 0 0 11.009995 11.009995 0.000000 89.212867 261.709991 201.500000 99.319044
2018-01-25 263.000000 272.299988 260.230011 269.700012 15336400 0 0 8.400024 8.400024 0.000000 90.299463 272.299988 204.000000 96.193300
2018-01-26 271.489990 274.600006 268.760010 274.600006 11021800 0 0 4.899994 4.899994 0.000000 91.276894 274.600006 205.589996 100.000000
2018-01-29 274.200012 286.809998 273.920013 284.589996 17529700 0 0 9.989990 9.989990 0.000000 91.337168 286.809998 206.910004 97.221525
2018-01-30 277.000000 282.730011 272.700012 278.799988 12482900 0 0 -5.790009 0.000000 5.790009 92.135218 286.809998 206.910004 89.974956

Williams Percentage Range

The Williams Percentage Range is similar to the Stochastic Oscillator. In fact, it indicates the level of a market's closing price in relation to the highest price for the look-back period.

It’s value ranges from -100 to 0. When its value is above -20, it indicates a sell signal and when its value is below -80, it indicates a buy signal.

Formula:

$R = -100*\frac{H_n-C}{H_n-L_n}$

where

$C$ is the Current Closing Price,

$L_n$ is the Lowest Low over the past $n$ days and

$H_n$ is the Highest High over the past $n$ days.

In [12]:
# We create an empty list which we want to fill with the values of the williams percentage formula
R = []

for i in range (0,n):
    R.append(math.nan)
    
for i in range (n,len(high_list)):
    nom = H_n[i] - close_list[i]
    denom = H_n[i] - L_n[i]
    R.append(-100*(nom/denom))

# We want to add a williams percentage column to the price_data
price_data["Williams Percentage"] = R

price_data[10:20]
Out[12]:
Open High Low Close Volume Dividends Stock Splits Change in Price Gain Loss RSI Highest High Lowest Low Stochastic Oscillator Williams Percentage
Date
2018-01-17 221.000000 221.149994 216.320007 217.500000 9123100 0 0 -4.029999 0.000000 4.029999 NaN NaN NaN NaN NaN
2018-01-18 220.339996 220.580002 216.550003 220.330002 8225300 0 0 2.830002 2.830002 0.000000 NaN NaN NaN NaN NaN
2018-01-19 222.750000 223.490005 218.500000 220.460007 10548600 0 0 0.130005 0.130005 0.000000 NaN NaN NaN NaN NaN
2018-01-22 222.000000 227.789993 221.199997 227.580002 17703300 0 0 7.119995 7.119995 0.000000 NaN NaN NaN NaN NaN
2018-01-23 255.050003 257.709991 248.020004 250.289993 27705300 0 0 22.709991 22.709991 0.000000 83.096120 257.709991 195.419998 88.087977 -11.912023
2018-01-24 250.880005 261.709991 249.309998 261.299988 17352400 0 0 11.009995 11.009995 0.000000 89.212867 261.709991 201.500000 99.319044 -0.680956
2018-01-25 263.000000 272.299988 260.230011 269.700012 15336400 0 0 8.400024 8.400024 0.000000 90.299463 272.299988 204.000000 96.193300 -3.806700
2018-01-26 271.489990 274.600006 268.760010 274.600006 11021800 0 0 4.899994 4.899994 0.000000 91.276894 274.600006 205.589996 100.000000 -0.000000
2018-01-29 274.200012 286.809998 273.920013 284.589996 17529700 0 0 9.989990 9.989990 0.000000 91.337168 286.809998 206.910004 97.221525 -2.778475
2018-01-30 277.000000 282.730011 272.700012 278.799988 12482900 0 0 -5.790009 0.000000 5.790009 92.135218 286.809998 206.910004 89.974956 -10.025044

On Balance Volume

On balance volume (OBV) utilizes changes in volume to estimate changes in stock prices. This technical indicator is used to find buying and selling trends of a stock, by considering the cumulative volume: it cumulatively adds the volumes on days when the prices go up, and subtracts the volume on the days when prices go down, compared to the prices of the previous day.

Formula:

If $ C(t) > C(t-1): OBV(t) = OBV(t-1) + Vol(t)$

If $ C(t) < C(t-1): OBV(t) = OBV(t-1) - Vol(t)$

If $ C(t) = C(t-1): OBV(t) = OBV(t-1)$

where

$OBV(t)$ is the on balance volume at time t,

$Vol(t)$ is the trading volume at time t and

$C(t)$ is the closing price at time t.

In [13]:
vol = price_data["Low"].copy()

# We create an empty volume list which we will feed with the terms of the volume column
vol_list = []

for i in vol:
    vol_list.append(i)

# We create an empty on balance volume list that we will fill with the values according to the corresponding formula
OBV = []

OBV.append(0)

for i in range (1,len(vol_list)):
    prev_obv = OBV[i-1]
    v = vol_list[i]
    if close_list[i] > close_list[i-1]:
        OBV.append(prev_obv + v)
    elif close_list[i] < close_list[i-1]:
        OBV.append(prev_obv - v)
    else:
        OBV.append(prev_obv)

# We add the on balance volume column to our price_data
price_data["On Balance Volume"] = OBV

price_data.head()
Out[13]:
Open High Low Close Volume Dividends Stock Splits Change in Price Gain Loss RSI Highest High Lowest Low Stochastic Oscillator Williams Percentage On Balance Volume
Date
2018-01-02 196.100006 201.649994 195.419998 201.070007 10966900 0 0 NaN 0.000000 0.0 NaN NaN NaN NaN NaN 0.000000
2018-01-03 202.050003 206.210007 201.500000 205.050003 8591400 0 0 3.979996 3.979996 0.0 NaN NaN NaN NaN NaN 201.500000
2018-01-04 206.199997 207.050003 204.000000 205.630005 6029600 0 0 0.580002 0.580002 0.0 NaN NaN NaN NaN NaN 405.500000
2018-01-05 207.250000 210.020004 205.589996 209.990005 7033200 0 0 4.360001 4.360001 0.0 NaN NaN NaN NaN NaN 611.089996
2018-01-08 210.020004 212.500000 208.440002 212.050003 5580200 0 0 2.059998 2.059998 0.0 NaN NaN NaN NaN NaN 819.529999

Price Rate of Change

The Price Rate of Change (PROC) is a technical indicator which reflects the percentage change in price between the current price and the price over the window that we consider to be the time period of observation.

Formula:

$PROC(t) = \frac{C(t)-C(t-n)}{C(t-n)}$

where

$PROC(t)$ is the price rate of change at time t and

$C(t)$ is the closing price at time t.

In [14]:
# We create an empty price rate of change list to which we will add the values according to the formula
PROC = []

for i in range (0,n):
    PROC.append(math.nan)
    
for i in range (n,len(close_list)):
    nom = close_list[i]-close_list[i-n]
    denom = close_list[i-n]
    PROC.append(nom/denom)

# We add the price rate of change to our price_data
price_data["Price Rate of Change"] = PROC

price_data[10:20]
Out[14]:
Open High Low Close Volume Dividends Stock Splits Change in Price Gain Loss RSI Highest High Lowest Low Stochastic Oscillator Williams Percentage On Balance Volume Price Rate of Change
Date
2018-01-17 221.000000 221.149994 216.320007 217.500000 9123100 0 0 -4.029999 0.000000 4.029999 NaN NaN NaN NaN NaN 1248.080002 NaN
2018-01-18 220.339996 220.580002 216.550003 220.330002 8225300 0 0 2.830002 2.830002 0.000000 NaN NaN NaN NaN NaN 1464.630005 NaN
2018-01-19 222.750000 223.490005 218.500000 220.460007 10548600 0 0 0.130005 0.130005 0.000000 NaN NaN NaN NaN NaN 1683.130005 NaN
2018-01-22 222.000000 227.789993 221.199997 227.580002 17703300 0 0 7.119995 7.119995 0.000000 NaN NaN NaN NaN NaN 1904.330002 NaN
2018-01-23 255.050003 257.709991 248.020004 250.289993 27705300 0 0 22.709991 22.709991 0.000000 83.096120 257.709991 195.419998 88.087977 -11.912023 2152.350006 0.244790
2018-01-24 250.880005 261.709991 249.309998 261.299988 17352400 0 0 11.009995 11.009995 0.000000 89.212867 261.709991 201.500000 99.319044 -0.680956 2401.660004 0.274323
2018-01-25 263.000000 272.299988 260.230011 269.700012 15336400 0 0 8.400024 8.400024 0.000000 90.299463 272.299988 204.000000 96.193300 -3.806700 2661.890015 0.311579
2018-01-26 271.489990 274.600006 268.760010 274.600006 11021800 0 0 4.899994 4.899994 0.000000 91.276894 274.600006 205.589996 100.000000 -0.000000 2930.650024 0.307681
2018-01-29 274.200012 286.809998 273.920013 284.589996 17529700 0 0 9.989990 9.989990 0.000000 91.337168 286.809998 206.910004 97.221525 -2.778475 3204.570038 0.342089
2018-01-30 277.000000 282.730011 272.700012 278.799988 12482900 0 0 -5.790009 0.000000 5.790009 92.135218 286.809998 206.910004 89.974956 -10.025044 2931.870026 0.331996

Prediction Column

Now that we have our technical indicators calculated and our price data cleaned up, we are almost ready to build our model. However, we are missing one critical piece of information that is crucial to the model: the column we wish to predict.

Our goal is to predict whether the day is a down-day or an up-day, since we are trying to solve a classification problem.

We need to determine if the stock closed up or down for any given day.

Therefore we set the value -1 for down days (days, where the price is lower than the day before), 1 for up days (days, where the price is higher than the day before) and 0 for flat days (days, where the price has not changed).

In [15]:
# We create an empty prediction list which we want to fill with only 0,1 and -1 as explained above
prediction = []

CIP = price_data["Change in Price"].copy()

# We will work with the change in price column, so we create a list out of it
CIP_list = []

for i in CIP:
    CIP_list.append(i)

# We fill the prediction list
prediction.append(math.nan)

for i in range (1,len(close_list)):
    if CIP_list[i] > 0:
        prediction.append(1)
    elif CIP_list[i] < 0:
        prediction.append(-1)
    else:
        prediction.append(0)

# We add the prediction column to our price_datab
price_data["Prediction"]=prediction

price_data[:20]
Out[15]:
Open High Low Close Volume Dividends Stock Splits Change in Price Gain Loss RSI Highest High Lowest Low Stochastic Oscillator Williams Percentage On Balance Volume Price Rate of Change Prediction
Date
2018-01-02 196.100006 201.649994 195.419998 201.070007 10966900 0 0 NaN 0.000000 0.000000 NaN NaN NaN NaN NaN 0.000000 NaN NaN
2018-01-03 202.050003 206.210007 201.500000 205.050003 8591400 0 0 3.979996 3.979996 0.000000 NaN NaN NaN NaN NaN 201.500000 NaN 1.0
2018-01-04 206.199997 207.050003 204.000000 205.630005 6029600 0 0 0.580002 0.580002 0.000000 NaN NaN NaN NaN NaN 405.500000 NaN 1.0
2018-01-05 207.250000 210.020004 205.589996 209.990005 7033200 0 0 4.360001 4.360001 0.000000 NaN NaN NaN NaN NaN 611.089996 NaN 1.0
2018-01-08 210.020004 212.500000 208.440002 212.050003 5580200 0 0 2.059998 2.059998 0.000000 NaN NaN NaN NaN NaN 819.529999 NaN 1.0
2018-01-09 212.110001 212.979996 208.589996 209.309998 6125900 0 0 -2.740005 0.000000 2.740005 NaN NaN NaN NaN NaN 610.940002 NaN -1.0
2018-01-10 207.570007 213.639999 206.910004 212.520004 5951500 0 0 3.210007 3.210007 0.000000 NaN NaN NaN NaN NaN 817.850006 NaN 1.0
2018-01-11 214.289993 217.750000 213.350006 217.240005 7659500 0 0 4.720001 4.720001 0.000000 NaN NaN NaN NaN NaN 1031.200012 NaN 1.0
2018-01-12 217.179993 222.550003 216.000000 221.229996 8199400 0 0 3.989990 3.989990 0.000000 NaN NaN NaN NaN NaN 1247.200012 NaN 1.0
2018-01-16 224.240005 226.070007 217.199997 221.529999 13516100 0 0 0.300003 0.300003 0.000000 NaN NaN NaN NaN NaN 1464.400009 NaN 1.0
2018-01-17 221.000000 221.149994 216.320007 217.500000 9123100 0 0 -4.029999 0.000000 4.029999 NaN NaN NaN NaN NaN 1248.080002 NaN -1.0
2018-01-18 220.339996 220.580002 216.550003 220.330002 8225300 0 0 2.830002 2.830002 0.000000 NaN NaN NaN NaN NaN 1464.630005 NaN 1.0
2018-01-19 222.750000 223.490005 218.500000 220.460007 10548600 0 0 0.130005 0.130005 0.000000 NaN NaN NaN NaN NaN 1683.130005 NaN 1.0
2018-01-22 222.000000 227.789993 221.199997 227.580002 17703300 0 0 7.119995 7.119995 0.000000 NaN NaN NaN NaN NaN 1904.330002 NaN 1.0
2018-01-23 255.050003 257.709991 248.020004 250.289993 27705300 0 0 22.709991 22.709991 0.000000 83.096120 257.709991 195.419998 88.087977 -11.912023 2152.350006 0.244790 1.0
2018-01-24 250.880005 261.709991 249.309998 261.299988 17352400 0 0 11.009995 11.009995 0.000000 89.212867 261.709991 201.500000 99.319044 -0.680956 2401.660004 0.274323 1.0
2018-01-25 263.000000 272.299988 260.230011 269.700012 15336400 0 0 8.400024 8.400024 0.000000 90.299463 272.299988 204.000000 96.193300 -3.806700 2661.890015 0.311579 1.0
2018-01-26 271.489990 274.600006 268.760010 274.600006 11021800 0 0 4.899994 4.899994 0.000000 91.276894 274.600006 205.589996 100.000000 -0.000000 2930.650024 0.307681 1.0
2018-01-29 274.200012 286.809998 273.920013 284.589996 17529700 0 0 9.989990 9.989990 0.000000 91.337168 286.809998 206.910004 97.221525 -2.778475 3204.570038 0.342089 1.0
2018-01-30 277.000000 282.730011 272.700012 278.799988 12482900 0 0 -5.790009 0.000000 5.790009 92.135218 286.809998 206.910004 89.974956 -10.025044 2931.870026 0.331996 -1.0

We have enough features to continue with the random forest process. Note that we could add a couple of features, which could improve our results, but we will settle with the ones we have.

But before we continue with the random forest algorithm, we need to take care of the NaN values.

Removing NaN Values

The random forest can't accept NaN values so we need to remove them before feeding the data in.

In [16]:
# We use the built-in method to "delete" all the rows that have a NaN value somewhere
price_data = price_data.dropna()

# We can see that the first lines were removed since they all contained some NaN values
price_data[:20]
Out[16]:
Open High Low Close Volume Dividends Stock Splits Change in Price Gain Loss RSI Highest High Lowest Low Stochastic Oscillator Williams Percentage On Balance Volume Price Rate of Change Prediction
Date
2018-01-23 255.050003 257.709991 248.020004 250.289993 27705300 0 0 22.709991 22.709991 0.000000 83.096120 257.709991 195.419998 88.087977 -11.912023 2152.350006 0.244790 1.0
2018-01-24 250.880005 261.709991 249.309998 261.299988 17352400 0 0 11.009995 11.009995 0.000000 89.212867 261.709991 201.500000 99.319044 -0.680956 2401.660004 0.274323 1.0
2018-01-25 263.000000 272.299988 260.230011 269.700012 15336400 0 0 8.400024 8.400024 0.000000 90.299463 272.299988 204.000000 96.193300 -3.806700 2661.890015 0.311579 1.0
2018-01-26 271.489990 274.600006 268.760010 274.600006 11021800 0 0 4.899994 4.899994 0.000000 91.276894 274.600006 205.589996 100.000000 -0.000000 2930.650024 0.307681 1.0
2018-01-29 274.200012 286.809998 273.920013 284.589996 17529700 0 0 9.989990 9.989990 0.000000 91.337168 286.809998 206.910004 97.221525 -2.778475 3204.570038 0.342089 1.0
2018-01-30 277.000000 282.730011 272.700012 278.799988 12482900 0 0 -5.790009 0.000000 5.790009 92.135218 286.809998 206.910004 89.974956 -10.025044 2931.870026 0.331996 -1.0
2018-01-31 281.940002 282.290009 269.579987 270.299988 11695100 0 0 -8.500000 0.000000 8.500000 88.982378 286.809998 206.910004 79.336657 -20.663343 2662.290039 0.271880 -1.0
2018-02-01 266.410004 271.950012 263.380005 265.070007 9669000 0 0 -5.229980 0.000000 5.229980 80.597323 286.809998 213.350006 70.405673 -29.594327 2398.910034 0.220171 -1.0
2018-02-02 263.000000 270.619995 262.709991 267.429993 9123600 0 0 2.359985 2.359985 0.000000 75.192254 286.809998 216.000000 72.630976 -27.369024 2661.620026 0.208832 1.0
2018-02-05 262.000000 267.899994 250.029999 254.259995 11896100 0 0 -13.169998 0.000000 13.169998 74.758848 286.809998 216.320007 53.823227 -46.176773 2411.590027 0.147745 -1.0
2018-02-06 247.699997 266.700012 245.000000 265.720001 12595800 0 0 11.460007 11.460007 0.000000 65.413961 286.809998 216.320007 70.080864 -29.919136 2656.590027 0.221701 1.0
2018-02-07 266.579987 272.450012 264.329987 264.559998 8981500 0 0 -1.160004 0.000000 1.160004 71.223597 286.809998 216.550003 68.331908 -31.668092 2392.260040 0.200744 -1.0
2018-02-08 267.079987 267.619995 250.000000 250.100006 9306700 0 0 -14.459991 0.000000 14.459991 69.757886 286.809998 218.500000 46.259709 -53.740291 2142.260040 0.134446 -1.0
2018-02-09 253.850006 255.800003 236.110001 249.470001 16906900 0 0 -0.630005 0.000000 0.630005 61.737687 286.809998 221.199997 43.087950 -56.912050 1906.150040 0.096186 -1.0
2018-02-12 252.139999 259.149994 249.000000 257.950012 8534900 0 0 8.480011 8.480011 0.000000 59.138350 286.809998 236.110001 43.076949 -56.923051 2155.150040 0.030605 1.0
2018-02-13 257.290009 261.410004 254.699997 258.269989 6855200 0 0 0.319977 0.319977 0.000000 53.628965 286.809998 236.110001 43.708067 -56.291933 2409.850037 -0.011596 1.0
2018-02-14 260.470001 269.880005 260.329987 266.000000 10972000 0 0 7.730011 7.730011 0.000000 48.402741 286.809998 236.110001 58.954637 -41.045363 2670.180023 -0.013719 1.0
2018-02-15 270.029999 280.500000 267.630005 280.269989 10759700 0 0 14.269989 14.269989 0.000000 48.035669 286.809998 236.110001 87.100574 -12.899426 2937.810028 0.020648 1.0
2018-02-16 278.730011 281.959991 275.690002 278.519989 8312400 0 0 -1.750000 0.000000 1.750000 52.737801 286.809998 236.110001 83.648897 -16.351103 2662.120026 -0.021329 -1.0
2018-02-20 277.739990 285.809998 276.609985 278.549988 7769000 0 0 0.029999 0.029999 0.000000 46.815649 285.809998 236.110001 85.392334 -14.607666 2938.730011 -0.000897 1.0

Splitting the Data

We have to split our data into a training and a testing set.

We can take the RSI, Williams Percentage, Price Rate of Change, ... as our X and our Y column will be the Prediction column, which specifies whether the stock closed up or down compared to the previous day.

Note that the number of input columns is in relation with the accuracy of the program.

In [17]:
# We want to predict if the stock will close up or down n days from the day indicated
n = 1

# If we have the data from a day m, then we want to obtain a prediction for the day m+1 since we set our n = 1

# We select our columns
X_col = price_data[["Open","High","Low", "Close","Volume", "RSI", "Williams Percentage", "Price Rate of Change", "On Balance Volume","Stochastic Oscillator"]]
Y_col = price_data["Prediction"]

# We "delete" the n last rows of the X column and the n first rows in the Y column
# In this way, if we put them side by side, the Y value un row m will tell if the day m+n is a up or down day
X_col = X_col[:len(X_col)-n]
Y_col = Y_col[n:]

# We now want to split our data into a training and testing set
# To do so, we want the training set to be the first p percent of our entire price data

# We set p to 90, which means that we want our training set to be the first 90 percent of our price_data
p = 90
first_p_percent = int((len(X_col)*p)/100)

# We define our training data
X_train = X_col[:first_p_percent]
y_train = Y_col[:first_p_percent]

# We define our testing data to be the rest of the price_data, so everything that follows the data in the training set
X_test = X_col[first_p_percent+1:]
y_test = Y_col[first_p_percent+1:]


# We create our RandomForestClassifier
random_forest_classifier = RandomForestClassifier(n_estimators = 100, oob_score = True, criterion = "gini", random_state = 0)

# We fit the training data to the model using the built-in fit method
random_forest_classifier.fit(X_train,y_train)

# We take the X_test data set and use it to make predictions
y_pred = random_forest_classifier.predict(X_test)

We now built our model. One can see that this part is not the longest and most time intensive part, since the SciKit learn provides us with very useful methods and objects. Indeed the most time consuming part is the data preprocessing.

Of course, we now want to know how accurate it is. Again, SciKit learn makes this process very easy by providing some built-in metrics that we can call.

In [18]:
# The accuracy_score function computes the accuracy
# It returns the fraction as a default, but it can also return the count of correct predictions (normalize=False)
# The accuracy is in regard to the number of accurate predictions the model made on the test set
print("Correct Prediction (%) : ", accuracy_score(y_test, random_forest_classifier.predict(X_test), normalize = True)*100)
Correct Prediction (%) :  59.154929577464785

When it comes to evaluating the model, we generally look at the accuracy. If our accuracy is high, it means our model is correctly classifying items. Luckily, our accuracy is pretty high!

Feature importance

As a last step we want to have an idea of what features are helping explain most of the model, as this can give you insight as to why you're getting the results you are. With Random Forest, we can identify some of our most important features or, in other words, the features that help explain most of the model.

In [19]:
# Calculate feature importance and store in pandas series
feature_imp = pd.Series(random_forest_classifier.feature_importances_, index=X_col.columns).sort_values(ascending=False)
feature_imp
Out[19]:
On Balance Volume        0.116763
Price Rate of Change     0.106781
RSI                      0.106450
High                     0.103462
Volume                   0.101802
Close                    0.099899
Williams Percentage      0.096448
Open                     0.092209
Stochastic Oscillator    0.088484
Low                      0.087700
dtype: float64