pip install numpy
pip install pandas
pip install matplotlib
pip install sklearn
pip install yfinance
pip install datetime
#Paths
import os
import sys
import requests
#Data Acquisition
import yfinance as yf
from datetime import date
#Calculations
import numpy as np
import pandas as pd
import math
#Plotting
import matplotlib.pyplot as plt
#Sklearn is a machine learning library
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import plot_roc_curve
from sklearn.metrics import accuracy_score, classification_report
At first, we need to get the data from the stock we want to work with. This is a crucial first step, since this will provide all the information that will feed our random forest.
# date.today() returns today's date
today = date.today()
# The ticker is the stock we want to work with
ticker = "NFLX"
# We set the starting and ending date of the period we want to work with
start_date = "2018-01-01"
end_date = "2020-12-01"
# We want to store the full price history
# We therefore use the built-in methods from yfinance
data = yf.Ticker(ticker)
price_data = data.history(period="max", start=start_date, end=end_date)
# We want to make sure that the dates are in an ascending order
price_data = price_data.sort_index(ascending=True, axis=0)
# We want to display the head, meaning the first five rows of our data set
price_data.head()
As one can observe in the above output, we currently only have 7 columns, which are the default columns yfinance gives us.
To get by far better results, we will now try to extract some relevant features. A feature in the data can be any useful characteristic that can be used to correctly identify if the stock is going to close up or down.
We want to add a column which will indicate the change in price, meaning the difference between the closing price of two consecutive days.
In fact, we want to calculate $Close_m - Close_{m-1}$ with $Close_m$ being the Close value of the day $m$.
This of course implies that we will not get a result for the first row.
# We add a column which indicates the change in price
# We therefore use the diff() method, which will calculate the difference from one row to the next
price_data["Change in Price"] = price_data["Close"].diff()
price_data.head()
The Relative Strength Index (RSI) is a indicator that determines whether the stock is overbought or oversold.
A stock is said to be overbought when the demand unjustifiably pushes the price upwards. In general, this is a sign that the price is likely to go down since the stock is overvalued. A stock is said to be oversold when the price goes down sharply to a level below its true value. This is a result caused due to panic selling.
The RSI ranges from $0$ to $100$.
If the RSI is above $70$, it may indicate that the stock is overbought. If the RSI is below $30$, it may indicate that the stock is oversold.
Formula:
$RSI = 100 - \frac{100}{1+RS}$
where $RS = \frac{Average \, Gain \,Over \,Past \,n \,Days}{Average \,Loss \,Over \,Past \,n \,Days}$ is the relative strength.
# We want to calculate the n-day RSI
n = 14
# We make a copy of the data frame
gain = price_data["Change in Price"].copy()
loss = price_data["Change in Price"].copy()
# We create the empty lists which we feed with the data from the change in price column
gain_list = []
loss_list = []
for i in gain:
gain_list.append(i)
for i in loss:
loss_list.append(i)
gain_list[0]=0
loss_list[0]=0
# If the change in price is positive, the closing price of the day before is less than the closing price of the day itself.
# If the change in price is negative, the closing price of the day before is greater than the closing price of the day itself.
# This means that we have a loss if the change in price is negative and we have a gain if it is positive.
# We want to exclude all the losses from the gain list. We therefore set all the losses to 0.
for i in range (0,len(gain_list)):
if gain_list[i]<0:
gain_list[i]=0
# We want to exclude all the gains from the loss list. We therefore set all the gains to 0.
# We need to take the absolute value, since we want the losses to be positive.
for i in range (0, len(loss_list)):
if loss_list[i]>0:
loss_list[i]=0
else:
loss_list[i]= - loss_list[i]
# Note that by setting the values to 0, they won't play a part in the calculation of the average.
# We now create the empty average lists which we will feed with the average of the gain (resp. loss) of the past n days
average_gain = []
average_loss = []
# We can not calculate the average of the last n days for the values which don't have n previous values
for i in range (0,n):
average_gain.append(math.nan)
average_loss.append(math.nan)
# For the rest of the values, we calculate the average of the past n days
for i in range (n,len(gain_list)):
gain_sum = 0
loss_sum = 0
# We take the sum of the past n values
for j in range (1,n+1):
gain_sum = gain_list[i-j] + gain_sum
loss_sum = loss_list[i-j] + loss_sum
# We obtain the average by dividing the sum by n
average_gain.append(gain_sum/n)
average_loss.append(loss_sum/n)
# We create the empty relative strength list in order to fill it with the average of gains divided by the average of losses for the past n days.
RS = []
for i in range (0,n):
RS.append(math.nan)
for i in range (n,len(average_gain)):
RS.append(average_gain[i]/average_loss[i])
# We create the empty relative strength index list and will fill it with the values of the according formula
RSI = []
for i in range (0,n):
RSI.append(math.nan)
for i in range (n,len(average_gain)):
RSI.append(100 - ((100)/(1+RS[i])))
# We now want to include our newly obtained data to our price_data
price_data["Gain"] = gain_list
price_data["Loss"] = loss_list
price_data["RSI"] = RSI
# We can observe that the new columns were added properly
price_data[10:20]
The Stochastic Oscillator follows the speed of the price. In general, the momentum changes before the price changes. It measures the level of the closing price relative to the low-high range over a period of time.
Formula:
$K = 100 * \frac{C-L_n}{H_n - L_n}$
where
$C$ is the Current Closing Price,
$L_n$ is the Lowest Low over the past $n$ days and
$H_n$ is the Highest High over the past $n$ days.
low = price_data["Low"].copy()
high = price_data["High"].copy()
close = price_data["Close"].copy()
# We create the empty lists which we feed with the data from the resp. column
low_list = []
high_list = []
close_list = []
for i in high:
high_list.append(i)
for i in low:
low_list.append(i)
for i in close:
close_list.append(i)
# We create the empty lists which we fill with the highest high and the lowest low over the past n days
H_n = []
L_n = []
for i in range (0,n):
H_n.append(math.nan)
L_n.append(math.nan)
for i in range (n,len(high_list)):
max = high_list[i]
min = low_list[i]
for j in range (1,n+1):
if max < high_list[i-j]:
max = high_list[i-j]
if min > low_list[i-j]:
min = low_list[i-j]
H_n.append(max)
L_n.append(min)
# We create an empty list which we want to fill with the values of the according stochastic oscillator formula
K = []
for i in range (0,n):
K.append(math.nan)
for i in range (n,len(high_list)):
nom = close_list[i] - L_n[i]
denom = H_n[i] - L_n[i]
K.append(100*(nom/denom))
# We add the newly obtained data to price_data
price_data["Highest High"] = H_n
price_data["Lowest Low"] = L_n
price_data["Stochastic Oscillator"] = K
# We can check that the new columns were properly added
price_data[:20]
The Williams Percentage Range is similar to the Stochastic Oscillator. In fact, it indicates the level of a market's closing price in relation to the highest price for the look-back period.
It’s value ranges from -100 to 0. When its value is above -20, it indicates a sell signal and when its value is below -80, it indicates a buy signal.
Formula:
$R = -100*\frac{H_n-C}{H_n-L_n}$
where
$C$ is the Current Closing Price,
$L_n$ is the Lowest Low over the past $n$ days and
$H_n$ is the Highest High over the past $n$ days.
# We create an empty list which we want to fill with the values of the williams percentage formula
R = []
for i in range (0,n):
R.append(math.nan)
for i in range (n,len(high_list)):
nom = H_n[i] - close_list[i]
denom = H_n[i] - L_n[i]
R.append(-100*(nom/denom))
# We want to add a williams percentage column to the price_data
price_data["Williams Percentage"] = R
price_data[10:20]
On balance volume (OBV) utilizes changes in volume to estimate changes in stock prices. This technical indicator is used to find buying and selling trends of a stock, by considering the cumulative volume: it cumulatively adds the volumes on days when the prices go up, and subtracts the volume on the days when prices go down, compared to the prices of the previous day.
Formula:
If $ C(t) > C(t-1): OBV(t) = OBV(t-1) + Vol(t)$
If $ C(t) < C(t-1): OBV(t) = OBV(t-1) - Vol(t)$
If $ C(t) = C(t-1): OBV(t) = OBV(t-1)$
where
$OBV(t)$ is the on balance volume at time t,
$Vol(t)$ is the trading volume at time t and
$C(t)$ is the closing price at time t.
vol = price_data["Low"].copy()
# We create an empty volume list which we will feed with the terms of the volume column
vol_list = []
for i in vol:
vol_list.append(i)
# We create an empty on balance volume list that we will fill with the values according to the corresponding formula
OBV = []
OBV.append(0)
for i in range (1,len(vol_list)):
prev_obv = OBV[i-1]
v = vol_list[i]
if close_list[i] > close_list[i-1]:
OBV.append(prev_obv + v)
elif close_list[i] < close_list[i-1]:
OBV.append(prev_obv - v)
else:
OBV.append(prev_obv)
# We add the on balance volume column to our price_data
price_data["On Balance Volume"] = OBV
price_data.head()
The Price Rate of Change (PROC) is a technical indicator which reflects the percentage change in price between the current price and the price over the window that we consider to be the time period of observation.
Formula:
$PROC(t) = \frac{C(t)-C(t-n)}{C(t-n)}$
where
$PROC(t)$ is the price rate of change at time t and
$C(t)$ is the closing price at time t.
# We create an empty price rate of change list to which we will add the values according to the formula
PROC = []
for i in range (0,n):
PROC.append(math.nan)
for i in range (n,len(close_list)):
nom = close_list[i]-close_list[i-n]
denom = close_list[i-n]
PROC.append(nom/denom)
# We add the price rate of change to our price_data
price_data["Price Rate of Change"] = PROC
price_data[10:20]
Now that we have our technical indicators calculated and our price data cleaned up, we are almost ready to build our model. However, we are missing one critical piece of information that is crucial to the model: the column we wish to predict.
Our goal is to predict whether the day is a down-day or an up-day, since we are trying to solve a classification problem.
We need to determine if the stock closed up or down for any given day.
Therefore we set the value -1 for down days (days, where the price is lower than the day before), 1 for up days (days, where the price is higher than the day before) and 0 for flat days (days, where the price has not changed).
# We create an empty prediction list which we want to fill with only 0,1 and -1 as explained above
prediction = []
CIP = price_data["Change in Price"].copy()
# We will work with the change in price column, so we create a list out of it
CIP_list = []
for i in CIP:
CIP_list.append(i)
# We fill the prediction list
prediction.append(math.nan)
for i in range (1,len(close_list)):
if CIP_list[i] > 0:
prediction.append(1)
elif CIP_list[i] < 0:
prediction.append(-1)
else:
prediction.append(0)
# We add the prediction column to our price_datab
price_data["Prediction"]=prediction
price_data[:20]
We have enough features to continue with the random forest process. Note that we could add a couple of features, which could improve our results, but we will settle with the ones we have.
But before we continue with the random forest algorithm, we need to take care of the NaN values.
The random forest can't accept NaN values so we need to remove them before feeding the data in.
# We use the built-in method to "delete" all the rows that have a NaN value somewhere
price_data = price_data.dropna()
# We can see that the first lines were removed since they all contained some NaN values
price_data[:20]
We have to split our data into a training and a testing set.
We can take the RSI, Williams Percentage, Price Rate of Change, ... as our X and our Y column will be the Prediction column, which specifies whether the stock closed up or down compared to the previous day.
Note that the number of input columns is in relation with the accuracy of the program.
# We want to predict if the stock will close up or down n days from the day indicated
n = 1
# If we have the data from a day m, then we want to obtain a prediction for the day m+1 since we set our n = 1
# We select our columns
X_col = price_data[["Open","High","Low", "Close","Volume", "RSI", "Williams Percentage", "Price Rate of Change", "On Balance Volume","Stochastic Oscillator"]]
Y_col = price_data["Prediction"]
# We "delete" the n last rows of the X column and the n first rows in the Y column
# In this way, if we put them side by side, the Y value un row m will tell if the day m+n is a up or down day
X_col = X_col[:len(X_col)-n]
Y_col = Y_col[n:]
# We now want to split our data into a training and testing set
# To do so, we want the training set to be the first p percent of our entire price data
# We set p to 90, which means that we want our training set to be the first 90 percent of our price_data
p = 90
first_p_percent = int((len(X_col)*p)/100)
# We define our training data
X_train = X_col[:first_p_percent]
y_train = Y_col[:first_p_percent]
# We define our testing data to be the rest of the price_data, so everything that follows the data in the training set
X_test = X_col[first_p_percent+1:]
y_test = Y_col[first_p_percent+1:]
# We create our RandomForestClassifier
random_forest_classifier = RandomForestClassifier(n_estimators = 100, oob_score = True, criterion = "gini", random_state = 0)
# We fit the training data to the model using the built-in fit method
random_forest_classifier.fit(X_train,y_train)
# We take the X_test data set and use it to make predictions
y_pred = random_forest_classifier.predict(X_test)
We now built our model. One can see that this part is not the longest and most time intensive part, since the SciKit learn provides us with very useful methods and objects. Indeed the most time consuming part is the data preprocessing.
Of course, we now want to know how accurate it is. Again, SciKit learn makes this process very easy by providing some built-in metrics that we can call.
# The accuracy_score function computes the accuracy
# It returns the fraction as a default, but it can also return the count of correct predictions (normalize=False)
# The accuracy is in regard to the number of accurate predictions the model made on the test set
print("Correct Prediction (%) : ", accuracy_score(y_test, random_forest_classifier.predict(X_test), normalize = True)*100)
When it comes to evaluating the model, we generally look at the accuracy. If our accuracy is high, it means our model is correctly classifying items. Luckily, our accuracy is pretty high!
As a last step we want to have an idea of what features are helping explain most of the model, as this can give you insight as to why you're getting the results you are. With Random Forest, we can identify some of our most important features or, in other words, the features that help explain most of the model.
# Calculate feature importance and store in pandas series
feature_imp = pd.Series(random_forest_classifier.feature_importances_, index=X_col.columns).sort_values(ascending=False)
feature_imp