In this notebook, we will evaluate the daily performance of our gas usage method percentage systems identification VAR(X) model, as initially defined with minute level data. We will backtest on a period containing the atypical patterns analyzed in our initial adhoc analysis of gas usage actor/methods. To evaluate the model, which in this test case, is performing one-step predictions with retraining after each step, we use the Root Mean Squared Error (RMSE) criteria.
RMSE is a commonly used measure of the difference between actual and forecasted values. RMSE is always $\geq 0$, with 0 being a perfect forecast. Since RMSE is the square root of the averaged squared errors, it is sensitive to outliers.
$$RMSE =\sqrt{\frac{\sum_{t=1}^T (\hat y_t - y_t)^2}{T}}$$Systems Identification uses statistical methods to create models of dynamical systems from observed input and output signals of the system. A dynamic system is an economic system such as a stock market or in our case, Filecoin's network gas economy. In our Systems Identification model, our goal is to create a model from measurements of the behavior of the system and its external inputs to determine a mathematical model of what is occurring. Depending on the level of knowledge of the system, we could use a white box, grey box, or black-box modeling approach. In our case, there is no prior model available of the gas usage methods, so we will be using the black-box modeling paradigm.
To learn more about systems identification, visit the links in this write-up.
Backtesting is a process used to validate a model on historical data. With backtesting, a model is tested against a historic time series and compared to actual values to see how it would have performed, if it had been used during the historical period. Backtesting is a valuable tool for determining a model's domain performance, as long as its limitations are understood. Below we will enumerate the pros and cons.
Below we will perform one-step forecasts with retraining. We forecast from August 30, 2021, through October 15th, 2021.
# Import libraries
import os
os.chdir('..')
from datetime import datetime
from statsmodels.tsa.api import VAR
from filecoin_digital_twin.modeling import gas_dynamics_VAR_prediction, gas_dynamics_VAR_invert
from filecoin_digital_twin.retrieve_data import pull_data, pull_message_count_data,process_message_count_data, compute_difference_vector
import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
from pandas import DataFrame
CONN_STRING_PATH = 'config/sentinel-conn-string.txt'
The first step is to obtain all of the data for our full period of study. In our case, we will begin with daily values from July 1, 2021, to give us one month and 29 days' worth of 'training' data. We will query data through the end of our backtest, which is October 15, 2021.
start_date = datetime(2021, 7, 1)
end_date = datetime(2021, 10, 15)
truncation_interval = 'DAY'
## compute_gas_dynamics_vector - obtain macro variables for system identification
macro_data = pull_data(truncation_interval=truncation_interval,
start_date=start_date,
end_date=end_date)
#Pull and process message_count_data
message_count_data = pull_message_count_data(truncation_interval=truncation_interval,
start_date=start_date,
end_date=end_date,
CONN_STRING_PATH=CONN_STRING_PATH)
message_count_data = process_message_count_data(message_count_data, truncation_interval, start_date, end_date)
#Pivot and fill null values with 0 for the vector
vector = message_count_data.pivot("datetime", "Actor-Method", "percentage_gas_used").fillna(0)
#Collaspe unknown columns
unknown_cols = [x for x in vector.columns if x.startswith("<unknown>-")]
vector["unknown"] = vector[unknown_cols].sum(axis=1)
vector = vector.drop(columns=unknown_cols)
# #Join macro data into the vector
vector = vector.join(macro_data)
# subset to prediction columns
prediction_columns = ['fil/5/account-0',
'fil/5/storagemarket-2','fil/5/storagemarket-4','fil/5/storageminer-11',
'fil/5/storageminer-16','fil/5/storageminer-25','fil/5/storageminer-26',
'fil/5/storageminer-5','fil/5/storageminer-6','fil/5/storageminer-7']
vector = vector[prediction_columns]
#Get rid of rows with null values
vector = vector[~pd.isnull(vector).any(axis=1)]
# create an empty list to hold our predictions
predictions = []
# subset actual values for comparison
actual = vector.iloc[60:60+48]
# iterate through each step training and forecasting the next day.
for i in range(0,47):
training = vector.iloc[0:60+i]
#Compute the difference vector
diff_vector = compute_difference_vector(training)
pred = gas_dynamics_VAR_prediction(diff_vector,'DAY', lag=5,steps=1)
#Pull out prior state
prior_state = training.iloc[-1]
#Invert the prediction
pred = gas_dynamics_VAR_invert(pred, prior_state, prediction_columns)
predictions.append(pred)
# create a dataframe from the predictions
one_step_predictions_df_all = pd.concat(predictions).reset_index()
del one_step_predictions_df_all['index']
actual.reset_index(inplace=True)
del actual['datetime']
# calculate the RMSE
RMSE = sm.tools.eval_measures.rmse(actual,one_step_predictions_df_all,axis=1)
RMSE = np.round(RMSE,decimals=4)
RMSE_dict = dict(zip(one_step_predictions_df_all.columns,RMSE))
#Plot the results
for x in one_step_predictions_df_all.columns:
actual[x].plot(kind='line')
one_step_predictions_df_all[x].plot(kind='line')
title_text = str(x) + ' RMSE: {}'.format(RMSE_dict[x])
plt.title(title_text)
plt.legend(["In-sample", "Prediction"])
plt.show()
Our gas usage percentage forecasting model, when performing one-step forecasts, performed well during our backtesting period. August 29th through October 15, 2021, was a volatile period. Our model is more volatile than the actual data but recorrects efficiently. We will examine, in a subsequent notebook, ways to refine our model to behavior more as a filter, muting volatility.