EDA for Gas Fees

In this notebook we perform Exploratory Data Analysis (EDA) on FIL's gas fee mechanism. The goal is to observe the gas fee as a signal and attempt to understand what may be driving it.

Access was obtained to Sentinel from Filecoin, we will use the the derived_gas_outputs table as the primary data gathering table. Sentinel's Data Dictionary was obtained on 6/28/2021 and will drive subsquent analysis.

This notebook will do the following:

  1. Analyze the derived_gas_outputs continuous signals for June 2021, mean aggregated by seconds.
  2. Analyze the individual actors and methods for June 2021, and ascertain the most common methods, actors, and what drives gas usage.
  3. Perform Fourier Analysis of the derived_gas_outputs continuous signals to understand phase shifts, periodicity, and leading indicators.
  4. Combine the derived_gas_outputs with exogenous signals from the chain_economics, chain_powers, and chain_rewards tables.
  5. Perform a VAR analysis and Granger Causality on the key signals combined from #4.
  6. Generate conclusions from our research.

Change Log

TODO:

What are Gas Fees?

Note: this description is copied from the official Filecoin documentation

Executing messages, for example by including transactions or proofs in the chain, consumes both computation and storage resources on the network. Gas is a measure of resources consumed by messages. The gas consumed by a message directly affects the cost that the sender has to pay for it to be included in a new block by a miner.

Historically in other blockchains, miners specify a GasFee in a unit of native currency and then pay the block producing miners a priority fee based on how much gas is consumed by the message. Filecoin works similarly, except an amount of the fees is burned (sent to an irrecoverable address) to compensate for the network expenditure of resources, since all nodes need to validate the messages. The idea is based on Ethereum's EIP1559.

The amount of fees burned in the Filecoin network comes given by a dynamic BaseFee which gets automatically adjusted according to the network congestion parameters (block sizes). The current value can be obtained from one of the block explorers or by inspecting the current head.

Additionally, a number of gas-related parameters are attached to each message and determine the amount of rewards that miners get. Here's an overview of the terms and concepts:

GasUsage: the amount of gas that a message's execution actually consumes. Current protocol does not know how much gas a message will exactly consume ahead of execution, but it can be estimated (see prices (opens new window)). GasUsage measured in units of Gas.

BaseFee: the amount of FIL that gets burned per unit of gas consumed for the execution of every message. It is measured in units of attoFIL/Gas.

GasLimit: the limit on the amount of gas that a message's execution can consume, estimated and specified by a message sender. It is measured in units of Gas. The sum of GasLimit for all messages included in a block must not exceed the BlockGasLimit. Messages will fail to execute if they run out of Gas, and any effects of the execution will be reverted.

GasFeeCap: the maximum token amount that a sender is willing to pay per GasUnit for including a message in a block. It is measured in units of attoFIL/Gas. A message sender must have a minimum balance of GasFeeCap * GasLimit when sending a message, even though not all of that will be consumed. GasFeeCap can serve as a safeguard against high, unexpected BaseFee fluctuations.

GasPremium: a priority fee that is paid to the block-producing miner. This is capped by GasFeeCap. The BaseFee has a higher priority. It is measured in units of attoFIL/Gas and can be as low as 1 attoFIL/Gas.

Overestimation burn: an additional amount of gas to burn that grows larger when the difference between GasLimit and GasUsage is large.

The total cost of a message for a sender will be:

An important detail is that a message will always pay the burn fee, regardless of the GasFeeCap used. Thus, a low GasFeeCap may result in a reduced GasPremium or even a negative one! In that case, the miners that include a message will have to pay the needed amounts out of their own pockets, which means they are unlikely to include such messages in new blocks.

Filecoin implementations may choose the heuristics of how their miners select messages for inclusion in new blocks, but they will usually attempt to maximize the miner's rewards.

Data Resources

Sentinel Diagram

Data EDA

Below we download minute averages from the derived_gas_outputs table from June 1st, 2021 to June 30th, 2021. After downloading the data, we few the first and lasts 5 rows, and perform basic statistics on the data.

derived_gas_outputs - coped from Sentinel's Data Dictionary

Derived gas costs resulting from execution of a message in the VM.

Name Type Nullable Description
actor_name text NO Human readable identifier for the type of the actor.
base_fee_burn text NO The amount of FIL (in attoFIL) to burn as a result of the base fee. It is parent_base_fee (or gas_fee_cap if smaller) multiplied by gas_used. Note: successfull window PoSt messages are not charged this burn.
cid text NO CID of the message.
exit_code bigint NO The exit code that was returned as a result of executing the message. Exit code 0 indicates success. Codes 0-15 are reserved for use by the runtime. Codes 16-31 are common codes shared by different actors. Codes 32+ are actor specific.
from text NO Address of actor that sent the message.
gas_burned bigint NO The overestimated units of gas to burn. It is a portion of the difference between gas_limit and gas_used.
gas_fee_cap text NO The maximum price that the message sender is willing to pay per unit of gas.
gas_limit bigint YES A hard limit on the amount of gas (i.e., number of units of gas) that a message’s execution should be allowed to consume on chain. It is measured in units of gas.
gas_premium text NO The price per unit of gas (measured in attoFIL/gas) that the message sender is willing to pay (on top of the BaseFee) to "tip" the miner that will include this message in a block.
gas_refund bigint NO The overestimated units of gas to refund. It is a portion of the difference between gas_limit and gas_used.
gas_used bigint NO A measure of the amount of resources (or units of gas) consumed, in order to execute a message.
height bigint NO Epoch this message was executed at.
method bigint YES The method number to invoke. Only unique to the actor the method is being invoked on. A method number of 0 is a plain token transfer - no method exectution.
miner_penalty text NO Any penalty fees (in attoFIL) the miner incured while executing the message.
miner_tip text NO The amount of FIL (in attoFIL) the miner receives for executing the message. Typically it is gas_premium * gas_limit but may be lower if the total fees exceed the gas_fee_cap.
nonce bigint YES The message nonce, which protects against duplicate messages and multiple messages with the same values.
over_estimation_burn text NO The fee to pay (in attoFIL) for overestimating the gas used to execute a message. The overestimated gas to burn (gas_burned) is a portion of the difference between gas_limit and gas_used. The over_estimation_burn value is gas_burned * parent_base_fee.
parent_base_fee text NO The set price per unit of gas (measured in attoFIL/gas unit) to be burned (sent to an unrecoverable address) for every message execution.
refund text NO The amount of FIL (in attoFIL) to refund to the message sender after base fee, miner tip and overestimation amounts have been deducted.
size_bytes bigint YES Size in bytes of the serialized message.
state_root text NO CID of the parent state root.
to text NO Address of actor that received the message.
value text NO The FIL value transferred (attoFIL) to the message receiver.

Messages calculation mapping

Based on the message cost calculation outliend by Filecoin's official documentation, we will map the data obtained to this calculation.

Filecoin: message_cost_calculation = GasUsage * BaseFee FIL (burned) + GasLimit * GasPremium FIL (miner reward) + OverEstimationBurn * BaseFee FIL

Our downloaded data: message_cost = derived_gas_outputs.mean_gas_used * derived_gas_outputs.mean_base_fee_burn + derived_gas_outputs.mean_gas_limit * derived_gas_outputs.mean_gas_premium + derived_gas_outputs.mean_over_estimation_burn * derived_gas_outputs.mean_base_fee_burn

We will focus on analyzing the methods by actor type using the derived gas outputs table. The data is very large, so we will perform individul SQL queries in order to obtain all of the required data.

Count of all actor types and methods used during June.

In order to understand what the specific methods signify, we have begun creating a mapping table from Filecoin's system code:

https://docs.google.com/spreadsheets/d/13sfwHtT1YO94a37JmA956a1HcFJ_FtMQegu1oTE3WaQ/edit#gid=0

Method types are not unique across actor types. Without having this formalized and connected to git commits, these values may be inaccurate.

Below we will query all Storage Market Actor transactions for June

Time Analysis

When examining the derived_gas_outputs_message_level, we developed the following two questions:

  1. Is the distribution of intervals between data points consistent?
  2. What type of samplimg time do we have?

Workflow:

1. Verify sampling intervals
2. Verify distributions. If not normal, poisson, etc

To answer these questions, we will calculate the timestamp difference and create a histogram of the time differences and determine if we have equal time sampling or not.

Exogenous signals

https://github.com/filecoin-project/sentinel/blob/master/docs/db.md

chain_economics

Economic summaries per state root CID.

Name Type Nullable Description
burnt_fil text NO Total FIL (attoFIL) burned as part of penalties and on-chain computations.
circulating_fil text NO The amount of FIL (attoFIL) circulating and tradeable in the economy. The basis for Market Cap calculations.
locked_fil text NO The amount of FIL (attoFIL) locked as part of mining, deals, and other mechanisms.
mined_fil text NO The amount of FIL (attoFIL) that has been mined by storage miners.
parent_state_root text NO CID of the parent state root.
vested_fil text NO Total amount of FIL (attoFIL) that is vested from genesis allocation.

chain_powers

Power summaries from the Power actor.

Name Type Nullable Description
height bigint NO Epoch this power summary applies to.
miner_count bigint YES Total number of miners.
participating_miner_count bigint YES Total number of miners with power above the minimum miner threshold.
qa_smoothed_position_estimate text NO Total power smoothed position estimate - Alpha Beta Filter "position" (value) estimate in Q.128 format.
qa_smoothed_velocity_estimate text NO Total power smoothed velocity estimate - Alpha Beta Filter "velocity" (rate of change of value) estimate in Q.128 format.
state_root text NO CID of the parent state root.
total_pledge_collateral text NO Total locked FIL (attoFIL) miners have pledged as collateral in order to participate in the economy.
total_qa_bytes_committed text NO Total provably committed, quality adjusted storage power in bytes. Quality adjusted power is a weighted average of the quality of its space and it is based on the size, duration and quality of its deals.
total_qa_bytes_power text NO Total quality adjusted storage power in bytes in the network. Quality adjusted power is a weighted average of the quality of its space and it is based on the size, duration and quality of its deals.
total_raw_bytes_committed text NO Total provably committed storage power in bytes. Raw byte power is the size of a sector in bytes.
total_raw_bytes_power text NO Total storage power in bytes in the network. Raw byte power is the size of a sector in bytes.

chain_rewards

Reward summaries from the Reward actor.

Name Type Nullable Description
cum_sum_baseline text NO Target that CumsumRealized needs to reach for EffectiveNetworkTime to increase. It is measured in byte-epochs (space * time) representing power committed to the network for some duration.
cum_sum_realized text NO Cumulative sum of network power capped by BaselinePower(epoch). It is measured in byte-epochs (space * time) representing power committed to the network for some duration.
effective_baseline_power text NO The baseline power (in bytes) at the EffectiveNetworkTime epoch.
effective_network_time bigint YES Ceiling of real effective network time "theta" based on CumsumBaselinePower(theta) == CumsumRealizedPower. Theta captures the notion of how much the network has progressed in its baseline and in advancing network time.
height bigint NO Epoch this rewards summary applies to.
new_baseline_power text NO The baseline power (in bytes) the network is targeting.
new_reward text YES The reward to be paid in per WinCount to block producers. The actual reward total paid out depends on the number of winners in any round. This value is recomputed every non-null epoch and used in the next non-null epoch.
new_reward_smoothed_position_estimate text NO Smoothed reward position estimate - Alpha Beta Filter "position" (value) estimate in Q.128 format.
new_reward_smoothed_velocity_estimate text NO Smoothed reward velocity estimate - Alpha Beta Filter "velocity" (rate of change of value) estimate in Q.128 format.
state_root text NO CID of the parent state root.
total_mined_reward text NO The total FIL (attoFIL) awarded to block miners.

Fourier Transform Analysis

We will plot each signal, save for the timestamp, below and perform Fourier transforms to search for periodicity.

A Fourier transform (FT) is a mathematical method for decomposing a signal into a sum of periodic components. Used frequently in signal processing to understand trends and for filtering. we will use the common Fast Fourier Transform (FFT) algorithm to calculate discrete Fourier transform (DFT) of each signal.

Resources:

Based on the Fourier decompositions, it appears that there is some periodicity to the data, with two spikes a week, approximately on Monday and Thursday.

Decomposed Phase Shifts Overlay

To understand which signals may be leading or lagging indicators, we will overlay the fourier decomposed components, in pairs, for analysis.

Gas premium appears to be a leading indicator for gas fee cap. There is no phase shift (peak to peak distances line up).

Gas limit, at times, appears to be a slight leading indicator.

Message cost, at times, appears to be a slight leading indicator.

Below we will normalize the signals by their individual max peaks so we can plot on one graph. We will use Matplotlib instead of Plotly due to the fact that Matplotlib handles complex numbers better.

Data processing

We will now remove the timestamp field and examine the data distributions and determine if any transformations are required prior to our VAR modeling.

We can see from the above historgrams that our data is not normally distributed and will need to be transformed prior to modeling. We will take the log of the data to reduce the skewness and take the first difference to make the data stationary.

The derived_gas_outputs data looks normally distributed, however the chain_economics and other chain signals are still non-normal. We will cautiously proceed.

As a final check prior to modeling, we will run the Augmented Dickey-Fuller test to ensure that our data is stationary (non-unit root - A unit root is a stochastic trend in a time series). The test's hypothesis are:

Based on the Augmented Dickey-Fuller, our preprocessing was successful and none of our univariate time series signals as a unit root. We can now proceed to the VAR model.

Model Fit

To determine the ideal number of lags for our model, we will perform a heuristic SVD. We will fit our model with an autocorrelation between 1 and 15 to ascertain which VAR order has the best Akaike information criterion(AIC) score.

The Akaike information criterion (AIC) is an estimator of prediction error, rooted in information theory. Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models as a means for model selection.

When a statistical model is used to represent the process that generated the data, the representation will rarely be exact; so some information will be lost by using the model to represent the process. AIC estimates the relative amount of information lost by a given model: the less information a model loses, the higher the quality of that model.

In estimating the amount of information lost by a model, AIC deals with the trade-off between the goodness of fit of the model and the simplicity of the model. In other words, AIC deals with both the risk of overfitting and the risk of underfitting.

Below is the equation for AIC where $\hat L$ is the maximum value of the likelihood function for the model:

$$\mathrm{AIC} \, = \, 2k - 2\ln(\hat L)$$

Given a set of candidate models for the data, the preferred model is the one with the minimum AIC value, the sign of the data does not matter. AIC optimizes for the goodness of fit but also includes a penalty for each additional parameter, which discourages overfitting.

Paraphrased sources: * https://en.wikipedia.org/wiki/Akaike_information_criterion

Based on our analysis, a lag of 10 appears to be the optimal.

Granger causality

Granger causality is a hypothesis test for determining whether one-time series is useful in forecasting another. We can say that a variable X, or variables, evolves Granger-causes another evolving variable Y if predictions of the value of Y based on its past values and the past values of X are better than predictions of Y based only on Y's past values.

Granger Causality is relationship based on the following principles:

Given these two assumptions about causality, Granger proposed to test the following hypothesis for identification of a causal effect of $X$ on $Y$: $${P}[Y(t+1) \in A\mid \mathcal{I}(t)] \neq \mathbb{P}[Y(t+1) \in A\mid \mathcal{I}_{-X}(t)]$$ where $\mathbb{P}$ refers to probability, $A$ is an arbitrary non-empty set, and $\mathcal{I}(t)$ and $\mathcal{I}_{-X}(t)$ respectively denote the information available as of time $t$ in the entire universe, and that in the modified universe in which $X$ is excluded. If the above hypothesis is accepted, we say that $X$ Granger causes $Y$.

In our analysis, we present the hypothesis that gas_used is a driver of message cost. In statistical parlance, we have the following:

Granger Causality assumes that the time series are non-stationary, which we checked and passed above, and autoregressive lags greater than 1.

We will perform now perform the Granger Causality hypothesis test with an $\alpha = 0.05$ value using an F test to determine if the gas used has any casual component for predicting the message cost. If the p-value (the probability of obtaining test results at least as extreme as the results observed) of the test is less than or equal to $\alpha$ we will reject the null hypothesis and determine that gas used is a driver of message cost.

Paraphrased source:

As we have many signals with the analysis, we will loop through all the signals, perform the Granger Causality test, and save the results for analysis.

Based on the heatmap above, we can see that there are some granger causal relationships between signals. To see more granular specifics, we will examine the reject H0s below

Based on the above table, we can see some Granger caused relationships between signals, such as mean_gas_fee_cap Granger causes mean_gas_limit. We can't fully use Granger until we understand what the signals are and what they represent better.

Conclusion

Behavior Model

For our Digital Twin of the Filecoin system, we need to construct a behavior model to forecast daily values. Very rough V1.

Behavior next steps

Based on the histograms above, we have some stochastic signals, such as mean_base_fee_burn_log_differenced and some close to constant values, such as mean_vested_fil_log_differenced. We will divide the dataset into two groups, stochastic and constants (some of which are not, but are highly skewed), and perform different manipulations on each to create a behavior model.