The pyprocessta API reference
Contents
The pyprocessta API reference¶
Preprocessing¶
Sometimes, different kinds of measurements are sampled at different intervals. This module provides utilities to combine such data. We will always operate on pandas dataframes with datatime indexing
- pyprocessta.preprocess.align.align_two_dfs(df_a, df_b, interpolation='linear')[source]¶
Alignes to dataframes with datatimeindex Resamples both dataframes on the dataframe with the lowest frequency timestep. The first timepoint in the new dataframe will be the later one of the first observations of the dataframes.
https://stackoverflow.com/questions/47148446/pandas-resample-interpolate-is-producing-nans https://stackoverflow.com/questions/66967998/pandas-interpolation-giving-odd-results
- Parameters
df_a (pd.DataFrame) – Dataframe
df_b (pd.DataFrame) – Dataframe
interpolation (Union[str, int], optional) – Interpolation method. If you provide an integer, spline interpolation of that order will be used. Defaults to “linear”.
- Returns
merged dataframe
- Return type
pd.DataFrame
This module contains basic data cleaning functions
- pyprocessta.preprocess.clean.drop_duplicated_indices(df)[source]¶
If one concatenates dataframes there might be duplicated indices. This can lead to problems, e.g., in interpolation steps. One easy solution can be to just drop the duplicated row
- Parameters
df (Union[pd.Series, pd.DataFrame]) – Input data
- Returns
Data without duplicated indices
- Return type
Union[pd.Series, pd.DataFrame]
In some time series there is a trend component that does not interest us, e.g., because we have domain knowledge that this trend is due to another phenomenon like instrument drift. In this case, we might want to remove the trend component for furhter modeling. The same is the case for the variance. If the variance increases over time, one might want to remove this effect using a Box-Cox transformation [1]
References: [1] https://otexts.com/fpp2/transformations.html#mathematical-transformations
- pyprocessta.preprocess.detrend.detrend_linear_deterministc(data)[source]¶
Removes a deterministic linear trend from a series. Note that we assume that the data is sampled on a regular grid and we estimate the trend as
- np.arange(
len(series) * (series.iloc[end] - series.iloc[start]) / (end - start)
)
- Parameters
data (Union[pd.Series, pd.DataFrame]) – Data to detrend. In case of dataframes we detrend every column separately.
- Returns
Detrended data
- Return type
Union[pd.Series, pd.DataFrame]
- pyprocessta.preprocess.detrend.detrend_stochastic(data)[source]¶
Detrends time series data using the difference method y_t - y_{t-1}. This is useful to remove stochastic trends (random walk with trend).
- Parameters
data (Union[pd.Series, pd.DataFrame]) – Time series data to detrend
- Returns
Differenced data
- Return type
Union[pd.Series, pd.DataFrame]
Often, data is not sampled on a regular grid. This module provides to regularize such data
- pyprocessta.preprocess.resample.resample_regular(df, interval='10min', interpolation='linear', start_time=None)[source]¶
Resamples the dataframe at a desired interval.
- Parameters
df (pd.DataFrame) – input dataframne
interval (str, optional) – Resampling intervall. Defaults to “10min”.
interpolation (Union[str, int], optional) – Interpolation method. If you provide an integer, spline interpolation of that order will be used. Defaults to “linear”.
- Returns
Output data.
- Return type
pd.DataFrame
- pyprocessta.preprocess.smooth.exponential_window_smoothing(data, window_size, aggregation='mean')[source]¶
- Parameters
data (Union[pd.Series, pd.DataFrame]) – Data to smoothen
window_size (int) – size for the exponential window
aggregation (str, optional) – Aggregation function. Defaults to “mean”.
- Returns
Smoothned data
- Return type
Union[pd.Series, pd.DataFrame]
- pyprocessta.preprocess.smooth.z_score_filter(data, threshold=2, window=10)[source]¶
Replaces spikes (values > threshold * z_score) with the median of the window values before.
- Parameters
data (Union[pd.Series, pd.DataFrame]) – Series to despike
threshold (float, optional) – Threshold on the z-score. Defaults to 2.
window (int, option) – Window that is used for the median with which the spike value is replaced. This mean only looks back.
- Returns
Despiked series
- Return type
Union[pd.Series, pd.DataFrame]
EDA¶
- pyprocessta.eda.statistics.check_granger_causality(x, y, max_lag=20, add_constant=True)[source]¶
Check if series x is Granger causal for series y We reject the null hypothesis that x does not Granger cause y if the pvalues are below a desired size of the test.
- Parameters
x (pd.Series) – Time series.
y (pd.Series) – Time series.
max_lag (int, optional) – Maximum lag to use for the causality checks. Defaults to 20.
add_constant (bool, optional) – [description]. Defaults to True.
- Returns
results dictionary
- Return type
dict
- pyprocessta.eda.statistics.check_stationarity(series, threshold=0.05, regression='c')[source]¶
Performs the Augmented-Dickey fuller and Kwiatkowski-Phillips-Schmidt-Shin (KPSS) tests for stationarity.
- Parameters
series (pd.Series) – Time series data
threshold (float, optional) – p-value thresholds for the statistical tests. Defaults to 0.05.
regression (str, optional) – If regression=”c” then the tests check for stationarity around a constant. For “ct” the test check for stationarity around a trend. Defaults to “c”.
- Returns
Results dictionary with key “stationary” that has a bool as value
- Return type
dict
Causal impact analysis¶
Causal impact analysis uses machine learning to construct a counterfactual (what would the results have been without an intervention) which can be used to estimate an effect size without the need for a control group A good introduction is https://www.youtube.com/watch?v=GTgZfCltMm8 the original paper is https://storage.googleapis.com/pub-tools-public-publication-data/pdf/41854.pdf. The particular implementation we use was described in https://towardsdatascience.com/implementing-causal-impact-on-top-of-tensorflow-probability-c837ea18b126
One can use covariates to build the counterfactual but one needs to be careful that they are not changed by the intervention.
- pyprocessta.causalimpact.run_causal_impact_analysis(df, x_columns, intervention_column, y_column, start, end, p_value_threshold=0.05)[source]¶
Run the causal impact analysis. Here, we use all the x that are not related to the intervention variable.
- Parameters
df (pd.DataFrame) – Dataframe to run the analysis on
x_columns (List[str]) – All column names that can potentially be used as covariates for the counterfactual model
intervention_column (str) – Name of the column on which the intervention has been performed
y_column (str) – Target column on which we want to understand the effect of the intervention
start (List) – Two elements defining the pre-intervention interval
end (List) – Two elements defining the post-intervention interval
p_value_threshold (float) – H0 that x does not Granger cause y is rejected when p smaller this threshold. Defaults to 0.05.
- Returns
causalimpact object
- Return type
object