The pyprocessta API reference

Preprocessing

Sometimes, different kinds of measurements are sampled at different intervals. This module provides utilities to combine such data. We will always operate on pandas dataframes with datatime indexing

pyprocessta.preprocess.align.align_two_dfs(df_a, df_b, interpolation='linear')[source]

Alignes to dataframes with datatimeindex Resamples both dataframes on the dataframe with the lowest frequency timestep. The first timepoint in the new dataframe will be the later one of the first observations of the dataframes.

https://stackoverflow.com/questions/47148446/pandas-resample-interpolate-is-producing-nans https://stackoverflow.com/questions/66967998/pandas-interpolation-giving-odd-results

Parameters
  • df_a (pd.DataFrame) – Dataframe

  • df_b (pd.DataFrame) – Dataframe

  • interpolation (Union[str, int], optional) – Interpolation method. If you provide an integer, spline interpolation of that order will be used. Defaults to “linear”.

Returns

merged dataframe

Return type

pd.DataFrame

This module contains basic data cleaning functions

pyprocessta.preprocess.clean.drop_duplicated_indices(df)[source]

If one concatenates dataframes there might be duplicated indices. This can lead to problems, e.g., in interpolation steps. One easy solution can be to just drop the duplicated row

Parameters

df (Union[pd.Series, pd.DataFrame]) – Input data

Returns

Data without duplicated indices

Return type

Union[pd.Series, pd.DataFrame]

In some time series there is a trend component that does not interest us, e.g., because we have domain knowledge that this trend is due to another phenomenon like instrument drift. In this case, we might want to remove the trend component for furhter modeling. The same is the case for the variance. If the variance increases over time, one might want to remove this effect using a Box-Cox transformation [1]

References: [1] https://otexts.com/fpp2/transformations.html#mathematical-transformations

pyprocessta.preprocess.detrend.detrend_linear_deterministc(data)[source]

Removes a deterministic linear trend from a series. Note that we assume that the data is sampled on a regular grid and we estimate the trend as

np.arange(

len(series) * (series.iloc[end] - series.iloc[start]) / (end - start)

)

Parameters

data (Union[pd.Series, pd.DataFrame]) – Data to detrend. In case of dataframes we detrend every column separately.

Returns

Detrended data

Return type

Union[pd.Series, pd.DataFrame]

pyprocessta.preprocess.detrend.detrend_stochastic(data)[source]

Detrends time series data using the difference method y_t - y_{t-1}. This is useful to remove stochastic trends (random walk with trend).

Parameters

data (Union[pd.Series, pd.DataFrame]) – Time series data to detrend

Returns

Differenced data

Return type

Union[pd.Series, pd.DataFrame]

Often, data is not sampled on a regular grid. This module provides to regularize such data

pyprocessta.preprocess.resample.resample_regular(df, interval='10min', interpolation='linear', start_time=None)[source]

Resamples the dataframe at a desired interval.

Parameters
  • df (pd.DataFrame) – input dataframne

  • interval (str, optional) – Resampling intervall. Defaults to “10min”.

  • interpolation (Union[str, int], optional) – Interpolation method. If you provide an integer, spline interpolation of that order will be used. Defaults to “linear”.

Returns

Output data.

Return type

pd.DataFrame

pyprocessta.preprocess.smooth.exponential_window_smoothing(data, window_size, aggregation='mean')[source]
Parameters
  • data (Union[pd.Series, pd.DataFrame]) – Data to smoothen

  • window_size (int) – size for the exponential window

  • aggregation (str, optional) – Aggregation function. Defaults to “mean”.

Returns

Smoothned data

Return type

Union[pd.Series, pd.DataFrame]

pyprocessta.preprocess.smooth.z_score_filter(data, threshold=2, window=10)[source]

Replaces spikes (values > threshold * z_score) with the median of the window values before.

Parameters
  • data (Union[pd.Series, pd.DataFrame]) – Series to despike

  • threshold (float, optional) – Threshold on the z-score. Defaults to 2.

  • window (int, option) – Window that is used for the median with which the spike value is replaced. This mean only looks back.

Returns

Despiked series

Return type

Union[pd.Series, pd.DataFrame]

EDA

pyprocessta.eda.statistics.check_granger_causality(x, y, max_lag=20, add_constant=True)[source]

Check if series x is Granger causal for series y We reject the null hypothesis that x does not Granger cause y if the pvalues are below a desired size of the test.

Parameters
  • x (pd.Series) – Time series.

  • y (pd.Series) – Time series.

  • max_lag (int, optional) – Maximum lag to use for the causality checks. Defaults to 20.

  • add_constant (bool, optional) – [description]. Defaults to True.

Returns

results dictionary

Return type

dict

pyprocessta.eda.statistics.check_stationarity(series, threshold=0.05, regression='c')[source]

Performs the Augmented-Dickey fuller and Kwiatkowski-Phillips-Schmidt-Shin (KPSS) tests for stationarity.

Parameters
  • series (pd.Series) – Time series data

  • threshold (float, optional) – p-value thresholds for the statistical tests. Defaults to 0.05.

  • regression (str, optional) – If regression=”c” then the tests check for stationarity around a constant. For “ct” the test check for stationarity around a trend. Defaults to “c”.

Returns

Results dictionary with key “stationary” that has a bool as value

Return type

dict

Causal impact analysis

Causal impact analysis uses machine learning to construct a counterfactual (what would the results have been without an intervention) which can be used to estimate an effect size without the need for a control group A good introduction is https://www.youtube.com/watch?v=GTgZfCltMm8 the original paper is https://storage.googleapis.com/pub-tools-public-publication-data/pdf/41854.pdf. The particular implementation we use was described in https://towardsdatascience.com/implementing-causal-impact-on-top-of-tensorflow-probability-c837ea18b126

One can use covariates to build the counterfactual but one needs to be careful that they are not changed by the intervention.

pyprocessta.causalimpact.run_causal_impact_analysis(df, x_columns, intervention_column, y_column, start, end, p_value_threshold=0.05)[source]

Run the causal impact analysis. Here, we use all the x that are not related to the intervention variable.

Parameters
  • df (pd.DataFrame) – Dataframe to run the analysis on

  • x_columns (List[str]) – All column names that can potentially be used as covariates for the counterfactual model

  • intervention_column (str) – Name of the column on which the intervention has been performed

  • y_column (str) – Target column on which we want to understand the effect of the intervention

  • start (List) – Two elements defining the pre-intervention interval

  • end (List) – Two elements defining the post-intervention interval

  • p_value_threshold (float) – H0 that x does not Granger cause y is rejected when p smaller this threshold. Defaults to 0.05.

Returns

causalimpact object

Return type

object

TCN

Utilities

pyprocessta.utils.is_regular_grid(series)[source]

For many analyses it can be convenient to have the data on a regular grid. This function checks if this is the case.

Parameters

series (pd.Series) – pd.Series of datetime

Returns

[description]

Return type

bool