DataPreprocessingFuncs module

DataPreprocessingFuncs.build_signal(DC, amp, c_noise: Literal[0, 1, 2, 3, 4], trend_type: Literal[0, 1, 2], N, bool_plot=False, labels: str = ['Full Singal', 'Day', 'Percent of change'])[source]

Constructs a signal from three components: sinusoidal signals, noise, and a long term trend. The sinusoidal signals are given as the parameters DC (a 1-D array of each signal’s frequency) and amp (a 1-D array of the corresponding amplitudes)

Parameters:

DC (1-D numpy.array): An array containing the frequencies of the deterministic components. amp (1-D numpy.array): An array containing the amplitudes of the deterministic components. c_noise (0, 1, 2, 3, or 4): Noise coefficient to determine the strength off the noise signal: 0 gives no noise, 4 gives a noise signal for which the maximum signal to noise ratio is 1. The coefficients scale the noise linearly. trend_type (0,1, or 2): 0 -> flat trend: the signal has no long term trend. 1 -> linear trend: the signal has a linear long-term trend with gradient 5e-2. 2 -> quadratic trend: the signal has a quadratic linear trend with gradient 10e-5. N (int)= length of the original signal.

Returns:

full_signal (1-D numpy.array) = The signal consisting of the deterministic (sinusoidal) components, additive noise, and a long term trend. r (int) = The number of deterministic components. In essence, the length of DC or amp.

DataPreprocessingFuncs.find_components(load_PSD=None, signal=None, threshold: int | None = None, bool_plot: bool = False, save_components: str | None = None)[source]

Performs a discrete Fourier transform and calculates the modulus squared to estimate the signal’s power spectral density.

Parameters:

signal (numpy.array): 1-D data for which the power spectral density is to be estimated. threshold (int): The minimum amplitude squared required for a component frequency to be added to the list of deterministic components. bool_plot (bool) = If True, will show the periodogram. save_components (str, optional) = The name of the file to which the deterministic components will be saved. If not specified, the deterministic components will not be saved.

Returns:

DC (numpy.array): 1-D array containing the frequencies of the component signals with amplitudes above the threshold (the deterministic components). amp (numpy.array): 1-D array containing the amplitudes of the deterministic components. (DC[i] and amp[i] give the frequency and amplitude respectively of a deterministic component signal.) N (int)= Length of the original signal.

DataPreprocessingFuncs.gradient(data, bool_plot=False)[source]

Calculates the percent of change of an input dataset.

Parameters:

sub_data (numpy.array): A 1-D array of size N. The sequential data for which the gradient should be calculated.

Returns:

pc numpy.array): A 1-D array of size N. The gradient (percent of change) of the input data. The nth value is the rate of change from xn to xn+1.

DataPreprocessingFuncs.load_data(dataset, usecols=<built-in function all>, skiprows=None, sample_size=<built-in function all>, bool_plot=False)[source]

Loads a specified subset of a csv file. Documentation source: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

Parameters:

dataset (str, path object or file-like object): Any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, gs, and file. For file URLs, a host is expected. A local file could be: file://localhost/path/to/table.csv. If you want to pass in a path object, pandas accepts any os.PathLike. By file-like object, we refer to objects with a read() method, such as a file handle (e.g. via builtin open function) or StringIO. usecols (list-like or callable, optional): Return a subset of the columns. If list-like, all elements must either be positional (i.e. integer indices into the document columns) or strings that correspond to column names provided either by the user in names or inferred from the document header row(s). If names are given, the document header row(s) are not taken into account. For example, a valid list-like usecols parameter would be [0, 1, 2] or [‘foo’, ‘bar’, ‘baz’]. Element order is ignored, so usecols=[0, 1] is the same as [1, 0]. To instantiate a DataFrame from data with element order preserved use pd.read_csv(data, usecols=[‘foo’, ‘bar’])[[‘foo’, ‘bar’]] for columns in [‘foo’, ‘bar’] order or pd.read_csv(data, usecols=[‘foo’, ‘bar’])[[‘bar’, ‘foo’]] for [‘bar’, ‘foo’] order. If callable, the callable function will be evaluated against the column names, returning names where the callable function evaluates to True. An example of a valid callable argument would be lambda x: x.upper() in [‘AAA’, ‘BBB’, ‘DDD’]. skiprows (list-like, int or callable, optional): Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file. If callable, the callable function will be evaluated against the row indices, returning True if the row should be skipped and False otherwise. An example of a valid callable argument would be lambda x: x in [0, 2]. sample_size (int): Number of datapoints to include i.e., number of rows except headers. bool_plot (boolean): If True, will plot the data

Returns:

dataset (numpy.array): a 1-D array