ebes.data package
Submodules
ebes.data.accessors module
The module contains classes that expose a pd.DataFrame interface to datasets.
- class ebes.data.accessors.InMemoryPandasDataAccessor(*, parquet_path, split_sizes, data_queries=None, split_by_col=None, random_split=False, split_seed=None)
Bases:
PandasDataAccessorData accessor that keeps all data in memory.
- get_split(split_idx)
Get split by its index.
- Parameters:
split_idx – positive index of split.
- Return type:
DataFrame- Returns:
A dataframe or a sequence of dataframes with data in given split.
- class ebes.data.accessors.PandasDataAccessor
Bases:
ABCAbstract class for all data accessors.
Data accessor is responsible for splitting the data on train/test/whatever, filtering the data and exposing pd.DataFrame interface to it. The splits are configred in subclass __init__ methods and accessed by their index. Each subclass splits the data (using any specified strategy) and returns a split by its positive index as a result of the get_split method.
- abstract get_split(split_idx)
Get split by its index.
- Parameters:
split_idx (
int) – positive index of split.- Return type:
DataFrame|Sequence[DataFrame]- Returns:
A dataframe or a sequence of dataframes with data in given split.
ebes.data.batch_tfs module
Batch transforms for data loading pipelines.
- class ebes.data.batch_tfs.BatchTransform
Bases:
ABCBase class for all batch transforms.
The BatchTransform is a Callable object that modifies Batch in-place.
- class ebes.data.batch_tfs.CatToNum
Bases:
BatchTransformProcess categorical features as numerical.
Treat categorical features as numerical (just type cast). Category 0 is converted to NaN value.
- class ebes.data.batch_tfs.ContrastiveTarget
Bases:
BatchTransformSet target for contrastive losses.
New target is LongTensor such that items with different indices have different target labels.
- class ebes.data.batch_tfs.DatetimeToFloat(loc, scale)
Bases:
BatchTransformCast time from np.datetime64 to float by rescale. scale:
-
loc:
str|datetime64 Location to subtract. If string is passed, it is converted to
np.datetime64beforehand.
-
scale:
tuple[int,str] |timedelta64 Scale to divide time by. If tuple is passed, it is passed to the
np.timedelta64function. The first item is a value and the second is a unit.
-
loc:
- class ebes.data.batch_tfs.FillNans(fill_value)
Bases:
BatchTransformFill NaNs with specified values.
-
fill_value:
Mapping[str,float] |float If float, all NaNs in all numerical features will be replaced with the
fill_value. Mapping sets feature-specific replacement values.
-
fill_value:
- class ebes.data.batch_tfs.ForwardFillNans(backward=False)
Bases:
BatchTransformFill NaN values by propagating forwad lase non-nan values.
The algoritm starts from the second step. If some values are NaNs, the values from the prevoius step are used to fill them. If the first time step contains NaNs, some NaNs will not be filled after the forward pass. To handle it
backward=Truemight be specified to fill remaining NaN values from last to first after the forwad pass. But even after a backward pass the batch may contain NaNs, if some feature has all NaN values. To fill it useFillNanstransform.-
backward:
bool= False Wether to do backward fill after the forwad fill (see the class description).
-
backward:
- class ebes.data.batch_tfs.Logarithm(names)
Bases:
BatchTransformApply natural logarithm to specific feature.
-
names:
list[str] Feature names to transform by taking the logarithm.
-
names:
- class ebes.data.batch_tfs.MaskValid
Bases:
BatchTransformAdd mask indicating valid values to batch.
Mask has shape (max_seq_len, batch_size, n_features) and has True values where there are non-NaN values (nonzero category) and where the data is not padded.
- class ebes.data.batch_tfs.PrimeNetSampler(mask_ratio_per_seg=0.05, segment_num=3, pretrain_tasks='full2')
Bases:
BatchTransformContrastive sampling according to PrimeNet.
- Input:
batch: Batch. Masks required.
batch.num_features (T, B, D) -> (T, 2B, D) batch.cat_features (T, B, D) -> (T, 2B, D)
Masks have additional dim for constrastive and interpolation: batch.num_mask (T, B, D) - > (T, 2B, D, 2) batch.cat_mask (T, B, D) - > (T, 2B, D, 2)
- dense_sampling_bound = [0.4, 0.6]
- len_sampling_bound = [0.3, 0.7]
-
mask_ratio_per_seg:
float= 0.05
-
pretrain_tasks:
str= 'full2'
-
segment_num:
int= 3
- class ebes.data.batch_tfs.RandomEventsPermutation(keep_last=False)
Bases:
BatchTransformPermute events in sequence randomly.
Time, target and masks are left unchanged.
-
keep_last:
bool= False If
Truethe last event remains on its place, other are permuted.
-
keep_last:
- class ebes.data.batch_tfs.RandomSlices(split_count, cnt_min, cnt_max, short_seq_crop_rate=1.0, seed=None)
Bases:
BatchTransformSample random slices from input sequences.
The transform is taken from https://github.com/dllllb/coles-paper. It samples random slices from initial sequences. The batch size after this transform will be
split_counttimes larger.-
cnt_max:
int Maximal sample sequence length.
-
cnt_min:
int Minimal sample sequence length.
-
seed:
int|None= None Value to seed the random generator.
-
short_seq_crop_rate:
float= 1.0 Must be from (0, 1]. If
short_seq_crop_rate< 1, and if a sequence of length less than cnt_min is encountered, the mininum sample length for this sequence is set as ashort_seq_crop_ratetime the actual sequence length.
-
split_count:
int How many sample slices to draw for each input sequence.
-
cnt_max:
- class ebes.data.batch_tfs.RandomTime
Bases:
BatchTransformReplace time with uniformly disributed values.
- class ebes.data.batch_tfs.Rescale(name, loc, scale)
Bases:
BatchTransformRescale feature: subtract location and divide by scale.
-
loc:
Any Value to subtract from the feature values.
-
name:
str Feature name.
-
scale:
Any Value to divide by the feature values.
-
loc:
- class ebes.data.batch_tfs.RescaleTime(loc, scale)
Bases:
BatchTransformRescale time: subtract location and divide by scale.
-
loc:
float Location to subtract from time.
-
scale:
float Scale to divide time by.
-
loc:
- class ebes.data.batch_tfs.TargetToLong
Bases:
BatchTransformCast target to LongTensor
- class ebes.data.batch_tfs.TimeToFeatures(process_type='none', time_name='time')
Bases:
BatchTransformAdd time to numerical features.
To apply this transform first cast time to Tensor. Has to be applied BEFORE mask creation. And AFTER DatetoTime
-
process_type:
Literal['cat','diff','none'] = 'none' How to add time to features. The options are:
"cat"— add absolute time to other numerical features,"diff"— add time intervals between sequential events. In this case the first interval in a sequence equals zero."none"— do not add time to features. This option is added for the ease of optuna usage.
-
time_name:
str= 'time' Name of new feature with time, default
"time".
-
process_type:
- class ebes.data.batch_tfs.UnsqueezeTarget
Bases:
BatchTransformUnsqueeze last dimension in target array.
Last linear layer for regression task produces tensors of shape (bs, 1). When calling MSE loss with target of shape (bs,), PyTorch expands it to the shape (bs, bs) and loss is computed incorrectly. This batch transform reshapes the target to (bs, 1), so MSE loss is computed correctly.
ebes.data.datasets module
- class ebes.data.datasets.SeriesDataset(data, batch_size, query=None, drop_incomplete=False, shuffle=False, loop=False, random_seed=None)
Bases:
IterableDatasetAn iterable dataset over the DataFrame rows.
- class ebes.data.datasets.SizedSeriesDataset(data, batch_size, query=None, drop_incomplete=False, shuffle=False, loop=False, random_seed=None)
Bases:
SeriesDatasetThe same as SeriesDataset, but has __len__ method implemented.
- ebes.data.datasets.series(df)
Return list of DataFrame rows as a series.
- Return type:
list[Series]
ebes.data.loading module
- class ebes.data.loading.SequenceCollator(*, time_name, cat_cardinalities=None, num_names=None, index_name=None, target_name=None, max_seq_len=0, batch_transforms=None)
Bases:
object-
cat_cardinalities:
Mapping[str,int] |None= None
-
index_name:
str|None= None
-
max_seq_len:
int= 0
-
num_names:
list[str] |None= None
-
target_name:
str|list[str] |None= None
-
time_name:
str
-
cat_cardinalities:
ebes.data.utils module
- ebes.data.utils.build_loaders(dataset, loaders, preprocessing)
- Return type:
Mapping[str,DataLoader]
- ebes.data.utils.get_accessor(parquet_path, split_sizes, split_by_col=None, random_split=False, split_seed=None)
- ebes.data.utils.get_collator(time_name, cat_cardinalities=None, num_names=None, index_name=None, target_name=None, max_seq_len=0, batch_transforms=None)
- Return type:
- ebes.data.utils.get_loader(accessor, collators, split_idx, preprocessing, batch_size, query=None, drop_incomplete=False, shuffle=False, loop=False, random_seed=None, num_workers=0, labeled=True)
- Return type:
DataLoader