Commit 377cd826 authored by 刘潇(23硕士)'s avatar 刘潇(23硕士)

upload

parents
# LLMDetector
This repository contains the implementation for the paper "LLMDetector: Large Language Models with Dual-View Contrastive Learning for Time Series Anomaly Detection".
## Requirements
The dependencies can be installed by:
```pip install -r requirements.txt```
## Data
The datasets can be downloaded in this [link](https://drive.google.com/drive/folders/1ehugseUxqp1o6Xn60woCFxEEvKq4AT0d?usp=sharing). And the files should be put into `datasets/`, where the original data can be located by `datasets/<dataset_name>`. All the datasets are preprocessed into '.npy' format for convenience.
## Usage
To train and evaluate LLMDetector on a dataset, run the following command:
```python main.py --anormly_ratio <anormly ratio> --alpha <alpha> --mode <mode> --dataset <data_name> --input_c <input dimension> --win_size <window size> --patch_size <patch size> --step <step>```
The detailed descriptions about the arguments are as following:
| Parameter Name | Description |
|------------------|-----------------------------------------------------------------------------|
| anomaly_ratio | Threshold for determining whether a timestamp is anomalous. |
| alpha | Weighting factor for the instance-level loss in the training objective. |
| mode | Specifies the execution mode: either `train` or `test`. |
| dataset | Name of the dataset. |
| input_c | Number of input channels (i.e., the dimensionality of the input data). |
| win_size | Length of the sliding window applied to the time series. |
| patch_size | Size of each patch segmented from the window. |
| step | Step size between consecutive windows during sliding window segmentation. |
For example, dataset WADI can be directly trained by the following command:
```python main.py --anormly_ratio 0.003 --alpha 0.1 --num_epochs 20 --batch_size 64 --mode train --dataset WADI --data_path WADI --input_c 123 --win_size 60 --patch_size 6 --step 60```
To test:
```python main.py --anormly_ratio 0.003 --alpha 0.1 --num_epochs 20 --batch_size 64 --mode test --dataset WADI --data_path WADI --input_c 123 --win_size 60 --patch_size 6 --step 60```
The results will be saved in the directory ```./result```.
For all the scripts we used, please refer to the directory ```./Scripts```. Or Just directly run the .sh file, for example:
```sh Scripts/MSL.sh```
\ No newline at end of file
python main.py --anormly_ratio 0.007 --alpha 0.4 --num_epochs 1 --batch_size 64 --mode train --dataset MSL --data_path MSL --input_c 55 --win_size 60 --patch_size 6 --step 60
python main.py --anormly_ratio 0.007 --alpha 0.4 --num_epochs 1 --batch_size 64 --mode test --dataset MSL --data_path MSL --input_c 55 --win_size 60 --patch_size 6 --step 60
python main.py --anormly_ratio 0.007 --alpha 0.8 --num_epochs 20 --batch_size 64 --mode train --dataset GECCO --data_path GECCO --input_c 9 --win_size 60 --patch_size 10 --step 60
python main.py --anormly_ratio 0.007 --alpha 0.8 --num_epochs 20 --batch_size 64 --mode test --dataset GECCO --data_path GECCO --input_c 9 --win_size 60 --patch_size $ps --step 60
python main.py --anormly_ratio 0.001 --alpha 0.7 --num_epochs 20 --batch_size 64 --mode train --dataset SWAN --data_path SWAN --input_c 38 --win_size 36 --patch_size 6 --step 36
python main.py --anormly_ratio 0.001 --alpha 0.7 --num_epochs 20 --batch_size 64 --mode test --dataset SWAN --data_path SWAN --input_c 38 --win_size 36 --patch_size 6 --step 36
python main.py --anormly_ratio 0.005 --alpha 0.7 --num_epochs 20 --batch_size 64 --mode train --dataset PSM --data_path PSM --input_c 25 --win_size 60 --patch_size 5 --step 60
python main.py --anormly_ratio 0.005 --alpha 0.7 --num_epochs 20 --batch_size 64 --mode test --dataset PSM --data_path PSM --input_c 25 --win_size 60 --patch_size 5 --step 60
python main.py --anormly_ratio 0.006 --alpha 0.7 --num_epochs 20 --batch_size 64 --mode train --dataset SMAP --data_path SMAP --input_c 25 --patch_size 6 --win_size 60 --step 60
python main.py --anormly_ratio 0.006 --alpha 0.7 --num_epochs 20 --batch_size 64 --mode test --dataset SMAP --data_path SMAP --input_c 25 --patch_size 6 --win_size 60 --step 60
python main.py --anormly_ratio 0.007 --alpha 0.6 --num_epochs 20 --batch_size 64 --mode train --dataset SMD --data_path SMD --input_c 38 --loss_fuc MSE --win_size 60 --patch_size 6 --step 60
python main.py --anormly_ratio 0.007 --alpha 0.6 --num_epochs 20 --batch_size 64 --mode test --dataset SMD --data_path SMD --input_c 38 --loss_fuc MSE --win_size 60 --patch_size 6 --step 60
python main.py --anormly_ratio 0.008 --alpha 0.5 --num_epochs 20 --batch_size 64 --mode train --dataset SWAT --data_path SWAT --input_c 51 --win_size 60 --patch_size 6 --step 60
python main.py --anormly_ratio 0.008 --alpha 0.5 --num_epochs 20 --batch_size 64 --mode test --dataset SWAT --data_path SWAT --input_c 51 --win_size 60 --patch_size 6 --step 60
python main.py --anormly_ratio 0.003 --alpha 0.1 --num_epochs 20 --batch_size 64 --mode train --dataset WADI --data_path WADI --input_c 123 --win_size 60 --patch_size 6 --step 60
python main.py --anormly_ratio 0.003 --alpha 0.1 --num_epochs 20 --batch_size 64 --mode test --dataset WADI --data_path WADI --input_c 123 --win_size 60 --patch_size 6 --step 60
import torch
import os
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
class PSMSegLoader(object):
def __init__(self, data_path, win_size, step, mode="train"):
self.mode = mode
self.step = step
self.win_size = win_size
self.scaler = StandardScaler()
data = pd.read_csv(data_path + '/train.csv')
data = data.values[:, 1:]
data = np.nan_to_num(data)
self.scaler.fit(data)
data = self.scaler.transform(data)
test_data = pd.read_csv(data_path + '/test.csv')
test_data = test_data.values[:, 1:]
test_data = np.nan_to_num(test_data)
self.test = self.scaler.transform(test_data)
self.train = data
self.val = self.test
self.test_labels = pd.read_csv(data_path + '/test_label.csv').values[:, 1:]
def __len__(self):
"""
Number of images in the object dataset.
"""
if self.mode == "train":
return (self.train.shape[0] - self.win_size) // self.step + 1
elif (self.mode == 'val'):
return (self.val.shape[0] - self.win_size) // self.step + 1
elif (self.mode == 'test'):
return (self.test.shape[0] - self.win_size) // self.step + 1
else:
return (self.test.shape[0] - self.win_size) // self.win_size + 1
def __getitem__(self, index):
index = index * self.step
if self.mode == "train":
return np.float32(self.train[index:index + self.win_size]), np.float32(self.test_labels[0:self.win_size])
elif (self.mode == 'val'):
return np.float32(self.val[index:index + self.win_size]), np.float32(self.test_labels[0:self.win_size])
elif (self.mode == 'test'):
return np.float32(self.test[index:index + self.win_size]), np.float32(
self.test_labels[index:index + self.win_size])
else:
return np.float32(self.test[
index // self.step * self.win_size:index // self.step * self.win_size + self.win_size]), np.float32(
self.test_labels[index // self.step * self.win_size:index // self.step * self.win_size + self.win_size])
class MSLSegLoader(object):
def __init__(self, data_path, win_size, step, mode="train"):
self.mode = mode
self.step = step
self.win_size = win_size
self.scaler = StandardScaler()
data = np.load(data_path + "/MSL_train.npy")
self.scaler.fit(data)
data = self.scaler.transform(data)
test_data = np.load(data_path + "/MSL_test.npy")
self.test = self.scaler.transform(test_data)
self.train = data
self.val = self.test
self.test_labels = np.load(data_path + "/MSL_test_label.npy")
def __len__(self):
if self.mode == "train":
return (self.train.shape[0] - self.win_size) // self.step + 1
elif (self.mode == 'val'):
return (self.val.shape[0] - self.win_size) // self.step + 1
elif (self.mode == 'test'):
return (self.test.shape[0] - self.win_size) // self.step + 1
else:
return (self.test.shape[0] - self.win_size) // self.win_size + 1
def __getitem__(self, index):
index = index * self.step
if self.mode == "train":
return np.float32(self.train[index:index + self.win_size]), np.float32(self.test_labels[0:self.win_size])
elif (self.mode == 'val'):
return np.float32(self.val[index:index + self.win_size]), np.float32(self.test_labels[0:self.win_size])
elif (self.mode == 'test'):
return np.float32(self.test[index:index + self.win_size]), np.float32(
self.test_labels[index:index + self.win_size])
else:
return np.float32(self.test[
index // self.step * self.win_size:index // self.step * self.win_size + self.win_size]), np.float32(
self.test_labels[index // self.step * self.win_size:index // self.step * self.win_size + self.win_size])
class SMAPSegLoader(object):
def __init__(self, data_path, win_size, step, mode="train"):
self.mode = mode
self.step = step
self.win_size = win_size
self.scaler = StandardScaler()
data = np.load(data_path + "/SMAP_train.npy")
self.scaler.fit(data)
data = self.scaler.transform(data)
test_data = np.load(data_path + "/SMAP_test.npy")
self.test = self.scaler.transform(test_data)
self.train = data
self.val = self.test
self.test_labels = np.load(data_path + "/SMAP_test_label.npy")
def __len__(self):
if self.mode == "train":
return (self.train.shape[0] - self.win_size) // self.step + 1
elif (self.mode == 'val'):
return (self.val.shape[0] - self.win_size) // self.step + 1
elif (self.mode == 'test'):
return (self.test.shape[0] - self.win_size) // self.step + 1
else:
return (self.test.shape[0] - self.win_size) // self.win_size + 1
def __getitem__(self, index):
index = index * self.step
if self.mode == "train": #train and val did not use label
return np.float32(self.train[index:index + self.win_size]), np.float32(self.test_labels[0:self.win_size])
elif (self.mode == 'val'):
return np.float32(self.val[index:index + self.win_size]), np.float32(self.test_labels[0:self.win_size])
elif (self.mode == 'test'):
return np.float32(self.test[index:index + self.win_size]), np.float32(
self.test_labels[index:index + self.win_size])
else:
return np.float32(self.test[
index // self.step * self.win_size:index // self.step * self.win_size + self.win_size]), np.float32(
self.test_labels[index // self.step * self.win_size:index // self.step * self.win_size + self.win_size])
class SMDSegLoader(object):
def __init__(self, data_path, win_size, step, mode="train"):
self.mode = mode
self.step = step
self.win_size = win_size
self.scaler = StandardScaler()
data = np.load(data_path + "/SMD_train.npy")[:,:]
self.scaler.fit(data)
data = self.scaler.transform(data)
test_data = np.load(data_path + "/SMD_test.npy")[:,:]
self.test = self.scaler.transform(test_data)
self.train = data
data_len = len(self.train)
self.val = self.train[(int)(data_len * 0.8):]
self.test_labels = np.load(data_path + "/SMD_test_label.npy")[:]
def __len__(self):
if self.mode == "train":
return (self.train.shape[0] - self.win_size) // self.step + 1
elif (self.mode == 'val'):
return (self.val.shape[0] - self.win_size) // self.step + 1
elif (self.mode == 'test'):
return (self.test.shape[0] - self.win_size) // self.step + 1
else:
return (self.test.shape[0] - self.win_size) // self.win_size + 1
def __getitem__(self, index):
index = index * self.step
if self.mode == "train":
return np.float32(self.train[index:index + self.win_size]), np.float32(self.test_labels[0:self.win_size])
elif (self.mode == 'val'):
return np.float32(self.val[index:index + self.win_size]), np.float32(self.test_labels[0:self.win_size])
elif (self.mode == 'test'):
return np.float32(self.test[index:index + self.win_size]), np.float32(
self.test_labels[index:index + self.win_size])
else:
return np.float32(self.test[
index // self.step * self.win_size:index // self.step * self.win_size + self.win_size]), np.float32(
self.test_labels[index // self.step * self.win_size:index // self.step * self.win_size + self.win_size])
class NIPS_TS_WaterSegLoader(object):
def __init__(self, data_path, win_size, step, mode="train"):
self.mode = mode
self.step = step
self.win_size = win_size
self.scaler = StandardScaler()
data = np.load(data_path + "/NIPS_TS_Water_train.npy")
self.scaler.fit(data)
data = self.scaler.transform(data)
test_data = np.load(data_path + "/NIPS_TS_Water_test.npy")
self.test = self.scaler.transform(test_data)
self.train = data
self.val = self.test
self.test_labels = np.load(data_path + "/NIPS_TS_Water_test_label.npy")
# print("test:", self.test.shape)
# print("train:", self.train.shape)
def __len__(self):
if self.mode == "train":
return (self.train.shape[0] - self.win_size) // self.step + 1
elif (self.mode == 'val'):
return (self.val.shape[0] - self.win_size) // self.step + 1
elif (self.mode == 'test'):
return (self.test.shape[0] - self.win_size) // self.step + 1
else:
return (self.test.shape[0] - self.win_size) // self.win_size + 1
def __getitem__(self, index):
index = index * self.step
if self.mode == "train":
return np.float32(self.train[index:index + self.win_size]), np.float32(self.test_labels[0:self.win_size])
elif (self.mode == 'val'):
return np.float32(self.val[index:index + self.win_size]), np.float32(self.test_labels[0:self.win_size])
elif (self.mode == 'test'):
return np.float32(self.test[index:index + self.win_size]), np.float32(
self.test_labels[index:index + self.win_size])
else:
return np.float32(self.test[
index // self.step * self.win_size:index // self.step * self.win_size + self.win_size]), np.float32(
self.test_labels[index // self.step * self.win_size:index // self.step * self.win_size + self.win_size])
class NIPS_TS_SwanSegLoader(object):
def __init__(self, data_path, win_size, step, mode="train"):
self.mode = mode
self.step = step
self.win_size = win_size
self.scaler = StandardScaler()
data = np.load(data_path + "/NIPS_TS_Swan_train.npy")
self.scaler.fit(data)
data = self.scaler.transform(data)
test_data = np.load(data_path + "/NIPS_TS_Swan_test.npy")
self.test = self.scaler.transform(test_data)
self.train = data
self.val = self.test
self.test_labels = np.load(data_path + "/NIPS_TS_Swan_test_label.npy")
# # print("test:", self.test.shape)
# # print("train:", self.train.shape)
def __len__(self):
if self.mode == "train":
return (self.train.shape[0] - self.win_size) // self.step + 1
elif (self.mode == 'val'):
return (self.val.shape[0] - self.win_size) // self.step + 1
elif (self.mode == 'test'):
return (self.test.shape[0] - self.win_size) // self.step + 1
else:
return (self.test.shape[0] - self.win_size) // self.win_size + 1
def __getitem__(self, index):
index = index * self.step
if self.mode == "train":
return np.float32(self.train[index:index + self.win_size]), np.float32(self.test_labels[0:self.win_size])
elif (self.mode == 'val'):
return np.float32(self.val[index:index + self.win_size]), np.float32(self.test_labels[0:self.win_size])
elif (self.mode == 'test'):
return np.float32(self.test[index:index + self.win_size]), np.float32(
self.test_labels[index:index + self.win_size])
else:
return np.float32(self.test[
index // self.step * self.win_size:index // self.step * self.win_size + self.win_size]), np.float32(
self.test_labels[index // self.step * self.win_size:index // self.step * self.win_size + self.win_size])
class SWATSegLoader(Dataset):
def __init__(self, root_path, win_size, step=1, flag="train"):
self.flag = flag
self.step = step
self.win_size = win_size
self.scaler = StandardScaler()
train_data = pd.read_csv(os.path.join(root_path, 'swat_train2.csv'))
test_data = pd.read_csv(os.path.join(root_path, 'swat2.csv'))
labels = test_data.values[:, -1:]
train_data = train_data.values[:, :-1]
test_data = test_data.values[:, :-1]
self.scaler.fit(train_data)
train_data = self.scaler.transform(train_data)
test_data = self.scaler.transform(test_data)
self.train = train_data
self.test = test_data
data_len = len(self.train)
self.val = self.train[(int)(data_len * 0.8):]
self.test_labels = labels
# print("test:", self.test.shape)
# print("train:", self.train.shape)
def __len__(self):
"""
Number of images in the object dataset.
"""
if self.flag == "train":
return (self.train.shape[0] - self.win_size) // self.step + 1
elif (self.flag == 'val'):
return (self.val.shape[0] - self.win_size) // self.step + 1
elif (self.flag == 'test'):
return (self.test.shape[0] - self.win_size) // self.step + 1
else:
return (self.test.shape[0] - self.win_size) // self.win_size + 1
def __getitem__(self, index):
index = index * self.step
if self.flag == "train":
return np.float32(self.train[index:index + self.win_size]), np.float32(self.test_labels[0:self.win_size])
elif (self.flag == 'val'):
return np.float32(self.val[index:index + self.win_size]), np.float32(self.test_labels[0:self.win_size])
elif (self.flag == 'test'):
return np.float32(self.test[index:index + self.win_size]), np.float32(
self.test_labels[index:index + self.win_size])
else:
return np.float32(self.test[
index // self.step * self.win_size:index // self.step * self.win_size + self.win_size]), np.float32(
self.test_labels[index // self.step * self.win_size:index // self.step * self.win_size + self.win_size])
class WADILoader(Dataset):
def __init__(self, root_path, win_size, step=1, flag="train"):
self.flag = flag
self.step = step
self.win_size = win_size
self.scaler = StandardScaler()
data = pd.read_csv(os.path.join(root_path, 'WADI_14days_new.csv'), index_col=0)
data = data.dropna(how="all")
data = data.dropna(thresh=2) # Keep only the rows with at least 2 non-NA values.
for col in data.columns:
if col.strip() not in ["Date", "Time"]:
data[col] = pd.to_numeric(data[col], errors='coerce')
data.fillna(method='ffill', inplace=True)
data.fillna(method='bfill', inplace=True)
data.dropna(axis='columns', inplace=True) # drop null columns
data=data.values[:, 2:].astype(np.float32) # col0: Date, col1: Time
self.scaler.fit(data)
data = self.scaler.transform(data)
self.train = data
data_len = len(self.train)
self.val = self.train[(int)(data_len * 0.8):]
test_data = pd.read_csv(os.path.join(root_path, 'WADI_attackdataLABLE.csv'), skiprows=[0], index_col=0)
for col in test_data.columns:
if col.strip() not in ["Date", "Time"]:
test_data[col] = pd.to_numeric(test_data[col], errors='coerce')
test_data.fillna(method='ffill', inplace=True)
test_data.fillna(method='bfill', inplace=True)
test_data.dropna(axis='columns', inplace=True) # drop null columns
self.test = test_data.values[:, 2:-1].astype(np.float32) # col0: Date, col1: Time, col[-1]: label
self.test_labels = (test_data.values[:, -1] == -1).astype(int)
print(self.train.shape,self.test.shape)
anomaly_nums=len(np.where(self.test_labels==1)[0])
print('ar:',anomaly_nums/self.test_labels.shape[0])
def __len__(self):
if self.flag == "train":
return (self.train.shape[0] - self.win_size) // self.step + 1
elif (self.flag == 'val'):
return (self.val.shape[0] - self.win_size) // self.step + 1
elif (self.flag == 'test'):
return (self.test.shape[0] - self.win_size) // self.step + 1
else:
return (self.test.shape[0] - self.win_size) // self.win_size + 1
def __getitem__(self, index):
index = index * self.step
if self.flag == "train":
return np.float32(self.train[index:index + self.win_size]), np.float32(self.test_labels[0:self.win_size])
elif (self.flag == 'val'):
return np.float32(self.val[index:index + self.win_size]), np.float32(self.test_labels[0:self.win_size])
elif (self.flag == 'test'):
return np.float32(self.test[index:index + self.win_size]), np.float32(
self.test_labels[index:index + self.win_size])
else:
return np.float32(self.test[
index // self.step * self.win_size:index // self.step * self.win_size + self.win_size]), np.float32(
self.test_labels[index // self.step * self.win_size:index // self.step * self.win_size + self.win_size])
def get_loader_segment(index, data_path, batch_size, win_size=100, step=100, mode='train', dataset='KDD'):
if mode=='test' or mode=='thre':
step = win_size
if (dataset == 'SMD'):
dataset = SMDSegLoader(data_path, win_size, step, mode)
elif (dataset == 'MSL'):
dataset = MSLSegLoader(data_path, win_size, step, mode)
elif (dataset == 'SMAP'):
dataset = SMAPSegLoader(data_path, win_size, step, mode)
elif (dataset == 'PSM'):
dataset = PSMSegLoader(data_path, win_size, step, mode)
elif (dataset =='SWAT'):
dataset = SWATSegLoader(data_path,win_size,step,mode)
elif (dataset == 'GECCO'):
dataset = NIPS_TS_WaterSegLoader(data_path, win_size, step, mode)
elif (dataset == 'SWAN'):
dataset = NIPS_TS_SwanSegLoader(data_path, win_size, step, mode)
elif (dataset== 'WADI'):
dataset = WADILoader(data_path,win_size,step,mode)
shuffle = False
if mode == 'train':
shuffle = True
data_loader = DataLoader(dataset=dataset,
batch_size=batch_size,
shuffle=shuffle,
num_workers=8,
drop_last=True,
generator=torch.Generator().manual_seed(2024)
)
return data_loader
import os
import argparse
import numpy as np
from solver import Solver
import warnings
import torch
warnings.filterwarnings('ignore')
import random
def seed_everything(seed=11):
random.seed(seed)
np.random.seed(seed)
os.environ['PYTHONHASHSEED']=str(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.enabled=False
os.environ['CUBLAS_WORKSPACE_CONFIG']=':16:8'
torch.use_deterministic_algorithms(True)
def main(config,setting):
if (not os.path.exists(config.model_save_path)):
os.makedirs(config.model_save_path)
solver = Solver(vars(config))
if config.mode == 'train':
solver.train(setting)
solver.test(setting)
elif config.mode == 'test':
solver.test1(setting)
return solver
if __name__ == '__main__':
parser = argparse.ArgumentParser()
# Alternative
parser.add_argument('--win_size', type=int, default=100)
parser.add_argument('--patch_size', type=int, default=5)
parser.add_argument('--lr', type=float, default=1e-4)
parser.add_argument('--n_heads', type=int, default=1)
parser.add_argument('--d_model', type=int, default=256)
parser.add_argument('--rec_timeseries', action='store_true', default=True)
parser.add_argument('--use_gpu', type=bool, default=True, help='use gpu')
parser.add_argument('--gpu', type=int, default=2, help='gpu')
parser.add_argument('--use_multi_gpu', action='store_true', help='use multiple gpus', default=True)
# Default
parser.add_argument('--index', type=int, default=137)
parser.add_argument('--seed',type=int,default=2024)
parser.add_argument('--num_epochs', type=int, default=10)
parser.add_argument('--batch_size', type=int, default=64)
parser.add_argument('--step',type=int,default=1)
parser.add_argument('--input_c', type=int, default=9)
parser.add_argument('--alpha',type=float,default=1)
parser.add_argument('--patience',type=int,default=3)
parser.add_argument('--dataset', type=str, default='credit')
parser.add_argument('--mode', type=str, default='test', choices=['train', 'test','ana'])
parser.add_argument('--data_path', type=str, default='./dataset/creditcard_ts.csv')
parser.add_argument('--model_save_path', type=str, default='checkpoints')
parser.add_argument('--llm_model',type=str,default='gpt2')
parser.add_argument('--anormly_ratio', type=float, default=4.00)
config = parser.parse_args()
args = vars(config)
seed_everything(config.seed)
setting = '{}_ws{}_pa{}_dm{}_llm{}_nh{}_sp{}_alpha{}'.format(
config.dataset,
config.win_size,
config.patch_size,
config.d_model,
config.llm_model,
config.n_heads,
config.step,
config.alpha
)
print(setting)
main(config,setting)
# used by paper: TSB-UAD as the main evaluator
# github: https://github.com/johnpaparrizos/TSB-UAD/blob/main/TSB_AD/utils/metrics.py
import numpy as np
from sklearn import metrics
from metrics.evaluate_utils import find_length,range_convers_new
def extend_postive_range(x, window=16):
label = x.copy().astype(float)
# print(label)
L = range_convers_new(label) # index of non-zero segments
# print(L)
length = len(label)
for k in range(len(L)):
s = L[k][0]
e = L[k][1]
# x1 is the extended list like [1,2,3] which are non-zero(from the end-e)
x1 = np.arange(e, min(e + window // 2, length))
label[x1] += np.sqrt(1 - (x1 - e) / (window))
# before the start-s
x2 = np.arange(max(s - window // 2, 0), s)
label[x2] += np.sqrt(1 - (s - x2) / (window))
label = np.minimum(np.ones(length), label)
return label
def extend_postive_range_individual(x, percentage=0.2):
label = x.copy().astype(float)
L = range_convers_new(label) # index of non-zero segments
length = len(label)
for k in range(len(L)):
s = L[k][0]
e = L[k][1]
l0 = int((e - s + 1) * percentage)
x1 = np.arange(e, min(e + l0, length))
label[x1] += np.sqrt(1 - (x1 - e) / (2 * l0))
x2 = np.arange(max(s - l0, 0), s)
label[x2] += np.sqrt(1 - (s - x2) / (2 * l0))
label = np.minimum(np.ones(length), label)
return label
def TPR_FPR_RangeAUC(labels, pred, P, L):
product = labels * pred
TP = np.sum(product)
# recall = min(TP/P,1)
P_new = (P + np.sum(labels)) / 2 # so TPR is neither large nor small
# P_new = np.sum(labels)
recall = min(TP / P_new, 1)
# recall = TP/np.sum(labels)
# print('recall '+str(recall))
existence = 0
for seg in L:
if np.sum(product[seg[0]:(seg[1] + 1)]) > 0:
existence += 1
existence_ratio = existence / len(L)
# print(existence_ratio)
# TPR_RangeAUC = np.sqrt(recall*existence_ratio)
# print(existence_ratio)
TPR_RangeAUC = recall * existence_ratio
FP = np.sum(pred) - TP
# TN = np.sum((1-pred) * (1-labels))
# FPR_RangeAUC = FP/(FP+TN)
N_new = len(labels) - P_new
FPR_RangeAUC = FP / N_new
Precision_RangeAUC = TP / np.sum(pred)
return TPR_RangeAUC, FPR_RangeAUC, Precision_RangeAUC
def Range_AUC(score_t_test, y_test, window=5, percentage=0, plot_ROC=False, AUC_type='window'):
# AUC_type='window'/'percentage'
score = score_t_test
labels = y_test
score_sorted = -np.sort(-score)
P = np.sum(labels)
# print(np.sum(labels))
if AUC_type == 'window':
labels = extend_postive_range(labels, window=window)
else:
labels = extend_postive_range_individual(labels, percentage=percentage)
# print(np.sum(labels))
L = range_convers_new(labels)
TPR_list = [0]
FPR_list = [0]
Precision_list = [1]
for i in np.linspace(0, len(score) - 1, 250).astype(int):
threshold = score_sorted[i]
# print('thre='+str(threshold))
pred = score >= threshold
TPR, FPR, Precision = TPR_FPR_RangeAUC(labels, pred, P, L)
TPR_list.append(TPR)
FPR_list.append(FPR)
Precision_list.append(Precision)
TPR_list.append(1)
FPR_list.append(1) # otherwise, range-AUC will stop earlier than (1,1)
tpr = np.array(TPR_list)
fpr = np.array(FPR_list)
prec = np.array(Precision_list)
width = fpr[1:] - fpr[:-1]
height = (tpr[1:] + tpr[:-1]) / 2
AUC_range = np.sum(width * height)
width_PR = tpr[1:-1] - tpr[:-2]
height_PR = (prec[1:] + prec[:-1]) / 2
AP_range = np.sum(width_PR * height_PR)
if plot_ROC:
return AUC_range, AP_range, fpr, tpr, prec
return AUC_range
def point_wise_AUC(score_t_test, y_test, plot_ROC=False):
# area under curve
label = y_test
score = score_t_test
auc = metrics.roc_auc_score(label, score)
# plor ROC curve
if plot_ROC:
fpr, tpr, thresholds = metrics.roc_curve(label, score)
# display = metrics.RocCurveDisplay(fpr=fpr, tpr=tpr, roc_auc=auc)
# display.plot()
return auc, fpr, tpr
else:
return auc
def main():
y_test = np.zeros(100)
y_test[10:20] = 1
y_test[50:60] = 1
pred_labels = np.zeros(100)
pred_labels[15:17] = 0.5
pred_labels[55:62] = 0.7
# pred_labels[51:55] = 1
# true_events = get_events(y_test)
point_auc = point_wise_AUC(pred_labels, y_test)
range_auc = Range_AUC(pred_labels, y_test)
print("point_auc: {}, range_auc: {}".format(point_auc, range_auc))
if __name__ == "__main__":
main()
\ No newline at end of file
from sklearn.metrics import confusion_matrix
import numpy as np
def MCC(y_test, pred_labels):
tn, fp, fn, tp = confusion_matrix(y_test, pred_labels).ravel()
MCC_score = (tp*tn-fp*fn)/(((tp+fp)*(tp+fn)*(tn+fp)*(tn+fn))**0.5)
return MCC_score
def main():
y_test = np.zeros(100)
y_test[10:20] = 1
y_test[50:60] = 1
pred_labels = np.zeros(100)
pred_labels[15:17] = 1
pred_labels[55:62] = 1
# pred_labels[51:55] = 1
# true_events = get_events(y_test)
confusion_matric = MCC(y_test, pred_labels)
# print(confusion_matric)
if __name__ == "__main__":
main()
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from metrics.affiliation._integral_interval import interval_intersection
def t_start(j, Js = [(1,2),(3,4),(5,6)], Trange = (1,10)):
"""
Helper for `E_gt_func`
:param j: index from 0 to len(Js) (included) on which to get the start
:param Js: ground truth events, as a list of couples
:param Trange: range of the series where Js is included
:return: generalized start such that the middle of t_start and t_stop
always gives the affiliation zone
"""
b = max(Trange)
n = len(Js)
if j == n:
return(2*b - t_stop(n-1, Js, Trange))
else:
return(Js[j][0])
def t_stop(j, Js = [(1,2),(3,4),(5,6)], Trange = (1,10)):
"""
Helper for `E_gt_func`
:param j: index from 0 to len(Js) (included) on which to get the stop
:param Js: ground truth events, as a list of couples
:param Trange: range of the series where Js is included
:return: generalized stop such that the middle of t_start and t_stop
always gives the affiliation zone
"""
if j == -1:
a = min(Trange)
return(2*a - t_start(0, Js, Trange))
else:
return(Js[j][1])
def E_gt_func(j, Js, Trange):
"""
Get the affiliation zone of element j of the ground truth
:param j: index from 0 to len(Js) (excluded) on which to get the zone
:param Js: ground truth events, as a list of couples
:param Trange: range of the series where Js is included, can
be (-math.inf, math.inf) for distance measures
:return: affiliation zone of element j of the ground truth represented
as a couple
"""
range_left = (t_stop(j-1, Js, Trange) + t_start(j, Js, Trange))/2
range_right = (t_stop(j, Js, Trange) + t_start(j+1, Js, Trange))/2
return((range_left, range_right))
def get_all_E_gt_func(Js, Trange):
"""
Get the affiliation partition from the ground truth point of view
:param Js: ground truth events, as a list of couples
:param Trange: range of the series where Js is included, can
be (-math.inf, math.inf) for distance measures
:return: affiliation partition of the events
"""
# E_gt is the limit of affiliation/attraction for each ground truth event
E_gt = [E_gt_func(j, Js, Trange) for j in range(len(Js))]
return(E_gt)
def affiliation_partition(Is = [(1,1.5),(2,5),(5,6),(8,9)], E_gt = [(1,2.5),(2.5,4.5),(4.5,10)]):
"""
Cut the events into the affiliation zones
The presentation given here is from the ground truth point of view,
but it is also used in the reversed direction in the main function.
:param Is: events as a list of couples
:param E_gt: range of the affiliation zones
:return: a list of list of intervals (each interval represented by either
a couple or None for empty interval). The outer list is indexed by each
affiliation zone of `E_gt`. The inner list is indexed by the events of `Is`.
"""
out = [None] * len(E_gt)
for j in range(len(E_gt)):
E_gt_j = E_gt[j]
discarded_idx_before = [I[1] < E_gt_j[0] for I in Is] # end point of predicted I is before the begin of E
discarded_idx_after = [I[0] > E_gt_j[1] for I in Is] # start of predicted I is after the end of E
kept_index = [not(a or b) for a, b in zip(discarded_idx_before, discarded_idx_after)]
Is_j = [x for x, y in zip(Is, kept_index)]
out[j] = [interval_intersection(I, E_gt[j]) for I in Is_j]
return(out)
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import math
from metrics.affiliation.generics import _sum_wo_nan
"""
In order to shorten the length of the variables,
the general convention in this file is to let:
- I for a predicted event (start, stop),
- Is for a list of predicted events,
- J for a ground truth event,
- Js for a list of ground truth events.
"""
def interval_length(J = (1,2)):
"""
Length of an interval
:param J: couple representating the start and stop of an interval, or None
:return: length of the interval, and 0 for a None interval
"""
if J is None:
return(0)
return(J[1] - J[0])
def sum_interval_lengths(Is = [(1,2),(3,4),(5,6)]):
"""
Sum of length of the intervals
:param Is: list of intervals represented by starts and stops
:return: sum of the interval length
"""
return(sum([interval_length(I) for I in Is]))
def interval_intersection(I = (1, 3), J = (2, 4)):
"""
Intersection between two intervals I and J
I and J should be either empty or represent a positive interval (no point)
:param I: an interval represented by start and stop
:param J: a second interval of the same form
:return: an interval representing the start and stop of the intersection (or None if empty)
"""
if I is None:
return(None)
if J is None:
return(None)
I_inter_J = (max(I[0], J[0]), min(I[1], J[1]))
if I_inter_J[0] >= I_inter_J[1]:
return(None)
else:
return(I_inter_J)
def interval_subset(I = (1, 3), J = (0, 6)):
"""
Checks whether I is a subset of J
:param I: an non empty interval represented by start and stop
:param J: a second non empty interval of the same form
:return: True if I is a subset of J
"""
if (I[0] >= J[0]) and (I[1] <= J[1]):
return True
else:
return False
def cut_into_three_func(I, J):
"""
Cut an interval I into a partition of 3 subsets:
the elements before J,
the elements belonging to J,
and the elements after J
:param I: an interval represented by start and stop, or None for an empty one
:param J: a non empty interval
:return: a triplet of three intervals, each represented by either (start, stop) or None
"""
if I is None:
return((None, None, None))
I_inter_J = interval_intersection(I, J)
if I == I_inter_J:
I_before = None
I_after = None
elif I[1] <= J[0]:
I_before = I
I_after = None
elif I[0] >= J[1]:
I_before = None
I_after = I
elif (I[0] <= J[0]) and (I[1] >= J[1]):
I_before = (I[0], I_inter_J[0])
I_after = (I_inter_J[1], I[1])
elif I[0] <= J[0]:
I_before = (I[0], I_inter_J[0])
I_after = None
elif I[1] >= J[1]:
I_before = None
I_after = (I_inter_J[1], I[1])
else:
raise ValueError('unexpected unconsidered case')
return(I_before, I_inter_J, I_after)
def get_pivot_j(I, J):
"""
Get the single point of J that is the closest to I, called 'pivot' here,
with the requirement that I should be outside J
:param I: a non empty interval (start, stop)
:param J: another non empty interval, with empty intersection with I
:return: the element j of J that is the closest to I
"""
if interval_intersection(I, J) is not None:
raise ValueError('I and J should have a void intersection')
j_pivot = None # j_pivot is a border of J
if max(I) <= min(J):
j_pivot = min(J)
elif min(I) >= max(J):
j_pivot = max(J)
else:
raise ValueError('I should be outside J')
return(j_pivot)
def integral_mini_interval(I, J):
"""
In the specific case where interval I is located outside J,
integral of distance from x to J over the interval x \in I.
This is the *integral* i.e. the sum.
It's not the mean (not divided by the length of I yet)
:param I: a interval (start, stop), or None
:param J: a non empty interval, with empty intersection with I
:return: the integral of distances d(x, J) over x \in I
"""
if I is None:
return(0)
j_pivot = get_pivot_j(I, J)
a = min(I)
b = max(I)
return((b-a)*abs((j_pivot - (a+b)/2)))
def integral_interval_distance(I, J):
"""
For any non empty intervals I, J, compute the
integral of distance from x to J over the interval x \in I.
This is the *integral* i.e. the sum.
It's not the mean (not divided by the length of I yet)
The interval I can intersect J or not
:param I: a interval (start, stop), or None
:param J: a non empty interval
:return: the integral of distances d(x, J) over x \in I
"""
# I and J are single intervals (not generic sets)
# I is a predicted interval in the range of affiliation of J
def f(I_cut):
return(integral_mini_interval(I_cut, J))
# If I_middle is fully included into J, it is
# the distance to J is always 0
def f0(I_middle):
return(0)
cut_into_three = cut_into_three_func(I, J)
# Distance for now, not the mean:
# Distance left: Between cut_into_three[0] and the point min(J)
d_left = f(cut_into_three[0])
# Distance middle: Between cut_into_three[1] = I inter J, and J
d_middle = f0(cut_into_three[1])
# Distance right: Between cut_into_three[2] and the point max(J)
d_right = f(cut_into_three[2])
# It's an integral so summable
return(d_left + d_middle + d_right)
def integral_mini_interval_P_CDFmethod__min_piece(I, J, E):
"""
Helper of `integral_mini_interval_Pprecision_CDFmethod`
In the specific case where interval I is located outside J,
compute the integral $\int_{d_min}^{d_max} \min(m, x) dx$, with:
- m the smallest distance from J to E,
- d_min the smallest distance d(x, J) from x \in I to J
- d_max the largest distance d(x, J) from x \in I to J
:param I: a single predicted interval, a non empty interval (start, stop)
:param J: ground truth interval, a non empty interval, with empty intersection with I
:param E: the affiliation/influence zone for J, represented as a couple (start, stop)
:return: the integral $\int_{d_min}^{d_max} \min(m, x) dx$
"""
if interval_intersection(I, J) is not None:
raise ValueError('I and J should have a void intersection')
if not interval_subset(J, E):
raise ValueError('J should be included in E')
if not interval_subset(I, E):
raise ValueError('I should be included in E')
e_min = min(E)
j_min = min(J)
j_max = max(J)
e_max = max(E)
i_min = min(I)
i_max = max(I)
d_min = max(i_min - j_max, j_min - i_max)
d_max = max(i_max - j_max, j_min - i_min)
m = min(j_min - e_min, e_max - j_max)
A = min(d_max, m)**2 - min(d_min, m)**2
B = max(d_max, m) - max(d_min, m)
C = (1/2)*A + m*B
return(C)
def integral_mini_interval_Pprecision_CDFmethod(I, J, E):
"""
Integral of the probability of distances over the interval I.
In the specific case where interval I is located outside J,
compute the integral $\int_{x \in I} Fbar(dist(x,J)) dx$.
This is the *integral* i.e. the sum (not the mean)
:param I: a single predicted interval, a non empty interval (start, stop)
:param J: ground truth interval, a non empty interval, with empty intersection with I
:param E: the affiliation/influence zone for J, represented as a couple (start, stop)
:return: the integral $\int_{x \in I} Fbar(dist(x,J)) dx$
"""
integral_min_piece = integral_mini_interval_P_CDFmethod__min_piece(I, J, E)
e_min = min(E)
j_min = min(J)
j_max = max(J)
e_max = max(E)
i_min = min(I)
i_max = max(I)
d_min = max(i_min - j_max, j_min - i_max)
d_max = max(i_max - j_max, j_min - i_min)
integral_linear_piece = (1/2)*(d_max**2 - d_min**2)
integral_remaining_piece = (j_max - j_min)*(i_max - i_min)
DeltaI = i_max - i_min
DeltaE = e_max - e_min
output = DeltaI - (1/DeltaE)*(integral_min_piece + integral_linear_piece + integral_remaining_piece)
return(output)
def integral_interval_probaCDF_precision(I, J, E):
"""
Integral of the probability of distances over the interval I.
Compute the integral $\int_{x \in I} Fbar(dist(x,J)) dx$.
This is the *integral* i.e. the sum (not the mean)
:param I: a single (non empty) predicted interval in the zone of affiliation of J
:param J: ground truth interval
:param E: affiliation/influence zone for J
:return: the integral $\int_{x \in I} Fbar(dist(x,J)) dx$
"""
# I and J are single intervals (not generic sets)
def f(I_cut):
if I_cut is None:
return(0)
else:
return(integral_mini_interval_Pprecision_CDFmethod(I_cut, J, E))
# If I_middle is fully included into J, it is
# integral of 1 on the interval I_middle, so it's |I_middle|
def f0(I_middle):
if I_middle is None:
return(0)
else:
return(max(I_middle) - min(I_middle))
cut_into_three = cut_into_three_func(I, J)
# Distance for now, not the mean:
# Distance left: Between cut_into_three[0] and the point min(J)
d_left = f(cut_into_three[0])
# Distance middle: Between cut_into_three[1] = I inter J, and J
d_middle = f0(cut_into_three[1])
# Distance right: Between cut_into_three[2] and the point max(J)
d_right = f(cut_into_three[2])
# It's an integral so summable
return(d_left + d_middle + d_right)
def cut_J_based_on_mean_func(J, e_mean):
"""
Helper function for the recall.
Partition J into two intervals: before and after e_mean
(e_mean represents the center element of E the zone of affiliation)
:param J: ground truth interval
:param e_mean: a float number (center value of E)
:return: a couple partitionning J into (J_before, J_after)
"""
if J is None:
J_before = None
J_after = None
elif e_mean >= max(J):
J_before = J
J_after = None
elif e_mean <= min(J):
J_before = None
J_after = J
else: # e_mean is across J
J_before = (min(J), e_mean)
J_after = (e_mean, max(J))
return((J_before, J_after))
def integral_mini_interval_Precall_CDFmethod(I, J, E):
"""
Integral of the probability of distances over the interval J.
In the specific case where interval J is located outside I,
compute the integral $\int_{y \in J} Fbar_y(dist(y,I)) dy$.
This is the *integral* i.e. the sum (not the mean)
:param I: a single (non empty) predicted interval
:param J: ground truth (non empty) interval, with empty intersection with I
:param E: the affiliation/influence zone for J, represented as a couple (start, stop)
:return: the integral $\int_{y \in J} Fbar_y(dist(y,I)) dy$
"""
# The interval J should be located outside I
# (so it's either the left piece or the right piece w.r.t I)
i_pivot = get_pivot_j(J, I)
e_min = min(E)
e_max = max(E)
e_mean = (e_min + e_max) / 2
# If i_pivot is outside E (it's possible), then
# the distance is worst that any random element within E,
# so we set the recall to 0
if i_pivot <= min(E):
return(0)
elif i_pivot >= max(E):
return(0)
# Otherwise, we have at least i_pivot in E and so d < M so min(d,M)=d
cut_J_based_on_e_mean = cut_J_based_on_mean_func(J, e_mean)
J_before = cut_J_based_on_e_mean[0]
J_after = cut_J_based_on_e_mean[1]
iemin_mean = (e_min + i_pivot)/2
cut_Jbefore_based_on_iemin_mean = cut_J_based_on_mean_func(J_before, iemin_mean)
J_before_closeE = cut_Jbefore_based_on_iemin_mean[0] # before e_mean and closer to e_min than i_pivot ~ J_before_before
J_before_closeI = cut_Jbefore_based_on_iemin_mean[1] # before e_mean and closer to i_pivot than e_min ~ J_before_after
iemax_mean = (e_max + i_pivot)/2
cut_Jafter_based_on_iemax_mean = cut_J_based_on_mean_func(J_after, iemax_mean)
J_after_closeI = cut_Jafter_based_on_iemax_mean[0] # after e_mean and closer to i_pivot than e_max ~ J_after_before
J_after_closeE = cut_Jafter_based_on_iemax_mean[1] # after e_mean and closer to e_max than i_pivot ~ J_after_after
if J_before_closeE is not None:
j_before_before_min = min(J_before_closeE) # == min(J)
j_before_before_max = max(J_before_closeE)
else:
j_before_before_min = math.nan
j_before_before_max = math.nan
if J_before_closeI is not None:
j_before_after_min = min(J_before_closeI) # == j_before_before_max if existing
j_before_after_max = max(J_before_closeI) # == max(J_before)
else:
j_before_after_min = math.nan
j_before_after_max = math.nan
if J_after_closeI is not None:
j_after_before_min = min(J_after_closeI) # == min(J_after)
j_after_before_max = max(J_after_closeI)
else:
j_after_before_min = math.nan
j_after_before_max = math.nan
if J_after_closeE is not None:
j_after_after_min = min(J_after_closeE) # == j_after_before_max if existing
j_after_after_max = max(J_after_closeE) # == max(J)
else:
j_after_after_min = math.nan
j_after_after_max = math.nan
# <-- J_before_closeE --> <-- J_before_closeI --> <-- J_after_closeI --> <-- J_after_closeE -->
# j_bb_min j_bb_max j_ba_min j_ba_max j_ab_min j_ab_max j_aa_min j_aa_max
# (with `b` for before and `a` for after in the previous variable names)
# vs e_mean m = min(t-e_min, e_max-t) d=|i_pivot-t| min(d,m) \int min(d,m)dt \int d dt \int_(min(d,m)+d)dt \int_{t \in J}(min(d,m)+d)dt
# Case J_before_closeE & i_pivot after J before t-e_min i_pivot-t min(i_pivot-t,t-e_min) = t-e_min t^2/2-e_min*t i_pivot*t-t^2/2 t^2/2-e_min*t+i_pivot*t-t^2/2 = (i_pivot-e_min)*t (i_pivot-e_min)*tB - (i_pivot-e_min)*tA = (i_pivot-e_min)*(tB-tA)
# Case J_before_closeI & i_pivot after J before t-e_min i_pivot-t min(i_pivot-t,t-e_min) = i_pivot-t i_pivot*t-t^2/2 i_pivot*t-t^2/2 i_pivot*t-t^2/2+i_pivot*t-t^2/2 = 2*i_pivot*t-t^2 2*i_pivot*tB-tB^2 - 2*i_pivot*tA + tA^2 = 2*i_pivot*(tB-tA) - (tB^2 - tA^2)
# Case J_after_closeI & i_pivot after J after e_max-t i_pivot-t min(i_pivot-t,e_max-t) = i_pivot-t i_pivot*t-t^2/2 i_pivot*t-t^2/2 i_pivot*t-t^2/2+i_pivot*t-t^2/2 = 2*i_pivot*t-t^2 2*i_pivot*tB-tB^2 - 2*i_pivot*tA + tA^2 = 2*i_pivot*(tB-tA) - (tB^2 - tA^2)
# Case J_after_closeE & i_pivot after J after e_max-t i_pivot-t min(i_pivot-t,e_max-t) = e_max-t e_max*t-t^2/2 i_pivot*t-t^2/2 e_max*t-t^2/2+i_pivot*t-t^2/2 = (e_max+i_pivot)*t-t^2 (e_max+i_pivot)*tB-tB^2 - (e_max+i_pivot)*tA + tA^2 = (e_max+i_pivot)*(tB-tA) - (tB^2 - tA^2)
#
# Case J_before_closeE & i_pivot before J before t-e_min t-i_pivot min(t-i_pivot,t-e_min) = t-e_min t^2/2-e_min*t t^2/2-i_pivot*t t^2/2-e_min*t+t^2/2-i_pivot*t = t^2-(e_min+i_pivot)*t tB^2-(e_min+i_pivot)*tB - tA^2 + (e_min+i_pivot)*tA = (tB^2 - tA^2) - (e_min+i_pivot)*(tB-tA)
# Case J_before_closeI & i_pivot before J before t-e_min t-i_pivot min(t-i_pivot,t-e_min) = t-i_pivot t^2/2-i_pivot*t t^2/2-i_pivot*t t^2/2-i_pivot*t+t^2/2-i_pivot*t = t^2-2*i_pivot*t tB^2-2*i_pivot*tB - tA^2 + 2*i_pivot*tA = (tB^2 - tA^2) - 2*i_pivot*(tB-tA)
# Case J_after_closeI & i_pivot before J after e_max-t t-i_pivot min(t-i_pivot,e_max-t) = t-i_pivot t^2/2-i_pivot*t t^2/2-i_pivot*t t^2/2-i_pivot*t+t^2/2-i_pivot*t = t^2-2*i_pivot*t tB^2-2*i_pivot*tB - tA^2 + 2*i_pivot*tA = (tB^2 - tA^2) - 2*i_pivot*(tB-tA)
# Case J_after_closeE & i_pivot before J after e_max-t t-i_pivot min(t-i_pivot,e_max-t) = e_max-t e_max*t-t^2/2 t^2/2-i_pivot*t e_max*t-t^2/2+t^2/2-i_pivot*t = (e_max-i_pivot)*t (e_max-i_pivot)*tB - (e_max-i_pivot)*tA = (e_max-i_pivot)*(tB-tA)
if i_pivot >= max(J):
part1_before_closeE = (i_pivot-e_min)*(j_before_before_max - j_before_before_min) # (i_pivot-e_min)*(tB-tA) # j_before_before_max - j_before_before_min
part2_before_closeI = 2*i_pivot*(j_before_after_max-j_before_after_min) - (j_before_after_max**2 - j_before_after_min**2) # 2*i_pivot*(tB-tA) - (tB^2 - tA^2) # j_before_after_max - j_before_after_min
part3_after_closeI = 2*i_pivot*(j_after_before_max-j_after_before_min) - (j_after_before_max**2 - j_after_before_min**2) # 2*i_pivot*(tB-tA) - (tB^2 - tA^2) # j_after_before_max - j_after_before_min
part4_after_closeE = (e_max+i_pivot)*(j_after_after_max-j_after_after_min) - (j_after_after_max**2 - j_after_after_min**2) # (e_max+i_pivot)*(tB-tA) - (tB^2 - tA^2) # j_after_after_max - j_after_after_min
out_parts = [part1_before_closeE, part2_before_closeI, part3_after_closeI, part4_after_closeE]
elif i_pivot <= min(J):
part1_before_closeE = (j_before_before_max**2 - j_before_before_min**2) - (e_min+i_pivot)*(j_before_before_max-j_before_before_min) # (tB^2 - tA^2) - (e_min+i_pivot)*(tB-tA) # j_before_before_max - j_before_before_min
part2_before_closeI = (j_before_after_max**2 - j_before_after_min**2) - 2*i_pivot*(j_before_after_max-j_before_after_min) # (tB^2 - tA^2) - 2*i_pivot*(tB-tA) # j_before_after_max - j_before_after_min
part3_after_closeI = (j_after_before_max**2 - j_after_before_min**2) - 2*i_pivot*(j_after_before_max - j_after_before_min) # (tB^2 - tA^2) - 2*i_pivot*(tB-tA) # j_after_before_max - j_after_before_min
part4_after_closeE = (e_max-i_pivot)*(j_after_after_max - j_after_after_min) # (e_max-i_pivot)*(tB-tA) # j_after_after_max - j_after_after_min
out_parts = [part1_before_closeE, part2_before_closeI, part3_after_closeI, part4_after_closeE]
else:
raise ValueError('The i_pivot should be outside J')
out_integral_min_dm_plus_d = _sum_wo_nan(out_parts) # integral on all J, i.e. sum of the disjoint parts
# We have for each point t of J:
# \bar{F}_{t, recall}(d) = 1 - (1/|E|) * (min(d,m) + d)
# Since t is a single-point here, and we are in the case where i_pivot is inside E.
# The integral is then given by:
# C = \int_{t \in J} \bar{F}_{t, recall}(D(t)) dt
# = \int_{t \in J} 1 - (1/|E|) * (min(d,m) + d) dt
# = |J| - (1/|E|) * [\int_{t \in J} (min(d,m) + d) dt]
# = |J| - (1/|E|) * out_integral_min_dm_plus_d
DeltaJ = max(J) - min(J)
DeltaE = max(E) - min(E)
C = DeltaJ - (1/DeltaE) * out_integral_min_dm_plus_d
return(C)
def integral_interval_probaCDF_recall(I, J, E):
"""
Integral of the probability of distances over the interval J.
Compute the integral $\int_{y \in J} Fbar_y(dist(y,I)) dy$.
This is the *integral* i.e. the sum (not the mean)
:param I: a single (non empty) predicted interval
:param J: ground truth (non empty) interval
:param E: the affiliation/influence zone for J
:return: the integral $\int_{y \in J} Fbar_y(dist(y,I)) dy$
"""
# I and J are single intervals (not generic sets)
# E is the outside affiliation interval of J (even for recall!)
# (in particular J \subset E)
#
# J is the portion of the ground truth affiliated to I
# I is a predicted interval (can be outside E possibly since it's recall)
def f(J_cut):
if J_cut is None:
return(0)
else:
return integral_mini_interval_Precall_CDFmethod(I, J_cut, E)
# If J_middle is fully included into I, it is
# integral of 1 on the interval J_middle, so it's |J_middle|
def f0(J_middle):
if J_middle is None:
return(0)
else:
return(max(J_middle) - min(J_middle))
cut_into_three = cut_into_three_func(J, I) # it's J that we cut into 3, depending on the position w.r.t I
# since we integrate over J this time.
#
# Distance for now, not the mean:
# Distance left: Between cut_into_three[0] and the point min(I)
d_left = f(cut_into_three[0])
# Distance middle: Between cut_into_three[1] = J inter I, and I
d_middle = f0(cut_into_three[1])
# Distance right: Between cut_into_three[2] and the point max(I)
d_right = f(cut_into_three[2])
# It's an integral so summable
return(d_left + d_middle + d_right)
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import math
from metrics.affiliation._affiliation_zone import (
get_all_E_gt_func,
affiliation_partition)
from metrics.affiliation._integral_interval import (
integral_interval_distance,
integral_interval_probaCDF_precision,
integral_interval_probaCDF_recall,
interval_length,
sum_interval_lengths)
def affiliation_precision_distance(Is = [(1,2),(3,4),(5,6)], J = (2,5.5)):
"""
Compute the individual average distance from Is to a single ground truth J
:param Is: list of predicted events within the affiliation zone of J
:param J: couple representating the start and stop of a ground truth interval
:return: individual average precision directed distance number
"""
if all([I is None for I in Is]): # no prediction in the current area
return(math.nan) # undefined
return(sum([integral_interval_distance(I, J) for I in Is]) / sum_interval_lengths(Is))
def affiliation_precision_proba(Is = [(1,2),(3,4),(5,6)], J = (2,5.5), E = (0,8)):
"""
Compute the individual precision probability from Is to a single ground truth J
:param Is: list of predicted events within the affiliation zone of J
:param J: couple representating the start and stop of a ground truth interval
:param E: couple representing the start and stop of the zone of affiliation of J
:return: individual precision probability in [0, 1], or math.nan if undefined
"""
if all([I is None for I in Is]): # no prediction in the current area
return(math.nan) # undefined
return(sum([integral_interval_probaCDF_precision(I, J, E) for I in Is]) / sum_interval_lengths(Is))
def affiliation_recall_distance(Is = [(1,2),(3,4),(5,6)], J = (2,5.5)):
"""
Compute the individual average distance from a single J to the predictions Is
:param Is: list of predicted events within the affiliation zone of J
:param J: couple representating the start and stop of a ground truth interval
:return: individual average recall directed distance number
"""
Is = [I for I in Is if I is not None] # filter possible None in Is
if len(Is) == 0: # there is no prediction in the current area
return(math.inf)
E_gt_recall = get_all_E_gt_func(Is, (-math.inf, math.inf)) # here from the point of view of the predictions
Js = affiliation_partition([J], E_gt_recall) # partition of J depending of proximity with Is
return(sum([integral_interval_distance(J[0], I) for I, J in zip(Is, Js)]) / interval_length(J))
def affiliation_recall_proba(Is = [(1,2),(3,4),(5,6)], J = (2,5.5), E = (0,8)):
"""
Compute the individual recall probability from a single ground truth J to Is
:param Is: list of predicted events within the affiliation zone of J
:param J: couple representating the start and stop of a ground truth interval
:param E: couple representing the start and stop of the zone of affiliation of J
:return: individual recall probability in [0, 1]
"""
Is = [I for I in Is if I is not None] # filter possible None in Is
if len(Is) == 0: # there is no prediction in the current area
return(0)
E_gt_recall = get_all_E_gt_func(Is, E) # here from the point of view of the predictions
Js = affiliation_partition([J], E_gt_recall) # partition of J depending of proximity with Is
return(sum([integral_interval_probaCDF_recall(I, J[0], E) for I, J in zip(Is, Js)]) / interval_length(J))
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from itertools import groupby
from operator import itemgetter
import math
import gzip
import glob
import os
def convert_vector_to_events(vector = [0, 1, 1, 0, 0, 1, 0]):
"""
Convert a binary vector (indicating 1 for the anomalous instances)
to a list of events. The events are considered as durations,
i.e. setting 1 at index i corresponds to an anomalous interval [i, i+1).
:param vector: a list of elements belonging to {0, 1}
:return: a list of couples, each couple representing the start and stop of
each event
"""
positive_indexes = [idx for idx, val in enumerate(vector) if val > 0]
events = []
for k, g in groupby(enumerate(positive_indexes), lambda ix : ix[0] - ix[1]):
cur_cut = list(map(itemgetter(1), g))
events.append((cur_cut[0], cur_cut[-1]))
# Consistent conversion in case of range anomalies (for indexes):
# A positive index i is considered as the interval [i, i+1),
# so the last index should be moved by 1
events = [(x, y+1) for (x,y) in events]
return(events)
def infer_Trange(events_pred, events_gt):
"""
Given the list of events events_pred and events_gt, get the
smallest possible Trange corresponding to the start and stop indexes
of the whole series.
Trange will not influence the measure of distances, but will impact the
measures of probabilities.
:param events_pred: a list of couples corresponding to predicted events
:param events_gt: a list of couples corresponding to ground truth events
:return: a couple corresponding to the smallest range containing the events
"""
if len(events_gt) == 0:
raise ValueError('The gt events should contain at least one event')
if len(events_pred) == 0:
# empty prediction, base Trange only on events_gt (which is non empty)
return(infer_Trange(events_gt, events_gt))
min_pred = min([x[0] for x in events_pred])
min_gt = min([x[0] for x in events_gt])
max_pred = max([x[1] for x in events_pred])
max_gt = max([x[1] for x in events_gt])
Trange = (min(min_pred, min_gt), max(max_pred, max_gt))
return(Trange)
def has_point_anomalies(events):
"""
Checking whether events contain point anomalies, i.e.
events starting and stopping at the same time.
:param events: a list of couples corresponding to predicted events
:return: True is the events have any point anomalies, False otherwise
"""
if len(events) == 0:
return(False)
return(min([x[1] - x[0] for x in events]) == 0)
def _sum_wo_nan(vec):
"""
Sum of elements, ignoring math.isnan ones
:param vec: vector of floating numbers
:return: sum of the elements, ignoring math.isnan ones
"""
vec_wo_nan = [e for e in vec if not math.isnan(e)]
return(sum(vec_wo_nan))
def _len_wo_nan(vec):
"""
Count of elements, ignoring math.isnan ones
:param vec: vector of floating numbers
:return: count of the elements, ignoring math.isnan ones
"""
vec_wo_nan = [e for e in vec if not math.isnan(e)]
return(len(vec_wo_nan))
def read_gz_data(filename = 'data/machinetemp_groundtruth.gz'):
"""
Load a file compressed with gz, such that each line of the
file is either 0 (representing a normal instance) or 1 (representing)
an anomalous instance.
:param filename: file path to the gz compressed file
:return: list of integers with either 0 or 1
"""
with gzip.open(filename, 'rb') as f:
content = f.read().splitlines()
content = [int(x) for x in content]
return(content)
def read_all_as_events():
"""
Load the files contained in the folder `data/` and convert
to events. The length of the series is kept.
The convention for the file name is: `dataset_algorithm.gz`
:return: two dictionaries:
- the first containing the list of events for each dataset and algorithm,
- the second containing the range of the series for each dataset
"""
filepaths = glob.glob('data/*.gz')
datasets = dict()
Tranges = dict()
for filepath in filepaths:
vector = read_gz_data(filepath)
events = convert_vector_to_events(vector)
# ad hoc cut for those files
cut_filepath = (os.path.split(filepath)[1]).split('_')
data_name = cut_filepath[0]
algo_name = (cut_filepath[1]).split('.')[0]
if not data_name in datasets:
datasets[data_name] = dict()
Tranges[data_name] = (0, len(vector))
datasets[data_name][algo_name] = events
return(datasets, Tranges)
def f1_func(p, r):
"""
Compute the f1 function
:param p: precision numeric value
:param r: recall numeric value
:return: f1 numeric value
"""
return(2*p*r/(p+r))
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from metrics.affiliation.generics import (
infer_Trange,
has_point_anomalies,
_len_wo_nan,
_sum_wo_nan,
read_all_as_events)
from metrics.affiliation._affiliation_zone import (
get_all_E_gt_func,
affiliation_partition)
from metrics.affiliation._single_ground_truth_event import (
affiliation_precision_distance,
affiliation_recall_distance,
affiliation_precision_proba,
affiliation_recall_proba)
def test_events(events):
"""
Verify the validity of the input events
:param events: list of events, each represented by a couple (start, stop)
:return: None. Raise an error for incorrect formed or non ordered events
"""
if type(events) is not list:
raise TypeError('Input `events` should be a list of couples')
if not all([type(x) is tuple for x in events]):
raise TypeError('Input `events` should be a list of tuples')
if not all([len(x) == 2 for x in events]):
raise ValueError('Input `events` should be a list of couples (start, stop)')
if not all([x[0] <= x[1] for x in events]):
raise ValueError('Input `events` should be a list of couples (start, stop) with start <= stop')
if not all([events[i][1] < events[i+1][0] for i in range(len(events) - 1)]):
raise ValueError('Couples of input `events` should be disjoint and ordered')
def pr_from_events(events_pred, events_gt, Trange):
"""
Compute the affiliation metrics including the precision/recall in [0,1],
along with the individual precision/recall distances and probabilities
:param events_pred: list of predicted events, each represented by a couple
indicating the start and the stop of the event
:param events_gt: list of ground truth events, each represented by a couple
indicating the start and the stop of the event
:param Trange: range of the series where events_pred and events_gt are included,
represented as a couple (start, stop)
:return: dictionary with precision, recall, and the individual metrics
"""
# testing the inputs
test_events(events_pred)
test_events(events_gt)
# other tests
minimal_Trange = infer_Trange(events_pred, events_gt)
if not Trange[0] <= minimal_Trange[0]:
raise ValueError('`Trange` should include all the events')
if not minimal_Trange[1] <= Trange[1]:
raise ValueError('`Trange` should include all the events')
if len(events_gt) == 0:
raise ValueError('Input `events_gt` should have at least one event')
if has_point_anomalies(events_pred) or has_point_anomalies(events_gt):
raise ValueError('Cannot manage point anomalies currently')
if Trange is None:
# Set as default, but Trange should be indicated if probabilities are used
raise ValueError('Trange should be indicated (or inferred with the `infer_Trange` function')
E_gt = get_all_E_gt_func(events_gt, Trange)
aff_partition = affiliation_partition(events_pred, E_gt)
# Computing precision distance
d_precision = [affiliation_precision_distance(Is, J) for Is, J in zip(aff_partition, events_gt)]
# Computing recall distance
d_recall = [affiliation_recall_distance(Is, J) for Is, J in zip(aff_partition, events_gt)]
# Computing precision
p_precision = [affiliation_precision_proba(Is, J, E) for Is, J, E in zip(aff_partition, events_gt, E_gt)]
# Computing recall
p_recall = [affiliation_recall_proba(Is, J, E) for Is, J, E in zip(aff_partition, events_gt, E_gt)]
if _len_wo_nan(p_precision) > 0:
p_precision_average = _sum_wo_nan(p_precision) / _len_wo_nan(p_precision)
else:
p_precision_average = p_precision[0] # math.nan
p_recall_average = sum(p_recall) / len(p_recall)
dict_out = dict({'precision': p_precision_average,
'recall': p_recall_average,
'individual_precision_probabilities': p_precision,
'individual_recall_probabilities': p_recall,
'individual_precision_distances': d_precision,
'individual_recall_distances': d_recall})
return(dict_out)
def produce_all_results():
"""
Produce the affiliation precision/recall for all files
contained in the `data` repository
:return: a dictionary indexed by data names, each containing a dictionary
indexed by algorithm names, each containing the results of the affiliation
metrics (precision, recall, individual probabilities and distances)
"""
datasets, Tranges = read_all_as_events() # read all the events in folder `data`
results = dict()
for data_name in datasets.keys():
results_data = dict()
for algo_name in datasets[data_name].keys():
if algo_name != 'groundtruth':
results_data[algo_name] = pr_from_events(datasets[data_name][algo_name],
datasets[data_name]['groundtruth'],
Tranges[data_name])
results[data_name] = results_data
return(results)
from f1_score_f1_pa import *
from fc_score import *
from precision_at_k import *
from customizable_f1_score import *
from AUC import *
from Matthews_correlation_coefficient import *
from affiliation.generics import convert_vector_to_events
from affiliation.metrics import pr_from_events
from vus.models.feature import Window
from vus.metrics import get_range_vus_roc
def combine_all_evaluation_scores(y_test, pred_labels, anomaly_scores):
events_pred = convert_vector_to_events(y_test) # [(4, 5), (8, 9)]
events_gt = convert_vector_to_events(pred_labels) # [(3, 4), (7, 10)]
Trange = (0, len(y_test))
affiliation = pr_from_events(events_pred, events_gt, Trange)
true_events = get_events(y_test)
_, _, _, f1_score_ori, f05_score_ori = get_accuracy_precision_recall_fscore(y_test, pred_labels)
f1_score_pa = get_point_adjust_scores(y_test, pred_labels, true_events)[5]
pa_accuracy, pa_precision, pa_recall, pa_f_score = get_adjust_F1PA(y_test, pred_labels)
range_f_score = customizable_f1_score(y_test, pred_labels)
_, _, f1_score_c = get_composite_fscore_raw(y_test, pred_labels, true_events, return_prec_rec=True)
precision_k = precision_at_k(y_test, anomaly_scores, pred_labels)
point_auc = point_wise_AUC(pred_labels, y_test)
range_auc = Range_AUC(pred_labels, y_test)
MCC_score = MCC(y_test, pred_labels)
results = get_range_vus_roc(y_test, pred_labels, 100) # slidingWindow = 100 default
score_list = {"f1_score_ori": f1_score_ori,
"f05_score_ori" : f05_score_ori,
"f1_score_pa": f1_score_pa,
"pa_accuracy":pa_accuracy,
"pa_precision":pa_precision,
"pa_recall":pa_recall,
"pa_f_score":pa_f_score,
"range_f_score": range_f_score,
"f1_score_c": f1_score_c,
"precision_k": precision_k,
"point_auc": point_auc,
"range_auc": range_auc,
"MCC_score":MCC_score,
"Affiliation precision": affiliation['precision'],
"Affiliation recall": affiliation['recall'],
"R_AUC_ROC": results["R_AUC_ROC"],
"R_AUC_PR": results["R_AUC_PR"],
"VUS_ROC": results["VUS_ROC"],
"VUS_PR": results["VUS_PR"]}
return score_list
def main():
y_test = np.zeros(100)
y_test[10:20] = 1
y_test[50:60] = 1
pred_labels = np.zeros(100)
pred_labels[15:17] = 1
pred_labels[55:62] = 1
anomaly_scores = np.zeros(100)
anomaly_scores[15:17] = 0.7
anomaly_scores[55:62] = 0.6
pred_labels[51:55] = 1
true_events = get_events(y_test)
scores = combine_all_evaluation_scores(y_test, pred_labels, anomaly_scores)
# scores = test(y_test, pred_labels)
for key,value in scores.items():
print(key,' : ',value)
if __name__ == "__main__":
main()
\ No newline at end of file
# used by paper: Exathlon: A Benchmark for Explainable Anomaly Detection over Time Series_VLDB 2021
# github: https://github.com/exathlonbenchmark/exathlon
import numpy as np
from metrics.evaluate_utils import range_convers_new
# the existence reward on the bias
def b(bias, i, length):
if bias == 'flat':
return 1
elif bias == 'front-end bias':
return length - i + 1
elif bias == 'back-end bias':
return i
else:
if i <= length / 2:
return i
else:
return length - i + 1
def w(AnomalyRange, p):
MyValue = 0
MaxValue = 0
start = AnomalyRange[0]
AnomalyLength = AnomalyRange[1] - AnomalyRange[0] + 1
# flat/'front-end bias'/'back-end bias'
bias = 'flat'
for i in range(start, start + AnomalyLength):
bi = b(bias, i, AnomalyLength)
MaxValue += bi
if i in p:
MyValue += bi
return MyValue / MaxValue
def Cardinality_factor(Anomolyrange, Prange):
score = 0
start = Anomolyrange[0]
end = Anomolyrange[1]
for i in Prange:
if start <= i[0] <= end:
score += 1
elif i[0] <= start <= i[1]:
score += 1
elif i[0] <= end <= i[1]:
score += 1
elif start >= i[0] and end <= i[1]:
score += 1
if score == 0:
return 0
else:
return 1 / score
def existence_reward(labels, preds):
'''
labels: list of ordered pair
preds predicted data
'''
score = 0
for i in labels:
if np.sum(np.multiply(preds <= i[1], preds >= i[0])) > 0:
score += 1
return score
def range_recall_new(labels, preds, alpha):
p = np.where(preds == 1)[0] # positions of predicted label==1
range_pred = range_convers_new(preds)
range_label = range_convers_new(labels)
Nr = len(range_label) # total # of real anomaly segments
ExistenceReward = existence_reward(range_label, p)
OverlapReward = 0
for i in range_label:
OverlapReward += w(i, p) * Cardinality_factor(i, range_pred)
score = alpha * ExistenceReward + (1 - alpha) * OverlapReward
if Nr != 0:
return score / Nr, ExistenceReward / Nr, OverlapReward / Nr
else:
return 0, 0, 0
def customizable_f1_score(y_test, pred_labels, alpha=0.2):
label = y_test
preds = pred_labels
Rrecall, ExistenceReward, OverlapReward = range_recall_new(label, preds, alpha)
Rprecision = range_recall_new(preds, label, 0)[0]
if Rprecision + Rrecall == 0:
Rf = 0
else:
Rf = 2 * Rrecall * Rprecision / (Rprecision + Rrecall)
return Rf
def main():
y_test = np.zeros(100)
y_test[10:20] = 1
y_test[50:60] = 1
pred_labels = np.zeros(100)
pred_labels[15:19] = 1
pred_labels[55:62] = 1
# pred_labels[51:55] = 1
# true_events = get_events(y_test)
Rf = customizable_f1_score(y_test, pred_labels)
print("Rf: {}".format(Rf))
if __name__ == "__main__":
main()
\ No newline at end of file
import numpy as np
from statsmodels.tsa.stattools import acf
from scipy.signal import argrelextrema
def get_composite_fscore_from_scores(score_t_test, thres, true_events, prec_t, return_prec_rec=False):
pred_labels = score_t_test > thres
tp = np.sum([pred_labels[start:end + 1].any() for start, end in true_events.values()])
fn = len(true_events) - tp
rec_e = tp / (tp + fn)
fscore_c = 2 * rec_e * prec_t / (rec_e + prec_t)
if prec_t == 0 and rec_e == 0:
fscore_c = 0
if return_prec_rec:
return prec_t, rec_e, fscore_c
return fscore_c
class NptConfig:
def __init__(self, config_dict):
for k, v in config_dict.items():
setattr(self, k, v)
def find_length(data):
if len(data.shape) > 1:
return 0
data = data[:min(20000, len(data))]
base = 3
auto_corr = acf(data, nlags=400, fft=True)[base:]
local_max = argrelextrema(auto_corr, np.greater)[0]
try:
max_local_max = np.argmax([auto_corr[lcm] for lcm in local_max])
if local_max[max_local_max] < 3 or local_max[max_local_max] > 300:
return 125
return local_max[max_local_max] + base
except:
return 125
def range_convers_new(label):
'''
input: arrays of binary values
output: list of ordered pair [[a0,b0], [a1,b1]... ] of the inputs
'''
L = []
i = 0
j = 0
while j < len(label):
while label[i] == 0:
i += 1
if i >= len(label):
break
j = i + 1
if j >= len(label):
if j == len(label):
L.append((i, j - 1))
break
while label[j] != 0:
j += 1
if j >= len(label):
L.append((i, j - 1))
break
if j >= len(label):
break
L.append((i, j - 1))
i = j
return L
\ No newline at end of file
import logging
import os
import pickle
import copy
import json
import numpy as np
import pandas as pd
from logger_configs.configurations import datasets_config, default_thres_config
from datasets.data_preprocess.dataset import get_events
from logger_configs.logger import init_logging
from src.evaluation.evaluation_utils import get_dataset_class, get_algo_class, get_chan_num, collect_eval_metrics, \
combine_entities_eval_metrics, get_dynamic_scores, get_gaussian_kernel_scores
from src.evaluation.evaluation_utils import fit_distributions, get_scores_channelwise
from src.algorithms.algorithm_utils import load_torch_algo
from src.evaluation.trainer import Trainer
def evaluate(saved_model_root, logger, thres_methods=["top_k_time", "best_f1_test"], eval_root_cause=True,
point_adjust=False, eval_R_model=True, eval_dyn=False, thres_config=None,
telem_only=True, make_plots=["prc", "score_t"], composite_best_f1=False):
seed = 42
saved_model_folders = os.listdir(saved_model_root)
saved_model_folders.sort() # Sort directories in alphabetical order
plots_dir = os.path.join(saved_model_root, "plots")
os.makedirs(plots_dir, exist_ok=True)
# Initialize dictionary structure to collect results from each entity
algo_results = {"hr_100_all": [], "hr_150_all": [], "rc_top3_all": [], "val_recons_err": [], "val_loss": [],
"std_scores_train": [], "auroc": [], "avg_prec": []}
eval_methods = ["time-wise"]
for thres_method in thres_methods + ["tail_prob", "pot"]:
algo_results[thres_method] = {"hr_100_tp": [], "hr_150_tp": [], "rc_top3_tp": [], "opt_thres": [],
"fscore_comp": [], "rec_e": []}
for eval_method in eval_methods:
algo_results[thres_method][eval_method] = {"tp": [], "fp": [], "fn": []}
algo_R_results = copy.deepcopy(algo_results)
telemanom_gauss_s_results = copy.deepcopy(algo_results)
algo_dyn_results = copy.deepcopy(algo_results)
algo_dyn_gauss_conv_results = copy.deepcopy(algo_results)
path_decomposition = os.path.normpath(saved_model_root).split(os.path.sep)
algo_name = path_decomposition[-3]
me_ds_name = path_decomposition[-4].split("_me")[0]
ds_class = get_dataset_class(me_ds_name)
rca_possible = eval_root_cause
thres_config_dict = None
for folder_name in saved_model_folders:
if ".ini" in folder_name or ".csv" in folder_name or ".pdf" in folder_name:
continue
elif os.path.split(folder_name)[-1] == "plots":
continue
# get dataset
entity = os.path.split(folder_name)[-1]
if me_ds_name in ["msl", "smd", "smap"]:
entity = entity.split("-", 1)[1]
ds_init_params = {"seed": seed, "entity": entity}
if me_ds_name == "swat-long":
ds_init_params["shorten_long"] = False
if me_ds_name == "damadics-s":
ds_init_params["drop_init_test"] = True
dataset = ds_class(**ds_init_params)
plots_name = os.path.join(plots_dir, algo_name + "_" + me_ds_name + "_" + entity + "_")
# ds_name = dataset.name
logger.info("Processing Folder name: {}, {} on me dataset {}, entity {}".format(folder_name, algo_name,
me_ds_name, entity))
if thres_config is not None:
thres_config_dict = thres_config(me_ds_name)
else:
thres_config_dict = default_thres_config
# get test scores from pkl
raw_preds_file = os.path.join(saved_model_root, folder_name, "raw_predictions")
try:
with open(raw_preds_file, 'rb') as file:
preds = pickle.load(file)
except:
logger.info("The raw predictions of %s on %s weren't found, this run can't be evaluated" % (algo_name,
me_ds_name))
return None
# Get the true labels
_, _, _, y_test = dataset.data()
true_events = get_events(y_test)
root_causes = None
if eval_root_cause:
root_causes = dataset.get_root_causes()
# Flag that indicates root cause identification evaluation is possible
rca_possible = eval_root_cause and (preds["score_tc_test"] is not None or preds["error_tc_test"] is not None) \
and root_causes is not None
# Load the predictions
score_t_test = preds["score_t_test"]
score_tc_test = preds["score_tc_test"]
error_tc_test = preds["error_tc_test"]
error_t_test = preds["error_t_test"]
score_t_train = preds["score_t_train"]
score_tc_train = preds["score_tc_train"]
error_tc_train = preds["error_tc_train"]
error_t_train = preds["error_t_train"]
recons_tc_train = preds["recons_tc_train"]
recons_tc_test = preds["recons_tc_test"]
try:
val_recons_err = np.nanmean(preds["val_recons_err"])
except:
val_recons_err = None
try:
val_loss = preds["val_loss"]
except:
val_loss = None
if telem_only and me_ds_name in ["msl", "smap"]:
if error_tc_train is not None:
error_t_train = error_tc_train[:, 0]
error_tc_train = None
if error_tc_test is not None:
error_t_test = error_tc_test[:, 0]
error_tc_test = None
if score_tc_test is not None:
score_t_test = score_tc_test[:, 0]
score_tc_test = None
eval_R = eval_R_model and (error_tc_test is not None or error_t_test is not None)
eval_dyn = eval_dyn and ((error_tc_test is not None) or
(error_t_test is not None))
# Evaluate on each entity
logger.info("Evaluating for score_t")
algo_results = collect_eval_metrics(algo_results=algo_results, score_t_test=score_t_test, y_test=y_test,
thres_methods=thres_methods, logger=logger, true_events=true_events,
rca_possible=rca_possible and (preds["score_tc_test"] is not None),
score_tc_test=score_tc_test,
root_causes=root_causes, score_t_train=score_t_train,
point_adjust=point_adjust, thres_config_dict=thres_config_dict,
eval_methods=eval_methods, make_plots=make_plots, dataset=dataset,
plots_name=plots_name + "base", composite_best_f1=composite_best_f1)
algo_results["val_recons_err"].append(val_recons_err)
algo_results["val_loss"].append(val_loss)
algo_results["std_scores_train"].append(np.std(score_t_train))
if eval_R:
if algo_name == "TelemanomAlgo":
logger.info("Evaluating for static gaussian for TelemanomAlgo")
# get static gaussian scores. This is usually done in the trainer, but not for this algo
distr_names = ["univar_gaussian"]
distr_par_file = os.path.join(saved_model_root, folder_name, "distr_parameters")
if error_t_train is None or error_tc_train is None:
score_t_test_gauss_s = error_t_test
score_t_train_gauss_s = None
score_tc_test_gauss_s = error_tc_test
else:
distr_params = fit_distributions(distr_par_file, distr_names, predictions_dic=
{"train_raw_scores": error_tc_train})[distr_names[0]]
score_t_train_gauss_s, _, score_t_test_gauss_s, score_tc_train_gauss_s, _, score_tc_test_gauss_s = \
get_scores_channelwise(distr_params, train_raw_scores=error_tc_train,
val_raw_scores=None, test_raw_scores=error_tc_test,
logcdf=True)
telemanom_gauss_s_results = collect_eval_metrics(algo_results=telemanom_gauss_s_results,
score_t_test=score_t_test_gauss_s,
y_test=y_test,
thres_methods=thres_methods,
logger=logger,
true_events=true_events,
rca_possible=rca_possible,
score_tc_test=score_tc_test_gauss_s,
root_causes=root_causes,
score_t_train=score_t_train_gauss_s,
point_adjust=point_adjust,
thres_config_dict=thres_config_dict,
eval_methods=eval_methods,
make_plots=make_plots,
dataset=dataset,
plots_name=plots_name + "-gauss-s",
composite_best_f1=composite_best_f1)
if error_tc_train is not None and error_tc_test is not None:
logger.info("Doing mean adjustment of train and test error_tc")
mean_c_train = np.mean(error_tc_train, axis=0)
error_tc_train_normed = error_tc_train - mean_c_train
error_tc_test_normed = error_tc_test - mean_c_train
error_t_train_normed = np.sqrt(np.mean(error_tc_train_normed ** 2, axis=1))
error_t_test_normed = np.sqrt(np.mean(error_tc_test_normed ** 2, axis=1))
else:
error_t_test_normed = error_t_test
error_t_train_normed = error_t_train
error_tc_test_normed = error_tc_test
error_tc_train_normed = None
logger.info("Evaluating for error_t")
algo_R_results = collect_eval_metrics(algo_results=algo_R_results, score_t_test=error_t_test_normed,
y_test=y_test,
thres_methods=thres_methods, logger=logger, true_events=true_events,
rca_possible=rca_possible, score_tc_test=error_tc_test_normed,
root_causes=root_causes, score_t_train=error_t_train_normed,
point_adjust=point_adjust,
thres_config_dict=thres_config_dict, eval_methods=eval_methods,
make_plots=make_plots, dataset=dataset,
plots_name=plots_name + "R",
composite_best_f1=composite_best_f1,
score_tc_train=error_tc_train_normed)
algo_R_results["val_recons_err"].append(val_recons_err)
algo_R_results["val_loss"].append(val_loss)
if error_t_train is not None:
algo_R_results["std_scores_train"].append(np.std(error_t_train))
# dynamic scoring function
if eval_dyn:
# dyn_thres_methods = ["best_f1_test"]
dyn_thres_methods = thres_methods
logger.info("Evaluating gaussian dynamic scoring for error_t with thres_methods {}".format(dyn_thres_methods))
long_window = thres_config_dict["dyn_gauss"]["long_window"]
short_window = thres_config_dict["dyn_gauss"]["short_window"]
if telem_only and me_ds_name in ["msl", "smap"]:
score_t_test_dyn, score_tc_test_dyn, score_t_train_dyn, score_tc_train_dyn = get_dynamic_scores(
error_tc_train=None, error_tc_test=None, error_t_train=error_t_train, error_t_test=error_t_test,
long_window=long_window, short_window=short_window)
else:
score_t_test_dyn, score_tc_test_dyn, score_t_train_dyn, score_tc_train_dyn = get_dynamic_scores(
error_tc_train, error_tc_test, error_t_train, error_t_test, long_window=long_window,
short_window=short_window)
algo_dyn_results = collect_eval_metrics(algo_results=algo_dyn_results, score_t_test=score_t_test_dyn,
y_test=y_test, thres_methods=dyn_thres_methods,
logger=logger, rca_possible=rca_possible, true_events=true_events,
score_tc_test=score_tc_test_dyn, root_causes=root_causes,
score_t_train=score_t_train_dyn, point_adjust=point_adjust,
thres_config_dict=thres_config_dict, eval_methods=eval_methods,
make_plots=make_plots, dataset=dataset,
plots_name=plots_name + "dyn",
composite_best_f1=composite_best_f1,
score_tc_train=score_tc_train_dyn)
algo_dyn_results["val_recons_err"].append(val_recons_err)
algo_dyn_results["val_loss"].append(val_loss)
if score_t_train_dyn is not None:
algo_dyn_results["std_scores_train"].append(np.std(score_t_train_dyn))
kernel_sigma = thres_config_dict["dyn_gauss"]["kernel_sigma"]
score_t_test_dyn_gauss_conv, score_tc_test_dyn_gauss_conv = get_gaussian_kernel_scores(
score_t_test_dyn, score_tc_test_dyn, kernel_sigma)
if score_t_train_dyn is not None:
score_t_train_dyn_gauss_conv, _ = get_gaussian_kernel_scores(score_t_train_dyn, score_tc_train_dyn,
kernel_sigma)
else:
score_t_train_dyn_gauss_conv = None
algo_dyn_gauss_conv_results = collect_eval_metrics(algo_results=algo_dyn_gauss_conv_results,
score_t_test=score_t_test_dyn_gauss_conv,
y_test=y_test,
thres_methods=dyn_thres_methods,
logger=logger,
rca_possible=rca_possible,
true_events=true_events,
score_tc_test=score_tc_test_dyn_gauss_conv,
root_causes=root_causes,
score_t_train=score_t_train_dyn_gauss_conv,
point_adjust=point_adjust,
thres_config_dict=thres_config_dict,
eval_methods=eval_methods,
make_plots=make_plots, dataset=dataset,
plots_name=plots_name + "dyn-gauss-conv",
composite_best_f1=composite_best_f1,
score_tc_train=None)
# Combine results from each entity
final_results, column_names = combine_entities_eval_metrics(algo_results, thres_methods, me_ds_name, algo_name,
rca_possible, eval_methods=eval_methods)
if eval_R_model:
results_R, _ = combine_entities_eval_metrics(algo_R_results, thres_methods, me_ds_name, algo_name + "-R",
rca_possible, eval_methods=eval_methods)
final_results = np.concatenate((final_results, results_R), axis=0)
if algo_name == "TelemanomAlgo":
results_telem, _ = combine_entities_eval_metrics(telemanom_gauss_s_results, thres_methods, me_ds_name,
algo_name + "-Gauss-S", rca_possible, eval_methods=eval_methods)
final_results = np.concatenate((final_results, results_telem), axis=0)
if eval_dyn:
results_dyn, _ = combine_entities_eval_metrics(algo_dyn_results, dyn_thres_methods,
me_ds_name, algo_name + "-dyn",
rca_possible, eval_methods=eval_methods)
final_results = np.concatenate((final_results, results_dyn), axis=0)
results_dyn_gauss_conv, _ = combine_entities_eval_metrics(algo_dyn_gauss_conv_results,
dyn_thres_methods,
me_ds_name, algo_name + "-dyn-gauss-conv",
rca_possible, eval_methods=eval_methods)
final_results = np.concatenate((final_results, results_dyn_gauss_conv), axis=0)
results_df = pd.DataFrame(final_results, columns=column_names)
results_df["folder_name"] = saved_model_root
new_col_order = list(results_df.columns)[:3] + ["point_adjust"] + list(results_df.columns)[3:]
results_df["point_adjust"] = point_adjust
results_df = results_df[new_col_order]
with open(os.path.join(os.path.dirname(saved_model_root), "config.json")) as file:
algo_config = json.dumps(json.load(file))
results_df["config"] = algo_config
if thres_config_dict is not None:
results_df["thres_config"] = str(thres_config_dict)
if point_adjust:
filename = os.path.join(saved_model_root, "results_point_adjust.csv")
else:
filename = os.path.join(saved_model_root, "results.csv")
results_df.to_csv(filename, index=False)
logger.info("Saved results to {}".format(filename))
def analyse_from_pkls(results_root:str, thres_methods=["best_f1_test"], eval_root_cause=True, point_adjust=False,
eval_R_model=True, eval_dyn=True, thres_config=None, logger=None,
telem_only=True, filename_prefix="", rerun_if_ds=None, process_seeds=None, make_plots=[],
composite_best_f1=False):
"""
Function that reads saved predictions and evaluates them for anomaly detection and diagnosis under various
settings.
:param results_root: dir where predictions for the algo generated by the trainer in a specific folder structure.
:param thres_methods: list of thresholding methods with which to evaluate
:param eval_root_cause: Set it to True if root cause is desired and possible (i.e. if channel-wise scores are provided
in the predictions, else False
:param point_adjust: True if point-adjusted evaluation is desired.
:param eval_R_model: Corresponds to using Errors scoring function. Set it to True only if errors_t or errors_tc are
avaiable in the predictions. Pre-requisite to be True for eval_dyn.
:param eval_dyn: Set it to True if Gauss-D and Gauss-D-K scoring function evaluation is desired. Needs eval_R_model
to be True.
:param thres_config: A function that takes the dataset name as input and returns a dictionary corresponding to the
config for each method in thres_methods.
:param logger: for logging.
:param telem_only: Only affects the evaluation for MSL and SMAP datasets. If set to True, only the sensor channel,
i.e. first channel will be used in evaluation. If False, all channels - sensors and commands will be used.
:param filename_prefix: desired prefix on the filename.
:param process_seeds: specify a list if only some of the seeds need to be analyzed. If None, all seeds for which
results.csv doesn't exist will be (re)analyzed.
:param rerun_if_ds: set the names of specific datasets for which (re)analysis is required. Otherwise only datasets
for which results.csv doesn't exist will be (re)analyzed.
:param make_plots: specify which plots are desired. ["prc", "score_t"] are implemented.
:param composite_best_f1: if set to True, the "best-f1" threshold will be computed as "best-fc1" threshold.
:return: None. The evalution results are saved as results.csv for each run.
"""
seed = 42
result_df_list = []
ds_folders = os.listdir(results_root)
if point_adjust:
result_filename = "results_point_adjust.csv"
else:
result_filename = "results.csv"
if logger is None:
init_logging(os.path.join(results_root, 'logs'), prefix="eval")
logger = logging.getLogger(__name__)
for ds_folder in ds_folders:
if ds_folder.endswith(".csv") or ds_folder == "logs" or "thres_results" in ds_folder:
continue
ds_path = os.path.join(results_root, ds_folder)
algo_folders = os.listdir(ds_path)
for algo_folder in algo_folders:
algo_path = os.path.join(ds_path, algo_folder)
config_folders = os.listdir(algo_path)
config_folders = [folder for folder in config_folders if not folder.endswith(".csv")]
for config_folder in config_folders:
config_path = os.path.join(algo_path, config_folder)
run_folders = os.listdir(config_path)
for run_folder in run_folders:
if not run_folder.endswith(".json"):
run_path = os.path.join(config_path, run_folder)
current_seed = int(run_folder.split("-", 2)[0])
if process_seeds is not None:
if current_seed not in process_seeds:
continue
if rerun_if_ds is not None:
if (ds_folder in rerun_if_ds) or (rerun_if_ds == 'all'):
if os.path.exists(os.path.join(run_path, result_filename)):
os.remove(os.path.join(run_path, result_filename))
if not os.path.exists(os.path.join(run_path, result_filename)):
entity_folders = os.listdir(run_path)
skip_this_run = False # Flag to indicate that evaluation for this run is impossible
for entity_folder in entity_folders:
if entity_folder != "plots" and ".csv" not in entity_folder and entity_folder != "logs":
entity_path = os.path.join(run_path, entity_folder)
if not skip_this_run:
try:
with open(os.path.join(entity_path, "raw_predictions"), "rb") as file:
raw_predictions = pickle.load(file)
assert "score_t_test" in raw_predictions.keys()
if np.isnan(raw_predictions["score_t_test"]).any():
skip_this_run = True
except:
me_ds_name = ds_folder.split("_me")[0]
ds_class = get_dataset_class(me_ds_name)
algo_class = get_algo_class(algo_folder)
entity_name = entity_folder.replace("smap-", "").replace("msl-", "").\
replace("smd-", "")
ds_init_params = {"seed": seed, "entity": entity_name}
if me_ds_name == "swat-long":
ds_init_params["shorten_long"] = False
entity_ds = ds_class(**ds_init_params)
repredict = repredict_from_saved_model(entity_path, algo_class=algo_class,
entity=entity_ds, logger=logger)
if not repredict:
logger.warning("Predictions and trained model couldn't be found, evaluation is "
"impossible for run saved at %s" % run_path)
skip_this_run = True
if not skip_this_run:
evaluate(run_path, thres_methods=thres_methods, eval_root_cause=eval_root_cause,
point_adjust=point_adjust, eval_R_model=eval_R_model, eval_dyn=eval_dyn,
thres_config=thres_config, logger=logger, telem_only=telem_only,
make_plots=make_plots, composite_best_f1=composite_best_f1)
try:
result_df = pd.read_csv(os.path.join(run_path, result_filename))
if "point_adjust" not in result_df.columns:
result_df["point_adjust"] = False
result_df_list.append(result_df)
except:
logger.warning("Results table couldn't be found for run saved at %s" % run_path)
overall_results = pd.concat(result_df_list, ignore_index=True)
overall_results.to_csv(os.path.join(results_root, filename_prefix+"overall_" + result_filename))
def repredict_from_saved_model(model_root, algo_class, entity, logger):
algo_config_filename = os.path.join(model_root, "init_params")
saved_model_filename = [os.path.join(model_root, filename) for filename in
os.listdir(model_root) if "trained_model" in filename]
if len(saved_model_filename) == 1:
saved_model_filename = saved_model_filename[0]
else:
saved_model_filename.sort(key=get_chan_num)
additional_params_filename = os.path.join(model_root, "additional_params")
if len(additional_params_filename) == 1:
additional_params_filename = additional_params_filename[0]
try:
algo_reload = load_torch_algo(algo_class, algo_config_filename, saved_model_filename,
additional_params_filename, eval=True)
_ = Trainer.predict(algo_reload, entity, model_root, logger=logger)
return True
except Exception as e:
logger.warning(f"An error occurred while loading saved algo and repredicting: {e}")
return False
import numpy as np
from sklearn.metrics import precision_recall_curve, roc_curve, auc, roc_auc_score, precision_score, recall_score, \
accuracy_score, fbeta_score, average_precision_score
# function: calculate the point-adjust f-scores(whether top k)
def get_point_adjust_scores(y_test, pred_labels, true_events, thereshold_k=0, whether_top_k=False):
tp = 0
fn = 0
for true_event in true_events.keys():
true_start, true_end = true_events[true_event]
if whether_top_k is False:
if pred_labels[true_start:true_end].sum() > 0:
tp += (true_end - true_start)
else:
fn += (true_end - true_start)
else:
if pred_labels[true_start:true_end].sum() > thereshold_k:
tp += (true_end - true_start)
else:
fn += (true_end - true_start)
fp = np.sum(pred_labels) - np.sum(pred_labels * y_test)
prec, rec, fscore = get_prec_rec_fscore(tp, fp, fn)
return fp, fn, tp, prec, rec, fscore
def get_adjust_F1PA(pred, gt):
anomaly_state = False
for i in range(len(gt)):
if gt[i] == 1 and pred[i] == 1 and not anomaly_state:
anomaly_state = True
for j in range(i, 0, -1):
if gt[j] == 0:
break
else:
if pred[j] == 0:
pred[j] = 1
for j in range(i, len(gt)):
if gt[j] == 0:
break
else:
if pred[j] == 0:
pred[j] = 1
elif gt[i] == 0:
anomaly_state = False
if anomaly_state:
pred[i] = 1
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(gt, pred)
precision, recall, f_score, support = precision_recall_fscore_support(gt, pred,
average='binary')
return accuracy, precision, recall, f_score
# calculate the point-adjusted f-score
def get_prec_rec_fscore(tp, fp, fn):
if tp == 0:
precision = 0
recall = 0
else:
precision = tp / (tp + fp)
recall = tp / (tp + fn)
fscore = get_f_score(precision, recall)
return precision, recall, fscore
def get_f_score(prec, rec):
if prec == 0 and rec == 0:
f_score = 0
else:
f_score = 2 * (prec * rec) / (prec + rec)
return f_score
# function: calculate the normal edition f-scores
def get_accuracy_precision_recall_fscore(y_true: list, y_pred: list):
accuracy = accuracy_score(y_true, y_pred)
# warn_for=() avoids log warnings for any result being zero
# precision, recall, f_score, _ = prf(y_true, y_pred, average='binary', warn_for=())
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f_score = (2 * precision * recall) / (precision + recall)
if precision == 0 and recall == 0:
f05_score = 0
else:
f05_score = fbeta_score(y_true, y_pred, average='binary', beta=0.5)
return accuracy, precision, recall, f_score, f05_score
from fc_score import *
from f1_score_f1_pa import *
from evaluate_utils import *
default_thres_config = {"top_k_time": {},
"best_f1_test": {"exact_pt_adj": True},
"thresholded_score": {},
"tail_prob": {"tail_prob": 2},
"tail_prob_1": {"tail_prob": 1},
"tail_prob_2": {"tail_prob": 2},
"tail_prob_3": {"tail_prob": 3},
"tail_prob_4": {"tail_prob": 4},
"tail_prob_5": {"tail_prob": 5},
"dyn_gauss": {"long_window": 10000, "short_window": 1, "kernel_sigma": 10},
"nasa_npt": {"batch_size": 70, "window_size": 30, "telem_only": True,
"smoothing_perc": 0.005, "l_s": 250, "error_buffer": 5, "p": 0.05}}
def threshold_and_predict(score_t_test, y_test, true_events, logger, test_anom_frac, thres_method="top_k_time",
point_adjust=False, score_t_train=None, thres_config_dict=dict(), return_auc=False,
composite_best_f1=False):
if thres_method in thres_config_dict.keys():
config = thres_config_dict[thres_method]
else:
config = default_thres_config[thres_method]
# test_anom_frac = (np.sum(y_test)) / len(y_test)
auroc = None
avg_prec = None
if thres_method == "thresholded_score":
opt_thres = 0.5
if set(score_t_test) - {0, 1}:
logger.error("Score_t_test isn't binary. Predicting all as non-anomalous")
pred_labels = np.zeros(len(score_t_test))
else:
pred_labels = score_t_test
elif thres_method == "best_f1_test" and point_adjust:
prec, rec, thresholds = precision_recall_curve(y_test, score_t_test, pos_label=1)
if not config["exact_pt_adj"]:
fscore_best_time = [get_f_score(precision, recall) for precision, recall in zip(prec, rec)]
opt_num = np.squeeze(np.argmax(fscore_best_time))
opt_thres = thresholds[opt_num]
thresholds = np.random.choice(thresholds, size=5000) + [opt_thres]
fscores = []
for thres in thresholds:
_, _, _, _, _, fscore = get_point_adjust_scores(y_test, score_t_test > thres, true_events)
fscores.append(fscore)
opt_thres = thresholds[np.argmax(fscores)]
pred_labels = score_t_test > opt_thres
elif thres_method == "best_f1_test" and composite_best_f1:
prec, rec, thresholds = precision_recall_curve(y_test, score_t_test, pos_label=1)
precs_t = prec
fscores_c = [get_composite_fscore_from_scores(score_t_test, thres, true_events, prec_t) for thres, prec_t in
zip(thresholds, precs_t)]
try:
opt_thres = thresholds[np.nanargmax(fscores_c)]
except:
opt_thres = 0.0
pred_labels = score_t_test > opt_thres
elif thres_method == "top_k_time":
opt_thres = np.nanpercentile(score_t_test, 100 * (1 - test_anom_frac), interpolation='higher')
pred_labels = np.where(score_t_test > opt_thres, 1, 0)
elif thres_method == "best_f1_test":
prec, rec, thres = precision_recall_curve(y_test, score_t_test, pos_label=1)
fscore = [get_f_score(precision, recall) for precision, recall in zip(prec, rec)]
opt_num = np.squeeze(np.argmax(fscore))
opt_thres = thres[opt_num]
pred_labels = np.where(score_t_test > opt_thres, 1, 0)
elif "tail_prob" in thres_method:
tail_neg_log_prob = config["tail_prob"]
opt_thres = tail_neg_log_prob
pred_labels = np.where(score_t_test > opt_thres, 1, 0)
elif thres_method == "nasa_npt":
opt_thres = 0.5
pred_labels = get_npt_labels(score_t_test, y_test, config)
else:
logger.error("Thresholding method {} not in [top_k_time, best_f1_test, tail_prob]".format(thres_method))
return None, None
if return_auc:
avg_prec = average_precision_score(y_test, score_t_test)
auroc = roc_auc_score(y_test, score_t_test)
return opt_thres, pred_labels, avg_prec, auroc
return opt_thres, pred_labels
# most-top funcion
def evaluate_predicted_labels(pred_labels, y_test, true_events, logger, eval_method="time-wise", breaks=[],
point_adjust=False):
"""
Computes evaluation metrics for the binary classifications given the true and predicted labels
:param point_adjust: used to judge whether is pa
:param pred_labels: array of predicted labels
:param y_test: array of true labels
:param eval_method: string that indicates whether we evaluate the classification time point-wise or event-wise
:param breaks: array of discontinuities in the time series, relevant only if you look at event-wise
:param return_raw: Boolean that indicates whether we want to return tp, fp and fn or prec, recall and f1
:return: tuple of evaluation metrics
"""
if eval_method == "time-wise":
# point-adjust fscore
if point_adjust:
fp, fn, tp, prec, rec, fscore = get_point_adjust_scores(y_test, pred_labels, true_events)
# normal fscore
else:
_, prec, rec, fscore, _ = get_accuracy_precision_recall_fscore(y_test, pred_labels)
tp = np.sum(pred_labels * y_test)
fp = np.sum(pred_labels) - tp
fn = np.sum(y_test) - tp
# event-wise
else:
logger.error("Evaluation method {} not in [time-wise, event-wise]".format(eval_method))
return 0, 0, 0
return tp, fp, fn, prec, rec, fscore
import numpy as np
from sklearn.metrics import precision_score
def get_events(y_test, outlier=1, normal=0):
events = dict()
label_prev = normal
event = 0 # corresponds to no event
event_start = 0
for tim, label in enumerate(y_test):
if label == outlier:
if label_prev == normal:
event += 1
event_start = tim
else:
if label_prev == outlier:
event_end = tim - 1
events[event] = (event_start, event_end)
label_prev = label
if label_prev == outlier:
event_end = tim - 1
events[event] = (event_start, event_end)
return events
def get_composite_fscore_raw(y_test, pred_labels, true_events, return_prec_rec=False):
tp = np.sum([pred_labels[start:end + 1].any() for start, end in true_events.values()])
fn = len(true_events) - tp
rec_e = tp / (tp + fn)
prec_t = precision_score(y_test, pred_labels)
fscore_c = 2 * rec_e * prec_t / (rec_e + prec_t)
if prec_t == 0 and rec_e == 0:
fscore_c = 0
if return_prec_rec:
return prec_t, rec_e, fscore_c
return fscore_c
def main():
y_test = np.zeros(100)
y_test[10:20] = 1
y_test[50:60] = 1
pred_labels = np.zeros(100)
pred_labels[15:17] = 1
pred_labels[55:62] = 1
# pred_labels[51:55] = 1
# true_events = get_events(y_test)
prec_t, rec_e, fscore_c = get_composite_fscore_raw(pred_labels, y_test, return_prec_rec=True)
# print("Prec_t: {}, rec_e: {}, fscore_c: {}".format(prec_t, rec_e, fscore_c))
if __name__ == "__main__":
main()
from metrics.f1_score_f1_pa import *
from metrics.fc_score import *
from metrics.precision_at_k import *
from metrics.customizable_f1_score import *
from metrics.AUC import *
from metrics.Matthews_correlation_coefficient import *
from metrics.affiliation.generics import convert_vector_to_events
from metrics.affiliation.metrics import pr_from_events
# from metrics.vus.models.feature import Window
from metrics.vus.metrics import get_range_vus_roc
import numpy as np
def combine_all_evaluation_scores(y_test, pred_labels, anomaly_scores):
events_pred = convert_vector_to_events(y_test)
events_gt = convert_vector_to_events(pred_labels)
Trange = (0, len(y_test))
affiliation = pr_from_events(events_pred, events_gt, Trange)
true_events = get_events(y_test)
pa_accuracy, pa_precision, pa_recall, pa_f_score = get_adjust_F1PA(y_test, pred_labels)
MCC_score = MCC(y_test, pred_labels)
vus_results = get_range_vus_roc(y_test, pred_labels, 100) # default slidingWindow = 100
score_list_simple = {
"pa_accuracy":pa_accuracy,
"pa_precision":pa_precision,
"pa_recall":pa_recall,
"pa_f_score":pa_f_score,
"MCC_score":MCC_score,
"Affiliation precision": affiliation['precision'],
"Affiliation recall": affiliation['recall'],
"R_AUC_ROC": vus_results["R_AUC_ROC"],
"R_AUC_PR": vus_results["R_AUC_PR"],
"VUS_ROC": vus_results["VUS_ROC"],
"VUS_PR": vus_results["VUS_PR"]
}
# return score_list, score_list_simple
return score_list_simple
if __name__ == '__main__':
y_test = np.load("data/events_pred_MSL.npy")+0
pred_labels = np.load("data/events_gt_MSL.npy")+0
anomaly_scores = np.load("data/events_scores_MSL.npy")
print(len(y_test), max(anomaly_scores), min(anomaly_scores))
score_list_simple = combine_all_evaluation_scores(y_test, pred_labels, anomaly_scores)
for key, value in score_list_simple.items():
print('{0:21} :{1:10f}'.format(key, value))
\ No newline at end of file
# k is defined as the number of anomalies
# only calculate the range top k not the whole set
import numpy as np
def precision_at_k(y_test, score_t_test, pred_labels):
# top-k
k = int(np.sum(y_test))
threshold = np.percentile(score_t_test, 100 * (1 - k / len(y_test)))
# precision_at_k = metrics.top_k_accuracy_score(label, score, k)
p_at_k = np.where(pred_labels > threshold)[0]
TP_at_k = sum(y_test[p_at_k])
precision_at_k = TP_at_k / k
return precision_at_k
from random import shuffle
import numpy as np
import math
import matplotlib.pyplot as plt
from matplotlib import cm
import pandas as pd
from tqdm import tqdm as tqdm
import time
from sklearn.preprocessing import MinMaxScaler
import random
import os
import sys
module_path = os.path.abspath(os.path.join('../..'))
if module_path not in sys.path:
sys.path.append(module_path)
from metrics.vus.utils.slidingWindows import find_length
from metrics.vus.utils.metrics import metricor
from metrics.vus.models.distance import Fourier
# from metrics.vus.models.feature import Window
def generate_new_label(label,lag):
if lag < 0:
return np.array(list(label[-lag:]) + [0]*(-lag))
elif lag > 0:
return np.array([0]*lag + list(label[:-lag]))
elif lag == 0:
return label
def compute_anomaly_acc_lag(methods_scores,label,slidingWindow,methods_keys):
lag_range = list(range(-slidingWindow//4,slidingWindow//4,5))
methods_acc = {}
for i,methods_score in enumerate(tqdm(methods_keys)):
dict_acc = {
'R_AUC_ROC': [],
'AUC_ROC': [],
'R_AUC_PR': [],
'AUC_PR': [],
'VUS_ROC': [],
'VUS_PR': [],
'Precision': [],
'Recall': [],
'F': [],
'ExistenceReward':[],
'OverlapReward': [],
'Precision@k': [],
'Rprecision': [],
'Rrecall': [],
'RF': []}
for lag in tqdm(lag_range):
new_label = generate_new_label(label,lag)
grader = metricor()
R_AUC, R_AP, R_fpr, R_tpr, R_prec = grader.RangeAUC(labels=new_label, score=methods_scores[methods_score], window=slidingWindow, plot_ROC=True)
L, fpr, tpr= grader.metric_new(new_label, methods_scores[methods_score], plot_ROC=True)
precision, recall, AP = grader.metric_PR(new_label, methods_scores[methods_score])
Y, Z, X, X_ap, W, Z_ap,avg_auc_3d, avg_ap_3d = generate_curve(new_label,methods_scores[methods_score],2*slidingWindow)
L1 = [ elem for elem in L]
dict_acc['R_AUC_ROC'] +=[R_AUC]
dict_acc['AUC_ROC'] +=[L1[0]]
dict_acc['R_AUC_PR'] +=[R_AP]
dict_acc['AUC_PR'] +=[AP]
dict_acc['VUS_ROC'] +=[avg_auc_3d]
dict_acc['VUS_PR'] +=[avg_ap_3d]
dict_acc['Precision'] +=[L1[1]]
dict_acc['Recall'] +=[L1[2]]
dict_acc['F'] +=[L1[3]]
dict_acc['ExistenceReward']+=[L1[5]]
dict_acc['OverlapReward'] +=[L1[6]]
dict_acc['Precision@k'] +=[L1[9]]
dict_acc['Rprecision'] +=[L1[7]]
dict_acc['Rrecall'] +=[L1[4]]
dict_acc['RF'] +=[L1[8]]
methods_acc[methods_score] = dict_acc
return methods_acc
def compute_anomaly_acc_percentage(methods_scores,label,slidingWindow,methods_keys,pos_first_anom):
list_pos = []
step_a = max(0,(len(label) - pos_first_anom-200))//20
step_b = max(0,pos_first_anom-200)//20
pos_a = min(len(label),pos_first_anom + 200)
pos_b = max(0,pos_first_anom - 200)
list_pos.append((pos_b,pos_a))
for pos_iter in range(20):
pos_a = min(len(label),pos_a + step_a)
pos_b = max(0,pos_b - step_b)
list_pos.append((pos_b,pos_a))
methods_acc = {}
print(list_pos)
for i,methods_score in enumerate(tqdm(methods_keys)):
dict_acc = {
'R_AUC_ROC': [],
'AUC_ROC': [],
'R_AUC_PR': [],
'AUC_PR': [],
'VUS_ROC': [],
'VUS_PR': [],
'Precision': [],
'Recall': [],
'F': [],
'ExistenceReward':[],
'OverlapReward': [],
'Precision@k': [],
'Rprecision': [],
'Rrecall': [],
'RF': []}
for end_pos in tqdm(list_pos):
new_label = label[end_pos[0]:end_pos[1]]
new_score = np.array(methods_scores[methods_score])[end_pos[0]:end_pos[1]]
grader = metricor()
R_AUC, R_AP, R_fpr, R_tpr, R_prec = grader.RangeAUC(labels=new_label, score=new_score, window=slidingWindow, plot_ROC=True)
L, fpr, tpr= grader.metric_new(new_label, new_score, plot_ROC=True)
precision, recall, AP = grader.metric_PR(new_label, new_score)
Y, Z, X, X_ap, W, Z_ap,avg_auc_3d, avg_ap_3d = generate_curve(new_label,new_score,2*slidingWindow)
L1 = [ elem for elem in L]
dict_acc['R_AUC_ROC'] +=[R_AUC]
dict_acc['AUC_ROC'] +=[L1[0]]
dict_acc['R_AUC_PR'] +=[R_AP]
dict_acc['AUC_PR'] +=[AP]
dict_acc['VUS_ROC'] +=[avg_auc_3d]
dict_acc['VUS_PR'] +=[avg_ap_3d]
dict_acc['Precision'] +=[L1[1]]
dict_acc['Recall'] +=[L1[2]]
dict_acc['F'] +=[L1[3]]
dict_acc['ExistenceReward']+=[L1[5]]
dict_acc['OverlapReward'] +=[L1[6]]
dict_acc['Precision@k'] +=[L1[9]]
dict_acc['Rprecision'] +=[L1[7]]
dict_acc['Rrecall'] +=[L1[4]]
dict_acc['RF'] +=[L1[8]]
methods_acc[methods_score] = dict_acc
return methods_acc
def compute_anomaly_acc_noise(methods_scores,label,slidingWindow,methods_keys):
lag_range = list(range(-slidingWindow//2,slidingWindow//2,10))
methods_acc = {}
for i,methods_score in enumerate(tqdm(methods_keys)):
dict_acc = {
'R_AUC_ROC': [],
'AUC_ROC': [],
'R_AUC_PR': [],
'AUC_PR': [],
'VUS_ROC': [],
'VUS_PR': [],
'Precision': [],
'Recall': [],
'F': [],
'ExistenceReward':[],
'OverlapReward': [],
'Precision@k': [],
'Rprecision': [],
'Rrecall': [],
'RF': []}
for lag in tqdm(lag_range):
new_label = label
grader = metricor()
noise = np.random.normal(-0.1,0.1,len(methods_scores[methods_score]))
new_score = np.array(methods_scores[methods_score]) + noise
new_score = (new_score - min(new_score))/(max(new_score) - min(new_score))
R_AUC, R_AP, R_fpr, R_tpr, R_prec = grader.RangeAUC(labels=new_label, score=new_score, window=slidingWindow, plot_ROC=True)
L, fpr, tpr= grader.metric_new(new_label, new_score, plot_ROC=True)
precision, recall, AP = grader.metric_PR(new_label, new_score)
Y, Z, X, X_ap, W, Z_ap,avg_auc_3d, avg_ap_3d = generate_curve(new_label,new_score,2*slidingWindow)
L1 = [ elem for elem in L]
dict_acc['R_AUC_ROC'] +=[R_AUC]
dict_acc['AUC_ROC'] +=[L1[0]]
dict_acc['R_AUC_PR'] +=[R_AP]
dict_acc['AUC_PR'] +=[AP]
dict_acc['VUS_ROC'] +=[avg_auc_3d]
dict_acc['VUS_PR'] +=[avg_ap_3d]
dict_acc['Precision'] +=[L1[1]]
dict_acc['Recall'] +=[L1[2]]
dict_acc['F'] +=[L1[3]]
dict_acc['ExistenceReward']+=[L1[5]]
dict_acc['OverlapReward'] +=[L1[6]]
dict_acc['Precision@k'] +=[L1[9]]
dict_acc['Rprecision'] +=[L1[7]]
dict_acc['Rrecall'] +=[L1[4]]
dict_acc['RF'] +=[L1[8]]
methods_acc[methods_score] = dict_acc
return methods_acc
def compute_anomaly_acc_pairwise(methods_scores,label,slidingWindow,method1,method2):
lag_range = list(range(-slidingWindow//4,slidingWindow//4,5))
methods_acc = {}
method_key = [method1]
if method2 is not None:
method_key = [method1,method2]
for i,methods_score in enumerate(tqdm(method_key)):
dict_acc = {
'R_AUC_ROC': [],
'AUC_ROC': [],
'R_AUC_PR': [],
'AUC_PR': [],
'VUS_ROC': [],
'VUS_PR': [],
'Precision': [],
'Recall': [],
'F': [],
'ExistenceReward':[],
'OverlapReward': [],
'Precision@k': [],
'Rprecision': [],
'Rrecall': [],
'RF': []}
for lag in tqdm(range(60)):
new_lag = random.randint(-slidingWindow//4,slidingWindow//4)
new_label = generate_new_label(label,new_lag)
noise = np.random.normal(-0.1,0.1,len(methods_scores[methods_score]))
new_score = np.array(methods_scores[methods_score]) + noise
new_score = (new_score - min(new_score))/(max(new_score) - min(new_score))
grader = metricor()
R_AUC, R_AP, R_fpr, R_tpr, R_prec = grader.RangeAUC(labels=new_label, score=new_score, window=slidingWindow, plot_ROC=True)
L, fpr, tpr= grader.metric_new(new_label, new_score, plot_ROC=True)
precision, recall, AP = grader.metric_PR(new_label, new_score)
#range_anomaly = grader.range_convers_new(new_label)
Y, Z, X, X_ap, W, Z_ap,avg_auc_3d, avg_ap_3d = generate_curve(new_label,new_score,2*slidingWindow)
L1 = [ elem for elem in L]
dict_acc['R_AUC_ROC'] +=[R_AUC]
dict_acc['AUC_ROC'] +=[L1[0]]
dict_acc['R_AUC_PR'] +=[R_AP]
dict_acc['AUC_PR'] +=[AP]
dict_acc['VUS_ROC'] +=[avg_auc_3d]
dict_acc['VUS_PR'] +=[avg_ap_3d]
dict_acc['Precision'] +=[L1[1]]
dict_acc['Recall'] +=[L1[2]]
dict_acc['F'] +=[L1[3]]
dict_acc['ExistenceReward']+=[L1[5]]
dict_acc['OverlapReward'] +=[L1[6]]
dict_acc['Precision@k'] +=[L1[9]]
dict_acc['Rprecision'] +=[L1[7]]
dict_acc['Rrecall'] +=[L1[4]]
dict_acc['RF'] +=[L1[8]]
methods_acc[methods_score] = dict_acc
return methods_acc
def normalize_dict_exp(methods_acc_lag,methods_keys):
key_metrics = [
'VUS_ROC',
'VUS_PR',
'R_AUC_ROC',
'R_AUC_PR',
'AUC_ROC',
'AUC_PR',
'Rprecision',
'Rrecall',
'RF',
'Precision',
'Recall',
'F',
'Precision@k'
][::-1]
norm_methods_acc_lag = {}
for key in methods_keys:
norm_methods_acc_lag[key] = {}
for key_metric in key_metrics:
ts = methods_acc_lag[key][key_metric]
new_ts = list(np.array(ts) - np.mean(ts))
norm_methods_acc_lag[key][key_metric] = new_ts
return norm_methods_acc_lag
def group_dict(methods_acc_lag,methods_keys):
key_metrics = [
'VUS_ROC',
'VUS_PR',
'R_AUC_ROC',
'R_AUC_PR',
'AUC_ROC',
'AUC_PR',
'Rprecision',
'Rrecall',
'RF',
'Precision',
'Recall',
'F',
'Precision@k'
][::-1]
norm_methods_acc_lag = {key:[] for key in key_metrics}
for key in methods_keys:
for key_metric in key_metrics:
ts = list(methods_acc_lag[key][key_metric])
new_ts = list(np.array(ts) - np.mean(ts))
norm_methods_acc_lag[key_metric] += new_ts
return norm_methods_acc_lag
def generate_curve(label,score,slidingWindow):
tpr_3d, fpr_3d, prec_3d, window_3d, avg_auc_3d, avg_ap_3d = metricor().RangeAUC_volume(labels_original=label, score=score, windowSize=1*slidingWindow)
X = np.array(tpr_3d).reshape(1,-1).ravel()
X_ap = np.array(tpr_3d)[:,:-1].reshape(1,-1).ravel()
Y = np.array(fpr_3d).reshape(1,-1).ravel()
W = np.array(prec_3d).reshape(1,-1).ravel()
Z = np.repeat(window_3d, len(tpr_3d[0]))
Z_ap = np.repeat(window_3d, len(tpr_3d[0])-1)
return Y, Z, X, X_ap, W, Z_ap,avg_auc_3d, avg_ap_3d
def box_plot(data, edge_color, fill_color):
bp = ax.boxplot(data, patch_artist=True)
for element in ['boxes', 'whiskers', 'fliers', 'means', 'medians', 'caps']:
plt.setp(bp[element], color=edge_color)
for patch in bp['boxes']:
patch.set(facecolor=fill_color)
return bp
import numpy as np
import math
import matplotlib.pyplot as plt
from matplotlib import cm
import pandas as pd
from tqdm import tqdm as tqdm
import time
from sklearn.preprocessing import MinMaxScaler
import random
import os
import sys
module_path = os.path.abspath(os.path.join('../..'))
if module_path not in sys.path:
sys.path.append(module_path)
from metrics.vus.utils.slidingWindows import find_length
from metrics.vus.utils.metrics import metricor
from metrics.vus.models.distance import Fourier
from metrics.vus.models.feature import Window
from metrics.vus.models.cnn import cnn
from metrics.vus.models.AE_mlp2 import AE_MLP2
from metrics.vus.models.lstm import lstm
from metrics.vus.models.ocsvm import OCSVM
from metrics.vus.models.poly import POLY
from metrics.vus.models.pca import PCA
from metrics.vus.models.norma import NORMA
from metrics.vus.models.matrix_profile import MatrixProfile
from metrics.vus.models.lof import LOF
from metrics.vus.models.iforest import IForest
def find_section_length(label,length):
best_i = None
best_sum = None
current_subseq = False
for i in range(len(label)):
changed = False
if label[i] == 1:
if current_subseq == False:
current_subseq = True
if best_i is None:
changed = True
best_i = i
best_sum = np.sum(label[max(0,i-200):min(len(label),i+9800)])
else:
if np.sum(label[max(0,i-200):min(len(label),i+9800)]) < best_sum:
changed = True
best_i = i
best_sum = np.sum(label[max(0,i-200):min(len(label),i+9800)])
else:
changed = False
if changed:
diff = i+9800 - len(label)
pos1 = max(0,i-200 - max(0,diff))
pos2 = min(i+9800,len(label))
else:
current_subseq = False
if best_i is not None:
return best_i-pos1,(pos1,pos2)
else:
return None,None
def generate_data(filepath,init_pos,max_length):
df = pd.read_csv(filepath, header=None).to_numpy()
name = filepath.split('/')[-1]
#max_length = 30000
data = df[init_pos:init_pos+max_length,0].astype(float)
label = df[init_pos:init_pos+max_length,1]
pos_first_anom,pos = find_section_length(label,max_length)
data = df[pos[0]:pos[1],0].astype(float)
label = df[pos[0]:pos[1],1]
slidingWindow = find_length(data)
#slidingWindow = 70
X_data = Window(window = slidingWindow).convert(data).to_numpy()
data_train = data[:int(0.1*len(data))]
data_test = data
X_train = Window(window = slidingWindow).convert(data_train).to_numpy()
X_test = Window(window = slidingWindow).convert(data_test).to_numpy()
return pos_first_anom,slidingWindow,data,X_data,data_train,data_test,X_train,X_test,label
def compute_score(methods,slidingWindow,data,X_data,data_train,data_test,X_train,X_test):
methods_scores = {}
for method in methods:
start_time = time.time()
if method == 'IForest':
clf = IForest(n_jobs=1)
x = X_data
clf.fit(x)
score = clf.decision_scores_
score = MinMaxScaler(feature_range=(0,1)).fit_transform(score.reshape(-1,1)).ravel()
score = np.array([score[0]]*math.ceil((slidingWindow-1)/2) + list(score) + [score[-1]]*((slidingWindow-1)//2))
elif method == 'LOF':
clf = LOF(n_neighbors=20, n_jobs=1)
x = X_data
clf.fit(x)
score = clf.decision_scores_
score = MinMaxScaler(feature_range=(0,1)).fit_transform(score.reshape(-1,1)).ravel()
score = np.array([score[0]]*math.ceil((slidingWindow-1)/2) + list(score) + [score[-1]]*((slidingWindow-1)//2))
elif method == 'MatrixProfile':
clf = MatrixProfile(window = slidingWindow)
x = data
clf.fit(x)
score = clf.decision_scores_
score = MinMaxScaler(feature_range=(0,1)).fit_transform(score.reshape(-1,1)).ravel()
score = np.array([score[0]]*math.ceil((slidingWindow-1)/2) + list(score) + [score[-1]]*((slidingWindow-1)//2))
elif method == 'NormA':
clf = NORMA(pattern_length = slidingWindow, nm_size=3*slidingWindow)
x = data
clf.fit(x)
score = clf.decision_scores_
score = MinMaxScaler(feature_range=(0,1)).fit_transform(score.reshape(-1,1)).ravel()
score = np.array([score[0]]*((slidingWindow-1)//2) + list(score) + [score[-1]]*((slidingWindow-1)//2))
elif method == 'PCA':
clf = PCA()
x = X_data
clf.fit(x)
score = clf.decision_scores_
score = MinMaxScaler(feature_range=(0,1)).fit_transform(score.reshape(-1,1)).ravel()
score = np.array([score[0]]*math.ceil((slidingWindow-1)/2) + list(score) + [score[-1]]*((slidingWindow-1)//2))
elif method == 'POLY':
clf = POLY(power=3, window = slidingWindow)
x = data
clf.fit(x)
measure = Fourier()
measure.detector = clf
measure.set_param()
clf.decision_function(measure=measure)
score = clf.decision_scores_
score = MinMaxScaler(feature_range=(0,1)).fit_transform(score.reshape(-1,1)).ravel()
elif method == 'OCSVM':
X_train_ = MinMaxScaler(feature_range=(0,1)).fit_transform(X_train.T).T
X_test_ = MinMaxScaler(feature_range=(0,1)).fit_transform(X_test.T).T
clf = OCSVM(nu=0.05)
clf.fit(X_train_, X_test_)
score = clf.decision_scores_
score = np.array([score[0]]*math.ceil((slidingWindow-1)/2) + list(score) + [score[-1]]*((slidingWindow-1)//2))
score = MinMaxScaler(feature_range=(0,1)).fit_transform(score.reshape(-1,1)).ravel()
elif method == 'LSTM':
clf = lstm(slidingwindow = slidingWindow, predict_time_steps=1, epochs = 50, patience = 5, verbose=0)
clf.fit(data_train, data_test)
measure = Fourier()
measure.detector = clf
measure.set_param()
clf.decision_function(measure=measure)
score = clf.decision_scores_
score = MinMaxScaler(feature_range=(0,1)).fit_transform(score.reshape(-1,1)).ravel()
elif method == 'AE':
clf = AE_MLP2(slidingWindow = slidingWindow, epochs=100, verbose=0)
clf.fit(data_train, data_test)
score = clf.decision_scores_
score = MinMaxScaler(feature_range=(0,1)).fit_transform(score.reshape(-1,1)).ravel()
elif method == 'CNN':
clf = cnn(slidingwindow = slidingWindow, predict_time_steps=1, epochs = 100, patience = 5, verbose=0)
clf.fit(data_train, data_test)
measure = Fourier()
measure.detector = clf
measure.set_param()
clf.decision_function(measure=measure)
score = clf.decision_scores_
score = MinMaxScaler(feature_range=(0,1)).fit_transform(score.reshape(-1,1)).ravel()
#end_time = time.time()
#time_exec = end_time - start_time
#print(method,"\t time: {}".format(time_exec))
methods_scores[method] = score
return methods_scores
from .utils.metrics import metricor
from .analysis.robustness_eval import generate_curve
def get_range_vus_roc(score, labels, slidingWindow):
grader = metricor()
R_AUC_ROC, R_AUC_PR, _, _, _ = grader.RangeAUC(labels=labels, score=score, window=slidingWindow, plot_ROC=True)
_, _, _, _, _, _,VUS_ROC, VUS_PR = generate_curve(labels, score, 2*slidingWindow)
metrics = {'R_AUC_ROC': R_AUC_ROC, 'R_AUC_PR': R_AUC_PR, 'VUS_ROC': VUS_ROC, 'VUS_PR': VUS_PR}
return metrics
# -*- coding: utf-8 -*-
"""Classes of distance measure for model type A
"""
import numpy as np
# import matplotlib.pyplot as plt
# import random
from arch import arch_model
# import pandas as pd
import math
# import pmdarima as pm
# from pmdarima import model_selection
# import os
# import dis
# import statistics
# from sklearn import metrics
# import sklearn
class Euclidean:
""" The function class for Lp euclidean norm
----------
Power : int, optional (default=1)
The power of the lp norm. For power = k, the measure is calculagted by |x - y|_k
neighborhood : int, optional (default=max (100, 10*window size))
The length of neighborhood to derivete the normalizing constant D which is based on
the difference of maximum and minimum in the neighborhood minus window.
window: int, optional (default = length of input data)
The length of the subsequence to be compaired
Attributes
----------
decision_scores_ : numpy array of shape (n_samples,)
The outlier scores of the training data.
The higher, the more abnormal. Outliers tend to have higher
scores. This value is available once the detector is
fitted.
detector: Object classifier
the anomaly detector that is used
"""
def __init__(self, power = 1, neighborhood = 100, window = 20, norm = False):
self.power = power
self.window = window
self.neighborhood = neighborhood
self.detector = None
self.decision_scores_ = []
self.norm = norm
self.X_train = 2
def measure(self, X, Y, index):
"""Derive the decision score based on the given distance measure
Parameters
----------
X : numpy array of shape (n_samples, )
The real input samples subsequence.
Y : numpy array of shape (n_samples, )
The estimated input samples subsequence.
Index : int
the index of the starting point in the subsequence
Returns
-------
score : float
dissimiarity score between the two subsquence
"""
X_train = self.X_train
X_train = self.detector.X_train_
power = self.power
window = self.window
neighborhood = self.neighborhood
norm = self.norm
data = X_train
if norm == False:
if X.shape[0] == 0:
score = 0
else:
score = np.linalg.norm(X-Y, power)/(X.shape[0])
self.decision_scores_.append((index, score))
return score
elif type(X_train) == int:
print('Error! Detector is not fed to the object and X_train is not known')
elif neighborhood != 'all':
length = X.shape[0]
neighbor = int(self.neighborhood/2)
if index + neighbor < self.n_train_ and index - neighbor > 0:
region = np.concatenate((data[index - neighbor: index], data[index + window: index + neighbor] ))
D = np.max(region) - np.min(region)
elif index + neighbor >= self.n_train_ and index + window < self.n_train_:
region = np.concatenate((data[self.n_train_ - neighborhood: index], data[index + window: self.n_train_] ))
D = np.max(region) - np.min(region)
elif index + window >= self.n_train_:
region = data[self.n_train_ - neighborhood: index]
D = np.max(region) - np.min(region)
else:
region = np.concatenate((data[0: index], data[index + window: index + neighborhood] ))
D = np.max(region) - np.min(region)
score = np.linalg.norm(X-Y, power)/D/(X.shape[0]**power)
self.decision_scores_.append((index, score))
return score
def set_param(self):
if self.detector != None:
self.window = self.detector.window
self.neighborhood = self.detector.neighborhood
self.n_train_ = self.detector.n_train_
self.X_train = self.detector.X_train_
else:
print('Error! Detector is not fed to the object and X_train is not known')
return self
class Mahalanobis:
""" The function class for Mahalanobis measure
----------
Probability : boolean, optional (default=False)
Whether to derive the anomoly score by the probability that such point occurs
neighborhood : int, optional (default=max (100, 10*window size))
The length of neighborhood to derivete the normalizing constant D which is based on
the difference of maximum and minimum in the neighborhood minus window.
Attributes
----------
decision_scores_ : numpy array of shape (n_samples,)
The outlier scores of the training data.
The higher, the more abnormal. Outliers tend to have higher
scores. This value is available once the detector is
fitted.
detector: Object classifier
the anomaly detector that is used
"""
def __init__(self, probability = False):
self.probability = probability
self.detector = None
self.decision_scores_ = []
self.mu = 0
def set_param(self):
'''update the parameters with the detector that is used
'''
self.n_initial_ = self.detector.n_initial_
self.estimation = self.detector.estimation
self.X_train = self.detector.X_train_
self.window = self.detector.window
window = self.window
resid = self.X_train - self.estimation
number = max(100, self.window)
self.residual = np.zeros((window, number))
for i in range(number):
self.residual[:, i] = resid[self.n_initial_+i:self.n_initial_+i+window]
self.mu = np.zeros(number)
self.cov = np.cov(self.residual, rowvar=1)
if self.window == 1:
self.cov = (np.sum(np.square(self.residual))/(number - 1))**0.5
return self
def norm_pdf_multivariate(self, x):
'''multivarite normal density function
'''
try:
mu = self.mu
except:
mu = np.zeros(x.shape[0])
sigma = self.cov
size = x.shape[0]
if size == len(mu) and (size, size) == sigma.shape:
det = np.linalg.det(sigma)
if det == 0:
raise NameError("The covariance matrix can't be singular")
norm_const = 1.0/ ( math.pow((2*math.pi),float(size)/2) * math.pow(det,1.0/2) )
x_mu = np.matrix(x - mu)
inv = np.linalg.inv(sigma)
result = math.pow(math.e, -0.5 * (x_mu * inv * x_mu.T))
return norm_const * result
else:
raise NameError("The dimensions of the input don't match")
def normpdf(self, x):
'''univariate normal
'''
mean = 0
sd = np.asscalar(self.cov)
var = float(sd)**2
denom = (2*math.pi*var)**.5
num = math.exp(-(float(x)-float(mean))**2/(2*var))
return num/denom
def measure(self, X, Y, index):
"""Derive the decision score based on the given distance measure
Parameters
----------
X : numpy array of shape (n_samples, )
The real input samples subsequence.
Y : numpy array of shape (n_samples, )
The estimated input samples subsequence.
Index : int
the index of the starting point in the subsequence
Returns
-------
score : float
dissimiarity score between the two subsquence
"""
mu = np.zeros(self.detector.window)
cov = self.cov
if self.probability == False:
if X.shape[0] == mu.shape[0]:
score = np.matmul(np.matmul((X-Y-mu).T, cov), (X-Y-mu))/(X.shape[0])
self.decision_scores_.append((index, score))
return score
else:
return (X-Y).T.dot(X-Y)
else:
if len(X) > 1:
prob = self.norm_pdf_multivariate(X-Y)
elif len(X) == 1:
X = np.asscalar(X)
Y = np.asscalar(Y)
prob = self.normpdf(X-Y)
else:
prob = 1
score = 1 - prob
score = max(score, 0)
self.decision_scores_.append((index, score))
return score
class Garch:
""" The function class for garch measure
----------
p, q : int, optional (default=1, 1)
The order of the garch model to be fitted on the residual
mean : string, optional (default='zero' )
The forecast conditional mean.
vol: string, optional (default = 'garch')
he forecast conditional variance.
Attributes
----------
decision_scores_ : numpy array of shape (n_samples,)
The outlier scores of the training data.
The higher, the more abnormal. Outliers tend to have higher
scores. This value is available once the detector is
fitted.
detector: Object classifier
the anomaly detector that is used
"""
def __init__(self, p = 1, q = 1, mean = 'zero', vol = 'garch'):
self.p = p
self.q = q
self.vol = vol
self.mean = mean
self.decision_scores_ = []
def set_param(self):
'''update the parameters with the detector that is used
'''
q = self.q
p=self.p
mean = self.mean
vol = self.vol
if self.detector != None:
self.n_initial_ = self.detector.n_initial_
self.estimation = self.detector.estimation
self.X_train = self.detector.X_train_
self.window = self.detector.window
window = self.window
resid = 10 * (self.X_train - self.estimation)
model = arch_model(resid, mean=mean, vol=vol, p=p, q=q)
model_fit = model.fit(disp='off')
self.votility = model_fit.conditional_volatility/10
else:
print('Error! Detector not fed to the measure')
return self
def measure(self, X, Y, index):
"""Derive the decision score based on the given distance measure
Parameters
----------
X : numpy array of shape (n_samples, )
The real input samples subsequence.
Y : numpy array of shape (n_samples, )
The estimated input samples subsequence.
Index : int
the index of the starting point in the subsequence
Returns
-------
score : float
dissimiarity score between the two subsquences
"""
X = np.array(X)
Y = np.array(Y)
length = len(X)
score = 0
if length != 0:
for i in range(length):
sigma = self.votility[index + i]
if sigma != 0:
score += abs(X[i]-Y[i])/sigma
score = score/length
return score
class SSA_DISTANCE:
""" The function class for SSA measure
good for contextual anomolies
----------
method : string, optional (default='linear' )
The method to fit the line and derives the SSA score
e: float, optional (default = 1)
The upper bound to start new line search for linear method
Attributes
----------
decision_scores_ : numpy array of shape (n_samples,)
The outlier scores of the training data.
The higher, the more abnormal. Outliers tend to have higher
scores. This value is available once the detector is
fitted.
detector: Object classifier
the anomaly detector that is used
"""
def __init__(self, method ='linear', e = 1):
self.method = method
self.decision_scores_ = []
self.e = e
def Linearization(self, X2):
"""Obtain the linearized curve.
Parameters
----------
X2 : numpy array of shape (n, )
the time series curve to be fitted
e: float, integer, or numpy array
weights to obtain the
Returns
-------
fit: parameters for the fitted linear curve
"""
e = self.e
i = 0
fit = {}
fit['index'] = []
fit['rep'] = []
while i < len(X2):
fit['index'].append(i)
try:
fit['Y'+str(i)]= X2[i]
except:
print(X2.shape, X2)
fit['rep'].append(np.array([i, X2[i]]))
if i+1 >= len(X2):
break
k = X2[i+1]-X2[i]
b = -i*(X2[i+1]-X2[i])+X2[i]
fit['reg' +str(i)]= np.array([k, b])
i += 2
if i >= len(X2):
break
d = np.abs(X2[i]- (k * i+b))
while d < e:
i +=1
if i >= len(X2):
break
d = np.abs(X2[i]- (k * i+b))
return fit
def set_param(self):
'''update the parameters with the detector that is used.
Since the SSA measure doens't need the attributes of detector
or characteristics of X_train, the process is omitted.
'''
return self
def measure(self, X2, X3, start_index):
"""Obtain the SSA similarity score.
Parameters
----------
X2 : numpy array of shape (n, )
the reference timeseries
X3 : numpy array of shape (n, )
the tested timeseries
e: float, integer, or numpy array
weights to obtain the
Returns
-------
score: float, the higher the more dissimilar are the two curves
"""
#linearization of data X2 and X3
X2 = np.array(X2)
X3 = np.array(X3)
e = self.e
fit = self.Linearization(X2)
fit2 = self.Linearization(X3)
#line alinement
Index = []
test_list = fit['index'] + fit2['index']
[Index.append(x) for x in test_list if x not in Index]
Y = 0
#Similarity Computation
for i in Index:
if i in fit['index'] and i in fit2['index']:
Y += abs(fit['Y'+str(i)]-fit2['Y'+str(i)])
elif i in fit['index']:
J = np.max(np.where(np.array(fit2['index']) < i ))
index = fit2['index'][J]
k = fit2['reg'+str(index)][0]
b = fit2['reg'+str(index)][1]
value = abs(k * i + b - fit['Y'+str(i)])
Y += value
elif i in fit2['index']:
J = np.max(np.where(np.array(fit['index']) < i ))
index = fit['index'][J]
k = fit['reg'+str(index)][0]
b = fit['reg'+str(index)][1]
value = abs(k * i + b - fit2['Y'+str(i)])
Y += value
if len(Index) != 0:
score = Y/len(Index)
else:
score = 0
self.decision_scores_.append((start_index, score))
if len(X2) == 1:
print('Error! SSA measure doesn\'t apply to singleton' )
else:
return score
class Fourier:
""" The function class for Fourier measure
good for contextual anomolies
----------
power: int, optional (default = 2)
Lp norm for dissimiarlity measure considered
Attributes
----------
decision_scores_ : numpy array of shape (n_samples,)
The outlier scores of the training data.
The higher, the more abnormal. Outliers tend to have higher
scores. This value is available once the detector is
fitted.
detector: Object classifier
the anomaly detector that is used
"""
def __init__(self, power = 2):
self.decision_scores_ = []
self.power = power
def set_param(self):
'''update the parameters with the detector that is used
since the FFT measure doens't need the attributes of detector
or characteristics of X_train, the process is omitted.
'''
return self
def measure(self, X2, X3, start_index):
"""Obtain the SSA similarity score.
Parameters
----------
X2 : numpy array of shape (n, )
the reference timeseries
X3 : numpy array of shape (n, )
the tested timeseries
index: int,
current index for the subseqeuence that is being measured
Returns
-------
score: float, the higher the more dissimilar are the two curves
"""
power = self.power
X2 = np.array(X2)
X3 = np.array(X3)
if len(X2) == 0:
score = 0
else:
X2 = np.fft.fft(X2);
X3 = np.fft.fft(X3)
score = np.linalg.norm(X2 - X3, ord = power)/len(X3)
self.decision_scores_.append((start_index, score))
return score
class DTW:
""" The function class for dynamic time warping measure
----------
method : string, optional (default='L2' )
The distance measure to derive DTW.
Avaliable "L2", "L1", and custom
Attributes
----------
decision_scores_ : numpy array of shape (n_samples,)
The outlier scores of the training data.
The higher, the more abnormal. Outliers tend to have higher
scores. This value is available once the detector is
fitted.
detector: Object classifier
the anomaly detector that is used
"""
def __init__(self, method = 'L2'):
self.decision_scores_ = []
if type(method) == str:
if method == 'L1':
distance = lambda x, y: abs(x-y)
elif method == 'L2':
distance = lambda x, y: (x-y)**2
else:
distance = method
self.distance = distance
def set_param(self):
'''update the parameters with the detector that is used
since the FFT measure doens't need the attributes of detector
or characteristics of X_train, the process is omitted.
'''
return self
def measure(self, X1, X2, start_index):
"""Obtain the SSA similarity score.
Parameters
----------
X1 : numpy array of shape (n, )
the reference timeseries
X2 : numpy array of shape (n, )
the tested timeseries
index: int,
current index for the subseqeuence that is being measured
Returns
-------
score: float, the higher the more dissimilar are the two curves
"""
distance = self.distance
X1 = np.array(X1)
X2 = np.array(X2)
value = 1
if len(X1)==0:
value =0
X1= np.zeros(5)
X2 = X1
M = np.zeros((len(X1), len(X2)))
for index_i in range(len(X1)):
for index_j in range(len(X1) - index_i):
L = []
i = index_i
j = index_i + index_j
D = distance(X1[i], X2[j])
try:
L.append(M[i-1, j-1])
except:
L.append(np.inf)
try:
L.append(M[i, j-1])
except:
L.append(np.inf)
try:
L.append(M[i-1, j])
except:
L.append(np.inf)
D += min(L)
M[i,j] = D
if i !=j:
L = []
j = index_i
i = index_i + index_j
D = distance(X1[i], X2[j])
try:
L.append(M[i-1, j-1])
except:
L.append(np.inf)
try:
L.append(M[i, j-1])
except:
L.append(np.inf)
try:
L.append(M[i-1, j])
except:
L.append(np.inf)
D += min(L)
M[i,j] = D
score = M[len(X1)-1, len(X1)-1]/len(X1)
if value == 0:
score = 0
self.decision_scores_.append((start_index, score))
return score
class EDRS:
""" The function class for edit distance on real sequences
----------
method : string, optional (default='L2' )
The distance measure to derive DTW.
Avaliable "L2", "L1", and custom
ep: float, optiona (default = 0.1)
the threshold value to decide Di_j
vot : boolean, optional (default = False)
whether to adapt a chaging votilities estimaed by garch
for ep at different windows.
Attributes
----------
decision_scores_ : numpy array of shape (n_samples,)
The outlier scores of the training data.
The higher, the more abnormal. Outliers tend to have higher
scores. This value is available once the detector is
fitted.
detector: Object classifier
the anomaly detector that is used
"""
def __init__(self, method = 'L1', ep = False, vol = False):
self.decision_scores_ = []
if type(method) == str:
if method == 'L1':
distance = lambda x, y: abs(x-y)
else:
distance = method
self.distance = distance
self.ep = ep
self.vot = vol
def set_param(self):
'''update the ep based on the votalitiy of the model
'''
estimation = np.array(self.detector.estimation )
initial = self.detector.n_initial_
X = np.array(self.detector.X_train_)
self.initial = initial
residual = estimation[initial:] - X[initial:]
number = len(residual)
#var = (np.sum(np.square(residual))/(number - 1))**0.5
vot = self.vot
if vot == False:
var = np.var(residual)
else:
model = arch_model(10 * residual, mean='Constant', vol='garch', p=1, q=1)
model_fit = model.fit(disp='off')
var = model_fit.conditional_volatility/10
if self.ep == False:
self.ep = 3 * (np.sum(np.square(residual))/(len(residual) - 1))**0.5
else:
self.ep = self.ep
return self
def measure(self, X1, X2, start_index):
"""Obtain the SSA similarity score.
Parameters
----------
X1 : numpy array of shape (n, )
the reference timeseries
X2 : numpy array of shape (n, )
the tested timeseries
index: int,
current index for the subseqeuence that is being measured
Returns
-------
score: float, the higher the more dissimilar are the two curves
"""
distance = self.distance
X1 = np.array(X1)
X2 = np.array(X2)
vot = self.vot
if vot == False:
ep = self.ep
else:
try:
ep = self.ep[start_index - self.initial]
except:
#sometime start_index is the length of the number
ep = 0
value = 1
if len(X1)==0:
value =0
X1= np.zeros(5)
X2 = X1
M = np.zeros((len(X1), len(X2)))
M[:, 0] = np.arange(len(X1))
M[0, :] = np.arange(len(X1))
for index_i in range(1, len(X1)):
for index_j in range(len(X1) - index_i):
L = []
i = index_i
j = index_i + index_j
D = distance(X1[i], X2[j])
if D < ep:
M[i, j]= M[i-1, j-1]
else:
try:
L.append(M[i-1, j-1])
except:
L.append(np.inf)
try:
L.append(M[i, j-1])
except:
L.append(np.inf)
try:
L.append(M[i-1, j])
except:
L.append(np.inf)
M[i,j] = 1 + min(L)
if i !=j:
L = []
j = index_i
i = index_i + index_j
D = distance(X1[i], X2[j])
if D < ep:
M[i, j]= M[i-1, j-1]
else:
try:
L.append(M[i-1, j-1])
except:
L.append(np.inf)
try:
L.append(M[i, j-1])
except:
L.append(np.inf)
try:
L.append(M[i-1, j])
except:
L.append(np.inf)
M[i,j] = 1 + min(L)
score = M[len(X1)-1, len(X1)-1]/len(X1)
if value == 0:
score = 0
self.decision_scores_.append((start_index, score))
return score
class TWED:
""" Function class for Time-warped edit distance(TWED) measure
----------
method : string, optional (default='L2' )
The distance measure to derive DTW.
Avaliable "L2", "L1", and custom
gamma: float, optiona (default = 0.1)
mismatch penalty
v : float, optional (default = False)
stifness parameter
Attributes
----------
decision_scores_ : numpy array of shape (n_samples,)
The outlier scores of the training data.
The higher, the more abnormal. Outliers tend to have higher
scores. This value is available once the detector is
fitted.
detector: Object classifier
the anomaly detector that is used
"""
def __init__(self, gamma = 0.1, v = 0.1):
self.decision_scores_ = []
self.gamma = gamma
self.v = v
def set_param(self):
'''No need'''
return self
def measure(self, A, B, start_index):
"""Obtain the SSA similarity score.
Parameters
----------
X1 : numpy array of shape (n, )
the reference timeseries
X2 : numpy array of shape (n, )
the tested timeseries
index: int,
current index for the subseqeuence that is being measured
Returns
-------
score: float, the higher the more dissimilar are the two curves
"""
#code modifed from wikipedia
Dlp = lambda x,y: abs(x-y)
timeSB = np.arange(1,len(B)+1)
timeSA = np.arange(1,len(A)+1)
nu = self.v
_lambda = self.gamma
# Reference :
# Marteau, P.; F. (2009). "Time Warp Edit Distance with Stiffness Adjustment for Time Series Matching".
# IEEE Transactions on Pattern Analysis and Machine Intelligence. 31 (2): 306–318. arXiv:cs/0703033
# http://people.irisa.fr/Pierre-Francois.Marteau/
# Check if input arguments
if len(A) != len(timeSA):
print("The length of A is not equal length of timeSA")
return None, None
if len(B) != len(timeSB):
print("The length of B is not equal length of timeSB")
return None, None
if nu < 0:
print("nu is negative")
return None, None
# Add padding
A = np.array([0] + list(A))
timeSA = np.array([0] + list(timeSA))
B = np.array([0] + list(B))
timeSB = np.array([0] + list(timeSB))
n = len(A)
m = len(B)
# Dynamical programming
DP = np.zeros((n, m))
# Initialize DP Matrix and set first row and column to infinity
DP[0, :] = np.inf
DP[:, 0] = np.inf
DP[0, 0] = 0
# Compute minimal cost
for i in range(1, n):
for j in range(1, m):
# Calculate and save cost of various operations
C = np.ones((3, 1)) * np.inf
# Deletion in A
C[0] = (
DP[i - 1, j]
+ Dlp(A[i - 1], A[i])
+ nu * (timeSA[i] - timeSA[i - 1])
+ _lambda
)
# Deletion in B
C[1] = (
DP[i, j - 1]
+ Dlp(B[j - 1], B[j])
+ nu * (timeSB[j] - timeSB[j - 1])
+ _lambda
)
# Keep data points in both time series
C[2] = (
DP[i - 1, j - 1]
+ Dlp(A[i], B[j])
+ Dlp(A[i - 1], B[j - 1])
+ nu * (abs(timeSA[i] - timeSB[j]) + abs(timeSA[i - 1] - timeSB[j - 1]))
)
# Choose the operation with the minimal cost and update DP Matrix
DP[i, j] = np.min(C)
distance = DP[n - 1, m - 1]
self.M = DP
self.decision_scores_.append((start_index, distance))
return distance
\ No newline at end of file
# -*- coding: utf-8 -*-
"""Classes of feature mapping for model type B
"""
import numpy as np
# import matplotlib.pyplot as plt
# import random
# from arch import arch_model
import pandas as pd
import math
# import pmdarima as pm
# from pmdarima import model_selection
# import os
# import dis
# import statistics
# from sklearn import metrics
# import sklearn
from tsfresh import extract_features
from statsmodels.tsa.seasonal import seasonal_decompose
# import itertools
# import functools
import warnings
from builtins import range
# from collections import defaultdict
from numpy.linalg import LinAlgError
# from scipy.signal import cwt, find_peaks_cwt, ricker, welch
# from scipy.stats import linregress
# from statsmodels.tools.sm_exceptions import MissingDataError
with warnings.catch_warnings():
# Ignore warnings of the patsy package
warnings.simplefilter("ignore", DeprecationWarning)
from statsmodels.tsa.ar_model import AR
# from statsmodels.tsa.stattools import acf, adfuller, pacf
from hurst import compute_Hc
class Window:
""" The class for rolling window feature mapping.
The mapping converts the original timeseries X into a matrix.
The matrix consists of rows of sliding windows of original X.
"""
def __init__(self, window = 100):
self.window = window
self.detector = None
def convert(self, X):
n = self.window
X = pd.Series(X)
L = []
if n == 0:
df = X
else:
for i in range(n):
L.append(X.shift(i))
df = pd.concat(L, axis = 1)
df = df.iloc[n-1:]
return df
class tf_Stat:
'''statisitc feature extraction using the tf_feature package.
It calculates 763 features in total so it might be over complicated for some models.
Recommend to use for methods like Isolation Forest which randomly picks a feature
and then perform the classification. To use for other distance-based model like KNN,
LOF, CBLOF, etc, first train to pass a function that give weights to individual features so that
inconsequential features won't cloud the important ones (mean, variance, kurtosis, etc).
'''
def __init__(self, window = 100, step = 25):
self.window = window
self.step = step
self.detector = None
def convert(self, X):
window = self.window
step = self.step
pos = math.ceil(window/2)
#step <= window
length = X.shape[0]
Xd = pd.DataFrame(X)
Xd.columns = pd.Index(['x'], dtype='object')
Xd['id'] = 1
Xd['time'] = Xd.index
test = np.array(extract_features(Xd.iloc[0+pos-math.ceil(window/2):0+pos + math.floor(window/2)], column_id="id", column_sort="time", column_kind=None, column_value=None).fillna(0))
M = np.zeros((length - window, test.shape[1]+1 ))
i = 0
while i + window <= M.shape[0]:
M[i:i+step, 0]= X[pos + i: pos + i + step]
vector = np.array(extract_features(Xd.iloc[i+pos-math.ceil(window/2):i+pos + math.floor(window/2)], column_id="id", column_sort="time", column_kind=None, column_value=None).fillna(0))
M[i:i+step, 1:] = vector
i+= step
num = M.shape[0]
if i < num:
M[i: num, 0]= X[pos + i: pos + num]
M[i: num, 1:] = np.array(extract_features(Xd.iloc[i+pos-math.ceil(window/2):], column_id="id", column_sort="time", column_kind=None, column_value=None).fillna(0))
return M
class Stat:
'''statisitc feature extraction.
Features include [mean, variance, skewness, kurtosis, autocorrelation, maximum,
minimum, entropy, seasonality, hurst component, AR coef]
'''
def __init__(self, window = 100, data_step = 10, param = [{"coeff": 0, "k": 5}], lag = 1, freq = 720):
self.window = window
self.data_step = data_step
self.detector = None
self.param = param
self.lag = lag
self.freq =freq
if data_step > int(window/2):
raise ValueError('value step shoudm\'t be greater than half of the window')
def convert(self, X):
freq = self.freq
n = self.window
data_step = self.data_step
X = pd.Series(X)
L = []
if n == 0:
df = X
raise ValueError('window lenght is set to zero')
else:
for i in range(n):
L.append(X.shift(i))
df = pd.concat(L, axis = 1)
df = df.iloc[n:]
df2 = pd.concat(L[:data_step], axis = 1)
df = df.reset_index()
#value
x0 = df2[math.ceil(n/2) : - math.floor(n/2)].reset_index()
#mean
x1 = (df.mean(axis=1))
#variance
x2 = df.var(axis=1)
#AR-coef
self.ar_function = lambda x: self.ar_coefficient(x)
x3 = df.apply(self.ar_function, axis =1, result_type='expand' )
#autocorrelation
self.auto_function = lambda x: self.autocorrelation(x)
x4 = df.apply(self.auto_function, axis =1, result_type='expand' )
#kurtosis
x5 = (df.kurtosis(axis=1))
#skewness
x6 = (df.skew(axis=1))
#maximum
x7 = (df.max(axis=1))
#minimum
x8 = (df.min(axis=1))
#entropy
self.entropy_function = lambda x: self.sample_entropy(x)
x9 = df.apply(self.entropy_function, axis =1, result_type='expand')
#seasonality
result = seasonal_decompose(X, model='additive', freq = freq, extrapolate_trend='freq')
#seasonal
x10 = pd.Series(np.array(result.seasonal[math.ceil(n/2) : - math.floor(n/2)]))
#trend
x11 = pd.Series(np.array(result.trend[math.ceil(n/2) : - math.floor(n/2)]))
#resid
x12 = pd.Series(np.array(result.resid[math.ceil(n/2) : - math.floor(n/2)]))
#Hurst component
self.hurst_function = lambda x: self.hurst_f(x)
x13 = df.apply(self.hurst_function, axis =1, result_type='expand')
L = [x0, x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12, x13]
M = pd.concat(L, axis = 1)
M = M.drop(columns=['index'])
return M
def ar_coefficient(self, x):
"""
This feature calculator fits the unconditional maximum likelihood
of an autoregressive AR(k) process.
The k parameter is the maximum lag of the process
.. math::
X_{t}=\\varphi_0 +\\sum _{{i=1}}^{k}\\varphi_{i}X_{{t-i}}+\\varepsilon_{t}
For the configurations from param which should contain the maxlag "k" and such an AR process is calculated. Then
the coefficients :math:`\\varphi_{i}` whose index :math:`i` contained from "coeff" are returned.
:param x: the time series to calculate the feature of
:type x: numpy.ndarray
:param param: contains dictionaries {"coeff": x, "k": y} with x,y int
:type param: list
:return x: the different feature values
:return type: pandas.Series
"""
calculated_ar_params = {}
param = self.param
x_as_list = list(x)
res = {}
for parameter_combination in param:
k = parameter_combination["k"]
p = parameter_combination["coeff"]
column_name = "coeff_{}__k_{}".format(p, k)
if k not in calculated_ar_params:
try:
calculated_AR = AR(x_as_list)
calculated_ar_params[k] = calculated_AR.fit(maxlag=k, solver="mle").params
except (LinAlgError, ValueError):
calculated_ar_params[k] = [np.NaN] * k
mod = calculated_ar_params[k]
if p <= k:
try:
res[column_name] = mod[p]
except IndexError:
res[column_name] = 0
else:
res[column_name] = np.NaN
L = [(key, value) for key, value in res.items()]
L0 = []
for item in L:
L0.append(item[1])
return L0
def autocorrelation(self, x):
"""
Calculates the autocorrelation of the specified lag, according to the formula [1]
.. math::
\\frac{1}{(n-l)\\sigma^{2}} \\sum_{t=1}^{n-l}(X_{t}-\\mu )(X_{t+l}-\\mu)
where :math:`n` is the length of the time series :math:`X_i`, :math:`\\sigma^2` its variance and :math:`\\mu` its
mean. `l` denotes the lag.
.. rubric:: References
[1] https://en.wikipedia.org/wiki/Autocorrelation#Estimation
:param x: the time series to calculate the feature of
:type x: numpy.ndarray
:param lag: the lag
:type lag: int
:return: the value of this feature
:return type: float
"""
lag = self.lag
# This is important: If a series is passed, the product below is calculated
# based on the index, which corresponds to squaring the series.
if isinstance(x, pd.Series):
x = x.values
if len(x) < lag:
return np.nan
# Slice the relevant subseries based on the lag
y1 = x[:(len(x) - lag)]
y2 = x[lag:]
# Subtract the mean of the whole series x
x_mean = np.mean(x)
# The result is sometimes referred to as "covariation"
sum_product = np.sum((y1 - x_mean) * (y2 - x_mean))
# Return the normalized unbiased covariance
v = np.var(x)
if np.isclose(v, 0):
return np.NaN
else:
return sum_product / ((len(x) - lag) * v)
def _into_subchunks(self, x, subchunk_length, every_n=1):
"""
Split the time series x into subwindows of length "subchunk_length", starting every "every_n".
For example, the input data if [0, 1, 2, 3, 4, 5, 6] will be turned into a matrix
0 2 4
1 3 5
2 4 6
with the settings subchunk_length = 3 and every_n = 2
"""
len_x = len(x)
assert subchunk_length > 1
assert every_n > 0
# how often can we shift a window of size subchunk_length over the input?
num_shifts = (len_x - subchunk_length) // every_n + 1
shift_starts = every_n * np.arange(num_shifts)
indices = np.arange(subchunk_length)
indexer = np.expand_dims(indices, axis=0) + np.expand_dims(shift_starts, axis=1)
return np.asarray(x)[indexer]
def sample_entropy(self, x):
"""
Calculate and return sample entropy of x.
.. rubric:: References
| [1] http://en.wikipedia.org/wiki/Sample_Entropy
| [2] https://www.ncbi.nlm.nih.gov/pubmed/10843903?dopt=Abstract
:param x: the time series to calculate the feature of
:type x: numpy.ndarray
:return: the value of this feature
:return type: float
"""
x = np.array(x)
# if one of the values is NaN, we can not compute anything meaningful
if np.isnan(x).any():
return np.nan
m = 2 # common value for m, according to wikipedia...
tolerance = 0.2 * np.std(x) # 0.2 is a common value for r, according to wikipedia...
# Split time series and save all templates of length m
# Basically we turn [1, 2, 3, 4] into [1, 2], [2, 3], [3, 4]
xm = self._into_subchunks(x, m)
# Now calculate the maximum distance between each of those pairs
# np.abs(xmi - xm).max(axis=1)
# and check how many are below the tolerance.
# For speed reasons, we are not doing this in a nested for loop,
# but with numpy magic.
# Example:
# if x = [1, 2, 3]
# then xm = [[1, 2], [2, 3]]
# so we will substract xm from [1, 2] => [[0, 0], [-1, -1]]
# and from [2, 3] => [[1, 1], [0, 0]]
# taking the abs and max gives us:
# [0, 1] and [1, 0]
# as the diagonal elements are always 0, we substract 1.
B = np.sum([np.sum(np.abs(xmi - xm).max(axis=1) <= tolerance) - 1 for xmi in xm])
# Similar for computing A
xmp1 = self._into_subchunks(x, m + 1)
A = np.sum([np.sum(np.abs(xmi - xmp1).max(axis=1) <= tolerance) - 1 for xmi in xmp1])
# Return SampEn
return -np.log(A / B)
def hurst_f(self, x):
H,c, M = compute_Hc(x)
return [H, c]
\ No newline at end of file
from sklearn import metrics
import numpy as np
import math
# import matplotlib.pyplot as plt
class metricor:
def __init__(self, a = 1, probability = True, bias = 'flat', ):
self.a = a
self.probability = probability
self.bias = bias
def detect_model(self, model, label, contamination = 0.1, window = 100, is_A = False, is_threshold = True):
if is_threshold:
score = self.scale_threshold(model.decision_scores_, model._mu, model._sigma)
else:
score = self.scale_contamination(model.decision_scores_, contamination = contamination)
if is_A is False:
scoreX = np.zeros(len(score)+window)
scoreX[math.ceil(window/2): len(score)+window - math.floor(window/2)] = score
else:
scoreX = score
self.score_=scoreX
L = self.metric(label, scoreX)
return L
def labels_conv(self, preds):
'''return indices of predicted anomaly
'''
# p = np.zeros(len(preds))
index = np.where(preds >= 0.5)
return index[0]
def labels_conv_binary(self, preds):
'''return predicted label
'''
p = np.zeros(len(preds))
index = np.where(preds >= 0.5)
p[index[0]] = 1
return p
def w(self, AnomalyRange, p):
MyValue = 0
MaxValue = 0
start = AnomalyRange[0]
AnomalyLength = AnomalyRange[1] - AnomalyRange[0] + 1
for i in range(start, start +AnomalyLength):
bi = self.b(i, AnomalyLength)
MaxValue += bi
if i in p:
MyValue += bi
return MyValue/MaxValue
def Cardinality_factor(self, Anomolyrange, Prange):
score = 0
start = Anomolyrange[0]
end = Anomolyrange[1]
for i in Prange:
if i[0] >= start and i[0] <= end:
score +=1
elif start >= i[0] and start <= i[1]:
score += 1
elif end >= i[0] and end <= i[1]:
score += 1
elif start >= i[0] and end <= i[1]:
score += 1
if score == 0:
return 0
else:
return 1/score
def b(self, i, length):
bias = self.bias
if bias == 'flat':
return 1
elif bias == 'front-end bias':
return length - i + 1
elif bias == 'back-end bias':
return i
else:
if i <= length/2:
return i
else:
return length - i + 1
def scale_threshold(self, score, score_mu, score_sigma):
return (score >= (score_mu + 3*score_sigma)).astype(int)
def metric_new(self, label, score, plot_ROC=False, alpha=0.2,coeff=3):
'''input:
Real labels and anomaly score in prediction
output:
AUC,
Precision,
Recall,
F-score,
Range-precision,
Range-recall,
Range-Fscore,
Precison@k,
k is chosen to be # of outliers in real labels
'''
if np.sum(label) == 0:
print('All labels are 0. Label must have groud truth value for calculating AUC score.')
return None
if np.isnan(score).any() or score is None:
print('Score must not be none.')
return None
#area under curve
auc = metrics.roc_auc_score(label, score)
# plor ROC curve
if plot_ROC:
fpr, tpr, thresholds = metrics.roc_curve(label, score)
# display = metrics.RocCurveDisplay(fpr=fpr, tpr=tpr, roc_auc=auc)
# display.plot()
#precision, recall, F
preds = score > (np.mean(score)+coeff*np.std(score))
if np.sum(preds) == 0:
preds = score > (np.mean(score)+2*np.std(score))
if np.sum(preds) == 0:
preds = score > (np.mean(score)+1*np.std(score))
Precision, Recall, F, Support = metrics.precision_recall_fscore_support(label, preds, zero_division=0)
precision = Precision[1]
recall = Recall[1]
f = F[1]
#range anomaly
Rrecall, ExistenceReward, OverlapReward = self.range_recall_new(label, preds, alpha)
Rprecision = self.range_recall_new(preds, label, 0)[0]
if Rprecision + Rrecall==0:
Rf=0
else:
Rf = 2 * Rrecall * Rprecision / (Rprecision + Rrecall)
# top-k
k = int(np.sum(label))
threshold = np.percentile(score, 100 * (1-k/len(label)))
# precision_at_k = metrics.top_k_accuracy_score(label, score, k)
p_at_k = np.where(preds > threshold)[0]
TP_at_k = sum(label[p_at_k])
precision_at_k = TP_at_k/k
L = [auc, precision, recall, f, Rrecall, ExistenceReward, OverlapReward, Rprecision, Rf, precision_at_k]
if plot_ROC:
return L, fpr, tpr
return L
def metric_PR(self, label, score):
precision, recall, thresholds = metrics.precision_recall_curve(label, score)
# plt.figure()
# disp = metrics.PrecisionRecallDisplay(precision=precision, recall=recall)
# disp.plot()
AP = metrics.auc(recall, precision)
#AP = metrics.average_precision_score(label, score)
return precision, recall, AP
def range_recall_new(self, labels, preds, alpha):
p = np.where(preds == 1)[0] # positions of predicted label==1
range_pred = self.range_convers_new(preds)
range_label = self.range_convers_new(labels)
Nr = len(range_label) # total # of real anomaly segments
ExistenceReward = self.existence_reward(range_label, p)
OverlapReward = 0
for i in range_label:
OverlapReward += self.w(i, p) * self.Cardinality_factor(i, range_pred)
score = alpha * ExistenceReward + (1-alpha) * OverlapReward
if Nr != 0:
return score/Nr, ExistenceReward/Nr, OverlapReward/Nr
else:
return 0,0,0
def range_convers_new(self, label):
'''
input: arrays of binary values
output: list of ordered pair [[a0,b0], [a1,b1]... ] of the inputs
'''
L = []
i = 0
j = 0
while j < len(label):
# print(i)
while label[i] == 0:
i+=1
if i >= len(label):
break
j = i+1
# print('j'+str(j))
if j >= len(label):
if j==len(label):
L.append((i,j-1))
break
while label[j] != 0:
j+=1
if j >= len(label):
L.append((i,j-1))
break
if j >= len(label):
break
L.append((i, j-1))
i = j
return L
def existence_reward(self, labels, preds):
'''
labels: list of ordered pair
preds predicted data
'''
score = 0
for i in labels:
if np.sum(np.multiply(preds <= i[1], preds >= i[0])) > 0:
score += 1
return score
def num_nonzero_segments(self, x):
count=0
if x[0]>0:
count+=1
for i in range(1, len(x)):
if x[i]>0 and x[i-1]==0:
count+=1
return count
def extend_postive_range(self, x, window=5):
label = x.copy().astype(float)
L = self.range_convers_new(label) # index of non-zero segments
length = len(label)
for k in range(len(L)):
s = L[k][0]
e = L[k][1]
x1 = np.arange(e,min(e+window//2,length))
label[x1] += np.sqrt(1 - (x1-e)/(window))
x2 = np.arange(max(s-window//2,0),s)
label[x2] += np.sqrt(1 - (s-x2)/(window))
label = np.minimum(np.ones(length), label)
return label
def extend_postive_range_individual(self, x, percentage=0.2):
label = x.copy().astype(float)
L = self.range_convers_new(label) # index of non-zero segments
length = len(label)
for k in range(len(L)):
s = L[k][0]
e = L[k][1]
l0 = int((e-s+1)*percentage)
x1 = np.arange(e,min(e+l0,length))
label[x1] += np.sqrt(1 - (x1-e)/(2*l0))
x2 = np.arange(max(s-l0,0),s)
label[x2] += np.sqrt(1 - (s-x2)/(2*l0))
label = np.minimum(np.ones(length), label)
return label
def TPR_FPR_RangeAUC(self, labels, pred, P, L):
product = labels * pred
TP = np.sum(product)
# recall = min(TP/P,1)
P_new = (P+np.sum(labels))/2 # so TPR is neither large nor small
# P_new = np.sum(labels)
recall = min(TP/P_new,1)
# recall = TP/np.sum(labels)
# print('recall '+str(recall))
existence = 0
for seg in L:
if np.sum(product[seg[0]:(seg[1]+1)])>0:
existence += 1
existence_ratio = existence/len(L)
# print(existence_ratio)
# TPR_RangeAUC = np.sqrt(recall*existence_ratio)
# print(existence_ratio)
TPR_RangeAUC = recall*existence_ratio
FP = np.sum(pred) - TP
# TN = np.sum((1-pred) * (1-labels))
# FPR_RangeAUC = FP/(FP+TN)
N_new = len(labels) - P_new
FPR_RangeAUC = FP/N_new
Precision_RangeAUC = TP/np.sum(pred)
return TPR_RangeAUC, FPR_RangeAUC, Precision_RangeAUC
def RangeAUC(self, labels, score, window=0, percentage=0, plot_ROC=False, AUC_type='window'):
# AUC_type='window'/'percentage'
score_sorted = -np.sort(-score)
P = np.sum(labels)
# print(np.sum(labels))
if AUC_type=='window':
labels = self.extend_postive_range(labels, window=window)
else:
labels = self.extend_postive_range_individual(labels, percentage=percentage)
# print(np.sum(labels))
L = self.range_convers_new(labels)
TPR_list = [0]
FPR_list = [0]
Precision_list = [1]
for i in np.linspace(0, len(score)-1, 250).astype(int):
threshold = score_sorted[i]
# print('thre='+str(threshold))
pred = score>= threshold
TPR, FPR, Precision = self.TPR_FPR_RangeAUC(labels, pred, P,L)
TPR_list.append(TPR)
FPR_list.append(FPR)
Precision_list.append(Precision)
TPR_list.append(1)
FPR_list.append(1) # otherwise, range-AUC will stop earlier than (1,1)
tpr = np.array(TPR_list)
fpr = np.array(FPR_list)
prec = np.array(Precision_list)
width = fpr[1:] - fpr[:-1]
height = (tpr[1:] + tpr[:-1])/2
AUC_range = np.sum(width*height)
width_PR = tpr[1:-1] - tpr[:-2]
height_PR = (prec[1:] + prec[:-1])/2
AP_range = np.sum(width_PR*height_PR)
if plot_ROC:
return AUC_range, AP_range, fpr, tpr, prec
return AUC_range
# TPR_FPR_window
def RangeAUC_volume(self, labels_original, score, windowSize):
score_sorted = -np.sort(-score)
tpr_3d=[]
fpr_3d=[]
prec_3d=[]
auc_3d=[]
ap_3d=[]
window_3d = np.arange(0, windowSize+1, 1)
P = np.sum(labels_original)
for window in window_3d:
labels = self.extend_postive_range(labels_original, window)
# print(np.sum(labels))
L = self.range_convers_new(labels)
TPR_list = [0]
FPR_list = [0]
Precision_list = [1]
for i in np.linspace(0, len(score)-1, 250).astype(int):
threshold = score_sorted[i]
# print('thre='+str(threshold))
pred = score>= threshold
TPR, FPR, Precision = self.TPR_FPR_RangeAUC(labels, pred, P,L)
TPR_list.append(TPR)
FPR_list.append(FPR)
Precision_list.append(Precision)
TPR_list.append(1)
FPR_list.append(1) # otherwise, range-AUC will stop earlier than (1,1)
tpr = np.array(TPR_list)
fpr = np.array(FPR_list)
prec = np.array(Precision_list)
tpr_3d.append(tpr)
fpr_3d.append(fpr)
prec_3d.append(prec)
width = fpr[1:] - fpr[:-1]
height = (tpr[1:] + tpr[:-1])/2
AUC_range = np.sum(width*height)
auc_3d.append(AUC_range)
width_PR = tpr[1:-1] - tpr[:-2]
height_PR = (prec[1:] + prec[:-1])/2
AP_range = np.sum(width_PR*height_PR)
ap_3d.append(AP_range)
return tpr_3d, fpr_3d, prec_3d, window_3d, sum(auc_3d)/len(window_3d), sum(ap_3d)/len(window_3d)
def generate_curve(label,score,slidingWindow):
tpr_3d, fpr_3d, prec_3d, window_3d, avg_auc_3d, avg_ap_3d = metricor().RangeAUC_volume(labels_original=label, score=score, windowSize=1*slidingWindow)
X = np.array(tpr_3d).reshape(1,-1).ravel()
X_ap = np.array(tpr_3d)[:,:-1].reshape(1,-1).ravel()
Y = np.array(fpr_3d).reshape(1,-1).ravel()
W = np.array(prec_3d).reshape(1,-1).ravel()
Z = np.repeat(window_3d, len(tpr_3d[0]))
Z_ap = np.repeat(window_3d, len(tpr_3d[0])-1)
return Y, Z, X, X_ap, W, Z_ap,avg_auc_3d, avg_ap_3d
\ No newline at end of file
from statsmodels.tsa.stattools import acf
from scipy.signal import argrelextrema
import numpy as np
import matplotlib.patches as mpatches
import matplotlib.pyplot as plt
# determine sliding window (period) based on ACF
def find_length(data):
if len(data.shape)>1:
return 0
data = data[:min(20000, len(data))]
base = 3
auto_corr = acf(data, nlags=400, fft=True)[base:]
local_max = argrelextrema(auto_corr, np.greater)[0]
try:
max_local_max = np.argmax([auto_corr[lcm] for lcm in local_max])
if local_max[max_local_max]<3 or local_max[max_local_max]>300:
return 125
return local_max[max_local_max]+base
except:
return 125
\ No newline at end of file
import torch
import torch.nn as nn
from einops import rearrange,repeat
from math import sqrt
from .attn import FullAttention
from .embed import DataEmbedding
from transformers import LlamaConfig, LlamaModel, LlamaTokenizer, GPT2Config, GPT2Model, GPT2Tokenizer
from einops import reduce
class Linear_Encoder(nn.Module):
def __init__(self, patch_size,patch_num, d_model,channel):
super(Linear_Encoder, self).__init__()
self.channel=channel
self.patch_encoder=nn.Linear(d_model,d_model)
self.point_encoder=nn.Linear(d_model,d_model)
def forward(self, x_patch_size, x_patch_num, x_ori, attn_mask=None):
series=self.patch_encoder(x_patch_size)
prior=self.point_encoder(x_patch_num)
series=reduce(series,"(b c) m n -> b m n","mean", c=self.channel)
prior=reduce(prior,"(b c) m n -> b m n","mean", c=self.channel)
return series, prior
class LLMDetector(nn.Module):
def __init__(self, win_size, enc_in, n_heads=1, d_model=256,llm_model="gpt2", patch_size=5, channel=1, dropout=0.0, output_attention=True,description=None):
super(LLMDetector, self).__init__()
self.output_attention = output_attention
self.patch_size = patch_size
self.channel = channel
self.win_size = win_size
self.patch_num=win_size//patch_size
self.dataset_description=description
self.channel_fusion = FullAttention(channel, n_heads, attention_dropout=dropout)
self.embedding_patch_size=DataEmbedding(self.patch_size, d_model, dropout)
self.embedding_patch_num=DataEmbedding(self.win_size//self.patch_size, d_model, dropout)
self.embedding_window_size = DataEmbedding(enc_in, d_model, dropout)
self.encoder=Linear_Encoder(patch_size,patch_size,d_model,channel)
if llm_model=="gpt2":
self.max_token = 1024
self.gpt2_config = GPT2Config.from_pretrained("LLM/openai-community/gpt2")
self.gpt2_config.output_attentions = True
self.gpt2_config.output_hidden_states = True
try:
self.llm_model = GPT2Model.from_pretrained(
"LLM/openai-community/gpt2",
local_files_only=True,
config=self.gpt2_config,
)
except EnvironmentError: # downloads model from HF is not already done
print("Local model files not found. Attempting to download...")
self.llm_model = GPT2Model.from_pretrained(
"LLM/openai-community/gpt2",
local_files_only=False,
config=self.gpt2_config,
)
try:
self.tokenizer = GPT2Tokenizer.from_pretrained(
"LLM/openai-community/gpt2",
local_files_only=True
)
except EnvironmentError: # downloads the tokenizer from HF if not already done
print("Local tokenizer files not found. Atempting to download them..")
self.tokenizer = GPT2Tokenizer.from_pretrained(
"LLM/openai-community/gpt2",
local_files_only=False
)
elif llm_model=="llama2":
self.max_token=2048
self.llama2_config=LlamaConfig.from_pretrained("LLM/Llama/llama2")
self.llama2_config.output_attentions = True
self.llama2_config.output_hidden_states = True
# self.llm_dim=1024
try:
self.llm_model=LlamaModel.from_pretrained(
"LLM/Llama/llama2",
local_files_only=True,
config=self.llama2_config
)
except EnvironmentError:
print("Local model files not found. Attempting to download...")
self.llm_model=LlamaModel.from_pretrained(
"LLM/Llama/llama2",
local_files_only=False,
config=self.llama2_config
)
try:
self.tokenizer=LlamaTokenizer.from_pretrained(
"LLM/Llama/llama2",
local_files_only=True
)
except EnvironmentError:
print("Local tokenizer files not found. Attempting to download...")
self.tokenizer=LlamaTokenizer.from_pretrained(
"LLM/Llama/llama2",
local_files_only=False
)
if self.tokenizer.eos_token:
self.tokenizer.pad_token = self.tokenizer.eos_token
else:
pad_token = "[PAD]"
self.tokenizer.add_special_tokens({"pad_token": pad_token})
self.tokenizer.pad_token = pad_token
for param in self.llm_model.parameters():
param.requires_grad = False
self.word_embeddings = self.llm_model.get_input_embeddings().weight
llm_dim = self.word_embeddings.shape[-1]
self.vocab_size = self.word_embeddings.shape[0]
self.num_tokens = 1000
self.mapping_layer = nn.Linear(self.vocab_size, self.num_tokens)
self.alignment_layer = FeatureAlignmentLayer(d_model, n_heads, d_llm=llm_dim)
def forward(self, x):
B, L, M = x.shape #Batch win_size channel
means = x.mean(1, keepdim=True).detach()
x = x - means
stdev = torch.sqrt(
torch.var(x, dim=1, keepdim=True, unbiased=False) + 1e-5)
x /= stdev
x_ori = self.embedding_window_size(x)
x=self.channel_fusion(x)
x_patch, x_point = x, x
x_patch = rearrange(x_patch, "b l m -> b m l") #Batch channel win_size
x_point = rearrange(x_point, "b l m -> b m l") #Batch channel win_size
x_patch = rearrange(x_patch, "b m (n p) -> (b m) n p", p = self.patch_size)
x_patch = self.embedding_patch_size(x_patch)
x_point = rearrange(x_point, "b m (p n) -> (b m) p n", p = self.patch_size)
x_point = self.embedding_patch_num(x_point)
source_embeddings = self.mapping_layer(self.word_embeddings.permute(1, 0)).permute(1, 0)
patch, point = self.encoder(x_patch, x_point, x_ori)
patch=self.alignment_layer(patch,source_embeddings,source_embeddings)
point=self.alignment_layer(point,source_embeddings,source_embeddings)
prompt1=[
f"Background: We are working with a time series dataset that includes a {self.channel}-dimensional time series with a length of {self.win_size}."
f"Dataset Description: {self.dataset_description}"
f"Goal: Your objective is to use the patch-wise and point-wise representations of the time series to generate more generalized and robust representations."
f"Notes: The patch-wise representation captures the global information global information and is generated based on dependencies between patches. "
f"The point-wise representation captures the local information and is generated based on dependencies between points in a patch. "
f"As anomalies are rare and normal points share latent patterns, the two representations should be similar for non-anomalous time points and more different for anomalous time points. "
f"Constrains: Generate different representations for suspected abnormal points and similar representations for normal points."
f"Patch-wise representation:"
]
prompt2=["Point-wise representation:"]
prompt1=self.tokenizer(prompt1, return_tensors="pt", padding=True, truncation=True,
max_length=2048).input_ids
prompt2=self.tokenizer(prompt2, return_tensors="pt", padding=True, truncation=True,
max_length=2048).input_ids
prompt1_embeddings=self.llm_model.get_input_embeddings()(prompt1.to(x.device))
prompt2_embeddings=self.llm_model.get_input_embeddings()(prompt2.to(x.device))
prompt1_embeddings = repeat(prompt1_embeddings, "b n d -> (b m) n d", m=B)
prompt2_embeddings = repeat(prompt2_embeddings, "b n d -> (b m) n d", m=B)
llm_in=torch.cat([prompt1_embeddings[:,:(prompt1_embeddings.shape[1]-1),:],patch,prompt2_embeddings[:,1:-1,:],point],dim=1)
llm_out=self.llm_model(inputs_embeds=llm_in).last_hidden_state
rep_1=llm_out[:,-self.win_size:,:]
rep_2=llm_out[:,-2*self.win_size:-self.win_size,:]
return rep_1, rep_2
class FeatureAlignmentLayer(nn.Module):
def __init__(self, d_model, n_heads, d_keys=None, d_llm=None, attention_dropout=0.1):
super(FeatureAlignmentLayer, self).__init__()
d_keys = d_keys or (d_model // n_heads)
self.query_projection = nn.Linear(d_model, d_keys * n_heads)
self.key_projection = nn.Linear(d_llm, d_keys * n_heads)
self.value_projection = nn.Linear(d_llm, d_keys * n_heads)
self.out_projection = nn.Linear(d_keys * n_heads, d_llm)
self.n_heads = n_heads
self.dropout = nn.Dropout(attention_dropout)
def forward(self, target_embedding, source_embedding, value_embedding):
B, L, _ = target_embedding.shape
S, _ = source_embedding.shape
H = self.n_heads
target_embedding = self.query_projection(target_embedding).view(B, L, H, -1)
source_embedding = self.key_projection(source_embedding).view(S, H, -1)
value_embedding = self.value_projection(value_embedding).view(S, H, -1)
out = self.alignment(target_embedding, source_embedding, value_embedding)
out = out.reshape(B, L, -1)
return self.out_projection(out)
def alignment(self, target_embedding, source_embedding, value_embedding):
B, L, H, E = target_embedding.shape
scale = 1. / sqrt(E)
scores = torch.einsum("blhe,she->bhls", target_embedding, source_embedding)
A = self.dropout(torch.softmax(scale * scores, dim=-1))
output = torch.einsum("bhls,she->blhe", A, value_embedding)
return output
\ No newline at end of file
import torch
import torch.nn as nn
from math import sqrt
class FullAttention(nn.Module):
def __init__(self, d_model, n_heads, d_keys=None, attention_dropout=0.1):
super(FullAttention, self).__init__()
d_keys = d_keys or (d_model // n_heads)
self.query_projection = nn.Linear(d_model, d_keys * n_heads)
self.key_projection = nn.Linear(d_model, d_keys * n_heads)
self.value_projection = nn.Linear(d_model, d_keys * n_heads)
# self.out_projection = nn.Linear(d_keys * n_heads, d_llm)
self.n_heads = n_heads
self.dropout = nn.Dropout(attention_dropout)
def forward(self, x):
B, L, D= x.shape
H = self.n_heads
query = self.query_projection(x).view(B, L, H, -1)
key = self.key_projection(x).view(B, L, H, -1)
value = self.value_projection(x).view(B, L, H, -1)
E= value.shape[-1]
scale = 1. / sqrt(E)
scores = torch.einsum("blhe,bshe->bhls", query, key)
A = self.dropout(torch.softmax(scale * scores, dim=-1))
out = torch.einsum("bhls,bshe->blhe", A, value)
out = out.reshape(B, L, -1)
return out
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.nn.utils import weight_norm
import math
class PositionalEmbedding(nn.Module):
def __init__(self, d_model, max_len=5000):
super(PositionalEmbedding, self).__init__()
# Compute the positional encodings once in log space.
pe = torch.zeros(max_len, d_model).float()
pe.require_grad = False
position = torch.arange(0, max_len).float().unsqueeze(1)
div_term = (torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model)).exp()
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
pe = pe.unsqueeze(0)
self.register_buffer('pe', pe)
def forward(self, x):
return self.pe[:, :x.size(1)]
class TokenEmbedding(nn.Module):
def __init__(self, c_in, d_model):
super(TokenEmbedding, self).__init__()
padding = 1 if torch.__version__ >= '1.5.0' else 2
self.tokenConv = nn.Conv1d(in_channels=c_in, out_channels=d_model,
kernel_size=3, padding=padding, padding_mode='circular', bias=False)
for m in self.modules():
if isinstance(m, nn.Conv1d):
nn.init.kaiming_normal_(m.weight, mode='fan_in', nonlinearity='leaky_relu')
def forward(self, x):
x = self.tokenConv(x.permute(0, 2, 1)).transpose(1, 2)
return x
class DataEmbedding(nn.Module):
def __init__(self, c_in, d_model, dropout=0.05):
super(DataEmbedding, self).__init__()
self.value_embedding = TokenEmbedding(c_in=c_in, d_model=d_model)
self.position_embedding = PositionalEmbedding(d_model=d_model)
self.dropout = nn.Dropout(p=dropout)
def forward(self, x):
x = self.value_embedding(x) + self.position_embedding(x)
return self.dropout(x)
import torch
class TriangularCausalMask():
def __init__(self, B, L, device="cpu"):
mask_shape = [B, 1, L, L]
with torch.no_grad():
self._mask = torch.triu(torch.ones(mask_shape, dtype=torch.bool), diagonal=1).to(device)
@property
def mask(self):
return self._mask
# This file may be used to create an environment using:
# $ conda create --name <env> --file <this file>
# platform: linux-64
_libgcc_mutex=0.1=conda_forge
_openmp_mutex=4.5=2_kmp_llvm
accelerate=0.29.2=pyhd8ed1ab_0
aiohttp=3.9.3=py39h5eee18b_0
aiosignal=1.2.0=pyhd3eb1b0_0
annotated-types=0.6.0=py39h06a4308_0
aom=3.6.0=h6a678d5_0
arch-py=6.3.0=py39h44dd56e_0
async-timeout=4.0.3=py39h06a4308_0
attrs=23.1.0=py39h06a4308_0
axial-positional-embedding=0.2.1=pyhd8ed1ab_0
black=24.8.0=py39h06a4308_0
blas=1.0=mkl
blosc=1.21.3=h6a678d5_0
bottleneck=1.3.7=py39ha9d4c09_0
brotli=1.0.9=h5eee18b_7
brotli-bin=1.0.9=h5eee18b_7
brotli-python=1.0.9=py39h6a678d5_7
brunsli=0.1=h2531618_0
bzip2=1.0.8=h5eee18b_5
c-ares=1.19.1=h5eee18b_0
ca-certificates=2024.11.26=h06a4308_0
calflops=0.3.2=pypi_0
certifi=2024.8.30=py39h06a4308_0
cfitsio=3.470=h5893167_7
charls=2.2.0=h2531618_0
charset-normalizer=2.0.4=pyhd3eb1b0_0
click=8.1.7=py39h06a4308_0
colorama=0.4.6=pyhd8ed1ab_0
contourpy=1.2.0=py39hdb19cb5_0
cuda-cudart=12.1.105=0
cuda-cupti=12.1.105=0
cuda-libraries=12.1.0=0
cuda-nvrtc=12.1.105=0
cuda-nvtx=12.1.105=0
cuda-opencl=12.4.127=0
cuda-runtime=12.1.0=0
cuda-version=11.8=hcce14f8_3
cudatoolkit=11.8.0=h6a678d5_0
cudnn=8.9.7.29=hbc23b4c_3
cycler=0.11.0=pyhd3eb1b0_0
cyrus-sasl=2.1.28=h52b45da_1
dav1d=1.2.1=h5eee18b_0
dbus=1.13.18=hb2f20db_0
decorator=5.1.1=pyhd8ed1ab_0
deepspeed=0.14.0=cuda118py39h87c11cd_200
einops=0.7.0=pyhd8ed1ab_1
et_xmlfile=1.1.0=py39h06a4308_0
expat=2.6.2=h6a678d5_0
fastdtw=0.3.4=py39h8003fee_4
ffmpeg=4.3=hf484d3e_0
filelock=3.13.1=py39h06a4308_0
fontconfig=2.14.1=h55d465d_3
fonttools=4.51.0=py39h5eee18b_0
freetype=2.12.1=h4a9f257_0
frozenlist=1.4.0=py39h5eee18b_0
fsspec=2024.3.1=pyhca7485f_0
giflib=5.2.2=h5eee18b_0
glib=2.78.4=h6a678d5_0
glib-tools=2.78.4=h6a678d5_0
gmp=6.2.1=h295c915_3
gmpy2=2.1.2=py39heeb90bb_0
gnutls=3.6.15=he1e5248_0
gst-plugins-base=1.14.1=h6a678d5_1
gstreamer=1.14.1=h5eee18b_1
hjson-py=3.1.0=pyhd8ed1ab_0
huggingface_hub=0.23.4=pyhd8ed1ab_0
icu=73.1=h6a678d5_0
idna=3.4=py39h06a4308_0
imagecodecs=2023.1.23=py39hc4b7b5f_0
imageio=2.33.1=py39h06a4308_0
importlib-metadata=7.0.1=py39h06a4308_0
importlib_resources=6.1.1=py39h06a4308_1
intel-openmp=2023.1.0=hdb19cb5_46306
jinja2=3.1.3=py39h06a4308_0
joblib=1.2.0=py39h06a4308_0
jpeg=9e=h5eee18b_1
jxrlib=1.1=h7b6447c_2
kiwisolver=1.4.4=py39h6a678d5_0
krb5=1.20.1=h143b758_1
lame=3.100=h7b6447c_0
lazy_loader=0.4=pyhd8ed1ab_0
lcms2=2.12=h3be6417_0
ld_impl_linux-64=2.38=h1181459_1
lerc=3.0=h295c915_0
libabseil=20240116.2=cxx17_h6a678d5_0
libaec=1.0.4=he6710b0_1
libaio=0.3.113=h5eee18b_0
libavif=0.11.1=h5eee18b_0
libblas=3.9.0=20_linux64_mkl
libbrotlicommon=1.0.9=h5eee18b_7
libbrotlidec=1.0.9=h5eee18b_7
libbrotlienc=1.0.9=h5eee18b_7
libcblas=3.9.0=20_linux64_mkl
libclang=14.0.6=default_hc6dbbc7_1
libclang13=14.0.6=default_he11475f_1
libcublas=12.1.0.26=0
libcufft=11.0.2.4=0
libcufile=1.9.1.3=0
libcups=2.4.2=h2d74bed_1
libcurand=10.3.5.147=0
libcurl=8.9.1=h251f7ec_0
libcusolver=11.4.4.55=0
libcusparse=12.0.2.55=0
libdeflate=1.17=h5eee18b_1
libedit=3.1.20230828=h5eee18b_0
libev=4.33=h7f8727e_1
libffi=3.4.4=h6a678d5_0
libgcc=14.2.0=h77fa898_1
libgcc-ng=14.2.0=h69a702a_1
libgfortran-ng=11.2.0=h00389a5_1
libgfortran5=11.2.0=h1234567_1
libglib=2.78.4=hdc74915_0
libgomp=14.2.0=h77fa898_1
libiconv=1.16=h7f8727e_2
libidn2=2.3.4=h5eee18b_0
libjpeg-turbo=2.0.0=h9bf148f_0
liblapack=3.9.0=20_linux64_mkl
libllvm14=14.0.6=hdb19cb5_3
libmagma=2.7.2=h09b5827_2
libmagma_sparse=2.7.2=h09b5827_3
libnghttp2=1.57.0=h2d74bed_0
libnpp=12.0.2.50=0
libnvjitlink=12.1.105=0
libnvjpeg=12.1.1.14=0
libpng=1.6.39=h5eee18b_0
libpq=12.17=hdbd6064_0
libprotobuf=4.25.3=he621ea3_0
libssh2=1.11.1=h251f7ec_0
libstdcxx-ng=13.2.0=hc0a3c3a_6
libtasn1=4.19.0=h5eee18b_0
libtiff=4.5.1=h6a678d5_0
libtorch=2.1.2=cuda118_h9f31b6d_304
libunistring=0.9.10=h27cfd23_0
libuuid=1.41.5=h5eee18b_0
libuv=1.48.0=hd590300_0
libwebp-base=1.3.2=h5eee18b_0
libxcb=1.15=h7f8727e_0
libxkbcommon=1.0.1=h097e994_2
libxml2=2.13.1=hfdd30dd_2
libxslt=1.1.41=h097e994_0
libzlib=1.2.13=hd590300_5
libzopfli=1.0.3=he6710b0_0
llvm-openmp=18.1.5=ha31de31_0
local-attention=1.9.3=pyhd8ed1ab_0
lxml=5.2.1=py39h57af460_1
lz4-c=1.9.4=h6a678d5_0
magma=2.7.2=h4aca40b_3
markupsafe=2.1.3=py39h5eee18b_0
matplotlib=3.8.4=py39h06a4308_0
matplotlib-base=3.8.4=py39h1128e8f_0
mkl=2023.2.0=h84fe81f_50496
mkl-service=2.4.0=py39h5eee18b_1
mkl_fft=1.3.8=py39h5eee18b_0
mkl_random=1.2.4=py39hdb19cb5_0
mne-base=1.7.1=pyha770c72_0
mpc=1.1.0=h10f8cd9_1
mpfr=4.0.2=hb69a4c5_1
mpmath=1.3.0=py39h06a4308_0
multidict=6.0.4=py39h5eee18b_0
mypy_extensions=1.0.0=py39h06a4308_0
mysql=5.7.24=h721c034_2
nccl=2.21.5.1=h6103f9b_0
ncurses=6.4=h6a678d5_0
nettle=3.7.3=hbbd107a_1
networkx=3.1=py39h06a4308_0
numexpr=2.8.7=py39h85018f9_0
numpy=1.26.4=py39h5f9d8c6_0
numpy-base=1.26.4=py39hb5e798b_0
openh264=2.1.1=h4ff587b_0
openjpeg=2.4.0=h3ad879b_0
openpyxl=3.1.5=py39h5eee18b_0
openssl=3.4.0=hb9d3cd8_0
packaging=24.0=pyhd8ed1ab_0
pandas=2.2.1=py39h6a678d5_0
pathspec=0.10.3=py39h06a4308_0
patool=1.15.0=pyhd8ed1ab_0
patsy=0.5.3=py39h06a4308_0
pcre2=10.42=hebb0a14_0
performer-pytorch=1.1.4=pyhd8ed1ab_0
pillow=10.2.0=py39h5eee18b_0
pip=24.2=py39h06a4308_0
platformdirs=4.2.2=pyhd8ed1ab_0
ply=3.11=py39h06a4308_0
pooch=1.8.2=pyhd8ed1ab_0
protobuf=4.25.3=py39h12ddb61_0
psutil=5.9.8=py39hd1e30aa_0
py-cpuinfo=9.0.0=py39h06a4308_0
pydantic=2.5.3=py39h06a4308_0
pydantic-core=2.14.6=py39hb02cf49_0
pyg=2.5.2=py39_torch_2.1.0_cu121
pynvml=11.5.0=pyhd8ed1ab_0
pyparsing=3.0.9=py39h06a4308_0
pyqt=5.15.10=py39h6a678d5_0
pyqt5-sip=12.13.0=py39h5eee18b_0
pysocks=1.7.1=py39h06a4308_0
python=3.9.19=h955ad1f_0
python-dateutil=2.8.2=pyhd3eb1b0_0
python-tzdata=2023.3=pyhd3eb1b0_0
python_abi=3.9=2_cp39
pytorch=2.1.2=cuda118_py39hd44be3b_304
pytorch-cuda=12.1=ha16c6d3_5
pytorch-mutex=1.0=cuda
pytz=2023.3.post1=py39h06a4308_0
pyyaml=6.0.1=py39h5eee18b_0
qt-main=5.15.2=h53bd1ea_10
readline=8.2=h5eee18b_0
regex=2023.10.3=py39h5eee18b_0
requests=2.31.0=py39h06a4308_1
safetensors=0.4.3=py39h9fdd4d6_0
scikit-base=0.8.2=pyhecae5ae_0
scikit-image=0.24.0=py39h1128e8f_0
scikit-learn=1.3.0=py39h1128e8f_1
scipy=1.12.0=py39h5f9d8c6_0
seaborn=0.13.2=py39h06a4308_0
selenium=3.141.0=py39h27cfd23_1000
sentencepiece=0.1.99=py39hdb19cb5_0
setuptools=68.2.2=py39h06a4308_0
sip=6.7.12=py39h6a678d5_0
six=1.16.0=pyhd3eb1b0_1
sktime=0.31.0=py39hf3d152e_0
sleef=3.5.1=h9b69904_2
snappy=1.2.1=h6a678d5_0
sqlite=3.41.2=h5eee18b_0
statsmodels=0.14.0=py39ha9d4c09_0
sympy=1.12=py39h06a4308_0
tbb=2021.8.0=hdb19cb5_0
threadpoolctl=2.2.0=pyh0d69192_0
tifffile=2023.4.12=py39h06a4308_0
tk=8.6.12=h1ccaba5_0
tokenizers=0.15.1=py39h22610ee_0
toml=0.10.2=pyhd3eb1b0_0
tomli=2.0.1=py39h06a4308_0
torchaudio=2.1.2=py39_cu121
torchinfo=1.8.0=pyhd8ed1ab_0
torchtriton=2.2.0=py39
torchvision=0.15.2=cuda118py39h196c800_0
tornado=6.3.3=py39h5eee18b_0
tqdm=4.66.2=pyhd8ed1ab_0
transformers=4.37.2=py39h06a4308_0
typing-extensions=4.9.0=py39h06a4308_1
typing_extensions=4.9.0=py39h06a4308_1
tzdata=2024a=h04d1e81_0
unicodedata2=15.1.0=py39h5eee18b_0
urllib3=2.0.3=py39h06a4308_0
wheel=0.41.2=py39h06a4308_0
xz=5.4.6=h5eee18b_0
yaml=0.2.5=h7b6447c_0
yarl=1.9.3=py39h5eee18b_0
zfp=1.0.0=h6a678d5_0
zipp=3.17.0=py39h06a4308_0
zlib=1.2.13=hd590300_5
zstd=1.5.6=ha6fb4c9_0
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import os
import time
from utils.utils import *
from model.LLMDetector import LLMDetector
from data_factory.data_loader import get_loader_segment
from einops import rearrange
from metrics.metrics import *
import warnings
from accelerate import Accelerator
from accelerate import DistributedDataParallelKwargs
warnings.filterwarnings('ignore')
from accelerate.utils import set_seed
def my_loss_function(p, q, input, dis='cos'):
loss=nn.MSELoss(reduction='none')
res=loss(p,q)
# print(torch.sum(res,dim=-1))
input_sim=torch.zeros((input.shape[0],input.shape[0]))
input=rearrange(input,'b l c -> b (l c)')
#--------instance-wise-----------------------
if dis=='cos':
input_sim=torch.einsum('ij,jk->ik',input,input.transpose(0,1))/torch.einsum('i,j->ij',torch.norm(input,dim=-1),torch.norm(input,dim=-1))
input_sim=torch.clamp(input_sim,min=0)
input_logits=torch.tril(input_sim,diagonal=-1)[:, :-1]
input_logits+=torch.triu(input_sim,diagonal=1)[:, 1:]
p_ins=rearrange(p,'b l c -> b (l c)')
q_ins=rearrange(q,'b l c -> b (l c)')
p_sim=torch.matmul(p_ins,p_ins.transpose(0,1))/torch.einsum('i,j->ij',torch.norm(p_ins,dim=-1),torch.norm(p_ins,dim=-1))
q_sim=torch.matmul(q_ins,q_ins.transpose(0,1))/torch.einsum('i,j->ij',torch.norm(q_ins,dim=-1),torch.norm(q_ins,dim=-1))
p_logits=torch.tril(p_sim,diagonal=-1)[:, :-1]
p_logits+=torch.triu(p_sim,diagonal=1)[:, 1:]
q_logits=torch.tril(q_sim,diagonal=-1)[:, :-1]
q_logits+=torch.triu(q_sim,diagonal=1)[:, 1:]
p_logits=-F.log_softmax(p_logits,dim=-1)
q_logits=-F.log_softmax(q_logits,dim=-1)
ins_loss=torch.mean(p_logits*input_logits)
ins_loss+=torch.mean(q_logits*input_logits)
return torch.mean(res,dim=-1),ins_loss
class Solver(object):
DEFAULTS = {}
def __init__(self, config):
kwargs = DistributedDataParallelKwargs(find_unused_parameters=True)
self.__dict__.update(Solver.DEFAULTS, **config)
set_seed(self.seed)
self.accelerator = Accelerator(kwargs_handlers=[kwargs])
if self.mode == 'train':
self.accelerator.print("\n\n")
self.accelerator.print('================ Hyperparameters ===============')
for k, v in sorted(config.items()):
self.accelerator.print('%s: %s' % (str(k), str(v)))
self.accelerator.print('==================== Train ===================')
# print("Loading dataset")
self.train_loader = get_loader_segment(self.index, 'datasets/'+self.data_path, batch_size=self.batch_size, win_size=self.win_size, mode='train', dataset=self.dataset,step=self.step )
# print("Train dataset loaded.")
self.vali_loader = get_loader_segment(self.index, 'datasets/'+self.data_path, batch_size=self.batch_size, win_size=self.win_size, mode='val', dataset=self.dataset,step=self.step)
self.test_loader = get_loader_segment(self.index, 'datasets/'+self.data_path, batch_size=self.batch_size, win_size=self.win_size, mode='test', dataset=self.dataset,step=self.step)
self.device = torch.device(f"cuda:{self.gpu}" if torch.cuda.is_available() else "cpu")
self.accelerator.print('Data loaded')
try:
with open(os.path.join('datasets/'+self.data_path, 'description.txt'), 'r') as f:
self.des = f.read()
except:
self.des=''
self.build_model()
def build_model(self):
self.model = LLMDetector(win_size=self.win_size, enc_in=self.input_c, n_heads=self.n_heads, llm_model=self.llm_model,\
d_model=self.d_model, patch_size=self.patch_size, channel=self.input_c,description=self.des)
self.optimizer = torch.optim.Adam(self.model.parameters(), lr=self.lr)
def vali(self, vali_loader):
self.model.eval()
loss_1 = []
vali_loader=self.accelerator.prepare(vali_loader)
for i, (input_data, _) in enumerate(vali_loader):
input=input_data.float()
series, prior = self.model(input)
series_loss,series_ins_loss = my_loss_function(series, prior,input)
prior_loss,prior_ins_loss = my_loss_function(prior, series,input)
loss=torch.mean(prior_loss + series_loss)-(series_ins_loss+prior_ins_loss)
all_losses=self.accelerator.gather(loss)
batch_loss=0
if all_losses.shape:
for l in all_losses:
batch_loss+=l.item()
loss_1.append(batch_loss)
else:
loss_1.append(all_losses.item())
return np.average(loss_1)
def train(self,setting):
time_now = time.time()
path = 'checkpoints/' + setting + '/'
if not os.path.exists(path):
os.makedirs(path)
early_stopping = EarlyStopping(accelerator=self.accelerator,patience=self.patience, verbose=True, dataset_name=self.data_path)
train_steps = len(self.train_loader)
self.model, self.optimizer, self.train_loader=self.accelerator.prepare(self.model,self.optimizer,self.train_loader)
for epoch in range(self.num_epochs):
iter_count = 0
epoch_time = time.time()
self.model.train()
for i, (input_data, labels) in enumerate(self.train_loader):
self.optimizer.zero_grad()
iter_count += 1
input=input_data.float()
series, prior = self.model(input)
series_loss,series_ins_loss = my_loss_function(series, prior.detach(),input)
prior_loss, prior_ins_loss= my_loss_function(prior, series.detach(),input)
if self.use_insloss and self.use_reploss:
loss = (1-self.alpha)*torch.mean(prior_loss + series_loss)-self.alpha*(series_ins_loss+prior_ins_loss)
elif self.use_reploss:
loss=torch.mean(prior_loss + series_loss)
elif self.use_insloss:
loss=-(series_ins_loss+prior_ins_loss)
else:
assert('ths loss is none.')
if (i + 1) % 100 == 0:
speed = (time.time() - time_now) / iter_count
left_time = speed * ((self.num_epochs - epoch) * train_steps - i)
self.accelerator.print('\tspeed: {:.4f}s/iter; left time: {:.4f}s'.format(speed, left_time))
iter_count = 0
time_now = time.time()
self.accelerator.backward(loss)
self.optimizer.step()
with torch.no_grad():
vali_loss= self.vali(self.test_loader)
self.accelerator.print(
"Epoch: {0}, Cost time: {1:.3f}s ".format(
epoch + 1, time.time() - epoch_time))
early_stopping(vali_loss, self.model, path)
if early_stopping.early_stop:
break
adjust_learning_rate(self.optimizer, epoch + 1, self.lr)
def test(self,setting):
with torch.no_grad():
path = 'checkpoints/' + setting + '/'
self.build_model()
self.model.load_state_dict(
torch.load(
path+self.data_path+'_model.pth'))
self.model.eval()
test_labels = []
attens_energy = []
self.test_loader,self.model=self.accelerator.prepare(self.test_loader,self.model)
for i, (input_data, labels) in enumerate(self.test_loader):
input = input_data.float()
series, prior = self.model(input)
series_loss,_=my_loss_function(series, prior, input)
prior_loss,_=my_loss_function(prior, series, input)
loss = -series_loss - prior_loss
mean=torch.mean(loss,dim=-1).unsqueeze(1)
std=torch.std(loss,dim=-1).unsqueeze(1)
metric=(loss-mean)/std
attens_energy.append(metric)
test_labels.append(labels)
attens_energy=self.accelerator.gather(attens_energy)
test_labels=self.accelerator.gather(test_labels)
attens_energy=torch.stack(attens_energy).detach().cpu().numpy()
test_labels=torch.stack(test_labels).detach().cpu().numpy()
attens_energy = np.concatenate(attens_energy, axis=0).reshape(-1)
test_labels = np.concatenate(test_labels, axis=0).reshape(-1)
test_energy = np.array(attens_energy)
test_labels = np.array(test_labels)
if self.accelerator.is_main_process:
np.save(path + self.dataset+ 'test_energy.npy',test_energy)
np.save(path+self.dataset+'test_labels.npy',test_labels)
def test1(self,setting):
path= 'checkpoints/' + setting + '/'
test_energy=np.load(path + self.dataset+'test_energy.npy', )
print(test_energy.shape,self.train_loader.dataset.test.shape)
test_labels=np.load(path+ self.dataset+'test_labels.npy')
thresh = np.percentile(test_energy, (1 - self.anormly_ratio)*100)
print('Anomaly Ratio:', self.anormly_ratio, (1 - self.anormly_ratio)*100)
print("Threshold :", thresh)
pred = (test_energy > thresh).astype(int)
gt = test_labels.astype(int)
anomaly_state = False
for i in range(len(gt)):
if gt[i] == 1 and pred[i] == 1 and not anomaly_state:
anomaly_state = True
for j in range(i, 0, -1):
if gt[j] == 0:
break
else:
if pred[j] == 0:
pred[j] = 1
for j in range(i, len(gt)):
if gt[j] == 0:
break
else:
if pred[j] == 0:
pred[j] = 1
elif gt[i] == 0:
anomaly_state = False
if anomaly_state:
pred[i] = 1
pred = np.array(pred)
gt = np.array(gt)
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(gt, pred)
precision, recall, f_score, support = precision_recall_fscore_support(gt, pred, average='binary')
print("Accuracy:{:0.4f}, Precision:{:0.4f}, Recall:{:0.4f}, F-score:{:0.4f} ".format(accuracy, precision, recall, f_score))
result_path='result/'+setting+'_result.txt'
with open(result_path, 'a') as f:
f.write("Thresh:{:0.4f}, Anomaly Ratio:{:0.4f}. ".format(thresh,self.anormly_ratio))
f.write("Dataset:{}, Accuracy:{:0.4f}, Precision:{:0.4f}, Recall:{:0.4f}, F-score:{:0.4f}.".format(self.dataset, accuracy, precision, recall, f_score))
f.write('\n')
class EarlyStopping:
def __init__(self, accelerator=None, patience=3, dataset_name='',verbose=False, delta=0, save_mode=True):
self.accelerator = accelerator
self.patience = patience
self.dataset=dataset_name
self.verbose = verbose
self.counter = 0
self.best_score = None
self.early_stop = False
self.val_loss_min = np.Inf
self.delta = delta
self.save_mode = save_mode
def __call__(self, val_loss, model, path):
score = -val_loss
if self.best_score is None:
self.best_score = score
if self.save_mode:
self.save_checkpoint(val_loss, model, path)
elif score < self.best_score + self.delta:
self.counter += 1
if self.accelerator is None:
print(f'EarlyStopping counter: {self.counter} out of {self.patience}')
else:
self.accelerator.print(f'EarlyStopping counter: {self.counter} out of {self.patience}')
if self.counter >= self.patience:
self.early_stop = True
else:
self.best_score = score
if self.save_mode:
self.save_checkpoint(val_loss, model, path)
self.counter = 0
def save_checkpoint(self, val_loss, model, path):
if self.verbose:
if self.accelerator is not None:
self.accelerator.print(
f'Validation loss decreased ({self.val_loss_min:.6f} --> {val_loss:.6f}). Saving model ...')
else:
print(
f'Validation loss decreased ({self.val_loss_min:.6f} --> {val_loss:.6f}). Saving model ...')
if self.accelerator is not None:
model = self.accelerator.unwrap_model(model)
torch.save(model.state_dict(), path + '/' + self.dataset+'_model.pth')
else:
self.accelerator.save(model.state_dict(), path + '/' + self.dataset+'_model.pth')
self.val_loss_min = val_loss
def adjust_learning_rate(optimizer, epoch, lr_):
lr_adjust = {epoch: lr_ * (0.5 ** ((epoch - 1) // 1))}
if epoch in lr_adjust.keys():
lr = lr_adjust[epoch]
for param_group in optimizer.param_groups:
param_group['lr'] = lr
\ No newline at end of file
# Code referenced from https://gist.github.com/gyglim/1f8dfb1b5c82627ae3efcfbbadb9f514
import tensorflow as tf
import numpy as np
import scipy.misc
try:
from StringIO import StringIO # Python 2.7
except ImportError:
from io import BytesIO # Python 3.5+
class Logger(object):
def __init__(self, log_dir):
"""Create a summary writer logging to log_dir."""
self.writer = tf.summary.FileWriter(log_dir)
def scalar_summary(self, tag, value, step):
"""Log a scalar variable."""
summary = tf.Summary(value=[tf.Summary.Value(tag=tag, simple_value=value)])
self.writer.add_summary(summary, step)
def image_summary(self, tag, images, step):
"""Log a list of images."""
img_summaries = []
for i, img in enumerate(images):
# Write the image to a string
try:
s = StringIO()
except:
s = BytesIO()
scipy.misc.toimage(img).save(s, format="png")
# Create an Image object
img_sum = tf.Summary.Image(encoded_image_string=s.getvalue(),
height=img.shape[0],
width=img.shape[1])
# Create a Summary value
img_summaries.append(tf.Summary.Value(tag='%s/%d' % (tag, i), image=img_sum))
# Create and write Summary
summary = tf.Summary(value=img_summaries)
self.writer.add_summary(summary, step)
def histo_summary(self, tag, values, step, bins=1000):
"""Log a histogram of the tensor of values."""
# Create a histogram using numpy
counts, bin_edges = np.histogram(values, bins=bins)
# Fill the fields of the histogram proto
hist = tf.HistogramProto()
hist.min = float(np.min(values))
hist.max = float(np.max(values))
hist.num = int(np.prod(values.shape))
hist.sum = float(np.sum(values))
hist.sum_squares = float(np.sum(values ** 2))
# Drop the start of the first bin
bin_edges = bin_edges[1:]
# Add bin edges and counts
for edge in bin_edges:
hist.bucket_limit.append(edge)
for c in counts:
hist.bucket.append(c)
# Create and write Summary
summary = tf.Summary(value=[tf.Summary.Value(tag=tag, histo=hist)])
self.writer.add_summary(summary, step)
self.writer.flush()
import torch
import numpy as np
class EarlyStopping:
def __init__(self, accelerator=None, patience=7, verbose=False, delta=0, save_mode=True):
self.accelerator = accelerator
self.patience = patience
self.verbose = verbose
self.counter = 0
self.best_score = None
self.early_stop = False
self.val_loss_min = np.Inf
self.delta = delta
self.save_mode = save_mode
def __call__(self, val_loss, model, path):
score = -val_loss
if self.best_score is None:
self.best_score = score
if self.save_mode:
self.save_checkpoint(val_loss, model, path)
elif score < self.best_score + self.delta:
self.counter += 1
if self.accelerator is None:
print(f'EarlyStopping counter: {self.counter} out of {self.patience}')
else:
self.accelerator.print(f'EarlyStopping counter: {self.counter} out of {self.patience}')
if self.counter >= self.patience:
self.early_stop = True
else:
self.best_score = score
if self.save_mode:
self.save_checkpoint(val_loss, model, path)
self.counter = 0
def save_checkpoint(self, val_loss, model, path):
if self.verbose:
if self.accelerator is not None:
self.accelerator.print(
f'Validation loss decreased ({self.val_loss_min:.6f} --> {val_loss:.6f}). Saving model ...')
else:
print(
f'Validation loss decreased ({self.val_loss_min:.6f} --> {val_loss:.6f}). Saving model ...')
if self.accelerator is not None:
model = self.accelerator.unwrap_model(model)
torch.save(model.state_dict(), path + '/' + 'model.pth')
else:
torch.save(model.state_dict(), path + '/' + 'model.pth')
self.val_loss_min = val_loss
def adjust_learning_rate(optimizer, epoch, lr_):
lr_adjust = {epoch: lr_ * (0.5 ** ((epoch - 1) // 1))}
if epoch in lr_adjust.keys():
lr = lr_adjust[epoch]
for param_group in optimizer.param_groups:
param_group['lr'] = lr
def reset_learning_rate(optimizer, lr):
for param_group in optimizer.param_groups:
param_group['lr'] = lr
def print_args(args):
print("\033[1m" + "Basic Config" + "\033[0m")
print(f' {"Task Name:":<20}{"Anomaly Detection":<20}{"Is Training:":<20}{args.is_training:<20}')
print(f' {"Model:":<20}{"LLMDetector":<20}{"Seed:":<20}{args.seed:<20}')
print()
print("\033[1m" + "Data Loader" + "\033[0m")
print(f' {"Data:":<20}{args.data:<20}{"Root Path:":<20}{args.root_path:<20}')
# print(f' {"Data Path:":<20}{args.data_path:<20}{"Features:":<20}{args.features:<20}')
# print(f' {"Target:":<20}{args.target:<20}{"Freq:":<20}{args.freq:<20}')
print(f' {"Checkpoints:":<20}{args.checkpoints:<20}')
print()
# if args.task_name in ['long_term_forecast', 'short_term_forecast']:
# print("\033[1m" + "Forecasting Task" + "\033[0m")
# print(f' {"Seq Len:":<20}{args.seq_len:<20}{"Label Len:":<20}{args.label_len:<20}')
# print(f' {"Pred Len:":<20}{args.pred_len:<20}{"Seasonal Patterns:":<20}{args.seasonal_patterns:<20}')
# print(f' {"Inverse:":<20}{args.inverse:<20}')
# print()
#
# if args.task_name == 'imputation':
# print("\033[1m" + "Imputation Task" + "\033[0m")
# print(f' {"Mask Rate:":<20}{args.mask_rate:<20}')
# print()
# if args.task_name == 'anomaly_detection':
# print("\033[1m" + "Anomaly Detection Task" + "\033[0m")
# print(f' {"Anomaly Ratio:":<20}{args.anomaly_ratio:<20}')
# print()
print("\033[1m" + "Model Parameters" + "\033[0m")
print(f' {"Gnn:":<20}{args.gnn:<20}{"Anomaly Ratio:":<20}{args.anomaly_ratio:<20}')
# print(f' {"Top k:":<20}{args.top_k:<20}{"Num Kernels:":<20}{args.num_kernels:<20}')
print(f' {"Patch_len:":<20}{args.patch_len:<20}{"Stride:":<20}{args.stride:<20}')
# print(f' {"Enc In:":<20}{args.enc_in:<20}{"Dec In:":<20}{args.dec_in:<20}')
print(f' {"n heads:":<20}{args.n_heads:<20}{"d model:":<20}{args.d_model:<20}')
# print(f' {"C Out:":<20}{args.c_out:<20}{"e layers:":<20}{args.e_layers:<20}')
# print(f' {"d layers:":<20}{args.d_layers:<20}{"d FF:":<20}{args.d_ff:<20}')
# print(f' {"Moving Avg:":<20}{args.moving_avg:<20}{"Factor:":<20}{args.factor:<20}')
# print(f' {"Distil:":<20}{args.distil:<20}{"Dropout:":<20}{args.dropout:<20}')
# print(f' {"Embed:":<20}{args.embed:<20}{"Activation:":<20}{args.activation:<20}')
print()
print("\033[1m" + "Run Parameters" + "\033[0m")
print(f' {"Num Workers:":<20}{args.num_workers:<20}{"Itr:":<20}{args.itr:<20}')
print(f' {"Train Epochs:":<20}{args.train_epochs:<20}{"Batch Size:":<20}{args.batch_size:<20}')
print(f' {"Patience:":<20}{args.patience:<20}{"Learning Rate:":<20}{args.lr:<20}')
# print(f' {"Des:":<20}{args.des:<20}{"Loss:":<20}{args.loss:<20}')
# print(f' {"Lradj:":<20}{args.lradj:<20}{"Use Amp:":<20}{args.use_amp:<20}')
print()
print("\033[1m" + "GPU" + "\033[0m")
print(f' {"Use GPU:":<20}{args.use_gpu:<20}{"GPU:":<20}{args.gpu:<20}')
print(f' {"Use Multi GPU:":<20}{args.use_multi_gpu:<20}{"Devices:":<20}{args.devices:<20}')
print()
# print("\033[1m" + "De-stationary Projector Params" + "\033[0m")
# p_hidden_dims_str = ', '.join(map(str, args.p_hidden_dims))
# print(f' {"P Hidden Dims:":<20}{p_hidden_dims_str:<20}{"P Hidden Layers:":<20}{args.p_hidden_layers:<20}')
print()
import os
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable
import numpy as np
def to_var(x, volatile=False):
if torch.cuda.is_available():
x = x.cuda()
return Variable(x, volatile=volatile)
def mkdir(directory):
if not os.path.exists(directory):
os.makedirs(directory)
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment