Features Analysis

This module provides an overview of input datasets. It displays the boxplot, the histogram and the distribution of the values of each feature. If annotations are provided (with -a GROUND_TRUTH or -a <partial_annotations>.csv) the analyses are carried out for each label (malicious and benign) independently and the discriminating power of each feature is assessed with various indicators (chi2, f_classif, and mutual information).

Features analysis should be executed before running more complex algorithms such as clustering or training a classification model. Indeed, this module can help detect bugs in the feature extraction process (e.g. a feature defined as a ratio that does not have all its values between 0 and 1). Besides, it allows to identify useless features those variance is null on a given dataset.

Usage: SecuML_features_analysis <project> <dataset>.
For more information about the available options:
SecuML_features_analysis <project> <dataset> -h.

Graphical User Interface

_images/main2.png

The graphical user interface displays the boxplot, the histogram and the distribution of the values of each features.

The features are listed with their ids (see Data). The panel at the bottom left displays the name and the description of the feature if provided in a description file.

The features can be sorted alphabetically, according to their discriminating power (if annotations are provided) or according to their variance. Features with a null variance on the input dataset can be set apart with the null_variance sorting criterion.