.. _misc_large_datasets: Running SecuML on Large Datasets ================================ The following tips allow to run SecuML on large datasets. Storing Features as Sparse Matrices ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ SecuML supports sparse feature matrices for some types of experiments. CSC, CSR and LIL `scipy sparse formats `_ are supported (see :ref:`Data`). :ref:`Clustering ` and :ref:`Projection ` do not support sparse matrices. :ref:`Features Analysis ` supports sparse matrices and the CSC format should be preferred for maximum efficiency. Regarding :ref:`DIADEM ` and :ref:`ILAB `, it depends on the selected model class. SecuML relies on scikit-learn learning algorithms and some of them do not support sparse matrices. You must then refer to the scikit-learn documentation to check whether a given model class can be trained from sparse features and to know the best suited sparse format for maximum efficency. Proving the Features' Types ^^^^^^^^^^^^^^^^^^^^^^^^^^^ The types of the features can be provided in a :ref:`description file ` to load the dataset more quickly. When the features' types are provided, SecuML does not need to infer them. Reducing the Number of Parallel Jobs ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Various experiments can be parallelized with the ``--n-jobs`` argument. Reducing the number of jobs allows to decrease the memory usage. Streaming Validation ^^^^^^^^^^^^^^^^^^^^ Detection models trained with :ref:`DIADEM ` can be tested on a validation dataset in streaming thanks to the arguments ``--validation-mode ValidationDatasets --validation-datasets --streaming``. This way the validation instances are not loaded into memory at once which allows to process bigger datasets. .. note:: Scipy sparse matrices cannot be processed in streaming. Selecting an Appropriate Optimization Algorithm ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Logistic regression can be trained with various optimization algorithms (``liblinear``, ``lbfgs``, ``sag``, and ``saga``). By default, SecuML trains logistic regression models with ``liblinear`` which suits small datasets. ``sag`` and ``saga`` are more suitable for large datasets.