.. include:: images.rst


XGBoost with cWB
-----------------------

| **Library** - XGBoost integrated in cWB: `GIT repository <https://gitlab.com/gwburst/public/library/-/tree/public>`__
| **XGBoost** - version 1.7.6: `GIT repository <https://github.com/dmlc/xgboost>`__


**References**

- `Optimization of model independent gravitational wave search for binary black hole mergers using machine learning <https://journals.aps.org/prd/abstract/10.1103/PhysRevD.104.023014>`__
- `Search for binary black hole mergers in the third observing run of Advanced LIGO-Virgo using coherent WaveBurst enhanced with machine learning <https://journals.aps.org/prd/abstract/10.1103/PhysRevD.105.083018>`__
- `Search for gravitational-wave bursts in the third Advanced LIGO-Virgo run with coherent WaveBurst enhanced by Machine Learning <https://arxiv.org/abs/2210.01754>`__
 
**What is eXtreme Gradient Boosting (XGBoost)**

+---------------------------------------+------------------------------+
|  `Introduction <#introduction>`__:    | eXtreme Gradient Boosting    |
+---------------------------------------+------------------------------+
|  `Why XGBoost? <#why-xgboost>`__:     | Veto Method Limitations      |
+---------------------------------------+------------------------------+

**Enhancing cWB with XGBoost**

+--------------------------------------------------------------------+-------------------------------------------------------------+ 
|  `Method <#method>`__:                                             | Overall idea for integration of XGBoost into cWB            |
+--------------------------------------------------------------------+-------------------------------------------------------------+ 
|  `cWB Summary Statistics <#cwb-summary-statistics>`__:             | List of input features (cWB summary statistics) for XGBoost |
+--------------------------------------------------------------------+-------------------------------------------------------------+ 
|  `Reduced Detection Statistic <#reduced-detection-statistic>`__:   | XGBoost Ranking Statistic                                   |
+--------------------------------------------------------------------+-------------------------------------------------------------+ 

**XGBoost Model Implementation ->**

+-----------------------------------------+-------------------------------------------------------------------------------------+ 
|  `Data Splitting <#data-splitting>`__:  | Setting up cWB working directories, data splitting for XGBoost training and testing |
+-----------------------------------------+-------------------------------------------------------------------------------------+ 
|  `Tuning <#tuning>`__:                  | Finding the most optimal XGBoost hyper-parameters set                               |
+-----------------------------------------+-------------------------------------------------------------------------------------+ 
|  `Training <#training>`__:              | Creating and Storing the XGBoost model                                              |
+-----------------------------------------+-------------------------------------------------------------------------------------+ 
|  `Testing <#testing>`__:                | Predicting and estimating the significance using the stored XGBoost model           |
+-----------------------------------------+-------------------------------------------------------------------------------------+ 
|  `Report <#report>`__:                  | Creating the final report page and additional plotting options                      |
+-----------------------------------------+-------------------------------------------------------------------------------------+ 

XGBoost - eXtreme Gradient Boosting
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Introduction
^^^^^^^^^^^^^^^^

| Our goal is to improve the sensitivity of the cWB searches by using Machine Learning (ML) algorithms. Traditional ML algorithms e.g. support vector machines, nearest neighbor algorithms, and decision tree based algorithms are typically applicable when the data is pre-processed and the dimensionality of the data is limited. Deep learning methods, on the other hand, are generally suitable for processing raw, high-dimensional data.

| `XGBoost <https://xgboost.readthedocs.io/en/stable/>`__ aka eXtreme Gradient Boosting is an ensemble learning, decision-tree based, supervised ML algorithm. In XGBoost, instead of using a single decision tree to classify events, an ensemble of decision trees is generated. A decision tree is used as the base learner, and subsequent learners (trees) are formed based on the residual errors obtained after each iteration (boosting). The misclassified instances from each tree (residual) is weighted and input into the subsequent tree as shown in the Figure below.

+-------------------------------------------------------------------------------------------------------------------+
|  |image224|                                                                                                       |
+-------------------------------------------------------------------------------------------------------------------+

* **XGBoost flow chart** for building an ensemble of trees. For a given dataset X, XGBoost builds subsequent trees by learning the residual of the previous tree as the input for the next tree [`image credit <https://www.researchgate.net/publication/345327934_Degradation_state_recognition_of_piston_pump_based_on_ICEEMDAN_and_XGBoost>`__].

|
| The output of a decision tree is given by the *leaf score* (estimated by the Log-Odds output) which helps in determining the category of a given observation. In XGBoost Classifier, the leaf scores from all the decision trees in the ensemble are added together to get a margin score. The final output :math:`P_{XGB}` is computed by taking the sigmoid of the margin score, where a value close to zero denotes a noise-like event and a value close to one denotes a signal-like event.

Why XGBoost?
^^^^^^^^^^^^^^^^^^^

| GW detector data contains non-Gaussian noise artifacts known as glitches, some of the noise events are reconstructed by the cWB pipeline and pollute the data used for analysis. In the standard cWB analysis, a series of a priori defined veto thresholds are applied on a list of cWB summary statistics to target and remove these glitches. This approach, henceforth known as the *veto method*, improves the significance of candidate GW events by removing noise events and reducing excess background.

**Veto method and its limitations**

| The veto method classifies reconstructed events into one of the two categories: signal-like events and noise-like events. Events that fall into the noise-like category are removed from the analysis. While this method works well, it inevitably results in discarding borderline GW events which do not pass the veto thresholds and at the same time makes the pipeline vulnerable to the high SNR glitches which do pass the vetoes. Moreover, designing vetoes manually in the multidimensional space of the summary statistics is challenging, and furthermore, requires re-tuning of the veto thresholds for each detector network configuration and each observing run.

| Binary classification is a standard problem in the ML literature with many prominent ML algorithms based on the decision-tree structure. 

**XGBoost Classifier**

| We find that XGBoost is the most suitable choice of ML algorithm for the cWB classification problem. Moreover, XGBoost is computationally efficient and the model training and testing procedures are completed within minutes to tens of minutes using one CPU core or a GPU.


Enhancing cWB with XGBoost
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Method
^^^^^^^^^


| XGBoost improves the sensitivity of the cWB searches by replacing the veto method. In order to construct the XGBoost model, we carefully select a subset of summary statistics estimated by cWB as the input features for the ML algorithm. This is done by looking at the correlation among the summary statistics by using the Spearman Correlation Coefficient and picking the summary statistics that have the least correlation with each other. We use the accumulated background data and generate signal events (simulations) data to train and test our supervised XGBoost algorithm. 

| The XGBoost hyper-parameter values are optimized specifically to prevent overfitting by tuning over six standard XGBoost hyper-parameters using a grid search with 10-fold cross-validation, with respect to the precision-recall area under the curve (AUC PR) metric. A separate ML model is trained for each search configuration and each observing run.

.. For training, we select 100 yr of background data per data chunk. For every 100 yr of background data we select approximately 1000 simulation events. Once trained, the ML model predictions are directly incorporated into the standard cWB search’s detection statistic :math:`\eta_0` to obtain the **reduced detection statistic** :math:`\eta_r`.

| The list of cWB summary statistics that are used as the input features for the XGBoost model are described in `cWB Summary Statistics <#cwb-summary-statistics>`__. Subsequent testing showed that adding supplementary statistics contributed redundant information and did not markedly improve classification. On the other hand, further feature pruning led to diminished classification.

Reduced Detection Statistic
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

| The standard cWB detection statistic :math:`\eta_0` was updated to be more sensitive to the :math:`\chi^2` factor as follows:

.. math::

   \eta_{0} = \sqrt{\frac{E_{c}}{1+\chi^2(\text{max}(1, \chi^2)-1)}}

| where :math:`E_c` is the coherent energy of the event and :math:`\chi^2 \equiv E_n / N_{df}` is the noise energy per degree of freedom, measuring the quality of the waveform reconstruction.

**Post-O3 approach**

| The ML model predictions were incorporated as a penalty factor multiplying the standard detection statistic :math:`\eta_0` to produce the reduced detection statistic :math:`\eta_r`. The XGBoost output :math:`P_{XGB} \in [0, 1]` was first corrected to suppress short-duration noise events (*blips*) — one-cycle glitches common in GW detector data — and then transformed via a monotonic mapping into the penalty factor :math:`W_{XGB}` (defined in `Mishra et al. 2021 <https://journals.aps.org/prd/abstract/10.1103/PhysRevD.104.023014>`__). The resulting reduced detection statistic was:

.. math::

   \eta_{r} = \eta_{0} \cdot W_{XGB}

| The end result was a detection statistic :math:`\eta_r` resistant to overfitting low-SNR noise events. In addition, to further mitigate blip glitches, :math:`\eta_r` was then fed into nonlinear penalization functions of a few morphological parameters of the trigger candidate.

**O4 approach**

| In the current cWB-2G version used for O4, the XGBoost output score is mostly used directly as the sole ranking statistic (`Martini et al. 2025 <https://iopscience.iop.org/article/10.1088/1361-6382/ae4717>`__). The monotonically stretched score :math:`W_{XGB} \in [0, 1]` is  the new ranking statistic :math:`\eta'_0`, designed to mitigate numerical resolution limitations in the high-score range. 

.. math::

   \eta'_{0} = W_{XGB}
 
| This replaces both the penalty-factor approach and all ad-hoc nonlinear blip penalizations, which are no longer needed with :math:`\eta'_0`.

*For more information please check the papers linked here* - `1 <https://journals.aps.org/prd/abstract/10.1103/PhysRevD.104.023014>`__, `2 <https://journals.aps.org/prd/abstract/10.1103/PhysRevD.105.083018>`__, `3 <https://arxiv.org/abs/2210.01754>`__, `4 <https://iopscience.iop.org/article/10.1088/1361-6382/ae4717>`__.


XGBoost model implementation 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

| The XGBoost framework has been successfully integrated with the cWB pipeline and different XGBoost utilities can be accessed with cwb commands. A new cWB command has been added **cwb_xgboost** (details can be found in the commands page).
| The following sections show how to implement the XGBoost method for the BBH/IMBH search configurations. *Separate sections on different searches will be added soon* ..

Data Splitting
^^^^^^^^^^^^^^^^^

| In order to train and test the XGBoost model, we start by splitting the available data (both background and simulation) into two parts: Training data and Testing data. 

**Setting up the working directories**

- We start by setting up 3 directories:
    - **BKG** - To store the Background data for each chunk involved in the XGBoost model.
    - **SIM** - To store the Simulations data.
    - **XGB** - To store the trained XGBoost model.

- Clone working directories for XGBoost use by utilizing the **cwb_clonedir** command (for simulation directories add option **' --simulation true'** while cloning). **This step is optional**; XGBoost analysis can be performed in the original working directories as well.

.. code-block:: bash

          cwb_clonedir /in_path_BKG/BKG_dir   /out_path/BKG/BKG_dir_xgb1   '--output smerge --label M1.V_hvetoLH'
          cwb_clonedir /in_path_SIM/SIM_dir   /out_path/SIM/SIM_dir_xgb1   '--output smerge --label M1.C_U.V_hvetoLH --simulation true'

- Create the XGB directory using **cwb_mkdir /out_path/XGB/Model_dir_XGB**.

**XGB cuts**

- Frequency cuts are applied prior to XGBoost processing. The frequency cuts are different for different search configurations:
    - BBH configuration: 60 Hz < :math:`f_0` < 300 Hz
    - IMBH configuration: :math:`f_0` < 200 Hz
    - BurstLF configuration: 16 Hz < :math:`f_0` < 2048 Hz

- The wave root files get a label *xgb_cut* at the end to denote that the frequency cut has been applied on those files.

.. code-block:: bash

        cwb_setcuts M1.V_hvetoLH '--wcuts ((frequency[0]>60&&frequency[0]<300)&&(!veto_hveto_H1&&!veto_hveto_L1)) --label xgb_cut'

**Data split**

- We split the data for training and testing.
    - BKG data - For each chunk of available background data (usually around a 1000 years unless the chunk hosts some special GW events), 100 years of background data is kept for training, and the remaining data is set aside for testing.
    - SIM data - For every 100 years of background data that is used for training, we split 1000 simulation events for training and keep the rest of the simulation events for testing purposes.

- The **datasplit** option is utilized for splitting the data with an option **--ifrac** which allows us to choose whether we want to split the root files based on number of years for BKG/events for SIM (``--ifrac -x``, where x is an integer and x > 1) or based on the fraction of root events stored (``--ifrac y``, where 0 < y < 1).

.. code-block:: bash

        cwb_xgboost datasplit '--mlabel M1.V_hvetoLH.C_xgb_cut --ifrac -100 --verbose true --cfile $CWB_CONFIG/O3a/CHUNKS/BBH//Chunk_List.lst'

- The **--cfile** option points to the GPS times for the chunks available in order to assign the respective chunk numbers :math:`C_n`.
- After the split we end up with two new wave root files for training and testing purposes. 

- Create the following files in the XGB directory for tuning and training:
    - **nfile.py** - Input background files (wave_*.root/.py) for training.
    - **sfile.py** - Input simulation files (wave_*.root/.py) for training.   


.. note::
   **cwb_xgboost_config.py** script is used as the default input XGBoost configuration for **tuning/training/prediction/report** functionalities of the **cwb_xgboost** command. The default config values and options can be changed by including user defined **user_xgboost_config_ .py** files in the cwb_xgboost command options. The `cwb_xgboost_config.py` script is included at the end of this page for completion.      

Tuning
^^^^^^^^^^

| XGBoost has a number of free `hyper-parameters <https://xgboost.readthedocs.io/en/stable/parameter.html>`__ which control various properties of the learning process to prevent overfitting. These hyper-parameters need to be tuned for each specific application. Some primary XGBoost hyper-parameters are fixed e.g. **objective**: *binary:logistic* - This is the cost function used by the XGBoost Classifier, **tree_method**: *hist* - Tells the algorithm to build trees using distributed training, **grow_policy**: *lossguide* - the split is made at the node with the highest loss change, **n_estimators**: *20,000** parameter determines the total number of generated trees in the model. * A method known as *early stopping* is employed to optimize the total number of trees generated. In early stopping, a small fraction of the training data set is set aside for validation. When the validation AUC PR score stops improving, the training ends to prevent the XGBoost from overfitting. 

| We perform a grid search over a range of six standard XGBoost hyper-parameters, listed below. We find the most optimal set by evaluating each configuration of XGBoost hyper-parameters with respect to the precision-recall area under the curve (AUC PR) over 10-fold cross-validation. The optimal configuration of hyper-parameters, according to this criteria, is shown in bold.

|

+-------------------------------+-------------------------+
| **XGBoost hyper-parameters**  |          entry          |
+-------------------------------+-------------------------+
|  learning_rate                |  **0.03**, 0.1          |
+-------------------------------+-------------------------+
|  max_depth                    |  **13**                 |
+-------------------------------+-------------------------+
|  min_child_weight             |  **10.0**               |
+-------------------------------+-------------------------+
|  colsample_bytree             |  **1.0**                |
+-------------------------------+-------------------------+
|  subsample                    |  0.4, **0.6**, 0.8      |
+-------------------------------+-------------------------+
|  gamma                        |  **2.0**, 5.0, 10.0     |
+-------------------------------+-------------------------+

|

| Here, the **learning_rate** parameter regulates how much each generated tree effects the final prediction. The **max_depth** parameter determines the deepness of each tree and thus effects the overall complexity of the model. Other parameters (**min_child_weight**, **colsample_bytree**, **subsample**, **gamma**) act as different forms of regularization to preserve the conservativeness of the algorithm and prevent overfitting. 

| We use around 20% of the training data for tuning.

**Sample Weight options**

- Apart from the XGBoost hyper-parameters, there are some user defined hyper-parameters that control the shape of a custom :math:`\eta_0` dependent sample weight which is applied on the background events to minimize the importance given by the algorithm to low SNR glitches. All the simulation events are assigned a sample weight of 1. In contrast, for the noise events, we first divide simulation events in the interval :math:`6.5 < \eta_0 < 20` into :math:`nbins` = 100 percentile bins such that the number of simulation events in each bin is the same. The lower threshold at :math:`\eta_0` = 6.5 gets rid of excess background with minimal or no loss of simulated events. The capping at :math:`\eta_0` = 20.0 prevents the algorithm from being affected by high SNR background events.

- Sample weight :math:`w`: For simulation events — :math:`w_{S} = 1`. For background events:

  .. math::

     w_{B}(i) = \frac{N_{S}(i)}{N_{B}(i)}\, e^{\ln(A)\left(1-\frac{i}{nbins}\right)^q}

  where :math:`i = 1, 2, \ldots, nbins` is a given bin, :math:`N_{S}(i)` and :math:`N_{B}(i)` are the number of simulation and background events in the :math:`i`-th bin. :math:`(q, A)` are weight options where :math:`A` is called the balance parameter — :math:`A = \frac{N_{S}(1)}{N_{B}(1)}` is the class balance (:math:`N_{S}/N_{B}`) for the first bin at :math:`\eta_0 \geq 6.5` — and :math:`q` is called the slope parameter that controls the rate of change of the weighted background distribution (defined in `Mishra et al. 2022 <https://journals.aps.org/prd/abstract/10.1103/PhysRevD.105.083018>`__). For all events with :math:`\eta_0 \geq 20`, the number of simulation events is re-sampled to match the number of background events to have a perfect class balance (:math:`N_{S}/N_{B} = 1`). For a given combination of :math:`(q, A)` values, we can achieve any monotonic distribution of our choice.

+-------------------------------+-------------------------+
| **XGBoost hyper-parameters**  |          entry          |
+-------------------------------+-------------------------+
|  weight *balance*             | :math:`A` = **40**      |
+-------------------------------+-------------------------+
|  weight *slope*               | :math:`q` = **5**       |
+-------------------------------+-------------------------+

.. toggle-header::
       :header: *Show/Hide distribution of background events and simulation events after application of sample weight*

       +-------------------------------------------------------------------------------------------------------------------+
       |  |image225|                                                                                                       |
       +-------------------------------------------------------------------------------------------------------------------+

       * :math:`\eta_0` **dependent sample weight** - Distribution of weighted background events and simulation events in the range :math:`6.5 < \eta_0 < 20` after the application of sample_weight with (:math:`q` = **5**, :math:`A` = **40**).

|

- The **user_pparameters.C** file is modified for tuning and the commands for tuning are as follows:

.. code-block:: bash

        cwb_condor xgb r1
        cwb_xgboost report '--type tuning --ulabel r1 --verbose true'

.. toggle-header::
       :header: *Show/Hide user_pparameters.C file*

       .. literalinclude:: logs/xgb_user_pparameters_tuning.txt
              :language: bash

|

Training
^^^^^^^^^^^^^

| Training is done in the **/out_path/XGB/model_dir_XGB** directory. Create the following config file **user_xgboost_config_v1.py** in the config folder in the XGB model directory:

.. code-block:: bash

        /out_path/XGB/model_dir_XGB/config/user_xgboost_config_v1.py

.. toggle-header::
       :header: *Show/Hide user_xgboost_config_v1.py file*

       .. literalinclude:: logs/xgb_user_xgboost_config_v1.txt
              :language: bash
 
|
| Once the config file is setup as per the requirements, the XGBoost model is trained by running the following command in the XGB model directory:

.. code-block:: bash

        cwb_xgboost training '--nfile nfile.py --sfile sfile.py  --model xgb/v1/xgb_model_name_v1.dat --nfrac 1.0 --sfrac 1.0 --search bbh --verbose true --dump true --config config/user_xgboost_config_v1.py'
        cwb_mkhtml xgb/v1/                     //(plots stored in public_html training report page in ldas)

| After training is completed, the trained model is stored as **/out_path/XGB/model_dir_XGB/xgb/v1/xgb_model_name_v1.dat** and the training information and plots can be found in the training report page.

**A few example plots stored in the training report page**

.. toggle-header::
       :header: *Show/Hide training report page plots*
 
       +-------------------------------------------------------------------------------------------------------------------+
       |  |image226|                                                                                                       |
       +-------------------------------------------------------------------------------------------------------------------+

       * **Last Tree Generated** - Illustration of the last tree generated in the ensemble before training procedure stops.
 
       +-------------------------------------------------------------------------------------------------------------------+
       |  |image227|                                                                                                       |
       +-------------------------------------------------------------------------------------------------------------------+

       * **Feature importances** - Importance of input features/summary statistics in training the XGBoost model.


       +-------------------------------------------------------------------------------------------------------------------+
       |  |image228|                                                                                                       |
       +-------------------------------------------------------------------------------------------------------------------+

       * :math:`Q_a` **vs** :math:`Q_p` **training plot** - 2d hist plot of :math:`Q_a Q_p` with scatterplot of high SNR background events.


       +-------------------------------------------------------------------------------------------------------------------+
       |  |image229|                                                                                                       |
       +-------------------------------------------------------------------------------------------------------------------+

       * **Chirp mass** :math:`\mathcal{M}` - Distribution of :math:`\mathcal{M}` or *mchirp* for training background and simulation events.

       +-------------------------------------------------------------------------------------------------------------------+
       |  |image230|                                                                                                       |
       +-------------------------------------------------------------------------------------------------------------------+

       * **Central frequency** :math:`f_0` - Distribution of central frequency :math:`f_0` for training background and simulation events.


       +-------------------------------------------------------------------------------------------------------------------+
       |  |image231|                                                                                                       |
       +-------------------------------------------------------------------------------------------------------------------+

       * :math:`\eta_0` - Distribution of :math:`\eta_0` or :math:`\rho_0` for training background and simulation events.

|

Testing
^^^^^^^^^^^

| The trained XGBoost model is now called to predict :math:`W_{XGB}` for a given cWB event and subsequently estimate the reduced detection statistic :math:`\eta_r`. The :math:`\eta_r` statistic is then used to assign significance to simulation events based on the background rate distribution, and conduct further analysis. The testing/prediction procedure is as follows:

**BACKGROUND**

- **/out_path/BKG/BKG_dir_xgb1** - For each chunk of background data, we go to the respective working directory and run the following commands:

**Standard Veto Method for comparison**

| Set the standard PP cuts/veto method for the testing wave root file and create a standard cWB report.

.. code-block:: bash

        cwb_setcuts M1.V_hvetoLH.C_xgb_cut.XK_Test_Y100_TXXXX '--tcuts bin1_cut --label bin1_cut'
        cwb_report M1.V_hvetoLH.C_xgb_cut.XK_Test_Y100_TXXXX.C_bin1_cut create

| Collect the background root files after standard PP cuts application in **nfile_bkg_std.py** file for testing with XGBoost.

**Prediction with XGBoost**

| Calling the stored XGBoost model for prediction and then creating standard cWB report page.

.. code-block:: bash

        cwb_xgboost M1.V_hvetoLH.C_xgb_cut.XK_Test_Y100_TXXXX '--model /out_path/XGB/model_dir_XGB/xgb/v1/xgb_model_name_v1.dat --ulabel v1 --rthr 0 --search bbh'
        cwb_report M1.V_hvetoLH.C_xgb_cut.XK_Test_Y100_TXXXX.XGB_rthr0_v1 create

| Once the prediction step is completed for all the chunks, we collect the far_rho.txt files for each chunk and store it in the **far_rho_v1.lst** file in a new folder /out_path/BKG/XGB/far_rho_v1.lst. Similarly, we collect the preddicted root files for each chunk in the **nfile_bkg_rhor_v1.py** file in the folder /out_path/BKG/XGB/nfile_bkg_rhor_v1.py.

**SIMULATION**

- **/out_path/SIM/SIM_dir_xgb1** - For each simulation directory, we run the following commands:

**Standard Veto Method for comparison**

| Set the standard PP cuts/veto method for the testing wave root file and set ifar.

.. code-block:: bash

        #set standard PP cuts
        cwb_setcuts M1.C_U.V_hvetoLH.C_xgb_cut.XK_Test_NXX_TXXX '--tcuts bin1_cut --label bin1_cut'

        #set ifar chunkwise, far_bin1_cut_file pointing to the standard far_rho.txt files for the respective BKG chunk
        cwb_setifar M1.C_U.V_hvetoLH.C_xgb_cut.XK_Test_NXX_TXXX.C_bin1_cut        '--tsel O2_K01_cut --label ifar --file far_bin1_cut_file[1]  --mode exclusive'   
        cwb_setifar M1.C_U.V_hvetoLH.C_xgb_cut.XK_Test_NXX_TXXX.C_bin1_cut.S_ifar '--tsel O2_K02_cut --label ifar --file far_bin1_cut_file[2]  --mode exclusive' 
        #... so on for all the chunks.  

**Prediction with XGBoost**

|  We first make the following addition to the **config/user_pparameters.C** file in the simulation directory:

.. code-block:: bash

        /out_path/SIM/SIM_dir_xgb1/config/user_pparameters.C

 
.. toggle-header::
       :header: *Show/Hide user_pparameters.C for setting IFAR*

       .. literalinclude:: logs/xgb_user_pparameters_testing.txt
              :language: bash

|
| After editing the user_pparameters.C file we proceed to the prediction for simulation events followed by setting ifar using the predicted bkg far.

.. code-block:: bash

        #prediction        
        cwb_xgboost M1.C_U.V_hvetoLH.C_xgb_cut.XK_Test_NXX_TXXX '--model /out_path/XGB/model_dir_XGB/xgb/v1/xgb_model_name_v1.dat --ulabel v1 --rthr 0 --search bbh'

        #set ifar
        cwb_setifar M1.C_U.V_hvetoLH.C_xgb_cut.XK_Test_NXX_TXXX.XGB_rthr0_v1        '--tsel O2_K01_cut --label ifar --file xgb_far_bin1_cut_file[1]  --mode exclusive'   
        cwb_setifar M1.C_U.V_hvetoLH.C_xgb_cut.XK_Test_NXX_TXXX.XGB_rthr0_v1.S_ifar '--tsel O2_K02_cut --label ifar --file xgb_far_bin1_cut_file[2]  --mode exclusive' 
        #... so on for all the chunks.  

| Once the above step is performed for all the simulations available for the specific search, we proceed to produce the final report pages with the result plots. Make a new folder **/out_path/SIM/SIM_dir_test** where we store the following files:

- **sfile_rhor_v1.py** - list of all the predicted root files using XGBoost for all simulation directories used (reduced detection statistic, rhor or :math:`\eta_r`).
- **sfile_rho1.py** - list of all the standard PP cuts root files for all simulation directories used (standard detection statistic, rho1 or :math:`\eta_1`).

Report
^^^^^^^^^^^

| In order to create the final report containing the comparison and result plots, we create a **report_config_v1.py** file:

.. code-block:: bash

        /out_path/SIM/SIM_dir_test/report_config_v1.py

.. toggle-header::
       :header: *Show/Hide report_config_v1.py for setting IFAR*

       .. literalinclude:: logs/xgb_report_config_v1.txt
              :language: bash

|

| XGBoost final report generation:

.. code-block:: bash

        cwb_xgboost report  '--type prediction --subtype roc/hrho/efreq/lrho/qaqp --config report_config_v1.py '
        cwb_mkhtml xgb/v1/       

| All the generated plots are stored in **/out_path/SIM/SIM_dir_test/xgb/v1/**
| **cwb_xgboost report** contains more options and they are described in details in the **cwb_xgboost** command page. 

**Plots in the final XGBoost report page**
 
+-------------------------------------------------------------------------------------------------------------------+
|  |image232|                                                                                                       |
+-------------------------------------------------------------------------------------------------------------------+

* **roc** - Detection efficiency vs FAR plot for the ML-enhanced cWB search compared with the standard cWB search.

.. toggle-header::
       :header: *Show/Hide other important plots from the testing report page*

       +-------------------------------------------------------------------------------------------------------------------+
       |  |image233|                                                                                                       |
       +-------------------------------------------------------------------------------------------------------------------+

       * **efreq** - Detection efficiency @IFAR 1 yr vs central frequency :math:`f_0` (or freq[0]) for the ML-enhanced cWB search compared with the standard cWB search.


       +-------------------------------------------------------------------------------------------------------------------+
       |  |image234|                                                                                                       |
       +-------------------------------------------------------------------------------------------------------------------+

       * **hrhor** - Distribution of :math:`\eta_r` or :math:`\rho_r` for testing background and simulation events.


       +-------------------------------------------------------------------------------------------------------------------+
       |  |image235|                                                                                                       |
       +-------------------------------------------------------------------------------------------------------------------+

       * **qaqp** - 2d hist plot of :math:`Q_a` vs :math:`Q_p` with scatterplot of high SNR background events for testing events.

       +-------------------------------------------------------------------------------------------------------------------+
       |  |image236|                                                                                                       |
       +-------------------------------------------------------------------------------------------------------------------+

       * **lrhor** - :math:`\log_{10}(\eta_r)` or :math:`\log_{10}(\rho_r)` vs :math:`0.5 \cdot \log_{10}(\text{likelihood})` for testing simulation events and scatterplot of high SNR testing background events.

|

cWB Summary Statistics
^^^^^^^^^^^^^^^^^^^^^^^^^

The following 14 cWB summary statistics were used as the input features for the XGBoost BBH search and IMBH search models:
    - :math:`\eta_0`: The effective correlated SNR and the main cWB detection statistic used for the generic GW search. As an input feature for XGBoost, the :math:`\eta_0` statistic is capped at 20 (any event with higher :math:`\eta_0` is assigned the value 20). The capping prevents the algorithm from being affected by high SNR background events, which fall steeply with the increase in :math:`\eta_0`.
    - :math:`c_c`: Coherent energy divided by the sum of coherent energy and null energy.
    - :math:`n_f`: Effective number of time-frequency resolutions used for event detection and waveform reconstruction.
    - :math:`S_i/L`: Ratio of the square of the reconstructed waveform's SNR for each detector (:math:`i`), to the network likelihood. :math:`S_0/L` is chosen for the two detector network (HL), and both :math:`S_0/L` and :math:`S_1/L` are considered for the three detector network (HLV). This allows us to increase the sensitivity of the ML algorithm to different detector networks.
    - :math:`\Delta T_s`: Energy weighted signal duration.
    - :math:`\Delta F_s`: Energy weighted signal bandwidth.
    - :math:`f_0`: Energy weighted signal central frequency.
    - :math:`\mathcal{M}`: Chirp mass parameter estimated in the time-frequency domain, defined in Ref. `Tiwari et al. 2015 <https://iopscience.iop.org/article/10.1088/0264-9381/33/1/01LT01>`__.
    - :math:`e_M`: Chirp mass goodness of the fit metric, presented in Ref. `Tiwari et al. 2015 <https://iopscience.iop.org/article/10.1088/0264-9381/33/1/01LT01>`__.
    - :math:`Q_a`: The waveform shape parameter **Qveto[0]** developed to identify a characteristic family of (blip) glitches present in the detectors.
    - :math:`Q_p`: An estimation of the effective number of cycles in a cWB event. This is computed by dividing the quality factor of the reconstructed waveform by an appropriate function of the coherent energy.
    - :math:`L_v`: for the loudest pixel, the ratio between the pixel energy and the total energy of an event **Lveto[2]**.
    - :math:`\chi^2`: quality of the event reconstruction (**penalty**), :math:`\chi^2 = E_n / N_{df}` where :math:`E_n` is the residual noise energy estimated and :math:`N_{df}` is the number of independent wavelet amplitudes describing an event.
    - :math:`C_n`: Data chunk number. LIGO-Virgo data is divided into time segments known as chunks, which typically contain a few days of strain data. Including the data chunk number allows the ML algorithm to respond to changes in detector sensitivity across separate observing runs and chunks.


| The entire **cwb_xgboost_config.py** is given below:


.. toggle-header::
       :header: *Show/Hide complete cwb_xgboost_config.py file*

       .. literalinclude:: logs/xgb_cwb_xgboost_config_py.txt
              :language: bash
  
|