Metadata-Version: 2.4
Name: CleaveRNA
Version: 1.0.0
Summary: Advanced machine learning-based computational tool for scoring candidate DNAzyme cleavage sites
Home-page: https://github.com/reyhaneh-tavakoli/CleaveRNA
Author: reyhaneh tavakoli and contributors
Author-email: rey.ta.kop.biochem@gmail.com
License: MIT License with Attribution Requirements
Project-URL: Bug Reports, https://github.com/reyhaneh-tavakoli/CleaveRNA/issues
Project-URL: Source, https://github.com/reyhaneh-tavakoli/CleaveRNA
Project-URL: Documentation, https://github.com/reyhaneh-tavakoli/CleaveRNA/wiki
Keywords: DNAzyme,cleavage sites,prediction,machine learning,bioinformatics,RNA
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Operating System :: OS Independent
Requires-Python: >=3.7
Description-Content-Type: text/markdown
License-File: LICENSE.txt
Requires-Dist: pandas>=1.3.0
Requires-Dist: numpy>=1.20.0
Requires-Dist: scikit-learn>=1.0.0
Requires-Dist: matplotlib>=3.3.0
Requires-Dist: tqdm>=4.60.0
Requires-Dist: argparse
Provides-Extra: dev
Requires-Dist: pytest>=6.0; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Requires-Dist: black; extra == "dev"
Requires-Dist: flake8; extra == "dev"
Provides-Extra: conda
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: keywords
Dynamic: license
Dynamic: license-file
Dynamic: project-url
Dynamic: provides-extra
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# CleaveRNA

CleaveRNA is a machine learning based computational tool for scoring candidate DNAzyme cleavage sites in substrate sequences.
 
# Documentation

## Overview

## Dependencies

The following tools need to be present in the environment, e.g. via a respective coenda setup

- Python v3 (python)
- IntaRNA (intarna)
- RNAplfold (viennarna)

## Usage and Parameters

The **CleaveRNA** algorithm provides two different modes: **training** and **prediction**.

- **Training Mode**: Allows you to create a training file using your own experimental data. If you have experimental data on the fraction of DNAzyme cleavage at different target sites, you can use this mode to generate a custom training file for prediction.  
- **Prediction Mode**: Using either your own training file or the provided default ones, you can score cleavage sites on target sequences and select the most suitable DNAzyme based on your specific needs.

If you don’t have experimental data, you can use the **default training files** we provide. These were generated from experimental data published prior to the development of this algorithm.

---

### Training Mode

This section explains how the default training files were generated and how you can create your own training set for use in prediction mode.

If you have your own dataset (see details in the **`data_preparation`** folder), you must first run this mode to generate the **pre_train** file.

#### Steps

1. **Prepare the target sequence files in FASTA format**
     
- Example test files: [`BCL-1.fasta`, `BCL-2.fasta`, `BCL-3.fasta`, `BCL-4.fasta`, `BCL-5.fasta`, `HPV.fasta`](https://github.com/reytakop/CleaveRNA/tree/main/CleaveRNA/Train_mode/HPBC)
- **Notes:**  
  - The minimum sequence length must be **150 nt**.  
  - The sequence name must match the FASTA file name.  
 - **Example:** If the target file is [`BCL-1.fasta`](https://github.com/reytakop/CleaveRNA/blob/main/CleaveRNA/Train_mode/HPBC/BCL-1.fasta), the header must start with:  

```fasta
>BCL-1
GTTGGCCCCCGTTACTTTTCCTCTGGGAAATATGGCGCACGCTGGGAGAACAGGGTACGATAACCGGGAGATAGTGATGAAGTACATCCATTATAAGCTGTCGCAGAGGGGCTACGAGTGGGATGCGGGAGATGTGGGCGCCGCGCCCCCGGGGGCCGCCCCCGCGCCGGGCATCTTCTCCTCGCAGCCCGGGCACACGCCCCATACAGC...
 ```
- The target files must be provided with the `--targets` flag:  
```bash
--targets HPV.fasta BCL-1.fasta BCL-2.fasta BCL-3.fasta BCL-4.fasta BCL-5.fasta
```
  
2. **Prepare the parameter file (default mode)**  
- Example: [`test_default.csv`](https://github.com/reytakop/CleaveRNA/blob/main/CleaveRNA/Train_mode/HPBC/test_default.csv)  
- This file contains **five columns**, described below:  
  - **LA**: Left binding arm length of the DNAzyme  
  - **RA**: Right binding arm length of the DNAzyme  
  - **CS**: Cleavage site dinucleotide of the DNAzyme. In this example, the catalytic core of the **10-23  DNAzyme** is used ([Reference: Nat. Chem. 2021](https://doi.org/10.4103/1673-5374.335157))  
  - **Tem**: Reaction temperature of the DNAzyme  
  - **CA**: Catalytic core sequence of the DNAzyme
- Provide the default mode and parameter file with:
   
```bash
--feature_mode default --params test_default.csv
```
3. **Define the output directory**
    
```bash
--output_dir
```
4. **Specify the model name**  
- Provide the model name using the `--model_name` flag:
     
```bash
--model_name "HPBC"
```
5. **Run the shell script**
- Update the [`run.sh`](https://github.com/reytakop/CleaveRNA/blob/main/CleaveRNA/Train_mode/HPBC/run.sh)  **lines 3–12** to match your conda environment.  
- In the input files directory, run the tool with:
     
```bash
bash run.sh
```   
### The Output file

- The tool will generate the **pre_train file**: [HPBC_user_merged_num.csv](https://github.com/reytakop/CleaveRNA/blob/main/CleaveRNA/Train_mode/HPBC/HPBC_user_merged_num.csv)  
- In this file, you can find the generated DNAzymes (`seq_2` column) based on your defined parameters.  
- All dinucleotide cleavage sites (`id2` column) are included, along with the generated feature sets for each cleavage site.
  
- **Note:**  
  - Two different **pre_train** files are provided, generated from the largest fraction cleavage dataset      published prior to the development of this tool.

- If you have your own dataset (or a newly published one), please:  
  - Create a new folder and name it according to your `model_name`.  
  - Prepare all the required input files as described above.  
  - Update the `run.sh` script and run it. 

---

### Prediction Mode (Default)

In this mode, you can use the generated **pre_train file** to score the cleavage sites on your target files based on machine learning predictions. First, the DNAzyme sequences are designed according to the given parameters, and then all cleavage site positions are classified and scored based on the AI predictions.

The required input files are:  
- **Target sequence FASTA files**  
  - These are the sequences you want to consider as targets for DNAzyme.  
  - Example test files: [`BCL-1.fasta`, `BCL-2.fasta`, `BCL-3.fasta`, `BCL-4.fasta`, `BCL-5.fasta`, `HPV.fasta`](https://github.com/reytakop/CleaveRNA/tree/main/CleaveRNA/Prediction_mode/default/HPBC)  
  - **Notes**:  
    - The minimum sequence length must be **150 nt**.  
    - The sequence name must match the FASTA file name. For example, if the target file is [`BCL-1.fasta`](https://github.com/reytakop/CleaveRNA/blob/main/CleaveRNA/Prediction_mode/default/HPBC/BCL-1.fasta), the header must start with:  

```bash
>BCL-1
GTTGGCCCCCGTTACTTTTCCTCTGGGAAATATGGCGCACGCTGGGAGAACAGGGTACGATAACCGGGAGATAGTGATGAAGTACATCCATTATAAGCTGTCGCAGAGGGGCTACGAGTGGGATGCGGGAGATGTGGGCGCCGCGCCCCCGGGGGCCGCCCCCGCGCCGGGCATCTTCTCCTCGCAGCCCGGGCACACGCCCCATACAGC...
```

- **Parameter file**
  
   In this example, the default parameter mode is used.  
   - Example: [`test_default.csv`](https://github.com/reytakop/CleaveRNA/blob/main/CleaveRNA/Train_mode/HPBC/test_default.csv)  
   - This file contains **five columns** (defined previously) and must be uploaded with the following command:
     
```bash
--feature_mode default --params test_default.csv
```

- **Pre_train file**
   
   - Example: [`HPBC_user_merged_num.csv`](https://github.com/reytakop/CleaveRNA/blob/main/CleaveRNA/Prediction_mode/default/HPBC/HPBC_user_merged_num.csv)  
   - **Notes**:  
     - The pre_train file is either generated during `train_mode` or you can use the default provided file.  
     - Upload this file using:
       
```bash
--prediction_mode HPBC_user_merged_num.csv
```

- **Model name**
  
  Select the model name, which will be used as the prefix for all generated output files.  
  - Specify it with:  

```bash
--model_name HPBC
```

- **Classification score file**
  
   - Example: [`HPBC_target.csv`](https://github.com/reytakop/CleaveRNA/blob/main/CleaveRNA/Prediction_mode/default/HPBC/HPBC_target.csv)  
     - This file contains **two columns**: **id2**, which represents the cleavage site index, and **Y**, which indicates the classification score of that position based on experimental data.  
     - **Notes**: This file must be prepared by the user or you can use the corresponding file from the default traini. If using your own dataset, the fraction cleavage of each site can be converted to binary classification as described in the **`data_preparation`** folder.
         
  - Upload this file with:  

```bash
--ML_target HPBC_target.csv
```

- **Define the output directory**  

```bash
--output_dir
```
      
- **Run the shell script**

  - Update **lines 3–14** [`run.sh`](https://github.com/reytakop/CleaveRNA/blob/main/CleaveRNA/Prediction_mode/default/HPBC/run.sh)  to match  your conda environment.  
  - From the input files directory, run the tool using:  

```bash
bash run.sh
```

### Output

The tool will generate the following output files:

1. **Model performance metrics**  
   - Example: [`HPBC_ML_metrics.csv`](https://github.com/reytakop/CleaveRNA/blob/main/CleaveRNA/Prediction_mode/default/HPBC/HPBC_ML_metrics.csv) 
   - Contains all machine learning scores related to the prediction.  

2. **Prediction file**  
   - Example: [`HPBC_CleaveRNA_output.csv`](https://github.com/reytakop/CleaveRNA/blob/main/CleaveRNA/Prediction_mode/default/HPBC/HPBC_CleaveRNA_output.csv)  
   - This file reports candidate cleavage sites scored by their accessibility for DNAzyme cleavage reactions.
     
   **Columns included:**  
     - **CS_Index** → Nucleotide index of the cleavage site on the target sequence  
     - **Dz_Seq** → DNAzyme sequence designed for each cleavage site  
     - **CS_Target_File** → Target file name associated with each cleavage site  
     - **Classification_score** → Binary classification of the cleavage sites based on ML prediction  
     - **Prediction_score** → Score reflecting the accuracy of prediction at each position  
     - **Decision_score** → Model decision score.
     
 ---

 ### Prediction Mode (Target_check)

In this mode, prediction and scoring of candidate cleavage sites are performed only for a specific region of the target sequence.  
- The parameter file in **target_check** mode contains an additional column called **Start_End_Index**, which defines the desired region within the target sequence.  
- Example file: [test_target_check.csv](https://github.com/reytakop/CleaveRNA/blob/main/CleaveRNA/Prediction_mode/target_check/test/test_target_check.csv)  
- To run the prediction in **target_check** mode, update the [run.sh](https://github.com/reytakop/CleaveRNA/blob/main/CleaveRNA/Prediction_mode/target_check/test/run.sh) script (lines 3–14) and use the provided [input files](https://github.com/reytakop/CleaveRNA/tree/main/CleaveRNA/Prediction_mode/target_check/test).  

---

 ### Prediction Mode (Target_screen)

In this mode, the prediction is performed only for the cleavage sites whose indices are provided by the user.  
- The input files in this mode are the same as in the default mode, except for the **parameter file**.  
You can see an example of this file here: [`test_target_screen.csv`](https://github.com/reytakop/CleaveRNA/blob/main/CleaveRNA/Prediction_mode/target_screen/test/test_target_screen.csv)  

- The parameter file in this mode contains one additional column as described below:
  - **CS_index**: This column contains the name of the target file and the index of the cleavage site.

- You can run this mode with all the required input files in the [`target_screen`](https://github.com/reytakop/CleaveRNA/tree/main/CleaveRNA/Prediction_mode/target_screen/test) directory.  

  - **Note**: Please update lines 3–14 in the [`run.sh`](https://github.com/reytakop/CleaveRNA/blob/main/CleaveRNA/Prediction_mode/target_screen/test/run.sh) file according to your conda environment.  

---
 
### Prediction Mode (Specific_query)

-If you have the DNAzyme sequence and you want to predict cleavage efficiency in the target site, you can run the CleaveRNA with this mode.  
- You must to provide the parameter file with these columns:
   - **LA_seq**: The sequence of the left binding arm of the DNAzyme  
   - **RA_seq**: The sequence of the right binding arm of the DNAzyme
   - **CS**: Cleavage site dinucleotide of the enzyme.
   - **CS_Index_query**: The index of dinucleotide cleavage site on the target file.
   - **Tem**: Reaction temperature for DNAzyme activity.  
   - **CA**: Catalytic core sequence of the DNAzyme.
 - for the example file please check the [test_specific_query.csv](https://github.com/reytakop/CleaveRNA/blob/main/CleaveRNA/Prediction_mode/specific_query/test/test_specific_query.csv)
 - You can run this node with all the requeried files in [specific_query folder](https://github.com/reytakop/CleaveRNA/tree/main/CleaveRNA/Prediction_mode/specific_query/test)
     


---
  
