This Python script converts protein sequences into numerical descriptors based on amino acid properties. It supports two descriptor types:
- VHSE (Vector of Hydrophobic, Steric, and Electronic properties): Based on the paper: https://doi.org/10.1002/bip.20296
- Z-scales: Based on the paper: https://pubs.acs.org/doi/full/10.1021/jm9700575
- Python 3.x
- pandas
- scikit-learn
To install the required packages:
pip install pandas scikit-learndescriptor_converter.py: Main Python scriptexample.csv: Input file containing protein sequences (one sequence per line, no header)data_vhse.csv: Output file with calculated descriptors
- Prepare your input file (example.csv) with one protein sequence per line
- Run the script:
python descriptor_converter.py- The script will generate
data_vhse.csvcontaining the numerical descriptors
You can modify the descriptor type by changing the parameter in the script:
# For VHSE descriptors (8 values per amino acid)
data_descriptors = get_descriptors(data_list, descriptor='vhse')
# For Z-scales descriptors (5 values per amino acid)
data_descriptors = get_descriptors(data_list, descriptor='zscales')assign_descriptor(sequence, descriptor): Converts a single protein sequence to descriptorsget_descriptors(data_list, descriptor): Processes a list of sequencesnorm_features(data): Normalizes the descriptors using MinMaxScaler
The output file contains numerical descriptors where:
- Each row represents one protein sequence
- Each column represents a descriptor value
- For VHSE: 8 values per amino acid
- For Z-scales: 5 values per amino acid
Input protein sequence: "ACDE"
- With VHSE, this would generate 4×8 = 32 descriptor values
- With Z-scales, this would generate 4×5 = 20 descriptor values
If using this tool for research, please cite the relevant descriptor papers:
- VHSE: Mei H, Liao ZH, Zhou Y, Li SZ. A new set of amino acid descriptors and its application in peptide QSARs. Biopolymers. 2005.
- Z-scales: Sandberg M, Eriksson L, Jonsson J, Sjöström M, Wold S. New chemical descriptors relevant for the design of biologically active peptides. A multivariate characterization of 87 amino acids. Journal of Medicinal Chemistry. 1998.