-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathreadme.txt
118 lines (71 loc) · 2.88 KB
/
readme.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
***************************************************
****************** Word2vec.ipynb *****************
***************************************************
1- How to generate w2vec embeddings?
Open w2vec.ipynb
- Run all the cells with relevant functions to create w2vec model i.e., (getWord_model)
- Run the helper functions to convert DNA sequences to KMERS (DNAToWOrd)
- Split the two seqs in different lists (getDNASplit)
- (GetAvgFeatureVecs) to get the mean of the embeddings
- Some function just help you to write to appropriate npy and csv files
Cell 6: (the cell where these functions are used)
- Use appropriate "KMER SIZE" (here it is 6)
kmer = 6
- Declare the path for train or test data
- seqfile = 'cross_test_data.fa'
- read the data
- DNAseq = pd.read_csv(seqfile,sep = "\t",error_bad_lines=False)
- Define the number of positive samples in the file
- # pos_number = 2096 # train pos samples
- pos_number = 1658 # cross test
- split pair of DNA sequences
- words1,words2 = getDNA_split(DNAseq,kmer)
- get the w2vec model, where Unfile represents the kmers present in the sequences
- word_model = getWord_model(kmer,fea_num,min_fea,model,Unfile)
Cell 7:
- after training or loading the model
- Avg out the features as discussed
- dataDataVecs = getAvgFeatureVecs(words1,words2,word_model,fea_num)
- convert embeddings to csv
- A csv file contains label and the relevant feature vector of the seqs
fea_svm = '%d_test_vecs.svm'%(kmer)
fea_csv = '%d_test.csv'%(kmer)
npyTosvm(fea_npy, fea_svm,pos_number)
SVMtoCSV(fea_svm, fea_csv)
***************************************************
****************** CNN_model.ipynb ****************
***************************************************
Cell 1:
- Load all libraries
cell 2:
- Load the default evaluation measures
- METRICS = [
keras.metrics.TruePositives(name='tp'),
keras.metrics.FalsePositives(name='fp'),
keras.metrics.TrueNegatives(name='tn'),
keras.metrics.FalseNegatives(name='fn'),
keras.metrics.BinaryAccuracy(name='accuracy'),
keras.metrics.Precision(name='precision'),
keras.metrics.Recall(name='recall'),
keras.metrics.AUC(name='auc'),
]
- Function to create the deep learning model (dnn_model())
- Inputs
- Conv1(inputs)
- cc1 = concatenate(inputs, Conv1)
- Conv2(Conv1)
- Conv3(cc1)
- cc2 = concatenate(inputs, Conv3)
- dropout on cc1, cc2
- Dense1(cc1)
- Dense2(cc2)
- cc3 = concatenate(Dense1, Dense2)
- Dense3(Conv2)
- cc4 = concatenate(cc3, Dense3)
- Dense4, sigmoid => final prediction
(ADAM, BCE)
Cell 3:
- 5-Fold Validation
Cell 4:
- Independent testing
(NOTE PLEASE DECLARE THE POS SAMPLES BY CROSS CHECKING WITH THE RELEVANT FILES PRESENT IN THE SUBSEQUENT DIRECTORIES)