Skip to content

Commit

Permalink
small bug fix
Browse files Browse the repository at this point in the history
  • Loading branch information
Tobias Hofmann committed May 11, 2018
1 parent 030cda2 commit e31aab9
Show file tree
Hide file tree
Showing 13 changed files with 51 additions and 628 deletions.
123 changes: 14 additions & 109 deletions docs/notebook/.ipynb_checkpoints/cleaning_trimming-checkpoint.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,7 @@
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"metadata": {},
"outputs": [
{
"name": "stdout",
Expand Down Expand Up @@ -59,96 +57,15 @@
"metadata": {},
"source": [
"### 2. Quality-check your raw (and dirty) reads\n",
"To convince yourself that the raw reads are not fit for further processing, it is a good idea to first run some quality tests on the raw fastq files. I'll show you how:\n",
"\n",
"#### a) Prepare text-file with file-paths\n",
"Prepare a file that contains the file paths (absolute paths or relative to your work_dir) of all cleaned fastq-files of interest (you can do it manually or use the following commands, after inserting the correct path to your output folder in the bash for-loop)\n",
"\n",
"<div class=\"alert alert-block alert-info\">**Adjust path:** Replace `../../data/raw/fastq/*/` with the path to your folder containing the raw reads</div>\n",
"\n",
" for dir in ../../data/raw/fastq/*R1.fastq; do echo $dir; done > ../../data/processed/raw_fastq_file_list.txt\n",
" \n",
" for dir in ../../data/raw/fastq/*R2.fastq; do echo $dir; done >> ../../data/processed/raw_fastq_file_list.txt\n",
" \n",
"After successfully running these for-loops, the resulting file `raw_fastq_file_list.txt` should contain the path to all samples that you want to quality check:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"../../data/raw/fastq/1061_R1.fastq\n",
"../../data/raw/fastq/1063_R1.fastq\n",
"../../data/raw/fastq/1064_R1.fastq\n",
"../../data/raw/fastq/1065_R1.fastq\n",
"../../data/raw/fastq/1068_R1.fastq\n",
" ... \n",
"../../data/raw/fastq/1140_R2.fastq\n",
"../../data/raw/fastq/1164_R2.fastq\n",
"../../data/raw/fastq/1165_R2.fastq\n",
"../../data/raw/fastq/1166_R2.fastq\n",
"../../data/raw/fastq/1167_R2.fastq\n"
]
}
],
"source": [
"%%bash\n",
"head -n 5 ../../data/processed/raw_fastq_file_list.txt\n",
"echo ' ... '\n",
"tail -n 5 ../../data/processed/raw_fastq_file_list.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### b) Run `fastqc` for quality check"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now run `fastqc` for all fastq files in order to produce a quality check for all these samples. Make sure to create the output directory manually before running the command (otherwise `fastqc` will return an error).\n",
"\n",
" mkdir ../../data/processed/fastqc_results/raw\n",
" fastqc -o ../../data/processed/fastqc_results/raw -f fastq $(cat ../../data/processed/raw_fastq_file_list.txt) \n",
"To convince yourself that the raw reads are not fit for further processing, it is a good idea to first run some quality tests on the raw fastq files. This is easy and straightforward with the `secapr quality_check` function:\n",
"\n",
"\n",
"`fastqc` produces two output files per sample: one zip-archive and one .html file. The easiest way to look at the test results of a specific file is to open the html file in your favorite html-reader (e.g. Firefox, Safari, etc.)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 3. Visualize results"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Since it is somewhat cumbersome to look at all test results for all samples by manually checking all html files, we provide an R-script (in the `src/` folder of the `secapr` GitHub project) which produces a graphical overview of the test results of all samples. This makes it easier to see which samples passed which tests (rather than having to go open each individual report).\n",
"\n",
" Rscript ../../src/fastqc_visualization.r -i ../../data/processed/fastqc_results/raw -o ../../data/processed/fastqc_results/raw/summary_all_samples_raw.pdf\n",
"\n",
"This is what the quality check results look like for the uncleaned reads: "
"`secapr quality_check --input ../../data/raw/fastq/ --output ../../data/processed/fastqc_results/raw`"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
Expand Down Expand Up @@ -211,7 +128,6 @@
"cell_type": "code",
"execution_count": 21,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
Expand Down Expand Up @@ -323,9 +239,7 @@
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": false
},
"metadata": {},
"outputs": [
{
"name": "stdout",
Expand Down Expand Up @@ -395,7 +309,7 @@
"Let's run the script as in this example command:\n",
"<div class=\"alert alert-block alert-warning\">**Please check:** Is `secapr_env` activated? You can test with `conda info --envs`. Activate the correct environment with `source activate secapr_env`</div>\n",
"\n",
" secapr clean_reads --input data/raw/fastq/ --config data/raw/adapter_info.txt --output data/processed/cleaned_trimmed_reads_default --index single\n",
" secapr clean_reads --input ../../data/raw/fastq/ --config ../../data/raw/adapter_info.txt --output ../../data/processed/cleaned_trimmed_reads_default --index single\n",
" \n",
"`secapr clean_reads` produces a subfolder for each sample in the output directory, containing the cleaned reads for the respective sample."
]
Expand All @@ -405,24 +319,15 @@
"metadata": {},
"source": [
"#### c) Check quality of the results\n",
"After cleaning the reads with `secapr clean_reads` with default settings we again perform the quality tests on all cleaned files, just as we did above for the raw reads. Therefore we create a text file with the file paths to all cleaned fastq files and run `fastqc` by providing this list of files. Note that the `secapr clean_reads` function named all forward-read files with the tag '_READ1_' and all backward-read files with '_READ2_'.\n",
"\n",
" for dir in ../../data/processed/cleaned_trimmed_reads_default/*/*READ1.fastq; do echo $dir; done > ../../data/processed/fastq_file_list.txt\n",
" \n",
" for dir in ../../data/processed/cleaned_trimmed_reads_default/*/*READ2.fastq; do echo $dir; done >> ../../data/processed/fastq_file_list.txt\n",
"After cleaning the reads with `secapr clean_reads` with default settings we again perform the quality tests on all cleaned files, just as we did above for the raw reads. \n",
"\n",
" mkdir ../../data/processed/fastqc_results/cleaned_default_settings\n",
" fastqc -o ../../data/processed/fastqc_results/cleaned_default_settings -f fastq $(cat ../../data/processed/fastq_file_list.txt)\n",
" \n",
"Finally we run our R-script which plots the results:"
"`secapr quality_check --input ../../data/processed/cleaned_trimmed_reads_default --output ../../data/processed/fastqc_results/cleaned_default_settings`"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"collapsed": false
},
"metadata": {},
"outputs": [
{
"name": "stdout",
Expand Down Expand Up @@ -480,17 +385,17 @@
"source": [
"As we see above, running the script with default settings improved the file quality but there is a lot of room for improvement. After reviewing the intial quality reports and after trying a bunch of different flags and values, I ended up with this command for the example data. See the script documentation for more information about the different flags (`secapr clean_reads -h`).\n",
"\n",
" secapr clean_reads --input data/raw/fastq/ --config data/raw/adapter_info.txt --output data/processed/cleaned_trimmed_reads --index single --simpleClipThreshold 5 --palindromeClipThreshold 20 --seedMismatches 5 --headCrop 10\n",
" secapr clean_reads --input ../../data/raw/fastq/ --config ../../data/raw/adapter_info.txt --output ../../data/processed/cleaned_trimmed_reads --index single --simpleClipThreshold 5 --palindromeClipThreshold 20 --seedMismatches 5 --headCrop 10\n",
"\n",
"For producing the plots you repeat the commands from above (3. Clean reads with `secapr` (default settings))."
"Let's check the final quality of the data:\n",
"\n",
" secapr quality_check --input ../../data/processed/cleaned_trimmed_reads --output ../../data/processed/fastqc_results/custom_default_settings"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"metadata": {},
"outputs": [
{
"name": "stdout",
Expand Down Expand Up @@ -559,7 +464,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.0"
"version": "3.6.4"
}
},
"nbformat": 4,
Expand Down
123 changes: 14 additions & 109 deletions docs/notebook/cleaning_trimming.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,7 @@
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"metadata": {},
"outputs": [
{
"name": "stdout",
Expand Down Expand Up @@ -59,96 +57,15 @@
"metadata": {},
"source": [
"### 2. Quality-check your raw (and dirty) reads\n",
"To convince yourself that the raw reads are not fit for further processing, it is a good idea to first run some quality tests on the raw fastq files. I'll show you how:\n",
"\n",
"#### a) Prepare text-file with file-paths\n",
"Prepare a file that contains the file paths (absolute paths or relative to your work_dir) of all cleaned fastq-files of interest (you can do it manually or use the following commands, after inserting the correct path to your output folder in the bash for-loop)\n",
"\n",
"<div class=\"alert alert-block alert-info\">**Adjust path:** Replace `../../data/raw/fastq/*/` with the path to your folder containing the raw reads</div>\n",
"\n",
" for dir in ../../data/raw/fastq/*R1.fastq; do echo $dir; done > ../../data/processed/raw_fastq_file_list.txt\n",
" \n",
" for dir in ../../data/raw/fastq/*R2.fastq; do echo $dir; done >> ../../data/processed/raw_fastq_file_list.txt\n",
" \n",
"After successfully running these for-loops, the resulting file `raw_fastq_file_list.txt` should contain the path to all samples that you want to quality check:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"../../data/raw/fastq/1061_R1.fastq\n",
"../../data/raw/fastq/1063_R1.fastq\n",
"../../data/raw/fastq/1064_R1.fastq\n",
"../../data/raw/fastq/1065_R1.fastq\n",
"../../data/raw/fastq/1068_R1.fastq\n",
" ... \n",
"../../data/raw/fastq/1140_R2.fastq\n",
"../../data/raw/fastq/1164_R2.fastq\n",
"../../data/raw/fastq/1165_R2.fastq\n",
"../../data/raw/fastq/1166_R2.fastq\n",
"../../data/raw/fastq/1167_R2.fastq\n"
]
}
],
"source": [
"%%bash\n",
"head -n 5 ../../data/processed/raw_fastq_file_list.txt\n",
"echo ' ... '\n",
"tail -n 5 ../../data/processed/raw_fastq_file_list.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### b) Run `fastqc` for quality check"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now run `fastqc` for all fastq files in order to produce a quality check for all these samples. Make sure to create the output directory manually before running the command (otherwise `fastqc` will return an error).\n",
"\n",
" mkdir ../../data/processed/fastqc_results/raw\n",
" fastqc -o ../../data/processed/fastqc_results/raw -f fastq $(cat ../../data/processed/raw_fastq_file_list.txt) \n",
"To convince yourself that the raw reads are not fit for further processing, it is a good idea to first run some quality tests on the raw fastq files. This is easy and straightforward with the `secapr quality_check` function:\n",
"\n",
"\n",
"`fastqc` produces two output files per sample: one zip-archive and one .html file. The easiest way to look at the test results of a specific file is to open the html file in your favorite html-reader (e.g. Firefox, Safari, etc.)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 3. Visualize results"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Since it is somewhat cumbersome to look at all test results for all samples by manually checking all html files, we provide an R-script (in the `src/` folder of the `secapr` GitHub project) which produces a graphical overview of the test results of all samples. This makes it easier to see which samples passed which tests (rather than having to go open each individual report).\n",
"\n",
" Rscript ../../src/fastqc_visualization.r -i ../../data/processed/fastqc_results/raw -o ../../data/processed/fastqc_results/raw/summary_all_samples_raw.pdf\n",
"\n",
"This is what the quality check results look like for the uncleaned reads: "
"`secapr quality_check --input ../../data/raw/fastq/ --output ../../data/processed/fastqc_results/raw`"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
Expand Down Expand Up @@ -211,7 +128,6 @@
"cell_type": "code",
"execution_count": 21,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
Expand Down Expand Up @@ -323,9 +239,7 @@
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": false
},
"metadata": {},
"outputs": [
{
"name": "stdout",
Expand Down Expand Up @@ -395,7 +309,7 @@
"Let's run the script as in this example command:\n",
"<div class=\"alert alert-block alert-warning\">**Please check:** Is `secapr_env` activated? You can test with `conda info --envs`. Activate the correct environment with `source activate secapr_env`</div>\n",
"\n",
" secapr clean_reads --input data/raw/fastq/ --config data/raw/adapter_info.txt --output data/processed/cleaned_trimmed_reads_default --index single\n",
" secapr clean_reads --input ../../data/raw/fastq/ --config ../../data/raw/adapter_info.txt --output ../../data/processed/cleaned_trimmed_reads_default --index single\n",
" \n",
"`secapr clean_reads` produces a subfolder for each sample in the output directory, containing the cleaned reads for the respective sample."
]
Expand All @@ -405,24 +319,15 @@
"metadata": {},
"source": [
"#### c) Check quality of the results\n",
"After cleaning the reads with `secapr clean_reads` with default settings we again perform the quality tests on all cleaned files, just as we did above for the raw reads. Therefore we create a text file with the file paths to all cleaned fastq files and run `fastqc` by providing this list of files. Note that the `secapr clean_reads` function named all forward-read files with the tag '_READ1_' and all backward-read files with '_READ2_'.\n",
"\n",
" for dir in ../../data/processed/cleaned_trimmed_reads_default/*/*READ1.fastq; do echo $dir; done > ../../data/processed/fastq_file_list.txt\n",
" \n",
" for dir in ../../data/processed/cleaned_trimmed_reads_default/*/*READ2.fastq; do echo $dir; done >> ../../data/processed/fastq_file_list.txt\n",
"After cleaning the reads with `secapr clean_reads` with default settings we again perform the quality tests on all cleaned files, just as we did above for the raw reads. \n",
"\n",
" mkdir ../../data/processed/fastqc_results/cleaned_default_settings\n",
" fastqc -o ../../data/processed/fastqc_results/cleaned_default_settings -f fastq $(cat ../../data/processed/fastq_file_list.txt)\n",
" \n",
"Finally we run our R-script which plots the results:"
"`secapr quality_check --input ../../data/processed/cleaned_trimmed_reads_default --output ../../data/processed/fastqc_results/cleaned_default_settings`"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"collapsed": false
},
"metadata": {},
"outputs": [
{
"name": "stdout",
Expand Down Expand Up @@ -480,17 +385,17 @@
"source": [
"As we see above, running the script with default settings improved the file quality but there is a lot of room for improvement. After reviewing the intial quality reports and after trying a bunch of different flags and values, I ended up with this command for the example data. See the script documentation for more information about the different flags (`secapr clean_reads -h`).\n",
"\n",
" secapr clean_reads --input data/raw/fastq/ --config data/raw/adapter_info.txt --output data/processed/cleaned_trimmed_reads --index single --simpleClipThreshold 5 --palindromeClipThreshold 20 --seedMismatches 5 --headCrop 10\n",
" secapr clean_reads --input ../../data/raw/fastq/ --config ../../data/raw/adapter_info.txt --output ../../data/processed/cleaned_trimmed_reads --index single --simpleClipThreshold 5 --palindromeClipThreshold 20 --seedMismatches 5 --headCrop 10\n",
"\n",
"For producing the plots you repeat the commands from above (3. Clean reads with `secapr` (default settings))."
"Let's check the final quality of the data:\n",
"\n",
" secapr quality_check --input ../../data/processed/cleaned_trimmed_reads --output ../../data/processed/fastqc_results/custom_default_settings"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"metadata": {},
"outputs": [
{
"name": "stdout",
Expand Down Expand Up @@ -559,7 +464,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.0"
"version": "3.6.4"
}
},
"nbformat": 4,
Expand Down
Loading

0 comments on commit e31aab9

Please sign in to comment.