Skip to content

Commit

Permalink
Merge pull request #142 from tobyhodges/heading-levels
Browse files Browse the repository at this point in the history
standardise heading levels
  • Loading branch information
hoytpr authored Apr 11, 2022
2 parents 2b86f1a + 2f7e833 commit f9007c0
Show file tree
Hide file tree
Showing 3 changed files with 65 additions and 65 deletions.
18 changes: 9 additions & 9 deletions _episodes/01-tidiness.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,17 +14,17 @@ keypoints:
- "Tabular data needs to be structured to be able to work with it effectively."
---

# Introduction
## Introduction

When we think about the data for a sequencing project, we often start by thinking about the sequencing data that we get back from the sequencing center, but just as important, if not more so, is the data you've generated about the sequences before it ever goes to the sequencing center. This is the data about the data, often called the metadata. Without the information about what you sequenced, the sequence data itself is useless.
When we think about the data for a sequencing project, we often start by thinking about the sequencing data that we get back from the sequencing center, but just as important, if not more so, is the data you've generated about the sequences before it ever goes to the sequencing center. This is the data about the data, often called the metadata. Without the information about what you sequenced, the sequence data itself is useless.

> ## Discussion
> With the person next to you, discuss:
>
> What kinds of data and information have you generated before you sent your DNA/RNA off for sequencing?
>
> > ## Solution
> > Types of files and information you have generated:
> > Types of files and information you have generated:
> > - Spreadsheet or tabular data with the data from your experiment and whatever you were measuring for your study.
> > - Lab notebook notes about how you conducted those experiments.
> > - Spreadsheet or tabular data about the samples you sent off for sequencing. Sequencing centers often have a particular format they need with the name of the sample, DNA concentration and other information.
Expand Down Expand Up @@ -70,7 +70,7 @@ consistent and can be used across the field.
>
{: .callout}

### Structuring data in spreadsheets
## Structuring data in spreadsheets

Independent of the type of data you're collecting, there are standard ways to enter that data into the spreadsheet, to make it easier to analyze later. We often enter data that makes it easy for us as humans to read and work with it, because we're human! Computers need data structured in a way that they can use it. So to use this data in a computational workflow, we need to think like computers when we use spreadsheets.

Expand All @@ -81,26 +81,26 @@ The cardinal rules of using spreadsheet programs for data:
- Put all your variables in columns - the thing that vary between samples, like ‘strain’ or ‘DNA-concentration’.
- Have column names be explanatory, but without spaces. Use '-', '_' or [camel case](https://en.wikipedia.org/wiki/Camel_case) instead of a space. For instance 'library-prep-method' or 'LibraryPrep'is better than 'library preparation method' or 'prep', because computers interpret spaces in particular ways.
- Do not combine multiple pieces of information in one cell. Sometimes it just seems like one thing, but think if that’s the only way
you’ll want to be able to use or sort that data. For example, instead of having a column with species and strain name (e.g. *E. coli*
K12) you would have one column with the species name (*E. coli*) and another with the strain name (K12). Depending on the type of
you’ll want to be able to use or sort that data. For example, instead of having a column with species and strain name (e.g. *E. coli*
K12) you would have one column with the species name (*E. coli*) and another with the strain name (K12). Depending on the type of
analysis you want to do, you may even separate the genus and species names into distinct columns.
- Export the cleaned data to a text-based format like CSV (comma-separated values) format. This ensures that anyone can use the data, and is required by most data repositories.

[![Messy spreadsheet](../fig/01_tidiness_datasheet_example_messy.png)](https://github.com/datacarpentry/organization-genomics/raw/gh-pages/files/Ecoli_metadata_composite_messy.xlsx)

> ## Discussion
> This is some potential spreadsheet data generated about a sequencing experiment. With the person next to you, for about 2 minutes, discuss some of the problems with the spreadsheet data shown above. You can look at the image, or download the file to your computer via this [link](https://github.com/datacarpentry/organization-genomics/raw/gh-pages/files/Ecoli_metadata_composite_messy.xlsx) and open it in a spreadsheet reader like Excel.
> This is some potential spreadsheet data generated about a sequencing experiment. With the person next to you, for about 2 minutes, discuss some of the problems with the spreadsheet data shown above. You can look at the image, or download the file to your computer via this [link](https://github.com/datacarpentry/organization-genomics/raw/gh-pages/files/Ecoli_metadata_composite_messy.xlsx) and open it in a spreadsheet reader like Excel.
>
>
> > ## Solution
> > A full set of types of issues with spreadsheet data is at the [Data Carpentry Ecology spreadsheet lesson](http://www.datacarpentry.org/spreadsheet-ecology-lesson/02-common-mistakes/). Not all are present in this example. Discuss with the group what they found. Some problems include not all data sets having the same columns, datasets split into their own tables, color to encode information, different column names, spaces in some columns names. Here is a "clean" version of the same spreadsheet:
> >
> >[Cleaned spreadsheet](https://raw.githubusercontent.com/datacarpentry/wrangling-genomics/gh-pages/files/Ecoli_metadata_composite.tsv)
> >Download the file using right-click (PC)/command-click (Mac).
> >Download the file using right-click (PC)/command-click (Mac).
> {: .solution}
{: .challenge}

## Further notes on data tidiness
### Further notes on data tidiness

Data organization at this point of your experiment will help facilitate your analysis later, as well as prepare your data and notes for data deposition now often required by journals and funding agencies. If this is a collaborative project, as most projects are now, it's also information that collaborators will need to interpret your data and results and is very useful for communication and efficiency.

Expand Down
32 changes: 16 additions & 16 deletions _episodes/02-project-planning.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ questions:
- "What are the guidelines for data storage?"
objectives:
- Understand the data we send to and get back from a sequencing center.
- Make decisions about how (if) data will be stored, archived, shared, etc.
- Make decisions about how (if) data will be stored, archived, shared, etc.
keypoints:
- "Data being sent to a sequencing center also needs to be structured so you can use it."
- "Raw sequencing data should be kept raw somewhere, so you can always go back to the original files."
Expand All @@ -18,7 +18,7 @@ There are a variety of ways to work with a large sequencing dataset. You may be
bioinformatics tools beyond doing BLAST searches. You may have bioinformatics experience with other types of data
and are working with high-throughput (NGS) sequence data for the first time. In the most important ways, the
methods and approaches we need in bioinformatics are the same ones we need at the bench or in the field -
*planning, documenting, and organizing* are the key to good reproducible science.
*planning, documenting, and organizing* are the key to good reproducible science.

> ## Discussion
>
Expand All @@ -28,14 +28,14 @@ methods and approaches we need in bioinformatics are the same ones we need at th
>
> **Working with sequence data**
>
> What challenges do you think you'll face (or have already faced) in working with a large sequence dataset?
> What is your strategy for saving and sharing your sequence files?
> How can you be sure that your raw data have not been unintentionally corrupted?
> Where/how will you (did you) analyze your data - what software, what computer(s)?
> What challenges do you think you'll face (or have already faced) in working with a large sequence dataset?
> What is your strategy for saving and sharing your sequence files?
> How can you be sure that your raw data have not been unintentionally corrupted?
> Where/how will you (did you) analyze your data - what software, what computer(s)?
{: .challenge}


# Sending samples to the facility
## Sending samples to the facility

The first step in sending your sample for sequencing will be to complete a form documenting the metadata for the
facility. Take a look at the following example submission spreadsheet.
Expand All @@ -58,14 +58,14 @@ with Excel or another spreadsheet program.
> > - Capitalization of the replicate column changes
> > - Volume and concentration column headers have unusual (not allowed) characters
> > - Volume, concentration, and RIN column decimal accuracy changes
> > - The prep_date and ship_date formats are different, and prep_date has multiple formats
> > - The prep_date and ship_date formats are different, and prep_date has multiple formats
> > - Are there others not mentioned?
> >
> > Improvements in naming
> > - Shorten client_sample_id names, and maybe just call them "names"
> > - For example: "wt" for "wild-type". Also, they are all "1hr", so that is superfluous information
> > - The prep_date and ship_date might not be needed
> > - Use "microliters" for "Volume (µL)" etc.
> > - Use "microliters" for "Volume (µL)" etc.
> >
> > Errors hard to spot:
> > - No space between "wild" and "type", repeated barcode numbers, missing data, duplicate names
Expand All @@ -74,7 +74,7 @@ with Excel or another spreadsheet program.
> {: .solution}
{: .challenge}

# Retrieving sample sequencing data from the facility
## Retrieving sample sequencing data from the facility

When the data come back from the sequencing facility, you will receive some documentation (metadata) as well as
the sequence files themselves. Download and examine the following example file - here provided as a text file and
Expand All @@ -89,20 +89,20 @@ Excel file:
> 2. If you wanted to relate file names to the sample names submitted above (e.g. wild type...) could you do so?
> 3. What do the \_R1/\_R2 extensions mean in the file names?
> 4. What does the '.gz' extension on the filenames indicate?
> 5. What is the total file size - what challenges in downloading and sharing these data might exist?
> 5. What is the total file size - what challenges in downloading and sharing these data might exist?
>
> > ## Solution
> >
> > 1. Samples are organized by sample_id
> > 2. To relate filenames use the sample_id, and do a VLOOKUP on submission sheet
> > 3. The \_R1/\_R2 extensions mean "Read 1" and "Read 2" of each sample
> > 4. The '.gz' extension means it is a compressed "gzip" type format to save disk space
> > 5. The size of all the files combined is 1113.60 Gb (over a terabyte!). To transfer files this large you should validate the file size following transfer. Absolute file integrity checks following transfers and methods for faster file transfers are possible but beyond the scope of this lesson.
> > 5. The size of all the files combined is 1113.60 Gb (over a terabyte!). To transfer files this large you should validate the file size following transfer. Absolute file integrity checks following transfers and methods for faster file transfers are possible but beyond the scope of this lesson.
> >
> {: .solution}
{: .challenge}

# Storing data
## Storing data

The raw data you get back from the sequencing center is the foundation of your sequencing analysis. You need to keep this data, so that you can always come back to it if there are any questions or you need to re-run an analysis, or try a new analysis approach.

Expand All @@ -120,19 +120,19 @@ If you do not have access to these resources, you can back up on hard drives. Ha

You can also use resources like [Amazon S3](https://aws.amazon.com/s3/), [Microsoft Azure](https://azure.microsoft.com/en-us/pricing/details/storage/blobs/), [Google Cloud](https://cloud.google.com/storage/) or others for cloud storage. The [open science framework](https://osf.io) is a free option for storing files up to 5 GB. See more in the lesson ["Introduction to Cloud Computing for Genomics"](http://www.datacarpentry.org/cloud-genomics/04-which-cloud/).

# Summary
## Summary

Before analysis of data has begun, there are already many potential areas for errors and omissions. Keeping
organized and keeping a critical eye can help catch mistakes.

One of Data Carpentry's goals is to help you achieve *competency* in working with bioinformatics. This means that
you can accomplish routine tasks, under normal conditions, in an acceptable amount of time. While an expert might
be able to get to a solution on instinct alone - taking your time, using Google or another Internet search engine,
and asking for help are all valid ways of solving your problems. As you complete the lessons you'll be able to use all of those methods more efficiently.
and asking for help are all valid ways of solving your problems. As you complete the lessons you'll be able to use all of those methods more efficiently.

> ## Where to go from here?
>
> More reading about core competencies
> More reading about core competencies
>
>L. Welch, F. Lewitter, R. Schwartz, C. Brooksbank, P. Radivojac, B. Gaeta and M. Schneider, '[Bioinformatics Curriculum Guidelines: Toward a Definition of Core Competencies](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3945096/)', PLoS Comput Biol, vol. 10, no. 3, p. e1003496, 2014.
>
Expand Down
Loading

0 comments on commit f9007c0

Please sign in to comment.