In this example we’ll be using the WGS example report used in the
MultiQC documentation. You can find view the report in your browser here,
and find the associated multiqc_data.json
file here.
Feel free to download this file and use it to follow along with this
vignette.
In the rest of the vignette we will use the
multiqc_data_path
variable to indicate the path to this
multiqc_data.json
file. Feel free to set this variable to
the path to this file on your system.
First, install the package using:
Then if you want, you can load the package:
However, for the sake of this tutorial we will explicitly use
namespaced functions, so we won’t be using library
a great
deal.
The main entry point to the TidyMultiqc
package is the
function load_multiqc
. A basic invocation of this function
looks like this:
(note the arrow for scrolling through columns)
df = TidyMultiqc::load_multiqc(multiqc_data_path)
df
#> # A tibble: 6 × 165
#> metadata.sample_id general.total_reads general.mapped_reads
#> <chr> <dbl> <dbl>
#> 1 P4107_1003 868204107 847562410
#> 2 P4107_1004 1002828927 985115356
#> 3 P4107_1005 974955793 955921317
#> 4 P4107_1002 865975844 847067526
#> 5 P4107_1006 912383669 894970438
#> 6 P4107_1001 772071557 751147332
#> # ℹ 162 more variables: general.percentage_aligned <dbl>,
#> # general.median_coverage <int>, general.median_insert_size <int>,
#> # general.avg_gc <dbl>, general.1_x_pc <dbl>, general.5_x_pc <dbl>,
#> # general.10_x_pc <dbl>, general.30_x_pc <dbl>, general.50_x_pc <dbl>,
#> # general.genome <chr>, general.number_of_variants_before_filter <dbl>,
#> # general.number_of_known_variants_brie_non_empty_id <dbl>,
#> # general.number_of_known_variants_brie_non_empty_id_percent <dbl>, …
We’ve now generated a tibble
(a kind of data frame),
whose rows are samples in the QC report, and whose columns are QC data
and metadata about these samples.
By default this function only returns the “general” statistics, which
are the ones in the “General Statistics” table at the top of the MultiQC
report. In TidyMultiqc, these statistics are all prefixed by
general.
We can also extract the “raw” statistics, which
includes some fields normally hidden from the report. These statistics
will have the prefix raw.<toolname>.
where
<toolname>
is the QC tool used to calculate it.
TidyMultiqc::load_multiqc(multiqc_data_path, sections = 'raw')
#> # A tibble: 6 × 224
#> metadata.sample_id raw.qualimap_bamqc_genome_results.…¹ raw.qualimap_bamqc_g…²
#> <chr> <chr> <dbl>
#> 1 P4107_1003 /scratch/109642/ANALYSIS/P4107/pipe… 868204107
#> 2 P4107_1004 /scratch/108816/ANALYSIS/P4107/pipe… 1002828927
#> 3 P4107_1005 /scratch/108818/ANALYSIS/P4107/pipe… 974955793
#> 4 P4107_1002 /scratch/108825/ANALYSIS/P4107/pipe… 865975844
#> 5 P4107_1006 /scratch/108820/ANALYSIS/P4107/pipe… 912383669
#> 6 P4107_1001 /scratch/108823/ANALYSIS/P4107/pipe… 772071557
#> # ℹ abbreviated names: ¹raw.qualimap_bamqc_genome_results.bam_file,
#> # ²raw.qualimap_bamqc_genome_results.total_reads
#> # ℹ 221 more variables: raw.qualimap_bamqc_genome_results.mapped_reads <dbl>,
#> # raw.qualimap_bamqc_genome_results.mapped_bases <dbl>,
#> # raw.qualimap_bamqc_genome_results.sequenced_bases <dbl>,
#> # raw.qualimap_bamqc_genome_results.mean_insert_size <dbl>,
#> # raw.qualimap_bamqc_genome_results.median_insert_size <dbl>, …
Often you won’t care about fields like
raw.qualimap_bamqc_genome_results.bam_file
, the path to the
original BAM file, but ‘raw’ at least provides this option.
You can also combine both general
and raw
sections by passing in a longer vector:
df_both = TidyMultiqc::load_multiqc(multiqc_data_path, sections = c('raw', 'general'))
ncol(df_both)
#> [1] 388
That’s a lot of columns!
This section will briefly talk about some downstream use-cases for this package.
One use for this data frame is creating QC plots. For example, to visualise the duplication rate per sample:
library(magrittr)
df %>%
ggplot2::ggplot(ggplot2::aes(x=metadata.sample_id, y=general.percent_duplication)) +
ggplot2::geom_col()
Of course, this is basically just replicating a plot already in the MultiQC report, but now we can customise it how we like!
With all this data, we might also want to test a specific hypothesis! For example, we might want to test the hypothesis that the mean GC content is the same as the mean GC content in the human genome (41%). If we assume that GC content is normally distributed, we can do the following:
t.test(df$general.percent_gc, mu=41)
#>
#> One Sample t-test
#>
#> data: df$general.percent_gc
#> t = -1, df = 5, p-value = 0.3632
#> alternative hypothesis: true mean is not equal to 41
#> 95 percent confidence interval:
#> 40.40490 41.26176
#> sample estimates:
#> mean of x
#> 40.83333
It seems that we cannot reject this hypothesis, so these may well be human samples!
It may be the case that your samples have important metadata that you
want in your data frame. For example, it is common for the sample names
or file names to be composed of a number of metadata fields, and indeed
the report we are working with has IDs such as P4107_1003
,
which is composed of two identifiers.
To include this metadata in our output, we need to provide the
find_metadata
argument to load_multiqc
, which
is a function that is called for each sample, and which returns a named
vector of metadata fields for that sample. It also gets passed the
entire parsed MultiQC JSON report, so the function can traverse the
structure as it wants to extract metadata.
Here is an example that parses the input file names to annotate
additional metadata. Notice that the first argument our function is
passed is a string which is the sample identifier for a sample, and how
in this example we ignore the parsed
argument. Also notice
that the names we give to our return value (“batch” and “sample”) are
prefixed by “metadata” to become the final names in the data frame.
TidyMultiqc::load_multiqc(
multiqc_data_path,
find_metadata = function(sample, parsed) {
# Split the sample ID to obtain some metadata
segments <- stringr::str_split(sample, "_")[[1]]
c(
batch = segments[[1]],
sample = segments[[2]]
)
}
)
#> # A tibble: 6 × 167
#> metadata.sample_id metadata.batch metadata.sample general.total_reads
#> <chr> <chr> <chr> <dbl>
#> 1 P4107_1003 P4107 1003 868204107
#> 2 P4107_1004 P4107 1004 1002828927
#> 3 P4107_1005 P4107 1005 974955793
#> 4 P4107_1002 P4107 1002 865975844
#> 5 P4107_1006 P4107 1006 912383669
#> 6 P4107_1001 P4107 1001 772071557
#> # ℹ 163 more variables: general.mapped_reads <dbl>,
#> # general.percentage_aligned <dbl>, general.median_coverage <int>,
#> # general.median_insert_size <int>, general.avg_gc <dbl>,
#> # general.1_x_pc <dbl>, general.5_x_pc <dbl>, general.10_x_pc <dbl>,
#> # general.30_x_pc <dbl>, general.50_x_pc <dbl>, general.genome <chr>,
#> # general.number_of_variants_before_filter <dbl>,
#> # general.number_of_known_variants_brie_non_empty_id <dbl>, …
We can extend this approach, but this time actually look up the file
paths within the report_data_sources
section of the MultiQC
report:
TidyMultiqc::load_multiqc(
multiqc_data_path,
find_metadata = function(sample, parsed) {
# This gives us the path to the fastqc output file
filepath = parsed$report_data_sources$FastQC$all_sections[[sample]]
# Split into path segments
path_segments = stringr::str_split(filepath, "/")[[1]]
# The filename is the last path segment
filename = dplyr::last(path_segments)
# Split the filename using dots and underscores
name_segments = stringr::str_split(filename, "[_\\.]")[[1]]
# Arbitrarily assign names for the outputs
name_segments %>% purrr::set_names(LETTERS[1:length(name_segments)])
}
)
#> # A tibble: 6 × 173
#> metadata.sample_id metadata.A metadata.B metadata.C metadata.D metadata.E
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 P4107_1003 P4107 1003 S3 L004 R1
#> 2 P4107_1004 P4107 1004 S4 L005 R1
#> 3 P4107_1005 P4107 1005 S5 L006 R2
#> 4 P4107_1002 P4107 1002 S2 L003 R2
#> 5 P4107_1006 P4107 1006 S6 L007 R2
#> 6 P4107_1001 P4107 1001 S1 L002 R2
#> # ℹ 167 more variables: metadata.F <chr>, metadata.G <chr>, metadata.H <chr>,
#> # general.total_reads <dbl>, general.mapped_reads <dbl>,
#> # general.percentage_aligned <dbl>, general.median_coverage <int>,
#> # general.median_insert_size <int>, general.avg_gc <dbl>,
#> # general.1_x_pc <dbl>, general.5_x_pc <dbl>, general.10_x_pc <dbl>,
#> # general.30_x_pc <dbl>, general.50_x_pc <dbl>, general.genome <chr>,
#> # general.number_of_variants_before_filter <dbl>, …
Of course in a real application we would choose specific names for each field.
Finally, we might want to include metadata that doesn’t relate to the
sample at all. For example, MultiQC has a number of report fields
prefixed by config_
that we might want to store:
TidyMultiqc::load_multiqc(
multiqc_data_path,
find_metadata = function(sample, parsed) {
parsed[c(
"config_creation_date",
"config_version"
)]
}
)
#> # A tibble: 6 × 167
#> metadata.sample_id metadata.config_creation_date metadata.config_version
#> <chr> <chr> <chr>
#> 1 P4107_1003 2021-03-08, 21:54 1.10
#> 2 P4107_1004 2021-03-08, 21:54 1.10
#> 3 P4107_1005 2021-03-08, 21:54 1.10
#> 4 P4107_1002 2021-03-08, 21:54 1.10
#> 5 P4107_1006 2021-03-08, 21:54 1.10
#> 6 P4107_1001 2021-03-08, 21:54 1.10
#> # ℹ 164 more variables: general.total_reads <dbl>, general.mapped_reads <dbl>,
#> # general.percentage_aligned <dbl>, general.median_coverage <int>,
#> # general.median_insert_size <int>, general.avg_gc <dbl>,
#> # general.1_x_pc <dbl>, general.5_x_pc <dbl>, general.10_x_pc <dbl>,
#> # general.30_x_pc <dbl>, general.50_x_pc <dbl>, general.genome <chr>,
#> # general.number_of_variants_before_filter <dbl>,
#> # general.number_of_known_variants_brie_non_empty_id <dbl>, …
It is occasionally useful to extract QC data from the MultiQC plots. For example, let’s say we want to calculate the median quality score of every base in each sample. Unfortunately, MultiQC provides no numerical summary statistic for the read quality, it only has mapping quality and pass/fails for the per-base sequence quality:
df_both %>% dplyr::select(dplyr::contains('quality'))
#> # A tibble: 6 × 5
#> raw.qualimap_bamqc_genome_resu…¹ raw.fastqc.sequences…² raw.fastqc.per_base_…³
#> <dbl> <dbl> <chr>
#> 1 50.3 0 pass
#> 2 50.4 0 pass
#> 3 50.3 0 pass
#> 4 50.3 0 pass
#> 5 50.3 0 pass
#> 6 50.3 0 pass
#> # ℹ abbreviated names: ¹raw.qualimap_bamqc_genome_results.mean_mapping_quality,
#> # ²raw.fastqc.sequences_flagged_as_poor_quality,
#> # ³raw.fastqc.per_base_sequence_quality
#> # ℹ 2 more variables: raw.fastqc.per_tile_sequence_quality <chr>,
#> # raw.fastqc.per_sequence_quality_scores <chr>
However, our MultiQC report does have plots that contain this information. In particular, let’s look at the “Per Sequence Quality Scores” plot.
Firstly, we need the ID of the plot we want. This isn’t necessarily obvious from just looking at the report, so we can use a utility function here:
#> # A tibble: 20 × 2
#> id title
#> <chr> <chr>
#> 1 qualimap_coverage_histogram Qualimap BamQC: Coverage histogram
#> 2 qualimap_genome_fraction Qualimap BamQC: Genome fraction cove…
#> 3 qualimap_insert_size Qualimap BamQC: Insert size histogram
#> 4 qualimap_gc_content Qualimap BamQC: GC content distribut…
#> 5 snpeff_variant_effects_region SnpEff: Counts by Genomic Region
#> 6 snpeff_variant_effects_impact SnpEff: Counts by Effects Impact
#> 7 snpeff_effects SnpEff: Counts by Effect Types
#> 8 snpeff_variant_effects_class SnpEff: Counts by Functional Class
#> 9 snpeff_qualities SnpEff: Qualities
#> 10 gatk_varianteval_variant_plot GATK VariantEval: Variant Counts
#> 11 picard_deduplication Picard: Deduplication Stats
#> 12 fastqc_sequence_counts_plot FastQC: Sequence Counts
#> 13 fastqc_per_base_sequence_quality_plot FastQC: Mean Quality Scores
#> 14 fastqc_per_sequence_quality_scores_plot FastQC: Per Sequence Quality Scores
#> 15 fastqc_per_sequence_gc_content_plot FastQC: Per Sequence GC Content
#> 16 fastqc_per_base_n_content_plot FastQC: Per Base N Content
#> 17 fastqc_sequence_duplication_levels_plot FastQC: Sequence Duplication Levels
#> 18 fastqc_overrepresented_sequencesi_plot FastQC: Overrepresented sequences
#> 19 fastqc_adapter_content_plot FastQC: Adapter Content
#> 20 fastqc-status-check-heatmap FastQC: Status Checks
Now, we know we want the “Per Sequence Quality Scores” plot, and by
looking at the data frame above we can tell that the corresponding ID is
fastqc_per_sequence_quality_scores_plot
.
Now that we have the plot ID, we can load the plot data. First, we
need to tell TidyMultiqc
to load include some plots by
using sections = "plot"
(you can load other sections at the
same time, as explained above). Also, we need to pass the plot ID from
the previous step into the plots
argument:
df = TidyMultiqc::load_multiqc(
multiqc_data_path,
sections = 'plot',
plots = "fastqc_per_sequence_quality_scores_plot"
)
df
#> # A tibble: 6 × 2
#> metadata.sample_id plot.fastqc_per_sequence_quality_scores_plot
#> <chr> <list>
#> 1 P4107_1001 <tibble [35 × 2]>
#> 2 P4107_1002 <tibble [38 × 2]>
#> 3 P4107_1003 <tibble [31 × 2]>
#> 4 P4107_1004 <tibble [32 × 2]>
#> 5 P4107_1005 <tibble [40 × 2]>
#> 6 P4107_1006 <tibble [40 × 2]>
We now have the plot data, but it’s not in a very usable form! This
is because each sample has an entire data frame of plot data. At this
point if you’re comfortable using dplyr
and
tidyr
to deal with nested data frames, you probably know
what to do. Otherwise, read on.
Recall that we are after the median quality score of each sample. First, we should look at the plot data for a single sample to know what we’re dealing with:
df$plot.fastqc_per_sequence_quality_scores_plot[[1]]
#> # A tibble: 35 × 2
#> x y
#> <dbl> <dbl>
#> 1 7 2
#> 2 8 13
#> 3 9 121
#> 4 10 702
#> 5 11 3962
#> 6 12 11946
#> 7 13 33566
#> 8 14 77653
#> 9 15 157422
#> 10 16 294209
#> # ℹ 25 more rows
So each data frame is a set of x, y pairs. As it’s a histogram plot,
we know that the x
value is the quality score, and
y
is the number of times that score has been counted.
One possible way to process this nested data is to use
tidyr
:
df %>%
tidyr::unnest(cols = plot.fastqc_per_sequence_quality_scores_plot)
#> # A tibble: 216 × 3
#> metadata.sample_id x y
#> <chr> <dbl> <dbl>
#> 1 P4107_1001 7 2
#> 2 P4107_1001 8 13
#> 3 P4107_1001 9 121
#> 4 P4107_1001 10 702
#> 5 P4107_1001 11 3962
#> 6 P4107_1001 12 11946
#> 7 P4107_1001 13 33566
#> 8 P4107_1001 14 77653
#> 9 P4107_1001 15 157422
#> 10 P4107_1001 16 294209
#> # ℹ 206 more rows
As you can see, if we unnest in this way, we now have multiple rows
for the same sample, which is a bit confusing (and not Tidy). However,
if we use group_by
and then summarise
, this
can be a useful way to calculate summary statistics. For example, if we
want the total number of reads, we could do the following:
df %>%
tidyr::unnest(cols = plot.fastqc_per_sequence_quality_scores_plot) %>%
dplyr::group_by(metadata.sample_id) %>%
dplyr::summarise(total_reads = sum(y))
#> # A tibble: 6 × 2
#> metadata.sample_id total_reads
#> <chr> <dbl>
#> 1 P4107_1001 383592756
#> 2 P4107_1002 430185524
#> 3 P4107_1003 431379000
#> 4 P4107_1004 498165399
#> 5 P4107_1005 484233426
#> 6 P4107_1006 453238610
Although unnesting worked well in this example, it can get a bit
awkward for more complex operations. In these cases we can use
purrr
. Refer to the next section for an example that
compares both approaches. The below example sums the number of reads for
each sample, as we have done above, but this time it uses
purrr::map_dbl
to map over the list of data frames:
df %>%
dplyr::mutate(
total_reads = purrr::map_dbl(plot.fastqc_per_sequence_quality_scores_plot, ~sum(.$y)),
plot.fastqc_per_sequence_quality_scores_plot = NULL
)
#> # A tibble: 6 × 2
#> metadata.sample_id total_reads
#> <chr> <dbl>
#> 1 P4107_1001 383592756
#> 2 P4107_1002 430185524
#> 3 P4107_1003 431379000
#> 4 P4107_1004 498165399
#> 5 P4107_1005 484233426
#> 6 P4107_1006 453238610
Of course, we actually want to find the median here, which is a bit
harder. Luckily there exists a package called HistDat
for
generating summary statistics from histogram-type data. You can check
out the package’s manual and vignettes here, but in
brief, we want to convert each of these plot data frames into a
HistDat
object, which we can do using the same strategies
as before. Then, using HistDat
, we can calculate our
summary statistics in one of the two ways mentioned above.
Using the tidyr
approach, we can unnest the plot data,
group it, create a HistDat
object for each group, and then
produce new columns using the new hist
column:
df %>%
tidyr::unnest(cols = plot.fastqc_per_sequence_quality_scores_plot) %>%
dplyr::group_by(metadata.sample_id) %>%
dplyr::mutate(hist = list(HistDat::HistDat(vals = x, counts = y)), .keep = "unused") %>%
dplyr::mutate(
mean_coverage = hist %>% dplyr::first() %>% mean(),
median_coverage = hist %>% dplyr::first() %>% median(),
max_coverage = hist %>% dplyr::first() %>% max(),
hist= NULL
) %>%
dplyr::slice(1)
#> # A tibble: 6 × 4
#> # Groups: metadata.sample_id [6]
#> metadata.sample_id mean_coverage median_coverage max_coverage
#> <chr> <dbl> <dbl> <dbl>
#> 1 P4107_1001 37.4 40 41
#> 2 P4107_1002 37.9 40 41
#> 3 P4107_1003 39.4 41 41
#> 4 P4107_1004 40.0 41 41
#> 5 P4107_1005 38.3 40 41
#> 6 P4107_1006 37.6 40 41
Alternatively, using the purrr
method, we can just map
each plot data frame into a row of summary statistics. Much neater!
df %>%
dplyr::mutate(
purrr::map_dfr(plot.fastqc_per_sequence_quality_scores_plot, function(plot_df){
hist = HistDat::HistDat(vals=plot_df$x, counts = plot_df$y)
list(
mean_coverage = mean(hist),
median_coverage = median(hist),
max_coverage = max(hist)
)
}),
plot.fastqc_per_sequence_quality_scores_plot = NULL
)
#> # A tibble: 6 × 4
#> metadata.sample_id mean_coverage median_coverage max_coverage
#> <chr> <dbl> <dbl> <dbl>
#> 1 P4107_1001 37.4 40 41
#> 2 P4107_1002 37.9 40 41
#> 3 P4107_1003 39.4 41 41
#> 4 P4107_1004 40.0 41 41
#> 5 P4107_1005 38.3 40 41
#> 6 P4107_1006 37.6 40 41
So far, we have used TidyMultiqc
’s built-in parsers.
Each parser is a function that converts the plot JSON into a list of
data frames. However, it is possible that we might encounter a new type
of plot that is not yet implemented in TidyMultiqc
. If this
happens, the first thing you should do is file an issue
against this package.
Then, if you’re daring, you can try to implement a parser. For your
reference, you can refer to the existing parsers in the source code.
Then, you can pass in your custom parser using the
plot_parsers
argument to load_multiqc
:
TidyMultiqc::load_multiqc(
multiqc_data_path,
sections = 'plot',
plots = "fastqc_per_sequence_quality_scores_plot",
plot_parsers = list(
# This fake parser function takes a plot and just returns the iris dataset
xy_line = function(plot_data, name){
list(
sample_1 = list(
plot_name = list(iris)
)
)
}
)
)
#> # A tibble: 1 × 2
#> metadata.sample_id plot_name
#> <chr> <list>
#> 1 sample_1 <df [150 × 5]>
Finally, if your parser works, please submit a pull request to
TidyMultiqc
to share this with everyone!
Using the strategies and patterns explained in this section, you should be in good stead to handle whatever plots MultiQC throws at you.