Recent developments in data science, in particular computational biology, often integrate data from several sources, over diverse experiments, or databases leaves a challenge of truthfully visualize data where the number of data points vary between classes. Plot types like bar charts, violin plot, strip charts or box-and-whiskers plots can provide visual information about mean/median, variance of the data, number of data points or density distribution of data; but only pairs of plots or dense overlays of these plot types will provide all the relevant information. To aid the presentation of datasets with differing sample size we have developed a new type of plot overcoming limitations of current standards visualization charts.
sinaplot is inspired by the strip chart and the violin plot. By letting the normalized density of points restrict the jitter along the x-axis the plot displays the same contour as a violin plot, but resemble a simple strip chart for small number of data points. In this way the plot conveys information of both the number of data points, the density distribution, outliers and spread in a very simple, comprehensible and condensed format.
x <- c(rnorm(200, 4, 1), rnorm(200, 5, 2), rnorm(400, 6, 1.5))
groups <- c(rep("Cond1", 200), rep("Cond2", 200), rep("Cond3", 400))
library(sinaplot)
## Loading required package: plyr
We use a cohort of 2095 AML, ALL and healthy bone marrow samples to illustrate some of the strengths of sinaplot.
Class | Gene |
---|---|
ALL t(12;21) | 7.553129 |
ALL t(12;21) | 7.252447 |
ALL t(12;21) | 5.608201 |
ALL t(12;21) | 5.971710 |
ALL t(12;21) | 6.554109 |
ALL t(12;21) | 5.655416 |
ALL t(12;21) | 6.127554 |
ALL t(12;21) | 6.043007 |
ALL t(12;21) | 7.681021 |
ALL t(12;21) | 5.959204 |
By setting the argument scale = FALSE
we turn off the
group-wise scaling based on the class with the highest density.
Using the method = "counts"
to compute the borders we
get a less smooth spread of the samples due to the absence of the kernel
density estimate.
Sinaplot aesthetics can be tweaked in the same manner as in graphics::plot.
par(mar = c(9,4,4,2) + 0.1)
n_groups <- length(levels(blood$Class))
sinaplot(Gene ~ Class, data = blood, pch = 20, xaxt = "n", col = rainbow(n_groups),
ann = FALSE, bty = "n")
axis(1, at = 1:n_groups, labels = FALSE)
text(x = 1:n_groups,
y = par()$usr[3] - 0.1 * (par()$usr[4] - par()$usr[3]),
labels = levels(blood$Class), srt = 45, xpd = TRUE, adj = 1,
cex = .8)
Using a subset of the blood
dataset we compare sinaplot
with 5 popular plotting strategies, and show how our package integrates
features from these methods to achieve truthful and yet simple
representation of multiclass single variable data.
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] sinaplot_1.1.1 plyr_1.8.9 rmarkdown_2.29
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.37 R6_2.5.1 fastmap_1.2.0 xfun_0.49
## [5] maketools_1.3.1 cachem_1.1.0 knitr_1.49 htmltools_0.5.8.1
## [9] buildtools_1.0.0 lifecycle_1.0.4 cli_3.6.3 sass_0.4.9
## [13] jquerylib_0.1.4 compiler_4.4.2 sys_3.4.3 tools_4.4.2
## [17] evaluate_1.0.1 bslib_0.8.0 Rcpp_1.0.13-1 yaml_2.3.10
## [21] jsonlite_1.8.9 rlang_1.1.4