This vignette demonstrates a compact RNA-seq preprocessing workflow
using Rtoolset functions for filtering, normalization, and
feature selection.
Overview
Rtoolset provides functions for:
- filtering low-expression genes
- calculating counts per million (CPM)
- log transformation with pseudo counts
- identifying the most variable genes
These functions integrate naturally with the edgeR
package for downstream differential expression analysis.
Complete Workflow
Step 2: Filter and Normalize
The filter_calcpm_dge() function performs multiple
preprocessing steps in one call:
- Creates a
DGEListobject - Calculates CPM
- Filters low-expression genes
- Normalizes library sizes
- Computes filtered CPM and log-transformed CPM
result <- filter_calcpm_dge(
count_df = count_df,
group = group,
min_count = 10,
min_prop = 0.1
)Step 3: Access Results
# Original DGEList
result$dge_count
# CPM for all genes
result$dge_cpm
# Filtered DGEList
result$dge_keep
# CPM for retained genes
result$dge_keep_cpm
# Log2(CPM + 1) matrix for retained genes
result$dge_keep_cpm_logStep 4: Continue with Differential Expression
dge <- estimateDisp(result$dge_keep)
fit <- glmQLFit(dge)
qlf <- glmQLFTest(fit, contrast = c(-1, 1))
top_genes <- topTags(qlf, n = 100)Individual Functions
Log Transformation
Use log_transform() when you want to control the pseudo
count or log base:
log_data <- log_transform(
df_mat = count_df,
count = 1,
base = 2
)Most Variable Genes
Use get_top_var_mat() to focus on the most variable
features for PCA, clustering, or exploratory analysis:
var_result <- get_top_var_mat(
count_df = result$dge_keep_cpm_log,
prop = 0.2
)
var_result$var_genes
var_result$topN_matFiltering Parameters
min_count
The minimum count threshold for a gene to be kept:
result_strict <- filter_calcpm_dge(
count_df = count_df,
min_count = 50,
min_prop = 0.1
)
result_lenient <- filter_calcpm_dge(
count_df = count_df,
min_count = 5,
min_prop = 0.1
)
min_prop
The minimum proportion of samples in the smallest group that must pass the expression threshold:
result <- filter_calcpm_dge(
count_df = count_df,
group = group,
min_count = 10,
min_prop = 0.5
)Suggested Use Cases
Standard Differential Expression Analysis
library(Rtoolset)
library(edgeR)
count_df <- read.csv("counts.csv", row.names = 1)
group <- c(rep("Control", 4), rep("Treatment", 4))
result <- filter_calcpm_dge(
count_df = count_df,
group = group,
min_count = 10,
min_prop = 0.1
)
dge <- estimateDisp(result$dge_keep)
fit <- glmQLFit(dge)
qlf <- glmQLFTest(fit, contrast = c(-1, 1))
top_genes <- topTags(qlf, n = 100)Exploratory Data Analysis
var_result <- get_top_var_mat(
count_df = result$dge_keep_cpm_log,
prop = 0.2
)
pca_data <- var_result$topN_mat