Utility Functions for RNA-seq Analysis

This vignette demonstrates a complete RNA-seq analysis workflow using Rtoolset functions for filtering, normalization, and preprocessing count data.

Overview

Rtoolset provides functions for:

Filtering low-expression genes
Calculating counts per million (CPM)
Log transformation with pseudo counts
Identifying most variable genes

These functions integrate seamlessly with the edgeR package for differential expression analysis. The filter_calcpm_dge() function is a comprehensive workflow function that performs multiple preprocessing steps in one call, while individual functions like log_transform() and get_top_var_mat() provide flexibility for custom workflows.

Complete Workflow

Step 1: Load and Prepare Data

library(Rtoolset)
library(edgeR)

# Your count matrix: genes (rows) x samples (columns)
# count_df should be a data.frame or matrix
count_df <- read.csv("count_matrix.csv", row.names = 1)

# Group information (optional)
group <- c(rep("Control", 3), rep("Treatment", 3))

Step 2: Filter and Normalize

The filter_calcpm_dge() function performs multiple steps:

Creates a DGEList object
Calculates CPM
Filters low-expression genes
Normalizes library sizes
Calculates filtered CPM and log-transformed CPM

result <- filter_calcpm_dge(
  count_df = count_df,
  group = group,
  min_count = 10,      # Minimum count threshold
  min_prop = 0.1       # Minimum proportion of samples
)

Step 3: Access Results

The function returns a list with multiple objects:

# Original DGEList (all genes)
result$dge_count

# CPM for all genes
result$dge_cpm

# Filtered DGEList (kept genes only)
result$dge_keep

# CPM for filtered genes
result$dge_keep_cpm

# Log2(CPM + 1) for filtered genes (most commonly used)
result$dge_keep_cpm_log

Step 4: Continue with Differential Expression

Use the filtered DGEList for downstream analysis:

# Estimate dispersions
dge <- estimateDisp(result$dge_keep)

# Fit model
fit <- glmQLFit(dge)

# Test for differential expression
qlf <- glmQLFTest(fit, contrast = c(-1, 1))

# Get results
top_genes <- topTags(qlf, n = 100)

Individual Functions

Log Transformation

For custom transformations:

# Log transform with custom parameters
log_data <- log_transform(
  df_mat = count_df,
  count = 1,    # Pseudo count
  base = 2      # Log base
)

Most Variable Genes

Identify genes with highest variance:

# Get top 20% most variable genes
var_result <- get_top_var_mat(
  count_df = result$dge_keep_cpm_log,
  prop = 0.2
)

# Or get top N genes
var_result <- get_top_var_mat(
  count_df = result$dge_keep_cpm_log,
  topN = 1000
)

# Access results
var_result$var_genes      # Variance for all genes
var_result$topN_mat       # Expression matrix for top genes

Filtering Parameters

min_count

Minimum count threshold for a gene to be kept:

# Stricter filtering (keep only highly expressed genes)
result_strict <- filter_calcpm_dge(
  count_df = count_df,
  min_count = 50,  # Higher threshold
  min_prop = 0.1
)

# Lenient filtering (keep more genes)
result_lenient <- filter_calcpm_dge(
  count_df = count_df,
  min_count = 5,   # Lower threshold
  min_prop = 0.1
)

min_prop

Minimum proportion of samples in the smallest group that must have the minimum count:

# Require gene to be expressed in at least 50% of smallest group
result <- filter_calcpm_dge(
  count_df = count_df,
  group = group,
  min_count = 10,
  min_prop = 0.5  # More stringent
)

Use Cases

1. Standard Differential Expression Analysis

Complete workflow from raw counts to differential expression:

library(Rtoolset)
library(edgeR)

# Load and preprocess
count_df <- read.csv("counts.csv", row.names = 1)
group <- c(rep("Control", 4), rep("Treatment", 4))

# Filter and normalize
result <- filter_calcpm_dge(
  count_df = count_df,
  group = group,
  min_count = 10,
  min_prop = 0.1
)

# Differential expression
dge <- estimateDisp(result$dge_keep)
fit <- glmQLFit(dge)
qlf <- glmQLFTest(fit, contrast = c(-1, 1))
top_genes <- topTags(qlf, n = 100)

2. Exploratory Data Analysis

Identify most variable genes for visualization:

# Get top variable genes
var_result <- get_top_var_mat(
  count_df = result$dge_keep_cpm_log,
  prop = 0.2  # Top 20% most variable
)

# Use for PCA or clustering
pca_data <- var_result$topN_mat

3. Custom Filtering Workflow

Build a custom preprocessing pipeline:

# Custom log transformation
log_data <- log_transform(
  df_mat = count_df,
  count = 1,
  base = 2
)

# Manual filtering
keep <- rowSums(count_df >= 10) >= 3
filtered_data <- count_df[keep, ]

# Calculate CPM
dge <- DGEList(counts = filtered_data)
dge <- calcNormFactors(dge)
cpm_data <- cpm(dge, log = TRUE, prior.count = 1)

Best Practices

Group Information: Always provide group information when available for better filtering
Filtering Thresholds:
- min_count = 10 and min_prop = 0.1 are good defaults
- Adjust based on your data characteristics
- More stringent filtering (higher thresholds) reduces noise but may lose lowly expressed genes
Normalization: The function uses edgeR’s calcNormFactors() for TMM normalization
Log Transformation: Use dge_keep_cpm_log for most downstream analyses (PCA, clustering, etc.)
Variable Genes: Use get_top_var_mat() to identify genes for visualization and dimensionality reduction
Performance: Caching is enabled for faster vignette rebuilds
Reproducibility: Set seeds when using random processes in downstream analysis

Technical Details

Filtering Strategy

The filter_calcpm_dge() function uses edgeR’s filterByExpr() function, which:

Calculates the minimum count threshold based on library sizes
Requires genes to have at least min_count reads in at least min_prop proportion of samples in the smallest group
This ensures genes are expressed in a meaningful number of samples

Normalization

The function uses TMM (Trimmed Mean of M-values) normalization via calcNormFactors(), which:

Accounts for differences in library sizes
Is robust to outliers
Is widely used in RNA-seq analysis

Log Transformation

The log transformation uses:

$\text{log2(CPM + 1)}$

where the pseudo-count of 1 prevents log(0) and stabilizes variance for low counts.

Example: Complete Analysis

library(Rtoolset)
library(edgeR)

# 1. Load data
count_df <- read.csv("counts.csv", row.names = 1)
group <- c(rep("Control", 4), rep("Treatment", 4))

# 2. Filter and normalize
result <- filter_calcpm_dge(
  count_df = count_df,
  group = group,
  min_count = 10,
  min_prop = 0.1
)

# 3. Get most variable genes for visualization
var_genes <- get_top_var_mat(
  count_df = result$dge_keep_cpm_log,
  prop = 0.2
)

# 4. Continue with edgeR for differential expression
dge <- estimateDisp(result$dge_keep)
fit <- glmQLFit(dge)
qlf <- glmQLFTest(fit, contrast = c(-1, 1))
top_genes <- topTags(qlf, n = 100)

Overview

Complete Workflow

Step 1: Load and Prepare Data

Step 2: Filter and Normalize

Step 3: Access Results

Step 4: Continue with Differential Expression

Individual Functions

Log Transformation

Most Variable Genes

Filtering Parameters

min_count

min_prop

Use Cases

1. Standard Differential Expression Analysis

2. Exploratory Data Analysis

3. Custom Filtering Workflow

Best Practices

Technical Details

Filtering Strategy

Normalization

Log Transformation

Example: Complete Analysis

See Also