RNA-seq

This vignette demonstrates a compact RNA-seq preprocessing workflow using Rtoolset functions for filtering, normalization, and feature selection.

Overview

Rtoolset provides functions for:

filtering low-expression genes
calculating counts per million (CPM)
log transformation with pseudo counts
identifying the most variable genes

These functions integrate naturally with the edgeR package for downstream differential expression analysis.

Complete Workflow

Step 1: Load and Prepare Data

library(Rtoolset)
library(edgeR)

# Count matrix: genes (rows) x samples (columns)
count_df <- read.csv("count_matrix.csv", row.names = 1)

# Group labels (optional)
group <- c(rep("Control", 3), rep("Treatment", 3))

Step 2: Filter and Normalize

The filter_calcpm_dge() function performs multiple preprocessing steps in one call:

Creates a DGEList object
Calculates CPM
Filters low-expression genes
Normalizes library sizes
Computes filtered CPM and log-transformed CPM

result <- filter_calcpm_dge(
  count_df = count_df,
  group = group,
  min_count = 10,
  min_prop = 0.1
)

Step 3: Access Results

# Original DGEList
result$dge_count

# CPM for all genes
result$dge_cpm

# Filtered DGEList
result$dge_keep

# CPM for retained genes
result$dge_keep_cpm

# Log2(CPM + 1) matrix for retained genes
result$dge_keep_cpm_log

Step 4: Continue with Differential Expression

dge <- estimateDisp(result$dge_keep)
fit <- glmQLFit(dge)
qlf <- glmQLFTest(fit, contrast = c(-1, 1))
top_genes <- topTags(qlf, n = 100)

Individual Functions

Log Transformation

Use log_transform() when you want to control the pseudo count or log base:

log_data <- log_transform(
  df_mat = count_df,
  count = 1,
  base = 2
)

Most Variable Genes

Use get_top_var_mat() to focus on the most variable features for PCA, clustering, or exploratory analysis:

var_result <- get_top_var_mat(
  count_df = result$dge_keep_cpm_log,
  prop = 0.2
)

var_result$var_genes
var_result$topN_mat

Filtering Parameters

`min_count`

The minimum count threshold for a gene to be kept:

result_strict <- filter_calcpm_dge(
  count_df = count_df,
  min_count = 50,
  min_prop = 0.1
)

result_lenient <- filter_calcpm_dge(
  count_df = count_df,
  min_count = 5,
  min_prop = 0.1
)

`min_prop`

The minimum proportion of samples in the smallest group that must pass the expression threshold:

result <- filter_calcpm_dge(
  count_df = count_df,
  group = group,
  min_count = 10,
  min_prop = 0.5
)

Suggested Use Cases

Standard Differential Expression Analysis

library(Rtoolset)
library(edgeR)

count_df <- read.csv("counts.csv", row.names = 1)
group <- c(rep("Control", 4), rep("Treatment", 4))

result <- filter_calcpm_dge(
  count_df = count_df,
  group = group,
  min_count = 10,
  min_prop = 0.1
)

dge <- estimateDisp(result$dge_keep)
fit <- glmQLFit(dge)
qlf <- glmQLFTest(fit, contrast = c(-1, 1))
top_genes <- topTags(qlf, n = 100)

Exploratory Data Analysis

var_result <- get_top_var_mat(
  count_df = result$dge_keep_cpm_log,
  prop = 0.2
)

pca_data <- var_result$topN_mat