Seurat包整合多数据
单细胞数据整合的两个方面
1 如何将同一组细胞类型基于不同技术平台或由不同批次实验产生的测序数据进行修正与合并?
2 如何将同一组细胞类型的多组学特征信息整合从而完整描绘细胞状态? DOI:https://doi.org/10.1016/j.cell.2019.05.031 SueratV3的文章
作者分析了这4种方法对一个包含8个独立亚数据集的胰岛单细胞RNA测序数据集的整合效果。经由UMAP降维和可视化
SueratV2(canonical correlation analysis, CCA)
mnnCorrect (mutual nearest neighbor, MNN)
SueratV3(新算法综合了前述两类分别基于CCA降维和MNN搜索互近邻的方案)
scanorama
为了定量分析上述由聚类模式展示的不同算法在数据修正和整合上的表现差异,研究者试图从两个角度对其进行考察:一是整合后来源于不同批次的数据点的混杂程度,二是整合后同一批次中的数据点的原有聚类模式的保留。这两个参数分别表征了数据整合对批次效应的去除程度和对真实生物学效应的保存程度。
Integration 聚类特征应由真实的生物学特征决定而不是由批次效应决定
#可以写到记事本里.r用Source快速library
library(Seurat)
library(SeuratData)
library(patchwork)
# install dataset,这里如果因为网速下载失败可以手动安装
InstallData("ifnb")
#查询R包安装位置,解压好放到library
.libPaths()
# load dataset
LoadData("ifnb")
# split the dataset into a list of two seurat objects (stim and CTRL)
ifnb.list <- SplitObject(ifnb, split.by = "stim")
# normalize and identify variable features for each dataset independently
ifnb.list <- lapply(X = ifnb.list, FUN = function(x) {
x <- NormalizeData(x)#对数据进行标准化
x <- FindVariableFeatures(x, selection.method = "vst", nfeatures = 2000)#筛选高变基因
})
# select features that are repeatedly variable across datasets for integration
#选择数据集间重复可变基因,作为下一步的anchor. features
features <- SelectIntegrationFeatures(object.list = ifnb.list)
#数据整合SueratV3
Seurat包的函数基本是函数名精准定位其功能,最后的数据整合所用到的函数就是IntegrateData, 其中的参数是anchorset,所以上一步就要FindIntegrationAnchors找到整合的锚定点, FindIntegrationAnchors这个函数里面的参数有object.list = ,anchor.features = ,故其上一步是选features 和数学里的反证法好像。
immune.anchors <- FindIntegrationAnchors(object.list = ifnb.list,
anchor.features = features)
# this command creates an 'integrated' data assay
immune.combined <- IntegrateData(anchorset = immune.anchors)
Fast integration using reciprocal PCA (RPCA)
用于整合 scRNA-seq 数据集
虽然命令列表几乎相同,但此流程要求用户在整合之前对每个数据集单独运行主成分分析
(PCA)。
FindIntegrationAnchors()
。用户还应该在运行时将“reduction”参数设置为“rpca”
# load dataset
LoadData("ifnb")
# split the dataset into a list of two seurat objects (stim and CTRL)
ifnb.list <- SplitObject(ifnb, split.by = "stim")
# normalize and identify variable features for each dataset independently
ifnb.list <- lapply(X = ifnb.list, FUN = function(x) {
x <- NormalizeData(x)
x <- FindVariableFeatures(x, selection.method = "vst", nfeatures = 2000)
})
# select features that are repeatedly variable across datasets for integration run PCA on each
# dataset using these features
features <- SelectIntegrationFeatures(object.list = ifnb.list)
ifnb.list <- lapply(X = ifnb.list, FUN = function(x) {
x <- ScaleData(x, features = features, verbose = FALSE)
x <- RunPCA(x, features = features, verbose = FALSE)
})#对每个数据集单独运行主成分分析 (PCA)
immune.anchors <- FindIntegrationAnchors(object.list = ifnb.list, anchor.features = features, reduction = "rpca")#reduction”参数设置为“rpca”
# this command creates an 'integrated' data assay
immune.combined <- IntegrateData(anchorset = immune.anchors)
整合大型数据集的技巧
创建要集成的 Seurat 对象列表
CreateSeuratObject()
standard normalization and variable feature selection
NormalizeData()
FindVariableFeatures()
select features for downstream integration, and run PCA on each object in the list
SelectIntegrationFeatures()
ScaleData()
RunPCA()
整合数据集,并进行联合分析
FindIntegrationAnchors()
IntegrateData()
##两个方法提高效率
Reciprocal PCA (RPCA)
Reference-based integration
bm280k.data <- Read10X_h5("../data/ica_bone_marrow_h5.h5")
bm280k <- CreateSeuratObject(counts = bm280k.data, min.cells = 100, min.features = 500)
bm280k.list <- SplitObject(bm280k, split.by = "orig.ident")
bm280k.list <- lapply(X = bm280k.list, FUN = function(x) {
x <- NormalizeData(x, verbose = FALSE)
x <- FindVariableFeatures(x, verbose = FALSE)
})
features <- SelectIntegrationFeatures(object.list = bm280k.list)
bm280k.list <- lapply(X = bm280k.list, FUN = function(x) {
x <- ScaleData(x, features = features, verbose = FALSE)
x <- RunPCA(x, features = features, verbose = FALSE)#这里
})
anchors <- FindIntegrationAnchors(object.list = bm280k.list, reference = c(1, 2), reduction = "rpca",
dims = 1:50)#这里注意一下
bm280k.integrated <- IntegrateData(anchorset = anchors, dims = 1:50)