Integrate scRNA-seq datasets#
Show code cell content
!lamin load test-scrna
π‘ found cached instance metadata: /home/runner/.lamin/instance--testuser1--test-scrna.env
β
loaded instance: testuser1/test-scrna
import lamindb as ln
import lnschema_bionty as lb
import pandas as pd
import anndata as ad
β
loaded instance: testuser1/test-scrna (lamindb 0.50.7)
ln.track()
π‘ notebook imports: anndata==0.9.2 lamindb==0.50.7 lnschema_bionty==0.29.6 pandas==1.5.3
β
saved: Transform(id='agayZTonayqAz8', name='Integrate scRNA-seq datasets', short_name='scrna2', stem_id='agayZTonayqA', version='0', type=notebook, updated_at=2023-08-17 14:11:53, created_by_id='DzTjkKse')
β
saved: Run(id='0Zhg72h2cHaaONf6p0pc', run_at=2023-08-17 14:11:53, transform_id='agayZTonayqAz8', created_by_id='DzTjkKse')
Query files based on metadata#
ln.File.filter(tissues__name__icontains="lymph node").distinct().df()
storage_id | key | suffix | accessor | description | version | initial_version_id | size | hash | hash_type | transform_id | run_id | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||
goAhrxKrgthN83xZwUcA | u9weBVGy | None | .h5ad | AnnData | Detmar22 | None | None | 17342743 | rk5lSoJvz6PHRRjmcB919w | md5 | Nv48yAceNSh8z8 | MtzJ6VmKpLQiUtbzoMTo | 2023-08-17 14:11:12 | DzTjkKse |
AXO3ps8RX50KjYoWYb58 | u9weBVGy | None | .h5ad | AnnData | Conde22 | None | None | 28061905 | 3cIcmoqp1MxjX8NlRkKGlQ | md5 | Nv48yAceNSh8z8 | MtzJ6VmKpLQiUtbzoMTo | 2023-08-17 14:11:35 | DzTjkKse |
ln.File.filter(cell_types__name__icontains="monocyte").distinct().df()
storage_id | key | suffix | accessor | description | version | initial_version_id | size | hash | hash_type | transform_id | run_id | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||
AXO3ps8RX50KjYoWYb58 | u9weBVGy | None | .h5ad | AnnData | Conde22 | None | None | 28061905 | 3cIcmoqp1MxjX8NlRkKGlQ | md5 | Nv48yAceNSh8z8 | MtzJ6VmKpLQiUtbzoMTo | 2023-08-17 14:11:35 | DzTjkKse |
K88ck4vLa862rEysKlNO | u9weBVGy | None | .h5ad | AnnData | 10x reference pbmc68k | None | None | 589484 | eKVXV5okt5YRYjySMTKGEw | md5 | Nv48yAceNSh8z8 | MtzJ6VmKpLQiUtbzoMTo | 2023-08-17 14:11:44 | DzTjkKse |
ln.File.filter(labels__name="female").distinct().df()
storage_id | key | suffix | accessor | description | version | initial_version_id | size | hash | hash_type | transform_id | run_id | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||
goAhrxKrgthN83xZwUcA | u9weBVGy | None | .h5ad | AnnData | Detmar22 | None | None | 17342743 | rk5lSoJvz6PHRRjmcB919w | md5 | Nv48yAceNSh8z8 | MtzJ6VmKpLQiUtbzoMTo | 2023-08-17 14:11:12 | DzTjkKse |
Intersect measured genes between two datasets#
file1 = ln.File.filter(description="Conde22").one()
file2 = ln.File.filter(description="10x reference pbmc68k").one()
file1.describe()
π‘ File(id=AXO3ps8RX50KjYoWYb58, key=None, suffix=.h5ad, accessor=AnnData, description=Conde22, version=None, size=28061905, hash=3cIcmoqp1MxjX8NlRkKGlQ, hash_type=md5, created_at=2023-08-17 14:11:35.656298+00:00, updated_at=2023-08-17 14:11:35.656325+00:00)
Provenance:
ποΈ storage: Storage(id='u9weBVGy', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna', type='local', updated_at=2023-08-17 14:11:51, created_by_id='DzTjkKse')
π transform: Transform(id='Nv48yAceNSh8z8', name='Curate & link scRNA-seq datasets', short_name='scrna', stem_id='Nv48yAceNSh8', version='0', type='notebook', updated_at=2023-08-17 14:11:44, created_by_id='DzTjkKse')
π£ run: Run(id='MtzJ6VmKpLQiUtbzoMTo', run_at=2023-08-17 14:10:54, transform_id='Nv48yAceNSh8z8', created_by_id='DzTjkKse')
π€ created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-08-17 14:11:51)
Features:
var (X):
π index (36503, bionty.Gene.id): ['qo0IbL5zk0Ho', '0qs9l4ehlo5k', 'Wt6KcsC2HNFM', 'XzlnJbBfj81b', '2T9MPEOUfnUy'...]
external:
π species (1, bionty.Species): ['human']
obs (metadata):
π cell_type (32, bionty.CellType): ['mucosal invariant T cell', 'CD8-positive, alpha-beta memory T cell', 'CD16-negative, CD56-bright natural killer cell, human', 'group 3 innate lymphoid cell', 'progenitor cell']
π assay (3, bionty.ExperimentalFactor): ["10x 5' v2", "10x 5' v1", "10x 3' v3"]
π tissue (17, bionty.Tissue): ['thoracic lymph node', 'blood', 'lamina propria', 'thymus', 'transverse colon']
π donor (12, core.Label): ['D496', 'A35', 'D503', 'A36', 'A31']
file1.view_lineage()
file2.describe()
π‘ File(id=K88ck4vLa862rEysKlNO, key=None, suffix=.h5ad, accessor=AnnData, description=10x reference pbmc68k, version=None, size=589484, hash=eKVXV5okt5YRYjySMTKGEw, hash_type=md5, created_at=2023-08-17 14:11:44.665221+00:00, updated_at=2023-08-17 14:11:44.665270+00:00)
Provenance:
ποΈ storage: Storage(id='u9weBVGy', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna', type='local', updated_at=2023-08-17 14:11:51, created_by_id='DzTjkKse')
π transform: Transform(id='Nv48yAceNSh8z8', name='Curate & link scRNA-seq datasets', short_name='scrna', stem_id='Nv48yAceNSh8', version='0', type='notebook', updated_at=2023-08-17 14:11:44, created_by_id='DzTjkKse')
π£ run: Run(id='MtzJ6VmKpLQiUtbzoMTo', run_at=2023-08-17 14:10:54, transform_id='Nv48yAceNSh8z8', created_by_id='DzTjkKse')
π€ created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-08-17 14:11:51)
Features:
var (X):
π index (695, bionty.Gene.id): ['l77NXBSjDNfh', 'GuX3mcJ3R0JC', '5GuQULEkSkBV', 'k7KXOWF0TiRf', '20169ReyBOCR'...]
obs (metadata):
π cell_type (9, bionty.CellType): ['conventional dendritic cell', 'B cell, CD19-positive', 'dendritic cell', 'CD14-positive, CD16-negative classical monocyte', 'CD8-positive, CD25-positive, alpha-beta regulatory T cell']
file2.view_lineage()
file1_adata = file1.load()
file2_adata = file2.load()
π‘ adding file AXO3ps8RX50KjYoWYb58 as input for run 0Zhg72h2cHaaONf6p0pc, adding parent transform Nv48yAceNSh8z8
π‘ adding file K88ck4vLa862rEysKlNO as input for run 0Zhg72h2cHaaONf6p0pc, adding parent transform Nv48yAceNSh8z8
file2_adata.obs.cell_type.head()
index
GCAGGGCTGGATTC-1 dendritic cell
CTTTAGTGGTTACG-6 B cell, CD19-positive
TGACTGGAACCATG-7 dendritic cell
TCAATCACCCTTCG-8 B cell, CD19-positive
CGTTATACAGTACC-8 effector memory CD4-positive, alpha-beta T cel...
Name: cell_type, dtype: category
Categories (9, object): ['CD8-positive, CD25-positive, alpha-beta regul..., 'effector memory CD4-positive, alpha-beta T ce..., 'cytotoxic T cell', 'CD38-negative naive B cell', ..., 'B cell, CD19-positive', 'conventional dendritic cell', 'CD16-positive, CD56-dim natural killer cell, ..., 'dendritic cell']
Here we compute shared genes without loading files:
file1_genes = file1.features["var"]
file2_genes = file2.features["var"]
shared_genes = file1_genes & file2_genes
shared_genes.list("symbol")[:10]
['IGFBP7',
'U2AF1',
'PRDX6',
'SBDS',
'AKR1C3',
'LAMTOR4',
'LCK',
'IMPDH2',
'RABAC1',
'RNH1']
We also need to convert the ensembl_gene_id to symbol for file2 so that they can be concatenated:
mapper = (
pd.DataFrame(file2_genes.values_list("ensembl_gene_id", "symbol"))
.drop_duplicates(0)
.set_index(0)[1]
)
mapper.head()
0
ENSG00000197448 GSTK1
ENSG00000135404 CD63
ENSG00000198546 ZNF511
ENSG00000116171 SCP2
ENSG00000213658 LAT
Name: 1, dtype: object
file1_adata.var.rename(index=mapper, inplace=True)
Intersect cell types#
file1_celltypes = file1.cell_types.all()
file2_celltypes = file2.cell_types.all()
shared_celltypes = file1_celltypes & file2_celltypes
shared_celltypes_names = shared_celltypes.list("name")
shared_celltypes_names
['CD16-positive, CD56-dim natural killer cell, human',
'conventional dendritic cell']
We can now subset the two datasets by shared cell types:
file1_adata_subset = file1_adata[
file1_adata.obs["cell_type"].isin(shared_celltypes_names)
]
file1_adata_subset.obs["cell_type"].value_counts()
CD16-positive, CD56-dim natural killer cell, human 114
conventional dendritic cell 7
Name: cell_type, dtype: int64
file2_adata_subset = file2_adata[
file2_adata.obs["cell_type"].isin(shared_celltypes_names)
]
file2_adata_subset.obs["cell_type"].value_counts()
CD16-positive, CD56-dim natural killer cell, human 3
conventional dendritic cell 2
Name: cell_type, dtype: int64
adata_concat = ad.concat(
[file1_adata_subset, file2_adata_subset],
label="file",
keys=[file1.description, file2.description],
)
adata_concat
AnnData object with n_obs Γ n_vars = 126 Γ 695
obs: 'cell_type', 'file'
obsm: 'X_umap'
adata_concat.obs.value_counts()
cell_type file
CD16-positive, CD56-dim natural killer cell, human Conde22 114
conventional dendritic cell Conde22 7
CD16-positive, CD56-dim natural killer cell, human 10x reference pbmc68k 3
conventional dendritic cell 10x reference pbmc68k 2
dtype: int64
Show code cell content
!lamin delete --force test-scrna
!rm -r ./test-scrna
π‘ deleting instance testuser1/test-scrna
β
deleted instance settings file: /home/runner/.lamin/instance--testuser1--test-scrna.env
β
instance cache deleted
β
deleted '.lndb' sqlite file
β consider manually deleting your stored data: /home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna