Pseudobulk visualization

Pseudobulk visualization¶

Now we have all the ingredients we need to generate the pseudobulk files to compare the cell types. With this function we will generate fragments files per group and the corresponding bigwigs in general.

Here, we prepared another function which use the matrix instead of fragments.

In [ ]:

Copied!

!git clone https://github.com/DBinary/Epiverse.git
!git clone https://github.com/DBinary/Epiverse.git

Cloning into 'Epiverse'...
remote: Enumerating objects: 463, done.
remote: Counting objects: 100% (52/52), done.
remote: Compressing objects: 100% (42/42), done.
remote: Total 463 (delta 23), reused 19 (delta 9), pack-reused 411
Receiving objects: 100% (463/463), 6.14 MiB | 28.59 MiB/s, done.
Resolving deltas: 100% (174/174), done.

In [ ]:

Copied!

!pip install pyranges
!pip install pyranges

Collecting pyranges
  Downloading pyranges-0.0.129-py3-none-any.whl (1.5 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.5/1.5 MB 15.5 MB/s eta 0:00:00
Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from pyranges) (1.5.3)
Collecting ncls>=0.0.63 (from pyranges)
  Downloading ncls-0.0.68-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.4 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.4/2.4 MB 63.7 MB/s eta 0:00:00
Requirement already satisfied: tabulate in /usr/local/lib/python3.10/dist-packages (from pyranges) (0.9.0)
Collecting sorted-nearest>=0.0.33 (from pyranges)
  Downloading sorted_nearest-0.0.39-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.0/6.0 MB 81.2 MB/s eta 0:00:00
Requirement already satisfied: natsort in /usr/local/lib/python3.10/dist-packages (from pyranges) (8.4.0)
Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from ncls>=0.0.63->pyranges) (1.23.5)
Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas->pyranges) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->pyranges) (2023.3.post1)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.1->pandas->pyranges) (1.16.0)
Installing collected packages: sorted-nearest, ncls, pyranges
Successfully installed ncls-0.0.68 pyranges-0.0.129 sorted-nearest-0.0.39

In [1]:

Copied!





import omicverse as ov
import scanpy as sc
import Epiverse as ev
ov.plot_set()
import omicverse as ov
import scanpy as sc
import Epiverse as ev
ov.plot_set()

   ____            _     _    __                  
  / __ \____ ___  (_)___| |  / /__  _____________ 
 / / / / __ `__ \/ / ___/ | / / _ \/ ___/ ___/ _ \ 
/ /_/ / / / / / / / /__ | |/ /  __/ /  (__  )  __/ 
\____/_/ /_/ /_/_/\___/ |___/\___/_/  /____/\___/                                              

Version: 1.5.8, Tutorials: https://omicverse.readthedocs.io/

Load Data¶

We recommend the use of scanpy Anndata objects as the preferred mode of loading and filtering data.

A sample datset is available for download with the instructions listed below. This is a filtered, unnormalized counts of multiome dataset of CD34+ sorted bone marrow cells to profile human hematopoiesis [Dataset ref TBD].

Uncomment the following lines to download the sample dataset in a Unix-based system. For non-UNIX systems, download the files using the URL

In [ ]:

Copied!





!mkdir data/
!wget https://dp-lab-data-public.s3.amazonaws.com/SEACells-multiome/cd34_multiome_atac.h5ad -O data/cd34_multiome_atac.h5ad # ATAC data
!wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE200nnn/GSE200046/suppl/GSE200046_RAW.tar -O data/GSE200046_RAW.tar
!tar xvf data/GSE200046_RAW.tar
!mkdir data/
!wget https://dp-lab-data-public.s3.amazonaws.com/SEACells-multiome/cd34_multiome_atac.h5ad -O data/cd34_multiome_atac.h5ad # ATAC data
!wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE200nnn/GSE200046/suppl/GSE200046_RAW.tar -O data/GSE200046_RAW.tar
!tar xvf data/GSE200046_RAW.tar

--2023-09-08 15:37:03--  https://dp-lab-data-public.s3.amazonaws.com/SEACells-multiome/cd34_multiome_atac.h5ad
Resolving dp-lab-data-public.s3.amazonaws.com (dp-lab-data-public.s3.amazonaws.com)... 3.5.25.253, 52.217.168.177, 54.231.137.225, ...
Connecting to dp-lab-data-public.s3.amazonaws.com (dp-lab-data-public.s3.amazonaws.com)|3.5.25.253|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 843871190 (805M) [binary/octet-stream]
Saving to: ‘data/cd34_multiome_atac.h5ad’

data/cd34_multiome_ 100%[===================>] 804.78M  34.3MB/s    in 26s     

2023-09-08 15:37:29 (31.0 MB/s) - ‘data/cd34_multiome_atac.h5ad’ saved [843871190/843871190]

--2023-09-08 15:37:29--  ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE200nnn/GSE200046/suppl/GSE200046_RAW.tar
           => ‘data/GSE200046_RAW.tar’
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 130.14.250.12, 130.14.250.10, 2607:f220:41e:250::7, ...
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|130.14.250.12|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /geo/series/GSE200nnn/GSE200046/suppl ... done.
==> SIZE GSE200046_RAW.tar ... 9667829760
==> PASV ... done.    ==> RETR GSE200046_RAW.tar ... done.
Length: 9667829760 (9.0G) (unauthoritative)

GSE200046_RAW.tar   100%[===================>]   9.00G  91.0MB/s    in 95s     

2023-09-08 15:39:04 (97.1 MB/s) - ‘data/GSE200046_RAW.tar’ saved [9667829760]

GSM6005302_BM_CD34_Rep1_filtered_feature_bc_matrix.h5
GSM6005303_BM_CD34_Rep1_atac_fragments.tsv.gz
GSM6005304_BM_CD34_Rep2_filtered_feature_bc_matrix.h5
GSM6005305_BM_CD34_Rep2_atac_fragments.tsv.gz
GSM6005306_BM_Tcelldep_Rep1_filtered_feature_bc_matrix.h5
GSM6005307_BM_Tcelldep_Rep1_atac_fragments.tsv.gz
GSM6005308_BM_Tcelldep_Rep2_filtered_feature_bc_matrix.h5
GSM6005309_BM_Tcelldep_Rep2_atac_fragments.tsv.gz

In [3]:

Copied!

adata=sc.read('data/cd34_multiome_atac.h5ad')
adata
adata=sc.read('data/cd34_multiome_atac.h5ad')
adata

Out[3]:

AnnData object with n_obs × n_vars = 6881 × 246113
    obs: 'Sample', 'TSSEnrichment', 'ReadsInTSS', 'ReadsInPromoter', 'ReadsInBlacklist', 'PromoterRatio', 'PassQC', 'NucleosomeRatio', 'nMultiFrags', 'nMonoFrags', 'nFrags', 'nDiFrags', 'BlacklistRatio', 'Clusters', 'ReadsInPeaks', 'FRIP', 'leiden', 'phenograph', 'celltype', 'SEACell'
    var: 'seqnames', 'start', 'end', 'width', 'strand', 'score', 'replicateScoreQuantile', 'groupScoreQuantile', 'Reproducibility', 'GroupReplicate', 'nearestGene', 'distToGeneStart', 'peakType', 'distToTSS', 'nearestTSS', 'GC', 'idx'
    uns: 'celltype_colors'
    obsm: 'X_svd', 'X_umap'

We require peaks to be expressed in at least 0.01% of the cells.

In [4]:

Copied!

sc.pp.filter_genes(adata, min_cells = adata.shape[0]*0.001)
adata
sc.pp.filter_genes(adata, min_cells = adata.shape[0]*0.001)
adata

filtered out 760 genes that are detected in less than 6.881 cells

Out[4]:

AnnData object with n_obs × n_vars = 6881 × 245353
    obs: 'Sample', 'TSSEnrichment', 'ReadsInTSS', 'ReadsInPromoter', 'ReadsInBlacklist', 'PromoterRatio', 'PassQC', 'NucleosomeRatio', 'nMultiFrags', 'nMonoFrags', 'nFrags', 'nDiFrags', 'BlacklistRatio', 'Clusters', 'ReadsInPeaks', 'FRIP', 'leiden', 'phenograph', 'celltype', 'SEACell'
    var: 'seqnames', 'start', 'end', 'width', 'strand', 'score', 'replicateScoreQuantile', 'groupScoreQuantile', 'Reproducibility', 'GroupReplicate', 'nearestGene', 'distToGeneStart', 'peakType', 'distToTSS', 'nearestTSS', 'GC', 'idx', 'n_cells'
    uns: 'celltype_colors'
    obsm: 'X_svd', 'X_umap'

Load chromsizes¶

In order to produce the bigwig files, we also need to know the overall size of the chromosomes. We can easily download this information from the UCSC.

In [5]:

Copied!





# Get chromosome sizes (for hg38 here)
import pyranges as pr
import requests
import pandas as pd
target_url='http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes'
chromsizes=pd.read_csv(target_url, sep='\t', header=None)
chromsizes.columns=['Chromosome', 'End']
chromsizes['Start']=[0]*chromsizes.shape[0]
chromsizes=chromsizes.loc[:,['Chromosome', 'Start', 'End']]
# Exceptionally in this case, to agree with CellRangerARC annotations
chromsizes['Chromosome'] = [chromsizes['Chromosome'][x].replace('v', '.') for x in range(len(chromsizes['Chromosome']))]
chromsizes['Chromosome'] = [chromsizes['Chromosome'][x].split('_')[1] if len(chromsizes['Chromosome'][x].split('_')) > 1 else chromsizes['Chromosome'][x] for x in range(len(chromsizes['Chromosome']))]
chromsizes=pr.PyRanges(chromsizes)
# Get chromosome sizes (for hg38 here)
import pyranges as pr
import requests
import pandas as pd
target_url='http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes'
chromsizes=pd.read_csv(target_url, sep='\t', header=None)
chromsizes.columns=['Chromosome', 'End']
chromsizes['Start']=[0]*chromsizes.shape[0]
chromsizes=chromsizes.loc[:,['Chromosome', 'Start', 'End']]
# Exceptionally in this case, to agree with CellRangerARC annotations
chromsizes['Chromosome'] = [chromsizes['Chromosome'][x].replace('v', '.') for x in range(len(chromsizes['Chromosome']))]
chromsizes['Chromosome'] = [chromsizes['Chromosome'][x].split('_')[1] if len(chromsizes['Chromosome'][x].split('_')) > 1 else chromsizes['Chromosome'][x] for x in range(len(chromsizes['Chromosome']))]
chromsizes=pr.PyRanges(chromsizes)

Analysis the pseudobulk of each celltypes¶

We provide the ev.single.pseudobulk function to get a bigwig file for each cell type from the matrix

In [5]:

Copied!

!pip install pyrle
!pip install pyrle

Requirement already satisfied: pyrle in /usr/local/lib/python3.10/dist-packages (0.0.38)
Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from pyrle) (1.5.3)
Requirement already satisfied: tabulate in /usr/local/lib/python3.10/dist-packages (from pyrle) (0.9.0)
Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from pyrle) (1.23.5)
Requirement already satisfied: natsort in /usr/local/lib/python3.10/dist-packages (from pyrle) (8.4.0)
Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas->pyrle) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->pyrle) (2023.3.post1)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.1->pandas->pyrle) (1.16.0)

In [6]:

Copied!





!mkdir temp
ev.single.pseudobulk(adata,chromsizes,cluster_key='celltype',
           clusters=['cDC','pDC','CLP'],size=1000,
           chr=['seqnames','start','end'],
           bigwig_path='temp')
!mkdir temp
ev.single.pseudobulk(adata,chromsizes,cluster_key='celltype',
           clusters=['cDC','pDC','CLP'],size=1000,
           chr=['seqnames','start','end'],
           bigwig_path='temp')

['cDC', 'pDC', 'CLP']

  0%|          | 0/3 [00:00<?, ?it/s]

cDC chr_value
cDC chr_start
cDC chr_end
cDC Name
cDC Score
cDC write

 33%|███▎      | 1/3 [00:16<00:32, 16.02s/it]

pDC chr_value
pDC chr_start
pDC chr_end
pDC Name
pDC Score
pDC write

 67%|██████▋   | 2/3 [00:46<00:24, 24.77s/it]

CLP chr_value
CLP chr_start
CLP chr_end
CLP Name
CLP Score
CLP write

100%|██████████| 3/3 [01:31<00:00, 30.56s/it]

Visualization pseudobulk¶

We can use the bigwig visualisation in the bulk step to analyse the pseudobulk generated by scATAC

In [7]:

Copied!





bw_dict={
    'cDC':'temp/cDC.bw',
    'pDC':'temp/pDC.bw',
    'CLP':'temp/CLP.bw',
}
bw_obj=ev.bulk.bigwig(bw_dict)
bw_obj.read()
bw_dict={
    'cDC':'temp/cDC.bw',
    'pDC':'temp/pDC.bw',
    'CLP':'temp/CLP.bw',
}
bw_obj=ev.bulk.bigwig(bw_dict)
bw_obj.read()

......Loading cDC
......Loading pDC
......Loading CLP

In [8]:

Copied!

color_dict=dict(zip(adata.obs['celltype'].cat.categories,adata.uns['celltype_colors']))
color_dict=dict(zip(adata.obs['celltype'].cat.categories,adata.uns['celltype_colors']))

The gtf files could be downloaded from gencode.

Here, we using GRCh38 v43 to perfrom the annotation.

In [9]:

Copied!

!wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.basic.annotation.gtf.gz
!wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.basic.annotation.gtf.gz

--2023-09-08 16:21:45--  ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.basic.annotation.gtf.gz
           => ‘gencode.v44.basic.annotation.gtf.gz.1’
Resolving ftp.ebi.ac.uk (ftp.ebi.ac.uk)... 193.62.193.165
Connecting to ftp.ebi.ac.uk (ftp.ebi.ac.uk)|193.62.193.165|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pub/databases/gencode/Gencode_human/release_44 ... done.
==> SIZE gencode.v44.basic.annotation.gtf.gz ... 29570410
==> PASV ... done.    ==> RETR gencode.v44.basic.annotation.gtf.gz ... done.
Length: 29570410 (28M) (unauthoritative)

gencode.v44.basic.a 100%[===================>]  28.20M  10.7MB/s    in 2.6s    

2023-09-08 16:21:49 (10.7 MB/s) - ‘gencode.v44.basic.annotation.gtf.gz.1’ saved [29570410]

In [9]:

Copied!

bw_obj.load_gtf('/mnt/home/zehuazeng/data/gtf/gencode.v43.basic.annotation.gtf')
bw_obj.load_gtf('/mnt/home/zehuazeng/data/gtf/gencode.v43.basic.annotation.gtf')

......Loading gtf file

In [14]:

Copied!





#BARX2: chr11:129,375,233-129,452,279
region_dict={
    'region1':[129375233, 129452279],
}
#BARX2: chr11:129,375,233-129,452,279
region_dict={
    'region1':[129375233, 129452279],
}

We prepared ev.bulk.plotloc to calculate the location automatically to peak visualization.

In [11]:

Copied!

loc_obj=ev.bulk.plotloc('chr11',129375233, 129452279)
start1,end1=loc_obj.cal()
loc_obj=ev.bulk.plotloc('chr11',129375233, 129452279)
start1,end1=loc_obj.cal()

bw_obj.plot_track is a very powerful function for visualising ATAC-seq. We provide quite a few parameters to be used as visualisation effect customisations.

chrom: the chromosome of region.
chromstart: the start position of region.
chromend: the end position of region.
nbins: the number of bins. default is 700
value_type: the type of value. Can be set as 'mean','max','min','coverage','std','sum'.
transform: the transform of value. Can be set as 'log','log2','log10','log1p','-log' or 'no'.
figwidth: the width of figure.
figheight: the height of figure.
plot_names: the name of bigwig file need to be plotted.
color_dict: the color of bigwig file need to be plotted.
region_dict: the region of interest.
gtf_color: the color of gtf. default is #000000
prefered_name: the prefered name of gtf. default is gene_id
jump_symbols: If the gene name contains characters within list, the text for that gene is not displayed.
ymax: Since there are differences in our bigwig files for each type of cell, we can specify the maximum value to ensure a high degree of consistency when visualising.
text_fontsize: the size of text in figures

In [19]:

Copied!





bw_obj.plot_track(chrom=loc_obj.chr,chromstart=start1,chromend=end1,
                  plot_names=['cDC','pDC','CLP'],figwidth=6,figheight=1.5,
                  color_dict=color_dict,
                  region_dict=region_dict,
                  #transform='log1p',
                  jump_symbols=['ENSG','LINC'],
                  prefered_name='gene_name',
                  ymax=3,text_fontsize=10
                 )
bw_obj.plot_track(chrom=loc_obj.chr,chromstart=start1,chromend=end1,
                  plot_names=['cDC','pDC','CLP'],figwidth=6,figheight=1.5,
                  color_dict=color_dict,
                  region_dict=region_dict,
                  #transform='log1p',
                  jump_symbols=['ENSG','LINC'],
                  prefered_name='gene_name',
                  ymax=3,text_fontsize=10
                 )

Out[19]:

(<Figure size 480x160 with 4 Axes>,
 array([<AxesSubplot: ylabel='cDC'>, <AxesSubplot: ylabel='pDC'>,
        <AxesSubplot: ylabel='CLP'>, <AxesSubplot: >], dtype=object))

No description has been provided for this image

Visualization pseudobulk with fragment¶

We can use the bigwig visualisation in the bulk step to analyse the pseudobulk generated by scATAC and fragment, the fragment files can help us generate th pseudobulk result accurately.

In [12]:

Copied!

import pandas as pd
fragment_2 = pd.read_csv('/content/GSM6005305_BM_CD34_Rep2_atac_fragments.tsv.gz',sep='\t')
fragment_2
import pandas as pd
fragment_2 = pd.read_csv('/content/GSM6005305_BM_CD34_Rep2_atac_fragments.tsv.gz',sep='\t')
fragment_2

Out[12]:

	chr1	10073	10327	CCGCACACAGAAATGC-1	1
0	chr1	10084	10285	CCACAGGGTGAACAAA-1	1
1	chr1	10084	10340	TCTCCTCGTTAGGTTG-1	1
2	chr1	10085	10278	CTCGACAAGTGAGCGG-1	1
3	chr1	10085	10320	AGTTATGTCGGTTTCC-1	2
4	chr1	10085	10327	CTTCAATTCGTTTCGC-1	5
...	...	...	...	...	...
200568152	KI270713.1	37661	37875	TGAGCTTAGGCTGGCT-1	1
200568153	KI270713.1	37663	37882	TCAGCAAGTCATGCCC-1	1
200568154	KI270713.1	38991	39208	CGATTCCTCAGGTTTA-1	3
200568155	KI270713.1	39071	39228	GTGCACGGTCACCTAT-1	1
200568156	KI270713.1	40664	40709	TCGTGCTTCTAATCCT-1	1

200568157 rows × 5 columns

In [13]:

Copied!





adata.obs['barcode']=adata.obs.index
for i in adata.obs.index:
    adata.obs.loc[i,'barcode'] = adata.obs.loc[i,'barcode'].split('#')[1]
adata.obs.index = adata.obs['barcode']
adata.obs['barcode']=adata.obs.index
for i in adata.obs.index:
    adata.obs.loc[i,'barcode'] = adata.obs.loc[i,'barcode'].split('#')[1]
adata.obs.index = adata.obs['barcode']

In [14]:

Copied!

adata.obs.loc[adata.obs.index.isin(set(fragment_2.iloc[:,3])), 'type'] = "cd34_multiome_rep2"
adata.obs
adata.obs.loc[adata.obs.index.isin(set(fragment_2.iloc[:,3])), 'type'] = "cd34_multiome_rep2"
adata.obs

Out[14]:

	Sample	TSSEnrichment	ReadsInTSS	ReadsInPromoter	ReadsInBlacklist	PromoterRatio	PassQC	NucleosomeRatio	nMultiFrags	nMonoFrags	...	BlacklistRatio	Clusters	ReadsInPeaks	FRIP	leiden	phenograph	celltype	SEACell	barcode	type
barcode
AAACAGCCACTCGCTC-1	cd34_multiome_rep1	10.131	6559	7086	227	0.186062	1	1.015666	2757	9447	...	0.005961	C8	12388	0.654204	3	5	HMP	cd34_multiome_rep1#AAACAGCCACTCGCTC-1	AAACAGCCACTCGCTC-1	cd34_multiome_rep2
AAACAGCCACTGACCG-1	cd34_multiome_rep1	9.216	6381	7340	239	0.154260	1	1.373404	3747	10024	...	0.005023	C13	12880	0.639777	0	2	DCPre	cd34_multiome_rep1#TTGCAGCCAGGCGATA-1	AAACAGCCACTGACCG-1	cd34_multiome_rep2
AAACAGCCATAATCAC-1	cd34_multiome_rep1	11.886	5264	5476	143	0.222638	1	1.149624	1899	5721	...	0.005814	C5	8390	0.731218	5	0	HSC	cd34_multiome_rep1#TACAGCTAGTAATCCA-1	AAACAGCCATAATCAC-1	cd34_multiome_rep2
AAACATGCAAATTCGT-1	cd34_multiome_rep1	12.290	7994	8203	214	0.231266	1	1.431784	3004	7293	...	0.006033	C5	10992	0.749591	4	0	HSC	cd34_multiome_rep2#TAACCTAAGGATTGAG-1	AAACATGCAAATTCGT-1	cd34_multiome_rep2
AAACATGCAGCATGTC-1	cd34_multiome_rep1	9.746	8067	8928	293	0.168218	1	1.517742	4481	10540	...	0.005521	C1	13298	0.627383	7	6	Ery	cd34_multiome_rep1#TAGCGGCTCATTATGG-1	AAACATGCAGCATGTC-1	cd34_multiome_rep2
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
TTTGTCTAGGGCCACT-1	cd34_multiome_rep2	9.709	10267	11241	377	0.163753	1	1.138771	5234	16048	...	0.005492	C1	18997	0.589639	7	6	Ery	cd34_multiome_rep1#TAGCGGCTCATTATGG-1	TTTGTCTAGGGCCACT-1	cd34_multiome_rep2
TTTGTGAAGGAAGCAC-1	cd34_multiome_rep2	10.617	5147	5139	134	0.211813	1	1.447740	2023	4956	...	0.005523	C5	7285	0.731867	6	0	HSC	cd34_multiome_rep1#TACAGCTAGTAATCCA-1	TTTGTGAAGGAAGCAC-1	cd34_multiome_rep2
TTTGTGAAGTAACCAC-1	cd34_multiome_rep2	12.637	9464	9801	203	0.222770	1	1.230808	3482	9861	...	0.004614	C6	14692	0.742020	5	5	HSC	cd34_multiome_rep1#GCAGGCTTCAAACACC-1	TTTGTGAAGTAACCAC-1	cd34_multiome_rep2
TTTGTGGCATACTCCT-1	cd34_multiome_rep2	9.624	2012	2201	63	0.219136	1	1.140665	723	2346	...	0.006272	C12	3266	0.694894	15	15	Ery	cd34_multiome_rep1#GATTCAGGTCGTTATC-1	TTTGTGGCATACTCCT-1	cd34_multiome_rep2
TTTGTTGGTTCGCTCA-1	cd34_multiome_rep2	10.941	2978	3164	84	0.191502	1	1.316601	1364	3566	...	0.005084	C4	5097	0.710482	5	5	HMP	cd34_multiome_rep1#GTACTTCGTTGGTGAC-1	TTTGTTGGTTCGCTCA-1	cd34_multiome_rep2

6881 rows × 22 columns

In [15]:

Copied!

fragments_dict = {'cd34_multiome_rep1': '/content/GSM6005303_BM_CD34_Rep1_atac_fragments.tsv.gz',
          'cd34_multiome_rep2': '/content/GSM6005305_BM_CD34_Rep2_atac_fragments.tsv.gz'}
fragments_dict = {'cd34_multiome_rep1': '/content/GSM6005303_BM_CD34_Rep1_atac_fragments.tsv.gz',
          'cd34_multiome_rep2': '/content/GSM6005305_BM_CD34_Rep2_atac_fragments.tsv.gz'}

In [ ]:

Copied!





ev.single.pseudobulk_with_fragments(
    input_data = adata.obs,
    chromsizes = chromsizes,
    cluster_key = 'celltype',
    clusters = ['cDC','pDC','CLP'],
    bed_path = 'result/pseudobulk_bed_files/',
    bigwig_path = 'result/pseudobulk_bw_files/',
    verbose = True,
    path_to_fragments = fragments_dict,
    sample_id_col = 'type'
)
ev.single.pseudobulk_with_fragments(
    input_data = adata.obs,
    chromsizes = chromsizes,
    cluster_key = 'celltype',
    clusters = ['cDC','pDC','CLP'],
    bed_path = 'result/pseudobulk_bed_files/',
    bigwig_path = 'result/pseudobulk_bw_files/',
    verbose = True,
    path_to_fragments = fragments_dict,
    sample_id_col = 'type'
)

In [37]:

Copied!





bw_dict={
    'cDC':'result/pseudobulk_bw_files/cDC.bw',
    'pDC':'result/pseudobulk_bw_files/pDC.bw',
    'CLP':'result/pseudobulk_bw_files/CLP.bw',
}
bw_obj=ev.bulk.bigwig(bw_dict)
bw_obj.read()
bw_dict={
    'cDC':'result/pseudobulk_bw_files/cDC.bw',
    'pDC':'result/pseudobulk_bw_files/pDC.bw',
    'CLP':'result/pseudobulk_bw_files/CLP.bw',
}
bw_obj=ev.bulk.bigwig(bw_dict)
bw_obj.read()

......Loading cDC
......Loading pDC
......Loading CLP

In [38]:

Copied!

color_dict=dict(zip(adata.obs['celltype'].cat.categories,adata.uns['celltype_colors']))
color_dict=dict(zip(adata.obs['celltype'].cat.categories,adata.uns['celltype_colors']))

In [39]:

Copied!

bw_obj.load_gtf('/content/gencode.v44.basic.annotation.gtf.gz')
bw_obj.load_gtf('/content/gencode.v44.basic.annotation.gtf.gz')

......Loading gtf file

In [41]:

Copied!





bw_obj.plot_track(chrom='chr3',chromstart=124200000,chromend=128600000,
                  plot_names=['cDC','pDC','CLP'],figwidth=6,figheight=0.6,
                  color_dict=color_dict,
                  region_dict=region_dict
                 )
bw_obj.plot_track(chrom='chr3',chromstart=124200000,chromend=128600000,
                  plot_names=['cDC','pDC','CLP'],figwidth=6,figheight=0.6,
                  color_dict=color_dict,
                  region_dict=region_dict
                 )

Out[41]:

(<Figure size 480x192 with 4 Axes>,
 array([<AxesSubplot: ylabel='cDC'>, <AxesSubplot: ylabel='pDC'>,
        <AxesSubplot: ylabel='CLP'>, <AxesSubplot: >], dtype=object))