Build an On-Disk Database¶
Using AnnSQL, there are two types of databases you can build. The first is a simple in-memory database for smaller datasets. The second is an on-disk database which we demonstrate how to build in this notebook. Building an on-disk AnnSQL database will allow you to query, filter, and run basic statistics on a laptop for larger than memory datasets. Any modifications to an on-disk database will be automatically saved to the database.
Install the AnnSQL package¶
pip install annsql
Import Libraries¶
from AnnSQL import AnnSQL
from AnnSQL.MakeDb import MakeDb
import scanpy as sc
import os
Load the dataset¶
Here, we load the sample pbmc3k raw dataset provided by Scanpy. Note: For very large datasets, it is necessary to open a dataset using the AnnData backed mode. Backed mode is fully supported. If opening in backed mode, the database will build in chunks. Depending on the size of your dataset and your compute source, this process may take time.
adata = sc.datasets.pbmc3k_processed()
print(adata)
AnnData object with n_obs × n_vars = 2638 × 1838 obs: 'n_genes', 'percent_mito', 'n_counts', 'louvain' var: 'n_cells' uns: 'draw_graph', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups' obsm: 'X_pca', 'X_tsne', 'X_umap', 'X_draw_graph_fr' varm: 'PCs' obsp: 'distances', 'connectivities'
Build the AnnSQL database¶
adata = sc.read_h5ad("data/pbmc3k_processed.h5ad", backed="r")
#Delete command. This is for testing purposes only.
if os.path.exists("db/pbmc3k.asql"):
os.remove("db/pbmc3k.asql")
if os.path.exists("db/pbmc3k.asql.wal"):
os.remove("db/pbmc3k.asql.wal")
#high system memory (>24Gb)
MakeDb(adata=adata, db_name="pbmc3k", db_path="db/", chunk_size=5000)
# #medium system memory (16-24Gb)
# MakeDb(adata=adata, db_name="pbmc3k", db_path="db/", chunk_size=2500)
# #low system memory (<=16Gb)
# MakeDb(adata=adata, db_name="pbmc3k", db_path="db/", chunk_size=1000, make_buffer_file=True)
Time to make var_names unique: 0.12565255165100098 Time to create X table structure: 0.020343780517578125 Starting backed mode X table data insert. Total rows: 2638 Processed chunk 0-2637 in 0.8914670944213867 seconds Finished inserting X data.
<AnnSQL.MakeDb.MakeDb at 0x75a5e070cbf0>
Open the Database¶
Below we instantiate the AnnSQL class with the db parameter pointing to our newly created database. By default the database files contain the .asql
extension.
asql = AnnSQL(db="db/pbmc3k.asql")
asql.show_tables()
table_name | |
---|---|
0 | obs |
1 | obsm_X_draw_graph_fr |
2 | obsm_X_pca |
3 | obsm_X_tsne |
4 | obsm_X_umap |
5 | obsp_connectivities |
6 | obsp_distances |
7 | uns_raw |
8 | var |
9 | varm_PCs |
10 | var_names |
11 | X |
12 | adata |
Query the Database¶
asql.query("SELECT * FROM X LIMIT 5")
cell_id | TNFRSF4 | CPSF3L | ATAD3C | C1orf86 | RER1 | TNFRSF25 | TNFRSF9 | CTNNBIP1 | SRM | ... | DSCR3 | BRWD1 | BACE2 | SIK1 | C21orf33 | ICOSLG | SUMO3 | SLC19A1 | S100B | PRMT2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AAACATACAACCAC-1 | -0.171470 | -0.280812 | -0.046677 | -0.475169 | -0.544024 | 4.928495 | -0.038028 | -0.280573 | -0.341788 | ... | -0.226570 | -0.236269 | -0.102943 | -0.222116 | -0.312401 | -0.121678 | -0.521229 | -0.098269 | -0.209095 | -0.531203 |
1 | AAACATTGAGCTAC-1 | -0.214582 | -0.372653 | -0.054804 | -0.683391 | 0.633951 | -0.334837 | -0.045589 | -0.498264 | -0.541914 | ... | -0.317530 | 2.568866 | 0.007155 | -0.445372 | 1.629285 | -0.058662 | -0.857164 | -0.266844 | -0.313146 | -0.596654 |
2 | AAACATTGATCAGC-1 | -0.376887 | -0.295084 | -0.057528 | -0.520972 | 1.332647 | -0.309362 | -0.103108 | -0.272526 | -0.500798 | ... | -0.302938 | -0.239801 | -0.071774 | -0.297857 | -0.410920 | -0.070431 | -0.590721 | -0.158656 | -0.170876 | 1.379000 |
3 | AAACCGTGCTTCCG-1 | -0.285241 | -0.281735 | -0.052227 | -0.484929 | 1.572679 | -0.271825 | -0.074552 | -0.258876 | -0.416752 | ... | -0.262978 | -0.231807 | -0.093818 | -0.247770 | 2.552078 | -0.097402 | 1.631685 | -0.119462 | -0.179120 | -0.505670 |
4 | AAACCGTGTATGCG-1 | -0.256483 | -0.220394 | -0.046800 | -0.345859 | -0.333409 | -0.208122 | -0.069514 | 5.806442 | -0.283112 | ... | -0.202237 | -0.176765 | -0.167350 | -0.098665 | -0.275836 | -0.139482 | -0.310096 | -0.006877 | -0.109614 | -0.461946 |
5 rows × 1839 columns