π§¬π₯ Protein Dimension DB π₯π§¬
Datasets with PLM embeddings, GO annotations and taxonomy representations for all proteins in Uniprot/Swiss-Prot
Current Release
Proteins are sorted by length. All files contain the same sequence of proteins, so the βids.txtβ file can be used as the row names.
Protein Language Model Embeddings π’
Several models are used to create computational descriptions of the Swiss-Prot proteins:
Name | Model π€ | Vector Length π | File Size | Download Links π |
---|---|---|---|---|
emb.prottrans.parquet | prottrans_t5_xl_u50 (calculated by Uniprot) | 1024 | 1.3G | HF, UFRN |
emb.ankh_large.parquet | ankh-large | 1536 | 3.4G | HF, UFRN |
emb.ankh_base.parquet | ankh-base | 768 | 1.7G | HF, UFRN |
emb.esm2_t36.parquet | esm2_t36_3B_UR50D | 2560 | 5.7G | HF, UFRN |
emb.esm2_t33.parquet | esm2_t33_650M_UR50D | 1280 | 2.8G | HF, UFRN |
emb.esm2_t30.parquet | esm2_t30_150M_UR50D | 640 | 1.4G | HF, UFRN |
emb.esm2_t12.parquet | esm2_t12_35M_UR50D | 480 | 1G | HF, UFRN |
emb.esm2_t6.parquet | esm2_t6_8M_UR50D | 320 | 700M | HF, UFRN |
Uniprot/Swiss-Prot π¬
Name | Content | Download Links π |
---|---|---|
ids.txt | Uniprot Accession IDs | UFRN |
uniprot_sorted.fasta.gz | Aminoacid sequences of SwissProt proteins | UFRN |
taxid.tsv | NCBI taxon ID of each protein | UFRN |
Protein Annotations π
All Gene Ontology annotations of Swiss-Prot proteins, excluding computational, non-traceable and no-data annotations. The full list of ignored evidence codes is available at evi_not_to_use.txt. Annotations have been βexpanded upwardsβ: parent terms of existing annotations have been included in these files.
Name | Content | Download Links π |
---|---|---|
go.expanded.tsv.gz | MF, BP and CC annotations in simplified GAF format | UFRN |
go.experimental.mf.tsv.gz | Molecular Functions | UFRN |
go.experimental.bp.tsv.gz | Biological Processes | UFRN |
go.experimental.cc.tsv.gz | Cellular Components | UFRN |
Taxonomy π’
Numerical representations of the NCBI taxon IDs of each protein. Instead of the original NCBI taxonomy tree, we use the custom taxonomy created by taxallnomy project, because it attributes the same number of parent taxa (genus, family, orderβ¦) to each species ID.
Name | Description | Vector Length π | Download Links π |
---|---|---|---|
emb.taxa_profile_256.parquet | Taxa Proximity [0.0, 1.0] to each one of the 256 most annotated taxa | 256 | UFRN |
emb.taxa_profile_128.parquet | Taxa Proximity [0.0, 1.0] to each one of the 128 most annotated taxa | 128 | UFRN |
onehot.taxa_256.parquet | Taxa One-Hot Encoding | 256 | HF, UFRN |
onehot.taxa_128.parquet | Taxa One-Hot Encoding | 128 | HF, UFRN |
File Formats ποΈ
Files | Format Descriptions |
---|---|
ids.txt | One UniprotID per line |
taxid.tsv | Tab-separated table with columns: UniprotID, NCBI Taxon ID |
go.expanded.tsv.gz | Tab-separated table with columns: UniprotID, GO ID, ECO ID, NCBI Taxon ID, GO Ontology Code |
go.experimental.*.tsv.gz | Tab-separated table with columns: UniprotID, GO IDs separated by β,β |
*.parquet | Parquet formatted dataset. Has only two columns (βidβ and βembβ). For rows where an embedding could not be defined, a vector of np.NaN is placed. |
Create Release
Requirements to generate the datasets from scratch:
- Nextflow >= 24
- Mamba package manager
- Fast and stable internet connection to download original datasets
- At least 16GB of RAM
Test:
$ mkdir test
$ nextflow run main.nf --mode test --release_dir test
Full release:
$ mkdir <path to generate database at>
$ nextflow run main.nf --mode release --release_dir <path to generate database at>