Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pending updates for multiome #1287

Closed
brianraymor opened this issue Mar 6, 2025 · 1 comment · Fixed by #1293
Closed

Pending updates for multiome #1287

brianraymor opened this issue Mar 6, 2025 · 1 comment · Fixed by #1293
Assignees
Labels
5.3 Next minor CELLxGENE schema version after 5.2 schema CELLxGENE Discover dataset schema

Comments

@brianraymor
Copy link
Contributor

brianraymor commented Mar 6, 2025

Context

@jahilton noted that this information from #1013 needs to be captured:

One key decision is to accept unpaired scATAC data. This is based on many users finding them valuable, especially because 10x multiome data can be of poor quality.
unpaired scATAC-seq Datasets will be the gene activity matrix (not a peak matrix). Paired scATAC-seq (eg 10x multiome) Datasets will be the gene expression matrix (RNA data)
Matrix Layers table - Accessibility (e.g. ATAC-seq, mC-seq) can be specified to unpaired Accessibility (e.g. ATAC-seq, mC-seq)
Will need to communicate this distinction clearly to the user outside the schema

…and for clarity, the scRNA-seq (UMI, e.g. 10x v3, Slide-seqV2) can have 10x multiome added to the list

Design (@brianraymor)

@jahilton - I could move the definitions for paired and unpaired to the X (Matrix Layers) section. Another approach is to inline the gene activity matrix requirement in the table row with unpaired accessibility?

X (Matrix Layers)

...

Definitions for scATAC-seq assays

paired assay. obs['assay_ontology_term_id'] is a descendant of both "EFO:0010891" for scATAC-seq and "EFO:0008913" for single-cell RNA sequencing. A gene expression matrix (RNA data) is required.

unpaired assay. obs['assay_ontology_term_id'] is "EFO:0010891" for scATAC-seq or a descendant and is not a descendant of "EFO:0008913" for single-cell RNA sequencing. A gene activity matrix and not a peak matrix is required.

The following table describes the matrix data and layers requirements that are assay-specific. If an entry in the table is empty, the schema does not have any other requirements on data in those layers beyond the ones listed above.

Assay "raw" required? "raw" location "normalized" required? "normalized" location
scRNA-seq (UMI, e.g. 10x multiome, 10x v3, Slide-seqV2) REQUIRED. Values MUST be de-duplicated molecule counts. Each cell MUST contain at least one non-zero value. All non-zero values MUST be positive integers stored as numpy.float32. AnnData.raw.X unless no "normalized" is provided, then AnnData.X STRONGLY RECOMMENDED AnnData.X
Visium Spatial (e.g. V1, CytAssist) REQUIRED. Values MUST be de-duplicated molecule counts. All non-zero values MUST be positive integers stored as numpy.float32.

If uns['spatial']['is_single'] is False then each cell MUST contain at least one non-zero value.

If uns['spatial']['is_single'] is True then the unfiltered feature-barcode matrix (raw_feature_bc_matrix) MUST be used. See Space Ranger Feature-Barcode Matrices.

if assay_ontology_term_id is "EFO:0022860" for Visium CytAssist Spatial Gene Expression, 11mm, this matrix MUST contain 14336 rows; otherwise, this matrix MUST contain 4992 rows.

If the obs['in_tissue'] value is 1, then the cell MUST contain at least one non-zero value. If any obs['in_tissue'] values are 0, then at least one cell corresponding to a obs['in_tissue'] with a value of 0 MUST contain a non-zero value.
AnnData.raw.X unless no "normalized" is provided, then AnnData.X STRONGLY RECOMMENDED AnnData.X
scRNA-seq (non-UMI, e.g. SS2) REQUIRED. Values MUST be one of read counts (e.g. FeatureCounts) or estimated fragments (e.g. output of RSEM). Each cell MUST contain at least one non-zero value. All non-zero values MUST be positive integers stored as numpy.float32. AnnData.raw.X unless no "normalized" is provided, then AnnData.X STRONGLY RECOMMENDED AnnData.X
unpaired Accessibility (e.g. ATAC-seq, mCT-seq) NOT REQUIRED REQUIRED AnnData.X
@brianraymor brianraymor added 5.3 Next minor CELLxGENE schema version after 5.2 schema CELLxGENE Discover dataset schema labels Mar 6, 2025
@brianraymor brianraymor self-assigned this Mar 6, 2025
@jahilton
Copy link
Collaborator

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5.3 Next minor CELLxGENE schema version after 5.2 schema CELLxGENE Discover dataset schema
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants