Quickstart

This page shows the shortest path from installation to harmonized metadata tables. It assumes no prior knowledge of the repository.

Install From PyPI

Install the current pre-alpha release:

python -m pip install omicsmeta

Check that the command-line interface is available:

omicsmeta --help

For an isolated command-line install, pipx install omicsmeta is also a good option if pipx is available on your system.

Create A Tiny Input File

omicsmeta reads CSV or TSV files with column names in the first row. The column names do not need to be standardized, but useful names such as species, tissue, disease, cell line, and sex help the field detector.

cat > metadata.tsv <<'EOF'
sample_id,species,tissue,disease,cell line,sex
sample_1,Homo sapiens,lung,NSCLC,A549,female
sample_2,Homo sapiens,breast,breast cancer,MCF-7,female
EOF

Harmonize The File

Run the default offline mapper and write the main output tables:

omicsmeta harmonize metadata.tsv \
  --output harmonized.tsv \
  --unmapped unmapped.tsv \
  --unmapped-summary-output unmapped_summary.tsv \
  --sample-output samples.tsv \
  --report qc_report.html

The command produces:

harmonized.tsv: accepted ontology mappings with confidence scores and source-column provenance.
unmapped.tsv: candidate terms that were not accepted automatically.
unmapped_summary.tsv: deduplicated review terms, useful for manual curation.
samples.tsv: one row per sample with sample-wide ontology columns.
qc_report.html: a compact HTML summary of mapping rates and warnings.

The built-in mapper covers common demonstration terms and works without network access. For real projects, add managed ontology resources or local OBO files as described below.

Fetch Metadata From GEO

Use --geo-accession to fetch GEO SOFT metadata directly from NCBI GEO:

omicsmeta harmonize \
  --geo-accession GSE123456 \
  --output harmonized.tsv \
  --unmapped unmapped.tsv \
  --unmapped-summary-output unmapped_summary.tsv \
  --sample-output samples.tsv \
  --report qc_report.html

Network access is required for direct GEO fetching. If you already have a SOFT snippet on disk, use --input-type geo_soft.

Read BioSample Or SRA XML

Use --input-type biosample_xml for NCBI BioSample XML exports:

omicsmeta harmonize biosample.xml \
  --input-type biosample_xml \
  --output harmonized.tsv \
  --unmapped unmapped.tsv \
  --unmapped-summary-output unmapped_summary.tsv \
  --sample-output samples.tsv \
  --report qc_report.html

Use --input-type sra_xml for SRA XML files that contain SAMPLE and SAMPLE_ATTRIBUTE blocks:

omicsmeta harmonize sra.xml \
  --input-type sra_xml \
  --output harmonized.tsv \
  --unmapped unmapped.tsv \
  --unmapped-summary-output unmapped_summary.tsv \
  --sample-output samples.tsv \
  --report qc_report.html

Add Ontology Resources

List managed ontology resources:

omicsmeta ontologies list

Download selected OBO resources and build a local SQLite synonym index:

omicsmeta ontologies download doid uberon cl
omicsmeta ontologies index --resource doid --resource uberon --resource cl

Use cached resources during harmonization:

omicsmeta harmonize metadata.tsv \
  --ontology-resource doid \
  --ontology-resource uberon \
  --ontology-resource cl \
  --output harmonized.tsv \
  --unmapped unmapped.tsv \
  --unmapped-summary-output unmapped_summary.tsv \
  --sample-output samples.tsv \
  --report qc_report.html

By default, resources are stored under ~/.cache/omicsmeta/ontologies. Use --cache-dir with omicsmeta ontologies download or omicsmeta ontologies index to choose another cache location, and use --ontology-cache-dir with omicsmeta harmonize to read from that location.

You can also load local OBO files directly:

omicsmeta harmonize metadata.tsv \
  --ontology-obo disease_slim.obo \
  --output harmonized.tsv \
  --unmapped unmapped.tsv \
  --unmapped-summary-output unmapped_summary.tsv \
  --sample-output samples.tsv \
  --report qc_report.html

Batch Harmonization

Use batch for multiple files or GEO accessions:

omicsmeta batch \
  --input metadata_a.tsv \
  --input metadata_b.tsv \
  --output harmonized.tsv \
  --unmapped unmapped.tsv \
  --unmapped-summary-output unmapped_summary.tsv \
  --sample-output samples.tsv \
  --report qc_report.html

Batch outputs include a batch_source column so rows can be traced back to their input file or accession.

Python API

Use the API when harmonization is part of a larger workflow:

from omicsmeta.core.harmonizer import Harmonizer

result = Harmonizer(confidence_threshold=0.70).from_file(
    "metadata.tsv",
    file_type="tabular",
)

print(result.qc_summary)
print(result.sample_table)

See the API reference for result objects, mapper backends, and output writers.

Development Install

Use the editable install only when working from a source checkout:

git clone https://github.com/qchiujunhao/omicsmeta.git
cd omicsmeta
python -m pip install -e ".[dev,docs]"
python -m pytest

The repository includes examples/basic/ and benchmarks/ fixtures for local development, documentation checks, and known-answer benchmark runs.