Design

omicsmeta is built around a narrow pipeline goal: convert messy public omics metadata into ontology-mapped tables that can be reviewed, benchmarked, and used by downstream analysis workflows.

Pipeline

The harmonization pipeline runs these steps:

Read metadata from a tabular file, GEO SOFT snippet, BioSample XML file, SRA XML file, or fetched GEO accession.
Detect the semantic role of each column, such as disease, tissue, cell line, species, sex, age, or treatment.
Normalize and split values into candidate terms.
Map each term to ontology candidates with a pluggable mapper backend.
Route confident mappings to the harmonized table and lower-confidence terms to manual review outputs.
Add transparent inferred terms when a recognized cell line implies species, tissue, or disease and the source row lacks those fields.
Emit detailed, sample-wide, unmapped-summary, and QC-report outputs.

Conservative Field Routing

Public metadata often uses vague columns such as phenotype, characteristics, or sample type. omicsmeta avoids treating those names as enough evidence for ontology routing. Column-name hints are combined with value-level evidence, and low-confidence or ambiguous terms are kept in the unmapped review outputs.

This design favors reviewable false negatives over silent false positives.

Mapper Boundaries

Term-to-ontology matching is intentionally pluggable. The built-in mapper is an offline fallback and test fixture backend. The optional text2term adapter lets users delegate broad biomedical term grounding to an external package while keeping omicsmeta responsible for metadata-specific preprocessing, field routing, provenance, and output tables.

The package should not become a replacement for mature ontology matching tools. Its contribution is the metadata harmonization workflow around those tools.

Output Tables

The detailed harmonized and unmapped tables preserve term-level provenance: input row, sample identifier, source column, raw value, normalized term, detected field type, ontology candidate, confidence score, backend, and accepted status.

The sample-wide table is intended for analysis workflows that need one row per sample. It aggregates ontology IDs, labels, ontologies, source columns, and confidence scores by semantic field.

The unmapped summary groups repeated review terms across samples and batch inputs. It is designed for curator triage: frequent repeated failures appear first, with sample IDs, source columns, example text, and best candidate metadata.

Validation and Inference

Validation is implemented as warnings, not hard failures. Current checks focus on row-level consistency and missing expected context. Cell-line inference is explicitly marked with backend=inference and includes provenance columns such as inferred_from in the detailed output.

Inference records are useful for downstream completeness but should be treated separately from direct source metadata during benchmarking.

Galaxy Wrapper

The repository includes a Galaxy wrapper scaffold under galaxy-omicsmeta/. It wraps the same CLI workflow, emits the same five output files, and includes small test data for Planemo-oriented validation. The wrapper is not yet a Tool Shed release because the Python package still needs external packaging through a Galaxy-compatible channel.

Current Limitations

GEO SOFT, tabular, BioSample XML, and SRA XML inputs are supported. The XML readers cover common sample attributes but are not complete parsers for every NCBI export shape.
The bundled vocabulary is intentionally small and should be extended with managed ontology resources or user-provided OBO files for real projects.
The benchmark command currently scores known-answer fixtures, not a publication-scale curated benchmark set.
The Galaxy wrapper is a scaffold and has not yet been submitted to the Galaxy Tool Shed.
The project is pre-alpha and not yet ready for JOSS submission.