Model Card for INDUS-SDE-ST (INDUS-SDE Sentence Transformer)

Paper: INDUS-SDE: A Language Model for Scientific Content Curation and Discovery — KDD 2026, AI for Sciences Track. INDUS-SDE-ST is a sentence transformer for semantic scientific discovery, built on the INDUS-SDE encoder. Code: NASA-IMPACT/st-training-workflow · Binary (EQAT) variant: indus-sde-st-equat-v0.1

The model was first further fine tuned on sentence embedding task on top of previous (nasa-impact/indus-sde-st-v0.1) using stage 2 dataset (scientific dataset) for a epoch. Then this model is again fined tuned for 2 more epoches on NASA SDE and NASA ADS corpus.

The initial stage of Indus-SDE-ST training focused on adapting the base Indus-SDE model to comprehend general domain semantics and sentence-pair relationships. The stage 2 dataset was designed for scieinfic domain adaptation. The primary objective was to establish a broad linguistic foundation before specializing in scientific content (for subsequent stages). This was achieved using a diverse corpus comprising pairs from S2ORC, arxiv, PubMed, NASA ADS and NASA SDE set in a contrastive learning objective: Multiple Negatives Ranking loss.

Dataset table

Dataset Name	Data Points	Type	Link
S2ORC_title_abstract	~41.8M	Title-Body	Link
S2ORC_abstract_citation	~39.6M	Body-Body	Link
S2ORC_title_citation	~51M	Title-Title	Link
arxiv_title_abstract	~2.7M	Title-Body	Link
PubMed	~ 24M	Title-Body	Link
specter	~684K	Title-Body	Link
nasa_ads	~2.66M	Title-Abstract	Link
SDE-syntisaized	177486	question-answer	Link
SDE-syntisaized	194382	search_terms-document
CMR-natural	53974	Title-Description
PDS-natural	9832	Title-Description
CMR-syntisaized	796097	search_terms-document
PDS-syntisaized	52777	search_terms-document
Total	~162.4M

Evaluation

We evaluate the model on a variety of benchmark datasets, especially the following:

We observe that the model from this stage has overall better performance compared to original INDUS Sentence Transformer and ModernBERT-based ST.

The model uploaded to the Hf is indus-sde-st-v0.2_vocal-river-16

models = {
    "modernbert-embed-base": "ModernBERT based embedding model",
    "nasa-smd-ibm-st-v2": "Original Indus Sentence Transformer",
    "indus-sde-st-v0.1": "Indus-SDE Stage 1 Sentence Transformer",
    "indus-sde-st-v0.2_whole-moon-14": "Indus-SDE Stage 2 Sentence Transformer (Trained on full dataset and faster learning rate)",
    "indus-sde-st-v0.2_atomic-plasma-15": "Indus-SDE Stage 2 Sentence Transformer (Trained just on the sde/ads dataset)",
    "indus-sde-st-v0.2_vocal-river-16": "Indus-SDE Stage 2 Sentence Transformer (Trained on top of model 14 with nasa sde/ads for 2 epoch)",
}

NASA SDE IR Benchmark

Nano BEIR

NASA SMD IR Benchmark

Citation

If you use INDUS-SDE-ST (or INDUS-SDE), please cite:

@inproceedings{pantha2026indussde,
  author    = {Pantha, Nishan and Awale, Sajil and Kuruvanthodi, Vishnudev and KC, Simran and Ramasubramanian, Muthukumaran and Davis, Carson and Praveen, Bishwas and Foshee, Emily and Bhattacharjee, Bishwaranjan and Bugbee, Kaylin and Ramachandran, Rahul},
  title     = {{INDUS-SDE}: A Language Model for Scientific Content Curation and Discovery},
  year      = {2026},
  isbn      = {979-8-4007-2259-2},
  publisher = {Association for Computing Machinery},
  address   = {New York, NY, USA},
  doi       = {10.1145/3770855.3818847},
  booktitle = {Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2 (KDD '26)},
  location  = {Jeju Island, Republic of Korea},
  series    = {KDD '26}
}

Downloads last month: 1,014

Safetensors

Model size

0.1B params

Tensor type

BF16

Model tree for nasa-impact/indus-sde-st-v0.2

Finetunes

1 model

Collection including nasa-impact/indus-sde-st-v0.2

INDUS-SDE-ST | NASA | IBM

Collection

A subset of INDUS-SDE suite of models/datasets designed for NASA<<>>IBM collaboration • 9 items • Updated Apr 10