Model Card for INDUS-SDE-ST (INDUS-SDE Sentence Transformer)

Paper: INDUS-SDE: A Language Model for Scientific Content Curation and Discovery — KDD 2026, AI for Sciences Track. INDUS-SDE-ST is a sentence transformer for semantic scientific discovery, built on the INDUS-SDE encoder. Code: NASA-IMPACT/st-training-workflow · Binary (EQAT) variant: indus-sde-st-equat-v0.1

The model was first further fine tuned on sentence embedding task on top of previous (nasa-impact/indus-sde-st-v0.1) using stage 2 dataset (scientific dataset) for a epoch. Then this model is again fined tuned for 2 more epoches on NASA SDE and NASA ADS corpus.

The initial stage of Indus-SDE-ST training focused on adapting the base Indus-SDE model to comprehend general domain semantics and sentence-pair relationships. The stage 2 dataset was designed for scieinfic domain adaptation. The primary objective was to establish a broad linguistic foundation before specializing in scientific content (for subsequent stages). This was achieved using a diverse corpus comprising pairs from S2ORC, arxiv, PubMed, NASA ADS and NASA SDE set in a contrastive learning objective: Multiple Negatives Ranking loss.

Dataset table

Dataset Name Data Points Type Link
S2ORC_title_abstract ~41.8M Title-Body Link
S2ORC_abstract_citation ~39.6M Body-Body Link
S2ORC_title_citation ~51M Title-Title Link
arxiv_title_abstract ~2.7M Title-Body Link
PubMed ~ 24M Title-Body Link
specter ~684K Title-Body Link
nasa_ads ~2.66M Title-Abstract Link
SDE-syntisaized 177486 question-answer Link
SDE-syntisaized 194382 search_terms-document
CMR-natural 53974 Title-Description
PDS-natural 9832 Title-Description
CMR-syntisaized 796097 search_terms-document
PDS-syntisaized 52777 search_terms-document
Total ~162.4M

Evaluation

We evaluate the model on a variety of benchmark datasets, especially the following:

We observe that the model from this stage has overall better performance compared to original INDUS Sentence Transformer and ModernBERT-based ST.

The model uploaded to the Hf is indus-sde-st-v0.2_vocal-river-16

models = {
    "modernbert-embed-base": "ModernBERT based embedding model",
    "nasa-smd-ibm-st-v2": "Original Indus Sentence Transformer",
    "indus-sde-st-v0.1": "Indus-SDE Stage 1 Sentence Transformer",
    "indus-sde-st-v0.2_whole-moon-14": "Indus-SDE Stage 2 Sentence Transformer (Trained on full dataset and faster learning rate)",
    "indus-sde-st-v0.2_atomic-plasma-15": "Indus-SDE Stage 2 Sentence Transformer (Trained just on the sde/ads dataset)",
    "indus-sde-st-v0.2_vocal-river-16": "Indus-SDE Stage 2 Sentence Transformer (Trained on top of model 14 with nasa sde/ads for 2 epoch)",
}

NASA SDE IR Benchmark

image/png

Nano BEIR

image/png

NASA SMD IR Benchmark

image/png

Citation

If you use INDUS-SDE-ST (or INDUS-SDE), please cite:

@inproceedings{pantha2026indussde,
  author    = {Pantha, Nishan and Awale, Sajil and Kuruvanthodi, Vishnudev and KC, Simran and Ramasubramanian, Muthukumaran and Davis, Carson and Praveen, Bishwas and Foshee, Emily and Bhattacharjee, Bishwaranjan and Bugbee, Kaylin and Ramachandran, Rahul},
  title     = {{INDUS-SDE}: A Language Model for Scientific Content Curation and Discovery},
  year      = {2026},
  isbn      = {979-8-4007-2259-2},
  publisher = {Association for Computing Machinery},
  address   = {New York, NY, USA},
  doi       = {10.1145/3770855.3818847},
  booktitle = {Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2 (KDD '26)},
  location  = {Jeju Island, Republic of Korea},
  series    = {KDD '26}
}
Downloads last month
1,014
Safetensors
Model size
0.1B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nasa-impact/indus-sde-st-v0.2

Finetunes
1 model

Collection including nasa-impact/indus-sde-st-v0.2