Instructions to use nasa-impact/indus-sde-st-v0.2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use nasa-impact/indus-sde-st-v0.2 with Transformers:
# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("nasa-impact/indus-sde-st-v0.2") model = AutoModelForMultimodalLM.from_pretrained("nasa-impact/indus-sde-st-v0.2") - Notebooks
- Google Colab
- Kaggle
Model Card for INDUS-SDE-ST (INDUS-SDE Sentence Transformer)
Paper: INDUS-SDE: A Language Model for Scientific Content Curation and Discovery — KDD 2026, AI for Sciences Track. INDUS-SDE-ST is a sentence transformer for semantic scientific discovery, built on the INDUS-SDE encoder. Code: NASA-IMPACT/st-training-workflow · Binary (EQAT) variant: indus-sde-st-equat-v0.1
The model was first further fine tuned on sentence embedding task on top of previous (nasa-impact/indus-sde-st-v0.1) using stage 2 dataset (scientific dataset) for a epoch. Then this model is again fined tuned for 2 more epoches on NASA SDE and NASA ADS corpus.
The initial stage of Indus-SDE-ST training focused on adapting the base Indus-SDE model to comprehend general domain semantics and sentence-pair relationships. The stage 2 dataset was designed for scieinfic domain adaptation. The primary objective was to establish a broad linguistic foundation before specializing in scientific content (for subsequent stages). This was achieved using a diverse corpus comprising pairs from S2ORC, arxiv, PubMed, NASA ADS and NASA SDE set in a contrastive learning objective: Multiple Negatives Ranking loss.
Dataset table
| Dataset Name | Data Points | Type | Link |
|---|---|---|---|
| S2ORC_title_abstract | ~41.8M | Title-Body | Link |
| S2ORC_abstract_citation | ~39.6M | Body-Body | Link |
| S2ORC_title_citation | ~51M | Title-Title | Link |
| arxiv_title_abstract | ~2.7M | Title-Body | Link |
| PubMed | ~ 24M | Title-Body | Link |
| specter | ~684K | Title-Body | Link |
| nasa_ads | ~2.66M | Title-Abstract | Link |
| SDE-syntisaized | 177486 | question-answer | Link |
| SDE-syntisaized | 194382 | search_terms-document | |
| CMR-natural | 53974 | Title-Description | |
| PDS-natural | 9832 | Title-Description | |
| CMR-syntisaized | 796097 | search_terms-document | |
| PDS-syntisaized | 52777 | search_terms-document | |
| Total | ~162.4M |
Evaluation
We evaluate the model on a variety of benchmark datasets, especially the following:
We observe that the model from this stage has overall better performance compared to original INDUS Sentence Transformer and ModernBERT-based ST.
The model uploaded to the Hf is indus-sde-st-v0.2_vocal-river-16
models = {
"modernbert-embed-base": "ModernBERT based embedding model",
"nasa-smd-ibm-st-v2": "Original Indus Sentence Transformer",
"indus-sde-st-v0.1": "Indus-SDE Stage 1 Sentence Transformer",
"indus-sde-st-v0.2_whole-moon-14": "Indus-SDE Stage 2 Sentence Transformer (Trained on full dataset and faster learning rate)",
"indus-sde-st-v0.2_atomic-plasma-15": "Indus-SDE Stage 2 Sentence Transformer (Trained just on the sde/ads dataset)",
"indus-sde-st-v0.2_vocal-river-16": "Indus-SDE Stage 2 Sentence Transformer (Trained on top of model 14 with nasa sde/ads for 2 epoch)",
}
NASA SDE IR Benchmark
Nano BEIR
NASA SMD IR Benchmark
Citation
If you use INDUS-SDE-ST (or INDUS-SDE), please cite:
@inproceedings{pantha2026indussde,
author = {Pantha, Nishan and Awale, Sajil and Kuruvanthodi, Vishnudev and KC, Simran and Ramasubramanian, Muthukumaran and Davis, Carson and Praveen, Bishwas and Foshee, Emily and Bhattacharjee, Bishwaranjan and Bugbee, Kaylin and Ramachandran, Rahul},
title = {{INDUS-SDE}: A Language Model for Scientific Content Curation and Discovery},
year = {2026},
isbn = {979-8-4007-2259-2},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
doi = {10.1145/3770855.3818847},
booktitle = {Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2 (KDD '26)},
location = {Jeju Island, Republic of Korea},
series = {KDD '26}
}
- Downloads last month
- 1,014


