Model Card for INDUS-SDE

This model was further pre-trained on full Science Discovery Engine (SDE) website data from nasa-smd-ibm-v0.1 after extending its context size with Masked Language Modelling task.

Paper: INDUS-SDE: A Language Model for Scientific Content Curation and Discovery — KDD 2026, AI for Sciences Track. INDUS-SDE prioritizes scientific terminology during pretraining via Weighted Dynamic Masking (YAKE keyword + random masking) on NASA's noisy, web-sourced SDE corpus. Code: NASA-IMPACT/mlm-fine-tuning · Sentence transformer: indus-sde-st-v0.2

Model Details

Base Model: nasa-impact/nasa-smd-ibm-v0.1
Tokenizer: nasa-impact/nasa-smd-ibm-v0.1
Parameters: 125M
Pretraining Strategy: Masked Language Modeling (MLM)

Training Data

Full Science Discovery Engine (SDE) Website Data

Training Procedure

transformers Version: 4.48.3
Strategy: Masked Language Modeling (MLM)
Stage 1 Training: Increase the context size from 512 tokens to 1024 tokens and pre-train slowly only the position encoding layer for 1 epoch. (We do this to make sure we still retain the representation learned in the original upstream Indus model that was trained on huge scientific corpus.)
Stage 2 Training: Full training with cosine warmup LR Scheduler for 5 epoch
Masking Strategy:
- Weighted Dynamic Masking based on Keyword Importance (YAKE) and Random Masking
  - The idea for masking important keywords is to force the model to generalize for "science" keywords that gives a high signal for the document
- Masked Language Model Probability: 30%
Batch Size: 6
Learning rate: 5e-5
Warmup ratio: 0.1

Dataset

Total Data Size: 545,717
Validation Data Size: 10% of total size
Test Data Size: 10% of total size

Evaluation

Top-k Test Mask Accuracy: {'top1': 0.7814, 'top2': 0.8319, 'top3': 0.8548}

Citation

If you use INDUS-SDE, please cite:

@inproceedings{pantha2026indussde,
  author    = {Pantha, Nishan and Awale, Sajil and Kuruvanthodi, Vishnudev and KC, Simran and Ramasubramanian, Muthukumaran and Davis, Carson and Praveen, Bishwas and Foshee, Emily and Bhattacharjee, Bishwaranjan and Bugbee, Kaylin and Ramachandran, Rahul},
  title     = {{INDUS-SDE}: A Language Model for Scientific Content Curation and Discovery},
  year      = {2026},
  isbn      = {979-8-4007-2259-2},
  publisher = {Association for Computing Machinery},
  address   = {New York, NY, USA},
  doi       = {10.1145/3770855.3818847},
  booktitle = {Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2 (KDD '26)},
  location  = {Jeju Island, Republic of Korea},
  series    = {KDD '26}
}

Downloads last month: 22

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for nasa-impact/indus-sde-v0.2

Finetunes

5 models

Collection including nasa-impact/indus-sde-v0.2

indus-sde

Collection

3 items • Updated May 21, 2025