Model Card for INDUS-SDE

This model was further pre-trained on full Science Discovery Engine (SDE) website data from nasa-smd-ibm-v0.1 after extending its context size with Masked Language Modelling task.

Paper: INDUS-SDE: A Language Model for Scientific Content Curation and Discovery — KDD 2026, AI for Sciences Track. INDUS-SDE prioritizes scientific terminology during pretraining via Weighted Dynamic Masking (YAKE keyword + random masking) on NASA's noisy, web-sourced SDE corpus. Code: NASA-IMPACT/mlm-fine-tuning · Sentence transformer: indus-sde-st-v0.2

Model Details

  • Base Model: nasa-impact/nasa-smd-ibm-v0.1
  • Tokenizer: nasa-impact/nasa-smd-ibm-v0.1
  • Parameters: 125M
  • Pretraining Strategy: Masked Language Modeling (MLM)

Training Data

  • Full Science Discovery Engine (SDE) Website Data

Training Procedure

  • transformers Version: 4.48.3
  • Strategy: Masked Language Modeling (MLM)
  • Stage 1 Training: Increase the context size from 512 tokens to 1024 tokens and pre-train slowly only the position encoding layer for 1 epoch. (We do this to make sure we still retain the representation learned in the original upstream Indus model that was trained on huge scientific corpus.)
  • Stage 2 Training: Full training with cosine warmup LR Scheduler for 5 epoch
  • Masking Strategy:
    • Weighted Dynamic Masking based on Keyword Importance (YAKE) and Random Masking
      • The idea for masking important keywords is to force the model to generalize for "science" keywords that gives a high signal for the document
    • Masked Language Model Probability: 30%
  • Batch Size: 6
  • Learning rate: 5e-5
  • Warmup ratio: 0.1

Dataset

  • Total Data Size: 545,717
  • Validation Data Size: 10% of total size
  • Test Data Size: 10% of total size

Evaluation

  • Top-k Test Mask Accuracy: {'top1': 0.7814, 'top2': 0.8319, 'top3': 0.8548}

image

Citation

If you use INDUS-SDE, please cite:

@inproceedings{pantha2026indussde,
  author    = {Pantha, Nishan and Awale, Sajil and Kuruvanthodi, Vishnudev and KC, Simran and Ramasubramanian, Muthukumaran and Davis, Carson and Praveen, Bishwas and Foshee, Emily and Bhattacharjee, Bishwaranjan and Bugbee, Kaylin and Ramachandran, Rahul},
  title     = {{INDUS-SDE}: A Language Model for Scientific Content Curation and Discovery},
  year      = {2026},
  isbn      = {979-8-4007-2259-2},
  publisher = {Association for Computing Machinery},
  address   = {New York, NY, USA},
  doi       = {10.1145/3770855.3818847},
  booktitle = {Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2 (KDD '26)},
  location  = {Jeju Island, Republic of Korea},
  series    = {KDD '26}
}
Downloads last month
22
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nasa-impact/indus-sde-v0.2

Finetunes
5 models

Collection including nasa-impact/indus-sde-v0.2