Instructions to use SYSUSELab/DCS-Llama2-13B-It-MNTP with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use SYSUSELab/DCS-Llama2-13B-It-MNTP with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| language: | |
| - code | |
| library_name: peft | |
| tags: | |
| - llm2vec | |
| - mntp | |
| - decoder-only | |
| - pre-training | |
| - codegemma | |
| ## π Are Decoder-Only Large Language Models the Silver Bullet for Code Search? | |
| This model is an official artifact from our research paper: **"[Are Decoder-Only Large Language Models the Silver Bullet for Code Search?](https://arxiv.org/abs/2410.22240)"**. | |
| In this work, we conduct a large-scale systematic evaluation of decoder-only Large Language Models for the task of code search and present a set of effective fine-tuning and optimization strategies. | |
| For complete details on all our experiments, to reproduce the full training/evaluation pipeline, or to use other models from the paper, please visit our official GitHub repository: | |
| β‘οΈ **[GitHub: Georgepitt/DecoderLLMs-CodeSearch](https://github.com/Georgepitt/DecoderLLMs-CodeSearch)** | |
| --- | |
| ## Model Card: CodeGemma-7B - MNTP Pre-trained Model | |
| ### π Model Description | |
| This is a PEFT adapter for the **`meta-llama/Llama-2-13b-hf`** model, pre-trained with the **Masked Next Token Prediction (MNTP)** objective from the [llm2vec](https://github.com/McGill-NLP/llm2vec) framework. | |
| **Important Note on its Role**: | |
| This model is **not intended for direct downstream task evaluation**. Instead, it serves as a crucial **foundational prerequisite** for our supervised fine-tuned (SupCon) models. The MNTP pre-training enables the decoder-only model to learn bidirectional representations, which is an essential step before applying supervised contrastive learning. | |
| ### π How to Use | |
| #### Standalone Use (for Base Embeddings) | |
| You can also use this MNTP model by itself to generate text or code embeddings. | |
| ```python | |
| from transformers import AutoTokenizer, AutoModel, AutoConfig | |
| from peft import PeftModel | |
| from llm2vec import LLM2Vec | |
| base_model_id = "meta-llama/Llama-2-13b-hf" | |
| mntp_model_id = "SYSUSELab/DCS-llama2-13B-It-MNTP" | |
| tokenizer = AutoTokenizer.from_pretrained(base_model_id) | |
| config = AutoConfig.from_pretrained(base_model_id, trust_remote_code=True) | |
| model = AutoModel.from_pretrained(base_model_id, trust_remote_code=True, config=config, | |
| torch_dtype=torch.bfloat16, device_map="auto") | |
| model = PeftModel.from_pretrained(model, mntp_model_id) | |
| l2v = LLM2Vec(model, tokenizer, pooling_mode="mean", max_length=512) | |
| embeddings = l2v.encode(["def hello_world():\n print('Hello, World!')"]) | |
| print("Embedding from MNTP model:", embeddings.shape) | |
| ``` | |
| ### βοΈ Training Methodology | |
| This model was pre-trained using the **MNTP** objective as described in the `llm2vec` paper. If you wish to train your own MNTP model from scratch, please refer to the instructions in the `Fine-tuning/Fine-tuning_method/MNTP/` directory of our GitHub repository. | |
| ### π Citation | |
| If you use this model, please cite both our paper and the foundational work of `llm2vec`. | |
| ```bibtex | |
| @article{chen2024decoder, | |
| title={Are Decoder-Only Large Language Models the Silver Bullet for Code Search?}, | |
| author={Chen, Yuxuan and Liu, Mingwei and Ou, Guangsheng and Li, Anji and Dai, Dekun and Wang, Yanlin and Zheng, Zibin}, | |
| journal={arXiv preprint arXiv:2410.22240}, | |
| year={2024} | |
| } | |
| @article{vaishaal2024llm2vec, | |
| title={LLM2Vec: Large Language Models Are Good Contextual Text Encoders}, | |
| author={Vaishaal, Shankar and Bansal, Mohit and Arora, Simran}, | |
| journal={arXiv preprint arXiv:2404.05961}, | |
| year={2024} | |
| } | |
| ``` |