Instructions to use codesage/codesage-large with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use codesage/codesage-large with Transformers:
# Load model directly from transformers import CodeSage model = CodeSage.from_pretrained("codesage/codesage-large", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| datasets: | |
| - bigcode/the-stack-dedup | |
| library_name: transformers | |
| language: | |
| - code | |
| ## CodeSage-Large | |
| ### Updates | |
| * [12/2024] <span style="color:blue">We are excited to announce the release of the CodeSage V2 model family with largely improved performance and flexible embedding dimensions!</span> Please check out our [models](https://huggingface.co/codesage) and [blogpost](https://code-representation-learning.github.io/codesage-v2.html) for more details. | |
| * [11/2024] You can now access CodeSage models through SentenceTransformer. | |
| ### Model description | |
| CodeSage is a new family of open code embedding models with an encoder architecture that support a wide range of source code understanding tasks. It is introduced in the paper: | |
| [Code Representation Learning At Scale by | |
| Dejiao Zhang*, Wasi Uddin Ahmad*, Ming Tan, Hantian Ding, Ramesh Nallapati, Dan Roth, Xiaofei Ma, Bing Xiang](https://arxiv.org/abs/2402.01935) (* indicates equal contribution). | |
| ### Pretraining data | |
| This checkpoint is trained on the Stack data (https://huggingface.co/datasets/bigcode/the-stack-dedup). Supported languages (9 in total) are as follows: c, c-sharp, go, java, javascript, typescript, php, python, ruby. | |
| ### Training procedure | |
| This checkpoint is first trained on code data via masked language modeling (MLM) and then on bimodal text-code pair data. Please refer to the paper for more details. | |
| ### How to Use | |
| This checkpoint consists of an encoder (1.3B model), which can be used to extract code embeddings of 1024 dimension. | |
| 1. Accessing CodeSage via HuggingFace: it can be easily loaded using the AutoModel functionality and employs the [Starcoder Tokenizer](https://arxiv.org/pdf/2305.06161.pdf). | |
| ``` | |
| from transformers import AutoModel, AutoTokenizer | |
| checkpoint = "codesage/codesage-large" | |
| device = "cuda" # for GPU usage or "cpu" for CPU usage | |
| # Note: CodeSage requires adding eos token at the end of each tokenized sequence | |
| tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True, add_eos_token=True) | |
| model = AutoModel.from_pretrained(checkpoint, trust_remote_code=True).to(device) | |
| inputs = tokenizer.encode("def print_hello_world():\tprint('Hello World!')", return_tensors="pt").to(device) | |
| embedding = model(inputs)[0] | |
| ``` | |
| 2. Accessing CodeSage via SentenceTransformer | |
| ``` | |
| from sentence_transformers import SentenceTransformer | |
| model = SentenceTransformer("codesage/codesage-large", trust_remote_code=True) | |
| ``` | |
| ### BibTeX entry and citation info | |
| ``` | |
| @inproceedings{ | |
| zhang2024codesage, | |
| title={CodeSage: Code Representation Learning At Scale}, | |
| author={Dejiao Zhang* and Wasi Ahmad* and Ming Tan and Hantian Ding and Ramesh Nallapati and Dan Roth and Xiaofei Ma and Bing Xiang}, | |
| booktitle={The Twelfth International Conference on Learning Representations}, | |
| year={2024}, | |
| url={https://openreview.net/forum?id=vfzRRjumpX} | |
| } | |
| ``` |