TheoDB commited on Apr 7

Commit

3ac36c5

verified ·

1 Parent(s): ce07143

Upload BidirLM-0.6B-Embedding

Browse files

Files changed (18) hide show

.gitattributes +2 -0
1_Pooling/config.json +10 -0
README.md +229 -0
added_tokens.json +28 -0
config.json +38 -0
config_sentence_transformers.json +14 -0
configuration_bidirlm.py +200 -0
final_results.png +3 -0
merges.txt +0 -0
model.safetensors +3 -0
modeling_bidirlm.py +666 -0
modules.json +14 -0
mteb_v2_eval_prompts.json +262 -0
sentence_bert_config.json +4 -0
special_tokens_map.json +38 -0
tokenizer.json +3 -0
tokenizer_config.json +241 -0
vocab.json +0 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+final_results.png filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

1_Pooling/config.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+    "word_embedding_dimension": 1024,
+    "pooling_mode_cls_token": false,
+    "pooling_mode_mean_tokens": true,
+    "pooling_mode_max_tokens": false,
+    "pooling_mode_mean_sqrt_len_tokens": false,
+    "pooling_mode_weightedmean_tokens": false,
+    "pooling_mode_lasttoken": false,
+    "include_prompt": true
+}

README.md ADDED Viewed

	@@ -0,0 +1,229 @@

+---
+tags:
+  - mteb
+  - sentence-transformers
+  - transformers
+  - embedding
+  - bidirectional
+  - multilingual
+pipeline_tag: sentence-similarity
+license: apache-2.0
+base_model: BidirLM/BidirLM-0.6B-Base
+language:
+  - multilingual
+  - af
+  - am
+  - ar
+  - az
+  - be
+  - bg
+  - bn
+  - bs
+  - ca
+  - ceb
+  - cs
+  - cy
+  - da
+  - de
+  - el
+  - en
+  - es
+  - et
+  - eu
+  - fa
+  - fi
+  - fr
+  - ga
+  - gl
+  - gu
+  - ha
+  - he
+  - hi
+  - hr
+  - ht
+  - hu
+  - hy
+  - id
+  - ig
+  - is
+  - it
+  - ja
+  - jv
+  - ka
+  - kk
+  - kn
+  - ko
+  - ky
+  - lt
+  - lv
+  - mg
+  - mk
+  - ml
+  - mr
+  - ms
+  - mt
+  - my
+  - nb
+  - ne
+  - nl
+  - nso
+  - ny
+  - pa
+  - pl
+  - ps
+  - pt
+  - ro
+  - ru
+  - sd
+  - si
+  - sk
+  - sl
+  - sn
+  - so
+  - sq
+  - sr
+  - su
+  - sv
+  - sw
+  - ta
+  - te
+  - th
+  - tl
+  - tr
+  - uk
+  - ur
+  - vi
+  - wo
+  - xh
+  - yo
+  - zh
+  - zu
+---
+# BidirLM-0.6B
+BidirLM is a family of 5 frontier bidirectional encoders, including an omnimodal variant at 2.5B, adapted from causal decoder LLMs. Contrary to contrastive-only models, BidirLM relies on a prior masking phase (MNTP) that enables state-of-the-art results on task-specific fine-tuning (NER, classification, NLI) while achieving frontier performance on embedding benchmarks (MTEB) against open-source alternatives.
+![Multilingual model performance by size on XTREME-Benchmark Augmented and MTEB Multilingual V2](final_results.png)
+| Model | Base LLM | Parameters | Embedding Dim | Max Tokens | MTEB Multi. V2 (Mean Task) |
+|---|---|---|---|---|---|
+| BidirLM-270M | Gemma3-270M | 268M | 640 | 512 | 55.5 |
+| **BidirLM-0.6B** | **Qwen3-0.6B** | **596M** | **1024** | **512** (\*) | **59.6** |
+| BidirLM-1B | Gemma3-1B | 1001M | 1152 | 512 | 62.1 |
+| BidirLM-1.7B | Qwen3-1.7B | 1721M | 2048 | 512 | 62.9 |
+| BidirLM-Omni-2.5B | Qwen3-1.7B | 2.5B | 2048 | 512 | 63.1 |
+(\*) While evaluated on MTEB with a max length of 512, the underlying architecture supports up to 40,960 context length (Qwen3). Longer sequences can be used by adjusting `model.max_seq_length` in Sentence Transformers or `max_length` in the tokenizer.
+## Supported Tasks
+**General embeddings** (via Sentence Transformers): retrieval, semantic similarity (STS), clustering, classification, pair classification, reranking, bitext mining, multilabel classification
+**Downstream fine-tuning** (via Transformers): sequence classification (e.g. MNLI, XNLI, PAWS-X, MathShepherd), token classification (e.g. PAN-X, POS), information retrieval (e.g. MIRACL, CodeSearchNet), sequence regression (e.g. Seahorse)
+## Usage
+### Sentence Transformers
+Use Sentence Transformers to compute embeddings for any text representation task.
+```python
+from sentence_transformers import SentenceTransformer
+model = SentenceTransformer("BidirLM/BidirLM-0.6B", trust_remote_code=True)
+queries = [
+    "What is the capital of France?",
+    "How does photosynthesis work?",
+]
+documents = [
+    "Paris is the capital and largest city of France, situated on the river Seine.",
+    "Photosynthesis is the process by which plants convert sunlight, water, and CO2 into glucose and oxygen.",
+]
+query_embeddings = model.encode(queries)
+document_embeddings = model.encode(documents)
+similarities = model.similarity(query_embeddings, document_embeddings)
+print(similarities)
+```
+### Fine-tuning for Downstream Tasks
+BidirLM can be directly fine-tuned for downstream tasks:
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoModelForTokenClassification
+tokenizer = AutoTokenizer.from_pretrained("BidirLM/BidirLM-0.6B", trust_remote_code=True)
+# Sequence classification (e.g., NLI: entailment, neutral, contradiction)
+seq_model = AutoModelForSequenceClassification.from_pretrained(
+    "BidirLM/BidirLM-0.6B",
+    trust_remote_code=True,
+    num_labels=3,
+)
+# Token classification (e.g., NER)
+tok_model = AutoModelForTokenClassification.from_pretrained(
+    "BidirLM/BidirLM-0.6B",
+    trust_remote_code=True,
+    num_labels=7,
+)
+# Fine-tune with HuggingFace Trainer
+```
+## Evaluation
+Please follow the [mteb repository](https://github.com/embeddings-benchmark/mteb) on how to reproduce our scores. The evaluation prompts used for each task are also available at [mteb_v2_eval_prompts.json](mteb_v2_eval_prompts.json).
+## Supported Languages
+Multilingual support across over 119 languages, inherited from the Qwen3 base model and reinforced through contrastive training with 87 languages.
+## Requirements
+This model requires `trust_remote_code=True` as it uses a custom bidirectional architecture.
+```
+transformers>=4.57.6,<5.0.0
+sentence-transformers>=5.0.0
+```
+## FAQ
+### 1. What pooling strategy does this model use?
+The model uses **mean pooling**. This is handled automatically when using Sentence Transformers.
+### 2. Do I need `trust_remote_code=True`?
+Yes. BidirLM uses a custom bidirectional architecture (`BidirLMModel`) that requires loading custom code from the repository.
+### 3. Why are my reproduced results slightly different from those reported in the model card?
+Different versions of `transformers` and `pytorch` could cause negligible but non-zero performance differences. This model was trained and evaluated with `transformers==4.57.6` and `pytorch==2.6.0`.
+### 4. What is the relationship between BidirLM-0.6B and BidirLM-0.6B-base?
+[BidirLM/BidirLM-0.6B-Base](https://huggingface.co/BidirLM/BidirLM-0.6B-Base) is the intermediate MNTP-adapted checkpoint (bidirectional pretraining stage). BidirLM-0.6B is the final contrastive fine-tuned version optimized for both sentence embeddings and downstream fine-tuning.
+### 5. How is BidirLM different from other embedding models?
+Most embedding models (BGE-M3, KaLM, EmbedGemma, Qwen3-Embedding) use contrastive-only training, which optimizes embeddings but sacrifices fine-tuning ability. BidirLM restores a prior MNTP phase, advancing the Pareto frontier on both MTEB and XTREME simultaneously.
+## Citation
+```bibtex
+@misc{boizard2026bidirlmtextomnimodalbidirectional,
+      title={BidirLM: From Text to Omnimodal Bidirectional Encoders by Adapting and Composing Causal LLMs},
+      author={Nicolas Boizard and Théo Deschamps-Berger and Hippolyte Gisserot-Boukhlef and Céline Hudelot and Pierre Colombo},
+      year={2026},
+      eprint={2604.02045},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2604.02045},
+}
+```

added_tokens.json ADDED Viewed

	@@ -0,0 +1,28 @@

+{
+  "</think>": 151668,
+  "</tool_call>": 151658,
+  "</tool_response>": 151666,
+  "<think>": 151667,
+  "<tool_call>": 151657,
+  "<tool_response>": 151665,
+  "<|box_end|>": 151649,
+  "<|box_start|>": 151648,
+  "<|endoftext|>": 151643,
+  "<|file_sep|>": 151664,
+  "<|fim_middle|>": 151660,
+  "<|fim_pad|>": 151662,
+  "<|fim_prefix|>": 151659,
+  "<|fim_suffix|>": 151661,
+  "<|im_end|>": 151645,
+  "<|im_start|>": 151644,
+  "<|image_pad|>": 151655,
+  "<|mask|>": 151663,
+  "<|object_ref_end|>": 151647,
+  "<|object_ref_start|>": 151646,
+  "<|quad_end|>": 151651,
+  "<|quad_start|>": 151650,
+  "<|video_pad|>": 151656,
+  "<|vision_end|>": 151653,
+  "<|vision_pad|>": 151654,
+  "<|vision_start|>": 151652
+}

config.json ADDED Viewed

	@@ -0,0 +1,38 @@

+{
+  "architectures": [
+    "BidirLMModel"
+  ],
+  "attention_bias": false,
+  "attention_dropout": 0.0,
+  "auto_map": {
+    "AutoConfig": "configuration_bidirlm.BidirLMConfig",
+    "AutoModel": "modeling_bidirlm.BidirLMModel",
+    "AutoModelForMaskedLM": "modeling_bidirlm.BidirLMForMaskedLM",
+    "AutoModelForPreTraining": "modeling_bidirlm.BidirLMPreTrainedModel",
+    "AutoModelForSequenceClassification": "modeling_bidirlm.BidirLMForSequenceClassification",
+    "AutoModelForTokenClassification": "modeling_bidirlm.BidirLMForTokenClassification"
+  },
+  "bos_token_id": 151644,
+  "clf_pooling": "late",
+  "dtype": "bfloat16",
+  "eos_token_id": 151645,
+  "head_dim": 128,
+  "hidden_act": "silu",
+  "hidden_size": 1024,
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "mask_token": "<|mask|>",
+  "mask_token_id": 151663,
+  "max_position_embeddings": 40960,
+  "model_type": "bidirlm",
+  "num_attention_heads": 16,
+  "num_hidden_layers": 28,
+  "num_key_value_heads": 8,
+  "pad_token_id": 151645,
+  "rms_norm_eps": 1e-06,
+  "rope_scaling": null,
+  "rope_theta": 1000000,
+  "tie_word_embeddings": true,
+  "transformers_version": "4.57.6",
+  "vocab_size": 151936
+}

config_sentence_transformers.json ADDED Viewed

	@@ -0,0 +1,14 @@

+{
+  "model_type": "SentenceTransformer",
+  "__version__": {
+    "sentence_transformers": "5.2.3",
+    "transformers": "4.57.6",
+    "pytorch": "2.6.0"
+  },
+  "prompts": {
+    "query": "",
+    "document": ""
+  },
+  "default_prompt_name": null,
+  "similarity_fn_name": "cosine"
+}

configuration_bidirlm.py ADDED Viewed

	@@ -0,0 +1,200 @@

+# coding=utf-8
+# Copyright 2024 The Qwen team, Alibaba Group and the HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""BidirLM model configuration"""
+import transformers
+_v = transformers.__version__
+if _v < "4.57.6" or _v >= "5.0.0":
+    raise ImportError(
+        f"BidirLM requires transformers>=4.57.6,<5.0.0 (found {_v}). "
+        f"Install a compatible version: pip install 'transformers>=4.57.6,<5.0.0'"
+    )
+from transformers.configuration_utils import PretrainedConfig
+from transformers.modeling_rope_utils import rope_config_validation
+from transformers.utils import logging
+logger = logging.get_logger(__name__)
+class BidirLMConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`BidirLMModel`]. It is used to instantiate a
+    BidirLM model according to the specified arguments, defining the model architecture. Instantiating a configuration
+    with the defaults will yield a similar configuration to that of
+    Qwen3-8B [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) WITH BIDIRECTIONAL ATTENTION MECHANISM.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+    Args:
+        vocab_size (`int`, *optional*, defaults to 151936):
+            Vocabulary size of the Qwen3 model. Defines the number of different tokens that can be represented by the
+            `inputs_ids` passed when calling [`Qwen3Model`]
+        hidden_size (`int`, *optional*, defaults to 4096):
+            Dimension of the hidden representations.
+        intermediate_size (`int`, *optional*, defaults to 22016):
+            Dimension of the MLP representations.
+        num_hidden_layers (`int`, *optional*, defaults to 32):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 32):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        num_key_value_heads (`int`, *optional*, defaults to 32):
+            This is the number of key_value heads that should be used to implement Grouped Query Attention. If
+            `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
+            `num_key_value_heads=1` the model will use Multi Query Attention (MQA) otherwise GQA is used. When
+            converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
+            by meanpooling all the original heads within that group. For more details, check out [this
+            paper](https://huggingface.co/papers/2305.13245). If it is not specified, will default to `32`.
+        head_dim (`int`, *optional*, defaults to 128):
+            The attention head dimension.
+        hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
+            The non-linear activation function (function or string) in the decoder.
+        max_position_embeddings (`int`, *optional*, defaults to 32768):
+            The maximum sequence length that this model might ever be used with.
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        rms_norm_eps (`float`, *optional*, defaults to 1e-06):
+            The epsilon used by the rms normalization layers.
+        tie_word_embeddings (`bool`, *optional*, defaults to `False`):
+            Whether the model's input and output word embeddings should be tied.
+        rope_theta (`float`, *optional*, defaults to 10000.0):
+            The base period of the RoPE embeddings.
+        rope_scaling (`Dict`, *optional*):
+            Dictionary containing the scaling configuration for the RoPE embeddings. NOTE: if you apply new rope type
+            and you expect the model to work on longer `max_position_embeddings`, we recommend you to update this value
+            accordingly.
+            Expected contents:
+                `rope_type` (`str`):
+                    The sub-variant of RoPE to use. Can be one of ['default', 'linear', 'dynamic', 'yarn', 'longrope',
+                    'llama3'], with 'default' being the original RoPE implementation.
+                `factor` (`float`, *optional*):
+                    Used with all rope types except 'default'. The scaling factor to apply to the RoPE embeddings. In
+                    most scaling types, a `factor` of x will enable the model to handle sequences of length x *
+                    original maximum pre-trained length.
+                `original_max_position_embeddings` (`int`, *optional*):
+                    Used with 'dynamic', 'longrope' and 'llama3'. The original max position embeddings used during
+                    pretraining.
+                `attention_factor` (`float`, *optional*):
+                    Used with 'yarn' and 'longrope'. The scaling factor to be applied on the attention
+                    computation. If unspecified, it defaults to value recommended by the implementation, using the
+                    `factor` field to infer the suggested value.
+                `beta_fast` (`float`, *optional*):
+                    Only used with 'yarn'. Parameter to set the boundary for extrapolation (only) in the linear
+                    ramp function. If unspecified, it defaults to 32.
+                `beta_slow` (`float`, *optional*):
+                    Only used with 'yarn'. Parameter to set the boundary for interpolation (only) in the linear
+                    ramp function. If unspecified, it defaults to 1.
+                `short_factor` (`list[float]`, *optional*):
+                    Only used with 'longrope'. The scaling factor to be applied to short contexts (<
+                    `original_max_position_embeddings`). Must be a list of numbers with the same length as the hidden
+                    size divided by the number of attention heads divided by 2
+                `long_factor` (`list[float]`, *optional*):
+                    Only used with 'longrope'. The scaling factor to be applied to long contexts (<
+                    `original_max_position_embeddings`). Must be a list of numbers with the same length as the hidden
+                    size divided by the number of attention heads divided by 2
+                `low_freq_factor` (`float`, *optional*):
+                    Only used with 'llama3'. Scaling factor applied to low frequency components of the RoPE
+                `high_freq_factor` (`float`, *optional*):
+                    Only used with 'llama3'. Scaling factor applied to high frequency components of the RoPE
+        attention_bias (`bool`, defaults to `False`, *optional*, defaults to `False`):
+            Whether to use a bias in the query, key, value and output projection layers during self-attention.
+        layer_types (`list`, *optional*):
+            Attention pattern for each layer.
+        attention_dropout (`float`, *optional*, defaults to 0.0):
+            The dropout ratio for the attention probabilities.
+    ```python
+    >>> from transformers import Qwen3Model, Qwen3Config
+    >>> # Initializing a Qwen3 style configuration
+    >>> configuration = Qwen3Config()
+    >>> # Initializing a model from the Qwen3-8B style configuration
+    >>> model = Qwen3Model(configuration)
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+    model_type = "bidirlm"
+    keys_to_ignore_at_inference = ["past_key_values"]
+    # Default tensor parallel plan for base model, same than `Qwen3`
+    base_model_tp_plan = {
+        "layers.*.self_attn.q_proj": "colwise",
+        "layers.*.self_attn.k_proj": "colwise",
+        "layers.*.self_attn.v_proj": "colwise",
+        "layers.*.self_attn.o_proj": "rowwise",
+        "layers.*.mlp.gate_proj": "colwise",
+        "layers.*.mlp.up_proj": "colwise",
+        "layers.*.mlp.down_proj": "rowwise",
+    }
+    base_model_pp_plan = {
+        "embed_tokens": (["input_ids"], ["inputs_embeds"]),
+        "layers": (["hidden_states", "attention_mask"], ["hidden_states"]),
+        "norm": (["hidden_states"], ["hidden_states"]),
+    }
+    def __init__(
+        self,
+        vocab_size=151936,
+        hidden_size=4096,
+        intermediate_size=22016,
+        num_hidden_layers=32,
+        num_attention_heads=32,
+        num_key_value_heads=32,
+        head_dim=128,
+        hidden_act="silu",
+        max_position_embeddings=32768,
+        initializer_range=0.02,
+        rms_norm_eps=1e-6,
+        tie_word_embeddings=False,
+        rope_theta=10000.0,
+        rope_scaling=None,
+        attention_bias=False,
+        attention_dropout=0.0,
+        classifier_pooling="late",
+        **kwargs,
+    ):
+        self.vocab_size = vocab_size
+        self.max_position_embeddings = max_position_embeddings
+        self.hidden_size = hidden_size
+        self.intermediate_size = intermediate_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        # for backward compatibility
+        if num_key_value_heads is None:
+            num_key_value_heads = num_attention_heads
+        self.num_key_value_heads = num_key_value_heads
+        self.head_dim = head_dim
+        self.hidden_act = hidden_act
+        self.initializer_range = initializer_range
+        self.rms_norm_eps = rms_norm_eps
+        self.rope_theta = rope_theta
+        self.rope_scaling = rope_scaling
+        self.attention_bias = attention_bias
+        self.attention_dropout = attention_dropout
+        self.clf_pooling = classifier_pooling
+        # Validate the correctness of rotary position embeddings parameters
+        # BC: if there is a 'type' field, move it to 'rope_type'.
+        if self.rope_scaling is not None and "type" in self.rope_scaling:
+            self.rope_scaling["rope_type"] = self.rope_scaling["type"]
+        rope_config_validation(self)
+        super().__init__(
+            tie_word_embeddings=tie_word_embeddings,
+            **kwargs,
+        )
+__all__ = ["BidirLMConfig"]

final_results.png ADDED Viewed

Git LFS Details

SHA256: 165257135cb40c3b9e7dcec94d225a0cce3a5c12f9ef0d152972eec048d0493a
Pointer size: 131 Bytes
Size of remote file: 738 kB

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0e9585aa599a24d18bfaacadb05b3d302003ea6221c457e721bfdd1bdfc38ae0
+size 1192133232

modeling_bidirlm.py ADDED Viewed

	@@ -0,0 +1,666 @@

+from typing import Optional
+import transformers
+_v = transformers.__version__
+if _v < "4.57.6" or _v >= "5.0.0":
+    raise ImportError(
+        f"BidirLM requires transformers>=4.57.6,<5.0.0 (found {_v}). "
+        f"Install a compatible version: pip install 'transformers>=4.57.6,<5.0.0'"
+    )
+import torch
+from torch import nn
+from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
+from transformers.activations import ACT2FN
+from transformers.modeling_attn_mask_utils import AttentionMaskConverter
+from transformers.modeling_layers import (
+    GradientCheckpointingLayer,
+)
+from transformers.modeling_rope_utils import ROPE_INIT_FUNCTIONS, dynamic_rope_update
+from transformers.modeling_utils import PreTrainedModel
+from .configuration_bidirlm import BidirLMConfig
+from transformers.modeling_outputs import BaseModelOutput, MaskedLMOutput, SequenceClassifierOutput, TokenClassifierOutput
+try:
+    import flash_attn
+    FLASH_ATTN_AVAILABLE = True
+except ImportError:
+    FLASH_ATTN_AVAILABLE = False
+class Qwen3RMSNorm(nn.Module):
+    def __init__(self, hidden_size, eps=1e-6):
+        """
+        Qwen3RMSNorm is equivalent to T5LayerNorm
+        """
+        super().__init__()
+        self.weight = nn.Parameter(torch.ones(hidden_size))
+        self.variance_epsilon = eps
+    def forward(self, hidden_states):
+        input_dtype = hidden_states.dtype
+        hidden_states = hidden_states.to(torch.float32)
+        variance = hidden_states.pow(2).mean(-1, keepdim=True)
+        hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
+        return self.weight * hidden_states.to(input_dtype)
+    def extra_repr(self):
+        return f"{tuple(self.weight.shape)}, eps={self.variance_epsilon}"
+class Qwen3MLP(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.hidden_size = config.hidden_size
+        self.intermediate_size = config.intermediate_size
+        self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
+        self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
+        self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
+        self.act_fn = ACT2FN[config.hidden_act]
+    def forward(self, x):
+        down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
+        return down_proj
+def rotate_half(x):
+    """Rotates half the hidden dims of the input."""
+    x1 = x[..., : x.shape[-1] // 2]
+    x2 = x[..., x.shape[-1] // 2 :]
+    return torch.cat((-x2, x1), dim=-1)
+def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
+    """Applies Rotary Position Embedding to the query and key tensors.
+    Args:
+        q (`torch.Tensor`): The query tensor.
+        k (`torch.Tensor`): The key tensor.
+        cos (`torch.Tensor`): The cosine part of the rotary embedding.
+        sin (`torch.Tensor`): The sine part of the rotary embedding.
+        position_ids (`torch.Tensor`, *optional*):
+            Deprecated and unused.
+        unsqueeze_dim (`int`, *optional*, defaults to 1):
+            The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
+            sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
+            that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
+            k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
+            cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
+            the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
+    Returns:
+        `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
+    """
+    cos = cos.unsqueeze(unsqueeze_dim)
+    sin = sin.unsqueeze(unsqueeze_dim)
+    q_embed = (q * cos) + (rotate_half(q) * sin)
+    k_embed = (k * cos) + (rotate_half(k) * sin)
+    return q_embed, k_embed
+def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
+    """
+    This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
+    num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
+    """
+    num_key_value_heads, slen, head_dim = hidden_states.shape
+    if n_rep == 1:
+        return hidden_states
+    hidden_states = hidden_states[:, None, :, :].expand(num_key_value_heads, n_rep, slen, head_dim)
+    return hidden_states.reshape(num_key_value_heads * n_rep, slen, head_dim)
+def batch_input_to_cu_seqlens(x: torch.Tensor, attention_mask: torch.Tensor):
+    lengths = attention_mask.sum(dim=1)
+    max_seqlen = int(lengths.max().item())
+    cu_seqlens = torch.zeros(lengths.size(0) + 1, dtype=torch.int32, device=x.device)
+    cu_seqlens[1:] = torch.cumsum(lengths, dim=0)
+    x = x[attention_mask.bool()]
+    return x, cu_seqlens, max_seqlen
+def cu_seqlens_to_batch_input(x: torch.Tensor, cu_seqlens: torch.Tensor, max_seqlen: int):
+    B = cu_seqlens.size(0) - 1
+    D = x.size(1)
+    idx = torch.arange(max_seqlen, device=x.device).expand(B, max_seqlen)
+    lens = (cu_seqlens[1:] - cu_seqlens[:-1]).unsqueeze(1)
+    mask = idx < lens
+    base = cu_seqlens[:-1].unsqueeze(1)
+    gather_idx = (idx + base) * mask
+    out = torch.zeros(B, max_seqlen, D, device=x.device, dtype=x.dtype)
+    out[mask] = x[gather_idx[mask]]
+    return out
+def cu_attention_weight_to_batch(hidden_states, cu_seqlens, max_seqlen):
+    H, T, _ = hidden_states.shape
+    device = hidden_states.device
+    cu_seqlens = cu_seqlens.to(device, dtype=torch.long)
+    B = cu_seqlens.numel() - 1
+    start = cu_seqlens[:-1]
+    end = cu_seqlens[1:]
+    L = end - start
+    p = torch.arange(max_seqlen, device=device)
+    valid = p.unsqueeze(0) < L.unsqueeze(1)
+    rel = p.unsqueeze(0)
+    abs_idx = start.unsqueeze(1) + rel
+    abs_idx = torch.where(valid, abs_idx, torch.zeros_like(abs_idx))
+    attn = hidden_states.unsqueeze(0).expand(B, -1, -1, -1)
+    row_index = abs_idx[:, None, :, None].expand(B, H, max_seqlen, T)
+    attn_rows = torch.gather(attn, dim=2, index=row_index)
+    col_index = abs_idx[:, None, None, :].expand(B, H, max_seqlen, max_seqlen)
+    attn_padded = torch.gather(attn_rows, dim=3, index=col_index)
+    mask = valid.to(attn_padded.dtype)
+    attn_padded = attn_padded * mask[:, None, :, None] * mask[:, None, None, :]
+    return attn_padded
+def create_packed_seqs_mask(
+    cu_seqlens: torch.Tensor,
+    causal: bool = True,
+    device: torch.device = torch.device("cpu"),
+) -> torch.Tensor:
+    """
+    Create a causal or non-causal attention mask for packed sequences.
+    Args:
+        cu_seqlens (torch.Tensor): Cumulative sequence lengths of shape [batch + 1].
+        is_causal (bool): If True, create a causal (lower triangular) mask within
+            each sequence. If False, a full attention mask is created within each sequence.
+        device (torch.device): Target device for the mask.
+    Returns:
+        torch.Tensor: Attention mask of shape [total_len, total_len] with 0.0 (allowed)
+            and -inf (masked).
+    """
+    total_len = cu_seqlens[-1].item()
+    seq_lengths = cu_seqlens[1:] - cu_seqlens[:-1]
+    seq_indices = torch.repeat_interleave(
+        torch.arange(len(seq_lengths), device=device),
+        seq_lengths
+    )
+    seq_mask = seq_indices.unsqueeze(0) == seq_indices.unsqueeze(1)
+    if causal:
+        causal_mask = torch.tril(torch.ones(total_len, total_len, device=device, dtype=torch.bool))
+        combined_mask = seq_mask & causal_mask
+    else:
+        combined_mask = seq_mask
+    attention_mask = torch.full((total_len, total_len), float('-inf'), device=device)
+    attention_mask.masked_fill_(combined_mask, 0.0)
+    return attention_mask
+def sdpa_attention_forward(
+        q, k, v,
+        cu_seqlens,
+        scaling,
+        dropout: float = 0.0,
+        causal: bool = True
+    ):
+    """Compute scaled dot-product attention for packed sequences."""
+    attn_weights = torch.matmul(q, k.transpose(1, 2)) * scaling
+    mask = create_packed_seqs_mask(cu_seqlens, causal, q.device)
+    attn_weights = attn_weights + mask
+    attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(q.dtype)
+    attn_weights = nn.functional.dropout(attn_weights, p=dropout)
+    attn_output = torch.matmul(attn_weights, v)
+    attn_output = attn_output.transpose(0, 1).contiguous()
+    return attn_output, attn_weights
+class Qwen3Attention(nn.Module):
+    """Multi-headed attention from 'Attention Is All You Need' paper"""
+    def __init__(self, config: BidirLMConfig):
+        super().__init__()
+        self.config = config
+        self.head_dim = getattr(config, "head_dim", config.hidden_size // config.num_attention_heads)
+        self.num_key_value_groups = config.num_attention_heads // config.num_key_value_heads
+        self.scaling = self.head_dim**-0.5
+        self.attention_dropout = config.attention_dropout
+        self.q_proj = nn.Linear(
+            config.hidden_size, config.num_attention_heads * self.head_dim, bias=config.attention_bias
+        )
+        self.k_proj = nn.Linear(
+            config.hidden_size, config.num_key_value_heads * self.head_dim, bias=config.attention_bias
+        )
+        self.v_proj = nn.Linear(
+            config.hidden_size, config.num_key_value_heads * self.head_dim, bias=config.attention_bias
+        )
+        self.o_proj = nn.Linear(
+            config.num_attention_heads * self.head_dim, config.hidden_size, bias=config.attention_bias
+        )
+        self.q_norm = Qwen3RMSNorm(self.head_dim, eps=config.rms_norm_eps)  # unlike olmo, only on the head dim!
+        self.k_norm = Qwen3RMSNorm(self.head_dim, eps=config.rms_norm_eps)  # thus post q_norm does not need reshape
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        position_embeddings: tuple[torch.Tensor, torch.Tensor],
+        cu_seqlens: Optional[torch.Tensor],
+        max_seqlen: Optional[int],
+    ) -> tuple[torch.Tensor, Optional[torch.Tensor], Optional[tuple[torch.Tensor]]]:
+        input_shape = hidden_states.shape[:-1]
+        hidden_shape = (*input_shape, -1, self.head_dim)
+        query_states = self.q_norm(self.q_proj(hidden_states).view(hidden_shape)).transpose(0, 1)
+        key_states = self.k_norm(self.k_proj(hidden_states).view(hidden_shape)).transpose(0, 1)
+        value_states = self.v_proj(hidden_states).view(hidden_shape).transpose(0, 1)
+        query_states, key_states = query_states.unsqueeze(0), key_states.unsqueeze(0),
+        cos, sin = position_embeddings
+        query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
+        query_states, key_states = query_states.squeeze(0), key_states.squeeze(0),
+        key_states = repeat_kv(key_states, self.num_key_value_groups)
+        value_states = repeat_kv(value_states, self.num_key_value_groups)
+        if self.config._attn_implementation == "flash_attention_2":
+            attn_weights = None
+            attn_output = flash_attn.flash_attn_varlen_func(
+                query_states.transpose(0, 1),
+                key_states.transpose(0, 1),
+                value_states.transpose(0, 1),
+                cu_seqlens,
+                cu_seqlens,
+                max_seqlen_q=max_seqlen,
+                max_seqlen_k=max_seqlen,
+                dropout_p=self.attention_dropout if self.training else 0.0,
+                softmax_scale=self.scaling,
+                causal=False,
+            ).contiguous()
+        else:
+            attn_output, attn_weights = sdpa_attention_forward(
+                query_states,
+                key_states,
+                value_states,
+                cu_seqlens=cu_seqlens,
+                dropout=self.attention_dropout if self.training else 0.0,
+                scaling=self.scaling,
+                causal=False,
+            )
+        attn_output = attn_output.reshape(*input_shape, -1).contiguous()
+        attn_output = self.o_proj(attn_output)
+        return attn_output, attn_weights
+class Qwen3EncoderLayer(GradientCheckpointingLayer):
+    def __init__(self, config: BidirLMConfig):
+        super().__init__()
+        self.hidden_size = config.hidden_size
+        self.self_attn = Qwen3Attention(config=config)
+        self.mlp = Qwen3MLP(config)
+        self.input_layernorm = Qwen3RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        self.post_attention_layernorm = Qwen3RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        cu_seqlens: Optional[torch.Tensor] = None,
+        max_seqlen: Optional[int] = None,
+        position_embeddings: Optional[tuple[torch.Tensor, torch.Tensor]] = None,  # necessary, but kept here for BC
+        output_attentions: Optional[bool] = False,
+    ) -> tuple[torch.FloatTensor, Optional[tuple[torch.FloatTensor, torch.FloatTensor]]]:
+        residual = hidden_states
+        hidden_states = self.input_layernorm(hidden_states)
+        hidden_states, self_attn_weights = self.self_attn(
+            hidden_states=hidden_states,
+            cu_seqlens=cu_seqlens,
+            max_seqlen=max_seqlen,
+            position_embeddings=position_embeddings,
+        )
+        hidden_states = residual + hidden_states
+        residual = hidden_states
+        hidden_states = self.post_attention_layernorm(hidden_states)
+        hidden_states = self.mlp(hidden_states)
+        hidden_states = residual + hidden_states
+        outputs = (hidden_states,)
+        if output_attentions:
+            outputs += (self_attn_weights,)
+        return outputs
+class BidirLMPreTrainedModel(PreTrainedModel):
+    config: BidirLMConfig
+    base_model_prefix = "model"
+    _supports_flash_attn = True
+    _supports_sdpa = True
+    _can_record_outputs = {}
+class Qwen3RotaryEmbedding(nn.Module):
+    def __init__(self, config: BidirLMConfig, device=None):
+        super().__init__()
+        # BC: "rope_type" was originally "type"
+        if hasattr(config, "rope_scaling") and isinstance(config.rope_scaling, dict):
+            self.rope_type = config.rope_scaling.get("rope_type", config.rope_scaling.get("type"))
+        else:
+            self.rope_type = "default"
+        self.max_seqlen_cached = config.max_position_embeddings
+        self.original_max_seqlen = config.max_position_embeddings
+        self.config = config
+        self.rope_init_fn = ROPE_INIT_FUNCTIONS[self.rope_type]
+        inv_freq, self.attention_scaling = self.rope_init_fn(self.config, device)
+        self.register_buffer("inv_freq", inv_freq, persistent=False)
+        self.original_inv_freq = self.inv_freq
+    @torch.no_grad()
+    @dynamic_rope_update  # power user: used with advanced RoPE types (e.g. dynamic rope)
+    def forward(self, x, position_ids):
+        inv_freq_expanded = self.inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1).to(x.device)
+        position_ids_expanded = position_ids[:, None, :].float()
+        device_type = x.device.type if isinstance(x.device.type, str) and x.device.type != "mps" else "cpu"
+        with torch.autocast(device_type=device_type, enabled=False):  # Force float32
+            freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
+            emb = torch.cat((freqs, freqs), dim=-1)
+            cos = emb.cos() * self.attention_scaling
+            sin = emb.sin() * self.attention_scaling
+        return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)
+class BidirLMModel(BidirLMPreTrainedModel):
+    def __init__(self, config: BidirLMConfig):
+        super().__init__(config)
+        self.padding_idx = config.pad_token_id
+        self.vocab_size = config.vocab_size
+        self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
+        self.layers = nn.ModuleList([Qwen3EncoderLayer(config) for _ in range(config.num_hidden_layers)])
+        self.norm = Qwen3RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        self.rotary_emb = Qwen3RotaryEmbedding(config=config)
+        self.gradient_checkpointing = False
+        self.mask_converter = AttentionMaskConverter(True)
+        self.post_init()
+    def forward(
+        self,
+        input_ids: torch.LongTensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        *,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        **kwargs
+    ) -> tuple[torch.Tensor] | BaseModelOutput:
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        all_hidden_states = () if output_hidden_states else None
+        all_self_attns = () if output_attentions else None
+        # For MNTP XP
+        batch_size, seq_len = input_ids.size()
+        new_input_ids = torch.empty((batch_size, seq_len + 1), dtype=input_ids.dtype, device=input_ids.device)
+        new_input_ids[:, 0] = 151644
+        new_input_ids[:, 1:] = input_ids
+        if attention_mask is not None:
+            new_attention_mask = torch.empty((batch_size, seq_len + 1), dtype=attention_mask.dtype, device=attention_mask.device)
+            new_attention_mask[:, 0] = 1
+            new_attention_mask[:, 1:] = attention_mask
+            attention_mask = new_attention_mask
+            input_ids, cu_seqlens, max_seqlen = batch_input_to_cu_seqlens(new_input_ids, attention_mask)
+        else:
+            input_ids = new_input_ids
+        hidden_states = self.embed_tokens(input_ids)
+        position_ids = torch.arange(len(input_ids), device=input_ids.device).unsqueeze(0)
+        position_embeddings = self.rotary_emb(hidden_states, position_ids)
+        for encoder_layer in self.layers[: self.config.num_hidden_layers]:
+            if output_hidden_states:
+                if attention_mask is not None:
+                    all_hidden_states += (cu_seqlens_to_batch_input(hidden_states, cu_seqlens, attention_mask.shape[-1])[0],)
+                else:
+                    all_hidden_states += (hidden_states,)
+            layer_outputs = encoder_layer(
+                hidden_states,
+                cu_seqlens=cu_seqlens,
+                max_seqlen=max_seqlen,
+                position_embeddings=position_embeddings,
+                output_attentions=output_attentions,
+            )
+            hidden_states = layer_outputs[0]
+            if output_attentions:
+                if attention_mask is not None:
+                    all_self_attns += (cu_attention_weight_to_batch(layer_outputs[1], cu_seqlens, attention_mask.shape[-1]),)
+                else:
+                    all_self_attns += (layer_outputs[1],)
+        hidden_states = self.norm(hidden_states)
+        if attention_mask is not None:
+            hidden_states = cu_seqlens_to_batch_input(hidden_states, cu_seqlens, attention_mask.shape[-1])
+        if output_hidden_states:
+            all_hidden_states += (hidden_states,)
+        # For MNTP XP
+        output = BaseModelOutput(
+            last_hidden_state=hidden_states[:, :-1, :],
+            hidden_states=tuple(h[:, :-1, :] for h in all_hidden_states) if all_hidden_states is not None else None,
+            attentions=tuple(a[:, :, :-1, :-1] for a in all_self_attns) if all_self_attns is not None else None,
+        )
+        return output if return_dict else output.to_tuple()
+class BidirLMForMaskedLM(BidirLMPreTrainedModel):
+    config_class = BidirLMConfig
+    _tied_weights_keys = ["lm_head.weight"]
+    def __init__(self, config):
+        super().__init__(config)
+        self.model = BidirLMModel(config)
+        self.vocab_size = config.vocab_size
+        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+        self.post_init()
+    def forward(
+        self,
+        input_ids: torch.LongTensor = None,
+        *,
+        attention_mask: Optional[torch.Tensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        **kwargs
+    ) -> tuple[torch.Tensor] | MaskedLMOutput:
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        encoder_output = self.model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        logits = self.lm_head(encoder_output[0])
+        loss = None
+        if labels is not None:
+            loss = self.loss_function(
+                logits, labels, vocab_size=self.config.vocab_size
+            )
+        output = MaskedLMOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=encoder_output.hidden_states,
+            attentions=encoder_output.attentions,
+        )
+        return output if return_dict else output.to_tuple()
+class BidirLMForSequenceClassification(BidirLMPreTrainedModel):
+    def __init__(self, config: BidirLMConfig):
+        super().__init__(config)
+        self.num_labels = config.num_labels
+        self.clf_pooling = config.clf_pooling
+        self.model = BidirLMModel(config)
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.activation = nn.GELU()
+        self.classifier = nn.Linear(config.hidden_size, self.num_labels)
+        self.post_init()
+    def forward(
+            self,
+            input_ids: Optional[torch.LongTensor] = None,
+            attention_mask: Optional[torch.Tensor] = None,
+            labels: Optional[torch.LongTensor] = None,
+            output_attentions: Optional[bool] = None,
+            output_hidden_states: Optional[bool] = None,
+            return_dict: Optional[bool] = None,
+            **kwargs
+        ) -> tuple[torch.Tensor] | SequenceClassifierOutput:
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        encoder_output = self.model(
+            input_ids,
+            attention_mask=attention_mask,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        last_hidden_state = encoder_output[0]
+        if self.clf_pooling in ["bos", "mean"]:
+            if self.clf_pooling == "bos":
+                pooled_output = last_hidden_state[:, 0]
+            elif self.clf_pooling == "mean":
+                if attention_mask is None:
+                    pooled_output = last_hidden_state.mean(dim=1)
+                else:
+                    pooled_output = (last_hidden_state * attention_mask.unsqueeze(-1)).sum(dim=1)
+                    pooled_output /= attention_mask.sum(dim=1, keepdim=True)
+            pooled_output = self.dense(pooled_output)
+            pooled_output = self.activation(pooled_output)
+            logits = self.classifier(pooled_output)
+        elif self.clf_pooling == "late":
+            x = self.dense(last_hidden_state)
+            x = self.activation(x)
+            logits = self.classifier(x)
+            if attention_mask is None:
+                logits = logits.mean(dim=1)
+            else:
+                logits = (logits * attention_mask.unsqueeze(-1)).sum(dim=1)
+                logits /= attention_mask.sum(dim=1, keepdim=True)
+        loss = None
+        if labels is not None:
+            labels = labels.to(logits.device)
+            if self.config.problem_type is None:
+                if self.num_labels == 1:
+                    self.config.problem_type = "regression"
+                elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
+                    self.config.problem_type = "single_label_classification"
+                else:
+                    self.config.problem_type = "multi_label_classification"
+            if self.config.problem_type == "regression":
+                loss_fct = MSELoss()
+                if self.num_labels == 1:
+                    loss = loss_fct(logits.squeeze(), labels.squeeze())
+                else:
+                    loss = loss_fct(logits, labels)
+            elif self.config.problem_type == "single_label_classification":
+                loss_fct = CrossEntropyLoss()
+                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
+            elif self.config.problem_type == "multi_label_classification":
+                loss_fct = BCEWithLogitsLoss()
+                loss = loss_fct(logits, labels)
+        output = SequenceClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=encoder_output.hidden_states,
+            attentions=encoder_output.attentions,
+        )
+        return output if return_dict else output.to_tuple()
+class BidirLMForTokenClassification(BidirLMPreTrainedModel):
+    def __init__(self, config: BidirLMConfig):
+        super().__init__(config)
+        self.num_labels = config.num_labels
+        self.model = BidirLMModel(config)
+        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
+        self.post_init()
+    def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> tuple[torch.Tensor] | TokenClassifierOutput:
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        outputs = self.model(
+            input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        sequence_output = outputs[0]
+        logits = self.classifier(sequence_output)
+        loss = None
+        if labels is not None:
+            loss_fct = CrossEntropyLoss()
+            loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
+        if not return_dict:
+            output = (logits,) + outputs[2:]
+            return ((loss,) + output) if loss is not None else output
+        return TokenClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+__all__ = [
+    "BidirLMPreTrainedModel",
+    "BidirLMModel",
+    "BidirLMForMaskedLM",
+    "BidirLMForSequenceClassification",
+    "BidirLMForTokenClassification",
+]

modules.json ADDED Viewed

	@@ -0,0 +1,14 @@

+[
+  {
+    "idx": 0,
+    "name": "0",
+    "path": "",
+    "type": "sentence_transformers.models.Transformer"
+  },
+  {
+    "idx": 1,
+    "name": "1",
+    "path": "1_Pooling",
+    "type": "sentence_transformers.models.Pooling"
+  }
+]

mteb_v2_eval_prompts.json ADDED Viewed

	@@ -0,0 +1,262 @@

+{
+    "AmazonCounterfactualClassification": "Given an Amazon review, judge whether it is counterfactual.",
+    "AmazonPolarityClassification": "Classifying Amazon reviews into positive or negative sentiment",
+    "AmazonReviewsClassification": "Classifying the given Amazon review into its appropriate rating category",
+    "Banking77Classification": "Given an online banking query, find the corresponding intents",
+    "EmotionClassification": "Classify the emotion expressed in the given Twitter message into one of the six emotions: anger, fear, joy, love, sadness, and surprise",
+    "ImdbClassification": "Classifying the sentiment expressed in the given movie review text from the IMDB dataset",
+    "MassiveIntentClassification": "Given a user utterance as query, find the user intents",
+    "MassiveScenarioClassification": "Given a user utterance as query, find the user scenarios",
+    "MTOPDomainClassification": "Classifying the intent domain of the given utterance in task-oriented conversation",
+    "MTOPIntentClassification": "Classifying the intent of the given utterance in task-oriented conversation",
+    "ToxicConversationsClassification": "Classifying the given comments as either toxic or not toxic",
+    "TweetSentimentExtractionClassification": "Classifying the sentiment of a given tweet as either positive, negative, or neutral",
+    "TNews": "Categorizing the given news title",
+    "IFlyTek": "Given an App description text, find the appropriate fine-grained category",
+    "MultilingualSentiment": "Classifying sentiment of the customer review into positive, neutral, or negative",
+    "JDReview": "Classifying sentiment of the customer review for iPhoneinto positive or negative",
+    "OnlineShopping": "Classifying sentiment of the customer reviewinto positive or negative",
+    "Waimai": "Classify the customer review from a food takeaway platform into positive or negative",
+    "ArxivClusteringP2P": "Identify the main and secondary category of Arxiv papers based on the titles and abstracts",
+    "ArxivClusteringS2S": "Identify the main and secondary category of Arxiv papers based on the titles",
+    "BiorxivClusteringP2P": "Identify the main category of Biorxiv papers based on the titles and abstracts",
+    "BiorxivClusteringS2S": "Identify the main category of Biorxiv papers based on the titles",
+    "MedrxivClusteringP2P": "Identify the main category of Medrxiv papers based on the titles and abstracts",
+    "MedrxivClusteringS2S": "Identify the main category of Medrxiv papers based on the titles",
+    "RedditClustering": "Identify the topic or theme of Reddit posts based on the titles",
+    "RedditClusteringP2P": "Identify the topic or theme of Reddit posts based on the titles and posts",
+    "StackExchangeClustering": "Identify the topic or theme of StackExchange posts based on the titles",
+    "StackExchangeClusteringP2P": "Identify the topic or theme of StackExchange posts based on the given paragraphs",
+    "TwentyNewsgroupsClustering": "Identify the topic or theme of the given news articles",
+    "CLSClusteringS2S": "Identify the main category of scholar papers based on the titles",
+    "CLSClusteringP2P": "Identify the main category of scholar papers based on the titles and abstracts",
+    "ThuNewsClusteringS2S": "Identify the topic or theme of the given news articles based on the titles",
+    "ThuNewsClusteringP2P": "Identify the topic or theme of the given news articles based on the titles and contents",
+    "AskUbuntuDupQuestions": "Given a question, retrieve detailed question descriptions from Stackexchange that are duplicates to the given question",
+    "MindSmallReranking": "Given a query, retrieve documents that answer the query.",
+    "SciDocsRR": "Given a query, retrieve documents that answer the query.",
+    "StackOverflowDupQuestions": "Given a question, retrieve detailed question descriptions from Stackexchange that are duplicates to the given question",
+    "SprintDuplicateQuestions": "Given a question, retrieve detailed question descriptions from Stackexchange that are duplicates to the given question",
+    "TwitterSemEval2015": "Retrieve semantically similar text.",
+    "TwitterURLCorpus": "Retrieve semantically similar text.",
+    "T2Reranking": "Given a query, retrieve documents that answer the query.",
+    "MmarcoReranking": "Given a query, retrieve documents that answer the query.",
+    "CMedQAv1": "Given a query, retrieve documents that answer the query.",
+    "CMedQAv2": "Given a query, retrieve documents that answer the query.",
+    "Ocnli": "Retrieve semantically similar text.",
+    "Cmnli": "Retrieve semantically similar text.",
+    "ArguAna": {
+        "query": "Given a claim, retrieve documents that support or refute the claim",
+        "passage": "Given a claim, retrieve documents that support or refute the claim"
+    },
+    "ClimateFEVER": "Given a claim, retrieve documents that support or refute the claim",
+    "ClimateFEVERHardNegatives": "Given a claim, retrieve documents that support or refute the claim",
+    "DBPedia": "Given a query, retrieve documents that answer the query.",
+    "FEVER": "Given a claim, retrieve documents that support or refute the claim",
+    "FEVERHardNegatives": "Given a claim, retrieve documents that support or refute the claim",
+    "FiQA2018": "Given a query, retrieve documents that answer the query.",
+    "HotpotQA": "Given a multi-hop question, retrieve documents that can help answer the question",
+    "HotpotQAHardNegatives": "Given a multi-hop question, retrieve documents that can help answer the question",
+    "MSMARCO": "Given a web search query, retrieve relevant passages that answer the query",
+    "NFCorpus": "Given a question, retrieve relevant documents that best answer the question",
+    "NQ": "Given a question, retrieve Wikipedia passages that answer the question",
+    "QuoraRetrieval": "Given a query, retrieve documents that answer the query.",
+    "SCIDOCS": "Given a query, retrieve documents that answer the query.",
+    "SciFact": "Given a scientific claim, retrieve documents that support or refute the claim",
+    "Touche2020": "Given a query, retrieve documents that answer the query.",
+    "Touche2020Retrieval.v3": "Given a query, retrieve documents that answer the query.",
+    "TRECCOVID": "Given a query, retrieve documents that answer the query.",
+    "T2Retrieval": "Given a question, retrieve passages that answer the question",
+    "MMarcoRetrieval": "Given a web search query, retrieve relevant passages that answer the query",
+    "DuRetrieval": "Given a question, retrieve passages that answer the question",
+    "CovidRetrieval": "Given a query on COVID-19, retrieve documents that answer the query",
+    "CmedqaRetrieval": "Given a query, retrieve documents that answer the query.",
+    "EcomRetrieval": "Given a query, retrieve documents that answer the query.",
+    "MedicalRetrieval": "Given a query, retrieve documents that answer the query.",
+    "VideoRetrieval": "Given a query, retrieve documents that answer the query.",
+    "STSBenchmarkMultilingualSTS": "Retrieve semantically similar text",
+    "SICKFr": "Retrieve semantically similar text",
+    "SummEvalFr": "Retrieve semantically similar text.",
+    "MasakhaNEWSClassification": "Categorizing the given news title",
+    "OpusparcusPC": "Retrieve semantically similar text",
+    "PawsX": "Retrieve semantically similar text",
+    "AlloProfClusteringP2P": "Identify the main category of scholar papers based on the titles and abstracts",
+    "AlloProfClusteringS2S": "Identify the main category of scholar papers based on the titles",
+    "HALClusteringS2S": "Identify the main category of scholar papers based on the titles",
+    "MasakhaNEWSClusteringP2P": "Identify the topic or theme of the given news articles based on the titles and contents",
+    "MasakhaNEWSClusteringS2S": "Identify the topic or theme of the given news articles based on the titles",
+    "MLSUMClusteringP2P": "Identify the topic or theme of Reddit posts based on the titles and posts",
+    "MLSUMClusteringS2S": "Identify the topic or theme of Reddit posts based on the titles",
+    "SyntecReranking": "Given a question, retrieve passages that answer the question",
+    "AlloprofReranking": "Given a question, retrieve passages that answer the question",
+    "AlloprofRetrieval": "Given a question, retrieve passages that answer the question",
+    "BSARDRetrieval": "Given a question, retrieve passages that answer the question",
+    "SyntecRetrieval": "Given a question, retrieve passages that answer the question",
+    "XPQARetrieval": "Given a question, retrieve passages that answer the question",
+    "MintakaRetrieval": "Given a question, retrieve passages that answer the question",
+    "CBD": "Classifying the sentiment of a given tweet as either positive, negative, or neutral",
+    "PolEmo2.0-IN": "Classifying sentiment of the customer review into positive, neutral, or negative",
+    "PolEmo2.0-OUT": "Classifying sentiment of the customer review into positive, neutral, or negative",
+    "AllegroReviews": "Classifying sentiment of the customer review into positive, neutral, or negative",
+    "PAC": "Classify the sentence into one of the two types: \"BEZPIECZNE_POSTANOWIENIE_UMOWNE\" and \"KLAUZULA_ABUZYWNA\"",
+    "SICK-E-PL": "Retrieve semantically similar text",
+    "SICK-R-PL": "Retrieve semantically similar text",
+    "STS22": "Retrieve semantically similar text",
+    "AFQMC": "Retrieve semantically similar text",
+    "BQ": "Retrieve semantically similar text",
+    "LCQMC": "Retrieve semantically similar text",
+    "PAWSX": "Retrieve semantically similar text",
+    "QBQTC": "Retrieve semantically similar text",
+    "STS12": "Retrieve semantically similar text",
+    "PPC": "Retrieve semantically similar text",
+    "CDSC-E": "Retrieve semantically similar text",
+    "PSC": "Retrieve semantically similar text",
+    "8TagsClustering": "Identify the topic or theme of the given news articles",
+    "ArguAna-PL": "Given a claim, retrieve documents that support or refute the claim",
+    "DBPedia-PL": "Given a query, retrieve documents that answer the query.",
+    "FiQA-PL": "Given a query, retrieve documents that answer the query.",
+    "HotpotQA-PL": "Given a multi-hop question, retrieve documents that can help answer the question",
+    "MSMARCO-PL": "Given a web search query, retrieve relevant passages that answer the query",
+    "NFCorpus-PL": "Given a question, retrieve relevant documents that best answer the question",
+    "NQ-PL": "Given a question, retrieve Wikipedia passages that answer the question",
+    "Quora-PL": "Given a query, retrieve documents that answer the query.",
+    "SCIDOCS-PL": "Given a query, retrieve documents that answer the query.",
+    "SciFact-PL": "Given a scientific claim, retrieve documents that support or refute the claim",
+    "TRECCOVID-PL": "Given a query, retrieve documents that answer the query.",
+    "GeoreviewClassification": "Classifying sentiment of the customer review into positive, neutral, or negative",
+    "HeadlineClassification": "Categorizing the given news title",
+    "InappropriatenessClassification": "Classifying the given comments as either toxic or not toxic",
+    "KinopoiskClassification": "Classifying the sentiment expressed in the given movie review text from the IMDB dataset",
+    "RuReviewsClassification": "Classifying sentiment of the customer review into positive, neutral, or negative",
+    "RuSciBenchGRNTIClassification": "Categorizing the given news title",
+    "RuSciBenchOECDClassification": "Categorizing the given news title",
+    "GeoreviewClusteringP2P": "Identify the topic or theme of Reddit posts based on the titles and posts",
+    "RuSciBenchGRNTIClusteringP2P": "Identify the main category of scholar papers based on the titles and abstracts",
+    "RuSciBenchOECDClusteringP2P": "Identify the main category of scholar papers based on the titles and abstracts",
+    "TERRa": "Retrieve semantically similar text.",
+    "RuBQReranking": "Given a question, retrieve Wikipedia passages that answer the question",
+    "RiaNewsRetrieval": "Given a query, retrieve documents that answer the query.",
+    "RuBQRetrieval": "Given a question, retrieve Wikipedia passages that answer the question",
+    "RUParaPhraserSTS": "Retrieve semantically similar text",
+    "RuSTSBenchmarkSTS": "Retrieve semantically similar text",
+    "AppsRetrieval": "Given a query, retrieve documents that answer the query.",
+    "COIRCodeSearchNetRetrieval": "Given a query, retrieve documents that answer the query.",
+    "CodeEditSearchRetrieval": "Given a query, retrieve documents that answer the query.",
+    "CodeFeedbackMT": "Given a query, retrieve documents that answer the query.",
+    "CodeFeedbackST": "Given a query, retrieve documents that answer the query.",
+    "CodeSearchNetCCRetrieval": "Given a query, retrieve documents that answer the query.",
+    "CodeSearchNetRetrieval": "Given a query, retrieve documents that answer the query.",
+    "CodeTransOceanContest": "Given a query, retrieve documents that answer the query.",
+    "CodeTransOceanDL": "Given a query, retrieve documents that answer the query.",
+    "CosQA": "Given a query, retrieve documents that answer the query.",
+    "StackOverflowQA": "Given a query, retrieve documents that answer the query.",
+    "SyntheticText2SQL": "Given a query, retrieve documents that answer the query.",
+    "BibleNLPBitextMining": "Retrieve semantically similar text.",
+    "BUCC.v2": "Retrieve semantically similar text.",
+    "DiaBlaBitextMining": "Retrieve semantically similar text.",
+    "FloresBitextMining": "Retrieve semantically similar text.",
+    "IN22GenBitextMining": "Retrieve semantically similar text.",
+    "IndicGenBenchFloresBitextMining": "Retrieve semantically similar text.",
+    "NollySentiBitextMining": "Retrieve semantically similar text.",
+    "NTREXBitextMining": "Retrieve semantically similar text.",
+    "NusaTranslationBitextMining": "Retrieve semantically similar text.",
+    "NusaXBitextMining": "Retrieve semantically similar text.",
+    "Tatoeba": "Retrieve semantically similar text.",
+    "BulgarianStoreReviewSentimentClassfication": "Classifying sentiment of the customer review into positive, neutral, or negative",
+    "CzechProductReviewSentimentClassification": "Classifying sentiment of the customer review into positive, neutral, or negative",
+    "GreekLegalCodeClassification": "Categorizing the given news title",
+    "DBpediaClassification": "Given an App description text, find the appropriate fine-grained category",
+    "FinancialPhrasebankClassification": "Classifying sentiment of the customer review into positive, neutral, or negative",
+    "PoemSentimentClassification": "Classifying sentiment of the customer review into positive, neutral, or negative",
+    "TweetTopicSingleClassification": "Categorizing the given news title",
+    "EstonianValenceClassification": "Classifying sentiment of the customer review into positive, neutral, or negative",
+    "FilipinoShopeeReviewsClassification": "Classifying sentiment of the customer review into positive, neutral, or negative",
+    "GujaratiNewsClassification": "Categorizing the given news title",
+    "SentimentAnalysisHindi": "Classifying sentiment of the customer review into positive, neutral, or negative",
+    "IndonesianIdClickbaitClassification": "Categorizing the given news title",
+    "ItaCaseholdClassification": "Categorizing the given news title",
+    "KorSarcasmClassification": "Classifying sentiment of the customer review into positive, neutral, or negative",
+    "KurdishSentimentClassification": "Classifying sentiment of the customer review into positive, neutral, or negative",
+    "MacedonianTweetSentimentClassification": "Classifying the sentiment of a given tweet as either positive, negative, or neutral",
+    "AfriSentiClassification": "Classifying sentiment of the customer review into positive, neutral, or negative",
+    "CataloniaTweetClassification": "Classifying the sentiment of a given tweet as either positive, negative, or neutral",
+    "CyrillicTurkicLangClassification": "Given a text, classify its language",
+    "IndicLangClassification": "Given a text, classify its language",
+    "MultiHateClassification": "Classifying the given comments as either toxic or not toxic",
+    "NusaParagraphEmotionClassification": "Classify the emotion expressed in the given Twitter message into one of the six emotions: anger, fear, joy, love, sadness, and surprise",
+    "NusaX-senti": "Classifying sentiment of the customer review into positive, neutral, or negative",
+    "SwissJudgementClassification": "Classifying sentiment of the customer review into positive, neutral, or negative",
+    "NepaliNewsClassification": "Categorizing the given news title",
+    "OdiaNewsClassification": "Categorizing the given news title",
+    "PunjabiNewsClassification": "Categorizing the given news title",
+    "SinhalaNewsClassification": "Categorizing the given news title",
+    "CSFDSKMovieReviewSentimentClassification": "Classifying the sentiment expressed in the given movie review text from the IMDB dataset",
+    "SiswatiNewsClassification": "Categorizing the given news title",
+    "SlovakMovieReviewSentimentClassification": "Classifying the sentiment expressed in the given movie review text from the IMDB dataset",
+    "SwahiliNewsClassification": "Categorizing the given news title",
+    "TswanaNewsClassification": "Categorizing the given news title",
+    "IsiZuluNewsClassification": "Categorizing the given news title",
+    "WikiCitiesClustering": "Identify the topic or theme of the given news articles",
+    "RomaniBibleClustering": "Identify the topic or theme of the given news articles",
+    "ArXivHierarchicalClusteringP2P": "Identify the main and secondary category of Arxiv papers based on the titles and abstracts",
+    "ArXivHierarchicalClusteringS2S": "Identify the main and secondary category of Arxiv papers based on the titles",
+    "BigPatentClustering.v2": "Identify the main category of scholar papers based on the titles and abstracts",
+    "AlloProfClusteringS2S.v2": "Identify the main category of scholar papers based on the titles",
+    "HALClusteringS2S.v2": "Identify the main category of scholar papers based on the titles",
+    "SIB200ClusteringS2S": "Identify the topic or theme of the given news articles",
+    "WikiClusteringP2P.v2": "Identify the topic or theme of the given news articles",
+    "PlscClusteringP2P.v2": "Identify the main category of scholar papers based on the titles and abstracts",
+    "KorHateSpeechMLClassification": "Classifying the given comments as either toxic or not toxic",
+    "MalteseNewsClassification": "Categorizing the given news title",
+    "MultiEURLEXMultilabelClassification": "Categorizing the given news title",
+    "BrazilianToxicTweetsClassification": "Classifying the given comments as either toxic or not toxic",
+    "CTKFactsNLI": "Retrieve semantically similar text",
+    "indonli": "Retrieve semantically similar text",
+    "ArmenianParaphrasePC": "Retrieve semantically similar text",
+    "PawsXPairClassification": "Retrieve semantically similar text",
+    "RTE3": "Retrieve semantically similar text",
+    "XNLI": "Retrieve semantically similar text",
+    "PpcPC": "Retrieve semantically similar text",
+    "GermanSTSBenchmark": "Retrieve semantically similar text",
+    "SICK-R": "Retrieve semantically similar text",
+    "STS13": "Retrieve semantically similar text",
+    "STS14": "Retrieve semantically similar text",
+    "STSBenchmark": "Retrieve semantically similar text",
+    "FaroeseSTS": "Retrieve semantically similar text",
+    "FinParaSTS": "Retrieve semantically similar text",
+    "JSICK": "Retrieve semantically similar text",
+    "IndicCrosslingualSTS": "Retrieve semantically similar text",
+    "SemRel24STS": "Retrieve semantically similar text",
+    "STS17": "Retrieve semantically similar text",
+    "STS22.v2": "Retrieve semantically similar text",
+    "STSES": "Retrieve semantically similar text",
+    "STSB": "Retrieve semantically similar text",
+    "AILAStatutes": "Given a query, retrieve documents that answer the query.",
+    "HagridRetrieval": "Given a query, retrieve documents that answer the query.",
+    "LegalBenchCorporateLobbying": "Given a query, retrieve documents that answer the query.",
+    "LEMBPasskeyRetrieval": "Given a query, retrieve documents that answer the query.",
+    "BelebeleRetrieval": "Given a query, retrieve documents that answer the query.",
+    "MLQARetrieval": "Given a query, retrieve documents that answer the query.",
+    "StatcanDialogueDatasetRetrieval": "Given a query, retrieve documents that answer the query.",
+    "WikipediaRetrievalMultilingual": "Given a query, retrieve documents that answer the query.",
+    "Core17InstructionRetrieval": "Given a query, retrieve documents that answer the query.",
+    "News21InstructionRetrieval": "Given a query, retrieve documents that answer the query.",
+    "Robust04InstructionRetrieval": "Given a query, retrieve documents that answer the query.",
+    "WebLINXCandidatesReranking": "Given a query, retrieve documents that answer the query.",
+    "WikipediaRerankingMultilingual": "Given a query, retrieve documents that answer the query.",
+    "STS15": "Retrieve semantically similar text",
+    "MIRACLRetrievalHardNegatives": "Given a question, retrieve passages that answer the question",
+    "BIOSSES": "Retrieve semantically similar text",
+    "CQADupstackRetrieval": "Given a question, retrieve detailed question descriptions from Stackexchange that are duplicates to the given question",
+    "CQADupstackGamingRetrieval": {
+        "query": "Given a question, retrieve detailed question descriptions from Stackexchange that are duplicates to the given question",
+        "passage": "Given a question, retrieve detailed question descriptions from Stackexchange that are duplicates to the given question"
+    },
+    "CQADupstackUnixRetrieval": {
+        "query": "Given a question, retrieve detailed question descriptions from Stackexchange that are duplicates to the given question",
+        "passage": "Given a question, retrieve detailed question descriptions from Stackexchange that are duplicates to the given question"
+    },
+    "STS16": "Retrieve semantically similar text",
+    "SummEval": "Retrieve semantically similar text",
+    "ATEC": "Retrieve semantically similar text"
+}

sentence_bert_config.json ADDED Viewed

	@@ -0,0 +1,4 @@

+{
+    "max_seq_length": 512,
+    "do_lower_case": false
+}

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,38 @@

+{
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<|object_ref_start|>",
+    "<|object_ref_end|>",
+    "<|box_start|>",
+    "<|box_end|>",
+    "<|quad_start|>",
+    "<|quad_end|>",
+    "<|vision_start|>",
+    "<|vision_end|>",
+    "<|vision_pad|>",
+    "<|image_pad|>",
+    "<|video_pad|>"
+  ],
+  "eos_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "<|mask|>",
+    "lstrip": true,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8fed6809a51ebca45ce2df9ac10fc3aa4ed7f05c659dc09e96be7d814271c46b
+size 11422913

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,241 @@

+{
+  "add_bos_token": false,
+  "add_eos_token": false,
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "151643": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151644": {
+      "content": "<|im_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151645": {
+      "content": "<|im_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151646": {
+      "content": "<|object_ref_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151647": {
+      "content": "<|object_ref_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151648": {
+      "content": "<|box_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151649": {
+      "content": "<|box_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151650": {
+      "content": "<|quad_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151651": {
+      "content": "<|quad_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151652": {
+      "content": "<|vision_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151653": {
+      "content": "<|vision_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151654": {
+      "content": "<|vision_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151655": {
+      "content": "<|image_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151656": {
+      "content": "<|video_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151657": {
+      "content": "<tool_call>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151658": {
+      "content": "</tool_call>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151659": {
+      "content": "<|fim_prefix|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151660": {
+      "content": "<|fim_middle|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151661": {
+      "content": "<|fim_suffix|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151662": {
+      "content": "<|fim_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151663": {
+      "content": "<|mask|>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151664": {
+      "content": "<|file_sep|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151665": {
+      "content": "<tool_response>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151666": {
+      "content": "</tool_response>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151667": {
+      "content": "<think>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151668": {
+      "content": "</think>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    }
+  },
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<|object_ref_start|>",
+    "<|object_ref_end|>",
+    "<|box_start|>",
+    "<|box_end|>",
+    "<|quad_start|>",
+    "<|quad_end|>",
+    "<|vision_start|>",
+    "<|vision_end|>",
+    "<|vision_pad|>",
+    "<|image_pad|>",
+    "<|video_pad|>"
+  ],
+  "bos_token": null,
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|endoftext|>",
+  "errors": "replace",
+  "extra_special_tokens": {},
+  "mask_token": "<|mask|>",
+  "model_max_length": 131072,
+  "pad_token": "<|endoftext|>",
+  "split_special_tokens": false,
+  "tokenizer_class": "Qwen2Tokenizer",
+  "unk_token": null
+}

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff