trustchainai-codebert

Fine-tuned CodeBERT for Solidity smart contract vulnerability detection.

Part of the TrustChainAI project โ€” an AI-powered smart contract auditor with explainability and ethics monitoring, built to make blockchain security accessible to African and emerging-market Web3 ecosystems.


Model Performance

Metric Score
F1 (weighted, test set) 98.6%
Eval Loss 0.0428
Test Samples 1,032 contracts
Classes 13 vulnerability categories

How to Use

from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="emekaphilians/trustchainai-codebert"
)

contract = """
pragma solidity ^0.8.0;
contract Vulnerable {
    mapping(address => uint) public balances;
    function withdraw() external {
        uint amt = balances[msg.sender];
        (bool ok,) = msg.sender.call{value: amt}("");
        balances[msg.sender] = 0;
    }
}
"""

result = classifier(contract[:512])
print(result)
# [{'label': 'reentrancy', 'score': 0.997}]

Label Schema

ID Label Description
0 safe No vulnerability detected
1 reentrancy Reentrancy attack (DAO-style)
2 integer_overflow Arithmetic overflow / underflow
3 access_control Unprotected ownership or selfdestruct
4 tx_origin_phishing tx.origin used for authentication
5 dos_gas Unbounded loop / gas exhaustion
6 unchecked_call External call return value ignored
7 front_running_mev Mempool-visible state / TOD
8 timestamp_dependence block.timestamp manipulation
9 proxy_storage_collision Delegatecall storage slot collision
10 flash_loan_oracle Oracle price manipulation via flash loan
11 flash_loan_single_block Single-block liquidity attack
12 misnamed_constructor Pre-Solidity-0.5 constructor naming bug
13 other Multi-class or miscellaneous vulnerability

Training Data

Assembled from four open-source sources using the prepare_datasets.py pipeline:

Source Contracts
SmartBugs Curated 143
SolidiFI Benchmark 1,700
DeFiHackLabs 729
Not-So-Smart Contracts 25
Synthetic augmentation 3,600
Total (after dedup) 6,879

Split: 70% train / 15% val / 15% test (stratified by label).


Training Details

Parameter Value
Base model microsoft/codebert-base
Epochs 5 (best checkpoint at epoch 2)
Batch size 16
Learning rate 2e-5
Optimizer AdamW (weight decay 0.01, warmup 100 steps)
Max token length 512
Mixed precision fp16
Hardware Google Colab T4 GPU

Intended Use

  • Pre-deployment security screening of Solidity smart contracts
  • Automated vulnerability triage for DeFi protocols
  • Research baseline for smart contract security ML benchmarks
  • Integration into the TrustChainAI multi-agent audit pipeline

Out-of-Scope Use

  • This model is not a substitute for a full professional security audit on high-value contracts
  • Performance on Vyper, Yul, or non-EVM contracts is untested
  • The tx_origin_phishing class has limited real training samples (28); treat predictions for this class with extra caution

Limitations & Bias

  • Synthetic augmentation was used for 9 of 13 classes to compensate for dataset scarcity. Synthetic contracts may not fully capture real-world obfuscation patterns.
  • The tx_origin_phishing class had only 28 real-world training samples; model confidence for this class may be lower in practice.
  • Training data skews toward older Solidity vulnerability patterns (pre-0.8). Newer attack vectors may be underrepresented.

Citation

@misc{trustchainai2025,
  author = {Emeka Philian},
  title  = {TrustChainAI: AI-Powered Smart Contract Auditor},
  year   = {2025},
  url    = {https://github.com/emekaphilian/TrustChainAI}
}

Links

Downloads last month
303
Safetensors
Model size
0.1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for emekaphilians/trustchainai-codebert

Finetuned
(143)
this model