# Developer Documentation ## Project Overview This project implements a Vietnamese text classification system using the UTS2017_Bank dataset from Hugging Face. The model uses TF-IDF feature extraction combined with either Logistic Regression or SVM classifiers to categorize Vietnamese banking-related text into 14 categories. ## Setup ### Prerequisites - Python 3.8+ - uv (for dependency management) ### Installation 1. Clone the repository: ```bash git clone cd sonar_core_1 ``` 2. Install dependencies using uv: ```bash uv sync ``` ## Usage ### Command Line Interface Train the model with default parameters: ```bash uv run python train.py ``` Train with custom parameters: ```bash # Train VNTC dataset with SVM classifier uv run python train.py --dataset vntc --model svc_linear --max-features 10000 # Quick test with limited samples uv run python train.py --dataset vntc --num-rows 1000 # Compare specific models on VNTC dataset uv run python train.py --compare-models logistic svc_linear random_forest --compare-dataset vntc # Compare all models with subset of data uv run python train.py --compare-dataset vntc --num-rows 5000 --compare # Train banking dataset with neural network uv run python train.py --dataset uts2017 --model mlp --num-rows 500 ``` ### Python API ```python from train import train_model # Train a single model result = train_model( model_name="logistic", # or "svc" max_features=20000, ngram_range=(1, 2), split_ratio=0.2, n_samples=None # Use all data ) # Access results print(f"Test accuracy: {result['test_accuracy']}") print(f"Training time: {result['train_time']} seconds") ``` ### Google Colab / Jupyter Notebook The training script is fully compatible with Google Colab and Jupyter notebooks. Here's how to run comprehensive model comparisons: #### Quick Setup in Google Colab ```python # 1. Install dependencies (run this first) !pip install datasets scikit-learn joblib requests numpy # 2. Download the training script !wget https://raw.githubusercontent.com/your-repo/sonar_core_1/main/train.py # 3. Import and use the training functions from train import train_notebook, train_model ``` #### Compare All Available Models ```python # Compare all 9 advanced algorithms on VNTC dataset from train import train_notebook # Full comparison (may take 10-15 minutes) results = train_notebook( dataset="vntc", compare=True ) # Quick comparison with subset of data (recommended for Colab) results = train_notebook( dataset="vntc", compare=True, num_rows=1000 # Use 1000 samples for faster training ) ``` #### Compare Specific Models ```python # Compare just the best performing algorithms from train import train_all_configurations results = train_all_configurations( dataset="vntc", models=["logistic", "svc_linear", "random_forest", "naive_bayes"], num_rows=2000 ) # Compare SVM variants results = train_all_configurations( dataset="vntc", models=["svc_linear", "svc_rbf", "logistic"], num_rows=1000 ) ``` #### Training Individual Models ```python # Train specific models with custom parameters result = train_notebook( dataset="vntc", # or "uts2017" for banking dataset model_name="random_forest", max_features=20000, ngram_min=1, ngram_max=2, num_rows=5000 # Use subset for faster training ) print(f"Test accuracy: {result['test_accuracy']:.4f}") print(f"Training time: {result['train_time']:.2f}s") ``` #### Available Models for Comparison ```python # All 9 available algorithms: models = [ "logistic", # Logistic Regression "svc_linear", # SVM with Linear Kernel "svc_rbf", # SVM with RBF Kernel "naive_bayes", # Multinomial Naive Bayes "decision_tree", # Decision Tree "random_forest", # Random Forest "gradient_boost", # Gradient Boosting "ada_boost", # AdaBoost "mlp" # Multi-layer Perceptron ] # Compare all models with small dataset for quick results results = train_all_configurations( dataset="vntc", models=models, num_rows=500 # Very fast for testing ) ``` #### Analyzing Results in Colab ```python # Create analysis script inline analysis_code = ''' import json from pathlib import Path def analyze_results(): results = [] runs_dir = Path("runs") for run_dir in runs_dir.glob("*/"): metadata_path = run_dir / "metadata.json" if metadata_path.exists(): with open(metadata_path) as f: data = json.load(f) results.append({ 'model': data.get('classifier', 'Unknown'), 'test_accuracy': data.get('test_accuracy', 0), 'train_time': data.get('train_time', 0), 'max_features': data.get('max_features', 0) }) # Sort by accuracy results.sort(key=lambda x: x['test_accuracy'], reverse=True) print("\\nModel Comparison Results:") print("-" * 60) print(f"{'Model':<20} {'Test Acc':<10} {'Train Time':<12} {'Features':<10}") print("-" * 60) for r in results: model = r['model'][:18] acc = f"{r['test_accuracy']:.4f}" time = f"{r['train_time']:.1f}s" feat = f"{r['max_features']//1000}k" if r['max_features'] > 0 else "N/A" print(f"{model:<20} {acc:<10} {time:<12} {feat:<10}") return results # Run analysis results = analyze_results() ''' # Execute the analysis exec(analysis_code) ``` #### Recommended Colab Workflow ```python # Step 1: Quick test with small data to verify setup train_notebook(dataset="vntc", model_name="logistic", num_rows=100) # Step 2: Compare fast algorithms with moderate data train_all_configurations( dataset="vntc", models=["logistic", "naive_bayes", "decision_tree"], num_rows=1000 ) # Step 3: Compare best performers with larger data train_all_configurations( dataset="vntc", models=["logistic", "svc_linear", "random_forest"], num_rows=5000 ) # Step 4: Full comparison (if you have time) train_all_configurations(dataset="vntc", num_rows=10000) ``` ## Available Parameters | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `dataset` | str | "uts2017" | Dataset: "vntc" (news) or "uts2017" (banking) | | `model_name` | str | "logistic" | Model type (see Available Models below) | | `max_features` | int | 20000 | Maximum number of TF-IDF features | | `ngram_range` | tuple | (1, 2) | N-gram range for feature extraction | | `split_ratio` | float | 0.2 | Test set ratio (0.2 = 20% test data) | | `num_rows` | int | None | Limit samples for quick testing | | `compare` | bool | False | Compare multiple model configurations | ### Available Models | Model Name | Algorithm | Description | |------------|-----------|-------------| | `logistic` | Logistic Regression | Fast, highly effective for text classification | | `svc_linear` | SVM Linear | Support Vector Machine with linear kernel | | `svc_rbf` | SVM RBF | Support Vector Machine with RBF kernel | | `naive_bayes` | Multinomial NB | Naive Bayes classifier for text | | `decision_tree` | Decision Tree | Tree-based classifier | | `random_forest` | Random Forest | Ensemble of decision trees | | `gradient_boost` | Gradient Boosting | Advanced boosting algorithm | | `ada_boost` | AdaBoost | Adaptive boosting algorithm | | `mlp` | Neural Network | Multi-layer perceptron | ## Datasets The project supports two Vietnamese text classification datasets: ### VNTC Dataset (Vietnamese News Classification) - **Categories**: 10 news categories (politics, sports, health, etc.) - **Training samples**: 33,759 - **Test samples**: 50,373 - **Language**: Vietnamese - **Best accuracy**: 92.33% with Logistic Regression ### UTS2017_Bank Dataset (Vietnamese Banking Classification) - **Categories**: 14 banking service categories - **Training samples**: 1,581 - **Test samples**: 396 - **Language**: Vietnamese - **Best accuracy**: 70.96% with Logistic Regression Banking categories include: - ACCOUNT, CARD, CUSTOMER_SUPPORT, DISCOUNT - INTEREST_RATE, INTERNET_BANKING, LOAN, MONEY_TRANSFER - OTHER, PAYMENT, PROMOTION, SAVING, SECURITY, TRADEMARK Both datasets are automatically downloaded from Hugging Face on first use. ## Model Architecture ``` Text Input (Vietnamese) � CountVectorizer (TF-IDF features) � TfidfTransformer (IDF weighting) � Classifier (LogisticRegression or SVM) � Predicted Category ``` ## Output Structure After training, the following files are created: ``` runs/ / training.log # Detailed training logs metadata.json # Training configuration and results models/ model.pkl # Main trained model .pkl # Model with configuration name labels.txt # List of all labels ``` ## Performance With default parameters (Logistic Regression, 20k features, bigrams): - **Training samples**: 1,581 - **Test samples**: 396 - **Test accuracy**: ~71% - **Training time**: <1 second ## API Reference ### Core Functions #### `train_model()` Trains a single model with specified parameters. ```python def train_model( model_name="logistic", max_features=20000, ngram_range=(1, 2), split_ratio=0.2, n_samples=None ) -> dict ``` Returns a dictionary containing: - `timestamp`: Training run timestamp - `config_name`: Configuration identifier - `train_accuracy`: Training set accuracy - `test_accuracy`: Test set accuracy - `train_time`: Training duration - `classification_report`: Detailed per-class metrics - `confusion_matrix`: Confusion matrix #### `train_all_configurations()` Trains multiple model configurations for comparison. ```python def train_all_configurations() -> list ``` Returns a list of training results for each configuration. #### `train_notebook()` Convenience wrapper for notebook environments. ```python def train_notebook( model_name="logistic", max_features=20000, ngram_min=1, ngram_max=2, split_ratio=0.2, n_samples=None, compare=False ) -> dict ``` ## Loading Trained Models ```python import joblib # Load the model model = joblib.load("runs//models/model.pkl") # Make predictions text = ["T�i mu�n m� t�i kho�n ng�n h�ng"] prediction = model.predict(text) print(f"Predicted category: {prediction[0]}") # Get prediction probabilities probabilities = model.predict_proba(text) ``` ## Development ### Project Structure ``` sonar_core_1/ train.py # Main training script run_train.py # Simple runner script pyproject.toml # Project configuration uv.lock # Locked dependencies runs/ # Training outputs DEVELOPERS.md # This file ``` ### Adding New Features 1. **New Classifiers**: Add to `get_available_models()` function 2. **Feature Engineering**: Modify the pipeline in `train_model()` 3. **Metrics**: Extend the metadata dictionary with new metrics ### Testing Run quick tests with limited samples: ```bash # Test with 100 samples on VNTC dataset uv run python train.py --dataset vntc --num-rows 100 # Test specific model on banking dataset uv run python train.py --dataset uts2017 --model svc_linear --num-rows 200 # Quick comparison of top models uv run python train.py --compare-models logistic svc_linear random_forest --compare-dataset vntc --num-rows 500 ``` ### Debugging Enable detailed logging by checking the `runs//training.log` file after each training run. ## Common Issues ### Memory Issues If running out of memory, reduce `max_features` or use smaller `num_rows`. For Colab, try `num_rows=1000`. ### Google Colab Compatibility The script automatically handles Colab's kernel arguments. Use `train_notebook()` for best experience. Always install dependencies first. ### Slow Training - **SVM (svc_rbf)** and **Neural Networks (mlp)** are slower than other models - **Random Forest** and **Gradient Boosting** can be slow with large `num_rows` - For quick comparisons, use `num_rows=500-1000` - **Logistic Regression** and **Naive Bayes** are fastest for initial testing ### Model Selection Tips - **Logistic Regression**: Best overall performance for Vietnamese text - **SVM Linear**: Good alternative to Logistic Regression - **Random Forest**: Good for ensemble learning, interpretable - **Naive Bayes**: Very fast, good baseline - **Neural Networks**: May overfit on small datasets ## Contributing 1. Fork the repository 2. Create a feature branch 3. Make your changes 4. Run tests with small samples 5. Submit a pull request ## License [Add license information here] ## Support For issues or questions, please open an issue on the GitHub repository.