| --- |
| datasets: |
| - Souha-BH/DetectingRiskyHealthBehaviorsInTikTokVideos |
| --- |
| # Model Card for Multimodal Risk Behavior Detection Model |
|
|
| ## Model Overview |
| The Multimodal Risk Behavior Detection Model is designed to detect risky health behaviors in TikTok videos. |
| By leveraging both visual and textual features from TikTok video content, the model can classify whether a video portrays risky health behaviors, such as smoking, alcohol consumption, or unhealthy eating habits. |
| The model integrates two pre-trained architectures: BERT for text feature extraction and ResNet50 for video frame analysis, combining their outputs to make predictions. |
|
|
| ## Training Data |
| The model was trained on the "Detecting Risky Health Behaviors in TikTok Videos" dataset: https://huggingface.co/datasets/Souha-BH/DetectingRiskyHealthBehaviorsInTikTokVideos. |
| This dataset includes video metadata, captions, and video clips, which are labeled as either risky or non-risky. The data was collected using the Apify TikTok Hashtag Scraper and annotated for risky health behaviors. |
| The model uses the text column (captions) and the corresponding video files from the dataset to extract text and visual features. |
|
|
| - Training-Validation-Test Split: The dataset was split into training, validation, and test sets using the "Split" column. |
| - Training set: Used to train the model's parameters. |
| - Validation set: Used to tune hyperparameters and avoid overfitting. |
| - Test set: Used to evaluate the final performance of the model. |
|
|
| ## Model Architecture |
| The Multimodal Risk Behavior Detection Model follows a multimodal approach that integrates both textual and visual modalities. |
|
|
| - Textual Features: Extracted using BERT (bert-base-uncased), with tokenized video captions passed through BERT's transformer layers. |
| - Visual Features: Extracted using ResNet50, where frames from each TikTok video are resized and processed to generate high-level visual embeddings. |
| - Feature Fusion: The embeddings from BERT and ResNet50 are concatenated and passed through a series of fully connected layers with ReLU activations and dropout regularization to prevent overfitting. |
| - Classification Layer: The final layer is a single-unit sigmoid layer that outputs a probability between 0 and 1, with 0.5 as the threshold for classification. |
|
|
| ### Training Procedure |
|
|
| - Loss Function: Binary Cross-Entropy Loss (BCE) was used to compute the error between predicted probabilities and true labels. |
| - Optimizer: Adam optimizer with a learning rate of 2e-5. |
| - Batch Size: 4 video samples per batch. |
| - Epochs: The model was trained for 5 epochs. |
| - Video Frame Limit: Each video was sampled for 10 frames to reduce computational overhead. |
| - Augmentation and Normalization: Frames were resized to 224x224 and normalized using ImageNet's mean and standard deviation. |
| - |
| ### Evaluation Metrics |
| The model was evaluated on the test set using the following metrics: |
|
|
| - Accuracy: Measures the proportion of correct predictions. |
| - Precision: Measures how many of the predicted "risky" videos were actually risky. |
| - Recall: Measures how many of the actual risky videos were correctly identified. |
| - F1 Score: The harmonic mean of precision and recall, balancing both metrics. |
| - ROC-AUC: Measures the area under the ROC curve, showing the model's ability to distinguish between risky and non-risky videos. |
|
|
| ### Model Performance |
| After training for 5 epochs, the model's performance on the test set was as follows: |
|
|
| - Accuracy: 63.33% |
| - Precision: 55.00% |
| - Recall: 84.62% |
| - F1 Score: 66.67% |
| - ROC-AUC: 65.84% |
|
|
| ### Usage |
|
|
| - Input: A TikTok video and its corresponding caption. |
| - Output: A probability score indicating the likelihood that the video depicts a risky health behavior. |
|
|
| ### Limitations |
| - Data Balance: If the dataset is imbalanced (more non-risky videos than risky ones), the model may struggle with precision. |
| - Contextual Understanding: The model relies heavily on textual captions. If a caption does not explicitly describe risky behavior, the model may underperform. |
| - Limited Frame Sampling: Only 10 frames per video are processed, which may miss important content, especially for longer videos. |