Project in regard to the kaggle contest https://www.kaggle.com/competitions/llm-detect-ai-generated-text
AI-Generated Text Detection using BERT is a project aimed at detecting AI-generated text segments within a given dataset. Leveraging the power of BERT (Bidirectional Encoder Representations from Transformers), the project addresses the challenge of distinguishing between genuine human-authored content and computer-generated text. By implementing advanced natural language processing techniques, the model contributes to enhancing cybersecurity and integrity in digital communications.
The project follows a structured workflow:
-
Data Preprocessing: Cleaning and preprocessing textual data to remove noise, stop words, punctuation, and non-alphabetic characters using BERT-preprocess.
-
Additional Datasets: Collecting various datasets from competitions and concatenating them to increase the training instances. This step enhances the model's ability to identify features and patterns effectively.
-
Model Training: Utilizing a BERT-based sequence classification model to train the system to distinguish between human and AI-generated text segments accurately.
-
Predictions: Generating predictions on test data to highlight potential AI-generated content segments.
-
Result Analysis: Saving the results in a CSV file for submission and further analysis.
The project includes an in-depth analysis of how BERT detects AI-generated texts, exploring various features, including semantic differences, vocabulary usage, statistical distributions, and sentiment analysis measures. The analysis delves into black-box detection algorithms for AI text detection, shedding light on the underlying mechanisms responsible for distinguishing between human and AI-generated content.
The project addresses edge cases and potential anomalies in AI-generated text detection. Detailed explanations and possible solutions for edge cases are provided, enhancing the model's robustness and accuracy.
The project highlights notable points and findings, including observations on the differences between human-authored and AI-generated content. Insights from research papers and analysis provide valuable information for understanding and addressing challenges in AI text detection.
A summary of the project's results and findings is presented, including model performance, LB scores, and recommendations for further analysis. Insights into the effectiveness of different models and techniques contribute to advancing research in AI text detection.
Various research papers and resources are referenced for further analysis and exploration of AI text detection. These references provide valuable insights and perspectives for continued research and development in the field.
Kairvee Vaswani