Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Select a pre-trained model or fine-tune a sentiment analysis model. #2329

Closed
Tracked by #2073
DonnieBLT opened this issue Jun 13, 2024 · 14 comments
Closed
Tracked by #2073

Select a pre-trained model or fine-tune a sentiment analysis model. #2329

DonnieBLT opened this issue Jun 13, 2024 · 14 comments
Assignees
Labels
Milestone

Comments

@DonnieBLT
Copy link
Collaborator

No description provided.

@DonnieBLT DonnieBLT mentioned this issue Jun 13, 2024
51 tasks
@github-project-automation github-project-automation bot moved this to Backlog in 📌 All Jun 13, 2024
@DonnieBLT DonnieBLT moved this from Backlog to Ready in 📌 All Jun 13, 2024
@Uttkarsh-raj
Copy link
Contributor

I think we can go with DistilBERT which is managed by Hugging Face and has many advantages over the generally used models for Sentiment analysis.

DistilBERT

The DistilBERT model was proposed in the blog post Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT, and the paper DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. DistilBERT is a small, fast, cheap and light Transformer model trained by distilling BERT base. It has 40% less parameters than google-bert/bert-base-uncased, runs 60% faster while preserving over 95% of BERT’s performances as measured on the GLUE language understanding benchmark.

BERT

Google developed BERT to serve as a bidirectional transformer model that examines words within text by considering both left-to-right and right-to-left contexts. It helps computer systems understand text as opposed to creating text, which GPT models are made to do. BERT excels at NLU tasks as well as performing sentiment analysis. It's ideal for Google searches and customer feedback.

How is BERT different from GPT

GPT models differ from BERT in both their objectives and their use cases. GPT models are forms of generative AI that generate original text and other forms of content. They're also well-suited for summarizing long pieces of text and text that's hard to interpret.
BERT and other language models differ not only in scope and applications but also in architecture.

@Sarthak5598
Copy link
Member

I have two issues with this:

First, it requires a powerful PC or a premium server to use and train the model, as free trials can't handle the load.

Recommended Specifications for Better Performance:

  • CPU: High-end multi-core processor (e.g., Intel i9 or AMD Ryzen 9)
  • RAM: 32 GB or more
  • Storage: SSD with 256 GB or more to accommodate large datasets and multiple versions of models

The second issue is why use something complex when a simple machine learning project could suffice. The functionality is very basic, and with just logistic regression, I achieved 92% accuracy. By incorporating multiple layers of different machine learning algorithms, the accuracy could easily reach 98% or even higher.

Please share your thoughts on this
@Uttkarsh-raj @DonnieBLT @arkid15r @AtmegaBuzz

@Uttkarsh-raj
Copy link
Contributor

This is something we should definitely look into.. as for the training part Google colab do provide you the resources to train the model. I have trained a model on this and was able to achieve around 90% of accuracy but the accuracy completely depends on the data set selected. The DistillBert is able to understand the context of the sentence so if a new word or arrangement of words is encountered it can easily understand it which is not possible by training regression model. Also once trained you dont need to retrain the model on every request. I think we can see if we get any hosting platform for a minimum cost cause anyway we would have to host the model somewhere, but i guess we can try hosting it on the same server where the backend is hosted currently. Would definitely like the mentors opinion on this tho .

@AtmegaBuzz
Copy link
Collaborator

True, running a LLM just for simpler tasks will consume a lot of resources. Try tensorflow or pre built traditional models on GitHub.

@Sarthak5598
Copy link
Member

So, as for the update I increased the data set to almost 5k , I think focusing on that is really important other than that I tried using stacking ensemble learning(one of the three ways to use multiple models for one ) for this and if we use the right models I think we can do alot better and wouldn't need buy servers or anything and also training the model needs to done once here too , We can use joblib library to save the model and use it

@Sarthak5598
Copy link
Member

image
I have achieved almost 96% accuracy but we need to work on improving the dataset

@Uttkarsh-raj
Copy link
Contributor

I was trying to create a dataset out of the current issues in the server but there are some issues with this:

  • There are too many issues to extract and doing this manually will not be the best approach .
  • There are too many issues related to the blt app itself and that too by anonymous users and multiple times.
  • The issue description and title are not the required one on which the model can be trained.

So the problem for the dataset still remains the same. @DonnieBLT what should we do regarding this?

@Sarthak5598
Copy link
Member

If you want to use that approach then the best idea would be web scrapping but not directly as you're doing it
But rather we can provide users all the label options from now on and within a month or two we will get a good dataset
And can be used to train
But the issue is will it be enough because I created a dataset of about 5k and it's still not enough we would need around 10k and also my dataset is not that good as it is created by gpt .

@Sarthak5598
Copy link
Member

How about we add the labeling option as I said and after we get a dataset of about 2k we can use it and then we will also store the future bugs and add them to dataset and after we find the dataset and our model good enough we can use earlier bugs too , this will create us a big dataset Not sure if this is the best approach

@Uttkarsh-raj
Copy link
Contributor

If you want to use that approach then the best idea would be web scrapping but not directly as you're doing it But rather we can provide users all the label options from now on and within a month or two we will get a good dataset And can be used to train But the issue is will it be enough because I created a dataset of about 5k and it's still not enough we would need around 10k and also my dataset is not that good as it is created by gpt .

I would have been in favor of this but currently the bugs being reported is mostly by anonymous users and that too about the blt app . Also not sure of the real traffic on the application currently. Also we cant perform web scraping cause some sites have proxies to prevent scraping on them and this can break the scraper too. We cant rely on only the bugs reported on blt for the dataset , we need to have other sources too . I tried looking for such dataset on Kaggle to but no success.

@Sarthak5598
Copy link
Member

There are no datasets online and if that's the thing then we will have to use gpt for it at least for now , One thing you can look into is jira if you can find its dataset , then that's more than enough (Research about what jira is first)

@Uttkarsh-raj
Copy link
Contributor

I have worked with Jira before but didnt knew that it provided a dataset too.. Thanks for the info but i could only find this
https://zenodo.org/records/5901804#:~:text=Description,using%20the%20Jira%20API%20V2.

@Sarthak5598
Copy link
Member

I don't think they share any dataset officially , It was just an idea if we can get hands on any dataset it would be helpful

@DonnieBLT DonnieBLT added this to the GSOC 2024 milestone Aug 2, 2024
@Uttkarsh-raj
Copy link
Contributor

I think we can close this issue since we are going with GPT to generate the fields and other models.

@github-project-automation github-project-automation bot moved this from Ready to Done in 📌 All Aug 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
Development

No branches or pull requests

4 participants