Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Onboarding docs with contribution guidelines #115

Draft
wants to merge 11 commits into
base: main
Choose a base branch
from
1 change: 1 addition & 0 deletions docs-website/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
# Generated files
.docusaurus
.cache-loader
.yarn

# Misc
.DS_Store
Expand Down
2 changes: 1 addition & 1 deletion docs-website/docs/acknowledgements.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
sidebar_position: 7
sidebar_position: 8
---

# 🙏 Acknowledgements
Expand Down
2 changes: 1 addition & 1 deletion docs-website/docs/api.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
sidebar_position: 6
sidebar_position: 7
---

# 📄 API Documentation
Expand Down
4 changes: 4 additions & 0 deletions docs-website/docs/contributing/_category_.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
{
"label": "🤝 Contributing",
"position": 6
}
27 changes: 27 additions & 0 deletions docs-website/docs/contributing/api-integration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
---
sidebar_position: 3
---

# 🧩 API integration

**If you'd like to add the Kindly API to a project you're working on, get in touch with us!**

There is currently one available endpoint for Kindly :`'/detect'`.
To read more in-depth, go check out our [API development docs](../api.md).

## For those who are new to integrating APIs into your own project, here's some guidance!

You'll be accessing the `/detect` endpoint in order to check whether some input text has cyberbullying intent.
To integrate Kindly into your project you will need a couple of things to get started:
- text input functionality where there is a trigger to send the text as a request to an endpoint.
- see the client development site for an example of this on the [index page](https://github.com/unicef/kindly/blob/main/client/pages/index.vue)
- input text that's in json format
```PAYLOAD```
```
{
"text":"this movie is great"
}
```
- authorization headers and/or token keys

Test basic functionality and integration with your project by setting up the api on your local development environment. Follow the instructions on the [development docs](../technical/development.md) to get the API up and running. You won't need the web client and can just send requests to the endpoint on your local environment.
11 changes: 11 additions & 0 deletions docs-website/docs/contributing/api-maintenance.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
---
sidebar_position: 2
---

# 🛠 Project contributions
Keep an eye on our Github Repos for [Kindly](https://github.com/unicef/kindly) and the [Kindly website](https://github.com/unicef/kindly-website) for any issues which come up that you think you'll be able to help with. Remember to comment so we can assign the issue to you!

Check out the [development docs](../technical/development) to see how to set up a development environment so you can work on the Kindly repo.
Our backend is set up as an API, so you can also make calls to this on Postman or with `curl`. We also have a [website](https://kindly.unicef.io) where you can check phrases with Kindly, alternatively once you've set up the development environment you can use the [web client](../overview#web-client) to test phrases.

## Contribution Guidelines
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nathanfletcher We can maybe add the guidelines at this point? Or even a link if we're doing them somewhere else.

17 changes: 17 additions & 0 deletions docs-website/docs/contributing/overview.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
---
sidebar_position: 1
---

# 🤝 Contributing

There are multiple ways to contribute to, and work with, the Kindly project:

## Training Data
We are collecting training data from children on the [Kindly website](https://kindly.unicef.io/contribute). These contributions will be added to the [training dataset](ml-model/training-data) which are classified as either `offensive` or `non-offensive` to improve the model's accuracy. See the [training data docs](../ml-model/training-data) for a more technical overview. See our [Privacy and Terms of Use](https://kindly.unicef.io/legal) page for information on privacy, how we use data and terms of usage.

## [Project Maintenance](project-maintenance)
Kindly is built as an open-source solution, and is a certified [DPG](https://digitalpublicgoods.net/registry/kindly.html). We are on [GitHub](https://github.com/unicef/kindly) and will welcome your help whatever your skills. Check out [this page](project-maintenance) to learn more about our contribution guidelines.

## [API Integration](api-integration)
Integrate the Kindly API with your own program.
The [API docs](../api.md) contain all the available endpoints.
2 changes: 1 addition & 1 deletion docs-website/docs/ml-model/build-from-scratch.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ sidebar_position: 3

Multiple different strategies can be used to build a machine learning model from scratch:

- "Bag-of-words" is one simple strategy. Refer to this Google CodeLab that walks you thorugh the process of building a simple model from scratch.
- "Bag-of-words" is one simple strategy. Refer to this Google CodeLab that walks you through the process of building a simple model from scratch.

- A model based on the TFIDF Vectorizer strategy was contributed by the community into the project by Emmanuel Djaba. This is new base model would be our second approach to home growing Kindly’s own model using traditional machine learning until such a time where the strategy would have to be changed due to new information.

Expand Down
10 changes: 5 additions & 5 deletions docs-website/docs/ml-model/training-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ In Machine Learning (ML) and Natural Language Processing (NLP), data is sorted i
- Validation dataset (text + labels)
- Testing dataset (text + label)

Kindly is currently compiling a training dataset to both [tranfer learning](prebuilt#transfer-learning) from the current model (see [cardiffnlp/twitter-roberta-base-offensive](https://github.com/cardiffnlp/tweeteval)) using data submitted by children, and [to build a new model from scratch](./build-from-scratch).
Kindly is currently compiling a training dataset to both [transfer learning](prebuilt#transfer-learning) from the current model (see [cardiffnlp/twitter-roberta-base-offensive](https://github.com/cardiffnlp/tweeteval)) using data submitted by children, and [to build a new model from scratch](./build-from-scratch).

Once data is started to be compiled, it will be split into the three groups mentioned above to create the model. You can refer to the notebook [`KindlyModel.ipynb`](https://github.com/unicef/kindly/blob/main/modeling/KindlyModel.ipynb) as a reference on how this split is currently being done with the initial dataset.

Expand All @@ -22,7 +22,7 @@ This initial data was extracted from it's original JSON format and put in to two

The data is tagged in the file `modeling/dataset/offensive_train_labels.txt` to know which submissions are offensive or not to help the model learn. The numbers `1` and `0` in the labels file correspond to `offensive` and `non-offensive` respectively.
The number on each line in the file directly corresponds to the sentence or phrase in each line of the training text file `offensive_train_text.txt`.
These numerical tags will be validated as more data comes in through the *Data Colection* process mentioned below.
These numerical tags will be validated as more data comes in through the *Data Collection* process mentioned below.

## Data Collection

Expand All @@ -31,8 +31,8 @@ request training data from children. The form simulates a hypothetical chat inte

![Data Collection](/img/ml-model/data-collection.svg)

Every time the child submits text through the form, data is both stored and submitted for analysis by the current model. The model then returns its prediction on whether it detects cyberbulling intent, and the child can validate whether the prediction is acurate or not. The validation from the user is apended to the initial data entry.
Every time the child submits text through the form, data is both stored and submitted for analysis by the current model. The model then returns its prediction on whether it detects cyberbulling intent, and the child can validate whether the prediction is accurate or not. The validation from the user is appended to the initial data entry.

No data is collected from the user other than the text that that the child enters in the form and their validation on whether the preditcion is accurate or not. The data submission is completely anonymous.
No data is collected from the user other than the text that that the child enters in the form and their validation on whether the prediction is accurate or not. The data submission is completely anonymous.

Data submitted by children is stored pending review by UNICEF staff. The human review validates that the data is relevant and anonymizes it by removing any personal identifiable information (PII) that the user may have submited inadvertently or by mistake (e.g. replacing proper nouns with a generic reference such as "*@user*"). The data cleansing includes the removal of individual names, physical or email addresses, school names, phone numbers or any other personal or device identifiers, as well as location data. At the discretion of UNICEF staff, if we can think of any other data (or combinations thereof) that may lead back to a child, then that will also be removed. After the new data has been reviewed, it is then added to Kindly's [open dataset](https://github.com/unicef/kindly/tree/main/modeling/dataset). This open dataset is used to periodically re-train and improve the existing machine learning model.
Data submitted by children is stored pending review by UNICEF staff. The human review validates that the data is relevant and anonymizes it by removing any personal identifiable information (PII) that the user may have submitted inadvertently or by mistake (e.g. replacing proper nouns with a generic reference such as "*@user*"). The data cleansing includes the removal of individual names, physical or email addresses, school names, phone numbers or any other personal or device identifiers, as well as location data. At the discretion of UNICEF staff, if we can think of any other data (or combinations thereof) that may lead back to a child, then that will also be removed. After the new data has been reviewed, it is then added to Kindly's [open dataset](https://github.com/unicef/kindly/tree/main/modeling/dataset). This open dataset is used to periodically re-train and improve the existing machine learning model.
32 changes: 32 additions & 0 deletions docs-website/docs/overview.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
---
sidebar_position: 2
---

# 🗺 Overview

This is a broad overview of the Kindly App and how all the components fit together. There are currently 2 GitHub repos associated with Kindly - the [Kindly](https://github.com/unicef/kindly) repo which contains the API, web client and ML training server, and the [Kindly Website](https://github.com/unicef/kindly-website) repo which contains the front-facing [website](https://kindly.unicef.io) which includes details, contact info, and the contribution form.

## Kindly
The [Kindly API repo](https://github.com/unicef/kindly) contains the [API](https://github.com/unicef/kindly/tree/main/api), the [web client](https://github.com/unicef/kindly/tree/main/client), and the [ML training server](https://github.com/unicef/kindly/tree/main/modeling). It also houses the [docs website](https://github.com/unicef/kindly/tree/main/docs-website). More information on how they fit together is below.

### API
The [API](https://github.com/unicef/kindly/tree/main/api) includes the [application](https://github.com/unicef/kindly/tree/main/api/api.py), where the endpoints and functions are found, as well as [unit tests](https://github.com/unicef/kindly/blob/main/api/test_api.py) which cover the endpoints.
There are currently 3 endpoints, with 1 more for ML training which is yet to be completed. The `'/'` route returns the api glossary, with the functional and relevant endpoints. The `'/test-ui'` endpoint renders the [`index.html` template](https://github.com/unicef/kindly/tree/main/api/templates) for testing. The `'/detect'` endpoint is used to process input text.

#### Tests
The [unit tests](https://github.com/unicef/kindly/blob/main/api/test_api.py) are set up using pytest, and will run with the [`main.yml`](https://github.com/unicef/kindly/blob/main/.github/workflows/main.yml) workflow. They target the API url `localhost:8080` and test for success, 403 error, 404 error and 400 error (see [development docs](technical/development#testing) for a detailed outline).

### Web Client
The [web client](https://github.com/unicef/kindly/tree/main/client) is a basic frontend and is only for demo and development purposes when working on the API and is never used in production. It contains a page to [test](https://github.com/unicef/kindly/blob/main/client/pages/index.vue) the API and a page to [contribute](https://github.com/unicef/kindly/blob/main/client/pages/contribute.vue), although this is not functioning at the moment.

### ML training server
To learn more about how the machine learning training server, see the [ML Model docs](ml-model/overview). Kindly is currently using a [prebuilt](ml-model/prebuilt) model for detecting toxic messaging, the [twitter-roberta-base-offensive](https://huggingface.co/cardiffnlp/twitter-roberta-base-offensive) model. We are concurrently collecting new training data to [transfer learning](ml-model/prebuilt#transfer-learning), where we use an existing model to validate new training data while adapting the results to be relevant to what we need for our model, in this case, to be relevant for language used by children and young people.

## Kindly Website
The [Kindly Website repo](https://github.com/unicef/kindly-website) found in the [UNICEF Github organisation](https://github.com/unicef) is where the development code is stored. However, the Kindly website is actually hosted through Cloudflare, with the fork created by Victor at [lacabra/kindly-website](https://github.com/lacabra/kindly-website). The workflow [`push-downstream.yml`](https://github.com/unicef/kindly-website/blob/main/.github/workflows/push-downstream.yml) is a continuous integration solution which will push any changes committed on unicef/kindly-website downstream to lacabra/kindly-website.

### The Contribution Form
The [Kindly form](https://github.com/unicef/kindly-website/blob/main/src/components/KindlyForm.js) on the [contribution page](https://github.com/unicef/kindly-website/blob/main/src/Contribute.js) is linked to the Kindly API and Google Sheets to collect training data. When someone enters a phrase into the form, the input text is passed to the [`/detect` endpoint](https://github.com/unicef/kindly/blob/6a39f09eec60f8f3d0c0809e35aa9352075e46ca/api/api.py#L50) of the API and the results are returned, the user will then be prompted to confirm whether the phrase has been correctly classified as either cyberbullying or not. The phrase and the intent are then passed through to the [Google Training Sheet](https://github.com/unicef/kindly/blob/main/scripts/OutputFile.gs) as `formData` in [`handleFeedback`](https://github.com/unicef/kindly-website/blob/0556e79a3b1393a55df68e46cd663990a4d40b91/src/components/KindlyForm.js#L70).

### Testing Kindly Functionality
Users can test Kindly on the main page using the [Kindly form](https://github.com/unicef/kindly-website/blob/main/src/components/KindlyForm.js), which is linked only to the [`/detect` endpoint](https://github.com/unicef/kindly/blob/6a39f09eec60f8f3d0c0809e35aa9352075e46ca/api/api.py#L50) of the API. This will return a result indicating whether the submitted phrase is considered cyberbullying or not.
2 changes: 1 addition & 1 deletion docs-website/docs/research.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ sidebar_position: 3
# 📚 Benchmarking Research

The following projects have also leveraged different forms of technology to combat cyberbullying. These served as additional inspiration as we aimed to create an open-source, community-driven product.
* [Perspective](https://www.perspectiveapi.com/) is a free API that uses machine learning to identify "toxic" comments. Perspective returns a percentage that represents the likelihood that someone will perceive the text as toxic. Perspecive requires users to register in order to access the API. It also requires users to have a Google account and Google Cloud project to authenticate API requests. Currently, there is no fee to use it but in the future, increases to QPS may incur a fee ([Source](https://developers.perspectiveapi.com/s/)).
* [Perspective](https://www.perspectiveapi.com/) is a free API that uses machine learning to identify "toxic" comments. Perspective returns a percentage that represents the likelihood that someone will perceive the text as toxic. Perspective requires users to register in order to access the API. It also requires users to have a Google account and Google Cloud project to authenticate API requests. Currently, there is no fee to use it but in the future, increases to QPS may incur a fee ([Source](https://developers.perspectiveapi.com/s/)).
* AS Tracking by [STEER](https://steer.global/en) is an AI solution that compares the online psychological test results provided by students with its psychological model to flag which students may need more attention and support. This is a commercial product that aims to sell to various school groups.
* [@dhavalpotdar](https://github.com/dhavalpotdar/cyberbullying-detection/commits?author=dhavalpotdar) created a [project](https://github.com/dhavalpotdar/cyberbullying-detection) on GitHub to detect cyberbullying in tweets using ML Classification Algorithms. However, this project is not active as the last commit was made 2 years ago.
* Academic researchers from Ghent University, University of Antwerp, and University of Cape Town published a research paper focusing on automatic cyberbullying detection in social media text by modelling posts written by bullies, victims, and bystanders of online bullying (Van Hee, Cynthia et al. “Automatic detection of cyberbullying in social media text.” PloS one vol. 13,10 e0203794. 8 Oct. 2018, [doi:10.1371/journal.pone.0203794](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0203794)).
Expand Down
Loading