Python for Google Cloud Vision OCR for Image Folder Organizer
- Setting Up Google Cloud: https://cloud.google.com/vision/docs/setup
- Setting Up Google Cloud Vision API: https://cloud.google.com/vision/docs/labels
- Setting Up sklearn: https://scikit-learn.org/stable/install.html
- Setting Up PyDictionary: https://pypi.org/project/PyDictionary/
- Setting Up nltk (For Lemmatization): https://www.nltk.org/
This is the response JSON after sending request to Google cloud for labeling the picture. It seems to have problem with the score and topicality (having same value). (https://issuetracker.google.com/issues/117855698?pli=1)
{
"responses": [
{
"labelAnnotations": [
{
"mid": "/m/0199g",
"description": "Bicycle",
"score": 0.96705616,
"topicality": 0.96705616
},
{
"mid": "/m/0h9mv",
"description": "Tire",
"score": 0.9641615,
"topicality": 0.9641615
}
]
}
]
}
- tmp: Folder that contains the pictures before clustering
- results: Folder that contains the pictures of result
- pictures: Folder that contains the pictures after clustering
- csv: Folder that contains the csv files of dataframe used in the project
- FolderCreater.py: Creating the folder with appropriate cluster and move the image according to the cluster result
- KMeanClustering.py: Using the Sklearn Kmean clustering library, it creates the 5 cluster of images.
- Lemmatization.py: Using the nltk WorkNetLemmatizer, it preprocesses (lemmatize) the word. (eg. computer, computing, computerize -> compute)
- Main.py: Create the dataframe using the labels from the Google Vision API
- OCR.py: Opens up the connection with Google Cloud and process Google Vision API labeling
In this project, I mainly used the Google Cloud Vision API to extract the labels of each pictures. The response JSON format is describe above and more detail can be found in https://cloud.google.com/vision/docs/reference/rest/v1/AnnotateImageResponse#EntityAnnotation.
Then, I mainly used
label.describe
label.topicality
to get the appropriate label information and topicality score of the label to the picture. Due to timing issue, I only used top 3 topicaity scored labels when creating dataframe (more in issues).
Using the labels from the Google Vision API, I used PyDictionary which is python library that provides the definition of the word to create the information document of the image.
Using the definition document, I proecess the TF-IDF and lemmatization to creat vector score of each image. Then, using the vector score, I process K mean alogrithm to create the clusters of images.
Information about PyDictionary can be found in: https://pypi.org/project/PyDictionary/
Information about nltk lemmatization can be found in: https://www.nltk.org/_modules/nltk/stem/wordnet.html
Information about kmean sklearn can be found in: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
Information about tfidf skelarn can be found in: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
The result dataframe before clustering:
The result dataframe after clustering:
You can find the csv file under resources/csv
Right now, the name of the folders where label with cluster number. In future, I will try to extact the main concept words from the cluster and use it as the name of the folder.
Issue that the project currently has is the timing issue. Eventhough the code proecess time is relatively low, (~2 total secs), the amount of time for result is about 3-5minutes for 12 pictures. So, the time for clustering will increase when the number of picture also increase
Another issue is Google Clound Vision API that I am using is free trial version. It mean, I cannot use the project after August. Moroever, since it is free trial version, there is limited number of picture that we can process to get labeling. Pricing can be found in (https://cloud.google.com/vision/pricing).
In the later version, I will delete the PyDictionary and will only use the labels and topicality to create clustering. (maybe useing cosine similairty rules) https://en.wikipedia.org/wiki/Cosine_similarity
- Document Clustering using Sklearn: https://romg2.github.io/mlguide/03_%EB%A8%B8%EC%8B%A0%EB%9F%AC%EB%8B%9D-%EC%99%84%EB%B2%BD%EA%B0%80%EC%9D%B4%EB%93%9C-08.-%ED%85%8D%EC%8A%A4%ED%8A%B8%EB%B6%84%EC%84%9D-%EB%AC%B8%EC%84%9C-%EA%B5%B0%EC%A7%91%ED%99%94/
- Lemmatization: https://en.wikipedia.org/wiki/Lemmatisation
- TF-IDF Example: https://en.wikipedia.org/wiki/Tf%E2%80%93idf