Kindly refer to this page for all the projects being undertaken by CLiPS in GSoC 2018.
Architecture and Setup Instructions: Click Here
Usage Guide: Click Here
GSoC Project Page : Click Here
Text Anonymization refers to the processing of text, stripping it of any attributes/identifiers thus hiding sensitive details and protecting the identity of users.
This project consists of two principal parts, entity/identifier recognition, and the subsequent anonymization. First sensitive chunks of texts will be identified using various approaches including Named Entity Recognition, Regular Expression based pattern matching and TF-IDF based rare token detection. On being identified, the sensitive attributes will either be suppressed, generalized or deleted/replaced. Some of the approaches for generalization include Word Vector based obfuscation and usage of part holonyms.
This system is tied on top of a Django web-app. The system is provided with a dashboard where users can map attributes to the appropriate action and configure them. The system also has accesibilty features like RESTful API based anonymization end-points, Token Level Anonymization detail API, GUI based and API based anonymization of uploaded files etc.
This system will provide a seamless, end-to-end solution for a firm's/user's text anonymization needs.