Traditional Chinese Handwriting Dataset

繁體中文手寫資料集

Preface 前言

In the way of data science, we believe every scholar, scientists might have heard about MNIST dataset, or played with Fashion MNIST. As a traditional Chinese user, we couldn't help but wonder: is it possible for machine learning, neural networks to recognize handwritten traditional Chinese characters? Let's challenge!

在走過資料科學的路上，相信每一位學者、科學家都聽過 MNIST dataset (手寫數字資料集)，或許也玩過 Fashion MNIST；身為繁體中文使用者，難免開始好奇：手寫繁體中文是否也有機會讓機器學習、神經網路成功辨識呢？讓我們一起來挑戰！

Description 資料集說明

Original dataset was produced based on Tegaki, an open-source package. Total 13,065 different Chinese characters, with average of 50 samples for each character.

原始資料集基於 Tegaki 開源套件下產出，總計 13,065 個不同的中文字，每一個字體平均有 50 個樣本。

Updates 更新紀錄

2021.04.17 專案衍生應用： Web-based 模型訓練、手寫辨識
2021.04.14 (非直接相關) 趨勢科技 T-brain 玉山人工智慧挑戰賽2021夏季賽：繁體中文場景文字辨識競賽
2020.09.03 Released the whole dataset (13,065 charaters; image size: 300x300pixels; total 684,677 images)
2020.04.29 分享使用繁體中文手寫字集實現卷積神經網路手寫識別實作 (感謝 Yen-Lin 博士熱情貢獻)
2020.04.21 提供資料集部署操作範例 (感謝 Yen-Lin 博士熱情貢獻)
2020.04.20 上傳最新資料集 (4,803個常用字；圖片大小：50x50pixels；共計 250,712 個圖片檔) (教育部 4,808 個常用字)
2020.04.20 Uploaded the first dataset (4,803 charaters; image size: 50x50pixels; total 250,712 images)

Data samples 資料樣本

完整資料集 - 各樣本資料夾
手寫"自由"範例

Usage 使用方法

1. 完整資料集 - whole Dataset (13,065 characters)

git clone https://github.com/chenkenanalytic/handwritting_data_all.git

cat (file_path)/all_data.zip* > (file_path)/all_data.zip

unzip -O big5 (file_path)/all_data.zip -d (output_path)

※ (file_path) & (output_path) 以實際檔案位置需求作修改、替換，解壓縮後資料夾名稱為 cleaned_data，共684,677個圖片。

完整資料集 - 部署操作

Colab操作程式碼參考

2. 常用字資料集 - common words Dataset (4,803 characters)

git clone https://github.com/AI-FREE-Team/Traditional-Chinese-Handwriting-Dataset.git

※ 下載常用字資料集後，解壓縮 data 資料夾內的四個檔案，解壓縮後資料夾名稱為 cleaned_data(50_50)，共250,712個圖片。

常用字資料集 - 部署操作 (感謝 Yen-Lin 博士熱情貢獻)

Colab操作程式碼參考

本地操作程式碼參考

Issues 問題與發現

常用字資料集因壓縮至 50x50 Pixels，發現部分圖片檔筆畫不清楚、出現重疊現象。 (完整資料集較無此問題，資料為 300x300 Pixels)
~~完整資料集佈署範例於 Colab 上解壓縮後，中文字集檔名會出現亂碼~~。(issue solved, please see #issue 1, credit to ling199104)

Handwriting Chinese Characters Recognition 手寫中文辨識

Repo Introdcution 專案介紹

使用繁體中文手寫字集實現卷積神經網路手寫識別。

Applied Traditional-Chinese-Handwriting-Dataset to realize handwriting recognition by CNN model.

若您對於進一步此實作感興趣，歡迎參考此文章說明。

Project Application 專案衍伸應用 - Web-based 模型訓練、手寫辨識

The application was developed based on the week 2 homework of Browser-based Models with TensorFlow.js in TensorFlow: Data and Deployment Specialization on coursera.

此衍生應用基於 Deeplearning.ai 之 Coursera 線上課程，TensorFlow: Data and Deployment Specialization 的第一堂課程：Browser-based Models with TensorFlow.js 的第二週線上作業所開發。

若您對於此專案有興趣，歡迎參考此文章說明。

License 授權

(CC BY-NC-SA 4.0)
本資料集適用 Attribution-NonCommercial-ShareAlike 4.0 International 授權。
The dataset applied Attribution-NonCommercial-ShareAlike 4.0 International license.

※ 使用、改作、分享請附上以下資訊：

本數據集由 AI . FREE Team 改作開發自 [STUST EECS_Chinese MNIST(總集)]。如有使用、改作、分享，請註明出處及此訊息。
The dataset is AI . FREE Team development from [STUST EECS_Chinese MNIST(總集)]. If used, modified, or shared, please cite the source and the mesage.
(source: https://github.com/AI-FREE-Team/Traditional-Chinese-Handwriting-Dataset )

Citing

@misc{AI.FREE2020,
  author = {Po-Chuan Chen},
  title = {Traditional Chinese Handwriting Dataset},
  year = {2020},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/AI-FREE-Team/Traditional-Chinese-Handwriting-Dataset}},
}

Source 資料來源

原資料集來源：https://scidm.nchc.org.tw/dataset/stusteecs_chinese_mnist

介紹說明影片：https://www.youtube.com/watch?v=eJy1BtkqHX4

來源說明：本數據集開發修改自南臺科技大學電子系所提供之中文手寫字集。

Description: The Dataset is developed from Chinese handwriting data set, which is provided by Dept. EECS, Southern Taiwan University of Science and Technology.

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
data		data
img		img
Data_Deployment_colab.ipynb		Data_Deployment_colab.ipynb
Data_Deployment_local.ipynb		Data_Deployment_local.ipynb
License		License
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Traditional Chinese Handwriting Dataset

繁體中文手寫資料集

Preface 前言

Description 資料集說明

Updates 更新紀錄

Data samples 資料樣本

Usage 使用方法

1. 完整資料集 - whole Dataset (13,065 characters)

完整資料集 - 部署操作

2. 常用字資料集 - common words Dataset (4,803 characters)

常用字資料集 - 部署操作 (感謝 Yen-Lin 博士熱情貢獻)

Issues 問題與發現

Handwriting Chinese Characters Recognition 手寫中文辨識

Project Application 專案衍伸應用 - Web-based 模型訓練、手寫辨識

License 授權

Citing

Source 資料來源

About

Releases

Packages

Contributors 2

Languages

License

AI-FREE-Team/Traditional-Chinese-Handwriting-Dataset

Folders and files

Latest commit

History

Repository files navigation

Traditional Chinese Handwriting Dataset

繁體中文手寫資料集

Preface 前言

Description 資料集說明

Updates 更新紀錄

Data samples 資料樣本

Usage 使用方法

1. 完整資料集 - whole Dataset (13,065 characters)

完整資料集 - 部署操作

2. 常用字資料集 - common words Dataset (4,803 characters)

常用字資料集 - 部署操作 (感謝 Yen-Lin 博士 熱情貢獻)

Issues 問題與發現

Handwriting Chinese Characters Recognition 手寫中文辨識

Project Application 專案衍伸應用 - Web-based 模型訓練、手寫辨識

License 授權

Citing

Source 資料來源

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

常用字資料集 - 部署操作 (感謝 Yen-Lin 博士熱情貢獻)

Packages