Skip to content

OneYearIsEnough/PaddleOCR-Recog-LmdbDataset-Conversion

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

Lmdb Dataset Format Conversion Tool

In PaddleOCR's text recognition tasks, it supports two dataset format, one is SimpleDataset(or Icdar2015 dataset format), and another is LmdbDataset

When your training datasets is very large, on account of lots of memory reads, which result in a low system io speed, so at this time using SimpleDataset(or Icdar2015 dataset format) may slow down your training speed extremely!

To solve this problem, I wrote a lmdb dataset format conversion python script which can transform SimpleDataset fomat to LmdbDataset format.

Python package dependencies

  • lmdb
  • opencv-python
  • numpy

You can also use pip install -r requirements.txt to install those packages.

How to use

There is only one executable python script file named make_lmdb.py in this project, so it's very easy to use.

Parameters usage description:

Args:
    --data_root_dir: A dir which contains total imgs or total img subdirs(e.g. train_data).
    --label_file_paths: Txt files to store the image path which based on ${data_root_dir} 
                        param and label. If you have more than one txt files, please use 
                        space char to split them(e.g. label1.txt label2.txt label3.txt).
    --delimiter: Only support 'blank' and 'tab' delimiter in ${label_file_paths} to split
                 image path and image label. By default, the image path and image label are 
                 split with 'tab', which is \t. I also recommend you to use \t as delimiter.
    --lmdb_out_dir: Output lmdb dir.
    --check: If declared, it will check every image whether it is valid or not and throw 
             invalid images away, thus we can get a cleaner lmdb dataset, but this could 
             result in inefficiencies.

Demo

For example, if the training set has the following file structure:

|-train_data
  |-rec
    |- gt_label1.txt
    |- gt_label2.txt
    |- gt_label3.txt
    |- train
        |- word_001.png
        |- word_002.jpg
        |- word_003.jpg
        | ...

We can use the following code to generate lmdb dataset:

python3 make_lmdb.py \
    --data_root_dir train_data,
    --label_file_paths train_data/rec/gt_label1.txt train_data/rec/gt_label2.txt train_data/rec/gt_label3.txt \
    --delimiter tab \
    --lmdb_out_dir ${output lmdb dir you specified} \

Commonly, in text recognition task, the training data is very large and we can't tolerate a low conversion speed, so it is better not to declare the parametercheck.

About

A lmdb dataset format conversion tool

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages