Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve pdf creation: implement --pages-per-dict, support for background images #10

Open
akryukov opened this issue Dec 18, 2009 · 25 comments

Comments

@akryukov
Copy link

Hi,

I propose a patch which changes jbig2 behavior at two aspects. First, the files generated in the "-p" mode now retain their original names, and just the extension is changed (I use ".jbig2", but whatever else would be OK). A numerical suffix is added in case of name clashes (or for images which go from multipage tiff files). For this reason the 'basename' parameter is gone. The reason for this change is that source images may have some accompanying files (such as background images previously separated with a scan processing application). In such case file names contain some useful information which should not be lost during the processing/conversion.

The second change allows to generate more than just one symbol dictionary, so that the loading speed for large PDF files can be increased. There is now a new option (-P, --pages-per-dict), which specifies how many pages should be processed at the same pass. The default value for this parameter is 15.

I also propose a modified version of pdf.py, implementing support for background images, which can be combined with the foreground mask in the same pdf file. Several graphical formats (PNG, TIFF, JPEG) are supported. It is possible either to use graphics stripped by jbig2 at the previous stage, or prepage images separately in a different application, given that the file names follow the same convention.

BTW it might be reasonable to rename pdf.py to something more meaningful, so that the script could be safely installed somewhere into the PATH.

The files can be downloaded here:
http://www.thessalonica.org.ru/downloads/jbig2.patch.gz
http://www.thessalonica.org.ru/downloads/pdf.py.gz

@DingoDog
Copy link

I downloaded your patch, and tried patching, but these errors are returned to me:

patch -p0 < jbig2.patch

can't find file to patch at input line 4
Perhaps you used the wrong -p or --strip option?

The text leading up to this was:

|diff -ur agl-jbig2enc-git.orig//jbig2.cc agl-jbig2enc-git//jbig2.cc
|--- agl-jbig2enc-git.orig//jbig2.cc 2009-11-05 11:27:45.000000000 +0300

|+++ agl-jbig2enc-git//jbig2.cc 2009-11-07 00:31:39.000000000 +0300

File to patch: jbig2.cc
patching file jbig2.cc
Hunk #1 FAILED at 39.
Hunk #2 FAILED at 191.
Hunk #3 FAILED at 304.
Hunk #4 FAILED at 354.
Hunk #5 FAILED at 393.
Hunk #6 FAILED at 431.
Hunk #7 FAILED at 571.
7 out of 7 hunks FAILED -- saving rejects to file jbig2.cc.rej

@akryukov
Copy link
Author

That's my fault: the patch was prepared according to my directory tree (i. e. it assumed the unpatched sources have been downloaded into a directory called agl-jbig2enc-git). I have now uploaded a corrected version of the patch at the same location. This version should be placed directly into the directory with jbig2enc sources before you execute

patch -p0 < jbig2.patch

@DingoDog
Copy link

thanks (also for your fonts, specially "Old Standard", that I use)

patched, this is the output:

patch -p0<jbig2.patch

patching file jbig2.cc
Hunk #1 succeeded at 37 (offset -2 lines).
Hunk #3 succeeded at 302 (offset -2 lines).
Hunk #5 succeeded at 391 (offset -2 lines).
Hunk #6 FAILED at 429.
Hunk #7 succeeded at 572 (offset -1 lines).
1 out of 7 hunks FAILED -- saving rejects to file jbig2.cc.rej

after patched I tried to build, but something is wrong

make

g++ -c jbig2enc.cc -I../leptonlib-1.58/src -Wall -I/usr/include -L/usr/lib -O3
g++ -c jbig2arith.cc -I../leptonlib-1.58/src -Wall -I/usr/include -L/usr/lib -O3
g++ -c jbig2sym.cc -DUSE_EXT -I../leptonlib-1.58/src -Wall -I/usr/include -L/usr/lib -O3
ar -rcv libjbig2enc.a jbig2enc.o jbig2arith.o jbig2sym.o
a - jbig2enc.o
a - jbig2arith.o
a - jbig2sym.o
g++ -o jbig2 jbig2.cc -L. -ljbig2enc ../leptonlib-1.58/src/liblept.a -I../leptonlib-1.58/src -Wall -I/usr/include -L/usr/lib -O3 -lpng -ljpeg -ltiff -lm
jbig2.cc: In function 'int main(int, char**)':
jbig2.cc:501: warning: format '%s' expects type 'char_', but argument 3 has type 'char_ ()(const char)throw ()'
jbig2.cc:538: warning: format '%s' expects type 'char_', but argument 3 has type 'char_ ()(const char)throw ()'
jbig2.cc:553: warning: format '%s' expects type 'char_', but argument 3 has type 'char_ ()(const char)throw ()'
jbig2.cc:564: error: 'pages_to_compress' was not declared in this scope
jbig2.cc:567: error: 'cnt' was not declared in this scope
jbig2.cc:578: error: 'cnt' was not declared in this scope
jbig2.cc: At global scope:
jbig2.cc:207: warning: 'char* replace_suffix(char_, const char_)' defined but not used
jbig2.cc:220: warning: 'char* get_page_or_dict_name(char**, int, const char_, int)' defined but not used
jbig2.cc:273: warning: 'int is_tiff_format(int)' defined but not used
make: *_* [jbig2] Error 1

@akryukov
Copy link
Author

DingoDog

It looks like you are attempting to patch a wrong version. You should download the most recent sources from git:

git clone git://git://github.com/agl/jbig2enc.gitgithub.com/agl/jbig2enc.git

@DingoDog
Copy link

Many thanks for your answer first of all

Yes, I tried to apply patch to jbig2 0.27 downloadable at:

http://github.com/agl/jbig2enc/tarball/0.27

Now, I used GIt but it seems fail:

git clone git://git://github.com/agl/jbig2enc.gitgithub.com/agl/jbig2enc.git

Initialized empty Git repository in /root/NewDir/jbig2enc/.git/
fatal: Unable to look up git (port ) (Servname not supported for ai_socktype)
fetch-pack from 'git://git://github.com/agl/jbig2enc.gitgithub.com/agl/jbig2enc.git' failed.

this command has instead worked

git clone git://github.com/agl/jbig2enc.git src


git clone git://github.com/agl/jbig2enc.git src

Initialized empty Git repository in /root/NewDir/src/.git/
remote: Counting objects: 118, done.
remote: Compressing objects: 100% (112/112), done.
Indexing 118 objects...
remote: Total 118 (delta 75), reused 0 (delta 0)
100% (118/118) done
Resolving 75 deltas...
100% (75/75) done


and applying patch has been successful

I then downloaded leptonica libs 1.63 and built jbig2enc (not yet tried) I can not wait to try it! meantime thanks again for your patch and your answers

EDIT:

Tried, it is working, only, when I use your modified pdf.py it says

File "/root/my-applications/bin/thessalonica-pdf.py", line 27, in
from PIL import Image

So I think I have not Python Imaging Library (PIL), it is right? I'm currently looking for this but I have yet found

EDIT:

I built PIL from sources and launched before

jbig2

and then your modified pdf.py

pdf.py *.jbig2 out>test.pdf

but resulting pdf has b/w images, I thought the pictures were in color, maybe I did not understand the meaning of your sentence:

"which can be combined with the foreground mask in the same pdf file"

how can this be done? (mixing foreground mask with b/w text) excuse me for my ignorance

@mistydemeo
Copy link
Contributor

@akryukov, I recognize this is a very old issue, but wanted to mention I'd consider pulling this in mistydemeo/jbig2enc.

Since you've implemented both the symbol page limiting functionality and the image/text layer functionality in PDFBeads, do you think there's anything significant here that is still worth including directly in jbig2enc? I think the layer functionality is out of scope for pdf.py, since that's really just a simple demo utility - I would rather avoid adding new dependencies to it.

The symbol page feature seems useful, however. Probably still within scope of jbig2enc. If you think it's still relevant, would you mind rebasing your patch on the current master at my fork and submitting a pull request there?

Thanks!

@zdenop
Copy link
Collaborator

zdenop commented Jun 29, 2012

For http://www.thessalonica.org.ru/downloads/pdf.py.gz I got error message (File Not Found!).
Can you (or somobody else who has a copy) post this file once again?

@akryukov
Copy link
Author

On Fri, 29 Jun 2012 06:46:32 -0700
zdenop wrote:

For http://www.thessalonica.org.ru/downloads/pdf.py.gz I got error
message (File Not Found!). Can you (or somobody else who has a copy)
post this file once again?

It is obsolete and no longer needed: try pdfbeads (which works even with
unpatched jbig2enc) instead.

Regards,
Alexey Kryukov

Moscow State University
Faculty of History

@DingoDog
Copy link

download from here:

http://ge.tt/7GmUTpJ/v/0

@zdenop
Copy link
Collaborator

zdenop commented Jun 29, 2012

thanks - I am aware and glad for pdfbeads.
I just want to evaluate your proposed functionality and probably merge it into my fork of jbig2enc...

BTW: I am not sure if it is a good idea to use .jbig2 or jb2 extension for current jbig2 output. I was not able to read this files with stduviewer (it should be able to open and read jbig2 files). I plan to do more test on this.

@akryukov
Copy link
Author

On Fri, 29 Jun 2012 11:48:44 -0700
zdenop wrote:

BTW: I am not sure if it is a good idea to use .jbig2 or jb2
extension for current jbig2 output. I was not able to read this files
with stduviewer (it should be able to open and read jbig2 files). I
plan to do more test on this.

Of course you are absolutely right here, but... can you propose another
meaningful extension for those files?

Regards,
Alexey Kryukov

Moscow State University
Faculty of History

@galex751
Copy link

Hi Mr Kryukov, I'd like to test your patch with -P parameter but I'm not able to download from http://www.thessalonica.org.ru/downloads/jbig2.patch.gz. Could you post the sources somewhere in order to be able to donwload?

Many Thanks
Alessandro

@yb85
Copy link

yb85 commented Nov 1, 2019

Dear @akryukov ,
I am very interested by your patch as I encounter some serious slowdown on large documents (>100p). Would it be possible to post it online ?
thanks
yann

@DingoDog
Copy link

DingoDog commented May 3, 2020

Dear @akryukov ,
I am very interested by your patch as I encounter some serious slowdown on large documents (>100p). Would it be possible to post it online ?
thanks
yann

Sorry for delay. I uploaded here the patch:

http://ge.tt/8BllCy23

@DingoDog
Copy link

DingoDog commented May 3, 2020

Hi Mr Kryukov, I'd like to test your patch with -P parameter but I'm not able to download from http://www.thessalonica.org.ru/downloads/jbig2.patch.gz. Could you post the sources somewhere in order to be able to donwload?

Many Thanks
Alessandro

I sent a full pack with patch and other goodies to mail address you provided to me on diybookscanner forum

@useretail
Copy link

could you guys re-upload the patch please?

@DingoDog
Copy link

I reuploaded patch on my site:

http://dokupuppylinux.info/media/jbig2.patch.zip

@useretail
Copy link

mirror: https://pastebin.com/raw/WT4TwUxZ

@jaumegs
Copy link

jaumegs commented Feb 17, 2023

@DingoDog

It's possible for you to re-upload the latest version of "pdf.py.gz" before you obsoleted it in favor of PDFBeads?

I'm interested in the other goodies from diybookscanner forum too... grinning

Thank you.

@DingoDog
Copy link

@DingoDog

It's possible for you to re-upload the latest version of "pdf.py.gz" before you obsoleted it in favor of PDFBeads?

I'm interested in the other goodies from diybookscanner forum too... grinning

Thank you.

Sure. Here the code of modified pdf.py


import sys
import re
import struct
import glob
import os

This is a very simple script to make a PDF file out of the output of a

multipage symbol compression.

Run ./jbig2 -s -p image1.jpeg image1.jpeg ...

python pdf.py output > out.pdf

class Ref:
def init(self, x):
self.x = x
def str(self):
return "%d 0 R" % self.x

class Dict:
def init(self, values = {}):
self.d = {}
self.d.update(values)

def str(self):
s = ['<< ']
for (x, y) in self.d.items():
s.append('/%s ' % x)
s.append(str(y))
s.append("\n")
s.append(">>\n")

return ''.join(s)

global_next_id = 1

class Obj:
next_id = 1
def init(self, d = {}, stream = None):
global global_next_id

if stream is not None:
  d['Length'] = str(len(stream))
self.d = Dict(d)
self.stream = stream
self.id = global_next_id
global_next_id += 1

def str(self):
s = []
s.append(str(self.d))
if self.stream is not None:
s.append('stream\n')
s.append(self.stream)
s.append('\nendstream\n')
s.append('endobj\n')

return ''.join(s)

class Doc:
def init(self):
self.objs = []
self.pages = []

def add_object(self, o):
self.objs.append(o)
return o

def add_page(self, o):
self.pages.append(o)
return self.add_object(o)

def str(self):
a = []
j = [0]
offsets = []

def add(x):
  a.append(x)
  j[0] += len(x) + 1
add('%PDF-1.4')
for o in self.objs:
  offsets.append(j[0])
  add('%d 0 obj' % o.id)
  add(str(o))
xrefstart = j[0]
a.append('xref')
a.append('0 %d' % (len(offsets) + 1))
a.append('0000000000 65535 f ')
for o in offsets:
  a.append('%010d 00000 n ' % o)
a.append('')
a.append('trailer')
a.append('<< /Size %d\n/Root 1 0 R >>' % (len(offsets) + 1))
a.append('startxref')
a.append(str(xrefstart))
a.append('%%EOF')

# sys.stderr.write(str(offsets) + "\n")

return '\n'.join(a)

def ref(x):
return '%d 0 R' % x

def main(symboltable='symboltable', pagefiles=glob.glob('page-*')):
doc = Doc()
doc.add_object(Obj({'Type' : '/Catalog', 'Outlines' : ref(2), 'Pages' : ref(3)}))
doc.add_object(Obj({'Type' : '/Outlines', 'Count': '0'}))
pages = Obj({'Type' : '/Pages'})
doc.add_object(pages)
symd = doc.add_object(Obj({}, file(symboltable, 'r').read()))
page_objs = []

for p in pagefiles:
try:
contents = file(p).read()
except IOError:
sys.stderr.write("error reading page file %s\n"% p)
continue
(width, height) = struct.unpack('>II', contents[11:19])
xobj = Obj({'Type': '/XObject', 'Subtype': '/Image', 'Width':
str(width), 'Height': str(height), 'ColorSpace': '/DeviceGray',
'BitsPerComponent': '1', 'Filter': '/JBIG2Decode', 'DecodeParms':
' << /JBIG2Globals %d 0 R >>' % symd.id}, contents)
contents = Obj({}, 'q %d 0 0 %d 0 0 cm /Im1 Do Q' % (width, height))
resources = Obj({'ProcSet': '[/PDF /ImageB]',
'XObject': '<< /Im1 %d 0 R >>' % xobj.id})
page = Obj({'Type': '/Page', 'Parent': '3 0 R',
'MediaBox': '[ 0 0 %d %d ]' % (width, height),
'Contents': ref(contents.id),
'Resources': ref(resources.id)})
[doc.add_object(x) for x in [xobj, contents, resources, page]]
page_objs.append(page)

pages.d.d['Count'] = str(len(page_objs))
pages.d.d['Kids'] = '[' + ' '.join([ref(x.id) for x in page_objs]) + ']'

print str(doc)

def usage(script, msg):
if msg:
sys.stderr.write("%s: %s\n"% (script, msg))
sys.stderr.write("Usage: %s [file_basename] > out.pdf\n"% script)
sys.exit(1)

if name == 'main':

if len(sys.argv) == 2:
sym = sys.argv[1] + '.sym'
pages = glob.glob(sys.argv[1] + '.[0-9]')
elif len(sys.argv) == 1:
sym = 'symboltable'
pages = glob.glob('page-
')
else:
usage(sys.argv[0])

if not os.path.exists(sym):
usage("symbol table %s not found!"% sym)
elif len(pages) == 0:
usage("no pages found!")

main(sym, pages)

@Mark-Joy
Copy link

@DingoDog
Could you please re-upload jbig2.patch.gz?

@zdenop zdenop added the pdf label Dec 19, 2024
@zdenop
Copy link
Collaborator

zdenop commented Dec 19, 2024

There update script to generete pdf: https://github.com/agl/jbig2enc/blob/master/jbig2topdf.py
Feel free to send PR/patches to improve it.

@zdenop zdenop closed this as completed Dec 19, 2024
@zdenop zdenop reopened this Dec 19, 2024
@zdenop zdenop added the patch label Dec 19, 2024
@zdenop zdenop changed the title a patch to improve pdf creation Improve pdf creation: implement --pages-per-dict, support for background images Dec 19, 2024
@zvezdochiot
Copy link
Contributor

zvezdochiot commented Dec 26, 2024

Hi all.

Repatch jbig2.cc "save filename and pages-per-dict"

git diff >jbig2_changes.patch
diff --git a/src/jbig2.cc b/src/jbig2.cc
index f84af70..3f99dcb 100644
--- a/src/jbig2.cc
+++ b/src/jbig2.cc
@@ -53,15 +53,16 @@
 #define BW_GLOBAL_THRESHOLD_DEF 128
 
 static void
-usage(const char *argv0) {
+usage(const char *argv0)
+{
   fprintf(stderr, "Usage: %s [options] <input filenames...>\n", argv0);
   fprintf(stderr, "Options:\n");
-  fprintf(stderr, "  -b <basename>: output file root name when using symbol coding\n");
   fprintf(stderr, "  -d --duplicate-line-removal: use TPGD in generic region coder\n");
   fprintf(stderr, "  -p --pdf: produce PDF ready data\n");
   fprintf(stderr, "  -s --symbol-mode: use text region, not generic coder\n");
   fprintf(stderr, "  -t <threshold>: set classification threshold for symbol coder (def: %0.2f)\n", JBIG2_THRESHOLD_DEF);
   fprintf(stderr, "  -w <weight>: set classification weight for symbol coder (def: %0.2f)\n", JBIG2_WEIGHT_DEF);
+  fprintf(stderr, "  -P <number> --pages-per-dict <number>: pages per dictionary (default 15)\n");
   fprintf(stderr, "  -T <bw threshold>: set 1 bpp threshold (def: %d)\n", BW_LOCAL_THRESHOLD_DEF);
   fprintf(stderr, "  -G --global: use global BW threshold on 8 bpp images;\n"
                   "               the default is to use local (adaptive) thresholding\n");
@@ -82,9 +83,11 @@ static bool verbose = false;
 
 
 static void
-pixInfo(PIX *pix, const char *msg) {
+pixInfo(PIX *pix, const char *msg)
+{
   if (msg != NULL) fprintf(stderr, "%s ", msg);
-  if (pix == NULL) {
+  if (pix == NULL)
+  {
     fprintf(stderr, "NULL pointer!\n");
     return;
   }
@@ -98,7 +101,8 @@ pixInfo(PIX *pix, const char *msg) {
 // -----------------------------------------------------------------------------
 #include <stdarg.h>
 int
-asprintf(char **strp, const char *fmt, ...) {
+asprintf(char **strp, const char *fmt, ...)
+{
   va_list va;
   va_start(va, fmt);
 
@@ -138,7 +142,8 @@ static const char *segment_dilation_sequence = "d3.3";
 // -----------------------------------------------------------------------------
 
 static PIX*
-segment_image(PIX **ppixb, PIX *piximg) {
+segment_image(PIX **ppixb, PIX *piximg)
+{
   PIX *pixb = *ppixb;
   // Make a mask over the non-text (graphics) part of the input 1 bpp image
   // Do this by making a seed and mask, and filling the seed into the mask
@@ -165,7 +170,8 @@ segment_image(PIX **ppixb, PIX *piximg) {
   l_int32  pcount;
   pixCountPixels(pixd, &pcount, tab);
   if (verbose) fprintf(stderr, "pixel count of graphics image: %u\n", pcount);
-  if (pcount < 100) {
+  if (pcount < 100)
+  {
     pixDestroy(&pixd);
     return NULL;
   }
@@ -173,38 +179,51 @@ segment_image(PIX **ppixb, PIX *piximg) {
   // If no text portion is found, destroy the input binary image.
   pixCountPixels(pixb, &pcount, tab);
   if (verbose) fprintf(stderr, "pixel count of binary image: %u\n", pcount);
-  if (pcount < 100) {
+  if (pcount < 100)
+  {
     pixDestroy(ppixb);  // destroy & set caller handle to NULL
     pixb = NULL;  // needed later in this function for pixInfo()
   }
 
   PIX *piximg1;
-  if (piximg->d == 1 || piximg->d == 8 || piximg->d == 32) {
+  if (piximg->d == 1 || piximg->d == 8 || piximg->d == 32)
+  {
     piximg1 = pixClone(piximg);
-  } else if (piximg->d > 8) {
+  }
+  else if (piximg->d > 8)
+  {
     piximg1 = pixConvertTo32(piximg);
-  } else {
+  }
+  else
+  {
     piximg1 = pixConvertTo8(piximg, FALSE);
   }
 
   PIX *pixd1;
-  if (piximg1->d == 32) {
+  if (piximg1->d == 32)
+  {
     pixd1 = pixConvertTo32(pixd);
-  } else if (piximg1->d == 8) {
+  }
+  else if (piximg1->d == 8)
+  {
     pixd1 = pixConvertTo8(pixd, FALSE);
-  } else {
+  }
+  else
+  {
     pixd1 = pixClone(pixd);
   }
   pixDestroy(&pixd);
 
-  if (verbose) {
+  if (verbose)
+  {
     pixInfo(pixd1, "binary mask image:");
     pixInfo(piximg1, "graphics image:");
   }
   pixRasteropFullImage(pixd1, piximg1, PIX_SRC | PIX_DST);
 
   pixDestroy(&piximg1);
-  if (verbose) {
+  if (verbose)
+  {
     pixInfo(pixb, "segmented binary text image:");
     pixInfo(pixd1, "segmented graphics image:");
   }
@@ -212,26 +231,118 @@ segment_image(PIX **ppixb, PIX *piximg) {
   return pixd1;
 }
 
+static int
+get_ext_delim_pos(const char *fname)
+{
+  unsigned int pos = strcspn(fname,".");
+  unsigned int last = 0;
+
+  while (last + pos != strlen(fname))
+  {
+    last += (pos + 1);
+    pos = strcspn(fname + last,".");
+  }
+  return last;
+}
+
+static char*
+replace_suffix(char *name, const char *suffix)
+{
+  int extpos = get_ext_delim_pos(name);
+  char *ret;
+    
+  asprintf(&ret, "%s ", name);
+    
+  ret[extpos] = '\0';;
+  strcat(ret, suffix);
+  return ret;
+}
+
+static char*
+get_page_or_dict_name(char **elements, int cnt, const char *fname, int m_tiff)
+{
+  int i, extpos, same=-1;
+  char *page_name, *pattern;
+  const char *page_ext = ".jbig2";
+
+  extpos = get_ext_delim_pos(fname);
+  page_name = (char *) malloc(extpos + 12);
+  memset(page_name,'\0',extpos + 12);
+  if (extpos > 0)
+    strncpy(page_name, fname, extpos-1);
+  
+  // Make sure first page from a multipage tiff also has a numerical extension
+  // (this is necessary to guarantee it is sorted first)
+  if (m_tiff)
+    strcat(page_name, "_0000");
+  strcat(page_name, page_ext);
+
+  for (i=0; i<cnt; i++ )
+  {
+    if (strcmp(page_name, elements[i]) == 0)
+    {
+        same = i;
+        break;
+    }
+  }
+
+  if (same != -1)
+  {
+    int previdx=0, idx=0, res;
+
+    pattern = (char *) malloc(extpos + 12);
+    strcpy(pattern, page_name);
+    pattern[extpos-1] = '\0';
+    strcat(pattern, "_%4d.");
+
+    for (i=same; i<cnt; i++)
+    {
+      res = sscanf(elements[i],pattern,&idx);
+      if (res && idx > previdx) previdx = idx;
+    }
+    if (idx == 9999)
+    {
+      fprintf(stderr, "Cannot generate a unique name for %s\n", fname);
+      exit(1);
+    }
+    sprintf(page_name + (extpos - 1), "_%04d%s", idx+1, page_ext);
+    free(pattern);
+    fprintf(stderr,"name %s\n",page_name);
+  }
+  return(page_name);
+}
+
+static int
+is_tiff_format(int filetype)
+{
+  return (filetype == IFF_TIFF || 
+          filetype == IFF_TIFF_PACKBITS || filetype == IFF_TIFF_RLE ||
+          filetype == IFF_TIFF_G3 || filetype == IFF_TIFF_G4 ||
+          filetype == IFF_TIFF_LZW || filetype == IFF_TIFF_ZIP);
+}
+
 int
-main(int argc, char **argv) {
+main(int argc, char **argv)
+{
   bool duplicate_line_removal = false;
   bool pdfmode = false;
   bool globalmode = false;
   int bw_threshold = BW_LOCAL_THRESHOLD_DEF;
   float threshold = JBIG2_THRESHOLD_DEF;
   float weight = JBIG2_WEIGHT_DEF;
+  int pages_per_dict = 10;
   bool symbol_mode = false;
   bool refine = false;
   bool up2 = false, up4 = false;
   const char *output_threshold_image = NULL;
-  const char *basename = "output";
   l_int32 img_fmt = IFF_PNG;
-  const char *img_ext = "png";
+  const char *img_ext = "bg.png";
+  char **fnames;
   bool segment = false;
   bool auto_thresh = false;
   bool hash = true;
   int dpi = 0;
-  int i;
+  int i, j;
 
   #ifdef WIN32
     int result = _setmode(_fileno(stdout), _O_BINARY);
@@ -239,47 +350,47 @@ main(int argc, char **argv) {
       fprintf(stderr, "Cannot set mode to binary for stdout\n");
   #endif
 
-  for (i = 1; i < argc; ++i) {
+  for (i = 1; i < argc; ++i)
+  {
     if (strcmp(argv[i], "-h") == 0 ||
-        strcmp(argv[i], "--help") == 0) {
+        strcmp(argv[i], "--help") == 0)
+    {
       usage(argv[0]);
       return 0;
       continue;
     }
 
     if (strcmp(argv[i], "-V") == 0 ||
-        strcmp(argv[i], "--version") == 0) {
+        strcmp(argv[i], "--version") == 0)
+    {
       fprintf(stderr, "jbig2enc %s\n", getVersion());
       return 0;
     }
 
-    if (strcmp(argv[i], "-b") == 0 ||
-        strcmp(argv[i], "--basename") == 0) {
-      basename = argv[i+1];
-      i++;
-      continue;
-    }
-
     if (strcmp(argv[i], "-d") == 0 ||
-        strcmp(argv[i], "--duplicate-line-removal") == 0) {
+        strcmp(argv[i], "--duplicate-line-removal") == 0)
+    {
       duplicate_line_removal = true;
       continue;
     }
 
     if (strcmp(argv[i], "-p") == 0 ||
-        strcmp(argv[i], "--pdf") == 0) {
+        strcmp(argv[i], "--pdf") == 0)
+    {
       pdfmode = true;
       continue;
     }
 
     if (strcmp(argv[i], "-s") == 0 ||
-        strcmp(argv[i], "--symbol-mode") == 0) {
+        strcmp(argv[i], "--symbol-mode") == 0)
+    {
       symbol_mode = true;
       continue;
     }
 
     if (strcmp(argv[i], "-r") == 0 ||
-        strcmp(argv[i], "--refine") == 0) {
+        strcmp(argv[i], "--refine") == 0)
+    {
       fprintf(stderr, "Refinement broke in recent releases since it's "
                       "rarely used. If you need it you should bug "
                       "[email protected] to fix it\n");
@@ -288,44 +399,52 @@ main(int argc, char **argv) {
       continue;
     }
 
-    if (strcmp(argv[i], "-2") == 0) {
+    if (strcmp(argv[i], "-2") == 0)
+    {
       up2 = true;
       continue;
     }
-    if (strcmp(argv[i], "-4") == 0) {
+    if (strcmp(argv[i], "-4") == 0)
+    {
       up4 = true;
       continue;
     }
 
-    if (strcmp(argv[i], "-O") == 0) {
+    if (strcmp(argv[i], "-O") == 0)
+    {
       output_threshold_image = argv[i+1];
       i++;
       continue;
     }
 
-    if (strcmp(argv[i], "-S") == 0) {
+    if (strcmp(argv[i], "-S") == 0)
+    {
       segment = true;
       continue;
     }
 
     if (strcmp(argv[i], "-j") == 0 ||
-        strcmp(argv[i], "--jpeg-output") == 0) {
-      img_ext = "jpg";
+        strcmp(argv[i], "--jpeg-output") == 0)
+    {
+      img_ext = "bg.jpg";
       img_fmt = IFF_JFIF_JPEG;
       continue;
     }
 
-    if (strcmp(argv[i], "-t") == 0) {
+    if (strcmp(argv[i], "-t") == 0)
+    {
       char *endptr;
       threshold = strtod(argv[i+1], &endptr);
-      if (*endptr) {
+      if (*endptr)
+      {
         fprintf(stderr, "Cannot parse float value: %s\n", argv[i+1]);
         usage(argv[0]);
         return 1;
       }
 
       if ((threshold < JBIG2_THRESHOLD_MIN) ||
-          (threshold > JBIG2_THRESHOLD_MAX)) {
+          (threshold > JBIG2_THRESHOLD_MAX))
+      {
         fprintf(stderr, "Invalid value for threshold\n");
         fprintf(stderr, "(must be between %0.2f and %0.2f)\n",
                 JBIG2_THRESHOLD_MIN, JBIG2_THRESHOLD_MAX);
@@ -335,16 +454,19 @@ main(int argc, char **argv) {
       continue;
      }
 
-    if (strcmp(argv[i], "-w") == 0) {
+    if (strcmp(argv[i], "-w") == 0)
+    {
       char *endptr;
       weight = strtod(argv[i+1], &endptr);
-      if (*endptr) {
+      if (*endptr)
+      {
         fprintf(stderr, "Cannot parse float value: %s\n", argv[i+1]);
         usage(argv[0]);
         return 1;
       }
 
-      if ((weight < JBIG2_WEIGHT_MIN) || (weight > JBIG2_WEIGHT_MAX)) {
+      if ((weight < JBIG2_WEIGHT_MIN) || (weight > JBIG2_WEIGHT_MAX))
+      {
         fprintf(stderr, "Invalid value for weight\n");
         fprintf(stderr, "(must be between %0.2f and %0.2f)\n",
                 JBIG2_WEIGHT_MIN, JBIG2_WEIGHT_MAX);
@@ -354,25 +476,44 @@ main(int argc, char **argv) {
       continue;
     }
 
+    if (strcmp(argv[i], "-P") == 0 ||
+        strcmp(argv[i], "--pages-per-dict") == 0)
+    {
+      char *endptr;
+      pages_per_dict = strtol(argv[i+1], &endptr, 10);
+      if (*endptr)
+      {
+        fprintf(stderr, "Cannot parse int value: %s\n", argv[i+1]);
+        usage(argv[0]);
+        return 1;
+      }
+      i++;
+      continue;
+    }
+
     // Local BW thresholding is the default.  However, if global
     // BW thresholding is requested, use its default threshold.
     if (strcmp(argv[i], "-G") == 0 ||
-        strcmp(argv[i], "--global") == 0) {
+        strcmp(argv[i], "--global") == 0)
+    {
       globalmode = true;
       bw_threshold = BW_GLOBAL_THRESHOLD_DEF;
       continue;
     }
 
     // If a BW threshold value is requested, overwrite the default value.
-    if (strcmp(argv[i], "-T") == 0) {
+    if (strcmp(argv[i], "-T") == 0)
+    {
       char *endptr;
       bw_threshold = strtol(argv[i+1], &endptr, 10);
-      if (*endptr) {
+      if (*endptr)
+      {
         fprintf(stderr, "Cannot parse int value: %s\n", argv[i+1]);
         usage(argv[0]);
         return 1;
       }
-      if (bw_threshold < BW_THRESHOLD_MIN || bw_threshold > BW_THRESHOLD_MAX) {
+      if (bw_threshold < BW_THRESHOLD_MIN || bw_threshold > BW_THRESHOLD_MAX)
+      {
         fprintf(stderr, "Invalid bw threshold: (%d..%d)\n",
                 BW_THRESHOLD_MIN, BW_THRESHOLD_MAX);
         return 11;
@@ -383,31 +524,37 @@ main(int argc, char **argv) {
 
     // engage auto thresholding
     if (strcmp(argv[i], "--auto-thresh") == 0 ||
-        strcmp(argv[i], "-a") == 0 ) {
+        strcmp(argv[i], "-a") == 0 )
+    {
       auto_thresh = true;
       continue;
     }
 
-    if (strcmp(argv[i], "--no-hash") == 0) {
+    if (strcmp(argv[i], "--no-hash") == 0)
+    {
       hash = false;
       continue;
     }
 
-    if (strcmp(argv[i], "-v") == 0) {
+    if (strcmp(argv[i], "-v") == 0)
+    {
       verbose = true;
       continue;
     }
 
     if (strcmp(argv[i], "-D") == 0 ||
-        strcmp(argv[i], "--dpi") == 0) {
+        strcmp(argv[i], "--dpi") == 0)
+    {
       char *endptr;
       long t_dpi = strtol(argv[i+1], &endptr, 10);
-      if (*endptr) {
-    fprintf(stderr, "Cannot parse int value: %s\n", argv[i+1]);
-    usage(argv[0]);
-    return 1;
+      if (*endptr)
+      {
+        fprintf(stderr, "Cannot parse int value: %s\n", argv[i+1]);
+        usage(argv[0]);
+        return 1;
       }
-      if (t_dpi <= 0 || t_dpi > 9600) {
+      if (t_dpi <= 0 || t_dpi > 9600)
+      {
         fprintf(stderr, "Invalid dpi: (1..9600)\n");
         return 12;
       } 
@@ -419,191 +566,257 @@ main(int argc, char **argv) {
     break;
   }
 
-  if (i == argc) {
+  if (i == argc)
+  {
     fprintf(stderr, "No filename given\n\n");
     usage(argv[0]);
     return 4;
   }
 
-  if (refine && !symbol_mode) {
+  if (refine && !symbol_mode)
+  {
     fprintf(stderr, "Refinement makes not sense unless in symbol mode!\n");
     fprintf(stderr, "(if you have -r, you must have -s)\n");
     return 5;
   }
 
-  if (up2 && up4) {
+  if (up2 && up4)
+  {
     fprintf(stderr, "Can't have both -2 and -4!\n");
     return 6;
   }
 
-  struct jbig2ctx *ctx = jbig2_init(threshold, weight, 0, 0,
-                         !pdfmode, refine ? 10 : -1);
-  int pageno = -1;
-
-  int numsubimages=0, subimage=0, num_pages = 0;
-  while (i < argc) {
-    if (subimage==numsubimages) {
-      subimage = numsubimages = 0;
-      FILE *fp;
-      if (verbose) fprintf(stderr, "Processing \"%s\"...\n", argv[i]);
-      if ((fp=lept_fopen(argv[i], "r"))==NULL) {
-        fprintf(stderr, "Unable to open \"%s\"\n", argv[i]);
-        return 1;
-      }
-      l_int32 filetype;
-      findFileFormatStream(fp, &filetype);
-      if (filetype==IFF_TIFF && tiffGetCount(fp, &numsubimages)) {
-        return 1;
-      }
-      lept_fclose(fp);
+  int numsubimages, subimage, num_pages = 0, num_images = argc - i, cnt = 0;
+  char **images = argv + i;
+  for (i = 0; i < num_images; i++)
+  {
+    subimage = numsubimages = 0;
+    FILE *fp;
+    if ((fp=lept_fopen(images[i], "r"))==NULL)
+    {
+      fprintf(stderr, "Unable to open \"%s\"\n", images[i]);
+      return 1;
     }
 
-    PIX *source;
-    if (numsubimages<=1) {
-      source = pixRead(argv[i]);
-      numsubimages = 0;
-    } else {
-      source = pixReadTiff(argv[i], subimage++);
+    l_int32 filetype;
+    findFileFormatStream(fp, &filetype);
+    if (is_tiff_format(filetype) && tiffGetCount(fp, &numsubimages))
+    {
+      return 1;
     }
+    fclose(fp);
+    
+    if (numsubimages) num_pages += numsubimages;
+    else num_pages++;
+  }
+  subimage = numsubimages = i = 0;
+  if (pages_per_dict <= 0) pages_per_dict = num_pages;
+  fnames = (char **) malloc(sizeof(char *) * num_pages);
+
+  for (i = cnt = 0; i < num_images && cnt < num_pages; )
+  {
+    struct jbig2ctx *ctx = jbig2_init(threshold, weight, 0, 0,
+                                      !pdfmode, refine ? 10 : -1);
+    int pages_to_compress = num_pages - cnt;
+    int first_in_group = cnt;
+    
+    if (pages_to_compress > pages_per_dict)
+      pages_to_compress = pages_per_dict;
+    
+    for (j = 0; j < pages_to_compress; cnt++, j++ )
+    {
+      if (subimage==numsubimages)
+      {
+        subimage = numsubimages = 0;
+        FILE *fp;
+        if ((fp=fopen(images[i], "r"))==NULL)
+        {
+          fprintf(stderr, "Unable to open \"%s\"\n", images[i]);
+          return 1;
+        }
+        l_int32 filetype;
+        findFileFormatStream(fp, &filetype);
+        if (is_tiff_format(filetype) && tiffGetCount(fp, &numsubimages))
+        {
+          return 1;
+        }
+        fclose(fp);
+      }
 
-    if (dpi != 0 && source->xres == 0 && source->yres == 0) {
-      source->xres = dpi;
-      source->yres = dpi;
-    }
+      PIX *source;
+      if (numsubimages==0)
+      {
+        source = pixRead(images[i]);
+      }
+      else
+      {
+        source = pixReadTiff(images[i], subimage++);
+      }
 
-    if (!source) return 3;
-    if (verbose)
-      pixInfo(source, "source image:");
+      if (dpi != 0 && source->xres == 0 && source->yres == 0)
+      {
+        source->xres = dpi;
+        source->yres = dpi;
+      }
+      fnames[cnt] = get_page_or_dict_name(fnames, cnt, images[i], numsubimages > 1);
 
-    PIX *pixl, *gray, *adapt, *pixt;
-    if ((pixl = pixRemoveColormap(source, REMOVE_CMAP_BASED_ON_SRC)) == NULL) {
-      fprintf(stderr, "Failed to remove colormap from %s\n", argv[i]);
-      return 1;
-    }
-    pixDestroy(&source);
-    pageno++;
-
-    if (pixl->d > 1) {
-      if (pixl->d > 8) {
-        gray = pixConvertRGBToGrayFast(pixl);
-        if (!gray) return 1;
-      } else if (pixl->d == 4 || pixl->d == 8) {
-        gray = pixClone(pixl);
-      } else {
-        fprintf(stderr, "Unsupported input image depth: %d\n", pixl->d);
+      if (!source) return 3;
+      if (verbose)
+        pixInfo(source, "source image:");
+
+      PIX *pixl, *gray, *adapt, *pixt;
+      if ((pixl = pixRemoveColormap(source, REMOVE_CMAP_BASED_ON_SRC)) == NULL)
+      {
+        fprintf(stderr, "Failed to remove colormap from %s\n", argv[i]);
         return 1;
       }
-      if (!globalmode) {
-        adapt = pixCleanBackgroundToWhite(gray, NULL, NULL, 1.0, 90, 190);
-      } else {
-        adapt = pixClone(gray);
+      pixDestroy(&source);
+
+      if (pixl->d > 1)
+      {
+        if (pixl->d > 8)
+        {
+          gray = pixConvertRGBToGrayFast(pixl);
+          if (!gray) return 1;
+        }
+        else if (pixl->d == 4 || pixl->d == 8)
+        {
+          gray = pixClone(pixl);
+        }
+        else
+        {
+          fprintf(stderr, "Unsupported input image depth: %d\n", pixl->d);
+          return 1;
+        }
+        if (globalmode)
+        {
+          adapt = pixClone(gray);
+        }
+        else
+        {
+          adapt = pixCleanBackgroundToWhite(gray, NULL, NULL, 1.0, 90, 190);
+        }
+        pixDestroy(&gray);
+        if (up2)
+        {
+          pixt = pixScaleGray2xLIThresh(adapt, bw_threshold);
+        }
+        else if (up4)
+        {
+          pixt = pixScaleGray4xLIThresh(adapt, bw_threshold);
+        }
+        else
+        {
+          pixt = pixThresholdToBinary(adapt, bw_threshold);
+        }
+        pixDestroy(&adapt);
       }
-      pixDestroy(&gray);
-      if (up2) {
-        pixt = pixScaleGray2xLIThresh(adapt, bw_threshold);
-      } else if (up4) {
-        pixt = pixScaleGray4xLIThresh(adapt, bw_threshold);
-      } else {
-        pixt = pixThresholdToBinary(adapt, bw_threshold);
+      else
+      {
+        pixt = pixClone(pixl);
       }
-      pixDestroy(&adapt);
-    } else {
-      pixt = pixClone(pixl);
-    }
-    if (!pixt) {
-      fprintf(stderr, "Failed to convert input image to binary\n");
-      return 1;
-    }
-    if (verbose)
-      pixInfo(pixt, "thresholded image:");
 
-    if (output_threshold_image) {
-      pixWrite(output_threshold_image, pixt, IFF_PNG);
-    }
+      if (verbose)
+        pixInfo(pixt, "thresholded image:");
 
-    if (segment && pixl->d > 1) {
-      // If no text is found, pixt is destroyed
-      PIX *graphics = segment_image(&pixt, pixl);
-      pixDestroy(&pixl);  // if pixt == NULL, the loop exits at 'continue'
-      if (graphics) {
-        if (verbose)
-          pixInfo(graphics, "graphics image:");
-        char *filename;
-        asprintf(&filename, "%s.%04d.%s", basename, pageno, img_ext);
-        pixWrite(filename, graphics, img_fmt);
-        free(filename);
-        pixDestroy(&graphics);
-      } else if (verbose) {
-        fprintf(stderr, "%s: no graphics found in input image\n", argv[i]);
-      }
-      if (pixt == NULL) {
-        fprintf(stderr, "%s: no text portion found in input image\n", argv[i]);
-        i++;
-        continue;
+      if (output_threshold_image)
+      {
+        pixWrite(output_threshold_image, pixt, IFF_PNG);
       }
-    }
-
-    pixDestroy(&pixl);
 
-    if (!symbol_mode) {
-      int length;
-      uint8_t *ret;
-      ret = jbig2_encode_generic(pixt, !pdfmode, 0, 0, duplicate_line_removal,
-                                 &length);
-      write(1, ret, length);
-      return 0;
-    }
+      if (segment && pixl->d > 1)
+      {
+        PIX *graphics = segment_image(&pixt, pixl);
+        if (graphics)
+        {
+          if (verbose)
+            pixInfo(graphics, "graphics image:");
+          char *filename = replace_suffix(fnames[cnt], img_ext);
+          pixWrite(filename, graphics, img_fmt);
+          free(filename);
+          pixDestroy(&graphics);
+        }
+        else if (verbose)
+        {
+          fprintf(stderr, "%s: no graphics found in input image\n", argv[i]);
+        }
+        if (!pixt)
+        {
+          fprintf(stderr, "%s: no text portion found in input image\n", argv[i]);
+          i++;
+          continue;
+        }
+      }
+      pixDestroy(&pixl);
+
+      if (!symbol_mode)
+      {
+        int length;
+        uint8_t *ret;
+        ret = jbig2_encode_generic(pixt, !pdfmode, 0, 0, duplicate_line_removal,
+                                   &length);
+        write(1, ret, length);
+        return 0;
+      }
 
-    jbig2_add_page(ctx, pixt);
-    pixDestroy(&pixt);
-    num_pages++;
-    if (subimage==numsubimages) {
-      i++;
+      jbig2_add_page(ctx, pixt);
+      pixDestroy(&pixt);
+      if (subimage==numsubimages)
+      {
+        i++;
+      }
     }
-  }
 
-  if (auto_thresh) {
-    if (hash) {
-      jbig2enc_auto_threshold_using_hash(ctx);
-    } else {
-      jbig2enc_auto_threshold(ctx);
+    if (auto_thresh)
+    {
+      if (hash)
+      {
+        jbig2enc_auto_threshold_using_hash(ctx);
+      }
+      else
+      {
+        jbig2enc_auto_threshold(ctx);
+      }
     }
-  }
 
-  uint8_t *ret;
-  int length;
-  ret = jbig2_pages_complete(ctx, &length);
-  if (pdfmode) {
-    char *filename;
-    asprintf(&filename, "%s.sym", basename);
-    const int fd = open(filename, O_WRONLY | O_TRUNC | O_CREAT | WINBINARY, 0600);
-    free(filename);
-    if (fd < 0) abort();
-    write(fd, ret, length);
-    close(fd);
-  } else {
-    write(1, ret, length);
-  }
-  free(ret);
-
-  for (int i = 0; i < num_pages; ++i) {
-    ret = jbig2_produce_page(ctx, i, -1, -1, &length);
-    if (pdfmode) {
-      char *filename;
-      asprintf(&filename, "%s.%04d", basename, i);
-      const int fd = open(filename, O_WRONLY | O_CREAT | O_TRUNC | WINBINARY, 0600);
-      free(filename);
+    uint8_t *ret;
+    int length;
+    ret = jbig2_pages_complete(ctx, &length);
+    if (pdfmode)
+    {
+      char *dict_name = replace_suffix(fnames[first_in_group], "sym");
+      const int fd = open(dict_name, O_WRONLY | O_TRUNC | O_CREAT | WINBINARY, 0600);
+      free(dict_name);
       if (fd < 0) abort();
       write(fd, ret, length);
       close(fd);
-    } else {
+    }
+    else
+    {
       write(1, ret, length);
     }
     free(ret);
+
+    for (j = 0; j < pages_to_compress; ++j)
+    {
+      ret = jbig2_produce_page(ctx, j, -1, -1, &length);
+      if (pdfmode)
+      {
+        const int fd = open(fnames[cnt - pages_to_compress + j], O_WRONLY | O_CREAT | O_TRUNC | WINBINARY, 0600);
+        if (fd < 0) abort();
+        write(fd, ret, length);
+        close(fd);
+      }
+      else
+      {
+        write(1, ret, length);
+      }
+      free(ret);
+    }
+    jbig2_destroy(ctx);
   }
+  for (i=0; i<cnt; i++) free(fnames[i]);
+  free(fnames);
 
-  jbig2_destroy(ctx);
   return 0;
-
 }
-

Good luck. 😄

@zdenop
Copy link
Collaborator

zdenop commented Dec 28, 2024

A lot of changes are formatting changes. Can you make patch/diff without them?

@zvezdochiot
Copy link
Contributor

zvezdochiot commented Dec 28, 2024

@zdenop , I'm happy with this formatting. 😄

See also ImageProcessing-ElectronicPublications@bcc9b59

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants