Parsing Differences Array #73

tjanela · 2014-01-30T13:55:20Z

I've been playing around with the code in this repo and I've been able to correctly interpret the differences array.

Basically I've loaded Adobe's glyphlist.txt:
http://partners.adobe.com/public/developer/en/opentype/glyphlist.txt

And then built a differences array (if such array is present) that maps a byte to a unicode char. If the byte is not present in the differences map the unicode char is returned.

I'm interested in debating this approach and a possible pull request with you.

Best Regards!

KurtCode · 2014-01-30T21:16:20Z

Hi!

Sounds interesting, it’s quite a long list of characters. I see it provides both the short name and the Unicode value. I am not sure exactly what your implementation looks like, but like I said, sounds interesting.

Marcus

On 30 Jan 2014, at 14:55, Tiago Janela [email protected] wrote:

Hi there @KurtCode ,

I've been playing around with the code in this repo and I've been able to correctly interpret the differences array.

Basically I've loaded Adobe's glyphlist.txt:
http://partners.adobe.com/public/developer/en/opentype/glyphlist.txt

And then build a differences array (whenever present) that maps a byte code to a unicode char. If the byte code is not present in the difference array the unicode char is returned.

I'm interested in debating this approach and a possible pull request with you.

Best Regards!

—
Reply to this email directly or view it on GitHub.

tjanela · 2014-01-30T21:50:24Z

Hi there Kurt,

Like someone said madness lurks in the PDF spec.

I confess that I do not know much about PDF. Or at least as much as you do.
I tried to stand on the shoulder of your implementation and dug a little deeper.

I found a comment at the -[SimpleFont setEncodingWithEncodingObject] that caught my attention.
I was trying out a search function that was based on Scanner to extract textual content, trying to preserve whole words.
I noticed some unichars where completely off when printed in the console.
After some digging about the replicable crashes I was having I managed to understand that sometimes the toUnicode function did not output an expected result.
I learned about the Differences array and how it influences the glyph resolving behavior.
And then I met glyph list.
It is included as a resource file and shipped as such.
There is another of such lists:
http://partners.adobe.com/public/developer/en/opentype/aglfn13.txt
I'm sure it needs to be loaded and used whenever appropriate as some of the characters that I’ve been obtaining with this algorithm are printed as gremlins in the console.
I’m betting on symbol fonts and other special, create your own circus, stuff.

Cookies crumble as follows:
-> Preload at application startup the glyph list.
-> Notes: Ugly. Modifiable. Cardinal sin committed in the name of development.
-> -[SimpleFont setEncodingWithEncodingObject:]
-> If no encoding is found use StandardEncoding instead of not doing nothing.
-> Try to get Differences array
-> Process it in order to get a DifferencesDictionary.
-> I’m probably missing something here but the differences array is simply a map coded as an array of values.
-> It contains numbers (that will be used in text sections to dictate what glyph is to be used)
-> It contains names (that are the same names present in glyph list)
-> Rule is “178, a,z,b,c,d,e,100,two” means the value 178 will be used to represent the glyph named a, 179 -> z, 180 -> b, 181-> c, 182 -> d, 183-> e and the value 100 means the glyph ”2"
-> Add the number as a key to the string containing the unicode char.
-> Store the value in a Dictionary property.
-> Notes: Very localized change. All the glyphlist.txt parsing is in a C function invoked at startup.
-> - [SimpleFont stringWithPDFString:]
-> Add the case when there is a Differences property.
-> For each byte in the string bytes do:

unichar cid = bytes[i];
NSString* uni = [self.differences objectForKey:@(cid)];
if(!uni){
//NSLog(@"(%hu) %C -> (%hu) %C", cid, cid, uni, uni);
uni = [NSString stringWithFormat:@"%C", [self.toUnicode unicodeCharacter:cid]];
}
//NSLog(@"(%hu) %C -> (%hu) %C", cid, cid, uni, uni);
[unicodeString appendString:uni];

-> Notes: This will probably break on some cases as I’m sure that this can’t be that simple. What if the encoding is not standard?

And that’s it.

Setting up a test battery against this stuff might represent some time.
Special PDFs need to be fabricated in order to correctly check the behavior of the algorithms.

These are my insights.

Best regards:

Tiago Janela
Bliss Applications

On 30/01/2014, at 21:16, Marcus Hedenström [email protected] wrote:

Hi!

Sounds interesting, it’s quite a long list of characters. I see it provides both the short name and the Unicode value. I am not sure exactly what your implementation looks like, but like I said, sounds interesting.

Marcus

On 30 Jan 2014, at 14:55, Tiago Janela [email protected] wrote:

Hi there @KurtCode ,

I've been playing around with the code in this repo and I've been able to correctly interpret the differences array.

Basically I've loaded Adobe's glyphlist.txt:
http://partners.adobe.com/public/developer/en/opentype/glyphlist.txt

And then build a differences array (whenever present) that maps a byte code to a unicode char. If the byte code is not present in the difference array the unicode char is returned.

I'm interested in debating this approach and a possible pull request with you.

Best Regards!

—
Reply to this email directly or view it on GitHub.

—
Reply to this email directly or view it on GitHub.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parsing Differences Array #73

Parsing Differences Array #73

tjanela commented Jan 30, 2014

KurtCode commented Jan 30, 2014

tjanela commented Jan 30, 2014

Parsing Differences Array #73

Parsing Differences Array #73

Comments

tjanela commented Jan 30, 2014

KurtCode commented Jan 30, 2014

tjanela commented Jan 30, 2014