Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add blob.length property #225

Merged
merged 1 commit into from
Dec 22, 2024
Merged

add blob.length property #225

merged 1 commit into from
Dec 22, 2024

Conversation

mcratt
Copy link
Contributor

@mcratt mcratt commented Dec 20, 2024

I found that uharfbuzz has already bind the hb_blob_get_length API, but it doesn't use this API and instead returns the length using return len(self._data). However, this leads to an issue: after creating a blob using hb.Blob.from_file_path, the length obtained by len(blob) or len(blob.data) is 0. Only after creating hb.Face(blob), can the correct length be obtained through len(face.blob) or len(face.blob.data). This causes a significant amount of disk I/O when reading many files.

def test_len(fontPath):
    fontSize = os.path.getsize(fontPath)
    print(f"fontSize: {fontSize}")

    blob = uharfbuzz.Blob.from_file_path(fontPath)
    len_blob = len(blob)
    print(f"len(blob): {len_blob}")

    face = uharfbuzz.Face(blob)
    len_face_blob = len(face.blob)
    print(f"len(face.blob): {len_face_blob}")

def test_datalen(fontPath):
    fontSize = os.path.getsize(fontPath)
    print(f"fontSize: {fontSize}")

    blob = uharfbuzz.Blob.from_file_path(fontPath)
    len_blob = len(blob.data)
    print(f"len(blob.data): {len_blob}")

    face = uharfbuzz.Face(blob)
    len_face_blob = len(face.blob.data)
    print(f"len(face.blob.data): {len_face_blob}")

def test_length(fontPath):
    fontSize = os.path.getsize(fontPath)
    print(f"fontSize: {fontSize}")

    blob = uharfbuzz.Blob.from_file_path(fontPath)
    blob_length = blob.length
    print(f"blob.length: {blob_length}")

    face = uharfbuzz.Face(blob)
    face_blob_length = face.blob.length
    print(f"face.blob.length: {face_blob_length}")

result:

test_len:
fontSize: 16998496
len(blob): 0
len(face.blob): 16998496

test_datalen:
fontSize: 16998496
len(blob.data): 0
len(face.blob.data): 16998496

test_length:
fontSize: 16998496
blob.length: 16998496
face.blob.length: 16998496

IO:

test_len
test_datalen
test_length

@khaledhosny
Copy link
Collaborator

The whole _data dance seems broken to me. I don’t think we should keep it around at all. Instead we should create the blob in any of the ways currently supported, and then return data and length using the C API exclusively.

@mcratt
Copy link
Contributor Author

mcratt commented Dec 22, 2024

I'm hesitant to modify the _data parameter because hb_blob_get_length only accepts the blob, not blob.data. If I modify the __len__() method, it might cause errors in the new version since previous code may have used len(blob.data)

@mcratt
Copy link
Contributor Author

mcratt commented Dec 22, 2024

Modifying __len__ seems to work, and len(blob) already produces the correct result.

def __len__(self) -> int:
    return hb_blob_get_length(self._hb_blob)

code:

def test_len(fontPath):
    fontSize = os.path.getsize(fontPath)
    print(f"fontSize: {fontSize}")

    blob = uharfbuzz.Blob.from_file_path(fontPath)
    len_blob = len(blob)
    print(f"len(blob): {len_blob}")

    face = uharfbuzz.Face(blob)
    len_face_blob = len(face.blob)
    print(f"len(face.blob): {len_face_blob}")

def test_datalen(fontPath):
    fontSize = os.path.getsize(fontPath)
    print(f"fontSize: {fontSize}")

    blob = uharfbuzz.Blob.from_file_path(fontPath)
    len_blob = len(blob.data)
    print(f"len(blob.data): {len_blob}")

    face = uharfbuzz.Face(blob)
    len_face_blob = len(face.blob.data)
    print(f"len(face.blob.data): {len_face_blob}")

result:

test_len:
fontSize: 16998496
len(blob): 16998496
len(face.blob): 16998496

test_datalen:
fontSize: 16998496
len(blob.data): 0
len(face.blob.data): 16998496

@mcratt
Copy link
Contributor Author

mcratt commented Dec 22, 2024

If self._data is not used for storage, perhaps hb_blob_get_data can be used to retrieve the data, but I'm not sure if this will cause other issues.

@mcratt
Copy link
Contributor Author

mcratt commented Dec 22, 2024

It seems to be running well, although reading the data part inevitably involves accessing the hard drive, as the data is in bytes.

def test_len(fontPath):
    fontSize = os.path.getsize(fontPath)
    print(f"fontSize: {fontSize}")

    blob = uharfbuzz.Blob.from_file_path(fontPath)
    print(f"len(blob): {len(blob)}")

    face = uharfbuzz.Face(blob)
    print(f"len(face.blob): {len(face.blob)}")

def test_datalen(fontPath):
    fontSize = os.path.getsize(fontPath)
    print(f"fontSize: {fontSize}")

    blob = uharfbuzz.Blob.from_file_path(fontPath)
    print(f"len(blob.data): {len(blob.data)}")

    face = uharfbuzz.Face(blob)
    print(f"len(face.blob.data): {len(face.blob.data)}")

def test_face_blob(fontPath):
    fontSize = os.path.getsize(fontPath)
    print(f"fontSize: {fontSize}")

    with open(fontPath, "rb") as f:
        fontBytes = f.read()
    face = uharfbuzz.Face(fontBytes)
    print(f"len(face.blob): {len(face.blob)}")
    print(f"len(face.blob.data): {len(face.blob.data)}")

result:

test_len:
fontSize: 16998496
len(blob): 16998496
len(face.blob): 16998496

test_datalen:
fontSize: 16998496
len(blob.data): 16998496
len(face.blob.data): 16998496

test_face_blob:
fontSize: 16998496
len(face.blob): 16998496
len(face.blob.data): 16998496

@khaledhosny
Copy link
Collaborator

although reading the data part inevitably involves accessing the hard drive

If the blob is created from a file path, HarfBuzz will mmap the file, so reading the data will indeed involve file I/O.

@mcratt
Copy link
Contributor Author

mcratt commented Dec 22, 2024

Yes, but now the build is failing, and it's strange because I can't find where the issue is. Is it possible that _hb_blob is not being cleaned up properly?

Instead of keeping blob data at Python side and risking it being not
synced with the C side, always use C API to get blob length and data.
@khaledhosny khaledhosny merged commit 2c9c0be into harfbuzz:main Dec 22, 2024
3 of 4 checks passed
@khaledhosny
Copy link
Collaborator

Thanks!

@mcratt
Copy link
Contributor Author

mcratt commented Dec 23, 2024

Looking forward to the new version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants