Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid extra call of safe_path in file_extension() #141

Merged
merged 1 commit into from
Jan 15, 2024
Merged

Conversation

hagenw
Copy link
Member

@hagenw hagenw commented Jan 15, 2024

This speeds up audiofile.read() and all the functions from audiofile.core.info by not calling audeer.safe_path() when getting the file extension of a file, as we called audeer.safe_path(file) before.

Copy link

codecov bot commented Jan 15, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (fe06fcf) 100.0% compared to head (9f7fd22) 100.0%.

❗ Current head 9f7fd22 differs from pull request most recent head 291b597. Consider uploading reports for the commit 291b597 to get more accurate results

Additional details and impacted files
Files Coverage Δ
audiofile/core/utils.py 100.0% <100.0%> (ø)

@frankenjoe
Copy link
Collaborator

Is the speed up measurable?

@hagenw
Copy link
Member Author

hagenw commented Jan 15, 2024

Not with the benchmarks in the docs as there we look only at 160 files. But it can be measured when using more files, e.g.

Execution time when reading 100,000 files with a duration of 0.1 s and a sampling rate of 8000 Hz.

main this branch
15.2 s 11.5 s
Benchmark code
import os
import time

import numpy as np

import audeer
import audiofile


duration = 0.1
sampling_rate = 8000
repetitions = 100000

# Create files for benchmark
folder = audeer.mkdir('mix')
files = [
    audeer.path(folder, f'long-{n}.wav')
    for n in range(repetitions)
]
for file in files:
    if not os.path.exists(file):
        signal = np.random.normal(0, 0.1, (1, int(sampling_rate * duration)))
        signal /= (np.max(np.abs(signal)) + 0.0001)
        audiofile.write(file, signal, sampling_rate)

# benchmark
start = time.time()
for file in files:
    audiofile.read(file)
end = time.time()
print(end - start)

I also saw this when profiling auglib, which spent 1 s on audeer.safe_path() and half of it inside file_extension().

@frankenjoe
Copy link
Collaborator

Wow, I was expecting a call to audeer.path() to not have a measurable impact. At least not if the path is absolute and does not contain relative sub-paths. Since we are using this function all over the place, maybe we could add such a check to audeer.path() then?

@hagenw
Copy link
Member Author

hagenw commented Jan 15, 2024

Yes, it has a big impact if you loop over lot of files and all your other processing is relatively fast. E.g. when removing audeer.safe_path() completely when executing the benchmark above, the execution time is reduced to 8.0 s.
The expensive part inside audeer.path() is os.path.realpath() which translates symbolic links to the actual path.

One of the advantages of using os.path.realpath() is the following:

import audeer
import os

audeer.touch('test.wav')
os.symlink('test.wav', 'test.mp3')

audeer.file_extension('test.mp3')

returns

'wav'

@frankenjoe
Copy link
Collaborator

Ok, I see. Basically, it means there is at least one disk operation we cannot avoid since we cannot derive from the string if a path is symbolic or not.

@hagenw
Copy link
Member Author

hagenw commented Jan 15, 2024

Yes, we can only say we don't care and replace os.path.realpath by os.path.abspath which is much faster, but will break the current behavior of audeer.path() for symbolic links.

@frankenjoe
Copy link
Collaborator

Does Windows support symbolic links? If not, we could at least there switch to os.path.abspath.

@hagenw
Copy link
Member Author

hagenw commented Jan 15, 2024

I added audeering/audeer#131 for further discussion on the audeer.path() issue as this is not relevant to the proposed changes of this pull request.

@frankenjoe frankenjoe merged commit 27c1eb9 into main Jan 15, 2024
32 checks passed
@frankenjoe frankenjoe deleted the speedup-read branch January 15, 2024 09:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants