You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Scan the URL string, make sure we can write it as a normalised URL string.
Allocate storage, write the normalised URL string to it.
This leaves out one important consideration: we don't know the size of the URL string until we try to write it. It may be a lot larger than the input, if anything needs to be percent-encoded (percent-encoding triples the length of the string), or it may be lot smaller if it contains a complex path which can be simplified, some parts may come from the base URL, etc.
In order to provide scalable, predictable performance, the allocation & writing step above is broken down in to 2 sub-steps (see ParsedURLString.constructURLObject):
2a. Perform a dry-run, writing the URL using a StructureAndMetricsCollector, so we know the precise capacity required.
2b. Allocate storage with that capacity, and write again, this time for real.
This has the really nice property that parsing any URL, no matter how large or complex, results in only a single heap allocation (the result). The dry-run also stores a bunch of flags on the side which can make the second writing pass faster (e.g. whether we can skip percent-encoding for a particular component).
There are other approaches we could take: for example, we could just make a storage object, guess the required capacity, and keep append-ing to it. Unfortunately, this can lead to repeated copies and allocations, and in the worst case has quadratic complexity (that's why the standard library advises you to reserve capacity in advance if appending to Array/String in a loop).
The way Array and String solve this is by using a geometric growth strategy; this reserves a bunch of extra capacity on each reallocation to try to amortise the cost. It's based on the idea that if you're appending a lot of elements to your array, you're probably going to keep on appending a lot. While this makes sense for appending to a general-purpose type such as Array, which people use in all sorts of varied situations, I'm not convinced that it makes sense for parsing a URL string.
Another approach would be if the first writing pass (currently a dry-run) took a guess at the required storage capacity, at least for small URLs. If it's correct, great! We saved a second writing pass, and cut parsing time roughly in half. If it isn't, we wasted an allocation, but we can still continue calculating the actual required capacity. Then the second pass can make storage of the correct size, copy the partial result from the "guess storage", and resume writing from where the first pass ran out of space.
Profiling shows that writing the normalised URL accounts for ~80-90% of execution time, split evenly between the dry-run and writing to storage. If small URLs can avoid the second pass, parsing them could be up to 2x as fast as it is today. Large URLs will likely get a bit slower due to the extra allocation, but they'll also be able to skip a lot more on the second pass, so I expect it will be much less than 2x slower (maybe 1.3x-1.5x, at a guess). So that sounds like an overall win.
And when I say "small", I'm talking <200 bytes. For comparison, this reddit URL is 153 bytes. And just by looking at it, it's longer than most of the URLs you're likely to parse in typical application scenarios. So the vast majority of URLs should see a big speed increase.
The URL parser can be viewed in 2 steps:
This leaves out one important consideration: we don't know the size of the URL string until we try to write it. It may be a lot larger than the input, if anything needs to be percent-encoded (percent-encoding triples the length of the string), or it may be lot smaller if it contains a complex path which can be simplified, some parts may come from the base URL, etc.
In order to provide scalable, predictable performance, the allocation & writing step above is broken down in to 2 sub-steps (see
ParsedURLString.constructURLObject
):2a. Perform a dry-run, writing the URL using a
StructureAndMetricsCollector
, so we know the precise capacity required.2b. Allocate storage with that capacity, and write again, this time for real.
This has the really nice property that parsing any URL, no matter how large or complex, results in only a single heap allocation (the result). The dry-run also stores a bunch of flags on the side which can make the second writing pass faster (e.g. whether we can skip percent-encoding for a particular component).
There are other approaches we could take: for example, we could just make a storage object, guess the required capacity, and keep
append
-ing to it. Unfortunately, this can lead to repeated copies and allocations, and in the worst case has quadratic complexity (that's why the standard library advises you to reserve capacity in advance if appending toArray
/String
in a loop).The way
Array
andString
solve this is by using a geometric growth strategy; this reserves a bunch of extra capacity on each reallocation to try to amortise the cost. It's based on the idea that if you're appending a lot of elements to your array, you're probably going to keep on appending a lot. While this makes sense for appending to a general-purpose type such as Array, which people use in all sorts of varied situations, I'm not convinced that it makes sense for parsing a URL string.Another approach would be if the first writing pass (currently a dry-run) took a guess at the required storage capacity, at least for small URLs. If it's correct, great! We saved a second writing pass, and cut parsing time roughly in half. If it isn't, we wasted an allocation, but we can still continue calculating the actual required capacity. Then the second pass can make storage of the correct size, copy the partial result from the "guess storage", and resume writing from where the first pass ran out of space.
Profiling shows that writing the normalised URL accounts for ~80-90% of execution time, split evenly between the dry-run and writing to storage. If small URLs can avoid the second pass, parsing them could be up to 2x as fast as it is today. Large URLs will likely get a bit slower due to the extra allocation, but they'll also be able to skip a lot more on the second pass, so I expect it will be much less than 2x slower (maybe 1.3x-1.5x, at a guess). So that sounds like an overall win.
And when I say "small", I'm talking <200 bytes. For comparison, this reddit URL is 153 bytes. And just by looking at it, it's longer than most of the URLs you're likely to parse in typical application scenarios. So the vast majority of URLs should see a big speed increase.
The text was updated successfully, but these errors were encountered: