Handling errors without raising exceptions #90

argenisleon · 2020-05-25T05:27:43Z

I need to parse millions of rows and raising exception could slow down the process. Is there any way to return None or just the same object?

movermeyer · 2020-05-25T11:28:51Z

Summary

I'm not against the idea of returning None (after the user applies some flag/setting): assuming that it can be shown to speed things up.

What is your use-case?

I would like to know more about your use-case though.
How high is your invalid data rate that the cost of Exception creation matters? Where is this data coming from?

I'm mostly surprised and curious.

could slow down the process

Have you tested it to get a sense of the slowdown? Or is this speculation?

Optimized for valid data

The length of time that ciso8601 takes while parsing invalid data varies a lot. This is because:

"How" the data is invalid determines how much of the parsing is done before it is noticed as invalid
- ex. an empty string fails to parse faster than a string with invalid data in the time zone, since there is much less parsed
the code was optimized with the assumption of a low invalid data rate.
- ex. ciso8601 does the full parse and validation of the first fields in the datetime, even before the later fields have been verified as existing.

If one happened to know that their data was incorrect in specific ways, one could re-organize the code to have better average run-time over the invalid data. But in general, I'm not sure that we can know that.

Which is to say that, in general, we should not expect invalid data to parse as quickly as valid data (though we can possibly do better).

Again, I'm not against the idea, just providing some background. I'd like to see some analysis of the slowdown done before we make this change (perhaps I'll find time for this...maybe...)

argenisleon · 2020-05-26T05:32:56Z

@movermeyer thanks for the fast response.

I am working on Optimus https://github.com/ironmussa/Optimus/tree/develop-3.0. Optimus is a library to process and explore big data using pyspark/dask.

Because big data in my particular use case I can not know:

The % of valid data
The size

I would like to parse string to dates as fast as possible.

About the performance, if I try to parse this string to int it will raise and exception and it takes 10x that using valid data. This takes 13-sec approx.

import pandas as pd
N = 10000000
s_arr = pd.util.testing.rands_array(10, N)
%%time
for i in s_arr:
    try:
        int(i)
    except:
        pass

This takes 1.76 sec

import pandas as pd
N = 10000000
s_arr = pd.util.testing.rands_array(10, N)
%%time
for i in s_arr:
    try:
        str(i)
    except:
        pass

In my case, I would need to return a date object if it is a valid date or return the same object if it can not be parsed.

movermeyer · 2020-05-28T11:05:13Z

In my case, I would need to return a date object if it is a valid date or return the same object if it can not be parsed.

Please excuse my ignorance. It seems like an odd way to do error handling.
What does a user do after? Presumably, they now have a dataframe of datatime objects mixed with strings. In this case, I suppose the next step is to filter out the strings? What would be the next step be if the function could return strings?

I'm surprised that you'd want the same object back, especially since it's just as fast to return None.

argenisleon changed the title ~~Han~~ Handling error without raising exceptions May 25, 2020

argenisleon changed the title ~~Handling error without raising exceptions~~ Handling errors without raising exceptions May 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling errors without raising exceptions #90

Handling errors without raising exceptions #90

argenisleon commented May 25, 2020 •

edited

Loading

movermeyer commented May 25, 2020 •

edited

Loading

argenisleon commented May 26, 2020

movermeyer commented May 28, 2020

Handling errors without raising exceptions #90

Handling errors without raising exceptions #90

Comments

argenisleon commented May 25, 2020 • edited Loading

movermeyer commented May 25, 2020 • edited Loading

Summary

What is your use-case?

Optimized for valid data

argenisleon commented May 26, 2020

movermeyer commented May 28, 2020

argenisleon commented May 25, 2020 •

edited

Loading

movermeyer commented May 25, 2020 •

edited

Loading