Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling errors without raising exceptions #90

Open
argenisleon opened this issue May 25, 2020 · 3 comments
Open

Handling errors without raising exceptions #90

argenisleon opened this issue May 25, 2020 · 3 comments

Comments

@argenisleon
Copy link

argenisleon commented May 25, 2020

I need to parse millions of rows and raising exception could slow down the process. Is there any way to return None or just the same object?

@argenisleon argenisleon changed the title Han Handling error without raising exceptions May 25, 2020
@argenisleon argenisleon changed the title Handling error without raising exceptions Handling errors without raising exceptions May 25, 2020
@movermeyer
Copy link
Collaborator

movermeyer commented May 25, 2020

Summary

I'm not against the idea of returning None (after the user applies some flag/setting): assuming that it can be shown to speed things up.

What is your use-case?

I would like to know more about your use-case though.
How high is your invalid data rate that the cost of Exception creation matters? Where is this data coming from?

I'm mostly surprised and curious.

could slow down the process

Have you tested it to get a sense of the slowdown? Or is this speculation?

Optimized for valid data

The length of time that ciso8601 takes while parsing invalid data varies a lot. This is because:

  • "How" the data is invalid determines how much of the parsing is done before it is noticed as invalid
    • ex. an empty string fails to parse faster than a string with invalid data in the time zone, since there is much less parsed
  • the code was optimized with the assumption of a low invalid data rate.
    • ex. ciso8601 does the full parse and validation of the first fields in the datetime, even before the later fields have been verified as existing.

If one happened to know that their data was incorrect in specific ways, one could re-organize the code to have better average run-time over the invalid data. But in general, I'm not sure that we can know that.

Which is to say that, in general, we should not expect invalid data to parse as quickly as valid data (though we can possibly do better).


Again, I'm not against the idea, just providing some background. I'd like to see some analysis of the slowdown done before we make this change (perhaps I'll find time for this...maybe...)

@argenisleon
Copy link
Author

@movermeyer thanks for the fast response.

I am working on Optimus https://github.com/ironmussa/Optimus/tree/develop-3.0. Optimus is a library to process and explore big data using pyspark/dask.

Because big data in my particular use case I can not know:

  • The % of valid data
  • The size

I would like to parse string to dates as fast as possible.

About the performance, if I try to parse this string to int it will raise and exception and it takes 10x that using valid data. This takes 13-sec approx.

import pandas as pd
N = 10000000
s_arr = pd.util.testing.rands_array(10, N)
%%time
for i in s_arr:
    try:
        int(i)
    except:
        pass

This takes 1.76 sec

import pandas as pd
N = 10000000
s_arr = pd.util.testing.rands_array(10, N)
%%time
for i in s_arr:
    try:
        str(i)
    except:
        pass

In my case, I would need to return a date object if it is a valid date or return the same object if it can not be parsed.

@movermeyer
Copy link
Collaborator

In my case, I would need to return a date object if it is a valid date or return the same object if it can not be parsed.

Please excuse my ignorance. It seems like an odd way to do error handling.
What does a user do after? Presumably, they now have a dataframe of datatime objects mixed with strings. In this case, I suppose the next step is to filter out the strings? What would be the next step be if the function could return strings?

I'm surprised that you'd want the same object back, especially since it's just as fast to return None.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants