-
-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate http client from requests to httpx async client #739
base: main
Are you sure you want to change the base?
Conversation
ce26428
to
8dd8337
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting PR. I gave it a look through and it looks fine as far as I can tell, but I worry about how this will affect callers of this library. Won't they all need to be rewritten whever this version goes live?
Yes, callers will need to be tweaked to make async calls but that should be fairly straight forwards. |
Hm, I was afraid of that, this library is used by others as well. If this is an important change to make, we'll want to do a full version bump, or if there's a way to keep it backwards compatible, we'll want to do that. If we do do the version bump, is there a way for others using the library to upgrade without going full async? We should write something in the changelog about that too before releasing this. |
Not really a good way to do this without making things a lot more complex.
Yeah, I mean we can generally use |
So, for many years I've occasionally seen libraries that use tornado or other async frameworks, and whenever I did, I ran away as quickly as I could, knowing that it was a whole universe of bugs I didn't understand or what to learn about. I'm fully on board now that async is the right architecture for this library, but I'm afraid that if we make this library async-first, that'll have a similar affect on anybody that is considering using it, even if we document the heck out of it. I know as recently as a year ago, I'd have run away from this library as soon as I saw the word async. So I'm not sure what to do with this PR. Async is the right thing to do architecturally, but the wrong thing to do for the community, at least for now. A couple thoughts:
I'm also aware of the fact that Juriscraper has a number of things about it that are really bad:
I can't help but think that the thing to do is abandon ship and start breaking things out. First, we do async PACER in a new library then we do states, one at a time, in a greenfield approach, then we see if there's anything else we care about that's left behind. |
I feel like it was alluded to above, but what's wrong with synchronous wrappers around asynchronous core functions?
I do not think that the PACER/CMECF code is unpopular because it is "buried." |
Well, it's just kind of ugly? Everybody that uses Juriscaper would have to sprinkle little
Sure, so we can fix that too with a clean break. |
Sorry, no. I mean we retain our existing interface and make that interface a syncrhonous wrapper around the asynchronous core interface. No changes for juriscraper callers or users, unless they want to use async functions, in which case they have to call the async version. |
Oh, so you're proposing the django thing of every method having an |
Well native asyncio is a bit different than the many older 3rd party library variants since it's built in to python.
I think this probably the best approach.
Was the opposite for me generally, non-async functions doing network-IO always seemed to be potential performance foot-gun to be avoided.
IMO for something like this incremental refactoring is probably a better option.
I think it's better to have calling code handle the synchronous wrappers and not implement it in juriscraper directly, we don't really want to make assumptions here about say the |
What I like about forking is that it allows the old Juriscraper to continue living on (in infamy?), whereas if we don't fork, it's pretty hard to do bug fixes or anything like that for folks that are on the old version. This feels big enough to me that a fork makes sense. It's also probably time to fix all these many issues, and make a more focused PACER-only library.
Are there many others and would they be incompatible? Django made it's decision, I guess, but it feels like if Django can do a-prefixed async functions alongside the regular ones, we should be able to also. (But if we fork, this doesn't matter so much.) |
I mean, you can create a legacy branch if there's any actual interest in a sync version still(which is kind of an unknown at this point).
Well there's stuff like
Well django internally is more sync oriented still than async and tends to shim things in the reverse direction(ie sync native functions shimmed to have async semantics using thread pools), while this would be native async being shimmed to sync so it's not quite the same. |
You're focused on the PACER part, right? Why not do a async-native PACER fork in a fresh library? It shouldn't be too terribly hard and it'd be useful for others. We can start there, then do something for the state scrapers if we want to, in a separate approach/library/version/etc. But if we look just at PACER, I think that shouldn't be too terribly hard? |
Well anything doing network IO I was trying to convert.
Maintaining two codebases that basically do the same thing seems to be a lot of extra work for no significant benefit(unless there really are a lot of applications that depend on juriscraper that can't be modified to handle an async API easially). |
I don't propose we maintain the code here anymore. We would say that it's old and that we're focused on ajuriscraper or whatever we wind up calling it, and that we recommend people use that instead. |
I mean, if the code here wouldn't be maintained then what's the point of creating a separate project? |
That would let us untether parts of Juriscraper. So PACER stuff goes into a new library and the PACER stuff here goes into a slow death spiral. Meanwhile, the rest of Juriscraper can continue advancing. It also gives a strong signal of "This is something new and different," and lets us think in greenfield kinds of ways. |
It's a bit unclear to me what we want to untether exactly, but I'm not seeing an advantage to creating a separate juriscraper project for that.
I don't really see how that would be better than just refactoring things here as needed, then we don't have two implementations doing effectively the same thing.
If there's two different versions of juriscraper that have significant overlapping functionality I think it becomes less clear which one projects should be using vs a single version. Also large flag day upgrades tend to be difficult and risky in general so they often are avoided vs say more incremental migrations/refactoring over time. |
I think the biggest advantage is that when you Google for "pacer python" you'd get a package called The second advantage is just making the package smaller in general. If all you want to do is scrape PACER, you get a lean library just for that. If you want to fix a bug in the PACER library, you can do that and run tests more quickly, without worrying about the rest, etc. I much prefer that as a library user, rather than big ones that do a lot more than I need. For example, if we do this, it gets a lot easier to document the library, than if we want to document all of Juriscraper and its various modules. The third thing is that it'd allow us to open our minds to bigger changes to the API (go async, clean up a few other things), because we can change things without worrying about breaking other people's systems or demanding that they sprinkle |
I mean, juriscraper is the top active python pacer scraping library in google, the others are either completely abandoned/unmaintained or unrelated to pacer. Splitting things to a separate library may make things worse there.
You can run a subset of tests already, for example like this:
As long as the library is modular where you can pull in on the functionality you need I don't really think this is a big issue, and having some things in the same library does have benefits in regards to simplifying code sharing/maintenance.
I still don't see a good reason for forking the library unless there's a longterm plan to maintain both versions which is kinda a PITA. By the way I tried to find any active projects using juriscraper other than courtlistener but didn't see any so this seems likely to be only a theoretical issue to some degree. I'd say we should refactor/clean things up here and if anyone complains then see what migration/compatibility strategy appears to best address their concerns at that point. Trying to prematurely anticipate if changing something like going async will cause major downstream issues is kinda hard without knowing specifics regarding the downstream users use case/architecture. |
Discussed via Slack, and @ttys0dev wins. I'll give this another look early next week. |
Man, time flies, but it'd be good to return to this, if you're interested, @ttys0dev. The new challenge we have is that a lot of work is being done on Juriscraper now by our new developer, @grossir and by @flooie, so we'd need to have a branch with this that stays up to date with their work until we cut this version, and then we'd quickly need to have a PR in CL that's ready to use the new version. I think I'd propose the following, but I'd love input from the three of you:
An optional step is to repeat this process for each module of Juriscraper, so we do PACER first, then opinions, then oral arguments, etc, but I'm not sure it's necessary. It would be safer, but it'd also wind up with Juriscaper 2.0, 2.1, 2.2, etc (i.e., lots of breaking releases). Would the approach above be safe, incremental, and efficient to get this big change done? |
No description provided.