-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loading big repodata (uncompressed other.xml that has 3GB) #493
Comments
Oh no, making that work will not be much fun. And the next limit will be 4GB, of course. Regarding a streaming implementation: you would need to do the parsing and solv file writing in one step. Certainly doable, but not trivial and quite a bit of code. Once the solv file is written and loaded again, the memory usage is quite small as the changelog author/text is put into the "paged" part of the solv file ("vertical data") and thus not read in. |
If that >3GB repo public so that I can do some testing? |
Oh... would that be a reason to reload cache data immediately after writing? Does it somehow reduce the memory footprint (even though the data were already loaded to the same pool)? I've removed the reloading in libdnf 5 because I saw no purpose to it and it had a lackluster comment that didn't really explain it. What does the "paged" part of the solv file (vertical data) mean? Is it not read into memory? |
The data is segmented into 32K pages and read on demand. |
From the file? Does libsolv store the path and reopen the file if needed? Because we are passing an open |
Hi. I submitted one of the original bugs to RHEL. I'd just like to point out that the huge XML file is the other.xml, not one of the primary ones. Specifically, it's the %changelog entries. For packages with huge, ancient %changelog's, if there are 20 copies of them, those ancient %changelogs entries are replicated 20x, which greatly bloats this file. Anyway, I would not expect this data to be crucial to solving package dependencies. So, for other.xml data and maybe also for filelists.xml data, a lazy loading approach might make sense. |
Clarification: *...if the repo contains 20 different versions of these packages... |
I don't think it is easily available, but I made a python script that should suffice: https://github.com/kontura/createrepo_dummy |
Regarding the paging: repo_add_solv dup()s away the file descriptor and uses pread to read the pages. |
Regarding the lazy converting to a solv file: doesn't dnf do that already? |
Regarding laziness: I was talking about a more fine-grained laziness, where dnf could request changelog or filelists data only for specific packages, so that it didn't need to load the entire other.xml or filelists.xml. |
That would indeed be nice, but then we'd run in weird cases if the remote repo is changed and the files cannot be accessed anymore. So I think the default should be to do the download and do the solv file conversion on demand. |
@etoddallen Please point me to that Bugzilla because I will gladly raise the volume on it. This has caused a multitude of problems for us. The root of the issue is that they have been keeping all of the changelogs, for every package, due to a bug. The packages that are updated the most frequently, have the longest changelog lists, and also the most copies of those changelogs because all the old packages are kept as well. I'm not going to work out the big-O on that but you can see why it scales poorly. If you take the same metadata and drop all but the last 10 changelogs per-package it ends up like 10% of the size. |
Sure. The one that I reported is here: |
bump; |
Some infra is being updated to restrict the # of changelogs kept per-package, but I don't know precisely when it's all going to be in production. Most distros are using createrepo_c I believe so they already get it for "free". |
The gigantic other.xml issue is now resolved at the source. The issue was partially that, on top of keeping full copies of the metadata and doing so for every version of the RPM, some RPMs such as OpenSSL, Samba and so forth actually keep changelog history going back more than 20 years. So was really huge quantities of data. Restricting to the 10 most recent changelogs (as most distributions do) shrunk RHEL's other.xml.gz by 90-99%. So this issue as-written can probably be closed though we may still want one open for the |
Apologies for so many issues lately but I have got another one.
In rhel we have some big repodata, such as other.xml that has 3GB uncompressed.
This fails to load because of an overflow. I think it happens e.g. here: https://github.com/openSUSE/libsolv/blob/master/src/repodata.c#L2528 since
data->attrdatalen
is anunsigned int
butrepodata_set()
takes anId
(this conversion happens at other places as well) then the value is stored and during internalization it tries to use it as anint
inextdata
which results in a crash.Is there something more we can do about this apart from catching the overflow?
I tried converting some of the involved parameters and variables to
Offset
s but didn't manage to make it work. I am also not sure if it is even a possible/valid approach, if I understand correctly we would need some way to differentiateId
s fromOffset
s? Perhaps a newREPOKEY_TYPE_STR_OFFSET
. Still that doesn't seem like a scalable approach since it would only buy some time until theunsigned int
overflows.Maybe another possibility could be adding additional string data spaces?
Related issue is also that in order to parse the metadata we need to load them into RAM all at once even though the resulting solvfile has around one third of the size. Could there be some streaming parsing model? Where we process the metadata and internalize them continually?
The text was updated successfully, but these errors were encountered: