-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
.NET Core based parser - 2x speedup #114
Conversation
After applying these two changes, this program went from taking 22 seconds for "labels" to taking 14 seconds. It is possible to make it even faster, but that will require replacing the de-serializer with a hand-written solution, which is a lot of code to write, so it may not be worth it. |
Good news: I quickly hacked together a de-serializer for "labels". Just replace the current ParseStreamAsync() with this one: public async Task ParseStreamAsync(Stream stream)
{
int objectCount = 0;
var settings = new XmlReaderSettings
{
ConformanceLevel = ConformanceLevel.Fragment,
Async = true,
};
using XmlReader reader = XmlReader.Create(stream, settings);
await reader.MoveToContentAsync();
await reader.ReadAsync();
while (!reader.EOF)
{
if (reader.Name == _typeName)
{
var lbl = new label();
while (reader.Read())
{
if (reader.IsStartElement(_typeName))
break;
if (reader.IsStartElement("id"))
lbl.id = reader.ReadElementContentAsString();
if (reader.IsStartElement("name"))
lbl.name = reader.ReadElementContentAsString();
if (reader.IsStartElement("contactinfo"))
lbl.contactinfo = reader.ReadElementContentAsString();
if (reader.IsStartElement("profile"))
lbl.profile = reader.ReadElementContentAsString();
if (reader.IsStartElement("data_quality"))
lbl.data_quality = reader.ReadElementContentAsString();
if (reader.IsStartElement("parentLabel"))
lbl.parentLabel = new parentLabel { id = reader.GetAttribute("id"), name = reader.ReadElementContentAsString() };
if (reader.IsStartElement("images"))
{
var images = new List<image>();
while (reader.Read() && reader.IsStartElement("image"))
{
var image = new image { type = reader.GetAttribute("type"), width = reader.GetAttribute("width"), height = reader.GetAttribute("height") };
images.Add(image);
}
lbl.images = images.ToArray();
}
if (reader.IsStartElement("urls"))
{
reader.Read();
var urls = new List<string>();
while (reader.IsStartElement("url"))
{
var url = reader.ReadElementContentAsString();
if (!string.IsNullOrWhiteSpace(url))
urls.Add(url);
}
lbl.urls = urls.ToArray();
}
}
if (lbl.id is null)
continue;
var obj = (T)(object)lbl;
await _exporter.ExportAsync(obj);
objectCount++;
if (objectCount % _throttle == 0) OnSucessfulParse(null, new ParseEventArgs { ParseCount = objectCount });
}
else
{
await reader.ReadAsync();
}
}
await _exporter.CompleteExportAsync(objectCount);
} |
@MuleaneEve - that was a fantastic improvement, thank you so much for it. Do you think you could take a look at the Edit: I actually think I figured it out in this commit, but it seems clumsy. Could you still please take a look and see if there's a more elegant way? |
The way you are parsing I'm glad that you were able to integrate these improvements so quickly! |
Some comments on the new branch:
discogs-xml2db/alternatives/dotnet/discogs/DiscogsLabel.cs Lines 20 to 30 in 864887d
Currently, we are "lucky" that the Here is my implementation: I renamed the if (reader.IsStartElement("sublabels"))
{
reader.Read();
var sublabelList = new List<labelReference>();
while (reader.IsStartElement("label"))
{
var label = new labelReference
{
id = reader.GetAttribute("id"),
name = reader.ReadElementContentAsString()
};
sublabelList.Add(label);
}
sublabels = sublabelList.ToArray();
} With that, we never need to check |
Famous last words... Joking aside, it's gotten better (the xml files didn't even use to be well-formed at one point - "Pepperidge Farms meme"); still would like to be a bit midful. |
OK for the new features. Regarding invalid data in general, the way I usually deal with this possibility is that I add a lot of validation code in Debug mode, that way, when I get a new data dump file, I validate it once, without affecting performance in Release mode. |
@MuleaneEve - I've pushed the master parser to the dotnet_parser_specific_xml_parsing. In trying to make it work, I've noticed a few peculiarities of this new SAX-style parsing system. To be honest, I'm completely new to the finer details of XML stream parsing, so there may be obvious "d'oh" to someone more experienced.
The speedup is amazing (artist get parsed in about 50 seconds on this machine vs almost 11 minutes for the python version). I need to become more comfortable with the SAX-style parsing and its idiosyncrasies and I'm hoping that meanwhile you can help me make that version more resilient. I'm inclining to close this particular PR as soon as we get some feedback from people testing it and take on the conversation to a new PR opened against the "even faster speedup" 😄 branch. |
Looks good to me. The Regarding the forward-only parsing, it is a lot easier when the elements are always in the same order, that way, the parsing code can match that order, and you won't even need the For XML documents that are more free-formed, the only choice is the Regarding white spaces, the trick is to replace public static bool MoveToElement(this XmlReader reader)
{
reader.Read();
while (reader.NodeType != XmlNodeType.Element && !reader.EOF)
reader.Skip();
return !reader.EOF;
} This is similar to MoveToContent() but focused on elements. I recommend that you read the whole documentation of XmlReader. There are indeed a lot of little details to be aware of. |
# Conflicts: # .gitignore
This branch contains a .NET Core based parser which is at least 2x faster than the Python one.
Note: tests consistent (same OS, files, etc) only across same file
This alternative parser:
How to test:
discogs
executable and adiscogs.pdb
support file.discogs
executable, passing it the path to discogs files, for example:./discogs /tmp/discogs/discogs_20200806_labels.xml.gz
/tmp/discogs
in the above example.--gz
argument--dry-run
argument@ijabz, @berz - do you think you could take a swing at this?
I tested on: