.NET Core based parser - 2x speedup #114

philipmat · 2020-08-31T15:07:49Z

This branch contains a .NET Core based parser which is at least 2x faster than the Python one.

File	Record Count	Python	C#
discogs_20200806_artists.xml.gz	7,046,615	6:22	2:35
discogs_20200806_labels.xml.gz	1,571,873	1:15	0:22
discogs_20200806_masters.xml.gz	1,734,371	3:56	1:57
discogs_20200806_releases.xml.gz	12,867,980	1:45:16	42:38

Note: tests consistent (same OS, files, etc) only across same file

This alternative parser:

exports compatible csv files (with the exception of track_id which has been made more resilient, but requires a column change);
is self-contained. It does not require any program or module to be installed.
compresses using gzip (.gz) vs python's bzip2 (.bz2)
lacks, as of this PR, a few "nice" features relative to the main parser (api counts, etc), but that's something that can be built upon quick.

How to test:

In the repo, visit Actions -> DotNet Build -- direct link
Find and access the last build. This "Artifacts" section lists the builds available for multiple platforms:
Download and un-zip the build for the applicable platform. Within it there will be 2 files: a discogs executable and a discogs.pdb support file.
Place these two files in a local folder, then invoke the discogs executable, passing it the path to discogs files, for example: ./discogs /tmp/discogs/discogs_20200806_labels.xml.gz
csv files will be exported in the same folder as the discogs file, /tmp/discogs in the above example.
to compress the csv files, pass the --gz argument
to perform a dry-run, only outputting file counts, pass the --dry-run argument

@ijabz, @berz - do you think you could take a swing at this?

I tested on:

macOS Catalina (extensively)
Windows 10 v2004
Ubuntu Focal - ubuntu:latest docker image

…y white space

alternatives/dotnet/discogs/CsvExporter.cs

alternatives/dotnet/discogs/Parser.cs

MuleaneEve · 2020-09-01T20:18:48Z

After applying these two changes, this program went from taking 22 seconds for "labels" to taking 14 seconds.
"Masters" went from 1:21 to 57 seconds on my PC.

It is possible to make it even faster, but that will require replacing the de-serializer with a hand-written solution, which is a lot of code to write, so it may not be worth it.

MuleaneEve · 2020-09-01T21:36:53Z

Good news: I quickly hacked together a de-serializer for "labels".
This is a proof of concept that shows that it is possible to parse all labels in under 4 seconds ;)
The output CSV files are identical.

Just replace the current ParseStreamAsync() with this one:

public async Task ParseStreamAsync(Stream stream)
{
  int objectCount = 0;
  var settings = new XmlReaderSettings
  {
    ConformanceLevel = ConformanceLevel.Fragment,
    Async = true,
  };
  using XmlReader reader = XmlReader.Create(stream, settings);

  await reader.MoveToContentAsync();
  await reader.ReadAsync();
  while (!reader.EOF)
  {
    if (reader.Name == _typeName)
    {
      var lbl = new label();
      while (reader.Read())
      {
        if (reader.IsStartElement(_typeName))
          break;
        if (reader.IsStartElement("id"))
          lbl.id = reader.ReadElementContentAsString();
        if (reader.IsStartElement("name"))
          lbl.name = reader.ReadElementContentAsString();
        if (reader.IsStartElement("contactinfo"))
          lbl.contactinfo = reader.ReadElementContentAsString();
        if (reader.IsStartElement("profile"))
          lbl.profile = reader.ReadElementContentAsString();
        if (reader.IsStartElement("data_quality"))
          lbl.data_quality = reader.ReadElementContentAsString();
        if (reader.IsStartElement("parentLabel"))
          lbl.parentLabel = new parentLabel { id = reader.GetAttribute("id"), name = reader.ReadElementContentAsString() };
        if (reader.IsStartElement("images"))
        {
          var images = new List<image>();
          while (reader.Read() && reader.IsStartElement("image"))
          {
            var image = new image { type = reader.GetAttribute("type"), width = reader.GetAttribute("width"), height = reader.GetAttribute("height") };
            images.Add(image);
          }
          lbl.images = images.ToArray();
        }
        if (reader.IsStartElement("urls"))
        {
          reader.Read();
          var urls = new List<string>();
          while (reader.IsStartElement("url"))
          {
            var url = reader.ReadElementContentAsString();
            if (!string.IsNullOrWhiteSpace(url))
              urls.Add(url);
          }
          lbl.urls = urls.ToArray();
        }
      }
      if (lbl.id is null)
        continue;
      var obj = (T)(object)lbl;
      await _exporter.ExportAsync(obj);

      objectCount++;
      if (objectCount % _throttle == 0) OnSucessfulParse(null, new ParseEventArgs { ParseCount = objectCount });
    }
    else
    {
      await reader.ReadAsync();
    }
  }
  await _exporter.CompleteExportAsync(objectCount);
}

…nto dotnet_parser

philipmat · 2020-09-02T02:59:03Z

@MuleaneEve - that was a fantastic improvement, thank you so much for it.
I've started to make some design changes and they're on the dotnet_parser_specific_xml_parsing branch.

Do you think you could take a look at the members parsing for artists - I can't quite nail the XML node parsing for it.

https://github.com/philipmat/discogs-xml2db/blob/dotnet_parser_specific_xml_parsing/alternatives/dotnet/discogs/DiscogsArtist.cs#L111

Edit: I actually think I figured it out in this commit, but it seems clumsy. Could you still please take a look and see if there's a more elegant way?

MuleaneEve · 2020-09-02T07:19:40Z

The way you are parsing members looks good. I wonder why the Discogs dump bothers mixing id elements in there...

I'm glad that you were able to integrate these improvements so quickly!
I will take a look later today.

MuleaneEve · 2020-09-02T10:30:56Z

Some comments on the new branch:

Would you please include artist.groups? I think that other people using your library (like me) may want that.

discogs-xml2db/alternatives/dotnet/discogs/DiscogsArtist.cs

Line 35 in 864887d

// public name[] groups {get;set;}

label.contactinfo was turned into this:

discogs-xml2db/alternatives/dotnet/discogs/DiscogsLabel.cs

Lines 20 to 30 in 864887d

    
           private string contactinfo1; 
        
           public string Getcontactinfo() 
        
           { 
        
               return contactinfo1; 
        
           } 
        
           public void Setcontactinfo(string value) 
        
           { 
        
               contactinfo1 = value; 
        
           }

label needs to populate its sublabels:

Currently, we are "lucky" that the <sublabels /> element always comes at the end, so when parsing, we skip them without losing any data.
But, it is more correct to parse them accurately.

Here is my implementation:

I renamed the parentLabel class into labelReference, removed SubId & SubName, then added at the end of Populate():

if (reader.IsStartElement("sublabels"))
{
  reader.Read();
  var sublabelList = new List<labelReference>();
  while (reader.IsStartElement("label"))
  {
    var label = new labelReference
    {
      id = reader.GetAttribute("id"),
      name = reader.ReadElementContentAsString()
    };
    sublabelList.Add(label);
  }
  sublabels = sublabelList.ToArray();
}

With that, we never need to check IsValid(), which was a sign that the parser was missing something. It should be impossible for the Discogs dump to contain invalid data.

philipmat · 2020-09-02T13:48:10Z

I've created Export artist groups #116 for exporting artists groups. Please see if you're ok with my export proposal.
The current parser doesn't do this either, so I want to achieve 100% parity before I take on new features; it makes it easier to compare files.
Fixed contactinfo1. It was an overly eager vscode helper.
As far as I can tell, the python parser doesn't export sublabels either. Same as point 1 and I've opened Export sublabels #117 for it.
I'll add your code and keep it commented out. The reason it was using that silly trick with SubId, SubName it's because I was trying to do it with the XmlSerializer and it doesn't like multiple classes having the same name. While the new parser doesn't have that limitation, I'd like to keep the structure until this becomes the main parser and then we can make some changes (the lower-name classes and properties is highly offensive to my C# sensibilities).

It should be impossible for the Discogs dump to contain invalid data.

Famous last words... Joking aside, it's gotten better (the xml files didn't even use to be well-formed at one point - "Pepperidge Farms meme"); still would like to be a bit midful.

MuleaneEve · 2020-09-02T20:30:19Z

OK for the new features.
Still, you should detect <sublabels> and <groups> to skip them. The parser will be more correct then (and a bit faster).

Regarding invalid data in general, the way I usually deal with this possibility is that I add a lot of validation code in Debug mode, that way, when I get a new data dump file, I validate it once, without affecting performance in Release mode.

philipmat · 2020-09-03T03:21:02Z

@MuleaneEve - I've pushed the master parser to the dotnet_parser_specific_xml_parsing.

In trying to make it work, I've noticed a few peculiarities of this new SAX-style parsing system. To be honest, I'm completely new to the finer details of XML stream parsing, so there may be obvious "d'oh" to someone more experienced.

The new parsing approach seems highly dependent on the order of nodes. Because of how it moves the reading head upon ReadElementAsString, if the next node is not met by the end of the while/if loop, it will skip it.
For example, if it if-tests for "year" first then "title" second, but the nodes in the XML document are <title> first then <year>, it will correctly populate this.title, but it will miss this.year since at the end of the loop it advances to the next element with while(reader.Read())
I would also like to make it more resilient against unknown nodes. A good deal of the inner-loops are while(reader.IsStart("a") || reader.IsStart("b")) { .... If a node <c> gets introduced before <a>, the parsing loop/subloop ends without ever parsing <a> and <b>
The parsing also seems sensitive to the whether there's any text/space between nodes or not.

The speedup is amazing (artist get parsed in about 50 seconds on this machine vs almost 11 minutes for the python version).

I need to become more comfortable with the SAX-style parsing and its idiosyncrasies and I'm hoping that meanwhile you can help me make that version more resilient.

I'm inclining to close this particular PR as soon as we get some feedback from people testing it and take on the conversation to a new PR opened against the "even faster speedup" 😄 branch.

MuleaneEve · 2020-09-03T05:44:12Z

Looks good to me. The XmlReader-based implementation can be pushed at a later time when it is ready.

Regarding the forward-only parsing, it is a lot easier when the elements are always in the same order, that way, the parsing code can match that order, and you won't even need the while loop. From what I can tell with artists and labels, that order is indeed deterministic. So, assuming it is also the case for masters and releases, you can update your code accordingly. It also becomes possible to notice when an unknown element is introduced. Then, you can decide if you want to implement it or skip it.

For XML documents that are more free-formed, the only choice is the while loop. Still, it is possible to make it more resilient.
As a practical example, take a look at this RSS parser from Microsoft: Think of the switch on ElementType as a switch on reader.Name. And in our case, the default case would be the unknown elements...

Regarding white spaces, the trick is to replace reader.Read() with a more targeted solution, like:

public static bool MoveToElement(this XmlReader reader)
{
  reader.Read();
  while (reader.NodeType != XmlNodeType.Element && !reader.EOF)
    reader.Skip();
  return !reader.EOF;
}

This is similar to MoveToContent() but focused on elements.

I recommend that you read the whole documentation of XmlReader. There are indeed a lot of little details to be aware of.
For example, it is also important to be aware of IsEmptyElement when reading list of items...

…ernative

# Conflicts: # .gitignore

philipmat added 30 commits August 25, 2020 17:14

adds c# parser

b3d674e

creates menu entries

0950d08

sample label to test deserialization

571cda3

deserializes everything but sublabels

77e632f

deserializing sub-labels

3a42aca

adds a helper class for csv writing

c07a1e2

expands paths to absolute paths

3752e39

exports labels

3cbc22b

reads releases; isolates namespace

a99f89f

removes serialize-release

ea2e274

adds common interface for export types

91917e6

sub_track example

d3a33c5

CSV export implementation for releases

44516f9

adds progress bars and some counts to drive the bar

5290a89

adds extra artists

2efb8d4

parses tracks, subtracks, and track artists

1fae50a

parses the "tracks" tag info for artists

c7f46ed

better guard for missing elements

ed93352

parses artists

96ef24e

fixes name variation export scheme

afd28af

employs a reading method that accounts for tags not being separated b…

19d3cf9

…y white space

parses masters

fc92218

moves parsing code into parser class

8b080dc

separates exporter into own class

8e5b1bb

adds test project

e8189ce

allows for stream parsing

0304ed3

moves single files under tests

c29e34b

artist deserialization tests

c95861f

uses deserialization method from parser

46f62fe

ensures event is not null on initialization

4af5b77

changes track_id to text instead of int

3f9588c

MuleaneEve reviewed Sep 1, 2020

View reviewed changes

alternatives/dotnet/discogs/CsvExporter.cs Outdated Show resolved Hide resolved

MuleaneEve reviewed Sep 1, 2020

View reviewed changes

alternatives/dotnet/discogs/Parser.cs Outdated Show resolved Hide resolved

philipmat added 5 commits September 1, 2020 19:05

uses a 1mb buffer size when writing csv

d18c881

sets some security options on xml reader

d9a8cde

sets a 1mb read buffer

c107738

avoids creating a new serializer for every node

7b9e581

Merge branch 'dotnet_parser' of github.com:philipmat/discogs-xml2db i…

8e52a19

…nto dotnet_parser

philipmat added 9 commits September 3, 2020 20:49

adds parameters; they do nothing, yet

9ceb7d3

repoints same files to test folder

9798b30

implements --dry-run support

fec651c

introduces RunOptions class to capture ... run options

9dd18c7

supports compress output via --gz

a470a88

formatting

dfc1559

mentions the .NET alternative and provides a readme file for this alt…

dbeefb7

…ernative

adds .net build badge

cb9e825

Merge branch 'develop' into dotnet_parser

2632f89

# Conflicts: # .gitignore

philipmat closed this Sep 8, 2020

philipmat reopened this Sep 8, 2020

philipmat merged commit cd0929a into develop Sep 8, 2020

philipmat deleted the dotnet_parser branch September 8, 2020 14:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.NET Core based parser - 2x speedup #114

.NET Core based parser - 2x speedup #114

philipmat commented Aug 31, 2020 •

edited

Loading

MuleaneEve commented Sep 1, 2020 •

edited

Loading

MuleaneEve commented Sep 1, 2020 •

edited

Loading

philipmat commented Sep 2, 2020 •

edited

Loading

MuleaneEve commented Sep 2, 2020 •

edited

Loading

MuleaneEve commented Sep 2, 2020

philipmat commented Sep 2, 2020 •

edited

Loading

MuleaneEve commented Sep 2, 2020

philipmat commented Sep 3, 2020

MuleaneEve commented Sep 3, 2020 •

edited

Loading

.NET Core based parser - 2x speedup #114

.NET Core based parser - 2x speedup #114

Conversation

philipmat commented Aug 31, 2020 • edited Loading

MuleaneEve commented Sep 1, 2020 • edited Loading

MuleaneEve commented Sep 1, 2020 • edited Loading

philipmat commented Sep 2, 2020 • edited Loading

MuleaneEve commented Sep 2, 2020 • edited Loading

MuleaneEve commented Sep 2, 2020

philipmat commented Sep 2, 2020 • edited Loading

MuleaneEve commented Sep 2, 2020

philipmat commented Sep 3, 2020

MuleaneEve commented Sep 3, 2020 • edited Loading

philipmat commented Aug 31, 2020 •

edited

Loading

MuleaneEve commented Sep 1, 2020 •

edited

Loading

MuleaneEve commented Sep 1, 2020 •

edited

Loading

philipmat commented Sep 2, 2020 •

edited

Loading

MuleaneEve commented Sep 2, 2020 •

edited

Loading

philipmat commented Sep 2, 2020 •

edited

Loading

MuleaneEve commented Sep 3, 2020 •

edited

Loading