Pretty slow section of code #51

ross-spencer · 2021-06-17T09:12:50Z

ross-spencer
Jun 17, 2021
Maintainer

This change in sqlitefid points to a pretty slow piece of code.

I was using a loop to iterate through an increasing amount of data to update a field in a dict - each time the loop ran I'd also be updating the field with exactly the same information... pretty redundant. Don't do it kids!

Anyway, the optimization means the govdocs sample which previously took over 2 mins:

--- 139.35122227668762 seconds ---

real	2m19.654s
user	2m18.978s
sys	0m0.360s

Now only takes 6 seconds, unless I've missed anything.

--- 6.717500925064087 seconds ---

real	0m6.963s
user	0m6.382s
sys	0m0.325s

This will be fixed with the py2/py3 release, but it might be worth back-porting to the py2 only release. We'll see which can be delivered first. The hope is that he code will continue to work on both interpreters.

The code is also a bit of a building site right now, but slowly moving. The diff from the fix today and some small other changes I started making while tracking it down is below. For anyone who can apply the patch for themselves.

diff --git a/libs/SFHandlerClass.py b/libs/SFHandlerClass.py
index bfcaea2..9ec6b9a 100644
--- a/libs/SFHandlerClass.py
+++ b/libs/SFHandlerClass.py
@@ -88,11 +88,11 @@ class SFYAMLHandler:
     DICTFILES = "files"
     DICTID = "identification"
 
-    TYPECONT = "Container"
+    TYPE_CONTAINER = "Container"
     TYPEFILE = "File"
 
     # additional fields given to SF output
-    FIELDFILENAME = "filename"
+    FIELD_FILE_NAME = "filename"
     FIELDURI = "uri"
     FIELDURISCHEME = "uri scheme"
     FIELDDIRNAME = "directory"
@@ -154,7 +154,29 @@ class SFYAMLHandler:
             elif line[0] != "identifiers":
                 self.header[line[0]] = line[1]
 
+    def add_file_uri(self, filedict):
+        """Add file URIs to filedict structure.
+
+        :param filedict: filedict structure containing information
+            about our file.
+        :returns: None (nonetype)
+        """
+        fname = filedict[self.FIELD_FILE_NAME]
+        file_uri = self.addFileURI(fname)
+        if filedict[self.FIELDTYPE] == "Container":
+            file_uri = self.addContainerURI(filedict, filedict, file_uri)
+        filedict[self.FIELDURI] = file_uri
+        filedict[self.FIELDURISCHEME] = self.geturischeme(file_uri)
+
     def filesection(self, sfrecord):
+        """Returns some information about the SF report.
+
+        :param sfrecord: A list of non-parsed records from Siegfried
+            to be converted. (list[(string)])
+        :returns: A file dictionary to be appended to the global file
+            list. (dict)
+        """
+
         iddict = {}  # { nsname : {id : x, mime : x } }
         filedict = {}
 
@@ -165,18 +187,7 @@ class SFYAMLHandler:
             s = self.handleentry(s)
             if s[0] in self.fileheaders:
                 filedict[s[0]] = s[1]
-                if s[0] == self.FIELDFILENAME:
-                    fname = filedict[self.FIELDFILENAME]
-                    furi = self.addFileURI(fname)
-                    for f in self.files:
-                        needle_name = f[self.FIELDFILENAME]
-                        needle_type = f[self.FIELDTYPE]
-                        haystack = fname
-                        if needle_name in haystack:
-                            if needle_type == self.TYPECONT:
-                                furi = self.addContainerURI(f, filedict, furi)
-                    filedict[self.FIELDURI] = furi
-                    filedict[self.FIELDURISCHEME] = self.geturischeme(furi)
+
                 if s[0] in self.hashes and self.hashtype is None:
                     self.hashtype = s[0]
 
@@ -211,6 +222,10 @@ class SFYAMLHandler:
                             s[1] = "none"
                     iddata[s[0]] = s[1]
 
+        # TODO: Add tests to make sure the file URI is constructed
+        # correctly.
+        self.add_file_uri(filedict)
+
         if self.FIELDVERSION not in iddata:
             iddata[self.FIELDVERSION] = ""
 
@@ -219,6 +234,7 @@ class SFYAMLHandler:
 
         # add complete id data to filedata, return
         filedict[self.DICTID] = iddict
+
         return filedict
 
     def readSFYAML(self, sfname):
@@ -306,13 +322,13 @@ class SFYAMLHandler:
 
     def adddirname(self, sfdata):
         for row in sfdata[self.DICTFILES]:
-            fname = row[self.FIELDFILENAME]
+            fname = row[self.FIELD_FILE_NAME]
             row[self.FIELDDIRNAME] = self.getDirName(fname)
         return sfdata
 
     def addfilename(self, sfdata):
         for row in sfdata[self.DICTFILES]:
-            fname = row[self.FIELDFILENAME]
+            fname = row[self.FIELD_FILE_NAME]
             row["name"] = self.getFileName(fname)
         return sfdata
 
@@ -329,19 +345,25 @@ class SFYAMLHandler:
         # only set as File if and only if it isn't a Container
         # container overrides all...
         if id_ in self.containers.values():
-            filedict[self.FIELDTYPE] = self.TYPECONT
+            filedict[self.FIELDTYPE] = self.TYPE_CONTAINER
             # get container type: http://stackoverflow.com/a/13149770
             filedict[self.FIELDCONTTYPE] = list(self.containers.keys())[
                 list(self.containers.values()).index(id_)
             ]
         else:
             if self.FIELDTYPE in filedict:
-                if filedict[self.FIELDTYPE] != self.TYPECONT:
+                if filedict[self.FIELDTYPE] != self.TYPE_CONTAINER:
                     filedict[self.FIELDTYPE] = self.TYPEFILE
             else:
                 filedict[self.FIELDTYPE] = self.TYPEFILE
 
     def addFileURI(self, fname):
+        """Creates a file URI for a given path.
+
+        :param fname: ...
+        :returns: ...
+        """
+
         fname = fname.replace("\\", "/")
         # PY3 compatibility.
         try:
@@ -357,10 +379,18 @@ class SFYAMLHandler:
         return fname
 
     def addContainerURI(self, container, containedfile, fname):
+        """Creates a container URI for a given path.
+
+        :param container: ...
+        :param containedfile: ...
+        :param fname: ...
+        :returns: ...
+        """
+
         fname = fname
         fname = container[self.FIELDCONTTYPE] + ":" + fname
         fname = fname.replace(
-            container[self.FIELDFILENAME], container[self.FIELDFILENAME] + "!"
+            container[self.FIELD_FILE_NAME], container[self.FIELD_FILE_NAME] + "!"
         )
         return fname

As you can already identify from the diff there'll be more tests and there should be some further optimizations in time, but this does seem a pretty chunky one. I'm about to try it on a 230mb report whichwas taking over 4 hours yesterday so we'll see how much this helps.

ross-spencer · 2021-06-18T08:32:08Z

ross-spencer
Jun 18, 2021
Maintainer Author

For comparison, the database creation for the 230MB report (no checksums), 580,000 files which didn't complete even after 4 hours processing, is now down to just over two mins:

--- 119.1657600402832 seconds ---

real	2m1.929s
user	1m59.907s
sys	0m1.821s

So that helps. The demystify part of this still takes over 40 minutes so will be looking for optimizations there too.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pretty slow section of code #51

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Pretty slow section of code #51

ross-spencer Jun 17, 2021 Maintainer

Replies: 1 comment

ross-spencer Jun 18, 2021 Maintainer Author

ross-spencer
Jun 17, 2021
Maintainer

ross-spencer
Jun 18, 2021
Maintainer Author