Finish integrating new parser into pipeline #186

bcspragu · 2024-02-14T04:42:24Z

This PR finishes integrating the new parser logic from [1] into our pipeline.

It parses the processed_portfolios.json file from the output directory (in this case /home/portfolio-parser/output) and uses that to both correlate input + output files as well as upload the output CSV files. Since the R code now includes a row count, we no longer need to parse the files manually.

This all mostly works as expected. A few sharp edges (relying on UUIDs from the R code) are noted in the PR, and there's metadata produced by the new code (both at the input file level and the output file level) that we aren't currently recording anywhere.

Adjacent changes:

In creating the parser, I also duplicated the taskrunner package. That has been hoisted to the top level and de-duped
Assorted refactorings and renamings to make sure the pactaparser image gets invoked correctly

[1] https://github.com/RMI-PACTA/workflow.portfolio.parsing

gbdubs

👍 Psyched to see the complete coupling come into focus!

gbdubs · 2024-02-14T13:21:17Z

async/async.go

-			return fmt.Errorf("failed to parse file type from file name %q: %w", fileName, err)
+	for _, sf := range sourceFiles {
+
+		// TODO: There's lots of metadata associated with the input files (e.g.


Can you create a TODO to track this? I feel like capturing this data and storing it seems like it is more likely than not to save us time and energy down the road if/when something goes wrong. Storing it as a blob (using our existing infrastructure) and then either attaching that blob to the incomplete_upload or creating a new table for it is probably the right call.

Done, see #187

gbdubs · 2024-02-14T13:22:12Z

async/async.go

+		for _, p := range sf.Portfolios {
+			outPath := filepath.Join(outputDir, p.OutputFilename)
+
+			// XXX: One risk here is that we're depending on the R code to generate truly


Why rely on that at all? The info we actually need for semantics is in the portfolio metadata. Let's create UUIDs or similar here?

It was mostly just because there's already a bunch of UUIDs flying around and being mapped to things and I didn't want to introduce more cognitive load with another opaque UUID -> UUID mapping, but agreed that we shouldn't rely on this behavior since we don't have to.

Done, and also simpler than I thought, we don't need to maintain the mapping as we just store the blob URI as is.

async/async.go

gbdubs · 2024-02-14T13:24:30Z

async/async.go

+				return fmt.Errorf("failed to parse file type from file name %q: %w", p.OutputFilename, err)
+			}
+
+			sourceURI, ok := localCSVToBlob[sf.InputFilename]


Optional: consider adding another check that no additional output files are present that aren't mentioned in the manifest. That kind of defensive programming is probably warranted over a trusted but blackbox interface like this one.

(after the loop completes, obviously. A map/set should be sufficient)

I considered this and then got lazy, this is done now.

gbdubs · 2024-02-14T13:26:05Z

async/async.go

-		paths = append(paths, strings.TrimSpace(line[idx+17:]))
+	var sourceFiles []parsed.SourceFile
+	if err := json.NewDecoder(omf).Decode(&sourceFiles); err != nil {
+		return fmt.Errorf("failed to decode processed_portfolios.json as JSON: %w", err)


If this step fails, we probably want a more complete accounting of why. Would logging the full manifest as a string if this fails be inappropriate? Maybe upload it to a cloud bucket?

This is a great call. The R code actually already does log the output processed_porfolios.json file to stdout (or stderr), so this will be covered by #185

gbdubs · 2024-02-14T13:27:38Z

async/parsed/parsed.go

@@ -0,0 +1,35 @@
+// Package parsed just holds the domain types for dealing with the output of the


Perhaps link to the repo that this relies upon? It's not obvious from this comment that this describes the contours of an external dependency.

gbdubs · 2024-02-14T13:28:00Z

azure/azevents/azevents.go

@@ -327,10 +327,13 @@ func (s *Server) handleParsedPortfolio(id string, resp *task.ParsePortfolioRespo
 			if err != nil {
 				return fmt.Errorf("creating blob %d: %w", i, err)
 			}
+
+			// TODO: There's other metadata in output.Portfolio, like `InvestorName`, that


Can you add a bug for this? I don't want to lose track of it.

bcspragu

Good calls across the board. You've once again caught all the small validations and other things I cheaped out on, and the code is better for it!

bcspragu · 2024-02-14T18:29:06Z

async/async.go

-		paths = append(paths, strings.TrimSpace(line[idx+17:]))
+	var sourceFiles []parsed.SourceFile
+	if err := json.NewDecoder(omf).Decode(&sourceFiles); err != nil {
+		return fmt.Errorf("failed to decode processed_portfolios.json as JSON: %w", err)


This is a great call. The R code actually already does log the output processed_porfolios.json file to stdout (or stderr), so this will be covered by #185

bcspragu · 2024-02-14T18:36:46Z

async/async.go

-			return fmt.Errorf("failed to parse file type from file name %q: %w", fileName, err)
+	for _, sf := range sourceFiles {
+
+		// TODO: There's lots of metadata associated with the input files (e.g.


Done, see #187

bcspragu · 2024-02-14T18:41:44Z

async/async.go

+		for _, p := range sf.Portfolios {
+			outPath := filepath.Join(outputDir, p.OutputFilename)
+
+			// XXX: One risk here is that we're depending on the R code to generate truly


It was mostly just because there's already a bunch of UUIDs flying around and being mapped to things and I didn't want to introduce more cognitive load with another opaque UUID -> UUID mapping, but agreed that we shouldn't rely on this behavior since we don't have to.

Done, and also simpler than I thought, we don't need to maintain the mapping as we just store the blob URI as is.

bcspragu · 2024-02-14T18:46:58Z

async/async.go

+				return fmt.Errorf("failed to parse file type from file name %q: %w", p.OutputFilename, err)
+			}
+
+			sourceURI, ok := localCSVToBlob[sf.InputFilename]


I considered this and then got lazy, this is done now.

bcspragu · 2024-02-14T18:47:59Z

async/parsed/parsed.go

@@ -0,0 +1,35 @@
+// Package parsed just holds the domain types for dealing with the output of the


bcspragu · 2024-02-14T18:48:03Z

azure/azevents/azevents.go

@@ -327,10 +327,13 @@ func (s *Server) handleParsedPortfolio(id string, resp *task.ParsePortfolioRespo
 			if err != nil {
 				return fmt.Errorf("creating blob %d: %w", i, err)
 			}
+
+			// TODO: There's other metadata in output.Portfolio, like `InvestorName`, that


This PR finishes integrating the new parser logic from [1] into our pipeline. It parses the `processed_portfolios.json` file from the output directory (in this case `/home/portfolio-parser/output`) and uses that to both correlate input + output files as well as upload the output CSV files. Since the R code now includes a row count, we no longer need to parse the files manually. This all mostly works as expected. A few sharp edges (relying on UUIDs from the R code) are noted in the PR, and there's metadata produced by the new code (both at the input file level and the output file level) that we aren't currently recording anywhere. Adjacent changes: - In creating the `parser`, I also duplicated the `taskrunner` package. That has been hoisted to the top level and de-duped - Assorted refactorings and renamings to make sure the `pactaparser` image gets invoked correctly [1] https://github.com/RMI-PACTA/workflow.portfolio.parsing

bcspragu requested a review from gbdubs February 14, 2024 04:47

gbdubs approved these changes Feb 14, 2024

View reviewed changes

bcspragu mentioned this pull request Feb 14, 2024

Decide if we want to store additional 'parse_portfolio' metadata #187

Open

bcspragu force-pushed the brandon/parser-integrate branch from 83b748a to 799a333 Compare February 14, 2024 18:50

bcspragu commented Feb 14, 2024

View reviewed changes

bcspragu force-pushed the brandon/parser-deploy branch from ffe410c to d894cbc Compare February 14, 2024 18:57

bcspragu force-pushed the brandon/parser-integrate branch from 799a333 to 9fafd18 Compare February 14, 2024 18:57

bcspragu force-pushed the brandon/parser-deploy branch from d894cbc to 4410253 Compare February 14, 2024 18:59

bcspragu force-pushed the brandon/parser-integrate branch from 9fafd18 to 5cac6c8 Compare February 14, 2024 18:59

bcspragu added 2 commits February 14, 2024 11:24

Address review comments

9ed862c

bcspragu changed the base branch from brandon/parser-deploy to main February 14, 2024 19:26

bcspragu force-pushed the brandon/parser-integrate branch from 5cac6c8 to 9ed862c Compare February 14, 2024 19:26

bcspragu merged commit a4d1c4c into main Feb 14, 2024
2 checks passed

bcspragu deleted the brandon/parser-integrate branch February 14, 2024 19:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finish integrating new parser into pipeline #186

Finish integrating new parser into pipeline #186

bcspragu commented Feb 14, 2024

gbdubs left a comment

gbdubs Feb 14, 2024

bcspragu Feb 14, 2024

gbdubs Feb 14, 2024

bcspragu Feb 14, 2024

gbdubs Feb 14, 2024

bcspragu Feb 14, 2024

gbdubs Feb 14, 2024

bcspragu Feb 14, 2024

gbdubs Feb 14, 2024

bcspragu Feb 14, 2024

gbdubs Feb 14, 2024

bcspragu Feb 14, 2024

bcspragu left a comment

bcspragu Feb 14, 2024

bcspragu Feb 14, 2024

bcspragu Feb 14, 2024

bcspragu Feb 14, 2024

bcspragu Feb 14, 2024

bcspragu Feb 14, 2024

		@@ -0,0 +1,35 @@
		// Package parsed just holds the domain types for dealing with the output of the

Finish integrating new parser into pipeline #186

Finish integrating new parser into pipeline #186

Conversation

bcspragu commented Feb 14, 2024

gbdubs left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bcspragu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment