Support exporting source media to google drive #115

philmcmahon · 2025-01-08T17:20:04Z

What does this change?

This PR adds a new option to the export page to export the original source media to google drive. As part of this I've redesigned the export page - it now looks like this:

The export has several stages - see below for screenshots of that

On the assumption that, following this change, users will frequently be exporting more than one file, all exported files (even if there's just one) are now stored in a subfolder of the 'Guardian Transcribe Tool' folder (suffixed with date/time in case they have multiple files with the same name).

I decided to use a lambda to perform the export to google drive. This has the advantages of being easier to setup than an ECS task, and faster to start. It has the disadvantage that we can only export files up to 10GB (the maximum ephemeral storage), and we only have 15 minutes to do the upload. In my (limited) testing, I found that the lambda was able to export a 1.2GB file in 70 seconds, so I suspect we'll be limited more by the max file size than the timeout - but only just.

I had to use a separate lambda function for this rather than the API itself because API gateway has a 30s timeout, and once the lambda returns a http response it gets terminated. There are workarounds to this but I couldn't find anything that works nicely with serverless-express so I decided to create a separate function (this has the advantage that we don't need our API lambda to have loads of memory/disk space).

Some error reporting exists for if the file is too large. I still need to add an error for if the lambda times out whilst performing the export - might leave for a future PR though.

The feature relies on the file extension to tell google drive what type the file is - this seems to work reasonably well. A future feature could run apache tika or something similar on the file to determine the file type.

In theory the uploadFileToGoogleDrive function should be streaming the file 128MB at a time, in practice I found that the function ran out of memory when uploading a 1.2Gb file when the lambda only had 512MB. This needs more investigation - for now I have set the memory to 2GB. I think it's worth getting in as is because my 1.2GB test file was off a 1h30 youtube video, and I suspect many videos will be under this length. Might be a bit of fun though to try and work out how memory management in node works.

How to test

This is currently live on CODE, you can try it out here https://transcribe.code.dev-gutools.co.uk/

Screenshots

github-actions · 2025-01-08T17:23:35Z

Deploy build 795 of `investigations::transcription-service` to CODE

All deployment options

From guardian/actions-riff-raff.

github-actions · 2025-01-08T17:23:39Z

Deploy build 675 of `investigations::transcription-service-repository` to CODE

All deployment options

From guardian/actions-riff-raff.

…ow for downloading/uploading large files

… ui to s3

…endpoint

…nning after returning a response

philmcmahon · 2025-01-13T11:38:00Z

packages/api/src/services/googleDrive.ts

+			);
+		}
+
+		offset += chunkSize;


if lambda times out during this process can we find a way of telling the user?

philmcmahon · 2025-01-13T11:38:57Z

packages/api/src/services/googleDrive.ts

+	return folderId;
+};
+
+export const uploadFileToGoogleDrive = async (


delete - moved to media-export lambda

zekehuntergreen · 2025-01-13T16:44:46Z

packages/api/src/export.ts

+import Drive = drive_v3.Drive;
+import Docs = docs_v1.Docs;


not sure I've come across this syntax before!

zekehuntergreen · 2025-01-13T16:46:18Z

packages/media-export/src/googleDrive.ts

+
+		if (response.ok) {
+			// Response status is 308 until the final chunk. Final response includes file metadata
+			return ((await response.json()) as { id: string }).id;


can this be zodified?

zekehuntergreen · 2025-01-13T16:48:33Z

packages/api/src/index.ts

-			const dynamoClient = getDynamoClient(
-				config.aws.region,
-				config.aws.localstackEndpoint,
+			const id = req.query.id as string;


can we use zod to parse the request like we're doing for some of the endpoints above to avoid a type error here if query.id isn't a string?

zekehuntergreen · 2025-01-13T16:51:36Z

packages/backend-common/src/s3.ts

+		throw new Error(`Failed to retrieve object ${key} from bucket ${bucket}`);
+	}
+	await downloadS3Data(
+		data.Body as Readable,


can this be parsed as well?

zekehuntergreen · 2025-01-13T16:55:00Z

packages/api/src/export.ts

+	if (isS3Failure(transcriptText)) {
+		if (transcriptText.failureReason === 'NoSuchKey') {
+			const msg = `Failed to export transcript - file has expired. Please re-upload the file and try again.`;
+			logger.error(msg);


might be better to have separate messages for the user and for devs looking through error logs wherever we're instructing a user on what to next as we are here.

zekehuntergreen · 2025-01-13T18:04:13Z

packages/api/src/export.ts

+			exportType: format,
+		};
+	}
+	const exportResult = await uploadToGoogleDocs(


is exportResult the google doc's id? Might be worth renaming the variable to make this more clear.

zekehuntergreen · 2025-01-13T18:10:53Z

packages/api/src/export.ts

+};
+
+export const updateStatus = (
+	status: ExportStatus,


maybe renaming to something like statusToUpdate could make it clearer what's going on here

zekehuntergreen · 2025-01-13T18:26:19Z

packages/api/src/index.ts

+			let currentStatuses: ExportStatuses = exportStatusInProgress(
+				exportRequest.data.items,
+			);
+			await writeTranscriptionItem(dynamoClient, config.app.tableName, {


is it worth doing a single write to the db after the attempted uploads of text and srt files?
might also be able to simplify the code below:

let currentStatuses: ExportStatuses = await Promise.all( exportStatusInProgress(exportRequest.data.items).map( (exportStatus: ExportStatus) => { if (exportStatus.exportType == 'source-media') { return exportStatus; } return exportTranscriptToDoc( config, s3Client, item, exportStatus.exportType, exportRequest.data.folderId, driveClients.drive, driveClients.docs, ); }, ), );

zekehuntergreen · 2025-01-13T18:34:32Z

packages/media-export/src/index.ts

+
+	const fileName = item.originalFilename.endsWith(`.${extensionOrMp4}`)
+		? item.originalFilename
+		: `${item.originalFilename}.${extensionOrMp4 || 'mp4'}`;


already defaulting

Suggested change

: `${item.originalFilename}.${extensionOrMp4 || 'mp4'}`;

: `${item.originalFilename}.${extensionOrMp4}`;

zekehuntergreen · 2025-01-13T18:35:59Z

packages/api/src/export.ts

+	}));
+};
+
+export const updateStatus = (


might be clearer if this was updateStatuses plural?

philmcmahon requested a review from a team as a code owner January 8, 2025 17:20

philmcmahon marked this pull request as draft January 9, 2025 10:03

philmcmahon force-pushed the pm-save-media-google-drive branch from d6baa92 to 0ef1107 Compare January 9, 2025 12:35

philmcmahon marked this pull request as ready for review January 10, 2025 15:34

philmcmahon added 22 commits January 10, 2025 15:35

Add 'extension' field to metadata of scraped media

0091de7

Add functionality to export original source media to google drive

a2dcfd3

Increase ephemeral storage of lambda to 10gb, memory to 512mb, to all…

f083b13

…ow for downloading/uploading large files

Set originalfilename and extension metadata on files uploaded via the…

5c6b98c

… ui to s3

Prevent double . in filename

de469a0

Tidy up filename

9126bbe

Show links to individual files when export complete

60ab235

Rename ExportButton ExportForm

56327a0

Fix dynamo table name

a5f4b4c

Include date and time in folder name

7a78c1f

Bump lambda timeout to 15 minutes

7144de2

Refactor so that export returns immediately then client polls status …

b31667c

…endpoint

Fix export status reporting

9ca9a87

Add extra logging for export

4be932d

Disable callbackWaitsForEmptyEventLoop to allow lambda to continue ru…

d13c138

…nning after returning a response

Source media is now input media, fix logs

3ea1060

Fix media download file path

6dde0d2

Log progress of s3 download

02c368c

Try setting resolution mode to callback

15712e9

Remove await from export promise

f0d5af5

Add empty media-export lambda

c803341

Move export to google drive functionality to export-media lambda

f2e538f

philmcmahon force-pushed the pm-save-media-google-drive branch from 6a5806f to f2e538f Compare January 10, 2025 15:37

philmcmahon mentioned this pull request Jan 10, 2025

Improve url form validation #117

Open

philmcmahon commented Jan 13, 2025

View reviewed changes

Move uploadFileToGoogleDrive to media-export service

ebe74ed

zekehuntergreen reviewed Jan 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support exporting source media to google drive #115

Support exporting source media to google drive #115

philmcmahon commented Jan 8, 2025 •

edited

Loading

github-actions bot commented Jan 8, 2025 •

edited

Loading

github-actions bot commented Jan 8, 2025 •

edited

Loading

philmcmahon Jan 13, 2025

philmcmahon Jan 13, 2025

zekehuntergreen Jan 13, 2025

zekehuntergreen Jan 13, 2025

zekehuntergreen Jan 13, 2025

zekehuntergreen Jan 13, 2025

zekehuntergreen Jan 13, 2025

zekehuntergreen Jan 13, 2025

zekehuntergreen Jan 13, 2025

zekehuntergreen Jan 13, 2025

zekehuntergreen Jan 13, 2025

zekehuntergreen Jan 13, 2025

	: `${item.originalFilename}.${extensionOrMp4 \|\| 'mp4'}`;
	: `${item.originalFilename}.${extensionOrMp4}`;

Support exporting source media to google drive #115

Are you sure you want to change the base?

Support exporting source media to google drive #115

Conversation

philmcmahon commented Jan 8, 2025 • edited Loading

What does this change?

How to test

Screenshots

github-actions bot commented Jan 8, 2025 • edited Loading

github-actions bot commented Jan 8, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

philmcmahon commented Jan 8, 2025 •

edited

Loading

github-actions bot commented Jan 8, 2025 •

edited

Loading

github-actions bot commented Jan 8, 2025 •

edited

Loading