Skip to content

NLP Submission Process

sdard edited this page Aug 16, 2022 · 3 revisions

If you are reading this, you are an N3C site that has either already agreed to take part in the optional natural language processing (NLP) component of data extraction, or a site that is curious about the steps necessary to participate. Either way, welcome!

First thing’s first: If you have not already contacted the Mayo Clinic OHNLP team leading the NLP effort to express your interest in participating, please reach out to RSTNLP<at>mayo.edu as soon as possible. The Mayo team can set up an orientation meeting for you and your team to get you started.

The process will work as shown in this flow chart; detailed descriptions of each step can be found below.

Flow diagram

Detailed Steps

Overall note: At no point will you send free-text to N3C. The only data N3C will receive from this process are the structured data points that the NLP process derives from your notes, as well as some metadata about the notes. The only HIPAA identifiers sent during this process are dates, which are allowable as part of N3C’s HIPAA limited data set. Notes themselves are never transmitted. All the same, please work with your local regulatory and compliance folks to determine whether any additional steps need to be taken at your site for working with notes before proceeding.

Action: Populate N3C_COHORT table. If you’re already submitting data to N3C, you’re already doing this step--N3C_COHORT is the table that is automatically populated each time you run the phenotype code. The patients in this table drive which notes are selected in the NLP process. If you are a TriNetX site, the table that you will need to use is the patient table (field: patient_id), supplied to you in the zipped N3C payload that you download from the TriNetX appliance.

Question: Are your notes in the same database as your CDM? For most sites, the answer is “no,” which means an extra step will be required. If the answer is “yes,” skip to the next Action step.

If “no,” action: Build an ID crosswalk between N3C_COHORT patient IDs and the IDs used in your note database. We don’t want to be prescriptive around how this is done, because sites are set up completely differently. The end result needs to be (1) the patient IDs associated with your notes are the same IDs used in your N3C_COHORT table, and (2) the set of patients for whom you extract notes should be the same set of patients in N3C_COHORT. So long as you achieve these two ends, the path you choose to get there is up to you--but here are three possible workflows to give you some ideas:

  1. You could start pulling notes into your CDM during your regular CDM ETL process for use in N3C and other projects. Note that ACT, PCORnet, and TriNetX do not have tables intended to store notes, so if you use one of those models you’ll need to define your own structure or use the [OMOP NOTE table] v5.3.1 (https://ohdsi.github.io/CommonDataModel/cdm531.html#NOTE). (The latter is a good idea for N3C, as that’s the structure we will ask you to use.)
  2. If you have write access to your note database (e.g., Epic Clarity), you can create an ID crosswalk table in that database that you populate each time you re-run the N3C phenotype. This will enable you to pull notes for that cohort, with the correct IDs.
  3. If you can use the same database client to cross-query your CDM and your note database, you can do suggestion #2 even without write access to your note database.

There are many more possible workflows to get this done. Please feel free to stop by office hours if you would like to discuss options with us; we are happy to help brainstorm.

Action: Build the OMOP NOTE and NOTE_NLP tables in your CDM. Even if you’re not an OMOP site, N3C asks that you use the OMOP note tables. So, as an example, if you’re a PCORnet site, you’ll have a PCORnet CDM with two OMOP tables (NOTE and NOTE_NLP) tacked on.

TIP for non-OMOP sites: OMOP’s person_id is an integer by default. If your data has non-numeric patient IDs, remember to change the datatype when creating the table.

Action: ETL notes into the OMOP NOTE table. The Mayo team can provide guidance on this step during and after your orientation meeting. You will want to set this up as a recurring ETL job, where new notes are incrementally loaded each time the job runs (rather than doing a full truncate and reload each time).

Action: Run NLP process to populate the NOTE_NLP table. This is the part of the process where NLP is actually run on the notes you’ve loaded. The Mayo team will provide all the tools you need to do this during and after your orientation meeting. Mayo has a full set of documentation on this process on their GitHub site, at this link. We also welcome existing NLP solutions from participating sites on extracting COVID-19 related concepts (e.g. signs and symptoms) as long as the extracted concepts can be normalized to OMOP CDM concept IDs.

Action: Export the NOTE and NOTE_NLP table and submit to N3C. Once your tables are populated for your most recent N3C_COHORT, it’s time to package up the results and submit to N3C. A few critical points:

  1. These extracts should end up in the DATAFILES folder of your usual zip file sent to N3C, just like any other table you submit.
  2. We have strategically nulled out any fields that could contain PHI in our extract scripts. Please use our extract scripts rather than just dumping the tables on your own.
  3. If you use the Python or R exporter to extract your data and want to add the note tables to that process, please review the documentation for your exporter of choice. (Python docs here, R docs here.) We have updated the documentation to add information to incorporate new instructions for the NLP component.

As always, if you have questions or issues, please stop by office hours, send us a Slack message, or send us an email.