Skip to content

Commit

Permalink
46: Final docs check through (#47)
Browse files Browse the repository at this point in the history
* Docs readthrough

* Changed pprl to PPRL Toolkit in verknupfung tutorial
  • Loading branch information
matweldon authored Apr 4, 2024
1 parent 50b6643 commit 7dd1f7a
Show file tree
Hide file tree
Showing 13 changed files with 127 additions and 95 deletions.
7 changes: 3 additions & 4 deletions .github/ISSUE_TEMPLATE/bug_report.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ assignees: ''

---

Please be aware that, as the `pprl_toolkit` is an experimental package, ONS cannot promise to resolve bugs.
Please be aware that, as pprl is an experimental package, ONS cannot promise to resolve bugs.

### Describe the bug
A clear and concise description of what the bug is.
Expand All @@ -22,16 +22,15 @@ Steps to reproduce the behaviour:
### Expected behaviour
A clear and concise description of what you expected to happen.

### Evidence (tracebacks and screenshots
### Evidence (tracebacks and screenshots)
If applicable, please add any tracebacks or screenshots to help explain your problem.

### System information
Please provide the following information about your environment:

- OS: [e.g. macOS]
- Browser (when using the client-side app or GCP): [e.g. Chrome, Safari]
- `pprl_toolkit` version: [e.g. 0.0.1]
- pprl version: [e.g. 0.0.1]

### Additional context
Add any other context about the problem here.

3 changes: 1 addition & 2 deletions .github/ISSUE_TEMPLATE/feature-idea.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ assignees: ''

---

Please be aware that, as the `pprl_toolkit` is an experimental package, ONS cannot promise to implement feature ideas.
Please be aware that, as pprl is an experimental package, ONS cannot promise to implement feature ideas.

### Does your feature idea solve a problem?
If this applies to your idea, please provide a clear and concise description of what the problem is.
Expand All @@ -20,4 +20,3 @@ A clear and concise description of any alternative solutions or features you've

### Additional context
Add any other context or screenshots about the feature request here.

14 changes: 8 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
![ONS and DSC logos](https://github.com/datasciencecampus/awesome-campus/blob/master/ons_dsc_logo.png)

# `pprl_toolkit`: a toolkit for privacy-preserving record linkage
# PPRL Toolkit: A toolkit for Privacy-Preserving Record Linkage

> "We find ourselves living in a society which is rich with data and the opportunities that comes with this. Yet, when disconnected, this data is limited in its usefulness. ... Being able to link data will be vital for enhancing our understanding of society, driving policy change for greater public good." Sir Ian Diamond, the National Statistician
The Privacy Preserving Record Linkage (PPRL) toolkit demonstrates the feasibility of record linkage in difficult 'eyes off' settings. It has been designed for a situation where two organisations (perhaps in different jurisdictions) want to link their datasets at record level, to enrich the information they contain, but neither party is able to send sensitive personal identifiers -- such as names, addresses or dates of birth -- to the other. Building on [previous ONS research](https://www.gov.uk/government/publications/joined-up-data-in-government-the-future-of-data-linking-methods/privacy-preserving-record-linkage-in-the-context-of-a-national-statistics-institute), the toolkit implements a well-known privacy-preserving linkage method in a new way to improve performance, and wraps it in a secure cloud architecture to demonstrate the potential of a layered approach.
The Privacy Preserving Record Linkage (PPRL) toolkit demonstrates the feasibility of record linkage in difficult 'eyes off' settings. It has been designed for a situation where two organisations (perhaps in different jurisdictions) want to link their datasets at record level, to enrich the information they contain, but neither party is able to send sensitive personal identifiers - such as names, addresses or dates of birth - to the other. Building on [previous ONS research](https://www.gov.uk/government/publications/joined-up-data-in-government-the-future-of-data-linking-methods/privacy-preserving-record-linkage-in-the-context-of-a-national-statistics-institute), the toolkit implements a well-known privacy-preserving linkage method in a new way to improve performance, and wraps it in a secure cloud architecture to demonstrate the potential of a layered approach.

The toolkit has been developed by data scientists at the [Data Science Campus](https://datasciencecampus.ons.gov.uk/) of the UK Office for National Statistics. This project has benefitted from early collaborations with colleagues at NHS England.

Expand All @@ -13,7 +13,9 @@ The two parts of the toolkit are:
* a Python package for privacy-preserving record linkage with Bloom filters and hash embeddings, that can be used locally with no cloud set-up
* instructions, scripts and resources to run record linkage in a cloud-based secure enclave. This part of the toolkit requires you to set up Google Cloud accounts with billing

We're publishing the repo as a prototype and teaching tool. Please feel free to download, adapt and experiment with it in compliance with the open-source license. You can submit issues [here](https://github.com/datasciencecampus/pprl_toolkit/issues). However, as this is an experimental repo, the development team cannot commit to maintaining the repo or responding to issues. If you'd like to collaborate with us, to put these ideas into practice for the public good, please [get in touch](https://datasciencecampus.ons.gov.uk/contact/).
We're publishing the repo as a prototype and teaching tool. Please feel free to download, adapt and experiment with it in compliance with the open-source license. The reference documentation and tutorials are published [here](https://datasciencecampus.github.io/pprl_toolkit). You can submit issues [here](https://github.com/datasciencecampus/pprl_toolkit/issues). However, as this is a prototype, the development team cannot commit to maintaining the repo indefinitely or responding to all issues.

This toolkit is not assured for use in production settings, but we believe the tools and methods demonstrated here have great potential for positive impact with further development and adaptation. If you'd like to collaborate with us, to put these ideas into practice for the public good, please [get in touch](https://datasciencecampus.ons.gov.uk/contact/).

## Installation

Expand Down Expand Up @@ -84,7 +86,7 @@ matching. We will use the toolkit to identify these matches.
> These datasets don't have the same column names or follow the same encodings,
> and there are several spelling mistakes in the names of the band members.
>
> Thankfully, the `pprl_toolkit` is flexible enough to handle this!
> Thankfully, the PPRL Toolkit is flexible enough to handle this!
### Creating and assigning a feature factory

Expand Down Expand Up @@ -148,7 +150,7 @@ Lastly, we compute the matching using an adapted Hungarian algorithm with local

```

So, all three of the records in each dataset were matched correctly. Excellent!
So, all three of the records in each dataset were matched correctly. Excellent! You can find a longer version of this tutorial [here](https://datasciencecampus.github.io/pprl_toolkit/docs/tutorials/example-verknupfung.html).


## Working in the cloud
Expand All @@ -169,7 +171,7 @@ parties, a workload author, and a workload operator. These roles can be summaris
- The workload **operator** sets up and runs the Confidential
Space virtual machine, which uses the Docker image to perform the record linkage.

We have set up `pprl_toolkit` to allow any configuration of these roles among
We have set up the PPRL Toolkit to allow any configuration of these roles among
users. You could do it all yourself, split the workload roles between two
data owning-parties, or ask a trusted third party to maintain the
workload.
Expand Down
14 changes: 11 additions & 3 deletions _quarto.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ project:
type: website

website:
title: "`pprl`"
title: "**pprl**"
navbar:
left:
- href: index.qmd
Expand All @@ -15,9 +15,9 @@ website:
- icon: github
menu:
- text: Source code
url: https://github.com/datasciencecampus/pprl
url: https://github.com/datasciencecampus/pprl_toolkit
- text: Open an issue
url: https://github.com/datasciencecampus/pprl/issues
url: https://github.com/datasciencecampus/pprl_toolkit/issues
sidebar:
style: docked
search: true
Expand Down Expand Up @@ -75,3 +75,11 @@ quartodoc:
package: pprl.app
contents:
- utils
- title: Server functions
desc: >
Functions for the matching workload server. Used in `scripts/server.py`
package: pprl.matching
contents:
- cloud
- local
- perform
43 changes: 29 additions & 14 deletions docs/tutorials/example-verknupfung.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ df1 = pd.DataFrame(
{
"first_name": ["Laura", "Kaspar", "Grete"],
"last_name": ["Daten", "Gorman", "Knopf"],
"gender": ["f", "m", "f"],
"gender": ["F", "M", "F"],
"date_of_birth": ["01/03/1977", "31/12/1975", "12/7/1981"],
"instrument": ["bass", "guitar", "drums"],
}
Expand All @@ -37,11 +37,12 @@ df2 = pd.DataFrame(
)
```

> [!NOTE]
> These datasets don't have the same column names or follow the same encodings,
> and there are several spelling mistakes in the names of the band members, as well as a typo in the dates.
>
> Thankfully, the `pprl_toolkit` is flexible enough to handle this!
::: {.callout-note}
These datasets don't have the same column names or follow the same encodings,
and there are several spelling mistakes in the names of the band members, as well as a typo in the dates.

Thankfully, the PPRL Toolkit is flexible enough to handle this!
:::

### Creating and assigning a feature factory

Expand Down Expand Up @@ -72,24 +73,27 @@ spec1 = dict(
spec2 = dict(name="name", sex="sex", main_instrument="instrument", birth_date="dob")
```

> [!TIP]
> The feature generation functions, `features.gen_XXX_features` have sensible default parameters, but sometimes have to be passed in to the feature factory with different parameters, such as to set a feature label in the example above.
> There are two ways to achieve this. Either use `functools.partial` to set parameters (as above), or pass keyword arguments as a dictionary of dictionaries to the `Embedder` as `ff_args`.
::: {.callout-tip}
The feature generation functions, `features.gen_XXX_features` have sensible default parameters, but sometimes have to be passed in to the feature factory with different parameters, such as to set a feature label in the example above.
There are two ways to achieve this. Either use `functools.partial` to set parameters (as above), or pass keyword arguments as a dictionary of dictionaries to the `Embedder` as `ff_args`.
:::

### Embedding the data

With our specifications sorted out, we can get to creating our Bloom filter
embedding. Before doing so, we need to decide on two parameters: the size of
the filter and the number of hashes. By default, these are `1024` and `2`,
the filter and the number of hashes. By default, these are 1024 and 2,
respectively.

Once we've decided, we can create our `Embedder` instance and use it to embed
our data with their column specifications.

```{python}
#| warning: false
from pprl.embedder.embedder import Embedder
embedder = Embedder(factory, bf_size=1024, num_hashes=2)
edf1 = embedder.embed(df1, colspec=spec1, update_thresholds=True)
edf2 = embedder.embed(df2, colspec=spec2, update_thresholds=True)
```
Expand All @@ -103,15 +107,26 @@ three additional columns: `bf_indices`, `bf_norms` and `thresholds`.
edf1.columns
```

The `bf_indices` column contains the Bloom filters, represented compactly as a list of non-zero indices for each record. The `bf_norms` column contains the norm of each Bloom filter with respect to the Soft Cosine Measure (SCM) matrix. In this case since we are using an untrained model, the SCM matrix is an identity matrix, and the norm is just the Euclidean norm of the Bloom filter represented as a binary vector, which is equal to `np.sqrt(len(bf_indices[i]))` for record `i`. The norm is used to scale the similarity measures so that they take values between -1 and 1.
The `bf_indices` column contains the Bloom filters, represented compactly as a list of non-zero indices for each record.

```{python}
print(edf1.bf_indices[0])
```

The `thresholds` column is calculated to provide, for each record, a threshold similarity score below which it will not be matched. It's like a reserve price in an auction -- it stops a record being matched to another record when the similarity isn't high enough. In this feature, the method implemented here differs from other linkage methods, which typically only have one global threshold score for the entire dataset.
The `bf_norms` column contains the norm of each Bloom filter with respect to the Soft Cosine Measure (SCM) matrix. In this case since we are using an untrained model, the SCM matrix is an identity matrix, and the norm is just the Euclidean norm of the Bloom filter represented as a binary vector, which is equal to `np.sqrt(len(bf_indices[i]))` for record `i`. The norm is used to scale the similarity measures so that they take values between -1 and 1.

The `thresholds` column is calculated to provide, for each record, a threshold similarity score below which it will not be matched. It's like a reserve price in an auction -- it stops a record being matched to another record when the similarity isn't high enough. This is an innovative feature of our method; other linkage methods typically only have one global threshold score for the entire dataset.

```{python}
print(edf1.loc[:,["bf_norms","thresholds"]])
print(edf2.loc[:,["bf_norms","thresholds"]])
```

<!-- ToDO: Write an explainer on the threshold method, and link it here -->

### The processed features

Let's take a look at how the features are processed into small text strings (shingles) before being hashed into the Bloom filter. The first record in the first dataset is the same person as the first record in the second dataset, although the data is not identical, so we can compare the processed features for these records to see how `pprl_toolkit` puts them into a format where they can be compared.
Let's take a look at how the features are processed into small text strings (shingles) before being hashed into the Bloom filter. The first record in the first dataset is the same person as the first record in the second dataset, although the data is not identical, so we can compare the processed features for these records to see how pprl puts them into a format where they can be compared.

First, we'll look at date of birth:

Expand All @@ -129,7 +144,7 @@ print(edf1.first_name_features[0] + edf1.last_name_features[0])
print(edf2.name_features[0])
```

The two datasets store the names differently, but this doesn't matter for the Bloom filter method because it treats each record like a bag of features. By default, the name processor produces 2-grams, 3-grams and 4-grams.
The two datasets store the names differently, but this doesn't matter for the Bloom filter method because it treats each record like a bag of features. By default, the name processor produces 2-grams and 3-grams.

The sex processing function just converts different formats to lowercase and takes the first letter. This will often be enough:

Expand Down
47 changes: 18 additions & 29 deletions docs/tutorials/in-the-cloud.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ description: >
Get you and your collaborators performing linkage in the cloud
---

This tutorial provides an overview of how to use `pprl_toolkit` on
This tutorial provides an overview of how to use the PPRL Toolkit on
Google Cloud Platform (GCP). We go over how to assemble and assign roles in a
linkage team, how to set up everybody's projects, and end with executing the
linkage itself.
Expand Down Expand Up @@ -34,14 +34,14 @@ model that allows one of the data-owning parties to author the workload while
the other is the operator.

::: {.callout-tip}
In fact, `pprl_toolkit` is set up to allow any configuration of these roles
In fact, the PPRL Toolkit is set up to allow any configuration of these roles
among up to four people.
:::

In any case, you must decide who will be doing what from the outset. Each role
comes with different responsibilities, but all roles require a GCP account and
access to the `gcloud` command-line tool. Additionally, everyone in the linkage
project will need to install `pprl_toolkit`.
project will need to install the PPRL Toolkit.

### Data-owning party

Expand Down Expand Up @@ -89,34 +89,23 @@ unique. This will ensure that bucket names are also globally unique.
Our aim is to create a globally unique name (and thus ID) for each project.
:::

For example, say the US Census Bureau and UK Office for National Statistics
(ONS) are looking to link some data on ex-patriated residents with PPRL. Then
they might use `us-cb` and `uk-ons` as their party names, which are succinct
For example, say a UK bank and a US bank are looking to link some data on international
transactions to fit a machine learning model to predict fraud. Then
they might use `us-eaglebank` and `uk-royalbank` as their party names, which are succinct
and descriptive. However, they are generic and rule out future PPRL projects
with the same names.

As a remedy, they could make a hash of their project description to create an
As a remedy, they could make a short hash of their project description to create an
identifier:

```bash
$ echo -n "pprl us-cb uk-ons ex-pats-analysis" | sha256sum
d59a50241dc78c3f926b565937b99614b7bb7c84e44fb780440718cb2b0ddc1b -
$ echo -n "pprl us-eaglebank uk-royalbank fraud apr 2024" | sha256sum | cut -c 1-7
4fb6720
```

This is very long. You might only want to use the first few characters of this
hash. Note that Google Cloud bucket names also can't be more than 63 characters
long without dots.

You can trim it down like so:

```bash
$ echo -n "pprl us-cb uk-ons ex-pats-analysis" | sha256sum | cut -c 1-7
d59a502
```

So, our names would be: `uk-ons-d59a502`, `us-cb-d59a502`. If they had a
So, our project names would be: `uk-royalbank-4fb6720`, `us-eaglebank-4fb6720`. If they had a
third-party linkage administrator (authoring and operating the workload), they
would have a project called something like `admin-d59a502`.
would have a project called something like `admin-4fb6720`.


## Setting up your projects
Expand Down Expand Up @@ -169,32 +158,32 @@ The workload operator requires three IAM roles:
| Storage Admin | `roles/storage.admin` | Managing a shared bucket |


## Configuring `pprl_toolkit`
## Configuring the PPRL Toolkit

Now your linkage team has its projects made up, you need to configure
`pprl_toolkit`. This configuration tells the package where to look and what to
the PPRL Toolkit. This configuration tells the package where to look and what to
call things; we do this with a single environment file containing a short
collection of key-value pairs.

We have provided an example environment file in `.env.example`. Copy or rename
that file to `.env` in the root of the `pprl_toolkit` installation. Then, fill
that file to `.env` in the root of the PPRL Toolkit installation. Then, fill
in your project details as necessary.

For our example above, let's say the ONS will be the workload author and the US
Census Bureau will be the workload operator. The environment file would look
something like this:

```bash
PARTY_1_PROJECT=us-cb-d59a502
PARTY_1_PROJECT=uk-royalbank-4fb6720
PARTY_1_KEY_VERSION=1

PARTY_2_PROJECT=uk-ons-d59a502
PARTY_2_PROJECT=us-eaglebank-4fb6720
PARTY_2_KEY_VERSION=1

WORKLOAD_AUTHOR_PROJECT=uk-ons-d59a502
WORKLOAD_AUTHOR_PROJECT=uk-royalbank-4fb6720
WORKLOAD_AUTHOR_PROJECT_REGION=europe-west2

WORKLOAD_OPERATOR_PROJECT=us-cb-d59a502
WORKLOAD_OPERATOR_PROJECT=us-eaglebank-4fb6720
WORKLOAD_OPERATOR_PROJECT_ZONE=us-east4-a
```

Expand Down
4 changes: 2 additions & 2 deletions docs/tutorials/index.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,8 @@ listing:
filter-ui: false
---

These tutorials walk you through some of the essential workflows for `pprl`.
The purpose of these documents is for you to learn how to use the `pprl`
These tutorials walk you through some of the essential workflows for pprl.
The purpose of these documents is for you to learn how to use the pprl
package for your own linkage projects.

<br>
Loading

0 comments on commit 7dd1f7a

Please sign in to comment.