Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run on VM #142

Merged
merged 17 commits into from
Feb 24, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
165 changes: 162 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# workflow.data.preparation

`workflow.data.preparation` orchestrates the PACTA data preparation process, combining production, financial, scenario, and currency data into a format suitable for use in a PACTA for investors analysis. Assuming that the computing resource being used has sufficient memory (which can be >16gb depending on the inputs), storage space, and access to the necessary inputs, this is intended to work on a desktop or laptop using RStudio or run using the included [Dockerfile](https://github.com/RMI-PACTA/workflow.data.preparation/blob/main/Dockerfile) and [docker-compose.yml](https://github.com/RMI-PACTA/workflow.data.preparation/blob/main/docker-compose.yml).
`workflow.data.preparation` orchestrates the PACTA data preparation process, combining production, financial, scenario, and currency data into a format suitable for use in a PACTA for investors analysis. Assuming that the computing resource being used has sufficient memory (which can be >16Gb depending on the inputs), storage space, and access to the necessary inputs, this is intended to work on a desktop or laptop using RStudio or run using the included [Dockerfile](https://github.com/RMI-PACTA/workflow.data.preparation/blob/main/Dockerfile) and [docker-compose.yml](https://github.com/RMI-PACTA/workflow.data.preparation/blob/main/docker-compose.yml).

## Running in RStudio

Expand All @@ -12,7 +12,7 @@ Running workflow.data.preparation has a number of R package dependencies that ar

To make things easier, the recommended way to specify the desired config set when running locally in RStudio is by setting the active config set to `desktop` and modifying/adding only a few of the properties in the `desktop` config set. By doing so, you benefit from inheriting many of the appropriate configuration values without having to explicitly specify each one.

You will need to set the `inherits` parameter, e.g. `inherits: 2022Q4`, to select which of the config sets specified in the config.yml file that is desired.
You will need to set the `inherits` parameter, e.g. `inherits: 2022Q4`, to select which of the config sets specified in the config.yml file that is desired.

You will need to set `data_prep_outputs_path` to an *existing* directory where you want the outputs to be saved, e.g. `data_prep_outputs_path: "./outputs"` to point to an existing directory named `outputs` in the working directory of the R session you will be running data.prep in. This directory must exist before running data.prep (and ideally be empty). The script will throw an error early on if it does not exist.

Expand All @@ -34,7 +34,8 @@ Running the workflow requires a file `.env` to exist in the root directory, that

```sh
# .env
HOST_INPUTS_PATH=/PATH/TO/inputs
HOST_FACTSET_EXTRACTED_PATH=/PATH/TO/factset-extracted
HOST_ASSET_IMPACT_PATH=/PATH/TO/asset-impact
HOST_OUTPUTS_PATH=/PATH/TO/YYYYQQ_pacta_analysis_inputs_YYYY-MM-DD/YYYYQQ
GITHUB_PAT=ghp_XXXXxxXxXXXxXxxX
R_CONFIG_ACTIVE=YYYYQQ
Expand All @@ -57,6 +58,164 @@ Run `docker-compose up` from the root directory, and docker will build the image

Use `docker-compose build --no-cache` to force a rebuild of the Docker image.

## Running Data Preparation interactively on Azure VM

*Instructions specific to the RMI-PACTA team's Azure instance are in Italics.*

0. **Prerequisites:**
*These steps have been completed on the RMI Azure instance.*
- Ensure a Virtual Network with a Gateway has been set up, permitting SSH (Port 22) access.
Details of setting this up are out of scope for these instructions.
Talk to your network coordinator for help.
- Set up Storage Accounts containing the [required files](#required-input-files).
While all the files can exist on a single file share, in a single storage account, the workflow can access different storage accounts, to allow for read-only access to raw data, to prevent accident manipulation of source data.
The recommended structure (*used by RMI*) is:
- Storage Account: `pactadatadev`: (read/write).
Naming note: *RMI QAs datasets prior to moving them to PROD with [`workflow.pacta.data.qa`](https://github.com/RMI-PACTA/workflow.pacta.data.qa)*.
- File Share `workflow-data-preparation-outputs`: Outputs from this workflow.
- Storage Account: `pactarawdata` (read-only)
- File Share `factset-extracted`: Outputs from [`workflow.factset`](https://github.com/RMI-PACTA/workflow.factset)
- File Share `AssetImpact` Raw data files from [Asset Impact](https://asset-impact.gresb.com/)
- (Optional, but recommended) Create a User Assigned Managed Identity.
Alternately, after creating the VM with a system-managed identity, you can assign all appropriate permissions. ***RMI:** The `workflow-data-preparation` Identity exists with all the appropriate permissions.*
- Grant Appropriate permissions to the Identity:
- `pactadatadev`: "Reader and Data Access".
- `pactarawdata`: "Reader and Data Access"
Note that this gives read/write access the Storage Account via the Storage Account Key.
To grant read-only access to the VM, use the `mount_afs` script without the `-w` flag, as shown below.

1. **Start a VM**
While the machine can be deployed via the Portal (WebUI), for simplicity, the following code block is provided which ensures consistency:

```sh
# The options here work with the RMI-PACTA team's Azure setup.
# Change values for your own instance as needed.

# Get Network details.
VNET_RESOURCE_GROUP="RMI-PROD-EU-VNET-RG"
VNET_NAME="RMI-PROD-EU-VNET"
SUBNET_NAME="RMI-SP-PACTA-DEV-VNET"
SUBNET_ID=$(az network vnet subnet show --resource-group $VNET_RESOURCE_GROUP --name $SUBNET_NAME --vnet-name $VNET_NAME --query id -o tsv)

# Use the identity previously setup (see Prerequisites)
MACHINEIDENTITY="/subscriptions/feef729b-4584-44af-a0f9-4827075512f9/resourceGroups/RMI-SP-PACTA-PROD/providers/Microsoft.ManagedIdentity/userAssignedIdentities/workflow-data-preparation"
# This size has 2 vCPU, and 32GiB memory, recommended settings.
MACHINE_SIZE="Standard_E4-2as_v4"
# Using epoch to give machine a (probably) unique name
MACHINE_NAME="dataprep-runner-$(date +%s)"
# NOTE: Change this to your own RG as needed.
VM_RESOURCE_GROUP="RMI-SP-PACTA-DEV"

# **NOTE: Check these options prior to running**
# Non-RMI users may choose to omit the --public-ip-address line for public SSH Access.

az vm create \
--admin-username azureuser \
--assign-identity "$MACHINEIDENTITY" \
--generate-ssh-keys \
--image Ubuntu2204 \
--name "$MACHINE_NAME" \
--nic-delete-option delete \
--os-disk-delete-option delete \
--public-ip-address "" \
--resource-group "$VM_RESOURCE_GROUP" \
--size "$MACHINE_SIZE" \
--subnet "$SUBNET_ID"

```

If this command successfully runs, it will output a JSON block describing the resource (VM) created.

2. **Connect to the Network.** (Optional)
***RMI:** Connecting to the VPN will enable SSH access.*
Connect to the Virtual Network specified above, as the comand above does not create a Public IP Address.
Details for this are out of scope for these instructions.
Contact your network coordinator for help.

3. **Connect to the newly created VM via SSH.**

```sh
# This connects to the VM created above via SSH.
# See above block for envvars referenced here.

az ssh vm \
--local-user azureuser \
--name "$MACHINE_NAME" \
--prefer-private-ip \
--resource-group "$VM_RESOURCE_GROUP"

```

4. **Connect the VM to required resources**
Clone this repo, install the `az` cli utility, and mount the appropriate Azure File Shares.

```sh
# Clone this repo through https to avoid need for an SSH key
git clone https://github.com/RMI-PACTA/workflow.data.preparation.git ~/workflow.data.preparation

# Install az cli
sudo apt update
# See https://aka.ms/installcli for alternate instructions
curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash

# Login to azure with assigned identity
az login --identity

# Use script from this repo to connect to file shares
~/workflow.data.preparation/scripts/mount_afs.sh -r "RMI-SP-PACTA-PROD" -a "pactarawdata" -f "factset-extracted" -m "/mnt/factset-extracted"
~/workflow.data.preparation/scripts/mount_afs.sh -r "RMI-SP-PACTA-PROD" -a "pactarawdata" -f "asset-impact" -m "/mnt/asset-impact"

# Note the outputs directory has the -w flag, meaning write permissions are enabled.
~/workflow.data.preparation/scripts/mount_afs.sh -r "RMI-SP-PACTA-DEV" -a "pactadatadev" -f "workflow-data-preparation-outputs" -m "/mnt/workflow-data-preparation-outputs" -w

```

5. **Install Docker**

```sh
# install docker
sudo apt -y install \
docker-compose \
docker.io

# Allow azureuser to run docker without sudo
sudo usermod -aG docker azureuser
```

At this point, you need to log out of the shell to reevaluate group memberships (add the `docker` group to `azureuser`).
You can log back in with the `az ssh` command from step 3.
When you are back into the shell, you can run `docker run --rm hello-world` to confirm that docker is working correctly, and you are able to run as a non-root user.

6. **Prepare `.env` file**
The `ubuntu2204` image used for the VM includes both `vim` and `nano`.
Create a `.env` file in the `workflow.data.preparation` directory, according to the instructions in the [running locally](running-locally-with-docker-compose) section of this file.

7. **Build Docker image**
The cloned git repo in the home directory, and mounted directories should sill be in place after logging in again.
Additionally, `azureuser` should be part of the `docker` group.
you can confirm this with:

```sh
groups
ls ~
ls /mnt
```

With that in place, you are ready to build the `workflow.data.preparation` docker image.
**To ensure that a dropped network connection does not kill the process, you should run this in `tmux`.**

```sh
# navigate to the workflow.data.preparation repo
cd ~/workflow.data.preparation

tmux

docker-compose build

docker-compose up

```

## Required Input Files

All required files must exist at `$HOST_INPUTS_PATH`, in a single directory (no subdirectories).
Expand Down
11 changes: 9 additions & 2 deletions docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,17 @@ services:
data_prep:
build:
context: .
environment:
- LOG_LEVEL=TRACE
volumes:
- type: bind
source: ${HOST_INPUTS_PATH}
target: /inputs
source: ${HOST_FACTSET_EXTRACTED_PATH}
target: /mnt/factset-extracted
read_only: true
- type: bind
source: ${HOST_ASSET_IMPACT_PATH}
target: /mnt/asset-impact
read_only: true
- type: bind
source: ${HOST_OUTPUTS_PATH}
target: /mnt/outputs
117 changes: 117 additions & 0 deletions scripts/mount_afs.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
#! /bin/sh

# mount an Azure File Share at a given location.
# Requires az cli to be installed and logged in.

usage() {
echo "Usage: mount_afs.sh [-h] [-v] -r <resource group> -a <storage account name> -f <file share name> -m <mount point>"
echo " -h: help (this message)"
echo " -v: verbose"
echo " -w: Allow write access to the file share (default is read-only)"
echo " -r: resource group (Required)"
echo " -a: storage account name (Required)"
echo " -f: file share name (Required)"
echo " -m: mount point (Required)"
echo " -?: help"
exit 1
}

while getopts "h?vwr:a:f:m:" opt; do
case "$opt" in
h|\?)
usage
;;
v) VERBOSE=1
;;
w) ALLOW_WRITE=1
;;
r) RESOURCEGROUP=$OPTARG
;;
a) STORAGEACCOUNTNAME=$OPTARG
;;
f) FILESHARENAME=$OPTARG
;;
m) MOUNTPOINT=$OPTARG
;;
*)
usage
;;
esac
done

missing_opts=0
if [ -z "$RESOURCEGROUP" ]; then
echo "ERROR: Resource group is required"
missing_opts=1
fi

if [ -z "$STORAGEACCOUNTNAME" ]; then
echo "ERROR: Storage account name is required"
missing_opts=1
fi

if [ -z "$FILESHARENAME" ]; then
echo "ERROR: File share name is required"
missing_opts=1
fi

if [ -z "$MOUNTPOINT" ]; then
echo "ERROR: Mount point is required"
missing_opts=1
fi

if [ $missing_opts -eq 1 ]; then
usage
fi

if [ -n "$VERBOSE" ]; then
echo "RESOURCEGROUP: $RESOURCEGROUP"
echo "STORAGEACCOUNTNAME: $STORAGEACCOUNTNAME"
echo "FILESHARENAME: $FILESHARENAME"
echo "MOUNTPOINT: $MOUNTPOINT"
fi

# This command assumes you have logged in with az login

if [ -n "$VERBOSE" ]; then
echo "Getting https endpoint for storage account $STORAGEACCOUNTNAME"
fi

httpEndpoint=$(az storage account show \
--resource-group "$RESOURCEGROUP" \
--name "$STORAGEACCOUNTNAME" \
--query "primaryEndpoints.file" --output tsv | tr -d '"')
smbPath=$(echo "$httpEndpoint" | cut -c7-${#httpEndpoint})$FILESHARENAME
fileHost=$(echo "$httpEndpoint" | cut -c7-${#httpEndpoint}| tr -d "/")
nc -zvw3 "$fileHost" 445

if [ -n "$VERBOSE" ]; then
echo "httpEndpoint: $httpEndpoint"
echo "smbPath: $smbPath"
echo "fileHost: $fileHost"
fi

if [ -n "$VERBOSE" ]; then
echo "Getting storage account key"
fi
storageAccountKey=$(az storage account keys list \
--resource-group "$RESOURCEGROUP" \
--account-name "$STORAGEACCOUNTNAME" \
--query "[0].value" --output tsv | tr -d '"')

if [ -n "$VERBOSE" ]; then
echo "Creating mount path: $MOUNTPOINT"
fi
sudo mkdir -p "$MOUNTPOINT"

if [ -n "$VERBOSE" ]; then
echo "Mounting $smbPath to $MOUNTPOINT"
fi

if [ -n "$ALLOW_WRITE" ]; then
permissions="file_mode=0777,dir_mode=0777"
else
permissions="file_mode=0555,dir_mode=0555"
fi

sudo mount -t cifs "$smbPath" "$MOUNTPOINT" -o username="$STORAGEACCOUNTNAME",password="$storageAccountKey",serverino,nosharesock,actimeo=30,nobrl,"$permissions",vers=3.1.1