diff --git a/README.md b/README.md index ac5d44a..834e447 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@ # workflow.data.preparation -`workflow.data.preparation` orchestrates the PACTA data preparation process, combining production, financial, scenario, and currency data into a format suitable for use in a PACTA for investors analysis. Assuming that the computing resource being used has sufficient memory (which can be >16gb depending on the inputs), storage space, and access to the necessary inputs, this is intended to work on a desktop or laptop using RStudio or run using the included [Dockerfile](https://github.com/RMI-PACTA/workflow.data.preparation/blob/main/Dockerfile) and [docker-compose.yml](https://github.com/RMI-PACTA/workflow.data.preparation/blob/main/docker-compose.yml). +`workflow.data.preparation` orchestrates the PACTA data preparation process, combining production, financial, scenario, and currency data into a format suitable for use in a PACTA for investors analysis. Assuming that the computing resource being used has sufficient memory (which can be >16Gb depending on the inputs), storage space, and access to the necessary inputs, this is intended to work on a desktop or laptop using RStudio or run using the included [Dockerfile](https://github.com/RMI-PACTA/workflow.data.preparation/blob/main/Dockerfile) and [docker-compose.yml](https://github.com/RMI-PACTA/workflow.data.preparation/blob/main/docker-compose.yml). ## Running in RStudio @@ -12,7 +12,7 @@ Running workflow.data.preparation has a number of R package dependencies that ar To make things easier, the recommended way to specify the desired config set when running locally in RStudio is by setting the active config set to `desktop` and modifying/adding only a few of the properties in the `desktop` config set. By doing so, you benefit from inheriting many of the appropriate configuration values without having to explicitly specify each one. -You will need to set the `inherits` parameter, e.g. `inherits: 2022Q4`, to select which of the config sets specified in the config.yml file that is desired. +You will need to set the `inherits` parameter, e.g. `inherits: 2022Q4`, to select which of the config sets specified in the config.yml file that is desired. You will need to set `data_prep_outputs_path` to an *existing* directory where you want the outputs to be saved, e.g. `data_prep_outputs_path: "./outputs"` to point to an existing directory named `outputs` in the working directory of the R session you will be running data.prep in. This directory must exist before running data.prep (and ideally be empty). The script will throw an error early on if it does not exist. @@ -34,7 +34,8 @@ Running the workflow requires a file `.env` to exist in the root directory, that ```sh # .env -HOST_INPUTS_PATH=/PATH/TO/inputs +HOST_FACTSET_EXTRACTED_PATH=/PATH/TO/factset-extracted +HOST_ASSET_IMPACT_PATH=/PATH/TO/asset-impact HOST_OUTPUTS_PATH=/PATH/TO/YYYYQQ_pacta_analysis_inputs_YYYY-MM-DD/YYYYQQ GITHUB_PAT=ghp_XXXXxxXxXXXxXxxX R_CONFIG_ACTIVE=YYYYQQ @@ -57,6 +58,164 @@ Run `docker-compose up` from the root directory, and docker will build the image Use `docker-compose build --no-cache` to force a rebuild of the Docker image. +## Running Data Preparation interactively on Azure VM + +*Instructions specific to the RMI-PACTA team's Azure instance are in Italics.* + +0. **Prerequisites:** + *These steps have been completed on the RMI Azure instance.* + - Ensure a Virtual Network with a Gateway has been set up, permitting SSH (Port 22) access. + Details of setting this up are out of scope for these instructions. + Talk to your network coordinator for help. + - Set up Storage Accounts containing the [required files](#required-input-files). + While all the files can exist on a single file share, in a single storage account, the workflow can access different storage accounts, to allow for read-only access to raw data, to prevent accident manipulation of source data. + The recommended structure (*used by RMI*) is: + - Storage Account: `pactadatadev`: (read/write). + Naming note: *RMI QAs datasets prior to moving them to PROD with [`workflow.pacta.data.qa`](https://github.com/RMI-PACTA/workflow.pacta.data.qa)*. + - File Share `workflow-data-preparation-outputs`: Outputs from this workflow. + - Storage Account: `pactarawdata` (read-only) + - File Share `factset-extracted`: Outputs from [`workflow.factset`](https://github.com/RMI-PACTA/workflow.factset) + - File Share `AssetImpact` Raw data files from [Asset Impact](https://asset-impact.gresb.com/) + - (Optional, but recommended) Create a User Assigned Managed Identity. + Alternately, after creating the VM with a system-managed identity, you can assign all appropriate permissions. ***RMI:** The `workflow-data-preparation` Identity exists with all the appropriate permissions.* + - Grant Appropriate permissions to the Identity: + - `pactadatadev`: "Reader and Data Access". + - `pactarawdata`: "Reader and Data Access" + Note that this gives read/write access the Storage Account via the Storage Account Key. + To grant read-only access to the VM, use the `mount_afs` script without the `-w` flag, as shown below. + +1. **Start a VM** + While the machine can be deployed via the Portal (WebUI), for simplicity, the following code block is provided which ensures consistency: + + ```sh + # The options here work with the RMI-PACTA team's Azure setup. + # Change values for your own instance as needed. + + # Get Network details. + VNET_RESOURCE_GROUP="RMI-PROD-EU-VNET-RG" + VNET_NAME="RMI-PROD-EU-VNET" + SUBNET_NAME="RMI-SP-PACTA-DEV-VNET" + SUBNET_ID=$(az network vnet subnet show --resource-group $VNET_RESOURCE_GROUP --name $SUBNET_NAME --vnet-name $VNET_NAME --query id -o tsv) + + # Use the identity previously setup (see Prerequisites) + MACHINEIDENTITY="/subscriptions/feef729b-4584-44af-a0f9-4827075512f9/resourceGroups/RMI-SP-PACTA-PROD/providers/Microsoft.ManagedIdentity/userAssignedIdentities/workflow-data-preparation" + # This size has 2 vCPU, and 32GiB memory, recommended settings. + MACHINE_SIZE="Standard_E4-2as_v4" + # Using epoch to give machine a (probably) unique name + MACHINE_NAME="dataprep-runner-$(date +%s)" + # NOTE: Change this to your own RG as needed. + VM_RESOURCE_GROUP="RMI-SP-PACTA-DEV" + + # **NOTE: Check these options prior to running** + # Non-RMI users may choose to omit the --public-ip-address line for public SSH Access. + + az vm create \ + --admin-username azureuser \ + --assign-identity "$MACHINEIDENTITY" \ + --generate-ssh-keys \ + --image Ubuntu2204 \ + --name "$MACHINE_NAME" \ + --nic-delete-option delete \ + --os-disk-delete-option delete \ + --public-ip-address "" \ + --resource-group "$VM_RESOURCE_GROUP" \ + --size "$MACHINE_SIZE" \ + --subnet "$SUBNET_ID" + + ``` + + If this command successfully runs, it will output a JSON block describing the resource (VM) created. + +2. **Connect to the Network.** (Optional) + ***RMI:** Connecting to the VPN will enable SSH access.* + Connect to the Virtual Network specified above, as the comand above does not create a Public IP Address. + Details for this are out of scope for these instructions. + Contact your network coordinator for help. + +3. **Connect to the newly created VM via SSH.** + + ```sh + # This connects to the VM created above via SSH. + # See above block for envvars referenced here. + + az ssh vm \ + --local-user azureuser \ + --name "$MACHINE_NAME" \ + --prefer-private-ip \ + --resource-group "$VM_RESOURCE_GROUP" + + ``` + +4. **Connect the VM to required resources** + Clone this repo, install the `az` cli utility, and mount the appropriate Azure File Shares. + + ```sh + # Clone this repo through https to avoid need for an SSH key + git clone https://github.com/RMI-PACTA/workflow.data.preparation.git ~/workflow.data.preparation + + # Install az cli + sudo apt update + # See https://aka.ms/installcli for alternate instructions + curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash + + # Login to azure with assigned identity + az login --identity + + # Use script from this repo to connect to file shares + ~/workflow.data.preparation/scripts/mount_afs.sh -r "RMI-SP-PACTA-PROD" -a "pactarawdata" -f "factset-extracted" -m "/mnt/factset-extracted" + ~/workflow.data.preparation/scripts/mount_afs.sh -r "RMI-SP-PACTA-PROD" -a "pactarawdata" -f "asset-impact" -m "/mnt/asset-impact" + + # Note the outputs directory has the -w flag, meaning write permissions are enabled. + ~/workflow.data.preparation/scripts/mount_afs.sh -r "RMI-SP-PACTA-DEV" -a "pactadatadev" -f "workflow-data-preparation-outputs" -m "/mnt/workflow-data-preparation-outputs" -w + + ``` + +5. **Install Docker** + + ```sh + # install docker + sudo apt -y install \ + docker-compose \ + docker.io + + # Allow azureuser to run docker without sudo + sudo usermod -aG docker azureuser + ``` + + At this point, you need to log out of the shell to reevaluate group memberships (add the `docker` group to `azureuser`). + You can log back in with the `az ssh` command from step 3. + When you are back into the shell, you can run `docker run --rm hello-world` to confirm that docker is working correctly, and you are able to run as a non-root user. + +6. **Prepare `.env` file** + The `ubuntu2204` image used for the VM includes both `vim` and `nano`. + Create a `.env` file in the `workflow.data.preparation` directory, according to the instructions in the [running locally](running-locally-with-docker-compose) section of this file. + +7. **Build Docker image** + The cloned git repo in the home directory, and mounted directories should sill be in place after logging in again. + Additionally, `azureuser` should be part of the `docker` group. + you can confirm this with: + + ```sh + groups + ls ~ + ls /mnt + ``` + + With that in place, you are ready to build the `workflow.data.preparation` docker image. + **To ensure that a dropped network connection does not kill the process, you should run this in `tmux`.** + + ```sh + # navigate to the workflow.data.preparation repo + cd ~/workflow.data.preparation + + tmux + + docker-compose build + + docker-compose up + + ``` + ## Required Input Files All required files must exist at `$HOST_INPUTS_PATH`, in a single directory (no subdirectories). diff --git a/docker-compose.yml b/docker-compose.yml index 2c1e990..400d371 100644 --- a/docker-compose.yml +++ b/docker-compose.yml @@ -4,10 +4,17 @@ services: data_prep: build: context: . + environment: + - LOG_LEVEL=TRACE volumes: - type: bind - source: ${HOST_INPUTS_PATH} - target: /inputs + source: ${HOST_FACTSET_EXTRACTED_PATH} + target: /mnt/factset-extracted + read_only: true + - type: bind + source: ${HOST_ASSET_IMPACT_PATH} + target: /mnt/asset-impact + read_only: true - type: bind source: ${HOST_OUTPUTS_PATH} target: /mnt/outputs diff --git a/scripts/mount_afs.sh b/scripts/mount_afs.sh new file mode 100755 index 0000000..d0f58c9 --- /dev/null +++ b/scripts/mount_afs.sh @@ -0,0 +1,117 @@ +#! /bin/sh + +# mount an Azure File Share at a given location. +# Requires az cli to be installed and logged in. + +usage() { + echo "Usage: mount_afs.sh [-h] [-v] -r -a -f -m " + echo " -h: help (this message)" + echo " -v: verbose" + echo " -w: Allow write access to the file share (default is read-only)" + echo " -r: resource group (Required)" + echo " -a: storage account name (Required)" + echo " -f: file share name (Required)" + echo " -m: mount point (Required)" + echo " -?: help" + exit 1 +} + +while getopts "h?vwr:a:f:m:" opt; do + case "$opt" in + h|\?) + usage + ;; + v) VERBOSE=1 + ;; + w) ALLOW_WRITE=1 + ;; + r) RESOURCEGROUP=$OPTARG + ;; + a) STORAGEACCOUNTNAME=$OPTARG + ;; + f) FILESHARENAME=$OPTARG + ;; + m) MOUNTPOINT=$OPTARG + ;; + *) + usage + ;; + esac +done + +missing_opts=0 +if [ -z "$RESOURCEGROUP" ]; then + echo "ERROR: Resource group is required" + missing_opts=1 +fi + +if [ -z "$STORAGEACCOUNTNAME" ]; then + echo "ERROR: Storage account name is required" + missing_opts=1 +fi + +if [ -z "$FILESHARENAME" ]; then + echo "ERROR: File share name is required" + missing_opts=1 +fi + +if [ -z "$MOUNTPOINT" ]; then + echo "ERROR: Mount point is required" + missing_opts=1 +fi + +if [ $missing_opts -eq 1 ]; then + usage +fi + +if [ -n "$VERBOSE" ]; then + echo "RESOURCEGROUP: $RESOURCEGROUP" + echo "STORAGEACCOUNTNAME: $STORAGEACCOUNTNAME" + echo "FILESHARENAME: $FILESHARENAME" + echo "MOUNTPOINT: $MOUNTPOINT" +fi + +# This command assumes you have logged in with az login + +if [ -n "$VERBOSE" ]; then + echo "Getting https endpoint for storage account $STORAGEACCOUNTNAME" +fi + +httpEndpoint=$(az storage account show \ + --resource-group "$RESOURCEGROUP" \ + --name "$STORAGEACCOUNTNAME" \ + --query "primaryEndpoints.file" --output tsv | tr -d '"') +smbPath=$(echo "$httpEndpoint" | cut -c7-${#httpEndpoint})$FILESHARENAME +fileHost=$(echo "$httpEndpoint" | cut -c7-${#httpEndpoint}| tr -d "/") +nc -zvw3 "$fileHost" 445 + +if [ -n "$VERBOSE" ]; then + echo "httpEndpoint: $httpEndpoint" + echo "smbPath: $smbPath" + echo "fileHost: $fileHost" +fi + +if [ -n "$VERBOSE" ]; then + echo "Getting storage account key" +fi +storageAccountKey=$(az storage account keys list \ + --resource-group "$RESOURCEGROUP" \ + --account-name "$STORAGEACCOUNTNAME" \ + --query "[0].value" --output tsv | tr -d '"') + +if [ -n "$VERBOSE" ]; then + echo "Creating mount path: $MOUNTPOINT" +fi +sudo mkdir -p "$MOUNTPOINT" + +if [ -n "$VERBOSE" ]; then + echo "Mounting $smbPath to $MOUNTPOINT" +fi + +if [ -n "$ALLOW_WRITE" ]; then + permissions="file_mode=0777,dir_mode=0777" +else + permissions="file_mode=0555,dir_mode=0555" +fi + +sudo mount -t cifs "$smbPath" "$MOUNTPOINT" -o username="$STORAGEACCOUNTNAME",password="$storageAccountKey",serverino,nosharesock,actimeo=30,nobrl,"$permissions",vers=3.1.1