Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop vm #146

Closed
wants to merge 58 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
58 commits
Select commit Hold shift + click to select a range
3625964
add default 2023Q4 config
cjyetman Feb 15, 2024
268c3cc
Merge branch 'main' into add-default-2023Q4-config
cjyetman Feb 15, 2024
de2bd60
add `factset_industry_map_bridge_filename`
cjyetman Feb 15, 2024
f192844
Merge branch 'main' into add-default-2023Q4-config
cjyetman Feb 15, 2024
bdbe25d
Merge branch 'main' into add-default-2023Q4-config
cjyetman Feb 15, 2024
58e365c
add `factset_manual_pacta_sector_override`
cjyetman Feb 15, 2024
5a2ec24
add `_filename` suffix
cjyetman Feb 15, 2024
a96f286
add AI dataset filenames
cjyetman Feb 15, 2024
6e34fde
change default "desktop" config to use 2023Q4
cjyetman Feb 15, 2024
965c2da
Update from main
cjyetman Feb 16, 2024
e6c114b
Merge branch 'main' into add-default-2023Q4-config
cjyetman Feb 16, 2024
3a98177
Merge branch 'main' into add-default-2023Q4-config
cjyetman Feb 16, 2024
00c26bc
add more parameters to review
cjyetman Feb 17, 2024
58d78fa
docs(deploy): Define prerequisites
AlexAxthelm Feb 17, 2024
032459e
docs(deploy): Instructions up through connecting
AlexAxthelm Feb 17, 2024
080ad61
feat(deploy): Add mount_afs script
AlexAxthelm Feb 17, 2024
222375e
fix(deploy): Update script to default to read-only
AlexAxthelm Feb 17, 2024
c2e798c
feat(deploy): Use new split inputs in docker-compose
AlexAxthelm Feb 18, 2024
4f42a33
feat(deploy): Change AI File paths
AlexAxthelm Feb 18, 2024
13d2d3c
feat(deploy): Update Factset file paths for 2022Q4
AlexAxthelm Feb 18, 2024
dfc4825
ci(deploy): Add verbose logging for remote environment
AlexAxthelm Feb 18, 2024
1249902
docs(deploy): Update README instructions
AlexAxthelm Feb 18, 2024
e730ab7
fix(deploy): fix path in docker volume mount
AlexAxthelm Feb 18, 2024
90b3293
feat(deploy): make docker-compose mounts read-only
AlexAxthelm Feb 18, 2024
fa02e01
docs(deploy): update Readme
AlexAxthelm Feb 18, 2024
3020e58
Add current working config for 2022q4
AlexAxthelm Feb 18, 2024
665a676
return config to `main`
AlexAxthelm Feb 18, 2024
2790485
Merge branch 'docs/how-to-run-on-vm' into develop-vm
AlexAxthelm Feb 18, 2024
78f5c4a
Merge branch 'build/143-2022q4-config' into develop-vm
AlexAxthelm Feb 18, 2024
b3ddeed
Add temporary step for targeting this branch
AlexAxthelm Feb 18, 2024
2f63203
feat(package): #147 Update dependency
AlexAxthelm Feb 18, 2024
88c1f7d
hack(docker): Use dev version of pacta.data.prep
AlexAxthelm Feb 18, 2024
2866159
feat(app): #147 convert checks to use input_filepaths
AlexAxthelm Feb 18, 2024
e9f5e18
fix(app): #147 Move data prep outputs path out of input files
AlexAxthelm Feb 18, 2024
32b7c23
feat(app): #147 Use `input_filepaths` in `parameters`
AlexAxthelm Feb 18, 2024
2d2c72f
feat(app): #148 Export data from scraped files
AlexAxthelm Feb 18, 2024
35fd720
refactor(app): #147 Move code to avoid conflict
AlexAxthelm Feb 18, 2024
55a82ad
Merge branch 'feat/148-export-scraped-inputs' into feat/147-update-ca…
AlexAxthelm Feb 18, 2024
15058c5
feat(app): #147 INclude preflight paths in input_filepaths
AlexAxthelm Feb 18, 2024
1f9d65b
Merge branch 'feat/148-export-scraped-inputs' into develop-vm
AlexAxthelm Feb 18, 2024
1260620
Merge branch 'feat/147-update-calls-to-write_manifest' into develop-vm
AlexAxthelm Feb 18, 2024
b27ca13
feat(app): #151 Put outputs into unique directory
AlexAxthelm Feb 18, 2024
27be26f
refactor(deploy): Harmonize docker mount points
AlexAxthelm Feb 18, 2024
713eb5c
Merge branch 'harmonize-output-filepath-for-docker' into develop-vm
AlexAxthelm Feb 18, 2024
d415ea4
feat(app): #152 Export archives of outputs and inputs
AlexAxthelm Feb 18, 2024
1b8cd49
Merge branch 'feat/151-isolate-outputs' into feat/147-update-calls-to…
AlexAxthelm Feb 18, 2024
4239eae
feat(app): #147 Use explicit filepaths for archives
AlexAxthelm Feb 18, 2024
ac8cce6
Merge branch 'feat/151-isolate-outputs' into develop-vm
AlexAxthelm Feb 18, 2024
4a5b477
Merge branch 'feat/147-update-calls-to-write_manifest' into develop-vm
AlexAxthelm Feb 18, 2024
ee97f26
build(docker): #138 Allow .git in build context
AlexAxthelm Feb 18, 2024
dc8c83c
Merge branch 'build/138-include-git-in-docker-image' into develop-vm
AlexAxthelm Feb 18, 2024
ef49c6e
build(deploy): #144 Draft 2023Q4 Config
AlexAxthelm Feb 18, 2024
10595f7
Merge branch 'build/144-2023q4-config' into develop-vm
AlexAxthelm Feb 18, 2024
a490e3c
build(deploy): #144 Only change data sources from 2022Q4
AlexAxthelm Feb 18, 2024
99b87e6
Merge branch 'build/144-2023q4-config' into develop-vm
AlexAxthelm Feb 18, 2024
93459a6
fix(deploy): #144 fix bad factset path
AlexAxthelm Feb 18, 2024
588a47f
Merge branch 'build/144-2023q4-config' into develop-vm
AlexAxthelm Feb 18, 2024
02db084
Merge branch 'main' into develop-vm
AlexAxthelm Feb 24, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ Imports:
stringr,
tidyr
Remotes:
RMI-PACTA/pacta.data.preparation,
RMI-PACTA/pacta.data.preparation#341,
RMI-PACTA/pacta.data.scraping,
RMI-PACTA/pacta.scenario.preparation
Depends:
Expand Down
171 changes: 168 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# workflow.data.preparation

`workflow.data.preparation` orchestrates the PACTA data preparation process, combining production, financial, scenario, and currency data into a format suitable for use in a PACTA for investors analysis. Assuming that the computing resource being used has sufficient memory (which can be >16gb depending on the inputs), storage space, and access to the necessary inputs, this is intended to work on a desktop or laptop using RStudio or run using the included [Dockerfile](https://github.com/RMI-PACTA/workflow.data.preparation/blob/main/Dockerfile) and [docker-compose.yml](https://github.com/RMI-PACTA/workflow.data.preparation/blob/main/docker-compose.yml).
`workflow.data.preparation` orchestrates the PACTA data preparation process, combining production, financial, scenario, and currency data into a format suitable for use in a PACTA for investors analysis. Assuming that the computing resource being used has sufficient memory (which can be >16Gb depending on the inputs), storage space, and access to the necessary inputs, this is intended to work on a desktop or laptop using RStudio or run using the included [Dockerfile](https://github.com/RMI-PACTA/workflow.data.preparation/blob/main/Dockerfile) and [docker-compose.yml](https://github.com/RMI-PACTA/workflow.data.preparation/blob/main/docker-compose.yml).

## Running in RStudio

Expand All @@ -12,7 +12,7 @@ Running workflow.data.preparation has a number of R package dependencies that ar

To make things easier, the recommended way to specify the desired config set when running locally in RStudio is by setting the active config set to `desktop` and modifying/adding only a few of the properties in the `desktop` config set. By doing so, you benefit from inheriting many of the appropriate configuration values without having to explicitly specify each one.

You will need to set the `inherits` parameter, e.g. `inherits: 2022Q4`, to select which of the config sets specified in the config.yml file that is desired.
You will need to set the `inherits` parameter, e.g. `inherits: 2022Q4`, to select which of the config sets specified in the config.yml file that is desired.

You will need to set `data_prep_outputs_path` to an *existing* directory where you want the outputs to be saved, e.g. `data_prep_outputs_path: "./outputs"` to point to an existing directory named `outputs` in the working directory of the R session you will be running data.prep in. This directory must exist before running data.prep (and ideally be empty). The script will throw an error early on if it does not exist.

Expand All @@ -34,7 +34,8 @@ Running the workflow requires a file `.env` to exist in the root directory, that

```sh
# .env
HOST_INPUTS_PATH=/PATH/TO/inputs
HOST_FACTSET_EXTRACTED_PATH=/PATH/TO/factset-extracted
HOST_ASSET_IMPACT_PATH=/PATH/TO/asset-impact
HOST_OUTPUTS_PATH=/PATH/TO/YYYYQQ_pacta_analysis_inputs_YYYY-MM-DD/YYYYQQ
GITHUB_PAT=ghp_XXXXxxXxXXXxXxxX
R_CONFIG_ACTIVE=YYYYQQ
Expand All @@ -57,6 +58,170 @@ Run `docker-compose up` from the root directory, and docker will build the image

Use `docker-compose build --no-cache` to force a rebuild of the Docker image.

## Running Data Preparation interactively on Azure VM

*Instructions specific to the RMI-PACTA team's Azure instance are in Italics.*

0. **Prerequisites:**
*These steps have been completed on the RMI Azure instance.*
- Ensure a Virtual Network with a Gateway has been set up, permitting SSH (Port 22) access.
Details of setting this up are out of scope for these instructions.
Talk to your network coordinator for help.
- Set up Storage Accounts containing the [required files](#required-input-files).
While all the files can exist on a single file share, in a single storage account, the workflow can access different storage accounts, to allow for read-only access to raw data, to prevent accident manipulation of source data.
The recommended structure (*used by RMI*) is:
- Storage Account: `pactadatadev`: (read/write).
Naming note: *RMI QAs datasets prior to moving them to PROD with [`workflow.pacta.data.qa`](https://github.com/RMI-PACTA/workflow.pacta.data.qa)*.
- File Share `workflow-data-preparation-outputs`: Outputs from this workflow.
- Storage Account: `pactarawdata` (read-only)
- File Share `factset-extracted`: Outputs from [`workflow.factset`](https://github.com/RMI-PACTA/workflow.factset)
- File Share `AssetImpact` Raw data files from [Asset Impact](https://asset-impact.gresb.com/)
- (Optional, but recommended) Create a User Assigned Managed Identity.
Alternately, after creating the VM with a system-managed identity, you can assign all appropriate permissions. ***RMI:** The `workflow-data-preparation` Identity exists with all the appropriate permissions.*
- Grant Appropriate permissions to the Identity:
- `pactadatadev`: "Reader and Data Access".
- `pactarawdata`: "Reader and Data Access"
Note that this gives read/write access the Storage Account via the Storage Account Key.
To grant read-only access to the VM, use the `mount_afs` script without the `-w` flag, as shown below.

1. **Start a VM**
While the machine can be deployed via the Portal (WebUI), for simplicity, the following code block is provided which ensures consistency:

```sh
# The options here work with the RMI-PACTA team's Azure setup.
# Change values for your own instance as needed.

# Get Network details.
VNET_RESOURCE_GROUP="RMI-PROD-EU-VNET-RG"
VNET_NAME="RMI-PROD-EU-VNET"
SUBNET_NAME="RMI-SP-PACTA-DEV-VNET"
SUBNET_ID=$(az network vnet subnet show --resource-group $VNET_RESOURCE_GROUP --name $SUBNET_NAME --vnet-name $VNET_NAME --query id -o tsv)

# Use the identity previously setup (see Prerequisites)
MACHINEIDENTITY="/subscriptions/feef729b-4584-44af-a0f9-4827075512f9/resourceGroups/RMI-SP-PACTA-PROD/providers/Microsoft.ManagedIdentity/userAssignedIdentities/workflow-data-preparation"
# This size has 2 vCPU, and 32GiB memory, recommended settings.
MACHINE_SIZE="Standard_E4-2as_v4"
# Using epoch to give machine a (probably) unique name
MACHINE_NAME="dataprep-runner-$(date +%s)"
# NOTE: Change this to your own RG as needed.
VM_RESOURCE_GROUP="RMI-SP-PACTA-DEV"

# **NOTE: Check these options prior to running**
# Non-RMI users may choose to omit the --public-ip-address line for public SSH Access.

az vm create \
--admin-username azureuser \
--assign-identity "$MACHINEIDENTITY" \
--generate-ssh-keys \
--image Ubuntu2204 \
--name "$MACHINE_NAME" \
--nic-delete-option delete \
--os-disk-delete-option delete \
--public-ip-address "" \
--resource-group "$VM_RESOURCE_GROUP" \
--size "$MACHINE_SIZE" \
--subnet "$SUBNET_ID"

```

If this command successfully runs, it will output a JSON block describing the resource (VM) created.

2. **Connect to the Network.** (Optional)
***RMI:** Connecting to the VPN will enable SSH access.*
Connect to the Virtual Network specified above, as the comand above does not create a Public IP Address.
Details for this are out of scope for these instructions.
Contact your network coordinator for help.

3. **Connect to the newly created VM via SSH.**

```sh
# This connects to the VM created above via SSH.
# See above block for envvars referenced here.

az ssh vm \
--local-user azureuser \
--name "$MACHINE_NAME" \
--prefer-private-ip \
--resource-group "$VM_RESOURCE_GROUP"

```

4. **Connect the VM to required resources**
Clone this repo, install the `az` cli utility, and mount the appropriate Azure File Shares.

```sh
# Clone this repo through https to avoid need for an SSH key
git clone https://github.com/RMI-PACTA/workflow.data.preparation.git ~/workflow.data.preparation

# **Temporary Step: change to develop-vm branch
cd ~/workflow.data.preparation
git fetch
git checkout develop-vm
cd ~

# Install az cli
sudo apt update
# See https://aka.ms/installcli for alternate instructions
curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash

# Login to azure with assigned identity
az login --identity

# Use script from this repo to connect to file shares
~/workflow.data.preparation/scripts/mount_afs.sh -r "RMI-SP-PACTA-PROD" -a "pactarawdata" -f "factset-extracted" -m "/mnt/factset-extracted"
~/workflow.data.preparation/scripts/mount_afs.sh -r "RMI-SP-PACTA-PROD" -a "pactarawdata" -f "asset-impact" -m "/mnt/asset-impact"

# Note the outputs directory has the -w flag, meaning write permissions are enabled.
~/workflow.data.preparation/scripts/mount_afs.sh -r "RMI-SP-PACTA-DEV" -a "pactadatadev" -f "workflow-data-preparation-outputs" -m "/mnt/workflow-data-preparation-outputs" -w

```

5. **Install Docker**

```sh
# install docker
sudo apt -y install \
docker-compose \
docker.io

# Allow azureuser to run docker without sudo
sudo usermod -aG docker azureuser
```

At this point, you need to log out of the shell to reevaluate group memberships (add the `docker` group to `azureuser`).
You can log back in with the `az ssh` command from step 3.
When you are back into the shell, you can run `docker run --rm hello-world` to confirm that docker is working correctly, and you are able to run as a non-root user.

6. **Prepare `.env` file**
The `ubuntu2204` image used for the VM includes both `vim` and `nano`.
Create a `.env` file in the `workflow.data.preparation` directory, according to the instructions in the [running locally](running-locally-with-docker-compose) section of this file.

7. **Build Docker image**
The cloned git repo in the home directory, and mounted directories should sill be in place after logging in again.
Additionally, `azureuser` should be part of the `docker` group.
you can confirm this with:

```sh
groups
ls ~
ls /mnt
```

With that in place, you are ready to build the `workflow.data.preparation` docker image.
**To ensure that a dropped network connection does not kill the process, you should run this in `tmux`.**

```sh
# navigate to the workflow.data.preparation repo
cd ~/workflow.data.preparation

tmux

docker-compose build

docker-compose up

```

## Required Input Files

All required files must exist at `$HOST_INPUTS_PATH`, in a single directory (no subdirectories).
Expand Down
11 changes: 9 additions & 2 deletions docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,17 @@ services:
data_prep:
build:
context: .
environment:
- LOG_LEVEL=TRACE
volumes:
- type: bind
source: ${HOST_INPUTS_PATH}
target: /inputs
source: ${HOST_FACTSET_EXTRACTED_PATH}
target: /mnt/factset-extracted
read_only: true
- type: bind
source: ${HOST_ASSET_IMPACT_PATH}
target: /mnt/asset-impact
read_only: true
- type: bind
source: ${HOST_OUTPUTS_PATH}
target: /mnt/outputs
1 change: 0 additions & 1 deletion run_pacta_data_preparation.R
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,6 @@ config <-

asset_impact_data_path <- config$asset_impact_data_path
factset_data_path <- config$factset_data_path
data_prep_outputs_path <- config$data_prep_outputs_path
masterdata_ownership_filename <- config$masterdata_ownership_filename
masterdata_debt_filename <- config$masterdata_debt_filename
ar_company_id__factset_entity_id_filename <- config$ar_company_id__factset_entity_id_filename
Expand Down
117 changes: 117 additions & 0 deletions scripts/mount_afs.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
#! /bin/sh

# mount an Azure File Share at a given location.
# Requires az cli to be installed and logged in.

usage() {
echo "Usage: mount_afs.sh [-h] [-v] -r <resource group> -a <storage account name> -f <file share name> -m <mount point>"
echo " -h: help (this message)"
echo " -v: verbose"
echo " -w: Allow write access to the file share (default is read-only)"
echo " -r: resource group (Required)"
echo " -a: storage account name (Required)"
echo " -f: file share name (Required)"
echo " -m: mount point (Required)"
echo " -?: help"
exit 1
}

while getopts "h?vwr:a:f:m:" opt; do
case "$opt" in
h|\?)
usage
;;
v) VERBOSE=1
;;
w) ALLOW_WRITE=1
;;
r) RESOURCEGROUP=$OPTARG
;;
a) STORAGEACCOUNTNAME=$OPTARG
;;
f) FILESHARENAME=$OPTARG
;;
m) MOUNTPOINT=$OPTARG
;;
*)
usage
;;
esac
done

missing_opts=0
if [ -z "$RESOURCEGROUP" ]; then
echo "ERROR: Resource group is required"
missing_opts=1
fi

if [ -z "$STORAGEACCOUNTNAME" ]; then
echo "ERROR: Storage account name is required"
missing_opts=1
fi

if [ -z "$FILESHARENAME" ]; then
echo "ERROR: File share name is required"
missing_opts=1
fi

if [ -z "$MOUNTPOINT" ]; then
echo "ERROR: Mount point is required"
missing_opts=1
fi

if [ $missing_opts -eq 1 ]; then
usage
fi

if [ -n "$VERBOSE" ]; then
echo "RESOURCEGROUP: $RESOURCEGROUP"
echo "STORAGEACCOUNTNAME: $STORAGEACCOUNTNAME"
echo "FILESHARENAME: $FILESHARENAME"
echo "MOUNTPOINT: $MOUNTPOINT"
fi

# This command assumes you have logged in with az login

if [ -n "$VERBOSE" ]; then
echo "Getting https endpoint for storage account $STORAGEACCOUNTNAME"
fi

httpEndpoint=$(az storage account show \
--resource-group "$RESOURCEGROUP" \
--name "$STORAGEACCOUNTNAME" \
--query "primaryEndpoints.file" --output tsv | tr -d '"')
smbPath=$(echo "$httpEndpoint" | cut -c7-${#httpEndpoint})$FILESHARENAME
fileHost=$(echo "$httpEndpoint" | cut -c7-${#httpEndpoint}| tr -d "/")
nc -zvw3 "$fileHost" 445

if [ -n "$VERBOSE" ]; then
echo "httpEndpoint: $httpEndpoint"
echo "smbPath: $smbPath"
echo "fileHost: $fileHost"
fi

if [ -n "$VERBOSE" ]; then
echo "Getting storage account key"
fi
storageAccountKey=$(az storage account keys list \
--resource-group "$RESOURCEGROUP" \
--account-name "$STORAGEACCOUNTNAME" \
--query "[0].value" --output tsv | tr -d '"')

if [ -n "$VERBOSE" ]; then
echo "Creating mount path: $MOUNTPOINT"
fi
sudo mkdir -p "$MOUNTPOINT"

if [ -n "$VERBOSE" ]; then
echo "Mounting $smbPath to $MOUNTPOINT"
fi

if [ -n "$ALLOW_WRITE" ]; then
permissions="file_mode=0777,dir_mode=0777"
else
permissions="file_mode=0555,dir_mode=0555"
fi

sudo mount -t cifs "$smbPath" "$MOUNTPOINT" -o username="$STORAGEACCOUNTNAME",password="$storageAccountKey",serverino,nosharesock,actimeo=30,nobrl,"$permissions",vers=3.1.1