Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MLTable - AzureML - Cache Environment variables #3143

Open
FrsECM opened this issue Apr 26, 2024 · 4 comments
Open

MLTable - AzureML - Cache Environment variables #3143

FrsECM opened this issue Apr 26, 2024 · 4 comments
Labels

Comments

@FrsECM
Copy link

FrsECM commented Apr 26, 2024

Operating System

Linux

Version Information

mltable-1.6.1
azureml-dataprep-rslex~=2.22.2dev0

Steps to reproduce

  1. Run a job on a compute that is size S
  2. Mount Datastore as folder with mltables - Datastore total size> S
  3. Wait...
  4. Crash

For example, in AzureMachine Learning :

storage_paths = [
{'folder':'azureml://subscriptions/$sub/resourcegroups/$rg/workspaces/$ws/datastores/$ds/paths/'}
]
tbl = mltable.from_paths(storage_paths )
mount_context = tbl._mount()
mount_context.start()
# Iterate over files

In order to fix my issue, i need to add extra mount settings :
https://learn.microsoft.com/en-us/azure/machine-learning/how-to-read-write-data-v2?view=azureml-api-2&tabs=python#available-mount-settings

I use a wrapper class in order to do this on multiple storage / containers :

@dataclass
class MyStorage:
    mount_paths:List[int] = field(init=False,default_factory=list)
    _is_mounted:bool = field(init=False,default=False)
    _mount_context:Any = field(init=False,default=None)

    def __post_init__(self):
        os.environ['DATASET_MOUNT_CACHE_SIZE']="-40GB" # We leave at least 50GB available on the cluster.
        os.environ['DATASET_MOUNT_BLOCK_BASED_CACHE_ENABLED']="True"

    def mount(self):
        print(f'Start Mounting storage...')
        [print(f"- {path['folder']}") for path in self.mount_path]
        tbl = mltable.from_paths(self.mount_paths)
        self._mount_context = tbl._mount()
        self._mount_context.start()
        self._is_mounted = True
        print(f'Mount Done - {self._mount_context.mount_point}')

    def umount(self):
        if self._is_mounted:
            print(f'Start UnMounting  - {self._mount_context.mount_point}')
            self._mount_context.stop()
            self._mount_context=None
            self._is_mounted = False
            print('UnMount Done...')

    def __del__(self):
        self.umount()

storage = MyStorage()
storage.mount_paths = storage_paths
storage.mount()
# Do stuff 
del storage

I also tried to add the environment variable in the yaml job :

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json

experiment_name: LARGE-JOB
display_name: Large Job

environment_variables:
  DATASET_MOUNT_CACHE_SIZE: "-40 GB"
  DATASET_MOUNT_BLOCK_BASED_CACHE_ENABLED: "True"
  DATASET_MOUNT_FILE_CACHE_PRUNE_TARGET: "0.0"

....

But none of theses solutions are working well.

Expected behavior

I expect that the disk cache is pruned when it is reaching the -40GB limit on the compute machine.

Actual behavior

Currently, the cache continues to grow :
image

Until fail :
image

Even if i set environment variables in yaml :
image

or in code :
image

And i can confirm that the environment variable are used in the job :
image

But it seems mltables are ignoring them.

Addition information

No response

@FrsECM FrsECM added the bug label Apr 26, 2024
@FrsECM
Copy link
Author

FrsECM commented Apr 26, 2024

For people who may have the problem, i got a fix :

import re 
import os 

def mount_options()->MountOptions:
    max_size = None
    free_space_required = None
    cache_param = os.getenv('DATASET_MOUNT_CACHE_SIZE',None)
    if cache_param:
        CACHE_SIZE_PATTERN = r'^(?P<sign>-?)(?P<val>\d+).*(?P<size>[A-Z]{2})$'
        match = re.match(CACHE_SIZE_PATTERN,cache_param)
        if match:
            size = match.group('size')
            if size == 'GB':
                coeff = 1024**3
            elif size =='MB':
                coeff = 1024**2
            else:
                raise NotImplementedError(f'Not implemented for size {size}')
            value = int(match.group('val'))*coeff
            if match.group('sign')=='-':
                # We are in mode "free_space_required"
                free_space_required = value
                print(f'MountOption : {value} Max Free Space')
            else:
                # We are in mode "max_size"
                max_size = value
                print(f'MountOption : {value} Max Size')
    return MountOptions(max_size=max_size,free_space_required=free_space_required)

###### You can now consume your mltable 
storage_paths = [
{'folder':'azureml://subscriptions/$sub/resourcegroups/$rg/workspaces/$ws/datastores/$ds/paths/'}
]
tbl = mltable.from_paths(storage_paths )
mount_context = tbl._mount(mount_options=mount_options())
mount_context.start()

If i do this way, it works, but it ignores the prune target :
image

Anyway, it's a bug for me, the behaviour should be consistent with the documentation.

@IvanHahan
Copy link

IvanHahan commented Aug 20, 2024

I have the same bug. data caching eats up all memory on 64Gb disk. Cant store training checkpoints.
Tried setting DATASET_MOUNT_BLOCK_BASED_CACHE_ENABLED: true, but error arises, cant set boolean type.
When I set DATASET_MOUNT_BLOCK_BASED_CACHE_ENABLED: "true" , nothing happens. Data keeps getting cached

@FrsECM
Copy link
Author

FrsECM commented Aug 29, 2024

I have the same bug. data caching eats up all memory on 64Gb disk. Cant store training checkpoints.
Tried setting DATASET_MOUNT_BLOCK_BASED_CACHE_ENABLED: true, but error arises, cant set boolean type.
When I set DATASET_MOUNT_BLOCK_BASED_CACHE_ENABLED: "true" , nothing happens. Data keeps getting cached

Normally you can use the fix i did, just set the DATASET_MOUNT_CACHE_SIZE env variable with a size and normally it should work.

But anyway it should be fixed....

@FrsECM
Copy link
Author

FrsECM commented Sep 23, 2024

Another concern we have is that we can not set other parameters like theses two :
image

It would allow us to grab less data than required because when we use a shuffled dataloader there is no interest to cache more block than the average image size.

Would it be possible to opensource mltable ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants