[Port HIL-SERL] Add HF vision encoder option in SAC #651

ChorntonYoel · 2025-01-21T12:15:13Z

What this does

Adds a pretrained vision encoder from HF to the sac modeling + allows for appropriate transforms when training.

How it was tested

Ran a sac training with moss.
My sac training script and associated modeling have slight differences, so it would be good to double check with a training loop incorporating real images and the current sac training script.

ChorntonYoel · 2025-01-21T15:05:48Z

lerobot/common/datasets/factory.py

@@ -74,7 +74,23 @@ def make_dataset(cfg, split: str = "train") -> LeRobotDataset | MultiLeRobotData

    image_transforms = None
    if cfg.training.image_transforms.enable:
-        cfg_tf = cfg.training.image_transforms
+        default_tf = OmegaConf.create(


Added this to have more flexibility when creating transforms. Lmk if you don't think we need it, or if you have a better idea

helper2424

The code looks fine in general. I haven't checked it manually as my env is broken a little bit - will try to do it tomorrow.

helper2424 · 2025-01-21T15:47:36Z

lerobot/common/policies/sac/modeling_sac.py

@@ -27,6 +27,7 @@
 import torch.nn.functional as F  # noqa: N812
 from huggingface_hub import PyTorchModelHubMixin
 from torch import Tensor
+from transformers import AutoModel


Need to move it under the class definition as with the Reward Classifier - https://github.com/huggingface/lerobot/pull/565/files#diff-160b98695ab8295e4ac586d8b9e50cb8e849b2bc31b2daceb28ded10580ab574R46. The reason is that by default we don't install transformers, so the library will crash without hil-serl deps installation.

helper2424 · 2025-01-21T16:50:17Z

lerobot/common/policies/sac/modeling_sac.py

@@ -476,7 +505,13 @@ def forward(self, obs_dict: dict[str, Tensor]) -> Tensor:
        # Concatenate all images along the channel dimension.
        image_keys = [k for k in self.config.input_shapes if k.startswith("observation.image")]
        for image_key in image_keys:
-            feat.append(flatten_forward_unflatten(self.image_enc_layers, obs_dict[image_key]))
+            if self.config.vision_encoder_name is not None:


Just a small refactoring - let's extract to a separate method has_pretrained_visual_encoder

add hf vision encoder

3bc3fe3

ChorntonYoel marked this pull request as draft January 21, 2025 12:16

ChorntonYoel added 6 commits January 21, 2025 13:28

cleanup

6836b3a

nit

b4d18e7

nit

8f2678b

fix enc proj naming

c62052a

fix for backward compat

f3d6f97

add resize/normalization transformations

ff79a27

ChorntonYoel commented Jan 21, 2025

View reviewed changes

renaming for hf processor consistency

3d56189

ChorntonYoel marked this pull request as ready for review January 21, 2025 15:29

helper2424 reviewed Jan 21, 2025

View reviewed changes

ChorntonYoel added 3 commits January 22, 2025 12:03

merge rlpd pr, fix conflicts

933b374

bring back vision encoder + move AutoModel

3893069

use has_pretrained_vision_encoder bool

f2f77ef

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Port HIL-SERL] Add HF vision encoder option in SAC #651

[Port HIL-SERL] Add HF vision encoder option in SAC #651

ChorntonYoel commented Jan 21, 2025 •

edited

Loading

ChorntonYoel Jan 21, 2025

helper2424 left a comment

helper2424 Jan 21, 2025

helper2424 Jan 21, 2025

[Port HIL-SERL] Add HF vision encoder option in SAC #651

Are you sure you want to change the base?

[Port HIL-SERL] Add HF vision encoder option in SAC #651

Conversation

ChorntonYoel commented Jan 21, 2025 • edited Loading

What this does

How it was tested

ChorntonYoel Jan 21, 2025

Choose a reason for hiding this comment

helper2424 left a comment

Choose a reason for hiding this comment

helper2424 Jan 21, 2025

Choose a reason for hiding this comment

helper2424 Jan 21, 2025

Choose a reason for hiding this comment

ChorntonYoel commented Jan 21, 2025 •

edited

Loading