Add Prompt Depth Anything Model #35401

haotongl · 2024-12-23T17:15:09Z

What does this PR do?

This PR adds the Prompt Depth Anything Model. Prompt Depth Anything builds upon Depth Anything V2 and incorporates metric prompt depth to enable accurate and high-resolution metric depth estimation.

The implementation leverages Modular Transformers. The main file can be found here.

Before submitting

[ N/A] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[ ✅] Did you read the [contributor guideline] (https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#create-a-pull-request),
Pull Request section?
[ N/A] Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
[ ✅] Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
[ ✅] Did you write any new necessary tests?

…nything

haotongl · 2024-12-24T04:21:29Z

@NielsRogge @qubvel @pcuenca Could you help review this PR when you have some time? Thanks so much in advance! Let me know if you have any questions or suggestions. 😊

docs/source/en/_toctree.yml

src/transformers/models/prompt_depth_anything/__init__.py

NielsRogge · 2024-12-24T10:30:47Z

tests/models/prompt_depth_anything/test_modeling_prompt_depth_anything.py

+
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+        # config.backbone = "resnet18"


To be removed?

I removed them as prompt depth anything only supports dino backbone.

Thank you for your time to review this PR. I have fixed above issues.

qubvel · 2024-12-24T11:37:17Z

Hi @haotongl! Thanks for working on the model integration to transformers 🤗 I'm on holidays until Jan 3rd, and I'll do a review after that if it's still necessary.

NielsRogge · 2024-12-24T13:52:11Z

docs/source/en/model_doc/prompt_depth_anything.md

+<!-- 
+<Tip>
+
+[Prompt Depth Anything V2](prompt_depth_anything_v2) was released in June 2024. It retains the same architecture as the original Prompt Depth Anything, ensuring compatibility with all existing code examples and workflows. However, it utilizes synthetic data and a larger capacity teacher model to deliver more precise and robust depth predictions.


The Markdown link to prompt_depth_anything_v2 probably won't work. However we can link to (depth_anything_v2): https://huggingface.co/docs/transformers/main/en/model_doc/depth_anything_v2

NielsRogge · 2024-12-24T13:52:51Z

docs/source/en/model_doc/prompt_depth_anything.md

+
+*Prompts play a critical role in unleashing the power of language and vision foundation models for specific tasks. For the first time, we introduce prompting into depth foundation models, creating a new paradigm for metric depth estimation termed Prompt Depth Anything. Specifically, we use a low-cost LiDAR as the prompt to guide the Depth Anything model for accurate metric depth output, achieving up to 4K resolution. Our approach centers on a concise prompt fusion design that integrates the LiDAR at multiple scales within the depth decoder. To address training challenges posed by limited datasets containing both LiDAR depth and precise GT depth, we propose a scalable data pipeline that includes synthetic data LiDAR simulation and real data pseudo GT depth generation. Our approach sets new state-of-the-arts on the ARKitScenes and ScanNet++ datasets and benefits downstream applications, including 3D reconstruction and generalized robotic grasping.*
+
+<img src="https://promptda.github.io/assets/teaser.jpg"


Feel free to open a PR on this repo, specifically this folder: https://huggingface.co/datasets/huggingface/documentation-images/tree/main/transformers/model_doc to add a prompt_depth_anything_architecture.jpg picture

xenova · 2024-12-24T18:21:58Z

src/transformers/models/prompt_depth_anything/modeling_prompt_depth_anything.py

+        predicted_depth = self.conv1(hidden_states)
+        predicted_depth = nn.functional.interpolate(
+            predicted_depth,
+            (int(patch_height * self.patch_size), int(patch_width * self.patch_size)),


Could you use the torch_int helper function from utils instead of int? This approach means the model can't be traced, and results in the following warning during export:

/usr/local/lib/python3.10/dist-packages/transformers/models/prompt_depth_anything/modeling_prompt_depth_anything.py:352: TracerWarning: Converting a tensor to a Python integer might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! (int(patch_height * self.patch_size), int(patch_width * self.patch_size)),

xenova · 2024-12-25T03:06:21Z

src/transformers/models/prompt_depth_anything/modeling_prompt_depth_anything.py

+
+        if prompt_depth is not None:
+            # normalize prompt depth
+            B = len(prompt_depth)


Suggested change

B = len(prompt_depth)

B = prompt_depth.shape[0]

len() of a tensor causes issues during tracing.

xenova · 2024-12-25T04:05:42Z

docs/source/en/model_doc/prompt_depth_anything.md

+>>> prompt_depth_url = "https://github.com/DepthAnything/PromptDA/blob/main/assets/example_images/arkit_depth.png?raw=true"
+>>> prompt_depth = Image.open(requests.get(prompt_depth_url, stream=True).raw)
+>>> prompt_depth = torch.tensor((np.asarray(prompt_depth) / 1000.0).astype(np.float32))
+>>> prompt_depth = prompt_depth.unsqueeze(0).unsqueeze(0)


Usage-wise, it might be a good idea to create a PromptDepthAnythingProcessor to help handle processing the input image (via the image processor) and then optional prompt_depth input.

haotongl added 5 commits December 23, 2024 20:57

add prompt depth anything model by modular transformer

24151d8

add prompt depth anything docs and imports

7e6dcaa

update code style according transformers doc

dfa7d67

update code style: import order issue is fixed by custom_init_isort

8509440

fix depth shape from B,1,H,W to B,H,W which is as the same as Depth A…

2fa72ef

…nything

NielsRogge reviewed Dec 24, 2024

View reviewed changes

docs/source/en/_toctree.yml Outdated Show resolved Hide resolved

NielsRogge reviewed Dec 24, 2024

View reviewed changes

src/transformers/models/prompt_depth_anything/__init__.py Show resolved Hide resolved

NielsRogge reviewed Dec 24, 2024

View reviewed changes

haotongl added 3 commits December 24, 2024 20:37

move prompt depth anything to vision models in _toctree.yml

d13a55f

update backbone test; there is no need for resnet18 backbone test

6cd1bbf

update init file & pass RUN_SLOW tests

76299f4

NielsRogge reviewed Dec 24, 2024

View reviewed changes

xenova reviewed Dec 24, 2024

View reviewed changes

xenova reviewed Dec 25, 2024

View reviewed changes

This was referenced Dec 25, 2024

Add ONNX export support for depth anything and prompt depth anything huggingface/optimum#2139

Draft

[WIP] Add support for prompt depth anything huggingface/transformers.js#1113

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Prompt Depth Anything Model #35401

Add Prompt Depth Anything Model #35401

haotongl commented Dec 23, 2024

haotongl commented Dec 24, 2024 •

edited

Loading

NielsRogge Dec 24, 2024

haotongl Dec 25, 2024

haotongl Dec 25, 2024

qubvel commented Dec 24, 2024

NielsRogge Dec 24, 2024

NielsRogge Dec 24, 2024

xenova Dec 24, 2024

xenova Dec 25, 2024

xenova Dec 25, 2024


		config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()

		# config.backbone = "resnet18"


		Prompts play a critical role in unleashing the power of language and vision foundation models for specific tasks. For the first time, we introduce prompting into depth foundation models, creating a new paradigm for metric depth estimation termed Prompt Depth Anything. Specifically, we use a low-cost LiDAR as the prompt to guide the Depth Anything model for accurate metric depth output, achieving up to 4K resolution. Our approach centers on a concise prompt fusion design that integrates the LiDAR at multiple scales within the depth decoder. To address training challenges posed by limited datasets containing both LiDAR depth and precise GT depth, we propose a scalable data pipeline that includes synthetic data LiDAR simulation and real data pseudo GT depth generation. Our approach sets new state-of-the-arts on the ARKitScenes and ScanNet++ datasets and benefits downstream applications, including 3D reconstruction and generalized robotic grasping.

		<img src="https://promptda.github.io/assets/teaser.jpg"

Add Prompt Depth Anything Model #35401

Are you sure you want to change the base?

Add Prompt Depth Anything Model #35401

Conversation

haotongl commented Dec 23, 2024

What does this PR do?

Before submitting

haotongl commented Dec 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qubvel commented Dec 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

haotongl commented Dec 24, 2024 •

edited

Loading