-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.html
299 lines (259 loc) · 12.6 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta property="og:title" content="All Things ViTs: Understanding and Interpreting Attention in Vision"/>
<meta property="og:url" content="https://all-things-vits.github.io/atv/"/>
<meta property="og:image" content="static/figures/teaser.jpg" />
<meta property="og:image:width" content="1200"/>
<meta property="og:image:height" content="630"/>
<title>All Things ViTs</title>
<link href="https://fonts.googleapis.com/css?family=Google+Sans|Noto+Sans|Castoro"
rel="stylesheet">
<link rel="icon" href="static/figures/elephant.jpeg">
<link rel="stylesheet" href="static/css/bulma.min.css">
<link rel="stylesheet" href="static/css/bulma-carousel.min.css">
<link rel="stylesheet" href="static/css/bulma-slider.min.css">
<link rel="stylesheet" href="static/css/fontawesome.all.min.css">
<link rel="stylesheet"
href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css">
<link rel="stylesheet" href="static/css/index.css">
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
<script src="https://documentcloud.adobe.com/view-sdk/main.js"></script>
<script defer src="static/js/fontawesome.all.min.js"></script>
<script src="static/js/bulma-carousel.min.js"></script>
<script src="static/js/bulma-slider.min.js"></script>
<script src="static/js/index.js"></script>
</head>
<body>
<section class="publication-header">
<div class="hero-body">
<div class="container is-max-widescreen">
<!-- <div class="columns is-centered"> -->
<div class="column has-text-centered">
<h1 class="title is-1 publication-title">All Things ViTs: Understanding and Interpreting Attention in Vision</h1>
</div>
</div>
</div>
</section>
<section class="publication-author-block">
<div class="hero-body">
<div class="container is-max-desktop">
<div class="columns is-centered">
<div class="column has-text-centered">
<div class="is-size-4 publication-authors">
<span class="author-block"><a href="https://hila-chefer.github.io/" target="_blank">Hila Chefer</a><sup>1</sup>,</span>
<span class="author-block"><a href="https://sayak.dev/" target="_blank">Sayak Paul</a><sup>2</sup></span>
</div>
<div class="is-size-6 publication-authors">
<span class="author-block"><sup>1</sup>Tel Aviv University, Google <sup>2</sup>Hugging Face
</div>
<div class="is-size-3 publication-authors">
<span class="author-block">CVPR 2023 Hybrid Tutorial</span>
</div>
<div class="is-size-4 publication-authors">
<span class="author-block">Sunday, June 18th 9:00-12:00, West 211</span>
</div>
<div class="column has-text-centered">
<div class="publication-links">
<span class="link-block">
<a href="https://github.com/all-things-vits/code-samples" target="_blank"
class="external-link button is-normal is-rounded">
<span class="icon">
<i class="fab fa-github"></i>
</span>
<span>Code</span>
</a>
</span>
<span class="link-block">
<a href="https://huggingface.co/all-things-vits" target="_blank"
class="external-link button is-normal is-rounded">
<span class="icon">
<i class="fas fa-laptop"></i>
</span>
<span>Demos</span>
</a>
</span>
<span class="link-block">
<a href="https://cvpr2023.thecvf.com/virtual/2023/tutorial/18574" target="_blank"
class="external-link button is-normal is-rounded">
<span class="icon">
<i class="fa fa-globe"></i>
</span>
<span>Virtual Site</span>
</a>
</span>
<span class="link-block">
<a href="https://drive.google.com/file/d/10NaQNVybucl8i2Or0iA_DC_NWkhs_IgV/view?usp=sharing" target="_blank"
class="external-link button is-normal is-rounded">
<span class="icon">
<i class="fa fa-desktop"></i>
</span>
<span>Slides</span>
</a>
</span>
<span class="link-block">
<a href="https://youtu.be/ma3NYVo8Im0" target="_blank"
class="external-link button is-normal is-rounded">
<span class="icon">
<i class="fab fa-brands fa-youtube"></i>
</span>
<span>Recording</span>
</a>
</span>
</div>
</div>
</div>
</div>
</div>
</div>
</section>
<section class="hero is-small">
<!-- <div class="hero-body"> -->
<section class="hero teaser">
<div class="container is-max-desktop">
<div class="hero-body">
<!-- <div id="results-carousel" class="carousel results-carousel"> -->
<div class="container">
<div class="item">
<div class="column is-centered has-text-centered">
<img src="static/figures/teaser.jpg" alt="All_Things_ViTs" width="850px"/>
<h2 class="subtitle">
In this tutorial, we explore the use of attention in vision. From left to right: <b>(i)</b> attention can be used to explain the predictions by the model (e.g., CLIP for an image-text pair) <b>(ii)</b> Examples of probing attention-based models <b>(iii)</b> The cross-attention maps of multi-modal models can be used to guide generative models (e.g., mitigating neglect in Stable Diffusion).
</h2>
</div>
</div>
</div>
<!-- </div> -->
</div>
</div>
<!-- </div> -->
</section>
<section class="section hero is-light">
<div class="container is-max-desktop">
<!-- Abstract. -->
<div class="columns is-centered has-text-centered">
<div class="column is-four-fifths">
<h2 class="title is-3">Abstract</h2>
<div class="content has-text-justified">
<p>
The attention mechanism has revolutionized deep learning research across many disciplines starting from NLP and expanding to vision, speech, and more. Different from other mechanisms, the elegant and general attention mechanism is easily adaptable and eliminates modality-specific inductive biases. As attention becomes increasingly popular, it is crucial to develop tools to allow researchers to understand and explain the inner workings of the mechanism to facilitate better and more responsible use of it. This tutorial focuses on understanding and interpreting attention in the vision and the multi-modal setting. We present state-of-the-art research on representation probing, interpretability, and attention-based semantic guidance, alongside hands-on demos to facilitate interactivity. Additionally, we discuss open questions arising from recent works and future research directions.
</p>
</div>
</section>
<section class="section hero">
<div class="container is-max-desktop">
<div class="columns is-centered">
<div class="column is-four-fifths">
<h2 class="title is-3 has-text-centered">Tutorial Outline</h2>
<div class="content has-text-justified">
<p>
The following is an outline of the topics we will cover in the tutorial. A detailed description can be found <a href="https://docs.google.com/document/d/1AHYQyi5rvTGZC8kKS1TEOMewl5_b1M6gHrTyUt38oFs/edit#heading=h.4fa4qoz6sg55">in this document</a>.
<h5 class="title is-5">Interpreting Attention</h5>
<ul>
<li>
Brief history of interpretability for DNNs
</li>
<li>
Attention vs. Convolutions
</li>
<li>
Using attention as an explanation
</li>
</ul>
<h5 class="title is-5">Probing Attention</h5>
<ul>
<li>
Depth and breadth of attention layers
</li>
<li>
Representational similarities between CNNs and Transformers
</li>
<li>
<p>Probing cross-attention</p>
</li>
</ul>
<h5 class="title is-5">Leveraging Attention as Explanation </h5>
<ul>
<li>
<i><b><span class="title is-5 blink">New!</span></b></i> <a href="https://rmokady.github.io/"> Ron Mokady</a> will share his seminal research on employing attention for text-based image editing. You can find <a href="https://drive.google.com/file/d/18U9rMGrMelC1oMv4c7j6aJwqkv5puSA9/view?usp=sharing">his slides</a> here.
</li>
<li>
Attention-based semantic guidance
</li>
</ul>
</p>
</div>
<h2 class="title is-3 has-text-centered">Tutorial Logistics</h2>
<div class="content has-text-justified">
<p>
Our tutorial will be conducted in a hybrid manner on June 18, 2022 from 9 AM onwards. We aim to complete our tutorial by 12:00 PM (Canada time). Due the VISA issues, Sayak won't be able to present in person. So, he will be joining and presenting virtually. Our guest speaker Ron will also be presenting virtually. However, Hila will be presenting in person.
<ul>
<li><b>For in-person attendees: </b>Our tutorial will be presented at the Vancouver Convention Center <b>West 211</b>.</li>
<li><b>For virtual attendees: </b>Please see the <a href="https://cvpr2023.thecvf.com/virtual/current/index.html">virtual site</a>, where you can find the homepage of <a href="https://cvpr2023.thecvf.com/virtual/2023/tutorial/18574">our tutorial</a>. All the (registered) participants (both in-person and virtual) of CVPR will have access to a Zoom link to join the tutorial live and ask questions to the speakers via RocketChat.</li>
</ul>
</p></div>
<p class="content">
</div>
</div>
</div>
</section>
<section class="section hero is-light">
<div class="container is-max-desktop">
<!-- Abstract. -->
<h2 class="title is-4 is-centered has-text-centered">References</h2>
<div class="columns is-centered has-text-centered">
<div class="content has-text-justified">
<p>
[1] <i>Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, </i> <b> Chefer et al.</b>
<br>
[2] <i>Do Vision Transformers See Like Convolutional Neural Networks?, </i> <b> Raghu et al.</b>
<br>
[3] <i>What do Vision Transformers Learn? A Visual Exploration, </i> <b> Ghiasi et al.</b>
<br>
[4] <i>Quantifying Attention Flow in Transformers, </i> <b> Abnar et al.</b>
<br>
[5] <i>Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models, </i> <b> Chefer et al.</b>
<br>
[6] <i>Prompt-to-Prompt Image Editing with Cross-Attention Control, </i> <b> Hertz et al.</b>
<br>
[7]<i> NULL-text Inversion for Editing Real Images using Guided Diffusion Models, </i> <b> Mokady et al.</b>
<br>
</p>
</div>
</div>
</section>
<footer class="footer">
<div class="columns is-centered">
<div class="column is-8">
<div class="content">
<p>
This website is licensed under a <a rel="license"
href="http://creativecommons.org/licenses/by-sa/4.0/">Creative
Commons Attribution-ShareAlike 4.0 International License</a>.
</p>
<p>
Website source code based on the <a href="https://nerfies.github.io/"> Nerfies</a> project page. If you want to reuse their <a
href="https://github.com/nerfies/nerfies.github.io">source code</a>, please credit them appropriately.
</p>
</div>
</div>
</div>
</div>
</footer>
<script type="text/javascript">
var sc_project=12351448;
var sc_invisible=1;
var sc_security="c676de4f";
</script>
<script type="text/javascript"
src="https://www.statcounter.com/counter/counter.js"
async></script>
<noscript><div class="statcounter"><a title="Web Analytics"
href="https://statcounter.com/" target="_blank"><img
class="statcounter"
src="https://c.statcounter.com/12351448/0/c676de4f/1/"
alt="Web Analytics"></a></div></noscript>
<!-- End of Statcounter Code -->
</body>
</html>