<h1 id="key-contributions" data-line="0" class="code-line">
<a class="header-anchor-link" href="#key-contributions" aria-hidden="true"></a> Key Contributions</h1>
<table data-line="1" class="code-line">
<thead data-line="1" class="code-line">
<tr data-line="1" class="code-line">
<th></th>
<th>speed</th>
<th>detail</th>
</tr>
</thead>
<tbody data-line="3" class="code-line">
<tr data-line="3" class="code-line">
<td>high resolution attention</td>
<td>slow</td>
<td>high</td>
</tr>
<tr data-line="4" class="code-line">
<td>low resolution attention</td>
<td>high</td>
<td>low</td>
</tr>
<tr data-line="5" class="code-line">
<td>flex attention</td>
<td>middle</td>
<td>high</td>
</tr>
</tbody>
</table>
<p data-line="7" class="code-line"><img src="https://storage.googleapis.com/zenn-user-upload/cab5721d244a-20250715.png" class="md-img" loading="lazy"></p>
<h2 id="high-resolution-feature-selection" data-line="9" class="code-line">
<a class="header-anchor-link" href="#high-resolution-feature-selection" aria-hidden="true"></a> high-resolution feature selection</h2>
<ol data-line="10" class="code-line">
<li data-line="10" class="code-line">self-attention with low resolution</li>
<li data-line="11" class="code-line">find high attention score pixels</li>
<li data-line="12" class="code-line">extract that pixel from high resolution map<br>
<img src="https://storage.googleapis.com/zenn-user-upload/f9608b560f8c-20250715.png" class="md-img" loading="lazy">
</li>
</ol>
<h2 id="hierarchical-self-attention" data-line="15" class="code-line">
<a class="header-anchor-link" href="#hierarchical-self-attention" aria-hidden="true"></a> Hierarchical Self-attention</h2>
<p data-line="16" class="code-line">make attention with<br>
N: number of low resolution pixels<br>
M: numer of high resolution pixels<br>
Q: N<br>
K: N+M<br>
V: N+M<br>
<img src="https://storage.googleapis.com/zenn-user-upload/e673e4617e2e-20250715.png" class="md-img" loading="lazy"></p>
<h1 id="reference" data-line="25" class="code-line">
<a class="header-anchor-link" href="#reference" aria-hidden="true"></a> Reference</h1>
<p data-line="26" class="code-line"><a href="https://arxiv.org/pdf/2407.20228" target="_blank" rel="nofollow noopener noreferrer">FlexAttention for Efficient High-Resolution Vision-Language Models</a></p>


[FLEX ATTENTION]

Key Contributions

high-resolution feature selection

Hierarchical Self-attention

Reference

Discussion

	speed	detail
high resolution attention	slow	high
low resolution attention	high	low
flex attention	middle	high