Multi-Scale Vision Longformer 简述

本文最后更新于：2024年8月14日下午

Title: Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding

2022 week7 reading

这篇工作也是一篇对ViT的改进，主要体现在两点上：

通过 patching embedding 实现的 multi-scale，这个和之前看过的pyramid, hierarchical基本是一个意思
基于 Longformer 的 attention 实现了一个关于tokens的线性复杂度linear complexity，并且精度没有明显降低，从而效率大幅提高

multi-scale

如上multi-scale其实和其他层次结构一样通过每层patching embedding实现，同时下层有着small hidden dimension往上the feature map resolution reduces while the hidden dimension increases来平衡计算量和内存。但作者指出这样的方法即使下层的hidden dimension已经减少但还是有着极大的增加（因为复杂度为平方增长，同时图像为2d矩阵，分辨率与token数量也是二次关系）

For example, the computational complexity of a 4x higher resolution multihead self attention (MSA) layer (hidden dimension reduced by 4, i.e., 4H x 4W x D ) equals to that of 64 layers in the 4 original size (i.e., H x W x D).

于是还需要对attention进行改进

Longformer

于是本工具使用的transformer是一种变体被称为Longformer，这是一篇nlp的paper，目的也是减少计算量随着序列长度的平方增长，作者将其中的Global+sliding window的思想用在了vit上

Global+sliding window如上图，就是首先对于所有token，只取附近的token组成一个局部attention；再加上某一些token的全部attention组成的global+local attention

可以看到这一思想被移植到vit上，也组成一个global+local attention同时global与local也相互交换信息。而作者这里也与PVT进行了对比，指出正是这里使得multi-scale的性能得以提高，结合之前文章对PVT的分析可以看出PVT只有作图local attention部分而没有global 部分，降低了模型的性能。

关于 weekly reading

我会将每周所写周报中paper reading部分上传至我的blog供参考，希望能为你提供一些帮助。

weekly reading

#machine learning #transformer #vit

Multi-Scale Vision Longformer 简述

https://asteriscus.cat/posts/cd2b69fe/

作者

Asterisk

发布于

2022年2月18日

许可协议

RAM-trans 简述上一篇

Hexo Fluid 留言服务下一篇