With current technological advancements, the norm in computer vision and natural language processing research is transformer-based models. Transformer designs have shown exceptional performance increases since they were first introduced in 2017, leading to significant advances in deep learning and artificial intelligence. However, the significant computational and memory requirements of their self-attention processes (needed for capturing various syntactic and semantic representations from lengthy input sequences) have restricted their more extensive real-world applicability.
In a recent research paper titled “Transformers with Multiresolution Attention Heads,” researchers suggest MrsFormer, a unique transformer architecture that employs Multiresolution-head Attention (MrsHA) to approximate output sequences. Compared to a baseline softmax transformer, this architecture considerably lowers head redundancy and minimizes the computational and memory costs of the transformer without compromising accuracy. The authors’ names and the institutions they represent are concealed because the publication is currently undergoing a double-blind review for ICLR 2023.
The research team’s primary contribution was the initial derivation of an approximate attention head at various scales. The process involved two steps: the first directly estimated the output sequence H, and the second involved approximating the value matrix V, which corresponds to the dictionary containing the bases of H. The group subsequently created MrsHA, a novel MHA whose attention heads roughly correspond to the output sequences Hh (with h = 1,…, H) at various scales. Next, the group put forward MrsFormer, a new class of transformers that employ MrsHA in its attention layers.
The suggested MrsFormer uses multiresolution approximation (MRA) instead of the traditional self-attention technique used by transformers, which learns long-sequence representations by comparing input sequence tokens and altering the appropriate output sequence places. A signal is divided into parts using MRA that are located on orthogonal subspaces at various scales. Similar to this, MrsFormer models the attention patterns between tokens and between groups of tokens by dividing the attention heads in the multi-head attention into fine-scale and coarse-scale heads.
Additionally, as a part of their empirical evaluations, the researchers extensively examined and verified MrsFormer’s advantage over conventional transformers in various applications, including image and time series classification. The findings demonstrate that calculating the attention heads with MrsFormer decreases head redundancy, lowers computation and memory costs, and keeps accuracy on par with the baseline. Overall, this research shows the remarkable potential of a new family of effective transformers that can drastically reduce computational and memory costs without compromising model performance.
Check out the Paper. All Credit For This Research Goes To Researchers on This Project. Also, don’t forget to join our Reddit page and discord channel, where we share the latest AI research news, cool AI projects, and more.
Khushboo Gupta is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Goa. She is passionate about the fields of Machine Learning, Natural Language Processing and Web Development. She enjoys learning more about the technical field by participating in several challenges.