Proves Q-Former is a Multi-Head MIL module due to permutation invariance in its cross-attention. Notes its limitation: it assumes i.i.d. instances, overlooking crucial instance correlation.Proves Q-Former is a Multi-Head MIL module due to permutation invariance in its cross-attention. Notes its limitation: it assumes i.i.d. instances, overlooking crucial instance correlation.

MIL Perspective: Analyzing Q-Former as a Multi-Head Mechanism

2025/11/14 10:52

Abstract and 1 Introduction

  1. Related Work

    2.1. Multimodal Learning

    2.2. Multiple Instance Learning

  2. Methodology

    3.1. Preliminaries and Notations

    3.2. Relations between Attention-based VPG and MIL

    3.3. MIVPG for Multiple Visual Inputs

    3.4. Unveiling Instance Correlation in MIVPG for Enhanced Multi-instance Scenarios

  3. Experiments and 4.1. General Setup

    4.2. Scenario 1: Samples with Single Image

    4.3. Scenario 2: Samples with Multiple Images, with Each Image as a General Embedding

    4.4. Scenario 3: Samples with Multiple Images, with Each Image Having Multiple Patches to be Considered and 4.5. Case Study

  4. Conclusion and References

\ Supplementary Material

A. Detailed Architecture of QFormer

B. Proof of Proposition

C. More Experiments

3.2. Relations between Attention-based VPG and MIL

\ In AB-MIL[16], weights are calculated as Equation 5.

\

\ Proposition 1. QFormer belongs to the category of Multiple Instance Learning modules.

\ Within the cross-attention layer of QFormer, every query token computes weights for image embeddings. Query embeddings, being learnable parameters, can be seen as a linear transformation from an instance to its weight. To provide further clarification, each row in the attention map A signifies the weights assigned to instances for aggregation. Consequently, the cross-attention between the learnable query embeddings and the input is permutation invariance.

\ The result of cross-attention is combined with the original query embeddings using a residual connection. This process can be expressed as shown in Equation 6, by replacing pool with Equation 1, and setting λ = γ = I, as illustrated in Equation 7, which is permutation equivalence.

\

\ Figure 2. Overview of MIVPG. 2a: When handling multiple visual inputs, the initial step involves aggregating them at the image-level. QFormer can be treated as a Multiple Instance Learning module that takes multiple samples as instances. The MIVPG complements QFormer by introducing a correlated self-attention module and the pyramid positional encoding module, depending on specific scenarios. 2b: Image-level aggregation can employ various MIL strategies, either learnable, such as AB-MIL, or fixed, for example, always selecting a specific token. 2c: The visual prompt embeddings produced by Q-Former are combined with textual prompt embeddings and forwarded to the LLM for generating outputs.

\ Considering that the self-attention layer within the QFormer block adheres to the principles of permutation equivalence, we can conceptualize the QFormer as a multi-head MIL mechanism.

\ From the standpoint of MIL, the weighted pooling in Equation 1 operates under the assumption that instances are independent and identically distributed (i.i.d)[34]. However, in practical scenarios, instances may exhibit correlations, and accounting for instance correlation can lead to improved performance. It’s worth noting that when each sample contains only one image, the input to QFormer comprises patch embeddings that have already incorporated correlations through the self-attention layer in ViT. Moreover, performance enhancement is attainable through the integration of a Pyramid Positional Encoding Generator (PPEG)[34], which complements the proposed MIVPG when handling single-image inputs.

\

:::info Authors:

(1) Wenliang Zhong, The University of Texas at Arlington (wxz9204@mavs.uta.edu);

(2) Wenyi Wu, Amazon (wenyiwu@amazon.com);

(3) Qi Li, Amazon (qlimz@amazon.com);

(4) Rob Barton, Amazon (rab@amazon.com);

(5) Boxin Du, Amazon (boxin@amazon.com);

(6) Shioulin Sam, Amazon (shioulin@amazon.com);

(7) Karim Bouyarmane, Amazon (bouykari@amazon.com);

(8) Ismail Tutar, Amazon (ismailt@amazon.com);

(9) Junzhou Huang, The University of Texas at Arlington (jzhuang@uta.edu).

:::


:::info This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.

:::

\

Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact service@support.mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

Prediction markets, DATs, the fee switch, and Project Crypto

Prediction markets, DATs, the fee switch, and Project Crypto

The post Prediction markets, DATs, the fee switch, and Project Crypto appeared on BitcoinEthereumNews.com. This is a segment from The Breakdown newsletter. To read full editions, subscribe. “If you can’t make money, you may want to consider being quiet. Maybe the market knows more than you do.” — Jeff Yass Today, The Breakdown looks at developing stories and links from around the cryptoverse. After Jeff Yass brought his math and poker skills onto trading floors in the 1980s, global options markets stopped looking like a casino and started looking like a science. Yass thinks prediction markets could do the same for the world. First and foremost, he says, “It will stop wars.” Yass cites the second Iraq War, which President Bush said would cost the US $20 billion but is now thought to have cost at least $2 trillion, and maybe as much as $6 trillion. It’s unlikely prediction markets would have settled on such an astronomical number, but Yass believes they might have predicted something like $500 billion, in which case “people might have said, ‘Look, we don’t want this war.’” That would have saved many, many lives, as well: “If people know how expensive it’s going to be and how disastrous it’s going to be, they’ll try to come up with other solutions.” Prediction markets, he says, “can really slow down the lies that politicians are constantly telling us.” He also cites applications in insurance, technology and even dating. Asked by the 16-year-old podcast host what advice he’d give young people, Yass suggested they could avoid relationship mistakes by creating an anonymous prediction market for their friends to bet on. “I believe in markets,” he concluded. It sounds like a dumb idea: Unlike stocks with their open-ended valuations, prediction markets should converge toward the single fixed probability of a binary outcome. But the author of No Dumb Ideas crunched the numbers and…
Share
BitcoinEthereumNews2025/11/14 23:52
U.S., Europe brands take on the Chinese consumer

U.S., Europe brands take on the Chinese consumer

The post U.S., Europe brands take on the Chinese consumer appeared on BitcoinEthereumNews.com. Pictured here is Louis Vuitton’s new cruise ship-shaped store in Shanghai, China, on June 28, 2025. Bloomberg | Bloomberg | Getty Images BEIJING — China’s economic slowdown isn’t discouraging U.S. and European brands from revamping their strategies to reach Chinese shoppers. Instead, the allure of the world’s second-largest consumer market is forcing companies to adapt in the face of growing competition from local brands. In the case of Kraft Heinz, getting more people in China to buy ketchup this year also meant hiring a local agency to help create catchy campaigns — decorating subway station columns to mimic ketchup bottles and promoting the condiment as a fresh twist on a popular dish: stir-fried eggs and tomatoes. It’s a hard market to tackle, even for Shanghai-based marketing firm Good Idea Growth Network (GGN). The agency has witnessed at least five different waves of consumer trends in its 14-year history, founder Stephy Liu, said in Mandarin, translated by CNBC. “The gameplay keeps on changing.” But GGN has succeeded even after rejecting an acquisition offer from British advertising giant WPP, Liu said, noting that about half of her clients are foreign brands. While Kraft Heinz isn’t done with its China ketchup campaign yet, the company reported second-quarter net sales in emerging markets climbed by 4.2% from a year ago, helping offset declines in North America. WPP explored a potential acquisition of GGN but did not end up going far in the process, according to a person familiar with the discussions. Kraft Heinz did not immediately respond to requests for comment. Localized social media From Starbucks’ struggles to Lululemon’s successes in China, it’s become clear that the right mix of localization is essential. “Among international brands in China, the winners are often dedicating more than 40% of revenue to marketing, especially content and platform-first…
Share
BitcoinEthereumNews2025/10/04 09:22