This article evaluates retrieval recall across multiple AI models for medical imaging, including DreamSim, ResNet50, and DINO variants. It compares slice-wise, volume-based, region-based, and localized retrieval methods on both coarse (29) and fine-grained (104) anatomical structures. Results show DreamSim excels at slice-wise and region-based recall, ResNet50 performs best at coarse volume retrieval, while DINO models lead in localized and fine-grained recall. The study highlights trade-offs between model types and retrieval approaches, underscoring the importance of context, granularity, and localization in advancing medical AI retrieval systems.This article evaluates retrieval recall across multiple AI models for medical imaging, including DreamSim, ResNet50, and DINO variants. It compares slice-wise, volume-based, region-based, and localized retrieval methods on both coarse (29) and fine-grained (104) anatomical structures. Results show DreamSim excels at slice-wise and region-based recall, ResNet50 performs best at coarse volume retrieval, while DINO models lead in localized and fine-grained recall. The study highlights trade-offs between model types and retrieval approaches, underscoring the importance of context, granularity, and localization in advancing medical AI retrieval systems.

Medical AI Models Battle It Out—And the Winner Might Surprise You

6 min read

Abstract and 1. Introduction

  1. Materials and Methods

    2.1 Vector Database and Indexing

    2.2 Feature Extractors

    2.3 Dataset and Pre-processing

    2.4 Search and Retrieval

    2.5 Re-ranking retrieval and evaluation

  2. Evaluation and 3.1 Search and Retrieval

    3.2 Re-ranking

  3. Discussion

    4.1 Dataset and 4.2 Re-ranking

    4.3 Embeddings

    4.4 Volume-based, Region-based and Localized Retrieval and 4.5 Localization-ratio

  4. Conclusion, Acknowledgement, and References

3 Evaluation

In this section, we evaluate the retrieval recall of the methods explained in Section 2.4 and Section 2.5. The results related to the 29 coarse anatomical structures from Table 1 and the results related to the original 104 fine-grained anatomical structures from Wasserthal et al. [2023] are presented separately in the following. In the tables presented in this section, the average and standard deviation (STD) columns allow identifying difficult classes across models (low average) and the ones that have higher variations among models (higher STD). The average and STD rows show the average and STD over all the classes for each model.

3.1 Search and Retrieval

3.1.1 Slice-wise

\ Detailed computation of the recall measure for different retrieval methods is explained in Section 2.4. Table 2 and Table 3 show the retrieval recall of 29 coarse anatomical regions and 104 original TS anatomical regions, respectively, using the slice-wise method. The slice-wise recall is considered the lower bound recall because for a perfect recall all the anatomical regions present in the query slice should appear in the retrieved slice.

\ In slice-wise retrieval, DreamSim is the best-performing model with retrieval recall of .863 ± .107 and .797 ± .129 for coarse and original TS classes, respectively. ResNet50 pre-trained on fractal images has the lowest retrieval recall almost on every anatomical region for 29 and 104 classes. This is however expected due to the nature of synthetic generated images.

\ In Table 3 the gallbladder has the lowest retrieval rate followed by vertebrae C4 and C5 (see average column). However, in Table 2 the vertebrae class shows a higher recall which indicated that the vertebrae classes were detected but the exact location, i.e. C4 or C5 were mismatched. The same pattern can be observed in rib classes.

\ Table 2: Slice-wise recall of coarse anatomical regions (29 classes) using HNSW Indexing. In each row, bold numbers represent the best-performing values, while italicized numbers indicate the worst-performing. The separate average and standard deviation (STD) columns are color-coded, with blue indicating the best-performing values and yellow indicating the worst-performing values across different models. Additionally, bold numbers in colored columns represent the best classes in terms of average and standard deviation, while italicized values represent the worst-performing class across the models.

\ Continue with the next figure

\ Table 3: Slice-wise recall of all TS anatomical regions (104 classes) using HNSW Indexing. In each row, bold numbers represent the best-performing values, while italicized numbers indicate the worst-performing. The separate average and standard deviation (STD) columns are color-coded, with blue indicating the best-performing values and yellow indicating the worst-performing values across different models. Additionally, bold numbers in colored columns represent the best classes in terms of average and standard deviation, while italicized values represent the worst-performing class across the models.

\ 3.1.2 Volume-based

\ This section presents the recall of volume-based retrieval explained in Section 2.4.1 An overview of the evaluation is shown in Figure 2. In volume-based retrieval, per each query volume, one volume is retrieved. In the recall computation, the classes present in both the query and the retrieved volume are considered TP classes. The classes that are present in the query volume and are missing from the retrieved volume are considered FN.

\ Table 4 and Table 5 present the retrieval recall of the volume-based method on 29 and 104 classes, respectively. The overall recall rates are increased compared to slice-wise retrieval which is expected due to the aggregation and contextual effects of neighboring slices.

\ Table 4 shows that ResNet50 trained on RadImageNet outperforms other methods with an average recall of .952 ± .043. However, in Table 5 DINOv1 outperforms all models including ResNet50 with an average recall of .923 ± .077. This shows that the embeddings of finer classes are retrieved and assigned to a different similar class by ResNet50, thus, the performance from fine to coarse classes is improved. Whereas, all the self-supervised methods in Table 5 outperform the supervised methods. Although some models perform slightly better than others based on looking at isolated classes, overall models perform on par.

\ Table 4: Volume-based retrieval recall of coarse anatomical regions (29 classes) using HNSW Indexing. In each row, bold numbers represent the best-performing values, while italicized numbers indicate the worst-performing. The separate average and standard deviation (STD) columns are color-coded, with blue indicating the best-performing values and yellow indicating the worst-performing values across different models. Additionally, bold numbers in colored columns represent the best classes in terms of average and standard deviation, while italicized values represent the worst-performing class across the models.

\ Continue with the next figure

\ Table 5: Volume-based retrieval recall of all TS anatomical regions (104 classes) using HNSW Indexing. In each row, bold numbers represent the best-performing values, while italicized numbers indicate the worst-performing. The separate average and standard deviation (STD) columns are color-coded, with blue indicating the best-performing values and yellow indicating the worst-performing values across different models. Additionally, bold numbers in colored columns represent the best classes in terms of average and standard deviation, while italicized values represent the worst-performing class across the models.

\ 3.1.3 Region-based

\ This section presents the recall of region-based retrieval. An overview of the evaluation is shown in Figure 3. In regionbased retrieval, per each anatomical region in the query volume, one volume is retrieved. In the recall computation, the classes present in both the sub-volume of the query and the corresponding retrieved volume are considered TP classes. The classes that are present in the query sub-volume and are missing from the retrieved volume are considered FN.

\ Table 6 and Table 7 present the retrieval recalls. Compared to volume-based retrieval the average retrieval for the regions is higher. The performance of the models is very close. DreamSim performs slightly better with an average recall of .979 ± .037 for coarse anatomical regions and .983 ± .032 for 104 anatomical regions. The retrieval recall for many classes is 1.0. The standard deviation among classes and the models is low, with the highest standard deviation of .076 and .092, for coarse and fine classes respectively.

\ Table 6: Region-based retrieval recall of coarse anatomical regions (29 classes) using HNSW Indexing. In each row, bold numbers represent the best-performing values, while italicized numbers indicate the worst-performing. The separate average and standard deviation (STD) columns are color-coded, with blue indicating the best-performing values and yellow indicating the worst-performing values across different models. Additionally, bold numbers in colored columns represent the best classes in terms of average and standard deviation, while italicized values represent the worst-performing class across the models.

\ Continue with the next figure

\ Table 7: Regiond-based retrieval recall of all TS anatomical regions (104 classes) using HNSW Indexing. In each row, bold numbers represent the best-performing values, while italicized numbers indicate the worst-performing. The separate average and standard deviation (STD) columns are color-coded, with blue indicating the best-performing values and yellow indicating the worst-performing values across different models. Additionally, bold numbers in colored columns represent the best classes in terms of average and standard deviation, while italicized values represent the worst-performing class across the models.

\ 3.1.4 Localized

\ This section presents the recall and localization-ratio of localized retrieval. An overview of the evaluation is shown in Figure 4. In localized retrieval, per each anatomical region in the query volume, one volume is retrieved.

\ Localized Retrieval Recall The recall calculation for localized retrieval is explained in Section 2.4.3 and an overview is shown in Figure 4. Table 8 and Table 9 present the retrieval recalls. Compared to region-based retrieval the average retrieval for regions is lower which is expected based on the more strict metric defined. The performance of models is close, especially the self-supervised models. DINOv2 performs best for 29 anatomical regions with an average recall of .941 ± .077. For 104 regions the performance of models is even closer with DINOv1 performing slightly better with an average recall of .929 ± .085.

\ Table 8: Localized retrieval recall of coarse anatomical regions (29 classes) using HNSW Indexing, L = 15. In each row, bold numbers represent the best-performing values, while italicized numbers indicate the worst-performing. The separate average and standard deviation (STD) columns are color-coded, with blue indicating the best-performing values and yellow indicating the worst-performing values across different models. Additionally, bold numbers in colored columns represent the best classes in terms of average and standard deviation, while italicized values represent the worst-performing class across the models.

\ Continue with the next figure

\ Table 9: Localized retrieval recall of all TS anatomical regions (104 classes) using HNSW Indexing, L = 15. In each row, bold numbers represent the best-performing values, while italicized numbers indicate the worst-performing. The separate average and standard deviation (STD) columns are color-coded, with blue indicating the best-performing values and yellow indicating the worst-performing values across different models. Additionally, bold numbers in colored columns represent the best classes in terms of average and standard deviation, while italicized values represent the worst-performing class across the models.

\ Localization-ratio The localization-ratio is computed based on (2). This measure shows how many slices that contributed to the retrieval of the volume actually contained the desired organ. Table 10 and Table 11 show the localization-ratio for 29 coarse and 104 TS original classes. DreamSim shows the best average localization-ratio with an average localization-ratio of .864 ± .145 and .803 ± .130 for coarse and original TS classes, respectively.

\ Table 10: Localization-ratio of coarse anatomical regions (29 classes) using HNSW Indexing, L = 15. In each row, bold numbers represent the best-performing values, while italicized numbers indicate the worst-performing. The separate average and standard deviation (STD) columns are color-coded, with blue indicating the best-performing values and yellow indicating the worst-performing values across different models. Additionally, bold numbers in colored columns represent the best classes in terms of average and standard deviation, while italicized values represent the worst-performing class across the models.

\ Continue with the next figure

\ Table 11: Localization-ratio of all TS anatomical regions (104 classes) using HNSW Indexing, L = 15. In each row, bold numbers represent the best-performing values, while italicized numbers indicate the worst-performing. The separate average and standard deviation (STD) columns are color-coded, with blue indicating the best-performing values and yellow indicating the worst-performing values across different models. Additionally, bold numbers in colored columns represent the best classes in terms of average and standard deviation, while italicized values represent the worst-performing class across the models

\

:::info Authors:

(1) Farnaz Khun Jush, Bayer AG, Berlin, Germany (farnaz.khunjush@bayer.com);

(2) Steffen Vogler, Bayer AG, Berlin, Germany (steffen.vogler@bayer.com);

(3) Tuan Truong, Bayer AG, Berlin, Germany (tuan.truong@bayer.com);

(4) Matthias Lenga, Bayer AG, Berlin, Germany (matthias.lenga@bayer.com).

:::


:::info This paper is available on arxiv under CC BY 4.0 DEED license.

:::

\

Market Opportunity
null Logo
null Price(null)
--
----
USD
null (null) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact service@support.mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

Tom Lee’s BitMine Hits 7-Month Stock Low as Ethereum Paper Losses Reach $8 Billion

Tom Lee’s BitMine Hits 7-Month Stock Low as Ethereum Paper Losses Reach $8 Billion

The post Tom Lee’s BitMine Hits 7-Month Stock Low as Ethereum Paper Losses Reach $8 Billion appeared on BitcoinEthereumNews.com. In brief Shares of BitMine Immersion
Share
BitcoinEthereumNews2026/02/06 04:47
Headwind Helps Best Wallet Token

Headwind Helps Best Wallet Token

The post Headwind Helps Best Wallet Token appeared on BitcoinEthereumNews.com. Google has announced the launch of a new open-source protocol called Agent Payments Protocol (AP2) in partnership with Coinbase, the Ethereum Foundation, and 60 other organizations. This allows AI agents to make payments on behalf of users using various methods such as real-time bank transfers, credit and debit cards, and, most importantly, stablecoins. Let’s explore in detail what this could mean for the broader cryptocurrency markets, and also highlight a presale crypto (Best Wallet Token) that could explode as a result of this development. Google’s Push for Stablecoins Agent Payments Protocol (AP2) uses digital contracts known as ‘Intent Mandates’ and ‘Verifiable Credentials’ to ensure that AI agents undertake only those payments authorized by the user. Mandates, by the way, are cryptographically signed, tamper-proof digital contracts that act as verifiable proof of a user’s instruction. For example, let’s say you instruct an AI agent to never spend more than $200 in a single transaction. This instruction is written into an Intent Mandate, which serves as a digital contract. Now, whenever the AI agent tries to make a payment, it must present this mandate as proof of authorization, which will then be verified via the AP2 protocol. Alongside this, Google has also launched the A2A x402 extension to accelerate support for the Web3 ecosystem. This production-ready solution enables agent-based crypto payments and will help reshape the growth of cryptocurrency integration within the AP2 protocol. Google’s inclusion of stablecoins in AP2 is a massive vote of confidence in dollar-pegged cryptocurrencies and a huge step toward making them a mainstream payment option. This widens stablecoin usage beyond trading and speculation, positioning them at the center of the consumption economy. The recent enactment of the GENIUS Act in the U.S. gives stablecoins more structure and legal support. Imagine paying for things like data crawls, per-task…
Share
BitcoinEthereumNews2025/09/18 01:27
European Blockchain Convention Drives Digital Finance Revival Amid 90% Blockchain Job Postings Decline

European Blockchain Convention Drives Digital Finance Revival Amid 90% Blockchain Job Postings Decline

The post European Blockchain Convention Drives Digital Finance Revival Amid 90% Blockchain Job Postings Decline appeared on BitcoinEthereumNews.com. This content is provided by a sponsor. PRESS RELEASE. Global leaders convene in Barcelona showcasing resilience as EU advances digital euro and fintech investment reaches €3.6bn in H1, 2025. Barcelona, Spain, September 22nd — The 11th European Blockchain Convention (EBC11) will gather global leaders in Barcelona on October 16-17 to challenge perceptions of European decline […] Source: https://news.bitcoin.com/european-blockchain-convention-drives-digital-finance-revival-amid-90-blockchain-job-postings-decline/
Share
BitcoinEthereumNews2025/09/23 07:16