
Fraesormer: Learning Adaptive Sparse Transformer for Efficient Food Recognition
*Equal Contribution, †Corresponding authors
-
arXiv
-
PDF
-
Code
Abstract
In recent years, Transformer has witnessed significant progress in food recognition. However, most existing approaches still face two critical challenges in lightweight food recognition: (1) the quadratic complexity and redundant feature representation from interactions with irrelevant tokens; (2) static feature recognition and single-scale representation, which overlook the unstructured, non-fixed nature of food images and the need for multi-scale features. To address these, we propose an adaptive and efficient sparse Transformer architecture (Fraesormer) with two core designs: Adaptive Top-k Sparse Partial Attention (ATK-SPA) and Hierarchical Scale-Sensitive Feature Gating Network (HSSFGN). ATK-SPA uses a learnable Gated Dynamic Top-K Operator (GDTKO) to retain critical attention scores, filtering low query-key matches that hinder feature aggregation. It also introduces a partial channel mechanism to reduce redundancy and promote expert information flow, enabling local-global collaborative modeling. HSSFGN employs gating mechanism to achieve multi-scale feature representation, enhancing contextual semantic information. Extensive experiments show that Fraesormer outperforms state-of-the-art methods.
Overall Pipeline

Visualization

Visualization of the impact of each spatial location on the final prediction of the DeiT-S model. The results show that the final prediction of the vision transformer is primarily based on the most influential tokens, indicating that a large portion of tokens can be removed without affecting performance.
Quantitative results

Seven-dimensional radar map

Ablation Study

Bibtex
@article{zou2025fraesormer,
title={Fraesormer: Learning Adaptive Sparse Transformer for Efficient Food Recognition},
author={Zou, Shun and Zou, Yi and Zhang, Mingya and Luo, Shipeng and Chen, Zhihao and Gao, Guangwei},
journal={},
year={2025}
}