Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks

Xiao, Bin; Wu, Haiping; Xu, Weijian; Dai, Xiyang; Hu, Houdong; Lu, Yumao; Zeng, Michael; Liu, Ce; Yuan, Lu

doi:10.1109/cvpr52733.2024.00461

articleJun 16, 2024Closed access

Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks

BXBin Xiao HWHaiping Wu WXWeijian Xu XDXiyang Dai HHHoudong Hu

Microsoft Research (United Kingdom)

Indexed incrossref

Abstract

We introduce Florence-2, a novel vision foundation model with a unified, prompt-based representation for various computer vision and vision-language tasks. While existing large vision models excel in transfer learning, they struggle to perform diverse tasks with simple instructions, a capability that implies handling the complexity of various spatial hierarchy and semantic granularity. Florence-2 was designed to take text-prompt as task instructions and generate desirable results in text forms, whether it be captioning, object detection, grounding or segmentation. This multi-task learning setup demands large-scale, high-quality annotated data. To this end, we co-developed FLD-5B that consists of 5.4 billion…

Citation impact

196

total citations

FWCI: 43.81
Percentile: 100%
References: 81

Citations per year

Authors

9

Topics & keywords

Topics

Keywords

Variety (cybernetics)
Computer science
Representation (politics)
Cognitive science
Artificial intelligence
Human–computer interaction
Computer vision
Psychology

No related works found for this paper.