articleJun 16, 2024Closed access

Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks

Microsoft Research (United Kingdom)

Indexed incrossref

Abstract

We introduce Florence-2, a novel vision foundation model with a unified, prompt-based representation for various computer vision and vision-language tasks. While existing large vision models excel in transfer learning, they struggle to perform diverse tasks with simple instructions, a capability that implies handling the complexity of various spatial hierarchy and semantic granularity. Florence-2 was designed to take text-prompt as task instructions and generate desirable results in text forms, whether it be captioning, object detection, grounding or segmentation. This multi-task learning setup demands large-scale, high-quality annotated data. To this end, we co-developed FLD-5B that consists of 5.4 billion…

Citation impact

196
total citations
FWCI
43.81
Percentile
100%
References
81
Citations per year

Authors

9

Topics & keywords

Keywords
  • Variety (cybernetics)
  • Computer science
  • Representation (politics)
  • Cognitive science
  • Artificial intelligence
  • Human–computer interaction
  • Computer vision
  • Psychology
No related works found for this paper.