Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
Microsoft Research (United Kingdom)
Abstract
We introduce Florence-2, a novel vision foundation model with a unified, prompt-based representation for various computer vision and vision-language tasks. While existing large vision models excel in transfer learning, they struggle to perform diverse tasks with simple instructions, a capability that implies handling the complexity of various spatial hierarchy and semantic granularity. Florence-2 was designed to take text-prompt as task instructions and generate desirable results in text forms, whether it be captioning, object detection, grounding or segmentation. This multi-task learning setup demands large-scale, high-quality annotated data. To this end, we co-developed FLD-5B that consists of 5.4 billion…
Citation impact
- FWCI
- 43.81
- Percentile
- 100%
- References
- 81
Authors
9Topics & keywords
- Variety (cybernetics)
- Computer science
- Representation (politics)
- Cognitive science
- Artificial intelligence
- Human–computer interaction
- Computer vision
- Psychology