articleHarvard Data Science ReviewMar 12, 2024HYBRID OA

How Is ChatGPT’s Behavior Changing Over Time?

Stanford University · University of California, Berkeley

Indexed incrossrefdoaj

Abstract

GPT-3.5 and GPT-4 are the two most widely used large language model (LLM) services. However, when and how these models are updated over time is opaque. Here, we evaluate the March 2023 and June 2023 versions of GPT-3.5 and GPT-4 on several diverse tasks: 1) math problems, 2) sensitive/dangerous questions, 3) opinion surveys, 4) multi-hop knowledge-intensive questions, 5) generating code, 6) US Medical License tests, and 7) visual reasoning. We find that the performance and behavior of both GPT-3.5 and GPT-4 can vary greatly over time. For example, GPT-4 (March 2023) was reasonable at identifying prime vs. composite numbers (84% accuracy) but GPT-4 (June 2023) was poor on these same questions (51% accuracy).…

Citation impact

274
total citations
FWCI
26.39
Percentile
100%
References
0
Citations per year

Authors

3

Topics & keywords

Keywords
  • Computer science
  • Task (project management)
  • Psychology
  • Engineering
No related works found for this paper.