Abstract: Large language models (LLMs), trained on vast corpora, have emerged as a groundbreaking innovation in the realm of question-answering and conversational agents. Among these LLMs, ChatGPT has pioneered a new phase in AI, adeptly handling varied tasks from writing essays and biographies to solving complex mathematical problems. However, assessing the performance of ChatGPT's output poses a challenge, particularly in scenarios where queries lack clear objective criteria for correctness. We delve into the efficacy of ChatGPT (GPT-4) in generating correct code for programming problems, examining both the correctness and the efficiency of its solution in terms of time and memory complexity. A custom dataset containing problems of various topics and difficulties from Leetcode has been used. The research reveals an overall success rate of 71.875%, denoting the proportion of problems for which ChatGPT was able to provide correct solutions that successfully satisfied all the test cases present in Leetcode. It exhibits strength in structured problems and shows a linear correlation between its success rate and problem acceptance rates. However, it struggles to improve incorrect solutions based on feedback, pointing to potential shortcomings in debugging tasks. These findings provide a compact yet insightful glimpse into ChatGPT's capabilities and areas for improvement.
Loading