Benchmarking Code Generation: A Look at Chinese and International LLMs
Abstract:With the rapid advancement of artificial intelligence technology, the application of large language models (LLMs) in software programming has become a focal point of shared interest among both industry and academia. This paper takes mainstream domestic and international LLMs as its research subjects and conducts a systematic comparative analysis of their programming capabilities from multiple dimensions, including performance on programming benchmarks, code-generation quality, engineering-practice competence, adaptability to Chinese programming scenarios, and ecosystem development. The study reveals that domestic LLMs—represented by DeepSeek, Qwen, and MiniMax—have fully matched or even surpassed, in certain areas, internationally recognized benchmark products in standardized programming benchmarks. In particular, from late 2025 to early 2026, MiniMax M2.5 and GLM-5 have entered the global top three on the SWE-bench Verified engineering-level benchmark, marking a historic breakthrough in the programming capabilities of domestic LLMs. However, in terms of systematizing agentic coding, building robust code-security frameworks, and fostering a mature developer ecosystem, international models still hold a relative advantage. This paper aims to provide researchers, engineers, and decision-makers with an objective and comprehensive perspective for reference. ...