SWE-bench Leaderboard: February 2026 Rankings
The latest SWE-bench Verified scores show Kimi K2.5 and Qwen3.5 tied near the top. Here is the full leaderboard breakdown.
GPTUni Team
SWE-bench Verified, the industry-standard benchmark for evaluating AI models on real-world software engineering tasks, has seen significant movement in early 2026. The benchmark tests models on their ability to resolve actual GitHub issues from popular open-source repositories.
The current top performers as of February 2026:
1. Kimi K2.5 — 76.8% 2. Qwen3.5 397B — 76.4% 3. Qwen3 Coder Next — 74.2% 4. Claude Opus 4 — 72.0% 5. o3 — 71.7% 6. GPT-4.1 — 69.3% 7. DeepSeek R1 — 65.8% 8. Gemini 2.5 Pro — 63.8%
The gap between first and eighth place is now just 13 percentage points, compared to 25+ points a year ago. This compression reflects the rapid improvement across all major providers. Models from Chinese labs (Moonshot AI, Alibaba/Qwen, DeepSeek) now hold four of the top seven positions.
What makes SWE-bench particularly valuable as a benchmark is that it tests end-to-end software engineering ability, not just code completion. Models must read issue descriptions, navigate codebases, identify the relevant files, and produce working patches. This requires a combination of code understanding, reasoning, and attention to detail that correlates well with real-world usefulness.
The benchmark continues to evolve, with plans to add more complex multi-file issues and repository-level tasks in future updates.