Benchmarks

SWE-bench Leaderboard: February 2026 Rankings

The latest SWE-bench Verified scores show Kimi K2.5 and Qwen3.5 tied near the top. Here is the full leaderboard breakdown.

GPTUni Team

February 18, 20262 min read

SWE-bench Verified, the industry-standard benchmark for evaluating AI models on real-world software engineering tasks, has seen significant movement in early 2026. The benchmark tests models on their ability to resolve actual GitHub issues from popular open-source repositories.

The current top performers as of February 2026:

1. Kimi K2.5 — 76.8% 2. Qwen3.5 397B — 76.4% 3. Qwen3 Coder Next — 74.2% 4. Claude Opus 4 — 72.0% 5. o3 — 71.7% 6. GPT-4.1 — 69.3% 7. DeepSeek R1 — 65.8% 8. Gemini 2.5 Pro — 63.8%

The gap between first and eighth place is now just 13 percentage points, compared to 25+ points a year ago. This compression reflects the rapid improvement across all major providers. Models from Chinese labs (Moonshot AI, Alibaba/Qwen, DeepSeek) now hold four of the top seven positions.

What makes SWE-bench particularly valuable as a benchmark is that it tests end-to-end software engineering ability, not just code completion. Models must read issue descriptions, navigate codebases, identify the relevant files, and produce working patches. This requires a combination of code understanding, reasoning, and attention to detail that correlates well with real-world usefulness.

The benchmark continues to evolve, with plans to add more complex multi-file issues and repository-level tasks in future updates.

Benchmarks

SWE-bench Leaderboard: February 2026 Rankings

More from News & Insights

Comparing Token Costs: What Does AI Actually Cost to Use?

Gemini 3.1 Pro: Google Claims #1 on 12 of 18 Benchmarks

DeepSeek V4 and the February 17th Mega-Launch