본문으로 건너뛰기
AI Trends
피드
트렌딩
콜로세움
로그인
피드
트렌딩
콜로세움
벤치마크
벤치마크 관련 AI 뉴스를 한국어로 요약해 드립니다.
SWE-bench
27
ARC-AGI
20
ARC-AGI-2
16
NeurIPS
10
Terminal-Bench
8
METR
6
LMSYS Arena
6
MineBench
5
SWE-bench Pro
3
FACTS Grounding
3
DREAM
3
AtCoder
3
FrontierMath
3
GAIA
3
OODSelect
2
WebShop
2
AI GameStore
2
Berkeley Function Calling Leaderboard
2
AI WeatherQuest
2
How2Everything
2
HLE
2
GPQA
2
VisGym
2
IRPAPERS
2
Defects4J
2
ALE-Bench
2
MCIF
2
AA-Omniscience
2
Euro NCAP
2
KernelBench
2
SWE-rebench V2
2
Pass@K
2
OSWorld
2
VBVR
2
Core War
2
GDPVal
2
Google Research Football
2
CIFAR-10
2
MetaDrive
2
CAPTURE
2
LOLAMEME
2
MedAgentsBench
2
PersonalRewardBench
2
DORA
2
FinanceBench
2
AI Arena
2
FrontierScience
2
Humanity's Last Exam
2
DeepResearch Bench
2
Calendar Gym
2
AssetOpsBench
2
IT-Bench
2
Mathet's Gamma
1
SpeechMap
1
Petri
1
BugBench
1
SimpleQA
1
Age of Empires 2
1
BaldurBench
1
ScienceWorld
1
ScholarQABench
1
LLM Engineering Evaluation
1
DTR
1
OmniGAIA
1
CiteAudit
1
MMR-Life
1
CGBench
1
Ref-Adv
1
RefCOCO
1
DystopiaBench
1
StructEval
1
T2S-Bench
1
Live Gaming Benchmark
1
MindEval
1
τ²-bench
1
whatllm.org
1
DPG-Bench
1
CHAIN
1
ForesightSafety Bench
1
Workunit Benchmarks
1
OmniDocBench
1
RubricBench
1
HackBench
1
IRT
1
LoCoMo
1
LLM-as-a-Judge
1
LABBench2
1
Qwen-Bench
1
DraftNEPABench
1
APEX Testing
1
GSM-Symbolic
1
PhotoBench
1
PinchBench
1
MTEB
1
CHAIR
1
AIME
1
1-WL
1
InsanityBench
1
Webgrid Eval
1
RubberDuckBench
1
LiveCodeBench
1
DeepSearchQA
1
Craftax
1
SteerEval
1
CyberMetric
1
Evaluation Benchmark
1
RCT
1
Skills Benchmark
1
BeyondSWE
1
RewardBench
1
Minimax Rate
1
DeepMind Control Suite
1
DMLab
1
UniG2U-Bench
1
LIBERO
1
Humanity's Last Exam
1
TimeMachine-bench
1
MMLU-Pro
1
How2Bench
1
Android Bench
1
Kaggle Game Arena
1
APEX
1
Bullshit Benchmark
1
LunarLander-v3
1
GDP-Eval
1
Eleusis
1
CC-Bench-V2
1
SWE-bench Verified
1
MMEB
1
PostTrainBench
1
scheduler_perf
1
March Machine Learning Mania
1
MMLU
1
MATH500
1
VBVR-Bench
1
DexMimicGen
1
EarthSpatialBench
1
AoE2 LLM Benchmark
1
First Proof
1
DARPA SubT Challenge
1
LunarLander
1
Pencil Puzzle Bench
1
ROUGE
1
Anthropic Economic Index
1
Conv-FinRe
1
LongCLI-Bench
1
GSM8K
1
HealthBench
1
F1-score
1
BEIR
1
ViDoRe
1
MOSES
1
LLM-as-judge
1
VBench
1
τ2-bench
1
MobilityBench
1
Together Evaluations
1
ALFWorld
1
AUC
1
ARLArena
1
Interactive Benchmarks
1
pass@k
1
CoW-Bench
1
MedAgentBench
1
LMSYS Chatbot Arena
1
AgentVista
1
벤치마크 관련 모든 뉴스 보기 →