๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
AI

์—์ด์ „ํŠธ ํ‰๊ฐ€ ํ•˜๋„ค์Šค(Eval Harness) ๊ตฌ์ถ• ๊ฐ€์ด๋“œ — "๋А๋‚Œ"์ด ์•„๋‹ˆ๋ผ "์ˆซ์ž"๋กœ ์‹ ๋ขฐ์„ฑ์„ ์ฆ๋ช…ํ•˜๊ธฐ

by The era of AI 2026. 4. 11.
728x90

์‹œ๋ฆฌ์ฆˆ 2ํŽธ. 1ํŽธ ํ•˜๋„ค์Šค ์—”์ง€๋‹ˆ์–ด๋ง ์ž…๋ฌธ์—์„œ ์šฐ๋ฆฌ๋Š” ์ œ์•ฝ·๋„๊ตฌ·ํ”ผ๋“œ๋ฐฑ ๋ฃจํ”„๋กœ ์—์ด์ „ํŠธ๋ฅผ ๋‘˜๋Ÿฌ์‹ธ์•ผ ํ•œ๋‹ค๊ณ  ์ด์•ผ๊ธฐํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฒˆ ๊ธ€์€ ๊ทธ ์ค‘ ํ”ผ๋“œ๋ฐฑ ๋ฃจํ”„์˜ ์‹ฌ์žฅ, ํ‰๊ฐ€ ํ•˜๋„ค์Šค๋ฅผ ๋‹ค๋ฃน๋‹ˆ๋‹ค.

๋“ค์–ด๊ฐ€๋ฉฐ: "์–ด์ œ๋ณด๋‹ค ๋‚˜์•„์กŒ๋‚˜์š”?"

ํ”„๋กฌํ”„ํŠธ ํ•œ ์ค„ ๋ฐ”๊พธ๊ณ , ๋„๊ตฌ ํ•˜๋‚˜ ์ถ”๊ฐ€ํ•˜๊ณ , ๋ชจ๋ธ์„ Sonnet์—์„œ Opus๋กœ ์˜ฌ๋ ธ์Šต๋‹ˆ๋‹ค. ์—์ด์ „ํŠธ๊ฐ€ ๋” ์ข‹์•„์กŒ์„๊นŒ์š”? ๋” ๋‚˜๋น ์กŒ์„๊นŒ์š”?

"๋А๋‚Œ์ƒ ์ข‹์•„์ง„ ๊ฒƒ ๊ฐ™์•„์š”"๋กœ ๋‹ตํ•˜๋Š” ํŒ€์€ ๊ณง ํ”„๋กœ๋•์…˜์—์„œ ๋ฌด๋„ˆ์ง‘๋‹ˆ๋‹ค. ํ‰๊ฐ€ ํ•˜๋„ค์Šค๊ฐ€ ์—†์œผ๋ฉด ๊ฐœ์„ ๊ณผ ํšŒ๊ท€๋ฅผ ๊ตฌ๋ถ„ํ•  ์ˆ˜ ์—†๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

ํ‰๊ฐ€ ํ•˜๋„ค์Šค๋ž€, ์—์ด์ „ํŠธ์˜ ๋™์ž‘์„ ์žฌํ˜„ ๊ฐ€๋Šฅํ•œ ๋ฐฉ์‹์œผ๋กœ ์ธก์ •ํ•˜๊ณ  ํšŒ๊ท€๋ฅผ ์žก์•„๋‚ด๋Š” ์ž๋™ํ™”๋œ ํ…Œ์ŠคํŠธ ์ธํ”„๋ผ์ž…๋‹ˆ๋‹ค.

 

์ „ํ†ต ์†Œํ”„ํŠธ์›จ์–ด์˜ ๋‹จ์œ„ ํ…Œ์ŠคํŠธ์™€ ๋น„์Šทํ•˜์ง€๋งŒ, ์ถœ๋ ฅ์ด ๋น„๊ฒฐ์ •์ ์ด๋ผ๋Š” ์ ์—์„œ ๋ณธ์งˆ์ ์œผ๋กœ ๋‹ค๋ฆ…๋‹ˆ๋‹ค.


ํ‰๊ฐ€ ํ•˜๋„ค์Šค์˜ 4๊ฐ€์ง€ ๊ตฌ์„ฑ ์š”์†Œ

๊ตฌ์„ฑ ์š”์†Œ ์—ญํ•  ์ „ํ†ต SW ๋น„์Šˆ
๋ฐ์ดํ„ฐ์…‹(Dataset) ์ž…๋ ฅ + ๊ธฐ๋Œ€ ๋™์ž‘ ํ…Œ์ŠคํŠธ ์ผ€์ด์Šค
๋Ÿฌ๋„ˆ(Runner) ์—์ด์ „ํŠธ๋ฅผ ๊ฒฉ๋ฆฌ ํ™˜๊ฒฝ์—์„œ ์‹คํ–‰ ํ…Œ์ŠคํŠธ ๋Ÿฌ๋„ˆ
์ฑ„์ ๊ธฐ(Scorer) ๊ฒฐ๊ณผ๋ฅผ ์ ์ˆ˜๋กœ ๋ณ€ํ™˜ assert ๋ฌธ
๋Œ€์‹œ๋ณด๋“œ(Dashboard) ์‹œ๊ณ„์—ด ์ถ”์  + ํšŒ๊ท€ ์•Œ๋ฆผ CI ๋ฆฌํฌํŠธ

1๏ธโƒฃ ๋ฐ์ดํ„ฐ์…‹ — ๊ฐ€์žฅ ์ค‘์š”ํ•˜๊ณ  ๊ฐ€์žฅ ์†Œํ™€ํžˆ ๋‹ค๋ค„์ง€๋Š” ๋ถ€๋ถ„

์ข‹์€ eval ๋ฐ์ดํ„ฐ์…‹์˜ 3๊ฐ€์ง€ ์›์น™:

  • ๋‹ค์–‘์„ฑ(Diversity) — ์‰ฌ์šด ์ผ€์ด์Šค 60%, ์–ด๋ ค์šด ์ผ€์ด์Šค 30%, ์—ฃ์ง€ ์ผ€์ด์Šค 10%
  • ํ˜„์‹ค์„ฑ(Realism) — ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ๋ณด๋‹ค ํ”„๋กœ๋•์…˜ ๋กœ๊ทธ์—์„œ ์ถ”์ถœํ•œ ์‹ค์ œ ์ผ€์ด์Šค๊ฐ€ ํ›จ์”ฌ ๊ฐ•๋ ฅ
  • ๊ณ„์ธตํ™”(Stratification) — ๊ธฐ๋Šฅ๋ณ„·๋‚œ์ด๋„๋ณ„ ํƒœ๊น…์œผ๋กœ ํšŒ๊ท€ ์œ„์น˜๋ฅผ ์ขํž ์ˆ˜ ์žˆ์–ด์•ผ ํ•จ
# ์ข‹์€ eval case ์˜ˆ์‹œ
{
  "id": "refactor-001",
  "category": "code_refactor",
  "difficulty": "medium",
  "input": "Extract the validation logic in user_service.py into a separate module",
  "repo_snapshot": "fixtures/user_service_v1/",
  "expected": {
    "must_pass_tests": ["test_user_validation"],
    "must_not_break": ["test_user_create", "test_user_update"],
    "files_modified_max": 3
  },
  "tags": ["refactor", "python", "imports"]
}

ํ•ต์‹ฌ ํŒ: ๋ฒ„๊ทธ ๋ฆฌํฌํŠธ๊ฐ€ ๋“ค์–ด์˜ฌ ๋•Œ๋งˆ๋‹ค eval ์ผ€์ด์Šค๋กœ ๋ฐ•์ œํ•˜์„ธ์š”. ํ•œ ๋ฒˆ ์žก์€ ํšŒ๊ท€๋Š” ๋‹ค์‹œ ์ผ์–ด๋‚˜์ง€ ์•Š์•„์•ผ ํ•ฉ๋‹ˆ๋‹ค.


2๏ธโƒฃ ๋Ÿฌ๋„ˆ — ๊ฒฉ๋ฆฌ, ์žฌํ˜„, ๋ณ‘๋ ฌํ™”

async def run_eval(case, agent_config):
    with isolated_sandbox(case.repo_snapshot) as sandbox:
        trace = await agent.run(
            task=case.input,
            workspace=sandbox.path,
            max_steps=50,
            max_cost_usd=2.0,
            seed=42,  # ๊ฐ€๋Šฅํ•œ ๊ฒฝ์šฐ ๊ฒฐ์ •์„ฑ ํ™•๋ณด
        )
    return EvalResult(case_id=case.id, trace=trace, sandbox_diff=sandbox.diff())

ํ•„์ˆ˜ ์š”๊ฑด: ๊ฒฉ๋ฆฌ๋œ ์ž‘์—… ๊ณต๊ฐ„, ๋น„์šฉ/์Šคํ… ์ƒํ•œ, ์ „์ฒด ํŠธ๋ ˆ์ด์Šค ์ €์žฅ, ๋ณ‘๋ ฌ ์‹คํ–‰.


3๏ธโƒฃ ์ฑ„์ ๊ธฐ — 3๊ฐ€์ง€ ๋ ˆ์ด์–ด๋ฅผ ์กฐํ•ฉํ•˜๋ผ

ํ‰๊ฐ€๋Š” ๋‹จ์ผ ์ ์ˆ˜๊ฐ€ ์•„๋‹™๋‹ˆ๋‹ค. ์ €๋ ดํ•˜๊ณ  ํ™•์‹คํ•œ ๊ฒƒ๋ถ€ํ„ฐ, ๋น„์‹ธ๊ณ  ๋ชจํ˜ธํ•œ ๊ฒƒ๊นŒ์ง€ ๊ณ„์ธต์ ์œผ๋กœ ์Œ“์Šต๋‹ˆ๋‹ค.

๐ŸŽฏ ์ฑ„์  ํ”ผ๋ผ๋ฏธ๋“œ

๋ ˆ์ด์–ด ๋น„์šฉ ์‹ ๋ขฐ๋„ ์˜ˆ์‹œ
L1: ๊ฒฐ์ •์  ๊ฒ€์ฆ ๋งค์šฐ ๋‚ฎ์Œ ๋งค์šฐ ๋†’์Œ ํ…Œ์ŠคํŠธ ํ†ต๊ณผ? ์ปดํŒŒ์ผ๋จ? ์Šคํ‚ค๋งˆ ์ผ์น˜?
L2: ํœด๋ฆฌ์Šคํ‹ฑ ๋‚ฎ์Œ ์ค‘๊ฐ„ ํŒŒ์ผ ์ˆ˜์ • ๊ฐœ์ˆ˜, ํ† ํฐ ์‚ฌ์šฉ๋Ÿ‰, ๊ธˆ์ง€ ํŒจํ„ด ๋ฏธ์‚ฌ์šฉ
L3: LLM-as-judge ๋†’์Œ ๋‚ฎ์Œ~์ค‘๊ฐ„ "์ฝ”๋“œ ํ’ˆ์งˆ์ด ์ ์ ˆํ•œ๊ฐ€?", "์„ค๋ช…์ด ๋ช…ํ™•ํ•œ๊ฐ€?"

์ค‘์š”ํ•œ ์›์น™: L1์œผ๋กœ ์žก์„ ์ˆ˜ ์žˆ์œผ๋ฉด ์ ˆ๋Œ€ L3์„ ์“ฐ์ง€ ๋งˆ์„ธ์š”. LLM ์ฑ„์ ๊ธฐ๋Š” ๋น„์‹ธ๊ณ , ๋ถˆ์•ˆ์ •ํ•˜๋ฉฐ, ์ž์ฒด ํŽธํ–ฅ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

def score_refactor_case(result, case):
    scores = {}
    # L1: ๋ฌด์กฐ๊ฑด ํ†ต๊ณผํ•ด์•ผ ํ•˜๋Š” ๊ฒƒ
    scores["tests_pass"] = run_tests(result.sandbox, case.expected.must_pass_tests)
    scores["no_regression"] = run_tests(result.sandbox, case.expected.must_not_break)
    # L2: ํœด๋ฆฌ์Šคํ‹ฑ
    scores["files_within_budget"] = len(result.modified_files) <= case.expected.files_modified_max
    # L3: LLM judge (์„ ํƒ์ )
    scores["code_quality"] = await llm_judge(
        rubric="๋‹จ์ผ ์ฑ…์ž„ ์›์น™ ์ค€์ˆ˜, ๋ช…ํ™•ํ•œ ๋„ค์ด๋ฐ",
        diff=result.sandbox.diff()
    )
    return scores

LLM-as-judge์˜ ํ•จ์ • ํ”ผํ•˜๊ธฐ

  • โŒ ์ ์ˆ˜ 1~10์„ ์ง์ ‘ ๋ฌป์ง€ ๋งˆ์„ธ์š” → LLM์€ 6~8 ์‚ฌ์ด๋งŒ ๋‹ตํ•ฉ๋‹ˆ๋‹ค
  • โœ… A/B ํŽ˜์–ด ๋น„๊ต๋ฅผ ์‹œํ‚ค๊ฑฐ๋‚˜, ๋ช…ํ™•ํ•œ ๋ฃจ๋ธŒ๋ฆญ(์ฒดํฌ๋ฆฌ์ŠคํŠธ) ์„ ์ฃผ์„ธ์š”
  • โœ… Judge ๋ชจ๋ธ์€ ํ”ผํ‰๊ฐ€ ๋ชจ๋ธ๊ณผ ๋‹ค๋ฅธ ๋ชจ๋ธ์„ ์“ฐ์„ธ์š” (์ž๊ธฐ ํŽธํ–ฅ ๋ฐฉ์ง€)
  • โœ… Judge ์ž์ฒด๋„ ๋ฉ”ํƒ€ ํ‰๊ฐ€์…‹์œผ๋กœ ๊ฒ€์ฆํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค

4๏ธโƒฃ ๋Œ€์‹œ๋ณด๋“œ — ํšŒ๊ท€๋ฅผ ์ฆ‰์‹œ ์•Œ์•„์ฐจ๋ฆฌ๊ธฐ

๋Œ€์‹œ๋ณด๋“œ๊ฐ€ ๋ณด์—ฌ์ค˜์•ผ ํ•  ๊ฒƒ:

  • ์นดํ…Œ๊ณ ๋ฆฌ๋ณ„ ํ†ต๊ณผ์œจ (์ „์ฒด๊ฐ€ ์•„๋‹ˆ๋ผ ๋ถ„ํ•ด๋œ ์ ์ˆ˜)
  • ํšŒ๊ท€ ์•Œ๋ฆผ (์ด์ „ ๋นŒ๋“œ ๋Œ€๋น„ N% ์ด์ƒ ํ•˜๋ฝ ์‹œ ์Šฌ๋ž™)
  • ์‹คํŒจ ์ผ€์ด์Šค์˜ ํŠธ๋ ˆ์ด์Šค ๋งํฌ (๋””๋ฒ„๊น…๊นŒ์ง€ 1ํด๋ฆญ)
  • ๋น„์šฉ๊ณผ ์ง€์—ฐ์‹œ๊ฐ„ (์ •ํ™•๋„๊ฐ€ ๊ฐ™๋‹ค๋ฉด ๋” ์‹ธ๊ณ  ๋น ๋ฅธ ๊ฒŒ ์ด๊น€)

ํ‰๊ฐ€ ํ•˜๋„ค์Šค ๊ตฌ์ถ• ๋กœ๋“œ๋งต

1์ฃผ์ฐจ: ํ”„๋กœ๋•์…˜ ๋กœ๊ทธ์—์„œ 20๊ฐœ ์ผ€์ด์Šค ์ถ”์ถœ + L1 ์ฑ„์ ๊ธฐ
2์ฃผ์ฐจ: CI์— ํ†ตํ•ฉ, ๋งค PR๋งˆ๋‹ค ์ž๋™ ์‹คํ–‰
3์ฃผ์ฐจ: ์นดํ…Œ๊ณ ๋ฆฌ ํƒœ๊น…, ๋Œ€์‹œ๋ณด๋“œ ๊ตฌ์ถ•
4์ฃผ์ฐจ: 50๊ฐœ๋กœ ํ™•์žฅ, LLM judge ์ถ”๊ฐ€, ๋ฉ”ํƒ€ ํ‰๊ฐ€
์ดํ›„: ๋ฒ„๊ทธ ๋ฆฌํฌํŠธ๋งˆ๋‹ค ์ผ€์ด์Šค ์ถ”๊ฐ€, ๋ถ„๊ธฐ๋งˆ๋‹ค ๋ฐ์ดํ„ฐ์…‹ ๋ฆฌ๋ทฐ

์ž‘๊ฒŒ ์‹œ์ž‘ํ•˜์„ธ์š”. 20๊ฐœ์˜ ์ž˜ ํ๋ ˆ์ดํŒ…๋œ ์ผ€์ด์Šค๊ฐ€ 2,000๊ฐœ์˜ ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ๋ณด๋‹ค ํ›จ์”ฌ ๊ฐ•๋ ฅํ•ฉ๋‹ˆ๋‹ค.


๋งˆ์น˜๋ฉฐ: Eval์€ ์ œํ’ˆ์ด๋‹ค

๋งŽ์€ ํŒ€์ด ํ‰๊ฐ€ ํ•˜๋„ค์Šค๋ฅผ "์žˆ์œผ๋ฉด ์ข‹์€ ๋„๊ตฌ"๋กœ ์ทจ๊ธ‰ํ•˜์ง€๋งŒ, ์„ฑ์ˆ™ํ•œ AI ํŒ€์—์„œ๋Š” eval ๋ฐ์ดํ„ฐ์…‹ ์ž์ฒด๊ฐ€ ๊ฐ€์žฅ ์ค‘์š”ํ•œ ์ž์‚ฐ์ž…๋‹ˆ๋‹ค. ๋ชจ๋ธ์€ ๊ฐˆ์•„๋ผ์šธ ์ˆ˜ ์žˆ์ง€๋งŒ, 6๊ฐœ์›”๊ฐ„ ํ๋ ˆ์ดํŒ…ํ•œ 1,000๊ฐœ์˜ ๊ณจ๋“  ์ผ€์ด์Šค๋Š” ๊ฐˆ์•„๋ผ์šธ ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.

"์ธก์ •ํ•  ์ˆ˜ ์—†์œผ๋ฉด ๊ฐœ์„ ํ•  ์ˆ˜ ์—†๋‹ค." — ํ”ผํ„ฐ ๋“œ๋Ÿฌ์ปค

๋น„๊ฒฐ์ •์  ์‹œ์Šคํ…œ์—์„œ๋Š” ์ด ๊ฒฉ์–ธ์ด ๋‘ ๋ฐฐ๋กœ ๋ฌด๊ฒ์Šต๋‹ˆ๋‹ค.

 

๋‹ค์Œ ๊ธ€์—์„œ๋Š” ๋„๊ตฌ ์„ค๊ณ„ ๋ฒ ์ŠคํŠธ ํ”„๋ž™ํ‹ฐ์Šค — ์–ด๋–ค tool spec์ด ์—์ด์ „ํŠธ๋ฅผ ๋˜‘๋˜‘ํ•ด ๋ณด์ด๊ฒŒ ๋งŒ๋“œ๋Š”์ง€ — ๋ฅผ ๋‹ค๋ฃจ๊ฒ ์Šต๋‹ˆ๋‹ค.


#AI์—์ด์ „ํŠธ #EvalHarness #LLMOps #AgentEvaluation #ํ•˜๋„ค์Šค์—”์ง€๋‹ˆ์–ด๋ง

 

728x90