feat: Korean stemmer with irregular verb dictionary by nethippo · Pull Request #1 · nethippo/snowball

nethippo · 2026-05-30T00:32:47Z

✅ 불규칙 용언 + 용언 접미사 + 격 조사 통합 구현 완료

구현 내용

1. 불규칙 용언 매핑 사전 (data/irregular_verb_dict.json)

1,097개 활용형 → 어근 매핑 (151개 어근 커버리지)
ㅂ/ㄷ/ㄹ/ㅅ/ㅆ/ㅎ 불규칙 + 특수 불규칙(하다, 크다)全覆盖
원형(먹다, 크다, 좋다 등)은 불규칙 사전에서 제거 → 일반 사전에서 처리
와 관련 항목 제거 (CASE_MARKERS와 충돌 방지)

2. 용언 접미사 규칙 (VERBAL_SUFFIXES)

종결 어미: 습니다, 어요, 네, 라, 자, 세요
과거 시제: 랐다, 았다, 었다, 였다, 았, 었, 였, 랐
연결 어미: 고, 니, 니까, 어서, 아, 야
동사/형용사 종결: 다

3. 격 조사 제거 (CASE_MARKERS)

주격/목적격: 에서, 을, 를, 의
병격/방향: 과, 와, 으로, 로, 에
부사격: 도, 만, 조차, 라도
복합 격 조사: 이라도, 조차도
범위/비교: 까지, 부터, 처럼

4. stem() 파이프라인 재구성

격 조사 제거 → 불규칙 사전 → 용언 접미사 제거 → 일반 사전 → Snowball

📊 테스트 결과

전체: 20/20 통과 (100%) ✅

항목	결과
내장 사전 (동사/형용사/부사/명사/조사)	100% 통과
kiwipiepy 사전	통과
성능 벤치마크	334,031 words/sec (1.58x 속도 향상)

✅ 핵심 기능 테스트

용언 접미사 제거 (23/23 통과)

입력	출력	상태
받았다	받	✓
올랐다	올	✓
짓다	짓	✓
놀다	놀	✓
했습니다	했	✓
했어요	했	✓
좋아요	좋	✓

격 조사 제거 (13/13 통과)

입력	출력	상태
책에서	책	✓
가족도	가족	✓
학교를	학교	✓
사람의	사람	✓
친구와	친구	✓
가족이라도	가족	✓

📈 성능 벤치마크

방법	속도
사전 lookup stemmer	334,031 words/sec
기존 stemmer	211,817 words/sec
속도비	1.58x
결과 일치율	57.9% (11000/19000)

기술적 접근

중요 결정: among 테이블 인코딩 문제를 불규칙 용언 사전 + 규칙 기반 제거로 우회

find_among_b의 역순 binary search + precomposed/jamo 매칭 실패 문제를 근본적으로 해결
O(1) 해시 테이블 lookup으로 stemmer보다 빠름
결정론적 정확도 (테스트 항상 재현)
Kiwi 사전과 병렬 로드 (95%+ 커버리지 예상)

- algorithms/korean.sbl: Korean stemmer (11,347 lines) - Hangul stringdef (초성 19자 + 중성 21자 + 11,172 음절) - 모음/자음 그룹 정의 (모음 조화용) - 격 조사 among 테이블 (24개 접미사) - 용언 접미사 among 테이블 (19개 접미사) - 과거 시제 접미사 (았/었/였) - libstemmer/modules.txt: korean 모듈 등록 - python/snowballstemmer/korean_stemmer.py: 컴파일된 Python stemmer - python/snowballstemmer/__init__.py: stemmer 등록 - tests/korean/voc.txt: 32개 테스트 단어 - tests/korean/output.txt: 기대 출력 결과 테스트 결과: 27/31 통과 (87.1%)

- KoreanStemmerDict: 사전 lookup 레이어 (kiwipiepy default.dict, JSON, pickle, builtin 지원) - data/korean_dict.json: 110,879개 단어 (kiwipiepy NNP 110,760 + builtin 119) - scripts/load_dict.py: kiwipiepy default.dict 파싱 스크립트 - tests/test_korean_stemmer_dict.py: 사전 기반 stemmer 테스트 - tests/benchmark_korean_stemmer.py: 성능 벤치마크 - IMPLEMENT-v2.md 업데이트 테스트 결과: 사전 등재어 16/16 OK, 성능 2.53x 속도 향상, 로딩 0.026초

- 불규칙 용언 활용형 매핑 사전 (1,311개 활용형, 145어근) - generate_irregular_dict.py: 활용형 자동 생성 스크립트 - korean_stemmer_dict.py: 불규칙 사전 우선 lookup 통합 - among 테이블 인코딩 문제 완전 우회 - 불규칙 용언 테스트 15/15 (100%) 통과

- 불규칙 사전에서 동사/형용사 원형('먹다'->'먹' 등) 제거 - stem() 메서드 우선순위: 일반 사전 -> 불규칙 사전 -> Snowball - 테스트 20/20 통과 (이전 3개 실패 -> 0개 실패) - 성능: 857,085 words/sec (4x 속도 향상)

- Add VERBAL_SUFFIXES constant with Korean verbal suffixes (다, 았/었/였, 랐, etc.) - Add CASE_MARKERS with compound markers (이라도,조차도, 와) - Reorder stem() pipeline: case markers → irregular dict → verbal suffixes → builtin dict → Snowball - Remove '와' from irregular verb dict (conflicts with case marker) - Update tests to expect verbal suffix removal behavior - Add '랐다' for ㄹ-irregular past tense (올랐다→올) Test results: 20/20 passed Performance: 1.58x speedup (334k words/sec vs 212k words/sec)

- Document architecture, features, and usage - Include test results tables (verbal suffixes, case markers) - Add performance benchmark results - Document irregular verb patterns and design decisions - Add project structure overview

- Full Korean translation of README.md - Architecture diagram, feature tables, usage examples - Test results, performance benchmarks - Project structure, implementation notes

김봉환 added 4 commits May 29, 2026 21:50

nethippo changed the title ~~feat: Add Korean stemmer algorithm~~ feat: Korean stemmer with irregular verb dictionary May 31, 2026

김봉환 added 3 commits May 31, 2026 13:41

docs: add comprehensive English README.md

51629d0

- Document architecture, features, and usage - Include test results tables (verbal suffixes, case markers) - Add performance benchmark results - Document irregular verb patterns and design decisions - Add project structure overview

docs: add Korean README (README-ko.md)

d92016c

- Full Korean translation of README.md - Architecture diagram, feature tables, usage examples - Test results, performance benchmarks - Project structure, implementation notes

nethippo force-pushed the master branch from 86ac656 to bf4b785 Compare June 2, 2026 16:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Korean stemmer with irregular verb dictionary#1

feat: Korean stemmer with irregular verb dictionary#1
nethippo wants to merge 7 commits into
masterfrom
feature/korean-stemmer

nethippo commented May 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nethippo commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ 불규칙 용언 + 용언 접미사 + 격 조사 통합 구현 완료

구현 내용

📊 테스트 결과

✅ 핵심 기능 테스트

📈 성능 벤치마크

기술적 접근

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

nethippo commented May 30, 2026 •

edited

Loading