Skip to content

boostcampaitech8/pro-nlp-mrc-nlp-03

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

43 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ† NLP Project : Open-Domain Question Answering

✏️ λŒ€νšŒ μ†Œκ°œ

νŠΉμ§• μ„€λͺ…
λŒ€νšŒ 주제 넀이버 λΆ€μŠ€νŠΈμΊ ν”„ AI-Tech 8κΈ° NLP νŠΈλž™μ˜ Open-Domain Question Answering (ODQA) λŒ€νšŒ
λŒ€νšŒ μ„€λͺ… μ§ˆλ¬Έμ— λŒ€ν•΄ λ°©λŒ€ν•œ μ§€λ¬Έ(Corpus)μ—μ„œ κ΄€λ ¨ λ¬Έμ„œλ₯Ό μ°Ύμ•„λ‚΄κ³ (Retriever), 정닡을 μΆ”λ‘ (Reader)ν•˜λŠ” μ‹œμŠ€ν…œ ꡬ좕
μ§„ν–‰ κΈ°κ°„ 2025λ…„ 12μ›” 3일 ~ 2025λ…„ 12μ›” 11일
데이터 ꡬ성 Wikipedia Passage (60,613개), KorQuAD v1/v2 ν•™μŠ΅ 데이터
평가 μ§€ν‘œ Exact Match (EM) - μ •λ‹΅ μ™„μ „ 일치 μ—¬λΆ€ (메인 μ§€ν‘œ), F1 Score
λž©μ—… 리포트 Gen for NLP NLP-03 λž©μ—… 리포트

πŸŽ–οΈ Leader Board

Priavate Leader Board (6μœ„)

image

Public Leader Board (10μœ„)

image

πŸ‘¨β€πŸ’» Contributors

μ΅œμ€€ν˜Έ
badge μ΅œμ€€ν˜Έ
κΉ€μœ€ν¬
badge κΉ€μœ€ν¬
λ°•μ„œμ§„
badge λ°•μ„œμ§„
κ³½λ‚˜μ˜
badge κ³½λ‚˜μ˜
κΉ€μ΄μŠ¬
badge κΉ€μ΄μŠ¬
μš°ν˜œμ§„
badge μš°ν˜œμ§„

πŸ‘Ό μ—­ν•  λΆ„λ‹΄

이름 μ—­ν• 
κΉ€β μœ€β ν¬ 데이터 EDA, Hybrid Retrieval(BM25+Dense) 섀계 및 κ΅¬ν˜„, Reader Fine-tuning, 앙상블 μ „λž΅ 수립
λ°•β μ„œβ μ§„ Dense Retrieval μ‹€ν—˜ 및 λͺ¨λΈ(BGE-M3 λ“±) μ„ μ •, Retrieval νŒŒμΈνŠœλ‹ μ‹€ν—˜, 데이터 μ „μ²˜λ¦¬
κ³½β λ‚˜β μ˜ 데이터 μ „μ²˜λ¦¬ νŒŒμ΄ν”„λΌμΈ ꡬ좕, Hybrid Retrieval κ΅¬ν˜„ 및 μ„±λŠ₯ μ΅œμ ν™”, Retrieval Fine-tuning
κΉ€β μ΄β μŠ¬ KorQuAD 2.0 데이터 μ „μ²˜λ¦¬ 및 증강, ElasticSearch 기반 Retrieval κ΅¬ν˜„ 및 μ‹€ν—˜
μš°β ν˜œβ μ§„ BM25 기반 Sparse Retrieval κ΅¬ν˜„, Reader λͺ¨λΈ κ°œμ„ (Negative Passage, Curriculum Learning), Qwen3 μ‹€ν—˜
μ΅œβ β μ€€ν˜Έ Curriculum Learning λ‚œμ΄λ„ μ§€ν‘œ(Embedding+Position) ꡬ성 및 λͺ¨λΈ ν•™μŠ΅, 앙상블(Soft/Hard Voting) κ΅¬ν˜„

✍🏻 ν”„λ‘œμ νŠΈ κ°œμš”

λ³Έ ν”„λ‘œμ νŠΈλŠ” 수만 개의 μœ„ν‚€ν”Όλ””μ•„ μ§€λ¬Έ 쀑 μ§ˆλ¬Έμ— μ ν•©ν•œ 정보λ₯Ό μ°Ύμ•„ μ •ν™•ν•œ 닡을 λ‚΄λ†“λŠ” ODQA μ‹œμŠ€ν…œμ˜ μ„±λŠ₯을 κ·ΉλŒ€ν™”ν•˜λŠ” 것을 λͺ©ν‘œλ‘œ ν•©λ‹ˆλ‹€. λ‹¨μˆœ 검색을 λ„˜μ–΄ 의미둠적 μœ μ‚¬λ„μ™€ ν•™μŠ΅ λ‚œμ΄λ„λ₯Ό μ‘°μ ˆν•˜λŠ” κ³ λ„ν™”λœ μ „λž΅μ„ μ‚¬μš©ν–ˆμŠ΅λ‹ˆλ‹€.

μ£Όμš” νŠΉμ§•

  • Hybrid Retrieval & Reranking:
    • BM25(Sparse)와 BGE-M3(Dense)λ₯Ό κ²°ν•©ν•˜μ—¬ ν‚€μ›Œλ“œμ™€ 의미λ₯Ό λ™μ‹œμ— 포착
    • **RRF(Reciprocal Rank Fusion)**λ₯Ό ν†΅ν•œ μ•ˆμ •μ μΈ μˆœμœ„ 톡합 및 Cross-Encoder Reranker둜 μƒμœ„ λ¬Έμ„œ μž¬μ •λ ¬
  • Reader Optimization:
    • Negative Passage Training: 정닡이 μ—†λŠ” μ˜€λ‹΅ 지문을 ν•™μŠ΅μ— ν¬ν•¨ν•˜μ—¬ λͺ¨λΈμ˜ 변별λ ₯ κ°•ν™”
    • Curriculum Learning: Passage κ°œμˆ˜μ™€ μ •λ‹΅ μœ„μΉ˜ 정보λ₯Ό ν™œμš©ν•΄ Easy β†’ Medium β†’ Hard 순으둜 단계적 ν•™μŠ΅ μˆ˜ν–‰
  • Robustness:
    • Stride(128)λ₯Ό μ μš©ν•œ Passage Chunking으둜 λ¬Έλ§₯ 손싀 μ΅œμ†Œν™”
    • Position Bias μ™„ν™”λ₯Ό μœ„ν•΄ μ •λ‹΅ μœ„μΉ˜ λžœλ€ν™” 적용
  • Ensemble Strategy: λ¬Έμžμ—΄ μœ μ‚¬λ„ 기반의 Soft Voting을 톡해 κ°œλ³„ λͺ¨λΈμ˜ μ˜€λ‹΅μ„ μƒν˜Έ 보완

πŸ“ƒ μ‹œμŠ€ν…œ μ•„ν‚€ν…μ²˜

image

πŸ“ 폴더 ꡬ쑰

korean-mrc-negative-passage/
β”œβ”€β”€ data_preparation/          # 데이터셋 생성
β”‚   β”œβ”€β”€ create_negative_passage_dataset.py
β”‚   └── build_passages.py
β”œβ”€β”€ retrieval/                 # 검색 λͺ¨λ“ˆ
β”‚   β”œβ”€β”€ retrieval.py
β”‚   β”œβ”€β”€ retrieval_bm25.py
β”‚   β”œβ”€β”€ retrieval_dense.py
β”‚   β”œβ”€β”€ retrieval_hybrid.py
β”‚   β”œβ”€β”€ retrieval_hybrid_passage.py
β”‚   └── retrieval_hybrid_passage_rerank_only.py
β”œβ”€β”€ training/                  # ν•™μŠ΅ λͺ¨λ“ˆ
β”‚   β”œβ”€β”€ train.py
β”‚   β”œβ”€β”€ trainer_qa.py
β”‚   β”œβ”€β”€ arguments.py
β”‚   └── utils_qa.py
β”œβ”€β”€ inference/                 # μΆ”λ‘  λͺ¨λ“ˆ
β”‚   β”œβ”€β”€ inference.py
β”‚   β”œβ”€β”€ inference_bm25.py
β”‚   └── inference_hybrid_passage_rerank_only.py
β”œβ”€β”€ ensemble/                  # 앙상블 λͺ¨λ“ˆ
β”‚   └── ensemble_voting.ipynb
β”œβ”€β”€ scripts/                   # μ‹€ν–‰ 슀크립트
β”‚   └── create_negative_passage.sh
└── analysis/                  # 뢄석 도ꡬ
    β”œβ”€β”€ compare_predictions.py
    └── analy

πŸ’» μ„€μΉ˜

pip install -r requirements.txt

βš™οΈ μ‚¬μš©λ²•

0. Wikipedia documents β†’ passage corpus 생성

python data_preparation/build_passages.py

1. Negative Passage 데이터셋 생성

κΈ°λ³Έ μ‚¬μš© (κ³ μ • 개수)

python data_preparation/create_negative_passage_dataset.py \
    --train_dataset_path ../data/train_dataset \
    --passages_path ../data/wikipedia_passages_256_128.json \
    --output_path ../data/train_dataset_negative_passage \
    --top_k_retrieval 100 \
    --rerank_top_k 5 \
    --alpha 0.7 \
    --use_rerank

Curriculum Learning (Easy β†’ Medium β†’ Hard)

Easy (3 passages):

python data_preparation/create_negative_passage_dataset.py \
    --curriculum_mode easy \
    --output_path ../data/train_dataset_easy

Medium (5 passages):

python data_preparation/create_negative_passage_dataset.py \
    --curriculum_mode medium \
    --output_path ../data/train_dataset_medium

Hard (7 passages):

python data_preparation/create_negative_passage_dataset.py \
    --curriculum_mode hard \
    --output_path ../data/train_dataset_hard

2. λͺ¨λΈ ν•™μŠ΅

단일 μŠ€ν…Œμ΄μ§€ ν•™μŠ΅

python -m training.train \
    --model_name_or_path HANTAEK/klue-roberta-large-korquad-v1-qa-finetuned \
    --dataset_name ../data/train_dataset_negative_passage \
    --output_dir ../models/reader_negative_passage \
    --do_train \
    --do_eval \
    --num_train_epochs 3 \
    --per_device_train_batch_size 16 \
    --fp16

Curriculum Learning (순차 ν•™μŠ΅)

# Stage 1: Easy
python -m training.train \
    --model_name_or_path HANTAEK/klue-roberta-large-korquad-v1-qa-finetuned \
    --dataset_name ../data/train_dataset_easy \
    --output_dir ../models/curriculum_stage1_easy \
    --num_train_epochs 2 \
    --per_device_train_batch_size 16

# Stage 2: Medium (이전 λͺ¨λΈμ—μ„œ μ‹œμž‘)
python -m training.train \
    --model_name_or_path ../models/curriculum_stage1_easy \
    --dataset_name ../data/train_dataset_medium \
    --output_dir ../models/curriculum_stage2_medium \
    --num_train_epochs 2 \
    --per_device_train_batch_size 16

# Stage 3: Hard
python -m training.train \
    --model_name_or_path ../models/curriculum_stage2_medium \
    --dataset_name ../data/train_dataset_hard \
    --output_dir ../models/curriculum_stage3_hard \
    --num_train_epochs 2 \
    --per_device_train_batch_size 16

3. μΆ”λ‘ 

python -m inference.inference_hybrid_passage_rerank_only \
    --model_name_or_path ../models/curriculum_stage3_hard \
    --dataset_name ../data/test_dataset \
    --output_dir ../outputs/predictions

πŸ”— 참고자료

πŸ“‚ Datasets

  • KorQuAD 1.0 - ν•œκ΅­μ–΄ μ§ˆμ˜μ‘λ‹΅ 데이터셋 (μ£Όμš” ν•™μŠ΅/검증 데이터)
  • KorQuAD 2.0 - λŒ€κ·œλͺ¨ ν•œκ΅­μ–΄ μ§ˆμ˜μ‘λ‹΅ 데이터셋 (데이터 증강 및 μ™ΈλΆ€ 데이터 ν™œμš©)
  • KLUE Benchmark - ν•œκ΅­μ–΄ μžμ—°μ–΄ 이해 평가 ν‘œμ€€ 데이터셋

πŸ€– Models & Libraries

  • KLUE-RoBERTa (Reader) - Reader 베이슀 λͺ¨λΈλ‘œ ν™œμš©λœ klue/roberta-large
  • BGE-M3 (Retrieval) - λ‹€κ΅­μ–΄ 지원 및 ν•˜μ΄λΈŒλ¦¬λ“œ 검색이 κ°€λŠ₯ν•œ κ³ μ„±λŠ₯ μž„λ² λ”© λͺ¨λΈ
  • Ko-Reranker - 검색 결과의 정밀도λ₯Ό 높이기 μœ„ν•œ ν•œκ΅­μ–΄ μ „μš© Cross-Encoder λͺ¨λΈ
  • Rank-BM25 - ν‚€μ›Œλ“œ 기반 Sparse Retrieval κ΅¬ν˜„μ„ μœ„ν•œ μ•Œκ³ λ¦¬μ¦˜ 라이브러리
  • FAISS - λŒ€κ·œλͺ¨ Dense Vector 검색을 μœ„ν•œ Facebook AI Research의 κ³ μ„±λŠ₯ 라이브러리

πŸ“„ Papers & Technical Concepts

  • Reciprocal Rank Fusion (RRF) - μ„œλ‘œ λ‹€λ₯Έ 검색 κ²°κ³Ό(Sparse & Dense)λ₯Ό 효과적으둜 ν†΅ν•©ν•˜λŠ” μˆœμœ„ μ‚°μ • 기법
  • Curriculum Learning - ν•™μŠ΅ λ°μ΄ν„°μ˜ λ‚œμ΄λ„λ₯Ό μ μ§„μ μœΌλ‘œ λ†’μ—¬ λͺ¨λΈ μ„±λŠ₯을 μ΅œμ ν™”ν•˜λŠ” μ „λž΅
  • Dense Passage Retrieval (DPR) - λ“€μ–Ό 인코더 ꡬ쑰λ₯Ό ν™œμš©ν•œ 의미둠적 λ¬Έμ„œ 검색 ν”„λ ˆμž„μ›Œν¬

About

pro-nlp-mrc-nlp-03 created by GitHub Classroom

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors