Scikit-LLM 튜토리얼 – TF-IDF·BART·LLM 세 방식으로 텍스트 분류 벤치마크하기

환경 준비
데이터 준비
접근법 1: TF-IDF + 로지스틱 회귀
접근법 2: BART 제로샷 분류
접근법 3: Scikit-LLM + Groq LLM
벤치마크 결과 비교
언제 어떤 방법을 쓸 것인가
Scikit-LLM의 강점
참고 자료

텍스트 분류에서 “LLM이 언제 전통적 방법보다 나은가”는 단순한 질문이 아니다. 이 튜토리얼에서는 고객 지원 티켓 분류 태스크를 대상으로 세 가지 접근법을 동일 조건에서 벤치마크하고, 각 방법의 트레이드오프를 데이터로 확인한다.

접근법 1: TF-IDF + 로지스틱 회귀 (클래식 베이스라인)
접근법 2: BART 기반 제로샷 분류 (딥러닝 트랜스포머)
접근법 3: Scikit-LLM + Groq LLM (현대적 프롬프트 기반)

Groq API 키만 있으면 비용 없이 실행할 수 있다.

환경 준비

pip install scikit-learn transformers scikit-llm

Groq API 키는 console.groq.com/keys에서 발급한다.

데이터 준비

고객 지원 티켓을 5개 카테고리로 분류하는 태스크를 사용한다.

from sklearn.model_selection import train_test_split

# 예시 데이터 (실제 고객 지원 티켓 형태)
data = {
    "text": [
        "I can't access my account",
        "I was charged twice for my subscription",
        "I need a refund for my last order",
        "I'm interested in your enterprise plan",
        "The app keeps crashing on my phone",
        # ... 더 많은 데이터
    ],
    "label": ["Account", "Billing", "Refund", "Sales", "Technical"]
}

X_train, X_test, y_train, y_test = train_test_split(
    data["text"], data["label"], test_size=0.3, random_state=42
)

접근법 1: TF-IDF + 로지스틱 회귀

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
import time

pipeline = Pipeline([
    ("tfidf", TfidfVectorizer(max_features=5000)),
    ("clf", LogisticRegression(max_iter=1000))
])

start = time.time()
pipeline.fit(X_train, y_train)
y_pred_tfidf = pipeline.predict(X_test)
tfidf_latency = time.time() - start

print(f"TF-IDF Latency: {tfidf_latency:.4f}s")
print(classification_report(y_test, y_pred_tfidf))

결과 특성: 빠르고 가볍지만 보지 못한 표현에 취약하다.

접근법 2: BART 제로샷 분류

from transformers import pipeline as hf_pipeline

zero_shot = hf_pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
candidate_labels = ["Account", "Billing", "Refund", "Sales", "Technical"]

start = time.time()
y_pred_bart = []
for text in X_test:
    result = zero_shot(text, candidate_labels)
    y_pred_bart.append(result["labels"][0])
bart_latency = time.time() - start

print(f"BART Latency: {bart_latency:.4f}s")
print(classification_report(y_test, y_pred_bart))

결과 특성: 레이블 데이터 없이 사용 가능하지만 도메인 특화 표현에서 정확도가 낮다.

접근법 3: Scikit-LLM + Groq LLM

from skllm.config import SKLLMConfig
from skllm.models.gpt.classification.zero_shot import ZeroShotGPTClassifier
import getpass

api_key = getpass.getpass("Groq API Key: ")
SKLLMConfig.set_openai_key(api_key)
SKLLMConfig.set_gpt_url("https://api.groq.com/openai/v1/")

llm_clf = ZeroShotGPTClassifier(model="custom_url::llama-3.3-70b-versatile")

start = time.time()
llm_clf.fit(X_train, y_train)
y_pred_llm = llm_clf.predict(X_test)
llm_latency = time.time() - start

print(f"Scikit-LLM Latency: {llm_latency:.4f}s")
print(classification_report(y_test, y_pred_llm))

벤치마크 결과 비교

지표	TF-IDF + LR	BART 제로샷	Scikit-LLM (Groq)
F1 점수	~0.70	~0.65	0.86–0.87
레이턴시	매우 낮음	높음	2.6초
레이블 데이터 필요	필요	불필요	불필요
비용	무료	GPU 비용	API 비용
구현 난이도	낮음	중간	낮음

Scikit-LLM(Groq)이 정확도에서 압도적으로 앞섰다. BART 제로샷보다 빠르기까지 했다. 대규모 학습 데이터 없이도 높은 정확도를 달성할 수 있었던 이유는 LLaMA 3.3 70B가 고객 지원 도메인 개념을 이미 알고 있기 때문이다.

언제 어떤 방법을 쓸 것인가

상황	권장 방법
레이블 데이터가 충분하고 비용이 중요	TF-IDF + LR
레이블 없이 빠른 프로토타이핑	Scikit-LLM
API 비용 없이 온프레미스 실행	BART 제로샷
최고 정확도가 우선	Scikit-LLM (대형 LLM)

Scikit-LLM의 강점

Scikit-LLM의 핵심 가치는 표준화된 인터페이스다. LogisticRegression을 ZeroShotGPTClassifier로 교체하는 코드 변경만으로 클래식 ML 파이프라인을 LLM 기반으로 전환할 수 있다. 실험 속도가 빠르고 기존 sklearn 파이프라인과 완전히 호환된다.

참고 자료

Scikit-LLM vs. Traditional Text Classifiers: When Should You Use an LLM? — Machine Learning Mastery (2026-06-02)

scikit-llm — Scikit-LLM 전체 개요 llm-explainability — LLM 설명 가능성 (분류 근거 해석)

Like?

AI Sparkup