How Far Can Classical NLP Go? From Bag-of-Words to Stacking on Spooky Author Identification

Contents

Dataset and Evaluation Setup 1. Word-only Vowpal Wabbit baseline 2. Rich VW: adding style-aware features 3. TF-IDF word and character features NB-SVM-style Logistic Regression 4. Stacking with out-of-fold predictions 5. Final full-data refit and Kaggle submission 6. Error analysis 7. A representation survey Results at a glance What actually helped Limitations and next steps Conclusion Data source and license Links

is a good way to test NLP models because it focuses not only on what a sentence says, but also on how it is written. Kaggle’s Spooky Author Identification competition is a compact version of this challenge: given a single sentence from gothic or horror fiction, the model has to predict whether it was written by Edgar Allan Poe (EAP), Mary Wollstonecraft Shelley (MWS), or H. P. Lovecraft (HPL).

At first, this seems like a typical three-class text classification problem. But in reality, it is more complex. The authors all write about similar themes: fear, mystery, death, atmosphere, and the supernatural. Simple keywords are not enough to tell them apart. Instead, the important clues are often stylistic: function words, punctuation, character patterns, short phrases, sentence rhythm, and the way each author builds a sentence.

This made the project a good way to explore a specific question:

How far can classical NLP go when we choose representations carefully and evaluate them honestly?

I approached the task by building a sequence of increasingly capable classical models:

a fast Vowpal Wabbit word baseline,
a richer VW model with punctuation and character n-grams,
a tuned TF-IDF ensemble,
a stacked sparse-text ensemble using out-of-fold predictions,
a small representation survey comparing sparse features, BM25, Word2Vec, and FastText.

The goal was not only to improve the score, but also to understand which representations helped, which metrics improved, and which evaluation setup each result came from.

This article focuses on the project’s methodology, results, and interpretation. I’ll go over the main implementation choices and share the key code snippets, but I won’t include every line from the notebook. The complete executed notebook, including the full implementation and outputs, is available in the GitHub repository linked at the end.

Dataset and Evaluation Setup

The dataset contains 19,579 labeled training sentences and 8,392 unlabeled test sentences. The class distribution is mildly imbalanced:

Figure 1. Class distribution in the training set. The dataset is mildly imbalanced, with EAP making up the largest share of examples and HPL the smallest.

I encoded the labels as 1-based integers because Vowpal Wabbit’s One-Against-All multiclass mode expects labels starting at 1.

train_texts = pd.read_csv(DATA_DIR / "train.csv", index_col="id")
test_texts = pd.read_csv(DATA_DIR / "test.csv", index_col="id")

AUTHOR_CODE = {"EAP": 1, "MWS": 2, "HPL": 3}
train_texts["author_code"] = train_texts["author"].map(AUTHOR_CODE)

print(f"Train: {len(train_texts)} sentences   Test: {len(test_texts)} sentences")
print(train_texts["author"].value_counts(normalize=True).round(3))

To compare models locally, I used a single stratified 70/30 train-validation split with a fixed random seed. This kept the class proportions stable and ensured that every model was evaluated on the same held-out examples.

train_texts_part, valid_texts = train_test_split(
    train_texts,
    test_size=0.3,
    random_state=17,
    stratify=train_texts["author_code"]
)

y_part = train_texts_part["author_code"].values
y_valid = valid_texts["author_code"].values

I focused on three main metrics:

Accuracy: straightforward to understand, but it only measures the final top-class decision.
Macro-F1: useful for checking whether performance is balanced across the three authors.
Multiclass log loss: the official Kaggle metric and the most important metric for this project, because it evaluates the quality of the predicted probabilities, not just the predicted class.

Log loss rewards confident correct predictions and heavily penalizes confident wrong predictions. This matters in a competition where the submission is a probability distribution over EAP, HPL, and MWS.

1. Word-only Vowpal Wabbit baseline

I started with Vowpal Wabbit because it is fast, handles sparse data well, and is well-suited to linear text models. VW trains online linear models, hashes features into a fixed feature space, and handles multiclass classification through One-Against-All.

For the first baseline, I used only lowercased word features of length three or more.

def to_vw_words(df, is_train=True):
    """VW line: '<label> |text <words of 3+ chars>'."""
    lines = []

    for i in range(len(df)):
        label = df["author_code"].iloc[i] if is_train else 1
        text = df["text"].iloc[i].lower().replace("|", "").replace(":", "")
        words = " ".join(re.findall(r"\w{3,}", text))
        lines.append(f"{label} |text {words}\n")

    return lines

One implementation detail that mattered was how VW handles multiple passes. When VW reads a file directly, options such as passes and cache behave as expected. When feeding examples manually through the Python API, I had to loop over the file myself.

N_PASSES = 10

vw = Workspace(
    oaa=3,
    loss_function="logistic",
    ngram=2,
    b=28,
    quiet=True,
    final_regressor=f"{OUTPUT_DIR}/spooky_words.vw"
)

for _ in range(N_PASSES):
    with open(f"{OUTPUT_DIR}/train_words.vw") as f:
        for line in f:
            vw.learn(line)

vw.finish()

On the 70/30 holdout split, the word-only VW baseline reached:

**Holdout performance of the word-only Vowpal Wabbit baseline.** Even with simple word and bigram features, the fast linear VW model provides a strong starting point.

This was already a strong result for a fast linear model using simple word and bigram features. It also established a useful baseline: any added representation or ensemble layer needed to clear this bar.

2. Rich VW: adding style-aware features

Authorship attribution involves more than classifying topics. A model also needs access to cues that reflect writing style. For the richer VW model, I separated the input into three namespaces:

|w for words, including short function words,
|p for punctuation,
|c for character n-grams.

def char_ngrams(text, ns=(2, 3, 4)):
    """Boundary-aware character n-grams; whitespace/edges become '_'."""
    t = "_" + re.sub(r"\s+", "_", text.strip()) + "_"
    return [t[i:i + n] for n in ns for i in range(len(t) - n + 1)]


def to_vw_rich(df, is_train=True, char_ns=(2, 3, 4)):
    """Three namespaces: |w words, |p punctuation, |c character n-grams."""
    lines = []
    texts = df["text"].values
    labels = df["author_code"].values if is_train else None

    for i, text in enumerate(texts):
        safe = str(text).lower().replace("|", " ").replace(":", " ")

        label = labels[i] if is_train else 1
        words = " ".join(re.findall(r"\w+", safe))
        punct = " ".join(re.findall(r"[^\w\s]", safe))
        chars = " ".join(char_ngrams(safe, ns=char_ns))

        lines.append(f"{label} |w {words} |p {punct} |c {chars}\n")

    return lines

This model used more passes and a slightly larger hash space than the word-only baseline.

N_PASSES = 15

vw = Workspace(
    oaa=3,
    loss_function="logistic",
    ngram=2,
    b=29,
    quiet=True,
    final_regressor=f"{OUTPUT_DIR}/spooky_rich.vw"
)

for _ in range(N_PASSES):
    with open(f"{OUTPUT_DIR}/train_rich.vw") as f:
        for line in f:
            vw.learn(line)

vw.finish()

This improved the holdout result:

**Effect of adding style-aware VW features on holdout performance.** Adding punctuation and character n-grams improves both accuracy and Macro-F1 over the word-only VW baseline.

The gain is meaningful: adding punctuation and character-level structure helped the model capture style beyond plain word choice.

3. TF-IDF word and character features

Next, I wanted to see whether another classical sparse-text pipeline could match or exceed the VW results. I built a TF-IDF feature matrix using two views of the text:

word-level unigrams and bigrams,
character-level 2-to-5-grams inside word boundaries.

CLASSES = np.array([1, 2, 3])  # 1=EAP, 2=MWS, 3=HPL

def build_tfidf(fit_texts):
    word_vectorizer = TfidfVectorizer(
        sublinear_tf=True,
        ngram_range=(1, 2),
        min_df=2
    ).fit(fit_texts)

    char_vectorizer = TfidfVectorizer(
        sublinear_tf=True,
        analyzer="char_wb",
        ngram_range=(2, 5),
        min_df=2
    ).fit(fit_texts)

    return word_vectorizer, char_vectorizer


def tfidf_features(word_vectorizer, char_vectorizer, texts):
    X_word = word_vectorizer.transform(texts)
    X_char = char_vectorizer.transform(texts)
    return sp.hstack([X_word, X_char]).tocsr()

The word features capture vocabulary and phrase-level evidence. The character features capture spelling fragments, suffixes, prefixes, punctuation-adjacent patterns, and other small details that are useful for style classification.

I trained three complementary models on this representation:

Logistic Regression,
NB-SVM-style Logistic Regression,
Complement Naive Bayes.

For Logistic Regression and the NB-SVM-style model, I tuned the C values with inner cross-validation on the training split only, leaving the holdout set untouched.

def tune_lr_C(X, y, C_grid=(0.1, 0.3, 1, 3, 10, 30), n_splits=5):
    cv = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)
    rows = []

    for C in C_grid:
        oof = np.zeros((X.shape[0], len(CLASSES)))

        for tr_idx, va_idx in cv.split(X, y):
            clf = LogisticRegression(C=C, max_iter=3000)
            clf.fit(X[tr_idx], y[tr_idx])
            oof[va_idx] = align_proba(clf, X[va_idx])

        rows.append({"C": C, "log_loss": log_loss(y, oof, labels=CLASSES)})

    return pd.DataFrame(rows)

The best inner-CV tuning results were:

**Inner cross-validation results for tuning the TF-IDF linear models.** NB-SVM-style Logistic Regression achieved a lower inner-CV log loss, suggesting a stronger tuned linear component.

The final 3-model probability average reached:

**Holdout performance of the tuned TF-IDF 3-model average.** Averaging the model probabilities produced strong accuracy and a competitive log loss on the 70/30 holdout split.

The accuracy gain over rich VW was modest, but the log loss was strong. Since Kaggle evaluates probability distributions, this was an important improvement.

NB-SVM-style Logistic Regression

The NB-SVM-style model gets its own section because it is a simple yet effective classical text-classification trick.

The idea is to compute a per-feature log-count ratio: how much more often a feature appears in one class than in the others. Each feature is then multiplied by this ratio before fitting a linear classifier.

def nbsvm_proba(X_train, y_train, X_test, C=10):
    probas = []

    for cls in CLASSES:
        y_binary = (y_train == cls).astype(int)

        p = X_train[y_binary == 1].sum(axis=0) + 1
        q = X_train[y_binary == 0].sum(axis=0) + 1

        r = np.log((p / p.sum()) / (q / q.sum()))
        r = np.asarray(r).ravel()

        clf = LogisticRegression(C=C, max_iter=3000)
        clf.fit(X_train.multiply(r), y_binary)

        probas.append(clf.predict_proba(X_test.multiply(r))[:, 1])

    proba = np.vstack(probas).T
    proba = np.clip(proba, 1e-15, 1 - 1e-15)
    return proba / proba.sum(axis=1, keepdims=True)

Despite the name, my implementation is not a pure SVM. It uses Logistic Regression trained on Naive-Bayes-weighted sparse features. The benefit is that features strongly associated with a specific author are amplified before the linear model is trained.

4. Stacking with out-of-fold predictions

After the TF-IDF ensemble, I had several useful base models. A flat average gives each model equal weight, but there is no reason to assume every model is equally reliable for every class. Stacking lets a second-level model learn how to combine them.

The main leakage risk is training the meta-learner on predictions from base models that have already seen the same examples. To avoid that, I used out-of-fold predictions:

For training examples, each base model predicts only the examples in a fold that it was not trained on.
For holdout or test examples, predictions are averaged across fold-trained versions of each base model.

The base models were:

BASE_MODELS = ["lr", "nbsvm", "cnb", "mnb", "sgd"]

BASE_PARAM_GRIDS = {
    "lr": {"C": [1, 3, 10, 30]},
    "nbsvm": {"C": [1, 3, 10, 30]},
    "cnb": {"alpha": [0.1, 0.3, 0.5, 1.0]},
    "mnb": {"alpha": [0.1, 0.3, 0.5, 1.0]},
    "sgd": {"alpha": [1e-6, 3e-6, 1e-5, 3e-5]},
}

The stacking feature builder creates a matrix with one block of probability columns per base model. With five base models and three authors, the meta-learner receives 15 probability features per example.

def build_stack_features(X_train, y_train, X_test, best_params_by_model,
                         n_folds=5, seed=17):
    skf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=seed)

    n_classes = len(CLASSES)
    n_models = len(BASE_MODELS)

    oof_stack = np.zeros((X_train.shape[0], n_classes * n_models))
    test_stack = np.zeros((X_test.shape[0], n_classes * n_models))

    for j, kind in enumerate(BASE_MODELS):
        start = j * n_classes
        end = start + n_classes
        params = best_params_by_model[kind]

        for tr_idx, va_idx in skf.split(X_train, y_train):
            oof_stack[va_idx, start:end] = base_proba(
                kind,
                X_train[tr_idx],
                y_train[tr_idx],
                X_train[va_idx],
                params
            )

            test_stack[:, start:end] += base_proba(
                kind,
                X_train[tr_idx],
                y_train[tr_idx],
                X_test,
                params
            ) / n_folds

    return oof_stack, test_stack

I tuned the Logistic Regression meta-learner using cross-validation on the stacked probability features.

def tune_meta_C(oof_stack, y, C_grid=(0.03, 0.1, 0.3, 1, 3, 10, 30)):
    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=17)

    for C in C_grid:
        oof_meta = np.zeros((oof_stack.shape[0], len(CLASSES)))

        for tr_idx, va_idx in skf.split(oof_stack, y):
            meta = LogisticRegression(C=C, max_iter=3000)
            meta.fit(oof_stack[tr_idx], y[tr_idx])
            oof_meta[va_idx] = align_proba(meta, oof_stack[va_idx])

        print(C, log_loss(y, oof_meta, labels=CLASSES))

On the 70/30 holdout split, the best base-model settings were:

**Best base-model hyperparameters used in the stacked ensemble on the 70/30 holdout split.** These tuned base models produced the probability features used by the Logistic Regression meta-learner.

The best meta-learner setting was C=3.

The stacked model reached:

**Final holdout performance of the tuned stacked ensemble.** The ensemble substantially improved probability quality, achieving the lowest holdout log loss among the classical pipelines.

This was the strongest holdout result in the project. The biggest improvement was not raw accuracy; it was log loss. That means the ensemble improved the probability estimates, which is exactly what the Kaggle metric rewards.

5. Final full-data refit and Kaggle submission

For the final submission, I refit the TF-IDF representation on the full labeled training data, rebuilt the stacking features, retuned the base models, trained the final meta-learner, and generated predictions for the test set.

On the full training data, the best base-model parameters were:

**Best full-data base-model hyperparameters for the final stacked submission.** These settings were selected after refitting the pipeline on the full labeled training set.

The best final meta-learner setting was C=30.

The code also explicitly mapped my internal class order [1, 2, 3] = [EAP, MWS, HPL] into Kaggle’s required submission column order: EAP, HPL, MWS.

meta_final = LogisticRegression(C=best_full_meta_C, max_iter=3000)
meta_final.fit(oof_full, y_full)

proba_test = align_proba(meta_final, test_stack)

proba_test = np.clip(proba_test, 1e-15, 1 - 1e-15)
proba_test = proba_test / proba_test.sum(axis=1, keepdims=True)

submission = pd.DataFrame({
    "id": test_texts.index,
    "EAP": proba_test[:, 0],   # class 1
    "HPL": proba_test[:, 2],   # class 3
    "MWS": proba_test[:, 1],   # class 2
})

submission.to_csv(OUTPUT_DIR / "spooky_submission.csv", index=False)

The full-data level-2 OOF estimate for the meta-learner was:

**Full-data level-2 out-of-fold estimate for the final meta-learner.** This estimate is useful as a sanity check, but it is not directly comparable to the earlier 70/30 holdout results.

This number is useful as a sanity check, but it should not be compared directly with the earlier 70/30 holdout rows because it comes from a different evaluation setup. It evaluates the meta-learner using out-of-fold stacking features over the full training data, not a fully nested cross-validation of the entire pipeline.

On Kaggle, the final stacked model scored:

**Kaggle leaderboard performance of the final tuned stacked model.** The private score was close to the full-data level-2 OOF estimate, which suggests that the validation setup was reasonably reliable.

The leaderboard scores landed close to the full-data level-2 OOF estimate, which is encouraging. I would still treat that as validation evidence, not proof that the setup is fully unbiased.

6. Error analysis

Aggregate metrics are useful, but they can hide where the model fails. I used the holdout predictions from the stacked model to inspect the confusion matrix, per-author recall, and high-confidence mistakes.

AUTHORS = {1: "EAP", 2: "MWS", 3: "HPL"}

cm = confusion_matrix(y_valid, valid_predictions, labels=CLASSES)

cm_df = pd.DataFrame(
    cm,
    index=[f"true_{AUTHORS[c]}" for c in CLASSES],
    columns=[f"pred_{AUTHORS[c]}" for c in CLASSES]
)

display(cm_df)

The confusion matrix was:

**Confusion matrix for the tuned stacked model on the 70/30 holdout split.** Most predictions fall on the diagonal, while the largest off-diagonal errors come from confusion between MWS and EAP.

Per-author recall was relatively balanced:

**Per-author recall for the tuned stacked model on the 70/30 holdout split.** Recall is fairly balanced across all three authors, suggesting that the model does not rely heavily on a single majority class.

The most common confusions were:

**Most common misclassification pairs for the tuned stacked model.** The largest errors occur between MWS and EAP, followed by HPL and EAP, showing that the remaining mistakes are mostly between stylistically overlapping authors.

The main point is that the model did not simply collapse into predicting the largest class. The recall scores were close across all three authors, and the errors were bidirectional. MWS and EAP were often confused with each other, while HPL and EAP also overlapped on some short or stylistically neutral sentences.

I also inspected high-confidence mistakes. One example was the sentence:

“I walked the cellar from end to end.”

The true author was EAP, but the model assigned HPL a probability above 0.97. This is a useful reminder that single-sentence authorship can be underdetermined. Some sentences simply do not carry enough distinctive stylistic evidence for a sparse linear model to separate three similar gothic authors reliably.

7. A representation survey

To put the main pipeline in context, I also tested several foundational representations on the same holdout split.

For Bag-of-Words, I used word counts with unigrams and bigrams:

bow = CountVectorizer(
    ngram_range=(1, 2),
    min_df=2
)

X_bow_tr = bow.fit_transform(train_texts_part["text"])
X_bow_va = bow.transform(valid_texts["text"])

bow_lr = LogisticRegression(C=10, max_iter=3000)
bow_lr.fit(X_bow_tr, y_part)

For BM25, I treated retrieval as a nearest-neighbor classifier. This is not BM25’s natural use case, but it was useful as a point of comparison.

K = 15
scores = np.asarray((query_terms[start:end] @ bm25_docs.T).todense())
topk = np.argpartition(-scores, kth=K - 1, axis=1)[:, :K]

For Word2Vec and FastText, I trained embeddings on the training split, then represented each sentence as an IDF-weighted average of its word vectors.

def document_vectors(model, tokenized_docs):
    vectors = np.zeros((len(tokenized_docs), model.vector_size), dtype=np.float32)

    for i, tokens in enumerate(tokenized_docs):
        doc_vecs, doc_weights = [], []

        for token in tokens:
            try:
                doc_vecs.append(model.wv[token])
                doc_weights.append(idf_weight.get(token, 1.0))
            except KeyError:
                continue

        if doc_vecs:
            vectors[i] = np.average(doc_vecs, axis=0, weights=doc_weights)

    return vectors

The results were:

**Representation survey on the 70/30 holdout split.** Sparse count-based features performed better than BM25 retrieval and simple averaged Word2Vec/FastText embeddings on this short-text authorship task.

Sparse count-based features were clearly stronger in this setup than simple averaged embeddings. That does not mean Word2Vec or FastText are generally weak. It means that for this short-text authorship task, averaging word vectors blurred many of the stylistic details that sparse word, character, and punctuation features preserved.

Results at a glance

All holdout rows use the same stratified 70/30 split, so they are directly comparable.

**Summary of the main model results across validation settings.** The holdout rows are directly comparable, while the full-data level-2 OOF estimate is included as a separate sanity check for the final stacked model.

Kaggle submission:

**Kaggle leaderboard score for the final tuned stacked model.** The final submission achieved a private log loss of 0.30414 and a public log loss of 0.33621.

The level-2 OOF estimate is not directly comparable to the holdout rows because it uses a different evaluation setup.

What actually helped

Most of the useful improvements came from better representations and cleaner validation, not from adding complexity for its own sake.

Sparse word and character features carried the strongest signal.
The task is stylistic, and sparse n-gram features preserved details that pooled dense vectors tended to smooth away.

Punctuation and character n-grams improved authorship modeling.
Adding style-aware features increased the VW holdout accuracy from 0.8332 to 0.8553.

TF-IDF improved probability quality.
The tuned TF-IDF ensemble did not dramatically improve accuracy, but it produced a strong log loss result, which is what the competition optimizes.

Stacking helped most with log loss.
The stacked model improved holdout log loss from 0.3843 to 0.3504. This suggests that the meta-learner found a better way to combine probability estimates than a flat average.

Evaluation separation matters.
I kept three result types separate: the 70/30 holdout, the full-data level-2 OOF estimate, and the Kaggle leaderboard scores. They answer different questions, so mixing them would make the results look more certain than they really are.

Limitations and next steps

There are several ways I would extend this project.

First, the stacking pipeline was evaluated with a single holdout split plus a full-data level-2 OOF estimate. A fully nested cross-validation design would provide a more conservative estimate of the whole modeling and tuning process.

Second, I used log loss as the main probability-quality metric, but I did not include explicit calibration diagnostics such as reliability diagrams or expected calibration error. Since the final objective is probability quality, calibration analysis would be a natural next step.

Third, I did not compare against a transformer baseline such as DistilBERT or BERT. A fine-tuned transformer would be the obvious next benchmark, especially to test how much contextual representation improves over sparse classical features on short literary sentences.

Fourth, the hyperparameter search was intentionally limited. A broader search over TF-IDF ranges, VW settings, smoothing values, regularization strengths, and stacking design choices could improve the final score.

Finally, the dataset is small and domain-specific. These results support conclusions about short-text authorship attribution in this setting, not a universal ranking of NLP methods.

Conclusion

This project shows that classical NLP can still go surprisingly far when the representation matches the problem. A word-only Vowpal Wabbit baseline was already strong, but adding style-aware features, TF-IDF word and character n-grams, probability-focused tuning, and stacked generalization further improved the model.

The strongest classical pipeline reached 0.8687 accuracy and 0.3504 log loss on the 70/30 holdout split, and the final stacked submission scored 0.30414 private and 0.33621 public log loss on Kaggle.

The main takeaway is not just that stacking improved the score. It is that authorship attribution rewards the details: punctuation, subword patterns, function words, and careful probability estimates. Before reaching for heavier contextual models, a well-validated sparse-text baseline can still be a serious competitor.

Data source and license

This article uses Kaggle’s Spooky Author Identification dataset, a text-classification dataset built from excerpts of public-domain fiction by Edgar Allan Poe, H. P. Lovecraft, and Mary Wollstonecraft Shelley. The task is to predict the author of each sentence among three labels: EAP for Edgar Allan Poe, HPL for H. P. Lovecraft, and MWS for Mary Wollstonecraft Shelley.

The dataset is listed on Kaggle under the CC BY 4.0 license. This license permits sharing and adaptation, including for commercial purposes, provided appropriate attribution is given. In this article, the dataset is used for an educational machine-learning walkthrough, and attribution links are provided in this section.

Thanks for making it all the way to the end! I hope you found this project as fun and useful as I did. If you have thoughts, questions, or ideas for extending the experiment, feel free to reach out through LinkedIn or my website.

Links

• Full notebook + code
• LinkedIn
• Website