First steps in NLP - Training a Philosopher Classifier with Hugging Face

In my working through the fast.ai course I’ve neglected a central tenet of their learning philosophy: build some models! I’ve been neglecting it and telling myself that I would continue the course and build a model when a course topic intersected with my interests. I’ve been interested in NLP and time-series analysis so I was excited to see that we would get a taste of it as early as Lesson 4.

Lesson 4 of the fast.ai course deviated from the book and introduced us to the Hugging Face ecosystem and its ensemble of pretrained NLP models. After following the lesson I decided that I would follow that tangent a bit further and study the Hugging Face course and build myself a model before continuing with the lectures.

I may write a summary of what I learned from the course in another post, but the course is so clear that I think it might be unnecessary. This post catalogues my beginning-to-end training of a Hugging Face model to answer the question

Given a sentence, which philosophers was mostly likely to have said it?

More precisely, we walk through

  1. Preprocessing a dataset using Hugging Face’s data structures and tokenizers.
  2. Training and evaluating the model.
  3. Results and Reflection.

1. Model Preprocessing

The dataset that with which we’ll be training our model is the History of Philosophy dataset from Kaggle. Each sample is a sentence pulled from the texts of one of 36 philosophers, categorized by author, philosophical school (though that’s subjective), publication date, and corpus date. It also includes some preprocessing such as a tokenization, but we won’t use those fields in our training.

In contrast to fast.ai, which uses Dataloaders, the data structure expected by pre-trained Hugging Face models are Datasets (not to be confused with fast.ai Datasets, which are something completely different). The first step is to convert the provided .csv file into a native data structure following these instructions. This provided me one DatasetDict with all of the data:

from datasets import load_dataset

phil_dataset_raw  = load_dataset("csv", data_files="philosophy_data.csv", field="data")
phil_dataset_raw

-- Output -- 
DatasetDict({
    train: Dataset({
        features: ['title', 'author', 'school', 'sentence_spacy', 'sentence_str', 'original_publication_date', 'corpus_edition_date', 'sentence_length', 'sentence_lowered', 'tokenized_txt', 'lemmatized_str'],
        num_rows: 360808
    })
})

The DatasetDict object, as the name suggests, is a dictionary containing your training, testing, and evaluation data. By default, it throws everything into the ‘train’ Dataset. One of the features of DatasetDict is that it enables you to perform the same preprocessing for every partition of your dataset at once, which greatly simplifies preprocessing.

I begin by removing the extraneous tokenizations. Also, as Jeremy said, Hugging Face demands that the target label be called label. Since I want to predict the author of a sentence, I renamed the ‘author’ field to ‘label’:

phil_dataset = phil_dataset_raw.remove_columns(['sentence_length', 'sentence_lowered', 'tokenized_txt', 'lemmatized_str', 'sentence_spacy'])

phil_dataset = phil_dataset.rename_column("author", "labels")

-- Output -- 
DatasetDict({
    train: Dataset({
        features: ['title', 'author', 'school', 'sentence_str', 'original_publication_date', 'corpus_edition_date'],
        num_rows: 360808
    })
})

To construct a validation and test set, I used huggingface’s built-in train_test_split method (the analogue of sklearn’s method of the same name) twice: once to cut the train set to a training and the rest, and then cut the rest into a validation and test set:

ds_split = phil_dataset["train"].train_test_split(test_size=0.1)
ds_train, ds_eval_and_test = ds_split["train"], ds_split["test"]
ds_eval_split = ds_eval_and_test.train_test_split(test_size=0.2)
ds_eval, ds_test = ds_eval_split["train"], ds_eval_split["test"]

(If you want to see what can happen when you do this step wrong, scroll to the end of the post 😅.) I can then manually put these together into a DatasetDict via

from datasets import DatasetDict

phil_dataset_full = DatasetDict({'train': ds_train, 'valid': ds_eval, 'test': ds_test})

Next we need to tokenize our sentences and append to our dataset as features. We’ll do this using the neat Dataset.map() method introduced in this chapter. In Hugging Face, the tokenization function is attached to the model we want to fine-tune: the model expects a certain tokenization, e.g. with regards to padding and separators, and different models have different conventions. For this training, we’ll fine-tune an uncased version of the BERT model:

from transformers import AutoTokenizer

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

With tokenizer in hand, we can define a function, compatible with batching, which tokenizes the sentence_str field of our dataset:

def tokenize_function(example):
    return tokenizer(example["sentence_str"], truncation=True)

and we can append the tokenizations to our dataset via

In [375]: tokenized_datasets = phil_dataset_full.map(tokenize_function, batched=True)

I tried to train the model with this dataset but it failed. To see why, let’s take a look at the features of our dataset.

tokenized_datasets["train"].features

-- Output --
{'title': Value(dtype='string', id=None),
 'labels': Value(dtype='string', id=None),
 'school': Value(dtype='string', id=None),
 'sentence_str': Value(dtype='string', id=None),
 'original_publication_date': Value(dtype='int64', id=None),
 'corpus_edition_date': Value(dtype='int64', id=None),
 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None),
 'token_type_ids': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None),
 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None)}

It makes sense that the training failed. Hugging Face models don’t automatically convert the labels features into integers that a model understands, so of course the model complains that I feed it sentences as labels! That choice makes sense, I think?, from a batching perspective: if you don’t know the categorical encodings of all of your labels (a ‘global’ property of the dataset) beforehand, how can you properly batch your data (which sees only a ‘local’ subset)? So you can’t encode batches on-the-fly.

In the course examples, the labels were of the ClassLabel class. As I understand it, the class helps the model encode the categorical variables into integers (like the OrdinalEncoder from sklearn, for those familiar with it). Let me try to convert the labels column using the class_encode_column method:

tokenized_datasets = tokenized_datasets.class_encode_column("labels")

What are the features now?

tokenized_datasets["train"].features

-- Output --
{'title': Value(dtype='string', id=None),
 'labels': ClassLabel(names=['Aristotle', 'Beauvoir', 'Berkeley', 'Davis', 'Deleuze', 'Derrida', 'Descartes', 'Epictetus', 'Fichte', 'Foucault', 'Hegel', 'Heidegger', 'Hume', 'Husserl', 'Kant', 'Keynes', 'Kripke', 'Leibniz', 'Lenin', 'Lewis', 'Locke', 'Malebranche', 'Marcus Aurelius', 'Marx', 'Merleau-Ponty', 'Moore', 'Nietzsche', 'Plato', 'Popper', 'Quine', 'Ricardo', 'Russell', 'Smith', 'Spinoza', 'Wittgenstein', 'Wollstonecraft'], id=None),
 'school': Value(dtype='string', id=None),
 'sentence_str': Value(dtype='string', id=None),
 'original_publication_date': Value(dtype='int64', id=None),
 'corpus_edition_date': Value(dtype='int64', id=None),
 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None),
 'token_type_ids': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None),
 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None)

That looks better!

2. Training and Evaluating the Model

Let me try to train it now. The extra objects that we need to create the Trainer object are a data_collator which instructs the model how to handle batching (e.g. whether to pad each batch by the maximum length of the batch, by the dataset, by a fixed number, etc.) and a training_args object which, for our purposes, simply tells the model where to store the checkpoints for our model.

from transformers import DataCollatorWithPadding, TrainingArguments

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
training_args = TrainingArguments("phil-test-trainer")

Let’s try again?

model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["valid"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)
trainer.train()

-- Output --
RuntimeError: CUDA error: device-side assert triggered

Oops. Just like the example in the course needed the num_labels=2 to classify sentences into equivalent or not equivalent, we need num_labels equal to the number of labels in our dataset.

num_labels = len(tokenized_datasets["train"].features['labels'].names)
num_labels

-- Output --
36

Let’s reinitialize our model:

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=num_labels)

I don’t know if it’s necessary, but I tell the model to train on the GPU instead of the CPU:

import torch
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
device

-- Output --
device(type='cuda')
model.to(device)

We then recreate the Trainer object and try training again:

trainer.train()

aaaaand training is going to take 5 hours. Even with CUDA enabled! To make sure it’s actually working, we should try training the model with a much smaller fraction of the data. To do this, I’m going to use the shard method of a Dataset. I run the following before splitting up the dataset into training, validation, and training sets:

phil_dataset = DatasetDict({'train': phil_dataset["train"].shard(num_shards=10, index=0)})

Going through the above process for this smaller dataset reduces the training time to a much more manageable 30 minutes or so (on a laptop GPU.)

The moment of truth: what’s the accuracy of the model? We compute its accuracy on the test set consisting of 722 sample sentences:

import evaluate
accuracy_metric = evaluate.load("accuracy")

predictions = trainer.predict(tokenized_datasets["test"])
preds = np.argmax(predictions.predictions, axis=-1)
accuracy_metric.compute(predictions=preds, references=predictions.label_ids)

-- Output --
{'accuracy': 0.7063711911357341}

It’s honestly higher than I expected. This means that it picked the correct philosopher, out of 36!, a solid 70% of the time. For a better look at what kinds of predictions it was making, let’s take a direct look at the model’s predictions.

names_list = tokenized_datasets["train"].features['labels'].names

def str_and_preds(index):
    idx_pred = np.argmax(predictions.predictions[index])
    idx_label = predictions.label_ids[index]
    name_pred, name_label = names_list[idx_pred], names_list[idx_label]
    text = tokenized_datasets["test"][index]["sentence_str"]
    return f'''Text: {text}
    Prediction: {name_pred}
    Actual: {name_label}'''

I inspected swathes of the predictions with a loop over the indices. Some of the correct predictions are spot-on:

Text: In any case, what is demanded of her is self forgetting and love.

Prediction: Beauvoir; Actual: Beauvoir

And some of the predictions I can’t really blame the model for missing:

Text: As I have already said, I disagree.

Prediction: Lewis; Actual: Kripke

This is honestly a flaw in the dataset. In retrospect, I should have looked more thoroughly at the dataset (e.g. looking at random samples instead of browing just its head). Lesson learned.

Just out of curiosity, I’m going to try predicting the school instead of the precise author of a text. All that’s involved is renaming the ‘school’ feature to ‘labels’ instead of renaming ‘author’. The result is actually not that much better, which surprises me a lot:

{'accuracy': 0.7479224376731302}

3. Results and Reflections

I think it’s true in general that metrics don’t give a complete picture of the behavior of a model, and it’s especially true here where both our input (‘philosophical’ sentences) and output (philosophical schools or authors) are not concrete quantities. So I want to take a look at what the model is doing, and speculate on what it might be learning.

Some of the misclassifications are interesting:

Text: Our senses are also hostile and averse to the new; and generally, even in the simplest processes of sensation, the emotions such as fear, love, hatred, and the passive emotion of indolence.

Prediction: rationalism; Actual: nietzsche

And some of the predictions which are wrong are humanly understandable:

Text: But what is there to lead, and, more than that, authorize us to supplement the facts of the case in this way?

Prediction: plato; Actual: german_idealism

or

Text: For that is the goal of its impulse.

Prediction: german_idealism; Actual: aristotle

I’m undecided whether it’s a bug or feature of this dataset that it contains sentences that a human would call ambiguous. On one hand, it introduces samples that the model shouldn’t be able to guess, which is a bit strange. They tend to be the philosophically empty sentences and I don’t think the model learns anything productive by being trained on them. On the other hand, it’s the reality of things that some sentences or claims, especially very tiny and philosophcially empty sentences like the above, can be found in a lot of contexts.

As expected, some of the predictions are harder to be charitable about:

Text: These values fall on a linear scale with arbitrary zero and unit.

Prediction: continental; Actual: analytic

Finally, let’s see how the text behaves on authors who were not in the dataset. To get a sense of the probabilities that the model was assigning, we define a function which returns the top num_preds predictions and their probabilities:

def str_predictions(text, num_preds):
    str_pred = trainer.predict([tokenizer(text)])
    top_preds = np.argsort(str_pred.predictions)[0][-num_preds:]

    logits = torch.from_numpy(str_pred.predictions[0])
    probs = torch.nn.functional.softmax(logits, dim=-1)
    for i in range(len(top_preds)):
        print(f'{len(top_preds) - i}: {names_list[top_preds[i]]}: {probs[top_preds[i]]:.3f}')

For example, evaluated on the following quote by Sartre,

str_predictions("It is therefore senseless to think of complaining since nothing foreign has decided what we feel, what we live, or what we are.", 5)

-- Output --
5: phenomenology: 0.014
4: continental: 0.019
3: rationalism: 0.103
2: aristotle: 0.296
1: plato: 0.550

the top three candidates are resolutely not where a human would classify Sartre! Others are better:

str_predictions("Man is condemned to be free; because once thrown into the world, he is responsible for everything he does. It is up to you to give life a meaning.", 5)

-- Output --
5: stoicism: 0.000
4: phenomenology: 0.000
3: continental: 0.001
2: nietzsche: 0.001
1: feminism: 0.997

Here are a couple of revealing examples, by Levinas and yours truly:

str_predictions("Faith is not a question of the existence or non-existence of God. It is believing that love without reward is valuable.", 5)

-- Output --
5: phenomenology: 0.003
4: analytic: 0.007
3: feminism: 0.024
2: nietzsche: 0.026
1: rationalism: 0.934
str_predictions("She loves peanut butter", 5)

-- Output --
5: nietzsche: 0.000
4: rationalism: 0.000
3: communism: 0.000
2: analytic: 0.001
1: feminism: 0.998
str_predictions("He loves peanut butter", 5)

5: nietzsche: 0.002
4: communism: 0.002
3: rationalism: 0.009
2: feminism: 0.023
1: analytic: 0.958

Final Thoughts

My speculation is that the model cannot detect philosophical content at all, and learned the minute stylistic tendencies of each author school, such as the dominant vocabularies in their discourse and their distinctive cadences. This aligns with what I’ve seen even in recent, cutting-edge NLP models , like ChatGPT which was released a couple of days ago as of this writing, and what Jeremy has cautioned us about: NLP models learn enough to have context-appropriate responses, but they’re terribly unreliable at understanding the actual content of language. I wonder if that’s what’s going on here.

If I come back to this dataset, I want to come equipped with more metrics and ways to understand the results. Also, something I felt when I was working on this NLP project was that I didn’t know what levers there were to pull in the model. Are there any meaningful hyperparameters? Beyond choosing other pretrained models to fine-tune, what can I do? What sorts of feature engineering is productive with a language dataset, and does the model do any of it automatically? I learned a lot about how to work with the tooling through this exercise. Next I want to understand how to use these tools better.

Aside: A Disaster

On my first successful training of the author-prediction model, I got an impossibly high accuracy of

{'accuracy': 0.9376731301939059}

I originally attributed it to the model being extremely good at pinpointing writing style, but maybe with a score that high I should have scrutinized my methodology first. Writing this post, I realized that I made a dumb mistake in the train-test-split: instead of

ds_split = phil_dataset["train"].train_test_split(test_size=0.1)
ds_train, ds_eval_and_test = ds_split["train"], ds_split["test"]

like above, I instead did

ds_train = phil_dataset["train"].train_test_split(test_size=0.1)["train"]
ds_eval_and_test = phil_dataset["train"].train_test_split(test_size=0.1)["test"]

These datasets aren’t disjoint, since train_test_split randomly shuffles the data each time it’s called! Because of this mistakes, I was evaluating my model on data on samples which were in the training set, a cardinal sin.