Data Science & ML

NLP & Hugging Face

Tokenization, embeddings, BERT, GPT, Hugging Face Transformers, fine-tuning, pipelines, inference

24 interview questionsยท
Senior
1

What is the main function of tokenization in natural language processing?

Answer

Tokenization splits raw text into smaller units called tokens, which can be words, subwords, or characters. This step is essential because language models cannot directly process raw text. Each token is then converted to a numerical identifier that the model can process.

2

What is the main advantage of the BPE (Byte Pair Encoding) algorithm over word-level tokenization?

Answer

BPE handles unknown words (out-of-vocabulary) by decomposing them into known subunits. Unlike word-level tokenization that replaces unknown words with a special [UNK] token, BPE can represent any word as a combination of subwords present in the vocabulary, enabling generalization to words never seen during training.

3

What is the fundamental difference between WordPiece and BPE for vocabulary construction?

Answer

BPE merges the most frequent token pairs, while WordPiece chooses merges that maximize the likelihood of the training corpus. WordPiece thus uses a probabilistic criterion rather than pure frequency, which can produce slightly different splits potentially better suited to the final language model.

4

What is the main difference between static word embeddings (Word2Vec) and contextual embeddings (BERT)?

5

What are the two pre-training tasks used by BERT?

+21 interview questions

Master Data Science & ML for your next interview

Access all questions, flashcards, technical tests, code review exercises and interview simulators.

Start for free