In the first NLP class of the series, we will teach you the foundations needed to analyze linguistic data and understand basic Natural Language Processing concepts.
The training starts with a discussion about the challenges of linguistic data, followed by techniques of handling, cleaning, and normalizing text data. The lesson concludes with two language models: *N-grams* and *word embeddings*. The latter model will be discussed further in NLP 2, as it requires a better understanding of RNNs and Deep Learning.
The theoretical lesson is followed by a few lab exercises where participants get familiarized with main NLP toolkits used in the industry (NLTK, Spacy, Gensim) and train a Bayesian model to predict the author's gender using word frequency features from Twitter data.
The training includes theory, demos, and hands-on-exercises.
By the end of the training participants will have gained knowledge about:
- Techniques of handling, cleaning and normalizing linguistic data (Tokenization, Normalization, Stemming, Lemmatization, Stop words)
- Modelling language to derive insights (Statistical language modelling, Word frequency: Bag of words, Tf-idf, Term frequency-inverse document frequency, N-grams, Word embeddings and vector representation of words)
- Useful methods for topic classification and sentiment analysis will be discussed
- The lab includes an introduction to the main NLP toolkits used in the industry (NLTK, Spacy, Gensim)