A01: Introduction and Regular Expressions, Text Normalization, and Edit Distances Solved

25.00 $

Category:

Description

Read Chapter 1 and 2 of J&M 2nd edition.  The 3rd edition has only a place holder for Chapter 1. It is assumed that you all are familiar with regular expressions (regexes).  If not, you will need to read and learn Section 2.1.  Section 2.2 is for background but spend enough time to also master the ideas in Section 2.3 and 2.4.  Text normalization, even if not all its procedures are used for every NLP project, is basic to the entire process.  The Porter Stemmer is the default unless it is not sufficient for a project.  Edit distances will be needed as the basics of Chapter 5 on Spelling Correction and Noisy Channels. Most of the corpora mentioned in this chapter are freely available over the Internet but you may have to create accounts with their sponsors.

Read the Preface and Chapters 1 and 2 from BKL for Python 3.

Responses to assignments can be done in Python, R, Java or any other suitable software. At times other software may be useful. Indicate what your instructor may need to know and should have review it.

Considering the popularity of Python (its NKTK package is widely used) it will be the default software for this course but R and Java have similar packages. All three were initially based on what is now the Stanford

JavaNLP API.  (https://nlp.stanford.edu/nlp/javadoc/javanlp/) You ae welcome to use them but will need to find books and documentation comparable to the Bird, Klein, and Loper book (BKL). The contents of BKL updated for Python 3 can be read at http://www.nltk.org/book/       (http://www.nltk.org/book/) . If problems occur when implementing code from BKL check this BKL Python 3 website before examining other fixes.

This is not a course in learning Python. It is assumed you know it or alternate languages/software well enough to at least get going with the assignments. If you need to learn more do so but let me know of any limitations in your knowledge and the capability of the software and its libraries, packages, and resources.