CS114 Assignment 1 Finite State Transducers Solved

35.00 $

Category:

Description

5/5 - (2 votes)

This assignment is on finite state transducers and will require you to create transducers in Python. There are two problems, and we have provided some code to get you started, as well as a few utilities that will make it easier for you to debug and test your code.

Some minimal test cases are provided as well. Before you turn in your work, you should at least pass all of the test cases we provided although it is neither sufficient nor necessary for getting full credit. You should add your own test cases to help yourself debug your program.

Problem 1: Soundex[1]

The Soundex algorithm is a phonetic algorithm commonly used by libraries and the Census Bureau to represent people’s names as they are pronounced in English. It has the advantage that name variations with minor spelling differences will map to the same representation, as long as they have the same pronunciation in English. Here is how the algorithm works:

Step 1: Retain the first letter of the name. This may be uppercased or lowercased.

Step 2: Remove all non-initial occurrences of the following letters: a, e, h, i, o, u, w, y. (To clarify, this step removes all occurrences of the given characters except when they occur in the first position.)

Step 3: Replace the remaining letters (except the first) with numbers:

b, f, p, v→ 1 c, g, j, k, q, s, x, z→ 2 d, t→ 3 l→ 4 m, n→ 5 r→ 6

If two or more letters from the same number group were adjacent in the original name, then only replace the first of those letters with the corresponding number and ignore the others. This rule also applies at the beginning of the name: retain the first letter, but ignore the others.

Step 4: If there are more than 3 digits in the resulting output, then drop the extra ones.

Step 5: If there are less than 3 digits, then pad at the end with the required number of trailing zeros.

The final output of applying Soundex algorithm to any input string should be of the form Letter Digit Digit Digit. Table 1 shows the output of the Soundex algorithm for some example names.

Input Output
Jurafsky J612
Jarovski J612
Resnik R252
Reznick R252
Euler E460
Peterson P362
Ashcroft A226
Pfister P236

Table 1: Example outputs for the Soundex algorithm.

Construct an fst that implements the Soundex algorithm. Obviously, it is non-trivial to implement a single transducer for the entire algorithm. Therefore, the strategy we will adopt is a bottom-up one: implement multiple transducers, each performing a simpler task, and then compose them together to get the final output. One possibility is to partition the algorithm across three transducers:

  1. Transducer 1: (letters_to_numbers) Performs steps 1-3 of the algorithm, i.e, retaining the first letter, removing letters and replacing letters with numbers.
  2. Transducer 2: (truncate_to_three_digits) Performs step 4 of the algorithm, i.e., truncating extra digits.
  3. Transducer 3: (add_zero_padding) Performs step 5 of the algorithm, i.e., padding with zeros if required.

Note that each of these three transducers will have characters as input/output symbols.

To make things easier for you, we have provided the file soundex.py which is where you will write your code. It already imports all needed modules and functions (including fsmutils.py). It also creates three transducer objects—as dictated by the bottom-up strategy outlined above—such that all you should have to do is to figure out the states and arcs required by each transducer. It also contains code that allows you to input a single name on the command line to get the output.

Note that while we have provided you with sample unit tests containing some names, it might be very useful to test your code on other names. For comparison purposes, you may use one of the many Soundex calculators available online to create more test cases. (Note that some calculators may use additional rules; you only need to implement the rules listed above.)

Problem 2: English Morphology

English often requires some spelling changes at the morpheme boundary between the stem and the affixes that indicate the inflection of the verb. Specifically, we will build a finite state transducer to handle the K-insertion rule in English.

If a verb ends with vowel + c, then we add k before adding in the -ed and -ing suffixes. For example:

panicked ↔ panic + ed ↔ panic+past form panicking ↔ panic + ing ↔ panic+present participle form lick ↔ lick + ed ↔ lick+past form lick ↔ lick + ing ↔ lick+present participle form

As you can see in the examples, the morphological parsing process can be done in two steps (although you may skip the intermediate step). So you might want to (but don’t have to) build two separate FSTs and compose them. The lexicon FST only has to include two inflectional morphemes (-ed and -ing) and five verbs: want, sync, panic, havoc, and lick. The other FST will handle the actual K-insertion rule.

The parser FST will be used as a morphological parser to transduce the surface form (e.g panicked) to the lexical form (panic + past form). The generator FST will be used to transduce the lexical form to the surface form.

How to use the FST code

We’ve provided a modified fst implementation from nltk. However, the documentation is less than perfect. Therefore, in this section, we will give you a brief introduction to everything that you will need to understand how to build finite state transducers in Python.

To start building fsts, you need to first import the fst module into your program’s namespace. Then, you need to instantiate an FST object. Once you have such an object, you can start adding states and arcs to it. Listing 1 shows how to build a very simple finite state transducer—one that removes all vowels from any given word.

Feel free to try out the example (provided as devowelizer.py) to see how it works on some of your own input. There are a few points worth mentioning:

  1. The Python string module comes with a few built-in strings that you might be able to use in this assignment for purposes of iteration as used in the example on line 23. These are:

string.ascii letters : All letters, upppercased and lowercased string.ascii lowercase : All lowercased letters string.ascii uppercase : All uppercased letters

  1. States can be added to an FST object by using its add state() This method takes a single argument: a unique string identifier for the state. Our example has only one state (line 13). Furthermore, there can only be one initial state and this is indicated by assigning the state identifier to the FST object’s initial state field (line 17). However, there may be multiple final states in an FST. In fact, it is almost always necessary to have multiple final states when working with transducers. All final states may be so indicated by using the FST object’s set final() method (line 18).
  2. Arcs can be added between the states of an FST object by using its add arc() This method takes the following arguments (in order): the starting state, the ending state, the input symbol and, finally, the output symbol.

Listing 1: A 1-state transducer that deletes vowels

17 f.initial_state = ’1’ 18 f.set_final(’1’)

20 # Now, we need to add an arc for each letter; if the letter is 21 # a vowel then the transition outputs nothing but otherwise it 22 # outputs the same letter that it consumed.

  • forletterinstring.ascii_lowercase:
  • ifletterinvowels: 25 f.add_arc(’1’, ’1’, letter, ’’)
  • else:
  • f.add_arc(’1’, ’1’, letter, letter)
  • # Evaluate it on some example words
  • print(’’.join(f.transduce([’v’, ’o’, ’w’, ’e’, ’l’])[0]))
  • print(’’.join(f.transduce(’exception’)[0]))
  1. An FST object can be evaluated against any input string by using its transduce() method. Here’s how:
  • If your transducer uses characters as input/output symbols, then the input to transduce() must be a list of characters. You may either directly input a list of characters (line 30) or a string (which is just a list of characters) (lines 31 and 32). Note that the output of transduce() is a list of possible transductions. For the assignment, you should make sure your transductions are deterministic; i.e., there should be at most one element in the list.
  • If your transducer uses words as input/output symbols, then the input to transduce() should be a list of words. Again, you can either explicitly use a list of words or call the split method on a string of words separated by whitespace. For example, say your FST maps from strings like ten and twenty to number strings 10 and 20, then to evaluate it on the input string ten twenty, you

should use either:

OR

  1. In fsmutils.py , we provide the code for composition. It does not actually perform the composition and return a newly composed FST. It cascades the output from one FST to the next.

The above function call computes (f3 ◦ f2 ◦ f1)(S). i.e., it will first  apply transducer f1 to the given input S, use the output of this transduction as input to transducer f2 and so on and so forth. S must be a list of characters. Note that like transduce(), the compose() function returns a list of possible transductions; make sure there is at most one element in the list.

[1] This problem is adapted from Jordan Boyd-Graber’s course at UMD.

  • Finite-state-transducers-ggjn1q.zip