What's the usage of minimum edit distance?

Week1 went through autocorrect first, then minimum edit distance.

After finding minimum edit distance, what’s the usage of it?

Minimal edit distance is used to find words that are closest to the input word. You can see autocorrect options be populated by candidates closest to the user input, sorted in increasing order of edit distance.

Thanks for the reply.

For example, in the assignment, after we find the suggestion for correction via edit and probability, would the next step be to calculate the minimum edit distance, then replace the misspelled word by the one with closest suggestion?

Week1 only introduced the concept of it and how to calculate it, but not touched how to use it after getting the number.
Is it covered somewhere in later courses?

You’re welcome. I don’t remember if it’s covered in the videos. What I shared is a practical application of edit distance.

So which way would be more common in the application? probability as in the assignment, or minimum edit distance?

Please look at the course 2 week 1 assignment again. get_corrections in exercise 10 answers your question.

That’s what I meant, get_corrections has nothing to do with “minimum edit distance”.
Exercise 10 is to find autocorrection with probabilities through words editing, which does not relate to “minimum edit distance”.

Exercise 11 is to calculate “minimum edit distance”, but what’s next, and how to use that?

Thanks for asking this question. I was wondering the same thing. This conversation helps to make sense of Assignment 1.

This is confusing -

Assume that you have 1 misspelled word and you want to generate all possible words with 1 or 2 edit distances using brute force. Assume that there are 500 possibilities and out of those 200 are in vocabulary

Alternative suggested is to use dynamic programming where we calculate edit distances of all words in vocab against this misspelled word and pick up whichever are less than 3 edit distances. So, for a vocab size of 100k, 100k min edit distance matrix should be computed and then back-traced to find relevant candidates each time 1 misspelled word is found.

I dont see how the 2nd approach has lesser computations than the brute force approach ?

Adding @arvyzukai, @Elemento

Hi @Mayank11

I’m not sure I understand you because there are no alternatives in this assignment. Let me elaborate in simple words on what’s going on.

For example, you have a misspelled word “dys”:

  • First, we check if the word is in the vocabulary (if it needs any correction). If it’s not in the vocabulary, we continue.
  • Then, we generate 1 edit (brute force), and let’s assume we find only two words in the vocabulary (“dye” and “days”). Next, we check which has a higher probability and choose that one (“days”). If there was none, then we would continue.
  • After that, we generate 2 edits (again, brute force) and do the same thing as with 1 edit (choose the word with the highest probability). If none is found, we return the word without correction.

Now, with the dynamic programming (part 4 in the Assignment), we can find how we can go from “dys” to “days”. The dynamic programming part here is that we don’t need brute force to generate all the “million” ways we can get from “dys” to “days”. (I think I don’t need to elaborate more, since the example is in the assignment (from “play” to “stay”)).

In other words, in the first part, we find the candidates and choose one, while in the second part, we find how we can get from the input word to the candidate (and what is the edit distance of that).

Does that make sense?

Yes @arvyzukai . My question was a reply to another post which claimed that dynamic programming is more efficient at finding auto-correct candidates than brute force. But as you rightly said these are 2 separate things. Dynamic programming is just about finding how to go from “days” to “dys” most efficiently. The task of finding the candidates for auto-correct has to use brute force only as taught in the assignment. To sum up, calculating min. edit distance has nothing to do with autocorrect.

And one more additional thing to note that I thought after the reply.

You could have asked why when we generate (brute force) the candidates why don’t we store the edit and the edit cost in the first place? So instead of having just a list with strings, we would have a list with tuples (candidate, operation, cost), for our example (“days”, (insert_id, “a”), 1), or some other form (like {“days”: (insert_id, “a”, 1)}, etc.). And we could extend that to edit two letters. This way, we could save computation (in expense of memory) by keeping track of the operations and their associated costs (which might be feasible for short words).

But I think the creators of the course wanted to introduce some concepts rather than making the most compute-efficient auto-correct.

Cheers