Testcase 2 of UNQ_C21

When I run the testcases for UNQ_C21, I kept getting failed with this message:

Wrong chosen neighbor ids. 
	Expected: [8413, 2085, 4802, 5526].
	Got: [8413, 3614, 4934, 5526].

So, for debugging, I added this testing code at the end of approximate_knn:

    # My own Test
    print("Cosine_similarity of the chosen vectors")
    for id in nearest_neighbor_idx_l:
        print(ids_to_consider_l[id], cosine_similarity(v, vecs_to_consider_arr[id]))
        
    print("Cosine_similarity of the expected vectors")
    Expect = [8413, 2085, 4802, 5526]
    idx_l = [idx for idx in range(len(ids_to_consider_l)) if ids_to_consider_l[idx] in Expect]
    for id in idx_l:
        print(ids_to_consider_l[id], cosine_similarity(v, vecs_to_consider_arr[id]))

And I got:

Cosine_similarity of the chosen vectors
8413 0.9999999999999996
3614 0.9999999999999996
4934 0.9999999999999996
5526 0.9999999999999996
Cosine_similarity of the expected vectors
2085 0.10372946727000373
4802 0.5021306764187731
5526 0.9999999999999996
8413 0.9999999999999996

Obviously, 2085 and 4802 are far less similar than 4934 and 3614. Is there anything wrong with my test code? Or did I misunderstand anything? Or the expected answer is wrong?

Hi @Jack_Changfan,

Can you share your lab ID with me ? In the assignment, when you click the top right “Help” button, a panel will open and your lab ID will be shown at the bottom.

I shall take a look.

When you reply back, kindly tag me in the post so that I’m notified.

Thanks,
Mubsi

Hi, @Mubsi
My Lab Id is xtbbiimx. I noticed that today a new ipynb file was pushed. The one I was working on is C1_W4_Assignment_2022_03_28_20_14_26.ipynb.

But when I restart my kernel, I found a total different consine similarity of the same ids. I’m suspecting my id for some reason probably doesn’t map to the correct vecs. I’ll debug a bit before asking for your help.

Thanks.
Jack

Hi, @Mubsi
As I said in my previous reply, I got different results after restart the kernel. Now I got both case failed. Here is my steps for debugging this issue.
I copied the test_approximate_knn function to the code block above the final code block, and renamed the function to my_test_approximate_knn. In there I added this code to check the similarity for each testcase:

            v = test_case["input"]["v"]
            
            print("Cosine_similarity of the chosen vectors")
            for id in result:
                print(id, cosine_similarity(v, document_vecs[id]))
    
            print("Cosine_similarity of the expected vector")
            for id in test_case["expected"]:
                print(id, cosine_similarity(v, document_vecs[id]))

You may notice that I use document_vecs[id] to access the vectors instead of through id_tables and hash_tables because I think they will be out of sync if we remove ids from new_ids_to_consider as instructed (another issue I raised in NLP C1_W4_Assignment UNQ_C21 bug? - #5 by Jack_Changfan). Since document_vecs is the source for creating those tables, it should be a more reliable source.

Now, with the code added, I see this in case1

Wrong chosen neighbor ids. 
	Expected: [51, 701, 2140].
	Got: [51, 2478, 105].
Cosine_similarity of the chosen vectors
51 0.9999999999999994
2478 0.9999999999999994
105 0.9999999999999994
Cosine_similarity of the expected vector
51 0.9999999999999994
701 0.5945676144082521
2140 0.23063321136649556

and case 2:

Wrong chosen neighbor ids. 
	Expected: [8413, 2085, 4802, 5526].
	Got: [992, 4750, 9486, 9101].
Cosine_similarity of the chosen vectors
992 0.9999999999999996
4750 0.9999999999999996
9486 0.9999999999999996
9101 0.9999999999999996
Cosine_similarity of the expected vector
8413 0.020437931293617165
2085 0.5947266612048572
4802 0.35082855420666725
5526 0.2648313322065428

Seems the cosine similarity of the expected vectors are all low. I’m not sure whether it is that I misunderstood the tests, or there is indeed something wrong?

Hi @Jack_Changfan,

I’d like to mention here is that, try not to change the original assignment code and functions, as this will make the grader malfunction upon submission. And also, whenever a new assignment version is pushed, please start working on that instead of previous versions. For one, previous version’s submission will fail.

You have added A LOT of custom code in your assignment. You are welcome to explore the assignment, but failure to reverting back the assignment to its original state will fail the grader (a warning which was added in the latest assignment version), so I’d recommend you copy/paste your graded functions in the new assignment version.

As for your issue, it was here

# remove the id of the document that we're searching
if doc_id in new_ids_to_consider:
            None
            print(
                f"removed doc_id {doc_id} of input vector from new_ids_to_search")

You had written correct code in place of the None, but for some reason, you had that code commented out, which was causing the test cases to fail.

Cheers,
Mubsi

Hi, @Mubsi
Regarding that line, actually I think removing doc_id from the list is wrong. At the beginning, document_vectors_l and new_ids_to_consider are of the same length with new_ids_to_consider[i] corresponding to document_vectors_l[i]. Now if any of the items in new_ids_to_consider is removed, the correspondence will be broken, and document_vectors_l[i] won’t be the vector new_ids_to_consider[i] maps to. So, the loop after this will be wrong. That is why I commented it out. The same issue is discussed in NLP C1_W4_Assignment UNQ_C21 bug? - #5 by Jack_Changfan.

And that is also why I tried to get the vectors from the source document_vecs rather than from id_tables and hash_tables in the test code I added. I suspect the expected answer is obtained based on the tables that is already messed up during the destructive removal in approximate_knn. What do you think?

Thanks.

Hi @Jack_Changfan,

That could be the case. I’ll have it checked. Thanks for pointing it out.

Cheers,
Mubsi

Hi, @Mubsi ,
Do you have chance to check whether my question about the correctness of the expected result is valid? I’m about to submit my assignment, but I don’t want to submit a solution which I think is wrong just because the result matches expected result.
Thanks.
Jack

Hi @Jack_Changfan,

The request was forwarded, and we are looking into it.

As the weekend is here, I wouldn’t recommend you wait, and instead submit your assignment with the solution which gives you the result of “all tests passed” (otherwise the grader will throw an error as well).

Best,
Mubsi

Also having issue with UNQ_C21.
Getting list index out of range on the last line of the code.
nearest_neighbor_ids = [ids_to_consider_l[idx] for idx in nearest_neighbor_idx_l]

Trying to figure out where above the code has gone off track.
Fast considering 77 vecs
vecs_to_consider_arr.shape = (77, 300)
nearest_neighbor_idx_l = [ 0 8 26]

All above code passing all tests and match expected outputs.
Not really sure where to begin trouble shooting.
Not sure if this has anything to do with the other comments regarding removing the doc_id.
if doc_id in new_ids_to_consider:
# new_ids_to_consider.remove(doc_id)

Commenting that out doesn’t change my outcome.

Hi, @yodester
Be aware that the removal of doc_id in the new_ids_to_consider actually will global changes the list in id_table and make it inconsistent to hash_table. So, after you comment that line out, you’ll have to re-run create_hash_id_tables (section 3.6) to recover the original hash tables.

Thanks for the reply. I think I have an error somewhere else/too.
I commented it out and restarted the kernel and reran all after clearing the outputs. Same out of range error.

It did change this.
nearest_neighbor_idx_l = [0 7 1].

In some other threads, one of the suggestions was to download a new copy of the notebook, but after redoing all the code (nones), I got the same results.
My lab ID is yjsgavns in case a mentor needs to look.

Found my mistake.
# append the new_id (the index for the document) to the list of ids to consider
# new_ids_to_consider.append(new_id) # wrong one
ids_to_consider_l.append(new_id) # correct one
I used the wrong list to append.
Needed to read the directions a little closer next time. DOH!

Also, commenting out that remove line is a mistake.
When I put it back in and rebuilt the hash tables, and reran the whole, I submitted it for grading and got 110/110.

Hi, @yodester
Regarding your comment about commenting out the remove line, I understand that if you have that line, the result would match the expected result. That is exactly I was asking @Musbi whether the expected result should be reviewed. I suspected it was generated on a corrupted hash table (due to the incorrect removal).
Think about this, when we remove an id from new_ids_to_consider, it would actually change the list of id_table[h], but the corresponding vec list in hash_table[h] wasn’t changed. So, id_table[h][i] will no longer be the index of hash_table[h][i]. But in the next loop, it still assumes these two match. That would be a mistake. Also, if we remove id from id_table, it would cause the query destructive. That is next query with the same vector would get different result. I don’t think that is correct.

I think since it is a shallow copy it is true that .remove() on the copy modifies the original as well and that has ripple effects as you noticed. But not submitting an assignment because it might contain ‘wrong’ code seems just self defeating. IMO the point of the unit tests and grader is to demonstrate ability match to deep learning’s expected results. You can’t change the boilerplate code and then be surprised the tests fail. In the past it has sometimes been multiple years before class source code gets updated so I say, if you understand the principles and know how to satisfy the grader, report the issue and move on.

Hi, @ai_curious @Mubsi
Yes, I’ve submit my assignment and get the “expected result” which I believed was incorrect. Thanks for your suggestions. Meanwhile, I still think it important to have this problem clarified. If it is incorrect, it is better corrected as soon as possible, or more people will learn it in a wrong way. If I’m incorrect, let me know so I can correct my mistaken concept about this.
Thanks.
Jack

If your theory is correct, shouldn’t replacing this line

# get the subset of documents to consider as nearest neighbors from this id_table dictionary
new_ids_to_consider = id_table[hash_value]

with this one

# get the subset of documents to consider as nearest neighbors from this id_table dictionary
new_ids_to_consider = copy.deepcopy(id_table[hash_value])

break the unit tests? Because now the user produced outputs and the test case expected outputs are created with different inputs to nearest_neighbor()? That’s not the behavior I see…

# Test your function
w4_unittest.test_approximate_knn(approximate_knn, hash_tables, id_tables)
...
 All tests passed

No, even if you deep copy that list, it is still not correct. in the loop after this would still incorrectly match document_vectors_l[i] and new_ids_to_consider[i]. Will also need to deepcopy document_vectors_l, and delete the corresponding item from it.

I just tried. If I remove the corresponding item from document.vectors_l, the test will fail as well.

Thanks.
Jack

Yeah I agree. Initially I thought the issue was that the id_table was being modified. But now I see it’s actually that after modifying the new_ids_to_consider list, the wrong embeddings are carried forward because the document_vectors_l list still has the original length and order.

Hi @Jack_Changfan,

Thank you for pointing this out. Turns out, you are right.

We shall be updating everything soon.

Thanks,
Mubsi