Assignment scoring

Hi,
In order to understand how the assignment scoring system works I made some assumptions ( in no particular order)
1 Assignment produces correct output.
2 Formulae are interpreted correctly.
3 Instructions and guidance followed accurately.
4 Coding is precisely the same and in the same sequence as a ‘master’ or template.
5 Variables are assigned in the same sequence as the template.
Can you advise if this is correct or if there are other criteria, please?
Regards
Ian

I am not sure I understand the question, but I think you are heavily “over assuming” here. With the disclaimer that I do not have any visibility into the actual mechanisms of the grader, here’s how I believe it works (based on quite a bit of actual experience and observation):

It runs all your functions from the notebook individually with its own test cases, which may well be different than the test cases in the notebook. It judges success only by whether the answers your code generates on those test cases agree or not with the “known correct” answers that it uses as the standard.

That’s it. It does not care how many lines of code you write, whether your code is vectorized or not or any other particulars of how you wrote the code. It literally does not do any source code analysis: it just runs the code and checks the results.

Note an important logical conclusion one can draw from the above:

Passing all the tests in the notebook is a necessary, but not sufficient, condition to get full points from the grader. In many cases there is only one test in the notebook, so it’s still possible to pass those local tests with code that is not “general” e.g. by hard-coding dimensions, referencing global variables and the like.

The other thing that has been experimentally observed is that the grader environment is less forgiving than the notebook environment in some ways. As you are no doubt aware, indentation is a critical part of the syntax in python. The notebook seems not to care if you sometimes mix tabs and spaces to achieve the correct indentation. The grader does not apparently allow that, meaning that it’s possible to get a syntax error from the grader with a notebook that runs successfully for you locally. There are also other things in the infrastructure of the notebook that the grader depends on other than your specific lines of code. So it is possible to get overall failure of the grader by modifying other parts of the contents of the notebook. It’s not illegal to modify things outside the bounds of the “YOUR CODE HERE” blocks, but extreme care is required.

On the question of assigning variables in a pre-determined order, note that the grader does not care about that either. Of course correctness dictates some of that (no forward references), but there is one very specific case in which order matters for correctness in a way that you cannot deduce from the structure of the computation graph:

If you are using the numpy random calls to initialize variables, the code will set the seed for you, so that the sequence generated is predictable. But that also assumes you initialize the variables in the same order that the test case in the grader does or you will get the predictable values assigned to different variables and will thus fail the grader. E.g. if you’ve got a loop initializing W^{[l]} and b^{[l]} for each layer, you’ll get different answers if you do b^{[l]} first at each layer. This should make sense based on our other recent conversation about how the seeds work with the PRNG functions.

The only thing I might add to the typically insightful replies from @paulinpaloalto is that type matters, too, in addition to value. That is, if the expected output is a scalar value but you return a list of length 1 the grader will likely beef, regardless of whether that single element has the expected output value or not. I know this was true prior to 2021 because I worked on the code for a bit. Assume from comments in this forum that it is still true. So type has to match, and if type matches, values match or within tolerance for numerical results. See math.isclose()

Yes, that is a good point! Typically the grader applies three separate evaluations to each returned data item:

  1. The type (scalar float, array, list …).
  2. The shape, if it is a compound type like an array or list. E.g. the dimensions of an array or length of a list.
  3. The actual values. As ai_curious mentioned above, for floating point values, they typically use numpy “isclose” or “allclose” so that there is some room for different implementations that give different rounding behavior.

The above is usually true for the public tests in the notebook as well. So if you pass 2 tests for a given variable and fail the third, that’s not a big reason for celebration (a pretty low bar in other words).

It is worth also mentioning that the grader does not usually give the same granular level of output that the public tests in the notebook give. So in a case where you fail one of the above three tests for a particular value, the grader typically just tells you that the whole function under test is incorrect. And in the worst case, it says that the tests you failed are “hidden” and doesn’t even tell you which function is incorrect. If you hit a case like that, please ask for help on Discourse.

Thanks for all the responses above, quite enlightening.
I ask is purely out of interest.
Understanding how a system assesses something like software is quite interesting and lets me know if the system is really marking my understanding of the topic or is simply identifying a few key features to score against.
I really don’t want to experiment in something that crucial to the course and my scores, so it was easier to pose the question.

Here is how it worked as of 2021. Cannot imagine it has changed all that much. When you submit a notebook it is stored in a special directory called student_submissions or similar. The file is read and the Jupyter cell type(s) inspected for cells of type code ie not markup. Inside gradeable code cells are tags with unique cell names. The unique cell names are used to look up blocks of associated grader code. The grader code parses and executes the gradeable cell using the Python exec() function Built-in Functions — Python 3.10.2 documentation

The output of that execution is then compared as described in the above responses. There is no examination or evaluation of the code for aesthetic or engineering quality, best practices, intent, etc. In my distant previous lives I have built and used automated code quality tools have attempted that kind code review, but it is non-trivial since there are so many ways to implement. In any case, to the best of my knowledge it isn’t done in this course. What you may see is comparison of the generated output to commonly made mistakes and a hint about it eg Did you forget to normalize the inputs?. HTH

1 Like