More detail about multivariat_gaussian numpy implementation

Is there a more detail explanation about the multivariat_gaussian utils numpy implementation for C3W1A2?

def multivariate_gaussian(X, mu, var):
    k = len(mu)
    if var.ndim == 1:
        var = np.diag(var)
    X = X - mu
    p = (2* np.pi)**(-k/2) * np.linalg.det(var)**(-0.5) * \
        np.exp(-0.5 * np.sum(np.matmul(X, np.linalg.pinv(var)) * X, axis=1))
    return p

Seems not so intuitive when comparing with the formula introduced in lecture.

1 Like

Hi @Feihong_YANG,

The assignment developer was trying to implement the Multivariate Gaussian distribution in a vectorized way.

The Multivariate Gaussian distribution can be reduced to the lecture’s formula because the covariance matrix \Sigma is diagonal. In other words, they are equivalent because the covariance matrix is diagonal.

It looks different from even the Multivariate Gaussian distribution, such as np.sum is used and no transpose is done to the X on the left hand side of the inverse covariance matrix, because it wants to compute all samples in X in a vectorized way.

Here are my suggestions for you to dig into it:

  1. inspect any unfamiliar numpy function and check out its official docs

  2. to verify, create a set of simple and small X, mu and var, and compare that implementation’s result with either your hand calculation based on the lecture’s formula, or some library’s (like scipy) implementation

  3. to see how it does it vectorized-wise, you might need to inspect the code piece by piece with your simple X, mu and var, after you verified as step 2 suggested to build up the confidence.

Good luck!

1 Like

Hey @rmwkwok

I saw you explanation about regarding the vectorized implementation, actually the most difficulty for me to understand it is the mapping to linear algebra. I just spent for a while and looked up a few resources to better understand determinant and finally got the geometric meaning is the only way to map the implementation to the original formula. For now going to the exponential part, since it’s my first time to see pinv, might spend more time to figure it out …

1 Like

I see. Thank you, @Feihong_YANG, for sharing your progress, which is important because I could not have guessed precisely determinant was a problem.

Just in case it helps, I am comparing them like this:

\Sigma is a covariance matrix, so it is easier to be compared with by rewriting \sigma to (\sigma^2)^{1/2}, where \sigma^2 is variance and can be compared to the elements in a covariance matrix.

I believe the key of pinv is to compute the inverse of matrix, and as for how it computes, it may be less important.


1 Like


It may also be helpful to expand the exponent term in a 2 dimenional case (k=2), so that, under the assumption of a diagonal covariance matrix, you see how it reduces back to two univariate Gaussian terms.

If you expand the exponent term, you will find, besides two familiar univariate terms, an additional term that is controlled by the covariance of the 2 dimensions.

With the covariance equals to zero, that additional term is completely removed, leaving us only the univariate terms. In such a case, if we draw the PDF out, it will be a regular and “circular” bell:

With non-zero covariance, that additional term remains, and thus stretches the shape of the PDF from a “circular” bell to a “elliptical” one:


If you just look at the first column, where only the covariance (correlation) changes, and if we compare the first and the third graphs, we see that in the third, as x_1 becomes large, the probability of x_2 shifts from peaking at zero to peaking at a larger value. Shifting to a larger value (but not smaller) reflects the positive covariance, and such shifting is explained by the additional term. So, if you want to see the additional term, then try to expand the exponent.

Let me know if you are not sure about that expansion.


1 Like

Hey @rmwkwok
Thanks for providing this PDF image! Indeed I had stuck on how the pinv mapped to the formula introduced in the lecture. After explore some concept like svd, mp inv, eig val/vec, det, rank, etc I have finally found the code exactly performing the same computation as the formula provided in the lecture, but still confused why did it provided in this complicate way?

Such as why did it uses det(diag(var)) rather than just prod(var)?
And since diag(var) is a square diagnal matrix, why it compute with pinv rather than simply inv? Even simply diag(1/var) which w/o computing the inverse matrix can easily get what you want right? Per my understanding both inv and pinv are high computation intensity operation which require like det or decomposition, but what you want is just the inverse of variance for each feature…

Actually I’m a little lost while you introducing covariance matrix and I didn’t find it in the lecture formula

Also hand computed the image for multivariant formula, if I didn’t make it wrongly you x is a column vector for a single data point, but how did you get the covariance of 2 dimensions? What did you mean “under the assumption of a diagonal covariance matrix”? How did you get the covariance term of 2 dimensions if this assumption is not hold?

1 Like

Hi @Feihong_YANG,

I see a lot of confusions from your message :smiley:, but that also lets me understand the situation better.

Let’s start from something more understandable to anyone who is OK with the lecture’s content.

  1. The assignment could have implemented the multivariate gaussian distribution as presented in the slide you shared:
    In that way, there will be no matrix, no determinant, no inverse of matrix, and nothing suprising. However, it requires to compute sample-by-sample and feature-by-feature, which is not exploiting vectorization. If you have taken course 1 & 2, you know we used many for-loops, and while they work, they are slower.

  2. You questioned many times why they did it this way or that way. To me, they are all because we want to vectorize it, and vectorize it for speed and for code clarity ( :stuck_out_tongue_winking_eye: I know, you may disagree).

  3. Don’t go too deep into question like “why it compute with pinv rather than simply inv?”. As long as you verify that it works, you can use it. Considering what we are doing, there is no need for us to understand all possible varieties before picking one. Verify → use → verify → use, ok? We are not studying algorithms for computing inverse of matrix, and I am not going down that way too.

1 Like

@Feihong_YANG, even we were sitting at the same table, discussing things face-to-face, it was still not possible for us to address all of your confusion like that, we need to sort them.

Here are my suggestions:

  1. You do it your preferred way, by using the lecture’s formula.

  2. Find your way to accept the multivariate gaussian distribution form. The form is here - look for its PDF form.
    [Below is completely optional]
    If you are familiar with probability, transformation of probability density function, and matrix algebra, and if you are eager to read some less rigorously presented “proof”, here is one.
    You go from the case of uncorrelated features such that it is legal to just multiply those probabilities up, to “Jacobian tranforming it” with Y = AX + b where A messes those uncorrelated features up and make them correlated.

  3. Plug numbers into the lecture’s formula, into the multivariate gaussian formula, into the assignment’s code, and verify that all of them give the same result.

Make sure you have done these three steps first. See it for yourself that all of them deliver the same result.

1 Like

Exploit vectorization. Can you think of an alternative that exploits vectorization - I am not kidding, only when you have seriously thought about it, can you finally feel it yourself.

The proof of the pudding is in the eating.

(I hope it is big enough to catch attention :wink: )

Do it your way, if you know inv will work.

We may finally find no wisdom digging too much into that question, so I am avoiding that.

You verify it. I don’t answer questions that you can verify yourself.

Thanks for the feedback!

I recommend to you this reading. It uses the term “variance-covariance matrix” in place of my “covariance matrix”, but they are talking about the same thing.

However, I must warn you, if you do the reading, then you are about to jump into learning probability. You don’t have to learn it as long as your goal is to verify that the assignment’s formula is consistent with the lecture’s formula. To achieve that goal, you only need to follow my three steps in the last post.

  1. the assumption has to hold as long as we are discussing the assignment.

  2. Outside of the assignment, if it does not hold, I will assume the covariance as c, just like we assume the variance to be \sigma^2.

  3. go through the suggested reading for how it is treated.


1 Like

Hi @rmwkwok

Thanks for providing these steps and materials. I have just finished reading the first two links, since my knowledge of linear algebra still not advance enough and I didn’t take multivariable calculus systematically, it did take my some time to figure it out the Jacbian tansforming and how the Y = AX + b mapping to multivariant gaussian distribution, thus have not step into the third one yet.

Let me see if I made it correct. For each row of Ai, it represent for “How much of each N(0,1) value in X factor in the corresponding variable in Y”, thus it’s a sdd vector and matmul(A, A.T) is the one you mentioned covariance matrix which is applied directly in the code as function signature. I think the author of code read the similar material thus implemented in this specific way from many possibilities (even though all comply with vectorization), will go back to this if any tricky found in the third material.

Don’t go too deep into question like “why it compute with pinv rather than simply inv?" …
We may finally find no wisdom digging too much into that question…

Actually the motive for me to explore the library code is to find standard way of implementation. Currently I’m planning to pursue a machine learning opportunity in the industry, like MLE…, and coding would be a big challenge. Since I didn’t find to many material of platform to practice for machine learning algorithm like common code challenge like Leetcode, the course code is what I have to start with. I think the code is professional enough if I apply it to crack the code challenge at least in the interview phase, thus I’m trying to understand it better, hope this make sense :upside_down_face:

1 Like

Hi @Feihong_YANG

Yes. This makes “each value” in Y correlated with each other, in contrast to uncorelated “values” in X.

Not really. I suppose SDD means Standard Derivation.

First, A is a matrix, and its component ain’t standard derivations. For example, the diagonal elements in AA^T are not a result of just squaring any element in A.

I see. I suppose one sensible thing to do is to find out how your dream jobs’ coding tests are like, but it is a good idea to work your way through others’ codes. Speaking of that, you probably want more diverse sources than just this course, so, besides leetcode, I could also find some other platforms by googling “leetcode like platforms”. On the other hand, if you want to challenge yourself with some more open ended questions, perhaps contributing in stackoverflow would be an option too. I used stackoverflow for practicing too.


Hi @rmwkwok

if you want to challenge yourself with some more open ended questions, perhaps contributing in stackoverflow…

Aha, I saw it as a Q&A platform by community and I did used to apply it as a platform to find solution why get stuck, so I guess you challenge mean try to answer question post as a practice right? :rofl:

I suppose one sensible thing to do is to find out how your dream jobs’ coding tests are like…

I did have chance to participate a company interview for MLE and messed it up badly since not proficiency enough with neither ML algorithm / metrics implementation through code nor ML system design. It’s something like finish the partial implemented code for gradient decent and regularization, take care of OOP and metrics calculation in a vectorized way, I can share it for more insight if you want … the awful experience is the main motive to take the courses… :rofl:

Not really. I suppose SDD means Standard Derivation.

Actually I meant the values (components) of AX before sum up horizontally… Thus it’s not SDD of Y, perhaps it’s so called “SDD” since not present during matmul…


Hi @Feihong_YANG,

Yes - answer ML/coding questions there.

Do you think you can finish those coding test questions now?


Hi @rmwkwok
It’s been familiar to me for the code I have to finish the questions.
But not sure if it’s a new one and more than 100 lines of code provided for a specific ML model algorithm class, it doesn’t take me too much time to local which line of the code need to be updated to make like maybe the fit function perform properly under the pressure of timing and interview env :upside_down_face:

Hello @Feihong_YANG

I see. Then I think you are doing some right things - inspecting others’ (e.g. course’s) codes, although just by reading is not the most useful.

I probably would split my day into some parts:

  1. my own ML projects. From your interview experience - ML algorithm / metrics / gradient descent / regularization / vectorization - it is obvious that they are some basics, and we can be familiar to them only after we do them ourselves. You may do projects with existing libraries, and you may do from scratch. The more from-scratch it is, more challenging on how you write efficient code and understanding of basics. The more libraries-dependent it is, more experienced you are with the tools. You probably want all of them but need to pick one side to start with, and compare that side’s result with the other side’s in terms of accuracy and time efficiency.

  2. others’ problems - something for broadening horizons and avoiding trapping yourself by yourself. Identify some interesting ones in StackExchange for Machine Learning, StackOverflow, Google (keywords like machine learning, questions, interview)…

  3. keep gathering info on how your future interviews will be like, and adjust accordingly :wink:


I see. Thanks for the advice, and indeed from-scratch (if I understand it correctly numpy and pandas do not count as tool~ :smiling_imp:) to map formula / algorithm to code is what I want at the moment (from the utilitarianism point of view for cracking the code challenge, they would not ask for I didn’t put tfx, beam, k8s on my CV :face_in_clouds:) … btw have you ever try kaggle? Currently trying to explore for study some from scratch notebooks, seems library based are the majority…

That’s easy, @Feihong_YANG. We just need to think differently - we replace library functions with our own.

For example, this library function does the K-fold CV, that function adds Adam to gradient descent, then we implement them one by one. Set requirements according to your past interview experience: No outside help? Refrain. With numpy? Ok. Using Python Class? Ok. Unit tests? No problem. Performance comparison with the library function? Do it. Finally, of course, look for others’ works or comments.

@rmwkwok Aha that make sense! If that’s still around data science, it should be in the limited scope and won’t be overwhelm. Let me try your solution, I think it should be more efficient than looking from Numpy and Pandas since there are plenty of API and hard to decide which is applied more often and to be a checkpoint~