UNQ_C1 value_counts() vs values

I tried a couple of ways to calculate data leakage between two datasets. one using the Pandas value_function() call, and the other Pandas.series.values. The former almost works (fails one of the 6 automated tests). as learning opportunity wanted to understand the difference between the two for a single integer dataframe column. tx.

If you’re talking about check_for_leakage in the C1 W1 assignment, you don’t need to use any of those functions. They showed us a simple way to accomplish this using the python set datatype in one of the optional notebooks.

The cool thing about using “set” is that it solves the two hard problems here: uniqueness within the two lists and then taking the intersection to find the common values. If you want to do that using straight “list” primitives, then you have to come up with your own mechanisms for solving those two issues.

Here’s the root of the Pandas documentation for the Series class. And here is the page for Series.values. It just gives you the contents of the series as an ndarray.

I don’t find anything called value_function in the Pandas docs, but both Series and Dataframe have a value_counts method. Here’s the doc for Series.value_counts. So what you can see is that it will solve one part of the uniqueness issue for you: it gives you all the unique values in a given Series (or column of a Dataframe), but sorts them in order of frequency of occurrence. So it doesn’t really give you any help in comparing the list between two sets. You’d then need to sort them both by the “key” value and then compare.

Using python sets is a lot easier. :nerd_face:

1 Like

Hi @Venu_Vasudevan

The pd.Series.values attribute returns the underlying numpy array that contains the values of a pandas series. It is a one-dimensional array and can be used to access the values of a pandas series in a similar way as you would use a numpy array.

The pd.DataFrame.values attribute returns a two-dimensional numpy array that contains the values of a pandas dataframe. It can be used to access the values of a pandas dataframe in a similar way as you would use a two-dimensional numpy array.

The pd.Series.values and pd.DataFrame.values are used to access the values of a pandas series and dataframe respectively in a low-level, numpy-like way.

The pd.DataFrame.values will return a 2D numpy array, even if the dataframe has only one column, whereas pd.Series.values will return a 1D numpy array.

In case of single integer column dataframe, the pd.DataFrame.values will return a 2D array of shape (n,1) where n is the number of rows in the dataframe, whereas pd.Series.values will return a 1D array of shape (n,) which is a flattened version of the 2D array.

In your case, you are probably comparing the values of two pandas Dataframe or Series and one of them is a 2D array and the other one is a 1D array, that’s why you are getting a mismatch.

You can use pd.Series.values.flatten() or pd.DataFrame.values.flatten() to convert the 2D array to 1D array and then compare them, or you can also use pd.Series.equals() or pd.DataFrame.equals() to compare the values of the two pandas objects and these methods handle the difference in dimensionality internally.

Hope so this answers your question

Muhammad John Abbas