I noticed, using
labels = np.append(labels, line[0])
take extremely long when used inside the loop over the lines from the csv-reader.
Instead
labels.append(line[0])
is much faster.
Why is that?
I noticed, using
labels = np.append(labels, line[0])
take extremely long when used inside the loop over the lines from the csv-reader.
Instead
labels.append(line[0])
is much faster.
Why is that?
The behavior of those two operations is different, both in terms of memory management and in terms of the output type.
The first one generates a complete new copy of the input list. Then the old one gets garbage collected. And the output type is a numpy array, instead of a list.
The second one is an “in-place” operation. The previous data is not copied. If the list gets long, this will be a lot faster.
You can verify this by using the “id()” function in python to see whether the memory changes or not. Here’s a little code block to show what I mean:
np.random.seed(42)
A = list(np.random.randint(0, 10, (8,)))
print(f"type(A) = {type(A)}")
print(f"A = {A}")
print(f"id(A) = {id(A)}")
B = np.append(A, 42)
print(f"type(B) = {type(B)}")
print(f"B = {B}")
print(f"id(B) = {id(B)}")
A.append(42)
print(f"type(A) = {type(A)}")
print(f"A = {A}")
print(f"id(A) = {id(A)}")
Running that gives this output:
type(A) = <class 'list'>
A = [6, 3, 7, 4, 6, 9, 2, 6]
id(A) = 127705198963040
type(B) = <class 'numpy.ndarray'>
B = [ 6 3 7 4 6 9 2 6 42]
id(B) = 127705198964128
type(A) = <class 'list'>
A = [6, 3, 7, 4, 6, 9, 2, 6, 42]
id(A) = 127705198963040
For other examples of how “in-place” operations have different behavior, even if the contents of the data are the same, here’s another thread worth a look. Please read the whole thread, not just the linked post, so see some other exciting results from object and procedure call semantics in python.
Great, thanks for the explanation.