For the week 1 assignment of this course, one of the graded function tasks was to split the image dataset into ‘Training’ and ‘Validation’.
The following is my code:
def split_data(SOURCE_DIR, TRAINING_DIR, VALIDATION_DIR, SPLIT_SIZE):
“”"
Splits the data into train and test sets
Args:
SOURCE_DIR (string): directory path containing the images
TRAINING_DIR (string): directory path to be used for training
VALIDATION_DIR (string): directory path to be used for validation
SPLIT_SIZE (float): proportion of the dataset to be used for training
Returns:
None
“”"
source_images = os.listdir(SOURCE_DIR)
for img in source_images:
if os.path.getsize(SOURCE_DIR+img) == 0:
print(f"{img} is zero length, so ignoring.")
source_images.remove(img)
train_sample = random.sample(source_images, int(len(source_images)*SPLIT_SIZE))
for img in source_images:
if img in train_sample:
copyfile(SOURCE_DIR+img, TRAINING_DIR+img)
else:
copyfile(SOURCE_DIR+img, VALIDATION_DIR+img)
But this failed one of the test cases.
The test case:
Failed test case: incorrect number of (training, validation) images when using a split of 0.5 and 12 images (6 are zero-sized).
Expected:
(3, 3),
but got:
(4, 5).
I don’t understand why it failed. Request anyone to help me understand the error of my code logic here
You posted this under General Discussion. I don’t recognize the code from DLS, which is the specialization I’m familiar with. You’ll have better luck getting the attention of the right people if you post it in right category for the specialization and course you’re taking. You can move it by using the little “edit pencil” on the title.
Thank you for the edit suggestion.
I have made the change accordingly.
This is my first time navigating through this community page.
Removing objects while iterating through the same collection is a bad idea.
Please use a separate list for those images whose size is not zero.
Here’s an example where removing an object while iterating over the same list produces incorrect results:
>>> l = [str(i) for i in range(10)]
>>> import random
>>> random.shuffle(l)
>>> l
['6', '2', '4', '8', '9', '3', '7', '0', '1', '5']
>>> for item in l:
... if int(item) % 2 == 0:
... l.remove(item)
...
>>> l
['2', '8', '9', '3', '7', '1', '5']
1 Like
I’m trying to understand why this fails.
Is it because
-
The For loop iterates over a list and it internally stores an incrementing index value.
-
The list that is being iterated over is losing length from the .remove()
-
The for loop increments index, which ends up skipping a value in the list since it had a value removed.
?
Yes, that is the point that Balaji was making in his earlier response. The list is changing underneath you as you iterate. You should build a new separate output list to hold the things that you don’t want to delete.
Or better yet, think of ways to implement this without a for loop. Either by “logical indexing” or by using a python “enumeration”. The enumeration is really a loop, but you can express it more simply and it supports the idea of subsetting the list because the subsetting operation does not happen “in place”. It’s the “in-place-ness” that’s getting you in trouble with the current implementation.
I experimented a bit and I have not yet figured out how to get “logical indexing” to work with lists, but it works with arrays. Here’s an example:
np.random.seed(42)
A = np.random.randint(0, 10, (12,))
print(type(A))
print(f"A = {A}")
B = A[A < 6]
print(type(B))
print(f"B = {B}")
<class 'numpy.ndarray'>
A = [6 3 7 4 6 9 2 6 7 4 3 7]
<class 'numpy.ndarray'>
B = [3 4 2 4 3]
Of course this is not exactly the problem you are trying to code, but I’m just showing the technique without writing out your solution for you. The point of that technique is how clean and expressive the code is.
Here’s an example of how to use an enumeration that works with a list:
Alist = list(A)
print(type(Alist))
print(f"Alist = {Alist}")
C = [Alist[ii] for ii in range(len(Alist)) if Alist[ii] < 6]
print(type(C))
print(f"C = {C}")
<class 'list'>
Alist = [6, 3, 7, 4, 6, 9, 2, 6, 7, 4, 3, 7]
<class 'list'>
C = [3, 4, 2, 4, 3]
That code has the advantage that it does not use “in-place” operations on the input list. I guess it’s a matter of taste whether you think that’s cleaner code than explicitly writing it as a for loop that appends to a new output list under the appropriate condition. But it is pretty “pythonic” FWIW.