How to create an array?

In one of our Coursera courses, Natural Language Processing with Sequence Models, i am asked to ’ # create an array with the indexes of data_lines that can be shuffled’.

and the line of code is:

lines_index = [*range(num_lines)]
and the function containing it is defined as:
def data_generator(batch_size, max_length, data_lines, line_to_tensor=line_to_tensor, shuffle=True):
    """Generator function that yields batches of data

    Args:
        batch_size (int): number of examples (in this case, sentences) per batch.
        max_length (int): maximum length of the output tensor.
        NOTE: max_length includes the end-of-sentence character that will be added
                to the tensor.  
                Keep in mind that the length of the tensor is always 1 + the length
                of the original line of characters.
        data_lines (list): list of the sentences to group into batches.
        line_to_tensor (function, optional): function that converts line to tensor. Defaults to line_to_tensor.
        shuffle (bool, optional): True if the generator should generate random batches of data. Defaults to True.

    Yields:
        tuple: two copies of the batch (jax.interpreters.xla.DeviceArray) and mask (jax.interpreters.xla.DeviceArray).
        NOTE: jax.interpreters.xla.DeviceArray is trax's version of numpy.ndarray
    """
    # initialize the index that points to the current position in the lines index array
    index = 0
    
    # initialize the list that will contain the current batch
    cur_batch = []
    
    # count the number of lines in data_lines
    num_lines = len(data_lines)
    
    # create an array with the indexes of data_lines that can be shuffled
    lines_index = [*range(num_lines)]
    
    # shuffle line indexes if shuffle is set to True
    if shuffle:
        rnd.shuffle(lines_index)
....
....# some more code
....
            # convert the batch (data type list) to a numpy array
            batch_np_arr = np.array(batch)
            mask_np_arr = np.array(mask)
            
            ### END CODE HERE ##
            
            # Yield two copies of the batch and mask.
            yield batch_np_arr, batch_np_arr, mask_np_arr
            
            # reset the current batch to an empty list
            cur_batch = []

Can someone explain to me what ‘*range(num_lines)]’ is?
I dont understand the structure of this expression because I thought that an array could be created with np.array(). thank you!

That is a great example of both the good AND the bad about Python. The good is that there are always multiple ways to achieve some functional objective, with many terse shortcuts built in to the language. The bad news is exactly the same thing, because sometimes those shortcuts are hard to unpack (pun intended) exactly because of the way they are written.

In this case, it’s probably best to decompose and assess it in parts.

x = []

That expression tells Python to create a new multi-valued object, an empty list.

range(num_lines)

Tells Python to create a multi-valued object, an immutable sequence to be precise, starting at 0 and continuing until num\_lines - 1

* performs multiple roles in Python. In most cases it has two arguments and indicates multiplication. However, here with only a single trailing argument, it is the unpack operator. It extracts, or unpacks, the elements of the sequence produced by range() and passes them as values into to [ ].

Ultimately, creating a new multi-valued variable with elements that are the integers from 0 to num\_lines - 1

1 Like

That’s a great explanation! Maybe one further detail that would add \epsilon more clarity would be consider what would happen if you did not include the * there to do the “unpack” operation. The point is that range(num_lines) is a python object of type range. So if you just say this without the *:

x = [range(num_lines)]

what you get is a list which contains one item, which is the range. But that’s not what you want. Let’s work a concrete example:

rg = range(8)
print(f"type(rg) = {type(rg)}")
print(f"len(rg) = {len(rg)}")
x = [rg]
print(f"type(x) = {type(x)}")
print(f"len(x) = {len(x)}")
print(f"x = {x}")

Running that gives this:

type(rg) = <class 'range'>
len(rg) = 8
type(x) = <class 'list'>
len(x) = 1
x = [range(0, 8)]

Note that x contains only one element. But now watch this:

x = [*rg]
print(f"type(x) = {type(x)}")
print(f"len(x) = {len(x)}")
print(f"x = {x}")
type(x) = <class 'list'>
len(x) = 8
x = [0, 1, 2, 3, 4, 5, 6, 7]

That’s more like it!

Not to belabor the point (well, maybe just a little), but of course there’s a “meta” lesson here: python is an interactive language. When you see something that looks mysterious, you don’t just have to wonder what it does. You can actually try it for yourself and observe how it works.

thank you all. I already like this group.