C3_W1 assignment

hi,

I’m having problems with C3_W1 assignment. Wondering if you could help.
I have the remove_stopwords(sentence): working correctly, and am working on the parse_data_from_file()

I am getting the error "AttributeError: ‘list’ object has no attribute ‘lower’

and it stems from
sentence = sentence.lower()
in the “remove_stopwords(sentence)” function

Somehow it is being passed as an array of strings instead of just the string itself.

any suggestions? I can share my code if you would like to have a look

thank you,
Ed

sentence should be a python string. Please use this hint to fix the calling code.

1 Like

hi

My # Test your function works correctly:

remove_stopwords("I am about to go to the store and get any snack")

I get:
'go store get snack'

In

def parse_data_from_file(filename):
   sentences = []
   ....
   sentences.append(row[1])
   sentences = remove_stopwords(sentences)

The error is:
sentence = sentence.lower()
AttributeError: ‘list’ object has no attribute ‘lower’

Please remove stopwords before adding the csv data to sentences.

hi

I think there is something wrong with either the assignment template or the bbc-text.csv (or bbc-train.csv files).

here is what I have:

for row in reader:
            # article ID (row [0]) is omitted
            sentences.append(remove_stopwords(row[1]))  # actual text
            labels.append(row[2])  # category

but row[2] causes an error:

IndexError: list index out of range

but from https://www.kaggle.com/competitions/learn-ai-bbc/data?select=BBC+News+Train.csv

it is clear there are 3 fields:

ArticleId - Article id unique # given to the record
Article - text of the header and article
Category - cateogry of the article (tech, business, sport, entertainment, politics/li>

but the “bbc-text.csv” that we were given in the assignment only has 2 fields: the Article ID and text. (no category).

thanks for your help
Ed

Please execute this cell in the notebook to know the data format:

with open("./bbc-text.csv", 'r') as csvfile:
    print(f"First line (header) looks like this:\n\n{csvfile.readline()}")
    print(f"Each data point looks like this:\n\n{csvfile.readline()}")

The 1st row in the csv is column headers i.e. category,text
Data starts from 2nd row onwards. Based on the column headers, row[0] is the label.

Hope this helps.

I’m afraid that’s not the case.

When I run the template code:

with open("./bbc-text.csv", 'r') as csvfile:
    print(f"First line (header) looks like this:\n\n{csvfile.readline()}")
    print(f"Each data point looks like this:\n\n{csvfile.readline()}")     

I get:

First line (header) looks like this:

ArticleId,Text

Each data point looks like this:

1018,qpr keeper day heads for preston queens park rangers keeper chris day is set to join preston on a month s loan.  day has been displaced by the arrival of simon royce  who is in his second month on loan from charlton. qpr have also signed italian generoso rossi. r s manager ian holloway said:  some might say it s a risk as he can t be recalled during that month and simon royce can now be recalled by charlton.  but i have other irons in the fire. i have had a  yes  from a couple of others should i need them.   day s rangers contract expires in the summer. meanwhile  holloway is hoping to complete the signing of middlesbrough defender andy davies - either permanently or again on loan - before saturday s match at ipswich. davies impressed during a recent loan spell at loftus road. holloway is also chasing bristol city midfielder tom doherty.```

Please use the file(s) provided by coursera and not kaggle. See below the output of executing the cell on coursera lab environment.

I am!

was there an update?

Based on your output, seems so.

Please refresh your workspace and try again.
See Refresh your Lab Workspace section here

could you attached the correct ‘bbc-text.csv’ file, please?

thanks,

I’m working locally, using Visual Studio Code

thanks

Go ahead and download the dataset from coursera lab environment.

Thanks

can you send me the link again? There isn’t anything in the current specialization.

found it:

Here’s the link to the assignment page.

works now!

is it because I ‘reset’ the course? that there were changes?

Sorry. I don’t know about the state of your workspace. Good to know that resetting the lab got things back to the latest version.

I just downloaded the ‘lab files’

thank you for your help

I’m now on the last section

label_sequences, label_word_index = tokenize_labels(labels)
print(f"Vocabulary of labels looks like this {label_word_index}\n")
print(f"First ten sequences {label_sequences[:10]}\n")

It prints out a LOT of values:

Vocabulary of labels looks like this {'<OOV>': 1, 's': 2, 'said': 3, 'will': 4, 'not': 5, 'mr': 6, 'year': 7, 'also': 8, 'people': 9, 'new': 10, 'us': 11, 'one': 12, 'can': 13, 'last': 14, 'first': 15, 't': 16, 'time': 17, 'two': 18, .....  all the way to 'allocating': 29713, 'heerenveen': 29714}

First ten sequences [[96, 176, 1157, 1220, 54, 1122, 742, 5211, 85, 1074, 4267, 147, 184, 4127, 1344, 1311, 1595, 47, 9, 949, 96, 4, 6516, 329, 92, 23, 17, 140, 3128, 1330, 2519, 576, 419, 1277, 72, 2963, 3046, 1755, 10, 894, 4, 755, 12, … all the way to 14996, 14997, 6527, 4802, 31, 5813, 10942, 19540, 19541, 19542, 19543, 162, 59, 949, 27, 4003, 8836, 5003, 3, 30, 63, 2884, 4420, 2, 63, 4004]]

So, clearly not getting the “Expected Output” of:


Vocabulary of labels looks like this {'sport': 1, 'business': 2, 'politics': 3, 'tech': 4, 'entertainment': 5}

First ten sequences [[4], [2], [1], [1], [5], [3], [3], [1], [1], [5]]