C3_W1 assignment

Ed_Sykes · July 13, 2022, 9:18pm

hi,

I’m having problems with C3_W1 assignment. Wondering if you could help.
I have the remove_stopwords(sentence): working correctly, and am working on the parse_data_from_file()

I am getting the error "AttributeError: ‘list’ object has no attribute ‘lower’

and it stems from
sentence = sentence.lower()
in the “remove_stopwords(sentence)” function

Somehow it is being passed as an array of strings instead of just the string itself.

any suggestions? I can share my code if you would like to have a look

thank you,
Ed

balaji.ambresh · July 14, 2022, 6:10am

sentence should be a python string. Please use this hint to fix the calling code.

Ed_Sykes · July 14, 2022, 1:53pm

hi

My # Test your function works correctly:

remove_stopwords("I am about to go to the store and get any snack")

I get:
'go store get snack'

In

def parse_data_from_file(filename):
   sentences = []
   ....
   sentences.append(row[1])
   sentences = remove_stopwords(sentences)

The error is:
sentence = sentence.lower()
AttributeError: ‘list’ object has no attribute ‘lower’

balaji.ambresh · July 14, 2022, 1:55pm

Please remove stopwords before adding the csv data to sentences.

Ed_Sykes · July 15, 2022, 1:27pm

hi

I think there is something wrong with either the assignment template or the bbc-text.csv (or bbc-train.csv files).

here is what I have:

for row in reader:
            # article ID (row [0]) is omitted
            sentences.append(remove_stopwords(row[1]))  # actual text
            labels.append(row[2])  # category

but row[2] causes an error:

IndexError: list index out of range

but from https://www.kaggle.com/competitions/learn-ai-bbc/data?select=BBC+News+Train.csv

it is clear there are 3 fields:

ArticleId - Article id unique # given to the record
Article - text of the header and article
Category - cateogry of the article (tech, business, sport, entertainment, politics/li>

but the “bbc-text.csv” that we were given in the assignment only has 2 fields: the Article ID and text. (no category).

thanks for your help
Ed

balaji.ambresh · July 15, 2022, 1:32pm

Please execute this cell in the notebook to know the data format:

with open("./bbc-text.csv", 'r') as csvfile:
    print(f"First line (header) looks like this:\n\n{csvfile.readline()}")
    print(f"Each data point looks like this:\n\n{csvfile.readline()}")

The 1st row in the csv is column headers i.e. category,text
Data starts from 2nd row onwards. Based on the column headers, row[0] is the label.

Hope this helps.

Ed_Sykes · July 15, 2022, 1:41pm

I’m afraid that’s not the case.

When I run the template code:

with open("./bbc-text.csv", 'r') as csvfile:
    print(f"First line (header) looks like this:\n\n{csvfile.readline()}")
    print(f"Each data point looks like this:\n\n{csvfile.readline()}")

I get:

First line (header) looks like this:

ArticleId,Text

Each data point looks like this:

1018,qpr keeper day heads for preston queens park rangers keeper chris day is set to join preston on a month s loan.  day has been displaced by the arrival of simon royce  who is in his second month on loan from charlton. qpr have also signed italian generoso rossi. r s manager ian holloway said:  some might say it s a risk as he can t be recalled during that month and simon royce can now be recalled by charlton.  but i have other irons in the fire. i have had a  yes  from a couple of others should i need them.   day s rangers contract expires in the summer. meanwhile  holloway is hoping to complete the signing of middlesbrough defender andy davies - either permanently or again on loan - before saturday s match at ipswich. davies impressed during a recent loan spell at loftus road. holloway is also chasing bristol city midfielder tom doherty.```

balaji.ambresh · July 15, 2022, 1:46pm

Please use the file(s) provided by coursera and not kaggle. See below the output of executing the cell on coursera lab environment.

Ed_Sykes · July 15, 2022, 1:47pm

I am!

was there an update?

balaji.ambresh · July 15, 2022, 1:48pm

Based on your output, seems so.

Please refresh your workspace and try again.
See Refresh your Lab Workspace section here

Ed_Sykes · July 15, 2022, 1:54pm

could you attached the correct ‘bbc-text.csv’ file, please?

thanks,

Ed_Sykes · July 15, 2022, 1:57pm

I’m working locally, using Visual Studio Code

thanks

balaji.ambresh · July 15, 2022, 1:57pm

Go ahead and download the dataset from coursera lab environment.

Ed_Sykes · July 15, 2022, 2:02pm

Thanks

can you send me the link again? There isn’t anything in the current specialization.

Ed_Sykes · July 15, 2022, 2:04pm

found it:

balaji.ambresh · July 15, 2022, 2:04pm

Here’s the link to the assignment page.

Ed_Sykes · July 15, 2022, 2:13pm

works now!

is it because I ‘reset’ the course? that there were changes?

balaji.ambresh · July 15, 2022, 2:18pm

Sorry. I don’t know about the state of your workspace. Good to know that resetting the lab got things back to the latest version.

Ed_Sykes · July 15, 2022, 2:23pm

I just downloaded the ‘lab files’

thank you for your help

Ed_Sykes · July 15, 2022, 2:26pm

I’m now on the last section

label_sequences, label_word_index = tokenize_labels(labels)
print(f"Vocabulary of labels looks like this {label_word_index}\n")
print(f"First ten sequences {label_sequences[:10]}\n")

It prints out a LOT of values:

Vocabulary of labels looks like this {'<OOV>': 1, 's': 2, 'said': 3, 'will': 4, 'not': 5, 'mr': 6, 'year': 7, 'also': 8, 'people': 9, 'new': 10, 'us': 11, 'one': 12, 'can': 13, 'last': 14, 'first': 15, 't': 16, 'time': 17, 'two': 18, .....  all the way to 'allocating': 29713, 'heerenveen': 29714}

First ten sequences [[96, 176, 1157, 1220, 54, 1122, 742, 5211, 85, 1074, 4267, 147, 184, 4127, 1344, 1311, 1595, 47, 9, 949, 96, 4, 6516, 329, 92, 23, 17, 140, 3128, 1330, 2519, 576, 419, 1277, 72, 2963, 3046, 1755, 10, 894, 4, 755, 12, … all the way to 14996, 14997, 6527, 4802, 31, 5813, 10942, 19540, 19541, 19542, 19543, 162, 59, 949, 27, 4003, 8836, 5003, 3, 30, 63, 2884, 4420, 2, 63, 4004]]

So, clearly not getting the “Expected Output” of:


Vocabulary of labels looks like this {'sport': 1, 'business': 2, 'politics': 3, 'tech': 4, 'entertainment': 5}

First ten sequences [[4], [2], [1], [1], [5], [3], [3], [1], [1], [5]]

Topic		Replies	Views
'list' object has no attribute 'lower' Natural Language Processing in TensorFlow week-1	4	619	September 24, 2022
Output for remove_stopwords is different Natural Language Processing in TensorFlow week-1	7	661	September 10, 2022
Parse_data_from_file only finding 18 sentences Natural Language Processing in TensorFlow week-1	3	564	October 18, 2022
C3W1 assignment trouble Natural Language Processing in TensorFlow week-1	1	596	June 19, 2022
AttributeError: 'list' object has no attribute 'lower' Natural Language Processing in TensorFlow week-1	1	569	November 11, 2022

C3_W1 assignment

Related topics