Need advice on text cleaning exercise for an NLP classification task

I am doing this as a university assignment where they ask me to apply a word embedding technique (bag of words, etc) and then use a naïve Bayes/logistic regression/ RNN model to classify the movie reviews into positive or negative sentiments.

I just wanted to ask if I need to remove numbers, )( and ", " and $ ? in my example.

here are a couple of sentences:

note : some may consider portions of the following text to be spoilers .  be forewarned .  it's startling to consider that it was only a few years ago that film distributors would worriedly rearrange their summer release schedules in order to give the annual disney animated feature juggernaut a wide berth .  the lion king had just cracked $300 million domestic in gross to become one of the most profitable ventures in film history , continuing to build on a sturdy base left by prior flicks aladdin and beauty and the beast .  since then , though , disney's animated features have shown an unbroken string of diminishing returns , with pocahontas , the hunchback of notre dame , and this year's hercules successively proving less and less potent .  with the once seemingly-impregnable disney stranglehold on the market share suddenly looking mighty vulnerable , and faced with their first serious competition in the animated film market from fox's anastasia , disney has brought xmas home early by dusting off the feature which sparked the modern revival of feature animation , the little mermaid .  while the animation for the film is , as is typically the case for disney films , unquestionably top-notch , the magic in the little mermaid is not its animation , but the wonderful innocence of its story and its rousingly superb music .  the film's storyline is fairly straightforward -- young teen falls for handsome man , father disapproves and assigns hapless chaperone to his daughter , teen disobeys father and goes to desperate lengths to win her man -- except in this case , the chaperone is a crab , the teen is a young mermaid , and the object of her desire is a human prince .  what makes the little mermaid so affecting and so emotionally resonant is the richness and charm of its characters and the sheer clarity and honest simplicity of their emotions .  from the moment mermaid ariel lays her eyes on prince eric , she's resolutely smitten , and she's such a pure and endearing character that one can't help but invest their heart with her .  this simple but touching love story , coupled with a healthy dose of smart humour , makes the little mermaid a remarkably captivating picture .  one of the interesting things about the little mermaid is something which now curiously dates it : the voices cast for its motley crew of characters .  this film was produced just before the distracting concept of using celebrity voices became in vogue , which started to a certain degree with beauty and the beast and was irrevocably exacerbated by robin williams' much-heralded turn in aladdin ; by the release of the lion king and henceforth , the majority of characters in the animated films were voiced by celebrities .  while it's understandable that animated features lacking the name-recognition or drawing power of disney ( say , balto's use of kevin bacon and bridget fonda , or even anastasia's showcasing of meg ryan and john cusack ) would be forced to turn to this strategy in order to hype their products , it's unfortunate that even disney has embraced this policy .  do we really need to hear , say , demi moore as esmerelda in the hunchback of notre dame ?  is the film's entertainment value really augmented by hearing a recognizable voice , rather than a voice which best suits the role ?   ( i'm not exactly on the edge of my seat for eddie murphy in the upcoming mulan . )  fortunately , the performers who voice the characters in the little mermaid , although perhaps more obscure , are impeccably cast .  chief among them is jodi benson , a 1992 tony nominee for her stage work in crazy for you , who voices the film's heroine ariel to perfection ; with a wonderfully expressive speaking voice full of youthful vigor , and gorgeous singing voice , ms . benson provides a most engaging anchor for the film .   ( she's the only reason i'd even consider catching flubber . )  similarly , samuel e . wright is terrific in the showy role of sebastian , the weary guardian crab .  he easily milks his lovable character's comic moments for all they're worth , and his rendering of two of the little mermaid's big tunes -- " under the sea " and " kiss the girl " -- have become the stuff of legend .  pat carrol is deliciously villainous and vampy as the evil sea-witch ursula , while kenneth mars' booming voice conveys the stern yet affectionate authority of ariel's father , king triton .  in large roles and small ( edie mcclurg as dotting busybody carlotta is ideal , and rene auberjonois has great fun with his exuberant french chef ) , the little mermaid is impeccably cast .  of course , the little mermaid will probably be best remembered for its remarkable collection of songs composed by the songwriting team of alan menken ( music ) and howard ashman ( lyrics ) , who had created little shop of horrors and would go on to compose beauty and the beast and aladdin for disney before mr . ashman's untimely death .  not only are mr . menken's tunes unbearably catchy , but mr . ashman's charming lyrics are fully integrated into the film's storyline so that the songs are a virtual extension of the character's dialogue , and consequently work wonderfully within the context of the film .  mr . menken's score for the film is equally top-notch ; the sequence where eric ( voiced by christopher daniel barnes ) and ariel tour his kingdom in a horse-drawn carriage becomes magical and wondrous with mr . menken's fine score .  it appears that most people prefer the delightfully colourful production number for the calypso-styled " under the sea " as joyfully crooned by mr . wright , which won the academy award and golden globe awards for best song -- indeed , one of the many little joys in screening the film during its re-release was listening to children scattered throughout the audience singing along with the tune -- but my favourite is ms . benson's heartfelt rendition of the ballad " part of your world " , an achingly beautiful tune of yearning and hope ( wonderfully lyricized by mr . ashman ) which , accompanied by the film's most dazzlingly polished animation sequence , packs an emotional wallop which literally brought tears to my eyes .  during the song's reprise , which builds to a crescendo with ariel arching on a rock as a wave crashes in , the cumulative effect is nothing short of breathtaking , and one becomes acutely aware that this single instance is one of the finest in animation history .  as of this writing , november 1997 has come to an end , as has disney's limited 17-day re-release of the little mermaid .  there's no question that the primary motivation for , if not the film's reissue itself , at least its timing , was to reinforce disney's dominance in the animation market and provide direct competition to fox's costly new upstart animation division and their first major venture , anastasia .  in every respect , the re-release of the little mermaid appears to be a success -- the film's 1997 grosses have pushed its cumulative domestic gross over the magic $100 million mark ; the little mermaid proved to have remarkably strong drawing power for a film initially released only eight years ago and in many homes on video , pulling in close to $10 million in its opening weekend ; and although nobody could possibly expect the little mermaid to possibly defeat the aggressively-marketed anastasia in head-to-head competition , it siphoned enough from the fox film's opening weekend totals to keep anastasia from the coveted weekend leader spot , allowing for disney's odious flubber to sweep in on the subsequent week and wrestle the family demographic market share away .  but although disney's motives in the reissue of the little mermaid were self-serving and protectionist , the real winner is the public .  any reason to put this film back into theatres is a good one , and it's a true joy to see this heartwarming gem back on the silver screen .  the little mermaid is the best film to come out of the disney's modern animation renaissance , and one of the greatest animated films ever made .   1
note : some may consider portions of the following text to be spoilers .  be forewarned .  among my fanatical ticker tape-worshipping friends , there's one who happens to share the same philosophy espoused by the central character in darren aronofsky's darkly original pi : the entire stock market can be reduced to nothing but a series of patterns which , through analysis , will produce information to accurately forecast future behaviour .   ( an example of the mentality involved : if the stock price goes up like this , and then down like that , and then sharply up this way , it then will go * this * way . )  while i freely admit that i know less than nothing about the market ( knowledge check : prices up -- good ; prices down -- bad ; most of the time , at least ) and hence really couldn't comment with any authority , it's always nonetheless struck me as an incredibly naive oversimplification of an astonishingly complex system ( and besides , if it were that simple , no doubt somebody would've already figured it all out ) .  the difference in this case is that while my colleague ( an otherwise assuredly realistic individual ) truly believes in this in and of itself as a valid forecaster , pi uses this ideology as a device with which to investigate its character's psychosis .  it's also vastly more convincing with its argument .   " mathematics is the language of the universe , " insists genius protagonist maximillian cohen ( sean gullette ) in a cool , mantra-like voice-over which repeats throughout the picture .  since nature can be expressed in numbers , and there are patterns everywhere in nature , he reasons with eminent logic that finding the patterns will allow him to predict anything -- the ups and downs of the stock market , how many games the yankees will win this year , the flavour of jam i'm going to put on my toast tomorrow morning .  obsessed with finding the proverbial key to the universe , max lives in paranoid , self-imposed solitude in a seedy nyc chinatown apartment , single-mindedly toiling away with his monstrous homemade computer system .  sullenly withdrawn and plauged by debilitating migraines , the elusive pursuit of a mysterious 216-digit number his machine spits out one day is driving him into madness .  the story , then , is basically an eccentricity , but it's a clever , astute eccentricity , perceptively zeroing in on the modern mistrust of mathematical reductionism ; in an age where a dominant societal phobia is one's individualism being replaced by a series of numeric identifiers , max's all-consuming penchent for numbers at once creates a lingering , unsettling mood .  it helps matters that he's not a particularly likable protagonist .  all attempts of friendliness from neighbours are curtly rebuffed by max , a spindly , neurotic-looking individual who hasn't the time to indulge in pleasantries .  for a film which puts its lead character front and center ( mr . gullette appears in virtually every scene ) , pi takes a refreshing and effective approach in avoiding conventional aesthetics ; because of our ambivalence with max , we're not so much avidly rooting for him to triumph with a moment of epiphany as we're following him through this plot with a sense of mixed dread and morbid fascination -- it's more disturbing journey than quest .  still , we do care about max's fate .  teetering on the edge of dementia , he winds up being pursued by two different groups which want to pick his brain , both fronted by deliciously perky , resolutely cheerful representatives with inevitably duplictious intentions .  as we know , in films where paranoia is a dominant element ( see the truman show's laura linney character ) , or for that matter , in real life , it's always the ones who never stop smiling at you and are overly friendly that are the ones of which to be wary .  pi , a film that addresses patterns , itself intentionally adheres to an identifiable pattern cycle -- headache scene ; important revelation or bit of plot development ; pill-popping montage ; hallucinatory nightmare ( with decidedly cronenberg-esque undertones -- few other directors are as equally adept in bridging unsettling concepts and body-themed horror ) ; nosebleeding reality .  the repetitiveness , far from being tedious , is effectively maddening ; more than anything , the picture aims to get under our skins and take in events from max's claustrophobic perspective .  in this regard , it wildly succeeds due to mr . aronofsky's striking direction .  it's a rarity that a film so completely immerses itself into a protagonist's warped perspective of his surrounding , and high contrast black-and-white cinematography combined with constant usage of extreme close-ups lend a heightened sense of paranoia to the proceedings .   ( in some scenes , the stark composition in conjunction with the lumbering approach by mr . gullette make his character curiously resemble a latter-day max schreck , from nosferatu . )  using savage , jittery lensing and rapid cuts to create a sense of disorientation , the picture is often dizzying to behold , and max's effective isolationism is emphasized by shots from the so-called snorri cam , which keep him in plain focus while the environment races by in blurred bursts .  pi's raw , aggressive visuals are reminiscent of david lynch's early work ( in particular , eraserhead ) .  the film's sinister tone splashes onto the screen immediately with a dazzling opening credit sequence ably backed by a sly electronic score by clint mansell , and gradually increases in intensity .  still , amidst all its kafkaesque qualities and overall dispassionate mood , pi does occasionally display a sense of humour .  at one point , marcy dawson ( pamela hart , great fun ) entices max with the offer of an invaluable treasure : a one-of-a-kind . . .  computer chip .   " isn't it beautiful , " she coos .  a showcase for mr . aronofsky's technical virtuosity ( made for $60 000 , it's since gone on to capture acclaim at the 1998 sundance film festival ) , pi is an intriguingly cerebral story which , ironically , is perhaps the most purely visceral film of the year .   1

after applying nltk.word_tokenize(), I still have all these punctuation marks and brackets - " 's", ", ", “.” , ‘(’ and ‘:’.

also would you apply a RNN or LSTM for better accuracy? because these are long sentences and word sequence may matter?
thank you !

Hi,

Here is just my point of view.
Have you checked the tf.keras.preprocessing.text.Tokenizer  |  TensorFlow Core v2.8.0 ?

The special chars can be filtered when you are using Tokenizer.

Hopefully, it helps. :laughing:

1 Like

My understanding is that whether to keep punctuation in a vocabulary or not depends on the NLP task you are performing. I would suggest that punctuation do not provide predictive strength for a single classification of an entire review. They play a more important role when you need to find sub-document level paragraph or sentence boundaries.

I can’t tell if you are taking the DLS sequences class or the NLP Specialization. If the former, you might want to at least watch some lecture videos from the NLP classes, which specifically cover sentiment analysis using several different techniques including Naive Bayes and deep learning models.

1 Like

I have watched all of the videos (and did some notebooks) from the NLP Specialisation, and now am on Week 1 and 2 in Sequence Models. yet, this one is from my university’s masters Data Engineering programme - they decided to include an NLP task into my python module!

do you mean I could use tf.keras.preprocessing.text.Tokenizer() instead of nltk.word_tokenize()?
thank you, I’ll have a look!

Yes, this is what I mean. Sure that you can use other lib. :grin:I just give suggestion here because I am familiar with TF Keras for vanilla NLP task.:relaxed:

Just mini workout which can be reference for you

Hopefully, it helps :+1:

yeah… I have keras and trax installed on my home computer, so am ready to try everything! thank you.