1. Import the twitter dataset using readreadline() function.
2. Inspect the first 10 tweets from the data.
3. We will specifically look at a tweet at index number 8. Assign a variable name to this tweet. For example: tweet8.
4. Check the data type of tweet8. Before changing the tweet8 to String type, firstly allow tweet8 to go through stemming. How many words in this sentence has been stemmed? What are the original form and base form respectively?
5. Remove the stop words from stemmed tweet8. Compare the original tweet8 and transformed tweet8. What are the stop words removed?
6. Reassign the 8th tweet to tweet8. Lemmatize the tweet8. What are the words has been lemmatized? What are the original forms and base forms respectively?
7. Change tweet8 to String type.
8. Use sentence tokenization function to segment tweet8. How many sentences are generated after tokenization?
9. Use word tokenization function to divide the words from the sentences. How many words have been generated? Display the words and sentences.
10. Use part of speech tagging function to assign POS tag to each word. Check the word and POS frequency. How many words have been assigned POS tags “VBD” (verb past tense)? What are the words being assigned with POS tags “VBD”?
11. Use name entity recognition function to detect name entities from this tweet. Does this function detect any name entities?
12. Use parsing function to parse this tweet. How many verb phrases (VP) are there? What components compose the last verb phrase? (If your parser does not work, you could skip this question)
Submission:
Create a R script file and write the R commands for each question. Write down the answers to the questions such as “How many words in this sentence has been stemmed?” as R script comments in the same R script file. Submit the file on blackboard.