Solution
Task Title: Data Analytic Technical Report
Subject Code: SIT717
Objective: Objective of this part of the assignment is to apply the data-analytics technique on real-world data and extract some useful information. Students are allowed to use any of the following strategies for this assignment: Supervised learning, unsupervised learning, time series prediction, text-mining etc.
Overview: This project is designed to provide students with a good opportunity to use data
mining and machine learning method in discovering knowledge from a dataset and
explore the applications for business intelligence. It is the second part of the individual
project work and you are required to implement the required analysis together with a
written report. This written assessment will be a technical report with no less than 3000
words.
University: Deakin University
Tool requirement:
- Weka Machine Learning Tool: This tool implements almost every popular machine learning algorithms (Supervised and Unsupervised).
- MS-Excel: Spreadsheet will help to make filtering on the data before data analysis.
Task Description:
In this portfolio, we have shown a Text mining based approach for this assignment. For this task, we have used yelp’s labelled data present in the ‘Sentiment Labelled Sentences Data Set’ under UCI machine learning repository (https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences). We have built a Random-forest based supervised classifier to identify the sentiment of any comment.
Steps of Development:
- CSV to ARFF conversion: CSV is the comma separated representation of the data which is not preferable in Weka. We have converted the CSV to a native Weka format know as .ARFF.
- String Tokenization: StringToWordVector is used to convert a string to a word vector.
- Token Filtering: Stop word removal, IDFTransformation, TFTransformation and Lovins Stemmer are used for the filtering.
- DatasetSplitting: Train and Test split up is done with 80:20 ratio.
- Classifier Design: J-48 and Random forest classifier ware design for this task. Based on the performance, we prefer to use Random forest.
- Verification of the Classifier: Using the test dataset we verify the accuracy of the classifier. 54.5% accuracy obtained in J-48 classifier and 72% in the Randomforest classifier.
Sample Output:
|
|
|
Result
J-48 |
Random forest |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Summery Correctly Classified Instances 109 54.5 % Incorrectly Classified Instances 91 45.5 % Kappa statistic 0.1209 Mean absolute error 0.4404 Root mean squared error 0.5042 Relative absolute error 87.8155 % Root relative squared error 100.5014 % Total Number of Instances 200 |
Correctly Classified Instances 144 72 % Incorrectly Classified Instances 56 28 % Kappa statistic 0.4344 Mean absolute error 0.3582 Root mean squared error 0.4202 Relative absolute error 71.4271 % Root relative squared error 83.7519 % Total Number of Instances 200 |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Detailed Accuracy
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Confusion Matrix a b <-- classified as 65 24 | a = 0 67 44 | b = 1 |
a b <-- classified as 62 27 | a = 0 29 82 | b = 1 |
Expert’s Comments: Every machine learning-based project is a challenging one. The art of selection of proper data-set and algorithm is very important. Our efficient team having vast industrial experience helps our student to achieve the best for this type of assignment.