SIT717 Assignment 2 - Technical Report

Solution

Task Title: Data Analytic Technical Report

Subject Code: SIT717

Objective: Objective of this part of the assignment is to apply the data-analytics technique on real-world data and extract some useful information. Students are allowed to use any of the following strategies for this assignment: Supervised learning, unsupervised learning, time series prediction, text-mining etc.

 

Overview: This project is designed to provide students with a good opportunity to use data
mining and machine learning method in discovering knowledge from a dataset and
explore the applications for business intelligence. It is the second part of the individual
project work and you are required to implement the required analysis together with a
written report. This written assessment will be a technical report with no less than 3000
words.           

 

University: Deakin University  

Tool requirement:

  • Weka Machine Learning Tool: This tool implements almost every popular machine learning algorithms (Supervised and Unsupervised).
  • MS-Excel: Spreadsheet will help to make filtering on the data before data analysis.

Task Description:

In this portfolio, we have shown a Text mining based approach for this assignment.  For this task, we have used yelp’s labelled data present in the ‘Sentiment Labelled Sentences Data Set’ under UCI machine learning repository (https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences). We have built a Random-forest based supervised classifier to identify the sentiment of any comment.

Steps of Development:

  • CSV to ARFF conversion: CSV is the comma separated representation of the data which is not preferable in Weka. We have converted the CSV to a native Weka format know as .ARFF.
  • String Tokenization: StringToWordVector is used to convert a string to a word vector.
  • Token Filtering: Stop word removal, IDFTransformation, TFTransformation and Lovins Stemmer are used for the filtering.
  • DatasetSplitting: Train and Test split up is done with 80:20 ratio.
  • Classifier Design: J-48 and Random forest classifier ware design for this task. Based on the performance, we prefer to use Random forest.
  • Verification of the Classifier: Using the test dataset we verify the accuracy of the classifier. 54.5% accuracy obtained in J-48 classifier and 72% in the Randomforest classifier.

 

 

Sample Output:

 

 

 

Result

J-48

Random forest

Summery

Correctly Classified Instances         109               54.5    %

Incorrectly Classified Instances        91               45.5    %

Kappa statistic                                  0.1209

Mean absolute error                       0.4404

Root mean squared error               0.5042

Relative absolute error                    87.8155 %

Root relative squared error            100.5014 %

Total Number of Instances              200

 

 

Correctly Classified Instances         144               72      %

Incorrectly Classified Instances        56               28      %

Kappa statistic                                 0.4344

Mean absolute error                      0.3582

Root mean squared error              0.4202

Relative absolute error                   71.4271 %

Root relative squared error           83.7519 %

Total Number of Instances            200  

Detailed Accuracy

 

Class 0

Class 1

Avg

TP Rate 

0.730

0.396

0.545

FP Rate 

0.604

0.270

0.418

Precision 

0.492

0.647

0.578

Recall  

0.730

0.396

0.545

F-Measure 

0.588

0.492

0.535

MCC     

0.133

0.133

0.133

ROC Area 

0.638

0.638

0.638

PRC Area 

0.548

0.705

0.635

 

 

 

Class 0

Class 1

Avg

TP Rate 

0.697

0.739

0.720

FP Rate 

0.261

0.303

0.285

Precision 

0.681

0.752

0.721

Recall  

0.697

0.739

0.720

F-Measure 

0.689

0.745

0.720

MCC     

0.434

0.434

0.434

ROC Area 

0.814

0.814

0.814

PRC Area 

0.754

0.862

0.813

 

Confusion Matrix

  a    b   <-- classified as

 65 24 |  a = 0

 67 44 |  b = 1

 

  a  b   <-- classified as

 62 27 |  a = 0

 29 82 |  b = 1

 

Expert’s Comments: Every machine learning-based project is a challenging one. The art of selection of proper data-set and algorithm is very important. Our efficient team having vast industrial experience helps our student to achieve the best for this type of assignment.