Data Analytic Technical Report

Solution

Task Title: Data Analytic Technical Report

Subject Code: SIT717

Objective: Objective of this part of the assignment is to apply the data-analytics technique on real-world data and extract some useful information. Students are allowed to use any of the following strategies for this assignment: Supervised learning, unsupervised learning, time series prediction, text-mining etc.

 

Overview: This project is designed to provide students with a good opportunity to use data
mining and machine learning method in discovering knowledge from a dataset and
explore the applications for business intelligence. It is the second part of the individual
project work and you are required to implement the required analysis together with a
written report. This written assessment will be a technical report with no less than 3000
words.           

 

University: Deakin University  

Tool requirement:

  • Weka Machine Learning Tool: This tool implements almost every popular machine learning algorithms (Supervised and Unsupervised).
  • MS-Excel: Spreadsheet will help to make filtering on the data before data analysis.

Task Description:

In this portfolio, we have shown a Text mining based approach for this assignment.  For this task, we have used yelp’s labelled data present in the ‘Sentiment Labelled Sentences Data Set’ under UCI machine learning repository (https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences). We have built a Random-forest based supervised classifier to identify the sentiment of any comment.

Steps of Development:

  • CSV to ARFF conversion: CSV is the comma separated representation of the data which is not preferable in Weka. We have converted the CSV to a native Weka format know as .ARFF.
  • String Tokenization: StringToWordVector is used to convert a string to a word vector.
  • Token Filtering: Stop word removal, IDFTransformation, TFTransformation and Lovins Stemmer are used for the filtering.
  • DatasetSplitting: Train and Test split up is done with 80:20 ratio.
  • Classifier Design: J-48 and Random forest classifier ware design for this task. Based on the performance, we prefer to use Random forest.
  • Verification of the Classifier: Using the test dataset we verify the accuracy of the classifier. 54.5% accuracy obtained in J-48 classifier and 72% in the Randomforest classifier.

 

 

Sample Output:

 

 

 

Result

J-48

Random forest

Summery

Correctly Classified Instances         109               54.5    %

Incorrectly Classified Instances        91               45.5    %

Kappa statistic                                  0.1209

Mean absolute error                       0.4404

Root mean squared error               0.5042

Relative absolute error                    87.8155 %

Root relative squared error            100.5014 %

Total Number of Instances              200

 

 

Correctly Classified Instances         144               72      %

Incorrectly Classified Instances        56               28      %

Kappa statistic                                 0.4344

Mean absolute error                      0.3582

Root mean squared error              0.4202

Relative absolute error                   71.4271 %

Root relative squared error           83.7519 %

Total Number of Instances            200  

Detailed Accuracy

 

Class 0

Class 1

Avg

TP Rate 

0.730

0.396

0.545

FP Rate 

0.604

0.270

0.418

Precision 

0.492

0.647

0.578

Recall  

0.730

0.396

0.545

F-Measure 

0.588

0.492

0.535

MCC     

0.133

0.133

0.133

ROC Area 

0.638

0.638

0.638

PRC Area 

0.548

0.705

0.635

 

 

 

Class 0

Class 1

Avg

TP Rate 

0.697

0.739

0.720

FP Rate 

0.261

0.303

0.285

Precision 

0.681

0.752

0.721

Recall  

0.697

0.739

0.720

F-Measure 

0.689

0.745

0.720

MCC     

0.434

0.434

0.434

ROC Area 

0.814

0.814

0.814

PRC Area 

0.754

0.862

0.813

 

Confusion Matrix

  a    b   <-- classified as

 65 24 |  a = 0

 67 44 |  b = 1

 

  a  b   <-- classified as

 62 27 |  a = 0

 29 82 |  b = 1

 

Expert’s Comments: Every machine learning-based project is a challenging one. The art of selection of proper data-set and algorithm is very important. Our efficient team having vast industrial experience helps our student to achieve the best for this type of assignment.  

SIT717 201 9 T2 Assignment 2 : Technical Report 1/3 SIT7 17 Enterprise Business Intelligence 201 9 T 2 Assignment 2 (Project 2): Data Analytic Technical Report Individual task – 50% +(10% bonus) Due Date: 11:59pm (AEST), Monday , September 9th, 201 9 A document (pdf or word document) should be submitted via CloudDeakin. NO email or Hardcopy assignments accepted. Photos of the document or photos/scanned copy of the handwritten documents are NOT accepted. [Description]: This project is designed to provide students a good opportunity to use data mining and machine learning method in discovering knowledge from a dataset and explore the applications for business intelligence. It is the second part of the individual project work, and you are required to implement the required analysis together with a written report. This written assessment will be a technical report with no less than 3000 words . This is a hands -on task that requir es you to utilize suitable methods and models to explore the data from your chosen topic in Assignment 1 . This task evaluates your technical skills on the mining of projected data in real applications. You will practise the problem solving , self -guided information discovery and written communication. [Tasks and Requirements]: The content of y our technical report should include the following A -H aspects: A. A meaningful title (5-20 words ) followed by your name and student ID The t itle describe s your topic and point s out your research direction, so it is very important and we list the tips below . [Topic selection tips ]: Tip 1 : The title may be narrowed down further from the title of your Assignment 1. Tip 2: M ake the best use of Practicals to implement your data analysis project. First , please learn a framework of basic skills of using Weka to prepare, process and analyze data in Practicals 1 -3 and focus on Practical 6 (Performance Evaluation) . Practical 6 comprises the skills to compare multiple data mining methods in different metrics , which directly help s your comparison and evaluation of different techniques in your technical report . Then , you can focus on one of the following techniques for analysing different types of data according to your topic :  Practical 4 or Practical 5 for processing general numeric data  Practical 7 Predicting Time Series  Practical 8 Text Mining  Practical 9 Image Analysis  Others learnt by yourself Practic al 10 Recommendation provide s an application example using techniques in Practicals 4 -9. Tip 3: An example. If the title in Assignment 1 is " Using Classification Method to Discover Events from Twitter Data ", then the title in Assignment 2 may be " Using SIT717 201 9 T2 Assignment 2 : Technical Report 2/3 Decision Tree to Predict A Event Trend from Twitter Data ". If you choose this topic for Assignment 2, you can focus on Practical 8, which provides you details for classify ing short text documents using Weka. Here , the event trend can be replaced by "a user's preference trend for a commercial product" and twitter data (taking one tweet as one short text document) also can be other short text data, such as user's comments for the product, etc. You may prefer to Decision Tree (J48 In Weka) as shown in this title, however, you also will present a counterpart method to do comparison. B. An abstract (100 -200 words ) C. An introduction of a data analytic application background, motivation and aim (200 -300 words ) D. A summary of your dataset, including data type (general numeric data, short text, time series, image etc.), data size, data quality and data pre -processing) (300 -500 words ) E. The main data mining techniques you adopt to satisfy your application aim (800 -1000 words ) In this section, you will point out  whether it is a clustering problem or classification problem,  what is the data mining algorithm you will adopt to analyse your data, what are the steps of the adopted algorithm and what are its advantages and disadvantages ,  what is the counterpart algorithm that maybe an alternative choice to analyse your data . F. Evaluation and demonstration ( 800 -1000 words )  what is the difference between your adopted algorithm and the counterpart algorithm, you may use performance evaluation skills learnt in Practical 6 to compare them . And you must demonstrate (i.e., show result accuracies as evidence) why one is better than another . G. Conclusion s (100 -200 words ) H. List of Reference s (IEEE and Harvard are preferred ). Please prepare your references according to the guidance at http://www.deakin.edu.au/students/study -support/referencing You can reuse the reference from Assignment 1 and add more publications. Most of them should be formal publications/papers . [Submission]: • You must submit your completed document (pdf or word doc) in the Dropbox in CloudDeakin. • Remember that late submissions will be penalised. Further, the CloudDeakin server is the ultimate time keeper when it comes to determining whether your submission has been received on time. • You are also reminded to keep a backup copy for record. SIT717 201 9 T2 Assignment 2 : Technical Report 3/3 Mark ing Criteria The technical report of data analysis will be marked using the following marking criteria: 1. (1 mark ) The title of the report is clearly specified. 2. (1 mark ) The abstract effectively summarise s all the content . 3. (3 mark s) The introduction is clearly specified the application purpose 4. (5 mark s) The dataset is described clearly . 5. (15 marks ) The main techniques (at least two data mining algorithms) are provided in detail . 6. (20 marks ) Adequate evaluation and comparison of the experimental results for at least two data mining algorithms are provided . 7. (1 mark ) The conclusion is made and data analysis experience is summarised. 8. (4 marks ) The technical report is clearly structured (title, abstract, introduction, dataset, main techniques, experimental ev aluation, conclusions and references), nicely presented, and well written. The length of the report is within the scope given in the guideline. 9. (10 bonus marks ) An improved algorithm is provided in pseudo codes based on an existing algorithm. Use the skills learnt by yourself to implement the proposed algorithms by program ming, and use experimental results to demonstrate the proposed algorithm is better . Table 1. Marking Scheme. Assignment Task 2: 50% +10%(bonus) =60 Marks Criteria Excellent Good Marginal Not Shown 1: Specify a title . Clear (1 Mark ) Intelligible but not sharp (0.5 Mark ) Not specified (0 Mark ) 2: Specify an abstract . Effective summary (1 Mark ) Limited summary (0.5 Mark ) Not specified (0 Mark ) 3: Specify an introduction . A focused purpose (3 Mark ) Need further narrow down (2 Mark ) Infeasible purpose (1 Mark ) Not specified (0 Mark ) 4: Describe the dataset used Clear in data type, data size and data quality and preprocessing is conducted (4-5 Marks ) Clear in data type, data size and data quality. No preprocessing is conducted (2-3 Marks ) Marginal description of data type, data size and data quality. (1 Marks ) Not specified (0 Mark ) 5: Detail the main techniques. At least two data mining algorithms are provided in detail with a dequate theoretical comparison (12 -15 Marks ) At least two data mining algorithms are provided in detail with limited theoretical comparison (8-11 Marks ) Only one data mining algorithm is provided and present clearly (4-7 Marks ) Only one algorithm is presented with limited details . (0-3 Mark ) 6. Demonstrate by experiments Adequate evaluation and comparison of the experimental results between at least two data mining algorithms. (15 -20 Marks ) The experimental results between at least two algorithms are provided but with limited evaluation and comparison. (10 -14 Marks ) The experimental results of only one algorithm are provided and present ed clearly . (5-9 Marks ) Limited experimental results are provided for any algori thm . (0-4 Marks ) 7. Specify an conclusion. Clear (1 Mark ) Intelligible but not sharp (0.5 Mark ) Not specified (0 Mark ) 8. Report with suitable structure and writing skills. Clear structure, nice presentation and excellent written (4 Marks ) Intelligible structure, good presentation and written (2-3 Marks ) Marginal content and hard to follow (1 Mark ) Not specified (0 Mark ) *9. Bonus marks (Improved data mining algorithm is provided and implemented in programming ) Provide an improved data mining algorithm in pseudocodes based on an existing algorithm. Demonstrate it is better by implement ing in programming (screenshot is provided). (7-10 Marks ) Provide an improved data mining algorithm in pseudocodes based on an existing algorithm. No programming demonstration but with adequate theoretical analysis . (4-6 Marks ) Provide an improved data mining algorithm in pseudocodes based on an existing algorithm , with only limited theoretical demonstration . (1-3 Marks ) Not specified (0 Mark ) *Not everyone will achieve bonus marks, it is only for exceptional students.