Authorship detection: 2010 group: Difference between revisions
| Line 431: | Line 431: | ||
| # Update progress report | # Update progress report | ||
| # JAVA program modification: | # JAVA program modification: | ||
| * Sort list of files read in according to their name order | |||
| * Replace manually parameter setup to automaticly read in data, form train set and testing set according to three header lines | |||
| '''Plan and Goals for next week:''' | '''Plan and Goals for next week:''' | ||
| # | # | ||
| ====Leng Tan==== | ====Leng Tan==== | ||
| '''Progress and Status this week:''' | '''Progress and Status this week:''' | ||
Revision as of 12:48, 7 March 2011
Supervisors
Collaborators
- François-Pierre Huchet, ITII Pays de la Loire, Nantes, France.
- J. José Alviar, University of Navarra, Spain
Students
Weekly progress and questions
Semester 2, Week 1
Jie Dong
Progress and Status this week:
- First meeting with Derek, Brian and Maryam and other group member Leng and Tien-en.
- Derek, Brian and Maryam introduce us the basic idea of this data mining project
- The idea of authorship detection was introduced
- Several applications which data mining technique can be applied was mentioned
- Researches of past year students were mentioned and Maryam sent us several past year research report together with the code
- Research on the project, especially on SVM and some algorithms
Plan and Goals for new week:
- Prepare for the proposal seminar.
- Read research report from past years students.
- Understand project handbook.
Leng Tan
Progress and Status This Week
- the 1st meeting for the final year project was held with the supervisor, Prof Derek Abbot, co-supervisor, Dr Brian Ng, and Mrs Maryam, along with the team member.
- the initial project scope was introduced and general idea of the aim of the project is discussed.
- basic idea on the techniques of authorship detection is shown as well.
- several ideas for the future application of this project is highlighted.
- some hints on getting started was given which is to read Talis's final year report, which will be provided by Mrs Maryam.
- the first milestone of the project which is the proposal seminar is reminded.
Plan and Goals for Next Week
- fully read and understand Talis report.
- have a brief look on the code that will be supplied by Mrs Maryam.
- do some research on the background information of some controversial issues like the works on William Shakespeares, the Federalist Paper and the Letter to Hebrew.
- read through the project handbook of 2010 to have a rough idea of all the milestones of the project focusing on the project seminar.
Tien-en Phua
Progress and Status this week:
- Met up with project supervisor, Prof Derek Abbot, co-supervisor, Dr Brian Ng, and Mrs Maryam
- Derek discuss the concept behind authorship detection
- Derek explains about multi-dimensional graphs to link a disputed text to a known author.
- Discuss about possible future applications. Brian suggested code plagiarism and possibly music.
- Was provided by Maryam with other projects by students and started to go through the report by Talis.
- Went through the FYP Project handbook
Plan and Goals for new week:
- Identity the methods Talis used in his report
- Research on various methods
- Read up on past works regarding authorship detection
- Research on controvesy
Semester 2, Week 2
Jie Dong
Progress and Status this week:
- Three methods are chosen for this project: word frequency, word recurrence interval, and trigram markov model
- Reading material on SVM (SVM tutorial)
- Play with SVM software on Matlab
- Prepare slides for proposal seminar presentation on project aim, background, and part of project process
Plan and Goals for new week:
- Combine slides with other group member and do some modification
- Send slides draft to supervisor for feedback
- Do more modification
- Presentation on Thursday
Leng Tan
Progress and Status This Week
- identified 3 methods that was mentioned by Talis.
- have a brief knowledge and information of the controversial issue.
- have a brief idea on the upcoming propose seminar.
Plan and Goals for Next Week
- research on SVM.
- research on the backgroud history of the project
- research on the different technique use before in history
- prepare project proposal
Tien-en Phua
Progress and Status this week:
- Identity the three methods that Talis applied in his project, namely Word Frequency, Word Recurrence and Trigram Markvo
- Briefly understand how the three methods work
- Identity the past works done by other researchers.
- Identity three main controvesy namely the Federalist papers, Shakespeare plays and the Letter to the Hebrews
Plan and Goals for new week:
- Prepare for Project Proposal
- Develop Gantt chart, project budget and risk analysis
- Identity major milestones in project
- Write up on controvesy
- Further research on three methods
Semester 2, Week 3
Jie Dong
Progress and Status this week:
- We were introduced to Matthew and François-Pierre Huchet who are also participating in this project in Monday's meeting.
- Came up with draft(first whole draft) of proposal presentation slides. Discuss about the role of each person.
- Send slides to Brian and Matthew for feedback
- modify our slides
- Presentation on Thursday
Plan and Goals for new week:
- Do more researches for three methods and SVM
- Prepare for stage 1 design document
Leng Tan
Progress and Status This Week
- rough draft slides on the past research have been done for the propose seminar.
- a comparison list of the different technique is done.
- start research on SVM that is to be added in the slides with the different technique
- had a meeting with supervisors, and was introduced to Dr Matthew.
- focus 100% on the propose seminar.
Plan and Goals for Next Week
- have a more detailed review on the 3 methods.
- read the criteria for the stage 1 design document.
Tien-en Phua
Progress and Status this week:
- Prepare for project proposal
- Developed gantt chart, project budget and risk analysis
- Developed slides for milestones and controvesy
- Research on SVM (Support Vector Machine)
- Gain a better understanding on Word Frequency, WRI and Trigram Markvo
Plan and Goals for new week:
- Proceed to develop Stage 1 Design Document
- Understand SVM
- Develop Work Breakdown Structure
- Delegate task to individual members
- Read up on the other 4 reports
Semester 2, Week 4
Jie Dong
Progress and Status This Week
- In this project, we plan to have each person working on one method -- I am working on Trigram Markov model
- Read past reports for trigram Markov information
- Make stage 1 design document template
- Write project aim, background, and project approach in design document
Plan and Goals for Next Week
- Modify the design document draft
- Send to supervisors for feedback
- More modification
- Prepare a tutorial on SVM for other group members
Leng Tan
Progress and Status This Week
- research on the 3 methods have completed.
- fully read and understood the criteria for stage 1 design document.
- have a brief meeting with group members to delicate the tasks in preparing the stage 1 design document.
Plan and Goals for Next Week
- do a rough draft on the tasks that is allocated.
- do a layout design for the document.
Tien-en Phua
Progress and Status this week:
- Develop Work Breakdown Structure
- Identity tasks required for Stage 1 Design Document
- Broken down task and assigned to each member
- In the process of development of Stage 1 Design Document
- Further research on SVM and Word Frequency
Plan and Goals for new week:
- Complete write up on Word Frequency and SVM
- Complete Stage 1 Design Document
- Coding and further research on Word Frequency
- Read up on the other 4 reports
Semester 2, Week 5
Jie Dong
Progress and Status this week:
- Done abstract, project aim, background and significance
- Done description of data extraction part for Trigram Markov model in design document
- Feedback from supervisors on design document
- Final modification on design document
- Format the design document on wiki
Plan and Goals for Next Week:
- Design on Trigram Markov model
- learn to use SVM
- a bit coding on trigram Markov model
Leng Tan
Progress and Status this week:
- Done Literature Review of design document
- Done description of data extraction part for WRI in design document
- Done project approach and milestone for design document
- added modified WBS in appendix
- done initial check and compilation of Design document
Plan and Goals for Next Week:
- start do rough design for WRI of data extraction in java
- read SVM
Tien-en Phua
Progress and Status this week:
- Completed design document
- Project Requirements
- Description of data extraction of Function Word Frequency analysis
- Project Budget
- Background and Significance of Hebrews
- Edited Gantt Chart, WBS to synchronise
- Edited and grammar check etc
 
- Basic layout of software design for data extraction algorithm
- Wiki page
Plan and Goals for Next Week:
- Commence programming of algorithm using Java
- Read up on SVM
Semester 2, Week 6
Jie Dong
Progress and Status this week:
- Research on Trigram Markov model
- Two models are proposed:
- Simple Trigram Markov model: only consider the effect of trigram in the text
- Potential problem with first model: sparse data, new trigram appears in the test text, lead to poor cross entropy
- Second model: Hidden Markov model on trigram: Not only count on trigram, but also unigram and bigram effects are taken into consideration. The transition probability is consisted from all three probabilities.
 
- The existence of punctuation and uppercase letter should be considered for text written in English.
- Programming on text file input and exception handle in JAVA
Plan and Goals for new week:
- Discuss the models with supervisor
- SVM problem
- Programming on first model
Leng Tan
Progress and Status this week:
- Done a design for the WRI code after discussion with group members.
- written about 50% of the code for data extraction using WRI.
- read a bit on SVM but still don't understand it.
Plan and Goals for new week:
- finish the coding for WRI.
- try to get help for SVM.
Tien-en Phua
Progress and Status this week:
- Finish the design algorithm code in java for word function frequency (pseudo - code).
- Start implementing the algorithm code.
- Code have been halfway done.
Plan and Goals for new week:
- Finish coding.
- Discuss about SVM problems.
Semester 2, Week 7
Jie Dong
Progress and Status this week:
- Reading chapter about Hidden Markov Chain of "Statistical language learning"
- Came up with my own test text to verify my code is working properly
- Meeting with Brian discuss my current work, the current approach does not work efficiently
Plan and Goals for new week:
- The previous algorithm only considers effect of the trigram words. Result for a test paragraph contains a lot useless information, which about 70% of trigrams only appear once. Information which is worth using in classification is just about 10%. By extracting common trigrams from several test texts, few of them left. Hence, another enhanced model, in which unigrams and bigrams are also taken into consideration, will be tested in the following week.
- SVM will also be used to test the result in coming week. Investigating how to use SVM functions in MATLAB, svmtrain and svmclassify (Bioinformatics toolbox)
Leng Tan
Progress and Status this week:
- Finish the Java coding for WRI technique in data extraction algorithm.
- Tested and verified that the code is working properly using a small test file. (text file with only few sentences)
- Have a meeting with Brian discussing on the SVM input and output.
Plan and Goals for new week:
- Figure out SVM.
- Test and try out SVM on matlab using small test files.
Tien-en Phua
Progress and Status this week:
- Completed coding for data extraction algorithm (DEA)
- Discuss implementation of output of data from DEA to SVM
- Analyse how other researches analyse their data
Plan and Goals for new week:
- Modification and refining of DEA code
- Continue analysis of how other researches used this DEA for authorship attribution
- Try applying data to SVM
Semester 2, Week 8
Jie Dong
Progress and Status this week:
- Peer review assessment on the design document on "Audio assisted vision system"
- Investigation on SVM in MATLAB
- Working on modified trigram model
Plan and Goals for new week:
- Test my result of java program with SVM
Leng Tan
Progress and Status this week:
- Receive a stage 1 design document on "Audio Assisted Vision System for Visually Impaired People".
- The document was fully read and take noted on presentation and various other perspective.
- The document was reviewed and a formal peer review report was produced.
- Investigation on Matlab for SVM was halted for a moment due to the peer review report.
Plan and Goals for new week:
- Figure out SVM.
- Test and try out SVM on matlab using small test files.
Tien-en Phua
Progress and Status this week:
- Complete the coding of Data Extraction Algorithm. Able to load file, remove punctuations, create a new output file for Support Vector Machine input
- Review Peer Document and did some research on the principles of echolocation performed by bats to understand the document
- Completed Peer Review on Audio Assisted Vision System For Visually Impair People
Plan and Goals for new week:
- Apply the generated data by the data extraction algorithm to Support vector machine
- Determine progress of project and review schedule.
Semester 2, Week 9
Jie Dong
Progress and Status this week:
- Hidden Markov model is implemented using Java, and the program produces a table containing probabilities information for some common trigrams from some texts input. The problem with it currently is because I am feeding all words appeared in texts into the program, there are few common trigrams among certain number of input texts. For example, I have tried with total 20 input texts from two authors, the number of trigram they do have in common is just one. In this case, I also set the program to allow part of these texts to have common trigram and others just put zero probabilities for these trigrams, the result is still not efficient.
- Read through Tails trigram description and code, I found that he simplified the method and extracted the key specification by deleting the non key words. By testing his idea using Java code, I found it does extract a lot more information than mine, however a question also raised to me is that whether it would reduce the accuracy of classification since it changes original text to another. This simplification needs to be proved.
- Produced result by extraction algorithm is fed into MATLAB SVM methods (svmtrain and svmclassify),it shows my extraction algorithm is not working properly. Sometimes, the predicted author for chosen texts are correct and sometimes are not. In term of SVM itself, it only supports classifying for two groups and multi-group classification produces error. In addition, they can only plot SVM structure for two dimensional data. Hence, more enhanced SVM toolboxes should be studied.
Plan and Goals for next week:
- GUI design
- Test efficiency using different groups of input texts
- Try another SVM toolbox from: http://asi.insa-rouen.fr/enseignants/~arakotom/toolbox/index.html
Leng Tan
Progress and Status this week:
- A basic SVM code which receives a text file input is produced.
- The SVM code will need 2 training data group and a number of test data group.
- The standardize format for the input to SVM was decided by team members.
- The input format will be in a MxN matrix where the first column will be the author and subsequent column is the data. (in my case, standard deviations)
- Initial data uses 20 standard deviation columns.
Plan and Goals for next week:
- The SVM do predict the author wrongly and this need to be resolve.
- Might be due to insufficient train data.
- Further testing is required.
- Might consider implementing GUI.
- Need to have a meeting with supervisors on progress and GUI implementation (can combine together GUI of Java and Matlab?)
Tien-en Phua
Progress and Status this week:
- Research for statistical software for obtaining the covariance of data StatGraphics
- Download and installed a choose software and attempts to operate the program
- Research on a book discussing the possible author of Hebrews Nacsbt: Lukan Authorship Of Hebrews
Plan and Goals for next week:
- Obtain the covariance of the data
- Check to see if data extraction algorithm produce similar results as Talis
- Produce code to "chop" all text file to a specific length for analysis
- Input data to SVM and observe the outcome
- Combine functions for analysis
Semester 2, Week 10
Jie Dong
Progress and Status this week:
- Original JAVA program is re-built in a standard eclipse project
- Delete Transition class, no longer used
- Change three classes (State, Gram, Record) to inner classes correspondingly
- Reduce original three main methods in separate class to only one in Driver class
- Move methods for User inputs to Driver class, including parameters and paths
- Add three header lines to Java program output: number of texts, number of disputed texts, number of trigram used
Plan and Goals for next week:
- Standardise three algorithms into one project folder
- Use same training data, unknown data to test three extraction algorithms
- Compare their accuracies in different situations(number of key words, number of texts,etc)
Leng Tan
Progress and Status this week:
- had a meeting with the supervisors and report on the progress of the project.
- SVM code is remain the same for the time being.
- A tabled results should be produced to compare the difference between each data extraction algorithm.
- the main idea of the progress report is discussed.
Plan and Goals for next week:
- A standardise template to combine all 3 data extraction algorithm was discussed.
- WRI code need to be slightly modified.
- need to plan the initial design for the GUI.
Tien-en Phua
Progress and Status this week:
- Modify code to accept multiple inputs
- Extract out federalist papers for testing on support vector machine using function word analysis
- Meeting with supervisors on Wednesday for progress updates and guidance on next step
- Commencement of progress report
Plan and Goals for next week:
- Produce a table of result displaying the accuracy of the algorithm with SVM Kernel function
- Complete progress report, project background, project specification, progress thus far and project management
- Combine the three algorithm together into a single driver file
- Discuss and design possible implementation of a GUI
Semester 2, Week 11
Jie Dong
Progress and Status this week:
- Update progress report
- JAVA program modification:
- Sort list of files read in according to their name order
- Replace manually parameter setup to automaticly read in data, form train set and testing set according to three header lines
Plan and Goals for next week:
Leng Tan
Progress and Status this week:
- Do progress report.
Plan and Goals for next week:
- catch up on assignments and prepare for exams.
Tien-en Phua
Progress and Status this week:
- Update of progress report
Plan and Goals for next week:
- Complete 4 upcoming assignment
- Prepare for power system quiz
Semester 2, Week 12
Jie Dong
Leng Tan
Progress and Status this week:
- assigments due for this week is completed.
Tien-en Phua
Progress and Status this week:
- Completed all assignments due this week
Plan and Goals for next week:
- Need to prepare for exams. SWOT week next week.
- Project will "pause" till after exam period, 20 Nov 2010, thereafter the team will be working individually back in their home country and update each other via email
Semester 1, Week 1
Jie Dong
Progress and Status this week:
- Had a small discussion with the team members and work on SVM.
- Modify SVM program to support multi-group classification function
- Test the accuracy of the whole classifying program with English texts
- Generate accuracy table with respect to three different variables: tolerance, number of key words and kernal function(linear, quadratic, rbf, polynomial)
Plan and Goals for new week:
- Discuss with supervisor about the performance of current program and suggest ways to increase accuracy
- Apply interface developed by Joel
Leng Tan
Progress and Status this week:
- Brief discussion with team members on the project.
- the english texts is used to test the accurancy of the program.
- Try different kernel function of the SVM while testing the accurancy.
Plan and Goals for Next Week
- Organize a meeting with the supervisors for updates.
- discuss with joel for a constant text length.
- try to combine the code.
Tien-en Phua
Progress and Status this week:
- Conduct a brief meeting with team members to further evaluate on SVM.
Plan and Goals for new week:
- Have a meeting with supervisors showing the results.
See also
- Authorship detection: 2010 group
- Authorship detection: Who wrote the Letter to the Hebrews?
- Minutes of Meeting 2010: Who wrote the Letter to the Hebrews?
- Critical design review 2010: Who wrote the Letter to the Hebrews?
- Progress Report 2010: Who wrote the Letter to the Hebrews?
- Final report 2010: Who wrote the Letter to the Hebrews?