# Small World Networks 2008

## Contents

## Supervisors

Prof Derek Abbott & Dr Matthew Berryman

## Weekly progress and questions

### Week 2

Wenrui (Linda):

Sweet and I had a short meeting last Wednesday, we decided to separate our proposal into three parts. Each person will talk about a little bit in each part. because we think it's necessary for both of us to understand every parts to get started.

For myself, I have prepared some definition about the seminar, and also I have read the book called "linked". But I haven't finish it yet. For the coming week, I will continue work on the proposal. maybe need to get start to make the PPT. find some fun video from www.youtube.com.

Zhujing (Sweet):

Linda & I managed our coming proposal this Wendnesday. There will be 3 parts and each part will take 5 mins(total 15 mins):

1.Introductions and background

What is "Small World Network"?

What is "Six Degrees of Separation"?

Is this theory existing in our life?

2.Show some fun things to support that the theory is existing around us

Relationship network of Hollywood stars

High school dating

A pretended network

About the pretended one, I have an idea. Recently, I am playing a wii game, "Super Mario Galaxy". The whole world in the game just 6 galaxys and each galaxy has 4-5 planets. If Mario wanna go any planet, he needs to fly to the centre of the galaxy first which the planet belongs to, and then tracks the network amoung the planets to get the destination.I sopport every planet is a individual node and each flying track is a link. We can calculate manually that how many degrees Mario needs :)

3.The serious things, we have no idea about this part, but we will think about it this weekend.

### Week 3

Wenrui (Linda):

For this week, I just use all the information I prepared for last week to made the ppt, basically on the introduction part and the fun part.

For the coming week, we will have our regular meeting(once two weeks) to discuss the serious part. such as the hierarchical clustering algorithms, so hopefully we will work out more about this part next week.

Zhujing (Sweet):

This week, we continued prepare our part 1&2, and had some ideas of the part 3.

1> I decided to change "Mario galaxy" to another game which mentioned in fun part last week, because the other one has a more clear network image.

2> Start drawing some pretended small networks(with 5-10 nodes) to find some measures for the part 3.

3> Read the book "Linked" to get some more ideas.

### Week 4

Wenrui (Linda):

We had a regular meeting with Matthew last Tuesday, during the meeting, we talked about the serious part. we talked about three types of failure - random link (or node) failure, targeted link (or node) failure, and cascading failures (one link fails, then the next links fail), and also an example of Google's algorithm ranks. hopefully it will be very helpful for the seminar next week.

For the coming weekends,I will finish my PPT, and send it to Matthew on the early of next week, hopefully can get some good feedback before the formal seminar.

Zhujing (Sweet):

After our regular meeting, we got some useful sense of the coming seminar, especially the most important part, the serious one.

1> About the fun part, we decided change the movie to a new series, because this series has better and more smooth images. And in the old movie there are some kinds of different languages, we must to use the subtitle. But the series is all English, that is more convenient for our work.

2>For the serious part, I was praparing the pertended networks. And I changed something of my old networks to make these show some different characters, such as target link. Because of these changes I need to recalculate some stuffs.

3>Countiue reading "Linked" to get more sense :)

### Week 5

Wenrui(Linda):

last week, we did our proposal seminar, I didn't do that well, I talked too much about the hierarchical networks. However, I didn't clearly explain something important, and didn't include something important. e.g.

- we should include a plan for our software design.

- I should talk more about our Gantt chat, and as well as the break down structures.

- In the risks analysis part, the idea is to talk about the technical risks in the project itself, but not some personal issues.

- We should include the discussion of the data sets we will use.

- We should include the discussion of relevance to electrical engineering.

- we should mention what is our aim for the project.

- we should include the slides to explain finding shortest path, analyzing network efficiency, finding important hubs, finding weaknesses that can cause failure of a network, identifying weaknesses that make the network vulnerable to attack.

During the presentation, we learned our lessons, i.e. if we couldn't answer the question from the audiences, at least we should try to say something which is related to the question, it's very bad to just stand there and say nothing. if couldn't understand the question, just ask them to ask in a easy way. if still cannot answer, then we can say:" good question, We do not know the answer to that and we will definitely follow it up later."

For the coming week, I am planning to work on the software design. I will basically read the algorithms on the internet. hopefully I can write some simple code myself, and start to test it with the Mini networks.

Zhujian(Sweet):

This week, we just finished our proposal seminar. There were some weaknesses in our presentation, but we will improve these next time. The biggest problem for me is too nervous to listen to the questions. After prsentation, I just recalled these and found I could answer most of these. That is the most important thing I need to improve.

For the coming week, we will start some matlab analysis and read the references about our project.

### Week 6

Wenrui(Linda):

For last week, I have searched something about the software design part on the internet, but I haven't really finish that yet. I will continue my research next week as we got more time for the holidays.

Use enron as an example, we can start to write some simple program to analysis the degree distribution and as well as the clustering coefficient. e.g. the train map in Hongkong or New York.(any big cities)

what's more, I need to prepare for next week's meeting, we will talk about the software design by using enron as an example, hopefully next week's regular meeting will be helpful for our progress.

Zhujian(Sweet):

I was programming some functions in Java and read more references about the properties of network to figure out the measures of my hand-calculating part. I am not very sure whether my measures are correct. Next week we will have a meeting, I will ask for some helps.

Btw, when I run some Java codes in windows system, my computer alway not very stable :( Maybe I need install a Linux system :(

### Week 7

Wenrui(Linda):

According to our meeting on Wednesday, we talked about how to write the Pajek .net file from a simple state diagram. How we calculate the average degree distributions, and hence the find out the clustering coefficients.

For the coming week ,we will design a simple state diagram ourselves, and try to use the same method as Matthew showed us to calculate the clustering coefficients. I was planning to use New York city train map as an example to build our network, but Matthew gave us a better suggestion, that is to use the Singapore train map, because the road the is smaller and more organized. So that can reduce our work load.

Zhujian(Sweet):

We just had meeting on Wednesday. And then, I find some mistakes of my measures. I will fix these and figure out how to test in Pajek. There are still some problems of Java running in windows system. I am trying solve this this week. Maybe something softwares crash.

### Week 8

Wenrui(Linda):

From last week, I have started some calculations with the simple systems(train map), this is basic on the clustering coefficients calculation. Moreover, I have done the introduction part of the report which is due on week 8, it is similar to what we have done for proposal anyway.

For the coming week, we will try to do as much as we can for the Critical Design Review report.

Zhujian(Sweet):

This week, I start to read a new book named "Models and Methods in Social Network Analysis". That is very helpful for my understanding of the measures. I try to prove some fomulas by my own way. That is a very interesting thing :)

### Week 9

Wenrui(Linda)

From last week, I have finished the calculations of the clustering coefficient by using both the undirected and directed graphs, for undirected graph, I used Adelaide train map system as an example, set each station as a node in the graph. I used the Adelaide road map as an example of the directed graph, because some of the street are only on way to go. but some of them are both ways. I have also finish the project management part, which is the risks anysis, and the budget.

For the comming week, we will finish the report.

Zhujian(Sweet)

1> I have finished all the mini network examples which I mentioned in the proposal seminar manual calculation. And change these undirected networks to directed networks to compare the difference between them. The calculations include average degree (for directed networks include in-degree and out-degree), clustering coefficient, short path between 2 nodes, the longest distance between 2 nodes in the network, etc.

2> I downloaded a book named “Exploratory Social Network Analysis with Pajek” to know more properties of network and be familiar with Pajek, But still have some confused problem of using this software. I was trying solving this.

3> Preparing the critical design review report. I am in the charge of manual calculation and coding test. Coz I got in trouble of using Pajek, I will do the test next week, just need some hints. And for the JUNG part, now I decide to temporary stop. I forgot a lot of about java programming stuffs. That need a lot of time to recall. I will do that after the due week.

### Week 10

Wenrui(Linda)

For this week, we have corrected the calculation mistakes that Matthew corrected for us, what's more, we have finished the report.

For the comming week, i will work on the peer review. read the book and as well as study more about the software.

Matthew, do we need more meetings for the last 3 weeks of this semester?

Zhujian(Sweet)

1> Corrected the mistakes in our calculation parts, found the right way to use Pajek to test and finally, finished the report.

2> Download the book "Models and Methods in Social Network Analysis", which I mentioned in mid-break, to solve some problems I met in book "Exploratory Social Network Analysis with Pajek".

3> Prapared peer review report.

### Week 11 & 12

Wenrui(Linda)

For these two weeks, we had just read some reference books, we got our feedback for critical design review, so that we can improve our final project report. For the coming winter holidays, we decided to work on the software part of the project, some functionality of the software, such as java, Pajek

Zhujian(Sweet)

1> Read the books which I mentioned before

2> Find some courses which I am studing using our "Small world network". Some ones are quite similar, and the others are not but use some theories of our project. Such as, shortest path, degree, target link

### Week 13

Wenrui(Linda)

Last week, I have watched the movie "number" that Matthew gave to us. Volume one is talking about to find the central node of a network system, it's pretty relevant to our project. hopefully we can also work out some big events later by applying our discover during the project.

Also I will ask sweet to get the copy of "Models and Methods in Social Network Analysis" for me to read through.

One more question, for the enron data set, Matthew, you have already gave us the code, but it's in Python which i have never use before, should i rewrite it in java again?

Zhujian(Sweet)

This week, I was thinking a problem, the following is what I thought

The relationship is very important of society, especially in China. And I find the weak relationship always to be the keypiont to spread my own social network. For example, if I want to find a job, that is hard to have some helps from my best friends, because we always have the similar social networks. But the acquaintances in my network may give me more chances to get a job, because they may have different social relationships from mine. In engineering aspect, we always want to improve the weak link(eg:target link) to be stronger. In contrast, we always ignore these weak relationships in real social network, but sometimes these weak ones seem to be the keypoint to spread our relation nets. Should we need to improve these ones to be stronger like in the engineering aspects or just keep these weak?

Making social network links stronger requires time and energy. You have to look at the benefits to improving links to those people. Perhaps in terms of finding a job, you can just keep a weak link to them. A key difference between social networks and engineering networks is that in a social network you only have one link to another person. In networks like power networks and communications networks, you can usually add in multiple links between two nodes, and get an improvement in robustness. So it's a different kind of strength. --Mattjb 11:39, 16 June 2008 (CST)

### Week 14

Wenrui(Linda)

For the last few weeks of the semester, we will be busy for the exams and some of the homeworks. so maybe we will stop our project for the exam period, will continue our research after the exams. during these time, we will just read some reference book, such as "Linked" and some other papers. Matthew, could you give us a tasks list, i think it will be more clear for us to know our assignment for the project. we will figure that out during the semester break.

Graph file format converter (GML to Pajek), draw graphs, calculate network measures for networks (see network data below), matlab analysis and graphing of results, hierarchy algorithm (Enron data), --Mattjb 11:58, 16 June 2008 (CST)

Zhujian(Sweet)

1. After the meeting, I have done some Graph file format converter for Pajek. It works well. Later I will try to do some more like high school dating graph.

2. I am still studing for how to use Pajek. I have tried some examples in the book and make some other manual calculations to test the functions in Pajek.

Sorry about my late update, I just forget my password :)

### Week 1&2

Wenrui(Linda)

For the last two weeks, i have been doing some jave and matlab reviewing. i have tried to read the data from the data sets, later i will try to find the relationship between every single data, try to link them together, and finally use matlab to show the result by graphs.

Zhujian(Sweet)

1. These 2 weeks I have tried to build the High school dating for Pajek and also done the degree distribution, cluster coefficient atc. Now I am doing the analysis of these datas. I think that I need some other same kind graphs but with different number of nodes, links, then I can compare these and analyse better. Can I just change the original graph?

If you want to have some data to test your algorithms on, then this is a great idea, but don't draw any conclusions about social networks from it. --Mattjb 10:37, 18 August 2008 (CST)

2. A firend of mine is studing sociology. We are doing a small social network reaserch now. Can I use the data from our reaserch for this project?

Sure, but be wary that if your friend has collected the data and you use it, to make sure that you acknowledge this properly in your report & presentation. --Mattjb 10:37, 18 August 2008 (CST)

### Week 3&4

Wenrui(Linda)

For the enron email date set, i understand that the enron.net file is where to store every single code and their email address. enron.pairs file is where to store who actually send to the email to whom, which presented by code, it shows the direction as well. md5-authors.txt gives every single email address a uniq code, with the same length and the same type. However, when i try to match them, i can't find some of the email address in the first file which is enron.net. is there any data missing?

The enron.net Pajek file lists the nodes (number ID of person plus their email address) *and* a set of directed edges representing emails being sent. The other files are those I generated enron.net from. The message IDs are different to the person IDs (my software generates these). --Mattjb 14:01, 8 September 2008 (CST)

should i use java or C++ to make them look like what we need for the pajek software? just like what you had in md5-recipients.txt file? that will looks like Directed graph's file.

It's already in a Pajek format. I think there is still an issue with quote marks needed around the email addresses, and I will fix this up tonight. I can't test from home though. --Mattjb 14:01, 8 September 2008 (CST)

I have been trying to use the data from my facebook to make the file as well, i made the data set myself, what i did is make my name as the first node, all my friends in the facebook are the second level of the hierchical system, and then my friends' friends, and so on. it seems hard to get similar style of the file. facebook is where to know a friend, if you are a friend of some one, then the relationship should be both sides, so should be a undirected graph.

Yes, it's undirected, because each time a person adds someone, that someone must confirm the friendship link. --Mattjb 14:01, 8 September 2008 (CST)

it's obviously a hierachical.

While it has levels, it's not hierarchical in the sense that any person has a position above another person, nor is it a tree structure, since you can have cycles (A is friends with B is friends with C is friends with A, etc.) --Mattjb 14:01, 8 September 2008 (CST)

To anlaysis the third level of the data set, you need the data from the forth level, if i want to get a general result of the friendships, then that will be hard, because i will be need a really large data set, or maybe a data base. what's more, if just consider about the direct relationship but not 2 degrees or 3 degrees, in the hierachical system, we will only consider about the upper level and the lower level, then i don't know what the point to do that. However, if you find someone you know in your friend's list, add him/her to be your friend, that person may not know you at all. so in this situation the relationship is only one side.

For the matlab part, what kind of result should we show? i mean should we use matlab to draw the graphs to show all our value which we got from the pajek calculation?

Yes :) --Mattjb 14:01, 8 September 2008 (CST)

Zhujian(Sweet)

I think we got lots of social network. Should we do something about the Engineering network?

I have put some Internet networks up that you can manually turn into Pajek format files and process. Feel free to search for any other engineered networks. --Mattjb 14:01, 8 September 2008 (CST)

About my research, my friend and I just finished all the data collection. We are doing the analysis of these stuff. That is also about the social network. In my research, we just sort the links to 2 type, the "Money friendship" and "Emotion friendship". We will do the research of the 2 kind of links. I will do the analysis of data, my friend will do the part of literal analysis. I think that is better than we just use the data from internet.

For the Engineering network, I prefer the telecommunication network, like the telephone links, the base stations distribution of mobile phone, the exchange of a city or province, atc. That is just my viewpoint. I think Linda can do the part of engineering. Choose some networks have the hierarchical directory structure and include other normal or special properties, then start ur work :)

### Week 5

Wenrui(Linda):

software study:

Java: 1. readFile function: public class ReadFile; FileNotFoundException; import java.io.IOException; import java.io.File;

2. remove the duplicate data.

Zhujian(Sweet):
Start social network analysis

1>Using stage 1 and 2 database to draw different pictures

2>Analyze the network

3>find the typical group structure

### Week 6

Wenrui(Linda):

1. Analysis Enron data sets by java.

2. file reader.

3. file converter.

Zhujian(Sweet):

Go on to do the stage 3-5 social network analysis

1>Find special case in the network

2>Observe the data change

3>Using Pajek calculate special parameters

Start software Design

1> Properties choose

2> GUI structure

### Week 7

Wenrui(Linda):

Find Engineering network examples:

router network: 1. degree distribution

2. shortest distance

Zhujian(Sweet):

Finish all the social network anlysis

Focus on software design

start GUI programming

### Week 8

Wenrui(Linda):

some important properties of the network analysis:

1. target link 2. central node 3. distance 4. degree distribution 5. cluster coefficient 6. weak ties and strong ties

Zhujian(Sweet):

Debug GUI code

Algorithm analysis by java

1>shortest distance

2>find target link

3>cluster coefficient calculation

4>degree distribution calculation

### Week 9

Wenrui(Linda):

Engineering network example: Internode

Analysis the internode network by those important properties

1. target link

2. cluster coefficient

Zhujian(Sweet):

### Week 10

Wenrui(Linda):

1. prepare the ppt for the final seminar

2. practice the final seminar

3. Demonstration

Zhujian(Sweet):

### Week 11

Wenrui(Linda):

software analysis

1. java: file reader; file converter; interface;

2. matlab: draw the degree distribution, and the degree by using matlab, hence easy to observe the average degree.

3. prepare the final report

Zhujian(Sweet):

### Week 12

Wenrui(Linda):

prepare the final project report.

prepare the exhibition.

Zhujian(Sweet):

## Project proposal

For the project proposal you need to have:

- Some fun stuff (fun small-world networks)
- Some more serious engineering reasons why small-world networks are important - namely robustness of power networks, communications networks, Google searching, finding who is the leader in social networks (like we discussed about Enron)
- Some slides on network measures (correlation coefficient, cluster detection algorithms).
- How you will start with some small networks that you can test your software on before moving on to a large data set.
- Some slides on project management - how you might structure your software modules & data flow, and definitely Gantt Chart (including slack space to cope with risks), budget, milestones, who is doing what (& why).

Here are some fun network graphs: [1] There are some fun and some more serious networks here: [2]

When you include graphics from the web, don't forget to include on the slide the web address you found them on. It is also a good idea to have a slide at the end with some key references (e.g. Barabasi book, Milgram's paper, the paper I gave you on finding hierarchies).

## Critical Design Review

For the Critical Design Review (not due until Week 8) you need to have the following parts:

- Project Aim & Background
- Literature review, where you discuss a number (say 10-15) of papers / books and their contributions. Try and summarise in a few sentences what each one is about.
- Your approach - what you are going to do, your software design (see below), and who is going to do what.
- Analysis of your software design - what are the hard sections, what are the critical parts that you need to get working first.
- Project management stuff. For risks you don't need to worry much about Occupational Health & Safety (apart from general computer set up) but there are other risks in your project to do with task timing, what happens if someone (including Derek and I) is unavailable, budget over-runs (should be minimal risk but you can still mention it).

Aside from the engineering applications (where small-world networks are important), the other engineering in this project is software engineering. You should have a block diagram showing the main parts of the software, i.e. graph file format converter, graph drawer, network measure calculators, hierarchy algorithm, and matlab analysis (graphing and statistics), and my software for converting the email data set into a graph file. For some of these you can then include what sub-modules they have, e.g. the network measure calculator might have sub-modules for degree distribution, for clustering coefficient. You should also have a flow diagram showing the data flow between these modules, and in the critical design review give details of the algorithms. You can software you find on the web but you will need to acknowledge it in your proposal report, make sure you understand how it works, and test it. It may be easier though, to write the software yourself then writing a file format converter to get other software to work.

The Critical Design Review can just be emailed directly to myself and Derek on the due date, and you only need one for your group. For the final report you need to write one each (please don't copy each other's writing but write it in your words - you can copy equations, flow charts, graphs, diagrams, reference list, software though. Derek and I are happy to read draft copies.

## Worked examples

### Directed graph

http://www.berrymanconsulting.com/images/SimpleDirectedGraph.jpg

Pajek .net file

*vertices 4 1 "1" 2 "2" 3 "3" 4 "4" *arcslist 1 2 4 2 3 4 3 1 4 1 2

Note that in my drawing done with different software I have drawn the bidirectional edges as two separate edges, whereas Pajek draws them as a single edge with arrows on both ends.

In-degree of nodes:

- Node 1 has 2 incoming edges
- 2
- 1
- 2

So the in-degree distribution is:

- 1
- 3

(i.e. the number of edges with 1 incoming edge is 1, and the number of edges with 2 incoming edges is 3) Average in-degree is (1*1+3*2)/4=1.75

Out-degree of nodes:

- 2
- 2
- 1
- 2

So the out-degree distribution is:

- 1
- 3

Average out-degree is (1*1+3*2)/4=1.75 Note in general that for a directed graph, the out-degree distribution is not the same as the in-degree.

To calculate the clustering coefficients we first need to find the neighbourhoods of each node. The neighbourhood of node i, is all the other nodes that either have an edge from node i or to node i.
The neighbourhood of node 1 is {2,3,4}, the neighbourhood of node 2 is {1,3,4}, the neighbourhood of node 3 is {1,2} and the neighbourhood of node 4 is {1,2}.

We also need the total degree of each node = in-degree + out-degree, thus

- 4
- 4
- 2
- 4

Using the notation from here, these are the k_i (k underscore i).

We also need the number of edges between nodes j and k in the neighbourhood of i, i.e. for node 1, we count all the edges between nodes 2, 3, and 4 (the neighbourhood of 1). The counts are:

- 3
- 3
- 1
- 1

So the clustering coefficient for each node is the count divided by k_i(k_i - 1), i.e.

- 3/(4*3) = 1/4
- 3/(4*3) = 1/4
- 1/(2*1) = 1/2
- 1/(4*3) = 1/12

The clustering coefficient for the graph as a whole is just the average of the clustering coefficients of the nodes, i.e. (1/4 + 1/4 + 1/2 + 1/12) / 4 = 13/48 approx. 0.271

### Undirected Graph

http://www.berrymanconsulting.com/images/SimpleUndirectedGraph.jpg

Pajek .net file:

*vertices = 5 1 "1" 2 "2" 3 "3" 4 "4" 5 "5" *edges 1 2 1 3 2 3 2 4 2 5 3 5 4 5

Degree:

- 2
- 4
- 3
- 2
- 3

Degree distribution:

- 0
- 2
- 2
- 1

Average degree = (2*2 + 2*3 + 4)/5 = 14/5 = 2.8

Neighbourhoods:

- {2,3}
- {1,3,4,5}
- {1,2,5}
- {2,5}
- {2,3,4}

Number of edges within neighbourhoods:

- 1
- 3
- 2
- 1
- 2

Clustering coefficients

- 2*1/(2*1) = 1
- 2*3/(4*3) = 1/2
- 2*2/(3*2) = 2/3
- 2*1/(2*1) = 1
- 2*2/(3*2) = 2/3

Average clustering coefficient = (1 + 1/2 + 2/3 + 1 + 2/3) / 5 approx. 0.767

## Reading list

## Simulations

## Data sets

- Enron email data
- My conversion of data from Andres Corrada-Emmanuel's Enron research page
- Network data compiled by Mark Newman. Most of the data is in the GML format. This data includes the karate club social data.
- iiNet report has some networks on p9 and p10.
- Internode network
- Telstra's international network

## Software

- Pajek
- Software for random hierarchical graphs
- Pajek file format (also supported by JUNG)
- JUNG - Java Universal Network / Graph framework
- To use JUNG you will also need:
- COLT jar file
- Xerces (only if you want to read Graph ML files).

- Eclipse Java development environment for Windows
- Tutorial on using Eclipse to start a new Java project
- Note that this window has the option for External JAR files (which you can use to include the COLT and Xerces files).