Editing
Final Report 2011
(section)
Jump to navigation
Jump to search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
===Technical Challenges=== Technical challenges encountered in the Web Crawler module included the following: # Robots Exclusion Protocol # Webpage content extraction # Multithreading # Crawling https secure sites # Accessing the Internet through secured proxies that require a username and password for use. Robots Exclusion Protocol refers to the policy governing ethical behavioural of web crawlers. The policy for each site is stored in a text file named robots.txt (available at <domain>/robots.txt. The challenge this provides web crawlers is reading the robots.txt file and obeying any instructions limiting crawling across that domain. In interfacing to the pattern matcher module, the web crawler is required to supply the contents of each webpage it crawls. In addition to this, any html code should be ignored and not passed. The complex nature of web crawlers meant parallel processing was required through multithreading. This introduced the technical challenge of controlling threads externally. Finally, providing the ability to access encrypted sessions or https (secure) sites is a challenge that is yet to be overcome. Currently the web crawler is unable to function in an operating environment where access to the interent is secured by a proxy that requires a username and password. An example of this type of domain is the University of Adelaide student computer accounts. An update to the web crawler has been designed to remedy this and will be implemented prior to project closeout.
Summary:
Please note that all contributions to Derek may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
Derek:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Navigation menu
Personal tools
Not logged in
Talk
Contributions
Create account
Log in
Namespaces
Page
Discussion
English
Views
Read
Edit
View history
More
Search
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Tools
What links here
Related changes
Special pages
Page information