A java hyperlink validator
Copyright © 1999 Robert Forsman
GNU General Public License
and GNU Lesser General Public License
JCheckLinks is a Java™ application which validates hyperlinks in web sites. It includes no native code so it should run on any Java 1.1.7 virtual machine. The licensing terms are LGPL with the main app class being GPL (which means you can incorporate the meat of the source into a commercial app, but not the front-end class).
download version 0.4b (gzipped tar) HTTPClient package home page. Java for Linux. This is the JDK used by the author. Checkers: another java hyperlink validator. LinkChecker: another java hyperlink validator. |
This page is falling out of sync with the package as I rush to release new features.
JCheckLinks is multithreaded to take advantage of multi-processor machines. There are harvester threads and non-harvester threads. This allows the user to limit the number of CPU-intensive threads while allowing a large number of low-impact probing threads.
JCheckLinks is currently limited to probing HTTP documents and local files. It has no support for https:, ftp:, gopher: or mailto:. Although some of these are in the TODO, others are impractical (mailto: verification is nigh impossible in the age of anti-spam warfare).
JCheckLinks only downloads and harvests URLs from documents with a Content-Type of text/html. I can be persuaded to add support for other document types if you can tell me how.
JCheckLinks extracts URLs from the following HTML tag attributes:
<base href=> <a href=> <link href=> <area href=> <img src= longdesc= usemap=> <input src= usemap=> <frame src= longdesc=> <iframe src= longdesc=> <style src=> <script src= for=> <object codebase= classid= data= archive= usemap=> <applet codebase= code= archive=> <head profile=> <body background=> <blockquote cite=> <q cite=> <ins cite=> <del cite=>The only tag which is not harvested is <form action=>. I consider that a dangerous proposition.
When it is done it will create two files: statuses and references.
Each line of the ./references output file consists of two %-encoded URLs separated by whitespace. The second URL can be malformed if the HTML document has a malformed reference. Each line of the ./statuses output file has a status string (usually a number from the HTTP response code, but sometimes indicating an exception), the %-encoded URL which we probed to get the status, and sometimes a third %-encoded URL which is present when we got a redirection (301, 302, or 307).
There is currently no report generator to parse these files. I am hacking up a perl script.
bash$ export CLASSPATH=/usr/local/lib/site-java/LinkChecker-v0.1.jar bash$ java -mx64M CheckLinks -nthreads 3 http://web.ortge.ufl.edu/Guide/I use -mx64M because my web site is fairly large and otherwise java gets OutOfMemoryErrors. It's probably a bad idea to set java's maximum heap size to be larger than your actual RAM because I think the garbage collector causes severe VM thrashing. If you find you actually need more memory than you have RAM, buy more RAM.
JCheckLinks was announced to the Tek list March 12, 1999.
Robert Forsman <thoth@purplefrog.com> Last modified: Mon Oct 14 15:04:10 1996