JCheckLinks

A java hyperlink validator

JCheckLinks is a Java™ application which validates hyperlinks in web sites. It includes no native code so it should run on any Java 1.1.7 virtual machine. The licensing terms are LGPL with the main app class being GPL (which means you can incorporate the meat of the source into a commercial app, but not the front-end class).

ftp site.

download version 0.4b (gzipped tar)
V0.4b is a non-expiring version of V0.4
README .. CHANGELOG
HTTPClient package home page.
Java at Sun.
Java for Linux. This is the JDK used by the author.
Checkers: another java hyperlink validator.
LinkChecker: another java hyperlink validator.

This page is falling out of sync with the package as I rush to release new features.

JCheckLinks is multithreaded to take advantage of multi-processor machines. There are harvester threads and non-harvester threads. This allows the user to limit the number of CPU-intensive threads while allowing a large number of low-impact probing threads.

JCheckLinks is currently limited to probing HTTP documents and local files. It has no support for https:, ftp:, gopher: or mailto:. Although some of these are in the TODO, others are impractical (mailto: verification is nigh impossible in the age of anti-spam warfare).

JCheckLinks only downloads and harvests URLs from documents with a Content-Type of text/html. I can be persuaded to add support for other document types if you can tell me how.

JCheckLinks extracts URLs from the following HTML tag attributes:

<base href=>
<a href=>
<link href=>
<area href=>
<img src= longdesc= usemap=>
<input src= usemap=>
<frame src= longdesc=>
<iframe src= longdesc=>
<style src=>
<script src= for=>
<object codebase= classid= data= archive= usemap=>
<applet codebase= code= archive=>
<head profile=>
<body background=>
<blockquote cite=>
<q cite=>
<ins cite=>
<del cite=>

The only tag which is not harvested is <form action=>. I consider that a dangerous proposition.

usage

java CheckLinks [ -nthreads n ] [ -scanners n ] [ -checkpoint nmin ] [ -progress nmin ] [ -proxy host:port ] [ -exact ] [ -loose ] [ -noautoinclude ] [ -include URLprefix ]* [ -includeexact URLprefix ]* [ -exclude URLprefix ]* [ -excludeexact URLprefix ]* URL1 [ URL2 ... ]

When it is done it will create two files: statuses and references.

Each line of the ./references output file consists of two %-encoded URLs separated by whitespace. The second URL can be malformed if the HTML document has a malformed reference. Each line of the ./statuses output file has a status string (usually a number from the HTTP response code, but sometimes indicating an exception), the %-encoded URL which we probed to get the status, and sometimes a third %-encoded URL which is present when we got a redirection (301, 302, or 307).

There is currently no report generator to parse these files. I am hacking up a perl script.

Example

bash$ export CLASSPATH=/usr/local/lib/site-java/LinkChecker-v0.1.jar
bash$ java -mx64M CheckLinks -nthreads 3 http://web.ortge.ufl.edu/Guide/

I use -mx64M because my web site is fairly large and otherwise java gets OutOfMemoryErrors. It's probably a bad idea to set java's maximum heap size to be larger than your actual RAM because I think the garbage collector causes severe VM thrashing. If you find you actually need more memory than you have RAM, buy more RAM.

robots.txt

JCheckLinks adheres to the Robots Exclusion Protocol. For the purposes of exclusion, it is called ``jchecklinks''. Its User-Agent string is ``JCheckLinks/0.1 RPT-HTTPClient/0.3-1''.

passwords

JCheckLinks can not currently probe URLs which require authorization or cookies. This is a design decision which can easily be reversed.

JCheckLinks was announced to the Tek list March 12, 1999.

Robert Forsman <thoth@purplefrog.com>

Last modified: Mon Oct 14 15:04:10 1996