Great Southern Data: W3C - LinkChecker - post processing with Awk

The W3C link checker program is very useful for checking large numbers of html files.
I have used the Perl CPAN implementation
One way to check a lot of pages from a command prompt is a script like this:

(
cd /home/webserver/pages
baseurl='http://mywebserver.com/'pages'
for page in *.htm *.html; do
/usr/bin/linkcheck -s ${baseurl}${page} >> outputfile
done
)

This may take a while to run the checking process is very thorough and the reports are quite verbose.
A common requirement is to just check for 404 (bad link) errors. To only report these I filtered the output file through an awk script:

# from FTP::webx-johnr\/home/johnr/librarycheck|linkfilter.awk
BEGIN { url = ""; }
/^Processing/ { url=$2;
errorCount=0;}
/^http:/ { link=$1; }
/^ Lines: / {lines = $2 $3; }
/^ Code: 404 Not Found/ {
if (! errorCount) printf "\n\nCompany page: %s\n", url;
errorCount++;
printf "link: %s lines %s; %s\n", link, lines, $0;
}
END {}

This only lists pages where 404 errors have occurred, ok pages or other 'error's such as redirections or ignored links are not listed.

Great Southern Data

Search This Blog

Wednesday, February 27, 2013

W3C - LinkChecker - post processing with Awk

No comments:

Post a Comment