sitedir: the directory the site is in

run winhttrack with full debug log
take lines begining with "##:##:## Info:  engine: transfer-status: link added:" (where # is a digit char)
remove the first 53 + urlLength chars, where urlLength is the root path of the site (e.g. site.com) (this will leave a leading /, which is important later)
do s/' -> .*$'// to remove local file references
do s/'[?].*$'// to remove URL parameters to PHP pages, etc.
save resulting file to goodlist.txt
cp -R sitedir sitedir2
create a mirror of the directory structure from sitedir2/../ with:
	find sitedir -type d -exec mkdir -p sitedir_good/{}
move all good links out of sitedir2 to somewhere else:
	for filename in `cat goodlist.txt`; do
	mv sitedir2$filename sitedir_good$filename;
	done;
	unset filename
check remaining files for dependencies (e.g. server-side includes that would not show up in winhttrack logs) on good files:
	(for filename in `ls -ARp ../sican2/* | grep -vE "\>:" | grep -vE "\>/" | grep -v "^$"`; do
	 linksto $filename; done; unset filename ) > ../linkagereport.txt
	 


problems:
- pathnames are not preserved during the linksto command, so linksto "foo.htm"
  returns pages with links to /foo/bar/foo.htm and /foo/foo.htm without
  distinction
- because of the above problem, bad links that may need to be changed because
  the files they want are still available, will show up (even if foobar.html
  links to /foo.htm and foo.htm is in /bar/, foo.htm will show up as a needed
  file)
- somehow reconcile the above two?
- partial filenames, e.g. logo.jpg matches cap_logo.jpg