[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Unidentified subject!



Cc: 
Bcc: 
Subject: Re: upstreams maintainer conflict, was: wget: remove outdated manual page
Reply-To: 
In-Reply-To: <[🔎] 19980516225619.U3613@test.legislate.com>; from Raul Miller on Sat, May 16, 1998 at 10:56:19PM -0400
Organization: Kathie Lee's Sweatshops
X-Operating-System: Linux/i486 2.1.99
X-Mutt-References: <[🔎] 19980516225619.U3613@test.legislate.com>

On May 16, Raul Miller wrote:
> [Aside: it would be nice to have mechanism to just generates
> a unique list of referenced URLs.  This would allow more complicated
> filtering schemes to determine what to download (at the expense
> of having to run wget twice -- but it's easy enough to set up a
> web proxy).  --spider only checks a single file.]

A list of all URLs in a particular web page can be fairly-easily generated;
see e.g. my findnew Python script (http://www.linux-m68k.org/py/findnew.py)
which does this very thing as part of its processing.

I'm sure one could write an ugly regular expression that could be awk'd over
a HTML file, the output of which could be piped through sort | uniq to
achieve the same effect.  One could even do this in C if one cared (but I'll
stick with my Python version, at least it's readable...).


Chris
-- 
=============================================================================
|      Chris Lawrence      |      The Realistic Consolidation Proposal      |
|  <quango@ix.netcom.com>  |  http://newforesthills.base.org/regional.html  |
|                          |                                                |
|    Contract Programmer   |       Join the party that opposed the CDA      |
|    FedEx Ops Research    |               http://www.lp.org/               |
=============================================================================


--
To UNSUBSCRIBE, email to debian-devel-request@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org


Reply to: