HURL: rough design notes

Thoughts on the design of HURL-NG: (written in 1998-2000)


All URIs must be bookmarkable; exceptions to this should be clearly marked.
(eventually in a machine-readable way)

URIs shouldn't expose implementation internals: use msgids in the URIs
instead of arbitrary numbers like hypermail, so archives can be moved
without breaking links when one installation disappears.

It should be easy to set up large numbers of separately-named archives,
like 100-200 of them. (hwg, w3c lists.)

Maybe http://impressive.net/archives should be the single CGI script
whose first argument is the name of the archive? (then it uses that
arg to load the appropriate DB/indexes)?

Would be nice if people could add archives without having to tweak
their httpd.conf's to add extra ScriptAliases or whatever.


URIs for a typical archive:

have a certain number of predefined names like 'search', 'status';
otherwise assume it's a msgid and try to look it up.

    http://impressive.net/archives/fogo/             nice overview a la egroups

    http://impressive.net/archives/fogo/1999/12      monthly overview pages

    http://impressive.net/archives/fogo/search       search form
    http://impressive.net/archives/fogo/status       archive status

    http://impressive.net/archives/fogo/search?subject=cmgi
    http://impressive.net/archives/fogo/search?subject=foo+bar&range=1-50
    http://impressive.net/archives/fogo/search?subject=foo+bar&range=51-100
    http://impressive.net/archives/fogo/20000214012338.E18072@impressive.net

Don't have "next" and "previous" links on the message pages any more.
(well, maybe. But use real cookies this time instead of URL munging.)


Adding articles to the archive:

add_articles:
    takes a list of filenames on stdin, invoked from crontab or procmailrc
    or find . -type f -print | ./add_articles

Support MH format archives first, then mboxes, then gzipped mboxes (?),
eventually http access to mboxes (?)


Internals:

Each new article gets assigned a unique ID when it is added to the
archive, then referenced internally by that ID instead of the msgid.

Msgids are stored in a table which point to these IDs; IDs are never
exposed to the outside world.

Log variable amounts of stuff, configurable with $loglevel

For msgids mentioned in the body of other messages
(in <foo@bar.com>, joe wrote:),
don't precalculate and store which of them are valid msgids;
do that on the fly when displaying the message. (shouldn't be too
expensive, just a couple extra hashed lookups.)

keep track of various things in variables in the DB, so I can
display a nice "status of the archive" page:

$DB{id}: numeric sequence id for messages
$DB{lock}: pid of current process (?)
or just use "lockfile" from procmail distribution instead?

$DB{date:1998}: number of articles in the archive for 1998
$DB{date:199811}: number of articles in the archive for 1998/11
$DB{date:19981101}: number of articles in the archive for 1998/11/01
$DB{date:earliest}: date of the earliest article in the archive
$DB{date:latest}: date of the latest (latest meaning having the most
    recent date, not most-recently-added) article in the archive
$DB{date:19981101:list}: space-separated list of $DB{id}'s of
    messages posted on that date ($DB{date:199810:list} can be
    generated by the yyyymmdd:list's, no need to store it separately)
$DB{num-articles}: number of articles in the archive
etc.

should they have a namespace a la $DB{hurl-internal.date:1998} ?
to avoid clashing with msgids?



ideas on OO-happy code structure:

$m = new Message;
$m->lookup( $msgid );
$m->parse;
$m->header('from');             (calls $m->parse if not done already)
$m->header('subject');
$m->subject;			(?)
...

$m->body;

use MJD's Memoize.pm to cache function calls:
http://www.plover.com/~mjd/perl/Memoize/

most of these functions should be Memoizable.

This could simplify the design a lot. Might want bits of that
cache to be persistent on disk; does Memoize do that? If not,
store frequently-used stuff in the main DB so it can be accessed
via hashes.


Templates:

all pages should be customizable using templates.

But what language to use? Text::Template? Webmacro? PHP?
Or make up my own?

Simply linking to external style sheets should handle a lot of
the customization that most users will need.


Misc frills:

Make the format of the "search results" pages defined by a variable
a la "date:10;from:20;subject:50"; hardcode this for now, make it
handle arbitrary formats later?

Also, on the search results page headers, have arrows to widen or
narrow each field a la:

      Date <>   Author <>   Subject <>
or:  <Date>    <Author>    <Subject>
or: < Date >  < Author >  < Subject >
or:   Date -+   Author -+   Subject -+

and if the field is less than 10 chars, expand/widen it by 1
column; if it's greater than 10 but less than 20 expand/widen by
2; greater than 20, expand/widen by 5; greater than 40,
expand/widen by 15? coooool....

At the bottom of search result pages, display:
"Next 20" "Next 50" "Next 100" "Next 200"

article transforms:

when displaying an article, apply a list of transforms to each
line of text, supplied in config.pl; for the hwg list archives,
look for html elements and attributes and link them to the html
spec via dtrt, etc.; for a perl newsgroup, look for perl functions
and builtins, link to online perl manual; expensive but cool!
(here's a sample of the HTML element one)


See also

Hypermail vs HURL discussion on www-talk, Dec 1995


$Id: index.html,v 1.22 2009/05/08 06:30:14 gerald Exp $