Harvester Roadmap

Michael Allan mike at zelea.com
Fri May 11 06:24:12 EDT 2012


Hi C,

So to cover a new forum, we just give it a page in the pollwiki.  Like
these ones: http://zelea.com/w/Concept:Forum
Then its messages will automatically appear in the harvest?
http://votorola.polyc0l0r.net/javadoc/votorola/s/wap/HarvestWAP.html
Is that right?

> I see two possible roads going from here. 1) Develop a sample
> Detector for Mailman/Maildir and hammer out the API for it, so we
> get <10s updates to the feed. 2) Build the talk track to embed it in
> the HUD on all (or most pages).  As the benefits of 1) only play out
> if people actually see the messages (assuming the current feed is
> only a demo and not used) and the Detector API can still be hooked
> in later, I tend to 2) and this also seems to follow our design
> philosophy. What do you think?

Whatever the order, we might not want to deploy a talk track (2) on
stage unless it can sync with the forums.  Unlike the old prototype
scenes, the stage is really intended for useable beta code.  Of
course, the talk track could sit in a dev repo/branch until a suitable
forum detector (1) is coded.  Or the detector could be coded first.
We could deploy it immediately.  The users would not see any benefit,
that's true, but it would at least improve the existing feed client as
a prototype/demo.

The detector problem is framed by the use case in this thread:
http://mail.zelea.com/list/votorola/2012-January/001291.html
http://mail.zelea.com/list/votorola/2012-January/001292.html
http://mail.zelea.com/list/votorola/2012-February/001293.html

I can build detectors into (i) the bridge scene, and (ii) the staging
of the forum archive.  The first would be the most useful.  It would
also be easier for you to support.  You just have to ensure that
redundant kicks are handled efficiently because I'd be calling your
Kicker whenever a diff is viewed on the bridge.

I have one doubt about that.  I vaguely recollect discussing an
email-based subscription detector for Mailman. (?)  Why did we discuss
that?  Do you recall?  Hopefully it's not needed, because a bridge
detector is much easier.

> http://votorola.polyc0l0r.net/javadoc/votorola/a/diff/harvest/Configurator.html
> Or is ConfigDetector better? as it also emits Kicks and detects
> something... I think is still distinct from MaildirDetector or
> IRCDetector.

I guess it's some kind of baseline detector, not of messages, but of
whole forums.  I didn't think we needed that, because new forums can
be revealed by other detectors, such as the bridge detector.  (The
first use case in the thread above dealt with something like that.)

-- 
Michael Allan

Toronto, +1 416-699-9528
http://zelea.com/


conseo said:
> Hi,
> 
> > 
> > > 1. Extend the PipermailHarvester to track any number of forums and not
> > > just
> > > one.
> > > 2. Track the state properly for each forum (internal to
> > > PipermailHarvester). 3. Make this state persistent on disk (internal to
> > > PipermailHarvester). 4. Design a proper SQL table layout for the gathered
> > > messages (internal to DiffMessageTable).
> > > 5. Configure the forums by querying the wiki on startup instead of
> > > hardcoding them (PipermailHarvester). This should also happen from time to
> > > time during runtime.
> 
> DONE.(1) The new Configurator (2) loads all forum configurations from the Wiki 
> (might be a separate forum-wiki later as Mike has suggested, to ease cross 
> project collaboration) and crawls all Pipermail archives. 
> I have also been able to proper TCP session reusage in the http client code by 
> using the newly released httpasyncclient of Apache (3). This might or might 
> not work with a few lines as done now in the long run, but it fits our use 
> case nicely and is still accessible through http-client/core APIs, so thanks 
> Apache! HarvestRunner is now at least 100 lines smaller.
> 
> I have added votorola/a/diff/harvest/harvest-cache.js as an example config 
> file showing how one can crawl for all subdomains of zelea.com (this config 
> file is optional). Config updates (including update crawls) happen every 3 
> hours atm. (hardcoded).
> 
> I see two possible roads going from here. 1) Develop a sample Detector for 
> Mailman/Maildir and hammer out the API for it, so we get <10s updates to the 
> feed. 2) Build the talk track to embed it in the HUD on all (or most pages). 
> As the benefits of 1) only play out if people actually see the messages 
> (assuming the current feed is only a demo and not used) and the Detector API 
> can still be hooked in later, I tend to 2) and this also seems to follow our 
> design philosophy. What do you think?
> 
> conseo
> 
> (1) http://votorola.polyc0l0r.net/hg/rev/552a9eec9d96
> 
> (2) 
> http://votorola.polyc0l0r.net/javadoc/votorola/a/diff/harvest/Configurator.html
> Or is ConfigDetector better? as it also emits Kicks and detects something... I 
> think is still distinct from MaildirDetector or IRCDetector.
> 
> (3)
> https://hc.apache.org/httpcomponents-asyncclient-dev/index.html
> It is still beta, but the API is supposed to be stable. I will upgrade it once 
> it is released. It builds upon the already used http-core, http-client, http-
> client-nio packages.



More information about the Votorola mailing list