<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN" "http://www.w3.org/TR/REC-html40/strict.dtd"> <html><head><meta name="qrichtext" content="1" /><style type="text/css"> p, li { white-space: pre-wrap; } </style></head><body style=" font-family:'Monospace'; font-size:10pt; font-weight:400; font-style:normal;"> Hi M and everybody else interested :-), I have worked on the PipermailHarvester prototype and the scheduling framework for scraping jobs to balance load on the servers as well as provide almost live updates and keep our I/O overhead for so many connections reasonable. We use http-core and http-core-nio (1). We had some discussion on IRC (2), I will respond in detail here now: >[09:08:22] <mcallan> conseo: looking now at your code. (1) DiffKick looks dangerous, because a harvest based on a kick should be no different than any other harvest, otherwise it might not be possible to regenerate the archive by a crawl harvest. to be sure of this, it would be best not to rely on contextual information from the kicker Sure, it doesn't. We provide the context information because we can. This is helpful because a Kick will trigger a "burst" and we can decide to end the burst once we have found the message with the context of the kick and resume a normal harvest afterwards. All data is still parsed from the web and not out of the context. I can privatise that concept in DiffKick instead of exposing the context itself, I can allow to match it. >[09:08:56] <mcallan> (regenerate the cache of the archive) >[09:17:10] <mcallan> (2) you receive a kick. you ignore its forum property, i guess because this is just test code with a hard-coded forum (ok). then it looks like you start a crawl consisting of many scheduled jobs of different types. this seems overcomplicated... or at least i don't see the design yet. A pipermail archive has three levels of HTML which we parse. First is the index itself (InitJob), scheduled by it are for each listed month the "date.html" post listings (MonthJob), which then schedule each posting from this list (PageJob). Each HarvestJob represents exactly one remote archive HTML page by scheduler design. These are the ones given by Pipermail, so I haven't added anything to the remote archive structure. These levels (scraping by time backwards), seem to be pretty common for most web forums. You could call "InitJob" "UpdateJob" if you like to, although I haven't modelled that concretely, it already does the same. >[09:45:03] <mcallan> I think you need a solid design before you get too far into the code. I would start with a simple napkin sketch. Here's my rough attempt: >[09:45:15] <mcallan> (a) Receive kick >[09:45:15] <mcallan> (b) Schedule update job >[09:45:15] <mcallan> (c) Let update job run and schedule further update jobs as needed Yep. >[09:45:15] <mcallan> Let's look at the detail of (c), because that's obviously the heart of it. I see problems in your following proposal: >[09:45:15] <mcallan> (c1) Read local marker recording the last message M0 cached. 1. Markers are a concept of us to avoid double crawling. They are not guaranteed by the remote archive. IRC archives don't have message id's for example, so we fall back on date ordering only, which basically gives us a list. Date's don't have previous and next items (non-discrete), so we cannot create such a structure in a Harvester per se. This also means btw. that HarvestHistory should is optional, as it is not guaranteed to represent the remote archive structure in the best way. >[09:45:17] <mcallan> (c2) Find M0 in the remote archive. >[09:45:20] <mcallan> (c3) If M0 is the latest message (no more to read), then quit. >[09:45:22] <mcallan> (c4) Try incrementing local marker to next message M1, or goto (c1) if another job has since incremented it. 2. We then harvest forward and not backward, which gives us no guarantee that we can match the <10s live criterium or we have to burst for any number of new posts forwards (this means we burst that way on every Kick!). Picture 100 posts sent since the last update, which we cannot outrule imo. >[09:45:25] <mcallan> (c5) Read M1 from the remote archive. >[09:45:28] <mcallan> (c6) If M1 contains a diff URL, then cache it. >[09:45:31] <mcallan> (c7) If M1 is the latest message (no more to read), then quit. 3. We don't know when to stop. If we receive a 404, this can be related to any issue, including a missing message id which even happens for metagov pipermail. Also a 404 can be related to anything else. If we fetch the index of the latest month, then we can go backwards until we match our context or reach the covered HarvestHistory (which makes the jobs stop). Compared to walking the markers and waiting for 404, we don't have any drawbacks, the overhead is the same, one page fetch to determine the start-point (month of current date) or end-point (with 404) of the job. We will also very likely be in <10s, because the Kick has just been received and it is very likely that we hit it first with our burst. If the burst goes backwards and we can match the DiffKick context, we can immediately degrade to the 1s stepping (schedule a normal UpdateJob or whatever it is), so 100 new posts are no problem. >[09:45:33] <mcallan> (c8) Schedule another update job. >[09:48:13] <mcallan> conseo: i'll be up in 10 hours or so, and we can discuss >[10:07:25] <mcallan> this is what i meant by sketching the algorithm of a single job. note this design does not depend on the structure of the archive, and includes very few implementation details. the details do not matter a whole lot because they can always be changed after the fact. the design cannot be changed so easily once the code is written, so it's crucial to get it right. not sure this is right, but it's a first stab See above for the current design rationale, which I have developed through this prototype and my past experiences with pipermail and irssilog. Sorry that I couldn't do it before, but I wanted to get my hands a bit dirty to understand the potential problems of the scheduling better (that was the prototyping for). While I know I have clarified the design rationale maybe a bit more, I actually wanted to get feedback if the scheduling is done right (independent of how to run a harvester). The concept is: 1) Extend a HarvestJob (I can separate it in an interface, if you don't like inheriting) and set the URL for each job. 2) Implement the run() method to read the InputStream which will be created by HarvestRunner and deal with the content of this fetched HTML page. 3) Schedule the job. (3) The scheduler asynchronously fetches the job's URL in the next possible step for this host and then runs it inside its thread-pool. Do some checks (internal to the harvester) with HarvestHistory or your own persistent state tracker to avoid double crawls. conseo (1) https://hc.apache.org/httpcomponents-core-ga/ (2) http://zelea.com/var/cache/irc/votorola/12-03/22 and http://zelea.com/var/cache/irc/votorola/12-03/23 (3) http://zelea.com/project/votorola/_/javadoc/votorola/a/diff/harvest/HarvestRunner.html#schedule%28votorola.a.diff.harvest.HarvestJob%29</body></html>