Bug#776658: lintian: Memory consumption of harness and html_reports

To: Russ Allbery <rra@debian.org>, 776658@bugs.debian.org
Subject: Bug#776658: lintian: Memory consumption of harness and html_reports
From: Niels Thykier <niels@thykier.net>
Date: Sat, 31 Jan 2015 10:06:15 +0100
Message-id: <[🔎] 54CC9B07.2080900@thykier.net>
Reply-to: Niels Thykier <niels@thykier.net>, 776658@bugs.debian.org
In-reply-to: <[🔎] 87egqbsp9b.fsf@hope.eyrie.org>
References: <[🔎] 20150130173041.9273.15937.reportbug@mangetsu.thykier.net> <[🔎] 87egqbsp9b.fsf@hope.eyrie.org>

On 2015-01-31 02:38, Russ Allbery wrote:
> Niels Thykier <niels@thykier.net> writes:
> 
>> The html_reports process itself consumes up to 2GB while processing
>> templates.  It is possible that there is nothing we can do about that
>> as there *is* a lot of data in play.  But even then, we can free it as
>> soon as possible (so we do not keep it while running gnuplot at the
>> end of the run).
> 
> I think the code currently takes a very naive approach and loads the
> entire state of the world into memory, and Perl's memory allocation is
> known to aggressively trade space for speed.
> 

It does try to share a lot of the inner data structures - there are
indeed still some deficiencies to it.  I really wish one could do things
like string interning in perl.

> If instead it stored the various things it cared about in a local SQLite
> database, it would be a bit slower, but it would consume much less
> memory.  I bet the speed difference wouldn't be too bad.  And this would
> have the possibly useful side effect of creating a SQLite database full of
> interesting statistics that one could run rich queries against.
> 

That is definitely worth consideration - thanks for the suggestion.  It
would imply an immense rewrite of html_reports.  While it is certainly
long overdue, it is not something I suspect I will have time (or mental
capacity) to do on my own.

I have started a different approach (see [1] for WIP code).  It is
mostly a parallel track to your idea, so they can certainly co-exist.

The goal of this approach is to:

 * Split harness into a "simple" coordinator
 * Remove the Lab as a (primary) data store (it is too fragile)
 * Harness state as datastore

The details of my design decisions are:

Harness - simple coordinator
============================

In my opinion, a lot of the (to quote private/TODO) "yuckness" of
harness happens because we want very well determined failure handling,
but never wrote harness with a structure that makes that trivial.
Notably, we do not want harness to crash (without logging it first) and
especially not while working on the Lab (see next section).

By moving logic to of harness, this rewrite will become easier as there
is less to juggle around with.  Further, by moving it out of harness
(and into an other process), we can ensure that any memory consumption
caused by this task will definitely be freed when the child process
terminates. I have previously tried to make harness free some of its
memory with no luck.

Removing the Lab as data store
==============================

For me, there are several advantages in this.  Firstly, the lab is very
fragile - if anything crashes (or is interrupted) while updating the
lab, the metadata is often trivially out of sync and the lab is (partly)
corrupted[0].  The end result is often that lintian/harness croaks on
importing stuff until someone manually runs a $lab->repair.  However,
this does not fix all types of corruptions (see the FIXME in
L::Lab->repair), so... /o\

By removing the Lab as a data store, we can use a simpler and more
robust data store (more on that in the next section) AND use throw away
labs.  I had a talk with DSA (I think weasel) about getting a tmpfs disk
on another machine for the heavy lifting.  This implies that we *can* in
fact throw away the lab after every run.

Harness state as datastore
==========================

I introduced a "harness state cache" a couple of versions back to track
which packages needed to be reprocessed, when we uploaded a new version
of lintian.  This (YAML) file can be trivially extended to contain all
the necessary information required by harness and html_reports to
replace the Lab as a data store.  It already features several advantages
to the Lab, namely:

 * Atomic updates of the content (see save_state_cache in harness)
 * Automatically recreated from scratch if it "vanishes".
 * We can add/remove information to/from without having to update the
   lab metadata.

Certainly, this file can (also?) be replaced by an SQL(-lite) database.
 If someone is willing to do or help me with the SQL(-lite) part, I am
definitely open for it.

~Niels

[0] Unless you manage to successfully run $LAB->close - harness does
not, lintian generally does.

[1]
http://anonscm.debian.org/cgit/users/nthykier/lintian.git/log/?h=reporting-rewrite

NB: Rebased regularly.

Reply to:

References:
- Bug#776658: lintian: Memory consumption of harness and html_reports
  - From: Niels Thykier <niels@thykier.net>
- Bug#776658: lintian: Memory consumption of harness and html_reports
  - From: Russ Allbery <rra@debian.org>

Prev by Date: Bug#776658: lintian: Memory consumption of harness and html_reports
Next by Date: [lintian] 01/02: FrontendUtil: Work around a bug in autodie
Previous by thread: Bug#776658: lintian: Memory consumption of harness and html_reports
Next by thread: [lintian] 01/02: FrontendUtil: Work around a bug in autodie
Index(es):
- Date
- Thread