Bug#908678: Testing the filter-branch scripts
On 2018-11-10 18:56:01, Daniel Lange wrote:
> Antoine,
>
> thank you very much for your filter-branch scripts.
you're welcome! glad it can be of use.
> I tested each:
>
> 1) the golang version:
> It completes after 3h36min:
>
> # git filter-branch --tree-filter '/split-by-year' HEAD
> Rewrite a09118bf0a33f3721c0b8f6880c4cbb1e407a39d (68282/68286) (12994 seconds passed, remaining 0 predicted)
> Ref 'refs/heads/master' was rewritten
>
> But it doesn't Close() the os.OpenFile handles so ...
> all data/CVE/list.yyyy files are 0 bytes long. Sic!
Well. That explains part of the performance difference. ;)
There were multiple problems with the golang source - variable shadowing
and, yes, a missing Close(). Surprisingly, the fixed version results is
*slower* than the equivalent Python code, taking about one second per
run or 1102 seconds for the last 1000 commits. I'm at a loss as to how I
managed to make go run slower than Python here (and can't help but think
C would have been easier, again). Probably poor programming on my
part. New version attached.
[...]
> 2.1) the Python version
> You claim #!/usr/bin/python3 in the shebang, so I tried that first:
>
> # git filter-branch --tree-filter '/usr/bin/python3 /__pycache__/split-by-year.cpython-35.pyc' HEAD
> Rewrite 990d3c4bbb49308fb3de1e0e91b9ba5600386f8a (1220/68293) (41 seconds passed, remaining 2254 predicted)
> Traceback (most recent call last):
> File "split-by-year.py", line 13, in <module>
> File "/usr/lib/python3.5/codecs.py", line 321, in decode
> (result, consumed) = self._buffer_decode(data, self.errors, final)
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 5463: invalid start byte
> tree filter failed: /usr/bin/python3 /__pycache__/split-by-year.cpython-35.pyc
I suspected this would be a problem, but didn't find any occurence in
the shallow clone so I forgot about it. Note that the golang version
takes great care to treat the data as binary...
> The offending commit is:
> * 990d3c4bbb - Rename sarge-checks data to something not specific to sarge, since we're working on etch now.
> Sorry for the probable annoyance, but it had to be done. (13 years ago) [Joey Hess]
>
> There will be many more like this, so for Python3
> this needs needs to be made unicode-agnostic.
... so I rewrote the thing to handle only binary and tested it against
that version of the file. It seems to work fine.
> Notice I compiled the .py to .pyc which makes it
> much faster and thus well usable.
Interesting. I didn't see much difference in performance in my
benchmarks on average, but the worst-case run did improve by 150ms, so I
guess this is worth the trouble. For those who didn't know (like me)
this means running:
python -m compileall bin/split-by-year.py
Whenever the .py file changes (right?).
> 2.2) Python, when a string was a string .. Python2
> Your code is actually Python2, so why not give that a try:
>
> # git filter-branch --tree-filter '/usr/bin/python2 /split-by-year.pyc' HEAD
> Rewrite b59da20b82011ffcfa6c4a453de9df58ee036b2c (2516/68293) (113 seconds passed, remaining 2954 predicted)
> Traceback (most recent call last):
> File "split-by-year.py", line 18, in <module>
> yearly = 'data/CVE/list.{:d}'.format(year)
> NameError: name 'year' is not defined
> tree filter failed: /usr/bin/python2 /split-by-year.pyc
>
> The offending commit is:
> * b59da20b82 - claim (13 years ago) [Moritz Muehlenhoff]
> | diff --git a/data/CVE/list b/data/CVE/list
> | index 7b5d1d21d6..cdf0b74dd0 100644
> | --- a/data/CVE/list
> | +++ b/data/CVE/list
> | @@ -1,3 +1,4 @@
> | +begin claimed by jmm
> | CVE-2005-3276 (The sys_get_thread_area function in process.c in Linux 2.6 before ...)
> | TODO: check
> | CVE-2005-3275 (The NAT code (1) ip_nat_proto_tcp.c and (2) ip_nat_proto_udp.c in ...)
> | @@ -34,6 +35,7 @@ CVE-2005-3260 (Multiple cross-site scripting (XSS) vulnerabilities in ...)
> | TODO: check
> | CVE-2005-3259 (Multiple SQL injection vulnerabilities in versatileBulletinBoard (vBB) ...)
> | TODO: check
> | +end claimed by jmm
> | CVE-2005-XXXX [Insecure caching of user id in mantis]
> | - mantis <unfixed> (bug #330682; unknown)
> | CVE-2005-XXXX [Filter information disclosure in mantis]
>
> As you see the line "+begin claimed by jmm" breaks the too simplistic parser logic.
> Unfortunately dry-running against a current version of data/CVE/list such errors do not show up.
> The "violations" of the file format are transient and buried in history.
Hmm... That's a trickier one. I guess we could just pretend that line
doesn't exist and drop it from history... But I chose to buffer it and
treat it like the CVE line so it gets attached to the right file. See if
it does what you expect.
git cat-file -p b59da20b82:data/CVE/list > data/CVE/list.b59da20b82
split-by-year.py data/CVE/list.b59da20b82
Performance-wise, I shaved off a surprising 60ms by enclosing all the
code in a function (yes, it's crazy), but the buffering to deal with the
above issue added another 40ms so performance should be similar.
I'll start a run on the whole history to see if I can find any problems,
as soon as a first clone finishes resolving those damn deltas. ;)
Thanks for the review!
A.
--
Premature optimization is the root of all evil
- Donald Knuth
#!/usr/bin/python3
import os
import sys
def main(path):
fds = {}
year = None
buffer = b''
with open(path, 'rb') as source:
for line in source:
if line.startswith(b'CVE-'):
buffer += line
year = line.split(b'-')[1]
year = int(year.decode('ascii', errors='surrogateescape'))
elif year:
yearly = 'data/CVE/list.{:d}'.format(year)
target = fds.get(year, None)
if target is None:
fds[year] = target = open(yearly, 'ab')
if buffer:
target.write(buffer)
buffer = b''
target.write(line)
else:
buffer += line
for year, fd in fds.items():
fd.close()
os.unlink(path)
if __name__ == '__main__':
path = 'data/CVE/list'
if len(sys.argv) > 1:
path = sys.argv[1]
main(path)
package main
import (
"bufio"
"bytes"
"io"
"log"
"os"
"strconv"
"strings"
)
func main() {
file, err := os.Open("data/CVE/list")
if err != nil {
log.Fatal(err)
}
defer file.Close()
var (
line []byte
cve []byte
year uint64
year_str string
target *os.File
header bool
ok bool
)
fds := make(map[uint64]*os.File, 20)
scanner := bufio.NewReader(file)
for {
line, err = scanner.ReadBytes('\n')
if err != nil {
break
}
if bytes.HasPrefix(line, []byte("CVE-")) {
cve = line
year_str = strings.Split(string(line), "-")[1]
year, _ = strconv.ParseUint(year_str, 0, 0)
header = true
} else {
if target, ok = fds[year]; !ok {
target, err = os.Create("data/CVE/list." + year_str)
if err != nil {
log.Fatal(err)
}
fds[year] = target
}
if header {
_, err := target.Write(cve)
if err != nil {
log.Println("error writing", string(cve), target, err)
break
}
header = false
}
_, err := target.Write(line)
if err != nil {
log.Println("error writing", string(line), target, err)
break
}
}
}
if err != io.EOF {
log.Fatal(err)
}
os.Remove("data/CVE/list")
}
Reply to: