Posted Feb 16, 2009
bigreplace: Regex find-and-replace for files bigger than RAM
grep and sed choke on enormous files. This doesn’t as much.
A 12GB odyssey
The other day, a partner came to me with a very slow Plone site. Almost immediately, I noticed their event.log was 12GB, most of it added in the last few hours. Through judicious use of head and tail, I was able to identify some suspicious tracebacks (each several hundred lines long), but it wasn't easy to get an overview of the whole mess, since I could only fit so much on my screen. What I really needed was to regex-replace the tracebacks I'd already looked at with a short token so I could see if there were any other interesting errors.
So off I went with grep and sed and brethren, and verily I did run out of core and thrash my computer into oblivion. Huh. What's more, I couldn't seem to find any existing tools to do regex find-and-replace without loading (or mmapping) the whole thing into RAM first. Imagine my shock!
So I went off and wrote bigreplace: regular expression search and replace for files that don't fit in RAM.
Usage: bigreplace [options] PATTERN REPLACEMENT
Do search-and-replace on files too big to fit in memory.
PATTERN -- A regular expression. Be sure to pass M to --flags if it's multiple lines.
REPLACEMENT -- What to replace it with. May include backreferences to captured groups.
Options:
--version show program's version number and exit
-h, --help show this help message and exit
-i IN_FILE, --in=IN_FILE
the file to search. Defaults to stdin.
-o OUT_FILE, --out=OUT_FILE
the file to which to write the results. Defaults to
stdout.
-f FLAGS, --flags=FLAGS
one or more regular expression flags, documented at
http://docs.python.org/library/re.html#re.I. Example:
--flags=ILMSUX
-t, --test run self tests
An interesting caveat
This implementation is good enough for log anaylsis, and it works 100% if you don't use regular expressions (as might other tools), but it's not totally correct. bigreplace uses a default buffer size of 10MB. If your search pattern matches a span that approaches that size, it's possible a match might not be recognized in its entirety (it might not be "greedy enough"). This is because, from Python, I have no way of telling when the regex's finite state machine wants to look past the end of the buffer. I pretty much just say "10MB is pretty big. Find all the matches you can in there, and when you run out of matches, I'll lop off the matched parts and append another 10MB to what remained unmatched." This is expressed more concisely and precisely in the test_match_bigger_than_buffer test, and I welcome ideas on how to address this without dropping down to C and writing my own FSM. (However, I will gladly accept custom Flying Spaghetti Monsters if you care to contribute them.)
But for the 99.99% case, bigreplace is pretty handy, so I'm tossing it out there.

Very Useful Tool