Skip to content. | Skip to navigation

Sections
Personal tools
You are here: Home News & views bigreplace: Regex find-and-replace for files bigger than RAM

Posted Feb 16, 2009

bigreplace: Regex find-and-replace for files bigger than RAM

by Erik Rose

grep and sed choke on enormous files. This doesn’t as much.

A 12GB odyssey

The other day, a partner came to me with a very slow Plone site. Almost immediately, I noticed their event.log was 12GB, most of it added in the last few hours. Through judicious use of head and tail, I was able to identify some suspicious tracebacks (each several hundred lines long), but it wasn't easy to get an overview of the whole mess, since I could only fit so much on my screen. What I really needed was to regex-replace the tracebacks I'd already looked at with a short token so I could see if there were any other interesting errors.

So off I went with grep and sed and brethren, and verily I did run out of core and thrash my computer into oblivion. Huh. What's more, I couldn't seem to find any existing tools to do regex find-and-replace without loading (or mmapping) the whole thing into RAM first. Imagine my shock!

So I went off and wrote bigreplace: regular expression search and replace for files that don't fit in RAM.

Usage: bigreplace [options] PATTERN REPLACEMENT
Do search-and-replace on files too big to fit in memory.
PATTERN -- A regular expression. Be sure to pass M to --flags if it's multiple lines.
REPLACEMENT -- What to replace it with. May include backreferences to captured groups.

Options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit
  -i IN_FILE, --in=IN_FILE
                        the file to search. Defaults to stdin.
  -o OUT_FILE, --out=OUT_FILE
                        the file to which to write the results. Defaults to
                        stdout.
  -f FLAGS, --flags=FLAGS
                        one or more regular expression flags, documented at
                        http://docs.python.org/library/re.html#re.I. Example:
                        --flags=ILMSUX
  -t, --test            run self tests

An interesting caveat

This implementation is good enough for log anaylsis, and it works 100% if you don't use regular expressions (as might other tools), but it's not totally correct. bigreplace uses a default buffer size of 10MB. If your search pattern matches a span that approaches that size, it's possible a match might not be recognized in its entirety (it might not be "greedy enough"). This is because, from Python, I have no way of telling when the regex's finite state machine wants to look past the end of the buffer. I pretty much just say "10MB is pretty big. Find all the matches you can in there, and when you run out of matches, I'll lop off the matched parts and append another 10MB to what remained unmatched." This is expressed more concisely and precisely in the test_match_bigger_than_buffer test, and I welcome ideas on how to address this without dropping down to C and writing my own FSM. (However, I will gladly accept custom Flying Spaghetti Monsters if you care to contribute them.)

But for the 99.99% case, bigreplace is pretty handy, so I'm tossing it out there.

Document Actions

Very Useful Tool

Posted by Casey at Apr 03, 2009 08:57 PM
This was very useful for me when I was trying to restore a MySQL backup (editing the 2GB SQL file is very tedious). Best of all, it's written in Python so it works in Windows. Thanks again!
Need help now?

Immediate assistance is available during university work hours:

News & views…
Posted Oct 13, 2009 Portlets gone wild with ContentWellPortlets 2.0.1 This new release adds the ability to add portlets to the footer area. It also has 6 portlet managers per area. This means 20 total portlet managers including the 2 on the sides that ship with plone.
Posted Sep 17, 2009 Plone 4 – An interview with Zope News Jan Ulrich Hasecke interviews me for Zope News.
Posted Aug 31, 2009 Web Services API for Plone Alpha 3 Release Details the release of the wsapi4plone.core package and the plans for future releases. The final report of the AtomPub for Plone Google Summer of Code project.
Posted Aug 28, 2009 Content editing and creation in Plone is faster with archetypes.schematuning Some bench marks of content editing and creation in Plone with and without archetypes.schematuning installed.
More news & views…