Eliminate File Redundancy with Ruby

Say you have a file with many repeated, unnecessary lines that you want to remove. For safety’s sake, you would rather make an abbreviated copy of the file rather than replace it. Ruby makes this a cinch. You just iterate over the file, putting all lines the computer has already “seen” into a dictionary. If a line is not in the dictionary, it must be new, so write it to the output file. Here’s the code designed with .tex files in mind, but easily adaptable:

puts 'Filename?'
filename = gets.chomp
input = File.open(filename+'.tex')
output = File.open(filename+'2.tex', 'w')
seen = {}
input.each do |line|
  if (seen[line]) 
  else
    output.write(line)
    seen[line] = true
  end
end
input.close()
output.close()

Where would this come in handy? Well, the .tex extension probably already gave you a clue that I am reducing redundancy in a \LaTeX file. In particular, I have an R plot generated as a tikz graphic. The R plot includes a rug at the bottom (tick marks indicating data observations)–but the data set includes over 9,000 observations, so many of the lines are drawn right on top of each other. The \LaTeX compiler got peeved at having to draw so many lines, so Ruby helped it out by eliminating the redundancy. One special tweak for using the script above to modify tikz graphics files is to change the line

if (seen[line])

to

if (seen[line]) && !(line.include? 'node') &&  !(line.include? 'scope') && !(line.include? 'path') && !(line.include? 'define')

if your plot has multiple panes (e.g. par(mfrow=c(1,2)) in R) so that Ruby won’t ignore seemingly redundant lines that are actually specifying new panes. The modified line is a little long and messy, but it works, and that was the main goal here. The resulting \LaTeX file compiles easily and more quickly than it did with all those redundant lines, thanks to Ruby.

2 thoughts on “Eliminate File Redundancy with Ruby

  1. Nice to see you working in Ruby! One way to make that code even Rubier (?) would to replace:

    if (seen[line])
    else

    with:

    unless seen[line]

Comments are closed.