I have a Rich Text Format file that contains data organized in an outline style. I need to parse the file and extract the text it contains, but the formatting information is just as important as the text–I need it in order to build data structures that reflect the relationships between the lines of text in the outline. For example, if a line is indented by one tab stop, that means it’s a child of the most recent line that has no indentation.
Given that RTF has been around forever, I figured that finding a good library to parse it would be pretty easy. Sadly, that’s not the case. I could only find two candidates:
RTF::Parser was disqualified because it’s a Perl library and I haven’t written any Perl in years. And I’ve only just gotten over the experience. iText is awesome. I’ve used it to generate PDFs before, and it works really well. Unfortunately, there’s no documentation for the RTF code. After playing around with it for about an hour, I had figured out how to pull the text out of the document, but not the formatting. Frustration ensued.
Fortunately, a little bird suggested that perhaps I could use a program that can read RTFs and save in a format that’s easier to parse. Also fortunately, the first such program I tried was OS X’s TextEdit. Not only can TextEdit read in RTF and spit out HTML, the HTML it produces is squeaky-clean and uses a tiny block of CSS to format the text rather than duplicating it over and over again throughout the entire document. Why do I care? Well, now that the formatting information isn’t duplicated everywhere, the resulting HTML file is about 25% of the size of the RTF file. Another handy benefit of using CSS to do the styling is that the CSS classes used to apply the formatting effectively group the lines in the file by indentation level. So every line that’s indented by one tab stop has a CSS class of (for example) “p7″, and every line indented by two tab stops has a class of “p9″. That’s much less convoluted than trying to understand how RTF’s formatting works.
Apple sometimes describes its mission as attempting to “surprise and delight” their customers. I am surprised. I am delighted. If only because this means I won’t have to write any Perl.