Writing Parsers is Easy

Warning! Some information on this page is older than 5 years now. I keep it for reference, but it probably doesn't reflect my current knowledge and beliefs.

Dec 2009

Using binary file formats is easier as data can be read and written almost directly as they are in memory, but sometimes we have to use some text formats. Whenever it is neither a standard format like XML or YAML, nor it is so simple we can load it with std::cin or scanf, we have to write a parser to read it. Writing a parser - code that reads and interprets some text format (e.g. file format or network protocol) looks like very complex task, but here I want to convince you that's not true - it is actually quite easy.

Why do I think so? It may look like it is a very long way from reading characters from a file till interpreting them as data records containing numbers, strings and other meaningful information. My answer for that is to split this problem into three layers.

The lowest layer is called CharReader. All it can do is reading single characters. It provides a layer of abstraction over reading data from a real source so you can forget whether you read from a file (hopefully with some buffering - never do file reads byte-after-byte!) or just iterate over a string fully loaded into memory. You can also forger about encoding so you see you data as characters, not bytes. It's a good idea to keep last read character somewhere so it can be "peeked" (read without losing it) many times before going to next one. The interface of a simple CharReader may look like this:

class CharReader {
  CharReader(const std::string &filePath);
  // Returns last read character as out.
  // If cursor at end, returns false.
  bool Peek(char &out);
  // Moves cursor one character forward.
  void Next();

Next layer is called Tokenizer. We need it because it's still complicated to interpret characters directly as some structured information. Data in almost every text format is made of multiple-character pieces of information like numbers, strings etc. so it's convenient to provide another layer of abstraction to see data as a sequence of tokens, not single characters. Here are some typical kinds of tokens that can be found also in programming languages:

Tokenizer uses CharReader to read chars and interprets these chars as tokens. Just like CharReader, it buffers and provides access to single token that was parsed recently so you can read it many times before anvancing to next one. It's also Tokenizer's responsibility to skip comments and whitespaces between tokens. Example interface may look like this:

class Tokenizer {
  Tokenizer(CharReader &cr);
  enum TOKEN_TYPE {
  // Returns type of the last read token.
  TOKEN_TYPE GetType();
  // Returns content of the last read token.
  const std::string & GetContent();
  // Moves cursor one token forward.
  void Next();

And finally, when being able to see file as made of tokens, it's easy to code the third layer - Parser - on top of that. Parser reads subsequent tokens and interprets them as some meaningful data. For example, you can parse a Key="Value" pair by expecting token of type identifier and saving its content as Key, then expecting the (=) symbol and finally expecting a string token and saving its content as Value. Of course you also have to find a way to report errors on all these stages, including information about current row and column.

Comments | #algorithms Share


[Download] [Dropbox] [pub] [Mirror] [Privacy policy]
Copyright © 2004-2022