My Byte of Code Blog. Tips and observations on creating software with Objective-C, C/C++, Python, Cocoa and Boost on Mac.

Wednesday, February 24, 2010

Parse CSV File With Boost Tokenizer In C++

Often data required by application are available in CSV formatted files. In c++ it is easy to read file line by line. All that is left is to extract fields from each line and insert them into datastructure stored in memory. Boost Tokenizer is a package that provides a way to easilly break a string or sequence of characters into sequence of tokens, and provides standard iterator interface to traverse the tokens. I will show simple way of using Boost Tokenizer to parse data from CSV file.

Boost provides tokenizers that are easy to construct and use. To setup a tokenizer you select one of the provided tokenizer functions. Tokenizer is instantiated with the string that is to be parsed. You can then use standard iterator interface to access parsed tokens, tokenizer and tokenizer function take care of parsing the string. Optionaly you can use other standard algorithms that operate on iterators, example below uses std::vector initialization using begin() and end() iterators.

Simple example code that parses CSV file into records:

#include <iostream>     // cout, endl
#include <fstream>      // fstream
#include <vector>
#include <string>
#include <algorithm>    // copy
#include <iterator>     // ostream_operator

#include <boost/tokenizer.hpp>

int main()
{
    using namespace std;
    using namespace boost;

    string data("data.csv");

    ifstream in(data.c_str());
    if (!in.is_open()) return 1;

    typedef tokenizer< escaped_list_separator<char> > Tokenizer;

    vector< string > vec;
    string line;

    while (getline(in,line))
    {
        Tokenizer tok(line);
        vec.assign(tok.begin(),tok.end());

        if (vec.size() < 3) continue;

        copy(vec.begin(), vec.end(),
             ostream_iterator<string>(cout, "|"));

        cout << "\n----------------------" << endl;
    }
}

First, the boost::tokenizer is setup using boost::escaped_list_separator function. This function specifies how string is parsed.

typedef tokenizer< escaped_list_separator<char> > Tokenizer;

Next the tokenizer is initialized with each line read from csv file:

Tokenizer tok(line);

Now tokens for one record are available via begin() and end() iterators. std::vector is initialized with data from one parsed line:

vec.assign(tok.begin(),tok.end());

Vector now contains parsed data. The example dumps the data onto standard output using copy algorithm and ostream_iterator that pipes data into cout using string "|" to separate tokens.

copy(vec.begin(),vec.end(),ostream_iterator<string>(cout,"|"));

Often it is desirable to perform basic checking on data, such as to check if each line was parsed properly by checking the number of fields extracted. This is easilly done by checking the number of elements in vector, the example skips each record that has less than three fields:

if (vec.size() < 3) continue;

Compiling With Boost Tokenizer

To compile you need to include -I/usr/local/include/boost-1_42/ in compile flags so that the compiler can find appropriate boost headers. No library for linking is required.

Iterating Over Tokens Using Standard Iterator Interface

You can use standard iterator interface to access tokens as they are parsed:

vector< string > vec.
vec.clear();
Tokenizer tok(line);
for (Tokenizer::iterator it(tok.begin()),
                         end(tok.end());
     it != end; ++it)
{
    vec.push_back((*it));
}

Trim Strings

If the csv file includes spaces between delimiters and values, the extracted token will contain those extra spaces. We can apply trim from boost::string library to remove spaces from front and back of the string:

#include <boost/algorithm/string/trim.hpp>
trim(vec[0]);
trim(vec[1]);

Store Data In Boost Bimap

I have shown in my previous blog how to use boost::bimap to keep bidirectional maps of data to map two unique sets of values. We now have a way to extract data from CSV file and insert them into data structure for lookup:

string data("map.csv");

ifstream in(data.c_str());
if (!in.is_open()) return 1;

using namespace boost::bimaps;
typedef bimap< unordered_set_of< string >,
               unordered_set_of< string > > symbol_map_type;

symbol_map_type m_symbol_map;

typedef tokenizer< escaped_list_separator<char> > Tokenizer;

vector< string > vec;
string line;

while (getline(in,line))
{
    vec.clear();

    Tokenizer tok(line);
    vec.assign(tok.begin(),tok.end());

    if (vec.size() < 2) continue;

    trim(vec[0]);
    trim(vec[1]);

    m_symbol_map.insert( symbol_map_type::value_type(vec[0],
                                                     vec[1]) );
}

Now we can access values in both directions, key to value:

symbol_map_type::left_map& map_view = m_symbol_map.left;
for (symbol_map_type::left_map::iterator it(map_view.begin()),
                                         end(map_view.end());
     it != end; ++it)
{
    cout << "[" << (*it).first
         << "] - [" << (*it).second << "]" << endl;
}

And reverse:

symbol_map_type::right_map& map_view = m_symbol_map.right;
for (symbol_map_type::right_map::iterator it(map_view.begin()),
                                          end(map_view.end());
     it != end; ++it)
{
    cout << "[" << (*it).first
         << "] - [" << (*it).second << "]" << endl;
}

See previous blog about searching boost::bimap data structures.

Data And Output

The data.csv file used is slightly modified file from boost::tokenizer example, note that you do not see the second line from the bottom in the output because of the check for at least 3 fields per record:

Field 1,Field 2,Field 3
Field 1,"Field 2, with comma",Field 3
Field 1,Field 2 with \"embedded quote\",Field 3
Field 1, Field 2 with \n new line,Field 3
Field 1, Field 2 with embedded \\ ,Field 3
Field 1, Field 2 with missing third field so it is skipped and will not appear in the output
Field 11, ,,Field 33

Output:

Field 1|Field 2|Field 3|
----------------------
Field 1|Field 2, with comma|Field 3|
----------------------
Field 1|Field 2 with "embedded quote"|Field 3|
----------------------
Field 1| Field 2 with
 new line|Field 3|
----------------------
Field 1| Field 2 with embedded \ |Field 3|
----------------------
Field 11| ||Field 33|
----------------------

Data For Map

Example data file map.csv used with boost::bimap. Note the extra space after comma:

ABC, cba.abc.cba
EFG, gfe.efg.gfe
HIJ, jih.hij.jih
KLM, mlk.klm.mlk
NOP, pon.nop.pon

Trimmed data output by iterating boost::bimap in both directions:

[EFG] - [gfe.efg.gfe]
[NOP] - [pon.nop.pon]
[KLM] - [mlk.klm.mlk]
[ABC] - [cba.abc.cba]
[HIJ] - [jih.hij.jih]

The other way ...

[gfe.efg.gfe] - [EFG]
[cba.abc.cba] - [ABC]
[pon.nop.pon] - [NOP]
[jih.hij.jih] - [HIJ]
[mlk.klm.mlk] - [KLM]

I have shown easy way to parse CSV data with boost::tokenizer and how to insert the data into 'boost::bimap`. Enjoy parsing CSV files.

4 comments:

  1. I wonder whether these tools are capable of parsing lines with fields including unescaped newlines. For example:

    Name;Address;Sport
    Joe Smith;"101 Main Street
    Springfield, Anystate";Basketball
    Will Brown;;Baseball

    ReplyDelete
  2. The code above will not be able to parse embedded new line in a field as you show in your example in the first record.

    This is not an issue with the boost::tokenizer, you can specify ';' as delimiter.

    The issue is that the code above assumes records are stored one per line so a line at a time is read and parsed.

    The reading code could be adjusted to skim trough each line, check if we have a case of new line inside a quoted string and keep reading new lines from file until the whole field with embedded new lines is read.

    ReplyDelete
  3. I have added another post that shows one way of dealing with the type of records with embedded line breaks and semi-colon separator http://mybyteofcode.blogspot.com/2010/11/parse-csv-file-with-embedded-new-lines.html .

    ReplyDelete
  4. Thanks! This was very helpful. I had another method, but it didn't like zero-length fields (i.e. commas with nothing between) and was slow. This is faster and handles it all.

    ReplyDelete