Tuesday, March 10, 2009

Reading CSV file in Regex

Csv filws come in many shapes and forms, and its trivial to read most a csv files. But then again it can be quite headache when the csv file has a mixture of formats and control characters.

The simplest way to parse a csv file it to use a simple regexp. One might start with something as simple as this:
split(/,/, $line);
But of course this is to simple. It will split on every coma and thus not work for a line like this:
123,asd,"123, asd"
Similarly the following will work:
split(/","/, $line);
"123","asd","123, asd"
But what if we don't have control over the quoter and a mixture of both is generated? And furthermore what if the fields can contain quotes and comas themselves?


In my case I have to parse a file that is quoted as such:
005101,"LITERATURE-P/S, WARRANTY","4,345,211.0000",0.0292"70P"
To phase this the regexp is a bit more complicated:
@line = split(/,(?!(?:[^",]|[^"],[^"])+")/, $line);
for $item (@line)
{
$item[$c] =~ s/^"(.*)"$/$1/;
}
In this case we first split the string on any coma but look ahead each time to check and see if the current coma is part of a quoted literal. If that is the case then the coma is skipped over. Next we look through each item and remove the quotation's if they are present.

This previous example is almost perfect but there is one more problem. Try the next item and it will parse wrong.
005101,"LITERATURE-P/S WARRANTY,","4,345,211.0000",0.0292,"70P"

Note the extra coma at the very end of the literals. This trips up the regexp logic. The simplest solution I found to this problem was to just add a extra space in this case and then trim it at the end.
$line =~ s/,",/, ",/g;
@line = split(/,(?!(?:[^",]|[^"],[^"])+")/, $line);
for $item (@line)
{
$line[$c] =~ s/, $/,/;
$item[$c] =~ s/^"(.*)"$/$1/;
}

The final script can be found below. The script included does a bit more then just phase a csv, it writes it back out as a fixed width file. This can be easily changed to just store the data in an array.
csvtofixed.pl
#!/usr/bin/perl

# Input File.
$in = shift;
# Output File.
$out = shift;
# Fixed width padding.
@size = split(/,/, shift);

if(!$in || !$out || $#size == -1)
{
print "usage: csvtofixed input output columns\n";
print "\tinput: Filename to read in.\n";
print "\toutput: Filename to write out to, will overwrite.\n";
print "\tcoumns: Field widths to pad input fields to. Example: 20,15,3,10\n";
exit;
}

if($in =~ $out)
{
print "Aborted: Can't use same input and output file. Please use a temparay $
exit;
}

# Open Files.
open(IN, $in) or die "Can't open input file $in";
open(OUT, ">$out") or die "Can't create output file $out";

# While there is input.
while(&th;in>)
{
# Read the next line.
$line = $_;

# Trim off the end.
$line =~ s/\r\n//;
# Fix for coma as the last char in a quote bug.
$line =~ s/,",/, ",/g;

# Split the line into its parts.
@line = split(/,(?!(?:[^",]|[^"],[^"])+")/, $line);

# For each column.
for($c=0;$c<=$#size;$c++)
{
# Trim and quoted fields.
$line[$c] =~ s/^"(.*)"$/$1/;
# Remove the extra space for the coma fix.
$line[$c] =~ s/, $/,/;

# Print out the field.
printf OUT "%*.*s|", $size[$c], $size[$c], $line[$c];
}

# Finish the line.
print OUT "\r\n";
}

# Close the files.
close(IN);
close(OUT);

exit;

No comments: