The simplest way to parse a csv file it to use a simple regexp. One might start with something as simple as this:
split(/,/, $line);But of course this is to simple. It will split on every coma and thus not work for a line like this:
123,asd,"123, asd"Similarly the following will work:
split(/","/, $line);But what if we don't have control over the quoter and a mixture of both is generated? And furthermore what if the fields can contain quotes and comas themselves?
"123","asd","123, asd"
In my case I have to parse a file that is quoted as such:
005101,"LITERATURE-P/S, WARRANTY","4,345,211.0000",0.0292"70P"To phase this the regexp is a bit more complicated:
@line = split(/,(?!(?:[^",]|[^"],[^"])+")/, $line);In this case we first split the string on any coma but look ahead each time to check and see if the current coma is part of a quoted literal. If that is the case then the coma is skipped over. Next we look through each item and remove the quotation's if they are present.
for $item (@line)
{
$item[$c] =~ s/^"(.*)"$/$1/;
}
This previous example is almost perfect but there is one more problem. Try the next item and it will parse wrong.
005101,"LITERATURE-P/S WARRANTY,","4,345,211.0000",0.0292,"70P"
Note the extra coma at the very end of the literals. This trips up the regexp logic. The simplest solution I found to this problem was to just add a extra space in this case and then trim it at the end.
$line =~ s/,",/, ",/g;
@line = split(/,(?!(?:[^",]|[^"],[^"])+")/, $line);
for $item (@line)
{
$line[$c] =~ s/, $/,/;
$item[$c] =~ s/^"(.*)"$/$1/;
}
The final script can be found below. The script included does a bit more then just phase a csv, it writes it back out as a fixed width file. This can be easily changed to just store the data in an array.
csvtofixed.pl
#!/usr/bin/perl
# Input File.
$in = shift;
# Output File.
$out = shift;
# Fixed width padding.
@size = split(/,/, shift);
if(!$in || !$out || $#size == -1)
{
print "usage: csvtofixed input output columns\n";
print "\tinput: Filename to read in.\n";
print "\toutput: Filename to write out to, will overwrite.\n";
print "\tcoumns: Field widths to pad input fields to. Example: 20,15,3,10\n";
exit;
}
if($in =~ $out)
{
print "Aborted: Can't use same input and output file. Please use a temparay $
exit;
}
# Open Files.
open(IN, $in) or die "Can't open input file $in";
open(OUT, ">$out") or die "Can't create output file $out";
# While there is input.
while(&th;in>)
{
# Read the next line.
$line = $_;
# Trim off the end.
$line =~ s/\r\n//;
# Fix for coma as the last char in a quote bug.
$line =~ s/,",/, ",/g;
# Split the line into its parts.
@line = split(/,(?!(?:[^",]|[^"],[^"])+")/, $line);
# For each column.
for($c=0;$c<=$#size;$c++)
{
# Trim and quoted fields.
$line[$c] =~ s/^"(.*)"$/$1/;
# Remove the extra space for the coma fix.
$line[$c] =~ s/, $/,/;
# Print out the field.
printf OUT "%*.*s|", $size[$c], $size[$c], $line[$c];
}
# Finish the line.
print OUT "\r\n";
}
# Close the files.
close(IN);
close(OUT);
exit;
No comments:
Post a Comment