March 19, 2010

Use gawk to convert fixed width files to delimited

I had a bunch of fixed-width text files that I needed to be convert to CSV. Each file was different and determining the start and end of each field would be time consuming so this needed to be done dynamically. Fortunately the second line of each file could be used to do exactly that and I was able to run the script unmodified against each file.

An example of the input file:
empno ,ename    ,job        ,hiredate  ,sal
------,---------,-----------,----------,---------
7782  ,CLARK    ,MANAGER    ,09-JUN-81 ,2450.00
7839  ,KING     ,PRESIDENT  ,17-NOV-81 ,5000.00
7934  ,MILLER   ,CLERK      ,23-JAN-82 ,1300.00

The script looks at the second line of each file (i.e. the hyphenated line) to dynamically determine the start and end of each field:
#!/usr/local/bin/gawk -f
BEGIN {
 FS=","
 RS="\n"
 OFS=","
}

# Determine fixed length positions of each field by
# looking at second record (i.e. header row) and output
# each field separated by chosen delimiter.
NR==1{
 widthstring = (length($1)+1)
 for(i=2; i < NF; i++) {
    widthstring = widthstring " " (length($i)+1)
    if(i!=(NF-1))
     printf("%s",OFS ltrim(rtrim(rmlastcomma($i))))
 }
 i=NF-1
 print(OFS ltrim(rtrim(rmlastcomma($i))))
 FIELDWIDTHS = widthstring
}

# Now read from line 3 (skipping first two rows) using newly
# determined field lengths and remove whitespaces. Output
# each field separated by chosen delimiter.
NR>2{
 for(i=2; i < NF; i++)
    printf("%s",OFS ltrim(rtrim(rmlastcomma($i))))
 print(OFS ltrim(rtrim(rmlastcomma($NF))))
}

function ltrim(v) { gsub(/^[ ]+/, "", v); return v}
function rtrim(v) { gsub(/[ ]+$/, "", v); return v}
function rmlastcomma(v) { sub(/[,]+$/, "", v); return v}

The resulting output will be a regular CSV file:
7782,CLARK,MANAGER,09-JUN-81,2450.00
7839,KING,PRESIDENT,17-NOV-81,5000.00
7934,MILLER,CLERK,23-JAN-82,1300.00

Example command  to start gawk in Windows (reading in all files):
for %n in (*.txt) do gawk -f fixed2csv.awk %n >%n.csv

Yes, I did this in Windows!