How to Parse CSV Data with Line Breaks within a field value using Groovy

Document created by Adam Arrowsmith Employee on Oct 26, 2011
Version 1Show Document
  • View in full screen mode
You need to map a CSV file with values containing line breaks. For example:

Name,Address,City,State
"Boomi Inc.","First Ave","Philadelphia","PA"
"Customer ABC","123 Main St
Suite 120","San Francisco","CA"


The AtomSphere data parser will not recognize the line break contained within the second record's "Address" value and will not map properly. Use this script to temporarily massage the data before mapping or performing other steps that require parsing the data using a Profile.
Copy and paste the into a Data Process - Custom Scripting step before your map. The script parses through the data and replaces any line breaks that appear within a text qualified field with some text value of your choosing (e.g. "%BREAK%"). After the map you can use another Data Process - Search/replace step after the Map to restore the original line breaks. Note that the CSV data fields must be text qualified.

Note the three variables at the beginning of the script. These can be modified if necessary.

String delimiter = ",";
String textQualifier = "\"";
String replaceNewline = "%BREAK%";


Code:

// This script parses the CSV data to replace any line breaks within a given field. 

import java.util.Properties;
     import java.io.InputStream;
     import java.io.BufferedReader;
     import java.io.Writer;
     import java.io.OutputStreamWriter;
     import com.boomi.util.TempOutputStream; 

String LINE_SEPARATOR = System.getProperty("line.separator"); 

String delimiter = ",";
     String textQualifier = "\"";
     String replaceNewline = "%BREAK%"; 

for( int i = 0; i < dataContext.getDataCount(); i++ ) { 

   InputStream is = dataContext.getStream(i); 

   Properties props = dataContext.getProperties(i); 

   BufferedReader reader = new BufferedReader(new InputStreamReader(is));
        String lineRead = reader.readLine(); 

   boolean inTextQualifiedField = false;
        int index = 0; 

   TempOutputStream tmpOut = new TempOutputStream();
        Writer output = new OutputStreamWriter(tmpOut); 

   while( lineRead != null ) {
            boolean start = true;
            while( index >= 0 ) {
                if( inTextQualifiedField ) {
                    index = lineRead.indexOf(textQualifier, index);
                    if( index > -1 ) {
                        index += textQualifier.length();
                        inTextQualifiedField = false;
                    } else {
                        //Hit the end of line ( ie. newline in text qualified field )
                        // Add replaceNewline and continue;
                        output.write(lineRead);
                        output.write(replaceNewline);
                    } 

           } else { 

               if( start ) {
                        //Check beginning of line for text qualifier
                        if( lineRead.indexOf(textQualifier, index) == 0 ) {
                            inTextQualifiedField = true;
                            index += textQualifier.length();
                        }
                        start = false;
                        continue;
                    } 

               index = lineRead.indexOf(delimiter, index);
                    if( index > -1 ) {
                        //found delimiter check next
                        index += delimiter.length(); 

                   if( lineRead.indexOf(textQualifier, index) == index ) {
                            //Next character is textQualifier
                            inTextQualifiedField = true;
                            index += textQualifier.length();
                        } 

               }
                }
            } 

       if( !inTextQualifiedField ) {
                output.write(lineRead);
                output.write(LINE_SEPARATOR);
            }
            lineRead = reader.readLine();
            index = 0; 

   }
        output.flush();
        dataContext.storeStream(tmpOut.toInputStream(), props);
     }

1 person found this helpful

Attachments

    Outcomes