Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible newlineFailsafe feature? #51

Open
cmangani opened this issue Oct 1, 2018 · 0 comments
Open

Possible newlineFailsafe feature? #51

cmangani opened this issue Oct 1, 2018 · 0 comments

Comments

@cmangani
Copy link

cmangani commented Oct 1, 2018

Hi,

Great Library. Just wanted to let you know an issue I faced.. SFDC has a non-compliant CSV export (from it's bulk export tool).. producing rows like the "00T1p00002Pq1pREAR" row below.. This causes your program to stay in the _isQuoted state as it flows through potentially hundreds of records until it encounters another quote to end this state... Like, for instance, if you save below as a csv file , the 4 rows past 00T1p00002Pq1pREAR don't get digested..

Anyways.. our workaround was that I coded a local version of your csv-stringify with the following.. Anyways.. this was just sort of an "escape hatch". Using the existing csv-stringify, there out of 82 million SFDC "task" records, we only processed 76 million .. w/ my code change, we processed 81,994,204 records due to my "recovery" code..

I know this doesn't work in all cases (e.g. in a String field that might just magically hit my Regex.. (in our use case, this is very improbable).. And perhaps you have a real fix to the issue.. but just figured I'd drop a line..

Thanks again for the great work!

My Changed Code in csv-stringify

/*
...

  • @param {boolean} [opts.columns=false] Whether to parse headers
  • @param {object} [opts.newlineRegexFailsafe] If the parser gets tripped up, detect the start of a new line w/ a regex. maxReadAheadLength and regex
  • @param {function} [cb] Callback function
    */

...

// newline
if (!state._isQuoted && (c === opts.newline || c === opts.newline[0])) {
state._newlineDetected = true
queue(c)
continue
}

  **if (opts.newlineRegexFailsafe && state._isQuoted && (c === opts.newline || c === opts.newline[0])) {        
    // find next delimiter.. then apply regex to see if we are at a newline
    var buff = [];
    var z=1;
    var nxtChar;
    while(
      (!opts.newlineRegexFailsafe.maxReadAheadLength || z<opts.newlineRegexFailsafe.maxReadAheadLength) // restrict how many characters to read ahead.. performance optimization
       && i+z <  data.length  // don't read past end of input
       && (nxtChar = data.charAt(i+z))!=opts.delimiter) { // read up until the next encounter of delimiter
      buff.push(nxtChar);
      z++;
    }        
    var succeedingCharacters = buff.join("");
    if (succeedingCharacters.match(opts.newlineRegexFailsafe.regex)) {
      emitLine(this)   
      continue       
    }
  }**

My Calling App

const csv = require('csv-streamify');
const input = 'subset.csv';
const fs = require('fs');
// If parser is in _isQuoted state.. As a failsafe for malformed, multi-quoted fields, If I get to a newline
// that has a pattern following of 18 characters w/ first 3 of a certain SFDC Id convention (e.g. 00T)
// we will consider this the new line.. this previous record we would be emitting will be incomplete (e.g. will)
// not contain all of the columns, as it is caught up in the multi-quote column.... It will be left to the caller
// to check for the correct number of columns, and dispose/or/deal with this errart row.
const parser = csv({"newlineRegexFailsafe" : {"regex" : "^(00T|001|003|005|a21|801|006|00U|a25)[a-zA-Z0-9]{15}$", "maxReadAheadLength" : 20}});
//const parser = csv()// test with this one.. you will see it fails on the 00T1p00002Pq1pREAR record

// emits each line as a buffer or as a string representing an array of fields
var idx = 0;
parser.on('data', function (values) {
console.log(idx + ":" + values[0] + ":" + values.length);
//NOTE: this is a "Task", 79 columns..
//if values.length = 79, probably a good record (perhaps add some simple heuristic to verify a few of the expected contents of columns
//if values.length < 79, was a victim of the double-double-quote issue.. probably easiest to dispose (and log) of record than try to recover
//if values.length > 79, was a victim, but was a rare/unfortunate victim in that the chunk size of the parser straddled the maxReadAheadLength
//string and thus, it ran on into the next record. NOTE: At most we will lose 2 records to the double-double quote issue, as the code will
//read in the next chunk and continue to read through the next record (making values.length > 79) in this _isQuoted state
// until it again encouters the newlineRegexFailsafe regex on the subsequent record.
idx++;
});

//001, 003, 00U, 006, 801, a21, a25, 00T, 00500T1p00002Pq1pREAR
fs.createReadStream(input, {start: 0}).pipe(parser);

Partial CSV Export from SFDC

00T1p00002Pq1pOEAR,0032400000pPGSuAAO,0012400000poV9uAAE,E-mailed - Anord,2017-03-01,Completed,Normal,false,005U0000000OflxIAC,,Attempted Contact,false,0012400000poV9uAAE,true,2017-03-01T14:55:03.000+0000,005U0000000OflxIAC,2017-03-01T14:55:03.000+0000,005U0000000OflxIAC,2018-03-14T16:47:33.000+0000,false,false,,,,,,false,,false,,,,,,,,,,,Task,Attempted Contact,false,,,,,R.348341659,,Non-Target,,,,Attempted Contact,,false,,0.0,,,,,,,,,,,,,,,,,R.348341659,,,,,
00T1p00002Pq1pREAR,0032400000pPGSuAAO,0012400000poV9uAAE,Roofbuilders,2018-03-07,Completed,Normal,false,0051p000008bcDxAAI,"""",Call,false,0012400000poV9uAAE,true,2018-03-07T17:50:08.000+0000,0051p000008bcDxAAI,2018-03-08T09:39:04.000+0000,0051p000008bcDxAAI,2018-03-14T16:47:33.000+0000,false,false,,,,,,false,,false,,,,,,,,,,,Task,Call,false,,,,,R.412305459,,Non-Target,"ONS-Richmond, VA-00494",00eU0000000eAbrIAE,Aerotek Chesapeake,Call,,true,2018-03-08T09:39:04.000+0000,0.0,,,,,,,,,,,,,,,,,R.412305459,,,,,
00T1p00002Pq1pSEAR,0032400000pPGSuAAO,0012400000poV9uAAE,Left Message & E-mailed - Roofbuilders,2018-03-08,Completed,Normal,false,0051p000008bcDxAAI,#NIS,Attempted Contact,false,0012400000poV9uAAE,true,2018-03-08T09:39:03.000+0000,0051p000008bcDxAAI,2018-03-08T09:39:03.000+0000,0051p000008bcDxAAI,2018-03-14T16:47:33.000+0000,false,false,,,,,,false,,false,,,,,,,,,,,Task,Attempted Contact,false,,,,,R.412377485,,Non-Target,"ONS-Richmond, VA-00494",00eU0000000eAbrIAE,Aerotek Chesapeake,Attempted Contact,,false,,0.0,,,,,,,,,,,,,,,,,R.412377485,,,,,
00T1p00002Pq1pWEAR,0032400000pPGSuAAO,0012400000poV9uAAE,Left Message - Attempted Contact,2017-07-28,Completed,Normal,false,005U0000004YtsvIAC,,Attempted Contact,false,0012400000poV9uAAE,true,2017-07-28T16:33:47.000+0000,005U0000004YtsvIAC,2017-07-28T16:33:47.000+0000,005U0000004YtsvIAC,2018-03-14T16:47:33.000+0000,false,false,,,,,,false,,false,,,,,,,,,,,Task,Attempted Contact,false,,,,,R.374017937,,Non-Target,"ONS-Fairfax, VA - P&I-00422",00eU0000000eAbrIAE,Aerotek Chesapeake,Attempted Contact,,false,,0.0,,,,,,,,,,,,,,,,,R.374017937,,,,,
00T1p00002Pq1pXEAR,0032400000pPGSuAAO,0012400000poV9uAAE,Candidate Summary/G2 Edited by Joseph Henry Breithaupt,2017-07-05,Completed,Normal,false,00524000003MqTYAA0,"really nice guy, jumpy between contracts, worked through people solutions from 2012-14, ennis flint left because he got his bachelors degree and got offered the position at alloy polymers, he is interested in getting more hands on with PLC work or a higher paying maintenance engineer role, he is working a split shift at alloy which he hates (comes in for the morning, leaves and comes back for the evening) he is currently making 28/hr but would be interested in 60k and up because he does get overtime, sending him job descriptions for foley and sabra",G2,false,0012400000poV9uAAE,true,2017-07-05T09:04:25.000+0000,00524000003MqTYAA0,2017-07-05T09:04:25.000+0000,00524000003MqTYAA0,2018-03-14T16:47:33.000+0000,false,false,,,,,,false,,false,,,,,,,,,,,Task,G2,false,,,,,R.369709668,,Non-Target,"ONS-Richmond, VA-00494",00eU0000000eAbrIAE,Aerotek Chesapeake,G2,,false,,0.0,,,,,,,,,,,,,,,,,R.369709668,,,,,
00T1p00002Pq1pYEAR,0032400000pPGSuAAO,0012400000poV9uAAE,TT,2017-07-06,Completed,Normal,false,00524000003MqTYAA0,"not the right experience for either sabra or foley, he is interested in staying in touch for other roles moving forward, sharp guy",Call,false,0012400000poV9uAAE,true,2017-07-06T11:07:09.000+0000,00524000003MqTYAA0,2017-07-06T11:07:09.000+0000,00524000003MqTYAA0,2018-03-14T16:47:33.000+0000,false,false,,,,,,false,,false,,,,,,,,,,,Task,Call,false,,,,,R.370010467,,Non-Target,"ONS-Richmond, VA-00494",00eU0000000eAbrIAE,Aerotek Chesapeake,Call,,true,2017-07-06T00:00:00.000+0000,0.0,,,,,,,,,,,,,,,,,R.370010467,,,,,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant