Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • PLAN A - Do what is natural in the business logc. Many a bulk file load does not have transactional nature at the business-logic level -- any one record could be cancelled separately. And sometime the whole file is a transaction to the business too. Some of my feeds it is required we reject the whole file if any record rejects, but others it is required we accept the rest of the records; I would claim the natural "transaction scope" for loading those files is different.

    However, business logic / natural and operational efficiency are different. Having each file restartable is quite desirable at an operations level. But having no pending transaction grow too large is a very real operational issue for the DBAs. Your debate with the DBA is strictly in operational space; even if there was a natural business answer, the operational issues might trump it anyway. This is however a solvable conundrum.

    PLAN B -- Well, maybe you could use Perl or *nix split(1) [google.com] to split the input file into subfiles of 1k rec and process them in sequence, marking them each as completed when committed. Then if you need to restart, you can avoid the done ones and pickup at the 1000-line break to which the DB rolled-back.

    $ split --lines=1000 inputfile pieces.
    $ ls piece*
    pieces.aa pieces.ab pieces.ac pieces.ad pieces.ae
    ...
    # now process pieces.?? in sequence, commit each.
    #  = pieces.a[a-j] for your 10k records ...
    This requires only wrapping your existing job with a splitter/re-sequencer script.

    PLAN C -- Alternatively, you include in the transaction an update to a side table or the status table indicating the last file and line processed. Upon restart, that's where you resume. You probably already have a table listing files received with status ... if so, you just need to add "Number of lines accepted" column, and update at each partial-file-commit point, not just at EOF.

    This is somewhat more invasive, changing Schema and basic logic. But probably well worth it.

    PLAN F -- The obvious answer of having the SQL avoid inserting records already in existence is of course so very very wrong, I hope I don't need to explain why.
    (Hint: speed and/or race conditions)

    Your instinct of asking the DBA why is DB can't handle a transaction of 10k records when other DBs can is of course a good question to ask. The DBA may have a good answer, though -- if the DBA is any good, they're paid to have a good answer to that question, one we outsiders do not want to have to understand. If the DB in question is optimized for On-Line usage, it may have internal structures that make queueing a huge transaction awkward. Or it may be running on junk hardware or an older rev of the DB or whatever. If there's a real answer, be prepared to fall back to plan B or plan C above (split).

    If your business requires you to accept all valid records immediately and re-try only the invalids, you'd not rollback the invalid but commit all 999 good records in one batch and write the invalid 1 record to an exception file that could be reprocessed after repair. This is stored proc uglyness .. or a front end validation program.

    --
    Bill
    # I had a sig when sigs were cool
    use Sig;