Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

[11:36:24 CST(-0600)] <js70> but after a bean validation processing step

[11:37:32 CST(-0600)] <TonyUnicon> if my understanding of the 'itemWriter' validation is correct, we want to catch these natural key type errors against existing external data… youve had to goto the db

[11:37:35 CST(-0600)] <js70> In use it was made a use difference validating the customer data. I know that is a concern

[11:38:34 CST(-0600)] <TonyUnicon> but i guess you can write one giant query for the whole file

[11:38:34 CST(-0600)] <js70> eventually, first step was to ensure Natural Keys were not null.

[11:38:50 CST(-0600)] <TonyUnicon> that would be the raw data validation i would expect

[11:39:09 CST(-0600)] <TonyUnicon> i dont think the raw data should need to concern itself with existing data

[11:39:09 CST(-0600)] <dmccallum54> the "raw data" validations that jim was running against beans did not involve database interactions

[11:39:16 CST(-0600)] <TonyUnicon> right

[11:39:21 CST(-0600)] <TonyUnicon> and I would expect that

[11:39:24 CST(-0600)] <dmccallum54> and you're right… it did not concern itself with existing data

[11:40:00 CST(-0600)] <TonyUnicon> so would you expect the validation step to have in one giant query

[11:40:14 CST(-0600)] <TonyUnicon> or on a unbatched - query per row basis

[11:40:35 CST(-0600)] <TonyUnicon> an*

[11:41:06 CST(-0600)] <js70> your talking after the raw -> then one giant step validation to insure for example unique keys?

[11:41:22 CST(-0600)] <TonyUnicon> yeah, we take the vetted file

[11:41:26 CST(-0600)] <TonyUnicon> and before we start to batch write

[11:41:37 CST(-0600)] <js70> yeah, in the initial implmentation I elected to take that one giant query and break it up into itsy bitsy steps.

[11:41:58 CST(-0600)] <TonyUnicon> so for N rows, how many queries were you firing to validate?

[11:42:05 CST(-0600)] <js70> N

[11:42:05 CST(-0600)] <TonyUnicon> N/batchsize ?

[11:42:09 CST(-0600)] <TonyUnicon> ok

[11:42:48 CST(-0600)] <TonyUnicon> do we still want to do that? it has performance bottleneck written all over it

[11:42:55 CST(-0600)] <js70> nope

[11:43:12 CST(-0600)] <TonyUnicon> the advantage if the giant intersect query

[11:43:25 CST(-0600)] <TonyUnicon> is we could identify all the bad rows in the file I guess

[11:43:27 CST(-0600)] <dmccallum54> my assumption is that once we had the validated file we simply batch inserts to the stage tables. if there is a duplicated key in a batch, it invalidates the batch

[11:43:29 CST(-0600)] <TonyUnicon> as opposed to just the first

[11:44:37 CST(-0600)] <TonyUnicon> now im confused

[11:44:54 CST(-0600)] <TonyUnicon> when we spoke about in memory validation before we write

[11:45:05 CST(-0600)] <TonyUnicon> I thought you meant before we write to the staging tables

[11:45:29 CST(-0600)] <dmccallum54> raw files -> per-row type, width, nullity validations -> filtered/validated files -> batched inserts to stage tables -> upserts to live tables

[11:46:52 CST(-0600)] <TonyUnicon> but lets say we have a dup row in the file

[11:46:56 CST(-0600)] <TonyUnicon> in that workflow

[11:47:04 CST(-0600)] <TonyUnicon> it would fail on the stage inserts

[11:47:11 CST(-0600)] <TonyUnicon> which I thought you wanted to avoid

[11:47:24 CST(-0600)] <dmccallum54> that particular error would not be caught until [batched inserts to stage tables] and the resulting error would be as inspecific as the batch size is large

[11:47:42 CST(-0600)] <dmccallum54> inspecific. wtf.

[11:47:44 CST(-0600)] <dmccallum54> non-specific

[11:47:52 CST(-0600)] <TonyUnicon> you've been talking to me too much

[11:48:16 CST(-0600)] <TonyUnicon> ok, so then I'm not sure what we were arguing about on the call

[11:48:34 CST(-0600)] <TonyUnicon> I thought we were talking about what we wanted to do before we wrote to the database in terms of validation

[11:48:55 CST(-0600)] <dmccallum54> what i heard on the call was a proposal to move all validations to database operations

[11:49:12 CST(-0600)] <dmccallum54> specifically to attempt a non-batched insert into stage tables for each raw record

[11:49:18 CST(-0600)] <TonyUnicon> right right, ok I didn't know we definitely wanted to do that

[11:49:28 CST(-0600)] <TonyUnicon> that is the argument I'm making, lean on the DB

[11:50:15 CST(-0600)] <TonyUnicon> i dont think we want to do non-batched inserts

[11:51:04 CST(-0600)] <TonyUnicon> especially if we want to use indices

[11:51:27 CST(-0600)] <TonyUnicon> I think

[11:51:31 CST(-0600)] <TonyUnicon> despite what the doc says

[11:51:46 CST(-0600)] <dmccallum54> if we use the db entirely, one downside is we can report at most one error per row

[11:51:51 CST(-0600)] <TonyUnicon> we can determine the bad row via the database error itself and not any sort of state the framework stores

[11:52:09 CST(-0600)] <dmccallum54> the other is the row-specificity issue that you're talking about now

[11:53:29 CST(-0600)] <TonyUnicon> right

[11:53:33 CST(-0600)] <TonyUnicon> the more in memory validation we have

[11:53:51 CST(-0600)] <TonyUnicon> the more detail we can give the issue on the validity of the file

[11:54:06 CST(-0600)] <TonyUnicon> we can probably identify all bad rows in one shot, and maybe multiple errors per row

[11:54:09 CST(-0600)] <TonyUnicon> relying on the database

[11:54:16 CST(-0600)] <TonyUnicon> would mean it would fail fast on the first error

[11:54:21 CST(-0600)] <TonyUnicon> which could mean more iteration

[11:54:37 CST(-0600)] <TonyUnicon> so

[11:54:56 CST(-0600)] <TonyUnicon> I think that is a good enough reason to try to put as much validation in the java as we can

[11:55:05 CST(-0600)] <dmccallum54> that is still my vote (smile)

[11:55:14 CST(-0600)] <TonyUnicon> the ayes have it

[11:55:30 CST(-0600)] <dmccallum54> poking around on SO re error granularity in batched jdbc statements

[11:55:53 CST(-0600)] <dmccallum54> looks like identifying the bad row might be a bit driver specific, if possible at all

[11:56:30 CST(-0600)] <TonyUnicon> ok, well in that case i'll do my best to put as much info into the logs as we can, at least for postgres and sqlserver

[11:56:31 CST(-0600)] <dmccallum54> i.e. if the driver keeps ploughing ahead after failed statements, getUpdateCounts wont help

[11:57:59 CST(-0600)] <dmccallum54> cool. sounds like we are agreed, then

[11:58:04 CST(-0600)] <TonyUnicon> yep

[11:58:06 CST(-0600)] <TonyUnicon> thanks

[12:22:11 CST(-0600)] <js70> interesting: https://github.com/42BV/jarb/ https://blog.42.nl/articles/using-database-constraints-in-java/

[12:26:12 CST(-0600)] <js70> so, an outstanding question that I have, we still need a little information to start using the metadata for validation. mainly the table name. that is going to come from the file name correct? headers contain the columnnames and away we go?

[12:27:41 CST(-0600)] <dmccallum54> simplest thing would be for the file names to match table names and file column headers to match db column names

[12:27:57 CST(-0600)] <dmccallum54> the result, of course, is that if the db names change, the file protocol changes

[12:28:22 CST(-0600)] <dmccallum54> but… the advantage is that it's totally obvious how to go from our published spec for the db tables to what your CSV files need to look like

[12:29:08 CST(-0600)] <dmccallum54> so my vote is to try to get as far as we can with unmapped correlations between file/table and column/column names

[12:30:20 CST(-0600)] <js70> k.