jasig-ssp IRC Logs-2013-11-06

[11:09:48 CST(-0600)] <dmccallum54> TonyUnicon here is the section of the SpringBatch reference docs i was talking about w/r/t error granularity

[11:09:51 CST(-0600)] <dmccallum54> http://docs.spring.io/spring-batch/reference/html-single/index.html#databaseItemWriters

[11:10:57 CST(-0600)] <dmccallum54> what i'd like to do is preserve the ability to report very fine-grained validation errors and the flexibility to aggressively batch inserts to the staging tables for performance (using JdbcBatchItemWriter perhaps)

[11:12:18 CST(-0600)] <dmccallum54> because of that, we can't rely exclusively on db inserts to check the validity of inbound flat file records

[11:12:59 CST(-0600)] <dmccallum54> that's not to say all inserts are guaranteed to succeed, but the idea is to catch as many fine-grained errors as early and as cheaply as possible using in memory validations

[11:15:35 CST(-0600)] <dmccallum54> js70… for db metadata… http://docs.oracle.com/javase/6/docs/api/java/sql/DatabaseMetaData.html#getColumns(java.lang.String, java.lang.String, java.lang.String, java.lang.String) and http://docs.oracle.com/javase/6/docs/api/java/sql/DatabaseMetaData.html#getPrimaryKeys(java.lang.String, java.lang.String, java.lang.String)

[11:18:17 CST(-0600)] <TonyUnicon> reading now

[11:23:45 CST(-0600)] <dmccallum54> key non-uniqueness is one example of an invalidity that's not going to be easily detectable until inserts. in that case we're going to need to accept that the error granularity will be unfortunately coarse

[11:24:03 CST(-0600)] <js70> yep

[11:24:25 CST(-0600)] <dmccallum54> unless we require you to deploy on unix and we just call out to the shell… then it's trivial

[11:25:34 CST(-0600)] <TonyUnicon> so

[11:25:45 CST(-0600)] <TonyUnicon> im fine wish using the in memory approach btw

[11:25:55 CST(-0600)] <TonyUnicon> but I think what the doc itself is saying

[11:26:04 CST(-0600)] <TonyUnicon> there is no way for the framework to know which one failed

[11:26:36 CST(-0600)] <TonyUnicon> but you should be able to tell from the database error you can propagate into the logs which row failed

[11:27:11 CST(-0600)] <TonyUnicon> if you're worried about partial writes

[11:27:54 CST(-0600)] <TonyUnicon> if we really want restartability to be an option we have to be able to pick up after a failed batch

[11:28:24 CST(-0600)] <TonyUnicon> and from what I gather from the frame, it may be for free

[11:28:31 CST(-0600)] <TonyUnicon> framework docs*

[11:29:11 CST(-0600)] <TonyUnicon> it will be easy to see as I write the code

[11:32:16 CST(-0600)] <TonyUnicon> and btw, if the validations have to be done by running queries

[11:32:24 CST(-0600)] <TonyUnicon> and those queries aren't batched

[11:32:39 CST(-0600)] <TonyUnicon> won't that be expensive as well?

[11:32:57 CST(-0600)] <TonyUnicon> just playing devils advocate

[11:33:22 CST(-0600)] <js70> you make a good one.

[11:33:51 CST(-0600)] <TonyUnicon> just a single intersection query?

[11:34:08 CST(-0600)] <TonyUnicon> i could look at the groovy code

[11:35:49 CST(-0600)] <js70> It was my understanding that the initial validation would not be against the database. The groovy was slow each bean was porcessed with a select/ update or insert.

[11:36:24 CST(-0600)] <js70> but after a bean validation processing step

[11:37:32 CST(-0600)] <TonyUnicon> if my understanding of the 'itemWriter' validation is correct, we want to catch these natural key type errors against existing external data… youve had to goto the db

[11:37:35 CST(-0600)] <js70> In use it was made a use difference validating the customer data. I know that is a concern

[11:38:34 CST(-0600)] <TonyUnicon> but i guess you can write one giant query for the whole file

[11:38:34 CST(-0600)] <js70> eventually, first step was to ensure Natural Keys were not null.

[11:38:50 CST(-0600)] <TonyUnicon> that would be the raw data validation i would expect

[11:39:09 CST(-0600)] <TonyUnicon> i dont think the raw data should need to concern itself with existing data

[11:39:09 CST(-0600)] <dmccallum54> the "raw data" validations that jim was running against beans did not involve database interactions

[11:39:16 CST(-0600)] <TonyUnicon> right

[11:39:21 CST(-0600)] <TonyUnicon> and I would expect that

[11:39:24 CST(-0600)] <dmccallum54> and you're right… it did not concern itself with existing data

[11:40:00 CST(-0600)] <TonyUnicon> so would you expect the validation step to have in one giant query

[11:40:14 CST(-0600)] <TonyUnicon> or on a unbatched - query per row basis

[11:40:35 CST(-0600)] <TonyUnicon> an*

[11:41:06 CST(-0600)] <js70> your talking after the raw -> then one giant step validation to insure for example unique keys?

[11:41:22 CST(-0600)] <TonyUnicon> yeah, we take the vetted file

[11:41:26 CST(-0600)] <TonyUnicon> and before we start to batch write

[11:41:37 CST(-0600)] <js70> yeah, in the initial implmentation I elected to take that one giant query and break it up into itsy bitsy steps.

[11:41:58 CST(-0600)] <TonyUnicon> so for N rows, how many queries were you firing to validate?

[11:42:05 CST(-0600)] <js70> N

[11:42:05 CST(-0600)] <TonyUnicon> N/batchsize ?

[11:42:09 CST(-0600)] <TonyUnicon> ok

[11:42:48 CST(-0600)] <TonyUnicon> do we still want to do that? it has performance bottleneck written all over it

[11:42:55 CST(-0600)] <js70> nope

[11:43:12 CST(-0600)] <TonyUnicon> the advantage if the giant intersect query

[11:43:25 CST(-0600)] <TonyUnicon> is we could identify all the bad rows in the file I guess

[11:43:27 CST(-0600)] <dmccallum54> my assumption is that once we had the validated file we simply batch inserts to the stage tables. if there is a duplicated key in a batch, it invalidates the batch

[11:43:29 CST(-0600)] <TonyUnicon> as opposed to just the first

[11:44:37 CST(-0600)] <TonyUnicon> now im confused

[11:44:54 CST(-0600)] <TonyUnicon> when we spoke about in memory validation before we write

[11:45:05 CST(-0600)] <TonyUnicon> I thought you meant before we write to the staging tables

[11:45:29 CST(-0600)] <dmccallum54> raw files -> per-row type, width, nullity validations -> filtered/validated files -> batched inserts to stage tables -> upserts to live tables

[11:46:52 CST(-0600)] <TonyUnicon> but lets say we have a dup row in the file

[11:46:56 CST(-0600)] <TonyUnicon> in that workflow

[11:47:04 CST(-0600)] <TonyUnicon> it would fail on the stage inserts

[11:47:11 CST(-0600)] <TonyUnicon> which I thought you wanted to avoid

[11:47:24 CST(-0600)] <dmccallum54> that particular error would not be caught until [batched inserts to stage tables] and the resulting error would be as inspecific as the batch size is large

[11:47:42 CST(-0600)] <dmccallum54> inspecific. wtf.

[11:47:44 CST(-0600)] <dmccallum54> non-specific

[11:47:52 CST(-0600)] <TonyUnicon> you've been talking to me too much

[11:48:16 CST(-0600)] <TonyUnicon> ok, so then I'm not sure what we were arguing about on the call

[11:48:34 CST(-0600)] <TonyUnicon> I thought we were talking about what we wanted to do before we wrote to the database in terms of validation

[11:48:55 CST(-0600)] <dmccallum54> what i heard on the call was a proposal to move all validations to database operations

[11:49:12 CST(-0600)] <dmccallum54> specifically to attempt a non-batched insert into stage tables for each raw record

[11:49:18 CST(-0600)] <TonyUnicon> right right, ok I didn't know we definitely wanted to do that

[11:49:28 CST(-0600)] <TonyUnicon> that is the argument I'm making, lean on the DB

[11:50:15 CST(-0600)] <TonyUnicon> i dont think we want to do non-batched inserts

[11:51:04 CST(-0600)] <TonyUnicon> especially if we want to use indices

[11:51:27 CST(-0600)] <TonyUnicon> I think

[11:51:31 CST(-0600)] <TonyUnicon> despite what the doc says

[11:51:46 CST(-0600)] <dmccallum54> if we use the db entirely, one downside is we can report at most one error per row

[11:51:51 CST(-0600)] <TonyUnicon> we can determine the bad row via the database error itself and not any sort of state the framework stores

[11:52:09 CST(-0600)] <dmccallum54> the other is the row-specificity issue that you're talking about now