ssp-data-importer
ssp-data-importer is a tool for importing csv data for the external tables into the SSP database. This program inserts data from csv files into a database. Csv files have some latitude in parsing including whitespace, separator, and quote. Currently, the program supports only updates and inserts. It is possible to upload partial data updating specific columns. When used as a method to update already existing data, the only requirement is that the columns containing key values be supplied.
High-level workflow
The ssp-data-importer is expected to be used as part of a cronjob or other periodic method for running the program. The job should be set up to run with a period that is appropriate for how often the external tables are to be updated. It is important that the upload to the input folder be timed to the running of the cronjob. 1. A folder that will contain the .csv files (input folder) to import is monitored at runtime. 2. If the folder contains .csv files and the files have had a sufficient soak time since modification, the files are transfered to a folder (process folder) for reading and validation. (Validation is on a column basis, all errors in a given row are documented for the final report. Validation is based on database metadata constraints. 3. Validated rows are written to a second folder (upsert folder) 4. Where they are upserted into the staging tables in the database. 5. Additional validation steps are taken on the complete data set evaluating for inconsistencies and any potential duplications. 6. The validated staging tables are then used to update/insert data into the corresponding external_tables. 7. A report is generated giving pertinent information including any validation errors, total lines per table processed etc. 8. Finally, staging tables are truncated, processing and upset folders are removed and the processed files can be archived.
Pre-requisites
At a minimum you'll need a Java JDK install. This process can vary widely from platform to platform, so you're on you're own for that one. This program requires JDK 1.6+. You will need a postgres 9.1+ or SQL Server 2008+ database To build you will need a maven distribution.
BUILDING
ssp-data-importer is a maven project. It can be built and tested using the standard mvn commands. To build for distribution, an assembly/descriptor.xml is supplied, as well as a bat and sh script for running the importer.
Running the Program
It is recommended that the supplied scripts: runJob.bat (Windows) or runJob.sh (Unix) be used to run ssp-csv-importer.
arguments that need to be supplied:
argument for database profile -Dspring.profiles.active=profile: postgres|sqlserver
argument for location of override properties file -Dssp.importer.configdir=/location/of/override/properties/file/folder/
To run hourly using crontab: in command line type:
# crontab -e to edit
then add the following line:
0 0-23 * * * /root/scripts/sh/runJob.sh -Dspring.profiles.active=postgres -Dssp.importer.configdir=/ssp-properties/ ```
PROPERTIES FILE DESCRIPTION
There are a number of properties that are required for the program to run properly. All * values must be set as defaults are not supplied.
IMPORTANT! These properties are stored in a file title: sap-importer.properties. The location of which is set at run time. See Running the Program
FOLDER LOCATIONS NOTE: file: must be used for location identification.
*batch.tables.input.folder=file:/location/of/input/folder/ full path to folder that will contain initial csv files
*batch.tables.process.folder=file:/location/of/process/folder/ full path to folder where csv files will be processed
*batch.tables.upsert.folder=file:/location/of/upsert/folder full path to folder where csv files will be upserted
*batch.tables.archive.folder=file:/location/of/archive full path to archive folder
FILE LOCATIONS
batch.tables.input.files=${batch.tables.input.folder}*.csv full path to uploaded files
batch.tables.process.files=${batch.tables.process.folder}*.csv full path to files to process
batch.tables.upsert.files=${batch.tables.upsert.folder}*.csv full path to files to upsert
INITIALIZATION
batch.tables.lagTimeBeforeStartInMinutes=10 set minutes files must be unmodified before beginning processing (default: 10)
ARCHIVING
batch.tables.retain.input.archive= true turn archiving on default is true
batch.tables.archive=UNIQUE what files to archive, ALL, NONE, UNIQUE default: UNIQUE
TOLERANCE
batch.rawitem.skip.limit=10 number of lines generating validation errors to allow during read/write
batch.rawitem.commit.interval=100 size of the batch
batch.upsertitem.skip.limit=10 number of lines generating validation errors to allow during upsert
batch.upsertitem.commit.interval=100 size of batch. Larger batch sizes will reduce processing time make errors less specific
TESTING
batch.table.input.duplicate=false
exists.only.for.testing.1=default.value.1
exists.only.for.testing.2=default.value.2
DATABASE
db_name=ssp Default is ssp
batch.jdbc.url=jdbc:postgresql://127.0.0.1:5432/${db_name} The full URL to the source database (default:jdbc:postgresql://localhost:5432/ssp)
batch.jdbc.driver=org.postgresql.Driver The driver to be used example: (default:org.postgresql.Driver) or net.sourceforge.jtds.jdbc.Driver
batch.jdbc.user=sspadmin The username for the source database (default:sspadmin)
batch.jdbc.password=sspadmin The password for the source database. (default:sspadmin)
REPORTING
batch.title=ssp_import
EMAIL NOTIFICATION NOTE: ** indicates required if batch.sentMail is true)
batch.sendEmail=true Activate email (default: true)
**batch.smtp.host=localhost host name (default: localhost)
**batch.smtp.port=25 port (default: 25)
**batch.smtp.protocol=smtp protocol to be used (default: smtp)
**batch.smtp.username=sysadmin username (default: sysadmin)
**batch.smtp.password=password password (default: password)
RECIPIENTS - COMMA SEPERATED address as string. Address must follow RFC822 syntax. (default: "SSP DATA IMPORTER"<sysadmin@localhost>)
**batch.email.recipients="SSP DATA IMPORTER"<sysadmin@localhost>
**batch.email.replyTo="SSP DATA IMPORTER"<sysadmin@localhost>
CSV FILE FORMAT
- Name of csv file MUST correspond to the name of the external_table that is to be imported. example: external_person.csv
format supported is comma delimite. Quote character is " and the escape character is also "
Each file MUST have an initial line containing the column names in the format of the external table. example for external_faculty_course first line would contain: faculty_school_id,term_code,formatted_course,title,section_code,section_number
Partial data updates are supported. However, in addition to the data being updated the key values for the specific table MUST be supplied.
For SSP version 2.0 keys for each external table:
TABLE KEY external_course code external_course_program program_code, course_code external_course_requisite requiring_course_code, required_course_code, requisite_code external_course_tag course_code, tag external_course_term course_code, term_code external_department code external_division code external_faculty_course (NONE) recommended faculty_school_id,term_code,formatted_course,title external_faculty_course_roster (NONE) recommended faculty_school_id,school_id,term_code,formatted_course,title external_person school_id external_person_note code external_person_planning_status school_id external_program code external_registration_status_by_term school_id,term_code external_student_academic_program school_id, degree_code, program_code external_student_financial_aid school_id external_student_test school_id, test_code, sub_test_code, test_date, discriminator external_student_transcript school_id external_student_transcript_course school_id, term_code, formatted_course, section_code external_student_transcript_term school_id, term_code external_term code
For a complete list of column names and constraints please see the relevant mappings for your version of SSP.
- Data Integration Mappings for Version 2.1.0:
- Data Integration Mappings for Version 2.0.1:
- Data Integration Mappings for Version 2.0.0
- Data Integration Mappings for Version 1.2.1
For questions on csv formatting please see expected csv format. Empty (including those with all whitespace) strings will be entered as null values.
acceptable examples:
no quotes
faculty_school_id,term_code,formatted_course,title,section_code,section_number
sherman123,FA12,Biology 101,Introduction To Biology,,
All quotes (note the enclosed comma in column 1):
"faculty_school_id","term_code","formatted_course","title","section_code","section_number"
"sherman, 123","FA12","Biology 101","Introduction To Biology","",""
All quotes escaped quote(note the enclosed comma in column 1):
"faculty_school_id","term_code","formatted_course","title","section_code","section_number"
"sherman"", 123","FA12","this is a string "", with escaped quotes","Introduction To Biology","",""