Do as I say, not as I do.
Because I am like most of you, I am very lazy.
Case in point: loading some data from a CSV into an Oracle table. I can use a wizard in SQL Developer and in a few clicks, have it loaded. Usually I’m playing with a hundred rows. Or maybe a few thousand.
But this time I needed to load up about 150MB of CSV, which isn’t very much. But it’s 750k rows, and it was taking more than 5 minutes to run the INSERTs against the table. And I thought that was pretty good, considering. My entire setup is running on a MacBook Air and our OTN VirtualBox Database image.
I’m setting up a scenario that others can run, and the entire lab is only allotted 30 minutes. So I can’t reserve 10 minutes of that just to do the data load.
The Solution: EXTERNAL TABLE
If you have access to the database server via a DIRECTORY object, then you are good to go. This means I can put the CSV (or CSVs) onto the server, in a directory that the database can access.
This wizard is pretty much exactly the same as I’ve shown you before. There’s an additional dialog, and the output is a script that you run.
You need to give us the database directory name, the name of your file, and if you want an errors and logging file, what you want to call them as well.
But when we’re done, we get a script.
The script will create the directory, if you need it, grants privs, if you need them, and drop your staging table, if you want to. That’s why those steps are commented out.
And I tweaked my script even further, changing out the INSERT script to include a function call…but setting up the table from the CSV file was a lot easier using the wizard.
SET DEFINE OFF --CREATE OR REPLACE DIRECTORY DATA_PUMP_DIR AS '/Users/oracle/data_loads; --GRANT READ ON DIRECTORY DATA_PUMP_DIR TO hr; --GRANT WRITE ON DIRECTORY DATA_PUMP_DIR TO hr; --drop table OPENDATA_TEST_STAGE; CREATE TABLE OPENDATA_TEST_STAGE ( NAME VARCHAR2(256), AMENITY VARCHAR2(256), ID NUMBER(11), WHO VARCHAR2(256), VISIBLE VARCHAR2(26), SOURCE VARCHAR2(512), OTHER_TAGS VARCHAR2(4000), WHEN VARCHAR2(40), GEO_POINT_2D VARCHAR2(26) ) ORGANIZATION EXTERNAL ( TYPE ORACLE_LOADER DEFAULT DIRECTORY ORDER_ENTRY ACCESS PARAMETERS (records delimited BY '\r\n' CHARACTERSET AL32UTF8 BADFILE ORDER_ENTRY:' openstreetmap-pois-usa.bad' DISCARDFILE ORDER_ENTRY:' openstreetmap-pois-usa.discard' LOGFILE ORDER_ENTRY:' openstreetmap-pois-usa.log' skip 1 FIELDS TERMINATED BY ';' OPTIONALLY ENCLOSED BY '"' AND '"' lrtrim missing FIELD VALUES are NULL ( NAME CHAR(4000), AMENITY CHAR(4000), ID CHAR(4000), WHO CHAR(4000), VISIBLE CHAR(4000), SOURCE CHAR(4000), OTHER_TAGS CHAR(4000), WHEN CHAR(4000), GEO_POINT_2D CHAR(4000) ) ) LOCATION ('openstreetmap-pois-usa.csv') ) REJECT LIMIT UNLIMITED; SELECT * FROM OPENDATA_TEST_STAGE WHERE ROWNUM <= 100; INSERT INTO OPENDATA_TEST ( NAME, AMENITY, ID, WHO, VISIBLE, SOURCE, OTHER_TAGS, WHEN, GEO_POINT_2D ) SELECT NAME, AMENITY, ID, WHO, VISIBLE, SOURCE, OTHER_TAGS, to_timestamp_tz(COL_TIMES, 'YYYY-MM-DD"T"HH24:MI:SSTZR'), GEO_POINT_2D FROM OPENDATA_TEST_STAGE3 ;
A Small Tweak
My TABLE has a timestamp column. I REFUSE to store DATES as strings. It bites me in the butt EVERY SINGLE TIME. So what I did here, because I’m lazy, is I loaded up the EXTERNAL TABLE column containing the TIMESTAMP as a VARCHAR2. But in my INSERT..SELECT, I throw in a TO_TIMESTAMP to do the conversion.
The hardest part, for me, was figuring out the format that represented the timestamp data. After a few trial and errors I managed that
2009-03-08T19:25:16-04:00 equated to a YYYY-MM-DD”T”HH24:MI:SSTZR. I got tripped up because I was single quote escaping the ‘T’ instead of double quoting it “T.” And then I got tripped up again because I was using TO_TIMESTAMP() vs TO_TIMESTAMP_TZ().
With my boo-boos fixed, instead of taking 5, almost 6, minutes to run:
747,973 ROWS inserted. Elapsed: 00:00:27.987 Commit complete. Elapsed: 00:00:00.156
Not too shabby. And the CREATE TABLE…STORAGE EXTERNAL itself is instantaneous. The data isn’t read in until you need it.
Last time I checked, 28 seconds vs 5 minutes is a lot better. Even on my VirtualBox database running on my MacBook Air.