Quantcast
Channel: csv | ThatJeffSmith
Viewing all articles
Browse latest Browse all 27

Loading Data via External Tables = Fast

$
0
0

Do as I say, not as I do.

Because I am like most of you, I am very lazy.

Case in point: loading some data from a CSV into an Oracle table. I can use a wizard in SQL Developer and in a few clicks, have it loaded. Usually I’m playing with a hundred rows. Or maybe a few thousand.

But this time I needed to load up about 150MB of CSV, which isn’t very much. But it’s 750k rows, and it was taking more than 5 minutes to run the INSERTs against the table. And I thought that was pretty good, considering. My entire setup is running on a MacBook Air and our OTN VirtualBox Database image.

I’m setting up a scenario that others can run, and the entire lab is only allotted 30 minutes. So I can’t reserve 10 minutes of that just to do the data load.

The Solution: EXTERNAL TABLE

If you have access to the database server via a DIRECTORY object, then you are good to go. This means I can put the CSV (or CSVs) onto the server, in a directory that the database can access.

If  you don't have access to the server directly, then SQL*Loader is your next best bet.

If you don’t have access to the server directly, then SQL*Loader is your next best bet.

This wizard is pretty much exactly the same as I’ve shown you before. There’s an additional dialog, and the output is a script that you run.

You need to give us the database directory name, the name of your file, and if you want an errors and logging file, what you want to call them as well.

But when we’re done, we get a script.

The script will create the directory, if you need it, grants privs, if you need them, and drop your staging table, if you want to. That’s why those steps are commented out.

And I tweaked my script even further, changing out the INSERT script to include a function call…but setting up the table from the CSV file was a lot easier using the wizard.

SET DEFINE OFF
--CREATE OR REPLACE DIRECTORY DATA_PUMP_DIR AS '/Users/oracle/data_loads;
--GRANT READ ON DIRECTORY DATA_PUMP_DIR TO hr;
--GRANT WRITE ON DIRECTORY DATA_PUMP_DIR TO hr;
--drop table OPENDATA_TEST_STAGE;
CREATE
  TABLE OPENDATA_TEST_STAGE
  (
    NAME         VARCHAR2(256),
    AMENITY      VARCHAR2(256),
    ID           NUMBER(11),
    WHO          VARCHAR2(256),
    VISIBLE      VARCHAR2(26),
    SOURCE       VARCHAR2(512),
    OTHER_TAGS   VARCHAR2(4000),
    WHEN         VARCHAR2(40),
    GEO_POINT_2D VARCHAR2(26)
  )
  ORGANIZATION EXTERNAL
  (
    TYPE ORACLE_LOADER DEFAULT DIRECTORY ORDER_ENTRY ACCESS PARAMETERS
    (records delimited BY '\r\n' CHARACTERSET AL32UTF8
    BADFILE ORDER_ENTRY:' openstreetmap-pois-usa.bad'
    DISCARDFILE ORDER_ENTRY:' openstreetmap-pois-usa.discard'
    LOGFILE ORDER_ENTRY:' openstreetmap-pois-usa.log'
    skip 1
    FIELDS TERMINATED BY ';'
    OPTIONALLY ENCLOSED BY '"'
    AND '"'
    lrtrim
    missing FIELD VALUES are NULL
    ( NAME       CHAR(4000),
    AMENITY      CHAR(4000),
    ID           CHAR(4000),
    WHO          CHAR(4000),
    VISIBLE      CHAR(4000),
    SOURCE       CHAR(4000),
    OTHER_TAGS   CHAR(4000),
    WHEN         CHAR(4000),
    GEO_POINT_2D CHAR(4000)
    )
    ) LOCATION ('openstreetmap-pois-usa.csv')
  )
  REJECT LIMIT UNLIMITED;
 
 
SELECT * FROM OPENDATA_TEST_STAGE WHERE ROWNUM <= 100;
 
 
INSERT
INTO
  OPENDATA_TEST
  (
    NAME,
    AMENITY,
    ID,
    WHO,
    VISIBLE,
    SOURCE,
    OTHER_TAGS,
    WHEN,
    GEO_POINT_2D
  )
SELECT
  NAME,
  AMENITY,
  ID,
  WHO,
  VISIBLE,
  SOURCE,
  OTHER_TAGS,
  to_timestamp_tz(COL_TIMES, 'YYYY-MM-DD"T"HH24:MI:SSTZR'),
  GEO_POINT_2D
FROM
  OPENDATA_TEST_STAGE3 ;

A Small Tweak

My TABLE has a timestamp column. I REFUSE to store DATES as strings. It bites me in the butt EVERY SINGLE TIME. So what I did here, because I’m lazy, is I loaded up the EXTERNAL TABLE column containing the TIMESTAMP as a VARCHAR2. But in my INSERT..SELECT, I throw in a TO_TIMESTAMP to do the conversion.

EXTERNAL TABLES are marked in the navigator with the GREEN ARROW decorators. In the external table, my timestamps have a 'T' text column to mark the 'time' portion.

EXTERNAL TABLES are marked in the navigator with the GREEN ARROW decorators. In the external table, my timestamps have a ‘T’ text column to mark the ‘time’ portion.

The hardest part, for me, was figuring out the format that represented the timestamp data. After a few trial and errors I managed that

2009-03-08T19:25:16-04:00 equated to a YYYY-MM-DD”T”HH24:MI:SSTZR. I got tripped up because I was single quote escaping the ‘T’ instead of double quoting it “T.” And then I got tripped up again because I was using TO_TIMESTAMP() vs TO_TIMESTAMP_TZ().

With my boo-boos fixed, instead of taking 5, almost 6, minutes to run:

747,973 ROWS inserted.
 
Elapsed: 00:00:27.987
Commit complete.
 
Elapsed: 00:00:00.156

Not too shabby. And the CREATE TABLE…STORAGE EXTERNAL itself is instantaneous. The data isn’t read in until you need it.

Last time I checked, 28 seconds vs 5 minutes is a lot better. Even on my VirtualBox database running on my MacBook Air.


Viewing all articles
Browse latest Browse all 27

Trending Articles