TL;DR:
DBMS_CLOUD.CREATE_EXTERNAL_TABLEcreates an external table that reads Parquet directly from S3. Define the table once, query it with standard SQL. No data loading, no ETL.
With S3 connectivity working (Part 1 and Part 2), this post shows how to create DBMS_CLOUD external tables over Parquet files in S3 and query them from Oracle 26ai. Performance numbers throughout come from our test environment.
Why Query Parquet from Oracle?
Ad-hoc analytics on lake data. Your data engineers write Parquet to a data lake. Your analysts know SQL, not PySpark. Create an external table and let them query it directly.
Data validation before loading. You received a multi-GB Parquet dataset in S3. Before running a full ETL load, create an external table to inspect the schema, check row counts, validate data quality, and run aggregations without importing a single row.
Test Environment
All scenarios in this post come from:
– Database server: Oracle 26ai Enterprise Edition (23.26.1.0.0)
– S3 endpoint: Pure Storage FlashBlade on the same LAN (~0.1ms RTT)
– Dataset: 20 Parquet files, ~552 MB each, 11 GB total, 2,556,520 rows, 41 columns
– Parallelism: DOP 40 (from cpu_count=40, parallel_degree_policy=MANUAL)
Your performance will vary with network latency, file count/size, and available parallelism.
Prerequisites
- Completed Part 1 and Part 2 (S3 connectivity, credentials, DNS, wallet, ACLs)
- SYSDBA access to the PDB
- Parquet files in an S3-compatible bucket
python3withpyarrowon the Oracle server (for schema inspection)
How to Create a DBMS_CLOUD External Table for Parquet
Step 1: List S3 Contents
Use DBMS_CLOUD to see what’s in the bucket:
SELECT object_name, bytes
FROM DBMS_CLOUD.LIST_OBJECTS(
credential_name => 'FLASHBLADE_CRED',
location_uri => 'https://s3.us-east-1.amazonaws.com/lb-bronze/customer/interactions/'
)
WHERE ROWNUM <= 5;
OBJECT_NAME BYTES
-------------------------- -----------
part-000000.parquet 552775111
part-000001.parquet 552836156
part-000002.parquet 552903890
part-000003.parquet 552932410
part-000004.parquet 552904057
20 Parquet files, ~552 MB each, totaling ~11 GB.
Step 2: Inspect the Parquet Schema
Download one file and inspect with PyArrow:
aws s3 cp s3://lb-bronze/customer/interactions/part-000000.parquet /tmp/sample.parquet \
--endpoint-url https://s3.us-east-1.amazonaws.com
python3 -c "
import pyarrow.parquet as pq
t = pq.read_table('/tmp/sample.parquet')
print(f'Rows: {t.num_rows}, Columns: {t.num_columns}')
for f in t.schema:
print(f'{f.name:40s} {f.type}')
"
Rows: 127826, Columns: 41
id int64
row_id int64
event_timestamp timestamp[us]
event_id string
session_id string
customer_id int64
email_raw string
...
transaction_amount double
...
loyalty_tier string
...
interaction_payload string
Step 3: Create the External Table with DBMS_CLOUD
Use DBMS_CLOUD.CREATE_EXTERNAL_TABLE to point directly at S3. The file_uri_list accepts wildcards:
BEGIN
DBMS_CLOUD.CREATE_EXTERNAL_TABLE(
table_name => 'BRONZE_S3',
credential_name => 'FLASHBLADE_CRED',
file_uri_list => 'https://s3.us-east-1.amazonaws.com/lb-bronze/customer/interactions/*.parquet',
format => JSON_OBJECT('type' VALUE 'parquet'),
column_list => '
id NUMBER,
row_id NUMBER,
event_timestamp TIMESTAMP,
event_id VARCHAR2(4000),
session_id VARCHAR2(4000),
customer_id NUMBER,
email_raw VARCHAR2(4000),
phone_raw VARCHAR2(4000),
interaction_type VARCHAR2(4000),
product_id VARCHAR2(4000),
product_category VARCHAR2(4000),
transaction_amount BINARY_DOUBLE,
currency VARCHAR2(100),
channel VARCHAR2(200),
device_type VARCHAR2(200),
browser VARCHAR2(200),
ip_address VARCHAR2(200),
city_raw VARCHAR2(500),
state_raw VARCHAR2(500),
zip_code VARCHAR2(100),
page_views NUMBER,
time_on_site_seconds NUMBER,
bounce_rate BINARY_DOUBLE,
click_count NUMBER,
cart_value BINARY_DOUBLE,
items_in_cart NUMBER,
support_ticket_id VARCHAR2(4000),
issue_category VARCHAR2(4000),
satisfaction_score NUMBER,
campaign_id VARCHAR2(4000),
utm_source VARCHAR2(4000),
utm_medium VARCHAR2(4000),
loyalty_member VARCHAR2(10),
loyalty_tier VARCHAR2(200),
points_earned NUMBER,
points_redeemed NUMBER,
data_source VARCHAR2(200),
data_quality_flag VARCHAR2(200),
raw_user_agent VARCHAR2(4000),
session_fingerprint VARCHAR2(4000),
interaction_payload VARCHAR2(4000)
'
);
END;
/
Under the hood, DBMS_CLOUD creates an ORGANIZATION EXTERNAL table with TYPE ORACLE_BIGDATA, the credential reference, and the S3 URL as the location.
Alternative: Auto-Schema Inference
Instead of manually specifying column_list, let Oracle infer the schema from the first Parquet file:
BEGIN
DBMS_CLOUD.CREATE_EXTERNAL_TABLE(
table_name => 'BRONZE_S3_AUTO',
credential_name => 'FLASHBLADE_CRED',
file_uri_list => 'https://s3.us-east-1.amazonaws.com/lb-bronze/customer/interactions/*.parquet',
format => JSON_OBJECT('type' VALUE 'parquet', 'schema' VALUE 'first')
);
END;
/
This reads the schema from the first file and creates columns automatically. Type mappings differ slightly from manual: Parquet bool maps to NUMBER(1,0) (0/1) rather than VARCHAR2(10) ('true'/'false'). The trade-off is less control over column sizing; strings default to VARCHAR2(4000).
Step 4: Query the External Table
Row count (all 20 files):
SELECT COUNT(*) FROM bronze_s3;
COUNT(*)
----------
2556520
Elapsed: 00:00:00.42
2.5 million rows from 11 GB of Parquet in S3, counted in 0.42 seconds.
Sample rows:
SELECT customer_id, email_raw, interaction_type, transaction_amount, channel
FROM bronze_s3
WHERE ROWNUM <= 5;
CUSTOMER_ID EMAIL_RAW INTERACTION_TYPE TRANSACTION_AMOUNT CHANNEL
----------- ------------------------- ---------------- ------------------ ----------
5 user184373@gmail.com browse 0 social_media
7 user808669@outlook.com purchase 101.30 web
7 user707229@hotmail.com purchase 13.25 mobile_app
3 user441693@icloud.com login 0 social_media
2 user334394@yahoo.com login 0 web
Elapsed: 00:00:00.24
Aggregation queries:
SELECT interaction_type, COUNT(*) cnt
FROM bronze_s3
GROUP BY interaction_type
ORDER BY cnt DESC;
INTERACTION_TYPE CNT
------------------- --------
browse 895159
login 505040
purchase 461137
abandoned_cart 385081
support 310103
Elapsed: 00:00:00.41
SELECT COUNT(*) total_purchases,
ROUND(SUM(transaction_amount), 2) total_revenue,
ROUND(AVG(transaction_amount), 2) avg_transaction
FROM bronze_s3
WHERE interaction_type = 'purchase';
TOTAL_PURCHASES TOTAL_REVENUE AVG_TRANSACTION
--------------- ------------- ---------------
461137 69304299.63 150.28
Elapsed: 00:00:00.42
Step 5: Understanding the Execution Plan
EXPLAIN PLAN FOR
SELECT channel, COUNT(*), SUM(transaction_amount)
FROM bronze_s3
GROUP BY channel;
SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY);
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Cost |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 10223 |
| 1 | PX COORDINATOR | | | |
| 2 | PX SEND QC (RANDOM) | :TQ10001 | 1 | 10223 |
| 3 | HASH GROUP BY | | 1 | 10223 |
| 4 | PX RECEIVE | | 1 | 10223 |
| 5 | PX SEND HASH | :TQ10000 | 1 | 10223 |
| 6 | HASH GROUP BY | | 1 | 10223 |
| 7 | PX BLOCK ITERATOR | | | |
| 8 | EXTERNAL TABLE ACCESS FULL | BRONZE_S3 | 2.5M | 10217 |
---------------------------------------------------------------------------
Note: automatic DOP: Computed Degree of Parallelism is 40
Key observations:
– EXTERNAL TABLE ACCESS FULL: Oracle reads the Parquet files from S3, distributing work across 40 parallel query slaves
– PX BLOCK ITERATOR: each parallel slave reads different files/blocks from S3
– HASH GROUP BY at both PX slave and coordinator levels: distributed aggregation
– DOP = 40: computed from cpu_count (40) with parallel_degree_policy=MANUAL
Parquet to Oracle Type Mapping
| Parquet Type | Oracle Type | Notes |
|---|---|---|
| int64 | NUMBER | Works directly |
| int32 | NUMBER | Works directly |
| double | BINARY_DOUBLE | Use BINARY_DOUBLE, not NUMBER, for Parquet doubles |
| string | VARCHAR2(4000) | 4000 is safe default; use tighter sizes where you know max length |
| bool | VARCHAR2(10) | Manual column_list: 'true'/'false' strings. Auto-inference: NUMBER(1,0) with 0/1 |
| timestamp[us] | TIMESTAMP | Microsecond precision preserved |
| date32[day] | DATE | Works directly |
| decimal128(p,s) | NUMBER(p,s) | Match precision and scale |
Gotchas
No Column Pruning
Oracle reads all columns from each Parquet file, regardless of your SELECT list. There’s no projection pushdown into the Parquet format. We confirmed this by timing SELECT COUNT(customer_id) (1 column) vs SELECT COUNT(*) (all 41 columns) on the same 2.5M-row dataset: 1.68s vs 1.49s, essentially identical. For wide tables, queries that only need a few columns still read everything. Plan your data layout accordingly.
File Count Affects Performance
More files means more S3 GET requests and HTTP round-trips. We saw this clearly comparing Bronze (20 files, 0.42s for COUNT) vs Silver (730 files, 2.86s for COUNT) on similar row counts. Fewer, larger files perform better for full-scan analytics.
Wildcards in file_uri_list
DBMS_CLOUD supports * wildcards in file_uri_list. The pattern *.parquet matches all Parquet files in the directory. For partitioned data with subdirectories, use */*.parquet to match one level of nesting.
Boolean Column Handling
With manual column_list, Parquet bool maps to VARCHAR2(10) giving 'true' and 'false' strings. Filter with WHERE loyalty_member = 'true'. With auto-inference ('schema' VALUE 'first'), bool maps to NUMBER(1,0) with values 0 and 1, which is generally better for filtering and aggregation.
Content-Encoding: aws-chunked
Some S3 clients upload objects with Content-Encoding: aws-chunked. The KUPC module cannot decode this and fails with KUP-13015: unsupported algorithm. Fix by re-copying affected objects:
aws s3 cp --recursive s3://bucket/path/ s3://bucket/path/ \
--endpoint-url https://s3.us-east-1.amazonaws.com \
--content-encoding "" --metadata-directive REPLACE
Summary
DBMS_CLOUD.CREATE_EXTERNAL_TABLE creates S3-backed external tables that query Parquet without loading data into Oracle. Auto-schema inference saves time but limits control over column sizing. Oracle reads all columns regardless of SELECT list (no column pruning), and fewer files means faster queries due to reduced S3 round-trips. Parallel execution is automatic.
