Querying Parquet Files on S3 with DBMS_CLOUD in Oracle 26ai

TL;DR: DBMS_CLOUD.CREATE_EXTERNAL_TABLE creates an external table that reads Parquet directly from S3. Define the table once, query it with standard SQL. No data loading, no ETL.

With S3 connectivity working (Part 1 and Part 2), this post shows how to create DBMS_CLOUD external tables over Parquet files in S3 and query them from Oracle 26ai. Performance numbers throughout come from our test environment.

Why Query Parquet from Oracle?

Ad-hoc analytics on lake data. Your data engineers write Parquet to a data lake. Your analysts know SQL, not PySpark. Create an external table and let them query it directly.

Data validation before loading. You received a multi-GB Parquet dataset in S3. Before running a full ETL load, create an external table to inspect the schema, check row counts, validate data quality, and run aggregations without importing a single row.

Test Environment

All scenarios in this post come from:
Database server: Oracle 26ai Enterprise Edition (23.26.1.0.0)
S3 endpoint: Pure Storage FlashBlade on the same LAN (~0.1ms RTT)
Dataset: 20 Parquet files, ~552 MB each, 11 GB total, 2,556,520 rows, 41 columns
Parallelism: DOP 40 (from cpu_count=40, parallel_degree_policy=MANUAL)

Your performance will vary with network latency, file count/size, and available parallelism.

Prerequisites

  • Completed Part 1 and Part 2 (S3 connectivity, credentials, DNS, wallet, ACLs)
  • SYSDBA access to the PDB
  • Parquet files in an S3-compatible bucket
  • python3 with pyarrow on the Oracle server (for schema inspection)

How to Create a DBMS_CLOUD External Table for Parquet

Step 1: List S3 Contents

Use DBMS_CLOUD to see what’s in the bucket:

SELECT object_name, bytes
FROM DBMS_CLOUD.LIST_OBJECTS(
  credential_name => 'FLASHBLADE_CRED',
  location_uri    => 'https://s3.us-east-1.amazonaws.com/lb-bronze/customer/interactions/'
)
WHERE ROWNUM <= 5;
OBJECT_NAME                BYTES
-------------------------- -----------
part-000000.parquet        552775111
part-000001.parquet        552836156
part-000002.parquet        552903890
part-000003.parquet        552932410
part-000004.parquet        552904057

20 Parquet files, ~552 MB each, totaling ~11 GB.

Step 2: Inspect the Parquet Schema

Download one file and inspect with PyArrow:

aws s3 cp s3://lb-bronze/customer/interactions/part-000000.parquet /tmp/sample.parquet \
  --endpoint-url https://s3.us-east-1.amazonaws.com

python3 -c "
import pyarrow.parquet as pq
t = pq.read_table('/tmp/sample.parquet')
print(f'Rows: {t.num_rows}, Columns: {t.num_columns}')
for f in t.schema:
    print(f'{f.name:40s} {f.type}')
"
Rows: 127826, Columns: 41
id                                       int64
row_id                                   int64
event_timestamp                          timestamp[us]
event_id                                 string
session_id                               string
customer_id                              int64
email_raw                                string
...
transaction_amount                       double
...
loyalty_tier                             string
...
interaction_payload                      string

Step 3: Create the External Table with DBMS_CLOUD

Use DBMS_CLOUD.CREATE_EXTERNAL_TABLE to point directly at S3. The file_uri_list accepts wildcards:

BEGIN
  DBMS_CLOUD.CREATE_EXTERNAL_TABLE(
    table_name      => 'BRONZE_S3',
    credential_name => 'FLASHBLADE_CRED',
    file_uri_list   => 'https://s3.us-east-1.amazonaws.com/lb-bronze/customer/interactions/*.parquet',
    format          => JSON_OBJECT('type' VALUE 'parquet'),
    column_list     => '
      id NUMBER,
      row_id NUMBER,
      event_timestamp TIMESTAMP,
      event_id VARCHAR2(4000),
      session_id VARCHAR2(4000),
      customer_id NUMBER,
      email_raw VARCHAR2(4000),
      phone_raw VARCHAR2(4000),
      interaction_type VARCHAR2(4000),
      product_id VARCHAR2(4000),
      product_category VARCHAR2(4000),
      transaction_amount BINARY_DOUBLE,
      currency VARCHAR2(100),
      channel VARCHAR2(200),
      device_type VARCHAR2(200),
      browser VARCHAR2(200),
      ip_address VARCHAR2(200),
      city_raw VARCHAR2(500),
      state_raw VARCHAR2(500),
      zip_code VARCHAR2(100),
      page_views NUMBER,
      time_on_site_seconds NUMBER,
      bounce_rate BINARY_DOUBLE,
      click_count NUMBER,
      cart_value BINARY_DOUBLE,
      items_in_cart NUMBER,
      support_ticket_id VARCHAR2(4000),
      issue_category VARCHAR2(4000),
      satisfaction_score NUMBER,
      campaign_id VARCHAR2(4000),
      utm_source VARCHAR2(4000),
      utm_medium VARCHAR2(4000),
      loyalty_member VARCHAR2(10),
      loyalty_tier VARCHAR2(200),
      points_earned NUMBER,
      points_redeemed NUMBER,
      data_source VARCHAR2(200),
      data_quality_flag VARCHAR2(200),
      raw_user_agent VARCHAR2(4000),
      session_fingerprint VARCHAR2(4000),
      interaction_payload VARCHAR2(4000)
    '
  );
END;
/

Under the hood, DBMS_CLOUD creates an ORGANIZATION EXTERNAL table with TYPE ORACLE_BIGDATA, the credential reference, and the S3 URL as the location.

Alternative: Auto-Schema Inference

Instead of manually specifying column_list, let Oracle infer the schema from the first Parquet file:

BEGIN
  DBMS_CLOUD.CREATE_EXTERNAL_TABLE(
    table_name      => 'BRONZE_S3_AUTO',
    credential_name => 'FLASHBLADE_CRED',
    file_uri_list   => 'https://s3.us-east-1.amazonaws.com/lb-bronze/customer/interactions/*.parquet',
    format          => JSON_OBJECT('type' VALUE 'parquet', 'schema' VALUE 'first')
  );
END;
/

This reads the schema from the first file and creates columns automatically. Type mappings differ slightly from manual: Parquet bool maps to NUMBER(1,0) (0/1) rather than VARCHAR2(10) ('true'/'false'). The trade-off is less control over column sizing; strings default to VARCHAR2(4000).

Step 4: Query the External Table

Row count (all 20 files):

SELECT COUNT(*) FROM bronze_s3;
  COUNT(*)
----------
   2556520
Elapsed: 00:00:00.42

2.5 million rows from 11 GB of Parquet in S3, counted in 0.42 seconds.

Sample rows:

SELECT customer_id, email_raw, interaction_type, transaction_amount, channel
FROM bronze_s3
WHERE ROWNUM <= 5;
CUSTOMER_ID EMAIL_RAW                 INTERACTION_TYPE TRANSACTION_AMOUNT CHANNEL
----------- ------------------------- ---------------- ------------------ ----------
          5 user184373@gmail.com      browse                            0 social_media
          7 user808669@outlook.com    purchase                    101.30 web
          7 user707229@hotmail.com    purchase                     13.25 mobile_app
          3 user441693@icloud.com     login                             0 social_media
          2 user334394@yahoo.com      login                             0 web
Elapsed: 00:00:00.24

Aggregation queries:

SELECT interaction_type, COUNT(*) cnt
FROM bronze_s3
GROUP BY interaction_type
ORDER BY cnt DESC;
INTERACTION_TYPE        CNT
------------------- --------
browse                895159
login                 505040
purchase              461137
abandoned_cart        385081
support               310103
Elapsed: 00:00:00.41
SELECT COUNT(*) total_purchases,
       ROUND(SUM(transaction_amount), 2) total_revenue,
       ROUND(AVG(transaction_amount), 2) avg_transaction
FROM bronze_s3
WHERE interaction_type = 'purchase';
TOTAL_PURCHASES TOTAL_REVENUE AVG_TRANSACTION
--------------- ------------- ---------------
         461137   69304299.63          150.28
Elapsed: 00:00:00.42

Step 5: Understanding the Execution Plan

EXPLAIN PLAN FOR
SELECT channel, COUNT(*), SUM(transaction_amount)
FROM bronze_s3
GROUP BY channel;

SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY);
---------------------------------------------------------------------------
| Id | Operation                          | Name      | Rows  | Cost      |
---------------------------------------------------------------------------
|  0 | SELECT STATEMENT                   |           |     1 | 10223     |
|  1 |  PX COORDINATOR                    |           |       |           |
|  2 |   PX SEND QC (RANDOM)              | :TQ10001  |     1 | 10223     |
|  3 |    HASH GROUP BY                   |           |     1 | 10223     |
|  4 |     PX RECEIVE                     |           |     1 | 10223     |
|  5 |      PX SEND HASH                  | :TQ10000  |     1 | 10223     |
|  6 |       HASH GROUP BY                |           |     1 | 10223     |
|  7 |        PX BLOCK ITERATOR           |           |       |           |
|  8 |         EXTERNAL TABLE ACCESS FULL | BRONZE_S3 | 2.5M  | 10217     |
---------------------------------------------------------------------------
Note: automatic DOP: Computed Degree of Parallelism is 40

Key observations:
EXTERNAL TABLE ACCESS FULL: Oracle reads the Parquet files from S3, distributing work across 40 parallel query slaves
PX BLOCK ITERATOR: each parallel slave reads different files/blocks from S3
HASH GROUP BY at both PX slave and coordinator levels: distributed aggregation
DOP = 40: computed from cpu_count (40) with parallel_degree_policy=MANUAL

Parquet to Oracle Type Mapping

Parquet Type Oracle Type Notes
int64 NUMBER Works directly
int32 NUMBER Works directly
double BINARY_DOUBLE Use BINARY_DOUBLE, not NUMBER, for Parquet doubles
string VARCHAR2(4000) 4000 is safe default; use tighter sizes where you know max length
bool VARCHAR2(10) Manual column_list: 'true'/'false' strings. Auto-inference: NUMBER(1,0) with 0/1
timestamp[us] TIMESTAMP Microsecond precision preserved
date32[day] DATE Works directly
decimal128(p,s) NUMBER(p,s) Match precision and scale

Gotchas

No Column Pruning

Oracle reads all columns from each Parquet file, regardless of your SELECT list. There’s no projection pushdown into the Parquet format. We confirmed this by timing SELECT COUNT(customer_id) (1 column) vs SELECT COUNT(*) (all 41 columns) on the same 2.5M-row dataset: 1.68s vs 1.49s, essentially identical. For wide tables, queries that only need a few columns still read everything. Plan your data layout accordingly.

File Count Affects Performance

More files means more S3 GET requests and HTTP round-trips. We saw this clearly comparing Bronze (20 files, 0.42s for COUNT) vs Silver (730 files, 2.86s for COUNT) on similar row counts. Fewer, larger files perform better for full-scan analytics.

Wildcards in file_uri_list

DBMS_CLOUD supports * wildcards in file_uri_list. The pattern *.parquet matches all Parquet files in the directory. For partitioned data with subdirectories, use */*.parquet to match one level of nesting.

Boolean Column Handling

With manual column_list, Parquet bool maps to VARCHAR2(10) giving 'true' and 'false' strings. Filter with WHERE loyalty_member = 'true'. With auto-inference ('schema' VALUE 'first'), bool maps to NUMBER(1,0) with values 0 and 1, which is generally better for filtering and aggregation.

Content-Encoding: aws-chunked

Some S3 clients upload objects with Content-Encoding: aws-chunked. The KUPC module cannot decode this and fails with KUP-13015: unsupported algorithm. Fix by re-copying affected objects:

aws s3 cp --recursive s3://bucket/path/ s3://bucket/path/ \
  --endpoint-url https://s3.us-east-1.amazonaws.com \
  --content-encoding "" --metadata-directive REPLACE

Summary

DBMS_CLOUD.CREATE_EXTERNAL_TABLE creates S3-backed external tables that query Parquet without loading data into Oracle. Auto-schema inference saves time but limits control over column sizing. Oracle reads all columns regardless of SELECT list (no column pruning), and fewer files means faster queries due to reduced S3 round-trips. Parallel execution is automatic.