Oracle 26ai can join external tables backed by S3 Parquet files with regular internal tables in a single SQL statement, without data movement, ETL pipelines, or Spark. The optimizer treats S3-backed external tables as another data source and picks the right join strategy automatically.
This post shows how to build a customer dimension table from S3 data, join it back against the external table for analytics, compare execution plans, and measure the actual performance. On 2.5 million rows read directly from S3, hybrid joins complete in under a second.
If you followed Part 1, Part 2, and Part 3, you have S3-backed external tables over Bronze (raw Parquet), Silver (cleaned Parquet from Iceberg), and Gold (aggregated Parquet from Iceberg). Now we join them with internal Oracle tables. The Silver and Gold external tables point at the Parquet data files produced by Iceberg, not at the Iceberg metadata. Oracle treats them identically to the Bronze Parquet tables.
Where You’d Use This
Enriching lake data with reference tables. Your data lake has raw interaction events in S3. Your Oracle database has customer master data, product catalogs, or pricing tables. Join them without importing the lake data.
Building dimension tables from lake data. Pull a customer dimension from S3 Parquet into Oracle via CTAS, add a primary key and indexes, then join it back against the S3-backed external fact table. Indexed lookups on the dimension, parallel S3 scans on the fact data.
Prerequisites
- Completed Part 3 (
BRONZE_S3,SILVER_INTERACTIONS, andGOLD_DASHBOARDexternal tables from S3) - SYSDBA access to the PDB
- Familiarity with Oracle execution plans (
DBMS_XPLAN.DISPLAY)
Step-by-Step
Step 1: Build an Internal Customers Dimension
Create a customers table from the S3-backed bronze external table using CTAS. This aggregates 2.5 million interaction rows into one row per customer:
CREATE TABLE customers AS
SELECT customer_id,
MAX(email_raw) AS email,
MAX(city_raw) AS city,
MAX(state_raw) AS state,
MIN(event_timestamp) AS first_seen,
MAX(event_timestamp) AS last_seen,
COUNT(*) AS interaction_count,
SUM(CASE WHEN interaction_type = 'purchase' THEN 1 ELSE 0 END) AS purchase_count,
ROUND(SUM(CASE WHEN interaction_type = 'purchase' THEN transaction_amount ELSE 0 END), 2) AS total_spend,
MAX(loyalty_tier) AS loyalty_tier,
MAX(channel) AS preferred_channel
FROM bronze_s3
GROUP BY customer_id;
Table created.
Elapsed: 00:00:01.43
26,151 customers from 2,556,520 interactions. The CTAS reads the S3 dataset (20 Parquet files), aggregates in parallel (DOP 40), and writes the result to an internal heap table in 1.43 seconds.
Add a primary key and gather statistics:
ALTER TABLE customers ADD CONSTRAINT pk_customers PRIMARY KEY (customer_id);
EXEC DBMS_STATS.GATHER_TABLE_STATS(USER, 'CUSTOMERS');
The primary key gives the optimizer an index for nested loop joins. Statistics let it estimate cardinalities accurately.
Step 2: Verify the Dimension
SELECT COUNT(*) FROM customers;
COUNT(*)
----------
26151
SELECT customer_id, email, interaction_count, purchase_count, total_spend
FROM customers
ORDER BY total_spend DESC
FETCH FIRST 5 ROWS ONLY;
CUSTOMER_ID EMAIL INTERACTION_COUNT PURCHASE_COUNT TOTAL_SPEND
----------- ------------------------- ----------------- -------------- -----------
1 user9999@outlook.com 978844 175675 26437946.9
2 user99999@gmail.com 346637 62455 9377389.42
3 user999996@outlook.com 188646 34030 5084541.66
4 user999982@outlook.com 121647 22095 3288312.38
5 user9999@yahoo.com 87362 15724 2361275.82
Step 3: Hybrid Query, Top Customers by Spend
Join the internal dimension against the S3-backed external fact table. The subquery aggregates purchases from S3 Parquet; the outer query enriches with customer metadata from Oracle:
SELECT c.customer_id, c.email, c.interaction_count,
ext.purchase_count AS ext_purchases,
ROUND(ext.total_spend, 2) AS ext_total_spend
FROM customers c
JOIN (
SELECT customer_id, COUNT(*) AS purchase_count, SUM(transaction_amount) AS total_spend
FROM bronze_s3
WHERE interaction_type = 'purchase'
GROUP BY customer_id
) ext ON c.customer_id = ext.customer_id
ORDER BY ext.total_spend DESC NULLS LAST
FETCH FIRST 10 ROWS ONLY;
CUSTOMER_ID EMAIL INTERACTION_COUNT EXT_PURCHASES EXT_TOTAL_SPEND
----------- ------------------------- ----------------- ------------- ---------------
1 user9999@outlook.com 978844 175675 26437946.90
2 user99999@gmail.com 346637 62455 9377389.42
3 user999996@outlook.com 188646 34030 5084541.66
4 user999982@outlook.com 121647 22095 3288312.38
5 user9999@yahoo.com 87362 15724 2361275.82
6 user999996@outlook.com 66621 12115 1874226.37
7 user999964@outlook.com 52606 9437 1413831.60
8 user999887@gmail.com 43332 7841 1129578.43
9 user999970@yahoo.com 36136 6425 1013184.00
10 user999985@gmail.com 30949 5634 861616.76
Elapsed: 00:00:00.60
Oracle reads 20 Parquet files from S3, filters to purchases, aggregates by customer, then joins against the internal customers table. 0.60 seconds.
Step 4: Hybrid Query, Customer Segmentation
Use the internal dimension to segment customers, then aggregate S3-backed fact data per segment:
SELECT
CASE
WHEN c.interaction_count > 10 THEN 'Power User'
WHEN c.interaction_count > 3 THEN 'Active'
ELSE 'Casual'
END AS segment,
COUNT(DISTINCT c.customer_id) AS customers,
COUNT(*) AS total_interactions,
ROUND(AVG(b.transaction_amount), 2) AS avg_txn,
ROUND(SUM(b.transaction_amount), 2) AS total_revenue
FROM customers c
JOIN bronze_s3 b ON c.customer_id = b.customer_id
WHERE b.interaction_type = 'purchase'
GROUP BY
CASE
WHEN c.interaction_count > 10 THEN 'Power User'
WHEN c.interaction_count > 3 THEN 'Active'
ELSE 'Casual'
END
ORDER BY total_revenue DESC;
SEGMENT CUSTOMERS TOTAL_INTERACTIONS AVG_TXN TOTAL_REVENUE
------------ --------- ----------------- --------- -------------
Power User 2060 452575 150.61 68161937.30
Casual 4639 4835 153.47 742011.12
Active 1632 2610 153.39 400351.21
Elapsed: 00:00:00.52
2,060 power users (7.9% of customers) drive 98.4% of revenue. This query joins 26,151 internal rows against 2,556,520 S3-backed external rows, filters to purchases, segments by interaction count, and aggregates in 0.52 seconds.
Step 5: Cross-Layer Query, Internal + Silver from S3
Join the internal customers table with the Silver S3 external table to compare raw and cleaned data:
SELECT c.customer_id, c.email AS raw_email, s.email_clean AS cleaned_email,
c.state AS raw_state, s.state_standardized AS clean_state,
s.customer_value_tier, s.churn_risk_indicator, s.engagement_score
FROM customers c
JOIN silver_interactions s ON c.customer_id = s.customer_id
WHERE ROWNUM <= 10;
CUSTOMER_ID RAW_EMAIL CLEANED_EMAIL RAW_STATE CLEAN_STATE CUSTOMER_VALUE_TIER CHURN_RISK_INDICATOR ENGAGEMENT_SCORE
----------- ---------------------------- ---------------------------- ---------- ------------ -------------------- -------------------- ----------------
4 user999982@outlook.com user337396@hotmail.com tx PA browser_only unknown_risk 3
8 user999887@gmail.com user987427@gmail.com tx IL browser_only unknown_risk 2
2 user99999@gmail.com user131575@outlook.com tx NY medium_value unknown_risk 1
1 user9999@outlook.com user641643@yahoo.com tx CA browser_only unknown_risk 0
57 user999302@gmail.com user377056@outlook.com tx TX low_value unknown_risk 2
2 user99999@gmail.com user312539@yahoo.com tx NY low_value unknown_risk 4
15 user999945@yahoo.com user531743@yahoo.com tx ILLINOIS browser_only unknown_risk 4
7 user999964@outlook.com user756613@hotmail.com tx PA browser_only unknown_risk 0
5 user9999@yahoo.com user244092@icloud.com tx AZ browser_only unknown_risk 0
2 user99999@gmail.com user628848@hotmail.com tx CA browser_only low_risk 0
Elapsed: 00:00:00.23
This joins an internal Oracle table with a Silver-layer external table. The result shows raw values from the dimension alongside cleaned/enriched values from the data lake: email normalization, state standardization, customer value tier, churn risk.
Step 6: Check the Execution Plan
Here’s how Oracle actually joins them:
EXPLAIN PLAN FOR
SELECT c.customer_id, COUNT(*), SUM(b.transaction_amount)
FROM customers c
JOIN bronze_s3 b ON c.customer_id = b.customer_id
WHERE b.interaction_type = 'purchase'
GROUP BY c.customer_id;
SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY);
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Cost |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 10222 |
| 1 | PX COORDINATOR | | | |
| 2 | PX SEND QC (RANDOM) | :TQ10001 | 1 | 10222 |
| 3 | HASH GROUP BY | | 1 | 10222 |
| 4 | PX RECEIVE | | 1 | 10222 |
| 5 | PX SEND HASH | :TQ10000 | 1 | 10222 |
| 6 | HASH GROUP BY | | 1 | 10222 |
| 7 | NESTED LOOPS | | 1.1M | 10220 |
| 8 | PX BLOCK ITERATOR | | | |
|* 9 | EXTERNAL TABLE ACCESS STORAGE FULL| BRONZE_S3| 1.1M | 10217 |
|*10 | INDEX UNIQUE SCAN | PK_CUSTOMERS | 1 | 0 |
---------------------------------------------------------------------------
Predicate Information:
9 - storage("B"."INTERACTION_TYPE"='purchase')
filter("B"."INTERACTION_TYPE"='purchase')
10 - access("C"."CUSTOMER_ID"="B"."CUSTOMER_ID")
Note: automatic DOP: Computed Degree of Parallelism is 40 because of degree limit
What the plan tells us:
- NESTED LOOPS: Oracle streams the S3 external table through parallel slaves, then looks up each customer via the primary key index.
- EXTERNAL TABLE ACCESS STORAGE FULL with PX BLOCK ITERATOR: parallel full scan of S3 Parquet files at DOP 40.
- INDEX UNIQUE SCAN on PK_CUSTOMERS: point lookups per row, not a broadcast.
- Storage and filter predicates on
interaction_type='purchase': Oracle applies both a storage-level and row-level filter. In practice both evaluate after Parquet row group reads; full-scan timing with and without the filter is nearly identical (0.41s vs 0.38s). - DOP 40: auto-computed from cpu_count=40, parallel_degree_policy=MANUAL.
Performance Summary
All queries on 2,556,520 S3-backed external rows joined with 26,151 internal rows:
| Query | Elapsed | Notes |
|---|---|---|
| CTAS (build dimension from S3) | 1.43s | 20 Parquet files → 26K internal rows |
| Top customers by spend | 0.60s | Subquery aggregate + nested loops |
| Customer segmentation | 0.52s | Hash join with broadcast |
| Cross-layer (internal + Silver) | 0.23s | Nested loops with index lookups |
| COUNT(*) on bronze_s3 | 0.30s | Full S3 scan baseline |
Gotchas
CTAS from S3 External Tables Creates Uncompressed Heap Tables
When you CREATE TABLE AS SELECT from an S3-backed external table, Oracle creates a standard heap table. For large datasets, consider adding compression:
CREATE TABLE customers_compressed
COMPRESS BASIC
AS SELECT ... FROM bronze_s3 ...;
Note: COMPRESS FOR QUERY HIGH (Hybrid Columnar Compression) requires Exadata or ZFS storage. On standard ASM or filesystem storage, use COMPRESS BASIC or ROW STORE COMPRESS ADVANCED instead.
Gather Statistics After CTAS
Oracle doesn’t automatically gather statistics on CTAS tables. Without stats, the optimizer makes poor join decisions:
EXEC DBMS_STATS.GATHER_TABLE_STATS(USER, 'TABLE_NAME');
S3 External Tables Don’t Support Indexes
You can’t create indexes on S3-backed external tables. If you need indexed lookups on lake data, load it into an internal table first. External tables are best for full-scan analytics, not point queries.
Join Strategy Varies by Query Shape
The optimizer picks different join strategies depending on the query. For simple GROUP BY joins, it chose nested loops, streaming the S3 external table through parallel slaves and doing index lookups on the internal dimension. For the segmentation query with COUNT(DISTINCT), it switched to a hash join, broadcasting the smaller customers table and hash-joining against the S3 scan. Both completed in under a second. If you need to override the optimizer’s choice:
SELECT /*+ LEADING(small_table) USE_HASH(large_external_table) */ ...
Summary
Oracle 26ai joins internal tables with S3-backed external Parquet data in a single SQL statement. The optimizer chooses between nested loop joins (with primary key index lookups) and hash joins (with broadcast of the smaller table) depending on the query shape, both with automatic DOP 40 parallelism. All hybrid queries on 2.5 million S3-sourced rows complete in under a second.
Across this four-part series, we connected Oracle 26ai to S3-compatible object storage (Part 1), solved the SigV4 signing problem with DNS and TLS, queried Parquet and Iceberg data directly from S3 (Part 2, Part 3), and joined S3 lake data with Oracle internal tables, all without ETL, Spark, or any middleware. The data stays in the lake. Oracle queries it on demand.
