Hybrid Analytics with Oracle 26ai and S3 Data

Oracle 26ai can join external tables backed by S3 Parquet files with regular internal tables in a single SQL statement, without data movement, ETL pipelines, or Spark. The optimizer treats S3-backed external tables as another data source and picks the right join strategy automatically.

This post shows how to build a customer dimension table from S3 data, join it back against the external table for analytics, compare execution plans, and measure the actual performance. On 2.5 million rows read directly from S3, hybrid joins complete in under a second.

If you followed Part 1, Part 2, and Part 3, you have S3-backed external tables over Bronze (raw Parquet), Silver (cleaned Parquet from Iceberg), and Gold (aggregated Parquet from Iceberg). Now we join them with internal Oracle tables. The Silver and Gold external tables point at the Parquet data files produced by Iceberg, not at the Iceberg metadata. Oracle treats them identically to the Bronze Parquet tables.

Where You’d Use This

Enriching lake data with reference tables. Your data lake has raw interaction events in S3. Your Oracle database has customer master data, product catalogs, or pricing tables. Join them without importing the lake data.

Building dimension tables from lake data. Pull a customer dimension from S3 Parquet into Oracle via CTAS, add a primary key and indexes, then join it back against the S3-backed external fact table. Indexed lookups on the dimension, parallel S3 scans on the fact data.

Prerequisites

  • Completed Part 3 (BRONZE_S3, SILVER_INTERACTIONS, and GOLD_DASHBOARD external tables from S3)
  • SYSDBA access to the PDB
  • Familiarity with Oracle execution plans (DBMS_XPLAN.DISPLAY)

Step-by-Step

Step 1: Build an Internal Customers Dimension

Create a customers table from the S3-backed bronze external table using CTAS. This aggregates 2.5 million interaction rows into one row per customer:

CREATE TABLE customers AS
SELECT customer_id,
       MAX(email_raw) AS email,
       MAX(city_raw) AS city,
       MAX(state_raw) AS state,
       MIN(event_timestamp) AS first_seen,
       MAX(event_timestamp) AS last_seen,
       COUNT(*) AS interaction_count,
       SUM(CASE WHEN interaction_type = 'purchase' THEN 1 ELSE 0 END) AS purchase_count,
       ROUND(SUM(CASE WHEN interaction_type = 'purchase' THEN transaction_amount ELSE 0 END), 2) AS total_spend,
       MAX(loyalty_tier) AS loyalty_tier,
       MAX(channel) AS preferred_channel
FROM bronze_s3
GROUP BY customer_id;
Table created.
Elapsed: 00:00:01.43

26,151 customers from 2,556,520 interactions. The CTAS reads the S3 dataset (20 Parquet files), aggregates in parallel (DOP 40), and writes the result to an internal heap table in 1.43 seconds.

Add a primary key and gather statistics:

ALTER TABLE customers ADD CONSTRAINT pk_customers PRIMARY KEY (customer_id);
EXEC DBMS_STATS.GATHER_TABLE_STATS(USER, 'CUSTOMERS');

The primary key gives the optimizer an index for nested loop joins. Statistics let it estimate cardinalities accurately.

Step 2: Verify the Dimension

SELECT COUNT(*) FROM customers;
  COUNT(*)
----------
     26151
SELECT customer_id, email, interaction_count, purchase_count, total_spend
FROM customers
ORDER BY total_spend DESC
FETCH FIRST 5 ROWS ONLY;
CUSTOMER_ID EMAIL                     INTERACTION_COUNT PURCHASE_COUNT TOTAL_SPEND
----------- ------------------------- ----------------- -------------- -----------
          1 user9999@outlook.com                 978844         175675  26437946.9
          2 user99999@gmail.com                  346637          62455  9377389.42
          3 user999996@outlook.com               188646          34030  5084541.66
          4 user999982@outlook.com               121647          22095  3288312.38
          5 user9999@yahoo.com                    87362          15724  2361275.82

Step 3: Hybrid Query, Top Customers by Spend

Join the internal dimension against the S3-backed external fact table. The subquery aggregates purchases from S3 Parquet; the outer query enriches with customer metadata from Oracle:

SELECT c.customer_id, c.email, c.interaction_count,
       ext.purchase_count AS ext_purchases,
       ROUND(ext.total_spend, 2) AS ext_total_spend
FROM customers c
JOIN (
    SELECT customer_id, COUNT(*) AS purchase_count, SUM(transaction_amount) AS total_spend
    FROM bronze_s3
    WHERE interaction_type = 'purchase'
    GROUP BY customer_id
) ext ON c.customer_id = ext.customer_id
ORDER BY ext.total_spend DESC NULLS LAST
FETCH FIRST 10 ROWS ONLY;
CUSTOMER_ID EMAIL                     INTERACTION_COUNT EXT_PURCHASES EXT_TOTAL_SPEND
----------- ------------------------- ----------------- ------------- ---------------
          1 user9999@outlook.com                 978844        175675     26437946.90
          2 user99999@gmail.com                  346637         62455      9377389.42
          3 user999996@outlook.com               188646         34030      5084541.66
          4 user999982@outlook.com               121647         22095      3288312.38
          5 user9999@yahoo.com                    87362         15724      2361275.82
          6 user999996@outlook.com                66621         12115      1874226.37
          7 user999964@outlook.com                52606          9437      1413831.60
          8 user999887@gmail.com                  43332          7841      1129578.43
          9 user999970@yahoo.com                  36136          6425      1013184.00
         10 user999985@gmail.com                  30949          5634       861616.76

Elapsed: 00:00:00.60

Oracle reads 20 Parquet files from S3, filters to purchases, aggregates by customer, then joins against the internal customers table. 0.60 seconds.

Step 4: Hybrid Query, Customer Segmentation

Use the internal dimension to segment customers, then aggregate S3-backed fact data per segment:

SELECT
    CASE
        WHEN c.interaction_count > 10 THEN 'Power User'
        WHEN c.interaction_count > 3  THEN 'Active'
        ELSE 'Casual'
    END AS segment,
    COUNT(DISTINCT c.customer_id) AS customers,
    COUNT(*) AS total_interactions,
    ROUND(AVG(b.transaction_amount), 2) AS avg_txn,
    ROUND(SUM(b.transaction_amount), 2) AS total_revenue
FROM customers c
JOIN bronze_s3 b ON c.customer_id = b.customer_id
WHERE b.interaction_type = 'purchase'
GROUP BY
    CASE
        WHEN c.interaction_count > 10 THEN 'Power User'
        WHEN c.interaction_count > 3  THEN 'Active'
        ELSE 'Casual'
    END
ORDER BY total_revenue DESC;
SEGMENT      CUSTOMERS TOTAL_INTERACTIONS   AVG_TXN TOTAL_REVENUE
------------ --------- ----------------- --------- -------------
Power User        2060            452575    150.61   68161937.30
Casual            4639              4835    153.47     742011.12
Active            1632              2610    153.39     400351.21

Elapsed: 00:00:00.52

2,060 power users (7.9% of customers) drive 98.4% of revenue. This query joins 26,151 internal rows against 2,556,520 S3-backed external rows, filters to purchases, segments by interaction count, and aggregates in 0.52 seconds.

Step 5: Cross-Layer Query, Internal + Silver from S3

Join the internal customers table with the Silver S3 external table to compare raw and cleaned data:

SELECT c.customer_id, c.email AS raw_email, s.email_clean AS cleaned_email,
       c.state AS raw_state, s.state_standardized AS clean_state,
       s.customer_value_tier, s.churn_risk_indicator, s.engagement_score
FROM customers c
JOIN silver_interactions s ON c.customer_id = s.customer_id
WHERE ROWNUM <= 10;
CUSTOMER_ID RAW_EMAIL                    CLEANED_EMAIL                RAW_STATE  CLEAN_STATE  CUSTOMER_VALUE_TIER  CHURN_RISK_INDICATOR ENGAGEMENT_SCORE
----------- ---------------------------- ---------------------------- ---------- ------------ -------------------- -------------------- ----------------
          4 user999982@outlook.com       user337396@hotmail.com       tx         PA           browser_only         unknown_risk                        3
          8 user999887@gmail.com         user987427@gmail.com         tx         IL           browser_only         unknown_risk                        2
          2 user99999@gmail.com          user131575@outlook.com       tx         NY           medium_value         unknown_risk                        1
          1 user9999@outlook.com         user641643@yahoo.com         tx         CA           browser_only         unknown_risk                        0
         57 user999302@gmail.com         user377056@outlook.com       tx         TX           low_value            unknown_risk                        2
          2 user99999@gmail.com          user312539@yahoo.com         tx         NY           low_value            unknown_risk                        4
         15 user999945@yahoo.com         user531743@yahoo.com         tx         ILLINOIS     browser_only         unknown_risk                        4
          7 user999964@outlook.com       user756613@hotmail.com       tx         PA           browser_only         unknown_risk                        0
          5 user9999@yahoo.com           user244092@icloud.com        tx         AZ           browser_only         unknown_risk                        0
          2 user99999@gmail.com          user628848@hotmail.com       tx         CA           browser_only         low_risk                            0

Elapsed: 00:00:00.23

This joins an internal Oracle table with a Silver-layer external table. The result shows raw values from the dimension alongside cleaned/enriched values from the data lake: email normalization, state standardization, customer value tier, churn risk.

Step 6: Check the Execution Plan

Here’s how Oracle actually joins them:

EXPLAIN PLAN FOR
SELECT c.customer_id, COUNT(*), SUM(b.transaction_amount)
FROM customers c
JOIN bronze_s3 b ON c.customer_id = b.customer_id
WHERE b.interaction_type = 'purchase'
GROUP BY c.customer_id;

SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY);
---------------------------------------------------------------------------
| Id | Operation                              | Name         | Rows  | Cost  |
---------------------------------------------------------------------------
|  0 | SELECT STATEMENT                       |              |     1 | 10222 |
|  1 |  PX COORDINATOR                        |              |       |       |
|  2 |   PX SEND QC (RANDOM)                  | :TQ10001     |     1 | 10222 |
|  3 |    HASH GROUP BY                       |              |     1 | 10222 |
|  4 |     PX RECEIVE                         |              |     1 | 10222 |
|  5 |      PX SEND HASH                      | :TQ10000     |     1 | 10222 |
|  6 |       HASH GROUP BY                    |              |     1 | 10222 |
|  7 |        NESTED LOOPS                    |              |  1.1M | 10220 |
|  8 |         PX BLOCK ITERATOR              |              |       |       |
|* 9 |          EXTERNAL TABLE ACCESS STORAGE FULL| BRONZE_S3|  1.1M | 10217 |
|*10 |         INDEX UNIQUE SCAN              | PK_CUSTOMERS |     1 |     0 |
---------------------------------------------------------------------------

Predicate Information:
   9 - storage("B"."INTERACTION_TYPE"='purchase')
       filter("B"."INTERACTION_TYPE"='purchase')
  10 - access("C"."CUSTOMER_ID"="B"."CUSTOMER_ID")

Note: automatic DOP: Computed Degree of Parallelism is 40 because of degree limit

What the plan tells us:

  • NESTED LOOPS: Oracle streams the S3 external table through parallel slaves, then looks up each customer via the primary key index.
  • EXTERNAL TABLE ACCESS STORAGE FULL with PX BLOCK ITERATOR: parallel full scan of S3 Parquet files at DOP 40.
  • INDEX UNIQUE SCAN on PK_CUSTOMERS: point lookups per row, not a broadcast.
  • Storage and filter predicates on interaction_type='purchase': Oracle applies both a storage-level and row-level filter. In practice both evaluate after Parquet row group reads; full-scan timing with and without the filter is nearly identical (0.41s vs 0.38s).
  • DOP 40: auto-computed from cpu_count=40, parallel_degree_policy=MANUAL.

Performance Summary

All queries on 2,556,520 S3-backed external rows joined with 26,151 internal rows:

Query Elapsed Notes
CTAS (build dimension from S3) 1.43s 20 Parquet files → 26K internal rows
Top customers by spend 0.60s Subquery aggregate + nested loops
Customer segmentation 0.52s Hash join with broadcast
Cross-layer (internal + Silver) 0.23s Nested loops with index lookups
COUNT(*) on bronze_s3 0.30s Full S3 scan baseline

Gotchas

CTAS from S3 External Tables Creates Uncompressed Heap Tables

When you CREATE TABLE AS SELECT from an S3-backed external table, Oracle creates a standard heap table. For large datasets, consider adding compression:

CREATE TABLE customers_compressed
COMPRESS BASIC
AS SELECT ... FROM bronze_s3 ...;

Note: COMPRESS FOR QUERY HIGH (Hybrid Columnar Compression) requires Exadata or ZFS storage. On standard ASM or filesystem storage, use COMPRESS BASIC or ROW STORE COMPRESS ADVANCED instead.

Gather Statistics After CTAS

Oracle doesn’t automatically gather statistics on CTAS tables. Without stats, the optimizer makes poor join decisions:

EXEC DBMS_STATS.GATHER_TABLE_STATS(USER, 'TABLE_NAME');

S3 External Tables Don’t Support Indexes

You can’t create indexes on S3-backed external tables. If you need indexed lookups on lake data, load it into an internal table first. External tables are best for full-scan analytics, not point queries.

Join Strategy Varies by Query Shape

The optimizer picks different join strategies depending on the query. For simple GROUP BY joins, it chose nested loops, streaming the S3 external table through parallel slaves and doing index lookups on the internal dimension. For the segmentation query with COUNT(DISTINCT), it switched to a hash join, broadcasting the smaller customers table and hash-joining against the S3 scan. Both completed in under a second. If you need to override the optimizer’s choice:

SELECT /*+ LEADING(small_table) USE_HASH(large_external_table) */ ...

Summary

Oracle 26ai joins internal tables with S3-backed external Parquet data in a single SQL statement. The optimizer chooses between nested loop joins (with primary key index lookups) and hash joins (with broadcast of the smaller table) depending on the query shape, both with automatic DOP 40 parallelism. All hybrid queries on 2.5 million S3-sourced rows complete in under a second.

Across this four-part series, we connected Oracle 26ai to S3-compatible object storage (Part 1), solved the SigV4 signing problem with DNS and TLS, queried Parquet and Iceberg data directly from S3 (Part 2, Part 3), and joined S3 lake data with Oracle internal tables, all without ETL, Spark, or any middleware. The data stays in the lake. Oracle queries it on demand.