01 Architecture Overview

Pipeline Architecture

Eight source platforms deliver data files to a shared GCS storage bucket. From there, an automated three-stage pipeline processes each file through BigQuery — first capturing it as-is, then cleaning and transforming it, then rolling it into the analysis tables that power the AI Analyst.

Figure 1.1 — End-to-end data flow

Source platforms

Google Ads

Delivery report
Match type report

raw_google_ads

Microsoft

Search delivery report

raw_microsoft

Meta

Delivery report
Performance report

raw_meta

Nexxen

Delivery report

raw_nexxen

MTA

Media cost
Conversions (P2C)
Final attribution

raw_mta

Innovid

Creative distribution

raw_innovid

Polk

Registrations

raw_polk

Toyota

LeadView lead list

raw_toyota

Files are uploaded manually or via automated export to the GCS bucket, in formats including CSV, ZIP, and CSV.GZ.

GCS Storage Bucket — tcaa_ne_data_ingestion

A single shared storage bucket organized by source path (e.g. google_ads/delivery/). Uploading any file here automatically kicks off the pipeline — no manual trigger needed.

The upload event is detected automatically. The file path is used to look up routing and configuration in ref.pipeline_sources. Metadata is written back to the GCS file to mark it as processed.

Stage 1 — Ingest

The file is read, column names are standardized to snake_case, and all values are loaded into BigQuery exactly as they arrived — no type-casting or transformation at this stage. ZIP and GZ files are decompressed automatically.

On successful ingest, a notification triggers Stage 2.

Raw tables

One table per source file. All values stored as plain text strings. Serves as the permanent record of exactly what each platform delivered.

Stage 2 — Clean & Transform

Values are cast to proper types (dates, integers, decimals). Market names are resolved against ref.dim_market. Source-specific logic is applied — e.g. parsing MTA placement strings, classifying Innovid creative types.

Clean tables trigger Stage 3. The dependency map in ref.pipeline_analysis_dependencies determines which analysis tables need to be rebuilt based on which source just updated.

Stage 3 — Analysis

Cross-source joins produce the final reporting tables used by the AI Analyst. Each table is rebuilt in full whenever its upstream dependencies update.

Current outputs:
analysis.media_delivery

Additional tables for registrations, search performance, and attribution are planned and will be added as their source data pipelines are completed.

Analysis tables are queried by the AI Analyst.

AI Analyst

Queries the analysis dataset to answer questions about campaign performance, market trends, attribution, and more.

Audit trail — ref.pipeline_log
Every step across all three stages is logged here: the source file, target table, timestamp, row count, status, and any error message. In addition, metadata is written back to the GCS file itself to mark it as processed. If something fails at any point in the pipeline, this is the first place to check.

Storage / Output

Processing stages

BigQuery tables

Audit log

All data remains within GCP (us-east1). The pipeline is fully event-driven — uploading a file is the only action required. Manual rebuilds of any analysis table are available via the corresponding manual_refresh_[table name] saved query in BigQuery.