01 Architecture Overview

Pipeline Architecture

Eight source platforms deliver data files to a shared GCS storage bucket. From there, an automated three-stage pipeline processes each file through BigQuery — first capturing it as-is, then cleaning and transforming it, then rolling it into the analysis tables that power the AI Analyst.

Figure 1.1 — End-to-end data flow

Source platforms

Google Ads

Delivery report
Match type report

raw_google_ads

Microsoft

Search delivery report

raw_microsoft

Meta

Delivery report
Performance report

raw_meta

Nexxen

Delivery report

raw_nexxen

MTA

Media cost
Conversions (P2C)
Final attribution

raw_mta

Innovid

Creative distribution

raw_innovid

Polk

Registrations

raw_polk

Toyota

LeadView lead list

raw_toyota

Files are uploaded manually or via automated export to the GCS bucket, in formats including CSV, ZIP, and CSV.GZ.

GCS Storage Bucket — tcaa_ne_data_ingestion

A single shared storage bucket organized by source path (e.g. google_ads/delivery/). Uploading any file here automatically kicks off the pipeline — no manual trigger needed.

The upload event is detected automatically. The file path is used to look up routing and configuration in ref.pipeline_sources. Metadata is written back to the GCS file to mark it as processed.

Stage 1 — Ingest

The file is read, column names are standardized to snake_case, and all values are loaded into BigQuery exactly as they arrived — no type-casting or transformation at this stage. ZIP and GZ files are decompressed automatically.

On successful ingest, a notification triggers Stage 2.

Raw tables

One table per source file. All values stored as plain text strings. Serves as the permanent record of exactly what each platform delivered.

Stage 2 — Clean & Transform

Values are cast to proper types (dates, integers, decimals). Market names are resolved against ref.dim_market. Source-specific logic is applied — e.g. parsing MTA placement strings, classifying Innovid creative types.

Clean tables trigger Stage 3. The dependency map in ref.pipeline_analysis_dependencies determines which analysis tables need to be rebuilt based on which source just updated.

Stage 3 — Analysis

Cross-source joins produce the final reporting tables used by the AI Analyst. Each table is rebuilt in full whenever its upstream dependencies update.

Current outputs:
analysis.media_deliveryThe primary cross-source delivery table, joining cleaned impressions, clicks, and spend data across Google Ads, Microsoft, Meta, and Nexxen.

Additional tables for registrations, search performance, and attribution are planned and will be added as their source data pipelines are completed.

Analysis tables are queried by the AI Analyst.

AI Analyst

Queries the analysis dataset to answer questions about campaign performance, market trends, attribution, and more.

Audit trail — ref.pipeline_log
Every step across all three stages is logged here: the source file, target table, timestamp, row count, status, and any error message. In addition, metadata is written back to the GCS file itself to mark it as processed. If something fails at any point in the pipeline, this is the first place to check.

Storage / Output

Processing stages

BigQuery tables

Audit log

All data remains within GCP (us-east1). The pipeline is fully event-driven — uploading a file is the only action required. Manual rebuilds of any analysis table are available via the corresponding manual_refresh_[table name] saved query in BigQuery.

02 Pipeline Configuration

How the Pipeline Is Configured

The pipeline's behavior is controlled by two configuration tables in BigQuery, not by code. Adding a new data source or changing what gets rebuilt downstream means updating a table row — no Cloud Function changes required.

Figure 2.1 — Configuration tables and processing layer

Source Configuration — ref.pipeline_sources

Every recognized data source has a row in this table. When a file arrives in the GCS bucket, the pipeline reads its folder path and looks it up here to determine two things: which raw BigQuery table the data should land in, and how it should be loaded — either replacing the table each time a new file arrives, or accumulating new rows on top of what's already there.

This table is the single source of truth for what the pipeline recognizes. If a source isn't listed here, the pipeline won't act on it.

triggers dependency check

Analysis Dependencies — ref.pipeline_analysis_dependencies

Once a source file is processed and its clean table is updated, the pipeline checks this table to determine which analysis tables need to be rebuilt as a result. Each row maps a clean source table to the analysis table or tables that depend on it.

This is what makes the pipeline fully automatic end to end. When new delivery data arrives from Google Ads, Microsoft, Meta, or Nexxen, the pipeline knows — without any manual instruction — that analysis.media_delivery needs to be refreshed.

executed by

Cloud Functions — the processing layer

The actual work of reading files, transforming data, and writing to BigQuery happens inside three Cloud Functions that run automatically in response to pipeline events. They are not visible in BigQuery or GCS, but they are what connects an uploaded file to finished data in the analysis layer.

The configuration tables above define the rules — the Cloud Functions carry them out.

To register a new data source, add a row to ref.pipeline_sources with its GCS folder path and target raw table. If it should contribute to an analysis table, add a corresponding row to ref.pipeline_analysis_dependencies. No code changes are needed.