AI Analyst Platform
Technical Documentation & Architecture Reference
market-analysis-451415
01 Architecture Overview

Pipeline Architecture

Eight source platforms deliver data files to a shared GCS storage bucket. From there, an automated three-stage pipeline processes each file through BigQuery — first capturing it as-is, then cleaning and transforming it, then rolling it into the analysis tables that power the AI Analyst.

Figure 1.1 — End-to-end data flow
Source platforms
Google Ads
  • Delivery report
  • Match type report
raw_google_ads
Microsoft
  • Search delivery report
raw_microsoft
Meta
  • Delivery report
  • Performance report
raw_meta
Nexxen
  • Delivery report
raw_nexxen
MTA
  • Media cost
  • Conversions (P2C)
  • Final attribution
raw_mta
Innovid
  • Creative distribution
raw_innovid
Polk
  • Registrations
raw_polk
Toyota
  • LeadView lead list
raw_toyota
Files are uploaded manually or via automated export to the GCS bucket, in formats including CSV, ZIP, and CSV.GZ.
GCS Storage Bucket — tcaa_ne_data_ingestion
A single shared storage bucket organized by source path (e.g. google_ads/delivery/). Uploading any file here automatically kicks off the pipeline — no manual trigger needed.
The upload event is detected automatically. The file path is used to look up routing and configuration in ref.pipeline_sources. Metadata is written back to the GCS file to mark it as processed.
Stage 1 — Ingest
The file is read, column names are standardized to snake_case, and all values are loaded into BigQuery exactly as they arrived — no type-casting or transformation at this stage. ZIP and GZ files are decompressed automatically.
On successful ingest, a notification triggers Stage 2.
Raw tables
One table per source file. All values stored as plain text strings. Serves as the permanent record of exactly what each platform delivered.
Stage 2 — Clean & Transform
Values are cast to proper types (dates, integers, decimals). Market names are resolved against ref.dim_market. Source-specific logic is applied — e.g. parsing MTA placement strings, classifying Innovid creative types.
Clean tables trigger Stage 3. The dependency map in ref.pipeline_analysis_dependencies determines which analysis tables need to be rebuilt based on which source just updated.
Stage 3 — Analysis
Cross-source joins produce the final reporting tables used by the AI Analyst. Each table is rebuilt in full whenever its upstream dependencies update.

Current outputs:
analysis.media_deliveryThe primary cross-source delivery table, joining cleaned impressions, clicks, and spend data across Google Ads, Microsoft, Meta, and Nexxen.

Additional tables for registrations, search performance, and attribution are planned and will be added as their source data pipelines are completed.
Analysis tables are queried by the AI Analyst.
AI Analyst
Queries the analysis dataset to answer questions about campaign performance, market trends, attribution, and more.
Audit trail — ref.pipeline_log
Every step across all three stages is logged here: the source file, target table, timestamp, row count, status, and any error message. In addition, metadata is written back to the GCS file itself to mark it as processed. If something fails at any point in the pipeline, this is the first place to check.
Storage / Output
Processing stages
BigQuery tables
Audit log

All data remains within GCP (us-east1). The pipeline is fully event-driven — uploading a file is the only action required. Manual rebuilds of any analysis table are available via the corresponding manual_refresh_[table name] saved query in BigQuery.

02 Pipeline Configuration

How the Pipeline Is Configured

The pipeline's behavior is controlled by two configuration tables in BigQuery, not by code. Adding a new data source or changing what gets rebuilt downstream means updating a table row — no Cloud Function changes required.

Figure 2.1 — Configuration tables and processing layer
Source Configuration — ref.pipeline_sources
Every recognized data source has a row in this table. When a file arrives in the GCS bucket, the pipeline reads its folder path and looks it up here to determine two things: which raw BigQuery table the data should land in, and how it should be loaded — either replacing the table each time a new file arrives, or accumulating new rows on top of what's already there.

This table is the single source of truth for what the pipeline recognizes. If a source isn't listed here, the pipeline won't act on it.
triggers dependency check
Analysis Dependencies — ref.pipeline_analysis_dependencies
Once a source file is processed and its clean table is updated, the pipeline checks this table to determine which analysis tables need to be rebuilt as a result. Each row maps a clean source table to the analysis table or tables that depend on it.

This is what makes the pipeline fully automatic end to end. When new delivery data arrives from Google Ads, Microsoft, Meta, or Nexxen, the pipeline knows — without any manual instruction — that analysis.media_delivery needs to be refreshed.
executed by
Cloud Functions — the processing layer
The actual work of reading files, transforming data, and writing to BigQuery happens inside three Cloud Functions that run automatically in response to pipeline events. They are not visible in BigQuery or GCS, but they are what connects an uploaded file to finished data in the analysis layer.

The configuration tables above define the rules — the Cloud Functions carry them out.
To register a new data source, add a row to ref.pipeline_sources with its GCS folder path and target raw table. If it should contribute to an analysis table, add a corresponding row to ref.pipeline_analysis_dependencies. No code changes are needed.