參考答案
I can tell you about a project where I built our customer analytics platform's core data model from scratch. We had a clear business need: our marketing and product teams couldn't get a unified view of customer behavior across our website, mobile app, and CRM. The raw data was incredibly messy. Website clickstream data came from Snowplow, app events from Amplitude, and CRM data from Salesforce, all landing in separate S3 buckets as JSON files or directly into our Snowflake raw layer. Each source had its own event naming conventions, user identifiers, and timestamps.
My first step was to load these raw sources into staging tables in Snowflake, applying basic cleansing like standardizing column names and casting data types. For example, event_timestamp might have been a string in one source and an epoch integer in another, so I converted them all to a consistent DATETIME format. The biggest challenge was unifying customer identities. Snowplow used anonymous device_ids, Amplitude had its own amplitude_user_id, and Salesforce used crm_user_id. We had a lookup table that linked these various IDs when a user logged in or made a purchase. I built a stg_customer_id_map model that leveraged this lookup, creating a single master_customer_id for each user across all platforms. This was a critical piece, as it allowed us to stitch together a complete customer journey.
Next, I built a series of intermediate models. I created int_web_events and int_app_events to standardize event names (e.g., page_view from Snowplow and screen_view from Amplitude both became page_view_event) and filter out bot traffic or irrelevant events. I then joined these with the stg_customer_id_map to associate each event with our master_customer_id. The final output was a fact_customer_events table. This table contained a single row for every user interaction, normalized across all sources, with consistent event names, precise timestamps, and linked to the master_customer_id. I also created a dim_customer table by aggregating customer attributes from CRM and their first-seen dates from our event streams, ensuring it had unique customers and accurate demographic information.
To make this production-ready, I implemented extensive dbt tests: unique and not_null for all primary keys, relationships between fact and dimension tables, and custom tests to ensure event counts were within expected ranges. I also set up dbt exposures for downstream Looker dashboards, linking specific reports to their underlying dbt models. This allowed analysts to easily find the data they needed and understand its lineage. The project significantly reduced data discrepancies, gave our teams a 360-degree view of customer behavior, and cut down report creation time from days to hours, truly enabling data-driven decisions.