إجابة مرجعية
This is always a good question because you might be asked about possible ways to ensure data quality in your data platform. It is one of the data engineer's daily routine jobs to improve data pipelines in terms of data accuracy. Data engineers connect data sources and deploy pipelines where data must be extracted and then very often it has to be transformed according to business requirements.
We would want to make sure that all required fields exist (data quality) and no data is missing (reliability).
How do we do it? It's always good to mention self-fixing pipelines and that you know how to deploy them. Data engineers can deploy data quality pipelines in a similar way they deploy ETL pipelines. To put it simply, you would want to use row conditions for one dataset and based on the outcome deploy a fixing step, i.e. extract missing data and load it.
Using row conditions for your datasets aims to ensure data quality.
All data quality checks can be scheduled as scripts and if any of them fail to meet certain conditions then we can send an email notification. It's worth saying that modern data warehouse solutions allow SQL scripts to do such checks but it doesn't have to be limited to SQL. Any data check script can be run on data in the data lake or anywhere else. It just depends on the type of our data platform. Good coding skills are a must in this case so we would want to demonstrate that we know how to create a simple patrol application that can scan our data depending on where it is located physically.
The SQL-based answer is also good but it would be more suitable for the Data Developer role as SQL is often considered the main data querying dialect in analytics. Consider this example below. It will use SQL with row conditions to check if there are any records with NULL payment_date. It will also check for duplicates.
with checks as (
select
count( transaction_id ) as t_cnt
, count(distinct transaction_id) as t_cntd
, count(distinct (case when payment_date is null then transaction_id end)) as pmnt_date_null
from
production.user_transaction
)
, row_conditions as (
select if(t_cnt = 0,'Data for yesterday missing; ', NULL) as alert from checks
union all
select if(t_cnt != t_cntd,'Duplicate transactions found; ', NULL) from checks
union all
select if(pmnt_date_null != 0, cast(pmnt_date_null as string )||' NULL payment_date found', NULL) from checks
)
, alerts as (
select
array_to_string(
array_agg(alert IGNORE NULLS)
,'.; ') as stringify_alert_list
, array_length(array_agg(alert IGNORE NULLS)) as issues_found
from
row_conditions
)
select
alerts.issues_found,
if(alerts.issues_found is null, 'all good'
, ERROR(FORMAT('ATTENTION: production.user_transaction has potential data quality issues for yesterday: %t. Check dataChecks.check_user_transaction_failed_v for more info.'
, stringify_alert_list)))
from
alerts
;
As a result BigQuery will send an automated email containing the alert.