Mock Interview Questions for GCP Data Engineers

1

How do you optimize storage costs in GCP?

Reference answer

- Archive data in Coldline or Nearline. - Delete obsolete datasets. - Compress files using efficient formats like Parquet. Example: We saved 35% on storage costs by archiving historical data in Coldline Storage.

2

How do you prioritize tasks in a data engineering project?

Reference answer

Prioritization strategies might include: - Assessing business impact and urgency of each task - Considering dependencies between tasks - Evaluating resource availability and constraints - Using techniques like the Eisenhower Matrix or MoSCoW method - Regular communication with stakeholders to align priorities

3

How do you implement blue/green deployment on GCP?

Reference answer

Employ a Cloud Run or GKE to facilitate creation of two environments (blue and green). Next, deploy this new version to the green environment, consequently test it, and finally switch the entire traffic from the blue one to green seamlessly.

4

Name three benefits of cloud services.

Reference answer

Three benefits of cloud services are cost savings by eliminating capital expenditure on hardware, flexibility and scalability to adjust resources based on demand, and disaster recovery capabilities with data backup and replication across multiple locations.

5

A table has two data entries every day for # of apples and oranges sold. Write a query to get the difference between the apples and oranges sold on a given day.

Reference answer

Assuming the table has columns like date, fruit_type, and quantity, write a query that filters for the specific day, sums quantities for apples and oranges separately (e.g., using CASE WHEN with SUM), and computes the absolute difference between the two sums.

6

What is the purpose of Dataflow autoscaling?

Reference answer

Autoscaling dynamically adjusts the number of worker nodes based on data processing demands, optimizing cost and performance. Example: I enabled autoscaling in a streaming pipeline during peak hours to process 3x the normal data volume without impacting performance.

7

What are the key features of Cloud Spanner?

Reference answer

Cloud Spanner refers to a completely scalable, managed and globally distributed SQL database. Some of its key features are-

8

Which accounts for which services are now available? How do you intend to go about making one of these?

Reference answer

Sadly, this is a typical question asked at interviews for Google Cloud jobs. The progress of project-specific services can be monitored through the use of service accounts. They are utilized in order to grant permission to Google Compute Engine to act on the user's behalf, hence providing the service access to data that is considered to be relatively harmless. The Google Cloud Platform Console and the Google Compute Engine service accounts are the most often used of the many different kinds of service accounts that Google offers. It is not necessary for the user to create an account for the service on their own. This file is automatically generated by the Compute Engine whenever a new instance of something is created. When an instance is created in Google Compute Engine, an administrator has the ability to restrict the privileges of the service account that is connected with the instance.

9

How do you manage table partition expiration and lifecycle policies in BigQuery?

Reference answer

Set partition expiration on ingestion-time partitioned tables using the `partition_expiration_days` option, which automatically deletes partitions older than the specified number of days. For custom partitioning, use a time-based partitioning column and combine with a scheduled query or Dataflow job to delete old partitions. Use BigQuery's DDL (e.g., `ALTER TABLE ... SET OPTIONS`) to modify expiration. For lifecycle policies, implement data retention rules at the dataset level, and use scripts to manage table deletions or archiving to GCS via exports.

10

What are the key benefits of using Vertex AI in GCP?

Reference answer

- Managed platform for end-to-end ML workflows. - Integration with BigQuery and Dataflow. - Automated hyperparameter tuning. Example: We deployed a real-time predictive model using Vertex AI, which improved customer engagement rates by 15%.

11

How do you optimize BigQuery queries for performance?

Reference answer

- Partitioning and clustering tables - Using the WITH clause for subqueries - Avoiding SELECT * and specifying only required columns - Caching query results - Materialized views

12

What is star schema?

Reference answer

Star schema is a data warehouse schema where a central fact table is surrounded by dimension tables. It's called a star schema because the diagram resembles a star, with the fact table at the center and dimension tables as points.

13

What is the difference between Dataflow and Dataproc in GCP?

Reference answer

Dataflow is a fully managed, serverless service built on Apache Beam for both batch and streaming data processing. Dataproc is a managed Hadoop and Spark cluster service used for big data processing workloads. Use Dataflow when you want a serverless, auto-scaling pipeline with no cluster management. Use Dataproc when you have existing Spark or Hadoop jobs you want to migrate to the cloud with more control over the cluster environment.

14

How do you test and validate GCP data pipelines?

Reference answer

- Unit tests for Dataflow transformations - Integration tests for pipeline components - Use sample datasets for validation Example: In an e-commerce project, I wrote unit tests for transformation logic and used integration tests to validate pipeline correctness during code updates.

15

Name three advantages of Google Cloud Hosting.

Reference answer

Three advantages of Google Cloud Hosting include scalability to handle traffic spikes, reliability with high uptime guarantees, and cost-effectiveness with a pay-as-you-go pricing model.

16

What is the difference between Cloud Router and VPN tunnels in GCP?

Reference answer

The Cloud Router enable the dynamic routing between the networks within your Virtual Private Cloud (VPC) and other networks. Routes to your VPC networks are automatically offered by that fully managed a solution. Virtual private network tunnels, on the other hand, use encrypted communication over the open internet to offer safe connections between your VPC network and your on-premises network. VPN tunnels securely increase your network into on-premises environments, while Cloud Router handles routing within Google Cloud Platform.

17

How do you choose the right partitioning strategy for a data warehouse?

Reference answer

Choose partitions based on access patterns—commonly by date, region, or customer ID. The goal is to reduce the amount of scanned data during queries. Avoid high cardinality columns and monitor skew in partition sizes.

18

What are the challenges of handling schema evolution in a data pipeline, and how does Google Cloud address this issue?

Reference answer

Handling schema evolution in a data pipeline can be challenging, especially when dealing with semi-structured or unstructured data. As data sources evolve or new data types are added, the schema may change, leading to compatibility issues that can break downstream processes. Google Cloud addresses schema evolution in several ways. In BigQuery, users can enable schema auto-detection, which automatically adjusts to changes in incoming data formats. This makes it easier to ingest new data sources without manually altering the schema. In Cloud Dataflow, schema changes can be managed through flexible transformations that allow for dynamic schema updates. The service allows data engineers to define how data should be transformed based on different schema versions, ensuring compatibility across different stages of the pipeline. Additionally, tools like Cloud Pub/Sub allow for message validation before processing, enabling safe schema changes without disrupting the flow of data.

19

What is the connection between Google Compute Engine and Google App Engine?

Reference answer

Google App Engine and Google Compute Engine complement one another. Google Application Engine is a Platform-as-a-service (PaaS), whereas GCE is an Infrastructure-as-a-service (IaaS). GAE is commonly used to power mobile backends, web-based apps, and line-of-business applications. If we require additional control over the underlying infrastructure, Google Compute Engine is an excellent choice. GCE, for example, can be utilized to create bespoke business logic or to run our own storage solution.

20

Explain the use of Google Cloud Storage Transfer Service for data migration.

Reference answer

Google Cloud Storage Transfer Service enables data migration between Google Cloud Storage buckets or between Google Cloud Storage and other cloud storage providers. It simplifies the migration process by handling data transfer securely and efficiently. Users can schedule one-time or recurring transfers and choose options like overwrite, delete source, and verification to ensure data consistency during migration.

21

Write a query that returns all neighborhoods that have 0 users.

Reference answer

This is a featured question at Google. The query should likely use a LEFT JOIN or NOT EXISTS pattern to find neighborhoods that do not appear in the users table.

22

Explain how you would view past transactions in the GCP.

Reference answer

To view past transactions in GCP, sign into the GCP console, navigate to the left pane and select billing, select the go-to linked billing account option, and then navigate to transactions. It's also possible to view transactions by using transaction type, view summaries of the transaction history, or change the data range.

23

What are some best practices to secure a GCP environment?

Reference answer

Some of the key best practices to secure a GCP environment are-

24

How to sum all values in a range of values between A and B.

Reference answer

To sum all values in a range between A and B, you can use a loop to iterate through the range and accumulate the sum. For example, in Python: 'sum(range(A, B+1))' if inclusive. In shell scripting, you might use 'seq A B | paste -sd+ | bc'. For large ranges, consider using mathematical formulas like (B*(B+1)//2 - (A-1)*A//2) for efficiency.

25

SLA compliance for daily partition loads

Reference answer

Given pipeline run logs and daily partition load targets, return one row per $pipeline_id$ and $partition_date$ with the latest successful load time and a boolean $is_sla_met$ where the SLA is met if $latest_success_at \le sla_deadline_ts$. Only consider partitions in the targets table and treat partitions with no successful run as not met. Output columns: pipeline_id, partition_date, latest_success_at, sla_deadline_ts, is_sla_met. | pipeline_id | run_id | partition_date | status | finished_at | |---|---|---|---|---| | p1 | r101 | 2026-01-01 | SUCCESS | 2026-01-01 05:10:00 | | p1 | r102 | 2026-01-01 | FAILED | 2026-01-01 05:30:00 | | p1 | r103 | 2026-01-02 | SUCCESS | 2026-01-02 07:05:00 | | p2 | r201 | 2026-01-01 | SUCCESS | 2026-01-01 09:15:00 | | p2 | r202 | 2026-01-02 | FAILED | 2026-01-02 08:55:00 | | pipeline_id | partition_date | sla_deadline_ts | |---|---|---| | p1 | 2026-01-01 | 2026-01-01 06:00:00 | | p1 | 2026-01-02 | 2026-01-02 06:30:00 | | p2 | 2026-01-01 | 2026-01-01 09:00:00 | | p2 | 2026-01-02 | 2026-01-02 09:00:00 |

26

How would you design a data solution for a company with high-volume real-time data processing needs?

Reference answer

For high-volume real-time data processing: - Data Ingestion: Use Cloud Pub/Sub for ingesting streaming data. - Data Processing: Use Cloud Dataflow or Apache Beam to process and transform data in real-time. - Data Storage: Store processed data in BigQuery for analytics or Cloud Storage for raw data. - Data Visualization: Use Looker or Data Studio to create real-time dashboards and reports.

27

How does BigQuery handle schema changes?

Reference answer

BigQuery supports schema changes such as adding new columns and modifying column descriptions. You can add new columns without affecting existing data: - Adding Columns: Use the `ALTER TABLE` statement. - Deleting Columns: Not directly supported, but you can create a new table with the desired schema and copy the data over. - Schema Auto-Detection: When loading new data, BigQuery can automatically detect and adjust the schema based on the incoming data.

28

What is Google App Engine?

Reference answer

A number of Google Cloud's fully managed platform-as-a-service (PaaS) products is Google App Engine. It renders feasible for developers to create and execute scalable web services and applications. Scaling, load balancing, and monitoring are just some of the infrastructure challenges which the platform takes deal of. Several programming languages are available, including Go, Java, Python, and Node.js.

29

What's the difference between @staticmethod and @classmethod in Python? How are they used in GCP SDKs?

Reference answer

Both @staticmethod and @classmethod are decorators in Python used to define methods inside a class that aren't tied to instance objects. They differ in how they access class and instance data. In GCP SDKs: - @staticmethod is often used for helper functions that perform generic tasks, like formatting or validation, which don't depend on class or instance state. - @classmethod is useful for alternative constructors or methods that need to access or modify class-level configurations, such as creating client instances with specific settings.

30

String Shift

Reference answer

A coding interview question. Likely involves rotating or shifting characters in a string by a given number of positions, either left or right.

31

Can you design a simple OLTP architecture that will convince the Redbus team to give X project to you?

Reference answer

Design an OLTP system for Redbus (a bus ticketing platform) with a normalized schema for transactions: tables for buses, routes, schedules, seats, bookings, customers, and payments. Ensure ACID compliance for booking transactions. Use indexing on key columns (e.g., schedule_id, seat_id) for fast reads/writes. Implement a queue for concurrent bookings to avoid race conditions.

32

How do you automate ETL processes on GCP?

Reference answer

- Use Cloud Composer for orchestration - Automate tasks with Airflow DAGs - Schedule Dataflow and BigQuery jobs Example: I created a Cloud Composer workflow that automated daily data ingestion, processing, and reporting, reducing manual effort by 90%.

33

What is the Function of a Bucket in Google Cloud Storage?

Reference answer

'Buckets' are the most straightforward containers that may be used to hold information. Any data that is stored in Cloud Storage must first be organized into a bucket. There is no restriction on the number of buckets that can be added or taken away from the system. Buckets, on the other hand, do not support nesting in the same way that directories and files do.

34

What is Cloud Composer?

Reference answer

Cloud Composer is a fully-managed workflow orchestration service from Google Cloud. It enables users to author, schedule, and monitor multi-step data pipelines using popular open-source tools such as Apache Airflow. With Cloud Composer, users can create and manage complex workflows that integrate with other cloud services, making it easier to build scalable and reliable data pipelines in the cloud.

35

Explain the concept of partitioning in Google BigQuery. How does it improve query performance?

Reference answer

Partitioning in Google BigQuery involves breaking a table into smaller, manageable segments based on a column's values (e.g., date or timestamp). When querying partitioned tables, BigQuery only processes the partitions relevant to the query, reducing the amount of data scanned. This significantly improves query performance and reduces costs, as only the required data is processed.

36

What is Cloud Source Repositories?

Reference answer

Cloud Source Repositories can be understood as a completely-managed Git repository service on the Google platform. It offers a scalable and secure environment to host code. This further enables collaboration and version control among development teams.

37

What are the four Vs of Big Data?

Reference answer

The four characteristics or four Vs of Big data are: - Volume - Veracity - Velocity - Variety

38

Explain the concept of windowing in Apache Beam/Dataflow.

Reference answer

Windowing in Apache Beam/Dataflow allows you to divide the data stream into finite and logical time intervals called windows for processing. It enables you to perform computations over time-based or event-based windows, such as fixed windows, sliding windows, and session windows. Windowing helps manage the processing of streaming data by providing control over how data is grouped and aggregated within specific time boundaries.

39

How do you prioritize tasks when managing multiple deadlines?

Reference answer

Mention frameworks like Eisenhower Matrix or Agile sprints. Explain how you balance high-priority business needs with technical debt and proactively flag risk if bandwidth becomes a blocker.

40

Given a dataset, find the time period when the most people were online, measured in seconds.

Reference answer

You need to analyze a dataset containing user online/offline timestamps. The solution involves parsing timestamps, identifying overlapping intervals, and calculating the total duration in seconds when the maximum number of concurrent users were online. Typically, this is solved using window functions or event-based aggregation.

41

Write a query to report the median of a user's searches, rounding the median to one decimal point.

Reference answer

Write a SQL query that calculates the median of search counts per user and rounds the result to one decimal place, likely using percentile functions or window functions.

42

How do you create data pipelines?

Reference answer

You would want to make clear that you are confident working with both third-party ETL tools (Fivetran, Stitch, etc.) and bespoke data connectors you can write yourself. A data pipeline is something that extracts, transforms and/or loads data from point A into the destination at point B [4]. So all you need is to demonstrate that you know how to do it following three main data pipeline design patterns – batch (aggregate and process in chunks), streaming (process and load record by record), change data capture (CDC, identify and capture changes at point A to process and load into B). CDC and streaming are closely connected. For example, we can use MySQL binary log file to move data into our DWH solution in real time. It must be used with care and is not always the most cost-effective tool for data pipelines but it is worth mentioning this. Keep everything in order following the conceptual design diagram. It helps to explain many ETL things.

43

Which ETL tools do you know and how is it different from ELT?

Reference answer

Answering this question we would want to demonstrate that we know how to extract, transform and load the data not only with third-party tools but also by writing our own bespoke data connectors and loaders. You can start with a quick note that there are managed solutions like Fivetran, Stitch, etc. that help with ETL. Don't forget to mention their pricing models that often are based on the number of records processed. You don't need third-party ETL tools when you know how to code. Don't be shy about saying this phrase. It is fairly easy to create your own ETL tool and then load the data into the DWH solution of your choice. Consider one of my previous articles where I extract millions of rows of data from MySQL or Postgres databases as an example. It explains how to create a robust data connector and extract data in chunks in a memory-efficient manner [12]. Things like this were designed to be serverless and can be easily deployed and scheduled in the cloud. We can even create our own bespoke data loading manager if we need to prepare and transform data before loading it into the DWH destination using cloud SDKs. It's a fairly complex application but it's worth learning it.

44

What is Cloud Spanner?

Reference answer

Cloud Spanner is a completely managed, strongly consistent and horizontally scalable relational DB service. It is crafted for mission-critical apps that need global distribution, ACID transactions and high availability.

45

Explain the concept of "windowing" in Dataflow.

Reference answer

Windowing allows grouping of elements into finite chunks based on time, event triggers, or other criteria, essential for processing unbounded datasets in streaming pipelines.

46

Tell me about a time you had to handle trade offs and ambiguity.

Reference answer

Describe a situation where requirements were unclear or conflicting. For example, needing to choose between a fast but less scalable solution versus a slower but scalable one. Explain how you gathered data, consulted stakeholders, made a decision based on priorities (e.g., time-to-market vs. long-term maintainability), and adapted as more information became available.

47

How have you handled a failed Dataflow job in a production environment?

Reference answer

- Analyzed error logs using Cloud Logging - Identified resource bottlenecks and increased worker nodes - Implemented checkpointing for job recovery Example: In a production ETL pipeline, a Dataflow job failed due to resource exhaustion during high traffic. By enabling autoscaling and optimizing transformations, I ensured job completion without manual intervention.

48

How does Google BigQuery ensure high availability and reliability?

Reference answer

Google BigQuery ensures high availability and reliability through data replication and automatic backups. BigQuery replicates data across multiple data centers, providing redundancy and minimizing the risk of data loss. It also performs automatic backups of data and metadata, allowing recovery to any point within the last seven days. Additionally, Google's infrastructure and network architecture contribute to its overall reliability.

49

How do you ensure data is secure when you transfer it?

Reference answer

To ensure that the data which is being transported is secure, you should check the implemented encryption key and that there is no leak in the data.

50

What is Google Cloud Pub/Sub, and how can it be used in data engineering?

Reference answer

Google Cloud Pub/Sub is a messaging service designed for real-time event-driven applications. It allows decoupling of components in a system, ensuring reliable and scalable data ingestion and delivery. In data engineering, Pub/Sub can be used to ingest streaming data from various sources like IoT devices or log streams. Data can then be processed in real-time using services like Cloud Dataflow or stored in databases like BigQuery for further analysis.

51

What are Google Cloud Functions, and when would you use them?

Reference answer

Users may execute code in response to events triggered by Google Cloud services or external sources utilizing serverless, event-driven Google Cloud Functions. They provide a scalable and inexpensive way to executing brief sections of code without having to worry with managing infrastructure. Use them for jobs where you need to respond to events quickly and efficiently without annoying about server management, such as data processing, automation, or creating lightweight APIs.

52

How do you integrate data from multiple systems?

Reference answer

Approaches include using ETL/ELT pipelines, API integrations, data virtualization, or message queues. Discuss considerations like data schema mapping, handling duplicates, and ensuring data quality.

53

What is Google Cloud Shell?

Reference answer

Google Cloud Shell is a browser-based command-line interface (CLI) provided by Google Cloud that enables users to manage their Google Cloud Platform resources from anywhere with an internet connection. It provides a pre-configured environment with popular tools and SDKs, allowing users to easily access and manage their cloud resources using CLI commands. Google Cloud Shell also supports file editing, version control, and customization, making it a powerful tool for cloud development and administration.

54

You need to process a 100 GB log file in Python. How do you read it without crashing the machine?

Reference answer

# Use iterators/generators - memory efficient with open('huge_file.log') as f: for line in f: # Reads one line at a time process_line(line) # Or with generators for processing pipelines def process_logs(filename): with open(filename) as f: for line in f: yield transform_line(line) ? Red Flag: “I'll just use pandas.read_csv() " — This loads everything into memory and will crash. Why this matters: This is the #1 mistake junior developers make with large files. Senior engineers know memory management is critical.

55

How do you propose getting a larger quota for the project?

Reference answer

Every Google Compute Engine project has a default allocation of resources that is assigned to it. There is also the possibility of increasing quotas on a project-by-project basis. On the quota tab of the Google Cloud Platform Console, one is able to observe the various limits that are currently in place for the project. If you discover that the quota limit for your account has been reached and you would like to make a request for more resources, you can do so through the quotas page found in the IAM. You can quickly and easily ask for extra allocation by clicking on the Edit Quotas link that is located in the top right corner of the page. These Google Cloud interview questions might be asked of you during an interview for the Google Cloud Architect position or the Google Cloud Consultant position. You need to put in a lot of effort studying if you want to do well in the interview.

56

What are the connections between Google Compute Engine and Google App Engine?

Reference answer

Google Compute Engine (GCE) and App Engine (GAE) are core Google Cloud services that work together for scalable, high-performance applications. - Compute vs. Serverless: GCE offers customizable VMs for full control, while GAE provides a fully managed, auto-scaling platform for hassle-free app deployment. - Scalability & Flexibility: App Engine auto-scales with traffic, ideal for web apps, while Compute Engine requires manual scaling but allows custom CPU, memory, and OS settings. - Seamless Networking: GAE can connect with GCE for backend processing, AI, and high-performance computing via Google's global network. - Hybrid Deployments: Businesses use GAE for APIs and frontend apps, leveraging GCE for databases, machine learning, and heavy processing. - Deep Cloud Integration: Both services connect with Cloud Storage, BigQuery, Firestore, and AI tools for smooth data handling.

57

How can strings be divided (using Python or any other language)? How can it be scaled into a large number of records?

Reference answer

This question assesses your ability to manipulate strings efficiently and scale operations for large datasets. In Python, strings can be divided using methods like .split() for delimiters, slicing for fixed positions, or regular expressions for complex patterns. To scale this to a large number of records, you can use distributed processing frameworks like Apache Spark (with PySpark) or parallelize operations using multiprocessing or threading in Python. For very large datasets, consider using map-reduce paradigms or cloud-based solutions like Google Cloud Dataflow to process records in batches.

58

What is the difference between Dataproc and Dataflow, and when would you use each?

Reference answer

- Dataproc: Best for existing Hadoop/Spark workloads - Dataflow: Ideal for stream and batch data processing with minimal infrastructure management Example: In a machine learning project, I used Dataproc to run Spark jobs for large-scale model training, whereas Dataflow was utilized for real-time feature engineering.

59

How do you ensure the reproducibility and scalability of your machine-learning experiments on GCP?

Reference answer

To ensure the reproducibility and scalability of my machine-learning experiments on GCP, I version datasets and models to keep track of changes and updates. I use AI Platform Pipelines to orchestrate ML workflows and ML Metadata for tracking metadata related to experiments. Additionally, I use Kubernetes Engine to create containerized environments, which ensures consistent and scalable runs of my experiments.

60

How can you ensure high availability and fault tolerance in GCP

Reference answer

GCP offers regional and multi-regional options for deploying services across multiple zones and regions. Load balancing, auto-scaling, and managed instance groups help ensure high availability and fault tolerance.

61

What is Cloud Logging?

Reference answer

Cloud Logging is a service offered by cloud platforms like Google Cloud, AWS, and Microsoft Azure that enables users to store, search, and analyze logs from their cloud resources and applications. It provides real-time and historical insights into system events, errors, and performance, allowing users to troubleshoot issues and debug their cloud deployments. Cloud Logging integrates with other cloud services, such as Cloud Monitoring and Cloud Trace, which provides a unified view of the cloud environment.

62

How do you ensure data security and compliance in GCP?

Reference answer

To ensure the data security and compliance in the Google Cloud Platform (GCP), it is an important to use identity and access management (IAM) to controls freedoms, allow audit logging to track and monitor the action, and encrypt the data when it is in transit and at rest. It is important to frequently install security patches and updates in addition to use the GCP's integrated safety solutions, such Security Command Center, for threat detection and compliance checks. In addition, periodic security inspections and compliance to compliance regulations (like GDPR and HIPAA) ensure continuous compliance and security.

63

How would you handle large data processing tasks in Google Cloud?

Reference answer

For large data processing tasks in Google Cloud, you can: - Use Google Cloud Dataproc for running Hadoop and Spark workloads in a managed environment. - Implement Google Cloud Dataflow to process data in both real-time and batch. - Optimize your pipeline by partitioning data and using BigQuery for scalable data analytics. - Leverage Cloud Storage to store large datasets and ensure the pipeline scales automatically.

64

What services does GCP provide?

Reference answer

Google Cloud Platform (GCP) provides a wide range of services. Here are some categorized under different domains: Compute: - Google Compute Engine (Virtual Machines) - Google Kubernetes Engine (Container-based applications) Storage & Databases: - Google Cloud Storage - Cloud SQL - Firestore Networking: - Google Virtual Private Cloud (VPC) - Cloud Load Balancing Big Data: - BigQuery - Cloud Dataflow Machine Learning: - Google AI platform - AutoML Identity & Security: - Cloud Identity and Access Management (IAM) - Cloud Identity-Aware Proxy

65

Write a SQL query to find the top 10 most expensive products from a BigQuery dataset.

Reference answer

To find the top 10 most expensive products from a BigQuery dataset, you can use the following SQL query: SELECT product_name, price FROM products ORDER BY price DESC LIMIT 10; This query selects the product names and prices, sorts them in descending order by price, and limits the results to the top 10.

66

What is the purpose of the BigQuery Data Transfer Service?

Reference answer

The BigQuery Data Transfer Service automates the process of moving data from various sources into BigQuery, making it easier to manage and analyze data. It supports a wide range of data sources, including Google Ads, YouTube, and external SaaS applications, simplifying the ETL process.

67

Why did you choose this algorithm, and can you compare it with other similar algorithms?

Reference answer

Interviewers want to know what you think about choosing one algorithm over another. It might be easiest to focus on a project you worked on and link any follow-up questions to that project. If you have an example of a project and an algorithm that relates to the company's work, choose that one. List the models you worked with, and then explain the analysis, results, and impact. The interviewer might also ask: 'What is the scalability of this algorithm?' or 'What would you do differently if you were to do the project again?'

68

What strategies do you use for optimizing query performance in large datasets?

Reference answer

Strategies for optimizing query performance include: - Proper indexing of frequently queried columns - Partitioning large tables - Using materialized views for complex, frequently-run queries - Query optimization and rewriting - Implementing caching mechanisms - Using columnar storage formats for analytical workloads - Leveraging distributed computing for large-scale data processing

69

Design a data model in order to track product from the vendor to the Amazon warehouse to delivery to the customer.

Reference answer

Create a star schema with a fact table for product movement events (e.g., event_id, product_id, vendor_id, warehouse_id, order_id, event_type, event_timestamp). Dimension tables include product, vendor, warehouse, order, and customer. Track status changes like 'received from vendor', 'stored in warehouse', 'shipped to customer', 'delivered'. Use SCD for product attributes.

70

What are the advantages of using Google Cloud Data Fusion over custom ETL solutions?

Reference answer

Google Cloud Data Fusion offers several advantages over custom ETL (Extract, Transform, Load) solutions: - No-code/low-code development: Data Fusion's visual interface allows users to build ETL pipelines without writing complex code, reducing development time and effort. - Simplified deployment and management: Data Fusion is a fully managed service, eliminating the need for manual infrastructure setup and maintenance. - Scalability: Data Fusion automatically scales resources based on the workload, ensuring seamless handling of large-scale data processing. - Pre-built connectors: Data Fusion provides a wide range of pre-built connectors to various data sources, making it easier to integrate with different data systems.

71

Explain the main firewall rules in cloud computing.

Reference answer

Main firewall rules in cloud computing control inbound and outbound traffic to virtual machine instances, based on parameters like source IP ranges, destination ports, and protocols, to secure the network.

72

What is Apache Flink?

Reference answer

Apache Flink is an open-source stream processing framework for distributed, high-performing, always-available, and accurate data streaming applications. It provides precise control of time and state, allowing for consistent and accurate results even in the face of out-of-order or late-arriving data.

73

Which Google Cloud solution provides global, strongly consistent ACID transactions with SQL access and supports concurrent updates across multiple regions at approximately 30 million operations per day?

Reference answer

B. Cloud Spanner with locking read write transactions. The correct option is Cloud Spanner with locking read write transactions. It is the only Google Cloud database that delivers global strongly consistent ACID transactions with SQL while supporting concurrent updates across multiple regions at the described scale. Locking read write transactions in this service provide serializable isolation for reads and writes which is the highest level of transactional correctness for concurrent updates. It uses TrueTime to achieve external consistency across regions and replicates data synchronously so reads and writes remain strongly consistent worldwide. The workload of about 30 million operations per day is well within its horizontally scalable architecture. Cloud SQL with BigQuery federation is not suitable because Cloud SQL is a regional service and cross region replication is asynchronous which does not provide strongly consistent multi region writes. Federation in BigQuery is for analytical querying of external data and it does not offer transactional guarantees or support for distributed ACID updates. AlloyDB for PostgreSQL with read replicas is also not suitable because it is a regional system and its replicas are for reads. It does not offer globally strongly consistent multi region write transactions or external consistency for concurrent updates across regions. When you see requirements that include global scope, strongly consistent ACID transactions, and multi region concurrency with SQL, map directly to Spanner with locking read write transactions. Performance figures like tens of millions of operations per day are a good fit for horizontally scalable distributed databases.

74

Tell me about a time you were part of an organization in transition and how you helped them move forward.

Reference answer

Example: During a migration from on-premise to cloud, you helped by learning new tools (e.g., GCP), documenting processes, training team members, and automating migration scripts. Show how you facilitated the transition, minimized downtime, and improved team efficiency.

75

Your Pub/Sub topic is receiving millions of messages per second. How do you ensure no data is lost?

Reference answer

Enable message retention on the Pub/Sub topic and use acknowledged delivery. Connect it to a Dataflow streaming pipeline with checkpointing enabled. If the consumer falls behind, retained messages ensure no data is dropped during high-traffic periods.

76

What are the advantages or benefits of using Compute Engine?

Reference answer

Compute Engine offers better kernel-level control, and encryption, and makes it easier to create and configure high-performance-based virtual machines that can easily and quickly scale to any size workload. Advantages include: - Storage Efficiency - Stability - Easy Integration - Confidential Computing - Security - Compute globally as per requirement

77

What is BigQuery ML and how can it be useful in data engineering?

Reference answer

BigQuery ML allows building and training machine learning models directly within BigQuery using SQL queries. This eliminates data movement, speeds up the development process, and enables data engineers to integrate ML tasks seamlessly with analytics workflows.

78

What are the components of Hadoop?

Reference answer

Hadoop has the following components: - Hadoop Common: A collection of Hadoop tools and libraries. - Hadoop HDFS: Hadoop's storage unit is the Hadoop Distributed File System (HDFS). HDFS stores data in a distributed fashion. HDFS is made up of two parts: a name node and a data node. While there is only one name node, numerous data nodes are possible. - Hadoop MapReduce: Hadoop's processing unit is MapReduce. The processing is done on the slave nodes in the MapReduce technique, and the final result is delivered to the master node. - Hadoop YARN: Hadoop's YARN is an acronym for Yet Another Resource Negotiator. It is Hadoop's resource management unit, and it is included in Hadoop version 2 as a component. It's in charge of managing cluster resources to avoid overloading a single machine.

79

Provide a rundown of the most significant advantages gained by utilizing Google's cloud services.

Reference answer

The following is a list of the primary characteristics of GCP: - Using Google Cloud Platform makes it simple to fine-tune the CPU, RAM, and storage capacities of your virtual machine. The virtual machine (VM) rightsizing advice tool clearly demonstrates in a short amount of time whether or not the machines in your environment are utilizing the appropriate quantity of hardware. - You will have access to the Google cloud shell when you utilize GCP. This shell comes pre-loaded with a broad number of helpful tools and makes it possible for you to manage your infrastructure with just a few keystrokes. Docker, Gradle, Make, npm, nvm, and pip, along with a great deal more software, is pre-installed and ready to use. - You'll have the ability to swiftly prototype new kinds of machines with Google Cloud Platform thanks to its fully programmable CPU, RAM, and storage. - The preemptible virtual machines that come with this technology can slash expenses by as much as 70 per cent for fault-tolerant and batch processing. - The Cloud SQL functionality of GCP does a check on the database's available storage once every 30 seconds and adds additional if it's required. - It is possible to alter the size of a persistent disc in real-time and without disrupting service in any way, either by decreasing or increasing the amount of space it occupies.

80

What is Cloud SQL?

Reference answer

Cloud SQL is a fully managed relational database service provided by GCP that allows users to host and manage MySQL, PostgreSQL, and SQL Server databases on the cloud. It provides features like automatic backups, replication, and high availability that make it easy to build and maintain databases on the cloud.

81

How would you set up logging and monitoring for a GCP data pipeline?

Reference answer

- Use Cloud Logging for real-time log aggregation - Configure Cloud Monitoring to track metrics - Set alerts for job failures using Alert Policies

82

Explain the concept of federated queries in BigQuery.

Reference answer

Federated queries allow you to query data stored outside of BigQuery, such as in Cloud SQL, Google Cloud Storage, Google Sheets, or Cloud Bigtable, without needing to load it into BigQuery first. This is done by using external data sources and external tables. To send a federated query use the EXTERNAL_QUERY function.

83

What is Google Cloud Identity and Access Management (IAM)?

Reference answer

Google Cloud Identity and Access Management, more often called IAM, offers deep access control to the resources on GCP. IAM helps administrators in managing who has what kind of access to which all resources. It offers support for role-based access control (RBAC) too. It's easily integrated with multiple identity providers to ensure centralized access management.

84

Explain the use of Cloud Key Management Service (KMS) in GCP

Reference answer

Cloud KMS is a managed service in GCP for generating, using, and managing encryption keys. It helps you encrypt data and control access to sensitive information.

85

You need to load data from a third-party REST API into BigQuery daily. How would you automate this?

Reference answer

Write a Python Cloud Function to call the API and store the response in Cloud Storage. Schedule it using Cloud Scheduler. Trigger a BigQuery load job after the file lands. This creates a lightweight, serverless, and fully automated daily ingestion workflow.

86

Write a Python script to upload a file to Google Cloud Storage using the google-cloud-storage library.

Reference answer

To upload a file to Google Cloud Storage using the google-cloud-storage library, first install the library with pip install google-cloud-storage. Then, authenticate using a service account key and write a Python script to create a storage client, specify the bucket, and upload the file.

87

How do you monitor and debug jobs running in Dataflow?

Reference answer

Use Dataflow's Monitoring UI to view job graphs, step metrics (e.g., element counts, throughput, system lag), and worker logs. Set up Cloud Monitoring alerts for key metrics like job failure, high system lag, or backlog. Use Stackdriver Logging to capture detailed logs from pipeline steps. Enable Dataflow's built-in support for metrics like watermark lag and data freshness. For debugging, use the pipeline's execution graph to identify slow or failing steps, and use the 'Step' view to inspect element counts and errors. Also, implement custom counters in the pipeline code for business-level monitoring.

88

Can you explain the concept of data partitioning and why it is used?

Reference answer

Data partitioning is the process of dividing a database or data warehouse into smaller, more manageable pieces, or partitions. It is used to improve performance, manageability, and scalability. By partitioning data, queries can be executed more efficiently, as they can target specific partitions rather than scanning the entire dataset.

89

Which Python libraries would you recommend for effective data processing?

Reference answer

This question allows the hiring manager to determine whether the candidate understands the fundamentals of Python, which is the most commonly used language among data engineers. NumPy, which is used for efficient processing of arrays of numbers, and pandas, which is useful for statistics and data preparation for machine learning work, should be included in your solution.

90

Describe a time you disagreed with a teammate on a technical approach. How did you resolve it?

Reference answer

We were designing a new API, and there was disagreement between me and another engineer about whether to use REST or gRPC. I advocated for gRPC because we were building microservices that needed low latency. The other engineer wanted REST because it's simpler and more familiar to the team. Instead of arguing, I proposed we evaluate both against our actual requirements. We created a simple benchmark using our typical payload sizes and latencies. gRPC was about 30% faster but added complexity to the build process and client tooling. We then talked to the team: how important is that 30% improvement? How much does complexity hurt? Turns out, for our use cases, we weren't latency-bound—the 30% didn't matter for the business. But the complexity did matter for the team's ability to debug and maintain the system. We went with REST. My colleague made good points I hadn't fully considered. In retrospect, I was optimizing for performance when the real constraint was maintainability. Since then, I approach these discussions differently—I lead with requirements first, then evaluate solutions against those requirements. It's less about who's right and more about what the data says.

91

What is the slowly changing dimension (SCD)?

Reference answer

Slowly changing dimension (SCD) is a concept in data warehousing that describes how to handle changes to dimension data over time. There are different types of SCDs, with the most common being: - Type 1: Overwrite the old value - Type 2: Create a new row with the changed data - Type 3: Add a new column to track changes

92

Can you explain the steps and considerations for migrating a large-scale on-premises application to GCP?

Reference answer

To migrate a large-scale on-premises application to GCP, I would start with an assessment phase to evaluate the current infrastructure and identify dependencies. In the planning phase, I would design the migration strategy, including selecting appropriate GCP services and tools like Migrate for Compute Engine for VM migration and data transfer options. During the execution phase, I would re-architect the application for the cloud, handle dependencies, perform thorough testing, and implement strategies to minimize downtime.

93

If you accidentally delete an instance, would you be able to retrieve it?

Reference answer

Deleted instances no longer form a part of the organization's project and cannot be retrieved. However, if an engineer has simply stopped an instance, they can restart it.

94

Write a SQL query to perform a window function that ranks products based on sales within each category.

Reference answer

To rank products based on sales within each category, you can use the RANK window function along with the PARTITION BY clause to group the data by category. Here's the SQL query: SELECT product_name, category, sales, RANK() OVER (PARTITION BY category ORDER BY sales DESC) AS rank FROM sales_table;

95

What are Google Cloud Functions used for?

Reference answer

Google Cloud Functions are apt for event-driven and lightweight apps. These include handling Cloud Pub/Sub messages, reacting to changes in Cloud Storage and processing HTTP requests. Cloud Functions help in automatically scaling and eliminating the necessity for infrastructure management.

96

How do you handle missing or NULL values in a BigQuery SQL query?

Reference answer

SELECT COALESCE(email, 'not_provided') AS email, IFNULL(age, 0) AS age, IF(city IS NULL, 'unknown', city) AS city FROM users; Use COALESCE, IFNULL, or IF to replace NULLs with meaningful default values.

97

Given a list of words, return the top K most frequent words. When two words tie, the lexicographically smaller word comes first.

Reference answer

from collections import Counter def top_k_words(words: list[str], k: int) -> list[str]: counts = Counter(words) ordered = sorted(counts.items(), key=lambda x: (-x[1], x[0])) return [w for w, _ in ordered[:k]] Why this works: Counter(words) builds the frequency map in one O(n) pass. The sort key (-count, word) orders by count descending (negation flips ascending into descending) and then by word ascending—exactly the prompt's tie-break rule. Slicing [:k] bounds output to k items, and the overall complexity is O(n + u log u) where u = len(counts). For typical text data u ≪ n, so this is effectively linear with a tiny log factor.

98

In what ways can information stored in the cloud be safeguarded?

Reference answer

Every single one of GCP's customers is provided with a comprehensive arsenal of preventative and detective safeguards. Information, Computer Science, and the Provision of Services Customers of Google Cloud Platform (GCP) are granted access to resources, such as Virtual Private Clouds (VPC), Identity and Access Management (IAM), Firewall Rules, and so on, that are compliant with GCP best practises. This ensures the security of all services.

99

BeaconPlay is a media startup that serves soccer fans around the globe. The platform offers live broadcasts and an on-demand library of recorded matches, and the lead engineer wants viewers to have consistent playback quality for the recorded videos no matter where they are located. Which Google Cloud service should be used to efficiently deliver the on-demand content to a worldwide audience?

Reference answer

C. Cloud CDN. The correct option is Cloud CDN. Cloud CDN caches frequently accessed content at Google edge locations worldwide. This reduces latency and helps deliver consistent playback quality for recorded videos to viewers everywhere. It integrates with Cloud Storage and HTTP or HTTPS load balancers and serves media efficiently. Caching video segments near users reduces origin load and improves throughput, which is exactly what is needed for a global on demand library. Cloud Storage multi-region stores objects redundantly across multiple locations for durability and availability. It does not provide edge caching or global content acceleration, so it alone cannot ensure low latency playback for a worldwide audience. Cloud Load Balancing distributes traffic across backends and regions for scalability and uptime. It does not cache content at the edge and is not a content delivery network, so it will not on its own provide the consistent global performance needed for recorded video delivery. Cloud Storage Nearline is a storage class designed for infrequently accessed data. It has higher access and retrieval costs and is not intended for serving frequently watched media, and it does not provide global delivery optimizations. When a question emphasizes global delivery and low latency for static or recorded media, map the requirement to a CDN. Storage classes address cost and durability and load balancing addresses backend distribution, while the CDN solves edge caching and geographic proximity.

100

How do you handle schema evolution in Google BigQuery?

Reference answer

In Google BigQuery, schema evolution can be handled through a feature called schema auto-detection, which automatically detects changes in the structure of incoming data. You can also manually alter the schema by using ALTER TABLE statements or by creating views that allow flexibility in handling different data versions. BigQuery's support for nested and repeated fields in schemas also facilitates managing evolving data structures.

101

What are some of the popular open-source cloud computing platforms?

Reference answer

Some of the important open-source cloud computing platforms are listed as below:

102

How do you ensure compliance in a GCP environment?

Reference answer

Follow these aspects to ensure compliance by employing Cloud IAM for access control. One should set up audit logging, apply organization policies and also employ tools such as Cloud Security Command Center for monitoring and enforcing best practices around security.

103

How do you secure sensitive information in a GCP data pipeline?

Reference answer

- Encrypt data in transit and at rest - Use DLP API for masking sensitive data - IAM roles for access control - Secure keys with Cloud KMS Example: For a fintech client, we encrypted payment data using Cloud KMS and masked sensitive information using DLP API before storage in BigQuery.

104

What are the different data types supported by BigQuery?

Reference answer

BigQuery supports various data types including STRING, INTEGER, FLOAT, BOOLEAN, and TIMESTAMP. It also supports complex data types like ARRAY and STRUCT, which are essential for advanced data modeling and querying.

105

Explain the difference between structured and unstructured data.

Reference answer

Structured data is made up of well-defined data types with patterns (using algorithms and coding) that make them easily searchable. Unstructured data is a bundle of files in various formats, such as videos, photos, texts, audio, and more. Data engineers turn unstructured data into structured data for data analysis using different methods for transformation, often using ELT tools to transform and integrate data into a cloud-based data warehouse.

106

What is the difference between a region and a zone in GCP?

Reference answer

A region is a distinct geographic area composed from multiple zones. Within a region, a zone is a separated data center which provides resources for fault tolerance and high availability. Zones enable redundancy within an area, while regions allow resources to be dispersed worldwide. In the case of a failure, this setup helps maintain service continuity and balance the load.

107

Describe how you would set up a CI/CD pipeline using GCP services for a microservices-based application.

Reference answer

To set up a CI/CD pipeline using GCP services for a microservices-based application, I would begin by using Cloud Build for building and testing the code. The built container images would then be stored in Container Registry. For deployment, I would use Kubernetes Engine or Cloud Run, depending on the application requirements. Additionally, I would employ Infrastructure as Code (IaC) tools like Deployment Manager or Terraform to manage infrastructure, and I would monitor deployments with Google Cloud's Operations Suite to ensure smooth operation and quick issue resolution.

108

What is the difference between Google Cloud Storage and Persistent Disks?

Reference answer

- Cloud Storage: Scalable, object-based storage for unstructured data - Persistent Disks: Block storage for virtual machine instances

109

What are the various layers in the cloud architecture?

Reference answer

The different layers that constitute the cloud architecture are as follows: - Physical Layer: This constitutes the physical servers, network, and other aspects. - Infrastructure Layer: This layer includes storage, virtualized layers, and so on. - Platform Layer: This includes the operating system, apps, and other aspects. - Application Layer: This is the layer that the end-user directly interacts with.

110

What are best practices for cost optimization in BigQuery?

Reference answer

- Use flat-rate billing for predictable workloads - Optimize queries to select only required columns - Partition and cluster tables - Materialized views for repetitive queries Example: Implementing partitioning by event_date reduced monthly query costs by 50% for a log analytics solution.

111

Compare GCS and BigTable for storing semi-structured data.

Reference answer

GCS is a scalable object store ideal for storing large volumes of semi-structured data (e.g., JSON, Avro, Parquet) as files. It is cost-effective for archival and batch processing but has higher latency for point lookups. Bigtable is a fully managed, scalable NoSQL database designed for low-latency, high-throughput access to semi-structured data (e.g., time-series, event logs). It supports single-row lookups and scans with millisecond latency but is more expensive than GCS for storage. GCS is better for data lakes and analytics, while Bigtable is better for real-time applications requiring fast reads/writes.

112

Explain complex scenario with joins involving 3+ tables and subqueries

Reference answer

-- Find top customers by total order value with product details SELECT c.customer_name, c.customer_id, SUM(oi.quantity * p.price) AS total_spent, COUNT(DISTINCT o.order_id) AS total_orders FROM customers c JOIN orders o ON c.customer_id = o.customer_id JOIN order_items oi ON o.order_id = oi.order_id JOIN products p ON oi.product_id = p.product_id WHERE c.customer_id IN ( -- Subquery: Customers who made orders in last 6 months SELECT DISTINCT customer_id FROM orders WHERE order_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 6 MONTH) ) AND p.category IN ( -- Subquery: Top 3 product categories by sales SELECT category FROM ( SELECT p2.category, SUM(oi2.quantity * p2.price) AS category_sales FROM products p2 JOIN order_items oi2 ON p2.product_id = oi2.product_id GROUP BY p2.category ORDER BY category_sales DESC LIMIT 3 ) AS top_categories ) GROUP BY c.customer_name, c.customer_id HAVING total_spent > 1000 ORDER BY total_spent DESC;

113

What are the different methods for the authentication of Google Compute Engine API?

Reference answer

There are different methods for the authentication of Google Compute Engine API. They are as follows: - Through the client library - Using OAuth 2.0 - Directly using an access token

114

What is a PCollection in Apache Beam?

Reference answer

A PCollection is the core data abstraction in Apache Beam. It represents a distributed dataset that your pipeline works on, similar to how a DataFrame works in pandas but designed for distributed processing. A PCollection can be bounded, meaning it has a finite size like a batch file, or unbounded, meaning it is a continuous stream of data. Every transformation in a Beam pipeline takes one or more PCollections as input and produces a new PCollection as output.

115

What is the use of MFA?

Reference answer

MFA stands for Multi-factor authentication. It helps you protect your user accounts and company data with a wide variety of MFA verification methods such as push notifications, Google Authenticator, phishing-resistant Titan Security Keys, and using your Android or iOS device as a security key.

116

Explain the difference between Google Cloud Datastore and Google Cloud Bigtable.

Reference answer

Google Cloud Datastore is a NoSQL document database designed for small-to-medium-sized operational applications. It offers high availability and automatic scaling but may not be suitable for very large datasets. On the other hand, Google Cloud Bigtable is a NoSQL wide-column store, optimized for handling massive amounts of data with low latency. It is well-suited for analytical and time-series workloads, making it a preferred choice for big data scenarios.

117

What are the differences between structured and unstructured data?

Reference answer

| On the basis of | Structured | Unstructured | |---|---|---| | Storage | Structured data is stored in DBMS. | It is stored in unmanaged file structures. | | Flexibility | It is less flexible as it is dependent on the schema. | It is more flexible. | | Scalability | Not easy to scale. | Easy to scale. | | Performance | Since we can perform a structured query, the performance is high. | The performance of unstructured data is low. | | Analysis factor | Easy to analyze. | Hard to analyze. |

118

Write a SQL query to find the total number of orders placed by each customer.

Reference answer

To find the total number of orders placed by each customer, you can use the GROUP BY clause to group the orders by customer and the COUNT function to count the number of orders for each customer. Here's the SQL query: SELECT customer_id, COUNT(order_id) AS total_orders FROM orders GROUP BY customer_id;

119

For a Pub/Sub push subscription, how should you configure retry behavior and dead lettering so messages survive short outages, retry with gradual delays, and are routed to a different topic after 10 delivery attempts?

Reference answer

B. Use exponential backoff for retries and configure a dead letter topic that is different from the source with a maximum of 10 delivery attempts. The correct option is Use exponential backoff for retries and configure a dead letter topic that is different from the source with a maximum of 10 delivery attempts. This configuration satisfies all three requirements. Exponential backoff spaces out push delivery retries which helps messages survive short outages without overwhelming the endpoint and the interval grows as failures continue. A dead letter policy then moves the message to a separate topic after the tenth failed delivery which prevents loops and provides a clear handoff path for failed processing. Use immediate retry and enable dead lettering to a different topic with a cap of 10 delivery attempts is incorrect because immediate retry can flood the endpoint during an outage and it does not provide gradual retry behavior. Set the acknowledgement deadline to 20 minutes is incorrect because the acknowledgement deadline does not control push retry pacing and it does not configure a dead letter route or enforce a delivery attempt limit. When you see requirements for surviving short outages and gradual retries and routing after a fixed number of attempts, choose exponential backoff with a dead letter topic and set maxDeliveryAttempts to the specified value.

120

Explain what Google Cloud SDK is.

Reference answer

Google Cloud SDK is a set of command-line tools and libraries that allow developers to interact with Google Cloud services, manage resources, and automate workflows from their local environment.

121

How do you load a CSV file from Cloud Storage into BigQuery using Python?

Reference answer

from google.cloud import bigquery client = bigquery.Client() job_config = bigquery.LoadJobConfig( source_format=bigquery.SourceFormat.CSV, skip_leading_rows=1, autodetect=True) client.load_table_from_uri( "gs://bucket/file.csv", "project.dataset.table", job_config=job_config).result()

122

What is Cloud Dataproc?

Reference answer

Cloud Dataproc is a fully managed, serverless data processing service that allows users to easily create and manage Apache Hadoop, Apache Spark, and other big data clusters. It provides a highly scalable, performant, and cost-effective environment for running data processing workloads. It also integrates with other GCP services.

123

Is that possible to share data across pipeline instances?

Reference answer

As there is no dataflow-specific cross-pipeline communication mechanism for sharing data or processing context between pipelines. So that, we can use durable storage like Cloud Storage or an in-memory cache like App Engine to share data between pipeline instances.

124

Explain the role of Cloud Storage in GCP

Reference answer

Cloud Storage is an object storage service provided by GCP. It offers scalable, durable, and highly available storage for objects of any size. It can be used for storing files, backups, and serving static content.

125

What are some common bottlenecks in GCP data pipelines, and how do you resolve them?

Reference answer

- Inefficient transformations: Optimize logic and minimize data shuffling - Large data volumes: Use partitioning and clustering - Slow streaming jobs: Use autoscaling and checkpointing

126

Explain what Google BigQuery is.

Reference answer

Google BigQuery is a fully managed, serverless data warehouse that enables scalable analysis of large datasets using SQL queries. It handles infrastructure management and provides built-in machine learning capabilities.

127

Explain how columnar storage increases query speed.

Reference answer

Since it dramatically reduces total disc I/O requirements and the quantity of data you need to load from the disc, columnar storage for database tables is a critical factor in increasing analytic query speed. Each data block stores values of a single column in multiple rows using columnar storage.

128

What is the difference between Kafka and traditional message queues like RabbitMQ?

Reference answer

Kafka is designed for high-volume, distributed, and real-time data ingestion. Unlike RabbitMQ, Kafka stores messages on disk and supports message replay. It also scales better with partitions and consumer groups. Kafka is ideal for event-driven architectures and analytics use cases.

129

What do you mean by data pipeline?

Reference answer

A data pipeline is a system for transporting data from one location (the source) to another (the destination) (such as a data warehouse). Data is converted and optimized along the journey, and it eventually reaches a state that can be evaluated and used to produce business insights. The procedures involved in aggregating, organizing, and transporting data are referred to as a data pipeline. Many of the manual tasks needed in processing and improving continuous data loads are automated by modern data pipelines.

130

How do you use Terraform with GCP?

Reference answer

Using Terraform includes a couple of things. First is writing configuration files for defining GCP resources. Second is employing Terraform commands for planning and applying infrastructure changes. Third is securely storing state files. All this facilitates IaC, which further ensures repeatable and consistent deployments.

131

Explain the use of Cloud Run in GCP

Reference answer

Cloud Run is a fully managed serverless execution environment in GCP. It allows you to run containers without worrying about infrastructure provisioning or scaling.

132

What are some strategies for optimizing Dataflow performance?

Reference answer

- Use streaming engines for low-latency processing - Set appropriate worker machine types - Optimize the number of parallel shards - Minimize data shuffling Example: By adjusting the worker type to n1-highmem-8 and tuning parallelism, I reduced Dataflow job completion time by 30% in a log processing pipeline.

133

How do you design a comprehensive backup strategy for a million-scale data storage?

Reference answer

Use a 3-2-1 rule: 3 copies of data, on 2 different media types, with 1 offsite backup. Implement daily full backups and hourly incremental backups. Use replication across geographic regions for high availability. For databases, use point-in-time recovery. For large-scale data, use distributed storage with snapshots (e.g., HDFS snapshots, cloud snapshots). Automate backup verification and test restores periodically.

134

Who are the system integrators when it comes to cloud computing?

Reference answer

Because there are so many moving pieces, understanding clouds can be difficult at times. The system integrator is the overarching strategy that enables different cloud-related tasks, such as cloud design and the assembly of necessary elements for a public, private, or hybrid cloud infrastructure. In the cloud, the system integrator is the strategy that enables these tasks.

135

Describe the process of deploying a containerized application using Google Kubernetes Engine (GKE).

Reference answer

To deploy a containerized application using Google Kubernetes Engine (GKE), first create a Kubernetes cluster in GKE. Then, build and push the container image to Google Container Registry, and deploy the application using kubectl commands.

136

What role does Google Cloud Build play in a CI/CD pipeline?

Reference answer

Google Cloud Build plays an imperative role in a CI/CD pipeline. It automates the compiling code, build process, producing artifacts and running tests. It also seamlessly integrates with key repositories to ultimately trigger builds on code commits. This guarantees continuous integration.

137

Tell me about a time when a data pipeline failed in production. How did you respond?

Reference answer

Frame your response using the STAR method. Explain the incident, how you diagnosed the root cause, involved stakeholders, restored service, and implemented preventive monitoring or alerts. Highlight your ownership and communication clarity.

138

What are the best practices for cost management in GCP for data engineering workloads?

Reference answer

Best practices include using committed use discounts for predictable workloads, selecting appropriate machine types, leveraging preemptible VMs for batch jobs, optimizing storage with lifecycle policies, monitoring costs with Cloud Billing reports, and setting budget alerts.

139

How do you manage IAM roles and permissions in GCP?

Reference answer

To manage IAM roles and permissions in GCP, you need to create and assign roles to users or groups, ensuring secure access control. It's crucial to follow the principle of least privilege to minimize security risks.

140

Tell me about a time you used data to measure impact.

Reference answer

Describe a scenario where you set up metrics to evaluate the success of a project. For instance, after optimizing a data pipeline, you measured reduction in latency or cost. Or after implementing a recommendation system, you tracked user engagement metrics. Explain how you collected and analyzed data and what the results showed.

141

What is windowing in Dataflow, and how have you used it in a real-world scenario?

Reference answer

Windowing divides a continuous data stream into time-based chunks for processing. Types include fixed, sliding, and session windows. Example: I used fixed windowing to aggregate clickstream data every minute for a web analytics platform. This setup allowed us to generate near-real-time dashboards without overwhelming the system.

142

A Dataflow job is failing midway through processing. How would you troubleshoot it?

Reference answer

Check Dataflow job logs in Cloud Logging for specific error messages. Identify whether the failure is in a specific PTransform or data issue. Enable retry logic and test the pipeline locally using Direct Runner before redeploying to isolate the root cause quickly.

143

How do you delete records older than 90 days from a BigQuery table using Python?

Reference answer

from google.cloud import bigquery client = bigquery.Client() query = """ DELETE FROM project.dataset.table WHERE created_at < DATE_SUB( CURRENT_DATE(), INTERVAL 90 DAY) """ client.query(query).result()

144

What are common pitfalls when exporting BigQuery data to external systems?

Reference answer

Common pitfalls include: data type incompatibility (e.g., BigQuery's nested/repeated fields not supported by target systems); export format limitations (e.g., CSV not handling all data types well); large exports causing timeouts or memory issues; lack of proper partitioning leading to full table scans; not accounting for data changes (incremental vs full export); and cost management (e.g., exporting the same data multiple times). Solutions include using Avro or Parquet formats, using partitioned exports, and scheduling exports with incremental logic.

145

What's your strategy for building slowly changing dimensions (SCD) in your ETL jobs?

Reference answer

SCD Type 1 vs 2, delta tables, merge statements

146

At AuroraRetail Co., you ingest several terabytes of event data from Google Analytics 4 into BigQuery each day. Customer attributes such as preferences and loyalty tiers are stored in two transactional systems. One is a Cloud SQL for MySQL instance and the other is a Cloud SQL for PostgreSQL instance that backs your CRM. The growth team wants to combine behavioral events with customer records to target customers active in the last year. They plan to run these campaigns about 120 times on a regular day and up to 360 times during major promotions. You must support frequent queries without placing heavy read load on the Cloud SQL systems. What should you do?

Reference answer

B. Set up Datastream to continuously replicate the necessary tables from both Cloud SQL instances into BigQuery and run all campaign queries only in BigQuery. The correct option is Set up Datastream to continuously replicate the necessary tables from both Cloud SQL instances into BigQuery and run all campaign queries only in BigQuery. This approach continuously captures changes from both Cloud SQL for MySQL and PostgreSQL and brings them into BigQuery with near real time freshness. You remove read pressure from the transactional systems and you run all 120 to 360 daily campaign queries inside BigQuery where large joins with GA4 event data scale well. Change data capture ensures the customer attributes remain current so you can reliably target customers active in the last year. Create BigQuery connections to both Cloud SQL databases and run federated queries that join Cloud SQL tables with the BigQuery events for each campaign is not suitable because each query still reads from Cloud SQL and adds connection and throughput overhead. Federated queries have limitations and quotas and they do not scale well for frequent large joins, which risks performance issues on the databases. Trigger a Dataproc Serverless Spark job for each campaign to read from both Cloud SQL databases and from BigQuery directly adds unnecessary complexity and latency and it repeatedly pulls from Cloud SQL which creates the same read load problem. The workload is analytic SQL that fits BigQuery better than spinning up many Spark jobs throughout the day. Create read replicas for both Cloud SQL databases and point BigQuery federated queries at the replicas to isolate the primaries still leaves the replicas handling many ad hoc analytical reads and the same federation limits apply. This does not match the scale and frequency needed and it increases operational burden without solving the core load and scalability concerns. When you see frequent analytical joins across BigQuery and OLTP data, think about using change data capture to land the operational tables in BigQuery and avoid direct federation so you protect transactional systems and gain scalable performance.

147

What is a Data Warehouse, and how is it different from a Data Lake?

Reference answer

A Data Warehouse is a centralized repository designed to store structured data for analysis and reporting. It is typically used for querying and analyzing historical data. Data Lakes, on the other hand, store raw, unstructured, or semi-structured data, allowing for more flexibility in handling various types of data (e.g., logs, videos, and text). The key difference is that data warehouses typically process cleaned and structured data, while data lakes allow for both structured and unstructured data.

148

What are the different models for deployment in cloud computing?

Reference answer

The various deployment models in cloud computing are private, public, and hybrid cloud.

149

What data engineering frameworks do you know?

Reference answer

We can't know everything. I interviewed a lot of people and it's not necessary to have experience with all data engineering tools and frameworks. You can name a few: Python ETL (PETL), Bonobo, Apache Airflow, Bubbles, Kestra, Luigi and I previously wrote about the ETL frameworks explosion we witnessed during the past couple of years. We don't need to be super experienced with all frameworks but demonstrating confidence is a must. In order to demonstrate confidence with various data tools we would want to learn at least one or two and then use the basic principles (data engineering principles). Using this approach we can answer almost every DE question: Why did you do it this way? – I got this from basic principles. Having said this it would be just fine to learn a few things from Apache Airflow and demonstrate it with a simple pipeline example. For example, we can run ml_engine_training_op after we export data into the cloud storage (bq_export_op) and make this workflow run daily or weekly.

150

Explain the differences between SQL-based data analysis and NoSQL-based data analysis in Google Cloud.

Reference answer

SQL-based data analysis involves querying structured data in relational databases using SQL queries. In Google Cloud, tools like BigQuery are optimized for SQL-based analysis, supporting complex joins, aggregations, and window functions. It is highly suitable for analytical workloads on large, structured datasets. NoSQL-based data analysis, however, involves working with unstructured or semi-structured data, often using key-value pairs or document models. Google Cloud Bigtable and Firestore are examples of NoSQL databases that provide flexible, schema-less data models. They are better suited for applications requiring low-latency data access and rapid scaling across large datasets.

151

What is BigQuery and how does it differ from traditional databases?

Reference answer

BigQuery is a fully-managed, serverless data warehouse designed for large-scale data analytics, utilizing a columnar storage format and distributed architecture for fast query performance. Unlike traditional row-based databases that require manual scaling and management, BigQuery offers automatic scaling and high-speed querying capabilities.

152

What are Managed Instance Groups (MIGs), and how do you use them?

Reference answer

Controlled Instance Groups, or MIGs for simple terms, are groups of virtual instances in Google Cloud that are managed as a single entity. The next one is an autonomous instance that may grow and cure self. Managed instance group (MIGs) may ensure high availability by distribute the instances across multiple zones. By develop a group, establish its template, establishing scaling the instructions, and carry out it, they are used. It is easier to increase the capacity of MIGs while handling significant workloads effectively.

153

What is Cloud Functions?

Reference answer

Cloud Functions is a serverless computing service provided by cloud platforms like Google Cloud, AWS, and Microsoft Azure. It allows developers to write and deploy code in response to events or HTTP requests without the need to manage infrastructure. It scales automatically, making it ideal for building event-driven and microservices-based applications in the cloud.

154

How can you configure IAM roles to secure sensitive backups?

Reference answer

- Use the least privilege principle by assigning roles like Storage Object Viewer for viewing backups and Storage Admin for creating/restoring them. - Set up audit logs to monitor access.

155

Explain Hadoop.

Reference answer

Hadoop is an open-source software framework for storing data and running applications that provides massive amounts of storage and processing power. It is compatible with multiple types of hardware that make it easy to access. Hadoop supports rapid processing of data, storing it in the cluster, which is independent of the rest of its operations. It allows you to create three replicas for each block with different nodes.

156

If a customer says X to you, how would you respond?

Reference answer

Demonstrate empathy, active listening, and problem-solving. Acknowledge the customer's concern, clarify the issue by asking questions, provide a clear explanation or solution, and follow up to ensure satisfaction. Use a specific example from past experience, e.g., handling a data discrepancy request or technical support issue.

157

What is the difference between Spark and MapReduce?

Reference answer

Spark is a MapReduce improvement in Hadoop. The difference between Spark and MapReduce is that Spark processes and retains data in memory for later steps, whereas MapReduce processes data on the disc. As a result, Spark's data processing speed is up to 100 times quicker than MapReduce for lesser workloads. Spark also constructs a Directed Acyclic Graph (DAG) to schedule tasks and orchestrate nodes throughout the Hadoop cluster, as opposed to MapReduce's two-stage execution procedure.

158

What is the difference between a database and a data warehouse?

Reference answer

Databases using Delete SQL statements, Insert, and Update SQL statements focus on speed and efficiency, so analyzing data can be more challenging. With data warehouses, the primary focus is on calculations, aggregations, and select statements that make it ideal for data analysis.

159

What is the Google Cloud Platform Marketplace?

Reference answer

The Google Cloud Platform Marketplace is an online marketplace for third-party software and services that are tested, verified, and optimized to run on GCP. It offers software packages and solutions, including databases, web servers, and machine learning tools, that allow users to easily deploy and manage their cloud applications. It also provides integration with other GCP services like Cloud Storage and Cloud Logging.

160

Good Grades and Favorite Colors

Reference answer

A SQL or data analysis question. Likely involves joining tables of students, their grades, and favorite colors to find patterns or correlations.

161

Have you ever worked with big data in a cloud computing environment?

Reference answer

Since most companies are now shifting to cloud-based environments, this question lets the interviewer know how prepared you are to work in a cloud-based environment. You should show your preparedness and familiarity with the cloud-based environment along with the pros of cloud computing such as: - Its flexibility and scalability. - Security and mobility. - Risk-free data access from anywhere.

162

What are the key differences between Google Cloud Storage and Google Cloud SQL?

Reference answer

Google Cloud Storage is a scalable object storage service for unstructured data, such as images and videos, while Google Cloud SQL is a fully managed relational database service for structured data, supporting MySQL, PostgreSQL, and SQL Server.

163

How do you handle Terraform state management and what best practices do you follow?

Reference answer

Terraform state is the source of truth for your infrastructure, so treating it carefully is non-negotiable. For every project, I: Store state remotely in Cloud Storage: Never in local .tfstate files. Remote state lets the team share state and enables automation. I configure the backend like this: terraform { backend "gcs" { bucket = "my-org-terraform-state" prefix = "prod/my-project" } } Enable state locking: This prevents simultaneous applies from corrupting state. GCS state locking works automatically when using a remote backend. Version and encrypt state: I enable GCS versioning on the state bucket so I can recover from accidental deletions. I also enable server-side encryption—state files contain sensitive data like database passwords. Restrict access: Only CI/CD systems and specific team members can access the state bucket. I use IAM roles—no blanket permissions. Implement safeguards against mistakes: - Require plan review before apply (via Cloud Build) - For production, enforce manual approval on sensitive resource changes - Never allow terraform destroy without multiple approvals One mistake I made: Early on, I manually edited state with terraform state rm to work around a problem. That was a bad call—it got me out of that pinch but created inconsistencies. Now I fix state issues through code (updating Terraform configs) rather than manually editing. Current workflow: Developer creates a branch with Terraform changes. On push, Cloud Build runs terraform plan and posts the output to the PR. Another team member reviews both the code and the plan. Only after approval does the apply happen via Cloud Build. This slows down deployments slightly, but catches mistakes early and gives the team visibility into infrastructure changes.

164

How do I get 10 out of 10 in SQL?

Reference answer

It would be something very tricky and obviously related to your expert knowledge of a particular tool, i.e. converting a table into an array of structs and passing them to UDF. This is useful when you need to apply a user-defined function (UDF) with some complex logic to each row or table. You can always consider your table as an array of TYPE STRUCT objects and then pass each one of them to UDF. It depends on your logic. For example, I use it in purchase stacking to calculate expire times: select target_id ,product_id ,product_type_id ,production.purchase_summary_udf()( ARRAY_AGG( STRUCT( target_id , user_id , product_type_id , product_id , item_count , days , expire_time_after_purchase , transaction_id , purchase_created_at , updated_at ) order by purchase_created_at ) ) AS processed from new_batch ;

165

What are the differences between GCP's regional and multi-regional storage options?

Reference answer

Regional storage in GCP stores data in a specific geographic location, providing lower-latency access within that region. In contrast, multi-regional storage replicates data across multiple regions, ensuring higher availability and redundancy for global applications.

166

How do you deal with problems? What are your strengths and weaknesses?

Reference answer

A data engineer's main responsibility is to build systems that collect, manage, and convert raw data into usable information for data scientists and business analysts to interpret. This question aims to ask about any obstacles you may have faced when dealing with a problem and how you solved it. Describe how you make data more accessible through coding and algorithms. Remember the specific responsibilities listed in the job description and incorporate them into your answer. The interviewer might also ask: 'How do you solve a business problem?', 'What is your process for dealing with and solving problems during a project?', or 'Can you describe a time when you encountered a problem and solved it in an innovative manner?'

167

Explain how you would approach implementing a data lake on Google Cloud.

Reference answer

Implementing a data lake on Google Cloud involves ingesting raw, unstructured, and semi-structured data from various sources and storing it in Cloud Storage. This serves as the foundation of the data lake, where different formats like JSON, Parquet, and Avro can be ingested. Data is then cataloged using Google Cloud Data Catalog, which provides metadata management and governance. For data processing and transformation, services like Cloud Dataflow and Dataproc can be used to clean and structure the raw data. Once processed, the data can be loaded into BigQuery for analysis. A key part of the implementation involves setting up security and governance controls using IAM, Data Loss Prevention, and Cloud Security Command Center.

168

What is the BigQuery Storage Read API and why is it important?

Reference answer

The BigQuery Storage Read API allows high-throughput parallel reading of BigQuery table data directly into processing frameworks like Apache Spark, Beam, or TensorFlow without going through slow export jobs. It is important because it significantly reduces the time needed to move large datasets from BigQuery into external compute environments for machine learning or advanced analytics workloads. It supports column and row filtering, which means only the required data is transferred, reducing both cost and processing time.

169

What are the main advantages of using Google Cloud Platform?

Reference answer

Google Cloud Platform is gaining popularity among cloud professionals and users because of its advantages: - GCP offers competitive pricing. - Google Cloud servers allow access to information from anywhere. - GCP has overall better performance and service compared to other hosting cloud services. - Google Cloud provides speedy and efficient server and security updates. - The security level of Google Cloud Platform is exemplary; the cloud platform and networks are secured and encrypted with various security measures.

170

What is a Slowly Changing Dimension (SCD)?

Reference answer

A Slowly Changing Dimension (SCD) is a dimension in a data warehouse that changes slowly over time, rather than changing on a regular schedule or in real-time. There are different types of SCDs: - SCD Type 1: Overwrites existing data, no history tracking. - SCD Type 2: Adds new records for changes, keeps full history with separate surrogate keys. - SCD Type 3: Adds new columns to track limited history (typically one previous value).

171

What is the purpose of Cloud Identity and Access Management (IAM) in GCP

Reference answer

GCP provides Cloud Identity and Access Management (IAM) for managing access control and permissions to GCP resources. IAM allows you to define fine-grained access policies and grant access to specific users or groups.

172

Given an order-stream CSV with columns customer_id, amount, status where amount may be malformed and only status == "paid" rows count, produce a dataframe with one row per customer, columns total (sum of amounts) and orders (count of paid orders), sorted by total descending.

Reference answer

import pandas as pd def per_customer_totals(df: pd.DataFrame) -> pd.DataFrame: df = df.copy() df["amount"] = pd.to_numeric(df["amount"], errors="coerce") paid = df[(df["status"] == "paid") & df["amount"].notna()] out = ( paid.groupby("customer_id") .agg(total=("amount", "sum"), orders=("amount", "count")) .reset_index() .sort_values("total", ascending=False, kind="stable") ) return out Why this works: to_numeric(errors="coerce") turns malformed amounts into NaN without raising, which the boolean mask then drops alongside non-paid rows. Named aggregation produces clean output column names so downstream consumers do not depend on tuple-style multi-level column names. kind="stable" keeps tied totals in customer-id order for deterministic output—important if downstream tests compare row order. Total complexity is O(N) for the coerce + filter and O(N log N) for the sort.

173

Explain the role of Google Cloud Dataproc in big data processing. How does it differ from Google Cloud Dataflow?

Reference answer

Google Cloud Dataproc is a managed Apache Hadoop and Apache Spark service, designed for big data processing at scale. It allows you to create and manage Hadoop and Spark clusters effortlessly. Unlike Google Cloud Dataflow, which focuses on stream and batch processing with serverless capabilities, Dataproc provides more control over cluster configuration and is well-suited for complex, long-running big data workloads.

174

How do you handle access control in BigQuery?

Reference answer

- Assign IAM roles (roles/bigquery.dataViewer,roles/bigquery.admin). - Use authorized views for restricted data access. Example: We created authorized views to share aggregated insights without exposing raw data.

175

What are the common challenges in data pipeline development?

Reference answer

Some challenges include: - Data quality issues (nulls, schema drift) - Late-arriving or out-of-order data - Scaling batch jobs under high volume - Orchestrating dependencies across sources

176

What is the difference between ETL and ELT?

Reference answer

In ETL, data is extracted, transformed on a staging server, and then loaded into the data warehouse. In ELT, data is loaded into the warehouse first and then transformed using the warehouse's computing power. ELT is preferred in cloud-native stacks like Snowflake or BigQuery due to their scalability.

177

What is partitioning in BigQuery, and how does it help in query optimization?

Reference answer

Partitioning in BigQuery is the process of dividing a table into segments based on a column, typically a date or timestamp field. This segmentation helps optimize query performance by allowing queries to scan only relevant portions of data, reducing the amount of data processed and speeding up query times. Partitioned tables also allow for automatic data retention management, which is useful for cost optimization.

178

What is the Lambda architecture?

Reference answer

The Lambda architecture is a data processing architecture designed to handle massive quantities of data by taking advantage of both batch and stream processing methods. It consists of three layers: - Batch layer: Manages the master dataset and pre-computes batch views - Speed layer: Handles real-time data processing - Serving layer: Responds to queries by combining results from batch and speed layers

179

What are the advantages of using BigQuery ML for machine learning in GCP?

Reference answer

BigQuery ML enables you to: - Train machine learning models directly within BigQuery using SQL queries, eliminating the need for data movement. - Support regression, classification, clustering, and forecasting tasks efficiently. - Use BigQuery datasets as input without additional preprocessing steps. - Integrate with Vertex AI for advanced model deployment and orchestration.

180

What are the different types of GCP projects?

Reference answer

Projects using Google Cloud Platform (GCP) can be grouped into several types: compute projects, which take advantage of services like Compute Engine and Kubernetes Engine; storage projects, that make employ of Cloud Storage and Bigtable; data analytics projects, that make use of BigQuery and Dataflow; and machine learning projects, that constitute utilize of AI Platform and AutoML. Each type improves performance and resource management through being appropriate for specific tasks and requirements.

181

How good are you with CLI tools and shell scripting?

Reference answer

Cloud vendor command-line tools are based on REST API and enable data engineers with a powerful command-line interface to communicate with cloud services endpoints to describe and modify resources. Data engineers use CLI tools with bash scripting to chain commands. It helps to create powerful scripts and interact with cloud services with ease. Consider this example below. It will invoke the AWS Lambda function called pipeline-manager: aws lambda invoke --function-name pipeline-manager --payload '{ "key": "something" }' response.json We can create something even more powerful to deploy our serverless microservices. Consider this example below. It will check if the storage bucket for the lambda package exists, upload and deploy our ETL service as a Lambda Function [10]: # ./deploy.sh # Run ./deploy.sh LAMBDA_BUCKET=$1 # your-lambda-packages.aws STACK_NAME=SimpleETLService APP_FOLDER=pipeline_manager # Get date and time to create unique s3-key for deployment package: date TIME=`date +"%Y%m%d%H%M%S"` # Get the name of the base application folder, i.e. pipeline_manager. base=${PWD##*/} # Use this name to name zip: zp=$base".zip" echo $zp # Remove old package if exists: rm -f $zp # Package Lambda zip -r $zp "./${APP_FOLDER}" -x deploy.sh # Check if Lambda bucket exists: LAMBDA_BUCKET_EXISTS=$(aws s3 ls ${LAMBDA_BUCKET} --output text) # If NOT: if [[ $? -eq 254 ]]; then # create a bucket to keep Lambdas packaged files: echo "Creating Lambda code bucket ${LAMBDA_BUCKET} " CREATE_BUCKET=$(aws s3 mb s3://${LAMBDA_BUCKET} --output text) echo ${CREATE_BUCKET} fi # Upload the package to S3: aws s3 cp ./${base}.zip s3://${LAMBDA_BUCKET}/${APP_FOLDER}/${base}${TIME}.zip # Deploy / Update: aws --profile $PROFILE cloudformation deploy --template-file stack.yaml --stack-name $STACK_NAME --capabilities CAPABILITY_IAM --parameter-overrides "StackPackageS3Key"="${APP_FOLDER}/${base}${TIME}.zip" "AppFolder"=$APP_FOLDER "LambdaCodeLocation"=$LAMBDA_BUCKET "Environment"="staging" "Testing"="false"

182

Design an algorithm to efficiently sort and identify the n-th value.

Reference answer

Use the Quickselect algorithm, which is based on the partition step of Quicksort. Choose a pivot, partition the array into elements less than and greater than the pivot. Recurse on the appropriate partition based on the index of the n-th smallest value. Average time complexity is O(n), worst-case O(n^2). Alternatively, use a heap (min-heap for k-th smallest) with O(n log k) time.

183

Explain the concept of data lineage.

Reference answer

Data lineage refers to the tracking and visualization of data as it flows from its source to its destination. It helps in understanding the data's origin, transformations, and journey through various processes, ensuring transparency, traceability, and data quality.

184

What are some data ingestion techniques in GCP?

Reference answer

- Batch ingestion using Cloud Storage - Streaming ingestion using Pub/Sub - Federated queries in BigQuery Example: I implemented a hybrid ingestion approach where Cloud Storage was used for daily batch loads, while Pub/Sub handled real-time transaction events.

185

What is a data engineer responsible for?

Reference answer

Recruiters want to know that you are aware of the duties of a data engineer. You should be able to describe the typical responsibilities, as well as who a data engineer works with on a team. If you have experience as a data scientist or analyst, you may want to describe how you have worked with data engineers in the past. The interviewer might also ask: 'What do data engineers do?', 'How do data engineers work within a team?', or 'What impact does a data engineer have?'

186

In a given array of numbers, if a number is Even, divide by 2. If a given number is Odd, multiply by 3 and add 1.

Reference answer

Write a function that iterates through an array and applies the Collatz conjecture rule: for even numbers divide by 2, for odd numbers multiply by 3 and add 1. Return the transformed array.

187

Explain the difference between Google Cloud Firestore and Google Cloud Datastore.

Reference answer

Google Cloud Firestore is a flexible, scalable NoSQL database designed for mobile, web, and server development, offering real-time synchronization and offline support. In contrast, Google Cloud Datastore is a NoSQL document database built for automatic scaling, high performance, and ease of application development, with Firestore being its next-generation evolution providing more advanced features and a richer query language.

188

Explain the use of Cloud Datalab in GCP

Reference answer

Cloud Datalab is an interactive data exploration and visualization tool in GCP. It provides a Jupyter notebook interface for analyzing and visualizing data using Python, SQL, and BigQuery.

189

What are the design schemas available in data modeling?

Reference answer

There are two design schemas available in data modeling: - Star Schema - Snowflake Schema

190

What type of technology would you need to build YouTube?

Reference answer

Building YouTube requires a robust tech stack including distributed storage (e.g., Google File System or HDFS), a processing framework (e.g., Apache Spark or MapReduce), a scalable database (e.g., Bigtable for metadata), streaming infrastructure (e.g., Kafka for video uploads), content delivery networks (CDNs) for video serving, and machine learning for recommendations. Also need a data pipeline to handle massive scale of user interactions and video data.

191

Explain backfilling and how you handled it in a production pipeline.

Reference answer

Airflow backfill jobs, historical data loads, and ensuring data correctness

192

How can you monitor and analyze GCP costs

Reference answer

GCP provides tools like Cloud Billing, Cost Management, and budgets to monitor, analyze, and optimize costs associated with your GCP resources and services.

193

How do you handle schema changes in BigQuery tables?

Reference answer

- Use schema updates to add new columns - Maintain backward compatibility - Use schema inference in data ingestion jobs

194

Explain the four Vs of big data.

Reference answer

The four Vs are volume, velocity, variety, and veracity. Volume refers to the size of the data sets (terabytes or petabytes) that need to be processed. Velocity refers to the speed at which the data is generated. Variety refers to the many sources and file types of structured and unstructured data. Veracity refers to the quality of the data being analyzed. The four Vs must create a fifth V, which is value.

195

How do you use Cloud Composer in orchestrating workflows?

Reference answer

Cloud Composer (based on Apache Airflow) helps schedule and automate data pipelines. DAGs (Directed Acyclic Graphs) define tasks and dependencies. Example: I created a DAG to automate daily data ingestion from Cloud Storage, processing in Dataflow, and loading results into BigQuery. This reduced manual pipeline execution efforts by 100%.

196

What are some best practices for designing cloud-native data pipelines?

Reference answer

- Use event-driven architecture (e.g., Cloud Functions, Lambda triggers) - Decouple compute from storage (S3, GCS, ADLS) - Build idempotent, retry-safe ETL jobs - Use managed orchestration tools like Cloud Composer or Azure Data Factory

197

How do you handle data privacy and compliance requirements in your projects?

Reference answer

Approaches to handling data privacy and compliance include: - Implementing data classification and tagging - Applying appropriate data masking and encryption techniques - Implementing role-based access control (RBAC) - Maintaining audit logs for data access and modifications - Implementing data retention and deletion policies - Conducting regular privacy impact assessments - Staying updated with relevant regulations (e.g., GDPR, CCPA)

198

Explain the use of Cloud CDN (Content Delivery Network) in GCP

Reference answer

Cloud CDN is a global content delivery network that caches and delivers content from GCP to users with low latency and high bandwidth. It improves the performance of web applications and reduces serving costs.

199

Explain how you would access a Google Cloud API.

Reference answer

To access a Google Cloud API, you would enable the API in the Cloud Console, authenticate using OAuth 2.0 or a service account, and then make HTTP requests or use client libraries in supported programming languages.

200

What's your experience with CI/CD on GCP? Walk me through a pipeline you've built.

Reference answer

I've built several CI/CD pipelines on GCP using Cloud Build as the orchestrator. For a recent microservices project, here's the pipeline: Trigger: On push to main branch, Cloud Build automatically kicks off. Stages: - Build and test: Cloud Build checks out the code, runs unit tests, lints the code, and builds a Docker image. Everything runs in parallel where possible to keep build time under 5 minutes. - Push to registry: If tests pass, the Docker image gets pushed to Artifact Registry with a tag based on the commit SHA. - Deploy to staging: Automatically deploy to a staging GKE cluster using Helm. Run smoke tests—HTTP requests to key endpoints, checking for expected responses. - Manual approval: Staging looks good? Team member approves the deployment to production in Cloud Build. - Deploy to production: Helm deploy with a canary strategy—first, roll out to 10% of pods, monitor metrics for 5 minutes, then complete the rollout if everything looks good. - Smoke tests in production: Final check that services are responding correctly. Configuration: The entire pipeline is defined in a cloudbuild.yaml file in the repo, so infrastructure engineers can see and review changes to the pipeline just like code. What makes it reliable: We treat staging like production—same infrastructure, same data (anonymized), same monitoring. If it works in staging, it works in production. Improvements I'd make: We sometimes have long waits for approval. I'd like to implement automatic promotions based on predefined criteria—if a canary deploys successfully and error rates stay below baseline, automatically promote without waiting for a person to click approve.

DON'T WANT TO MISS A THING?

Latest Cisco, PMP, AWS, CompTIA, Microsoft Materials on SALE
Get Now

Mock Interview Questions for GCP Data Engineers | SPOTO

Earn a certification to make your resume stand out.

DON'T WANT TO MISS A THING?

Latest Cisco, PMP, AWS, CompTIA, Microsoft Materials on SALE Get Now

Mock Interview Questions for GCP Data Engineers | SPOTO

Earn a certification to make your resume stand out.

Latest Cisco, PMP, AWS, CompTIA, Microsoft Materials on SALE
Get Now