すべての情報を見逃したくないですか?

認定試験に合格するためのヒント

最新の試験ニュースと割引情報

当社の専門家による厳選最新情報

はい、ニュースを送ってください

他の面接問題を見る

1
参考回答
A data warehouse is a centralized repository that stores large amounts of structured data from various sources in an organization. It is designed for query and analysis rather than for transaction processing.
2
参考回答
I explained the benefits of a data lake vs. a data warehouse to a business executive by using an analogy: a data lake is like a storage room for raw materials, while a warehouse is a showroom for finished goods. I focused on cost, flexibility, and how each supports business decisions, avoiding jargon.
キャリア加速

認定資格を取得して、履歴書を際立たせましょう。

データ分析によると、IT認定資格保有者の年収は平均的な求職者より26%高いことが分かっています。SPOTOでは、認定資格の取得と面接準備を同時に進め、キャリア成長を加速できます。

1 100% 合格率
2 2週間の問題集練習
3 認定試験に合格
3
参考回答
In a data warehouse, a star schema can include one fact table and a number of associated dimension tables in the center. It's called a star schema because its structure resembles that of a star. The simplest sort of Data Warehouse schema is the Star Schema data model. It is also known as the Star Join Schema, and it is designed for massive data sets.
4
参考回答
When this comes up, walk through a real example. Explain the problem (e.g., a slow sales dashboard), the schema decision you made (e.g., moving to a star schema with pre-aggregations), and the outcome (e.g., queries that ran 10x faster). Emphasize the business impact—such as enabling executives to make faster decisions or cutting costs.
5
参考回答
Apache Kafka is a distributed streaming platform used to build real-time data pipelines and streaming applications. It's used in data engineering to handle high-throughput, low-latency data streams, enabling real-time processing and analytics.
6
参考回答
Use dbt when working in SQL-first environments with cloud warehouses like BigQuery or Snowflake. Choose Spark for large-scale distributed processing where you need programmatic control or support for multiple formats (e.g., Parquet, Avro). dbt is preferred for analytics workflows; Spark is better for compute-heavy batch jobs.
7
参考回答
I handle missing/corrupt data based on the context. This might involve dropping records, imputing values (mean, median, or specific indicators), or using data validation rules to identify and quarantine bad data during ETL.
8
参考回答
The following are some of Hive's table creation functions: - Explode(array) - Explode(map) - JSON_tuple() - Stack()
9
参考回答
The ETL process involves three key steps: - Extract: Data is extracted from various source systems, which can include databases, APIs, files, or logs. This step often involves connecting to different systems and pulling out the required data. - Transform: The extracted data is then transformed to ensure consistency and compatibility with the target system. This step may involve cleaning the data (removing duplicates, handling missing values), applying business rules, aggregating data, and converting data types. The goal is to convert raw data into a structured format that meets the needs of the target system, typically a data warehouse or data lake. - Load: Finally, the transformed data is loaded into the target system, where it can be stored and made available for querying and analysis. The loading process needs to be efficient and should ensure that the data is properly indexed and accessible. The ETL process is important because it enables organizations to consolidate data from various sources into a single, coherent system. This allows for more accurate reporting, better decision-making, and the ability to perform advanced analytics.
10
参考回答
Data engineering involves designing, constructing, and maintaining the infrastructure and systems that store, process, and analyze large-scale data. It encompasses data pipelines, databases, data warehouses, and big data frameworks to ensure data is clean, reliable, and available for analysis.
11
参考回答
Cloud-based pipelines offer scalability, lower infrastructure overhead, pay-as-you-go pricing, and faster deployment cycles. Services like AWS Glue or GCP Dataflow allow engineers to focus on logic rather than server management. They also integrate easily with cloud-native storage, compute, and monitoring tools.
12
参考回答
Some challenges include: - Data quality issues (nulls, schema drift) - Late-arriving or out-of-order data - Scaling batch jobs under high volume - Orchestrating dependencies across sources
13
参考回答
A decorator is a function that wraps another function to extend its behavior (like adding logging or timing) without modifying the original function's code.
14
参考回答
Interviewers seek to analyze your decision-making abilities as well as your understanding of various tools. As a result, utilize this question to describe why you chose certain tools over others. Tell the interviewer about the tools you used and why you used them. You can also mention the features and drawbacks of the tool you used. Also, try to use this opportunity to tell the interviewer how you can use the tool for the company's benefit.
15
参考回答
A strong answer describes the incident, immediate actions taken to restore service, communication with stakeholders, root cause analysis, and preventive measures implemented. Shows calm decision-making and operational maturity.
16
参考回答
Ensuring data quality and integrity in a data pipeline involves several key practices: - Data Validation: Implementing validation checks at the ingestion stage is critical. This can include schema validation (ensuring the data adheres to the expected format and structure), range checks (validating numerical values are within acceptable ranges), and completeness checks (ensuring no required fields are missing). - Data Cleaning: Once the data is ingested, it's important to clean it by handling missing values, removing duplicates, and correcting any inconsistencies. Tools like Apache Spark, Python with Pandas, or ETL tools like Talend can be used for these cleaning operations. - Monitoring and Alerts: Continuous monitoring of the data pipeline is essential to catch issues as they arise. Tools like Apache Airflow, AWS CloudWatch, or Datadog can be set up to monitor data flows, detect anomalies, and trigger alerts if data quality issues are detected, such as sudden drops in data volume or schema changes. - Automated Testing: Implementing automated tests within the pipeline helps ensure that transformations are applied correctly and that data integrity is maintained throughout the process. This might include unit tests for individual transformations or end-to-end tests that verify the output data meets expectations. - Auditing and Logging: Keeping detailed logs of data processing steps and transformations can help trace the data's journey through the pipeline and identify where issues may have occurred. This is especially important for compliance and debugging purposes. - Data Governance: Implementing data governance policies, such as defining data ownership, access controls, and data stewardship roles, ensures that data quality is maintained across the organization.
17
参考回答
- Volume: The size of the data. - Velocity: The speed at which data is generated and processed. - Variety: The different types of data (structured, semi-structured, unstructured). - Veracity: The quality and trustworthiness of the data. Without Veracity, the other three are useless.
18
参考回答
The four Vs are volume, velocity, variety, and veracity. Volume refers to the size of the data sets (terabytes or petabytes) that need to be processed. Velocity refers to the speed at which the data is generated. Variety refers to the many sources and file types of structured and unstructured data. Veracity refers to the quality of the data being analyzed. These must create a fifth V, which is value.
19
参考回答
With a product table defined with a name, SKU, and price, you would construct a query that shows the lowest priced item by using an aggregate function like MIN(price) or by ordering the results by price ascending and limiting to one. For example: SELECT name, sku, price FROM products ORDER BY price ASC LIMIT 1;
20
参考回答
When this comes up, explain that surrogate keys are system-generated identifiers (like integers) that uniquely identify rows in dimension tables. You should highlight that they are preferred over natural keys to avoid business logic changes breaking relationships. Emphasize that surrogate keys improve join performance and support slowly changing dimensions.
21
参考回答
This question is about your relationship with data engineering. Keep your answer focused on your path to becoming a data engineer. What attracted you to this career or industry? How did you develop your technical skills?
22
参考回答
Hadoop splits large files into tiny, processable pieces. A block is the smallest part of any data file. A block scanner verifies each block from the list present on a DataNode.
23
参考回答
Prioritize based on impact and urgency. Triage quickly, communicate status, and delegate if possible. Use ticketing or task management tools. Stay calm and methodical. Follow up with post-incident reviews.
24
参考回答
ACID ensures database transactions are processed reliably. - Atomicity: All parts of a transaction succeed, or the entire transaction fails. (All or nothing). - Consistency: The database moves from one valid state to another valid state. Constraints are enforced. - Isolation: Concurrent transactions do not interfere with each other. - Durability: Once a transaction is committed, it remains committed even in the event of a power loss.
25
参考回答
Data partitioning is the process of dividing a large dataset into smaller, more manageable pieces called partitions. This technique is used to improve query performance, enable parallel processing, and manage large datasets more effectively. Common partitioning strategies include: - Range partitioning - Hash partitioning - List partitioning
26
参考回答
- Star Schema: Consists of a central fact table with dimension tables radiating from it. This design is intuitive and optimized for analytical queries. - Snowflake Schema: Similar to a star schema but with the additional layer of normalized dimension tables. This design offers better space efficiency at the cost of potential increased query complexity. - Galaxy Schema (Constellation Schema): Contains multiple fact tables that share dimension tables.
27
参考回答
I layer it. Source-level checks on row counts and schema drift, staging-level tests using dbt tests or Great Expectations for nulls, uniqueness, referential integrity, and accepted ranges. At the mart layer I add business logic tests — revenue never negative, active_users >= paying_users, and so on. All tests run in CI on pull requests against a sample, then again post-load in production with alerting to a dedicated Slack channel. Critical tables also get freshness SLAs monitored independently.
28
参考回答
To recommend UI changes through user journey analysis, start by examining user event data to identify drop-off points and engagement levels. Analyze user flows to pinpoint friction areas, then segment users based on behavior. Use visualizations to present findings and suggest UI improvements that enhance user experience. Document insights for future reference and continuous improvement.
29
参考回答
To gather stakeholder input effectively, start by conducting surveys and interviews to capture their needs. Utilize direct observations to understand workflows and review existing logs for insights. Document findings to ensure alignment and maintain open communication throughout the project, fostering collaboration and clarity.
30
参考回答
Be honest and show growth. Describe the criticism, how you received it openly, what you learned, and how you improved. Emphasize your ability to accept feedback and iterate.
31
参考回答
Good managers balance innovation with pragmatism: - Assess the technology against current stack and team skills. - Run proof-of-concept projects to validate claims. - Evaluate total cost of ownership (licensing, infrastructure, training). - Plan a gradual rollout with clear success metrics. - Gather feedback from the team before full adoption.
32
参考回答
Interviewers expect specifics here. Mention tools like: - Airflow: DAGs, task dependencies, custom operators - dbt: modular SQL modeling, testing, documentation - Fivetran/Stitch: plug-and-play connectors for SaaS data - Kafka: stream ingestion and integration into pipelines
33
参考回答
Effective monitoring and alerting involves: - Implementing comprehensive logging across all system components - Setting up real-time monitoring dashboards - Defining key performance indicators (KPIs) and service level objectives (SLOs) - Implementing proactive alerting for potential issues - Using anomaly detection techniques for identifying unusual patterns - Establishing an incident response process - Conducting regular system health checks and audits
34
参考回答
CDC captures and tracks changes in source data for real-time updates. Example: Using Debezium to track changes in a MySQL database and publish them to a Kafka topic for downstream applications. Importance: CDC ensures data freshness and supports near real-time analytics.
35
参考回答
Explain your quality assurance process: code reviews, testing (unit, integration, regression), monitoring, documentation, and continuous improvement. Provide an example.
36
参考回答
A micro-partition is a small, immutable storage unit in Snowflake that enables pruning. Check if the clustering key is doing its job by examining the clustering depth and the pruning statistics in the query profile.
37
参考回答
In your response, be sure to emphasize your strong communication skills, showcasing how you can effectively work with teams from various backgrounds. Highlight how well you adapt to changing project requirements and timelines. Additionally, illustrate your ability to translate complex technical details into actionable insights for stakeholders, ensuring that all team members, regardless of their technical expertise, are aligned with the project goals and understand their role in achieving success. This demonstrates not only technical proficiency but also leadership and collaborative skills critical in a cross-functional team setting.
38
参考回答
I'd start by profiling the source — schema stability, volume, update patterns, and whether it supports CDC or just full snapshots. For a typical batch source I'd land raw data in S3 or GCS as Parquet, then use dbt on Snowflake for transformations into staging, intermediate, and mart layers. Airflow or Dagster would orchestrate, with idempotent tasks, retries, and alerting via PagerDuty. I'd also add Great Expectations tests on staging tables and monitor row counts and freshness in Monte Carlo or a homegrown dashboard.
39
参考回答
Hadoop is a tool that many hiring managers ask about during interviews. You should know that whenever there's a specific question like that, it's highly likely that you'll be required to use this particular tool on the job. So, to prepare, do your homework and make sure you're familiar with the languages and tools the company uses. More often than not, you can find that information in the job description. If you're experienced with the tool, give a detailed explanation of your project to highlight your skills and knowledge of the tool's capabilities. In case you haven't worked with this tool, the least you could do is do some research to demonstrate some basic familiarity with the tool's attributes. Answer Example "I've used the Hadoop framework while working on a team project focused on increasing data processing efficiency. We chose to implement it because of its ability to increase data processing speeds while, at the same time, preserving quality through its distributed processing. We also decided to implement Hadoop because of its scalability, as the company I worked for expected a considerable increase in its data processing needs over the next few months. In addition, Hadoop is an open-source network which made it the best option, keeping in mind the limited resources for the project. Not to mention that it's Java-based, so it was easy to use by everyone on the team and no additional training was required."
40
参考回答
Challenges include the cost of scanning terabytes of data, schema drift over time, and downstream load. These are addressed by chunking backfills, validating schemas, and scheduling work during off-peak hours to avoid business disruption.
41
参考回答
All the data we see today is called Big Data. Big Data refers to large volumes of data both unstructured and structured which traditional methods of data storage cannot process easily. Hadoop is one of the most powerful tools for Big Data processing.
42
参考回答
In a previous role, one batch pipeline was taking over six hours to process daily sales data. I reviewed the SQL queries and discovered multiple unnecessary joins and unindexed columns. I rewrote the queries, added proper indexing, and used partitioned data in S3. The processing time dropped to under one hour, improving data availability for downstream reports.
43
参考回答
Design a low-latency, event-driven streaming pipeline as follows: - Ingestion: Use Azure Event Hubs or IoT Hubs to capture high-frequency sensor data. - Processing: Analyze data in real-time using Azure Stream Analytics or Databricks with Spark Structured Streaming. - Storage and alerts: Store data in Azure Cosmos DB for fast access or Data Lake for historical analysis.
44
参考回答
A heartbeat message is how the DataNode interacts with the NameNode. It is a vital signal the DataNode sends to the NameNode in a structured interval to indicate that it's operational.
45
参考回答
OLTP (Online Transaction Processing) and OLAP (Online Analytical Processing) are two paradigms that cater to distinct data-handling needs. - OLTP: - Focus: Day-to-day transactions, serving as a live data nerve-center for activities like order processing, banking transactions, and online bookings. - Data Freshness: The emphasis is on real-time data updates. - Query Complexity: Typically standardized, simple queries. - Database Design: Normalized to minimize redundancy. - Example Use-cases: Point of Sale (POS) systems, online banking, ticket booking platforms. - OLAP: - Focus: Extracting insights from data, supporting tasks such as reporting, data mining, and business intelligence. - Data Freshness: Data is periodically refreshed, often in near real-time, and sometimes in scheduled intervals. - Query Complexity: Ad-hoc, complex queries to analyze large data sets. - Database Design: Denormalized to optimize for query performance. - Example Use-cases: Business reporting, data analysis, market research. - Data Model: OLTP focuses on a detailed, current-state data model, while OLAP adopts a summarized, historical data model for analysis. - Query Optimization: OLTP prioritizes quick data modifications, whereas OLAP focuses on efficient, often parallelized, read-heavy operations. - Data Consistency: In OLTP, transactions need to be ACID-compliant; in OLAP, eventual consistency is often acceptable.
46
参考回答
| ? OLTP | ? OLAP | | | BASIS | Online Transactional Processing system to handle large numbers of small online transactions | Online Analytical Processing system for data retrieving and analysis | | FOCUS | INSERT, UPDATE, DELETE operations | Complex queries with aggregations | | OPTIMISATION | Write | Read | | TRANSACTIONS | Short | Long | | DATA QUALITY | ACID compliant | Data may not be as organized | | EXAMPLE | E-commerce purchases table | Average daily sales for the last month |
47
参考回答
You can use a combination of ORDER BY , LIMIT , and an offset to find the second highest value. SELECT column_name FROM table_name ORDER BY column_name DESC LIMIT 1 OFFSET 1; ORDER BY column_name DESC : Sorts the column in descending order, so the highest values appear first.LIMIT 1 : Selects only one row.OFFSET 1 : Skips the first row (highest value), returning the second row (second highest value). SELECT MAX(column_name) AS second_highest FROM table_name WHERE column_name < (SELECT MAX(column_name) FROM table_name); - Inner Query : Finds the maximum value in the column. - Outer Query : Finds the maximum value that is less than the maximum value (i.e., the second highest value).
48
参考回答
I'm a data engineer who enjoys building reliable systems that turn raw data into useful insights for the business. I started out working with databases and reporting, which led me to become interested in how data flows through organisations. Over the past few years I've focused on building and maintaining data pipelines, improving data quality, and ensuring analysts and data scientists have trusted data to work with. In my current role I help manage pipelines that process millions of records each day using SQL, cloud data warehouses, and orchestration tools. One project I'm particularly proud of involved redesigning an ETL workflow that reduced processing time by about 40 percent. I enjoy solving complex data problems and building systems that scale, and I'm now looking for an opportunity where I can continue developing high quality data platforms that support better decision making.
49
参考回答
I have extensive experience with AWS data services. In my current role, I architect solutions using S3 for storage, Glue for ETL, and Redshift for warehousing. I recently migrated our on-premise data warehouse to AWS, reducing costs by 40% while improving performance. I'm particularly experienced with AWS Lambda for event-driven processing and have built serverless pipelines that automatically process files as they arrive in S3. I also have some experience with Azure Data Factory and am currently learning Databricks to expand my multi-cloud skills.
50
参考回答
Azure provides core data services for managing, processing, and analyzing data, including: - Azure Data Factory (ADF) – A cloud-based ETL service for orchestrating and automating data movement. - Azure Synapse Analytics – A data warehousing and analytics service for querying large datasets with SQL and big data processing. - Azure Databricks – A big data and AI/ML platform on Apache Spark for large-scale transformations, real-time analytics, and machine learning. There are many others, but the above are the most important ones for data engineers.
51
参考回答
Kafka Streams provides a Java API for real-time transformations directly on Kafka topics. ksqlDB offers a SQL-like interface for stream processing without writing code.
52
参考回答
To design a Tinder-style dating app database, you need to create tables for users, swipes, matches, and possibly messages. Optimizations might include indexing frequently queried fields, using efficient data types, and implementing caching strategies to improve performance.
53
参考回答
In data modeling, two schemas are most common: - Star schema: A central fact table connected to dimension tables. - Snowflake schema: An extension of the star schema where dimension tables are normalized into multiple related tables. When explaining, mention tradeoffs: star schema offers faster query performance, while snowflake saves storage and enforces consistency.
54
参考回答
Some of the XML configuration files present in Hadoop are - HDFS-site (one of the most important XML configuration files) - Core-site - YARN-site - Mapred-site
55
参考回答
Efficiency comes from using vectorized operations in NumPy/pandas, minimizing loops, and applying efficient data structures (set , dict ). Profiling tools (cProfile , line_profiler ) help identify bottlenecks. Caching results, parallelizing tasks, and memory management (iterators, generators) further improve performance in data engineering pipelines.
56
参考回答
SELECT name, department, salary, RANK() OVER (PARTITION BY department ORDER BY salary DESC) as salary_rank, DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) as dense_salary_rank, ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) as row_num FROM employees; | name | department | salary | salary_rank | dense_salary_rank | row_num | |---|---|---|---|---|---| | Alice | Engineering | 150000 | 1 | 1 | 1 | | Bob | Engineering | 150000 | 1 | 1 | 2 | | Carol | Engineering | 120000 | 3 | 2 | 3 | | Dan | Sales | 90000 | 1 | 1 | 1 | Why interviewers ask this: Window functions separate junior SQL users from intermediate ones. RANK, DENSE_RANK, and ROW_NUMBER behave differently with ties, and choosing wrong creates incorrect analytics. This appears in almost every SQL interview.
57
参考回答
HDFS is an acronym for Hadoop Distributed File System. It is a distributed file system that runs on commodity hardware and can handle massive data collections.
58
参考回答
Use simple language and analogies instead of jargon. Share visuals like dashboards or diagrams to make complex points clearer. End with insights that connect to business value rather than just technical details.
59
参考回答
Definition: Data anonymization is the process of removing or obfuscating personally identifiable information (PII) from datasets to ensure privacy and security while retaining the data's utility for analysis. Example Use Case: Suppose a company wants to analyze user behavior to optimize its product offerings. Before sharing this data with the analytics team, the company anonymizes sensitive details like user IDs, phone numbers, and addresses by replacing them with hashed values or generalized data. Key Techniques: - Masking: Replacing PII with a placeholder or fake values (e.g., replacing names with pseudonyms). - Aggregation: Grouping data to prevent identifying individuals (e.g., showing only age ranges instead of specific ages). - Tokenization: Replacing sensitive data with tokens linked to the original data stored in a secure environment. - Differential Privacy: Adding statistical noise to datasets to obscure individual-level information. Why Critical? - Compliance with Privacy Regulations: Data anonymization ensures adherence to laws such as GDPR, CCPA, and HIPAA that mandate protecting user privacy. - Security: Prevents misuse or unauthorized access to sensitive information during data sharing or processing. - Trust: Builds user confidence by safeguarding their personal data.
60
参考回答
I implement security at every layer of the data pipeline. For encryption, I use AES-256 for data at rest and TLS for data in transit. I implement role-based access controls and regularly audit permissions. In my previous role handling healthcare data, I ensured HIPAA compliance by implementing field-level encryption for PII, maintaining audit logs of all data access, and using data masking for non-production environments. I also worked closely with our legal team to implement data retention policies and right-to-be-forgotten procedures for GDPR compliance.
61
参考回答
Explain how you proactively gathered requirements, made reasonable assumptions, and started with a minimal viable solution. Show how you iterated as more information became available.
62
参考回答
Candidates should discuss strategies for regular communication, documentation and knowledge sharing. Strong candidates will emphasize the importance of understanding other stakeholders' needs and working toward common goals.
63
参考回答
To stay updated with the latest data engineering trends and technologies, I actively participate in online forums like Stack Overflow and follow influential blogs in the field. I also attend industry conferences and webinars to learn from experts and network with peers. I enjoy working on personal data engineering projects and collaborating with colleagues to explore and apply new technologies.
64
参考回答
A Star Schema has a central fact table and denormalized dimension tables, leading to faster queries but redundant data. A Snowflake Schema normalizes dimension tables into multiple related tables, saving storage but slowing down performance due to complex joins.
65
参考回答
In Azure, data transformation using SQL is commonly performed in services like Azure SQL Database, Azure Synapse Analytics (Dedicated or Serverless SQL pools), or via mapping data flows in Azure Data Factory. These transformations help shape raw data into clean, structured, and insightful formats for reporting and analytics. Here are some widely used SQL operations for data transformation: - Aggregation (SUM,AVG,COUNT) – Summarizes data. CASE statements – Applies conditional logic.- String functions (UPPER,LOWER,CONCAT) – Modifies text. - Date functions (YEAR,MONTH,DATEDIFF) – Extracts date details. - Window functions (RANK,ROW_NUMBER,LEAD,LAG) – Enables analytics. Example: Transformation Query in Azure Synapse SQL SELECT customer_id, UPPER(TRIM(customer_name)) AS cleaned_name, SUM(order_amount) AS total_spent, RANK() OVER (ORDER BY SUM(order_amount) DESC) AS spending_rank, CASE WHEN SUM(order_amount) > 5000 THEN 'High Value' WHEN SUM(order_amount) > 1000 THEN 'Medium Value' ELSE 'Low Value' END AS customer_segment FROM sales_data GROUP BY customer_id, customer_name; The above kind of SQL transformation could be part of a Synapse pipeline used to power dashboards or feed into machine learning models.
66
参考回答
Describe how you used existing tools, volunteered time, or repurposed resources to achieve a goal. Highlight resourcefulness and determination.
67
参考回答
Data lineage refers to the lifecycle of data, including its origins, movements, transformations, and impacts. It's important because it: - Helps in understanding data provenance and quality - Facilitates impact analysis for proposed changes - Aids in regulatory compliance and auditing - Supports troubleshooting and debugging of data issues - Enhances data governance and metadata management
68
参考回答
Our nightly revenue pipeline failed silently one Monday because a source system started sending timestamps in a different timezone. Dashboards showed a 40% drop in Sunday sales. I caught it in the morning slack, rolled the mart tables back to Friday's snapshot within 30 minutes so the exec team had working numbers, then traced the issue to a schema contract we had not enforced. I added a test for timezone format on ingest and wrote a short post-mortem. Nothing fancy — just fast triage, clear comms, and a durable fix.
69
参考回答
- Adding a salt column creates random variations on the join key, distributing skewed data across partitions. from pyspark.sql.functions import expr # Original skewed DataFrame df1 = spark.createDataFrame([(1, "A"), (1, "B"), (2, "C")], ["key", "value1"]) df2 = spark.createDataFrame([(1, "D"), (1, "E"), (2, "F")], ["key", "value2"]) # Adding a salt column to distribute the skewed key (1) df1_salted = df1.withColumn("salt", expr("floor(rand() * 3)")) # 3 is the salt range df2_salted = df2.withColumn("salt", expr("floor(rand() * 3)")) # Perform join on both key and salt to reduce skewness df_joined = df1_salted.join(df2_salted, (df1_salted.key == df2_salted.key) & (df1_salted.salt == df2_salted.salt), "inner") df_joined.show()
70
参考回答
Approaches to handling data privacy and compliance include: - Implementing data classification and tagging - Applying appropriate data masking and encryption techniques - Implementing role-based access control (RBAC) - Maintaining audit logs for data access and modifications - Implementing data retention and deletion policies - Conducting regular privacy impact assessments - Staying updated with relevant regulations (e.g., GDPR, CCPA)
71
参考回答
Data lineage tracks where data comes from, how it's transformed, and where it goes. It answers: “How was this number calculated?” Why it matters: - Debugging: “Sales are wrong, which upstream table changed?” - Impact analysis: “If I modify this column, what breaks downstream?” - Compliance: “Auditors want to know how we calculated this metric” - Trust: Business users trust data they can trace Tools: dbt (automatic lineage from refs), DataHub, Amundsen, Monte Carlo
72
参考回答
Explain how you respectfully voiced your disagreement, supported your viewpoint with data, listened to others, and either convinced the group or committed to the decision after discussion. Emphasize backbone and commitment.
73
参考回答
This scenario is a variation of the failure question. With this question, a framework like STAR can help you describe the situation, the task, your actions, and the results. Remember: Your answer should provide clear insights into your resilience.
74
参考回答
To find duplicates in a single column: SELECT column_name, COUNT(column_name) FROM table_name GROUP BY column_name HAVING COUNT(column_name)>1 Will display all the records in a column which have the same value. To find duplicates in multiple columns of a table: SELECT column1_name, column2_name, COUNT(*) FROM table_name GROUP BY column1_name, column2_name HAVING COUNT(*)>1 Will display all the records with the same values in column1 and column2.
75
参考回答
An idempotent pipeline produces the same result whether run once or multiple times. This is critical for handling retries and backfills. Strategies: - Use MERGE/UPSERT instead of INSERT: MERGE INTO target_table AS target USING staging_table AS source ON target.id = source.id WHEN MATCHED THEN UPDATE SET ... WHEN NOT MATCHED THEN INSERT ...; - Delete before insert (for date-partitioned data): DELETE FROM sales_daily WHERE sale_date = '2024-01-15'; INSERT INTO sales_daily SELECT * FROM staging WHERE sale_date = '2024-01-15'; - Use processing timestamps, not wall-clock time: # Bad: Uses current time df['processed_at'] = datetime.now() # Good: Uses logical execution date df['processed_at'] = execution_date # Passed from orchestrator Why interviewers ask this: Pipelines fail. Networks timeout. Idempotency means you can safely retry without creating duplicates or data corruption.
76
参考回答
A stream captures change data capture (CDC) on a table. A task schedules SQL execution. Use streams to track changes and tasks to automate processing, often together for incremental pipelines.
77
参考回答
To handle data versioning and lineage, I would utilize a version control system like Git to track changes in the data pipeline code. I would also implement metadata management tools like Apache Atlas, which can capture data lineage information. Proper data cataloging practices would ensure the traceability of data transformations and changes.
78
参考回答
To clarify, the roles of data engineers and data scientists are distinct yet complementary within the data ecosystem. Data engineers focus primarily on building and maintaining the infrastructure required for data generation, collection, and analysis. This includes designing and implementing databases, data storage solutions, and data systems that enable large-scale data analytics. Data scientists, on the other hand, use this infrastructure to analyze data sources. They analyze and interpret complex data to help organizations make informed decisions. Their work involves statistical analysis, machine learning model development, and data visualization to extract meaningful insights from data.
79
参考回答
Begin by detailing the initial objectives of the project, including specific goals you aimed to achieve. Explain the technologies and methodologies you chose to use and why they were selected for this particular project. Mention any challenges or obstacles you encountered along the way and how you addressed them. Finish off by describing the outcomes of the project, both expected and unexpected, and how they reflected on your project management skills and your ability to deliver tangible results. This question is designed to assess your comprehensive project management skills and your capability to navigate through challenges to deliver successful outcomes.
80
参考回答
Workflow orchestration manages the execution of interdependent tasks in a pipeline, ensuring they run in the correct sequence and are monitored for failures. Example Use Case: Using Apache Airflow to orchestrate a pipeline that ingests raw data, transforms it, and loads it into a data warehouse. Steps to Design: Define Dependencies: - Identify task dependencies to ensure correct execution order. - Example: Ensure data extraction completes before transformation. Configure Schedules and Triggers: - Set up schedules (e.g., daily, hourly) or event-based triggers. - Example: Triggering a workflow when a file is uploaded to S3. Monitor Task Status: - Use monitoring tools to track task progress and retry failed tasks. - Example: Airflow UI displays task success, failures, and logs for debugging. Optimize for Scalability: - Distribute tasks across resources to handle high loads. - Example: Running tasks in parallel on a Kubernetes cluster.
81
参考回答
When asked about orchestration, begin by explaining that tools like Airflow give flexibility and open-source control, while managed services like Step Functions reduce operational overhead and integrate tightly with cloud ecosystems. You should highlight that you choose based on context: Airflow for complex DAGs and hybrid environments, Step Functions when reliability and scaling matter more than customization. This demonstrates that you weigh tradeoffs based on team resources and long-term maintenance.
82
参考回答
In this question, the interviewer will inquire about your capacity to handle unexpected problems along with the creativity you use while solving them. Ideally, candidates will come prepared with several experiences they can choose from to answer this question.
83
参考回答
Use managed identity to authenticate Azure Functions to Key Vault without storing secrets. Assign the function a system-assigned or user-assigned managed identity, grant it access to Key Vault secrets, and reference secrets via the Key Vault URL in the function configuration.
84
参考回答
Choose ETL when source data is complex or needs heavy transformation before loading, or when the target warehouse has limited processing power. Choose ELT when the warehouse is powerful (like Snowflake or BigQuery), when raw data needs to be preserved, or when transformation logic changes frequently.
85
参考回答
- Static Partitioning: Partitions are manually specified before writing. - Dynamic Partitioning: Spark auto-creates folders for each unique value of the partition column(s). hive.exec.dynamic.partition.mode should be set to "nonstrict" to enable dynamic partitioning. data = [("Alice", "2023-01", 85), ("Bob", "2023-02", 90), ("Alice", "2023-01", 95)] df = spark.createDataFrame(data, ["name", "date", "score"]) # Static partitioning df.write.mode("overwrite").partitionBy("date").parquet("/tmp/static_partitioned_table") # Dynamic partitioning spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict") df.write.mode("overwrite").partitionBy("name", "date").parquet("/tmp/dynamic_partitioned_table")
86
参考回答
A strong answer covers the full lifecycle: understanding requirements, designing the solution, building and testing, deploying to production, monitoring, and supporting downstream users. It shows end-to-end accountability.
87
参考回答
Here's a basic script using Pandas to clean missing values: import pandas as pd # Load dataset df = pd.read_csv("data.csv") # Drop rows with any missing values df_cleaned = df.dropna() # Or fill missing values with default # df_cleaned = df.fillna({'age': 0, 'income': df['income'].mean()}) print(df_cleaned.head()) This script loads the data, drops rows with nulls, or optionally fills them with defaults like zero or column means.
88
参考回答
No, star schemas are intentionally denormalized for analytical performance. Dimension tables are flattened to reduce joins, and fact tables contain measures and foreign keys. Over-normalizing a star schema would defeat its purpose and degrade query performance.
89
参考回答
Partitioning divides a large database table into smaller, more manageable parts based on a column (e.g., date). This improves query performance by allowing the database to scan only relevant partitions.
90
参考回答
ETL ? ? ⬇️ - ? Extraction of data from source systems, doing some ? Transformations (cleaning) and finally ⬇️ Loading the data into a data warehouse. ELT ? ⬇️ ? - With allowance of separation of storage and execution, it has become economical to store data and then transform them as required. All data is immediately Loaded into the target system (either a data warehouse, data mart or data lake). This can include raw, unstructured, semi-structured and structured data types. Only then data is transformed in the target system to be analyzed by BI tools or data analytics tools
91
参考回答
Provide technical specifics: 'Instead of traditional batch ETL, I built a streaming pipeline using Kafka and Flink that processed data in near real-time, reducing latency from 1 hour to 30 seconds. This required custom windowing logic and state management.'
92
参考回答
Data partitioning divides large datasets into smaller, manageable chunks (partitions) based on criteria like time, region, or ID. Example: Partitioning in Azure Data Lake Storage. A retail company storing sales data in Azure Data Lake Storage (ADLS) can organize it by year, month, and day instead of using one large file: /sales_data/year=2023/month=12/day=01/ /sales_data/year=2023/month=12/day=02/ /sales_data/year=2023/month=12/day=03/ This structure lets queries target only relevant partitions, greatly improving performance.
93
参考回答
Challenges include late data, low-latency requirements, duplicate events, and cost management. These are addressed with watermarking, idempotent processing, partition pruning, and active monitoring of lag. Effective solutions balance correctness, speed, and cost.
94
参考回答
A pivot table is a tool consisting of a table of grouped values where individual items of a larger, more extensive table aggregate within one or more discrete categories. It is useful for quick summarization of large unstructured data. It can automatically perform sort, total, count, or average of the data in the spreadsheet and display the results in another spreadsheet. Pivot tables save time and allow linking external data sources to Excel.
95
参考回答
Strong answers include using ROW_NUMBER() for deduplication, RANK() for ranking within partitions, LAG()/LEAD() for comparing sequential rows, and SUM() OVER() for running totals. Candidates should provide concrete examples like calculating moving averages or identifying top records per group.
96
参考回答
I've worked mainly on AWS and GCP. In AWS, I've used S3 for storage, Glue for ETL, Redshift for warehousing, and Lambda for serverless processing. On GCP, I've used BigQuery, Cloud Storage, and Dataflow for building batch and streaming pipelines. I choose platforms based on project needs, data volume, and integration requirements.
97
参考回答
The UNIQUE constraint is used for columns in SQL to ensure that all the values in a particular column are different. The UNIQUE constraint and the PRIMARY KEY both ensure that a column contains a value with unique values. However, there can be only one PRIMARY KEY per table, but you can specify the UNIQUE constraint for multiple columns. After creating the table, you can add or drop the UNIQUE constraints from columns.
98
参考回答
You can use the Advanced Criteria Filter to analyze a list or in cases where you need to test more than two conditions.
99
参考回答
A data engineer's primary concerns should be maintaining the accuracy of the data and preventing data loss. The purpose of this question is to help the hiring managers understand how you would validate data. You must be able to explain the suitable validation types in various instances. For instance, you might suggest that validation can be done through a basic comparison or after the complete data migration.
100
参考回答
ETL is a fundamental procedure in SQL. As such, every hiring manager will ask some questions about your knowledge of the ETL process. Your interviewers will be especially interested in your experience with different ETL tools. Therefore, candidates should reflect and think about the ETL tools they have worked with before. When you are asked for your favorite, be sure to answer in a way that also demonstrates your knowledge about the ETL process more generally.
101
参考回答
The PySpark code below processes a streaming DataFrame and handles the deduplication of records using watermarks: - The .withWatermark("event_timestamp", "10 minutes") sets a watermark on the event_timestamp column, allowing late data up to 10 minutes to be processed. After this window, older data is discarded. - The .dropDuplicates(["record_id"]) removes duplicate records based on the record_id field, ensuring only unique records are written to the output. - The .writeStream.format("parquet") writes the deduplicated stream in Parquet format to the specified output path (/path/to/output) as a continuous streaming job. # Assuming Kafka stream produces records with a unique UUID identifier deduplicated_stream = incoming_stream \ .withWatermark("event_timestamp", "10 minutes") \ .dropDuplicates(["record_id"]) deduplicated_stream.writeStream\ .format("parquet")\ .option("path", "/path/to/output")\ .start()
102
参考回答
Choose a star schema for analytics and reporting use cases where query simplicity and performance are critical. It reduces join complexity and is more intuitive for business users. Normalized structures are better for transactional systems or when storage efficiency and data integrity are the primary concern.
103
参考回答
Verify source data first. Check ETL transformations for errors. Implement validation checks and alerts to catch issues early.
104
参考回答
Data modeling is a technique that defines and analyzes the data requirements needed to support business processes. It involves creating a visual representation of an entire system of data or a part of it. The process of data modeling begins with stakeholders providing business requirements to the data engineering team.
105
参考回答
This question tests query readability and modularization skills. It specifically checks whether you can use CTEs to simplify subqueries and complex joins. Define a temporary result set with WITH , then reference it in the main query. Multiple CTEs can also be chained for layered logic. In real-world analytics, CTEs make ETL transformations and reporting queries more maintainable, especially when debugging multi-step calculations.
106
参考回答
Data partitioning means dividing a large dataset into smaller, manageable chunks based on keys like date, region, or ID. This improves performance by allowing queries to scan only the relevant partitions instead of the whole dataset. It also enables parallel processing, which speeds up ETL and analytics tasks. In distributed systems, partitioning helps balance load across nodes and reduces bottlenecks.
107
参考回答
CAP stands for Consistency, Availability, and Partition Tolerance. A distributed system can only guarantee two of these at any given time. For example, Cassandra sacrifices consistency to maximize availability and partition tolerance, while relational databases often prioritize consistency and availability.
108
参考回答
Here many options are available, but let's outline a couple of them: - Don't use SELECT * - Aggregate Data - When appropriate, use aggregates to pre-calculate results and reduce the amount of computation needed. - Filter by PARTITION column - Filter by CLUSTERED column - Use PREVIEW instead SELECT when you want to analyze table contents - Implement data retention policies to automatically archive or delete data that is no longer needed. - In some cases, denormalize tables to reduce the need for complex joins and improve query performance. - Use materialized views to store precomputed results and reduce the need for expensive computations during queries. - Select the appropriate instance types based on your workload requirements to avoid over-provisioning. - etc.
109
参考回答
- Metadata Management Tools: Hive Metastore and AWS Glue Catalog. Example: Hive Metastore manages metadata for tables in Hadoop clusters. - Data Lineage Tools: Apache Atlas or DataHub. Example: Apache Atlas tracks data flow in an ETL pipeline for auditing purposes.
110
参考回答
I use a framework like pytest. I create a small, "mock" dataset with known values, pass it through the transformation function, and assert that the output matches the expected "golden" result.
111
参考回答
When handling data schema evolution, I would adopt techniques like using Avro or Protobuf to define schema changes in a backward-compatible manner. This ensures that existing data pipelines can continue to process new data without any disruptions. Rigorous testing and versioning of data structures would be necessary to guarantee smooth transitions and prevent data inconsistency.
112
参考回答
Popular formats include: - CSV (simple but inefficient) - Parquet and ORC (columnar, analytics-optimized) - Avro (schema-based, streaming-friendly) Choosing the right format impacts performance and cost.
113
参考回答
Role-Based Access Control (RBAC) is an Azure security model that limits resource access based on user roles. Instead of full access, RBAC grants only the permissions needed for each role. RBAC assigns roles to users, groups, or apps at various scopes—like subscriptions, resource groups, or specific resources. Common roles include: - Owner - Contributor - Reader - Data reader/Writer Key benefits: - Prevents unauthorized data access - Minimizes risk by enforcing the least privilege - Enables auditing to track access and changes
114
参考回答
Mention frameworks like Eisenhower Matrix or Agile sprints. Explain how you balance high-priority business needs with technical debt and proactively flag risk if bandwidth becomes a blocker.
115
参考回答
- Flow Logs- Analyze your VPC flow logs in Amazon S3 or Amazon CloudWatch to obtain operational visibility into your network dependencies and traffic patterns, discover abnormalities, prevent data leakage, etc. - Network Access Analyzer- The Network Access Analyzer tool assists you in ensuring that your AWS network meets your network security and compliance standards. Network Access Analyzer allows you to establish your network security and compliance standards. - Traffic Mirroring- You can directly access the network packets running through your VPC via Traffic Mirroring. This functionality enables you to route network traffic from Amazon EC2 instances' elastic network interface to security and monitoring equipment for packet inspection.
116
参考回答
| NAS | DAS | | 109 to 1012 byte storage capacity | 109 byte storage capacity | | Moderate per GF cost of management | High per GF cost of management | | Data transmission uses Ethernet or TCP/IP. | Data transmission uses IDE/ SCSI |
117
参考回答
- By using broadcast(df_small), we force Spark to use a BroadcastHashJoin. - Disabling spark.sql.autoBroadcastJoinThreshold enforces a SortMergeJoin for larger tables. from pyspark.sql import SparkSession from pyspark.sql.functions import broadcast spark = SparkSession.builder.appName("JoinStrategies").getOrCreate() # Create sample DataFrames df_large = spark.range(1000000).withColumnRenamed("id", "key") df_small = spark.range(100).withColumnRenamed("id", "key") # Broadcast join (forces a broadcast join for the smaller DataFrame) df_broadcast_join = df_large.join(broadcast(df_small), on="key") print("Broadcast Join Plan:") df_broadcast_join.explain() # Look for 'BroadcastHashJoin' in the physical plan # Sort-Merge join (forces a sort-merge join by disabling broadcast threshold) spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1) df_sort_merge_join = df_large.join(df_small, on="key") print("Sort-Merge Join Plan:") df_sort_merge_join.explain() # Look for 'SortMergeJoin' in the physical plan
118
参考回答
While answering, ensure you have a firm understanding of data engineering, why it appeals to you, any background or previous experience that will help you excel in this field, and why you are the best person to implement data engineering for the organization. Read the job description and research the company to help you answer this question successfully.
119
参考回答
When discussing Hadoop, focus on its core features: fault tolerance ensures data is not lost, distributed processing allows handling large datasets across clusters, scalability enables growth with data volume, and reliability guarantees consistent performance. Use examples to illustrate each feature's impact on data projects.
120
参考回答
This is a common question at the intermediate level. An operational database uses Delete SQL statements, Insert and Update as its standard functionalities, focusing on efficiency and speed. Consequently, data analysis is slightly complex. Meanwhile, data warehouses focus primarily on select payments, aggregations and calculations, making them better suited for data analyses.
121
参考回答
To protect sensitive customer data, combine RBAC and Managed Identities: - RBAC for granular permissions: Assign least-privilege roles in Storage, Synapse, and Data Factory. - Managed identities for authentication: Avoid storing credentials; use Managed Identities for service access. - Row-Level Security (RLS): Apply RLS in Synapse or SQL Database to restrict access by user role. This approach ensures secure, role-based access across the pipeline.
122
参考回答
In a snowflake schema, dimension tables are normalized into multiple related tables. This reduces data redundancy but adds complexity to queries. It's typically used when storage efficiency or multi-level hierarchies are critical.
123
参考回答
The SUM function may be useful for finding the sum of columns in an Excel spreadsheet. =SUM(A5:F5) can be useful to find the sum of values in the columns A-F of the 5th row.
124
参考回答
Event Hubs is used for real-time data ingestion, such as telemetry, IoT events, or clickstream data, which can then be processed in Azure Stream Analytics or Databricks.
125
参考回答
| Data warehouse | Operational database | | Data warehouses generally support high-volume analytical data processing - OLAP. | Operational databases support high-volume transaction processing, typically - OLTP. | | You may add new data regularly, but once you add the data, it does not change very frequently. | Data is regularly updated. | | Data warehouses are optimized to handle complex queries, which can access multiple rows across many tables. | Operational databases are ideal for queries that return single rows at a time per table. | | There is a large amount of data involved. | The amount of data is usually less. | | A data warehouse is usually suitable for fast retrieval of data from relatively large volumes of data. | Operational databases are optimized to handle fast inserts and updates on a smaller scale of data. |
126
参考回答
FSCK (File System Check) is a command used in HDFS to check for inconsistencies in the file system. It helps administrators find and diagnose problems such as missing blocks, under-replicated blocks, and corrupted files. FSCK does not fix these issues but provides crucial information that can be used to take corrective actions, such as replicating missing blocks or recovering corrupted data. This tool is vital for maintaining the health and integrity of the data stored within HDFS, ensuring data reliability and system robustness.
127
参考回答
- DELETE statement is used to delete rows from a table. - TRUNCATE command is used to delete all the rows from the table and free the space containing the table. - DROP command is used to remove an object from the database. If you drop a table, all the rows in the table are deleted and the table structure is removed from the database.
128
参考回答
A data pipeline is an automated workflow that moves data from sources through transformations and loads it into a destination like a data warehouse. It ensures data is available where needed.
129
参考回答
Airflow is an open-source orchestration tool that defines workflows as Directed Acyclic Graphs (DAGs). It is popular because of its flexibility, strong community, and ability to schedule and monitor complex pipelines. Airflow also integrates easily with cloud services and data platforms.
130
参考回答
The CAP theorem states that a distributed data store can only provide two of three guarantees: Consistency, Availability, and Partition Tolerance. It is relevant to distributed systems because it forces tradeoffs when designing systems that span multiple nodes, particularly in handling network partitions.
131
参考回答
There are five distinct consistency models/levels in Azure Cosmos DB, starting from strongest to weakest- - Strong- It ensures linearizability, i.e., serving multiple requests simultaneously. The reads will always return the item's most recent committed version. Uncommitted or incomplete writes are never visible to the client, and users will always be able to read the most recent commit. - Bounded staleness- It guarantees the reads to follow the consistent prefix guarantee. Reads may lag writes by "K" versions (that is, "updates") of an item or "T" time interval, whichever comes first. - Session- It guarantees reads to honor the consistent prefix, monotonic reads and writes, read-your-writes, and write-follows-reads guarantees in a single client session. This implies that only one "writer" session or several authors share the same session token. - Consistent prefix- It returns updates with a consistent prefix throughout all updates and has no gaps. Reads will never detect out-of-order writes if the prefix consistency level is constant. - Eventual- There is no guarantee for ordering of reads in eventual consistency. The replicas gradually converge in the lack of further writes.
132
参考回答
Tables: Customer (customer_id, name, contact), Product (product_id, name, category, price), Order (order_id, customer_id, order_date, total_amount), Order_Item (order_item_id, order_id, product_id, quantity, price), Inventory (product_id, store_id, quantity), Store (store_id, location). Include primary/foreign keys and appropriate relationships.
133
参考回答
Type 1 overwrites the old value — fine for correcting typos or when history does not matter. Type 2 preserves history by inserting a new row with effective_from and effective_to timestamps and a current_flag — essential for things like sales territory changes where you need point-in-time reporting. Type 3 adds a previous_value column, which I rarely use because it only captures one level of history. In practice I default to Type 2 for anything business-critical and Type 1 for lookup attributes, using hashed surrogate keys to make joins stable.
134
参考回答
First, decide which project you'd want to talk about. If you have a real-world example in your field of expertise and an algorithm relevant to the company's work, utilize it to capture the hiring manager's attention. Maintain a list of all the models and analyses you deployed. Begin with simple models and avoid overcomplicating things. The hiring supervisors want you to describe the outcomes and their significance. There could be follow-up questions like: - Why did you choose this algorithm? - What is the scalability of your model? - If you were given more time, what could you improve?
135
参考回答
First, I'd clarify the requirements—are we serving millions of users with sub-100ms latency? For the architecture, I'd use a lambda pattern: Kafka for real-time event ingestion, Spark Streaming for real-time feature updates, and a batch layer using Spark for training recommendation models. For serving, I'd use Redis for fast lookups of precomputed recommendations and a feature store like Feast for real-time features. The key challenge is balancing model freshness with serving latency, so I'd implement a hybrid approach where popular items get real-time updates while long-tail items use batch-computed recommendations.
136
参考回答
A situation where the producer sends data faster than the consumer can process it. The system must have a way to signal the producer to slow down to prevent crashes.
137
参考回答
For datasets that exceed memory, use chunked processing with pandas (read_csv with chunksize ), leverage Dask or PySpark for distributed processing, or use databases to stream queries. Compression and optimized file formats like Parquet also reduce memory footprint. This ensures scalability for production-grade pipelines handling terabytes of data.
138
参考回答
Handling schema evolution involves: - Backward Compatibility: Ensuring new schema changes don't break existing queries. - Version Control: Managing different schema versions and tracking changes. - Migration Scripts: Using scripts to automate the process of updating schemas. - Data Governance: Establishing rules and procedures for managing schema changes.
139
参考回答
This PySpark code snippet establishes a streaming data pipeline that reads events from a Kafka topic and writes them to a Cassandra database: - A Spark session named KafkaToCassandra is created, which is essential for working with DataFrames and streaming data in Spark. - The readStream method is used to create a streaming DataFrame (kafkaStream) that reads data from the Kafka topic named events, connecting to a Kafka broker at localhost:9092. - The code uses the from_json function to parse the JSON data contained in the Kafka message values and creates a new column called event_data. The transformed DataFrame (transformed_df) is then constructed by selecting relevant fields from the parsed JSON, specifically user_id, event_timestamp, and event_type. - The transformed DataFrame is written to a Cassandra database. The writeStream method specifies that the output format is Cassandra, targeting the user_ks keyspace and the user_events table. The stream starts with the start() method which initiates the continuous data ingestion process. from pyspark.sql import SparkSession from pyspark.sql.functions import from_json, col # Create Spark session spark = SparkSession.builder \ .appName("KafkaToCassandra") \ .getOrCreate() # Reading the data stream via Kafka kafkaStream = spark.readStream \ .format("kafka") \ .option("kafka.bootstrap.servers", "localhost:9092") \ .option("subscribe", "events") \ .load() # Transformation logic kafkaStream = kafkaStream.withColumn( "event_data", from_json(kafkaStream.value.cast("string")) ) transformed_df = kafkaStream.select( col("event_data.user_id"), col("event_data.event_timestamp"), col("event_data.event_type") ) # Write to Cassandra using Spark-Cassandra Connector transformed_df.writeStream \ .format("org.apache.spark.sql.cassandra") \ .option("keyspace", "user_ks") \ .option("table", "user_events") \ .start()
140
参考回答
A chasm trap occurs in data modeling when two fact tables share a common dimension but are not directly related, leading to double-counting or incorrect results when queries join across both facts. It is resolved by ensuring proper grain alignment or using bridge tables.
141
参考回答
- Batch Processing: Processing data in chunks at set intervals (e.g., daily). Use this for complex reporting where data freshness (latency) is not critical, but accuracy and completeness are. - Streaming Processing: Processing data item-by-item as it arrives. Use this for fraud detection or real-time monitoring where low latency is critical.
142
参考回答
Common tools include workflow orchestrators like Airflow, processing frameworks like Spark for big data, databases (PostgreSQL, Cassandra), cloud storage (S3), and cloud data warehouses (Redshift, BigQuery).
143
参考回答
select min(sales) from (select distinct sales from Apparels by sales desc) where rownum < 3;
144
参考回答
Strategies for managing technical debt include: - Regular code reviews and refactoring sessions - Implementing CI/CD practices for consistent deployments - Maintaining comprehensive documentation - Prioritizing critical updates and migrations - Allocating time for system improvements in project planning - Conducting periodic architecture reviews - Implementing automated testing to catch regressions
145
参考回答
The CAP theorem states a distributed system can only guarantee two of three properties: Consistency (all nodes see same data), Availability (system always responds), Partition Tolerance (system works despite network partitions).
146
参考回答
Data validation ensures that data entering the pipeline meets predefined quality standards, preventing errors or inconsistencies downstream. Example Use Case: A Python script validates incoming datasets for a data warehouse. It checks for: - Missing values in critical columns. - Mismatched data types (e.g., numeric data in a text field). - Outliers in numerical columns using statistical thresholds. Key Validation Steps: Schema Validation: - Ensure data conforms to the expected schema (e.g., field names, data types). - Example: Using Apache Avro to enforce schema consistency. Range and Boundary Checks: - Validate numerical fields fall within acceptable ranges. - Example: Ensuring transaction amounts are greater than zero. Completeness Checks: - Verify no critical fields are missing. - Example: Checking that every sales record has a non-null order ID. Business Rule Validation: - Ensure data aligns with domain-specific rules. - Example: Checking that dates are not in the future for historical sales data.
147
参考回答
Here's a basic script using Pandas to clean missing values: import pandas as pd # Load dataset df = pd.read_csv("data.csv") # Drop rows with any missing values df_cleaned = df.dropna() # Or fill missing values with default # df_cleaned = df.fillna({'age': 0, 'income': df['income'].mean()}) print(df_cleaned.head()) This script loads the data, drops rows with nulls, or optionally fills them with defaults like zero or column means.
148
参考回答
def coin_combinations(coins, target): dp = [0] * (target + 1) dp[0] = 1 for coin in coins: for amount in range(coin, target + 1): dp[amount] += dp[amount - coin] return dp[target] # coin_combinations([1, 2, 5], 20) returns the number of ways.
149
参考回答
First, examine the query execution plan. I look for missing indexes, inefficient joins, or full table scans. I'd consider adding indexes, rewriting the query, or optimizing table structure.
150
参考回答
The candidate should suggest using the SQL keywords UNIQUE and DISTINCT for reducing duplicate data points. After that, they should also suggest other ways to deal with duplicate data points, such as grouping the data using GROUP BY and filtering it further. They should also ask clarifying questions about what kind of data you are working with and what columns or values would likely be duplicated.
151
参考回答
It is usually stored, and schema consistency is enforced by it. This is not the case; it can hold all types of data, including structured and unstructured. Also, it works well for exploration and massive data because it doesn't have a predetermined format for storing data. Data analytics are a good fit for data warehouses. However, data lakes are great places to store and investigate information.
152
参考回答
A database consistency model specifies how and when a successful write or change reflects in a future read of the same data. - The eventual consistency model is ideal for systems where data update doesn't occur in real-time. It's Amazon DynamoDB's default consistency model, boosting read throughput. However, the outcomes of a recently completed write may not necessarily reflect in an eventually consistent read. - In Amazon DynamoDB, a strongly consistent read yields a result that includes all writes that have a successful response before the read. You can provide additional variables in a request to get a strongly consistent read result. Processing a highly consistent read takes more resources than an eventually consistent read.
153
参考回答
NameNode is what the HDFS system is built on. It helps in tracking where data files are kept by storing files' directory trees in a single filing system.
154
参考回答
Batch pipelines process data at fixed intervals (e.g., daily reports), while streaming pipelines ingest and process data continuously (e.g., fraud detection). Streaming is typically built using Kafka, Spark Streaming, or Flink, whereas batch may use Airflow, dbt, or Glue.
155
参考回答
The primary difference lies in how data is processed and utilized. Batch processing involves collecting data over a period, then processing it all at once at a later time. This method of data preparation is often suitable for scenarios where time-sensitivity is not crucial, such as daily sales reports or monthly inventory checks. On the other hand, real-time streaming processes data instantly as it comes in, making it invaluable for scenarios that require immediate analysis and action. This is particularly important in applications such as fraud detection in financial transactions, live traffic monitoring, and data validation for dynamic pricing models. In these cases, the ability to process and act on data in real time can significantly enhance decision-making processes and operational efficiency.
156
参考回答
Apache Kafka is a distributed streaming platform designed for high-throughput, low-latency data streaming. It is commonly used for building real-time data pipelines that can handle large volumes of data across distributed systems. Kafka operates on the concept of a distributed commit log, where data is stored as records (messages) in topics, and producers can publish messages while consumers subscribe to and process them. In a data engineering ecosystem, Kafka plays several key roles: - Data Ingestion: Kafka is often used to ingest large volumes of data from various sources, such as logs, sensors, or transactional databases. It can handle data streams in real-time, ensuring that data is reliably captured and made available for downstream processing. - Data Streaming: Kafka supports real-time data streaming by allowing consumers to process data as it arrives. This makes it ideal for scenarios where immediate data processing is required, such as real-time analytics, monitoring systems, or alerting mechanisms. - Decoupling Systems: Kafka decouples data producers from consumers, allowing different parts of a data pipeline to operate independently. This reduces dependencies between systems and improves scalability and fault tolerance. For example, a Kafka topic can be used to buffer data, ensuring that even if the downstream system is temporarily unavailable, the data is not lost. - Event Sourcing and Stream Processing: Kafka is often used in event-driven architectures, where events are captured and processed in real-time. It integrates well with stream processing frameworks like Apache Flink or Apache Spark Streaming, enabling complex event processing, transformations, and aggregations.
157
参考回答
In Azure SQL DB, there are several data security options: - Azure SQL Firewall Rules: There are two levels of security available in Azure. - The first are server-level firewall rules, which are present in the SQL Master database and specify which Azure database servers are accessible. - The second type of firewall rule is database-level firewall rules, which monitor database access. - Azure SQL Database Auditing: The SQL Database service in Azure offers auditing features. It allows you to define the audit policy at the database server or database level. - Azure SQL Transparent Data Encryption: TDE encrypts and decrypts databases and performs backups and transactions on log files in real-time. - Azure SQL Always Encrypted: This feature safeguards sensitive data in the Azure SQL database, such as credit card details.
158
参考回答
When asked this, explain that you use partitioning, clustering, indexing, and materialized views. You should highlight file format choices (Parquet/ORC), compression, and pruning as cost-saving strategies. Emphasize that query optimization directly reduces both compute costs and end-user latency.
159
参考回答
The Hadoop job scheduler allocates resources to various tasks and manages their execution within the cluster. The default algorithm the Hadoop job scheduler uses is the FIFO (First In, First Out) scheduler, which processes jobs in the order they are submitted. While simple, the FIFO scheduler can lead to inefficient resource utilization if the first jobs in the queue do not use all the resources effectively. For more complex scheduling and better resource utilization, Hadoop administrators often switch to more sophisticated schedulers like the Capacity Scheduler or the Fair Scheduler, which allocate resources based on specific policies or priorities to maximize throughput and minimize job waiting time.
160
参考回答
Any data engineer worth their salt will need to know when to use one type of database over another. There may have been times where you needed to build a NoSQL database rather than a relational database, and your interviewer may be interested in learning why. These questions are investigating your knowledge of databases in general. As such, be sure to demonstrate this knowledge with concrete examples.
161
参考回答
The core responsibilities of a data engineer encompass a variety of critical tasks essential for the management and analysis of data. This includes developing, constructing, testing, and maintaining a data architecture like large-scale data processing systems. In data engineering context, they are responsible for ensuring the integrity and accessibility of data, optimizing data flow within organizations, and implementing complex algorithms that allow for efficient data storage and retrieval. Data engineers are responsible for converting raw data into usable information, which ultimately supports decision-making processes across the organization.
162
参考回答
Idempotency refers to the property of an operation that allows it to be applied multiple times without changing the result beyond the initial application. In data engineering, this concept is crucial when designing data pipelines, APIs, or any other system that may need to handle retries, failures, or duplicate requests. Importance of Idempotency: - Handling Retries: In distributed systems, network failures, timeouts, or other issues can cause operations to be retried automatically. If an operation is not idempotent, these retries could lead to unintended side effects, such as duplicate entries in a database or incorrect data aggregation. By designing operations to be idempotent, the system ensures that repeated execution of the same operation produces the same result, preventing data corruption. - Data Integrity: Idempotency is crucial for maintaining data integrity in systems that process large volumes of data or involve complex data transformations. For example, in an ETL pipeline, if a data transformation step is idempotent, running it multiple times on the same input data will yield the same output, ensuring consistent results.
163
参考回答
Check logs for error patterns. Assess business impact. Apply a temporary fix if needed. Investigate root causes like data changes, resource limits, or concurrency issues. Implement retry logic, monitoring, and permanent resolution.
164
参考回答
This is the fundamental distinction in database architecture. - OLTP (Online Transaction Processing): These systems are designed for transactional speed and data integrity. They handle a high volume of small, fast transactions (inserts, updates, deletes). - Example: A bank ATM system or an e-commerce checkout. - Structure: Highly normalized (3NF) to avoid redundancy. - OLAP (Online Analytical Processing): These systems are designed for complex queries and data analysis. They read historical data to find trends. - Example: A business intelligence dashboard or a Data Warehouse. - Structure: Denormalized (Star or Snowflake Schema) to optimize read speeds.
165
参考回答
ROW_NUMBER() assigns a unique sequential number to every row. RANK() assigns the same rank to ties but skips the next number (1, 2, 2, 4). DENSE_RANK() assigns the same rank to ties but does not skip any numbers (1, 2, 2, 3).
166
参考回答
Use a hybrid pipeline combining real-time and batch processing: - Real-time (fraud detection): Ingest order events via Azure Event Hubs. - Batch (financial reporting): Store raw transactions in Azure Data Lake Storage (ADLS). - Orchestration: Use Azure Logic Apps to trigger real-time alerts and integrate with fraud detection services.
167
参考回答
Snowflake schema is a variation of the star schema where dimension tables are normalized into multiple related tables. This creates a structure that looks like a snowflake, with the fact table at the center and increasingly granular dimension tables branching out.
168
参考回答
The snowflake schema is an extension of the star schema with more dimensions. The shape suggests its name. Following normalization, the data is structured and split into more tables.
169
参考回答
This question isn't just about theory — it tests your ability to balance performance and data integrity. Normalization reduces redundancy, but denormalization helps with speed. If you explain both sides and mention when you'd make the tradeoff, it shows you're practical and project-focused, not just academic.
170
参考回答
I would assess the impact and pattern of missing values. Strategies include dropping rows or columns with excessive missing data, imputing with mean, median, or mode for numerical data, using forward/backward fill for time-series data, or applying predictive models to estimate missing values. The choice depends on the dataset and use case.
171
参考回答
The strong answer involves taking the stakeholder seriously, investigating their numbers, and either explaining why your data is right or admitting where it was wrong, without making them feel like they wasted your time. The good outcome is alignment, not victory.
172
参考回答
Skewed tables are a type of table in which some values in a column appear more frequently than others. The distribution is skewed as a result of this. When a table is created in Hive with the SKEWED option, the skewed values are written to separate files, while the remaining data are written to another file.
173
参考回答
- Skewness is detected by grouping and counting occurrences of each key. - Using repartitionByRange helps balance partitions and reduce skewness. from pyspark.sql.functions import col # Sample DataFrame with skewed data data = [(1, "A"), (1, "B"), (1, "C"), (2, "D"), (3, "E")] df_skewed = spark.createDataFrame(data, ["key", "value"]) # Calculate distribution to detect skewness df_skewed.groupBy("key").count().orderBy(col("count").desc()).show() # Repartition by range to manage skewness df_balanced = df_skewed.repartitionByRange(3, "key") print(f"Partitioning after repartitionByRange: {df_balanced.rdd.glom().map(len).collect()}")
174
参考回答
Metastore is a place for storing the schema and Hive tables. We store data such as definitions, mappings, and metadata in the Metastore. Later, it is stored in an RDMS when required.
175
参考回答
Yes, creating more than one table for a data file is possible. In Hive, we store schemas in the MetaStore. Therefore, obtaining the result for the corresponding data is very easy.
176
参考回答
IT departments must maintain many servers and apps, but doing it manually isn't scalable. The more complicated an IT system is, the more difficult it is to keep track of all the moving elements. As the requirement to combine numerous automated jobs and their configurations across groups of systems or machines grows, so does the demand to combine multiple automated tasks and their configurations across groups of systems or machines. This is where orchestration comes in handy. The automated configuration, management, and coordination of computer systems, applications, and services are known as orchestration. IT can manage complicated processes and workflows more easily with orchestration. There are many container orchestration platforms available such as Kubernetes and OpenShift.
177
参考回答
Practice explaining something complex (e.g., late-arriving data, star schema, cost spike) without using words like 'schema,' 'join,' or 'denormalize.' The executive won't know those terms and will tune out.
178
参考回答
Stored procedures are used in SQL to run a particular task several times. You can save or reuse stored procedures when required. The syntax for creating a stored procedure: | CREATE PROCEDURE procedure_name *params* AS sql_statement GO; | Syntax for executing a stored procedure | EXEC procedure_name *params*; | A stored procedure can take parameters at the time of execution so that the stored procedure can execute based on the values passed as parameters.
179
参考回答
Interviewers want to know if you can compare big data tools and pick the right one for the job. Hadoop processes data in batches and writes everything to disk, which is slower but great for long-running jobs. Spark handles both batch and streaming and keeps data in memory, making it much faster. The best answers also note that Spark can run on top of Hadoop's storage layer (HDFS), so they often work together.
180
参考回答
To remove duplicates, you can use the ROW_NUMBER() function to assign a unique number to each row within a group and keep only the first occurrence. WITH ranked_data AS ( SELECT *, ROW_NUMBER() OVER ( PARTITION BY column1, column2 -- Specify columns that define uniqueness ORDER BY id -- Optional: Use a unique identifier for ordering ) AS row_num FROM table_name ) SELECT * FROM ranked_data WHERE row_num = 1; PARTITION BY column1, column2 : Groups rows by the columns that define uniqueness.ORDER BY id : Orders rows within each group (optional but useful for consistency).ROW_NUMBER() : Assigns a unique number to each row in the group.WHERE row_num = 1 : Keeps only the first occurrence of each unique combination.
181
参考回答
The Interviewer's Goal: To see how you handle upstream changes breaking your code. The Answer: When a source system adds, removes, or changes a column, it is the #1 cause of pipeline failure. I handle this in three ways: - The Technical Fix (Schema Registry): For streaming (Kafka), I use a Schema Registry (like Confluent) which rejects incompatible messages that don't match the agreed-upon format (Protobuf/Avro). - The Design Fix (Forward Compatibility): I build consumers that are resilient. They explicitly select columns they need (SELECT id, name) rather than using SELECT *, so new columns don't break the code. - The Organizational Fix: This is the most effective. I implement a 'Data Contract' where the software engineering team cannot change the database schema without alerting the data team first.
182
参考回答
Use schema registries (e.g., Confluent Schema Registry) for version control and compatibility checks in streaming. In batch systems, validate schemas at ingestion and use tools like dbt for versioned model management. Avoid SELECT * queries to prevent breakage due to added columns.
183
参考回答
Prepare to answer a variety of questions on SQL queries, including how to write efficient queries, the different types of joins and when to use them, subqueries and their use cases, as well as database optimization techniques. Demonstrating your proficiency in SQL, through explaining your thought process in selecting specific queries or optimizations, is often crucial for showcasing your skills and understanding of database management and manipulation.
184
参考回答
Data vault splits entities into hubs (business keys), links (relationships), and satellites (descriptive attributes with full history). It is append-only, highly auditable, and great when you are integrating many source systems with changing schemas. Dimensional modelling — facts and conformed dimensions — is better for consumption because it is intuitive for analysts. In practice I often use vault as the raw integration layer and build dimensional marts on top, though for smaller orgs I skip vault entirely and go straight to Kimball-style dimensional.
185
参考回答
- ETL (Extract, Transform, Load): Data is transformed before being loaded into the target system. - ELT (Extract, Load, Transform): Data is loaded into the target system in its raw form and transformed after loading, often used in big data environments.
186
参考回答
Describe a decision with significant impact. Explain the context, the options considered, your rationale, and the outcome. Show responsibility and foresight.
187
参考回答
ETL (Extract, Transform, Load) is a core data engineering process: - Extract data from multiple sources - Transform it into a usable format - Load it into a warehouse or analytics system ETL ensures data consistency, cleanliness, and reliability—critical for reporting, analytics, and downstream machine learning use cases.
188
参考回答
The Interviewer's Goal: To test your understanding of the ELT (Extract, Load, Transform) pattern. The Answer: I design pipelines with 'Replayability' in mind. Here is the architecture: - Orchestration (Airflow/Prefect): This triggers the pipeline on a schedule and manages dependencies. - The 'Raw Landing' (S3/GCS): This is crucial. I extract the JSON from the API and dump it untouched into a Data Lake (S3). - Why? If my transformation logic has a bug, I can fix the code and re-process the raw files without calling the slow API again. - Loading (Snowflake/BigQuery): I load the raw JSON into a variant/struct column in the warehouse. - Transformation (dbt): I use SQL to parse the JSON, clean the data, and model it into Fact and Dimension tables for the end users.
189
参考回答
The ORDER BY clause is useful for sorting the query result in ascending or descending order. By default, the query sorts in ascending order. The following statement can change the order: SELECT expressions FROM table_name WHERE conditions ORDER BY expression DESC;
190
参考回答
Apache Hadoop is an open-source framework for distributed storage and batch processing of large datasets. It allows systems to scale horizontally across commodity hardware. Although newer tools exist, Hadoop concepts still form the foundation of big data engineering.
191
参考回答
Your interviewer will be most interested in the improvements you can bring to the table as a data engineering candidate. They may ask some variation of this question to see how you take the initiative in improving things in your role. If you are asked this question, be sure to point out how your previous experience demonstrates that you are a self-starter. However, if you do not yet have this experience, be sure to prepare some remarks on the improvements you would and could be making if offered the job. Ultimately, be sure to keep your answer focused on the actual methods you employ as a data engineer to improve the quality of data for your organization.
192
参考回答
This is a scenario-based question. Walk through your thinking: - Identify the business process: Rides connecting riders with drivers - Identify the grain: One row per ride - Identify dimensions: rider, driver, vehicle, pickup_location, dropoff_location, date/time - Identify facts/measures: fare, distance, duration, tip, surge_multiplier -- Fact table CREATE TABLE fact_rides ( ride_id BIGINT PRIMARY KEY, rider_id INT, driver_id INT, vehicle_id INT, pickup_location_id INT, dropoff_location_id INT, ride_start_datetime TIMESTAMP, ride_end_datetime TIMESTAMP, distance_miles DECIMAL(10,2), duration_minutes INT, base_fare DECIMAL(10,2), surge_multiplier DECIMAL(3,2), tip_amount DECIMAL(10,2), total_fare DECIMAL(10,2) ); -- Dimension tables CREATE TABLE dim_rider (...); CREATE TABLE dim_driver (...); CREATE TABLE dim_vehicle (...); CREATE TABLE dim_location (...); Why interviewers ask this: This tests whether you can apply theoretical knowledge to real scenarios. They want to see your thought process, not just the final answer.
193
参考回答
The SUBSTITUTE function in Excel is useful to find a match for a particular text and replace it. The REPLACE function replaces the text, which you can identify using its position. SUBSTITUTE syntax =SUBSTITUTE (text, text_to_be_replaced, text_to_replace_old_text_with, [instance_number]) Where text refers to the text in which you can perform the replacements instance_number refers to the number of times you need to replace a match. E.g. consider a cell A5 which contains "Bond007" =SUBSTITUTE(A5, "0", "1", 1) gives the result "Bond107" =SUBSTITUTE(A5, "0", "1", 2) gives the result "Bond117" =SUBSTITUTE(A5, "0", "1") gives the result "Bond117" REPLACE syntax =REPLACE (old_text, start_num, num_chars, text_to_be_replaced) Where start_num - starting position of old_text to be replaced num_chars - number of characters to be replaced E.g. consider a cell A5 which contains "Bond007" =REPLACE(A5, 5, 1, "99") gives the result "Bond9907"
194
参考回答
This is a data design pattern used to organize data within a lakehouse. It consists of three layers: Bronze (raw ingestion), Silver (filtered, cleaned, and joined data), and Gold (business-level aggregates and specialized tables for reporting).
195
参考回答
A data mart is a subset of a data warehouse, focused on a specific business line or department. While a data warehouse is a centralized repository for the entire organization's data, a data mart serves the needs of a particular group, providing quicker access to relevant data.
196
参考回答
Star Join Schema or Star Schema is the most simple data warehousing schema type. It got its name from its basic structure that resembles a star. In this structure, the centre might contain one fact table and several dimension tables associated with it. This schema helps data engineers query large volumes of data and datasets.
197
参考回答
Approaching data security in data engineering projects involves implementing a combination of best practices, tools, and policies to protect data at all stages of its lifecycle—during collection, storage, processing, and transmission. Key Strategies: - Data Encryption: - At Rest: Ensure that all sensitive data is encrypted at rest using strong encryption algorithms like AES-256. This applies to databases, data lakes, and any storage services used in the project. - In Transit: Data should also be encrypted in transit using protocols like TLS (Transport Layer Security) to protect it from interception during transmission between systems. - Access Control: - Implement strict access control mechanisms to ensure that only authorized users and systems can access the data. This involves using role-based access control (RBAC) and enforcing the principle of least privilege, where users are given the minimum access necessary to perform their tasks. - Use IAM (Identity and Access Management) tools provided by cloud platforms (e.g., AWS IAM, Google Cloud IAM) to manage and audit access permissions. - Data Masking and Anonymization: - For sensitive data, implement data masking or anonymization techniques to protect personally identifiable information (PII) while still allowing the data to be used for analysis. Techniques like tokenization or pseudonymization can be used to obscure sensitive details. - Audit Logging: - Maintain detailed audit logs of all data access and processing activities. These logs should capture who accessed the data, what actions were taken, and when they occurred. Audit logs are essential for detecting unauthorized access and for compliance with regulations like GDPR or HIPAA. - Regular Security Audits and Penetration Testing: - Conduct regular security audits and penetration testing to identify and address vulnerabilities in the data infrastructure. This includes reviewing configurations, patching software, and ensuring compliance with security policies. - Data Governance and Compliance: - Implement data governance policies to ensure that data is managed and protected according to legal and regulatory requirements. This includes defining data ownership, handling data classification, and ensuring compliance with data protection laws like GDPR, CCPA, or HIPAA.
198
参考回答
Batch processing handles data in large blocks at scheduled intervals, making it cost-effective for massive volumes. Stream processing (real-time) handles data as it arrives, offering sub-second latency for use cases like fraud detection or live dashboards.
199
参考回答
In Excel, you can protect a worksheet, meaning that you can paste no copied data from the cells in the protected worksheet. To be able to copy and paste data from a protected worksheet, you must remove the sheet protection and unlock all cells, and once more lock only those cells that are not to be changed or removed. To protect a worksheet, go to Menu -> Review -> Protect Sheet -> Password. Using a unique password, you can protect the sheet from getting copied by others.
200
参考回答
The replication factor is the number of times the Hadoop framework replicates each Data Block. Fault tolerance is provided by replicating the block. The replication factor is set to 3 by default, however, it can be modified to 2 (less than 3) or raised to meet your needs (more than 3.)