參考答案
This question might be a trap if you had previous questions about data transformation with Python. If you like Python then you are probably a big fan of the Pandas library and you probably already mentioned this during the interview. Well, this is the kind of question where you wouldn't want to use Pandas. The thing is that Pandas doesn't work with big datasets very well, especially with data transformation. You will always be limited to your machine's memory while running data transformations in the Pandas data frame.
The right answer would be to mention that if memory is limited then you would find a scalable solution for this task. This can be a simple Python generator and, yes, it can take a lot of time but at least it won't fail.
# Create a file first: ./very_big_file.csv as:
# transaction_id,user_id,total_cost,dt
# 1,John,10.99,2023-04-15
# 2,Mary, 4.99,2023-04-12
# Example.py
def etl(item):
# Do some etl here
return item.replace("John", '****')
# Create a generator
def batch_read_file(file_object, batch_size=19):
"""Lazy function (generator) can read a file in chunks.
Default chunk: 1024 bytes."""
while True:
data = file_object.read(batch_size)
if not data:
break
yield data
# and read in chunks
with open('very_big_file.csv') as f:
for batch in batch_read_file(f):
print(etl(batch))
# In command line run
# Python example.py
The optimal answer should include transforming the data using distributed computing and ideally some tool that is fast for this purpose and scales well. Spark or HIVE-based tools might be a good choice.