参考回答
The various types of nulls in Spark are:
- Filtering null values
- Replacing null values
- Dropping rows with null values
- Coalesce
- To filter rows based on null values in a specific column (or columns), use the .filter() or .where() methods.
- For example, the code below filters out rows with nulls in the name column, showing only rows where name is not null.
# Create a sample DataFrame with null values
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark = SparkSession.builder.appName("NullHandling").getOrCreate()
data = [(1, "Alice"), (2, None), (3, "Bob"), (None, "Eve")]
df = spark.createDataFrame(data, ["id", "name"])
# Filter rows where the 'name' column is NOT null
df_filtered = df.filter(col("name").isNotNull())
df_filtered.show()
- To replace null values, use the .fillna() method or .na.fill() with either a dictionary for specific columns or a scalar value for all columns.
- In the example below, null values in name are replaced with "Unknown," and nulls in id are replaced with -1. You can replace nulls in all columns with a single value if desired.
# Replace null values in 'name' column with "Unknown"
df_replaced = df.fillna({"name": "Unknown", "id": -1})
df_replaced.show()
- To drop rows containing null values, use the .dropna() method. You can control the behavior using parameters such as how and thresh.
In the example below:
how="any" removes rows with any null values.
how="all" removes rows only if all columns have null values.
thresh specifies a minimum number of non-null values required to keep a row.
# Drop rows with any null values
df_dropped_any = df.dropna()
df_dropped_any.show()
# Drop rows if all values in the row are null
df_dropped_all = df.dropna(how="all")
df_dropped_all.show()
# Drop rows with less than 1 non-null value (thresh=1 means at least 1 non-null value must be present)
df_dropped_thresh = df.dropna(thresh=1)
df_dropped_thresh.show()
- The .coalesce() function in Spark is used to return the first non-null value among columns, which is useful for substituting alternative values when encountering nulls. coalesce returns the first non-null value among name, gender, and id for each row. If name is null, it will take the value from gender or id, in that order. This is particularly useful when multiple columns have potential nulls, and a default fallback is needed.
from pyspark.sql.functions import coalesce
# Create a sample DataFrame with multiple columns, some containing nulls
data = [(1, None, "Alice"), (2, "M", None), (3, None, "Bob")]
df_multi = spark.createDataFrame(data, ["id", "gender", "name"])
# Use coalesce to select the first non-null value in the specified columns
df_coalesced = df_multi.withColumn("final_name", coalesce("name", "gender", "id"))
df_coalesced.show()