Handling NaN in Python and CSV Data: Cleaning, Transforming, and Displaying

a Number). Handling NaN involves cleaning and transforming data. Also, raw data might be inconsistent, incomplete, or structured in ways that hinder proper processing. This article covers cleaning NaN values, transforming data for calculations, and formatting challenges when displaying CSV data in tables or charts.

Cleaning NaN Values in CSV Data

Cleaning data means making it usable. Raw data from sources like sensors, databases, or user inputs frequently contain missing entries. Python’s pandas library provides tools for dealing with this problem.

Common approaches to cleaning NaN values:

Fill Missing Values: Replace NaN with a default value.

df.fillna(0, inplace=True)  # Replace NaN with 0
df.fillna("Unknown", inplace=True)  # Replace NaN with a placeholder string
  • This approach prevents gaps in data displays.

Forward or Backward Filling: Fill gaps using neighboring values.

df.fillna(method="ffill", inplace=True)  # Forward fill using previous row
df.fillna(method="bfill", inplace=True)  # Backward fill using next row
  • Time-series data benefits from this technique.

Drop Rows or Columns with NaN: If too many values are missing, removal is an option.

df.dropna(inplace=True)  # Drop all rows with NaN
df.dropna(axis=1, inplace=True)  # Drop columns with NaN
  • Use this when missing data is excessive.

Interpolate Missing Values: Fill gaps with estimated values.

df.interpolate(inplace=True)

Displaying CSV Data in a Table with NaN Handling

When presenting raw CSV data in a table, NaN values can create gaps, making the table look unstructured. The approach to handling them depends on the context.

Basic display with NaN replacements:

import pandas as pd

df = pd.read_csv("data.csv")
df.fillna("N/A", inplace=True)  # Replace NaN for better readability

import ace_tools as tools
tools.display_dataframe_to_user("Cleaned CSV Data", df)

Replacing NaN with "N/A" or "Missing" makes tables more readable.

Tables that include numerical data should avoid empty cells. Users expect a logical flow when scanning through numbers.

Transforming Data: Math, Grouping, and Aggregation

When calculations are involved, NaN values can break operations. Aggregations like sum, mean, or median must handle missing data correctly.

Summing Columns with NaN

df["total"] = df["price"] * df["quantity"]
df["total"] = df["total"].fillna(0)  # Avoid NaN in calculated values

Without filling NaN, calculations can produce unwanted results.

Grouping and Aggregation

Grouping data requires handling NaN carefully. Consider sales data where missing prices could impact averages.

grouped_df = df.groupby("category").agg({"price": "mean", "quantity": "sum"})
grouped_df.fillna(0, inplace=True)  # Replace NaN after grouping

Grouping introduces gaps when values are missing in a category. Setting NaN to zero ensures no blank results.

Handling NaN in Charts

Charts need complete numerical data. Missing values create incorrect visualizations. The solution depends on the data type.

Replace NaN with Zeros: Ensures continuity in line or bar charts.

df.fillna(0, inplace=True)

Interpolate Missing Values: Smoothens trends.

df.interpolate(inplace=True)

Drop Rows with NaN: Use when missing values are minimal.

df.dropna(inplace=True)

Charts displaying categorical data may have missing labels. Avoid empty categories in bar charts by setting missing labels to "Unknown".

Formatting Issues in Tables and Charts

Column Data Type Mismatches

CSV files may store numbers as strings. This leads to sorting and calculation errors.

df["price"] = pd.to_numeric(df["price"], errors="coerce")  # Convert to number

Any invalid values convert to NaN, which should then be handled.

Incorrect Date Formats

Date columns might be stored as strings, preventing time-based calculations.

df["date"] = pd.to_datetime(df["date"], errors="coerce")

Converting to datetime ensures proper sorting and filtering.

Inconsistent Casing in Categories

Categorical data can be inconsistent.

df["category"] = df["category"].str.lower().str.strip()

Standardizing casing prevents duplicate-looking categories like “Electronics” and “electronics”.

Duplicate Entries

Duplicate rows skew calculations.

df.drop_duplicates(inplace=True)

Removing duplicates ensures accuracy.

Here’s a comprehensive list of pandas DataFrame (df) methods for handling NaN (missing values) and data transformation.

1. Handling NaN

(Missing Values)

-----------------------------------------------------------------------
| Method                        | Description                         |
-----------------------------------------------------------------------
| df.isna()                     | Returns True for NaN values.        |
| df.isnull()                   | Same as isna(), checks for NaN.     |
| df.notna()                    | Returns True for non-NaN values.    |
| df.notnull()                  | Same as notna(), checks for non-NaN.|
| df.dropna()                   | Removes rows with NaN values.       |
| df.dropna(axis=1)             | Removes columns with NaN values.    |
| df.dropna(subset=["column"])  | Drops rows where a specific column has NaN. |
| df.fillna(value)              | Replaces NaN with a specified value.|
| df.fillna(method="ffill")     | Forward fills missing values.       |
| df.fillna(method="bfill")     | Backward fills missing values.      |
| df.interpolate()              | Uses linear interpolation to fill NaN. |
-----------------------------------------------------------------------

2. Transforming Data

Changing Data Types

| Method                                         | Description                                          |
---------------------------------------------------------------------------------------------------------
| df.astype({"column": type})                    | Converts column data type.                           |
| pd.to_numeric(df["column"], errors="coerce")   | Converts column to number, replacing errors with NaN. |
| pd.to_datetime(df["column"])                   | Converts column to datetime format.                   |
---------------------------------------------------------------------------------------------------------

Handling Duplicates

----------------------------------------------------------------
| Method                    | Description                      |
----------------------------------------------------------------
| df.duplicated()           | Returns True for duplicate rows. |
| df.drop_duplicates()      | Removes duplicate rows.          |
----------------------------------------------------------------

Replacing Values

--------------------------------------------------------------------------------
| Method                                | Description                          |
--------------------------------------------------------------------------------
| df.replace(to_replace, value)         | Replaces values with another value.   |
| df["column"].replace({"old": "new"})  | Replaces specific values in a column. |
--------------------------------------------------------------------------------

String Transformation

----------------------------------------------------------------------------
| Method                                 | Description                     |
----------------------------------------------------------------------------
| df["column"].str.lower()               | Converts text to lowercase.     |
| df["column"].str.upper()               | Converts text to uppercase.     |
| df["column"].str.strip()               | Removes leading/trailing spaces. |
| df["column"].str.replace("old", "new") | Replaces substrings in text.    |
----------------------------------------------------------------------------

Grouping & Aggregation

-------------------------------------------------------------------------------------------------------
| Method                                                    | Description                             |
-------------------------------------------------------------------------------------------------------
| df.groupby("column").sum()                                | Groups by a column and sums values.     |
| df.groupby("column").mean()                               | Groups by a column and calculates mean. |
| df.groupby("column").agg({"col1": "sum", "col2": "mean"}) | Custom aggregation.                     |
-------------------------------------------------------------------------------------------------------

Sorting

---------------------------------------------------------------------------
| Method                                     | Description                |
---------------------------------------------------------------------------
| df.sort_values("column")                   | Sorts DataFrame by column. |
| df.sort_values("column", ascending=False)  | Sorts in descending order. |
---------------------------------------------------------------------------

Creating New Columns

----------------------------------------------------------------------------------------
| Method                                     | Description                             |
----------------------------------------------------------------------------------------
| df["new_col"] = df["col1"] * df["col2"]    | Creates a new column from calculations. |
| df.assign(new_col=df["col1"] + df["col2"]) | Adds a new column dynamically.          |
----------------------------------------------------------------------------------------

Thank you for reading this article. I hope you found it helpful and informative. If you have any questions, or if you would like to suggest new Python code examples or topics for future tutorials, please feel free to reach out. Your feedback and suggestions are always welcome!

Happy coding!
Py-Core.com Python Programming

You can also find this article at Medium.com

Leave a Reply