How to Detect and Remove Corrupted Image Files in an Image Dataset Using Python ?

3 min readJun 4, 2024

When working with large image datasets, corrupted images can cause significant issues in AI model training. Detecting and removing these corrupted files is essential to ensure smooth training processes. In this tutorial, you’ll learn how to scan directories, identify corrupted image files, and remove them using Python and the Pillow library.

Step 1: Setting Up Your Environment

Ensure you have Python and Pillow installed. If not, you can install Pillow with the following command:

pip install pillow

Step 2: Importing Required Libraries

Start by importing the necessary libraries. Use os for file handling and PIL from Pillow for image processing.

import os
from PIL import Image

Step 3: Defining a Function to Check and Remove Corrupted Images

Define a function that checks if an image file is corrupted. If the image is corrupted, the function will delete the file.

def check_and_remove_corrupted_image(file_path):
    try:
        with Image.open(file_path) as img:
            img.verify()  # Verify the image file integrity
        return False  # Image is not corrupted
    except (IOError, SyntaxError) as e:
        print(f"Removing corrupted image: {file_path} - {e}")
        os.remove(file_path)  # Remove corrupted image file
        return True  # Image was corrupted and removed

Step 4: Scanning Directories for Corrupted Images

Write a function that scans each folder in a given directory for image files and uses the check_and_remove_corrupted_image function to check and remove corrupted ones.

def scan_and_clean_directory(directory):
    for root, dirs, files in os.walk(directory):
        for file in files:
            file_path = os.path.join(root, file)
            check_and_remove_corrupted_image(file_path)

Step 5: Running the Code

Execute the function on a sample directory containing image files. Replace your_directory_path with the path to your image dataset.

if __name__ == "__main__":
    directory = "your_directory_path"
    scan_and_clean_directory(directory)
    print("Directory scan and cleanup complete.")

Complete Code

Here is the complete code for reference:

import os
from PIL import Image

def check_and_remove_corrupted_image(file_path):
    try:
        with Image.open(file_path) as img:
            img.verify()  # Verify the image file integrity
        return False  # Image is not corrupted
    except (IOError, SyntaxError) as e:
        print(f"Removing corrupted image: {file_path} - {e}")
        os.remove(file_path)  # Remove corrupted image file
        return True  # Image was corrupted and removed
            
def scan_and_clean_directory(directory):
    for root, dirs, files in os.walk(directory):
        for file in files:
            file_path = os.path.join(root, file)
            check_and_remove_corrupted_image(file_path)

if __name__ == "__main__":
    directory = "your_directory_path"
    scan_and_clean_directory(directory)
    print("Directory scan and cleanup complete.")

GitHub - Aravinda89/detect_corrpted_imgs: How to Detect and Remove Corrupted Image Files in an…