How to Detect and Remove Corrupted Image Files in an Image Dataset Using Python ?

Aravinda 加阳
3 min readJun 4, 2024

--

When working with large image datasets, corrupted images can cause significant issues in AI model training. Detecting and removing these corrupted files is essential to ensure smooth training processes. In this tutorial, you’ll learn how to scan directories, identify corrupted image files, and remove them using Python and the Pillow library.

Step 1: Setting Up Your Environment

Ensure you have Python and Pillow installed. If not, you can install Pillow with the following command:

pip install pillow

Step 2: Importing Required Libraries

Start by importing the necessary libraries. Use os for file handling and PIL from Pillow for image processing.

import os
from PIL import Image

Step 3: Defining a Function to Check and Remove Corrupted Images

Define a function that checks if an image file is corrupted. If the image is corrupted, the function will delete the file.

def check_and_remove_corrupted_image(file_path):
try:
with Image.open(file_path) as img:
img.verify() # Verify the image file integrity
return False # Image is not corrupted
except (IOError, SyntaxError) as e:
print(f"Removing corrupted image: {file_path} - {e}")
os.remove(file_path) # Remove corrupted image file
return True # Image was corrupted and removed

Step 4: Scanning Directories for Corrupted Images

Write a function that scans each folder in a given directory for image files and uses the check_and_remove_corrupted_image function to check and remove corrupted ones.

def scan_and_clean_directory(directory):
for root, dirs, files in os.walk(directory):
for file in files:
file_path = os.path.join(root, file)
check_and_remove_corrupted_image(file_path)

Step 5: Running the Code

Execute the function on a sample directory containing image files. Replace your_directory_path with the path to your image dataset.

if __name__ == "__main__":
directory = "your_directory_path"
scan_and_clean_directory(directory)
print("Directory scan and cleanup complete.")

Complete Code

Here is the complete code for reference:

import os
from PIL import Image

def check_and_remove_corrupted_image(file_path):
try:
with Image.open(file_path) as img:
img.verify() # Verify the image file integrity
return False # Image is not corrupted
except (IOError, SyntaxError) as e:
print(f"Removing corrupted image: {file_path} - {e}")
os.remove(file_path) # Remove corrupted image file
return True # Image was corrupted and removed

def scan_and_clean_directory(directory):
for root, dirs, files in os.walk(directory):
for file in files:
file_path = os.path.join(root, file)
check_and_remove_corrupted_image(file_path)

if __name__ == "__main__":
directory = "your_directory_path"
scan_and_clean_directory(directory)
print("Directory scan and cleanup complete.")

Conclusion

In this tutorial, I demonstrated how to detect and remove corrupted image files from a dataset using Python and the Pillow library. This automation ensures datasets are clean, saving time and preventing issues during data processing and AI model training.

Detecting and removing corrupted images is crucial for preparing datasets for machine learning and AI projects, maintaining data integrity, and avoiding training problems.

You can extend this script by adding features like logging, handling various image formats, or creating dataset backups before removing corrupted files. Happy coding!

References
Pillow Documentation

--

--

No responses yet