stefan's blag and stuff

Blog – 2016-04-17 – Sort MP3 Files

After a harddisk recovery session by photorec you have thousands of images, zips, documents and audio files on your harddrive. Except for the filename suffix the name is mostly uninformative. So as a good harddisk recovery wizard for your friends and family it's a good habit to rename and cleanup the mess.

The ubuntu wiki has already a Cleaning_up section on its DataRecovery page. It contains commands to move files based on the file type, remove small image files (thumbnails), rename images based on the EXIF data and remove duplicates based on the filename.

Especially using the JPG EXIF metadata is a huge win. For example using the date of the image as the filename group them automatically to events. You can argue that after this transformation the photo collection is often better sorted than before.

Nevertheless I was facing the same problem with MP3 files. Luckily the format also contains extra metadata like artist, album, track in its metadata (ID3). You can use the utility mp3info to read it. So I hacked together a python3 script to rename all MP3 files based on that information. Here is the introduction text:

Simple program to sort MP3 files into the folder hierarchy

   <output dir>/<artist>/<album>/<track>.mp3

I used it to sort thousand of MP3 files recoveried by 'photorec'
(http://www.cgsecurity.org/wiki/PhotoRec). You have to install
utility 'mp3info':

   sudo apt-get install mp3info

Usage:

   mp3sort <input directory of files> <output directory prefix>

then disable 'dryrun' in the source code. Sorry no commandline
switch for it now. It's save to reexecute the script multiple
times. It will not overwrite existing files.

Here is the download link: mp3sort.py (2016-10-05: mp3sort.py (v2)). The license is GPLv3+ of course.

Encoding Issues

As always when a program deals with string data and filesystems, there are encoding problems. I tried to implement it correctly by using python3 byte objects and don't assume any specific encoding like UTF-8 in the code. Nevertheless the underlying NTFS filesystem had problems while creating directories like 'Die Ärzte'. I implement a quick hack. The code strips any non-ASCII characters. Damn it.