stefan's blag and stuff

blockdiff

blockdiff is a binary patch tool for block based file and disk formats (like ext2,3,4 and btrfs). It's similar to bsdiff but not as general, because blockdiff is built on a more stricter assumption about the internal file format. The diff algorithm only considers very long byte sequences, blocks of around ~4 KiB, not single bytes. That's why the runtime and memory usage of blockdiff is minimal compared to bsdiff. Of course at the cost of not being as general applicable as bsdiff. The main usage area of blockdiff should be filesystem based A/B Updates of embedded devices.

For the commandline interface and examples see blockdiff's manpage.

For more information about binary patch tools see bsdiff's homepage, the Wikipedia article Delta Update and Google's Chrome Updater Courgette.

NOTE: The commandline interface and the patch file format of blockdiff is not considered stable yet. Wait for the 1.0.0 release.

Features and Limitations

Features:

Limitations (Or Keep in mind):

Homepage, Code and Contribution

There is a project's wobsite, but it contains the same information as the github repo.

The source code can be found in the git repositories:

$ git clone https://github.com/lengfeld/blockdiff
$ git clone https://git.stefanchrist.eu/blockdiff.git

If you found a bug or want to brainstrom about additional features, open a issue on github. To contribute code, open a pull request on github or just send patches to stefan+blockdiff@lengfeld.xyz.

Documentation and Installation

To generate the manpage and documentation just execute:

$ make

Then to view the manpage execute:

$ man ./blockdiff.1

To install the program in $HOME/bin execute:

$ make install

To install the program and the manpage system-wide in /usr execute:

$ make prefix=/usr install install-doc

For an overview of all Makefile targets execute

$ make help
all             Generate documentation
tests           Runs the python unit tests
check           Runs the pep8 source code checker
install         Installs program to $(prefix)/
help            Show the help prompt

Usage Example

See file blockdiff.md for examples.

License

The source code of blockdiff is licensed under GPLv2 and GPLv3. For the license text see files LICENSE.GPLv2.txt and LICENSE.GPLv3.txt or visit GNU.org Licenses.

I chose not to license the program as GPLv3 only, because the normal company lawyer freaks out when she/he hears the license name GPLv3. So she/he can calm down and only have to acknowledge the GPLv2 license text.

Algorithm

Compared to bsdiff the diff algorithm in blockdiff is braindead simple. It's the natural implementation that you would come up yourself if you restricted the binary diff algorithm to equally sized blocks and not bother with arbitrary byte movements of data. The presented algorithm works for blockbased file formats and filesystems that allows movement of data only at block length offsets.

The diff algorithm defined in functions readSource and readTargetAndGenPatchCommands is based on the user defined blocksize argument in command blockdiff patch.

The function readSource splits the source file into equally sized chunks (here named blocks) of blocksize bytes. This implies that the source file's length is a multiple of the blocksize.

Source file:
0                                                              EOF
|-- 0 --|-- 1 --|-- 2 --|-- 3 --|--    ...    --|-(n-1)-|-- n --|
    |       |       |                               |       |
    |       |       \- block 2                      |       |
    |       \- block 1                 block (n-1) -/       |
    \- block 0                        block n [last block] -/

For each block in the source file a checksum is calculated. blockdiff uses MD5 internally. The block's checksums are used to build a big hashtable for the source file. The hashtable key is the checksum of a block and the hashtable's value is the block index. Since multiple blocks in the source file can be identical and map to the same checksum, multiple indices are saved as a list as the hashtable's value. The hashtable looks like:

hashtable = {checksum_a: [4, 3],
             checksum_b: [1],
             checksum_c: [0],
             checksum_d: [2, n, n-1],
             ...}

As an optimization blocks of all zeros or ones bits are ignored and not saved in the hashtable. For filesystem zero blocks are quite common, because it's the default for non-used space.

The hashtable makes it easy and fast to look up whether a given block can be found in the source file and get the index of it.

After the hashtable of the source file is generated the target file is processed. The goal is to produce a sequence of commands, the patch file. The patch file can rebuild the target file, based on the available data in the source and patch file itself.

The function readTargetAndGenPatchCommands splits the target file into equally sized blocks of blocksize bytes and processes one block at a time. The block's checksum is calculated and looked up in the hashtable of the source file. If the same block is available in the source file, it emits a 'copy-from-block-index-n-in-source' command. If the block is not available in the source file, it emits and writes the whole target block content into the patch command. For all zero or ones blocks a special patch command is emitted. The sequence of patch commands may look like:

commands:
  target block 0: copy from source file block 2
  target block 1: copy from source file block (n-1)
  target block 2: write block of all ones
  target block 3: write block content b"a3f1deadbeaf..."
  target block 4: write block of all zeros
  ...
  target block n: copy from source file block 0

Apart from the sequence of patch commands the patch file formats contains an additional header and footer including extra information like the count of source and target blocks and extra checksums to verify the integrity of the generated target file in the command blockdiff patch.

Patch File Format

Goals and spec:

Contents/entries:

[header(magic="BDIF"), command(type=!stop)*, command(type=stop), footer(magic="BDIE")]

Command entries:

For more details see function writePatch.

Todos

Implement exact properties of supported file sizes. 2^32 or 2^64. Check boundary checks with generic implementation that also supports smaller and testable sizes 2^16 and 2^8.

Fix file format to support 2^64 file lengths.

Add tool and built-in feature for blockdiff patch. Read target blocks before writing them. Maybe they are already written by a previous interrupted patch operation. Minimize write operations.

IDEA: Allow in-place patching for (system/rescue update concepts). Think about interrupts/power-cuts.

Some infos about exit codes http://www.tldp.org/LDP/abs/html/exitcodes.html

IDEA: block diff should produce file format for bspatch. So the patch can be applied with the normal bspatch tool.

test ubifs: # mkfs.ubifs -r xx -m 1024 -e 15360 -c 2000 ubifs.img NAND Chip 8GiB (MX60LF8G18AC): 2K page size, 128K blocksize # for two gibs of ubifs space sudo mkfs.ubifs --max-leb-cnt 16384 -m 2048 -e 131072 ubifs.0.10.0.img -x none -r mnt/

$ blockdiff diff ubifs.0.9.0.img ubifs.0.10.0.img t.patch

$ time bsdiff ubifs.0.9.0.img ubifs.0.9.0.img bsdiff.0.10.0.patch real 3m20.795s user 3m19.308s sys 0m1.484s -> corrupted patch

Add a --dry-run option to blockdiff patch. Don't write the target file. Just check whether the patch file is not corrupted and all copied blocks are available in the source file.

Add multiple source blocks in the command copy. Detect data corruption in the source file and fallback to another block. This should make the patch operation more resilient against bit errors on the storage media.

The command blockdiff info should be able to read "gz" files. A user of blockdiff is going to compress the patch files.

Support interrupted patch operations. No need to download the whole patch file again. Just continue the download and patch operation in the middle of the patch file.

Use asciidoc instead of pandoc to generate the manpage and the html for the homepage. Pandoc is a haskell programm and requires the haskell toolchain installed on the build system.

Add argument "auto" for option "--blocksize" in command 'blockdiff diff'. The program should try to detect the suitable blocksize from the file, e.g. header of ext2,3,4 images or perdefined for tar.

Add some sort or progress stderr output in 'blockdiff patch'. Currently there is no indication that the patch operation is running.

Follow the recommendation of the REUSE documenation by FSF for software licenses.