xar

eXtensible ARchiver

A fork/clone of the subversion xar repository from http://xar.googlecode.com/svn/ that includes several enhancements and bug fixes including very basic command line signature support (master branch). See the xar project page for more information.

The Git utf8bom Filter

The top-level .gitattributes file assigns a filter=utf8bom attribute to several files. These files contain UTF-8 characters and the utf8bom filter ensures they start with a UTF-8 BOM (which is nothing more than the U+FEFF unicode value encoded into UTF-8 which is just 0xEF 0xBB 0xBF) in order to make sure they are handled properly by all editors.

Strictly speaking having a UTF-8 BOM (Byte Order Mark) at the beginning of a UTF-8 file should not be necessary, but some editors incorrectly assume the wrong text encoding otherwise.

Configuring the Filter

The repository can be used just fine without configuring your own copy of the utf8bom filter. In this case Git will just ignore the filter=utf8bom attribute.

This would mean that after editing one of the files marked with this attribute using an editor that strips off the UTF-8 BOM (some editors really do this) and checking the edited file back in, the UTF-8 BOM will be lost.

The whole point of the utf8bom filter is to prevent this from happening.

The filter lines from a ~/.gitconfig file that sets up the utf8bom filter look like this:

[filter "utf8bom"] 
	clean = utf8bomcat
	smudge = utf8bomcat
These Git commands will set up the utf8bom filter in your global (typically ~/.gitconfig) Git configuration file by creating the above lines in your Git global configuration file:
git config --global filter.utf8bom.clean utf8bomcat
git config --global filter.utf8bom.smudge utf8bomcat

The above configuration assumes that the utf8bomcat command will be available to Git via the current PATH. If that will not be the case, the above configuration will need to be adjusted to include a full path to the utf8bomcat command.

The utf8bomcat Script

Obviously in order to make the filter work, a copy of the utf8bomcat script will be necessary.

The utf8bomcat script is so named because it behaves almost exactly like the cat command except that it does not support any options or the special file name β€œ-” (but it will read from standard input if no file names are given).

What utf8bomcat does is check the first three bytes that would be written to standard output and if they are not the UTF-8 BOM (0xEF 0xBB 0xBF), then the UTF-8 BOM is written to standard output first. Next the input is copied to standard output.

To be successfully used as a Git filter, it’s important that utf8bomcat not add a UTF-8 BOM if one is already present.

Any command or script that has this behavior (it need only support reading from standard input to be a Git filter) may be used as the utf8bom filter command.

Here is one possible script that may be used. It requires the bash shell, the hexdump utility command and the cat utility command:

#!/bin/bash

# utf8bomcat -- prepend a UTF-8 BOM to the output if not already present
# Copyright (C) 2011,2012 Kyle J. McKay.  All rights reserved.

# This program is free software.  It comes WITHOUT ANY WARRANTY; without even
# the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
# You can redistribute it and/or modify it under the terms of the WTFPL,
# Version 2, as published by Sam Hocevar.
# See http://sam.zoy.org/wtfpl/COPYING for more details.

# Version 1.1.2

# Only add a BOM if one is not already present

export LC_ALL=C
cat "$@" |
{
  read -d '' -rn3
  hexval="$(hexdump -n3 -e '/1 "%02x"' <<< "$REPLY")"
  if [ "$hexval" != "efbbbf" ]; then
    printf '\xef\xbb\xbf'
  fi
  printf '%s' "$REPLY"
  exec cat
}

The utf8bomcat.c Source File

On systems where the bash shell, the hexdump utility command or the cat utility command is not available, the following source file may be compiled (typically with something like cc -o utf8bomcat -O utf8bomcat.c) to produce a version of utf8bomcat that can be used as a Git filter instead of the previous script:

/*
 * utf8bomcat.c -- prepend a UTF-8 BOM to the output if not already present
 * Copyright (C) 2012 Kyle J. McKay.  All rights reserved.
 *
 * This program is free software.  It comes WITHOUT ANY WARRANTY; without even
 * the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
 * You can redistribute it and/or modify it under the terms of the WTFPL,
 * Version 2, as published by Sam Hocevar.
 * See http://sam.zoy.org/wtfpl/COPYING for more details.
 */

/*
 * Version 1.0
 */

#include <stddef.h>
#include <stdlib.h>
#include <stdio.h>

#define RBUFSIZ 32768

int main()
{
  FILE *inbinary = freopen(NULL, "rb", stdin);
  FILE *outbinary = freopen(NULL, "wb", stdout);
  unsigned char *buff = (unsigned char *)malloc(RBUFSIZ);
  int checkdone = 0;

  if (!inbinary) {
    fprintf(stderr, "freopen(NULL, \"rb\", stdin) failed\n");
    return 1;
  }
  if (!outbinary) {
    fprintf(stderr, "freopen(NULL, \"wb\", stdout) failed\n");
    return 1;
  }
  if (!buff) {
    fprintf(stderr, "malloc(%d) failed\n", RBUFSIZ);
    return 1;
  }
  while (!feof(inbinary) && !ferror(inbinary)) {
    size_t count = fread(buff, 1, RBUFSIZ, inbinary);
    if (!checkdone) {
      if (count < 3 || buff[0] != 0xEF || buff[1] != 0xBB || buff[2] != 0xBF) {
        if (fwrite("\xEF\xBB\xBF", 3, 1, outbinary) != 1) {
          fprintf(stderr, "fwrite failed writing UTF-8 BOM to stdout\n");
          return 1;
        }
      }
      checkdone = 1;
    }
    if (count) {
      if (fwrite(buff, count, 1, outbinary) != 1) {
          fprintf(stderr, "fwrite failed writing input to stdout\n");
          return 1;
      }
    }
  }
  if (ferror(inbinary)) {
    fprintf(stderr, "fread failed reading input\n");
    return 1;
  }
  free(buff);
  return 0;
}

Project Page

Visit the xar project page (source code etc.) at http://mackyle.github.io/xar.