Don’t use memcmp to compare structs or classes

Structs and classes can be laid out with padding between the data members for alignment purposes, and this padding doesn’t get initialised unless you specifically zero it out yourself. So if you’re using memcmp and comparing memory at pointers to structs you might have a problem if you’re expecting structs with equal data members to always be equal.

Compiler Error C2662

I was forced to lookup Visual Studio’s Compiler Error C2662 because I was getting it in my code. Turns out this happens when your const’s don’t line up. In my case I had const in the client declaration but the member function I was calling didn’t have the const specification.

This little nugget of code from the referenced article shows how to fix up the problem:

// C2662.cpp
class C {
public:
   void func1();
   void func2() const{}
} const c;

int main() {
   c.func1();   // C2662
   c.func2();   // OK
}

Shell scripting for archive restoration

I’ve been restoring my archives. Basically I have a bit over 1.3TB of data that I’ve tarballed up and stashed on some disconnected SATA disks, and now that I have a computer with the capacity to hold all that data I’m resurrecting the file shares from the archived tarballs. You can see my restore script here:

restore.sh

#!/bin/bash
cd "`dirname $0`"
data_path=/var/sata2/data
tar xf $data_path/1999.tar.gz --hard-dereference >  output.txt 2>&1
tar xf $data_path/2001.tar.gz --hard-dereference >> output.txt 2>&1
tar xf $data_path/2002.tar.gz --hard-dereference >> output.txt 2>&1
tar xf $data_path/2003.tar.gz --hard-dereference >> output.txt 2>&1
tar xf $data_path/2004.tar.gz --hard-dereference >> output.txt 2>&1
tar xf $data_path/2005.tar.gz --hard-dereference >> output.txt 2>&1
tar xf $data_path/2006.tar.gz --hard-dereference >> output.txt 2>&1
tar xf $data_path/2007.tar.gz --hard-dereference >> output.txt 2>&1
tar xf $data_path/2008.tar.gz --hard-dereference >> output.txt 2>&1

The restore.sh script creates an output.txt file that lists any errors from tar during the restore process. I then have a set of scripts that process this output.txt file fixing up two types of common errors.

Fixing dates

The first error is that the date of the file in the archive isn’t a reasonable value. For example, I had files reporting modification time somewhere back in 1911, before computers. To fix the dates with this problem I run the following scripts:

fix-dates

#!/bin/bash
cd "`dirname $0`";
./bad-date | xargs -0 touch --no-create

bad-date

#!/bin/bash
awk -f bad-date.awk < output.txt | while read line
do
  # note: both -n and \c achieve the same end.
  echo -n -e "$line\0\c"
done

bad-date.awk

{
  if ( /tar: ([^:]*): implausibly old time stamp/ ) {
    split( $0, array, ":" )
    filepath = array[ 2 ]
    sub( / /, "", filepath )
    printf( "%s\n", filepath )
  }
}

Fixing hard links

The second class of error that I can receive is that the file that is being extracted from the archive is a hard link to an already existing file, but the hard link cannot be created because the number of links to the target has reached its limit. I think I used ReiserFS as my file system the archives were on originally, and I’m using Ext4 now. Ext4 seems to have limitations that ReiserFS didn’t. Anyway, it’s not big deal, because I can just copy the target to the path that failed to link. This creates a duplicate file, but that’s not a great concern. I’ll try to fix up such duplicates with my pcdedupe project.

fix-links

#!/bin/bash
cd "`dirname $0`";
./bad-link | xargs -0 ./fix-link

bad-link

#!/bin/bash
awk -f bad-link.awk < output.txt | while read line
do
  # note: both -n and \c achieve the same end.
  echo -n -e "$line\0\c"
done

bad-link.awk

{
  if ( /tar: ([^:]*): Cannot hard link to `([^']*)': Too many links/ ) {
    split( $0, array, ":" )
    linkpath = array[ 2 ]
    sub( / /, "", linkpath )
    filepath = array[ 3 ]
    sub( / Cannot hard link to `/, "", filepath )
    filepath = substr( filepath, 0, length( filepath ) )
    printf( "%s:%s\n", filepath, linkpath )
  }
}

fix-link

#!/bin/bash
cd "`dirname $0`";
spec="$1"
file=`echo "$spec" | sed 's/\([^:]*\):.*/\1/'`
link=`echo "$spec" | sed 's/[^:]*:\(.*\)/\1/'`
#echo "$spec"
#echo Linking "'""$link""'" to "'""$file""'"...
#echo ""
if [ ! -f "$file" ]; then
  echo Missing "'""$file""'"...
  exit 1;
fi
cp "$file" "$link"

check-output

I then checked for anything that I’d missed with my scripts with the following:

#!/bin/bash
cd "`dirname $0`";
cat output.txt | grep -v "Cannot hard link" | grep -v "implausibly old time"

tr

I learned about the ‘tr’ Unix command today. It’s for translating text in streams. The particular example was:

  echo | tr '012' '001'

And I didn’t really understand what that did, but now I do. Basically the ‘echo’ part will echo a new line character, which is octal 012. Then tr will read its input stream and read that new line. It then has a rule to translate 012 (new line) to 001 (Ctrl+A), which it does. So basically it’s just a way of getting a Ctrl+A character in a stream. If you use Ctrl+A as your regular expression delimiter you’re unlikely to have a collision in the expression itself.

FS variable in awk

I was reading about environment variables and I also found this article Internal Variables that describes the variables used by bash. In reading that I learned about the awk FS variable which aids in field splitting. See page 146 of sed & awk by Dougherty and Robbins for details, but basically you can set FS to a single character to have lines split into fields based on that character, or you can specify a regular expression such as “\t+” (any number of tabs separates fields) or “[,;]” (a single comma or fullstop will separate fields).

Environment Variables and Secure Programming for Linux

I read the Environment Variables section of Secure Programming for Linux and Unix HOWTO and learned about the IFS environment variable.

I also read CS 15-392 Secure Programming – Environment Variables.

The IFS environment variable is the “internal field separator” and it is typically space, tab, new line. I.e. white space used to separate fields. So in bash you can delete the IFR variable and it will default to ” \t\n” or you can set it explicitly to that value. So that explains why I found a script that unset the IFR variable — it’s a secure programming practice.

File names on Windows

I’ve been reading up on file names in Windows because I’m having a problem with my C++ code processing a file with an odd TM character in the file name. I’m not sure why, but it seems that file names returned by the POSIX readdir function don’t necessarily exist when then given to ifstream.open, or some weird character encoding thing is going on. Hopefully I get to the bottom of it. It’s a complete fluke that I actually had a file this failed on available and did testing on it, lucky I guess.

I did some more reading and discovered that there are ‘wide character’ versions of the file functions for Windows that use UTF-16 encoded strings rather than ‘code page’ encoded strings, I guess. Anyway, I don’t think I’m going to bother with such things, if the file can’t be opened because it has a weird character in it then I’ll just fail with an error message and the user can look at fixing it. This program is only being “developed and tested” in Windows, there are no plans to actually run it on Windows, it will run on Linux, which won’t have this weird character encoding issue.

The difference between delete and delete[] in C++

I just learned about the difference between delete and delete[] in C++ by reading this article from StackOverflow. Basically you use delete[] to delete arrays, and delete for everything else. There was conflicting information about whether C++ runs the destructors on objects in an array when the array is deleted. Some people said it did, others said it didn’t. I should do an experiment to see one day. The reason for the difference between delete and delete[] seems to be that when C++ allocates an array it allocates memory for storing the size of the array as well as the array elements, and then returns a pointer to the first array element, which is beyond the start of the allocated memory, because the array size takes up the first bit of space.