So I read up on Strange Loop. It’s a tech conference which was started back in 2009 and each year’s talks are available in the archive. I can see myself binge watching…
Tag Archives: archive
Archive processing
Processed: /data/archive Processing started: 2017-05-02 13:29:32 Processing completed: 2017-05-03 11:09:08 Processing time: 21.66h Processed 2,151,351 directories. Processed 13,138,509 files at 168 files/sec. Found 43,139 symlinks. Average number of files per directory is 6.11. Created 10,278,937 hard links. You have 2,859,572 unique files. Your percentage of unique files is 21.76%.
Mailman 3.0 and Postfix Virtual Domains
Read the spec for Mailman 3.0. Looks like it will be pretty good. The feature that I’m interested in, and I’m annoyed I can’t do this with my current version of Mailman, is to be able to put a link to the web archived message in the bottom of the outgoing SMTP message. I.e. so there’s a link back to that message on the web in the message itself. Would be really handy for referencing. At the moment if I want a link I have to go to the web archive for the particular list and find it.
While I was reading the Mailman 3.0 spec I noticed a link to Postfix Virtual Domain Hosting Howto. I think I might have read (at least some of) that before. But… reading that is now definitely on my TODO list.
Extracting a single file from a tar archive
You can use a parameter to the -x command line switch to tell the tar command that you just want to extract one particular file from the archive. For example:
$ tar -x filename/to/extract -f tarfile.tar
You can use the -O command line switch in conjunction with the above to have the file’s contents printed on stdout rather than created in the file system.
Shell scripting for archive restoration
I’ve been restoring my archives. Basically I have a bit over 1.3TB of data that I’ve tarballed up and stashed on some disconnected SATA disks, and now that I have a computer with the capacity to hold all that data I’m resurrecting the file shares from the archived tarballs. You can see my restore script here:
restore.sh
#!/bin/bash cd "`dirname $0`" data_path=/var/sata2/data tar xf $data_path/1999.tar.gz --hard-dereference > output.txt 2>&1 tar xf $data_path/2001.tar.gz --hard-dereference >> output.txt 2>&1 tar xf $data_path/2002.tar.gz --hard-dereference >> output.txt 2>&1 tar xf $data_path/2003.tar.gz --hard-dereference >> output.txt 2>&1 tar xf $data_path/2004.tar.gz --hard-dereference >> output.txt 2>&1 tar xf $data_path/2005.tar.gz --hard-dereference >> output.txt 2>&1 tar xf $data_path/2006.tar.gz --hard-dereference >> output.txt 2>&1 tar xf $data_path/2007.tar.gz --hard-dereference >> output.txt 2>&1 tar xf $data_path/2008.tar.gz --hard-dereference >> output.txt 2>&1
The restore.sh script creates an output.txt file that lists any errors from tar during the restore process. I then have a set of scripts that process this output.txt file fixing up two types of common errors.
Fixing dates
The first error is that the date of the file in the archive isn’t a reasonable value. For example, I had files reporting modification time somewhere back in 1911, before computers. To fix the dates with this problem I run the following scripts:
fix-dates
#!/bin/bash cd "`dirname $0`"; ./bad-date | xargs -0 touch --no-create
bad-date
#!/bin/bash awk -f bad-date.awk < output.txt | while read line do # note: both -n and \c achieve the same end. echo -n -e "$line\0\c" done
bad-date.awk
{ if ( /tar: ([^:]*): implausibly old time stamp/ ) { split( $0, array, ":" ) filepath = array[ 2 ] sub( / /, "", filepath ) printf( "%s\n", filepath ) } }
Fixing hard links
The second class of error that I can receive is that the file that is being extracted from the archive is a hard link to an already existing file, but the hard link cannot be created because the number of links to the target has reached its limit. I think I used ReiserFS as my file system the archives were on originally, and I’m using Ext4 now. Ext4 seems to have limitations that ReiserFS didn’t. Anyway, it’s not big deal, because I can just copy the target to the path that failed to link. This creates a duplicate file, but that’s not a great concern. I’ll try to fix up such duplicates with my pcdedupe project.
fix-links
#!/bin/bash cd "`dirname $0`"; ./bad-link | xargs -0 ./fix-link
bad-link
#!/bin/bash awk -f bad-link.awk < output.txt | while read line do # note: both -n and \c achieve the same end. echo -n -e "$line\0\c" done
bad-link.awk
{ if ( /tar: ([^:]*): Cannot hard link to `([^']*)': Too many links/ ) { split( $0, array, ":" ) linkpath = array[ 2 ] sub( / /, "", linkpath ) filepath = array[ 3 ] sub( / Cannot hard link to `/, "", filepath ) filepath = substr( filepath, 0, length( filepath ) ) printf( "%s:%s\n", filepath, linkpath ) } }
fix-link
#!/bin/bash cd "`dirname $0`"; spec="$1" file=`echo "$spec" | sed 's/\([^:]*\):.*/\1/'` link=`echo "$spec" | sed 's/[^:]*:\(.*\)/\1/'` #echo "$spec" #echo Linking "'""$link""'" to "'""$file""'"... #echo "" if [ ! -f "$file" ]; then echo Missing "'""$file""'"... exit 1; fi cp "$file" "$link"
check-output
I then checked for anything that I’d missed with my scripts with the following:
#!/bin/bash cd "`dirname $0`"; cat output.txt | grep -v "Cannot hard link" | grep -v "implausibly old time"