tr

I learned about the ‘tr’ Unix command today. It’s for translating text in streams. The particular example was:

  echo | tr '012' '001'

And I didn’t really understand what that did, but now I do. Basically the ‘echo’ part will echo a new line character, which is octal 012. Then tr will read its input stream and read that new line. It then has a rule to translate 012 (new line) to 001 (Ctrl+A), which it does. So basically it’s just a way of getting a Ctrl+A character in a stream. If you use Ctrl+A as your regular expression delimiter you’re unlikely to have a collision in the expression itself.

FS variable in awk

I was reading about environment variables and I also found this article Internal Variables that describes the variables used by bash. In reading that I learned about the awk FS variable which aids in field splitting. See page 146 of sed & awk by Dougherty and Robbins for details, but basically you can set FS to a single character to have lines split into fields based on that character, or you can specify a regular expression such as “\t+” (any number of tabs separates fields) or “[,;]” (a single comma or fullstop will separate fields).

Environment Variables and Secure Programming for Linux

I read the Environment Variables section of Secure Programming for Linux and Unix HOWTO and learned about the IFS environment variable.

I also read CS 15-392 Secure Programming – Environment Variables.

The IFS environment variable is the “internal field separator” and it is typically space, tab, new line. I.e. white space used to separate fields. So in bash you can delete the IFR variable and it will default to ” \t\n” or you can set it explicitly to that value. So that explains why I found a script that unset the IFR variable — it’s a secure programming practice.

File names on Windows

I’ve been reading up on file names in Windows because I’m having a problem with my C++ code processing a file with an odd TM character in the file name. I’m not sure why, but it seems that file names returned by the POSIX readdir function don’t necessarily exist when then given to ifstream.open, or some weird character encoding thing is going on. Hopefully I get to the bottom of it. It’s a complete fluke that I actually had a file this failed on available and did testing on it, lucky I guess.

I did some more reading and discovered that there are ‘wide character’ versions of the file functions for Windows that use UTF-16 encoded strings rather than ‘code page’ encoded strings, I guess. Anyway, I don’t think I’m going to bother with such things, if the file can’t be opened because it has a weird character in it then I’ll just fail with an error message and the user can look at fixing it. This program is only being “developed and tested” in Windows, there are no plans to actually run it on Windows, it will run on Linux, which won’t have this weird character encoding issue.

The difference between delete and delete[] in C++

I just learned about the difference between delete and delete[] in C++ by reading this article from StackOverflow. Basically you use delete[] to delete arrays, and delete for everything else. There was conflicting information about whether C++ runs the destructors on objects in an array when the array is deleted. Some people said it did, others said it didn’t. I should do an experiment to see one day. The reason for the difference between delete and delete[] seems to be that when C++ allocates an array it allocates memory for storing the size of the array as well as the array elements, and then returns a pointer to the first array element, which is beyond the start of the allocated memory, because the array size takes up the first bit of space.

Pcdedupe

I’m working on a new ProgClub project called pcdedupe. It’s a file system de-duplicator and it’s a C++ system based on rdfind. I haven’t created the project page on the wiki yet, but the source code is available.

Basically I’m going to take a new angle on the rdfind software and tailor it to suit my particular environment (I have ten million files with massive duplication and rdfind isn’t optimised for that kind of scale).