Using od To See How Text Is Formatted

A large part of using a shell is manipulating text strings using tools such as grep, sed and cut. Systems Administrators often have to take a file as input with an unknown format and extract only the text strings they need.

This is usually fairly simple - until it’s not. The situation is complicated because shells will interpret certain characters in strings and display something other than the literal characters. The two most common are \n for a new line (think a carriage return) and \t for a tabbed space.

This can complicate some operations as a \t will be displayed exactly like four blank spaces when viewed with less or cat, but will affect how commands like cut will work.

It is, therefore, sometimes necessary to see exactly what some text contains.

This is easily done with the command od. The od command will “dump” or display input data into a different human-readable output format. Any special characters such as \t will be printed as \t and will not get interrpreted and displayed as a space.

od is very simple to use, it will either accept the standard output from a command or input redirection e.g.:

cat data.txt | od -tc
od -tc <data.txt

The options that we use here are:

-t Set what the output format will be.
-c Set the output to display printable and backslashed (e.g. \t) characters.

The following two examples are from two sitemap.xml files. The first is from Google and the second is from a Wordpress site I was working on:

An extract from Google’s sitemap.xml file:

    <url>
        <loc>https://edu.google.com/components/</loc>
    </url>

And an extract from the Wordpress sitemap.xml file:

        <url>
                <loc>http://wordpress.example.com/2009/05/</loc>
        </url>

They both look similar with only the spacing and alignment different. However, it turns out that they are quite differently formatted.

Here is what the output of od -tc on the two examples looks like:

od -tc <google-sitemap.xml
0000000                   <   u   r   l   >  \n
0000020           <   l   o   c   >   h   t   t   p   s   :   /   /   e
0000040   d   u   .   g   o   o   g   l   e   .   c   o   m   /   c   o
0000060   m   p   o   n   e   n   t   s   /   <   /   l   o   c   >  \n
0000100                   <   /   u   r   l   >  \n
0000114

od -tc <wordpress-sitemap.xml
0000000  \t   <   u   r   l   >  \n  \t  \t   <   l   o   c   >   h   t
0000020   t   p   :   /   /   w   o   r   d   p   r   e   s   s   .   p
0000040   i   n   k   t   u   x   e   d   o   .   n   e   t   /   2   0
0000060   0   9   /   0   5   /   <   /   l   o   c   >  \n  \t   <   /
0000100   u   r   l   >  \n
0000105

Google has used white spaces (which are represented by blank spaces of od) and Wordpress uses tabs (shown as a \t by od).

‘od’ will also quickly reveal if a text file was created on a Windows machine and uses the Windows syle \r new line characters which causes issues on Linux machines e.g.:

od -tc <Hello-World.txt
0000000   H   e   l   l   o       W   o   r   l   d   !  \r
0000016

The od command allows us to quickly find out exactly how a file is formatted and modify any scripts or commands to work with it.