A large part of using a shell is manipulating text strings using tools such as grep
, sed
and cut
. Systems Administrators often have to take a file as input with an unknown format and extract only the text strings they need.
This is usually fairly simple - until it’s not. The situation is complicated because shells will interpret certain characters in strings and display something other than the literal characters. The two most common are \n
for a new line (think a carriage return) and \t
for a tabbed space.
This can complicate some operations as a \t
will be displayed exactly like four blank spaces when viewed with less
or cat
, but will affect how commands like cut
will work.
It is, therefore, sometimes necessary to see exactly what some text contains.
This is easily done with the command od
. The od
command will “dump” or display input data into a different human-readable output format. Any special characters such as \t
will be printed as \t
and will not get interrpreted and displayed as a space.
od
is very simple to use, it will either accept the standard output from a command or input redirection e.g.:
cat data.txt | od -tc
od -tc <data.txt
The options that we use here are:
-t
Set what the output format will be.-c
Set the output to display printable and backslashed (e.g.\t
) characters.
The following two examples are from two sitemap.xml
files. The first is from Google and the second is from a Wordpress site I was working on:
An extract from Google’s sitemap.xml
file:
<url>
<loc>https://edu.google.com/components/</loc>
</url>
And an extract from the Wordpress sitemap.xml
file:
<url>
<loc>http://wordpress.example.com/2009/05/</loc>
</url>
They both look similar with only the spacing and alignment different. However, it turns out that they are quite differently formatted.
Here is what the output of od -tc
on the two examples looks like:
od -tc <google-sitemap.xml
0000000 < u r l > \n
0000020 < l o c > h t t p s : / / e
0000040 d u . g o o g l e . c o m / c o
0000060 m p o n e n t s / < / l o c > \n
0000100 < / u r l > \n
0000114
od -tc <wordpress-sitemap.xml
0000000 \t < u r l > \n \t \t < l o c > h t
0000020 t p : / / w o r d p r e s s . p
0000040 i n k t u x e d o . n e t / 2 0
0000060 0 9 / 0 5 / < / l o c > \n \t < /
0000100 u r l > \n
0000105
Google has used white spaces (which are represented by blank spaces of od
) and Wordpress uses tabs (shown as a \t
by od
).
‘od’ will also quickly reveal if a text file was created on a Windows machine and uses the Windows syle \r
new line characters which causes issues on Linux machines e.g.:
od -tc <Hello-World.txt
0000000 H e l l o W o r l d ! \r
0000016
The od
command allows us to quickly find out exactly how a file is formatted and modify any scripts or commands to work with it.