Copying lots of small files from one Linux host to another can take a long time. Much longer than the data alone should take to copy.
I’ve often wondered, when I’ve had to copy lots of files what is the fastest way to do this.
Let’s find out!
The Setup
To start theist tests I created 100K 64b files of random data all contained in a directory called 100k
.
This directory was 825M.
The two Linux systems were two Debian 10 DigitalOcean droplets connected over a VLAN.
The Results
First, I simply copied the directory using rsync and scp.
The two commands were:
rsync -a 100k joe@10.1.0.1:~/
scp -r 100k joe@10.1.0.1:~/
The resulted were:
Command | Time |
---|---|
rsync | 0m21.064s |
scp | 3m56.392s |
rsync
was around 11 times faster. Dump scp
if you are still using it out of habit.
Can we do better?
What if we make an archive out of the tar archive out of the files?
tar -cf 100kfiles.tar 100k
The resultant archive 100kfiles.tar
was 152M. This is because the smallest filesysetm block size is 4 KiB. So every file, even though it’s just 64b of data takes 4KiB of space. The archive only includes the data the files contain.
Copying these produced the following results:
Command | Time |
---|---|
rsync | 0m1.182s |
scp | 0m1.271s |
If you’ve got the space to create a local archive then do this before making the copy.
What if you don’t have the local space?
No problem. Use tar
to dump the archive directly into SSH
and un-tar it at the other end. This gives you the benefits of using an archive without needing any local space.
This is the command I used:
tar cf - 100k | ssh joe@10.1.0.1:/destination/ "tar -x"
This method gives us the winner at 0m8.542s.
Note, the -
in the first tar
command instructs tar
to send the output to stdout instead of an archive.