Improving slow EFS throughput when copying thousands of small files

We were in the process of migrating our customers from a private cloud to AWS EKS. During our dry-runs, we noticed Customer A with 21GB worth of files had it transferred in 32 minutes from a local disk in the private cloud to EFS. However, it took us 2.5 hours to transfer Customer B with only 12GB worth of files.

This shouldn't be an issue if we were only planning to migrate one or two customer instances per week. Furthermore, we have a pretty narrow weekly deployment window that is only a few hours long. In order to migrate more customers in the same week, we had to do something about this bottleneck.

First, I repeated the dry-run to make sure this is not a one off event. It wasn't. However this time, I noticed that the EFS throughput was not reaching anywhere near the dedicated maximum throughput that we had allocated, which was 40Mbps.

So I ran couple of benchmarks on EFS using the fio tool. Strangely now the throughput was actually hitting the max.

It finally hit me and I checked the file counts for both customers using a simple find /data | wc -l. Turns out Customer A only had a total of 10,000 files whereas Customer B had a whopping 122,000 files!

I realized rsync and cp are a single threaded processes and we’d need to parallelize it using something like GNU Parallel. After some googling, came across AWS’s EBS to EFS throughput guide. In which they suggested the use of fpart + cpio + GNU Parallel for optimal performance.

fpart is a tool that gets of list of all files in a directory and splits the list into equal sized partitions based on number of files and total file size. We could then take the output lists and feed it into individual rsync or cpio threads to speed up copying process. So each thread would have similar workloads.

However, fpart binary was not readily available and I couldn't find any docker images that had it. So I had to create our own docker image and build it from source.

FROM ubuntu:18.04

RUN apt-get update && \
apt-get install -y build-essential autoconf rsync ca-certificates cpio parallel nload && \
gcc --version && make --version

ADD https://github.com/martymac/fpart/archive/fpart-1.1.0.tar.gz /

# Install fpart
RUN tar -xvzf /fpart-1.1.0.tar.gz && \
ls -lsh / && \
cd /fpart-fpart-1.1.0/ && \
autoreconf -i && \
./configure && \
make && \
make install && \
PATH=$PATH:/usr/local/bin && which fpart && fpart -V

# Test fpart
RUN export THREADS=$(($(nproc --all) * 16)) && \
echo $THREADS && \
ls -lsh /home && \
fpsync -n $THREADS -v -o "-achv --delete" /fpart-fpart-1.1.0/ /home/ && \
ls -lsh /home && \
rm -rf /home/* && \
ls -lsh /home

CMD ["/bin/sh"]

I couldn't get the exact command that was mentioned in the AWS guide to work – even after hours of trial and error. We finally settled for using just fpsync (rsync with fpart) for simplicity. fpsync is part of the fpart package that we have built in the docker image above.

export THREADS=$(($(nproc --all) * 16))
echo $THREADS
fpsync -n $THREADS -v -o "-achv --delete" /tmp/data /data

Even with this suboptimal solution, we managed to reduce the EFS copying time from 2h 30mins to just 20 mins.

Ideally we should move these files to S3 and serve them from there but this requires application level changes and we don't have that kind of bandwidth right now.

Improving slow EFS throughput when copying thousands of small files

Kornesh Kanan

Kornesh Kanan

Programmatically accessing Google Sheets via Service Account

Supporting clients that does not support SNI in EKS

Understanding Computational Model of Quantum Computers