A few days ago, we got our Google Coral Edge TPU dropped into the mail, a USB accelerator to run TensorFlow lite models at pretty impressive speeds.
First, let's talk a little about Edge AI, and why we want it.
What is it?
The name 'Edge AI' kind of says it all, it's about running Artificial Intelligence on the 'edge', which simply means that we run inferences locally, without the need for a connection to a powerful server-like host. Locally here, means mostly a portable (often battery powered) device.
Why would we want this?
Many AI applications run in the cloud, which allows you to run heavy models on heavy data. But that means you always have to send your data trough an internet connection, wait for the server to do the heavy lifting, and send the results back. This might be pretty quick if you're on a strong wired connection, but might take from hundreds of milliseconds up to even a few seconds on bad connections or devices with not-so-potent networking capabilities.
Also, we humans like stuff that feels instant, that is said to be everything that reacts in under 100ms (as researched in many studies which are easy to find on the internet). 100ms Seems like a long time in computing terms, but if you add in a bit of input and output latency, and some computing latency, the spare milliseconds diminish quickly, add in bad connections and your 100ms budget will start to look a lot like trying to buy a real house for toy house money.
EdgeAI is the answer to this problem, by running a lightweight model locally in just a few milliseconds, you can use the results to drive actuators or show output which can feel instantaneous. Also control systems that need a high refresh rate to work properly become a possibility.
This last paragraph turned out to be longer than I expected, but I'll keep it this way because I feel like it is still a summary of what should be said.
What is the Coral?
The Coral USB accelerator, is a USB connected hardware accelerator device, containing the new Google Edge TPU ( Tensor Processing Unit ): A small ASIC ( Application Specific Integrated Circuit ) packing huge performance on TensorFlow Lite models (100+fps on MobileNet V2 SSD), for very little power (they only specify the need of 500mA 5V USB port, so that would be a maximum of 2.5Watt).
Why is it fast?
ASIC's are specifically designed pieces of electronics to do an exactly specified task. So unlike a CPU, which can do basically anything, they can do nothing but the exact thing they where designed for. But where a CPU needs typically around 6 clock cycles to get a multiplication done, an ASIC designed to do a multiplication, can do it in much less than 1 clock cycle since the engineers could predict exactly what would be happening every time data flows through it. So the only time it takes is the propagation delay between in- and output coming from the datapath with the largest transistor count, which falls into the sub-picosecond range for a decent FET, which i'm sure the Edge TPU is full of.
Now, if anyone is not following about propagation delays, FET's and clock cycles, I suggest you to look up these words. Look into the basics of how CPU's and digital logic works, explaining it here would make this post a 50 hour read.
Back of the envelope calculations
Let's say the added propagation delay in the EdgeTPU is 10ps, that means there is 100GHz of data points flowing through. That is not like a 100GHz CPU clock, but means 100.000.000.000 multiplications done every second. This is my guess on what the EdgeTPU does in broad lines: it has parallel multiplication, and a pyramid-like addition architecture on the outputs of the multiplications.
The Coral is supported on systems running Debian based Linux, which means Raspbian is supported! Unfortunately even the latest Pi 3B+ doesn't have any USB3 port, so it won't use the full potential of the Coral, but I think it will still be pretty awesome!
I've heard some people are also running Jessie on a ZedBoard (Zynq7000)... Just imagine running a dual core Cortex-A9, half-a-million logic cells, and the Coral Edge TPU in parallel. The world just isn't ready yet to see what insane things could be done with a nicely worked out project on that setup I think...
Ok, enough with the blabber, let's get down
It's Google, it's powerful, it's cool and of course it's easy to use. If you have a running RaspberryPi and a Coral USB, click this link: Get-Started Guide, follow it and within 5 minutes you are running the nice demo's on the Edge TPU.
python3 demo/object_detection.py \
--model test_data/mobilenet_ssd_v2_face_quant_postprocess_edgetpu.tflite \
--input test_data/face.jpg \
And BAM! You're detecting faces in literally almost no time!
Quick performance test
I modified the original classifier demo to do exactly the same thing, but 250x after each other with a 224x224 image of a colibri, and time this.
----- Time was 3.719806671142578 ----- # On the Edge TPU
Apparently, the colibri is a 'Violet Sabrewing' btw ;-)
----- Time was 62.72629117965698 ----- # On the embedded CPU of the Pi
So on the Pi, that is a speedup of 16! Which also brings us to 60-70fps of classification. I don't know about you, but I find this incredibly satisfying.
I know a lot of you would like to see the comparison between the coral and a decent GPU, but this would take more time than available, and it would also not be a fair comparison. A GTX1080 takes up to 180W, which is huge compared to the 2.5W of the Coral. And a GTX1080 can get its data through a high speed PCIe bus, compared to a USB2 port for the Coral. All that said though, I'm not sure a GTX1080 would be really faster then this, so maybe, one day, i'll compare them anyway ;-)
This is where I finish this story. The least I can say is that I'm positively impressed by the Coral USB Accelerator! I will most certainly keep on testing with it, and use it in a few projects. Possibly I will also write more about my adventures in EdgeAI in the future, so keep checking!