Updated: May 8, 2019
Time & Location
Apr 18, 6:30 PM – 8:00 PM
Santa Clara, CA, USA
Robin.ly held its first small-group member meetup on April 18 to discuss the trend and challenges of computer architecture in the AI era. Member engineers and researchers from relevant fields and companies, such as NVIDIA, AMD, and Google, participated in an engaging roundtable discussion and mingled with peers. Below are some highlights from the background presentation and member discussion.
Facilitator: Chulian Zhang, Compute Architect @ NVDIA
The exponential single-thread performance growth of microprocessors started in 1970. Around 2010, the speed became much slower because of the end of Moore's Law and the end of Dennard scaling.
When the growth stops, we need to build better hardware and domain specific architecture to meet the growing demand in computing power. GPU is a good example of domain specific architecture. The single-threaded performance of CPUs grew 1.5x faster each year until 2010, and the GPU-accelerated computing still grew rapidly.
As deep learning becomes more popular, almost every major company is building a DL accelerator. Google’s TPU (tensor processing unit) is an important example. TPU is designed for neural network processing and the key is the matrix unit (MUX), a processing unit for matrix matrix multiplication. Most of the computation in neural networks eventually comes down to matrix matrix multiplication, thus most area within a TPT is allocated to the MUX.
Roundtable Discussion Highlights
1. What kind of hardware environment should we use for machine learning? Shall we use cloud and then request for GPU and specify the configuration, or have our own IT infrastructure with dedicated compute and storage resources?
Cloud is very expensive. If you want to run anything on TPU, the experiment charge is around $800 to just get started on a TPU. The total fee is based on the number of hours, and the TPU is really focused on providing more compute power. The main problem is that TPU’s baseline is much higher. The cost for teraflops really has to be high enough to even go and try to go for TPU, because the way allocating TPU resources is different from GPU resources.
For personal/startup experimenting, the cheapest way is to buy a cheap GPU and just install it on your computer. To get more compute power, you can go to AWS cloud, as well as Google Cloud.
For example, if you are using one iteration on neural network, it is going to take you four days to train the model. You can just buy GPU in four days of one training. If you are experimenting and learning, go for 1060, or 2070. 2070 is best balance, and 2060 or 1060 on cheaper side, but you can experiment with it. Those two options are really nice to have in the beginning. As you grow, you can look at other options. I normally use cloud in a month around - half of my model training is running on my local and the remaining is running in cloud.
You can also access free TPU resource through Google Colab. It’s like Python Jupyter notebook and you can just start training in browser to access the TPU resource and NVIDIA T4 GPU. The only problem with Colab is that if you have your own data set, then you have to set it up every time. And if you connect it to Google Drive, pop all of your data into Colab first, and then you run the same thing, it would be slow. But it is a very good option for beginners.
2. In the 5G IoT era, are GPU, FPGA and ASIC capable for lightweight use on IoT edge devices?
There’re a couple of solutions in the market right now. I think ASIC is a good solution potentially in the case of putting an optimized ASIC in a particular application on IoT devices. But the problem is, the machine learning algorithm itself is still evolving really fast, and the GPU and FPGAs are in a much better position. You can build a very optimized ASIC design, and 1.5 years after tape-out, your algorithm is already past due. Probably industry will converge to some particular algorithm, then a lot of cases will shift back to ASIC.
As the algorithms evolve, ASIC also evolves. It's not like ASIC is designed and implemented, and then stay there forever. Each year or each half of the year, there will be a new generation of IoT devices coming out to carry the capability of executing the more advanced algorithms. ASICs and FPGAs are not much different, because they just carry out what people tell them to do.
FPGA is something you can use to develop all the time, you can program it at any time you want. You have the flexibility, but not as good performance. ASIC, on the other hand, is optimized for specific purpose, you can still have advanced version and you can iterate and keep improving. But the combination is the trend now. There's a company that Intel-backed Chip Startup SiFive recently purchased called Open-Silicon. Their specialized ASIC has embedded programming capabilities like eFPGA. It has the ability to recompile and develop more algorithms.
3. DL accelerators are split between training and inference. Do you think it is going to diverge or to converge?
My understanding is they won’t converge, because the application energy efficiency and the target of optimization are so different, and eventually they will be in two markets. Even for companies in these two markets, they already have different requirements. Take Nvidia as an example, they are doing well in the data center training, even some data center inference. But when you come down to add inference, there're just so many competitors right now in the markets. So I think it's going to keep diverging from each other, especially considering the edge application is so specialized. I think a lot of other companies, other than Nvidia or Google can survive by supplying a very niche market and dive very deep vertically.
From an architectural perspective, I think there's a possibility that they can coverage, because they are still kind of solving similar problems. But for a given architecture, they can have different implementations to make it work.
I think probably for inference, we care about the efficiency more. Training usually happens in a data center, but inference usually happens on edge devices. So the architecture is designed to use less bits. That's the reason for having two different ways.
I think there are really two different markets with different goals. For training, what your care is a big throughput, as much as possible. And for inferencing, it’s about latency and power. So when you design the architecture, you first think about the final goal and then deploy your architecture accordingly. For example, for training and the data center, you don't care about single network or latency. You only care about how many networks you can train in an hour. For inference on edge device, you care about how much time it will take for this one inference and how much power is going to burn. As a result, I think they will still be separated.
4. What kinds of accelerators do Google develop in house?
In TensorFlow ecosystem itself, Google has many accelerators like TPU and Edge TPU. There are also many in-house developments that are not open sourced.
Google recently released an open source project called MLIR (multi-level intermediate representation). TensorFlow has a graph diagram and it is not very efficient to build a compiler to compile it down to different backends. Thus, it uses MLIR as an intermediate language. MLIR serves as the bridge to SLA (specialized learning accelerator), TPU, and different backends.
The fast development process was driven by the lead engineer Chris Laettner, who designed LLVM and the swift programming language.
View our previous events and download slides here.
Robin.ly is a content platform dedicated to helping engineers and researchers develop leadership, entrepreneurship, and AI insights to scale their impacts in the new tech era.
Sign up with us to stay updated and access exclusive event, career, and business mentorship opportunities.