My Blog

I think we are hitting hardware limits in field of deep learning

Popularity for deep learning frameworks increased exponentially but sadly its not the case for hardware. Experts are saying we hitting end of the moore's law. Although Nvidia and AMD are pushing their fabrication limits every year. we are seeing raw performance cap.

What is the reason? The answer is data exhaust. Every person with smart devices create lot of data. We need to train complex models to capture variations. There is a huge increase in number og layers in Popular deep learning models like VGG and inception. We can't expect next model to be any smaller than the existing ones.

Why AMD is failing to enter into Deep learning market?

Personallly I love AMD. AMD is one of those which promote freedom to user. We've seen last year AMD surpassed Intel benchmarks. The main advantage for AMD is their ability to make chips with 7nm fabrication. Intel coudn't achieve same level. Then why AMD is not able to enter into Deep Learning market? Reason is Software support, Nvidia has great advantage with CuDNN and Tensorflow library. Although AMD came up with RoCM they coudn't penetrate market. We don't know may be we can expect AMD to rise. As it happened with Intel in CPU market. Let's wait and see.

AMD boasts MI125 Datacenter GPU yields more flops per watt than Nvidia. MI125 GPU has Passive cooling system. It itself indicates AMD is holding power. They almost killed Intel with ZEN2 CPUs. I am eagerly waiting what will happen in future.

Deep learning is a field with intense computational requirements and the choice of your GPU will fundamentally determine your deep learning experience. But what features are important if you want to buy a new GPU? GPU RAM, cores, tensor cores? How to make a cost-efficient choice? This blog post will delve into these questions and will lend you advice which will help you to make a choice that is right for you.

AMD: Powerful But Lacking Support

HIP via ROCm unifies NVIDIA and AMD GPUs under a common programming language which is compiled into the respective GPU language before it is compiled to GPU assembly. If we would have all our GPU code in HIP this would be a major milestone, but this is rather difficult because it is difficult to port the TensorFlow and PyTorch code bases. TensorFlow and PyTorch have some support for AMD GPUs and all major networks can be run on AMD GPUs, but if you want to develop new networks some details might be missing which could prevent you from implementing what you need. The ROCm community is also not too large and thus it is not straightforward to fix issues quickly. AMD invests little into their deep learning software and as such one cannot expect that the software gap between NVIDIA and AMD will close.