2021|谷歌董事会主席John Hennessy:AI技术发展放缓,我们正处于半导体产业寒冬 | 钛媒体T-EDGE( 五 )


以硬件为中心的方法 。我们能否改变我们对硬件架构的设计,使它们更加高效?这种方法称为特定领域架构或特定领域加速器 。这里的设计思路是让硬件做特定的任务,然后优化要非常好 。我们已经在图形处理或手机内的调制解调器中看到了这样的例子 。这些使用的是密集计算技术,不是用于通用运算的,这也意味着它们不是设计来做各种各样的运算,它们旨在进行图形操作的安排或调制解调器需要的运算 。
And then of course some combinations of these. Can we come up with languages which match to these new domain specific architecture? Domain specific languages which improve the efficiency and let us code a range of applications very effectively.
最后是以上两类的一些结合 。我们是否能开发出与这些特定架构相匹配的语言?特定领域语言可以提高效率,让我们非常有效地开发应用程序 。
This is a fascinating slide from a paper that was done by Charles Leiserson and his colleagues at MIT and publish on Science called There's plenty of room at the Top.
这是查理·雷瑟森和他在麻省理工学院的同事完成发表在《科学》杂志上的一篇论文内容 。论文名为“顶端有足够的空间” 。
What they want to do observe is that software efficiency and the inefficiency of matching software to hardware means that we have lots of opportunity to improve performance. They took admittedly a very simple program, matrix multiply, written initially in python and ran it on an 18 core Intel processor. And simply by rewriting the code from python to C they got a factor of 47 in improvement. Then introducing parallel loops gave them another factor of approximately eight.
他们想要观察的是软件效率,以及软件与硬件匹配过程中带来的低效率,这也意味着我们有很多提高效率的地方 。他们在 18 核英特尔处理器上运行了一个用 Python 编写的简单程序 。把代码从 Python 重写为 C语言之后,他们就得到了 47 倍的效率改进 。引入并行循环后,又有了大约 8 倍的改进 。
Then introducing memory optimizations if you're familiar with large scale metrics multiplied by doing it in blocked fashion you can dramatically improve the ability to use the cashe as effectively and thereby they got another factor a little under 20 from that about 15. And then finally using the vector instructions inside the Intel processor they were able to gain another factor of 10. Overall this final program runs more than 62,000 times faster than the initial python program.
引入内存优化后可以显着提高缓存的使用效率,然后就又能获得15~20倍的效率提高 。然后最后使用英特尔处理器内部的向量指令,又能够获得10 倍的改进 。总体而言,这个最终程序的运行速度比最初的 Python 程序快62,000 多倍 。
Now this is not to say that you would get this for the larger scale programs or all kinds of environments but it's an example of how much inefficiency is in at least for one simple application. Of course not many performance sensitive things are written in Python but even the improvement from C to the fully parallel version of C that uses SIMD instructions is similar to what you would get if you use the domain specific processor. It is significant just in its onw right. That's nearly a factor of 100, more than 100, its almost 150.
当然,这并不是说在更大规模的程序或所有环境下我们都可以取得这样的提升,但它是一个很好的例子,至少能说明一个简单的应用程序也有效率改进空间 。当然,没有多少性能敏感的程序是用 Python 写的 。但从完全并行、使用SIMD 指令的C语言版本程序,它能获得的效率提升类似于特定领域处理器 。这已经是很大的性能提升了,这几乎是 100 的因数,超过 100,几乎是 150 。
So there's lots of opportunities here and that's the key point behind us slide of an observation.

推荐阅读