KiloCore [1, 2] is recently published and is quite interesting what applications are suitable to run on this chip. What jobs can this KiloCore do better than Xeon Phi Night Landing and GPGPU?
This is an application-specific ASIC that runs only a tiny piece of codes, usually less than 128 instructions with about 512 bytes memory for local variables. Under this constraint, accessing data from neighborhoods or recompute the data would be more efficient than accessing the main memory. The biggest issue of this chip is the memory size for data. Adding a new cache technology or having a larger SRAM with 3D IC stack technique could probably be a good solution for having this processor running without long stalls.
On the other hand, for me, KiloCore is like Coarse-Grain Reconfigurable Architecture (CGRA). Re-configurable cores and massive cores on a chip is always a hot topic in FPGA design which enables parallelism while coding easily in C/C++. A Recent paper  showed an OpenCL to FPGA compiler with customized VLIW chip multiprocessor (CMP) architecture, known as the LE1. Using LLVM compilation framework, they developed and open sourced  a prototype to enable the execution of OpenCL applications on the LE1 CPU. It’s quite fun to see these two papers matching each other.
At the end, I would like to bring up some new research ideas for KiloCore. I think this processor is better to sit on co-processor or accelerator positions, not the main processor. Under this scenario, applications could be offloaded to KiloCore for performance/power efficiency. OpenCL could be a good start point to make programming easy for such tasks. Parker et al. had shown their impressive results of LE1 cores. KiloCore could be the next one! Besides OpenCL, HSA is also a good choice for much easier coding with flat address space and low overhead AQL queue. I believe a thinner layer of runtime for heterogeneous systems is the trend. And power efficiency is the key techniques for every company.
Extend Reading of a 260-core processor: Sunway Taihu Light , a new supercomputer record! (But!!! it comes with the weakness of slow communications!)
- Architektur: 64 Bit-RISC-Prozessor (ShenWei)
- Taktung: 1,45 GHz pro CPU
- Kerne (gesamt): 10.65 Millionen
- Arbeitsspeicher (gesamt): 1,3 PByte, DDR3
- Nodes: 40.960 (zu je 4 Supernodes mit 256 Nodes)
- Leistung: 125 Petaflops (93 Petaflops im Benchmark)
- Energieverbrauch: 15.3 Megawatt, 6 Gigaflops/Watt
- Betriebssystem: Sunway Raise OS (Linux)
-  Original News: https://www.ucdavis.edu/news/worlds-first-1000-processor-chip
-  A 5.8 pJ/Op 115 Billion Ops/sec, to 1.78 Trillion Ops/sec 32nm 1000-Processor Array. Symposium on VLSI Technology and Circuits, 2016. : http://vcl.ece.ucdavis.edu/pubs/2016.06.vlsi.symp.kiloCore/2016.vlsi.symp.kiloCore.pdf
-  Parker, Samuel J. An automated OpenCL FPGA compilation framework targeting a configurable, VLIW chip multiprocessor. Diss. Loughborough University, 2015.
-  esdg-opencl source code: https://github.com/grubbymits/esdg-opencl
-  News of Sunway Taihu Light : http://www.top500.org/news/china-tops-supercomputer-rankings-with-new-93-petaflop-machine/