Research Projects

Data-Centric Computing

A wide range of application domains are emerging as computing platforms of all types become more ubiquitous in society. Many of these applications are data centric, and spend a significant fraction of their time on accessing and processing very large datasets. Examples of these applications include graph frameworks, precision medicine, computer vision, deep learning, and mobile device workloads. Unfortunately, the hardware platforms executing these applications remain compute centric, and are rooted in decades-old design principles for computer architectures. Running modern data-centric applications on compute-centric platforms results in high inefficiencies, with significant energy waste and program stalls. Data-centric platforms can eliminate these inefficiencies, but require the community to fundamentally rethink our approach to computer design.

Our work is designed to enable the development and widespread adoption of data-centric computers. This work spans the entire compute stack, starting from the device level up to software runtimes. We have developed an application-driven approach for identifying how and when to make use of data-centric computing, and have developed mechanisms that can provide essential and efficient support for traditional multithreaded programming. We are also exploring several application domains that can benefit from data-centric computing, and are performing hardware/software co-design to develop novel solutions for these domains.

Selected works:

S. Ghose, A. Boroumand, J. S. Kim, J. Gómez-Luna, and O. Mutlu
in IBM Journal of Research and Development (JRD), Vol. 63, No. 6, November/December 2019
A. Boroumand, S. Ghose, M. Patel, H. Hassan, B. Lucia, R. Ausavarungnirun, K. Hsieh, N. Hajinazar, K. T. Malladi, H. Zheng, and O. Mutlu
in Proc. of the International Symposium on Computer Architecture (ISCA), Phoenix, AZ, June 2019
A. Boroumand, S. Ghose, Y. Kim, R. Ausavarungnirun, E. Shiu, R. Thakur, D. Kim, A. Kuusela, A. Knies, P. Ranganathan, and O. Mutlu
in Proc. of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Williamsburg, VA, March 2018

Load Criticality

Commit Block Predictor overview Imagine what would happen if FedEx ignored the urgency of a package, and simply tried to ship as many packages as possible at a given time. Their throughput would increase significantly, but time-sensitive packages would be delayed, creating issues for customers waiting on these shipments. This is exactly how memory schedulers have been designed in the past -- bandwidth is improved through memory-level parallelism, with the incorrect thought that this improves performance.

Our work finds which loads are important to a core, by having the core itself identify them, and prioritizes them in memory. We find that, even if this comes at the expense of memory throughput, the proper identification of the criticality of a load provides significant performance improvements. Surprisingly, this processor-assisted information is so helpful that even a tiny criticality prediction mechanism can be used to garner large gains.

With the memory controller on chip, bandwidth between the controller and the cores is no longer at a premium. Furthermore, growing DRAM frequencies will make it exceedingly difficult to continue adding logical complexity to the memory scheduling logic. With our processor-assisted approach, we can decouple the load analysis from the scheduler by piggybacking criticality information on the memory request, keeping the scheduler lean yet sophisticated. Using this improved scheduling approach, we can achieve mean speedups of 9.3% for parallel applications, all with less than 2 kB of total storage overheads for an 8 core, 4 memory channel system.

S. Ghose, H. Lee, and J. F. Martínez
in Proc. of the International Symposium on Computer Architecture (ISCA), Tel Aviv, Israel, June 2013, pp. 84 - 95

Improving Fusible Multicore Architectures

Core Fusion timing diagram for steering algorithm Fusible multicore architectures, such as Core Fusion (İpek et al., ISCA '07), attack the inefficiencies of critical section execution on traditional chip multiprocessors (where most of the cores remain idle). This is accomplished by combining several components of the cores together to dynamically create a single, highly-superscalar processor when a critical section is reached.

Our work addresses the prior deficiencies of Core Fusion, whose fused large core underperformed when compared to an area-equivalent, monolithic core. Two techniques (NOP compression and an improved instruction steering algorithm) allow us to recover this performance loss. Offline genetic programming was used on a training set to search for the best steering policy. In order to validate that the steering algorithm met our aggressive 4 GHz target frequency, we created a detailed circuit-level model in a 45 nm process. Through several logical optimizations and careful transistor sizing, we designed a circuit that, when scaled conservatively to smaller process technologies, met our timing requirements with adequate margins.

J. Mukundan, S. Ghose, R. Karmazin, E. İpek, and J. F. Martínez
in Proc. of the International Conference on Supercomputing (ICS), Venice, Italy, June 2012, pp. 101 - 110

Memory Violation Detection

The C and C++ programming languages are notorious for providing unsafe access to memory objects. Bugs originating from illegal memory pointer usage are a major source of program error. Previous schemes to detect such violations have been software-based, and result in overheads that can slow programs down by an order of magnitude.

We developed compiler-assisted architectural mechanisms that can perform runtime detection of out-of-bounds references. With optimizations that reduce the detection latency and overhead, we provide full-coverage pointer detection with only a 5% average performance impact.

S. Ghose, L. Gilgeous, P. Dudnik, A. Aggarwal, and C. Waxman
in Proc. of the Design, Automation and Test in Europe Conference (DATE), Nice, France, April 2009, pp. 652 - 657

Selected Projects Completed at Cornell

Training Robots to Mimic Human Actions

Work by Koppula et al. (IJRR, '13) demonstrated how a Willow Garage PR2 robot could use RGB-D video from the Kinect sensor to observe and learn human activities and object affordances (relations between objects and other objects, as well as between objects and actions). We extended upon this work by using the learned trajectories and affordances to extract generalized motion paths for the robot to execute, in effect learning how to mimic the actions demonstrated by a human. Using inverse reinforcement learning, several combinations of features were used to extrapolate the trajectories for four observed activities.

S. Ghose
Learning Affordance-Labeled Activity Trajectories to Mimic Human Actions
CS 6758 (Robot Learning), May 2013

Benchmark Acceleration Using Portable Subword SIMD

Data-parallel architectures have made a resurgence in computer architecture research. Recently, GCC 4.7 added support for portable vector operation implementations, the known as the GCC Vector Extensions. With a beta version of the compiler, we present a case study comparing the performance of an explicitly-vectorized version of streamcluster using the Vector Extensions against the automatic vectorization of the Intel C++ Compiler for a Core i7 family processor.

S. Ghose, S. Srinath, and J. D. Tse
Accelerating a PARSEC Benchmark Using Portable Subword SIMD
CS 5220 (Applications of Parallel Computers), December 2011

Ad-Hoc Networks for Systems-on-Chip

Systems-on-chip (SoCs) have primarily used simple on-chip networks for communications between different components, typically relying on buses or point-to-point networks. In contrast, Dally and Towles (DAC '01) argue that regular packet-switched networks provide better scalability and easier design. However, total rigidity is not conducive to the heterogeneity of SoCs, as each component can have vast differences in communication requirements. We designed an automated tool flow that selectively replaces low-throughput point-to-point SoC channels with ad hoc routers, delivering greater overall network bandwidth as needed.

S. Ghose and S. Ravi
Alleviating Serialization Latencies in System-on-Chip Designs Using Ad-Hoc Networks
ECE 5970 (Chip-Level Interconnection Networks), May 2010

Memory-Aware DVFS for CMP Systems

VSV (Li et al., MICRO '03) proposed to scale down processor voltage any time an L2 miss occurred. We extended VSV to work on a chip multiprocessor executing multiprogrammed workloads. As part of our work, we evaluated how VSV, along with our multicore extension, operated for various DVFS transition latencies. We found that memory-aware scaling only provides tangible ED2 improvements for very short latencies (12 ns), and even then does so at a significant performance cost.

S. Ghose and J. D. Tse
Memory-Aware DVFS for CMP Systems
ECE 573 (Memory Systems), May 2009

Ultra Low Power 16-Bit Processor

A student team designed the ISA and microarchitecture for an ultra low power processor, for use in a low power GPS tracking device. The processor uses a modified version of the 16-bit MIPS ISA, with custom instructions for common GPS operations, and power optimizations such as dual-phase clocking. The chip was fabricated using a 130 nm process, with a target frequency of 50 MHz with 35 mW TDP. GNU binutils and GCC were ported to our ISA to provide a toolchain for high-level language compilation.

D. I. Davis, S. Ghose, M. J. Kotler, M. Mahajan, J. E. Maxwell, H. M. McCreary, K. S. Parmar, and J. A. Sicilia
Big Red Chip
technical report, May 2008

Learning Instruction Criticality

Instruction criticality has previously been expressed as a directed acyclic graph which can be found using reverse graph traversal, but this is difficult to identify at runtime, when only the forward flow is available (Fields et al., ISCA '01). We explored how several machine learning algorithms could be used to improve upon the token-based Fields criticality predictor, by training them on a number of dynamic instruction features. We found that decision trees offer the greatest prediction accuracy, achieving a 95% test set prediction accuracy.

M. Kırman, N. Kırman, D. M. Lockhart, and S. Ghose
Predicting Instruction Criticality in Microprocessors by Learning Statistical Observations
CS 478 (Machine Learning), May 2008

Asynchronous Floating-Point Adder

Synchronous floating-point units often incur high latencies, as their circuit designs must account for worst-case timing for a number of corner cases that require significant additional computation yet rarely occur. Unlike synchronous circuits, asynchronous circuits can significantly reduce the latency of these common cases, by allowing them to complete earlier and notify their consumer immediately. Corner cases simply diverge in the datapath, thus extending the unit's calculation latency only when necessary. We designed the first pass of a single precision IEEE 754 asynchronous floating-point adder, and implemented much of this design at gate level.

B. R. Sheikh, M. J. Kotler, and S. Ghose
An Asynchronous Floating-Point Adder for Wall Street
ECE 574 (Advanced Digital VLSI Design), December 2007