Publications

2023
Freddy Gabbay, Ramadan, Firas , and Ganaiem, Majd . 2023. Clock Tree Design Considerations In The Presence Of Asymmetric Transistor Aging. In Dvcon Europe 2023; Design And Verification Conference And Exhibition Europe, Pp. 14-20.
Freddy Gabbay, Ramadan, Firas , and Ganaiem, Majd . 2023. Effect Of Asymmetric Transistor Aging On Gpgpus. In Microelectronic Devices And Technologies, Pp. 52-58.
Freddy Gabbay and Mendelson, Avi . 2023. Electromigration-Aware Instruction Execution For Modern Microprocessors. In Microelectronic Devices And Technologies, Pp. 60-66.
Freddy Gabbay and Mendelson, Avi . 2023. Electromigration-Aware Memory Hierarchy Architecture. Journal Of Low Power Electronics And Applications, 13, Pp. 44. doi:10.3390/jlpea13030044. Publisher's Version Abstract
New mission-critical applications, such as autonomous vehicles and life-support systems, set a high bar for the reliability of modern microprocessors that operate in highly challenging conditions. However, while cutting-edge integrated circuit (IC) technologies have intensified microprocessors by providing remarkable reductions in the silicon area and power consumption, they also introduce new reliability challenges through the complex design rules they impose, creating a significant hurdle in the design process. In this paper, we focus on electromigration (EM), which is a crucial factor impacting IC reliability. EM refers to the degradation process of IC metal nets when used for both power supply and interconnecting signals. Typically, EM concerns have been addressed at the backend, circuit, and layout levels, where EM rules are enforced assuming extreme conditions to identify and resolve violations. This study presents new techniques that leverage architectural features to mitigate the effect of EM on the memory hierarchy of modern microprocessors. Architectural approaches can reduce the complexity of solving EM-related violations, and they can also complement and enhance common existing methods. In this study, we present a comprehensive simulation analysis that demonstrates how the proposed solution can significantly extend the lifetime of a microprocessor’s memory hierarchy with minimal overhead in terms of performance, power, and area while relaxing EM design efforts.
Samer Kurzum, Shomron, Gil , Gabbay, Freddy , and Weiser, Uri . 2023. Enhancing Dnn Training Efficiency Via Dynamic Asymmetric Architecture. Ieee Computer Architecture Letters, 22, Pp. 49-52. doi:10.1109/LCA.2023.3275909.
2022
Freddy Gabbay, Aharoni, Rotem Lev, and Schweitzer, Ori . 2022. Deep Neural Network Memory Performance And Throughput Modeling And Simulation Framework. Mathematics, 10, Pp. 4144. doi:10.3390/math10214144. Publisher's Version Abstract
Deep neural networks (DNNs) are widely used in various artificial intelligence applications and platforms, such as sensors in internet of things (IoT) devices, speech and image recognition in mobile systems, and web searching in data centers. While DNNs achieve remarkable prediction accuracy, they introduce major computational and memory bandwidth challenges due to the increasing model complexity and the growing amount of data used for training and inference. These challenges introduce major difficulties not only due to the constraints of system cost, performance, and energy consumption, but also due to limitations in currently available memory bandwidth. The recent advances in semiconductor technologies have further intensified the gap between computational hardware performance and memory systems bandwidth. Consequently, memory systems are, today, a major performance bottleneck for DNN applications. In this paper, we present DRAMA, a deep neural network memory simulator. DRAMA extends the SCALE-Sim simulator for DNN inference on systolic arrays with a detailed, accurate, and extensive modeling and simulation environment of the memory system. DRAMA can simulate in detail the hierarchical main memory components—such as memory channels, modules, ranks, and banks—and related timing parameters. In addition, DRAMA can explore tradeoffs for memory system performance and identify bottlenecks for different DNNs and memory architectures. We demonstrate DRAMA’s capabilities through a set of experimental simulations based on several use cases.
Freddy Gabbay, Mendelson, Avi , Salameh, Basel , and Ganaiem, Majd . 2022. A Design Flow And Tool For Avoiding Asymmetric Aging. Ieee Design & Test, 39, Pp. 111-118. doi:10.1109/MDAT.2022.3183552.
F. Gabbay, Salomon, B. , Cohen, R. , and Stav, Y. . 2022. Risc-V And Machine Learning Accelerator Hackathon – Enhancing Undergraduate Students’ Perceptions Of Essential Chip Design Skills. In Inted2022 Proceedings, Pp. 2921-2926. IATED. doi:10.21125/inted.2022.0832. Publisher's Version
Freddy Gabbay, Salomon, Benjamin , and Shomron, Gil . 2022. Structured Compression Of Convolutional Neural Networks For Specialized Tasks. Mathematics, 10, Pp. 3679. doi:10.3390/math10193679. Publisher's Version Abstract
Convolutional neural networks (CNNs) offer significant advantages when used in various image classification tasks and computer vision applications. CNNs are increasingly deployed in environments from edge and Internet of Things (IoT) devices to high-end computational infrastructures, such as supercomputers, cloud computing, and data centers. The growing amount of data and the growth in their model size and computational complexity, however, introduce major computational challenges. Such challenges present entry barriers for IoT and edge devices as well as increase the operational expenses of large-scale computing systems. Thus, it has become essential to optimize CNN algorithms. In this paper, we introduce the S-VELCRO compression algorithm, which exploits value locality to trim filters in CNN models utilized for specialized tasks. S-VELCRO uses structured compression, which can save costs and reduce overhead compared with unstructured compression. The algorithm runs in two steps: a preprocessing step identifies the filters with a high degree of value locality, and a compression step trims the selected filters. As a result, S-VELCRO reduces the computational load of the channel activation function and avoids the convolution computation of the corresponding trimmed filters. Compared with typical CNN compression algorithms that run heavy back-propagation training computations, S-VELCRO has significantly fewer computational requirements. Our experimental analysis shows that S-VELCRO achieves a compression-saving ratio between 6% and 30%, with no degradation in accuracy for ResNet-18, MobileNet-V2, and GoogLeNet when used for specialized tasks.
2021
Freddy Gabbay, Mendelson, Avi , Salameh, Basel , and Ganaiem, Majd . 2021. Asymmetric Aging Avoidance Eda Tool. In 2021 34Th Sbc/Sbmicro/Ieee/Acm Symposium On Integrated Circuits And Systems Design (Sbcci), Pp. 1-6. doi:10.1109/SBCCI53441.2021.9529984.
Freddy Gabbay and Shomron, Gil . 2021. Compression Of Neural Networks For Specialized Tasks Via Value Locality. Mathematics, 9, Pp. 2612. doi:10.3390/math9202612. Publisher's Version Abstract
Convolutional Neural Networks (CNNs) are broadly used in numerous applications such as computer vision and image classification. Although CNN models deliver state-of-the-art accuracy, they require heavy computational resources that are not always affordable or available on every platform. Limited performance, system cost, and energy consumption, such as in edge devices, argue for the optimization of computations in neural networks. Toward this end, we propose herein the value-locality-based compression (VELCRO) algorithm for neural networks. VELCRO is a method to compress general-purpose neural networks that are deployed for a small subset of focused specialized tasks. Although this study focuses on CNNs, VELCRO can be used to compress any deep neural network. VELCRO relies on the property of value locality, which suggests that activation functions exhibit values in proximity through the inference process when the network is used for specialized tasks. VELCRO consists of two stages: a preprocessing stage that identifies output elements of the activation function with a high degree of value locality, and a compression stage that replaces these elements with their corresponding average arithmetic values. As a result, VELCRO not only saves the computation of the replaced activations but also avoids processing their corresponding output feature map elements. Unlike common neural network compression algorithms, which require computationally intensive training processes, VELCRO introduces significantly fewer computational requirements. An analysis of our experiments indicates that, when CNNs are used for specialized tasks, they introduce a high degree of value locality relative to the general-purpose case. In addition, the experimental results show that without any training process, VELCRO produces a compression-saving ratio in the range 13.5–30.0% with no degradation in accuracy. Finally, the experimental results indicate that, when VELCRO is used with a relatively low compression target, it significantly improves the accuracy by 2–20% for specialized CNN tasks.
Alex Karbachevsky, Baskin, Chaim , Zheltonozhskii, Evgenii , Yermolin, Yevgeny , Gabbay, Freddy , Bronstein, Alex M, and Mendelson, Avi . 2021. Early-Stage Neural Network Hardware Performance Analysis. Sustainability, 13, Pp. 717. doi:10.3390/su13020717. Publisher's Version Abstract
The demand for running NNs in embedded environments has increased significantly in recent years due to the significant success of convolutional neural network (CNN) approaches in various tasks, including image recognition and generation. The task of achieving high accuracy on resource-restricted devices, however, is still considered to be challenging, which is mainly due to the vast number of design parameters that need to be balanced. While the quantization of CNN parameters leads to a reduction of power and area, it can also generate unexpected changes in the balance between communication and computation. This change is hard to evaluate, and the lack of balance may lead to lower utilization of either memory bandwidth or computational resources, thereby reducing performance. This paper introduces a hardware performance analysis framework for identifying bottlenecks in the early stages of CNN hardware design. We demonstrate how the proposed method can help in evaluating different architecture alternatives of resource-restricted CNN accelerators (e.g., part of real-time embedded systems) early in design stages and, thus, prevent making design mistakes.
Freddy Gabbay, Bar-Lev, Shirly , Montano, Ofer , and Hadad, Noam . 2021. A Lime-Based Explainable Machine Learning Model For Predicting The Severity Level Of Covid-19 Diagnosed Patients. Applied Sciences, 11, Pp. 10417. doi:10.3390/app112110417. Publisher's Version Abstract
The fast and seemingly uncontrollable spread of the novel coronavirus disease (COVID-19) poses great challenges to an already overloaded health system worldwide. It thus exemplifies an urgent need for fast and effective triage. Such triage can help in the implementation of the necessary measures to prevent patient deterioration and conserve strained hospital resources. We examine two types of machine learning models, a multilayer perceptron artificial neural networks and decision trees, to predict the severity level of illness for patients diagnosed with COVID-19, based on their medical history and laboratory test results. In addition, we combine the machine learning models with a LIME-based explainable model to provide explainability of the model prediction. Our experimental results indicate that the model can achieve up to 80% prediction accuracy for the dataset we used. Finally, we integrate the explainable machine learning models into a mobile application to enable the usage of the proposed models by medical staff worldwide.
Gil Shomron, Gabbay, Freddy , Kurzum, Samer , and Weiser, Uri . 2021. Post-Training Sparsity-Aware Quantization. In Advances In Neural Information Processing Systems, 34:Pp. 17737–17748. Curran Associates, Inc. . Publisher's Version
2001
Avi Mendelson and Gabbay, Freddy . 2001. The Effect Of Seance Communication On Multiprocessing Systems. Acm Trans. Comput. Syst., 19, Pp. 252–281. doi:10.1145/377769.377780. Publisher's Version Abstract
This paper introduces the seance communication phenomenon and analyzes its effect on a multiprocessing environment. Seance communication is an unnecessary coherency-related activity that is associated with dead cache information. Dead information may reside in the cache for various reasons: task migration, context switches, or working-set changes. Dead information does not have a significant performance impact on a single-processor system; however, it can dominate the performance of multicache environment. In order to evaluate the overhead of seance communication, we develop an analytical model that is based on the fractal behavior of the memory references. So far, all previous works that used the same modeling approach extracted the fractal parameters of a program manually. This paper provides an additional important contribution by demonstrating how these parameters can be automatically extracted from the program trace. Our analysis indicates that Seance communication may severely reduce the overall system performance when using write-update or write-invalidate cache coherency protocols. In addition, we find that the performance of write-update protocols is affected more severely than write-invalidate protocols. The results that are provided by our model are important for better understanding of the coherency-related overhead in multicache systems and for better development of parallel applications and operating systems.
2000
M. Bekeman, Yoaz, A. , Gabbay, F. , Jourdan, S. , Kalaev, M. , and Ronen, R. . 2000. Early Load Address Resolution Via Register Tracking. In Proceedings Of 27Th International Symposium On Computer Architecture (Ieee Cat. No.rs00201), Pp. 306-315. doi:10.1109/ISCA.2000.854400.
1999
Freddy Gabbay and Mendelson, Avi . 1999. The “Smart” Simulation Environment – A Tool-Set To Develop New Cache Coherency Protocols. Journal Of Systems Architecture, 45, Pp. 619-632. doi:https://doi.org/10.1016/S1383-7621(98)00007-1. Publisher's Version Abstract
“Smart” is a new system-level simulation environment that was developed in order to evaluate and improve algorithms for distributed and parallel systems. In this paper we focus our discussion on the developing of new cache coherency mechanisms that were optimized to handle system-level effects such as process switching and task migration. The developing of new cache coherency protocols is a good example to demonstrate many of the important features of Smart, since system-level events have a major influence on the effectiveness of different cache coherency policies and the overall performance of multicache systems. The Smart simulation environment was built as a separate layer that extends existing multi-processing simulators, so we could take the advantage of using mature and reliable simulation engines. Smart also provides a friendly graphical user interface (GUI) that allows: (1) control of different system parameters and mechanisms such as the cache coherency protocol type, cache organization, scheduling policies of processes and threads, etc., (2) simulation of the execution of shared-memory parallel architecture and measuring different systems' performance parameters and (3) use as a powerful visual based debugging tool. Although this paper presents a version of Smart which is dedicated for shared-bus architectures, other libraries of the tool can simulate different parallel and distributed architectures as well.
1998
Freddy Gabbay and Mendelson, Avi . 1998. The Effect Of Instruction Fetch Bandwidth On Value Prediction. Sigarch Comput. Archit. News, 26, Pp. 272–281. doi:10.1145/279361.278058. Publisher's Version Abstract
Value prediction attempts to eliminate true-data dependencies by dynamically predicting the outcome values of instructions and executing true-data dependent instructions based on that prediction. In this paper we attempt to understand the limitations of using this paradigm in realistic machines. We show that the instruction-fetch bandwidth and the issue rate have a very significant impact on the efficiency of value prediction. In addition, we study how recent techniques to improve the instruction-fetch rate affect the efficiency of value prediction and its hardware organization.
Freddy Gabbay and Mendelson, Avi . 1998. The Effect Of Instruction Fetch Bandwidth On Value Prediction. In Proceedings Of The 25Th Annual International Symposium On Computer Architecture, Pp. 272–281. USA: IEEE Computer Society. doi:10.1145/279358.278058. Publisher's Version Abstract
Value prediction attempts to eliminate true-data dependencies by dynamically predicting the outcome values of instructions and executing true-data dependent instructions based on that prediction. In this paper we attempt to understand the limitations of using this paradigm in realistic machines. We show that the instruction-fetch bandwidth and the issue rate have a very significant impact on the efficiency of value prediction. In addition, we study how recent techniques to improve the instruction-fetch rate affect the efficiency of value prediction and its hardware organization.