Publications
2023
- ISMB/ECCBRawHash: Enabling Fast and Accurate Real-Time Analysis of Raw Nanopore Signals for Large GenomesCan Firtina, Nika Mansouri Ghiasi, Joel Lindegger, Gagandeep Singh, Meryem Banu Cavlak, Haiyu Mao, and Onur MutluIn Proceedings of the 31st Annual Conference on Intelligent Systems for Molecular Biology (ISMB) and the 22nd European Conference on Computational Biology (ECCB), Jul 2023
Nanopore sequencers generate electrical raw signals in real-time while sequencing long genomic strands. These raw signals can be analyzed as they are generated, providing an opportunity for real-time genome analysis. An important feature of nanopore sequencing, Read Until, can eject strands from sequencers without fully sequencing them, which provides opportunities to computationally reduce the sequencing time and cost. However, existing works utilizing Read Until either 1) require powerful computational resources that may not be available for portable sequencers or 2) lack scalability for large genomes, rendering them inaccurate or ineffective. We propose RawHash, the first mechanism that can accurately and efficiently perform real-time analysis of nanopore raw signals for large genomes using a hash-based similarity search. To enable this, RawHash ensures the signals corresponding to the same DNA content lead to the same hash value, regardless of the slight variations in these signals. RawHash achieves an accurate hash-based similarity search via an effective quantization of the raw signals such that signals corresponding to the same DNA content have the same quantized value and, subsequently, the same hash value. We evaluate RawHash on three applications: 1) read mapping, 2) relative abundance estimation, and 3) contamination analysis. Our evaluations show that RawHash is the only tool that can provide high accuracy and high throughput for analyzing large genomes in real-time. When compared to the state-of-the-art techniques, UNCALLED and Sigmap, RawHash provides 1) 25.8x and 3.4x better average throughput and 2) significantly better accuracy for large genomes, respectively. Source code is available at https://github.com/CMUSAFARI/RawHash.
@inproceedings{firtina_rawhash_2023, booktitle = {Proceedings of the 31st Annual Conference on Intelligent Systems for Molecular Biology (ISMB) and the 22nd European Conference on Computational Biology (ECCB)}, title = {{RawHash}: {Enabling} {Fast} and {Accurate} {Real}-{Time} {Analysis} of {Raw} {Nanopore} {Signals} for {Large} {Genomes}}, doi = {10.1093/bioinformatics/btad272}, author = {Firtina, Can and Ghiasi, Nika Mansouri and Lindegger, Joel and Singh, Gagandeep and Cavlak, Meryem Banu and Mao, Haiyu and Mutlu, Onur}, month = jul, year = {2023}, }
- AACBBGAPiM: a hardware acceleration of Genome Analysis pipeline using Processing in MemoryNaomie Abecassis, Juan Gómez-Luna, Onur Mutlu, Ran Ginosar, Aphélie Moisson-Franckhauser, and Leonid YavitsIn Proceedings of the 5th Workshop on Accelerator Architecture in Computational Biology and Bioinformatics (AACBB), Jun 2023
Variant calling is a fundamental stage in genome analysis that identifies mutations (variations) in a sequenced genome relative to a known reference genome. Pair-HMM is a key part of the variant calling algorithm and its most compute-intensive part. In recent years, Processing-in-Memory (PiM) solutions, which consist of placing compute capabilities near/inside memory, have been proposed to speed up the genome analysis pipeline. We implement the Pair-HMM algorithm on a commercial PiM platform developed by UPMEM. We modify the Pair-HMM algorithm to make it more suitable for PiM execution with acceptable loss of accuracy. We evaluate our implementation on single chromosomes and whole genome sequencing datasets, demonstrating up to 2x speedup compared to existing CPU accelerations and up to 3x speedup compared to FPGA accelerations.
@inproceedings{abecasis2023, booktitle = {Proceedings of the 5th Workshop on Accelerator Architecture in Computational Biology and Bioinformatics (AACBB)}, title = {{GAPiM}: a hardware acceleration of Genome Analysis pipeline using Processing in Memory}, doi = {10.1101/2023.07.26.550623v1}, url = {https://www.biorxiv.org/content/10.1101/2023.07.26.550623v1}, author = {Abecassis, Naomie and G{\'o}mez-Luna, Juan and Mutlu, Onur and Ginosar, Ran and Moisson-Franckhauser, Aph{\'e}lie and Yavits, Leonid}, month = jun, year = {2023}, }
- DACAccelerating Genome Analysis via Algorithm-Architecture Co-DesignOnur Mutlu, and Can FirtinaIn Proceedings of the 60th Design Automation Conference (DAC), Jul 2023
High-throughput sequencing (HTS) technologies have revolutionized the field of genomics, enabling rapid and cost-effective genome analysis for various applications. However, the increasing volume of genomic data generated by HTS technologies presents significant challenges for computational techniques to effectively analyze genomes. To address these challenges, several algorithm-architecture co-design works have been proposed, targeting different steps of the genome analysis pipeline. These works explore emerging technologies to provide fast, accurate, and low-power genome analysis. This paper provides a brief review of the recent advancements in accelerating genome analysis, covering the opportunities and challenges associated with the acceleration of the key steps of the genome analysis pipeline. Our analysis highlights the importance of integrating multiple steps of genome analysis using suitable architectures to unlock significant performance improvements and reduce data movement and energy consumption. We conclude by emphasizing the need for novel strategies and techniques to address the growing demands of genomic data generation and analysis.
@inproceedings{mutlu2023accelerating, booktitle = {Proceedings of the 60th Design Automation Conference (DAC)}, series = {{DAC} '23}, title = {Accelerating Genome Analysis via Algorithm-Architecture Co-Design}, author = {Mutlu, Onur and Firtina, Can}, journal = {arXiv preprint arXiv:2305.00492}, doi = {10.48550/arXiv.2305.00492}, url = {https://doi.org/10.48550/arXiv.2305.00492}, year = {2023}, month = jul, }
- MMDCSWill computing in memory become a new dawn of associative processors?IEEE J. Emerg. Sel. Topics Circuits Syst., Jul 2023
Computer architecture faces an enormous challenge in recent years: while the demand for performance is constantly growing, the performance improvement of general-purpose CPU has almost stalled. Among the reasons are memory and power walls, due to which data transfer increasingly dominates computing. By significantly reducing data transfer, data-centric (or in-memory) computing promises to alleviate the memory and power walls. Associative processor is a non von Neumann computer invented in the 1960s but effectively cast aside until recently. It computes using associative memory in a perfect induction like fashion, using associative memory cells for both data storage and processing. Associative processor can be implemented using conventional CMOS as well as emerging memories. We show that associative processor can outperform state-of-the-art computing platforms by up to almost two orders of magnitude in a variety of data-intensive workloads.
@article{Yavits2023, author = {Yavits, Leonid}, title = {Will computing in memory become a new dawn of associative processors?}, journal = {{IEEE} J. Emerg. Sel. Topics Circuits Syst.}, issn = {2773-0646}, pages = {100033}, volume = {4}, doi = {10.1016/j.memori.2023.100033}, month = jul, year = {2023} }
- ISPASSTransPimLib: A Library for Efficient Transcendental Functions on Processing-in-Memory SystemsMaurus Item, Juan Gómez-Luna, Yuxin Guo, Geraldo F Oliveira, Mohammad Sadrosadati, and Onur MutluIn Proceedings of the 24th International Symposium on Performance Analysis of Systems and Software (ISPASS), Apr 2023
Processing-in-memory (PIM) promises to alleviate the data movement bottleneck in modern computing systems. However, current real-world PIM systems have the inherent disadvantage that their hardware is more constrained than in conventional processors (CPU, GPU), due to the difficulty and cost of building processing elements near or inside the memory. As a result, general-purpose PIM architectures support fairly limited instruction sets and struggle to execute complex operations such as transcendental functions and other hard-to-calculate operations (e.g., square root). These operations are particularly important for some modern workloads, e.g., activation functions in machine learning applications. In order to provide support for transcendental (and other hard-to-calculate) functions in general-purpose PIM systems, we present \emphTransPimLib, a library that provides CORDIC-based and LUT-based methods for trigonometric functions, hyperbolic functions, exponentiation, logarithm, square root, etc. We develop an implementation of TransPimLib for the UPMEM PIM architecture and perform a thorough evaluation of TransPimLib’s methods in terms of performance and accuracy, using microbenchmarks and three full workloads (Blackscholes, Sigmoid, Softmax). We open-source all our code and datasets at https://github.com/CMU-SAFARI/transpimlib.
@inproceedings{gomez2023transpimlib, booktitle = {Proceedings of the 24th International Symposium on Performance Analysis of Systems and Software (ISPASS)}, series = {{ISPASS} '23}, title = {TransPimLib: A Library for Efficient Transcendental Functions on Processing-in-Memory Systems}, author = {Item, Maurus and G{\'o}mez-Luna, Juan and Guo, Yuxin and Oliveira, Geraldo F and Sadrosadati, Mohammad and Mutlu, Onur}, journal = {arXiv preprint arXiv:2304.01951}, doi = {10.48550/arXiv.2304.01951}, url = {https://doi.org/10.48550/arXiv.2304.01951}, month = apr, year = {2023} }
- ISPASSEvaluating Machine Learning Workloads on Memory-Centric Computing SystemsJuan Gómez-Luna, Yuxin Guo, Sylvan Brocard, Julien Legriel, Remy Cimadomo, Geraldo F Oliveira, Gagandeep Singh, and Onur MutluIn Proceedings of the 24th International Symposium on Performance Analysis of Systems and Software (ISPASS), Apr 2023
Training machine learning (ML) algorithms is a computationally intensive process, which is frequently memory-bound due to repeatedly accessing large training datasets. As a result, processor-centric systems (e.g., CPU, GPU) suffer from costly data movement between memory units and processing units, which consumes large amounts of energy and execution cycles. Memory-centric computing systems, i.e., with processing-in-memory (PIM) capabilities, can alleviate this data movement bottleneck. Our goal is to understand the potential of modern general-purpose PIM architectures to accelerate ML training. To do so, we (1) implement several representative classic ML algorithms (namely, linear regression, logistic regression, decision tree, K-Means clustering) on a real-world general-purpose PIM architecture, (2) rigorously evaluate and characterize them in terms of accuracy, performance and scaling, and (3) compare to their counterpart implementations on CPU and GPU. Our evaluation on a real memory-centric computing system with more than 2500 PIM cores shows that general-purpose PIM architectures can greatly accelerate memory-bound ML workloads, when the necessary operations and datatypes are natively supported by PIM hardware. For example, our PIM implementation of decision tree is 27x faster than a state-of-the-art CPU version on an 8-core Intel Xeon, and 1.34x faster than a state-of-the-art GPU version on an NVIDIA A100. Our K-Means clustering on PIM is 2.8x and 3.2x than state-of-the-art CPU and GPU versions, respectively. To our knowledge, our work is the first one to evaluate ML training on a real-world PIM architecture. We conclude with key observations, takeaways, and recommendations that can inspire users of ML workloads, programmers of PIM architectures, and hardware designers & architects of future memory-centric computing systems.
@inproceedings{gomez2023evaluating, booktitle = {Proceedings of the 24th International Symposium on Performance Analysis of Systems and Software (ISPASS)}, series = {{ISPASS} '23}, title = {Evaluating Machine Learning Workloads on Memory-Centric Computing Systems}, author = {G{\'o}mez-Luna, Juan and Guo, Yuxin and Brocard, Sylvan and Legriel, Julien and Cimadomo, Remy and Oliveira, Geraldo F and Singh, Gagandeep and Mutlu, Onur}, journal = {arXiv preprint arXiv:2207.07886}, doi = {10.48550/arXiv.2207.07886}, url = {https://doi.org/10.48550/arXiv.2207.07886}, month = apr, year = {2023} }
- BioinformaticsScrooge: A Fast and Memory-Frugal Genomic Sequence Aligner for CPUs, GPUs, and ASICs.Joël Lindegger, Damla Senol Cali, Mohammed Alser, Juan Gómez-Luna, Nika Mansouri Ghiasi, and Onur MutluBioinformatics, Mar 2023
Pairwise sequence alignment is a very time-consuming step in common bioinformatics pipelines. Speeding up this step requires heuristics, efficient implementations, and/or hardware acceleration. A promising candidate for all of the above is the recently proposed GenASM algorithm. We identify and address three inefficiencies in the GenASM algorithm: it has a high amount of data movement, a large memory footprint, and does some unnecessary work. We propose Scrooge, a fast and memory-frugal genomic sequence aligner. Scrooge includes three novel algorithmic improvements which reduce the data movement, memory footprint, and the number of operations in the GenASM algorithm. We provide efficient open-source implementations of the Scrooge algorithm for CPUs and GPUs, which demonstrate the significant benefits of our algorithmic improvements. For long reads, the CPU version of Scrooge achieves a 20.1x, 1.7x, and 2.1x speedup over KSW2, Edlib, and a CPU implementation of GenASM, respectively. The GPU version of Scrooge achieves a 4.0x 80.4x, 6.8x, 12.6x and 5.9x speedup over the CPU version of Scrooge, KSW2, Edlib, Darwin-GPU, and a GPU implementation of GenASM, respectively. We estimate an ASIC implementation of Scrooge to use 3.6x less chip area and 2.1x less power than a GenASM ASIC while maintaining the same throughput. Further, we systematically analyze the throughput and accuracy behavior of GenASM and Scrooge under various configurations. As the best configuration of Scrooge depends on the computing platform, we make several observations that can help guide future implementations of Scrooge. https://github.com/CMU-SAFARI/Scrooge.
@article{Lindegger2023, author = {Lindegger, Joël and Cali, Damla Senol and Alser, Mohammed and Gómez-Luna, Juan and Ghiasi, Nika Mansouri and Mutlu, Onur}, journal = {Bioinformatics}, title = {Scrooge: A Fast and Memory-Frugal Genomic Sequence Aligner for {CPUs}, {GPUs}, and {ASICs}.}, year = {2023}, issn = {1367-4811}, month = mar, citation-subset = {IM}, country = {England}, doi = {10.1093/bioinformatics/btad151}, issn-linking = {1367-4803}, nlm-id = {9808944}, owner = {NLM}, pii = {btad151}, pmid = {36961334}, pubmodel = {Print-Electronic}, pubstate = {aheadofprint}, revised = {2023-03-24}, }
- BioinformaticsA framework for high-throughput sequence alignment using real processing-in-memory systems.Bioinformatics, Mar 2023
Sequence alignment is a memory bound computation whose performance in modern systems is limited by the memory bandwidth bottleneck. Processing-in-memory (PIM) architectures alleviate this bottleneck by providing the memory with computing competencies. We propose Alignment-in-Memory (AIM), a framework for high-throughput sequence alignment using PIM, and evaluate it on UPMEM, the first publicly available general-purpose programmable PIM system. Our evaluation shows that a real PIM system can substantially outperform server-grade multi-threaded CPU systems running at full-scale when performing sequence alignment for a variety of algorithms, read lengths, and edit distance thresholds. We hope that our findings inspire more work on creating and accelerating bioinformatics algorithms for such real PIM systems. Our code is available at https://github.com/safaad/aim.
@article{Diab2023, author = {Diab, Safaa and Nassereldine, Amir and Alser, Mohammed and Gómez Luna, Juan and Mutlu, Onur and El Hajj, Izzat}, journal = {Bioinformatics}, title = {A framework for high-throughput sequence alignment using real processing-in-memory systems.}, issn = {1367-4811}, month = mar, volume = {39}, citation-subset = {IM}, completed = {2023-05-08}, country = {England}, doi = {10.1093/bioinformatics/btad155}, issn-linking = {1367-4803}, issue = {5}, keywords = {Sequence Alignment; Algorithms; Software; Computational Biology; Sequence Analysis, DNA; High-Throughput Nucleotide Sequencing}, nlm-id = {9808944}, owner = {NLM}, pii = {btad155}, pmc = {PMC10159653}, pmid = {36971586}, pubmodel = {Print}, pubstate = {ppublish}, revised = {2023-05-09}, year = {2023} }
- JETCASAM4: MRAM Crossbar Based CAM/TCAM/ACAM/AP for In-Memory ComputingEsteban Garzón, Marco Lanuzza, Adam Teman, and Leonid YavitsIEEE J. Emerg. Sel. Topics Circuits Syst., Mar 2023
In-memory computing seeks to minimize data movement and alleviate the memory wall by computing in-situ, in the same place that the data is located. One of the key emerging technologies that promises to enable such computing-in-memory is spin-transfer torque magnetic tunnel junction (STT-MTJ). This paper proposes AM4, a combined STT-MTJ-based Content Addressable Memory (CAM), Ternary CAM (TCAM), approximate matching (similarity search) CAM (ACAM), and in-memory Associative Processor (AP) design, inspired by the recently announced Samsung MRAM crossbar. We demonstrate and evaluate the performance and energy-efficiency of the AM4-based AP using a variety of data intensive workloads. We show that an AM4-based AP outperforms state-of-the-art solutions both in performance (with the average speedup of about 10 x) and energy-efficiency (by about 60 x on average).
@article{Garzon2023, author = {Garz{\'{o}}n, Esteban and Lanuzza, Marco and Teman, Adam and Yavits, Leonid}, journal = {{IEEE} J. Emerg. Sel. Topics Circuits Syst.}, title = {AM4: {MRAM} Crossbar Based {CAM/TCAM/ACAM/AP} for In-Memory Computing}, number = {1}, pages = {408--421}, volume = {13}, bibsource = {dblp computer science bibliography, https://dblp.org}, biburl = {https://dblp.org/rec/journals/esticas/GarzonLTY23.bib}, doi = {10.1109/JETCAS.2023.3243222}, month = mar, year = {2023} }
- NARGABBLEND: a fast, memory-efficient and accurate mechanism to find fuzzy seed matches in genome analysisCan Firtina, Jisung Park, Mohammed Alser, Jeremie S Kim, Damla Senol Cali, Taha Shahroodi, Nika Mansouri Ghiasi, Gagandeep Singh, Konstantinos Kanellopoulos, Can Alkan, and Onur MutluNAR Genomics and Bioinformatics, Mar 2023
Generating the hash values of short subsequences, called seeds, enables quickly identifying similarities between genomic sequences by matching seeds with a single lookup of their hash values. However, these hash values can be used only for finding exact-matching seeds as the conventional hashing methods assign distinct hash values for different seeds, including highly similar seeds. Finding only exact-matching seeds causes either (i) increasing the use of the costly sequence alignment or (ii) limited sensitivity. We introduce BLEND, the first efficient and accurate mechanism that can identify both exact-matching and highly similar seeds with a single lookup of their hash values, called fuzzy seed matches. BLEND (i) utilizes a technique called SimHash, that can generate the same hash value for similar sets, and (ii) provides the proper mechanisms for using seeds as sets with the SimHash technique to find fuzzy seed matches efficiently. We show the benefits of BLEND when used in read overlapping and read mapping. For read overlapping, BLEND is faster by 2.4x–83.9x (on average 19.3x), has a lower memory footprint by 0.9x–14.1x (on average 3.8x), and finds higher quality overlaps leading to accurate de novo assemblies than the state-of-the-art tool, minimap2. For read mapping, BLEND is faster by 0.8x–4.1x (on average 1.7x) than minimap2. Source code is available at https://github.com/CMU-SAFARI/BLEND.
@article{firtina_blend_2023, title = {{BLEND}: a fast, memory-efficient and accurate mechanism to find fuzzy seed matches in genome analysis}, volume = {5}, url = {https://academic.oup.com/nargab/article/5/1/lqad004/6993940}, doi = {10.1093/nargab/lqad004}, number = {1}, journal = {NAR Genomics and Bioinformatics}, author = {Firtina, Can and Park, Jisung and Alser, Mohammed and Kim, Jeremie S and Cali, Damla Senol and Shahroodi, Taha and Ghiasi, Nika Mansouri and Singh, Gagandeep and Kanellopoulos, Konstantinos and Alkan, Can and Mutlu, Onur}, month = mar, year = {2023}, pages = {lqad004}, }
- ChipsApproximate Content-Addressable Memories: A ReviewEsteban Garzón, Leonid Yavits, Adam Teman, and Marco LanuzzaChips, Mar 2023
Content-addressable memory (CAM) has been part of the memory market for more than five decades. CAM can carry out a single clock cycle lookup based on the content rather than an address. Thanks to this attractive feature, CAM is utilized in memory systems where a high-speed content lookup technique is required. However, typical CAM applications only support exact matching, as opposed to approximate matching, where a certain Hamming distance (several mismatching characters between a query pattern and the dataset stored in CAM) needs to be tolerated. Recent interest in approximate search has led to the development of new CAM-based alternatives, accelerating the processing of large data workloads in the realm of big data, genomics, and other data-intensive applications. In this review, we provide an overview of approximate CAM and describe its current and potential applications that would benefit from approximate search computing.
@article{Garzon2023a, author = {Garzón, Esteban and Yavits, Leonid and Teman, Adam and Lanuzza, Marco}, journal = {Chips}, title = {Approximate Content-Addressable Memories: A Review}, year = {2023}, issn = {2674-0729}, number = {2}, pages = {70--82}, volume = {2}, doi = {10.3390/chips2020005}, owner = {calkan}, url = {https://www.mdpi.com/2674-0729/2/2/5}, }
Posters
2023
- RECOMBCharacterization of Alignment and Search Algorithms for Short Read, Long Read, and Graph MappersEcem İlgün, Ömer Yavuz Öztürk, Klea Zambaku, Juan Gómez Luna, Mohammed Alser, Ricardo Román-Brenes, The BioPIM Project, and Can AlkanIn RECOMB 2023, Apr 2023
We recently started a project funded by the Horizon Europe program, BioPIM, which aims to accelerate various bioinformatics algorithms using processing-in-memory (PIM) technologies. PIM is a type of computer architecture that aims to solve the issue of data movement between the CPU and memory being a bottleneck in data-intensive applications. PIM accomplishes this by integrating computing units directly into the memory chip, which reduces latency and increases bandwidth by bringing the computing units closer to the memory. Since read mapping is a crucial step in almost all genome analysis studies, we first aimed to understand how to accelerate read mapping in this project. As the first step, we analyzed the computational workload of BWA-MEM and Bowtie2 for short reads, NGMLR, Minimap2 and LRA for long reads, and finally minigraph, vg, GraphAligner and GWFA for read-to-graph aligners. Here we present a first-pass workload analysis using the Intel vtune tool to identify the most time and memory-bandwidth consuming functions used by these algorithms. Our preliminary results show that the resource usage of these algorithms varied significantly depending on the type of data and the algorithm used. We have identified several potential functions that could potentially be improved by PIM. Furthermore, we discuss processing-in-memory architectures to accelerate alignment and search algorithms for resequencing experiments. Overall, our study provides insights into the resource requirements of different alignment and search algorithms for different types of sequencing data, which can guide the selection of the most appropriate algorithms for different resequencing experiments. Our findings can also inform the development of more efficient and accurate algorithms for processing sequencing data, which is critical for advancing our understanding of the genetic basis of complex diseases.
booktitle = {RECOMB 2023}, series = {{RECOMB} '23}, title = {Characterization of Alignment and Search Algorithms for Short Read, Long Read, and Graph Mappers}, author = {İlgün, Ecem and Öztürk, Ömer Yavuz and Zambaku, Klea and Luna, Juan Gómez and Alser, Mohammed and Román-Brenes, Ricardo and Project, The BioPIM and Alkan, Can}, journal = {Poster presented}, url = {https://docs.google.com/document/d/1OK2oSsDjQYuo20jcJXwf8H5H_PMF6t_MOWO9QZOkFPg/view}, year = {2023}, month = apr, }
Related Publications
2022
- ISCAEDAM: Edit Distance Tolerant Approximate Matching Content Addressable MemoryRobert Hanhan, Esteban Garzón, Zuher Jahshan, Adam Teman, Marco Lanuzza, and Leonid YavitsIn Proceedings of the 49th Annual International Symposium on Computer Architecture, 2022
We propose a novel edit distance-tolerant content addressable memory (EDAM) for energy-efficient approximate search applications. Unlike state-of-the-art approximate search solutions that tolerate certain Hamming distance between the query pattern and the stored data, EDAM tolerates edit distance, which makes it especially efficient in applications such as text processing and genome analysis. EDAM was designed using a commercial 65 nm 1.2 V CMOS technology and evaluated through extensive Monte Carlo simulations, while considering different process corners. Simulation results show that EDAM can achieve robust approximate search operation with a wide range of edit distance threshold levels. EDAM is functionally evaluated as a pathogen DNA detection and classification accelerator. EDAM achieves up to 1.7x higher F1 score for high-quality DNA reads and up to 19.55x higher F1 score for DNA reads with 15% error rate, compared to state-of-the-art DNA classification tool Kraken2. Simulated at 667 MHz, EDAM provides 1, 214x average speedup over Kraken2. This makes EDAM suitable for hardware acceleration of genomic surveillance of outbreaks, such as the ongoing Covid-19 pandemic.
@inproceedings{hanhan_edam_2022, address = {New York, NY, USA}, series = {{ISCA} '22}, title = {{EDAM}: {Edit} {Distance} {Tolerant} {Approximate} {Matching} {Content} {Addressable} {Memory}}, isbn = {978-1-4503-8610-4}, url = {https://doi.org/10.1145/3470496.3527424}, doi = {10.1145/3470496.3527424}, booktitle = {Proceedings of the 49th {Annual} {International} {Symposium} on {Computer} {Architecture}}, publisher = {Association for Computing Machinery}, author = {Hanhan, Robert and Garzón, Esteban and Jahshan, Zuher and Teman, Adam and Lanuzza, Marco and Yavits, Leonid}, year = {2022}, pages = {495--507}, }
2021
- VLSI Tech.HERMES Core – A 14nm CMOS and PCM-based In-Memory Compute Core using an array of 300ps/LSB Linearized CCO-based ADCs and local digital processingR. Khaddam-Aljameh, M. Stanisavljevic, J. Fornt Mas, G. Karunaratne, M. Braendli, F. Liu, A. Singh, S. M. Müller, U. Egger, A. Petropoulos, T. Antonakopoulos, K. Brew, S. Choi, I. Ok, F. L. Lie, N. Saulnier, V. Chan, I. Ahsan, V. Narayanan, S. R. Nandakumar, M. Le Gallo, P. A. Francese, A. Sebastian, and E. EleftheriouIn 2021 Symposium on VLSI Technology, 2021
We present a 256x256 in-memory compute (IMC) core designed and fabricated in 14nm CMOS with backend-integrated multi-level phase-change memory (PCM). It comprises 256 linearized current controlled oscillator (CCO)-based ADCs at a compact 4µm pitch and a local digital processing unit performing affine scaling and ReLU operations. A novel frequency-linearization technique for CCOs is introduced, leading to accurate on-chip matrix-vector-multiply (MVM) when operating over 1 GHz. Measured classification accuracies on MNIST and CIFAR-10 datasets are presented when two cores are employed for deep learning (DL) inference. The measured energy efficiency is 10.5 TOPS/W at a performance density of 1.59 TOPS/mm 2 .
@inproceedings{khaddam-aljameh_hermes_2021, title = {{HERMES} {Core} – {A} 14nm {CMOS} and {PCM}-based {In}-{Memory} {Compute} {Core} using an array of 300ps/{LSB} {Linearized} {CCO}-based {ADCs} and local digital processing}, booktitle = {2021 {Symposium} on {VLSI} {Technology}}, author = {Khaddam-Aljameh, R. and Stanisavljevic, M. and Mas, J. Fornt and Karunaratne, G. and Braendli, M. and Liu, F. and Singh, A. and Müller, S. M. and Egger, U. and Petropoulos, A. and Antonakopoulos, T. and Brew, K. and Choi, S. and Ok, I. and Lie, F. L. and Saulnier, N. and Chan, V. and Ahsan, I. and Narayanan, V. and Nandakumar, S. R. and Gallo, M. Le and Francese, P. A. and Sebastian, A. and Eleftheriou, E.}, year = {2021}, pages = {1--2}, }
2020
- BIBMVariant Calling Parallelization on Processor-in-Memory ArchitectureD. Lavenier, R. Cimadomo, and R. JodinIn 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Dec 2020
This paper introduces a new combination of software and hardware PIM (Process-in-Memory) architecture to accelerate the variant calling genomic process. PIM translates into bringing data intensive calculations directly where the data is: within the DRAM, enhanced with thousands of processing units. The energy consumption, in large part due to data movement, is significantly lowered at a marginal additional hardware cost. Such design allows an unprecedented level of parallelism to process billions of short reads. Experiments on real PIM devices developed by the UPMEM company show significant speed-up compared to pure software implementation. The PIM solution also compared nicely to FPGA or GPU based acceleration bringing similar to twice the processing speed but most importantly being 5 to 8 times cheaper to deploy with up to 6 times less power consumption.
@inproceedings{lavenier_variant_2020, address = {Los Alamitos, CA, USA}, title = {Variant {Calling} {Parallelization} on {Processor}-in-{Memory} {Architecture}}, url = {https://doi.ieeecomputersociety.org/10.1109/BIBM49941.2020.9313351}, doi = {10.1109/BIBM49941.2020.9313351}, booktitle = {2020 {IEEE} {International} {Conference} on {Bioinformatics} and {Biomedicine} ({BIBM})}, publisher = {IEEE Computer Society}, author = {Lavenier, D. and Cimadomo, R. and Jodin, R.}, month = dec, year = {2020}, pages = {204--207}, }
- MICROGenASM: A High-Performance, Low-Power Approximate String Matching Acceleration Framework for Genome Sequence AnalysisDamla Senol Cali, Gurpreet S. Kalsi, Zülal Bingöl, Can Firtina, Lavanya Subramanian, Jeremie S. Kim, Rachata Ausavarungnirun, Mohammed Alser, Juan Gomez-Luna, Amirali Boroumand, Anant Norion, Allison Scibisz, Sreenivas Subramoneyon, Can Alkan, Saugata Ghose, and Onur MutluIn 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec 2020
Genome sequence analysis has enabled significant advancements in medical and scientific areas such as personalized medicine, outbreak tracing, and the understanding of evolution. To perform genome sequencing, devices extract small random fragments of an organism’s DNA sequence (known as reads). The first step of genome sequence analysis is a computational process known as read mapping. In read mapping, each fragment is matched to its potential location in the reference genome with the goal of identifying the original location of each read in the genome. Unfortunately, rapid genome sequencing is currently bottlenecked by the computational power and memory bandwidth limitations of existing systems, as many of the steps in genome sequence analysis must process a large amount of data. A major contributor to this bottleneck is approximate string matching (ASM), which is used at multiple points during the mapping process. ASM enables read mapping to account for sequencing errors and genetic variations in the reads. We propose GenASM, the first ASM acceleration framework for genome sequence analysis. GenASM performs bitvectorbased ASM, which can efficiently accelerate multiple steps of genome sequence analysis. We modify the underlying ASM algorithm (Bitap) to significantly increase its parallelism and reduce its memory footprint. Using this modified algorithm, we design the first hardware accelerator for Bitap. Our hardware accelerator consists of specialized systolic-array-based compute units and on-chip SRAMs that are designed to match the rate of computation with memory capacity and bandwidth, resulting in an efficient design whose performance scales linearly as we increase the number of compute units working in parallel. We demonstrate that GenASM provides significant performance and power benefits for three different use cases in genome sequence analysis. First, GenASM accelerates read alignment for both long reads and short reads. For long reads, GenASM outperforms state-of-the-art software and hardware accelerators by 116x and 3.9x, respectively, while reducing power consumption by 37x and 2.7x. For short reads, GenASM outperforms state-of-the-art software and hardware accelerators by 111x and 1.9x. Second, GenASM accelerates pre-alignment filtering for short reads, with 3.7x the performance of a state-of-the-art pre-alignment filter, while reducing power consumption by 1.7x and significantly improving the filtering accuracy. Third, GenASM accelerates edit distance calculation, with 22-12501x and 9.3-400x speedups over the state-of-the-art software library and FPGA-based accelerator, respectively, while reducing power consumption by 548-582x and 67x. We conclude that GenASM is a flexible, high-performance, and low-power framework, and we briefly discuss four other use cases that can benefit from GenASM.
@inproceedings{cali_genasm_2020, title = {{GenASM}: {A} {High}-{Performance}, {Low}-{Power} {Approximate} {String} {Matching} {Acceleration} {Framework} for {Genome} {Sequence} {Analysis}}, doi = {10.1109/MICRO50266.2020.00081}, booktitle = {2020 53rd {Annual} {IEEE}/{ACM} {International} {Symposium} on {Microarchitecture} ({MICRO})}, author = {Cali, Damla Senol and Kalsi, Gurpreet S. and Bingöl, Zülal and Firtina, Can and Subramanian, Lavanya and Kim, Jeremie S. and Ausavarungnirun, Rachata and Alser, Mohammed and Gomez-Luna, Juan and Boroumand, Amirali and Norion, Anant and Scibisz, Allison and Subramoneyon, Sreenivas and Alkan, Can and Ghose, Saugata and Mutlu, Onur}, year = {2020}, pages = {951--966}, }
- SYSTORBioSEAL: In-Memory Biological Sequence Alignment Accelerator for Large-Scale Genomic DataRoman Kaplan, Leonid Yavits, and Ran GinosarIn Proceedings of the 13th ACM International Systems and Storage Conference, Dec 2020
Genome sequences contain hundreds of millions of DNA base pairs. Finding the degree of similarity between two genomes requires executing a compute-intensive dynamic programming algorithm, such as Smith-Waterman. Traditional von Neumann architectures have limited parallelism and cannot provide an efficient solution for large-scale genomic data. Approximate heuristic methods (e.g. BLAST) are commonly used. However, they are suboptimal and still compute-intensive.In this work, we present BioSEAL, a biological sequence alignment accelerator. BioSEAL is a massively parallel non-von Neumann processing-in-memory architecture for large-scale DNA and protein sequence alignment. BioSEAL is based on resistive content addressable memory, capable of energy-efficient and highperformance associative processing.We present an associative processing algorithm for entire database sequence alignment on BioSEAL and compare its performance and power consumption with state-of-art solutions. We show that BioSEAL can achieve up to 57x speedup and 156x better energy efficiency, compared with existing solutions for genome sequence alignment and protein sequence database search.
@inproceedings{kaplan_bioseal_2020, address = {New York, NY, USA}, series = {{SYSTOR} '20}, title = {{BioSEAL}: {In}-{Memory} {Biological} {Sequence} {Alignment} {Accelerator} for {Large}-{Scale} {Genomic} {Data}}, isbn = {978-1-4503-7588-7}, url = {https://doi.org/10.1145/3383669.3398279}, doi = {10.1145/3383669.3398279}, booktitle = {Proceedings of the 13th {ACM} {International} {Systems} and {Storage} {Conference}}, publisher = {Association for Computing Machinery}, author = {Kaplan, Roman and Yavits, Leonid and Ginosar, Ran}, year = {2020}, pages = {36--48}, }
2019
- IEEE MicroRASSA: Resistive Prealignment Accelerator for Approximate DNA Long Read MappingRoman Kaplan, Leonid Yavits, and Ran GinosarIEEE Micro, Jul 2019
DNA read mapping is a computationally expensive bioinformatics task, required for genome assembly and consensus polishing. It requires to find the best-fitting location for each DNA read on a long reference sequence. A novel resistive approximate similarity search accelerator (RASSA) exploits charge distribution and parallel in-memory processing to reflect a mismatch count between DNA sequences. RASSA implementation of DNA long-read prealignment outperforms the state-of-the-art solution, minimap2, by 16–77x with comparable accuracy and provides two orders of magnitude higher throughput than GateKeeper, a short-read prealignment hardware architecture implemented in FPGA.
@article{kaplan_rassa_2019, title = {{RASSA}: {Resistive} {Prealignment} {Accelerator} for {Approximate} {DNA} {Long} {Read} {Mapping}}, volume = {39}, issn = {0272-1732}, url = {https://doi.org/10.1109/MM.2018.2890253}, doi = {10.1109/MM.2018.2890253}, number = {4}, journal = {IEEE Micro}, author = {Kaplan, Roman and Yavits, Leonid and Ginosar, Ran}, month = jul, year = {2019}, pages = {44--54}, }
2018
- BMC GenomicsGRIM-Filter: Fast seed location filtering in DNA read mapping using processing-in-memory technologiesJeremie S. Kim, Damla Senol Cali, Hongyi Xin, Donghyuk Lee, Saugata Ghose, Mohammed Alser, Hasan Hassan, Oguz Ergin, Can Alkan, and Onur MutluBMC Genomics, May 2018
Seed location filtering is critical in DNA read mapping, a process where billions of DNA fragments (reads) sampled from a donor are mapped onto a reference genome to identify genomic variants of the donor. State-of-the-art read mappers 1) quickly generate possible mapping locations for seeds (i.e., smaller segments) within each read, 2) extract reference sequences at each of the mapping locations, and 3) check similarity between each read and its associated reference sequences with a computationally-expensive algorithm (i.e., sequence alignment) to determine the origin of the read. A seed location filter comes into play before alignment, discarding seed locations that alignment would deem a poor match. The ideal seed location filter would discard all poor match locations prior to alignment such that there is no wasted computation on unnecessary alignments.
@article{kim_grim-filter_2018, title = {{GRIM}-{Filter}: {Fast} seed location filtering in {DNA} read mapping using processing-in-memory technologies}, volume = {19}, issn = {1471-2164}, url = {https://doi.org/10.1186/s12864-018-4460-0}, doi = {10.1186/s12864-018-4460-0}, number = {2}, journal = {BMC Genomics}, author = {Kim, Jeremie S. and Senol Cali, Damla and Xin, Hongyi and Lee, Donghyuk and Ghose, Saugata and Alser, Mohammed and Hassan, Hasan and Ergin, Oguz and Alkan, Can and Mutlu, Onur}, month = may, year = {2018}, pages = {89}, }
2017
2016
- BIBMDNA mapping using Processor-in-Memory architectureDominique Lavenier, Jean-Francois Roy, and David FurodetIn 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), May 2016
This paper presents the implementation of a mapping algorithm on a new Processing-in-Memory (PIM) architecture developed by UPMEM Company. UPMEM’s solution consists in adding processing units into the DRAM, to minimize data access time and maximize bandwidth, in order to drastically accelerate data-consuming algorithms. The technology developed by UPMEM makes it possible to combine 256 cores with 16 GBytes of DRAM, on a standard DIMM module. An experimentation of DNA Mapping on Human genome dataset shows that a speed-up of 25 can be obtained with UPMEM technology compared to fast mapping software such as BWA, Bowtie2 or NextGenMap running on 16 Intel threads. Experimentation also highlight that data transfer from storage device limits the performances of the implementation. The use of SSD drives can boost the speed-up to 80.
@inproceedings{lavenier_dna_2016, title = {{DNA} mapping using {Processor}-in-{Memory} architecture}, doi = {10.1109/BIBM.2016.7822732}, booktitle = {2016 {IEEE} {International} {Conference} on {Bioinformatics} and {Biomedicine} ({BIBM})}, author = {Lavenier, Dominique and Roy, Jean-Francois and Furodet, David}, year = {2016}, pages = {1429--1435}, }