Publications
2024
- IEEEXploreMiMyCS: A Processing-in-Memory Read Mapper for Compressing Next-Gen Sequencing DatasetsFlorestan Moor, Meven Mognol, Charles Deltel, Erwan Drezen, Julien Legriel, and Dominique LavenierDec 2024
As Next-Gen sequencing (NGS) technologies keep improving their accuracy and get largely deployed in human health care infrastructures, it is critical to design efficient reference-based compressors that fully leverage the capabilities of modern processors and hardware accelerators. This work proposes MiMyCS: a C++ software to achieve Mapping in Memory for Compressing Short reads. It performs lossless reference-based compression of NGS datasets such as Illumina reads. To this end, MiMyCS computes a non-exhaustive mapping against a reference genome and accelerates this step with the Processing-in-Memory architecture developed by the UPMEM company. Such architecture extends the computational power of a machine by adding dual in-line memory modules on which each memory bank has its own processing unit that runs up to 16 threads. This creates a massively parallel environment, well-fitted to alleviate memory bottlenecks. To reduce the overall amount of sequence comparisons and accelerate further the process, MiMyCS also incorporates a Bloom filters-based dispatcher that predicts against which genome parts reads are most likely to be mapped. We show with real whole human sequencing datasets that MiMyCS is able to achieve a speed-up between 1.2x and 2.7x compared to Genozip, the current leading state-of-the-art compressor, while maintaining a comparable compression ratio and lowering the overall energy consumption. The code of MiMyCS is available at https://gitlab.inria.fr/pim/org.pim.srm.
- IEEE AccessRawAlign: Accurate, Fast, and Scalable Raw Nanopore Signal Mapping via Combining Seeding and AlignmentJoël Lindegger, Can Firtina, Nika Mansouri Ghiasi, Mohammad Sadrosadati, Mohammed Alser, and Onur MutluIEEE Access, Dec 2024
Nanopore sequencers generate raw electrical signals representing the contents of a biological sequence molecule passing through the nanopore. These signals can be analyzed directly, avoiding basecalling entirely. We observe that while existing proposals for raw signal analysis typically do well in all metrics for small genomes (e.g., viral genomes), they all perform poorly for large genomes (e.g., the human genome). Our goal is to analyze raw nanopore signals in an accurate, fast, and scalable manner. To this end, we propose RawAlign, the first work to integrate fine-grained signal alignment into the state-of-the-art raw signal mapper. To enable accurate, fast, and scalable mapping with alignment, RawAlign implements three algorithmic improvements and hardware acceleration via a vectorized implementation of fine-grained alignment. Together, these significantly reduce the overhead of typically computationally expensive fine-grained alignment. Our extensive evaluations on different use cases and various datasets show RawAlign provides 1) the most accurate mapping for large genomes and 2) and on-par performance compared to RawHash (between 0.80x-1.08x), while achieving better performance than UNCALLED and Sigmap by on average (geo. mean) 2.83x and 2.06x, respectively.
- IEEE VLSIDesigning Precharge-Free Energy-Efficient Content-Addressable MemoriesRamiro Taco, Esteban Garzón, Robert Hanhan, Adam Teman, Leonid Yavits, and Marco LanuzzaIEEE Transactions on Very Large Scale Integration (VLSI) Systems, Oct 2024
Content-addressable memory (CAM) is a specialized type of memory that facilitates massively parallel comparison of a search pattern against its entire content. State-of-the-art (SOTA) CAM solutions are either fast but power-hungry (NOR CAM) or slow while consuming less power (nand CAM). These limitations stem from the dynamic precharge operation, leading to excessive power consumption in NOR CAMs and charge-sharing issues in NAND CAMs. In this work, we propose a precharge-free CAM (PCAM) class for energy-efficient applications. By avoiding precharge operation, PCAM consumes less energy than a NAND CAM, while achieving search speed comparable to a NOR CAM. PCAM was designed using a 65-nm CMOS technology and comprehensively evaluated under extensive Monte Carlo (MC) simulations while taking into account layout parasitics. When benchmarked against conventional NAND CAM, PCAM demonstrates improved search run time (reduced by more than 30%) and 15% less search energy. Moreover, PCAM can cut energy consumption by more than 75% when compared to conventional NOR CAM. We further extend our analysis to the application level, functionally evaluating the CAM designs as a fully associative cache using a CPU simulator running various benchmark workloads. This analysis confirms that PCAMs represent an optimal energy-performance design choice for associative memories and their broad spectrum of applications.
- ICPPParallelization of the Banded Needleman & Wunsch Algorithm on UPMEM PiM Architecture for Long DNA Sequence AlignmentMeven Mognol, Dominique Lavenier, and Julien LegrielIn Proceedings of the 53rd International Conference on Parallel Processing, Aug 2024
Sequence alignment is an ubiquitous task in genomics analysis whose performance can be affected by the memory-wall on a processor-centric architecture. Processing-in-Memory (PiM) architectures provide a memory system with integrated computing capabilities to alleviate this bottleneck. In this paper, we present a long read optimized version of the Needleman and Wunsch (N&W) algorithm based on PiM devices developed by the UPMEM company. Such memories add 128 computing units to a standard 16GB DIMM. On a server equipped with 20 UPMEM PiMs, our N&W implementation outperforms standard alignment tools by an order of magnitude compared to a traditional multicore server. Code available at https://github.com/upmem/usecase_dpu_alignment.
- ISCAQUETZAL: Vector Acceleration Framework for Modern Genome Sequence Analysis AlgorithmsJulian Pavon, Ivan Vargas Valdivieso, Carlos Rojas, Cesar Hernandez, Mehmet Aslan, Roger Figueras, Yichao Yuan, Joël Lindegger, Mohammed Alser, Francesc Moll, Santiago Marco-Sola, Oguz Ergin, Nishil Talati, Onur Mutlu, Osman Unsal, Mateo Valero, and Adrian CristalJun 2024
Genome sequence analysis is fundamental to medical breakthroughs such as developing vaccines, enabling genome editing, and facilitating personalized medicine. The exponentially expanding sequencing datasets and complexity of sequencing algorithms necessitate performance enhancements. While the performance of software solutions is constrained by their underlying hardware platforms, the utility of fixed-function accelerators is restricted to only certain sequencing algorithms.This paper presents QUETZAL, the first general-purpose vector acceleration framework designed for high efficiency and broad applicability across a diverse set of genomics algorithms. While a commercial CPU’s vector datapath is a promising candidate to exploit the data-level parallelism in genomics algorithms, our analysis finds that its performance is often limited due to long-latency scatter/gather memory instructions. QUETZAL introduces a hardware-software co-design comprising an accelerator microarchitecture closely integrated with the CPU’s vector datapath, alongside novel vector instructions to fully capitalize on the proposed hardware. QUETZAL integrates a set of scratchpad-style buffers meticulously designed to minimize latency associated with scatter/gather instructions during the retrieval of input genome sequences data. QUETZAL supports both short and long reads, and different types of sequencing data formats. A combination of hardware and software techniques enables QUETZAL to reduce the latency of memory instructions, perform complex computation using a single instruction, and transform data representations at runtime, resulting in overall efficiency gain. QUETZAL significantly accelerates a vectorized CPU baseline on modern genome sequence analysis algorithms by 5.7×, while incurring a small area overhead of 1.4% post place-and-route at the 7nm technology node compared to an HPC ARM CPU.
- ISCAMegIS: High-Performance, Energy-Efficient, and Low-Cost Metagenomic Analysis with In-Storage ProcessingNika Mansouri Ghiasi, Mohammad Sadrosadati, Harun Mustafa, Arvid Gollwitzer, Can Firtina, Julien Eudine, Haiyu Mao, Joël Lindegger, Meryem Banu Cavlak, Mohammed Alser, Jisung Park, and Onur MutluJun 2024
Metagenomics, the study of the genome sequences of diverse organisms in a common environment, has led to significant advances in many fields. Since the species present in a metagenomic sample are not known in advance, metagenomic analysis commonly involves the key tasks of determining the species present in a sample and their relative abundances. These tasks require searching large metagenomic databases containing information on different species’ genomes. Metagenomic analysis suffers from significant data movement overhead due to moving large amounts of low-reuse data from the storage system to the rest of the system. In-storage processing can be a fundamental solution for reducing this overhead. However, designing an in-storage processing system for metagenomics is challenging because existing approaches to metagenomic analysis cannot be directly implemented in storage effectively due to the hardware limitations of modern SSDs.We propose MegIS, the first in-storage processing system designed to significantly reduce the data movement overhead of the end-to-end metagenomic analysis pipeline. MegIS is enabled by our lightweight design that effectively leverages and orchestrates processing inside and outside the storage system. Through our detailed analysis of the end-to-end metagenomic analysis pipeline and careful hardware/software co-design, we address in-storage processing challenges for metagenomics via specialized and efficient 1) task partitioning, 2) data/computation flow coordination, 3) storage technology-aware algorithmic optimizations, 4) data mapping, and 5) lightweight in-storage accelerators. MegIS’s design is flexible, capable of supporting different types of metagenomic input datasets, and can be integrated into various metagenomic analysis pipelines. Our evaluation shows that MegIS outperforms the state-of-the-art performance- and accuracy-optimized software metagenomic tools by 2.7× – 37.2× and 6.9×–100.2×, respectively, while matching the accuracy of the accuracy-optimized tool. MegIS achieves 1.5×–5.1× speedup compared to the state-of-the-art metagenomic hardware-accelerated (using processing-in-memory) tool, while achieving significantly higher accuracy.
- LNCSEnergy Efficiency Impact of Processing in Memory: A Comprehensive Review of Workloads on the UPMEM ArchitectureYann Falevoz, and Julien LegrielApr 2024
Processing-in-Memory (PIM) architectures have emerged as a promising solution for data-intensive applications, providing significant speedup by processing data directly within the memory. However, the impact of PIM on energy efficiency is not well characterized. In this paper, we provide a comprehensive review of workloads ported to the first PIM product available on the market, namely the UPMEM architecture, and quantify the impact on each workload in terms of energy efficiency. Less than the half of the reviewed papers provide insights on the impact of PIM on energy efficiency, and the evaluation methods differ from one paper to the other. To provide a comprehensive overview, we propose a methodology for estimating energy consumption and efficiency for both the PIM and baseline systems at data center level, enabling a direct comparison of the two systems. Our results show that PIM can provide significant energy savings for data intensive workloads. We also identify key factors that impact the energy efficiency of UPMEM PIM, including the workload characteristics. Overall, this paper provides valuable insights for researchers and practitioners looking to optimize energy efficiency in data-intensive applications using UPMEM PIM architecture.
- IEEE CALMajorK: Majority Based kmer Matching in Commodity DRAMIEEE Computer Architecture Letters, Apr 2024
Fast parallel search capabilities on large datasets are required across multiple application domains. One such domain is genome analysis, which requires high-performance k mer matching in large genome databases. Recently proposed solutions implemented k mer matching in DRAM, utilizing its sheer capacity and parallelism. However, their operation is essentially bit-serial, which ultimately limits the performance, especially when matching long strings, as customary in genome analysis pipelines. The proposed solution, MajorK, enables bit-parallel majority based k mer matching in an unmodified commodity DRAM. MajorK employs multiple DRAM row activation, where the search patterns (query k mers) are coded into DRAM addresses. We evaluate MajorK on viral genome k mer matching and show that it can achieve up to 2.7 \times higher performance while providing a better matching accuracy compared to state-of-the-art DRAM based k mer matching accelerators.
- IEEE AccessViRAL: Vision Transformer Based Accelerator for ReAL Time Lineage Assignment of Viral PathogensZuher Jahshan, Esteban Garzón, and Leonid YavitsIEEE Access, Feb 2024
Real-time genome detection, classification and lineage assignment are critical for efficient tracking of emerging mutations and variants during viral pandemics such as Covid-19. For genomic surveillance to work effectively, each new viral genome sequence must be quickly and accurately associated with an existing viral family (lineage). ViRAL is a hardware-accelerated platform for real-time viral genome lineage assignment based on minhashing and Vision Transformer. Minhashing is a locality sensitive hashing based technique for finding regions of similarity within sequenced genomes. Vision Transformer is a model for image classification that employs a Transformer-like architecture over patches of images. In ViRAL, such image patches are genome fragments extracted from the regions of high similarity. ViRAL is especially efficient in lineage assignment of extremely low quality (or highly ambiguous) genomic data, i.e. when a large fraction of DNA bases are missing in an assembled genome. We implement ViRAL on CPU, GPU and a custom-designed hardware accelerator denoted ACMI. ViRAL assigns newly sequenced SARS-CoV-2 genomes to existing lineages with the top-1 accuracy of 94.2%. The probability of the correct assignment to be found among the five most likely placements generated by ViRAL (top-5 accuracy) is 99.8%. Accelerated ViRAL outperforms the fastest state-of-the-art assignment tools by 69.4\times . It also outperforms ViRAL GPU implementation by 19.5\times . ViRAL strongly outperforms the state-of-the-art solutions in assigning highly-ambiguous genomes: while state-of-the-art tools fail to assign lineage to genomes with 50% ambiguity, ViRAL achieves 77.6% assignment accuracy. We make ViRAL available to the research community through GitHub.
- BioinformaticsViTAL: Vision TrAnsformer based Low coverage SARS-CoV-2 lineage assignmentBioinformatics, Feb 2024
Rapid spread of viral diseases such as Coronavirus disease 2019 (COVID-19) highlights an urgent need for efficient surveillance of virus mutation and transmission dynamics, which requires fast, inexpensive and accurate viral lineage assignment. The first two goals might be achieved through low-coverage whole-genome sequencing (LC-WGS) which enables rapid genome sequencing at scale and at reduced costs. Unfortunately, LC-WGS significantly diminishes the genomic details, rendering accurate lineage assignment very challenging. We present ViTAL, a novel deep learning algorithm specifically designed to perform lineage assignment of low coverage-sequenced genomes. ViTAL utilizes a combination of MinHash for genomic feature extraction and Vision Transformer for fine-grain genome classification and lineage assignment. We show that ViTAL outperforms state-of-the-art tools across diverse coverage levels, reaching up to 87.7% lineage assignment accuracy at 1x coverage where state-of-the-art tools such as UShER and Kraken2 achieve the accuracy of 5.4% and 27.4% respectively. ViTAL achieves comparable accuracy results with up to 8x lower coverage than state-of-the-art tools. We explore ViTAL’s ability to identify the lineages of novel genomes, i.e. genomes the Vision Transformer was not trained on. We show how ViTAL can be applied to preliminary phylogenetic placement of novel variants. The data underlying this article are available in https://github.com/zuherJahshan/vital and can be accessed with 10.5281/zenodo.10688110.
- IEEE SSCLOCCAM: An Error Oblivious CAMYuval Harary, Paz Snapir, Eyal Reshef, Esteban Garzón, and Leonid YavitsIEEE Solid-State Circuits Letters, Feb 2024
Content addressable memories (CAMs) are widely used in many applications in general purpose computer microarchitecture, networking and domain-specific hardware accelerators. In addition to storing and reading data, CAMs enable simultaneous compare of query datawords with the entire memory content. Similar to SRAM and DRAM, CAMs are prone to errors and faults. While error correcting codes (ECCs) are widely used in DRAM and SRAM, they are not directly applicable in CAM: if a dataword that is supposed to match a query altered due to an error, it will falsely mismatch even if it is ECC-encoded. We propose OCCAM, an error oblivious CAM, which combines ECC and approximate search (matching) to allow tolerating a large and dynamically configurable number of errors. We manufactured the OCCAM silicon prototype using 65-nm commercial process and verified its error tolerance capabilities through silicon measurements. OCCAM tolerates 11% error rate (7 bit errors in each 64-bit memory row) with 100% sensitivity and specificity.
- IEEE AccessFASTA: Revisiting Fully Associative Memories in Computer MicroarchitectureEsteban Garzón, Robert Hanhan, Marco Lanuzza, Adam Teman, and Leonid YavitsIEEE Access, Jan 2024
Associative access is widely used in fundamental microarchitectural components, such as caches and TLBs. However, associative (or content addressable) memories (CAMs) have been traditionally considered too large, too energy-hungry, and not scalable, and therefore, have limited use in modern computer microarchitecture. This work revisits these presumptions and proposes an energy-efficient fully-associative tag array (FASTA) architecture, based on a novel complementary CAM (CCAM) bitcell. CCAM offers a full CMOS solution for CAM, removing the need for time- and energy-consuming precharge and combining the speed of NOR CAM and low energy consumption of NAND CAM. While providing better performance and energy consumption, CCAM features a larger area compared to state-of-the-art CAM designs. We further show how FASTA can be used to construct a novel aliasing-free, energy-efficient, Very-Many-Way Associative (VMWA) cache. Circuit-level simulations using 16 nm FinFET technology show that a 128 kB FASTA-based 256-way 8-set associative cache is 28% faster and consumes 88% less energy-per-access than a same sized 8-way (256-set) SRAM based cache, while also providing aliasing-free operation. System-level evaluation performed on the Sniper simulator shows that the VMWA cache exhibits lower Misses Per Kilo Instructions (MPKI) for the majority of benchmarks. Specifically, the 256-way associative cache achieves 17.3%, 11.5%, and 1.2% lower average MPKI for L1, L2, and L3 caches, respectively, compared to a 16-way associative cache. The average IPC improvement for L1, L2, and L3 caches are 1.6%, 1.4%, and 0.2%, respectively.
2023
- arXivDRAMA: Commodity DRAM based Content Addressable MemoryarXiv, Dec 2023
Fast parallel search capabilities on large datasets provided by content addressable memories (CAM) are required across multiple application domains. However compared to RAM, CAMs feature high area overhead and power consumption, and as a result, they scale poorly. The proposed solution, DRAMA, enables CAM, ternary CAM (TCAM) and approximate (similarity) search CAM functionalities in unmodified commodity DRAM. DRAMA performs compare operation in a bit-serial fashion, where the search pattern (query) is coded in DRAM addresses. A single bit compare (XNOR) in DRAMA is identical to a regular DRAM read. AND and OR operations required for NAND CAM and NOR CAM respectively are implemented using nonstandard DRAM timing. We evaluate DRAMA on bacterial DNA classification and show that DRAMA can achieve 3.6 times higher performance and 19.6 times lower power consumption compared to state-of-the-art CMOS CAM based genome classification accelerator.
- Genome BiologyA Framework for Designing Efficient Deep Learning-Based Genomic BasecallersGagandeep Singh, Mohammed Alser, Alireza Khodamoradi, Kristof Denolf, Can Firtina, Meryem Banu Cavlak, Henk Corporaal, and Onur MutlubioRxiv, Nov 2023
Nanopore sequencing is a widely-used high-throughput genome sequencing technology that can sequence long fragments of a genome. Nanopore sequencing generates noisy electrical signals that need to be converted into a standard string of DNA nucleotide bases using a computational step called basecalling. The performance of basecalling has critical implications for all later steps in genome analysis. Many researchers adopt complex deep learning-based models from the speech recognition domain to perform basecalling without considering the compute demands of such models, which leads to slow, inefficient, and memory-hungry basecallers. Therefore, there is a need to reduce the computation and memory cost of basecalling while maintaining accuracy. However, developing a very fast basecaller that can provide high accuracy requires a deep understanding of genome sequencing, machine learning, and hardware design. Our goal is to develop a comprehensive framework for creating deep learning-based basecallers that provide high efficiency and performance. We introduce RUBICON, a framework to develop hardware-optimized basecallers. RUBICON consists of two novel machine-learning techniques that are specifically designed for basecalling. First, we introduce the first quantization-aware basecalling neural architecture search (QABAS) framework to specialize the basecalling neural network architecture for a given hardware acceleration platform while jointly exploring and finding the best bit-width precision for each neural network layer. Second, we develop SkipClip, the first technique to remove the skip connections present in modern basecallers to greatly reduce resource and storage requirements without any loss in basecalling accuracy. We demonstrate the benefits of RUBICON by developing RUBICALL, the first hardware-optimized basecaller that performs fast and accurate basecalling. Our experimental results on state-of-the-art computing systems show that RUBICALL is a fast, memory-efficient, and hardware-friendly basecaller. Compared to the fastest state-of-the-art basecaller, RUBICALL provides a 3.96× speedup with 2.97% higher accuracy. Compared to an expert-designed basecaller, RUBICALL provides a 141.15× speedup without losing accuracy while also achieving a 6.88× and 2.94× reduction in neural network model size and the number of parameters, respectively. We show that RUBICON helps researchers develop hardware-optimized basecallers that are superior to expert-designed models and can inspire independent future ideas.
- PACTSimplePIM: A Software Framework for Productive and Efficient Processing-in-MemoryJinfan Chen, Juan Gómez-Luna, Izzat El Hajj, Yuxin Guo, and Onur MutluParallel Architectures and Compilation Techniques, Oct 2023
Data movement between memory and processors is a major bottleneck in modern computing systems. The processing-in-memory (PIM) paradigm aims to alleviate this bottleneck by performing computation inside memory chips. Real PIM hardware (e.g., the UPMEM system) is now available and has demonstrated potential in many applications. However, programming such real PIM hardware remains a challenge for many programmers. This paper presents a new software framework, SimplePIM, to aid programming real PIM systems. The framework processes arrays of arbitrary elements on a PIM device by calling iterator functions from the host and provides primitives for communication among PIM cores and between PIM and the host system. We implement SimplePIM for the UPMEM PIM system and evaluate it on six major applications. Our results show that SimplePIM enables 66.5% to 83.1% reduction in lines of code in PIM programs. The resulting code leads to higher performance (between 10% and 37% speedup) than hand-optimized code in three applications and provides comparable performance in three others. SimplePIM is fully and freely available at https://github.com/CMU-SAFARI/SimplePIM
- IEEE TCSA Low-Complexity Sensing Scheme for Approximate Matching Content-Addressable MemoryEsteban Garzón, Roman Golman, Marco Lanuzza, Adam Teman, and Leonid YavitsIEEE Transactions on Circuits and Systems II, Oct 2023
The need for approximate rather than exact search arises in numerous compare-intensive applications, from networking to computational genomics. This brief presents a novel sensing approach for approximate matching content-addressable memory (CAM) designed to handle large Hamming distances (HDs) between the query pattern and stored data. The proposed matchline sensing scheme (MLSS) employs a replica mechanism and a 12-transistor positive feedback sense amplifier to effectively resolve the approximate match operation. The MLSS was integrated into a 4 kB approximate CAM array and fabricated in a 65 nm CMOS technology. With an overall area footprint of 0.0048 mm2, which includes 512 sense amplifiers and the replica mechanism, the MLSS allows a flexible and dynamic adjustment of the HD tolerance threshold via several design variables. Experimental measurements demonstrate the efficiency of our sensing scheme in tolerating very large HDs with the highest sensitivity.
- ISPASSAn Experimental Evaluation of Machine Learning Training on a Real Processing-in-Memory SystemJuan Gómez-Luna, Yuxin Guo, Sylvan Brocard, Julien Legriel, Remy Cimadomo, Geraldo F. Oliveira, Gagandeep Singh, and Onur MutluSep 2023
Training machine learning (ML) algorithms is a computationally intensive process, which is frequently memory-bound due to repeatedly accessing large training datasets. As a result, processor-centric systems (e.g., CPU, GPU) suffer from costly data movement between memory units and processing units, which consumes large amounts of energy and execution cycles. Memory-centric computing systems, i.e., with processing-in-memory (PIM) capabilities, can alleviate this data movement bottleneck. Our goal is to understand the potential of modern general-purpose PIM architectures to accelerate ML training. To do so, we (1) implement several representative classic ML algorithms (namely, linear regression, logistic regression, decision tree, K-Means clustering) on a real-world general-purpose PIM architecture, (2) rigorously evaluate and characterize them in terms of accuracy, performance and scaling, and (3) compare to their counterpart implementations on CPU and GPU. Our evaluation on a real memory-centric computing system with more than 2500 PIM cores shows that general-purpose PIM architectures can greatly accelerate memory-bound ML workloads, when the necessary operations and datatypes are natively supported by PIM hardware. For example, our PIM implementation of decision tree is 27× faster than a state-of-the-art CPU version on an 8-core Intel Xeon, and 1.34× faster than a state-of-the-art GPU version on an NVIDIA A100. Our K-Means clustering on PIM is 2.8× and 3.2× than state-of-the-art CPU and GPU versions, respectively. To our knowledge, our work is the first one to evaluate ML training on a real-world PIM architecture. We conclude with key observations, takeaways, and recommendations that can inspire users of ML workloads, programmers of PIM architectures, and hardware designers & architects of future memory-centric computing systems.
- GREfficient mapping of accurate long reads in minimizer space with mapquikBaris Ekim, Kristoffer Sahlin, Paul Medvedev, Bonnie Berger, and Rayan ChikhiGenome Research, Jun 2023
DNA sequencing data continue to progress toward longer reads with increasingly lower sequencing error rates. We focus on the critical problem of mapping, or aligning, low-divergence sequences from long reads (e.g., Pacific Biosciences [PacBio] HiFi) to a reference genome, which poses challenges in terms of accuracy and computational resources when using cutting-edge read mapping approaches that are designed for all types of alignments. A natural idea would be to optimize efficiency with longer seeds to reduce the probability of extraneous matches; however, contiguous exact seeds quickly reach a sensitivity limit. We introduce mapquik, a novel strategy that creates accurate longer seeds by anchoring alignments through matches of k consecutively sampled minimizers (k-min-mers) and only indexing k-min-mers that occur once in the reference genome, thereby unlocking ultrafast mapping while retaining high sensitivity. We show that mapquik significantly accelerates the seeding and chaining steps—fundamental bottlenecks to read mapping—for both the human and maize genomes with Formula sensitivity and near-perfect specificity. On the human genome, for both real and simulated reads, mapquik achieves a Formula speedup over the state-of-the-art tool minimap2, and on the maize genome, mapquik achieves a Formula speedup over minimap2, making mapquik the fastest mapper to date. These accelerations are enabled from not only minimizer-space seeding but also a novel heuristic Formula pseudochaining algorithm, which improves upon the long-standing Formula bound. Minimizer-space computation builds the foundation for achieving real-time analysis of long-read sequencing data.
- IEEE AccessA low-energy DMTJ-based ternary content- addressable memory with reliable sub-nanosecond search operationEsteban Garzón, Leonid Yavits, Giovanni Finocchio, Mario Carpentieri, Adam Teman, and Marco LanuzzaIEEE Access, Feb 2023
In this paper, we propose an energy-efficient, reliable, hybrid, 10-transistor/2-Double-Barrier-Magnetic-Tunnel-Junction (10T2DMTJ) non-volatile (NV) ternary content-addressable memory (TCAM) with sub-nanosecond search operation. Our cell design relies on low-energy-demanding MTJs organized in a low-complexity voltage-divider-based circuit along with a simple dynamic logic CMOS matching network, which improves the search reliability. The proposed NV-TCAM was designed in 28 nm FDSOI process and evaluated under exhaustive Monte Carlo simulations. When compared to the best previous proposed NV-TCAMs, our solution achieves lower search error rate (3.8×) and lower write and search energy (–73% and–79%, respectively), while also exhibiting smaller area footprint (–74%). Such benefits are achieved at the expense of reduced search speed.
- ACM TACOApHMM: Accelerating Profile Hidden Markov Models for Fast and Energy-Efficient Genome AnalysisCan Firtina, Kamlesh Pillai, Gurpreet S. Kalsi, Bharathwaj Suresh, Damla Senol Cali, Jeremie S. Kim, Taha Shahroodi, Meryem Banu Cavlak, Joël Lindegger, Mohammed Alser, Juan Gómez Luna, Sreenivas Subramoney, and Onur MutluACM Trans. Archit. Code Optim., Dec 2023
Profile hidden Markov models (pHMMs) are widely employed in various bioinformatics applications to identify similarities between biological sequences, such as DNA or protein sequences. In pHMMs, sequences are represented as graph structures, where states and edges capture modifications (i.e., insertions, deletions, and substitutions) by assigning probabilities to them. These probabilities are subsequently used to compute the similarity score between a sequence and a pHMM graph. The Baum-Welch algorithm, a prevalent and highly accurate method, utilizes these probabilities to optimize and compute similarity scores. Accurate computation of these probabilities is essential for the correct identification of sequence similarities. However, the Baum-Welch algorithm is computationally intensive, and existing solutions offer either software-only or hardware-only approaches with fixed pHMM designs. When we analyze state-of-the-art works, we identify an urgent need for a flexible, high-performance, and energy-efficient hardware-software co-design to address the major inefficiencies in the Baum-Welch algorithm for pHMMs. We introduce ApHMM, the first flexible acceleration framework designed to significantly reduce both computational and energy overheads associated with the Baum-Welch algorithm for pHMMs. ApHMM employs hardware-software co-design to tackle the major inefficiencies in the Baum-Welch algorithm by 1) designing flexible hardware to accommodate various pHMM designs, 2) exploiting predictable data dependency patterns through on-chip memory with memoization techniques, 3) rapidly filtering out unnecessary computations using a hardware-based filter, and 4) minimizing redundant computations. ApHMM achieves substantial speedups of 15.55x - 260.03x, 1.83x - 5.34x, and 27.97x when compared to CPU, GPU, and FPGA implementations of the Baum-Welch algorithm, respectively. ApHMM outperforms state-of-the-art CPU implementations in three key bioinformatics applications: 1) error correction, 2) protein family search, and 3) multiple sequence alignment, by 1.29x - 59.94x, 1.03x - 1.75x, and 1.03x - 1.95x, respectively, while improving their energy efficiency by 64.24x - 115.46x, 1.75x, 1.96x.
- IEEE TCDIPER: Detection and Identification of Pathogens using Edit distance-tolerant Resistive CAMItay Merlin, Esteban Garzón, Alex Fish, and Leonid YavitsIEEE Transactions on Computers, Dec 2023
We propose a novel resistive edit distance-tolerant content addressable memory for computational genomics applications, particularly for detection and identification of pathogens of pandemic importance. Unlike state-of-the-art approximate search solutions that tolerate small number of replacements between the query pattern and the stored data, DIPER tolerates insertions and deletions, ubiquitous in genomics. DIPER achieves up to 1.7× higher F1 score for high-quality DNA reads and up to 6.2× higher F1 score for DNA reads with 15% error rate, compared to state-of-the-art DNA classification tool Kraken2. Simulated at 500MHz, DIPER provides 910× average speedup over Kraken2.
- MICROSwordfish: A Framework for Evaluating Deep Neural Network-Based Basecalling Using Computation-In-Memory with Non-Ideal MemristorsTaha Shahroodi, Gagandeep Singh, Mahdi Zahedi, Haiyu Mao, Joel Lindegger, Can Firtina, Stephan Wong, Onur Mutlu, and Said HamdiouiIn Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Oct 2023
Basecalling, an essential step in many genome analysis studies, relies on large Deep Neural Network s (DNN s) to achieve high accuracy. Unfortunately, these DNN s are computationally slow and inefficient, leading to considerable delays and resource constraints in the sequence analysis process. A Computation-In-Memory (CIM) architecture using memristors can significantly accelerate the performance of DNN s. However, inherent device non-idealities and architectural limitations of such designs can greatly degrade the basecalling accuracy, which is critical for accurate genome analysis. To facilitate the adoption of memristor-based CIM designs for basecalling, it is important to (1) conduct a comprehensive analysis of potential CIM architectures and (2) develop effective strategies for mitigating the possible adverse effects of inherent device non-idealities and architectural limitations. This paper proposes Swordfish, a novel hardware/software co-design framework that can effectively address the two aforementioned issues. Swordfish incorporates seven circuit and device restrictions or non-idealities from characterized real memristor-based chips. Swordfish leverages various hardware/software co-design solutions to mitigate the basecalling accuracy loss due to such non-idealities. To demonstrate the effectiveness of Swordfish, we take Bonito, the state-of-the-art (i.e., accurate and fast), open-source basecaller as a case study. Our experimental results using Swordfish show that a CIM architecture can realistically accelerate Bonito for a wide range of real datasets by an average of 25.7x, with an accuracy loss of 6.01%.
- bioRxivDASH-CAM: Dynamic Approximate SearcH Content Addressable Memory for genome classificationZuher Jahshan, Itay Merlin, Esteban Garzón, and Leonid YavitsbioRxiv, Oct 2023
We propose a novel dynamic storage-based approximate search content addressable memory (DASH-CAM) for computational genomics applications, particularly for identification and classification of viral pathogens of epidemic significance. DASH-CAM provides 5.5x better density compared to state-of-the-art SRAM-based approximate search CAM. This allows using DASH-CAM as a portable classifier that can be applied to pathogen surveillance in low-quality field settings during pandemics, as well as to pathogen diagnostics at points of care. DASH-CAM approximate search capabilities allow a high level of flexibility when dealing with a variety of industrial sequencers with different error profiles. DASH-CAM achieves up to 30% and 20% higher F1 score when classifying DNA reads with 10% error rate, compared to state-of-the-art DNA classification tools MetaCache-GPU and Kraken2 respectively. Simulated at 1GHz, DASH-CAM provides 1,178x and 1,040x average speedup over MetaCache-GPU and Kraken2 respectively.
- arXivRawHash2: Accurate and Fast Mapping of Raw Nanopore Signals using a Hash-based Seeding MechanismCan Firtina, Melina Soysal, Joël Lindegger, and Onur MutluarXiv, Sep 2023
Raw nanopore signals can be analyzed while they are being generated, a process known as real-time analysis. Real-time analysis of raw signals is essential to utilize the unique features that nanopore sequencing provides, enabling the early stopping of the sequencing of a read or the entire sequencing run based on the analysis. The state-of-the-art mechanism, RawHash, offers the first hash-based efficient and accurate similarity identification between raw signals and a reference genome by quickly matching their hash values. In this work, we introduce RawHash2, which provides major improvements over RawHash, including a more sensitive chaining implementation, weighted mapping decisions, frequency filters to reduce ambiguous seed hits, minimizers for hash-based sketching, and support for the R10.4 flow cell version and various data formats such as POD5. Compared to RawHash, RawHash2 provides the best F1 accuracy in all datasets and provides better average throughput by 2.5x. Source code is available at https://github.com/CMU-SAFARI/RawHash.
- arXivGateSeeder: Near-memory CPU-FPGA Acceleration of Short and Long Read MappingarXiv, Sep 2023
Motivation: Read mapping is a computationally expensive process and a major bottleneck in genomics analyses. The performance of read mapping is mainly limited by the performance of three key computational steps: Index Querying, Seed Chaining, and Sequence Alignment. The first step is dominated by how fast and frequent it accesses the main memory (i.e., memory-bound), while the latter two steps are dominated by how fast the CPU can compute their computationally-costly dynamic programming algorithms (i.e., compute-bound). Accelerating these three steps by exploiting new algorithms and new hardware devices is essential to accelerate most genome analysis pipelines that widely use read mapping. Given the large body of work on accelerating Sequence Alignment, this work focuses on significantly improving the remaining steps. Results: We introduce GateSeeder, the first CPU-FPGA-based near-memory acceleration of both short and long read mapping. GateSeeder exploits near-memory computation capability provided by modern FPGAs that couple a reconfigurable compute fabric with high-bandwidth memory (HBM) to overcome the memory-bound and compute-bound bottlenecks. GateSeeder also introduces a new lightweight algorithm for finding the potential matching segment pairs. Using real ONT, HiFi, and Illumina sequences, we experimentally demonstrate that GateSeeder outperforms Minimap2, without performing sequence alignment, by up to 40.3x, 4.8x, and 2.3x, respectively. When performing read mapping with sequence alignment, GateSeeder outperforms Minimap2 by 1.15-4.33x (using KSW2) and by 1.97-13.63x (using WFA-GPU)
- IEEE VLSIClaPIM: Scalable Sequence CLAssification using Processing-In-MemoryMarcel Khalifa, Barak Hoffer, Orian Leitersdorf, Robert Hanhan, Ben Perach, Leonid Yavits, and Shahar KvatinskyIEEE Transactions on Very Large Scale Integration (VLSI) Systems, Jul 2023
DNA sequence classification is a fundamental task in computational biology with vast implications for applications such as disease prevention and drug design. Therefore, fast high-quality sequence classifiers are significantly important. This paper introduces ClaPIM, a scalable DNA sequence classification architecture based on the emerging concept of hybrid in-crossbar and near-crossbar memristive processing-in-memory (PIM). We enable efficient and high-quality classification by uniting the filter and search stages within a single algorithm. Specifically, we propose a custom filtering technique that drastically narrows the search space and a search approach that facilitates approximate string matching through a distance function. ClaPIM is the first PIM architecture for scalable approximate string matching that benefits from the high density of memristive crossbar arrays and the massive computational parallelism of PIM. Compared with Kraken2, a state-of-the-art software classifier, ClaPIM provides significantly higher classification quality (up to 20x improvement in F1 score) and also demonstrates a 1.8x throughput improvement. Compared with EDAM, a recently-proposed SRAM-based accelerator that is restricted to small datasets, we observe both a 30.4x improvement in normalized throughput per area and a 7% increase in classification precision.
- BioinformaticsRawHash: Enabling Fast and Accurate Real-Time Analysis of Raw Nanopore Signals for Large GenomesCan Firtina, Nika Mansouri Ghiasi, Joel Lindegger, Gagandeep Singh, Meryem Banu Cavlak, Haiyu Mao, and Onur MutluBioinformatics, Jul 2023Proceedings of the 31st Annual Conference on Intelligent Systems for Molecular Biology (ISMB) and the 22nd European Conference on Computational Biology (ECCB)
Nanopore sequencers generate electrical raw signals in real-time while sequencing long genomic strands. These raw signals can be analyzed as they are generated, providing an opportunity for real-time genome analysis. An important feature of nanopore sequencing, Read Until, can eject strands from sequencers without fully sequencing them, which provides opportunities to computationally reduce the sequencing time and cost. However, existing works utilizing Read Until either 1) require powerful computational resources that may not be available for portable sequencers or 2) lack scalability for large genomes, rendering them inaccurate or ineffective. We propose RawHash, the first mechanism that can accurately and efficiently perform real-time analysis of nanopore raw signals for large genomes using a hash-based similarity search. To enable this, RawHash ensures the signals corresponding to the same DNA content lead to the same hash value, regardless of the slight variations in these signals. RawHash achieves an accurate hash-based similarity search via an effective quantization of the raw signals such that signals corresponding to the same DNA content have the same quantized value and, subsequently, the same hash value. We evaluate RawHash on three applications: 1) read mapping, 2) relative abundance estimation, and 3) contamination analysis. Our evaluations show that RawHash is the only tool that can provide high accuracy and high throughput for analyzing large genomes in real-time. When compared to the state-of-the-art techniques, UNCALLED and Sigmap, RawHash provides 1) 25.8x and 3.4x better average throughput and 2) significantly better accuracy for large genomes, respectively. Source code is available at https://github.com/CMUSAFARI/RawHash.
- DACAccelerating Genome Analysis via Algorithm-Architecture Co-DesignOnur Mutlu, and Can FirtinaIn Proceedings of the 60th Design Automation Conference (DAC), Jul 2023
High-throughput sequencing (HTS) technologies have revolutionized the field of genomics, enabling rapid and cost-effective genome analysis for various applications. However, the increasing volume of genomic data generated by HTS technologies presents significant challenges for computational techniques to effectively analyze genomes. To address these challenges, several algorithm-architecture co-design works have been proposed, targeting different steps of the genome analysis pipeline. These works explore emerging technologies to provide fast, accurate, and low-power genome analysis. This paper provides a brief review of the recent advancements in accelerating genome analysis, covering the opportunities and challenges associated with the acceleration of the key steps of the genome analysis pipeline. Our analysis highlights the importance of integrating multiple steps of genome analysis using suitable architectures to unlock significant performance improvements and reduce data movement and energy consumption. We conclude by emphasizing the need for novel strategies and techniques to address the growing demands of genomic data generation and analysis.
- MMDCSWill computing in memory become a new dawn of associative processors?IEEE J. Emerg. Sel. Topics Circuits Syst., Jul 2023
Computer architecture faces an enormous challenge in recent years: while the demand for performance is constantly growing, the performance improvement of general-purpose CPU has almost stalled. Among the reasons are memory and power walls, due to which data transfer increasingly dominates computing. By significantly reducing data transfer, data-centric (or in-memory) computing promises to alleviate the memory and power walls. Associative processor is a non von Neumann computer invented in the 1960s but effectively cast aside until recently. It computes using associative memory in a perfect induction like fashion, using associative memory cells for both data storage and processing. Associative processor can be implemented using conventional CMOS as well as emerging memories. We show that associative processor can outperform state-of-the-art computing platforms by up to almost two orders of magnitude in a variety of data-intensive workloads.
- AACBBGAPiM: a hardware acceleration of Genome Analysis pipeline using Processing in MemoryNaomie Abecassis, Juan Gómez-Luna, Onur Mutlu, Ran Ginosar, Aphélie Moisson-Franckhauser, and Leonid YavitsIn Proceedings of the 5th Workshop on Accelerator Architecture in Computational Biology and Bioinformatics (AACBB), Jun 2023
Variant calling is a fundamental stage in genome analysis that identifies mutations (variations) in a sequenced genome relative to a known reference genome. Pair-HMM is a key part of the variant calling algorithm and its most compute-intensive part. In recent years, Processing-in-Memory (PiM) solutions, which consist of placing compute capabilities near/inside memory, have been proposed to speed up the genome analysis pipeline. We implement the Pair-HMM algorithm on a commercial PiM platform developed by UPMEM. We modify the Pair-HMM algorithm to make it more suitable for PiM execution with acceptable loss of accuracy. We evaluate our implementation on single chromosomes and whole genome sequencing datasets, demonstrating up to 2x speedup compared to existing CPU accelerations and up to 3x speedup compared to FPGA accelerations.
- ISPASSTransPimLib: A Library for Efficient Transcendental Functions on Processing-in-Memory SystemsMaurus Item, Juan Gómez-Luna, Yuxin Guo, Geraldo F Oliveira, Mohammad Sadrosadati, and Onur MutluIn Proceedings of the 24th International Symposium on Performance Analysis of Systems and Software (ISPASS), Apr 2023
Processing-in-memory (PIM) promises to alleviate the data movement bottleneck in modern computing systems. However, current real-world PIM systems have the inherent disadvantage that their hardware is more constrained than in conventional processors (CPU, GPU), due to the difficulty and cost of building processing elements near or inside the memory. As a result, general-purpose PIM architectures support fairly limited instruction sets and struggle to execute complex operations such as transcendental functions and other hard-to-calculate operations (e.g., square root). These operations are particularly important for some modern workloads, e.g., activation functions in machine learning applications. In order to provide support for transcendental (and other hard-to-calculate) functions in general-purpose PIM systems, we present TransPimLib, a library that provides CORDIC-based and LUT-based methods for trigonometric functions, hyperbolic functions, exponentiation, logarithm, square root, etc. We develop an implementation of TransPimLib for the UPMEM PIM architecture and perform a thorough evaluation of TransPimLib’s methods in terms of performance and accuracy, using microbenchmarks and three full workloads (Blackscholes, Sigmoid, Softmax). We open-source all our code and datasets at https://github.com/CMU-SAFARI/transpimlib.
- ISPASSEvaluating Machine Learning Workloads on Memory-Centric Computing SystemsJuan Gómez-Luna, Yuxin Guo, Sylvan Brocard, Julien Legriel, Remy Cimadomo, Geraldo F Oliveira, Gagandeep Singh, and Onur MutluIn Proceedings of the 24th International Symposium on Performance Analysis of Systems and Software (ISPASS), Apr 2023
Training machine learning (ML) algorithms is a computationally intensive process, which is frequently memory-bound due to repeatedly accessing large training datasets. As a result, processor-centric systems (e.g., CPU, GPU) suffer from costly data movement between memory units and processing units, which consumes large amounts of energy and execution cycles. Memory-centric computing systems, i.e., with processing-in-memory (PIM) capabilities, can alleviate this data movement bottleneck. Our goal is to understand the potential of modern general-purpose PIM architectures to accelerate ML training. To do so, we (1) implement several representative classic ML algorithms (namely, linear regression, logistic regression, decision tree, K-Means clustering) on a real-world general-purpose PIM architecture, (2) rigorously evaluate and characterize them in terms of accuracy, performance and scaling, and (3) compare to their counterpart implementations on CPU and GPU. Our evaluation on a real memory-centric computing system with more than 2500 PIM cores shows that general-purpose PIM architectures can greatly accelerate memory-bound ML workloads, when the necessary operations and datatypes are natively supported by PIM hardware. For example, our PIM implementation of decision tree is 27x faster than a state-of-the-art CPU version on an 8-core Intel Xeon, and 1.34x faster than a state-of-the-art GPU version on an NVIDIA A100. Our K-Means clustering on PIM is 2.8x and 3.2x than state-of-the-art CPU and GPU versions, respectively. To our knowledge, our work is the first one to evaluate ML training on a real-world PIM architecture. We conclude with key observations, takeaways, and recommendations that can inspire users of ML workloads, programmers of PIM architectures, and hardware designers & architects of future memory-centric computing systems.
- BioinformaticsScrooge: A Fast and Memory-Frugal Genomic Sequence Aligner for CPUs, GPUs, and ASICs.Joël Lindegger, Damla Senol Cali, Mohammed Alser, Juan Gómez-Luna, Nika Mansouri Ghiasi, and Onur MutluBioinformatics, Mar 2023
Pairwise sequence alignment is a very time-consuming step in common bioinformatics pipelines. Speeding up this step requires heuristics, efficient implementations, and/or hardware acceleration. A promising candidate for all of the above is the recently proposed GenASM algorithm. We identify and address three inefficiencies in the GenASM algorithm: it has a high amount of data movement, a large memory footprint, and does some unnecessary work. We propose Scrooge, a fast and memory-frugal genomic sequence aligner. Scrooge includes three novel algorithmic improvements which reduce the data movement, memory footprint, and the number of operations in the GenASM algorithm. We provide efficient open-source implementations of the Scrooge algorithm for CPUs and GPUs, which demonstrate the significant benefits of our algorithmic improvements. For long reads, the CPU version of Scrooge achieves a 20.1x, 1.7x, and 2.1x speedup over KSW2, Edlib, and a CPU implementation of GenASM, respectively. The GPU version of Scrooge achieves a 4.0x 80.4x, 6.8x, 12.6x and 5.9x speedup over the CPU version of Scrooge, KSW2, Edlib, Darwin-GPU, and a GPU implementation of GenASM, respectively. We estimate an ASIC implementation of Scrooge to use 3.6x less chip area and 2.1x less power than a GenASM ASIC while maintaining the same throughput. Further, we systematically analyze the throughput and accuracy behavior of GenASM and Scrooge under various configurations. As the best configuration of Scrooge depends on the computing platform, we make several observations that can help guide future implementations of Scrooge. https://github.com/CMU-SAFARI/Scrooge.
- BioinformaticsA framework for high-throughput sequence alignment using real processing-in-memory systems.Bioinformatics, Mar 2023
Sequence alignment is a memory bound computation whose performance in modern systems is limited by the memory bandwidth bottleneck. Processing-in-memory (PIM) architectures alleviate this bottleneck by providing the memory with computing competencies. We propose Alignment-in-Memory (AIM), a framework for high-throughput sequence alignment using PIM, and evaluate it on UPMEM, the first publicly available general-purpose programmable PIM system. Our evaluation shows that a real PIM system can substantially outperform server-grade multi-threaded CPU systems running at full-scale when performing sequence alignment for a variety of algorithms, read lengths, and edit distance thresholds. We hope that our findings inspire more work on creating and accelerating bioinformatics algorithms for such real PIM systems. Our code is available at https://github.com/safaad/aim.
- JETCASAM4: MRAM Crossbar Based CAM/TCAM/ACAM/AP for In-Memory ComputingEsteban Garzón, Marco Lanuzza, Adam Teman, and Leonid YavitsIEEE J. Emerg. Sel. Topics Circuits Syst., Mar 2023
In-memory computing seeks to minimize data movement and alleviate the memory wall by computing in-situ, in the same place that the data is located. One of the key emerging technologies that promises to enable such computing-in-memory is spin-transfer torque magnetic tunnel junction (STT-MTJ). This paper proposes AM4, a combined STT-MTJ-based Content Addressable Memory (CAM), Ternary CAM (TCAM), approximate matching (similarity search) CAM (ACAM), and in-memory Associative Processor (AP) design, inspired by the recently announced Samsung MRAM crossbar. We demonstrate and evaluate the performance and energy-efficiency of the AM4-based AP using a variety of data intensive workloads. We show that an AM4-based AP outperforms state-of-the-art solutions both in performance (with the average speedup of about 10 x) and energy-efficiency (by about 60 x on average).
- NARGABBLEND: a fast, memory-efficient and accurate mechanism to find fuzzy seed matches in genome analysisCan Firtina, Jisung Park, Mohammed Alser, Jeremie S Kim, Damla Senol Cali, Taha Shahroodi, Nika Mansouri Ghiasi, Gagandeep Singh, Konstantinos Kanellopoulos, Can Alkan, and Onur MutluNAR Genomics and Bioinformatics, Mar 2023
Generating the hash values of short subsequences, called seeds, enables quickly identifying similarities between genomic sequences by matching seeds with a single lookup of their hash values. However, these hash values can be used only for finding exact-matching seeds as the conventional hashing methods assign distinct hash values for different seeds, including highly similar seeds. Finding only exact-matching seeds causes either (i) increasing the use of the costly sequence alignment or (ii) limited sensitivity. We introduce BLEND, the first efficient and accurate mechanism that can identify both exact-matching and highly similar seeds with a single lookup of their hash values, called fuzzy seed matches. BLEND (i) utilizes a technique called SimHash, that can generate the same hash value for similar sets, and (ii) provides the proper mechanisms for using seeds as sets with the SimHash technique to find fuzzy seed matches efficiently. We show the benefits of BLEND when used in read overlapping and read mapping. For read overlapping, BLEND is faster by 2.4x–83.9x (on average 19.3x), has a lower memory footprint by 0.9x–14.1x (on average 3.8x), and finds higher quality overlaps leading to accurate de novo assemblies than the state-of-the-art tool, minimap2. For read mapping, BLEND is faster by 0.8x–4.1x (on average 1.7x) than minimap2. Source code is available at https://github.com/CMU-SAFARI/BLEND.
- ChipsApproximate Content-Addressable Memories: A ReviewEsteban Garzón, Leonid Yavits, Adam Teman, and Marco LanuzzaChips, Mar 2023
Content-addressable memory (CAM) has been part of the memory market for more than five decades. CAM can carry out a single clock cycle lookup based on the content rather than an address. Thanks to this attractive feature, CAM is utilized in memory systems where a high-speed content lookup technique is required. However, typical CAM applications only support exact matching, as opposed to approximate matching, where a certain Hamming distance (several mismatching characters between a query pattern and the dataset stored in CAM) needs to be tolerated. Recent interest in approximate search has led to the development of new CAM-based alternatives, accelerating the processing of large data workloads in the realm of big data, genomics, and other data-intensive applications. In this review, we provide an overview of approximate CAM and describe its current and potential applications that would benefit from approximate search computing.
Posters
2024
- GIHardware/software co-design for sequence-to-pangenome mappingZülal Bingöl, Ccan Fırtına, Konstantina Koliogeorgi, Ricardo Roman-Brenes, Onur Mutlu, and Can AlkanIn Genome Informatics Conference 2024, Nov 2024
The reference genome is the representative genome of a species. Most genomics analyses begin with or involve a comparison with the reference genome of the same species, which serves as a template for the newly sequenced sample of an individual. However, using a single reference genome creates reference bias since it does not contain information about all individuals of a species to represent the genomic diversity. The rapid development of sequencing technologies has led to an increasing number of samples being studied. With the growing number of genetic materials sampled, genomics pursues utilizing this data to express all samples and capture vast genetic diversity. So, to alleviate these problems and enable diverse variance studies (e.g., bacterial pathogenicity, antibacterial resistance, virulence, human leukocyte antigens), pangenome structures are proposed. As the efforts to collect more data to improve the diversity representation on the pangenome are growing, the computational methods to assist this field become more vital to meet the high demand. Therefore, there is a pressing need for acceleration methods tuned for mapping sequences to pangenomes. To address this problem, in this work, we propose a hardware/software co-design framework for optimizing the pangenome graphs for compact storage and developing efficient algorithms that eventually lead to fast and accurate sequence-to-pangenome mapping. Our framework combines several steps (e.g., indexing, seeding, and alignment), and the accelerator includes optimized designs based on each step’s requirements.
2023
- RECOMBCharacterization of Alignment and Search Algorithms for Short Read, Long Read, and Graph MappersEcem İlgün, Ömer Yavuz Öztürk, Klea Zambaku, Juan Gómez Luna, Mohammed Alser, Ricardo Román-Brenes, The BioPIM Project, and Can AlkanIn RECOMB 2023, Apr 2023
We recently started a project funded by the Horizon Europe program, BioPIM, which aims to accelerate various bioinformatics algorithms using processing-in-memory (PIM) technologies. PIM is a type of computer architecture that aims to solve the issue of data movement between the CPU and memory being a bottleneck in data-intensive applications. PIM accomplishes this by integrating computing units directly into the memory chip, which reduces latency and increases bandwidth by bringing the computing units closer to the memory. Since read mapping is a crucial step in almost all genome analysis studies, we first aimed to understand how to accelerate read mapping in this project. As the first step, we analyzed the computational workload of BWA-MEM and Bowtie2 for short reads, NGMLR, Minimap2 and LRA for long reads, and finally minigraph, vg, GraphAligner and GWFA for read-to-graph aligners. Here we present a first-pass workload analysis using the Intel vtune tool to identify the most time and memory-bandwidth consuming functions used by these algorithms. Our preliminary results show that the resource usage of these algorithms varied significantly depending on the type of data and the algorithm used. We have identified several potential functions that could potentially be improved by PIM. Furthermore, we discuss processing-in-memory architectures to accelerate alignment and search algorithms for resequencing experiments. Overall, our study provides insights into the resource requirements of different alignment and search algorithms for different types of sequencing data, which can guide the selection of the most appropriate algorithms for different resequencing experiments. Our findings can also inform the development of more efficient and accurate algorithms for processing sequencing data, which is critical for advancing our understanding of the genetic basis of complex diseases.
Related Publications
2022
- ISCAEDAM: Edit Distance Tolerant Approximate Matching Content Addressable MemoryRobert Hanhan, Esteban Garzón, Zuher Jahshan, Adam Teman, Marco Lanuzza, and Leonid YavitsIn Proceedings of the 49th Annual International Symposium on Computer Architecture, 2022
We propose a novel edit distance-tolerant content addressable memory (EDAM) for energy-efficient approximate search applications. Unlike state-of-the-art approximate search solutions that tolerate certain Hamming distance between the query pattern and the stored data, EDAM tolerates edit distance, which makes it especially efficient in applications such as text processing and genome analysis. EDAM was designed using a commercial 65 nm 1.2 V CMOS technology and evaluated through extensive Monte Carlo simulations, while considering different process corners. Simulation results show that EDAM can achieve robust approximate search operation with a wide range of edit distance threshold levels. EDAM is functionally evaluated as a pathogen DNA detection and classification accelerator. EDAM achieves up to 1.7x higher F1 score for high-quality DNA reads and up to 19.55x higher F1 score for DNA reads with 15% error rate, compared to state-of-the-art DNA classification tool Kraken2. Simulated at 667 MHz, EDAM provides 1, 214x average speedup over Kraken2. This makes EDAM suitable for hardware acceleration of genomic surveillance of outbreaks, such as the ongoing Covid-19 pandemic.
2021
- VLSI Tech.HERMES Core – A 14nm CMOS and PCM-based In-Memory Compute Core using an array of 300ps/LSB Linearized CCO-based ADCs and local digital processingR. Khaddam-Aljameh, M. Stanisavljevic, J. Fornt Mas, G. Karunaratne, M. Braendli, F. Liu, A. Singh, S. M. Müller, U. Egger, A. Petropoulos, T. Antonakopoulos, K. Brew, S. Choi, I. Ok, F. L. Lie, N. Saulnier, V. Chan, I. Ahsan, V. Narayanan, S. R. Nandakumar, M. Le Gallo, P. A. Francese, A. Sebastian, and E. EleftheriouIn 2021 Symposium on VLSI Technology, 2021
We present a 256x256 in-memory compute (IMC) core designed and fabricated in 14nm CMOS with backend-integrated multi-level phase-change memory (PCM). It comprises 256 linearized current controlled oscillator (CCO)-based ADCs at a compact 4µm pitch and a local digital processing unit performing affine scaling and ReLU operations. A novel frequency-linearization technique for CCOs is introduced, leading to accurate on-chip matrix-vector-multiply (MVM) when operating over 1 GHz. Measured classification accuracies on MNIST and CIFAR-10 datasets are presented when two cores are employed for deep learning (DL) inference. The measured energy efficiency is 10.5 TOPS/W at a performance density of 1.59 TOPS/mm 2 .
2020
- BIBMVariant Calling Parallelization on Processor-in-Memory ArchitectureD. Lavenier, R. Cimadomo, and R. JodinIn 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Dec 2020
This paper introduces a new combination of software and hardware PIM (Process-in-Memory) architecture to accelerate the variant calling genomic process. PIM translates into bringing data intensive calculations directly where the data is: within the DRAM, enhanced with thousands of processing units. The energy consumption, in large part due to data movement, is significantly lowered at a marginal additional hardware cost. Such design allows an unprecedented level of parallelism to process billions of short reads. Experiments on real PIM devices developed by the UPMEM company show significant speed-up compared to pure software implementation. The PIM solution also compared nicely to FPGA or GPU based acceleration bringing similar to twice the processing speed but most importantly being 5 to 8 times cheaper to deploy with up to 6 times less power consumption.
- MICROGenASM: A High-Performance, Low-Power Approximate String Matching Acceleration Framework for Genome Sequence AnalysisDamla Senol Cali, Gurpreet S. Kalsi, Zülal Bingöl, Can Firtina, Lavanya Subramanian, Jeremie S. Kim, Rachata Ausavarungnirun, Mohammed Alser, Juan Gomez-Luna, Amirali Boroumand, Anant Norion, Allison Scibisz, Sreenivas Subramoneyon, Can Alkan, Saugata Ghose, and Onur MutluIn 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec 2020
Genome sequence analysis has enabled significant advancements in medical and scientific areas such as personalized medicine, outbreak tracing, and the understanding of evolution. To perform genome sequencing, devices extract small random fragments of an organism’s DNA sequence (known as reads). The first step of genome sequence analysis is a computational process known as read mapping. In read mapping, each fragment is matched to its potential location in the reference genome with the goal of identifying the original location of each read in the genome. Unfortunately, rapid genome sequencing is currently bottlenecked by the computational power and memory bandwidth limitations of existing systems, as many of the steps in genome sequence analysis must process a large amount of data. A major contributor to this bottleneck is approximate string matching (ASM), which is used at multiple points during the mapping process. ASM enables read mapping to account for sequencing errors and genetic variations in the reads. We propose GenASM, the first ASM acceleration framework for genome sequence analysis. GenASM performs bitvectorbased ASM, which can efficiently accelerate multiple steps of genome sequence analysis. We modify the underlying ASM algorithm (Bitap) to significantly increase its parallelism and reduce its memory footprint. Using this modified algorithm, we design the first hardware accelerator for Bitap. Our hardware accelerator consists of specialized systolic-array-based compute units and on-chip SRAMs that are designed to match the rate of computation with memory capacity and bandwidth, resulting in an efficient design whose performance scales linearly as we increase the number of compute units working in parallel. We demonstrate that GenASM provides significant performance and power benefits for three different use cases in genome sequence analysis. First, GenASM accelerates read alignment for both long reads and short reads. For long reads, GenASM outperforms state-of-the-art software and hardware accelerators by 116x and 3.9x, respectively, while reducing power consumption by 37x and 2.7x. For short reads, GenASM outperforms state-of-the-art software and hardware accelerators by 111x and 1.9x. Second, GenASM accelerates pre-alignment filtering for short reads, with 3.7x the performance of a state-of-the-art pre-alignment filter, while reducing power consumption by 1.7x and significantly improving the filtering accuracy. Third, GenASM accelerates edit distance calculation, with 22-12501x and 9.3-400x speedups over the state-of-the-art software library and FPGA-based accelerator, respectively, while reducing power consumption by 548-582x and 67x. We conclude that GenASM is a flexible, high-performance, and low-power framework, and we briefly discuss four other use cases that can benefit from GenASM.
- SYSTORBioSEAL: In-Memory Biological Sequence Alignment Accelerator for Large-Scale Genomic DataRoman Kaplan, Leonid Yavits, and Ran GinosarIn Proceedings of the 13th ACM International Systems and Storage Conference, Dec 2020
Genome sequences contain hundreds of millions of DNA base pairs. Finding the degree of similarity between two genomes requires executing a compute-intensive dynamic programming algorithm, such as Smith-Waterman. Traditional von Neumann architectures have limited parallelism and cannot provide an efficient solution for large-scale genomic data. Approximate heuristic methods (e.g. BLAST) are commonly used. However, they are suboptimal and still compute-intensive.In this work, we present BioSEAL, a biological sequence alignment accelerator. BioSEAL is a massively parallel non-von Neumann processing-in-memory architecture for large-scale DNA and protein sequence alignment. BioSEAL is based on resistive content addressable memory, capable of energy-efficient and highperformance associative processing.We present an associative processing algorithm for entire database sequence alignment on BioSEAL and compare its performance and power consumption with state-of-art solutions. We show that BioSEAL can achieve up to 57x speedup and 156x better energy efficiency, compared with existing solutions for genome sequence alignment and protein sequence database search.
2019
- IEEE MicroRASSA: Resistive Prealignment Accelerator for Approximate DNA Long Read MappingRoman Kaplan, Leonid Yavits, and Ran GinosarIEEE Micro, Jul 2019
DNA read mapping is a computationally expensive bioinformatics task, required for genome assembly and consensus polishing. It requires to find the best-fitting location for each DNA read on a long reference sequence. A novel resistive approximate similarity search accelerator (RASSA) exploits charge distribution and parallel in-memory processing to reflect a mismatch count between DNA sequences. RASSA implementation of DNA long-read prealignment outperforms the state-of-the-art solution, minimap2, by 16–77x with comparable accuracy and provides two orders of magnitude higher throughput than GateKeeper, a short-read prealignment hardware architecture implemented in FPGA.
2018
- BMC GenomicsGRIM-Filter: Fast seed location filtering in DNA read mapping using processing-in-memory technologiesJeremie S. Kim, Damla Senol Cali, Hongyi Xin, Donghyuk Lee, Saugata Ghose, Mohammed Alser, Hasan Hassan, Oguz Ergin, Can Alkan, and Onur MutluBMC Genomics, May 2018
Seed location filtering is critical in DNA read mapping, a process where billions of DNA fragments (reads) sampled from a donor are mapped onto a reference genome to identify genomic variants of the donor. State-of-the-art read mappers 1) quickly generate possible mapping locations for seeds (i.e., smaller segments) within each read, 2) extract reference sequences at each of the mapping locations, and 3) check similarity between each read and its associated reference sequences with a computationally-expensive algorithm (i.e., sequence alignment) to determine the origin of the read. A seed location filter comes into play before alignment, discarding seed locations that alignment would deem a poor match. The ideal seed location filter would discard all poor match locations prior to alignment such that there is no wasted computation on unnecessary alignments.
2017
2016
- BIBMDNA mapping using Processor-in-Memory architectureDominique Lavenier, Jean-Francois Roy, and David FurodetIn 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), May 2016
This paper presents the implementation of a mapping algorithm on a new Processing-in-Memory (PIM) architecture developed by UPMEM Company. UPMEM’s solution consists in adding processing units into the DRAM, to minimize data access time and maximize bandwidth, in order to drastically accelerate data-consuming algorithms. The technology developed by UPMEM makes it possible to combine 256 cores with 16 GBytes of DRAM, on a standard DIMM module. An experimentation of DNA Mapping on Human genome dataset shows that a speed-up of 25 can be obtained with UPMEM technology compared to fast mapping software such as BWA, Bowtie2 or NextGenMap running on 16 Intel threads. Experimentation also highlight that data transfer from storage device limits the performances of the implementation. The use of SSD drives can boost the speed-up to 80.