2024

Nat Methods

Constructing and personalizing population pangenome graphs.

Rayan Chikhi, Yoann Dufresne, and Paul Medvedev

Nature methods, Nov 2024

@article{Chikhi2024,
  author = {Chikhi, Rayan and Dufresne, Yoann and Medvedev, Paul},
  journal = {Nature methods},
  title = {Constructing and personalizing population pangenome graphs.},
  year = {2024},
  issn = {1548-7105},
  month = nov,
  pages = {1980--1981},
  volume = {21},
  citation-subset = {IM},
  country = {United States},
  doi = {10.1038/s41592-024-02402-7},
  issn-linking = {1548-7091},
  issue = {11},
  mid = {NIHMS2069661},
  nlm-id = {101215604},
  owner = {NLM},
  pii = {10.1038/s41592-024-02402-7},
  pmc = {PMC11962983},
  pmid = {39433877},
  pubmodel = {Print},
  pubstate = {ppublish},
  revised = {2025-04-03},
}

IEEE TBCS
GCOC: A genome classifier-on-chip based on similarity search content addressable memory

Yuval Harary, Paz Snapir, Shir Siman Tov, Chen Kruphman, Eyal Rechef, Zuher Jahshan, Esteban Garzon, and Leonid Yavits

IEEE Transactions on Biomedical Circuits and Systems, Nov 2024

Abstract BIB URL

GCOC is a genome classification system-on-chip (SoC) that classifies genomes by k-mer matching, an approach that divides a DNA query sequence into a set of short DNA fragments of size k, which are searched in a reference genome database, with the underlying assumption that sequenced DNA reads of the same organism (or its close variants) share most of such k-mers. At the core of GCOC is a similarity, or approximate search-capable Content Addressable Memory (SAS-CAM), which in addition to exact match, also supports approximate, or Hamming distance tolerant search. Classification operation is controlled by an embedded RISC-V processor. GCOC classification platform was designed and manufactured in a commercial 65nm process. We conduct a thorough analysis of GCOC classification efficiency as well as its performance, silicon area, and power consumption using silicon measurements. GCOC classifies 769.2K short DNA reads/sec. The silicon area of GCOC SoC is 3.12mm2 and its power consumption is 1.27mW. We envision GCOC deployed as a field (for example at points of care) portable classifier where the classification is required to be real-time, easy to operate and energy efficient.
@article{harary2024gcoc, title = {GCOC: A genome classifier-on-chip based on similarity search content addressable memory}, author = {Harary, Yuval and Snapir, Paz and Tov, Shir Siman and Kruphman, Chen and Rechef, Eyal and Jahshan, Zuher and Garzon, Esteban and Yavits, Leonid}, journal = {IEEE Transactions on Biomedical Circuits and Systems}, year = {2024}, publisher = {IEEE} }
IEEEXplore
MiMyCS: A Processing-in-Memory Read Mapper for Compressing Next-Gen Sequencing Datasets

Florestan Moor, Meven Mognol, Charles Deltel, Erwan Drezen, Julien Legriel, and Dominique Lavenier

Dec 2024

Abstract BIB URL PDF

As Next-Gen sequencing (NGS) technologies keep improving their accuracy and get largely deployed in human health care infrastructures, it is critical to design efficient reference-based compressors that fully leverage the capabilities of modern processors and hardware accelerators. This work proposes MiMyCS: a C++ software to achieve Mapping in Memory for Compressing Short reads. It performs lossless reference-based compression of NGS datasets such as Illumina reads. To this end, MiMyCS computes a non-exhaustive mapping against a reference genome and accelerates this step with the Processing-in-Memory architecture developed by the UPMEM company. Such architecture extends the computational power of a machine by adding dual in-line memory modules on which each memory bank has its own processing unit that runs up to 16 threads. This creates a massively parallel environment, well-fitted to alleviate memory bottlenecks. To reduce the overall amount of sequence comparisons and accelerate further the process, MiMyCS also incorporates a Bloom filters-based dispatcher that predicts against which genome parts reads are most likely to be mapped. We show with real whole human sequencing datasets that MiMyCS is able to achieve a speed-up between 1.2x and 2.7x compared to Genozip, the current leading state-of-the-art compressor, while maintaining a comparable compression ratio and lowering the overall energy consumption. The code of MiMyCS is available at https://gitlab.inria.fr/pim/org.pim.srm.
@article{demoor:hal-04821180, title = {MiMyCS: A Processing-in-Memory Read Mapper for Compressing Next-Gen Sequencing Datasets}, author = {de Moor, Florestan and Mognol, Meven and Deltel, Charles and Drezen, Erwan and Legriel, Julien and Lavenier, Dominique}, year = {2024}, month = dec, booktitle = {BIBM2024}, number = {1}, pages = {7176}, }
IEEE Access
RawAlign: Accurate, Fast, and Scalable Raw Nanopore Signal Mapping via Combining Seeding and Alignment

Joël Lindegger, Can Firtina, Nika Mansouri Ghiasi, Mohammad Sadrosadati, Mohammed Alser, and Onur Mutlu

IEEE Access, Dec 2024

Abstract BIB URL PDF

Nanopore sequencers generate raw electrical signals representing the contents of a biological sequence molecule passing through the nanopore. These signals can be analyzed directly, avoiding basecalling entirely. We observe that while existing proposals for raw signal analysis typically do well in all metrics for small genomes (e.g., viral genomes), they all perform poorly for large genomes (e.g., the human genome). Our goal is to analyze raw nanopore signals in an accurate, fast, and scalable manner. To this end, we propose RawAlign, the first work to integrate fine-grained signal alignment into the state-of-the-art raw signal mapper. To enable accurate, fast, and scalable mapping with alignment, RawAlign implements three algorithmic improvements and hardware acceleration via a vectorized implementation of fine-grained alignment. Together, these significantly reduce the overhead of typically computationally expensive fine-grained alignment. Our extensive evaluations on different use cases and various datasets show RawAlign provides 1) the most accurate mapping for large genomes and 2) and on-par performance compared to RawHash (between 0.80x-1.08x), while achieving better performance than UNCALLED and Sigmap by on average (geo. mean) 2.83x and 2.06x, respectively.
@article{rawalign2024, title = {RawAlign: Accurate, Fast, and Scalable Raw Nanopore Signal Mapping via Combining Seeding and Alignment}, author = {Lindegger, Joël and Firtina, Can and Ghiasi, Nika Mansouri and Sadrosadati, Mohammad and Alser, Mohammed and Mutlu, Onur}, year = {2024}, month = dec, journal = {IEEE Access}, volume = {12}, pages = {196855-196865}, keywords = {Genomics;Nanobioscience;Dynamic programming;Accuracy;Sequential analysis;Heuristic algorithms;Costs;Real-time systems;Single instruction multiple data;Signal mapping;Bioinformatics;genome sequence analysis;read mapping;raw signal analysis;nanopore sequencing}, doi = {10.1109/ACCESS.2024.3520669}, }
IEEE VLSI
Designing Precharge-Free Energy-Efficient Content-Addressable Memories

Ramiro Taco, Esteban Garzón, Robert Hanhan, Adam Teman, Leonid Yavits, and Marco Lanuzza

IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Oct 2024

Abstract BIB URL PDF

Content-addressable memory (CAM) is a specialized type of memory that facilitates massively parallel comparison of a search pattern against its entire content. State-of-the-art (SOTA) CAM solutions are either fast but power-hungry (NOR CAM) or slow while consuming less power (nand CAM). These limitations stem from the dynamic precharge operation, leading to excessive power consumption in NOR CAMs and charge-sharing issues in NAND CAMs. In this work, we propose a precharge-free CAM (PCAM) class for energy-efficient applications. By avoiding precharge operation, PCAM consumes less energy than a NAND CAM, while achieving search speed comparable to a NOR CAM. PCAM was designed using a 65-nm CMOS technology and comprehensively evaluated under extensive Monte Carlo (MC) simulations while taking into account layout parasitics. When benchmarked against conventional NAND CAM, PCAM demonstrates improved search run time (reduced by more than 30%) and 15% less search energy. Moreover, PCAM can cut energy consumption by more than 75% when compared to conventional NOR CAM. We further extend our analysis to the application level, functionally evaluating the CAM designs as a fully associative cache using a CPU simulator running various benchmark workloads. This analysis confirms that PCAMs represent an optimal energy-performance design choice for associative memories and their broad spectrum of applications.
@article{taco-precharge-free2024, title = {Designing Precharge-Free Energy-Efficient Content-Addressable Memories}, author = {Taco, Ramiro and Garzón, Esteban and Hanhan, Robert and Teman, Adam and Yavits, Leonid and Lanuzza, Marco}, year = {2024}, month = oct, journal = {IEEE Transactions on Very Large Scale Integration (VLSI) Systems}, volume = {32}, number = {12}, pages = {2303-2314}, keywords = {Transistors;Random access memory;Power demand;Discharges (electric);Capacitance;Memory management;Microprocessors;Energy consumption;Pattern matching;Europe;Content-addressable memory (CAM);fully associative cache;low-power;NAND CAM;NOR CAM;precharge-free}, doi = {10.1109/TVLSI.2024.3475036}, }
ICPP
Parallelization of the Banded Needleman & Wunsch Algorithm on UPMEM PiM Architecture for Long DNA Sequence Alignment

Meven Mognol, Dominique Lavenier, and Julien Legriel

In Proceedings of the 53rd International Conference on Parallel Processing, Aug 2024

Abstract BIB URL PDF

Sequence alignment is an ubiquitous task in genomics analysis whose performance can be affected by the memory-wall on a processor-centric architecture. Processing-in-Memory (PiM) architectures provide a memory system with integrated computing capabilities to alleviate this bottleneck. In this paper, we present a long read optimized version of the Needleman and Wunsch (N&W) algorithm based on PiM devices developed by the UPMEM company. Such memories add 128 computing units to a standard 16GB DIMM. On a server equipped with 20 UPMEM PiMs, our N&W implementation outperforms standard alignment tools by an order of magnitude compared to a traditional multicore server. Code available at https://github.com/upmem/usecase_dpu_alignment.
@inproceedings{mognol_2024_parallel_proc, title = {Parallelization of the Banded Needleman \& Wunsch Algorithm on UPMEM PiM Architecture for Long DNA Sequence Alignment}, author = {Mognol, Meven and Lavenier, Dominique and Legriel, Julien}, year = {2024}, month = aug, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, doi = {10.1145/3673038.3673094}, booktitle = {Proceedings of the 53rd International Conference on Parallel Processing}, pages = {1062–1071}, numpages = {10}, keywords = {Processing-in-Memory, long read Alignment, parallelism}, location = {Gotland, Sweden}, series = {ICPP '24}, }
ISCA
QUETZAL: Vector Acceleration Framework for Modern Genome Sequence Analysis Algorithms

Julian Pavon, Ivan Vargas Valdivieso, Carlos Rojas, Cesar Hernandez, Mehmet Aslan, Roger Figueras, Yichao Yuan, Joël Lindegger, Mohammed Alser, Francesc Moll, Santiago Marco-Sola, Oguz Ergin, Nishil Talati, Onur Mutlu, Osman Unsal, Mateo Valero, and Adrian Cristal

Jun 2024

Abstract BIB URL PDF

Genome sequence analysis is fundamental to medical breakthroughs such as developing vaccines, enabling genome editing, and facilitating personalized medicine. The exponentially expanding sequencing datasets and complexity of sequencing algorithms necessitate performance enhancements. While the performance of software solutions is constrained by their underlying hardware platforms, the utility of fixed-function accelerators is restricted to only certain sequencing algorithms.This paper presents QUETZAL, the first general-purpose vector acceleration framework designed for high efficiency and broad applicability across a diverse set of genomics algorithms. While a commercial CPU’s vector datapath is a promising candidate to exploit the data-level parallelism in genomics algorithms, our analysis finds that its performance is often limited due to long-latency scatter/gather memory instructions. QUETZAL introduces a hardware-software co-design comprising an accelerator microarchitecture closely integrated with the CPU’s vector datapath, alongside novel vector instructions to fully capitalize on the proposed hardware. QUETZAL integrates a set of scratchpad-style buffers meticulously designed to minimize latency associated with scatter/gather instructions during the retrieval of input genome sequences data. QUETZAL supports both short and long reads, and different types of sequencing data formats. A combination of hardware and software techniques enables QUETZAL to reduce the latency of memory instructions, perform complex computation using a single instruction, and transform data representations at runtime, resulting in overall efficiency gain. QUETZAL significantly accelerates a vectorized CPU baseline on modern genome sequence analysis algorithms by 5.7×, while incurring a small area overhead of 1.4% post place-and-route at the 7nm technology node compared to an HPC ARM CPU.
@article{10609714, title = {QUETZAL: Vector Acceleration Framework for Modern Genome Sequence Analysis Algorithms}, author = {Pavon, Julian and Valdivieso, Ivan Vargas and Rojas, Carlos and Hernandez, Cesar and Aslan, Mehmet and Figueras, Roger and Yuan, Yichao and Lindegger, Joël and Alser, Mohammed and Moll, Francesc and Marco-Sola, Santiago and Ergin, Oguz and Talati, Nishil and Mutlu, Onur and Unsal, Osman and Valero, Mateo and Cristal, Adrian}, booktitle = {2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)}, year = {2024}, month = jun, pages = {597-612}, keywords = {Sequential analysis;Sequences;Software algorithms;Genomics;Approximation algorithms;Vectors;Hardware;Vector architecture;Genome Sequence Analysis;Flexible domain specific accelerator;CPU microarchitecture;Vector acceleration framework;Scratchpad memory}, doi = {10.1109/ISCA59077.2024.00050}, }
ISCA
MegIS: High-Performance, Energy-Efficient, and Low-Cost Metagenomic Analysis with In-Storage Processing

Nika Mansouri Ghiasi, Mohammad Sadrosadati, Harun Mustafa, Arvid Gollwitzer, Can Firtina, Julien Eudine, Haiyu Mao, Joël Lindegger, Meryem Banu Cavlak, Mohammed Alser, Jisung Park, and Onur Mutlu

Jun 2024

Abstract BIB URL PDF

Metagenomics, the study of the genome sequences of diverse organisms in a common environment, has led to significant advances in many fields. Since the species present in a metagenomic sample are not known in advance, metagenomic analysis commonly involves the key tasks of determining the species present in a sample and their relative abundances. These tasks require searching large metagenomic databases containing information on different species’ genomes. Metagenomic analysis suffers from significant data movement overhead due to moving large amounts of low-reuse data from the storage system to the rest of the system. In-storage processing can be a fundamental solution for reducing this overhead. However, designing an in-storage processing system for metagenomics is challenging because existing approaches to metagenomic analysis cannot be directly implemented in storage effectively due to the hardware limitations of modern SSDs.We propose MegIS, the first in-storage processing system designed to significantly reduce the data movement overhead of the end-to-end metagenomic analysis pipeline. MegIS is enabled by our lightweight design that effectively leverages and orchestrates processing inside and outside the storage system. Through our detailed analysis of the end-to-end metagenomic analysis pipeline and careful hardware/software co-design, we address in-storage processing challenges for metagenomics via specialized and efficient 1) task partitioning, 2) data/computation flow coordination, 3) storage technology-aware algorithmic optimizations, 4) data mapping, and 5) lightweight in-storage accelerators. MegIS’s design is flexible, capable of supporting different types of metagenomic input datasets, and can be integrated into various metagenomic analysis pipelines. Our evaluation shows that MegIS outperforms the state-of-the-art performance- and accuracy-optimized software metagenomic tools by 2.7× – 37.2× and 6.9×–100.2×, respectively, while matching the accuracy of the accuracy-optimized tool. MegIS achieves 1.5×–5.1× speedup compared to the state-of-the-art metagenomic hardware-accelerated (using processing-in-memory) tool, while achieving significantly higher accuracy.
@article{10609570, title = {MegIS: High-Performance, Energy-Efficient, and Low-Cost Metagenomic Analysis with In-Storage Processing}, author = {Ghiasi, Nika Mansouri and Sadrosadati, Mohammad and Mustafa, Harun and Gollwitzer, Arvid and Firtina, Can and Eudine, Julien and Mao, Haiyu and Lindegger, Joël and Cavlak, Meryem Banu and Alser, Mohammed and Park, Jisung and Mutlu, Onur}, year = {2024}, month = jun, booktitle = {2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)}, pages = {660-677}, }
LNCS
Energy Efficiency Impact of Processing in Memory: A Comprehensive Review of Workloads on the UPMEM Architecture

Yann Falevoz, and Julien Legriel

Apr 2024

Abstract BIB URL

Processing-in-Memory (PIM) architectures have emerged as a promising solution for data-intensive applications, providing significant speedup by processing data directly within the memory. However, the impact of PIM on energy efficiency is not well characterized. In this paper, we provide a comprehensive review of workloads ported to the first PIM product available on the market, namely the UPMEM architecture, and quantify the impact on each workload in terms of energy efficiency. Less than the half of the reviewed papers provide insights on the impact of PIM on energy efficiency, and the evaluation methods differ from one paper to the other. To provide a comprehensive overview, we propose a methodology for estimating energy consumption and efficiency for both the PIM and baseline systems at data center level, enabling a direct comparison of the two systems. Our results show that PIM can provide significant energy savings for data intensive workloads. We also identify key factors that impact the energy efficiency of UPMEM PIM, including the workload characteristics. Overall, this paper provides valuable insights for researchers and practitioners looking to optimize energy efficiency in data-intensive applications using UPMEM PIM architecture.
@article{yann2024_energy_review, title = {Energy Efficiency Impact of Processing in Memory: A Comprehensive Review of Workloads on the UPMEM Architecture}, author = {Falevoz, Yann and Legriel, Julien}, year = {2024}, month = apr, booktitle = {Euro-Par 2023: Parallel Processing Workshops}, pages = {155-166}, doi = {10.1007/978-3-031-48803-0_13}, }
IEEE CAL
MajorK: Majority Based kmer Matching in Commodity DRAM

Zuher Jahshan, and Leonid Yavits

IEEE Computer Architecture Letters, Apr 2024

Abstract BIB URL

Fast parallel search capabilities on large datasets are required across multiple application domains. One such domain is genome analysis, which requires high-performance k mer matching in large genome databases. Recently proposed solutions implemented k mer matching in DRAM, utilizing its sheer capacity and parallelism. However, their operation is essentially bit-serial, which ultimately limits the performance, especially when matching long strings, as customary in genome analysis pipelines. The proposed solution, MajorK, enables bit-parallel majority based k mer matching in an unmodified commodity DRAM. MajorK employs multiple DRAM row activation, where the search patterns (query k mers) are coded into DRAM addresses. We evaluate MajorK on viral genome k mer matching and show that it can achieve up to 2.7 \times higher performance while providing a better matching accuracy compared to state-of-the-art DRAM based k mer matching accelerators.
@article{jahshan2024_majork, title = {MajorK: Majority Based kmer Matching in Commodity DRAM}, author = {Jahshan, Zuher and Yavits, Leonid}, year = {2024}, month = apr, journal = {IEEE Computer Architecture Letters}, }
IEEE Access
ViRAL: Vision Transformer Based Accelerator for ReAL Time Lineage Assignment of Viral Pathogens

Zuher Jahshan, Esteban Garzón, and Leonid Yavits

IEEE Access, Feb 2024

Abstract BIB URL PDF

Real-time genome detection, classification and lineage assignment are critical for efficient tracking of emerging mutations and variants during viral pandemics such as Covid-19. For genomic surveillance to work effectively, each new viral genome sequence must be quickly and accurately associated with an existing viral family (lineage). ViRAL is a hardware-accelerated platform for real-time viral genome lineage assignment based on minhashing and Vision Transformer. Minhashing is a locality sensitive hashing based technique for finding regions of similarity within sequenced genomes. Vision Transformer is a model for image classification that employs a Transformer-like architecture over patches of images. In ViRAL, such image patches are genome fragments extracted from the regions of high similarity. ViRAL is especially efficient in lineage assignment of extremely low quality (or highly ambiguous) genomic data, i.e. when a large fraction of DNA bases are missing in an assembled genome. We implement ViRAL on CPU, GPU and a custom-designed hardware accelerator denoted ACMI. ViRAL assigns newly sequenced SARS-CoV-2 genomes to existing lineages with the top-1 accuracy of 94.2%. The probability of the correct assignment to be found among the five most likely placements generated by ViRAL (top-5 accuracy) is 99.8%. Accelerated ViRAL outperforms the fastest state-of-the-art assignment tools by 69.4\times . It also outperforms ViRAL GPU implementation by 19.5\times . ViRAL strongly outperforms the state-of-the-art solutions in assigning highly-ambiguous genomes: while state-of-the-art tools fail to assign lineage to genomes with 50% ambiguity, ViRAL achieves 77.6% assignment accuracy. We make ViRAL available to the research community through GitHub.
@article{jahshan2024_viRal, title = {ViRAL: Vision Transformer Based Accelerator for ReAL Time Lineage Assignment of Viral Pathogens}, author = {Jahshan, Zuher and Garzón, Esteban and Yavits, Leonid}, year = {2024}, month = feb, journal = {IEEE Access}, }
Bioinformatics
ViTAL: Vision TrAnsformer based Low coverage SARS-CoV-2 lineage assignment

Zuher Jahshan, and Leonid Yavits

Bioinformatics, Feb 2024

Abstract BIB URL PDF

Rapid spread of viral diseases such as Coronavirus disease 2019 (COVID-19) highlights an urgent need for efficient surveillance of virus mutation and transmission dynamics, which requires fast, inexpensive and accurate viral lineage assignment. The first two goals might be achieved through low-coverage whole-genome sequencing (LC-WGS) which enables rapid genome sequencing at scale and at reduced costs. Unfortunately, LC-WGS significantly diminishes the genomic details, rendering accurate lineage assignment very challenging. We present ViTAL, a novel deep learning algorithm specifically designed to perform lineage assignment of low coverage-sequenced genomes. ViTAL utilizes a combination of MinHash for genomic feature extraction and Vision Transformer for fine-grain genome classification and lineage assignment. We show that ViTAL outperforms state-of-the-art tools across diverse coverage levels, reaching up to 87.7% lineage assignment accuracy at 1x coverage where state-of-the-art tools such as UShER and Kraken2 achieve the accuracy of 5.4% and 27.4% respectively. ViTAL achieves comparable accuracy results with up to 8x lower coverage than state-of-the-art tools. We explore ViTAL’s ability to identify the lineages of novel genomes, i.e. genomes the Vision Transformer was not trained on. We show how ViTAL can be applied to preliminary phylogenetic placement of novel variants. The data underlying this article are available in https://github.com/zuherJahshan/vital and can be accessed with 10.5281/zenodo.10688110.
@article{jahshan2024_viTal, title = {ViTAL: Vision TrAnsformer based Low coverage SARS-CoV-2 lineage assignment}, author = {Jahshan, Zuher and Yavits, Leonid}, year = {2024}, month = feb, journal = {Bioinformatics}, }
IEEE SSCL
OCCAM: An Error Oblivious CAM

Yuval Harary, Paz Snapir, Eyal Reshef, Esteban Garzón, and Leonid Yavits

IEEE Solid-State Circuits Letters, Feb 2024

Abstract BIB URL PDF

Content addressable memories (CAMs) are widely used in many applications in general purpose computer microarchitecture, networking and domain-specific hardware accelerators. In addition to storing and reading data, CAMs enable simultaneous compare of query datawords with the entire memory content. Similar to SRAM and DRAM, CAMs are prone to errors and faults. While error correcting codes (ECCs) are widely used in DRAM and SRAM, they are not directly applicable in CAM: if a dataword that is supposed to match a query altered due to an error, it will falsely mismatch even if it is ECC-encoded. We propose OCCAM, an error oblivious CAM, which combines ECC and approximate search (matching) to allow tolerating a large and dynamically configurable number of errors. We manufactured the OCCAM silicon prototype using 65-nm commercial process and verified its error tolerance capabilities through silicon measurements. OCCAM tolerates 11% error rate (7 bit errors in each 64-bit memory row) with 100% sensitivity and specificity.
@article{harary2024_occam, title = {OCCAM: An Error Oblivious CAM}, author = {Harary, Yuval and Snapir, Paz and Reshef, Eyal and Garzón, Esteban and Yavits, Leonid}, year = {2024}, month = feb, journal = {IEEE Solid-State Circuits Letters}, }
IEEE Access
FASTA: Revisiting Fully Associative Memories in Computer Microarchitecture

Esteban Garzón, Robert Hanhan, Marco Lanuzza, Adam Teman, and Leonid Yavits

IEEE Access, Jan 2024

Abstract BIB URL PDF

Associative access is widely used in fundamental microarchitectural components, such as caches and TLBs. However, associative (or content addressable) memories (CAMs) have been traditionally considered too large, too energy-hungry, and not scalable, and therefore, have limited use in modern computer microarchitecture. This work revisits these presumptions and proposes an energy-efficient fully-associative tag array (FASTA) architecture, based on a novel complementary CAM (CCAM) bitcell. CCAM offers a full CMOS solution for CAM, removing the need for time- and energy-consuming precharge and combining the speed of NOR CAM and low energy consumption of NAND CAM. While providing better performance and energy consumption, CCAM features a larger area compared to state-of-the-art CAM designs. We further show how FASTA can be used to construct a novel aliasing-free, energy-efficient, Very-Many-Way Associative (VMWA) cache. Circuit-level simulations using 16 nm FinFET technology show that a 128 kB FASTA-based 256-way 8-set associative cache is 28% faster and consumes 88% less energy-per-access than a same sized 8-way (256-set) SRAM based cache, while also providing aliasing-free operation. System-level evaluation performed on the Sniper simulator shows that the VMWA cache exhibits lower Misses Per Kilo Instructions (MPKI) for the majority of benchmarks. Specifically, the 256-way associative cache achieves 17.3%, 11.5%, and 1.2% lower average MPKI for L1, L2, and L3 caches, respectively, compared to a 16-way associative cache. The average IPC improvement for L1, L2, and L3 caches are 1.6%, 1.4%, and 0.2%, respectively.
@article{esteban2024_fasta, title = {FASTA: Revisiting Fully Associative Memories in Computer Microarchitecture}, author = {Garzón, Esteban and Hanhan, Robert and Lanuzza, Marco and Teman, Adam and Yavits, Leonid}, year = {2024}, month = jan, journal = {IEEE Access}, }

2023

Nat Methods
SVDSS: structural variation discovery in hard-to-call genomic regions using sample-specific strings from accurate long reads.

Luca Denti, Parsoa Khorsand, Paola Bonizzoni, Fereydoun Hormozdiari, and Rayan Chikhi

Nature methods, Apr 2023

Abstract BIB URL

Structural variants (SVs) account for a large amount of sequence variability across genomes and play an important role in human genomics and precision medicine. Despite intense efforts over the years, the discovery of SVs in individuals remains challenging due to the diploid and highly repetitive structure of the human genome, and by the presence of SVs that vastly exceed sequencing read lengths. However, the recent introduction of low-error long-read sequencing technologies such as PacBio HiFi may finally enable these barriers to be overcome. Here we present SV discovery with sample-specific strings (SVDSS)-a method for discovery of SVs from long-read sequencing technologies (for example, PacBio HiFi) that combines and effectively leverages mapping-free, mapping-based and assembly-based methodologies for overall superior SV discovery performance. Our experiments on several human samples show that SVDSS outperforms state-of-the-art mapping-based methods for discovery of insertion and deletion SVs in PacBio HiFi reads and achieves notable improvements in calling SVs in repetitive regions of the genome.
@article{Denti2023, author = {Denti, Luca and Khorsand, Parsoa and Bonizzoni, Paola and Hormozdiari, Fereydoun and Chikhi, Rayan}, journal = {Nature methods}, title = {{SVDSS}: structural variation discovery in hard-to-call genomic regions using sample-specific strings from accurate long reads.}, year = {2023}, issn = {1548-7105}, month = apr, pages = {550--558}, volume = {20}, citation-subset = {IM}, completed = {2023-04-13}, country = {United States}, doi = {10.1038/s41592-022-01674-1}, issn-linking = {1548-7091}, issue = {4}, keywords = {Humans; Sequence Analysis, DNA, methods; High-Throughput Nucleotide Sequencing, methods; Genomics, methods; Genome, Human; Repetitive Sequences, Nucleic Acid}, nlm-id = {101215604}, owner = {NLM}, pii = {10.1038/s41592-022-01674-1}, pmc = {4108431}, pmid = {36550274}, pubmodel = {Print-Electronic}, pubstate = {ppublish}, revised = {2024-03-16}, }
arXiv
DRAMA: Commodity DRAM based Content Addressable Memory

Leonid Yavits

arXiv, Dec 2023

Abstract BIB URL PDF

Fast parallel search capabilities on large datasets provided by content addressable memories (CAM) are required across multiple application domains. However compared to RAM, CAMs feature high area overhead and power consumption, and as a result, they scale poorly. The proposed solution, DRAMA, enables CAM, ternary CAM (TCAM) and approximate (similarity) search CAM functionalities in unmodified commodity DRAM. DRAMA performs compare operation in a bit-serial fashion, where the search pattern (query) is coded in DRAM addresses. A single bit compare (XNOR) in DRAMA is identical to a regular DRAM read. AND and OR operations required for NAND CAM and NOR CAM respectively are implemented using nonstandard DRAM timing. We evaluate DRAMA on bacterial DNA classification and show that DRAMA can achieve 3.6 times higher performance and 19.6 times lower power consumption compared to state-of-the-art CMOS CAM based genome classification accelerator.
@article{yavits2023drama, title = {DRAMA: Commodity DRAM based Content Addressable Memory}, author = {Yavits, Leonid}, year = {2023}, month = dec, eprint = {2312.15527}, journal = {arXiv}, }
Genome Biology
A Framework for Designing Efficient Deep Learning-Based Genomic Basecallers

Gagandeep Singh, Mohammed Alser, Alireza Khodamoradi, Kristof Denolf, Can Firtina, Meryem Banu Cavlak, Henk Corporaal, and Onur Mutlu

bioRxiv, Nov 2023

Abstract BIB URL PDF

Nanopore sequencing is a widely-used high-throughput genome sequencing technology that can sequence long fragments of a genome. Nanopore sequencing generates noisy electrical signals that need to be converted into a standard string of DNA nucleotide bases using a computational step called basecalling. The performance of basecalling has critical implications for all later steps in genome analysis. Many researchers adopt complex deep learning-based models from the speech recognition domain to perform basecalling without considering the compute demands of such models, which leads to slow, inefficient, and memory-hungry basecallers. Therefore, there is a need to reduce the computation and memory cost of basecalling while maintaining accuracy. However, developing a very fast basecaller that can provide high accuracy requires a deep understanding of genome sequencing, machine learning, and hardware design. Our goal is to develop a comprehensive framework for creating deep learning-based basecallers that provide high efficiency and performance. We introduce RUBICON, a framework to develop hardware-optimized basecallers. RUBICON consists of two novel machine-learning techniques that are specifically designed for basecalling. First, we introduce the first quantization-aware basecalling neural architecture search (QABAS) framework to specialize the basecalling neural network architecture for a given hardware acceleration platform while jointly exploring and finding the best bit-width precision for each neural network layer. Second, we develop SkipClip, the first technique to remove the skip connections present in modern basecallers to greatly reduce resource and storage requirements without any loss in basecalling accuracy. We demonstrate the benefits of RUBICON by developing RUBICALL, the first hardware-optimized basecaller that performs fast and accurate basecalling. Our experimental results on state-of-the-art computing systems show that RUBICALL is a fast, memory-efficient, and hardware-friendly basecaller. Compared to the fastest state-of-the-art basecaller, RUBICALL provides a 3.96× speedup with 2.97% higher accuracy. Compared to an expert-designed basecaller, RUBICALL provides a 141.15× speedup without losing accuracy while also achieving a 6.88× and 2.94× reduction in neural network model size and the number of parameters, respectively. We show that RUBICON helps researchers develop hardware-optimized basecallers that are superior to expert-designed models and can inspire independent future ideas.
@article{singh2023_DL_basecaller, author = {Singh, Gagandeep and Alser, Mohammed and Khodamoradi, Alireza and Denolf, Kristof and Firtina, Can and Cavlak, Meryem Banu and Corporaal, Henk and Mutlu, Onur}, journal = {bioRxiv}, title = {A Framework for Designing Efficient Deep Learning-Based Genomic Basecallers}, year = {2023}, month = nov, doi = {10.1101/2022.11.20.517297}, }
PACT
SimplePIM: A Software Framework for Productive and Efficient Processing-in-Memory

Jinfan Chen, Juan Gómez-Luna, Izzat El Hajj, Yuxin Guo, and Onur Mutlu

Parallel Architectures and Compilation Techniques, Oct 2023

Abstract BIB URL PDF

Data movement between memory and processors is a major bottleneck in modern computing systems. The processing-in-memory (PIM) paradigm aims to alleviate this bottleneck by performing computation inside memory chips. Real PIM hardware (e.g., the UPMEM system) is now available and has demonstrated potential in many applications. However, programming such real PIM hardware remains a challenge for many programmers. This paper presents a new software framework, SimplePIM, to aid programming real PIM systems. The framework processes arrays of arbitrary elements on a PIM device by calling iterator functions from the host and provides primitives for communication among PIM cores and between PIM and the host system. We implement SimplePIM for the UPMEM PIM system and evaluate it on six major applications. Our results show that SimplePIM enables 66.5% to 83.1% reduction in lines of code in PIM programs. The resulting code leads to higher performance (between 10% and 37% speedup) than hand-optimized code in three applications and provides comparable performance in three others. SimplePIM is fully and freely available at https://github.com/CMU-SAFARI/SimplePIM
@article{jinfan2023_simplePIM, title = {SimplePIM: A Software Framework for Productive and Efficient Processing-in-Memory}, author = {Chen, Jinfan and Gómez-Luna, Juan and Hajj, Izzat El and Guo, Yuxin and Mutlu, Onur}, year = {2023}, month = oct, journal = {Parallel Architectures and Compilation Techniques}, }
IEEE TCS
A Low-Complexity Sensing Scheme for Approximate Matching Content-Addressable Memory

Esteban Garzón, Roman Golman, Marco Lanuzza, Adam Teman, and Leonid Yavits

IEEE Transactions on Circuits and Systems II, Oct 2023

Abstract BIB URL PDF

The need for approximate rather than exact search arises in numerous compare-intensive applications, from networking to computational genomics. This brief presents a novel sensing approach for approximate matching content-addressable memory (CAM) designed to handle large Hamming distances (HDs) between the query pattern and stored data. The proposed matchline sensing scheme (MLSS) employs a replica mechanism and a 12-transistor positive feedback sense amplifier to effectively resolve the approximate match operation. The MLSS was integrated into a 4 kB approximate CAM array and fabricated in a 65 nm CMOS technology. With an overall area footprint of 0.0048 mm2, which includes 512 sense amplifiers and the replica mechanism, the MLSS allows a flexible and dynamic adjustment of the HD tolerance threshold via several design variables. Experimental measurements demonstrate the efficiency of our sensing scheme in tolerating very large HDs with the highest sensitivity.
@article{esteban2023_sensing_scheme, title = {A Low-Complexity Sensing Scheme for Approximate Matching Content-Addressable Memory}, author = {Garzón, Esteban and Golman, Roman and Lanuzza, Marco and Teman, Adam and Yavits, Leonid}, year = {2023}, month = oct, journal = {IEEE Transactions on Circuits and Systems II}, }
ISPASS
An Experimental Evaluation of Machine Learning Training on a Real Processing-in-Memory System

Juan Gómez-Luna, Yuxin Guo, Sylvan Brocard, Julien Legriel, Remy Cimadomo, Geraldo F. Oliveira, Gagandeep Singh, and Onur Mutlu

Sep 2023

Abstract BIB URL PDF

Training machine learning (ML) algorithms is a computationally intensive process, which is frequently memory-bound due to repeatedly accessing large training datasets. As a result, processor-centric systems (e.g., CPU, GPU) suffer from costly data movement between memory units and processing units, which consumes large amounts of energy and execution cycles. Memory-centric computing systems, i.e., with processing-in-memory (PIM) capabilities, can alleviate this data movement bottleneck. Our goal is to understand the potential of modern general-purpose PIM architectures to accelerate ML training. To do so, we (1) implement several representative classic ML algorithms (namely, linear regression, logistic regression, decision tree, K-Means clustering) on a real-world general-purpose PIM architecture, (2) rigorously evaluate and characterize them in terms of accuracy, performance and scaling, and (3) compare to their counterpart implementations on CPU and GPU. Our evaluation on a real memory-centric computing system with more than 2500 PIM cores shows that general-purpose PIM architectures can greatly accelerate memory-bound ML workloads, when the necessary operations and datatypes are natively supported by PIM hardware. For example, our PIM implementation of decision tree is 27× faster than a state-of-the-art CPU version on an 8-core Intel Xeon, and 1.34× faster than a state-of-the-art GPU version on an NVIDIA A100. Our K-Means clustering on PIM is 2.8× and 3.2× than state-of-the-art CPU and GPU versions, respectively. To our knowledge, our work is the first one to evaluate ML training on a real-world PIM architecture. We conclude with key observations, takeaways, and recommendations that can inspire users of ML workloads, programmers of PIM architectures, and hardware designers & architects of future memory-centric computing systems.
@article{gomezluna_ispass_2023, title = {An Experimental Evaluation of Machine Learning Training on a Real Processing-in-Memory System}, author = {Gómez-Luna, Juan and Guo, Yuxin and Brocard, Sylvan and Legriel, Julien and Cimadomo, Remy and Oliveira, Geraldo F. and Singh, Gagandeep and Mutlu, Onur}, year = {2023}, month = sep, }
GR
Efficient mapping of accurate long reads in minimizer space with mapquik

Baris Ekim, Kristoffer Sahlin, Paul Medvedev, Bonnie Berger, and Rayan Chikhi

Genome Research, Jun 2023

Abstract BIB URL PDF

DNA sequencing data continue to progress toward longer reads with increasingly lower sequencing error rates. We focus on the critical problem of mapping, or aligning, low-divergence sequences from long reads (e.g., Pacific Biosciences [PacBio] HiFi) to a reference genome, which poses challenges in terms of accuracy and computational resources when using cutting-edge read mapping approaches that are designed for all types of alignments. A natural idea would be to optimize efficiency with longer seeds to reduce the probability of extraneous matches; however, contiguous exact seeds quickly reach a sensitivity limit. We introduce mapquik, a novel strategy that creates accurate longer seeds by anchoring alignments through matches of k consecutively sampled minimizers (k-min-mers) and only indexing k-min-mers that occur once in the reference genome, thereby unlocking ultrafast mapping while retaining high sensitivity. We show that mapquik significantly accelerates the seeding and chaining steps—fundamental bottlenecks to read mapping—for both the human and maize genomes with Formula sensitivity and near-perfect specificity. On the human genome, for both real and simulated reads, mapquik achieves a Formula speedup over the state-of-the-art tool minimap2, and on the maize genome, mapquik achieves a Formula speedup over minimap2, making mapquik the fastest mapper to date. These accelerations are enabled from not only minimizer-space seeding but also a novel heuristic Formula pseudochaining algorithm, which improves upon the long-standing Formula bound. Minimizer-space computation builds the foundation for achieving real-time analysis of long-read sequencing data.
@article{ekim2023_mapquik, title = {Efficient mapping of accurate long reads in minimizer space with mapquik}, author = {Ekim, Baris and Sahlin, Kristoffer and Medvedev, Paul and Berger, Bonnie and Chikhi, Rayan}, year = {2023}, month = jun, journal = {Genome Research}, }
IEEE Access
A low-energy DMTJ-based ternary content- addressable memory with reliable sub-nanosecond search operation

Esteban Garzón, Leonid Yavits, Giovanni Finocchio, Mario Carpentieri, Adam Teman, and Marco Lanuzza

IEEE Access, Feb 2023

Abstract BIB URL PDF

In this paper, we propose an energy-efficient, reliable, hybrid, 10-transistor/2-Double-Barrier-Magnetic-Tunnel-Junction (10T2DMTJ) non-volatile (NV) ternary content-addressable memory (TCAM) with sub-nanosecond search operation. Our cell design relies on low-energy-demanding MTJs organized in a low-complexity voltage-divider-based circuit along with a simple dynamic logic CMOS matching network, which improves the search reliability. The proposed NV-TCAM was designed in 28 nm FDSOI process and evaluated under exhaustive Monte Carlo simulations. When compared to the best previous proposed NV-TCAMs, our solution achieves lower search error rate (3.8×) and lower write and search energy (–73% and–79%, respectively), while also exhibiting smaller area footprint (–74%). Such benefits are achieved at the expense of reduced search speed.
@article{garzon2023dmtj, author = {Garzón, Esteban and Yavits, Leonid and Finocchio, Giovanni and Carpentieri, Mario and Teman, Adam and Lanuzza, Marco}, journal = {IEEE Access}, title = {A low-energy DMTJ-based ternary content- addressable memory with reliable sub-nanosecond search operation}, year = {2023}, volume = {11}, number = {}, month = feb, pages = {16812-16819}, keywords = {Computer architecture;Microprocessors;Transistors;Magnetic tunneling;Switches;Integrated circuit reliability;Error analysis;Double-barrier MTJ;non-volatile TCAM (NV-TCAM);energy-efficiency;low-power}, doi = {10.1109/ACCESS.2023.3245981}, }
ACM TACO
ApHMM: Accelerating Profile Hidden Markov Models for Fast and Energy-Efficient Genome Analysis

Can Firtina, Kamlesh Pillai, Gurpreet S. Kalsi, Bharathwaj Suresh, Damla Senol Cali, Jeremie S. Kim, Taha Shahroodi, Meryem Banu Cavlak, Joël Lindegger, Mohammed Alser, Juan Gómez Luna, Sreenivas Subramoney, and Onur Mutlu

ACM Trans. Archit. Code Optim., Dec 2023

Abstract BIB URL PDF Code

Profile hidden Markov models (pHMMs) are widely employed in various bioinformatics applications to identify similarities between biological sequences, such as DNA or protein sequences. In pHMMs, sequences are represented as graph structures, where states and edges capture modifications (i.e., insertions, deletions, and substitutions) by assigning probabilities to them. These probabilities are subsequently used to compute the similarity score between a sequence and a pHMM graph. The Baum-Welch algorithm, a prevalent and highly accurate method, utilizes these probabilities to optimize and compute similarity scores. Accurate computation of these probabilities is essential for the correct identification of sequence similarities. However, the Baum-Welch algorithm is computationally intensive, and existing solutions offer either software-only or hardware-only approaches with fixed pHMM designs. When we analyze state-of-the-art works, we identify an urgent need for a flexible, high-performance, and energy-efficient hardware-software co-design to address the major inefficiencies in the Baum-Welch algorithm for pHMMs. We introduce ApHMM, the first flexible acceleration framework designed to significantly reduce both computational and energy overheads associated with the Baum-Welch algorithm for pHMMs. ApHMM employs hardware-software co-design to tackle the major inefficiencies in the Baum-Welch algorithm by 1) designing flexible hardware to accommodate various pHMM designs, 2) exploiting predictable data dependency patterns through on-chip memory with memoization techniques, 3) rapidly filtering out unnecessary computations using a hardware-based filter, and 4) minimizing redundant computations. ApHMM achieves substantial speedups of 15.55x - 260.03x, 1.83x - 5.34x, and 27.97x when compared to CPU, GPU, and FPGA implementations of the Baum-Welch algorithm, respectively. ApHMM outperforms state-of-the-art CPU implementations in three key bioinformatics applications: 1) error correction, 2) protein family search, and 3) multiple sequence alignment, by 1.29x - 59.94x, 1.03x - 1.75x, and 1.03x - 1.95x, respectively, while improving their energy efficiency by 64.24x - 115.46x, 1.75x, 1.96x.
@article{firtina_aphmm_2023, title = {ApHMM: Accelerating Profile Hidden Markov Models for Fast and Energy-Efficient Genome Analysis}, journal = {ACM Trans. Archit. Code Optim.}, author = {Firtina, Can and Pillai, Kamlesh and Kalsi, Gurpreet S. and Suresh, Bharathwaj and Cali, Damla Senol and Kim, Jeremie S. and Shahroodi, Taha and Cavlak, Meryem Banu and Lindegger, Jo\"{e}l and Alser, Mohammed and Luna, Juan G\'{o}mez and Subramoney, Sreenivas and Mutlu, Onur}, year = {2023}, issn = {1544-3566}, month = dec, doi = {10.1145/3632950} }
IEEE TC
DIPER: Detection and Identification of Pathogens using Edit distance-tolerant Resistive CAM

Itay Merlin, Esteban Garzón, Alex Fish, and Leonid Yavits

IEEE Transactions on Computers, Dec 2023

Abstract BIB URL PDF

We propose a novel resistive edit distance-tolerant content addressable memory for computational genomics applications, particularly for detection and identification of pathogens of pandemic importance. Unlike state-of-the-art approximate search solutions that tolerate small number of replacements between the query pattern and the stored data, DIPER tolerates insertions and deletions, ubiquitous in genomics. DIPER achieves up to 1.7× higher F1 score for high-quality DNA reads and up to 6.2× higher F1 score for DNA reads with 15% error rate, compared to state-of-the-art DNA classification tool Kraken2. Simulated at 500MHz, DIPER provides 910× average speedup over Kraken2.
@article{Merlin2023, author = {Merlin, Itay and Garzón, Esteban and Fish, Alex and Yavits, Leonid}, journal = {IEEE Transactions on Computers}, title = {DIPER: Detection and Identification of Pathogens using Edit distance-tolerant Resistive CAM}, year = {2023}, issn = {0018-9340}, pages = {1-12}, doi = {10.1109/tc.2023.3315829}, owner = {calkan}, }
MICRO
Swordfish: A Framework for Evaluating Deep Neural Network-Based Basecalling Using Computation-In-Memory with Non-Ideal Memristors

Taha Shahroodi, Gagandeep Singh, Mahdi Zahedi, Haiyu Mao, Joel Lindegger, Can Firtina, Stephan Wong, Onur Mutlu, and Said Hamdioui

In Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Oct 2023

Abstract BIB URL PDF

Basecalling, an essential step in many genome analysis studies, relies on large Deep Neural Network s (DNN s) to achieve high accuracy. Unfortunately, these DNN s are computationally slow and inefficient, leading to considerable delays and resource constraints in the sequence analysis process. A Computation-In-Memory (CIM) architecture using memristors can significantly accelerate the performance of DNN s. However, inherent device non-idealities and architectural limitations of such designs can greatly degrade the basecalling accuracy, which is critical for accurate genome analysis. To facilitate the adoption of memristor-based CIM designs for basecalling, it is important to (1) conduct a comprehensive analysis of potential CIM architectures and (2) develop effective strategies for mitigating the possible adverse effects of inherent device non-idealities and architectural limitations. This paper proposes Swordfish, a novel hardware/software co-design framework that can effectively address the two aforementioned issues. Swordfish incorporates seven circuit and device restrictions or non-idealities from characterized real memristor-based chips. Swordfish leverages various hardware/software co-design solutions to mitigate the basecalling accuracy loss due to such non-idealities. To demonstrate the effectiveness of Swordfish, we take Bonito, the state-of-the-art (i.e., accurate and fast), open-source basecaller as a case study. Our experimental results using Swordfish show that a CIM architecture can realistically accelerate Bonito for a wide range of real datasets by an average of 25.7x, with an accuracy loss of 6.01%.
@inproceedings{shahroodi_swordfish_2023, title = {Swordfish: A Framework for Evaluating Deep Neural Network-Based Basecalling Using Computation-In-Memory with Non-Ideal Memristors}, booktitle = {Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)}, author = {Shahroodi, Taha and Singh, Gagandeep and Zahedi, Mahdi and Mao, Haiyu and Lindegger, Joel and Firtina, Can and Wong, Stephan and Mutlu, Onur and Hamdioui, Said}, year = {2023}, month = oct, pages = {1437–1452}, isbn = {9798400703294}, doi = {10.1145/3613424.3614252}, }
bioRxiv
DASH-CAM: Dynamic Approximate SearcH Content Addressable Memory for genome classification

Zuher Jahshan, Itay Merlin, Esteban Garzón, and Leonid Yavits

bioRxiv, Oct 2023

Abstract BIB URL PDF

We propose a novel dynamic storage-based approximate search content addressable memory (DASH-CAM) for computational genomics applications, particularly for identification and classification of viral pathogens of epidemic significance. DASH-CAM provides 5.5x better density compared to state-of-the-art SRAM-based approximate search CAM. This allows using DASH-CAM as a portable classifier that can be applied to pathogen surveillance in low-quality field settings during pandemics, as well as to pathogen diagnostics at points of care. DASH-CAM approximate search capabilities allow a high level of flexibility when dealing with a variety of industrial sequencers with different error profiles. DASH-CAM achieves up to 30% and 20% higher F1 score when classifying DNA reads with 10% error rate, compared to state-of-the-art DNA classification tools MetaCache-GPU and Kraken2 respectively. Simulated at 1GHz, DASH-CAM provides 1,178x and 1,040x average speedup over MetaCache-GPU and Kraken2 respectively.
@article{Jahshan2023, author = {Jahshan, Zuher and Merlin, Itay and Garz{\'o}n, Esteban and Yavits, Leonid}, title = {DASH-CAM: Dynamic Approximate SearcH Content Addressable Memory for genome classification}, elocation-id = {2023.09.29.560142}, year = {2023}, month = oct, doi = {10.1101/2023.09.29.560142}, publisher = {Cold Spring Harbor Laboratory}, journal = {bioRxiv}, }
arXiv
RawHash2: Accurate and Fast Mapping of Raw Nanopore Signals using a Hash-based Seeding Mechanism

Can Firtina, Melina Soysal, Joël Lindegger, and Onur Mutlu

arXiv, Sep 2023

Abstract BIB URL PDF Code

Raw nanopore signals can be analyzed while they are being generated, a process known as real-time analysis. Real-time analysis of raw signals is essential to utilize the unique features that nanopore sequencing provides, enabling the early stopping of the sequencing of a read or the entire sequencing run based on the analysis. The state-of-the-art mechanism, RawHash, offers the first hash-based efficient and accurate similarity identification between raw signals and a reference genome by quickly matching their hash values. In this work, we introduce RawHash2, which provides major improvements over RawHash, including a more sensitive chaining implementation, weighted mapping decisions, frequency filters to reduce ambiguous seed hits, minimizers for hash-based sketching, and support for the R10.4 flow cell version and various data formats such as POD5. Compared to RawHash, RawHash2 provides the best F1 accuracy in all datasets and provides better average throughput by 2.5x. Source code is available at https://github.com/CMU-SAFARI/RawHash.
@article{firtina_rawhash2_2023, title = {{RawHash2}: Accurate and Fast Mapping of Raw Nanopore Signals using a Hash-based Seeding Mechanism}, author = {Firtina, Can and Soysal, Melina and Lindegger, Joël and Mutlu, Onur}, journal = {arXiv}, year = {2023}, month = sep, doi = {10.48550/arXiv.2309.05771}, }
arXiv
GateSeeder: Near-memory CPU-FPGA Acceleration of Short and Long Read Mapping

Julien Eudine, Mohammed Alser, Gagandeep Singh, Can Alkan, and Onur Mutlu

arXiv, Sep 2023

Abstract BIB URL PDF Code

Motivation: Read mapping is a computationally expensive process and a major bottleneck in genomics analyses. The performance of read mapping is mainly limited by the performance of three key computational steps: Index Querying, Seed Chaining, and Sequence Alignment. The first step is dominated by how fast and frequent it accesses the main memory (i.e., memory-bound), while the latter two steps are dominated by how fast the CPU can compute their computationally-costly dynamic programming algorithms (i.e., compute-bound). Accelerating these three steps by exploiting new algorithms and new hardware devices is essential to accelerate most genome analysis pipelines that widely use read mapping. Given the large body of work on accelerating Sequence Alignment, this work focuses on significantly improving the remaining steps. Results: We introduce GateSeeder, the first CPU-FPGA-based near-memory acceleration of both short and long read mapping. GateSeeder exploits near-memory computation capability provided by modern FPGAs that couple a reconfigurable compute fabric with high-bandwidth memory (HBM) to overcome the memory-bound and compute-bound bottlenecks. GateSeeder also introduces a new lightweight algorithm for finding the potential matching segment pairs. Using real ONT, HiFi, and Illumina sequences, we experimentally demonstrate that GateSeeder outperforms Minimap2, without performing sequence alignment, by up to 40.3x, 4.8x, and 2.3x, respectively. When performing read mapping with sequence alignment, GateSeeder outperforms Minimap2 by 1.15-4.33x (using KSW2) and by 1.97-13.63x (using WFA-GPU)
@article{Eudine2023gateseeder, title = {GateSeeder: Near-memory CPU-FPGA Acceleration of Short and Long Read Mapping}, author = {Eudine, Julien and Alser, Mohammed and Singh, Gagandeep and Alkan, Can and Mutlu, Onur}, year = {2023}, month = sep, eprint = {2309.17063}, archiveprefix = {arXiv}, journal = {arXiv}, }
IEEE VLSI
ClaPIM: Scalable Sequence CLAssification using Processing-In-Memory

Marcel Khalifa, Barak Hoffer, Orian Leitersdorf, Robert Hanhan, Ben Perach, Leonid Yavits, and Shahar Kvatinsky

IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Jul 2023

Abstract BIB URL PDF

DNA sequence classification is a fundamental task in computational biology with vast implications for applications such as disease prevention and drug design. Therefore, fast high-quality sequence classifiers are significantly important. This paper introduces ClaPIM, a scalable DNA sequence classification architecture based on the emerging concept of hybrid in-crossbar and near-crossbar memristive processing-in-memory (PIM). We enable efficient and high-quality classification by uniting the filter and search stages within a single algorithm. Specifically, we propose a custom filtering technique that drastically narrows the search space and a search approach that facilitates approximate string matching through a distance function. ClaPIM is the first PIM architecture for scalable approximate string matching that benefits from the high density of memristive crossbar arrays and the massive computational parallelism of PIM. Compared with Kraken2, a state-of-the-art software classifier, ClaPIM provides significantly higher classification quality (up to 20x improvement in F1 score) and also demonstrates a 1.8x throughput improvement. Compared with EDAM, a recently-proposed SRAM-based accelerator that is restricted to small datasets, we observe both a 30.4x improvement in normalized throughput per area and a 7% increase in classification precision.
@article{Khalifa2023, author = {Khalifa, Marcel and Hoffer, Barak and Leitersdorf, Orian and Hanhan, Robert and Perach, Ben and Yavits, Leonid and Kvatinsky, Shahar}, journal = {IEEE Transactions on Very Large Scale Integration (VLSI) Systems}, title = {ClaPIM: Scalable Sequence CLAssification using Processing-In-Memory}, year = {2023}, volume = {31}, number = {9}, month = jul, pages = {1347--1357}, archiveprefix = {arXiv}, doi = {10.48550/arXiv.2302.08284}, eid = {arXiv:2302.08284}, eprint = {2302.08284}, keywords = {Computer Science - Machine Learning, Electrical Engineering and Systems Science - Systems and Control}, owner = {calkan}, }
Bioinformatics
RawHash: Enabling Fast and Accurate Real-Time Analysis of Raw Nanopore Signals for Large Genomes

Can Firtina, Nika Mansouri Ghiasi, Joel Lindegger, Gagandeep Singh, Meryem Banu Cavlak, Haiyu Mao, and Onur Mutlu

Bioinformatics, Jul 2023

Proceedings of the 31st Annual Conference on Intelligent Systems for Molecular Biology (ISMB) and the 22nd European Conference on Computational Biology (ECCB)

Abstract BIB URL PDF Code

Nanopore sequencers generate electrical raw signals in real-time while sequencing long genomic strands. These raw signals can be analyzed as they are generated, providing an opportunity for real-time genome analysis. An important feature of nanopore sequencing, Read Until, can eject strands from sequencers without fully sequencing them, which provides opportunities to computationally reduce the sequencing time and cost. However, existing works utilizing Read Until either 1) require powerful computational resources that may not be available for portable sequencers or 2) lack scalability for large genomes, rendering them inaccurate or ineffective. We propose RawHash, the first mechanism that can accurately and efficiently perform real-time analysis of nanopore raw signals for large genomes using a hash-based similarity search. To enable this, RawHash ensures the signals corresponding to the same DNA content lead to the same hash value, regardless of the slight variations in these signals. RawHash achieves an accurate hash-based similarity search via an effective quantization of the raw signals such that signals corresponding to the same DNA content have the same quantized value and, subsequently, the same hash value. We evaluate RawHash on three applications: 1) read mapping, 2) relative abundance estimation, and 3) contamination analysis. Our evaluations show that RawHash is the only tool that can provide high accuracy and high throughput for analyzing large genomes in real-time. When compared to the state-of-the-art techniques, UNCALLED and Sigmap, RawHash provides 1) 25.8x and 3.4x better average throughput and 2) significantly better accuracy for large genomes, respectively. Source code is available at https://github.com/CMUSAFARI/RawHash.
@article{firtina_rawhash_2023, author = {Firtina, Can and Ghiasi, Nika Mansouri and Lindegger, Joel and Singh, Gagandeep and Cavlak, Meryem Banu and Mao, Haiyu and Mutlu, Onur}, journal = {Bioinformatics}, title = {{RawHash}: {Enabling} {Fast} and {Accurate} {Real}-{Time} {Analysis} of {Raw} {Nanopore} {Signals} for {Large} {Genomes}}, year = {2023}, month = jul, note = {Proceedings of the 31st Annual Conference on Intelligent Systems for Molecular Biology (ISMB) and the 22nd European Conference on Computational Biology (ECCB)}, number = {Supplement 1}, pages = {i297–i307}, volume = {39}, doi = {10.1093/bioinformatics/btad272}, }
DAC
Accelerating Genome Analysis via Algorithm-Architecture Co-Design

Onur Mutlu, and Can Firtina

In Proceedings of the 60th Design Automation Conference (DAC), Jul 2023

Abstract BIB URL PDF

High-throughput sequencing (HTS) technologies have revolutionized the field of genomics, enabling rapid and cost-effective genome analysis for various applications. However, the increasing volume of genomic data generated by HTS technologies presents significant challenges for computational techniques to effectively analyze genomes. To address these challenges, several algorithm-architecture co-design works have been proposed, targeting different steps of the genome analysis pipeline. These works explore emerging technologies to provide fast, accurate, and low-power genome analysis. This paper provides a brief review of the recent advancements in accelerating genome analysis, covering the opportunities and challenges associated with the acceleration of the key steps of the genome analysis pipeline. Our analysis highlights the importance of integrating multiple steps of genome analysis using suitable architectures to unlock significant performance improvements and reduce data movement and energy consumption. We conclude by emphasizing the need for novel strategies and techniques to address the growing demands of genomic data generation and analysis.
@inproceedings{mutlu2023accelerating, booktitle = {Proceedings of the 60th Design Automation Conference (DAC)}, series = {{DAC} '23}, title = {Accelerating Genome Analysis via Algorithm-Architecture Co-Design}, author = {Mutlu, Onur and Firtina, Can}, journal = {arXiv preprint arXiv:2305.00492}, doi = {10.48550/arXiv.2305.00492}, year = {2023}, month = jul, }
MMDCS
Will computing in memory become a new dawn of associative processors?

Leonid Yavits

IEEE J. Emerg. Sel. Topics Circuits Syst., Jul 2023

Abstract BIB URL PDF

Computer architecture faces an enormous challenge in recent years: while the demand for performance is constantly growing, the performance improvement of general-purpose CPU has almost stalled. Among the reasons are memory and power walls, due to which data transfer increasingly dominates computing. By significantly reducing data transfer, data-centric (or in-memory) computing promises to alleviate the memory and power walls. Associative processor is a non von Neumann computer invented in the 1960s but effectively cast aside until recently. It computes using associative memory in a perfect induction like fashion, using associative memory cells for both data storage and processing. Associative processor can be implemented using conventional CMOS as well as emerging memories. We show that associative processor can outperform state-of-the-art computing platforms by up to almost two orders of magnitude in a variety of data-intensive workloads.
@article{Yavits2023, author = {Yavits, Leonid}, title = {Will computing in memory become a new dawn of associative processors?}, journal = {{IEEE} J. Emerg. Sel. Topics Circuits Syst.}, issn = {2773-0646}, pages = {100033}, volume = {4}, doi = {10.1016/j.memori.2023.100033}, month = jul, year = {2023} }
AACBB
GAPiM: a hardware acceleration of Genome Analysis pipeline using Processing in Memory

Naomie Abecassis, Juan Gómez-Luna, Onur Mutlu, Ran Ginosar, Aphélie Moisson-Franckhauser, and Leonid Yavits

In Proceedings of the 5th Workshop on Accelerator Architecture in Computational Biology and Bioinformatics (AACBB), Jun 2023

Abstract BIB URL Code

Variant calling is a fundamental stage in genome analysis that identifies mutations (variations) in a sequenced genome relative to a known reference genome. Pair-HMM is a key part of the variant calling algorithm and its most compute-intensive part. In recent years, Processing-in-Memory (PiM) solutions, which consist of placing compute capabilities near/inside memory, have been proposed to speed up the genome analysis pipeline. We implement the Pair-HMM algorithm on a commercial PiM platform developed by UPMEM. We modify the Pair-HMM algorithm to make it more suitable for PiM execution with acceptable loss of accuracy. We evaluate our implementation on single chromosomes and whole genome sequencing datasets, demonstrating up to 2x speedup compared to existing CPU accelerations and up to 3x speedup compared to FPGA accelerations.
@inproceedings{abecasis2023, booktitle = {Proceedings of the 5th Workshop on Accelerator Architecture in Computational Biology and Bioinformatics (AACBB)}, title = {{GAPiM}: a hardware acceleration of Genome Analysis pipeline using Processing in Memory}, doi = {10.1101/2023.07.26.550623v1}, author = {Abecassis, Naomie and G{\'o}mez-Luna, Juan and Mutlu, Onur and Ginosar, Ran and Moisson-Franckhauser, Aph{\'e}lie and Yavits, Leonid}, month = jun, year = {2023}, }
ISPASS
TransPimLib: A Library for Efficient Transcendental Functions on Processing-in-Memory Systems

Maurus Item, Juan Gómez-Luna, Yuxin Guo, Geraldo F Oliveira, Mohammad Sadrosadati, and Onur Mutlu

In Proceedings of the 24th International Symposium on Performance Analysis of Systems and Software (ISPASS), Apr 2023

Abstract BIB URL PDF

Processing-in-memory (PIM) promises to alleviate the data movement bottleneck in modern computing systems. However, current real-world PIM systems have the inherent disadvantage that their hardware is more constrained than in conventional processors (CPU, GPU), due to the difficulty and cost of building processing elements near or inside the memory. As a result, general-purpose PIM architectures support fairly limited instruction sets and struggle to execute complex operations such as transcendental functions and other hard-to-calculate operations (e.g., square root). These operations are particularly important for some modern workloads, e.g., activation functions in machine learning applications. In order to provide support for transcendental (and other hard-to-calculate) functions in general-purpose PIM systems, we present TransPimLib, a library that provides CORDIC-based and LUT-based methods for trigonometric functions, hyperbolic functions, exponentiation, logarithm, square root, etc. We develop an implementation of TransPimLib for the UPMEM PIM architecture and perform a thorough evaluation of TransPimLib’s methods in terms of performance and accuracy, using microbenchmarks and three full workloads (Blackscholes, Sigmoid, Softmax). We open-source all our code and datasets at https://github.com/CMU-SAFARI/transpimlib.
@inproceedings{gomez2023transpimlib, booktitle = {Proceedings of the 24th International Symposium on Performance Analysis of Systems and Software (ISPASS)}, series = {{ISPASS} '23}, title = {TransPimLib: A Library for Efficient Transcendental Functions on Processing-in-Memory Systems}, author = {Item, Maurus and G{\'o}mez-Luna, Juan and Guo, Yuxin and Oliveira, Geraldo F and Sadrosadati, Mohammad and Mutlu, Onur}, journal = {arXiv preprint arXiv:2304.01951}, doi = {10.48550/arXiv.2304.01951}, month = apr, year = {2023} }
ISPASS
Evaluating Machine Learning Workloads on Memory-Centric Computing Systems

Juan Gómez-Luna, Yuxin Guo, Sylvan Brocard, Julien Legriel, Remy Cimadomo, Geraldo F Oliveira, Gagandeep Singh, and Onur Mutlu

In Proceedings of the 24th International Symposium on Performance Analysis of Systems and Software (ISPASS), Apr 2023

Abstract BIB URL PDF

Training machine learning (ML) algorithms is a computationally intensive process, which is frequently memory-bound due to repeatedly accessing large training datasets. As a result, processor-centric systems (e.g., CPU, GPU) suffer from costly data movement between memory units and processing units, which consumes large amounts of energy and execution cycles. Memory-centric computing systems, i.e., with processing-in-memory (PIM) capabilities, can alleviate this data movement bottleneck. Our goal is to understand the potential of modern general-purpose PIM architectures to accelerate ML training. To do so, we (1) implement several representative classic ML algorithms (namely, linear regression, logistic regression, decision tree, K-Means clustering) on a real-world general-purpose PIM architecture, (2) rigorously evaluate and characterize them in terms of accuracy, performance and scaling, and (3) compare to their counterpart implementations on CPU and GPU. Our evaluation on a real memory-centric computing system with more than 2500 PIM cores shows that general-purpose PIM architectures can greatly accelerate memory-bound ML workloads, when the necessary operations and datatypes are natively supported by PIM hardware. For example, our PIM implementation of decision tree is 27x faster than a state-of-the-art CPU version on an 8-core Intel Xeon, and 1.34x faster than a state-of-the-art GPU version on an NVIDIA A100. Our K-Means clustering on PIM is 2.8x and 3.2x than state-of-the-art CPU and GPU versions, respectively. To our knowledge, our work is the first one to evaluate ML training on a real-world PIM architecture. We conclude with key observations, takeaways, and recommendations that can inspire users of ML workloads, programmers of PIM architectures, and hardware designers & architects of future memory-centric computing systems.
@inproceedings{gomez2023evaluating, booktitle = {Proceedings of the 24th International Symposium on Performance Analysis of Systems and Software (ISPASS)}, series = {{ISPASS} '23}, title = {Evaluating Machine Learning Workloads on Memory-Centric Computing Systems}, author = {G{\'o}mez-Luna, Juan and Guo, Yuxin and Brocard, Sylvan and Legriel, Julien and Cimadomo, Remy and Oliveira, Geraldo F and Singh, Gagandeep and Mutlu, Onur}, journal = {arXiv preprint arXiv:2207.07886}, doi = {10.48550/arXiv.2207.07886}, month = apr, year = {2023} }
Bioinformatics
Scrooge: A Fast and Memory-Frugal Genomic Sequence Aligner for CPUs, GPUs, and ASICs.

Joël Lindegger, Damla Senol Cali, Mohammed Alser, Juan Gómez-Luna, Nika Mansouri Ghiasi, and Onur Mutlu

Bioinformatics, Mar 2023

Abstract BIB URL PDF

Pairwise sequence alignment is a very time-consuming step in common bioinformatics pipelines. Speeding up this step requires heuristics, efficient implementations, and/or hardware acceleration. A promising candidate for all of the above is the recently proposed GenASM algorithm. We identify and address three inefficiencies in the GenASM algorithm: it has a high amount of data movement, a large memory footprint, and does some unnecessary work. We propose Scrooge, a fast and memory-frugal genomic sequence aligner. Scrooge includes three novel algorithmic improvements which reduce the data movement, memory footprint, and the number of operations in the GenASM algorithm. We provide efficient open-source implementations of the Scrooge algorithm for CPUs and GPUs, which demonstrate the significant benefits of our algorithmic improvements. For long reads, the CPU version of Scrooge achieves a 20.1x, 1.7x, and 2.1x speedup over KSW2, Edlib, and a CPU implementation of GenASM, respectively. The GPU version of Scrooge achieves a 4.0x 80.4x, 6.8x, 12.6x and 5.9x speedup over the CPU version of Scrooge, KSW2, Edlib, Darwin-GPU, and a GPU implementation of GenASM, respectively. We estimate an ASIC implementation of Scrooge to use 3.6x less chip area and 2.1x less power than a GenASM ASIC while maintaining the same throughput. Further, we systematically analyze the throughput and accuracy behavior of GenASM and Scrooge under various configurations. As the best configuration of Scrooge depends on the computing platform, we make several observations that can help guide future implementations of Scrooge. https://github.com/CMU-SAFARI/Scrooge.
@article{Lindegger2023, author = {Lindegger, Joël and Cali, Damla Senol and Alser, Mohammed and Gómez-Luna, Juan and Ghiasi, Nika Mansouri and Mutlu, Onur}, journal = {Bioinformatics}, title = {Scrooge: A Fast and Memory-Frugal Genomic Sequence Aligner for {CPUs}, {GPUs}, and {ASICs}.}, year = {2023}, issn = {1367-4811}, month = mar, citation-subset = {IM}, country = {England}, doi = {10.1093/bioinformatics/btad151}, issn-linking = {1367-4803}, nlm-id = {9808944}, owner = {NLM}, pii = {btad151}, pmid = {36961334}, pubmodel = {Print-Electronic}, pubstate = {aheadofprint}, revised = {2023-03-24}, }
Bioinformatics
A framework for high-throughput sequence alignment using real processing-in-memory systems.

Safaa Diab, Amir Nassereldine, Mohammed Alser, Juan Gómez Luna, Onur Mutlu, and Izzat El Hajj

Bioinformatics, Mar 2023

Abstract BIB URL PDF

Sequence alignment is a memory bound computation whose performance in modern systems is limited by the memory bandwidth bottleneck. Processing-in-memory (PIM) architectures alleviate this bottleneck by providing the memory with computing competencies. We propose Alignment-in-Memory (AIM), a framework for high-throughput sequence alignment using PIM, and evaluate it on UPMEM, the first publicly available general-purpose programmable PIM system. Our evaluation shows that a real PIM system can substantially outperform server-grade multi-threaded CPU systems running at full-scale when performing sequence alignment for a variety of algorithms, read lengths, and edit distance thresholds. We hope that our findings inspire more work on creating and accelerating bioinformatics algorithms for such real PIM systems. Our code is available at https://github.com/safaad/aim.
@article{Diab2023, author = {Diab, Safaa and Nassereldine, Amir and Alser, Mohammed and Gómez Luna, Juan and Mutlu, Onur and El Hajj, Izzat}, journal = {Bioinformatics}, title = {A framework for high-throughput sequence alignment using real processing-in-memory systems.}, issn = {1367-4811}, month = mar, volume = {39}, citation-subset = {IM}, completed = {2023-05-08}, country = {England}, doi = {10.1093/bioinformatics/btad155}, issn-linking = {1367-4803}, issue = {5}, keywords = {Sequence Alignment; Algorithms; Software; Computational Biology; Sequence Analysis, DNA; High-Throughput Nucleotide Sequencing}, nlm-id = {9808944}, owner = {NLM}, pii = {btad155}, pmc = {PMC10159653}, pmid = {36971586}, pubmodel = {Print}, pubstate = {ppublish}, revised = {2023-05-09}, year = {2023} }
JETCAS
AM4: MRAM Crossbar Based CAM/TCAM/ACAM/AP for In-Memory Computing

Esteban Garzón, Marco Lanuzza, Adam Teman, and Leonid Yavits

IEEE J. Emerg. Sel. Topics Circuits Syst., Mar 2023

Abstract BIB URL PDF

In-memory computing seeks to minimize data movement and alleviate the memory wall by computing in-situ, in the same place that the data is located. One of the key emerging technologies that promises to enable such computing-in-memory is spin-transfer torque magnetic tunnel junction (STT-MTJ). This paper proposes AM4, a combined STT-MTJ-based Content Addressable Memory (CAM), Ternary CAM (TCAM), approximate matching (similarity search) CAM (ACAM), and in-memory Associative Processor (AP) design, inspired by the recently announced Samsung MRAM crossbar. We demonstrate and evaluate the performance and energy-efficiency of the AM4-based AP using a variety of data intensive workloads. We show that an AM4-based AP outperforms state-of-the-art solutions both in performance (with the average speedup of about 10 x) and energy-efficiency (by about 60 x on average).
@article{Garzon2023, author = {Garz{\'{o}}n, Esteban and Lanuzza, Marco and Teman, Adam and Yavits, Leonid}, journal = {{IEEE} J. Emerg. Sel. Topics Circuits Syst.}, title = {AM4: {MRAM} Crossbar Based {CAM/TCAM/ACAM/AP} for In-Memory Computing}, number = {1}, pages = {408--421}, volume = {13}, bibsource = {dblp computer science bibliography, https://dblp.org}, biburl = {https://dblp.org/rec/journals/esticas/GarzonLTY23.bib}, doi = {10.1109/JETCAS.2023.3243222}, month = mar, year = {2023} }
NARGAB
BLEND: a fast, memory-efficient and accurate mechanism to find fuzzy seed matches in genome analysis

Can Firtina, Jisung Park, Mohammed Alser, Jeremie S Kim, Damla Senol Cali, Taha Shahroodi, Nika Mansouri Ghiasi, Gagandeep Singh, Konstantinos Kanellopoulos, Can Alkan, and Onur Mutlu

NAR Genomics and Bioinformatics, Mar 2023

Abstract BIB URL Code

Generating the hash values of short subsequences, called seeds, enables quickly identifying similarities between genomic sequences by matching seeds with a single lookup of their hash values. However, these hash values can be used only for finding exact-matching seeds as the conventional hashing methods assign distinct hash values for different seeds, including highly similar seeds. Finding only exact-matching seeds causes either (i) increasing the use of the costly sequence alignment or (ii) limited sensitivity. We introduce BLEND, the first efficient and accurate mechanism that can identify both exact-matching and highly similar seeds with a single lookup of their hash values, called fuzzy seed matches. BLEND (i) utilizes a technique called SimHash, that can generate the same hash value for similar sets, and (ii) provides the proper mechanisms for using seeds as sets with the SimHash technique to find fuzzy seed matches efficiently. We show the benefits of BLEND when used in read overlapping and read mapping. For read overlapping, BLEND is faster by 2.4x–83.9x (on average 19.3x), has a lower memory footprint by 0.9x–14.1x (on average 3.8x), and finds higher quality overlaps leading to accurate de novo assemblies than the state-of-the-art tool, minimap2. For read mapping, BLEND is faster by 0.8x–4.1x (on average 1.7x) than minimap2. Source code is available at https://github.com/CMU-SAFARI/BLEND.
@article{firtina_blend_2023, title = {{BLEND}: a fast, memory-efficient and accurate mechanism to find fuzzy seed matches in genome analysis}, volume = {5}, doi = {10.1093/nargab/lqad004}, number = {1}, journal = {NAR Genomics and Bioinformatics}, author = {Firtina, Can and Park, Jisung and Alser, Mohammed and Kim, Jeremie S and Cali, Damla Senol and Shahroodi, Taha and Ghiasi, Nika Mansouri and Singh, Gagandeep and Kanellopoulos, Konstantinos and Alkan, Can and Mutlu, Onur}, month = mar, year = {2023}, pages = {lqad004}, }
Chips
Approximate Content-Addressable Memories: A Review

Esteban Garzón, Leonid Yavits, Adam Teman, and Marco Lanuzza

Chips, Mar 2023

Abstract BIB URL

Content-addressable memory (CAM) has been part of the memory market for more than five decades. CAM can carry out a single clock cycle lookup based on the content rather than an address. Thanks to this attractive feature, CAM is utilized in memory systems where a high-speed content lookup technique is required. However, typical CAM applications only support exact matching, as opposed to approximate matching, where a certain Hamming distance (several mismatching characters between a query pattern and the dataset stored in CAM) needs to be tolerated. Recent interest in approximate search has led to the development of new CAM-based alternatives, accelerating the processing of large data workloads in the realm of big data, genomics, and other data-intensive applications. In this review, we provide an overview of approximate CAM and describe its current and potential applications that would benefit from approximate search computing.
@article{Garzon2023a, author = {Garzón, Esteban and Yavits, Leonid and Teman, Adam and Lanuzza, Marco}, journal = {Chips}, title = {Approximate Content-Addressable Memories: A Review}, year = {2023}, issn = {2674-0729}, number = {2}, pages = {70--82}, volume = {2}, doi = {10.3390/chips2020005}, owner = {calkan}, url = {https://www.mdpi.com/2674-0729/2/2/5}, }

Posters

2024

GI
Hardware/software co-design for sequence-to-pangenome mapping

Zülal Bingöl, Ccan Fırtına, Konstantina Koliogeorgi, Ricardo Roman-Brenes, Onur Mutlu, and Can Alkan

In Genome Informatics Conference 2024, Nov 2024

Abstract BIB URL

The reference genome is the representative genome of a species. Most genomics analyses begin with or involve a comparison with the reference genome of the same species, which serves as a template for the newly sequenced sample of an individual. However, using a single reference genome creates reference bias since it does not contain information about all individuals of a species to represent the genomic diversity. The rapid development of sequencing technologies has led to an increasing number of samples being studied. With the growing number of genetic materials sampled, genomics pursues utilizing this data to express all samples and capture vast genetic diversity. So, to alleviate these problems and enable diverse variance studies (e.g., bacterial pathogenicity, antibacterial resistance, virulence, human leukocyte antigens), pangenome structures are proposed. As the efforts to collect more data to improve the diversity representation on the pangenome are growing, the computational methods to assist this field become more vital to meet the high demand. Therefore, there is a pressing need for acceleration methods tuned for mapping sequences to pangenomes. To address this problem, in this work, we propose a hardware/software co-design framework for optimizing the pangenome graphs for compact storage and developing efficient algorithms that eventually lead to fast and accurate sequence-to-pangenome mapping. Our framework combines several steps (e.g., indexing, seeding, and alignment), and the accelerator includes optimized designs based on each step’s requirements.
booktitle = {Genome Informatics Conference 2024}, series = {{GI} '24}, title = {Hardware/software co-design for sequence-to-pangenome mapping}, author = {Bingöl, Zülal and Fırtına, Ccan and Koliogeorgi, Konstantina and Roman-Brenes, Ricardo and Mutlu, Onur and Alkan, Can}, journal = {Poster presented}, url = {https://coursesandconferences.wellcomeconnectingscience.org/event/genome-informatics-20241113/}, year = {2024}, month = nov, }

2023

RECOMB
Characterization of Alignment and Search Algorithms for Short Read, Long Read, and Graph Mappers

Ecem İlgün, Ömer Yavuz Öztürk, Klea Zambaku, Juan Gómez Luna, Mohammed Alser, Ricardo Román-Brenes, The BioPIM Project, and Can Alkan

In RECOMB 2023, Apr 2023

Abstract BIB URL

We recently started a project funded by the Horizon Europe program, BioPIM, which aims to accelerate various bioinformatics algorithms using processing-in-memory (PIM) technologies. PIM is a type of computer architecture that aims to solve the issue of data movement between the CPU and memory being a bottleneck in data-intensive applications. PIM accomplishes this by integrating computing units directly into the memory chip, which reduces latency and increases bandwidth by bringing the computing units closer to the memory. Since read mapping is a crucial step in almost all genome analysis studies, we first aimed to understand how to accelerate read mapping in this project. As the first step, we analyzed the computational workload of BWA-MEM and Bowtie2 for short reads, NGMLR, Minimap2 and LRA for long reads, and finally minigraph, vg, GraphAligner and GWFA for read-to-graph aligners. Here we present a first-pass workload analysis using the Intel vtune tool to identify the most time and memory-bandwidth consuming functions used by these algorithms. Our preliminary results show that the resource usage of these algorithms varied significantly depending on the type of data and the algorithm used. We have identified several potential functions that could potentially be improved by PIM. Furthermore, we discuss processing-in-memory architectures to accelerate alignment and search algorithms for resequencing experiments. Overall, our study provides insights into the resource requirements of different alignment and search algorithms for different types of sequencing data, which can guide the selection of the most appropriate algorithms for different resequencing experiments. Our findings can also inform the development of more efficient and accurate algorithms for processing sequencing data, which is critical for advancing our understanding of the genetic basis of complex diseases.
booktitle = {RECOMB 2023}, series = {{RECOMB} '23}, title = {Characterization of Alignment and Search Algorithms for Short Read, Long Read, and Graph Mappers}, author = {İlgün, Ecem and Öztürk, Ömer Yavuz and Zambaku, Klea and Luna, Juan Gómez and Alser, Mohammed and Román-Brenes, Ricardo and Project, The BioPIM and Alkan, Can}, journal = {Poster presented}, url = {https://docs.google.com/document/d/1OK2oSsDjQYuo20jcJXwf8H5H_PMF6t_MOWO9QZOkFPg/view}, year = {2023}, month = apr, }

Related Publications

2022

ISCA
EDAM: Edit Distance Tolerant Approximate Matching Content Addressable Memory

Robert Hanhan, Esteban Garzón, Zuher Jahshan, Adam Teman, Marco Lanuzza, and Leonid Yavits

In Proceedings of the 49th Annual International Symposium on Computer Architecture, 2022

Abstract BIB URL

We propose a novel edit distance-tolerant content addressable memory (EDAM) for energy-efficient approximate search applications. Unlike state-of-the-art approximate search solutions that tolerate certain Hamming distance between the query pattern and the stored data, EDAM tolerates edit distance, which makes it especially efficient in applications such as text processing and genome analysis. EDAM was designed using a commercial 65 nm 1.2 V CMOS technology and evaluated through extensive Monte Carlo simulations, while considering different process corners. Simulation results show that EDAM can achieve robust approximate search operation with a wide range of edit distance threshold levels. EDAM is functionally evaluated as a pathogen DNA detection and classification accelerator. EDAM achieves up to 1.7x higher F1 score for high-quality DNA reads and up to 19.55x higher F1 score for DNA reads with 15% error rate, compared to state-of-the-art DNA classification tool Kraken2. Simulated at 667 MHz, EDAM provides 1, 214x average speedup over Kraken2. This makes EDAM suitable for hardware acceleration of genomic surveillance of outbreaks, such as the ongoing Covid-19 pandemic.
@inproceedings{hanhan_edam_2022, address = {New York, NY, USA}, series = {{ISCA} '22}, title = {{EDAM}: {Edit} {Distance} {Tolerant} {Approximate} {Matching} {Content} {Addressable} {Memory}}, isbn = {978-1-4503-8610-4}, url = {https://doi.org/10.1145/3470496.3527424}, doi = {10.1145/3470496.3527424}, booktitle = {Proceedings of the 49th {Annual} {International} {Symposium} on {Computer} {Architecture}}, publisher = {Association for Computing Machinery}, author = {Hanhan, Robert and Garzón, Esteban and Jahshan, Zuher and Teman, Adam and Lanuzza, Marco and Yavits, Leonid}, year = {2022}, pages = {495--507}, }

2021

VLSI Tech.
HERMES Core – A 14nm CMOS and PCM-based In-Memory Compute Core using an array of 300ps/LSB Linearized CCO-based ADCs and local digital processing

R. Khaddam-Aljameh, M. Stanisavljevic, J. Fornt Mas, G. Karunaratne, M. Braendli, F. Liu, A. Singh, S. M. Müller, U. Egger, A. Petropoulos, T. Antonakopoulos, K. Brew, S. Choi, I. Ok, F. L. Lie, N. Saulnier, V. Chan, I. Ahsan, V. Narayanan, S. R. Nandakumar, M. Le Gallo, P. A. Francese, A. Sebastian, and E. Eleftheriou

In 2021 Symposium on VLSI Technology, 2021

Abstract BIB URL

We present a 256x256 in-memory compute (IMC) core designed and fabricated in 14nm CMOS with backend-integrated multi-level phase-change memory (PCM). It comprises 256 linearized current controlled oscillator (CCO)-based ADCs at a compact 4µm pitch and a local digital processing unit performing affine scaling and ReLU operations. A novel frequency-linearization technique for CCOs is introduced, leading to accurate on-chip matrix-vector-multiply (MVM) when operating over 1 GHz. Measured classification accuracies on MNIST and CIFAR-10 datasets are presented when two cores are employed for deep learning (DL) inference. The measured energy efficiency is 10.5 TOPS/W at a performance density of 1.59 TOPS/mm 2 .
@inproceedings{khaddam-aljameh_hermes_2021, title = {{HERMES} {Core} – {A} 14nm {CMOS} and {PCM}-based {In}-{Memory} {Compute} {Core} using an array of 300ps/{LSB} {Linearized} {CCO}-based {ADCs} and local digital processing}, booktitle = {2021 {Symposium} on {VLSI} {Technology}}, author = {Khaddam-Aljameh, R. and Stanisavljevic, M. and Mas, J. Fornt and Karunaratne, G. and Braendli, M. and Liu, F. and Singh, A. and Müller, S. M. and Egger, U. and Petropoulos, A. and Antonakopoulos, T. and Brew, K. and Choi, S. and Ok, I. and Lie, F. L. and Saulnier, N. and Chan, V. and Ahsan, I. and Narayanan, V. and Nandakumar, S. R. and Gallo, M. Le and Francese, P. A. and Sebastian, A. and Eleftheriou, E.}, year = {2021}, pages = {1--2}, }

2020

BIBM
Variant Calling Parallelization on Processor-in-Memory Architecture

D. Lavenier, R. Cimadomo, and R. Jodin

In 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Dec 2020

Abstract BIB URL

This paper introduces a new combination of software and hardware PIM (Process-in-Memory) architecture to accelerate the variant calling genomic process. PIM translates into bringing data intensive calculations directly where the data is: within the DRAM, enhanced with thousands of processing units. The energy consumption, in large part due to data movement, is significantly lowered at a marginal additional hardware cost. Such design allows an unprecedented level of parallelism to process billions of short reads. Experiments on real PIM devices developed by the UPMEM company show significant speed-up compared to pure software implementation. The PIM solution also compared nicely to FPGA or GPU based acceleration bringing similar to twice the processing speed but most importantly being 5 to 8 times cheaper to deploy with up to 6 times less power consumption.
@inproceedings{lavenier_variant_2020, address = {Los Alamitos, CA, USA}, title = {Variant {Calling} {Parallelization} on {Processor}-in-{Memory} {Architecture}}, url = {https://doi.ieeecomputersociety.org/10.1109/BIBM49941.2020.9313351}, doi = {10.1109/BIBM49941.2020.9313351}, booktitle = {2020 {IEEE} {International} {Conference} on {Bioinformatics} and {Biomedicine} ({BIBM})}, publisher = {IEEE Computer Society}, author = {Lavenier, D. and Cimadomo, R. and Jodin, R.}, month = dec, year = {2020}, pages = {204--207}, }
MICRO
GenASM: A High-Performance, Low-Power Approximate String Matching Acceleration Framework for Genome Sequence Analysis

Damla Senol Cali, Gurpreet S. Kalsi, Zülal Bingöl, Can Firtina, Lavanya Subramanian, Jeremie S. Kim, Rachata Ausavarungnirun, Mohammed Alser, Juan Gomez-Luna, Amirali Boroumand, Anant Norion, Allison Scibisz, Sreenivas Subramoneyon, Can Alkan, Saugata Ghose, and Onur Mutlu

In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec 2020

Abstract BIB URL PDF Code Slides Slides (PPT)

Genome sequence analysis has enabled significant advancements in medical and scientific areas such as personalized medicine, outbreak tracing, and the understanding of evolution. To perform genome sequencing, devices extract small random fragments of an organism’s DNA sequence (known as reads). The first step of genome sequence analysis is a computational process known as read mapping. In read mapping, each fragment is matched to its potential location in the reference genome with the goal of identifying the original location of each read in the genome. Unfortunately, rapid genome sequencing is currently bottlenecked by the computational power and memory bandwidth limitations of existing systems, as many of the steps in genome sequence analysis must process a large amount of data. A major contributor to this bottleneck is approximate string matching (ASM), which is used at multiple points during the mapping process. ASM enables read mapping to account for sequencing errors and genetic variations in the reads. We propose GenASM, the first ASM acceleration framework for genome sequence analysis. GenASM performs bitvectorbased ASM, which can efficiently accelerate multiple steps of genome sequence analysis. We modify the underlying ASM algorithm (Bitap) to significantly increase its parallelism and reduce its memory footprint. Using this modified algorithm, we design the first hardware accelerator for Bitap. Our hardware accelerator consists of specialized systolic-array-based compute units and on-chip SRAMs that are designed to match the rate of computation with memory capacity and bandwidth, resulting in an efficient design whose performance scales linearly as we increase the number of compute units working in parallel. We demonstrate that GenASM provides significant performance and power benefits for three different use cases in genome sequence analysis. First, GenASM accelerates read alignment for both long reads and short reads. For long reads, GenASM outperforms state-of-the-art software and hardware accelerators by 116x and 3.9x, respectively, while reducing power consumption by 37x and 2.7x. For short reads, GenASM outperforms state-of-the-art software and hardware accelerators by 111x and 1.9x. Second, GenASM accelerates pre-alignment filtering for short reads, with 3.7x the performance of a state-of-the-art pre-alignment filter, while reducing power consumption by 1.7x and significantly improving the filtering accuracy. Third, GenASM accelerates edit distance calculation, with 22-12501x and 9.3-400x speedups over the state-of-the-art software library and FPGA-based accelerator, respectively, while reducing power consumption by 548-582x and 67x. We conclude that GenASM is a flexible, high-performance, and low-power framework, and we briefly discuss four other use cases that can benefit from GenASM.
@inproceedings{cali_genasm_2020, title = {{GenASM}: {A} {High}-{Performance}, {Low}-{Power} {Approximate} {String} {Matching} {Acceleration} {Framework} for {Genome} {Sequence} {Analysis}}, doi = {10.1109/MICRO50266.2020.00081}, booktitle = {2020 53rd {Annual} {IEEE}/{ACM} {International} {Symposium} on {Microarchitecture} ({MICRO})}, author = {Cali, Damla Senol and Kalsi, Gurpreet S. and Bingöl, Zülal and Firtina, Can and Subramanian, Lavanya and Kim, Jeremie S. and Ausavarungnirun, Rachata and Alser, Mohammed and Gomez-Luna, Juan and Boroumand, Amirali and Norion, Anant and Scibisz, Allison and Subramoneyon, Sreenivas and Alkan, Can and Ghose, Saugata and Mutlu, Onur}, year = {2020}, pages = {951--966}, }
SYSTOR
BioSEAL: In-Memory Biological Sequence Alignment Accelerator for Large-Scale Genomic Data

Roman Kaplan, Leonid Yavits, and Ran Ginosar

In Proceedings of the 13th ACM International Systems and Storage Conference, Dec 2020

Abstract BIB URL

Genome sequences contain hundreds of millions of DNA base pairs. Finding the degree of similarity between two genomes requires executing a compute-intensive dynamic programming algorithm, such as Smith-Waterman. Traditional von Neumann architectures have limited parallelism and cannot provide an efficient solution for large-scale genomic data. Approximate heuristic methods (e.g. BLAST) are commonly used. However, they are suboptimal and still compute-intensive.In this work, we present BioSEAL, a biological sequence alignment accelerator. BioSEAL is a massively parallel non-von Neumann processing-in-memory architecture for large-scale DNA and protein sequence alignment. BioSEAL is based on resistive content addressable memory, capable of energy-efficient and highperformance associative processing.We present an associative processing algorithm for entire database sequence alignment on BioSEAL and compare its performance and power consumption with state-of-art solutions. We show that BioSEAL can achieve up to 57x speedup and 156x better energy efficiency, compared with existing solutions for genome sequence alignment and protein sequence database search.
@inproceedings{kaplan_bioseal_2020, address = {New York, NY, USA}, series = {{SYSTOR} '20}, title = {{BioSEAL}: {In}-{Memory} {Biological} {Sequence} {Alignment} {Accelerator} for {Large}-{Scale} {Genomic} {Data}}, isbn = {978-1-4503-7588-7}, url = {https://doi.org/10.1145/3383669.3398279}, doi = {10.1145/3383669.3398279}, booktitle = {Proceedings of the 13th {ACM} {International} {Systems} and {Storage} {Conference}}, publisher = {Association for Computing Machinery}, author = {Kaplan, Roman and Yavits, Leonid and Ginosar, Ran}, year = {2020}, pages = {36--48}, }

2019

IEEE Micro
RASSA: Resistive Prealignment Accelerator for Approximate DNA Long Read Mapping

Roman Kaplan, Leonid Yavits, and Ran Ginosar

IEEE Micro, Jul 2019

Abstract BIB URL

DNA read mapping is a computationally expensive bioinformatics task, required for genome assembly and consensus polishing. It requires to find the best-fitting location for each DNA read on a long reference sequence. A novel resistive approximate similarity search accelerator (RASSA) exploits charge distribution and parallel in-memory processing to reflect a mismatch count between DNA sequences. RASSA implementation of DNA long-read prealignment outperforms the state-of-the-art solution, minimap2, by 16–77x with comparable accuracy and provides two orders of magnitude higher throughput than GateKeeper, a short-read prealignment hardware architecture implemented in FPGA.
@article{kaplan_rassa_2019, title = {{RASSA}: {Resistive} {Prealignment} {Accelerator} for {Approximate} {DNA} {Long} {Read} {Mapping}}, volume = {39}, issn = {0272-1732}, url = {https://doi.org/10.1109/MM.2018.2890253}, doi = {10.1109/MM.2018.2890253}, number = {4}, journal = {IEEE Micro}, author = {Kaplan, Roman and Yavits, Leonid and Ginosar, Ran}, month = jul, year = {2019}, pages = {44--54}, }

2018

BMC Genomics
GRIM-Filter: Fast seed location filtering in DNA read mapping using processing-in-memory technologies

Jeremie S. Kim, Damla Senol Cali, Hongyi Xin, Donghyuk Lee, Saugata Ghose, Mohammed Alser, Hasan Hassan, Oguz Ergin, Can Alkan, and Onur Mutlu

BMC Genomics, May 2018

Abstract BIB URL PDF Code Slides Slides (PPT)

Seed location filtering is critical in DNA read mapping, a process where billions of DNA fragments (reads) sampled from a donor are mapped onto a reference genome to identify genomic variants of the donor. State-of-the-art read mappers 1) quickly generate possible mapping locations for seeds (i.e., smaller segments) within each read, 2) extract reference sequences at each of the mapping locations, and 3) check similarity between each read and its associated reference sequences with a computationally-expensive algorithm (i.e., sequence alignment) to determine the origin of the read. A seed location filter comes into play before alignment, discarding seed locations that alignment would deem a poor match. The ideal seed location filter would discard all poor match locations prior to alignment such that there is no wasted computation on unnecessary alignments.
@article{kim_grim-filter_2018, title = {{GRIM}-{Filter}: {Fast} seed location filtering in {DNA} read mapping using processing-in-memory technologies}, volume = {19}, issn = {1471-2164}, url = {https://doi.org/10.1186/s12864-018-4460-0}, doi = {10.1186/s12864-018-4460-0}, number = {2}, journal = {BMC Genomics}, author = {Kim, Jeremie S. and Senol Cali, Damla and Xin, Hongyi and Lee, Donghyuk and Ghose, Saugata and Alser, Mohammed and Hassan, Hasan and Ergin, Oguz and Alkan, Can and Mutlu, Onur}, month = may, year = {2018}, pages = {89}, }

2017

2016

BIBM
DNA mapping using Processor-in-Memory architecture

Dominique Lavenier, Jean-Francois Roy, and David Furodet

In 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), May 2016

Abstract BIB URL

This paper presents the implementation of a mapping algorithm on a new Processing-in-Memory (PIM) architecture developed by UPMEM Company. UPMEM’s solution consists in adding processing units into the DRAM, to minimize data access time and maximize bandwidth, in order to drastically accelerate data-consuming algorithms. The technology developed by UPMEM makes it possible to combine 256 cores with 16 GBytes of DRAM, on a standard DIMM module. An experimentation of DNA Mapping on Human genome dataset shows that a speed-up of 25 can be obtained with UPMEM technology compared to fast mapping software such as BWA, Bowtie2 or NextGenMap running on 16 Intel threads. Experimentation also highlight that data transfer from storage device limits the performances of the implementation. The use of SSD drives can boost the speed-up to 80.
@inproceedings{lavenier_dna_2016, title = {{DNA} mapping using {Processor}-in-{Memory} architecture}, doi = {10.1109/BIBM.2016.7822732}, booktitle = {2016 {IEEE} {International} {Conference} on {Bioinformatics} and {Biomedicine} ({BIBM})}, author = {Lavenier, Dominique and Roy, Jean-Francois and Furodet, David}, year = {2016}, pages = {1429--1435}, }