Proxima: Near-storage Acceleration for Graph-based Approximate Nearest Neighbor Search in 3D NAND Weihong Xu†, Junwei Chen†, Po-Kai Hsu‡, Jaeyoung Kang†, Minxuan Zhou†, Sumukh Pinge†, Shimeng Yu‡, and Tajana Rosing† arXiv:2312.04257v1 [cs.AR] 7 Dec 2023 University of California San Diego (UCSD)† Georgia Institute of Technology‡ Abstract—Approximate nearest neighbor search (ANNS) plays an indispensable role in a wide variety of applications, including recommendation systems, information retrieval, and semantic search. Among the cutting-edge ANNS algorithms, graph-based approaches provide superior accuracy and scalability on massive datasets. However, the best-performing graph-based ANN search solutions incur tens of hundreds of memory footprints as well as costly distance computation, thus hindering their efficient deployment at scale. The 3D NAND flash is emerging as a promising device for data-intensive applications due to its high density and nonvolatility. In this work, we present the near-storage processing (NSP)-based ANNS solution Proxima, to accelerate graph-based ANNS with algorithm-hardware co-design in 3D NAND flash. Proxima significantly reduces the complexity of graph search by leveraging the distance approximation and early termination. On top of the algorithmic enhancement, we implement Proxima search algorithm in 3D NAND flash using the heterogeneous integration technique. To maximize 3D NAND’s bandwidth utilization, we present customized dataflow and optimized data allocation scheme. Our evaluation results show that: compared to graph ANNS on CPU and GPU, Proxima achieves a magnitude improvement in throughput or energy efficiency. Proxima yields 7× to 13× speedup over existing ASIC designs. Furthermore, Proxima achieves a good balance between accuracy, efficiency and storage density compared to previous NSP-based accelerators. I. I NTRODUCTION Nearest neighbor search (NNS) is a fundamental workload that plays an important role in a wide variety of applications, such as recommendation systems [17], [24], [53], media data retrieval [3], [46], and bioinformatics data management [7]. The state-of-the-art NNS system adopts semantic-based search for unstructured data such as images, texts, videos, and speech. The feature vectors of product catalogs are first generated using neural embedding techniques that can effectively capture the semantics of objects. Then the recommendation results are returned by finding products whose embeddings are closest to the embedded search query. For example, the major eCommerce players, Amazon [45] and Alibaba [16] build semantic search engines to recommend products. While exhaustive search is the most accurate way to perform NNS, the ever-expanding data volume makes it impractical for meeting low-latency requirements at scale. This is because it requires an expensive distance computation in high-dimensional space and linear search time. To address this issue, modern NNS systems [19], [36] adopt approximate NNS (ANNS) schemes, which provide significantly lower query latency by approximat- ing NNS results. ANNS achieves high efficiency mainly by reducing search space, distance computation, and data access. This relied on the pre-built indices that heuristically guide the search process. Popular indexing methods include hashing [25], inverted file (IVF) [62], and vector compression [33]. Stateof-the-art ANNS tools with advanced indexing, such as Google’s ScaNN [23], Facebook’s Faiss [36], and Microsoft’s DiskANN [32], can deliver millisecond query latency even on large-scale datasets. Meanwhile, several hardware designs [42], [66] are presented to further push efficiency and performance beyond CPU and GPU. Although the aforementioned designs improve both computational and memory efficiency, they still suffer from significant accuracy degradation due to lossy compression and approximation. The experiments on FAISS [58] show that even the carefully optimized IVF-PQ [20] only achieves approximately 80% recall on datasets with 10 million items. In comparison, recent graph-based ANNS algorithms [19], [32], [50] demonstrate superior performance and complexity tradeoffs. HNSW [50] and DiskANN [32] achieve a recall rate of > 98% with promising throughput across various large datasets. As such, graph-based methods have not only been integrated into ANNS tools [15], [19], [32], [36], [50] but also deployed on large-scale commercial systems, such as Alibaba’s visual search engine [67]. However, implementing graph-based ANNS efficiently poses great challenges due to two reasons: First, graph-based ANNS is notorious for massive memory footprint because both the raw data and the graph index need to be stored for the best accuracy. Advanced graph ANNS tools, such as HNSW [50], HM-ANN [57], and NSG [19], require 300-700GB memory to store the data structure for billion-scale datasets. Hence, it is challenging to store and process the large-scale graph index on CPU or GPU memory. Second, the query search over the graph is, in essence, a best-first (BF) graph traversal characterized by random data access. This nature entails low parallelism and expensive distance computation. Our profiling results in Section II-D show that the irregular data access during graph traversal incurs 80% to 95% cache miss rate. Additionally, the graph search requires thousands of distance computations for high-dimensional data. Data fetching and distance computation consume up to 80% of the total runtime. Several attempts have been made to efficiently boost graphbased ANNS. NSG [19] and GGNN [22] distribute the large graph index to multiple machines [19] or GPUs [22]. However, for graph-based ANNS workloads. We present various ASIC this approach dramatically increases the deployment cost designs to efficiently implement the proposed graph search while sacrificing energy efficiency due to the communication algorithm. overhead. SONG [68] optimizes the graph data organization • We introduce various types of data mapping optimizations for GPU to exploit GPU’s computation parallelism. They use for Proxima, which reduces query latency while providing hashing to fit large indexes into limited GPU memory, but scalable approaches to handle different dataset sizes. The significant performance degradation occurs in a high-recall experiments show that the proposed hot node repetition regime due to the hash approximation. Besides, several works achieves 3× latency reduction without additional hardware. have offloaded the memory-intensive graph index from DRAM • Proxima achieves up to 15× speedup and energy efficiency to slower but cheaper and denser memory devices. Emerging improvements compared to state-of-the-art hardware solunon-volatile memories, like solid state drives (SSDs) and tions [22], [42], [44]. Optane [57], are the ideal devices to store the graph index. II. BACKGROUND AND M OTIVATIONS DiskANN [32] offloads the graph index from DRAM to SSD A. Approximate Nearest Neighbor Search whereas the host memory only caches frequently-accessed data. HM-ANN [57] builds a heterogeneous memory architecture that Exact k-NN Search. The k-nearest neighbor (k-NN) search combines DRAM and non-volatile Optane memory. However, retrieves k nearest neighbors, R from a dataset X = the main drawback of these solutions is that the data in storage {x1 , . . . , xN }, that have the smallest distances to the given needs to travel through multiple-level memory hierarchies for query q. The k-NN search for the D-dimension vector in Euclidean space xi , q ∈ RD is given by: computing, thus incurring costly data movement. R = arg min dist(q, x), (1) To reduce data movement cost, VStore [44] presents a R⊆X ,|R|=k near-storage accelerator, which yields an order of magnitude where dist(·, ·) denotes the distance between two given vectors, improvement in energy efficiency compared to previous CPU e.g., Euclidean, cosine, or inner product. and GPU baselines [19], [22], [68]. However, the graph Approximate k-NN Search. The exact k-NN search using search algorithms in HM-ANN [57] and VStore [44] are exhaustive search requires O(N ·D) complexity, thus is compuinefficient since both require thousands of expensive distance tationally inefficient as the data size N = |X | becomes million computations for high-dimensional data, which become the or billion-scale. This makes exact k-NN search impractical to bottleneck to obtaining higher efficiency. Additionally, existing implement in real-time systems. Instead of precisely retrieving near-memory/in-memory graph accelerators [52], [66] are not k NNs, state-of-the-art tools [23], [36], [50] relax the exact suitable for graph-based ANNS as they are primarily designed search conditions and instead retrieve the approximate k NNs for graph updating while the graph index for ANNS is fixed expressed as R̂. These approximate k-NN search algorithms during query search. effectively reduce the search complexity by only visiting a 3D NAND flash [48] and near-data processing present a small portion of the dataset. The good trade-off between search unique opportunity to tackle the mentioned challenges for complexity and accuracy heavily depends on the pre-built data graph-based ANNS. The ultra-high density of 3D NAND flash index that guides the search process. Popular data indexing aligns well with the significant memory space requirements approaches adopted by existing large-scale NN search libraries of graph ANNS, which is highly advantageous for low-cost include IVF [62], graph [19], [32], [50], and hashing [25]. implementation. Recent works [39], [48] have demonstrated that Evaluation Metrics. Recall, query latency, and throughput are the internal structure of 3D NAND flash offers high parallelism, the three key metrics to evaluate the ANNS performance. The which can be exploited by graph-based ANNS. Moreover, the recall measures the overlap between the approximate k-NN set non-volatile nature of 3D NAND flash allows for better data R̂ and the exact k-NN set is R, which is computed by: persistency and energy efficiency, which is crucial for the |R̂ ∩ R| deployment of graph ANNS in the data center. Recall(R̂, R) = . (2) k In this work, we present an algorithm-hardware co-design, Query latency measures the response latency of a given Proxima, to accelerate graph-based ANNS. We use high ANNS system, while throughput is measured in terms of query density 3D NAND flash as basic memory units and develop per second (QPS). Therefore, the design goal is to obtain high specialized hardware on top of 3D NAND using heterogeneous QPS and low latency while providing satisfactory recall. integration technology [61]. To summarize, our contributions B. Graph-based ANN Search are as follows: The experiments [58] using Facebook’s FAISS [36] show • We provide an in-depth profiling and analysis for graph-based IVF [62] and hashing [25] methods are memory-efficient but ANNS and identify its inefficiencies on CPUs and GPUs. the yield recall saturates around 80% on 10M and 100MBased on the analysis, we design the novel Proxima graph scale datasets. In comparison, graph-based methods [19], [32], search algorithm that improves throughput and efficiency by [50] demonstrate superior performance with polylogarithmic combining product quantization (PQ) and early termination. search and graph building complexity. The graph-based ANNS • We leverage the heterogeneous integration techniques to includes two phases: 1. graph building and 2. search. The graph devise Proxima, a NSP accelerator in 3D NAND tailored building generates a sparse proximity graph, G(V, E), as the Raw Data Built Graph Core Core Graph Building Core Core Core Core Core Core Core Core Core Tile Core Core Core Core Core Tile Query Query Search I/O Interface Search Engine H-Tree Buffer Memory Tile Controller Entry Point Visited Vertices kNN Results Traversal Path Fig. 1: Illustration of graph-based ANN search. 3D NAND Array data index. Each data point xi ∀X is uniquely represented by HV Switch Page Buffer WL DEC vertex vi ∈ V over the graph. The edge e ∈ E represents the CMOS Wafer NAND Wafer neighborhood relationships for the connecting vertices. Although existing graph-based ANN search algorithms [19], Fig. 2: Proxima’s near storage architecture with 3D NAND [32], [50] impose diverged constraints during graph building, and heterogeneous integration techniques. most of them share a similar heuristic searching procedure. The (ISP) candidate [39], [48], [61], offering high density and search flow is a best-first traversal illustrated in Fig. 1. As a new bandwidth. However, ISP designs using 3D NAND flash query q comes, the search process starts from a pre-defined memory suffer from: 1. device and circuit non-idealities that Entry Point (vs ) and greedily traverses the graph to reach reduce computing accuracy, 2. variations in process, voltage the nearest neighbors of q. A candidate list L is maintained and temperature variations, and 3. device aging [27], [47]. To to store (distance, id) pairs of the best evaluated vertices, avoid this issue, near-storage processing (NSP) as an alternative which are sorted in ascending order of their distances to q. L could still provide an energy-efficient and low-latency solution has a predefined size L and is intialized with (dist(vs ,q), vs ). to accelerate large-scale data processing without offloading The search process iteratively ”evaluates” the first unevaluated massive data from the 3D NAND chip. candidate in L by visiting its neighborhood and computing the Heterogeneous Integration for NSP. Although NSP avoids distances between its neighbors and q. These neighbors are massive data transfers, the integration of processing elements inserted into L along with their distances. Then L is sorted and could be area costly. Heterogeneous integration [61] using only keeps the top L nearest candidate to q. This search process CMOS under Array (CUA) [10] and Cu-Cu bonding [2] can is repeated until all candidates in L have been evaluated. Then address the cost and area concerns associated with integrating the first k candidates in L are returned as an approximation processing elements onto a 3D NAND chip. As shown in Fig. 2, of k nearest neighbors to q. The candidate list size L can be CUA overlaps memory peripherals under the array, reducing used to control the search accuracy. In particular, with a higher the area of a single tier. The CMOS wafer and NAND wafer L, more vertices in the graph are evaluated before the search can be manufactured independently using different technology terminates. Therefore, the search returns more accurate results. nodes [2], [61]. After finishing the manufacturing of CMOS and NAND wafers, the high-density inter-chip Cu-Cu bonding C. 3D NAND-based Near-Storage Processing Near-storage Processing (NSP). As emerging algorithms connects the processing elements on the CMOS wafer to the 3D become increasingly IO-bounded, the growing demand for NAND wafer, providing seamless integration. As a result, NSP large-scale data processing presents significant challenges for with heterogeneous integration can provide a compact solution conventional von Neumann architectures, which require massive for large-scale data processing with improved performance. data transfers between the processor and off-chip memory. This approach opens new opportunities for the development Limited bandwidth and expensive data movement hinder of low-power, high-performance, and compact data processing the computing system scaling for data-intensive workloads, systems for various applications. causing the well-known ”memory wall” problem. To overcome D. Motivations this limitation, in/near-memory processing architectures based Profiling Graph-based ANNS. To understand the bottlenecks on emerging non-volatile memory (NVMs) [18], [63] and of graph ANNS algorithms, we profile the state-of-the-art graph volatile dynamic random access memory (DRAM) [35], [59] ANNS tools (NSG [19], HNSW [50], and DiskANN [32]) on have shown promise in building optimized accelerators for a 16-core CPU with 64GB memory. We choose three popular various big data applications. However, these devices may datasets, SIFT [34], GLOVE [56], and DEEP [5]. We identify not be suitable for larger planar array setups due to their the graph-based ANNS faces the following challenges: low on/off current ratio, high on-current, and relatively lower Challenges 1: Irregular Data Access. The graph ANN search density to support large datasets. Recently, 3D NAND flash is a best-first (BF) graph traversal process, which involves: memory has emerged as a promising in-storage processing 1. fetching D-dimensional data vectors and R NN indices,   +16: 'LVN$1134                &RPSXWDWLRQ,QWHQVLW\ )/23%\WH D 3HUFHQWDJH  3HUIRUPDQFH *)/23V  16*  '((3 */29( HNSW, Vamana, NSG, ... 6,)7 Offline Graph Building Gap Encoding Query PQ Distance-based Search Reranking Raw Data  Compressed Graph  //& 0LVV5DWH 'LVWDQFH &RPSXWDWLRQ kNNs Increase List size E Fig. 3: Profiling results for graph ANNS tools: (a) Roofline model on AMD EPYC 7543 CPU. (b) LLC miss rate and distance computation overhead for HNSW [50]. Dynamic List + Early Termination Fig. 4: Proxima graph search algorithms. 32×4×3=384-b 1 2 3 4 20 47 29 65 15 36 26 31 33 40 47 52 32×4+5×8=168-b 1 Sort 2 3 4 15 36 26 31 20 40 29 52 33 1 15 47 Encode 2 36 47 3 26 65 4 31 5 4 3 21 13 7 18 13 2. Calculating and comparing distance for the fetched data vectors. The roofline analysis in Fig. 3-(a) shows graph ANNS algorithms [19], [32], [50] exhibit low computational intensity (a) Gap Encoding for Adjacency List and fall into the memory-bound region. Moreover, graph Query q traversal is notorious for its random memory access behavior. Asymmetric PQ Codebook Distance Table 1 2 8 4 Fig. 3-(b) depicts the average last-level cache (LLC) miss rate 2 2 17 8 2 6 6 2 D for graph-based ANNS. Due to the random access pattern, C 0 C 10 8 25 5 4 1 25 5 1 7 6 3 the LLC miss rate during search is as high as 80% to 90%, Raw Base x D M becoming the other bottleneck. 3 2 5 5 4 Base PQ Code D Challenges 2: Expensive Distance Computation. To guaran1 0 2 PQ dist(q,x) = tee high recall, the graph-based ANNS rely on the accurate M 17+5 = 22 distance computation for high dimensional raw data. The (b) PQ Distance Approximation distance computation for D-dimensional data mostly consumes over 50% of total runtime on CPU as illustrated in Fig. 3-(b). Fig. 5: (a) Gap encoding and (b) product quantization (PQ) However, majority of the distance computations are redundant distance approximation in Proxima. and has no impact on the final outcome. Hence, how to reduce distance computations while providing satisfactory performance advanced CMOS logic on top of 3D NAND. is the key to accelerating graph ANN algorithms. III. PROXIMA G RAPH S EARCH Challenges 3: Large Memory Consumption. Graph-based According to Section II-D, ANNS graph search is dominated ANN algorithm requires storing two basic data types, including by costly distance computation and memory footprint. We the graph index and the raw data. The memory complexity to present the optimized Proxima graph search scheme to address store graph index in adjacency list format is O(bindex · N · R), these bottlenecks at the algorithm level. where N and R denote the data number and maximum vertex degree, respectively. bindex is the bit width for each vertex A. Overview Fig. 4 shows the overall flow of Proxima search optimization. index. For raw data, the required memory space is O(braw · N · D), where D is the data dimension. The raw data size is Proxima graph search is general and can be applied to graphs fixed for a specific dataset while the graph index size depends generated by various graph building algorithms, such as on the graph building parameters. The graph index size for HNSW [50], DiskANN [32], and NSG [19]. Before query most datasets is even larger than the raw data size because search, the gap encoding in Proxima first compresses the built bindex is generally 32-b while R is 32 − 96. As a result, the graphs to reduce the index size. During query search, we graph index size is over 203GB while the raw data are 178GB propose three strategies, including product quantization (PQ) distance-based search, accurate distance-based reranking, and for billion-scale datasets. Solutions. In general, graph-based ANNS is a type of memory- dynamic list with early termination, to perform accurate and intensive workload. We perform algorithm and hardware co- low-complexity search. optimization to develop the near-storage accelerator that can B. PQ Distance-based Search Graph ANN search algorithms [19], [32], [50] incur hundreds efficiently handle datasets from various scales. Section III to thousands of distance computations, but most do not presents the Proxima graph search approach that improves contribute to the final results. A natural idea is to approximate the searching efficiency over existing works. To fully exploit them. Proxima traverses the graph using the approximate the potential of Proxima’s algorithm, we propose Proxima’s distance obtained by PQ [33]. near-storage hardware architecture based on 3D NAND flash and heterogeneous integration [2] in Section IV. 3D NAND PQ splits vectors into M subdimensions and encodes each is a high-density and nonvolatile memory that can address vector x into a M · log2 C-bit code, where C is the number of the large memory footprint of graph-based ANNS. Meanwhile, centroids of each subdimension from k-means. Each subvector heterogeneous integration is a promising solution for developing of x is quantized into its k-means centroid and encoded with       6,)7  '((3 */29(      &DQGLGDWH/LVW6L]H D 0HPRU\7UDIILF  Data: Graph G(V, E); Query q; Entry point vs ; Repetition rate r; Search step Tstep ; Candidate list size L; PQ threshold beta. Result: k-NN results for query q. ˆ s , vs )}; 1 Candidate set L = {(dist 2 Evaluated set E = {}; 3 while T ≤ L do ˆ ′ , v ′ ) = the first candidate in L with v ′ ∈ 4 (dist / E; 5 E = E ∪ {v ′ }; 6 foreach n ∈ neighbors of v ′ do ˆ n = dist(q, ˆ 7 dist n) ; /* PQ Distance */ ˆ n , n)}; 8 L = L ∪ {(dist 9 end ˆ and keep top L candidates; 10 Sort L based on dist 11 if ∀a ∈ top T candidates of L, a ∈ E then 12 Rerank top T candidates in L ; 13 if early termination check(r) == True then 14 break ; /* Early Termination */ 15 end 16 T = T + Tstep ; /* Dynamic List */ 17 end 18 end ˆ c < dist ˆ L[T ] ∗ beta do 19 foreach c ∈ L and dist 20 Rerank c ; /* Reranking */ 21 end 22 return Top-k reranked candidates &RUUHFW5DWLR  Algorithm 1: Proxima Search Algorithm       11 34&RGH 5DZ'DWD     0D[LPXP'HJUHH5 E Fig. 6: (a) Search convergence trend for GLOVE, DEEP10M, SIFT. (b)Memory traffic breakdown using different degrees R. reranking strategy by introducing a new parameter beta (PQ error ratio) into Algorithm 1(line 19) and nesting the original search list of size T inside a larger list of size L. beta can be chosen through empirical evaluation. For example, by sampling base vertices as queries and constructing 32 bytes PQ code, we observe that 99 percent of PQ distances for the SIFT dataset are within 1.06 (beta) times of their accurate distances. Then, we ˆ L[T ] ∗ beta. rerank all vertices with PQ distances less than dist This is possible because we keep more vertices in the larger list of size L. We may need a few accurate distance computations, but we can improve the recall by up to 10% over DiskANN [32] at a low recall with negligible impact on QPS. (Fig. 11). D. Dynamic List and Early Termination We propose a Dynamic List and Early Termination strategy centroid index of log2 C bits. PQ approximates the distance to make use of the information during graph traversal, as shown in Algorithm 1. We observe that most queries converge (find ˆ between query q and x: dist their true k-NNs) at a small T (candidate list size). Increasing M X ˆ (q, x) = dist ADTi [ci ], (3) T further will not improve the recall of these queries, but only increase the computation cost. Fig. 6-(a) shows this trend for i=1 where ADTi is the Asymmetric Distance Table computed as DiskANN [32]. We see a rapid increase of convergence ratio at we receive q, which holds distances between subvector qi of q small T ’s. For datasets like GLOVE having low recall even at and the C centroids of subdimension i; ci is the centroid index big T ’s, we will expect less queries to converge at a given T , but ˆ is computed the general trend is the same. Based on this, we design a novel of subvector xi of x. PQ approximate distance dist by summing the distance between each subvector of q and early termination technique that uses a dynamic list to iteratively the quantized subvector of x. Therefore, a total of M LUTs adjust T (line 16) during graph traversal. In each iteration, (Look Up Table) and additions are needed for each distance we compare the top-k new and old reranked candidates, and computation. Fig. 5-(b) depicts an example of computing the terminate the search if they are the same for r consecutive ADT and final PQ distance for D = 4, M = 2, C = 3. Note iterations (line 12-14); otherwise, we increase T by Tstep (line that Proxima only uses PQ during the search process. The 16) and continue. We control the early termination outcome graph index is built using existing algorithms [19], [32], [50] by setting r and Tstep . We store the computed distances to with full-precision coordinates to ensure correct structure. amortize the overhead of accurate distance computations in each iteration. We also keep L at a relatively large size L to C. Accurate Distance-based Reranking When the search ends, we obtain a list L of best-matched preserve the useful information at small T ’s. This also allows vertices found during graph traversal using PQ, but we still need us to apply the optimized reranking scheme in Section III-C. to rerank these vertices to return the actual top k results in L. We combine early termination with optimized reranking and Compared to the thousands of accurate distance computations found around 10% reduction in distance computations at the in traditional graph searching algorithms, the cost of reranking same recall for all 6 datasets in Table.I, compared to optimized reranking alone. This suggests that early termination can scale (typically around one hundred) is trivial. well to datasets of different sizes and distributions. Although PQ with reranking has been a common technique for graph-based ANN algorithms [32], [36], they have not E. Gap Encoding for Vertex Indices ˆ L[T ] taken full consideration of the inaccuracy of PQ. Let dist The graph ANNS algorithms store the NN index and raw be the distance of the most distant candidate in L, then vertices data. This leads to two problems: 1. Fetching vertex indices ˆ L[T ] but a PQ distance (NN indices) creates significant data access overhead during with an accurate distance less than dist ˆ greater than distL[T ] are discarded merely because of PQ’s search, accounting for 80% to 90% as shown in Fig. 6-(b). 2. inaccuracy. To address this issue, we propose an optimized The NN index has a comparable size as the raw data. Existing To ensure the architectural extensibility, the two-level spatial hierarchy in previous works [39], [61] is adopted to construct Core Core 3D NAND tiles, where there are total Ntiles tiles and each tile Tile Tile includes Ncore cores. The tiles and cores at the same hierarchy are connected via the tile or core H-tree bus, respectively. Inside Core Core each core, there are Nsub blocks of 3D NAND subarrays with Tile Tile peripheral circuits. To boost data access during graph search, we customize the 3D NAND core by adding a BL MUX 16-b Page Buffer Ctrl. I/O between BLs and page buffer. Meanwhile, a smaller array size Search Engine BL MUX is chosen for lower read latency. The details are presented in I/O Interface Section IV-C. Search Engine is manufactured on the CMOS wafer to 3D NAND Subarray Query kNNs maximize the logic density. Then it is connected to the tile controller via H-tree bus and further connected to the memory Compressed Graph array through Cu-Cu bonding [2]. The search engine efficiently Fig. 7: Proxima accelerator in 3D NAND flash. implements Proxima graph search algorithm and is used for: 1. scheduling and processing incoming queries, 2. fetching algorithms [19], [32], [50] use a uniform 32-b integer to present graph data from the 3D NAND cores. The 3D NAND flash the vertex index. But the uniform bit width for different graph core is the minimal unit that can be accessed by the search scales is redundant since ⌈log2 N ⌉-b is sufficient to present engine for data fetching. The design of search egine is given the vertex index for a dataset with size N . Proxima uses gap in Section IV-D. encoding [6] to save the space and data movement required by NN indices. Fig. 5-(a) illustrates an example for 4 indices B. Proxima Execution Flow The execution of Proxima accelerator is mainly composed and 3 NNs that are expressed by an adjacency list of 12 of two steps: 1. graph data preloading and 2. query graph elements. The uncompressed 32-b adjacency list consumes search. Before graph search, the raw data, raw data’s PQ codes, 384-b space. In comparison, gap encoding includes two steps: and NN indices need to be prestored into the corresponding first, sort the NN indices in each row in ascending order and physical addresses using the proposed data mapping scheme then convert the sorted indices into the difference values to shown in Section IV-E. its previous index except for the first one. In this case, the bit After the required data have been stored, Fig. 8 illustrates width is determined by the bits for the maximum difference value. Therefore, the required space is reduced to 168-b. Our the data flow inside the search engine during graph search. experiments show graphs of 1M to 100M datasets need 20-b The entire graph search consists of four steps: Step 1 is the to 26-b gap encoding, leading to at least 19% to 37% graph initial phase for new queries, where the PQ module computes index data compression. The compressed graph index also the asymmetric distance table (ADT) based on the query data. helps achieve faster graph traversal as the overhead of index The computed ADT is transmitted to the ADT memory in the target queue specified by the scheduler. In Step 2 , the fetching is reduced. candidate list in the queue pops the un-evaluated vertex and IV. PROXIMA ACCELERATOR the arbiter generates the corresponding address to fetch the Although Proxima search optimization effectively improves NN indices from 3D NAND cores (Line 4-6 in Algorithm 1). graph search, it is still limited by costly data movement The fetched NN indices first pass through the Bloom filter to between SSD, host memory, and cache. For better throughput update newly visited vertices and filter out those previously and efficiency, we architect the near-storage accelerator using computed vertices. The PQ codes of those unvisited vertices 3D NAND and heterogeneous integration in this section to are then fetched and computed by the distance computation module (Line 6-8 in Algorithm 1). implement the Proxima graph search algorithm. In Step 3 , a sorting is needed to select the top L candidates A. Architecture Overview Fig. 7 illustrates the architecture of Proxima accelerator (Line 10 in Algorithm 1) after all the neighbors have been consisting of two main parts: 3D NAND Tiles and Search visited. The implemented Bitonic sorter is a shared sorter by Engine. Proxima is a standalone near-storage accelerator all search queues since it can provide sufficient throughput. (similar to [28], [44]) that can realize both data storage After the PQ distance-based search is finished, the top vertices and efficient ANNS function. This is achieved by utilizing in the candidate list memory will be reranked using their raw the advanced heterogeneous integration technique [2], which data (Line 12 in Algorithm 1). This is illustrated by Step 4 connects the 3D NAND wafer and CMOS wafer for better in Fig. 8. Meanwhile, the candidate list also checks whether the early termination condition is satisfied. efficiency and storage capacity. 3D NAND Tiles are implemented on the 3D NAND wafer C. 3D NAND Core Design and are used to store three types of graph data: 1. raw data, The generic 3D NAND flash chips [26], [37], [40] used 2. raw data’s PQ codes, and 3. compressed graph NN indices. for commercial SSD are not suitable for graph ANNS. First, Tile Proxima Accelerator I/O Buffer WL Decoder Bitonic Sorter Search Queues Codebook Memory 1 Arbiter FP16 MACs Queue 1 Scheduler Asymmetric Distance PQ Module Queue 2 ··· ··· Global Bus Queue Nq Query Vector 2 Compute distance Asymmetric Distance ADT Memory MAC 1 Raw Data Query Buffer Bloom Filter Candidate List 3 4 Sort     5HDG/DWHQF\ QV      $UHD PP $UHD(IILFLHQF\  Fig. 8: Internal architecture of search engine.     over 104 ns read latency. The previous work [55] shows the precharge and discharge processes take ≈ 90% of the page read latency. The long precharge and discharge time results from the large capacitance load created by hundreds of blocks and extensive page size. We customize the Proxima 3D NAND core to reduce the long read latency and increase data granularity. First, we reduce the number of BLs NBL as well as the number of blocks Nsub to decrease the capacitance load. Specifically, we use NBL = 36768, NSSL = 4, and Nblock = 64 to build each core. Dramatically decreasing page size also decreases the area efficiency because a large portion of NAND array is occupied by the peripheral circuits. To reduce the overhead of the peripheral page buffer, a BL MUX is implemented between the page buffer and 3D NAND blocks for reduced page buffer size and better data granularity. For each time, Proxima’s core only precharges a small portion of the BLs instead of the entire page. As a result, we use a 32:1 MUX to achieve 128B data granularity while Proxima core achieves read latency < 300ns. The other benefit of partial precharging is that it reduces the area overhead of the peripheral circuits in the page buffer by a factor of 32, thereby increasing the storage density. .% .% .% .% .% .% .% .% .% .% .% .% D. Search Engine Design The search engine for graph search consists of the following components: Fig. 9: The density, area, and read latency trade-offs for 96- Scheduler and Arbiter. The scheduler adopts the simple Round-Robin strategy with the first-come-first-serve policy layer 3D NAND flash. to allocate the new query to the ideal queue. The scheduler 3D NAND chips in SSD are optimized for storage density, keeps track of all queues’ status in an Nq -b buffer. The arbiter thus they need large 3D NAND array sizes. As a result, the is used to allocate the data fetching request from queues to the long precharging and discharging time lead to long page read target 3D NAND core. To this end, the arbiter first translates latency, ranging from 15µs to 90µs [26], [37], [40]. This is the vertex information from the queue into the associated unfavorable for graph traversal which needs lots of random physical address. If the destination core is in the ideal status, data access and low processing latency. The second drawback the physical address is sent via H-tree bus. Otherwise, the data is that the large page size (8KB to 16KB) used by existing fetching request is temporarily stalled by the arbiter and waits 3D NAND flashes is too coarse for graph traversal workloads for the next-round allocation. because the data reading operations of graph search (fetching Product Quantization (PQ) Module contains a codebook raw data or graph indices) only need <0.5KB from the storage memory and total 32 FP16 multiply and accumulate (MAC) each time. Hence, a finer data access granularity with lower units to compute the C×M ADT. The codebook memory stores latency is needed. Thirdly, for attaining higher storage density, the C × D PQ codebook trained by offline k-means. Since the most SSDs store multiple bits per cell (MLC). However, parameters C = 256, M = 32 are fixed for different datasets MLC incurs one to two orders of magnitudes error rates while the vector dimension D ranges from 96 to 128, we use higher than single-level cell (SLC) [29], [49]. Our experiments a 64kB SRAM as the codebook memory and a 16kB SRAM in Section V-E provide numerical data to demonstrate that as the ADT memory. The ADT computation with O(C · D) MLC without error correction (ECC) dramatically degrades complexity incurs latency with 8D (Angular distance) to 24D the ANNS accuracy. While commercial SSDs are generally (Euclidean distance) clock cycles. equipped with ECC modules to correct the potential bit errors Distance Computation Module inside each queue consists when reading data [60], adding ECC to Proxima is infeasible of one ADT memory, one query buffer, and one MAC unit. It due to the large area overhead of ECC [60] which may the supports two types of distance computations: the approximate deteriorate system efficiency and the ECC module to match PQ distance and accurate distance. For the PQ distance, the the high data rate over Cu-Cu bonding between the CMOS distance computation module uses the fetched 256-b PQ code and 3D NAND wafer. to look up the partial distances of M sub-vectors in the ADT We develop a simulator based on the 3D NAND simula- memory. Then it computes the accumulation (Eq. (3)) in M tor [39] to project the tradeoff between density, area, and read clock cycles. For accurate distance, the base data vectors are latency for 96-layer 3D NAND in Fig. 9, working as the design first fetched and computed against the query data cached in guidance for Proxima cores. The large page size could incur the query buffer, requiring D clock cycles in total. Queue works independently to process the incoming data from 5 4 3 2 Graph Index the scheduler or the arbiter. The data from PQ module include Reordering 4 1 the ADT and query vector, which are used to initialize the ADT 2 3 6 0 memory and query buffer. The queue sends vertex information to the arbiter for fetching data stored in 3D NAND flash tiles. 0 6 1 5 The read data (PQ codes, raw base vector, or NN indices) are Hot Node sent back the corresponding queue via the arbiter. One key Repetition design for high-performance graph search is the search queues. Proxima accelerator contains total Ncore = 512 cores which NN1 ... NNR PQ Normal data provide high internal data parallelism. Single queue is unable to NN1 PQ1 ... NNR PQR PQ Hot data fully leverage this parallelism. To increase the core utilization as well as search throughput, Nq queues are implemented in (a) the search queue module of Fig. 8. Raw Data Normal/Hot Data Bloom Filter. Some vertices may be visited repeatedly during Core 0 Core 2 Core 4 graph search. To save redundant computations, the indices of 0 2 4 6 0 4 0 2 6 2 visited vertices are saved and checked. Based on our simulation, the number of visited vertices increases linearly with the Core 1 Core 3 Core 5 candidate list size |L|. At most 8000 vertices need to be saved 1 3 5 1 5 1 3 3 for |L| = 250. We use the memory-efficient Bloom filter [68] to detect visited vertices. Bloom filter is a probabilistic data (b) structure [8] with a false positive of (1 − ekn/m )k , where k is Fig. 10: (a) Graph index reordering and hot nodes repetition. the number of hash functions, m is the size bit array size, and (b) Address mapping. n is the number of elements to be inserted. We implement a 12kB SRAM with 8 lightweight SeaHashes [1] in the Bloom of directly mapping the irregular graph indices, we first reorder filter module to guarantee a false positive probability < 0.02%, the indices to increase their locality. As shown in Fig. 10-(a), the graph index reordering is based on vertices’ visiting frequency. providing negligible recall loss [68]. Bitonic Sorter and Candidate List. During the graph traversal, The hotter (more frequent) vertices have smaller indices, which the candidate list in each search queue requires sorting to obtain means the entry point starts from 0. The calculation of vertices’ the sorted candidate list. In the worst case, over 200 distances visiting frequency is based on the graph search trace from the need to be sorted. We use the parallel Bitonic sorter that randomly sampled base data. The graph index reordering is provides constant latency to avoid excessive sorting latency. helpful to increase the index locality. The Bitonic sorter is stage-pipelined and each cycle accepts Hot Nodes Repetition. After graph index reordering, the Nsorter inputs in parallel. The sorting latency for Nsorter hot vertices and their NNs will be located on the top of inputs is constant 2 log2 Nsorter clock cycles. Proxima only graph index. Proxima selects the hottest nodes which have implements one global Nsorter = 256-point Bitonic sorter that the smallest indices as the hot nodes. The hot nodes repetition satisfy the latency and throughput requirements of used list scheme ensures the graph search process for the most frequently size |L| and queues. The candidate list is a 2kB buffer to store accessed vertices can be significantly accelerated. As shown the candidate set L in Algorithm 1, including each candidate’s in Fig. 10-(a), each hot node’s NN index is followed by the distance and the corresponding vertex index. The candidate list corresponding PQ code. In this case, the NNs’ PQ codes and also stores the previous k-NN results from the last iteration to indices are stored together, so computing each vertex can be support the early termination in Section III-D. done in one shot, which means only one WL setup for 3D NAND is sufficient. The cost of hot nodes repetition scheme E. Data Mapping Strategy Before query graph search, the raw, graph, and PQ data is the additional bits for R PQ codes. The total bits for each need to be preloaded and mapped into the 3D NAND cores hot node are R × (bindex + bP Q ) + bP Q . We do not implement based on the data mapping scheme. The data mapping not only additional caches or faster buffers to store hot nodes. Instead, determines conversion between logical and physical addresses hot nodes but also affects the system’s performance. We consider three Data Allocation and Address Translation. Most graph search design factors for developing Proxima’s data mapping strategy: operations are related to PQ codes and graph indices, while 1. The data mapping strategy needs to provide low-complexity access to raw data is needed for accurate distance computation address translation logic to the arbiter; 2. The data mapping during the reranking step. We let Proxima store the raw data strategy should be able to maximize the system’s query search individually in some 3D NAND cores while PQ codes and performance; 3. The data mapping strategy should be scalable to graph indices are stored together. As shown in Fig. 10-(b), we support different ANNS dataset scales. To this end, we propose use a core-level round-robin address mapping scheme, such that the data with consecutive indices are assigned to consecutive the following schemes to achieve efficient data mapping: Graph Index Reordering. The generated graph from ANNS cores. This scheme helps maximize memory utilization. Each tools [32], [50] has a highly random index distribution. Instead page uses the same bit length to store node vectors and # Query Dimension D SIFT [33] GLOVE [56] DEEP-10M [5] BIGANN-10M [34] DEEP-100M [5] BIGANN-100M [34] Euclidean Angular Inner Product Euclidean Inner Product Euclidean 1M 1M 10M 10M 100M 100M 10K 10K 10K 10K 10K 10K 128 100 96 128 96 128 associated NN indices (nodes with degree< R are padded to R to align address). Fig. 10-(a) shows the coupled storage format of PQ codes and graph indices in each page of 3D NAND core. Each data frame for one vertex vi starts with R NNs’ graph indices and the end is the corresponding PQ code of vi . So R × bindex + bP Q bits are needed for each vertex. NBL Each page can store at most ⌊ R×bindex +bP Q ⌋ frames. Likewise, each raw data consumes braw × D bits and each page can store NBL ⌊ braw ×D ⌋ raw data frames. Except for the raw, PQ, and graph index data, Proxima introduces another type of special data called hot nodes explained as follows. TABLE II: Area and power breakdown of Proxima accelerator. 3D NAND Flash # Base Search Engine Distance 4XHU\SHU6HFRQG 436 TABLE I: Specifications of evaluated datasets. Dataset  Hardware Unit Configs. Area (mm2 ) Dynamic Energy (pJ) Core 3D NAND Blocks Core H-Tree Bus Tile Core Tile H-Tree Bus Total ×1 96-layer, 4 SSL, 36864 BL ×1 ×1 ×32 ×1 16 Tiles (432Gb) 0.505 0.505 0.163 16.16 16.16 1.309 258.56 4442. 21.4 198.6 - Dynamic Pwr. (mW) 1920.316 0.274 4.579 1.793 17.396 5.822 11.574 486.090 2423.802 Static Pwr. (mW) 2127.384 0.684 3.472 4.153 14.347 14.345 0.002 0.021 2141.752 Hardware Unit Size Search Queues Candidate List Bloom Filter ADT Module PQ Module Codebook Mem. FP16-MACs Bitonic Sorter Total ×256 2kB×1 12kB×1 16kB×1 ×1 64kB×1 ×32 ×1 - 6L]H 0 Area (mm2 ) 9.012 0.003 0.014 0.017 0.082 0.058 0.024 0.237 9.331 6L]H 0 6L]H 0  V. E VALUATION  6,)7 %,* %,* A. Evaluation Methodology */29( '((3 '((3 Baselines. Proxima graph search is compared with three              graph-based ANNS baselines: HNSW [50] and DiskANN [32]  PD 3UR[L 'LVN$11  +16: )$,66,9 ) on CPU and GGNN [22] on GPU. We also consider the compression-based algorithm, IVF-PQ in FAISS [36], as the Fig. 11: Throughput (QPS) and recall comparison with non-graph baseline. We use the graph building parameters of HNSW [50], DiskANN [32], and FAISS-IVF [36] on CPU. ANN-Benchmarks [4]: the maximum out degree R = 64 with list size L = 150 for DiskANN and L = 500 for HNSW. The clock frequency is 1GHz and the design is scaled to 22nm. Proxima hardware is compared with three domain-specific The timing and energy parameters of the SRAM and buffers accelerators, ANNA [42] (IVF-PQ-based ASIC), VStore [44] are calculated using CACTI-3DD [13] at 22nm. (near-storage graph ANNS accelerator), and ICE [28]. B. ANNS Algorithm Evaluation Benchmarks. Table I lists the used ANNS datasets on three Improvements over Graph Baselines. We evaluate Proxscales (1M, 10M and 100M). The 10M and 100M datasets ima graph search and compare it with DiskANN [32] and are subsets extracted from DEEP1B [5] and BIGANN1B [34], HNSW [50] on CPU. Fig. 11 shows throughput (QPS) and respectively. The software baselines are evaluated on AMD recall, where Proxima achieves comparable recall and throughEPYC 7543 CPU with 128GB DDR4-3200 memory and put across all datasets and up to 10% recall improvement NVIDIA A40 GPUs with 40GB GDDR6 memory to measure over DiskANN and HNSW at the same throughput on 1M their performance, energy consumption, and recall. 100M datasets. The improvements over HNSW come from using datasets for GGNN [22] are evaluated using two A40 GPUs. lightweight PQ distance approximations instead of expensive Proxima Software Implementation. We implement the accurate distances. Compared to DiskANN that also uses PQ, Proxima graph search algorithm using C++ upon DiskANN such improvements come from our implementation of the novel codebase [32]. The code is compiled with g++ compiler search strategy in Algorithm 1. with -O3 optimization. The data vectors are divided into 32 Comparison with Non-graph Baseline. We also evaluate the subvectors and 256 centroids per subvector. The PQ threshold non-graph algorithm, FAISS-IVF [36], which uses PQ compresis beta = 1.06 illustrated in Section III-C, repetition rate r in sion. Although the memory footprint of FAISS-IVF is smaller, the range 1 to 15 and search step Tstep = 4 in Section III-D. lossy PQ compression introduces significant approximation Proxima Hardware. Proxima performance is estimated using error and causes the recall to saturate around 90% and 85% for an in-house simulator: The simulator’s back-end is built upon small and large datasets, respectively. Graph-based methods 3D-FPIM [39] to compute the parasitics, energy, and latency consistently achieve better recall vs. throughput tradeoff over of 3D NAND based on the parameters of Samsung’s 96FAISS-IVF. layer NAND flash [64] and configurations in Table II, where the total storage capacity is 432Gb. The simulator’s front- C. Hardware Performance Comparison end is a modified version of NeuroSIM [14] which faithfully Proxima vs. CPU, GPU, and ASIC. We compare Proxima reflects the behavior of Proxima graph search. The front-end near-storage accelerator in 3D NAND with state-of-the-art CPU, accepts the trace generated from the software and calculates the GPU, and ASIC designs for ANNS. Fig. 12-(a) summarizes corresponding performance metrics. The ASIC search engine is the throughput comparison, which is conducted on two smallimplemented using Verilog HDL using TSMC 40nm technology. scale datasets (SIFT and GLOVE) and two large-scale 100M     **11*38 $11$$6,& 3UR[LPD    6,)7 */29( %,*0 '((30 %,*0 '((30 (QHUJ\(IILFLHQF\ (a) Throughput (QPS)           6,)7 */29( %,*0 '((30 %,*0 '((30 (b) Energy efficiency (QPS/W) Fig. 12: Throughput and energy efficiency comparison for HNSW [50], GGNN [22], ANNA [42], and Proxima. datasets (BIGANN-100M and DEEP-100M). GGNN [22] is a high-performance GPU tool optimized for graph-based ANNS while ANNA [42] is a PQ-IVF-based ASIC accelerator. We adopt HNSW [50] as the CPU baseline. Proxima achieves the highest throughput among all baselines while GPU-based GGNN is the 2nd fastest tool. Compared to ASIC design, ANNA [42], Proxima is 6.6× to 13× faster on 1M and 100M datasets. Compared to the CPU baseline, Proxima’s speedup is more significant for large-scale datasets or high-complexity datasets, such as GLOVE [56] that needs much more distance computations (6× to 8×) to achieve the same recall. We measure the energy efficiency for Proxima and the baselines mentioned in Fig. 12-(b). Energy efficiency is measured by QPS/W. Proxima has the highest energy efficiency in all datasets. ANNA’s energy efficiency is up to 17× inferior to Proxima because ANNA needs large on-chip SRAMs and frequent off-chip data transfer. The expensive data movement increases the energy consumption. Compared to CPU, Proxima is three orders of magnitude energy efficient because Proxima implemented near 3D NAND has much shorter data access latency and lower reading energy. Proxima vs. Existing ANNS Accelerators. Additionally, we compare Proxima with state-of-the-art ANNS accelerations in Table III, including CPU-based DiskANN-PQ [32], GPUbased GGNN [22], ASIC-based ANNA [42], and NSP-based VStore [44]. Proxima and VStore are the only two designs that can persist the graph data in 3D NAND flashes. No offchip communication is needed for the next-time power on. In comparison, GGNN and ANNA rely on off-chip or on-chip data communication, which incurs higher energy consumption. VStore relied on SSD internal interface like SimpleSSD to realize the data communication with processors, which delivers very low bandwidth (9.9GB/s). The low bandwidth severely limits scaling to large-scale graph ANNS applications. Compared to VStore, Proxima has more compact design, shorter memory latency, and 26× higher peak memory bandwidth due to the high-speed heterogeneous integration. GGNN with NVIDIA V100 GPU and HBM2 memory delivers the highest bandwidth. But its bit density is only 41% of Proxima. Overall, Proxima balances well between memory capacity, density, and bandwidth, which can be regarded as a promising solution for data-intensive graph ANNS workloads. Hardware Overhead. Table II summarizes the area and energy breakdown for Proxima near-storage accelerator. The area of the 3D NAND part is dominated by the memory tier determined by the size of the memory array. The areas of the peripheral circuits and H-Tree buses are factored out by incorporating the heterogeneous integration. The search engine area is also factored out by fitting the search engine on the top CMOS tier, The overall Proxima area is 258.56 mm2 , which is a compact solution for graph-based ANN. Proxima’ area size is 2.4× smaller than the A40 GPU’s 628mm2 die area. D. Effectiveness of Proxima Optimizations Gap Encoding and Early Termination. We analyze the memory traffic for different graph ANNS search algorithms in Fig. 14. HNSW [50] without using any distance approximations incurs the largest data movement overhead. DiskANN-PQ denotes DiskANN with product quantization [32]. DiskANNPQ reduces 12% to 40% total memory traffic because most raw data access is skipped using PQ. Gap encoding in Proxima further decreases the data access of NN indices, while early termination saves 10% to 25% redundant operations over DiskANN-PQ. As a result, Proxima achieves a reduction ratio of 1.9× to 2.4× over HNSW [50]. Hot Nodes Repetition. The hot node repetition is an effective scheme to reduce the latency while improving the overall efficiency. Fig. 15 illustrates the runtime breakdown for hot node percentages from 0.0% to 7.0% on 100M datasets. When hot node repetition is disabled, data access latency caused by the 3D NAND core and H-tree bus contributes to 80% of the overall latency. Adding 1% of hot  34'LVWDQFH '1$1'%XV nodes (1M points for %ORRP)LOWHU 2WKHUV 100M datasets) helps  to reduces the over all search latency by 2.2×. The gain       comes from: the in+RW1RGH3HUFHQWDJH  creased data locality Fig. 15: Runtime breakdown for differof hot nodes helps to ent hot node percentages. reduce the required data access time to the 3D NAND cores as the one round of NN indices fetching and PQ distance computations (Line 6-9 in Algorithm 1) can be finished within one page access. When the percentage of hot nodes further increases to 3%, the speedup is ≈ 3× compared to the performance without hot nodes. However, the benefits of adding hot nodes > 3% reach a plateau. We use 3% as the default value in Proxima. Improvements over other Graph ANNS Algorithms. Proxima accelerator is general to support various graph ANNS algorithms. We compare the performance of two variants of Proxima with two state-of-the-art works, HNSW [50] and DiskANN-PQ [32] . All these graph algorithms run on the proposed 3D NAND accelerator using the same hardware configurations in Table II. Fig. 13 shows the throughput, energy 1RUPDOL]HG/DWHQF\ 7KURXJKSXW +16:&38  1RUPDOL]HG9DOXH +16: 'LVN$1134 3UR[LPDZ*( 3UR[LPDZ*(+       '((30 '((30 %,*0 %,*0 '((30 '((30 %,*0 %,*0 '((30 '((30 %,*0 %,*0 D 7KURXJKSXW E (QHUJ\HIILFLHQF\ F 4XHU\/DWHQF\ Fig. 13: Performance comparison for various graph ANNS algorithms ran on the proposed 3D NAND-based NSP accelerator: HNSW [50], DiskANN-PQ [32], and Proxima (G: gap encoding, E: early termination, H: hot nodes repetition). TABLE III: Comparison with existing CPU, GPU, ASIC, and NSP accelerators. Design DiskANN-PQ [32] GGNN [22] ANNA [42] VStore [44] Proxima Platform Including storage? Memory Type Memory Capacity Peak Bandwidth Memory Bit Density CPU No DRAM-DDR4-3200 128GB 102GB/s 0.2Gb/mm2 GPU No HBM2 32GB 900GB/s 0.7Gb/mm2 ASIC No DRAM 64GB/s 0.2Gb/mm2 FPGA+SSD Yes DRAM+SSD 32GB 9.9GB/s (aggregated) 4.2Gb/mm2 3D NAND SLC Yes 3D NAND 54GB 254GB/s 1.7Gb/mm2 5HFDOO  H H H H H H H H          &RUH8WLOL]DWLRQ       1RUPDOL]HG436: 1RUPDOL]HG436 Fig. 14: Memory traffic breakdown for HNSW [50], DiskANNPQ [32], and Proxima with gap encoding and early termination.             4XHXH6L]H 4XHXH6L]H Fig. 16: Performance and efficiency of Proxima using different queue sizes on 100M datasets. efficiency, query latency on 10M and 100M datasets. Proxima with gap encoding and early termination achieves moderate throughput and efficiency improvements over HNSW and DiskANN-PQ via software optimization. The early termination and gap encoding help to reduce both computational redundancy and data movement during search. HNSW yields the worst performance because it requires many accurate distance computations. After adopting the hot nodes repetition, Proxima delivers about 2× speedup, 3× energy efficiency improvement, and 3× latency reduction. The benefits mainly result from the better data locality due to graph index reordering and hot nodes repetition, which reduce the memory access during graph traversal. E. Scalability and Sensitivity Analysis Queue Size. The search queue design is the key to improving throughput and the utilization of the core and bandwidth. We vary the queue size Nq from 32 to 256 and simulate the 6,)7 */29( %,*0 '((30 %,*0 '((30 Fig. 17: Impact of data errors in 3D NAND flash on the search recall across different datasets. normalized throughput, energy efficiency, and 3D NAND core utilization on 100-M datasets without using hot node repetition. Fig. 16 shows that increasing the queue size to 256 significantly improves the throughput by 3.8×. This is because more parallel search queues can handle more queries simultaneously, thereby increasing the achievable number of parallel memory requests. Increased memory requests increase core utilization from 17.9% to 68%. However, increasing queue size also decreases energy efficiency by almost 20% due to two factors: 1. The increased queues may conflict when two different queues access the same core, which slows down the query search. 2. More search queues dissipate more static power during graph search. We observe that further increasing the queue size beyond Nq = 256 does not significantly speed up QPS while adding more power consumption to the system. This suggests that the queue has basically saturated the internal bandwidth of 3D NAND core. Therefore, it is not cost-effective to further increase the queue size. We choose Nq = 256 as the default queue size. 3D NAND Error. Proxima uses SLC-based 3D NAND without any ECC modules to deliver the best. It is critical to evaluate and study whether the ECC-free design is able to tolerate the occurred errors. Typically, the raw bit error rate of SLC 3D NAND is lower than 1e−5 [29], while larger than 1e−4 for MLC 3D NAND [49] and TLC 3D NAND [54]. Fig. 17 shows the simulation results of search recall degradation with the impact of bit errors. It shows that using SLC 3D NAND, Proxima can retain the recall rate without ECC because the recall degradation is less than 3% when the error rate increases to 1e−4 . VI. R ELATED W ORKS A. Large-scale ANN Algorithms Various approaches are proposed to address large-scale ANNS. Compression-based algorithms compress data based on space partitioning and clustering including IVF [62], product quantization [33], [51] and hash methods [11], [12]. The previous experiments [4], [9] show that the low memory comsumption of these compressioncomesed methods come at the cost of moderate accuracy degradation especially on large-scale datasets. Graph-based methods [19], [32] build an approximate graph and prune edges by proximity graph criteria [65]. Graph-based methods are fast, accurate and scaling well to large datasets. There are various types of emerging graph algorithms, such as Filtered-DiskANN [21] and OODDiskANN [30], optimizing the ANNS problem for specific cases. These works can benefit from Proxima design to enhance their performance because most existing graph methods share the same graph search strategy. B. Acceleration for Graph and Graph ANNS There exist various hardware accelerators that aim at addressing large-scale ANNS problems with high energy efficiency as well as low query latency. These works include VStore [44], ICE [28], ES4D [38], SSAM [41], and CXL-ANNS [31]. Near-storage processing [31], [41], [43], [44] or in-storage processing [28], [31] is commonly utilized to achieve better system efficiency and high storage density on vairous types of memory devices. Proxima co-designs the graph search algorithm and the 3D NAND-based hardware, thereby having better memory density, energy efficiency, and bandwidth tradeoffs as compared to existing works. VII. C ONCLUSION This paper presents Proxima, an NSP-based accelerator for graph-based ANN search. Based on an analysis of existing ANN search tools, we characterize that the algorithm faces challenges in memory consumption, expensive distance computation, and irregular data access. Our solution significantly reduces the computational complexity with approximation and early termination of distance computation while showing comparable recall to competing tools. Furthermore, we devise a 3D NAND flash-based NSP accelerator that efficiently processes the optimized ANN search algorithm with strategies to maximize internal parallelism and bandwidth utilization. Compared to these state-of-the-art ANNS accelerator, Proxima achieves one to two orders of magnitudes speedup and energy efficiency improvements. R EFERENCES [1] “Crate seahash.” [Online]. Available: https://docs.rs/seahash/latest/ seahash/ [2] “Techinsights, YMTC 232L Xtacking3.0.” [Online]. Available: https: //www.techinsights.com/blog/ymtc-leading-pioneer-3d-nand [3] D. Aiger, E. Kokiopoulou, and E. Rivlin, “Random grids: Fast approximate nearest neighbors and range searching for image search,” in ICCV 2013, 2013. [4] M. Aumüller, E. Bernhardsson, and A. Faithfull, “Ann-benchmarks: A benchmarking tool for approximate nearest neighbor algorithms,” Information Systems, vol. 87, p. 101374, 2020. [5] A. Babenko and V. Lempitsky, “Efficient indexing of billion-scale datasets of deep descriptors,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2055–2063. [6] M. Besta and T. Hoefler, “Survey and taxonomy of lossless graph compression and space-efficient graph representations,” arXiv preprint arXiv:1806.01799, 2018. [7] W. Bittremieux, K. Laukens, and W. S. Noble, “Extremely fast and accurate open modification spectral library searching of high-resolution mass spectra using feature hashing and graphics processing units,” Journal of Proteome Research, vol. 18, no. 10, pp. 3792–3799, Aug. 2019. [Online]. Available: https://doi.org/10.1021/acs.jproteome.9b00291 [8] B. H. Bloom, “Space/time trade-offs in hash coding with allowable errors,” Communications of the ACM, vol. 13, no. 7, pp. 422–426, 1970. [9] D. Cai, “A revisit of hashing algorithms for approximate nearest neighbor search,” IEEE Transactions on Knowledge and Data Engineering, vol. 33, no. 6, pp. 2337–2348, 2019. [10] C. Caillat, K. Beaman, A. Bicksler, E. Camozzi, T. Ghilardi, G. Huang, H. Liu, Y. Liu, D. Mao, S. Mujumdar, N. Righetti, M. Ulrich, C. Venkatasubramanian, X. Yang, A. Goda, S. Gowda, H. Mebrahtu, H. Sanda, Y. Yuwen, and R. Koval, “3dnand gidl-assisted body biasing for erase enabling cmos under array (cua) architecture,” in 2017 IEEE International Memory Workshop (IMW), 2017, pp. 1–4. [11] Y. Cao, S. Chen, J. Gui, H. Qi, Z. Li, and C. Liu, “Hash learning with variable quantization for large-scale retrieval,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 5, pp. 2624–2637, 2021. [12] M. S. Charikar, “Similarity estimation techniques from rounding algorithms,” in Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, 2002, pp. 380–388. [13] K. Chen, S. Li, N. Muralimanohar, J. H. Ahn, J. B. Brockman, and N. P. Jouppi, “Cacti-3dd: Architecture-level modeling for 3d die-stacked dram main memory,” in 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2012, pp. 33–38. [14] P.-Y. Chen, X. Peng, and S. Yu, “Neurosim: A circuit-level macro model for benchmarking neuro-inspired architectures in online learning,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 37, no. 12, pp. 3067–3080, 2018. [15] Q. Chen, H. Wang, M. Li, G. Ren, S. Li, J. Zhu, J. Li, C. Liu, L. Zhang, and J. Wang, SPTAG: A library for fast approximate nearest neighbor search, 2018. [Online]. Available: https://github.com/Microsoft/SPTAG [16] Q. Chen, H. Zhao, W. Li, P. Huang, and W. Ou, “Behavior sequence transformer for e-commerce recommendation in alibaba,” in Proceedings of the 1st International Workshop on Deep Learning Practice for HighDimensional Sparse Data, 2019, pp. 1–4. [17] R. Chen, B. Liu, H. Zhu, Y. Wang, Q. Li, B. Ma, Q. Hua, J. Jiang, Y. Xu, and H. Deng, “Approximate nearest neighbor search under neural similarity metric for large-scale recommendation,” arXiv preprint arXiv:2202.10226, 2022. [18] P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and Y. Xie, “Prime: A novel processing-in-memory architecture for neural network computation in reram-based main memory,” in Proceedings of the 43rd International Symposium on Computer Architecture, ser. ISCA ’16. IEEE Press, 2016, p. 27–39. [Online]. Available: https://doi.org/10.1109/ISCA.2016.13 [19] C. Fu, C. Xiang, C. Wang, and D. Cai, “Fast approximate nearest neighbor search with the navigating spreading-out graph,” Proceedings of the VLDB Endowment, vol. 12, no. 5, pp. 461–474, 2019. [20] T. Ge, K. He, Q. Ke, and J. Sun, “Optimized product quantization,” IEEE transactions on pattern analysis and machine intelligence, vol. 36, no. 4, pp. 744–755, 2013. [21] S. Gollapudi, N. Karia, V. Sivashankar, R. Krishnaswamy, N. Begwani, S. Raz, Y. Lin, Y. Zhang, N. Mahapatro, P. Srinivasan, A. Singh, and H. V. Simhadri, “Filtered-diskann: Graph algorithms for approximate nearest neighbor search with filters,” in Proceedings of the ACM Web Conference 2023, ser. WWW ’23. New York, NY, USA: Association for Computing Machinery, 2023, p. 3406–3416. [Online]. Available: https://doi.org/10.1145/3543507.3583552 [22] F. Groh, L. Ruppert, P. Wieschollek, and H. Lensch, “Ggnn: Graph-based gpu nearest neighbor search,” IEEE Transactions on Big Data, 2022. [23] R. Guo, P. Sun, E. Lindgren, Q. Geng, D. Simcha, F. Chern, and S. Kumar, “Accelerating large-scale inference with anisotropic vector quantization,” in International Conference on Machine Learning. PMLR, 2020, pp. 3887–3896. [24] Y. Guo, M. Imani, J. Kang, S. Salamat, J. Morris, B. Aksanli, Y. Kim, and T. Rosing, “Hyperrec: Efficient recommender systems with hyperdimensional computing,” in 2021 26th Asia and South Pacific Design Automation Conference (ASP-DAC). IEEE, 2021, pp. 384–389. [25] K. He, F. Wen, and J. Sun, “K-means hashing: An affinity-preserving quantization method for learning binary compact codes,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2013, pp. 2938–2945. [26] T. Higuchi, T. Kodama, K. Kato, R. Fukuda, N. Tokiwa, M. Abe, T. Takagiwa, Y. Shimizu, J. Musha, and K. Sakurai, “30.4 a 1tb 3b/cell 3d-flash memory in a 170+ word-line-layer technology,” in 2021 IEEE International Solid-State Circuits Conference (ISSCC), vol. 64. IEEE, 2021, pp. 428–430. [27] P.-K. Hsu, P.-Y. Du, C. R. Lo, H.-T. Lue, W.-C. Chen, T.-H. Hsu, T.-H. Yeh, C.-C. Hsieh, M.-L. Wei, K.-C. Wang, and C.-Y. Lu, “An approach of 3d nand flash based nonvolatile computing-in-memory (nvcim) accelerator for deep neural networks (dnns) with calibration and read disturb analysis,” in 2020 IEEE International Memory Workshop (IMW), 2020, pp. 1–4. [28] H.-W. Hu, W.-C. Wang, Y.-H. Chang, Y.-C. Lee, B.-R. Lin, H.-M. Wang, Y.-P. Lin, Y.-M. Huang, C.-Y. Lee, and T.-H. Su, “Ice: An intelligent cognition engine with 3d nand-based in-memory computing for vector similarity search acceleration,” in 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2022, pp. 763–783. [29] A. Jagmohan, M. Franceschini, L. A. Lastras-Montaño, and J. Karidis, “Adaptive endurance coding for nand flash,” in 2010 IEEE Globecom Workshops, 2010, pp. 1841–1845. [30] S. Jaiswal, R. Krishnaswamy, A. Garg, H. V. Simhadri, and S. Agrawal, “Ood-diskann: Efficient and scalable graph anns for out-of-distribution queries,” 2022. [31] J. Jang, H. Choi, H. Bae, S. Lee, M. Kwon, and M. Jung, “{CXLANNS}:{Software-Hardware} collaborative memory disaggregation and computation for {Billion-Scale} approximate nearest neighbor search,” in 2023 USENIX Annual Technical Conference (USENIX ATC 23), 2023, pp. 585–600. [32] S. Jayaram Subramanya, F. Devvrit, H. V. Simhadri, R. Krishnawamy, and R. Kadekodi, “Diskann: Fast accurate billion-point nearest neighbor search on a single node,” Advances in Neural Information Processing Systems, vol. 32, 2019. [33] H. Jegou, M. Douze, and C. Schmid, “Product quantization for nearest neighbor search,” IEEE transactions on pattern analysis and machine intelligence, vol. 33, no. 1, pp. 117–128, 2010. [34] H. Jégou, R. Tavenard, M. Douze, and L. Amsaleg, “Searching in one billion vectors: re-rank with source coding,” in 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2011, pp. 861–864. [35] L. Jiang, M. Kim, W. Wen, and D. Wang, “Xnor-pop: A processing-inmemory architecture for binary convolutional neural networks in wide-io2 drams,” in 2017 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED), 2017, pp. 1–6. [36] J. Johnson, M. Douze, and H. Jégou, “Billion-scale similarity search with GPUs,” IEEE Transactions on Big Data, vol. 7, no. 3, pp. 535–547, 2019. [37] D. Kang, M. Kim, S. C. Jeon, W. Jung, J. Park, G. Choo, D.-k. Shim, A. Kavala, S.-B. Kim, and K.-M. Kang, “13.4 a 512gb 3-bit/cell 3d 6 th-generation v-nand flash memory with 82mb/s write throughput and 1.2 gb/s interface,” in 2019 IEEE International Solid-State Circuits Conference-(ISSCC). IEEE, 2019, pp. 216–218. [38] J. Kim, J. Seo, J. Park, S.-W. Lee, H. Roh, and H. Cho, “Es4d: Accelerating exact similarity search for high-dimensional vectors via vector slicing and in-ssd computation,” in 2022 IEEE 40th International Conference on Computer Design (ICCD). IEEE, 2022, pp. 298–306. [39] H. Lee, M. Kim, D. Min, J. Kim, J. Back, H. Yoo, J.-H. Lee, and J. Kim, “3d-fpim: An extreme energy-efficient dnn acceleration system using 3d nand flash-based in-situ pim unit,” in 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2022, pp. 1359–1376. [40] S. Lee, C. Kim, M. Kim, S.-m. Joe, J. Jang, S. Kim, K. Lee, J. Kim, J. Park, and H.-J. Lee, “A 1tb 4b/cell 64-stacked-wl 3d nand flash memory with 12mb/s program throughput,” in 2018 IEEE International Solid-State Circuits Conference-(ISSCC). IEEE, 2018, pp. 340–342. [41] V. T. Lee, A. Mazumdar, C. C. del Mundo, A. Alaghi, L. Ceze, and M. Oskin, “Application codesign of near-data processing for similarity search,” in 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2018, pp. 896–907. [42] Y. Lee, H. Choi, S. Min, H. Lee, S. Beak, D. Jeong, J. W. Lee, and T. J. Ham, “Anna: Specialized architecture for approximate nearest neighbor search,” in 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2022, pp. 169–183. [43] S. Liang, Y. Wang, Y. Lu, Z. Yang, H. Li, and X. Li, “Cognitive {SSD}: A deep learning engine for {In-Storage} data retrieval,” in 2019 USENIX Annual Technical Conference (USENIX ATC 19), 2019, pp. 395–410. [44] S. Liang, Y. Wang, Z. Yuan, C. Liu, H. Li, and X. Li, “Vstore: instorage graph based vector search accelerator,” in Proceedings of the 59th ACM/IEEE Design Automation Conference, 2022, pp. 997–1002. [45] G. Linden, B. Smith, and J. York, “Amazon. com recommendations: Item-to-item collaborative filtering,” IEEE Internet computing, vol. 7, no. 1, pp. 76–80, 2003. [46] T. Liu, C. Rosenberg, and H. A. Rowley, “Clustering billions of images with large scale nearest neighbor search,” in 2007 IEEE Workshop on Applications of Computer Vision (WACV ’07), 2007, pp. 28–28. [47] H.-T. Lue, P.-K. Hsu, K.-C. Wang, and C.-Y. Lu, “Introduction of nonvolatile computing in memory (nvcim) by 3d nand flash for inference accelerator of deep neural network (dnn) and the read disturb reliability evaluation : (invited paper),” in 2020 IEEE International Reliability Physics Symposium (IRPS), 2020, pp. 1–6. [48] H.-T. Lue, P.-K. Hsu, M.-L. Wei, T.-H. Yeh, P.-Y. Du, W.-C. Chen, K.-C. Wang, and C.-Y. Lu, “Optimal design methods to transform 3d nand flash into a high-density, high-bandwidth and low-power nonvolatile computing in memory (nvcim) accelerator for deep-learning neural networks (dnn),” in 2019 IEEE International Electron Devices Meeting (IEDM), 2019, pp. 38.1.1–38.1.4. [49] Y. Luo, S. Ghose, Y. Cai, E. F. Haratsch, and O. Mutlu, “Improving 3d nand flash memory lifetime by tolerating early retention loss and process variation,” Proc. ACM Meas. Anal. Comput. Syst., vol. 2, no. 3, dec 2018. [Online]. Available: https://doi.org/10.1145/3224432 [50] Y. A. Malkov and D. A. Yashunin, “Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs,” IEEE transactions on pattern analysis and machine intelligence, vol. 42, no. 4, pp. 824–836, 2018. [51] S. Morozov and A. Babenko, “Unsupervised neural quantization for compressed-domain similarity search,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 3036–3045. [52] L. Nai, R. Hadidi, J. Sim, H. Kim, P. Kumar, and H. Kim, “Graphpim: Enabling instruction-level pim offloading in graph computing frameworks,” in 2017 IEEE International symposium on high performance computer architecture (HPCA). IEEE, 2017, pp. 457–468. [53] X. Ning, C. Desrosiers, and G. Karypis, “A comprehensive survey of neighborhood-based recommendation methods,” Recommender systems handbook, pp. 37–76, 2015. [54] N. Papandreou, H. Pozidis, T. Parnell, N. Ioannou, R. Pletka, S. Tomic, P. Breen, G. Tressler, A. Fry, and T. Fisher, “Characterization and analysis of bit errors in 3d tlc nand flash memory,” in 2019 IEEE International Reliability Physics Symposium (IRPS), 2019, pp. 1–6. [55] J. Park, M. Kim, M. Chun, L. Orosa, J. Kim, and O. Mutlu, “Reducing solid-state drive read latency by optimizing read-retry,” in Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2021, pp. 702–716. [56] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 1532–1543. [57] J. Ren, M. Zhang, and D. Li, “Hm-ann: Efficient billion-point nearest neighbor search on heterogeneous memory,” Advances in Neural Information Processing Systems, vol. 33, pp. 10 672–10 684, 2020. [58] F. Research, “Indexing 1g vectors Wiki,” https://github.com/facebookresearch/faiss/wiki/Indexing-1G-vectors, apr 1 2021. [59] V. Seshadri, D. Lee, T. Mullins, H. Hassan, A. Boroumand, J. Kim, M. A. Kozuch, O. Mutlu, P. B. Gibbons, and T. C. Mowry, “Ambit: In-memory accelerator for bulk bitwise operations using commodity dram technology,” in Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO-50 ’17. New York, NY, USA: Association for Computing Machinery, 2017, p. 273–287. [Online]. Available: https://doi.org/10.1145/3123939.3124544 [60] W. Shao, J. Sha, and C. Zhang, “Dispersed array ldpc codes and decoder architecture for nand flash memory,” IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 65, no. 8, pp. 1014–1018, 2017. [61] W. Shim and S. Yu, “System-technology codesign of 3-d nand flash-based compute-in-memory inference engine,” IEEE Journal on Exploratory Solid-State Computational Devices and Circuits, vol. 7, no. 1, pp. 61–69, 2021. [62] J. Sivic and A. Zisserman, “Video google: A text retrieval approach to object matching in videos,” in Computer Vision, IEEE International Conference on, vol. 3. IEEE Computer Society, 2003, pp. 1470–1470. [63] L. Song, X. Qian, H. Li, and Y. Chen, “Pipelayer: A pipelined reram-based accelerator for deep learning,” in 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2017, pp. 541–552. [64] TechInsight, “Samsung k93kgd8u0m die.” [Online]. Available: https: //www.techinsights.com/products/mfr-1812-801 [65] M. Wang, X. Xu, Q. Yue, and Y. Wang, “A comprehensive survey and experimental comparison of graph-based approximate nearest neighbor search,” arXiv preprint arXiv:2101.12631, 2021. [66] J. Zhang, S. Khoram, and J. Li, “Efficient large-scale approximate nearest neighbor search on opencl fpga,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4924–4932. [67] K. Zhao, P. Pan, Y. Zheng, Y. Zhang, C. Wang, Y. Zhang, Y. Xu, and R. Jin, “Large-scale visual search with binary distributed graph at alibaba,” in Proceedings of the 28th ACM International Conference on Information and Knowledge Management, 2019, pp. 2567–2575. [68] W. Zhao, S. Tan, and P. Li, “Song: Approximate nearest neighbor search on gpu,” in IEEE 36th International Conference on Data Engineering (ICDE). IEEE, 2020, pp. 1033–1044.