B ^x;@sddlZddlmmZddlZddlZddlm Z ddl m Z ddl Z ddlmZddlmZddlmZddlmZddlZGdd d ZdS) N) default_timer)Path)get_from_cache) GPT2Tokenizer)PyGpt2Tokenizer) GPT2Modelc@sLeZdZddZddZddZddZd d Zd d Zd dZ ddZ dS)TestBenchmarkDistilGPT2c!srtj_tt_tj ddjd_ t t j j ddt j j dd_tj ddd_jr~jdd d d d d ddddddddddddddddddd d!d"d#d$d%d&d'd(g!_fd)d*jD}fd+d*|D}fd,d*|D}td-d*|Dfd.d*|Dg}tj|tjd/}jrD|}t|d0}WdQRXdS)1N distilgpt2T) do_lower_case cache_dir vocab_file merges_fileF)output_attentionszDeep learning (also known as deep structured learning or hierarchical learning) is part of a broader family of machine learning methods based on artificial neural networks.Learning can be supervised, semi-supervised or unsupervised.uDeep learning is a class of machine learning algorithms that[11](pp199–200) uses multiple layers to progressively extract higher level features from the raw input.zFor example, in image processing, lower layers may identify edges, while higher layers may identify the concepts relevant to a human such as digits or letters or faces.zMost modern deep learning models are based on artificial neural networks, specifically, Convolutional Neural Networks (CNN)s, although they can also include propositional formulas organized layer-wise in deep generative models.z{In deep learning, each level learns to transform its input data into a slightly more abstract and composite representation.zIn an image recognition application, the raw input may be a matrix of pixels; the first representational layer may abstract the pixels and encode edges; the second layer may compose and encode arrangements of edges;zhe third layer may encode a nose and eyes; and the fourth layer may recognize that the image contains a face. Importantly, a deep learning process can learn which features to optimally place in which level on its own.z(Of course, this does not completely eliminate the need for hand-tuning; for example, varying numbers of layers and layer sizes can provide different degrees of abstraction.)[aThe word "deep" in "deep learning" refers to the number of layers through which the data is transformed. More precisely, deep learning systems have a substantial credit assignment path (CAP) depth. The CAP is the chain of transformations from input to output.zCAPs describe potentially causal connections between input and output. For a feedforward neural network, the depth of the CAPs is that of the network and is the number of hidden layers plus one (as the output layer is also parameterized).zFor recurrent neural networks, in which a signal may propagate through a layer more than once, the CAP depth is potentially unlimited.[2] No universally agreed upon threshold of depth divides shallow learning from deep learning.zCAP of depth 2 has been shown to be a universal approximator in the sense that it can emulate any function.[14] Beyond that, more layers do not add to the function approximator ability of the network.zDeep models (CAP > 2) are able to extract better features than shallow models and hence, extra layers help in learning the features effectively. Deep learning architectures can be constructed with a greedy layer-by-layer method.aDeep learning helps to disentangle these abstractions and pick out which features improve performance.[1]. For supervised learning tasks, deep learning methods eliminate feature engineering, by translating the data into compact intermediate representationsa'Deep learning algorithms can be applied to unsupervised learning tasks. This is an important benefit because unlabeled data are more abundant than the labeled data. Examples of deep structures that can be trained in an unsupervised manner are neural history compressors and deep belief networks.a*Deep neural networks are generally interpreted in terms of the universal approximation theorem or probabilistic inference. The classic universal approximation theorem concerns the capacity of feedforward neural networks with a single hidden layer of finite size to approximate continuous functions.a6In 1989, the first proof was published by George Cybenko for sigmoid activation functions and was generalised to feed-forward multi-layer architectures in 1991 by Kurt Hornik.Recent work also showed that universal approximation also holds for non-bounded activation functions such as the rectified linear unit.aRhe universal approximation theorem for deep neural networks concerns the capacity of networks with bounded width but the depth is allowed to grow. Lu et al. proved that if the width of a deep neural network with ReLU activation is strictly larger than the input dimension, then the network can approximate any Lebesgue integrable functionzThe probabilistic interpretation[24] derives from the field of machine learning. It features inference, as well as the optimization concepts of training and testing, related to fitting and generalization, respectivelyzMore specifically, the probabilistic interpretation considers the activation nonlinearity as a cumulative distribution function. The probabilistic interpretation led to the introduction of dropout as regularizer in neural networks.zThe probabilistic interpretation was introduced by researchers including Hopfield, Widrow and Narendra and popularized in surveys such as the one by Bishop. The term Deep Learning was introduced to the machine learning community by Rina Dechter in 1986aThe first general, working learning algorithm for supervised, deep, feedforward, multilayer perceptrons was published by Alexey Ivakhnenko and Lapa in 1965.[32] A 1971 paper described already a deep network with 8 layers trained by the group method of data handling algorithm.zOther deep learning working architectures, specifically those built for computer vision, began with the Neocognitron introduced by Kunihiko Fukushima in 1980.[34] In 1989, Yann LeCun et al. applied the standard backpropagation algorithma.By 1991 such systems were used for recognizing isolated 2-D hand-written digits, while recognizing 3-D objects was done by matching 2-D images with a handcrafted 3-D object model. Weng et al. suggested that a human brain does not use a monolithic 3-D object model and in 1992 they published Cresceptrona\Because it directly used natural images, Cresceptron started the beginning of general-purpose visual learning for natural 3D worlds. Cresceptron is a cascade of layers similar to Neocognitron. But while Neocognitron required a human programmer to hand-merge features, Cresceptron learned an open number of features in each layer without supervisionaACresceptron segmented each learned object from a cluttered scene through back-analysis through the network. Max pooling, now often adopted by deep neural networks (e.g. ImageNet tests), was first used in Cresceptron to reduce the position resolution by a factor of (2x2) to 1 through the cascade for better generalizationuZIn 1994, André de Carvalho, together with Mike Fairhurst and David Bisset, published experimental results of a multi-layer boolean neural network, also known as a weightless neural network, composed of a 3-layers self-organising feature extraction neural network module (SOFT) followed by a multi-layer classification neural network module (GSN)aRn 1995, Brendan Frey demonstrated that it was possible to train a network containing six fully connected layers and several hundred hidden units using the wake-sleep algorithm, co-developed with Peter Dayan and Hinton. Many factors contribute to the slow speed, including the vanishing gradient problem analyzed in 1991 by Sepp Hochreitera-Simpler models that use task-specific handcrafted features such as Gabor filters and support vector machines (SVMs) were a popular choice in the 1990s and 2000s, because of artificial neural network's (ANN) computational cost and a lack of understanding of how the brain wires its biological networks.a.Both shallow and deep learning (e.g., recurrent nets) of ANNs have been explored for many years.[47][48][49] These methods never outperformed non-uniform internal-handcrafting Gaussian mixture model/Hidden Markov model (GMM-HMM) technology based on generative models of speech trained discriminatively.aGKey difficulties have been analyzed, including gradient diminishing[45] and weak temporal correlation structure in neural predictive models.[51][52] Additional difficulties were the lack of training data and limited computing power. Most speech recognition researchers moved away from neural nets to pursue generative modeling.aAn exception was at SRI International in the late 1990s. Funded by the US government's NSA and DARPA, SRI studied deep neural networks in speech and speaker recognition. The speaker recognition team led by Larry Heck achieved the first significant success with deep neural networks.aHWhile SRI experienced success with deep neural networks in speaker recognition, they were unsuccessful in demonstrating similar success in speech recognition. The principle of elevating "raw" features over hand-crafted optimization was first explored successfully in the architecture of deep autoencoder on the "raw" spectrogramcsg|]}j|qS)base_tokenizertokenize).0sentence)selfrJE:\Coding\backup-rust\rust-transformers\tests\test_benchmark_distilgpt2.py Msz7TestBenchmarkDistilGPT2.setup_class..csg|]}j|qSr)rconvert_tokens_to_ids)rtokens)rrrrNscs g|]}jj|ddddqS)NT)add_special_tokens max_length)rprepare_for_model)rinput)rrrrOscSsg|]}t|dqS) input_ids)len)rfrrrrQscs*g|]"}|ddgt|dqS)rr)r)rr )max_lenrrrRs)dtyper)torchcuda is_availableuse_gpurtempfilemkdtemptest_dirrfrom_pretrainedrrrpretrained_vocab_files_maprust_tokenizerrevalmodel sentence_listmaxtensorlongno_gradcpunumpy)r tokens_listfeatures all_input_ids_r)r!rr setup_classsn     z#TestBenchmarkDistilGPT2.setup_classcCstjdd|jd|_dS)Nr T)r r )rr*r)r)rrrrsetup_base_tokenizer[sz,TestBenchmarkDistilGPT2.setup_base_tokenizercCs0tt|jjddt|jjdd|_dS)Nr r r )rrrr+r,)rrrrsetup_rust_tokenizer_sz,TestBenchmarkDistilGPT2.setup_rust_tokenizerc sfddjD}fdd|D}fdd|D}tdd|Dfdd|Dg}tj|tjd}jr||}t|d }WdQRX|S) Ncsg|]}j|qSr)rr)rr)rrrrfsz:TestBenchmarkDistilGPT2.baseline_batch..csg|]}j|qSr)rr)rr)rrrrgscs g|]}jj|ddddqS)NTr)rr)rr)rr)rrrrhscSsg|]}t|dqS)r)r)rr rrrrlscs*g|]"}|ddgt|dqS)rr)r)rr )r!rrrms)r"r) r/r0r#r1r2r&r$r3r.r4r5)rr6r7r8outputr)r!rrbaseline_batches   z&TestBenchmarkDistilGPT2.baseline_batchc sfddjD}tdd|Dfdd|Dg}tj|tjd}jrX|}t|d }WdQRX|S)Ncs g|]}jj|ddddqS)r longest_firstr)r!truncation_strategystride)r,encode)rr)rrrrvszFTestBenchmarkDistilGPT2.rust_batch_single_threaded..cSsg|]}t|jqSr)r token_ids)rr rrrrzscs&g|]}|jdgt|jqS)r)rCr)rr )r!rrr{s)r"r) r/r0r#r1r2r&r$r3r.r4r5)rr7r8r=r)r!rrrust_batch_single_threadedus    z2TestBenchmarkDistilGPT2.rust_batch_single_threadedcsg}x>tdD]2}|t}|t}|||dqWt|t|ttfdd|Dt|d}t ddd|ddS) N icsg|]}|dqS)r)rvalue)meanrrrszDTestBenchmarkDistilGPT2.test_distilgpt2_baseline..zbaseline - mean: z.2fz , std. dev: ) ranger;timerr>appendsumrmathsqrtprint)rvaluesit0t1std_devr)rHrtest_distilgpt2_baselines(z0TestBenchmarkDistilGPT2.test_distilgpt2_baselinecsg}x>tdD]2}|t}|t}|||dqWt|t|ttfdd|Dt|d}t ddd|ddS) NrEicsg|]}|dqS)rFr)rrG)rHrrrszPTestBenchmarkDistilGPT2.test_distilgpt2_rust_single_threaded..rIzrust single thread - mean: z.2fz , std. dev: ) rJr<rKrDrLrMrrNrOrP)rrQrRrSrTrUr)rHr$test_distilgpt2_rust_single_threadeds(zrDrVrWr[rrrrrsB  r)builtins @py_builtins_pytest.assertion.rewrite assertionrewrite @pytest_arrNr'timeitrrKpathlibrrXZtransformers.file_utilsrZtransformers.tokenization_gpt2rrust_transformersrZtransformers.modeling_gpt2rr#rrrrr s