{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Bling Fire Tokenizer Demo\n", "In this notebook we illustrate how to use Bling Fire tokenizer. We build a simple token-based classifier for Stack Overflow classification set and measure accuracy of the classifier with Bling Fire tokenizer vs the built in.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Basic Usage" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "All tokens: ['In', 'order', 'to', 'login', 'to', 'Café', 'use', 'pi', '@', '1', '.', '2', '.', '1', '.', '2', '.', 'Split', 'the', 'data', 'into', 'train', '/', 'test', 'with', 'a', 'test', 'size', 'of', '20', '%', 'then', 'use', 'recurrent', 'model', '(', 'use', 'LSTM', 'or', 'GRU', ')', '.']\n", "Sentence text: In order to login to Café use pi@1.2.1.2.\n", "Tokenized sentence: In order to login to Café use pi @ 1 . 2 . 1 . 2 .\n", "Sentence text: Split the data into train/test with a test size of 20% then use recurrent model (use LSTM or GRU).\n", "Tokenized sentence: Split the data into train / test with a test size of 20 % then use recurrent model ( use LSTM or GRU ) .\n" ] } ], "source": [ "from blingfire import *\n", "\n", "text = \"In order to login to Café use pi@1.2.1.2. Split the data into train/test with a test size of 20% then use recurrent model (use LSTM or GRU).\"\n", "\n", "# tokenize text without sentence boundaries\n", "ws = text_to_words(text).split(' ')\n", "print(\"All tokens: \", ws)\n", "\n", "# first break text to sentences and then break each sentence to words\n", "sents = text_to_sentences(text).split('\\n')\n", "for sent in sents:\n", " print(\"Sentence text: \" + sent)\n", " print(\"Tokenized sentence: \" + text_to_words(sent))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get the Data Set Ready" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Download the Data Set, if needed" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import requests\n", "import os.path\n", "\n", "def download_file(filename, url):\n", " \"\"\"\n", " Download an URL to a file\n", " \"\"\"\n", " with open(filename, 'wb') as fout:\n", " response = requests.get(url, stream=True)\n", " response.raise_for_status()\n", " # Write response data to file\n", " for block in response.iter_content(4096):\n", " fout.write(block)\n", "\n", "def download_if_not_exists(filename, url):\n", " \"\"\"\n", " Download a URL to a file if the file\n", " does not exist already.\n", " Returns\n", " -------\n", " True if the file was downloaded,\n", " False if it already existed\n", " \"\"\"\n", " if not os.path.exists(filename):\n", " download_file(filename, url)\n", " return True\n", " return False" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "INPUT_DATA = './stack-overflow-data.csv'\n", "url = 'https://storage.googleapis.com/tensorflow-workshop-examples/stack-overflow-data.csv'\n", "download_if_not_exists(filename, url)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-rwxrwxrwx 1 root root 44319561 Mar 28 21:49 stack-overflow-data.csv\r\n" ] } ], "source": [ "!ls -l stack-overflow-data.csv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Read the data with Pandas" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 39999 entries, 0 to 39998\n", "Data columns (total 2 columns):\n", "post 39999 non-null object\n", "tags 39999 non-null object\n", "dtypes: object(2)\n", "memory usage: 625.1+ KB\n" ] } ], "source": [ "import pandas as pd\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.feature_extraction.text import TfidfVectorizer\n", "\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "\n", "CSV_COLUMN_NAMES = [\"post\", \"tags\"]\n", "TEST_SIZE = 0.2\n", "\n", "# Parse the local CSV file.\n", "all_data = pd.read_csv(filepath_or_buffer=INPUT_DATA,\n", " names=CSV_COLUMN_NAMES, # list of column names\n", " header=1 # ignore the first row of the CSV file.\n", " )\n", "all_data.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Split into the Train and Test" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Train data length 31999\n", "Train labels length 31999\n" ] } ], "source": [ "train, test = train_test_split(all_data, test_size=TEST_SIZE)\n", "\n", "train_x, train_y = train.pop('post'), train.pop('tags')\n", "test_x, test_y = test.pop('post'), test.pop('tags')\n", "\n", "all_texts = all_data.pop('post')\n", "\n", "print('Train data length ' + str(len(train_x)))\n", "print('Train labels length ' + str(len(train_y)))" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "31866 how to multiply a number after each 60 minute ...\n", "12131 priorityqueue returning incorrect ordering for...\n", "35742 how can i create a method so that the below co...\n", "22475 long.tryparse and negative values i m writing...\n", "24315 how to add space between two divs in html i ...\n", "Name: post, dtype: object\n", "31866 javascript\n", "12131 java\n", "35742 c#\n", "22475 c#\n", "24315 html\n", "Name: tags, dtype: object\n" ] } ], "source": [ "print(train_x.head())\n", "print(train_y.head())\n" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "def texts2ids(vocab, series):\n", " new_series = []\n", " for line in series:\n", " word = line.strip()\n", " if not (word in vocab):\n", " vocab[word] = len(vocab)\n", " new_series.append(vocab[word])\n", " return new_series" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'javascript': 0, 'objective-c': 8, 'c': 16, 'asp.net': 6, 'jquery': 19, 'ios': 14, 'java': 1, 'python': 13, 'mysql': 15, 'c++': 4, 'angularjs': 18, 'iphone': 9, 'c#': 2, 'css': 11, 'html': 3, 'sql': 12, 'android': 17, '.net': 10, 'ruby-on-rails': 7, 'php': 5}\n", "[0, 1, 2, 2, 3, 4, 5, 6, 5, 7, 8, 3, 5, 7, 9, 10, 9, 11, 5, 9, 12, 5, 9, 8, 3, 9, 13, 3, 10, 14, 0, 9, 15, 5, 1, 6, 9, 6, 9, 11, 9, 14, 7, 8, 4, 16, 7, 14, 2, 5, 10, 13, 17, 0, 10, 10, 18, 11, 11, 8, 5, 2, 3, 8, 16, 4, 6, 7, 13, 9, 19, 10, 1, 11, 16, 1, 4, 13, 8, 16, 2, 1, 8, 6, 7, 4, 14, 5, 14, 0, 2, 3, 5, 14, 7, 6, 16, 18, 17, 6]\n", "{'javascript': 0, 'objective-c': 8, 'c': 16, 'asp.net': 6, 'jquery': 19, 'ios': 14, 'java': 1, 'python': 13, 'mysql': 15, 'c++': 4, 'angularjs': 18, 'iphone': 9, 'c#': 2, 'css': 11, 'html': 3, 'sql': 12, 'android': 17, '.net': 10, 'ruby-on-rails': 7, 'php': 5}\n", "[0, 16, 8, 19, 18, 0, 19, 11, 18, 0, 2, 6, 4, 14, 6, 13, 6, 1, 1, 17, 7, 3, 1, 5, 0, 0, 5, 13, 10, 1, 10, 5, 17, 15, 7, 12, 9, 6, 8, 16, 17, 11, 2, 19, 0, 11, 3, 3, 16, 19, 14, 11, 17, 2, 0, 14, 18, 15, 15, 7, 5, 11, 12, 8, 2, 18, 12, 5, 17, 8, 2, 6, 9, 17, 3, 5, 2, 1, 14, 1, 0, 3, 10, 3, 1, 12, 0, 13, 17, 15, 2, 1, 5, 16, 13, 19, 13, 6, 8, 17]\n" ] } ], "source": [ "label_vocab = {}\n", "\n", "train_y_ids = texts2ids(label_vocab, train_y)\n", "print(label_vocab)\n", "print(train_y_ids[0:100])\n", "\n", "test_y_ids = texts2ids(label_vocab, test_y)\n", "print(label_vocab)\n", "print(test_y_ids[0:100])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## First Let's use built in tokenizer" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "# we will use all text to collect top 1000 most frequent words\n", "vectorizer_all = TfidfVectorizer(max_features=1000)\n", "vectorizer_all.fit_transform(all_texts)\n", "\n", "# we will use this vectorizer with precomputed vocabulary of 1000 words\n", "vectorizer = TfidfVectorizer(vocabulary=vectorizer_all.vocabulary_)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "20\n" ] } ], "source": [ "train_x_tfidf = vectorizer.fit_transform(train_x).todense()\n", "test_x_tfidf = vectorizer.fit_transform(test_x).todense()\n", "\n", "import numpy as np\n", "from sklearn import preprocessing\n", "\n", "train_y_onehot = preprocessing.OneHotEncoder().fit_transform(np.array(train_y_ids).reshape(-1,1)).toarray()\n", "test_y_onehot = preprocessing.OneHotEncoder().fit_transform(np.array(test_y_ids).reshape(-1,1)).toarray()\n", "\n", "print(train_y_onehot.shape[1])" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "_________________________________________________________________\n", "Layer (type) Output Shape Param # \n", "=================================================================\n", "dense (Dense) (None, 1000) 1001000 \n", "_________________________________________________________________\n", "batch_normalization (BatchNo (None, 1000) 4000 \n", "_________________________________________________________________\n", "activation (Activation) (None, 1000) 0 \n", "_________________________________________________________________\n", "dropout (Dropout) (None, 1000) 0 \n", "_________________________________________________________________\n", "dense_1 (Dense) (None, 100) 100100 \n", "_________________________________________________________________\n", "batch_normalization_1 (Batch (None, 100) 400 \n", "_________________________________________________________________\n", "activation_1 (Activation) (None, 100) 0 \n", "_________________________________________________________________\n", "dropout_1 (Dropout) (None, 100) 0 \n", "_________________________________________________________________\n", "dense_2 (Dense) (None, 20) 2020 \n", "=================================================================\n", "Total params: 1,107,520\n", "Trainable params: 1,105,320\n", "Non-trainable params: 2,200\n", "_________________________________________________________________\n" ] } ], "source": [ "import tensorflow as tf\n", "\n", "C = tf.feature_column\n", "E = tf.estimator\n", "D = tf.data\n", "L = tf.keras.layers\n", "\n", "model = tf.keras.models.Sequential()\n", "model.add(L.Dense(1000, input_shape=train_x_tfidf.shape[1:]))\n", "model.add(L.BatchNormalization())\n", "model.add(L.Activation('tanh'))\n", "model.add(L.Dropout(0.5))\n", "model.add(L.Dense(100))\n", "model.add(L.BatchNormalization())\n", "model.add(L.Activation('tanh'))\n", "model.add(L.Dropout(0.5))\n", "model.add(L.Dense(train_y_onehot.shape[1], activation='softmax'))\n", "model.compile(tf.keras.optimizers.Adam(lr=0.0001), 'categorical_crossentropy', metrics=['accuracy'])\n", "model.summary()" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Train on 28799 samples, validate on 3200 samples\n", "Epoch 1/8\n", "28799/28799 [==============================] - 13s 439us/step - loss: 1.9058 - acc: 0.4548 - val_loss: 0.9705 - val_acc: 0.7406\n", "Epoch 2/8\n", "28799/28799 [==============================] - 4s 146us/step - loss: 1.1049 - acc: 0.6775 - val_loss: 0.8033 - val_acc: 0.7712\n", "Epoch 3/8\n", "28799/28799 [==============================] - 4s 148us/step - loss: 0.9139 - acc: 0.7304 - val_loss: 0.7512 - val_acc: 0.7834\n", "Epoch 4/8\n", "28799/28799 [==============================] - 4s 148us/step - loss: 0.8149 - acc: 0.7575 - val_loss: 0.7232 - val_acc: 0.7869\n", "Epoch 5/8\n", "28799/28799 [==============================] - 4s 147us/step - loss: 0.7544 - acc: 0.7737 - val_loss: 0.7097 - val_acc: 0.7887\n", "Epoch 6/8\n", "28799/28799 [==============================] - 4s 147us/step - loss: 0.7115 - acc: 0.7854 - val_loss: 0.7020 - val_acc: 0.7863\n", "Epoch 7/8\n", "28799/28799 [==============================] - 4s 148us/step - loss: 0.6834 - acc: 0.7904 - val_loss: 0.6946 - val_acc: 0.7878\n", "Epoch 8/8\n", "28799/28799 [==============================] - 4s 147us/step - loss: 0.6485 - acc: 0.8003 - val_loss: 0.6942 - val_acc: 0.7925\n" ] } ], "source": [ "h = model.fit(train_x_tfidf, train_y_onehot, epochs=8, validation_split=0.1)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "31999/31999 [==============================] - 1s 41us/step\n" ] }, { "data": { "text/plain": [ "[0.46620027987434476, 0.8610269070703569]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.evaluate(train_x_tfidf, train_y_onehot)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "8000/8000 [==============================] - 0s 41us/step\n" ] }, { "data": { "text/plain": [ "0.791" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "loss, acc = model.evaluate(test_x_tfidf, test_y_onehot, batch_size=32)\n", "acc" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "pd.DataFrame(h.history).plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Now let's use Bling Fire" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "def my_tokenizer(s):\n", " return text_to_words(s).split(' ')\n", "\n", "vectorizer_all = TfidfVectorizer(tokenizer=my_tokenizer, max_features=1000)\n", "vectorizer_all.fit_transform(all_texts)\n", "\n", "vectorizer = TfidfVectorizer(tokenizer=my_tokenizer, vocabulary=vectorizer_all.vocabulary_)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "20\n" ] } ], "source": [ "train_x_tfidf = vectorizer.fit_transform(train_x).todense()\n", "test_x_tfidf = vectorizer.fit_transform(test_x).todense()\n", "\n", "import numpy as np\n", "from sklearn import preprocessing\n", "\n", "train_y_onehot = preprocessing.OneHotEncoder().fit_transform(np.array(train_y_ids).reshape(-1,1)).toarray()\n", "test_y_onehot = preprocessing.OneHotEncoder().fit_transform(np.array(test_y_ids).reshape(-1,1)).toarray()\n", "\n", "print(train_y_onehot.shape[1])" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "_________________________________________________________________\n", "Layer (type) Output Shape Param # \n", "=================================================================\n", "dense_3 (Dense) (None, 1000) 1001000 \n", "_________________________________________________________________\n", "batch_normalization_2 (Batch (None, 1000) 4000 \n", "_________________________________________________________________\n", "activation_2 (Activation) (None, 1000) 0 \n", "_________________________________________________________________\n", "dropout_2 (Dropout) (None, 1000) 0 \n", "_________________________________________________________________\n", "dense_4 (Dense) (None, 100) 100100 \n", "_________________________________________________________________\n", "batch_normalization_3 (Batch (None, 100) 400 \n", "_________________________________________________________________\n", "activation_3 (Activation) (None, 100) 0 \n", "_________________________________________________________________\n", "dropout_3 (Dropout) (None, 100) 0 \n", "_________________________________________________________________\n", "dense_5 (Dense) (None, 20) 2020 \n", "=================================================================\n", "Total params: 1,107,520\n", "Trainable params: 1,105,320\n", "Non-trainable params: 2,200\n", "_________________________________________________________________\n" ] } ], "source": [ "model = tf.keras.models.Sequential()\n", "model.add(L.Dense(1000, input_shape=train_x_tfidf.shape[1:]))\n", "model.add(L.BatchNormalization())\n", "model.add(L.Activation('tanh'))\n", "model.add(L.Dropout(0.5))\n", "model.add(L.Dense(100))\n", "model.add(L.BatchNormalization())\n", "model.add(L.Activation('tanh'))\n", "model.add(L.Dropout(0.5))\n", "model.add(L.Dense(train_y_onehot.shape[1], activation='softmax'))\n", "model.compile(tf.keras.optimizers.Adam(lr=0.0001), 'categorical_crossentropy', metrics=['accuracy'])\n", "model.summary()" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Train on 28799 samples, validate on 3200 samples\n", "Epoch 1/8\n", "28799/28799 [==============================] - 5s 171us/step - loss: 1.8345 - acc: 0.4695 - val_loss: 0.8957 - val_acc: 0.7606\n", "Epoch 2/8\n", "28799/28799 [==============================] - 4s 148us/step - loss: 1.0535 - acc: 0.6977 - val_loss: 0.7377 - val_acc: 0.7891\n", "Epoch 3/8\n", "28799/28799 [==============================] - 4s 147us/step - loss: 0.8592 - acc: 0.7475 - val_loss: 0.6831 - val_acc: 0.7994\n", "Epoch 4/8\n", "28799/28799 [==============================] - 4s 146us/step - loss: 0.7652 - acc: 0.7730 - val_loss: 0.6589 - val_acc: 0.8078\n", "Epoch 5/8\n", "28799/28799 [==============================] - 4s 149us/step - loss: 0.6935 - acc: 0.7915 - val_loss: 0.6430 - val_acc: 0.8128\n", "Epoch 6/8\n", "28799/28799 [==============================] - 4s 148us/step - loss: 0.6538 - acc: 0.8032 - val_loss: 0.6393 - val_acc: 0.8103\n", "Epoch 7/8\n", "28799/28799 [==============================] - 4s 147us/step - loss: 0.6226 - acc: 0.8105 - val_loss: 0.6348 - val_acc: 0.8103\n", "Epoch 8/8\n", "28799/28799 [==============================] - 4s 147us/step - loss: 0.5974 - acc: 0.8166 - val_loss: 0.6330 - val_acc: 0.8103\n" ] } ], "source": [ "h = model.fit(train_x_tfidf, train_y_onehot, epochs=8, validation_split=0.1)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "31999/31999 [==============================] - 1s 42us/step\n" ] }, { "data": { "text/plain": [ "[0.4210543013810404, 0.8734335448075885]" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.evaluate(train_x_tfidf, train_y_onehot)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "8000/8000 [==============================] - 0s 42us/step\n" ] }, { "data": { "text/plain": [ "0.813125" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "loss, acc = model.evaluate(test_x_tfidf, test_y_onehot, batch_size=32)\n", "acc" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "pd.DataFrame(h.history).plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion\n", "\n", "Classifier shows 79.1% accuracy with built in tokenizer and 81.3% with Bling Fire. So importance of the tokenization cannot be underestimated, even more the correctness and consistency of tokenization is important if model uses pre-trained embeddings.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 2 }