{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# News Categorization using Multinomial Naive Bayes\n", "## by [Andrés Soto](https://www.linkedin.com/in/andres-soto-villaverde-36198a5/)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once upon a time, while searching by internet, I discovered [this site](https://www.kaggle.com/uciml/news-aggregator-dataset), where I found this challenge: \n", "* Using the News Aggregator Data Set, can we predict the category (business, entertainment, etc.) of a news article given only its headline? \n", "So I decided to try to do it using the Multinomial Naive Bayes method." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The News Aggregator Data Set comes from the UCI Machine Learning Repository. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Lichman, M. (2013). [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml). Irvine, CA: University of California, School of Information and Computer Science. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This specific dataset can be found in the UCI ML Repository at [this URL](http://archive.ics.uci.edu/ml/datasets/News+Aggregator)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This dataset contains headlines, URLs, and categories for 422,937 news stories collected by a web aggregator between March 10th, 2014 and August 10th, 2014. News categories in this dataset are labelled:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Label | Category\t| News \n", "-------|------------|----------\n", "b\t| business\t|