{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Polars Example\n", "This is a little example using the `nu_plugin_polars` to showcase that you can use dataframes pretty well with this kernel. The `nu_plugin_polars` is based on the dataframe libray [Polars](https://polar.rs), it is completely written in Rust (just like this kernel) and executes dataframes way faster that [Pandas](https://pandas.pydata.org), but do consider that we loose some performance in form of communication from your jupyter client, to the kernel, to the plugin and all the way back. It should be still fast to use.\n", "\n", "To start using the plugin, you need to have it installed, this can easily be done via `cargo install nu_plugin_polars` on a machine that has the full Rust toolchain installed. Then you need to add the plugin to your plugin registry, if you have `nushell` installed (I guess so, because you are using this kernel), you simply run `plugin add `, this will add it to your plugin registry and this kernel will be able to pick it up. If you don't have `nushell` on your machine (kinda weird, not gonna lie), you can also run the `plugin add` command in a notebook, for further help try to run `plugin --help` or check the [Plugins section](https://www.nushell.sh/book/plugins.html) on the nushell website." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plugins in `nushell` are automatically loaded from your registry when you open the shell. However, for this kernel they need to be loaded manually, this ensures that other users of your notebook understand where some of the commands you're using are coming from." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# load the `polars` plugin to use its commands\n", "plugin use polars" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You also need some data for this notebook. This is using the [New Zealand business demography](https://www.stats.govt.nz/assets/Uploads/New-Zealand-business-demography-statistics/New-Zealand-business-demography-statistics-At-February-2020/Download-data/Geographic-units-by-industry-and-statistical-area-2000-2020-descending-order-CSV.zip) dataset. You can load it via the adjacent `polars-data.nu` file. If that run succesfully you should see in the `ls data` output the file `Data7602DescendingYearOrder.csv`." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "|name|type|size|modified|\n", "|-|-|-|-|\n", "|data\\Data7602DescendingYearOrder.csv|file|129.3 MiB|Wed, 5 Oct 2022 14:44:48 +0200 (2 years ago)|\n", "|data\\Metadata for Data7602DescendingYearOrder.xlsx|file|106.1 KiB|Thu, 20 Oct 2022 10:41:12 +0200 (2 years ago)|\n", "|data\\nz-stats.zip|file|22.9 MiB|Wed, 15 May 2024 12:30:31 +0200 (4 months ago)|" ], "text/plain": [ "name: data\\Data7602DescendingYearOrder.csv, type: file, size: 129.3 MiB, modified: Wed, 5 Oct 2022 14:44:48 +0200 (2 years ago)\r\n", "name: data\\Metadata for Data7602DescendingYearOrder.xlsx, type: file, size: 106.1 KiB, modified: Thu, 20 Oct 2022 10:41:12 +0200 (2 years ago)\r\n", "name: data\\nz-stats.zip, type: file, size: 22.9 MiB, modified: Wed, 15 May 2024 12:30:31 +0200 (4 months ago)" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ls data | nuju display md" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ok, then let's load the csv file and check that the plugin loaded it." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "|key|created|columns|rows|type|estimated_size|span_contents|span_start|span_end|reference_count|\n", "|-|-|-|-|-|-|-|-|-|-|\n", "|208664bf-188b-4d93-9079-1eae3d7a6811|Fri, 27 Sep 2024 13:26:24 +0200 (now)|5|5985364|LazyFrame|194.0 MiB|polars open|6394|6405|1|\n" ], "text/plain": [ "key: 208664bf-188b-4d93-9079-1eae3d7a6811, created: Fri, 27 Sep 2024 13:26:24 +0200 (now), columns: 5, rows: 5985364, type: LazyFrame, estimated_size: 194.0 MiB, span_contents: polars open, span_start: 6394, span_end: 6405, reference_count: 1" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "let df = polars open data/Data7602DescendingYearOrder.csv\n", "polars store-ls | nuju display md" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also check the shape and schema of the dataset." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "application/json": [ { "columns": 5, "index": 0, "rows": 5985364 } ], "text/html": [ "
[{index: 0, rows: 5985364, columns: 5}]
" ], "text/markdown": [ "[{index: 0, rows: 5985364, columns: 5}]\n" ], "text/plain": [ "index: 0, rows: 5985364, columns: 5" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/json": { "Area": "str", "anzsic06": "str", "ec_count": "i64", "geo_count": "i64", "year": "i64" }, "text/csv": [ "anzsic06,Area,year,geo_count,ec_count\n", "str,str,i64,i64,i64\n" ], "text/html": [ "
anzsic06Areayeargeo_countec_count
strstri64i64i64
" ], "text/markdown": [ "|anzsic06|Area|year|geo_count|ec_count|\n", "|-|-|-|-|-|\n", "|str|str|i64|i64|i64|\n" ], "text/plain": [ "anzsic06: str\r\n", "Area: str\r\n", "year: i64\r\n", "geo_count: i64\r\n", "ec_count: i64" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "$df | polars shape | nuju print\n", "$df | polars schema | nuju print" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And let's get a sample of the dataset." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "|anzsic06|Area|year|geo_count|ec_count|\n", "|-|-|-|-|-|\n", "|S942|A183400|2014|3|0|\n", "|E322|A140700|2001|3|0|\n", "|J59|R13|2007|57|270|\n", "|C20|A346700|2014|0|6|\n", "|E30|A255000|2002|0|18|" ], "text/plain": [ "anzsic06: S942, Area: A183400, year: 2014, geo_count: 3, ec_count: 0\r\n", "anzsic06: E322, Area: A140700, year: 2001, geo_count: 3, ec_count: 0\r\n", "anzsic06: J59, Area: R13, year: 2007, geo_count: 57, ec_count: 270\r\n", "anzsic06: C20, Area: A346700, year: 2014, geo_count: 0, ec_count: 6\r\n", "anzsic06: E30, Area: A255000, year: 2002, geo_count: 0, ec_count: 18" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "$df | polars sample -n 5 | polars into-nu | nuju display md" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And finally do some data operations on the dataset. We group the data by year, sum up the geo_count column and sort it by year. We convert the data into a `nu` object and pipe it to a `series bar` to create a nice bar chart." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "0\n", "\n", "\n", "\n", "2000000\n", "\n", "\n", "\n", "4000000\n", "\n", "\n", "\n", "6000000\n", "\n", "\n", "\n", "8000000\n", "\n", "\n", "\n", "10000000\n", "\n", "\n", "\n", "12000000\n", "\n", "\n", "\n", "14000000\n", "\n", "\n", "\n", "\n", "2000\n", "\n", "\n", "\n", "2001\n", "\n", "\n", "\n", "2002\n", "\n", "\n", "\n", "2003\n", "\n", "\n", "\n", "2004\n", "\n", "\n", "\n", "2005\n", "\n", "\n", "\n", "2006\n", "\n", "\n", "\n", "2007\n", "\n", "\n", "\n", "2008\n", "\n", "\n", "\n", "2009\n", "\n", "\n", "\n", "2010\n", "\n", "\n", "\n", "2011\n", "\n", "\n", "\n", "2012\n", "\n", "\n", "\n", "2013\n", "\n", "\n", "\n", "2014\n", "\n", "\n", "\n", "2015\n", "\n", "\n", "\n", "2016\n", "\n", "\n", "\n", "2017\n", "\n", "\n", "\n", "2018\n", "\n", "\n", "\n", "2019\n", "\n", "\n", "\n", "2020\n", "\n", "\n", "\n", "2021\n", "\n", "\n", "\n", "2022\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "series: kind: bar, series: x: 2000, y: 9109038, x: 2001, y: 9036159, x: 2002, y: 9129798, x: 2003, y: 9459999, x: 2004, y: 10275174, x: 2005, y: 10726932, x: 2006, y: 11109930, x: 2007, y: 11351079, x: 2008, y: 11595300, x: 2009, y: 11680239, x: 2010, y: 11517015, x: 2011, y: 11526618, x: 2012, y: 11513895, x: 2013, y: 11590815, x: 2014, y: 12009198, x: 2015, y: 12310005, x: 2016, y: 12559281, x: 2017, y: 12904980, x: 2018, y: 13046571, x: 2019, y: 13325616, x: 2020, y: 13582815, x: 2021, y: 13682886, x: 2022, y: 14338263, color: r: 30, g: 100, b: 20, a: 1, filled: true, stroke_width: 1\r\n", "width: 1000\r\n", "height: 400\r\n", "background: \r\n", "caption: \r\n", "margin: 10, 10, 0, 25\r\n", "label_area: 0, 0, 35, 35\r\n", "x_range: \r\n", "y_range: " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "$df \n", "| polars group-by year \n", "| polars agg (polars col geo_count | polars sum)\n", "| polars sort-by year \n", "| polars into-nu\n", "| rename x y\n", "| series bar -c [30, 100, 20]\n", "| chart 2d -W 1000 -m [10, 10, 0, 25]\n", "| nuju display svg" ] } ], "metadata": { "kernelspec": { "display_name": "Nushell", "language": "nushell", "name": "nu" }, "language_info": { "file_extension": ".nu", "mimetype": "text/nu", "name": "nushell", "version": "0.97" } }, "nbformat": 4, "nbformat_minor": 2 }