{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "302e4e90",
   "metadata": {},
   "source": [
    "# Tutorial"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "127e15ad",
   "metadata": {},
   "source": [
    "Here's a more full-featured walkthrough of how to use all of `eyecite`'s functionality. We'll (1) **clean** the text of a sample opinion, (2) **extract** citations from that cleaned text, (3) **aggregate** those citations into groups based on their referents, and (4) **annotate** the original text with hypothetical URLs linking to each citation's referent."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0a7cb31e",
   "metadata": {},
   "source": [
    "First, import the functions and models we'll need:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "b494ee53",
   "metadata": {},
   "outputs": [],
   "source": [
    "from eyecite import get_citations, clean_text, resolve_citations, annotate_citations\n",
    "from eyecite.models import FullCaseCitation, Resource\n",
    "from eyecite.resolve import resolve_full_citation\n",
    "from eyecite.tokenizers import HyperscanTokenizer\n",
    "\n",
    "import requests"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bc4f74a3",
   "metadata": {},
   "source": [
    "For this tutorial, we'll use the opinion from the Supreme Court case *Citizens United v. Federal Election Com'n* (2010), 558 U.S. 310. Let's pull it from the Courtlistener API."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "a9ea9f3c",
   "metadata": {},
   "outputs": [],
   "source": [
    "opinion_url = 'https://www.courtlistener.com/api/rest/v3/opinions/1741/'\n",
    "opinion_text = requests.get(opinion_url).json()['plain_text']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "99abacba",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(Slip Opinion)              OCTOBER TERM, 2009                                       1\r\n",
      "\r\n",
      "                                       Syllabus\r\n",
      "\r\n",
      "         NOTE: Where it is feasible, a syllabus (headnote) will be released, as is\r\n",
      "       being done in connection with this case, at the time the opinion is issued.\r\n",
      "       The syllabus constitutes no part of the opinion of the Court but has been\r\n",
      "       prepared by the Reporter of Decisions for the convenience of the reader.\r\n",
      "       See United States v. Detroit Timber & Lumber Co., 200 U. S. 321, 337.\r\n",
      "\r\n",
      "\r\n",
      "SUPREME COURT OF THE UNITED STATES\r\n",
      "\r\n",
      "                                       Syllabus\r\n",
      "\r\n",
      "         CITIZENS UNITED v. FEDERAL ELECTION\r\n",
      "\r\n",
      "                     COMMISSION \r\n",
      "\r\n",
      "\r\n",
      "APPEAL FROM THE UNITED STATES DISTRICT COURT FOR THE\r\n",
      "               DISTRICT OF COLUMBIA\r\n",
      "\r\n",
      "No. 08–205.      Argued March 24, 2009—Reargued September 9, 2009––\r\n",
      "                        Decided January 21, 2010\r\n",
      "As amended by §203 of the Bipartisan Campaign Reform Act of\n"
     ]
    }
   ],
   "source": [
    "print(opinion_text[:1000])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "24417912",
   "metadata": {},
   "source": [
    "### Cleaning"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5dc0d7f8",
   "metadata": {},
   "source": [
    "Note that this text is broken up by newline characters, and the whitespace is uneven. To deal with this, we first have to clean the text to get it ready for citation extraction, which we can do by calling `clean_text()`. This function expects two arguments: The first is the text to be cleaned, and the second is an iterable of cleaning utilities to run. We have several built in utilities for removing HTML tags, whitespace, and underscores, *inter alia*. (See the [API documentation](https://freelawproject.github.io/eyecite/clean.html) for a full list.) Here, because we grabbed the `plain_text` variable from the API, it shouldn't contain any HTML tags, but let's remove those too just for demonstrative purposes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "da074163",
   "metadata": {},
   "outputs": [],
   "source": [
    "cleaned_text = clean_text(opinion_text, ['html', 'all_whitespace'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "56477cb3",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(Slip Opinion) OCTOBER TERM, 2009 1 Syllabus NOTE: Where it is feasible, a syllabus (headnote) will be released, as is being done in connection with this case, at the time the opinion is issued. The syllabus constitutes no part of the opinion of the Court but has been prepared by the Reporter of Decisions for the convenience of the reader. See United States v. Detroit Timber & Lumber Co., 200 U. S. 321, 337. SUPREME COURT OF THE UNITED STATES Syllabus CITIZENS UNITED v. FEDERAL ELECTION COMMISSION APPEAL FROM THE UNITED STATES DISTRICT COURT FOR THE DISTRICT OF COLUMBIA No. 08–205. Argued March 24, 2009—Reargued September 9, 2009–– Decided January 21, 2010 As amended by §203 of the Bipartisan Campaign Reform Act of 2002 (BCRA), federal law prohibits corporations and unions from using their general treasury funds to make independent expenditures for speech that is an “electioneering communication” or for speech that expressly advocates the election or defeat of a candidate. 2 U. S. C. §\n"
     ]
    }
   ],
   "source": [
    "print(cleaned_text[:1000])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3718fa06",
   "metadata": {},
   "source": [
    "### Extracting"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0e738743",
   "metadata": {},
   "source": [
    "Next, we'll extract the citations using a custom tokenizer. Unlike the default tokenizer, here we'll use our hyperscan tokenizer for much faster extraction, which works by automatically pre-compiling and caching a regular expression database on first use. Because of this one-time pre-compilation stage, the first use of this tokenizer is slow:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "045ea5b1",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 14.9 s, sys: 301 ms, total: 15.2 s\n",
      "Wall time: 15.7 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "tokenizer = HyperscanTokenizer(cache_dir='.test_cache')\n",
    "citations = get_citations(cleaned_text, tokenizer=tokenizer)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4fde8411",
   "metadata": {},
   "source": [
    "However, so long as the cache folder (here `.test_cache`) persists, every future call to `get_citations()` using the hyperscan tokenizer will be super fast. E.g.:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "f137f51b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 183 ms, sys: 5.74 ms, total: 189 ms\n",
      "Wall time: 198 ms\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "citations = get_citations(cleaned_text, tokenizer=tokenizer)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f309225a",
   "metadata": {},
   "source": [
    "Now, let's take a brief look at the citations we extracted:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "0afcf38f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Extracted 1005 citations.\n",
      "\n",
      "First citation:\n",
      " FullCaseCitation('200 U. S. 321', groups={'volume': '200', 'reporter': 'U. S.', 'page': '321'}, metadata=FullCaseCitation.Metadata(parenthetical=None, pin_cite='337', year=None, court='scotus', plaintiff='States', defendant='Detroit Timber & Lumber Co.', extra=None))\n"
     ]
    }
   ],
   "source": [
    "print(f'Extracted {len(citations)} citations.\\n')\n",
    "print(f'First citation:\\n {citations[0]}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7da6199e",
   "metadata": {},
   "source": [
    "As you can see, we've extracted data about the citation's volume, reporter, page number, pincite page, and parties. If the data had been present in the text, we would have also grabbed the citation's year, its accompanying parenthetical text, and any \"extra\" information."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "331ad723",
   "metadata": {},
   "source": [
    "### Aggregating"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6898051e",
   "metadata": {},
   "source": [
    "This opinion contains more than 1000 citations, but these are not all full citations like `123 XYZ 456`. In addition to these more obvious citations, `eyecite` will also find short-form citations such as \"id\" and \"supra\". So, while there are 1005 citations total, the count of unique opinions cited is much fewer. Let's aggregate all the short form citations together by referent:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "59839eba",
   "metadata": {},
   "outputs": [],
   "source": [
    "resolutions = resolve_citations(citations)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "95ec80c7",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Resolved citations into 176 groups.\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(f'Resolved citations into {len(resolutions)} groups.\\n')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3bfd8417",
   "metadata": {},
   "source": [
    "Let's look at one group as an example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "b8ea15ab",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "This case is cited lots of times:\n",
      "FullCaseCitation('479 U. S. 238', groups={'volume': '479', 'reporter': 'U. S.', 'page': '238'}, metadata=FullCaseCitation.Metadata(parenthetical='MCFL', pin_cite='249', year='1986', court='scotus', plaintiff='Comm’n', defendant='Massachusetts Citizens for Life, Inc.', extra=None))\n",
      "\n",
      "23 times, in fact.\n",
      "\n",
      "Here are all of its citations:\n",
      "[FullCaseCitation('479 U. S. 238', groups={'volume': '479', 'reporter': 'U. S.', 'page': '238'}, metadata=FullCaseCitation.Metadata(parenthetical='MCFL', pin_cite='249', year='1986', court='scotus', plaintiff='Comm’n', defendant='Massachusetts Citizens for Life, Inc.', extra=None)), ShortCaseCitation('479 U. S., at 257', groups={'volume': '479', 'reporter': 'U. S.', 'page': '257'}, metadata=ShortCaseCitation.Metadata(parenthetical=None, pin_cite='257', year=None, court='scotus', antecedent_guess='MCFL')), ShortCaseCitation('479 U. S., at 260', groups={'volume': '479', 'reporter': 'U. S.', 'page': '260'}, metadata=ShortCaseCitation.Metadata(parenthetical=None, pin_cite='260', year=None, court='scotus', antecedent_guess='MCFL')), ShortCaseCitation('479 U. S., at 262', groups={'volume': '479', 'reporter': 'U. S.', 'page': '262'}, metadata=ShortCaseCitation.Metadata(parenthetical=None, pin_cite='262', year=None, court='scotus', antecedent_guess='MCFL')), FullCaseCitation('479 U. S. 238', groups={'volume': '479', 'reporter': 'U. S.', 'page': '238'}, metadata=FullCaseCitation.Metadata(parenthetical=None, pin_cite=None, year='1986', court='scotus', plaintiff='Comm’n', defendant='Massachusetts Citizens for Life, Inc.', extra=None)), FullCaseCitation('479 U. S. 238', groups={'volume': '479', 'reporter': 'U. S.', 'page': '238'}, metadata=FullCaseCitation.Metadata(parenthetical='MCFL', pin_cite=None, year='1986', court='scotus', plaintiff='FEC', defendant='Massachusetts Citizens for Life, Inc.', extra=None)), FullCaseCitation('479 U. S. 238', groups={'volume': '479', 'reporter': 'U. S.', 'page': '238'}, metadata=FullCaseCitation.Metadata(parenthetical=None, pin_cite=None, year=None, court='scotus', plaintiff=None, defendant=None, extra=None)), FullCaseCitation('479 U. S. 238', groups={'volume': '479', 'reporter': 'U. S.', 'page': '238'}, metadata=FullCaseCitation.Metadata(parenthetical=None, pin_cite=None, year=None, court='scotus', plaintiff=None, defendant=None, extra=None)), IdCitation('id.,', metadata=IdCitation.Metadata(parenthetical='quoting NRWC, 459 U. S., at 209–210', pin_cite='at 256')), ShortCaseCitation('479 U. S., at 257', groups={'volume': '479', 'reporter': 'U. S.', 'page': '257'}, metadata=ShortCaseCitation.Metadata(parenthetical='internal quotation marks omitted', pin_cite='257', year=None, court='scotus', antecedent_guess=None)), IdCitation('id.,', metadata=IdCitation.Metadata(parenthetical='internal quotation marks omitted', pin_cite='at 260')), IdCitation('id.,', metadata=IdCitation.Metadata(parenthetical=None, pin_cite='at 257')), ShortCaseCitation('479 U. S., at 268', groups={'volume': '479', 'reporter': 'U. S.', 'page': '268'}, metadata=ShortCaseCitation.Metadata(parenthetical='opinion of Rehnquist, C. J.', pin_cite='268', year=None, court='scotus', antecedent_guess=None)), ShortCaseCitation('479 U. S., at 257', groups={'volume': '479', 'reporter': 'U. S.', 'page': '257'}, metadata=ShortCaseCitation.Metadata(parenthetical=None, pin_cite='257', year=None, court='scotus', antecedent_guess='MCFL')), ShortCaseCitation('479 U. S., at 264', groups={'volume': '479', 'reporter': 'U. S.', 'page': '264'}, metadata=ShortCaseCitation.Metadata(parenthetical=None, pin_cite='264', year=None, court='scotus', antecedent_guess=None)), IdCitation('Ibid.', metadata=IdCitation.Metadata(parenthetical=None, pin_cite=None)), IdCitation('Ibid.', metadata=IdCitation.Metadata(parenthetical=None, pin_cite=None)), ShortCaseCitation('479 U. S., at 259', groups={'volume': '479', 'reporter': 'U. S.', 'page': '259'}, metadata=ShortCaseCitation.Metadata(parenthetical=None, pin_cite='259, n. 12', year=None, court='scotus', antecedent_guess='MCFL')), ShortCaseCitation('479 U. S., at 258', groups={'volume': '479', 'reporter': 'U. S.', 'page': '258'}, metadata=ShortCaseCitation.Metadata(parenthetical=None, pin_cite='258', year=None, court='scotus', antecedent_guess='MCFL')), ShortCaseCitation('479 U. S., at 258', groups={'volume': '479', 'reporter': 'U. S.', 'page': '258'}, metadata=ShortCaseCitation.Metadata(parenthetical=None, pin_cite='258', year=None, court='scotus', antecedent_guess='MCFL')), FullCaseCitation('479 U. S. 238', groups={'volume': '479', 'reporter': 'U. S.', 'page': '238'}, metadata=FullCaseCitation.Metadata(parenthetical=None, pin_cite='264', year='1986', court='scotus', plaintiff=None, defendant='MCFL', extra=None)), ShortCaseCitation('479 U. S., at 259', groups={'volume': '479', 'reporter': 'U. S.', 'page': '259'}, metadata=ShortCaseCitation.Metadata(parenthetical=None, pin_cite='259', year=None, court='scotus', antecedent_guess='MCFL')), ShortCaseCitation('479 U. S., at 258', groups={'volume': '479', 'reporter': 'U. S.', 'page': '258'}, metadata=ShortCaseCitation.Metadata(parenthetical=None, pin_cite='258', year=None, court='scotus', antecedent_guess='MCFL'))]\n"
     ]
    }
   ],
   "source": [
    "k = list(resolutions.keys())[10]\n",
    "\n",
    "print(f'This case is cited lots of times:\\n{k.citation}\\n')\n",
    "print(f'{len(resolutions[k])} times, in fact.\\n')\n",
    "\n",
    "print(f'Here are all of its citations:\\n{resolutions[k]}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "135ba838",
   "metadata": {},
   "source": [
    "On its own, `eyecite` does a pretty good job of resolving citations, but if you want to perform more sophisticated resolution (e.g., by incorporating external knowledge about parallel citations), you'll have to pass a custom resolution function to `resolve_citations()`. See [the README](https://github.com/freelawproject/eyecite#resolving-citations) and the [API Documentation](https://freelawproject.github.io/eyecite/resolve.html) for more information about doing this."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d33aedfb",
   "metadata": {},
   "source": [
    "### Annotating"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "34e35f76",
   "metadata": {},
   "source": [
    "Next, let's prepare annotations for each of our extracted citations, now grouped in clusters. An annotation is text to insert back into the `cleaned_text`, like `((<start offset>, <end offset>), <before text>, <after text>)`. The positional offsets for each citation can be easily retrieved by calling each citation's `span()` method. Here, for simplicity, we'll plan to annotate each citation with a URL to an API that will redirect the user appropriately:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "2e2743ea",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[]\n"
     ]
    }
   ],
   "source": [
    "annotations = []\n",
    "print(resolutions['0'])\n",
    "for resource, cites in resolutions.items():\n",
    "    if type(resource) is Resource:\n",
    "        # add bespoke URL to each citation:\n",
    "        url = f\"/some_api?cite={resource.citation.matched_text()}\"\n",
    "        for citation in cites:\n",
    "            annotations.append((citation.span(), f\"<a href='{url}'>\", f\"</a>\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "efd0a1b4",
   "metadata": {},
   "source": [
    "This is what one of our annotations looks like:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "72062858",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "((392, 405), \"<a href='/some_api?cite=200 U. S. 321'>\", '</a>')\n"
     ]
    }
   ],
   "source": [
    "print(annotations[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5f924a13",
   "metadata": {},
   "source": [
    "We now have the annotations properly prepared, but recall that we *cleaned* our original opinion text before passing it to `get_citations()`. Thus, to insert the annotations into our *original* text, we need to pass `source_text=opinion_text` into `annotate_citations()`, which will intelligently adjust the annotation positions using the `diff-match-patch` library:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "5514acfc",
   "metadata": {},
   "outputs": [],
   "source": [
    "annotated_text = annotate_citations(cleaned_text, annotations, source_text=opinion_text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "fa07867a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(Slip Opinion)              OCTOBER TERM, 2009                                       1\r\n",
      "\r\n",
      "                                       Syllabus\r\n",
      "\r\n",
      "         NOTE: Where it is feasible, a syllabus (headnote) will be released, as is\r\n",
      "       being done in connection with this case, at the time the opinion is issued.\r\n",
      "       The syllabus constitutes no part of the opinion of the Court but has been\r\n",
      "       prepared by the Reporter of Decisions for the convenience of the reader.\r\n",
      "       See United States v. Detroit Timber & Lumber Co., <a href='/some_api?cite=200 U. S. 321'>200 U. S. 321</a>, 337.\r\n",
      "\r\n",
      "\r\n",
      "SUPREME COURT OF THE UNITED STATES\r\n",
      "\r\n",
      "                                       Syllabus\r\n",
      "\r\n",
      "         CITIZENS UNITED v. FEDERAL ELECTION\r\n",
      "\r\n",
      "                     COMMISSION \r\n",
      "\r\n",
      "\r\n",
      "APPEAL FROM THE UNITED STATES DISTRICT COURT FOR THE\r\n",
      "               DISTRICT OF COLUMBIA\r\n",
      "\r\n",
      "No. 08–205.      Argued March 24, 2009—Reargued September 9, 2009––\r\n",
      "                        Decided January 21, 2010\r\n",
      "As amended by §2\n"
     ]
    }
   ],
   "source": [
    "print(annotated_text[:1000])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ff77e44b",
   "metadata": {},
   "source": [
    "Nice!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3ea3af4c",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}