# Tutorial

Here's a more full-featured walkthrough of how to use all of `eyecite`'s functionality. We'll (1) **clean** the text of a sample opinion, (2) **extract** citations from that cleaned text, (3) **aggregate** those citations into groups based on their referents, and (4) **annotate** the original text with hypothetical URLs linking to each citation's referent.

First, import the functions and models we'll need:

In [1]:
from eyecite import get_citations, clean_text, resolve_citations, annotate_citations
from eyecite.models import FullCaseCitation, Resource
from eyecite.resolve import resolve_full_citation
from eyecite.tokenizers import HyperscanTokenizer

import requests

For this tutorial, we'll use the opinion from the Supreme Court case *Citizens United v. Federal Election Com'n* (2010), 558 U.S. 310. Let's pull it from the Courtlistener API.

In [2]:
opinion_url = 'https://www.courtlistener.com/api/rest/v3/opinions/1741/'
opinion_text = requests.get(opinion_url).json()['plain_text']

In [3]:
print(opinion_text[:1000])

(Slip Opinion) OCTOBER TERM, 2009 1

 Syllabus

 NOTE: Where it is feasible, a syllabus (headnote) will be released, as is
 being done in connection with this case, at the time the opinion is issued.
 The syllabus constitutes no part of the opinion of the Court but has been
 prepared by the Reporter of Decisions for the convenience of the reader.
 See United States v. Detroit Timber & Lumber Co., 200 U. S. 321, 337.


SUPREME COURT OF THE UNITED STATES

 Syllabus

 CITIZENS UNITED v. FEDERAL ELECTION

 COMMISSION 


APPEAL FROM THE UNITED STATES DISTRICT COURT FOR THE
 DISTRICT OF COLUMBIA

No. 08–205. Argued March 24, 2009—Reargued September 9, 2009––
 Decided January 21, 2010
As amended by §203 of the Bipartisan Campaign Reform Act of


### Cleaning

Note that this text is broken up by newline characters, and the whitespace is uneven. To deal with this, we first have to clean the text to get it ready for citation extraction, which we can do by calling `clean_text()`. This function expects two arguments: The first is the text to be cleaned, and the second is an iterable of cleaning utilities to run. We have several built in utilities for removing HTML tags, whitespace, and underscores, *inter alia*. (See the [API documentation](https://freelawproject.github.io/eyecite/clean.html) for a full list.) Here, because we grabbed the `plain_text` variable from the API, it shouldn't contain any HTML tags, but let's remove those too just for demonstrative purposes:

In [4]:
cleaned_text = clean_text(opinion_text, ['html', 'all_whitespace'])

In [5]:
print(cleaned_text[:1000])

(Slip Opinion) OCTOBER TERM, 2009 1 Syllabus NOTE: Where it is feasible, a syllabus (headnote) will be released, as is being done in connection with this case, at the time the opinion is issued. The syllabus constitutes no part of the opinion of the Court but has been prepared by the Reporter of Decisions for the convenience of the reader. See United States v. Detroit Timber & Lumber Co., 200 U. S. 321, 337. SUPREME COURT OF THE UNITED STATES Syllabus CITIZENS UNITED v. FEDERAL ELECTION COMMISSION APPEAL FROM THE UNITED STATES DISTRICT COURT FOR THE DISTRICT OF COLUMBIA No. 08–205. Argued March 24, 2009—Reargued September 9, 2009–– Decided January 21, 2010 As amended by §203 of the Bipartisan Campaign Reform Act of 2002 (BCRA), federal law prohibits corporations and unions from using their general treasury funds to make independent expenditures for speech that is an “electioneering communication” or for speech that expressly advocates the election or defeat of a candidate. 2 U. S. C. §

### Extracting

Next, we'll extract the citations using a custom tokenizer. Unlike the default tokenizer, here we'll use our hyperscan tokenizer for much faster extraction, which works by automatically pre-compiling and caching a regular expression database on first use. Because of this one-time pre-compilation stage, the first use of this tokenizer is slow:

In [7]:
%%time
tokenizer = HyperscanTokenizer(cache_dir='.test_cache')
citations = get_citations(cleaned_text, tokenizer=tokenizer)

CPU times: user 14.9 s, sys: 301 ms, total: 15.2 s
Wall time: 15.7 s


However, so long as the cache folder (here `.test_cache`) persists, every future call to `get_citations()` using the hyperscan tokenizer will be super fast. E.g.:

In [8]:
%%time
citations = get_citations(cleaned_text, tokenizer=tokenizer)

CPU times: user 183 ms, sys: 5.74 ms, total: 189 ms
Wall time: 198 ms


Now, let's take a brief look at the citations we extracted:

In [9]:
print(f'Extracted {len(citations)} citations.\n')
print(f'First citation:\n {citations[0]}')

Extracted 1005 citations.

First citation:
 FullCaseCitation('200 U. S. 321', groups={'volume': '200', 'reporter': 'U. S.', 'page': '321'}, metadata=FullCaseCitation.Metadata(parenthetical=None, pin_cite='337', year=None, court='scotus', plaintiff='States', defendant='Detroit Timber & Lumber Co.', extra=None))


As you can see, we've extracted data about the citation's volume, reporter, page number, pincite page, and parties. If the data had been present in the text, we would have also grabbed the citation's year, its accompanying parenthetical text, and any "extra" information.

### Aggregating

This opinion contains more than 1000 citations, but these are not all full citations like `123 XYZ 456`. In addition to these more obvious citations, `eyecite` will also find short-form citations such as "id" and "supra". So, while there are 1005 citations total, the count of unique opinions cited is much fewer. Let's aggregate all the short form citations together by referent:

In [10]:
resolutions = resolve_citations(citations)

In [11]:
print(f'Resolved citations into {len(resolutions)} groups.\n')

Resolved citations into 176 groups.



Let's look at one group as an example:

In [12]:
k = list(resolutions.keys())[10]

print(f'This case is cited lots of times:\n{k.citation}\n')
print(f'{len(resolutions[k])} times, in fact.\n')

print(f'Here are all of its citations:\n{resolutions[k]}')

This case is cited lots of times:
FullCaseCitation('479 U. S. 238', groups={'volume': '479', 'reporter': 'U. S.', 'page': '238'}, metadata=FullCaseCitation.Metadata(parenthetical='MCFL', pin_cite='249', year='1986', court='scotus', plaintiff='Comm’n', defendant='Massachusetts Citizens for Life, Inc.', extra=None))

23 times, in fact.

Here are all of its citations:
[FullCaseCitation('479 U. S. 238', groups={'volume': '479', 'reporter': 'U. S.', 'page': '238'}, metadata=FullCaseCitation.Metadata(parenthetical='MCFL', pin_cite='249', year='1986', court='scotus', plaintiff='Comm’n', defendant='Massachusetts Citizens for Life, Inc.', extra=None)), ShortCaseCitation('479 U. S., at 257', groups={'volume': '479', 'reporter': 'U. S.', 'page': '257'}, metadata=ShortCaseCitation.Metadata(parenthetical=None, pin_cite='257', year=None, court='scotus', antecedent_guess='MCFL')), ShortCaseCitation('479 U. S., at 260', groups={'volume': '479', 'reporter': 'U. S.', 'page': '260'}, metadata=ShortCaseCi

On its own, `eyecite` does a pretty good job of resolving citations, but if you want to perform more sophisticated resolution (e.g., by incorporating external knowledge about parallel citations), you'll have to pass a custom resolution function to `resolve_citations()`. See [the README](https://github.com/freelawproject/eyecite#resolving-citations) and the [API Documentation](https://freelawproject.github.io/eyecite/resolve.html) for more information about doing this.

### Annotating

Next, let's prepare annotations for each of our extracted citations, now grouped in clusters. An annotation is text to insert back into the `cleaned_text`, like `((, ), , )`. The positional offsets for each citation can be easily retrieved by calling each citation's `span()` method. Here, for simplicity, we'll plan to annotate each citation with a URL to an API that will redirect the user appropriately:

In [13]:
annotations = []
print(resolutions['0'])
for resource, cites in resolutions.items():
 if type(resource) is Resource:
 # add bespoke URL to each citation:
 url = f"/some_api?cite={resource.citation.matched_text()}"
 for citation in cites:
 annotations.append((citation.span(), f"", f""))

[]


This is what one of our annotations looks like:

In [14]:
print(annotations[0])

((392, 405), "", '')


We now have the annotations properly prepared, but recall that we *cleaned* our original opinion text before passing it to `get_citations()`. Thus, to insert the annotations into our *original* text, we need to pass `source_text=opinion_text` into `annotate_citations()`, which will intelligently adjust the annotation positions using the `diff-match-patch` library:

In [15]:
annotated_text = annotate_citations(cleaned_text, annotations, source_text=opinion_text)

In [16]:
print(annotated_text[:1000])

(Slip Opinion) OCTOBER TERM, 2009 1

 Syllabus

 NOTE: Where it is feasible, a syllabus (headnote) will be released, as is
 being done in connection with this case, at the time the opinion is issued.
 The syllabus constitutes no part of the opinion of the Court but has been
 prepared by the Reporter of Decisions for the convenience of the reader.
 See United States v. Detroit Timber & Lumber Co., 200 U. S. 321, 337.


SUPREME COURT OF THE UNITED STATES

 Syllabus

 CITIZENS UNITED v. FEDERAL ELECTION

 COMMISSION 


APPEAL FROM THE UNITED STATES DISTRICT COURT FOR THE
 DISTRICT OF COLUMBIA

No. 08–205. Argued March 24, 2009—Reargued September 9, 2009––
 Decided January 21, 2010
As amended by §2


Nice!