| Crates.io | screenplay-doc-parser-rs |
| lib.rs | screenplay-doc-parser-rs |
| version | 0.1.8 |
| created_at | 2025-06-17 00:18:50.156596+00 |
| updated_at | 2025-07-30 00:51:39.942596+00 |
| description | Tools to parse Screenplay-formatted documents into semantically-typed structs. |
| homepage | |
| repository | https://github.com/richardmrodriguez/screenplay-doc-parser |
| max_upload_size | |
| id | 1715011 |
| size | 2,620,103 |
Parses a PDF document file into a structured, semantically typed ScreenplayDocument object.
This parser currently supports parsing from PDF, but may include support for other formats such as FDX or Fountain in the future.
The PDF parser uses the x,y positions of the TextElements on a page to deduce their type. This will usually be correct, BUT may require manual intervention after parsing for some edge-cases. Screenwriters love to play with formatting and indentation...
In general, screenplay elements like Action, Character, Dialogue, Parentheticals, even the Page Number, Scene Numbers and revision markers, all have a set indentation point, and/or specific justification.
Also, screenplays generally have consistent margins, or at least margins consistint within the same document (hopefully...)
If we know the indentations and margins of a document, we can deduce that, any line of text which begins at 1.5 inches from the left side, is below the top margin and above the bottom margin, is probably an Action line.
Lines that adhere to the above, but also start with something like INT. or EXT. are very likely SceneHeadings.
Character names and dialogue have their own indentations, as well as parentheticals. So this scheme should yield correct parsing for the majority of a properly-formatted ScreenplayPDF.
The user of this crate can also pass in their own indentation values and strings to match against for Scene Environments or Time of Day (INT./EXT., DAY, NIGHT...), so we can even support screenplays that are A4, or have deviated somewhat from "standard" US-Letter formatting.
The default margins and indentations for this crate are taken from the default settings found in Final Draft 11, for a simple US-Letter screenplay.
This categorizes the following Screenplay Element Types:
(V.O.) in CHARACTER (V.O.))This parser also captures the following screenplay elements as metadata
Some types, such as TimeOfDay, Revision Markers, and Environment rely on arbitrary string values. You can pass in your own collection of these strings, to parse a screenplay written in a different language, or support additional / specific elements.
For example, you can add "DUSK" or "HIGH NOON" as TimeOfDay strings, so that they are correctly identified as TimeOfDay elements
Additionally, the ElementIndentations struct can be passed in to the PDF parser, to provide custom indentations and support parsing a screenplay formatted in A4, or a screenplay formatted with "centered" (as in placement, not justification) sctipts, like from Fade In or other programs.
These are currently not parsed or handled properly yet:
This parser has an optional feature, which uses the mupdf-basic-text-extractor crate to allow PDF file reading. You may choose to exclude this feature and roll your own PDF file-parsing, and then handle the conversion to the generic `pdf_document::PDFDocument' object, which gets passed into the PDF parser.
This code is licensed under AGPL-3.0.