| Crates.io | textsurf |
| lib.rs | textsurf |
| version | 0.3.0 |
| created_at | 2025-05-19 09:21:18.33096+00 |
| updated_at | 2025-09-15 10:29:16.261837+00 |
| description | Webservice for efficiently serving multiple plain text documents or excerpts thereof (by unicode character offset), without everything into memory. |
| homepage | |
| repository | https://github.com/knaw-huc/textsurf |
| max_upload_size | |
| id | 1679508 |
| size | 139,502 |
This is a webservice for efficiently serving plain texts and fragments thereof using unicode character-based addressing. It builds upon textframe.
A RESTful API is offered with several end-points. The full OpenAPI specification can be consulted
interactively at the /swagger-ui/ endpoint once it is running.
The main feature that this service provides is that you can query excerpts of plain text by unicode character offsets. Internally, they are efficiently translated to byte offsets and only partially loaded from disk into memory, and then served. The addressing syntax for the API is derived from RFC5147.
The service allows upload and deletion of texts, provided this feature is
enabled on startup using the --writable flag (make sure you understand the security implications outline further down).
Alternatively, you can consider storing plain text files in a git repository, cloning that repository on your server (perhaps also periodically pulling updates via cron), and then serving them immutably using textsurf. Any other comparable repository or version control system will also do.
Please also see the FAQ section further below.
The following endpoints are defined and consistute the Text Referencing API, which will be more formally defined in a later section:
GET / - Returns a simple JSON list of all available texts.GET /{text_id} - Returns a full text given its identifier.GET /{text_id}?char={begin},{end} - Returns a text selection inside a resource. Offset are 0-indexed, unicode points, end is non inclusive. This implements part of RFC5147 server-side.GET /{text_id}?line={begin},{end} - Returns a text selection inside a resource by line range. Offset are 0-indexed lines (so the first line is 0 and not 1!), end is non inclusive. This implements another part of RFC5147 server-side.DELETE /{text_id} - Delete a textPOST /{text_id} - Add a new textGET /stat/{text_id} - Returns file size and modification date (JSON)In all these instances text_id may itself consist of any number of path
components, a filename, and optionally an extension. If no explicit extension
is provided, the server may use an implied a default one (usually .txt).
Allowing a full path allows you to use arbitrary hierarchies to organize text files.
These are extra endpoints that are available but not part of the Text Referencing API:
GET /{text_id}?begin={begin}&end={end} - Returns a text selection inside a resource. Offset are 0-indexed, unicode points, end is non inclusive. Alternative syntax similar to the above.
GET /s/{text_id}/{begin}/{end} - Simple pure URL call. Only works with simple text IDs without any path components!
GET /swagger-ui - Serves an interactive webinterface explaining the RESTful API specification.
GET /api-doc/openapi.json - Machine parseable OpenAPI specification.
Textsurf implements a minimal Text Referencing API that is directly derived from
RFC5147. RFC5147 specifies URI
fragment identifiers for the text/plain media type, in the form of, e.g:
https://example.org/test.txt#char=10,20. It is a fragment specification and
therefore applies to the client-side, not the server side. Textsurf, however, is a server.
We take this RFC5417 spec and turn it into an API.
The capitalized key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this section are to be interpreted as described in RFC 2119.
This Text Referencing API lays down the following constraints:
?). Example: https://example.org/test.txt
https://example.org/testhttps://example.org/deep/in/the/forest/test.txtHTTP GET call on its URI.
text/plain with character encoding UTF-8 and UNIX line endings. (linefeed, 0x0a, \n)HTTP POST call on its URI, provided the server is not in a read-only state.
HTTP DELETE call on its URI, provided the server is not in a read-only state.?) of its URI, rather than in the fragment part (starting with #). Examples: https://example.org/test.txt?char=10,20 , https://example.org/test.txt?line=0,1 , https://example.org/test.txt?line=0,1&length=104&md5=b07ec26b0c68933887b28278becdc5f9
# with ?)./stat/{text_id} SHOULD be provided that provides at least the following information as keys in a JSON response:
bytes - The filesize of the file in byteschars - The length of the text file in unicode points.checksum - A SHA-256 checksum of the entire textfile.mtime - The modification time of the file in number of seconds since the unix epoch (1970-01-01 00:00).In addition to the above API, Textsurf implements a second Text
Referencing API. Though there are two separate interfaces, the functionality they
expose is identical and it is a matter of preference which one you want to use.
The secondary API is available under the /api2/ endpoint. It was designed not
to use query parameters, interoperate closer with linked open data, and is
modelled after the IIIF Image API.
The capitalized key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this section are to be interpreted as described in RFC 2119.
HTTP GET request conforming to the following URI template: {scheme}://{server}{/prefix}/{identifier}{/region}
scheme - Indicates the use of the HTTP or HTTPS protocol in calling the service.server - The host server on which the service resides. The parameters MAY also contain a port number.prefix - The path on the host server to the service. This prefix is OPTIONAL from the point of view of this specification, but it is REQUIRED to end in /api2 for the TextSurf implementation. A prefix may be useful when the host server supports multiple services. The prefix may contain multiple path segments, delimited by slashes, but all other special characters must be encoded.identifier - The identifier of the requested text. This must be a filename and MAY contain path information, but special characters including slashes for directory hierarchy MUST be URI encoded. The text file MUST be retrievable by its full extension. It MAY also be retrievable by having an implied default extension. Example: https://example.org/api2/test for https://example.org/api2/test.txtregion - This parameter is OPTIONAL and used when requesting a subpart of the text. Syntax is as follows:
full - Returns the full text, same as just omitted the region parameter entirely{begin},{end} - Returns the text from character begin to end.
0,1 returns the first character of a text.-1, returns the last character of a text.char:{begin},{end} - Same as aboveline:{begin},{end} - Returns lines, lines MUST be 0-indexed and the end MUST be non-inclusive.HTTP POST call on the same URI as in point 1, but without the region part, and provided the server is not in a read-only state.
HTTP DELETE call on its URI, provided the server is not in a read-only state.{scheme}://{server}{/prefix}/{identifier}/info.json. This SHOULD return a JSON response with the following keys:
@context - https://w3id.org/textsurf/api2.jsonldid - URI of the text filetype - TextService2protocol - https://w3id.org/textsurf/api2bytes - The filesize of the file in byteschars - The length of the text file in unicode points.checksum - A SHA-256 checksum of the entire textfile.mtime - The modification time of the file in number of seconds since the unix epoch (1970-01-01 00:00).You can install textsurf as follows:
Production environments:
$ cargo install textsurf
Development environments:
$ git clone git@github.com:knaw-huc/textsurf.git
$ cd textsurf
$ cargo install --path .
Development versions may require a development version of
textframe; clone it alongside textsurf and add a
textsurf/.cargo/config.toml with:
#[dependencies.textframe]
paths = ["../textframe"]
Run make docker to build a container using docker or podman.
Run textsurf to start the webservice, see textsurf --help for various parameters.
Run docker run --rm -v ./test/docroot:/data -p 8080:8080 proycon/textsurf where ./test/docroot/ is the document root path containing text files that you want to mount into the container. The service will be available on 127.0.0.1:8080. Make sure that subuid 1000 inside the container is mapped to a user on the host that has read and write access to the files. You can pass --env DEBUG=1 for more verbose output.
The webservice launches in read-only mode by default (does not allow text
upload/deletion). Pass --writable to allow writing (for the container, pass environment variable WRITABLE=1).
In that case, the webservice is NOT meant to be directly opened up to the internet, as it
does not provide any authentication mechanism and can be easily abused as a
an arbitrary file hosting service. Make sure it is behind a firewall or on a private network
segment.
Q: Can I request byte offsets instead?
A: No, just use any HTTP/1.1 server that supports the Range request header. We
deliberately do not implement this because using byte-offsets may result in malformed unicode responses.
Q: Will you support other encodings than UTF-8 and other formats than plain text?
A: No, although for formats with light markup like Markdown or ReStructuredText, this service may still be useful. For heavy markup like XML or JSON it is not recommended as character-based addressing makes little sense there.