Crates.io | pardalotus_chaser |
lib.rs | pardalotus_chaser |
version | 0.1.0 |
source | src |
created_at | 2024-11-06 17:40:59.321921 |
updated_at | 2024-11-06 17:40:59.321921 |
description | Keep up to date with scholarly metadata indexed in Crossref. |
homepage | |
repository | https://github.com/pardalotus/pardalotus_chaser |
max_upload_size | |
id | 1438619 |
size | 72,685 |
Keep up to date with scholarly metadata. Pardalotus Chaser will keep a local SQLite database up to date with recently added or updated scholarly metadata from Crossref.
When you run the tool it will create or update a SQLite database. On each run it it will retrieve data since the previous run, with a 1-hour overlap to account for jitter.
The tool doesn't attempt to retrieve historical data, only newly updated records.
The content of the database is what was returned from the Crossref API. No attempt is made to interpret the metadata, beyond extracting the DOI and index date.
You may need to install libssl:
sudo apt-get install pkg-config libssl-dev
To run directly from this repo:
cargo run
To install directly from cargo:
cargo install pardalotus_chaser
Then run:
pardalotus_chaser
It will create a SQLite database in the current working directory.
Because SQLite is a local database, it's not suited to concurrent access. This tool uses SQLite's WAL (Write-Ahead Log) feature to allow you to read the database whilst it's writing.
Nonetheless, the intended use-case is that you run the tool periodically to update the database rather than keep it running.
If you want to keep your database continually updated, you can set a cron job to run once an hour or so.
When you run the tool it will always retrieve at least 1 hour's worth of data. So don't run it in a tight loop.
Feature requests welcome, open an issue!
This code is pre-release, and the structure may change.
The database contains a work_history table:
TABLE works_history (
pk INT PRIMARY KEY,
identifier TEXT,
identifier_type INT,
source INT,
hash TEXT,
json JSONB,
updated INTEGER,
UNIQUE(identifier, identifier_type, source, hash));
identifier
is the DOI. There may be other work types in future.identifier_type
is 1, to indicate that it's a DOI.source
is 1 for Crossref. There may be future sources.hash
is the SHA-1 of the JSON. This is stored for future convenience. It is used to deduplicate repeated inputs.json
is a JSONB representation of the work JSON.updated
is a Unix datestamp.This code is MIT Licensed, Copyright 2024 Joe Wass.