Crates.io | wikiwho |
lib.rs | wikiwho |
version | 0.1.0 |
source | src |
created_at | 2024-10-20 11:06:52.952965 |
updated_at | 2024-10-20 11:06:52.952965 |
description | Fast Rust reimplementation of the WikiWho algorithm for fine-grained authorship attribution on large datasets. Optimized for easy integration in multi-threaded applications. |
homepage | |
repository | https://github.com/Schuwi/wikiwho_rs |
max_upload_size | |
id | 1416158 |
size | 212,920 |
A high-performance Rust implementation of the WikiWho algorithm for token-level authorship tracking in Wikimedia pages.
wikiwho
is a Rust library that implements the WikiWho algorithm, enabling users to track authorship on a token level (token ≈ word) across all revisions of a Wikimedia page (e.g., Wikipedia, Wiktionary). It is designed to process entire Wikipedia/Wiktionary XML dumps efficiently, offering significant performance improvements over the original Python implementation by Fabian Flöck and Maribel Acosta.
Key Features:
The original Python implementation of WikiWho could process about 300 pages in one to two minutes. In contrast, wikiwho_rs
can process an entire German Wiktionary dump (approximately 1.3 million pages) in just 2 minutes using 8 processor cores. This performance boost makes large-scale authorship analysis feasible and efficient.
Currently, wikiwho
is available via its GitHub repository. You can include it in your Cargo.toml
as follows:
[dependencies]
wikiwho = { git = "https://github.com/Schuwi/wikiwho_rs.git" }
A release on crates.io is planned soon.
Here's a minimal example of how to load a Wikimedia XML dump and analyze a page:
use wikiwho::dump_parser::DumpParser;
use wikiwho::algorithm::Analysis;
use std::fs::File;
use std::io::BufReader;
fn main() -> Result<(), Box<dyn std::error::Error>> {
// Open the XML dump file
let xml_dump = File::open("dewiktionary-20240901-pages-meta-history.xml")?;
let reader = BufReader::new(xml_dump);
let mut parser = DumpParser::new(reader)?;
// Parse a single page
if let Some(page) = parser.parse_page()? {
// Analyze the page revisions
let analysis = Analysis::analyse_page(&page.revisions)?;
// Iterate over tokens in the current revision
for token in wikiwho_rs::utils::iterate_revision_tokens(&analysis, &analysis.current_revision) {
println!(
"'{}' by '{}'",
token.value,
analysis[token].origin_revision.contributor.username
);
}
}
Ok(())
}
To process a full dump, you can iterate over all pages:
use wikiwho::dump_parser::DumpParser;
use wikiwho::algorithm::Analysis;
use std::fs::File;
use std::io::BufReader;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let xml_dump = File::open("dewiktionary-20240901-pages-meta-history.xml")?;
let reader = BufReader::new(xml_dump);
let mut parser = DumpParser::new(reader)?;
while let Some(page) = parser.parse_page()? {
// Analyze each page in parallel or sequentially
let analysis = Analysis::analyse_page(&page.revisions)?;
// Your processing logic here
}
Ok(())
}
While XML parsing is inherently linear, you can process pages in parallel once they are parsed:
std::thread
or crates like rayon
for concurrency.dump_parser
DumpParser
instance with a reader, then call parse_page()
to retrieve pages one by one.algorithm
Analysis::analyse_page(&page.revisions)
to analyze the revisions of a page.utils
iterate_revision_tokens()
for easy iteration over tokens in a revision.SentencePointer
) to reference nodes. Access mutable data via indexing into the Analysis
struct (e.g., analysis[word_pointer].origin_revision
).compact_str
: Used in the public API for efficient handling of mostly short strings, such as page titles and contributor names.Page
and Analysis
structs to free memory.python-diff
feature.To use the original Python diff algorithm:
[dependencies]
wikiwho = { git = "https://github.com/Schuwi/wikiwho_rs.git", features = ["python-diff"] }
pyo3
.tracing
crate for logging warnings and errors.strict
feature to terminate parsing on errors.Page
and Revision
structs manually for other data sources.Contributions are welcome! Here are some ways you can help:
This library was developed through a mix of hard work, creativity, and collaboration with various tools, including GitHub Copilot and ChatGPT. It has been an exciting journey filled with coding and brainstorming 💛.
Special thanks to the friendly guidance and support of ChatGPT along the way, helping with documentation and understanding the original implementation to make this library as robust and performant as possible.
This project is primarily licensed under the Mozilla Public License 2.0.
However, parts of this project are derived from the
original WikiWho
python implementation, which is licensed
under the MIT License. Thus for these parts of the project (as marked by the SPDX headers), the
MIT License applies additionally.
This basically just means that the copyright notice in LICENSE-MIT must be preserved.