Crates.io | coma |
lib.rs | coma |
version | 0.2.3 |
source | src |
created_at | 2024-08-03 22:48:13.503011 |
updated_at | 2024-10-04 21:07:17.290609 |
description | Coma is a lightweight command-line tool designed for crawling websites |
homepage | |
repository | https://github.com/noahfraiture/coma |
max_upload_size | |
id | 1324573 |
size | 186,007 |
Disclaimer: This project is currently on pause. I made some significant changes in how it's used recently (in the last merge), and it hasn't been tested much since. I plan to continue developing this tool with many great features, but for now, I am working on another project.
Coma is a lightweight command-line tool designed for scraping various types of content from web pages, such as text, comments, links, and images. Its simplicity and flexibility make it easy for users to extract the specific data they need from a given URL.
You can install Coma either by compiling it locally after cloning the repository or by installing it directly from crates.io.
Clone the repository:
git clone https://github.com/yourusername/coma.git
cd coma
Build the project using Cargo:
cargo build --release
Run the compiled binary:
./target/release/coma --help
To install Coma from crates.io, use the following command:
cargo install coma
This will download and compile Coma, making it available for easy use from the command line.
To use Coma, the basic command structure is as follows:
coma [OPTIONS] --url <URL> <COMMAND>
Where <URL>
is the website you want to scrape, and <COMMAND>
specifies what type of data you wish to extract.
The available commands enable you to target specific content on the web page:
Coma includes several options to customize its behavior:
-c, --content <CONTENT>
: Specifies the type of content to scrape. Available values are:
-u, --url <URL>
: Mandatory option to specify the URL to start the scraping process.
-d, --depth <DEPTH>
: Determines how deep the scraper should go from the specified URL:
0
: Scrapes only the specified URL.<0
: Enables infinite depth, allowing the scraper to traverse through all linked pages.0
.-b, --bound <BOUND>
: Sets a filter to include only URLs containing a specific substring. This can be useful for limiting the scraping to a specific domain or section of a website. The default value is an empty string, meaning no filtering is applied.
-t, --task <TASK>
: Sets the maximum number of concurrent asynchronous tasks to be made during scraping. The default is set to 5, which balances speed and performance without overwhelming the target server.
-e, --external <EXTERNAL>
: Specifies whether to include external links or not. Default is 0 (exclude external links).
-h, --help
: Prints the help menu for Coma, including usage instructions and command options.
-V, --version
: Displays the current version of Coma.
The current graph doesn't give the possibility to make directed link which would be great
I aim to provide the complete topology of the website based on different heuristics:
We could add more command options beyond the current selection:
It's important to improve the usability of the tool with these options:
Coma is a flexible and straightforward tool for anyone needing to scrape data from websites quickly. Users can easily customize their scraping experience through various commands and options, making it suitable for a wide range of web data extraction tasks.