# Stream crawler
`stream-scraper` is a Rust crate that provides an asynchronous web crawling utility. It processes URLs, extracts content and child URLs, and handles retry attempts for failed requests. It uses the `tokio` runtime for asynchronous operations and the `reqwest` library for HTTP requests.
## Features
- Asynchronous crawling using `tokio`
- Extracts URLs from `` tags in HTML
- Retries failed requests up to a specified number of attempts
- Limits the number of concurrent requests using a semaphore
## Installation
Add this to your `Cargo.toml`:
```toml
[dependencies]
stream_crawler = "0.1.0"
tokio = { version = "1", features = ["full"] }
reqwest = { version = "0.11", features = ["json"] }
scraper = "0.12"
```
## Usage
```rust
use stream_crawler::scrape;
use tokio_stream::StreamExt;
#[tokio::main]
async fn main() {
let urls = vec![
String::from("https://www.google.com"),
String::from("https://www.twitter.com"),
];
let mut result_stream = scrape(urls, 3, 5, 10).await;
while let Some(data) = result_stream.next().await {
println!("Processed URL: {:?}", data);
}
}
```
### Functionality
1. **`scrape` function** :
* Takes a vector of URLs, a retry attempt limit, and a maximum number of concurrent processes.
* Returns a stream of `ProcessedUrl` structures.
1. **`ProcessedUrl` structure** :
* Contains the original URL, the parent URL (if any), the HTML content of the page, and a list of child URLs extracted from `` tags.
### Example
This example demonstrates how to use the `scrape` function to process a list of URLs.
```rust
use stream_crawler::scrape;
use tokio_stream::StreamExt;
#[tokio::main]
async fn main() {
let urls = vec![
String::from("https://www.google.com"),
String::from("https://www.twitter.com"),
];
let mut result_stream = scrape(urls, 3, 5, 10).await;
while let Some(data) = result_stream.next().await {
println!("Processed URL: {:?}", data);
}
}
```
## Documentation
Refer to the inline documentation for detailed usage and examples.
### ` ProcessedUrl`
```rust
#[derive(Debug, PartialEq)]
pub struct ProcessedUrl {
pub parent: Option,
pub url: String,
pub content: String,
pub children: Vec,
}
```
## Contributing
Contributions are welcome! Please open an issue or submit a pull request.
## License
This project is licensed under the MIT License.
This `README.md` provides an overview of the crate, its features, installation instructions, and usage examples. You can customize it further based on your specific requirements.