# archive.is api exploration ### trying to archive a site clicking "save" button with a url in the bar calls the following api: **Request URL** ``` Request URL: https://archive.ph/submit/?submitid=XsyYbX9a8aiFaUmVbhCCFgl1x0ukSnnULKgw52ZfM6JsmspLFezx51rSKUJnnmqg&url=https%3A%2F%2Fwww.theinformation.com%2Farticles%2Falphabets-google-and-deepmind-pause-grudges-join-forces-to-chase-openai%3Futm_source%3Dti_app Request Method: GET Status Code: 302 Remote Address: 91.193.43.144:443 Referrer Policy: strict-origin-when-cross-origin ``` **Payload (decoded):** - **submitid:** XsyYbX9a8aiFaUmVbhCCFgl1x0ukSnnULKgw52ZfM6JsmspLFezx51rSKUJnnmqg - **url:** https://www.theinformation.com/articles/alphabets-google-and-deepmind-pause-grudges-join-forces-to-chase-openai?utm_source=ti_app **Payload (encoded):** - **submitid:** XsyYbX9a8aiFaUmVbhCCFgl1x0ukSnnULKgw52ZfM6JsmspLFezx51rSKUJnnmqg - **url:** https%3A%2F%2Fwww.theinformation.com%2Farticles%2Falphabets-google-and-deepmind-pause-grudges-join-forces-to-chase-openai%3Futm_source%3Dti_app **Response Headers** ``` cache-control: private, no-cache, no-store, must-revalidate, maxage=0 content-length: 0 date: Thu, 30 Mar 2023 20:02:58 GMT expires: Sat, 01 Jan 2000 00:00:00 GMT location: https://archive.ph/x2vQs pragma: no-cache server: nginx x-host: p-archiveweb31 ``` **Request Headers** ``` :authority: archive.ph :method: GET :path: /submit/?submitid=XsyYbX9a8aiFaUmVbhCCFgl1x0ukSnnULKgw52ZfM6JsmspLFezx51rSKUJnnmqg&url=https%3A%2F%2Fwww.theinformation.com%2Farticles%2Falphabets-google-and-deepmind-pause-grudges-join-forces-to-chase-openai%3Futm_source%3Dti_app :scheme: https accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7 accept-encoding: gzip, deflate, br accept-language: en,en-GB;q=0.9,en-US;q=0.8,en-CA;q=0.7,es-US;q=0.6,es;q=0.5,it-IT;q=0.4,it;q=0.3 cookie: ga=GA1.2.661111166.1680206568 referer: https://archive.ph/ sec-ch-ua: "Google Chrome";v="111", "Not(A:Brand";v="8", "Chromium";v="111" sec-ch-ua-mobile: ?0 sec-ch-ua-platform: "Linux" sec-fetch-dest: document sec-fetch-mode: navigate sec-fetch-site: same-origin sec-fetch-user: ?1 upgrade-insecure-requests: 1 user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36 ``` afterwards, archive.is redirects the user to the url listed in the `location` field of the returned headers above. This is the actual archive entry users will use. ### second run running the save request a second time (this time without the `?utm_source` parameter in the requested url), we get: **Request URL** ``` Request URL: https://archive.is/submit/?submitid=ZZiwDA836TdRU7X0tKLjBaqeQRi6F%2Bae2rYWPATBD6BsmspLFezx51rSKUJnnmqg&url=https%3A%2F%2Fwww.theinformation.com%2Farticles%2Falphabets-google-and-deepmind-pause-grudges-join-forces-to-chase-openai Request Method: GET Status Code: 302 Remote Address: 91.193.43.144:443 Referrer Policy: strict-origin-when-cross-origin ``` **Payload (decoded)**: - **submitid:** ZZiwDA836TdRU7X0tKLjBaqeQRi6F+ae2rYWPATBD6BsmspLFezx51rSKUJnnmqg - **url:** https://www.theinformation.com/articles/alphabets-google-and-deepmind-pause-grudges-join-forces-to-chase-openai **Payload (encoded):** *- *submitid:** ZZiwDA836TdRU7X0tKLjBaqeQRi6F%2Bae2rYWPATBD6BsmspLFezx51rSKUJnnmqg - **url:** https%3A%2F%2Fwww.theinformation.com%2Farticles%2Falphabets-google-and-deepmind-pause-grudges-join-forces-to-chase-openai **Response Headers** ``` cache-control: private, no-cache, no-store, must-revalidate, maxage=0 content-length: 0 date: Thu, 30 Mar 2023 20:12:14 GMT expires: Sat, 01 Jan 2000 00:00:00 GMT location: https://archive.is/GkZPl pragma: no-cache server: nginx x-host: p-archiveweb31 ``` **Request Headers** ``` :authority: archive.is :method: GET :path: /submit/?submitid=ZZiwDA836TdRU7X0tKLjBaqeQRi6F%2Bae2rYWPATBD6BsmspLFezx51rSKUJnnmqg&url=https%3A%2F%2Fwww.theinformation.com%2Farticles%2Falphabets-google-and-deepmind-pause-grudges-join-forces-to-chase-openai :scheme: https accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7 accept-encoding: gzip, deflate, br accept-language: en,en-GB;q=0.9,en-US;q=0.8,en-CA;q=0.7,es-US;q=0.6,es;q=0.5,it-IT;q=0.4,it;q=0.3 cookie: a=GA1.2.661111166.1680206997 referer: https://archive.is/ sec-ch-ua: "Google Chrome";v="111", "Not(A:Brand";v="8", "Chromium";v="111" sec-ch-ua-mobile: ?0 sec-ch-ua-platform: "Linux" sec-fetch-dest: document sec-fetch-mode: navigate sec-fetch-site: same-origin sec-fetch-user: ?1 upgrade-insecure-requests: 1 user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36 ``` ### third run (on website that's never been archived) **Request URL** ``` Request URL: https://archive.is/submit/?submitid=ly1WQYvPxnOtd%2Fd9OJdEu7b7QwwdcZq%2FvfxsGZh1LjBsmspLFezx51rSKUJnnmqg&url=https%3A%2F%2Fgithub.com%2Funsafeoats Request Method: GET Status Code: 200 Remote Address: 91.193.43.144:443 Referrer Policy: strict-origin-when-cross-origin ``` **Response Headers** ``` accept-ranges: bytes cache-control: private, no-cache, no-store, must-revalidate, maxage=0 content-encoding: gzip content-length: 244 content-type: text/html;charset=utf-8 date: Thu, 30 Mar 2023 20:55:39 GMT expires: Sat, 01 Jan 2000 00:00:00 GMT pragma: no-cache refresh: 0;url=https://archive.is/wip/WJW3i server: nginx x-host: p-archiveweb31 ``` **Request Headers** ``` authority: archive.is :method: GET :path: /submit/?submitid=ly1WQYvPxnOtd%2Fd9OJdEu7b7QwwdcZq%2FvfxsGZh1LjBsmspLFezx51rSKUJnnmqg&url=https%3A%2F%2Fgithub.com%2Funsafeoats :scheme: https accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7 accept-encoding: gzip, deflate, br accept-language: en,en-GB;q=0.9,en-US;q=0.8,en-CA;q=0.7,es-US;q=0.6,es;q=0.5,it-IT;q=0.4,it;q=0.3 cookie: ga=GA1.2.661111166.1680209721 referer: https://archive.is/ sec-ch-ua: "Google Chrome";v="111", "Not(A:Brand";v="8", "Chromium";v="111" sec-ch-ua-mobile: ?0 sec-ch-ua-platform: "Linux" sec-fetch-dest: document sec-fetch-mode: navigate sec-fetch-site: same-origin sec-fetch-user: ?1 upgrade-insecure-requests: 1 user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36 ``` ### questions 1) where does the `?submitid` field in the `/submit` route come from? is it just random or does it need to be deterministic? - answer found: this is a unique code generated every time someone opens the archive homepage. if you hit `https://archive.is` with a get request, the html returned will have the following section: ```html ``` - the `submitid` can be extracted from here. ### general workflow for archiving site 1) send get request to `https://archive.is" and extract submitid value` 2) send get request to `https://archive.is/submit/?submitid={encoded submitid}&url={encoded url}` 3) check response headers to see if `location` or `refresh` is present - if `location` is present, extract `location` and return it's value * returned `status codes` observed in this situation have all been in range [302,] so far - if `location` is not present but `refresh` is (with pattern `0;url=https://archive.is/wip/{new archive identifier}`), extract the future url (`https://archive.is/{new archive identifier}`) and return it * return `status codes` observed in this situation have all been in range [200,] so far * can also monitor the `wip` version of the url found above and wait until it returns a `location` field before returning the url