From 016cbf284e5c35bddaea5e2575105f29719e1117 Mon Sep 17 00:00:00 2001 From: Yo'av Moshe Date: Fri, 6 Sep 2024 20:38:27 +0200 Subject: [PATCH] Update README.md --- README.md | 82 +++++++++++++++++++++++++++++++++++++++---------------- 1 file changed, 58 insertions(+), 24 deletions(-) diff --git a/README.md b/README.md index 56d9a3c..94b0287 100644 --- a/README.md +++ b/README.md @@ -94,7 +94,11 @@ The only required argument for Pipet is the path to your `.pipet` file. Other th - `--help`, `-h` - Show help # Pipet files -Pipet files describe where and how to get the data you are interested in. They are normal text files containing one or more blocks, separated with an empty line. Line beginning with `//` are ignored and can be used for comments. Every block has at least 2 sections - the first line containing the URL and the tool we are using for scraping, and the following lines describing the selectors reaching the data we would like scrap. Some blocks can end with a special last line pointing to the "next page" selector - more on that later. +Pipet files describe where and how to get the data you are interested in. They are normal text files containing one or more blocks, separated with an empty line. Lines beginning with `//` are ignored and can be used for comments. Every block can have 3 sections: + +1. **Resource** - The first line containing the URL and the tool we are using for scraping +2. *Queries* - The following lines describing the selectors reaching the data we would like scrap +3. **Next page** - An _optional_ last line starting with `>` describing the selector pointing to the "next page" of data Below is an example Pipet file. @@ -115,40 +119,70 @@ playwright https://github.com/bjesus/pipet Array.from(document.querySelectorAll('.about-margin .Link')).map(e => e.innerText.trim()).filter(t=> /^\d/.test(t) ) ``` -Blocks can start with either `curl` or `playwright`. Pipet doesn't just call these things `curl` because it's cool - it actually uses curl to fetch the resource. This might sound weird, but it's meant so that you can use your browser to find the request containing the information you are interested in, right click it, choose "Copy as cURL", and paste in your Pipet file. This ensures that your headers and cookies are all the same, making it very easy to get data which is behind a login page or is hidden from bots. +## Resource + +Resource lines can start with either `curl` or `playwright`. -Starting a block with `playwright` will use a headless browser to navigate to the specified URL. +### curl -The lines following the first line are your _queries_. There are 3 different type of queries - for HTML files, for JSON files, and for websites loaded using `playwright`. +Resource lines starting with `curl` will actually execute with curl. This might sound weird, but it's meant so that you can use your browser to find the request containing the information you are interested in, right click it, choose "Copy as cURL", and paste in your Pipet file. This ensures that your headers and cookies are all the same, making it very easy to get data which is behind a login page or is hidden from bots. For example, this is a perfectly valid first line for a block: `curl 'https://news.ycombinator.com/' --compressed -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:131.0) Gecko/20100101 Firefox/131.0' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/png,image/svg+xml,*/*;q=0.8' -H 'Accept-Language: en-US,en;q=0.5' -H 'Accept-Encoding: gzip, deflate, br, zstd' -H 'DNT: 1' -H 'Sec-GPC: 1' -H 'Connection: keep-alive' -H 'Upgrade-Insecure-Requests: 1' -H 'Sec-Fetch-Dest: document' -H 'Sec-Fetch-Mode: navigate' -H 'Sec-Fetch-Site: none' -H 'Sec-Fetch-User: ?1' -H 'Priority: u=0, i' -H 'Pragma: no-cache' -H 'Cache-Control: no-cache' -H 'TE: trailers'`. -## HTML Queries -HTML Queries use CSS Selectors to point select specific elements. Whitespace nesting is used for iterations - parent lines will run as iterators, making their children lines run for each occurance of the parent selector. This means that you can use nesting to determine the structure of your final output. When writing your child selectors, note that the whole document isn't available anymore, and only the parent document is present during the iteration. +### Playwright -By defult, Pipet will return the `innerText` of your elements. If you need to another piece of data, use Pipes. When piping HTML elements, Pipet will pipe the element's complete HTML to the receiving program. +Resource lines starting with `playwright` will use a headless browser to navigate to the specified URL. If you don't have a headless browser installed, Pipet will attempt to download one. -## JSON Queries -JSON Queries use GJSON to point select specific elements. Here too, whitespace nesting is used for iterations - parent lines will run as iterators, making their children lines run for each occurance of the parent selector. If you don't like GJSON, you can always use Pipes extract your data in other ways, for example with `jq`. See more examples below. +## Queries -When using pipes with to send data to program that return valid JSON, Pipet will parse the JSON and embed it in its final output. +Queries lines define 3 things: +1. The way to the exact pieces of data you would like to extract (e.g. using CSS selectors) +2. The data structure your output will use (e.g. every title and URL should be grouped together by item) +3. The way the data will be processed (e.g. using Unix pipes) before it is printed -## Playwright Queries -Playwright Queries are different and do not use whitespace nesting. Instead, queries here are simply JavaScript code that will be evaluated after the webpage loaded. If the JavaScript code returns something that can be serialized as JSON, it will be included in Pipet's output. Otherwise, you can write JavaScript that will click, scroll or perform any othe action you might want. +Pipet uses 3 different query types - for HTML, for JSON, and for when loading pages with Playwright. -## Unix Pipes -Sometimes CSS Selectors and GJSON aren't enough, or perhaps you just prefer using something you already know. This is why unix pipes are first class citizen in Pipet. +### HTML Queries +HTML Queries use CSS Selectors to point select specific elements. Whitespace nesting is used for iterations - parent lines will run as iterators, making their children lines run for each occurance of the parent selector. This means that you can use nesting to determine the structure of your final output. See the following 3 examples: +
Get only the first title and first URL + ``` curl https://news.ycombinator.com/ -span.yclinks a - body - body | htmlq --attribute href a - body | htmlq --attribute href a | wc -c +.title .titleline > a +.sitebit a +``` + +
Get all the titles, and then get all URLs + +``` +curl https://news.ycombinator.com/ +.title .titleline + span > a +.title .titleline + .sitebit a +``` -curl http://localhost:8000/some.json -people - name -people | jq keys -@this | jq '[.products[].name]' +
Get all the title and URL for each story + ``` +curl https://news.ycombinator.com/ +.title .titleline + span > a + .sitebit a +``` +
+ +When writing your child selectors, note that the whole document isn't available anymore. Pipet is passing only your parent HTML to the child iterations. + +By defult, Pipet will return the `innerText` of your elements. If you need to another piece of data, use Unix pipes. When piping HTML elements, Pipet will pipe the element's complete HTML. For example, you can use `| htmq --attr href a` to extract the `href` attribute from links. + +### JSON Queries + +JSON Queries use the [GJSON syntax](https://github.com/tidwall/gjson/blob/master/SYNTAX.md) to point select specific elements. Here too, whitespace nesting is used for iterations - parent lines will run as iterators, making their children lines run for each occurance of the parent selector. If you don't like GJSON, that's okay. For example, you can use `jq` by passing parts or the complete JSON to it using Unix pipes, like `@this | jq '.[].firstName'`. + +When using pipes, Pipet will try to parse the returned string. If it's JSON, it will be parsed and injected as an object into the Pipet result. + +### Playwright Queries + +Playwright Queries are different and do not use whitespace nesting. Instead, queries here are simply JavaScript code that will be evaluated after the webpage loaded. If the JavaScript code returns something that can be serialized as JSON, it will be included in Pipet's output. Otherwise, you can write JavaScript that will click, scroll or perform any other action you might want. -## Next page nav +## Next page