diff --git a/v2/content/post/add-archivarix-archives-to-hugo/Screenshot 2022-12-05 at 19-46-57 Posts HugoPoi Blog.png b/v2/content/post/add-archivarix-archives-to-hugo/Screenshot 2022-12-05 at 19-46-57 Posts HugoPoi Blog.png new file mode 100644 index 0000000..24d123d Binary files /dev/null and b/v2/content/post/add-archivarix-archives-to-hugo/Screenshot 2022-12-05 at 19-46-57 Posts HugoPoi Blog.png differ diff --git a/v2/content/post/add-archivarix-archives-to-hugo/Screenshot from 2022-12-18 20-13-35.png b/v2/content/post/add-archivarix-archives-to-hugo/Screenshot from 2022-12-18 20-13-35.png new file mode 100644 index 0000000..6694349 Binary files /dev/null and b/v2/content/post/add-archivarix-archives-to-hugo/Screenshot from 2022-12-18 20-13-35.png differ diff --git a/v2/content/post/add-archivarix-archives-to-hugo/index.md b/v2/content/post/add-archivarix-archives-to-hugo/index.md index 57b027f..f1418ad 100644 --- a/v2/content/post/add-archivarix-archives-to-hugo/index.md +++ b/v2/content/post/add-archivarix-archives-to-hugo/index.md @@ -1,7 +1,7 @@ --- title: "Add Archivarix archives to Hugo" date: 2022-11-06T14:27:04+01:00 -draft: true +draft: false --- I want to add all my old articles to the Hugo posts list page. @@ -14,7 +14,7 @@ Let's write some code. * Sorted by reverse date of publication * With the title -First, I discover that GoHugo handle override over files, if you a file +First, I discover that GoHugo handle override over files, if you have a file in `/themes//static/js/jquery.min.js`, you can override it with a file in `/static/js/jquery.min.js`. So I think I don't need a custom theme, so let's remove that. @@ -57,12 +57,11 @@ I will not use this solution we can't have title with it. ## Proof of concept with webcrawl csv file -In an other life, I develop a little web crawler or spider that can list +Some times ago, I develop a [little web crawler or spider](https://github.com/HugoPoi/webcrawler) that can list all the urls and robot metadatas for a given website. -1. `git clone ` -1. `npm install` -1. `node console.js http://localhost:8080 --noindex --nofollow --progress` will create a file called `localhost_urls.csv` +1. `npm install -g hugopoi-webcrawler` +1. `hugopoi-webcrawler http://localhost:8080 --progress` will create a file called `localhost_urls.csv` ```csv "url","statusCode","metas.title","metas.robots","metas.canonical","metas.lang","parent.url" @@ -100,22 +99,55 @@ documentation of Hugo {{ end }} ``` - This solution is promising - // TODO IMAGE + This solution is promising. -## Webcrawler + {{< figureCupper +img="Screenshot 2022-12-05 at 19-46-57 Posts HugoPoi Blog.png" +caption="Blog page with legacy articles poc with empty titles" +command="Fill" +options="1024x500 Bottom" >}} -It's simple crawler/webspider, I develop some times ago for the fun. -`./console.js https://blog.hugopoi.net/ --output-format json --progress` -`> crawled 499 urls. average speed: 37.32 urls/s, totalTime: 13s` +## Refining the webcrawler and the theme mod -`jq '. | length' blog.hugopoi.net_urls.json` +* Let's use JSON file instead of csv +* Filter only articles urls and order them by dates + +First I add `--output-format json` option to my webcrawler. +

+
+The usage become :
+```shell
+hugopoi-webcrawler https://blog.hugopoi.net/ --output-format json --progress
+> crawled 499 urls. average speed: 37.32 urls/s, totalTime: 13s
+```
+Now we can handle the data with `jq '. | length' blog.hugopoi.net_urls.json`
 
 Now let's filter this file and order it.
+* Remove replytocom duplicate urls and response error urls without title
 
-`jq '. | map(select((.metas.title != null) and (.url | test("\\?replytocom") == false))) | .[].url' blog.hugopoi.net_urls.json`
+    `jq '. | map(select((.metas.title != null) and (.url | test("\\?replytocom") == false))) | .[].url' blog.hugopoi.net_urls.json`
 
-`jq '. | map(select((.metas.title != null) and (.url | test("(\\?replytocom|^https://blog.hugopoi.net/v2)") == false) and (.url | test("/[0-9]{4}/[0-9]{2}/[0-9]{2}/[^/]+/$")))) | sort_by(.url) | reverse' blog.hugopoi.net_urls.json > blog.hugopoi.net_urls.filtered.json`
+* Select only urls that contains a date pattern, because my wordpress
+urls were built with `/YYYY/MM/DD/THE_TITLE` pattern.
+
+    `jq '. | map(select((.metas.title != null) and (.url | test("(\\?replytocom|^https://blog.hugopoi.net/v2)") == false) and (.url | test("/[0-9]{4}/[0-9]{2}/[0-9]{2}/[^/]+/$")))) | sort_by(.url) | reverse' blog.hugopoi.net_urls.json > blog.hugopoi.net_urls.filtered.json`
 
 * Remove ` – HugoPoi` from the titles
-`jq '. | map(.metas.title |= sub(" – HugoPoi"; "")) | .[].metas.title' blog.hugopoi.net_urls.filtered.json`
+
+    `jq '. | map(.metas.title |= sub(" – HugoPoi"; "")) | .[].metas.title' blog.hugopoi.net_urls.filtered.json`
+
+Now we have a proper [urls data source](https://home.hugopoi.net/gitea/hugopoi/blog.hugopoi.net/src/commit/681d85997a598e9de06820b2c6ccc1fbc4e128c6/v2/data/LegacyBlogUrls.json#L2-L32)
+

+
+## Override the Hugo Theme layout
+
+We now have a `data/LegacyBlogUrls.json` file with all urls I want to
+put in the blog posts index page.
+I copied the original `themes/cupper-hugo-theme/layouts/_default/list.html` to `layouts/_default/list.html`.
+

+
+  {{< figureCupper
+img="Screenshot from 2022-12-18 20-13-35.png"
+caption="Blog page with legacy articles final version"
+command="Fit"
+options="1024x500" >}}