blog.hugopoi.net/v2/content/post/add-archivarix-archives-to-hugo/index.md
2022-12-21 23:26:50 +01:00

154 lines
5.9 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "Add Archivarix archives to Hugo"
date: 2022-12-18T18:27:04+01:00
tags: ["this blog","Archivarix", "gohugo"]
---
I want to add all my old articles to the Hugo posts list page.
Let's write some code.
* I can use the Archivarix sitemap as source
* Or I can use the sqlite database as source
* I want to add all the canonical pages to the list
* Sorted by reverse date of publication
* With the title
First, I discover that GoHugo handle override over files, if you have a file
in `/themes/<THEME>/static/js/jquery.min.js`, you can override it with a
file in `/static/js/jquery.min.js`. So I think I don't need a custom
theme, so let's remove that.
## Proof of concept with a sitemap
1. First I change the `index.php` and add a sitemap path to enable
sitemap generation in Archivarix loader.
1. Generate a sitemap `wget http://localhost:8080/sitemap.xml`
1. Then I discover sitemap doesn't have title in specification so it's a
dead end.
1. Place `sitemap.xml` in `/data/legacyblog/sitemap.xml`
1. Let's poc the change in our Hugo theme in `layouts/_default/list.html`
```html
# Will load the file and parse it
{{ range $.Site.Data.legacyblog.sitemap.url }}
<li>
<h2>
<a href="{{ .loc }}">
<svg
class="bookmark"
aria-hidden="true"
viewBox="0 0 40 50"
focusable="false"
>
<use href="#bookmark"></use>
</svg>
{{ .loc }}
</a>
</h2>
</li>
{{ end }}
```
I will not use this solution we can't have title with it.
## Proof of concept with webcrawl csv file
Some times ago, I develop a [little web crawler or spider](https://github.com/HugoPoi/webcrawler) that can list
all the urls and robot metadatas for a given website.
1. `npm install -g hugopoi-webcrawler`
1. `hugopoi-webcrawler http://localhost:8080 --progress` will create a file called `localhost_urls.csv`
```csv
"url","statusCode","metas.title","metas.robots","metas.canonical","metas.lang","parent.url"
"http://localhost:8080/",200,"HugoPoi Internet, Hardware et Bidouille","max-image-preview:large",,"fr-FR",
"http://localhost:8080/v2/",200,"HugoPoi Blog",,"http://localhost:1313/v2/","en","http://localhost:8080/"
"http://localhost:8080/en/",200,"How to decrypt flows_cred.json from NodeRED data ? HugoPoi","max-image-preview:large","http://localhost:8080/en/2021/12/28/how-to-decrypt-flows_cred-json-from-nodered-data/","en-US","http://localhost:8080/"
```
1. Then we put this file outside of data directory as mention in the
documentation of Hugo
1. Mod the template with CSV parse function
```html
<!-- Loop against csv lines -->
{{ range $i,$line := getCSV "," "./localhost_urls.csv" }}
<!-- Fill variables with columns -->
{{ $url := index $line 0 }}
{{ $title := index $line 2 }}
<!-- Skip csv head line and replytocom wordpress urls -->
{{ if and (ne $i 0) (eq (len (findRE `replytocom` $url 1)) 0)}}
<li>
<h2>
<a href="{{ $url }}">
<svg
class="bookmark"
aria-hidden="true"
viewBox="0 0 40 50"
focusable="false"
>
<use href="#bookmark"></use>
</svg>
{{ $title }}
</a>
</h2>
</li>
{{ end }}
{{ end }}
```
This solution is promising.
{{< figureCupper
img="Screenshot 2022-12-05 at 19-46-57 Posts HugoPoi Blog.png"
caption="Blog page with legacy articles poc with empty titles"
command="Fill"
options="1024x500 Bottom" >}}
## Refining the webcrawler and the theme mod
* Let's use JSON file instead of csv
* Filter only articles urls and order them by dates
First I add `--output-format json` option to my webcrawler.
<pre data-src="https://raw.githubusercontent.com/HugoPoi/webcrawler/e4675502af9ee133c32a5fe49ceba433461a1c00/console.js" data-range="50,60" class="line-numbers"></pre>
The usage become :
```shell
hugopoi-webcrawler https://blog.hugopoi.net/ --output-format json --progress
> crawled 499 urls. average speed: 37.32 urls/s, totalTime: 13s
```
Now we can handle the data with `jq '. | length' blog.hugopoi.net_urls.json`
Now let's filter this file and order it.
* Remove replytocom duplicate urls and response error urls without title
`jq '. | map(select((.metas.title != null) and (.url | test("\\?replytocom") == false))) | .[].url' blog.hugopoi.net_urls.json`
* Select only urls that contains a date pattern, because my wordpress
urls were built with `/YYYY/MM/DD/THE_TITLE` pattern.
`jq '. | map(select((.metas.title != null) and (.url | test("(\\?replytocom|^https://blog.hugopoi.net/v2)") == false) and (.url | test("/[0-9]{4}/[0-9]{2}/[0-9]{2}/[^/]+/$")))) | sort_by(.url) | reverse' blog.hugopoi.net_urls.json > blog.hugopoi.net_urls.filtered.json`
* Remove ` HugoPoi` from the titles
`jq '. | map(.metas.title |= sub(" HugoPoi"; "")) | .[].metas.title' blog.hugopoi.net_urls.filtered.json`
Now we have a proper [urls data source](https://home.hugopoi.net/gitea/hugopoi/blog.hugopoi.net/src/commit/681d85997a598e9de06820b2c6ccc1fbc4e128c6/v2/data/LegacyBlogUrls.json#L2-L32)
<pre data-src="https://home.hugopoi.net/gitea/hugopoi/blog.hugopoi.net/raw/branch/master/v2/data/LegacyBlogUrls.json" data-range="1,32" class="line-numbers"></pre>
## Override the Hugo Theme layout
We now have a `data/LegacyBlogUrls.json` file with all urls I want to
put in the blog posts index page.
I copied the original `themes/cupper-hugo-theme/layouts/_default/list.html` to `layouts/_default/list.html`.
<pre data-src="https://home.hugopoi.net/gitea/hugopoi/blog.hugopoi.net/raw/commit/681d85997a598e9de06820b2c6ccc1fbc4e128c6/v2/layouts/_default/list.html" data-range="30,48" class="line-numbers"></pre>
{{< figureCupper
img="Screenshot from 2022-12-18 20-13-35.png"
caption="Blog page with legacy articles final version"
command="Fit"
options="1024x500" >}}