2022-12-05 20:01:43 +01:00
---
title: "Add Archivarix archives to Hugo"
date: 2022-11-06T14:27:04+01:00
2022-12-18 20:31:51 +01:00
draft: false
2022-12-05 20:01:43 +01:00
---
I want to add all my old articles to the Hugo posts list page.
Let's write some code.
* I can use the Archivarix sitemap as source
* Or I can use the sqlite database as source
* I want to add all the canonical pages to the list
* Sorted by reverse date of publication
* With the title
2022-12-18 20:31:51 +01:00
First, I discover that GoHugo handle override over files, if you have a file
2022-12-05 20:01:43 +01:00
in `/themes/<THEME>/static/js/jquery.min.js` , you can override it with a
file in `/static/js/jquery.min.js` . So I think I don't need a custom
theme, so let's remove that.
## Proof of concept with a sitemap
1. First I change the `index.php` and add a sitemap path to enable
sitemap generation in Archivarix loader.
1. Generate a sitemap `wget http://localhost:8080/sitemap.xml`
1. Then I discover sitemap doesn't have title in specification so it's a
dead end.
1. Place `sitemap.xml` in `/data/legacyblog/sitemap.xml`
1. Let's poc the change in our Hugo theme in `layouts/_default/list.html`
```html
# Will load the file and parse it
{{ range $.Site.Data.legacyblog.sitemap.url }}
< li >
< h2 >
< a href = "{{ .loc }}" >
< svg
class="bookmark"
aria-hidden="true"
viewBox="0 0 40 50"
focusable="false"
>
< use href = "#bookmark" > < / use >
< / svg >
{{ .loc }}
< / a >
< / h2 >
< / li >
{{ end }}
```
I will not use this solution we can't have title with it.
## Proof of concept with webcrawl csv file
2022-12-18 20:31:51 +01:00
Some times ago, I develop a [little web crawler or spider ](https://github.com/HugoPoi/webcrawler ) that can list
2022-12-05 20:01:43 +01:00
all the urls and robot metadatas for a given website.
2022-12-18 20:31:51 +01:00
1. `npm install -g hugopoi-webcrawler`
1. `hugopoi-webcrawler http://localhost:8080 --progress` will create a file called `localhost_urls.csv`
2022-12-05 20:01:43 +01:00
```csv
"url","statusCode","metas.title","metas.robots","metas.canonical","metas.lang","parent.url"
"http://localhost:8080/",200,"HugoPoi – Internet, Hardware et Bidouille","max-image-preview:large",,"fr-FR",
"http://localhost:8080/v2/",200,"HugoPoi Blog",,"http://localhost:1313/v2/","en","http://localhost:8080/"
"http://localhost:8080/en/",200,"How to decrypt flows_cred.json from NodeRED data ? – HugoPoi","max-image-preview:large","http://localhost:8080/en/2021/12/28/how-to-decrypt-flows_cred-json-from-nodered-data/","en-US","http://localhost:8080/"
```
1. Then we put this file outside of data directory as mention in the
documentation of Hugo
1. Mod the template with CSV parse function
```html
<!-- Loop against csv lines -->
{{ range $i,$line := getCSV "," "./localhost_urls.csv" }}
<!-- Fill variables with columns -->
{{ $url := index $line 0 }}
{{ $title := index $line 2 }}
<!-- Skip csv head line and replytocom wordpress urls -->
{{ if and (ne $i 0) (eq (len (findRE `replytocom` $url 1)) 0)}}
< li >
< h2 >
< a href = "{{ $url }}" >
< svg
class="bookmark"
aria-hidden="true"
viewBox="0 0 40 50"
focusable="false"
>
< use href = "#bookmark" > < / use >
< / svg >
{{ $title }}
< / a >
< / h2 >
< / li >
{{ end }}
{{ end }}
```
2022-12-18 20:31:51 +01:00
This solution is promising.
2022-12-05 20:01:43 +01:00
2022-12-18 20:31:51 +01:00
{{< figureCupper
img="Screenshot 2022-12-05 at 19-46-57 Posts HugoPoi Blog.png"
caption="Blog page with legacy articles poc with empty titles"
command="Fill"
options="1024x500 Bottom" >}}
2022-12-05 20:01:43 +01:00
2022-12-18 20:31:51 +01:00
## Refining the webcrawler and the theme mod
2022-12-05 20:01:43 +01:00
2022-12-18 20:31:51 +01:00
* Let's use JSON file instead of csv
* Filter only articles urls and order them by dates
First I add `--output-format json` option to my webcrawler.
< pre data-src = "https://raw.githubusercontent.com/HugoPoi/webcrawler/e4675502af9ee133c32a5fe49ceba433461a1c00/console.js" data-range = "50,60" class = "line-numbers" > < / pre >
The usage become :
```shell
hugopoi-webcrawler https://blog.hugopoi.net/ --output-format json --progress
> crawled 499 urls. average speed: 37.32 urls/s, totalTime: 13s
```
Now we can handle the data with `jq '. | length' blog.hugopoi.net_urls.json`
2022-12-09 15:33:48 +01:00
Now let's filter this file and order it.
2022-12-18 20:31:51 +01:00
* Remove replytocom duplicate urls and response error urls without title
2022-12-09 15:33:48 +01:00
2022-12-18 20:31:51 +01:00
`jq '. | map(select((.metas.title != null) and (.url | test("\\?replytocom") == false))) | .[].url' blog.hugopoi.net_urls.json`
2022-12-09 15:33:48 +01:00
2022-12-18 20:31:51 +01:00
* Select only urls that contains a date pattern, because my wordpress
urls were built with `/YYYY/MM/DD/THE_TITLE` pattern.
`jq '. | map(select((.metas.title != null) and (.url | test("(\\?replytocom|^https://blog.hugopoi.net/v2)") == false) and (.url | test("/[0-9]{4}/[0-9]{2}/[0-9]{2}/[^/]+/$")))) | sort_by(.url) | reverse' blog.hugopoi.net_urls.json > blog.hugopoi.net_urls.filtered.json`
2022-12-09 15:33:48 +01:00
* Remove ` – HugoPoi` from the titles
2022-12-18 20:31:51 +01:00
`jq '. | map(.metas.title |= sub(" – HugoPoi"; "")) | .[].metas.title' blog.hugopoi.net_urls.filtered.json`
Now we have a proper [urls data source ](https://home.hugopoi.net/gitea/hugopoi/blog.hugopoi.net/src/commit/681d85997a598e9de06820b2c6ccc1fbc4e128c6/v2/data/LegacyBlogUrls.json#L2-L32 )
< pre data-src = "https://home.hugopoi.net/gitea/hugopoi/blog.hugopoi.net/raw/branch/master/v2/data/LegacyBlogUrls.json" data-range = "1,32" class = "line-numbers" > < / pre >
## Override the Hugo Theme layout
We now have a `data/LegacyBlogUrls.json` file with all urls I want to
put in the blog posts index page.
I copied the original `themes/cupper-hugo-theme/layouts/_default/list.html` to `layouts/_default/list.html` .
< pre data-src = "https://home.hugopoi.net/gitea/hugopoi/blog.hugopoi.net/raw/commit/681d85997a598e9de06820b2c6ccc1fbc4e128c6/v2/layouts/_default/list.html" data-range = "30,48" class = "line-numbers" > < / pre >
{{< figureCupper
img="Screenshot from 2022-12-18 20-13-35.png"
caption="Blog page with legacy articles final version"
command="Fit"
options="1024x500" >}}