feat: publish add archives urls to hugo
This commit is contained in:
parent
681d85997a
commit
a5ae01212a
Binary file not shown.
After Width: | Height: | Size: 3.1 MiB |
Binary file not shown.
After Width: | Height: | Size: 148 KiB |
|
@ -1,7 +1,7 @@
|
|||
---
|
||||
title: "Add Archivarix archives to Hugo"
|
||||
date: 2022-11-06T14:27:04+01:00
|
||||
draft: true
|
||||
draft: false
|
||||
---
|
||||
|
||||
I want to add all my old articles to the Hugo posts list page.
|
||||
|
@ -14,7 +14,7 @@ Let's write some code.
|
|||
* Sorted by reverse date of publication
|
||||
* With the title
|
||||
|
||||
First, I discover that GoHugo handle override over files, if you a file
|
||||
First, I discover that GoHugo handle override over files, if you have a file
|
||||
in `/themes/<THEME>/static/js/jquery.min.js`, you can override it with a
|
||||
file in `/static/js/jquery.min.js`. So I think I don't need a custom
|
||||
theme, so let's remove that.
|
||||
|
@ -57,12 +57,11 @@ I will not use this solution we can't have title with it.
|
|||
|
||||
## Proof of concept with webcrawl csv file
|
||||
|
||||
In an other life, I develop a little web crawler or spider that can list
|
||||
Some times ago, I develop a [little web crawler or spider](https://github.com/HugoPoi/webcrawler) that can list
|
||||
all the urls and robot metadatas for a given website.
|
||||
|
||||
1. `git clone `
|
||||
1. `npm install`
|
||||
1. `node console.js http://localhost:8080 --noindex --nofollow --progress` will create a file called `localhost_urls.csv`
|
||||
1. `npm install -g hugopoi-webcrawler`
|
||||
1. `hugopoi-webcrawler http://localhost:8080 --progress` will create a file called `localhost_urls.csv`
|
||||
|
||||
```csv
|
||||
"url","statusCode","metas.title","metas.robots","metas.canonical","metas.lang","parent.url"
|
||||
|
@ -100,22 +99,55 @@ documentation of Hugo
|
|||
{{ end }}
|
||||
```
|
||||
|
||||
This solution is promising
|
||||
// TODO IMAGE
|
||||
This solution is promising.
|
||||
|
||||
## Webcrawler
|
||||
{{< figureCupper
|
||||
img="Screenshot 2022-12-05 at 19-46-57 Posts HugoPoi Blog.png"
|
||||
caption="Blog page with legacy articles poc with empty titles"
|
||||
command="Fill"
|
||||
options="1024x500 Bottom" >}}
|
||||
|
||||
It's simple crawler/webspider, I develop some times ago for the fun.
|
||||
`./console.js https://blog.hugopoi.net/ --output-format json --progress`
|
||||
`> crawled 499 urls. average speed: 37.32 urls/s, totalTime: 13s`
|
||||
## Refining the webcrawler and the theme mod
|
||||
|
||||
`jq '. | length' blog.hugopoi.net_urls.json`
|
||||
* Let's use JSON file instead of csv
|
||||
* Filter only articles urls and order them by dates
|
||||
|
||||
First I add `--output-format json` option to my webcrawler.
|
||||
<pre data-src="https://raw.githubusercontent.com/HugoPoi/webcrawler/e4675502af9ee133c32a5fe49ceba433461a1c00/console.js" data-range="50,60" class="line-numbers"></pre>
|
||||
|
||||
The usage become :
|
||||
```shell
|
||||
hugopoi-webcrawler https://blog.hugopoi.net/ --output-format json --progress
|
||||
> crawled 499 urls. average speed: 37.32 urls/s, totalTime: 13s
|
||||
```
|
||||
Now we can handle the data with `jq '. | length' blog.hugopoi.net_urls.json`
|
||||
|
||||
Now let's filter this file and order it.
|
||||
* Remove replytocom duplicate urls and response error urls without title
|
||||
|
||||
`jq '. | map(select((.metas.title != null) and (.url | test("\\?replytocom") == false))) | .[].url' blog.hugopoi.net_urls.json`
|
||||
`jq '. | map(select((.metas.title != null) and (.url | test("\\?replytocom") == false))) | .[].url' blog.hugopoi.net_urls.json`
|
||||
|
||||
`jq '. | map(select((.metas.title != null) and (.url | test("(\\?replytocom|^https://blog.hugopoi.net/v2)") == false) and (.url | test("/[0-9]{4}/[0-9]{2}/[0-9]{2}/[^/]+/$")))) | sort_by(.url) | reverse' blog.hugopoi.net_urls.json > blog.hugopoi.net_urls.filtered.json`
|
||||
* Select only urls that contains a date pattern, because my wordpress
|
||||
urls were built with `/YYYY/MM/DD/THE_TITLE` pattern.
|
||||
|
||||
`jq '. | map(select((.metas.title != null) and (.url | test("(\\?replytocom|^https://blog.hugopoi.net/v2)") == false) and (.url | test("/[0-9]{4}/[0-9]{2}/[0-9]{2}/[^/]+/$")))) | sort_by(.url) | reverse' blog.hugopoi.net_urls.json > blog.hugopoi.net_urls.filtered.json`
|
||||
|
||||
* Remove ` – HugoPoi` from the titles
|
||||
`jq '. | map(.metas.title |= sub(" – HugoPoi"; "")) | .[].metas.title' blog.hugopoi.net_urls.filtered.json`
|
||||
|
||||
`jq '. | map(.metas.title |= sub(" – HugoPoi"; "")) | .[].metas.title' blog.hugopoi.net_urls.filtered.json`
|
||||
|
||||
Now we have a proper [urls data source](https://home.hugopoi.net/gitea/hugopoi/blog.hugopoi.net/src/commit/681d85997a598e9de06820b2c6ccc1fbc4e128c6/v2/data/LegacyBlogUrls.json#L2-L32)
|
||||
<pre data-src="https://home.hugopoi.net/gitea/hugopoi/blog.hugopoi.net/raw/branch/master/v2/data/LegacyBlogUrls.json" data-range="1,32" class="line-numbers"></pre>
|
||||
|
||||
## Override the Hugo Theme layout
|
||||
|
||||
We now have a `data/LegacyBlogUrls.json` file with all urls I want to
|
||||
put in the blog posts index page.
|
||||
I copied the original `themes/cupper-hugo-theme/layouts/_default/list.html` to `layouts/_default/list.html`.
|
||||
<pre data-src="https://home.hugopoi.net/gitea/hugopoi/blog.hugopoi.net/raw/commit/681d85997a598e9de06820b2c6ccc1fbc4e128c6/v2/layouts/_default/list.html" data-range="30,48" class="line-numbers"></pre>
|
||||
|
||||
{{< figureCupper
|
||||
img="Screenshot from 2022-12-18 20-13-35.png"
|
||||
caption="Blog page with legacy articles final version"
|
||||
command="Fit"
|
||||
options="1024x500" >}}
|
||||
|
|
Loading…
Reference in New Issue
Block a user