On Fri, 25 Sep 2020, f6k wrote:
i'm certainly missing something in my configuration, right?
do you know (or anyone else) cosmic.voyage achieves that?
maybe some compression at some point?
My figure was counted from data that `suck` downloaded from cosmic.voyage
and saved into my directory in tilde.club. And no, there is no compression
used at any point of my workflow, all of the articles are `cat`-readable.
On Fri, 25 Sep 2020, f6k wrote:
i'm really curious; how is it possible that a year in cosmic.voyage
makes 293KiB when 60days in my local copy of tildeverse makes 1.3M!
The key is your use of `du`:
$du -shc *
combined with your directory structure, this introduces a whole lot
of local bias (which its magnitude is larger than the actual newspool size itself). This is a measurement error. Explanation starts from header 1-
an extreme example there should give an idea of how subjective
this method is.
See heading 5 for the measurement strategy, if you would like to
use my methodology to measure your newspool.
Another possible factor is your continued use of `slrnpull` which
I suspect that it didn't "expire" the article the way you might think;
thus your "i only keep articles within 60 days" might or might not be true.
I cannot confirm or cross off this factor on on my own, but there is a test
in heading 6 that you could try.
And there appears to be an oversight from my end about line ending
which caused my original total to be a bit off; but not significantly.
(Off by 0.237%, see heading 4 for side note)
A big wall-of-text follows...
%<-----
## 1. File-size V.S. size-on-disk ##
On Fri, 25 Sep 2020, xwindows wrote:
(all articles from tilde.* newsgroups as existed on cosmic.voyage
at 2020-09-18 [with its 1-year history back then] totaled to only 293 KiB;
yes, *kilobyte*, you read that correctly)
Emphasize "all articles (...) totaled to only 293 KiB".
I measured content size (in bytes) of articles (including its USENET headers) totaled together, without including local storage overhead.
This is intentional, to avoid introducing local bias.
Your measurement method however, is for measuring "storage size
as occupied on disk"; useful for inspecting actual disk utilization,
as it includes local overhead such as block padding, indirect block maps,
and other filesystem-specific baggage; which varies from system to system,
thus constitutes local bias.
My original measurement was done by running `du -bc *` in `suck`'s
article feed directory, where all download articles are stored,
one article per file, all articles are in a single directory...
299626 total
This number is the "293 KiB" I was talking about, in the original article.
Then compare and contrast: following is the last line result of
`du -shc *` (your method) in the same directory...
780K total
You'd see that this alone made the size go up to 267%+ of the actual value,
and this does not even include space occupied directories,
which will come up in the next heading.
This measurement is done on tilde.club, which appeared to use 4 KiB
filesystem block size; if your filesystem block size is bigger,
the difference will be even *wilder* than this.
To demonstrate a more extreme case, I have even tried tar'ing said directory over to my local machine and extracted it to my local FAT32 volume
(which was formatted with 32 KiB block size).
Guess what `du -shc *` in that folder reported?
5.9M total
Outrageous, right?
(FYI: Running `du -bc *` there still reported the same-as-ever 299626-byte size, which is the actual size of newspool content, excluding storage
overhead)
## 2. Directory overhead ##
This is another local bias, as they are not really a storage space occupied
by articles, but rather occupied by directory areas which point to those articles- which, again, varies from system to system. `du` also counts these
if it found any directory in the specified tree.
And by including directory overhead, it means you even counted
*empty* newsgroups; which by essence, have 0 bytes of article.
12K javascript
12K meetups
(...)
12K php
12K pink
(...)
12K python
(...)
12K team
^ This means additional 72 KiB of emptiness added into your total.
And for directories of non-empty newsgroup, this is my estimate of your newspool directory tree; with the subgroups information from
currently-listed newsgroups in news.tilde.club...
12K art
12K art/ascii
12K art/music
12K club
12K cosmic
12K food+drink
12K gopher
12K meta
12K nsfw
12K poetry
12K projects
12K radiofreqs
12K services
12K services/uucp
12K text
^ This means 180 KiB-minimum of directory weight added to your total.
It is not exact, as it is calculated from assumption of 12 KiB size
per directory which is the *lower bound* estimate.
(Empty directory consumed 12 KiB, directory which itself lists many files
is bound to consume more than that)
Note that I omitted `black` directory as it is saved for next point...
## 3. There are some newsgroups that cosmic.voyage did not carry ##
The cosmic.voyage is no longer carrying `tilde.black` newsgroup,
not after ~tomasino shut it down. And have been the case since
before the time I've done suck-feeding. So, the following:
28K black
does not count.
## 4. Line endings differences ##
It happened that `suck` stores the downloaded articles with Unix (LF)
line ending, as opposed to articles in its NNTP transfer format,
which use Internet line ending (CR-LF). INN2 however, stores
each article verbatim in its transfer format; so I'm going to treat this
as the canonical format.
This could be counted as an error on my part, which resulted in
my original 292 KiB figure being a slight under-measurement.
Let's see what is the actual size, and check if this error
significantly changes the statistics...
I ran `cat * | wc -c` in the same article feed directory again,
for a sanity check...
299626
You'd see that the size matched exactly with my `du -bc *` output
shown in heading 1 (292 KiB).
Then, on every LF byte encountered, add CR byte in front of it,
and measure again (`cat * | sed -e 's/$/\r/' | wc -c`)...
306896
This means 300 KiB worth of articles in its transfer format
is the correct total size:
- My original measurement had 7270-byte difference from this correct value.
- This means my original measurement's error was 0.237%.
This is pretty much insignificant, but I think it's worth writing
about anyway, just for the record.
## 5. Methodology-matching of measurement ##
To exclude blocksize-induced bias and directory-induced bias,
you ought to take total count of only file size of each article.
Running following command inside your newspool directory
should have you covered...
find . -type f | xargs -d '\n' du -bc | tail -n 1
This will print out the total bytes of files (news articles)
in the directory, excluding any local storage overhead.
(Note that the directory tree you're going to run this in must not contain files that are not news articles, not even hidden dotfiles)
## 6. Finding out the oldest article date in the tree ##
I don't currently use SLRNpull myself, but from your description
of your usage, it seems that you instructed SLRNpull to download
articles in the latest 60-days window into some directory;
repeated every time you'd like to fetch the update.
As I don't really have an insight on how SLRNpull actually operate
regarding article expiration; but the date of oldest article actually
existing in your newspool should give the indication.
To get the date of oldest article run this scary-looking command [1]
in your newspool folder...
find . -type f -exec sed -e \
's/^.*[^[:space:]].*$/\0/
t CKDATE
d
q
:CKDATE
s/^Date:[[:space:]]*\(.*[^[:space:]]\)[[:space:]]*$/\1/
t
d' '{}' ';' | \
xargs -d '\n' -I '{}' date -d '{}' '+%Y-%m-%d %H:%M:%S' | sort | head -n 1
(Note that the directory tree you're going to run this in should not contain files that are not news articles, not even hidden dotfiles)
This would return the timestamp of oldest article from your newsspool,
using your timezone, in ISO-8601 format (YYYY-MM-DD HH:mm:ss). If the date
is older than 60 days counted from today into the past, then you are
not really "keep articles within 60 days", but rather longer than that.
Regards,
~xwindows
P.S. Note that the commands I listed here are tested on GNU implementation; your mileage may vary if your system is not GNU-based.
[1] This scary-looking command basically...
1. `find . -type f -exec [...] '{}' ';'`
Find every files in current directory tree, and for each file...
1.1. `sed -e '[...]'`
Run a specified Sed inline program on it, which extracts its
newspost date. (I take a liberty to not discuss a
cryptic-looking Sed program in the ellipsis here; this article
is too lengthy already)
2. `xargs -d '\n' -I '{}' [...]`
For each extracted newspost date...
2.1. `date -d '{}' '+%Y-%m-%d %H:%M:%S'`
Convert it into a sortable ISO-8601 format.
3. `sort`
Sort the dates in ascending order (i.e. oldest first).
4. `head -n 1`
And display only the first date (i.e. oldest date).
--- Synchronet 3.18b-Linux NewsLink 1.113