Simple Blog Post Stats

I got curious about the word count in my blog. When I generate the static site, I simply count whitespace separated strings as words, but this will automatically include code sections. To find out how much prose I wrote, I need to process each blog post, ignoring the front-matter (information at the top of each post and surrounded by ---) and the code sections (which are surrounded by three tildes), and adding valid words to my statistics (I’m just keeping a total word count and count of each word). I got some the following results:

Word Count: 7089

Wow! Over 7000 words! I’m pretty happy with that!

order word count
1 the 402
2 to 296
3 a 196
4 i 172
5 and 155
6 it 122
7 this 109
8 of 106
9 is 93
10 in 91
11 with 86
12 that 83
13 for 66
14 on 65
15 my 65
16 you 52
17 from 52
18 be 50
19 use 47
20 can 43
21 so 38
22 if 35
23 we 35
24 file 31
25 command 31
26 site.baseurl 30
27 then 29
28 powershell 29
29 but 28
30 are 28
31 have 28
32 like 28
33 by 27
34 following 26
35 an 26
36 now 25
37 one 25
38 will 24
39 some 24
40 or 24
41 at 24
42 install 23
43 up 23
44 using 23
45 your 22
46 when 22
47 get 21
48 do 21
49 want 21
50 also 21

Well, this was a lot more disappointing. I don’t use a lot of interesting words, I assume.

Code

I used the following code to generate this:

#!/usr/bin/env python3

from collections import Counter
from pathlib import Path
import string
import sys

# This script goes through my _posts directory, strips
# out lines surrounded by ``` or --- blocks, then does a little
# statistics on the results


def is_valid_word(word):
    contains_letters = any(c in string.ascii_letters for c in word)
    not_a_variable = '`' not in word
    return contains_letters and not_a_variable


def munge_word(word):
    """ return the lowercase word with trailing/preceding punctuation stripped"""
    word = word.lower()
    if word and word[-1] not in string.ascii_lowercase:
        word = word[:-1]
    if word and word[0] not in string.ascii_lowercase:
        word = word[1:]
    return word


def main():

    counter = Counter()
    word_count = 0

    topdir = sys.argv[1]

    for path in Path(topdir).glob('*.md'):
        with open(path) as blog_post:
            is_code = False
            for line in blog_post:
                if line.startswith('```') or line.startswith('---'):
                    is_code = not is_code
                    continue
                if not is_code:
                    # print(line, end='\n')

                    # now get stats :)
                    for word in line.split():
                        word = word.strip()
                        if is_valid_word(word):
                            word_count += 1
                            munged_word = munge_word(word)
                            counter[munged_word] += 1

    print()
    print('Word Count: ', word_count)
    print()

    # print(counter.most_common(100))
    print("| order | word | count |")
    print("|=======|======|=======|")
    for order, mci in enumerate(counter.most_common(50)):
        word, count = mci
        print(f"| {order + 1} | {word} | {count} |")


if __name__ == "__main__":
    main()