# Ben's Corner

View page source on GitHub

# Simple Blog Post Stats

I got curious about the word count in my blog. When I generate the static site, I simply count whitespace separated strings as words, but this will automatically include code sections. To find out how much prose I wrote, I need to process each blog post, ignoring the front-matter (information at the top of each post and surrounded by ---) and the code sections (which are surrounded by three tildes), and adding valid words to my statistics (I’m just keeping a total word count and count of each word). I got some the following results:

Word Count: 7089

Wow! Over 7000 words! I’m pretty happy with that!

 order word count 1 the 402 2 to 296 3 a 196 4 i 172 5 and 155 6 it 122 7 this 109 8 of 106 9 is 93 10 in 91 11 with 86 12 that 83 13 for 66 14 on 65 15 my 65 16 you 52 17 from 52 18 be 50 19 use 47 20 can 43 21 so 38 22 if 35 23 we 35 24 file 31 25 command 31 26 site.baseurl 30 27 then 29 28 powershell 29 29 but 28 30 are 28 31 have 28 32 like 28 33 by 27 34 following 26 35 an 26 36 now 25 37 one 25 38 will 24 39 some 24 40 or 24 41 at 24 42 install 23 43 up 23 44 using 23 45 your 22 46 when 22 47 get 21 48 do 21 49 want 21 50 also 21

Well, this was a lot more disappointing. I don’t use a lot of interesting words, I assume.

## Code

I used the following code to generate this:

#!/usr/bin/env python3

from collections import Counter
from pathlib import Path
import string
import sys

# This script goes through my _posts directory, strips
# out lines surrounded by  or --- blocks, then does a little
# statistics on the results

def is_valid_word(word):
contains_letters = any(c in string.ascii_letters for c in word)
not_a_variable = '' not in word
return contains_letters and not_a_variable

def munge_word(word):
""" return the lowercase word with trailing/preceding punctuation stripped"""
word = word.lower()
if word and word[-1] not in string.ascii_lowercase:
word = word[:-1]
if word and word[0] not in string.ascii_lowercase:
word = word[1:]
return word

def main():

counter = Counter()
word_count = 0

topdir = sys.argv[1]

for path in Path(topdir).glob('*.md'):
with open(path) as blog_post:
is_code = False
for line in blog_post:
if line.startswith('') or line.startswith('---'):
is_code = not is_code
continue
if not is_code:
# print(line, end='\n')

# now get stats :)
for word in line.split():
word = word.strip()
if is_valid_word(word):
word_count += 1
munged_word = munge_word(word)
counter[munged_word] += 1

print()
print('Word Count: ', word_count)
print()

# print(counter.most_common(100))
print("| order | word | count |")
print("|=======|======|=======|")
for order, mci in enumerate(counter.most_common(50)):
word, count = mci
print(f"| {order + 1} | {word} | {count} |")

if __name__ == "__main__":
main()
`