January 05, 2015

Pinboard tag analysis

I thought I might as well have a look at my Pinboard tags while I’m on my little bookmark analysis binge. I should point out that the script that downloads my bookmarks and saves them locally is pinboard-backup. (You can, of course, just export the same file from your Pinboard account.) Also, my amateur Python noodling depends wholly on the marvellous Beautiful Soup module.

Running the following on your bookmarks.html file will return a Top 10 tags. On my first run I was a bit shocked to discover I had over one thousand bookmarks with no tags, which renders them close to useless for me. I did a little digging on the Pinboard site, isolated the tag-less marks and slowly sifted through a couple of hundred of them to find that they were almost all superfluous links that had accumulated when I briefly had Pinboard log all my twitter favourites. So I deleted the lot and felt a little lighter afterwards.

from bs4 import BeautifulSoup

counts = dict()

bookmarks = BeautifulSoup(open('pinboard/bookmarks.html'))
for link in bookmarks.find_all('a'):
    tags = link.get('tags')
    tags = tags.split(',')
    for tag in tags:
        counts[tag] = counts.get(tag, 0) + 1

def getvalue(item):
    return item[1]

for uniqtag, number in sorted(counts.items(), key=getvalue, reverse=True)[:10]:
    print uniqtag, number

My Top 10 tags, as they currently stand, are:

reference 183
tools 169
read 156
github 123
css 116
shop 116
starred 114
osx 111
cli 104
app 87

Each of the links tagged with those tags will have other tags, but the github and starred tag items will need some cleaning up. They’re the result of a mass import of links using github-starred-to-pinboard and will each need further tagging to make them really useful. I could probably also do some weeding on the shop and css categories.

A few tweaks to the Python and I got a list of tags that are only used once. Some of these are typos or, in the case of ‘add to mine’ bookmarks, someone else’s tags. More housekeeping to be done here!

from bs4 import BeautifulSoup

counts = dict()

bookmarks = BeautifulSoup(open('pinboard/bookmarks.html'))
for link in bookmarks.find_all('a'):
    tags = link.get('tags')
    tags = tags.split(',')
    for tag in tags:
        counts[tag] = counts.get(tag, 0) + 1

def getvalue(item):
    return item[1]

for uniqtag, number in sorted(counts.items(), key=getvalue, reverse=True):
    if number == 1:
        print uniqtag

Getting to know my bookmarks in such detail has been well worth the time spent. I’ve removed a lot, tagged and cleaned up the remainder and have less of a ‘black hole’ feeling about them now. I’m also going to be more careful about adding new marks.