March 29, 2015

Local full text Pinboard search

This post should conclude my current Pinboard obsession, and it’s a post that I wondered about writing at all for a number of reasons: Pinboard offers a great paid full text search option, this script hammers the Instapaper API and the code is rough and ready. But with all that said, this script did save my bacon twice already during internet outages so I will leave it here as a Proof of Concept. The Python script doesn’t create the urls directory, doesn’t perform any fancy rate-limiting or threading, and is specific to my setup but you should be able to get it to work on your own machine if you want to try it out. It relies on a local Pinboard bookmarks file, which you can export from the Pinboard site or grab using the pinboard-backup Perl script that I use. The script sends each url it finds in the bookmarks file to Instapaper to be converted to Markdown and writes the resulting file to a local directory. It’s quick and a little bit nasty but it works. Some urls just won’t lend themselves to conversion to Markdown, some I have deliberately excluded (images, scripts, pdfs) and the script will try to just keep on working through most errors.

#!/usr/bin/env python

import os
import codecs
import re

from bs4 import BeautifulSoup
from html2md import UrlToMarkdown

bookmarks = open('/Users/larry/pinboard/bookmarks.html')
soup = BeautifulSoup(bookmarks)
u2md = UrlToMarkdown('instapaper')

for link in soup.find_all('a'):
    link = link.get('href').encode('utf-8')

    if re.search('(.pl|.sh|.txt|.pdf|.rb|.jpe?g|.png|.gif)$', link):
        print 'Skipping ', link
        continue

    localfile = link.split('/')[2]
    try:
        if link.split('/')[3] in link:
            localfile = localfile + '_' + link.split('/')[3]
        if link.split('/')[4] in link:
            localfile = localfile + '_' + link.split('/')[4]
        if link.split('/')[5] in link:
            localfile = localfile + '_' + link.split('/')[5]
    except:
        localfile = localfile
    localfile = localfile + '.markdown'

    if os.path.exists('/Users/larry/urls/%s' % localfile):
        print 'Skipping ' '/Users/larry/urls/%s' % localfile
        continue

    try:
        markdown = u2md.convert(link)
        with codecs.open('/Users/larry/urls/%s' % localfile, 'w', 'utf-8') as file:
            file.write(markdown)
            file.close()
    except IOError:
        pass
    except AttributeError:
        pass
    except Exception:
        pass