in-a-bind

Some notes on creating beautiful PDFs

convert -background white -alpha remove -alpha off -density 300 -depth 8 INPUT.PDF[0] SINGLE_PAGE.PNM

unpaper SINGLE_PAGE.PNM CLEAN_PAGE.PNM

echo "tessedit_create_hocr 1" > hocr.conf tesseract CLEAN_PAGE.PNM PAGE_OCR -l eng +hocr.conf

convert CLEAN_PAGE.PNM CLEAN_PAGE.JPG hocr2pdf -i CLEAN_PAGE.JPG -o WITH_OCR.pdf < PAGE_OCR.html

** The simplest way to do this would just be to glob the pages together

pdfunite PAGE1.PDF PAGE2.PDF [...] OUTPUT.PDF

** If we want to have an outline, we may need to use reportlab

canvas.addOutlineEntry(self, title, key, level=0, closed=None)

API

when you have a PDF ready:

1. make a PUT to this url http://{{BASE_URL}}/db/{UNIQUE-ID}

with this JSON { "type": "uploaded-page", "filename": {IMAGE-FILENAME}, "side": {"left" | "right"} }

2. You should get a response back as a JSON dictionary, including

{ "rev": REVISIONID, ... }

3. Then, make another PUT, with the actual attachment, to this URL http://{{BASE_URL}}/db/{UNIQUE-ID}/upload?rev={REVISION_ID}

and including the full-resolution image data; make sure the "Content-Type" header is set correctly, eg. to "image/jpeg"

git clone git://git.numm.org/in-a-bind

snapshot: in-a-bind.zip

files