pdftocognition

pdftocognition

is neuroscience research sentient?

== notes

This was a messy pipeline... As best as I can reconstruct it:

A very particular directory structure is assumed. In the root must be:

A flat directory with pdfs named {{PUBMED_ID}}.pdf

We need to parse out the text of pdfs; in fact we don't use it at all, but we rely on the bounding boxes to generate the word images.

From within the pdfs/ directory, this loop should do the trick:

for i in *.pdf; do pdftotext -r 300 -bbox i../xml300/{i}.xhtml

Then (and this is very slow and memory-intensive!) we need to generate pngs for each pdf.

A semi-resumable script is provided for this task:

python pdf_to_png.py

An arbitrary set of papers was selected to build the components.

python random_training_words.py

From the training_words, we can generate principal components:

python cv2_pca.py

Now that we have a PCA, we can project words from any paper into it.

python paper_to_coeffs.py PDF_PATH [PDF_PATH [...]]

If you've gotten this far, it's time for some fun!

python word_morph.py (PDF_PATH | PUBMED_ID)

This will show a realtime display and dump video to `morph.webm.'

python plot_words_2d.py PDF_PATH [PDF_PATH [...]]

This will generate plots for each bin at two different reconstruction levels, and save them as pngs in the root.

== see also

I wasn't satisfied with pyIPCA in the end, and by now have serious doubts on the generality of incremental PCA. Using OpenCV's PCA methods on a training set was much simpler and produced substantially better results.

git clone git://git.numm.org/pdftocognition

snapshot: pdftocognition.zip

files