- Building the Database: A graph database is built from a collection of PDFs by parsing the specified PDFs. (PDFs are provided with full paths in a line-separated file.)
- Scoring the Database: A list of files is provided to score the graphs for similarity. If the files are not present in the graph database, they are added. Nabu outputs the list in CSV format: subject, family, candidate, score.
- Drawing Clusters: Running from the graph database, draw dendrogram clusters. Nabu uses SciPy and Matplotlib to draw the dendrogram of the set of PDFs based on the similarity score. It currently uses the Canberra distance metric.