Visual Content Intelligence for scanned images

Haystac’s Visual/Content Intelligence (VCI) builds on the combination of NLP and machine learning by combining text and visual analytics in a unified vector space model. This combination sidesteps NLP’s “cold-start” problem and dramatically reduces the requirements for a priori organizational structures like taxonomies, ontologies or editorially selected training sets. Instead, clustering and classification techniques organize information visually – leaving the user to label and/or merge clusters and resolve exceptions.

VCI produces high-quality results by exploiting the frequent presence of templates in the enterprise. Such templates cannot be detected by text-only analysis, which often start out by discarding formatting and images. VCI samples non-text clues such as lines, shapes, boundary boxes, bar codes, photographs, etc. When visual cues are combined with text they provide a decisive advantage for clustering, classification and detecting fielded information such as forms.

Using VCI, users who want to make sense of unstructured information follow a much simpler workflow:

  • A few of each type of document are found by searching through the initial raw content set, or by applying a filter/rule
  • A few document(s) are labelled as a class
  • The labeled class is visually generalized (clustered) across the remaining documents
  • Exceptions are placed in a queue; the user can identify them as a new type, merge them with an old type – and this updates the classification models

Beyond the classification and labeling, VCI’s visual analysis gives it great insight as to fielded structures such as forms and labels. This allows VCI to turn unstructured information into data without modeling.

Visual Analytics: Classification & Clustering with More Input

Underlying VCI is a layered approach to detecting and extracting non-text features. Edge detection, sampling and visual similarity algorithms, among others, are used to identify frames, table edges, logos, etc. These extracted small, specially “anchored” regions – are normalized and then modeled in vector space. Text extracted from the documents are also modeled in vector space. Unified clustering and classification via SVM or other algorithms are then applied.

There are several benefits of this method, when compared to legacy OCR, primitive image search, and textual analysis:

  • Allows high-quality classification and clustering without pre-definition of categories
  • Uses otherwise discarded data to discern major classes of documents, and variations
  • Uses anchored OCR data extraction, which yields higher accuracy and is skew/rotation independent
  • Uses boundary/border detection support
  • Supports multi-line/multi-column format

Latest Blog Posts