Every day, document collections are populated with image files which contain valuable, often vital, information. Common sources are in-house scanning, downloads of filings or other court documents, or as attachments to emails which are then filed into the document collection. Docs Corp results show that there are often a large number of image-files in a given document collection. (1)
What's the issue? Image files do not contain searchable text, making them invisible to most full-text search engines (2). This creates the risk that users do not get complete search results when looking for documents.
What's the solution?
The simple solution is to assure that OCR (optical character recognition) is applied to image files upon creation or capture - thus making the content "visible" to your full-text search engine. This can occur at the scanner, or be applied to image files to convert them to searchable format on an as-need basis. But most document collections contain 1000s or 100,000s of documents which have already been ingested as image files.
Docs Corp has created a solution (contentCrawler) which can attack large collections of image files, by surveying the entire contents of a document management system or file system, and then create new versions of files (including email attachments) that include OCR text. These files are then captured by the full-text search tool with the final result being complete and accurate search results.
Docs Corp is conducting a webinar on this topic on September 9th. register here http://bit.ly/r36wCl
(1) Results gathered by Docs Corp show that between 17 and 25% of documents collections typically include image files.
(2) Some full-text search engines can index an image-only document without prior OCR - check with your provider on this!.