Extraction from Scanned Document Images

Data Science Team, Altisource

The project involved developing a tool for the automated extraction and parsing of various fields present in scanned mortgage document images. The software was able to handle multiple pre-defined document types with variations in the number of fields per page and was also invariant to image skew and noise. Complex heuristics based approaches, image processing, and NLP techniques were used to make the tool robust to such variations, and extract appropriate field values from the documents.

Tools used: Python, ImageMagick, Tesseract OCR, OpenCV