Built with C++, the Docwire DocToText Cloud API allows you to extract data and text from all popular file formats at the drop of a hat. Compatible with all operating systems.
Big data is only useful if your company can keep up with it. The DocToText Cloud API allows you to scan any of the supported document formats, identify text and data specified by you, and finally extract said data into clean rich-text or HTML en masse. With the capability to process thousands of words a second, the Docwire DocToText API ensures fast, reliable and consistent extraction no matter the file format.
The Docwire DocToText API derives from the SDK with the same name - A bespoke, open source, C++ development kit 10 years in the making. The Docwire DocToText API can therefore easily integrate with any language that can send HTTP requests and parse the JSON response. It's dynamic in every sense of the word.
DocToText supports a wide variety of file formats, including DOC, XLS, PDF, EML, HTML, Outlook (PST,OST), and images. This allows companies to easily extract data from a wide range of different types of documents and files, without the need to manually extract information from each one individually.
Our next Next-level API (Importer, Exporter, Transformer) allows other software to interact with DocToText, enabling companies to import, export and transform data seamlessly. This includes, but is not limited to, DOC, XLS, PDF, EML, HTML and a wide range of image formats to name a few.
The API comes with Docker, SIMD parallelization, and experimental OpenCL support for faster general processing.
Allowing low-level manipulation and much higher levels of portability and scalability. API is MSVC Compatible.
Can be integrated into apps and projects running on any language, from C to java and python.
Supports PDF's, all offices (iWork, MS, Libre), all popular image formats (JPG, PNG, TIFF etc.), email formats and more.
Process ten, hundreds or thousands of documents at a time without any loss in performance.
Super easy to hook up to any data processing platform to extend their text extraction capabilities.
From an updated Tesseract OCR to complete bespoke support for all popular file formats - Docwire DocToText Cloud API ensures that you can pull your data no matter the document. The Docwire DocToText API currently supports:
DOC, XLS, XLSB, PPT, RTF, ODF (ODT, ODS, ODP),OOXML (DOCX, XLSX, PPTX),iWork (PAGES, NUMBERS, KEYNOTE), ODFXML (FODP, FODS, FODT), PDF, EML, HTML,Outlook (PST, OST), Image (JPG, JPEG, JFIF, BMP, PNM, PNG, TIFF, WEBP)