Community ForumCategory: GeneralTables in PDFs
Mauricio asked 5 months ago

I am testing pdfalchemist.exe to extract text information from a PDF file that contains tables. The convertion to html just writes the table as an image; the convertion to xml showed the text from the table, but:

  • it misses the line breaks;
  • it misses the columns alignments when there are empty columns.

Is there a way to address these two issues?

Datalogics Staff replied 5 months ago

What version of PDF Alchemist are you using and on what platform?
I assume the problems are unique to a PDF file. Can you describe the type of table that has this problem (size, complexity). Does it span pages?

Datalogics Staff replied 5 months ago

Also are you using any of the OCR options? Can you show us the XML output for a line or two of the table.

1 Answers
Best Answer
Datalogics Staff answered 5 months ago

Please review the documentation for the -purpose parameter.  Note that HTML and EPUB default to “balanced” which may write tables as an image for a better original appearance.  XML output uses “indexing”  as a default and will preserve text for searching/indexing workflows but the output might differ significantly from the PDF appearance.