I am using Calibre to convert a .docx file with a complex layout (lots of figures, tables, etc.) into .epub and .mobi. While the conversion succeeds I have used some RegEx expressions to find and replace some formatting irregularities.
The RegEx expressions I've written work most of the time but still miss about 30% of the things I'm trying to fix, leaving me to go through each .html file and fix things by hand. My books are often 500+ pages and this process is getting tedious. What I would like to do is share with you my input file, the RegEx expressions I'm using, and ask if anyone can make suggestions on how to make these expressions more bullet-proof.
Here's the process I use. (Links to files appear below.) I start with a complex-layout docx like the kind attached. Before conversion I'll replace non-standard characters (like in-line arrows and smiley faces) with ASCII-character equivalents; I also replace multiple-images-in-tables with just one image. Then I use calibre to convert to epub.
From there I run the following regex expressions:
This widens all tables to fill the reader width:
Find: <table class="table_.*">
Replace with: <table width="100%">
Next I want to enlarge all images that appear within tables / figures:
Find: <table width="100%">((.|\n)*?)src=(.*?)class=(.*)/>(.*?)Figure((.|\n)*?)/table>
Replace with: <table width="100%"> \1 src= \3 width="100%"/> \5 Figure \6/table>
This works for most but not all images within figures. The red arrows in the upper-right-hand corner of http://friedmanarchives.com/~downloa...escription.jpg shows examples of where it fails.
The files are too large to upload but you can download them from my server:
1) The original .docx file (so you can see the complex layout as it was intended for printed form): http://friedmanarchives.com/~downloa.../Original.docx
2) The .epub version after calibre had converted it (and after I started to fix things by hand): http://friedmanarchives.com/~downloa...ed_output.epub
I'd even be willing to pay someone to help create more bulletproof REGEX' and to help fix other formatting anomolies that I currently have to tweak by hand in HTML.
Sorry for the long post; hopefully some of you can be of help!
Sincerely, Gary
The RegEx expressions I've written work most of the time but still miss about 30% of the things I'm trying to fix, leaving me to go through each .html file and fix things by hand. My books are often 500+ pages and this process is getting tedious. What I would like to do is share with you my input file, the RegEx expressions I'm using, and ask if anyone can make suggestions on how to make these expressions more bullet-proof.
Here's the process I use. (Links to files appear below.) I start with a complex-layout docx like the kind attached. Before conversion I'll replace non-standard characters (like in-line arrows and smiley faces) with ASCII-character equivalents; I also replace multiple-images-in-tables with just one image. Then I use calibre to convert to epub.
From there I run the following regex expressions:
This widens all tables to fill the reader width:
Find: <table class="table_.*">
Replace with: <table width="100%">
Next I want to enlarge all images that appear within tables / figures:
Find: <table width="100%">((.|\n)*?)src=(.*?)class=(.*)/>(.*?)Figure((.|\n)*?)/table>
Replace with: <table width="100%"> \1 src= \3 width="100%"/> \5 Figure \6/table>
This works for most but not all images within figures. The red arrows in the upper-right-hand corner of http://friedmanarchives.com/~downloa...escription.jpg shows examples of where it fails.
The files are too large to upload but you can download them from my server:
1) The original .docx file (so you can see the complex layout as it was intended for printed form): http://friedmanarchives.com/~downloa.../Original.docx
2) The .epub version after calibre had converted it (and after I started to fix things by hand): http://friedmanarchives.com/~downloa...ed_output.epub
I'd even be willing to pay someone to help create more bulletproof REGEX' and to help fix other formatting anomolies that I currently have to tweak by hand in HTML.
Sorry for the long post; hopefully some of you can be of help!
Sincerely, Gary