The Goal
Convert several hundred PDF and/or docx files into what looks like a single Wordpress xml export file. Optionally convert each one into an individual plain text WP xml export file or just plain txt files.
Why?
I have hundreds of PDFs that I would like to turn into blog entries in Squarespace. Squarespace facilitates importing from Wordpress xml files and this would GREATLY speed the process of getting the text from PDF's into blog entries in Squarespace.
Where I am At in the Conversion process
I have figured out how to bulk convert PDFs using Calibre into TXT files and minimally remove unwanted text and html tags-however, I have not been able to reformat the resulting text as desired (in WP xml export file format) I know what code from the WP xml export file I need to add into my resulting Calibre conversion, but I don't know if it is possible to bulk convert and append each pdf conversion into a single xml output file that follows Wordpress export format. If I end up with a perfectly formatted xml files for each individual PDF, that would still be great-and a yuge time saver. Also acceptable would be txt or html formatted versions of my PDFs that could be opened and manually copied and pasted into a new blog entry on Squarespace.
The PDFs are unfortunately copyrighted content so I have included a sample where I changed the text portions to lorem ipsum or generic data. Essentially I want to cherry pick a few short chunks of text from these pdfs and apply basic html tags like bold or h1 and line breaks to the output file so that the text is ready to copy and paste or import from a WP xml file and is perfectly formatted html-minimizing editorializing on the Squarespace side.
In Calibre conversion terms I would like to know:
1. How best to eliminate all unwanted text and html tags in the source doc
2. How to add all relevant Wordpress XML file tags/code to the Calibre xml output file (txt format) - wrap the WP tags around text so it looks like it was created by exporting from Wordpress. (open any WP xml export file and you can see the format it follows.
3. How to keep just the text I want and apply html formatting to the sentences of text I keep in the Calibre output file-and also have it placed within the appropriate tags in the resulting Wordpress xml file . See example below:
---USING CALIBRE SEARCH AND REPLACE WIZARD, here is what CALIBRE "sees" in a sample source document:
<!-- created by calibre's pdftohtml -->
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="" xml:lang="">
<head>
<title>Microsoft Word - Antifoam.docx</title>
<meta name="generator" content="pdftohtml 0.36"/>
<meta name="author" content="author"/>
<meta name="date" content="2012-03-21T08:56:38+00:00"/>
</head>
<body bgcolor="#A0A0A0" vlink="blue" link="blue">
<a id="1"></a><img src="index-1_1.jpg"/><br>
ZZ<br>
Z Z <br>
X YZ<br>
Z YX<br>
X Q<br>
Q W<br>
V ZX<br>
Q WYXZ<br>
Z X<br>
X XYZ<br>
Y Y<br>
Q Q<br>
Z Z,<br>
Y YY<br>
Q Z. <br>
123 Main Street <br>
City, ST 00000-0000 <br>
www.domain.com <br>
<b>Corporate:</b> 000-555-1212 <br>
<b>Fax: </b>000-555-1212 <br>
<b>East:</b> 000-555-1212 <br>
<b>Type of Document</b><br>
Date: 01/31/16 <br>
<br>
Supersedes: 01/31/15 <br>
<b>PRODUCT #: 12345</b> <br>
<i><b>PRODUCT TITLE</b></i><br>
Product Subtitle<br>
<i><b>Product Description:</b></i> <br>
Lorem ipsum<br>
lorem ipsum. <i><b>PRODUCT TITLE </b></i>lorem ipsum. <br>
lorem ipsum.<i><b> </b></i><br>
<br>
<br>
<i><b>Product Directions:</b></i> <br>
Lorem ipsum<i><b>PRODUCT TITLE</b></i> lorem ipsum <br>
lorem ipsum<br>
Lorem ipsum<br>
lorem ipsum. <br>
<br>
<br>
<i><b>Product Specifications:</b></i> <br>
<br>
<br>
<b>Product Appearance:</b> <br>
Lorem ipsum <br>
<br>
<br>
<b>Density:</b> <br>
lorem ipsum<br>
<br>
<br>
<b>Product Ingredients: </b><br>
None <br>
<br>
<b>(lorem ipsum)</b> <br>
<br>
<b>Product Warnings:</b> <br>
Lorem ipsum <br>
<br>
<br>
<br>
<br>
<br>
<br>
Legal disclaimer line 1<br>
legal disclaimer line 2 <br>
legal disclaimer line 3<br>
<hr/>
</body>
</html>
--------------------------------------------------------------------------------
////END OF SAMPLE DOCUMENT AS CALIBRE SEES IT
If I have been unclear, please let me know...hopefully you get the gist of what I am asking for...Thanks in advance...
Convert several hundred PDF and/or docx files into what looks like a single Wordpress xml export file. Optionally convert each one into an individual plain text WP xml export file or just plain txt files.
Why?
I have hundreds of PDFs that I would like to turn into blog entries in Squarespace. Squarespace facilitates importing from Wordpress xml files and this would GREATLY speed the process of getting the text from PDF's into blog entries in Squarespace.
Where I am At in the Conversion process
I have figured out how to bulk convert PDFs using Calibre into TXT files and minimally remove unwanted text and html tags-however, I have not been able to reformat the resulting text as desired (in WP xml export file format) I know what code from the WP xml export file I need to add into my resulting Calibre conversion, but I don't know if it is possible to bulk convert and append each pdf conversion into a single xml output file that follows Wordpress export format. If I end up with a perfectly formatted xml files for each individual PDF, that would still be great-and a yuge time saver. Also acceptable would be txt or html formatted versions of my PDFs that could be opened and manually copied and pasted into a new blog entry on Squarespace.
The PDFs are unfortunately copyrighted content so I have included a sample where I changed the text portions to lorem ipsum or generic data. Essentially I want to cherry pick a few short chunks of text from these pdfs and apply basic html tags like bold or h1 and line breaks to the output file so that the text is ready to copy and paste or import from a WP xml file and is perfectly formatted html-minimizing editorializing on the Squarespace side.
In Calibre conversion terms I would like to know:
1. How best to eliminate all unwanted text and html tags in the source doc
2. How to add all relevant Wordpress XML file tags/code to the Calibre xml output file (txt format) - wrap the WP tags around text so it looks like it was created by exporting from Wordpress. (open any WP xml export file and you can see the format it follows.
3. How to keep just the text I want and apply html formatting to the sentences of text I keep in the Calibre output file-and also have it placed within the appropriate tags in the resulting Wordpress xml file . See example below:
---USING CALIBRE SEARCH AND REPLACE WIZARD, here is what CALIBRE "sees" in a sample source document:
<!-- created by calibre's pdftohtml -->
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="" xml:lang="">
<head>
<title>Microsoft Word - Antifoam.docx</title>
<meta name="generator" content="pdftohtml 0.36"/>
<meta name="author" content="author"/>
<meta name="date" content="2012-03-21T08:56:38+00:00"/>
</head>
<body bgcolor="#A0A0A0" vlink="blue" link="blue">
<a id="1"></a><img src="index-1_1.jpg"/><br>
ZZ<br>
Z Z <br>
X YZ<br>
Z YX<br>
X Q<br>
Q W<br>
V ZX<br>
Q WYXZ<br>
Z X<br>
X XYZ<br>
Y Y<br>
Q Q<br>
Z Z,<br>
Y YY<br>
Q Z. <br>
123 Main Street <br>
City, ST 00000-0000 <br>
www.domain.com <br>
<b>Corporate:</b> 000-555-1212 <br>
<b>Fax: </b>000-555-1212 <br>
<b>East:</b> 000-555-1212 <br>
<b>Type of Document</b><br>
Date: 01/31/16 <br>
<br>
Supersedes: 01/31/15 <br>
<b>PRODUCT #: 12345</b> <br>
<i><b>PRODUCT TITLE</b></i><br>
Product Subtitle<br>
<i><b>Product Description:</b></i> <br>
Lorem ipsum<br>
lorem ipsum. <i><b>PRODUCT TITLE </b></i>lorem ipsum. <br>
lorem ipsum.<i><b> </b></i><br>
<br>
<br>
<i><b>Product Directions:</b></i> <br>
Lorem ipsum<i><b>PRODUCT TITLE</b></i> lorem ipsum <br>
lorem ipsum<br>
Lorem ipsum<br>
lorem ipsum. <br>
<br>
<br>
<i><b>Product Specifications:</b></i> <br>
<br>
<br>
<b>Product Appearance:</b> <br>
Lorem ipsum <br>
<br>
<br>
<b>Density:</b> <br>
lorem ipsum<br>
<br>
<br>
<b>Product Ingredients: </b><br>
None <br>
<br>
<b>(lorem ipsum)</b> <br>
<br>
<b>Product Warnings:</b> <br>
Lorem ipsum <br>
<br>
<br>
<br>
<br>
<br>
<br>
Legal disclaimer line 1<br>
legal disclaimer line 2 <br>
legal disclaimer line 3<br>
<hr/>
</body>
</html>
--------------------------------------------------------------------------------
////END OF SAMPLE DOCUMENT AS CALIBRE SEES IT
If I have been unclear, please let me know...hopefully you get the gist of what I am asking for...Thanks in advance...