PDA

View Full Version : HTML Hell



GradyHendrix
03-27-2011, 07:08 AM
I'm not a technically adept person by any means, so this isn't entirely unexpected, but I'm in the middle of HTML hell. I've got an 80,000 word document written in MS Word and I'm trying to convert it to HTML and then to and ebook friendly format like .epub or what-have-you.

But getting it clean in HTML from Word is awful. I've been trying to use an automatic conversion program like Word 2 Clean (http://word2cleanhtml.com/) but I'm finding so many insane errors with indenting that I'm going nuts.

Any suggestions on a better way to go about this, besides, "Don't write in MS Word?"

Medievalist
03-27-2011, 08:52 AM
Any suggestions on a better way to go about this, besides, "Don't write in MS Word?"

Find a Mac using friend with Pages.

Scrivener for Mac is a lovely tool for making epubs; it's worth contacting the developer or a Windows user and finding out if that's true for Windows.

Word Perfect for Windows does pretty clean Save As HTML.

AmsterdamAssassin
03-27-2011, 11:16 AM
MS Word > Notepad [to remove all invisible formatting] > jEdit [to turn it into HTML] > Calibre [to convert it into .mobi/.epub/other formats].

GradyHendrix
03-27-2011, 06:30 PM
Thanks for the advice. I'm already using Bean for most writing and jEdit for HTML editing and tweaking. My real crisis right now is that I've done everything I know how to strip out the MS Word formatting for tab indents and line spacing. I mean....everything! And yet I still have the occasional bizarre tab indent in my document after I transfer it to HTML and take a look and, for some even weirder reason, there's a line space after every single paragraph. That's really what has me crying blood right now.

How to avoid in the future - check!
How to fix what I've got in hand now - confused and frustrated!

zpeteman
03-27-2011, 07:08 PM
Why not just do a find/replace for all your tabs? Word's find function is very robust and it's pretty simple to remove just about any formatting.

Then copy it into Pages to make the epub (be sure to download the template from Apple). (If you aren't using a Mac then you probably shouldn't be writing in the first place! :) )

Medievalist
03-27-2011, 08:30 PM
Thanks for the advice. I'm already using Bean for most writing and jEdit for HTML editing and tweaking. My real crisis right now is that I've done everything I know how to strip out the MS Word formatting for tab indents and line spacing. I mean....everything! And yet I still have the occasional bizarre tab indent in my document after I transfer it to HTML and take a look and, for some even weirder reason, there's a line space after every single paragraph. That's really what has me crying blood right now.

How to avoid in the future - check!
How to fix what I've got in hand now - confused and frustrated!

Can you send me the HTML file? (medievalist AT Mac.com)

I'll send you back a cleaned up file; let me know what in particular you need done. (I suspect you could do this, but I've been hand-coding HTML a very long time, and have a fabulous Mac-only text editor called BBEdit.)

GradyHendrix
03-27-2011, 10:02 PM
If that's a serious offer, I'm definitely taking you up on it.

Thanks for the advice about find/replace (and I'm using a Mac, thank goodness - didn't know it was such an advantage but I can't imagine this stuff being harder!). I tried it and it worked on a big chunk of tabs but for some reason the first tab will not match up.

Medievalist
03-27-2011, 10:11 PM
If that's a serious offer, I'm definitely taking you up on it.

Thanks for the advice about find/replace (and I'm using a Mac, thank goodness - didn't know it was such an advantage but I can't imagine this stuff being harder!). I tried it and it worked on a big chunk of tabs but for some reason the first tab will not match up.

I'm serious.

You might also take a look at TextWrangler for cleaning up text; it doesn't include the HTML tools of it's more geeky sibling BBEdit Pro, but it is free and very good at cleaning up via search and replace.

http://www.barebones.com/

valeriec80
03-27-2011, 10:24 PM
I'd suggest learning how to code html if you don't already know it. This is a great site: http://htmldog.com.

Really, for a simple document like a book, you don't need to know that much code. Making a paragraph, centering things, italicizing things and putting line breaks in is about the extent of it, unless you need to get super fancy for some reason.

AmsterdamAssassin
03-27-2011, 11:26 PM
Thanks for the advice. I'm already using Bean for most writing and jEdit for HTML editing and tweaking. My real crisis right now is that I've done everything I know how to strip out the MS Word formatting for tab indents and line spacing. I mean....everything! And yet I still have the occasional bizarre tab indent in my document after I transfer it to HTML and take a look and, for some even weirder reason, there's a line space after every single paragraph. That's really what has me crying blood right now.

How to avoid in the future - check!
How to fix what I've got in hand now - confused and frustrated!

I replace all em-dashes, ellipsis, single and double quotes with html codes, make sure all italics are wrapped in <i></i>, then copy the whole document and put it in Notepad or similar. That will strip all formatting - tabs, indents, font and font sizes. From Notepad I copy it into jEdit. That way, only the html formatting you've done prior to rinsing the document through notepad will have survived.

It sounds like you copied the content of the .doc into the jEdit, without rinsing out the formatting by the notepad.

KathleenD
03-28-2011, 04:04 AM
This is either TMI or too simple for you, but I'm using this series of posts for my next book:

http://guidohenkel.com/2010/12/take-pride-in-your-ebook-formatting/

If it helps, you're welcome. If it doesn't... I was never here. ;)

GradyHendrix
03-28-2011, 07:15 PM
Yes, yes, yes! Kathleen! Yes! I have been following that site religiously! I actually find it hugely helpful and I'm glad you're seconding the notion that it's actually useful and good. I'm glad someone else is vouching for it - I was worried I might be using it and someone wiser would come in and say, "You're using THAT site? Ha! It's 5 years out of date!"

Thanks for the pointers and offers of assistance, everyone. I'm digging into this again today and taking all this on board and trying my best. Seriously, for someone who is shy about technical things I feel like I'm hacking my way through an unmapped forest, so getting feedback is keeping me from going crazy.