Answered by:
How to convert .docx file to html file with formatting using open xml sdk 2.5
Question
-
HI,
I am trying to convert the document file(.docx) to html file. Currently I am able to do the conversion but html file does not retain the formatting. I am using open xml sdk 2.0.
For example: If a paragraph contain the text in red color with some text with bold and underline in docx file, the converted html shows all the lines as simple text and lost all the formatting.
Here is my current code :
public string ConvertDocxToHtml(string docxFileEncodedData) { string inputFileName = DateTime.Now.ToString("ddMMyyyyhhmmss") + ".docx"; string imageDirectoryName = inputFileName.Split('.')[0] + "_files"; DirectoryInfo imgDirInfo = new DirectoryInfo(HttpContext.Current.Server.MapPath("~/Documents/" + imageDirectoryName)); int imageCounter = 0; byte[] byteArray = Convert.FromBase64String(docxFileEncodedData);//File.ReadAllBytes(docxFile); using (MemoryStream memoryStream = new MemoryStream()) { memoryStream.Write(byteArray, 0, byteArray.Length); using (WordprocessingDocument doc = WordprocessingDocument.Open(memoryStream, true)) { HtmlConverterSettings settings = new HtmlConverterSettings() { PageTitle = inputFileName, ConvertFormatting = false, }; XElement html = HtmlConverter.ConvertToHtml(doc, settings, imageInfo => { DirectoryInfo localDirInfo = imgDirInfo; if (!localDirInfo.Exists) localDirInfo.Create(); ++imageCounter; string extension = imageInfo.ContentType.Split('/')[1].ToLower(); ImageFormat imageFormat = null; if (extension == "png") { // Convert the .png file to a .jpeg file. extension = "jpeg"; imageFormat = ImageFormat.Jpeg; } else if (extension == "bmp") imageFormat = ImageFormat.Bmp; else if (extension == "jpeg") imageFormat = ImageFormat.Jpeg; else if (extension == "tiff") imageFormat = ImageFormat.Tiff; // If the image format is not one that you expect, ignore it, // and do not return markup for the link. if (imageFormat == null) return null; string imageFileName = "image" + imageCounter.ToString() + "." + extension; try { imageInfo.Bitmap.Save(imgDirInfo.FullName + "/" + imageFileName, imageFormat); } catch (System.Runtime.InteropServices.ExternalException) { return null; } XElement img = new XElement(Xhtml.img, new XAttribute(NoNamespace.src, imageDirectoryName + "/" + imageFileName), imageInfo.ImgStyleAttribute, imageInfo.AltText != null ? new XAttribute(NoNamespace.alt, imageInfo.AltText) : null); return img; }); string htmlFilePath = HttpContext.Current.Server.MapPath("~/Documents/" + inputFileName.Split('.')[0] + ".html"); File.WriteAllText(htmlFilePath, html.ToStringNewLineOnAttributes()); return ConfigurationManager.AppSettings["ServerUri"].ToString() + "/Documents/" + inputFileName.Split('.')[0] + ".html"; } } }So I just want to know how can I retain the format of docx in html file ?
Thanks
Thursday, December 19, 2013 10:42 AM
Answers
-
The HtmlConverter does not retain formatting of the document as far as I know. You can use css- stylesheets for the formatting, but you have to define them yourself.
More information about how to do that you can find here.
- Marked as answer by Fei XueMicrosoft employee Wednesday, December 25, 2013 11:42 AM
Thursday, December 19, 2013 12:46 PM
All replies
-
The HtmlConverter does not retain formatting of the document as far as I know. You can use css- stylesheets for the formatting, but you have to define them yourself.
More information about how to do that you can find here.
- Marked as answer by Fei XueMicrosoft employee Wednesday, December 25, 2013 11:42 AM
Thursday, December 19, 2013 12:46 PM -
What libraries need to be included for this?
Kaiser
Wednesday, November 4, 2015 11:17 PM -
Thanks, this is a really helpful piece of code.Thursday, December 7, 2017 10:21 AM
-
In case you're interested, here is a newer version of that code:
https://github.com/OfficeDev/Open-Xml-PowerTools/blob/vNext/OpenXmlPowerToolsExamples/WmlToHtmlConverter02/WmlToHtmlConverter02.csYou'll notice that HtmlConverter was renamed into WmlToHtmlConverter.
Also, an alternative approach that does retain document's formatting is via this Word library for .NET.
It does so by applying the required inline styling on the HTML elements. For instance, check this conversion of Word to HTML in C#:DocumentModel document = DocumentModel.Load("HtmlExport.docx"); // Images will be embedded directly in HTML img src attribute. document.Save("Html Export.html", new HtmlSaveOptions() { EmbedImages = true });
- Edited by Pusting Thursday, January 18, 2018 10:50 AM
Thursday, January 18, 2018 9:40 AM