How to create HTML from InfoPath Text Data

Note – This was done with InfoPath 2003. The principals should be the same for InfoPath 2007 but I haven’t tested it yet.
Also  .. PLEASE don’t confuse this post with Form Server . This is an isolated and independent code example, which I developed mainly to create PDFs (the PDF creator I used allows input of formatted XHTML to generate the PDF).
 

I wrestled this for quite some time. InfoPath has several little nuances that you need to be aware of:

    • When you create a Text Box, the text formatting is stored using HTML. This includes fonts, colours, styles, tables and Images.

    • Line Breaks .. however … are NOT stored using break tags. For some reason (ask the InfoPath development team?) line breaks inside text boxes are stored using a Unicode line-break character!
    • Images are also embedded as data into the text. It took some rummaging to eventually find (on the InfoPath Team Blog) that they are encoded as Base64 binary arrays!

    So .. on to the code!

    I will assume that the InfoPath data is being loaded from an SPFile object in SharePoint (i.e. from a Form Library). You can however load it from whatever you like. If you want to get the data from a file then you can use standard .Net practices to load a file from the filing system into a MemoryStream.

    The following Code Snippet will allow you to read out the string values stored in Text Boxes from the XML structure of the XML File itself.

    Code Snippet

    public void LoadSPFile(Microsoft.SharePoint.SPFile xmlFile)

    {

    try

    {

    if (xmlFile != null)

    {

    // grab the binary file data into a byte array

    byte[] fileBytes = xmlFile.OpenBinary();

     

    // create an XML Document to grab the required information

    XmlDocument xmlDoc = new XmlDocument();

     

    // parse those bytes into a memory stream

    using (MemoryStream fileStream = new MemoryStream(fileBytes))

    {

    // load in the XML form data

    xmlDoc.Load(fileStream);

    }

     

    #region InfoPath – Create NameSpaceManager

    // the "my" prefix is defined in the Root Element,

    // so we can retrieve it from there

    XmlNode root = xmlDoc.DocumentElement;

    string infoPathNsUri = root.GetNamespaceOfPrefix("my");

    string infoPathNsPrefix = "my";

    // to use the namespace in the XPath queries we must first

    // load the InfoPath URI into a Name Space Manager.

    XmlNamespaceManager nsMgr = new XmlNamespaceManager(xmlDoc.NameTable);

    nsMgr.AddNamespace(infoPathNsPrefix, infoPathNsUri);

    #endregion

     

    // we can now do "NameSpace Aware" XPath Queries

    #region Load the properties for the object using XPath queries

     

    // get document text from a TextBox called txtTextBox

    string strTextBoxValue = xmlDoc.SelectSingleNode("//my:txtTextBox", nsMgr).InnerXml;

     

    // get all TextBox nodes called "txtRepeatText"

    // from a Repeating Section called "SectionsRepeater"

    XmlNodeList sectionNodes = xmlDoc.SelectNodes("//my:SectionsRepeater/my:txtRepeatText", nsMgr);

     

     

     

    }

    catch (Exception ex)

    {

    throw new Exception("Error in LoadSPFile() – " + ex.Message, ex);

    }

    }

     

    Now note that InfoPath forms use XML NameSpaces. That means your XPath queries won’t work unless you load the namespace into a NameSpaceManager and use that to select the nodes

 

Having done that, we should now have one (or more) string values which contain the Text Box "data".

As previously mentioned .. this "data" will be XHTML formatted. But Line Breaks are NOT.

 

So .. to remove Line Breaks you have to replace those Unicode Characters with a < br/ > tags.

 

Code Snippet

// remove this Unicode character that InfoPath uses for line breaks

// (yes .. I know it’s annoying!)

strHtml = strHtml.Replace("�", "");

 

 

Now .. to get the correct character please copy and paste into your code. Otherwise you can use the XML Serialised view when stepping through your code, and you can copy and paste directly from there while developing.

NOTE – adding this will force you to save your Source Code files in unicode format. Visual Studio should prompt you to do that after adding this character.

Now, the code snippet we have so far is NOT fully XHTML formatted. we need to add the Namespace headers into HTML < head > tags and format the rest of the file. So we need to add the following to our string:

< html xmlns:xsf2="https://schemas.microsoft.com/office/infopath/2006/solutionDefinition/extensions" xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance" xmlns:xdEnvironment="https://schemas.microsoft.com/office/infopath/2006/xslt/environment" xmlns:xdUser="https://schemas.microsoft.com/office/infopath/2006/xslt/User" xmlns:xhtml="https://www.w3.org/1999/xhtml" xmlns:my="https://schemas.microsoft.com/office/infopath/2003/myXSD/2007-03-28T10:00:14" xmlns:xd="https://schemas.microsoft.com/office/infopath/2003" xmlns:msxsl="urn:schemas-microsoft-com:xslt" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns:xdExtension="https://schemas.microsoft.com/office/infopath/2003/xslt/extension" xmlns:xdXDocument="https://schemas.microsoft.com/office/infopath/2003/xslt/xDocument" xmlns:xdSolution="https://schemas.microsoft.com/office/infopath/2003/xslt/solution" xmlns:xdFormatting="https://schemas.microsoft.com/office/infopath/2003/xslt/formatting" xmlns:xdImage="https://schemas.microsoft.com/office/infopath/2003/xslt/xImage" xmlns:xdUtil="https://schemas.microsoft.com/office/infopath/2003/xslt/Util" xmlns:xdMath="https://schemas.microsoft.com/office/infopath/2003/xslt/Math" xmlns:xdDate="https://schemas.microsoft.com/office/infopath/2003/xslt/Date" xmlns:sig="https://www.w3.org/2000/09/xmldsig#" xmlns:xdSignatureProperties="https://schemas.microsoft.com/office/infopath/2003/SignatureProperties" xmlns:ipApp="https://schemas.microsoft.com/office/infopath/2006/XPathExtension/ipApp">
  < head >
    < meta http-equiv="Content-Type" content="text/html" >
    < /meta >
  < /head >
< body >

<– STRING DATA GOES HERE

< /body >

< /html >

 
Once we’ve done that, we almost have a fully valid XHTML structured file.

Now, we have to sort out the images.

In order to find all of the IMG tags we need to do an XPath query on the XHTML. However, this again means we need to declare an XHTML namespace for the XPath queries.

  Code Snippet

XmlDocument xmlDoc = new XmlDocument();

xmlDoc.LoadXml(strHtml);

// Create NameSpaceManager

XmlNode root = xmlDoc.DocumentElement;

string infoPathNsUri = "https://www.w3.org/1999/xhtml";

string infoPathNsPrefix = "XHTML";

// to use the namespace in the XPath we must first

// load the InfoPath URI into a Name Space Manager.

XmlNamespaceManager nsMgr = new XmlNamespaceManager(xmlDoc.NameTable);

nsMgr.AddNamespace(infoPathNsPrefix, infoPathNsUri);

// we can now do "NameSpace Aware" XPath Queries

 
Once we have our namespace manager we can find all of the "IMG" tags in the Body section.

 

Code Snippet

XmlNode bodyNode = xmlDoc.SelectSingleNode("html/body");

XmlNodeList imgNodes = bodyNode.SelectNodes(@"//XHTML:img", nsMgr);

 

// store filenames in here to be removed later

ArrayList arrFilesList = new ArrayList();

 

foreach (XmlNode imgNode in imgNodes)

{

string strImgData = imgNode.Attributes["xd:inline"].InnerText;

byte[] image = Convert.FromBase64String(strImgData);

MemoryStream memStr = new MemoryStream();

memStr.Write(image, 0, image.Length);

System.Drawing.Bitmap img = (System.Drawing.Bitmap)System.Drawing.Bitmap.FromStream(memStr);

 

// Create GUID filename for image

Guid imgGuid = Guid.NewGuid();

string imgFileName = @"C:\Temp\" + imgGuid.ToString() + ".jpg";

 

if (Directory.Exists(@"C:\Temp") == false)

{

Directory.CreateDirectory(@"C:\Temp");

}

 

// store the filename

arrFilesList.Add(imgFileName);

 

// save the image, and reset the HTML source

img.Save(imgFileName);

imgNode.Attributes["src"].Value = imgFileName;

 

#region Remove the "xd:inline" attribute

XmlAttribute imgInlineAttribute = imgNode.Attributes["xd:inline"];

if (imgInlineAttribute != null)

{

imgNode.Attributes.Remove(imgInlineAttribute);

}

#endregion

 

}

 

// get the newly formatted HTML and put it back in our string

strHtml = xmlDoc.InnerXml;

 

 

Now note that for tracking purposes we have created each "image" as a jpg file, with a GUID as the filename. We’ve used a GUID to make sure that they don’t have any file conflicts.

 

Once you’ve finished with your HTML you can clear up those files by calling:

 

Code Snippet

foreach(string strFileName in arrFilesList)

{

System.IO.File.Delete(strFileName);

}

 

So .. now you should have formatted HTML. You can post this to the screen, or you can save it to a file (note the image file references would need to remain for the file to work though … you could save the images and re-point them quite easily though.