Donate

Clean Invalid XML Characters In C#

Here's a cool way to clean Large XML files with invalid xml characters.
Note: Stream from is the original xml file, while Stream to is the new xml file with invalid characters removed.
private void Copy(Stream from, Stream to)   
{   
       TextReader reader = new StreamReader(from);   
       TextWriter writer = new StreamWriter(to);   
       writer.WriteLine(CleanInvalidXmlChars(reader.ReadToEnd()));   
       writer.Flush();   
}   
     
public static string CleanInvalidXmlChars(string text)   
{   
       string re = @"[^\x09\x0A\x0D\x20-\xD7FF\xE000-\xFFFD\x10000-x10FFFF]";   
       return Regex.Replace(text, re, "");   
}  
Source: http://social.msdn.microsoft.com/Forums/
Post: Invalid character returned from webservice

Comments

  1. There is a mistake in the expression:
    is: \x10000-x10FFFF
    should be: \x10000-\x10FFFF

    ReplyDelete
  2. Sorry, actually it should be:

    const string expression = @"[^\x09\x0A\x0D\u0020-\uD7FF\uE000-\uFFFD]";

    \x10000-\0x7FFF is beyond UTF16 and can't be handled by .NET Regex.
    However it should be OK for most applications to strip these.

    Furthermore the four digits hex must be prefixed by \u, not \x

    This is the unit test:

    [TestMethod]
    public void CleanInvalidCharacters()
    {
    var input = "";

    // illegal range 1 (except nl, cr, tab)
    for (int i = 0; i < 0x20; i++)
    {
    input += ((char)i);
    }
    const string legalCharactersBelowX20 = "\t\n\r";

    // illegal range 2
    for (int i = 0xD800; i < 0xE000; i++)
    {
    input += ((char)i);
    }

    // illegal range 3
    for (int i = 0xFFFE; i < 0x10000; i++)
    {
    input += ((char)i);
    }

    // some legal characters
    var someLegalSampleCharacters = "";
    someLegalSampleCharacters += " abcdefghijklmnopqrstuvwxyzäöüABCDEFGHIJKLMNOPQRSTUVWXZYZÄÖÜ0123456789%&.-_";
    someLegalSampleCharacters += "\uFFFD"; // xFFFD as an example for a high range legal character
    input += someLegalSampleCharacters;

    var output = input.CleanXml10InvalidCharacters();

    Assert.AreEqual(_stringToHex(legalCharactersBelowX20 + someLegalSampleCharacters), _stringToHex(output));
    }

    // format as hex for easier debugging when the test fails
    private string _stringToHex(string s)
    {
    var sb = new StringBuilder();
    foreach (var t in s)
    {
    sb.Append(Convert.ToInt32(t).ToString("x") + " ");
    }
    return sb.ToString();
    }

    ReplyDelete
  3. Cool!
    Thank you for pointing out the corrections. When I tested it for large/bulky xml files, they work just fine.

    Your modified REGEX pattern declared as constant is a scope wise version for UTF-16 which I haven't thought of.

    Thanks.. :)

    Greg

    ReplyDelete
  4. Can yo uprovide some usage? I have a large xml file I want to load in, but not sure how to call copy with the xml file as a stream. Also I've converting to VB.NET

    ReplyDelete
    Replies
    1. Hi,

      Here's a detailed example from MSDN where I derived the contents of this post. It has a C# and VB.NET sample.
      http://msdn.microsoft.com/en-us/library/system.web.services.protocols.soapextension(v=vs.100).aspx


      Psycho Genes

      Delete

Post a Comment

Donate

Popular Posts From This Blog

WPF CRUD Application Using DataGrid, MVVM Pattern, Entity Framework, And C#.NET

TypeScript Error Or Bug: The term 'tsc' is not recognized as the name of a cmdlet, function, script file, or operable program.

Invalid nested tag div found, expected closing tag input