Clean Invalid XML Characters In C#
Here's a cool way to clean Large XML files with invalid xml characters.
Note: Stream from is the original xml file, while Stream to is the new xml file with invalid characters removed.
Source: http://social.msdn.microsoft.com/Forums/
Post: Invalid character returned from webservice
Note: Stream from is the original xml file, while Stream to is the new xml file with invalid characters removed.
private void Copy(Stream from, Stream to) { TextReader reader = new StreamReader(from); TextWriter writer = new StreamWriter(to); writer.WriteLine(CleanInvalidXmlChars(reader.ReadToEnd())); writer.Flush(); } public static string CleanInvalidXmlChars(string text) { string re = @"[^\x09\x0A\x0D\x20-\xD7FF\xE000-\xFFFD\x10000-x10FFFF]"; return Regex.Replace(text, re, ""); }
Post: Invalid character returned from webservice
There is a mistake in the expression:
ReplyDeleteis: \x10000-x10FFFF
should be: \x10000-\x10FFFF
Sorry, actually it should be:
ReplyDeleteconst string expression = @"[^\x09\x0A\x0D\u0020-\uD7FF\uE000-\uFFFD]";
\x10000-\0x7FFF is beyond UTF16 and can't be handled by .NET Regex.
However it should be OK for most applications to strip these.
Furthermore the four digits hex must be prefixed by \u, not \x
This is the unit test:
[TestMethod]
public void CleanInvalidCharacters()
{
var input = "";
// illegal range 1 (except nl, cr, tab)
for (int i = 0; i < 0x20; i++)
{
input += ((char)i);
}
const string legalCharactersBelowX20 = "\t\n\r";
// illegal range 2
for (int i = 0xD800; i < 0xE000; i++)
{
input += ((char)i);
}
// illegal range 3
for (int i = 0xFFFE; i < 0x10000; i++)
{
input += ((char)i);
}
// some legal characters
var someLegalSampleCharacters = "";
someLegalSampleCharacters += " abcdefghijklmnopqrstuvwxyzäöüABCDEFGHIJKLMNOPQRSTUVWXZYZÄÖÜ0123456789%&.-_";
someLegalSampleCharacters += "\uFFFD"; // xFFFD as an example for a high range legal character
input += someLegalSampleCharacters;
var output = input.CleanXml10InvalidCharacters();
Assert.AreEqual(_stringToHex(legalCharactersBelowX20 + someLegalSampleCharacters), _stringToHex(output));
}
// format as hex for easier debugging when the test fails
private string _stringToHex(string s)
{
var sb = new StringBuilder();
foreach (var t in s)
{
sb.Append(Convert.ToInt32(t).ToString("x") + " ");
}
return sb.ToString();
}
Cool!
ReplyDeleteThank you for pointing out the corrections. When I tested it for large/bulky xml files, they work just fine.
Your modified REGEX pattern declared as constant is a scope wise version for UTF-16 which I haven't thought of.
Thanks.. :)
Greg
Can yo uprovide some usage? I have a large xml file I want to load in, but not sure how to call copy with the xml file as a stream. Also I've converting to VB.NET
ReplyDeleteHi,
DeleteHere's a detailed example from MSDN where I derived the contents of this post. It has a C# and VB.NET sample.
http://msdn.microsoft.com/en-us/library/system.web.services.protocols.soapextension(v=vs.100).aspx
Psycho Genes