tag:blogger.com,1999:blog-3535912101507598176.post4211124316103250177..comments2024-03-26T08:41:26.502-07:00Comments on .NET GENES: Clean Invalid XML Characters In C#.NET GENEShttp://www.blogger.com/profile/14871142096265359644noreply@blogger.comBlogger5125tag:blogger.com,1999:blog-3535912101507598176.post-50038583794418126592014-03-18T01:57:55.814-07:002014-03-18T01:57:55.814-07:00Hi,
Here's a detailed example from MSDN where...Hi,<br /><br />Here's a detailed example from MSDN where I derived the contents of this post. It has a C# and VB.NET sample. <br />http://msdn.microsoft.com/en-us/library/system.web.services.protocols.soapextension(v=vs.100).aspx<br /><br /><br />Psycho Genes.NET GENEShttps://www.blogger.com/profile/14871142096265359644noreply@blogger.comtag:blogger.com,1999:blog-3535912101507598176.post-35436474998232873442014-03-17T18:58:57.805-07:002014-03-17T18:58:57.805-07:00Can yo uprovide some usage? I have a large xml fi...Can yo uprovide some usage? I have a large xml file I want to load in, but not sure how to call copy with the xml file as a stream. Also I've converting to VB.NETAnonymousnoreply@blogger.comtag:blogger.com,1999:blog-3535912101507598176.post-6193614363117604242013-10-17T02:21:02.223-07:002013-10-17T02:21:02.223-07:00Cool!
Thank you for pointing out the corrections. ...Cool!<br />Thank you for pointing out the corrections. When I tested it for large/bulky xml files, they work just fine. <br /> <br />Your modified REGEX pattern declared as constant is a scope wise version for UTF-16 which I haven't thought of. <br /><br />Thanks.. :)<br /><br />Greg<br />.NET GENEShttps://www.blogger.com/profile/14871142096265359644noreply@blogger.comtag:blogger.com,1999:blog-3535912101507598176.post-31166155584722484462013-10-16T03:06:40.877-07:002013-10-16T03:06:40.877-07:00Sorry, actually it should be:
const string expres...Sorry, actually it should be:<br /><br />const string expression = @"[^\x09\x0A\x0D\u0020-\uD7FF\uE000-\uFFFD]";<br /><br />\x10000-\0x7FFF is beyond UTF16 and can't be handled by .NET Regex.<br />However it should be OK for most applications to strip these.<br /><br />Furthermore the four digits hex must be prefixed by \u, not \x<br /><br />This is the unit test:<br /><br /> [TestMethod]<br /> public void CleanInvalidCharacters()<br /> {<br /> var input = "";<br /><br /> // illegal range 1 (except nl, cr, tab)<br /> for (int i = 0; i < 0x20; i++)<br /> {<br /> input += ((char)i);<br /> }<br /> const string legalCharactersBelowX20 = "\t\n\r";<br /><br /> // illegal range 2<br /> for (int i = 0xD800; i < 0xE000; i++)<br /> {<br /> input += ((char)i);<br /> }<br /><br /> // illegal range 3<br /> for (int i = 0xFFFE; i < 0x10000; i++)<br /> {<br /> input += ((char)i);<br /> }<br /><br /> // some legal characters<br /> var someLegalSampleCharacters = "";<br /> someLegalSampleCharacters += " abcdefghijklmnopqrstuvwxyzäöüABCDEFGHIJKLMNOPQRSTUVWXZYZÄÖÜ0123456789%&.-_";<br /> someLegalSampleCharacters += "\uFFFD"; // xFFFD as an example for a high range legal character<br /> input += someLegalSampleCharacters;<br /><br /> var output = input.CleanXml10InvalidCharacters();<br /><br /> Assert.AreEqual(_stringToHex(legalCharactersBelowX20 + someLegalSampleCharacters), _stringToHex(output));<br /> }<br /><br /> // format as hex for easier debugging when the test fails<br /> private string _stringToHex(string s)<br /> {<br /> var sb = new StringBuilder();<br /> foreach (var t in s)<br /> {<br /> sb.Append(Convert.ToInt32(t).ToString("x") + " ");<br /> }<br /> return sb.ToString();<br /> } <br />Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-3535912101507598176.post-65322814804542259742013-10-16T02:15:00.760-07:002013-10-16T02:15:00.760-07:00There is a mistake in the expression:
is: \x10000...There is a mistake in the expression: <br />is: \x10000-x10FFFF <br />should be: \x10000-\x10FFFFAnonymousnoreply@blogger.com