Paul Selles

Computers and cats

Monthly Archives: July 2013

Powershell Tip #1: Strict Mode XML Parsing Gotcha

I love how easy it is to parse XML with Powershell, but then I started scripting in Strict Mode and got hung-up on a little problem dealing with attributes.

I will revisit a revised cats.xml to use as an example xml file:

<?xml version="1.0" encoding="utf-8"?>
<Cats>
	<Cat Name="Wilson" Type="Tabby">
		<Property Name="Fur" Value="Coarse"/>
		<Property Name="Color" Value="Orange" />
		<Part Name="Paws">
			<Property Name="Claws" Value="Very sharp" />
		</Part>
		<Part Name="Nose">
			<Property Name="Cute" Value="true" />
		</Part>
	</Cat>
	<Cat Name="Winnie" Type="Short hair">
		<Property Name="Fur" Value="Soft"/>
		<Property Name="Color" Value="Black" />
		<Part Name="Paws">
			<Property Name="Polydactyl" Value="true" />
			<Property Name="Claws" Value="Sharp" />
		</Part>
		<Part Name="Nose">
			<Property Name="Cute" Value="true" />
		</Part>
	</Cat>
	<Cat Name="Luna">
		<Property Name="Fur" Value="Soft"/>
		<Property Name="Color" Value="Black" />
		<Part Name="Paws">
			<Property Name="Claws" Value="Trimmed" />
		</Part>
		<Part Name="Nose">
			<Property Name="Cute" Value="true" />
		</Part>
	</Cat>
</Cats>

Lets make a script that will retrieve the Name of a cat by their type:

# Get cat name by type
Param (
	[ValidateNotNullOrEmpty()]
	[String]$Type
)
[Xml]$Cats = (Get-Content -Path C:\temp\cats.xml)
($Cats.Cats.Cat | Where {$_.Type -match $Type}).Name

The script is nice and short and does what we expect. If the Type matches, it returns the Name of the cat (otherwise, we get nothing):

CatTypeResults

Now let’s try the script in Strict Mode:

# Get cat name by type
Param (
	[ValidateNotNullOrEmpty()]
	[String]$Type
)
Set-StrictMode -Version Latest
[Xml]$Cats = (Get-Content -Path C:\temp\cats.xml)
($Cats.Cats.Cat | Where {$_.Type -match $Type}).Name

If we try to run the script again we run into problems:

CatTypeResultsStrictMode

What did we do wrong? As it turns out there are two problems. The first is that the entry for Cat Luna does not have a Type attribute (if we re-ran the tests removing Luna then the script will pass). Secondly, we are expecting Powershell to interpret what we mean by the property Type. We want Type as an attribute name, but how does Powershell know that? This is sloppy scripting, but it works until we switch into Strict Mode.

In order to move forward, we need a better idea of the objects that we are playing with. So let’s get the members of $Cats.Cats.

SystemXmlXmlDocumentGetMember

We can see that this is an System.Xml.XmlElement[1]. We could look the class up or we can intuitively see what method on from the list above will help us clean up our code and test for the attribute Type:

# Get cat name by type
Param (
	[ValidateNotNullOrEmpty()]
	[String]$Type
)
[Xml]$Cats = (Get-Content -Path C:\temp\cats.xml)
$Cat = $Cats.Cats.Cat | Where {$_.GetAttribute('Type') -match $Type}
if ($Cat) { $Cat.GetAttribute('Name') }

Using GetAttribute on attribute Type will solve the first error. We solve our second error (when a not matching type is entered) by making sure that the object is not null before reading the attribute Name.

Paul

References

[1] XmlXElement Class. MSDN Library

Advertisement

Parsing Xml with Invalid Characters in C#

The Problem

I’ve stumbled upon an interesting predicament. I need to parse some SQL relationships from an automatically generated XML file that contains invalid characters. Here is an example XML file that I will use to highlight the problem that I saw:

<?xml version="1.0" encoding="utf-8"?>
<Cats>
	<Cat Id="1" Type="Tabby">
		<Property Name="Fur" Value="Coarse"/>
		<Property Name="Color" Value="Orange" />
		<Part Name="Paws">
			<Property Name="Claws" Value="Very sharp" />
		</Part>
		<Part Name="Nose">
			<Property Name="Cute" Value="true" />
		</Part>
		<Info>
			I have an invalid character.&#x13;
		</Info>
	</Cat>
	<Cat Id="2" Type="Short hair">
		<Property Name="Fur" Value="Soft"/>
		<Property Name="Color" Value="Black" />
		<Part Name="Paws">
			<Property Name="Polydactyl" Value="true" />
			<Property Name="Claws" Value="Sharp" />
		</Part>
		<Part Name="Nose">
			<Property Name="Cute" Value="true" />
		</Part>
		<Info>
			I don't have an invalid character.
		</Info>
	</Cat>
</Cats>

So above we have a small XML file cataloging my two cats. Within the Info tags you may notice that the first Cat entry has a superfluous character, 0x13; this falls outside of the valid XML character set [1]. The W3C recommendation, however, is no guarantee that every XML file that you encounter will follow the recommendations to a tee.

In C# we can try using the two most common XML parsing libraries System.Xml and System.Xml.Linq to import the XML file to the XmlDocument and XDocument objects using their respective Load functions [2][3]. If we try to do this we can expect to see the following exception:

‘ ‘, hexadecimal value 0x13, is an invalid character. Line 13, position 35.

The Solution

There is a workaround that is made possible with the lightweight disposable XmlReader class and the XmlReaderSettings support class that allows us to customize the behavior of XmlReader [4][5]. The XmlReaderSettings property that interests us the most is the Boolean CheckCharacters. Setting CheckCharacters property to false will let us read the XML document without verifying if the processed text data is within the valid XML character set [6]. The XmlDocument and XDocument objects can now be loaded from the XmlReader incident free:

static XmlDocument ReadXmlDocumentWithInvalidCharacters(string filename)
{
    XmlDocument xmlDocument = new XmlDocument();

    XmlReaderSettings xmlReaderSettings = new XmlReaderSettings { CheckCharacters = false };

    using (XmlReader xmlReader = XmlReader.Create(filename, xmlReaderSettings))
    {
        // Load our XmlDocument
        xmlReader.MoveToContent();
        xmlDocument.Load(xmlReader);
    }

    return xmlDocument;
}
static XDocument ReadXDocumentWithInvalidCharacters(string filename)
{
    XDocument xDocument = null;

    XmlReaderSettings xmlReaderSettings = new XmlReaderSettings { CheckCharacters = false };

    using (XmlReader xmlReader = XmlReader.Create(filename, xmlReaderSettings))
    {
        // Load our XDocument
        xmlReader.MoveToContent();
        xDocument = XDocument.Load(xmlReader);
    }

    return xDocument;
}

Once we load our XML code then we are free to parse it, and since I prefer working with the System.Xml.Linq library, that’s all I will do:

static void PrintXDocument(XDocument xDocument)
{
    foreach (XElement xElement in xDocument.Elements(xDocument.Root.Name).DescendantsAndSelf())
    {
           Console.Write(("".PadRight(xElement.Ancestors().Count() * 4) +
            (xElement.HasElements == true || string.IsNullOrEmpty(xElement.Value) ?
                xElement.Name.LocalName :
                (xElement.Name.LocalName + " \"" + xElement.Value.Trim() + "\""))));

        foreach (XAttribute xAttribute in xElement.Attributes())
            Console.Write(" " + xAttribute.Name.LocalName + "=\"" + xAttribute.Value + "\"");

        Console.WriteLine();
    }

    Console.ReadLine();
}

And the results:

Cats
Cat Id=”1″ Type=”Tabby”
Property Name=”Fur” Value=”Coarse”
Property Name=”Color” Value=”Orange”
Part Name=”Paws”
Property Name=”Claws” Value=”Very sharp”
Part Name=”Nose”
Property Name=”Cute” Value=”true”
Info “I have an invalid character.‼”
Cat Id=”2″ Type=”Short hair”
Property Name=”Fur” Value=”Soft”
Property Name=”Color” Value=”Black”
Part Name=”Paws”
Property Name=”Polydactyl” Value=”true”
Property Name=”Claws” Value=”Sharp”
Part Name=”Nose”
Property Name=”Cute” Value=”true”
Info “I don’t have an invalid character.”

We are not out of the woods yet

We are dealing with damaged goods here: that invalid character is still present, so we have to be careful. Notice the in the output above, that is 0x13.

An example of what can go wrong is evident if we try to print out the contents of our XDocument object:

Console.WriteLine(XDocument.Load(filename).ToString());

Normally we will get a printout of the containing XML. In this case we will see the exception we saw above.

Paul

 

References

[1] Extensible Markup Language (XML) 1.0 (Fifth Edition). 26 Nov 2008. W3C Recommendation

[2] XmlDocument Class. MSDN Library

[3] XDocument Class. MSDN Library

[4] XmlReader Class. MSDN Library

[5] XmlReaderSettings Class. MSDN Library

[6] XmlReaderSettings.CheckCharacters Property. MSDN Library

%d bloggers like this: