Register | Login 
View Article  

Current Articles | Categories | Search | Syndication

The XML Datatype

By Scott Klein on Saturday, December 09, 2006 :: 3000 Views :: 2 Comments :: :: SQL Server

One of the things I have been working with lately is the XML data type in SQL Server 2005.  I am a fan of XML, and I love working with SQL Server.  When the two “officially” came together in SQL Server 2005 I could not have been happier.

 

The project I am working on didn’t put the xml data type to use immediately.  I am a firm believer in the concept of not using something just because it is cool (although I do love to play with all the new cool technologies coming out).  However, it recently became readily apparent that the mechanism we had in place would not fill the requirements of what we were trying to accomplish, and the xml data type seemed to be the answer.

 

The problem we were running into is the size of a string parameter we were passing to a number of stored procedures, which in turn are the data source for some reports.  This string parameter contained a comma separated list of employees that were selected from a list via the front end application.  If no employees were selected then we just didn’t include that in the query filter and we were good to go.  But what if they did select some employees to filter on?  How big do you make that varchar parameter?  Meaning, the parameter in the stored procedure looked like this:

 

CREATE PROCEDURE tempproc

(

  @startdate DateTime,

  @enddate DateTime,

  @emplist Varchar(1000)

)

As

 

And this even brought into question the issue of filtering on this list.  It is no secret that the following is not possible in T-SQL:

 

WHERE Employee.EmployeeID IN (@emplist)

 

The solution to this problem was to create a user defined function which took that parameter of comma separated strings and returned a table of ID’s.  This function looks like the following:

 

CREATE Function udf_SplitIDString

(

  @IDString varchar(1000)

)

Returns @IDs table

(ID int)

AS

BEGIN

  Declare @Position int

  While len(@IDString) > 0

    BEGIN

      SET @Position = CHARINDEX(‘, ‘ @IDString)

      If @Position > 0

        BEGIN

          INSERT @IDs

          SELECT CONVERT(int, LEFT(@IDString, @Position – 1))

          SET @IDString = RIGHT(IDString, len(IDString – Position)

        END

      ELSE

        BEGIN

          INSERT @IDs

          SELECT CONVERT(int, @IDString)

        END

      END

    END

RETURN

END

 

By using this method, it meant that the following join would need to take place in each stored procedures that needed this functionality.

 

INNER JOIN    dbo.udf_SplitIDString(@emplist) dbo.employee.employeeid = udf_SplitIDString.ID

 

This solution worked, and worked well.  We did see a very slight decrement in performance but not anything to raise an alarm.  The decrease in performance was in the parsing of the id string in the function.

 

However, we need to go back to the original question.  Is a Varchar(1000) big enough?  A size of 1000 would only get you just over a couple hundred employees (assuming you started at employeeid 1).  If you incremented that number to 2000, you would only get just over 450 employees (again, assuming you started at employeeid 1).  If you started with an employeeid any higher, the number of returned employees would decrement significantly.

 

Many of our customers are quite large and could easily fill up this list when selecting a range of employees (for example, selecting a range by department or location, etc).

 

In comes the xml data type.  If you haven’t read anything about the xml data type in SQL Server 2005, you need to stop everything you are doing and spend a good hour or so reading up on it.  For a “first iteration”, Microsoft did an extremely fine job.

 

SQL Server 2005 comes with an xml data type.  This data type can be used in creating a column, as shown below in figure 1, or as variables.

 

scott3.gif

Figure 1

 

Since we needed to pass xml, we chose the latter, meaning, to use it as a parameter variable in our stored procedures.  Using the xml data type as a variable would remove the constraint of having to limit the size of the string being passed.

 

Using the xml data type as a variable would also provide one other benefit.  It meant that the join to the function did not have to be made and the parsing of the comma separated string did not have to been done.

 

Now, our parameter list looked like the following:

 

CREATE PROCEDURE tempproc

(

  @startdate DateTime,

  @enddate DateTime,

  @emplist xml

)

As

 

An example of the xml that would be passed to the stored procedure looks like the following:

 

<ids>

   <v id="2780"/>

   <v id="1678"/>

   <v id="2831"/>

   <v id="3977"/>

   <v id="729"/>

   <v id="1142"/>

</ids>

 

The join then became the following:

 

INNER JOIN  @emplist.nodes('ids/v') as EmpJoinValues(EmpJoin) on EmpJoin.value('@id','int') = hr180.HR180ID

 

This method worked wonderful.  It was fast and efficient.  It was time to take it to the next level.

 

Many of the reports we run filter on a number of criteria, such as location, department, category, and others.  Some of our larger reports filter on up to 6 or 7 criteria.  For example, our parameter list could look like this:

 

(

  @startdate DateTime,

  @enddate DateTime,

  @emplist Varchar(1000),

  @loclist Varchar(1000),

  @deptlist Varchar(1000),

  @catlist Varchar(1000),

  @classlist Varchar(1000),

  @paylist Varchar(1000),

)

As

 

Just like the first example, depending on the number of items they pick from each list, we could easily reach the 1000 character length mark.  But the more important issue is now we are doing a 6 or more joins using the SplitIDString function, and now this function has to parse a whole lot of strings.  Not too efficient.

 

So our thought was that if we could do xml with one of those parameters, why couldn’t we do it with the rest?  So, we converted the other parameters to xml data type.  Our parameter list now looked like the following:

 

(

  @startdate DateTime,

  @enddate DateTime,

  @emplist xml,

  @loclist xml,

  @deptlist xml,

  @catlist xml,

  @classlist xml,

  @paylist xml,

)

As

 

We also adjusted the WHERE clause appropriately to join on the xml just as we did in our first test.  For example:

 

INNER JOIN  @emplist.nodes('ids/v') as EmpJoinValues(EmpJoin) on EmpJoin.value('@id','int') = hrtable.employeeid

INNER JOIN  @loclist.nodes('ids/v') as EmpJoinValues(EmpJoin) on EmpJoin.value('@id','int') = location.locationid

 

And so on.

 

When we ran the report, we got an error stating that the query was too complex.  Huh?  We started trouble shooting and decided to add the parameters back as varchar(1000) and add the xml data type one at a time for each parameter. 

 

The report ran extremely well until our xml parameter list hit 5 or 6 and more, at which point the following error was returned:

The query processor ran out of internal resources and could not produce a query plan.

This is a rare event and only expected for extremely complex queries or queries that reference a very large number of tables or partitions.

Please simplify the query. If you believe you have received this message in error, contact Customer Support Services for more information.

In working with Microsoft on this, it turns out that this is a bug in the query compilation.  Their interim/workaround for now is to change the WHERE clause to the following:

 

AND hrtable.employeeid IN (SELECT [joinvalues].[ref].value(‘.’, ‘int’) FROM @emplist.nodes(‘/ids/values/@id’) joinvalues([ref])),

 

And so on for each xml join.

 

We are currently running tests to determine performance and thresholds.  Stay tuned for the results.

 

Summary

As I stated earlier, I love XML and I love SQL Server, and even though we have had to provide some workarounds for our solutions, it still in many ways makes me take a step back and realize how cool this technology lives.  This article is not intended to deter you from using the xml data type.  Quite the opposite.  In fact, I’d recommend that you look to see where your database and application might benefit from such as great technology.


Previous Page | Next Page

COMMENTS

I'd give 2 for trying to represent the case.
It would be nice to see the clean code in a published article. Because somebody may try to reproduce this.
Pretty much the code listed in the article does not compile.
CREATE Function udf_SplitIDString
to begin with. Either the text was formatted by the web site either..., but look, this just can not compile in sql
SET @IDString = RIGHT(IDString, len(IDString - Position)

It is easy to write your own one that does the same.
So I did.
it is easy to reproduce your xml even from t-sql. I can blow it on 5 joins, stable on 6. I put 622 int id's in each 'coma separated list' variable

But when you give an example from your joins it would be nice if they would compile.
with respect, this can not compile, it is incorrect sql
INNER JOIN @emplist.nodes('ids/v') as EmpJoinValues(EmpJoin) on EmpJoin.value('@id','int') = hrtable.employeeid
INNER JOIN @loclist.nodes('ids/v') as EmpJoinValues(EmpJoin) on EmpJoin.value('@id','int') = location.locationid

That does compile
select fname+' '+lname as name, address
from Employee e
inner join EmpWload ew on e.EmployeeID=ew.EmployeeID
inner join Empcat ec on e.EmployeeID=ec.EmployeeID
inner join Empclass es on e.EmployeeID=es.EmployeeID
inner join Empdept ed on e.EmployeeID=ed.EmployeeID
inner join Emplock el on e.EmployeeID=el.EmployeeID
inner join Emppay ep on e.EmployeeID=ep.EmployeeID
INNER JOIN @emplist.nodes('ids/v') as EmpJoinValues(EmpJoin) on EmpJoin.value('@id','int') = ew.EmployeeID
INNER JOIN @catlist.nodes('ids/v') as EmpJoinValuesc(EmpJoinc) on EmpJoinc.value('@id','int') = ec.EmployeeID
INNER JOIN @classlist.nodes('ids/v') as EmpJoinValuess(EmpJoins) on EmpJoins.value('@id','int') = es.EmployeeID
INNER JOIN @deptlist.nodes('ids/v') as EmpJoinValuesd(EmpJoind) on EmpJoind.value('@id','int') = ed.EmployeeID
INNER JOIN @locklist.nodes('ids/v') as EmpJoinValuesl(EmpJoinl) on EmpJoinl.value('@id','int') = el.EmployeeID
INNER JOIN @paylist.nodes('ids/v') as EmpJoinValuesp(EmpJoinp) on EmpJoinp.value('@id','int') = ep.EmployeeID

your last code sample I guess does not use the format that your prev sample is using. If you had to change this, would be nice to mention.
I mean this one

AND hrtable.employeeid IN (SELECT [joinvalues].[ref].value('.', 'int') FROM @emplist.nodes('/ids/values/@id') joinvalues([ref]))

As about xml method working good, I would respectfully argue this one based on the execution plan.
From a DBA point of view it is a cool technique, but not more.
Put 2000 resord or so into your 7 tables and take a look at your execution plan and see where sql is actually spending the time.
If you'd like to have this working fast, put your data into a table, not a function. UDF looks cool, but sql server cost based optimized does not think so.
Relational databases were and are good at one thing only - work with the existing data.
It is one of the cases when a DBA and a developer have different ... outlook maybe.

posted @ Thursday, March 06, 2008 2:27 PM by lilya


from performance point of view you may want to run 5-6 of these (your last version that does not blow the resource on sql server) and watch it's cpu for example. 5-6 will peak your cpu 100%.
A production DBA is not going to say 'wow, cool sql technique, do it again' :).

posted @ Thursday, March 06, 2008 2:33 PM by liliya


Click here to post a comment

Copyright (c) 2010 GSP Developers
Walling Info Systems | Terms Of Use | Privacy Statement