Returning vertical data in one horizontal column

Having data in a normalized form is great. It helps eliminate redundant data, and therefore helps save space in the database. In this blog post, I want to discuss how you can query a normalized database to return denormalized data – that is, return data from many rows in a single column. I’ve used this technique a number of times, and below a couple of the ways I’ve put it to use.

  • At my current company we have a custom service which runs on a nightly basis and sends emails for various reasons. Some emails are sent to the user in question, but also copies in their manager(s). If the user has multiple managers, we need to put both of their email addresses into a single “CC” column, separated by a semicolon, as well as the employees email address in the “To” column.
  • Over the last 6 or so months, I’ve been building my company a Holiday & Sick Day management tool in ASP.Net & C# as a means to teach myself a coding language. In the tool, a manager can view sick days on a calendar by hovering over a highlighted day (indicating there was a sickness on that day), and a tool tip will appear with each person who was ill on that day. The tool tip came straight from a stored procedure in one column, making it much easier for me to utilize in C#.

Now that I’ve given you a small taster of what it’s used for, I should probably show you some SQL!

Using the second example given above, imagine we have the below query:

SELECT S.SicknessDate,
	   P.Name   
 
FROM Person P
	 JOIN Sickness S ON P.PersonId = S.PersonId

Which gives us this dataset:

Returning vertical data in one horizontal column

 

 

 

Great, we have data! But, we only want 2 rows, one for ’2012-01-01′ containing Zach Stagers and Some Guy, and another for ’2012-03-01′ for Other Fella.

How do we do that? Through the use of a correlated subquery making use of FOR XML PATH and the [text()] column name. Here’s the new code:

SELECT DISTINCT 
	   S.SicknessDate,
	   (
		SELECT Name + ', ' [text()]
 
		FROM Person P1
			 JOIN Sickness S1 ON P1.PersonId = S1.PersonId
 
		WHERE S1.SicknessDate = S.SicknessDate
 
		FOR XML PATH('')
	   ) [Names]
 
FROM Person P
	 JOIN Sickness S ON P.PersonId = S.PersonId

And the new, easier to utilize, data set:

Returning vertical data in one horizontal column_FIXED

 

 

 
Perfecto! Just what we wanted! The FOR XML PATH subquery kindly does all of the work for us, and the distinct gets rid of duplicate rows caused when multiple people are ill on the same day. In the subquery, we use where the sickness date of the inner query equals the sickness date of the outer query, turn the result into XML (FOR XML PATH(”)), and tell it to just output it as plain text ([text()]).

I hope you find as useful an implementation for this as I have!

Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)

Subqueries – Correlated and Noncorrelated

Subqueries are a great way to easily increase the scope of your result set, and in this post I’m going to explain the difference between Correlated and Noncorrelated Subqueries, as well as provide examples of them both.

Correlated Subqueries

Correlated Subqueries create a link between the outer query and the subquery. This means that the subquery is dependent on the outer query, and won’t execute as a stand alone query.

SELECT *
 
FROM Products AS P1
 
WHERE P1.Price >
              (
               SELECT AVG(P2.Price)
               FROM Products AS P2
               WHERE P2.ProductCategoryId = P1.ProductCategoryId
              )

The above query will return all Products that have a price higher than the average price for each product category. The WHERE clause of the subquery almost acts as if it were a GROUP BY for the outer query in this instance. This is because the subquery and outer query join together on the ProductCategoryId – meaning that the average price is found for each Product Category. Then products with a price greater than the average, for that product category, will be returned.

Noncorrelated Subqueries

Noncorrelated Subqueries are completely independent of the outer query, and will execute as a stand alone query without any issues. These are commonly used within a WHERE clause – much like the example below:

SELECT *
 
FROM Products AS P
 
WHERE P.ProductId IN (SELECT DISTINCT O.ProductId FROM Orders AS O)

The above code will return all rows from the Products table that have atleast one reference in the Orders tables. It does this by returning all of the ProductId’s in the Orders table, to be used in the WHERE clause of the outer query.

Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)

Using Derived Tables

A derived table is essentially a correlated subquery within your FROM clause. If used correctly, a derived table can drastically improve the efficiency of your query.

Imagine that you’re going to be querying a large table with over a million rows in it and 20 columns, but you know your query is only ever going to be using 50,000 of those rows and 3 of the columns. That’s 950,000 rows and 17 columns that have to be scanned and rejected by your query, in order to get the data you need. By using a derived table, you can limit the table reference to those 50,000 rows.

Syntax and Explanation

SELECT P.* 
 
FROM (
      SELECT ProductId,
             ProductName,
             ProductDescription
 
      FROM Products
 
      WHERE ProductCategoryId = 4
     ) AS P 
     JOIN Orders AS O ON O.ProductId = P.ProductId
 
WHERE O.OrderDate >= DATEADD(Y, -1, GETDATE()

The above code will return all products with a ProductCategoryId of 4 from the products table, which have been ordered within the last year.

This will help query performance by cutting down the rows read out from the Products table to begin with. The FROM clause within a SQL statement is actually the first to be executed, then the WHERE. So in the above query, only the 50,000 records have to be joined to the Orders table, rather than the full million rows – only to be rejected later.

Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)