Monday, October 15, 2007

A custom SSIS PipelineComponent

I've recently been doing a spot of data migration. It's not my favorite task mostly because it involves the painful manual task of mapping one set of tables and columns to another. Drudge work of the worst kind. I've been using SQL Server Integration Services (SSIS) which is the replacement for Data Transformation Services (DTS). Data migration is a job I've mostly managed to avoid in recent years, so I never really had a reason to play with DTS and so I can't comment on how much, if at all, SSIS improves on it. SSIS basically gives you a design surface that you can drag various data sources and destinations and then connect them up via a whole stack of tools such as 'Derived Column', 'Data Conversion' and 'Lookup'. It's designed to support a variety different data manipulation task, including data mining and it's certainly capable of handing most data migration requirements.

SSIS

However, one of my data migration tasks was to take a simple list of locations in the source database that looked a bit like this:

ID     Location_Name
------ --------------------------
1      All UK
2      -England
3      --South
4      ---South East
5      ----Bedfordshire
6      ----Berkshire
7      ----Buckinghamshire
8      ----East Sussex
9      ----Essex
10     ----Hampshire
11     ----Hertfordshire
12     ----Isle of Wight
13     ----Kent
14     ----London & M25
15     -----London
16     ------Central London
17     -------City of London
18     ------East London
19     ------North London
20     ------South London
21     ------West London

... and import it into a table that looks like this:

id          name                 parentId
----------- -------------------- -----------
0           All                  NULL
1           All UK               0
2           England              1
3           South                2
4           South East           3
5           Bedfordshire         4
6           Berkshire            4
7           Buckinghamshire      4
8           East Sussex          4
9           Essex                4
10          Hampshire            4
11          Hertfordshire        4
12          Isle of Wight        4
13          Kent                 4
14          London & M25         4
15          London               14
16          Central London       15
17          City of London       16
18          East London          15
19          North London         15
20          South London         15

Notice how a location hierarchy is represented by the number of hyphens before the location name in the source table, but is represented as a correctly normalized relationship in the destination table. In order to map these two tables I wanted to write some C# code. You can write VB script directly into a script component, but it wasn't really what I wanted. The other obvious alternative was to write my own custom SSIS pipeline component. A brief look at the MSDN documentation on this suggested that it should be quite easy. You just extend a base class PipelineComponent and override a few methods. But in true MSDN style the documentation had large gaps in it leaving out some essential information, eventually I discovered CalendarTransform by Grant Dickinson, also on MSDN which pointed me in the right direction. The SSIS extensibility API could be much easier to use. It's not impossible, but there's far too much digging around for column ids that really should be provided by the toolkit.

Anyway, for your amusement, here's my very simple pipeline component that takes the source table and pumps out the destination.

[ComVisible(false)]
[DtsPipelineComponent(DisplayName="Location Mapper", ComponentType = ComponentType.Transform)]
public class LocationMapper : PipelineComponent
{
    int parentIdColumnIndex;
    int locationDescriptionColumnIndex;

    int idColumnIndex;
    int locationNameColumnIndex;

    // this override provides the columns at design time to allow you to wire up you components
    // on the design surface.
    public override void ProvideComponentProperties()
    {
        base.ProvideComponentProperties();

        //Support resetting the component, this is straight out of CalendarTransform   
        this.RemoveAllInputsOutputsAndCustomProperties();
        this.ComponentMetaData.RuntimeConnectionCollection.RemoveAll();
        this.ComponentMetaData.UsesDispositions = false;
        this.ComponentMetaData.ValidateExternalMetadata = true;

        // Add the input collection.
        IDTSInput90 input = ComponentMetaData.InputCollection.New();
        input.Name = "Input";
        input.ExternalMetadataColumnCollection.RemoveAll();
        input.ExternalMetadataColumnCollection.IsUsed = false;

        // Add the output collection.
        IDTSOutput90 output = ComponentMetaData.OutputCollection.New();
        output.Name = "Output";
        output.ExclusionGroup = 0;
        output.SynchronousInputID = input.ID;
        output.ExternalMetadataColumnCollection.RemoveAll();
        output.ExternalMetadataColumnCollection.IsUsed = false;

        // add the output columns, this is the bit that the MSDN documentation doesn't tell you :P

        // parent id
        IDTSOutputColumn90 parentIdColumn = this.InsertOutputColumnAt(output.ID, 0, "ParentId", "Parent Id");
        parentIdColumn.SetDataTypeProperties(DataType.DT_I4, 0, 0, 0, 0);

        // location description
        IDTSOutputColumn90 locationDescriptionColumn = this.InsertOutputColumnAt(output.ID, 1, "LocationDescription", "Location Description");
        locationDescriptionColumn.SetDataTypeProperties(DataType.DT_WSTR, 500, 0, 0, 0);
    }

    // this runs first at runtime.
    public override void PreExecute()
    {
        // input columns, assumes that the correct columns have been mapped on the design surface.
        // a bit of a hack!
        IDTSInput90 input = ComponentMetaData.InputCollection[0];
        IDTSInputColumnCollection90 inputColumns = input.InputColumnCollection;

        IDTSInputColumn90 idColumn = inputColumns[0];
        if (idColumn == null)
        {
            throw new ApplicationException("id column is missing");
        }

        IDTSInputColumn90 locationNameColumn = inputColumns[1];
        if (locationNameColumn == null)
        {
            throw new ApplicationException("location name column is missing");
        }

        // this is the really wacky stuff, you have to discover the column indexes inside the buffer
        // using this convoluted syntax. I could never have worked this out for myself!!
        idColumnIndex = BufferManager.FindColumnByLineageID(input.Buffer, idColumn.LineageID);
        locationNameColumnIndex = BufferManager.FindColumnByLineageID(input.Buffer, locationNameColumn.LineageID);

        // output columns
        IDTSOutput90 output = ComponentMetaData.OutputCollection[0];
        IDTSOutputColumnCollection90 outputColumns = output.OutputColumnCollection;

        IDTSOutputColumn90 parentIdColumn = outputColumns[0];
        IDTSOutputColumn90 locationDescriptionColumn = outputColumns[1];

        // do the crazy column index lookup again, this time for the output columns.
        parentIdColumnIndex = BufferManager.FindColumnByLineageID(input.Buffer, parentIdColumn.LineageID);
        locationDescriptionColumnIndex = BufferManager.FindColumnByLineageID(input.Buffer, locationDescriptionColumn.LineageID);
    }

    // this is the bit that actually does all the work.
    public override void ProcessInput(int inputID, PipelineBuffer buffer)
    {
        base.ProcessInput(inputID, buffer);

        if (buffer != null)
        {
            if (!buffer.EndOfRowset)
            {
                // level is a little class that I wrote to manage the parent/child mapping.
                Level level = new Level();

                while (buffer.NextRow())
                {
                    short id = (short)buffer[idColumnIndex];
                    string locationName = (string)buffer[locationNameColumnIndex];

                    // location is another helper class that simply describes a row in the destination table
                    Location location = new Location(id, locationName);
                    level.Set(location);

                    buffer[parentIdColumnIndex] = location.ParentId;
                    buffer[locationDescriptionColumnIndex] = location.Name;
                    buffer[orderColumnIndex] = location.Order;
                }
            }
        }
    }
}

7 comments:

Anonymous said...

Wait until 2008 and you can write script in C#!

Philip Stephenson said...

With regards your statement "A brief look at the MSDN documentation on this suggested that it should be quite easy." Too right! So much so that I wrote my own blog on just that.

See: http://bi-logger.blogspot.com

Mike Hadlow said...

Hi Philip,

Yes, why does it have to be so difficult? It would have been pretty trivial to have given us an easier API for SSIS pipeline components. I guess not everyone in MS has read the Framework Design Guidelines book.

Jamie Shellman said...

Hi Mike! Hmmm, why did you have to transfer data, anyways? Data migration is quite a challenge as it depends on the company's needs. The location table looked organized, though. What software did you use for this, btw?

Anonymous said...

I stumbled into this while searching for solutions to 'stale metadata' in SSIS.
Interesting that here we sit, 1/2-way through 2014 and Microsoft's pat answer to so many shortcomings in the SSIS tools is 'you can write C# script' for that (or VB .net).

Just about every ETL tool at some point requires writing code to do something.

However, the point here is that basic, everyday circumstances such as metadata changes in source OUGHT TO BE HANDLED BY THE PRODUCTIVITY TOOL.

If I have to write code, be it C# or VB .Net, then I might as well go back to writing T-SQL in sprocs to do my ETL.

In short, Microsoft has fallen behind in putting the PRODUCTIVITY back into SSIS.

At some point in time on our data warehouse project, we're going to get breathing room, and when we do, high on my list is "port all the SSIS packages to PDI (Pentaho Data Integration)!

RDMKR said...

Patrick, many thanks for posting this work. It has served as a good example for me to build my first SSIS pipeline component from. Please keep up the great blogging.

If you can spare the time, I invite you to take a look at this problem that still plagues me and share your expertise:
http://stackoverflow.com/questions/38241021/ssis-custom-pipeline-component-output-to-buffer-index-mapping

Anonymous said...

Hi Friends,
ssis pipeline i can,t understand please explain any Master Developer

Thanks
Ganesh