I've recently been doing a spot of data migration. It's not my favorite task mostly because it involves the painful manual task of mapping one set of tables and columns to another. Drudge work of the worst kind. I've been using SQL Server Integration Services (SSIS) which is the replacement for Data Transformation Services (DTS). Data migration is a job I've mostly managed to avoid in recent years, so I never really had a reason to play with DTS and so I can't comment on how much, if at all, SSIS improves on it. SSIS basically gives you a design surface that you can drag various data sources and destinations and then connect them up via a whole stack of tools such as 'Derived Column', 'Data Conversion' and 'Lookup'. It's designed to support a variety different data manipulation task, including data mining and it's certainly capable of handing most data migration requirements.
However, one of my data migration tasks was to take a simple list of locations in the source database that looked a bit like this:
ID Location_Name ------ -------------------------- 1 All UK 2 -England 3 --South 4 ---South East 5 ----Bedfordshire 6 ----Berkshire 7 ----Buckinghamshire 8 ----East Sussex 9 ----Essex 10 ----Hampshire 11 ----Hertfordshire 12 ----Isle of Wight 13 ----Kent 14 ----London & M25 15 -----London 16 ------Central London 17 -------City of London 18 ------East London 19 ------North London 20 ------South London 21 ------West London
... and import it into a table that looks like this:
id name parentId ----------- -------------------- ----------- 0 All NULL 1 All UK 0 2 England 1 3 South 2 4 South East 3 5 Bedfordshire 4 6 Berkshire 4 7 Buckinghamshire 4 8 East Sussex 4 9 Essex 4 10 Hampshire 4 11 Hertfordshire 4 12 Isle of Wight 4 13 Kent 4 14 London & M25 4 15 London 14 16 Central London 15 17 City of London 16 18 East London 15 19 North London 15 20 South London 15
Notice how a location hierarchy is represented by the number of hyphens before the location name in the source table, but is represented as a correctly normalized relationship in the destination table. In order to map these two tables I wanted to write some C# code. You can write VB script directly into a script component, but it wasn't really what I wanted. The other obvious alternative was to write my own custom SSIS pipeline component. A brief look at the MSDN documentation on this suggested that it should be quite easy. You just extend a base class PipelineComponent and override a few methods. But in true MSDN style the documentation had large gaps in it leaving out some essential information, eventually I discovered CalendarTransform by Grant Dickinson, also on MSDN which pointed me in the right direction. The SSIS extensibility API could be much easier to use. It's not impossible, but there's far too much digging around for column ids that really should be provided by the toolkit.
Anyway, for your amusement, here's my very simple pipeline component that takes the source table and pumps out the destination.
[ComVisible(false)] [DtsPipelineComponent(DisplayName="Location Mapper", ComponentType = ComponentType.Transform)] public class LocationMapper : PipelineComponent { int parentIdColumnIndex; int locationDescriptionColumnIndex; int idColumnIndex; int locationNameColumnIndex; // this override provides the columns at design time to allow you to wire up you components // on the design surface. public override void ProvideComponentProperties() { base.ProvideComponentProperties(); //Support resetting the component, this is straight out of CalendarTransform this.RemoveAllInputsOutputsAndCustomProperties(); this.ComponentMetaData.RuntimeConnectionCollection.RemoveAll(); this.ComponentMetaData.UsesDispositions = false; this.ComponentMetaData.ValidateExternalMetadata = true; // Add the input collection. IDTSInput90 input = ComponentMetaData.InputCollection.New(); input.Name = "Input"; input.ExternalMetadataColumnCollection.RemoveAll(); input.ExternalMetadataColumnCollection.IsUsed = false; // Add the output collection. IDTSOutput90 output = ComponentMetaData.OutputCollection.New(); output.Name = "Output"; output.ExclusionGroup = 0; output.SynchronousInputID = input.ID; output.ExternalMetadataColumnCollection.RemoveAll(); output.ExternalMetadataColumnCollection.IsUsed = false; // add the output columns, this is the bit that the MSDN documentation doesn't tell you :P // parent id IDTSOutputColumn90 parentIdColumn = this.InsertOutputColumnAt(output.ID, 0, "ParentId", "Parent Id"); parentIdColumn.SetDataTypeProperties(DataType.DT_I4, 0, 0, 0, 0); // location description IDTSOutputColumn90 locationDescriptionColumn = this.InsertOutputColumnAt(output.ID, 1, "LocationDescription", "Location Description"); locationDescriptionColumn.SetDataTypeProperties(DataType.DT_WSTR, 500, 0, 0, 0); } // this runs first at runtime. public override void PreExecute() { // input columns, assumes that the correct columns have been mapped on the design surface. // a bit of a hack! IDTSInput90 input = ComponentMetaData.InputCollection[0]; IDTSInputColumnCollection90 inputColumns = input.InputColumnCollection; IDTSInputColumn90 idColumn = inputColumns[0]; if (idColumn == null) { throw new ApplicationException("id column is missing"); } IDTSInputColumn90 locationNameColumn = inputColumns[1]; if (locationNameColumn == null) { throw new ApplicationException("location name column is missing"); } // this is the really wacky stuff, you have to discover the column indexes inside the buffer // using this convoluted syntax. I could never have worked this out for myself!! idColumnIndex = BufferManager.FindColumnByLineageID(input.Buffer, idColumn.LineageID); locationNameColumnIndex = BufferManager.FindColumnByLineageID(input.Buffer, locationNameColumn.LineageID); // output columns IDTSOutput90 output = ComponentMetaData.OutputCollection[0]; IDTSOutputColumnCollection90 outputColumns = output.OutputColumnCollection; IDTSOutputColumn90 parentIdColumn = outputColumns[0]; IDTSOutputColumn90 locationDescriptionColumn = outputColumns[1]; // do the crazy column index lookup again, this time for the output columns. parentIdColumnIndex = BufferManager.FindColumnByLineageID(input.Buffer, parentIdColumn.LineageID); locationDescriptionColumnIndex = BufferManager.FindColumnByLineageID(input.Buffer, locationDescriptionColumn.LineageID); } // this is the bit that actually does all the work. public override void ProcessInput(int inputID, PipelineBuffer buffer) { base.ProcessInput(inputID, buffer); if (buffer != null) { if (!buffer.EndOfRowset) { // level is a little class that I wrote to manage the parent/child mapping. Level level = new Level(); while (buffer.NextRow()) { short id = (short)buffer[idColumnIndex]; string locationName = (string)buffer[locationNameColumnIndex]; // location is another helper class that simply describes a row in the destination table Location location = new Location(id, locationName); level.Set(location); buffer[parentIdColumnIndex] = location.ParentId; buffer[locationDescriptionColumnIndex] = location.Name; buffer[orderColumnIndex] = location.Order; } } } } }
7 comments:
Wait until 2008 and you can write script in C#!
With regards your statement "A brief look at the MSDN documentation on this suggested that it should be quite easy." Too right! So much so that I wrote my own blog on just that.
See: http://bi-logger.blogspot.com
Hi Philip,
Yes, why does it have to be so difficult? It would have been pretty trivial to have given us an easier API for SSIS pipeline components. I guess not everyone in MS has read the Framework Design Guidelines book.
Hi Mike! Hmmm, why did you have to transfer data, anyways? Data migration is quite a challenge as it depends on the company's needs. The location table looked organized, though. What software did you use for this, btw?
I stumbled into this while searching for solutions to 'stale metadata' in SSIS.
Interesting that here we sit, 1/2-way through 2014 and Microsoft's pat answer to so many shortcomings in the SSIS tools is 'you can write C# script' for that (or VB .net).
Just about every ETL tool at some point requires writing code to do something.
However, the point here is that basic, everyday circumstances such as metadata changes in source OUGHT TO BE HANDLED BY THE PRODUCTIVITY TOOL.
If I have to write code, be it C# or VB .Net, then I might as well go back to writing T-SQL in sprocs to do my ETL.
In short, Microsoft has fallen behind in putting the PRODUCTIVITY back into SSIS.
At some point in time on our data warehouse project, we're going to get breathing room, and when we do, high on my list is "port all the SSIS packages to PDI (Pentaho Data Integration)!
Patrick, many thanks for posting this work. It has served as a good example for me to build my first SSIS pipeline component from. Please keep up the great blogging.
If you can spare the time, I invite you to take a look at this problem that still plagues me and share your expertise:
http://stackoverflow.com/questions/38241021/ssis-custom-pipeline-component-output-to-buffer-index-mapping
Hi Friends,
ssis pipeline i can,t understand please explain any Master Developer
Thanks
Ganesh
Post a Comment