SSIS and SQL Server - varchar as a handy data type
Updated: Mar 7
Data engineering and ETL processes often require dealing with text files in all shapes and forms, be it for the purpose of ingesting data or exporting it. In the following sections I will provide my take on how to streamline data import and export to files by using variable length data types of varchar and nvarchar.
Varchar and data import
Importing data from a text file is quite easy unless it's a fixed-width file or we decide to validate the file format and handle exceptions (this could be a separate topic in itself).
SSIS flat file connection manager's suggested data type is string (DT_STR), which is roughly equivalent to the SQL varchar data type. However, we may want to adjust the OutputColumnWidth property of some of the columns – the default width is 50 characters.
Our goal is to import raw data into a staging table as quick as possible. Therefore, the best strategy is to leave all values as text and in our staging table use varchar data type for all columns. Staging tables are not meant to store the data for a long time so we may be quite generous in terms of field lengths and, besides, variable length data types' storage requirements change depending on the length of the entered value itself.
If we expect multi-byte coded characters then we should use nvarchar instead, in order to avoid character conversion issues.
Based on my experience data validation is best performed using SQL stored procedures, and trying to validate it while streaming slows down the process. There are rare instances when it might be necessary or more convenient (e.g. more complex values where we might want to use regular expressions to search for patterns), but generally speaking no data validation is needed at SSIS level. This includes de-duplication.
Format of the file itself is a different matter, it can be handled by using error and warning events as well as script tasks and components.
Exporting data to files
I hardly ever need to concern myself with how SQL Server data types of the values at the source table will be displayed in the output file after SSIS processes it. I find it much easier to be in control of this process very early on and convert every single value into a varchar data type at the server end, e.g. a stored procedure, before the data is fetched by a data source in a dataflow task.
Perhaps the most common task in my experience is to apply date formatting. This can be done very easily after converting date to a character data type, using CONVERT method and different styles like so:
declare @mydate datetime = '20210306 23:36';
select convert(varchar(23), @mydate, 121) as formatted_date
And here is an example of casting a decimal number:
declare @mydecimal decimal(24,8) = 24.80000000
select @mydecimal as my_decimal_number, cast(@mydecimal as varchar(25)) as my_decimal_text
For more information about the convert and cast functions (including available date formats) follow this link to Microsoft's documentation:
Text file is just it – a text file. It does not care about data types - decimal, int, strings etc. All it expects in the end are text representations of the source values.
Good to know about varchar
There are a few things worth noting in terms of working with varchar data type.
In varchar(n) the n does not stand for the number of characters, but rather the number of bytes. In a single byte encoding it does not matter, but with encoding such as UTF-8 we may get characters using more than one byte representations.
The following example shows data truncation when we try to fit a 10-character Unicode string containing such characters into a varchar type applying one of the UTF8 collations:
declare @SomeNvarchar nvarchar(10);
set @SomeNvarchar = N'Gȏȏdtest12';
select convert(varchar(10), @SomeNvarchar collate Arabic_100_CI_AI_SC_UTF8);
As we can see, the "12" at the end didn't make it to the result.
If we expect a wider range of characters in our source data then it is best to use nvarchar, bearing in mind that in nvarchar(n), n denotes pairs of bytes (not necessarily the number of characters) so e.g. nvarchar(4) means data length of 8 bytes. Some code pages may contain characters which occupy 3 bytes of space.
Here is len vs datalenght comparison for nvarchar character string:
declare @nvarchar nvarchar(4)='Test';
select len(@nvarchar) as [len], datalength(@nvarchar) as [datalength]
Another potential source of unexpected data truncation comes from the fact that the varchar length when not provided defaults to 1.
e.g. the following statement results in the output ‘T’:
declare @default_varchar varchar = 'Test';
Finally, I would like to mention the conversion from varchar to uniqueidentifier. I happen to deal with the data containing many of them and often need to perform validation on the source column to ensure the imported string is indeed a uniqueidentifier.
According to Microsoft's documentation, uniqueidentifier type is considered a character type for the purposes of conversion from a character expression, and is subject to the truncation rules for converting to a character type.
As a result, when our input string starts with GUID, we may get a positive result when we don’t necessarily expect it, because the string will be truncated to 36 characters before the conversion takes place:
declare @test_input varchar(max) = cast(newID() as varchar(36)) + ' some text that follows';
select @test_input as original_value, try_cast(@test_input as uniqueidentifier) as i_didnt_expect_that
If our requirement is to allow just the GUID value in the field then we need to use a different method e.g. check the length of the input string first.
Variable length data types can simplify data engineer’s life on the SQL Server platform. They are flexible in terms of conversion to and from different data types, which makes them ideal candidates for tasks like importing raw data, mapping and lookup tasks, and exporting data to files with a specific format. They are quite easy to work with and there are not many pitfalls to look out for.