Collations in SQL Server provide sorting rules, case, and accent sensitivity properties for your data. Collation also keep information about CodePage (character set). So it hold combination of Character Set and Rule governing those characters. Rules for how strings of character data are stored, sorted and compared in SQL Server. At low level, through selection of CodePage it specifies the bit patterns that represent each character and the rules by which characters are sorted and compared.
Collation and CodePages
CodePage is set of non-unicode characters with their binary representation (also called character set). Most of the CodePages uses one byte of character, so can only support maximum of 256 characters. But for languages that require more characters, some CodePages exist with two bytes per character (double-byte character set (DBCS)). One Collation can point to one Code Page, and different Collations can use same Code Page. If different collations are using same Code Page, that means, same character set is being governed by different set of rules.
Please note that Unicode characters are different and does not use any CodePage. Some SQL data types use unicode characters, and so are independent of CodePage.
Examples of non-Unicode SQL Data Types using CodePage: char, varchar and text
Examples of unicode SQL Data Types using CodePage: nchar, nvarchar, and ntext
How it effects?
Collation Affects following:
- Characters supported: - i.e. Collation has the Code page information, that will be used for non-Unicode characters. Different collations are actually pointing to different CodePages and so supporting different characters and different languages.
- Number of Bytes per character: It depend directly on CodePage being used, and indirectly change of collation / CodePage might result in different number of bytes per character. Please note that most of the CodePage is of one byte character set, and only supports 256 different characters, but for some languages, where more than 256 characters are required, a character is of 2 bytes, and so the code page is of two byte character set. A two byte character can support up to 65536 (216) characters.
- Sorting: Collation defines multilevel comparisons of the character set in code page.
- Storage
- Performance: Some Collations perform better than others in certain situations. And use of Collate keyword in query will certainly add the performance penalty. For more information refer to Arvind Shyamsundar’s post.
- Collation mismatches can be dangerous – worst case, they can cause data loss or errors to occur (if coercion is disallowed) or in many cases can (at the very least) cause performance problems.
- The usage of the COLLATE clause can introduce a potential performance penalty. If this clause is applied on top of a table column inside a predicate, it may mean that any indexes might not be used (or we may have to do a lot of work using the GetRangeThroughConvert() function.)
- Lastly, if you are using cross-database references in a batch, beware of the collation assigned to each local variable – they inherit the same database collation of the last USE database prior to the first DECLARE.
What is CodePage?
There are either Unicode characters or non-Unicode characters. Unicode character set is designed to accommodate all different characters and symbols of all the different languages together in one big set. Since one character in Unicode is of four bytes size, so it can contain 232 different characters. While non-Unicode characters are usually one byte, and one byte can only support 256 (28) different characters.
In a computer, characters are represented by different patterns of bits being either ON or OFF. There are 8 bits in a byte, and the 8 bits can be turned ON and OFF in 256 different patterns. A program that uses 1 byte to store each character can therefore represent up to 256 different characters by assigning a character to each of the bit patterns. There are 16 bits in 2 bytes, and 16 bits can be turned ON and OFF in 65,536 unique patterns. A program that uses 2 bytes to represent each character can represent up to 65,536 characters.
Single-byte code pages are definitions of the characters mapped to each of the 256 bit patterns possible in a byte. Code pages define bit patterns for uppercase and lowercase characters, digits, symbols, and special characters such as the exclamation point (!), the at sign (@), the number sign (#), or percent (%). Each European language, such as German or Spanish, has its own single-byte code page. Although the bit patterns used to represent the Latin alphabet characters A through Z are the same for all the code pages, the bit patterns used to represent accented characters vary from one code page to the next.
Single-byte character sets cannot store all the characters used by many languages. Some Asian languages have thousands of characters; therefore, they must use 2 bytes per character. Double-byte character sets have been defined for these languages, and code pages have also been defined around them.
How to find out Collation for your data?
Using SQL:
To check the collation of database:
Select DATABASEPROPERTYEX('TEMPDB','COLLATION')
To check the collation of columns or one column in a table:
To get collation for all the columns for a table, you can query sys.columns table. i.e.
select name As ColumnName,collation_name from sys.columns where object_id = object_id('[Person].[Address]')
Or alternatively filter by column name to get collation of individual column:
select collation_name from sys.columns where object_id = object_id('[Person].[Address]') AND name='AddressLine1'
Using SQL Server Management Studio:
Database Collation: In SQL Server Management Studio, go to database properties, and you can find the collation on properties page: Database Properties –> General –> Collation
Table Collation: Open Table Properties –> Extended Properties –> Collation
Column Collation: Column Properties –> Collation
Get list of all collations
In SQL Server, use following method to get list of all collations:
select * from fn_helpcollations()
Collation Properties (Sensitivities):
you can notice abbreviated codes in collation names like of CS, CI, AS, KS etc. These codes represent different properties of collation. For example:
Case Sensitivity: Abbreviated as CS for Case Sensitive and CI for Case Insensitive.
Accent Sensitivity: Abbreviated as AS for Accent Sensitive and AI for Access Insensitive
Kanatype Sensitivity: Abbreviated as KS for Kanatype Sensitive and KI for Kanatype Insensitive
Width Sensitivity: Abbreviated as WS for Width Sensitive and WI for Width Insensitive
Supplementary Characters: Abbreviated as SC for Supplementary Characters supported.
So you can get idea about the collation properties by looking at its name. e.g.
Collation SQL_Latin1_General_CP1_CI_AS roughly translates to
Latin1-General, case-insensitive, accent-sensitive, kanatype-insensitive, width-insensitive
Dealing with Collation
There are many levels at which Collation can be defined. By default Collation defined at high level, will be used, unless defined otherwise at lower level. Collation defined at lower level explicitly overrides inheriting collation.
Collation Scope
Server Level
Collation have scope over whole Database server. It will be used as default collation for any database created in server,
Database Level
Default collation for database. Will be used as Default for any Tables created in DB.
Column Level
Collation for individual column. It can be used to change individual column collation and override any inherited from higher level.
Expression Level
Collation can be explicitly defined for expressions (including column) in T-SQL queries.
Best Practice:
Its better to stick with one global collation across server and even over multiple servers. Multiple collations should not be used, until required due to specific requirements. There are benefits associated with sticking with one collation, so always try to pre-empt any future data usage, and choose collation well before creating database(s).
Study more about collation::
http://msdn.microsoft.com/en-us/library/ms143726.aspx