Collations in SQL Server provide sorting rules, case, and accent sensitivity properties for your data. Collation also keep information about CodePage (character set). So it hold combination of Character Set and Rule governing those characters. Rules for how strings of character data are stored, sorted and compared in SQL Server. At low level, through selection of CodePage it specifies the bit patterns that represent each character and the rules by which characters are sorted and compared.
Collation and CodePages
CodePage is set of non-unicode characters with their binary representation (also called character set). Most of the CodePages uses one byte of character, so can only support maximum of 256 characters. But for languages that require more characters, some CodePages exist with two bytes per character (double-byte character set (DBCS)). One Collation can point to one Code Page, and different Collations can use same Code Page. If different collations are using same Code Page, that means, same character set is being governed by different set of rules.
Please note that Unicode characters are different and does not use any CodePage. Some SQL data types use unicode characters, and so are independent of CodePage.
Examples of non-Unicode SQL Data Types using CodePage: char, varchar and text
Examples of unicode SQL Data Types using CodePage: nchar, nvarchar, and ntext
How it effects?
Collation Affects following:
- Characters supported: - i.e. Collation has the Code page information, that will be used for non-Unicode characters. Different collations are actually pointing to different CodePages and so supporting different characters and different languages.
- Number of Bytes per character: It depend directly on CodePage being used, and indirectly change of collation / CodePage might result in different number of bytes per character. Please note that most of the CodePage is of one byte character set, and only supports 256 different characters, but for some languages, where more than 256 characters are required, a character is of 2 bytes, and so the code page is of two byte character set. A two byte character can support up to 65536 (216) characters.
- Sorting: Collation defines multilevel comparisons of the character set in code page.
- Storage
- Performance: Some Collations perform better than others in certain situations. And use of Collate keyword in query will certainly add the performance penalty. For more information refer to Arvind Shyamsundar’s post.
- Collation mismatches can be dangerous – worst case, they can cause data loss or errors to occur (if coercion is disallowed) or in many cases can (at the very least) cause performance problems.
- The usage of the COLLATE clause can introduce a potential performance penalty. If this clause is applied on top of a table column inside a predicate, it may mean that any indexes might not be used (or we may have to do a lot of work using the GetRangeThroughConvert() function.)
- Lastly, if you are using cross-database references in a batch, beware of the collation assigned to each local variable – they inherit the same database collation of the last USE database prior to the first DECLARE.
What is CodePage?
There are either Unicode characters or non-Unicode characters. Unicode character set is designed to accommodate all different characters and symbols of all the different languages together in one big set. Since one character in Unicode is of four bytes size, so it can contain 232 different characters. While non-Unicode characters are usually one byte, and one byte can only support 256 (28) different characters.
In a computer, characters are represented by different patterns of bits being either ON or OFF. There are 8 bits in a byte, and the 8 bits can be turned ON and OFF in 256 different patterns. A program that uses 1 byte to store each character can therefore represent up to 256 different characters by assigning a character to each of the bit patterns. There are 16 bits in 2 bytes, and 16 bits can be turned ON and OFF in 65,536 unique patterns. A program that uses 2 bytes to represent each character can represent up to 65,536 characters.
Single-byte code pages are definitions of the characters mapped to each of the 256 bit patterns possible in a byte. Code pages define bit patterns for uppercase and lowercase characters, digits, symbols, and special characters such as the exclamation point (!), the at sign (@), the number sign (#), or percent (%). Each European language, such as German or Spanish, has its own single-byte code page. Although the bit patterns used to represent the Latin alphabet characters A through Z are the same for all the code pages, the bit patterns used to represent accented characters vary from one code page to the next.
Single-byte character sets cannot store all the characters used by many languages. Some Asian languages have thousands of characters; therefore, they must use 2 bytes per character. Double-byte character sets have been defined for these languages, and code pages have also been defined around them.
How to find out Collation for your data?
Using SQL:
To check the collation of database:
Select DATABASEPROPERTYEX('TEMPDB','COLLATION')
To check the collation of columns or one column in a table:
To get collation for all the columns for a table, you can query sys.columns table. i.e.
select name As ColumnName,collation_name from sys.columns where object_id = object_id('[Person].[Address]')
Or alternatively filter by column name to get collation of individual column:
select collation_name from sys.columns where object_id = object_id('[Person].[Address]') AND name='AddressLine1'
Using SQL Server Management Studio:
Database Collation: In SQL Server Management Studio, go to database properties, and you can find the collation on properties page: Database Properties –> General –> Collation
Table Collation: Open Table Properties –> Extended Properties –> Collation
Column Collation: Column Properties –> Collation
Get list of all collations
In SQL Server, use following method to get list of all collations:
select * from fn_helpcollations()
Collation Properties (Sensitivities):
you can notice abbreviated codes in collation names like of CS, CI, AS, KS etc. These codes represent different properties of collation. For example:
Case Sensitivity: Abbreviated as CS for Case Sensitive and CI for Case Insensitive.
Accent Sensitivity: Abbreviated as AS for Accent Sensitive and AI for Access Insensitive
Kanatype Sensitivity: Abbreviated as KS for Kanatype Sensitive and KI for Kanatype Insensitive
Width Sensitivity: Abbreviated as WS for Width Sensitive and WI for Width Insensitive
Supplementary Characters: Abbreviated as SC for Supplementary Characters supported.
So you can get idea about the collation properties by looking at its name. e.g.
Collation SQL_Latin1_General_CP1_CI_AS roughly translates to
Latin1-General, case-insensitive, accent-sensitive, kanatype-insensitive, width-insensitive
Dealing with Collation
There are many levels at which Collation can be defined. By default Collation defined at high level, will be used, unless defined otherwise at lower level. Collation defined at lower level explicitly overrides inheriting collation.
Collation Scope
Server Level
Collation have scope over whole Database server. It will be used as default collation for any database created in server,
Database Level
Default collation for database. Will be used as Default for any Tables created in DB.
Column Level
Collation for individual column. It can be used to change individual column collation and override any inherited from higher level.
Expression Level
Collation can be explicitly defined for expressions (including column) in T-SQL queries.
Best Practice:
Its better to stick with one global collation across server and even over multiple servers. Multiple collations should not be used, until required due to specific requirements. There are benefits associated with sticking with one collation, so always try to pre-empt any future data usage, and choose collation well before creating database(s).
Study more about collation::
http://msdn.microsoft.com/en-us/library/ms143726.aspx
Overview
In the previous post we discussed the custom requirements and procedure of designing a full text search on relational database in a business domain driven fashion. In this article I will share high level design diagram depicting all the components, their inter-connectivity and general flow application to database end. Also I will describe each component individually.
Design
Design has always been requirement oriented. Even for the same requirement there could be N design variation. Following is one possible design implementation for business domain driven full text search, to represent overall components. It is very high level and generic, to demonstrate involvement of different components and flow of information in between them.
Components Description:
DataTier / Server Code
Classes and functionality for full text search is represented as part of DataTier component. It could be a different components using DataTier. These parts are presumably written using application programming language. Its components are:
- Full Text Search Query<T>: Is a generic class representing search query, with keywords provided by user and type of search i.e. inclusive or exclusive, and provided type T would be used to identify type of search, and associated stored procedures.
- Search Query Transformer: Is a component made up of implementation for inclusive and exclusive transformers, That will take the keywords provided by user from Search Query object, and will transform it to search engine ready keywords, adding AND/OR operators where needed and handling double quotes as might be provided by user.
- Procedure Mapping Provider<T>: Is a generic class to map stored procedures with type T.
- Full Text Search Response<T>: This class represent search response, and will hold result generated from search.
Database
Main logic of text searching, indexing, query building and full text catalogue resides in database engines.
- Table A (Pivot Table): Is our main pivot table. So it represent T in type of search we are designing. For more information on Pivot table, refer to this article.
- Table B, C and N related tables: These are the tables, which are directly or indirectly related with Table A through physical (constraint) or logical relationship. These tables can include embedded documents.
- Index A, B, C: These are the Indexes maintained by Database engine Text Catalogue, when we call for rank against keyword in stored procedure from each index, it return matching rank, and ID of record with which rank is associated. We do need to workout on this part, as it is fully managed by database engine.
- Catalog managed FTS keywords: This is a system repository (normally table) in which system maintains list of all different keywords which it collects through indexing. This provides a real data keyword list for features like keyword suggestions.
- Suggest Keyword SP: Is a stored procedure, Which we use to provide keyword suggestions on the fly as the user type. It retrieved the keywords from catalog managed keywords source, by union of all keywords extracted from all the related tables.
- Full Text Search SP: This is the main SP for type T, that will execute the custom query we have written for search N tables, based on transformed keyword. Each full text query native command will return results and rank. We will multiply the rank with weightage we assigned to each object, and will then sort the results based on calculated rank.
Summary
In previous post, we went in to theoretical details of the concept and design process of business type driven full text search on relational database, why it is needed, and general concepts. In this post, we have discussed the design and the general components we need to implement to achieve the functionality.
Overview
In business applications, designing a full text search is a custom designing process. There is no general search algorithm, that can fulfil the requirement of optimized and meaningful business domain driven full text search. And there is a reason for that. Search is always around certain type of information. For example, find a company, or a product, or a person. If multiple types of information is included in search, then it is a combination of search of multiple types. In the case of web, search engines are mostly designed around finding a general type of information in one source “web page”. No matter what kind of information is out there on web, it is all in web pages, using same HTML standards, and linking between pages, information paragraphs. So the challenge on web search is all about volume and relevancy of information.
In business applications, using relational databases at backend, if we use a generic search algorithm, it can work but to less optimized extent. Because structure of information and set of information is different from application to application, and requirement to requirement. So a single, pre-designed search inclusive of everything (open search) cannot bring the 100% expected results when it comes to relational databases and scattered files. Each “type of search” require planning and designing including radius of information, weightage of information, retrieval of information etc. Luckily, I guess, same design process can be applied to any type of business full text search. So in business applications, intelligent searching should be type specific or combination of type specifics. We can call it business domain driven full text searching.
Essence of keyword based full-text search in business applications
In the presence of today’s powerful, web search engines, end users want keyword based suggestions and search as a granted user experience. However, majority of business software system fails to deliver that basic user expectation, and ignores this fundamental pillar of a good software system. Proper implemented search experience is not only a must have to find information, but also the quickest method to access any information. In this article I would go through the steps of designing a custom business full text search.
Scope of this article
Type of search or the unit of search is one set of information represented by bunch of business models, and set of tables in database. In this article we will focus on how to design business full text search around one set of information. To expand the scope of information, or to include multiple sets of information, the process of designing can be repeated. Results of multiple types can be combined at higher level. So we would be focusing on one ‘unit of search’, where we can identify one higher piece of information around which our search revolves. We can call it pivot information.
We will see how N number of tables linking to pivot information can be included in search. Also we will explore how to design weightage, and ranking within the information.
I have implemented this design on SQL Server many times, but I would try to avoid technology specifics and I think same design can be applied to any database engine out there with basic full text searching/ indexing capability. So I would be leaving how to execute each step to reader’s knowledge and technology.
Pivot Table
Each unit of search normally revolves around central piece of information, an entity or model. Although the overall information can span to multiple tables, but select a table that is really central to a search. We will call that table Pivot table.
Scoping
- Identify central table of information unit. We will call it “Pivot Table”.
- Identify scope of search, by listing down all the related tables of Pivot table. We can include N number of tables in to search. We will call it “radius of search in database”. It is like circling the set of tables on a database diagram. More tables, greater the radius of selection and vice versa.
- There could be multiple pivots in your requirements. That means, different set of information independent from each other.
Define Full-Text Catalog
Define a new full text (or simply text) catalog in database.
- Catalog is independent of type of search, and multiple searches, or even multiple projects, can use the same full text-catalog.
- SQL Server Specific: Microsoft recommendation of defining catalogs are based on same characteristics (such as small number of changes versus large number of changes, or tables that change frequently during a particular time of day) together under the same full-text catalog. By setting up full-text catalog population schedules, full-text indexes stay synchronous with the tables without adversely affecting the resource usage of the database server during periods of high database activity.
Define Index on Pivot and Related Tables
For each type of search:
- Define indexes on Pivot Table P
- Define indexes on all the related tables (Set (RT)) that are related with main pivot table with FK or logical relation.
E.g. For Companies the related tables could be Contacts, Addresses, Classifications etc - Only add columns that contain information you want to include in search.
- You can index any table.column that contain files embedded like Word, Excel files etc.
- A unique index would be required for that. Make use of existing Clustered Primary Key Index on tables.
Define weightage for type of information
- Some information is more important that other. That means, a record with that information if matched with keyword, should be ranked higher. This concept is the process of assigning weightage to information entities.
- You have to go through the list of tables and assign a weightage to each of it. Weightage would be used to rank records while executing search.
- Weightage will define, if the keyword match with this table, how much weightage it would be given comparing to if the same keyword match for some other records from other tables.
- Normally Pivot table contains the highest weightage, and tables close to Pivot table will get higher weightage.
- Within same level of tables, you can tweak weightage as per importance of information.
- That weightage will be multiplied with built-in search rank number (generated by full text engine) to increase or decrease the overall rank of record.
Design Search Strategy around Pivot Table
Define the search strategy that include:
- Plan for a query where you can retrieve Pivot record and rank associated with that record for each associated table using relationship. That would be implemented in a stored procedure
- Plan for tables or view that you would like to retrieve as a result of matching record. Here multiple tables can be linked using relationship while retrieving information.
Develop Search Server Code
Develop application code and models, to represent query, transformation, and result.
- Write classes to represent Query on type T, FullTextSearchQuery<T> where T is data class for pivot table, we can call Pivot class.
- Types of Queries i.e. inclusive or exclusive (with ‘AND’, or ‘OR’ operator in between them)
- Transformation from user provided keywords and the transformed query that would be sent to database full text engine.
Write Database Code (Stored Procedures)
We will write two central stored procedures, for each Pivot Table.
Full_Text_Search_SP
- This is the stored procedure where we will search the actual tables and will rank the results as per matched keywords.
- Simplest way would be to collect rank and IDs (of pivot table) from all the related tables. Take SUM of all ranks received from multiple tables, and take UNION of all the IDs while grouping by IDs, this will give set of IDs of records from Pivot table along with their rank.
- Results then will be sorted by combined Rank in descending order to get results by relevancy
Suggest_Keyword_SP
- Develop the keyword strategy around each Pivot table, which will be provided as suggestions on user typing.
- Need to identify all the related tabled along with Pivot table. Number of tables you select for keyword selection could be less or more than actual searching tables, depending on strategy for keywords.
- You can only choose tables for keywords on which Full-Text index has been defined.
Code Usage
Using all of above components, the search would be using following (please refer to diagram in next part of this article):
- Create Instance of FullTextSearchQuery<T>
- On front end key press of search textbox, Use Provide_Suggestions to get number of keyword suggestions (in web through Ajax).
- Call Execute Search to get search results, in which we will get object of FullTextSearchResponse<T> which will provide list of search results.
- Display results to front end by order of rank.
Summary
Business Domain Driven Full text search require planning and designing for each type of search. Where each type of search revolves around a pivot information or pivot table in database. We select list of tables form DB, to define scope of searching, called radius of search. Each table form search is assigned a weightage which would be calculated for ranking of results. In core of search procedure, we select records based on ranking and retrieve results, where keywords are transformed in to Inclusive or Exclusive relationships, using AND / OR operators. This design will help to achieve required business domain driven full text searching for application.
Next Part – Design Diagram
In next part, I will share design as diagram, and will describe all the components of design.
To manage, asp.net web application session lifecycle and timeouts, understanding and configuration of three different settings is required.
Q. Are you looking for a way to manage asp.net session timeout, or stuck where, increasing your timeout is not working?
To increase the session timeout, you need to look at all possible places, and should have complete understanding of different timeouts, that all will define user session life while active or inactive. Three different timeouts that can affect user session life cycle are:
1) Authentication cookie timeout in config file
Assuming if you are using FormsAuthentication, you need to set timeout in config file as following:
<authentication mode="Forms">
<forms name="AppRootForms" loginUrl="Login.aspx" protection="All" timeout="20" path="/" slidingExpiration="true"/>
</authentication>
Where timeout is a time in minutes after which session cookie of user will expire. So above means, that after 20 minutes of inactivity, user’s session cookie will expire, and user will have to login again (i.e. create a new session) to use application/feature.
SlidingExpiration
SlidingExpiration property in above configuration, if set to true, resets the timeout on each request from user, and if set to false, will force session expire after every N minutes, no matter even if user was actively using application. For more information see SlidingExpiration MSDN documentation.
Please note that, If you are using Windows Authentication, there could be a Active Directory domain policy to log users out of Windows Session.
Form authentication timeout can be set in code as well. So make sure to set it properly, if you are setting it in code like:
formsAuthentication.Timeout = System.TimeSpan.FromMinutes(20);
2) Server SessionState timeout in config file
SessionState configuration element is used to configure session management at server, including where session will be stored, whether it will be cookie-less etc. It also contains a timeout value, which defaults to 20 minutes. e.g. element in system.web would be:
<sessionState cookieless="false" timeout="20" mode="InProc"/>
In above, timeout “specifies the number of minutes a session can be idle before it is abandoned”. So this was the second place to look at.
3) Application Pool Idle time-out termination in IIS
Finally, even after you have icreased the timeouts of Authentication Cookie, and Server SessionState, user will be kicked out of session, if the application pool remained idle for certain time.
Now this last part might leave you scratching your head, if not configured properly, because it only affect when there is no activity in pool at all, means, there is not a single active user, so it is rare, but if probable, chances are low. So if the user really kicked out due to IIS App pool termination, it is hard to figure out.
If you have not mentioned timeouts for authentication cookie and session state in web.config file explicitly, look out for parent configuration files. (i.e. asp.net root web.config and machine.config etc)
Programming Naturally an Art
When it comes to programming, I am in complete agreement with Donald E. Knuth well-known respected computer scientist and author of the best programming multi-volume work ‘The Art of Computer Programming’. He published an article “Computer Programming as An Art” in a CACM 1974 publication, and that article is a comprehensive coverage of the subject, and as valid today as it was back in time. Article talks in great detail about the conception of art as well science, difference in scope of them, and desire of adaptability for both of these. I would like to extend the concept by categorizing programmers in two different modes or categories. Programming Engineers and Programming Artists.
Computer science is purely based on human imagination, and layers upon layers are imagined, defined, named and implemented by humans themselves. As human creativity can never be perfect, so ironically every layer can be challenged, which holds true only but for Art. Historically, layers of human imagination are always depicted as form of expression and Art. That implies, programming is an art. In the start, there was no disagreement. In the same article, Knuth mentioned “When Communications of the ACM began publication in 1959, the members of ACM'S Editorial Board made the following remark as they described the purposes of ACM'S periodicals”
If computer programming is to become an important part of computer research and development, a transition of programming from an art to a disciplined science must be effected. – CACM 1959
Well, apparently the community at that time was certainly in agreement that programming is an art. But the objective was a transition to disciplined science. Although it says, that transition is necessary for it to become part of research and development, but rather I feel the actual benefit of transition is the production of Programming Engineers at masses. Why? we will explore in a while.
Why transition to Science?
Artist or Engineer governed by artistic principals
Artist craves for inner self imaginative expressions to create some thing new. Compare this to Natural sciences, i.e. Physics, mechanical, electrical, nuclear and genetics, etc. where scientists are observing and discovering the formulas, laws which are universal and defined by nature itself. These are certainly not humanly imagined and defined. So there is essentially a big difference between Engineering governed by some real universal laws and principals, versus Engineering governed by laws of Artistic creation of humans.
Real motive behind the struggle to make computer programming a science and programmer as Engineer is the limitation of Art itself. Art has huge limitations when it comes to producing artists, It requires natural talent, and not every one can be trained as artist. But as an industry, computer programming field require masses of programmers to fulfil demand of automation. Only principals of Engineering can fulfil the requirement of industry. As Engineering, depends on knowledgebase and exhibits pre-defined set of rules and principals to follow. So theoretically any individual can be trained as Engineer. Artists can not be produced, but Engineers could. Artist imagines, creates, invent, while Engineer follows best practices, and set of rule.
Struggle for Science
To fulfil the requirement, over the time, due to necessity, we have tried to take Art of Programming and convert it in to a form of Engineering governed by rules (not Natural but our own defined) as much as possible. Over the time a huge effort has been applied to develop practices and principals which combined can make programming a partial form of Engineering (or at least to some extent). Algorithms, Design Patterns, Good Programming practices are all example of this. This effort has been successful to good extent, so masses are actually being trained and deployed to industry as Software Programmers. There are ton of best practices, principals, patterns, articles, libraries to follow, and it is accumulating fast. It guarantees, any person can be trained as Programmer (Programming Engineer).
What Comes first? Artist or Engineer
I think the line is very thin but still very clear between what kind of work falls in Art and Engineering. There are certain properties that define both roles:
Engineer
- Studies certain set of principals and applied sciences
- Takes every day problem, applies well known established practice for a solution
- Keeps up to date with latest trends and best practices. So that knowledge can be applied for better solution of well known problems.
- Accumulates Knowledge
Artist
- Learn tools to express
- Innovative. Artist Imagine / Think and goes through no mans Land.
- Normally rules does not apply (or very few), and even if there are some, artist think out of the box to come up some thing better.
- Accumulates Inner Expressions and ways to trigger Imagination and Innovation.
Clearly, role of Artist is to innovate, and Engineer is to apply. Inventor is artist, while learner / applier is Engineer. Artist creates, Engineer applies.
Back to programming, I would categorize the programmers in two major types / roles, based on thinking style, native talent, and trend. Normally each one of programmer tilts more towards one of those categories. And definitely majority tilts towards Engineering.
Programming Artist
When it comes to innovation, programming is purely piece of art and nothing else. It genuinely require imagination, and there are very few restrictions. While innovating, you are already in no mans land. There is no rule to follow, there is no similar example. There is no design pattern that matches with problem. There is no book that gives you solution. No one did so before, your imagination is under siege. Programmer Inventor is the artist by all means.
Programming Artists are very small percentage of developers for obvious artistic principals. Those who are less of followers and more of innovators. Who have native ideas, and writing their own libraries/frameworks inspired by internal creativity. All the great programmers who invented great tool and libraries, solved a major problem for us, all fall in to this category. So they are not that rare species luckily.
Programming Artists, have to remember, space and fields for innovation is very small. If you are not in that area, you are probably solving daily life business problems. The pile of developed principals and practices is there to help solve normal business problems with ease and with greater overall confidence. Only a programming artist can appreciate another programming artist. While Industry is full of Engineers, led by non-technical bosses and management. So it would be hard to get appreciation in true means, specially if you are in typical business solutions development industry. Welcome to real world!
Programming Engineer
When it comes to solving daily life business problems with quickest and most reliable methods, and where you try to divide problem in to well known patterns and apply best practices, programming is a much like a discipline of science and engineering. While engineering, you have wonderful knowledge of common type of problems and extensive library of patterns to solve these problems. There are lot of procedures, laws, design patterns, terminologies, to make software development and problem solving much easier. There are always new patterns and libraries, methodologies, and practices in market to keep your skills fertile. With so many case studies and experience, solution confidence level is very good.
No doubt that most of programmers around as Programming Engineers and it is for this type, that for struggle for transition of programming to science has begun. Along with time, industry is improving in this front, so much so that it became common, ones who are strict followers of the defined rules are hailed as best programmers.
Programming Engineers, have to remember, that programming was natively an art. Knowledge, patterns and practices will keep you in good company. But keep craving inner self to discover an artist, to really enjoy programming at its best. Struggle to convert programming as a pure Engineering can never be completely successful, because it is all governed by imaginative, virtual and human defined rules that could be challenged at every level. Programming and designing is very artistic activity in its heart.
Summary
Programming is an Art and Science/Engineering discipline as well. It is a combination of both. It is a struggle in between. Believers of Art without boundaries might need to behave in certain situations where established practices and scientific pillar of programming is much more productive. Hard line followers of pillar of programming science needs not to think rules and best practices as the universal truth, and should think out of the box for an more personalized solution much like free form expression. Any programmer should think of these two modes, and practice switching between them as per need. Strive for balance, and you would improve, no matter which category you are in. And, look out for Programming Artists. As they are the ones, that will bring State of the Art solution of problem.