Archive for the ‘Database Continuous Integration’ Category

Database Deployment Patterns (Introduction)

Database Continuous Integration | Posted by Petr Kozelek
Oct 31 2011

In the previous post I described general approaches how to define database deployment steps. Now I focus more deeply on analysis of individual steps and on identification of deployment patterns – I replace here a more familiar term database refactorings since I want to focus on database deployment instead of database redesign. However, there can be slight overlap between the two.

Deployment patterns are of two basic types, regarding database changes that they cover:

  • Schema deployment patterns cover any modifications in database schema. They can be divided further into two disjunctive groups:
    • Data-independent patterns are applied without requirements to modify existing data. In other words, when I need to apply data-independent pattern, I do not require to affect data being hold in existing tables. They are always deterministic (see below)
      Examples: add/drop views, functions or stored procedures, drop foreign key or column constraint
    • Data-dependent patterns require existing data to be modified. The reason for modifications can be, for example, improving data quality before the pattern can be applied.
      Examples: add foreign key, add column constraint
  • Data deployment patterns modify purely the data and they keep the schema unchanged. Usually they follow schema deployment patterns in order to insert data into newly introduced tables/columns. They are deterministic or dependent (see below).

Deployment patterns are of different clarity, depending on how straightforward it is to apply the pattern.

  • Deterministic patterns include deployment of changes for which there is only one way how to deploy them. Sometimes, the particular pattern can be applied repeatably without negative impact on a database schema and data itself. E.g. deployment of Modify view pattern is pretty straightforward – just execute ALTER VIEW... and you are done. When you execute it twice, it does not cause a problem. Well, modifying the view might break a dependent application, of course, but regarding the database layer, there is neither a risk nor uncertainty in applying it.
  • Ambiguous patterns are the ones for which it is not clear how to apply them. For instance, Split column pattern includes a recipe defining the way how to split one column into the two (imagine splitting FullName into FirstName, LastName). How should it be done? Split by spaces? What if FullName includes academic titles? Get rid of them? Make them a part of the FirstName? These questions have to be answered by a developer who knows data quality. Moreover, ambiguous patterns usually depend on existing data and they usually are not repeatable.

Ambiguities are the biggest obstacle for adopting automatic database deployment. If you succeed with reducing the number of ambiguous deployment patterns you will feel less pain in introducing continuous database deployment. There are less ambiguities when deployment patterns are small, e.g. introduce column pattern. On the contrary, when database deployment requires complex patterns (e.g. Introduce Associative Table) to be applied then the number of ambiguities increases.

The main question is: How to reduce the number of ambiguities?

  • Deploy database changes often. Then there is smaller chance for introducing complexities between the original and the new database version.
  • Use the right tools. Although the existing tools cannot solve all problems connected with database deployment they can provide significant help. They should detect differences between database versions. Consequently, they should assist a developer in classifying database deployment patterns. Of course, the tools should automatically resolve differences for which deterministic patterns can be applied so that the developer can focus on ambiguous problems.

Database Deployment Approaches

Database Continuous Integration | Posted by Petr Kozelek
Oct 21 2011

There are two main categories of database deployment tools. The first one supports deployment of software that follows Application-Driven Database Design (ADD). It means that a database is a persistence layer for a particular application. It is the application itself, its features and the need them to be persisted that drives database design. A typical reprezentant here is web application development.

On the other hand, applications that are oriented strictly on database development (and deployment) follow (as  I call such a development approach) Independent Database Design (IDD). Thus, the major determinant for introducing database schema changes is the need for extending or modifying a nature of the data. This category typically includes data warehouses and high load/performance critical database processes.

It should be clear that IDD is simpler than ADD because a database developer cares just about development and deployment of one layer. But when an application developer takes care of two software layers and their mutual interconnection then he has more complex responsibility. It is natural that deployment of two dependent layers is more complex task than deployment of only one. The situation becomes even more complex when responsibilities for application layer development and database layer development are spread across different teams because of communication issues.

First I focus on IDD because it seems it is simpler than ADD, as if ADD was just extension to IDD. Later in this article I will explain that IID can be significantly different from ADD.

Independent Database Design

If a developer is responsible for development of just a database layer then his task to create deployment strategy is quite straightforward. Once the development phase within an iteration is over he is expected to define a deployment strategy for new version of database schema. The strategy usually is defined in a database deployment script. It includes DDL commands for inserting/modifying/deleting columns in tables and etc. The deployment script often defines also DML commands like inserting initial data to newly created tables. The two general questions come up when thinking about how to define database deployment process:

  1. How does the developer identifies what should be deployed into target environment?
  2. How does the developer define the deployment process of the database?

The answer to the first question is Database Comparison Tool. There is a number of them in the market. They can compare both database schema and data themselves. When the developer needs to identify what should be deployed he usually works in the following scenario:

  1. Compare the target database schema with current development database schema.
  2. Identify schema differences
  3. Identify which data should be deployed into target database. Database data comparison can help here.

Identifying schema and data differences is quite simple thanks to the mature database comparison tools. The more difficult part come up with the second question. How to define a deployment process? Regardless of whether or not the developer uses some database deployment software the final result is always the same – a deployment SQL script. So, in order to answer the second question related to defining deployment process, the developer has to do the following steps:

  1. Create a database schema deployment script. It usually contains only DDL commands.
  2. Create database data modification script. It usually contains only DML commands.

Of course, the tools could help here, too. However, the whole process of transforming the identified database differences into the deployment script can be very complex. There can be mutual dependencies among database objects – both schema and data dependencies.

Consider the following example: In a target database there is a table Person with columns Id, FullName. In a development database that I want to deploy to target there is a table Person with columns Id, FirstName, LastName. How should I deploy the development database into target database? Simply with this deployment script?

ALTER TABLE Person ADD COLUMN FirstName VARCHAR(50);
ALTER TABLE Person ADD COLUMN LastName VARCHAR(50);
ALTER TABLE Person DROP COLUMN FullName; -- data gets lost!

Opps! I lost all my data related to person names! Let us try better approach:

ALTER TABLE Person ADD COLUMN FirstName VARCHAR(50);
ALTER TABLE Person ADD COLUMN LastName VARCHAR(50);
UPDATE Person SET FirstName=ExtractFirstName(FullName), ExtractLastName(FullName);
ALTER TABLE Person DROP COLUMN FullName;

Here functions ExtractFirstNameExtractLastName represent some (potentially complex) transformations defining the process how to retrieve particular parts of the name from its full represennation.

The situation gets more complex if new columns have constraints. For example, FirstName and LastName should not be null. It such a case you are not even able to add a column until you do not specify a default values for the columns being added.

The examples above demonstrates that there is not always a straightforward way how to deploy a new version of database to target environment. If there was then it would be relatively easy to automate creation od a deployment script.

Although nowadays tools can help a developer to pre-generate a deployment script, the developer still is required to manually double-check and fix it if needed. All in all, the whole process of defining a deployment script requires rather a lot of manual work.

Application-Driven Database Design

It might seem that ADD extends IDD approach however it is true only to some extent. In case a developer is creating an application which uses a database only as persistent storage then the database design is driven by application features. If the application stores a property FullName of an entity Person then a table Person with a column FullName has to be modeled in the database.

Later on, when the need for two independent properties FirstName, LastName arises then it has to be reflected in the database as well. Deployment of the application layer is simple since it is stateless – I can simply get rid of the old version of the application that supports only FullName property and replace it completely with new version of the entire application (I simplified it here since I do not count with introducing breaking changes in case other applications depend on the application being deployed). Definition of a deployment script can be simple SQL as in IDD approach.

However, unless the database design is strictly driven by the application then it is possible (at least to some extent) to define a recipe for database deployment directly in the application layer in the language the application layer is written in. Here is an example for migration in Ruby:

class AddFirstNameAndLastNameToPersons < ActiveRecord::Migration
  def change
    add_column :FirstName, :LastName, :string
  end
end

Of course it is just a wrapper that generates good old SQL at deployment time. But with this approach of defining migrations with simple DSL you get greater level of control over deployment process because you can use full power of object-oriented language. You can programmatically combine migrations together, you can programmatically enable/disable them, you can run pre-tests validating that particular migration is even possible to be run, etc.

One of the most popular deployment tools of this kind is Liquibase. Written in Java, it uses special DSL for migrations recipes:

<changeSet id="1" author="bob">
    <addColumn tableName="Person">
        <column name="FirstName" type="varchar(50)"/>
    </addColumn>
    <addColumn tableName="Person">
        <column name="LastName" type="varchar(50)"/>
    </addColumn>
    <sql>UPDATE Person SET FirstName=ExtractFirstName(FullName), ExtractLastName(FullName)</sql>
    <dropColumn tableName="Person" columnName="FullName"/>
    <rollback>
        <addColumn tableName="Person">
            <column name="FullName" type="varchar(100)"/>
        </addColumn>
    <sql>UPDATE Person SET FullName=CreateFullName(FirstName, LastName)</sql>
    <dropColumn tableName="Person" columnName="LastName"/>
    <dropColumn tableName="Person" columnName="FirstName"/>
    </rollback>
</changeSet>

During deployment, the tool is responsible for translating migrations written in XML into SQL of target database server. Liquibase comes with an interesting feature of a rollback deployment in case a business decision to go back to previous version arises. In some cases the rollback does not have to be defined (add column, remove column) since a recipe for rollback can be inherited from the migration definition. But for a more advanced migration like in the example above the rollback recipe has to be defined manually.

Conclusion

Regardless of what approach the development team chooses for database migrations it always results in creating of a database change script. In simple case, the script is first generated by a database comparison tool, then a developer manuall reviews in and corrects conflicting changes. RoundhousE or Liquibase represent such a family of tools.

Advanced tools use custom DSL for defining database migrations. They can be written in application original language (Ruby migrations, FluentMigrator).

There are also mature commercial products that include both database comparison and deployment features in one tool. SQL Data Compare and NeXtep Open Designer are promising tools for database deployment automation.

Database Continuous Integration with DbTracer

Database Continuous Integration | Posted by Petr Kozelek
Jan 12 2010


In my company we use principles of continuous integration in our C#.Net+MS SQL development. We use Subversion for source control management and CruiseControl.Net as integration server. The problem came when we needed to choose a tool for continuous integration of database changes since we did not find any open-source alternative.

Read more »