PostgreSQL 7.4 and Advanced SQL

Jim O'Halloran • January 14, 2004

linuxconfau-2004

The real business part of the Linux.Conf.Au started off with a tutorial from Gavin Sherry (Alcove Systems Engineering) covering "PostgreSQL 7.4 and advanced SQL features". I'd done a little bit of research on PostgreSQL previously, and I'd almost settled on it as a Linux database server for our Windows apps, bt after today's presentation I'm 99% sold.

PostgreSQL grew out of an academic research project started in the 70's. It started life as a relational database, called Ingress. SQL support was later remove as the original authors believed that relational database theory had a limited future and they set out to find what came next. Eventually, it was realized that relational set theory was here to stay, and SQL was reintroduced. It was open sourced in the mid 90's under a BSD licence, and now has an active core development team. Initial releases focused on cleaning up the code and fixing the problems with Postgress, and the product was later renamed PostgreSQL.

PostgreSQL 7.4 is now the current release, and introduced many new features. One new 7.4 feature is the "information schema". Information schema is a standardized way of finding out about the database structure, and SQL features, etc available to the current user. The information schema supported by PostgreSQL is an ANSI standard, and should be supported on other databases as well. PostgreSQL 7.4 also has a new wire protocol, which features better SQL error codes, more status information on the back end, and faster session startup times. Log detail level is now configurable via pgsql.conf, and there is also a handy configuration parameter " log_min_duration_statement" which will log the text of all long running SQL statements for debugging purposes.

Holdable cursors allow cursors to be used outside of transactions. Previously cursors could only be used within transactions for multiuser concurrency reasons. Array handling has also been improved in 7.4 and also conforms to ANSI SQL99 specifications. Arrays can be used to condense a one to many relationship down to a single field in some situations.

Statement level triggers are another new feature and can be used to fire a trigger before or after any DELETE, UPDATE or INSERT statement. Full text indexing has also been updated in this release, but is generally not included in the vanilla install, but the tsearch2 module can be found in the contrib. directory. The full text indexing algorithm strips punctuation, common words, plurals and other pollutants from text before indexing and can rank search results according to relevance.

As records are deleted they may need to hang around for a period of time for multi-user concurrency reasons, so PostgreSQL databases need to be periodically vacuumed to recover free space from the database. UPDATES are handled internally as a DELETE then an INSERT, so they can contribute to database bloat. Running VACUUM with no parameters builds a table of free pages ready for reallocation which can help to alleviate the bloat. Autovacuum is also included with 7.4 (in contrib), and will periodically vacuum the database without operator intervention. However, autovacuum requires that you run row level statistics, which in turn could pose its own performance problems on some systems.

Asynchronous Master-Slave replication was also added to 7.4. Its asynchronous which means updates between master and slave are not done in real time. The replication was designed for high availability requirements, and may also be used to enhance performance. Multiple salve servers are also allowed. Two possible problems are that replication is not part of the main backend code, and it was written in Java, which could create deployment or performance issues in some situations.

There were performance improvements in 7.4 in the use if IN/NOT IN syntax, as well as better handling of joins and an improved regular expression engine. Much of the SSL code was rewritten for much higher performance.

Shared memory is used for caching, and is critical for performance. 7.4 introduced an algorithm to detect a more optimal buffer size on startup (pre7.4 there was a fixed (small) default size which had to be tuned manually). . Its recommended that production servers be tuned by hand though.

IPv6 is supported both for the storage of addresses within he database, and for connections to the database. Also new is support for read only transactions, transactions which can read the dataset but not modify it. Permissions can now be assigned by non-superuser users.

The second half of the session covered some Advanced SQL syntax. Some of the things covered I already knew about from other databases, and I haven't made a note of those, but here's some of the new stuff I learnt.

PostgreSQL supports user defined aggregates ( like sum, count, etc). Aggregates are written in C. User defined data types ( also written in C) can also be used, and allow PostgreSQL to support more complex data types than are supported natively. You can also create your own operators for these new data types. Domains are like the little brother to user defined types, and are piggy backed on top of native data types. Basic SQL constraints are used to create a domain.

Rules can rewrite SQL queries on the fly. One example given was to disallow deletions from a table, and redirecting any deletes into a second table. A view was then used to produce a combined view of both tables with the deleted rows apparently "deleted".

You can also create your own user defined functions in SQL, PL/SQL, C, Java, Ruby, PHP, and pretty much any other open source language.

Inheritance is a pretty neat feature drawn from object oriented programming. In PostgreSQL a table can inherit the structure from a "parent" table, and at the same time data from the child table is stored in the parent table.

There are a few tricks for improving performance of PostgreSQL at the SQl level. EXPLAIN can be used to get PostgreSQL to explain how a query will be executed. ANALYZE can then be used to recalculate table and row statistics. Accurate statistics improve query performance, so ANALYSE should be run periodically to ensure best performance.

SELECT Count(*) statements should be avoided as they require a sequential table scan in order to count the records in the table. On large tables, instead use triggers to update a record count stored in a one line table.

Using the Min/Max functions in SELECT statements has a similar problem to e COUNT() function. The same trigger solution can be used in this situation as well, or you can rewrite the query using a sort ascending or descending and a limit clause.

All in all an excellent presentation, and well worth the time. I'll be checking out PostgreSQL in more detail soon I hope.