Linux.Conf.Au 2004
Day 3 of Linux.Conf.Au got underway with Bdale Garbe’s “where would you like 100,000 users to do today?” keynote. Bdale looked at a couple of large scale Linux deployments around the world and explained why we shouldn’t be surprised that these things are happening in less developed parts of the world.
He believes that the growing gap between people who can participate in online communities and those who can’t (the “digital divide”), is sparking government interest in Linux solutions from parts of the world like Brazil, Extremadura and so forth.
Extremadura is one of the poorest regions of Spain. They’ve recently rolled out 80,000 systems into their public school system (enough for 1 for every 2 students). When they did the maths, they could afford either 80,000 PC’s or 80,000 Microsoft licences, but not both, so they looked at Linux as a way of being able to afford more hardware. Extremadura didn’t rush into their Linux rollout, the process took 5 years from start to finish. But this included localising Linux for their local language and building up an internet infrastructure in the region. Andalusia (another region of Spain) is now looking at a similar program.
Debian is a community-based distribution with no commercial entity driving its agenda or release cycle. The Debian Free Software Guidelines and the Social Contract set the values for the community, and as such acceptance of these values becomes a criteria for acceptance into the community. Because of its open, non-commercial nature, Debian has become the preferred distribution for deployments like Extremadura, and in other parts of the world. Debian also runs on a wide variety of hardware platforms from Handhelds to Mainframes.
Bdale’s slides are available from here.
Simon Hprman’s Linux Virutal Servers talk was the last talk of the day for me. A Linux Director is a service that sites in front of a number of servers and load balances requests between each. Persistence is maintained so that each client will connect to the same “real” server for a given period of time.
LVS is essentially a fast Layer 4 switch. A single LVS machine on modest hardware can easily saturate a 100 mbit LAN connection.
Most of te presentation went straight over my head, but I gained some appreciation for what LVS does and when it might be used that it was useful. It also serves to remind me that as good as I think I am, there are other people who are on a whole nother level! Papers from this talk are available from UltraMonkey.org.
Con Zymeris gave a talk after lunch on “Corporate Linux Evangelism”. The first stage in any Linux deployment is evangelism. That is, raising awareness of Linux and its strengths (and weaknesses). There are three main groups that you’ll need to pitch to, Technical staff, managers, and the financial people.
Technical people need to know what the product does, how it does it, and articles, etc that will get them up to speed on a product quickly. The managers will want case studies, articles showing strong interest in Linux, and what type of productivity gains they could expect. They need to be reassured that Linux is a “safe” choice. Financial people will be more interested in TCO and ROI. i.e. how much will this system cost me, and how long will it take to pay for itself in savings. Accountants are typically looking for about a 1.5 year ROI on IT investments these days.
Keep an eye on news sources looking for quotes which might be helpful, but studies quoting TCO and ROI figures can be a but misleading. These analysises are pretty simple, so do your own foe each job you’re bidding on.
With the recent SCO vs IBM stuff, open source is now seen as a risk. But in reality its no more risky than closed source (Microsoft’s entire liability for XP is the replace a faulty CD, not to fix the software). However companies should inventory their software (both closed and open source) and sure licence compliance. If you will be modifying any open source software, you need to track that and ensure compliance with licences.
To introduce Linux to an organisation start with low key infrastructure services. DNS, Web proxies or similar where a well established open source system exists can be good starting points. Make a business case for replacing it and quantify the benefits in dollar terms. Audit the existing system to ensure you’re completely familiar with the existing setup, and carefully plan a roll over and roll back strategy in case things go wrong.
After the trial roll out, prepare a report to management. Draw attention to your success. If you want to use more Linux, call attention to the successes that you have had. Solicit feedback from IT staff and users, the n picks a new system or server and repeat the process.
There are a few key areas in which you can sell desktop Linux. Linux desktops can be locked down, they have a lower TCO, and are less prone to viruses. Use these strengths to advocate desktop linux where it makes sense. Desktop deployments are trickier because they’ll mean that you’ll need to get buy in from each of the individual users who’s machines you’re migrating.
The last talk before lunch was Graeme Merral’s “Free Software: The AOL|7 Story”. AOL|7 is a joint venture between AAPT, AOL and Channel Seven. Its sort of a hybrid company, part ISP, and part content company.
They started out with proprietary solutions such as Vignette and software from Xoom.com, but had trouble making it work reliably. Eventually they went to open source for reliability, performance and as a way of doing more with less money.
Their platform is a pretty standard mix of Linux, Apache, PHP and other common open source tools.
The Apache mod_ntlm module was also mentioned as a way of getting Apache to authenticate against an NT or Active Directory domain.
Next up was “Could SCO vs IBM happen to you?” presented by Jeremy Malcolm from iLaw. After starting out with a general overview of the SCO vs IBM case, Jeremy then covered the current state of copyright law.
Copyrights protect the expression of ideas only. The same idea can be restated in different language without violating copyrights. Trade Secrets aren’t actually defined in statutes, but a body of case law has evolved surrounding the idea of trade secrets whigh protect confidential information.
Patents protect inventive ideas, and are only valid in the country in which they were issued. It is possible to violate a patent even if you wrote the code yourself from your own ideas, without copying and didn’t know the patent exists.
Both Patents and Trade Secrets take precedence over copyright law.
If you’re working on both closed and open source products you need to take steps to protect yourself. Use separate development teams and keep them separated to try and limit the cross flow of code and ideas.
If you’re a project leader you might want to think about protecting your project by ensuring that each developer makes a declaration of cleanliness of their code. You might want to ensure that the developers indemnify the other project members if their code is found to be in violation of patents or copyright. However you’ll also need to balance that with the possibility that you’ll scare away possible developers with these legalities.
You can also protect yourself by choosing a licence which explicitly deals with patents. For example the Mozilla Public Licence requires contributors to grant free patent licences to all developers and users of the project.
As a project leader you should read up on patents in your area, although the sheer volume and vague wording of some patents can make this difficult to do properly. If you have money you can always hire an attorney to do this for you and sue if they screw it up, but that’s expensive.
Finally, you can avoid some licensing issues by building from scratch rather than plugging into an existing project. The more independent the better.
The day started out with a keynote from John “Maddog” Hall, “Programers are from Mars, Users/Managers/Companies are from Uranus”. This was a brilliantly executed talk which carried a useful message, but at the same time was quite light and humerous. Very well done.
Managers and corporate types like to see road maps describing where products are going in the future. The very well organised open source projects have these, but road maps are typically a product of companies who do all of their design and planning behind closed doors, where all they can share with the customer is the road map. By its very nature, anyone can see the process behind open source, and after tracking a project for a short time will get a very good feel for where the project is going. Managers like a controlled approach. Managers like plans.
A closed source company tends to get away with higher margins on services, etc because they have a virtual monopoly on the expertise wrapped up in the product. In open source anyone can pick apart the product and see how it works and acquire the expertise necessary to work with it in some way (e.g. to support it, extend it or whatever). If one open source company is charging too much, another can spring up and charge less. In this way open source can put a downward pressure on prices for support and services.
Halls Law #3047:
“It is easier to take an engineer and teach them business than it is to take a businessperson and teach them engineering.”
Managers are trying to protect their products and markets, which leads to “good ideas” like “customer lock in”, “patents”. It also leads to great ideas like keeping their engineers fully utilised, with minimal training, leave, etc.
Programmers on the other hand are too intelligent. We like to think that we’re just a little more intelligent than everyone else. We want to be recognised, and we want to do things “the right way”, as “elegantly” as possible. In this way open source can be good for the ego.
We write our software for our perceived user. Our ideal user was born in 1984, and taught himself BASIC at age 6. He started on Linux at 12, and had released his own Linux distribution by 14. Believe it or not this guy actually exists, but he’s certainly not the typical user.
Taken directly from Maddog’s slides…
“The real end user is:
Not intelligent
Does not read manuals
Does not want to read manuals
Just want to do their work
Has arthritis
Speaks only one language and its not yours.
May be completely illiterate but is not necessarily dumb
May be dumb
Barely understands how a light bulb works
“I was wondering why it was beeping at me for two days”
Has trouble separating the operating system from anything else.
Speaks in terms of the real world “knobs and widgets”.”
Maddog characterises end users as one of two types. Either corporate, where the PC’s are installed and supported by professional IT staff (either internal staff (large), visiting consultants (smaller), or a local computer shop (very small). Linux it probably ready for all except the last group of these. The other group are the home users who get their support from their church group, their community, or in absolute emergencies their children. It could take a long time for Linux to be ready for these people.
There are a few characteristics that define a successful product. They must use real world knobs, things the user can relate to directly. Supercalc, and Palm Pilots are a couple of good examples here. They must be simple to start off with and progressive disclose their features as the user becomes more experienced, with the hard parts automated wherever possible. Tivo is a classic example of this.
The name of the product doesn’t mean a lot, but it has to be the right name, e.g. Google vs Altavista.
All in all, it really was a great keynote.
After lunch, the next thing on the Linux.Conf.Au agenda for me was Rasmus Lerdorf’s “PHP Tips & Tricks” presentation.
Rasmus started off with an introduction to where PHP fits in the programming landscape. What some people are doing with it, and what its really designed for. Essentially Rasmus sees PHP as a templating system, which he describes as “a mechanism to separate logic from layout”. As he says “PHP is a general purpose templating system”. Other templating systems have been built on top of PHP, but by the time they usually add loops and conditionals, “Any general purpose templating system will eventually become PHP.”
Rasmus briefly demoed a few applications written in PHP, but one that caught my eye especially was Cacti which was a nice web based graphical management and monitoring tool.
He also demoed the usage of the gdchart library end extension to create a line chart in about 8 lines of PHP code. Gdchart is written in C, optimized for performance, with Yahoo! type scalability in mind.
PHP can generate a Macromedia Flash animation. Tools are also available to decompile Macromedia authored Flash files into PHP, which can then be rebuilt with dynamic data if required. That’s pretty cool, and someone is using this sort of thing to create an online RPG type game, which is really neat.
PECL is the PHP Extension Code Library. As PHP has grown more extensions, PHP has become harder to release as each extension needs to be bought into a releasable state. PECL aims to solve that by removing many of the extensions from the main distributions and putting them into separate PEAR installable packages.
When setting up PHP with MySQL, make sure that MySQL allows more connections than Apache. Apache defaults to 150 simultaneous connections while MySQL defaults to 100. Most of the time this will work, but when your PHP site gets SlashDotted you’ll run out of MySQL connections and scripts will fail because MySQL will refuse connections before Apache.
The PHP “magic quotes” feature automatically escapes quotes, etc and automatically prevents most forms of SQL injection attack. Wish I’d know about that a couple of weeks beck when I was working on fixing SQL Injections in MyHelpDesk.
For busy sites a reverse proxy like Squid can be used to boost performance dramatically. You can also use SquidGuard redirector to redirect different domain names to different apache instances or different machines altogether.
$PATH_INFO can be useful for creating friendly URL’s. Using an Apache trick you can force a PHP script to be executed and return some results. You can also replace your 404 error page using an Apache configuration option and use PHP script to redirect to different locations. Of course if you really want your 404 page to 404, use “Header(‘HTTP/1.0 404 Not Found’);”. Rasmus also demonstrated a really neat concept for using the 404 page to generate and cache dynamic image files.
All this talk of using the 404 page to do useful work prompted Rasmus to ask the question “Why should you decide where the information on your site is located, why not leave t to your users?”. In other words, why not use the 404 page to try and conjour up some useful content (e.g. a search or something) for whatever URL the user types in. Interesting food for thought.
The “auto_prepend” configuration option allows you to specify a file which is automatically prepended to all PHP files. This can be handy for including common code without having to do so explicitly.
There are several options available (safe mode, open basedir, etc) for ISP’s needing to isolate different PHP users from each other and their host systems, but none are really 100% effective. When coding scripts, watch out for uninitialised variables, and never ever trust user data. Be paranoid with your validation of anything supplied from the client browser.
The RealPath will properly resolve a file name figuring out any “/../”s which might be in use. Then prefix the RealPath with the Document Root before opening any files and you’ll pretty much guarantee that nothing can be opened outside of the document root.
I’ve seen it suggested that people use .php extensions for their include instead of .inc for security reasons. However, it seems that .inc may be a better solution as long as Apache is configured not to serve up that file type at all.
If you allow files to be uploaded, be especially paranoid if they’re to reside inside the document root. Validate that you’ve receive the file type you expect, including opening up the fill to ensure that its contents really do match up with the extension.
The some of the major changes in PHP5 relate to Object Oriented features, which I haven’t really played with that much in PHP4, so I haven’t really noted whats new. Thereis also a Try/Cattch error handler mechanism, which should simplify the code in error prone areas like connecting to a database. DOMXML has been improved, with a general cleanup, and bug fixes.
PHP5 also introduces a new simple XML parser, which should make working with XML a lot easier. However the simple XML parser does load the entire file into memory which might make it unsuitable for processing large files.
PHP5 also bundles SQLite, which is an SQL interface for flat files. Pretty neat looking stuff too.
Rasmus also shared some hints on optimising PHP code. Essentially you should try to keep the includes to a minimum, use OOP techniques only where appropriate, and the same for layers, abstractions, etc. Opcode caches can dramatically improve performance. Poorly written regular expressions can also slow things down as well. Finally if you have plenty of spare CPU, and limited bandwidth, try turning on output compression.
There are a few useful techniques for benchmarking PHP applications. First of all, have a look at the average size of the pages you’re generating. If they’re fairly large you may need to look at kernel buffers. Also run http_load from acme.com for load testing. While http_load is running, use vmstat to check for idle CPU time. If the CPU is idle, then it suggests the system is IO bound somewhere, and you need to improve throughput somehow. A fully utilized CPU suggests that some benefit can be gained by tuning the PHP code itself.
If we need to tune the PHP, then check the include_path and shorten where possible. Turn off open_basedir if you don’t need it. Also remove un used arrays from the variable_order setting n PHP.ini to prevent PHP from populating unused $_[] arrays. Also look into an opcode cache.
The XDebug extension can be used to get stack trace data for profiling. XDebug also has a modified rror handler which gives a lot more debug information than the standard error handler. XDebug.org is the home for XDebug.
All in all it was a brilliant presentation, which could probably be renamed “Things every PHP programmer should know but probably doesn’t”. Rasmus’s slides are available from Rasmus’ site.
I do try to keep these things reasonably short, but they seem to be getting longer every day..
The real business part of the Linux.Conf.Au started off with a tutorial from Gavin Sherry (Alcove Systems Engineering) covering “PostgreSQL 7.4 and advanced SQL features”. I’d done a little bit of research on PostgreSQL previously, and I’d almost settled on it as a Linux database server for our Windows apps, bt after today’s presentation I’m 99% sold.
PostgreSQL grew out of an academic research project started in the 70’s. It started life as a relational database, called Ingress. SQL support was later remove as the original authors believed that relational database theory had a limited future and they set out to find what came next. Eventually, it was realized that relational set theory was here to stay, and SQL was reintroduced. It was open sourced in the mid 90’s under a BSD licence, and now has an active core development team. Initial releases focused on cleaning up the code and fixing the problems with Postgress, and the product was later renamed PostgreSQL.
PostgreSQL 7.4 is now the current release, and introduced many new features. One new 7.4 feature is the “information schema”. Information schema is a standardized way of finding out about the database structure, and SQL features, etc available to the current user. The information schema supported by PostgreSQL is an ANSI standard, and should be supported on other databases as well.
PostgreSQL 7.4 also has a new wire protocol, which features better SQL error codes, more status information on the back end, and faster session startup times. Log detail level is now configurable via pgsql.conf, and there is also a handy configuration parameter “log_min_duration_statement” which will log the text of all long running SQL statements for debugging purposes.
Holdable cursors allow cursors to be used outside of transactions. Previously cursors could only be used within transactions for multiuser concurrency reasons. Array handling has also been improved in 7.4 and also conforms to ANSI SQL99 specifications. Arrays can be used to condense a one to many relationship down to a single field in some situations.
Statement level triggers are another new feature and can be used to fire a trigger before or after any DELETE, UPDATE or INSERT statement. Full text indexing has also been updated in this release, but is generally not included in the vanilla install, but the tsearch2 module can be found in the contrib. directory. The full text indexing algorithm strips punctuation, common words, plurals and other pollutants from text before indexing and can rank search results according to relevance.
As records are deleted they may need to hang around for a period of time for multi-user concurrency reasons, so PostgreSQL databases need to be periodically vacuumed to recover free space from the database. UPDATES are handled internally as a DELETE then an INSERT, so they can contribute to database bloat. Running VACUUM with no parameters builds a table of free pages ready for reallocation which can help to alleviate the bloat. Autovacuum is also included with 7.4 (in contrib), and will periodically vacuum the database without operator intervention. However, autovacuum requires that you run row level statistics, which in turn could pose its own performance problems on some systems.
Asynchronous Master-Slave replication was also added to 7.4. Its asynchronous which means updates between master and slave are not done in real time. The replication was designed for high availability requirements, and may also be used to enhance performance. Multiple salve servers are also allowed. Two possible problems are that replication is not part of the main backend code, and it was written in Java, which could create deployment or performance issues in some situations.
There were performance improvements in 7.4 in the use if IN/NOT IN syntax, as well as better handling of joins and an improved regular expression engine. Much of the SSL code was rewritten for much higher performance.
Shared memory is used for caching, and is critical for performance. 7.4 introduced an algorithm to detect a more optimal buffer size on startup (pre7.4 there was a fixed (small) default size which had to be tuned manually). . Its recommended that production servers be tuned by hand though.
IPv6 is supported both for the storage of addresses within he database, and for connections to the database. Also new is support for read only transactions, transactions which can read the dataset but not modify it. Permissions can now be assigned by non-superuser users.
The second half of the session covered some Advanced SQL syntax. Some of the things covered I already knew about from other databases, and I haven’t made a note of those, but here’s some of the new stuff I learnt.
PostgreSQL supports user defined aggregates (like sum, count, etc). Aggregates are written in C. User defined data types (also written in C) can also be used, and allow PostgreSQL to support more complex data types than are supported natively. You can also create your own operators for these new data types. Domains are like the little brother to user defined types, and are piggy backed on top of native data types. Basic SQL constraints are used to create a domain.
Rules can rewrite SQL queries on the fly. One example given was to disallow deletions from a table, and redirecting any deletes into a second table. A view was then used to produce a combined view of both tables with the deleted rows apparently “deleted”.
You can also create your own user defined functions in SQL, PL/SQL, C, Java, Ruby, PHP, and pretty much any other open source language.
Inheritance is a pretty neat feature drawn from object oriented programming. In PostgreSQL a table can inherit the structure from a “parent” table, and at the same time data from the child table is stored in the parent table.
There are a few tricks for improving performance of PostgreSQL at the SQl level. EXPLAIN can be used to get PostgreSQL to explain how a query will be executed. ANALYZE can then be used to recalculate table and row statistics. Accurate statistics improve query performance, so ANALYSE should be run periodically to ensure best performance.
SELECT Count(*) statements should be avoided as they require a sequential table scan in order to count the records in the table. On large tables, instead use triggers to update a record count stored in a one line table.
Using the Min/Max functions in SELECT statements has a similar problem to e COUNT() function. The same trigger solution can be used in this situation as well, or you can rewrite the query using a sort ascending or descending and a limit clause.
All in all an excellent presentation, and well worth the time. I’ll be checking out PostgreSQL in more detail soon I hope.
The first “real” day of Linux.Conf.Au started out with some opening remarks from Linux.Conf.Au organiser (one of many) Michael Davies. The winner of the national Sun Regional Delegate price was announced (he won a T-Shirt from last years LCA autographed by most of the kernel contributors). The door prize (a 3 foot high stuffed Tux) was won by someone from WA who now has the job of getting it home. I’m sure they’ll have fun getting that through airport security!
Today was a more hands on introduction to IPv6. John Barlow the Advanced Communications Services Coordinator for AARnet gave all of today’s presentations and started out the day with “IPv6 101 Hands On”. This in some respects was a rehash of some things covered yesterday, but it did cover some more technical type issues.
One interesting comment made was that NAT has a happy side effect of enhancing security (which I know and exploit regularly). However, John tells us that NAT can be hacked to get at machines behind the NAT gateway, which is something I really need to investigate further.
The basic structure of the IPv6 address space is as follows… The IPv6 equivalent of 127.0.0.1 is ::1. Most globally routable IPv6 IP’s start with 2001: and current best practice dictates that these are the only IPs to be routed across the backbone. IPv6 IP’s starting with fe80: indicate a “link local” address. Link local addresses are used for bootstrapping, autoconfiguration, etc and should not be propagated beyond a router.
Fec0::/10 are site local addresses (similar to the private IP ranges in IPv4) ut the use of these has been or will be deprecated shortly. :FFFF:a.b.c.d are mapped IPv4 addresses, and don’t actually exist, but might show up in logs. ::a.b.c.d addresses are used to arrange tunnels. There are more prefixes allocated, but many are not currently used, so the above are the commonly seen IPv6 addresses.
Unlike DHCP, IPv6 autoconfiguration does not assign name servers, time servers, etc. Other solutions such as anycast addressing, etc can be used to work around this, but no mechanism has been standardized at this stage.
IPv6 uses a simpler header than IPv4, many un used, or less commonly used fields are dropped or moved into extension headers. An IPv4 header is a fixed 40 bytes, with important fields aligned on 64 bit boundaries for faster processing on 64 bit architectures. Extension headers are a fixed 64 bits each in length and follow the main header. One of the design goals of IPv6 was to ensure that routers do minimal processing of packets. IPv6 packets can’t be fragmented in transit, which makes ICMP MTU discovery mandatory.
A workshop session followed, during which we were walked through IPv6 setup on our own machines. The demo was done on Linux. However the WiFi coverage where I was sitting in the Lecture Theatre was pretty spotty (more off than on), so I wasn’t able to get much going, gave up and went to lunch.
After lunch John finished off the workshop session by showing an IPv6 router setup on Linux. There is a daemon called radvd (Router Advertisement Daemon) which handles the router multicast packets required for autoconfiguration. Quogga (used to be Zebra) handles the BGB, etc required for serious internet routing.
Next up was a session on Provider Independent Addressing (PIA). From the sounds of things this is now pretty much dead in the water, and in fact current best practices will prevent PIA packets from being routed.
Essentially PIA allocates a /48 network to every 10 x 10 metre square of the earths surface (including oceans). The network prefix is mathematically derived from the Latitude and Longitude coordinates of your current locations. However there are no algorithms for resolving conflicts (e.g. multiple floors of a building, or an aircraft flying overhead), and in fact the suggested remedy was to “pick some random suare of ocean and use that” in the event of conflicts.
We then covered IPv6 Global Routing. Essentially obtaining an IPv6 address block is simple. You just go to APNIC and commit to handing out 200 /48’s to customers in the next two years. However, most of us will simply obtain our allocations from our ISP, but this means that you can’t take your allocation with you when you change ISP’s. However the old days of getting your own class C and having it routed via any ISP are over with IPv6. To control the growth of the routing tables IPv6 address space is being allocated only to ISPs, hence the requirement to allocate 200 networks above.
Finally we had an “Issues with IPv6” presentation which covered some of the issues in an IPv6 migration.
Autoconfiguration will mean that Dynamic DNS will become almost a necessity for mot organizations. However, most common DNS servers (e.g. BIND) will handle IPv6 addresses.
Application support is pretty good, but traditional network file sharing protocols such as SMB and NFS are yet to be upgraded to IPv6.
The Linux IPv6 HOWTO is updated regularly and is an excellent source of information related to IPv6 and migration issues.
IPv6 should run over pretty much any Layer 2 hardware (e.g. hubs, network cards, wireless gear) without problems. Most Layer 3 devices (such as routers) should only need a software upgrade in order to deal with IPv6, but some switches and routers which use hardware acceleration to process packets and therefore may need to be replaced before they achieve full throughput/functionality.
Because all IPv6 addresses are provider assigned multihoming for reliability becomes awkward and usually becomes something that needs to be negotiated with the ISP’s concerned. In theory it should work, but in practice its difficult.
There is also a thing called NAT-PT which is an IPv6 NAT into IPv4 and back again. As per traditional NAT each protocol needs to be handled individually. NAT-PT is achieved via a DNS Server hack and some additional bindings on the IPv6 interface. Its pretty nasty, but is a reasonable migration tool.
All slides from today’s (and some of yesterdays) presentations can be found on AARNet’s site. Overall I’ve learnt a lot from the IPv6 MiniConf, and its been a pretty worthwhile experience. The main conference starts tomorrow, can’t wait.