FLAIM is a FLexible Adaptable Information Manager (database engine) for traditional as well as volatile and complex information. Even though FLAIM provides many traditional database features (e.g., transactions, recovery, reliability, scalability), it was conceived with a broader view toward the greater flexibility and adaptability that is offered by an XML data model. FLAIM is not new; various products have used FLAIM for over 15 years. For instance, Novell’s scalable, reliable directory and collaboration products, eDirectory and GroupWise, both use FLAIM as the data store, with user licenses totaling well into the hundreds of millions.
XFLAIM is an embeddable cross-platform XML database engine. It is written in C++ and provides a DOM-like interface for creating, modifying, deleting, indexing, and searching on XML documents in a database. It extends traditional XML with the addition of non-text data types (number, binary). XML Documents may be of any size. The DOM interface virtualizes documents, bringing nodes into memory on demand. There are all of the traditional database features: transactions, indexing, queries (using XPath), backup, restore, caching, etc. It is designed to be very reliable and scalable. An XFLAIM database is, itself, portable to multiple platforms.
FLAIM and XFLAIM have been ported to a wide variety of 32 bit and 64 bit platforms, including various flavors of Linux (SUSE, Redhat, Ubuntu, Debian, etc.), various flavors of Unix (Solaris, AIX, HP/UX, Mac OS X, etc.), Windows (2000, XP), and NetWare 6. There is 64 bit support for all of the Linux, Unix, and Unix-like (Mac OS X) platforms, as well as Windows versions that support 64 bit application development.
This is where you would expect us to ramble on about our plans to conquer the world of database technology and turn FLAIM into a multi-billion dollar product. While that would be nice, our goals are somewhat more humble. The most notable aspects of FLAIM are that it is scalable, reliable, and embeddable. Everything that we plan to do with the technology will be in support of improving its scalabiliy, reliabiliy, and embeddability. Keeping focused on these aspects will hopefully lead to the important and desirable outcome of developing a thriving user and developer community.
Anyone who is interested in improving the core technology, testing, writing utilities, documenting features and APIs, and supporting the user community by answering questions. We also encourage the use of FLAIM in open source projects that have a need for a database engine. We have found that each new project that uses FLAIM results in improvements to the technology by uncovering limitations or omissions.
FLAIM is an open source project released under the terms of the GPL by Novell. The project source code resides in a public Subversion repository on the Novell Forge site. Various other services (such as this Wiki) are also hosted on the Forge project site. There are two full-time Novell engineers that staff the project, acting as moderators and maintainers. All code changes are reviewed and approved by these engineers prior to being accepted into the FLAIM source tree.
In the past, FLAIM has always been tighly coupled to a Novell product release schedule. Now that it is an open source project we are still trying to determine an appropriate release schedule. Stay tuned.
We have recently documented the full FLAIM API using Doxygen. This allows us to easily keep the documentation up-to-date when APIs are added, removed, or modified. At this time, there isn't a lot of conceptual documentation. We do provide a small sample application in the FLAIM project in the sample subdirectory. Click here to view the FLAIM documentation.
XFLAIM is very-well documented. We haven't migrated it over to Doxygen yet, so some of the API reference is outdated. Click here to view the XFLAIM documentation.
The goal of concurrency control is to ensure that operations being executed at the same time by different applications do not interleave in such a way as to compromise database integrity. Because transactions are defined as the unit of work that transforms a database from one consistent state to another, it is necessary to address concurrency issues in the context of transaction processing.
Individual transactions that run in isolation should always leave the database in a consistent state. In practice, it is usually desirable to allow many transactions to run concurrently. However, if the various operations of the different transactions were allowed to interleave indiscriminately, serious errors may result that could leave the database in an inconsistent state. The fundamental concern of database concurrency control is to ensure that concurrent execution of transactions does not result in a loss of database consistency. This means that the effect of interleaving the operations of multiple concurrent transactions should be the same as running the transactions serially.
Stated simply, FLAIM supports an MVCC (multi-versioning concurrency control) model with unlimited readers and a single concurrent writer. In many database systems, readers must lock records in order to prevent a writer from modifying the data once it has been read. This causes readers to sometimes block writers and vice versa. In FLAIM, use of MVCC means that readers never block writers and writers never block readers. It has been found that if applications using FLAIM are architected to take advantage of this concurrency model (i.e., update transactions are used only when it is known for sure that data will be changed), the database can service thousands of requests a second on commodity hardware. This has been demonstrated again and again in customer deployments of eDirectory, as well as in SuperLab tests conducted by Novell. To summarize, FLAIM supports the following:
In short, the answer is "no." Even though FLAIM is an embedded database engine, it is architected to many of the same principles as stand-alone DBMS servers. This was done so that optimal and reliable caching and file-system layers could be implemented. As such, FLAIM opens a database for exclusive use by its host process. Since the consumers of the FLAIM technology have traditionally been servers (eDirectory, iFolder, etc.), providing a way for multiple processes to access the same database has not been a priority. Within a process, however, an unlimited number of threads can access the same database.
FLAIM provides two types of transactions:
There are two types of transaction failures. The first type of failure occurs when the application executing the transaction encounters an error that makes it impossible to continue the transaction. Upon detecting the error, the application can request that FLAIM abort the transaction, which will cause FLAIM to undo (or rollback) all operations that have been performed by the transaction.
The other type of transaction failure occurs when the application terminates before committing or aborting the transaction, thus leaving the effects of a partially completed transaction in the database. Such transactions are sometimes called dead transactions because the application that created the transaction has terminated without specifying a final disposition for the transaction. Dead transactions may be the result of external events over which the application has no control (such as a power failure), or they may be the result of faulty application code. Whatever the reason, FLAIM provides for the automatic detection and rollback of dead transactions.
Update transactions obtain an exclusive lock on the database, but during the commit operation, they will release the lock before writing out their final RFL entries. Although the database lock has been released, the transaction is not committed until the final RFL entries are safely on disk. However, it is not necessary to keep the database locked while writing out these final RFL entries. In this way, another thread that is waiting to obtain the database lock and start a transaction can obtain the lock before the current thread has fully committed its transaction. This allows update transaction throughput to be significantly improved. It also enables a feature of FLAIM called "group commit", which allows multiple update transactions to write out their final RFL entries in the same write operation.
Here is an example of how group commit works.
In this way, it is possible for multiple threads to all be inside the commit function.
A checkpoint brings the on-disk version of the database up to the same coherent state as the in-memory (cached) database. FLAIM attempts to do a checkpoint whenever there are periods of minimal update activity on the database. In this case, FLAIM acquires a lock on the database and does as much work as possible until either the checkpoint completes or another thread wants to update the database.
To prevent the on-disk database from becoming too out of sync, there are conditions under which a checkpoint will be forced even if threads are waiting to update the database. First, if the checkpoint thread has not been able to complete a checkpoint within a specified time interval (default is three minutes), a checkpoint will be forced. Second, a checkpoint will always be forced when FLAIM is told to shut down. Third, I/O errors or out-of-disk conditions on the roll-forward log volume will cause a checkpoint to be forced. Forcing a checkpoint helps to shorten the amount of time it takes to recover the database after a system failure.
"I was under the impression that checkpoint thread will only block writers and readers are always serviced (I remember hearing about flaim maintaining a snapshot of the region being modified so that requests can be serviced with the old data until a transaction is committed.) In one discussion, it came out that CP thread blocks both readers and writers. Is that true?"
This question requires more than a simple yes/no answer. The short answer is that the CP thread always blocks writers (update transactions). It never blocks a read transaction, but there is a rare scenario where the CP thread will kill a read transaction. There is also a rare condition where the CP thread will be blocked by a read transaction. In that scenario, because the CP thread is blocked by a read transaction, and the CP thread blocks update transactions, you will end up with the rare scenario where update transactions are being indirectly blocked by a read transaction.
One of the tasks of the CP thread is to truncate the rollback log whenever it can (after it completes a checkpoint) to keep it from growing indefinitely. The rollback log is used for three purposes:
Whenever the CP thread completes a checkpoint, it checks to see if it can truncate the rollback log. If there are active read transactions that need information in the rollback log to maintain their snapshot of the database, the CP thread will NOT truncate the rollback log - up to a point. If the CP thread has not been able to truncate the rollback log for a long time because of a long-running read transaction, or because of a series of overlapping read transactions, and the rollback log has grown larger than 1 GB, the CP thread will take more drastic steps to keep the rollback log from continuing to grow. In prior versions of FLAIM, the CP thread would wait for the read transactions to terminate - on the assumption that they would, for the most part, terminate in a short amount of time. In the current version of FLAIM, that behavior has been changed. Now when the rollback log has grown to be more that 1 GB, the CP thread will simply terminate all read transactions that are preventing it from truncating the rollback log - with one exception. The exception is for certain internal read transactions that have been specially marked as "do not kill". Currently, the only read transactions that are marked as "do not kill" are read transactions that are used to take a hot backup of the database. In effect, this means that if a hot backup of the database runs long enough, the CP thread could end up waiting on the read transaction that is being used to do the hot backup.
Thus, in the normal course of operations, the CP thread will not wait on read transactions - but it may kill them in order to get the rollback log down below 1 GB... Since the CP thread will attempt to truncate the rollback log well before it gets to 1 GB, it should be rare that the rollback log actually grows to 1 GB. - As noted above, the CP thread could also end up waiting for a hot backup to complete, but that should also be relatively rare - depending on when users decide to do their hot backups.
There are some common misconceptions about the checkpoint thread and when it runs and what it means to "force a checkpoint" versus "not force a checkpoint" that I would like to comment on briefly. Hopefully this will help clear up some common misconceptions about the checkpoint thread.
FLAIM logs the operations of each update transaction to a roll-forward log. Roll-forward log files are used to recover transactions after a system failure and when restoring a database from backup.
FLAIM is able to operate in two modes with respect to the roll-forward log. In the default mode, the log is truncated every time a checkpoint is completed since the log is no longer needed for recovery. This mode allows applications that do not need continuous backup capabilities to conserve disk space. The other mode allows transactions logged to the roll-forward log be kept indefinitely. When this mode is employed, multiple log files are utilized instead of just one. Roll-forward log files are not reset and reused when checkpoints are performed. Instead, the roll-forward log continually grows.
For recovery after a non-catastrophic event (e.g., something other than a disk failure), only the RFL entries since the last checkpoint are needed. For recovery after a disk failure, only the RFL entries logged since the last backup are needed. In short, only a subset of the RFL is needed to allow recovery in either case, thus allowing obsolete portions of the RFL to be removed as needed to reduce its footprint. FLAIM provides mechanisms for an application to identify and remove sections of the log that are no longer relevant.
Every now and then features are added to FLAIM that require new meta-data to be stored in the database, or that require changes in how data is stored in the database. Such changes are obviously unknown to prior versions of the FLAIM libraries and hence cannot be properly supported by older libraries. In the past, changes like this have included the following:
To deal with these kinds of changes, we have adopted the following practices and philosophies:
We had a user that was attempting to bulk load a database with a very large number of records in a single transaction. After loading a large number of records, they received an FERR_DB_FULL error. It turns out that the error was occurring because the transaction had filled up the roll-forward log file. The question was raised why FLAIM did not just start using the next sequential roll-forward log file instead of returning the error.
The explanation for this is that FLAIM only uses multiple roll-forward log files when the "keep-rfl-files" flag is set to TRUE. This particular user was encountering the FERR_DB_FULL error because the "keep-rfl-files" flag in their database was FALSE (this is the default value that is set when a database is created). Could FLAIM be modified to use multiple roll-forward log files when the keep-rfl-files flag is FALSE? Yes, but there are a number of subtle issues that would need to be carefully analyzed - to ensure that we have a proper distinction between when we are really "keeping" rfl files versus just using the next sequential file temporarily. We don't want the ability to really "keep" roll-forward log files to be inadvertently compromised. Furthermore, to make it work properly might require the database revision number to be bumped. Given that, it is felt that a customer who runs into this i ssue could easily work around the limitation by setting the keep-rfl-files flag to TRUE before performing their very large bulk-load transaction. This is done by calling FlmDbConfig as follows:
rc = FlmDbConfig( hDb, FDB_RFL_KEEP_FILES, (void *)TRUE, (void *)0);
Also, the program will probably want to set the maximum RFL file size to something:
FLMUINT uiMinRflSize = some value; // in bytes FLMUINT uiMaxRflSize = some value; // in bytes; rc = FlmDbConfig( hDb, FDB_RFL_FILE_LIMITS, (void *)uiMinRflSize, (void *)uiMaxRflSize);
When the very large bulk-load transaction is complete, the program can reset the keep-rfl-files flag back to FALSE if desired:
rc = FlmDbConfig( hDb, FDB_RFL_KEEP_FILES, (void *)FALSE, (void *)0);
NOTE: The calls to change the keep-rfl-files flag CANNOT be done inside a transaction.
The only down side to this approach is that because the database is keeping roll-forward log files, the transactions will chew up space on the disk where the roll-forward log files reside. This would happen whether the program chose to the bulk load in a single transaction or many smaller transactions. Of course, if the program was doing the entire bulk load in a single transaction, the disk space would be consumed regardless of whether the keep-rfl-files flag was TRUE or FALSE (assuming FLAIM could be modified to use multiple files when the keep-rfl-files flag was FALSE). Thus, if an application decided to do its bulk load using multiple small transactions instead of one large transaction, it might want to set the keep-rfl-files flag to FALSE instead of TRUE. In any case, since the keep-rfl-files flag can be changed at any time, the program has complete control of what behavior it desires.
Also, please note that when the keep-rfl-files flag is changed from TRUE to FALSE, the RFL file sequence number will bump to the next file after the last RFL file that was being used. Thus, if the last transaction ended in rfl file #27, when the keep-rfl-files flag is reset to FALSE, the database will move to rfl file #28, and continue reusing #28 going forward. This is done so that all of the data collected in rfl files while the keep-rfl-files flag was TRUE will not be overwritten - after all, that is what "keep" means, right? This means that when the application finishes its bulk load, it will have an RFL directory with a bunch of rfl files in it that it may not really want. After setting the keep-rfl-files flag back to FALSE, it may want to delete those unneeded rfl files. It can determine which RFL files it wants to delete by first determining the current RFL fil e number, as follows:
FLMUINT uiCurrRflFileNum; rc = FlmDbGetConfig( hDb, FDB_GET_RFL_FILE_NUM, &uiCurrRflFileNum, NULL, NULL);
Then just delete all of the RFL files 1 through uiCurrRflFileNum minus 1. Note that this also means that the rfl file that remains in the RFL directory will have a different number than the one people are used to seeing. Instead of 00000001.log, it will be 0000001C.log (or whatever).
Whenever a block becomes empty, FLAIM links the block into an available block list (or "avail" list). Subsequently, if FLAIM needs to create a new block, it will first look in the avail list for a block before extending the database. In certain instances, it may be desirable to have blocks in the avail list returned to the file system to reduce the footprint of a database. FLAIM provides a function (reduceSize) for reorganizing blocks so that free space can be returned to the file system.
The space reclamation function can be performed on-line, without requiring exclusive access to the database. Update operations, but not reads, are prevented while a reclamation operation is in progress. However, the reduceSize function allows the specification of the maximum amount of unused space to be reclaimed. Typically, it is best to reclaim small chunks at a time by making successive calls to the reclamation function instead of trying to reclaim all unused space in one call. This helps to minimize interference with normal update operations.
Yes. If a new index is being created, the FLM_DO_IN_BACKGROUND flag can be passed to FlmRecordAdd when adding the index definition. The only way to cause indexing to happen in the background for existing indexes is to suspend those indexes while bulk-loading (call FlmIndexSuspend). When you are done bulk loading you can resume the indexes (call FlmIndexResume), and all newly added entries will be indexed on a background thread. Remember, though, that background indexing still performs an update transaction for each group of records it indexes. Thus, the database is locked for short bursts of time for each group of records indexed. Only one background thread at a time will be able to actually do work; the rest will wait their turn for the database lock. The transactions are small, so it has the appearance of all of the background threads working in parallel, but in reality they are being serialized, just like all other update transactions.
To answer this question, it is useful to consider the different "levels of sophistication" that pertain to database backup:
FLAIM supports five types of backups:
The idea behind hot continuous backup (hereafter referred to as HCB) is as follows:
From a purely logical point of view, the roll-forward log is a sequential list of all transactions that have been performed on the database. However, the log is represented on disk as a sequence of files that are numbered 1, 2, 3, etc. (00000001.log, 00000002.log, etc.). This sequencing is extremely important for HCB. In order for a database to be correctly restored from the roll-forward log, transactions must be replayed in their proper order, and no transactions may be omitted. Otherwise, the restored database will not be logically complete.
On the surface, it would appear that the sequence number in the RFL file names are sufficient to guarantee proper sequencing of RFL files. However, there are a there are a number of scenarios where it is possible to have a 00000009.log that really isn't the sequential file that comes after the 00000008.log. I won't go into all of those scenarios. One scenario, "branching", is described below. Suffice it to say, it is possible. Given that possibility, to ensure proper log file sequencing we put two serial numbers in each RFL file's header: its own serial number, and the serial number of the next sequential log file. In this way, during a database restore operation, when FLAIM replays transactions from the roll-forward log it can verify that the sequence of transactions in the respective log files really are a valid sequence. It is not enough to check the file numbers, or even the transaction numbers in the files. FLAIM verifies those too, but FLAIM needed something that was guaranteed to be unique. Think of the "next serial number" as a pointer to the next sequential log file. That next sequential log file not only better have the right file number, it better have the right serial number in its header.
It is important to understand what happens when the RFL logging mode is changed from "keep" to "no-keep" or vice versa.
In effect, this action will terminate the sequence of roll-forward log files. This is because when a database is changed to a no-keep mode, it can no longer guarantee an uninterrupted sequence of transactions in the RFL files. Transactions may be overwritten and thus lost. So that the transactions that have been logged up to that point are not lost, FLAIM will create a new RFL file and mark it as a "no-keep" file. For example, if a database is on RFL file #12, FLAIM will create RFL file #13 and mark it as "no-keep", thus preserving all of the transactions up through and including file #12. This preserves all of the transactions that were collected up to the point where we changed from KEEP to NO-KEEP. The sequence of restorable transactions basically ends in log file #12. To make sure that a restore operation doesn't inadvertently try to jump from log file #12 into log file #13 and treat it as a valid sequence of transactions, file #13 will be given a new serial number that does not match the "next serial number" found in file #12.
This action basically begins a new sequence of restorable transactions. Whatever transactions are in the current RFL file can be part of that sequence. Hence, although FLAIM will create a new log file, the new log file will be linked to the old log file so that we will have a valid sequence of transactions from the old log file going into the new log file.
Another reason for keeping serial numbers on log files is because of the branching effect that can occur due to partial restore operations. Consider the following example:
At this point, we will have essentially created a branch in the RFL log, starting at RFL file #16. If new transactions are added to the live #16 (which is possible if RFL size limits have been increased), it will immediately diverge from what is in the archived #16. However, note that both the live #16 and the archived #16 will have the same serial number, because they are both valid continuations of the RFL log from archived #15. The archived #15 can point to either one of them, and a restore operation will work. If the administrator decides later on to do another restore operation, he now has two branches he can select from, starting at RFL file #16. The branch that gets restored depends on which RFL files the administrator chooses to make available to the restore operation.
Note that the "next" serial number in the live #16 will have been modified so that it no longer points to the archived #17. It must point to a different #17 than the archived one, because different transactions will be written to it than the transactions found in the archived #17.
FLAIM provides the FlmDbRebuild API and a command-line equivalent called "rebuild" that can be used to salvage all "sane" data from a damaged database. To do this, rebuild executes the following sequence of operations:
At this point, the rebuild is complete and, typically, the rebuilt database is either renamed or copied over the damaged database.
Since disk I/O is a fundamental part of the FLAIM database engine, it is important to understand some of the differences between ATA, SATA, and SCSI disk technologies and their impact on database reliability and performance. Probably the most significant differentiating factor between "enterprise class" and commodity (desktop) drives is the support of Tagged Command Queuing (TCQ) or Native Command Queuing (NCQ). Although the implementation details differ, both of these technologies are designed to improve disk performance and reliability.
SCSI and newer SATA drives (based on the SATA II specification) support command queuing, whereas most ATA and SATA 1.0 drives do not (refer to this summary). Among other things, this technology allows multiple I/O requests to be outstanding at any one time and individual write operations can be tagged so that they are forced to disk. This is in contrast to typical ATA drives that support only one I/O operation at a time and require the entire cache to be flushed to ensure dirty sectors have been written. A drive that supports command queuing can also re-order I/O requests based on a knowledge of the physical layout of the drive, head position, and rotational latency. This allows the drive to achieve the most optimal sequence of I/O requests, resulting in much more efficient asynchronous and direct I/O when accessing a FLAIM database.
For additional information on command queuing, refer to the Wikipedia articles on NCQ and TCQ. Seagate also has an excellent whitepaper that discusses NCQ and its implementation in SATA II drives. It can be found here.
Simply put, the disparity can be attributed to differences in disk drive technologies. Most desktop-class machines use inexpensive ATA or SATA drives, whereas most server-class machines use SCSI, Fibre Channel, or SAS drives.
It is very important to understand that most ATA and SATA-I drives do not support tagged commands. Thus, it isn't possible to pass a block of data to one of these drives and tag it as something that must be written all the way out to the disk platters (rather than just being put into the drive's cache). These drives typically provide the option of disabling their built-in write cache, but it must be done globally, not on a per-file basis. Because of this, machines that use these drives almost never explicitly disable the on-board write cache. In fact, some drives don't even honor the command to disable the write cache.
The only other way to get ATA and SATA-I drives to write their dirty cache is to issue a command that flushes the drive's entire cache. Mac OS X provides a way to issue this command directly to the drive. As an experiment, code was added to FLAIM to make use of this command. It resulted in a catastrophic performance hit. In Linux, there doesn't appear to be a high-level API to issue this same command to the disk ... only fsync() and fdatasync() are provided. These are supposed to guarantee that the data is safe on disk, but again, because of the limited nature of ATA and SATA-I drives, frequently the drives and/or device drivers choose "optimal" performance at the expense of data integrity. Obviously, running a database application on any drive that cannot ensure that data has been flushed out of cache is dangerous.
SCSI disks and newer SATA-II drives that support Native Command Queuing (NCQ) do not suffer from the same limited set of capabilities as their ATA and SATA-I counterparts. These more sophisticated drives are really cool because not only do they allow tagging and priority queuing of write requests, they also support partial completion notifications and a host of other great features. When direct I/O is enabled on a file that resides on one of these drives, each queued write is tagged to indicate that it should be written through the drive's cache. The drive will not notify the operating system that the write is complete until the data is safe on disk (imagine that). These drives are also very smart about how they manage their queues. If, for example, two operations on neighboring sectors are queued, these drives will automatically combine the requests to minimize head movement and rotational delays.
A while ago, we could not understand why a $1,000 Linux box with a cheap ATA drive could out-perform a $20,000 Sun box with 10000 RPM SCSI drives. In fact, even a laptop with Windows XP and a 4200 RPM drive was able to out-perform the Sun machine on a bulkload test. The answer presented itself when we tried running a performance test on a new Dell PowerEdge server with 15K RPM drives and gigabytes of memory. We expected this machine to scream. Quite the contrary. The machine returned bulkload times that were similar to the Sun box. What was the difference? The hard drives. The high-end, enterprise class machines use high-end SCSI drives that actually guarantee that data is written to disk! This is in contrast to the ATA drives that say the data has been written to disk, when in fact it has only been written to the drive's cache. As an aside, it is important to note that neither the Sun nor Dell machines were using caching RAID controllers which can dramatically improve I/O performance.
The preferred way to get the best of both worlds (good performance and high data integrity) is to use a disk controller with non-volatile cache (typically battery-backed). A quality controller of this type will disable the on-board cache of any of the connected drives. There is an interesting Linux kernel thread that discusses many of these topics. Linus even contributed some thoughts and questions to the discussion. It is a very lengthy thread, but may be of interest. It can be found here.
We have known for several years that there is a performance advantage to explicitly extending a file prior to performing any write operation that would implicitly extend the file. This is due to the fact that every time a file is extended, various pieces of file system metadata have to be updated to reflect the fact that resources (sectors, i-nodes, etc.) have been assigned to the file. Additionally, most modern file systems maintain some type of transaction journal to allow partially committed file system changes to be rolled-back in the event of a power failure (refer to the Wikipedia article on journaling file systems for mor information). This means that whenever the size of a file changes, one or more journal writes must be performed and flushed all the way out to the disk platters (unless, of course, a battery-backed RAID controller is used, in which case the write only needs to be flushed through to the controller's cache). It is much more efficient to extend a file by a large amount because the file system metadata is updated only once. This is in contrast to the multiple metadata updates that are required when many smaller writes each cause the file to be implicitly extended.
FLAIM tries to be ultra efficient in the way it interacts with the disk. This includes using async and direct I/O when available, ordering writes to minimize seeking, using sector-aligned buffers, and various other techniques. One of the most significant factors impacting database I/O performance (particularly when performing a checkpoint) is the host platform's support of async I/O completion notifications. FLAIM (and XFLAIM) use async I/O and multiple write buffers to keep the disk channel as busy as possible. Since async I/O, by its nature, may result in later writes completing before earlier writes, FLAIM needs to be notified of an out-of-order I/O completion so that buffers can be re-used as quickly as possible.
Basically, there are a limited number of I/O buffers managed by FLAIM. As FLAIM flushes dirty cache to disk, it acquires a buffer from the buffer manager, holds onto it until the write completes, and then releases the buffer back to the manager. When all of the write buffers are in use, FLAIM must wait for a pending I/O to complete before queuing an additional I/O operation. The problem is that at any one time, especially when forcing a checkpoint, FLAIM may have thousands of pending writes and it is unknown which one will complete first. Originally, FLAIM was taking the simplistic approach of waiting for the earliest queued write to complete; this, however, was rarely the first I/O to complete. The result was that there were many usable I/O buffers available, but FLAIM was unaware of this fact because it was blocked waiting for the earliest I/O.
Platforms that support async I/O all provide a simple (i.e., polling) mechanism for determining if an async I/O has completed via a call to a routine like GetOverlappedResult (on Windows). Some operating systems provide an additional callback-based mechanism to notify an application that I/O has completed. FLAIM takes advantage of this callback-based notification to leverage out-of-order I/O completion to its advantage. Instead of waiting for a specific I/O operation to complete so that its buffer can be re-used, the callback can do all of the work needed to release the buffer back to the buffer manager and also alert (via a semaphore) any thread waiting for a buffer to become available. This results in efficient and timely buffer re-use and has a huge impact on throughput by maximizing FLAIM's use of available I/O bandwidth.
Threads doing update operations typically do not hold the database lock for very long periods of time. Probably the real issue you are facing is the fact that there is a background thread, called the "checkpoint thread", which periodically runs to flush all dirty blocks (blocks which have been modified in cache but not yet written to disk) from cache to disk to establish a database checkpoint. In order to do this, it must obtain the database lock. While it is holding the lock, all other threads that want to do update operations will be blocked.
The amount of time it takes for the checkpoint thread to flush dirty blocks to disk (and hence, the amount of time it holds the database lock) depends on how much dirty cache there is. The more dirty cache there is, the longer it takes. The checkpoint thread wakes up every second to see if there are dirty cache blocks that need to be written to disk. If there are no other threads that have obtained the lock, and none waiting to obtain it, it will obtain the lock and start writing out dirty blocks. However, while it is writing out dirty blocks, if another thread requests the lock, the checkpoint thread will usually immediately give up the lock so that foreground update operations will not have to wait for it to finish writing out all dirty blocks. However, if the checkpoint thread has not been able to complete a checkpoint (which is writing out all dirty blocks) for a certain period of time (the "checkpoint interval"), it will not release the lock, but will continue writing until all dirty blocks have been written and a checkpoint established. The term "checkpoint interval" is a misnomer. It seems to suggest that it is how often the checkpoint thread wakes up and does a checkpoint. But that is NOT what it is. The fact of the matter is, the checkpoint thread is continuously waking up and attempting to complete a checkpoint. The checkpoint interval is simply the longest time that the checkpoint thread will allow to go by without completing a checkpoint. If the last completed checkpoint was too long ago (more than the seconds specified by the checkpoint interval), the checkpoint thread holds onto the lock and completes the checkpoint before giving up the lock. We sometimes refer to this as "forcing a checkpoint."
By reducing the checkpoint interval, the checkpoint thread will "force" a checkpoint more often. This means that it will generally have fewer dirty blocks to write out - because not as much dirty cache can build up in the shorter interval. There is probably a better way to keep dirty cache from building up. There are two settings in the _ndsdb.ini file that you can set to control dirty cache buildup: "maxdirtycache" and "lowdirtycache". For example:
maxdirtycache=30000000 lowdirtycache=0
These settings tell the checkpoint thread to not allow more than 30 MB (roughly) of dirty cache to build up. Whenever it sees that more than 30 MB of dirty cache has accumulated, it will lock the database and write it all out (down to zero - the number specified by the lowdirtycache setting). By setting maxdirtycache to the right value, the checkpoint thread forces a checkpoint more frequently, but writes out a smaller amounts of dirty cache each time. This, in effect, reduces the length of time the checkpoint thread holds the lock whenever it forces a checkpoint. Note that this does NOT reduce the overall amount of writing that must be done - it just spreads it out over time - amortizes it so to speak. Also, although increasing checkpoint frequency "spreads out the writes" it may not improve overall throughput. In fact, overall throughput may actually decrease some, because there is now not as much opportunity for a "piggybacking" effect - which is where multiple update operations update the same block before it is written to disk. Because the checkpoint thread is writing more frequently, a given block that is updated by multiple update operations may be written to disk multiple times now instead of being written out once.
Determining the best value for maxdirtycache requires experimentation. It will be different for every system because it depends a lot on the disk system and how efficient it is.
© 2009 Novell, Inc. All Rights Reserved.