Novell Home

FLAIM FAQ

From Developer Community

General


What is FLAIM?

FLAIM is a FLexible Adaptable Information Manager (database engine) for traditional as well as volatile and complex information. Even though FLAIM provides many traditional database features (e.g., transactions, recovery, reliability, scalability), it was conceived with a broader view toward the greater flexibility and adaptability that is offered by an XML data model. FLAIM is not new; various products have used FLAIM for over 15 years. For instance, Novell’s scalable, reliable directory and collaboration products, eDirectory and GroupWise, both use FLAIM as the data store, with user licenses totaling well into the hundreds of millions.

What is XFLAIM?

XFLAIM is an embeddable cross-platform XML database engine. It is written in C++ and provides a DOM-like interface for creating, modifying, deleting, indexing, and searching on XML documents in a database. It extends traditional XML with the addition of non-text data types (number, binary). XML Documents may be of any size. The DOM interface virtualizes documents, bringing nodes into memory on demand. There are all of the traditional database features: transactions, indexing, queries (using XPath), backup, restore, caching, etc. It is designed to be very reliable and scalable. An XFLAIM database is, itself, portable to multiple platforms.

Which platforms are supported?

FLAIM and XFLAIM have been ported to a wide variety of 32 bit and 64 bit platforms, including various flavors of Linux (SUSE, Redhat, Ubuntu, Debian, etc.), various flavors of Unix (Solaris, AIX, HP/UX, Mac OS X, etc.), Windows (2000, XP), and NetWare 6. There is 64 bit support for all of the Linux, Unix, and Unix-like (Mac OS X) platforms, as well as Windows versions that support 64 bit application development.

What are the goals of FLAIM?

This is where you would expect us to ramble on about our plans to conquer the world of database technology and turn FLAIM into a multi-billion dollar product. While that would be nice, our goals are somewhat more humble. The most notable aspects of FLAIM are that it is scalable, reliable, and embeddable. Everything that we plan to do with the technology will be in support of improving its scalabiliy, reliabiliy, and embeddability. Keeping focused on these aspects will hopefully lead to the important and desirable outcome of developing a thriving user and developer community.

Who should participate in FLAIM?

Anyone who is interested in improving the core technology, testing, writing utilities, documenting features and APIs, and supporting the user community by answering questions. We also encourage the use of FLAIM in open source projects that have a need for a database engine. We have found that each new project that uses FLAIM results in improvements to the technology by uncovering limitations or omissions.

How is FLAIM project organized and managed?

FLAIM is an open source project released under the terms of the GPL by Novell. The project source code resides in a public Subversion repository on the Novell Forge site. Various other services (such as this Wiki) are also hosted on the Forge project site. There are two full-time Novell engineers that staff the project, acting as moderators and maintainers. All code changes are reviewed and approved by these engineers prior to being accepted into the FLAIM source tree.

What is the FLAIM release schedule?

In the past, FLAIM has always been tighly coupled to a Novell product release schedule. Now that it is an open source project we are still trying to determine an appropriate release schedule. Stay tuned.

Using FLAIM


Is there any documentation?

We have recently documented the full FLAIM API using Doxygen. This allows us to easily keep the documentation up-to-date when APIs are added, removed, or modified. At this time, there isn't a lot of conceptual documentation. We do provide a small sample application in the FLAIM project in the sample subdirectory. Click here to view the FLAIM documentation.

XFLAIM is very-well documented. We haven't migrated it over to Doxygen yet, so some of the API reference is outdated. Click here to view the XFLAIM documentation.

What is the FLAIM concurrency model?

The goal of concurrency control is to ensure that operations being executed at the same time by different applications do not interleave in such a way as to compromise database integrity. Because transactions are defined as the unit of work that transforms a database from one consistent state to another, it is necessary to address concurrency issues in the context of transaction processing.

Individual transactions that run in isolation should always leave the database in a consistent state. In practice, it is usually desirable to allow many transactions to run concurrently. However, if the various operations of the different transactions were allowed to interleave indiscriminately, serious errors may result that could leave the database in an inconsistent state. The fundamental concern of database concurrency control is to ensure that concurrent execution of transactions does not result in a loss of database consistency. This means that the effect of interleaving the operations of multiple concurrent transactions should be the same as running the transactions serially.

Stated simply, FLAIM supports an MVCC (multi-versioning concurrency control) model with unlimited readers and a single concurrent writer. In many database systems, readers must lock records in order to prevent a writer from modifying the data once it has been read. This causes readers to sometimes block writers and vice versa. In FLAIM, use of MVCC means that readers never block writers and writers never block readers. It has been found that if applications using FLAIM are architected to take advantage of this concurrency model (i.e., update transactions are used only when it is known for sure that data will be changed), the database can service thousands of requests a second on commodity hardware. This has been demonstrated again and again in customer deployments of eDirectory, as well as in SuperLab tests conducted by Novell. To summarize, FLAIM supports the following:

  • One writer, multiple readers.
  • Readers don't block writers because they NEVER lock items in the database.
  • Writers don't block readers.
  • Readers get a virtual snapshot of the database. The rollback log is used to provide block multi-versioning.
  • Uncommitted data is not visible to other transactions.


Can a database be accessed by more than one process at a time?

In short, the answer is "no." Even though FLAIM is an embedded database engine, it is architected to many of the same principles as stand-alone DBMS servers. This was done so that optimal and reliable caching and file-system layers could be implemented. As such, FLAIM opens a database for exclusive use by its host process. Since the consumers of the FLAIM technology have traditionally been servers (eDirectory, iFolder, etc.), providing a way for multiple processes to access the same database has not been a priority. Within a process, however, an unlimited number of threads can access the same database.

Does FLAIM support transactions?

FLAIM provides two types of transactions:

  • An update transaction allows an application to read and update data. Until a transaction has been committed, none of the operations performed during the transaction are made permanent in the database. Furthermore, changes to the database are not visible to other concurrent transactions. If an update is aborted, the changes made to the database during the transaction are undone.
  • A read transaction is a transaction where only read operations are allowed; updates are not permitted. A read transaction provides a read-consistent view of the database. A read-consistent view may be thought of as a "snapshot" of the database in a logically consistent state. A read transaction essentially sees a logically consistent snapshot of the database as of the point in time when the read transaction was started. In effect, updates made by other concurrent processes that have not committed before the start of the read transaction are not visible from within the read transaction. A read transaction is executed so that it never blocks a concurrent update transaction or other concurrent read transactions.


Under what circumstances would a transaction be rolled-back?

There are two types of transaction failures. The first type of failure occurs when the application executing the transaction encounters an error that makes it impossible to continue the transaction. Upon detecting the error, the application can request that FLAIM abort the transaction, which will cause FLAIM to undo (or rollback) all operations that have been performed by the transaction.

The other type of transaction failure occurs when the application terminates before committing or aborting the transaction, thus leaving the effects of a partially completed transaction in the database. Such transactions are sometimes called dead transactions because the application that created the transaction has terminated without specifying a final disposition for the transaction. Dead transactions may be the result of external events over which the application has no control (such as a power failure), or they may be the result of faulty application code. Whatever the reason, FLAIM provides for the automatic detection and rollback of dead transactions.

  • Database recovery after a system crash is automatic. The rollback log is used to return the database to last checkpointed state. The roll-forward log is used to “redo” transactions that were committed since the last checkpoint.
  • Recovery is idempotent. That is, if crash occurs during recovery, it will be resumed when the database is subsequently opened.


Why do I see multiple threads committing update transactions at the same time?

Update transactions obtain an exclusive lock on the database, but during the commit operation, they will release the lock before writing out their final RFL entries. Although the database lock has been released, the transaction is not committed until the final RFL entries are safely on disk. However, it is not necessary to keep the database locked while writing out these final RFL entries. In this way, another thread that is waiting to obtain the database lock and start a transaction can obtain the lock before the current thread has fully committed its transaction. This allows update transaction throughput to be significantly improved. It also enables a feature of FLAIM called "group commit", which allows multiple update transactions to write out their final RFL entries in the same write operation.

Here is an example of how group commit works.

  • Thread A obtains the database lock, performs a transaction, and makes the call to commit the transaction. It immediately unlocks the database and starts writing out the transaction's final RFL entries. Even though thread A has unlocked the database, the commit function will not return until the final RFL entries are all written out.
  • Thread B obtains the database lock and does a transaction. The commit function is called and the database is immediately unlocked. However, it notices that Thread A is still writing out its final RFL entries, so it waits to begin writing out its RFL entries until Thread A finishes. Even though it has unlocked the database, it has not yet returned from the commit function.
  • Thread C obtains the database lock and does a transaction. The commit function is called, which immediately unlocks the database. Thread A is still not done writing out its final RFL entries, so Thread C also waits for Thread A to finish. Again, even though it has unlocked the database, it has not yet returned from the commit function.
  • Thread D obtains the database lock and does a transaction. The commit function is called, which immediately unlocks the database. Thread A is still not done writing out its final RFL entries, so Thread D also waits for Thread A to finish. Again, even though it has unlocked the database, it has not yet returned from the commit function.
  • Thread A finally finishes writing out its final RFL entries. It notices that Threads B, C, and D, have been waiting for it to finish, so it wakes up Thread B (because it is the first one in the wait list) to let it know that it can now write out its final RFL entries. Thread A now returns from the commit function.
  • Thread B wakes up and sees that it can now write out its final RFL entries. It also sees that Threads C and D have final RFL entries to write out. It writes not only its own RFL entries, but also the RFL entries of Threads C and D. When it finishes, it notifies Thread C and D that their final RFL entries have also been written out and have successfully committed. Thus, Threads B, C, and D all return from their commit functions at about the same time. Those three transactions were effectively "group-committed".

In this way, it is possible for multiple threads to all be inside the commit function.

What is a checkpoint?

A checkpoint brings the on-disk version of the database up to the same coherent state as the in-memory (cached) database. FLAIM attempts to do a checkpoint whenever there are periods of minimal update activity on the database. In this case, FLAIM acquires a lock on the database and does as much work as possible until either the checkpoint completes or another thread wants to update the database.

To prevent the on-disk database from becoming too out of sync, there are conditions under which a checkpoint will be forced even if threads are waiting to update the database. First, if the checkpoint thread has not been able to complete a checkpoint within a specified time interval (default is three minutes), a checkpoint will be forced. Second, a checkpoint will always be forced when FLAIM is told to shut down. Third, I/O errors or out-of-disk conditions on the roll-forward log volume will cause a checkpoint to be forced. Forcing a checkpoint helps to shorten the amount of time it takes to recover the database after a system failure.

Does the checkpoint thread ever block readers?

"I was under the impression that checkpoint thread will only block writers and readers are always serviced (I remember hearing about flaim maintaining a snapshot of the region being modified so that requests can be serviced with the old data until a transaction is committed.) In one discussion, it came out that CP thread blocks both readers and writers. Is that true?"

This question requires more than a simple yes/no answer. The short answer is that the CP thread always blocks writers (update transactions). It never blocks a read transaction, but there is a rare scenario where the CP thread will kill a read transaction. There is also a rare condition where the CP thread will be blocked by a read transaction. In that scenario, because the CP thread is blocked by a read transaction, and the CP thread blocks update transactions, you will end up with the rare scenario where update transactions are being indirectly blocked by a read transaction.

One of the tasks of the CP thread is to truncate the rollback log whenever it can (after it completes a checkpoint) to keep it from growing indefinitely. The rollback log is used for three purposes:

  • To abort a transaction
  • To recover a database after a shutdown to the last checkpoint
  • To maintain a read transaction's snapshot of the database.

Whenever the CP thread completes a checkpoint, it checks to see if it can truncate the rollback log. If there are active read transactions that need information in the rollback log to maintain their snapshot of the database, the CP thread will NOT truncate the rollback log - up to a point. If the CP thread has not been able to truncate the rollback log for a long time because of a long-running read transaction, or because of a series of overlapping read transactions, and the rollback log has grown larger than 1 GB, the CP thread will take more drastic steps to keep the rollback log from continuing to grow. In prior versions of FLAIM, the CP thread would wait for the read transactions to terminate - on the assumption that they would, for the most part, terminate in a short amount of time. In the current version of FLAIM, that behavior has been changed. Now when the rollback log has grown to be more that 1 GB, the CP thread will simply terminate all read transactions that are preventing it from truncating the rollback log - with one exception. The exception is for certain internal read transactions that have been specially marked as "do not kill". Currently, the only read transactions that are marked as "do not kill" are read transactions that are used to take a hot backup of the database. In effect, this means that if a hot backup of the database runs long enough, the CP thread could end up waiting on the read transaction that is being used to do the hot backup.

Thus, in the normal course of operations, the CP thread will not wait on read transactions - but it may kill them in order to get the rollback log down below 1 GB... Since the CP thread will attempt to truncate the rollback log well before it gets to 1 GB, it should be rare that the rollback log actually grows to 1 GB. - As noted above, the CP thread could also end up waiting for a hot backup to complete, but that should also be relatively rare - depending on when users decide to do their hot backups.

What does it mean to "force a checkpoint"?

There are some common misconceptions about the checkpoint thread and when it runs and what it means to "force a checkpoint" versus "not force a checkpoint" that I would like to comment on briefly. Hopefully this will help clear up some common misconceptions about the checkpoint thread.

  • There is a common misunderstanding about the checkpoint thread. The checkpoint thread does NOT just run every three minutes. It runs whenever there is no update transaction currently in progress. It is always trying to complete a checkpoint as soon as it possibly can after update transactions run. There is no point in waiting if it can do it at a low priority. Whenever the checkpoint thread wakes up and runs it runs in one of two modes: Forcing checkpoint or Not forcing checkpoint. If it is not forcing a checkpoint, it will quit whenever an update transaction comes along and wants to lock the database - this is basically its "low priority" mode because it yields to things that seem to have a higher priority. If it is forcing a checkpoint, it will make update transactions wait until the checkpoint is completed - this its its "high priority" mode. The three minute interval is simply the maximum amount of time it will allow between completed checkpoints. If it can't complete a checkpoint nicely (i.e., not forcing one) between update transactions, it will force one to complete at least every three minutes. But that doesn't mean it won't do one more often if it can.
  • If you set MaxDirtyCache, the checkpoint thread also runs to keep dirty cache below the threshold you set - not just to complete a checkpoint. It will write out dirty blocks until dirty cache comes down below the MinDirtyCache limit that was set. So, as soon as dirty cache exceeds the MaxDirtyCache, the checkpoint thread kicks in and writes out enough dirty blocks to get back below the MinDirtyCache limit. - BTW, are you sure about your settings here? Values of 350 and 250 are in bytes, not megabytes or kilobytes. These settings, if they are correct, are awful. They are the equivalent of continually forcing all dirty cache blocks to be written to disk after every single update transaction - something that will slow down your updates horrendously.


What is the purpose of the roll-forward log?

FLAIM logs the operations of each update transaction to a roll-forward log. Roll-forward log files are used to recover transactions after a system failure and when restoring a database from backup.

FLAIM is able to operate in two modes with respect to the roll-forward log. In the default mode, the log is truncated every time a checkpoint is completed since the log is no longer needed for recovery. This mode allows applications that do not need continuous backup capabilities to conserve disk space. The other mode allows transactions logged to the roll-forward log be kept indefinitely. When this mode is employed, multiple log files are utilized instead of just one. Roll-forward log files are not reset and reused when checkpoints are performed. Instead, the roll-forward log continually grows.

For recovery after a non-catastrophic event (e.g., something other than a disk failure), only the RFL entries since the last checkpoint are needed. For recovery after a disk failure, only the RFL entries logged since the last backup are needed. In short, only a subset of the RFL is needed to allow recovery in either case, thus allowing obsolete portions of the RFL to be removed as needed to reduce its footprint. FLAIM provides mechanisms for an application to identify and remove sections of the log that are no longer relevant.

How are database file format changes handled?

Every now and then features are added to FLAIM that require new meta-data to be stored in the database, or that require changes in how data is stored in the database. Such changes are obviously unknown to prior versions of the FLAIM libraries and hence cannot be properly supported by older libraries. In the past, changes like this have included the following:

  • Encryption of data (a fairly recent change)
  • Block checksums (done a long time ago)
  • New type of index
  • New packet types in the roll-forward log

To deal with these kinds of changes, we have adopted the following practices and philosophies:

  • FLAIM databases are stamped with a file-format version number. This version numbers should NOT be confused with the version number associated with a FLAIM library. Many new FLAIM features, even new APIs, can be and often are introduced into FLAIM libraries without requiring a change to the database version number. In fact, every effort is made to avoid bumping database version numbers to support new features. FLAIM library versions can change many times without ever changing database version numbers. But a database version number change will always mean a FLAIM library version change.
  • The database version number basically represents some set of meta-data or internal data formatting that the FLAIM library must be willing to support. Such changes generally come about because of the need to support new features. Higher version numbers indicate that either: a) a newer/better way of doing things is replacing an older way, or b) an altogether new feature is being introduced that requires new data formatting or new meta-data to support it. It is a good idea to keep a database version current if at all possible - more about that later.
  • When a FLAIM library opens an existing database (via FlmDbOpen), it will check the version number that is stamped in the database. If it cannot support the features implied by the version number, it will exit with an error. If the database was created by a version of the FLAIM library that supports newer features that it does not know about, it will return FERR_NEWER_FLAIM. For example, if the database is stamped as version 460, but the FLAIM library only supports up to version 452, FERR_NEWER_FLAIM will be returned. If the database version is one that it no longer supports, it will return FERR_UNSUPPORTED_VERSION. For example, if the FLAIM library only supports 451, 452, and 460, and the database is stamped with 431, FERR_UNSUPPORTED_VERSION will be returned. NOTE: FERR_UNSUPPORTED_VERSION will rarely be returned, because newer versions of the FLAIM libraries are nearly always coded to support all prior database versions.
  • A FLAIM library that supports a newer database version will NEVER automatically upgrade a database that is stamped with an older version. For example, if a FLAIM library supports every database version up to and including 460, and the database it opens is stamped with version 452, FLAIM will leave it at 452. Furthermore, although the FLAIM library has new features that it could put into the database, it will not do it. In this way, FLAIM libraries that only support up to database version 452 are still guaranteed to be able to open the database and work correctly on that database.
  • It is the application's responsibility to upgrade databases if they desire to take advantage of new features that change how data is stored in the database. This may be done by calling the FlmDbUpgrade API. An application may determine a database's current version by calling FlmDbGetConfig() with the FDB_GET_VERSION option. It can then determine if the database version is the current version by comparing it to FLM_CURRENT_VERSION_NUM, which is defined in flaim.h. If it determines that the current version is < FLM_CURRENT_VERSION_NUM, the application may want to call FlmDbUpgrade to upgrade the database. Once FlmDbUpgrade has been called, older versions of FLAIM libraries that do not support the newer database version will no longer be able to access the database.


Why am I getting a "database full" error during a bulk load?

We had a user that was attempting to bulk load a database with a very large number of records in a single transaction. After loading a large number of records, they received an FERR_DB_FULL error. It turns out that the error was occurring because the transaction had filled up the roll-forward log file. The question was raised why FLAIM did not just start using the next sequential roll-forward log file instead of returning the error.

The explanation for this is that FLAIM only uses multiple roll-forward log files when the "keep-rfl-files" flag is set to TRUE. This particular user was encountering the FERR_DB_FULL error because the "keep-rfl-files" flag in their database was FALSE (this is the default value that is set when a database is created). Could FLAIM be modified to use multiple roll-forward log files when the keep-rfl-files flag is FALSE? Yes, but there are a number of subtle issues that would need to be carefully analyzed - to ensure that we have a proper distinction between when we are really "keeping" rfl files versus just using the next sequential file temporarily. We don't want the ability to really "keep" roll-forward log files to be inadvertently compromised. Furthermore, to make it work properly might require the database revision number to be bumped. Given that, it is felt that a customer who runs into this i ssue could easily work around the limitation by setting the keep-rfl-files flag to TRUE before performing their very large bulk-load transaction. This is done by calling FlmDbConfig as follows:

  rc = FlmDbConfig( hDb, FDB_RFL_KEEP_FILES, (void *)TRUE, (void *)0);

Also, the program will probably want to set the maximum RFL file size to something:

   FLMUINT uiMinRflSize = some value;  // in bytes
   FLMUINT uiMaxRflSize = some value;  // in bytes;

   rc = FlmDbConfig( hDb, FDB_RFL_FILE_LIMITS, (void *)uiMinRflSize, (void *)uiMaxRflSize);

When the very large bulk-load transaction is complete, the program can reset the keep-rfl-files flag back to FALSE if desired:

   rc = FlmDbConfig( hDb, FDB_RFL_KEEP_FILES, (void *)FALSE, (void *)0);

NOTE: The calls to change the keep-rfl-files flag CANNOT be done inside a transaction.

The only down side to this approach is that because the database is keeping roll-forward log files, the transactions will chew up space on the disk where the roll-forward log files reside. This would happen whether the program chose to the bulk load in a single transaction or many smaller transactions. Of course, if the program was doing the entire bulk load in a single transaction, the disk space would be consumed regardless of whether the keep-rfl-files flag was TRUE or FALSE (assuming FLAIM could be modified to use multiple files when the keep-rfl-files flag was FALSE). Thus, if an application decided to do its bulk load using multiple small transactions instead of one large transaction, it might want to set the keep-rfl-files flag to FALSE instead of TRUE. In any case, since the keep-rfl-files flag can be changed at any time, the program has complete control of what behavior it desires.

Also, please note that when the keep-rfl-files flag is changed from TRUE to FALSE, the RFL file sequence number will bump to the next file after the last RFL file that was being used. Thus, if the last transaction ended in rfl file #27, when the keep-rfl-files flag is reset to FALSE, the database will move to rfl file #28, and continue reusing #28 going forward. This is done so that all of the data collected in rfl files while the keep-rfl-files flag was TRUE will not be overwritten - after all, that is what "keep" means, right? This means that when the application finishes its bulk load, it will have an RFL directory with a bunch of rfl files in it that it may not really want. After setting the keep-rfl-files flag back to FALSE, it may want to delete those unneeded rfl files. It can determine which RFL files it wants to delete by first determining the current RFL fil e number, as follows:

   FLMUINT uiCurrRflFileNum;

   rc = FlmDbGetConfig( hDb, FDB_GET_RFL_FILE_NUM, &uiCurrRflFileNum, NULL, NULL);

Then just delete all of the RFL files 1 through uiCurrRflFileNum minus 1. Note that this also means that the rfl file that remains in the RFL directory will have a different number than the one people are used to seeing. Instead of 00000001.log, it will be 0000001C.log (or whatever).

Why didn't the size of my database decrease when I deleted items?

Whenever a block becomes empty, FLAIM links the block into an available block list (or "avail" list). Subsequently, if FLAIM needs to create a new block, it will first look in the avail list for a block before extending the database. In certain instances, it may be desirable to have blocks in the avail list returned to the file system to reduce the footprint of a database. FLAIM provides a function (reduceSize) for reorganizing blocks so that free space can be returned to the file system.

The space reclamation function can be performed on-line, without requiring exclusive access to the database. Update operations, but not reads, are prevented while a reclamation operation is in progress. However, the reduceSize function allows the specification of the maximum amount of unused space to be reclaimed. Typically, it is best to reclaim small chunks at a time by making successive calls to the reclamation function instead of trying to reclaim all unused space in one call. This helps to minimize interference with normal update operations.

Is there an option to do indexing in the background (as opposed to doing it inline in FlmRecordAdd/Modify)?

Yes. If a new index is being created, the FLM_DO_IN_BACKGROUND flag can be passed to FlmRecordAdd when adding the index definition. The only way to cause indexing to happen in the background for existing indexes is to suspend those indexes while bulk-loading (call FlmIndexSuspend). When you are done bulk loading you can resume the indexes (call FlmIndexResume), and all newly added entries will be indexed on a background thread. Remember, though, that background indexing still performs an update transaction for each group of records it indexes. Thus, the database is locked for short bursts of time for each group of records indexed. Only one background thread at a time will be able to actually do work; the rest will wait their turn for the database lock. The transactions are small, so it has the appearance of all of the background threads working in parallel, but in reality they are being serialized, just like all other update transactions.

How do I backup my database?

To answer this question, it is useful to consider the different "levels of sophistication" that pertain to database backup:

  • A basic, no-frills backup solution requires that all updates to the database be held off while the backup runs. This could be as simple as shutting down the database server and copying the files to a backup location, or to be slightly more sophisticated, the database server could continue to run in a read-only mode (after all dirty cache is flushed to disk) while the files are copied to a backup location. For most database deployments, this type of backup is generally not acceptable.
  • The next level of sophistication, hot backup, refers to a backup that is performed while other concurrent operations are allowed to execute against the database. This type of backup results in a snapshot in time of the database, capturing all committed transactions at the time of the backup. All modifications made to the database during the backup are excluded. A hot backup allows for reasonable protection of the data in the database, while also allowing the database to remain fully on-line for the duration of the backup. The drawback is that changes made to the database between backups are not protected against catastrophic failure. This could mean the loss of several hours, or even days, of database updates depending on when the last backup was made. For some deployments, this risk of partial data loss is unacceptable.
  • Hot, continuous backup extends the concept of a hot backup by providing a mechanism for protecting changes to the database made between backups. Typically, this is accomplished by preserving roll-forward log (RFL) files, thus maintaining a complete record of changes made to the database since the last hot backup. These log files are typically stored on a device (disk, tape, etc.) separate from the device that hosts the database.

FLAIM supports five types of backups:

  • Full, Offline Backup. This type of backup doesn't require FLAIM at all. The only requirement is that the database must be closed so that all dirty cache has been flushed to disk. After that, any utility that can access the FLAIM database files can back them up by whatever means available.
  • Full, Locked Backup. This type of backup is created via the FlmDbCopy API. FLAIM holds off update transactions until the copy is complete. Readers are allowed to access the database without blocking.
  • Full, Hot Backup via the FlmDbBackup API. A full backup makes a complete copy of all data in the database that is committed as of the start of the backup. It does this by starting a single read transaction (thus guaranteeing a read-consistent view of the database) and then streaming each of the blocks in the database out to the application via a callback. Since this type of database scan is a classic example of a cache-poisoning operation, the read transaction is started with a special flag that prevents it from using cache in a way that would cause it to be poisoned. It is interesting to note that since block reads are done from cache when possible, it is likely that some of the blocks in the backup set will be newer than the corresponding database blocks on disk.
  • Incremental, Hot Backup via the FlmDbBackup API. An incremental backup is similar to a full backup in that it is done within a single read transaction that scans every block in the database. The difference is that an incremental backup only copies those blocks that have changed since the last backup (either full or incremental).
  • Continuous, Hot Backup via the FlmDbBackup API. As discussed above, full and incremental backups are essentially snapshots of the database at the time of the backup. Thus, transactions posted to the database after the start of the backup will not be recorded in the backup set. Continuous backup overcomes this shortcoming by preserving the transactions written to the roll-forward log. During a database restore, the transactions recorded in the roll-forward log can be applied to the newly restored database to bring it up to date with the last committed transaction.


How does hot continuous backup work and how are roll-forward log files used in that process?

The idea behind hot continuous backup (hereafter referred to as HCB) is as follows:

  • A user may periodically take a backup of the database that is essentially a snapshot of the database at some point in time.
  • Transactions that occurred on the database after the backup was taken may be redone simply by reading them from the roll-forward log files that were created after the backup was taken.
  • To be able to restore transactions from roll-forward log files, the database must be put in a mode where it is "keeping" the roll-forward log files, rather than continuously reusing the space in a single roll-forward log file. The latter mode is what we sometimes refer to as the "don't keep RFL files" mode. It is the default mode that is set when a database is created, so in order for a database to keep its roll-forward log files, the mode must be manually changed from "no-keep" to "keep". When in "no-keep" mode, the transactions in the roll-forward log file may be overwritten whenever a checkpoint is executed - as they are no longer needed to recover a database should their be some kind of failure.

From a purely logical point of view, the roll-forward log is a sequential list of all transactions that have been performed on the database. However, the log is represented on disk as a sequence of files that are numbered 1, 2, 3, etc. (00000001.log, 00000002.log, etc.). This sequencing is extremely important for HCB. In order for a database to be correctly restored from the roll-forward log, transactions must be replayed in their proper order, and no transactions may be omitted. Otherwise, the restored database will not be logically complete.

On the surface, it would appear that the sequence number in the RFL file names are sufficient to guarantee proper sequencing of RFL files. However, there are a there are a number of scenarios where it is possible to have a 00000009.log that really isn't the sequential file that comes after the 00000008.log. I won't go into all of those scenarios. One scenario, "branching", is described below. Suffice it to say, it is possible. Given that possibility, to ensure proper log file sequencing we put two serial numbers in each RFL file's header: its own serial number, and the serial number of the next sequential log file. In this way, during a database restore operation, when FLAIM replays transactions from the roll-forward log it can verify that the sequence of transactions in the respective log files really are a valid sequence. It is not enough to check the file numbers, or even the transaction numbers in the files. FLAIM verifies those too, but FLAIM needed something that was guaranteed to be unique. Think of the "next serial number" as a pointer to the next sequential log file. That next sequential log file not only better have the right file number, it better have the right serial number in its header.

It is important to understand what happens when the RFL logging mode is changed from "keep" to "no-keep" or vice versa.

KEEP to NO-KEEP

In effect, this action will terminate the sequence of roll-forward log files. This is because when a database is changed to a no-keep mode, it can no longer guarantee an uninterrupted sequence of transactions in the RFL files. Transactions may be overwritten and thus lost. So that the transactions that have been logged up to that point are not lost, FLAIM will create a new RFL file and mark it as a "no-keep" file. For example, if a database is on RFL file #12, FLAIM will create RFL file #13 and mark it as "no-keep", thus preserving all of the transactions up through and including file #12. This preserves all of the transactions that were collected up to the point where we changed from KEEP to NO-KEEP. The sequence of restorable transactions basically ends in log file #12. To make sure that a restore operation doesn't inadvertently try to jump from log file #12 into log file #13 and treat it as a valid sequence of transactions, file #13 will be given a new serial number that does not match the "next serial number" found in file #12.

NO-KEEP to KEEP

This action basically begins a new sequence of restorable transactions. Whatever transactions are in the current RFL file can be part of that sequence. Hence, although FLAIM will create a new log file, the new log file will be linked to the old log file so that we will have a valid sequence of transactions from the old log file going into the new log file.

RFL LOG BRANCHING

Another reason for keeping serial numbers on log files is because of the branching effect that can occur due to partial restore operations. Consider the following example:

  • Administrator takes a backup of the database, RFL log is currently in log file #12.
  • Database operations continue, and more transactions accrue in log files #12 through #20.
  • Administrator archives RFL log files #12 through #20.
  • Administrator decides to go back and restore the backup. Before restoring the backup, he removes RFL files #17 through #20 from disk - leaving them only in the archive. Backup is restored, and only RFL files #12 through #16 are replayed.
  • Transactions are continued, some are logged to #16, and when #16 fills up, new RFL files #17 through #25 are created. The archived #16 through #20 are now different than the live #16 through #20.

At this point, we will have essentially created a branch in the RFL log, starting at RFL file #16. If new transactions are added to the live #16 (which is possible if RFL size limits have been increased), it will immediately diverge from what is in the archived #16. However, note that both the live #16 and the archived #16 will have the same serial number, because they are both valid continuations of the RFL log from archived #15. The archived #15 can point to either one of them, and a restore operation will work. If the administrator decides later on to do another restore operation, he now has two branches he can select from, starting at RFL file #16. The branch that gets restored depends on which RFL files the administrator chooses to make available to the restore operation.

Note that the "next" serial number in the live #16 will have been modified so that it no longer points to the archived #17. It must point to a different #17 than the archived one, because different transactions will be written to it than the transactions found in the archived #17.

Help! My database has been damaged and I don't have a backup. Can I recover my data?

FLAIM provides the FlmDbRebuild API and a command-line equivalent called "rebuild" that can be used to salvage all "sane" data from a damaged database. To do this, rebuild executes the following sequence of operations:

  • 1. Various pieces of metadata are stored in the database header (found in the .db file). If the database header has been damaged, the rebuild code will attempt to reconstruct a subset of the metadata (such as the database block size), by scanning the blocks in the database and applying various heuristics to derive the metadata. It is very important for the block size to be determined correctly so that the rebuild can salvage as much data as possible. Note that it is rare, but not impossible, for the database header to become damaged. If the header is determined to be sane, the block size (and other metadata) recorded in the header is used and, thus, there is no need to perform this scan of the database.
  • 2. The database is scanned to locate blocks that belong to the dictionary. Since the dictionary stores information about fields and indexes, it must be recovered before any actual data can be recovered. Since the structure of a damaged database cannot be trusted, the rebuild code must scan the entire database (rather than just the dictionary's B-Tree) looking for any blocks belonging to the dictionary. When a dictionary block is found, if its checksum is valid and it passes various other sanity checks, the field and index definitions it contains are extracted into a temporary holding area in memory. Once all of the blocks have been scanned and all of the usable definitions extracted, the rebuild code does some dependency checking and various fix-ups to make the set of definitions usable by the destination database. If, for example, some of the indexed field definitions could not be recovered, the referencing indexes are automatically excluded from the rebuilt database.
  • 3. A final scan of the database is made to recover data records. As data blocks are found, they are checksummed and sanity checked just like the dictionary blocks in step 2. If determined to be sane, the records in the block are extracted and added to the destination database. This continues until all of the blocks in the damaged database have been processed.

At this point, the rebuild is complete and, typically, the rebuilt database is either renamed or copied over the damaged database.

Performance and Reliability


What is command queuing?

Since disk I/O is a fundamental part of the FLAIM database engine, it is important to understand some of the differences between ATA, SATA, and SCSI disk technologies and their impact on database reliability and performance. Probably the most significant differentiating factor between "enterprise class" and commodity (desktop) drives is the support of Tagged Command Queuing (TCQ) or Native Command Queuing (NCQ). Although the implementation details differ, both of these technologies are designed to improve disk performance and reliability.

SCSI and newer SATA drives (based on the SATA II specification) support command queuing, whereas most ATA and SATA 1.0 drives do not (refer to this summary). Among other things, this technology allows multiple I/O requests to be outstanding at any one time and individual write operations can be tagged so that they are forced to disk. This is in contrast to typical ATA drives that support only one I/O operation at a time and require the entire cache to be flushed to ensure dirty sectors have been written. A drive that supports command queuing can also re-order I/O requests based on a knowledge of the physical layout of the drive, head position, and rotational latency. This allows the drive to achieve the most optimal sequence of I/O requests, resulting in much more efficient asynchronous and direct I/O when accessing a FLAIM database.

For additional information on command queuing, refer to the Wikipedia articles on NCQ and TCQ. Seagate also has an excellent whitepaper that discusses NCQ and its implementation in SATA II drives. It can be found here.

Why does database performance seem better on my desktop-class machine than on my $20,000 server?

Simply put, the disparity can be attributed to differences in disk drive technologies. Most desktop-class machines use inexpensive ATA or SATA drives, whereas most server-class machines use SCSI, Fibre Channel, or SAS drives.

It is very important to understand that most ATA and SATA-I drives do not support tagged commands. Thus, it isn't possible to pass a block of data to one of these drives and tag it as something that must be written all the way out to the disk platters (rather than just being put into the drive's cache). These drives typically provide the option of disabling their built-in write cache, but it must be done globally, not on a per-file basis. Because of this, machines that use these drives almost never explicitly disable the on-board write cache. In fact, some drives don't even honor the command to disable the write cache.

The only other way to get ATA and SATA-I drives to write their dirty cache is to issue a command that flushes the drive's entire cache. Mac OS X provides a way to issue this command directly to the drive. As an experiment, code was added to FLAIM to make use of this command. It resulted in a catastrophic performance hit. In Linux, there doesn't appear to be a high-level API to issue this same command to the disk ... only fsync() and fdatasync() are provided. These are supposed to guarantee that the data is safe on disk, but again, because of the limited nature of ATA and SATA-I drives, frequently the drives and/or device drivers choose "optimal" performance at the expense of data integrity. Obviously, running a database application on any drive that cannot ensure that data has been flushed out of cache is dangerous.

SCSI disks and newer SATA-II drives that support Native Command Queuing (NCQ) do not suffer from the same limited set of capabilities as their ATA and SATA-I counterparts. These more sophisticated drives are really cool because not only do they allow tagging and priority queuing of write requests, they also support partial completion notifications and a host of other great features. When direct I/O is enabled on a file that resides on one of these drives, each queued write is tagged to indicate that it should be written through the drive's cache. The drive will not notify the operating system that the write is complete until the data is safe on disk (imagine that). These drives are also very smart about how they manage their queues. If, for example, two operations on neighboring sectors are queued, these drives will automatically combine the requests to minimize head movement and rotational delays.

A while ago, we could not understand why a $1,000 Linux box with a cheap ATA drive could out-perform a $20,000 Sun box with 10000 RPM SCSI drives. In fact, even a laptop with Windows XP and a 4200 RPM drive was able to out-perform the Sun machine on a bulkload test. The answer presented itself when we tried running a performance test on a new Dell PowerEdge server with 15K RPM drives and gigabytes of memory. We expected this machine to scream. Quite the contrary. The machine returned bulkload times that were similar to the Sun box. What was the difference? The hard drives. The high-end, enterprise class machines use high-end SCSI drives that actually guarantee that data is written to disk! This is in contrast to the ATA drives that say the data has been written to disk, when in fact it has only been written to the drive's cache. As an aside, it is important to note that neither the Sun nor Dell machines were using caching RAID controllers which can dramatically improve I/O performance.

The preferred way to get the best of both worlds (good performance and high data integrity) is to use a disk controller with non-volatile cache (typically battery-backed). A quality controller of this type will disable the on-board cache of any of the connected drives. There is an interesting Linux kernel thread that discusses many of these topics. Linus even contributed some thoughts and questions to the discussion. It is a very lengthy thread, but may be of interest. It can be found here.

Why are database files expanded several megabytes at a time?

We have known for several years that there is a performance advantage to explicitly extending a file prior to performing any write operation that would implicitly extend the file. This is due to the fact that every time a file is extended, various pieces of file system metadata have to be updated to reflect the fact that resources (sectors, i-nodes, etc.) have been assigned to the file. Additionally, most modern file systems maintain some type of transaction journal to allow partially committed file system changes to be rolled-back in the event of a power failure (refer to the Wikipedia article on journaling file systems for mor information). This means that whenever the size of a file changes, one or more journal writes must be performed and flushed all the way out to the disk platters (unless, of course, a battery-backed RAID controller is used, in which case the write only needs to be flushed through to the controller's cache). It is much more efficient to extend a file by a large amount because the file system metadata is updated only once. This is in contrast to the multiple metadata updates that are required when many smaller writes each cause the file to be implicitly extended.

Why is async I/O completion important?

FLAIM tries to be ultra efficient in the way it interacts with the disk. This includes using async and direct I/O when available, ordering writes to minimize seeking, using sector-aligned buffers, and various other techniques. One of the most significant factors impacting database I/O performance (particularly when performing a checkpoint) is the host platform's support of async I/O completion notifications. FLAIM (and XFLAIM) use async I/O and multiple write buffers to keep the disk channel as busy as possible. Since async I/O, by its nature, may result in later writes completing before earlier writes, FLAIM needs to be notified of an out-of-order I/O completion so that buffers can be re-used as quickly as possible.

Basically, there are a limited number of I/O buffers managed by FLAIM. As FLAIM flushes dirty cache to disk, it acquires a buffer from the buffer manager, holds onto it until the write completes, and then releases the buffer back to the manager. When all of the write buffers are in use, FLAIM must wait for a pending I/O to complete before queuing an additional I/O operation. The problem is that at any one time, especially when forcing a checkpoint, FLAIM may have thousands of pending writes and it is unknown which one will complete first. Originally, FLAIM was taking the simplistic approach of waiting for the earliest queued write to complete; this, however, was rarely the first I/O to complete. The result was that there were many usable I/O buffers available, but FLAIM was unaware of this fact because it was blocked waiting for the earliest I/O.

Platforms that support async I/O all provide a simple (i.e., polling) mechanism for determining if an async I/O has completed via a call to a routine like GetOverlappedResult (on Windows). Some operating systems provide an additional callback-based mechanism to notify an application that I/O has completed. FLAIM takes advantage of this callback-based notification to leverage out-of-order I/O completion to its advantage. Instead of waiting for a specific I/O operation to complete so that its buffer can be re-used, the callback can do all of the work needed to release the buffer back to the buffer manager and also alert (via a semaphore) any thread waiting for a buffer to become available. This results in efficient and timely buffer re-use and has a huge impact on throughput by maximizing FLAIM's use of available I/O bandwidth.

We are deploying a large database (millions of objects) and are seeing cases where update operations stall for several seconds. What is causing this problem?

Threads doing update operations typically do not hold the database lock for very long periods of time. Probably the real issue you are facing is the fact that there is a background thread, called the "checkpoint thread", which periodically runs to flush all dirty blocks (blocks which have been modified in cache but not yet written to disk) from cache to disk to establish a database checkpoint. In order to do this, it must obtain the database lock. While it is holding the lock, all other threads that want to do update operations will be blocked.

The amount of time it takes for the checkpoint thread to flush dirty blocks to disk (and hence, the amount of time it holds the database lock) depends on how much dirty cache there is. The more dirty cache there is, the longer it takes. The checkpoint thread wakes up every second to see if there are dirty cache blocks that need to be written to disk. If there are no other threads that have obtained the lock, and none waiting to obtain it, it will obtain the lock and start writing out dirty blocks. However, while it is writing out dirty blocks, if another thread requests the lock, the checkpoint thread will usually immediately give up the lock so that foreground update operations will not have to wait for it to finish writing out all dirty blocks. However, if the checkpoint thread has not been able to complete a checkpoint (which is writing out all dirty blocks) for a certain period of time (the "checkpoint interval"), it will not release the lock, but will continue writing until all dirty blocks have been written and a checkpoint established. The term "checkpoint interval" is a misnomer. It seems to suggest that it is how often the checkpoint thread wakes up and does a checkpoint. But that is NOT what it is. The fact of the matter is, the checkpoint thread is continuously waking up and attempting to complete a checkpoint. The checkpoint interval is simply the longest time that the checkpoint thread will allow to go by without completing a checkpoint. If the last completed checkpoint was too long ago (more than the seconds specified by the checkpoint interval), the checkpoint thread holds onto the lock and completes the checkpoint before giving up the lock. We sometimes refer to this as "forcing a checkpoint."

By reducing the checkpoint interval, the checkpoint thread will "force" a checkpoint more often. This means that it will generally have fewer dirty blocks to write out - because not as much dirty cache can build up in the shorter interval. There is probably a better way to keep dirty cache from building up. There are two settings in the _ndsdb.ini file that you can set to control dirty cache buildup: "maxdirtycache" and "lowdirtycache". For example:

  maxdirtycache=30000000
  lowdirtycache=0

These settings tell the checkpoint thread to not allow more than 30 MB (roughly) of dirty cache to build up. Whenever it sees that more than 30 MB of dirty cache has accumulated, it will lock the database and write it all out (down to zero - the number specified by the lowdirtycache setting). By setting maxdirtycache to the right value, the checkpoint thread forces a checkpoint more frequently, but writes out a smaller amounts of dirty cache each time. This, in effect, reduces the length of time the checkpoint thread holds the lock whenever it forces a checkpoint. Note that this does NOT reduce the overall amount of writing that must be done - it just spreads it out over time - amortizes it so to speak. Also, although increasing checkpoint frequency "spreads out the writes" it may not improve overall throughput. In fact, overall throughput may actually decrease some, because there is now not as much opportunity for a "piggybacking" effect - which is where multiple update operations update the same block before it is written to disk. Because the checkpoint thread is writing more frequently, a given block that is updated by multiple update operations may be written to disk multiple times now instead of being written out once.

Determining the best value for maxdirtycache requires experimentation. It will be different for every system because it depends a lot on the disk system and how efficient it is.

Navigation



Table of Contents


Contents

Novell® Making IT Work As One

© 2009 Novell, Inc. All Rights Reserved.