Journal Articles

CVu Journal Vol 12, #1 - Jan 2000 + Letters to the Editor

Browse in :

All > Journals > CVu > 121 (30)
All > Journal Columns > LettersEditor (132)
Any of these categories - All of these categories

Note: when you create a new publication type, the articles module will automatically use the templates user-display-[publicationtype].xt and user-summary-[publicationtype].xt. If those templates do not exist when you try to preview or display a new article, you'll get this warning :-) Please place your own templates in themes/yourtheme/modules/articles . The templates will get the extension .xt there.

Title: The Wall

Author: Administrator

Date: 08 January 2000 13:15:34 +00:00 or Sat, 08 January 2000 13:15:34 +00:00

Summary:

Body:

And Another View

Dear Francis,

After the demands of a new job and moving house, I've finally had time to think about the last C Vu. The article by Waldo Posul caught my attention because while you can't do what he's asking, you can achieve the effect by some abstraction of the problem.

For the sake of this example, I'm assuming a disc file of 10GB, with small fixed length records of a few KB or less. The requirement is to be able to create/read/maintain a disc index to the records in the datafile such that the file can be accessed in reasonable time. I'll leave a definition of 'reasonable' open, but the process can be speeded up considerably over the example given in the article. As the article states, the fgetpos() return cannot be stored beyond the life of the associated stream with any guarantee of success.

My approach would be to forgo the actual file offset in favour of a record number in the index file.

An index file needs to contain as a minimum the key field and the record number in the file. This gives us the logical place to look, but not the physical offset.

The datafile should be accessed in chunks as large as memory will permit. The chunk must be some exact multiple of the record size. A 64Kb chunk is only 1/160th of our 10GB file, and for small records that is a considerable reduction of overhead from reading one record at a time. The program loop is slow compared to the disc to memory transfer. From the record number and the number of records that fit in your chunk, you know which chunk to read, and then you extract the record by knowing the offset into the chunk. True, you still have to read all the chunks to get to the last record.

By caching the fgetpos() returns for each chunk as it is required, but not before, you reduce your startup overhead at the expense of a variable initial access time to any specific record. If memory permits, you can cache recently used chunks to further improve performance.

The original indexed sequential access method used an access scheme that was directly related to the location of the record on the storage medium. This isn't possible on current general-purpose filesystems. However, by abstracting the problem to a higher level you can achieve a similar end.

There may be some performance gains to be obtained by using setvbuf() to tune the stream I/O to the chunk size, or some fraction of it, but I regard that as a late optimisation.

I do have to ask why a large single file is being used for small records, rather than use an index scheme of key, file, record that will fit in the original long int specification.

Graham Patterson <grahamp@econ.Berkeley.EDU>

Good to hear from you. I hope you will now have time to write more regularly.

For newcomers, Graham is a long-term member of ACCU having held office as both Secretary and Treasurer who moved to California earlier this year.

Notes:

More fields may be available via dynamicdata ..