Introduction to the Hexane Data Engine
Hexane has a non-trivial data engine, which was designed from the ground up. The engine has built-in support for commonly desired features such as large files and data insertion. This engine handles all major functionality of the program, including, but not limited to:
-
Data editing (insert, change, delete)
-
Undo and Redo facilities
-
Save/Export to various file types
The purpose of this document is to help interested parties understand why the engine was designed the way it was, why it is 'as complicated as it is', and to provide a high-level overview of how it works.
The details of the engine implementation are not for the faint of heart. Familiarity with standard Computer Science jargon is assumed. |
1. A Brief Overview of the problem
Any large-file editor suffers from inherent complexity due to the problem of not being able to store the entire file in RAM at the same time. In this section, we discuss what kinds of properties the optimal solution should have. In the next, we will examine some potential solutions to these problems.
The optimal solution should have the following properties:
-
Keep the engine as simple as possible, while still being able to page in and out sections of the file we are editing.
-
Be able to change the file size (ie, insert and delete bytes from it), without having to do any expensive computations.
-
Be able to get to other addresses quickly (eg, if we are at address X, how long does it take to get to address 1/2 * X or 2X ?).
These will be our main points of consideration for evaluating potential solutions.
2. Potential Solutions to the Problem
2.1 The Buffer Gap
The first, most well known solution is the buffer gap. The buffer gap is a technique employed by most modern text editors to store data. The structure involves keeping a buffer with the files contents, which has a gap in the data. The gap follows the cursor as it moves. As the user enters text, the gap fills. Once the gap is full, the buffer must resize, and shift its contents to make another gap.
This technique is probably the most efficient solution - provided that you can fit the entire file into RAM. In the case of trying to edit a file bigger than RAM, the program either relies on the operating system to page data in and out of its process space, or it does it itself. If the program pages the data on its own, it is relatively cheap to append data to the buffer, but quite expensive to prepend data to the buffer - especially when that involves having to shift all other data in the buffer to make room. This is the only major problem with this solution. Inserting and deleting bytes is simple, although expensive when the gap is full - due to the necessity to shift all data over to make room for a new gap. However, jumping to other addresses is easy, and only requires compensation for the size and location of the buffer gap.
2.2 Using a Linked List with Fixed Sized Pages
The problems presented by the buffer gap might be enough to make one consider the use of a linked list to store data. Obviously, storing one byte per link is inefficient, as it would require 4x the size of the file, just to store the pointers in the list (on a 32-bit machine anyway). So a better solution would be to make each link hold a quantity of data, such as a 4K or 8K page (suppose 8K for the purposes of this discussion). This has the benefits of being simple to implement, and easy to understand. This makes paging in and out quicker, provided that 8K chunks are read.
Although problems arise when considering the insertion of data. Inserting into a linked list is easy enough - but such an insert will make the said page no longer 8K in length. This has the nasty side effect of forcing the program to shift the rest of the data to maintain the 8K page size. However, since pages are a uniform 8K, calculating which page contains a given address is easy.
2.3 Using a Linked List with Variable Sized Pages
The previous solution fixed our data paging problems - but unfortunately, we would still need to shift the data to maintain an 8K page size. So what if we allow variable sized pages? Then we don't need to shift the rest of the data to do an insertion.
Unfortunately, this makes address computation much more complicated. It would be possible to just use a linear search across the linked list. Binary search wouldn't work though, if the pages only contain a length and data segment. To make binary search possible, an address field would be needed in each page, but would require O(N) upkeep every time a byte is inserted or deleted. In short, you still wind up doing a linear traversal of the tree. So that raises the question, can we do better?
2.4 Variable Sized Pages and a Weight Tree
To make the last case have faster searching capabilities, we could use an additional data structure, such as a binary tree to store the weights/lengths of the pages. For simplicity, we assume it is a full tree. The tree could work as follows:
-
A Leaf Node's weight is the length of the data segment in it's corresponding page in the linked list. Leaf Nodes also store either an index or pointer to their corresponding page in the linked list.
-
An Inner Node's weight is the sum of the weights of its children
This requires that any time the length of a page in the linked list changes, the weight of its corresponding leaf node in the tree must also be updated. Additionally, all direct ancestors of that leaf node must also have their weights updated.
The benefit to this solution is that finding which page an address belongs to only requires traversing the height of the tree. Insertion is easy, deletion is easy, and searching is now much quicker. Unfortunately, considering a non-full tree complicates the situation, since we would like all leaf nodes to reside on the same level.
2.5 Variable Sized Pages and a B+ Tree
Our problems with not having all leaf-nodes on the same level are solved by use of a B+ tree. B+ trees are a natural extension of binary trees, which allow for both random access and sequential processing of their data. A B+ tree ensures that all leaf nodes are on the same level and that the leaf nodes double as a linked list. B+ trees are admittedly not as simple as a binary tree, but have the added benefit of reducing the height of the search tree. Lower tree height means faster address lookup (and address updates when page sizes change), as well as less memory usage due to fewer internal nodes. This simplifies our solution, in that we do not need separate data structures for the linked list and tree.
2.6 HxTree - Hexane's Internal Data Structure
Hexane's internal tree is almost exactly that described in section 2.5. However, the HxTree does not use a stock B+ tree. A stock B+ tree stores keys and values in the leaf nodes. Keys are duplicated into the inner nodes during insert() and delete(), and are used by search() to figure out which subtree to visit. Since a HxTree does not store sorted data, these fields are essentially useless, and were not included in the implementation. Instead, inner nodes store a weight, and a count of the leaf nodes below them (either directly or indirectly). Leaf nodes store arrays of pointers to pages, and their associated weights. All nodes store parent pointers. Additionally, the leaf nodes make up a doubly-linked list instead of the usual singly linked list.
3. Other Engine Features
3.1 Undo System
Hexane's undo/redo system is pretty much the usual implementation. There is an UndoStack and a RedoStack. When the buffer is modified, an Action object is pushed onto the undo stack. This object contains information about what type of edit was preformed - Insert, Change, or Delete - and contains the corresponding data, and an address corresponding to the change. When an action is undone, it is popped from the UndoStack, and pushed onto the RedoStack. When an action is redone, it is popped from the RedoStack, and pushed back onto the UndoStack. One important thing to remember about the address contained in the Action objects, is that it is relative to the state of the file at that point in time. That is, suppose an insert is done at address $X, followed by an insert at address $W, with $W < $X. Then technically, the address $X has changed - however, the undo system is oblivious to the change in address. Since the UndoStack is a LIFO device (Last In, First Out), the change at address $X cannot be undone until the change at $W is undone first - at which point, the address $X will be correct. Think about it, it makes sense.
3.2 Paging and Prebuffering
Hexane allows for editing of large files. To accomplish this, the whole file is not loaded into RAM, unless it is smaller than the active window size. The window size is defined as the number of pages allowed in RAM multiplied by the page size. One difficulty in implementing the paging system arises in that Hexane allows for non-constant page lengths - that is, Hexane allows the user to insert and delete bytes from pages, without having to shift data in the resulting pages. This is in contrast to the fact that most operating systems and other paging systems use fixed-size (instead of variable-sized) pages. To work around this, a page can be in one of three states: Unmodified, Modified and in RAM, or Modified an not in RAM. Unmodified pages are the easiest to handle. If an unmodified page is to be evicted from RAM, its contents are simply discarded - because they can be re-read directly from the original file. When a page that has been modified is to be evicted, it is written to $TMPDIR/$FILENAME.$PAGENUMBER. When pages are to be loaded back into RAM, if the necessary page-file exists in $TMPDIR, it is loaded from there, instead of being read from the file. This accounts for the fact that we cannot write variable-sized pages back into the original file, since we could potentially overwrite (and therefore loose) data.
Hexane also has a prebuffering system to make scrolling seem smoother. The basic idea is that when the engine is instructed to seek(), it reads two pages ahead and keeps them in a page cache. Similarly, when pages are about to be evicted from memory, they are first pushed into the page cache. The page cache only keeps four pages - two pages in each direction of the active window. When pages are added to the cache, they force the older (ie, farther away) pages to be paged out to disk. This allows for the system to provide smoother seeking through data.
4. Planned Features
These features are currently in the planning stages, and are listed in no particular order.
4.1 Process Editing
Process editing can be accomplished using the ptrace() function. Basically, "opening" a process' memory space is just a special case of opening a file - except that the data can only be read and written one word at a time, instead of one page at a time. Planned functionality is to allow opening, editing, and re-writing of a processes' memory space. Additionally, support for pausing/resuming the process is planned.
4.2 Digest Algorithms
The option to calculate various digests and cryptographic hashes is also planned. This will be accomplished by use of the freely available Botan library.
4.3 Data Operations
Support for binary data operations (AND, OR, XOR, ADD, SUB, DIV, MUL, etc…) is planned.
4.5 Engine Scripting
The engine will be scriptable in some programming language. Languages are currently being investigated. Good candidates are lua, python, ruby, squirrel, or ferite. Ideally, the language should be easy to bind to the engine, for example using SWIG or something similar.
4.6 Data Template/Structure System
A mechanism for defining data structures, and mapping those structures to the currently open file is in planning. Ideally, structure definitions and engine scripts would use the same language to simplify the engine. A C-like syntax for defining structures would be familiar to many people; however, it is possible that a cleaner, more elegant syntax could be used.
5. Further Reading
If you would like to know more about the technical details of the engine, the author suggests that interested readers do the following:
-
Read up on B+ trees. Unfortunately, few textbooks cover these structures in detail, but the following online resources provide a good overview:
-
Read the source code to the engine. There are few comments hxtree.cpp, but well-equipped readers should have little trouble understanding what is going on.