[Date Prev][Date Next][Thread Prev][Thread Next][Author Index][Date Index][Thread Index]

Re: Presentation on Xanalogical Data Structures



On Tue, 2004-07-06 at 11:32, Steve Witham wrote:

> The stuff you said about the different large components (block server,
> Mozilla pluggin, etc.), the different
> small components (tumblers, blocking, enfilades), how they fit
> together, and how (briefly) they differ from the Green implementation
> and Ted's Silverstands vision, are good stuff and help to motivate
> the lower-level stuff.  It's good to have an idea of the uses of
> something and also an alternative design, when you're looking at
> a design.
> 
> When you say disk blocking... are you talking about the B-tree
> stuff?  Or, if you let someone else's library handle blocking,
> does that mean you are just using binary trees, and letting it
> group some of the binary nodes into blocks?

Yes, in Green they implemented their own object persistence mechanism,
with binary nodes (one datum per node with sibling links) arranged into
an N-way tree (multiple nodes per level).  They then worked hard to get
those related nodes to pack together into fixed-size disk blocks.  And
then they designed a caching system below that to hold the disk blocks
as they came in from disk.

I've usually seen it as an N-way tree the way a B-tree works; each node
holds some number of keys/datums, and the nodes -are- the disk blocks as
well.  The extra layer of tracking nodes (crums) and the blocks (loafs)
they belong in seems awkward, although it does provide an extra layer of
virtuality if you haven't finalized your tree structures.

A difference then is when the WID operator is invoked on a layer of the
tree; in their approach you have to navigate over a linked mesh of nodes
to sum their WID attributes; in a B-tree you only have to iterate over
an array of WID attributes.  The traversal of linked nodes can have
performance issues in a virtual memory environment that arrays don't
have.

In Python I'm storing N tuples (key -> datum) per tuple (enfilade node),
and making them all immutable.  The nodes are variable length on disk. 
It also means as you change the trees you generate more nodes and
consume more disk space.  But disk space is cheap, immutability solves a
lot of locking issues, and the code is easier to understand.  Fixed
length blocks are preferred when you're rewriting in-place; variable
length blocks are good for retaining old versions.

> I guess what I'm saying is that to me, the choice of binary
> trees vs. 2-3 trees vs. trees with wide branching, and whether
> & how the tree is balanced, is one set of issues,
> and putting the nodes onto disk blocks is another,
> and I'm not sure whether you're grouping both under "disk blocking."

Ah, so you would agree with the Green approach of separating the choice
of tree structure from disk storage?

> Anyway, you should make a bunch of slides, then do an audio recording
> of yourself practicing presenting the slides (if not a recording
> of your performance at the conference).  Then you can post pdf's of
> the slides and nice compact mp3 of the audio... right?

That's an idea.  I've never done such but perhaps I will. ;-)

-Jeff