[Community] Designing new types of Rtree indexes

Sean Gillies sean.gillies at gmail.com
Thu Jul 30 14:14:33 EEST 2009


On Jul 29, 2009, at 4:06 PM, Howard Butler wrote:

>
> On Jul 29, 2009, at 8:31 AM, Sean Gillies wrote:
>>
>> Yes, but there's some inevitable requirement for C code here. We're
>> just choosing to have a more reusable C API instead of a single-
>> purpose Python C extension.
>>
>
> Yep.  The audience of ctypes-capable folks is also larger than the
> audience of C/C++ python extension developers.  Hopefully this will
> result in more contribution, but if it doesn't we haven't lost  
> anything.
>
>>> - It only supported changing the pagesize parameter.   
>>> libspatialindex
>>> has nearly 20 properties that can be set/tweaked.
>>
>> I don't fully understand how to productively tweak them, yet, but I
>> appreciate this.
>>
>
> I'm awaiting a document from Marios describing the basic ideas, but
> one can find out the basics by reading the papers that the code is
> based on.  It is a fairly straightforward mapping.  The most
> significant knobs for us to tweak are the index_capacity and
> leaf_capacity.  For my scenario of inserting lots of points,
> performance is quite sensitive to these.  The various *pool_capacity
> parameters can be increased if you have more capable hardware.  These
> can improve insert and query performance.
>
>>> - It didn't allow storage of objects in the tree (ie, clustered
>>> index), only ids.
>>
>> Using only integer ids has been a design decision, not a fault. Rtree
>> was originally intended to support spatial catalog search of
>> documents, identified by integers, that could be stored *anywhere*.
>> Storage agnosticism. I think this makes Rtree quite reusable. I guess
>> I don't understand what you mean by a clustered index here.
>>
>
> Well, a limitation at least.  A clustered index in libspatialindex
> parlance just means storing the data inside the rtree.  In the new
> code, I have preserved id-only queries, and according to the
> benchmarks.py script, performance change is negligible (actually it is
> better for the disk-based indexes with the new code).
>
> You would use the intersection_obj or nearest_obj to have it return
> pickles back to you.  This will obviously have a performance cost.
> The question is whether or not that cost is small enough and where to
> take the hit.  In some cases it will be worth it.  It will also be
> handy for those looking for a no fuss, spatially-aware storage bucket.

Let's separate the search index rtree was originally designed to be  
from this new storage bucket you're proposing. I think they'll have  
very different users. Separate classes, different (though related)  
interfaces. No unnecessary constraints on each other. I'm not opposed  
to their being in the same namespace, but different modules like  
rtree.index and rtree.storage would be good, IMO.

>
>
>>> - Point storage.  insert will determine that the min and max values
>>> are the same, and we will insert a SpatialIndex::Point instead of a
>>> SpatialIndex::Region
>>
>> I'd prefer to make this more explicit and dodge numerical issues at
>> the same time. Say, Region if the user inserts a 4-tuple, and Point  
>> if
>> the user inserts a 2-tuple.
>
> What about a 6-tuple, 8-tuple, or a k-tuple (we support kD indexes
> now)?  The code currently checks using
> std::numeric_limits<double>::epsilon to determine the min and max are
> equivalent, and if any of the tests fail, inserts a region.

Okay.

>
>>>
>>> - Bulk insertion
>>>
>>
>> Yay for bulk insertion. Could reindexing be part of this same  
>> feature?
>>
>
> What do you mean by reindexing?  I don't have bulk insertion done yet,
> but the C API will work by using callbacks.  If there were a
> straightforward and common tree-walking query that people need, this
> could be implemented as well.

Say I've bulk loaded, but then reindex items here and there over time.  
If I understand it, this eventually degrades the index. I'd like to be  
able to reindex everything with the same efficiency as a bulk load.

I'm not sure how you want to use callbacks. Wouldn't you just pass an  
iterator to the bulk loading function?

>
>>
>> I'd like to discuss how we're going to test the C API. Test it via
>> Python in the old mapscript style, or write tests in C?
>
> For libLAS, I do mapscript-style, with Python doctests doing the
> work.  Much easier to read and add to than any of the c unit testing
> frameworks.  Coverage-wise, I think http://svn.gispython.org/svn/gispy/Rtree/trunk/tests/index.txt
>  and http://svn.gispython.org/svn/gispy/Rtree/trunk/tests/properties.txt
>  provide 100% C API coverage right now.  I'd be open to moving to
> entirely C tests, but I admit to it being a low priority given what
> has already been developed.

+1 on testing with python.

--
Sean Gillies
Programmer
Institute for the Study of the Ancient World
New York University




More information about the Community mailing list