31/10/2018 Tech
Optimization strategies for traversals in Neo4j
Cet article est disponible en français sous le titre « Stratégies d’optimisation des traversées dans Neo4j ».
This article is a follow-up of “Performance of traversals in Neo4j”.
The story could have ended after my analysis of the evolution of traversal times, but I still had a real problem with users observing degraded response times. It was difficult to argue that it was for the greater good and a better scalability in the future.
Going back to the initial problem, the cost of operations in the Neo4j core API (Node::hasLabel
, for example) exploded when comparing to the version using a cache, and now represents a significant part of the execution time. What can be done to alleviate that cost?
Preliminary analysis
The task is probably bound by reads on disk, since we’re in a database. The best optimization would be to remove some parts of the task if they aren’t actually needed, instead of trying to make them run faster.
What is our traversal doing exactly? It goes through each node in the tree, the Evaluator
component checks if it’s of type B
and if the value of its property is true
then always continues the traversal, and then the PathExpander
component checks if it’s of type A
or B
to fetch the outgoing relationships that are appropriate. That means between 2 and 3 calls to Node::hasLabel
per node, whether it’s of type A
or B
. There’s also a call to Node::getRelationships
per node to expand the traversal, and a call to Node::getProperty
per node of type B
. It would be difficult to do less functionally, and there’s unfortunately no code to cut.
![Flowchart of the traversal per node](//images.contentful.com/95wnqgvmhlea/7jJnD0k0ggGW0KickYsw06/4764d5f96f0374fee4e5769e632ab4c6/evaluator-td-en.png “”)
Let’s use a profiler to find out the cost distribution of the calls to Node::hasLabel
during a traversal. I’m exploring incrementally using the various modes of Yourkit.
First in CPU sampling mode, that only has a small overhead. Node::hasLabel
(or more precisely, its implementation NodeProxy::hasLabel
) is measured as contributing 19% to the traversal, with its own calls distributed like:
- 16% of
OperationsFacade::labelGetForName
, to get the label index from its name - 84% of
OperationsFacade::nodeHasLabel
, which itself calls another single method, which calls another single one, etc. with a decreasing relative cost (each method has its own cost, added to the cost of what it calls). The descend ends up with methods such asNodeStore::loadRecord
orMuninnReadPageCursor::next
, which do look like disk reads (load, read page), so it’s consistent with what we expected.
![Call tree sampled by Yourkit](//images.contentful.com/95wnqgvmhlea/5yGjoDtJbUUSYySaa42wKu/02eceee922eb7b5119f9248d5d28aa3c/hasLabel-baseline-sampling-1.png “”)
Let’s then use the CPU tracing mode, which is more invasive but gives us the number of calls and a finer measure of the method durations. Node::hasLabel
now represents 28% of the traversal (observation changes the measure), and its cost distribution is:
- 7% of
OperationsFacade::labelGetForName
- 73% of
OperationsFacade::nodeHasLabel
- 19% of its own code
- 1% elsewhere
It’s still data reads, but we now know that there are 3 millions calls to Node::hasLabel
, which match the expected 2 to 3 calls per node.
![Call tree instrumented by Yourkit, without capturing the memory allocations](//images.contentful.com/95wnqgvmhlea/1ZZhmTCdBaC0om8k2ECAWm/f0b6f23bd7513e3a37fd64ad0ae90937/hasLabel-baseline-tracing-2.png “”)
Finally, by setting both the CPU tracing mode and the capture of memory allocations, we get a view combining the CPU cost with the memory flow, at the cost of higher bias and execution time. Node::hasLabel
now contributes 42% to the traversal:
- 2% of
OperationsFacade::labelGetForName
- 92% of
OperationsFacade::nodeHasLabel
- 6% of its own code
![Call tree instrumented by Yourkit, when capturing the memory allocations](//images.contentful.com/95wnqgvmhlea/3UNmqLhKRaWCau4yIUM2eM/8b41ddaed981239e1b2dbf10e4b4b609/hasLabel-baseline-tracing-mem.png “”)
Obviously, reads still constitute most of the cost. Memory allocations for the request are measured at 76 MB, distributed like:
- 615,155 instances of
long[]
, i.e. 14 MB - 307,591
InlineNodeLabels
and as manyBits
, i.e. 7 and 9 MB respectively - 139,822
int[]
, i.e. 3 MB - 139,818
NodeProxy
,NodeProxy$2
andTraversalBranchImpl
, i.e. 3, 3 and 5 MB respectively - 139,808
RelationshipType[]
,RelationshipConversion
andRelationshipProxy
, i.e. 3, 4 and 4 MB respectively - etc.
![Allocations mémoire capturées par Yourkit](//images.contentful.com/95wnqgvmhlea/6d21vGGg2k8oKIg6IIKa04/4f726a7797d0fcec80e7c8eadf7cfed7/hasLabel-baseline-mem.png “”)
100% of the long[]
allocated during the request are actually caused by Node::hasLabel
! If we can limit the number of calls to the method, there is potential for improved allocations.
Potential optimizations
Labels
With 2 to 3 calls to Node::hasLabel
per node, 1 to Node::getRelationships
and 0 or 1 to Node::getProperty
, we can’t do much for a single request. However, inspired by an article by Max de Marzi (@maxdemarzi, Support Engineer at Neo Technology) about speeding up traversals, we can explore the possibilities if we cache data for all requests. In the real application, we have around 10 calls to Node::hasLabel
per node anyway, there would be more opportunity to amortize the cost of a cache local to the request.
I’ll try to replace disk reads and the related allocations by another type of allocation, and see if there’s a benefit.
The simplest approach would be to cache with a composite key based on the node and the label, but it involves a lot of objects to finally get a simple boolean, and we might do better. By looking at the implementation of Node::hasLabel
(subject to change, though), we can observe a simple iteration on the labels till the requested one is found: it doesn’t use a more evolved structure (inverted index, hash map, etc.). Therefore, if a node has *n* labels, the search for a label will statistically iterate over *n/2* nodes before stopping, and 2 calls to Node::hasLabel
will iterate over the equivalent of *n* nodes, which is what Node::getLabels
does. Of course, Node::getLabels
adds an overhead since it creates the collection of labels instead of returning a simple boolean, but if Node::hasLabel
is called more than twice on average, using Node::getLabels
to get all the labels once for all and store them in a structure optimized for search (a Set
for example) could improve the situation.
Actually, a Set
is fine, but why not go even further? What’s a Label
? Its API is really simple:
public interface Label { String name(); }
It looks just like an Enum
, and that’s actually a known implementation pattern:
enum Labels implements Label { A, B, Root }
As a Set
, we could specifically use an EnumSet
, more efficient as it’s based on the ordinal value of the enum. Continuing in the direction, a label present in the enum is equivalent to an index (its ordinal value), which can be encoded using bit masking: the *n*-th bit represents the presence of label *n* on the node, whether its value is 0 or 1. That’s exactly what I’m going to do: since we only have 3 labels, we can do bit masking on a byte
et cache the labels of a node (or more precisely, its Neo4j identifier, a long
) in that byte
.
Since we’re already using fastutil in the application to get primitive collections (which use less memory than their object counterparts), I’ll use a Long2ByteOpenHashMap
for the cache.
Properties
As properties can be of a number of types (boolean, number, string, or an array of one of these types), we can only cache the read properties in a Map<String, Object>
.
However, I can use a trick due on the fact that there’s only one property per node in the example, and use Collections::singletonMap
to cache the first property, before falling back to a HashMap
once there are 2; it’s much more compact in memory.
Relationships
Caching relationships is much more complex, as the simplest solution would be to only cache the identifiers of relationships and reload them on the fly. It’s not so easy to implement, with limited potential gains. I decide not to implement it, as it’s not even a scenario which can be optimized in the application, where the same relationship is hardly ever traversed more than once.
Measuring the effects
All this is implemented in a new branch of the example, using facades to allow independent activation of the caches (via query parameters).
All measures are done on a Neo4j Enterprise 2.3.2 instance with a 4 GB heap, and a traversal in depth-first mode. Neo4j runs on my 15-inch, mid-2014 Mac Book Pro with an Intel Core i7 2.2 GHz processor using 4 cores, and 16 GB of RAM, running under OS X 10.11.4 (El Capitan), and with a 64-bit, 1.8 u92 Oracle JVM.
The label cache gives a higher gain than the property cache, and combining the two improves the response times by around 30%.
Scenario | Reference | Label cache | Property cache | Both caches |
---|---|---|---|---|
Mean | 1540 | 1209 | 1387 | 1067 |
50% | 1547 | 1207 | 1384 | 1053 |
90% | 1693 | 1229 | 1427 | 1100 |
95% | 1716 | 1241 | 1432 | 1192 |
Maximum | 1757 | 1284 | 1447 | 1248 |
Used heap (MB) | 24 | 42 | 40 | 58 |
Allocations of the request (MB) | 76 | 46 | 73 | 43 |
![Histogram of the response times for the 4 scenarios combining the caches, showing the best improvement with both active](//images.contentful.com/95wnqgvmhlea/7yhXIJu9Oguik8SQcqIMMK/3d4511c7bafd9e2a70fb21cb6372f325/optims-histo-en.png “”)
The effect is largely positive, with a limited increased of the used memory. The allocation of the permanent caches during the first request (18 MB for the label cache, 16 MB for the property cache) is more than compensated by the reduced allocations in a single request after the first one: the label cache has the most important impact by eliminating 30 MB of allocations (about 40%), where the property cache only eliminates 3 MB. The memory pressure is reduced, there are fewer garbage collections, fewer pauses, etc.
![Position of the 4 scenarios by static and dynamic memory usage](//images.contentful.com/95wnqgvmhlea/1r8JwK3FsAMcEkmuKi8m4m/5531e0d947025ff7f7fc9456cd2c07bc/optims-labels-en.png “”)
How about not modifying code?
When you’re using a platform, another strategy is possible, which is to take advantage of the low-level optimizations of that platform, when available. Since we migrated to version 2.3, Neo4j 3.0 was released. It’s a new major version, with both functional and technical evolutions. Independently of the upgrade plan of our application to 3.0, let’s have a look at what it brings through the example.
Actually, we’re based on a second platform: the JVM. I neglected that part when we migrated from version 2.1 to 2.3: the garbage collector (GC) configured by default in Neo4j changed from Concurrent Mark Sweep (CMS) to Garbage First (G1), and we also used the migration to upgrade from JDK 7 to JDK 8. However, even if G1 is supposed to supersede CMS, a lot of users are experiencing mixed results (likeElasticSearch or its users for example), and it has seen lots of fixes up to recent versions of the JDK. We have a synthetic benchmark, not representative of a production situation, but I’ll use it anyway to explore how it’s affected by the GC algorithm.
As during the comparison between versions 2.1 and 2.3, the same jar containing the unmanaged extension is deployed on different versions of Neo4j Enterprise with the same configuration: the heap is still at 4 GB, and the traversal is done in depth-first mode, with no active cache.
If we focus on the cases using G1, a degradation could be observed with version 3.0.0-RC1 (Release Candidate 1), but the final version actually shows a slight improvement of the response times over 2.3.2. A spectacular gain can be seen starting at version 3.0.2, where response times are reduced by around 25%. Version 3.0.3 even manages to score a little better.
However, when comparing G1 and CMS, CMS gives better response times in almost all cases, with a huge difference in 2.3.2 where response times are 25% lower. Since the results have converged in 3.0.2 and 3.0.3, we can imagine that the improved performance comes partly from reduced allocations, which would decrease the weight of the garbage collector in the equation.
Version | 2.3.2 | 3.0.0 RC1 | 3.0.0 | 3.0.1 | 3.0.2 | 3.0.3 | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
GC | G1 | CMS | G1 | CMS | G1 | CMS | G1 | CMS | G1 | CMS | G1 | CMS |
Moyenne | 1438 | 1100 | 1521 | 1461 | 1367 | 1281 | 1406 | 1252 | 980 | 1027 | 1067 | 993 |
50% | 1395 | 1098 | 1517 | 1429 | 1365 | 1282 | 1386 | 1251 | 980 | 992 | 1067 | 992 |
90% | 1596 | 1131 | 1548 | 1616 | 1384 | 1294 | 1498 | 1264 | 990 | 1125 | 1075 | 1006 |
95% | 1619 | 1140 | 1556 | 1654 | 1391 | 1298 | 1560 | 1268 | 994 | 1134 | 1078 | 1010 |
Maximum | 1694 | 1168 | 1637 | 1735 | 1409 | 1323 | 1630 | 1305 | 1006 | 1163 | 1120 | 1049 |
![Histogram of the response times, by version and GC, showing a continuous improvement](//images.contentful.com/95wnqgvmhlea/1xd8tgEFeQQI2gqc4WYqO0/79de7350199022e0d6f5f7994715e324/versions-histo-en.png “”)
I didn’t compare (yet) the versions with a profiler, but obviously the engineers at Neo Technology did a good job, even on minor 3.0.x versions.
Conclusion
Based on these results, I applied a slightly different label cache to the real application:
- it uses a larger type than
byte
(as we have more than 8 labels) - the cache is local to the HTTP request, and not permanent as in the example
Making the cache local means we don’t have to manage its invalidation, its memory footprint, or the related pauses. The result is positive nonetheless because for a single node, we can get around 10 label calls (to classify it, find which relationships to expand, filter it, etc.). The result is especially significant for allocations. With that single optimization, response times have been reduced by 23%, bringing them back to an acceptable level, even it they remain greater than the “perfect” measurement with Neo4j 2.2.
Finally, the lower allocations and the decreased GC pauses give us more stable response times, but the GC algorithm itself should not be neglected either. However, a single benchmark is not representative for that, and G1 and CMS should really be compared with real traffic on the application before making a choice.