Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Apache Age: A Graph Extension for PostgreSQL (apache.org)
215 points by based2 on March 4, 2021 | hide | past | favorite | 45 comments


Though i appreciate all the hard work people put into this and offer it so generously for free, it kind of saddens me to see yet another property graph database that supports some non-standardized (not really anyway) query language. I would really like to see a free and rdf based triple-store with good SPARQL support and that can be used for serious production workloads. But all the open source activity seems to be in the property-graph camp, with a new product every couple of month, while the high-quality triple stores are all quite pricy.


Really, the market has spoken. In relational databases, tables with multiple columns aren't strictly necessary, but practically quite useful. In the same way, property graphs are more useful than triple stores, since common usage patterns want a collection of related properties a lot of the time.

Another way to put it is: it is straightforward to map a property graph to a triple store. In most cases, the property graph will have fewer nodes and edges and will operate faster and be easier to maintain.


Benchmarks consistently fail to show this perf advantage. Fair ones anyway. Which makes sense because property graphs and RDF are very similar and mostly differ in terms of syntax, i.e., stuff that good query planners and indexing schemes compile away.


What does it mean that "tables with multiple columns aren't strictly necessary"? Do you use tables with 1 column?


"One sequential primary key field is all the columns anyone will ever need in a database"

- Bill Gates


128 bits are enough for enumerating everything.

- 2017


2^128 is ~10^38. For frame of reference, there are ~10^22 atoms in a penny and ~10^50 atoms in the entire planet.

Alternatively, 585 years is "only" ~2^64 nanoseconds. 2^128 nanoseconds is on the order of 10^22 years, while the estimated current age of the universe comes in at a mere 10^10 years.

What sort of enumeration could you possibly do in practice on such a scale? (Allocation, on the other hand, is an entirely separate problem.)


I can't think of any. But I think the historical lesson is that what I (nay, people much smarter than me) can think of today is insufficient for the greater tomorrow.

The ridiculous historical quotes for computer counts, disk size, RAM were all predicated on people not computing differently. As it turns out, technology progressed, and people began doing entirely new, unexpected things with computing.


For the record, most (not all) of those quotes are taken somewhat out of context. Even in 1950, it was trivial to come up with examples of sets with more than (for example) 2^32 elements.

When it comes to address space allocation (ie IPv6) I agree; I'm skeptical that 128 bits will prove to be sufficient. But as far as simple enumeration goes, 2^128 is unimaginably large and I don't see how changes in computing could possibly affect that assessment.

An example. Partition a 2^128 address space evenly between 2^64 individual computers (it's difficult to imagine humanity ever possessing anywhere near this many devices). Each computer does nothing more than visit each value in its segment of the address space sequentially. No additional computations, nothing, just visits it. At 1 value per nanosecond (ie 1 GHz) this otherwise pointless exercise requires approximately 585 years to complete.


It means that everything a relational database is capable of (from CS theory) can be done with a single column of values per table. Note that each such table also has a "column" of primary keys; in other words, it's a simple K -> V mapping.

Also note that just because you can do something doesn't mean that it will be efficient, or that it will be enjoyable to work with.


> It means that everything a relational database is capable of (from CS theory) can be done with a single column of values per table. Note that each such table also has a "column" of primary keys

So, in other words, it's a two column table (“primary key” aren't some kind of virtual column, either in concrete databases or in relational theory.)

Calling a database with two columns one of which is the primary key a one-column table is...just wrong.


I believe that they are referring to the EAV data model, which is maximally flexible, in terms of schema modification. But performance, especially in terms of queries (but also in terms of DML) is atrocious


This uses openCypher for queries, which is about to become an ISO Standard. In the mid-term the query language will be on equal footing with SQL.

http://www.opencypher.org/articles/2019/09/12/SQL-and-now-GQ...


I don't find SPARQL terrible, and have used it in commercial projects.

But RDF can't claim the "standardization" argument in good faith when RDF/SemWeb overshadowed Datalog/Prolog (based on a true ISO and community standard) for such a long time. SemWeb, like XHTML, SOAP/WS-* and other W3C stuff, failed on the web to become an enterprise-y thing instead, W3C being a pay-as-you-go org.


I don’t understand this argument. Several overlapping specs max exist. ISO specs weren’t built to take advantage of Web specs. W3C specs were. Like it or not, the W3C didn’t do any disservice to the ISO or fight unfairly.


Do you know Fuseki?


That’s a bit of a toy from the perspective of enterprise requirements. Good starter system if you adhere to open source religion.


What enterprise requirements does it lack?


This is based on AgensGraph: http://bitnine.net/downloads-2020/

I found this presentation from 2017 about AgensGraph: https://www.slideshare.net/mobile/kisung80/agensgraph-a-mult...



Fantastic that they are using Cypher. Love that language, if one could say that about a Query language.


Cypher's pretty much the only thing about Neo4j that I found to be both pleasant-to-use and... well, any good, really. Love seeing it borrowed by other graphDBs. I'm far from being a SQL hater, but being able to bounce into Cypher to replace (at least some large subset of) recursive CTEs would be a huge developer-experience improvement for PostgreSQL, for multi-model DBs.

Example from the n4j Cypher docs, for the curious:

    MATCH (user:User { name: 'Adam' })-[r1:FRIEND]-()-[r2:FRIEND]-(friend_of_a_friend)
    RETURN friend_of_a_friend.name AS fofName
Returns names of friends-of-friends (connected with FRIEND-labeled edges) User nodes having the "name" property "Adam". Stuff like "friend_of_a_friend" is set as an alias for the matched nodes, like in SQL. () denotes a node, [] an edge. (It's been a while, so this explanation may be subtly wrong, but it's close)


I love Cypher too. And PostgreSQL! I wonder how this project compares to Neo4j.


Take a look at OrientDB’s OSQL.


Agreed, Cypher was the thing I loved most about neo4j.


Ever tried 4GL on IBM? Way worse than most.


It’s always nice to see such efforts around Postgres. I do think it’s very well suited to many needs aside from extreme scales that most won’t deal with.

In terms of graphs, there is also an implementation of Tinkerpop which allows using Gremlin, very different in nature to Cypher.

http://www.sqlg.org/docs/2.0.1/

NB: I believe cypher can compile to bytecode that runs on the tinkerpop engine which I found interesting


How would this compare with something like pgRouting (https://pgrouting.org/)?


I read pgRouting as focusing on Geospacial routing to get from point A to point B. Is there more to it than that?

A graph database is about storing data that the relationships between pieces of data, like a social graph as an example. You'd have a people and the relationships between them in the database.


Despite the description on its website, at core, there's nothing particularly geospatial about pgRouting. You don't need PostGIS, or even Postgres's built-in geo types (point, line, etc.), in order to use pgRouting.

Rather, pgRouting is a set of general graph- and path-search algorithms, exposed as procedures, that operate upon rowsets (most efficiently, upon indexed tables) of vertices and edges. You can use pgRouting to do SPARQL-like graph queries, or even full-blown network analysis, if you like.

In a previous job, I did just that: I loaded up social-network data into vertex and edge tables, and then I used pgRouting's implementation of Floyd-Warshall and Driving Distance to discover high-value potential social connections within a given relationship-weighted distance of a given user. Not as a one-time data-science thing, but as the backend of our service's matching engine, that ran every time a user refreshed their "candidate matches" page. It was pretty instantaneous.


I remember wondering about exactly this a couple years ago, but I couldn't figure out the answer (I didn't look all that closely into it). Now I want to come up with an excuse to try it out on something. Thanks for mentioning this!


Wow, this is super neat. I wish I knew about this a couple years ago, it would've been super useful for a recommendation system I was building in production. I'll have to give this a shot!


FWIW, Postgres already has good support for representing and querying graph structures using the LTree extension https://www.postgresql.org/docs/current/ltree.html


That's more for trees, which to be fair are a specific kind of graph I guess. Ltree doesn't provide anywhere near the types of tools someone would expect if you told them it supports "graph structures".




Very ugly code review process in that project in Developer Guidelines


And braceless if statements are the road to ruin

I also chuckled at "Repeat 4 and 5." written in a `<ul>`


That's a very interesting topic


Cool Project!


I wonder how the inclusion of graph features in Postgres 14 will affect this project.


From where do you get that? I was searching the internets for this purported feature and couldn't find it. Link?


Version 14 adds some features to recursive CTE expressions to do BFS/DFS searches and cycle detection. As always, depesz has a nice write up of it: https://www.depesz.com/2021/02/04/waiting-for-postgresql-14-...

I _think_ it's just syntactic sugar and doesn't let you do anything you couldn't already do, although perhaps it would leave room in the future for the Postgres team to optimize query execution.


the projects are not related at all but this and https://github.com/FiloSottile/age have a name conflict.


I like to research meaningless things, so looked at the first commit in each repository:

- https://github.com/apache/incubator-age/commit/bef50e5d86d45... (Mar 19, 2019)

- https://github.com/FiloSottile/age/commit/06cbe4f91ea9843069... (Oct 6, 2019)

What does this mean? Absolutely nothing.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: