The "{area/field/business} is broken" template is broken (well, I don't know if ever worked). Algorithm development is not broken - it's specialized work and it's largely done by people with specialized backgrounds. Maybe that's not ideal, and maybe you can improve it, but saying "it's broken" is lazy and false.
Based on my experiences in grad school, I'm not sure it is an exaggeration to say that it's "broken". So much work done solving hard problems, that ends up in academic journals and never seen again. And there's no real incentive for that to change. I think that's broken.
So much work done solving hard problems, that ends up in academic journals and never seen again
Is this really the case? When I have to solve a particularly unusual or tricky problem, I often have a look at relevant academic papers and go from there – either finding existing solutions, or basing one on the contents of the paper. I kind of assumed most developers did the same.
Yeah, this seems bizarre. There is better access right now to cutting-edge algorithms than ever, as well as building blocks that make it easier than ever to implement the algorithms described. I've been doing this for years, and I've noticed it get progressively easier. Honorable mention to http://thepaperbay.com/ for making it easier to get around those occasional papers stuck behind paywalls.
What's the monetization strategy here? It would be a major step backwards for the industry if it turns out the plan here is to try to make money off of what is currently a public resource. I'd hate to see researchers having to decide if they want to publish their results or just make some cash by creating a closed source API with your service. You guys should make it clear if source code will always be available or if this is basically going to be a land-rush for people to provide an API for all the standard CS algorithms with you monetizing access to them.
We are algorithm developers ourselves and our goal is just to make these awesome algorithms that are being developed more widely available.
We are not trying to monopolize basic algorithms. In fact in a free market I don't see how that could happen, since anyone else could come along and implement the same basic CS algorithms and offer a lower price.
The thing that makes us different from say Wikipedia or Github is that we don't provide just source code, but live infrastructure delivering the algorithms as a service. So rather than uploading code to github, and putting the burden on a user cloning the algorithm, setting up virtual servers, deployment systems, load balancers, etc. We handle all that. That way algorithm developers are free to focus on algorithm code, and the infrastructure is taken care of. This infrastructure needs to be paid for however, so our general plan is to charge for using algorithms through the API. It will be up to the developers whether to share source code or not, since they own the IP.
The status quo now is not broken: developers who want to share IP are encouraged to incorporate their work into open source libraries so they can be used by everyone and be made robust and evolve over time. In various domains these libraries are more or less agreed upon, and the patches go through extensive review to get incorporated.
This type of system will encourage these folks to release their source code in a form compatible with your API, and 3rd parties who want to leverage their work will need to either cut-and-paste it to incorporate it into their own system (a nightmare), or access it through a slow network API, likely paying for it along the way. Libraries have solved the problem of code re-use of algorithms for the last several decades. You're basically proposing that LAPACK and BLAS and Mahout and all these great algorithmic libraries would have been best served up without any centralized coordination or binary distribution, but with each individual algorithm behind a separate API with each author publishing each one individually. I don't see how if the goal is really to enable wider distribution of good algorithmic work that you can't see that an heterogenous API-centric approach to this problem (while one that would certainly make you a pile of cash if it catches on) wouldn't really be a great step forward for the industry compared to the centralized open source library model we have now. Good algorithmic work is hard, and the model we have now results in battle hardened, robust, agreed-upon implementations of well understood algorithms.
If you really want to make me a believer, all algorithms that your API supports that have public source code should also be integrated and built into a binary library that a user can download instead. At the very least, the path for a contributor to take what they have written and additionally publish it as a appropriate binary package themselves should be minimal. For example, if I implement something in Ruby that uses your service, I should be able to push it as a un-coupled, reusable gem in a straightforward manner.
I agree that the existing libraries are great, and if anything that success is what we to encourage even more of. We want to build a community around algorithm development, and have a central place for developers to share their work, and collaborate with other developers to make Huge library of algorithms so that everyone can benefit.
The first language we implemented was Java/Scala, largely because of the excellent libraries like Mahout and Weka easily available in Maven Central. But even though the code is there in maven, it is still not trivial to deploy those algorithms as a service. We are trying to make that easier, while at the same time supporting the developers of those libraries in the process.
Being able to run the code locally is in the plans. For the most part, the algorithms in our system are already simply maven packages, or ruby gems, npm packages, etc. Being able to distribute them in binary form is definitely possible. We're still figuring things out right now, but our focus is always on doing right by the algorithm development community.
This doesn't make sense to me. You display Dijkstra as an example on your homepage. Imagine I want to use Dijkstra in an app that needs critical performance and calls the function 1000000 times per minute, making this an API rather a library just breaks the whole point of centralizing the algorithms.
Plus, the choice of Dijkstra algorithm implementation depends on the graph size, type, and even CPU. See http://www.cs.sunysb.edu/~rezaul/papers/TR-07-54.pdf , with 10 different implementation variations and several different benchmarks.
The 9th DIMACS Implementation Challenge from 8 years ago (see http://www.dis.uniroma1.it/challenge9/ ) served as a way to gather the best algorithms for the shortest path problem. I don't see how this project can be significantly better, or even close to, a traditional effort like that.
Indeed, the Parallel Boost Graph Library library is easily available, and contains two parallel shortest-path implementations.
The flip question is, what if it doesn't fit my needs? Well, then the question is "what do I need"?
The other DIMACS challenges include 1) pre-computation, which is useful for road-based graphs, 2) supercomputer parallelism to search billions of nodes, 3) out-of-core graph searches, for when you don't have enough RAM, and 4) k-shortest path search.
It's very unlikely that this proposed algorithm service provider will implement these alternate algorithms.
Yes. It might be confusing the lack of apparent centralization for a need where a diffuse format already accomplishes the desired goal. Either there has to be an unmet Tragedy of the Commons that open standards/academia isn't addressing or that people would be willing to pay for. I can't see either because it would probably already exist.
Maybe we need more crowdfunding of open source and independent researchers that are independent of academia. I really think open source hardware is crying out for a project and a community to revolutionize how a community can participated with the mostly closed EDA / silicon stack, because frankly, I don't trust anything I can't get the source code to.
> The thing that makes us different from say Wikipedia or Github is that we don't provide just source code, but live infrastructure delivering the algorithms as a service. So rather than uploading code to github, and putting the burden on a user cloning the algorithm, setting up virtual servers, deployment systems, load balancers, etc. We handle all that. That way algorithm developers are free to focus on algorithm code, and the infrastructure is taken care of.
Virtual servers? deployment systems? load balancers?!? Huh? It all looks like gibberish to me. What does any of this have to do with developing implementations of new algorithmic techniques?
I've read your blog post twice now. I've looked at your website. I've seen your comments here. Still I have no idea what you're proposing or why I would be motivated to use... whatever it is you're trying to convince me to use.
Your message is vague as vague can be. It also doesn't help that the word "algorithm" is sufficiently general to cover any piece of code.
In this post we were trying to focus on the problems we've personally experienced as algorithm developers. The plan is to explain the details of Algorithmia more in future posts, but here's the basic idea:
We want smarter machines, and that means making algorithms more widely available. To achieve that, we're creating a community around algorithm development, where people can contribute code and collaborate with others on that code. Github has done a great job of this, but it is not always trivial to get from a github repo to live code in production. We think that creating a built-in market for algorithms-as-a-service could be a way to reward algorithm developers for their work, and make algorithms easily accessible via REST API.
The types of algorithms we are talking about is not limited by us. My opinion is that the most useful algorithms in the near future will be related to: machine learning, classification, analysis, and optimization. There are also numerous applications for things like image recognition, speech-to-text, and translation, to name a few.
The idea makes a lot more sense to me in this context. Even if you are aiming at other algorithms too, I'd suggest don't be shy about giving these examples.
Good luck... that seems like a really intimidating area to learn and a collection of ready to try algorithms could be super useful.
Yeah: What's the goal and where's the value to investors?
Goal: It's hard to see algorithms innovating except for large scale microoptimizations using higher levels of abstractions, because many everyday purposes have already been solved using whatever the developer chose to ship a product.
Value: For-profit research is a really hard model to scale. Well-respected universities already do this and are more of a factory for talent (via publishing) rather than IP, so it becomes a very indirect horse bet for a business to invest in. In a sense, this would be building a double-ended marketplace of investors and researchers. Also really hard to scale.
Another good resource is Steven Skiena's Stony Brook Algorithm Repository [0].
It's not immediately clear on the linked page, but the "By Language", "By Problem", "Algorithm Links" targets will drop down with links to a specific page. The "By Problem" targets link to a page that is very similar to what is in "The Hitchhiker's Guide to Algorithms" part in Skiena's "The Algorithm Design Manual" textbook. They each lead with an image representation of the input and output of the general case of the algorithm, a problem description, some short text description, and then links to implementations. It's not as detailed as what is in the textbook, and I'm not sure how up to date it is these days, but it's a good place to browse about.
Algorithmia will be like a wiki of algorithms. Or a github. The difference is that the code will also be available live as a service so that other developers can easily make use of them.
We strongly believe in open source, and the source code will generally be available (although we may leave that up to the individual algorithm developers to decide). We are algorithm developers ourselves and our goal is to see more intelligent algorithms used as widely as possible.
Hm, sounds interesting but vague. I don't really understand how it will work.
How is data for the algorithms provided? A lot of times it is big. And messy, and proprietary. There seems to be an implicit assumption that you can just plug different algorithms to different data sets. But I can't think of a project where that has been the case.
I also doubt that "algorithms" are the bottleneck in a lot of projects. I'm not an expert, but I have some personal experience to back up what people say about "data trumping algorithms" (e.g. Peter Norvig and others have written about this)
I would like to hear about some more concrete examples / success stories. "Algorithms" is just too vague. I think if this becomes successful, it will be by first narrowing it to a particular domain, and then generalizing it again.
Sorry for being vague, that wasn't our intention, so much as to focus on laying out the problem that we've experienced personally as algorithm developers. We will be posting more blog posts with much more detail about the system in the near future.
The data question is huge for us. We have built our system on top of Hadoop + Spark in order to handle large amounts of data, and be able to apply computations to it. You're absolutely right that data is very heavy, and often the limiting factor, so we are doing everything we can to both get data into the system (currently with a focus on streaming data, since that gets around the heavy data problem), and getting algorithms to where the data is.
As far as specific algorithms, there are many that could apply. My personal experience is in machine learning algorithms, so I'm personally biased toward those. There are numerous algorithms like Classification, Clustering, Optimization, Anomaly Detection, etc. which are very CPU-intensive. We will be doing a blog post soon with a demo of live algorithms already in our system to give more concrete examples.
Could you return the actual code of the algorithm that could be loaded and used during runtime (for some languages if the user wants it)? Seems like having the user send in the data and running an algorithm as a service would take a lot more time and effort for everyone involved.
I can't figure out how such a system is supposed to work. Anyone have any ideas?
For example, a couple of years ago I worked on an algorithm to find the maximum common subgraph of a set of 2 or more molecular graphs. More specifically, I wanted the largest subgraph in M of N graphs (M<=N), I wanted to define the atom match criteria, and I wanted to require that rings not be broken in the subgraph. (Chemists love rings.)
I did it the old-fashioned way. I read papers, I investigated similar systems, I implemented various implementation details, and I did a lot of testing.
How would I find such an algorithm using this system?
There are only a few people who develop this sort of algorithm. Why might I expect that this system is a better resource than traditional means?
I have no idea how this would work, but I disagree with your statement. There are probably lots of people solving similar problems, but with different notations.
Lots of domains share the same mathematical abstractions, but we don't know because (almost) nobody is trying to discover them. We can know that it's common, because once in a while people do look, and always come back with new shared abstractions.
If somebody created an index that took different notations into account, that'd revolutionize research. But I don't think this system does that.
Which statement of mine do you disagree with? Do you mean that there are few people working on the maximum common subgraph problem as it applies to small molecule chemistry? [1] What forms the basis for your opposition?
Of course there are many different notations. One need only look to Newton and Leibniz notations for calculus, or Schrodinger's wave equations and Heisenberg's matrices for quantum mechanics, to know there's a long history of different mathematical abstractions for the same concept.
In my own field, I recently discovered (by reading the literature and getting feedback from conferences) that there appears to be a previously unknown connection between this maximum common subgraph problem and frequent subgraph mining, so it's not like connections are altogether rare.
Indeed, I'll argue that these occur all the time, and research libraries exist in part in order to help people find them.
Which is why I asked why this project might be better than traditional means, and I gave an real-world example to give a basis for discussion.
[1] There's plenty of MCS research, but since the problem itself is known to be NP-hard, most of the mathematical work shows that certain limit cases, like planar graphs or outerplanar graphs, is solvable in polynomial time. As the structures I deal with don't all fall into those categories, I can't use them as a general purpose solution. [2]
[2] I might be able to use them as a special purpose solution, and this project might help identify codes I can use for this case. I just can't figure out a way to make it easier than the current methods.
>Lots of domains share the same mathematical abstractions, but we don't know because (almost) nobody is trying to discover them.
>We can know that it's common, because once in a while people do look, and always come back with new shared abstractions.
Why would anyone write about not finding a shared abstraction? You have a significant publication bias there.
Funny he mentions LDA. A company I founded and sold (Algo Anywhere, which started off as a Generalized Algorithms as a Service company) was a recommendation engine as a service business built on top of LDA. The papers out there are dense, but LDA is actually pretty easy to get information on. Check out Gensim in pythonland.
The thing with algorithms is that you really have to think. 200 lines of code might take two months to really grok. Especially because the people in the field make certain assumptions when they start, and sometimes even just one of those assumptions takes two weeks to understand and research.
You can't just jump into PhD level research and expect to understand it right away.
I don't know if algorithmia will help solve this problem, but I wish them all the best of luck. Getting actual code next to research is super important and useful.
Thanks for the best wishes. LDA was an interesting one since when we were investigating it it was our first exposure. Although the algorithm itself is well documented there was at least 3 or 4 libraries that could be implemented and we really didnt know where to go from there. We want academic papers to be accompanied by live code as you mentioned. We strongly believe this will not only increase the reach of developed algorithms but also significantly lower the bar of understanding for those who are not.
Yes algorithm development is broken, but not quite in the way that the OP suggests.
People tend to focus on initial algorithm development, because that is the academically prestigious intellectually stimulating bit, but the job is really about how to turn algorithms into cash; a much broader problem than the narrow slice that people typically obsess over.
A huge (and frequently overlooked) part of that job is the communication and coordination role between business development and algorithm development. The volume of communication and level of detail required cannot be overstated.
Another huge part of the job is actually turning a piece of research into a functioning product. Whilst a large part of the OP's proposition is intended to address this problem, (kudos) I think that his solution falls short in a big way: It omits the largest part of the solution, which is where the business learns about the algorithm and how the behaviour and performance characteristics of the algorithm interact with the business' problem domain. I.e. how does the business build sufficient expertise and knowledge of their product to be able to effectively sell it. All of these are human problems, mainly oriented around communications and learning.
Having said all that, I would like to encourage the OP in his efforts. I think that it is something that is worth doing, and I really hope he is able to build a business around this idea. I think that technology can help support all of these activities, and this is actually something that I have wanted to do myself for very a long time, so Kudos to the OP for actually taking the chance, going out and doing it!
However, if you want to do something non-trivial and you want it done right, hire a computer scientist or mathematician. No amount of crowd will help if no one in the crowd has a clue what they're doing.
1) Will there be limits on the size of data sets/ what size data are you optimizing for? Some algos focus on hundreds of records, some on billions, and the ideal system for either are quite different (small data works great by transferring data back and forth over http, data with hundreds of thousands of records and up... not so much).
2) Same question as above, but for processing times. Some algos are aimed at operating on the fly and might take ~1 second or less. Others (many that I deal with quite often), might run for days, or even weeks. What sort of processes are you optimizing for, and are long running procs on the radar?
It is an interesting idea, but there is a high bar when competing against my language's package manager, and the 10s of thousands of "algorithms" already out there. Best of luck!
The data question is something we are taking very seriously. I talked a bit more about this in another comment, but we know that in many cases data is heavy and is often the limiting factor. We are building a robust data platform ontop of Hadoop + Spark, so be able to handle large amounts of data, and so that we can deploy code to where the data is.
That said, I think there are many algorithms that will work as standalone algorithms. Lots of machine learning type algorithms are mostly CPU-bound. There are companies like http://www.kooaba.com that do image recognition as a service over HTTP. Also things like Siri and Google Now seem to work well enough, despite network latency. Thanks for the feedback!
The API is interesting but what I really want is to understand the algorithms. Clean, clear reference code in multiple languages with good explanations would be hugely helpful. Is that part of the plan?
Yes definitely. The code is available for people to learn from, and you will be able to edit the code and see the results immediately. The goal is to all about making algorithms more accessible.
This is the most interesting part to me. I agree with the other people commenting in this thread that advanced algorithm design (as seen in journals) isn't broken. Interesting advanced algorithms do get spun out into open source implementations.
A non-academic community based around discovery, discussion and implementation of algorithms would be an interesting place to hang around. A MVP would be, basically, a discussion forum with an index of github repos for the implementations, but there are other values to be added on top.
Good luck. I'll be paying attention to what becomes of your project.
1. There are 14 libraries for technical algorithms.
2. We need a meta-library so that we don't need to worry about having so many libraries!
3. There are 15 libraries for technical algorithms.
How is the DeepMind purchase in any way a "record sum"? The linked article does not seem to make this claim. I didn't even get to your main point and already I don't trust you.
I think it is a little disingenuous to say that algorithms get buried in academic literature and are impossible to find. By the nature of research most of these things are made public when published, often times there is no implementation or just a research quality one (which I can guarantee you is almost never useful for the "real world").
For example one day I was interested in implementing HyperLogLog(a set cardinality measure that is useful in data analysis). In about 10m I had all the relevant papers on hand and after skimming them I had a pretty good sense of how to implement it.
Similarly if I want to know how to implement a program dependency graph for doing program analysis I can go read a few pages of a paper and get a good description on the algorithm I would need to construct such a thing. I can believe the argument that some of these things are poorly indexed but even a bare minimum of Google searching usually results in useful algorithms. I would argue that often times many of these research algorithms have a bunch of different design decisions that are best explored in the academic literature around them, and an implementation and a few notes is not sufficient exploration.
For example I was recently implementing Paxos and there were tons of little details to be extracted from the papers around that had a big impact on the actual implementation we ended up with. The 'Paxos Made Live' paper from Google had many details that were only relevant/true because of engineering decisions made by the team. If one was presented you with an implementation derived solely from that paper there are multiple incorrect assumptions you could derive.
An instance of this is made apparent in Paxos Made Live. Google essentially fixes their proposer because the have used Google specific details about the number of participants and their availability. The result is that they direct all traffic to a single node, and don't spend a much time talking about leader/proposer selection (which could be useful to your needs).
I also don't buy that an important part is getting the algorithms running as a service. Most so called "algorithms" are nothing more than a subroutine that is need as a piece of a greater whole. I would venture most useful "algorithms" are most likely container data structures and algorithms that operate over them. It seems that these are probably most useful to have as a library. Many libraries have already taken this approach LLVM (algorithms for code generation, albeit not always everything you want), OpenCV for computer vision routines, BLAS for linear algebra, NLOpt for non-linear optimization, and I'm sure one could come up with many more examples of democratized algorithms.
1. Make open source releasing of algorithms compulsory, not optional.
Think of it from your prospective clients' perspective: the upside of digging out an algorithm from academic journals and implementing it yourself is that you get to fully understand the source code (since you end up writing it yourself); you can attest its correctness; and you get to stand on the shoulders of giants by tweaking and extending it later, if you wish.
Admittedly this might not be seen as that much a benefit for business users, but for many actual users of advanced algorithms in the scientific computing community, having to use proprietary algorithms with restrictions on their use has been seen as a significant step backwards, as manifest during the controversy brewed over the "Numerical Recipes" controversy [1,2,3]. Even if the benefits are more a matter of principle than mere practicality, the palpable distaste for proprietary algorithms in the scientific computing community is something you should at least keep in mind, lest you risk alienating a core user base for your product.
2. Formally verify the correctness of every algorithm submitted.
This is as crucial as large-scale deployment for many scientific computing users, and it is one of the banes of (and reasons why) implementing the algorithm yourself. Here your product then could really offer a compelling proposition to these users.
This would also be beneficial for ensuring the reliability of your API, even if you formally waive liability to the algorithm developers (as you surely do). Else you might find yourself on the other side of securities regulators for a multi-million dollar trading glitch caused by one of your algorithms [4], or something crazy like that.
[1] In fact, in the wikipedia article for Numerical Recipes, it is claimed that one of the motivations for the development of the GNU C library was precisely to come up with a free alternative to them! See: http://en.wikipedia.org/wiki/Numerical_Recipes