Show HN: DocsGPT, open-source documentation assistant, fully aware of libraries

ib33 · on Feb 4, 2023

Yesterday, an undergraduate from Sri Lanka released KnowledgeGPT[1] which allows you to upload your docs and get answers from ChatGPT. It also uses FAISS so I'm wondering if DocsGPT is somehow related or inspired by the former.

It also appears the Github library for DocsGPT was created shortly after the release of KnowledgeGPT.

1: https://github.com/mmz-001/knowledge_gpt

aioprisan · on Feb 4, 2023

They're both just based on langchain embedding and search logic from a built index. No real difference in the two except stylistic code approaches.

ib33 · on Feb 5, 2023

You're correct. I've seen examples from Langchain's documentation and a similar but earlier tutorial from another commenter.

osullip · on Feb 4, 2023

I will support and pay for this project.

I thought about the same problem. Companies have lots of marketing material and data sheets that are not easily queried but would be very useful to support staff when dealing with customer's questions.

I didn't build it, they did. They get my support.

Taek · on Feb 4, 2023

Anything that relies on the ChatGPT API is subject to being insta-killed the moment OpenAI decides its not good to have them around.

Productivity tools shouldn't depend on external parties like that.

hackernewds · on Feb 4, 2023

that's not a viable model. OpenAI has been most successful marketing, but these learning models are not unique.

If not OpenAI, Google will sell general use AI APIs

amrb · on Feb 4, 2023

You are 100% right, people are building on sand if they think any VC will fund any who doesn't own the model underpinning there products.

osullip · on Feb 4, 2023

I agree.

But the problem will be solved regardless of GPT.

I don't think the whole stack relies on GPT for this solution.

Some parts maybe, but the whole of the project is more than that.

amrb · on Feb 4, 2023

You can swap openai for models on huggingface.co but the quality is not equal yet.

rrgok · on Feb 4, 2023

Was it important to say the country of origin?

ib33 · on Feb 4, 2023

The creator's context seemed like a good addition when I was writing my initial comment. Is it offensive or problematic?

rrgok · on Feb 5, 2023

It is not offensive, it is just pointless. I've never see people from USA write "I'm made XYZ App, made from a guy in Minnesota". I'm not from USA. But a lot of people outside USA, see the need to plaster it all over the place (Switzerland? Germany? Sweden?) their country of origin. Is the country an indicator of quality of a product? Is it supposed to convey anything to me?

Hey, I don't agree with "... made with Rust/Go/React/X" slogan too. It is just pointless.

KomoD · on Feb 7, 2023

Great, you don't have to agree, nor do you have to comment on it.

rrgok · on Feb 7, 2023

Next time, I will ask for your permission.

gtirloni · on Feb 4, 2023

Was it important to ask if it was important to say the country if origin?

rrgok · on Feb 5, 2023

Yes, I want to know if that is an indicator of quality. Are AI from Sri-Lanka better than AI from USA? Could you tell me that?

gtirloni · on Feb 5, 2023

No, it's probably not an indicator. You probably already know that.

amrb · on Feb 4, 2023

Posted Jan 9th insert slowpoke meme

https://dagster.io/blog/chatgpt-langchain

ib33 · on Feb 4, 2023

Posted Dec 16, 2022 - insert nodding meme

https://twitter.com/LangChainAI/status/1603799770148921344

punkspider · on Feb 4, 2023

Exactly what I've been looking for. It's a very easy to follow and understand tutorial, and it uses GPT-3 API instead of ChatGPT.

Thanks so much!

JPKab · on Feb 4, 2023

Seems like this app is very stripped down streamlit app more dependent on API calls vs the DocsGPT. Doubt they are related.

ib33 · on Feb 4, 2023

You're right. There's an interesting difference in the approach.

grensley · on Feb 4, 2023

Seeing https://github.com/arc53/docsgpt/blob/main/application/combi... is just fascinating to me. It's absolutely where this all is headed, but to see it as source code evokes something in me.

kfarr · on Feb 4, 2023

It's like the new crud but instead of designing the database tables and columns you're tweaking the prompts.

_nalply · on Feb 4, 2023

I would try to fine-tune GPT such that I don't need to repeat the first part for every query. Since OpenAI is billed by the token (a thousand tokens are about 750 words), it makes sense to fine-tune once then only submit the changing part.

Caveat emptor: I didn't try this out yet. No idea whether this would work.

sadrobin · on Feb 4, 2023

I know its crazy, there are also some other promts that I have not edited from LangChain that acutally run in the background

bfeynman · on Feb 4, 2023

This is really the quality of projects we're upvoting? This is a python script that calls another API.

zerop · on Feb 4, 2023

It's not about this project. It's a use-case of ChatGPT someone has explored. ChatGPT is new and we all want to learn about how it works in many new use cases.

shrimpx · on Feb 4, 2023

Btw this project doesn't use ChatGPT, it uses text-davinci-003.

miadabdi · on Feb 13, 2023

Can you provide more info? I set up a telegram bot and connected it to my OpenAI account with API keys and it works, but as I'm aware ChatGPT is not available as api yet, so I'm guessing the repo I got the telegram bot kinda lied that it's using ChatGPT?

[the repo](https://github.com/karfly/chatgpt_telegram_bot)

antiatheist · on Feb 4, 2023

I agree, the code is pretty average, inconsistently using quotation marks, looks copy pasted and developer comments trying to understand what they're doing. The concept isn't too novel either, LLM usage in knowledge base querying is nothing new, I know some lawyers looking into it for regulatory compliance.

The KnowledgeGPT repo linked by another commenter seems more interesting.

sadrobin · on Feb 4, 2023

Ahhh.. I agree, I wanted to make it higher quality, but I was working on it during this week and was just too excited to share it. Would have been happy with few upvotes, but to be on the front page is absolutely crazy for me

occam65 · on Feb 4, 2023

And that's the thing... This page should always be about sharing and encouraging people who build things they're excited about, regardless of whether it meets some technical high bar.

Kudos to you, and I hope you keep that excitement going!

spencerchubb · on Feb 4, 2023

facebook was just a php script to display some pictures

disgruntledphd2 · on Feb 4, 2023

not even pictures until 2007 or 2008

verdverm · on Feb 4, 2023

Upvoting serves as a bookmark too, and people will often upvote for the discussion more than the original post.

rat9988 · on Feb 4, 2023

You could use the favorite button.

verdverm · on Feb 4, 2023

Favorites are public, upvotes are private

lopkeny12ko · on Feb 5, 2023

This was my first thought as well. Both the code and technical complexity of this project are unremarkable.

gth158a · on Feb 4, 2023

In the screenshot, the last sentence reads:

>"This will return a new DataFrame with all the columns from both tables, and only the rows that match the 'key' column".

That is incorrect. The 'how' parameter is 'left' not 'inner'.

sorokod · on Feb 4, 2023

The first paragraph in the "What is..." section states that the purpose is to provide answers. Nowhere does it say that answers should be correct or accurate by any measure.

That this is not a problem for many is worrying.

shzhdbi09gv8ioi · on Feb 4, 2023

Are you insinuating that ChatGPT gives incorrect answers? Careful about the pitchforks.

bjackman · on Feb 4, 2023

AI to help me read docs does seem somewhat handy but I feel like if there's already documentation it's really just gonna save a couple of minutes?

I will be much more excited when AI can explain undocumented systems to me. This feels like it can't be far away, and it will be a game changer.

I guess for this to be helpful it is gonna need out-of-band info, but maybe just the git log would be a pretty good start. If you could add a mailing list or chat history of developers I imagine things could get more powerful.

monkeydust · on Feb 4, 2023

We have plenty of partially documented systems, it's a constant challenge to keep it up to date. I wonder if this could be brought more in-line with production by merging documents with support request logs and even code ?

bjackman · on Feb 5, 2023

I think it would be silly to have AI _write docs_ for us! Documentation is an obsolete concept at that point. You can just ask the AI exactly what you need to know.

sadrobin · on Feb 4, 2023

Step 1. Is documented systems, plain code is next. Thank you for your suggestion, great ideas!

_ajoj · on Feb 4, 2023

So just so I understand- this is all based on taking input from the user, injecting it in a template prompt that instructs chatgpt to answer the question based on providing it all the source material? What happened to building your own models to run offline?

shrimpx · on Feb 4, 2023

A trained AI model is a platform that you can build applications on. Pretty sure >90% of the developers in this space will be building applications, not models. Kinda like how the vast majority of mobile developers build mobile apps, not mobile operating systems. True, this particular application is pretty simple, but early applications are.

yieldcrv · on Feb 4, 2023

What happened to inventing wheels?

_ajoj · on Feb 4, 2023

Ha! I do love making artisan wheels! :)

yieldcrv · on Feb 4, 2023

I agree on that. I think about it multiple times a day, but primarily I want that model but run client side or my own environment, and ideally without the content filter.

Kiro · on Feb 4, 2023

That's like complaining that a musician doesn't build their own piano. Or actually, it's like asking why they don't build their own piano factory. No sole developer has the skills or resources to build something like GPT. Even if it was open source no user would be able to run it locally anyway.

shzhdbi09gv8ioi · on Feb 4, 2023

No, you completely miss the point.

It's like complaining that a musician cannot practice their instrument without software that requires you to be always-online.

> Even if it was open source no user would be able to run it locally anyway.

You are just stating this, it does not make it true. Several of us are running GPT-3 workloads locally.

Kiro · on Feb 4, 2023

Very strange counterpoint that makes no sense unless you're just arguing against cloud software in general, which was clearly not the point the parent made.

> Several of us are running GPT-3 workloads locally.

Several people out of 8 billion is the same as no-one. The point made was that anyone should be able to run this document assistant on their local machine.

jw1224 · on Feb 4, 2023

> Several of us are running GPT-3 workloads locally.

Unless you work for OpenAI (and your “local” machine somehow has 1TB+ of VRAM, equivalent to roughly 25 A100s), this cannot possibly be true…

bdhcuidbebe · on Feb 4, 2023

1.7k forks… https://github.com/openai/gpt-3

apart from that, several pre trained corpuses had been around for a while

jw1224 · on Feb 4, 2023

That repo just seems to be a bunch of JSON files and sample outputs. Am I missing something?

Eater_of_food · on Feb 4, 2023

> Several of us are running GPT-3 workloads locally.

Do you mind sharing how? I didn't think the model was available to download and thought it was api only?

_ajoj · on Feb 4, 2023

It wasn't a complaint. I'm just trying to understand.

When I was in school, in AI/ML classes we needed to build our own models. So now I'm guessing that you use existing models...

I'm trying to gauge how behind the times I am.

sashu123 · on Feb 9, 2023

In this day and age there is no way a well funded University Lab can build a model that is even a fraction as good as GPT-3. Just look at "Bloom", from BigScience a commendable effort but useless for any real world use case compared to GPT-3. Unfortunately moving forward, if you want to build useful apps that use ML Models, you need to call API's from well funded Industry groups.

westurner · on Feb 4, 2023

From https://news.ycombinator.com/item?id=34659668 :

>> How do the responses compare to auto-summarization in terms of Big E notation and usefulness?

> Automatic summarization: https://en.wikipedia.org/wiki/Automatic_summarization

> "Automatic summarization" GH topic: https://github.com/topics/automatic-summarization

Though now archived,

> Microsoft/nlp-recipes lists current NLP tasks that would be helpful for a docs bot: https://github.com/microsoft/nlp-recipes#content

NLP Tasks: Text Classification, Named Entity Recognition, Text Summarization, Entailment, Question Answering, Sentence Similarity, Embeddings, Sentiment Analysis, Model Explainability, and Auto-Annotatiom

westurner · on Feb 5, 2023

On further review, there are more GitHub projects labeled with https://github.com/topics/text-summarization than "automatic-summarization"; e.g. awesome-text-summarization: https://github.com/icoxfog417/awesome-text-summarization and https://github.com/luopeixiang/awesome-text-summarization , which links to what look like relatively current benchmarks for SOTA performance in text summarization from the gh source repo of https://nlpprogress.com/ : https://github.com/sebastianruder/NLP-progress/blob/master/e...

atonse · on Feb 4, 2023

Where can one go to learn how to build something like this with chatgpt?

Are people asking a question under the hood? Like “explain these 30 code files to me”

swalsh · on Feb 4, 2023

You could probably ask ChatGPT how to get started with GPT, but I think a lot of people are building apps taking advantage of GPT's Fine Tuning ability. https://platform.openai.com/docs/guides/fine-tuning

petilon · on Feb 4, 2023

If I take advantage of this for allowing my customers to ask questions about our product documentation, how do I limit questions to questions about my product documentation? I don't want to be paying Open OPI for questions unrelated to my product.

shagie · on Feb 4, 2023

One idea would be to use a much cheaper (and faster) classifier to come back with a "yes" or "no" if the question asked is about your product documentation.

Using Ada or Babbage is about 1% of the cost of Davinci (and Curie is 10% the cost of Davinci).

Without any real tuning, this responds quite promptly (and the various tests I've done, correctly):

    curl https://api.openai.com/v1/completions \
      -H "Content-Type: application/json" \
      -H "Authorization: Bearer $OPENAI_API_KEY" \
      -d '{
      "model": "text-ada-001",
      "prompt": "Identify if the following question is about the Olympic Games.  Answer with {yes}, {no}, {maybe}.\n\nWhat category is pole vaulting in?",
      "temperature": 0.7,
      "max_tokens": 38,
      "top_p": 1,
      "frequency_penalty": 0,
      "presence_penalty": 0
    }'

bongobingo1 · on Feb 4, 2023

> What category is pole vaulting in?

>> Yes

Well, at least you can't say the documentation is out of date.

shagie · on Feb 4, 2023

I was just after something simple in its training set to demonstrate the simple classification that could be used as a check before passing the request on to a more expensive fine tuned model.

jesselawson · on Feb 4, 2023

Honestly the best comment I've read on here in a long time. Cheers.

keithwhor · on Feb 4, 2023

It's way easier and cheaper to use embeddings for these sorts of use cases.

shagie · on Feb 4, 2023

It probably is - I just haven't toyed with that part of the system yet.

th3h4mm3r · on Feb 4, 2023

This is really a good question. Could be interesting also which is the best method to provide to GPT a Product documentato and ask question inherents to that and nothing else.

amrb · on Feb 4, 2023

I'm going to coin the cyber-security finding "Bill exhaustion via open ML prompts".. wonder how long till it's in the Owasp top X

danvass · on Feb 4, 2023

If you're looking for a low-code solution, I've written up how we handle this here: https://relevanceai.com/blog/use-all-of-your-blog-posts-and-...

a_vanderbilt · on Feb 4, 2023

This looks like a good way to explore how we could start using AI at our organization. Say we have a PDF of an instruction manual and we'd like to make a workflow that lets us use Ask Relevance. There doesn't seem to be a way to vectorize the content of a PDF. The option isn't presented in the drop down menu. It seems to only offer CSV-related processing, although I selected a PDF as the data type for upload. Am I using it wrong?

brianjking · on Feb 4, 2023

https://github.com/arc53/docsgpt/wiki/How-to-train-on-other-...

seabass · on Feb 4, 2023

Openai has documentation and guides in their blog about how to interface with the GPT API. But in this case, you can also look directly at the source code published here.

https://platform.openai.com/docs/introduction

https://github.com/arc53/docsgpt/tree/main/application

sadrobin · on Feb 4, 2023

If you want join our discord and i can talk you through it

droopyEyelids · on Feb 4, 2023

Im curious how this works out over time.

It seems like adding a translation layer into conversational speech could make good docs harder to use?

And if theyre not good docs to start, what will gpt base its answer on?

Could be useful for a question that spans multiple technologies maybe

eyegor · on Feb 4, 2023

This is a cool idea. Right now my goto tends to be https://devdocs.io/, but the idea of a conversational type of layer is fascinating. It's always a struggle in a new set of docs trying to figure out their phrasing for merge/join/combine or how they describe aggregations for example. A lot of the time when you're looking at documentation, you're trying to look up "how to do x (with y)" but most docs are written in a "common language" and end up describing things in jargon you may not be aware of yet.

johnywalks · on Feb 4, 2023

What ChatGPT can do combine knowledge and provide personalised examples. This is usually done manually and it takes hours of research with trial & error.

xwowsersx · on Feb 4, 2023

What exactly is it aware of though? I guess it's aware of certain packages in certain languages? I asked how to create an AWS API gateway with terraform. It said:

> You can use the terraform-aws-api-gateway module. This module provides a set of Terraform resources for creating and managing an API gateway. It allows you to define the API gateway, its resources, methods, and stages. It also provides support for custom domain names, API keys, and usage plans.

I followed up with "Show me the hcl configuration for it."

> There is no hcl configuration for it.

madrox · on Feb 4, 2023

To be fair, this sounds like some junior engineer interviews I've conducted

xwowsersx · on Feb 4, 2023

Haha me too

sadrobin · on Feb 4, 2023

I didnt add any memory of the current conversation yet... sorry

verdverm · on Feb 3, 2023

Very cool, I'm glad that you provided some instructions on how to use this on a different set of docs.

Would this support Markdown format? Context, I'm using Hugo with Markdown + injected code snippets. Might need to actually crawl the site...

Do you think including Slack Q&As in training would help too?

sadrobin · on Feb 3, 2023

Yeah, its still very early preview. we are working on making sure parsing works well on different formats. But for now you can make sure it looks well in txt. Would love to see you in our discord. Very soon it will support different formats, so make sure you are updated.

o_____________o · on Feb 4, 2023

What are you doing on top of langchain? More details would be nice.

eternalban · on Feb 4, 2023

How to create the vector store: this gets passed along (i assume) in the longchain context:

https://github.com/arc53/docsgpt/blob/main/scripts/ingest_rs...

This is the prompt that is used: https://github.com/arc53/docsgpt/blob/main/application/combi...

And this is where it calls the openai. Looks pretty straightforward. https://github.com/arc53/docsgpt/blob/main/application/app.p...

jacooper · on Feb 4, 2023

Man, imagine If you can run it locally and let it parse all of your documents, so you can ask it questions like when does my passport expires, or what is my Wife's ID number and so on.

amrb · on Feb 4, 2023

I've already built this and you can use this as a guide, being PII we just need to swap openai for a model on huggingface.co, this let's you do the question answering part run locally.

https://dagster.io/blog/chatgpt-langchain

jacooper · on Feb 4, 2023

But it still requires uploading sensitive data to the cloud. I want it completely local, I ain't uploading things like my passport and birth certificate to OpenAI.

amrb · on Feb 4, 2023

Bro you see the part where I talk about swapping out openai. You can download ML models from the site hugging face.co, if you read this link you should be well on your way to building something today https://github.com/huggingface/transformers#quick-tour

jacooper · on Feb 4, 2023

Oh, I understood it as replacing it in the future, thanks!

Vespasian · on Feb 4, 2023

The future will most likely be "upload it to the cloud and ask the question there".

Really sad but but it seems like most people seem to not care about privacy even for the most sensitive parts of their lives.

Or even worse they mistrust "the government" but place great trust in "companies which are easily pressured or raided by more than one government".

It doesn't make sense at all.

(Edit: I don't think a well run government in a functioning democracy is inherently evil but pretending that companies are not collaborating because they are "private entities" is foolish)

amrb · on Feb 4, 2023

"Do you want megacorps? Because that's how you get megacorps" -Archer

Control is why I want to see more "open source" version of machine learning models be it over data or auditing the output and if I can p2p download a model I know we're in a good place.

Take the internet, I could see my path being very different if we had to pay per minute like a phone plan to get information, having open access was a big bootstrap here.

Finally of course there are problems with the Linux project, but would servers dictated only by M$oft/0r@cle be a positive outcome, gonna say no.

hackernewds · on Feb 4, 2023

You merely need to see the Twitter example to see how risky monopoly-level tech is in the wrong hands. Imagine they have access to all politicians DMs now, ripe for blackmail and kompromat

aioprisan · on Feb 4, 2023

All you need is to 1. convert documents to embeddings (OpenAI or some other local library for this), 2. store all embeddings in an index like (SaaS like Pinecone or FAISS locally), 3. run query against embeddings. Just a document search is likely good enough, but if not you can use a completion algorithm in a simple LLM of the docs to generate answers.