Doing a similar project it has been quite an eye opener on the results.
At a first glance the answers look very good but we have noticed these things.
In a number of cases the documents returned were wrong, but the LMM was able to use the contents of documents to find the right answer within the model. In our case this is a fail, as it had to cite valid documents.
Not as frequently but also noticed is the sequence of returned chunks from the vector database had an impact on the answer. In one case the question had one word as a past in first question, present tense in the second question. Otherwise the same. This swapped the order of the chunks and gave an opposite answer.
There is no easy way to see when these pop up. The current testing frameworks are limited in validating these.
At a first glance the answers look very good but we have noticed these things.
In a number of cases the documents returned were wrong, but the LMM was able to use the contents of documents to find the right answer within the model. In our case this is a fail, as it had to cite valid documents.
Not as frequently but also noticed is the sequence of returned chunks from the vector database had an impact on the answer. In one case the question had one word as a past in first question, present tense in the second question. Otherwise the same. This swapped the order of the chunks and gave an opposite answer.
There is no easy way to see when these pop up. The current testing frameworks are limited in validating these.