It’s been mentioned in Discord that there’s interest to develop an AI model (like ChatGPT) but for Zenon exclusively. I’ve done the preliminary research with some ML engineers, and I’d like to share my vision / knowledge of the matter before we begin hacking partial solutions which could fragment / duplicate data into unnecessary places.
The Vision
Feed structured and unstructured data into a trained model so that it can be then called by various channels via an API.
The Zenon AI can be then used to assist various channels:
- A chatbot on the organic site.
- A chatbot on the landing pages (help people complete the conversion of the funnel).
- A bot anyone can call in discord or telegram or the forums.
- A reply bot on Twitter.
The model can be fed from the following data sources:
- Organic Site: zenon.org = Verified (see this link to access development URLs)
- Documentation: docs.zenon.org = Verified
- Forum: forum.zenon.org = Unverified
- Discord / Telegram = Unverified
- Git Repositories = Verified
Data Verification
We cannot assume that all data from sources are correct. We need to develop a method for the model to know which data is verified. This is specifically for Discord / Telegram / Forum conversations we’d feed to the model, as we can assume that the organic and docs sites will have verified data.
Preparing the Data
A Question / Answer data structure is what’s most required to train such model. The more data we feed the model in such structured format, the more it will know how to answer the user, using the the various data we feed to it (organic/docs/forum/discord/telegram etc). We do NOT need to provide questions / answers for all possible things, the model should be able to comprehend questions asked. It should do fine if we organize Q/A for the most common things people may ask.
I quote from the ML engineer:
Structured data helps in training. For instance, having a pair of most commonly asked questions and their answers as we would explain it in person, along with the wider context will get us really accurate results. So models trained on such question answer dataset will perform very well. In case the question is from that list the model will give the accurate output. But someone asks a related question which is not in the training set would still be answered based on its similarity to other questions.
The key is to build this Q/A development methodology into a process within the sources we chose to feed the model-with. For example:
While I build a page on zenon.org, I can tag my data with a question/answer for the model. As I push the update to the website, the Q/A data for the model gets updated, output into a format designed to train the model. An automated workflow can then be triggered = new data available = trigger a re-training of the model (GPUs) = new model output = all services using the model are updated. Perhaps it doesn’t need to be done after each push ofc, but rather periodically or after some major updates are made. All depends on the cost to retrain the model etc.
Such retraining process can be developed for all other sources. It is our job to develop a community-operated methodology to structure and tag our data for continuous retraining. We act like alien-bees in a major hive.
AI Replies
The model can also be trained to source its replied. For example:
Discord sourced example:
User’s question: How can I convert my wZNN to native ZNN?
AI reply: You can refer to the guide on the Discord server. To learn more go to: https://discord.com/channels/920058192560533504/1065658520462176256/1065768065888952362
Docs sourced example:
User’s question: The wallet’s data is stored into what?
AI reply: The wallet data is stored inside a KeyStore. To learn more go to: https://docs.zenon.org/hypercore/technical/phase-1/cryptography#wallet
If an answer is not available, the question can be fed into a honeypot where the community sees what people are asking. We can then use that data to improve our data and retrain the model.
Quotes from the ML Engineer
Pretrained Huggingface transformer models are the best fit for this task. Those are scalable and performance is good. We can use seq-to-seq language models like T5 or GPT2 or BERT models for this scenario. These models can be finetuned one the input data to provide sophisticated output closer to ChatGPT. Besides using large amount of data for training, ChatGPT used reinforced learning based on human responses so we cannot possibly achieve that smoothness here. But we can still get smart and relevant answers since these transformer models are also pre-trained on a very large data. […] My approach would be to first explore BERT and GPT2 models.
QA based performance will outperform significantly. One thing is, if I dont use QA structure, the responses will be different every single time. Sometimes those might be very irrelevant since these ML system are not 100% accurate but approximations. So more we leave it to the model, lesser its performance.
Discussion
As you may understand, before we get into training a model, we need to organize ourselves to develop that hive.
-
Let’s agree on the various data sources.
-
Let’s discuss about the hierarchy of various data sources (and their verified/unverified status).
-
Let’s develop a methodology to tag a question/answer format into the various data sources. You can assume that zenon.org can develop a method to tag its content in such format. The question is more for: Discord, Telegram, Forum. This could develop into a role/paid job using AZ to fund community to structure data.
-
Should the model be told that answers from certain vetted members are automatically assumed to be verified? i.e. Kaine and a few other notable members?
-
Git repositories / code hasn’t been really considered yet, but would be useful for technical minds. I think we should keep it in our back pocket to see if we can use zenon repos to learn code at some later date once the first versions of such project are developed.
-
Let’s organize a crew to handle the technical documentation of Technical - HyperCore | Zenon Docs (I can handle the preliminary organization of the HyperGrowth sections). The basic formats for HyperCore have been created, but the community will greatly benefit from developers to work alongside marketing teams. It’s the dream synergy.
-
Let’s discuss and eventually outline the specific needs for the project so that the community can hire an ML engineer through AZ.
Preliminary Requirements
- Defined data hierarchy (as the model must be trained to give the best answer possible)
- Defined training workflows (as data sources are constantly updated)
- Open sourced code
- Web-based training (GPUs)
- API to develop chatbot services in different contexts (Discord Zenon AI Chatbot vs. Chatbot embedded on zenon.org landing pages vs. Twitter reply Chatbot etc).
Community project, open for anyone to contribute.