Extending Mongo's mgenerate with LLM-Integrated $text Operator

Enhance Mongo's mgenerate tool with an async $text operator, enabling developers to generate contextually relevant, LLM-generated text data for their applications via Ollama.
Description
Issues / PRs
Team Members
Overview

This project extends Mongo's mgenerate tool by introducing a new $text operator, which integrates with Large Language Models (LLMs) via Ollama to generate text data based on specific prompts. This enhancement aims to help developers create more relevant and context-specific dummy data for their applications, addressing the needs of various use cases, from healthcare to regional context.


Key Features
  1. Application-Specific Data Generation:

    • Generate dummy data tailored to application needs. For instance, developers can create job titles specific to the healthcare sector with prompts like "Job title for Healthcare worker."

  2. Long Text Generation:

    • Produce coherent, contextually appropriate long text. Example: Generating dummy product reviews with prompts such as "Product review for office item."

  3. Regional Contextualization:

    • Generate data with regional relevance, such as Indian names for applications targeting the Indian market.


How It Works

Using the new $text operator, developers can define prompts within their mgenerate templates to create specific types of data. For example:

Template:

{
    "name": "$name",
    "Role": {
      "$text":{
        "prompt": "Designation or job title found in Healthcare",
        "maxWordCount": "4"
      }
    },
    "lastLogin": "$now"
}

Generated Output (using model: llama3:latest):

{
    "name": "Virginia Blair",
    "Role": "Medical Assistant",
    "lastLogin": {
        "$date": "2024-07-28T12:53:00.267Z"
    }
}

The above use case can also be tweaked to support for example, French native names in the name attribute if the same data is specific to Indian healthcare industry.

{
    "name": {
      "$text":{
        "prompt": "Indian last name"
      }
    },
    "Role": {
      "$text":{
        "prompt": "Designation or job title found in Healthcare",
        "maxWordCount": "30"
      }
    },
    "lastLogin": "$now"
}

Generated Output (using model: llama3:latest):

{"name":"Kulkarni","Role":" Clinical Care Coordinator","lastLogin":{"$date":"2024-07-28T13:04:59.694Z"}}

Example of long text:

{
    "name": "$name",
    "review": {
        "$text":{
          "prompt": "Product Review of a technology product of your choice",
          "maxWordCount": "50"
        }
    },
    "timestamp": "$now"
}

Output (model: llama3:latest)

{
  "name": "Rodney Hamilton",
  "review": "**Product Review:**\n\n**Logitech K380 Wireless Keyboard**\n\nThis wireless keyboard is a game-changer for those who spend hours typing away on their computers. With its compact design and reliable connectivity, I've never experienced any lag or dropped signals. The battery life is impressive, lasting up to 3 years on a single set of batteries. The keyboard itself is comfortable to type on and has a nice tactile feedback. Overall, I'm extremely satisfied with this purchase. **Rating: 5/5 stars**",
  "timestamp": {
    "$date": "2024-07-28T15:50:39.668Z"
  }
}
Technical Implementation

I have marked this implementation as experimental and have refrained from creating an immediate PR to the original repo to ensure proper code review. My repository will remain available as an open source tool. I summarize the technical changes below.

  1. Converted mgenerate into a async function. This change has propagated to other operators that require a nested call to mgenerate's evaluator.

  2. Updated mocha test suite to make async calls

  3. Implemented a $text operator and allowing Ollama integration using environment variables

  4. Updated the documentation for $text operator

Test Results

The 1 failing test result is an exception handling issue with Mocha. Since the change to async, mocha's assert.throws is failing to recognize the exception thrown.

  62 passing (28ms)
  2 pending
  1 failing

  1) General
       should throw an error if additional key is present after operator:
     AssertionError [ERR_ASSERTION]: Missing expected exception.
      at Context.<anonymous> (test/index.test.js:8:12)
      at process.processImmediate (node:internal/timers:483:21)

Try it on your machine

I have only proposed a draft PR on the official mgenerate repository (https://github.com/rueckstiess/mgeneratejs) as I believe this requires further code review considering impact on the whole code base for an experimental feature. However, you can try my fork of mgenerate on your local machine using the following steps.

  1. Clone the git repository https://github.com/omkarkhair/mgenerativejs `git clone git@github.com:omkarkhair/mgenerativejs.git`

  2. Navigate to the directory where you have cloned the repo and install a dependency manually as it has been missed in the `package.json` as the time of writing 🫠 `npm install axios --save`

  3. Now run `npm install` to install all other dependencies

  4. Ensure that `mgeneratejs` isn't already installed in your global modules. Once confirmed, install the local repo as mgeneratejs using the command `npm install -g .` while in the project directory

  5. To try the `$text` operator, the an Ollama API must be accessible from your machine. You can install Ollama on your machine and run the API using the instruction here https://github.com/ollama/ollama?tab=readme-ov-file#ollama

  6. Download the models of your choice from the ollama library. I have tested with `llama3:latest` and it worked best for my use cases.

  7. Set the environment variable `MGENERATIVEJS_OLLAMA_ENDPOINT` to for example `http://127.0.0.1:11434/api` remember to set this until the /api endpoint.

  8. Set the environment variable `MGENERATIVE_OLLAMA_MODEL` to the model of your choice. example `llama3:latest`.

  9. Create a mgenerate template file with the `$text` operator as mentioned in the documentation https://github.com/omkarkhair/mgenerativejs?tab=readme-ov-file#text . Enjoy :)


Next steps
  1. Complete test coverage for generative text implementation

  2. Ensure all async functions are reviewed to return a promise (Doesn't break anything, but good code hygiene. I may have missed some along the way)

  3. Choose a lighter HTTP client than axios. Might consider ollama-js if it is light enough.

  4. Implement multiple generative response in a single call for multiple document generation with a state implementation

Issues, PRs and Discussions
Om
Om
omxyz