Přeskočit obsah

Data Extraction aneb převod volného textu na strukturovaná data pomocí LLM

Tohle je můj průzkum na téma, jak převést volný text:

We introduce NuExtract-v1.5 -- a fine-tuning of Phi-3.5-mini-instruct, which is a 3.8B parameter language model. It is trained on a private high-quality dataset for structured information extraction. It supports long documents (up to 128k token context) and several languages (English, French, Spanish, German, Portuguese, and Italian). To use the model, provide an input text and a JSON template describing the information you need to extract.

a šablonu:

{
    "Model": {
        "Name": "",
        "Number of parameters": "",
        "Number of token": "",
    },
    "Usage": {
        "Use case": []
    }
}

na strukturovaná data:

{
  "Model": {
    "Name": "NuExtract-v1.5",
    "Number of parameters": "3.8B parameter language model",
    "Number of token": "up to 128k token context"
  },
  "Usage": {
    "Use case": [
      "structured information extraction"
    ]
  }
}

Výpisky ze článků k tématu

Nejlepší čtení, které jsem našel:

OpenAI’s structured output vs. instructor and outlines (2024-08-10)

Krátké zamyšlení nad tím, jestli po zveřejnění funkce Structured Outputs ještě potřebujeme knihovny jako outlines a instructor.

Is structured outputs a replacement for instructor, outlines and other libraries that provide structured outputs for language models?

Yes, the core value proposition of: “give me a Pydantic model and I’ll use function calling to guarantee the output fits the schema” is now covered for OpenAI models, but only for OpenAI models. If you’re using other models or want to stay flexible, structured output libraries are still useful.

Making LLMs Reliable: Building an LLM-powered Web App to Generate Gift Ideas (2024-12-20)

Ukázka použití knihovny Outlines. Na blogu autorů této knihovny.

Outlines transforms language models into predictable tools that fit naturally into your application.

The Essential Guide to Large Language Models Structured Output, and Function Calling (2024-09-22)

Dlouhý text, zabývá se hlavně tématem Function Calling.

LLM structured output and function calling are two of the most efficient and powerful ways of overcoming shortcomings and bottlenecks of LLMs when you build software systems using them. They truly enable a new level of LLM-powered software.


There are many fancy names and explanations to function calling, yet it all can be boiled down to one statement – “Function calling is a type of structured output capability of a large language model.”

“Function calling” naming is confusing. LLMs don’t call any functions themselves; they suggest which function you should call from pre-defined functions which you provide to the LLM in a prompt.


Consider this problem: implement a natural language processing parser that allows users to create a grocery list out of natural language input. The user provides a list of groceries in written or spoken form, and the program outputs an HTML-formatted list.

Without LLMs, that is not such an easy task to tackle. It’s easy to build a demo, but not easy to build a high-quality product that handles edge cases well.

When should I use function calling, structured outputs or JSON mode? (2024-09-04)

Vysvětlení proč je JSON Mode překonané řešení: negarantuje, že dodrží požadované JSON schema.

JSON Mode was the first foray by OpenAI in creating reliable outputs. Toggling JSON mode on just required the output to be in valid JSON and did not ensure any schema adherence.

Developers wanted more and OpenAI & Gemini have since released Structured Outputs. Enabling Structured Outputs allows you to specify a JSON schema through Zod, Pydantic or through Vellum’s UI to define the JSON.

Dále se v textu rozebírá, v jakých situacích je function calling potřeba a v jakých ne.

Use Function Calling: You’ve given the model options of multiple tools/functions and you’d like the model to decide which tool to use.

Use response_format: When there’s a specific task at hand (e.g., data extraction) and the model is not using its reasoning capabilities to pick a task.

Groq: Introduction to Tool Use

Groq API offers best-effort matching for parameters, which means the model could occasionally miss parameters or misinterpret types for more complex tool calls. We recommend the Instuctor library to simplify the process of working with structured data and to ensure that the model's output adheres to a predefined schema.

Další užitečné zdroje


!!! https://www.tamingllms.com/notebooks/structured_output.html#techniques

Konkrétní příklad 1: instructor

Ukázka používající groq, pydantic a instructor.

import instructor
from dotenv import load_dotenv
from pydantic import BaseModel
from groq import Groq

# Load the Groq API key from .env file
load_dotenv()

# Describe the desired output schema using pydantic models
class UserInfo(BaseModel):
    name: str
    age: int
    email: str

# The text to extract data from
text = """
John Doe, a 35-year-old software engineer from New York, has been working with large language models for several years.
His email address is [email protected].
"""

# Patch Groq() with instructor, this is where the magic happens!
client = instructor.from_groq(Groq(), mode=instructor.Mode.JSON)

# Call the API
user_info = client.chat.completions.create(
    model="llama-3.1-70b-versatile",
    response_model=UserInfo, # Specify the response model
    messages=[
        {"role": "system", "content": "Your job is to extract user information from the given text."},
        {"role": "user", "content": text}
    ],
    temperature=0.65,
)

print(f"Name: {user_info.name}")
print(f"Age: {user_info.age}")
print(f"Email: {user_info.email}")

V praxi bych raději místo knihovny instructor použil outlines. Zatím to ale vypadá, že Groq si s outlines nerozumí.

Proč raději outlines než instructor?

Konkrétní příklad 2: NuExtract

Nainstaloval jsem nástroj ollama a po stažení modelu (ollama pull sroecker/nuextract-v1.5-smol) bylo hned možné provést test.

Tenhle bash kód:

PROMPT=$(cat <<'EOF'
<|input|>
### Template:
{
    "Model": {
        "Name": "",
        "Number of parameters": "",
        "Number of token": "",
    },
    "Usage": {
        "Use case": []
    }
}

### Text:
We introduce NuExtract-v1.5 -- a fine-tuning of Phi-3.5-mini-instruct, which is a 3.8B parameter language model. It is trained on a private high-quality dataset for structured information extraction. It supports long documents (up to 128k token context) and several languages (English, French, Spanish, German, Portuguese, and Italian). To use the model, provide an input text and a JSON template describing the information you need to extract.

<|output|>
EOF
)

# Escapování newlines a dalších speciálních znaků
PROMPT_ESCAPED=$(printf '%s' "$PROMPT" | jq -Rs .)

curl -s -X POST http://localhost:11434/api/generate \
-H "Content-Type: application/json" \
-d '{
    "model": "sroecker/nuextract-v1.5-smol",
    "prompt": '"$PROMPT_ESCAPED"',
    "max_tokens": 100,
    "temperature": 0.0,
    "stream": false
}' | jq -r '.response' | jq

vrátí:

{
  "Model": {
    "Name": "NuExtract-v1.5",
    "Number of parameters": "3.8B parameter language model",
    "Number of token": "up to 128k token context"
  },
  "Usage": {
    "Use case": [
      "structured information extraction"
    ]
  }
}

Kód vychází z tohohle zápisku.

Model sroecker/nuextract-v1.5-smol má 3.4 GB, model sroecker/nuextract-tiny-v1.5 má 0.99 GB.

O NuExtract:

NuExtract is a family of small open-source models that do only one thing: they extract information from documents and return a structured output.

Více tu.