Detailed explanation of visual language model

The white paper of AI+network security business case, come and download it! "

The visual language model can be learned from images and texts at the same time, so it can be used for multiple tasks such as visual question and answer, image description, etc. In this article, we will take you at a glance in the field of visual language models: give an overview, understand its working principle, figure out how to find the "model" of the "life", how to reason about it, and how to use the latest version of trl It's easy to fine tune.

What is a visual language model?

Visual language model is a multimodal model that can be learned from image and text at the same time. It belongs to the generation model, and the input is image and text, and the output is text. The large visual language model has good zero sample ability, good generalization ability, and can handle many types of images including documents, web pages, etc. It has a wide range of applications, including image based chat, image recognition based on instructions, visual question and answer, document understanding, image description, etc. Some visual language models can also capture spatial information in images. When prompted to detect or segment specific targets, these models can output bounding boxes or segmentation masks. Some models can also locate different targets or answer questions related to their relative or absolute positions. The existing large visual language models use many methods in training data, image coding methods, etc., so their capabilities are also very different.

VLM capability

Overview of Open Source Visual Language Models

There are many open visual language models on the Hugging Face Hub, and the following table lists some of them.

There are both basic models and chat fine-tuning models that can be used in conversation scenarios.
Some of these models have the function of "grounding", so they can reduce model illusion.
Unless otherwise specified, the training language of all models is English.

Model	Commercially available	Model size	Image resolution	Other capabilities
LLaVA 1.6 (Hermes 34B)	✅	34B	672x672
deepseek-vl-7b-base	✅	7B	384x384
DeepSeek-VL-Chat	✅	7B	384x384	chat
moondream2	✅	~2B	378x378
CogVLM-base	✅	17B	490x490
CogVLM-Chat	✅	17B	490x490	Grounding, chat
Fuyu-8B	❌	8B	300x300	Text detection in images
KOSMOS-2	✅	~2B	224x224	Ground, zero sample target detection
Qwen-VL	✅	4B	448x448	Zero sample target detection
Qwen-VL-Chat	✅	4B	448x448	chat
Yi-VL-34B	✅	34B	448x448	Bilingual (English, Chinese)

Looking for a suitable visual language model

There are many ways to help you choose the best model for yourself.

Vision Arena It is a leaderboard that conducts anonymous voting based on model output, and its ranking will be refreshed continuously. In this arena, when users input images and prompts, two anonymous different models will generate outputs for them, and then users can choose an output based on their preferences. The rankings generated in this way are entirely based on human preferences.

Vision Arena

Open VLM Ranking Another option is provided. Various visual language models are ranked according to the average score of all indicators. You can also filter models according to model size, private or open source licenses, and rank them according to the indicators you choose.

Open VLM Ranking

VLMEvalKit It is a tool kit used to run benchmark tests on the visual language model. The open VLM ranking is based on this tool kit.

Another evaluation suite is LMMS-Eval , which provides a standard command line interface. You can use the dataset hosted on the Hugging Face Hub to evaluate the selected Hugging Face model, as shown below:

 accelerate launch --num_processes=8 -m lmms_eval --model llava --model_args pretrained="liuhaotian/llava-v1.5-7b" --tasks mme, mmbench_en --batch_size 1 --log_samples --log_samples_suffix llava_v1.5_mme_mmbenchen --output_path ./logs/

Both the visual arena and the open VLM leaderboard are limited to the models submitted to them, and new models can only be added after being updated. If you want to find other models, you can find them in image-text-to-text Browse the Model 。

In the ranking list, you will see different benchmarks used to evaluate visual language models. Let's select some of them to introduce.

MMMU

A Massive Multi discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI (MMMU) It is the most comprehensive benchmark for evaluating visual language models. It contains 11.5K multimodal problems, which require university level discipline knowledge and interdisciplinary reasoning ability (such as art and engineering).

MMBench

MMBench It consists of 3000 single topics covering more than 20 different skills, including OCR, target positioning, etc. The paper also introduces a kind of CircularEval Each round of the evaluation strategy will make different combinations and shuffles of the options of the questions, and expect the model to give correct answers each round.

In addition, there are other more targeted benchmarks for different application fields, such as MathVista (visual mathematical reasoning), AI2D (chart understanding), ScienceQA (scientific question and answer) and OCRBench (document understanding).

Technical details

There are many ways to pre train the visual language model. The main technique is to unify the image and text representation to input them to the text decoder for text generation. The most common and best performing models are usually stacked by image encoders, embedded shadow models (usually a dense neural network) used to align images and text representations, and text decoders in order. As for the training part, different models adopt different methods.

For example, LLaVA is composed of CLIP image encoder, multi-mode shadow casting model and Vicuna text decoder. The author inputs a dataset containing images and descriptive text into GPT-4 to describe problems related to text and image generation. The author freezes the image encoder and text decoder, and trains the multi-mode projection model only by feeding images and problems to the model and comparing the model output with the description text, so as to achieve the purpose of aligning image and text features. After pre training the shadow model, the author keeps the image encoder in the frozen state, unfreezes the text decoder, and then continues to train the decoder and shadow model. This method of pre training and fine adjustment is the most common way to train visual language models.

Typical structure of visual language model

Cascade the output of the projection model with the text embedding

Taking another example of KOSMOS-2, the author chose the method of end-to-end complete training of the model. Compared with the LLaVA type pre training method, this method is computationally expensive. After the pre training is completed, the author also uses pure language instructions to fine tune the model to align. In another way, Fuyu-8B does not even have an image encoder. It directly feeds the image block to the projection model, and then directly concatenates its output and the text sequence to the autoregressive decoder.

Most of the time, we don't need to pre train the visual language model, just use the existing model for reasoning, or fine tune it according to our own scene. Next, we will introduce how to transformers And how to use these models SFTTrainer Fine tune them.

Using visual language model in transformers

You can use LlavaNext The model infers Llava, as shown below.

First, we initialize the model and data processor.

 from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration import torch device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') processor = LlavaNextProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf") model = LlavaNextForConditionalGeneration.from_pretrained( "llava-hf/llava-v1.6-mistral-7b-hf", torch_dtype=torch.float16, low_cpu_mem_usage=True ) model.to(device)

Now, send the image and text prompt to the data processor, and then send the processed input to generate method. Please note that each model has its own prompt template. Please select the correct template according to the model to avoid performance degradation.

 from PIL import Image import requests url = " https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true " image = Image.open(requests.get(url, stream=True).raw) prompt = "[INST] <img src="">\nWhat is shown in this image?  [/INST]" inputs = processor(prompt, image, return_tensors="pt").to(device) output = model.generate(**inputs, max_new_tokens=100)

call decode Decode the output lexical elements.

 print(processor.decode(output[0],  skip_special_tokens=True))

Fine tune the visual language model using TRL

We are pleased to announce that, as an experimental function, TRL Of SFTTrainer Visual language model is now supported! Here, we give an example to show how to llava-instruct SFT is performed on the dataset, which contains 260k image dialog pairs.

llava-instruct The dataset organizes the interaction between the user and the assistant into a message sequence format, and each message sequence is paired with the image that the user question refers to.

To use the VLM training function, you must use pip install -U trl Install the latest version of TRL. You can here Find the complete sample script.

 from trl.commands.cli_utils import SftScriptArguments, TrlParser parser = TrlParser((SftScriptArguments, TrainingArguments)) args, training_args = parser.parse_args_and_config()

Initialize the chat template for command tuning.

 LLAVA_CHAT_TEMPLATE = """A chat between a curious user and an artificial intelligence assistant.  The assistant gives helpful, detailed, and polite answers to the user's questions. {% for message in messages %}{% if message['role'] == 'user' %}USER: {% else %}ASSISTANT: {% endif %}{% for item in message['content'] %}{% if item['type'] == 'text' %}{{ item['text'] }}{% elif item['type'] == 'image' %}<img src="">{% endif %}{% endfor %} {% if message['role'] == 'user' %} {% else %}{{eos_token}}{% endif %}{% endfor %}"""

Now initialize the model and the word breaker.

 from transformers import AutoTokenizer, AutoProcessor, TrainingArguments, LlavaForConditionalGeneration import torch model_id = "llava-hf/llava-1.5-7b-hf" tokenizer = AutoTokenizer.from_pretrained(model_id) tokenizer.chat_template = LLAVA_CHAT_TEMPLATE processor = AutoProcessor.from_pretrained(model_id) processor.tokenizer = tokenizer model = LlavaForConditionalGeneration.from_pretrained(model_id,  torch_dtype=torch.float16)

Build a data organizer to combine text and image pairs.

 class LLavaDataCollator: def __init__(self, processor): self.processor = processor def __call__(self, examples): texts = [] images = [] for example in examples: messages = example["messages"] text = self.processor.tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=False ) texts.append(text) images.append(example["images"][0]) batch = self.processor(texts, images, return_tensors="pt", padding=True) labels = batch["input_ids"].clone() if self.processor.tokenizer.pad_token_id is not None: labels[labels == self.processor.tokenizer.pad_token_id] = -100 batch["labels"] = labels return batch data_collator = LLavaDataCollator(processor)

Load the dataset.

 from datasets import load_dataset raw_datasets = load_dataset("HuggingFaceH4/llava-instruct-mix-vsft") train_dataset = raw_datasets["train"] eval_dataset = raw_datasets["test"]

initialization SFTTrainer , pass in the model, data subset, PEFT configuration and data organizer, and then call train() 。 To push the final checkpoint to the Hub, call push_to_hub() 。

 from trl import SFTTrainer trainer = SFTTrainer( model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset, dataset_text_field="text", # need a dummy field tokenizer=tokenizer, data_collator=data_collator, dataset_kwargs={"skip_prepare_dataset": True}, ) trainer.train()

Save the model and push it to the Hugging Face Hub.

 trainer.save_model(training_args.output_dir) trainer.push_to_hub()

You can here Find the trained model. You can also play with our model through the interview on the following page ⬇️。

thank

We thank Pedro Cuenca, Lewis Tunstall, Kashif Rasul and Omar Sanseviero for their comments and suggestions on this article.

Original English: https://hf.co/blog/vlms Original author: Merve Noyan, Edward Beeching Translator: Matrix Yao, Intel deep learning engineer, working in the application of transformer family model in various modal data and training reasoning of large-scale models.

infoworld 2024-06-17 17:58

Can we blame ChatGPT for the test problem? Your conversion progress will only be slower without chatGPT.

Francesca 2024-06-17 21:22

Not really 360 is rogue software, but other people's name is Zhou Hongyi, not Zhou Hongwei

kakai 2024-05-10 10:21

The world only knows that Android was created by Google. Several people know that Android is only a product acquired by Google. Similarly, what is the problem with Huawei's contribution to the collection of OGG open source work and integration into its own proprietary product line?

Monkeys think of apes 2024-05-31 18:31

You can cheat your brother. Just don't cheat yourself

zhuzhua 2024-05-21 10:08

I'm laughing to death. Those who have been deeply kidnapped dare not pay? Who will use the domestic open source framework of small companies in the future will be 213!!! Wait for harvesting later

osc_566335 2024-05-30 14:12

Sure enough, there are a lot of low-end farmers. In the era of localization, they haven't even played with the domestic ecosystem. Although things are not so good, you should spray on the idea. Don't show your ignorance and make people laugh.

Stephen 421 2024-06-12 10:37

If I write a script that allows AI to generate images according to various keywords 24 hours a day, I can sit at home and wait to collect money

The sword god is extraordinary 2024-06-17 16:51

This is too good

Arcane cat 2024-04-30 10:24

@Zlqzlq's current news is all headline party. If you want to send it like this, it's OK. It depends on whether you have explained clearly in the main text. At least this OGG makes it clear, bold: OGG Fork from OCCT. Contribute all source code before or in the future. Also, you equate this news with Huawei. In fact, the title of the news is not from Huawei, but from OSCHINA, and the original text is not this title. So I don't understand why there are so many people who hate Huawei when it comes to Huawei?

kfyty 2024-06-17 18:08

Mybatis has been integrated, but mybatis mp has not yet been integrated. If you are interested, you can make basic components together

osc_32755045 2024-06-17 14:34

Zhou Hongwei is a clown who wants to laugh when he sees him

www378660084 2024-06-17 15:48

Obviously, I am not confident about my own things I don't know if something unsuitable will be generated

Francesca 2024-05-19 18:00

Wine runs the Android emulator of Windows. Chrome OS is installed in the Android emulator. Linux environment is installed in chrome OS. Linux environment is installed in the Linux environment. Wine is installed in the Android emulator

Li Yinghui 2024-05-09 16:40

Buddhism has a good word, evil opinion. In dealing with the world, it is meaningless to draw conclusions from preset positions; It is also important to receive good logic training.

Chief taxi captain 2024-05-17 11:17

I suggest that 360 open source all its products, and then become the leading enterprise in the domestic open source industry through open source, leading everyone to compete with foreign enterprises

Single structure 2024-05-11 10:09

Selected as Open Source China's disgrace pillar

Ma Nong Little Fatty Brother 2024-05-16 14:40

I give you six seconds. I give you six moves with the same effect in the martial arts contest, which shows the invincibility and confidence of the master

one

one billion one hundred and eleven million one hundred and twenty-three thousand four hundred and forty-one 2024-06-17 16:44

Ha ha ha ha

-SORA- 2024-06-17 18:59

It can be seen at a glance that the default parameter of python is evaluated when it is defined, and the same value will be used in subsequent calls. This is one of the classic design mistakes of python.

A farmer 2024-06-17 18:07

The nails or flying books are replaced directly

-SORA- 2024-04-30 17:07

When this happened in a foreign country, the comment area suddenly became very objective and rational 😂

Happy LeapFrog 2024-05-18 09:18

But the question is: "What's the use of this for ordinary Android users?" Now the answer seems to be: "Almost nothing.".

Jackie-Liu 2024-06-17 17:43

The enhancement of Linux ecology is a good thing.

Yalin melon seeds 2024-06-17 14:37

The agreement has been changed 😄🤣

Ai East 2024-06-17 18:14

It should be very simple. The configuration of mybatis mp is almost the same as that of mybatis

Qin Liming 2024-05-11 09:12

be devoid of any sense of shame

Youyouzi with Tomatoes 2024-05-30 16:53

There are off the shelf open source products available. Do I need a garbage to keep the bottom?

zoujiaqing 2024-06-17 18:25

There are many go frameworks, but there is still no standardization. The language is inherently problematic, which is inevitable.

Muyou Longjing Tea 2024-06-17 17:18

Huh?

Intermarch 2024-05-30 13:42

800 million yuan was used for the database industry base of Damon China, and 603 million yuan was used for the construction project of Damon Research Institute. It raised 2.351 billion yuan and 1.403 billion yuan for infrastructure construction. six hundred and sixty-six

Bright 2024-05-19 23:25

What a fool! I killed myself. How can people deal with me later.

Yeah, for 2024-05-17 13:42

That's too right. Old Zhou can't control Google, but he can control 360. Do not do to others what you do not want. All 360 products should be opened first.

osc_32755045 2024-06-17 14:32

Will open source leave traces at night

young7 2024-06-17 14:04

Support RISCV and Godson

Ai East 2024-06-17 18:19

The only thing is to change the configuration to my MybatisConfiguration, which means MybatisConfiguration or inherited configuration

Yoona520 2024-05-17 16:34

Zhou Hongyi is now living more and more like a clown. If he stays behind the scenes, he has to become an online celebrity. Can you learn from Lei Jun?

phper08 2024-06-17 18:07

"Update the token automatic refresh function. If the user information or role information (including menu authorization, data authorization, and user authorization) is modified, the currently logged in user token information will be refreshed in real time." I also encountered this problem in another goframe based background hotgo, including the 3.4 version of Ruoyi

monkey_cici 2024-05-09 00:25

My I9 CPU, 64GB memory module and 3080Ti computer are inferior to the top configuration of 19999 on a tablet

CantosSong 2024-06-17 18:29

Focus: In view of the login restrictions, the development team actively cooperated with the WeChat team, elaborately developed the hardware information database and the supporting back-end authentication services, which effectively guaranteed the information security of users; Then it went through multiple rounds of tests, such as internal tests and external small-scale tests, and carried out detailed processing one by one for various situations encountered by physical and virtual machine users.

freekevin 2024-06-17 15:22

Just like Dockerhub

One code Yma 2024-05-09 09:58

Recently, I often go to interviews. People who hate Ali background most regard me as a fool, even though I am a fool

osc_32755045 2024-06-17 14:31

.。

janvie1 2024-06-17 13:40

A group of brainless black people follow the same example. Who said that 360 could not be uninstalled? Either their computer operation ability was low or their IQ was low

gamedot 2024-05-17 11:14

Old Zhou is deeply concerned about Huawei's great cause of open source. He is not a Huawei person, but has Huawei's soul.

Piglet basking in the sun 2024-06-17 17:56

Then why does 360 install this by default? Not very easy to understand.

Du Fuzhong 2024-06-17 15:35

@Is FIT2CLOUD Flying Cloud 1Panel App Store available?

swingcoder 2024-06-17 16:16

The first time I saw that the small words that agreed to uninstall, and the holes that were not retained, were 360's software.

CodeDoger 2024-05-02 20:48

35 It's too old to go to work and too early to retire at 60

zhy 2024-05-16 13:16

At the end of Shannon is Nong

Ai East 2024-06-17 15:47

Is it more convenient to use ORM: mybatis mp

infoworld 2024-05-11 15:12

Universities should use open source free software instead of commercial ones. In this way, hands and feet will not be tied technically.

Black toothpaste 2024-06-17 13:56

360 itself is garbage, virus, and rogue software.

oldpig 2024-04-29 15:34

There are a group of people like this. They don't know what they have experienced. Once they see Huawei as the subject, they feel very confident in telling lies

sunday12345 2024-05-15 18:31

What does the bank do? It's blamed on the remote desktop. Persimmons really pick up soft pinches~?

xiaoqibabby 2024-05-15 17:36

The bank is strongly required to be responsible for

phper08 2024-06-17 16:34

Does it support webdav synchronization

phper08 2024-06-17 18:06

"Fix the problem of page cache failure caused by the capitalization of the first letter of the component name". I also encountered this problem in the background mask_api_midwayjs of the middlejs. One reason is that the back end capitalized the first letter of the routing path, and the other is that the front end component package was in the div

One code Yma 2024-05-06 09:14

My technical article was moved by CSDN. Why didn't anyone step on the sewing machine? This kind of report is a joke to me. The monsters with background are fine, and the monsters without background fight to death

Wang, chief passer-by of Kaiyuan China 2024-06-17 17:44

Where can I find the open source library of C language

GDWhisperer 2024-05-15 17:23

I transferred tens of thousands of yuan to my own account, which was under risk control. How did I do this? The bank should be responsible for this 😂

What is a visual language model?

Overview of Open Source Visual Language Models

Looking for a suitable visual language model

MMMU

MMBench

Technical details

Using visual language model in transformers

Fine tune the visual language model using TRL

Hot content

Popular comments of the whole site

About the author

Author's Album

Author's other popular articles

Hot News

Hot software

OSCHINA Community

Online tools

Introduction

QQ group

Public account

Video number

Detailed explanation of visual language model

What is a visual language model?

Overview of Open Source Visual Language Models

Looking for a suitable visual language model

MMMU

MMBench

Technical details

Using visual language model in transformers

Fine tune the visual language model using TRL

Hot content

Popular comments of the whole site

About the author

Author's Album

Author's other popular articles

Hot News

Recommended attention

Hot software

OSCHINA Community

Online tools

Introduction

QQ group

Public account

Video number