diff --git a/TUTORIAL.md b/TUTORIAL.md
index f7915e4..545af20 100644
--- a/TUTORIAL.md
+++ b/TUTORIAL.md
@@ -218,3 +218,36 @@ You are expected to get an output similar to the following:
 
 
 **Please note that travel planning is a fairly subjective question, so the responses generated by the model may be subject to a high degree of randomness. If you do not set the random seed using ```torch.manual_seed(1234)```, the output will be different each time. Even if you set the random seed, the results obtained may still differ from this tutorial due to differences in hardware and software environments.**
+
+### Grounded Captioning
+Qwen-VL can output the bounding box information of the subject while captioning the image. For example:
+
+```
+img_url = 'assets/apple.jpeg'
+query = tokenizer.from_list_format([
+    {'image': img_url},
+    {'text': 'Generate the caption in English with grounding:'},
+])
+response, history = model.chat(tokenizer, query=query, history=None)
+print(response)
+
+image = tokenizer.draw_bbox_on_latest_picture(response, history)
+if image is not None:
+    image.save('apple.jpg')
+```
+
+The saved ```apple.jpg``` will look similar to the screenshot below: 
+<p align="left">
+    <img src="assets/apple_r.jpeg" width="600"/>
+<p>
+
+#### How to get the caption without any box-like annotations
+Sometimes you may expect no box-like annotations in the response. In the case, you can stably get the cleaned text by the following post-processing.
+
+```
+# response = '<ref> Two apples</ref><box>(302,257),(582,671)</box><box>(603,252),(878,642)</box> and<ref> a bowl</ref><box>(2,269),(304,674)</box>'
+import re
+clean_response = re.sub(r'<ref>(.*?)</ref>(?:<box>.*?</box>)*(?:<quad>.*?</quad>)*', r'\1', response).strip()
+print(clean_response)
+# clean_response = 'Two apples and a bowl'
+```