我練習了一個圖像辨識AI

pondahai

6 min readFeb 26, 2024

使用google gemini API打造AI看圖說話

google gemini公開了他的AI API，並且提供所有人一分鐘內最多60次的API 呼叫，對於個人嘗鮮應用十分足夠。

於是我詢問了gemini如何撰寫一個python應用，經過幾輪對話之後，先給出一份框架，然後大概半天的修整，就完成了這個小程式。

使用時他會抓取系統上的第一部相機進行拍照，可能是筆電的內建相機，或是桌機外接的usb webcam。

這個程式的核心部分是API的呼叫，這裡使用gemini-pro-vision模型，它的應用對象是多模態，可同時輸入提示詞以及圖片檔案，然後AI回覆結果文字。

最後再把回覆文字以語音合成的方式唸出來，整個過程十分有趣。

拍照的部分，會先把拍攝到的影像存成jpeg檔命名為image1.jpg，然後再把這個image1.jpg以二進制方式讀進來，轉換成base64字串。

接著把要給AI的提示詞以及圖檔base64字串包成json封包，這裡要注意，圖檔的base64字串要再轉換成UTF8編碼才能夠封裝到json裡面。

這個用json封裝好的資料會存成一個檔案，然後再下一步用request.post呼叫REST API的時候當作負載(payload)送給伺服器。

如果AI有回應，則在回應中會出現以”text”為鍵值的資料，此時就會將它取出存成一個變數交給下一步的語音合成。

語音合成使用gTTS程式庫，其中它會把文字合成出語音之後存成音檔，但因為程式庫有提供write_to_object可將音檔寫到一個虛擬檔案，因此這裡不用另存成外部檔案就可以直接傳給下一步的語音輸出。

如此一來就完成一個AI看圖說話的程式。

from gtts import gTTS
from io import BytesIO
import os
import json
import base64

import pygame
pygame.init()

import cv2

# take photo
cap = cv2.VideoCapture(0)
ret, frame = cap.read()
cv2.imshow("capture", frame)
cv2.imwrite("image1.jpg", frame)
cv2.waitKey(3000)

cap.release()
cv2.destroyAllWindows()

#os.system("fswebcam -r 800x600 --no-banner image1.jpg")

import time
time.sleep(1)

# base64 encode
with open("image1.jpg", "rb") as image_file:
    encoded_string = base64.b64encode(image_file.read())

# json
dictionary = {
    'contents':[
        {
            'parts':[
                {   'text': '看到什麼'},
                {    
                    'inline_data':{
                        'mime_type': 'image/jpeg',
                        'data': encoded_string.decode('utf-8')
                    }
                }
            ]
        }
    ]
}
 
#print(dictionary)

# Serializing json
json_object = json.dumps(dictionary)
 
# Writing to sample.json
with open("request.json", "w") as outfile:
    outfile.write(json_object)


# get API_KEY
with open("apikey.txt", "r") as f:
    first_line = f.readline()
API_KEY = first_line

# REST API
import requests
response = requests.post("https://generativelanguage.googleapis.com/v1beta/models/gemini-pro-vision:generateContent?key="+API_KEY+"", json=dictionary)
response_json = response.json()
#print(response_json)
def get_value(data, key):
    if isinstance(data, dict):
        for k, v in data.items():
            if k == key:
                return v
            else:
                value = get_value(v, key)
                if value is not None:
                    return value
    elif isinstance(data, list):
        for v in data:
            value = get_value(v, key)
            if value is not None:
                return value
    return None
#print(get_value(response_json, "text"))
#response_text = response_json['candidates'][0]['content']['parts'][0]['text']
response_text = get_value(response_json, "text")

#result = os.popen("curl https://generativelanguage.googleapis.com/v1beta/models/gemini-pro-vision:generateContent?key="+API_KEY+" -H 'Content-Type: application/json' -d @request.json ").read()
#print(result)
#response_text = json.loads(result)
print(response_text)

# TTS
mp3_fp = BytesIO()
tts = gTTS(text=response_text, lang='zh-TW')
tts.write_to_fp(mp3_fp)

# audio play
mp3_fp.seek(0)
pygame.mixer.init()
pygame.mixer.music.load(mp3_fp)
pygame.mixer.music.play()
while pygame.mixer.music.get_busy():
    pygame.time.Clock().tick(10)

參考資料:
https://ai.google.dev/tutorials/rest_quickstart

我練習了一個圖像辨識AI

Written by pondahai