In this article, I demonstrate how an application can be created using computer vision to detect objects from voice commands, estimate the approximate distance of the objects, and utilize location information to improve the lives of blind individuals.
The primary goal of the project is to process real-time data, similar to wearable technologies like those from Meta and Envision, in order to enhance the lives of users and improve their daily experiences.
Steps Covered in this Tutorial
The steps to be examined for project:
- Importing libraries and defining class parameters.
- Voice recognition and processing function definitions.
- Detecting the object from the returned parameter, finding its location, calculating the average distance, notifications.
Importing Libraries and Defining Class Parameters.
- I am importing the “speech_recognition” library to capture audio from the microphone and convert speech to text.
- I am importing the cv2 library to capture webcam footage and apply various operations to it.
- Importing Numpy for mathematical operations.
- Importing the Ultralytics library to use a pre-trained YOLOv8 model.
- Importing pyttsx3 for text-to-speech conversion.
- Importing the math library for trigonometric calculations and mathematical operations.
import speech_recognition as sr
import cv2
import numpy as np
from ultralytics import YOLO
import pyttsx3
import math
class_names = ["person", "bicycle", "car", "motorbike", "aeroplane", "bus", "train", "truck", "boat",
"traffic light", "fire hydrant", "stop sign", "parking meter", "bench", "bird", "cat",
"dog", "horse", "sheep", "cow", "elephant", "bear", "zebra", "giraffe", "backpack", "umbrella",
"handbag", "tie", "suitcase", "frisbee", "skis", "snowboard", "sports ball", "kite", "baseball bat",
"baseball glove", "skateboard", "surfboard", "tennis racket", "bottle", "wine glass", "cup",
"fork", "knife", "spoon", "bowl", "banana", "apple", "sandwich", "orange", "broccoli",
"carrot", "hot dog", "pizza", "donut", "cake", "chair", "sofa", "pottedplant", "bed",
"diningtable", "toilet", "tvmonitor", "laptop", "mouse", "remote", "keyboard", "telephone",
"microwave", "oven", "toaster", "sink", "refrigerator", "book", "clock", "vase", "scissors",
"teddy bear", "hair drier", "toothbrush"]
object_dimensions = {
"bird" : "0.10",
"cat" : "0.45",
"backpack" : "0.55",
"umbrella" : "0.50",
"bottle" : "0.20",
"wine glass" : "0.25",
"cup" : "0.15",
"fork" : "0.15",
"knife" : "0.25",
"spoon" : "0.15",
"banana" : "0.20",
"apple" : "0.07",
"sandwich" : "0.20",
"orange" : "0.08",
"chair" : "0.50",
"laptop" : "0.40",
"mouse" : "0.10",
"remote" : "0.20",
"keyboard" : "0.30",
"phone" : "0.15",
"book" : "0.18",
"toothbrush" : "0.16"
}
I am storing the classes of my YOLOv8 model trained with the COCO dataset in the ‘class_names’ variable, and their average dimensions in the ‘object_dimensions’ variable. Considering that this application will be used in a home environment, I have selected specific objects. If you wish to work with your own dataset, you will need to perform custom object detection and modify these variables accordingly.
Voice Recognition and Processing Function Definitions
To create a general function assuming objects are at the end of the sentence and to capture the searched object from phrases like (“Where is my book?”, “Find the book!”, “Book.”), I’m defining a function called ‘get_last_word’. This function will return the last word, which is the object, from the sentence.
def get_last_word(sentence):
words = sentence.split()
return words[-1]
I defined a function named ‘voice_command’ to return the object it wants to search for with a voice command and the average real-world size of this object.
def voice_command():
recognizer = sr.Recognizer()
with sr.Microphone() as source:
print("Waiting for voice command...")
recognizer.adjust_for_ambient_noise(source)
audio = recognizer.listen(source)
target_object = ""
real_width = 0.15
try:
command = recognizer.recognize_google(audio, language="en-US")
print("Recognized command:", command)
last_word = get_last_word(command.lower())
if last_word:
print("Last word:", last_word)
target_object = last_word.lower()
if target_object in object_dimensions:
real_width = float(object_dimensions[target_object])
print(real_width)
else:
print(f"No length information found for {target_object}, using the default value of 0.15.")
except sr.UnknownValueError:
print("Voice cannot be understood.")
except sr.RequestError as e:
print("Voice recognition error; {0}".format(e))
return target_object, real_width
I’m creating a function called ‘voice_notification’ for alerting the user with a voice.
def voice_notification(obj_name, direction, distance):
engine = pyttsx3.init()
text = "{} is at {}. It is {:.2f} meters away.".format(obj_name, direction, distance)
engine.say(text)
engine.runAndWait()
I’m loading the YOLOv8 model, which you can download and use from the Ultralytics website.
I calculate the distance of the object received from voice command to the camera and provide the end user with a voice notification of the direction in which the object is located on the clock.
def main():
# Load the YOLO model
model = YOLO("yolov8n.pt")
# Get video frame dimensions for calculating
cap = cv2.VideoCapture(0)
frame_width = cap.get(cv2.CAP_PROP_FRAME_WIDTH)
frame_height = cap.get(cv2.CAP_PROP_FRAME_HEIGHT)
center_x = int(frame_width // 2)
center_y = int(frame_height // 2)
radius = min(center_x, center_y) - 30 # Radius of the circle where clock hands are drawn
#The target object the user wants to search for via voice command and its real-world average size
target_object, real_width = voice_command()
while True:
success, img = cap.read()
# Predict objects using the YOLO model
results = model.predict(img, stream=True)
# Draw clock
for i in range(1, 13):
angle = math.radians(360 / 12 * i - 90)
x = int(center_x + radius * math.cos(angle))
y = int(center_y + radius * math.sin(angle))
if i % 3 == 0:
thickness = 3
length = 20
else:
thickness = 1
length = 10
font = cv2.FONT_HERSHEY_SIMPLEX
cv2.putText(img, str(i), (x - 10, y + 10), font, 0.5, (0, 255, 0), thickness)
# detect and process objects recognized by model
for r in results:
boxes = r.boxes
for box in boxes:
x1, y1, x2, y2 = box.xyxy[0]
x1, y1, x2, y2 = int(x1), int(y1), int(x2), int(y2)
cls = int(box.cls)
if class_names[cls].lower() == target_object:
camera_width = x2 - x1
distance = (real_width * frame_width) / camera_width
#voice_notification(target_object)
obj_center_x = (x1 + x2) // 2
obj_center_y = (y1 + y2) // 2
camera_middle_x = frame_width // 2
camera_middle_y = frame_height // 2
vector_x = obj_center_x - camera_middle_x
vector_y = obj_center_y - camera_middle_y
angle_deg = math.degrees(math.atan2(vector_y, vector_x))
#direction = ''
if angle_deg < 0:
angle_deg += 360
if 0 <= angle_deg < 30:
direction = "3 o'clock"
elif 30 <= angle_deg < 60:
direction = "4 o'clock"
elif 60 <= angle_deg < 90:
direction = "5 o'clock"
elif 90 <= angle_deg < 120:
direction = "6 o'clock"
elif 120 <= angle_deg < 150:
direction = "7 o'clock"
elif 150 <= angle_deg < 180:
direction = "8 o'clock"
elif 180 <= angle_deg < 210:
direction = "9 o'clock"
elif 210 <= angle_deg < 240:
direction = "10 o'clock"
elif 240 <= angle_deg < 270:
direction = "11 o'clock"
elif 270 <= angle_deg < 300:
direction = "12 o'clock"
elif 300 <= angle_deg < 330:
direction = "1 o'clock"
elif 330 <= angle_deg < 360:
direction = "2 o'clock"
else:
direction = "Unknown Clock Position"
cv2.putText(img, direction, (10, 30), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 0, 255), 2)
cv2.putText(img, "Distance: {:.2f} meters".format(distance), (x1, y1 - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)
cv2.rectangle(img, (x1, y1), (x2, y2), (255, 0, 255), 3)
if boxes is not None:
voice_notification(target_object, direction, distance)
cv2.imshow("Webcam", img)
k = cv2.waitKey(1)
if k == ord("q"):
break
cap.release()
cv2.destroyAllWindows()
if __name__ == "__main__":
main()
Conclusion and Recommendations
- I wanted to demonstrate how computer vision can enhance human vision and make our lives easier, and I hope this project has broadened your horizons.
- With your own datasets, you can further develop and customize this project to meet your specific needs and add functions tailored to your goals. For example, you can use OCR (Optical Character Recognition) technology to convert any text into text-to-speech, which can be very useful. This could help someone find a product in a store or listen to a book, among many other applications.
Feel free to share any ideas that come to your mind.
References:
- Official Repo: https://github.com/ultralytics/ultralytics
- Pyimagesearch
- Stackoverflow
'Python' 카테고리의 다른 글
[python]reactpy (0) | 2024.03.19 |
---|---|
mqtt 참고사이트 (0) | 2024.03.19 |
Audio Data Augmentation in python (0) | 2023.11.02 |
Data Augmentation in Python: Everything You Need to Know (0) | 2023.11.02 |
[python]pyQt5 GUI Designer (0) | 2023.09.15 |
댓글