Volume 2022, Issue 1 2339188

Research Article

Open Access

Deep Learning and Detection Technique with Least Image-Capturing for Multiple Pill Dispensing Inspection

Hyuk-Ju Kwon

orcid.org/0000-0002-7665-4049

School of Electronic and Electrical Engineering, Kyungpook National University, 80 Deahakro, Buk-Gu, Daegu 41566, Republic of Korea knu.ac.kr

Search for more papers by this author

Hwi-Gang Kim,

Hwi-Gang Kim

Electronics and Telecommunications Research Institute, Daegu-Gyeongbuk Research Center, Daegu 42994, Republic of Korea etri.re.kr

Search for more papers by this author

Sung-Woon Jung,

Sung-Woon Jung

School of Electronic and Electrical Engineering, Kyungpook National University, 80 Deahakro, Buk-Gu, Daegu 41566, Republic of Korea knu.ac.kr

Search for more papers by this author

Sung-Hak Lee,

Corresponding Author

Sung-Hak Lee

[email protected]

orcid.org/0000-0002-1030-381X

School of Electronic and Electrical Engineering, Kyungpook National University, 80 Deahakro, Buk-Gu, Daegu 41566, Republic of Korea knu.ac.kr

Search for more papers by this author

Hyuk-Ju Kwon,

Hyuk-Ju Kwon

orcid.org/0000-0002-7665-4049

School of Electronic and Electrical Engineering, Kyungpook National University, 80 Deahakro, Buk-Gu, Daegu 41566, Republic of Korea knu.ac.kr

Search for more papers by this author

Hwi-Gang Kim,

Hwi-Gang Kim

Electronics and Telecommunications Research Institute, Daegu-Gyeongbuk Research Center, Daegu 42994, Republic of Korea etri.re.kr

Search for more papers by this author

Sung-Woon Jung,

Sung-Woon Jung

School of Electronic and Electrical Engineering, Kyungpook National University, 80 Deahakro, Buk-Gu, Daegu 41566, Republic of Korea knu.ac.kr

Search for more papers by this author

Sung-Hak Lee,

Corresponding Author

Sung-Hak Lee

[email protected]

orcid.org/0000-0002-1030-381X

School of Electronic and Electrical Engineering, Kyungpook National University, 80 Deahakro, Buk-Gu, Daegu 41566, Republic of Korea knu.ac.kr

Search for more papers by this author

First published: 06 October 2022

https://doi.org/10.1155/2022/2339188

Academic Editor: Sangsoon Lim

Share a link

Email
Wechat
Bluesky

Abstract

In this study, we propose a method to effectively increase the performance of small-object detection using limited training data. We aimed at detecting multiple objects in an image using training data in which each image contains only a single object. Medical pills of various shapes and colors were used as the learning and detection targets. We propose a labeling automation process to easily create label files for learning and a three-dimensional (3D) augmentation technique that applies stereo vision and 3D photo inpainting (3DPI) to avoid overfitting caused by limited data. We also apply confidence-based nonmaximum suppression and voting to improve detection performance. The proposed 3D augmentation, 2D rotation, nonmaximum suppression, and voting algorithms were applied in experiments conducted with 20 and 40 types of pills. The precision, recall, individual accuracy, and combination accuracy of the experiment with 20 types of pills were 0.998, 1.000, 0.998, and 0.991, respectively, and those for the experiment with 40 types of pills were 0.986, 0.999, 0.985, and 0.940, respectively.

1. Introduction

Drug prescription and inventory management are important for ensuring safe drug dispensing and require promptness and accuracy. Hospitals handle 500–1000 types of pills, and various prescriptions are required depending on the conditions of the patients. The drug prescription and inventory management processes require a long time in many hospitals and pharmacies because the pills are manually sorted and packed by pharmacists based on the prescriptions. In addition, because performing simple tasks repetitively cause fatigue, mistakes may occur in pill classification, which may lead to medical accidents. Recently, automated equipment such as automatic drug dispensing machines [1–3] have gained popularity and are being used in pharmacies and hospitals to sort and package pills. An automatic drug-dispensing machine sorts and packs drugs based on prescriptions input from a computer program. However, because there is a risk of error in automatic dispensing machines, it is necessary to inspect the prepared products.

Visual inspection using digital cameras is widely used as a pill inspection method. Existing visual inspection methods include rule-based analysis in which the product features are compared and analyzed, and template matching methods in which the similarity with a reference image is analyzed [4–7]. Deep-learning object-detection algorithms have been actively studied recently [8–11]. In a rule-based algorithm, the characteristics of the objects such as their colors, sizes, shapes, and identification marks are compared. An image processing method is used to extract the features of the object in the algorithm, and a binary threshold is applied to classify the background and object. The morphology is then used to remove the blob noise generated after the binary thresholding process and histogram equalization, gamma transform, and retinex filtering were performed to make the brightness of the captured image uniform [12–14].

Template matching is another type of rule-based algorithm in which a template of the image to be classified is registered, and the input image is compared with the template. In general, objection detection in rule-based image-processing methods is hindered by reflected light and adjacent objects, whereas objects with similar shapes are difficult to classify using only rule-based methods. Furthermore, the template matching method suffers from the need for several templates to be made for each object because the object is affected by the templates. The increased number of templates reduces the processing speed.

Convolutional neural network (CNN)-based deep-learning algorithms have recently become the subjects of active studies because they enable the detection of various types of objects that cannot be easily detected using existing rule-based algorithms. CNNs overcome the problem of insufficiently expressed local information around the image pixels in early neural networks by introducing a convolution operation. The recognition rate in CNNs was subsequently increased with the emergence of VGG, RestNet, and GoogleNet, which change the layer depth or structure [15–17]. Object detection methods applying deep learning can be divided into two- and one-step structure algorithms. Two-step algorithms have a two-step structure in which the object candidate group is first identified and the identified candidate group is subsequently classified. Representative algorithms include R-CNN, Fast R-CNN, Faster R-CNN, and Mask R-CNN in the R-CNN series [18–21]. In contrast, one-step-structure algorithms simultaneously identify the location and type of objects from the feature maps generated from the CNN layer. You only look once (YOLO) is a representative one-step-structure algorithm [22]. YOLO can be trained faster than R-CNN using the same amount of training data. In addition, real-time detection with a detection rate of up to 65 fps is possible in YOLO.

A large amount of training data is required to ensure the performance of deep-learning algorithms. Overfitting occurs and the model becomes difficult to generalize if only a small amount of data is used. Various data augmentation methods have therefore been used to avoid overfitting [23]. Data augmentation is a method for supplementing insufficient training data by artificially transforming the training image. Representative data augmentation methods include rotation, brightness or saturation adjustment, enlargement, reduction, shearing transformation, noise addition, and image movement. In these methods, the data are multiplied by transforming the two-dimensional (2D) information of the photographed objects. However, when the image is captured from a training video, the variation of the object shape with its position at small distances from the camera may affect the depth of information acquired for the object. These changes are difficult to express in 2D and require three-dimensional (3D) image augmentation in consideration of the height and side of the object.

3D augmentation requires depth maps to be acquired. Existing techniques for acquiring depth information include structured light camera, laser scanning, time-of-flight (TOF) camera, and stereo camera methods [24]. In the structured light camera method, the depth information of an object is acquired by projecting structured light with various shapes, such as points or planes, onto an object and analyzing the structural light change information from the captured image [25]. This method requires a separate projector for structured light illumination and suffers from the disadvantage that it is significantly affected by external light. In the laser scanning method, the depth of an object is estimated by scanning it with a laser light source. Although ultra-precise depth information can be acquired with an accuracy of 1 mm, a mechanical device is required to move the laser light source and sensor, and a long time is needed for scanning. In the TOF camera method, the distance is estimated by emitting an infrared signal and measuring the time required for the signal to arrive after it is reflected from an object within the measurement range [26]. The depth information can be measured in a short time at a rate of approximately 30 fps. However, the sensor and object must be separated by at least 2 m, and the image resolution that can be acquired is low. In the stereo camera method, the depth information is predicted from the relative position information of an object in images captured using two or more cameras. The stereo camera method is affected by the performance of the camera and lens and the distance between the cameras but has the advantage of being relatively less affected by external light than the other methods.

In this paper, we propose a deep-learning-based small-object identification algorithm that offers improved detection performance from limited training data. The method aimed at detecting multiple individual objects in an image based on minimal basic captured images. In general, when an individual object is to be detected in an image that includes multiple objects, images that include multiple objects should also be used in the training stage. However, as the types of detection targets increase, the number of case combinations that may exist in one image exponentially increases. To solve this problem, the amount of data is limited in this study by capturing images that contain only a single object for the training data. To supplement the insufficient captured data during training, stereo-vision-based 3D image augmentation and rotational augmentation are applied. An OpenCV-based image-processing algorithm for automatically generating the label files required for training is proposed. Confidence-based nonmaximum suppression (NMS) is applied to improve the detection of overlapping objects at the same location. In addition, to improve the detection performance, a method for comparing the results by designating stereo pair images into groups is proposed. A training method optimized for multiobject detection and recognition based on the YOLOv4 model is presented, and the results are quantitatively evaluated and verified. To verify the proposed method, pills of various colors and shapes are used for training and detection, and quantitative comparisons are made with existing algorithms and data augmentation techniques. The proposed method is verified to be superior to the existing methods. The main contributions of this study are as follows:

(i)
We propose a method to improve detection performance for training data with minimal captured images. The proposed capturing system can effectively generate training and test data using only four cameras
(ii)
The operator intervention during the capturing of training data is reduced to three times for each pill. Three scenes are recorded for the front, rear, and side of a single pill
(iii)
Image processing is used to detect the pill position in the training image to automatically generate the corresponding label file
(iv)
3D augmentation based on stereo vision and 3DPI is used to increase the amount of training data. The proposed image rotation method can fill the empty region in the rotated image without distortion
(v)
Confidence-based NMS and voting algorithms are used to improve the model performance in the decision stage.

2. Related Works

2.1. YOLOv4

YOLO is a model in which features are extracted from the entire image and used to predict the location of the object search area. In the YOLO model, the cells are displayed by dividing each object into an S × S grid in the image, and the position of the bounding box and class probability information for each grid cell are predicted. Four versions of YOLO denoted as v1, v2, v3, and v4 have been developed. The concept of a bounding box was used in YOLOv1 to directly solve the regression problem. Anchor boxes were subsequently introduced in later versions of YOLO to facilitate the solution of the regression problem.

YOLOv2 [27] achieves improved object detection by adopting batch normalization, anchor box, direct location prediction, and multiscale training methods. The object detection performance is further improved in YOLOv3 [28] through predictions across different scales and residual blocks. The bounding box is predicted across different scales in YOLOv3 by composing feature maps from three different scales into a structure similar to that of a feature pyramid network. This allows meaningful information to be extracted from the previous layer to obtain subdivided information from the current layer.

YOLOv4 [29] has a structure similar to that of YOLOv3. Its detection performance is improved by applying additional techniques such as cross-stage-partial-connection (CSP) [30], spatial pyramid pooling (SPP) [31], and a path aggregation network (PAN) [32]. The YOLOv4 structure consists of the backbone, neck, and head. The backbone, which is based on a CSPDarknet53 structure with batch normalization and Mish activation functions, generates feature maps from the input image. CSPDarknet53 improves object detection by slightly deepening the Darknet53 network used in YOLOv3. In addition, because only half of the feature map passes through the residual block in CSPDarknet53, a bottleneck layer is not required. The neck connects the backbone and head and plays a role in reconstructing and refining the feature map. The neck in YOLOv4 comprises SPP and a PAN. SPP contributes to the extraction of contextual features and effectively increases the receptive field by connecting the four layers into a max-pooling layer. The PAN is a network structure that improves the information flow in the granular framework structure by expanding the path information flow between the lower and uppermost feature layers. This improves the overall feature layers, and thereby the localization performance. The head in YOLOv4 is the same as that in the prediction structure of YOLOv3. It enables predictions across different scales by predicting boxes using feature maps from three different scales. Furthermore, YOLOv4 uses Mosiac and CutMix [33] to improve its ability to detect objects in the training data. Figure 1 shows the structure of YOLOv4.

Details are in the caption following the image — Open in figure viewer PowerPoint

2.2. Stereo Vision Disparity Map

In the stereo vision method, the three-dimensional coordinates of each point on an object are reconstructed to estimate its depth. Two cameras are placed horizontally and employed to capture two images simultaneously to generate a disparity map, which represents the number of horizontal shifts between the left and right images. Stereo-vision-based disparity map algorithms can be divided into global and local methods [34]. In local methods, the disparity is calculated using the brightness information of the pixels within a predefined window. Because a local method uses only information inside the window, its computational complexity is lower than that of a global method. A local method comprises the four steps of matching cost computation, cost aggregation, disparity selection, and disparity refinement [35]. The matching cost is defined as the disparity of a point in the stereo-pair image. Common matching cost functions include the sum of absolute or squared differences, normalized cross-correlations, and rank and census transform. The adjacent pixel information is used in cost aggregation to minimize the matching uncertainty. The disparity is calculated and optimized during disparity selection, and the noise in the disparity map is reduced during disparity refinement, which consists of regularization and occlusion filling. In the former, filtering is used to remove the overall noise and in the latter, regions with uncertain disparity values are interpolated using their adjacent values. In contrast, the cost of a global energy function for all the disparity values is minimized in global methods. Various methods to solve the global energy minimization problem using Markov random fields have been proposed [36, 37]. Although global methods provide good performance, they are not suitable for real-time processing owing to their high computational complexity.

2.3. 3D Photo Inpainting

3D image conversion is a technique for rendering a 2D image captured by a camera into a new view. It can be used to reproduce and record visual perceptions from various angles. Classical image-based rendering technology requires sophisticated image-capturing techniques and specialized hardware. 3D image conversion methods that use RGB-depth (RGB-D) images obtained from small digital cameras or mobile phones equipped with dual lenses in place of expensive specialized equipment have recently been developed. 3DPI provides a means to separate parallax and restore information that is lost when an image of a new view is rendered using an RGB-D image. 3DPI techniques can be divided into image-based and learning-based rendering techniques. In image-based rendering techniques, an image of a new view is synthesized using a collection of posed images. Image-based rendering has good performance when the multiview stereo algorithm works well or when the images are captured using a depth sensor. Several learning-based rendering techniques based on CNNs have recently been studied [38, 39]. These techniques have the advantage of not requiring expensive equipment because they can synthesize images from the new view using both single and stereo-pair images. Shih et al. [40] used layered depth images (LDIs) to reduce the complexity of arbitrary depth information. The input LDI image is divided into several local regions based on the connectivity between the pixels, and the images synthesized by the inpainting algorithm are fused into a new LDI image. The depth information is subsequently changed, and the process applied repeatedly. The algorithm comprises three subnetworks for color, depth, and edge inpainting. The edges of the regions that require restoration are preferentially restored by the inpainting network. After edge restoration, the color and depth information of the image boundary region are restored by the color and depth inpainting networks.

3. Proposed Methods

In this study, we propose a method to improve the detection performance for small-object detection based on limited training data during multiclass training. The proposed method consists of data augmentation to supplement insufficient data, an automated method to process training data into a suitable form for learning and a process to improve multiobject detection performance. Figure 2 shows a flowchart of the proposed method.

3.1. Pill Data Labeling

In the proposed method, pill position detection is performed during the pretraining stage to process the captured image into a suitable form for training and automatically generate a label file that records the location information of the object during training. As shown in Figure 3, the position of the pill is determined based on the edge component of the pill. In this example for a transparent pill, the blue channel, which had the largest standard deviation between the pixel values of the background and foreground among the RGB channels, was used for edge detection of the pill. Figure 4 shows the distribution and standard deviation of the pixel brightness for each RGB channel at the centerline of the transparent pill. It can be seen that the blue channel had the largest standard deviation. A median filter was applied in a preprocessing step to remove noise while preserving the edge of the image. Subsequently, an adaptive threshold algorithm was applied to obtain the edge region information of the pill. It was difficult to apply the global or Otsu threshold methods to transparent pills because the brightness distributions were not uniform and depended on the location of the pills.

Figure 5 shows the results of applying the global, Otsu, and adaptive threshold methods. The shape of the pill was accurately detected by the adaptive threshold method. Some regions of the pill were also detected by the global threshold method; however, the method has a deficiency in that the threshold value needs to be adjusted according to the type of pill. After the threshold step, the morphology technique was used to combine the separated regions of the pill, and a partial region of the image was cropped to remove the large noise in the outermost part of the image. In the next step, a contouring algorithm was applied to detect the position of the pill. The contour algorithm generates the area and centroid of each blob in a binary image. To obtain the blob corresponding to the pill among the several blobs generated by the binary algorithm, the largest blob with its centroid located at less than 50 pixels from the image center was selected. Finally, the location information of the pill was obtained from the bounding rectangle of the selected blob. Table 1 lists the main process parameters for identifying the pill position.

1. Main process parameters for pill position detection.

Process	Function (OpenCV)	Parameter	Value
Input		Image Size	608 × 608
Median filter	medianBlur	kSize	11, 11
Adaptive threshold	adaptiveThreshold	maxValue	255
		adaptiveMethod	ADAPTIVE_THRESH_GAUSSIAN_C
		thresholdType	THRESH_BINARY_INV
		blockSize	33
		C	1
Morphology	getStructuringElement	shape	MORPH_ELLIPSE
	getStructuringElement	kSize	11, 11
	morphologyEx	op	MORPH_CLOSE
	morphologyEx	Iteration	2
Contour	findContours	mode	RETR_EXTERNAL
Contour	findContours	method	CHAIN_APPROX_SIMPLE

3.2. Data Augmentation

Deep-learning models require large amounts of data for effective training. However, it is not easy to obtain a large amount of suitable data for training, and the classes in the obtained data may be imbalanced. When a deep-learning model is trained with unbalanced class data, the model may be biased toward a specific class, which significantly affects its performance. Data augmentation is used to solve this problem. Data augmentation methods such as varying the brightness, color, and image ratio were not required in this study because the data were acquired in a fixed capturing environment. Instead, a 3D image augmentation method is proposed in which three-dimensional changes are applied according to the position of the object. RGB images and depth maps are used in this method. For each specimen, a depth map was generated in the up, down, left, and right directions using four cameras attached to the capturing system, and a 3DPI algorithm was applied based on the generated depth map and four RGB images to generate multiple training images.

3.2.1. 3D Data Augmentation

Because the brightness, color, and camera position were fixed in the proposed capturing, the sizes and colors of the objects in the training images were always the same as those in the testing images. Commonly used 2D augmentation methods such as resizing, flipping, shearing and changing the color, and exposure can distort the shape and color of the object and degrade the detection performance of the model. However, the actual changes in the appearance of a pill when the pill is located at various locations within the measurement range are three-dimensional. Therefore, capturing the pill form only at the center position does not capture the shapes that the pill can have at various positions. The proposed 3D augmentation method can create more realistic images by shifting the viewpoint of the object captured by the camera.

Figure 6 shows the process of generating a 3D-augmented image using the top-left and top-right images among the four captured images. Four refined images labeled as LT, RT, LB, and RB for left-top, right-top, left-bottom, and right-bottom were generated in the 3D image augmentation process by rotation for 0°, 90°, 180°, and 270°, respectively. Two image pairs comprising the top and bottom images and the left and right images were used for the 3D augmentation. The pill in the image was detected and its aspect ratio was calculated to generate a depth map. Because circular pills suffered from severe distortion during 3D augmentation, 3D augmentation was applied only for oval-shaped pills with an aspect ratio of at least 1.05. Multiple copies of pills with aspect ratios of less than 1.05 were used in place of 3D augmentation to match the amount of data for the 3D-augmented pills. Subsequently, a depth map was generated using a semiglobal block-matching algorithm. The depth map in Figure 6 was generated based on the LT image. A depth line map was generated from the depth map to reflect the changes at different camera viewpoints when the 3D image was augmented. After the depth line map was extracted from the centerline of the depth map, the value on each point of the depth line was copied along the vertical direction of the line. The horizontal line was calculated based on the average value along the left and right directions, such that the extracted lines were symmetrical. One-dimensional convolution was applied with a Gaussian kernel to remove the noise of each average line. Next, the horizontal depth line was repeated in the vertical direction to match the image size to create a new depth map. Finally, 3D image augmentation was performed using 3DPI based on the generated depth line map.

3.2.2. Image Rotational Augmentation

Because 3D image augmentation is performed only for the horizontal and vertical viewing angle fluctuations of the pill, additional data augmentation is required for image rotation. However, in general, the image rotation transformation results in unwanted black regions in the regions in which there is no data after the transformation, as shown in Figure 7(a). Therefore, to remove the unwanted areas, it is necessary to extend the padding to the top, bottom, left, and right of the image before the rotation transformation. Figure 7(b) shows the required padding area before rotation transformation. The blue rectangle denotes the input image, and the gray rectangle denotes the position of the image after the rotation transformation. If the rotation transformation is performed without padding extension, a black region will occur at the intersection of the blue and gray rectangles. Therefore, padding interpolation is required for the orange rectangle. The padding size can be expressed as:

(1)

(2)

(3)

where d represents the diagonal length of the image, and w and h represent the width and height, respectively. p_w and p_h represent the padding sizes along the left and right directions and the up and down directions of the image, respectively.

Figure 8 shows the overall process of the proposed image rotational augmentation. In the input image, the edge region was cropped before the padding process to remove the noise components at the outer edges of the image. In the experiments, the image was cropped to a width-to-height ratio of 0.95. Padding and rotational transformations were then applied to the image, and the image was cut to the same dimensions as the input image based on the center of the transformed image.

3.3. Data Labeling Automation

The class name and location coordinates of the object in each image are required for YOLOv4 training. Annotation tools are generally used to generate the position coordinates. However, the time required to manually determine the class and position coordinates of the objects in all the images increases with the number of captured images. Therefore, labeling automation was performed using the proposed pill-position detection algorithm. The values recorded in the label data are the class, center coordinates (cx, cy), width, and height of the object. The coordinates of the object used in YOLOv4 are relative coordinates expressed as ratios of the image size according to the following equations:

(4)

(5)

(6)

(7)

where W_i and H_i are the width and height of the input image, respectively; CX_b and CY_b are the center coordinates of the bounding box, and cx and cy are the relative coordinates obtained by dividing the center coordinates of the bounding box by the width and height of the image, respectively. Furthermore, width and height are the relative sizes obtained by dividing the width and height of the bounding box, W_b and H_b, by the width and height of the image.

3.4. Data Training

Images containing only one pill each were used for the training data. To convert each captured image into training data, the position of the pill was detected and the image was cropped such that the pill was located at the center of the image. The cropped image was resized to 608 × 608 pixels for training. A label file was created from the 3D and rotationally augmented images using the proposed pill position detection algorithm before training. Label files and images with errors such as data omissions, class mismatches, and duplicates were excluded from the training data during the preprocessing review process for the generated label files. The data training was performed using the YOLOv4 model with the image shift and mosaic augmentation YOLOv4 data augmentation options applied but not the color, chroma, exposure, left and right inversions, and image ratio options. Figure 9 shows the training process.

3.5. Data Detection

Unlike training, which used images containing only a single pill each, test images containing multiple pills were used in the data detection process. Each detection image was cropped to only the area containing the pills and resized to 608 × 608 pixels. A confidence-based NMS algorithm was applied as a postprocessing method to exclude duplicate detections at the same location in the detection results. Subsequently, the scenes captured with four cameras were designated as one group, and the results were analyzed using a voting algorithm. Figure 10 shows the detection process.

3.5.1. Removal of Overlapping Data

Because the numbers and shapes of the pills were not the same in the images used for training and detection, duplicate detection results may appear in the same areas in detection images containing multiple pills. The overlapping regions were removed using an NMS algorithm based on the intersection over union (IoU), as shown in Figure 11. The class of the overlapping region was not considered in the algorithm, and only the result with the highest confidence was selected. The IoU represents the ratio between the intersection and union of the overlapping regions, as shown in Figure 12. A larger IoU value represents a larger overlap between the areas of the two regions. The IoU value ranges from 0 to 1. An IoU value of 0.5 was used in the proposed method.

3.5.2. Final Detection Decision

Because the capturing system comprised four cameras for two-axis stereo photography, four slightly different images were acquired for the same scene. In addition, the detection results may differ between the four images. Therefore, to improve detection performance, the four images of each scene were grouped and compared. Figure 13 shows the process for comparing the results of the image groups. If an NG occurs only for the RB image among the results of the image group, the sum of the confidences for each pill in the four images is calculated, and the four pills with the highest score are selected. In the LT, RT, and LB images, abp, cgp, caco, and eut pills were detected, while in the RB image, it was confirmed that lox was incorrectly detected instead of caco based on the large sums of 397, 395, 395, 282, and 76 for the detection values of eut, cgp, abp, caco, and lox, respectively, in decreasing order. The final detection results were verified to be consistent with the ground truth.

4. Experimental Environment

4.1. Data Capturing Environment and Development System

Four CAM8200-U (5 megapixel USB camera module) cameras were used to capture stereo vision images in the capturing system. In addition, an NVIDIA Jetson nanoembedded board with a Wi-Fi module was used for capturing control and data communication. Backlight lighting was used in the lighting environment to remove the shadow from the pill, and two LED lighting modules were installed in the direction of the pill next to the upper camera module. A blocking film was installed around the camera to reduce the influence of external light sources. Figure 14 shows photographs and illustrations of the interior and exterior of the image-capturing system.

The system for deep learning training comprised an NVIDIA 3090 GPU, i9-10980EX CPU, and 256 GB RAM. CUDA 11.0 and cuDNN 8.0.4 were used in the system. A C language-based dark-net framework was used for YOLOv4 and Pytorch 1.9.0, and Python 3.8.10 was used for 3DPI. Table 2 lists the YOLOv4 parameters used for training.

2. YOLOv4 training settings.

Option	Description
Image size	608 × 608
Validation ratio	20%
Learning rate	0.001
Momentum	0.949
Decays	0.005
Backbone	CSPDarkNet53
Augmentation in YOLOv4	Mosaic, image shift

4.2. Data Communication

A Jetson Nano board was used for image capture and data transmission in the system. The images were captured by the capturing system, and the captured data was transmitted to the deep-learning PC through socket communication using the transmission control protocol (TCP). Data refinement and augmentation, label file creation, training, and detection were performed on the received images by the deep-learning PC.

4.3. Data Refinement

The pill in the training image should ideally be placed at the center of the image for 3D augmentation. However, the manually placed pills were not always centered on the image, which caused position errors. It was therefore necessary to find the center point of the pill in each image and to cut the image to a certain size. The fixed backlight region was first removed from the image and the center of the pill determined using the pill-position detection algorithm described in Section 3.1. After cropping the image to 400 × 400 pixels around the center of the pill, the padding algorithm was applied to expand the image to 500 × 500 pixels. If the image has been cropped to 500 × 500 pixels immediately, the area outside the backlight could have been included depending on the position of the pill. Finally, the image was resized to 608 × 608 pixels for YOLO training. The images used for detection contained multiple pills. In this case, the capturing position was fixed such that multiple pills were included in the image. The area containing the multiple pills was cropped to 500 × 500 pixels based on the fixed location of the captured image and then resized to 608 × 608 pixels for training. Figures 15(a) and 15(b) show the processes for refining the training and detection data, respectively.

4.4. Data Set for Training and Detection

The images used for training and detection contained pills with hard and soft materials; circular, ovoid, or square shapes; and various colors including white and transparent. The training images contained only one pill each, whereas the detection images contained multiple pills.

4.4.1. Pill Data for Single-Axis 3D Augmentation

20 types of pills were used for primary pill detection training. Figure 16(a) shows an example of the images used for primary pill detection training. The training images were acquired using two pairs of stereo cameras. The pill in the video was captured at the angles of 0° and 90° with respect to its front, rear, and side. Figure 16(b) shows the images augmented by applying the gamma transformation and luminance adaptation transform (LAT) for comparison with the proposed augmentation method. The image brightness was nonlinearly adjusted to different levels in the gamma transformation while the local image brightness was adjusted to improve the image contrast and detail in LAT. LAT comes in the two versions of single-scale (SLAT) and multiscale LAT (MLAT) according to the number of applied Gaussian blur kernels [13]. Figure 16(b) shows that the excessive local adaptation in SLAT led to halo artifacts, whereas both the contrast and detail were improved in MLAT. Figure 16(c) shows the images used for the detection test.

4.4.2. Pill Data for Dual-Axis 3D Augmentation

Figure 17 shows the images used for the secondary training and detection test. Figure 17(a) shows the LT, LB, RT, and RB images acquired from the four cameras for the front, rear, and side of each pill without rotational arrangement. The captured images were expanded using the 3D image and rotational augmentations. Figure 17(b) shows the images used for the detection test. Four pills were randomly selected from 40 types of pills and placed in their front, rear, or side views in each detection test image.

5. Experimental Results

5.1. Evaluation Index

The precision, recall, individual accuracy, and combination accuracy were used as the metrics in the verification experiment. The precision, recall, and individual accuracy were used as the evaluation indicators for individual pills, and the combination accuracy as an indicator for a combination of pills in each test image. These indicators are defined as:

(8)

(9)

(10)

(11)

where TP, TN, FP, and FN in the scores for individual pills are the true positives, true negatives, false positives, and false negatives, respectively, OK represents the number of results in which all the pills in the image were correctly detected, and NG represents the number of results in which one or more pills in the image were not detected or incorrectly identified.

5.2. Pill Detection Training with Single-Axis 3D Augmentation

In the first test, the effects of different augmentation methods were compared using 20 types of pills. These augmentation methods comprise the original photographed images, 2D image augmentation based on gamma transformation and LAT, and single-axis 3D augmentation using stereo left and right pairs. The original photographs were captured in the vertical and horizontal directions in the front, rear, and side views with the pill at the center of the image. 100 images containing four to six random pills with various shapes chosen from 20 types of pills were used for the detection test. A total of 468 individual pills were included in the 100 test images. The NMS algorithm was applied as a postprocessing algorithm to remove duplicate detection areas. Table 3 lists the capturing conditions, training settings, and test settings; Table 4 lists the types of augmentation used in each experiment.

3. Capturing conditions and training and test settings for first pill detection experiment.

Environment	Option	Description
Capturing conditions	Number of captured images	480 (24 per pill)
	Pill position	Center
	Pill direction	Front, rear, side
	Pill angle	0°, 90°
	Number of cameras	2
Test setting	Test set size	100
Test setting	Postprocessing	Nonmaximum suppression

4. Image augmentation methods for test.

Test	Augmentation
Test	Gamma	LAT	Rotation (0°, ±20°, ±45°)	3D (left-right and top-bottom pairs, 0°, 90°, 180° 270°) and 10 images per angle
Basic image capture
2-dimension augmentation	✓	✓
3-dimension single-axis augmentation			✓	✓

5.2.1. Original Captured Images

Four hundred and eighty images of 20 types of pills were used without additional augmentation for training using only the original captured images. The detection results are presented in Table 5. The highest precision of 0.917 and recall of 0.948 were achieved at 4000 iterations and the combination accuracy was 0.55, indicating that accurate detection was achieved in 55 out of 100 test images.

5. Detection results for basic captured images (without augmentation).

Iterations	TP	FP	FN	Precision	Recall	Individual accuracy	OK	NG	Combination accuracy
1000	215	252	117	0.460	0.647	0.368	0	100	0
2000	382	49	42	0.890	0.900	0.808	44	56	0.44
3000	393	44	32	0.899	0.924	0.838	47	53	0.47
4000	409	37	22	0.917	0.948	0.874	55	45	0.55
5000	406	38	24	0.914	0.944	0.868	54	46	0.54
6000	407	38	23	0.914	0.946	0.870	54	46	0.54
7000	408	37	23	0.916	0.946	0.872	54	46	0.54

5.2.2. 2D Image Augmentation

The gamma transformation and LAT were used for 2D image augmentation. Three gamma transforms and two LAT transforms comprising SLAT and MLAT were applied. The detection results are presented in Table 6. The highest detection performance was achieved at 3000 iterations; however, the accuracy did not improve significantly, and the combination accuracy was only 0.59. Although both transformation models in the secondary augmentation led to changes in brightness, the results confirmed that changing the brightness was not very helpful for training because the images were captured under a fixed lighting environment.

6. Detection results with gamma and luminance adaptation transform augmentations.

Iterations	TP	FP	FN	Precision	Recall	Individual accuracy	OK	NG	Combination accuracy
1000	420	101	258	0.542	0.317	0.539	0	100	0
2000	400	53	17	0.883	0.959	0.851	51	49	0.51
3000	416	47	7	0.898	0.983	0.885	59	41	0.59
4000	416	48	5	0.896	0.988	0.887	58	42	0.58
5000	414	51	5	0.890	0.988	0.881	57	43	0.57
6000	415	50	5	0.892	0.988	0.883	57	43	0.57
7000	414	50	5	0.892	0.988	0.883	58	42	0.58

5.2.3. 3D Single-Axis Image Augmentation

In 3D single-axis augmentation, the depth map of the augmented image was extracted from only the stereo left- and right-pair images. The 3DPI algorithm was used for 3D augmentation and applied in both the left and right directions. Fourteen images were generated after augmentation along the left and right directions each. In addition, rotational augmentation was applied at the angles of ±20° and ±45° on the 3D-augmented images. Table 7 presents the results for the trained model. The detection performance was the highest at 4000 iterations, and the precision, recall, and individual and combination accuracies are all higher than those of the previous two models. In particular, the combination accuracy index for the whole pill image was significantly improved from 0.59 to 0.94. These results confirm that the proposed 3D and rotational augmentations are effective in improving detection performance.

7. Detection results with single-axis 3D and rotational augmentations.

Iterations	TP	FP	FN	Precision	Recall	Individual accuracy	OK	NG	Combination accuracy
1000	254	211	89	0.546	0.740	0.458	0	100	0
2000	439	36	9	0.924	0.979	0.907	65	35	0.65
3000	459	7	2	0.984	0.995	0.981	93	7	0.93
4000	460	5	3	0.989	0.993	0.983	94	6	0.94
5000	459	6	3	0.987	0.993	0.981	93	7	0.93
6000	459	5	4	0.989	0.991	0.981	93	7	0.93
7000	459	6	3	0.987	0.993	0.981	93	7	0.93

5.3. Pill Detection Training with Dual-Axis 3D Augmentation

In the secondary pill detection training, the direction of the pill was additionally restricted to only the vertical direction to reduce the need for operator intervention in capturing the training images. Dual-axis 3D augmentation was performed in which a combination of left, right, top, and bottom stereo pairs was formed using images captured by the four cameras, and 3D augmentation was performed using each pair of images. Additionally, rotational augmentation was applied at the angles of ±20° and ±45° to the 3D-augmented images. Detection training was conducted for 20 and 40 types of pills. In the detection test, 430 images were used in the detection experiments with 20 types of pills, and 862 images for the detection experiments with 40 types of pills. The NMS algorithm was applied as a postprocessing algorithm to remove overlapping regions. To improve the detection performance, images of the same pill taken by four cameras were processed as one group, and the pills with high cumulative scores were selected in the detection decision.

5.3.1. Detection for 20 Types of Pills with Dual-Axis 3D Image Augmentation

Table 8 lists the test conditions for the detection of 20 types of pills to which dual-axis 3D image augmentation was applied. The results are presented in Tables 9 and 10, which, respectively, show the results of applying only the NSM algorithm and applying both the NMS and voting algorithms. The number of original images used in the secondary pill detection training was 240, which is approximately half of the 480 original images used in the primary pill detection training. Nevertheless, it can be observed from Table 9 that the results are similar to those in the primary detection test. However, as shown in Table 10, the combination accuracy increased from 0.923 to 0.991 after applying the voting algorithm.

8. Capturing conditions and training and test settings for secondary pill detection (20 types of pills).

Environment	Option	Description
Capturing conditions	Number of captured images	240 (12 per pill)
	Pill position	Center
	Pill direction	Front, rear, side
	Pill angle	90°
	Number of cameras	4
Training setting	Augmentation	3D (left-right and top-bottom pairs, 0°, 90°, 180°, 270°) and 10 images per angle, rotation (0°, ±20°, ± 45°)
Test setting	Test set size	430
Test setting	Postprocessing	Nonmaximum suppression, voting

9. Dual-axis 3D image augmentation detection results for 20 types of pill before voting.

Iterations	TP	FP	FN	Precision	Recall	Individual accuracy	OK	NG	Combination accuracy
1000	918	1010	282	0.476	0.765	0.415	1	429	0.002
2000	1628	96	2	0.944	0.999	0.943	340	90	0.791
3000	1651	68	1	0.960	0.999	0.960	365	65	0.849
4000	1681	38	1	0.978	0.999	0.977	393	37	0.914
5000	1683	36	1	0.979	0.999	0.978	394	36	0.916
6000	1685	34	1	0.980	0.999	0.980	397	33	0.923
7000	1683	37	1	0.978	0.999	0.978	394	36	0.916

10. Dual-axis 3D image augmentation detection results for 20 types of pill after voting.

Iterations	TP	FP	FN	Precision	Recall	Individual accuracy	OK	NG	Combination accuracy
1000	296	70	66	0.809	0.818	0.685	15	93	0.139
2000	420	9	3	0.979	0.993	0.972	96	12	0.889
3000	428	3	1	0.993	0.998	0.991	104	4	0.963
4000	431	1	0	0.998	1.000	0.998	107	1	0.991
5000	431	1	0	0.998	1.000	0.998	107	1	0.991
6000	431	1	0	0.998	1.000	0.998	107	1	0.991
7000	431	1	0	0.998	1.000	0.998	107	1	0.991

5.3.2. Detection for 40 Types of Pills with Dual-Axis 3D Image Augmentation

Twenty types of pills with diverse combinations were photographed for pill detection. Table 11 lists the experimental conditions, and Tables 12 and 13 list the experimental results. Table 12 shows that the combination accuracy was 0.833 without the voting algorithm, which is lower than the accuracy of 0.923 for the 20-type case, but it increased to 0.940 after applying the voting algorithm, as shown in Table 13. This demonstrates the effectiveness of the combination determination method using four output images.

11. Capturing conditions and training and test settings for detection experiment with 40 types of pills.

Environment	Option	Description
Capturing conditions	Number of captured images	480 (12 per pill)
	Pill position	Center
	Pill direction	Front, rear, side
	Pill angle	90°
	Number of cameras	4
Training setting	Augmentation	3D (left-right and top-bottom pairs, 0°, 90°, 180°, 270°) and 10 images per angle, rotation (0°, ±20°, ±45°)
Test setting	Test set	862
Test setting	Postprocessing	Nonmaximum suppression, voting

12. Dual-axis 3D image augmentation detection results for 40 types of pill before voting.

Iterations	TP	FP	FN	Precision	Recall	Individual accuracy	OK	NG	Combination accuracy
1000	495	614	2414	0.446	0.170	0.141	1	861	0.001
2000	3,233	548	45	0.855	0.986	0.845	421	441	0.488
3000	3,278	266	41	0.925	0.988	0.914	587	275	0.681
4000	3,277	148	23	0.957	0.993	0.950	706	156	0.819
5000	3,292	135	21	0.961	0.994	0.955	718	144	0.833
6000	3,286	136	26	0.960	0.992	0.953	712	150	0.826
7000	3,283	138	27	0.960	0.992	0.952	709	153	0.823

13. Dual-axis 3D image augmentation detection results for 40 types of pill after voting.

Iterations	TP	FP	FN	Precision	Recall	Individual accuracy	OK	NG	Combination accuracy
1000	233	143	488	0.620	0.323	0.270	1	215	0.005
2000	838	24	2	0.972	0.998	0.970	192	24	0.889
3000	850	14	0	0.984	1.000	0.984	202	14	0.935
4000	849	15	0	0.983	1.000	0.983	201	15	0.931
5000	851	12	1	0.986	0.999	0.985	203	13	0.940
6000	849	14	1	0.984	0.999	0.983	201	15	0.931
7000	848	14	2	0.984	0.998	0.981	200	16	0.926

The results of the proposed method were compared with those obtained using the two-step approach of Kwon et al., which is based on Mask R-CNN [41]. In the first step, the area of each pill was detected in an image containing multiple pills, and virtual images containing only a single pill were generated using the detected area of the pill. In the second step, pill detection was performed using the generated virtual images. In the test with 27 classes of pills, the average accuracy for each pill was 0.916. However, if the accuracy is converted to the combination accuracy for multiple pills used in this study, the accuracy becomes 66%, which confirms the superiority of the proposed method.

6. Conclusions

In this study, we proposed a method to improve the small-object detection performance of a module trained using limited data. An image-capturing system was implemented for training and pill detection. In general, multiclass object detection algorithms require multiclass data for training. However, the number of images required for training containing the various possible combinations of classes as the number of object classes increases along with the difficulty of database management. A method to increase detection performance using minimal training data was therefore proposed in this study. Conventionally, the images of the objects to be detected need to be obtained under the same capturing conditions as those for the test images. A large number of training images are therefore required to account for variations in the positions and adjacent objects. Insufficient training images can result in detection problems for test images that contain several objects. The detection problems due to differences between multiobject test images and single-object training images were addressed in this study.

Pills of various shapes and colors were used for data training and detection tests. Four cameras were used to effectively capture 2-axis stereo images for training. A backlight and upper light were used simultaneously to remove the shadow of the pill during capture. An NVIDIA Jetson Nano board was used in the imaging system for standalone use and remote data communication based on a TCP socket communication protocol. The four cameras performed stereo vision capture along the two-axis directions. YOLOv4, which can detect objects in real time and has excellent detection performance, was used for data training. The data augmentation methods in YOLOv4 that distort the shape and color of the objects, such as left-right inversion and ratio, color, brightness, and saturation adjustments, were not used, and only the mosaic and image shift augmentations were applied. To automatically generate the necessary label data during training, a vision algorithm to detect the position of the pill in the training image was developed. An NMS algorithm was used to remove the overlapping regions during detection, and the final result was obtained by combining the detected results from the four test images, which were combined into a single group.

Primary (1-axis 3D augmentation) and secondary (2-axis 3D augmentation) pill detection training were conducted for the training tests. For the primary pill detection training experiment, the results of basic imaging without data augmentation, 2D image augmentation with brightness conversion, and 3D augmentation using the stereo vision method were compared for 20 types of pills. The proposed image augmentation and postprocessing method showed a combination accuracy of more than 99% in a fixed lighting environment. In the second experiment, to reduce the need for operator intervention to three times per pill during capturing of the training data, only one pill was placed at 90°, and its front, rear, and side scene images recorded using four cameras. During training, the 3D and rotational augmentations were applied along the two axes of the up and down directions and the left and right directions. Training tests were conducted using 20 and 40 types of pills. The proposed method achieved a combination accuracy of 94% in the latter test.

The proposed training and detection methods, including the capturing system, are not limited to pill detection and can be used in various image-based object detection applications. In particular, it is expected that the method can be applied to the training and detection of various objects in training environment conditions where there is insufficient training data.

Conflicts of Interest

The authors declare that there is no conflict of interest regarding the publication of this paper.

Acknowledgments

This research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF), BK21 FOUR project funded by the Ministry of Education, Korea [NRF-2021R1I1A3049604, 4199990113966], and Electronics and Telecommunications Research Institute (ETRI) grant funded by the Korean government [21ZD1140, development of ICT convergence technology for Daegu-Gyeongbuk Regional Industry].

Open Research

Data Availability

All data used in the study (e.g., pill images) were taken and used with self-produced imaging equipment.

References

1 Gordon J. O., Hadsall R. S., and Schommer J. C., Automated medication-dispensing system in two hospital emergency departments, American Journal of Health-System Pharmacy. (2005) 62, no. 18, 1917–1923, https://doi.org/10.2146/ajhp040481, 2-s2.0-24944574316, 16141111.
10.2146/ajhp040481
PubMed Web of Science® Google Scholar
2 Fung E. Y., Leung B., Hamilton D., and Hope J., Do automated dispensing machines improve patient safety?, The Canadian Journal of Hospital Pharmacy. (2009) 62, no. 6, 516–519, https://doi.org/10.4212/cjhp.v62i6.852, 2-s2.0-75149148731, 22478942.
10.4212/cjhp.v62i6.852
PubMed Google Scholar
3 Craswell A., Bennett K., Hanson J., Dalgliesh B., and Wallis M., Implementation of distributed automated medication dispensing units in a new hospital: nursing and pharmacy experience, Journal of Clinical Nursing. (2021) 30, no. 19-20, 2863–2872, https://doi.org/10.1111/jocn.15793, 33931903.
10.1111/jocn.15793
PubMed Web of Science® Google Scholar
4 Wen Z. and Tao Y., Building a rule-based machine-vision system for defect inspection on apple sorting and packing lines, Expert Systems with Applications. (1999) 16, no. 3, 307–313, https://doi.org/10.1016/S0957-4174(98)00079-7, 2-s2.0-0347824124.
10.1016/S0957-4174(98)00079-7
Web of Science® Google Scholar
5 Chantara W., Mun J.-H., Shin D.-W., and Ho Y.-S., Object tracking using adaptive template matching, IEIE Transactions on Smart Processing and Computing. (2015) 4, no. 1, 1–9, https://doi.org/10.5573/IEIESPC.2015.4.1.001.
10.5573/IEIESPC.2015.4.1.001
Google Scholar
6 Zhou X., Wang Y., Xiao C., Zhu Q., Lu X., Zhang H., Ge J., and Zhao H., Automated visual inspection of glass bottle bottom with saliency detection and template matching, IEEE Transactions on Instrumentation and Measurement. (2019) 68, no. 11, 4253–4267, https://doi.org/10.1109/TIM.2018.2886977.
10.1109/TIM.2018.2886977
Web of Science® Google Scholar
7 Chen J.-Y., Hung K.-F., Lin H.-Y., Chang Y.-C., Hwang Y.-T., Yu C.-K., Hong C.-R., Wu C.-C., and Chang Y.-J., Real-time FPGA-based template matching module for visual inspection application, 2012 IEEE/ASME International Conference on Advanced Intelligent Mechatronics, 2012, Kaohsiung, Taiwan, 1072–1076, https://doi.org/10.1109/AIM.2012.6265941, 2-s2.0-84866952113.
10.1109/AIM.2012.6265941
Google Scholar
8 Wong Y. F., Ng H. T., Leung K. Y., Chan K. Y., Chan S. Y., and Loy C. C., Development of fine-grained pill identification algorithm using deep convolutional network, Journal of Biomedical Informatics. (2017) 74, 130–136, https://doi.org/10.1016/j.jbi.2017.09.005, 2-s2.0-85029709283, 28923366.
10.1016/j.jbi.2017.09.005
PubMed Web of Science® Google Scholar
9 Chupawa P. and Kanjanawanishkul K., Pill identification with imprints using a neural network, Journal of Engineering Technology. (2015) 1, 30–35, https://doi.org/10.14456/mijet.2015.7.
10.14456/mijet.2015.7
Google Scholar
10 Wang Y., Ribera J., Liu C., Yarlagadda S., and Zhu F., Pill recognition using minimal labeled data, 2017 IEEE Third International Conference on Multimedia Big Data, 2017, Laguna Hills, CA, USA, 346–353, https://doi.org/10.1109/BigMM.2017.61, 2-s2.0-85027678946.
10.1109/BigMM.2017.61
Google Scholar
11 Larios Delgado N., Usuyama N., Hall A. K., Hazen R. J., Ma M., Sahu S., and Lundin J., Fast and accurate medication identification, NPJ Digital Medicine. (2019) 2, no. 1, https://doi.org/10.1038/s41746-019-0086-0, 31304359.
10.1038/s41746-019-0086-0
PubMed Web of Science® Google Scholar
12 Gonzalez R. C., Woods R. E., and Masters B. R., Digital image processing, third edition, Journal of Biomedical Optics. (2009) 14, article 029901, https://doi.org/10.1117/1.3115362.
10.1117/1.3115362
Google Scholar
13 Kwon H.-J., Lee S.-H., Lee G.-Y., and Sohng K.-I., Luminance adaptation transform based on brightness functions for LDR image reproduction, Signal Processing. (2014) 30, 74–85, https://doi.org/10.1016/j.dsp.2014.03.008, 2-s2.0-84900827536.
10.1016/j.dsp.2014.03.008
Web of Science® Google Scholar
14 Land E. H., The Retinex theory of color vision, Scientific American. (1977) 237, no. 6, 108–128, https://doi.org/10.1038/scientificamerican1277-108, 2-s2.0-0017640363, 929159.
10.1038/scientificamerican1277-108
CAS PubMed Web of Science® Google Scholar
15 Simonyan K. and Zisserman A., Very deep convolutional networks for large-scale image recognition, 2014, http://arxiv.org/abs/1409.1556.
Google Scholar
16 He K., Zhang X., Ren S., and Sun J., Deep residual learning for image recognition, 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016, Las Vegas, NV, USA, 770–778, https://doi.org/10.1109/CVPR.2016.90, 2-s2.0-84986274465.
10.1109/CVPR.2016.90
Google Scholar
17 Szegedy C., Liu W., Jia Y., Sermanet P., Reed S., Anguelov D., Erhan D., Vanhoucke V., and Rabinovich A., Going deeper with convolutions, 2015 IEEE Conference on Computer Vision and Pattern Recognition, 2015, Boston, MA, USA, 1–9, https://doi.org/10.1109/CVPR.2015.7298594, 2-s2.0-84937522268.
10.1109/CVPR.2015.7298594
Google Scholar
18 Girshick R., Donahue J., Darrell T., and Malik J., Rich feature hierarchies for accurate object detection and semantic segmentation, 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014, Columbus, OH, USA, 580–587, https://doi.org/10.1109/CVPR.2014.81, 2-s2.0-84911400494.
10.1109/CVPR.2014.81
Google Scholar
19 Girshick R., Fast R-CNN, 2015 IEEE International Conference on Computer Vision, 2015, Santiago, Chile, 1440–1448, https://doi.org/10.1109/ICCV.2015.169, 2-s2.0-84964588182.
10.1109/ICCV.2015.169
Google Scholar
20 Ren S., He K., Girshick R., and Sun J., Faster R-CNN: towards real-time object detection with region proposal networks, IEEE Transactions on Pattern Analysis and Machine Intelligence. (2017) 39, no. 6, 1137–1149, https://doi.org/10.1109/TPAMI.2016.2577031, 2-s2.0-85019258369, 27295650.
10.1109/TPAMI.2016.2577031
PubMed Web of Science® Google Scholar
21 He K., Gkioxari G., Dollar P., and Girshick R., Mask R-CNN, IEEE Transactions on Pattern Analysis and Machine Intelligence. (2020) 42, no. 2, 386–397, https://doi.org/10.1109/TPAMI.2018.2844175, 2-s2.0-85048205789.
10.1109/TPAMI.2018.2844175
PubMed Web of Science® Google Scholar
22 Redmon J., Divvala S., Girshick R., and Farhadi A., Characterization and performance of the ASIC (CITIROC) front-end of the ASTRI camera, Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment. (2015) 794, 185–192, https://doi.org/10.1016/j.nima.2015.05.028, 2-s2.0-84937700337.
10.1016/j.nima.2015.05.028
Web of Science® Google Scholar
23 Shorten C. and Khoshgoftaar T. M., A survey on image data augmentation for deep learning, Journal of Big Data. (2019) 6, https://doi.org/10.1186/s40537-019-0197-0, 2-s2.0-85068705129.
10.1186/s40537-019-0197-0
PubMed Web of Science® Google Scholar
24 Yang S. P., Seo Y. H., Kim J. B., Kim H., and Jeong K. H., Optical MEMS devices for compact 3D surface imaging cameras, Micro and Nano Systems Letters. (2019) 7, no. 1, https://doi.org/10.1186/s40486-019-0087-4, 2-s2.0-85069161518.
10.1186/s40486-019-0087-4
Google Scholar
25 Pagès J., Salvi J., García R., and Matabosch C., Overview of coded light projection techniques for automatic 3D profiling, 2003 IEEE International Conference on Robotics and Automation, 2003, Taipei, Taiwan, 133–138, https://doi.org/10.1109/robot.2003.1241585.
10.1109/robot.2003.1241585
Google Scholar
26 Velten A., Willwacher T., Gupta O., Veeraraghavan A., Bawendi M. G., and Raskar R., Recovering three-dimensional shape around a corner using ultrafast time-of- flight imaging, Nature Communications. (2012) 3, no. 1, https://doi.org/10.1038/ncomms1747, 2-s2.0-84859192677, 22434188.
10.1038/ncomms1747
PubMed Web of Science® Google Scholar
27 Redmon J. and Farhadi A., YOLO9000: better, faster, stronger, 2017 IEEE Conference on Computer Vision and Pattern Recognition, 2017, Honolulu, HI, USA, 6517–6525, https://doi.org/10.1109/CVPR.2017.690, 2-s2.0-85041900441.
10.1109/CVPR.2017.690
Google Scholar
28 Redmon J. and Farhadi A., YOLOv3: an incremental improvement, 2018, http://arxiv.org/abs/1804.02767.
Google Scholar
29 Bochkovskiy A., Wang C.-Y., and Liao H.-Y. M., YOLOv4: optimal speed and accuracy of object detection, http://arxiv.org/abs/2004.10934.
Google Scholar
30 Wang C. Y., Mark Liao H. Y., Wu Y. H., Chen P. Y., Hsieh J. W., and Yeh I. H., CSPNet: a new backbone that can enhance learning capability of CNN, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020, Seattle, WA, USA, 1571–1580, https://doi.org/10.1109/CVPRW50498.2020.00203.
10.1109/CVPRW50498.2020.00203
Google Scholar
31 He K., Zhang X., Ren S., and Sun J., Spatial pyramid pooling in deep convolutional networks for visual recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence. (2015) 37, no. 9, 1904–1916, https://doi.org/10.1109/TPAMI.2015.2389824, 2-s2.0-84939247735, 26353135.
10.1109/TPAMI.2015.2389824
PubMed Web of Science® Google Scholar
32 Liu S., Qi L., Qin H., Shi J., and Jia J., Path aggregation Network for instance segmentation, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, Salt Lake City, UT, USA, 8759–8768, https://doi.org/10.1109/CVPR.2018.00913, 2-s2.0-85060854014.
10.1109/CVPR.2018.00913
Google Scholar
33 Yun S., Han D., Oh S. J., Chun S., Choe J., and Yoo Y., CutMix: regularization strategy to train strong classifiers with localizable features, 2019 IEEE/CVF International Conference on Computer Vision, 2019, Seoul, Korea (South), 6023–6032, https://doi.org/10.1109/iccv.2019.00612.
10.1109/iccv.2019.00612
Google Scholar
34 Hamzah R. A. and Ibrahim H., Literature survey on stereo vision disparity map algorithms, Journal of Sensors. (2016) 2016, 23, https://doi.org/10.1155/2016/8742920, 2-s2.0-84956909123, 8742920.
10.1155/2016/8742920
Web of Science® Google Scholar
35 Scharstein D. and Szeliski R., A taxonomy and evaluation of dense two-frame stereo correspondence algorithms, International Journal of Computer Vision. (2002) 47, 7–42, https://doi.org/10.1023/A:1014573219977, 2-s2.0-0036537472.
10.1023/A:1014573219977
Web of Science® Google Scholar
36 Kolmogorov V. and Zabih R., Computing visual correspondence with occlusions using graph cuts, 2, Proceedings Eighth IEEE International Conference on Computer Vision, 2001, Vancouver, BC, Canada, 508–515, https://doi.org/10.1109/iccv.2001.937668.
10.1109/iccv.2001.937668
Google Scholar
37 Tappen F., Comparison of graph cuts with belief propagation for stereo, using identical MRF parameters, Proceedings Ninth IEEE International Conference on Computer Vision, 2003, Nice, France, 900–906, https://doi.org/10.1109/ICCV.2003.1238444.
10.1109/ICCV.2003.1238444
Google Scholar
38 Liu G., Reda F. A., Shih K. J., Wang T.-C., Tao A., and Catanzaro B., Image inpainting for irregular holes using partial convolutions, Proceedings of the European Conference on Computer Vision, 2018, Munich, Germany, 85–100.
Google Scholar
39 Yu J., Lin Z., Yang J., Shen X., Lu X., and Huang T. S., Free-form image inpainting with gated convolution, 2019 IEEE/CVF International Conference on Computer Vision, 2019, Seoul, Korea (South), 4471–4480, https://doi.org/10.1109/iccv.2019.00457.
10.1109/iccv.2019.00457
Google Scholar
40 Shih M.-L., Su S.-Y., Kopf J., and Huang J.-B., 3D photography using context-aware layered depth inpainting, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, Seattle, WA, USA, 8025–8035, https://doi.org/10.1109/CVPR42600.2020.00805.
10.1109/CVPR42600.2020.00805
Google Scholar
41 Kwon H.-J., Kim H.-G., and Lee S.-H., Pill detection model for medicine inspection based on deep learning, Chemosensors. (2021) 10, https://doi.org/10.3390/chemosensors10010004.
10.3390/chemosensors10010004
Web of Science® Google Scholar

All articles

Deep Learning and Detection Technique with Least Image-Capturing for Multiple Pill Dispensing Inspection

Abstract

1. Introduction

2. Related Works

2.1. YOLOv4

2.2. Stereo Vision Disparity Map

2.3. 3D Photo Inpainting

3. Proposed Methods

3.1. Pill Data Labeling

3.2. Data Augmentation

3.2.1. 3D Data Augmentation

3.2.2. Image Rotational Augmentation

3.3. Data Labeling Automation

3.4. Data Training

3.5. Data Detection

3.5.1. Removal of Overlapping Data

3.5.2. Final Detection Decision

4. Experimental Environment

4.1. Data Capturing Environment and Development System

4.2. Data Communication

4.3. Data Refinement

4.4. Data Set for Training and Detection

4.4.1. Pill Data for Single-Axis 3D Augmentation

4.4.2. Pill Data for Dual-Axis 3D Augmentation

5. Experimental Results

5.1. Evaluation Index

5.2. Pill Detection Training with Single-Axis 3D Augmentation

5.2.1. Original Captured Images

5.2.2. 2D Image Augmentation

5.2.3. 3D Single-Axis Image Augmentation

5.3. Pill Detection Training with Dual-Axis 3D Augmentation

5.3.1. Detection for 20 Types of Pills with Dual-Axis 3D Image Augmentation

5.3.2. Detection for 40 Types of Pills with Dual-Axis 3D Image Augmentation

6. Conclusions

Conflicts of Interest

Acknowledgments

Open Research

Data Availability

References

Figures

References

Related

Information