Face detection with Darknet Yolo

Real time object detection with custom data

Posted on December 24, 2017

You only look once (YOLO) is a state-of-the-art, real-time object detection system. It comes with a few pre-trained classifiers but I decided to train with my own data to know how well it's made, the potential of Image Recognition in general and its application in real-life situations.

If you are a fan of HBO's Silicon Valley TV series, you might be aware of the famous Not Hotdog app that Jìng-Yáng built. This is similar, very basic, and detects if an image has me or not.

Getting Started

To get started, we need to install Darknet with two dependencies - OpenCV and CUDA for faster computation. The following were on an Ubuntu 16.04 machine with Nvidia GTX 1060.

Installing Darknet
         git clone https://github.com/pjreddie/darknet.git

         cd darknet


         mkdir -p obj

To verify your installation, run darknet with


         output: usage: ./darknet <function>

If you get the above output, you're good to go to the next step!

Compiling with CUDA (optional)

Compiling with your GPU is many times faster than your CPU. To install CUDA, you'll need a compatible Nvidia gpu. For installation, download CUDA (make sure it is version 8) and follow the instructions on the website.

To enable CUDA, change the first line of the Makefile in the base directory to GPU = 1 and 'make' in the terminal

Compiling with OpenCV (optional)

To support multiple formats of media install OpenCV. Check instructions here

Similar to CUDA, change the Makefile to read OPENCV=1 to enable OpenCV and then 'make' in the terminal to build the darknet application.

Training YOLO with your custom objects

Create file yolo-obj.cfg with the same content as in yolo-voc.2.0.cfg (or copy yolo-voc.2.0.cfg to yolo-obj.cfg) and:

  • change line batch to batch=64
  • change line subdivisions to subdivisions=8
  • change line classes=20 to your number of objects
  • change line #237 from filters=125 to: filters=(classes + 5)*5, so if classes=2 then should be filter=35

Create file obj.names in the directory darknet\data\, with objects names - each in new line

Create file obj.data in the directory darknet\data\, containing (where classes = number of objects):

        classes= 2
        train  = data/train.txt
        valid  = data/test.txt
        names = data/obj.names
        backup = backup/

Put image-files (.jpg) of your objects in the directory darknet\obj\

Create .txt-file for each .jpg-image-file - in the same directory and with the same name, but with .txt-extension, and put to file: object number and object coordinates on this image, for each object in new line: <object-class> <x> <y> <width> <height>


  • <object-class> - integer number of object from 0 to (classes-1)
  • change line subdivisions to subdivisions=8
  • change line classes=20 to your number of objects
  • change line #237 from filters=125 to: filters=(classes + 5)*5, so if classes=2 then should be filter=35

Use the BBox-Label-Tool to get the face coordinates from your images.

For example for img1.jpg you should create img1.txt containing:

        1 0.716797 0.395833 0.216406 0.147222
        0 0.687109 0.379167 0.255469 0.158333
        1 0.420312 0.395833 0.140625 0.166667

Create file train.txt in directory darknet\data\, with filenames of your images, each filename in new line, with path relative to ./darknet, for example containing:


Download pre-trained weights for the convolutional layers (76 MB): here and put in the main directory.

Start training by using the command line: ./darknet detector train data/obj.data yolo-obj.cfg darknet19_448.conv.23

After training is complete - get result yolo-obj_xxxxx.weights from darknet\backup\

After each 1000 iterations you can stop and later start training from this point. For example, after 2000 iterations you can stop training, and later just copy yolo-obj_2000.weights from darknet\backup\ to main directory and start training using: ./darknet detector train data/obj.data yolo-obj.cfg yolo-obj_2000.weights

During training, you will see varying indicators of error,

        Region Avg IOU: 0.798363, Class: 0.893232, Obj: 0.700808, No Obj: 0.004567, Avg Recall: 1.000000, count: 8
        Region Avg IOU: 0.800677, Class: 0.892181, Obj: 0.701590, No Obj: 0.004574, Avg Recall: 1.000000, count: 8

        9002: 0.211667, 0.060730 avg, 0.001000 rate, 3.868000 seconds, 576128 images Loaded: 0.000000 seconds

When you see that average loss 0.xxxxxx avg no longer decreases at many iterations then you should stop training.

Face ID results

Test your trained weights using the command

          ./darknet detector test data/obj.data yolo-obj.cfg yolo-obj_xxxx.weights

After over 40000 iterations I found my results to be fairly accurate. By default, YOLO only displays objects detected with a confidence of .25 or higher. You can change this by passing the -thresh <val> flag to the yolo command.

          ./darknet detector test data/obj.data yolo-obj.cfg yolo-obj_xxxx.weights images/test.jpg -thresh 0.6 

Scope for improvement and applications

I dont know if an application exists based on darknet yolo. Building one will make it complete and useful for real-life applications. Something that can iterate through multiple images and save the results for easy insights should be the next step. If you have built one already, let me know.

But doing this small exercise made me appreciate the power of Image Recognition. If you're a mechanical engineer like myself, you can instantly build a tool for manufacturing companies to study the material flaws in industrial radiography. Or a sign language translator for the hearing-impaired people. The possibilities are endless.