This paper describes a novel real-time end-to-end system for facial expression transformation, without the need of any driving source. Its core idea is to directly generate desired and photo-realistic facial expressions on top of input monocular RGB video. Specifically, an unpaired learning framework is developed to learn the mapping between any two facial expressions in the facial blendshape space. Then, it automatically transforms the source expression in an input video clip to a specified target expression through the combination of automated 3D face construction, the learned bi-directional expression mapping and automated lip correction. It can be applied to new users without additional training. Its effectiveness is demonstrated through many experiments on faces from live and online video, with different identities, ages, speeches and expressions.

Supporting Information

References

[AAB*16] Abadi M., Agarwal A., Barham P., Brevdo E., Chen Z., Citro C., Corrado G. S., Davis A., Dean J., Devin M., Ghemawat S., Goodfellow I. J., Harp A., Irving G., Isard M., Jia Y., Józefowicz R., Kaiser L., Kudlur M., Levenberg J., Man D., Monga R., Moore S., Murray D., Olah C., Schuster M., Shlens J., Steiner B., Sutskever I., Talwar K., Tucker P., Vanhoucke V., Vasudevan V., Viegas F., Vinyals O., Warden P., Wattenberg M., Wicke M., Yu Y., Zheng X.: Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
Google Scholar
[AECOKC17] Averbuch-Elor H., Cohen-Or D., Kopf J., Cohen M. F.: Bringing portraits to life. ACM Transactions on Graphics 36, 6 (November 2017), 196:1–196:13.
10.1145/3130800.3130818
Web of Science® Google Scholar
[BA83] Burt P., Adelson E.: The Laplacian pyramid as a compact image code. IEEE Transactions on Communications 31, 4 (1983), 532–540.
10.1109/TCOM.1983.1095851
Web of Science® Google Scholar
[BBPV03] Blanz V., Basso C., Poggio T., Vetter T.: Reanimating faces in images and video. Computer Graphics Forum 22, 3 (2003), 641–650.
10.1111/1467-8659.t01-1-00712
Web of Science® Google Scholar
[BJ03] Basri R., Jacobs D. W.: Lambertian reflectance and linear subspaces. IEEE Transactions on Pattern Analysis and Machine Intelligence 25, 2 (2003), 218–233.
10.1109/TPAMI.2003.1177153
Web of Science® Google Scholar
[Bra99] Brand M.: Voice puppetry. In SIGGRAPH '99: Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques (1999), ACM Press/Addison-Wesley Publishing Co., pp. 21–28.
10.1145/311535.311537
Google Scholar
[BV99] Blanz V., Vetter T.: A morphable model for the synthesis of 3d faces. In SIGGRAPH '99: Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques (1999), ACM Press/Addison-Wesley Publishing Co., pp. 187–194.
10.1145/311535.311556
Web of Science® Google Scholar
[CBF16] Cong M., Bhat K. S., Fedkiw R.: Art-directed muscle simulation for high-end facial animation. In SCA '16: Proceedings of the ACM SIGGRAPH/Eurographics Symposium on Computer Animation (2016), Eurographics Association, pp. 119–127.
Google Scholar
[CBZB15] Cao C., Bradley D., Zhou K., Beeler T.: Real-time high-fidelity facial performance capture. ACM Transactions on Graphics 34, 4 (July 2015), 46:1–46:9.
10.1145/2766943
Web of Science® Google Scholar
[CCK*17] Choi Y., Choi M., Kim M., Ha J.-W., Kim S., Choo J.: Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. arXiv preprint arXiv:1711.09020, 2017.
Google Scholar
[CHZ14] Cao C., Hou Q., Zhou K.: Displaced dynamic expression regression for real-time facial tracking and animation. ACM Transactions on Graphics 33, 4 (July 2014), 43:1–43:10.
10.1145/2601097.2601204
Web of Science® Google Scholar
[Coo17] Cootes T.: Talking face video, 2017. URL: http://www-prima.inrialpes.fr/FGnet/data/01-TalkingFace/talking_face.html. Accessed on 1 July 2018.
Google Scholar
[CTFP05] Cao Y., Tien W. C., Faloutsos P., Pighin F.: Expressive speech-driven facial animation. ACM Transactions on Graphics 24, 4 (October 2005), 1283–1302.
10.1145/1095878.1095881
Web of Science® Google Scholar
[CWLZ13] Cao C., Weng Y., Lin S., Zhou K.: 3d shape regression for real-time facial animation. ACM Transactions on Graphics 32, 4 (July 2013), 41:1–41:10.
10.1145/2461912.2462012
Web of Science® Google Scholar
[CWZ*14] Cao C., Weng Y., Zhou S., Tong Y., Zhou K.: Facewarehouse: A 3d facial expression database for visual computing. IEEE Transactions on Visualization and Computer Graphics 20, 3 (2014), 413–425.
10.1109/TVCG.2013.249
PubMed Web of Science® Google Scholar
[DCFN06] Deng Z., Chiang P.-Y., Fox P., Neumann U.: Animating blendshape faces by cross-mapping motion capture data. In I3D '06: Proceedings of the 2006 Symposium on Interactive 3D Graphics and Games (2006), ACM, pp. 43–48.
10.1145/1111411.1111419
Google Scholar
[DN08a] Deng Z., Neumann U.: Expressive speech animation synthesis with phoneme-level controls. Computer Graphics Forum 27, 8 (2008), 2096–2113.
10.1111/j.1467-8659.2008.01192.x
Web of Science® Google Scholar
[DN08b] Deng Z., Noh J.: Computer facial animation: A survey. In Data-Driven 3D Facial Animation. Z. Deng and U. Neumann (Eds.). Springer, Berlin, Germany (2008), pp. 1–28.
Google Scholar
[DNL*06] Deng Z., Neumann U., Lewis J. P., Kim T.-Y., Bulut M., Narayanan S.: Expressive facial animation synthesis by learning speech coarticulation and expression spaces. IEEE Transactions on Visualization and Computer Graphics 12, 6 (2006), 1523–1534.
10.1109/TVCG.2006.90
PubMed Web of Science® Google Scholar
[DSD17] Ding Y., Shi L., Deng Z.: Perceptual enhancement of emotional mocap head motion: An experimental study. In Proceedings of 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII) (2017), pp. 242–247.
10.1109/ACII.2017.8273607
Google Scholar
[DSJ*11] Dale K., Sunkavalli K., Johnson M. K., Vlasic D., Matusik W., Pfister H.: Video face replacement. ACM Transactions on Graphics 30, 6 (December 2011), 130:1–130:10.
10.1145/2070781.2024164
Web of Science® Google Scholar
[EF78] Ekman P., Friesen W.: Facial Action Coding System: A Technique for the Measurement of Facial Movement. Consulting Psychologists, Palo Alto, 1978.
Google Scholar
[EGP02] Ezzat T., Geiger G., Poggio T.: Trainable videorealistic speech animation. ACM Transactions on Graphics 21, 3 (July 2002), 388–398.
10.1145/566654.566594
Web of Science® Google Scholar
[FJA*14] Fyffe G., Jones A., Alexander O., Ichikari R., Debevec P.: Driving high-resolution facial scans with video performance capture. ACM Transactions on Graphics 34, 1 (December 2014), 8:1–8:14.
10.1145/2638549
Web of Science® Google Scholar
[GDOP17] Grinstein E., Duong N., Ozerov A., Perez P.: Audio style transfer. arXiv preprint arXiv:1710.11385, 2017.
Google Scholar
[GEB16] Gatys L. A., Ecker A. S., Bethge M.: Image style transfer using convolutional neural networks. In Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2016), pp. 2414–2423.
10.1109/CVPR.2016.265
Google Scholar
[GJ*10] Guennebaud G., Jacob B., et al.: Eigen v3. http://eigen.tuxfamily.org, 2010. Accessed on 2 August 2018.
Google Scholar
[GKSL16] Ganin Y., Kononenko D., Sungatullina D., Lempitsky V.: Deepwarp: Photorealistic image resynthesis for gaze manipulation. In Computer Vision – ECCV 2016 (Cham, 2016), Springer International Publishing, pp. 311–326.
10.1007/978-3-319-46475-6_20
Google Scholar
[GPAM*14] Goodfellow I., Pouget-Abadie J., Mirza M., Xu B., Warde-Farley D., Ozair S., Courville A., Bengio Y.: Generative adversarial nets. In Proceedings of Annual Conference on Advances in Neural Information Processing Systems (2014), Curran Associates, Inc., Vol. 27, pp. 2672–2680.
Google Scholar
[GVR*14] Garrido P., Valgaerts L., Rehmsen O., Thormaehlen T., Perez P., Theobalt C.: Automatic face reenactment. In Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition (June 2014), pp. 4217–4224.
10.1109/CVPR.2014.537
Google Scholar
[GVS*15] Garrido P., Valgaerts L., Sarmadi H., Steiner I., Varanasi K., Pérez P., Theobalt C.: Vdub: Modifying face video of actors for plausible visual alignment to a dubbed audio track. Computer Graphics Forum 34, 2 (2015), 193–204.
10.1111/cgf.12552
Web of Science® Google Scholar
[GZC*16] Garrido P., Zollhöfer M., Casas D., Valgaerts L., Varanasi K., Pérez P., Theobalt C.: Reconstruction of personalized 3d face rigs from monocular video. ACM Transactions on Graphics 35, 3 (May 2016), 28:1–28:15.
10.1145/2890493
Web of Science® Google Scholar
[HJ10] Haq S., Jackson P. J.: Multimodal emotion recognition. Machine Audition: Principles, Algorithms and Systems, 17 (2010), 398–423.
Google Scholar
[IKKP17] Ichim A.-E., Kadleček P., Kavan L., Pauly M.: Phace: Physics-based face modeling and animation. ACM Transactions on Graphics 36, 4 (July 2017), 153:1–153:14.
10.1145/3072959.3073664
Web of Science® Google Scholar
[IKNDP16] Ichim A.-E., Kavan L., Nimier-David M., Pauly M.: Building and animating user-specific volumetric face rigs. In SCA '16: Proceedings of the ACM SIGGRAPH/Eurographics Symposium on Computer Animation (Goslar Germany, Germany, 2016), Eurographics Association, pp. 107–117.
Google Scholar
[IZZE16] Isola P., Zhu, J.-Y., Zhou T., Efros A. A.: Image-to-image translation with conditional adversarial networks. arXiv preprint arXiv:1611.07004, 2016.
Google Scholar
[JAFF16] Johnson J., Alahi A., Fei-Fei L.: Perceptual losses for real-time style transfer and super-resolution. In Proceedings of European Conference on Computer Vision (2016), pp. 694–711.
Google Scholar
[KAL*17] Karras T., Aila T., Laine S., Herva A., Lehtinen J.: Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Transactions on Graphics 36, 4 (July 2017), 94:1–94:12.
10.1145/3072959.3073658
Web of Science® Google Scholar
[KPB*12] Kuster C., Popa T., Bazin J.-C., Gotsman C., Gross M.: Gaze correction for home video conferencing. ACM Transactions on Graphics 31, 6 (November 2012), 174:1–174:6.
10.1145/2366145.2366193
Web of Science® Google Scholar
[LAR*14] Lewis J. P., Anjyo K., Rhee T., Zhang M., Pighin F. H., Deng Z.: Practice and theory of blendshape facial models. In Proceedings of Eurographics (State of the Art Reports) (2014).
Google Scholar
[LD08] Li Q., Deng Z.: Orthogonal-blendshape-based editing system for facial motion capture data. IEEE Computer Graphics and Applications 28, 6 (2008), 76–82.
10.1109/MCG.2008.120
PubMed Web of Science® Google Scholar
[LTW95] Lee Y., Terzopoulos D., Waters K.: Realistic modeling for facial animation. In Proceedings of the 22nd Annual Conference on Computer Graphics and Interactive Techniques (1995), SIGGRAPH '95, ACM, pp. 55–62.
10.1145/218380.218407
Google Scholar
[LXW*12] Li K., Xu F., Wang J., Dai Q., Liu Y.: A data-driven approach for facial expression synthesis in video. In Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition (June 2012), IEEE, pp. 57–64.
Google Scholar
[LZZ16] Li M., Zuo W., Zhang D.: Deep identity-aware transfer of facial attributes. arXiv preprint arXiv:1610.05586, 2016.
Google Scholar
[MBW*15] Malleson C., Bazin J., Wang O., Bradley D., Beeler T., Hilton A., Sorkine-Hornung A.: Facedirector: Continuous control of facial performance in video. In Proceedings of 2015 IEEE International Conference on Computer Vision (ICCV) (December 2015), IEEE, pp. 3979–3987.
10.1109/ICCV.2015.453
Google Scholar
[MLD09] Ma X., Le B. H., Deng Z.: Style learning and transferring for facial animation editing. In Proceedings of the 2009 ACM SIGGRAPH/Eurographics Symposium on Computer Animation (2009), SCA '09, ACM, pp. 123–132.
10.1145/1599470.1599486
Google Scholar
[RCWS14] Ren S., Cao X., Wei Y., Sun J.: Face alignment at 3000 fps via regressing local binary features. In Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition (June 2014), IEEE, pp. 1685–1692.
10.1109/CVPR.2014.218
Google Scholar
[SLH*17] Song L., Lu Z., He R., Sun Z., Tan T.: Geometry guided adversarial facial expression synthesis. arXiv preprint arXiv:1712.03474, 2017.
Google Scholar
[SLS*12] Seol Y., Lewis J., Seo J., Choi B., Anjyo K., Noh J.: Spacetime expression cloning for blendshapes. ACM Transactions on Graphics 31, 2 (Apr. 2012), 14:1–14:12.
10.1145/2159516.2159519
Web of Science® Google Scholar
[SNF05] Sifakis E., Neverov I., Fedkiw R.: Automatic determination of facial muscle activations from sparse motion capture marker data. ACM Transactions on Graphics 24, 3 (July 2005), 417–425.
10.1145/1073204.1073208
Web of Science® Google Scholar
[SSKS17] Suwajanakorn S., Seitz S. M., Kemelmacher-Shlizerman I.: Synthesizing Obama: Learning lip sync from audio. ACM Transactions on Graphics 36, 4 (July 2017), 95:1–95:13.
10.1145/3072959.3073640
Web of Science® Google Scholar
[SWTC14] Shi F., Wu H.-T., Tong X., Chai J.: Automatic acquisition of high-fidelity facial performances using monocular videos. ACM Transactions on Graphics 33, 6 (November 2014), 222:1–222:13.
10.1145/2661229.2661290
Web of Science® Google Scholar
[TKY*17] Taylor S., Kim T., Yue Y., Mahler M., Krahe J., Rodriguez A. G., Hodgins J., Matthews I.: A deep learning approach for generalized speech animation. ACM Transactions on Graphics 36, 4 (July 2017), 93:1–93:11.
10.1145/3072959.3073699
Web of Science® Google Scholar
[TPW16] Taigman Y., Polyak A., Wolf L.: Unsupervised cross-domain image generation. arXiv preprint arXiv:1611.02200, 2016.
Google Scholar
[TZN*15] Thies J., Zollhöfer M., Nießner M., Valgaerts L., Stamminger M., Theobalt C.: Real-time expression transfer for facial reenactment. ACM Transactions on Graphics 34, 6 (October 2015), 183:1–183:14.
10.1145/2816795.2818056
Web of Science® Google Scholar
[TZS*16] Thies J., Zollhofer M., Stamminger M., Theobalt C., Nießner M.: Face2face: Real-time face capture and reenactment of rgb videos. In Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2016), IEEE, pp. 2387–2395.
10.1109/CVPR.2016.262
Google Scholar
[VBPP05] Vlasic D., Brand M., Pfister H., Popović J.: Face transfer with multilinear models. ACM Transactions on Graphics 24, 3 (July 2005), 426–433.
10.1145/1073204.1073209
Web of Science® Google Scholar
[WBLP11] Weise T., Bouaziz S., Li H., Pauly M.: Realtime performance-based facial animation. ACM Transactions on Graphics 30, 4 (July 2011), 77:1–77:10.
10.1145/2010324.1964972
Web of Science® Google Scholar
[WSXC16] Wang C., Shi F., Xia S., Chai J.: Realtime 3d eye gaze animation using a single rgb camera. ACM Transactions on Graphics 35, 4 (July 2016), 118:1–118:14.
10.1145/2897824.2925947
Web of Science® Google Scholar
[XCLT14] Xu F., Chai J., Liu Y., Tong X.: Controllable high-fidelity facial performance transfer. ACM Transactions on Graphics 33, 4 (July 2014), 42:1–42:11.
10.1145/2601097.2601210
Web of Science® Google Scholar
[YBS*12] Yang F., Bourdev L., Shechtman E., Wang J., Metaxas D.: Facial expression editing in video using a temporally-smooth factorization. In Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition (June 2012), IEEE, pp. 861–868.
10.1109/CVPR.2012.6247759
Google Scholar
[YWS*11] Yang F., Wang J., Shechtman E., Bourdev L., Metaxas D.: Expression flow for 3d-aware face component transfer. ACM Transactions on Graphics 30, 4 (July 2011), 60:1–60:10.
10.1145/2010324.1964955
Web of Science® Google Scholar
[ZPIE17] Zhu J., Park T., Isola P., Efros A. A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV) (Oct 2017), pp. 2242–2251.
10.1109/ICCV.2017.244
Google Scholar

Citing Literature

Volume38, Issue1

February 2019

Pages 470-481

Real-Time Facial Expression Transformation for Monocular RGB Video

Abstract

Supporting Information

References

Citing Literature

References

Information

About Wiley Online Library

Help & Support

Opportunities

Connect with Wiley

Real-Time Facial Expression Transformation for Monocular RGB Video

Abstract

Supporting Information

References

Citing Literature

References

Related

Information