Development of sign language database


The first step in the study of sign languages is to collect sign language data, but compared to spoken language research, not enough data has been developped. The way in which signs are recorded varies according to the purpose of the research, as signs are expressed using the whole body and 3D-space, and there is no specific recording method or form of recording, and even if data is collected for research purposes, it is not always possible for another researcher to reuse it. data collection and accumulation is made difficult by a variety of reasons.

Led by Professor Yuji Nagashima (Professor at Kogakuin University *now Professor Emeritus), who has been involved in the engineering research of sign languages for many years, we participated in a project to develop a large-scale, high-precision database of Japanese Sign Language words. Unparalleled in the world, over 5,000 of Japanese Sign Language words were recorded using optical motion capture equipment, and the database was made available for research purposes. The development of the KoSign database was supported by the Grant-in-Aid for Scientific Research (KAKENHI), Kiban-Kenkyu (S).


  • Kogakuin University Japanese Sign Language Multi-Dimensional Database (KoSign) [DOI] [URL]
  • Keiko Watanabe, Yuji Nagashima, Daisuke Hara, Yasuo Horiuchi, Shinji Sako, Akira Ichikawa, "Construction of a Japanese Sign Language Database with Various Data Types", International Conference on Human-Computer Interaction (HCII2019), Communications in Computer and Information Science book series (CCIS), Vol. 1032, pp.317–322, Jul. 2019. [DOI]
  • Shinji Sako, Yuji Nagashima, Daisuke Hara, Yasuo Horiuchi, Keiko Watanabe, Ritsuko Kikusawa, Naoto Kato, Akira Ichikawa, "Discussion of a Japanese sign language database and its annotation systems with consideration for its use in various areas", LingCologne2019, Poster Nr. 24, Jun. 2019.
  • Yuji Nagashima, Daisuke Hara, Shinji Sako, Keiko Watanabe, Yasuo Horiuchi, Ritsuko Kikusawa, Naoto Kato, Akira Ichikawa, "Constructing a Japanese Sign Language Multi-Dimensional Database", The 7th Meeting of Signed and SpokenLanguage Linguistics (SSLL 2018), Sep. 2018. [PDF]

Research on fingering and playing motion


♪ Automatic violin fingering estimation

This research was carried out  by Ms Nagata, who completed her master's degree in 2014, and Ms Watanabe, who graduated from the undergraduate programme in 2017. This research is characterised by its ability to estimate the appropriate fingering according to the performer's level of proficiency. Fingering refers to the grip used when playing a musical instrument, and in the case of a violin, it corresponds to which finger is used to hold down which string. The correct fingering on the violin depends on the skill level of the player, so there are several fingerings that can produce the same note. For example, at beginner level, the simplest fingering is appropriate, but as the player progresses, fingerings are used that are not only playable but also appropriate for expressing performance (e.g. different tones for different strings, vibrato, etc.). The purpose of this study is to determine the appropriate fingering for these proficiency levels from arbitrary scores.

In order to accommodate different levels of ability, it is necessary not only to determine fingerings that are playable and avoid unnatural movements, but also to make decisions based on performance expression. Therefore, we focus on performance indicator symbols other than notes in the score and model the relationship between the score and fingerings using a conditional probability field (CRF), which is one of the probabilistic models, to obtain a fingering estimation model from fingerings data.By modelling the relationship between the notation and fingerings using a conditional probability field (CRF), which is one of the probabilistic models, a fingering estimation model can be obtained from fingerings data in a music book by learning.

♪ Contrabass performance motion generation

This research was undertaken by Mr. Shirai, a master's course graduate in 2022. The objective of this research is to generate a 3-D model of a contrabass player's playing motion (3-D trajectory of the upper body's physical feature points) when playing a given score. It is expected that the 3D model will be useful for performance training for beginners and for creating a virtual performer by combining it with technology for generating performance expressions.


  • Wakana Nagata, Shinji Sako, and Tadashi Kitamura, "Violin Fingering Estimation According to Skill Level based on Hidden Markov Model", Joint conference of 40th ICMC (International Computer Music Conference) and 11th SMC (Sound & Music Computing conference), pp. 1233–1238, Sep. 2014. [PDF]
  • Shinji Sako, Wakana Nagata, and Tadashi Kitamura, "Violin fingering estimation according to the performer's skill level based on conditional random field", Proc. of HCII 2015, Human-Computer Interaction: Interaction Technologies, LNCS 9170, pp.485–494, Aug. 2015. [DOI]
  • Takeru Shirai, and Shinji Sako, "3D skeleton motion generation of double bass from musical score", 15th International Symposium on Computer Music Multidisciplinary Research (CMMR), pp.41–46, Nov. 2021. [PDF]

Research on Automatic Arrangement


Recreating a performance form or melody from an existing piece of music to suit a specific purpose is widely practiced in jazz, popular music, classical music, and other forms of music. Arranging music, like composing, requires special knowledge and experience, and there is widespread research aimed at using computers to assist in this process. There are various methodologies for arrangement depending on the subject and purpose, and our laboratory has been involved in research on automatic arrangement from various viewpoints and approaches.

♬ Automatic ensemble score generation

This research was mainly carried out by Mr Rio Mizuno, a master's student who graduated in 2009. Given a piano score and the desired composition (instrumentation), MusicPipe generates an accompaniment and sub-melody that fits the melody and can be played by each instrument, and automatically generates an ensemble score in which all instruments can be played by switching between instruments in the middle of the melody. MusicPipe" has been developed as a demonstration system that allows users to input a MIDI piano score, generate an ensemble score (output as a PDF file) and perform it on the spot, and has been presented at interactions and other events.

♬ Automatic Jazz Arrangement

This research was carried out by Mr. Naoto Sato, who completed his Master's degree in 2015. The aim of this research is to develop an automatic jazz arrangement method that transforms the melody itself into a jazz-like arrangement. As a first step, we proposed a method to transform the rhythm, which is an important element of jazz arrangement. First, we prepared a large amount of example data consisting of pairs of original songs and jazz-arranged songs. We consider the deformation rule of jazz-like rhythm as a change between rhythmic patterns separated by a certain length, and deform an arbitrary melody into a jazz-specific rhythm by using the existing example data.

♬ Automatic music box arrangement

This research was carried out by Mr Matsumoto, who received his master's degree in 2022.The melody is converted from a given melody to one that satisfies the structural constraints of a music box for a cylinder music box that is commonly distributed.Cylinder music boxes produce sound when a certain number of plate (metal plates designed to produce a specific pitch) flip the pins on the rotating cylinder. Because the structure is simple and the number of plate is limited, the range of sound is limited and there are structural constraints such as not being able to sound the same pitch continuously.The conversion of note sequences to satisfy these constraints was realised using neural networks.

♬ Automatic arrangement of NES music

This research was carried out by Mr. Ogiso, who received his master's degree in 2022. Game music at that time, as exemplified by Nintendo's Family Computer, used a combination of distinctive electronic sounds (square waves, triangle waves, etc.) to create background music and sound effects due to the limitations of the hardware (sound source modules).This unique music has established itself as a genre by going beyond the framework of game music and composing music in the style of NES music or arranging existing music for the NES sound source.In this study, we aimed to automatically arrange existing music into music that can be played on the NES sound source, and realised the conversion using a neural network.


  • 佐藤 直人, 酒向 慎司, 北村 正, "自動ジャズアレンジにおける曲の統一性を考慮したリズム転写", 日本音響学会2014年秋季研究発表会, 2-4-19, pp. 945–946, Sep. 2014.
  • 佐藤 直人, 酒向 慎司 北村 正, "自動ジャズアレンジのための事例に基づくリズム転写", 電子情報通信学会2014年総合大会 学生ポスターセッション, ISS-SP-396, p. 225, Mar. 2014. [IEICE]
  • 佐藤 直人, 酒向 慎司, 北村 正, "自動ジャズアレンジのための事例に基づくリズム転写", 情報処理学会研究報告 音楽情報科学(MUS), Vol. 2015-MUS-107, No. 21, pp. 1–2, May 2015. [CiNii]
  • 佐藤 直人, 酒向 慎司, 北村 正, "自動ジャズアレンジのための事例に基づくメロディ変形", 情報処理学会第78回全国大会, 2Q-06, pp. 457–458, Mar. 2016. 【大会奨励賞受賞】
  • 松本 優太, 酒向 慎司, "シリンダーオルゴールを対象としたFCNによる自動編曲", 情報処理学会第85回全国大会講演論文集, 1T-05, pp.491–492, Mar. 2023. [IPSJ]
  • 小木曽 雄飛, 酒向 慎司, "Transformerを用いたファミコン風自動編曲手法の検討", 情報処理学会第85回全国大会講演論文集, 1T-07, pp.495–496, Mar. 2023. [IPSJ]




Sign language recognition


☝ Subunit modeling for sign language recognition

This research was mainly carried out by Mr. Ariga, who completed his Master's degree in 2009. The vocabulary of a sign language is mainly represented by manual signals. Methods to extract the features of these hand movements from images or sensors and to recognise and classify them using some kind of pattern recognition technology have been widely studied. In this study, a recognition model tailored to the nature of hand signals was investigated, taking into account that hand signals are mainly composed of hand movements, hand positions and hand shapes. It was shown that extending the Hidden Markov Model to handle such features separately, while exploiting the commonalities that appear in the word representation, can improve the recognition performance of sign words.

☝ Depth sensor-based sign language recognition system

This research was mainly carried out by Ms Hatano, who completed her Master's degree in 2015. Sign language, a visual language, contains various lexical and grammatical expressions in three-dimensional body movements. With the advent of depth sensors (e.g. Kinect sensor), it is now possible to acquire depth information at high speed and with high accuracy without using image processing. In this study, a real-time sign language word recognition technique was proposed using Kinect version 2, a typical depth sensor. The results of this research were also used to develop a continuous sign language recognition system using a small kiosk terminal in cooperation with a private company under a project supported by the Ministry of Economy, Trade and Industry.

☝ Fingerspelling recognition

The study was carried out by Hosoe, who completed her master's degree in 2017, and Nam, who will complete her master's degree in 2019. In Japanese Sign Language, different vocabulary is mainly represented by hand movements and shapes, but there are cases where there are no specific representations for proper nouns, such as names of people or places, and in such cases Japanese Sign Language uses unique finger shapes (fingerspelling) that correspond to each hiragana character in Japanese. This is also the case in other sign languages that use fingerspelling to represent the phonograms of the major spoken languages. 

As part of research into automatic sign language recognition, fingerspelling recognition from moving images has been widely studied. Although each letter has a specific shape, fingerprints vary from person to person in the shape of the fingers, the way they are presented, and the direction from which they are captured. Our aim was to improve the performance of fingerprint recognition by using a 3D model to generate data that reproduces various shape changes and differences depending on the viewpoint, without having to collect many different types of data. It should be noted that we collaborated with Prof. Bogdan Kwolek of AGH University of Science and Technology, Poland.

☝ Sign language recognition using ego-centric video

This research was mainly carried out by Mr Miura, who graduated in 2022. This research is unique in that it uses video to read sign language from the perspective of the person signing (first-person video). Previous automatic sign language recognition technology using video images has mostly used video data of a person speaking a sign language face to face. However, as sign language is a visual language that uses all of 3D-space, information about the signer's own point of view is also essential. For example, the object beyond the pointing is necessary to understand the meaning of a sign, and traditional video data of the person itself lacks information such as the person, the object, or the direction beyond the pointing, which prevents full interpretation.

In this study, we investigated a technology to simultaneously track not only the video data from the signer's point of view, but also the signer's body movements by using an omnidirectional camera (360° camera) from the signer's point of view, and examined whether the body movement information obtained in this way could be useful for automatic sign language recognition. We also investigated whether the body movement information obtained in this way could be used for automatic sign language recognition.


  • Mika Hatano, Shinji Sako, and Tadashi Kitamura, "Contour-based Hand Pose Recognition for Sign Language Recognition", Proc. of 6th Workshop on Speech and Language Processing for Assistive Technologies, Sep. 2015. [PDF]
  • Bogdan Kwolek, and Shinji Sako, "Learning Siamese Features for Finger Spelling Recognition", Advanced Concepts for Intelligent Vision Systems, LNCS, Vol. 10617, pp.225–236, Sep. 2017. [DOI]
  • Nam Tu Nguyen, Shinji Ssako and Bogdan Kwolek, "Deep CNN-based Recognition of JSL Finger Spelling", International Conference on Hybrid Artificial Intelligent Systems (HAIS), Lecture Notes in Computer Science book series (LNCS), Vol. 11734, pp.602–613, Sep. 2019. [DOI]
  • Nguyen Tu Nam, Shinji Sako, Bogdan Kwolek, "Fingerspelling recognition using synthetic images and deep transfer learning", 2020 The 13th International Conference on Machine Vision (ICMV 2020), 11605, pp. 528–535, Nov. 2020. [DOI]
  • Teppei Miura, and Shinji Sako, "SynSLaG: Synthetic Sign Language Generator", The 23rd International ACM SIGACCESS Conference on Computers and Accessibility, pp.1–4, Oct. 2021.[DOI]
  • Teppei Miura, Shinji Sako, "3D Ego-Pose Lift-Up Robustness Study for Fisheye Camera Perturbations", 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, Vol. 4: pp. 600–606, Feb. 2023. [DOI]

Research on automatic score following


In human performance, even if a piece is played according to the score, it will not be played according to the score due to fluctuations in tempo, changes in intensity, deliberate insertions and other variations related to performance expression, as well as pure performance errors. Therefore, even if the score is known, it is not easy to track in real time which part of the score a performance is playing.In this research, we have developed a mechanism to track the position of a performance in real time by using a probabilistic model to represent the degree to which a performance varies, and a mechanism to predict the position of the next performance based on the local tempo of the performance.

Application 1: Automatic accompaniment playback synchronized with violin performance

We have developed an automatic accompaniment system that takes the acoustic signal of a musical instrument performance as input and the other parts (system) follow. By preparing a score (MIDI data) for the human performance and the accompaniment parts, a human-machine ensemble can enjoy any kind of music. The video below shows how the system robustly follows the tempo changes and performance errors of a violin performance. In addition to violin, we have also confirmed that the system works with piano and guitar.

Application example 2: Live demonstration of automatic accompaniment playback synchronized with violin performance

At Interaction 2013 in March 2013, we presented a live demonstration of our automatic accompaniment playback system.The conference place was crowded with exhibitors and visitors, so we were very concerned about the effect of ambient noise, but the system performed reasonably robustly.

Arm-type robot that follows piano playing

Since September 2015, Denso Corporation, Denso Wave Corporation and the SOKEN and our laboratory have been collaborating to develop Denmaiko, an arm-type robot that dances to musical performances. The results were exhibited at the Denso booth at the International Robot Exhibition 2015 in December 2015. Cobotta is a small 6-axis arm robot newly developed to enable industrial robots to be used in familiar places, such as the home and education. Cobotta is a small 6-axis arm robot, newly developed to bring industrial robots closer to people at home and in education. Robot technology will be introduced into the home in the future, but in order for robots to work with humans, they must be able to understand human intentions correctly. With this in mind, we came to the conclusion that it would be possible to use performance tracking technology that follows human performance, which led to the joint development of this robot.

During the exhibition period, live demonstrations were regularly held for many visitors to see. We were lucky enough to be able to find a video of the demos on YouTube, courtesy of Mr. Kazumichi Moriyama.


  • Shinji Sako, Ryuichi Yamamoto, and Tadashi Kitamura, "Ryry: A Real-Time Score-Following Automatic Accompaniment Playback System Capable of Real Performances with Errors, Repeats and", Active Media Technology (AMT) Lecture Notes in Computer Science, LNCS 8610, pp. 134–145, Aug. 2014. [DOI]
  • Ryuichi Yamamoto, Shinji Sako, and Tadashi Kitamura, "Robust On-line Algorithm For Real-time Audio-to-score Alignment Based on A Delayed Decision and Anticipation Framework", International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 191–195, May. 2013. [DOI: DOI]
  • Ryuichi Yamamoto, Shinji Sako, and Tadashi Kitamura, "Accurate and Low Computational Audio-to-score Alignment Using Segmental CRF with An Explicit Continuous Tempo Model", International Workshop on Nonlinear Circuits, Communications and Signal Processing (NCSP), pp. 345–348, Mar. 2013.
  • Ryuichi Yamamoto, Shinji Sako, and Tadashi Kitamura, "Real-time Audio to Score Alignment Using Semi-Markov Conditional Random Fields and Linear Dynamical System", The Music Information Retrieval Evaluation eXchange (MIREX2012)Oct. 2012. [PDF]