TECS-16927 Camera Ready.pdf

Page 1 of 9

Transactions on Engineering and Computing Sciences - Vol. 12, No. 3

Publication Date: June 25, 2024

DOI:10.14738/tecs.123.16927.

Michael, C. (2024). Singing Voice Melody Detection. Transactions on Engineering and Computing Sciences, 12(3). 09-17.

Services for Science and Education – United Kingdom

Singing Voice Melody Detection

Chourdakis Michael

University of Athens

ABSTRACT

This is an enhancement on our time domain pitch detection algorithm (Anonymous

2007), optimized for voice singing. Various time and frequency domain methods

have been researched to detect the pitch of the human voice. This manuscript

describes our methods to convert human singing to music notation. Over the past

years, significant progress is made in speech recognition, but methods used in that

process recognize textural patterns, not melody. We use a combination of time and

frequency domain algorithms to achieve our aim using C++. The results of our

algorithms are satisfactory and accurate enough, not only to represent human voice

singing in the European notation, but also to represent it when higher resolution

script is available - for example, when using the Byzantine Script which has a

minimum note distance of about 17 Cents (as opposed to the European notation

with a minimum note distance of 100 Cents).

OTHER ALGORITHMS

In this section we will describe our results with known pitch detection algorithms. There are

two major categories, “time domain” in which the algorithm works mostly in a function of time

and “frequency domain” in which the algorithm works mostly in a function of frequency. We

will write a quick review on these algorithms and later we will compare their results with our

own.

The Fundamental Problem of a Window

Most pitch detection algorithms suffer by a fundamental problem, the choosing of a Window (J.

Smith 2010). A Window is an area of a signal in which we will work at. If the area is too small,

then we cannot be sure if the frequency contained there is completed as a function of time. If

the area is too large, then we cannot know if there is only one frequency inside or perhaps more.

Fig 1: Improper picking of a section to analyze. The signal time is greater than the windows’

time.

Page 2 of 9

Transactions on Engineering and Computing Sciences (TECS) Vol 12, Issue 3, June - 2024

Services for Science and Education – United Kingdom

Time Domain

These algorithms work mainly in the time domain, that is, in a function like S(t) =

A0 sin(2πf0t) + A1 sin(2πf1t) + ... + Av sin(2πfvt) (1).

Direct Attribute Count Methods:

These algorithms count the basic properties of the signal (zero crossings, peaks, or phase)

directly. They work only on very simple signals which is not the case in most vocal signals. A

plain Zero Crossing approach (Anonymous 2006) counts the crossings of the signal with the X- axis and divides by period to find the frequency. Obviously, this will only work in very simple

signals like the plain sine wave. For something more complex like the signal below, we can see

that if the maximum time is 10 seconds, then the frequency of the signal is 1Hz, but the zero

crossings are 20.

Fig. 2: f(t) = sin(2πτ) + sin (4πτ). Zero crossings cannot return reliable results.

The “Peak Detection” method, closely related to Zero Crossings (Kedem 1986) counts the peaks

instead of the zero crossings. Theoretically this can be more accurate, but in practice we very

often have signals of the following type:

Fig. 3: Adult male voice.

In the above pattern, we don’t know if we are about to count the positive or the negative peaks.

The plain Peak Detection algorithm only plays slightly better in results when compared to the

plain Zero crossings. The Phase Detection algorithm (Gibiat 1988) relies on counting phase

differences. For each type f(t) = Asin2πft signal there is a phase movement from 0 to 2π (a

circle) for each period T = 1/f and therefore, to calculate the frequency. This again works on

signals that have a pretty much constant frequency to estimate.

Related Signal Methods:

A correlation of two signals is the similarity check of a function by itself, when time is shifted

by τ, defined as follows: (f ⋈ g)(t) = ∫ f

∗

(t)g(τ + t)

+∞

−∞

(2) . Autocorrelation (Tan and

Karnjanadecha 2003) occurs when the signal is compared to itself (f = g). If the function has a

period T, then for t ∈ [0,

] the autocorrelation function decreases and for t ∈ [

, T] it

increases. By measuring the points that this increases and decreases we can calculate the

frequency. This works only in “clean” signals that do not vary significantly in the recording and

Page 3 of 9

Michael, C. (2024). Singing Voice Melody Detection. Transactions on Engineering and Computing Sciences, 12(3). 09-17.

URL: http://dx.doi.org/10.14738/tecs.123.16927

therefore, it’s not recommended for a human voice. For example, in the signal of a 70ms at

164Hz recorded below:

Fig. 4: Voice recorded with a tone of 164 Hz.

The correlation function returns this signal:

Fig. 5: Autocorrelation function of the signal in Fig. 4.

Which returns an incorrect frequency result as the above signal peak count would result in 122

Hz. YIN, a specialization of the autocorrelation (Cheveigné and Kawahara 2002) function tries

to resolve various autocorrelation errors in high frequencies but still suffers from the general

autocorrelation issues as we saw above. pYIN, a new version of YIN still suffers from the

autocorrelation issues but enhances the “melody aspects” of YIN (Mauch and Dixon 2014).

Frequency Domain

These algorithms work mainly in the frequency domain, that is, in the Fourier transformations

of the time signal: F(ω) = ∫ f(t)e

−iωtdt +∞

−∞

(3) and F(ω) = ∑ f(t)e

+∞ −iωt

i=−∞ (4).

Fast Fourier Transform:

The generic Fast Fourier Transform function suffers from a “cut” in the signal to be represented

known as the Gibbs phenomenon (Gibbs 1899) that occurs when the frequency to be analyzed

contains frequencies that are not exact multiplications of the fundamental frequency (which is

the case of the human voice). Also, the digital (discrete) Fourier transform can only return us

results for specific frequencies (for these that returned from the following formula when j is

integer: fj =

jSR

,j ∈ N,j ∈ [0,

− 1] (5)) (Marchand 2001). All algorithms based on the

frequency domain suffer from these fundamental inaccuracies.

Cepstrum:

The Cepstrum approach (Noll 1964) apples the Fourier transform to the logarithm of the signal:

|F(log (|F(f(t))|

(6), on the theory that a linear evaluation of the frequency is more easily

analyzed than a logarithmic one. This method works better in speech recognition but not in

human signing melody recognition.

Maximum Likelihood:

The max likelihood (Doval and Rodet 1993) is a statistical hypothesis of the fact that, on a given

signal, there is more probability that it was originally composed from a series of exact