Biometric Consortium search

Verification Score Normalization

Let's take a look at one real system. Thousands of test utterances (YOHO database) are input to this system and the engine returns raw scores:

After all tests are run, we create histograms (top figure) and count our errors, both false reject and false accept for each possible threshold (bottom figure). Here are the results:

System has an Equal Error rate of 1.221% at at a raw score of 1.243. For this particular database, the range of raw scores are -7.8988 to 6.9692. The raw numbers returned are floats and could conceivable range much larger given a different database, a different set of speakers or audio quality. In the figures, the dashed line is the FA % error and the solid line is FR % error.

Problem. return a score from this system which "means the same" as a score other systems.

Discussion: These curves have been created based on a certain set of tests (using the YOHO database). A different database or live tests should return different scores with different statistics. In live applications, the scores will be dependent on microphone, office noise, background chatter, computer fans, etc. In fact, the same SVAPI engine will probably return different raw scores for the same person if the SVAPI machine is relocated to a different location (say lab to office). In this hypothetical situation, what should the engine return? Whatever transformation from the lab environment of raw scores to normalized scores wont be the same transformation in the office setting. Realistically, noise and microphone effects could contribute greatly to the raw scores (match) of the SVAPI engines.

ONE POSSIBLE SOLUTION: if the engine developers test their system with realistic tests, then for particular raw scores, they could return the values of the curves in the bottom figure. For those who wanted "confidence" scores, these are the closest values. For example, a user attempts a speaker login procedure using a single combination lock phrase (prompted). The engine returns a raw score of 2.00, this has an associated false reject error of 11.8160%. In other words, if I reject this individual, there exist a 11.8160% chance I goofed. On the other hand, if I allow this individual access, there simulataneously exist a probability of about 0.0481% this person is an impostor. The application can decide what to do.

Likewise, if the returned score was -1.00, this has an associated false reject error of 0.0%. In other words, if I reject this individual, there exist a 0.0% chance I goofed. On the other hand, if I allow this individual access, there simulataneously exist a probability of about 76.0625% it is an impostor!

So, for testing, raw scores are needed. Also, many applications may want to decide acceptability on raw scores (since thresholds may have to be change somewhat on environmental changes (office, lab, etc). Lastly, if a "normalized score" should be returned, we may want to consider the probability of FA and FR (or their percentages) based on vender/ engine developer tests. I would guess that developers would know their FA/FR curves so they could recommend a threshold to their customers. These FA and FR percentages should satisfy the probability/confidence-loving members of the SVAPI crowd :^)


_________


Navbar