Recovering Capitalization and Punctuation Marks on Speech Transcriptions
INESC-ID Lisboa and IST and ISCTE –
This presentation addresses two important metadata annotation tasks, involved in the production of rich transcripts: capitalization and recovery of punctuation marks. The main focus of this study concerns broadcast news, using both manual and automatic speech transcripts. Different capitalization models were analysed and compared, indicating that generative approaches capture the structure of written corpora better, while the discriminative approaches are suitable for dealing with speech transcripts, and are also more robust to ASR errors. The so-called language dynamics have been addressed, and results indicate that the capitalization performance is affected by the temporal distance between the training and testing data. In what concerns the punctuation task, this study covers the three most frequent marks: full stop, comma, and question mark. Early experiments addressed full-stop and comma recovery, using local features, and combining lexical and acoustic information. Recent experiments also combine prosodic information and extend this study to question marks.
Date: 2011-May-25 Time: 14:30:00 Room: 020
For more information:
INESC-ID ESR Talks – February 2023
If you are a masters/PhD student or a postdoctoral fellow, come and present your work in an informal and friendly environment – and savour some tasty snacks!
Individual talks will be 10-15 minutes plus time for feedback. Enroll on your selected date by emailing pedro.ferreira[at]inesc-id.pt.
Happening on the second Wednesday of every month (4pm-5pm):
- 15 February (Alves Redol, Room 9)
- 15 March (Alves Redol, Room 9)
- 12 April (Alves Redol, Room 9)
- 10 May (Alves Redol, Room 9)
- 14 June (Alves Redol, Room 9)
- 12 July (Alves Redol, Room 9)
We hope to see you there!