음성인식 API 사용해 보기( 자막생성 포함)

2023. 1. 20. 11:47

728x90

AssemblyAI 음성인식의 특징 살펴보기

AssemblyAI API는 녹음된 음성 오디오 파일 및 실시간 오디오 스트리밍 데이터를 문자스크립트(text transation)로 변환해 준다. 변환된 문자를 처리하여 감정분석(sentimental analysis), 요약(summarization), 객체인식(entity detection), 주제어인식(topic detection)을 부가적으로 수행할 수 있다.

AssmblyAI의 API 기술문서는 https://docs.assemblyai.com/ 에 원문(영어)이 있다.

API 계정신청

우측 상단의 'Create Account'를 클릭한다. 이메일 주소와 비밀번호를 입력하고 제출한다.

API 개요

처리시간은 약 10배속이다. 이는 예를 들면 10분짜리의 파일을 약1분 내외로 처리한다는 뜻이다. 실시간 스트리밍 처리 방식은 약 수백 밀리세컨드이내에 응답결과를 받을 수 있다.

동시처리 파일 갯수는 무료(Free)모드 계정은 1개만 가능하고, 유료 계정은 최대 12개까지 가능하다. 실시간 스크리밍 처리방식에서는 무료 계정은 사용이 불가능(Limit==0)하고, 유료 계정은 최대 32개까지 가능하다.

API를 사용함에 있어서 주로 발생하는 오류 코드는 401오류 : 인증키가 없다 400오류 : API 요청 파라미터를 잘못 사용하고 있다 500오류 : 서버측 에러 이다. 오류가 있을 경우 오류코드 값 뿐만아니라 그 메시지 정보도 있으므로, 상세 내용은 메시지를 참조하면 된다. 예를 들면, 지원하지 않는 파일 형태 파일에 오디오 데이터가 존재하지 않음. 오디오 파일의 내용이 너무 짧음(200ms이하) 오디에 파일에 접근할 수 없음.(특정URL 등) 등이 있다.

지원하는 파일 형태는 에서 확인할 수 있다. 웬만한 오디오 파일 및 비디오 파일은 모두 지원된다고 보면 된다.

인증 및 온라인 음성파일 처리

Visual Studio Code 저작도구를 활용하여, 코드를 작성한다.

hearders의 authorization의 값은 각 개인의 코드값으로 대체하여 수행하여야 한다.

API키는 대시보드 및 Developers의 펼침 창에서 확인가능하다. 복사하여 코드에 붙여넣기를 한다.

처리대상 파일은 인터넷에 존재하는 파일이다.

약 12초 가량의 영어로 발성된 음성파일이다.

수행 결과는

{'id': 'oe8djuoctf-3874-4228-82f6-0416162f48fd', 'language_model': 'assemblyai_default', 'acoustic_model': 'assemblyai_default', 'language_code': 'en_us', 'status': 'queued', 'audio_url': 'https://bit.ly/3yxKEIY', 'text': None, 'words': None, 'utterances': None, 'confidence': None, 'audio_duration': None, 'punctuate': True, 'format_text': True, 'dual_channel': None, 'webhook_url': None, 'webhook_status_code': None, 'speed_boost': False, 'auto_highlights_result': None, 'auto_highlights': False, 'audio_start_from': None, 'audio_end_at': None, 'word_boost': [], 'boost_param': None, 'filter_profanity': False, 'redact_pii': False, 'redact_pii_audio':
False, 'redact_pii_audio_quality': None, 'redact_pii_policies': None, 'redact_pii_sub': None, 'speaker_labels': False, 'content_safety': False, 'iab_categories': False, 'content_safety_labels': {}, 'iab_categories_result': {}, 'disfluencies': False, 'sentiment_analysis': False, 'sentiment_analysis_results': None, 'auto_chapters': False, 'chapters': None, 'entity_detection': False, 'entities': None}

와 같이 출력된다. 음성인식 전사 결과가 바로 출력되지 않는다. id값(oe8djuoctf-3874-4228-82f6-0416162f48fd)을 참조로 하여, 그 결과를 얻어야 한다.

{'id': 'oe8djuoctf-3874-4228-82f6-0416162f48fd', 'language_model': 'assemblyai_default', 'acoustic_model': 'assemblyai_default', 'language_code': 'en_us', 'status': 'completed', 'audio_url': 'https://bit.ly/3yxKEIY', , 'words': [{'text': 'You', 'start': 430, 'end': 522, 'confidence': 0.99876, 'speaker': None}, {'text': 'know,', 'start': 536, 'end': 858, 'confidence': 0.9964, 'speaker': None}, {'text': 'demons', 'start': 944, 'end': 1402, 'confidence': 0.99522, 'speaker': None}, {'text': 'on', 'start': 1426, 'end': 1578, 'confidence': 0.78258, 'speaker': None}, {'text': 'TV', 'start': 1604, 'end': 1894, 'confidence': 0.97286, 'speaker': None}, {'text': 'like', 'start': 1942, 'end': 2118, 'confidence': 0.99993, 'speaker': None}, {'text': 'that.', 'start': 2144, 'end': 2334, 'confidence': 0.99808, 'speaker': None}, {'text': 'And', 'start': 2372, 'end': 2754, 'confidence': 0.70275, 'speaker': None}, {'text': 'and', 'start': 2852, 'end': 3114, 'confidence': 0.84709, 'speaker': None}, {'text': 'for', 'start': 3152, 'end': 3354, 'confidence': 0.80745, 'speaker': None}, {'text': 'people', 'start': 3392, 'end': 3630, 'confidence': 0.9997, 'speaker': None}, {'text': 'to', 'start': 3680, 'end': 4038, 'confidence': 0.9935, 'speaker': None}, {'text': 'expose', 'start': 4124, 'end': 4558, 'confidence': 0.66411, 'speaker': None}, {'text': 'themselves', 'start': 4594, 'end': 4974, 'confidence': 0.98493, 'speaker': None}, {'text': 'to', 'start': 5072, 'end': 5298, 'confidence': 0.99923, 'speaker': None}, {'text': 'being', 'start': 5324, 'end': 5514, 'confidence': 0.99932, 'speaker': None}, {'text': 'rejected', 'start': 5552, 'end': 6058, 'confidence': 0.99986, 'speaker': None}, {'text': 'on', 'start': 6094, 'end': 6258, 'confidence': 0.9996, 'speaker': None}, {'text': 'TV.', 'start': 6284, 'end': 6550, 'confidence': 0.76502, 'speaker': None}, {'text': 'Or,', 'start': 6610, 'end': 7050, 'confidence': 0.51107, 'speaker': None}, {'text': 'you', 'start': 7160, 'end': 7398, 'confidence': 0.55854, 'speaker': None}, {'text': 'know,', 'start': 7424, 'end': 7650, 'confidence': 0.99902, 'speaker': None}, {'text': 'Humil,', 'start': 7700, 'end': 8170, 'confidence': 0.61465, 'speaker': None}, {'text': 'humiliated', 'start': 8230, 'end': 8962, 'confidence': 0.50987, 'speaker': None}, {'text': 'by', 'start': 9046, 'end': 9258, 'confidence': 0.99928, 'speaker': None}, {'text': 'Fear', 'start': 9284, 'end': 9598, 'confidence': 0.63855,
'speaker': None}, {'text': 'Factor', 'start': 9634, 'end': 10090, 'confidence': 0.67138, 'speaker': None}, {'text': 'or,', 'start': 10150, 'end': 10760, 'confidence': 0.99936, 'speaker': None}, {'text': 'you', 'start': 11210, 'end': 11502, 'confidence': 0.70603, 'speaker': None}, {'text': 'know.', 'start': 11516, 'end': 11580, 'confidence': 0.62727, 'speaker': None}], 'utterances': None, 'confidence': 0.844713666666667, 'audio_duration': 12, 'punctuate': True, 'format_text': True, 'dual_channel': None, 'webhook_url': None, 'webhook_status_code': None, 'speed_boost': False, 'auto_highlights_result': None, 'auto_highlights': False, 'audio_start_from': None, 'audio_end_at': None, 'word_boost': [], 'boost_param': None, 'filter_profanity': False, 'redact_pii': False, 'redact_pii_audio': False, 'redact_pii_audio_quality': None, 'redact_pii_policies': None, 'redact_pii_sub': None, 'speaker_labels': False, 'content_safety': False, 'iab_categories': False, 'content_safety_labels': {'status': 'unavailable', 'results': [], 'summary': {}}, 'iab_categories_result': {'status': 'unavailable', 'results': [], 'summary': {}}, 'disfluencies': False, 'sentiment_analysis': False, 'auto_chapters': False, 'chapters': None, 'sentiment_analysis_results': None, 'entity_detection': False, 'entities': None}

API결과는 음성처리결과에 대한 다양한 정보를 보여준다.

acoustic_model : 사용된 음향모델
audio_duration : 처리한 음성파일 길이(12초내외)
audio_url : 처리대상 파일의 url
confidence : 전체 오디오 데이터에 대한 음성인식의 신뢰도(정합률)
dual_channel : 스테레오 데이터인지 또는 단일 채널 데이터 여부
id : request된 id
status : request된 id에 대한 상태정보, 긴 파일일 경우에는 본 상태가 'completed'가 된 것인지 파악하여야 함.
text : 음성 전사 데이터 즉 음성인식결과 문자열
* words : 단어열에 대한 각각의 시작점, 끝점, 및 그 신뢰도값을 보여줌.

내 PC에 있는 음성파일 처리

import requests
filename =  "d:\\AssemblyAI\\myfile_44kHz_mono.wav"
def read_file(filename, chunk_size=5242880):
    with open(filename, 'rb') as _file:
        while True:
            data = _file.read(chunk_size)
            if not data:
                break
            yield data

headers = {'authorization': "YOUR-API-TOKEN"} 
response = requests.post('https://api.assemblyai.com/v2/upload',
                        headers=headers,
                        data=read_file(filename))

print(response.json())

로컬PC의 음성파일을 filename에 지정하고, API 계정을 입력하여 프로그램을 수행한다.

수행하면, 아래와 같은 URL값을 리턴한다.

{'upload_url': 'https://cdn.assemblyai.com/upload/b0a420fb-58d2-4005-8b6f-9d53f41b2b89'}

본 값(b0a420fb-58d2-4005-8b6f-9d53f41b2b89)은 전사결과를 얻는 부분에 사용한다.

import requests
endpoint = "https://api.assemblyai.com/v2/transcript"
json = { "audio_url": "https://cdn.assemblyai.com/upload/b0a420fb-58d2-4005-8b6f-9d53f41b2b89" }
headers = {
    "authorization": "YOUR-API-TOKEN",
    "content-type": "application/json"
}
response = requests.post(endpoint, json=json, headers=headers)
print(response.status_code)
print(response.json())

수행결과는 아래와 같이 출력된다.

{'id': 'oehlp3curi-3f13-4dc5-ba17-686e48615313', 'language_model': 'assemblyai_default', 'acoustic_model': 'assemblyai_default', 'language_code': 'en_us', 'status': 'queued', 'audio_url': 'https://cdn.assemblyai.com/upload/b0a420fb-58d2-4005-8b6f-9d53f41b2b89', 'text': None, 'words': None, 'utterances': None, 'confidence': None, 'audio_duration': None, 'punctuate': True, 'format_text': True, 'dual_channel': None, 'webhook_url': None, 'webhook_status_code': None, 'speed_boost': False, 'auto_highlights_result': None, 'auto_highlights': False, 'audio_start_from': None, 'audio_end_at': None, 'word_boost': [], 'boost_param': None, 'filter_profanity':
False, 'redact_pii': False, 'redact_pii_audio': False, 'redact_pii_audio_quality': None, 'redact_pii_policies': None, 'redact_pii_sub': None, 'speaker_labels': False, 'content_safety': False, 'iab_categories': False, 'content_safety_labels': {}, 'iab_categories_result': {}, 'disfluencies': False, 'sentiment_analysis': False, 'sentiment_analysis_results': None, 'auto_chapters': False, 'chapters': None, 'entity_detection': False, 'entities': None}

status 값이 complete가 되면 text 값에 전사결과가 나타나지만, 현재는 status 값이 queued 이다. status값을 주기적으로 체크하기 위해, 본 프로그램을 주기적으로 체크(polling)하여야 한다.

폴링(polling) 대신에, 완료(completed) 메시지를 받기 위해 웹후킹(web hooking)하는 방식도 있다.

이제 id값(oehlp3curi-3f13-4dc5-ba17-686e48615313)를 기반으로, 인식결과를 얻을 수 있다.

import requests
endpoint = "https://api.assemblyai.com/v2/transcript/oehlp3curi-3f13-4dc5-ba17-686e48615313"
headers = {
    "authorization": "YOUR-API-TOKEN",
}
response = requests.get(endpoint, headers=headers)
print(response.json())

수행 결과는 아래와 같다.

{'id': 'oehlp3curi-3f13-4dc5-ba17-686e48615313', 'language_model': 'assemblyai_default', 'acoustic_model': 'assemblyai_default', 'language_code': 'en_us', 'status': 'completed', 'audio_url': 'https://cdn.assemblyai.com/upload/b0a420fb-58d2-4005-8b6f-9d53f41b2b89', 'text': 'Good morning. May I help you? Itchesani.', 'words': [{'text': 'Good', 'start': 910, 'end': 1110, 'confidence': 0.55991, 'speaker': None}, {'text': 'morning.', 'start': 1160, 'end': 1734, 'confidence': 0.52938, 'speaker': None}, {'text': 'May', 'start': 1892, 'end': 2214, 'confidence': 0.63963, 'speaker': None}, {'text': 'I', 'start': 2252, 'end': 2454, 'confidence': 0.99273, 'speaker': None}, {'text': 'help', 'start': 2492, 'end': 2694, 'confidence': 0.91993, 'speaker': None}, {'text': 'you?', 'start': 2732, 'end': 3258, 'confidence': 0.97432, 'speaker': None}, {'text': 'Itchesani.', 'start': 3404, 'end': 4150, 'confidence': 0.1758, 'speaker': None}], 'utterances': None, 'confidence': 0.684528571428571, 'audio_duration': 5, 'punctuate': True, 'format_text': True, 'dual_channel': None, 'webhook_url': None, 'webhook_status_code': None, 'speed_boost': False, 'auto_highlights_result': None, 'auto_highlights': False, 'audio_start_from': None, 'audio_end_at': None, 'word_boost': [], 'boost_param': None, 'filter_profanity': False, 'redact_pii': False, 'redact_pii_audio': False, 'redact_pii_audio_quality': None, 'redact_pii_policies': None, 'redact_pii_sub': None, 'speaker_labels': False, 'content_safety': False, 'iab_categories': False, 'content_safety_labels': {'status': 'unavailable', 'results': [], 'summary': {}}, 'iab_categories_result': {'status': 'unavailable', 'results': [], 'summary': {}}, 'disfluencies': False, 'sentiment_analysis': False, 'auto_chapters': False, 'chapters': None, 'sentiment_analysis_results': None, 'entity_detection': False, 'entities': None}

status 는 'complete'이며, text 영역에 인식결과인 'Good morning. May I help you? Itchesani'가 출력되었다. 저자가 실제 발성한 내용은 'Good morning. May I help you? It's sunny.'이었다.

웹기반 사용자인터페이스를 제공하는 네이버의 클로바노트에서는 'could mary may i help you it's a sunny'로 인식되었다.

동영상파일을 Local PC에서 업로드를 수행해 본다. 동영상파일은 mp4형태이며 약 1분 가량의 재생시간이다.

UploadingLocalFilesForTranscription.py 코드를 수행하면 아래 코드값을 리턴한다.

{'upload_url': 'https://cdn.assemblyai.com/upload/de846263-e374-454a-a42e-c0c4a2401e2b'}

SubmitUploadForTranscription.py를 수행하면, 아래 값이 리턴된다.

{'id': 'oowurcgelb-923b-456b-a5bf-a2d4be721940', 'language_model': 'assemblyai_default', 'acoustic_model': 'assemblyai_default', 'language_code': 'en_us', 'status': 'queued', 'audio_url': 'https://cdn.assemblyai.com/upload/de846263-e374-454a-a42e-c0c4a2401e2b', 'text': None, 'words': None, 'utterances': None, 'confidence': None, 'audio_duration': None, 'punctuate': True, 'format_text': True, 'dual_channel': None, 'webhook_url': None, 'webhook_status_code': None, 'speed_boost': False, 'auto_highlights_result': None, 'auto_highlights': False, 'audio_start_from': None, 'audio_end_at': None, 'word_boost': [], 'boost_param': None, 'filter_profanity':
False, 'redact_pii': False, 'redact_pii_audio': False, 'redact_pii_audio_quality': None, 'redact_pii_policies': None, 'redact_pii_sub': None, 'speaker_labels': False, 'content_safety': False, 'iab_categories': False, 'content_safety_labels': {}, 'iab_categories_result': {}, 'disfluencies': False, 'sentiment_analysis': False, 'sentiment_analysis_results': None, 'auto_chapters': False, 'chapters': None, 'entity_detection': False, 'entities': None}

{'id': 'oowurcgelb-923b-456b-a5bf-a2d4be721940', 'language_model': 'assemblyai_default', 'acoustic_model': 'assemblyai_default', 'language_code': 'en_us', 'status': 'completed', 'audio_url': 'https://cdn.assemblyai.com/upload/de846263-e374-454a-a42e-c0c4a2401e2b', 'text': "Foreign Minister Kyan Wakang, welcome to the program. Well, thank you for having me back on your program, Christian. You know, when we last spoke, it was a very different world. And now today, December 20, you in South Korea are facing a third wave, maybe even a fourth wave of this coronavirus. Just tell me why this latest one is so difficult to control. Yes, we are in the midst of our third way, which is turning out to be higher than the first and lasting much longer. And it is thus because the virus has now penetrated into every corner of everyday life of people. And this happening mostly
in the Metropolitan Seoul area. And you know how packed with people this particular area is in a country that is already one of the most population density wise, high density country today. In fact, we've hit the highest number so far at 1017, three new confirmed cases, including, of course, those who have recently come in from overseas.", 'words': [{'text': 'Foreign', 'start': 3050, 'end': 3278, 'confidence': 0.9741, 'speaker': None}, {'text': 'Minister', 'start': 3314, 'end': 3686, 'confidence': 0.99944, 'speaker': None}, {'text': 'Kyan', 'start': 3758, 'end': 4154, 'confidence': 0.13066, 'speaker': None}, {'text': 'Wakang,', 'start': 4202, 'end': 4874, 'confidence': 0.10904, 'speaker': None}, {'text': 'welcome', 'start': 4982, 'end': 5434, 'confidence': 0.99937, 'speaker': None}, {'text': 'to', 'start': 5532, 'end': 5830, 'confidence': 0.99932, 'speaker': None}, {'text': 'the', 'start': 5880, 'end': 6094, 'confidence': 0.9995, 'speaker': None}, {'text': 'program.', 'start': 6132, 'end': 6720, 'confidence': 0.99991, 'speaker': None}, {'text': 'Well,', 'start': 7290, 'end': 7654, 'confidence': 0.85043, 'speaker': None}, {'text': 'thank', 'start': 7692, 'end': 7858, 'confidence': 0.9993, 'speaker': None}, {'text': 'you', 'start': 7884, 'end': 8038, 'confidence': 0.99861, 'speaker': None}, {'text': 'for', 'start': 8064, 'end': 8182, 'confidence': 0.99974, 'speaker': None}, {'text': 'having', 'start': 8196, 'end': 8374, 'confidence': 0.99797, 'speaker': None}, {'text': 'me', 'start': 8412, 'end': 8578,
'confidence': 0.99978, 'speaker': None}, {'text': 'back', 'start': 8604, 'end': 8758, 'confidence': 0.99943, 'speaker': None}, {'text': 'on', 'start': 8784, 'end': 8902, 'confidence': 0.99888, 'speaker': None}, {'text': 'your', 'start': 8916, 'end': 9058, 'confidence': 0.99841, 'speaker': None}, {'text': 'program,', 'start': 9084, 'end': 9346, 'confidence': 0.9997, 'speaker': None}, {'text': 'Christian.', 'start': 9408, 'end': 9938, 'confidence': 0.53526, 'speaker': None}, {'text': 'You', 'start': 10034, 'end': 10222, 'confidence': 0.65565, 'speaker': None}, {'text': 'know,', 'start': 10236, 'end': 10414, 'confidence': 0.99709, 'speaker': None},
{'text': 'when', 'start': 10452, 'end': 10582, 'confidence': 0.99958, 'speaker': None}, {'text': 'we', 'start': 10596, 'end': 10738, 'confidence': 0.99828, 'speaker': None}, {'text': 'last', 'start': 10764, 'end': 10954, 'confidence': 0.98151, 'speaker': None}, {'text': 'spoke,', 'start': 10992, 'end': 11258, 'confidence': 0.51839, 'speaker': None}, {'text': 'it', 'start': 11294, 'end': 11422, 'confidence': 0.99569, 'speaker': None}, {'text': 'was', 'start': 11436, 'end': 11578, 'confidence': 0.99951, 'speaker': None}, {'text': 'a', 'start': 11604, 'end': 11722, 'confidence': 0.99016, 'speaker': None}, {'text': 'very', 'start': 11736, 'end': 12058, 'confidence': 0.9872, 'speaker': None}, {'text': 'different', 'start': 12144, 'end': 12502, 'confidence': 0.99908, 'speaker': None}, {'text': 'world.', 'start': 12576, 'end': 13030, 'confidence': 0.9981, 'speaker':
None}, {'text': 'And', 'start': 13140, 'end': 13558, 'confidence': 0.92822, 'speaker': None}, {'text': 'now', 'start': 13644, 'end': 14074, 'confidence': 0.99358, 'speaker': None}, {'text': 'today,', 'start': 14172, 'end': 14758, 'confidence': 0.99458, 'speaker': None}, {'text': 'December', 'start': 14904, 'end': 15518, 'confidence': 0.97343, 'speaker': None}, {'text': '20,', 'start': 15614, 'end': 16500, 'confidence': 0.6, 'speaker': None}, {'text': 'you', 'start': 17250, 'end': 17614, 'confidence': 0.91964, 'speaker': None}, {'text': 'in', 'start': 17652, 'end': 17818, 'confidence': 0.6083, 'speaker': None}, {'text': 'South', 'start': 17844, 'end': 18038, 'confidence': 0.54402, 'speaker': None}, {'text': 'Korea', 'start': 18074, 'end': 18458, 'confidence': 0.60191, 'speaker': None}, {'text': 'are', 'start': 18494, 'end': 18658, 'confidence': 0.94141, 'speaker': None}, {'text': 'facing', 'start': 18684, 'end': 19202, 'confidence': 0.99969, 'speaker': None}, {'text': 'a', 'start': 19286, 'end': 19786, 'confidence': 0.97704, 'speaker': None}, {'text': 'third', 'start': 19908, 'end': 20266, 'confidence': 0.95205, 'speaker': None}, {'text': 'wave,', 'start': 20328, 'end': 20654, 'confidence': 0.99737, 'speaker': None}, {'text': 'maybe', 'start': 20702, 'end': 20914, 'confidence': 0.99814, 'speaker': None}, {'text': 'even', 'start': 20952, 'end': 21154, 'confidence': 0.99965, 'speaker': None}, {'text': 'a', 'start': 21192, 'end': 21358, 'confidence': 0.98562, 'speaker': None}, {'text': 'fourth', 'start':
21384, 'end': 21674, 'confidence': 0.66432, 'speaker': None}, {'text': 'wave', 'start': 21722, 'end': 22070, 'confidence': 0.99114, 'speaker': None}, {'text': 'of', 'start': 22130, 'end': 22354, 'confidence': 0.99836,
'speaker': None}, {'text': 'this', 'start': 22392, 'end': 22702, 'confidence': 0.98492, 'speaker': None}, {'text': 'coronavirus.', 'start': 22776, 'end': 24074, 'confidence': 0.88828, 'speaker': None}, {'text': 'Just', 'start': 24242, 'end': 24610, 'confidence': 0.9945, 'speaker': None}, {'text': 'tell', 'start': 24660, 'end': 24838, 'confidence': 0.99804, 'speaker': None}, {'text': 'me', 'start': 24864, 'end': 25126, 'confidence':
0.99943, 'speaker': None}, {'text': 'why', 'start': 25188, 'end': 25486, 'confidence': 0.99981, 'speaker': None}, {'text': 'this', 'start': 25548, 'end': 25774, 'confidence': 0.99367, 'speaker': None}, {'text': 'latest', 'start': 25812, 'end': 26174, 'confidence': 0.71501, 'speaker': None}, {'text': 'one', 'start': 26222, 'end': 26434, 'confidence': 0.96477, 'speaker': None}, {'text': 'is', 'start': 26472, 'end': 26710, 'confidence': 0.99636, 'speaker': None}, {'text': 'so', 'start': 26760, 'end': 27082, 'confidence': 0.92246, 'speaker': None}, {'text': 'difficult', 'start': 27156, 'end': 27574, 'confidence': 0.99996, 'speaker': None}, {'text': 'to', 'start': 27672, 'end': 27898, 'confidence': 0.99988, 'speaker': None}, {'text': 'control.', 'start': 27924, 'end': 28500, 'confidence': 0.51343, 'speaker': None}, {'text': 'Yes,', 'start': 29190, 'end': 29626, 'confidence': 0.94699, 'speaker': None}, {'text': 'we', 'start': 29688, 'end': 29878, 'confidence': 0.9987, 'speaker': None}, {'text': 'are', 'start': 29904, 'end': 30202, 'confidence': 0.99162, 'speaker': None}, {'text': 'in', 'start': 30276, 'end': 30442, 'confidence': 0.99882, 'speaker': None}, {'text': 'the', 'start': 30456, 'end': 30598, 'confidence': 0.79534, 'speaker': None}, {'text': 'midst', 'start': 30624, 'end': 30914, 'confidence': 0.99387, 'speaker': None}, {'text': 'of', 'start': 30962, 'end': 31102, 'confidence': 0.49814, 'speaker': None}, {'text': 'our', 'start': 31116, 'end': 31294, 'confidence': 0.99701, 'speaker': None}, {'text':
'third', 'start': 31332, 'end': 31570, 'confidence': 0.97586, 'speaker': None}, {'text': 'way,', 'start': 31620, 'end': 32014, 'confidence': 0.885, 'speaker': None}, {'text': 'which', 'start': 32112, 'end': 32338, 'confidence': 0.99769, 'speaker': None}, {'text': 'is', 'start': 32364, 'end': 32914, 'confidence': 0.95308, 'speaker': None}, {'text': 'turning', 'start': 33072, 'end': 33482, 'confidence': 0.99574, 'speaker': None}, {'text': 'out', 'start': 33506, 'end': 33658, 'confidence': 0.99874, 'speaker': None}, {'text': 'to', 'start': 33684, 'end': 33838, 'confidence': 0.99925, 'speaker': None}, {'text': 'be', 'start': 33864, 'end': 34090, 'confidence': 0.99958, 'speaker': None}, {'text': 'higher', 'start': 34140, 'end': 34390, 'confidence': 0.99982, 'speaker': None}, {'text': 'than', 'start': 34440, 'end': 34618, 'confidence': 0.9997, 'speaker': None}, {'text': 'the', 'start': 34644, 'end': 34798, 'confidence': 0.99572

테스트에 사용한 동영상은 영어인식을 위해, 유튜브 상의 https://www.youtube.com/watch?v=XXJoJ8lugRY 자료를 개인PC에서 녹화한 영상이다. 인식성능결과는 유뷰트이 영어자막과 비교해 보면 된다.

동영상 자막 생성

수행한 동영상 파일 전사에 대한 결과에 대해 VTT(Video Text Track) 생성 요청을 한다.

import requests
endpoint = "https://api.assemblyai.com/v2/transcript/YOUR-TRANSCRIPT-ID-HERE/vtt"
headers = {
    "authorization": "YOUR-API-TOKEN",
}
response = requests.get(endpoint, headers=headers)
print(response.text)

를 수행하면,

WEBVTT

00:03.050 --> 00:06.720
Foreign Minister Kyan Wakang, welcome to the program.

00:07.290 --> 00:10.738
Well, thank you for having me back on your program, Christian. You know, when we

00:10.764 --> 00:14.758
last spoke, it was a very different world. And now today,

00:14.904 --> 00:18.458
December 20, you in South Korea

00:18.494 --> 00:21.674
are facing a third wave, maybe even a fourth

00:21.722 --> 00:25.126
wave of this coronavirus. Just tell me

00:25.188 --> 00:28.500
why this latest one is so difficult to control.

00:29.190 --> 00:32.914
Yes, we are in the midst of our third way, which is

00:33.072 --> 00:36.874
turning out to be higher than the first and lasting much longer.

00:36.972 --> 00:41.270
And it is thus because the virus has now penetrated

00:41.330 --> 00:45.370
into every corner of everyday life of people. And this

00:45.420 --> 00:49.198
happening mostly in the Metropolitan Seoul area. And you

00:49.224 --> 00:52.570
know how packed with people this particular

00:52.680 --> 00:56.566
area is in a country that is already one of the most

00:56.748 --> 01:00.540
population density wise, high density country

01:02.430 --> 01:05.614
today. In fact, we've hit the highest number so

01:05.652 --> 01:08.894
far at 1017, three new confirmed

01:08.942 --> 01:12.358
cases, including, of course, those who have recently come

01:12.384 --> 01:13.250
in from overseas.

를 얻을 수 있다.

동영상 자막 파일 중 늘리 쓰이는 SRT(SubRip subTitle) 파일을 생성해 보면,

import requests
endpoint = "https://api.assemblyai.com/v2/transcript/YOUR-TRANSCRIPT-ID-HERE/srt"
headers = {
    "authorization": "YOUR-API-TOKEN",
}
response = requests.get(endpoint, headers=headers)
print(response.text)

출력은 아래와 같다.

1
00:00:03,050 --> 00:00:06,720
Foreign Minister Kyan Wakang, welcome to the program.

2
00:00:07,290 --> 00:00:10,738
Well, thank you for having me back on your program, Christian. You know, when we

3
00:00:10,764 --> 00:00:14,758
last spoke, it was a very different world. And now today,

4
00:00:14,904 --> 00:00:18,458
December 20, you in South Korea

5
00:00:18,494 --> 00:00:22,354
are facing a third wave, maybe even a fourth wave of

6
00:00:22,392 --> 00:00:25,774
this coronavirus. Just tell me why this

7
00:00:25,812 --> 00:00:28,500
latest one is so difficult to control.

8
00:00:29,190 --> 00:00:32,914
Yes, we are in the midst of our third way, which is

9
00:00:33,072 --> 00:00:36,874
turning out to be higher than the first and lasting much longer.

10
00:00:36,972 --> 00:00:41,270
And it is thus because the virus has now penetrated

11
00:00:41,330 --> 00:00:44,746
into every corner of everyday life of people.

12
00:00:44,868 --> 00:00:48,158
And this happening mostly in the Metropolitan Seoul

13
00:00:48,194 --> 00:00:52,090
area. And you know how packed with people this

14
00:00:52,140 --> 00:00:55,618
particular area is in a country that is already one

15
00:00:55,644 --> 00:00:58,658
of the most population density wise,

16
00:00:58,814 --> 00:01:02,866
high density country today.

17
00:01:02,928 --> 00:01:06,840
In fact, we've hit the highest number so far at 1017,

18
00:01:06,870 --> 00:01:10,642
three new confirmed cases, including, of course,

19
00:01:10,836 --> 00:01:13,250
those who have recently come in from overseas.

출력된 것을 파일로 저장 후에, 동영상 재생기를 통해 영상을 재생하여 자막(음성전사)의 성능을 살펴본다. 자막 파일명은 동영상 파일명과 동일하여야 한다.

728x90

'음성인식' 카테고리의 다른 글

Pytorch 2.0 vs Tensorflow 사용량( 모델개수 측면) (0)	2023.05.08
Conformer Architecture for ASR (0)	2023.03.22
한국어 종단형 음성인식엔진( End-To-End Speech Recognition System for Korean Language) (0)	2023.01.16
프랑스 국영열차(SNCF) 안내 방송 음원 (0)	2022.01.23
VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device Speech Recognition : REVIEW (0)	2022.01.21

구름사이