Assembly AI

Pytorch 2.0 vs Tensorflow 사용량( 모델개수 측면) 2023.05.08
Conformer Architecture for ASR 2023.03.22
음성인식 API 사용해 보기( 자막생성 포함) 2023.01.20 1

Pytorch 2.0 vs Tensorflow 사용량( 모델개수 측면)

2023. 5. 8. 18:09

728x90

Pytorch2.0이 공개되었습니다. 또한, Tensorflow도 딥러닝 프레임워크도 있죠. 양대 산맥을 형성하고 있습니다.

AssemblyAI는

HuggingFace에 있는 모델의 92%(2023년 3월 현재)가 pytorch기반

이라고 설명하고 있습니다.

작년(2022년)에는 85%수준이었는데, 많이 늘었다고 합니다.

반면에 tensorflow기반의 모델 비율은 8%수준이라고 합니다.(작년에는 14%수준이었는데, 16%감조되었네요.)

연구자(Researcher) 는 Pytorch 가 선호됩니다.

- 많은 연구자 커뮤니티 사이트가 존재합니다.

- 모델의 유연성, 디버깅 유리, 짧은 훈련 시간...

- "pythonic" 접근법, 객체지향(object-oriented)

산업계 종사자는 Tensorflow가 선호됩니다.

- 머신러닝이나 딥런잉에 익숙하고

- 산업계에서 Job을 잡기 위해서..

- 디버깅에 유리합니다.

- 훈련 과정(Training Process)를 좀 더 잘 추적해 볼 수 있습니다.

- 시각화(visualization) 고찰에 강점이 있습니다

- 다양한 선택사항(options)....

Pytorch

- 웹기반 서비스가 탑재되어 있지 않아서, Django나 Flask 등의 Back-end 서버를 필요로 합니다.

- 반면, TensorFlow는 TensforFlow Serving framework가 제공됩니다.

-	Keras	PyTorch	TensorFlow
API Level	High	Low	High and Low
Architecture	Simple, concise, readable	Complex, less readable	Not easy to use
Datasets	Smaller datasets	Large datasets, high performance	Large datasets, high performance
Debugging	Simple network, so debugging is not often needed	Good debugging capabilities	Difficult to conduct debugging
Does It Have Trained Models?	Yes	Yes	Yes
Popularity	Most popular	Third most popular	Second most popular
Speed	Slow, low performance	Fast, high-performance	Fast, high-performance
Written In	Python	Lua	C++, CUDA, Python

보다 자세한 내용은

https://www.assemblyai.com/blog/pytorch-vs-tensorflow-in-2023/

PyTorch vs TensorFlow in 2023

Should you use PyTorch vs TensorFlow in 2023? This guide walks through the major pros and cons of PyTorch vs TensorFlow, and how you can pick the right framework.

www.assemblyai.com

에서 확인할 수 있습니다.

728x90

저작자표시 비영리 동일조건 (새창열림)

'음성인식' 카테고리의 다른 글

음성인식 기능 - 출력 되는 정보 기준으로 (0)	2023.10.20
음성인식의 응용 분야 /feat LLM(Large Language Model) (0)	2023.10.20
Conformer Architecture for ASR (0)	2023.03.22
음성인식 API 사용해 보기( 자막생성 포함) (1)	2023.01.20
한국어 종단형 음성인식엔진( End-To-End Speech Recognition System for Korean Language) (0)	2023.01.16

Conformer Architecture for ASR

2023. 3. 22. 10:05

728x90

미국의 Assembly AI사는 Conformer-1 이라는 아키텍처를 음성인식에 적용하여 좋은 성과를 이루었다고 블로그를 통해 설명하였습니다.

Conformer-1모델은 Transformer 모델과 Convolutional 모델의 장점을 합친 것이라고 소개하고 있습니다.

Conformer모델은 2020년에 구글브레인(Google Brain)을 통해 소개된 음성인식용 신경망 체계입니다.

Conformer모델 내의 Transformer Archtecture는 병렬화 및 Attention mechanisim의 장정이 있다고 이미 알려져 있습니다.
Convolutional layer들을 Transformer 구조에 추가를 함으로써, Conformer모델은 지협적인(Local) 요소 및 전역(Global)적인 특성을 모두 모델링할 수 있다는 특성을 가지게 됩니다. ( 2000년도경에 나타난 Wavelet 처럼, Wavelet을 통한 분석은 기존의 Fourier Transform을 통한 주파수 분석보다 다양한 스케일로 분석을 수행할 수 있었죠.)

Assembly AI는 Conformer architecture는 최고의 성능을 나타내지만, 연산량 및 메모리 사용량을 줄여 효율성이 증대되었다고 합니다.

650,000시간의 데이터를 통해 인간 수준의 성능에 도달하였고, 다양한 형태의 데이터 특히 잡음이 섞인 데이터에 대해 높은 성능을 나타냈다고 설명하고 있습니다.

구체적인 사항은

https://www.assemblyai.com/blog/conformer-1/

에서 확인할 수 있습니다.

기계번역에서는 Transformer모델과 CNN기반의 모델이 상용화 측면에서 경쟁을 하고 있습니다.

Efficient Conformer encoder model architecture.  출처 : Efficient Conformer 관련 논문

728x90

저작자표시 비영리 동일조건 (새창열림)

'음성인식' 카테고리의 다른 글

음성인식의 응용 분야 /feat LLM(Large Language Model) (0)	2023.10.20
Pytorch 2.0 vs Tensorflow 사용량( 모델개수 측면) (0)	2023.05.08
음성인식 API 사용해 보기( 자막생성 포함) (1)	2023.01.20
한국어 종단형 음성인식엔진( End-To-End Speech Recognition System for Korean Language) (0)	2023.01.16
프랑스 국영열차(SNCF) 안내 방송 음원 (0)	2022.01.23

음성인식 API 사용해 보기( 자막생성 포함)

2023. 1. 20. 11:47

728x90

AssemblyAI 음성인식의 특징 살펴보기

AssemblyAI API는 녹음된 음성 오디오 파일 및 실시간 오디오 스트리밍 데이터를 문자스크립트(text transation)로 변환해 준다. 변환된 문자를 처리하여 감정분석(sentimental analysis), 요약(summarization), 객체인식(entity detection), 주제어인식(topic detection)을 부가적으로 수행할 수 있다.

AssmblyAI의 API 기술문서는 https://docs.assemblyai.com/ 에 원문(영어)이 있다.

API 계정신청

우측 상단의 'Create Account'를 클릭한다. 이메일 주소와 비밀번호를 입력하고 제출한다.

API 개요

처리시간은 약 10배속이다. 이는 예를 들면 10분짜리의 파일을 약1분 내외로 처리한다는 뜻이다. 실시간 스트리밍 처리 방식은 약 수백 밀리세컨드이내에 응답결과를 받을 수 있다.

동시처리 파일 갯수는 무료(Free)모드 계정은 1개만 가능하고, 유료 계정은 최대 12개까지 가능하다. 실시간 스크리밍 처리방식에서는 무료 계정은 사용이 불가능(Limit==0)하고, 유료 계정은 최대 32개까지 가능하다.

API를 사용함에 있어서 주로 발생하는 오류 코드는 401오류 : 인증키가 없다 400오류 : API 요청 파라미터를 잘못 사용하고 있다 500오류 : 서버측 에러 이다. 오류가 있을 경우 오류코드 값 뿐만아니라 그 메시지 정보도 있으므로, 상세 내용은 메시지를 참조하면 된다. 예를 들면, 지원하지 않는 파일 형태 파일에 오디오 데이터가 존재하지 않음. 오디오 파일의 내용이 너무 짧음(200ms이하) 오디에 파일에 접근할 수 없음.(특정URL 등) 등이 있다.

지원하는 파일 형태는 에서 확인할 수 있다. 웬만한 오디오 파일 및 비디오 파일은 모두 지원된다고 보면 된다.

인증 및 온라인 음성파일 처리

Visual Studio Code 저작도구를 활용하여, 코드를 작성한다.

hearders의 authorization의 값은 각 개인의 코드값으로 대체하여 수행하여야 한다.

API키는 대시보드 및 Developers의 펼침 창에서 확인가능하다. 복사하여 코드에 붙여넣기를 한다.

처리대상 파일은 인터넷에 존재하는 파일이다.

약 12초 가량의 영어로 발성된 음성파일이다.

수행 결과는

{'id': 'oe8djuoctf-3874-4228-82f6-0416162f48fd', 'language_model': 'assemblyai_default', 'acoustic_model': 'assemblyai_default', 'language_code': 'en_us', 'status': 'queued', 'audio_url': 'https://bit.ly/3yxKEIY', 'text': None, 'words': None, 'utterances': None, 'confidence': None, 'audio_duration': None, 'punctuate': True, 'format_text': True, 'dual_channel': None, 'webhook_url': None, 'webhook_status_code': None, 'speed_boost': False, 'auto_highlights_result': None, 'auto_highlights': False, 'audio_start_from': None, 'audio_end_at': None, 'word_boost': [], 'boost_param': None, 'filter_profanity': False, 'redact_pii': False, 'redact_pii_audio':
False, 'redact_pii_audio_quality': None, 'redact_pii_policies': None, 'redact_pii_sub': None, 'speaker_labels': False, 'content_safety': False, 'iab_categories': False, 'content_safety_labels': {}, 'iab_categories_result': {}, 'disfluencies': False, 'sentiment_analysis': False, 'sentiment_analysis_results': None, 'auto_chapters': False, 'chapters': None, 'entity_detection': False, 'entities': None}

와 같이 출력된다. 음성인식 전사 결과가 바로 출력되지 않는다. id값(oe8djuoctf-3874-4228-82f6-0416162f48fd)을 참조로 하여, 그 결과를 얻어야 한다.

{'id': 'oe8djuoctf-3874-4228-82f6-0416162f48fd', 'language_model': 'assemblyai_default', 'acoustic_model': 'assemblyai_default', 'language_code': 'en_us', 'status': 'completed', 'audio_url': 'https://bit.ly/3yxKEIY', , 'words': [{'text': 'You', 'start': 430, 'end': 522, 'confidence': 0.99876, 'speaker': None}, {'text': 'know,', 'start': 536, 'end': 858, 'confidence': 0.9964, 'speaker': None}, {'text': 'demons', 'start': 944, 'end': 1402, 'confidence': 0.99522, 'speaker': None}, {'text': 'on', 'start': 1426, 'end': 1578, 'confidence': 0.78258, 'speaker': None}, {'text': 'TV', 'start': 1604, 'end': 1894, 'confidence': 0.97286, 'speaker': None}, {'text': 'like', 'start': 1942, 'end': 2118, 'confidence': 0.99993, 'speaker': None}, {'text': 'that.', 'start': 2144, 'end': 2334, 'confidence': 0.99808, 'speaker': None}, {'text': 'And', 'start': 2372, 'end': 2754, 'confidence': 0.70275, 'speaker': None}, {'text': 'and', 'start': 2852, 'end': 3114, 'confidence': 0.84709, 'speaker': None}, {'text': 'for', 'start': 3152, 'end': 3354, 'confidence': 0.80745, 'speaker': None}, {'text': 'people', 'start': 3392, 'end': 3630, 'confidence': 0.9997, 'speaker': None}, {'text': 'to', 'start': 3680, 'end': 4038, 'confidence': 0.9935, 'speaker': None}, {'text': 'expose', 'start': 4124, 'end': 4558, 'confidence': 0.66411, 'speaker': None}, {'text': 'themselves', 'start': 4594, 'end': 4974, 'confidence': 0.98493, 'speaker': None}, {'text': 'to', 'start': 5072, 'end': 5298, 'confidence': 0.99923, 'speaker': None}, {'text': 'being', 'start': 5324, 'end': 5514, 'confidence': 0.99932, 'speaker': None}, {'text': 'rejected', 'start': 5552, 'end': 6058, 'confidence': 0.99986, 'speaker': None}, {'text': 'on', 'start': 6094, 'end': 6258, 'confidence': 0.9996, 'speaker': None}, {'text': 'TV.', 'start': 6284, 'end': 6550, 'confidence': 0.76502, 'speaker': None}, {'text': 'Or,', 'start': 6610, 'end': 7050, 'confidence': 0.51107, 'speaker': None}, {'text': 'you', 'start': 7160, 'end': 7398, 'confidence': 0.55854, 'speaker': None}, {'text': 'know,', 'start': 7424, 'end': 7650, 'confidence': 0.99902, 'speaker': None}, {'text': 'Humil,', 'start': 7700, 'end': 8170, 'confidence': 0.61465, 'speaker': None}, {'text': 'humiliated', 'start': 8230, 'end': 8962, 'confidence': 0.50987, 'speaker': None}, {'text': 'by', 'start': 9046, 'end': 9258, 'confidence': 0.99928, 'speaker': None}, {'text': 'Fear', 'start': 9284, 'end': 9598, 'confidence': 0.63855,
'speaker': None}, {'text': 'Factor', 'start': 9634, 'end': 10090, 'confidence': 0.67138, 'speaker': None}, {'text': 'or,', 'start': 10150, 'end': 10760, 'confidence': 0.99936, 'speaker': None}, {'text': 'you', 'start': 11210, 'end': 11502, 'confidence': 0.70603, 'speaker': None}, {'text': 'know.', 'start': 11516, 'end': 11580, 'confidence': 0.62727, 'speaker': None}], 'utterances': None, 'confidence': 0.844713666666667, 'audio_duration': 12, 'punctuate': True, 'format_text': True, 'dual_channel': None, 'webhook_url': None, 'webhook_status_code': None, 'speed_boost': False, 'auto_highlights_result': None, 'auto_highlights': False, 'audio_start_from': None, 'audio_end_at': None, 'word_boost': [], 'boost_param': None, 'filter_profanity': False, 'redact_pii': False, 'redact_pii_audio': False, 'redact_pii_audio_quality': None, 'redact_pii_policies': None, 'redact_pii_sub': None, 'speaker_labels': False, 'content_safety': False, 'iab_categories': False, 'content_safety_labels': {'status': 'unavailable', 'results': [], 'summary': {}}, 'iab_categories_result': {'status': 'unavailable', 'results': [], 'summary': {}}, 'disfluencies': False, 'sentiment_analysis': False, 'auto_chapters': False, 'chapters': None, 'sentiment_analysis_results': None, 'entity_detection': False, 'entities': None}

API결과는 음성처리결과에 대한 다양한 정보를 보여준다.

acoustic_model : 사용된 음향모델
audio_duration : 처리한 음성파일 길이(12초내외)
audio_url : 처리대상 파일의 url
confidence : 전체 오디오 데이터에 대한 음성인식의 신뢰도(정합률)
dual_channel : 스테레오 데이터인지 또는 단일 채널 데이터 여부
id : request된 id
status : request된 id에 대한 상태정보, 긴 파일일 경우에는 본 상태가 'completed'가 된 것인지 파악하여야 함.
text : 음성 전사 데이터 즉 음성인식결과 문자열
* words : 단어열에 대한 각각의 시작점, 끝점, 및 그 신뢰도값을 보여줌.

내 PC에 있는 음성파일 처리

import requests
filename =  "d:\\AssemblyAI\\myfile_44kHz_mono.wav"
def read_file(filename, chunk_size=5242880):
    with open(filename, 'rb') as _file:
        while True:
            data = _file.read(chunk_size)
            if not data:
                break
            yield data

headers = {'authorization': "YOUR-API-TOKEN"} 
response = requests.post('https://api.assemblyai.com/v2/upload',
                        headers=headers,
                        data=read_file(filename))

print(response.json())

로컬PC의 음성파일을 filename에 지정하고, API 계정을 입력하여 프로그램을 수행한다.

수행하면, 아래와 같은 URL값을 리턴한다.

{'upload_url': 'https://cdn.assemblyai.com/upload/b0a420fb-58d2-4005-8b6f-9d53f41b2b89'}

본 값(b0a420fb-58d2-4005-8b6f-9d53f41b2b89)은 전사결과를 얻는 부분에 사용한다.

import requests
endpoint = "https://api.assemblyai.com/v2/transcript"
json = { "audio_url": "https://cdn.assemblyai.com/upload/b0a420fb-58d2-4005-8b6f-9d53f41b2b89" }
headers = {
    "authorization": "YOUR-API-TOKEN",
    "content-type": "application/json"
}
response = requests.post(endpoint, json=json, headers=headers)
print(response.status_code)
print(response.json())

수행결과는 아래와 같이 출력된다.

{'id': 'oehlp3curi-3f13-4dc5-ba17-686e48615313', 'language_model': 'assemblyai_default', 'acoustic_model': 'assemblyai_default', 'language_code': 'en_us', 'status': 'queued', 'audio_url': 'https://cdn.assemblyai.com/upload/b0a420fb-58d2-4005-8b6f-9d53f41b2b89', 'text': None, 'words': None, 'utterances': None, 'confidence': None, 'audio_duration': None, 'punctuate': True, 'format_text': True, 'dual_channel': None, 'webhook_url': None, 'webhook_status_code': None, 'speed_boost': False, 'auto_highlights_result': None, 'auto_highlights': False, 'audio_start_from': None, 'audio_end_at': None, 'word_boost': [], 'boost_param': None, 'filter_profanity':
False, 'redact_pii': False, 'redact_pii_audio': False, 'redact_pii_audio_quality': None, 'redact_pii_policies': None, 'redact_pii_sub': None, 'speaker_labels': False, 'content_safety': False, 'iab_categories': False, 'content_safety_labels': {}, 'iab_categories_result': {}, 'disfluencies': False, 'sentiment_analysis': False, 'sentiment_analysis_results': None, 'auto_chapters': False, 'chapters': None, 'entity_detection': False, 'entities': None}

status 값이 complete가 되면 text 값에 전사결과가 나타나지만, 현재는 status 값이 queued 이다. status값을 주기적으로 체크하기 위해, 본 프로그램을 주기적으로 체크(polling)하여야 한다.

폴링(polling) 대신에, 완료(completed) 메시지를 받기 위해 웹후킹(web hooking)하는 방식도 있다.

이제 id값(oehlp3curi-3f13-4dc5-ba17-686e48615313)를 기반으로, 인식결과를 얻을 수 있다.

import requests
endpoint = "https://api.assemblyai.com/v2/transcript/oehlp3curi-3f13-4dc5-ba17-686e48615313"
headers = {
    "authorization": "YOUR-API-TOKEN",
}
response = requests.get(endpoint, headers=headers)
print(response.json())

수행 결과는 아래와 같다.

{'id': 'oehlp3curi-3f13-4dc5-ba17-686e48615313', 'language_model': 'assemblyai_default', 'acoustic_model': 'assemblyai_default', 'language_code': 'en_us', 'status': 'completed', 'audio_url': 'https://cdn.assemblyai.com/upload/b0a420fb-58d2-4005-8b6f-9d53f41b2b89', 'text': 'Good morning. May I help you? Itchesani.', 'words': [{'text': 'Good', 'start': 910, 'end': 1110, 'confidence': 0.55991, 'speaker': None}, {'text': 'morning.', 'start': 1160, 'end': 1734, 'confidence': 0.52938, 'speaker': None}, {'text': 'May', 'start': 1892, 'end': 2214, 'confidence': 0.63963, 'speaker': None}, {'text': 'I', 'start': 2252, 'end': 2454, 'confidence': 0.99273, 'speaker': None}, {'text': 'help', 'start': 2492, 'end': 2694, 'confidence': 0.91993, 'speaker': None}, {'text': 'you?', 'start': 2732, 'end': 3258, 'confidence': 0.97432, 'speaker': None}, {'text': 'Itchesani.', 'start': 3404, 'end': 4150, 'confidence': 0.1758, 'speaker': None}], 'utterances': None, 'confidence': 0.684528571428571, 'audio_duration': 5, 'punctuate': True, 'format_text': True, 'dual_channel': None, 'webhook_url': None, 'webhook_status_code': None, 'speed_boost': False, 'auto_highlights_result': None, 'auto_highlights': False, 'audio_start_from': None, 'audio_end_at': None, 'word_boost': [], 'boost_param': None, 'filter_profanity': False, 'redact_pii': False, 'redact_pii_audio': False, 'redact_pii_audio_quality': None, 'redact_pii_policies': None, 'redact_pii_sub': None, 'speaker_labels': False, 'content_safety': False, 'iab_categories': False, 'content_safety_labels': {'status': 'unavailable', 'results': [], 'summary': {}}, 'iab_categories_result': {'status': 'unavailable', 'results': [], 'summary': {}}, 'disfluencies': False, 'sentiment_analysis': False, 'auto_chapters': False, 'chapters': None, 'sentiment_analysis_results': None, 'entity_detection': False, 'entities': None}

status 는 'complete'이며, text 영역에 인식결과인 'Good morning. May I help you? Itchesani'가 출력되었다. 저자가 실제 발성한 내용은 'Good morning. May I help you? It's sunny.'이었다.

웹기반 사용자인터페이스를 제공하는 네이버의 클로바노트에서는 'could mary may i help you it's a sunny'로 인식되었다.

동영상파일을 Local PC에서 업로드를 수행해 본다. 동영상파일은 mp4형태이며 약 1분 가량의 재생시간이다.

UploadingLocalFilesForTranscription.py 코드를 수행하면 아래 코드값을 리턴한다.

{'upload_url': 'https://cdn.assemblyai.com/upload/de846263-e374-454a-a42e-c0c4a2401e2b'}

SubmitUploadForTranscription.py를 수행하면, 아래 값이 리턴된다.

{'id': 'oowurcgelb-923b-456b-a5bf-a2d4be721940', 'language_model': 'assemblyai_default', 'acoustic_model': 'assemblyai_default', 'language_code': 'en_us', 'status': 'queued', 'audio_url': 'https://cdn.assemblyai.com/upload/de846263-e374-454a-a42e-c0c4a2401e2b', 'text': None, 'words': None, 'utterances': None, 'confidence': None, 'audio_duration': None, 'punctuate': True, 'format_text': True, 'dual_channel': None, 'webhook_url': None, 'webhook_status_code': None, 'speed_boost': False, 'auto_highlights_result': None, 'auto_highlights': False, 'audio_start_from': None, 'audio_end_at': None, 'word_boost': [], 'boost_param': None, 'filter_profanity':
False, 'redact_pii': False, 'redact_pii_audio': False, 'redact_pii_audio_quality': None, 'redact_pii_policies': None, 'redact_pii_sub': None, 'speaker_labels': False, 'content_safety': False, 'iab_categories': False, 'content_safety_labels': {}, 'iab_categories_result': {}, 'disfluencies': False, 'sentiment_analysis': False, 'sentiment_analysis_results': None, 'auto_chapters': False, 'chapters': None, 'entity_detection': False, 'entities': None}

{'id': 'oowurcgelb-923b-456b-a5bf-a2d4be721940', 'language_model': 'assemblyai_default', 'acoustic_model': 'assemblyai_default', 'language_code': 'en_us', 'status': 'completed', 'audio_url': 'https://cdn.assemblyai.com/upload/de846263-e374-454a-a42e-c0c4a2401e2b', 'text': "Foreign Minister Kyan Wakang, welcome to the program. Well, thank you for having me back on your program, Christian. You know, when we last spoke, it was a very different world. And now today, December 20, you in South Korea are facing a third wave, maybe even a fourth wave of this coronavirus. Just tell me why this latest one is so difficult to control. Yes, we are in the midst of our third way, which is turning out to be higher than the first and lasting much longer. And it is thus because the virus has now penetrated into every corner of everyday life of people. And this happening mostly
in the Metropolitan Seoul area. And you know how packed with people this particular area is in a country that is already one of the most population density wise, high density country today. In fact, we've hit the highest number so far at 1017, three new confirmed cases, including, of course, those who have recently come in from overseas.", 'words': [{'text': 'Foreign', 'start': 3050, 'end': 3278, 'confidence': 0.9741, 'speaker': None}, {'text': 'Minister', 'start': 3314, 'end': 3686, 'confidence': 0.99944, 'speaker': None}, {'text': 'Kyan', 'start': 3758, 'end': 4154, 'confidence': 0.13066, 'speaker': None}, {'text': 'Wakang,', 'start': 4202, 'end': 4874, 'confidence': 0.10904, 'speaker': None}, {'text': 'welcome', 'start': 4982, 'end': 5434, 'confidence': 0.99937, 'speaker': None}, {'text': 'to', 'start': 5532, 'end': 5830, 'confidence': 0.99932, 'speaker': None}, {'text': 'the', 'start': 5880, 'end': 6094, 'confidence': 0.9995, 'speaker': None}, {'text': 'program.', 'start': 6132, 'end': 6720, 'confidence': 0.99991, 'speaker': None}, {'text': 'Well,', 'start': 7290, 'end': 7654, 'confidence': 0.85043, 'speaker': None}, {'text': 'thank', 'start': 7692, 'end': 7858, 'confidence': 0.9993, 'speaker': None}, {'text': 'you', 'start': 7884, 'end': 8038, 'confidence': 0.99861, 'speaker': None}, {'text': 'for', 'start': 8064, 'end': 8182, 'confidence': 0.99974, 'speaker': None}, {'text': 'having', 'start': 8196, 'end': 8374, 'confidence': 0.99797, 'speaker': None}, {'text': 'me', 'start': 8412, 'end': 8578,
'confidence': 0.99978, 'speaker': None}, {'text': 'back', 'start': 8604, 'end': 8758, 'confidence': 0.99943, 'speaker': None}, {'text': 'on', 'start': 8784, 'end': 8902, 'confidence': 0.99888, 'speaker': None}, {'text': 'your', 'start': 8916, 'end': 9058, 'confidence': 0.99841, 'speaker': None}, {'text': 'program,', 'start': 9084, 'end': 9346, 'confidence': 0.9997, 'speaker': None}, {'text': 'Christian.', 'start': 9408, 'end': 9938, 'confidence': 0.53526, 'speaker': None}, {'text': 'You', 'start': 10034, 'end': 10222, 'confidence': 0.65565, 'speaker': None}, {'text': 'know,', 'start': 10236, 'end': 10414, 'confidence': 0.99709, 'speaker': None},
{'text': 'when', 'start': 10452, 'end': 10582, 'confidence': 0.99958, 'speaker': None}, {'text': 'we', 'start': 10596, 'end': 10738, 'confidence': 0.99828, 'speaker': None}, {'text': 'last', 'start': 10764, 'end': 10954, 'confidence': 0.98151, 'speaker': None}, {'text': 'spoke,', 'start': 10992, 'end': 11258, 'confidence': 0.51839, 'speaker': None}, {'text': 'it', 'start': 11294, 'end': 11422, 'confidence': 0.99569, 'speaker': None}, {'text': 'was', 'start': 11436, 'end': 11578, 'confidence': 0.99951, 'speaker': None}, {'text': 'a', 'start': 11604, 'end': 11722, 'confidence': 0.99016, 'speaker': None}, {'text': 'very', 'start': 11736, 'end': 12058, 'confidence': 0.9872, 'speaker': None}, {'text': 'different', 'start': 12144, 'end': 12502, 'confidence': 0.99908, 'speaker': None}, {'text': 'world.', 'start': 12576, 'end': 13030, 'confidence': 0.9981, 'speaker':
None}, {'text': 'And', 'start': 13140, 'end': 13558, 'confidence': 0.92822, 'speaker': None}, {'text': 'now', 'start': 13644, 'end': 14074, 'confidence': 0.99358, 'speaker': None}, {'text': 'today,', 'start': 14172, 'end': 14758, 'confidence': 0.99458, 'speaker': None}, {'text': 'December', 'start': 14904, 'end': 15518, 'confidence': 0.97343, 'speaker': None}, {'text': '20,', 'start': 15614, 'end': 16500, 'confidence': 0.6, 'speaker': None}, {'text': 'you', 'start': 17250, 'end': 17614, 'confidence': 0.91964, 'speaker': None}, {'text': 'in', 'start': 17652, 'end': 17818, 'confidence': 0.6083, 'speaker': None}, {'text': 'South', 'start': 17844, 'end': 18038, 'confidence': 0.54402, 'speaker': None}, {'text': 'Korea', 'start': 18074, 'end': 18458, 'confidence': 0.60191, 'speaker': None}, {'text': 'are', 'start': 18494, 'end': 18658, 'confidence': 0.94141, 'speaker': None}, {'text': 'facing', 'start': 18684, 'end': 19202, 'confidence': 0.99969, 'speaker': None}, {'text': 'a', 'start': 19286, 'end': 19786, 'confidence': 0.97704, 'speaker': None}, {'text': 'third', 'start': 19908, 'end': 20266, 'confidence': 0.95205, 'speaker': None}, {'text': 'wave,', 'start': 20328, 'end': 20654, 'confidence': 0.99737, 'speaker': None}, {'text': 'maybe', 'start': 20702, 'end': 20914, 'confidence': 0.99814, 'speaker': None}, {'text': 'even', 'start': 20952, 'end': 21154, 'confidence': 0.99965, 'speaker': None}, {'text': 'a', 'start': 21192, 'end': 21358, 'confidence': 0.98562, 'speaker': None}, {'text': 'fourth', 'start':
21384, 'end': 21674, 'confidence': 0.66432, 'speaker': None}, {'text': 'wave', 'start': 21722, 'end': 22070, 'confidence': 0.99114, 'speaker': None}, {'text': 'of', 'start': 22130, 'end': 22354, 'confidence': 0.99836,
'speaker': None}, {'text': 'this', 'start': 22392, 'end': 22702, 'confidence': 0.98492, 'speaker': None}, {'text': 'coronavirus.', 'start': 22776, 'end': 24074, 'confidence': 0.88828, 'speaker': None}, {'text': 'Just', 'start': 24242, 'end': 24610, 'confidence': 0.9945, 'speaker': None}, {'text': 'tell', 'start': 24660, 'end': 24838, 'confidence': 0.99804, 'speaker': None}, {'text': 'me', 'start': 24864, 'end': 25126, 'confidence':
0.99943, 'speaker': None}, {'text': 'why', 'start': 25188, 'end': 25486, 'confidence': 0.99981, 'speaker': None}, {'text': 'this', 'start': 25548, 'end': 25774, 'confidence': 0.99367, 'speaker': None}, {'text': 'latest', 'start': 25812, 'end': 26174, 'confidence': 0.71501, 'speaker': None}, {'text': 'one', 'start': 26222, 'end': 26434, 'confidence': 0.96477, 'speaker': None}, {'text': 'is', 'start': 26472, 'end': 26710, 'confidence': 0.99636, 'speaker': None}, {'text': 'so', 'start': 26760, 'end': 27082, 'confidence': 0.92246, 'speaker': None}, {'text': 'difficult', 'start': 27156, 'end': 27574, 'confidence': 0.99996, 'speaker': None}, {'text': 'to', 'start': 27672, 'end': 27898, 'confidence': 0.99988, 'speaker': None}, {'text': 'control.', 'start': 27924, 'end': 28500, 'confidence': 0.51343, 'speaker': None}, {'text': 'Yes,', 'start': 29190, 'end': 29626, 'confidence': 0.94699, 'speaker': None}, {'text': 'we', 'start': 29688, 'end': 29878, 'confidence': 0.9987, 'speaker': None}, {'text': 'are', 'start': 29904, 'end': 30202, 'confidence': 0.99162, 'speaker': None}, {'text': 'in', 'start': 30276, 'end': 30442, 'confidence': 0.99882, 'speaker': None}, {'text': 'the', 'start': 30456, 'end': 30598, 'confidence': 0.79534, 'speaker': None}, {'text': 'midst', 'start': 30624, 'end': 30914, 'confidence': 0.99387, 'speaker': None}, {'text': 'of', 'start': 30962, 'end': 31102, 'confidence': 0.49814, 'speaker': None}, {'text': 'our', 'start': 31116, 'end': 31294, 'confidence': 0.99701, 'speaker': None}, {'text':
'third', 'start': 31332, 'end': 31570, 'confidence': 0.97586, 'speaker': None}, {'text': 'way,', 'start': 31620, 'end': 32014, 'confidence': 0.885, 'speaker': None}, {'text': 'which', 'start': 32112, 'end': 32338, 'confidence': 0.99769, 'speaker': None}, {'text': 'is', 'start': 32364, 'end': 32914, 'confidence': 0.95308, 'speaker': None}, {'text': 'turning', 'start': 33072, 'end': 33482, 'confidence': 0.99574, 'speaker': None}, {'text': 'out', 'start': 33506, 'end': 33658, 'confidence': 0.99874, 'speaker': None}, {'text': 'to', 'start': 33684, 'end': 33838, 'confidence': 0.99925, 'speaker': None}, {'text': 'be', 'start': 33864, 'end': 34090, 'confidence': 0.99958, 'speaker': None}, {'text': 'higher', 'start': 34140, 'end': 34390, 'confidence': 0.99982, 'speaker': None}, {'text': 'than', 'start': 34440, 'end': 34618, 'confidence': 0.9997, 'speaker': None}, {'text': 'the', 'start': 34644, 'end': 34798, 'confidence': 0.99572

테스트에 사용한 동영상은 영어인식을 위해, 유튜브 상의 https://www.youtube.com/watch?v=XXJoJ8lugRY 자료를 개인PC에서 녹화한 영상이다. 인식성능결과는 유뷰트이 영어자막과 비교해 보면 된다.

동영상 자막 생성

수행한 동영상 파일 전사에 대한 결과에 대해 VTT(Video Text Track) 생성 요청을 한다.

import requests
endpoint = "https://api.assemblyai.com/v2/transcript/YOUR-TRANSCRIPT-ID-HERE/vtt"
headers = {
    "authorization": "YOUR-API-TOKEN",
}
response = requests.get(endpoint, headers=headers)
print(response.text)

를 수행하면,

WEBVTT

00:03.050 --> 00:06.720
Foreign Minister Kyan Wakang, welcome to the program.

00:07.290 --> 00:10.738
Well, thank you for having me back on your program, Christian. You know, when we

00:10.764 --> 00:14.758
last spoke, it was a very different world. And now today,

00:14.904 --> 00:18.458
December 20, you in South Korea

00:18.494 --> 00:21.674
are facing a third wave, maybe even a fourth

00:21.722 --> 00:25.126
wave of this coronavirus. Just tell me

00:25.188 --> 00:28.500
why this latest one is so difficult to control.

00:29.190 --> 00:32.914
Yes, we are in the midst of our third way, which is

00:33.072 --> 00:36.874
turning out to be higher than the first and lasting much longer.

00:36.972 --> 00:41.270
And it is thus because the virus has now penetrated

00:41.330 --> 00:45.370
into every corner of everyday life of people. And this

00:45.420 --> 00:49.198
happening mostly in the Metropolitan Seoul area. And you

00:49.224 --> 00:52.570
know how packed with people this particular

00:52.680 --> 00:56.566
area is in a country that is already one of the most

00:56.748 --> 01:00.540
population density wise, high density country

01:02.430 --> 01:05.614
today. In fact, we've hit the highest number so

01:05.652 --> 01:08.894
far at 1017, three new confirmed

01:08.942 --> 01:12.358
cases, including, of course, those who have recently come

01:12.384 --> 01:13.250
in from overseas.

를 얻을 수 있다.

동영상 자막 파일 중 늘리 쓰이는 SRT(SubRip subTitle) 파일을 생성해 보면,

import requests
endpoint = "https://api.assemblyai.com/v2/transcript/YOUR-TRANSCRIPT-ID-HERE/srt"
headers = {
    "authorization": "YOUR-API-TOKEN",
}
response = requests.get(endpoint, headers=headers)
print(response.text)

출력은 아래와 같다.

1
00:00:03,050 --> 00:00:06,720
Foreign Minister Kyan Wakang, welcome to the program.

2
00:00:07,290 --> 00:00:10,738
Well, thank you for having me back on your program, Christian. You know, when we

3
00:00:10,764 --> 00:00:14,758
last spoke, it was a very different world. And now today,

4
00:00:14,904 --> 00:00:18,458
December 20, you in South Korea

5
00:00:18,494 --> 00:00:22,354
are facing a third wave, maybe even a fourth wave of

6
00:00:22,392 --> 00:00:25,774
this coronavirus. Just tell me why this

7
00:00:25,812 --> 00:00:28,500
latest one is so difficult to control.

8
00:00:29,190 --> 00:00:32,914
Yes, we are in the midst of our third way, which is

9
00:00:33,072 --> 00:00:36,874
turning out to be higher than the first and lasting much longer.

10
00:00:36,972 --> 00:00:41,270
And it is thus because the virus has now penetrated

11
00:00:41,330 --> 00:00:44,746
into every corner of everyday life of people.

12
00:00:44,868 --> 00:00:48,158
And this happening mostly in the Metropolitan Seoul

13
00:00:48,194 --> 00:00:52,090
area. And you know how packed with people this

14
00:00:52,140 --> 00:00:55,618
particular area is in a country that is already one

15
00:00:55,644 --> 00:00:58,658
of the most population density wise,

16
00:00:58,814 --> 00:01:02,866
high density country today.

17
00:01:02,928 --> 00:01:06,840
In fact, we've hit the highest number so far at 1017,

18
00:01:06,870 --> 00:01:10,642
three new confirmed cases, including, of course,

19
00:01:10,836 --> 00:01:13,250
those who have recently come in from overseas.

출력된 것을 파일로 저장 후에, 동영상 재생기를 통해 영상을 재생하여 자막(음성전사)의 성능을 살펴본다. 자막 파일명은 동영상 파일명과 동일하여야 한다.

728x90

'음성인식' 카테고리의 다른 글

Pytorch 2.0 vs Tensorflow 사용량( 모델개수 측면) (0)	2023.05.08
Conformer Architecture for ASR (0)	2023.03.22
한국어 종단형 음성인식엔진( End-To-End Speech Recognition System for Korean Language) (0)	2023.01.16
프랑스 국영열차(SNCF) 안내 방송 음원 (0)	2022.01.23
VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device Speech Recognition : REVIEW (0)	2022.01.21

PREV 1 NEXT

구름사이

Assembly AI

Pytorch 2.0 vs Tensorflow 사용량( 모델개수 측면)

'음성인식' 카테고리의 다른 글

Conformer Architecture for ASR

'음성인식' 카테고리의 다른 글

음성인식 API 사용해 보기( 자막생성 포함)

AssemblyAI 음성인식의 특징 살펴보기

API 계정신청

API 개요

인증 및 온라인 음성파일 처리

내 PC에 있는 음성파일 처리

동영상 자막 생성

'음성인식' 카테고리의 다른 글

+ Recent posts

티스토리툴바