In this code pattern, learn how to extract speaker diarized notes and significant insights reports utilizing IBM ® Watson ™ Speech To Text, Watson Natural Language Processing, and Watson Tone Analysis when given any video.


In an essentially connected world, staying concentrated on work or education is extremely important. Studies recommend that many individuals lose their focus in live virtual conferences or virtual class sessions after approximately 20 minutes. For that reason, lots of meetings and virtual class are recorded so that a person can enjoy it later.

It might assist if these recordings might be evaluated, and a detailed report of the meeting or class is generated by utilizing expert system (AI). This code pattern discusses how to do that. Provided a video recording of the virtual conference or virtual classroom, it discusses how to draw out audio from a video file utilizing the FFmpeg open source library, transcribe the audio to get speaker-diarized notes with custom-trained language and acoustic speech to text designs, and generate a natural language understanding report that consists of the category, principles, emotion, entities, keywords, sentiment, top positive sentences, and word clouds utilizing a Python Flask runtime.

After finishing the code pattern, you understand how to:

  • Use the Watson Speech to Text service to convert the human voice into the composed word
  • Use advanced natural language processing to analyze text and extract metadata from material such as principles, entities, keywords, categories, belief, and feeling
  • Leverage Watson Tone Analyzer cognitive linguistic analysis to determine a range of tones at both the sentence and document level



  1. The user uploads a taped video file of the virtual meeting or virtual class.
  2. The FFmpeg library extracts audio from the video file.
  3. The Watson Speech To Text service transcribes the audio to offer a diarized textual output.
  4. (Optionally) The Watson Language Translator service translates other languages into an English transcript.
  5. Watson Tone Analyzer analyses the records and picks up the top positive statements from the transcript.
  6. Watson Natural Language Comprehending checks out the records to identify essential guidelines and to get the sentiments and emotions.
  7. The essential tips and summary of the video are presented to the user in the application.
  8. The user can download the textual insights.


Find the in-depth actions in the README file. Those actions describe how to:

  1. Clone the GitHub repository.
  2. Add the qualifications to the application.
  3. Release the application.
  4. Run the application.

This code pattern is part of the Extracting insights from videos with IBM Watson use case series, which showcases the service on extracting meaningful insights from videos utilizing Watson Speech to Text, Watson Natural Language Processing, and Watson Tone Analyzer services.