The King of Blog

使用 Whisper JAX 批量语音转文字

作者: King
时间: 2024-03-15
分类: 技术
评论

最近练听力，找到 englishpod 的 mp3，非常不错，但是缺少同步字幕，所以用openai的whisper来转换
但是官方的 whisper 每次只处理30s，需要分片处理，所以找到了 Whisper Jax 声称速度更快，而且不需要自己处理分片的问题。

import jax
import jax.numpy as jnp
import os
import time
import json
import magic
from whisper_jax import FlaxWhisperPipline

def get_mp3_files(directory):
    total_files = 0
    print(f"scanning files...")
    for root, dirs, files in os.walk(directory):
        for file in files:
            mime = magic.Magic(mime=True)
            if mime.from_file(os.path.join(root, file)) == 'audio/mpeg':
                total_files += 1
                yield os.path.join(root, file)
    print(f"Total files: {total_files}")

def mp3_to_text(directory):
    pipeline = FlaxWhisperPipline("openai/whisper-large-v2", dtype=jnp.bfloat16)
    for mp3_file in get_mp3_files(directory):
        print(f"Processing file: {mp3_file}")

        txt_file = os.path.splitext(mp3_file)[0] + '.txt'
        if os.path.isfile(txt_file):
            print(f"Skipping file: {mp3_file}")
            continue

        start_time = time.time()
        text = pipeline(mp3_file, task="translate", language="chinese",return_timestamps=True)
        text_length = len(text['text'])  # 获取转录文本的长度
        print(f"Text length: {text_length} characters")
        end_time = time.time()
        print(f"Processed in {end_time - start_time} seconds")

        with open(txt_file, 'w') as f:
            f.write(json.dumps(text))


start_time = time.time()
mp3_to_text('/data/englishpod')
end_time = time.time()
print(f"Total time: {end_time - start_time} seconds")

印象笔记导出 enex

作者: King
时间: 2023-07-10
分类: 技术
评论

使用这个工具
https://github.com/vzhd1701/evernote-backup

备份步骤1
如果是印象笔记要注意的是需要加入 --backend china 参数

./evernote-backup.exe init-db --backend china
Logging in to Evernote...
Username or Email: king@gmail.com
Password:
Enter one-time code (+XX XXX XXXX 2010): 638303
Authorizing auth token, china backend...
Successfully authenticated as king!
Current login will expire at 2024-07-09 07:31:21.
Initializing database en_backup.db...
Reading database en_backup.db...
Successfully initialized database for king!

备份步骤2

./evernote-backup.exe sync
Reading database en_backup.db...
Authorizing auth token, china backend...
Successfully authenticated as king!
Current login will expire at 2024-07-09 07:31:21.
Syncing user notebooks...
  [####################################]  60931/60931
2267 note(s) to download...
Downloading 2267 note(s)...
  [####################################]  2267/2267
Updated or added notebooks: 96
Updated or added notes: 2267
Expunged notebooks: 38
Expunged linked notebooks: 0
Expunged notes: 660
Synchronization completed!

备份步骤3

./evernote-backup.exe export ./
Reading database en_backup.db...
Exporting notes...
  [####################################]  96/96
All notes have been exported!

GPT-4 Token 数量计算开源项目和工具

作者: King
时间: 2023-06-07
分类: 杂项
评论

GPT输入和输出的长度都是有限制的，OPENAI的GPT4是8K，azure的是32K

这个长度限制要注意是指输入的 prompt + 模型返回的。

调用接口的时候为了更准确的拆分长文本，需要计算token数量，但是不同版本的GPT用的编码方式不一样，所以计算token数量的方式也是不一样的。官方有python的库可以用，但其它语言的就需要找到合适的。

不同版本用的编码方式不一样：
cl100k_base：gpt-4, gpt-3.5-turbo, text-embedding-ada-002
p50k_base：Codex models, text-davinci-002, text-davinci-003
r50k_base (or gpt2)：GPT-3 models like davinci

官方GPT-3的在线计算工具
https://platform.openai.com/tokenizer
Tiktokenizer 在线工具
https://tiktokenizer.vercel.app/

gpt-tokenizer 截图，点击这里访问

支持cl100k_base and p50k_base 编码的库（也就是GPT-4和GPT-3.5）
Javascript:
https://github.com/niieani/gpt-tokenizer
https://www.npmjs.com/package/gpt-tokenizer

Python
https://github.com/openai/tiktoken

Java
https://github.com/knuddelsgmbh/jtokkit

.NET/C#
https://github.com/dmitry-brazhenko/SharpToken
https://github.com/aiqinxuancai/TiktokenSharp

MacOS Monterey NTFS 原生写操作

作者: King
时间: 2022-01-03
分类: 杂项
评论

下面方法其实是调用mount_ntfs命令实现的，但Ventura开始，这个命令被苹果删除了。
劝大家尽量用exfat吧，别折腾了。

我移动盘主要是因为需要在其它设备上操作，所以用ntfs兼容性是最好的。
由于版权问题，在macos写操作ntfs是一个折腾的事情，系统升级为monterey后之前用的开源工具已经不能使用。
而且之前通过fstab的方式也不管用。只能用下面的方式手动挂载：

1、先在磁盘工具中把对应的盘卸载
2、打开命令行工具
diskutil list
找到对应磁盘的分区，把最右边那一列的名称复制一下。
sudo mkdir /Volumes/oDisk
创建目录，名称随意和下面的对应起来就可以了。
sudo mount -t ntfs -o rw,auto,nobrowse /dev/disk2s3 /Volumes/oDisk
上面的disk2s3换成你自己的。

截屏2022-01-03 下午4.29.19.png

3、在finder中前往文件夹 /Volumes/oDisk

4、enjoy it

另外，如果发现有的文件显示是灰色，不能访问可以通过下面方法解决
xattr -d -r com.apple.FinderInfo /Volumes/oDisk/*

Hyperf Swoole 使用阿里云OSS官方PHP OSS SDK由于SWOOLE_HOOK_CURL 导致的Bug

作者: King
时间: 2020-11-27
分类: 技术
评论

Swoole4.5.7 + hyperf

用阿里云上传文件的时候莫名奇妙的异常
Oss\Core\OssException: : RequestId:

找了好久才定位到是SWOOLE_HOOK_CURL导致的问题，于是
检查 bin/hyperf.php
! defined('SWOOLE_HOOK_FLAGS') && define('SWOOLE_HOOK_FLAGS', SWOOLE_HOOK_ALL);
配置是对的，根据Swoole官方文档 SWOOLE_HOOK_ALL：打开所有类型但不包括CURL
https://wiki.swoole.com/wiki/diff/?id=993&version=0&compare=current

然后发现一个好心人做了一个兼容组件
https://github.com/Reasno/swoole-aliyunoss-addon

安装后再把SWOOLE_HOOK_CURL打开
! defined('SWOOLE_HOOK_FLAGS') && define('SWOOLE_HOOK_FLAGS', SWOOLE_HOOK_ALL | SWOOLE_HOOK_CURL)

错误却变成了
string(106) "MissingContentLength: You must provide the Content-Length HTTP header. RequestId: 5FC048380D92D938376A214A"
这种情况不用定位了，还是换个Swoole版本看看效果，先删除swoole-aliyunoss-addon

composer remove swoole-aliyunoss-addon

然后把Swoole升级到4.5.8问题依旧
最后把Swoole降级到4.5.2问题解决

看了官方文档
https://wiki.swoole.com/#/runtime?id=swoole_hook_all
从 v4.5.4 版本起，SWOOLE_HOOK_ALL 包括 SWOOLE_HOOK_CURL
原来是这样....

如果版本高于4.5.4使用下面这个就可以解决问题了
! defined('SWOOLE_HOOK_FLAGS') && define('SWOOLE_HOOK_FLAGS', SWOOLE_HOOK_ALL ^ SWOOLE_HOOK_CURL)

还有一个解决方案就是使用支持hook curl的php oss sdk
https://packagist.org/packages/starfalling/aliyun-oss-php-sdk

使用 Whisper JAX 批量语音转文字

印象笔记导出 enex

GPT-4 Token 数量计算开源项目和工具

MacOS Monterey NTFS 原生写操作

Hyperf Swoole 使用阿里云OSS官方PHP OSS SDK由于SWOOLE_HOOK_CURL 导致的Bug

最新文章

最近回复

分类

归档