Create Chunks Semantically
Primary tabs
Create chunks of large regulatory texts semantically rather than by size.
This is the result of my request to ChatGPT Alpha to create chunks semantically: https://sharegpt.com/c/GZq0DYJ
It did it: https://sharegpt.com/c/XgZJoVs
Note: No, it didn't do it. It just got to line 35.
It is a token limitation issue: https://sharegpt.com/c/hx9WiXf
Looks like 13K is going to have to be the chunk size for query completions. Both text-davinci-003 and gpt-3.5-turbo supports 4000 tokens.
And, 31K should work as chunk size for embeddings. text-embedding-ada-002 supports 8191 tokens
Note: Going to chunk all documents down to 14K or less, so no single embedding will be larger than 14K.
Find the top 20 largest files and total number of files:
ls -lahS adminlaw_files| head -n 20
-
adminlaw = 6k (no need to chunk)
-
ls -lahS adminlaw_files| head
-rw-r--r-- 1 root root 6.0K Feb 9 17:28 adminlaw_11529.txt
-rw-r--r-- 1 root root 5.1K Feb 9 17:28 adminlaw_11517.txt
-rw-r--r-- 1 root root 4.4K Feb 9 17:28 adminlaw_11505.txt
-rw-r--r-- 1 root root 4.4K Feb 9 17:28 adminlaw_11507.7.txt
drwxr-xr-x 3 root root 4.0K Feb 9 16:40 ..
-rw-r--r-- 1 root root 3.6K Feb 9 17:28 adminlaw_11512.txt
-rw-r--r-- 1 root root 3.3K Feb 9 17:28 adminlaw_11440.50.txt
-rw-r--r-- 1 root root 3.2K Feb 9 17:28 adminlaw_11507.6.txt -
ls adminlaw_files | wc -l
-
140 files
-
-
-
coe = 11K (no need to chunk)
-
ls coe_files | wc -l
-
19 files
-
-
-
excerpts = 37k
-
Top 20:
-
/home/ron/workarea/openai/php/regs/excerpts# ls -lahS excerpts_files| head
-n 20
total 3.9M
-rw-r--r-- 1 root root 37K Feb 14 01:43 excerpts_2924f_3551.txt
drwxr-xr-x 2 root root 36K Feb 14 01:43 .
-rw-r--r-- 1 root root 28K Feb 14 01:43 excerpts_6000_5496.txt
-rw-r--r-- 1 root root 28K Feb 14 01:43 excerpts_17520_5973.txt
-rw-r--r-- 1 root root 22K Feb 14 01:43 excerpts_494.5_269.txt
-rw-r--r-- 1 root root 19K Feb 14 01:43 excerpts_1950.5_2988.txt
-rw-r--r-- 1 root root 16K Feb 14 01:43 excerpts_2937_3695.txt
-rw-r--r-- 1 root root 15K Feb 14 01:43 excerpts_2924b_3509.txt
-rw-r--r-- 1 root root 15K Feb 14 01:43 excerpts_51.4_1225.txt
-rw-r--r-- 1 root root 15K Feb 14 01:43 excerpts_1161_5611.txt
-rw-r--r-- 1 root root 15K Feb 14 01:43 excerpts_1798.82_2441.txt
-rw-r--r-- 1 root root 15K Feb 14 01:43 excerpts_896_1347.txt
-rw-r--r-- 1 root root 14K Feb 14 01:43 excerpts_2943_3760.txt
-rw-r--r-- 1 root root 13K Feb 14 01:43 excerpts_1946.7_2892.txt
-rw-r--r-- 1 root root 13K Feb 14 01:43 excerpts_1102.6_1674.txt
-rw-r--r-- 1 root root 13K Feb 14 01:43 excerpts_1940.8.5_2775.txt
-rw-r--r-- 1 root root 13K Feb 14 01:43 excerpts_4973_6068.txt
-rw-r--r-- 1 root root 13K Feb 14 01:43 excerpts_17973_6600.txt
-rw-r--r-- 1 root root 12K Feb 14 01:43 excerpts_14312_5830.txt -
ls excerpts_files | wc -l
-
# files: 791 (before chunking)
-
-
-
-
fs < 1K (no need to chunk)
-
ls fs_files | wc -l
-
436 (may leave these out)
-
-
-
ref = 149k
-
ls -lahS ref_files| head -n 20
-rw-r--r-- 1 root root 149K Mar 9 22:05 ref12_8003.txt
-rw-r--r-- 1 root root 136K Mar 10 01:56 ref27_3.txt
-rw-r--r-- 1 root root 84K Mar 8 01:35 ref09_3.txt
-rw-r--r-- 1 root root 83K Mar 9 22:05 ref12_3058.txt
-rw-r--r-- 1 root root 80K Mar 9 22:05 ref12_1759.txt
-rw-r--r-- 1 root root 73K Mar 9 22:05 ref12_6432.txt
-rw-r--r-- 1 root root 62K Mar 9 22:05 ref12_5582.txt
-rw-r--r-- 1 root root 45K Mar 9 22:05 ref12_4533.txt
-rw-r--r-- 1 root root 40K Mar 8 01:21 ref05_609.txt
-rw-r--r-- 1 root root 37K Mar 9 21:57 ref10_461.txt
-rw-r--r-- 1 root root 37K Mar 8 01:22 ref06_61.txt
-rw-r--r-- 1 root root 36K Mar 9 21:57 ref10_987.txt
-rw-r--r-- 1 root root 36K Mar 9 22:05 ref12_5160.txt
-rw-r--r-- 1 root root 33K Mar 9 21:57 ref10_62.txt
-rw-r--r-- 1 root root 31K Mar 10 01:56 ref24_3.txt
-rw-r--r-- 1 root root 30K Mar 9 21:57 ref10_1760.txt
-rw-r--r-- 1 root root 29K Mar 9 22:16 ref20_835.txt
-rw-r--r-- 1 root root 28K Mar 9 22:05 ref12_4153.txt
-rw-r--r-- 1 root root 28K Feb 26 00:09 ref01_488.txt
-
ls ref_files | wc -l
-
217 (before chunking)
-
-
-
regs = 35k
-
regs_2930.txt
-
9,835 tokens
-
34,767 characters
-
373 lines (before adding the questions)
-
Top 20 regs:
-
/home/ron/workarea/openai/php/regs/regs# ls -lahS regs_files| head -n 20
total 1.1M
-rw-r--r-- 1 root root 35K Feb 8 11:03 regs_2930.txt
-rw-r--r-- 1 root root 25K Feb 8 11:03 regs_2718.txt
-rw-r--r-- 1 root root 18K Feb 8 11:03 regs_2849.01.txt
-rw-r--r-- 1 root root 16K Feb 8 11:03 regs_2792.32.txt
-rw-r--r-- 1 root root 16K Feb 8 11:03 regs_2809.3.txt
-rw-r--r-- 1 root root 16K Feb 8 11:03 regs_2780.txt
-rw-r--r-- 1 root root 13K Feb 8 11:03 regs_2799.1.txt
3,879 tokens, 13,096 characers, 146 lines (before questions)
-rw-r--r-- 1 root root 13K Feb 8 11:03 regs_2790.9.txt
-rw-r--r-- 1 root root 13K Feb 8 11:03 regs_2809.1.txt
drwxr-xr-x 2 root root 12K Feb 8 11:03 .
-rw-r--r-- 1 root root 12K Feb 8 11:03 regs_2912.txt
-rw-r--r-- 1 root root 9.7K Feb 8 11:03 regs_2848.txt
-rw-r--r-- 1 root root 8.8K Feb 8 11:03 regs_2811.txt
-rw-r--r-- 1 root root 8.4K Feb 8 11:03 regs_2790.txt
-rw-r--r-- 1 root root 7.8K Feb 8 11:03 regs_2792.txt
-rw-r--r-- 1 root root 7.3K Feb 8 11:03 regs_2844.txt
-rw-r--r-- 1 root root 6.9K Feb 8 11:03 regs_2792.23.txt
-rw-r--r-- 1 root root 6.7K Feb 8 11:03 regs_2792.2.txt
-rw-r--r-- 1 root root 6.7K Feb 8 11:03 regs_2841.txt -
ls regs_files | wc -l
-
202 files (before chunking)
-
-
-
-
relaw = 30k
-
Top 20
/home/ron/workarea/openai/php/regs/relaw# ls -lahS relaw_files| head -n 20
total 2.1M
-rw-r--r-- 1 root root 30K Feb 9 00:38 relaw_10237.txt
6,482 tokens, 30,264 characters, 248 lines
drwxr-xr-x 2 root root 20K Feb 9 00:38 .
drwxr-xr-x 3 root root 20K Feb 9 01:09 ..
-rw-r--r-- 1 root root 17K Feb 9 00:38 relaw_11234.txt
-rw-r--r-- 1 root root 15K Feb 9 00:38 relaw_10000.txt
-rw-r--r-- 1 root root 12K Feb 9 00:38 relaw_10145.txt
-rw-r--r-- 1 root root 12K Feb 9 00:38 relaw_11212.txt
-rw-r--r-- 1 root root 11K Feb 9 00:38 relaw_11240.txt
-rw-r--r-- 1 root root 9.2K Feb 9 00:38 relaw_11227.txt
-rw-r--r-- 1 root root 8.7K Feb 9 00:38 relaw_11010.txt
-rw-r--r-- 1 root root 8.7K Feb 9 00:38 relaw_10177.txt
-rw-r--r-- 1 root root 8.3K Feb 9 00:38 relaw_11226.txt
-rw-r--r-- 1 root root 8.3K Feb 9 00:38 relaw_10232.3.txt
-rw-r--r-- 1 root root 8.2K Feb 9 00:38 relaw_10167.10.txt
-rw-r--r-- 1 root root 8.2K Feb 9 00:38 relaw_11245.txt
-rw-r--r-- 1 root root 8.0K Feb 9 00:38 relaw_11265.txt
-rw-r--r-- 1 root root 7.9K Feb 9 00:38 relaw_11251.txt
-rw-r--r-- 1 root root 7.6K Feb 9 00:38 relaw_10471.txt
-rw-r--r-- 1 root root 7.5K Feb 9 00:38 relaw_11225.txt -
ls relaw_files | wc -l
-
434 (before chunking)
-
-
Recent comments