Create Chunks Semantically

Primary tabs

Date Adopted: 
Saturday, March 18, 2023

Create chunks of large regulatory texts semantically rather than by size.

This is the result of my request to ChatGPT Alpha to create chunks semantically: https://sharegpt.com/c/GZq0DYJ

It did it: https://sharegpt.com/c/XgZJoVs

Note: No, it didn't do it.  It just got to line 35.

It is a token limitation issue: https://sharegpt.com/c/hx9WiXf

Looks like 13K is going to have to be the chunk size for query completions.  Both text-davinci-003 and gpt-3.5-turbo supports 4000 tokens.

And, 31K should work as chunk size for embeddingstext-embedding-ada-002 supports 8191 tokens

Note:  Going to chunk all documents down to 14K or less, so no single embedding will be larger than 14K.

Find the top 20 largest files and total number of files:

ls -lahS adminlaw_files| head -n 20

  • adminlaw = 6k (no need to chunk)

    • ls -lahS adminlaw_files| head
      -rw-r--r-- 1 root root 6.0K Feb  9 17:28 adminlaw_11529.txt
      -rw-r--r-- 1 root root 5.1K Feb  9 17:28 adminlaw_11517.txt
      -rw-r--r-- 1 root root 4.4K Feb  9 17:28 adminlaw_11505.txt
      -rw-r--r-- 1 root root 4.4K Feb  9 17:28 adminlaw_11507.7.txt
      drwxr-xr-x 3 root root 4.0K Feb  9 16:40 ..
      -rw-r--r-- 1 root root 3.6K Feb  9 17:28 adminlaw_11512.txt
      -rw-r--r-- 1 root root 3.3K Feb  9 17:28 adminlaw_11440.50.txt
      -rw-r--r-- 1 root root 3.2K Feb  9 17:28 adminlaw_11507.6.txt

    • ls adminlaw_files | wc -l

      • 140 files 

  • coe = 11K (no need to chunk)

    • ls coe_files | wc -l

      • 19 files

  • excerpts = 37k

    • Top 20:

      • /home/ron/workarea/openai/php/regs/excerpts# ls -lahS excerpts_files| head 
        -n 20
        total 3.9M
        -rw-r--r-- 1 root root  37K Feb 14 01:43 excerpts_2924f_3551.txt
        drwxr-xr-x 2 root root  36K Feb 14 01:43 .
        -rw-r--r-- 1 root root  28K Feb 14 01:43 excerpts_6000_5496.txt
        -rw-r--r-- 1 root root  28K Feb 14 01:43 excerpts_17520_5973.txt
        -rw-r--r-- 1 root root  22K Feb 14 01:43 excerpts_494.5_269.txt
        -rw-r--r-- 1 root root  19K Feb 14 01:43 excerpts_1950.5_2988.txt
        -rw-r--r-- 1 root root  16K Feb 14 01:43 excerpts_2937_3695.txt
        -rw-r--r-- 1 root root  15K Feb 14 01:43 excerpts_2924b_3509.txt
        -rw-r--r-- 1 root root  15K Feb 14 01:43 excerpts_51.4_1225.txt
        -rw-r--r-- 1 root root  15K Feb 14 01:43 excerpts_1161_5611.txt
        -rw-r--r-- 1 root root  15K Feb 14 01:43 excerpts_1798.82_2441.txt
        -rw-r--r-- 1 root root  15K Feb 14 01:43 excerpts_896_1347.txt
        -rw-r--r-- 1 root root  14K Feb 14 01:43 excerpts_2943_3760.txt
        -rw-r--r-- 1 root root  13K Feb 14 01:43 excerpts_1946.7_2892.txt
        -rw-r--r-- 1 root root  13K Feb 14 01:43 excerpts_1102.6_1674.txt
        -rw-r--r-- 1 root root  13K Feb 14 01:43 excerpts_1940.8.5_2775.txt
        -rw-r--r-- 1 root root  13K Feb 14 01:43 excerpts_4973_6068.txt
        -rw-r--r-- 1 root root  13K Feb 14 01:43 excerpts_17973_6600.txt
        -rw-r--r-- 1 root root  12K Feb 14 01:43 excerpts_14312_5830.txt

      • ls excerpts_files | wc -l

        • # files: 791 (before chunking)

  • fs < 1K (no need to chunk)

    • ls fs_files | wc -l

      • 436 (may leave these out)

  • ref = 149k

    • ls -lahS ref_files| head -n 20
      -rw-r--r-- 1 root root 149K Mar  9 22:05 ref12_8003.txt
      -rw-r--r-- 1 root root 136K Mar 10 01:56 ref27_3.txt
      -rw-r--r-- 1 root root  84K Mar  8 01:35 ref09_3.txt
      -rw-r--r-- 1 root root  83K Mar  9 22:05 ref12_3058.txt
      -rw-r--r-- 1 root root  80K Mar  9 22:05 ref12_1759.txt
      -rw-r--r-- 1 root root  73K Mar  9 22:05 ref12_6432.txt
      -rw-r--r-- 1 root root  62K Mar  9 22:05 ref12_5582.txt
      -rw-r--r-- 1 root root  45K Mar  9 22:05 ref12_4533.txt
      -rw-r--r-- 1 root root  40K Mar  8 01:21 ref05_609.txt
      -rw-r--r-- 1 root root  37K Mar  9 21:57 ref10_461.txt
      -rw-r--r-- 1 root root  37K Mar  8 01:22 ref06_61.txt
      -rw-r--r-- 1 root root  36K Mar  9 21:57 ref10_987.txt
      -rw-r--r-- 1 root root  36K Mar  9 22:05 ref12_5160.txt
      -rw-r--r-- 1 root root  33K Mar  9 21:57 ref10_62.txt
      -rw-r--r-- 1 root root  31K Mar 10 01:56 ref24_3.txt
      -rw-r--r-- 1 root root  30K Mar  9 21:57 ref10_1760.txt
      -rw-r--r-- 1 root root  29K Mar  9 22:16 ref20_835.txt
      -rw-r--r-- 1 root root  28K Mar  9 22:05 ref12_4153.txt
      -rw-r--r-- 1 root root  28K Feb 26 00:09 ref01_488.txt
       

    • ls ref_files | wc -l

      • 217 (before chunking)

  • regs = 35k         

    • regs_2930.txt

    • 9,835 tokens

    • 34,767 characters

    • 373 lines (before adding the questions)

    • Top 20 regs:

      • /home/ron/workarea/openai/php/regs/regs# ls -lahS regs_files| head -n 20
        total 1.1M
        -rw-r--r-- 1 root root  35K Feb  8 11:03 regs_2930.txt
        -rw-r--r-- 1 root root  25K Feb  8 11:03 regs_2718.txt
        -rw-r--r-- 1 root root  18K Feb  8 11:03 regs_2849.01.txt
        -rw-r--r-- 1 root root  16K Feb  8 11:03 regs_2792.32.txt
        -rw-r--r-- 1 root root  16K Feb  8 11:03 regs_2809.3.txt
        -rw-r--r-- 1 root root  16K Feb  8 11:03 regs_2780.txt
        -rw-r--r-- 1 root root  13K Feb  8 11:03 regs_2799.1.txt
           3,879 tokens, 13,096 characers, 146 lines (before questions)
        -rw-r--r-- 1 root root  13K Feb  8 11:03 regs_2790.9.txt
        -rw-r--r-- 1 root root  13K Feb  8 11:03 regs_2809.1.txt
        drwxr-xr-x 2 root root  12K Feb  8 11:03 .
        -rw-r--r-- 1 root root  12K Feb  8 11:03 regs_2912.txt
        -rw-r--r-- 1 root root 9.7K Feb  8 11:03 regs_2848.txt
        -rw-r--r-- 1 root root 8.8K Feb  8 11:03 regs_2811.txt
        -rw-r--r-- 1 root root 8.4K Feb  8 11:03 regs_2790.txt
        -rw-r--r-- 1 root root 7.8K Feb  8 11:03 regs_2792.txt
        -rw-r--r-- 1 root root 7.3K Feb  8 11:03 regs_2844.txt
        -rw-r--r-- 1 root root 6.9K Feb  8 11:03 regs_2792.23.txt
        -rw-r--r-- 1 root root 6.7K Feb  8 11:03 regs_2792.2.txt
        -rw-r--r-- 1 root root 6.7K Feb  8 11:03 regs_2841.txt

      • ls regs_files | wc -l

        • 202 files (before chunking)

  • relaw = 30k

    • Top 20
      /home/ron/workarea/openai/php/regs/relaw# ls -lahS relaw_files| head -n 20
      total 2.1M
      -rw-r--r-- 1 root root  30K Feb  9 00:38 relaw_10237.txt
         6,482 tokens, 30,264 characters, 248 lines
      drwxr-xr-x 2 root root  20K Feb  9 00:38 .
      drwxr-xr-x 3 root root  20K Feb  9 01:09 ..
      -rw-r--r-- 1 root root  17K Feb  9 00:38 relaw_11234.txt
      -rw-r--r-- 1 root root  15K Feb  9 00:38 relaw_10000.txt
      -rw-r--r-- 1 root root  12K Feb  9 00:38 relaw_10145.txt
      -rw-r--r-- 1 root root  12K Feb  9 00:38 relaw_11212.txt
      -rw-r--r-- 1 root root  11K Feb  9 00:38 relaw_11240.txt
      -rw-r--r-- 1 root root 9.2K Feb  9 00:38 relaw_11227.txt
      -rw-r--r-- 1 root root 8.7K Feb  9 00:38 relaw_11010.txt
      -rw-r--r-- 1 root root 8.7K Feb  9 00:38 relaw_10177.txt
      -rw-r--r-- 1 root root 8.3K Feb  9 00:38 relaw_11226.txt
      -rw-r--r-- 1 root root 8.3K Feb  9 00:38 relaw_10232.3.txt
      -rw-r--r-- 1 root root 8.2K Feb  9 00:38 relaw_10167.10.txt
      -rw-r--r-- 1 root root 8.2K Feb  9 00:38 relaw_11245.txt
      -rw-r--r-- 1 root root 8.0K Feb  9 00:38 relaw_11265.txt
      -rw-r--r-- 1 root root 7.9K Feb  9 00:38 relaw_11251.txt
      -rw-r--r-- 1 root root 7.6K Feb  9 00:38 relaw_10471.txt
      -rw-r--r-- 1 root root 7.5K Feb  9 00:38 relaw_11225.txt

    • ls relaw_files | wc -l

      • 434 (before chunking)

 

Public Attachments: 
Groups audience: 
- Private group -