How does "Word Count" work?

8cuda · February 17, 2023, 10:20pm

In the GitHub-repository only the Config.plist file is present.

By looking at the code it seems that the application can simply access the wordcount-property with this line:

<string>{popclip wordcount} Words</string>

This makes me think that “wordcount” is a property that can be accessed at many places.

I tried to develop an extension for testing with applescript that outputs a dialog with the basic text metrics (word count, char count, line count).

But it seems that I cannot manage to access “wordcount” with AppleScript. What’s might be the problem here?

Config.json

{
  "identifier": "de.foobar.popclip.extension.basic-text-metrics",
  "name": "basic-text-metrics",
  "icon": "remspace.png",
  "description": "Outputs a dialog with the basic text metrics (word count, char count, line count)",
  "applescript file": "basic-text-metrics.scpt",
  "applescript call": {    
    "handler": "newDocument",
    "parameters": ["text", "wordcount"]
  }
}

basic-text-metrics.scpt

on newDocument(popClipText, popClipWordCount) --this is a handler
  set textTwo to ("Text: " & popClipText & "\n" & "WordCount: " & popClipWordCount)
  display dialog textTwo
end newDocument

nick · February 18, 2023, 3:55am

An excellent question!

The Word Count and Character Count extensions are really ancient (circa 2012) from when I made the very first few extensions.

At the time there was no way for extensions to calculate the action title themselves (they can now) and so I basically hacked it in. I made those {popclip wordcount} and {popclip charcount} special tags for the title only.

Looks like I’ve since also put these same values in the JavaScript environment at popclip.input.stats.words and popclip.input.stats.characters but for some reason I haven’t documented this. (I think because there is some custom behaviour related to the way it calculates counts for texts containing Chinese, Japanese and Korean words, that I wanted to make sure of and maybe consider further how to present the data.)

The wordcount and charcount could easily be in the main variable set accessible from AppleScript but they aren’t. I guess that’s simply because nobody has asked about it till today, so well done

Here are how it calculates them internally if you are interested:


- (NSUInteger)characterCount
{
    // quick check for presence combining characters
    if (!NMStringMayContainCombiningCharacters(self)) {
        return [self length];
    }
    
   // this is a lot slower so we only do it if we need to
    __block NSUInteger result=0;
    [self enumerateSubstringsInRange:NSMakeRange(0, [self length])
                                     options:NSStringEnumerationByComposedCharacterSequences
                                  usingBlock:^(NSString *substring, NSRange substringRange, NSRange enclosingRange, BOOL *stop) {
                                      result+=1;
                                  }];
    return result;
}

- (BOOL)stringIsChineseOrJapanese
{
#ifdef DEBUG
    NSDate *start=[NSDate date];
#endif
    BOOL result=NO;
    if (NMStringMayContainCJ(self)) {
        // needs at least 10% cj characters
        const NSUInteger cjCount=NMNumberOfCJCharacters(self);
        const NSUInteger threshold=[self length]*0.1;
        result=cjCount>threshold;
        NMLogFine(@"cj string detection took %f; cj chars %lu, total chars %lu, threshold %lu, iscj? %d", [[NSDate date] timeIntervalSinceDate:start], cjCount, [self length], threshold, result);
    }
    return result;
}

- (NSUInteger)wordCount:(BOOL)cjMode
{
#ifdef DEBUG
    NSDate *start=[NSDate date];
#endif
    __block NSUInteger result=0;

    // use different block depending on mode
    void (^block)(NSString *) = cjMode ?
    ^(NSString *word) {
        result+=word.length;
    }:
    ^(NSString *word) {
        result+=1;
    };
    
    for (NSString *word in [self componentsSeparatedByCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]]) {
        if (word.length>0) {
            block(word);
        }
    }
    
    NMLogFine(@"word count took %f; words %lu, cj mode? %d", [[NSDate date] timeIntervalSinceDate:start], result, cjMode);
    return result;
}

BOOL NMCharacterIsChineseOrJapanese(uint32_t cp)
{
    // chinese from http://stackoverflow.com/questions/1366068/whats-the-complete-range-for-chinese-characters-in-unicode
    // plus katakana and hiragana articles
    return
    (cp>=0x4E00  && cp<=0x9FCC)|| // CJK unified ideographs
    (cp>=0x3400  && cp<=0x4DB5)|| // CJKUI Ext A block
    (cp>=0x20000 && cp<=0x2A6D6)|| // CJKUI Ext B block
    (cp>=0x2A700 && cp<=0x2B734)|| // CJKUI Ext C block
    (cp>=0x2B740 && cp<=0x2B81D)|| // CJKUI Ext D block
    (cp>=0x30A0  && cp<=0x30FF)|| // Katakana
    (cp>=0x31F0  && cp<=0x32FF)|| // Enclosed CJK Letters and Months + Enclosed CJK Letters and Months
    (cp>=0xFF00  && cp<=0xFFEF)|| // Halfwidth and fullwidth forms
    (cp>=0x3040  && cp<=0x309F)|| // Hiragana
    (cp>=0x1B000 && cp<=0x1B0FF); // Kana supplement
}

// count the number of chinese and japanese characters
NSUInteger NMNumberOfCJCharacters(NSString *s)
{
    NSUInteger result=0;
    NSData *const data=[s dataUsingEncoding:NSUTF32LittleEndianStringEncoding];
    const size_t len=[data length]/sizeof(uint32_t);
    const uint32_t *buf=(uint32_t *)[data bytes];
    const uint32_t *const end=buf+len;
    for(; buf<end; buf+=1) {
        if (NMCharacterIsChineseOrJapanese(*buf)) {
            result+=1;
        }
    }
    return result;
}

// quickly detect if the string might contain any chinese or japanese characters
BOOL NMStringMayContainCJ(NSString *s)
{
    NSData *const data=[s dataUsingEncoding:NSUTF32LittleEndianStringEncoding];
    const size_t len=[data length]/sizeof(uint32_t);
    const uint32_t *buf=(uint32_t *)[data bytes];
    const uint32_t *const end=buf+len;
    for(; buf<end; buf+=1) {
        if (*buf>=0x3040) {
            return YES;
        }
    }
    return NO;
}

// quickly detect combining characters
// see http://www.fileformat.info/info/unicode/category/Mn/list.htm for rationale
BOOL NMStringMayContainCombiningCharacters(NSString *s)
{
    NSData *const data=[s dataUsingEncoding:NSUTF32LittleEndianStringEncoding];
    const size_t len=[data length]/sizeof(uint32_t);
    const uint32_t *buf=(uint32_t *)[data bytes];
    const uint32_t *const end=buf+len;
    for(; buf<end; buf+=1) {
        if (*buf>=0x0300) {
            return YES;
        }
    }
    return NO;
}

Looking at the code now I’m just thinking of all the ways it could be improved. But there it is.