version 1.0
技術報告: 96004
Date Jun. 14, 1996
ASPAC 計劃
中央研究院計算中心
工作站實驗室
Computing Center of Academia Sinica
Workstation Lab.
E-mail: aspac@phi.sinica.edu.tw
1. 背景Regular expression 具有可以表達出難以描述、複雜、但是卻有特殊規則的字串的功能,所以許多的 UNIX 工具程式都有支援 Regular expression 的功能。例如 ex 、 vi 、 sed 、 awk 、 grep 、 emacs 等等都有支援。除了這些具有 regular expression 功能的現成工具外,另外還有一類俱有 regular expression 功能的程式庫,可以供程式設計者很容易地在其程式中加入 regular expression 功能。例如 GNU 所發表的 Regex 程式庫便是屬於此類。本文就是要介紹如何利用 GNU Regex 程式庫,使自己的程式具有 regular expression 的功能。
在做 GNU Regex 程式庫的程式寫作之前,有必要先了解:
以下幾節便分別就這幾點對 GNU Regex 程式庫作一些簡介。Regular Expression是以一文字字串來表達"具有某特殊規則"的所有字串集合。例如 Regular Expression "fo*" 代表由 "fo" 、 "foo" 、 "fooo" 、 ... 等等所成的字串集合。如果一字串 A 是 Regular Expression 'fo*'所成的字串集合中的一字串,那我們就稱 Regular Expression 'fo*'match 字串 A。至於詳細的 Regular Expression 介紹,讀者可以參考中央研究院計算中心 ASPAC 計劃的 Regular Expression Introduction [2]。
GNU Regex 程式庫是 GNU 發展,提供操作比對 Regular Expression 文字字串的程式庫,也就是使用 GNU Regex 程式庫,可以作到以下的功能:
要取得 GNU Regex 程式庫,可以由公共的 ftp 伺服器下載。例如由:
ftp://phi.sinica.edu.tw/pub/GNU/gnu/regex-0.12.tar.gz
或者
ftp://prep.ai.mit.edu/pub/gnu/regex-0.12.tar.gz
。
要注意 GNU 另外有一個 Rx 程式庫,它是一較新的 POSIX.2 標準介面的 regular expression library。要取得 GNU Rx 程式庫,亦可以由公共的 ftp 伺服器下載。例如由:
ftp://phi.sinica.edu.tw/pub/GNU/gnu/rx-1.0.tar.gz
或者
ftp://prep.ai.mit.edu/pub/gnu/rx-1.0.tar.gz
。
基本上 GNU Rx 函數與 GNU Regex 中的 POSIX 相容介面函數,在函數名稱、使用介面和函數個數上是相同地,但是在函數的內部運作上二者有很大的不同。本文因主要在於介紹 GNU Regex 程式庫的使用法,所以對於 GNU Rx 程式庫就不做介紹。
在取得 GNU Regex 程式庫之後便可以進行建立 GNU Regex 程式庫的工作。建立 GNU Regex 程式庫的程序可以分成五個步驟:
在使用 GNU Regex 程式庫之前,最好先了解 Regular Expression 是如何運作。要了解 Regular Expression 的運作情形,則必需了解 Regular Expression 的語法才行。本章便就 Regular Expression 的語法及其控制的方式做一簡介。
在前面曾經提過 Regular Expression 是以一文字字串來表達"具有某特殊規則"的字串集合,此一文字字串便稱為"表示字串"。Regular Expression 的表示字串的內容主要可以分成兩大類:
語法變數的設定是在控制 GNU Regex 程式庫的程式如何處理 Regular Expression 表示字串中的特殊字元,也就是屬於在程式之中的控制設定。其使用法為在呼叫使用 GNU Regex 程式庫中的程式前,在自己的程式中先將 re_syntax_options 這個變數設定成所要的語法變數的定義位元 (Syntax Bits) 即可。至於使用法的範例,讀者可以參考後面 4.1.1 節中的函數使用範例。
因為語法變數的設定情形會影響到特殊字元的功能及用法,所以有必要了解清楚。以下是 GNU Regex 程式庫中,可以控制特殊字元的功能及用法的 Regular Expression 語法變數的定義位元 (Syntax Bits):
根據上述的語法變數的定義位元, GNU Regex 程式庫在其標頭檔 (header file) regex.h (詳見附錄)中有一些預先設好、複合的語法變數值,可以方便寫程式者直接加以應用。另外也可以由此預先設好、複合的語法變數值中,看出一些 GNU tools 程式,對於 Regular Expression 語法中的特殊控制字元的接受情形。
特殊控制字元 \ 的使用在 regular expression 中可以說是相當的重要。特殊控制字元 \ 的用法共計有四種意義如下:
Regular Expression 主要便是靠表示字串中的普通字元與特殊字元來運作,所以在此將這些運作元素做一簡單的介紹。在GNU Regex 程式庫中的 Regular Expression 運作元素可以大致分成三類:
以下的幾節中便分別就這三類的運作元素做一些簡單的介紹。
這一類主要是用在 POSIX 標準的 Regular Expression 程式中,但是在 GNU Regex 程式中也可以使用。在這一類中大部分的運作元素都有兩種表示的方式:
這一類主要是用在 GNU Regex 的程式之中,在 POSIX 標準的程式中並不能使用。
這一類只能用在 GNU Regex 的程式之中,在 POSIX 標準的程式中並不能使用。而且在建立 GNU Regex 函數庫之前,必須先定義 emacs 這個前處理變數 (preprocessor symbol) ,然後再建立 GNU Regex 函數庫,才能使用。也就是說,只限於在重新 build GNU Regex 函數庫時,以定義 C compiler 的前處理變數 emacs 的方式,來進行建立 GNU Regex 函數庫,才會有此類的功能。至於如何先定義 emacs 這個前處理變數,使用者可以用修改 Makefile 的方式來定義emacs這個前處理變數給 C compiler 用,只要在 GNU Regex 的 Makefile 中的 DEFS 那一行之末再加入 -Demacs 就可以了。
另外在使用這一類的功能時,必需在程式之中設定 re_syntax_table 成為 Emacs syntax table 才行。然而 Emacs syntax 比 Regex syntax 還要複雜得多,所以有興趣者可以自行參考 "GNU Emacs User's Manual" 中有關 syntax 那一節,本文就不在此處多加介紹了。
GNU Regex 程式庫的程式一共有 GNU 介面、 POSIX 相容介面、 BSD 相容介面等三種介面函數組,但是不論是那一種介面函數組,其主要的程式寫作流程都可以如下三個部驟:
Program Regex:
{
Setup regular expression string;
Initialize regular expression pattern buffer;
do ( compile regular expression into pattern buffer );
if ( compiling successfully ) {
do ( match or search string with pattern buffer );
if ( matching or searching successfully ) {
report matching or searching successed;
}
else {
report matching or searching failed;
}
}
else {
report compiling error;
}
}
以下 4.1.節便就三種介面函數分別介紹其功能與用法,並且針對三種介面函數中的每一個函數,列舉出一簡單的使用例子。而在
4.2.節中則列舉出三種介面函數完整的使用例子與測試結果。
因為 GNU Regex 程式庫有三種介面,一種專為 GNU 所設計的介面,一種為 POSIX
相容的介面,另一種為 Berkeley UNIX 相容的介面,所以下面便分成這三部分來介紹。至於這三種介面程式的優缺點,在前面
1.2.節只做簡單的介紹,所以在此加以補充介紹。
GNU 介面的 Regex 函數共有六個,如下:
4.1.1.1. 編譯Regular Expression的函數
使用GNU介面的 regex 函數庫的第一件工作,就是把 Regular Expression 的表示式字串,編譯成 regex 函數程式可以使用的 pattern buffer 。至於 pattern buffer 的結構定義,是在 regex.h 中定義,請參考附錄。
char *re_compile_pattern(const char *regex, const int regex_size, struct re_pattern_buffer *pattern_buffer)
/* regex_pattern : Regular Expression 的表示式字串 pattern_buffer : 給 regex 函數使用的 pattern buffer 結構 errcode : 編譯的結果字串 **/ const char regex_pattern[7] = "[Ff]oo"; struct re_pattern_buffer pattern_buffer; const char *errcode; /* initialize pattern buffer 結構 */ pattern_buffer.allocated = 0; pattern_buffer.buffer = 0; pattern_buffer.fastmap = 0; pattern_buffer.translate = 0; /* 設定語法的定義 */ re_syntax_options = RE_SYNTAX_EGREP; /* 進行編譯 Regular Expression */ errcode = re_compile_pattern( regex_pattern, strlen(regex_pattern), &pattern_buffer);4.1.1.2. 進行比對的函數
進行比對是指用一 regular expression 從一字串的某個位置比對起,看符合比對的子字串有多長。
int re_match(struct re_pattern_buffer *pattern_buffer, const char *string, const int size, const int start, struct re_registers *regs)
/* n : 比對的結果值 textstring : 要被 "[Ff]oo" 進行比對的字串 regs : 是進行比對過程中所有的 match 情形 pattern_buffer : 經編譯過的 "[Ff]oo" 的 pattern buffer 結構 **/ int n; const char *textstring; struct re_registers regs; struct re_pattern_buffer pattern_buffer; /* 從 teststring 的啟始位置起來進行比對 */ n = re_match( &pattern_buffer, textstring, strlen(textstring), 0, ®s);4.1.1.3.進行尋找的函數
進行尋找是指用一 regular expression 從一字串的某個位置尋找起,看是否有子字串符合的,如果有的話,並且把子字串所起始的位置回傳。
int re_search(struct re_pattern_buffer *pattern_buffer, const char *string, const int size, const int start, const int range, struct re_registers *regs)
/* n : 尋找的結果值 textstring : 要被 "[Ff]oo" 進行尋找的字串 regs : 是進行尋找過程中所有的 match 情形 pattern_buffer : 經編譯過的 "[Ff]oo" 的 pattern buffer 結構 **/ int n; const char *textstring; struct re_registers regs; struct re_pattern_buffer pattern_buffer; /* 從 teststring 的啟始位置起到最後的位置止來進行尋找 */ n = re_search( &pattern_buffer, textstring, strlen(textstring), 0, strlen(textstring), ®s);4.1.1.4.在兩組字串進行比對的函數
在兩組字串進行比對與前面的進行比對是相類似的,只是可以一次在兩組字串中進行比對。
int re_match_2(struct re_pattern_buffer *pattern_buffer, const char *string1, const int size1, const char *string2, const int size2, const int start, struct re_registers *regs, const int stop)
/* n : 比對的結果值 string1 : 要被 "[Ff]oo" 進行比對的字串一 string2 : 要被 "[Ff]oo" 進行比對的字串二 regs : 是進行比對過程中所有的match情形 pattern_buffer : 經編譯過的 "[Ff]oo" 的 pattern buffer 結構 **/ int n; const char *string1; const char *string2; struct re_registers regs; struct re_pattern_buffer pattern_buffer; /* 從 string1 的啟始位置起到 string2 的最後位置止來進行比對 */ n = re_match_2( &pattern_buffer, string1, strlen(string1), string2, strlen(string2), 0, ®s, strlen(string1)+strlen(string2));4.1.1.5.在兩組字串進行尋找的函數
在兩組字串進行尋找與前面的進行尋找是相類似的,只是可以一次在兩組字串中進行尋找。
int re_search_2(struct re_pattern_buffer *pattern_buffer, const char *string1, const int size1, const char *string2, const int size2, const int start, const int range, struct re_registers *regs, const int stop)
/* n : 尋找的結果值 string1 : 要被 "[Ff]oo" 進行尋找的字串一 string2 : 要被 "[Ff]oo" 進行尋找的字串二 regs : 是進行尋找過程中所有的 match 情形 pattern_buffer : 經編譯過的 "[Ff]oo" 的 pattern buffer 結構 **/ int n; const char *string1; const char *string2; struct re_registers regs; struct re_pattern_buffer pattern_buffer; /* 從 string1 的啟始位置起到 string2 的最後位置止來進行尋找 */ n = re_match_2( &pattern_buffer, string1, strlen(string1), string2, strlen(string2), 0, strlen(string1)+strlen(string2), ®s, strlen(string1)+strlen(string2));4.1.1.6. 使用fastmap編譯Regular Expression的函數
當在一很長的字串中尋找時,最好是使用 fastmap 來編譯 Regular Expression ,否則尋找的速度會很慢。
int re_compile_fastmap(struct re_pattern_buffer *pattern_buffer)
/* regex_pattern : Regular Expression 的表示式字串 pattern_buffer : 給 regex 函數使用的 pattern buffer 結構 errcode : 編譯的結果字串 fastmap : fastmap 所使用的空間 n : 使用 fastmap 編譯的結果 **/ const char regex_pattern[7] = "[Ff]oo"; struct re_pattern_buffer pattern_buffer; const char *errcode; char fastmap[1 << 8]; int n; /* initialize pattern buffer 結構 */ pattern_buffer.allocated = 0; pattern_buffer.buffer = 0; pattern_buffer.fastmap = fastmap; pattern_buffer.translate = 0; /* 設定語法的定義 */ re_syntax_options = RE_SYNTAX_EGREP; /* 進行編譯 Regular Expression */ errcode = re_compile_pattern( regex_pattern, strlen(regex_pattern), &pattern_buffer);/* 使用 fastmap 進行編譯 */
n = re_compile_fastmap( &pattern_buffer );
POSIX相容介面的GNU Regex函數共有四個:
4.1.2.1. 編譯Regular Expression的函數
與使用 GNU 介面的 Regex 函數一樣,使用 POSIX 相容介面的 Regex 函數的第一項工做便是編譯 Regular Expression。本函數中的 regular expression pattern buffer 結構 regex_t ,與前面 GNU 相容介面中的結構 re_pattern_buffer 是完全相等的。
int regcomp(regex_t *preg, const char *regex, int cflags)
/* pattern_buffer : 給 regex 函數使用的 pattern buffer 結構 regex : Regular Expression 的表示式字串 cflags : 編譯的 flag 代碼 errcode : 編譯的結果代碼 **/ regex_t pattern_buffer; char regex[7] = "[Ff]oo"; int cflags; int errcode; /* 設定編譯的 flag 為 REG_NEWLINE */ cflags = REG_NEWLINE; /* 編譯 regular expression */ errcode = regcomp( &pattern_buffer, regex, cflags);4.1.2.2.進行尋找的函數
在編譯過 Regular Expression 的表示式後,就可以進行 pattern 的尋找。 POSIX 相容介面的尋找函數功能遠少於 GNU 介面的尋找函數功能,使用 POSIX 相容介面的尋找函數無法指定由字串的某特定位置尋找起,只能從字串的起始位置尋找起。而且只能回覆是否有合於尋找的子字串,並不能回覆合於尋找的子字串的起始位置。
int regexec(const regex_t *preg, const char *string, size_t nmatch, regmatch_t pmatch[], int eflags)
/* pattern_buffer : 經編譯過的 "[Ff]oo" 的 pattern buffer 結構 text : 要被 "[Ff]oo" 進行尋找的字串 eflag : 執行尋找的 flag 代碼 n : 進行尋找的結果代碼 **/ regex_t pattern_buffer; char *text; int eflag; int n; /* 不設定任何執行尋找的 flag */ eflag = 0; /* 以乎略進行尋找過程中所有的 match 情形來進行尋找 */ n = regexec(&pattern_buffer, text, 0, 0, eflag);4.1.2.3. 錯誤報告的函數
若是在編譯 Regular Expression 時發生錯誤,或者在進行尋找時有錯誤發生,所得到的只是錯誤的代碼,而非字串訊息。如果想要得到相對應錯誤代碼的訊息字串的話,那就得呼叫錯誤報告的函數來產生。
size_t regerror(int errcode, const regex_t *preg, char *errbuf, size_t errbuf_size)
/* pattern_buffer : 給 regex 函數使用的 pattern buffer 結構 regex : Regular Expression 的表示式字串 cflags : 編譯的 flag 代碼 errcode : 編譯的結果代碼 buf : 編譯的錯誤訊息 **/ regex_t pattern_buffer; int errcode; char buf[256]; char regex[7] = "[Ff]oo"; int cflags; /* 設定編譯的 flag 為 REG_NEWLINE */ cflags = REG_NEWLINE; /* 編譯 regular expression */ errcode = regcomp( &pattern_buffer, regex, cflags);/* 處理編譯的過程是否有錯誤發生 */
if ( errcode != 0 ) {
regerror( errcode,
pattern_buffer,
buf,
sizeof(buf));
printf(" error : %s\n", buf);
}
4.1.2.4. 釋放編譯過Regular Expression
buffer的函數
使用過 POSIX 相容介面的 Regex 的函數後,若不會再使用的話,可以呼叫本函數,將 pattern buffer 所使用的記憶體空間釋放掉。另外本函數也可以供 GNU 介面 Regex 的函數所使用,因為 POSIX 介面中的 pattern buffer 結構 regex_t 與 GNU 介面中的 pattern buffer 結構 re_pattern_buffer 是相同的。
void regfree(regex_t *preg)
/* pattern_buffer : 經編譯過的 "[Ff]oo" 的 pattern buffer 結構 **/ regex_t pattern_buffer; /* 釋放編譯過的 regular expression pattern buffer */ regfree(&pattern_buffer);
BSD 相容介面的Regex 函數只有兩個,非常簡單明暸:
4.1.3.1.編譯Regular Expression的函數
同樣地,使用 BSD 相容介面 regex 函數庫的第一件工作,也就是把 Regular Expression 的表示式字串,編譯成 BSD 相容介面 regex 函數程式可以使用的 pattern buffer 。但是因為 BSD 介面使用用內部的 pattern buffer ,所以使用者可以不用考慮 pattern buffer 的設定等問題,只要簡單地把所要進行編譯的表示式字串傳入編譯 Regular Expression 的函數中即可。
char *re_comp(char *regex)
/* regex : Regular Expression 的表示式字串 err : 編譯的結果字串 **/ char regex[7] = "[Ff]oo"; const char *err; /* 設定編譯的語法為 REG_SYNTAX_GREP */ re_syntax_options = RE_SYNTAX_GREP; /* 編譯 regular expression */ err = re_comp( regex );4.1.3.2.進行尋找的函數
在編譯過 Regular Expression 的表示式後,就可以進行 pattern 的尋找。因為 BSD 相容介面使用內部設定的方式,所以只要簡單地把所要進行尋找的字串傳入尋找的函數中即可。但是也因功能簡單,所以只能得知進行尋找是否成功,而無法得知合於尋找的子字串的起始位置。
int re_exec(char *string)
/* text : 要被 "[Ff]oo" 進行尋找的字串 n : 尋找的結果 **/ char *text; int n; /* 進行尋找 */ n = re_exec( text );
本節根據前面所述的函數,分別就三種介面作一小的程式範例。程式範例所需的測試文件 (testfile) 內容如下:
there and here where ? here and there ?here and there程式範例所要測試執行 match 的 regular expression pattern 為 "出現在行末而且為一單字的here" 與 "空格?" 。測試的程式除了 match regular expression pattern 外,還會把合於 match pattern 的那一行列出來。
本 GNU 介面的範例程式會以引數參數的形式,讀入 regular expression ,並且打開測試的文件,然後以每一行為單位,使用 re_match 與 re_search 來作 match 測試,另外以每二行為單位,使用 re_match_2 與 re_search_2 來作 match 測試。範例程式 gnu_regex_test.c 的原始碼如下:
#include <stdio.h>
#include "regex.h"
int gnu_regex(regex_pattern, line1, line2)
char *regex_pattern;
char *line1;
char *line2;
{
struct re_pattern_buffer pattern_buffer;
struct re_registers regs;
int n, len1, len2;
const char *id;
len1 = strlen(line1);
len2 = strlen(line2);
/* 設定 regular expression 的語法定義 */
re_syntax_options = RE_SYNTAX_EGREP | RE_INTERVALS | RE_BACKSLASH_ESCAPE_IN_LISTS;
/* 將 regular expression 的 pattern buffer 初始化 */
pattern_buffer.allocated = 0;
pattern_buffer.buffer = 0;
pattern_buffer.fastmap = 0;
pattern_buffer.translate = 0;
/* 編譯 regular expression */
id = re_compile_pattern( regex_pattern, strlen(regex_pattern), &pattern_buffer);
/* 偵測是否有錯誤的發生 */
if (id != NULL) {
printf(" error on compiling regex1. code = %s\n", id);
exit(1);
}
/* 在字串 line1 中進行比對,並列出其回傳值 */
n = re_match( &pattern_buffer, line1, len1, 0, ®s);
printf(" re_match return = %d\n",n);
/* 在字串 line1 與字串 line2 中進行比對,並列出其回傳值 */
n = re_match_2( &pattern_buffer, line1, len1, line2, len2, 0, ®s, len1+len2);
printf(" re_match_2 return = %d\n",n);
/* 在字串 line1 中進行尋找,並列出其回傳值 */
n = re_search( &pattern_buffer, line1, len1, 0, len1, ®s);
printf(" re_search return = %d\n",n);
if (n >= 0) printf(" re_search string = %s\n",line1);
/* 在字串 line1 與字串 line2 中進行尋找,並列出其回傳值 */
n = re_search_2( &pattern_buffer, line1, len1, line2, len2, 0, len1+len2, ®s, len1+len2);
if (n >= 0) {
printf(" re_search_2 return = %d\n",n);
if (n < len1) printf(" re_search_2 string = %s\n",line1);
else printf(" re_search_2 string = %s\n",line2);
return 1;
}
printf(" re_search_2 return = %d\n",n);
return n;
}
main(argc,argv)
int argc;
char **argv;
{
FILE *fp;
char line[2][1024];
int i, n, k, j;
/* 檢查參數的個數 */
if (argc != 3) {
printf("Usage: %s pattern file\n",argv[0]);
exit(1);
}
/* 打開測試的文件 */
fp = fopen(argv[2],"r");
if (fp == NULL) {
fprintf(stderr, "Can't open %s.\n", argv[2]);
exit(1);
}
/* 讀取測試文件中的字串並進行 GNU 介面 Regex 程式的測試 */
j = 1;
fgets(line[0], 1024, fp);
i = strlen(line[0]) - 1;
if (line[0][i] == '\n') { line[0][i] = NULL;}
while (1) {
n = j & 1;
k = n ^ 1;
if (fgets(line[n], 1024, fp) == NULL) {break;}
j++;
i = strlen(line[n]) - 1;
if (line[n][i] == '\n') { line[n][i] = NULL;}
gnu_regex(argv[1], line[k], line[n]);
}
line[n][0] = NULL;
gnu_regex(argv[1], line[k], line[n]);
/* 關閉測試的文件 */
fclose(fp);
}
執行match "出現在行末而且為一單字的here" 的例子結果:
% gnu_regex_test '\bhere$' testfile re_match return = -1 re_match_2 return = -1 re_search return = 10 re_search string = there and here re_search_2 return = -1 re_match return = -1 re_match_2 return = -1 re_search return = -1 re_search_2 return = -1 re_match return = -1 re_match_2 return = -1 re_search return = -1 re_search_2 return = -1 re_match return = -1 re_match_2 return = -1 re_search return = -1 re_search_2 return = -1執行match "空格?" 的例子結果:
% gnu_regex_test '[[:space:]]\?' testfile re_match return = -1 re_match_2 return = -1 re_search return = -1 re_search_2 return = 19 re_search_2 string = where ? re_match return = -1 re_match_2 return = -1 re_search return = 5 re_search string = where ? re_search_2 return = 5 re_search_2 string = where ? re_match return = -1 re_match_2 return = -1 re_search return = -1 re_search_2 return = -1 re_match return = -1 re_match_2 return = -1 re_search return = -1 re_search_2 return = -14.2.2 POSIX相容介面的函數
本 POSIX 相容介面的範例程式會以引數參數的形式,讀入 regular expression ,並且打開測試的文件,然後以每一行為單位,使用 regexec 來作 match 測試。範例程式 posix_regex_test.c 的原始碼如下:
#include <stdio.h>
#include "regex.h"
/* pattern buffer 的初始化副程式 */
void init_pattern_buffer(pattern_buffer)
regex_t *pattern_buffer;
{
pattern_buffer->buffer = NULL;
pattern_buffer->allocated = 0;
pattern_buffer->used = 0;
pattern_buffer->fastmap = NULL;
pattern_buffer->fastmap_accurate = 0;
pattern_buffer->translate = NULL;
pattern_buffer->can_be_null = 0;
pattern_buffer->re_nsub = 0;
pattern_buffer->no_sub = 0;
pattern_buffer->not_bol = 0;
pattern_buffer->not_eol = 0;
}
int test_posix(pattern_buffer, regex, text)
regex_t *pattern_buffer;
char *regex;
char *text;
{
int cflags, eflag;
int n;
int id;
char buf[256];
/* 進行 regular expression pattern buffer 的初始化 */
init_pattern_buffer(pattern_buffer);
/* 設定 regular expression 的語法定義 */
cflags = REG_NEWLINE | REG_EXTENDED;
/* 編譯 regular expression */
id = regcomp( pattern_buffer, regex, cflags);
/* 偵測是否有錯誤的發生 */
if (id != 0) {
printf(" error on compiling regex. code = %d\n", id);
regerror( id, pattern_buffer, buf, sizeof(buf));
printf(" error : %s\n", buf);
exit(1);
}
/* 不設定執行進行尋找的特別功能 */
eflag = 0;
/* 在字串 text 中進行尋找,並列出其回傳值 */
n = regexec(pattern_buffer, text, 0, 0, eflag);
if (n == 0) {
printf(" regexec match string = %s\n",text);
}
return n;
}
main(argc,argv)
int argc;
char **argv;
{
FILE *fp;
char line[1024];
regex_t pattern_buffer;
/* 檢查參數的個數 */
if (argc != 3) {
printf("Usage: %s pattern file\n",argv[0]);
exit(1);
}
/* 打開測試的文件 */
fp = fopen(argv[2],"r");
if (fp == NULL) {
fprintf(stderr, "Can't open %s.\n", argv[2]);
exit(1);
}
/* 讀取測試文件中的字串並進行 POSIX 介面 Regex 程式的測試 */
while (fgets(line, 1024, fp) != NULL) {
test_posix(&pattern_buffer,argv[1],line);
}
/* 釋放 regular expression pattern buffer */
regfree(&pattern_buffer);
/* 關閉測試的文件 */
fclose(fp);
}
執行 match "出現在行末而且為一單字的here" 的例子結果:
% posix_regex_test '[[:space:]]here$' testfile regexec match string = there and here執行 match "空格?" 的例子結果:
% posix_regex_test '[[:space:]]?' testfile regexec match string = where ?4.2.3 BSD相容介面的函數
本 BSD 相容介面的範例程式會以引數參數的形式,讀入 regular expression ,並且打開測試的文件,然後以每一行為單位,使用 re_exec 來作 match 測試。範例程式 bsd_regex_test.c 的原始碼如下:
#include <stdio.h>
#include "regex.h"
int test_bsd(regex, text)
char *regex;
char *text;
{
int n;
const char *id;
re_syntax_options = RE_SYNTAX_GREP;
/* 編譯 regular expression */
id = re_comp( regex);
/* 偵測是否有錯誤的發生 */
if (id != NULL) {
printf(" error on compiling regex. code = %s\n", id);
exit(1);
}
/* 在字串 text 中進行尋找,並列出其回傳值 */
n = re_exec(text);
if (n == 1) {printf(" re_exec match string = %s\n",text);}
return n;
}
main(argc,argv)
int argc;
char **argv;
{
FILE *fp;
char line[1024];
/* 檢查參數的個數 */
if (argc != 3) {
printf("Usage: %s pattern file\n",argv[0]);
exit(1);
}
/* 打開測試的文件 */
fp = fopen(argv[2],"r");
if (fp == NULL) {
fprintf(stderr, "Can't open %s.\n", argv[2]);
exit(1);
}
/* 讀取測試文件中的字串並進行 BSD 相容介面 Regex 程式的測試 */
while (fgets(line, 1024, fp) != NULL) {
test_bsd(argv[1],line);
}
/* 關閉測試的文件 */
fclose(fp);
}
執行 match "出現在行末而且為一單字的here" 的例子結果:
% bsd_regex_test '[[:space:]]here$' testfile re_exec match string = there and here執行 match "空格?" 的例子結果:
% bsd_regex_test '[[:space:]]?' testfile re_exec match string = where ?
regex.h
/* Definitions for data structures and routines for the regular
expression library, version 0.12.
Copyright (C) 1985, 1989, 1990, 1991, 1992, 1993 Free Software Foundation, Inc.
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2, or (at your option)
any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. */
#ifndef __REGEXP_LIBRARY_H__
#define __REGEXP_LIBRARY_H__
/* POSIX says that <sys/types.h> must be included (by the caller) before
<regex.h>. */
#ifdef VMS
/* VMS doesn't have `size_t' in <sys/types.h>, even though POSIX says it
should be there. */
#include <stddef.h>
#endif
/* The following bits are used to determine the regexp syntax we
recognize. The set/not-set meanings are chosen so that Emacs syntax
remains the value 0. The bits are given in alphabetical order, and
the definitions shifted by one from the previous bit; thus, when we
add or remove a bit, only one other definition need change. */
typedef unsigned reg_syntax_t;
/* If this bit is not set, then \ inside a bracket expression is literal.
If set, then such a \ quotes the following character. */
#define RE_BACKSLASH_ESCAPE_IN_LISTS (1)
/* If this bit is not set, then + and ? are operators, and \+ and \? are
literals.
If set, then \+ and \? are operators and + and ? are literals. */
#define RE_BK_PLUS_QM (RE_BACKSLASH_ESCAPE_IN_LISTS << 1)
/* If this bit is set, then character classes are supported. They are:
[:alpha:], [:upper:], [:lower:], [:digit:], [:alnum:], [:xdigit:],
[:space:], [:print:], [:punct:], [:graph:], and [:cntrl:].
If not set, then character classes are not supported. */
#define RE_CHAR_CLASSES (RE_BK_PLUS_QM << 1)
/* If this bit is set, then ^ and $ are always anchors (outside bracket
expressions, of course).
If this bit is not set, then it depends:
^ is an anchor if it is at the beginning of a regular
expression or after an open-group or an alternation operator;
$ is an anchor if it is at the end of a regular expression, or
before a close-group or an alternation operator.
This bit could be (re)combined with RE_CONTEXT_INDEP_OPS, because
POSIX draft 11.2 says that * etc. in leading positions is undefined.
We already implemented a previous draft which made those constructs
invalid, though, so we haven't changed the code back. */
#define RE_CONTEXT_INDEP_ANCHORS (RE_CHAR_CLASSES << 1)
/* If this bit is set, then special characters are always special
regardless of where they are in the pattern.
If this bit is not set, then special characters are special only in
some contexts; otherwise they are ordinary. Specifically,
* + ? and intervals are only special when not after the beginning,
open-group, or alternation operator. */
#define RE_CONTEXT_INDEP_OPS (RE_CONTEXT_INDEP_ANCHORS << 1)
/* If this bit is set, then *, +, ?, and { cannot be first in an re or
immediately after an alternation or begin-group operator. */
#define RE_CONTEXT_INVALID_OPS (RE_CONTEXT_INDEP_OPS << 1)
/* If this bit is set, then . matches newline.
If not set, then it doesn't. */
#define RE_DOT_NEWLINE (RE_CONTEXT_INVALID_OPS << 1)
/* If this bit is set, then . doesn't match NUL.
If not set, then it does. */
#define RE_DOT_NOT_NULL (RE_DOT_NEWLINE << 1)
/* If this bit is set, nonmatching lists [^...] do not match newline.
If not set, they do. */
#define RE_HAT_LISTS_NOT_NEWLINE (RE_DOT_NOT_NULL << 1)
/* If this bit is set, either \{...\} or {...} defines an
interval, depending on RE_NO_BK_BRACES.
If not set, \{, \}, {, and } are literals. */
#define RE_INTERVALS (RE_HAT_LISTS_NOT_NEWLINE << 1)
/* If this bit is set, +, ? and | aren't recognized as operators.
If not set, they are. */
#define RE_LIMITED_OPS (RE_INTERVALS << 1)
/* If this bit is set, newline is an alternation operator.
If not set, newline is literal. */
#define RE_NEWLINE_ALT (RE_LIMITED_OPS << 1)
/* If this bit is set, then `{...}' defines an interval, and \{ and \}
are literals.
If not set, then `\{...\}' defines an interval. */
#define RE_NO_BK_BRACES (RE_NEWLINE_ALT << 1)
/* If this bit is set, (...) defines a group, and \( and \) are literals.
If not set, \(...\) defines a group, and ( and ) are literals. */
#define RE_NO_BK_PARENS (RE_NO_BK_BRACES << 1)
/* If this bit is set, then \<digit> matches <digit>.
If not set, then \<digit> is a back-reference. */
#define RE_NO_BK_REFS (RE_NO_BK_PARENS << 1)
/* If this bit is set, then | is an alternation operator, and \| is literal.
If not set, then \| is an alternation operator, and | is literal. */
#define RE_NO_BK_VBAR (RE_NO_BK_REFS << 1)
/* If this bit is set, then an ending range point collating higher
than the starting range point, as in [z-a], is invalid.
If not set, then when ending range point collates higher than the
starting range point, the range is ignored. */
#define RE_NO_EMPTY_RANGES (RE_NO_BK_VBAR << 1)
/* If this bit is set, then an unmatched ) is ordinary.
If not set, then an unmatched ) is invalid. */
#define RE_UNMATCHED_RIGHT_PAREN_ORD (RE_NO_EMPTY_RANGES << 1)
/* This global variable defines the particular regexp syntax to use (for
some interfaces). When a regexp is compiled, the syntax used is
stored in the pattern buffer, so changing this does not affect
already-compiled regexps. */
extern reg_syntax_t re_syntax_options;
/* Define combinations of the above bits for the standard possibilities.
(The [[[ comments delimit what gets put into the Texinfo file, so
don't delete them!) */
/* [[[begin syntaxes]]] */
#define RE_SYNTAX_EMACS 0
#define RE_SYNTAX_AWK \
(RE_BACKSLASH_ESCAPE_IN_LISTS | RE_DOT_NOT_NULL \
| RE_NO_BK_PARENS | RE_NO_BK_REFS \
| RE_NO_BK_VBAR | RE_NO_EMPTY_RANGES \
| RE_UNMATCHED_RIGHT_PAREN_ORD)
#define RE_SYNTAX_POSIX_AWK \
(RE_SYNTAX_POSIX_EXTENDED | RE_BACKSLASH_ESCAPE_IN_LISTS)
#define RE_SYNTAX_GREP \
(RE_BK_PLUS_QM | RE_CHAR_CLASSES \
| RE_HAT_LISTS_NOT_NEWLINE | RE_INTERVALS \
| RE_NEWLINE_ALT)
#define RE_SYNTAX_EGREP \
(RE_CHAR_CLASSES | RE_CONTEXT_INDEP_ANCHORS \
| RE_CONTEXT_INDEP_OPS | RE_HAT_LISTS_NOT_NEWLINE \
| RE_NEWLINE_ALT | RE_NO_BK_PARENS \
| RE_NO_BK_VBAR)
#define RE_SYNTAX_POSIX_EGREP \
(RE_SYNTAX_EGREP | RE_INTERVALS | RE_NO_BK_BRACES)
/* P1003.2/D11.2, section 4.20.7.1, lines 5078ff. */
#define RE_SYNTAX_ED RE_SYNTAX_POSIX_BASIC
#define RE_SYNTAX_SED RE_SYNTAX_POSIX_BASIC
/* Syntax bits common to both basic and extended POSIX regex syntax. */
#define _RE_SYNTAX_POSIX_COMMON \
(RE_CHAR_CLASSES | RE_DOT_NEWLINE | RE_DOT_NOT_NULL \
| RE_INTERVALS | RE_NO_EMPTY_RANGES)
#define RE_SYNTAX_POSIX_BASIC \
(_RE_SYNTAX_POSIX_COMMON | RE_BK_PLUS_QM)
/* Differs from ..._POSIX_BASIC only in that RE_BK_PLUS_QM becomes
RE_LIMITED_OPS, i.e., \? \+ \| are not recognized. Actually, this
isn't minimal, since other operators, such as \`, aren't disabled. */
#define RE_SYNTAX_POSIX_MINIMAL_BASIC \
(_RE_SYNTAX_POSIX_COMMON | RE_LIMITED_OPS)
#define RE_SYNTAX_POSIX_EXTENDED \
(_RE_SYNTAX_POSIX_COMMON | RE_CONTEXT_INDEP_ANCHORS \
| RE_CONTEXT_INDEP_OPS | RE_NO_BK_BRACES \
| RE_NO_BK_PARENS | RE_NO_BK_VBAR \
| RE_UNMATCHED_RIGHT_PAREN_ORD)
/* Differs from ..._POSIX_EXTENDED in that RE_CONTEXT_INVALID_OPS
replaces RE_CONTEXT_INDEP_OPS and RE_NO_BK_REFS is added. */
#define RE_SYNTAX_POSIX_MINIMAL_EXTENDED \
(_RE_SYNTAX_POSIX_COMMON | RE_CONTEXT_INDEP_ANCHORS \
| RE_CONTEXT_INVALID_OPS | RE_NO_BK_BRACES \
| RE_NO_BK_PARENS | RE_NO_BK_REFS \
| RE_NO_BK_VBAR | RE_UNMATCHED_RIGHT_PAREN_ORD)
/* [[[end syntaxes]]] */
/* Maximum number of duplicates an interval can allow. Some systems
(erroneously) define this in other header files, but we want our
value, so remove any previous define. */
#ifdef RE_DUP_MAX
#undef RE_DUP_MAX
#endif
#define RE_DUP_MAX ((1 << 15) - 1)
/* POSIX `cflags' bits (i.e., information for `regcomp'). */
/* If this bit is set, then use extended regular expression syntax.
If not set, then use basic regular expression syntax. */
#define REG_EXTENDED 1
/* If this bit is set, then ignore case when matching.
If not set, then case is significant. */
#define REG_ICASE (REG_EXTENDED << 1)
/* If this bit is set, then anchors do not match at newline
characters in the string.
If not set, then anchors do match at newlines. */
#define REG_NEWLINE (REG_ICASE << 1)
/* If this bit is set, then report only success or fail in regexec.
If not set, then returns differ between not matching and errors. */
#define REG_NOSUB (REG_NEWLINE << 1)
/* POSIX `eflags' bits (i.e., information for regexec). */
/* If this bit is set, then the beginning-of-line operator doesn't match
the beginning of the string (presumably because it's not the
beginning of a line).
If not set, then the beginning-of-line operator does match the
beginning of the string. */
#define REG_NOTBOL 1
/* Like REG_NOTBOL, except for the end-of-line. */
#define REG_NOTEOL (1 << 1)
/* If any error codes are removed, changed, or added, update the
`re_error_msg' table in regex.c. */
typedef enum
{
REG_NOERROR = 0, /* Success. */
REG_NOMATCH, /* Didn't find a match (for regexec). */
/* POSIX regcomp return error codes. (In the order listed in the
standard.) */
REG_BADPAT, /* Invalid pattern. */
REG_ECOLLATE, /* Not implemented. */
REG_ECTYPE, /* Invalid character class name. */
REG_EESCAPE, /* Trailing backslash. */
REG_ESUBREG, /* Invalid back reference. */
REG_EBRACK, /* Unmatched left bracket. */
REG_EPAREN, /* Parenthesis imbalance. */
REG_EBRACE, /* Unmatched \{. */
REG_BADBR, /* Invalid contents of \{\}. */
REG_ERANGE, /* Invalid range end. */
REG_ESPACE, /* Ran out of memory. */
REG_BADRPT, /* No preceding re for repetition op. */
/* Error codes we've added. */
REG_EEND, /* Premature end. */
REG_ESIZE, /* Compiled pattern bigger than 2^16 bytes. */
REG_ERPAREN /* Unmatched ) or \); not returned from regcomp. */
} reg_errcode_t;
/* This data structure represents a compiled pattern. Before calling
the pattern compiler, the fields `buffer', `allocated', `fastmap',
`translate', and `no_sub' can be set. After the pattern has been
compiled, the `re_nsub' field is available. All other fields are
private to the regex routines. */
struct re_pattern_buffer
{
/* [[[begin pattern_buffer]]] */
/* Space that holds the compiled pattern. It is declared as
`unsigned char *' because its elements are
sometimes used as array indexes. */
unsigned char *buffer;
/* Number of bytes to which `buffer' points. */
unsigned long allocated;
/* Number of bytes actually used in `buffer'. */
unsigned long used;
/* Syntax setting with which the pattern was compiled. */
reg_syntax_t syntax;
/* Pointer to a fastmap, if any, otherwise zero. re_search uses
the fastmap, if there is one, to skip over impossible
starting points for matches. */
char *fastmap;
/* Either a translate table to apply to all characters before
comparing them, or zero for no translation. The translation
is applied to a pattern when it is compiled and to a string
when it is matched. */
char *translate;
/* Number of subexpressions found by the compiler. */
size_t re_nsub;
/* Zero if this pattern cannot match the empty string, one else.
Well, in truth it's used only in `re_search_2', to see
whether or not we should use the fastmap, so we don't set
this absolutely perfectly; see `re_compile_fastmap' (the
`duplicate' case). */
unsigned can_be_null : 1;
/* If REGS_UNALLOCATED, allocate space in the `regs' structure
for `max (RE_NREGS, re_nsub + 1)' groups.
If REGS_REALLOCATE, reallocate space if necessary.
If REGS_FIXED, use what's there. */
#define REGS_UNALLOCATED 0
#define REGS_REALLOCATE 1
#define REGS_FIXED 2
unsigned regs_allocated : 2;
/* Set to zero when `regex_compile' compiles a pattern; set to one
by `re_compile_fastmap' if it updates the fastmap. */
unsigned fastmap_accurate : 1;
/* If set, `re_match_2' does not return information about
subexpressions. */
unsigned no_sub : 1;
/* If set, a beginning-of-line anchor doesn't match at the
beginning of the string. */
unsigned not_bol : 1;
/* Similarly for an end-of-line anchor. */
unsigned not_eol : 1;
/* If true, an anchor at a newline matches. */
unsigned newline_anchor : 1;
/* [[[end pattern_buffer]]] */
};
typedef struct re_pattern_buffer regex_t;
/* search.c (search_buffer) in Emacs needs this one opcode value. It is
defined both in `regex.c' and here. */
#define RE_EXACTN_VALUE 1
/* Type for byte offsets within the string. POSIX mandates this. */
typedef int regoff_t;
/* This is the structure we store register match data in. See
regex.texinfo for a full description of what registers match. */
struct re_registers
{
unsigned num_regs;
regoff_t *start;
regoff_t *end;
};
/* If `regs_allocated' is REGS_UNALLOCATED in the pattern buffer,
`re_match_2' returns information about at least this many registers
the first time a `regs' structure is passed. */
#ifndef RE_NREGS
#define RE_NREGS 30
#endif
/* POSIX specification for registers. Aside from the different names than
`re_registers', POSIX uses an array of structures, instead of a
structure of arrays. */
typedef struct
{
regoff_t rm_so; /* Byte offset from string's start to substring's start. */
regoff_t rm_eo; /* Byte offset from string's start to substring's end. */
} regmatch_t;
/* Declarations for routines. */
/* To avoid duplicating every routine declaration -- once with a
prototype (if we are ANSI), and once without (if we aren't) -- we
use the following macro to declare argument types. This
unfortunately clutters up the declarations a bit, but I think it's
worth it. */
#if __STDC__
#define _RE_ARGS(args) args
#else /* not __STDC__ */
#define _RE_ARGS(args) ()
#endif /* not __STDC__ */
/* Sets the current default syntax to SYNTAX, and return the old syntax.
You can also simply assign to the `re_syntax_options' variable. */
extern reg_syntax_t re_set_syntax _RE_ARGS ((reg_syntax_t syntax));
/* Compile the regular expression PATTERN, with length LENGTH
and syntax given by the global `re_syntax_options', into the buffer
BUFFER. Return NULL if successful, and an error string if not. */
extern const char *re_compile_pattern
_RE_ARGS ((const char *pattern, int length,
struct re_pattern_buffer *buffer));
/* Compile a fastmap for the compiled pattern in BUFFER; used to
accelerate searches. Return 0 if successful and -2 if was an
internal error. */
extern int re_compile_fastmap _RE_ARGS ((struct re_pattern_buffer *buffer));
/* Search in the string STRING (with length LENGTH) for the pattern
compiled into BUFFER. Start searching at position START, for RANGE
characters. Return the starting position of the match, -1 for no
match, or -2 for an internal error. Also return register
information in REGS (if REGS and BUFFER->no_sub are nonzero). */
extern int re_search
_RE_ARGS ((struct re_pattern_buffer *buffer, const char *string,
int length, int start, int range, struct re_registers *regs));
/* Like `re_search', but search in the concatenation of STRING1 and
STRING2. Also, stop searching at index START + STOP. */
extern int re_search_2
_RE_ARGS ((struct re_pattern_buffer *buffer, const char *string1,
int length1, const char *string2, int length2,
int start, int range, struct re_registers *regs, int stop));
/* Like `re_search', but return how many characters in STRING the regexp
in BUFFER matched, starting at position START. */
extern int re_match
_RE_ARGS ((struct re_pattern_buffer *buffer, const char *string,
int length, int start, struct re_registers *regs));
/* Relates to `re_match' as `re_search_2' relates to `re_search'. */
extern int re_match_2
_RE_ARGS ((struct re_pattern_buffer *buffer, const char *string1,
int length1, const char *string2, int length2,
int start, struct re_registers *regs, int stop));
/* Set REGS to hold NUM_REGS registers, storing them in STARTS and
ENDS. Subsequent matches using BUFFER and REGS will use this memory
for recording register information. STARTS and ENDS must be
allocated with malloc, and must each be at least `NUM_REGS * sizeof
(regoff_t)' bytes long.
If NUM_REGS == 0, then subsequent matches should allocate their own
register data.
Unless this function is called, the first search or match using
PATTERN_BUFFER will allocate its own register data, without
freeing the old data. */
extern void re_set_registers
_RE_ARGS ((struct re_pattern_buffer *buffer, struct re_registers *regs,
unsigned num_regs, regoff_t *starts, regoff_t *ends));
/* 4.2 bsd compatibility. */
extern char *re_comp _RE_ARGS ((const char *));
extern int re_exec _RE_ARGS ((const char *));
/* POSIX compatibility. */
extern int regcomp _RE_ARGS ((regex_t *preg, const char *pattern, int cflags));
extern int regexec
_RE_ARGS ((const regex_t *preg, const char *string, size_t nmatch,
regmatch_t pmatch[], int eflags));
extern size_t regerror
_RE_ARGS ((int errcode, const regex_t *preg, char *errbuf,
size_t errbuf_size));
extern void regfree _RE_ARGS ((regex_t *preg));
#endif /* not __REGEXP_LIBRARY_H__ */
/*
Local variables:
make-backup-files: t
version-control: t
trim-versions-without-asking: nil
End:
*/
6. 參考文件