如何在Bash中正确使用正则表达式判断字符串有效性

2017-11-27|Categories: Magedu-training|Tags: |

这里说的「正则表达式判断」是指[[ string =~ regex ]]这种情况,这里的regex既可以是直接书写的一长串正则表达式,也可以是保存了正则表达式的变量。

不能使用引号包围正则表达式

Bash会把=~后面的字符串当做扩展正则表达式(ERE)来解释,但添加引号会导致完全不同的解释结果。以下脚本可以展示这种差异:

#!/bin/bash
#
#===== ===== ===== ===== ===== ===== ===== ===== ===== ===== ===== ===== 
# Filename:     check_ip.sh
# Revision:     1.0
# Author:       Li Yang
# Date:         2017-11-24
# Description:      finding quotes how to affect regex evaluation
#===== ===== ===== ===== ===== ===== ===== ===== ===== ===== ===== ===== 

# 2 methods to define an array
# ips=("str1" "str2" "str3" "str4" "str5")
ips[0]='192.168.0.1'
ips[1]='192.168.0'
ips[2]='255.255.255.256'
ips[3]='123.123.123.123.123'
ips[4]='a.b.c.d'
ips[5]='255.255.255.255'
ips[6]='0.0.0.0'

red=$'\033[31m'
green=$'\033[32m'
yellow=$'\033[33m'
blue=$'\033[34m'
magenta=$'\033[35m'
cyan=$'\033[36m'
normal=$'\033[0m'

invalidOut="Invalid"
validOut="Valid"
regexVar="Saved in a VARIABLE"
regexStr="Is a STRING"
noQuotes="NO quotes"
singleQuotes="Single quotes"
doubleQuotes="Double quotes"

printStyleT="%-25s\t%-10s\t%-20s\t%-15s\n"
# color code must be out of `%[column_width]s`, or it will be counted in the column
printStyleY="${green}%-25s\t%-10s${normal}\t%-20s\t%-15s\n"
printStyleN="${red}%-25s\t%-10s${normal}\t%-20s\t%-15s\n"
printTitle=$(printf "${printStyleT}" "IP Address" "Validity" "Regex" "Quote Type")
print2ndLine=$(printf "${printStyleT}" "====================" "==========" "====================" "===============")

# part 1: regex saved in a variable

regex="^((25[0-5]|2[0-4][0-9]|[01][0-9][0-9]|[0-9]{1,2})[.]){3}(25[0-5]|2[0-4][0-9]|[01][0-9][0-9]|[0-9]{1,2})$"

# the regex variable has been surrounded without quotes
function validIP1a() {
    echo "${printTitle}"
    echo "${print2ndLine}"
    for (( i=0; i<${#ips[@]}; i++ )); do
        if [[ "${ips[i]}" =~ $regex ]]; then
            printf "${printStyleY}" "${ips[i]}" "${validOut}" "${regexVar}" "${noQuotes}"
        else
            printf "${printStyleN}" "${ips[i]}" "${invalidOut}" "${regexVar}" "${noQuotes}"
        fi
    done
}

# the regex variable has been surrounded with single quotes
function validIP1b() {
    echo "${printTitle}"
    echo "${print2ndLine}"
    for (( i=0; i<${#ips[@]}; i++ )); do
        if [[ "${ips[i]}" =~ '$regex' ]]; then
            printf "${printStyleY}" "${ips[i]}" "${validOut}" "${regexVar}" "${singleQuotes}"
        else
            printf "${printStyleN}" "${ips[i]}" "${invalidOut}" "${regexVar}" "${singleQuotes}"
        fi
    done
}

# the regex variable has been surrounded with double quotes
function validIP1c() {
    echo "${printTitle}"
    echo "${print2ndLine}"
    for (( i=0; i<${#ips[@]}; i++ )); do
        if [[ "${ips[i]}" =~ "$regex" ]]; then
            printf "${printStyleY}" "${ips[i]}" "${validOut}" "${regexVar}" "${doubleQuotes}"
        else
            printf "${printStyleN}" "${ips[i]}" "${invalidOut}" "${regexVar}" "${doubleQuotes}"
        fi
    done
}

# part 2: regex is a string that write down directly

# the regex string has been surrounded without quotes
function validIP2a() {
    echo "${printTitle}"
    echo "${print2ndLine}"
    for (( i=0; i<${#ips[@]}; i++ )); do
        if [[ "${ips[i]}" =~ ^((25[0-5]|2[0-4][0-9]|[01][0-9][0-9]|[0-9]{1,2})[.]){3}(25[0-5]|2[0-4][0-9]|[01][0-9][0-9]|[0-9]{1,2})$ ]]; then
            printf "${printStyleY}" "${ips[i]}" "${validOut}" "${regexStr}" "${noQuotes}"
        else
            printf "${printStyleN}" "${ips[i]}" "${invalidOut}" "${regexStr}" "${noQuotes}"
        fi
    done
}

# the regex string has been surrounded with single quotes
function validIP2b() {
    echo "${printTitle}"
    echo "${print2ndLine}"
    for (( i=0; i<${#ips[@]}; i++ )); do
        if [[ "${ips[i]}" =~ '^((25[0-5]|2[0-4][0-9]|[01][0-9][0-9]|[0-9]{1,2})[.]){3}(25[0-5]|2[0-4][0-9]|[01][0-9][0-9]|[0-9]{1,2})$' ]]; then
            printf "${printStyleY}" "${ips[i]}" "${validOut}" "${regexStr}" "${singleQuotes}"
        else
            printf "${printStyleN}" "${ips[i]}" "${invalidOut}" "${regexStr}" "${singleQuotes}"
        fi
    done
}

# the regex string has been surrounded with double quotes
function validIP2c() {
    echo "${printTitle}"
    echo "${print2ndLine}"
    for (( i=0; i<${#ips[@]}; i++ )); do
        if [[ "${ips[i]}" =~ "^((25[0-5]|2[0-4][0-9]|[01][0-9][0-9]|[0-9]{1,2})[.]){3}(25[0-5]|2[0-4][0-9]|[01][0-9][0-9]|[0-9]{1,2})$" ]]; then
            printf "${printStyleY}" "${ips[i]}" "${validOut}" "${regexStr}" "${doubleQuotes}"
        else
            printf "${printStyleN}" "${ips[i]}" "${invalidOut}" "${regexStr}" "${doubleQuotes}"
        fi
    done
}

validIP1a
echo
validIP1b
echo
validIP1c
echo

validIP2a
echo
validIP2b
echo
validIP2c

这个脚本执行之后,输出如下:

可以看到,不加引号,正则表达式评估才可以正确进行!因为加了引号之后,正则表达式会被Bash解释为字符串,其中所有的字符被当成字面含义,而不是正则表达式中的元字符。

In fact, quoting in this context is not advisable as it may cause regex evaluation to fail. Chet Ramey states in the Bash FAQ that quoting explicitly disables regex evaluation.

http://tldp.org/LDP/abs/html/bashver3.html

Stack Overflow有更多例子解释这个问题。

尽可能通过变量调用正则表达式

如果正则表达式直接写在=~右边,大多数时候可以正常工作,但是,如果正则表达式里面有\b\<\>这种\开头的元字符就会失败!例如,下面这个正则表达式可以在grep -E命令里正常工作:

echo -e "192.168.0.1\na.b.c.d\n255.255.255.255" \
| grep -E "\b((25[0-5]|2[0-4][0-9]|[01][0-9][0-9]|[0-9]{1,2})[.]){3}(25[0-5]|2[0-4][0-9]|[01][0-9][0-9]|[0-9]{1,2})\b"

但到了[[ ... =~ ... ]]就不行了:

ip="192.168.0.1"; \
[[ "${ip}" =~ \b((25[0-5]|2[0-4][0-9]|[01][0-9][0-9]|[0-9]{1,2})[.]){3}(25[0-5]|2[0-4][0-9]|[01][0-9][0-9]|[0-9]{1,2})\b ]] \
&& echo -e "${ip} is \033[32mValid\033[0m" \
|| echo -e "${ip} is \033[31mInvalid\033[0m"

把匹配单词边界的两个\b去掉,才可以正常工作:

更好的解决方法是把正则表达式存入变量,然后在条件判断语句里调用:

ip="192.168.0.1"; \
re="\b((25[0-5]|2[0-4][0-9]|[01][0-9][0-9]|[0-9]{1,2})[.]){3}(25[0-5]|2[0-4][0-9]|[01][0-9][0-9]|[0-9]{1,2})\b"; \
[[ "${ip}" =~ $re ]] \
&& echo -e "${ip} is \033[32mValid\033[0m" \
|| echo -e "${ip} is \033[31mInvalid\033[0m"

或者是通过命令替换把正则表达式输出为字符串再判断:

ip="192.168.0.1"; \
[[ "${ip}" =~ $(echo "\b((25[0-5]|2[0-4][0-9]|[01][0-9][0-9]|[0-9]{1,2})[.]){3}(25[0-5]|2[0-4][0-9]|[01][0-9][0-9]|[0-9]{1,2})\b") ]] \
&& echo -e "${ip} is \033[32mValid\033[0m" \
|| echo -e "${ip} is \033[31mInvalid\033[0m"

正则表达式判断结果依赖操作系统

=~ is the rare case (the only case?) of a built-in bash feature that is platform-dependent: It uses the regex libraries of the platform it is running on, resulting in different regex flavors on different platforms.

For instance, on FreeBSD/OSX \<, \> and \b are NOT supported, but [[:<:]] and [[:>:]] are. On Linux it is the other way around.

https://stackoverflow.com/a/12696899/3025050

上面引用资料说的是:Bash的正则表达式判断依赖于操作系统的正则表达式库,因此,同一个正则表达式在不同操作系统上会有不同的结果。例如,在macOS上,单词边界只能使用[[:<:]][[:>:]]表示(Linux并不支持这两个元字符):

Leave A Comment