[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: delimiters with more than one character? ...



On Wed, Aug 05, 2020 at 03:04:32PM +0900, John Crawley wrote:
> This method cuts off the first part of the string, up to the delimiter, and
> adds it to the array, then continues with what's left of the string until
> there's none left:

Yes, that's a valid approach, although not one that most bash scripters
would choose, because it's hideously slow in bash, and doesn't "feel"
like it's in the "spirit" of shell scripting.

> ---
> #!/bin/bash
> 
> _S=' 34 + 45 \| abc \| 1 2 3 \| c\|123abc '
> del='\|'
> 
> arr=()
> s=${_S}${del}
> while [[ -n $s ]]
> do
>     arr+=( "${s%%"${del}"*}" )
>     s=${s#*"${del}"}
> done
> 
> declare -p arr
> ---
> outputs:
> declare -a arr=([0]=" 34 + 45 " [1]=" abc\\ " [2]=" 1 2 3 " [3]=" c"
> [4]="123abc ")

Your output doesn't match your input.  You've got an extra backslash
character in the [1] element.  Perhaps you tested with several different
inputs, and accidentally pasted the wrong output.

(It would have been good if the original poster's sample input string
had included some of the things you were probably testing, such as
standalone backslash and pipe characters.  As presented, the original
problem was pretty stupid.  Why even have a multi-character delimiter
when none of the individual characters in the delimiter appear in the
data?)

> I've used this myself, so am eager to hear of any hidden snags. :)
> 
> (One already: if the delimiter is a repeated character which might also be
> the last in the last string fragment, then the loop never closes. Fairly
> rare?)

OK... yeah, that would be a show-stopper, all right.  I was able
to reproduce the infinite loop using:

_S=' 34 + 45 || abc\ || 1|2 3 || c||123abc |'
del='||'
s=${_S}${del}
arr=()
while [[ -n $s ]]; do arr+=( "${s%%"${del}"*}" ); s=${s#*"${del}"}; done

(Don't do this in a window you care about, or on a system that can't
afford to run out of memory.  By the time I pressed Ctrl-C, the output
array had over 600,000 elements.)

The problem here is that the two expressions inside the loop fail to
identify the final delimiter correctly.  At the point where everything
fails, we have:

del='||'
s='123abc |||'

and:

$ printf '<%s>\n' "${s%%"${del}"*}"
<123abc >
$ printf '<%s>\n' "${s#*"${del}"}"
<|>

In other words, the expressions treat the *first* two pipes as the
delimiter, rather than the *last* two pipes.  The third pipe character
is therefore left behind in the string, causing the infinite loop.
(In addition, the final array element before the infinite loop is
incorrect.  It should be "123abc |", not "123abc ".)

I don't believe this approach can be made to work in all cases, because
appending the extra delimiter to the end of the string creates an
ambiguous input.  The ||| at the end could be "a pipe followed by
the delimiter", or "the delimiter followed by a pipe".  We humans
know that it's supposed to be the former, but the code in the loop
is resulting in the latter.

A different approach is needed.  The one that immediately springs to
mind for me is:

s='a || b || c ||| d |'
del='||'
arr=()
while [[ $s = *"$del"* ]]; do arr+=( "${s%%"$del"*}" ); s=${s#*"$del"}; done
arr+=( "$s" )

This is very similar to the flawed approach, but instead of appending
an extra delimiter to the input and looping until the input is empty,
we *check* for the existence of a delimiter in the input, and loop until
none is found.  Whatever is left over becomes the final array element.

This will be a little bit slower than the flawed code (probably), but I
believe it is free from infinite loops.  If you're not convinced,
you could add a loop counter, and abort when some arbitrary limit is
reached.

Since we're not modifying the input by appending an extra delimiter,
any ambiguity in the input string was put there by the original input
source, not created by our code.  In my sample input, the ||| substring
between c and d can be parsed as either "delimiter followed by pipe",
or "pipe followed by delimiter".  It is impossible to tell which one
is correct, because we have no external knowledge of the input string.
Therefore, either result must be acceptable.  If the humans who
provided this input don't care for the results they get, well, it's
their problem and not ours.


Reply to: