Quantcast
Channel: Active questions tagged python - Stack Overflow
Viewing all articles
Browse latest Browse all 13921

Exclude specific spans from string when substituting by regular expression in Python

$
0
0

I have this problem.Assume there is a pretty long and complicated text file where you can meet such special blocks in delimiters (which can be empty):

some text----text inside special block----some text--------some text----text inside special block----

The task is to make some substitutions in the whole text but exclude text inside these borders (----) from these substitutions.

So for example we need to replace all text substrings with TEXT strings but not inside special blocks. The result should be:

some TEXT----text inside special block----some TEXT--------some TEXT----text inside special block----

We cannot use lookahead or lookbehind here because in the given position we don't know if we are inside the special block or not (delimiters are not oriented).

So what I really do to solve this is first I parse the whole text for delimiters of special blocks then I get the indexes of "bad" lines and then I apply my regex substitutions line by line checking if this line is not one of the "bad" lines. But if my regex must apply to more than one line it becomes more complicated. And I'm sure there are some pretty smart and easy ways to handle this.

So basically what I need is to be able to exclude some fragments of the text (by theirs spans) from the re.sub when it applies to the whole text. Even if the regex only intersects with the span (not necessarily contains it). So that I can apply the first regex, take the spans of specials blocks by their begin and end indexes and exclude these spans from the second regex. How is this possible?

Right now I have this solution (the example above is simplified, sorry):

def find_code_lines(data):    # Search for blocks by regex They can be empty!    r = re.compile(r'(\n----(?=\n)(?P<group1>[\s\S]*?\n)----\n)')    # Delete all '\n' which are not line breaks (there are some of them in formulas etc.)    data_edited = data.replace('\\n', '')    # Save spans by symbol indexes    char_spans = []    for m in r.finditer(data_edited):        #print(m.span(1))        #print(m.span[1])        char_spans.append(m.span(1))    # Calculate spans by line indexes    line_spans = []    for span in char_spans:        begin = data_edited[:span[0]].count("\n") + 2        end = data_edited[:span[1]].count("\n") - 1        line_spans.append((begin, end))    return line_spans# Check if index is inside one of spansdef in_spans(spans, line_index):    res = False    for span in spans:        if line_index >= span[0] and line_index < span[1]:            res = True    return res# Parse text by blockscode_lines = find_code_lines(data)lines_edited = []data_lines = data.splitlines()replace_count = 0for i in range(len(data_lines)):    if in_spans(code_lines, i):        lines_edited.append(data_lines[i])        #print('line in spans:', i)    else:        data_tuple = re.subn(r'(?P<group1>\s|^|\s\()\$(?P<group2>[^\$`\r\n]{1,1000}?)\$',                            r'\1stem:[\2]',                             data_lines[i])        if data_tuple[1] == 0:            lines_edited.append(data_lines[i])        else:            lines_edited.append(data_tuple[0])            replace_count += data_tuple[1]lines_edited.append('')data = '\n'.join(lines_edited)log_it('Replaced Math blocks', replace_count)

UPDI added more text to the input example because some of the solutions below can handle only specific versions of inputs (which are easier). So the most difficult one so far is like this:

some text----text inside special block----some text--------some text----text inside special block----some text

Expected output:

some TEXT----text inside special block----some TEXT--------some TEXT----text inside special block----some TEXT

Viewing all articles
Browse latest Browse all 13921

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>