Quantcast
Channel: Active questions tagged python - Stack Overflow
Viewing all articles
Browse latest Browse all 13891

Python regular expression did not able to extract the text and urls from the mail body

$
0
0

For example i have a mail in my outlook folder that have a subject and lots of Japanese text and urls like below.

01 事務用品・機器大阪府警察大正警察署:指サック等の購入  :大阪市大正区https://www.e-nyusatsu.pref.osaka.jp/CALS/Publish/EbController?Shori=SmallKokokuInfo&open_kokoku=0120235004221401 事務用品・機器府立学校大阪わかば高等学校:校内衛生用品7件 ★ :大阪市生野区https://www.e-nyusatsu.pref.osaka.jp/CALS/Publish/EbController?Shori=SmallKokokuInfo&open_kokoku=0120235004197801 事務用品・機器府立学校工芸高等学校:イレパネ 他 購入  :大阪市阿倍野区https://www.e-nyusatsu.pref.osaka.jp/CALS/Publish/EbController?Shori=SmallKokokuInfo&open_kokoku=01202350042117

I want to search with matching a parent keyword and a list of child keyword which i have configured in a json config file like below.

{"folder_name": "調達プロジェクト","output_file_path": "E:\\output","output_file_name": "output.txt","parent_keyword": "meeting","child_keywords": ["土木一式工事", "産業用機器", "事務用品・機器"]}

Now i am trying to find the mail that has these parent child keyword and want to make a text file with the matched keyword and the information (linked text and urls) associated with those keyword. For example for above mail if the keyord mathed with 通信用機器 keyword then i have to extract the text and urls below or associated with this keyword (and rest of the matched keyword) like below.

keyword: matched keywordParagraph text: text associated with the keyword Urls: urls associated with the keyword

Here is what i try with python.

import win32com.clientimport osimport jsonimport loggingimport redef read_config(config_file):    with open(config_file, 'r', encoding="utf-8") as f:        config = json.load(f)    return configdef search_and_save_email(config):    try:        folder_name = config.get("folder_name", "")        output_file_path = config.get("output_file_path", "")        parent_keyword = config.get("parent_keyword", "")        child_keywords = config.get("child_keywords", [])        # Ensure the directory exists        os.makedirs(output_file_path, exist_ok=True)        outlook = win32com.client.Dispatch("Outlook.Application").GetNamespace("MAPI")        inbox = outlook.GetDefaultFolder(6)        # Find the user-created folder within the Inbox        user_folder = None        for folder in inbox.Folders:            if folder.Name == folder_name:                user_folder = folder                break        if user_folder is not None:            # Search for emails with the parent keyword anywhere in the subject            parent_keyword_pattern = re.compile(r'\b(?:'+'|'.join(map(re.escape, parent_keyword.split())) + r')\b', re.IGNORECASE)            for item in user_folder.Items:                if parent_keyword_pattern.findall(item.Subject):                    logging.info(f"Found parent keyword in Subject: {item.Subject}")                    # Parent keyword found, now search for child keywords in the body                    body_lower = item.Body.lower()                    # Initialize output_text outside the child keywords loop                    output_text = ""                    for child_keyword in child_keywords:                        # Search for child keyword in the body using regular expression                        child_keyword_pattern = re.compile(re.escape(child_keyword), re.IGNORECASE)                        matches = child_keyword_pattern.finditer(body_lower)                        for match in matches:                            logging.info(f"Found child keyword '{child_keyword}' at position {match.start()}-{match.end()}")                            # Extract the paragraph around the matched position                            paragraph_start = body_lower.rfind('\n', 0, match.start())                            paragraph_end = body_lower.find('\n', match.end())                            paragraph_text = item.Body[paragraph_start + 1:paragraph_end]                            # Extract URLs from the paragraph using a simple pattern                            url_pattern = re.compile(r'http[s]?://\S+')                            urls = url_pattern.findall(paragraph_text)                            # Append the results to the output_text                            output_text += f"Child Keyword: {child_keyword}\n"                            output_text += f"Paragraph Text: {paragraph_text}\n"                            output_text += f"URLs: {', '.join(urls)}\n\n"                    # Save the result to a text file                    output_file = os.path.join(output_file_path, f"{item.Subject.replace('', '_')}.txt")                    with open(output_file, 'w', encoding='utf-8') as f:                        f.write(output_text)                    logging.info(f"Saved results to {output_file}")                else:                    logging.warning(f"Child keywords not found in folder '{folder_name}'.")        else:            logging.warning(f"Folder '{folder_name}' not found.")    except Exception as e:        logging.error(f"An error occurred: {str(e)}")if __name__ == "__main__":    # Set up logging    logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')    # Specify the path to the configuration file    config_file_path = "E:\\config2.json"    # Read configuration from the file    config = read_config(config_file_path)    # Search and save email based on the configuration    search_and_save_email(config)`

Unfortunate the code only given me the matched keyword, not any text and urls associated with those keyword. My output text file is like

Child Keyword: 土木一式工事Paragraph Text:  土木一式工事URLs: Child Keyword: 産業用機器Paragraph Text:  19 産業用機器URLs: Child Keyword: 産業用機器Paragraph Text:  19 産業用機器URLs:

I am pretty sure the problem is lied in the logic and the regex expression which i am trying to find out, but i need some help. Sorry for the long information guys.


Viewing all articles
Browse latest Browse all 13891

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>