I am trying to save a large amount of JSON files, by appending the JSON string as a new line to a large text file. I have limited storage so I don't want to save the JSON string as is for so many files. Instead, I try to compress the JSON string using zlib library, and then append the compressed string as a new line to the big file.
The compression is pretty good, however the problem is that it often happens that the compressed string contains a line break character "\n", which causes error for decompression when reading line by line.I tried to overcome this problem by using base64 encoding for the zlib compressed string, since bas64 does not have line breaks, but it causes the final string to be much longer and hence the compression is less effective (actually for shorter strings, the final string after zlib/base64 is longer than the original string).
import zlib, base64item_dict={}item_dict["a"]="ماهذاالذيقالهاليومبشأنالأخباريةالتيفلتهامتعمدا؟"item_dict["b"]="She’s allowed to not want someone else’s kids in her picture. Y’all are weird for the way youre acting over this. I don’t want any pics of myself with my ex’s children, because they aren’t my children and I’m not in their lives anymore. It’s weird to post pics of someone else’s kids… so asking for them to be removed so I can still enjoy my picture from my holiday isn’t as bad as y’all are making it seem."item_dict["c"]='''{"symbol": "A/RES/74/1", "resolution_number": "74/1.", "title": "Scale of assessments for the apportionment of the expenses of the United Nations: requests under Article 19 of the Charter", "session": "Seventy-fourth session", "adoption_meeting": "14th plenary meeting", "adoption_date": "2019-10-10 00:00:00", "originating_document": "A/74/483", "report_paragraph": "6", "committee": "Fifth Committee", "agenda_item": "Agenda item 139", "agenda_item_name": "Scale of assessments for the apportionment of the expenses of the United Nations", "voting_type": "Without a vote", "MS_in_favour_count": "N.A.", "MS_against_count": "N.A.", "MS_abstaining_count": "N.A.", "pv": "A/74/PV.14", "MS_in_favour": [], "MS_against": [], "MS_abstaining": [], "sponsors": ["SUBMITTED BY THE CHAIR OF THE COMMITTEE"], "additional_sponsors": [], "SDGs": [], "subjects": [["Comoros", "UNBIS Thesaurus"], ["Sao Tome And Principe", "UNBIS Thesaurus"], ["Somalia", "UNBIS Thesaurus"]]}{"symbol": "A/RES/74/2", "resolution_number": "74/2.", "title": "Political declaration of the high-level meeting on universal health coverage", "session": "Seventy-fourth session", "adoption_meeting": "14th plenary meeting", "adoption_date": "2019-10-10 00:00:00", "originating_document": "A/74/L.4", "report_paragraph": "N.A.", "committee": "Without reference to a Main Committee", "agenda_item": "Agenda item 126", "agenda_item_name": "Global health and foreign policy", "voting_type": "Without a vote", "MS_in_favour_count": "N.A.", "MS_against_count": "N.A.", "MS_abstaining_count": "N.A.", "pv": "A/74/PV.14", "MS_in_favour": [], "MS_against": [], "MS_abstaining": [], "sponsors": ["SUBMITTED BY THE PRESIDENT OF THE GENERAL ASSEMBLY"], "additional_sponsors": [], "SDGs": ["3"], "subjects": [["Health Policy", "UNBIS Thesaurus"], ["Public Health", "UNBIS Thesaurus"], ["Health Services", "UNBIS Thesaurus"], ["Health Insurance", "UNBIS Thesaurus"], ["Declarations (Text)", "UNBIS Thesaurus"]]}{"symbol": "A/RES/74/3", "resolution_number": "74/3.", "title": "Political declaration of the high-level meeting to review progress made in addressing the priorities of small island developing States through the implementation of the SIDS Accelerated Modalities of Action (SAMOA) Pathway", "session": "Seventy-fourth session", "adoption_meeting": "14th plenary meeting", "adoption_date": "2019-10-10 00:00:00", "originating_document": "A/74/L.3", "report_paragraph": "N.A.", "committee": "Without reference to a Main Committee", "agenda_item": "Agenda item 19 (b)", "agenda_item_name": "Sustainable development: follow-up to and implementation of the SIDS Accelerated Modalities of Action (SAMOA) Pathway and the Mauritius Strategy for the Further Implementation of the Programme of Action for the Sustainable Development of Small Island Developing States of the SIDS Accelerated Modalities of Action (SAMOA) Pathway and the Mauritius Strategy for the Further Implementation of the Programme of Action for the Sustainable Development of Small Island Developing States", "voting_type": "Without a vote", "MS_in_favour_count": "N.A.", "MS_against_count": "N.A.", "MS_abstaining_count": "N.A.", "pv": "A/74/PV.14", "MS_in_favour": [], "MS_against": [], "MS_abstaining": [], "sponsors": ["SUBMITTED BY THE PRESIDENT OF THE GENERAL ASSEMBLY"], "additional_sponsors": [], "SDGs": ["16", "17", "3"], "subjects": [["Sustainable Development", "UNBIS Thesaurus"], ["Developing Island Countries", "UNBIS Thesaurus"], ["Development Assistance", "UNBIS Thesaurus"], ["Programme Implementation", "UNBIS Thesaurus"], ["Programme Evaluation", "UNBIS Thesaurus"], ["Declarations (Text)", "UNBIS Thesaurus"]]}'''item_dict["d"]='{"url": "http://agribank.ngan-hang.net", "final_url": "http://ww7.ngan-hang.net/?usid=18&utid=23776691570", "lang": "", "title": "", "description": "", "keywords": "", "phone_numbers": [], "links": [], "social_links": [], "emails": [], "addresses": [], "logos": [], "text": "", "last": 41, "n_items": 1}'for key,val in item_dict.items(): zlib_compressed=zlib.compress(val.encode()) base64_compressed=base64.b64encode(zlib_compressed) zlib_n_line_breaks=zlib_compressed.count(b'\n') base64_line_breaks=base64_compressed.count(b'\n') print("original size:",len(val)," | zlib:",len(zlib_compressed),"base64",len(base64_compressed),"| zlib_n_line_breaks",zlib_n_line_breaks,base64_line_breaks)Result:
original size: 56 | zlib: 84 base64 112 | zlib_n_line_breaks 0 0original size: 407 | zlib: 254 base64 340 | zlib_n_line_breaks 0 0original size: 3655 | zlib: 941 base64 1256 | zlib_n_line_breaks 1 0original size: 303 | zlib: 184 base64 248 | zlib_n_line_breaks 1 0As a work around, I created a custom compression/decompression function, that replaces the line break in compression with an arbitrary string (e.g. 00000), and in the decompression it does the opposite. This reduces the likelihood of decompression errors but does not eliminate it, because it can happen that the original compressed string has this arbitrary string somehow.
I'm aware of this question, not satisfactory though:
So, the question here is the following -Is there any compression algorithm that can compress a string without producing a line break? Or is there a way to reliably post-process zlib compression/decompression output (or the output of any compression algorithm) to avoid line breaks?
Edit
Thanks to the answer by Booboo, I realized the difference between a line break character and a slash followed by "n", and I tested it and it now makes sense for the encoding part:
import zlibline0='{"symbol": "A/RES/74/1", "resolution_number": "74/1.", "title": "Scale of assessments for the apportionment of the expenses of the United Nations: requests under Article 19 of the Charter", "session": "Seventy-fourth session", "adoption_meeting": "14th plenary meeting", "adoption_date": "2019-10-10 00:00:00", "originating_document": "A/74/483", "report_paragraph": "6", "committee": "Fifth Committee", "agenda_item": "Agenda item 139", "agenda_item_name": "Scale of assessments for the apportionment of the expenses of the United Nations", "voting_type": "Without a vote", "MS_in_favour_count": "N.A.", "MS_against_count": "N.A.", "MS_abstaining_count": "N.A.", "pv": "A/74/PV.14", "MS_in_favour": [], "MS_against": [], "MS_abstaining": [], "sponsors": ["SUBMITTED BY THE CHAIR OF THE COMMITTEE"], "additional_sponsors": [], "SDGs": [], "subjects": [["Comoros", "UNBIS Thesaurus"], ["Sao Tome And Principe", "UNBIS Thesaurus"], ["Somalia", "UNBIS Thesaurus"]]} {"url": "http://agroreal911.sk", "final_url": "http://agroreal911.sk/", "lang": "sk-SK", "title": "Agroreal 911 s.r.o.", "description": "", "keywords": "", "phone_numbers": [], "links": [["http://agroreal911.sk/pozemky", "K\u00fapa p\u00f4dy"], ["http://agroreal911.sk/kontakty", "Kontakty"], ["http://www.advertplus.sk", "Advertplus.sk"], ["http://agroreal911.sk/predaj-pody", "Predaj p\u00f4dy"], ["http://agroreal911.sk/o-nas", "O n\u00e1s"], ["http://agroreal911.sk/?lang=en", ""], ["http://transposh.org/sk", ""]], "social_links": [], "emails": ["mgr.michal.hrabovsky@gmail.com"], "addresses": [], "logos": ["http://agroreal911.sk/wp-content/plugins/transposh-translation-filter-for-wordpress/img/tplogo.png"], "text": "Agroreal 911 s.r.o. \nAGRO REAL 911, S.R.O. \nMenu \nO n\u00e1s \nPOZEMKY \nK\u00fapa p\u00f4dy \nPredaj p\u00f4dy \nKontakty \nby \nWebstr\u00e1nku vytvoril Advertplus.sk Kontakt: 0908 692 782 \u00a0\u00a0\u00a0\u00a0\n\n\n \n \n\n\n\n\n\n\n\n mgr.michal.hrabovsky@gmail.com\n\n\n ", "last": 74, "n_items": 2}'compressed=zlib.compress(line0.encode())compressed0=compressed.replace(b"\n",b"\\n")print("number of line breaks in zlib output:", compressed.count(b"\n"))test_out_fpath="test_compress.txt"fopen0=open(test_out_fpath,"wb")fopen0.write(compressed0)fopen0.close()fopen0=open(test_out_fpath,"rb")lines=fopen0.readlines()print("number of lines after replacing line breaks", len(lines))fopen0.close()Output
number of line breaks in zlib output: 7number of lines after replacing line breaks 1I'd still need help with the decompression though, if possible