[Home]

Summary:ASTERISK-09993: Asterisk Crashes in cdr_csv.c during csv_log while trying to close the log file
Reporter:Frank Waller (explidous)Labels:
Date Opened:2007-07-31 16:13:51Date Closed:2007-09-13 14:59:12
Priority:CriticalRegression?No
Status:Closed/CompleteComponents:CDR/cdr_csv
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) all_thread_backtrace
( 1) cdr_csv.patch
( 2) cdr_csv.patch2
( 3) cflags.xml.patch
( 4) single_thread_full_backtrace
Description:Under moderately high load Asterisk crashes while trying to close the the log file at line 296 in cdr_csv.c. It seems that two processes call this function at the same time and try to close the file descriptor twice. This is most likely because a lock was removed somewhere in Asterisk.

To fix this I made the module only close the file on unload and reload and open the file on load and reload and put a lock around where it actually writes to the file. Previously Asterisk opened the file, wrote to the file, flushed the buffer, then closed the file. This was most likely due to the fact that fflush() had problems in the past not properly flushing the buffer. This has been fixes since about 2002. Now Asterisk locks, writes to the file, unlocks, then flushes the buffer. This also makes this function thread safe.

I have not had Asterisk crash at this point since (12 tries so far).

****** STEPS TO REPRODUCE ******

I am able to reproduce this very easily by running Vicidial (a predictive dialer) on this server with twenty agents and dialing at a ratio of three to one. This means that there are twenty channels waiting in twenty meetmes and the server is dialing 60 numbers. When a number connects they get placed into one of the meetmes. The server reliably crashes at one of two spots, either this location or at another that I have not had a chance to debug yet.

I believe that it should be able to be reproduced without Vicidial though by simply dialing many calls simultaneously with cdr_csv.so loaded.
Comments:By: Jason Parker (jparker) 2007-07-31 16:49:12

We probably want to still open/close the file on every call to cvs_log (for safety).  We could probably use mf in writefile as well, and just wrap both calls in a mutex that gets declared with mf.

By: Jason Parker (jparker) 2007-07-31 16:50:39

I guess we would need to put the locking in (un)load_module also

By: Frank Waller (explidous) 2007-07-31 22:57:29

No objection whatsoever on locking in (un)load_module. I'll upload a new patch soon.
writefile looks like it would not need any lock as it is, if it would, it would be better served by a separate lock, no reason to limit parallelisms between the two separate files. A lock in writefile is only desirable if the OS can not handle multiple threads opening the same file and writing to it at the "same" time, in this case the simple but not very performant solution would be to put that separate lock around the whole fopen to fclose section. For optimum parallel processing we would need a lock per file or forcing files to be not opened as shared with better handling of failing to open the file than just a "return -1". BTW I'll put an LOG_ERROR on that fail now...
Opening and closing the file on each write should really not be necessary anymore, there were issues where fflush was not reliably updating files depending on which write operation was used, these have however been resolved in most OS versions many years ago (2002).
The problem with closing and reopening for each write is, that it would have to be done within the lock as well and that would make the time locked much longer which negatively affects the general performance.
I would be happy to include a compile/config option, something along the lines of CSVLOG_PARANOID_FILE_HANDLING to switch to an alternative code that provides per line open and closes. If anyone still knows of systems that do not handle fflush correctly, those could be set to automatically use the alternative code.
I am willing to test if we can produce any data loss situation on the current linux platforms we use.

By: Tilghman Lesher (tilghman) 2007-08-27 14:42:27

Any reason why this issue seems to be stuck in limbo?  Should it be added to 1.4, as well?

By: Jason Parker (jparker) 2007-09-13 14:59:12

Fixed in svn branches 1.4 and trunk in revisions 82344 and 82345.