Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

uart_flush blocks forever (IDFGH-7778) #9315

Closed
Superberti opened this issue Jul 8, 2022 · 8 comments
Closed

uart_flush blocks forever (IDFGH-7778) #9315

Superberti opened this issue Jul 8, 2022 · 8 comments
Assignees
Labels
Resolution: Done Issue is done internally Status: Done Issue is done internally

Comments

@Superberti
Copy link

Environment

  • Development Kit: none
  • Kit version N/A
  • Module or chip used: ESP32-WROOM-32D
  • IDF version: stable, v4.4.1
  • Build System: ninja, idf.py
  • Compiler version: xtensa-esp32-elf-gcc (crosstool-NG esp-2021r2-patch3) 8.4.0
  • Operating System: Windows
  • environment type: PowerShell
  • Using an IDE?: Yes, VS Code with Espressif IDF extension
  • Power Supply: external 3.3V

Problem Description

I'm communication with some other device over a serial RS422 line with a high comm rate of 1MBPS with no flow control.
Receiving and transmitting works absolutely reliable. I only had to move the UART driver to the second core in order to get no buffer overruns in the hardware FIFO (although the UART software ring buffer is always big enough for the data so this seems to be an interrupt latency problem on core 0).
One task is listening to the input of the UART by a queue if (xQueueReceive(spp_uart_queue, (void *)&event, (portTickType)portMAX_DELAY))... and another task is sending back some data to the UART.
Now the sending task wants to flush the UART (some time after successfully receiving a lot of UART data) with uart_flush(UART_NUM_1); and the function blocks forever. In this moment my second device does not send any data, so the UART input buffer is not filled.
I don't know how this can happen as uart_flush should never block! But if I send some data from my second device to the UART after the blocking of uart_flush, the function unblocks immediately! After that the program works as expected again.
Fortunately, I do not really need to flush the UART(input) in my sending task, but this behaviour is strange and may lead to serious problems.
Maybe the function hangs in some of the spinlocks?

Expected behaviour

The function uart_flush should never block.

Actual behaviour

The function blocks in some scenarious

Steps to reproduce

Unfortunately I don't have any kind of simple test code right now and it needs a second device which is sending "real" data to the UART.
I know, this is not very satisfying but maybe my desciption helps to find a potential problem in the driver.

@espressif-bot espressif-bot added the Status: Opened Issue is new label Jul 8, 2022
@github-actions github-actions bot changed the title uart_flush blocks forever uart_flush blocks forever (IDFGH-7778) Jul 8, 2022
@moefear85
Copy link

moefear85 commented Jul 9, 2022

I hope nobody takes this the wrong way, and I was going to wait before mentioning this, but I share your thoughts and there is atleast one other person who is concerned with the current state of the uart driver.

Specifically because of similar hanging with rs485 as well as usb-cdc, and there is the still unresolved issue of misdetected uart break positions. Moreover, I suspect a uart rx corruption issue under load (by load I mean the wifi is running). I monitor it unmistakably, but I haven't finished an isolated test code to reproduce the issue, so I can be sure I haven't made any mistakes myself elsewhere (but I doubt it). I've created a testing stream from node A to node B, but the problem isn't yet manifesting itself because the system isn't under load the way it is in my actual project. Specifically, when using the uart event subsystem, under load, event.size will be larger than what uart_get_buffered_data_len() returns. Usually, both will be "120" under load, but sporadically, event.size will return 120 while uart_get_buffered_data() returns 118. It is exactly at that moment that I detect corruption in the data stream. Without stress, they always match, even if less than 120. One might think, it's corruption on the wires. But no framing/parity/full/OVF errors are ever raised, nor any breaks detected. Moreover if it were corruption, it should then also occur even without size mismatches. But it never does. I do alot of heavy logging of almost identical lines very often. It's something to celebrate whenever I do detect a corrupted byte anywhere in the stream. Either way I know for sure the wires are very clean and quiet. The corruption is happening in the rx FIFO or afterwards. One might think, if the uart were responsible, it would manifest itself also in the output logs. not really, since that is the tx direction, not rx. Moreover I sometimes split/T the uart channels, to monitor when two esps are communicating together over uart. If the corruption happened on the line, it would have to be detected on the monitor, but it isn't.

I think there is still something wrong either with the uart-driver, and/or with concurrency management. My worry is one or both might be a silicon issue, meaning I wouldn't be able to use an esp for any serious project for a long time into the future. Either way, I'm working on tracing the problem and will report it to espressif once I understand what is going on. I've only started studying the soc/hal structure, the uart driver, and the xtensa ISA.

@moefear85
Copy link

moefear85 commented Jul 9, 2022

@Superberti

still, I wonder... are you sure the flush function itself is blocking? maybe it is completing but afterwards it is blocking on xQueueRecieve waiting for new input? When flushing the buffer, I think (from the example), it is also necessary to reset the queue. Otherwise it leads to unexpected results, such as the next pending event still firing despite the input buffer being fully empty, hence any uart read operation in UART_DATA then actually hanging while it is waiting for actual new data to arrive on uart. to detect all of these cases, it is often best to set timeouts, then check the return values for errors and atleast recover from endless blocking.

@Superberti
Copy link
Author

Hi,

it is definitively the flush function. I made logs before and after uart_flush and without any new uart input in hangs in this function.

@moefear85
Copy link

moefear85 commented Jul 10, 2022

I suspect uart_flush is not callable from a writing task if there is a separate reading task running (since xQueueRecieve will be blocked while it is accessing the underlying readbuffer or related locks/semaphores), while the writing task would then also attempt to affect parts of the buffer or specific semaphores that the recieving part is concurrently accessing or access those locks/semaphores. for similar reasons, i know in freertos, a freertos queue can only be reset (ie flushed) when there is no currently blocked operation on it, otherwise it also hangs. the driver uses esp additions (ringbuffers), but I assume similar applies.

you could verify this by making sending notifications from the writing task to the reading task so that the reading task itself does the flushing. the reading task would have to be made to timeout though when calling xQueueReceive so it can check for notifications.

@Superberti
Copy link
Author

Maybe that's the case. But unfortunately it is not documented anywhere. And it's strange that uart_flush does not block every time (called from a different task) but only in some scenarios.
In my case it's ok not to call the flush function at all from the non-listening task.
It is a bit of a pity that the xTaskNotifyWait(Indexed) functions are limited to one notification per task, so all the XYZIndexed functions are useless (the limit is hardcoded in the FreeRTOS-Header).

Bye,
Oliver

@negativekelvin
Copy link
Contributor

I only had to move the UART driver to the second core in order to get no buffer overruns in the hardware FIFO (although the UART software ring buffer is always big enough for the data so this seems to be an interrupt latency problem on core 0).

Do you have uart isr in iram via menuconfig?

@Superberti
Copy link
Author

Ah, good hint, it was not checked! The help tells me: "If this option is not selected, UART interrupt will be disabled for a long time and may cause data lost when doing spi flash operation."

Bye,
Oliver

@Alvin1Zhang
Copy link
Collaborator

Thanks for reporting, feel free to reopen if the issue still happens.

@espressif-bot espressif-bot added Status: Done Issue is done internally Resolution: Done Issue is done internally and removed Status: Opened Issue is new labels Mar 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Resolution: Done Issue is done internally Status: Done Issue is done internally
Projects
None yet
Development

No branches or pull requests

6 participants