[Error] 500 HTTP response in 05_Training_lab_student

itrushkin · November 14, 2024, 11:47pm

Open “Training process” stage notebook
Execute 25th cell with the following input:

bigger_finetuned_model = BasicModelRunner(model_name_to_id["bigger_model_name"])
bigger_finetuned_output = bigger_finetuned_model(test_question)
print("Bigger (2.8B) finetuned model (test): ", bigger_finetuned_output)

Cell execution failed with the following output:

LAMINI CONFIGURATION
{}
LAMINI CONFIGURATION
{}
LAMINI CONFIGURATION
{}
status code: 500
---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
File /usr/local/lib/python3.9/site-packages/lamini/api/rest_requests.py:25, in make_web_request(key, url, http_method, json)
     24 try:
---> 25     resp.raise_for_status()
     26 except requests.exceptions.HTTPError as e:

File /usr/local/lib/python3.9/site-packages/requests/models.py:1021, in Response.raise_for_status(self)
   1020 if http_error_msg:
-> 1021     raise HTTPError(http_error_msg, response=self)

HTTPError: 500 Server Error: Internal Server Error for url: http://jupyter-api-proxy.internal.dlai/rev-proxy/lamini/v1/completions

During handling of the above exception, another exception occurred:

APIError                                  Traceback (most recent call last)
Cell In[25], line 2
      1 bigger_finetuned_model = BasicModelRunner(model_name_to_id["bigger_model_name"])
----> 2 bigger_finetuned_output = bigger_finetuned_model(test_question)
      3 print("Bigger (2.8B) finetuned model (test): ", bigger_finetuned_output)

File /usr/local/lib/python3.9/site-packages/lamini/runners/base_runner.py:28, in BaseRunner.__call__(self, prompt, system_prompt, output_type, max_tokens)
     21 def __call__(
     22     self,
     23     prompt: Union[str, List[str]],
   (...)
     26     max_tokens: Optional[int] = None,
     27 ):
---> 28     return self.call(prompt, system_prompt, output_type, max_tokens)

File /usr/local/lib/python3.9/site-packages/lamini/runners/base_runner.py:39, in BaseRunner.call(self, prompt, system_prompt, output_type, max_tokens)
     30 def call(
     31     self,
     32     prompt: Union[str, List[str]],
   (...)
     35     max_tokens: Optional[int] = None,
     36 ):
     37     input_objects = self.create_final_prompts(prompt, system_prompt)
---> 39     return self.lamini_api.generate(
     40         prompt=input_objects,
     41         model_name=self.model_name,
     42         max_tokens=max_tokens,
     43         output_type=output_type,
     44     )

File /usr/local/lib/python3.9/site-packages/lamini/api/lamini.py:46, in Lamini.generate(self, prompt, model_name, output_type, max_tokens, stop_tokens)
     31 def generate(
     32     self,
     33     prompt: Union[str, List[str]],
   (...)
     37     stop_tokens: Optional[List[str]] = None,
     38 ):
     39     req_data = self.make_llm_req_map(
     40         prompt=prompt,
     41         model_name=model_name or self.model_name,
   (...)
     44         stop_tokens=stop_tokens,
     45     )
---> 46     result = self.inference_queue.submit(req_data)
     47     if isinstance(prompt, str) and len(result) == 1:
     48         if output_type is None:

File /usr/local/lib/python3.9/site-packages/lamini/api/inference_queue.py:41, in InferenceQueue.submit(self, request)
     39 # Wait for all the results to come back
     40 for result in results:
---> 41     result.result()
     43 # Combine the results and return them
     44 return self.combine_results(results)

File /usr/local/lib/python3.9/concurrent/futures/_base.py:446, in Future.result(self, timeout)
    444     raise CancelledError()
    445 elif self._state == FINISHED:
--> 446     return self.__get_result()
    447 else:
    448     raise TimeoutError()

File /usr/local/lib/python3.9/concurrent/futures/_base.py:391, in Future.__get_result(self)
    389 if self._exception:
    390     try:
--> 391         raise self._exception
    392     finally:
    393         # Break a reference cycle with the exception in self._exception
    394         self = None

File /usr/local/lib/python3.9/concurrent/futures/thread.py:58, in _WorkItem.run(self)
     55     return
     57 try:
---> 58     result = self.fn(*self.args, **self.kwargs)
     59 except BaseException as exc:
     60     self.future.set_exception(exc)

File /usr/local/lib/python3.9/site-packages/lamini/api/inference_queue.py:103, in process_batch(key, api_prefix, batch)
    101 def process_batch(key, api_prefix, batch):
    102     url = api_prefix + "completions"
--> 103     result = make_web_request(key, url, "post", batch)
    104     return result

File /usr/local/lib/python3.9/site-packages/lamini/api/rest_requests.py:76, in make_web_request(key, url, http_method, json)
     74             if description == {"detail": ""}:
     75                 raise APIError("500 Internal Server Error")
---> 76             raise APIError(f"API error {description}")
     78 return resp.json()

APIError: API error {'code': 500, 'error_id': '143093073380044201231992061503924942078', 'detail': "[Errno 2] No such file or directory: '/home/slurmuser/jobs/2320/checkpoints'"}

martinnong · January 31, 2025, 2:33am

I have the same question.

SNaveenMathew · February 23, 2025, 2:21am

Was this resolved?

Debugging:
Command in notebook cell: ! ls /home/
My output (yours could be different): jovyan

Command in notebook cell: ! mkdir -p /home/slurmuser/jobs/2320/checkpoints
Output: mkdir: cannot create directory ‘/home/slurmuser’: Permission denied
I suspected this would be the case because the user jovyan doesn’t have permissions to create another user slurmuser

The username ‘slurmuser’ doesn’t exist in the Jupyter instance on the Cloud. I suspect the username ‘slurmuser’ is hardcoded in one of the imported libraries.

Topic		Replies	Views
05_Training_lab_student status code 400 Finetuning Large Language Models	4	204	January 31, 2025
05_Training_lab_student - Error with "Finetune a model in 3 lines of code using Lamin" Finetuning Large Language Models	2	200	August 7, 2024
ChatGpt error Finetuning Large Language Models	4	299	January 18, 2024
"Run much larger trained model and explore moderation" fails with 403: Forbidden Finetuning Large Language Models	2	21	July 24, 2024
Inference with trained model failing Finetuning Large Language Models	0	45	June 14, 2024

[Error] 500 HTTP response in 05_Training_lab_student

Related topics